2020-04-08 22:01:19

by harshad shirwadkar

[permalink] [raw]
Subject: [PATCH v6 01/20] ext4: update docs for fast commit feature

From: Harshad Shirwadkar <[email protected]>

This patch series adds support for fast commits which is a simplified
version of the scheme proposed by Park and Shin, in their paper,
"iJournaling: Fine-Grained Journaling for Improving the Latency of
Fsync System Call"[1]. The basic idea of fast commits is to make JBD2
give the client file system an opportunity to perform a faster
commit. Only if the file system cannot perform such a commit
operation, then JBD2 should fall back to traditional commits.

Because JBD2 operates at block granularity, for every file system
metadata update it commits all the changed blocks are written to the
journal at commit time. This is inefficient because updates to some
blocks that JBD2 commits are derivable from some other blocks. For
example, if a new extent is added to an inode, then corresponding
updates to the inode table, the block bitmap, the group descriptor and
the superblock can be derived based on just the extent information and
the corresponding inode information. So, if we take this relationship
between blocks into account and replay the journalled blocks smartly,
we could increase performance of file system commits significantly.

Fast commits introduced in this patch have two main contributions:

(1) Making JBD2 fast commit aware, so that clients of JBD2 can
implement fast commits

(2) Add support in ext4 to use JBD2's new interfaces and implement
fast commits.

Ext4 supports two modes of fast commits: 1) fast commits with hard
consistency guarantees 2) fast commits with soft consistency guarantees

When hard consistency is enabled, fast commit guarantees that all the
updates will be committed. After a successful replay of fast commits
blocks in hard consistency mode, the entire file system would be in
the same state as that when fsync() returned before crash. This
guarantee is similar to what jbd2 gives with full commits.

With soft consistency, file system only guarantees consistency for the
inode in question. In this mode, file system will try to write as less
data to the backend as possible during the commit time. To be precise,
file system records all the data updates for the inode in question and
directory updates that are required for guaranteeing consistency of the
inode in question.

In our evaluations, fast commits with hard consistency performed
better than fast commits with soft consistency. That's because with
hard consistency, a fast commit often ends up committing other inodes
together, while with soft consistency commits get serialized. Future
work can look at creating hybrid approach between the two extremes
that are there in this patchset.

Testing
-------

e2fsprogs was updated to set fast commit feature flag and to ignore
fast commit blocks during e2fsck.

https://github.com/harshadjs/e2fsprogs.git

After applying all the patches in this series, following runs of
xfstests were performed:

- kvm-xfstest.sh -g log -c 4k
- kvm-xfstests.sh smoke

All the log tests were successful and smoke tests didn't introduce any
additional failures.

Performance Evaluation
----------------------

Ext4 file system performance was tested with full commits, with fast
commits with soft consistency and with fast commits with hard
consistency. fs_mark benchmark showed that depending on the file size,
performance improvement was seen up to 50%. Soft fast commits performed
slightly worse than hard fast commits. But soft fast commits ended up
writing slightly lesser number of blocks on disk.

Changes since V5:

- Rebased on top of v5.6

Harshad Shirwadkar(20):
ext4: add debug mount option to test fast commit replay
ext4: add fast commit replay path
ext4: disable certain features in replay path
ext4: add idempotent helpers to manipulate bitmaps
ext4: fast commit recovery path preparation
jbd2: add fast commit recovery path support
ext4: main commit routine for fast commits
jbd2: add new APIs for commit path of fast commits
ext4: add fast commit on-disk format structs
ext4: add fast commit track points
ext4: break ext4_unlink() and ext4_link()
ext4: add inode tracking and ineligible marking routines
ext4: add directory entry tracking routines
ext4: add generic diff tracking routines and range tracking
jbd2: fast commit main commit path changes
jbd2: disable fast commits if journal is empty
jbd2: add fast commit block tracker variables
ext4, jbd2: add fast commit initialization routines
ext4: add handling for extended mount options
ext4: update docs for fast commit feature

Documentation/filesystems/ext4/journal.rst | 127 ++-
Documentation/filesystems/journalling.rst | 18 +
fs/ext4/acl.c | 1 +
fs/ext4/balloc.c | 10 +-
fs/ext4/ext4.h | 126 +++
fs/ext4/ext4_jbd2.c | 1484 +++++++++++++++++++++++++++-
fs/ext4/ext4_jbd2.h | 71 ++
fs/ext4/extents.c | 5 +
fs/ext4/extents_status.c | 24 +
fs/ext4/fsync.c | 2 +-
fs/ext4/ialloc.c | 165 +++-
fs/ext4/inline.c | 3 +
fs/ext4/inode.c | 76 +-
fs/ext4/ioctl.c | 11 +-
fs/ext4/mballoc.c | 158 ++-
fs/ext4/mballoc.h | 2 +
fs/ext4/migrate.c | 1 +
fs/ext4/namei.c | 182 ++--
fs/ext4/super.c | 72 +-
fs/ext4/xattr.c | 6 +
fs/jbd2/commit.c | 61 ++
fs/jbd2/journal.c | 217 +++-
fs/jbd2/recovery.c | 67 +-
include/linux/jbd2.h | 83 +-
include/trace/events/ext4.h | 208 +++-
25 files changed, 3046 insertions(+), 134 deletions(-)ˆ

Signed-off-by: Harshad Shirwadkar <[email protected]>
---
Documentation/filesystems/ext4/journal.rst | 127 ++++++++++++++++++++-
Documentation/filesystems/journalling.rst | 18 +++
2 files changed, 139 insertions(+), 6 deletions(-)

diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
index ea613ee701f5..f94e66f2f8c4 100644
--- a/Documentation/filesystems/ext4/journal.rst
+++ b/Documentation/filesystems/ext4/journal.rst
@@ -29,10 +29,10 @@ safest. If ``data=writeback``, dirty data blocks are not flushed to the
disk before the metadata are written to disk through the journal.

The journal inode is typically inode 8. The first 68 bytes of the
-journal inode are replicated in the ext4 superblock. The journal itself
-is normal (but hidden) file within the filesystem. The file usually
-consumes an entire block group, though mke2fs tries to put it in the
-middle of the disk.
+journal inode are replicated in the ext4 superblock. The journal
+itself is normal (but hidden) file within the filesystem. The file
+usually consumes an entire block group, though mke2fs tries to put it
+in the middle of the disk.

All fields in jbd2 are written to disk in big-endian order. This is the
opposite of ext4.
@@ -42,22 +42,74 @@ NOTE: Both ext4 and ocfs2 use jbd2.
The maximum size of a journal embedded in an ext4 filesystem is 2^32
blocks. jbd2 itself does not seem to care.

+Fast Commits
+~~~~~~~~~~~~
+
+Ext4 also implements fast commits and integrates it with JBD2 journalling.
+Fast commits store metadata changes made to the file system as inode level
+diff. In other words, each fast commit block identifies updates made to
+a particular inode and collectively they represent total changes made to
+the file system.
+
+A fast commit is valid only if there is no full commit after that particular
+fast commit. Because of this feature, fast commit blocks can be reused by
+the following transactions.
+
+Each fast commit block stores updates to 1 particular inode. Updates in each
+fast commit block are one of the 2 types:
+- Data updates (add range / delete range)
+- Directory entry updates (Add / remove links)
+
+Fast commit blocks must be replayed in the order in which they appear on disk.
+That's because directory entry updates are written in fast commit blocks
+in the order in which they are applied on the file system before crash.
+Changing the order of replaying for directory entry updates may result
+in inconsistent file system. Note that only directory entry updates need
+ordering, data updates, since they apply to only one inode, do not require
+ordered replay. Also, fast commits guarantee that file system is in consistent
+state after replay of each fast commit block as long as order of replay has
+been followed.
+
+Note that directory inode updates are never directly recorded in fast commits.
+Just like other file system level metaata, updates to directories are always
+implied based on directory entry updates stored in fast commit blocks.
+
+Based on which directory entry updates are committed with an inode, fast
+commits have two modes of operation:
+
+- Hard Consistency (default)
+- Soft Consistency (can be enabled by setting mount flag "fc_soft_consistency")
+
+When hard consistency is enabled, fast commit guarantees that all the updates
+will be committed. After a successful replay of fast commits blocks
+in hard consistency mode, the entire file system would be in the same state as
+that when fsync() returned before crash. This guarantee is similar to what
+jbd2 gives.
+
+With soft consistency, file system only guarantees consistency for the
+inode in question. In this mode, file system will try to write as less data
+to the backed as possible during the commit time. To be precise, file system
+records all the data updates for the inode in question and directory updates
+that are required for guaranteeing consistency of the inode in question.
+
Layout
~~~~~~

Generally speaking, the journal has this format:

.. list-table::
- :widths: 16 48 16
+ :widths: 16 48 16 18
:header-rows: 1

* - Superblock
- descriptor\_block (data\_blocks or revocation\_block) [more data or
revocations] commmit\_block
- [more transactions...]
+ - [Fast commits...]
* -
- One transaction
-
+ -

Notice that a transaction begins with either a descriptor and some data,
or a block revocation list. A finished transaction always ends with a
@@ -76,7 +128,7 @@ The journal superblock will be in the next full block after the
superblock.

.. list-table::
- :widths: 12 12 12 32 12
+ :widths: 12 12 12 32 12 12
:header-rows: 1

* - 1024 bytes of padding
@@ -85,11 +137,13 @@ superblock.
- descriptor\_block (data\_blocks or revocation\_block) [more data or
revocations] commmit\_block
- [more transactions...]
+ - [Fast commits...]
* -
-
-
- One transaction
-
+ -

Block Header
~~~~~~~~~~~~
@@ -609,3 +663,64 @@ bytes long (but uses a full block):
- h\_commit\_nsec
- Nanoseconds component of the above timestamp.

+Fast Commit Block
+~~~~~~~~~~~~~~~~~
+
+The fast commit block indicates an append to the last commit block
+that was written to the journal. One fast commit block records updates
+to one inode. So, typically you would find as many fast commit blocks
+as the number of inodes that got changed since the last commit. A fast
+commit block is valid only if there is no commit block present with
+transaction ID greater than that of the fast commit block. If such a
+block a present, then there is no need to replay the fast commit
+block.
+
+.. list-table::
+ :widths: 8 8 24 40
+ :header-rows: 1
+
+ * - Offset
+ - Type
+ - Name
+ - Descriptor
+ * - 0x0
+ - journal\_header\_s
+ - (open coded)
+ - Common block header.
+ * - 0xC
+ - \_\_le32
+ - fc\_magic
+ - Magic value which should be set to 0xE2540090. This identifies
+ that this block is a fast commit block.
+ * - 0x10
+ - \_\_u8
+ - fc\_features
+ - Features used by this fast commit block.
+ * - 0x11
+ - \_\_le16
+ - fc_num_tlvs
+ - Number of TLVs contained in this fast commit block
+ * - 0x13
+ - \_\_le32
+ - \_\_fc\_len
+ - Length of the fast commit block in terms of number of blocks
+ * - 0x17
+ - \_\_le32
+ - fc\_ino
+ - Inode number of the inode that will be recovered using this fast commit
+ * - 0x2B
+ - struct ext4\_inode
+ - inode
+ - On-disk copy of the inode at the commit time
+ * - <Variable based on inode size>
+ - struct ext4\_fc\_tl
+ - Array of struct ext4\_fc\_tl
+ - The actual delta with the last commit. Starting at this offset,
+ there is an array of TLVs that indicates which all extents
+ should be present in the corresponding inode. Currently,
+ following tags are supported: EXT4\_FC\_TAG\_EXT (extent that
+ should be present in the inode), EXT4\_FC\_TAG\_HOLE (extent
+ that should be removed from the inode), EXT4\_FC\_TAG\_ADD\_DENTRY
+ (dentry that should be linked), EXT4\_FC\_TAG\_DEL\_DENTRY
+ (dentry that should be unlinked), EXT4\_FC\_TAG\_CREATE\_DENTRY
+ (dentry that for the file that should be created for the first time).
diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
index 58ce6b395206..1cb116ab27ab 100644
--- a/Documentation/filesystems/journalling.rst
+++ b/Documentation/filesystems/journalling.rst
@@ -115,6 +115,24 @@ called after each transaction commit. You can also use
``transaction->t_private_list`` for attaching entries to a transaction
that need processing when the transaction commits.

+JBD2 also allows client file systems to implement file system specific
+commits which are called as ``fast commits``. Fast commits are
+asynchronous in nature i.e. file systems can call their own commit
+functions at any time. In order to avoid the race with kjournald
+thread and other possible fast commits that may be happening in
+parallel, file systems should first call
+:c:func:`jbd2_start_async_fc()`. File system can call
+:c:func:`jbd2_map_fc_buf()` to get buffers reserved for fast
+commits. Once a fast commit is completed, file system should call
+:c:func:`jbd2_stop_async_fc()` to indicate and unblock other
+committers and the kjournald thread. After performing either a fast
+or a full commit, JBD2 calls ``journal->j_fc_cleanup_cb`` to allow
+file systems to perform cleanups for their internal fast commit
+related data structures. At the replay time, JBD2 passes each and
+every fast commit block to the file system via
+``journal->j_fc_replay_cb``. Ext4 effectively uses this fast commit
+mechanism to improve journal commit performance.
+
JBD2 also provides a way to block all transaction updates via
:c:func:`jbd2_journal_lock_updates()` /
:c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a
--
2.26.0.110.g2183baf09c-goog


2020-04-08 22:01:22

by harshad shirwadkar

[permalink] [raw]
Subject: [PATCH v6 04/20] jbd2: add fast commit block tracker variables

From: Harshad Shirwadkar <[email protected]>

Add j_first_fc, j_last_fc and j_fc_offset variables to track fast commit
area. j_first_fc and j_last_fc mark the start and the end of the area,
while j_fc_offset points to the last used block in the region.

Signed-off-by: Harshad Shirwadkar <[email protected]>
---
fs/jbd2/journal.c | 33 ++++++++++++++++++++++++++++-----
include/linux/jbd2.h | 24 ++++++++++++++++++++++++
2 files changed, 52 insertions(+), 5 deletions(-)

diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 4e5d41d79b24..79f015f7bf54 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1178,6 +1178,11 @@ static journal_t *journal_init_common(struct block_device *bdev,
if (!journal->j_wbuf)
goto err_cleanup;

+ if (journal->j_fc_wbufsize > 0) {
+ journal->j_wbufsize = n - journal->j_fc_wbufsize;
+ journal->j_fc_wbuf = &journal->j_wbuf[journal->j_wbufsize];
+ }
+
bh = getblk_unmovable(journal->j_dev, start, journal->j_blocksize);
if (!bh) {
pr_err("%s: Cannot get buffer for journal superblock\n",
@@ -1321,11 +1326,20 @@ static int journal_reset(journal_t *journal)
}

journal->j_first = first;
- journal->j_last = last;

- journal->j_head = first;
- journal->j_tail = first;
- journal->j_free = last - first;
+ if (jbd2_has_feature_fast_commit(journal) &&
+ journal->j_fc_wbufsize > 0) {
+ journal->j_last_fc = last;
+ journal->j_last = last - journal->j_fc_wbufsize;
+ journal->j_first_fc = journal->j_last + 1;
+ journal->j_fc_off = 0;
+ } else {
+ journal->j_last = last;
+ }
+
+ journal->j_head = journal->j_first;
+ journal->j_tail = journal->j_first;
+ journal->j_free = journal->j_last - journal->j_first;

journal->j_tail_sequence = journal->j_transaction_sequence;
journal->j_commit_sequence = journal->j_transaction_sequence - 1;
@@ -1667,9 +1681,18 @@ static int load_superblock(journal_t *journal)
journal->j_tail_sequence = be32_to_cpu(sb->s_sequence);
journal->j_tail = be32_to_cpu(sb->s_start);
journal->j_first = be32_to_cpu(sb->s_first);
- journal->j_last = be32_to_cpu(sb->s_maxlen);
journal->j_errno = be32_to_cpu(sb->s_errno);

+ if (jbd2_has_feature_fast_commit(journal) &&
+ journal->j_fc_wbufsize > 0) {
+ journal->j_last_fc = be32_to_cpu(sb->s_maxlen);
+ journal->j_last = journal->j_last_fc - journal->j_fc_wbufsize;
+ journal->j_first_fc = journal->j_last + 1;
+ journal->j_fc_off = 0;
+ } else {
+ journal->j_last = be32_to_cpu(sb->s_maxlen);
+ }
+
return 0;
}

diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 3bd1431cb222..1fc981cca479 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -910,6 +910,30 @@ struct journal_s
*/
unsigned long j_last;

+ /**
+ * @j_first_fc:
+ *
+ * The block number of the first fast commit block in the journal
+ * [j_state_lock].
+ */
+ unsigned long j_first_fc;
+
+ /**
+ * @j_fc_off:
+ *
+ * Number of fast commit blocks currently allocated.
+ * [j_state_lock].
+ */
+ unsigned long j_fc_off;
+
+ /**
+ * @j_last_fc:
+ *
+ * The block number one beyond the last fast commit block in the journal
+ * [j_state_lock].
+ */
+ unsigned long j_last_fc;
+
/**
* @j_dev: Device where we store the journal.
*/
--
2.26.0.110.g2183baf09c-goog

2020-04-08 22:01:22

by harshad shirwadkar

[permalink] [raw]
Subject: [PATCH v6 08/20] ext4: add directory entry tracking routines

From: Harshad Shirwadkar <[email protected]>

Adds directory entry change tracking routines for fast commits. Use an
in-memory list of directory updates to track directory entry updates.

Signed-off-by: Harshad Shirwadkar <[email protected]>
---
fs/ext4/ext4.h | 26 +++++++++
fs/ext4/ext4_jbd2.c | 102 ++++++++++++++++++++++++++++++++++++
fs/ext4/ext4_jbd2.h | 4 ++
fs/ext4/super.c | 7 +++
include/trace/events/ext4.h | 28 ++++++++++
5 files changed, 167 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index c07ab844c335..669ecf12d392 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -950,6 +950,26 @@ enum {
};


+/* Fast commit tags */
+#define EXT4_FC_TAG_ADD_RANGE 0x1
+#define EXT4_FC_TAG_DEL_RANGE 0x2
+#define EXT4_FC_TAG_CREAT_DENTRY 0x3
+#define EXT4_FC_TAG_ADD_DENTRY 0x4
+#define EXT4_FC_TAG_DEL_DENTRY 0x5
+
+/*
+ * In memory list of dentry updates that are performed on the file
+ * system used by fast commit code.
+ */
+struct ext4_fc_dentry_update {
+ int fcd_op; /* Type of update create / add / del */
+ int fcd_parent; /* Parent inode number */
+ int fcd_ino; /* Inode number */
+ struct qstr fcd_name; /* Dirent name qstr */
+ unsigned char fcd_iname[DNAME_INLINE_LEN]; /* Dirent name string */
+ struct list_head fcd_list;
+};
+
/*
* fourth extended file system inode data in memory
*/
@@ -1009,6 +1029,11 @@ struct ext4_inode_info {

rwlock_t i_fc_lock;

+ /*
+ * Last mdata / dirent update that happened on this inode.
+ */
+ struct ext4_fc_dentry_update *i_fc_mdata_update;
+
/*
* i_disksize keeps track of what the inode size is ON DISK, not
* in memory. During truncate, i_size is set to the new size by
@@ -1598,6 +1623,7 @@ struct ext4_sb_info {
struct list_head s_fc_q; /* Inodes staged for fast commit
* that have data changes in them.
*/
+ struct list_head s_fc_dentry_q;
spinlock_t s_fc_lock;
};

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 151a4558c338..ccaaf1c09ba6 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -368,6 +368,8 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line,
return err;
}

+static struct kmem_cache *ext4_fc_dentry_cachep;
+
static inline
void ext4_reset_inode_fc_info(struct inode *inode)
{
@@ -376,6 +378,7 @@ void ext4_reset_inode_fc_info(struct inode *inode)
ei->i_fc_tid = 0;
ei->i_fc_lblk_start = 0;
ei->i_fc_lblk_end = 0;
+ ei->i_fc_mdata_update = NULL;
}

void ext4_init_inode_fc_info(struct inode *inode)
@@ -444,6 +447,94 @@ static int __ext4_fc_track_template(

return ret;
}
+
+struct __ext4_dentry_update_args {
+ struct dentry *dentry;
+ int op;
+};
+
+static int __ext4_dentry_update(struct inode *inode, void *arg, bool update)
+{
+ struct ext4_fc_dentry_update *node;
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ struct __ext4_dentry_update_args *dentry_update =
+ (struct __ext4_dentry_update_args *)arg;
+ struct dentry *dentry = dentry_update->dentry;
+
+ write_unlock(&ei->i_fc_lock);
+ node = kmem_cache_alloc(ext4_fc_dentry_cachep, GFP_NOFS);
+ if (!node) {
+ write_lock(&ei->i_fc_lock);
+ return -ENOMEM;
+ }
+
+ node->fcd_op = dentry_update->op;
+ node->fcd_parent = dentry->d_parent->d_inode->i_ino;
+ node->fcd_ino = inode->i_ino;
+ if (dentry->d_name.len > DNAME_INLINE_LEN) {
+ node->fcd_name.name = kmalloc(dentry->d_name.len + 1,
+ GFP_KERNEL);
+ if (!node->fcd_iname) {
+ kmem_cache_free(ext4_fc_dentry_cachep, node);
+ return -ENOMEM;
+ }
+ memcpy((u8 *)node->fcd_name.name, dentry->d_name.name,
+ dentry->d_name.len);
+ } else {
+ memcpy(node->fcd_iname, dentry->d_name.name,
+ dentry->d_name.len);
+ node->fcd_name.name = node->fcd_iname;
+ }
+ node->fcd_name.len = dentry->d_name.len;
+
+ spin_lock(&EXT4_SB(inode->i_sb)->s_fc_lock);
+ list_add_tail(&node->fcd_list, &EXT4_SB(inode->i_sb)->s_fc_dentry_q);
+ spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
+ write_lock(&ei->i_fc_lock);
+ EXT4_I(inode)->i_fc_mdata_update = node;
+
+ return 0;
+}
+
+void ext4_fc_track_unlink(struct inode *inode, struct dentry *dentry)
+{
+ struct __ext4_dentry_update_args args;
+ int ret;
+
+ args.dentry = dentry;
+ args.op = EXT4_FC_TAG_DEL_DENTRY;
+
+ ret = __ext4_fc_track_template(inode, __ext4_dentry_update,
+ (void *)&args);
+ trace_ext4_fc_track_unlink(inode, dentry, ret);
+}
+
+void ext4_fc_track_link(struct inode *inode, struct dentry *dentry)
+{
+ struct __ext4_dentry_update_args args;
+ int ret;
+
+ args.dentry = dentry;
+ args.op = EXT4_FC_TAG_ADD_DENTRY;
+
+ ret = __ext4_fc_track_template(inode, __ext4_dentry_update,
+ (void *)&args);
+ trace_ext4_fc_track_link(inode, dentry, ret);
+}
+
+void ext4_fc_track_create(struct inode *inode, struct dentry *dentry)
+{
+ struct __ext4_dentry_update_args args;
+ int ret;
+
+ args.dentry = dentry;
+ args.op = EXT4_FC_TAG_CREAT_DENTRY;
+
+ ret = __ext4_fc_track_template(inode, __ext4_dentry_update,
+ (void *)&args);
+ trace_ext4_fc_track_create(inode, dentry, ret);
+}
+
struct __ext4_fc_track_range_args {
ext4_lblk_t start, end;
};
@@ -494,3 +585,14 @@ void ext4_init_fast_commit(struct super_block *sb, journal_t *journal)
return;
jbd2_init_fast_commit(journal, EXT4_NUM_FC_BLKS);
}
+
+int __init ext4_init_fc_dentry_cache(void)
+{
+ ext4_fc_dentry_cachep = KMEM_CACHE(ext4_fc_dentry_update,
+ SLAB_RECLAIM_ACCOUNT);
+
+ if (ext4_fc_dentry_cachep == NULL)
+ return -ENOMEM;
+
+ return 0;
+}
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 06d1e4a885b7..8fbd09dbfeca 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -534,4 +534,8 @@ void ext4_init_fast_commit(struct super_block *sb, journal_t *journal);
void ext4_init_inode_fc_info(struct inode *inode);
void ext4_fc_track_range(struct inode *inode, ext4_lblk_t start,
ext4_lblk_t end);
+void ext4_fc_track_unlink(struct inode *inode, struct dentry *dentry);
+void ext4_fc_track_link(struct inode *inode, struct dentry *dentry);
+void ext4_fc_track_create(struct inode *inode, struct dentry *dentry);
+int __init ext4_init_fc_dentry_cache(void);
#endif /* _EXT4_JBD2_H */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 99b24156933a..a93dada07623 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4421,6 +4421,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
mutex_init(&sbi->s_orphan_lock);

INIT_LIST_HEAD(&sbi->s_fc_q);
+ INIT_LIST_HEAD(&sbi->s_fc_dentry_q);
spin_lock_init(&sbi->s_fc_lock);
sb->s_root = NULL;

@@ -6249,6 +6250,11 @@ static int __init ext4_init_fs(void)
err = init_inodecache();
if (err)
goto out1;
+
+ err = ext4_init_fc_dentry_cache();
+ if (err)
+ goto out05;
+
register_as_ext3();
register_as_ext2();
err = register_filesystem(&ext4_fs_type);
@@ -6259,6 +6265,7 @@ static int __init ext4_init_fs(void)
out:
unregister_as_ext2();
unregister_as_ext3();
+out05:
destroy_inodecache();
out1:
ext4_exit_mballoc();
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 9424ffb2a54b..577c6230b23a 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2723,6 +2723,34 @@ TRACE_EVENT(ext4_error,
__entry->function, __entry->line)
);

+#define DEFINE_TRACE_DENTRY_EVENT(__type) \
+ TRACE_EVENT(ext4_fc_track_##__type, \
+ TP_PROTO(struct inode *inode, struct dentry *dentry, int ret), \
+ \
+ TP_ARGS(inode, dentry, ret), \
+ \
+ TP_STRUCT__entry( \
+ __field(dev_t, dev) \
+ __field(int, ino) \
+ __field(int, error) \
+ ), \
+ \
+ TP_fast_assign( \
+ __entry->dev = inode->i_sb->s_dev; \
+ __entry->ino = inode->i_ino; \
+ __entry->error = ret; \
+ ), \
+ \
+ TP_printk("dev %d:%d, inode %d, error %d, fc_%s", \
+ MAJOR(__entry->dev), MINOR(__entry->dev), \
+ __entry->ino, __entry->error, \
+ #__type) \
+ )
+
+DEFINE_TRACE_DENTRY_EVENT(create);
+DEFINE_TRACE_DENTRY_EVENT(link);
+DEFINE_TRACE_DENTRY_EVENT(unlink);
+
TRACE_EVENT(ext4_fc_track_range,
TP_PROTO(struct inode *inode, long start, long end, int ret),

--
2.26.0.110.g2183baf09c-goog

2020-04-08 22:01:23

by harshad shirwadkar

[permalink] [raw]
Subject: [PATCH v6 07/20] ext4: add generic diff tracking routines and range tracking

From: Harshad Shirwadkar <[email protected]>

In fast commits, we need to track changes that have been made to the
file system since last full commit. Add generic diff tracking
infrastructure. We use those helpers to track logical block ranges
that have been affected for inodes. The diff tracking helpers are used
in following patches to track directory entry updates as well.

Signed-off-by: Harshad Shirwadkar <[email protected]>
Reported-by: kbuild test robot <[email protected]>
---
fs/ext4/ext4.h | 32 ++++++++++
fs/ext4/ext4_jbd2.c | 121 ++++++++++++++++++++++++++++++++++++
fs/ext4/ext4_jbd2.h | 3 +
fs/ext4/inode.c | 18 ++++++
fs/ext4/super.c | 5 ++
include/trace/events/ext4.h | 27 ++++++++
6 files changed, 206 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 57f8fd4fe6ad..c07ab844c335 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -983,6 +983,32 @@ struct ext4_inode_info {

struct list_head i_orphan; /* unlinked but open inodes */

+ struct list_head i_fc_list; /*
+ * inodes that need fast commit
+ * protected by sbi->s_fc_lock.
+ */
+ /*
+ * TID of when this struct was last updated. If fc_tid !=
+ * running transaction tid, then none of the other fields in this
+ * struct are valid. Don't directly modify fields in this struct.
+ * Use wrappers provided in ext4_jbd2.c.
+ */
+ tid_t i_fc_tid;
+
+ /*
+ * Start of logical block range that needs to be committed in
+ * this fast commit.
+ */
+ ext4_lblk_t i_fc_lblk_start;
+
+ /*
+ * End of logical block range that needs to be committed in this fast
+ * commit
+ */
+ ext4_lblk_t i_fc_lblk_end;
+
+ rwlock_t i_fc_lock;
+
/*
* i_disksize keeps track of what the inode size is ON DISK, not
* in memory. During truncate, i_size is set to the new size by
@@ -1102,6 +1128,7 @@ struct ext4_inode_info {
#define EXT4_VALID_FS 0x0001 /* Unmounted cleanly */
#define EXT4_ERROR_FS 0x0002 /* Errors detected */
#define EXT4_ORPHAN_FS 0x0004 /* Orphans being recovered */
+#define EXT4_FC_REPLAY 0x0008 /* Fast commit replay ongoing */

/*
* Misc. filesystem flags
@@ -1567,6 +1594,11 @@ struct ext4_sb_info {
#ifdef CONFIG_EXT4_DEBUG
unsigned long s_simulate_fail;
#endif
+ /* Ext4 fast commit stuff */
+ struct list_head s_fc_q; /* Inodes staged for fast commit
+ * that have data changes in them.
+ */
+ spinlock_t s_fc_lock;
};

static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 91d6437bc9b3..151a4558c338 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -367,6 +367,127 @@ int __ext4_handle_dirty_super(const char *where, unsigned int line,
mark_buffer_dirty(bh);
return err;
}
+
+static inline
+void ext4_reset_inode_fc_info(struct inode *inode)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+
+ ei->i_fc_tid = 0;
+ ei->i_fc_lblk_start = 0;
+ ei->i_fc_lblk_end = 0;
+}
+
+void ext4_init_inode_fc_info(struct inode *inode)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+
+ ext4_reset_inode_fc_info(inode);
+ INIT_LIST_HEAD(&ei->i_fc_list);
+}
+
+static void ext4_fc_enqueue_inode(struct inode *inode)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+
+ if (!ext4_should_fast_commit(inode->i_sb) ||
+ (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY))
+ return;
+
+ spin_lock(&sbi->s_fc_lock);
+ if (list_empty(&EXT4_I(inode)->i_fc_list))
+ list_add_tail(&EXT4_I(inode)->i_fc_list, &sbi->s_fc_q);
+ spin_unlock(&sbi->s_fc_lock);
+}
+
+static inline tid_t get_running_txn_tid(struct super_block *sb)
+{
+ if (EXT4_SB(sb)->s_journal)
+ return EXT4_SB(sb)->s_journal->j_commit_sequence + 1;
+ return 0;
+}
+
+/*
+ * Generic fast commit tracking function. If this is the first
+ * time this we are called after a full commit, we initialize
+ * fast commit fields and then call __fc_track_fn() with
+ * update = 0. If we have already been called after a full commit,
+ * we pass update = 1. Based on that, the track function can
+ * determine if it needs to track a field for the first time
+ * or if it needs to just update the previously tracked value.
+ */
+static int __ext4_fc_track_template(
+ struct inode *inode,
+ int (*__fc_track_fn)(struct inode *, void *, bool),
+ void *args)
+{
+ tid_t running_txn_tid = get_running_txn_tid(inode->i_sb);
+ bool update = false;
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ int ret;
+
+ if (!ext4_should_fast_commit(inode->i_sb) ||
+ (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY))
+ return -EOPNOTSUPP;
+
+ write_lock(&ei->i_fc_lock);
+ if (running_txn_tid == ei->i_fc_tid) {
+ update = true;
+ } else {
+ ext4_reset_inode_fc_info(inode);
+ ei->i_fc_tid = running_txn_tid;
+ }
+ ret = __fc_track_fn(inode, args, update);
+ write_unlock(&ei->i_fc_lock);
+
+ ext4_fc_enqueue_inode(inode);
+
+ return ret;
+}
+struct __ext4_fc_track_range_args {
+ ext4_lblk_t start, end;
+};
+
+#define MIN(__a, __b) ((__a) < (__b) ? (__a) : (__b))
+#define MAX(__a, __b) ((__a) > (__b) ? (__a) : (__b))
+
+int __ext4_fc_track_range(struct inode *inode, void *arg, bool update)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ struct __ext4_fc_track_range_args *__arg =
+ (struct __ext4_fc_track_range_args *)arg;
+
+ if (inode->i_ino < EXT4_FIRST_INO(inode->i_sb)) {
+ ext4_debug("Special inode %ld being modified\n", inode->i_ino);
+ return -ECANCELED;
+ }
+
+ if (update) {
+ ei->i_fc_lblk_start = MIN(ei->i_fc_lblk_start, __arg->start);
+ ei->i_fc_lblk_end = MAX(ei->i_fc_lblk_end, __arg->end);
+ } else {
+ ei->i_fc_lblk_start = __arg->start;
+ ei->i_fc_lblk_end = __arg->end;
+ }
+
+ return 0;
+}
+
+void ext4_fc_track_range(struct inode *inode, ext4_lblk_t start,
+ ext4_lblk_t end)
+{
+ struct __ext4_fc_track_range_args args;
+ int ret;
+
+ args.start = start;
+ args.end = end;
+
+ ret = __ext4_fc_track_template(inode,
+ __ext4_fc_track_range, &args);
+
+ trace_ext4_fc_track_range(inode, start, end, ret);
+}
+
void ext4_init_fast_commit(struct super_block *sb, journal_t *journal)
{
if (!ext4_should_fast_commit(sb))
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index b15cfa89cf1d..06d1e4a885b7 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -531,4 +531,7 @@ static inline int ext4_should_dioread_nolock(struct inode *inode)

#define EXT4_NUM_FC_BLKS 128
void ext4_init_fast_commit(struct super_block *sb, journal_t *journal);
+void ext4_init_inode_fc_info(struct inode *inode);
+void ext4_fc_track_range(struct inode *inode, ext4_lblk_t start,
+ ext4_lblk_t end);
#endif /* _EXT4_JBD2_H */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e416096fc081..3bf0ad4d7d32 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -725,6 +725,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
if (ret)
return ret;
}
+ ext4_fc_track_range(inode, map->m_lblk,
+ map->m_lblk + map->m_len - 1);
}
return retval;
}
@@ -4073,6 +4075,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)

up_write(&EXT4_I(inode)->i_data_sem);
}
+ ext4_fc_track_range(inode, first_block, stop_block);
if (IS_SYNC(inode))
ext4_handle_sync(handle);

@@ -4684,6 +4687,7 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
for (block = 0; block < EXT4_N_BLOCKS; block++)
ei->i_data[block] = raw_inode->i_block[block];
INIT_LIST_HEAD(&ei->i_orphan);
+ ext4_init_inode_fc_info(&ei->vfs_inode);

/*
* Set transaction id's of transactions that have to be committed
@@ -5351,6 +5355,20 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
inode->i_mtime = current_time(inode);
inode->i_ctime = inode->i_mtime;
}
+
+ if (shrink)
+ ext4_fc_track_range(
+ inode, attr->ia_size >>
+ inode->i_sb->s_blocksize_bits,
+ oldsize >>
+ inode->i_sb->s_blocksize_bits);
+ else
+ ext4_fc_track_range(
+ inode, oldsize >>
+ inode->i_sb->s_blocksize_bits,
+ attr->ia_size >>
+ inode->i_sb->s_blocksize_bits);
+
down_write(&EXT4_I(inode)->i_data_sem);
EXT4_I(inode)->i_disksize = attr->ia_size;
rc = ext4_mark_inode_dirty(handle, inode);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 0bfaf76200d2..99b24156933a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1151,6 +1151,8 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
ei->i_datasync_tid = 0;
atomic_set(&ei->i_unwritten, 0);
INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
+ ext4_init_inode_fc_info(&ei->vfs_inode);
+ rwlock_init(&ei->i_fc_lock);
return &ei->vfs_inode;
}

@@ -1193,6 +1195,7 @@ static void init_once(void *foo)
init_rwsem(&ei->i_data_sem);
init_rwsem(&ei->i_mmap_sem);
inode_init_once(&ei->vfs_inode);
+ ext4_init_inode_fc_info(&ei->vfs_inode);
}

static int __init init_inodecache(void)
@@ -4417,6 +4420,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
mutex_init(&sbi->s_orphan_lock);

+ INIT_LIST_HEAD(&sbi->s_fc_q);
+ spin_lock_init(&sbi->s_fc_lock);
sb->s_root = NULL;

needs_recovery = (es->s_last_orphan != 0 ||
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 19c87661eeec..9424ffb2a54b 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2723,6 +2723,33 @@ TRACE_EVENT(ext4_error,
__entry->function, __entry->line)
);

+TRACE_EVENT(ext4_fc_track_range,
+ TP_PROTO(struct inode *inode, long start, long end, int ret),
+
+ TP_ARGS(inode, start, end, ret),
+
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(int, ino)
+ __field(long, start)
+ __field(long, end)
+ __field(int, error)
+ ),
+
+ TP_fast_assign(
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->ino = inode->i_ino;
+ __entry->start = start;
+ __entry->end = end;
+ __entry->error = ret;
+ ),
+
+ TP_printk("dev %d:%d, inode %d, error %d, start %ld, end %ld",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->ino, __entry->error, __entry->start,
+ __entry->end)
+ );
+
#endif /* _TRACE_EXT4_H */

/* This part must be outside protection */
--
2.26.0.110.g2183baf09c-goog

2020-04-08 22:01:23

by harshad shirwadkar

[permalink] [raw]
Subject: [PATCH v6 13/20] jbd2: add new APIs for commit path of fast commits

From: Harshad Shirwadkar <[email protected]>

Add following helpers for commit path:
- jbd2_map_fc_buf() - allocates fast commit buffers for caller
- jbd2_wait_on_fc_bufs() - waits on fast commit buffers allocated
using jbd2_map_fc_buf()
- jbd2_submit_inode_data() - submit data buffers for one inode
- jbd2_wait_inode_data() - wait for inode data

Signed-off-by: Harshad Shirwadkar <[email protected]>
---
fs/jbd2/commit.c | 40 +++++++++++++++++++++++
fs/jbd2/journal.c | 76 ++++++++++++++++++++++++++++++++++++++++++++
include/linux/jbd2.h | 6 ++++
3 files changed, 122 insertions(+)

diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 280d11591bcb..2ef2dfb029e4 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -202,6 +202,46 @@ static int journal_submit_inode_data_buffers(struct address_space *mapping,
return ret;
}

+/* Send all the data buffers related to an inode */
+int jbd2_submit_inode_data(journal_t *journal, struct jbd2_inode *jinode)
+{
+ struct address_space *mapping;
+ loff_t dirty_start;
+ loff_t dirty_end;
+ int ret;
+
+ if (!jinode)
+ return 0;
+
+ dirty_start = jinode->i_dirty_start;
+ dirty_end = jinode->i_dirty_end;
+
+ if (!(jinode->i_flags & JI_WRITE_DATA))
+ return 0;
+
+ dirty_start = jinode->i_dirty_start;
+ dirty_end = jinode->i_dirty_end;
+
+ mapping = jinode->i_vfs_inode->i_mapping;
+
+ trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
+ ret = journal_submit_inode_data_buffers(mapping, dirty_start,
+ dirty_end);
+
+ return ret;
+}
+EXPORT_SYMBOL(jbd2_submit_inode_data);
+
+int jbd2_wait_inode_data(journal_t *journal, struct jbd2_inode *jinode)
+{
+ if (!jinode || !(jinode->i_flags & JI_WAIT_DATA))
+ return 0;
+ return filemap_fdatawait_range_keep_errors(
+ jinode->i_vfs_inode->i_mapping, jinode->i_dirty_start,
+ jinode->i_dirty_end);
+}
+EXPORT_SYMBOL(jbd2_wait_inode_data);
+
/*
* Submit all the data buffers of inode associated with the transaction to
* disk.
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index d3897d155fb9..e4e0b55dd077 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -864,6 +864,82 @@ int jbd2_journal_next_log_block(journal_t *journal, unsigned long long *retp)
return jbd2_journal_bmap(journal, blocknr, retp);
}

+/* Map one fast commit buffer for use by the file system */
+int jbd2_map_fc_buf(journal_t *journal, struct buffer_head **bh_out)
+{
+ unsigned long long pblock;
+ unsigned long blocknr;
+ int ret = 0;
+ struct buffer_head *bh;
+ int fc_off;
+ journal_header_t *jhdr;
+
+ write_lock(&journal->j_state_lock);
+
+ if (journal->j_fc_off + journal->j_first_fc < journal->j_last_fc) {
+ fc_off = journal->j_fc_off;
+ blocknr = journal->j_first_fc + fc_off;
+ journal->j_fc_off++;
+ } else {
+ ret = -EINVAL;
+ }
+ write_unlock(&journal->j_state_lock);
+
+ if (ret)
+ return ret;
+
+ ret = jbd2_journal_bmap(journal, blocknr, &pblock);
+ if (ret)
+ return ret;
+
+ bh = __getblk(journal->j_dev, pblock, journal->j_blocksize);
+ if (!bh)
+ return -ENOMEM;
+
+ lock_buffer(bh);
+ jhdr = (journal_header_t *)bh->b_data;
+ jhdr->h_magic = cpu_to_be32(JBD2_MAGIC_NUMBER);
+ jhdr->h_blocktype = cpu_to_be32(JBD2_FC_BLOCK);
+ jhdr->h_sequence = cpu_to_be32(journal->j_running_transaction->t_tid);
+
+ set_buffer_uptodate(bh);
+ unlock_buffer(bh);
+ journal->j_fc_wbuf[fc_off] = bh;
+
+ *bh_out = bh;
+
+ return 0;
+}
+EXPORT_SYMBOL(jbd2_map_fc_buf);
+
+/*
+ * Wait on fast commit buffers that were allocated by jbd2_map_fc_buf
+ * for completion.
+ */
+int jbd2_wait_on_fc_bufs(journal_t *journal, int num_blks)
+{
+ struct buffer_head *bh;
+ int i, j_fc_off;
+
+ read_lock(&journal->j_state_lock);
+ j_fc_off = journal->j_fc_off;
+ read_unlock(&journal->j_state_lock);
+
+ /*
+ * Wait in reverse order to minimize chances of us being woken up before
+ * all IOs have completed
+ */
+ for (i = j_fc_off - 1; i >= j_fc_off - num_blks; i--) {
+ bh = journal->j_fc_wbuf[i];
+ wait_on_buffer(bh);
+ if (unlikely(!buffer_uptodate(bh)))
+ return -EIO;
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL(jbd2_wait_on_fc_bufs);
+
/*
* Conversion of logical to physical block numbers for the journal
*
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 0a4d9d484528..599113bef67f 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -123,6 +123,7 @@ typedef struct journal_s journal_t; /* Journal control structure */
#define JBD2_SUPERBLOCK_V1 3
#define JBD2_SUPERBLOCK_V2 4
#define JBD2_REVOKE_BLOCK 5
+#define JBD2_FC_BLOCK 6

/*
* Standard header for all descriptor blocks:
@@ -1562,6 +1563,11 @@ int jbd2_start_async_fc_nowait(journal_t *journal, tid_t tid);
int jbd2_start_async_fc_wait(journal_t *journal, tid_t tid);
void jbd2_stop_async_fc(journal_t *journal, tid_t tid);
void jbd2_init_fast_commit(journal_t *journal, int num_fc_blks);
+int jbd2_map_fc_buf(journal_t *journal, struct buffer_head **bh_out);
+int jbd2_wait_on_fc_bufs(journal_t *journal, int num_blks);
+int jbd2_submit_inode_data(journal_t *journal, struct jbd2_inode *jinode);
+int jbd2_wait_inode_data(journal_t *journal, struct jbd2_inode *jinode);
+
/*
* is_journal_abort
*
--
2.26.0.110.g2183baf09c-goog

2020-04-08 22:01:48

by harshad shirwadkar

[permalink] [raw]
Subject: [PATCH v6 16/20] ext4: fast commit recovery path preparation

From: Harshad Shirwadkar <[email protected]>

Prepare for making ext4 fast commit recovery path changes. Make a few
existing functions visible. Break and add a wrapper around
ext4_get_inode_loc to allow reading inode from disk without having
a corresponding VFS inode.

Signed-off-by: Harshad Shirwadkar <[email protected]>
---
fs/ext4/ext4.h | 7 ++++
fs/ext4/inode.c | 64 +++++++++++++++++++++++++++----------
fs/ext4/ioctl.c | 6 ++--
fs/ext4/namei.c | 2 +-
include/trace/events/ext4.h | 8 ++---
5 files changed, 63 insertions(+), 24 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index c4e74bcbbf90..7c9ca8b962f8 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2740,6 +2740,8 @@ extern int ext4_trim_fs(struct super_block *, struct fstrim_range *);
extern void ext4_process_freed_data(struct super_block *sb, tid_t commit_tid);

/* inode.c */
+void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw,
+ struct ext4_inode_info *ei);
int ext4_inode_is_fast_symlink(struct inode *inode);
struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int);
struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int);
@@ -2786,6 +2788,8 @@ extern int ext4_sync_inode(handle_t *, struct inode *);
extern void ext4_dirty_inode(struct inode *, int);
extern int ext4_change_inode_journal_flag(struct inode *, int);
extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *);
+extern int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino,
+ struct ext4_iloc *iloc);
extern int ext4_inode_attach_jinode(struct inode *inode);
extern int ext4_can_truncate(struct inode *inode);
extern int ext4_truncate(struct inode *);
@@ -2819,12 +2823,15 @@ extern int ext4_ind_remove_space(handle_t *handle, struct inode *inode,
/* ioctl.c */
extern long ext4_ioctl(struct file *, unsigned int, unsigned long);
extern long ext4_compat_ioctl(struct file *, unsigned int, unsigned long);
+extern void ext4_reset_inode_seed(struct inode *inode);

/* migrate.c */
extern int ext4_ext_migrate(struct inode *);
extern int ext4_ind_migrate(struct inode *inode);

/* namei.c */
+extern int ext4_init_new_dir(handle_t *handle, struct inode *dir,
+ struct inode *inode);
extern int ext4_dirblock_csum_verify(struct inode *inode,
struct buffer_head *bh);
extern int ext4_orphan_add(handle_t *, struct inode *);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9cbd3f98c5f3..b5ca07497bbc 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -101,8 +101,8 @@ static int ext4_inode_csum_verify(struct inode *inode, struct ext4_inode *raw,
return provided == calculated;
}

-static void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw,
- struct ext4_inode_info *ei)
+void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw,
+ struct ext4_inode_info *ei)
{
__u32 csum;

@@ -4251,22 +4251,22 @@ int ext4_truncate(struct inode *inode)
* data in memory that is needed to recreate the on-disk version of this
* inode.
*/
-static int __ext4_get_inode_loc(struct inode *inode,
- struct ext4_iloc *iloc, int in_mem)
+static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino,
+ struct ext4_iloc *iloc, int in_mem,
+ ext4_fsblk_t *ret_block)
{
struct ext4_group_desc *gdp;
struct buffer_head *bh;
- struct super_block *sb = inode->i_sb;
ext4_fsblk_t block;
struct blk_plug plug;
int inodes_per_block, inode_offset;

iloc->bh = NULL;
- if (inode->i_ino < EXT4_ROOT_INO ||
- inode->i_ino > le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count))
+ if (ino < EXT4_ROOT_INO ||
+ ino > le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count))
return -EFSCORRUPTED;

- iloc->block_group = (inode->i_ino - 1) / EXT4_INODES_PER_GROUP(sb);
+ iloc->block_group = (ino - 1) / EXT4_INODES_PER_GROUP(sb);
gdp = ext4_get_group_desc(sb, iloc->block_group, NULL);
if (!gdp)
return -EIO;
@@ -4275,7 +4275,7 @@ static int __ext4_get_inode_loc(struct inode *inode,
* Figure out the offset within the block group inode table
*/
inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
- inode_offset = ((inode->i_ino - 1) %
+ inode_offset = ((ino - 1) %
EXT4_INODES_PER_GROUP(sb));
block = ext4_inode_table(sb, gdp) + (inode_offset / inodes_per_block);
iloc->offset = (inode_offset % inodes_per_block) * EXT4_INODE_SIZE(sb);
@@ -4376,7 +4376,7 @@ static int __ext4_get_inode_loc(struct inode *inode,
* has in-inode xattrs, or we don't have this inode in memory.
* Read the block from disk.
*/
- trace_ext4_load_inode(inode);
+ trace_ext4_load_inode(sb, ino);
get_bh(bh);
bh->b_end_io = end_buffer_read_sync;
submit_bh(REQ_OP_READ, REQ_META | REQ_PRIO, bh);
@@ -4384,8 +4384,8 @@ static int __ext4_get_inode_loc(struct inode *inode,
wait_on_buffer(bh);
if (!buffer_uptodate(bh)) {
simulate_eio:
- ext4_error_inode_block(inode, block, EIO,
- "unable to read itable block");
+ if (ret_block)
+ *ret_block = block;
brelse(bh);
return -EIO;
}
@@ -4395,11 +4395,43 @@ static int __ext4_get_inode_loc(struct inode *inode,
return 0;
}

+static int __ext4_get_inode_loc_noinmem(struct inode *inode,
+ struct ext4_iloc *iloc)
+{
+ ext4_fsblk_t err_blk;
+ int ret;
+
+ ret = __ext4_get_inode_loc(inode->i_sb, inode->i_ino, iloc, 0,
+ &err_blk);
+
+ if (ret == -EIO)
+ ext4_error_inode_block(inode, err_blk, EIO,
+ "unable to read itable block");
+
+ return ret;
+}
+
int ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc)
{
+ ext4_fsblk_t err_blk;
+ int ret;
+
/* We have all inode data except xattrs in memory here. */
- return __ext4_get_inode_loc(inode, iloc,
- !ext4_test_inode_state(inode, EXT4_STATE_XATTR));
+ ret = __ext4_get_inode_loc(inode->i_sb, inode->i_ino, iloc,
+ !ext4_test_inode_state(inode, EXT4_STATE_XATTR), &err_blk);
+
+ if (ret == -EIO)
+ ext4_error_inode_block(inode, err_blk, EIO,
+ "unable to read itable block");
+
+ return ret;
+}
+
+
+int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino,
+ struct ext4_iloc *iloc)
+{
+ return __ext4_get_inode_loc(sb, ino, iloc, 0, NULL);
}

static bool ext4_should_use_dax(struct inode *inode)
@@ -4551,7 +4583,7 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
ei = EXT4_I(inode);
iloc.bh = NULL;

- ret = __ext4_get_inode_loc(inode, &iloc, 0);
+ ret = __ext4_get_inode_loc_noinmem(inode, &iloc);
if (ret < 0)
goto bad_inode;
raw_inode = ext4_raw_inode(&iloc);
@@ -5142,7 +5174,7 @@ int ext4_write_inode(struct inode *inode, struct writeback_control *wbc)
} else {
struct ext4_iloc iloc;

- err = __ext4_get_inode_loc(inode, &iloc, 0);
+ err = __ext4_get_inode_loc_noinmem(inode, &iloc);
if (err)
return err;
/*
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index f66bcf185f5b..93523709f039 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -86,7 +86,7 @@ static void swap_inode_data(struct inode *inode1, struct inode *inode2)
i_size_write(inode2, isize);
}

-static void reset_inode_seed(struct inode *inode)
+void ext4_reset_inode_seed(struct inode *inode)
{
struct ext4_inode_info *ei = EXT4_I(inode);
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
@@ -199,8 +199,8 @@ static long swap_inode_boot_loader(struct super_block *sb,

inode->i_generation = prandom_u32();
inode_bl->i_generation = prandom_u32();
- reset_inode_seed(inode);
- reset_inode_seed(inode_bl);
+ ext4_reset_inode_seed(inode);
+ ext4_reset_inode_seed(inode_bl);

ext4_discard_preallocations(inode);

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 2d9c3767d8d6..3e69006e79f4 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2742,7 +2742,7 @@ struct ext4_dir_entry_2 *ext4_init_dot_dotdot(struct inode *inode,
return ext4_next_entry(de, blocksize);
}

-static int ext4_init_new_dir(handle_t *handle, struct inode *dir,
+int ext4_init_new_dir(handle_t *handle, struct inode *dir,
struct inode *inode)
{
struct buffer_head *dir_block = NULL;
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 4f7c9c00910e..8f31fd427ccc 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -1755,9 +1755,9 @@ TRACE_EVENT(ext4_ext_load_extent,
);

TRACE_EVENT(ext4_load_inode,
- TP_PROTO(struct inode *inode),
+ TP_PROTO(struct super_block *sb, unsigned long ino),

- TP_ARGS(inode),
+ TP_ARGS(sb, ino),

TP_STRUCT__entry(
__field( dev_t, dev )
@@ -1765,8 +1765,8 @@ TRACE_EVENT(ext4_load_inode,
),

TP_fast_assign(
- __entry->dev = inode->i_sb->s_dev;
- __entry->ino = inode->i_ino;
+ __entry->dev = sb->s_dev;
+ __entry->ino = ino;
),

TP_printk("dev %d,%d ino %ld",
--
2.26.0.110.g2183baf09c-goog

2020-04-09 22:10:12

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v6 01/20] ext4: update docs for fast commit feature

On Apr 8, 2020, at 3:55 PM, Harshad Shirwadkar <[email protected]> wrote:
>
> From: Harshad Shirwadkar <[email protected]>
>
> This patch series adds support for fast commits which is a simplified
> version of the scheme proposed by Park and Shin, in their paper,
> "iJournaling: Fine-Grained Journaling for Improving the Latency of
> Fsync System Call"[1]. The basic idea of fast commits is to make JBD2
> give the client file system an opportunity to perform a faster
> commit. Only if the file system cannot perform such a commit
> operation, then JBD2 should fall back to traditional commits.
>
>
> Signed-off-by: Harshad Shirwadkar <[email protected]>

It's not clear if all of this was intended to be in a 00/20 email,
but having it (probably minus full diffstat) in Git is OK as well.

Reviewed-by: Andreas Dilger <[email protected]>

> ---
> Documentation/filesystems/ext4/journal.rst | 127 ++++++++++++++++++++-
> Documentation/filesystems/journalling.rst | 18 +++
> 2 files changed, 139 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
> index ea613ee701f5..f94e66f2f8c4 100644
> --- a/Documentation/filesystems/ext4/journal.rst
> +++ b/Documentation/filesystems/ext4/journal.rst
> @@ -29,10 +29,10 @@ safest. If ``data=writeback``, dirty data blocks are not flushed to the
> disk before the metadata are written to disk through the journal.
>
> The journal inode is typically inode 8. The first 68 bytes of the
> -journal inode are replicated in the ext4 superblock. The journal itself
> -is normal (but hidden) file within the filesystem. The file usually
> -consumes an entire block group, though mke2fs tries to put it in the
> -middle of the disk.
> +journal inode are replicated in the ext4 superblock. The journal
> +itself is normal (but hidden) file within the filesystem. The file
> +usually consumes an entire block group, though mke2fs tries to put it
> +in the middle of the disk.
>
> All fields in jbd2 are written to disk in big-endian order. This is the
> opposite of ext4.
> @@ -42,22 +42,74 @@ NOTE: Both ext4 and ocfs2 use jbd2.
> The maximum size of a journal embedded in an ext4 filesystem is 2^32
> blocks. jbd2 itself does not seem to care.
>
> +Fast Commits
> +~~~~~~~~~~~~
> +
> +Ext4 also implements fast commits and integrates it with JBD2 journalling.
> +Fast commits store metadata changes made to the file system as inode level
> +diff. In other words, each fast commit block identifies updates made to
> +a particular inode and collectively they represent total changes made to
> +the file system.
> +
> +A fast commit is valid only if there is no full commit after that particular
> +fast commit. Because of this feature, fast commit blocks can be reused by
> +the following transactions.
> +
> +Each fast commit block stores updates to 1 particular inode. Updates in each
> +fast commit block are one of the 2 types:
> +- Data updates (add range / delete range)
> +- Directory entry updates (Add / remove links)
> +
> +Fast commit blocks must be replayed in the order in which they appear on disk.
> +That's because directory entry updates are written in fast commit blocks
> +in the order in which they are applied on the file system before crash.
> +Changing the order of replaying for directory entry updates may result
> +in inconsistent file system. Note that only directory entry updates need
> +ordering, data updates, since they apply to only one inode, do not require
> +ordered replay. Also, fast commits guarantee that file system is in consistent
> +state after replay of each fast commit block as long as order of replay has
> +been followed.
> +
> +Note that directory inode updates are never directly recorded in fast commits.
> +Just like other file system level metaata, updates to directories are always
> +implied based on directory entry updates stored in fast commit blocks.
> +
> +Based on which directory entry updates are committed with an inode, fast
> +commits have two modes of operation:
> +
> +- Hard Consistency (default)
> +- Soft Consistency (can be enabled by setting mount flag "fc_soft_consistency")
> +
> +When hard consistency is enabled, fast commit guarantees that all the updates
> +will be committed. After a successful replay of fast commits blocks
> +in hard consistency mode, the entire file system would be in the same state as
> +that when fsync() returned before crash. This guarantee is similar to what
> +jbd2 gives.
> +
> +With soft consistency, file system only guarantees consistency for the
> +inode in question. In this mode, file system will try to write as less data
> +to the backed as possible during the commit time. To be precise, file system
> +records all the data updates for the inode in question and directory updates
> +that are required for guaranteeing consistency of the inode in question.
> +
> Layout
> ~~~~~~
>
> Generally speaking, the journal has this format:
>
> .. list-table::
> - :widths: 16 48 16
> + :widths: 16 48 16 18
> :header-rows: 1
>
> * - Superblock
> - descriptor\_block (data\_blocks or revocation\_block) [more data or
> revocations] commmit\_block
> - [more transactions...]
> + - [Fast commits...]
> * -
> - One transaction
> -
> + -
>
> Notice that a transaction begins with either a descriptor and some data,
> or a block revocation list. A finished transaction always ends with a
> @@ -76,7 +128,7 @@ The journal superblock will be in the next full block after the
> superblock.
>
> .. list-table::
> - :widths: 12 12 12 32 12
> + :widths: 12 12 12 32 12 12
> :header-rows: 1
>
> * - 1024 bytes of padding
> @@ -85,11 +137,13 @@ superblock.
> - descriptor\_block (data\_blocks or revocation\_block) [more data or
> revocations] commmit\_block
> - [more transactions...]
> + - [Fast commits...]
> * -
> -
> -
> - One transaction
> -
> + -
>
> Block Header
> ~~~~~~~~~~~~
> @@ -609,3 +663,64 @@ bytes long (but uses a full block):
> - h\_commit\_nsec
> - Nanoseconds component of the above timestamp.
>
> +Fast Commit Block
> +~~~~~~~~~~~~~~~~~
> +
> +The fast commit block indicates an append to the last commit block
> +that was written to the journal. One fast commit block records updates
> +to one inode. So, typically you would find as many fast commit blocks
> +as the number of inodes that got changed since the last commit. A fast
> +commit block is valid only if there is no commit block present with
> +transaction ID greater than that of the fast commit block. If such a
> +block a present, then there is no need to replay the fast commit
> +block.
> +
> +.. list-table::
> + :widths: 8 8 24 40
> + :header-rows: 1
> +
> + * - Offset
> + - Type
> + - Name
> + - Descriptor
> + * - 0x0
> + - journal\_header\_s
> + - (open coded)
> + - Common block header.
> + * - 0xC
> + - \_\_le32
> + - fc\_magic
> + - Magic value which should be set to 0xE2540090. This identifies
> + that this block is a fast commit block.
> + * - 0x10
> + - \_\_u8
> + - fc\_features
> + - Features used by this fast commit block.
> + * - 0x11
> + - \_\_le16
> + - fc_num_tlvs
> + - Number of TLVs contained in this fast commit block
> + * - 0x13
> + - \_\_le32
> + - \_\_fc\_len
> + - Length of the fast commit block in terms of number of blocks
> + * - 0x17
> + - \_\_le32
> + - fc\_ino
> + - Inode number of the inode that will be recovered using this fast commit
> + * - 0x2B
> + - struct ext4\_inode
> + - inode
> + - On-disk copy of the inode at the commit time
> + * - <Variable based on inode size>
> + - struct ext4\_fc\_tl
> + - Array of struct ext4\_fc\_tl
> + - The actual delta with the last commit. Starting at this offset,
> + there is an array of TLVs that indicates which all extents
> + should be present in the corresponding inode. Currently,
> + following tags are supported: EXT4\_FC\_TAG\_EXT (extent that
> + should be present in the inode), EXT4\_FC\_TAG\_HOLE (extent
> + that should be removed from the inode), EXT4\_FC\_TAG\_ADD\_DENTRY
> + (dentry that should be linked), EXT4\_FC\_TAG\_DEL\_DENTRY
> + (dentry that should be unlinked), EXT4\_FC\_TAG\_CREATE\_DENTRY
> + (dentry that for the file that should be created for the first time).
> diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
> index 58ce6b395206..1cb116ab27ab 100644
> --- a/Documentation/filesystems/journalling.rst
> +++ b/Documentation/filesystems/journalling.rst
> @@ -115,6 +115,24 @@ called after each transaction commit. You can also use
> ``transaction->t_private_list`` for attaching entries to a transaction
> that need processing when the transaction commits.
>
> +JBD2 also allows client file systems to implement file system specific
> +commits which are called as ``fast commits``. Fast commits are
> +asynchronous in nature i.e. file systems can call their own commit
> +functions at any time. In order to avoid the race with kjournald
> +thread and other possible fast commits that may be happening in
> +parallel, file systems should first call
> +:c:func:`jbd2_start_async_fc()`. File system can call
> +:c:func:`jbd2_map_fc_buf()` to get buffers reserved for fast
> +commits. Once a fast commit is completed, file system should call
> +:c:func:`jbd2_stop_async_fc()` to indicate and unblock other
> +committers and the kjournald thread. After performing either a fast
> +or a full commit, JBD2 calls ``journal->j_fc_cleanup_cb`` to allow
> +file systems to perform cleanups for their internal fast commit
> +related data structures. At the replay time, JBD2 passes each and
> +every fast commit block to the file system via
> +``journal->j_fc_replay_cb``. Ext4 effectively uses this fast commit
> +mechanism to improve journal commit performance.
> +
> JBD2 also provides a way to block all transaction updates via
> :c:func:`jbd2_journal_lock_updates()` /
> :c:func:`jbd2_journal_unlock_updates()`. Ext4 uses this when it wants a
> --
> 2.26.0.110.g2183baf09c-goog
>


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP