2012-10-26 13:12:42

by Zheng Liu

[permalink] [raw]
Subject: [PATCH 0/8 v3] ext4: extent status tree (step 1)

Hi all,

Here is the v3 of extent status tree. In this version, the biggest change is
the i_es_lock (a rwlock_t) instead of i_data_sem, which is introduced to protect
extent status tree. Moreover I improve the SEEK_DATA/SEEK_HOLE. In previous
version, the unwritten extent is as a data, and now it will lookup page cache to
determine it as a data or a hole. When it has some data at the given range, the
unwritten extent will be as a data. Otherwise, it will be as a hole. Other
changes contain some bug fixes.

About introduction of extent status tree and future works in step2, please see
the comments in patch 2.

Here is a defect after introducing SEEK_DATA/SEEK_HOLE support, which will cause
xfstest #285 failure when the file is block-mapped because in ext4 fallocate(2)
is not supported by block-mapped file. However, in current version of xfstest,
it cannot trigger any failure because in xfstest the test case has a bug that
will cause no error is reported from #285. I have submitted a patch to fix it,
but it still doesn't be applied. Thus, if someone are willing to test
SEEK_DATA/SEEK_HOLE patch, please apply this patch firstly [1].

1. http://patchwork.xfs.org/patch/4276/

I really thanks Jeff Liu, who gives me some advices and inspires me about
SEEK_DATA/SEEK_HOLE improvement, and the function, which looks up page cache to
determine whether an unwritten extent is as a data or a hole, is inspired by
his work for xfs (d126d43f631f996daeee5006714fed914be32368). Thus, I add a
Signed-off-by for his crediting.

Jeff, I really appreicate if you could allow me to add Signed-off-by in patch 8.
If doing like this is incorrect, please let me know. Thanks!

v3 <- v2:
- add rwlock to protect extent status tree
- improve SEEK_DATA/SEEK_HOLE
- fix some bugs

Here is the second version:
http://lwn.net/Articles/512899/

v2 <- v1:
- add a BUG_ON to do a sanity check in extent_status_end
- fix off-by-one problem in extent_status_end accroding to Lukas's comment
- add more detailed comments in ext4_es_find_extent
- try to lookup in extent tree cache firstly in ext4_es_find_extent
- rename ext4_es_add_space to ext4_es_insert_extent
- rename ext4_es_remove_space to ext4_es_remove_extent

The first version is in this link:
http://www.spinics.net/lists/linux-ext4/msg33101.html

Any feedbacks are appreciated. Thanks!

Regards,
Zheng
---
Zheng Liu (8):
ext4: add two structures supporting extent status tree
ext4: add operations on extent status tree
ext4: initialize extent status tree
ext4: let ext4 maintain extent status tree
ext4: add some tracepoints in extent status tree
ext4: reimplement ext4_find_delay_alloc_range on extent status tree
ext4: reimplement fiemap on extent status tree
ext4: introduce lseek SEEK_DATA/SEEK_HOLE support

fs/ext4/Makefile | 2 +-
fs/ext4/ext4.h | 10 +-
fs/ext4/ext4_extents.h | 3 +-
fs/ext4/extents.c | 305 +++++-----------------------------
fs/ext4/extents_status.c | 499 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/extents_status.h | 45 +++++
fs/ext4/file.c | 334 ++++++++++++++++++++++++++++++++++++-
fs/ext4/indirect.c | 1 +
fs/ext4/inode.c | 91 ++++------
fs/ext4/super.c | 14 +-
include/trace/events/ext4.h | 101 +++++++++++
11 files changed, 1078 insertions(+), 327 deletions(-)


2012-10-26 13:12:46

by Zheng Liu

[permalink] [raw]
Subject: [PATCH 1/8 v3] ext4: add two structures supporting extent status tree

From: Zheng Liu <[email protected]>

This patch adds two structures that supports extent status tree, extent_status
and ext4_es_tree. Currently extent_status is used to track a delay extent for
an inode, which record the start block and the length of the delay extent.
ext4_es_tree is used to store all extent_status for an inode in memory.

CC: "Theodore Ts'o" <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
Signed-off-by: Allison Henderson <[email protected]>
Signed-off-by: Zheng Liu <[email protected]>
---
v3 <- v2:
- add i_es_lock to protect extent status tree

fs/ext4/ext4.h | 6 ++++++
fs/ext4/extents_status.h | 25 +++++++++++++++++++++++++
2 files changed, 31 insertions(+)
create mode 100644 fs/ext4/extents_status.h

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 3c20de1..bcc634b 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -811,6 +811,8 @@ struct ext4_ext_cache {
__u32 ec_len; /* must be 32bit to return holes */
};

+#include "extents_status.h"
+
/*
* fourth extended file system inode data in memory
*/
@@ -888,6 +890,10 @@ struct ext4_inode_info {
struct list_head i_prealloc_list;
spinlock_t i_prealloc_lock;

+ /* extents status tree */
+ struct ext4_es_tree i_es_tree;
+ rwlock_t i_es_lock;
+
/* ialloc */
ext4_group_t i_last_alloc_group;

diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
new file mode 100644
index 0000000..8be2ab9
--- /dev/null
+++ b/fs/ext4/extents_status.h
@@ -0,0 +1,25 @@
+/*
+ * fs/ext4/extents_status.h
+ *
+ * Written by Yongqiang Yang <[email protected]>
+ * Modified by
+ * Allison Henderson <[email protected]>
+ * Zheng Liu <[email protected]>
+ *
+ */
+
+#ifndef _EXT4_EXTENTS_STATUS_H
+#define _EXT4_EXTENTS_STATUS_H
+
+struct extent_status {
+ struct rb_node rb_node;
+ ext4_lblk_t start; /* first block extent covers */
+ ext4_lblk_t len; /* length of extent in block */
+};
+
+struct ext4_es_tree {
+ struct rb_root root;
+ struct extent_status *cache_es; /* recently accessed extent */
+};
+
+#endif /* _EXT4_EXTENTS_STATUS_H */
--
1.7.12.rc2.18.g61b472e


2012-10-26 13:12:51

by Zheng Liu

[permalink] [raw]
Subject: [PATCH 2/8 v3] ext4: add operations on extent status tree

From: Zheng Liu <[email protected]>

This patch adds operations on a extent status tree.

CC: "Theodore Ts'o" <[email protected]>
CC: Lukas Czerner <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
Signed-off-by: Allison Henderson <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Signed-off-by: Zheng Liu <[email protected]>
---
v3 <- v2:
- improve try_to_merge_left/right() functions (Hugh, Thank you)
- fix some bugs on inserting/removing an extent
- use i_es_lock to protect extent status tree
- add some comments to describe the extent status tree and future works in
step2

fs/ext4/Makefile | 2 +-
fs/ext4/extents_status.c | 491 +++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/extents_status.h | 20 ++
3 files changed, 512 insertions(+), 1 deletion(-)
create mode 100644 fs/ext4/extents_status.c

diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile
index 56fd8f86..41f22be 100644
--- a/fs/ext4/Makefile
+++ b/fs/ext4/Makefile
@@ -7,7 +7,7 @@ obj-$(CONFIG_EXT4_FS) += ext4.o
ext4-y := balloc.o bitmap.o dir.o file.o fsync.o ialloc.o inode.o page-io.o \
ioctl.o namei.o super.o symlink.o hash.o resize.o extents.o \
ext4_jbd2.o migrate.o mballoc.o block_validity.o move_extent.o \
- mmp.o indirect.o
+ mmp.o indirect.o extents_status.o

ext4-$(CONFIG_EXT4_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o
ext4-$(CONFIG_EXT4_FS_POSIX_ACL) += acl.o
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
new file mode 100644
index 0000000..048ae8c
--- /dev/null
+++ b/fs/ext4/extents_status.c
@@ -0,0 +1,491 @@
+/*
+ * fs/ext4/extents_status.c
+ *
+ * Written by Yongqiang Yang <[email protected]>
+ * Modified by
+ * Allison Henderson <[email protected]>
+ * Hugh Dickins <[email protected]>
+ * Zheng Liu <[email protected]>
+ *
+ * Ext4 extents status tree core functions.
+ */
+#include <linux/rbtree.h>
+#include "ext4.h"
+#include "extents_status.h"
+#include "ext4_extents.h"
+
+/*
+ * According to previous discussion in Ext4 Developer Workshop, we will
+ * introduce a new structure called io tree to track all extent status in order
+ * to solve some problems that we have met (e.g. Reservation space warning), and
+ * provide extent-level locking. Delay extent tree is the first step to achieve
+ * this goal. It is original built by Yongqiang Yang. At that time it is
+ * called delay extent tree, whose goal is only track delay extent in memory to
+ * simplify the implementation of fiemap and bigalloc, and introduce lseek
+ * SEEK_DATA/SEEK_HOLE support. That is why it is still called delay extent
+ * tree at the following comment. But for better understand what it does, it
+ * has been rename to extent status tree.
+ *
+ * Currently the first step has been done. All delay extents are tracked in the
+ * tree. It maintains the delay extent when a delay allocation is issued, and
+ * the delay extent is written out or invalidated. Therefore the implementation
+ * of fiemap and bigalloc are simplified, and SEEK_DATA/SEEK_HOLE are
+ * introduced.
+ *
+ * The following comment describes the implemenmtation of extent status tree and
+ * future works.
+ */
+
+/*
+ * extents status tree implementation for ext4.
+ *
+ *
+ * ==========================================================================
+ * Extents status encompass delayed extents and extent locks
+ *
+ * 1. Why delayed extent implementation ?
+ *
+ * Without delayed extent, ext4 identifies a delayed extent by looking up
+ * page cache, this has several deficiencies - complicated, buggy, and
+ * inefficient code.
+ *
+ * FIEMAP, SEEK_HOLE/DATA, bigalloc, punch hole and writeout all need to know if
+ * a block or a range of blocks are belonged to a delayed extent.
+ *
+ * Let us have a look at how they do without delayed extents implementation.
+ * -- FIEMAP
+ * FIEMAP looks up page cache to identify delayed allocations from holes.
+ *
+ * -- SEEK_HOLE/DATA
+ * SEEK_HOLE/DATA has the same problem as FIEMAP.
+ *
+ * -- bigalloc
+ * bigalloc looks up page cache to figure out if a block is already
+ * under delayed allocation or not to determine whether quota reserving
+ * is needed for the cluster.
+ *
+ * -- punch hole
+ * punch hole looks up page cache to identify a delayed extent.
+ *
+ * -- writeout
+ * Writeout looks up whole page cache to see if a buffer is mapped, If
+ * there are not very many delayed buffers, then it is time comsuming.
+ *
+ * With delayed extents implementation, FIEMAP, SEEK_HOLE/DATA, bigalloc and
+ * writeout can figure out if a block or a range of blocks is under delayed
+ * allocation(belonged to a delayed extent) or not by searching the delayed
+ * extent tree.
+ *
+ *
+ * ==========================================================================
+ * 2. ext4 delayed extents impelmentation
+ *
+ * -- delayed extent
+ * A delayed extent is a range of blocks which are contiguous logically and
+ * under delayed allocation. Unlike extent in ext4, delayed extent in ext4
+ * is a in-memory struct, there is no corresponding on-disk data. There is
+ * no limit on length of delayed extent, so a delayed extent can contain as
+ * many blocks as they are contiguous logically.
+ *
+ * -- delayed extent tree
+ * Every inode has a delayed extent tree and all under delayed allocation
+ * blocks are added to the tree as delayed extents. Delayed extents in
+ * the tree are ordered by logical block no.
+ *
+ * -- operations on a delayed extent tree
+ * There are three operations on a delayed extent tree: find next delayed
+ * extent, adding a space(a range of blocks) and removing a space.
+ *
+ * -- race on a delayed extent tree
+ * Delayed extent tree is protected inode->i_es_lock.
+ *
+ *
+ * ==========================================================================
+ * 3. performance analysis
+ * -- overhead
+ * 1. Apart from operations on a delayed extent tree, we need to
+ * down_write(inode->i_data_sem) in delayed write path to maintain delayed
+ * extent tree, this can have impact on parallel read-write and write-write
+ *
+ * 2. There is a cache extent for write access, so if writes are not very
+ * random, adding space operaions are in O(1) time.
+ *
+ * -- gain
+ * 3. Code is much simpler, more readable, more maintainable and
+ * more efficient.
+ *
+ *
+ * ==========================================================================
+ * 4. TODO list
+ * -- Track all extent status
+ *
+ * -- Improve get block process
+ *
+ * -- Extent-level locking
+ */
+
+static struct kmem_cache *ext4_es_cachep;
+
+int __init ext4_init_es(void)
+{
+ ext4_es_cachep = KMEM_CACHE(extent_status, SLAB_RECLAIM_ACCOUNT);
+ if (ext4_es_cachep == NULL)
+ return -ENOMEM;
+ return 0;
+}
+
+void ext4_exit_es(void)
+{
+ if (ext4_es_cachep)
+ kmem_cache_destroy(ext4_es_cachep);
+}
+
+void ext4_es_init_tree(struct ext4_es_tree *tree)
+{
+ tree->root = RB_ROOT;
+ tree->cache_es = NULL;
+}
+
+#ifdef ES_DEBUG__
+static void ext4_es_print_tree(struct inode *inode)
+{
+ struct ext4_es_tree *tree;
+ struct rb_node *node;
+
+ printk(KERN_DEBUG "status extents for inode %lu:", inode->i_ino);
+ tree = &EXT4_I(inode)->i_es_tree;
+ node = rb_first(&tree->root);
+ while (node) {
+ struct extent_status *es;
+ es = rb_entry(node, struct extent_status, rb_node);
+ printk(KERN_DEBUG " [%u/%u)", es->start, es->len);
+ node = rb_next(node);
+ }
+ printk(KERN_DEBUG "\n");
+}
+#else
+#define ext4_es_print_tree(inode)
+#endif
+
+static inline ext4_lblk_t extent_status_end(struct extent_status *es)
+{
+ BUG_ON(es->start + es->len < es->start);
+ return es->start + es->len - 1;
+}
+
+/*
+ * search through the tree for an delayed extent with a given offset. If
+ * it can't be found, try to find next extent.
+ */
+static struct extent_status *__es_tree_search(struct rb_root *root,
+ ext4_lblk_t offset)
+{
+ struct rb_node *node = root->rb_node;
+ struct extent_status *es = NULL;
+
+ while (node) {
+ es = rb_entry(node, struct extent_status, rb_node);
+ if (offset < es->start)
+ node = node->rb_left;
+ else if (offset > extent_status_end(es))
+ node = node->rb_right;
+ else
+ return es;
+ }
+
+ if (es && offset < es->start)
+ return es;
+
+ if (es && offset > extent_status_end(es)) {
+ node = rb_next(&es->rb_node);
+ return node ? rb_entry(node, struct extent_status, rb_node) :
+ NULL;
+ }
+
+ return NULL;
+}
+
+/*
+ * ext4_es_find_extent: find the 1st delayed extent covering @es->start
+ * if it exists, otherwise, the next extent after @es->start.
+ *
+ * @inode: the inode which owns delayed extents
+ * @es: delayed extent that we found
+ *
+ * Returns the first block of the next extent after es, otherwise
+ * EXT_MAX_BLOCKS if no delay extent is found.
+ * Delayed extent is returned via @es.
+ */
+ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
+{
+ struct ext4_es_tree *tree = NULL;
+ struct extent_status *es1 = NULL;
+ struct rb_node *node;
+ ext4_lblk_t ret = EXT_MAX_BLOCKS;
+
+ read_lock(&EXT4_I(inode)->i_es_lock);
+ tree = &EXT4_I(inode)->i_es_tree;
+
+ /* find delay extent in cache firstly */
+ if (tree->cache_es) {
+ es1 = tree->cache_es;
+ if (in_range(es->start, es1->start, es1->len)) {
+ es_debug("%u cached by [%u/%u)\n",
+ es->start, es1->start, es1->len);
+ goto out;
+ }
+ }
+
+ es->len = 0;
+ es1 = __es_tree_search(&tree->root, es->start);
+
+out:
+ if (es1) {
+ tree->cache_es = es1;
+ es->start = es1->start;
+ es->len = es1->len;
+ node = rb_next(&es1->rb_node);
+ if (node) {
+ es1 = rb_entry(node, struct extent_status, rb_node);
+ ret = es1->start;
+ }
+ }
+
+ read_unlock(&EXT4_I(inode)->i_es_lock);
+ return ret;
+}
+
+static struct extent_status *
+ext4_es_alloc_extent(ext4_lblk_t start, ext4_lblk_t len)
+{
+ struct extent_status *es;
+ es = kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC);
+ if (es == NULL)
+ return NULL;
+ es->start = start;
+ es->len = len;
+ return es;
+}
+
+static void ext4_es_free_extent(struct extent_status *es)
+{
+ kmem_cache_free(ext4_es_cachep, es);
+}
+
+static struct extent_status *
+ext4_es_try_to_merge_left(struct ext4_es_tree *tree, struct extent_status *es)
+{
+ struct extent_status *es1;
+ struct rb_node *node;
+
+ node = rb_prev(&es->rb_node);
+ if (!node)
+ return es;
+
+ es1 = rb_entry(node, struct extent_status, rb_node);
+ if (es->start == extent_status_end(es1) + 1) {
+ es1->len += es->len;
+ rb_erase(&es->rb_node, &tree->root);
+ ext4_es_free_extent(es);
+ es = es1;
+ }
+
+ return es;
+}
+
+static struct extent_status *
+ext4_es_try_to_merge_right(struct ext4_es_tree *tree, struct extent_status *es)
+{
+ struct extent_status *es1;
+ struct rb_node *node;
+
+ node = rb_next(&es->rb_node);
+ if (!node)
+ return es;
+
+ es1 = rb_entry(node, struct extent_status, rb_node);
+ if (es1->start == extent_status_end(es) + 1) {
+ es->len += es1->len;
+ rb_erase(node, &tree->root);
+ ext4_es_free_extent(es1);
+ }
+
+ return es;
+}
+
+static int __es_insert_extent(struct ext4_es_tree *tree, ext4_lblk_t offset,
+ ext4_lblk_t len)
+{
+ struct rb_node **p = &tree->root.rb_node;
+ struct rb_node *parent = NULL;
+ struct extent_status *es;
+ ext4_lblk_t end = offset + len - 1;
+
+ BUG_ON(end < offset);
+ es = tree->cache_es;
+ if (es && offset == (extent_status_end(es) + 1)) {
+ es_debug("cached by [%u/%u)\n", es->start, es->len);
+ es->len += len;
+ es = ext4_es_try_to_merge_right(tree, es);
+ goto out;
+ } else if (es && es->start == end + 1) {
+ es_debug("cached by [%u/%u)\n", es->start, es->len);
+ es->start = offset;
+ es->len += len;
+ es = ext4_es_try_to_merge_left(tree, es);
+ goto out;
+ } else if (es && es->start <= offset &&
+ end <= extent_status_end(es)) {
+ es_debug("cached by [%u/%u)\n", es->start, es->len);
+ goto out;
+ }
+
+ while (*p) {
+ parent = *p;
+ es = rb_entry(parent, struct extent_status, rb_node);
+
+ if (offset < es->start) {
+ if (es->start == end + 1) {
+ es->start = offset;
+ es->len += len;
+ es = ext4_es_try_to_merge_left(tree, es);
+ goto out;
+ }
+ p = &(*p)->rb_left;
+ } else if (offset > extent_status_end(es)) {
+ if (offset == extent_status_end(es) + 1) {
+ es->len += len;
+ es = ext4_es_try_to_merge_right(tree, es);
+ goto out;
+ }
+ p = &(*p)->rb_right;
+ } else {
+ if (extent_status_end(es) <= end)
+ es->len = offset - es->start + len;
+ goto out;
+ }
+ }
+
+ es = ext4_es_alloc_extent(offset, len);
+ if (!es)
+ return -ENOMEM;
+ rb_link_node(&es->rb_node, parent, p);
+ rb_insert_color(&es->rb_node, &tree->root);
+
+out:
+ tree->cache_es = es;
+ return 0;
+}
+
+/*
+ * ext4_es_insert_extent() adds a space to a delayed extent tree.
+ * Caller holds inode->i_es_lock.
+ *
+ * ext4_es_insert_extent is called by ext4_da_write_begin and
+ * ext4_es_remove_extent.
+ *
+ * Return 0 on success, error code on failure.
+ */
+int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t offset,
+ ext4_lblk_t len)
+{
+ struct ext4_es_tree *tree;
+ int err = 0;
+
+ es_debug("add [%u/%u) to extent status tree of inode %lu\n",
+ offset, len, inode->i_ino);
+
+ write_lock(&EXT4_I(inode)->i_es_lock);
+ tree = &EXT4_I(inode)->i_es_tree;
+ err = __es_insert_extent(tree, offset, len);
+ write_unlock(&EXT4_I(inode)->i_es_lock);
+
+ ext4_es_print_tree(inode);
+
+ return err;
+}
+
+/*
+ * ext4_es_remove_extent() removes a space from a delayed extent tree.
+ * Caller holds inode->i_es_lock.
+ *
+ * Return 0 on success, error code on failure.
+ */
+int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t offset,
+ ext4_lblk_t len)
+{
+ struct rb_node *node;
+ struct ext4_es_tree *tree;
+ struct extent_status *es;
+ struct extent_status orig_es;
+ ext4_lblk_t len1, len2, end;
+ int err = 0;
+
+ es_debug("remove [%u/%u) from extent status tree of inode %lu\n",
+ offset, len, inode->i_ino);
+
+ end = offset + len - 1;
+ BUG_ON(end < offset);
+ write_lock(&EXT4_I(inode)->i_es_lock);
+ tree = &EXT4_I(inode)->i_es_tree;
+ es = __es_tree_search(&tree->root, offset);
+ if (!es)
+ goto out;
+ if (es->start > end)
+ goto out;
+
+ /* Simply invalidate cache_es. */
+ tree->cache_es = NULL;
+
+ orig_es.start = es->start;
+ orig_es.len = es->len;
+ len1 = offset > es->start ? offset - es->start : 0;
+ len2 = extent_status_end(es) > end ?
+ extent_status_end(es) - end : 0;
+ if (len1 > 0)
+ es->len = len1;
+ if (len2 > 0) {
+ if (len1 > 0) {
+ err = __es_insert_extent(tree, end + 1, len2);
+ if (err) {
+ es->start = orig_es.start;
+ es->len = orig_es.len;
+ goto out;
+ }
+ } else {
+ es->start = end + 1;
+ es->len = len2;
+ }
+ goto out;
+ }
+
+ if (len1 > 0) {
+ node = rb_next(&es->rb_node);
+ if (node)
+ es = rb_entry(node, struct extent_status, rb_node);
+ else
+ es = NULL;
+ }
+
+ while (es && extent_status_end(es) <= end) {
+ node = rb_next(&es->rb_node);
+ rb_erase(&es->rb_node, &tree->root);
+ ext4_es_free_extent(es);
+ if (!node) {
+ es = NULL;
+ break;
+ }
+ es = rb_entry(node, struct extent_status, rb_node);
+ }
+
+ if (es && es->start < end + 1) {
+ len1 = extent_status_end(es) - end;
+ es->start = end + 1;
+ es->len = len1;
+ }
+
+out:
+ write_unlock(&EXT4_I(inode)->i_es_lock);
+ ext4_es_print_tree(inode);
+ return err;
+}
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 8be2ab9..077f82d 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -11,6 +11,15 @@
#ifndef _EXT4_EXTENTS_STATUS_H
#define _EXT4_EXTENTS_STATUS_H

+/*
+ * Turn on ES_DEBUG__ to get lots of info about extent status operations.
+ */
+#ifdef ES_DEBUG__
+#define es_debug(fmt, ...) printk(fmt, ##__VA_ARGS__)
+#else
+#define es_debug(fmt, ...) no_printk(fmt, ##__VA_ARGS__)
+#endif
+
struct extent_status {
struct rb_node rb_node;
ext4_lblk_t start; /* first block extent covers */
@@ -22,4 +31,15 @@ struct ext4_es_tree {
struct extent_status *cache_es; /* recently accessed extent */
};

+extern int __init ext4_init_es(void);
+extern void ext4_exit_es(void);
+extern void ext4_es_init_tree(struct ext4_es_tree *tree);
+
+extern int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t start,
+ ext4_lblk_t len);
+extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t start,
+ ext4_lblk_t len);
+extern ext4_lblk_t ext4_es_find_extent(struct inode *inode,
+ struct extent_status *es);
+
#endif /* _EXT4_EXTENTS_STATUS_H */
--
1.7.12.rc2.18.g61b472e


2012-10-26 13:12:55

by Zheng Liu

[permalink] [raw]
Subject: [PATCH 3/8 v3] ext4: initialize extent status tree

From: Zheng Liu <[email protected]>

Let ext4 initialize extent status tree of an inode.

CC: "Theodore Ts'o" <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
Signed-off-by: Allison Henderson <[email protected]>
Signed-off-by: Zheng Liu <[email protected]>
---
v3 <- v2:
- initialize i_es_lock

fs/ext4/super.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 80928f7..e0c824b 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -943,6 +943,8 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
memset(&ei->i_cached_extent, 0, sizeof(struct ext4_ext_cache));
INIT_LIST_HEAD(&ei->i_prealloc_list);
spin_lock_init(&ei->i_prealloc_lock);
+ ext4_es_init_tree(&ei->i_es_tree);
+ rwlock_init(&ei->i_es_lock);
ei->i_reserved_data_blocks = 0;
ei->i_reserved_meta_blocks = 0;
ei->i_allocated_meta_blocks = 0;
--
1.7.12.rc2.18.g61b472e


2012-10-26 13:12:59

by Zheng Liu

[permalink] [raw]
Subject: [PATCH 4/8 v3] ext4: let ext4 maintain extent status tree

From: Zheng Liu <[email protected]>

This patch lets ext4 maintain extent status tree.

Currently it only tracks delay extent status in extent status tree. When a
delay allocation is issued, the related delay extent will be inserted into
extent status tree. When a delay extent is written out or invalidated, it will
be removed from this tree.

CC: "Theodore Ts'o" <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
Signed-off-by: Allison Henderson <[email protected]>
Signed-off-by: Zheng Liu <[email protected]>
---
v3 <- v2:
- remove delay extent in ext4_da_page_release_reservation() and
ext4_da_block_invalidatepages().

fs/ext4/extents.c | 4 ++++
fs/ext4/indirect.c | 1 +
fs/ext4/inode.c | 38 +++++++++++++++++++++++++++++++++++---
fs/ext4/super.c | 12 +++++++++++-
4 files changed, 51 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 7011ac9..0e1a925 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4344,6 +4344,8 @@ void ext4_ext_truncate(struct inode *inode)

last_block = (inode->i_size + sb->s_blocksize - 1)
>> EXT4_BLOCK_SIZE_BITS(sb);
+ err = ext4_es_remove_extent(inode, last_block,
+ EXT_MAX_BLOCKS - last_block);
err = ext4_ext_remove_space(inode, last_block, EXT_MAX_BLOCKS - 1);

/* In a multi-transaction truncate, we only make the final
@@ -4971,6 +4973,8 @@ int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length)
ext4_ext_invalidate_cache(inode);
ext4_discard_preallocations(inode);

+ err = ext4_es_remove_extent(inode, first_block,
+ stop_block - first_block);
err = ext4_ext_remove_space(inode, first_block, stop_block - 1);

ext4_ext_invalidate_cache(inode);
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 792e388..9a6c2b1 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -1412,6 +1412,7 @@ void ext4_ind_truncate(struct inode *inode)
down_write(&ei->i_data_sem);

ext4_discard_preallocations(inode);
+ ext4_es_remove_extent(inode, last_block, EXT_MAX_BLOCKS - last_block);

/*
* The orphan list entry will now protect us from any crash which
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b3c243b..e4da45d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -574,7 +574,16 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
up_read((&EXT4_I(inode)->i_data_sem));

if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
- int ret = check_block_validity(inode, map);
+ int ret;
+ if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
+ /* delayed alloc may be allocated by fallocate and
+ * coverted to initialized by directIO.
+ * we need to handle delayed extent here.
+ */
+ down_write((&EXT4_I(inode)->i_data_sem));
+ goto delayed_mapped;
+ }
+ ret = check_block_validity(inode, map);
if (ret != 0)
return ret;
}
@@ -656,8 +665,16 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
* set the BH_Da_Mapped bit on them. Its important to do this
* under the protection of i_data_sem.
*/
- if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED)
+ if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
+ int ret;
set_buffers_da_mapped(inode, map);
+delayed_mapped:
+ /* delayed allocation blocks has been allocated */
+ ret = ext4_es_remove_extent(inode, map->m_lblk,
+ map->m_len);
+ if (ret < 0)
+ retval = ret;
+ }
}

up_write((&EXT4_I(inode)->i_data_sem));
@@ -1301,6 +1318,7 @@ static void ext4_da_page_release_reservation(struct page *page,
struct inode *inode = page->mapping->host;
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
int num_clusters;
+ ext4_fsblk_t lblk;

head = page_buffers(page);
bh = head;
@@ -1315,11 +1333,15 @@ static void ext4_da_page_release_reservation(struct page *page,
curr_off = next_off;
} while ((bh = bh->b_this_page) != head);

+ if (to_release) {
+ lblk = page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+ ext4_es_remove_extent(inode, lblk, to_release);
+ }
+
/* If we have released all the blocks belonging to a cluster, then we
* need to release the reserved space for that cluster. */
num_clusters = EXT4_NUM_B2C(sbi, to_release);
while (num_clusters > 0) {
- ext4_fsblk_t lblk;
lblk = (page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits)) +
((num_clusters - 1) << sbi->s_cluster_bits);
if (sbi->s_cluster_ratio == 1 ||
@@ -1500,9 +1522,15 @@ static void ext4_da_block_invalidatepages(struct mpage_da_data *mpd)
struct pagevec pvec;
struct inode *inode = mpd->inode;
struct address_space *mapping = inode->i_mapping;
+ ext4_lblk_t start, last;

index = mpd->first_page;
end = mpd->next_page - 1;
+
+ start = index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+ last = end << (PAGE_CACHE_SHIFT - inode->i_blkbits);
+ ext4_es_remove_extent(inode, start, last - start + 1);
+
while (index <= end) {
nr_pages = pagevec_lookup(&pvec, mapping, index, PAGEVEC_SIZE);
if (nr_pages == 0)
@@ -1814,6 +1842,10 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
goto out_unlock;
}

+ retval = ext4_es_insert_extent(inode, map->m_lblk, map->m_len);
+ if (retval)
+ goto out_unlock;
+
/* Clear EXT4_MAP_FROM_CLUSTER flag since its purpose is served
* and it should not appear on the bh->b_state.
*/
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index e0c824b..6d4a712 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -50,6 +50,7 @@
#include "xattr.h"
#include "acl.h"
#include "mballoc.h"
+#include "ext4_extents.h"

#define CREATE_TRACE_POINTS
#include <trace/events/ext4.h>
@@ -1033,6 +1034,7 @@ void ext4_clear_inode(struct inode *inode)
clear_inode(inode);
dquot_drop(inode);
ext4_discard_preallocations(inode);
+ ext4_es_remove_extent(inode, 0, EXT_MAX_BLOCKS);
if (EXT4_I(inode)->jinode) {
jbd2_journal_release_jbd_inode(EXT4_JOURNAL(inode),
EXT4_I(inode)->jinode);
@@ -5291,9 +5293,14 @@ static int __init ext4_init_fs(void)
init_waitqueue_head(&ext4__ioend_wq[i]);
}

- err = ext4_init_pageio();
+ err = ext4_init_es();
if (err)
return err;
+
+ err = ext4_init_pageio();
+ if (err)
+ goto out7;
+
err = ext4_init_system_zone();
if (err)
goto out6;
@@ -5343,6 +5350,9 @@ out5:
ext4_exit_system_zone();
out6:
ext4_exit_pageio();
+out7:
+ ext4_exit_es();
+
return err;
}

--
1.7.12.rc2.18.g61b472e


2012-10-26 13:13:03

by Zheng Liu

[permalink] [raw]
Subject: [PATCH 5/8 v3] ext4: add some tracepoints in extent status tree

From: Zheng Liu <[email protected]>

This patch adds some tracepoints in extent status tree.

Signed-off-by: Zheng Liu <[email protected]>
---
fs/ext4/extents_status.c | 8 ++++
include/trace/events/ext4.h | 101 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 109 insertions(+)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 048ae8c..1c4af0a 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -14,6 +14,8 @@
#include "extents_status.h"
#include "ext4_extents.h"

+#include <trace/events/ext4.h>
+
/*
* According to previous discussion in Ext4 Developer Workshop, we will
* introduce a new structure called io tree to track all extent status in order
@@ -223,6 +225,8 @@ ext4_lblk_t ext4_es_find_extent(struct inode *inode, struct extent_status *es)
struct rb_node *node;
ext4_lblk_t ret = EXT_MAX_BLOCKS;

+ trace_ext4_es_find_extent_enter(inode, es->start);
+
read_lock(&EXT4_I(inode)->i_es_lock);
tree = &EXT4_I(inode)->i_es_tree;

@@ -252,6 +256,8 @@ out:
}

read_unlock(&EXT4_I(inode)->i_es_lock);
+
+ trace_ext4_es_find_extent_exit(inode, es, ret);
return ret;
}

@@ -392,6 +398,7 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t offset,
struct ext4_es_tree *tree;
int err = 0;

+ trace_ext4_es_insert_extent(inode, offset, len);
es_debug("add [%u/%u) to extent status tree of inode %lu\n",
offset, len, inode->i_ino);

@@ -421,6 +428,7 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t offset,
ext4_lblk_t len1, len2, end;
int err = 0;

+ trace_ext4_es_remove_extent(inode, offset, len);
es_debug("remove [%u/%u) from extent status tree of inode %lu\n",
offset, len, inode->i_ino);

diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index d49b285..b01a901 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -15,6 +15,7 @@ struct ext4_inode_info;
struct mpage_da_data;
struct ext4_map_blocks;
struct ext4_extent;
+struct extent_status;

#define EXT4_I(inode) (container_of(inode, struct ext4_inode_info, vfs_inode))

@@ -2055,6 +2056,106 @@ TRACE_EVENT(ext4_ext_remove_space_done,
(unsigned short) __entry->eh_entries)
);

+TRACE_EVENT(ext4_es_insert_extent,
+ TP_PROTO(struct inode *inode, ext4_lblk_t start, ext4_lblk_t len),
+
+ TP_ARGS(inode, start, len),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( ino_t, ino )
+ __field( loff_t, start )
+ __field( loff_t, len )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->ino = inode->i_ino;
+ __entry->start = start;
+ __entry->len = len;
+ ),
+
+ TP_printk("dev %d,%d ino %lu es [%lld/%lld)",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long) __entry->ino,
+ __entry->start, __entry->len)
+);
+
+TRACE_EVENT(ext4_es_remove_extent,
+ TP_PROTO(struct inode *inode, ext4_lblk_t start, ext4_lblk_t len),
+
+ TP_ARGS(inode, start, len),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( ino_t, ino )
+ __field( loff_t, start )
+ __field( loff_t, len )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->ino = inode->i_ino;
+ __entry->start = start;
+ __entry->len = len;
+ ),
+
+ TP_printk("dev %d,%d ino %lu es [%lld/%lld)",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long) __entry->ino,
+ __entry->start, __entry->len)
+);
+
+TRACE_EVENT(ext4_es_find_extent_enter,
+ TP_PROTO(struct inode *inode, ext4_lblk_t start),
+
+ TP_ARGS(inode, start),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( ino_t, ino )
+ __field( ext4_lblk_t, start )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->ino = inode->i_ino;
+ __entry->start = start;
+ ),
+
+ TP_printk("dev %d,%d ino %lu start %u",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long) __entry->ino, __entry->start)
+);
+
+TRACE_EVENT(ext4_es_find_extent_exit,
+ TP_PROTO(struct inode *inode, struct extent_status *es,
+ ext4_lblk_t ret),
+
+ TP_ARGS(inode, es, ret),
+
+ TP_STRUCT__entry(
+ __field( dev_t, dev )
+ __field( ino_t, ino )
+ __field( ext4_lblk_t, start )
+ __field( ext4_lblk_t, len )
+ __field( ext4_lblk_t, ret )
+ ),
+
+ TP_fast_assign(
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->ino = inode->i_ino;
+ __entry->start = es->start;
+ __entry->len = es->len;
+ __entry->ret = ret;
+ ),
+
+ TP_printk("dev %d,%d ino %lu es [%u/%u) ret %u",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ (unsigned long) __entry->ino,
+ __entry->start, __entry->len, __entry->ret)
+);
+
#endif /* _TRACE_EXT4_H */

/* This part must be outside protection */
--
1.7.12.rc2.18.g61b472e


2012-10-26 13:13:07

by Zheng Liu

[permalink] [raw]
Subject: [PATCH 6/8 v3] ext4: reimplement ext4_find_delay_alloc_range on extent status tree

From: Zheng Liu <[email protected]>

ext4_find_delay_alloc_range is reimplemented on extent status tree.

CC: "Theodore Ts'o" <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
Signed-off-by: Allison Henderson <[email protected]>
Signed-off-by: Zheng Liu <[email protected]>
---
fs/ext4/ext4.h | 4 --
fs/ext4/ext4_extents.h | 3 +-
fs/ext4/extents.c | 117 ++++++++-----------------------------------------
fs/ext4/inode.c | 53 +---------------------
4 files changed, 20 insertions(+), 157 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index bcc634b..246e38f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2451,14 +2451,10 @@ enum ext4_state_bits {
* never, ever appear in a buffer_head's state
* flag. See EXT4_MAP_FROM_CLUSTER to see where
* this is used. */
- BH_Da_Mapped, /* Delayed allocated block that now has a mapping. This
- * flag is set when ext4_map_blocks is called on a
- * delayed allocated block to get its real mapping. */
};

BUFFER_FNS(Uninit, uninit)
TAS_BUFFER_FNS(Uninit, uninit)
-BUFFER_FNS(Da_Mapped, da_mapped)

/*
* Add new method to test wether block and inode bitmaps are properly
diff --git a/fs/ext4/ext4_extents.h b/fs/ext4/ext4_extents.h
index cb1b2c9..603bb11 100644
--- a/fs/ext4/ext4_extents.h
+++ b/fs/ext4/ext4_extents.h
@@ -314,7 +314,6 @@ extern struct ext4_ext_path *ext4_ext_find_extent(struct inode *, ext4_lblk_t,
struct ext4_ext_path *);
extern void ext4_ext_drop_refs(struct ext4_ext_path *);
extern int ext4_ext_check_inode(struct inode *inode);
-extern int ext4_find_delalloc_cluster(struct inode *inode, ext4_lblk_t lblk,
- int search_hint_reverse);
+extern int ext4_find_delalloc_cluster(struct inode *inode, ext4_lblk_t lblk);
#endif /* _EXT4_EXTENTS */

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 0e1a925..d6da3a8 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3461,115 +3461,34 @@ out:
/**
* ext4_find_delalloc_range: find delayed allocated block in the given range.
*
- * Goes through the buffer heads in the range [lblk_start, lblk_end] and returns
- * whether there are any buffers marked for delayed allocation. It returns '1'
- * on the first delalloc'ed buffer head found. If no buffer head in the given
- * range is marked for delalloc, it returns 0.
- * lblk_start should always be <= lblk_end.
- * search_hint_reverse is to indicate that searching in reverse from lblk_end to
- * lblk_start might be more efficient (i.e., we will likely hit the delalloc'ed
- * block sooner). This is useful when blocks are truncated sequentially from
- * lblk_start towards lblk_end.
+ * Return 1 if there is a delalloc block in the range, otherwise 0.
*/
static int ext4_find_delalloc_range(struct inode *inode,
ext4_lblk_t lblk_start,
- ext4_lblk_t lblk_end,
- int search_hint_reverse)
+ ext4_lblk_t lblk_end)
{
- struct address_space *mapping = inode->i_mapping;
- struct buffer_head *head, *bh = NULL;
- struct page *page;
- ext4_lblk_t i, pg_lblk;
- pgoff_t index;
-
- if (!test_opt(inode->i_sb, DELALLOC))
- return 0;
-
- /* reverse search wont work if fs block size is less than page size */
- if (inode->i_blkbits < PAGE_CACHE_SHIFT)
- search_hint_reverse = 0;
+ struct extent_status es;

- if (search_hint_reverse)
- i = lblk_end;
+ es.start = lblk_start;
+ ext4_es_find_extent(inode, &es);
+ if (es.len == 0)
+ return 0; /* there is no delay extent in this tree */
+ else if (es.start <= lblk_start && lblk_start < es.start + es.len)
+ return 1;
+ else if (lblk_start <= es.start && es.start <= lblk_end)
+ return 1;
else
- i = lblk_start;
-
- index = i >> (PAGE_CACHE_SHIFT - inode->i_blkbits);
-
- while ((i >= lblk_start) && (i <= lblk_end)) {
- page = find_get_page(mapping, index);
- if (!page)
- goto nextpage;
-
- if (!page_has_buffers(page))
- goto nextpage;
-
- head = page_buffers(page);
- if (!head)
- goto nextpage;
-
- bh = head;
- pg_lblk = index << (PAGE_CACHE_SHIFT -
- inode->i_blkbits);
- do {
- if (unlikely(pg_lblk < lblk_start)) {
- /*
- * This is possible when fs block size is less
- * than page size and our cluster starts/ends in
- * middle of the page. So we need to skip the
- * initial few blocks till we reach the 'lblk'
- */
- pg_lblk++;
- continue;
- }
-
- /* Check if the buffer is delayed allocated and that it
- * is not yet mapped. (when da-buffers are mapped during
- * their writeout, their da_mapped bit is set.)
- */
- if (buffer_delay(bh) && !buffer_da_mapped(bh)) {
- page_cache_release(page);
- trace_ext4_find_delalloc_range(inode,
- lblk_start, lblk_end,
- search_hint_reverse,
- 1, i);
- return 1;
- }
- if (search_hint_reverse)
- i--;
- else
- i++;
- } while ((i >= lblk_start) && (i <= lblk_end) &&
- ((bh = bh->b_this_page) != head));
-nextpage:
- if (page)
- page_cache_release(page);
- /*
- * Move to next page. 'i' will be the first lblk in the next
- * page.
- */
- if (search_hint_reverse)
- index--;
- else
- index++;
- i = index << (PAGE_CACHE_SHIFT - inode->i_blkbits);
- }
-
- trace_ext4_find_delalloc_range(inode, lblk_start, lblk_end,
- search_hint_reverse, 0, 0);
- return 0;
+ return 0;
}

-int ext4_find_delalloc_cluster(struct inode *inode, ext4_lblk_t lblk,
- int search_hint_reverse)
+int ext4_find_delalloc_cluster(struct inode *inode, ext4_lblk_t lblk)
{
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
ext4_lblk_t lblk_start, lblk_end;
lblk_start = lblk & (~(sbi->s_cluster_ratio - 1));
lblk_end = lblk_start + sbi->s_cluster_ratio - 1;

- return ext4_find_delalloc_range(inode, lblk_start, lblk_end,
- search_hint_reverse);
+ return ext4_find_delalloc_range(inode, lblk_start, lblk_end);
}

/**
@@ -3630,7 +3549,7 @@ get_reserved_cluster_alloc(struct inode *inode, ext4_lblk_t lblk_start,
lblk_from = lblk_start & (~(sbi->s_cluster_ratio - 1));
lblk_to = lblk_from + c_offset - 1;

- if (ext4_find_delalloc_range(inode, lblk_from, lblk_to, 0))
+ if (ext4_find_delalloc_range(inode, lblk_from, lblk_to))
allocated_clusters--;
}

@@ -3640,7 +3559,7 @@ get_reserved_cluster_alloc(struct inode *inode, ext4_lblk_t lblk_start,
lblk_from = lblk_start + num_blks;
lblk_to = lblk_from + (sbi->s_cluster_ratio - c_offset) - 1;

- if (ext4_find_delalloc_range(inode, lblk_from, lblk_to, 0))
+ if (ext4_find_delalloc_range(inode, lblk_from, lblk_to))
allocated_clusters--;
}

@@ -3927,7 +3846,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
if (ext4_ext_in_cache(inode, map->m_lblk, &newex)) {
if (!newex.ee_start_lo && !newex.ee_start_hi) {
if ((sbi->s_cluster_ratio > 1) &&
- ext4_find_delalloc_cluster(inode, map->m_lblk, 0))
+ ext4_find_delalloc_cluster(inode, map->m_lblk))
map->m_flags |= EXT4_MAP_FROM_CLUSTER;

if ((flags & EXT4_GET_BLOCKS_CREATE) == 0) {
@@ -4015,7 +3934,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
}

if ((sbi->s_cluster_ratio > 1) &&
- ext4_find_delalloc_cluster(inode, map->m_lblk, 0))
+ ext4_find_delalloc_cluster(inode, map->m_lblk))
map->m_flags |= EXT4_MAP_FROM_CLUSTER;

/*
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e4da45d..8009d51 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -484,49 +484,6 @@ static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
}

/*
- * Sets the BH_Da_Mapped bit on the buffer heads corresponding to the given map.
- */
-static void set_buffers_da_mapped(struct inode *inode,
- struct ext4_map_blocks *map)
-{
- struct address_space *mapping = inode->i_mapping;
- struct pagevec pvec;
- int i, nr_pages;
- pgoff_t index, end;
-
- index = map->m_lblk >> (PAGE_CACHE_SHIFT - inode->i_blkbits);
- end = (map->m_lblk + map->m_len - 1) >>
- (PAGE_CACHE_SHIFT - inode->i_blkbits);
-
- pagevec_init(&pvec, 0);
- while (index <= end) {
- nr_pages = pagevec_lookup(&pvec, mapping, index,
- min(end - index + 1,
- (pgoff_t)PAGEVEC_SIZE));
- if (nr_pages == 0)
- break;
- for (i = 0; i < nr_pages; i++) {
- struct page *page = pvec.pages[i];
- struct buffer_head *bh, *head;
-
- if (unlikely(page->mapping != mapping) ||
- !PageDirty(page))
- break;
-
- if (page_has_buffers(page)) {
- bh = head = page_buffers(page);
- do {
- set_buffer_da_mapped(bh);
- bh = bh->b_this_page;
- } while (bh != head);
- }
- index++;
- }
- pagevec_release(&pvec);
- }
-}
-
-/*
* The ext4_map_blocks() function tries to look up the requested blocks,
* and returns if the blocks are already mapped.
*
@@ -661,13 +618,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
ext4_clear_inode_state(inode, EXT4_STATE_DELALLOC_RESERVED);

- /* If we have successfully mapped the delayed allocated blocks,
- * set the BH_Da_Mapped bit on them. Its important to do this
- * under the protection of i_data_sem.
- */
if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
int ret;
- set_buffers_da_mapped(inode, map);
delayed_mapped:
/* delayed allocation blocks has been allocated */
ret = ext4_es_remove_extent(inode, map->m_lblk,
@@ -1328,7 +1280,6 @@ static void ext4_da_page_release_reservation(struct page *page,
if ((offset <= curr_off) && (buffer_delay(bh))) {
to_release++;
clear_buffer_delay(bh);
- clear_buffer_da_mapped(bh);
}
curr_off = next_off;
} while ((bh = bh->b_this_page) != head);
@@ -1345,7 +1296,7 @@ static void ext4_da_page_release_reservation(struct page *page,
lblk = (page->index << (PAGE_CACHE_SHIFT - inode->i_blkbits)) +
((num_clusters - 1) << sbi->s_cluster_bits);
if (sbi->s_cluster_ratio == 1 ||
- !ext4_find_delalloc_cluster(inode, lblk, 1))
+ !ext4_find_delalloc_cluster(inode, lblk))
ext4_da_release_space(inode, 1);

num_clusters--;
@@ -1451,8 +1402,6 @@ static int mpage_da_submit_io(struct mpage_da_data *mpd,
clear_buffer_delay(bh);
bh->b_blocknr = pblock;
}
- if (buffer_da_mapped(bh))
- clear_buffer_da_mapped(bh);
if (buffer_unwritten(bh) ||
buffer_mapped(bh))
BUG_ON(bh->b_blocknr != pblock);
--
1.7.12.rc2.18.g61b472e


2012-10-26 13:13:11

by Zheng Liu

[permalink] [raw]
Subject: [PATCH 7/8 v3] ext4: reimplement fiemap on extent status tree

From: Zheng Liu <[email protected]>

This patch reimplements fiemap on extent status tree.

CC: "Theodore Ts'o" <[email protected]>
Signed-off-by: Yongqiang Yang <[email protected]>
Signed-off-by: Allison Henderson <[email protected]>
Signed-off-by: Zheng Liu <[email protected]>
---
fs/ext4/extents.c | 184 +++++++-----------------------------------------------
1 file changed, 21 insertions(+), 163 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index d6da3a8..a72197b 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4499,193 +4499,51 @@ static int ext4_ext_fiemap_cb(struct inode *inode, ext4_lblk_t next,
struct ext4_ext_cache *newex, struct ext4_extent *ex,
void *data)
{
+ struct extent_status es;
__u64 logical;
__u64 physical;
__u64 length;
__u32 flags = 0;
+ ext4_lblk_t next_del;
int ret = 0;
struct fiemap_extent_info *fieinfo = data;
unsigned char blksize_bits;

- blksize_bits = inode->i_sb->s_blocksize_bits;
- logical = (__u64)newex->ec_block << blksize_bits;
+ es.start = newex->ec_block;
+ next_del = ext4_es_find_extent(inode, &es);

+ next = min(next_del, next);
if (newex->ec_start == 0) {
/*
* No extent in extent-tree contains block @newex->ec_start,
* then the block may stay in 1)a hole or 2)delayed-extent.
- *
- * Holes or delayed-extents are processed as follows.
- * 1. lookup dirty pages with specified range in pagecache.
- * If no page is got, then there is no delayed-extent and
- * return with EXT_CONTINUE.
- * 2. find the 1st mapped buffer,
- * 3. check if the mapped buffer is both in the request range
- * and a delayed buffer. If not, there is no delayed-extent,
- * then return.
- * 4. a delayed-extent is found, the extent will be collected.
*/
- ext4_lblk_t end = 0;
- pgoff_t last_offset;
- pgoff_t offset;
- pgoff_t index;
- pgoff_t start_index = 0;
- struct page **pages = NULL;
- struct buffer_head *bh = NULL;
- struct buffer_head *head = NULL;
- unsigned int nr_pages = PAGE_SIZE / sizeof(struct page *);
-
- pages = kmalloc(PAGE_SIZE, GFP_KERNEL);
- if (pages == NULL)
- return -ENOMEM;
-
- offset = logical >> PAGE_SHIFT;
-repeat:
- last_offset = offset;
- head = NULL;
- ret = find_get_pages_tag(inode->i_mapping, &offset,
- PAGECACHE_TAG_DIRTY, nr_pages, pages);
-
- if (!(flags & FIEMAP_EXTENT_DELALLOC)) {
- /* First time, try to find a mapped buffer. */
- if (ret == 0) {
-out:
- for (index = 0; index < ret; index++)
- page_cache_release(pages[index]);
- /* just a hole. */
- kfree(pages);
- return EXT_CONTINUE;
- }
- index = 0;
-
-next_page:
- /* Try to find the 1st mapped buffer. */
- end = ((__u64)pages[index]->index << PAGE_SHIFT) >>
- blksize_bits;
- if (!page_has_buffers(pages[index]))
- goto out;
- head = page_buffers(pages[index]);
- if (!head)
- goto out;
-
- index++;
- bh = head;
- do {
- if (end >= newex->ec_block +
- newex->ec_len)
- /* The buffer is out of
- * the request range.
- */
- goto out;
-
- if (buffer_mapped(bh) &&
- end >= newex->ec_block) {
- start_index = index - 1;
- /* get the 1st mapped buffer. */
- goto found_mapped_buffer;
- }
-
- bh = bh->b_this_page;
- end++;
- } while (bh != head);
-
- /* No mapped buffer in the range found in this page,
- * We need to look up next page.
- */
- if (index >= ret) {
- /* There is no page left, but we need to limit
- * newex->ec_len.
- */
- newex->ec_len = end - newex->ec_block;
- goto out;
- }
- goto next_page;
- } else {
- /*Find contiguous delayed buffers. */
- if (ret > 0 && pages[0]->index == last_offset)
- head = page_buffers(pages[0]);
- bh = head;
- index = 1;
- start_index = 0;
+ if (es.len == 0)
+ /* A hole found. */
+ return EXT_CONTINUE;
+
+ if (es.start > newex->ec_block) {
+ /* A hole found. */
+ newex->ec_len = min(es.start - newex->ec_block,
+ newex->ec_len);
+ return EXT_CONTINUE;
}

-found_mapped_buffer:
- if (bh != NULL && buffer_delay(bh)) {
- /* 1st or contiguous delayed buffer found. */
- if (!(flags & FIEMAP_EXTENT_DELALLOC)) {
- /*
- * 1st delayed buffer found, record
- * the start of extent.
- */
- flags |= FIEMAP_EXTENT_DELALLOC;
- newex->ec_block = end;
- logical = (__u64)end << blksize_bits;
- }
- /* Find contiguous delayed buffers. */
- do {
- if (!buffer_delay(bh))
- goto found_delayed_extent;
- bh = bh->b_this_page;
- end++;
- } while (bh != head);
-
- for (; index < ret; index++) {
- if (!page_has_buffers(pages[index])) {
- bh = NULL;
- break;
- }
- head = page_buffers(pages[index]);
- if (!head) {
- bh = NULL;
- break;
- }
-
- if (pages[index]->index !=
- pages[start_index]->index + index
- - start_index) {
- /* Blocks are not contiguous. */
- bh = NULL;
- break;
- }
- bh = head;
- do {
- if (!buffer_delay(bh))
- /* Delayed-extent ends. */
- goto found_delayed_extent;
- bh = bh->b_this_page;
- end++;
- } while (bh != head);
- }
- } else if (!(flags & FIEMAP_EXTENT_DELALLOC))
- /* a hole found. */
- goto out;
-
-found_delayed_extent:
- newex->ec_len = min(end - newex->ec_block,
- (ext4_lblk_t)EXT_INIT_MAX_LEN);
- if (ret == nr_pages && bh != NULL &&
- newex->ec_len < EXT_INIT_MAX_LEN &&
- buffer_delay(bh)) {
- /* Have not collected an extent and continue. */
- for (index = 0; index < ret; index++)
- page_cache_release(pages[index]);
- goto repeat;
- }
-
- for (index = 0; index < ret; index++)
- page_cache_release(pages[index]);
- kfree(pages);
+ flags |= FIEMAP_EXTENT_DELALLOC;
+ newex->ec_len = es.start + es.len - newex->ec_block;
}

- physical = (__u64)newex->ec_start << blksize_bits;
- length = (__u64)newex->ec_len << blksize_bits;

2012-10-26 13:13:14

by Zheng Liu

[permalink] [raw]
Subject: [PATCH 8/8 v3] ext4: introduce lseek SEEK_DATA/SEEK_HOLE support

From: Zheng Liu <[email protected]>

This patch makes ext4 really support SEEK_DATA/SEEK_HOLE flags. Block-mapped
and extent-mapped files are fully implemented together because ext4_map_blocks
hides this differences.

After applying this patch, it will cause a failure in xfstest #285 when the file
is block-mapped due to block-mapped file isn't support fallocate(2).

I had tried to use ext4_ext_walk_space() to retrieve the offset for a
extent-mapped file. But finally I decide to keep using ext4_map_blocks() to
support SEEK_DATA/SEEK_HOLE because ext4_map_blocks() can hide the difference
between block-mapped file and extent-mapped file. Moreover, in next step,
extent status tree will track all extent status, and we can get all mappings
from this tree. So I think that using ext4_map_blocks() is a better choice.

CC: "Theodore Ts'o" <[email protected]>
CC: Hugh Dickins <[email protected]>
Signed-off-by: Jie Liu <[email protected]>
Signed-off-by: Zheng Liu <[email protected]>
---
v3 <- v2:
- determine an unwritten extent as a data/hole according to the whether there
has some data in page cache or not.
- remove ext4_seek_data_hole(), and call ext4_seek_data/hole() directly.
- fix some bugs

Please apply this patch before testing this patch using xfstest #285 due to
there is a bug in xfstest (http://patchwork.xfs.org/patch/4276/).

The function ext4_find_unwritten_pgoff() is inspired by Jeff's work for xfs
(d126d43f631f996daeee5006714fed914be32368). Thus, I add a Signed-off-by for
his crediting.

Jeff, I really appreicate if you could allow me to add Signed-off-by in this
patch. Please let me know if you think it is not well. Thanks!

fs/ext4/file.c | 334 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 332 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index bf3966b..2f5759e 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -24,6 +24,7 @@
#include <linux/mount.h>
#include <linux/path.h>
#include <linux/quotaops.h>
+#include <linux/pagevec.h>
#include "ext4.h"
#include "ext4_jbd2.h"
#include "xattr.h"
@@ -286,6 +287,324 @@ static int ext4_file_open(struct inode * inode, struct file * filp)
}

/*
+ * Here we use ext4_map_blocks() to get a block mapping for a extent-based
+ * file rather than ext4_ext_walk_space() because we can introduce
+ * SEEK_DATA/SEEK_HOLE for block-mapped and extent-mapped file at the same
+ * function. When extent status tree has been fully implemented, it will
+ * track all extent status for a file and we can directly use it to
+ * retrieve the offset for SEEK_DATA/SEEK_HOLE.
+ */
+
+/*
+ * When we retrieve the offset for SEEK_DATA/SEEK_HOLE, we would need to
+ * lookup page cache to check whether or not there has some data between
+ * [startoff, endoff] because, if this range contains an unwritten extent,
+ * we determine this extent as a data or a hole according to whether the
+ * page cache has data or not.
+ */
+static int ext4_find_unwritten_pgoff(struct inode *inode,
+ int origin,
+ struct ext4_map_blocks *map,
+ loff_t *offset)
+{
+ struct pagevec pvec;
+ unsigned int blkbits;
+ pgoff_t index;
+ pgoff_t end;
+ loff_t endoff;
+ loff_t startoff;
+ loff_t lastoff;
+ int found = 0;
+
+ blkbits = inode->i_sb->s_blocksize_bits;
+ startoff = *offset;
+ lastoff = startoff;
+ endoff = (map->m_lblk + map->m_len) << blkbits;
+
+ index = startoff >> PAGE_CACHE_SHIFT;
+ end = endoff >> PAGE_CACHE_SHIFT;
+
+ pagevec_init(&pvec, 0);
+ do {
+ int i, num;
+ unsigned long nr_pages;
+
+ num = min_t(pgoff_t, end - index, PAGEVEC_SIZE);
+ nr_pages = pagevec_lookup(&pvec, inode->i_mapping, index,
+ (pgoff_t)num);
+ if (nr_pages == 0) {
+ if (origin == SEEK_DATA)
+ break;
+
+ BUG_ON(origin != SEEK_HOLE);
+ /*
+ * If this is the first time to go into the loop and
+ * offset is not beyond the end offset, it will be a
+ * hole at this offset
+ */
+ if (lastoff == startoff || lastoff < endoff)
+ found = 1;
+ break;
+ }
+
+ /*
+ * If this is the first time to go into the loop and
+ * offset is smaller than the first page offset, it will be a
+ * hole at this offset.
+ */
+ if (lastoff == startoff && origin == SEEK_HOLE &&
+ lastoff < page_offset(pvec.pages[0])) {
+ found = 1;
+ break;
+ }
+
+ for (i = 0; i < nr_pages; i++) {
+ struct page *page = pvec.pages[i];
+ struct buffer_head *bh, *head;
+
+ /*
+ * If the current offset is not beyond the end of given
+ * range, it will be a hole.
+ */
+ if (lastoff < endoff && origin == SEEK_HOLE &&
+ page->index > end) {
+ found = 1;
+ *offset = lastoff;
+ goto out;
+ }
+
+ lock_page(page);
+
+ if (unlikely(page->mapping != inode->i_mapping)) {
+ unlock_page(page);
+ continue;
+ }
+
+ if (!page_has_buffers(page)) {
+ unlock_page(page);
+ continue;
+ }
+
+ if (page_has_buffers(page)) {
+ lastoff = page_offset(page);
+ bh = head = page_buffers(page);
+ do {
+ if (buffer_uptodate(bh) ||
+ buffer_unwritten(bh)) {
+ if (origin == SEEK_DATA)
+ found = 1;
+ } else {
+ if (origin == SEEK_HOLE)
+ found = 1;
+ }
+ if (found) {
+ *offset = max_t(loff_t,
+ startoff, lastoff);
+ unlock_page(page);
+ goto out;
+ }
+ lastoff += bh->b_size;
+ bh = bh->b_this_page;
+ } while (bh != head);
+ }
+
+ lastoff = page_offset(page) + PAGE_SIZE;
+ unlock_page(page);
+ }
+
+ /*
+ * The no. of pages is less than our desired, that would be a
+ * hole in there.
+ */
+ if (nr_pages < num && origin == SEEK_HOLE) {
+ found = 1;
+ *offset = lastoff;
+ break;
+ }
+
+ index = pvec.pages[i - 1]->index + 1;
+ pagevec_release(&pvec);
+ } while (index <= end);
+
+out:
+ pagevec_release(&pvec);
+ return found;
+}
+
+/*
+ * ext4_seek_data() retrieves the offset for SEEK_DATA.
+ */
+static loff_t ext4_seek_data(struct file *file, loff_t offset, loff_t maxsize)
+{
+ struct inode *inode = file->f_mapping->host;
+ struct ext4_map_blocks map;
+ struct extent_status es;
+ ext4_lblk_t start, last, end;
+ loff_t dataoff, isize;
+ int blkbits;
+ int ret = 0;
+
+ mutex_lock(&inode->i_mutex);
+
+ isize = i_size_read(inode);
+ if (offset >= isize) {
+ mutex_unlock(&inode->i_mutex);
+ return -ENXIO;
+ }
+
+ blkbits = inode->i_sb->s_blocksize_bits;
+ start = offset >> blkbits;
+ last = start;
+ end = isize >> blkbits;
+ dataoff = offset;
+
+ do {
+ map.m_lblk = last;
+ map.m_len = end - last + 1;
+ ret = ext4_map_blocks(NULL, inode, &map, 0);
+ if (ret > 0 && !(map.m_flags & EXT4_MAP_UNWRITTEN)) {
+ if (last != start)
+ dataoff = last << blkbits;
+ break;
+ }
+
+ /*
+ * If there is a delay extent at this offset,
+ * it will be as a data.
+ */
+ es.start = last;
+ (void)ext4_es_find_extent(inode, &es);
+ if (last >= es.start &&
+ last < es.start + es.len) {
+ if (last != start)
+ dataoff = last << blkbits;
+ break;
+ }
+
+ /*
+ * If there is a unwritten extent at this offset,
+ * it will be as a data or a hole according to page
+ * cache that has data or not.
+ */
+ if (map.m_flags & EXT4_MAP_UNWRITTEN) {
+ int unwritten;
+ unwritten = ext4_find_unwritten_pgoff(inode, SEEK_DATA,
+ &map, &dataoff);
+ if (unwritten)
+ break;
+ }
+
+ last++;
+ dataoff = last << blkbits;
+ } while (last <= end);
+
+ mutex_unlock(&inode->i_mutex);
+
+ if (dataoff > isize)
+ return -ENXIO;
+
+ if (dataoff < 0 && !(file->f_mode & FMODE_UNSIGNED_OFFSET))
+ return -EINVAL;
+ if (dataoff > maxsize)
+ return -EINVAL;
+
+ if (dataoff != file->f_pos) {
+ file->f_pos = dataoff;
+ file->f_version = 0;
+ }
+
+ return dataoff;
+}
+
+/*
+ * ext4_seek_hole() retrieves the offset for SEEK_HOLE.
+ */
+static loff_t ext4_seek_hole(struct file *file, loff_t offset, loff_t maxsize)
+{
+ struct inode *inode = file->f_mapping->host;
+ struct ext4_map_blocks map;
+ struct extent_status es;
+ ext4_lblk_t start, last, end;
+ loff_t holeoff, isize;
+ int blkbits;
+ int ret = 0;
+
+ mutex_lock(&inode->i_mutex);
+
+ isize = i_size_read(inode);
+ if (offset >= isize) {
+ mutex_unlock(&inode->i_mutex);
+ return -ENXIO;
+ }
+
+ blkbits = inode->i_sb->s_blocksize_bits;
+ start = offset >> blkbits;
+ last = start;
+ end = isize >> blkbits;
+ holeoff = offset;
+
+ do {
+ map.m_lblk = last;
+ map.m_len = end - last + 1;
+ ret = ext4_map_blocks(NULL, inode, &map, 0);
+ if (ret > 0 && !(map.m_flags & EXT4_MAP_UNWRITTEN)) {
+ last += ret;
+ holeoff = last << blkbits;
+ continue;
+ }
+
+ /*
+ * If there is a delay extent at this offset,
+ * we will skip this extent.
+ */
+ es.start = last;
+ (void)ext4_es_find_extent(inode, &es);
+ if (last >= es.start &&
+ last < es.start + es.len) {
+ last = es.start + es.len;
+ holeoff = last << blkbits;
+ continue;
+ }
+
+ /*
+ * If there is a unwritten extent at this offset,
+ * it will be as a data or a hole according to page
+ * cache that has data or not.
+ */
+ if (map.m_flags & EXT4_MAP_UNWRITTEN) {
+ int unwritten;
+ unwritten = ext4_find_unwritten_pgoff(inode, SEEK_HOLE,
+ &map, &holeoff);
+ if (!unwritten) {
+ last += ret;
+ holeoff = last << blkbits;
+ continue;
+ }
+ }
+
+ /* find a hole */
+ break;
+ } while (last <= end);
+
+ mutex_unlock(&inode->i_mutex);
+
+ if (holeoff > isize)
+ holeoff = isize;
+
+ if (holeoff < 0 && !(file->f_mode & FMODE_UNSIGNED_OFFSET))
+ return -EINVAL;
+ if (holeoff > maxsize)
+ return -EINVAL;
+
+ if (holeoff != file->f_pos) {
+ file->f_pos = holeoff;
+ file->f_version = 0;
+ }
+
+ return holeoff;
+}
+
+/*
* ext4_llseek() handles both block-mapped and extent-mapped maxbytes values
* by calling generic_file_llseek_size() with the appropriate maxbytes
* value for each.
@@ -300,8 +619,19 @@ loff_t ext4_llseek(struct file *file, loff_t offset, int origin)
else
maxbytes = inode->i_sb->s_maxbytes;

- return generic_file_llseek_size(file, offset, origin,
- maxbytes, i_size_read(inode));
+ switch (origin) {
+ case SEEK_SET:
+ case SEEK_CUR:
+ case SEEK_END:
+ return generic_file_llseek_size(file, offset, origin,
+ maxbytes, i_size_read(inode));
+ case SEEK_DATA:
+ return ext4_seek_data(file, offset, maxbytes);
+ case SEEK_HOLE:
+ return ext4_seek_hole(file, offset, maxbytes);
+ }
+
+ return -EINVAL;
}

const struct file_operations ext4_file_operations = {
--
1.7.12.rc2.18.g61b472e


2012-10-27 10:16:04

by Jeff Liu

[permalink] [raw]
Subject: Re: [PATCH 8/8 v3] ext4: introduce lseek SEEK_DATA/SEEK_HOLE support

Hi Zheng,

On 10/26/2012 09:23 PM, Zheng Liu wrote:
> From: Zheng Liu <[email protected]>
>
> This patch makes ext4 really support SEEK_DATA/SEEK_HOLE flags. Block-mapped
> and extent-mapped files are fully implemented together because ext4_map_blocks
> hides this differences.
>
> After applying this patch, it will cause a failure in xfstest #285 when the file
> is block-mapped due to block-mapped file isn't support fallocate(2).
>
> I had tried to use ext4_ext_walk_space() to retrieve the offset for a
> extent-mapped file. But finally I decide to keep using ext4_map_blocks() to
> support SEEK_DATA/SEEK_HOLE because ext4_map_blocks() can hide the difference
> between block-mapped file and extent-mapped file. Moreover, in next step,
> extent status tree will track all extent status, and we can get all mappings
> from this tree. So I think that using ext4_map_blocks() is a better choice.
>
> CC: "Theodore Ts'o" <[email protected]>
> CC: Hugh Dickins <[email protected]>
> Signed-off-by: Jie Liu <[email protected]>
> Signed-off-by: Zheng Liu <[email protected]>
> ---
> v3 <- v2:
> - determine an unwritten extent as a data/hole according to the whether there
> has some data in page cache or not.
> - remove ext4_seek_data_hole(), and call ext4_seek_data/hole() directly.
> - fix some bugs
>
> Please apply this patch before testing this patch using xfstest #285 due to
> there is a bug in xfstest (http://patchwork.xfs.org/patch/4276/).
>
> The function ext4_find_unwritten_pgoff() is inspired by Jeff's work for xfs
> (d126d43f631f996daeee5006714fed914be32368). Thus, I add a Signed-off-by for
> his crediting.
>
> Jeff, I really appreicate if you could allow me to add Signed-off-by in this
> patch. Please let me know if you think it is not well. Thanks!
I'm fine. :)
Finally, maybe I'll try to tweak that routine with page cache lookup for
unwritten extents to be a
VFS helper so that Btrfs can make use of it as well, thank you!

-Jeff
>
> fs/ext4/file.c | 334 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 332 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index bf3966b..2f5759e 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -24,6 +24,7 @@
> #include <linux/mount.h>
> #include <linux/path.h>
> #include <linux/quotaops.h>
> +#include <linux/pagevec.h>
> #include "ext4.h"
> #include "ext4_jbd2.h"
> #include "xattr.h"
> @@ -286,6 +287,324 @@ static int ext4_file_open(struct inode * inode, struct file * filp)
> }
>
> /*
> + * Here we use ext4_map_blocks() to get a block mapping for a extent-based
> + * file rather than ext4_ext_walk_space() because we can introduce
> + * SEEK_DATA/SEEK_HOLE for block-mapped and extent-mapped file at the same
> + * function. When extent status tree has been fully implemented, it will
> + * track all extent status for a file and we can directly use it to
> + * retrieve the offset for SEEK_DATA/SEEK_HOLE.
> + */
> +
> +/*
> + * When we retrieve the offset for SEEK_DATA/SEEK_HOLE, we would need to
> + * lookup page cache to check whether or not there has some data between
> + * [startoff, endoff] because, if this range contains an unwritten extent,
> + * we determine this extent as a data or a hole according to whether the
> + * page cache has data or not.
> + */
> +static int ext4_find_unwritten_pgoff(struct inode *inode,
> + int origin,
> + struct ext4_map_blocks *map,
> + loff_t *offset)
> +{
> + struct pagevec pvec;
> + unsigned int blkbits;
> + pgoff_t index;
> + pgoff_t end;
> + loff_t endoff;
> + loff_t startoff;
> + loff_t lastoff;
> + int found = 0;
> +
> + blkbits = inode->i_sb->s_blocksize_bits;
> + startoff = *offset;
> + lastoff = startoff;
> + endoff = (map->m_lblk + map->m_len) << blkbits;
> +
> + index = startoff >> PAGE_CACHE_SHIFT;
> + end = endoff >> PAGE_CACHE_SHIFT;
> +
> + pagevec_init(&pvec, 0);
> + do {
> + int i, num;
> + unsigned long nr_pages;
> +
> + num = min_t(pgoff_t, end - index, PAGEVEC_SIZE);
> + nr_pages = pagevec_lookup(&pvec, inode->i_mapping, index,
> + (pgoff_t)num);
> + if (nr_pages == 0) {
> + if (origin == SEEK_DATA)
> + break;
> +
> + BUG_ON(origin != SEEK_HOLE);
> + /*
> + * If this is the first time to go into the loop and
> + * offset is not beyond the end offset, it will be a
> + * hole at this offset
> + */
> + if (lastoff == startoff || lastoff < endoff)
> + found = 1;
> + break;
> + }
> +
> + /*
> + * If this is the first time to go into the loop and
> + * offset is smaller than the first page offset, it will be a
> + * hole at this offset.
> + */
> + if (lastoff == startoff && origin == SEEK_HOLE &&
> + lastoff < page_offset(pvec.pages[0])) {
> + found = 1;
> + break;
> + }
> +
> + for (i = 0; i < nr_pages; i++) {
> + struct page *page = pvec.pages[i];
> + struct buffer_head *bh, *head;
> +
> + /*
> + * If the current offset is not beyond the end of given
> + * range, it will be a hole.
> + */
> + if (lastoff < endoff && origin == SEEK_HOLE &&
> + page->index > end) {
> + found = 1;
> + *offset = lastoff;
> + goto out;
> + }
> +
> + lock_page(page);
> +
> + if (unlikely(page->mapping != inode->i_mapping)) {
> + unlock_page(page);
> + continue;
> + }
> +
> + if (!page_has_buffers(page)) {
> + unlock_page(page);
> + continue;
> + }
> +
> + if (page_has_buffers(page)) {
> + lastoff = page_offset(page);
> + bh = head = page_buffers(page);
> + do {
> + if (buffer_uptodate(bh) ||
> + buffer_unwritten(bh)) {
> + if (origin == SEEK_DATA)
> + found = 1;
> + } else {
> + if (origin == SEEK_HOLE)
> + found = 1;
> + }
> + if (found) {
> + *offset = max_t(loff_t,
> + startoff, lastoff);
> + unlock_page(page);
> + goto out;
> + }
> + lastoff += bh->b_size;
> + bh = bh->b_this_page;
> + } while (bh != head);
> + }
> +
> + lastoff = page_offset(page) + PAGE_SIZE;
> + unlock_page(page);
> + }
> +
> + /*
> + * The no. of pages is less than our desired, that would be a
> + * hole in there.
> + */
> + if (nr_pages < num && origin == SEEK_HOLE) {
> + found = 1;
> + *offset = lastoff;
> + break;
> + }
> +
> + index = pvec.pages[i - 1]->index + 1;
> + pagevec_release(&pvec);
> + } while (index <= end);
> +
> +out:
> + pagevec_release(&pvec);
> + return found;
> +}
> +
> +/*
> + * ext4_seek_data() retrieves the offset for SEEK_DATA.
> + */
> +static loff_t ext4_seek_data(struct file *file, loff_t offset, loff_t maxsize)
> +{
> + struct inode *inode = file->f_mapping->host;
> + struct ext4_map_blocks map;
> + struct extent_status es;
> + ext4_lblk_t start, last, end;
> + loff_t dataoff, isize;
> + int blkbits;
> + int ret = 0;
> +
> + mutex_lock(&inode->i_mutex);
> +
> + isize = i_size_read(inode);
> + if (offset >= isize) {
> + mutex_unlock(&inode->i_mutex);
> + return -ENXIO;
> + }
> +
> + blkbits = inode->i_sb->s_blocksize_bits;
> + start = offset >> blkbits;
> + last = start;
> + end = isize >> blkbits;
> + dataoff = offset;
> +
> + do {
> + map.m_lblk = last;
> + map.m_len = end - last + 1;
> + ret = ext4_map_blocks(NULL, inode, &map, 0);
> + if (ret > 0 && !(map.m_flags & EXT4_MAP_UNWRITTEN)) {
> + if (last != start)
> + dataoff = last << blkbits;
> + break;
> + }
> +
> + /*
> + * If there is a delay extent at this offset,
> + * it will be as a data.
> + */
> + es.start = last;
> + (void)ext4_es_find_extent(inode, &es);
> + if (last >= es.start &&
> + last < es.start + es.len) {
> + if (last != start)
> + dataoff = last << blkbits;
> + break;
> + }
> +
> + /*
> + * If there is a unwritten extent at this offset,
> + * it will be as a data or a hole according to page
> + * cache that has data or not.
> + */
> + if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> + int unwritten;
> + unwritten = ext4_find_unwritten_pgoff(inode, SEEK_DATA,
> + &map, &dataoff);
> + if (unwritten)
> + break;
> + }
> +
> + last++;
> + dataoff = last << blkbits;
> + } while (last <= end);
> +
> + mutex_unlock(&inode->i_mutex);
> +
> + if (dataoff > isize)
> + return -ENXIO;
> +
> + if (dataoff < 0 && !(file->f_mode & FMODE_UNSIGNED_OFFSET))
> + return -EINVAL;
> + if (dataoff > maxsize)
> + return -EINVAL;
> +
> + if (dataoff != file->f_pos) {
> + file->f_pos = dataoff;
> + file->f_version = 0;
> + }
> +
> + return dataoff;
> +}
> +
> +/*
> + * ext4_seek_hole() retrieves the offset for SEEK_HOLE.
> + */
> +static loff_t ext4_seek_hole(struct file *file, loff_t offset, loff_t maxsize)
> +{
> + struct inode *inode = file->f_mapping->host;
> + struct ext4_map_blocks map;
> + struct extent_status es;
> + ext4_lblk_t start, last, end;
> + loff_t holeoff, isize;
> + int blkbits;
> + int ret = 0;
> +
> + mutex_lock(&inode->i_mutex);
> +
> + isize = i_size_read(inode);
> + if (offset >= isize) {
> + mutex_unlock(&inode->i_mutex);
> + return -ENXIO;
> + }
> +
> + blkbits = inode->i_sb->s_blocksize_bits;
> + start = offset >> blkbits;
> + last = start;
> + end = isize >> blkbits;
> + holeoff = offset;
> +
> + do {
> + map.m_lblk = last;
> + map.m_len = end - last + 1;
> + ret = ext4_map_blocks(NULL, inode, &map, 0);
> + if (ret > 0 && !(map.m_flags & EXT4_MAP_UNWRITTEN)) {
> + last += ret;
> + holeoff = last << blkbits;
> + continue;
> + }
> +
> + /*
> + * If there is a delay extent at this offset,
> + * we will skip this extent.
> + */
> + es.start = last;
> + (void)ext4_es_find_extent(inode, &es);
> + if (last >= es.start &&
> + last < es.start + es.len) {
> + last = es.start + es.len;
> + holeoff = last << blkbits;
> + continue;
> + }
> +
> + /*
> + * If there is a unwritten extent at this offset,
> + * it will be as a data or a hole according to page
> + * cache that has data or not.
> + */
> + if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> + int unwritten;
> + unwritten = ext4_find_unwritten_pgoff(inode, SEEK_HOLE,
> + &map, &holeoff);
> + if (!unwritten) {
> + last += ret;
> + holeoff = last << blkbits;
> + continue;
> + }
> + }
> +
> + /* find a hole */
> + break;
> + } while (last <= end);
> +
> + mutex_unlock(&inode->i_mutex);
> +
> + if (holeoff > isize)
> + holeoff = isize;
> +
> + if (holeoff < 0 && !(file->f_mode & FMODE_UNSIGNED_OFFSET))
> + return -EINVAL;
> + if (holeoff > maxsize)
> + return -EINVAL;
> +
> + if (holeoff != file->f_pos) {
> + file->f_pos = holeoff;
> + file->f_version = 0;
> + }
> +
> + return holeoff;
> +}
> +
> +/*
> * ext4_llseek() handles both block-mapped and extent-mapped maxbytes values
> * by calling generic_file_llseek_size() with the appropriate maxbytes
> * value for each.
> @@ -300,8 +619,19 @@ loff_t ext4_llseek(struct file *file, loff_t offset, int origin)
> else
> maxbytes = inode->i_sb->s_maxbytes;
>
> - return generic_file_llseek_size(file, offset, origin,
> - maxbytes, i_size_read(inode));
> + switch (origin) {
> + case SEEK_SET:
> + case SEEK_CUR:
> + case SEEK_END:
> + return generic_file_llseek_size(file, offset, origin,
> + maxbytes, i_size_read(inode));
> + case SEEK_DATA:
> + return ext4_seek_data(file, offset, maxbytes);
> + case SEEK_HOLE:
> + return ext4_seek_hole(file, offset, maxbytes);
> + }
> +
> + return -EINVAL;
> }
>
> const struct file_operations ext4_file_operations = {


2012-10-27 15:19:41

by Zheng Liu

[permalink] [raw]
Subject: Re: [PATCH 8/8 v3] ext4: introduce lseek SEEK_DATA/SEEK_HOLE support

On Sat, Oct 27, 2012 at 06:05:59PM +0800, Jeff Liu wrote:
[snip]
> >Jeff, I really appreicate if you could allow me to add Signed-off-by in this
> >patch. Please let me know if you think it is not well. Thanks!
> I'm fine. :)

Thank you. :-)

> Finally, maybe I'll try to tweak that routine with page cache lookup
> for unwritten extents to be a
> VFS helper so that Btrfs can make use of it as well, thank you!

That would be great. I am glad to use it in ext4 as well.

Regards,
Zheng

2012-11-08 23:21:36

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 2/8 v3] ext4: add operations on extent status tree

On Fri, Oct 26, 2012 at 09:23:39PM +0800, Zheng Liu wrote:
> + * 3. performance analysis
> + * -- overhead
> + * 1. Apart from operations on a delayed extent tree, we need to
> + * down_write(inode->i_data_sem) in delayed write path to maintain delayed
> + * extent tree, this can have impact on parallel read-write and write-write

Hi Zheng,

I can fix this up, before I finalize your commit, but I just want to
check. I believe this comment is out of date --- we are now using a
r/w spinlock, i_es_lock, yes? Since we never hold the spinlock for
very long, I would be surprised if this is going to be a scalability
bottleneck (too bad Eric doesn't have access to the big SMP machine
that he used to use to help us do our scalability testing, so we could
check to be sure).

- Ted

2012-11-09 02:10:40

by Zheng Liu

[permalink] [raw]
Subject: Re: [PATCH 2/8 v3] ext4: add operations on extent status tree

On Thu, Nov 08, 2012 at 06:21:23PM -0500, Theodore Ts'o wrote:
> On Fri, Oct 26, 2012 at 09:23:39PM +0800, Zheng Liu wrote:
> > + * 3. performance analysis
> > + * -- overhead
> > + * 1. Apart from operations on a delayed extent tree, we need to
> > + * down_write(inode->i_data_sem) in delayed write path to maintain delayed
> > + * extent tree, this can have impact on parallel read-write and write-write
>
> Hi Zheng,
>
> I can fix this up, before I finalize your commit, but I just want to
> check. I believe this comment is out of date --- we are now using a
> r/w spinlock, i_es_lock, yes? Since we never hold the spinlock for
> very long, I would be surprised if this is going to be a scalability
> bottleneck (too bad Eric doesn't have access to the big SMP machine
> that he used to use to help us do our scalability testing, so we could
> check to be sure).

Hi Ted,

Oops, it is my fault. Indeed it needs to be replaced with i_es_lock. I
can do some tests in a server which has 16 cores, but I am afraid that
it is not so big as you thought. I am willing to run Eric's tests to
ensure that there is no any scalability problem.

Regards,
Zheng

2012-11-19 03:17:36

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 0/8 v3] ext4: extent status tree (step 1)

On Fri, Oct 26, 2012 at 09:23:37PM +0800, Zheng Liu wrote:
> Hi all,
>
> Here is the v3 of extent status tree. In this version, the biggest change is
> the i_es_lock (a rwlock_t) instead of i_data_sem, which is introduced to protect
> extent status tree. Moreover I improve the SEEK_DATA/SEEK_HOLE. In previous
> version, the unwritten extent is as a data, and now it will lookup page cache to
> determine it as a data or a hole. When it has some data at the given range, the
> unwritten extent will be as a data. Otherwise, it will be as a hole. Other
> changes contain some bug fixes.

I don't think I ack'ed this patch series earlier; my apologies. In
any case (with the comment fixup we discussed earlier), It's currently
queued for the next merge window.

Regards,

- Ted

2012-11-19 05:16:29

by Zheng Liu

[permalink] [raw]
Subject: Re: [PATCH 0/8 v3] ext4: extent status tree (step 1)

On Sun, Nov 18, 2012 at 10:17:25PM -0500, Theodore Ts'o wrote:
> On Fri, Oct 26, 2012 at 09:23:37PM +0800, Zheng Liu wrote:
> > Hi all,
> >
> > Here is the v3 of extent status tree. In this version, the biggest change is
> > the i_es_lock (a rwlock_t) instead of i_data_sem, which is introduced to protect
> > extent status tree. Moreover I improve the SEEK_DATA/SEEK_HOLE. In previous
> > version, the unwritten extent is as a data, and now it will lookup page cache to
> > determine it as a data or a hole. When it has some data at the given range, the
> > unwritten extent will be as a data. Otherwise, it will be as a hole. Other
> > changes contain some bug fixes.
>
> I don't think I ack'ed this patch series earlier; my apologies. In
> any case (with the comment fixup we discussed earlier), It's currently
> queued for the next merge window.

Thanks. :-)

BTW, I am trying to implement the second step of extent status tree,
which will track all extent status for an inode. After that, some
improvements will be added, such as extent-level locking, etc.

Regards,
- Zheng