In addition to fixing a few bugs and addressing review comments, we now
add a new ioctl, EXT4_IOC_PRECACHE_EXTENTS, which forces all of the
extents in an inode to be cached in the extents status tree, and marks
them to be preferentially protected when under memory pressure.
This is critically important when using AIO to a preallocated file,
since if we need to read in blocks from the extent tree, the
io_submit(2) system call becomes synchronous, which is rather rude to
applications which were expecting the AIO to be "A".
As a bonus, using the extent status tree to store the logical to
physical block mapping is usually more compact that having to keep one
or more extent tree blocks in the buffer cache.
(Should we do this all the time, instead of when the application
explicitly requests it? Maybe; there could be cases with very large,
fragmented files accessed by an application such as "file" is only needs
to look at a small subset of the file where this could result in an
unnecessary work and memory allocated. OTOH, 95%+ of the time this
would probably be a win...)
Theodore Ts'o (5):
ext4: refactor code to read the extent tree block
ext4: print the block number of invalid extent tree blocks
ext4: use unsigned int for es_status values
ext4: cache all of an extent tree's leaf block upon reading
ext4: add new ioctl EXT4_IOC_PRECACHE_EXTENTS
fs/ext4/ext4.h | 19 +++-
fs/ext4/extents.c | 259 +++++++++++++++++++++++++++++---------------
fs/ext4/extents_status.c | 52 ++++++++-
fs/ext4/extents_status.h | 50 +++++----
fs/ext4/inode.c | 6 +-
fs/ext4/ioctl.c | 3 +
fs/ext4/migrate.c | 2 +-
fs/ext4/move_extent.c | 2 +-
include/trace/events/ext4.h | 28 +++--
9 files changed, 296 insertions(+), 125 deletions(-)
--
1.7.12.rc0.22.gcdd159b
When we find an invalid extent tree block, report the block number of
the bad block for debugging purposes.
Signed-off-by: "Theodore Ts'o" <[email protected]>
---
fs/ext4/extents.c | 24 ++++++++++++------------
1 file changed, 12 insertions(+), 12 deletions(-)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index dd422db..ba58815 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -407,7 +407,7 @@ static int ext4_valid_extent_entries(struct inode *inode,
static int __ext4_ext_check(const char *function, unsigned int line,
struct inode *inode, struct ext4_extent_header *eh,
- int depth)
+ int depth, ext4_fsblk_t pblk)
{
const char *error_msg;
int max = 0;
@@ -447,21 +447,21 @@ static int __ext4_ext_check(const char *function, unsigned int line,
corrupted:
ext4_error_inode(inode, function, line, 0,
- "bad header/extent: %s - magic %x, "
- "entries %u, max %u(%u), depth %u(%u)",
- error_msg, le16_to_cpu(eh->eh_magic),
- le16_to_cpu(eh->eh_entries), le16_to_cpu(eh->eh_max),
- max, le16_to_cpu(eh->eh_depth), depth);
-
+ "pblk %llu bad header/extent: %s - magic %x, "
+ "entries %u, max %u(%u), depth %u(%u)",
+ (unsigned long long) pblk, error_msg,
+ le16_to_cpu(eh->eh_magic),
+ le16_to_cpu(eh->eh_entries), le16_to_cpu(eh->eh_max),
+ max, le16_to_cpu(eh->eh_depth), depth);
return -EIO;
}
-#define ext4_ext_check(inode, eh, depth) \
- __ext4_ext_check(__func__, __LINE__, inode, eh, depth)
+#define ext4_ext_check(inode, eh, depth, pblk) \
+ __ext4_ext_check(__func__, __LINE__, (inode), (eh), (depth), (pblk))
int ext4_ext_check_inode(struct inode *inode)
{
- return ext4_ext_check(inode, ext_inode_hdr(inode), ext_depth(inode));
+ return ext4_ext_check(inode, ext_inode_hdr(inode), ext_depth(inode), 0);
}
static struct buffer_head *
@@ -484,7 +484,7 @@ __read_extent_tree_block(const char *function, unsigned int line,
if (buffer_verified(bh))
return bh;
err = __ext4_ext_check(function, line, inode,
- ext_block_hdr(bh), depth);
+ ext_block_hdr(bh), depth, pblk);
if (err)
goto errout;
set_buffer_verified(bh);
@@ -2775,7 +2775,7 @@ again:
path[0].p_hdr = ext_inode_hdr(inode);
i = 0;
- if (ext4_ext_check(inode, path[0].p_hdr, depth)) {
+ if (ext4_ext_check(inode, path[0].p_hdr, depth, 0)) {
err = -EIO;
goto out;
}
--
1.7.12.rc0.22.gcdd159b
Refactor out the code needed to read the extent tree block into a
single read_extent_tree_block() function. In addition to simplifying
the code, it also makes sure that we call the ext4_ext_load_extent
tracepoint whenever we need to read an extent tree block from disk.
Signed-off-by: "Theodore Ts'o" <[email protected]>
Reviewed-by: Zheng Liu <[email protected]>
---
fs/ext4/extents.c | 97 ++++++++++++++++++++++++-------------------------------
1 file changed, 43 insertions(+), 54 deletions(-)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index a618738..dd422db 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -464,25 +464,39 @@ int ext4_ext_check_inode(struct inode *inode)
return ext4_ext_check(inode, ext_inode_hdr(inode), ext_depth(inode));
}
-static int __ext4_ext_check_block(const char *function, unsigned int line,
- struct inode *inode,
- struct ext4_extent_header *eh,
- int depth,
- struct buffer_head *bh)
+static struct buffer_head *
+__read_extent_tree_block(const char *function, unsigned int line,
+ struct inode *inode, ext4_fsblk_t pblk, int depth)
{
- int ret;
+ struct buffer_head *bh;
+ int err;
+ bh = sb_getblk(inode->i_sb, pblk);
+ if (unlikely(!bh))
+ return ERR_PTR(-ENOMEM);
+
+ if (!bh_uptodate_or_lock(bh)) {
+ trace_ext4_ext_load_extent(inode, pblk, _RET_IP_);
+ err = bh_submit_read(bh);
+ if (err < 0)
+ goto errout;
+ }
if (buffer_verified(bh))
- return 0;
- ret = ext4_ext_check(inode, eh, depth);
- if (ret)
- return ret;
+ return bh;
+ err = __ext4_ext_check(function, line, inode,
+ ext_block_hdr(bh), depth);
+ if (err)
+ goto errout;
set_buffer_verified(bh);
- return ret;
+ return bh;
+errout:
+ put_bh(bh);
+ return ERR_PTR(err);
+
}
-#define ext4_ext_check_block(inode, eh, depth, bh) \
- __ext4_ext_check_block(__func__, __LINE__, inode, eh, depth, bh)
+#define read_extent_tree_block(inode, pblk, depth) \
+ __read_extent_tree_block(__func__, __LINE__, (inode), (pblk), (depth))
#ifdef EXT_DEBUG
static void ext4_ext_show_path(struct inode *inode, struct ext4_ext_path *path)
@@ -748,20 +762,12 @@ ext4_ext_find_extent(struct inode *inode, ext4_lblk_t block,
path[ppos].p_depth = i;
path[ppos].p_ext = NULL;
- bh = sb_getblk(inode->i_sb, path[ppos].p_block);
- if (unlikely(!bh)) {
- ret = -ENOMEM;
+ bh = read_extent_tree_block(inode, path[ppos].p_block, --i);
+ if (IS_ERR(bh)) {
+ ret = PTR_ERR(bh);
goto err;
}
- if (!bh_uptodate_or_lock(bh)) {
- trace_ext4_ext_load_extent(inode, block,
- path[ppos].p_block);
- ret = bh_submit_read(bh);
- if (ret < 0) {
- put_bh(bh);
- goto err;
- }
- }
+
eh = ext_block_hdr(bh);
ppos++;
if (unlikely(ppos > depth)) {
@@ -773,11 +779,6 @@ ext4_ext_find_extent(struct inode *inode, ext4_lblk_t block,
}
path[ppos].p_bh = bh;
path[ppos].p_hdr = eh;
- i--;
-
- ret = ext4_ext_check_block(inode, eh, i, bh);
- if (ret < 0)
- goto err;
}
path[ppos].p_depth = i;
@@ -1412,29 +1413,21 @@ got_index:
ix++;
block = ext4_idx_pblock(ix);
while (++depth < path->p_depth) {
- bh = sb_bread(inode->i_sb, block);
- if (bh == NULL)
- return -EIO;
- eh = ext_block_hdr(bh);
/* subtract from p_depth to get proper eh_depth */
- if (ext4_ext_check_block(inode, eh,
- path->p_depth - depth, bh)) {
- put_bh(bh);
- return -EIO;
- }
+ bh = read_extent_tree_block(inode, block,
+ path->p_depth - depth);
+ if (IS_ERR(bh))
+ return PTR_ERR(bh);
+ eh = ext_block_hdr(bh);
ix = EXT_FIRST_INDEX(eh);
block = ext4_idx_pblock(ix);
put_bh(bh);
}
- bh = sb_bread(inode->i_sb, block);
- if (bh == NULL)
- return -EIO;
+ bh = read_extent_tree_block(inode, block, path->p_depth - depth);
+ if (IS_ERR(bh))
+ return PTR_ERR(bh);
eh = ext_block_hdr(bh);
- if (ext4_ext_check_block(inode, eh, path->p_depth - depth, bh)) {
- put_bh(bh);
- return -EIO;
- }
ex = EXT_FIRST_EXTENT(eh);
found_extent:
*logical = le32_to_cpu(ex->ee_block);
@@ -2829,10 +2822,11 @@ again:
ext_debug("move to level %d (block %llu)\n",
i + 1, ext4_idx_pblock(path[i].p_idx));
memset(path + i + 1, 0, sizeof(*path));
- bh = sb_bread(sb, ext4_idx_pblock(path[i].p_idx));
- if (!bh) {
+ bh = read_extent_tree_block(inode,
+ ext4_idx_pblock(path[i].p_idx), depth - i - 1);
+ if (IS_ERR(bh)) {
/* should we reset i_size? */
- err = -EIO;
+ err = PTR_ERR(bh);
break;
}
/* Yield here to deal with large extent trees.
@@ -2842,11 +2836,6 @@ again:
err = -EIO;
break;
}
- if (ext4_ext_check_block(inode, ext_block_hdr(bh),
- depth - i - 1, bh)) {
- err = -EIO;
- break;
- }
path[i + 1].p_bh = bh;
/* save actual number of indexes since this
--
1.7.12.rc0.22.gcdd159b
Add a new ioctl which forces the all of the extents in an inode to be
cached in the extent_status tree. This is critically important when
using AIO to a preallocated file, since if we need to read in blocks
from the extent tree, the io_submit(2) system call becomes
synchronous, and the AIO is no longer "A", which is bad.
In addition, for most files which have an external leaf tree block,
the cost of caching the information in the extent status tree will be
less than caching the entire 4k block in the buffer cache. So it is
generally a win to keep the extent information cached.
---
fs/ext4/ext4.h | 17 +++++++-----
fs/ext4/extents.c | 67 +++++++++++++++++++++++++++++++++++++++++++++++-
fs/ext4/extents_status.c | 19 ++++++++++----
fs/ext4/ioctl.c | 3 +++
4 files changed, 93 insertions(+), 13 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index a8497b0..16b531c 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -561,15 +561,16 @@ enum {
#define EXT4_GET_BLOCKS_NO_PUT_HOLE 0x0200
/*
- * The bit position of this flag must not overlap with any of the
- * EXT4_GET_BLOCKS_*. It is used by ext4_ext_find_extent(),
+ * The bit position of these flags must not overlap with any of the
+ * EXT4_GET_BLOCKS_*. They are used by ext4_ext_find_extent(),
* read_extent_tree_block(), ext4_split_extent_at(),
- * ext4_ext_insert_extent(), and ext4_ext_create_new_leaf() to
- * indicate that the we shouldn't be caching the extents when reading
- * from the extent tree while a truncate or punch hole operation
- * is in progress.
+ * ext4_ext_insert_extent(), and ext4_ext_create_new_leaf().
+ * EXT4_EX_NOCACHE is used to indicate that the we shouldn't be
+ * caching the extents when reading from the extent tree while a
+ * truncate or punch hole operation is in progress.
*/
#define EXT4_EX_NOCACHE 0x0400
+#define EXT4_EX_FORCE_CACHE 0x0800
/*
* Flags used by ext4_free_blocks
@@ -601,6 +602,7 @@ enum {
#define EXT4_IOC_MOVE_EXT _IOWR('f', 15, struct move_extent)
#define EXT4_IOC_RESIZE_FS _IOW('f', 16, __u64)
#define EXT4_IOC_SWAP_BOOT _IO('f', 17)
+#define EXT4_IOC_PRECACHE_EXTENTS _IO('f', 18)
#if defined(__KERNEL__) && defined(CONFIG_COMPAT)
/*
@@ -1386,6 +1388,7 @@ enum {
nolocking */
EXT4_STATE_MAY_INLINE_DATA, /* may have in-inode data */
EXT4_STATE_ORDERED_MODE, /* data=ordered mode */
+ EXT4_STATE_EXT_PRECACHED, /* extents have been precached */
};
#define EXT4_INODE_BIT_FNS(name, field, offset) \
@@ -2704,7 +2707,7 @@ extern int ext4_find_delalloc_range(struct inode *inode,
extern int ext4_find_delalloc_cluster(struct inode *inode, ext4_lblk_t lblk);
extern int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
__u64 start, __u64 len);
-
+extern int ext4_ext_precache(struct inode *inode);
/* move_extent.c */
extern void ext4_double_down_write_data_sem(struct inode *first,
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 671dd91..f8f0f91 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -482,7 +482,7 @@ __read_extent_tree_block(const char *function, unsigned int line,
if (err < 0)
goto errout;
}
- if (buffer_verified(bh))
+ if (buffer_verified(bh) && !(flags & EXT4_EX_FORCE_CACHE))
return bh;
err = __ext4_ext_check(function, line, inode,
ext_block_hdr(bh), depth, pblk);
@@ -526,6 +526,71 @@ errout:
__read_extent_tree_block(__func__, __LINE__, (inode), (pblk), \
(depth), (flags))
+/*
+ * This function is called to cache a file's extent information in the
+ * extent status tree
+ */
+int ext4_ext_precache(struct inode *inode)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ struct ext4_ext_path *path = NULL;
+ struct buffer_head *bh = 0;
+ int i = 0, depth, ret = 0;
+
+ if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
+ return 0; /* not an extent-mapped inode */
+
+ down_read(&ei->i_data_sem);
+ depth = ext_depth(inode);
+
+ path = kzalloc(sizeof(struct ext4_ext_path) * (depth + 1),
+ GFP_NOFS);
+ if (path == NULL) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ /* Don't cache anything if there are no external extent blocks */
+ if (depth == 0)
+ goto out;
+ path[0].p_hdr = ext_inode_hdr(inode);
+ ret = ext4_ext_check(inode, path[0].p_hdr, depth, 0);
+ if (ret)
+ goto out;
+ path[0].p_idx = EXT_FIRST_INDEX(path[0].p_hdr);
+ while (i >= 0) {
+ /*
+ * If this is a leaf block or we've reached the end of
+ * the index block, go up
+ */
+ if ((i == depth) ||
+ path[i].p_idx > EXT_LAST_INDEX(path[i].p_hdr)) {
+ brelse(path[i].p_bh);
+ path[i].p_bh = NULL;
+ i--;
+ continue;
+ }
+ bh = read_extent_tree_block(inode,
+ ext4_idx_pblock(path[i].p_idx++),
+ depth - i - 1,
+ EXT4_EX_FORCE_CACHE);
+ if (IS_ERR(bh)) {
+ ret = PTR_ERR(bh);
+ break;
+ }
+ i++;
+ path[i].p_bh = bh;
+ path[i].p_hdr = ext_block_hdr(bh);
+ path[i].p_idx = EXT_FIRST_INDEX(path[i].p_hdr);
+ }
+ ext4_set_inode_state(inode, EXT4_STATE_EXT_PRECACHED);
+out:
+ up_read(&ei->i_data_sem);
+ ext4_ext_drop_refs(path);
+ kfree(path);
+ return ret;
+}
+
#ifdef EXT_DEBUG
static void ext4_ext_show_path(struct inode *inode, struct ext4_ext_path *path)
{
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 1dc5df0..a1cbbcf 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -710,11 +710,8 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
write_lock(&EXT4_I(inode)->i_es_lock);
es = __es_tree_search(&EXT4_I(inode)->i_es_tree.root, lblk);
- if (es && ((es->es_lblk <= lblk) || (es->es_lblk <= end)))
- goto out;
-
- __es_insert_extent(inode, &newes);
-out:
+ if (!es || es->es_lblk > end)
+ __es_insert_extent(inode, &newes);
write_unlock(&EXT4_I(inode)->i_es_lock);
}
@@ -930,6 +927,12 @@ static int ext4_inode_touch_time_cmp(void *priv, struct list_head *a,
eia = list_entry(a, struct ext4_inode_info, i_es_lru);
eib = list_entry(b, struct ext4_inode_info, i_es_lru);
+ if (ext4_test_inode_state(&eia->vfs_inode, EXT4_STATE_EXT_PRECACHED) &&
+ !ext4_test_inode_state(&eib->vfs_inode, EXT4_STATE_EXT_PRECACHED))
+ return 1;
+ if (!ext4_test_inode_state(&eia->vfs_inode, EXT4_STATE_EXT_PRECACHED) &&
+ ext4_test_inode_state(&eib->vfs_inode, EXT4_STATE_EXT_PRECACHED))
+ return -1;
if (eia->i_touch_when == eib->i_touch_when)
return 0;
if (time_after(eia->i_touch_when, eib->i_touch_when))
@@ -1069,10 +1072,16 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
struct rb_node *node;
struct extent_status *es;
int nr_shrunk = 0;
+ static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
+ DEFAULT_RATELIMIT_BURST);
if (ei->i_es_lru_nr == 0)
return 0;
+ if (ext4_test_inode_state(inode, EXT4_STATE_EXT_PRECACHED) &&
+ __ratelimit(&_rs))
+ ext4_warning(inode->i_sb, "forced shrink of precached extents");
+
node = rb_first(&tree->root);
while (node != NULL) {
es = rb_entry(node, struct extent_status, rb_node);
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 9491ac0..3c7304c 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -622,6 +622,8 @@ resizefs_out:
return 0;
}
+ case EXT4_IOC_PRECACHE_EXTENTS:
+ return ext4_ext_precache(inode);
default:
return -ENOTTY;
@@ -686,6 +688,7 @@ long ext4_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
case EXT4_IOC_MOVE_EXT:
case FITRIM:
case EXT4_IOC_RESIZE_FS:
+ case EXT4_IOC_PRECACHE_EXTENTS:
break;
default:
return -ENOIOCTLCMD;
--
1.7.12.rc0.22.gcdd159b
When we read in an extent tree leaf block from disk, arrange to have
all of its entries cached. In nearly all cases the in-memory
representation will be more compact than the on-disk representation in
the buffer cache, and it allows us to get the information without
having to traverse the extent tree for successive extents.
Signed-off-by: "Theodore Ts'o" <[email protected]>
Reviewed-by: Zheng Liu <[email protected]>
---
fs/ext4/ext4.h | 14 +++++++-
fs/ext4/extents.c | 87 +++++++++++++++++++++++++++++++--------------
fs/ext4/extents_status.c | 37 ++++++++++++++++++-
fs/ext4/extents_status.h | 3 ++
fs/ext4/migrate.c | 2 +-
fs/ext4/move_extent.c | 2 +-
include/trace/events/ext4.h | 14 +++++++-
7 files changed, 127 insertions(+), 32 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 6ed348d..a8497b0 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -561,6 +561,17 @@ enum {
#define EXT4_GET_BLOCKS_NO_PUT_HOLE 0x0200
/*
+ * The bit position of this flag must not overlap with any of the
+ * EXT4_GET_BLOCKS_*. It is used by ext4_ext_find_extent(),
+ * read_extent_tree_block(), ext4_split_extent_at(),
+ * ext4_ext_insert_extent(), and ext4_ext_create_new_leaf() to
+ * indicate that the we shouldn't be caching the extents when reading
+ * from the extent tree while a truncate or punch hole operation
+ * is in progress.
+ */
+#define EXT4_EX_NOCACHE 0x0400
+
+/*
* Flags used by ext4_free_blocks
*/
#define EXT4_FREE_BLOCKS_METADATA 0x0001
@@ -2683,7 +2694,8 @@ extern int ext4_ext_insert_extent(handle_t *, struct inode *,
struct ext4_ext_path *,
struct ext4_extent *, int);
extern struct ext4_ext_path *ext4_ext_find_extent(struct inode *, ext4_lblk_t,
- struct ext4_ext_path *);
+ struct ext4_ext_path *,
+ int flags);
extern void ext4_ext_drop_refs(struct ext4_ext_path *);
extern int ext4_ext_check_inode(struct inode *inode);
extern int ext4_find_delalloc_range(struct inode *inode,
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index ba58815..671dd91 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -466,7 +466,8 @@ int ext4_ext_check_inode(struct inode *inode)
static struct buffer_head *
__read_extent_tree_block(const char *function, unsigned int line,
- struct inode *inode, ext4_fsblk_t pblk, int depth)
+ struct inode *inode, ext4_fsblk_t pblk, int depth,
+ int flags)
{
struct buffer_head *bh;
int err;
@@ -488,6 +489,32 @@ __read_extent_tree_block(const char *function, unsigned int line,
if (err)
goto errout;
set_buffer_verified(bh);
+ /*
+ * If this is a leaf block, cache all of its entries
+ */
+ if (!(flags & EXT4_EX_NOCACHE) && depth == 0) {
+ struct ext4_extent_header *eh = ext_block_hdr(bh);
+ struct ext4_extent *ex = EXT_FIRST_EXTENT(eh);
+ ext4_lblk_t prev = 0;
+ int i;
+
+ for (i = le16_to_cpu(eh->eh_entries); i > 0; i--, ex++) {
+ unsigned int status = EXTENT_STATUS_WRITTEN;
+ ext4_lblk_t lblk = le32_to_cpu(ex->ee_block);
+ int len = ext4_ext_get_actual_len(ex);
+
+ if (prev && (prev != lblk))
+ ext4_es_cache_extent(inode, prev,
+ lblk - prev, ~0,
+ EXTENT_STATUS_HOLE);
+
+ if (ext4_ext_is_uninitialized(ex))
+ status = EXTENT_STATUS_UNWRITTEN;
+ ext4_es_cache_extent(inode, lblk, len,
+ ext4_ext_pblock(ex), status);
+ prev = lblk + len;
+ }
+ }
return bh;
errout:
put_bh(bh);
@@ -495,8 +522,9 @@ errout:
}
-#define read_extent_tree_block(inode, pblk, depth) \
- __read_extent_tree_block(__func__, __LINE__, (inode), (pblk), (depth))
+#define read_extent_tree_block(inode, pblk, depth, flags) \
+ __read_extent_tree_block(__func__, __LINE__, (inode), (pblk), \
+ (depth), (flags))
#ifdef EXT_DEBUG
static void ext4_ext_show_path(struct inode *inode, struct ext4_ext_path *path)
@@ -730,7 +758,7 @@ int ext4_ext_tree_init(handle_t *handle, struct inode *inode)
struct ext4_ext_path *
ext4_ext_find_extent(struct inode *inode, ext4_lblk_t block,
- struct ext4_ext_path *path)
+ struct ext4_ext_path *path, int flags)
{
struct ext4_extent_header *eh;
struct buffer_head *bh;
@@ -762,7 +790,8 @@ ext4_ext_find_extent(struct inode *inode, ext4_lblk_t block,
path[ppos].p_depth = i;
path[ppos].p_ext = NULL;
- bh = read_extent_tree_block(inode, path[ppos].p_block, --i);
+ bh = read_extent_tree_block(inode, path[ppos].p_block, --i,
+ flags);
if (IS_ERR(bh)) {
ret = PTR_ERR(bh);
goto err;
@@ -1199,7 +1228,8 @@ out:
* if no free index is found, then it requests in-depth growing.
*/
static int ext4_ext_create_new_leaf(handle_t *handle, struct inode *inode,
- unsigned int flags,
+ unsigned int mb_flags,
+ unsigned int gb_flags,
struct ext4_ext_path *path,
struct ext4_extent *newext)
{
@@ -1221,7 +1251,7 @@ repeat:
if (EXT_HAS_FREE_INDEX(curp)) {
/* if we found index with free entry, then use that
* entry: create all needed subtree and add new leaf */
- err = ext4_ext_split(handle, inode, flags, path, newext, i);
+ err = ext4_ext_split(handle, inode, mb_flags, path, newext, i);
if (err)
goto out;
@@ -1229,12 +1259,12 @@ repeat:
ext4_ext_drop_refs(path);
path = ext4_ext_find_extent(inode,
(ext4_lblk_t)le32_to_cpu(newext->ee_block),
- path);
+ path, gb_flags);
if (IS_ERR(path))
err = PTR_ERR(path);
} else {
/* tree is full, time to grow in depth */
- err = ext4_ext_grow_indepth(handle, inode, flags, newext);
+ err = ext4_ext_grow_indepth(handle, inode, mb_flags, newext);
if (err)
goto out;
@@ -1242,7 +1272,7 @@ repeat:
ext4_ext_drop_refs(path);
path = ext4_ext_find_extent(inode,
(ext4_lblk_t)le32_to_cpu(newext->ee_block),
- path);
+ path, gb_flags);
if (IS_ERR(path)) {
err = PTR_ERR(path);
goto out;
@@ -1415,7 +1445,7 @@ got_index:
while (++depth < path->p_depth) {
/* subtract from p_depth to get proper eh_depth */
bh = read_extent_tree_block(inode, block,
- path->p_depth - depth);
+ path->p_depth - depth, 0);
if (IS_ERR(bh))
return PTR_ERR(bh);
eh = ext_block_hdr(bh);
@@ -1424,7 +1454,7 @@ got_index:
put_bh(bh);
}
- bh = read_extent_tree_block(inode, block, path->p_depth - depth);
+ bh = read_extent_tree_block(inode, block, path->p_depth - depth, 0);
if (IS_ERR(bh))
return PTR_ERR(bh);
eh = ext_block_hdr(bh);
@@ -1786,7 +1816,7 @@ out:
*/
int ext4_ext_insert_extent(handle_t *handle, struct inode *inode,
struct ext4_ext_path *path,
- struct ext4_extent *newext, int flag)
+ struct ext4_extent *newext, int gb_flags)
{
struct ext4_extent_header *eh;
struct ext4_extent *ex, *fex;
@@ -1795,7 +1825,7 @@ int ext4_ext_insert_extent(handle_t *handle, struct inode *inode,
int depth, len, err;
ext4_lblk_t next;
unsigned uninitialized = 0;
- int flags = 0;
+ int mb_flags = 0;
if (unlikely(ext4_ext_get_actual_len(newext) == 0)) {
EXT4_ERROR_INODE(inode, "ext4_ext_get_actual_len(newext) == 0");
@@ -1810,7 +1840,7 @@ int ext4_ext_insert_extent(handle_t *handle, struct inode *inode,
}
/* try to insert block into found extent and return */
- if (ex && !(flag & EXT4_GET_BLOCKS_PRE_IO)) {
+ if (ex && !(gb_flags & EXT4_GET_BLOCKS_PRE_IO)) {
/*
* Try to see whether we should rather test the extent on
@@ -1913,7 +1943,7 @@ prepend:
if (next != EXT_MAX_BLOCKS) {
ext_debug("next leaf block - %u\n", next);
BUG_ON(npath != NULL);
- npath = ext4_ext_find_extent(inode, next, NULL);
+ npath = ext4_ext_find_extent(inode, next, NULL, 0);
if (IS_ERR(npath))
return PTR_ERR(npath);
BUG_ON(npath->p_depth != path->p_depth);
@@ -1932,9 +1962,10 @@ prepend:
* There is no free space in the found leaf.
* We're gonna add a new leaf in the tree.
*/
- if (flag & EXT4_GET_BLOCKS_METADATA_NOFAIL)
- flags = EXT4_MB_USE_RESERVED;
- err = ext4_ext_create_new_leaf(handle, inode, flags, path, newext);
+ if (gb_flags & EXT4_GET_BLOCKS_METADATA_NOFAIL)
+ mb_flags = EXT4_MB_USE_RESERVED;
+ err = ext4_ext_create_new_leaf(handle, inode, mb_flags, gb_flags,
+ path, newext);
if (err)
goto cleanup;
depth = ext_depth(inode);
@@ -2000,7 +2031,7 @@ has_space:
merge:
/* try to merge extents */
- if (!(flag & EXT4_GET_BLOCKS_PRE_IO))
+ if (!(gb_flags & EXT4_GET_BLOCKS_PRE_IO))
ext4_ext_try_to_merge(handle, inode, path, nearex);
@@ -2043,7 +2074,7 @@ static int ext4_fill_fiemap_extents(struct inode *inode,
path = NULL;
}
- path = ext4_ext_find_extent(inode, block, path);
+ path = ext4_ext_find_extent(inode, block, path, 0);
if (IS_ERR(path)) {
up_read(&EXT4_I(inode)->i_data_sem);
err = PTR_ERR(path);
@@ -2705,7 +2736,7 @@ again:
ext4_lblk_t ee_block;
/* find extent for this block */
- path = ext4_ext_find_extent(inode, end, NULL);
+ path = ext4_ext_find_extent(inode, end, NULL, EXT4_EX_NOCACHE);
if (IS_ERR(path)) {
ext4_journal_stop(handle);
return PTR_ERR(path);
@@ -2747,6 +2778,7 @@ again:
*/
err = ext4_split_extent_at(handle, inode, path,
end + 1, split_flag,
+ EXT4_EX_NOCACHE |
EXT4_GET_BLOCKS_PRE_IO |
EXT4_GET_BLOCKS_METADATA_NOFAIL);
@@ -2823,7 +2855,8 @@ again:
i + 1, ext4_idx_pblock(path[i].p_idx));
memset(path + i + 1, 0, sizeof(*path));
bh = read_extent_tree_block(inode,
- ext4_idx_pblock(path[i].p_idx), depth - i - 1);
+ ext4_idx_pblock(path[i].p_idx), depth - i - 1,
+ EXT4_EX_NOCACHE);
if (IS_ERR(bh)) {
/* should we reset i_size? */
err = PTR_ERR(bh);
@@ -3170,7 +3203,7 @@ static int ext4_split_extent(handle_t *handle,
* result in split of original leaf or extent zeroout.
*/
ext4_ext_drop_refs(path);
- path = ext4_ext_find_extent(inode, map->m_lblk, path);
+ path = ext4_ext_find_extent(inode, map->m_lblk, path, 0);
if (IS_ERR(path))
return PTR_ERR(path);
depth = ext_depth(inode);
@@ -3554,7 +3587,7 @@ static int ext4_convert_unwritten_extents_endio(handle_t *handle,
if (err < 0)
goto out;
ext4_ext_drop_refs(path);
- path = ext4_ext_find_extent(inode, map->m_lblk, path);
+ path = ext4_ext_find_extent(inode, map->m_lblk, path, 0);
if (IS_ERR(path)) {
err = PTR_ERR(path);
goto out;
@@ -4041,7 +4074,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
trace_ext4_ext_map_blocks_enter(inode, map->m_lblk, map->m_len, flags);
/* find extent for this block */
- path = ext4_ext_find_extent(inode, map->m_lblk, NULL);
+ path = ext4_ext_find_extent(inode, map->m_lblk, NULL, 0);
if (IS_ERR(path)) {
err = PTR_ERR(path);
path = NULL;
@@ -4760,6 +4793,6 @@ int ext4_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
error = ext4_fill_fiemap_extents(inode, start_blk,
len_blks, fieinfo);
}
-
+ ext4_es_lru_add(inode);
return error;
}
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index ded2615..1dc5df0 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -419,7 +419,7 @@ static void ext4_es_insert_extent_ext_check(struct inode *inode,
unsigned short ee_len;
int depth, ee_status, es_status;
- path = ext4_ext_find_extent(inode, es->es_lblk, NULL);
+ path = ext4_ext_find_extent(inode, es->es_lblk, NULL, EXT4_EX_NOCACHE);
if (IS_ERR(path))
return;
@@ -684,6 +684,41 @@ error:
}
/*
+ * ext4_es_cache_extent() inserts information into the extent status
+ * tree if and only if there isn't information about the range in
+ * question already.
+ */
+void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
+ ext4_lblk_t len, ext4_fsblk_t pblk,
+ unsigned int status)
+{
+ struct extent_status *es;
+ struct extent_status newes;
+ ext4_lblk_t end = lblk + len - 1;
+
+ newes.es_lblk = lblk;
+ newes.es_len = len;
+ ext4_es_store_pblock(&newes, pblk);
+ ext4_es_store_status(&newes, status);
+ trace_ext4_es_cache_extent(inode, &newes);
+
+ if (!len)
+ return;
+
+ BUG_ON(end < lblk);
+
+ write_lock(&EXT4_I(inode)->i_es_lock);
+
+ es = __es_tree_search(&EXT4_I(inode)->i_es_tree.root, lblk);
+ if (es && ((es->es_lblk <= lblk) || (es->es_lblk <= end)))
+ goto out;
+
+ __es_insert_extent(inode, &newes);
+out:
+ write_unlock(&EXT4_I(inode)->i_es_lock);
+}
+
+/*
* ext4_es_lookup_extent() looks up an extent in extent status tree.
*
* ext4_es_lookup_extent is called by ext4_map_blocks/ext4_da_map_blocks.
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index ae1be58..23348d1 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -71,6 +71,9 @@ extern void ext4_es_init_tree(struct ext4_es_tree *tree);
extern int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t len, ext4_fsblk_t pblk,
unsigned int status);
+extern void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
+ ext4_lblk_t len, ext4_fsblk_t pblk,
+ unsigned int status);
extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t len);
extern void ext4_es_find_delayed_extent_range(struct inode *inode,
diff --git a/fs/ext4/migrate.c b/fs/ext4/migrate.c
index 49e8bdf..f99bdb8 100644
--- a/fs/ext4/migrate.c
+++ b/fs/ext4/migrate.c
@@ -39,7 +39,7 @@ static int finish_range(handle_t *handle, struct inode *inode,
newext.ee_block = cpu_to_le32(lb->first_block);
newext.ee_len = cpu_to_le16(lb->last_block - lb->first_block + 1);
ext4_ext_store_pblock(&newext, lb->first_pblock);
- path = ext4_ext_find_extent(inode, lb->first_block, NULL);
+ path = ext4_ext_find_extent(inode, lb->first_block, NULL, 0);
if (IS_ERR(path)) {
retval = PTR_ERR(path);
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index e86dddb..7fa4d85 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -37,7 +37,7 @@ get_ext_path(struct inode *inode, ext4_lblk_t lblock,
int ret = 0;
struct ext4_ext_path *path;
- path = ext4_ext_find_extent(inode, lblock, *orig_path);
+ path = ext4_ext_find_extent(inode, lblock, *orig_path, EXT4_EX_NOCACHE);
if (IS_ERR(path))
ret = PTR_ERR(path);
else if (path[ext_depth(inode)].p_ext == NULL)
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 47a355b..d892b55 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2192,7 +2192,7 @@ TRACE_EVENT(ext4_ext_remove_space_done,
(unsigned short) __entry->eh_entries)
);
-TRACE_EVENT(ext4_es_insert_extent,
+DECLARE_EVENT_CLASS(ext4__es_extent,
TP_PROTO(struct inode *inode, struct extent_status *es),
TP_ARGS(inode, es),
@@ -2222,6 +2222,18 @@ TRACE_EVENT(ext4_es_insert_extent,
__entry->pblk, show_extent_status(__entry->status))
);
+DEFINE_EVENT(ext4__es_extent, ext4_es_insert_extent,
+ TP_PROTO(struct inode *inode, struct extent_status *es),
+
+ TP_ARGS(inode, es)
+);
+
+DEFINE_EVENT(ext4__es_extent, ext4_es_cache_extent,
+ TP_PROTO(struct inode *inode, struct extent_status *es),
+
+ TP_ARGS(inode, es)
+);
+
TRACE_EVENT(ext4_es_remove_extent,
TP_PROTO(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len),
--
1.7.12.rc0.22.gcdd159b
Don't use an unsigned long long for the es_status flags; this requires
that we pass 64-bit values around which is painful on 32-bit systems.
Instead pass the extent status flags around using the low 4 bits of an
unsigned int, and shift them into place when we are reading or writing
es_pblk.
Signed-off-by: "Theodore Ts'o" <[email protected]>
Reviewed-by: Zheng Liu <[email protected]>
---
fs/ext4/extents_status.c | 6 +++---
fs/ext4/extents_status.h | 47 ++++++++++++++++++++++++++-------------------
fs/ext4/inode.c | 6 +++---
include/trace/events/ext4.h | 14 +++++++-------
4 files changed, 40 insertions(+), 33 deletions(-)
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 91cb110..ded2615 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -263,7 +263,7 @@ void ext4_es_find_delayed_extent_range(struct inode *inode,
if (tree->cache_es) {
es1 = tree->cache_es;
if (in_range(lblk, es1->es_lblk, es1->es_len)) {
- es_debug("%u cached by [%u/%u) %llu %llx\n",
+ es_debug("%u cached by [%u/%u) %llu %x\n",
lblk, es1->es_lblk, es1->es_len,
ext4_es_pblock(es1), ext4_es_status(es1));
goto out;
@@ -641,13 +641,13 @@ out:
*/
int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t len, ext4_fsblk_t pblk,
- unsigned long long status)
+ unsigned int status)
{
struct extent_status newes;
ext4_lblk_t end = lblk + len - 1;
int err = 0;
- es_debug("add [%u/%u) %llu %llx to extent status tree of inode %lu\n",
+ es_debug("add [%u/%u) %llu %x to extent status tree of inode %lu\n",
lblk, len, pblk, status, inode->i_ino);
if (!len)
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index e936730..ae1be58 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -29,16 +29,26 @@
/*
* These flags live in the high bits of extent_status.es_pblk
*/
-#define EXTENT_STATUS_WRITTEN (1ULL << 63)
-#define EXTENT_STATUS_UNWRITTEN (1ULL << 62)
-#define EXTENT_STATUS_DELAYED (1ULL << 61)
-#define EXTENT_STATUS_HOLE (1ULL << 60)
+#define ES_SHIFT 60
+
+#define EXTENT_STATUS_WRITTEN (1 << 3)
+#define EXTENT_STATUS_UNWRITTEN (1 << 2)
+#define EXTENT_STATUS_DELAYED (1 << 1)
+#define EXTENT_STATUS_HOLE (1 << 0)
#define EXTENT_STATUS_FLAGS (EXTENT_STATUS_WRITTEN | \
EXTENT_STATUS_UNWRITTEN | \
EXTENT_STATUS_DELAYED | \
EXTENT_STATUS_HOLE)
+#define ES_WRITTEN (1ULL << 63)
+#define ES_UNWRITTEN (1ULL << 62)
+#define ES_DELAYED (1ULL << 61)
+#define ES_HOLE (1ULL << 60)
+
+#define ES_MASK (ES_WRITTEN | ES_UNWRITTEN | \
+ ES_DELAYED | ES_HOLE)
+
struct ext4_sb_info;
struct ext4_extent;
@@ -60,7 +70,7 @@ extern void ext4_es_init_tree(struct ext4_es_tree *tree);
extern int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t len, ext4_fsblk_t pblk,
- unsigned long long status);
+ unsigned int status);
extern int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t len);
extern void ext4_es_find_delayed_extent_range(struct inode *inode,
@@ -72,32 +82,32 @@ extern int ext4_es_zeroout(struct inode *inode, struct ext4_extent *ex);
static inline int ext4_es_is_written(struct extent_status *es)
{
- return (es->es_pblk & EXTENT_STATUS_WRITTEN) != 0;
+ return (es->es_pblk & ES_WRITTEN) != 0;
}
static inline int ext4_es_is_unwritten(struct extent_status *es)
{
- return (es->es_pblk & EXTENT_STATUS_UNWRITTEN) != 0;
+ return (es->es_pblk & ES_UNWRITTEN) != 0;
}
static inline int ext4_es_is_delayed(struct extent_status *es)
{
- return (es->es_pblk & EXTENT_STATUS_DELAYED) != 0;
+ return (es->es_pblk & ES_DELAYED) != 0;
}
static inline int ext4_es_is_hole(struct extent_status *es)
{
- return (es->es_pblk & EXTENT_STATUS_HOLE) != 0;
+ return (es->es_pblk & ES_HOLE) != 0;
}
-static inline ext4_fsblk_t ext4_es_status(struct extent_status *es)
+static inline unsigned int ext4_es_status(struct extent_status *es)
{
- return (es->es_pblk & EXTENT_STATUS_FLAGS);
+ return (es->es_pblk >> ES_SHIFT);
}
static inline ext4_fsblk_t ext4_es_pblock(struct extent_status *es)
{
- return (es->es_pblk & ~EXTENT_STATUS_FLAGS);
+ return (es->es_pblk & ~ES_MASK);
}
static inline void ext4_es_store_pblock(struct extent_status *es,
@@ -105,19 +115,16 @@ static inline void ext4_es_store_pblock(struct extent_status *es,
{
ext4_fsblk_t block;
- block = (pb & ~EXTENT_STATUS_FLAGS) |
- (es->es_pblk & EXTENT_STATUS_FLAGS);
+ block = (pb & ~ES_MASK) | (es->es_pblk & ES_MASK);
es->es_pblk = block;
}
static inline void ext4_es_store_status(struct extent_status *es,
- unsigned long long status)
+ unsigned int status)
{
- ext4_fsblk_t block;
-
- block = (status & EXTENT_STATUS_FLAGS) |
- (es->es_pblk & ~EXTENT_STATUS_FLAGS);
- es->es_pblk = block;
+ es->es_pblk = (((unsigned long long)
+ (status & EXTENT_STATUS_FLAGS) << ES_SHIFT) |
+ (es->es_pblk & ~ES_MASK));
}
extern void ext4_es_register_shrinker(struct ext4_sb_info *sbi);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ea2cbf1..a1b3d56 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -553,7 +553,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
}
if (retval > 0) {
int ret;
- unsigned long long status;
+ unsigned int status;
#ifdef ES_AGGRESSIVE_TEST
if (retval != map->m_len) {
@@ -654,7 +654,7 @@ found:
if (retval > 0) {
int ret;
- unsigned long long status;
+ unsigned int status;
#ifdef ES_AGGRESSIVE_TEST
if (retval != map->m_len) {
@@ -1635,7 +1635,7 @@ add_delayed:
set_buffer_delay(bh);
} else if (retval > 0) {
int ret;
- unsigned long long status;
+ unsigned int status;
#ifdef ES_AGGRESSIVE_TEST
if (retval != map->m_len) {
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 2068db2..47a355b 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -64,10 +64,10 @@ struct extent_status;
{ EXT4_FREE_BLOCKS_NOFREE_LAST_CLUSTER, "LAST_CLUSTER" })
#define show_extent_status(status) __print_flags(status, "", \
- { (1 << 3), "W" }, \
- { (1 << 2), "U" }, \
- { (1 << 1), "D" }, \
- { (1 << 0), "H" })
+ { EXTENT_STATUS_WRITTEN, "W" }, \
+ { EXTENT_STATUS_UNWRITTEN, "U" }, \
+ { EXTENT_STATUS_DELAYED, "D" }, \
+ { EXTENT_STATUS_HOLE, "H" })
TRACE_EVENT(ext4_free_inode,
@@ -2212,7 +2212,7 @@ TRACE_EVENT(ext4_es_insert_extent,
__entry->lblk = es->es_lblk;
__entry->len = es->es_len;
__entry->pblk = ext4_es_pblock(es);
- __entry->status = ext4_es_status(es) >> 60;
+ __entry->status = ext4_es_status(es);
),
TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %s",
@@ -2289,7 +2289,7 @@ TRACE_EVENT(ext4_es_find_delayed_extent_range_exit,
__entry->lblk = es->es_lblk;
__entry->len = es->es_len;
__entry->pblk = ext4_es_pblock(es);
- __entry->status = ext4_es_status(es) >> 60;
+ __entry->status = ext4_es_status(es);
),
TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %s",
@@ -2343,7 +2343,7 @@ TRACE_EVENT(ext4_es_lookup_extent_exit,
__entry->lblk = es->es_lblk;
__entry->len = es->es_len;
__entry->pblk = ext4_es_pblock(es);
- __entry->status = ext4_es_status(es) >> 60;
+ __entry->status = ext4_es_status(es);
__entry->found = found;
),
--
1.7.12.rc0.22.gcdd159b
On Tue, Jul 16, 2013 at 11:18:00AM -0400, Theodore Ts'o wrote:
> When we find an invalid extent tree block, report the block number of
> the bad block for debugging purposes.
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
Reviewed-by: Zheng Liu <[email protected]>
- Zheng
> ---
> fs/ext4/extents.c | 24 ++++++++++++------------
> 1 file changed, 12 insertions(+), 12 deletions(-)
>
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index dd422db..ba58815 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -407,7 +407,7 @@ static int ext4_valid_extent_entries(struct inode *inode,
>
> static int __ext4_ext_check(const char *function, unsigned int line,
> struct inode *inode, struct ext4_extent_header *eh,
> - int depth)
> + int depth, ext4_fsblk_t pblk)
> {
> const char *error_msg;
> int max = 0;
> @@ -447,21 +447,21 @@ static int __ext4_ext_check(const char *function, unsigned int line,
>
> corrupted:
> ext4_error_inode(inode, function, line, 0,
> - "bad header/extent: %s - magic %x, "
> - "entries %u, max %u(%u), depth %u(%u)",
> - error_msg, le16_to_cpu(eh->eh_magic),
> - le16_to_cpu(eh->eh_entries), le16_to_cpu(eh->eh_max),
> - max, le16_to_cpu(eh->eh_depth), depth);
> -
> + "pblk %llu bad header/extent: %s - magic %x, "
> + "entries %u, max %u(%u), depth %u(%u)",
> + (unsigned long long) pblk, error_msg,
> + le16_to_cpu(eh->eh_magic),
> + le16_to_cpu(eh->eh_entries), le16_to_cpu(eh->eh_max),
> + max, le16_to_cpu(eh->eh_depth), depth);
> return -EIO;
> }
>
> -#define ext4_ext_check(inode, eh, depth) \
> - __ext4_ext_check(__func__, __LINE__, inode, eh, depth)
> +#define ext4_ext_check(inode, eh, depth, pblk) \
> + __ext4_ext_check(__func__, __LINE__, (inode), (eh), (depth), (pblk))
>
> int ext4_ext_check_inode(struct inode *inode)
> {
> - return ext4_ext_check(inode, ext_inode_hdr(inode), ext_depth(inode));
> + return ext4_ext_check(inode, ext_inode_hdr(inode), ext_depth(inode), 0);
> }
>
> static struct buffer_head *
> @@ -484,7 +484,7 @@ __read_extent_tree_block(const char *function, unsigned int line,
> if (buffer_verified(bh))
> return bh;
> err = __ext4_ext_check(function, line, inode,
> - ext_block_hdr(bh), depth);
> + ext_block_hdr(bh), depth, pblk);
> if (err)
> goto errout;
> set_buffer_verified(bh);
> @@ -2775,7 +2775,7 @@ again:
> path[0].p_hdr = ext_inode_hdr(inode);
> i = 0;
>
> - if (ext4_ext_check(inode, path[0].p_hdr, depth)) {
> + if (ext4_ext_check(inode, path[0].p_hdr, depth, 0)) {
> err = -EIO;
> goto out;
> }
> --
> 1.7.12.rc0.22.gcdd159b
>
On Tue, Jul 16, 2013 at 11:18:03AM -0400, Theodore Ts'o wrote:
> Add a new ioctl which forces the all of the extents in an inode to be
> cached in the extent_status tree. This is critically important when
> using AIO to a preallocated file, since if we need to read in blocks
> from the extent tree, the io_submit(2) system call becomes
> synchronous, and the AIO is no longer "A", which is bad.
>
> In addition, for most files which have an external leaf tree block,
> the cost of caching the information in the extent status tree will be
> less than caching the entire 4k block in the buffer cache. So it is
> generally a win to keep the extent information cached.
> ---
[...]
> @@ -1069,10 +1072,16 @@ static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
> struct rb_node *node;
> struct extent_status *es;
> int nr_shrunk = 0;
> + static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
> + DEFAULT_RATELIMIT_BURST);
>
> if (ei->i_es_lru_nr == 0)
> return 0;
>
> + if (ext4_test_inode_state(inode, EXT4_STATE_EXT_PRECACHED) &&
> + __ratelimit(&_rs))
> + ext4_warning(inode->i_sb, "forced shrink of precached extents");
If I understand correctly, we don't want to reclaim from an inode with
EXT4_STATE_EXT_PRECACHED flag when __ratelimit() returns 0, right? So
I propose the following code:
if (ext4_test_inode_state(inode, EXT4_STATE_EXT_PRECACHED) {
if (!__ratelimit(&_rs))
return 0;
ext4_warning(inode->i_sb, "forced shrink of precached extents");
}
- Zheng
> +
> node = rb_first(&tree->root);
> while (node != NULL) {
> es = rb_entry(node, struct extent_status, rb_node);
On Thu, Jul 18, 2013 at 09:19:41AM +0800, Zheng Liu wrote:
>
> If I understand correctly, we don't want to reclaim from an inode with
> EXT4_STATE_EXT_PRECACHED flag when __ratelimit() returns 0, right?
No, the intent of the code was to make sure we don't trigger the
warning too often, in case the system is under massive memory
pressure. In the original implementation of this ioctl which we used
at Google (with an extent cache that was much less functional than the
extent status tree we now have upstream), the extents were pinned in
memory permanently, until the inode is evicted from memory.
I thought about doing this, since normally the cached extents will
take less memory than the extent tree in the buffer cache (especially
in any sane setup where the large tablespace, etc., files are are
fallocated in advance and are largely contiguous). But for upstream,
I was concerned that someone might deliberately create lots of
fragmented files, and then call the precache ioctl on all of them.
So what I did was to change the sort function such that the shrinker
would put those files at the end of the list. And although it's not
in the patch that I've sent out, I've since changed it so that if the
head of the list is an precached inode, and it's been more than 5
seconds, we force a resort of the list.
That way if we are under heavy memory pressure, we will eventually get
rid of the precached extents --- but under normal circumstnaces, we
try very hard not to, at least via the es_shrinker. (If the inode
gets closed, and then eventually the inode gets evicted, then of
course we'll drop all of the precached extents.)
So the ratelimited warning is so we can know if this has happened,
since it's probably a sign that something bad has happened. Either a
process ran wild trying to precache too many extents, or the system
was under far more memory pressure, which is probably something that
needs to be fixed by changing some configuration parameter or by
tweaking the load balancer.
- Ted
Hi Ted,
Thanks for your explanation. I can always learn something from your
reply. :-)
On Wed, Jul 17, 2013 at 10:50:25PM -0400, Theodore Ts'o wrote:
> On Thu, Jul 18, 2013 at 09:19:41AM +0800, Zheng Liu wrote:
> >
> > If I understand correctly, we don't want to reclaim from an inode with
> > EXT4_STATE_EXT_PRECACHED flag when __ratelimit() returns 0, right?
>
> No, the intent of the code was to make sure we don't trigger the
> warning too often, in case the system is under massive memory
> pressure. In the original implementation of this ioctl which we used
> at Google (with an extent cache that was much less functional than the
> extent status tree we now have upstream), the extents were pinned in
> memory permanently, until the inode is evicted from memory.
>
> I thought about doing this, since normally the cached extents will
> take less memory than the extent tree in the buffer cache (especially
> in any sane setup where the large tablespace, etc., files are are
> fallocated in advance and are largely contiguous). But for upstream,
> I was concerned that someone might deliberately create lots of
> fragmented files, and then call the precache ioctl on all of them.
Yes, at least for a internet company we can control everything, but for
upstream the kernel might run under some weird environments. The lesson
from this is that I need to think deeply for non-internet applications,
and make a better design. Now I fully agree with you about this
implementation. Meanwhile the patch looks good to me.
Reviewed-by: Zheng Liu <[email protected]>
- Zheng
>
> So what I did was to change the sort function such that the shrinker
> would put those files at the end of the list. And although it's not
> in the patch that I've sent out, I've since changed it so that if the
> head of the list is an precached inode, and it's been more than 5
> seconds, we force a resort of the list.
>
> That way if we are under heavy memory pressure, we will eventually get
> rid of the precached extents --- but under normal circumstnaces, we
> try very hard not to, at least via the es_shrinker. (If the inode
> gets closed, and then eventually the inode gets evicted, then of
> course we'll drop all of the precached extents.)
>
> So the ratelimited warning is so we can know if this has happened,
> since it's probably a sign that something bad has happened. Either a
> process ran wild trying to precache too many extents, or the system
> was under far more memory pressure, which is probably something that
> needs to be fixed by changing some configuration parameter or by
> tweaking the load balancer.
>
> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Jul 18, 2013 at 09:06:11PM +0800, Zheng Liu wrote:
>
> Yes, at least for a internet company we can control everything, but for
> upstream the kernel might run under some weird environments. The lesson
> from this is that I need to think deeply for non-internet applications,
> and make a better design.
That's the main reaosn why there are patches that are in the Google
kernel for ext4 that haven't gone upstream yet, in fact. It's not
that we didn't want them upstream (in fact it's more work when have to
constantly rebase patches upstream). But often there are short cuts
we can take given a controlled environment, and getting something
which is suitable for mainline takes a lot more work.
One example of this is our patches to disable stable page writebacks,
which was a huge performance improvement but which didn't work for
iSCSI and drives with DIF/DIX support (fortunately we didn't care
about things like DIF/DIX in our environment). Now that Darrick has
implemented patches which only enable stable page writeback when it's
needed, there is a much nicer solution upstream.
Another example of this is we have a direct I/O fastpath patches that
significantly reduce DIO's overhead on extremely high speed devices.
Unfortunately, it's too ugly to live (and not coincidentally,
incredibly painful to rebase). Kent Overstreet was working on some
DIO improvements that will hopefully be clean enough that they can go
upstream at some point, at which point I will be so glad when we can
drop our ugly local hacks....
- Ted
On 7/16/13 10:17 AM, Theodore Ts'o wrote:
> In addition to fixing a few bugs and addressing review comments, we now
> add a new ioctl, EXT4_IOC_PRECACHE_EXTENTS, which forces all of the
> extents in an inode to be cached in the extents status tree, and marks
> them to be preferentially protected when under memory pressure.
>
> This is critically important when using AIO to a preallocated file,
> since if we need to read in blocks from the extent tree, the
> io_submit(2) system call becomes synchronous, which is rather rude to
> applications which were expecting the AIO to be "A".
>
> As a bonus, using the extent status tree to store the logical to
> physical block mapping is usually more compact that having to keep one
> or more extent tree blocks in the buffer cache.
>
> (Should we do this all the time, instead of when the application
> explicitly requests it? Maybe; there could be cases with very large,
> fragmented files accessed by an application such as "file" is only needs
> to look at a small subset of the file where this could result in an
> unnecessary work and memory allocated. OTOH, 95%+ of the time this
> would probably be a win...)
I'd say yes, we should - maybe not in all cases but if you need it for
AIO, try to make it "all the time" at least for that AIO?
We keep telling application writers not to assume certain things about
various filesystems, or to write applications that treat ext4 differently
han ext3 differently than xfs etc...
This goes the other way.
In the end who (besides google?) is really going to call this IOCTL?
I wondered if only doing this when files are opened O_DIRECT might make
sense, but Jeff Moyer pointed out that giant databases probably don't
want to read in their entire block mapping tree - OTOH, they probably use
preallocation if they're smart, and maybe it's not that bad.
Or what about tying this into POSIX_FADV_WILLNEED? Hohum, that gets
into force_page_cache_readahead(). We need POSIX_FADV_WILLNEED_META...
-Eric
>
> Theodore Ts'o (5):
> ext4: refactor code to read the extent tree block
> ext4: print the block number of invalid extent tree blocks
> ext4: use unsigned int for es_status values
> ext4: cache all of an extent tree's leaf block upon reading
> ext4: add new ioctl EXT4_IOC_PRECACHE_EXTENTS
>
> fs/ext4/ext4.h | 19 +++-
> fs/ext4/extents.c | 259 +++++++++++++++++++++++++++++---------------
> fs/ext4/extents_status.c | 52 ++++++++-
> fs/ext4/extents_status.h | 50 +++++----
> fs/ext4/inode.c | 6 +-
> fs/ext4/ioctl.c | 3 +
> fs/ext4/migrate.c | 2 +-
> fs/ext4/move_extent.c | 2 +-
> include/trace/events/ext4.h | 28 +++--
> 9 files changed, 296 insertions(+), 125 deletions(-)
>
On Thu, Jul 18, 2013 at 01:35:24PM -0500, Eric Sandeen wrote:
> > (Should we do this all the time, instead of when the application
> > explicitly requests it? Maybe; there could be cases with very large,
> > fragmented files accessed by an application such as "file" is only needs
> > to look at a small subset of the file where this could result in an
> > unnecessary work and memory allocated. OTOH, 95%+ of the time this
> > would probably be a win...)
>
> I'd say yes, we should - maybe not in all cases but if you need it for
> AIO, try to make it "all the time" at least for that AIO?
The problem is we don't know that we're doing AIO until we see the
first io_submit(2) call. With this patch series, we'll pull the
contents of the entire leaf tree block into extent cache, but if the
extent tree is larger than that, if we read in the entire extent tree
on the first AIO request, then that first request will delayed even
more, and it's not clear that's a good thing.
> We keep telling application writers not to assume certain things about
> various filesystems, or to write applications that treat ext4 differently
> han ext3 differently than xfs etc...
>
> This goes the other way.
That's true, but I couldn't figure out a way where we could make the
file system do it automatically all the time.
> Or what about tying this into POSIX_FADV_WILLNEED? Hohum, that gets
> into force_page_cache_readahead(). We need POSIX_FADV_WILLNEED_META...
Maybe have fadvise(fd, POSIX_FADV_RANDOM), on the theory that a
program which cares enough to call the fadvise would probably want the
extent tree? That's not really an exact match for the requisite
semantics, either, though.
In the long run, I suspect if this proves to be useful, adding a new
fadvise flag is what would make sense, I think. Maybe
POSIX_FADV_WILLNEED_META.
I'd suggest using an ioctl for now, and if application writers find
this functionality useful, we could then add a more generic VFS
interface. After all, initially punch was implemented only as an
XFS-specific ioctl, and after it was proven to be more generally
useful, we added a generic VFS interface only much later.
- Ted
Hi Eric,
On Thu, Jul 18, 2013 at 01:35:24PM -0500, Eric Sandeen wrote:
> On 7/16/13 10:17 AM, Theodore Ts'o wrote:
> > In addition to fixing a few bugs and addressing review comments, we now
> > add a new ioctl, EXT4_IOC_PRECACHE_EXTENTS, which forces all of the
> > extents in an inode to be cached in the extents status tree, and marks
> > them to be preferentially protected when under memory pressure.
> >
> > This is critically important when using AIO to a preallocated file,
> > since if we need to read in blocks from the extent tree, the
> > io_submit(2) system call becomes synchronous, which is rather rude to
> > applications which were expecting the AIO to be "A".
> >
> > As a bonus, using the extent status tree to store the logical to
> > physical block mapping is usually more compact that having to keep one
> > or more extent tree blocks in the buffer cache.
> >
> > (Should we do this all the time, instead of when the application
> > explicitly requests it? Maybe; there could be cases with very large,
> > fragmented files accessed by an application such as "file" is only needs
> > to look at a small subset of the file where this could result in an
> > unnecessary work and memory allocated. OTOH, 95%+ of the time this
> > would probably be a win...)
>
> I'd say yes, we should - maybe not in all cases but if you need it for
> AIO, try to make it "all the time" at least for that AIO?
>
> We keep telling application writers not to assume certain things about
> various filesystems, or to write applications that treat ext4 differently
> han ext3 differently than xfs etc...
Yes, I agree with you. As Ted and I have discussed the problem of
setting 'data=writeback' by default in ext4. Although most application
writers have realized that they need to explicit call fsync to flush all
dirty pages, there are still some legacy applications that depends on
the 'data=ordered' mode to flush all dirty pages.
>
> This goes the other way.
>
> In the end who (besides google?) is really going to call this IOCTL?
>
> I wondered if only doing this when files are opened O_DIRECT might make
> sense, but Jeff Moyer pointed out that giant databases probably don't
> want to read in their entire block mapping tree - OTOH, they probably use
> preallocation if they're smart, and maybe it's not that bad.
I have talked with my colleague who is a MySQL contributor about whether
MySQL tries to preallocate some files or not. As far as I know, at
least MySQL doesn't try to do it until now. I don't have the source
code of Oracle or DB2, these giant databases might use preallocation I
guess.
>
> Or what about tying this into POSIX_FADV_WILLNEED? Hohum, that gets
> into force_page_cache_readahead(). We need POSIX_FADV_WILLNEED_META...
Yes, _WILLNEED_METADATA flag makes sense to me if other file systems
also want to support it. But, as Ted said, now adding it in ioctl might
a good choice because we won't impact other file systems.
- Zheng
On Fri, Jul 19, 2013 at 07:54:51AM +0800, Zheng Liu wrote:
>
> I have talked with my colleague who is a MySQL contributor about whether
> MySQL tries to preallocate some files or not. As far as I know, at
> least MySQL doesn't try to do it until now. I don't have the source
> code of Oracle or DB2, these giant databases might use preallocation I
> guess.
Oracle and DB2 don't use preallocate, because they don't want the
metadata update overhead. So for software packages that are really
critically worried about 99percentile latency, they will generally
either pre-zero the file ahead of time, so all of the extents are
written. Or, they will use the out-of-tree nohidestale patch, and
mark all of the extents as written. (If you are doing A/B benchmark
comparisons, using nohidestale means the setup overhead for each
benchmark run can be measured in minutes instead of hours...)
On at least one of the enterprise databases which I'm familiar with,
they don't pre-zero the entire database file, but they'll do it in
chunks of N megabytes. That means they don't have the huge time lag
when the database is initially created, but then every so often, when
the database will suddenly use most of the disk bandwidth to zero the
next chunk of 16 or 32 or 64 megabytes. (This tends to do a real
number on your 99.9 percentile latency numbers, if you care about such
things....)
- Ted
On Thu, Jul 18, 2013 at 08:07:38PM -0400, Theodore Ts'o wrote:
> On Fri, Jul 19, 2013 at 07:54:51AM +0800, Zheng Liu wrote:
> >
> > I have talked with my colleague who is a MySQL contributor about whether
> > MySQL tries to preallocate some files or not. As far as I know, at
> > least MySQL doesn't try to do it until now. I don't have the source
> > code of Oracle or DB2, these giant databases might use preallocation I
> > guess.
>
> Oracle and DB2 don't use preallocate, because they don't want the
> metadata update overhead. So for software packages that are really
> critically worried about 99percentile latency, they will generally
> either pre-zero the file ahead of time, so all of the extents are
> written. Or, they will use the out-of-tree nohidestale patch, and
> mark all of the extents as written. (If you are doing A/B benchmark
> comparisons, using nohidestale means the setup overhead for each
> benchmark run can be measured in minutes instead of hours...)
>
> On at least one of the enterprise databases which I'm familiar with,
> they don't pre-zero the entire database file, but they'll do it in
> chunks of N megabytes. That means they don't have the huge time lag
> when the database is initially created, but then every so often, when
> the database will suddenly use most of the disk bandwidth to zero the
> next chunk of 16 or 32 or 64 megabytes. (This tends to do a real
> number on your 99.9 percentile latency numbers, if you care about such
> things....)
Thanks for correcting me. :-). Yes, MySQL does like this. But the
difference between them is that MySQL doesn't try to zero any chunks
directly. It just writes out the dirty pages (Yes, in MySQL it has its
own buffer pool and manages it by itself, and it is also called page.),
such as 16 or 32 megabytes, if I understand correctly. So, in general,
it always wins if we keep the metadata of ext4 file system in memory, at
least for database application.
- Zheng
On 7/18/13 1:53 PM, Theodore Ts'o wrote:
> On Thu, Jul 18, 2013 at 01:35:24PM -0500, Eric Sandeen wrote:
>>> (Should we do this all the time, instead of when the application
>>> explicitly requests it? Maybe; there could be cases with very large,
>>> fragmented files accessed by an application such as "file" is only needs
>>> to look at a small subset of the file where this could result in an
>>> unnecessary work and memory allocated. OTOH, 95%+ of the time this
>>> would probably be a win...)
>>
>> I'd say yes, we should - maybe not in all cases but if you need it for
>> AIO, try to make it "all the time" at least for that AIO?
>
> The problem is we don't know that we're doing AIO until we see the
> first io_submit(2) call. With this patch series, we'll pull the
> contents of the entire leaf tree block into extent cache, but if the
> extent tree is larger than that, if we read in the entire extent tree
> on the first AIO request, then that first request will delayed even
> more, and it's not clear that's a good thing.
Is blocking on a pre-AIO ioctl better than blocking on the
first AIO?
-Eric
On Thu, Jul 18, 2013 at 07:56:45PM -0500, Eric Sandeen wrote:
> > The problem is we don't know that we're doing AIO until we see the
> > first io_submit(2) call. With this patch series, we'll pull the
> > contents of the entire leaf tree block into extent cache, but if the
> > extent tree is larger than that, if we read in the entire extent tree
> > on the first AIO request, then that first request will delayed even
> > more, and it's not clear that's a good thing.
>
> Is blocking on a pre-AIO ioctl better than blocking on the
> first AIO?
The precache ioctl is something which the application is expecting to
block. The question is, if we have a heavily fragmented extent tree,
is it better for the first AIO to block long enough to read in one
metadata block --- and then never block again, or to have that first
AIO request take a long, LONG time? Especially if the application
isn't expecting it?
Also there are use cases for the precache ioctl even if you are not
using AIO. If you've taken care to make sure the file is as
contiguous as possible, having the extents be cached will save a lot
of memory compared to if the buffer heads are always entering the
buffer cache. So reading in all of the metadata can be a good thing
to do, especially if you can do this *before* you declare that the
server is healthy and is ready to start receiving traffic.
This becomes especially critical if the server is running in a very
tight memory container, because you are trying to pack as many jobs
(or VM's, if that's the way you roll) as possible on a physical
server.
Cheers,
- Ted
On Thu, Jul 18, 2013 at 10:59:34PM -0400, Theodore Ts'o wrote:
> On Thu, Jul 18, 2013 at 07:56:45PM -0500, Eric Sandeen wrote:
> > > The problem is we don't know that we're doing AIO until we see the
> > > first io_submit(2) call. With this patch series, we'll pull the
> > > contents of the entire leaf tree block into extent cache, but if the
> > > extent tree is larger than that, if we read in the entire extent tree
> > > on the first AIO request, then that first request will delayed even
> > > more, and it's not clear that's a good thing.
> >
> > Is blocking on a pre-AIO ioctl better than blocking on the
> > first AIO?
>
> The precache ioctl is something which the application is expecting to
> block. The question is, if we have a heavily fragmented extent tree,
> is it better for the first AIO to block long enough to read in one
> metadata block --- and then never block again, or to have that first
> AIO request take a long, LONG time? Especially if the application
> isn't expecting it?
>
> Also there are use cases for the precache ioctl even if you are not
> using AIO. If you've taken care to make sure the file is as
> contiguous as possible, having the extents be cached will save a lot
> of memory compared to if the buffer heads are always entering the
> buffer cache. So reading in all of the metadata can be a good thing
> to do, especially if you can do this *before* you declare that the
> server is healthy and is ready to start receiving traffic.
An ioctl is kinda silly for this. Just use O_NONBLOCK when calling
open() and do the prefetch right in the open call. The open() can
block, anyway, and what you are trying to do is non-blocking IO with
AIO, so it seems like we've already got a sensible, generic
interface for triggering this sort of prefetch operation.
Cheers,
Dave.
--
Dave Chinner
[email protected]
Dave Chinner <[email protected]> writes:
> On Thu, Jul 18, 2013 at 10:59:34PM -0400, Theodore Ts'o wrote:
>> On Thu, Jul 18, 2013 at 07:56:45PM -0500, Eric Sandeen wrote:
>> > > The problem is we don't know that we're doing AIO until we see the
>> > > first io_submit(2) call. With this patch series, we'll pull the
>> > > contents of the entire leaf tree block into extent cache, but if the
>> > > extent tree is larger than that, if we read in the entire extent tree
>> > > on the first AIO request, then that first request will delayed even
>> > > more, and it's not clear that's a good thing.
>> >
>> > Is blocking on a pre-AIO ioctl better than blocking on the
>> > first AIO?
>>
>> The precache ioctl is something which the application is expecting to
>> block. The question is, if we have a heavily fragmented extent tree,
>> is it better for the first AIO to block long enough to read in one
>> metadata block --- and then never block again, or to have that first
>> AIO request take a long, LONG time? Especially if the application
>> isn't expecting it?
>>
>> Also there are use cases for the precache ioctl even if you are not
>> using AIO. If you've taken care to make sure the file is as
>> contiguous as possible, having the extents be cached will save a lot
>> of memory compared to if the buffer heads are always entering the
>> buffer cache. So reading in all of the metadata can be a good thing
>> to do, especially if you can do this *before* you declare that the
>> server is healthy and is ready to start receiving traffic.
>
> An ioctl is kinda silly for this. Just use O_NONBLOCK when calling
> open() and do the prefetch right in the open call. The open() can
> block, anyway, and what you are trying to do is non-blocking IO with
> AIO, so it seems like we've already got a sensible, generic
> interface for triggering this sort of prefetch operation.
Hmm, O_NONBLOCK on regular files, eh? That brings back memories:
http://www.gossamer-threads.com/lists/linux/kernel/481855
I don't recall exactly how that ended, but I'm pretty sure the
conclusion was that it was a bad idea.
-Jeff
On Fri, Jul 19, 2013 at 01:33:09PM +1000, Dave Chinner wrote:
> An ioctl is kinda silly for this. Just use O_NONBLOCK when calling
> open() and do the prefetch right in the open call. The open() can
> block, anyway, and what you are trying to do is non-blocking IO with
> AIO, so it seems like we've already got a sensible, generic
> interface for triggering this sort of prefetch operation.
O_NONBLOCK (either set via open or fcntl) is a possibility, since it's
carefully defined to be unspecified for regular files by SUSv3. It is
quite different from the existing semantics for O_NONBLOCK, though.
Currently, for all file types where O_NONBLOCK is not ignored, open(2)
is guaranteed itself not to block. If we use O_NONBLOCK for regular
files to mean that any necessary metadata blocks required for AIO to
be "A" will be cached, then it will make open(2) much more likely to
block. Also, for all file types where O_NONBLOCK is not ignored,
read(2) will not block but instead return -1 and set errno to EAGAIN.
This would also be a change.
If we tried to get this new semantics for O_NONBLOCK to be accepted by
the Austin Group for standardization in the future, would they accept
it, or would they say, "this makes me vommit"? I have a suspicion
there reaction might be closer to the latter....
If we want a VFS-level API, in my opinion an fadvise() flag would be a
better choice.
- Ted
On Fri, Jul 19, 2013 at 12:19:30PM -0400, Theodore Ts'o wrote:
> On Fri, Jul 19, 2013 at 01:33:09PM +1000, Dave Chinner wrote:
> > An ioctl is kinda silly for this. Just use O_NONBLOCK when calling
> > open() and do the prefetch right in the open call. The open() can
> > block, anyway, and what you are trying to do is non-blocking IO with
> > AIO, so it seems like we've already got a sensible, generic
> > interface for triggering this sort of prefetch operation.
>
> O_NONBLOCK (either set via open or fcntl) is a possibility, since it's
> carefully defined to be unspecified for regular files by SUSv3. It is
> quite different from the existing semantics for O_NONBLOCK, though.
> Currently, for all file types where O_NONBLOCK is not ignored, open(2)
> is guaranteed itself not to block. If we use O_NONBLOCK for regular
> files to mean that any necessary metadata blocks required for AIO to
> be "A" will be cached, then it will make open(2) much more likely to
> block. Also, for all file types where O_NONBLOCK is not ignored,
> read(2) will not block but instead return -1 and set errno to EAGAIN.
> This would also be a change.
>
> If we tried to get this new semantics for O_NONBLOCK to be accepted by
> the Austin Group for standardization in the future, would they accept
> it, or would they say, "this makes me vommit"? I have a suspicion
> there reaction might be closer to the latter....
>
> If we want a VFS-level API, in my opinion an fadvise() flag would be a
> better choice.
Sure. Make it an fadvise() flag - just don't add ioctls for things
that are generically useful.
On second thoughts - you're trying to get the extent map read in. We
already have an interface for querying extent maps - fiemap.
FIEMAP_FLAG_PREFETCH along with the range of the file you want the
extent map prefetched for?
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Mon, Jul 22, 2013 at 11:38:31AM +1000, Dave Chinner wrote:
> On Fri, Jul 19, 2013 at 12:19:30PM -0400, Theodore Ts'o wrote:
> > On Fri, Jul 19, 2013 at 01:33:09PM +1000, Dave Chinner wrote:
> > > An ioctl is kinda silly for this. Just use O_NONBLOCK when calling
> > > open() and do the prefetch right in the open call. The open() can
> > > block, anyway, and what you are trying to do is non-blocking IO with
> > > AIO, so it seems like we've already got a sensible, generic
> > > interface for triggering this sort of prefetch operation.
> >
> > O_NONBLOCK (either set via open or fcntl) is a possibility, since it's
> > carefully defined to be unspecified for regular files by SUSv3. It is
> > quite different from the existing semantics for O_NONBLOCK, though.
> > Currently, for all file types where O_NONBLOCK is not ignored, open(2)
> > is guaranteed itself not to block. If we use O_NONBLOCK for regular
> > files to mean that any necessary metadata blocks required for AIO to
> > be "A" will be cached, then it will make open(2) much more likely to
> > block. Also, for all file types where O_NONBLOCK is not ignored,
> > read(2) will not block but instead return -1 and set errno to EAGAIN.
> > This would also be a change.
> >
> > If we tried to get this new semantics for O_NONBLOCK to be accepted by
> > the Austin Group for standardization in the future, would they accept
> > it, or would they say, "this makes me vommit"? I have a suspicion
> > there reaction might be closer to the latter....
> >
> > If we want a VFS-level API, in my opinion an fadvise() flag would be a
> > better choice.
>
> Sure. Make it an fadvise() flag - just don't add ioctls for things
> that are generically useful.
>
> On second thoughts - you're trying to get the extent map read in. We
> already have an interface for querying extent maps - fiemap.
> FIEMAP_FLAG_PREFETCH along with the range of the file you want the
> extent map prefetched for?
I don't think fiemap is a good interface. The application uses
fiemap(2) to retrieve extent mapping. That means that the app could use
these mappings in userspace. But now we want to cache these mappings in
kernel space. So it seems that an fadvise(2) flag is better.
Regards,
- Zheng
On Mon, Jul 22, 2013 at 10:17:42AM +0800, Zheng Liu wrote:
> On Mon, Jul 22, 2013 at 11:38:31AM +1000, Dave Chinner wrote:
> > On Fri, Jul 19, 2013 at 12:19:30PM -0400, Theodore Ts'o wrote:
> > > On Fri, Jul 19, 2013 at 01:33:09PM +1000, Dave Chinner wrote:
> > > > An ioctl is kinda silly for this. Just use O_NONBLOCK when calling
> > > > open() and do the prefetch right in the open call. The open() can
> > > > block, anyway, and what you are trying to do is non-blocking IO with
> > > > AIO, so it seems like we've already got a sensible, generic
> > > > interface for triggering this sort of prefetch operation.
> > >
> > > O_NONBLOCK (either set via open or fcntl) is a possibility, since it's
> > > carefully defined to be unspecified for regular files by SUSv3. It is
> > > quite different from the existing semantics for O_NONBLOCK, though.
> > > Currently, for all file types where O_NONBLOCK is not ignored, open(2)
> > > is guaranteed itself not to block. If we use O_NONBLOCK for regular
> > > files to mean that any necessary metadata blocks required for AIO to
> > > be "A" will be cached, then it will make open(2) much more likely to
> > > block. Also, for all file types where O_NONBLOCK is not ignored,
> > > read(2) will not block but instead return -1 and set errno to EAGAIN.
> > > This would also be a change.
> > >
> > > If we tried to get this new semantics for O_NONBLOCK to be accepted by
> > > the Austin Group for standardization in the future, would they accept
> > > it, or would they say, "this makes me vommit"? I have a suspicion
> > > there reaction might be closer to the latter....
> > >
> > > If we want a VFS-level API, in my opinion an fadvise() flag would be a
> > > better choice.
> >
> > Sure. Make it an fadvise() flag - just don't add ioctls for things
> > that are generically useful.
> >
> > On second thoughts - you're trying to get the extent map read in. We
> > already have an interface for querying extent maps - fiemap.
> > FIEMAP_FLAG_PREFETCH along with the range of the file you want the
> > extent map prefetched for?
>
> I don't think fiemap is a good interface. The application uses
> fiemap(2) to retrieve extent mapping.
fiemap is used to query information about extent maps. What it
returns is entirely dependent on the input parameters that are
passed to it. Indeed, from Documentation/filesystems/fiemap.txt:
"If fm_extent_count is zero, then the fm_extents[] array is ignored
(no extents will be returned), and the fm_mapped_extents count will
hold the number of extents needed in fm_extents[] to hold the file's
current mapping."
Think about that for a minute. What does the filesystem do with such
an fiemap query when the extent map is not cached? That's right,
*fiemap reads the extent map from disk into the cache* and then
returns the number of extents in the range.
All I have suggested is adding a flag to make this an *explicit
operation* rather than a side effect of a "count extents" query. I
fail to see any justification for a whole new interface when we
already have a perfectly functional one that already provides the
functionality that is required...
> That means that the app could use
> these mappings in userspace. But now we want to cache these mappings in
> kernel space.
If the filesystem is not caching the extents read in during fiemap
operations then perhaps you should look into fixing that deficiency.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Mon, Jul 22, 2013 at 08:02:55PM +1000, Dave Chinner wrote:
> On Mon, Jul 22, 2013 at 10:17:42AM +0800, Zheng Liu wrote:
> > On Mon, Jul 22, 2013 at 11:38:31AM +1000, Dave Chinner wrote:
> > > On Fri, Jul 19, 2013 at 12:19:30PM -0400, Theodore Ts'o wrote:
> > > > On Fri, Jul 19, 2013 at 01:33:09PM +1000, Dave Chinner wrote:
> > > > > An ioctl is kinda silly for this. Just use O_NONBLOCK when calling
> > > > > open() and do the prefetch right in the open call. The open() can
> > > > > block, anyway, and what you are trying to do is non-blocking IO with
> > > > > AIO, so it seems like we've already got a sensible, generic
> > > > > interface for triggering this sort of prefetch operation.
> > > >
> > > > O_NONBLOCK (either set via open or fcntl) is a possibility, since it's
> > > > carefully defined to be unspecified for regular files by SUSv3. It is
> > > > quite different from the existing semantics for O_NONBLOCK, though.
> > > > Currently, for all file types where O_NONBLOCK is not ignored, open(2)
> > > > is guaranteed itself not to block. If we use O_NONBLOCK for regular
> > > > files to mean that any necessary metadata blocks required for AIO to
> > > > be "A" will be cached, then it will make open(2) much more likely to
> > > > block. Also, for all file types where O_NONBLOCK is not ignored,
> > > > read(2) will not block but instead return -1 and set errno to EAGAIN.
> > > > This would also be a change.
> > > >
> > > > If we tried to get this new semantics for O_NONBLOCK to be accepted by
> > > > the Austin Group for standardization in the future, would they accept
> > > > it, or would they say, "this makes me vommit"? I have a suspicion
> > > > there reaction might be closer to the latter....
> > > >
> > > > If we want a VFS-level API, in my opinion an fadvise() flag would be a
> > > > better choice.
> > >
> > > Sure. Make it an fadvise() flag - just don't add ioctls for things
> > > that are generically useful.
> > >
> > > On second thoughts - you're trying to get the extent map read in. We
> > > already have an interface for querying extent maps - fiemap.
> > > FIEMAP_FLAG_PREFETCH along with the range of the file you want the
> > > extent map prefetched for?
> >
> > I don't think fiemap is a good interface. The application uses
> > fiemap(2) to retrieve extent mapping.
>
> fiemap is used to query information about extent maps. What it
> returns is entirely dependent on the input parameters that are
> passed to it. Indeed, from Documentation/filesystems/fiemap.txt:
>
> "If fm_extent_count is zero, then the fm_extents[] array is ignored
> (no extents will be returned), and the fm_mapped_extents count will
> hold the number of extents needed in fm_extents[] to hold the file's
> current mapping."
>
> Think about that for a minute. What does the filesystem do with such
> an fiemap query when the extent map is not cached? That's right,
> *fiemap reads the extent map from disk into the cache* and then
> returns the number of extents in the range.
>
> All I have suggested is adding a flag to make this an *explicit
> operation* rather than a side effect of a "count extents" query. I
> fail to see any justification for a whole new interface when we
> already have a perfectly functional one that already provides the
> functionality that is required...
Yes, I understand your point of view. We can use fiemap to do that.
All I concern is about semantics. When someone mention about fiemap,
first I remember is that I can use it to retrieve the extent mappings.
But for fadvise, it looks like more naturally. When I look at it, I
always think that I can use it to provide a hint to the kernel, and then
the kernel will do the rest of things for me. So that is why I prefer
to use a fadvise flag rather than use fiemap.
Regards,
- Zheng
On Mon, Jul 22, 2013 at 08:57:45PM +0800, Zheng Liu wrote:
> On Mon, Jul 22, 2013 at 08:02:55PM +1000, Dave Chinner wrote:
> > On Mon, Jul 22, 2013 at 10:17:42AM +0800, Zheng Liu wrote:
> > > I don't think fiemap is a good interface. The application uses
> > > fiemap(2) to retrieve extent mapping.
> >
> > fiemap is used to query information about extent maps. What it
> > returns is entirely dependent on the input parameters that are
> > passed to it. Indeed, from Documentation/filesystems/fiemap.txt:
> >
> > "If fm_extent_count is zero, then the fm_extents[] array is ignored
> > (no extents will be returned), and the fm_mapped_extents count will
> > hold the number of extents needed in fm_extents[] to hold the file's
> > current mapping."
> >
> > Think about that for a minute. What does the filesystem do with such
> > an fiemap query when the extent map is not cached? That's right,
> > *fiemap reads the extent map from disk into the cache* and then
> > returns the number of extents in the range.
> >
> > All I have suggested is adding a flag to make this an *explicit
> > operation* rather than a side effect of a "count extents" query. I
> > fail to see any justification for a whole new interface when we
> > already have a perfectly functional one that already provides the
> > functionality that is required...
>
> Yes, I understand your point of view. We can use fiemap to do that.
> All I concern is about semantics. When someone mention about fiemap,
> first I remember is that I can use it to retrieve the extent mappings.
But that's exactly what Ted wants to do - retreive extent maps from
disk. From Documentation/filesystems/fiemap.txt:
"The fiemap ioctl is an efficient method for userspace to get file
extent mappings. .....
Certain flags to modify the way in which mappings are looked up can
be set in fm_flags. ..... This scheme is intended to allow the
fiemap interface to grow in the future but without losing
compatibility with old software. ..... "
The functionality being discussed here is userspace being able to
retreive extents from disk into memory. The initial description of
FIEMAP covers this. We can change the way fiemap behaves by input
flags - that's clearly stated - and it is intended to be extended in
this manner for future functionality. That's what I suggested to
make it explicit, but it's not actually necessary to actually read
the extent map into memory with FIEMAP.
Keep in mind that this "new extent map functionality" is *exactly*
why we designed the FIEMAP interface to be extensible.
> But for fadvise, it looks like more naturally.
fadvise() is for giving data access behaviour hints. It is not for
controlling how filesystems access their internal metadata.
> When I look at it, I
> always think that I can use it to provide a hint to the kernel, and then
> the kernel will do the rest of things for me. So that is why I prefer
> to use a fadvise flag rather than use fiemap.
But Ted's case is not a "hint" - it's a direct command to fetch the
extent map from disk. You can do that already with FIEMAP, so no new
code or interfaces are needed. fadvise() is not the proper interface
for manipulating filesystem metadata behaviour, and fiemap can
already do what you need. There is no need for any new interfaces
here.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, Jul 30, 2013 at 01:08:07PM +1000, Dave Chinner wrote:
> But Ted's case is not a "hint" - it's a direct command to fetch the
> extent map from disk. You can do that already with FIEMAP, so no new
> code or interfaces are needed. fadvise() is not the proper interface
> for manipulating filesystem metadata behaviour, and fiemap can
> already do what you need. There is no need for any new interfaces
> here.
I've been looking at the definition of fiemap, and I'm not convinced.
To quote from the fiemap.txt:
The fiemap ioctl is an efficient method for userspace to get file
extent mappings.
That's not what is going on here. We are pre-caching them into kernel
memory, not in user-space. In addition, we're also setting a flag to
keep these extents preferentially in memory compared to other entries
in the extent cache.
I agree that posix_fadvise() isn't really a good match, either:
"posix_fadvise - predeclare an access pattern for file data"
How about this? FIEMAP is an ioctl, anyway. How about if we just
declare this as a new fs-independent ioctl, much like FS_IOC_FIEMAP?
#define FS_IOC_PRECACHE_EXTENTS _IO('f', 18)
This is, of course, assuming that other file systems are interested in
implementing this functionality. If not, we can just keep it as
EXT4_IOC_PRECACHE_EXTENTS, and just call it a day. (We can always add
a definition of FS_IOC_PRECACHE_EXTENTS set to ext4 ioctl's code
point, at some later point, if people change their minds.)
- Ted
On Sat, Aug 03, 2013 at 09:27:40PM -0400, Theodore Ts'o wrote:
> On Tue, Jul 30, 2013 at 01:08:07PM +1000, Dave Chinner wrote:
> > But Ted's case is not a "hint" - it's a direct command to fetch the
> > extent map from disk. You can do that already with FIEMAP, so no new
> > code or interfaces are needed. fadvise() is not the proper interface
> > for manipulating filesystem metadata behaviour, and fiemap can
> > already do what you need. There is no need for any new interfaces
> > here.
>
> I've been looking at the definition of fiemap, and I'm not convinced.
> To quote from the fiemap.txt:
>
> The fiemap ioctl is an efficient method for userspace to get file
> extent mappings.
>
> That's not what is going on here. We are pre-caching them into kernel
> memory, not in user-space. In addition, we're also setting a flag to
> keep these extents preferentially in memory compared to other entries
> in the extent cache.
>
> I agree that posix_fadvise() isn't really a good match, either:
>
> "posix_fadvise - predeclare an access pattern for file data"
>
> How about this? FIEMAP is an ioctl, anyway. How about if we just
> declare this as a new fs-independent ioctl, much like FS_IOC_FIEMAP?
>
> #define FS_IOC_PRECACHE_EXTENTS _IO('f', 18)
>
> This is, of course, assuming that other file systems are interested in
> implementing this functionality. If not, we can just keep it as
> EXT4_IOC_PRECACHE_EXTENTS, and just call it a day. (We can always add
> a definition of FS_IOC_PRECACHE_EXTENTS set to ext4 ioctl's code
> point, at some later point, if people change their minds.)
We *don't need to add any code* to the kernel to read extents into
the kernel cache. The FIEMAP interface as it exists today -without
modification- fulfils your stated requirement.
I do no see any reason for adding a new, duplicated interface that
we have to maintain and hook up to all the relevant filesystems,
write test code for and then support forever more. It just makes no
sense at all.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On 8/12/13 10:10 PM, Dave Chinner wrote:
> On Sat, Aug 03, 2013 at 09:27:40PM -0400, Theodore Ts'o wrote:
>> On Tue, Jul 30, 2013 at 01:08:07PM +1000, Dave Chinner wrote:
>>> But Ted's case is not a "hint" - it's a direct command to fetch the
>>> extent map from disk. You can do that already with FIEMAP, so no new
>>> code or interfaces are needed. fadvise() is not the proper interface
>>> for manipulating filesystem metadata behaviour, and fiemap can
>>> already do what you need. There is no need for any new interfaces
>>> here.
>>
>> I've been looking at the definition of fiemap, and I'm not convinced.
>> To quote from the fiemap.txt:
>>
>> The fiemap ioctl is an efficient method for userspace to get file
>> extent mappings.
Ted -
Changing fiemap.txt is easy, if that's the only problem... :)
>> That's not what is going on here. We are pre-caching them into kernel
>> memory, not in user-space. In addition, we're also setting a flag to
>> keep these extents preferentially in memory compared to other entries
>> in the extent cache.
Reading extents via fiemap almost certainly moves that metadata into
kernel cache, simply by the act of reading the block device to get them.
It doesn't set any caching preference, but how needed is that, really,
in practice?
>> I agree that posix_fadvise() isn't really a good match, either:
>>
>> "posix_fadvise - predeclare an access pattern for file data"
>>
>> How about this? FIEMAP is an ioctl, anyway. How about if we just
>> declare this as a new fs-independent ioctl, much like FS_IOC_FIEMAP?
>>
>> #define FS_IOC_PRECACHE_EXTENTS _IO('f', 18)
>>
>> This is, of course, assuming that other file systems are interested in
>> implementing this functionality. If not, we can just keep it as
>> EXT4_IOC_PRECACHE_EXTENTS, and just call it a day. (We can always add
>> a definition of FS_IOC_PRECACHE_EXTENTS set to ext4 ioctl's code
>> point, at some later point, if people change their minds.)
>
> We *don't need to add any code* to the kernel to read extents into
> the kernel cache. The FIEMAP interface as it exists today -without
> modification- fulfils your stated requirement.
>
> I do no see any reason for adding a new, duplicated interface that
> we have to maintain and hook up to all the relevant filesystems,
> write test code for and then support forever more. It just makes no
> sense at all.
I see Dave's point that we _do_ have an interface today to read
all file extents into cache. We don't mark them as particularly sticky,
however.
This seems pretty clearly driven by a Google workload need; something you
can probably test. Does FIEMAP do the job for you or not? If not, why not?
-Eric
> Cheers,
>
> Dave.
>
On Mon, Aug 12, 2013 at 10:21:45PM -0500, Eric Sandeen wrote:
>
> Reading extents via fiemap almost certainly moves that metadata into
> kernel cache, simply by the act of reading the block device to get them.
Well, if the file system has an extent cache. It certainly will end
up reading the pages involved with the extents into the buffer and/or
page cache (depending on how the file system does things).
> I see Dave's point that we _do_ have an interface today to read
> all file extents into cache. We don't mark them as particularly sticky,
> however.
>
> This seems pretty clearly driven by a Google workload need; something you
> can probably test. Does FIEMAP do the job for you or not? If not, why not?
If you are using memory containers the way we do, in practice every
single process is going to be under memory pressure. See previous
comments I've made about why in a cloud environment, memory is your
most precious resource, since motherboards have limited numbers of
DIMM slots, and high-density DIMMS are expensive --- this is why
services like Amazon EC2 and Linode charge $$$ if you need much more
than 512mb of memory. This is because in order to make cloud systems
cost effective from a financial ROI point of view (especially once you
include power and cooling costs), you need to pack a large number of
workloads on each machine, and this is true regardless of whether you
are using containers or VM's as your method of isolation.
So basically, if you are trying to use your memory efficiently, _and_
you are trying to meet 99.9 percentile latency SLA numbers for your
performance-critical workloads, you need to have a way of telling the
system that certain pieces of memory (in this case, certain parts of
the extent cache) are more important than others (for example, a
no-longer-used inode/dentry in the inode/dentry cache or other slab
objects).
- Ted
P.S. In previous versions of this patch (which never went upstream,
using a different implementation which also never went upstream), this
ioctl nailed the relevant portions of the extent cache into memory
permanently, and they wouldn't be evicted no matter how much memory
pressure you would be under. In the Google environment, this wasn't a
major issue, since all jobs run under a restrictive memory container
and so a buggy or malicious program which attempted to precache too
many files would end up OOM-kiling itself (after which point the
situation would correct itself).
In this version of the patch, I've made the cache entries sticky, but
they aren't permanently nailed in place. This is because not all
systems will be running with containers, and I wanted to make sure we
had a safety valve against abuse. Could someone still degrade the
system performance if they tried to abuse this ioctl? Sure, but
someone can do the same thing with a "while (1) fork();" bomb.
On Tue, Aug 13, 2013 at 09:04:59AM -0400, Theodore Ts'o wrote:
> On Mon, Aug 12, 2013 at 10:21:45PM -0500, Eric Sandeen wrote:
> >
> > Reading extents via fiemap almost certainly moves that metadata into
> > kernel cache, simply by the act of reading the block device to get them.
>
> Well, if the file system has an extent cache. It certainly will end
> up reading the pages involved with the extents into the buffer and/or
> page cache (depending on how the file system does things).
Of course. ext4 has an extent cache, so you're agreeing that FIEMAP
will populate the extent cache for the use case google has, right?
> > I see Dave's point that we _do_ have an interface today to read
> > all file extents into cache. We don't mark them as particularly sticky,
> > however.
> >
> > This seems pretty clearly driven by a Google workload need; something you
> > can probably test. Does FIEMAP do the job for you or not? If not, why not?
>
> If you are using memory containers the way we do, in practice every
> single process is going to be under memory pressure. See previous
> comments I've made about why in a cloud environment, memory is your
> most precious resource.
<snip>
Sure, we know all that. But you haven't answered the question being
asked which was whether FIEMAP can acheive what you need. You've
already admitted it can populate the extent cache for ext4, so
there's little else that is needed to pin the extents is reads in a
range in the cache. Just one flag...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Fri, Aug 16, 2013 at 01:21:24PM +1000, Dave Chinner wrote:
>
> Sure, we know all that. But you haven't answered the question being
> asked which was whether FIEMAP can acheive what you need. You've
> already admitted it can populate the extent cache for ext4, so
> there's little else that is needed to pin the extents is reads in a
> range in the cache. Just one flag...
I thought you were asking a different question, which is whether we
could just use FIEMAP and not try to soft-pin the extents in the
cache. And the answer to that is no.
If the argument is that you think it's better to allocate an extra bit
to FIEMAP rather than allocate a new ioctl code --- shrug. I'm not
sure why you're making such a big deal about that; it's not like we're
short on ioctl code points. I'm willing to do this, although if we
keep on getting API bikeshedding, my fallback position is to just
implement this as an ext4-specific ioctl and call it a day.
I am curious whether any other file system is actually going to
implement this or not, though. If everyone else is convinced that
their file system is so wonderful that they don't need this, or this
is just a wierdo Google use case, I'm not sure why we're spending so
much time bikeshedding the APi.
- Ted