Background
==========
This patch set point to fix the inconsistency problem which has been
discussed and partial fixed in [1].
Now, the problem is on the unstable storage which has a flaky transport
(e.g. iSCSI transport may disconnect few seconds and reconnect due to
the bad network environment), if we failed to async write metadata in
background, the end write routine in block layer will clear the buffer's
uptodate flag, but the data in such buffer is actually uptodate. Finally
we may read "old && inconsistent" metadata from the disk when we get the
buffer later because not only the uptodate flag was cleared but also we
do not check the write io error flag, or even worse the buffer has been
freed due to memory presure.
Fortunately, if the jbd2 do checkpoint after async IO error happens,
the checkpoint routine will check the write_io_error flag and abort the
the journal if detect IO error. And in the journal recover case, the
recover code will invoke sync_blockdev() after recover complete, it will
also detect IO error and refuse to mount the filesystem.
Current ext4 have already deal with this problem in __ext4_get_inode_loc()
and commit 7963e5ac90125 ("ext4: treat buffers with write errors as
containing valid data"), but it's not enough.
[1] https://lore.kernel.org/linux-ext4/[email protected]/
Description
===========
This patch set add and rework 7 wrapper functions of getting metadata
blocks, replace all sb_bread() / sb_getblk*() / ext4_bread() and
sb_breadahead*(). Add buffer_write_io_error() checking into them, if
the buffer isn't uptodate and write_io_error flag was set, which means
that the buffer has been failed to write out to disk, re-add the
uptodate flag to prevent subsequent read operation.
- ext4_sb_getblk(): works the same as sb_getblk(), use to replace all
sb_getblk() used for newly allocated blocks and getting buffers.
- ext4_sb_getblk_locked(): works the same as sb_getblk() except check &
fix buffer uotpdate flag, use to replace all sb_getblk() used for
getting buffers to read.
- ext4_sb_getblk_gfp(): gfp version of ext4_sb_getblk().
- ext4_sb_getblk_locked_gfp(): gfp version of ext4_sb_getblk_locked().
- ext4_sb_bread(): get buffer and submit read bio if buffer is actually
not uptodate.
- ext4_sb_bread_unmovable(): unmovable version of ext4_sb_bread().
- ext4_sb_breadahead_unmovable(): works the same to ext4_sb_bread_unmovable()
except skip submit read bio if failed to lock the buffer.
Patch 1-2: do some small change in ext4 inode eio simulation and add a
helper in buffer.c, just prepare for below patches.
Patch 3: add the ext4_sb_*() function to deal with the write_io_error
flag in buffer.
Patch 4-8: replace all sb_*() with ext4_sb_*() in ext4.
Patch 9: deal with the buffer shrinking case, abort jbd2/fs when
shrinking a buffer with write_io_error flag.
Patch 10: just do some cleanup.
After this patch set, we need to use above 7 wrapper functions to
get/read metadata block instead of invoke sb_*() functions defined in
fs/buffer.h.
Test
====
This patch set is based on linux-5.7-rc7 and has been tests by xfstests
in auto mode.
Thanks,
Yi.
zhangyi (F) (10):
ext4: move inode eio simulation behind io completeion
fs: pick out ll_rw_one_block() helper function
ext4: add ext4_sb_getblk*() wrapper functions
ext4: replace sb_getblk() with ext4_sb_getblk_locked()
ext4: replace sb_bread*() with ext4_sb_bread*()
ext4: replace sb_getblk() with ext4_sb_getblk()
ext4: switch to use ext4_sb_getblk_locked() in ext4_getblk()
ext4: replace sb_breadahead() with ext4_sb_breadahead()
ext4: abort the filesystem while freeing the write error io buffer
ext4: remove unused parameter in jbd2_journal_try_to_free_buffers()
fs/buffer.c | 41 ++++++----
fs/ext4/balloc.c | 6 +-
fs/ext4/ext4.h | 60 ++++++++++++---
fs/ext4/extents.c | 13 ++--
fs/ext4/ialloc.c | 6 +-
fs/ext4/indirect.c | 13 ++--
fs/ext4/inline.c | 2 +-
fs/ext4/inode.c | 53 +++++--------
fs/ext4/mmp.c | 2 +-
fs/ext4/resize.c | 24 +++---
fs/ext4/super.c | 145 +++++++++++++++++++++++++++++++-----
fs/ext4/xattr.c | 4 +-
fs/jbd2/transaction.c | 20 +++--
include/linux/buffer_head.h | 1 +
include/linux/jbd2.h | 3 +-
15 files changed, 277 insertions(+), 116 deletions(-)
--
2.21.3
gfp_mask in jbd2_journal_try_to_free_buffers() is no longer used after
commit 536fc240e7147 ("jbd2: clean up jbd2_journal_try_to_free_buffers()"),
just remove it.
Signed-off-by: zhangyi (F) <[email protected]>
---
fs/ext4/inode.c | 2 +-
fs/ext4/super.c | 4 ++--
fs/jbd2/transaction.c | 7 +------
include/linux/jbd2.h | 3 ++-
4 files changed, 6 insertions(+), 10 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7354edb444c5..8adde60c9b02 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3293,7 +3293,7 @@ static int ext4_releasepage(struct page *page, gfp_t wait)
if (PageChecked(page))
return 0;
if (journal)
- return jbd2_journal_try_to_free_buffers(journal, page, wait);
+ return jbd2_journal_try_to_free_buffers(journal, page);
else
return try_to_free_buffers(page);
}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 1e15179aa1c4..8a84f52fd67c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1394,8 +1394,8 @@ static int bdev_try_to_free_page(struct super_block *sb, struct page *page,
if (!page_has_buffers(page))
return 0;
if (journal)
- return jbd2_journal_try_to_free_buffers(journal, page,
- wait & ~__GFP_DIRECT_RECLAIM);
+ return jbd2_journal_try_to_free_buffers(journal, page);
+
return bdev_try_to_free_buffer(sb, page);
}
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index ac6a077afec3..c3ecefdb3abd 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2070,10 +2070,6 @@ __journal_try_to_free_buffer(journal_t *journal, struct buffer_head *bh)
* int jbd2_journal_try_to_free_buffers() - try to free page buffers.
* @journal: journal for operation
* @page: to try and free
- * @gfp_mask: we use the mask to detect how hard should we try to release
- * buffers. If __GFP_DIRECT_RECLAIM and __GFP_FS is set, we wait for commit
- * code to release the buffers.
- *
*
* For all the buffers on this page,
* if they are fully written out ordered data, move them onto BUF_CLEAN
@@ -2104,8 +2100,7 @@ __journal_try_to_free_buffer(journal_t *journal, struct buffer_head *bh)
*
* Return 0 on failure, 1 on success
*/
-int jbd2_journal_try_to_free_buffers(journal_t *journal,
- struct page *page, gfp_t gfp_mask)
+int jbd2_journal_try_to_free_buffers(journal_t *journal, struct page *page)
{
struct buffer_head *head;
struct buffer_head *bh;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index f613d8529863..230bb7767180 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1376,7 +1376,8 @@ extern int jbd2_journal_dirty_metadata (handle_t *, struct buffer_head *);
extern int jbd2_journal_forget (handle_t *, struct buffer_head *);
extern int jbd2_journal_invalidatepage(journal_t *,
struct page *, unsigned int, unsigned int);
-extern int jbd2_journal_try_to_free_buffers(journal_t *, struct page *, gfp_t);
+extern int jbd2_journal_try_to_free_buffers(journal_t *journal,
+ struct page *page);
extern int jbd2_journal_stop(handle_t *);
extern int jbd2_journal_flush (journal_t *);
extern void jbd2_journal_lock_updates (journal_t *);
--
2.21.3
For the cases of newly allocated block and get block for write buffer,
checking the buffer's write_io_error and uptodate flag is optional.
Replace all sb_getblk() with ext4_sb_getblk(), it works the same as
before.
Signed-off-by: zhangyi (F) <[email protected]>
---
fs/ext4/extents.c | 8 +++++---
fs/ext4/indirect.c | 3 ++-
fs/ext4/inline.c | 2 +-
fs/ext4/inode.c | 2 +-
fs/ext4/mmp.c | 2 +-
fs/ext4/resize.c | 8 ++++----
fs/ext4/xattr.c | 2 +-
7 files changed, 15 insertions(+), 12 deletions(-)
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 5db76b46fad5..25a7a8389291 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1066,7 +1066,8 @@ static int ext4_ext_split(handle_t *handle, struct inode *inode,
err = -EFSCORRUPTED;
goto cleanup;
}
- bh = sb_getblk_gfp(inode->i_sb, newblock, __GFP_MOVABLE | GFP_NOFS);
+ bh = ext4_sb_getblk_gfp(inode->i_sb, newblock,
+ __GFP_MOVABLE | GFP_NOFS);
if (unlikely(!bh)) {
err = -ENOMEM;
goto cleanup;
@@ -1143,7 +1144,7 @@ static int ext4_ext_split(handle_t *handle, struct inode *inode,
while (k--) {
oldblock = newblock;
newblock = ablocks[--a];
- bh = sb_getblk(inode->i_sb, newblock);
+ bh = ext4_sb_getblk(inode->i_sb, newblock);
if (unlikely(!bh)) {
err = -ENOMEM;
goto cleanup;
@@ -1270,7 +1271,8 @@ static int ext4_ext_grow_indepth(handle_t *handle, struct inode *inode,
if (newblock == 0)
return err;
- bh = sb_getblk_gfp(inode->i_sb, newblock, __GFP_MOVABLE | GFP_NOFS);
+ bh = ext4_sb_getblk_gfp(inode->i_sb, newblock,
+ __GFP_MOVABLE | GFP_NOFS);
if (unlikely(!bh))
return -ENOMEM;
lock_buffer(bh);
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index bd4d86211ab8..e4c2a8e1afbb 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -347,7 +347,8 @@ static int ext4_alloc_branch(handle_t *handle,
if (i == 0)
continue;
- bh = branch[i].bh = sb_getblk(ar->inode->i_sb, new_blocks[i-1]);
+ bh = branch[i].bh = ext4_sb_getblk(ar->inode->i_sb,
+ new_blocks[i-1]);
if (unlikely(!bh)) {
err = -ENOMEM;
goto failed;
diff --git a/fs/ext4/inline.c b/fs/ext4/inline.c
index f35e289e17aa..94424ea011a5 100644
--- a/fs/ext4/inline.c
+++ b/fs/ext4/inline.c
@@ -1216,7 +1216,7 @@ static int ext4_convert_inline_data_nolock(handle_t *handle,
goto out_restore;
}
- data_bh = sb_getblk(inode->i_sb, map.m_pblk);
+ data_bh = ext4_sb_getblk(inode->i_sb, map.m_pblk);
if (!data_bh) {
error = -ENOMEM;
goto out_restore;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c374870f6bb1..97dc77ec7c3e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -829,7 +829,7 @@ struct buffer_head *ext4_getblk(handle_t *handle, struct inode *inode,
if (err < 0)
return ERR_PTR(err);
- bh = sb_getblk(inode->i_sb, map.m_pblk);
+ bh = ext4_sb_getblk(inode->i_sb, map.m_pblk);
if (unlikely(!bh))
return ERR_PTR(-ENOMEM);
if (map.m_flags & EXT4_MAP_NEW) {
diff --git a/fs/ext4/mmp.c b/fs/ext4/mmp.c
index d34cb8c46655..2aa1dbe44e9d 100644
--- a/fs/ext4/mmp.c
+++ b/fs/ext4/mmp.c
@@ -78,7 +78,7 @@ static int read_mmp_block(struct super_block *sb, struct buffer_head **bh,
* that the MD RAID device cache has been bypassed, and that the read
* is not blocked in the elevator. */
if (!*bh) {
- *bh = sb_getblk(sb, mmp_block);
+ *bh = ext4_sb_getblk(sb, mmp_block);
if (!*bh) {
ret = -ENOMEM;
goto warn_exit;
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index ff018e63bb55..27073c715228 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -400,7 +400,7 @@ static struct buffer_head *bclean(handle_t *handle, struct super_block *sb,
struct buffer_head *bh;
int err;
- bh = sb_getblk(sb, blk);
+ bh = ext4_sb_getblk(sb, blk);
if (unlikely(!bh))
return ERR_PTR(-ENOMEM);
BUFFER_TRACE(bh, "get_write_access");
@@ -464,7 +464,7 @@ static int set_flexbg_block_bitmap(struct super_block *sb, handle_t *handle,
if (err < 0)
return err;
- bh = sb_getblk(sb, flex_gd->groups[group].block_bitmap);
+ bh = ext4_sb_getblk(sb, flex_gd->groups[group].block_bitmap);
if (unlikely(!bh))
return -ENOMEM;
@@ -557,7 +557,7 @@ static int setup_new_flex_group_blocks(struct super_block *sb,
if (err < 0)
goto out;
- gdb = sb_getblk(sb, block);
+ gdb = ext4_sb_getblk(sb, block);
if (unlikely(!gdb)) {
err = -ENOMEM;
goto out;
@@ -1130,7 +1130,7 @@ static void update_backups(struct super_block *sb, sector_t blk_off, char *data,
backup_block = (ext4_group_first_block_no(sb, group) +
ext4_bg_has_super(sb, group));
- bh = sb_getblk(sb, backup_block);
+ bh = ext4_sb_getblk(sb, backup_block);
if (unlikely(!bh)) {
err = -ENOMEM;
break;
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 21df43a25328..e2ba4e631b02 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -2056,7 +2056,7 @@ ext4_xattr_block_set(handle_t *handle, struct inode *inode,
ea_idebug(inode, "creating block %llu",
(unsigned long long)block);
- new_bh = sb_getblk(sb, block);
+ new_bh = ext4_sb_getblk(sb, block);
if (unlikely(!new_bh)) {
error = -ENOMEM;
getblk_failed:
--
2.21.3
Now we can prevent reading old metadata buffer from the disk which has
been failed to write out through checking write io error when getting
the buffer. One more thing need to do is to prevent freeing the write
io error buffer. If the buffer was freed, we lose the latest data and
buffer stats, finally it will also lead to inconsistency.
So, this patch abort the journal in journal mode and invoke
ext4_error_err() in nojournal mode to prevent further inconsistency.
Signed-off-by: zhangyi (F) <[email protected]>
---
fs/ext4/super.c | 32 +++++++++++++++++++++++++++++++-
fs/jbd2/transaction.c | 13 +++++++++++++
2 files changed, 44 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index d25a0fe44bec..1e15179aa1c4 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1349,6 +1349,36 @@ static int ext4_nfs_commit_metadata(struct inode *inode)
return ext4_write_inode(inode, &wbc);
}
+static int bdev_try_to_free_buffer(struct super_block *sb, struct page *page)
+{
+ struct buffer_head *head, *bh;
+ bool has_write_io_error = false;
+
+ head = page_buffers(page);
+ bh = head;
+ do {
+ /*
+ * If the buffer has been failed to write out, the metadata
+ * in this buffer is uptodate but which on disk is old, may
+ * lead to inconsistency while reading the old data
+ * successfully.
+ */
+ if (buffer_write_io_error(bh) && !buffer_uptodate(bh)) {
+ has_write_io_error = true;
+ break;
+ }
+ } while ((bh = bh->b_this_page) != head);
+
+ if (has_write_io_error)
+ ext4_error_err(sb, EIO, "Free metadata buffer (%llu) that has "
+ "been failed to write out. There is a risk of "
+ "filesystem inconsistency in case of reading "
+ "metadata from this block subsequently.",
+ (unsigned long long) bh->b_blocknr);
+
+ return try_to_free_buffers(page);
+}
+
/*
* Try to release metadata pages (indirect blocks, directories) which are
* mapped via the block device. Since these pages could have journal heads
@@ -1366,7 +1396,7 @@ static int bdev_try_to_free_page(struct super_block *sb, struct page *page,
if (journal)
return jbd2_journal_try_to_free_buffers(journal, page,
wait & ~__GFP_DIRECT_RECLAIM);
- return try_to_free_buffers(page);
+ return bdev_try_to_free_buffer(sb, page);
}
#ifdef CONFIG_FS_ENCRYPTION
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index 3dccc23cf010..ac6a077afec3 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -2109,6 +2109,7 @@ int jbd2_journal_try_to_free_buffers(journal_t *journal,
{
struct buffer_head *head;
struct buffer_head *bh;
+ bool has_write_io_error = false;
int ret = 0;
J_ASSERT(PageLocked(page));
@@ -2133,11 +2134,23 @@ int jbd2_journal_try_to_free_buffers(journal_t *journal,
jbd2_journal_put_journal_head(jh);
if (buffer_jbd(bh))
goto busy;
+
+ /*
+ * If the buffer has been failed to write out, the metadata
+ * in this buffer is uptodate but which on disk is old,
+ * abort journal to prevent subsequent inconsistency while
+ * reading the old data successfully.
+ */
+ if (buffer_write_io_error(bh) && !buffer_uptodate(bh))
+ has_write_io_error = true;
} while ((bh = bh->b_this_page) != head);
ret = try_to_free_buffers(page);
busy:
+ if (has_write_io_error)
+ jbd2_journal_abort(journal, -EIO);
+
return ret;
}
--
2.21.3
For the cases of read ahead blocks, we also need to check the write io
error flag to prevent reading block from disk if it is actually
uptodate. Add a new wrapper ext4_sb_breadahead() to check the uptodate
flag and prevent unnecessary read operation, and replace all
sb_breadahead().
Signed-off-by: zhangyi (F) <[email protected]>
---
fs/ext4/ext4.h | 11 +++++------
fs/ext4/inode.c | 2 +-
fs/ext4/super.c | 19 ++++++++++++++++++-
3 files changed, 24 insertions(+), 8 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 81c1bdfb9397..cafa2617a093 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2767,6 +2767,8 @@ extern struct buffer_head *__ext4_sb_getblk(struct super_block *sb,
extern struct buffer_head *__ext4_sb_bread_gfp(struct super_block *sb,
sector_t block, int op_flags,
gfp_t gfp);
+extern void ext4_sb_breadahead_unmovable(struct super_block *sb,
+ sector_t block);
extern int ext4_seq_options_show(struct seq_file *seq, void *offset);
extern int ext4_calculate_overhead(struct super_block *sb);
extern void ext4_superblock_csum_set(struct super_block *sb);
@@ -3533,13 +3535,10 @@ static inline int ext4_buffer_uptodate(struct buffer_head *bh)
{
/*
* If the buffer has the write error flag, we have failed
- * to write out data in the block. In this case, we don't
- * have to read the block because we may read the old data
- * successfully.
+ * to write out this metadata block. In this case, the data
+ * in this block is uptodate.
*/
- if (!buffer_uptodate(bh) && buffer_write_io_error(bh))
- set_buffer_uptodate(bh);
- return buffer_uptodate(bh);
+ return buffer_uptodate(bh) || buffer_write_io_error(bh);
}
#endif /* __KERNEL__ */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4989a9633fc7..7354edb444c5 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4351,7 +4351,7 @@ static int __ext4_get_inode_loc(struct inode *inode,
if (end > table)
end = table;
while (b <= end)
- sb_breadahead_unmovable(sb, b++);
+ ext4_sb_breadahead_unmovable(sb, b++);
}
/*
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index b9aab334a5d0..d25a0fe44bec 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -218,6 +218,23 @@ __ext4_sb_bread_gfp(struct super_block *sb, sector_t block,
return ERR_PTR(-EIO);
}
+/*
+ * This works like sb_breadahead_unmovable() except it use
+ * ext4_buffer_uptodate() instead of buffer_uptodate() to check the
+ * metadata buffer is actually uptodate or not. The buffer should be
+ * considered as actually uptodate for the case of it has been
+ * failed to write out.
+ */
+void ext4_sb_breadahead_unmovable(struct super_block *sb, sector_t block)
+{
+ struct buffer_head *bh = ext4_sb_getblk_gfp(sb, block, 0);
+
+ if (likely(bh) && !ext4_buffer_uptodate(bh)) {
+ ll_rw_block(REQ_OP_READ, REQ_META | REQ_RAHEAD, 1, &bh);
+ brelse(bh);
+ }
+}
+
static int ext4_verify_csum_type(struct super_block *sb,
struct ext4_super_block *es)
{
@@ -4395,7 +4412,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
/* Pre-read the descriptors into the buffer cache */
for (i = 0; i < db_count; i++) {
block = descriptor_loc(sb, logical_sb_block, i);
- sb_breadahead_unmovable(sb, block);
+ ext4_sb_breadahead_unmovable(sb, block);
}
for (i = 0; i < db_count; i++) {
--
2.21.3
For the cases of getting blocks for reading in ext4_getblk()
(e.g. ext4_bread() and ext4_bread_batch()), they are also need to check
the write io error flag before read. Switch to use ext4_sb_getblk_locked()
instead.
Signed-off-by: zhangyi (F) <[email protected]>
---
fs/ext4/ext4.h | 3 ++-
fs/ext4/inode.c | 22 +++++++++++++---------
fs/ext4/xattr.c | 2 +-
3 files changed, 16 insertions(+), 11 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 609c2b555d29..81c1bdfb9397 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2642,7 +2642,8 @@ extern void ext4_process_freed_data(struct super_block *sb, tid_t commit_tid);
/* inode.c */
int ext4_inode_is_fast_symlink(struct inode *inode);
-struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int);
+struct buffer_head *ext4_getblk(handle_t *handle, struct inode *inode,
+ ext4_lblk_t block, bool lock, int map_flags);
struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int);
int ext4_bread_batch(struct inode *inode, ext4_lblk_t block, int bh_count,
bool wait, struct buffer_head **bhs);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 97dc77ec7c3e..4989a9633fc7 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -811,7 +811,8 @@ int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
* `handle' can be NULL if create is zero
*/
struct buffer_head *ext4_getblk(handle_t *handle, struct inode *inode,
- ext4_lblk_t block, int map_flags)
+ ext4_lblk_t block, bool lock,
+ int map_flags)
{
struct ext4_map_blocks map;
struct buffer_head *bh;
@@ -829,7 +830,8 @@ struct buffer_head *ext4_getblk(handle_t *handle, struct inode *inode,
if (err < 0)
return ERR_PTR(err);
- bh = ext4_sb_getblk(inode->i_sb, map.m_pblk);
+ bh = __ext4_sb_getblk(inode->i_sb, map.m_pblk,
+ lock && !(map.m_flags & EXT4_MAP_NEW));
if (unlikely(!bh))
return ERR_PTR(-ENOMEM);
if (map.m_flags & EXT4_MAP_NEW) {
@@ -872,12 +874,12 @@ struct buffer_head *ext4_bread(handle_t *handle, struct inode *inode,
{
struct buffer_head *bh;
- bh = ext4_getblk(handle, inode, block, map_flags);
+ bh = ext4_getblk(handle, inode, block, true, map_flags);
if (IS_ERR(bh))
return bh;
- if (!bh || ext4_buffer_uptodate(bh))
+ if (!bh || buffer_uptodate(bh))
return bh;
- ll_rw_block(REQ_OP_READ, REQ_META | REQ_PRIO, 1, &bh);
+ ll_rw_one_block(REQ_OP_READ, REQ_META | REQ_PRIO, bh);
wait_on_buffer(bh);
if (buffer_uptodate(bh))
return bh;
@@ -892,7 +894,7 @@ int ext4_bread_batch(struct inode *inode, ext4_lblk_t block, int bh_count,
int i, err;
for (i = 0; i < bh_count; i++) {
- bhs[i] = ext4_getblk(NULL, inode, block + i, 0 /* map_flags */);
+ bhs[i] = ext4_getblk(NULL, inode, block + i, true, 0);
if (IS_ERR(bhs[i])) {
err = PTR_ERR(bhs[i]);
bh_count = i;
@@ -902,9 +904,9 @@ int ext4_bread_batch(struct inode *inode, ext4_lblk_t block, int bh_count,
for (i = 0; i < bh_count; i++)
/* Note that NULL bhs[i] is valid because of holes. */
- if (bhs[i] && !ext4_buffer_uptodate(bhs[i]))
- ll_rw_block(REQ_OP_READ, REQ_META | REQ_PRIO, 1,
- &bhs[i]);
+ if (bhs[i] && !buffer_uptodate(bhs[i]))
+ ll_rw_one_block(REQ_OP_READ, REQ_META | REQ_PRIO,
+ bhs[i]);
if (!wait)
return 0;
@@ -923,6 +925,8 @@ int ext4_bread_batch(struct inode *inode, ext4_lblk_t block, int bh_count,
out_brelse:
for (i = 0; i < bh_count; i++) {
+ if (bhs[i] && buffer_locked(bhs[i]))
+ unlock_buffer(bhs[i]);
brelse(bhs[i]);
bhs[i] = NULL;
}
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index e2ba4e631b02..828a23a0e772 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -1358,7 +1358,7 @@ static int ext4_xattr_inode_write(handle_t *handle, struct inode *ea_inode,
brelse(bh);
csize = (bufsize - wsize) > blocksize ? blocksize :
bufsize - wsize;
- bh = ext4_getblk(handle, ea_inode, block, 0);
+ bh = ext4_getblk(handle, ea_inode, block, false, 0);
if (IS_ERR(bh))
return PTR_ERR(bh);
if (!bh) {
--
2.21.3
Current EIO simulation of reading inode from disk isn't accurate since
it will not submit bio if the inode buffer is uptodate. Move this
simulation behind read bio completeion just like inode/block bitmap EIO
simulation does.
Signed-off-by: zhangyi (F) <[email protected]>
---
fs/ext4/inode.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2a4aae6acdcb..e0f7e824b3b9 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4279,8 +4279,6 @@ static int __ext4_get_inode_loc(struct inode *inode,
bh = sb_getblk(sb, block);
if (unlikely(!bh))
return -ENOMEM;
- if (ext4_simulate_fail(sb, EXT4_SIM_INODE_EIO))
- goto simulate_eio;
if (!buffer_uptodate(bh)) {
lock_buffer(bh);
@@ -4378,8 +4376,8 @@ static int __ext4_get_inode_loc(struct inode *inode,
submit_bh(REQ_OP_READ, REQ_META | REQ_PRIO, bh);
blk_finish_plug(&plug);
wait_on_buffer(bh);
+ ext4_simulate_fail_bh(sb, bh, EXT4_SIM_INODE_EIO);
if (!buffer_uptodate(bh)) {
- simulate_eio:
ext4_error_inode_block(inode, block, EIO,
"unable to read itable block");
brelse(bh);
--
2.21.3
Pick out ll_rw_one_block() helper function from ll_rw_block() for
submitting one locked buffer for reading/writing.
Signed-off-by: zhangyi (F) <[email protected]>
---
fs/buffer.c | 41 ++++++++++++++++++++++---------------
include/linux/buffer_head.h | 1 +
2 files changed, 26 insertions(+), 16 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index a60f60396cfa..3a2226f88b2d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -3081,6 +3081,29 @@ int submit_bh(int op, int op_flags, struct buffer_head *bh)
}
EXPORT_SYMBOL(submit_bh);
+void ll_rw_one_block(int op, int op_flags, struct buffer_head *bh)
+{
+ BUG_ON(!buffer_locked(bh));
+
+ if (op == WRITE) {
+ if (test_clear_buffer_dirty(bh)) {
+ bh->b_end_io = end_buffer_write_sync;
+ get_bh(bh);
+ submit_bh(op, op_flags, bh);
+ return;
+ }
+ } else {
+ if (!buffer_uptodate(bh)) {
+ bh->b_end_io = end_buffer_read_sync;
+ get_bh(bh);
+ submit_bh(op, op_flags, bh);
+ return;
+ }
+ }
+ unlock_buffer(bh);
+}
+EXPORT_SYMBOL(ll_rw_one_block);
+
/**
* ll_rw_block: low-level access to block devices (DEPRECATED)
* @op: whether to %READ or %WRITE
@@ -3116,22 +3139,8 @@ void ll_rw_block(int op, int op_flags, int nr, struct buffer_head *bhs[])
if (!trylock_buffer(bh))
continue;
- if (op == WRITE) {
- if (test_clear_buffer_dirty(bh)) {
- bh->b_end_io = end_buffer_write_sync;
- get_bh(bh);
- submit_bh(op, op_flags, bh);
- continue;
- }
- } else {
- if (!buffer_uptodate(bh)) {
- bh->b_end_io = end_buffer_read_sync;
- get_bh(bh);
- submit_bh(op, op_flags, bh);
- continue;
- }
- }
- unlock_buffer(bh);
+
+ ll_rw_one_block(op, op_flags, bh);
}
}
EXPORT_SYMBOL(ll_rw_block);
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 15b765a181b8..11aa412c0bcd 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -198,6 +198,7 @@ struct buffer_head *alloc_buffer_head(gfp_t gfp_flags);
void free_buffer_head(struct buffer_head * bh);
void unlock_buffer(struct buffer_head *bh);
void __lock_buffer(struct buffer_head *bh);
+void ll_rw_one_block(int op, int op_flags, struct buffer_head *bh);
void ll_rw_block(int, int, int, struct buffer_head * bh[]);
int sync_dirty_buffer(struct buffer_head *bh);
int __sync_dirty_buffer(struct buffer_head *bh, int op_flags);
--
2.21.3
For the read buffer cases, now we invoke sb_getblk() and submit read
bio if the buffer is not uptodate, but the uptodate checking is not
accurate which may lead to read old metadata from the disk if the
buffer has been failed to write out.
Replace all sb_getblk() with ext4_sb_getblk_locked(), this function
will check and fix the buffer's uptodate flag if it has write io error
flag, and lock the buffer if it is actually not uptodate, so the caller
don't need to lock the buffer after ext4_sb_getblk_locked() return.
Signed-off-by: zhangyi (F) <[email protected]>
---
fs/ext4/balloc.c | 6 ++++--
fs/ext4/extents.c | 5 +++--
fs/ext4/ialloc.c | 6 ++++--
fs/ext4/indirect.c | 4 ++--
fs/ext4/inode.c | 23 ++++-------------------
fs/ext4/resize.c | 4 ++--
fs/ext4/super.c | 7 ++++---
7 files changed, 23 insertions(+), 32 deletions(-)
diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index a32e5f7b5385..806959644247 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -433,14 +433,15 @@ ext4_read_block_bitmap_nowait(struct super_block *sb, ext4_group_t block_group)
EXT4_GROUP_INFO_BBITMAP_CORRUPT);
return ERR_PTR(-EFSCORRUPTED);
}
- bh = sb_getblk(sb, bitmap_blk);
+ bh = ext4_sb_getblk_locked(sb, bitmap_blk);
if (unlikely(!bh)) {
ext4_warning(sb, "Cannot get buffer for block bitmap - "
"block_group = %u, block_bitmap = %llu",
block_group, bitmap_blk);
return ERR_PTR(-ENOMEM);
}
-
+ if (!buffer_uptodate(bh))
+ goto submit;
if (bitmap_uptodate(bh))
goto verify;
@@ -449,6 +450,7 @@ ext4_read_block_bitmap_nowait(struct super_block *sb, ext4_group_t block_group)
unlock_buffer(bh);
goto verify;
}
+submit:
ext4_lock_group(sb, block_group);
if (ext4_has_group_desc_csum(sb) &&
(desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))) {
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index f2b577b315a0..5db76b46fad5 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -488,11 +488,12 @@ __read_extent_tree_block(const char *function, unsigned int line,
struct buffer_head *bh;
int err;
- bh = sb_getblk_gfp(inode->i_sb, pblk, __GFP_MOVABLE | GFP_NOFS);
+ bh = ext4_sb_getblk_locked_gfp(inode->i_sb, pblk,
+ __GFP_MOVABLE | GFP_NOFS);
if (unlikely(!bh))
return ERR_PTR(-ENOMEM);
- if (!bh_uptodate_or_lock(bh)) {
+ if (!buffer_uptodate(bh)) {
trace_ext4_ext_load_extent(inode, pblk, _RET_IP_);
err = bh_submit_read(bh);
if (err < 0)
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 4b8c9a9bdf0c..a386b9126101 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -137,13 +137,15 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
EXT4_GROUP_INFO_IBITMAP_CORRUPT);
return ERR_PTR(-EFSCORRUPTED);
}
- bh = sb_getblk(sb, bitmap_blk);
+ bh = ext4_sb_getblk_locked(sb, bitmap_blk);
if (unlikely(!bh)) {
ext4_warning(sb, "Cannot read inode bitmap - "
"block_group = %u, inode_bitmap = %llu",
block_group, bitmap_blk);
return ERR_PTR(-ENOMEM);
}
+ if (!buffer_uptodate(bh))
+ goto submit;
if (bitmap_uptodate(bh))
goto verify;
@@ -152,7 +154,7 @@ ext4_read_inode_bitmap(struct super_block *sb, ext4_group_t block_group)
unlock_buffer(bh);
goto verify;
}
-
+submit:
ext4_lock_group(sb, block_group);
if (ext4_has_group_desc_csum(sb) &&
(desc->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT))) {
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 107f0043f67f..8dcbf21439c1 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -156,13 +156,13 @@ static Indirect *ext4_get_branch(struct inode *inode, int depth,
if (!p->key)
goto no_block;
while (--depth) {
- bh = sb_getblk(sb, le32_to_cpu(p->key));
+ bh = ext4_sb_getblk_locked(sb, le32_to_cpu(p->key));
if (unlikely(!bh)) {
ret = -ENOMEM;
goto failure;
}
- if (!bh_uptodate_or_lock(bh)) {
+ if (!buffer_uptodate(bh)) {
if (bh_submit_read(bh) < 0) {
put_bh(bh);
goto failure;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e0f7e824b3b9..c374870f6bb1 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4276,27 +4276,10 @@ static int __ext4_get_inode_loc(struct inode *inode,
block = ext4_inode_table(sb, gdp) + (inode_offset / inodes_per_block);
iloc->offset = (inode_offset % inodes_per_block) * EXT4_INODE_SIZE(sb);
- bh = sb_getblk(sb, block);
+ bh = ext4_sb_getblk_locked(sb, block);
if (unlikely(!bh))
return -ENOMEM;
if (!buffer_uptodate(bh)) {
- lock_buffer(bh);
-
- /*
- * If the buffer has the write error flag, we have failed
- * to write out another inode in the same block. In this
- * case, we don't have to read the block because we may
- * read the old inode data successfully.
- */
- if (buffer_write_io_error(bh) && !buffer_uptodate(bh))
- set_buffer_uptodate(bh);
-
- if (buffer_uptodate(bh)) {
- /* someone brought it uptodate while we waited */
- unlock_buffer(bh);
- goto has_buffer;
- }
-
/*
* If we have all information of the inode in memory and this
* is the only valid inode in the block, we need not read the
@@ -4309,7 +4292,8 @@ static int __ext4_get_inode_loc(struct inode *inode,
start = inode_offset & ~(inodes_per_block - 1);
/* Is the inode bitmap in cache? */
- bitmap_bh = sb_getblk(sb, ext4_inode_bitmap(sb, gdp));
+ bitmap_bh = ext4_sb_getblk_locked(sb,
+ ext4_inode_bitmap(sb, gdp));
if (unlikely(!bitmap_bh))
goto make_io;
@@ -4319,6 +4303,7 @@ static int __ext4_get_inode_loc(struct inode *inode,
* of one, so skip it.
*/
if (!buffer_uptodate(bitmap_bh)) {
+ unlock_buffer(bitmap_bh);
brelse(bitmap_bh);
goto make_io;
}
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index a50b51270ea9..414198e4d873 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -1239,10 +1239,10 @@ static int ext4_add_new_descs(handle_t *handle, struct super_block *sb,
static struct buffer_head *ext4_get_bitmap(struct super_block *sb, __u64 block)
{
- struct buffer_head *bh = sb_getblk(sb, block);
+ struct buffer_head *bh = ext4_sb_getblk_locked(sb, block);
if (unlikely(!bh))
return NULL;
- if (!bh_uptodate_or_lock(bh)) {
+ if (!buffer_uptodate(bh)) {
if (bh_submit_read(bh) < 0) {
brelse(bh);
return NULL;
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index ddc46dbcd5ce..111fff55fada 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -202,13 +202,14 @@ __ext4_sb_getblk(struct super_block *sb, sector_t block, bool lock)
struct buffer_head *
ext4_sb_bread(struct super_block *sb, sector_t block, int op_flags)
{
- struct buffer_head *bh = sb_getblk(sb, block);
+ struct buffer_head *bh;
+ bh = ext4_sb_getblk_locked(sb, block);
if (bh == NULL)
return ERR_PTR(-ENOMEM);
- if (ext4_buffer_uptodate(bh))
+ if (buffer_uptodate(bh))
return bh;
- ll_rw_block(REQ_OP_READ, REQ_META | op_flags, 1, &bh);
+ ll_rw_one_block(REQ_OP_READ, REQ_META | op_flags, bh);
wait_on_buffer(bh);
if (buffer_uptodate(bh))
return bh;
--
2.21.3
On Tue, May 26, 2020 at 03:17:46PM +0800, zhangyi (F) wrote:
> Pick out ll_rw_one_block() helper function from ll_rw_block() for
> submitting one locked buffer for reading/writing.
That should probably read factor out instead of pick out.
>
> Signed-off-by: zhangyi (F) <[email protected]>
> ---
> fs/buffer.c | 41 ++++++++++++++++++++++---------------
> include/linux/buffer_head.h | 1 +
> 2 files changed, 26 insertions(+), 16 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index a60f60396cfa..3a2226f88b2d 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -3081,6 +3081,29 @@ int submit_bh(int op, int op_flags, struct buffer_head *bh)
> }
> EXPORT_SYMBOL(submit_bh);
>
> +void ll_rw_one_block(int op, int op_flags, struct buffer_head *bh)
> +{
> + BUG_ON(!buffer_locked(bh));
> +
> + if (op == WRITE) {
> + if (test_clear_buffer_dirty(bh)) {
> + bh->b_end_io = end_buffer_write_sync;
> + get_bh(bh);
> + submit_bh(op, op_flags, bh);
> + return;
> + }
> + } else {
> + if (!buffer_uptodate(bh)) {
> + bh->b_end_io = end_buffer_read_sync;
> + get_bh(bh);
> + submit_bh(op, op_flags, bh);
> + return;
> + }
> + }
> + unlock_buffer(bh);
> +}
> +EXPORT_SYMBOL(ll_rw_one_block);
I don't think you want separate read and write sides. In fact I'm not
sure you want the helper at all. At this point just open coding it
rather than adding more overhead to core code might be a better idea.
Hi, Christoph
On 2020/5/28 13:07, Christoph Hellwig wrote:
> On Tue, May 26, 2020 at 03:17:46PM +0800, zhangyi (F) wrote:
>> Pick out ll_rw_one_block() helper function from ll_rw_block() for
>> submitting one locked buffer for reading/writing.
>
> That should probably read factor out instead of pick out.
>
>>
>> Signed-off-by: zhangyi (F) <[email protected]>
>> ---
>> fs/buffer.c | 41 ++++++++++++++++++++++---------------
>> include/linux/buffer_head.h | 1 +
>> 2 files changed, 26 insertions(+), 16 deletions(-)
>>
>> diff --git a/fs/buffer.c b/fs/buffer.c
>> index a60f60396cfa..3a2226f88b2d 100644
>> --- a/fs/buffer.c
>> +++ b/fs/buffer.c
>> @@ -3081,6 +3081,29 @@ int submit_bh(int op, int op_flags, struct buffer_head *bh)
>> }
>> EXPORT_SYMBOL(submit_bh);
>>
>> +void ll_rw_one_block(int op, int op_flags, struct buffer_head *bh)
>> +{
>> + BUG_ON(!buffer_locked(bh));
>> +
>> + if (op == WRITE) {
>> + if (test_clear_buffer_dirty(bh)) {
>> + bh->b_end_io = end_buffer_write_sync;
>> + get_bh(bh);
>> + submit_bh(op, op_flags, bh);
>> + return;
>> + }
>> + } else {
>> + if (!buffer_uptodate(bh)) {
>> + bh->b_end_io = end_buffer_read_sync;
>> + get_bh(bh);
>> + submit_bh(op, op_flags, bh);
>> + return;
>> + }
>> + }
>> + unlock_buffer(bh);
>> +}
>> +EXPORT_SYMBOL(ll_rw_one_block);
>
> I don't think you want separate read and write sides. In fact I'm not
> sure you want the helper at all. At this point just open coding it
> rather than adding more overhead to core code might be a better idea.
>
Yeah, what I want is only the read side, it's fine by me to open coding it.
Thanks,
Yi.
Hiļ¼Ted and Jan, any suggestions of this patch set?
Thanks,
Yi.
On 2020/5/26 15:17, zhangyi (F) wrote:
> Background
> ==========
>
> This patch set point to fix the inconsistency problem which has been
> discussed and partial fixed in [1].
>
> Now, the problem is on the unstable storage which has a flaky transport
> (e.g. iSCSI transport may disconnect few seconds and reconnect due to
> the bad network environment), if we failed to async write metadata in
> background, the end write routine in block layer will clear the buffer's
> uptodate flag, but the data in such buffer is actually uptodate. Finally
> we may read "old && inconsistent" metadata from the disk when we get the
> buffer later because not only the uptodate flag was cleared but also we
> do not check the write io error flag, or even worse the buffer has been
> freed due to memory presure.
>
> Fortunately, if the jbd2 do checkpoint after async IO error happens,
> the checkpoint routine will check the write_io_error flag and abort the
> the journal if detect IO error. And in the journal recover case, the
> recover code will invoke sync_blockdev() after recover complete, it will
> also detect IO error and refuse to mount the filesystem.
>
> Current ext4 have already deal with this problem in __ext4_get_inode_loc()
> and commit 7963e5ac90125 ("ext4: treat buffers with write errors as
> containing valid data"), but it's not enough.
>
> [1] https://lore.kernel.org/linux-ext4/[email protected]/
>
> Description
> ===========
>
> This patch set add and rework 7 wrapper functions of getting metadata
> blocks, replace all sb_bread() / sb_getblk*() / ext4_bread() and
> sb_breadahead*(). Add buffer_write_io_error() checking into them, if
> the buffer isn't uptodate and write_io_error flag was set, which means
> that the buffer has been failed to write out to disk, re-add the
> uptodate flag to prevent subsequent read operation.
>
> - ext4_sb_getblk(): works the same as sb_getblk(), use to replace all
> sb_getblk() used for newly allocated blocks and getting buffers.
> - ext4_sb_getblk_locked(): works the same as sb_getblk() except check &
> fix buffer uotpdate flag, use to replace all sb_getblk() used for
> getting buffers to read.
> - ext4_sb_getblk_gfp(): gfp version of ext4_sb_getblk().
> - ext4_sb_getblk_locked_gfp(): gfp version of ext4_sb_getblk_locked().
> - ext4_sb_bread(): get buffer and submit read bio if buffer is actually
> not uptodate.
> - ext4_sb_bread_unmovable(): unmovable version of ext4_sb_bread().
> - ext4_sb_breadahead_unmovable(): works the same to ext4_sb_bread_unmovable()
> except skip submit read bio if failed to lock the buffer.
>
> Patch 1-2: do some small change in ext4 inode eio simulation and add a
> helper in buffer.c, just prepare for below patches.
> Patch 3: add the ext4_sb_*() function to deal with the write_io_error
> flag in buffer.
> Patch 4-8: replace all sb_*() with ext4_sb_*() in ext4.
> Patch 9: deal with the buffer shrinking case, abort jbd2/fs when
> shrinking a buffer with write_io_error flag.
> Patch 10: just do some cleanup.
>
> After this patch set, we need to use above 7 wrapper functions to
> get/read metadata block instead of invoke sb_*() functions defined in
> fs/buffer.h.
>
> Test
> ====
>
> This patch set is based on linux-5.7-rc7 and has been tests by xfstests
> in auto mode.
>
> Thanks,
> Yi.
>
>
> zhangyi (F) (10):
> ext4: move inode eio simulation behind io completeion
> fs: pick out ll_rw_one_block() helper function
> ext4: add ext4_sb_getblk*() wrapper functions
> ext4: replace sb_getblk() with ext4_sb_getblk_locked()
> ext4: replace sb_bread*() with ext4_sb_bread*()
> ext4: replace sb_getblk() with ext4_sb_getblk()
> ext4: switch to use ext4_sb_getblk_locked() in ext4_getblk()
> ext4: replace sb_breadahead() with ext4_sb_breadahead()
> ext4: abort the filesystem while freeing the write error io buffer
> ext4: remove unused parameter in jbd2_journal_try_to_free_buffers()
>
> fs/buffer.c | 41 ++++++----
> fs/ext4/balloc.c | 6 +-
> fs/ext4/ext4.h | 60 ++++++++++++---
> fs/ext4/extents.c | 13 ++--
> fs/ext4/ialloc.c | 6 +-
> fs/ext4/indirect.c | 13 ++--
> fs/ext4/inline.c | 2 +-
> fs/ext4/inode.c | 53 +++++--------
> fs/ext4/mmp.c | 2 +-
> fs/ext4/resize.c | 24 +++---
> fs/ext4/super.c | 145 +++++++++++++++++++++++++++++++-----
> fs/ext4/xattr.c | 4 +-
> fs/jbd2/transaction.c | 20 +++--
> include/linux/buffer_head.h | 1 +
> include/linux/jbd2.h | 3 +-
> 15 files changed, 277 insertions(+), 116 deletions(-)
>
Hello Yi!
On Tue 26-05-20 15:17:44, zhangyi (F) wrote:
> Background
> ==========
>
> This patch set point to fix the inconsistency problem which has been
> discussed and partial fixed in [1].
>
> Now, the problem is on the unstable storage which has a flaky transport
> (e.g. iSCSI transport may disconnect few seconds and reconnect due to
> the bad network environment), if we failed to async write metadata in
> background, the end write routine in block layer will clear the buffer's
> uptodate flag, but the data in such buffer is actually uptodate. Finally
> we may read "old && inconsistent" metadata from the disk when we get the
> buffer later because not only the uptodate flag was cleared but also we
> do not check the write io error flag, or even worse the buffer has been
> freed due to memory presure.
>
> Fortunately, if the jbd2 do checkpoint after async IO error happens,
> the checkpoint routine will check the write_io_error flag and abort the
> the journal if detect IO error. And in the journal recover case, the
> recover code will invoke sync_blockdev() after recover complete, it will
> also detect IO error and refuse to mount the filesystem.
>
> Current ext4 have already deal with this problem in __ext4_get_inode_loc()
> and commit 7963e5ac90125 ("ext4: treat buffers with write errors as
> containing valid data"), but it's not enough.
Before we go and complicate ext4 code like this, I'd like to understand
what is the desired outcome which doesn't seem to be mentioned here, in the
commit 7963e5ac90125, or in the discussion you reference. If you have a
flaky transport that gives you IO errors, IMO it is not a bussiness of the
filesystem to try to fix that. I just has to make sure it properly reports
errors to userspace and (depending of errors= configuration) shuts itself
down to limit further damage. This patch seems to try to mask those errors
and that's, in my opinion, rather futile (as in you can hardly ever deal
with all the cases). BTW are you running these systems on flaky iSCSI with
errors=continue so that the errors don't shut the filesystem down
immediately?
Honza
> [1] https://lore.kernel.org/linux-ext4/[email protected]/
>
> Description
> ===========
>
> This patch set add and rework 7 wrapper functions of getting metadata
> blocks, replace all sb_bread() / sb_getblk*() / ext4_bread() and
> sb_breadahead*(). Add buffer_write_io_error() checking into them, if
> the buffer isn't uptodate and write_io_error flag was set, which means
> that the buffer has been failed to write out to disk, re-add the
> uptodate flag to prevent subsequent read operation.
>
> - ext4_sb_getblk(): works the same as sb_getblk(), use to replace all
> sb_getblk() used for newly allocated blocks and getting buffers.
> - ext4_sb_getblk_locked(): works the same as sb_getblk() except check &
> fix buffer uotpdate flag, use to replace all sb_getblk() used for
> getting buffers to read.
> - ext4_sb_getblk_gfp(): gfp version of ext4_sb_getblk().
> - ext4_sb_getblk_locked_gfp(): gfp version of ext4_sb_getblk_locked().
> - ext4_sb_bread(): get buffer and submit read bio if buffer is actually
> not uptodate.
> - ext4_sb_bread_unmovable(): unmovable version of ext4_sb_bread().
> - ext4_sb_breadahead_unmovable(): works the same to ext4_sb_bread_unmovable()
> except skip submit read bio if failed to lock the buffer.
>
> Patch 1-2: do some small change in ext4 inode eio simulation and add a
> helper in buffer.c, just prepare for below patches.
> Patch 3: add the ext4_sb_*() function to deal with the write_io_error
> flag in buffer.
> Patch 4-8: replace all sb_*() with ext4_sb_*() in ext4.
> Patch 9: deal with the buffer shrinking case, abort jbd2/fs when
> shrinking a buffer with write_io_error flag.
> Patch 10: just do some cleanup.
>
> After this patch set, we need to use above 7 wrapper functions to
> get/read metadata block instead of invoke sb_*() functions defined in
> fs/buffer.h.
>
> Test
> ====
>
> This patch set is based on linux-5.7-rc7 and has been tests by xfstests
> in auto mode.
>
> Thanks,
> Yi.
>
>
> zhangyi (F) (10):
> ext4: move inode eio simulation behind io completeion
> fs: pick out ll_rw_one_block() helper function
> ext4: add ext4_sb_getblk*() wrapper functions
> ext4: replace sb_getblk() with ext4_sb_getblk_locked()
> ext4: replace sb_bread*() with ext4_sb_bread*()
> ext4: replace sb_getblk() with ext4_sb_getblk()
> ext4: switch to use ext4_sb_getblk_locked() in ext4_getblk()
> ext4: replace sb_breadahead() with ext4_sb_breadahead()
> ext4: abort the filesystem while freeing the write error io buffer
> ext4: remove unused parameter in jbd2_journal_try_to_free_buffers()
>
> fs/buffer.c | 41 ++++++----
> fs/ext4/balloc.c | 6 +-
> fs/ext4/ext4.h | 60 ++++++++++++---
> fs/ext4/extents.c | 13 ++--
> fs/ext4/ialloc.c | 6 +-
> fs/ext4/indirect.c | 13 ++--
> fs/ext4/inline.c | 2 +-
> fs/ext4/inode.c | 53 +++++--------
> fs/ext4/mmp.c | 2 +-
> fs/ext4/resize.c | 24 +++---
> fs/ext4/super.c | 145 +++++++++++++++++++++++++++++++-----
> fs/ext4/xattr.c | 4 +-
> fs/jbd2/transaction.c | 20 +++--
> include/linux/buffer_head.h | 1 +
> include/linux/jbd2.h | 3 +-
> 15 files changed, 277 insertions(+), 116 deletions(-)
>
> --
> 2.21.3
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
Hi, Jan.
On 2020/6/8 16:20, Jan Kara wrote:
> Hello Yi!
>
> On Tue 26-05-20 15:17:44, zhangyi (F) wrote:
>> Background
>> ==========
>>
>> This patch set point to fix the inconsistency problem which has been
>> discussed and partial fixed in [1].
>>
>> Now, the problem is on the unstable storage which has a flaky transport
>> (e.g. iSCSI transport may disconnect few seconds and reconnect due to
>> the bad network environment), if we failed to async write metadata in
>> background, the end write routine in block layer will clear the buffer's
>> uptodate flag, but the data in such buffer is actually uptodate. Finally
>> we may read "old && inconsistent" metadata from the disk when we get the
>> buffer later because not only the uptodate flag was cleared but also we
>> do not check the write io error flag, or even worse the buffer has been
>> freed due to memory presure.
>>
>> Fortunately, if the jbd2 do checkpoint after async IO error happens,
>> the checkpoint routine will check the write_io_error flag and abort the
>> the journal if detect IO error. And in the journal recover case, the
>> recover code will invoke sync_blockdev() after recover complete, it will
>> also detect IO error and refuse to mount the filesystem.
>>
>> Current ext4 have already deal with this problem in __ext4_get_inode_loc()
>> and commit 7963e5ac90125 ("ext4: treat buffers with write errors as
>> containing valid data"), but it's not enough.
>
> Before we go and complicate ext4 code like this, I'd like to understand
> what is the desired outcome which doesn't seem to be mentioned here, in the
> commit 7963e5ac90125, or in the discussion you reference. If you have a
> flaky transport that gives you IO errors, IMO it is not a bussiness of the
> filesystem to try to fix that. I just has to make sure it properly reports
If we meet some IO errors due to the flaky transport, IMO the desired outcome
is 1) report IO error; 2) ext4 filesystem act as the "errors=xxx" configuration
specified, if we set "errors=read-only || panic", we expect ext4 could remount
to read-only or panic immediately to avoid inconsistency. In brief, the kernel
should try best to guarantee the filesystem on disk is consistency, this will
reduce fsck's work (AFAIK, the fsck cannot fix those inconsistent in auto mode
for most cases caused by the async error problem I mentioned), so we could
recover the fs automatically in next boot.
But now, in the case of metadata async writeback, (1) is done in
end_buffer_async_write(), but (2) is not guaranteed, because ext4 cannot detect
metadata write error, and it also cannot remount the filesystem or invoke panic
immediately. Finally, if we read the metadata on disk and re-write again, it
may lead to on-disk filesystem inconsistency.
> errors to userspace and (depending of errors= configuration) shuts itself
> down to limit further damage. This patch seems to try to mask those errors
> and that's, in my opinion, rather futile (as in you can hardly ever deal
> with all the cases). BTW are you running these systems on flaky iSCSI with
> errors=continue so that the errors don't shut the filesystem down
> immediately?
>
Yes, I run ext4 filesystem on a flaky iSCSI(it is stable most of the time)
with errors=read-only, in the cases mentioned above, the fs will not be
remount to read-only immediately or remount after it has already been
inconsistency.
Thinking about how to fix this, one method is to invoke ext4_error() or
jbd2_journal_abort() when we detect write error to prevent further use of
the filesystem. But after looking at __ext4_get_inode_loc() and 7963e5ac90125,
I think although the metadata buffer was failed to write back to the disk due
to the occasional unstable network environment, but the data in the buffer
is actually uptodate, the filesystem could self-healing after the network
recovery. In the worst case, if the network is broken for a long time, the
jbd2's checkpoint routing will detect the error, or jbd2 will failed to write
the journal to disk, both will abort the filesystem. So I think we could
re-set the uptodate flag when we read buffer again as 7963e5ac90125 does.
Thanks,
Yi.
>
>> [1] https://lore.kernel.org/linux-ext4/[email protected]/
>>
>> Description
>> ===========
>>
>> This patch set add and rework 7 wrapper functions of getting metadata
>> blocks, replace all sb_bread() / sb_getblk*() / ext4_bread() and
>> sb_breadahead*(). Add buffer_write_io_error() checking into them, if
>> the buffer isn't uptodate and write_io_error flag was set, which means
>> that the buffer has been failed to write out to disk, re-add the
>> uptodate flag to prevent subsequent read operation.
>>
>> - ext4_sb_getblk(): works the same as sb_getblk(), use to replace all
>> sb_getblk() used for newly allocated blocks and getting buffers.
>> - ext4_sb_getblk_locked(): works the same as sb_getblk() except check &
>> fix buffer uotpdate flag, use to replace all sb_getblk() used for
>> getting buffers to read.
>> - ext4_sb_getblk_gfp(): gfp version of ext4_sb_getblk().
>> - ext4_sb_getblk_locked_gfp(): gfp version of ext4_sb_getblk_locked().
>> - ext4_sb_bread(): get buffer and submit read bio if buffer is actually
>> not uptodate.
>> - ext4_sb_bread_unmovable(): unmovable version of ext4_sb_bread().
>> - ext4_sb_breadahead_unmovable(): works the same to ext4_sb_bread_unmovable()
>> except skip submit read bio if failed to lock the buffer.
>>
>> Patch 1-2: do some small change in ext4 inode eio simulation and add a
>> helper in buffer.c, just prepare for below patches.
>> Patch 3: add the ext4_sb_*() function to deal with the write_io_error
>> flag in buffer.
>> Patch 4-8: replace all sb_*() with ext4_sb_*() in ext4.
>> Patch 9: deal with the buffer shrinking case, abort jbd2/fs when
>> shrinking a buffer with write_io_error flag.
>> Patch 10: just do some cleanup.
>>
>> After this patch set, we need to use above 7 wrapper functions to
>> get/read metadata block instead of invoke sb_*() functions defined in
>> fs/buffer.h.
>>
>> Test
>> ====
>>
>> This patch set is based on linux-5.7-rc7 and has been tests by xfstests
>> in auto mode.
>>
>> Thanks,
>> Yi.
>>
>>
>> zhangyi (F) (10):
>> ext4: move inode eio simulation behind io completeion
>> fs: pick out ll_rw_one_block() helper function
>> ext4: add ext4_sb_getblk*() wrapper functions
>> ext4: replace sb_getblk() with ext4_sb_getblk_locked()
>> ext4: replace sb_bread*() with ext4_sb_bread*()
>> ext4: replace sb_getblk() with ext4_sb_getblk()
>> ext4: switch to use ext4_sb_getblk_locked() in ext4_getblk()
>> ext4: replace sb_breadahead() with ext4_sb_breadahead()
>> ext4: abort the filesystem while freeing the write error io buffer
>> ext4: remove unused parameter in jbd2_journal_try_to_free_buffers()
>>
>> fs/buffer.c | 41 ++++++----
>> fs/ext4/balloc.c | 6 +-
>> fs/ext4/ext4.h | 60 ++++++++++++---
>> fs/ext4/extents.c | 13 ++--
>> fs/ext4/ialloc.c | 6 +-
>> fs/ext4/indirect.c | 13 ++--
>> fs/ext4/inline.c | 2 +-
>> fs/ext4/inode.c | 53 +++++--------
>> fs/ext4/mmp.c | 2 +-
>> fs/ext4/resize.c | 24 +++---
>> fs/ext4/super.c | 145 +++++++++++++++++++++++++++++++-----
>> fs/ext4/xattr.c | 4 +-
>> fs/jbd2/transaction.c | 20 +++--
>> include/linux/buffer_head.h | 1 +
>> include/linux/jbd2.h | 3 +-
>> 15 files changed, 277 insertions(+), 116 deletions(-)
>>
>> --
>> 2.21.3
>>
Hi Yi!
On Mon 08-06-20 22:39:31, zhangyi (F) wrote:
> > On Tue 26-05-20 15:17:44, zhangyi (F) wrote:
> >> Background
> >> ==========
> >>
> >> This patch set point to fix the inconsistency problem which has been
> >> discussed and partial fixed in [1].
> >>
> >> Now, the problem is on the unstable storage which has a flaky transport
> >> (e.g. iSCSI transport may disconnect few seconds and reconnect due to
> >> the bad network environment), if we failed to async write metadata in
> >> background, the end write routine in block layer will clear the buffer's
> >> uptodate flag, but the data in such buffer is actually uptodate. Finally
> >> we may read "old && inconsistent" metadata from the disk when we get the
> >> buffer later because not only the uptodate flag was cleared but also we
> >> do not check the write io error flag, or even worse the buffer has been
> >> freed due to memory presure.
> >>
> >> Fortunately, if the jbd2 do checkpoint after async IO error happens,
> >> the checkpoint routine will check the write_io_error flag and abort the
> >> the journal if detect IO error. And in the journal recover case, the
> >> recover code will invoke sync_blockdev() after recover complete, it will
> >> also detect IO error and refuse to mount the filesystem.
> >>
> >> Current ext4 have already deal with this problem in __ext4_get_inode_loc()
> >> and commit 7963e5ac90125 ("ext4: treat buffers with write errors as
> >> containing valid data"), but it's not enough.
> >
> > Before we go and complicate ext4 code like this, I'd like to understand
> > what is the desired outcome which doesn't seem to be mentioned here, in the
> > commit 7963e5ac90125, or in the discussion you reference. If you have a
> > flaky transport that gives you IO errors, IMO it is not a bussiness of the
> > filesystem to try to fix that. I just has to make sure it properly reports
>
> If we meet some IO errors due to the flaky transport, IMO the desired outcome
> is 1) report IO error; 2) ext4 filesystem act as the "errors=xxx" configuration
> specified, if we set "errors=read-only || panic", we expect ext4 could remount
> to read-only or panic immediately to avoid inconsistency. In brief, the kernel
> should try best to guarantee the filesystem on disk is consistency, this will
> reduce fsck's work (AFAIK, the fsck cannot fix those inconsistent in auto mode
> for most cases caused by the async error problem I mentioned), so we could
> recover the fs automatically in next boot.
Good, so I fully agree with your goals. Let's now talk about how to achieve
them :)
> But now, in the case of metadata async writeback, (1) is done in
> end_buffer_async_write(), but (2) is not guaranteed, because ext4 cannot detect
> metadata write error, and it also cannot remount the filesystem or invoke panic
> immediately. Finally, if we read the metadata on disk and re-write again, it
> may lead to on-disk filesystem inconsistency.
Ah, I see. This was the important bit I was missing. And I think the
real problem here is that ext4 cannot detect metadata write error from
async writeback. So my plan would be to detect metadata write errors early
and abort the journal and do appropriate errors=xxx handling. And a
relatively simple way how to do that these days would be to use errseq in
the block device's mapping - sb->s_bdev->bd_inode->i_mapping->wb_err - that
gets incremented whenever there's writeback error in the block device
mapping so (probably in ext4_journal_check_start()) we could check whether
wb_err is different from the original value we sampled at mount time an if
yes, we know metadata writeback error has happened and we trigger the error
handling. What do you think?
> > errors to userspace and (depending of errors= configuration) shuts itself
> > down to limit further damage. This patch seems to try to mask those errors
> > and that's, in my opinion, rather futile (as in you can hardly ever deal
> > with all the cases). BTW are you running these systems on flaky iSCSI with
> > errors=continue so that the errors don't shut the filesystem down
> > immediately?
> >
> Yes, I run ext4 filesystem on a flaky iSCSI(it is stable most of the time)
> with errors=read-only, in the cases mentioned above, the fs will not be
> remount to read-only immediately or remount after it has already been
> inconsistency.
>
> Thinking about how to fix this, one method is to invoke ext4_error() or
> jbd2_journal_abort() when we detect write error to prevent further use of
> the filesystem. But after looking at __ext4_get_inode_loc() and 7963e5ac90125,
> I think although the metadata buffer was failed to write back to the disk due
> to the occasional unstable network environment, but the data in the buffer
> is actually uptodate, the filesystem could self-healing after the network
> recovery. In the worst case, if the network is broken for a long time, the
> jbd2's checkpoint routing will detect the error, or jbd2 will failed to write
> the journal to disk, both will abort the filesystem. So I think we could
> re-set the uptodate flag when we read buffer again as 7963e5ac90125 does.
Yeah, but I'm actually against such self-healing logic. IMHO it is too
fragile and also fairly intrusive as your patches show. If we wanted
something like this, we'd need to put a hard thought into whether a
functionality like this belongs to ext4 or to some layer below it (e.g.
think how multipath handles temporary path failures). And even if we
decided it's worth the trouble in the filesystem, I'd rather go and change
how fs/buffer.c deals with buffer writeback errors than resetting uptodate
bits on buffers which just seems dangerous to me...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
Hi, Jan.
On 2020/6/9 20:19, Jan Kara wrote> On Mon 08-06-20 22:39:31, zhangyi (F) wrote:
>>> On Tue 26-05-20 15:17:44, zhangyi (F) wrote:
>>>> Background
>>>> ==========
>>>>
>>>> This patch set point to fix the inconsistency problem which has been
>>>> discussed and partial fixed in [1].
>>>>
>>>> Now, the problem is on the unstable storage which has a flaky transport
>>>> (e.g. iSCSI transport may disconnect few seconds and reconnect due to
>>>> the bad network environment), if we failed to async write metadata in
>>>> background, the end write routine in block layer will clear the buffer's
>>>> uptodate flag, but the data in such buffer is actually uptodate. Finally
>>>> we may read "old && inconsistent" metadata from the disk when we get the
>>>> buffer later because not only the uptodate flag was cleared but also we
>>>> do not check the write io error flag, or even worse the buffer has been
>>>> freed due to memory presure.
>>>>
>>>> Fortunately, if the jbd2 do checkpoint after async IO error happens,
>>>> the checkpoint routine will check the write_io_error flag and abort the
>>>> the journal if detect IO error. And in the journal recover case, the
>>>> recover code will invoke sync_blockdev() after recover complete, it will
>>>> also detect IO error and refuse to mount the filesystem.
>>>>
>>>> Current ext4 have already deal with this problem in __ext4_get_inode_loc()
>>>> and commit 7963e5ac90125 ("ext4: treat buffers with write errors as
>>>> containing valid data"), but it's not enough.
>>>
>>> Before we go and complicate ext4 code like this, I'd like to understand
>>> what is the desired outcome which doesn't seem to be mentioned here, in the
>>> commit 7963e5ac90125, or in the discussion you reference. If you have a
>>> flaky transport that gives you IO errors, IMO it is not a bussiness of the
>>> filesystem to try to fix that. I just has to make sure it properly reports
>>
>> If we meet some IO errors due to the flaky transport, IMO the desired outcome
>> is 1) report IO error; 2) ext4 filesystem act as the "errors=xxx" configuration
>> specified, if we set "errors=read-only || panic", we expect ext4 could remount
>> to read-only or panic immediately to avoid inconsistency. In brief, the kernel
>> should try best to guarantee the filesystem on disk is consistency, this will
>> reduce fsck's work (AFAIK, the fsck cannot fix those inconsistent in auto mode
>> for most cases caused by the async error problem I mentioned), so we could
>> recover the fs automatically in next boot.
>
> Good, so I fully agree with your goals. Let's now talk about how to achieve
> them :)
>
>> But now, in the case of metadata async writeback, (1) is done in
>> end_buffer_async_write(), but (2) is not guaranteed, because ext4 cannot detect
>> metadata write error, and it also cannot remount the filesystem or invoke panic
>> immediately. Finally, if we read the metadata on disk and re-write again, it
>> may lead to on-disk filesystem inconsistency.
>
> Ah, I see. This was the important bit I was missing. And I think the
> real problem here is that ext4 cannot detect metadata write error from
> async writeback. So my plan would be to detect metadata write errors early
> and abort the journal and do appropriate errors=xxx handling. And a
> relatively simple way how to do that these days would be to use errseq in
> the block device's mapping - sb->s_bdev->bd_inode->i_mapping->wb_err - that
> gets incremented whenever there's writeback error in the block device
> mapping so (probably in ext4_journal_check_start()) we could check whether
> wb_err is different from the original value we sampled at mount time an if
> yes, we know metadata writeback error has happened and we trigger the error
> handling. What do you think?
>
Thanks a lot for your suggestion, this solution looks good to me. But Ithink
add 'wb_err' checking into ext4_journal_check_start() maybe too early, see below
race condition (It's just theoretical analysis, I'm not test it):
ext4_journal_start()
ext4_journal_check_start() <-- pass checking
| end_buffer_async_write()
| mark_buffer_write_io_error() <-- set b_page
sb_getblk(bh) <-- read old data from disk
ext4_journal_get_write_access(bh)
modify this bh <-- modify data and lead to inconsistency
ext4_handle_dirty_metadata(bh)
So I guess it may still lead to inconsistency. How about add this checking
into ext4_journal_get_write_access() ?
>>> errors to userspace and (depending of errors= configuration) shuts itself
>>> down to limit further damage. This patch seems to try to mask those errors
>>> and that's, in my opinion, rather futile (as in you can hardly ever deal
>>> with all the cases). BTW are you running these systems on flaky iSCSI with
>>> errors=continue so that the errors don't shut the filesystem down
>>> immediately?
>>>
>> Yes, I run ext4 filesystem on a flaky iSCSI(it is stable most of the time)
>> with errors=read-only, in the cases mentioned above, the fs will not be
>> remount to read-only immediately or remount after it has already been
>> inconsistency.
>>
>> Thinking about how to fix this, one method is to invoke ext4_error() or
>> jbd2_journal_abort() when we detect write error to prevent further use of
>> the filesystem. But after looking at __ext4_get_inode_loc() and 7963e5ac90125,
>> I think although the metadata buffer was failed to write back to the disk due
>> to the occasional unstable network environment, but the data in the buffer
>> is actually uptodate, the filesystem could self-healing after the network
>> recovery. In the worst case, if the network is broken for a long time, the
>> jbd2's checkpoint routing will detect the error, or jbd2 will failed to write
>> the journal to disk, both will abort the filesystem. So I think we could
>> re-set the uptodate flag when we read buffer again as 7963e5ac90125 does.
>
> Yeah, but I'm actually against such self-healing logic. IMHO it is too
> fragile and also fairly intrusive as your patches show. If we wanted
> something like this, we'd need to put a hard thought into whether a
> functionality like this belongs to ext4 or to some layer below it (e.g.
> think how multipath handles temporary path failures). And even if we
> decided it's worth the trouble in the filesystem, I'd rather go and change
> how fs/buffer.c deals with buffer writeback errors than resetting uptodate
> bits on buffers which just seems dangerous to me...
>
Yeah, I see. Invoke error handlers as soon as we detect error flag could
minimize the risk.
Thanks,
Yi.
On Wed 10-06-20 16:55:15, zhangyi (F) wrote:
> Hi, Jan.
>
> On 2020/6/9 20:19, Jan Kara wrote> On Mon 08-06-20 22:39:31, zhangyi (F) wrote:
> >>> On Tue 26-05-20 15:17:44, zhangyi (F) wrote:
> >>>> Background
> >>>> ==========
> >>>>
> >>>> This patch set point to fix the inconsistency problem which has been
> >>>> discussed and partial fixed in [1].
> >>>>
> >>>> Now, the problem is on the unstable storage which has a flaky transport
> >>>> (e.g. iSCSI transport may disconnect few seconds and reconnect due to
> >>>> the bad network environment), if we failed to async write metadata in
> >>>> background, the end write routine in block layer will clear the buffer's
> >>>> uptodate flag, but the data in such buffer is actually uptodate. Finally
> >>>> we may read "old && inconsistent" metadata from the disk when we get the
> >>>> buffer later because not only the uptodate flag was cleared but also we
> >>>> do not check the write io error flag, or even worse the buffer has been
> >>>> freed due to memory presure.
> >>>>
> >>>> Fortunately, if the jbd2 do checkpoint after async IO error happens,
> >>>> the checkpoint routine will check the write_io_error flag and abort the
> >>>> the journal if detect IO error. And in the journal recover case, the
> >>>> recover code will invoke sync_blockdev() after recover complete, it will
> >>>> also detect IO error and refuse to mount the filesystem.
> >>>>
> >>>> Current ext4 have already deal with this problem in __ext4_get_inode_loc()
> >>>> and commit 7963e5ac90125 ("ext4: treat buffers with write errors as
> >>>> containing valid data"), but it's not enough.
> >>>
> >>> Before we go and complicate ext4 code like this, I'd like to understand
> >>> what is the desired outcome which doesn't seem to be mentioned here, in the
> >>> commit 7963e5ac90125, or in the discussion you reference. If you have a
> >>> flaky transport that gives you IO errors, IMO it is not a bussiness of the
> >>> filesystem to try to fix that. I just has to make sure it properly reports
> >>
> >> If we meet some IO errors due to the flaky transport, IMO the desired outcome
> >> is 1) report IO error; 2) ext4 filesystem act as the "errors=xxx" configuration
> >> specified, if we set "errors=read-only || panic", we expect ext4 could remount
> >> to read-only or panic immediately to avoid inconsistency. In brief, the kernel
> >> should try best to guarantee the filesystem on disk is consistency, this will
> >> reduce fsck's work (AFAIK, the fsck cannot fix those inconsistent in auto mode
> >> for most cases caused by the async error problem I mentioned), so we could
> >> recover the fs automatically in next boot.
> >
> > Good, so I fully agree with your goals. Let's now talk about how to achieve
> > them :)
> >
> >> But now, in the case of metadata async writeback, (1) is done in
> >> end_buffer_async_write(), but (2) is not guaranteed, because ext4 cannot detect
> >> metadata write error, and it also cannot remount the filesystem or invoke panic
> >> immediately. Finally, if we read the metadata on disk and re-write again, it
> >> may lead to on-disk filesystem inconsistency.
> >
> > Ah, I see. This was the important bit I was missing. And I think the
> > real problem here is that ext4 cannot detect metadata write error from
> > async writeback. So my plan would be to detect metadata write errors early
> > and abort the journal and do appropriate errors=xxx handling. And a
> > relatively simple way how to do that these days would be to use errseq in
> > the block device's mapping - sb->s_bdev->bd_inode->i_mapping->wb_err - that
> > gets incremented whenever there's writeback error in the block device
> > mapping so (probably in ext4_journal_check_start()) we could check whether
> > wb_err is different from the original value we sampled at mount time an if
> > yes, we know metadata writeback error has happened and we trigger the error
> > handling. What do you think?
> >
>
> Thanks a lot for your suggestion, this solution looks good to me. But Ithink
> add 'wb_err' checking into ext4_journal_check_start() maybe too early, see below
> race condition (It's just theoretical analysis, I'm not test it):
>
> ext4_journal_start()
> ext4_journal_check_start() <-- pass checking
> | end_buffer_async_write()
> | mark_buffer_write_io_error() <-- set b_page
> sb_getblk(bh) <-- read old data from disk
> ext4_journal_get_write_access(bh)
> modify this bh <-- modify data and lead to inconsistency
> ext4_handle_dirty_metadata(bh)
>
> So I guess it may still lead to inconsistency. How about add this checking
> into ext4_journal_get_write_access() ?
Yes, this also occured to me later. Adding the check to
ext4_journal_get_write_access() should be safer.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed, Jun 10, 2020 at 11:57:39AM +0200, Jan Kara wrote:
> > So I guess it may still lead to inconsistency. How about add this checking
> > into ext4_journal_get_write_access() ?
>
> Yes, this also occured to me later. Adding the check to
> ext4_journal_get_write_access() should be safer.
There's another thing which we could do. One of the issues is that we
allow buffered writeback for block devices once the change to the
block has been committed. What if we add a change to block device
writeback code and in fs/buffer.c so that optionally, the file system
can specify a callback function can get called when an I/O error has
been reflected back up from the block layer?
It seems unfortunate that currently, we can immediately report the I/O
error for buffered writes to *files*, but for metadata blocks, we
would only be able to report the problem when we next try to modify
it.
Making changes to fs/buffer.c might be controversial, but I think it
might be result in a better solution.
- Ted
On Wed 10-06-20 11:45:43, Theodore Y. Ts'o wrote:
> On Wed, Jun 10, 2020 at 11:57:39AM +0200, Jan Kara wrote:
> > > So I guess it may still lead to inconsistency. How about add this checking
> > > into ext4_journal_get_write_access() ?
> >
> > Yes, this also occured to me later. Adding the check to
> > ext4_journal_get_write_access() should be safer.
>
> There's another thing which we could do. One of the issues is that we
> allow buffered writeback for block devices once the change to the
> block has been committed. What if we add a change to block device
> writeback code and in fs/buffer.c so that optionally, the file system
> can specify a callback function can get called when an I/O error has
> been reflected back up from the block layer?
>
> It seems unfortunate that currently, we can immediately report the I/O
> error for buffered writes to *files*, but for metadata blocks, we
> would only be able to report the problem when we next try to modify
> it.
>
> Making changes to fs/buffer.c might be controversial, but I think it
> might be result in a better solution.
Yeah, what you propose certainly makes sence could be relatively easily
done by blkdev_writepage() using __block_write_full_page() with appropriate
endio handler which calls fs callback. I'm just not sure how propagate the
callback function from the fs to the blkdev...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On 2020/6/11 0:27, Jan Kara wrote:
> On Wed 10-06-20 11:45:43, Theodore Y. Ts'o wrote:
>> On Wed, Jun 10, 2020 at 11:57:39AM +0200, Jan Kara wrote:
>>>> So I guess it may still lead to inconsistency. How about add this checking
>>>> into ext4_journal_get_write_access() ?
>>>
>>> Yes, this also occured to me later. Adding the check to
>>> ext4_journal_get_write_access() should be safer.
>>
>> There's another thing which we could do. One of the issues is that we
>> allow buffered writeback for block devices once the change to the
>> block has been committed. What if we add a change to block device
>> writeback code and in fs/buffer.c so that optionally, the file system
>> can specify a callback function can get called when an I/O error has
>> been reflected back up from the block layer?
>>
>> It seems unfortunate that currently, we can immediately report the I/O
>> error for buffered writes to *files*, but for metadata blocks, we
>> would only be able to report the problem when we next try to modify
>> it.
>>
>> Making changes to fs/buffer.c might be controversial, but I think it
>> might be result in a better solution.
>
> Yeah, what you propose certainly makes sence could be relatively easily
> done by blkdev_writepage() using __block_write_full_page() with appropriate
> endio handler which calls fs callback. I'm just not sure how propagate the
> callback function from the fs to the blkdev...
>
I have thought about this solution, we could add a hook in 'struct super_operations'
and call it in blkdev_writepage() like blkdev_releasepage() does, and pick out a
wrapper from block_write_full_page() to pass our endio handler in, something like
this.
static const struct super_operations ext4_sops = {
...
.bdev_write_page = ext4_bdev_write_page,
...
};
static int blkdev_writepage(struct page *page, struct writeback_control *wbc)
{
struct super_block *super = BDEV_I(page->mapping->host)->bdev.bd_super;
if (super && super->s_op->bdev_write_page)
return super->s_op->bdev_write_page(page, blkdev_get_block, wbc);
return block_write_full_page(page, blkdev_get_block, wbc);
}
But I'm not sure it's a optimal ieda. So I continue to realize the "wb_err"
solution now ?
Thanks,
Yi.
On Thu 11-06-20 10:12:45, zhangyi (F) wrote:
> On 2020/6/11 0:27, Jan Kara wrote:
> > On Wed 10-06-20 11:45:43, Theodore Y. Ts'o wrote:
> >> On Wed, Jun 10, 2020 at 11:57:39AM +0200, Jan Kara wrote:
> >>>> So I guess it may still lead to inconsistency. How about add this checking
> >>>> into ext4_journal_get_write_access() ?
> >>>
> >>> Yes, this also occured to me later. Adding the check to
> >>> ext4_journal_get_write_access() should be safer.
> >>
> >> There's another thing which we could do. One of the issues is that we
> >> allow buffered writeback for block devices once the change to the
> >> block has been committed. What if we add a change to block device
> >> writeback code and in fs/buffer.c so that optionally, the file system
> >> can specify a callback function can get called when an I/O error has
> >> been reflected back up from the block layer?
> >>
> >> It seems unfortunate that currently, we can immediately report the I/O
> >> error for buffered writes to *files*, but for metadata blocks, we
> >> would only be able to report the problem when we next try to modify
> >> it.
> >>
> >> Making changes to fs/buffer.c might be controversial, but I think it
> >> might be result in a better solution.
> >
> > Yeah, what you propose certainly makes sence could be relatively easily
> > done by blkdev_writepage() using __block_write_full_page() with appropriate
> > endio handler which calls fs callback. I'm just not sure how propagate the
> > callback function from the fs to the blkdev...
> >
>
> I have thought about this solution, we could add a hook in 'struct super_operations'
> and call it in blkdev_writepage() like blkdev_releasepage() does, and pick out a
> wrapper from block_write_full_page() to pass our endio handler in, something like
> this.
>
> static const struct super_operations ext4_sops = {
> ...
> .bdev_write_page = ext4_bdev_write_page,
> ...
> };
>
> static int blkdev_writepage(struct page *page, struct writeback_control *wbc)
> {
> struct super_block *super = BDEV_I(page->mapping->host)->bdev.bd_super;
>
> if (super && super->s_op->bdev_write_page)
> return super->s_op->bdev_write_page(page, blkdev_get_block, wbc);
>
> return block_write_full_page(page, blkdev_get_block, wbc);
> }
>
> But I'm not sure it's a optimal ieda. So I continue to realize the "wb_err"
> solution now ?
The above idea looks good to me. I'm fine with either that solution or
"wb_err" idea so maybe let's leave it for Ted to decide...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu, Jun 11, 2020 at 10:21:03AM +0200, Jan Kara wrote:
> > I have thought about this solution, we could add a hook in 'struct super_operations'
> > and call it in blkdev_writepage() like blkdev_releasepage() does, and pick out a
> > wrapper from block_write_full_page() to pass our endio handler in, something like
> > this.
> >
> > static const struct super_operations ext4_sops = {
> > ...
> > .bdev_write_page = ext4_bdev_write_page,
> > ...
> > };
> >
> > static int blkdev_writepage(struct page *page, struct writeback_control *wbc)
> > {
> > struct super_block *super = BDEV_I(page->mapping->host)->bdev.bd_super;
> >
> > if (super && super->s_op->bdev_write_page)
> > return super->s_op->bdev_write_page(page, blkdev_get_block, wbc);
> >
> > return block_write_full_page(page, blkdev_get_block, wbc);
> > }
> >
> > But I'm not sure it's a optimal ieda. So I continue to realize the "wb_err"
> > solution now ?
>
> The above idea looks good to me. I'm fine with either that solution or
> "wb_err" idea so maybe let's leave it for Ted to decide...
My preference would be to be able to get the (error from the callback
right away. My reasoning behind that is (a) it allows the file system
to be notified about the problem right away, (b) in the case of a file
system resize, we _really_ want to know about the failure ASAP, so we
can fail the resize before we start allocating inodes and blocks to
use the new space, and (c) over time, we might be able to add some
more intelligence handling of some write errors.
For example, we already have a way of handling CRC errors when we are
reading an allocation bitmap; we simply avoid allocating blocks and
inodes from that blockgroup. Over time, we could theoretically do
other things to try to recover from some write errors --- for example,
we could try allocating a new block for an extent tree block, and try
writing it, and if that succeeds, updating its parent node to point at
the new location. Is it worth it to try to add that kind of
complexity? I'm really not sure; at the end of the day, it might be
simpler to just call ext4_error() and abort using the entire file
system until a system administrator can sort out the mess. But I
think (a) and (b) are still reasons for doing this by intercepting the
writeback error from the buffer head.
Cheers,
- Ted
On 2020/6/12 0:55, Theodore Y. Ts'o wrote:
> On Thu, Jun 11, 2020 at 10:21:03AM +0200, Jan Kara wrote:
>>> I have thought about this solution, we could add a hook in 'struct super_operations'
>>> and call it in blkdev_writepage() like blkdev_releasepage() does, and pick out a
>>> wrapper from block_write_full_page() to pass our endio handler in, something like
>>> this.
>>>
>>> static const struct super_operations ext4_sops = {
>>> ...
>>> .bdev_write_page = ext4_bdev_write_page,
>>> ...
>>> };
>>>
>>> static int blkdev_writepage(struct page *page, struct writeback_control *wbc)
>>> {
>>> struct super_block *super = BDEV_I(page->mapping->host)->bdev.bd_super;
>>>
>>> if (super && super->s_op->bdev_write_page)
>>> return super->s_op->bdev_write_page(page, blkdev_get_block, wbc);
>>>
>>> return block_write_full_page(page, blkdev_get_block, wbc);
>>> }
>>>
>>> But I'm not sure it's a optimal ieda. So I continue to realize the "wb_err"
>>> solution now ?
>>
>> The above idea looks good to me. I'm fine with either that solution or
>> "wb_err" idea so maybe let's leave it for Ted to decide...
>
> My preference would be to be able to get the (error from the callback
> right away. My reasoning behind that is (a) it allows the file system
> to be notified about the problem right away, (b) in the case of a file
> system resize, we _really_ want to know about the failure ASAP, so we
> can fail the resize before we start allocating inodes and blocks to
> use the new space, and (c) over time, we might be able to add some
> more intelligence handling of some write errors.
>
> For example, we already have a way of handling CRC errors when we are
> reading an allocation bitmap; we simply avoid allocating blocks and
> inodes from that blockgroup. Over time, we could theoretically do
> other things to try to recover from some write errors --- for example,
> we could try allocating a new block for an extent tree block, and try
> writing it, and if that succeeds, updating its parent node to point at
> the new location. Is it worth it to try to add that kind of
> complexity? I'm really not sure; at the end of the day, it might be
> simpler to just call ext4_error() and abort using the entire file
> system until a system administrator can sort out the mess. But I
> think (a) and (b) are still reasons for doing this by intercepting the
> writeback error from the buffer head.
>
Yeah, it make sense to me, I will realize this callback solution.
Thanks,
Yi.