I got some errors when I repair an ext4 volume which stacked by an
iscsi target:
Entry 'test60' in / (2) has deleted/unused inode 73750. Clear?
It can be reproduced when the network not good enough.
When I debug this I found ext4 will read entry buffer from disk and
the buffer is marked with write_io_error.
If the buffer is marked with write_io_error, it means it already
wroten to journal, and not checked out to disk. IOW, the journal
is newer than the data in disk.
If this journal record 'delete test60', it means the 'test60' still
on the disk metadata.
In this case, if we read the buffer from disk successfully and create
file continue, the new journal record will overwrite the journal
which record 'delete test60', then the entry corruptioned.
So, use the buffer rather than read from disk if the buffer marked
with write_io_error
Signed-off-by: Zhang Xiaoxu <[email protected]>
---
fs/ext4/ext4.h | 13 +++++++++++++
fs/ext4/inode.c | 4 ++--
2 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 1cb6785..5ebb36d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3301,6 +3301,19 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
extern const struct iomap_ops ext4_iomap_ops;
+static inline int ext4_buffer_uptodate(struct buffer_head *bh)
+{
+ /*
+ * If the buffer has the write error flag, we have failed
+ * to write out data in the block. In this case, we don't
+ * have to read the block because we may read the old data
+ * successfully.
+ */
+ if (!buffer_uptodate(bh) && buffer_write_io_error(bh))
+ set_buffer_uptodate(bh);
+ return buffer_uptodate(bh);
+}
+
#endif /* __KERNEL__ */
#define EFSBADCRC EBADMSG /* Bad CRC detected */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 82298c6..3546388 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1018,7 +1018,7 @@ struct buffer_head *ext4_bread(handle_t *handle, struct inode *inode,
bh = ext4_getblk(handle, inode, block, map_flags);
if (IS_ERR(bh))
return bh;
- if (!bh || buffer_uptodate(bh))
+ if (!bh || ext4_buffer_uptodate(bh))
return bh;
ll_rw_block(REQ_OP_READ, REQ_META | REQ_PRIO, 1, &bh);
wait_on_buffer(bh);
@@ -1045,7 +1045,7 @@ int ext4_bread_batch(struct inode *inode, ext4_lblk_t block, int bh_count,
for (i = 0; i < bh_count; i++)
/* Note that NULL bhs[i] is valid because of holes. */
- if (bhs[i] && !buffer_uptodate(bhs[i]))
+ if (bhs[i] && !ext4_buffer_uptodate(bhs[i]))
ll_rw_block(REQ_OP_READ, REQ_META | REQ_PRIO, 1,
&bhs[i]);
--
2.7.4
On Tue, May 14, 2019 at 12:23:37PM +0800, ZhangXiaoxu wrote:
> I got some errors when I repair an ext4 volume which stacked by an
> iscsi target:
> Entry 'test60' in / (2) has deleted/unused inode 73750. Clear?
> It can be reproduced when the network not good enough.
>
> When I debug this I found ext4 will read entry buffer from disk and
> the buffer is marked with write_io_error.
>
> If the buffer is marked with write_io_error, it means it already
> wroten to journal, and not checked out to disk. IOW, the journal
> is newer than the data in disk.
> If this journal record 'delete test60', it means the 'test60' still
> on the disk metadata.
>
> In this case, if we read the buffer from disk successfully and create
> file continue, the new journal record will overwrite the journal
> which record 'delete test60', then the entry corruptioned.
>
> So, use the buffer rather than read from disk if the buffer marked
> with write_io_error
You've raised a number of issues about how we handle write errors,
especially when they occur due to a flaky transport --- in your case,
due to iSCSI. As such, your patch isn't wrong, so much as it is
incomplete.
For example, your assumption that if the buffer is marked
write_io_error, it's safe to clear write_io_error and reset
buffer_uptodate assumes that journalling is enabled. If the file
system does not have the journal, there is no journal to fall back
upon. For file systems which do have a journal, if you are using a
flaky iSCSI transport, there is no protection from write errors which
occur when the journal is replayed. (fs/jbd2/recovery.c simply marks
the buffer dirty and allows the writeback code take care of writing
the buffer.) This means that the buffer could have write_io_error set
due to a failure to write the buffer during recovery, in which case
relying on the journal having a uptodate copy block is invalid.
Also, this patch only patches the ex4_bread() path, which is only used
by directories. It doesn't deal with metadata reads for allocation
bitmaps or extent tree blocks. We are doing this hack for inode table
blocks, already; perhaps you got the idea to do this for ext4_bread()
from __ext4_get_inode_loc()?
We could add some kind of callback from the buffer cache layer when an
aysnchronous writeback fails --- or we could use a synchronous write
in the journal recovery code (which would be bad from a performance
perspective, but ignore that for the moment) --- however, what do we
do when we discover that there is an error? Right now, we do nothing
until we try to read the inode table block (and after your patch,
reading a directory block). Under memory pressure, though, the data
will get lost and we don't even mark the file system as needing to be
checked. We could retry the write, but if it's due to a flaky iSCSI
or FC transport, this write could fail yet again --- and then what?
So while I could apply this patch, since it doesn't make things worse,
I want to make sure you are aware that if you have problems with your
iSCSI device, this patch is far from a complete solution. At the very
least, we should handle reads for other metadata block.
- Ted
I've applied this patch with the modified patch summary:
ext4: treat buffers with write errors as containing valid data
Cheers,
- Ted
On Fri, May 17, 2019 at 06:59:40PM -0400, Theodore Ts'o wrote:
> On Tue, May 14, 2019 at 12:23:37PM +0800, ZhangXiaoxu wrote:
> > I got some errors when I repair an ext4 volume which stacked by an
> > iscsi target:
> > Entry 'test60' in / (2) has deleted/unused inode 73750. Clear?
> > It can be reproduced when the network not good enough.
> >
> > When I debug this I found ext4 will read entry buffer from disk and
> > the buffer is marked with write_io_error.
> >
> > If the buffer is marked with write_io_error, it means it already
> > wroten to journal, and not checked out to disk. IOW, the journal
> > is newer than the data in disk.
> > If this journal record 'delete test60', it means the 'test60' still
> > on the disk metadata.
> >
> > In this case, if we read the buffer from disk successfully and create
> > file continue, the new journal record will overwrite the journal
> > which record 'delete test60', then the entry corruptioned.
> >
> > So, use the buffer rather than read from disk if the buffer marked
> > with write_io_error
>
> You've raised a number of issues about how we handle write errors,
> especially when they occur due to a flaky transport --- in your case,
> due to iSCSI. As such, your patch isn't wrong, so much as it is
> incomplete.
>
> For example, your assumption that if the buffer is marked
> write_io_error, it's safe to clear write_io_error and reset
> buffer_uptodate assumes that journalling is enabled. If the file
> system does not have the journal, there is no journal to fall back
> upon. For file systems which do have a journal, if you are using a
> flaky iSCSI transport, there is no protection from write errors which
> occur when the journal is replayed. (fs/jbd2/recovery.c simply marks
> the buffer dirty and allows the writeback code take care of writing
> the buffer.) This means that the buffer could have write_io_error set
> due to a failure to write the buffer during recovery, in which case
> relying on the journal having a uptodate copy block is invalid.
>
> Also, this patch only patches the ex4_bread() path, which is only used
> by directories. It doesn't deal with metadata reads for allocation
> bitmaps or extent tree blocks. We are doing this hack for inode table
> blocks, already; perhaps you got the idea to do this for ext4_bread()
> from __ext4_get_inode_loc()?
>
> We could add some kind of callback from the buffer cache layer when an
> aysnchronous writeback fails --- or we could use a synchronous write
> in the journal recovery code (which would be bad from a performance
> perspective, but ignore that for the moment) --- however, what do we
> do when we discover that there is an error? Right now, we do nothing
> until we try to read the inode table block (and after your patch,
> reading a directory block). Under memory pressure, though, the data
> will get lost and we don't even mark the file system as needing to be
> checked. We could retry the write, but if it's due to a flaky iSCSI
> or FC transport, this write could fail yet again --- and then what?
>
> So while I could apply this patch, since it doesn't make things worse,
> I want to make sure you are aware that if you have problems with your
> iSCSI device, this patch is far from a complete solution. At the very
> least, we should handle reads for other metadata block.
>
> - Ted