Now, we capture a data corruption problem on ext4 while we're truncating
an extent index block. Imaging that if we are revoking a buffer which
has been journaled by the committing transaction, the buffer's jbddirty
flag will not be cleared in jbd2_journal_forget(), so the commit code
will set the buffer dirty flag again after refile the buffer.
fsx kjournald2
jbd2_journal_commit_transaction
jbd2_journal_revoke commit phase 1~5...
jbd2_journal_forget
belongs to older transaction commit phase 6
jbddirty not clear __jbd2_journal_refile_buffer
__jbd2_journal_unfile_buffer
test_clear_buffer_jbddirty
mark_buffer_dirty
Finally, if the freed extent index block was allocated again as data
block by some other files, it may corrupt the file data when writing
cached pages later, such as during umount time.
This patch mark buffer as freed and set j_next_transaction to the new
transaction when it already belongs to the committing transaction in
jbd2_journal_forget(), so that commit code knows it should clear dirty
bits when it is done with the buffer.
This problem can be reproduced by xfstests generic/455 easily with
seeds (3246 3247 3248 3249).
Signed-off-by: zhangyi (F) <[email protected]>
Cc: [email protected]
---
fs/jbd2/transaction.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index f07f006..f7f9647 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -1609,14 +1609,19 @@ int jbd2_journal_forget (handle_t *handle, struct buffer_head *bh)
/* However, if the buffer is still owned by a prior
* (committing) transaction, we can't drop it yet... */
JBUFFER_TRACE(jh, "belongs to older transaction");
- /* ... but we CAN drop it from the new transaction if we
- * have also modified it since the original commit. */
+ /* ... but we CAN drop it from the new transaction, mark
+ * buffer as freed and set j_next_transaction to the new
+ * transaction so that commit code knows it should clear
+ * dirty bits when it is done with the buffer. */
- if (jh->b_next_transaction) {
- J_ASSERT(jh->b_next_transaction == transaction);
+ set_buffer_freed(bh);
+
+ if (!jh->b_next_transaction) {
spin_lock(&journal->j_list_lock);
- jh->b_next_transaction = NULL;
+ jh->b_next_transaction = transaction;
spin_unlock(&journal->j_list_lock);
+ } else {
+ J_ASSERT(jh->b_next_transaction == transaction);
/*
* only drop a reference if this transaction modified
--
2.7.4
On Wed 16-01-19 21:38:23, zhangyi (F) wrote:
> Now, we capture a data corruption problem on ext4 while we're truncating
> an extent index block. Imaging that if we are revoking a buffer which
> has been journaled by the committing transaction, the buffer's jbddirty
> flag will not be cleared in jbd2_journal_forget(), so the commit code
> will set the buffer dirty flag again after refile the buffer.
>
> fsx kjournald2
> jbd2_journal_commit_transaction
> jbd2_journal_revoke commit phase 1~5...
> jbd2_journal_forget
> belongs to older transaction commit phase 6
> jbddirty not clear __jbd2_journal_refile_buffer
> __jbd2_journal_unfile_buffer
> test_clear_buffer_jbddirty
> mark_buffer_dirty
>
> Finally, if the freed extent index block was allocated again as data
> block by some other files, it may corrupt the file data when writing
> cached pages later, such as during umount time.
Thanks for the patch! I'm sorry this didn't occur to me the first time when
I was reading your analysis but now there is one question I have: When the
freed extent index block gets reallocated as data block, we should call
clean_bdev_aliases() or clean_bdev_bh_alias() for it (it usually happens
shortly after block allocation either in ext4_block_write_begin() or
mpage_map_one_extent()). Which will clear the buffer dirty bit and thus
should avoid this kind of corruption. So how come this didn't work? Is it
that we for some reason didn't call clean_bdev_aliases() or that function
didn't work for some reason? Can you debug that with your reproducer?
Thanks a lot!
Honza
> This patch mark buffer as freed and set j_next_transaction to the new
> transaction when it already belongs to the committing transaction in
> jbd2_journal_forget(), so that commit code knows it should clear dirty
> bits when it is done with the buffer.
>
> This problem can be reproduced by xfstests generic/455 easily with
> seeds (3246 3247 3248 3249).
>
> Signed-off-by: zhangyi (F) <[email protected]>
> Cc: [email protected]
> ---
> fs/jbd2/transaction.c | 15 ++++++++++-----
> 1 file changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> index f07f006..f7f9647 100644
> --- a/fs/jbd2/transaction.c
> +++ b/fs/jbd2/transaction.c
> @@ -1609,14 +1609,19 @@ int jbd2_journal_forget (handle_t *handle, struct buffer_head *bh)
> /* However, if the buffer is still owned by a prior
> * (committing) transaction, we can't drop it yet... */
> JBUFFER_TRACE(jh, "belongs to older transaction");
> - /* ... but we CAN drop it from the new transaction if we
> - * have also modified it since the original commit. */
> + /* ... but we CAN drop it from the new transaction, mark
> + * buffer as freed and set j_next_transaction to the new
> + * transaction so that commit code knows it should clear
> + * dirty bits when it is done with the buffer. */
>
> - if (jh->b_next_transaction) {
> - J_ASSERT(jh->b_next_transaction == transaction);
> + set_buffer_freed(bh);
> +
> + if (!jh->b_next_transaction) {
> spin_lock(&journal->j_list_lock);
> - jh->b_next_transaction = NULL;
> + jh->b_next_transaction = transaction;
> spin_unlock(&journal->j_list_lock);
> + } else {
> + J_ASSERT(jh->b_next_transaction == transaction);
>
> /*
> * only drop a reference if this transaction modified
> --
> 2.7.4
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu 17-01-19 18:58:51, zhangyi (F) wrote:
> On 2019/1/16 22:36, Jan Kara Wrote:
> > On Wed 16-01-19 21:38:23, zhangyi (F) wrote:
> >> Now, we capture a data corruption problem on ext4 while we're truncating
> >> an extent index block. Imaging that if we are revoking a buffer which
> >> has been journaled by the committing transaction, the buffer's jbddirty
> >> flag will not be cleared in jbd2_journal_forget(), so the commit code
> >> will set the buffer dirty flag again after refile the buffer.
> >>
> >> fsx kjournald2
> >> jbd2_journal_commit_transaction
> >> jbd2_journal_revoke commit phase 1~5...
> >> jbd2_journal_forget
> >> belongs to older transaction commit phase 6
> >> jbddirty not clear __jbd2_journal_refile_buffer
> >> __jbd2_journal_unfile_buffer
> >> test_clear_buffer_jbddirty
> >> mark_buffer_dirty
> >>
> >> Finally, if the freed extent index block was allocated again as data
> >> block by some other files, it may corrupt the file data when writing
> >> cached pages later, such as during umount time.
> >
> > Thanks for the patch! I'm sorry this didn't occur to me the first time when
> > I was reading your analysis but now there is one question I have: When the
> > freed extent index block gets reallocated as data block, we should call
> > clean_bdev_aliases() or clean_bdev_bh_alias() for it (it usually happens
> > shortly after block allocation either in ext4_block_write_begin() or
> > mpage_map_one_extent()). Which will clear the buffer dirty bit and thus
> > should avoid this kind of corruption. So how come this didn't work? Is it
> > that we for some reason didn't call clean_bdev_aliases() or that function
> > didn't work for some reason? Can you debug that with your reproducer?
> > Thanks a lot!
> >
>
> Indeed,I figure out that the root cause is
> ext4_ext_convert_to_initialized() return incorrect when it does try to
> zeroout the head of the first extent (see case 2 or 5)[1]. If we zeroout
> the tail of the second extent firstly, and then it will set "map->m_len"
> to "allocated" directly in case 2 or 5(cut the zeroed out range).
> Finally, ext4_ext_handle_unwritten_extents() will skip invoking
> clean_bdev_aliases() for the expanded region.
>
> At the same time, IIUC, it also have another two problems,
> 1) It doesn't call clean_bdev_aliases() for the head of the extent if zeroout extra
> blocks (unmap the tail of the extent only)[2].
> 2) If "allocated = ee_len - (map->m_lblk - ee_block)" but doesn't zeroout any extra
> blocks at all, the return value maybe large than requested and cover the uninitialized
> region (seems doesn't serious recently)[3].
OK, I see. Thanks for debugging this!
> For the problem [1][2], I think we could move clean_bdev_aliases() into
> ext4_ext_zeroout(). For the problem [3], it seems that
> ext4_ext_convert_to_initialized() return extra blocks number is
> unnecessary, return the request value on success is also fine after we do
> the previous job. Suggestions?
I have always considered clean_bdev_aliases() logic somewhat fragile since
it's not very clear when we should clear the aliases and bugs like the
above are the result of that. So I think that the best would be to make
sure that jbd2_journal_forget() cannot result in leaving dirty buffer head
behind. And your current patch goes a long way towards that. I think the
only remaining piece is to call __bforget() instead of __brelse() in
not_jbd branch of jbd2_journal_forget(). And then we can have a cleanup
patch removing all clean_bdev_aliases() and clean_bdev_bh_alias() calls
from ext4...
Honza
> >> This patch mark buffer as freed and set j_next_transaction to the new
> >> transaction when it already belongs to the committing transaction in
> >> jbd2_journal_forget(), so that commit code knows it should clear dirty
> >> bits when it is done with the buffer.
> >>
> >> This problem can be reproduced by xfstests generic/455 easily with
> >> seeds (3246 3247 3248 3249).
> >>
> >> Signed-off-by: zhangyi (F) <[email protected]>
> >> Cc: [email protected]
> >> ---
> >> fs/jbd2/transaction.c | 15 ++++++++++-----
> >> 1 file changed, 10 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
> >> index f07f006..f7f9647 100644
> >> --- a/fs/jbd2/transaction.c
> >> +++ b/fs/jbd2/transaction.c
> >> @@ -1609,14 +1609,19 @@ int jbd2_journal_forget (handle_t *handle, struct buffer_head *bh)
> >> /* However, if the buffer is still owned by a prior
> >> * (committing) transaction, we can't drop it yet... */
> >> JBUFFER_TRACE(jh, "belongs to older transaction");
> >> - /* ... but we CAN drop it from the new transaction if we
> >> - * have also modified it since the original commit. */
> >> + /* ... but we CAN drop it from the new transaction, mark
> >> + * buffer as freed and set j_next_transaction to the new
> >> + * transaction so that commit code knows it should clear
> >> + * dirty bits when it is done with the buffer. */
> >>
> >> - if (jh->b_next_transaction) {
> >> - J_ASSERT(jh->b_next_transaction == transaction);
> >> + set_buffer_freed(bh);
> >> +
> >> + if (!jh->b_next_transaction) {
> >> spin_lock(&journal->j_list_lock);
> >> - jh->b_next_transaction = NULL;
> >> + jh->b_next_transaction = transaction;
> >> spin_unlock(&journal->j_list_lock);
> >> + } else {
> >> + J_ASSERT(jh->b_next_transaction == transaction);
> >>
> >> /*
> >> * only drop a reference if this transaction modified
> >> --
> >> 2.7.4
> >>
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On 2019/1/16 22:36, Jan Kara Wrote:
> On Wed 16-01-19 21:38:23, zhangyi (F) wrote:
>> Now, we capture a data corruption problem on ext4 while we're truncating
>> an extent index block. Imaging that if we are revoking a buffer which
>> has been journaled by the committing transaction, the buffer's jbddirty
>> flag will not be cleared in jbd2_journal_forget(), so the commit code
>> will set the buffer dirty flag again after refile the buffer.
>>
>> fsx kjournald2
>> jbd2_journal_commit_transaction
>> jbd2_journal_revoke commit phase 1~5...
>> jbd2_journal_forget
>> belongs to older transaction commit phase 6
>> jbddirty not clear __jbd2_journal_refile_buffer
>> __jbd2_journal_unfile_buffer
>> test_clear_buffer_jbddirty
>> mark_buffer_dirty
>>
>> Finally, if the freed extent index block was allocated again as data
>> block by some other files, it may corrupt the file data when writing
>> cached pages later, such as during umount time.
>
> Thanks for the patch! I'm sorry this didn't occur to me the first time when
> I was reading your analysis but now there is one question I have: When the
> freed extent index block gets reallocated as data block, we should call
> clean_bdev_aliases() or clean_bdev_bh_alias() for it (it usually happens
> shortly after block allocation either in ext4_block_write_begin() or
> mpage_map_one_extent()). Which will clear the buffer dirty bit and thus
> should avoid this kind of corruption. So how come this didn't work? Is it
> that we for some reason didn't call clean_bdev_aliases() or that function
> didn't work for some reason? Can you debug that with your reproducer?
> Thanks a lot!
>
Indeed,I figure out that the root cause is ext4_ext_convert_to_initialized() return
incorrect when it does try to zeroout the head of the first extent (see case 2 or 5)[1].
If we zeroout the tail of the second extent firstly, and then it will set "map->m_len"
to "allocated" directly in case 2 or 5(cut the zeroed out range). Finally,
ext4_ext_handle_unwritten_extents() will skip invoking clean_bdev_aliases() for
the expanded region.
At the same time, IIUC, it also have another two problems,
1) It doesn't call clean_bdev_aliases() for the head of the extent if zeroout extra
blocks (unmap the tail of the extent only)[2].
2) If "allocated = ee_len - (map->m_lblk - ee_block)" but doesn't zeroout any extra
blocks at all, the return value maybe large than requested and cover the uninitialized
region (seems doesn't serious recently)[3].
For the problem [1][2], I think we could move clean_bdev_aliases() into ext4_ext_zeroout().
For the problem [3], it seems that ext4_ext_convert_to_initialized() return extra blocks
number is unnecessary, return the request value on success is also fine after we do the
previous job. Suggestions?
BTW, this patch is still need, I can edit the commit log and re-post a patchset
to fix this problem.
Thanks,
Yi.
>
>> This patch mark buffer as freed and set j_next_transaction to the new
>> transaction when it already belongs to the committing transaction in
>> jbd2_journal_forget(), so that commit code knows it should clear dirty
>> bits when it is done with the buffer.
>>
>> This problem can be reproduced by xfstests generic/455 easily with
>> seeds (3246 3247 3248 3249).
>>
>> Signed-off-by: zhangyi (F) <[email protected]>
>> Cc: [email protected]
>> ---
>> fs/jbd2/transaction.c | 15 ++++++++++-----
>> 1 file changed, 10 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
>> index f07f006..f7f9647 100644
>> --- a/fs/jbd2/transaction.c
>> +++ b/fs/jbd2/transaction.c
>> @@ -1609,14 +1609,19 @@ int jbd2_journal_forget (handle_t *handle, struct buffer_head *bh)
>> /* However, if the buffer is still owned by a prior
>> * (committing) transaction, we can't drop it yet... */
>> JBUFFER_TRACE(jh, "belongs to older transaction");
>> - /* ... but we CAN drop it from the new transaction if we
>> - * have also modified it since the original commit. */
>> + /* ... but we CAN drop it from the new transaction, mark
>> + * buffer as freed and set j_next_transaction to the new
>> + * transaction so that commit code knows it should clear
>> + * dirty bits when it is done with the buffer. */
>>
>> - if (jh->b_next_transaction) {
>> - J_ASSERT(jh->b_next_transaction == transaction);
>> + set_buffer_freed(bh);
>> +
>> + if (!jh->b_next_transaction) {
>> spin_lock(&journal->j_list_lock);
>> - jh->b_next_transaction = NULL;
>> + jh->b_next_transaction = transaction;
>> spin_unlock(&journal->j_list_lock);
>> + } else {
>> + J_ASSERT(jh->b_next_transaction == transaction);
>>
>> /*
>> * only drop a reference if this transaction modified
>> --
>> 2.7.4
>>