Currently we capture a NULL data exposure problem after a crash or
poweroff when append writing a file in the data=ordered mode. The
problem is that we were not add inode to the transaction's order data
list when updating i_disksize without new block allocation no matter
the delay allocated block feature is enabled or not.
write jbd2 writeback
append write in allocated block
mark buffer dirty
update i_disksize
mark inode dirty
commit transaction
write inode
(data exposure after a crash)
write dirty buffer
It's fine in the case of new block allocation because we do this job in
ext4_map_blocks(). To fix this problem, this patch add inode to current
transaction's order data list after new data is copied and needing
update i_disksize in the case of no block allocation.
Fixes: 06bd3c36a733ac ("ext4: fix data exposure after a crash")
Fixes: f3b59291a69d0b ("ext4: remove calls to ext4_jbd2_file_inode() from delalloc write path")
Signed-off-by: zhangyi (F) <[email protected]>
---
fs/ext4/inode.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b32a57b..5cfa066 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1419,6 +1419,16 @@ static int ext4_write_end(struct file *file,
if (i_size_changed || inline_data)
ext4_mark_inode_dirty(handle, inode);
+ /*
+ * Updating i_disksize when extending file without block
+ * allocation, the newly written data where should be visible
+ * after transaction commit must be on transaction's ordered
+ * data list.
+ */
+ if (copied && (i_size_changed & 0x2) &&
+ ext4_should_order_data(inode))
+ ext4_jbd2_inode_add_write(handle, inode);
+
if (pos + len > inode->i_size && ext4_can_truncate(inode))
/* if we have allocated more blocks and copied
* less. We will have blocks allocated outside
@@ -3185,6 +3195,15 @@ static int ext4_da_write_end(struct file *file,
* bu greater than i_disksize.(hint delalloc)
*/
ext4_mark_inode_dirty(handle, inode);
+
+ /*
+ * Updating i_disksize when extending file without
+ * block allocation, the newly written data where
+ * should be visible after transaction commit must
+ * be on transaction's ordered data list.
+ */
+ if (ext4_should_order_data(inode))
+ ext4_jbd2_inode_add_write(handle, inode);
}
}
--
2.7.4
On Thu 04-04-19 17:29:52, zhangyi (F) wrote:
> Currently we capture a NULL data exposure problem after a crash or
> poweroff when append writing a file in the data=ordered mode. The
> problem is that we were not add inode to the transaction's order data
> list when updating i_disksize without new block allocation no matter
> the delay allocated block feature is enabled or not.
>
> write jbd2 writeback
> append write in allocated block
> mark buffer dirty
> update i_disksize
> mark inode dirty
> commit transaction
> write inode
> (data exposure after a crash)
> write dirty buffer
>
> It's fine in the case of new block allocation because we do this job in
> ext4_map_blocks(). To fix this problem, this patch add inode to current
> transaction's order data list after new data is copied and needing
> update i_disksize in the case of no block allocation.
>
> Fixes: 06bd3c36a733ac ("ext4: fix data exposure after a crash")
> Fixes: f3b59291a69d0b ("ext4: remove calls to ext4_jbd2_file_inode() from delalloc write path")
> Signed-off-by: zhangyi (F) <[email protected]>
Thanks for the patch. The current behavior is a deliberate decision.
data=ordered mode does guarantee there is no stale data visible in case of
crash. However it does not guarantee you cannot see zeros where data was
written.
Honza
> ---
> fs/ext4/inode.c | 19 +++++++++++++++++++
> 1 file changed, 19 insertions(+)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index b32a57b..5cfa066 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1419,6 +1419,16 @@ static int ext4_write_end(struct file *file,
> if (i_size_changed || inline_data)
> ext4_mark_inode_dirty(handle, inode);
>
> + /*
> + * Updating i_disksize when extending file without block
> + * allocation, the newly written data where should be visible
> + * after transaction commit must be on transaction's ordered
> + * data list.
> + */
> + if (copied && (i_size_changed & 0x2) &&
> + ext4_should_order_data(inode))
> + ext4_jbd2_inode_add_write(handle, inode);
> +
> if (pos + len > inode->i_size && ext4_can_truncate(inode))
> /* if we have allocated more blocks and copied
> * less. We will have blocks allocated outside
> @@ -3185,6 +3195,15 @@ static int ext4_da_write_end(struct file *file,
> * bu greater than i_disksize.(hint delalloc)
> */
> ext4_mark_inode_dirty(handle, inode);
> +
> + /*
> + * Updating i_disksize when extending file without
> + * block allocation, the newly written data where
> + * should be visible after transaction commit must
> + * be on transaction's ordered data list.
> + */
> + if (ext4_should_order_data(inode))
> + ext4_jbd2_inode_add_write(handle, inode);
> }
> }
>
> --
> 2.7.4
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On 2019/4/4 18:18, Jan Kara Wrote:
> On Thu 04-04-19 17:29:52, zhangyi (F) wrote:
>> Currently we capture a NULL data exposure problem after a crash or
>> poweroff when append writing a file in the data=ordered mode. The
>> problem is that we were not add inode to the transaction's order data
>> list when updating i_disksize without new block allocation no matter
>> the delay allocated block feature is enabled or not.
>>
>> write jbd2 writeback
>> append write in allocated block
>> mark buffer dirty
>> update i_disksize
>> mark inode dirty
>> commit transaction
>> write inode
>> (data exposure after a crash)
>> write dirty buffer
>>
>> It's fine in the case of new block allocation because we do this job in
>> ext4_map_blocks(). To fix this problem, this patch add inode to current
>> transaction's order data list after new data is copied and needing
>> update i_disksize in the case of no block allocation.
>>
>> Fixes: 06bd3c36a733ac ("ext4: fix data exposure after a crash")
>> Fixes: f3b59291a69d0b ("ext4: remove calls to ext4_jbd2_file_inode() from delalloc write path")
>> Signed-off-by: zhangyi (F) <[email protected]>
>
> Thanks for the patch. The current behavior is a deliberate decision.
> data=ordered mode does guarantee there is no stale data visible in case of
> crash. However it does not guarantee you cannot see zeros where data was
> written.
>
Hi Jan,
Thanks a lot for your explanation. I read the Documentation/admin-guide/ext4.rst,
which said about the ordered mode:
> ... When it's time to write the new metadata out to disk, the associated data
> blocks are written first...
So I reckon that the dirty data block should be written to disk before committing
i_disksize and we cannot see the zero data. Now, I don't find any offical docs to
record the behavior you mentioned, do you have some links talk about this behavior
or am I miss something ?
Thanks,
Yi.
>> ---
>> fs/ext4/inode.c | 19 +++++++++++++++++++
>> 1 file changed, 19 insertions(+)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index b32a57b..5cfa066 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -1419,6 +1419,16 @@ static int ext4_write_end(struct file *file,
>> if (i_size_changed || inline_data)
>> ext4_mark_inode_dirty(handle, inode);
>>
>> + /*
>> + * Updating i_disksize when extending file without block
>> + * allocation, the newly written data where should be visible
>> + * after transaction commit must be on transaction's ordered
>> + * data list.
>> + */
>> + if (copied && (i_size_changed & 0x2) &&
>> + ext4_should_order_data(inode))
>> + ext4_jbd2_inode_add_write(handle, inode);
>> +
>> if (pos + len > inode->i_size && ext4_can_truncate(inode))
>> /* if we have allocated more blocks and copied
>> * less. We will have blocks allocated outside
>> @@ -3185,6 +3195,15 @@ static int ext4_da_write_end(struct file *file,
>> * bu greater than i_disksize.(hint delalloc)
>> */
>> ext4_mark_inode_dirty(handle, inode);
>> +
>> + /*
>> + * Updating i_disksize when extending file without
>> + * block allocation, the newly written data where
>> + * should be visible after transaction commit must
>> + * be on transaction's ordered data list.
>> + */
>> + if (ext4_should_order_data(inode))
>> + ext4_jbd2_inode_add_write(handle, inode);
>> }
>> }
>>
>> --
>> 2.7.4
>>
On Thu 04-04-19 20:46:47, zhangyi (F) wrote:
> On 2019/4/4 18:18, Jan Kara Wrote:
> > On Thu 04-04-19 17:29:52, zhangyi (F) wrote:
> >> Currently we capture a NULL data exposure problem after a crash or
> >> poweroff when append writing a file in the data=ordered mode. The
> >> problem is that we were not add inode to the transaction's order data
> >> list when updating i_disksize without new block allocation no matter
> >> the delay allocated block feature is enabled or not.
> >>
> >> write jbd2 writeback
> >> append write in allocated block
> >> mark buffer dirty
> >> update i_disksize
> >> mark inode dirty
> >> commit transaction
> >> write inode
> >> (data exposure after a crash)
> >> write dirty buffer
> >>
> >> It's fine in the case of new block allocation because we do this job in
> >> ext4_map_blocks(). To fix this problem, this patch add inode to current
> >> transaction's order data list after new data is copied and needing
> >> update i_disksize in the case of no block allocation.
> >>
> >> Fixes: 06bd3c36a733ac ("ext4: fix data exposure after a crash")
> >> Fixes: f3b59291a69d0b ("ext4: remove calls to ext4_jbd2_file_inode() from delalloc write path")
> >> Signed-off-by: zhangyi (F) <[email protected]>
> >
> > Thanks for the patch. The current behavior is a deliberate decision.
> > data=ordered mode does guarantee there is no stale data visible in case of
> > crash. However it does not guarantee you cannot see zeros where data was
> > written.
> >
>
> Hi Jan,
> Thanks a lot for your explanation. I read the
> Documentation/admin-guide/ext4.rst, which said about the ordered mode:
>
> > ... When it's time to write the new metadata out to disk, the associated data
> > blocks are written first...
>
> So I reckon that the dirty data block should be written to disk before
> committing i_disksize and we cannot see the zero data. Now, I don't find
> any offical docs to record the behavior you mentioned, do you have some
> links talk about this behavior or am I miss something ?
Yeah, I agree that paragraph in Documentation/admin-guide/ext4.rst could be
interpretted the way you understood it. Ext4 Wiki [1] has this comment:
(Note, however, that Ordered Mode does not guarantee that the file will be
consistent at an application level; the application must use fsync() at
appropriate commit points in order to guarantee application-level
consistency.)
In addition, there are some applications which depend on data=ordered to
automatically force data blocks to be written to disk soon after the file
is written. Using Writeback Mode extends the time from when a file is
written to when it is pushed out to disk to 30 seconds. This can be
surprising for some users; however, it should be noted that such problems
can still be an issue with Ordered Mode (although they are much rarer).
Again, a careful application or library should always use fsync() at points
where the application is at a stable commit point.
---
And we generally always made it clear that data=ordered mode is a "security
feature" - don't expose potentialy sensitive data after crash. Not app data
consistency feature (fsync(2) is for that). I don't have a good reference
for that except perhaps bugzilla comment from Ted [2] where he explains
some rationale behind data=ordered mode.
So overall no, data=ordered mode is not designed to provide the data
integrity guarantees you'd like to see. And providing more guarantees has
performance (and maintenance) costs so it should be better justified than
just "it would be nice for applications".
Also when you'd like to provide some guarantees about data integrity, you
need to specify your "promise" (currently our promise is: "no uninitialized
data visible"). Because full "writes are atomic operation wrt crash" kind
of guarantee is very hard (expensive) to achieve. I hope this clears things
out a bit.
Honza
[1] https://ext4.wiki.kernel.org/index.php/Ext3_Data=Ordered_vs_Data=Writeback_mode
[2] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/45
--
Jan Kara <[email protected]>
SUSE Labs, CR
Thanks a lot for your indepth explanation, I get it now.
Thanks
Yi.
On 2019/4/5 17:12, Jan Kara Wrote:
> On Thu 04-04-19 20:46:47, zhangyi (F) wrote:
>> On 2019/4/4 18:18, Jan Kara Wrote:
>>> On Thu 04-04-19 17:29:52, zhangyi (F) wrote:
>>>> Currently we capture a NULL data exposure problem after a crash or
>>>> poweroff when append writing a file in the data=ordered mode. The
>>>> problem is that we were not add inode to the transaction's order data
>>>> list when updating i_disksize without new block allocation no matter
>>>> the delay allocated block feature is enabled or not.
>>>>
>>>> write jbd2 writeback
>>>> append write in allocated block
>>>> mark buffer dirty
>>>> update i_disksize
>>>> mark inode dirty
>>>> commit transaction
>>>> write inode
>>>> (data exposure after a crash)
>>>> write dirty buffer
>>>>
>>>> It's fine in the case of new block allocation because we do this job in
>>>> ext4_map_blocks(). To fix this problem, this patch add inode to current
>>>> transaction's order data list after new data is copied and needing
>>>> update i_disksize in the case of no block allocation.
>>>>
>>>> Fixes: 06bd3c36a733ac ("ext4: fix data exposure after a crash")
>>>> Fixes: f3b59291a69d0b ("ext4: remove calls to ext4_jbd2_file_inode() from delalloc write path")
>>>> Signed-off-by: zhangyi (F) <[email protected]>
>>>
>>> Thanks for the patch. The current behavior is a deliberate decision.
>>> data=ordered mode does guarantee there is no stale data visible in case of
>>> crash. However it does not guarantee you cannot see zeros where data was
>>> written.
>>>
>>
>> Hi Jan,
>> Thanks a lot for your explanation. I read the
>> Documentation/admin-guide/ext4.rst, which said about the ordered mode:
>>
>>> ... When it's time to write the new metadata out to disk, the associated data
>>> blocks are written first...
>>
>> So I reckon that the dirty data block should be written to disk before
>> committing i_disksize and we cannot see the zero data. Now, I don't find
>> any offical docs to record the behavior you mentioned, do you have some
>> links talk about this behavior or am I miss something ?
>
> Yeah, I agree that paragraph in Documentation/admin-guide/ext4.rst could be
> interpretted the way you understood it. Ext4 Wiki [1] has this comment:
>
> (Note, however, that Ordered Mode does not guarantee that the file will be
> consistent at an application level; the application must use fsync() at
> appropriate commit points in order to guarantee application-level
> consistency.)
>
> In addition, there are some applications which depend on data=ordered to
> automatically force data blocks to be written to disk soon after the file
> is written. Using Writeback Mode extends the time from when a file is
> written to when it is pushed out to disk to 30 seconds. This can be
> surprising for some users; however, it should be noted that such problems
> can still be an issue with Ordered Mode (although they are much rarer).
> Again, a careful application or library should always use fsync() at points
> where the application is at a stable commit point.
>
> ---
>
> And we generally always made it clear that data=ordered mode is a "security
> feature" - don't expose potentialy sensitive data after crash. Not app data
> consistency feature (fsync(2) is for that). I don't have a good reference
> for that except perhaps bugzilla comment from Ted [2] where he explains
> some rationale behind data=ordered mode.
>
> So overall no, data=ordered mode is not designed to provide the data
> integrity guarantees you'd like to see. And providing more guarantees has
> performance (and maintenance) costs so it should be better justified than
> just "it would be nice for applications".
>
> Also when you'd like to provide some guarantees about data integrity, you
> need to specify your "promise" (currently our promise is: "no uninitialized
> data visible"). Because full "writes are atomic operation wrt crash" kind
> of guarantee is very hard (expensive) to achieve. I hope this clears things
> out a bit.
>
> Honza
>
> [1] https://ext4.wiki.kernel.org/index.php/Ext3_Data=Ordered_vs_Data=Writeback_mode
> [2] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/317781/comments/45
>