On Wed, Oct 30, 2019 at 04:28:42AM +0000, Xiaohui1 Li 李晓辉 wrote:
> the problem of file' data wating in jbd2 order mode is also a
> serious problem which case a long-latency fsync call.
Yes, this is a separate problem, although note that if the file with a
large amount of data is the file which is being fsync'ed, you have to
write it out at fsync time no matter what.
You could try to write out dirty data earlier (e.g., by decreasing the
30 second writeback window), but there are tradeoffs. For one thing,
if the file ends up being deleted anyway, it's better not to write out
the data at all. For another, if we know how big the file is at the
time when we do the writeout, we can do a better job allocating space
for the file, and it improves the file layout by making it more likely
it will be contiguous, or at least mostly contiguous.
Also, files that tend to be fsync'ed a lot tend to be database files
(e.g., SQLite files), and they tend to write small amounts of data and
then fsync them. So the problem described below happens when there
are unrelated files that happen to be downloaded in parallel. An
example of this in the Android case mgiht be when the user is
downloading a large video file, such as a movie, to be watched offline
later (such as when they are on a plane).
> as pointed out in this iJournaling paper, when three conditions turn up at the same time,
> 1: order mode must be applied, not the writeback mode.
> 2: The delayed block allocation technique of ext4 must be applied.
> 3: backgroud buffer writes are too many.
(1) and (2) are the default. (3) may or may not be a frequent
occurrence, depending on the workload. In practice though, users
aren't downloading large files all *that* often.
> we have no choice as the order mode need to do this work, so the
> waiting inode-data-flushed-disk time is too long in some extreme
> conditions. so it cause the appearance of long-latency fsync call.
>
> thank you for your reply, i will try to fix this problem in my free time.
So there is a solution; it's just a bit tricky to do, and it's not
been a huge enough deal that anyone has allocated time to fix it.
The idea is to allocate space, but not actually update the metadata
blocks at the time when the data blocks are allocated. Instead, we
reserve them so they won't get allocated for use by another file, and
we note where they are in the extent status cache. We then issue the
writes of the data block, and only after they are complete, only
*then* do we update the metadata blocks (which then gets updated via
the journal, using either a commit or a fast commit).
This is similar to the dioread_nolock case, where we update the
metadata blocks first, but mark them as unwritten, then we let the
data blocks get written, and only finally do we update the metadata
blocks so they are marked as written (e.g., initialized). This avoids
the stale data problem as well, but we end up modifying the metadata
blocks twice, and it has resulted other performance problems since in
increases overhead on the i_data_sem lock. See for example some of
the posts by Liu Bo from Alibaba last year:
If we can allocate space, write the data blocks, and only *then*
update the extent tree metadata blocks, it solves a lot of problems.
We can get rid of the dioread_nolock option; we can get rid of the
data=ordered vs data=writeback distinction; and we can avoid the need
to force data blocks to be written out at commit time. So it improves
performance, and it will reduce code complexity, making it a win-win
approach.
The problem is that this means significantly changing how we do block
allocation and block reservation, so it's a fairly large and invasive
set of changes. But it's the right long-term direction, and we'll get
there eventually.
Cheers,
- Ted
another way which i think can fix this fsync time cost problem may be
that changing ext4 data mode from order to writeback.
when in writeback mode, inode' data has not to be waited in jbd2
thread, so the fsync time cost is also reduced.
meawhile, writeback mode also can guarantee filesystem consistency in
os crash-reboot conditions,
with only one drawback is that it will cause security problems such as
stale data will be seen.
but in android system with file encryption enabled, there is no
security problem as files are all encryped.
but user will see wrong file data in system crash-reboot conditions
with writeback mode enabled.
for example:
-------------
file A has allocate 50 new blocks, and already dirtys page cache
corresponding to these 50 blocks,
and after the medata represent new 50 blocks of this file have been
flushed to journal area, the system crash.
file A's data according to above 50 new blocks has not been to flushed to disk.
after system reboot and finish file system recovery work, file A's
size has bee enlarged with new 50 blocks added.
but data in this file's new 50 blocks is not correct. so it will cheat
user if it is difficult to see this data-not-correct problem.
-------------
and this problem can't fixed by e2fsck full check ,it is not belong to
file system consistency,
so we will insist on using order mode ,not writeback mode.
i will also share my view about that ijournal paper in my next time.
On Wed, Oct 30, 2019 at 10:26 PM Theodore Y. Ts'o <[email protected]> wrote:
>
> On Wed, Oct 30, 2019 at 04:28:42AM +0000, Xiaohui1 Li 李晓辉 wrote:
> > the problem of file' data wating in jbd2 order mode is also a
> > serious problem which case a long-latency fsync call.
>
> Yes, this is a separate problem, although note that if the file with a
> large amount of data is the file which is being fsync'ed, you have to
> write it out at fsync time no matter what.
>
> You could try to write out dirty data earlier (e.g., by decreasing the
> 30 second writeback window), but there are tradeoffs. For one thing,
> if the file ends up being deleted anyway, it's better not to write out
> the data at all. For another, if we know how big the file is at the
> time when we do the writeout, we can do a better job allocating space
> for the file, and it improves the file layout by making it more likely
> it will be contiguous, or at least mostly contiguous.
>
> Also, files that tend to be fsync'ed a lot tend to be database files
> (e.g., SQLite files), and they tend to write small amounts of data and
> then fsync them. So the problem described below happens when there
> are unrelated files that happen to be downloaded in parallel. An
> example of this in the Android case mgiht be when the user is
> downloading a large video file, such as a movie, to be watched offline
> later (such as when they are on a plane).
>
> > as pointed out in this iJournaling paper, when three conditions turn up at the same time,
> > 1: order mode must be applied, not the writeback mode.
> > 2: The delayed block allocation technique of ext4 must be applied.
> > 3: backgroud buffer writes are too many.
>
> (1) and (2) are the default. (3) may or may not be a frequent
> occurrence, depending on the workload. In practice though, users
> aren't downloading large files all *that* often.
>
> > we have no choice as the order mode need to do this work, so the
> > waiting inode-data-flushed-disk time is too long in some extreme
> > conditions. so it cause the appearance of long-latency fsync call.
> >
> > thank you for your reply, i will try to fix this problem in my free time.
>
> So there is a solution; it's just a bit tricky to do, and it's not
> been a huge enough deal that anyone has allocated time to fix it.
>
> The idea is to allocate space, but not actually update the metadata
> blocks at the time when the data blocks are allocated. Instead, we
> reserve them so they won't get allocated for use by another file, and
> we note where they are in the extent status cache. We then issue the
> writes of the data block, and only after they are complete, only
> *then* do we update the metadata blocks (which then gets updated via
> the journal, using either a commit or a fast commit).
>
> This is similar to the dioread_nolock case, where we update the
> metadata blocks first, but mark them as unwritten, then we let the
> data blocks get written, and only finally do we update the metadata
> blocks so they are marked as written (e.g., initialized). This avoids
> the stale data problem as well, but we end up modifying the metadata
> blocks twice, and it has resulted other performance problems since in
> increases overhead on the i_data_sem lock. See for example some of
> the posts by Liu Bo from Alibaba last year:
>
> If we can allocate space, write the data blocks, and only *then*
> update the extent tree metadata blocks, it solves a lot of problems.
> We can get rid of the dioread_nolock option; we can get rid of the
> data=ordered vs data=writeback distinction; and we can avoid the need
> to force data blocks to be written out at commit time. So it improves
> performance, and it will reduce code complexity, making it a win-win
> approach.
>
> The problem is that this means significantly changing how we do block
> allocation and block reservation, so it's a fairly large and invasive
> set of changes. But it's the right long-term direction, and we'll get
> there eventually.
>
> Cheers,
>
> - Ted
On Mon, Nov 04, 2019 at 09:01:28AM +0800, xiaohui li wrote:
>
> when in writeback mode, inode' data has not to be waited in jbd2
> thread, so the fsync time cost is also reduced.
> meawhile, writeback mode also can guarantee filesystem consistency in
> os crash-reboot conditions,
> with only one drawback is that it will cause security problems such as
> stale data will be seen.
It's not just stale data; in data=writeback, today if a file gets
deleted, its blocks are immediately eligible to be reused. If there
is a crash before the transaction is committed, there could be a file
that would have deleted (and perhaps replaced) that doesn't in fact
get deleted, but its data blocks will have been corrupted.
I'm not fond of that particular behavior, and I may look to fix it,
but in general, data=writeback means that data blocks may be corrupted
or contain stale data after a crash --- for blocks that were freshly
created, or for a file that might have been deleted, but except for
the crash which means that the file deletion doesn't actually get
corrupted.
> but in android system with file encryption enabled, there is no
> security problem as files are all encryped.
> but user will see wrong file data in system crash-reboot conditions
> with writeback mode enabled.
If all files are encrypted, then yes, the chances of stale data
causing security issues is significantly reduced.
But see also the case of a file which is deleted immediately before a
crash. Things are more complex in terms of the data gauarantees after
a crash, which is why data=ordered is the default.
Regards,
- Ted
thank you, ted.
I have understood the whole design and implementation of the ijournal paper.
and i think the fast commit for ext4 may be designed and implemented
according to idea of that ijournal paper,
as that ijournal thought is the best way for resolve the problem of
file's data has to been waited in jbd2 thread with
order mode from my opinion.
according to that paper, ijournal only record and commit the changes
of the fsync'ed file to its own ijournal area,
the changes of whole ext4 filesystem are left to normal journal to process.
and ijournal only happen at the end of the thread which is doing fsync
work. it need not be embedded to jbd2 thread.
and the changes of the fsync'ed file which have been committed by
ijournal will also be committed to normal journal area subsequently.
ijournal won't have side effect on normal journal , these two journal
runs independently.
all of these above designments of ijournal from my viewpoint will
simply the fast commit function developed recently, meanwhile it can
help fast commit function to
achieve its goals.
one of its most important goals which i have to highlight should be
fix ext4 fsync time cost problems because of file's data has to been
waited in jbd2 thread(same as CTX problems pointed in ijournal paper).
what do you think of it ?
I like this ijournal thought. may be i want to do some work on coding
of this fast commit function in my free time.
On Mon, Nov 4, 2019 at 11:22 AM Theodore Y. Ts'o <[email protected]> wrote:
>
> On Mon, Nov 04, 2019 at 09:01:28AM +0800, xiaohui li wrote:
> >
> > when in writeback mode, inode' data has not to be waited in jbd2
> > thread, so the fsync time cost is also reduced.
> > meawhile, writeback mode also can guarantee filesystem consistency in
> > os crash-reboot conditions,
> > with only one drawback is that it will cause security problems such as
> > stale data will be seen.
>
> It's not just stale data; in data=writeback, today if a file gets
> deleted, its blocks are immediately eligible to be reused. If there
> is a crash before the transaction is committed, there could be a file
> that would have deleted (and perhaps replaced) that doesn't in fact
> get deleted, but its data blocks will have been corrupted.
>
> I'm not fond of that particular behavior, and I may look to fix it,
> but in general, data=writeback means that data blocks may be corrupted
> or contain stale data after a crash --- for blocks that were freshly
> created, or for a file that might have been deleted, but except for
> the crash which means that the file deletion doesn't actually get
> corrupted.
>
> > but in android system with file encryption enabled, there is no
> > security problem as files are all encryped.
> > but user will see wrong file data in system crash-reboot conditions
> > with writeback mode enabled.
>
> If all files are encrypted, then yes, the chances of stale data
> causing security issues is significantly reduced.
>
> But see also the case of a file which is deleted immediately before a
> crash. Things are more complex in terms of the data gauarantees after
> a crash, which is why data=ordered is the default.
>
> Regards,
>
> - Ted