I noticed that every time I sync ( which happens automatically when
you suspend to ram ), ext4 issues a flush to the block device, even
though there have been no writes to flush. This appears to be because
jbd2_trans_will_send_data_barrier() returns a 0 when no transaction
has been started. The intent appears to be that a transaction that
has completed should return 0, and that when there is NO transaction,
it should return a 1, but the tests were in the wrong order, leading
to the 0 to be returned before checking for the absense of a
transaction at all. Reversing the order allows my disk to remain in
runtime_pm when syncing.
I *think* this is correct, but I'm not very familliar with jbd2, so it
may have unintended consequences. What do you think?
---
fs/jbd2/journal.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index b6c114c11b97..be13dae767be 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -632,14 +632,16 @@ int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid)
if (!(journal->j_flags & JBD2_BARRIER))
return 0;
read_lock(&journal->j_state_lock);
- /* Transaction already committed? */
- if (tid_geq(journal->j_commit_sequence, tid))
- goto out;
commit_trans = journal->j_committing_transaction;
if (!commit_trans || commit_trans->t_tid != tid) {
ret = 1;
goto out;
}
+ /* Transaction already committed? */
+ if (tid_geq(journal->j_commit_sequence, tid))
+ {
+ goto out;
+ }
/*
* Transaction is being committed and we already proceeded to
* submitting a flush to fs partition?
--
2.43.0
On Tue, Feb 27, 2024 at 04:25:46PM -0500, Phillip Susi wrote:
> I noticed that every time I sync ( which happens automatically when
> you suspend to ram ), ext4 issues a flush to the block device, even
> though there have been no writes to flush. This appears to be because
> jbd2_trans_will_send_data_barrier() returns a 0 when no transaction
> has been started. The intent appears to be that a transaction that
> has completed should return 0, and that when there is NO transaction,
> it should return a 1, but the tests were in the wrong order, leading
> to the 0 to be returned before checking for the absense of a
> transaction at all. Reversing the order allows my disk to remain in
> runtime_pm when syncing.
>
> I *think* this is correct, but I'm not very familliar with jbd2, so it
> may have unintended consequences. What do you think?
Yeah, this change is going to problems. The basic idea here is if
when we request that a transaction to commit, will it issue a a
commit? If so, then fsync(2) doesn't need to issue a barrier (i.e., a
cache flush command).
So for example, if a database does an overwriting write to a file
block which is already allocated, and then follows it up with a
fdatasync(2), there won't be any need to make any metadata changes as
part of writing out the changed block. Hence, we won't need to start
a new jbd2 transaction, and in that case, current transaction has
already commited, so the jbd2 layer won't need to do anything, and so
it won't have issued a commit. In that case,
jbd2_trans_will_send_data_barrier() needs to return false, so that
fdatasync(2) will actually issue a cache flush command.
The patch you've proposed will cause fdatasync(2) to not issue a
barrier, which could lead to the write to the database file getting
lost after a power fail event, which would make the database
adminisrtator very sad.
Cheers,
- Ted
"Theodore Tso" <[email protected]> writes:
> Yeah, this change is going to problems. The basic idea here is if
> when we request that a transaction to commit, will it issue a a
> commit? If so, then fsync(2) doesn't need to issue a barrier (i.e., a
> cache flush command).
>
> So for example, if a database does an overwriting write to a file
> block which is already allocated, and then follows it up with a
> fdatasync(2), there won't be any need to make any metadata changes as
> part of writing out the changed block. Hence, we won't need to start
> a new jbd2 transaction, and in that case, current transaction has
> already commited, so the jbd2 layer won't need to do anything, and so
> it won't have issued a commit. In that case,
> jbd2_trans_will_send_data_barrier() needs to return false, so that
> fdatasync(2) will actually issue a cache flush command.
>
> The patch you've proposed will cause fdatasync(2) to not issue a
> barrier, which could lead to the write to the database file getting
> lost after a power fail event, which would make the database
> adminisrtator very sad.
So because no metadata changed, jbd2 will not issue a barrier to end the
transaction? How can we fix this then? Is there some way that jbd2 can
know whether file data has been written, and thus, issue the barrier to
close the transaction?
On Thu, Feb 29, 2024 at 09:23:35AM -0500, Phillip Susi wrote:
>
> So because no metadata changed, jbd2 will not issue a barrier to end the
> transaction? How can we fix this then? Is there some way that jbd2 can
> know whether file data has been written, and thus, issue the barrier to
> close the transaction?
Because no metadata changed, jbd2 will not even *start* a (jbd2)
transaction as a result of that write (overwrite) to an already
allocated data block.. Since it didn't start a jbd2 transaction for
this file system operation, there's no reason to force a jbd2
transaction to close.
(Note: this could because there *is* no currently existing open
transaction, or there might be a currently open transaction, but it's
not relevent to the activity associated with the file descriptor being
fsync'ed.)
This is a critical performance optimization, because for many database
workloads, which are overwriting existing blocks and using
fdatasync(2), there is no reason to force a jbd2 transaction commit
for every single fdatasync(2) issued by the database. However, we
still need to send a cache flush operation so that the data block is
safely persistend onto stable storage.
Cheers,
- Ted
"Theodore Ts'o" <[email protected]> writes:
> Because no metadata changed, jbd2 will not even *start* a (jbd2)
> transaction as a result of that write (overwrite) to an already
> allocated data block.. Since it didn't start a jbd2 transaction for
> this file system operation, there's no reason to force a jbd2
> transaction to close.
>
> (Note: this could because there *is* no currently existing open
> transaction, or there might be a currently open transaction, but it's
> not relevent to the activity associated with the file descriptor being
> fsync'ed.)
>
> This is a critical performance optimization, because for many database
> workloads, which are overwriting existing blocks and using
> fdatasync(2), there is no reason to force a jbd2 transaction commit
> for every single fdatasync(2) issued by the database. However, we
> still need to send a cache flush operation so that the data block is
> safely persistend onto stable storage.
So maybe what ext4's sync_fs needs to know is whether ANY writes have
been done since the last transaction committed? Is there a way to know
that? As long as NOTHING has been written since the last commit, then
there is no need to issue a flush.