LinuxLists.cc - code questions about ext4_inode_datasync

2021-01-12 13:04:38

Subject: code questions about ext4_inode_datasync_dirty()

hi,

I use io_uring to evaluate ext4 randread performance(direct io), observed
obvious overhead in jbd2_transaction_committed():
Samples: 124K of event 'cycles:ppp', Event count (approx.): 80630088951
Overhead Command Shared Object Symbol
7.02% io_uring-sq-per [kernel.kallsyms] [k] jbd2_transaction_committed

The codes:
/*
* Writes that span EOF might trigger an I/O size update on completion,
* so consider them to be dirty for the purpose of O_DSYNC, even if
* there is no other metadata changes being made or are pending.
*/
iomap->flags = 0;
if (ext4_inode_datasync_dirty(inode) ||
offset + length > i_size_read(inode))
iomap->flags |= IOMAP_F_DIRTY;

ext4_inode_datasync_dirty() calls jbd2_transaction_committed(). Sorry, I don't spend
much time to learn iomap codes yet, just ask a quick question here. Do we need to call
ext4_inode_datasync_dirty() for a read operation?

If we must call ext4_inode_datasync_dirty() for a read operation, can we improve
jbd2_transaction_committed() a bit, for example, have a quick check between
inode->i_datasync_tid and j_commit_sequence, if inode->i_datasync_tid is less than
or equal to j_commit_sequence, we also don't call jbd2_transaction_committed()?

Regards,
Xiaoguang Wang

2021-01-13 17:21:59

by Jan Kara

[permalink] [raw]

Subject: Re: code questions about ext4_inode_datasync_dirty()

Hi!

On Tue 12-01-21 19:45:06, Xiaoguang Wang wrote:
> I use io_uring to evaluate ext4 randread performance(direct io), observed
> obvious overhead in jbd2_transaction_committed():
> Samples: 124K of event 'cycles:ppp', Event count (approx.): 80630088951
> Overhead Command Shared Object Symbol
> 7.02% io_uring-sq-per [kernel.kallsyms] [k] jbd2_transaction_committed

Hum, that's quite a bit. Likely due to cacheline contention on
j_state_lock. Thanks for reporting this!

> The codes:
> /*
> * Writes that span EOF might trigger an I/O size update on completion,
> * so consider them to be dirty for the purpose of O_DSYNC, even if
> * there is no other metadata changes being made or are pending.
> */
> iomap->flags = 0;
> if (ext4_inode_datasync_dirty(inode) ||
> offset + length > i_size_read(inode))
> iomap->flags |= IOMAP_F_DIRTY;
>
> ext4_inode_datasync_dirty() calls jbd2_transaction_committed(). Sorry, I
> don't spend much time to learn iomap codes yet, just ask a quick question
> here. Do we need to call ext4_inode_datasync_dirty() for a read
> operation?

So strictly speaking we don't need to know the value of IOMAP_F_DIRTY in
that path (or any read path for that matter) but I'm somewhat reluctant to
optimize out setting of this flag because later some user might start to
depend on it being set correctly.

> If we must call ext4_inode_datasync_dirty() for a read operation, can we improve
> jbd2_transaction_committed() a bit, for example, have a quick check between
> inode->i_datasync_tid and j_commit_sequence, if inode->i_datasync_tid is less than
> or equal to j_commit_sequence, we also don't call jbd2_transaction_committed()?

Well, the modification would belong directly to
jbd2_transaction_committed(). Using j_commit_sequence is somewhat awkward
since TIDs can wrap around and so very old TIDs can suddently start to give
false positives leading to a strange results (this was one of motivations
to switch jbd2_transaction_committed() to a comparison for equality). But
it could certainly be done.

But j_state_lock is a scalability bottleneck in other cases as well. So
what I rather have in mind is that we could change transactions to be RCU
freed and then a majority of places where we use

read_lock(&journal->j_state_lock);

can be switched to using plain RCU which should significantly reduce the
contention on the lock.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-01-19 05:48:20

by Andreas Dilger

[permalink] [raw]

Subject: Re: code questions about ext4_inode_datasync_dirty()

On Jan 13, 2021, at 10:19 AM, Jan Kara <[email protected]> wrote:
>
> Hi!
>
> On Tue 12-01-21 19:45:06, Xiaoguang Wang wrote:
>> I use io_uring to evaluate ext4 randread performance(direct io), observed
>> obvious overhead in jbd2_transaction_committed():

I was going to ask about this - is the filesystem mounted with noatime or
relatime or lazytime? Otherwise, it may be a lot of atime updates that are
causing all of this journal traffic in what _should_ be a read-only workload?

Cheers, Andreas

Attachments:

signature.asc (890.00 B)
Message signed with OpenPGP