On Wed 10-01-24 11:19:42, Free Ekanayaka wrote:
> Jan Kara <[email protected]> writes:
>
> [...]
>
> >> I've re-ran the same code with kernel 6.5.0 and indeed the behavior has
> >> changed and an actual NVMe flush command seems to be issued (the flags
> >> passed to nvme_setup_cmd match the ones that I see in the case I write
> >> to the raw block device). So that part seems fixed. I thought I had
> >> tried with 6.5.1 when I had posted this issue, and that it was present
> >> there too, but maybe I was mistaken.
> >>
> >> I'm not entirely sure what you mean about flushing the inode buffer
> >> properly. As far as I see, the current behavior I see matches what I'd
> >> expect.
> >
> > The thing is: fallocate(2) does preallocate blocks to the file so
> > block allocation is not needed when writing to those blocks. However that
> > does not mean metadata changes are not needed when writing to those blocks.
> > In case of ext4 (and other filesystems such as xfs or btrfs as well) blocks
> > allocated using fallocate(2) are tracked as unwritten in inode's metadata (so
> > reads from them return 0 instead of random garbage). After block contents
> > is written with iouring write, we need to convert the state of blocks from
> > unwritten to written and that needs metadata modifications. So the first
> > write to the file after fallocate(2) call still needs to do metadata
> > modifications and these need to be persisted as part of fdatasync(2). And
> > by mistake this did not happen in nojournal mode.
>
> Thanks for the explanation.
>
> Since I want to achieve the lowest possible latency for these writes, is
> there a way to not onlly preallocate blocks with fallocate() but also to
> avoid the extra write for metadata modifications that you mention?
>
> I mean, I could of course take the brute force approach of performing
> dump "pre" writes of those blocks, but I'm wondering if there's a better
> way that doesn't require writing those blocks twice (which might end up
> defeating the purpose of lowering overall latency).
No, if you really want to make sure there will be no metadata modifications
while writing to a file, writing 0's (or whatever you wish) to it is the
only way.
> >> For reference I'm attaching below the trace of the same user code, this
> >> time run on kernel 6.5.0, which is the one currently shipping with
> >> Debian/testing. Note that there are quite a bit less trace lines emitted
> >> by the ext4 sub-system, not sure if it's related/relevant.
> >>
> >> raft-benchmark-35708 [000] ..... 9203.271114: io_uring_submit_req: ring 000000007ee609d1, req 00000000221c7d2e, user_data 0x0, opcode WRITE_FIXED, flags 0x1, sq_thread 0
> >> raft-benchmark-35708 [000] ..... 9203.271117: ext4_es_lookup_extent_enter: dev 259,5 ino 16 lblk 0
> >> raft-benchmark-35708 [000] ..... 9203.271117: ext4_es_lookup_extent_exit: dev 259,5 ino 16 found 1 [0/16) 32896 WR
> >> raft-benchmark-35708 [000] ..... 9203.271118: ext4_journal_start_inode: dev 259,5 blocks 2, rsv_blocks 0, revoke_creds 8, type 1, ino 16, caller ext4_dirty_inode+0x38/0x80 [ext4]
> >> raft-benchmark-35708 [000] ..... 9203.271119: ext4_mark_inode_dirty: dev 259,5 ino 16 caller ext4_dirty_inode+0x5b/0x80 [ext4]
> >> raft-benchmark-35708 [000] ..... 9203.271119: block_touch_buffer: 259,5 sector=135 size=4096
> >> raft-benchmark-35708 [000] ..... 9203.271120: block_dirty_buffer: 259,5 sector=135 size=4096
> >> raft-benchmark-35708 [000] ..... 9203.271120: ext4_es_lookup_extent_enter: dev 259,5 ino 16 lblk 0
> >> raft-benchmark-35708 [000] ..... 9203.271121: ext4_es_lookup_extent_exit: dev 259,5 ino 16 found 1 [0/16) 32896 WR
> >> raft-benchmark-35708 [000] ..... 9203.271122: block_bio_remap: 259,0 WFS 498455552 + 8 <- (259,5) 263168
> >> raft-benchmark-35708 [000] ..... 9203.271122: block_bio_queue: 259,0 WFS 498455552 + 8 [raft-benchmark]
> >> raft-benchmark-35708 [000] ..... 9203.271123: block_getrq: 259,0 WFS 498455552 + 8 [raft-benchmark]
> >> raft-benchmark-35708 [000] ..... 9203.271123: block_io_start: 259,0 WFS 4096 () 498455552 + 8 [raft-benchmark]
> >> raft-benchmark-35708 [000] ..... 9203.271124: block_plug: [raft-benchmark]
> >> raft-benchmark-35708 [000] ..... 9203.271124: nvme_setup_cmd: nvme0: disk=nvme0n1, qid=1, cmdid=53265, nsid=1, flags=0x0, meta=0x0, cmd=(nvme_cmd_write slba=498455552, len=7, ctrl=0x4000, dsmgmt=0, reftag=0)
> >> raft-benchmark-35708 [000] ..... 9203.271126: block_rq_issue: 259,0 WFS 4096 () 498455552 + 8 [raft-benchmark]
> >> raft-benchmark-35708 [000] d.h.. 9203.271382: nvme_sq: nvme0: disk=nvme0n1, qid=1, head=79, tail=79
> >> raft-benchmark-35708 [000] d.h.. 9203.271384: nvme_complete_rq: nvme0: disk=nvme0n1, qid=1, cmdid=53265, res=0x0, retries=0, flags=0x0, status=0x0
> >> raft-benchmark-35708 [000] d.h.. 9203.271384: block_rq_complete: 259,0 WFS () 498455552 + 8 [0]
> >> raft-benchmark-35708 [000] dNh.. 9203.271386: block_io_done: 259,0 WFS 0 () 498455552 + 0 [raft-benchmark]
> >> raft-benchmark-35708 [000] ...1. 9203.271391: io_uring_complete: ring 000000007ee609d1, req 00000000221c7d2e, user_data 0x0, result 4096, cflags 0x0 extra1 0 extra2 0
> >> raft-benchmark-35708 [000] ..... 9203.271391: io_uring_task_work_run: tctx 00000000f15587dc, count 1, loops 1
> >
> > So in this case the file blocks seem to have been already written.
>
> Do you mean that they have been already written at some point in the
> past after the file system was created?
No, I mean this particular file offset has already been written to since
the file was created. The line:
raft-benchmark-35708 [000] ..... 9203.271121: ext4_es_lookup_extent_exit: dev 259,5 ino 16 found 1 [0/16) 32896 WR
at the end has 'WR' which is the status of the found extent (contiguous
sequence of blocks) and 'WR' means 'written' and 'referenced'.
> I guess it doesn't matter if they were written as part of the new file
> that is being written (and that was created with open()/fallocate()), or
> if they were written before as part of some other file that was deleted
> since then.
No, the block must have been written when it was already part of this file.
The block being written is tracked on logical-file-offset basis and this
tracking is there exactly so that you cannot read some stale data
(potentially sensitive data of another user) after fallocating some blocks
to the file.
I understand this protection need not be that interesting for your usecase
and in nojournal mode you still may potentially see such stale data but
there's a big difference between possibly seeing such data only after a
crash and giving user a way to read such data at will.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR