2024-04-25 13:29:49

by Ritesh Harjani

[permalink] [raw]
Subject: [RFCv3 0/7] ext2 iomap changes and iomap improvements

Hello all,

Here is a RFCv3 which implements ext2 iomap changes and I have also included
some iomap improvements along with implementing BH_Boundary fix within iomap.

Posting this more as an update (before conference/call) to the current work
since it has survived few runs of xfstests -

Patch 1-4 implements ext2 regular file buffered I/O to use iomap APIs.
Patch 5 & 6 are iomap improvements which me and Ojaswin noticed during code reviews.
Patch 7 optimizes the data access patterns for filesystems with indirect block
mappings. Thanks to Matthew Wilcox for pointing the problem and providing a
rough solution to BH_Boundary problem within iomap (which this patch is based
on).

Please note that I would still like to work on following aspects before
thinking of merging any of these -

1. Look into how dir handling for ext2 should be handled. Since ext2 uses page
cache for that, can we directly use iomap? Do we need any other iomap ops for
implementing dir handling? Basically either a PoC or some theoretical
understanding of how we should handle dir for ext2.

2. Integrate Patch 4 within Patch-2. Kept it separate for review.

3. Test patch 5 & 6 separately before thinking of getting those merged. The
changes look ok to me and I would like those to be reviewed. But I hope to
get more testing done on those patches individually, because those are not
dependent on this series.

4. Patch 7 is an early RFC to get an idea on whether it is taking the right
direction or not. If this looks ok, then I can polish the series, carefully
review at any missed corner cases (hopefully I have covered all),
work on other points in this todo list and do more testing before posting
another version.

5. Write few fstests to excercise the paths more for the overall series.

Ritesh Harjani (IBM) (7):
ext2: Remove comment related to journal handle
ext2: Convert ext2 regular file buffered I/O to use iomap
ext2: Enable large folio support
ext2: Implement seq counter for validating cached iomap
iomap: Fix iomap_adjust_read_range for plen calculation
iomap: Optimize iomap_read_folio
iomap: Optimize data access patterns for filesystems with indirect mappings

fs/ext2/balloc.c | 1 +
fs/ext2/ext2.h | 6 ++
fs/ext2/file.c | 20 +++++-
fs/ext2/inode.c | 126 +++++++++++++++++++++++++++++++++++---
fs/ext2/super.c | 2 +-
fs/iomap/buffered-io.c | 135 ++++++++++++++++++++++++++++++++---------
6 files changed, 248 insertions(+), 42 deletions(-)

--
2.44.0



2024-04-25 13:30:00

by Ritesh Harjani

[permalink] [raw]
Subject: [RFCv3 2/7] ext2: Convert ext2 regular file buffered I/O to use iomap

This patch converts ext2 regular file's buffered-io path to use iomap.
- buffered-io path using iomap_file_buffered_write
- DIO fallback to buffered-io now uses iomap_file_buffered_write
- writeback path now uses a new aops - ext2_file_aops
- truncate now uses iomap_truncate_page
- mmap path of ext2 continues to use generic_file_vm_ops

Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext2/file.c | 20 ++++++++++++--
fs/ext2/inode.c | 69 ++++++++++++++++++++++++++++++++++++++++++++-----
2 files changed, 81 insertions(+), 8 deletions(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 4ddc36f4dbd4..ee5cd4a2f24f 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -252,7 +252,7 @@ static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)

iocb->ki_flags &= ~IOCB_DIRECT;
pos = iocb->ki_pos;
- status = generic_perform_write(iocb, from);
+ status = iomap_file_buffered_write(iocb, from, &ext2_iomap_ops);
if (unlikely(status < 0)) {
ret = status;
goto out_unlock;
@@ -278,6 +278,22 @@ static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
return ret;
}

+static ssize_t ext2_buffered_write_iter(struct kiocb *iocb,
+ struct iov_iter *from)
+{
+ ssize_t ret = 0;
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ inode_lock(inode);
+ ret = generic_write_checks(iocb, from);
+ if (ret > 0)
+ ret = iomap_file_buffered_write(iocb, from, &ext2_iomap_ops);
+ inode_unlock(inode);
+ if (ret > 0)
+ ret = generic_write_sync(iocb, ret);
+ return ret;
+}
+
static ssize_t ext2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
#ifdef CONFIG_FS_DAX
@@ -299,7 +315,7 @@ static ssize_t ext2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (iocb->ki_flags & IOCB_DIRECT)
return ext2_dio_write_iter(iocb, from);

- return generic_file_write_iter(iocb, from);
+ return ext2_buffered_write_iter(iocb, from);
}

const struct file_operations ext2_file_operations = {
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index c4de3a94c4b2..f90d280025d9 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -877,10 +877,14 @@ ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length,
if ((flags & IOMAP_DIRECT) && (flags & IOMAP_WRITE) && written == 0)
return -ENOTBLK;

- if (iomap->type == IOMAP_MAPPED &&
- written < length &&
- (flags & IOMAP_WRITE))
+ if (iomap->type == IOMAP_MAPPED && written < length &&
+ (flags & IOMAP_WRITE)) {
ext2_write_failed(inode->i_mapping, offset + length);
+ return 0;
+ }
+
+ if (iomap->flags & IOMAP_F_SIZE_CHANGED)
+ mark_inode_dirty(inode);
return 0;
}

@@ -912,6 +916,16 @@ static void ext2_readahead(struct readahead_control *rac)
mpage_readahead(rac, ext2_get_block);
}

+static int ext2_file_read_folio(struct file *file, struct folio *folio)
+{
+ return iomap_read_folio(folio, &ext2_iomap_ops);
+}
+
+static void ext2_file_readahead(struct readahead_control *rac)
+{
+ iomap_readahead(rac, &ext2_iomap_ops);
+}
+
static int
ext2_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, struct page **pagep, void **fsdata)
@@ -941,12 +955,41 @@ static sector_t ext2_bmap(struct address_space *mapping, sector_t block)
return generic_block_bmap(mapping,block,ext2_get_block);
}

+static sector_t ext2_file_bmap(struct address_space *mapping, sector_t block)
+{
+ return iomap_bmap(mapping, block, &ext2_iomap_ops);
+}
+
static int
ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
{
return mpage_writepages(mapping, wbc, ext2_get_block);
}

+static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
+ struct inode *inode, loff_t offset,
+ unsigned len)
+{
+ if (offset >= wpc->iomap.offset &&
+ offset < wpc->iomap.offset + wpc->iomap.length)
+ return 0;
+
+ return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
+ IOMAP_WRITE, &wpc->iomap, NULL);
+}
+
+static const struct iomap_writeback_ops ext2_writeback_ops = {
+ .map_blocks = ext2_write_map_blocks,
+};
+
+static int ext2_file_writepages(struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+ struct iomap_writepage_ctx wpc = { };
+
+ return iomap_writepages(mapping, wbc, &wpc, &ext2_writeback_ops);
+}
+
static int
ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc)
{
@@ -955,6 +998,20 @@ ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc
return dax_writeback_mapping_range(mapping, sbi->s_daxdev, wbc);
}

+const struct address_space_operations ext2_file_aops = {
+ .dirty_folio = iomap_dirty_folio,
+ .release_folio = iomap_release_folio,
+ .invalidate_folio = iomap_invalidate_folio,
+ .read_folio = ext2_file_read_folio,
+ .readahead = ext2_file_readahead,
+ .bmap = ext2_file_bmap,
+ .direct_IO = noop_direct_IO,
+ .writepages = ext2_file_writepages,
+ .migrate_folio = filemap_migrate_folio,
+ .is_partially_uptodate = iomap_is_partially_uptodate,
+ .error_remove_folio = generic_error_remove_folio,
+};
+
const struct address_space_operations ext2_aops = {
.dirty_folio = block_dirty_folio,
.invalidate_folio = block_invalidate_folio,
@@ -1279,8 +1336,8 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
error = dax_truncate_page(inode, newsize, NULL,
&ext2_iomap_ops);
else
- error = block_truncate_page(inode->i_mapping,
- newsize, ext2_get_block);
+ error = iomap_truncate_page(inode, newsize, NULL,
+ &ext2_iomap_ops);
if (error)
return error;

@@ -1370,7 +1427,7 @@ void ext2_set_file_ops(struct inode *inode)
if (IS_DAX(inode))
inode->i_mapping->a_ops = &ext2_dax_aops;
else
- inode->i_mapping->a_ops = &ext2_aops;
+ inode->i_mapping->a_ops = &ext2_file_aops;
}

struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
--
2.44.0


2024-04-25 13:30:12

by Ritesh Harjani

[permalink] [raw]
Subject: [RFCv3 4/7] ext2: Implement seq counter for validating cached iomap

There is a possibility of following race with iomap during
writebck -

write_cache_pages()
cache extent covering 0..1MB range
write page at offset 0k
truncate(file, 4k)
drops all relevant pages
frees fs blocks
pwrite(file, 4k, 4k)
creates dirty page in the page cache
writes page at offset 4k to a stale block

This race can happen because iomap_writepages() keeps a cached extent mapping
within struct iomap. While write_cache_pages() is going over each folio,
(can cache a large extent range), if a truncate happens in parallel on the
next folio followed by a buffered write to the same offset within the file,
this can change logical to physical offset of the cached iomap mapping.
That means, the cached iomap has now become stale.

This patch implements the seq counter approach for revalidation of stale
iomap mappings. i_blkseq will get incremented for every block
allocation/free. Here is what we do -

For ext2 buffered-writes, the block allocation happens at the
->write_iter time itself. So at writeback time,
1. We first cache the i_blkseq.
2. Call ext2_get_blocks(, create = 0) to get the no. of blocks
already allocated.
3. Call ext2_get_blocks() the second time with length to be same as
the no. of blocks we know were already allocated.
4. Till now it means, the cached i_blkseq remains valid as no block
allocation has happened yet.
This means the next call to ->map_blocks(), we can verify whether the
i_blkseq has raced with truncate or not. If not, then i_blkseq will
remain valid.

In case of a hole (could happen with mmaped writes), we only allocate
1 block at a time anyways. So even if the i_blkseq value changes right
after, we anyway need to allocate the next block in subsequent
->map_blocks() call.

Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext2/balloc.c | 1 +
fs/ext2/ext2.h | 6 +++++
fs/ext2/inode.c | 57 ++++++++++++++++++++++++++++++++++++++++++++----
fs/ext2/super.c | 2 +-
4 files changed, 61 insertions(+), 5 deletions(-)

diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
index 1bfd6ab11038..047a8f41a6f5 100644
--- a/fs/ext2/balloc.c
+++ b/fs/ext2/balloc.c
@@ -495,6 +495,7 @@ void ext2_free_blocks(struct inode * inode, ext2_fsblk_t block,
}

ext2_debug ("freeing block(s) %lu-%lu\n", block, block + count - 1);
+ ext2_inc_i_blkseq(EXT2_I(inode));

do_more:
overflow = 0;
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index f38bdd46e4f7..67b1acb08eb2 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -663,6 +663,7 @@ struct ext2_inode_info {
struct rw_semaphore xattr_sem;
#endif
rwlock_t i_meta_lock;
+ unsigned int i_blkseq;

/*
* truncate_mutex is for serialising ext2_truncate() against
@@ -698,6 +699,11 @@ static inline struct ext2_inode_info *EXT2_I(struct inode *inode)
return container_of(inode, struct ext2_inode_info, vfs_inode);
}

+static inline void ext2_inc_i_blkseq(struct ext2_inode_info *ei)
+{
+ WRITE_ONCE(ei->i_blkseq, READ_ONCE(ei->i_blkseq) + 1);
+}
+
/* balloc.c */
extern int ext2_bg_has_super(struct super_block *sb, int group);
extern unsigned long ext2_bg_num_gdb(struct super_block *sb, int group);
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 2b62786130b5..946a614ddfc0 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -406,6 +406,8 @@ static int ext2_alloc_blocks(struct inode *inode,
ext2_fsblk_t current_block = 0;
int ret = 0;

+ ext2_inc_i_blkseq(EXT2_I(inode));
+
/*
* Here we try to allocate the requested multiple blocks at once,
* on a best-effort basis.
@@ -966,15 +968,62 @@ ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
return mpage_writepages(mapping, wbc, ext2_get_block);
}

+static bool ext2_imap_valid(struct iomap_writepage_ctx *wpc, struct inode *inode,
+ loff_t offset)
+{
+ if (offset < wpc->iomap.offset ||
+ offset >= wpc->iomap.offset + wpc->iomap.length)
+ return false;
+
+ if (wpc->iomap.validity_cookie != READ_ONCE(EXT2_I(inode)->i_blkseq))
+ return false;
+
+ return true;
+}
+
static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
struct inode *inode, loff_t offset,
unsigned len)
{
- if (offset >= wpc->iomap.offset &&
- offset < wpc->iomap.offset + wpc->iomap.length)
+ loff_t maxblocks = (loff_t)INT_MAX;
+ u8 blkbits = inode->i_blkbits;
+ u32 bno;
+ bool new, boundary;
+ int ret;
+
+ if (ext2_imap_valid(wpc, inode, offset))
return 0;

- return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
+ /*
+ * For ext2 buffered-writes, the block allocation happens at the
+ * ->write_iter time itself. So at writeback time -
+ * 1. We first cache the i_blkseq.
+ * 2. Call ext2_get_blocks(, create = 0) to get the no. of blocks
+ * already allocated.
+ * 3. Call ext2_get_blocks() the second time with length to be same as
+ * the no. of blocks we know were already allocated.
+ * 4. Till now it means, the cached i_blkseq remains valid as no block
+ * allocation has happened yet.
+ * This means the next call to ->map_blocks(), we can verify whether the
+ * i_blkseq has raced with truncate or not. If not, then i_blkseq will
+ * remain valid.
+ *
+ * In case of a hole (could happen with mmaped writes), we only allocate
+ * 1 block at a time anyways. So even if the i_blkseq value changes, we
+ * anyway need to allocate the next block in subsequent ->map_blocks()
+ * call.
+ */
+ wpc->iomap.validity_cookie = READ_ONCE(EXT2_I(inode)->i_blkseq);
+
+ ret = ext2_get_blocks(inode, offset >> blkbits, maxblocks << blkbits,
+ &bno, &new, &boundary, 0);
+ if (ret < 0)
+ return ret;
+ /*
+ * ret can be 0 in case of a hole which is possible for mmaped writes.
+ */
+ ret = ret ? ret : 1;
+ return ext2_iomap_begin(inode, offset, (loff_t)ret << blkbits,
IOMAP_WRITE, &wpc->iomap, NULL);
}

@@ -1000,7 +1049,7 @@ ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc

const struct address_space_operations ext2_file_aops = {
.dirty_folio = iomap_dirty_folio,
- .release_folio = iomap_release_folio,
+ .release_folio = iomap_release_folio,
.invalidate_folio = iomap_invalidate_folio,
.read_folio = ext2_file_read_folio,
.readahead = ext2_file_readahead,
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 37f7ce56adce..32f5386284d6 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -188,7 +188,7 @@ static struct inode *ext2_alloc_inode(struct super_block *sb)
#ifdef CONFIG_QUOTA
memset(&ei->i_dquot, 0, sizeof(ei->i_dquot));
#endif
-
+ WRITE_ONCE(ei->i_blkseq, 0);
return &ei->vfs_inode;
}

--
2.44.0


2024-04-25 13:30:27

by Ritesh Harjani

[permalink] [raw]
Subject: [RFCv3 5/7] iomap: Fix iomap_adjust_read_range for plen calculation

If the extent spans the block that contains the i_size, we need to
handle both halves separately but only when the i_size is within the
current folio under processing.
"orig_pos + length > isize" can be true for all folios if the mapped
extent length is greater than the folio size. That is making plen to
break for every folio instead of only the last folio.

So use orig_plen for checking if "orig_pos + orig_plen > isize".

Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
cc: Ojaswin Mujoo <[email protected]>
---
fs/iomap/buffered-io.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 4e8e41c8b3c0..9f79c82d1f73 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -241,6 +241,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
unsigned block_size = (1 << block_bits);
size_t poff = offset_in_folio(folio, *pos);
size_t plen = min_t(loff_t, folio_size(folio) - poff, length);
+ size_t orig_plen = plen;
unsigned first = poff >> block_bits;
unsigned last = (poff + plen - 1) >> block_bits;

@@ -277,7 +278,7 @@ static void iomap_adjust_read_range(struct inode *inode, struct folio *folio,
* handle both halves separately so that we properly zero data in the
* page cache for blocks that are entirely outside of i_size.
*/
- if (orig_pos <= isize && orig_pos + length > isize) {
+ if (orig_pos <= isize && orig_pos + orig_plen > isize) {
unsigned end = offset_in_folio(folio, isize - 1) >> block_bits;

if (first <= end && last > end)
--
2.44.0


2024-04-25 13:30:32

by Ritesh Harjani

[permalink] [raw]
Subject: [RFCv3 6/7] iomap: Optimize iomap_read_folio

iomap_readpage_iter() handles "uptodate blocks" and "not uptodate blocks"
within a folio separately. This makes iomap_read_folio() to call into
->iomap_begin() to request for extent mapping even though it might already
have an extent which is not fully processed.

This happens when we either have a large folio or with bs < ps. In these
cases we can have sub blocks which can be uptodate (say for e.g. due to
previous writes). With iomap_read_folio_iter(), this is handled more
efficiently by not calling ->iomap_begin() call until all the sub blocks
with the current folio are processed.

Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
cc: Ojaswin Mujoo <[email protected]>
---
fs/iomap/buffered-io.c | 20 +++++++++++++++++++-
1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 9f79c82d1f73..0a4269095ae2 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -444,6 +444,24 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
return pos - orig_pos + plen;
}

+static loff_t iomap_read_folio_iter(const struct iomap_iter *iter,
+ struct iomap_readpage_ctx *ctx)
+{
+ struct folio *folio = ctx->cur_folio;
+ size_t pos = offset_in_folio(folio, iter->pos);
+ loff_t length = min_t(loff_t, folio_size(folio) - pos,
+ iomap_length(iter));
+ loff_t done, ret;
+
+ for (done = 0; done < length; done += ret) {
+ ret = iomap_readpage_iter(iter, ctx, done);
+ if (ret <= 0)
+ return ret;
+ }
+
+ return done;
+}
+
int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
{
struct iomap_iter iter = {
@@ -459,7 +477,7 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
trace_iomap_readpage(iter.inode, 1);

while ((ret = iomap_iter(&iter, ops)) > 0)
- iter.processed = iomap_readpage_iter(&iter, &ctx, 0);
+ iter.processed = iomap_read_folio_iter(&iter, &ctx);

if (ret < 0)
folio_set_error(folio);
--
2.44.0


2024-04-25 13:30:53

by Ritesh Harjani

[permalink] [raw]
Subject: [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings

This patch optimizes the data access patterns for filesystems with
indirect block mapping by implementing BH_Boundary handling within
iomap.

Currently the bios for reads within iomap are only submitted at
2 places -
1. If we cannot merge the new req. with previous bio, only then we
submit the previous bio.
2. Submit the bio at the end of the entire read processing.

This means for filesystems with indirect block mapping, we call into
->iomap_begin() again w/o submitting the previous bios. That causes
unoptimized data access patterns for blocks which are of BH_Boundary type.

For e.g. consider the file mapping
logical block(4k) physical block(4k)
0-11 1000-1011
12-15 1013-1016

In above physical block 1012 is an indirect metadata block which has the
mapping information for next set of indirect blocks (1013-1016).
With iomap buffered reads for reading 1st 16 logical blocks of a file
(0-15), we get below I/O pattern
- submit a bio for 1012
- complete the bio for 1012
- submit a bio for 1000-1011
- submit a bio for 1013-1016
- complete the bios for 1000-1011
- complete the bios for 1013-1016

So as we can see, above is an non-optimal I/O access pattern and also we
get 3 bio completions instead of 2.

This patch changes this behavior by doing submit_bio() if there are any
bios already processed, before calling ->iomap_begin() again.
That means if there are any blocks which are already processed, gets
submitted for I/O earlier and then within ->iomap_begin(), if we get a
request for reading an indirect metadata block, then block layer can merge
those bios with the already submitted read request to reduce the no. of bio
completions.

Now, for bs < ps or for large folios, this patch requires proper handling
of "ifs->read_bytes_pending". In that we first set ifs->read_bytes_pending
to folio_size. Then handle all the cases where we need to subtract
ifs->read_bytes_pending either during the submission side
(if we don't need to submit any I/O - for e.g. for uptodate sub blocks),
or during an I/O error, or at the completion of an I/O.

Here is the ftrace output of iomap and block layer with ext2 iomap
conversion patches -

root# filefrag -b512 -v /mnt1/test/f1
Filesystem type is: ef53
Filesystem cylinder groups approximately 32
File size of /mnt1/test/f1 is 65536 (128 blocks of 512 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 95: 98304.. 98399: 96: merged
1: 96.. 127: 98408.. 98439: 32: 98400: last,merged,eof
/mnt1/test/f1: 2 extents found

root# #This reads 4 blocks starting from lblk 10, 11, 12, 13
root# xfs_io -c "pread -b$((4*4096)) $((10*4096)) $((4*4096))" /mnt1/test/f1

w/o this patch - (indirect block is submitted before and does not get merged, resulting in 3 bios completion)
xfs_io-907 [002] ..... 185.608791: iomap_readahead: dev 8:16 ino 0xc nr_pages 4
xfs_io-907 [002] ..... 185.608819: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x4000 processed 0 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x9d/0x2c0
xfs_io-907 [002] ..... 185.608823: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300a000 offset 0xa000 length 0x2000 type MAPPED flags MERGED
xfs_io-907 [002] ..... 185.608831: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x2000 processed 8192 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1e1/0x2c0
xfs_io-907 [002] ..... 185.608859: block_bio_queue: 8,16 R 98400 + 8 [xfs_io]
xfs_io-907 [002] ..... 185.608865: block_getrq: 8,16 R 98400 + 8 [xfs_io]
xfs_io-907 [002] ..... 185.608867: block_io_start: 8,16 R 4096 () 98400 + 8 [xfs_io]
xfs_io-907 [002] ..... 185.608869: block_plug: [xfs_io]
xfs_io-907 [002] ..... 185.608872: block_unplug: [xfs_io] 1
xfs_io-907 [002] ..... 185.608874: block_rq_insert: 8,16 R 4096 () 98400 + 8 [xfs_io]
kworker/2:1H-198 [002] ..... 185.608908: block_rq_issue: 8,16 R 4096 () 98400 + 8 [kworker/2:1H]
<idle>-0 [002] d.h2. 185.609579: block_rq_complete: 8,16 R () 98400 + 8 [0]
<idle>-0 [002] dNh2. 185.609631: block_io_done: 8,16 R 0 () 98400 + 0 [swapper/2]
xfs_io-907 [002] ..... 185.609694: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300d000 offset 0xc000 length 0x2000 type MAPPED flags MERGED
xfs_io-907 [002] ..... 185.609704: block_bio_queue: 8,16 RA 98384 + 16 [xfs_io]
xfs_io-907 [002] ..... 185.609718: block_getrq: 8,16 RA 98384 + 16 [xfs_io]
xfs_io-907 [002] ..... 185.609721: block_io_start: 8,16 RA 8192 () 98384 + 16 [xfs_io]
xfs_io-907 [002] ..... 185.609726: block_plug: [xfs_io]
xfs_io-907 [002] ..... 185.609735: iomap_iter: dev 8:16 ino 0xc pos 0xc000 length 0x2000 processed 8192 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1e1/0x2c0
xfs_io-907 [002] ..... 185.609736: block_bio_queue: 8,16 RA 98408 + 16 [xfs_io]
xfs_io-907 [002] ..... 185.609740: block_getrq: 8,16 RA 98408 + 16 [xfs_io]
xfs_io-907 [002] ..... 185.609741: block_io_start: 8,16 RA 8192 () 98408 + 16 [xfs_io]
xfs_io-907 [002] ..... 185.609756: block_rq_issue: 8,16 RA 8192 () 98408 + 16 [xfs_io]
xfs_io-907 [002] ..... 185.609769: block_rq_issue: 8,16 RA 8192 () 98384 + 16 [xfs_io]
<idle>-0 [002] d.H2. 185.610280: block_rq_complete: 8,16 RA () 98408 + 16 [0]
<idle>-0 [002] d.H2. 185.610289: block_io_done: 8,16 RA 0 () 98408 + 0 [swapper/2]
<idle>-0 [002] d.H2. 185.610292: block_rq_complete: 8,16 RA () 98384 + 16 [0]
<idle>-0 [002] dNH2. 185.610301: block_io_done: 8,16 RA 0 () 98384 + 0 [swapper/2]

v/s with the patch - (optimzed I/O access pattern and bio gets merged resulting in only 2 bios completion)
xfs_io-944 [005] ..... 99.926187: iomap_readahead: dev 8:16 ino 0xc nr_pages 4
xfs_io-944 [005] ..... 99.926208: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x4000 processed 0 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x9d/0x2c0
xfs_io-944 [005] ..... 99.926211: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300a000 offset 0xa000 length 0x2000 type MAPPED flags MERGED
xfs_io-944 [005] ..... 99.926222: block_bio_queue: 8,16 RA 98384 + 16 [xfs_io]
xfs_io-944 [005] ..... 99.926232: block_getrq: 8,16 RA 98384 + 16 [xfs_io]
xfs_io-944 [005] ..... 99.926233: block_io_start: 8,16 RA 8192 () 98384 + 16 [xfs_io]
xfs_io-944 [005] ..... 99.926234: block_plug: [xfs_io]
xfs_io-944 [005] ..... 99.926235: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x2000 processed 8192 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1f9/0x2c0
xfs_io-944 [005] ..... 99.926261: block_bio_queue: 8,16 R 98400 + 8 [xfs_io]
xfs_io-944 [005] ..... 99.926266: block_bio_backmerge: 8,16 R 98400 + 8 [xfs_io]
xfs_io-944 [005] ..... 99.926271: block_unplug: [xfs_io] 1
xfs_io-944 [005] ..... 99.926272: block_rq_insert: 8,16 RA 12288 () 98384 + 24 [xfs_io]
kworker/5:1H-234 [005] ..... 99.926314: block_rq_issue: 8,16 RA 12288 () 98384 + 24 [kworker/5:1H]
<idle>-0 [005] d.h2. 99.926905: block_rq_complete: 8,16 RA () 98384 + 24 [0]
<idle>-0 [005] dNh2. 99.926931: block_io_done: 8,16 RA 0 () 98384 + 0 [swapper/5]
xfs_io-944 [005] ..... 99.926971: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300d000 offset 0xc000 length 0x2000 type MAPPED flags MERGED
xfs_io-944 [005] ..... 99.926981: block_bio_queue: 8,16 RA 98408 + 16 [xfs_io]
xfs_io-944 [005] ..... 99.926989: block_getrq: 8,16 RA 98408 + 16 [xfs_io]
xfs_io-944 [005] ..... 99.926989: block_io_start: 8,16 RA 8192 () 98408 + 16 [xfs_io]
xfs_io-944 [005] ..... 99.926991: block_plug: [xfs_io]
xfs_io-944 [005] ..... 99.926993: iomap_iter: dev 8:16 ino 0xc pos 0xc000 length 0x2000 processed 8192 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1f9/0x2c0
xfs_io-944 [005] ..... 99.927001: block_rq_issue: 8,16 RA 8192 () 98408 + 16 [xfs_io]
<idle>-0 [005] d.h2. 99.927397: block_rq_complete: 8,16 RA () 98408 + 16 [0]
<idle>-0 [005] dNh2. 99.927414: block_io_done: 8,16 RA 0 () 98408 + 0 [swapper/5]

Suggested-by: Matthew Wilcox <[email protected]>
Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
cc: Ojaswin Mujoo <[email protected]>
---
fs/iomap/buffered-io.c | 112 +++++++++++++++++++++++++++++++----------
1 file changed, 85 insertions(+), 27 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 0a4269095ae2..a1d50086a3f5 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -30,7 +30,7 @@ typedef int (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length);
*/
struct iomap_folio_state {
spinlock_t state_lock;
- unsigned int read_bytes_pending;
+ size_t read_bytes_pending;
atomic_t write_bytes_pending;

/*
@@ -380,6 +380,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
loff_t orig_pos = pos;
size_t poff, plen;
sector_t sector;
+ bool rbp_finished = false;

if (iomap->type == IOMAP_INLINE)
return iomap_read_inline_data(iter, folio);
@@ -387,21 +388,39 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
/* zero post-eof blocks as the page may be mapped */
ifs = ifs_alloc(iter->inode, folio, iter->flags);
iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
+
+ if (ifs) {
+ loff_t to_read = min_t(loff_t, iter->len - offset,
+ folio_size(folio) - offset_in_folio(folio, orig_pos));
+ size_t padjust;
+
+ spin_lock_irq(&ifs->state_lock);
+ if (!ifs->read_bytes_pending)
+ ifs->read_bytes_pending = to_read;
+ padjust = pos - orig_pos;
+ ifs->read_bytes_pending -= padjust;
+ if (!ifs->read_bytes_pending)
+ rbp_finished = true;
+ spin_unlock_irq(&ifs->state_lock);
+ }
+
if (plen == 0)
goto done;

if (iomap_block_needs_zeroing(iter, pos)) {
+ if (ifs) {
+ spin_lock_irq(&ifs->state_lock);
+ ifs->read_bytes_pending -= plen;
+ if (!ifs->read_bytes_pending)
+ rbp_finished = true;
+ spin_unlock_irq(&ifs->state_lock);
+ }
folio_zero_range(folio, poff, plen);
iomap_set_range_uptodate(folio, poff, plen);
goto done;
}

ctx->cur_folio_in_bio = true;
- if (ifs) {
- spin_lock_irq(&ifs->state_lock);
- ifs->read_bytes_pending += plen;
- spin_unlock_irq(&ifs->state_lock);
- }

sector = iomap_sector(iomap, pos);
if (!ctx->bio ||
@@ -435,6 +454,14 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
}

done:
+ /*
+ * If there is no bio prepared and if rbp is finished and
+ * this was the last offset within this folio then mark
+ * cur_folio_in_bio to false.
+ */
+ if (!ctx->bio && rbp_finished &&
+ offset_in_folio(folio, pos + plen) == 0)
+ ctx->cur_folio_in_bio = false;
/*
* Move the caller beyond our range so that it keeps making progress.
* For that, we have to include any leading non-uptodate ranges, but
@@ -459,9 +486,43 @@ static loff_t iomap_read_folio_iter(const struct iomap_iter *iter,
return ret;
}

+ if (ctx->bio) {
+ submit_bio(ctx->bio);
+ WARN_ON_ONCE(!ctx->cur_folio_in_bio);
+ ctx->bio = NULL;
+ }
+ if (offset_in_folio(folio, iter->pos + done) == 0 &&
+ !ctx->cur_folio_in_bio) {
+ folio_unlock(ctx->cur_folio);
+ }
+
return done;
}

+static void iomap_handle_read_error(struct iomap_readpage_ctx *ctx,
+ struct iomap_iter *iter)
+{
+ struct folio *folio = ctx->cur_folio;
+ struct iomap_folio_state *ifs;
+ unsigned long flags;
+ bool rbp_finished = false;
+ size_t rbp_adjust = folio_size(folio) - offset_in_folio(folio,
+ iter->pos);
+ ifs = folio->private;
+ if (!ifs || !ifs->read_bytes_pending)
+ goto unlock;
+
+ spin_lock_irqsave(&ifs->state_lock, flags);
+ ifs->read_bytes_pending -= rbp_adjust;
+ if (!ifs->read_bytes_pending)
+ rbp_finished = true;
+ spin_unlock_irqrestore(&ifs->state_lock, flags);
+
+unlock:
+ if (rbp_finished || !ctx->cur_folio_in_bio)
+ folio_unlock(folio);
+}
+
int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
{
struct iomap_iter iter = {
@@ -479,15 +540,9 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
while ((ret = iomap_iter(&iter, ops)) > 0)
iter.processed = iomap_read_folio_iter(&iter, &ctx);

- if (ret < 0)
+ if (ret < 0) {
folio_set_error(folio);
-
- if (ctx.bio) {
- submit_bio(ctx.bio);
- WARN_ON_ONCE(!ctx.cur_folio_in_bio);
- } else {
- WARN_ON_ONCE(ctx.cur_folio_in_bio);
- folio_unlock(folio);
+ iomap_handle_read_error(&ctx, &iter);
}

/*
@@ -506,12 +561,6 @@ static loff_t iomap_readahead_iter(const struct iomap_iter *iter,
loff_t done, ret;

for (done = 0; done < length; done += ret) {
- if (ctx->cur_folio &&
- offset_in_folio(ctx->cur_folio, iter->pos + done) == 0) {
- if (!ctx->cur_folio_in_bio)
- folio_unlock(ctx->cur_folio);
- ctx->cur_folio = NULL;
- }
if (!ctx->cur_folio) {
ctx->cur_folio = readahead_folio(ctx->rac);
ctx->cur_folio_in_bio = false;
@@ -519,6 +568,17 @@ static loff_t iomap_readahead_iter(const struct iomap_iter *iter,
ret = iomap_readpage_iter(iter, ctx, done);
if (ret <= 0)
return ret;
+ if (ctx->cur_folio && offset_in_folio(ctx->cur_folio,
+ iter->pos + done + ret) == 0) {
+ if (!ctx->cur_folio_in_bio)
+ folio_unlock(ctx->cur_folio);
+ ctx->cur_folio = NULL;
+ }
+ }
+
+ if (ctx->bio) {
+ submit_bio(ctx->bio);
+ ctx->bio = NULL;
}

return done;
@@ -549,18 +609,16 @@ void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops)
struct iomap_readpage_ctx ctx = {
.rac = rac,
};
+ int ret = 0;

trace_iomap_readahead(rac->mapping->host, readahead_count(rac));

- while (iomap_iter(&iter, ops) > 0)
+ while ((ret = iomap_iter(&iter, ops)) > 0)
iter.processed = iomap_readahead_iter(&iter, &ctx);

- if (ctx.bio)
- submit_bio(ctx.bio);
- if (ctx.cur_folio) {
- if (!ctx.cur_folio_in_bio)
- folio_unlock(ctx.cur_folio);
- }
+ if (ret < 0 && ctx.cur_folio)
+ iomap_handle_read_error(&ctx, &iter);
+
}
EXPORT_SYMBOL_GPL(iomap_readahead);

--
2.44.0


2024-04-25 13:53:52

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFCv3 6/7] iomap: Optimize iomap_read_folio

On Thu, Apr 25, 2024 at 06:58:50PM +0530, Ritesh Harjani (IBM) wrote:
> +static loff_t iomap_read_folio_iter(const struct iomap_iter *iter,
> + struct iomap_readpage_ctx *ctx)
> +{
> + struct folio *folio = ctx->cur_folio;
> + size_t pos = offset_in_folio(folio, iter->pos);

"pos" is position in file. You should call this 'offset'.

> + loff_t length = min_t(loff_t, folio_size(folio) - pos,
> + iomap_length(iter));
> + loff_t done, ret;
> +
> + for (done = 0; done < length; done += ret) {
> + ret = iomap_readpage_iter(iter, ctx, done);
> + if (ret <= 0)
> + return ret;
> + }

2024-04-26 06:50:09

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFCv3 5/7] iomap: Fix iomap_adjust_read_range for plen calculation

On Thu, Apr 25, 2024 at 06:58:49PM +0530, Ritesh Harjani (IBM) wrote:
> If the extent spans the block that contains the i_size, we need to

s/the i_size/i_size/.

> handle both halves separately

.. so that we properly zero data in the page cache for blocks that are
entirely outside of i_size.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <[email protected]>


2024-04-26 06:53:50

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFCv3 6/7] iomap: Optimize iomap_read_folio

On Thu, Apr 25, 2024 at 06:58:50PM +0530, Ritesh Harjani (IBM) wrote:
> iomap_readpage_iter() handles "uptodate blocks" and "not uptodate blocks"
> within a folio separately. This makes iomap_read_folio() to call into
> ->iomap_begin() to request for extent mapping even though it might already
> have an extent which is not fully processed.
>
> This happens when we either have a large folio or with bs < ps. In these
> cases we can have sub blocks which can be uptodate (say for e.g. due to
> previous writes). With iomap_read_folio_iter(), this is handled more
> efficiently by not calling ->iomap_begin() call until all the sub blocks
> with the current folio are processed.

Maybe throw in a sentence here that this copies what
iomap_readahead_iter already does?

Otherwise this looks good to me modulo the offset comment from willy.

2024-04-26 08:50:33

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [RFCv3 6/7] iomap: Optimize iomap_read_folio

Christoph Hellwig <[email protected]> writes:

> On Thu, Apr 25, 2024 at 06:58:50PM +0530, Ritesh Harjani (IBM) wrote:
>> iomap_readpage_iter() handles "uptodate blocks" and "not uptodate blocks"
>> within a folio separately. This makes iomap_read_folio() to call into
>> ->iomap_begin() to request for extent mapping even though it might already
>> have an extent which is not fully processed.
>>
>> This happens when we either have a large folio or with bs < ps. In these
>> cases we can have sub blocks which can be uptodate (say for e.g. due to
>> previous writes). With iomap_read_folio_iter(), this is handled more
>> efficiently by not calling ->iomap_begin() call until all the sub blocks
>> with the current folio are processed.
>
> Maybe throw in a sentence here that this copies what
> iomap_readahead_iter already does?

Does this sound any better?

iomap_read_folio_iter() handles multiple sub blocks within a given
folio but it's implementation logic is similar to how
iomap_readahead_iter() handles multiple folios within a single mapped
extent. Both of them iterate over a given range of folio/mapped extent
and call iomap_readpage_iter() for reading.


>
> Otherwise this looks good to me modulo the offset comment from willy.

Yes, I will address willy's comment too.
Thanks for the review!

-ritesh

2024-04-26 08:53:00

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [RFCv3 5/7] iomap: Fix iomap_adjust_read_range for plen calculation

Christoph Hellwig <[email protected]> writes:

> On Thu, Apr 25, 2024 at 06:58:49PM +0530, Ritesh Harjani (IBM) wrote:
>> If the extent spans the block that contains the i_size, we need to
>
> s/the i_size/i_size/.
>
>> handle both halves separately
>
> .. so that we properly zero data in the page cache for blocks that are
> entirely outside of i_size.

Sure.

>
> Otherwise looks good:
>
> Reviewed-by: Christoph Hellwig <[email protected]>

Thanks for the review.

-ritesh

2024-04-26 15:29:58

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [RFCv3 2/7] ext2: Convert ext2 regular file buffered I/O to use iomap

On Thu, Apr 25, 2024 at 06:58:46PM +0530, Ritesh Harjani (IBM) wrote:
> This patch converts ext2 regular file's buffered-io path to use iomap.
> - buffered-io path using iomap_file_buffered_write
> - DIO fallback to buffered-io now uses iomap_file_buffered_write
> - writeback path now uses a new aops - ext2_file_aops
> - truncate now uses iomap_truncate_page
> - mmap path of ext2 continues to use generic_file_vm_ops
>
> Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
> ---
> fs/ext2/file.c | 20 ++++++++++++--
> fs/ext2/inode.c | 69 ++++++++++++++++++++++++++++++++++++++++++++-----
> 2 files changed, 81 insertions(+), 8 deletions(-)
>
> diff --git a/fs/ext2/file.c b/fs/ext2/file.c
> index 4ddc36f4dbd4..ee5cd4a2f24f 100644
> --- a/fs/ext2/file.c
> +++ b/fs/ext2/file.c
> @@ -252,7 +252,7 @@ static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>
> iocb->ki_flags &= ~IOCB_DIRECT;
> pos = iocb->ki_pos;
> - status = generic_perform_write(iocb, from);
> + status = iomap_file_buffered_write(iocb, from, &ext2_iomap_ops);
> if (unlikely(status < 0)) {
> ret = status;
> goto out_unlock;
> @@ -278,6 +278,22 @@ static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> return ret;
> }
>
> +static ssize_t ext2_buffered_write_iter(struct kiocb *iocb,
> + struct iov_iter *from)
> +{
> + ssize_t ret = 0;
> + struct inode *inode = file_inode(iocb->ki_filp);
> +
> + inode_lock(inode);
> + ret = generic_write_checks(iocb, from);
> + if (ret > 0)
> + ret = iomap_file_buffered_write(iocb, from, &ext2_iomap_ops);
> + inode_unlock(inode);
> + if (ret > 0)
> + ret = generic_write_sync(iocb, ret);
> + return ret;
> +}
> +
> static ssize_t ext2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> {
> #ifdef CONFIG_FS_DAX
> @@ -299,7 +315,7 @@ static ssize_t ext2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> if (iocb->ki_flags & IOCB_DIRECT)
> return ext2_dio_write_iter(iocb, from);
>
> - return generic_file_write_iter(iocb, from);
> + return ext2_buffered_write_iter(iocb, from);
> }
>
> const struct file_operations ext2_file_operations = {
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index c4de3a94c4b2..f90d280025d9 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -877,10 +877,14 @@ ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length,
> if ((flags & IOMAP_DIRECT) && (flags & IOMAP_WRITE) && written == 0)
> return -ENOTBLK;
>
> - if (iomap->type == IOMAP_MAPPED &&
> - written < length &&
> - (flags & IOMAP_WRITE))
> + if (iomap->type == IOMAP_MAPPED && written < length &&
> + (flags & IOMAP_WRITE)) {
> ext2_write_failed(inode->i_mapping, offset + length);
> + return 0;
> + }
> +
> + if (iomap->flags & IOMAP_F_SIZE_CHANGED)
> + mark_inode_dirty(inode);
> return 0;
> }
>
> @@ -912,6 +916,16 @@ static void ext2_readahead(struct readahead_control *rac)
> mpage_readahead(rac, ext2_get_block);
> }
>
> +static int ext2_file_read_folio(struct file *file, struct folio *folio)
> +{
> + return iomap_read_folio(folio, &ext2_iomap_ops);
> +}
> +
> +static void ext2_file_readahead(struct readahead_control *rac)
> +{
> + iomap_readahead(rac, &ext2_iomap_ops);
> +}
> +
> static int
> ext2_write_begin(struct file *file, struct address_space *mapping,
> loff_t pos, unsigned len, struct page **pagep, void **fsdata)
> @@ -941,12 +955,41 @@ static sector_t ext2_bmap(struct address_space *mapping, sector_t block)
> return generic_block_bmap(mapping,block,ext2_get_block);
> }
>
> +static sector_t ext2_file_bmap(struct address_space *mapping, sector_t block)
> +{
> + return iomap_bmap(mapping, block, &ext2_iomap_ops);
> +}
> +
> static int
> ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
> {
> return mpage_writepages(mapping, wbc, ext2_get_block);
> }
>
> +static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
> + struct inode *inode, loff_t offset,
> + unsigned len)
> +{
> + if (offset >= wpc->iomap.offset &&
> + offset < wpc->iomap.offset + wpc->iomap.length)
> + return 0;
> +
> + return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
> + IOMAP_WRITE, &wpc->iomap, NULL);
> +}

Soooo... this is almost a directio write of the pagecache? ;)

> +
> +static const struct iomap_writeback_ops ext2_writeback_ops = {
> + .map_blocks = ext2_write_map_blocks,
> +};
> +
> +static int ext2_file_writepages(struct address_space *mapping,
> + struct writeback_control *wbc)
> +{
> + struct iomap_writepage_ctx wpc = { };
> +
> + return iomap_writepages(mapping, wbc, &wpc, &ext2_writeback_ops);
> +}
> +
> static int
> ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc)
> {
> @@ -955,6 +998,20 @@ ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc
> return dax_writeback_mapping_range(mapping, sbi->s_daxdev, wbc);
> }
>
> +const struct address_space_operations ext2_file_aops = {
> + .dirty_folio = iomap_dirty_folio,
> + .release_folio = iomap_release_folio,

trailing space here ^

> + .invalidate_folio = iomap_invalidate_folio,
> + .read_folio = ext2_file_read_folio,
> + .readahead = ext2_file_readahead,
> + .bmap = ext2_file_bmap,
> + .direct_IO = noop_direct_IO,

Nowadays, it is preferred to set FMODE_CAN_ODIRECT and skip setting
->direct_IO. But I see that ext2 hasn't been converted, so this is a
minor point.

> + .writepages = ext2_file_writepages,
> + .migrate_folio = filemap_migrate_folio,
> + .is_partially_uptodate = iomap_is_partially_uptodate,
> + .error_remove_folio = generic_error_remove_folio,
> +};
> +
> const struct address_space_operations ext2_aops = {

I wonder, could directories and symlinks get converted to iomap at some
point? (It's ok if that is not in scope for this series.)

Looks good to me,
Reviewed-by: Darrick J. Wong <[email protected]>

--D

> .dirty_folio = block_dirty_folio,
> .invalidate_folio = block_invalidate_folio,
> @@ -1279,8 +1336,8 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
> error = dax_truncate_page(inode, newsize, NULL,
> &ext2_iomap_ops);
> else
> - error = block_truncate_page(inode->i_mapping,
> - newsize, ext2_get_block);
> + error = iomap_truncate_page(inode, newsize, NULL,
> + &ext2_iomap_ops);
> if (error)
> return error;
>
> @@ -1370,7 +1427,7 @@ void ext2_set_file_ops(struct inode *inode)
> if (IS_DAX(inode))
> inode->i_mapping->a_ops = &ext2_dax_aops;
> else
> - inode->i_mapping->a_ops = &ext2_aops;
> + inode->i_mapping->a_ops = &ext2_file_aops;
> }
>
> struct inode *ext2_iget (struct super_block *sb, unsigned long ino)
> --
> 2.44.0
>
>

2024-04-26 15:42:06

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [RFCv3 4/7] ext2: Implement seq counter for validating cached iomap

On Thu, Apr 25, 2024 at 06:58:48PM +0530, Ritesh Harjani (IBM) wrote:
> There is a possibility of following race with iomap during
> writebck -
>
> write_cache_pages()
> cache extent covering 0..1MB range
> write page at offset 0k
> truncate(file, 4k)
> drops all relevant pages
> frees fs blocks
> pwrite(file, 4k, 4k)
> creates dirty page in the page cache
> writes page at offset 4k to a stale block
>
> This race can happen because iomap_writepages() keeps a cached extent mapping
> within struct iomap. While write_cache_pages() is going over each folio,
> (can cache a large extent range), if a truncate happens in parallel on the
> next folio followed by a buffered write to the same offset within the file,
> this can change logical to physical offset of the cached iomap mapping.
> That means, the cached iomap has now become stale.
>
> This patch implements the seq counter approach for revalidation of stale
> iomap mappings. i_blkseq will get incremented for every block
> allocation/free. Here is what we do -
>
> For ext2 buffered-writes, the block allocation happens at the
> ->write_iter time itself. So at writeback time,
> 1. We first cache the i_blkseq.
> 2. Call ext2_get_blocks(, create = 0) to get the no. of blocks
> already allocated.
> 3. Call ext2_get_blocks() the second time with length to be same as
> the no. of blocks we know were already allocated.
> 4. Till now it means, the cached i_blkseq remains valid as no block
> allocation has happened yet.
> This means the next call to ->map_blocks(), we can verify whether the
> i_blkseq has raced with truncate or not. If not, then i_blkseq will
> remain valid.
>
> In case of a hole (could happen with mmaped writes), we only allocate
> 1 block at a time anyways. So even if the i_blkseq value changes right
> after, we anyway need to allocate the next block in subsequent
> ->map_blocks() call.
>
> Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
> ---
> fs/ext2/balloc.c | 1 +
> fs/ext2/ext2.h | 6 +++++
> fs/ext2/inode.c | 57 ++++++++++++++++++++++++++++++++++++++++++++----
> fs/ext2/super.c | 2 +-
> 4 files changed, 61 insertions(+), 5 deletions(-)
>
> diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
> index 1bfd6ab11038..047a8f41a6f5 100644
> --- a/fs/ext2/balloc.c
> +++ b/fs/ext2/balloc.c
> @@ -495,6 +495,7 @@ void ext2_free_blocks(struct inode * inode, ext2_fsblk_t block,
> }
>
> ext2_debug ("freeing block(s) %lu-%lu\n", block, block + count - 1);
> + ext2_inc_i_blkseq(EXT2_I(inode));
>
> do_more:
> overflow = 0;
> diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
> index f38bdd46e4f7..67b1acb08eb2 100644
> --- a/fs/ext2/ext2.h
> +++ b/fs/ext2/ext2.h
> @@ -663,6 +663,7 @@ struct ext2_inode_info {
> struct rw_semaphore xattr_sem;
> #endif
> rwlock_t i_meta_lock;
> + unsigned int i_blkseq;
>
> /*
> * truncate_mutex is for serialising ext2_truncate() against
> @@ -698,6 +699,11 @@ static inline struct ext2_inode_info *EXT2_I(struct inode *inode)
> return container_of(inode, struct ext2_inode_info, vfs_inode);
> }
>
> +static inline void ext2_inc_i_blkseq(struct ext2_inode_info *ei)
> +{
> + WRITE_ONCE(ei->i_blkseq, READ_ONCE(ei->i_blkseq) + 1);
> +}
> +
> /* balloc.c */
> extern int ext2_bg_has_super(struct super_block *sb, int group);
> extern unsigned long ext2_bg_num_gdb(struct super_block *sb, int group);
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index 2b62786130b5..946a614ddfc0 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -406,6 +406,8 @@ static int ext2_alloc_blocks(struct inode *inode,
> ext2_fsblk_t current_block = 0;
> int ret = 0;
>
> + ext2_inc_i_blkseq(EXT2_I(inode));
> +
> /*
> * Here we try to allocate the requested multiple blocks at once,
> * on a best-effort basis.
> @@ -966,15 +968,62 @@ ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
> return mpage_writepages(mapping, wbc, ext2_get_block);
> }
>
> +static bool ext2_imap_valid(struct iomap_writepage_ctx *wpc, struct inode *inode,
> + loff_t offset)

ext2_iomap_valid, to stay consistent with the ext4 conversion series?

> +{
> + if (offset < wpc->iomap.offset ||
> + offset >= wpc->iomap.offset + wpc->iomap.length)
> + return false;
> +
> + if (wpc->iomap.validity_cookie != READ_ONCE(EXT2_I(inode)->i_blkseq))
> + return false;
> +
> + return true;
> +}
> +
> static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
> struct inode *inode, loff_t offset,
> unsigned len)
> {
> - if (offset >= wpc->iomap.offset &&
> - offset < wpc->iomap.offset + wpc->iomap.length)
> + loff_t maxblocks = (loff_t)INT_MAX;
> + u8 blkbits = inode->i_blkbits;
> + u32 bno;
> + bool new, boundary;
> + int ret;
> +
> + if (ext2_imap_valid(wpc, inode, offset))
> return 0;
>
> - return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
> + /*
> + * For ext2 buffered-writes, the block allocation happens at the
> + * ->write_iter time itself. So at writeback time -
> + * 1. We first cache the i_blkseq.
> + * 2. Call ext2_get_blocks(, create = 0) to get the no. of blocks
> + * already allocated.
> + * 3. Call ext2_get_blocks() the second time with length to be same as
> + * the no. of blocks we know were already allocated.
> + * 4. Till now it means, the cached i_blkseq remains valid as no block
> + * allocation has happened yet.
> + * This means the next call to ->map_blocks(), we can verify whether the
> + * i_blkseq has raced with truncate or not. If not, then i_blkseq will
> + * remain valid.
> + *
> + * In case of a hole (could happen with mmaped writes), we only allocate
> + * 1 block at a time anyways. So even if the i_blkseq value changes, we
> + * anyway need to allocate the next block in subsequent ->map_blocks()
> + * call.

You might want to leave a comment here that ext2 doesn't support
unwritten extents, so the validation cookie is needed only for
writeback, and not for pagecache writes themselves. I don't expect
anyone to port extents (and hence unwritten blocks) to ext2, so this is
a minor point.

> + */
> + wpc->iomap.validity_cookie = READ_ONCE(EXT2_I(inode)->i_blkseq);
> +
> + ret = ext2_get_blocks(inode, offset >> blkbits, maxblocks << blkbits,
> + &bno, &new, &boundary, 0);
> + if (ret < 0)
> + return ret;
> + /*
> + * ret can be 0 in case of a hole which is possible for mmaped writes.
> + */
> + ret = ret ? ret : 1;
> + return ext2_iomap_begin(inode, offset, (loff_t)ret << blkbits,
> IOMAP_WRITE, &wpc->iomap, NULL);
> }
>
> @@ -1000,7 +1049,7 @@ ext2_dax_writepages(struct address_space *mapping, struct writeback_control *wbc
>
> const struct address_space_operations ext2_file_aops = {
> .dirty_folio = iomap_dirty_folio,
> - .release_folio = iomap_release_folio,
> + .release_folio = iomap_release_folio,

This fix should be in patch 2.

--D

> .invalidate_folio = iomap_invalidate_folio,
> .read_folio = ext2_file_read_folio,
> .readahead = ext2_file_readahead,
> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index 37f7ce56adce..32f5386284d6 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -188,7 +188,7 @@ static struct inode *ext2_alloc_inode(struct super_block *sb)
> #ifdef CONFIG_QUOTA
> memset(&ei->i_dquot, 0, sizeof(ei->i_dquot));
> #endif
> -
> + WRITE_ONCE(ei->i_blkseq, 0);
> return &ei->vfs_inode;
> }
>
> --
> 2.44.0
>
>

2024-04-26 15:44:20

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [RFCv3 5/7] iomap: Fix iomap_adjust_read_range for plen calculation

On Fri, Apr 26, 2024 at 02:22:25PM +0530, Ritesh Harjani wrote:
> Christoph Hellwig <[email protected]> writes:
>
> > On Thu, Apr 25, 2024 at 06:58:49PM +0530, Ritesh Harjani (IBM) wrote:
> >> If the extent spans the block that contains the i_size, we need to
> >
> > s/the i_size/i_size/.
> >
> >> handle both halves separately
> >
> > .. so that we properly zero data in the page cache for blocks that are
> > entirely outside of i_size.
>
> Sure.
>
> >
> > Otherwise looks good:
> >
> > Reviewed-by: Christoph Hellwig <[email protected]>
>
> Thanks for the review.

With Christoph's comments addressed, you can also add
Reviewed-by: Darrick J. Wong <[email protected]>

--D

>
> -ritesh
>

2024-04-26 16:24:57

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings

On Thu, Apr 25, 2024 at 06:58:51PM +0530, Ritesh Harjani (IBM) wrote:
> This patch optimizes the data access patterns for filesystems with
> indirect block mapping by implementing BH_Boundary handling within
> iomap.
>
> Currently the bios for reads within iomap are only submitted at
> 2 places -
> 1. If we cannot merge the new req. with previous bio, only then we
> submit the previous bio.
> 2. Submit the bio at the end of the entire read processing.
>
> This means for filesystems with indirect block mapping, we call into
> ->iomap_begin() again w/o submitting the previous bios. That causes
> unoptimized data access patterns for blocks which are of BH_Boundary type.
>
> For e.g. consider the file mapping
> logical block(4k) physical block(4k)
> 0-11 1000-1011
> 12-15 1013-1016
>
> In above physical block 1012 is an indirect metadata block which has the
> mapping information for next set of indirect blocks (1013-1016).
> With iomap buffered reads for reading 1st 16 logical blocks of a file
> (0-15), we get below I/O pattern
> - submit a bio for 1012
> - complete the bio for 1012
> - submit a bio for 1000-1011
> - submit a bio for 1013-1016
> - complete the bios for 1000-1011
> - complete the bios for 1013-1016
>
> So as we can see, above is an non-optimal I/O access pattern and also we
> get 3 bio completions instead of 2.
>
> This patch changes this behavior by doing submit_bio() if there are any
> bios already processed, before calling ->iomap_begin() again.
> That means if there are any blocks which are already processed, gets
> submitted for I/O earlier and then within ->iomap_begin(), if we get a
> request for reading an indirect metadata block, then block layer can merge
> those bios with the already submitted read request to reduce the no. of bio
> completions.
>
> Now, for bs < ps or for large folios, this patch requires proper handling
> of "ifs->read_bytes_pending". In that we first set ifs->read_bytes_pending
> to folio_size. Then handle all the cases where we need to subtract
> ifs->read_bytes_pending either during the submission side
> (if we don't need to submit any I/O - for e.g. for uptodate sub blocks),
> or during an I/O error, or at the completion of an I/O.
>
> Here is the ftrace output of iomap and block layer with ext2 iomap
> conversion patches -
>
> root# filefrag -b512 -v /mnt1/test/f1
> Filesystem type is: ef53
> Filesystem cylinder groups approximately 32
> File size of /mnt1/test/f1 is 65536 (128 blocks of 512 bytes)
> ext: logical_offset: physical_offset: length: expected: flags:
> 0: 0.. 95: 98304.. 98399: 96: merged
> 1: 96.. 127: 98408.. 98439: 32: 98400: last,merged,eof
> /mnt1/test/f1: 2 extents found
>
> root# #This reads 4 blocks starting from lblk 10, 11, 12, 13
> root# xfs_io -c "pread -b$((4*4096)) $((10*4096)) $((4*4096))" /mnt1/test/f1
>
> w/o this patch - (indirect block is submitted before and does not get merged, resulting in 3 bios completion)
> xfs_io-907 [002] ..... 185.608791: iomap_readahead: dev 8:16 ino 0xc nr_pages 4
> xfs_io-907 [002] ..... 185.608819: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x4000 processed 0 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x9d/0x2c0
> xfs_io-907 [002] ..... 185.608823: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300a000 offset 0xa000 length 0x2000 type MAPPED flags MERGED
> xfs_io-907 [002] ..... 185.608831: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x2000 processed 8192 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1e1/0x2c0
> xfs_io-907 [002] ..... 185.608859: block_bio_queue: 8,16 R 98400 + 8 [xfs_io]
> xfs_io-907 [002] ..... 185.608865: block_getrq: 8,16 R 98400 + 8 [xfs_io]
> xfs_io-907 [002] ..... 185.608867: block_io_start: 8,16 R 4096 () 98400 + 8 [xfs_io]
> xfs_io-907 [002] ..... 185.608869: block_plug: [xfs_io]
> xfs_io-907 [002] ..... 185.608872: block_unplug: [xfs_io] 1
> xfs_io-907 [002] ..... 185.608874: block_rq_insert: 8,16 R 4096 () 98400 + 8 [xfs_io]
> kworker/2:1H-198 [002] ..... 185.608908: block_rq_issue: 8,16 R 4096 () 98400 + 8 [kworker/2:1H]
> <idle>-0 [002] d.h2. 185.609579: block_rq_complete: 8,16 R () 98400 + 8 [0]
> <idle>-0 [002] dNh2. 185.609631: block_io_done: 8,16 R 0 () 98400 + 0 [swapper/2]
> xfs_io-907 [002] ..... 185.609694: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300d000 offset 0xc000 length 0x2000 type MAPPED flags MERGED
> xfs_io-907 [002] ..... 185.609704: block_bio_queue: 8,16 RA 98384 + 16 [xfs_io]
> xfs_io-907 [002] ..... 185.609718: block_getrq: 8,16 RA 98384 + 16 [xfs_io]
> xfs_io-907 [002] ..... 185.609721: block_io_start: 8,16 RA 8192 () 98384 + 16 [xfs_io]
> xfs_io-907 [002] ..... 185.609726: block_plug: [xfs_io]
> xfs_io-907 [002] ..... 185.609735: iomap_iter: dev 8:16 ino 0xc pos 0xc000 length 0x2000 processed 8192 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1e1/0x2c0
> xfs_io-907 [002] ..... 185.609736: block_bio_queue: 8,16 RA 98408 + 16 [xfs_io]
> xfs_io-907 [002] ..... 185.609740: block_getrq: 8,16 RA 98408 + 16 [xfs_io]
> xfs_io-907 [002] ..... 185.609741: block_io_start: 8,16 RA 8192 () 98408 + 16 [xfs_io]
> xfs_io-907 [002] ..... 185.609756: block_rq_issue: 8,16 RA 8192 () 98408 + 16 [xfs_io]
> xfs_io-907 [002] ..... 185.609769: block_rq_issue: 8,16 RA 8192 () 98384 + 16 [xfs_io]
> <idle>-0 [002] d.H2. 185.610280: block_rq_complete: 8,16 RA () 98408 + 16 [0]
> <idle>-0 [002] d.H2. 185.610289: block_io_done: 8,16 RA 0 () 98408 + 0 [swapper/2]
> <idle>-0 [002] d.H2. 185.610292: block_rq_complete: 8,16 RA () 98384 + 16 [0]
> <idle>-0 [002] dNH2. 185.610301: block_io_done: 8,16 RA 0 () 98384 + 0 [swapper/2]

Could this be shortened to ... the iomap calls and
block_bio_queue/backmerge? It's a bit difficult to see the point you're
getting at with all the other noise.

I think you're trying to say that the access pattern here is 98400 ->
98408 -> 98384, which is not sequential?

> v/s with the patch - (optimzed I/O access pattern and bio gets merged resulting in only 2 bios completion)
> xfs_io-944 [005] ..... 99.926187: iomap_readahead: dev 8:16 ino 0xc nr_pages 4
> xfs_io-944 [005] ..... 99.926208: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x4000 processed 0 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x9d/0x2c0
> xfs_io-944 [005] ..... 99.926211: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300a000 offset 0xa000 length 0x2000 type MAPPED flags MERGED
> xfs_io-944 [005] ..... 99.926222: block_bio_queue: 8,16 RA 98384 + 16 [xfs_io]
> xfs_io-944 [005] ..... 99.926232: block_getrq: 8,16 RA 98384 + 16 [xfs_io]
> xfs_io-944 [005] ..... 99.926233: block_io_start: 8,16 RA 8192 () 98384 + 16 [xfs_io]
> xfs_io-944 [005] ..... 99.926234: block_plug: [xfs_io]
> xfs_io-944 [005] ..... 99.926235: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x2000 processed 8192 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1f9/0x2c0
> xfs_io-944 [005] ..... 99.926261: block_bio_queue: 8,16 R 98400 + 8 [xfs_io]
> xfs_io-944 [005] ..... 99.926266: block_bio_backmerge: 8,16 R 98400 + 8 [xfs_io]
> xfs_io-944 [005] ..... 99.926271: block_unplug: [xfs_io] 1
> xfs_io-944 [005] ..... 99.926272: block_rq_insert: 8,16 RA 12288 () 98384 + 24 [xfs_io]
> kworker/5:1H-234 [005] ..... 99.926314: block_rq_issue: 8,16 RA 12288 () 98384 + 24 [kworker/5:1H]
> <idle>-0 [005] d.h2. 99.926905: block_rq_complete: 8,16 RA () 98384 + 24 [0]
> <idle>-0 [005] dNh2. 99.926931: block_io_done: 8,16 RA 0 () 98384 + 0 [swapper/5]
> xfs_io-944 [005] ..... 99.926971: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300d000 offset 0xc000 length 0x2000 type MAPPED flags MERGED
> xfs_io-944 [005] ..... 99.926981: block_bio_queue: 8,16 RA 98408 + 16 [xfs_io]
> xfs_io-944 [005] ..... 99.926989: block_getrq: 8,16 RA 98408 + 16 [xfs_io]
> xfs_io-944 [005] ..... 99.926989: block_io_start: 8,16 RA 8192 () 98408 + 16 [xfs_io]
> xfs_io-944 [005] ..... 99.926991: block_plug: [xfs_io]
> xfs_io-944 [005] ..... 99.926993: iomap_iter: dev 8:16 ino 0xc pos 0xc000 length 0x2000 processed 8192 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1f9/0x2c0
> xfs_io-944 [005] ..... 99.927001: block_rq_issue: 8,16 RA 8192 () 98408 + 16 [xfs_io]
> <idle>-0 [005] d.h2. 99.927397: block_rq_complete: 8,16 RA () 98408 + 16 [0]
> <idle>-0 [005] dNh2. 99.927414: block_io_done: 8,16 RA 0 () 98408 + 0 [swapper/5]
>
> Suggested-by: Matthew Wilcox <[email protected]>
> Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
> cc: Ojaswin Mujoo <[email protected]>
> ---
> fs/iomap/buffered-io.c | 112 +++++++++++++++++++++++++++++++----------
> 1 file changed, 85 insertions(+), 27 deletions(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 0a4269095ae2..a1d50086a3f5 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -30,7 +30,7 @@ typedef int (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length);
> */
> struct iomap_folio_state {
> spinlock_t state_lock;
> - unsigned int read_bytes_pending;
> + size_t read_bytes_pending;
> atomic_t write_bytes_pending;
>
> /*
> @@ -380,6 +380,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
> loff_t orig_pos = pos;
> size_t poff, plen;
> sector_t sector;
> + bool rbp_finished = false;

What is "rbp"? My assembly programmer brain says x64 frame pointer, but
that's clearly wrong here. Maybe I'm confused...

> if (iomap->type == IOMAP_INLINE)
> return iomap_read_inline_data(iter, folio);
> @@ -387,21 +388,39 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
> /* zero post-eof blocks as the page may be mapped */
> ifs = ifs_alloc(iter->inode, folio, iter->flags);
> iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
> +
> + if (ifs) {
> + loff_t to_read = min_t(loff_t, iter->len - offset,
> + folio_size(folio) - offset_in_folio(folio, orig_pos));
> + size_t padjust;
> +
> + spin_lock_irq(&ifs->state_lock);
> + if (!ifs->read_bytes_pending)
> + ifs->read_bytes_pending = to_read;
> + padjust = pos - orig_pos;
> + ifs->read_bytes_pending -= padjust;
> + if (!ifs->read_bytes_pending)
> + rbp_finished = true;
> + spin_unlock_irq(&ifs->state_lock);
> + }
> +
> if (plen == 0)
> goto done;
>
> if (iomap_block_needs_zeroing(iter, pos)) {
> + if (ifs) {
> + spin_lock_irq(&ifs->state_lock);
> + ifs->read_bytes_pending -= plen;
> + if (!ifs->read_bytes_pending)
> + rbp_finished = true;
> + spin_unlock_irq(&ifs->state_lock);
> + }
> folio_zero_range(folio, poff, plen);
> iomap_set_range_uptodate(folio, poff, plen);
> goto done;
> }
>
> ctx->cur_folio_in_bio = true;
> - if (ifs) {
> - spin_lock_irq(&ifs->state_lock);
> - ifs->read_bytes_pending += plen;
> - spin_unlock_irq(&ifs->state_lock);
> - }
>
> sector = iomap_sector(iomap, pos);
> if (!ctx->bio ||
> @@ -435,6 +454,14 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
> }
>
> done:
> + /*
> + * If there is no bio prepared and if rbp is finished and
> + * this was the last offset within this folio then mark
> + * cur_folio_in_bio to false.
> + */
> + if (!ctx->bio && rbp_finished &&
> + offset_in_folio(folio, pos + plen) == 0)
> + ctx->cur_folio_in_bio = false;

...yes, I think I am confused. When would ctx->bio be NULL but
cur_folio_in_bio is true?

I /think/ what you're doing here is using read_bytes_pending to figure
out if you've processed the folio up to the end of the mapping? But
then you submit the bio unconditionally below for each readpage_iter
call?

Why not add an IOMAP_BOUNDARY flag that means "I will have to do some IO
if you call ->iomap_begin again"? Then if we get to this point in
readpage_iter with a ctx->bio, we can submit the bio, clear
cur_folio_in_bio, and return? And then you don't need this machinery?

--D

> /*
> * Move the caller beyond our range so that it keeps making progress.
> * For that, we have to include any leading non-uptodate ranges, but
> @@ -459,9 +486,43 @@ static loff_t iomap_read_folio_iter(const struct iomap_iter *iter,
> return ret;
> }
>
> + if (ctx->bio) {
> + submit_bio(ctx->bio);
> + WARN_ON_ONCE(!ctx->cur_folio_in_bio);
> + ctx->bio = NULL;
> + }
> + if (offset_in_folio(folio, iter->pos + done) == 0 &&
> + !ctx->cur_folio_in_bio) {
> + folio_unlock(ctx->cur_folio);
> + }
> +
> return done;
> }
>
> +static void iomap_handle_read_error(struct iomap_readpage_ctx *ctx,
> + struct iomap_iter *iter)
> +{
> + struct folio *folio = ctx->cur_folio;
> + struct iomap_folio_state *ifs;
> + unsigned long flags;
> + bool rbp_finished = false;
> + size_t rbp_adjust = folio_size(folio) - offset_in_folio(folio,
> + iter->pos);
> + ifs = folio->private;
> + if (!ifs || !ifs->read_bytes_pending)
> + goto unlock;
> +
> + spin_lock_irqsave(&ifs->state_lock, flags);
> + ifs->read_bytes_pending -= rbp_adjust;
> + if (!ifs->read_bytes_pending)
> + rbp_finished = true;
> + spin_unlock_irqrestore(&ifs->state_lock, flags);
> +
> +unlock:
> + if (rbp_finished || !ctx->cur_folio_in_bio)
> + folio_unlock(folio);
> +}
> +
> int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
> {
> struct iomap_iter iter = {
> @@ -479,15 +540,9 @@ int iomap_read_folio(struct folio *folio, const struct iomap_ops *ops)
> while ((ret = iomap_iter(&iter, ops)) > 0)
> iter.processed = iomap_read_folio_iter(&iter, &ctx);
>
> - if (ret < 0)
> + if (ret < 0) {
> folio_set_error(folio);
> -
> - if (ctx.bio) {
> - submit_bio(ctx.bio);
> - WARN_ON_ONCE(!ctx.cur_folio_in_bio);
> - } else {
> - WARN_ON_ONCE(ctx.cur_folio_in_bio);
> - folio_unlock(folio);
> + iomap_handle_read_error(&ctx, &iter);
> }
>
> /*
> @@ -506,12 +561,6 @@ static loff_t iomap_readahead_iter(const struct iomap_iter *iter,
> loff_t done, ret;
>
> for (done = 0; done < length; done += ret) {
> - if (ctx->cur_folio &&
> - offset_in_folio(ctx->cur_folio, iter->pos + done) == 0) {
> - if (!ctx->cur_folio_in_bio)
> - folio_unlock(ctx->cur_folio);
> - ctx->cur_folio = NULL;
> - }
> if (!ctx->cur_folio) {
> ctx->cur_folio = readahead_folio(ctx->rac);
> ctx->cur_folio_in_bio = false;
> @@ -519,6 +568,17 @@ static loff_t iomap_readahead_iter(const struct iomap_iter *iter,
> ret = iomap_readpage_iter(iter, ctx, done);
> if (ret <= 0)
> return ret;
> + if (ctx->cur_folio && offset_in_folio(ctx->cur_folio,
> + iter->pos + done + ret) == 0) {
> + if (!ctx->cur_folio_in_bio)
> + folio_unlock(ctx->cur_folio);
> + ctx->cur_folio = NULL;
> + }
> + }
> +
> + if (ctx->bio) {
> + submit_bio(ctx->bio);
> + ctx->bio = NULL;
> }
>
> return done;
> @@ -549,18 +609,16 @@ void iomap_readahead(struct readahead_control *rac, const struct iomap_ops *ops)
> struct iomap_readpage_ctx ctx = {
> .rac = rac,
> };
> + int ret = 0;
>
> trace_iomap_readahead(rac->mapping->host, readahead_count(rac));
>
> - while (iomap_iter(&iter, ops) > 0)
> + while ((ret = iomap_iter(&iter, ops)) > 0)
> iter.processed = iomap_readahead_iter(&iter, &ctx);
>
> - if (ctx.bio)
> - submit_bio(ctx.bio);
> - if (ctx.cur_folio) {
> - if (!ctx.cur_folio_in_bio)
> - folio_unlock(ctx.cur_folio);
> - }
> + if (ret < 0 && ctx.cur_folio)
> + iomap_handle_read_error(&ctx, &iter);
> +
> }
> EXPORT_SYMBOL_GPL(iomap_readahead);
>
> --
> 2.44.0
>
>

2024-04-26 17:19:04

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings

"Darrick J. Wong" <[email protected]> writes:

> On Thu, Apr 25, 2024 at 06:58:51PM +0530, Ritesh Harjani (IBM) wrote:
>> This patch optimizes the data access patterns for filesystems with
>> indirect block mapping by implementing BH_Boundary handling within
>> iomap.
>>
>> Currently the bios for reads within iomap are only submitted at
>> 2 places -
>> 1. If we cannot merge the new req. with previous bio, only then we
>> submit the previous bio.
>> 2. Submit the bio at the end of the entire read processing.
>>
>> This means for filesystems with indirect block mapping, we call into
>> ->iomap_begin() again w/o submitting the previous bios. That causes
>> unoptimized data access patterns for blocks which are of BH_Boundary type.
>>
>> For e.g. consider the file mapping
>> logical block(4k) physical block(4k)
>> 0-11 1000-1011
>> 12-15 1013-1016
>>
>> In above physical block 1012 is an indirect metadata block which has the
>> mapping information for next set of indirect blocks (1013-1016).
>> With iomap buffered reads for reading 1st 16 logical blocks of a file
>> (0-15), we get below I/O pattern
>> - submit a bio for 1012
>> - complete the bio for 1012
>> - submit a bio for 1000-1011
>> - submit a bio for 1013-1016
>> - complete the bios for 1000-1011
>> - complete the bios for 1013-1016
>>
>> So as we can see, above is an non-optimal I/O access pattern and also we
>> get 3 bio completions instead of 2.
>>
>> This patch changes this behavior by doing submit_bio() if there are any
>> bios already processed, before calling ->iomap_begin() again.
>> That means if there are any blocks which are already processed, gets
>> submitted for I/O earlier and then within ->iomap_begin(), if we get a
>> request for reading an indirect metadata block, then block layer can merge
>> those bios with the already submitted read request to reduce the no. of bio
>> completions.
>>
>> Now, for bs < ps or for large folios, this patch requires proper handling
>> of "ifs->read_bytes_pending". In that we first set ifs->read_bytes_pending
>> to folio_size. Then handle all the cases where we need to subtract
>> ifs->read_bytes_pending either during the submission side
>> (if we don't need to submit any I/O - for e.g. for uptodate sub blocks),
>> or during an I/O error, or at the completion of an I/O.
>>
>> Here is the ftrace output of iomap and block layer with ext2 iomap
>> conversion patches -
>>
>> root# filefrag -b512 -v /mnt1/test/f1
>> Filesystem type is: ef53
>> Filesystem cylinder groups approximately 32
>> File size of /mnt1/test/f1 is 65536 (128 blocks of 512 bytes)
>> ext: logical_offset: physical_offset: length: expected: flags:
>> 0: 0.. 95: 98304.. 98399: 96: merged
>> 1: 96.. 127: 98408.. 98439: 32: 98400: last,merged,eof
>> /mnt1/test/f1: 2 extents found
>>
>> root# #This reads 4 blocks starting from lblk 10, 11, 12, 13
>> root# xfs_io -c "pread -b$((4*4096)) $((10*4096)) $((4*4096))" /mnt1/test/f1
>>
>> w/o this patch - (indirect block is submitted before and does not get merged, resulting in 3 bios completion)
>> xfs_io-907 [002] ..... 185.608791: iomap_readahead: dev 8:16 ino 0xc nr_pages 4
>> xfs_io-907 [002] ..... 185.608819: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x4000 processed 0 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x9d/0x2c0
>> xfs_io-907 [002] ..... 185.608823: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300a000 offset 0xa000 length 0x2000 type MAPPED flags MERGED
>> xfs_io-907 [002] ..... 185.608831: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x2000 processed 8192 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1e1/0x2c0
>> xfs_io-907 [002] ..... 185.608859: block_bio_queue: 8,16 R 98400 + 8 [xfs_io]
>> xfs_io-907 [002] ..... 185.608865: block_getrq: 8,16 R 98400 + 8 [xfs_io]
>> xfs_io-907 [002] ..... 185.608867: block_io_start: 8,16 R 4096 () 98400 + 8 [xfs_io]
>> xfs_io-907 [002] ..... 185.608869: block_plug: [xfs_io]
>> xfs_io-907 [002] ..... 185.608872: block_unplug: [xfs_io] 1
>> xfs_io-907 [002] ..... 185.608874: block_rq_insert: 8,16 R 4096 () 98400 + 8 [xfs_io]
>> kworker/2:1H-198 [002] ..... 185.608908: block_rq_issue: 8,16 R 4096 () 98400 + 8 [kworker/2:1H]
>> <idle>-0 [002] d.h2. 185.609579: block_rq_complete: 8,16 R () 98400 + 8 [0]
>> <idle>-0 [002] dNh2. 185.609631: block_io_done: 8,16 R 0 () 98400 + 0 [swapper/2]
>> xfs_io-907 [002] ..... 185.609694: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300d000 offset 0xc000 length 0x2000 type MAPPED flags MERGED
>> xfs_io-907 [002] ..... 185.609704: block_bio_queue: 8,16 RA 98384 + 16 [xfs_io]
>> xfs_io-907 [002] ..... 185.609718: block_getrq: 8,16 RA 98384 + 16 [xfs_io]
>> xfs_io-907 [002] ..... 185.609721: block_io_start: 8,16 RA 8192 () 98384 + 16 [xfs_io]
>> xfs_io-907 [002] ..... 185.609726: block_plug: [xfs_io]
>> xfs_io-907 [002] ..... 185.609735: iomap_iter: dev 8:16 ino 0xc pos 0xc000 length 0x2000 processed 8192 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1e1/0x2c0
>> xfs_io-907 [002] ..... 185.609736: block_bio_queue: 8,16 RA 98408 + 16 [xfs_io]
>> xfs_io-907 [002] ..... 185.609740: block_getrq: 8,16 RA 98408 + 16 [xfs_io]
>> xfs_io-907 [002] ..... 185.609741: block_io_start: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>> xfs_io-907 [002] ..... 185.609756: block_rq_issue: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>> xfs_io-907 [002] ..... 185.609769: block_rq_issue: 8,16 RA 8192 () 98384 + 16 [xfs_io]
>> <idle>-0 [002] d.H2. 185.610280: block_rq_complete: 8,16 RA () 98408 + 16 [0]
>> <idle>-0 [002] d.H2. 185.610289: block_io_done: 8,16 RA 0 () 98408 + 0 [swapper/2]
>> <idle>-0 [002] d.H2. 185.610292: block_rq_complete: 8,16 RA () 98384 + 16 [0]
>> <idle>-0 [002] dNH2. 185.610301: block_io_done: 8,16 RA 0 () 98384 + 0 [swapper/2]
>
> Could this be shortened to ... the iomap calls and
> block_bio_queue/backmerge? It's a bit difficult to see the point you're
> getting at with all the other noise.

I will remove this log and move it to cover letter and will just extend
the simple example I considered before in this commit message,
to show the difference with and w/o patch.

>
> I think you're trying to say that the access pattern here is 98400 ->
> 98408 -> 98384, which is not sequential?
>

it's (98400,8 ==> metadata block) -> (98384,16 == lblk 10 & 11) -> (98408,16 ==> lblk 12 & 13)
... w/o the patch

>> v/s with the patch - (optimzed I/O access pattern and bio gets merged resulting in only 2 bios completion)
>> xfs_io-944 [005] ..... 99.926187: iomap_readahead: dev 8:16 ino 0xc nr_pages 4
>> xfs_io-944 [005] ..... 99.926208: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x4000 processed 0 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x9d/0x2c0
>> xfs_io-944 [005] ..... 99.926211: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300a000 offset 0xa000 length 0x2000 type MAPPED flags MERGED
>> xfs_io-944 [005] ..... 99.926222: block_bio_queue: 8,16 RA 98384 + 16 [xfs_io]
>> xfs_io-944 [005] ..... 99.926232: block_getrq: 8,16 RA 98384 + 16 [xfs_io]
>> xfs_io-944 [005] ..... 99.926233: block_io_start: 8,16 RA 8192 () 98384 + 16 [xfs_io]
>> xfs_io-944 [005] ..... 99.926234: block_plug: [xfs_io]
>> xfs_io-944 [005] ..... 99.926235: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x2000 processed 8192 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1f9/0x2c0
>> xfs_io-944 [005] ..... 99.926261: block_bio_queue: 8,16 R 98400 + 8 [xfs_io]
>> xfs_io-944 [005] ..... 99.926266: block_bio_backmerge: 8,16 R 98400 + 8 [xfs_io]
>> xfs_io-944 [005] ..... 99.926271: block_unplug: [xfs_io] 1
>> xfs_io-944 [005] ..... 99.926272: block_rq_insert: 8,16 RA 12288 () 98384 + 24 [xfs_io]
>> kworker/5:1H-234 [005] ..... 99.926314: block_rq_issue: 8,16 RA 12288 () 98384 + 24 [kworker/5:1H]
>> <idle>-0 [005] d.h2. 99.926905: block_rq_complete: 8,16 RA () 98384 + 24 [0]
>> <idle>-0 [005] dNh2. 99.926931: block_io_done: 8,16 RA 0 () 98384 + 0 [swapper/5]
>> xfs_io-944 [005] ..... 99.926971: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300d000 offset 0xc000 length 0x2000 type MAPPED flags MERGED
>> xfs_io-944 [005] ..... 99.926981: block_bio_queue: 8,16 RA 98408 + 16 [xfs_io]
>> xfs_io-944 [005] ..... 99.926989: block_getrq: 8,16 RA 98408 + 16 [xfs_io]
>> xfs_io-944 [005] ..... 99.926989: block_io_start: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>> xfs_io-944 [005] ..... 99.926991: block_plug: [xfs_io]
>> xfs_io-944 [005] ..... 99.926993: iomap_iter: dev 8:16 ino 0xc pos 0xc000 length 0x2000 processed 8192 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1f9/0x2c0
>> xfs_io-944 [005] ..... 99.927001: block_rq_issue: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>> <idle>-0 [005] d.h2. 99.927397: block_rq_complete: 8,16 RA () 98408 + 16 [0]
>> <idle>-0 [005] dNh2. 99.927414: block_io_done: 8,16 RA 0 () 98408 + 0 [swapper/5]
>>
>> Suggested-by: Matthew Wilcox <[email protected]>
>> Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
>> cc: Ojaswin Mujoo <[email protected]>
>> ---
>> fs/iomap/buffered-io.c | 112 +++++++++++++++++++++++++++++++----------
>> 1 file changed, 85 insertions(+), 27 deletions(-)
>>
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index 0a4269095ae2..a1d50086a3f5 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
>> @@ -30,7 +30,7 @@ typedef int (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length);
>> */
>> struct iomap_folio_state {
>> spinlock_t state_lock;
>> - unsigned int read_bytes_pending;
>> + size_t read_bytes_pending;
>> atomic_t write_bytes_pending;
>>
>> /*
>> @@ -380,6 +380,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>> loff_t orig_pos = pos;
>> size_t poff, plen;
>> sector_t sector;
>> + bool rbp_finished = false;
>
> What is "rbp"? My assembly programmer brain says x64 frame pointer, but
> that's clearly wrong here. Maybe I'm confused...
>

rbp == read_bytes_pending ;)

>> if (iomap->type == IOMAP_INLINE)
>> return iomap_read_inline_data(iter, folio);
>> @@ -387,21 +388,39 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>> /* zero post-eof blocks as the page may be mapped */
>> ifs = ifs_alloc(iter->inode, folio, iter->flags);
>> iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
>> +
>> + if (ifs) {
>> + loff_t to_read = min_t(loff_t, iter->len - offset,
>> + folio_size(folio) - offset_in_folio(folio, orig_pos));
>> + size_t padjust;
>> +
>> + spin_lock_irq(&ifs->state_lock);
>> + if (!ifs->read_bytes_pending)
>> + ifs->read_bytes_pending = to_read;
>> + padjust = pos - orig_pos;
>> + ifs->read_bytes_pending -= padjust;
>> + if (!ifs->read_bytes_pending)
>> + rbp_finished = true;
>> + spin_unlock_irq(&ifs->state_lock);
>> + }
>> +
>> if (plen == 0)
>> goto done;
>>
>> if (iomap_block_needs_zeroing(iter, pos)) {
>> + if (ifs) {
>> + spin_lock_irq(&ifs->state_lock);
>> + ifs->read_bytes_pending -= plen;
>> + if (!ifs->read_bytes_pending)
>> + rbp_finished = true;
>> + spin_unlock_irq(&ifs->state_lock);
>> + }
>> folio_zero_range(folio, poff, plen);
>> iomap_set_range_uptodate(folio, poff, plen);
>> goto done;
>> }
>>
>> ctx->cur_folio_in_bio = true;
>> - if (ifs) {
>> - spin_lock_irq(&ifs->state_lock);
>> - ifs->read_bytes_pending += plen;
>> - spin_unlock_irq(&ifs->state_lock);
>> - }
>>
>> sector = iomap_sector(iomap, pos);
>> if (!ctx->bio ||
>> @@ -435,6 +454,14 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>> }
>>
>> done:
>> + /*
>> + * If there is no bio prepared and if rbp is finished and
>> + * this was the last offset within this folio then mark
>> + * cur_folio_in_bio to false.
>> + */
>> + if (!ctx->bio && rbp_finished &&
>> + offset_in_folio(folio, pos + plen) == 0)
>> + ctx->cur_folio_in_bio = false;
>
> ...yes, I think I am confused. When would ctx->bio be NULL but
> cur_folio_in_bio is true?

Previously we had the bio submitted and so we make it null, but we still
have ctx->cur_folio & ctx->cur_folio_in_bio to true, since we haven't
completely processed the folio.

>
> I /think/ what you're doing here is using read_bytes_pending to figure
> out if you've processed the folio up to the end of the mapping? But
> then you submit the bio unconditionally below for each readpage_iter
> call?
>

yes, that's right.

> Why not add an IOMAP_BOUNDARY flag that means "I will have to do some IO
> if you call ->iomap_begin again"? Then if we get to this point in
> readpage_iter with a ctx->bio, we can submit the bio, clear
> cur_folio_in_bio, and return? And then you don't need this machinery?

TBH, I initially didn't think the approach taken in the patch would
require such careful handling of r_b_p. It was because of all of this
corner cases when we don't need to read the update blocks and/or in case
of an error we need to ensure we reduce r_b_p carefully so that we could
unlock the folio and when extent spans beyond i_size.

So it's all about how do we know if we could unlock the folio and that it's
corresponding blocks/mapping has been all processed or submitted for
I/O.

Assume we have a folio which spans over multiple extents. In such a
case,
-> we process a bio for 1st extent,
-> then we go back to iomap_iter() to get new extent mapping,
-> We now increment the r_b_p with this new plen to be processed.
-> We then submit the previous bio, since this new mapping couldn't be
merged due to discontinuous extents.
So by first incrementing the r_b_p before doing submit_bio(), we don't
unlock the folio at bio completion.

Maybe, it would be helpful if we have an easy mechanism to keep some state
from the time of submit_bio() till the bio completion to know that the
corresponding folio is still being processed and it shouldn't be
unlocked.
-> This currently is what we are doing by making r_b_p to the value of
folio_size() and then carefully reducing r_b_p for all the cases I
mentioned above.

Let me think if adding a IOMAP_BH_BOUNDARY flag could be helpful or not.
Say if we have a pagesize of 64k that means all first 16 blocks belongs
to same page. So even with IOMAP_BH_BOUNDARY flag the problem that still
remains is that, even if we submit the bio at block 11 (bh_boundary
block), how will the bio completion side know that the folio is not
completely processed and so we shouldn't unlock the folio?


-ritesh

2024-04-26 17:32:21

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings

Ritesh Harjani (IBM) <[email protected]> writes:

> "Darrick J. Wong" <[email protected]> writes:
>
>> On Thu, Apr 25, 2024 at 06:58:51PM +0530, Ritesh Harjani (IBM) wrote:
>>> This patch optimizes the data access patterns for filesystems with
>>> indirect block mapping by implementing BH_Boundary handling within
>>> iomap.
>>>
>>> Currently the bios for reads within iomap are only submitted at
>>> 2 places -
>>> 1. If we cannot merge the new req. with previous bio, only then we
>>> submit the previous bio.
>>> 2. Submit the bio at the end of the entire read processing.
>>>
>>> This means for filesystems with indirect block mapping, we call into
>>> ->iomap_begin() again w/o submitting the previous bios. That causes
>>> unoptimized data access patterns for blocks which are of BH_Boundary type.
>>>
>>> For e.g. consider the file mapping
>>> logical block(4k) physical block(4k)
>>> 0-11 1000-1011
>>> 12-15 1013-1016
>>>
>>> In above physical block 1012 is an indirect metadata block which has the
>>> mapping information for next set of indirect blocks (1013-1016).
>>> With iomap buffered reads for reading 1st 16 logical blocks of a file
>>> (0-15), we get below I/O pattern
>>> - submit a bio for 1012
>>> - complete the bio for 1012
>>> - submit a bio for 1000-1011
>>> - submit a bio for 1013-1016
>>> - complete the bios for 1000-1011
>>> - complete the bios for 1013-1016
>>>
>>> So as we can see, above is an non-optimal I/O access pattern and also we
>>> get 3 bio completions instead of 2.
>>>
>>> This patch changes this behavior by doing submit_bio() if there are any
>>> bios already processed, before calling ->iomap_begin() again.
>>> That means if there are any blocks which are already processed, gets
>>> submitted for I/O earlier and then within ->iomap_begin(), if we get a
>>> request for reading an indirect metadata block, then block layer can merge
>>> those bios with the already submitted read request to reduce the no. of bio
>>> completions.
>>>
>>> Now, for bs < ps or for large folios, this patch requires proper handling
>>> of "ifs->read_bytes_pending". In that we first set ifs->read_bytes_pending
>>> to folio_size. Then handle all the cases where we need to subtract
>>> ifs->read_bytes_pending either during the submission side
>>> (if we don't need to submit any I/O - for e.g. for uptodate sub blocks),
>>> or during an I/O error, or at the completion of an I/O.
>>>
>>> Here is the ftrace output of iomap and block layer with ext2 iomap
>>> conversion patches -
>>>
>>> root# filefrag -b512 -v /mnt1/test/f1
>>> Filesystem type is: ef53
>>> Filesystem cylinder groups approximately 32
>>> File size of /mnt1/test/f1 is 65536 (128 blocks of 512 bytes)
>>> ext: logical_offset: physical_offset: length: expected: flags:
>>> 0: 0.. 95: 98304.. 98399: 96: merged
>>> 1: 96.. 127: 98408.. 98439: 32: 98400: last,merged,eof
>>> /mnt1/test/f1: 2 extents found
>>>
>>> root# #This reads 4 blocks starting from lblk 10, 11, 12, 13
>>> root# xfs_io -c "pread -b$((4*4096)) $((10*4096)) $((4*4096))" /mnt1/test/f1
>>>
>>> w/o this patch - (indirect block is submitted before and does not get merged, resulting in 3 bios completion)
>>> xfs_io-907 [002] ..... 185.608791: iomap_readahead: dev 8:16 ino 0xc nr_pages 4
>>> xfs_io-907 [002] ..... 185.608819: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x4000 processed 0 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x9d/0x2c0
>>> xfs_io-907 [002] ..... 185.608823: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300a000 offset 0xa000 length 0x2000 type MAPPED flags MERGED
>>> xfs_io-907 [002] ..... 185.608831: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x2000 processed 8192 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1e1/0x2c0
>>> xfs_io-907 [002] ..... 185.608859: block_bio_queue: 8,16 R 98400 + 8 [xfs_io]
>>> xfs_io-907 [002] ..... 185.608865: block_getrq: 8,16 R 98400 + 8 [xfs_io]
>>> xfs_io-907 [002] ..... 185.608867: block_io_start: 8,16 R 4096 () 98400 + 8 [xfs_io]
>>> xfs_io-907 [002] ..... 185.608869: block_plug: [xfs_io]
>>> xfs_io-907 [002] ..... 185.608872: block_unplug: [xfs_io] 1
>>> xfs_io-907 [002] ..... 185.608874: block_rq_insert: 8,16 R 4096 () 98400 + 8 [xfs_io]
>>> kworker/2:1H-198 [002] ..... 185.608908: block_rq_issue: 8,16 R 4096 () 98400 + 8 [kworker/2:1H]
>>> <idle>-0 [002] d.h2. 185.609579: block_rq_complete: 8,16 R () 98400 + 8 [0]
>>> <idle>-0 [002] dNh2. 185.609631: block_io_done: 8,16 R 0 () 98400 + 0 [swapper/2]
>>> xfs_io-907 [002] ..... 185.609694: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300d000 offset 0xc000 length 0x2000 type MAPPED flags MERGED
>>> xfs_io-907 [002] ..... 185.609704: block_bio_queue: 8,16 RA 98384 + 16 [xfs_io]
>>> xfs_io-907 [002] ..... 185.609718: block_getrq: 8,16 RA 98384 + 16 [xfs_io]
>>> xfs_io-907 [002] ..... 185.609721: block_io_start: 8,16 RA 8192 () 98384 + 16 [xfs_io]
>>> xfs_io-907 [002] ..... 185.609726: block_plug: [xfs_io]
>>> xfs_io-907 [002] ..... 185.609735: iomap_iter: dev 8:16 ino 0xc pos 0xc000 length 0x2000 processed 8192 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1e1/0x2c0
>>> xfs_io-907 [002] ..... 185.609736: block_bio_queue: 8,16 RA 98408 + 16 [xfs_io]
>>> xfs_io-907 [002] ..... 185.609740: block_getrq: 8,16 RA 98408 + 16 [xfs_io]
>>> xfs_io-907 [002] ..... 185.609741: block_io_start: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>>> xfs_io-907 [002] ..... 185.609756: block_rq_issue: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>>> xfs_io-907 [002] ..... 185.609769: block_rq_issue: 8,16 RA 8192 () 98384 + 16 [xfs_io]
>>> <idle>-0 [002] d.H2. 185.610280: block_rq_complete: 8,16 RA () 98408 + 16 [0]
>>> <idle>-0 [002] d.H2. 185.610289: block_io_done: 8,16 RA 0 () 98408 + 0 [swapper/2]
>>> <idle>-0 [002] d.H2. 185.610292: block_rq_complete: 8,16 RA () 98384 + 16 [0]
>>> <idle>-0 [002] dNH2. 185.610301: block_io_done: 8,16 RA 0 () 98384 + 0 [swapper/2]
>>
>> Could this be shortened to ... the iomap calls and
>> block_bio_queue/backmerge? It's a bit difficult to see the point you're
>> getting at with all the other noise.
>
> I will remove this log and move it to cover letter and will just extend
> the simple example I considered before in this commit message,
> to show the difference with and w/o patch.
>
>>
>> I think you're trying to say that the access pattern here is 98400 ->
>> 98408 -> 98384, which is not sequential?
>>
>
> it's (98400,8 ==> metadata block) -> (98384,16 == lblk 10 & 11) -> (98408,16 ==> lblk 12 & 13)
> ... w/o the patch
>
>>> v/s with the patch - (optimzed I/O access pattern and bio gets merged resulting in only 2 bios completion)
>>> xfs_io-944 [005] ..... 99.926187: iomap_readahead: dev 8:16 ino 0xc nr_pages 4
>>> xfs_io-944 [005] ..... 99.926208: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x4000 processed 0 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x9d/0x2c0
>>> xfs_io-944 [005] ..... 99.926211: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300a000 offset 0xa000 length 0x2000 type MAPPED flags MERGED
>>> xfs_io-944 [005] ..... 99.926222: block_bio_queue: 8,16 RA 98384 + 16 [xfs_io]
>>> xfs_io-944 [005] ..... 99.926232: block_getrq: 8,16 RA 98384 + 16 [xfs_io]
>>> xfs_io-944 [005] ..... 99.926233: block_io_start: 8,16 RA 8192 () 98384 + 16 [xfs_io]
>>> xfs_io-944 [005] ..... 99.926234: block_plug: [xfs_io]
>>> xfs_io-944 [005] ..... 99.926235: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x2000 processed 8192 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1f9/0x2c0
>>> xfs_io-944 [005] ..... 99.926261: block_bio_queue: 8,16 R 98400 + 8 [xfs_io]
>>> xfs_io-944 [005] ..... 99.926266: block_bio_backmerge: 8,16 R 98400 + 8 [xfs_io]
>>> xfs_io-944 [005] ..... 99.926271: block_unplug: [xfs_io] 1
>>> xfs_io-944 [005] ..... 99.926272: block_rq_insert: 8,16 RA 12288 () 98384 + 24 [xfs_io]
>>> kworker/5:1H-234 [005] ..... 99.926314: block_rq_issue: 8,16 RA 12288 () 98384 + 24 [kworker/5:1H]
>>> <idle>-0 [005] d.h2. 99.926905: block_rq_complete: 8,16 RA () 98384 + 24 [0]
>>> <idle>-0 [005] dNh2. 99.926931: block_io_done: 8,16 RA 0 () 98384 + 0 [swapper/5]
>>> xfs_io-944 [005] ..... 99.926971: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300d000 offset 0xc000 length 0x2000 type MAPPED flags MERGED
>>> xfs_io-944 [005] ..... 99.926981: block_bio_queue: 8,16 RA 98408 + 16 [xfs_io]
>>> xfs_io-944 [005] ..... 99.926989: block_getrq: 8,16 RA 98408 + 16 [xfs_io]
>>> xfs_io-944 [005] ..... 99.926989: block_io_start: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>>> xfs_io-944 [005] ..... 99.926991: block_plug: [xfs_io]
>>> xfs_io-944 [005] ..... 99.926993: iomap_iter: dev 8:16 ino 0xc pos 0xc000 length 0x2000 processed 8192 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1f9/0x2c0
>>> xfs_io-944 [005] ..... 99.927001: block_rq_issue: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>>> <idle>-0 [005] d.h2. 99.927397: block_rq_complete: 8,16 RA () 98408 + 16 [0]
>>> <idle>-0 [005] dNh2. 99.927414: block_io_done: 8,16 RA 0 () 98408 + 0 [swapper/5]
>>>
>>> Suggested-by: Matthew Wilcox <[email protected]>
>>> Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
>>> cc: Ojaswin Mujoo <[email protected]>
>>> ---
>>> fs/iomap/buffered-io.c | 112 +++++++++++++++++++++++++++++++----------
>>> 1 file changed, 85 insertions(+), 27 deletions(-)
>>>
>>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>>> index 0a4269095ae2..a1d50086a3f5 100644
>>> --- a/fs/iomap/buffered-io.c
>>> +++ b/fs/iomap/buffered-io.c
>>> @@ -30,7 +30,7 @@ typedef int (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length);
>>> */
>>> struct iomap_folio_state {
>>> spinlock_t state_lock;
>>> - unsigned int read_bytes_pending;
>>> + size_t read_bytes_pending;
>>> atomic_t write_bytes_pending;
>>>
>>> /*
>>> @@ -380,6 +380,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>>> loff_t orig_pos = pos;
>>> size_t poff, plen;
>>> sector_t sector;
>>> + bool rbp_finished = false;
>>
>> What is "rbp"? My assembly programmer brain says x64 frame pointer, but
>> that's clearly wrong here. Maybe I'm confused...
>>
>
> rbp == read_bytes_pending ;)
>
>>> if (iomap->type == IOMAP_INLINE)
>>> return iomap_read_inline_data(iter, folio);
>>> @@ -387,21 +388,39 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>>> /* zero post-eof blocks as the page may be mapped */
>>> ifs = ifs_alloc(iter->inode, folio, iter->flags);
>>> iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
>>> +
>>> + if (ifs) {
>>> + loff_t to_read = min_t(loff_t, iter->len - offset,
>>> + folio_size(folio) - offset_in_folio(folio, orig_pos));
>>> + size_t padjust;
>>> +
>>> + spin_lock_irq(&ifs->state_lock);
>>> + if (!ifs->read_bytes_pending)
>>> + ifs->read_bytes_pending = to_read;
>>> + padjust = pos - orig_pos;
>>> + ifs->read_bytes_pending -= padjust;
>>> + if (!ifs->read_bytes_pending)
>>> + rbp_finished = true;
>>> + spin_unlock_irq(&ifs->state_lock);
>>> + }
>>> +
>>> if (plen == 0)
>>> goto done;
>>>
>>> if (iomap_block_needs_zeroing(iter, pos)) {
>>> + if (ifs) {
>>> + spin_lock_irq(&ifs->state_lock);
>>> + ifs->read_bytes_pending -= plen;
>>> + if (!ifs->read_bytes_pending)
>>> + rbp_finished = true;
>>> + spin_unlock_irq(&ifs->state_lock);
>>> + }
>>> folio_zero_range(folio, poff, plen);
>>> iomap_set_range_uptodate(folio, poff, plen);
>>> goto done;
>>> }
>>>
>>> ctx->cur_folio_in_bio = true;
>>> - if (ifs) {
>>> - spin_lock_irq(&ifs->state_lock);
>>> - ifs->read_bytes_pending += plen;
>>> - spin_unlock_irq(&ifs->state_lock);
>>> - }
>>>
>>> sector = iomap_sector(iomap, pos);
>>> if (!ctx->bio ||
>>> @@ -435,6 +454,14 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>>> }
>>>
>>> done:
>>> + /*
>>> + * If there is no bio prepared and if rbp is finished and
>>> + * this was the last offset within this folio then mark
>>> + * cur_folio_in_bio to false.
>>> + */
>>> + if (!ctx->bio && rbp_finished &&
>>> + offset_in_folio(folio, pos + plen) == 0)
>>> + ctx->cur_folio_in_bio = false;
>>
>> ...yes, I think I am confused. When would ctx->bio be NULL but
>> cur_folio_in_bio is true?
>
> Previously we had the bio submitted and so we make it null, but we still
> have ctx->cur_folio & ctx->cur_folio_in_bio to true, since we haven't
> completely processed the folio.
>
>>
>> I /think/ what you're doing here is using read_bytes_pending to figure
>> out if you've processed the folio up to the end of the mapping? But
>> then you submit the bio unconditionally below for each readpage_iter
>> call?
>>
>
> yes, that's right.
>
>> Why not add an IOMAP_BOUNDARY flag that means "I will have to do some IO
>> if you call ->iomap_begin again"? Then if we get to this point in
>> readpage_iter with a ctx->bio, we can submit the bio, clear
>> cur_folio_in_bio, and return? And then you don't need this machinery?
>
> TBH, I initially didn't think the approach taken in the patch would
> require such careful handling of r_b_p. It was because of all of this
> corner cases when we don't need to read the update blocks and/or in case
> of an error we need to ensure we reduce r_b_p carefully so that we could
> unlock the folio and when extent spans beyond i_size.
>
> So it's all about how do we know if we could unlock the folio and that it's
> corresponding blocks/mapping has been all processed or submitted for
> I/O.
>
> Assume we have a folio which spans over multiple extents. In such a
> case,
> -> we process a bio for 1st extent,
> -> then we go back to iomap_iter() to get new extent mapping,
> -> We now increment the r_b_p with this new plen to be processed.
> -> We then submit the previous bio, since this new mapping couldn't be
> merged due to discontinuous extents.
> So by first incrementing the r_b_p before doing submit_bio(), we don't
> unlock the folio at bio completion.
>
> Maybe, it would be helpful if we have an easy mechanism to keep some state
> from the time of submit_bio() till the bio completion to know that the
> corresponding folio is still being processed and it shouldn't be
> unlocked.
> -> This currently is what we are doing by making r_b_p to the value of
> folio_size() and then carefully reducing r_b_p for all the cases I
> mentioned above.
>
> Let me think if adding a IOMAP_BH_BOUNDARY flag could be helpful or not.
> Say if we have a pagesize of 64k that means all first 16 blocks belongs
> to same page. So even with IOMAP_BH_BOUNDARY flag the problem that still
> remains is that, even if we submit the bio at block 11 (bh_boundary
> block), how will the bio completion side know that the folio is not
> completely processed and so we shouldn't unlock the folio?

Maybe one way could be if we could add another state flag to ifs for
BH_BOUNDARY block and read that at the bio completion.
We can then also let the completion side know if it should unlock the
folio or whether it still needs processing at the submission side.

I guess that might be an easier approach then this. Thoughts?

-ritesh

2024-04-26 17:39:07

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings

On Fri, Apr 26, 2024 at 10:55:23PM +0530, Ritesh Harjani wrote:
> Ritesh Harjani (IBM) <[email protected]> writes:
>
> > "Darrick J. Wong" <[email protected]> writes:
> >
> >> On Thu, Apr 25, 2024 at 06:58:51PM +0530, Ritesh Harjani (IBM) wrote:
> >>> This patch optimizes the data access patterns for filesystems with
> >>> indirect block mapping by implementing BH_Boundary handling within
> >>> iomap.
> >>>
> >>> Currently the bios for reads within iomap are only submitted at
> >>> 2 places -
> >>> 1. If we cannot merge the new req. with previous bio, only then we
> >>> submit the previous bio.
> >>> 2. Submit the bio at the end of the entire read processing.
> >>>
> >>> This means for filesystems with indirect block mapping, we call into
> >>> ->iomap_begin() again w/o submitting the previous bios. That causes
> >>> unoptimized data access patterns for blocks which are of BH_Boundary type.
> >>>
> >>> For e.g. consider the file mapping
> >>> logical block(4k) physical block(4k)
> >>> 0-11 1000-1011
> >>> 12-15 1013-1016
> >>>
> >>> In above physical block 1012 is an indirect metadata block which has the
> >>> mapping information for next set of indirect blocks (1013-1016).
> >>> With iomap buffered reads for reading 1st 16 logical blocks of a file
> >>> (0-15), we get below I/O pattern
> >>> - submit a bio for 1012
> >>> - complete the bio for 1012
> >>> - submit a bio for 1000-1011
> >>> - submit a bio for 1013-1016
> >>> - complete the bios for 1000-1011
> >>> - complete the bios for 1013-1016
> >>>
> >>> So as we can see, above is an non-optimal I/O access pattern and also we
> >>> get 3 bio completions instead of 2.
> >>>
> >>> This patch changes this behavior by doing submit_bio() if there are any
> >>> bios already processed, before calling ->iomap_begin() again.
> >>> That means if there are any blocks which are already processed, gets
> >>> submitted for I/O earlier and then within ->iomap_begin(), if we get a
> >>> request for reading an indirect metadata block, then block layer can merge
> >>> those bios with the already submitted read request to reduce the no. of bio
> >>> completions.
> >>>
> >>> Now, for bs < ps or for large folios, this patch requires proper handling
> >>> of "ifs->read_bytes_pending". In that we first set ifs->read_bytes_pending
> >>> to folio_size. Then handle all the cases where we need to subtract
> >>> ifs->read_bytes_pending either during the submission side
> >>> (if we don't need to submit any I/O - for e.g. for uptodate sub blocks),
> >>> or during an I/O error, or at the completion of an I/O.
> >>>
> >>> Here is the ftrace output of iomap and block layer with ext2 iomap
> >>> conversion patches -
> >>>
> >>> root# filefrag -b512 -v /mnt1/test/f1
> >>> Filesystem type is: ef53
> >>> Filesystem cylinder groups approximately 32
> >>> File size of /mnt1/test/f1 is 65536 (128 blocks of 512 bytes)
> >>> ext: logical_offset: physical_offset: length: expected: flags:
> >>> 0: 0.. 95: 98304.. 98399: 96: merged
> >>> 1: 96.. 127: 98408.. 98439: 32: 98400: last,merged,eof
> >>> /mnt1/test/f1: 2 extents found
> >>>
> >>> root# #This reads 4 blocks starting from lblk 10, 11, 12, 13
> >>> root# xfs_io -c "pread -b$((4*4096)) $((10*4096)) $((4*4096))" /mnt1/test/f1
> >>>
> >>> w/o this patch - (indirect block is submitted before and does not get merged, resulting in 3 bios completion)
> >>> xfs_io-907 [002] ..... 185.608791: iomap_readahead: dev 8:16 ino 0xc nr_pages 4
> >>> xfs_io-907 [002] ..... 185.608819: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x4000 processed 0 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x9d/0x2c0
> >>> xfs_io-907 [002] ..... 185.608823: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300a000 offset 0xa000 length 0x2000 type MAPPED flags MERGED
> >>> xfs_io-907 [002] ..... 185.608831: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x2000 processed 8192 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1e1/0x2c0
> >>> xfs_io-907 [002] ..... 185.608859: block_bio_queue: 8,16 R 98400 + 8 [xfs_io]
> >>> xfs_io-907 [002] ..... 185.608865: block_getrq: 8,16 R 98400 + 8 [xfs_io]
> >>> xfs_io-907 [002] ..... 185.608867: block_io_start: 8,16 R 4096 () 98400 + 8 [xfs_io]
> >>> xfs_io-907 [002] ..... 185.608869: block_plug: [xfs_io]
> >>> xfs_io-907 [002] ..... 185.608872: block_unplug: [xfs_io] 1
> >>> xfs_io-907 [002] ..... 185.608874: block_rq_insert: 8,16 R 4096 () 98400 + 8 [xfs_io]
> >>> kworker/2:1H-198 [002] ..... 185.608908: block_rq_issue: 8,16 R 4096 () 98400 + 8 [kworker/2:1H]
> >>> <idle>-0 [002] d.h2. 185.609579: block_rq_complete: 8,16 R () 98400 + 8 [0]
> >>> <idle>-0 [002] dNh2. 185.609631: block_io_done: 8,16 R 0 () 98400 + 0 [swapper/2]
> >>> xfs_io-907 [002] ..... 185.609694: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300d000 offset 0xc000 length 0x2000 type MAPPED flags MERGED
> >>> xfs_io-907 [002] ..... 185.609704: block_bio_queue: 8,16 RA 98384 + 16 [xfs_io]
> >>> xfs_io-907 [002] ..... 185.609718: block_getrq: 8,16 RA 98384 + 16 [xfs_io]
> >>> xfs_io-907 [002] ..... 185.609721: block_io_start: 8,16 RA 8192 () 98384 + 16 [xfs_io]
> >>> xfs_io-907 [002] ..... 185.609726: block_plug: [xfs_io]
> >>> xfs_io-907 [002] ..... 185.609735: iomap_iter: dev 8:16 ino 0xc pos 0xc000 length 0x2000 processed 8192 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1e1/0x2c0
> >>> xfs_io-907 [002] ..... 185.609736: block_bio_queue: 8,16 RA 98408 + 16 [xfs_io]
> >>> xfs_io-907 [002] ..... 185.609740: block_getrq: 8,16 RA 98408 + 16 [xfs_io]
> >>> xfs_io-907 [002] ..... 185.609741: block_io_start: 8,16 RA 8192 () 98408 + 16 [xfs_io]
> >>> xfs_io-907 [002] ..... 185.609756: block_rq_issue: 8,16 RA 8192 () 98408 + 16 [xfs_io]
> >>> xfs_io-907 [002] ..... 185.609769: block_rq_issue: 8,16 RA 8192 () 98384 + 16 [xfs_io]
> >>> <idle>-0 [002] d.H2. 185.610280: block_rq_complete: 8,16 RA () 98408 + 16 [0]
> >>> <idle>-0 [002] d.H2. 185.610289: block_io_done: 8,16 RA 0 () 98408 + 0 [swapper/2]
> >>> <idle>-0 [002] d.H2. 185.610292: block_rq_complete: 8,16 RA () 98384 + 16 [0]
> >>> <idle>-0 [002] dNH2. 185.610301: block_io_done: 8,16 RA 0 () 98384 + 0 [swapper/2]
> >>
> >> Could this be shortened to ... the iomap calls and
> >> block_bio_queue/backmerge? It's a bit difficult to see the point you're
> >> getting at with all the other noise.
> >
> > I will remove this log and move it to cover letter and will just extend
> > the simple example I considered before in this commit message,
> > to show the difference with and w/o patch.
> >
> >>
> >> I think you're trying to say that the access pattern here is 98400 ->
> >> 98408 -> 98384, which is not sequential?
> >>
> >
> > it's (98400,8 ==> metadata block) -> (98384,16 == lblk 10 & 11) -> (98408,16 ==> lblk 12 & 13)
> > ... w/o the patch
> >
> >>> v/s with the patch - (optimzed I/O access pattern and bio gets merged resulting in only 2 bios completion)
> >>> xfs_io-944 [005] ..... 99.926187: iomap_readahead: dev 8:16 ino 0xc nr_pages 4
> >>> xfs_io-944 [005] ..... 99.926208: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x4000 processed 0 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x9d/0x2c0
> >>> xfs_io-944 [005] ..... 99.926211: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300a000 offset 0xa000 length 0x2000 type MAPPED flags MERGED
> >>> xfs_io-944 [005] ..... 99.926222: block_bio_queue: 8,16 RA 98384 + 16 [xfs_io]
> >>> xfs_io-944 [005] ..... 99.926232: block_getrq: 8,16 RA 98384 + 16 [xfs_io]
> >>> xfs_io-944 [005] ..... 99.926233: block_io_start: 8,16 RA 8192 () 98384 + 16 [xfs_io]
> >>> xfs_io-944 [005] ..... 99.926234: block_plug: [xfs_io]
> >>> xfs_io-944 [005] ..... 99.926235: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x2000 processed 8192 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1f9/0x2c0
> >>> xfs_io-944 [005] ..... 99.926261: block_bio_queue: 8,16 R 98400 + 8 [xfs_io]
> >>> xfs_io-944 [005] ..... 99.926266: block_bio_backmerge: 8,16 R 98400 + 8 [xfs_io]
> >>> xfs_io-944 [005] ..... 99.926271: block_unplug: [xfs_io] 1
> >>> xfs_io-944 [005] ..... 99.926272: block_rq_insert: 8,16 RA 12288 () 98384 + 24 [xfs_io]
> >>> kworker/5:1H-234 [005] ..... 99.926314: block_rq_issue: 8,16 RA 12288 () 98384 + 24 [kworker/5:1H]
> >>> <idle>-0 [005] d.h2. 99.926905: block_rq_complete: 8,16 RA () 98384 + 24 [0]
> >>> <idle>-0 [005] dNh2. 99.926931: block_io_done: 8,16 RA 0 () 98384 + 0 [swapper/5]
> >>> xfs_io-944 [005] ..... 99.926971: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300d000 offset 0xc000 length 0x2000 type MAPPED flags MERGED
> >>> xfs_io-944 [005] ..... 99.926981: block_bio_queue: 8,16 RA 98408 + 16 [xfs_io]
> >>> xfs_io-944 [005] ..... 99.926989: block_getrq: 8,16 RA 98408 + 16 [xfs_io]
> >>> xfs_io-944 [005] ..... 99.926989: block_io_start: 8,16 RA 8192 () 98408 + 16 [xfs_io]
> >>> xfs_io-944 [005] ..... 99.926991: block_plug: [xfs_io]
> >>> xfs_io-944 [005] ..... 99.926993: iomap_iter: dev 8:16 ino 0xc pos 0xc000 length 0x2000 processed 8192 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1f9/0x2c0
> >>> xfs_io-944 [005] ..... 99.927001: block_rq_issue: 8,16 RA 8192 () 98408 + 16 [xfs_io]
> >>> <idle>-0 [005] d.h2. 99.927397: block_rq_complete: 8,16 RA () 98408 + 16 [0]
> >>> <idle>-0 [005] dNh2. 99.927414: block_io_done: 8,16 RA 0 () 98408 + 0 [swapper/5]
> >>>
> >>> Suggested-by: Matthew Wilcox <[email protected]>
> >>> Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
> >>> cc: Ojaswin Mujoo <[email protected]>
> >>> ---
> >>> fs/iomap/buffered-io.c | 112 +++++++++++++++++++++++++++++++----------
> >>> 1 file changed, 85 insertions(+), 27 deletions(-)
> >>>
> >>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> >>> index 0a4269095ae2..a1d50086a3f5 100644
> >>> --- a/fs/iomap/buffered-io.c
> >>> +++ b/fs/iomap/buffered-io.c
> >>> @@ -30,7 +30,7 @@ typedef int (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length);
> >>> */
> >>> struct iomap_folio_state {
> >>> spinlock_t state_lock;
> >>> - unsigned int read_bytes_pending;
> >>> + size_t read_bytes_pending;
> >>> atomic_t write_bytes_pending;
> >>>
> >>> /*
> >>> @@ -380,6 +380,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
> >>> loff_t orig_pos = pos;
> >>> size_t poff, plen;
> >>> sector_t sector;
> >>> + bool rbp_finished = false;
> >>
> >> What is "rbp"? My assembly programmer brain says x64 frame pointer, but
> >> that's clearly wrong here. Maybe I'm confused...
> >>
> >
> > rbp == read_bytes_pending ;)
> >
> >>> if (iomap->type == IOMAP_INLINE)
> >>> return iomap_read_inline_data(iter, folio);
> >>> @@ -387,21 +388,39 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
> >>> /* zero post-eof blocks as the page may be mapped */
> >>> ifs = ifs_alloc(iter->inode, folio, iter->flags);
> >>> iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
> >>> +
> >>> + if (ifs) {
> >>> + loff_t to_read = min_t(loff_t, iter->len - offset,
> >>> + folio_size(folio) - offset_in_folio(folio, orig_pos));
> >>> + size_t padjust;
> >>> +
> >>> + spin_lock_irq(&ifs->state_lock);
> >>> + if (!ifs->read_bytes_pending)
> >>> + ifs->read_bytes_pending = to_read;
> >>> + padjust = pos - orig_pos;
> >>> + ifs->read_bytes_pending -= padjust;
> >>> + if (!ifs->read_bytes_pending)
> >>> + rbp_finished = true;
> >>> + spin_unlock_irq(&ifs->state_lock);
> >>> + }
> >>> +
> >>> if (plen == 0)
> >>> goto done;
> >>>
> >>> if (iomap_block_needs_zeroing(iter, pos)) {
> >>> + if (ifs) {
> >>> + spin_lock_irq(&ifs->state_lock);
> >>> + ifs->read_bytes_pending -= plen;
> >>> + if (!ifs->read_bytes_pending)
> >>> + rbp_finished = true;
> >>> + spin_unlock_irq(&ifs->state_lock);
> >>> + }
> >>> folio_zero_range(folio, poff, plen);
> >>> iomap_set_range_uptodate(folio, poff, plen);
> >>> goto done;
> >>> }
> >>>
> >>> ctx->cur_folio_in_bio = true;
> >>> - if (ifs) {
> >>> - spin_lock_irq(&ifs->state_lock);
> >>> - ifs->read_bytes_pending += plen;
> >>> - spin_unlock_irq(&ifs->state_lock);
> >>> - }
> >>>
> >>> sector = iomap_sector(iomap, pos);
> >>> if (!ctx->bio ||
> >>> @@ -435,6 +454,14 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
> >>> }
> >>>
> >>> done:
> >>> + /*
> >>> + * If there is no bio prepared and if rbp is finished and
> >>> + * this was the last offset within this folio then mark
> >>> + * cur_folio_in_bio to false.
> >>> + */
> >>> + if (!ctx->bio && rbp_finished &&
> >>> + offset_in_folio(folio, pos + plen) == 0)
> >>> + ctx->cur_folio_in_bio = false;
> >>
> >> ...yes, I think I am confused. When would ctx->bio be NULL but
> >> cur_folio_in_bio is true?
> >
> > Previously we had the bio submitted and so we make it null, but we still
> > have ctx->cur_folio & ctx->cur_folio_in_bio to true, since we haven't
> > completely processed the folio.
> >
> >>
> >> I /think/ what you're doing here is using read_bytes_pending to figure
> >> out if you've processed the folio up to the end of the mapping? But
> >> then you submit the bio unconditionally below for each readpage_iter
> >> call?
> >>
> >
> > yes, that's right.
> >
> >> Why not add an IOMAP_BOUNDARY flag that means "I will have to do some IO
> >> if you call ->iomap_begin again"? Then if we get to this point in
> >> readpage_iter with a ctx->bio, we can submit the bio, clear
> >> cur_folio_in_bio, and return? And then you don't need this machinery?
> >
> > TBH, I initially didn't think the approach taken in the patch would
> > require such careful handling of r_b_p. It was because of all of this
> > corner cases when we don't need to read the update blocks and/or in case
> > of an error we need to ensure we reduce r_b_p carefully so that we could
> > unlock the folio and when extent spans beyond i_size.
> >
> > So it's all about how do we know if we could unlock the folio and that it's
> > corresponding blocks/mapping has been all processed or submitted for
> > I/O.
> >
> > Assume we have a folio which spans over multiple extents. In such a
> > case,
> > -> we process a bio for 1st extent,
> > -> then we go back to iomap_iter() to get new extent mapping,
> > -> We now increment the r_b_p with this new plen to be processed.
> > -> We then submit the previous bio, since this new mapping couldn't be
> > merged due to discontinuous extents.
> > So by first incrementing the r_b_p before doing submit_bio(), we don't
> > unlock the folio at bio completion.
> >
> > Maybe, it would be helpful if we have an easy mechanism to keep some state
> > from the time of submit_bio() till the bio completion to know that the
> > corresponding folio is still being processed and it shouldn't be
> > unlocked.
> > -> This currently is what we are doing by making r_b_p to the value of
> > folio_size() and then carefully reducing r_b_p for all the cases I
> > mentioned above.
> >
> > Let me think if adding a IOMAP_BH_BOUNDARY flag could be helpful or not.
> > Say if we have a pagesize of 64k that means all first 16 blocks belongs
> > to same page. So even with IOMAP_BH_BOUNDARY flag the problem that still
> > remains is that, even if we submit the bio at block 11 (bh_boundary
> > block), how will the bio completion side know that the folio is not
> > completely processed and so we shouldn't unlock the folio?
>
> Maybe one way could be if we could add another state flag to ifs for
> BH_BOUNDARY block and read that at the bio completion.
> We can then also let the completion side know if it should unlock the
> folio or whether it still needs processing at the submission side.

The approach I suggested was to initialise read_bytes_pending to
folio_size() at the start. Then subtract off blocksize for each
uptodate block, whether you find it already uptodate, or as the
completion handler runs.

Is there a reason that doesn't work?

2024-04-26 17:55:42

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings

Matthew Wilcox <[email protected]> writes:

> On Fri, Apr 26, 2024 at 10:55:23PM +0530, Ritesh Harjani wrote:
>> Ritesh Harjani (IBM) <[email protected]> writes:
>>
>> > "Darrick J. Wong" <[email protected]> writes:
>> >
>> >> On Thu, Apr 25, 2024 at 06:58:51PM +0530, Ritesh Harjani (IBM) wrote:
>> >>> This patch optimizes the data access patterns for filesystems with
>> >>> indirect block mapping by implementing BH_Boundary handling within
>> >>> iomap.
>> >>>
>> >>> Currently the bios for reads within iomap are only submitted at
>> >>> 2 places -
>> >>> 1. If we cannot merge the new req. with previous bio, only then we
>> >>> submit the previous bio.
>> >>> 2. Submit the bio at the end of the entire read processing.
>> >>>
>> >>> This means for filesystems with indirect block mapping, we call into
>> >>> ->iomap_begin() again w/o submitting the previous bios. That causes
>> >>> unoptimized data access patterns for blocks which are of BH_Boundary type.
>> >>>
>> >>> For e.g. consider the file mapping
>> >>> logical block(4k) physical block(4k)
>> >>> 0-11 1000-1011
>> >>> 12-15 1013-1016
>> >>>
>> >>> In above physical block 1012 is an indirect metadata block which has the
>> >>> mapping information for next set of indirect blocks (1013-1016).
>> >>> With iomap buffered reads for reading 1st 16 logical blocks of a file
>> >>> (0-15), we get below I/O pattern
>> >>> - submit a bio for 1012
>> >>> - complete the bio for 1012
>> >>> - submit a bio for 1000-1011
>> >>> - submit a bio for 1013-1016
>> >>> - complete the bios for 1000-1011
>> >>> - complete the bios for 1013-1016
>> >>>
>> >>> So as we can see, above is an non-optimal I/O access pattern and also we
>> >>> get 3 bio completions instead of 2.
>> >>>
>> >>> This patch changes this behavior by doing submit_bio() if there are any
>> >>> bios already processed, before calling ->iomap_begin() again.
>> >>> That means if there are any blocks which are already processed, gets
>> >>> submitted for I/O earlier and then within ->iomap_begin(), if we get a
>> >>> request for reading an indirect metadata block, then block layer can merge
>> >>> those bios with the already submitted read request to reduce the no. of bio
>> >>> completions.
>> >>>
>> >>> Now, for bs < ps or for large folios, this patch requires proper handling
>> >>> of "ifs->read_bytes_pending". In that we first set ifs->read_bytes_pending
>> >>> to folio_size. Then handle all the cases where we need to subtract
>> >>> ifs->read_bytes_pending either during the submission side
>> >>> (if we don't need to submit any I/O - for e.g. for uptodate sub blocks),
>> >>> or during an I/O error, or at the completion of an I/O.
>> >>>
>> >>> Here is the ftrace output of iomap and block layer with ext2 iomap
>> >>> conversion patches -
>> >>>
>> >>> root# filefrag -b512 -v /mnt1/test/f1
>> >>> Filesystem type is: ef53
>> >>> Filesystem cylinder groups approximately 32
>> >>> File size of /mnt1/test/f1 is 65536 (128 blocks of 512 bytes)
>> >>> ext: logical_offset: physical_offset: length: expected: flags:
>> >>> 0: 0.. 95: 98304.. 98399: 96: merged
>> >>> 1: 96.. 127: 98408.. 98439: 32: 98400: last,merged,eof
>> >>> /mnt1/test/f1: 2 extents found
>> >>>
>> >>> root# #This reads 4 blocks starting from lblk 10, 11, 12, 13
>> >>> root# xfs_io -c "pread -b$((4*4096)) $((10*4096)) $((4*4096))" /mnt1/test/f1
>> >>>
>> >>> w/o this patch - (indirect block is submitted before and does not get merged, resulting in 3 bios completion)
>> >>> xfs_io-907 [002] ..... 185.608791: iomap_readahead: dev 8:16 ino 0xc nr_pages 4
>> >>> xfs_io-907 [002] ..... 185.608819: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x4000 processed 0 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x9d/0x2c0
>> >>> xfs_io-907 [002] ..... 185.608823: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300a000 offset 0xa000 length 0x2000 type MAPPED flags MERGED
>> >>> xfs_io-907 [002] ..... 185.608831: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x2000 processed 8192 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1e1/0x2c0
>> >>> xfs_io-907 [002] ..... 185.608859: block_bio_queue: 8,16 R 98400 + 8 [xfs_io]
>> >>> xfs_io-907 [002] ..... 185.608865: block_getrq: 8,16 R 98400 + 8 [xfs_io]
>> >>> xfs_io-907 [002] ..... 185.608867: block_io_start: 8,16 R 4096 () 98400 + 8 [xfs_io]
>> >>> xfs_io-907 [002] ..... 185.608869: block_plug: [xfs_io]
>> >>> xfs_io-907 [002] ..... 185.608872: block_unplug: [xfs_io] 1
>> >>> xfs_io-907 [002] ..... 185.608874: block_rq_insert: 8,16 R 4096 () 98400 + 8 [xfs_io]
>> >>> kworker/2:1H-198 [002] ..... 185.608908: block_rq_issue: 8,16 R 4096 () 98400 + 8 [kworker/2:1H]
>> >>> <idle>-0 [002] d.h2. 185.609579: block_rq_complete: 8,16 R () 98400 + 8 [0]
>> >>> <idle>-0 [002] dNh2. 185.609631: block_io_done: 8,16 R 0 () 98400 + 0 [swapper/2]
>> >>> xfs_io-907 [002] ..... 185.609694: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300d000 offset 0xc000 length 0x2000 type MAPPED flags MERGED
>> >>> xfs_io-907 [002] ..... 185.609704: block_bio_queue: 8,16 RA 98384 + 16 [xfs_io]
>> >>> xfs_io-907 [002] ..... 185.609718: block_getrq: 8,16 RA 98384 + 16 [xfs_io]
>> >>> xfs_io-907 [002] ..... 185.609721: block_io_start: 8,16 RA 8192 () 98384 + 16 [xfs_io]
>> >>> xfs_io-907 [002] ..... 185.609726: block_plug: [xfs_io]
>> >>> xfs_io-907 [002] ..... 185.609735: iomap_iter: dev 8:16 ino 0xc pos 0xc000 length 0x2000 processed 8192 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1e1/0x2c0
>> >>> xfs_io-907 [002] ..... 185.609736: block_bio_queue: 8,16 RA 98408 + 16 [xfs_io]
>> >>> xfs_io-907 [002] ..... 185.609740: block_getrq: 8,16 RA 98408 + 16 [xfs_io]
>> >>> xfs_io-907 [002] ..... 185.609741: block_io_start: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>> >>> xfs_io-907 [002] ..... 185.609756: block_rq_issue: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>> >>> xfs_io-907 [002] ..... 185.609769: block_rq_issue: 8,16 RA 8192 () 98384 + 16 [xfs_io]
>> >>> <idle>-0 [002] d.H2. 185.610280: block_rq_complete: 8,16 RA () 98408 + 16 [0]
>> >>> <idle>-0 [002] d.H2. 185.610289: block_io_done: 8,16 RA 0 () 98408 + 0 [swapper/2]
>> >>> <idle>-0 [002] d.H2. 185.610292: block_rq_complete: 8,16 RA () 98384 + 16 [0]
>> >>> <idle>-0 [002] dNH2. 185.610301: block_io_done: 8,16 RA 0 () 98384 + 0 [swapper/2]
>> >>
>> >> Could this be shortened to ... the iomap calls and
>> >> block_bio_queue/backmerge? It's a bit difficult to see the point you're
>> >> getting at with all the other noise.
>> >
>> > I will remove this log and move it to cover letter and will just extend
>> > the simple example I considered before in this commit message,
>> > to show the difference with and w/o patch.
>> >
>> >>
>> >> I think you're trying to say that the access pattern here is 98400 ->
>> >> 98408 -> 98384, which is not sequential?
>> >>
>> >
>> > it's (98400,8 ==> metadata block) -> (98384,16 == lblk 10 & 11) -> (98408,16 ==> lblk 12 & 13)
>> > ... w/o the patch
>> >
>> >>> v/s with the patch - (optimzed I/O access pattern and bio gets merged resulting in only 2 bios completion)
>> >>> xfs_io-944 [005] ..... 99.926187: iomap_readahead: dev 8:16 ino 0xc nr_pages 4
>> >>> xfs_io-944 [005] ..... 99.926208: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x4000 processed 0 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x9d/0x2c0
>> >>> xfs_io-944 [005] ..... 99.926211: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300a000 offset 0xa000 length 0x2000 type MAPPED flags MERGED
>> >>> xfs_io-944 [005] ..... 99.926222: block_bio_queue: 8,16 RA 98384 + 16 [xfs_io]
>> >>> xfs_io-944 [005] ..... 99.926232: block_getrq: 8,16 RA 98384 + 16 [xfs_io]
>> >>> xfs_io-944 [005] ..... 99.926233: block_io_start: 8,16 RA 8192 () 98384 + 16 [xfs_io]
>> >>> xfs_io-944 [005] ..... 99.926234: block_plug: [xfs_io]
>> >>> xfs_io-944 [005] ..... 99.926235: iomap_iter: dev 8:16 ino 0xc pos 0xa000 length 0x2000 processed 8192 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1f9/0x2c0
>> >>> xfs_io-944 [005] ..... 99.926261: block_bio_queue: 8,16 R 98400 + 8 [xfs_io]
>> >>> xfs_io-944 [005] ..... 99.926266: block_bio_backmerge: 8,16 R 98400 + 8 [xfs_io]
>> >>> xfs_io-944 [005] ..... 99.926271: block_unplug: [xfs_io] 1
>> >>> xfs_io-944 [005] ..... 99.926272: block_rq_insert: 8,16 RA 12288 () 98384 + 24 [xfs_io]
>> >>> kworker/5:1H-234 [005] ..... 99.926314: block_rq_issue: 8,16 RA 12288 () 98384 + 24 [kworker/5:1H]
>> >>> <idle>-0 [005] d.h2. 99.926905: block_rq_complete: 8,16 RA () 98384 + 24 [0]
>> >>> <idle>-0 [005] dNh2. 99.926931: block_io_done: 8,16 RA 0 () 98384 + 0 [swapper/5]
>> >>> xfs_io-944 [005] ..... 99.926971: iomap_iter_dstmap: dev 8:16 ino 0xc bdev 8:16 addr 0x300d000 offset 0xc000 length 0x2000 type MAPPED flags MERGED
>> >>> xfs_io-944 [005] ..... 99.926981: block_bio_queue: 8,16 RA 98408 + 16 [xfs_io]
>> >>> xfs_io-944 [005] ..... 99.926989: block_getrq: 8,16 RA 98408 + 16 [xfs_io]
>> >>> xfs_io-944 [005] ..... 99.926989: block_io_start: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>> >>> xfs_io-944 [005] ..... 99.926991: block_plug: [xfs_io]
>> >>> xfs_io-944 [005] ..... 99.926993: iomap_iter: dev 8:16 ino 0xc pos 0xc000 length 0x2000 processed 8192 flags (0x0) ops 0xffffffff82242160 caller iomap_readahead+0x1f9/0x2c0
>> >>> xfs_io-944 [005] ..... 99.927001: block_rq_issue: 8,16 RA 8192 () 98408 + 16 [xfs_io]
>> >>> <idle>-0 [005] d.h2. 99.927397: block_rq_complete: 8,16 RA () 98408 + 16 [0]
>> >>> <idle>-0 [005] dNh2. 99.927414: block_io_done: 8,16 RA 0 () 98408 + 0 [swapper/5]
>> >>>
>> >>> Suggested-by: Matthew Wilcox <[email protected]>
>> >>> Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
>> >>> cc: Ojaswin Mujoo <[email protected]>
>> >>> ---
>> >>> fs/iomap/buffered-io.c | 112 +++++++++++++++++++++++++++++++----------
>> >>> 1 file changed, 85 insertions(+), 27 deletions(-)
>> >>>
>> >>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> >>> index 0a4269095ae2..a1d50086a3f5 100644
>> >>> --- a/fs/iomap/buffered-io.c
>> >>> +++ b/fs/iomap/buffered-io.c
>> >>> @@ -30,7 +30,7 @@ typedef int (*iomap_punch_t)(struct inode *inode, loff_t offset, loff_t length);
>> >>> */
>> >>> struct iomap_folio_state {
>> >>> spinlock_t state_lock;
>> >>> - unsigned int read_bytes_pending;
>> >>> + size_t read_bytes_pending;
>> >>> atomic_t write_bytes_pending;
>> >>>
>> >>> /*
>> >>> @@ -380,6 +380,7 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>> >>> loff_t orig_pos = pos;
>> >>> size_t poff, plen;
>> >>> sector_t sector;
>> >>> + bool rbp_finished = false;
>> >>
>> >> What is "rbp"? My assembly programmer brain says x64 frame pointer, but
>> >> that's clearly wrong here. Maybe I'm confused...
>> >>
>> >
>> > rbp == read_bytes_pending ;)
>> >
>> >>> if (iomap->type == IOMAP_INLINE)
>> >>> return iomap_read_inline_data(iter, folio);
>> >>> @@ -387,21 +388,39 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>> >>> /* zero post-eof blocks as the page may be mapped */
>> >>> ifs = ifs_alloc(iter->inode, folio, iter->flags);
>> >>> iomap_adjust_read_range(iter->inode, folio, &pos, length, &poff, &plen);
>> >>> +
>> >>> + if (ifs) {
>> >>> + loff_t to_read = min_t(loff_t, iter->len - offset,
>> >>> + folio_size(folio) - offset_in_folio(folio, orig_pos));
>> >>> + size_t padjust;
>> >>> +
>> >>> + spin_lock_irq(&ifs->state_lock);
>> >>> + if (!ifs->read_bytes_pending)
>> >>> + ifs->read_bytes_pending = to_read;
>> >>> + padjust = pos - orig_pos;
>> >>> + ifs->read_bytes_pending -= padjust;
>> >>> + if (!ifs->read_bytes_pending)
>> >>> + rbp_finished = true;
>> >>> + spin_unlock_irq(&ifs->state_lock);
>> >>> + }
>> >>> +
>> >>> if (plen == 0)
>> >>> goto done;
>> >>>
>> >>> if (iomap_block_needs_zeroing(iter, pos)) {
>> >>> + if (ifs) {
>> >>> + spin_lock_irq(&ifs->state_lock);
>> >>> + ifs->read_bytes_pending -= plen;
>> >>> + if (!ifs->read_bytes_pending)
>> >>> + rbp_finished = true;
>> >>> + spin_unlock_irq(&ifs->state_lock);
>> >>> + }
>> >>> folio_zero_range(folio, poff, plen);
>> >>> iomap_set_range_uptodate(folio, poff, plen);
>> >>> goto done;
>> >>> }
>> >>>
>> >>> ctx->cur_folio_in_bio = true;
>> >>> - if (ifs) {
>> >>> - spin_lock_irq(&ifs->state_lock);
>> >>> - ifs->read_bytes_pending += plen;
>> >>> - spin_unlock_irq(&ifs->state_lock);
>> >>> - }
>> >>>
>> >>> sector = iomap_sector(iomap, pos);
>> >>> if (!ctx->bio ||
>> >>> @@ -435,6 +454,14 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
>> >>> }
>> >>>
>> >>> done:
>> >>> + /*
>> >>> + * If there is no bio prepared and if rbp is finished and
>> >>> + * this was the last offset within this folio then mark
>> >>> + * cur_folio_in_bio to false.
>> >>> + */
>> >>> + if (!ctx->bio && rbp_finished &&
>> >>> + offset_in_folio(folio, pos + plen) == 0)
>> >>> + ctx->cur_folio_in_bio = false;
>> >>
>> >> ...yes, I think I am confused. When would ctx->bio be NULL but
>> >> cur_folio_in_bio is true?
>> >
>> > Previously we had the bio submitted and so we make it null, but we still
>> > have ctx->cur_folio & ctx->cur_folio_in_bio to true, since we haven't
>> > completely processed the folio.
>> >
>> >>
>> >> I /think/ what you're doing here is using read_bytes_pending to figure
>> >> out if you've processed the folio up to the end of the mapping? But
>> >> then you submit the bio unconditionally below for each readpage_iter
>> >> call?
>> >>
>> >
>> > yes, that's right.
>> >
>> >> Why not add an IOMAP_BOUNDARY flag that means "I will have to do some IO
>> >> if you call ->iomap_begin again"? Then if we get to this point in
>> >> readpage_iter with a ctx->bio, we can submit the bio, clear
>> >> cur_folio_in_bio, and return? And then you don't need this machinery?
>> >
>> > TBH, I initially didn't think the approach taken in the patch would
>> > require such careful handling of r_b_p. It was because of all of this
>> > corner cases when we don't need to read the update blocks and/or in case
>> > of an error we need to ensure we reduce r_b_p carefully so that we could
>> > unlock the folio and when extent spans beyond i_size.
>> >
>> > So it's all about how do we know if we could unlock the folio and that it's
>> > corresponding blocks/mapping has been all processed or submitted for
>> > I/O.
>> >
>> > Assume we have a folio which spans over multiple extents. In such a
>> > case,
>> > -> we process a bio for 1st extent,
>> > -> then we go back to iomap_iter() to get new extent mapping,
>> > -> We now increment the r_b_p with this new plen to be processed.
>> > -> We then submit the previous bio, since this new mapping couldn't be
>> > merged due to discontinuous extents.
>> > So by first incrementing the r_b_p before doing submit_bio(), we don't
>> > unlock the folio at bio completion.
>> >
>> > Maybe, it would be helpful if we have an easy mechanism to keep some state
>> > from the time of submit_bio() till the bio completion to know that the
>> > corresponding folio is still being processed and it shouldn't be
>> > unlocked.
>> > -> This currently is what we are doing by making r_b_p to the value of
>> > folio_size() and then carefully reducing r_b_p for all the cases I
>> > mentioned above.
>> >
>> > Let me think if adding a IOMAP_BH_BOUNDARY flag could be helpful or not.
>> > Say if we have a pagesize of 64k that means all first 16 blocks belongs
>> > to same page. So even with IOMAP_BH_BOUNDARY flag the problem that still
>> > remains is that, even if we submit the bio at block 11 (bh_boundary
>> > block), how will the bio completion side know that the folio is not
>> > completely processed and so we shouldn't unlock the folio?
>>
>> Maybe one way could be if we could add another state flag to ifs for
>> BH_BOUNDARY block and read that at the bio completion.
>> We can then also let the completion side know if it should unlock the
>> folio or whether it still needs processing at the submission side.
>
> The approach I suggested was to initialise read_bytes_pending to
> folio_size() at the start. Then subtract off blocksize for each
> uptodate block, whether you find it already uptodate, or as the
> completion handler runs.
>
> Is there a reason that doesn't work?

That is what this patch series does right. The current patch does work
as far as my testing goes.

For e.g. This is what initializes the r_b_p for the first time when
ifs->r_b_p is 0.

+ loff_t to_read = min_t(loff_t, iter->len - offset,
+ folio_size(folio) - offset_in_folio(folio, orig_pos));
<..>
+ if (!ifs->read_bytes_pending)
+ ifs->read_bytes_pending = to_read;


Then this is where we subtract r_b_p for blocks which are uptodate.
+ padjust = pos - orig_pos;
+ ifs->read_bytes_pending -= padjust;


This is when we adjust r_b_p when we directly zero the folio.
if (iomap_block_needs_zeroing(iter, pos)) {
+ if (ifs) {
+ spin_lock_irq(&ifs->state_lock);
+ ifs->read_bytes_pending -= plen;
+ if (!ifs->read_bytes_pending)
+ rbp_finished = true;
+ spin_unlock_irq(&ifs->state_lock);
+ }

But as you see this requires surgery throughout read paths. What if
we add a state flag to ifs only for BH_BOUNDARY. Maybe that could
result in a more simplified approach?
Because all we require is to know whether the folio should be unlocked
or not at the time of completion.

Do you think we should try that part or you think the current approach
looks ok?

-ritesh

2024-04-26 18:58:11

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings

Matthew Wilcox <[email protected]> writes:

> On Fri, Apr 26, 2024 at 11:25:25PM +0530, Ritesh Harjani wrote:
>> Matthew Wilcox <[email protected]> writes:
>> > The approach I suggested was to initialise read_bytes_pending to
>> > folio_size() at the start. Then subtract off blocksize for each
>> > uptodate block, whether you find it already uptodate, or as the
>> > completion handler runs.
>> >
>> > Is there a reason that doesn't work?
>>
>> That is what this patch series does right. The current patch does work
>> as far as my testing goes.
>>
>> For e.g. This is what initializes the r_b_p for the first time when
>> ifs->r_b_p is 0.
>>
>> + loff_t to_read = min_t(loff_t, iter->len - offset,
>> + folio_size(folio) - offset_in_folio(folio, orig_pos));
>> <..>
>> + if (!ifs->read_bytes_pending)
>> + ifs->read_bytes_pending = to_read;
>>
>>
>> Then this is where we subtract r_b_p for blocks which are uptodate.
>> + padjust = pos - orig_pos;
>> + ifs->read_bytes_pending -= padjust;
>>
>>
>> This is when we adjust r_b_p when we directly zero the folio.
>> if (iomap_block_needs_zeroing(iter, pos)) {
>> + if (ifs) {
>> + spin_lock_irq(&ifs->state_lock);
>> + ifs->read_bytes_pending -= plen;
>> + if (!ifs->read_bytes_pending)
>> + rbp_finished = true;
>> + spin_unlock_irq(&ifs->state_lock);
>> + }
>>
>> But as you see this requires surgery throughout read paths. What if
>> we add a state flag to ifs only for BH_BOUNDARY. Maybe that could
>> result in a more simplified approach?
>> Because all we require is to know whether the folio should be unlocked
>> or not at the time of completion.
>>
>> Do you think we should try that part or you think the current approach
>> looks ok?
>
> You've really made life hard for yourself. I had something more like
> this in mind. I may have missed a few places that need to be changed,
> but this should update read_bytes_pending everwhere that we set bits
> in the uptodate bitmap, so it should be right?

Please correct me if I am wrong.

>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 41c8f0c68ef5..f87ca8ee4d19 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -79,6 +79,7 @@ static void iomap_set_range_uptodate(struct folio *folio, size_t off,
> if (ifs) {
> spin_lock_irqsave(&ifs->state_lock, flags);
> uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
> + ifs->read_bytes_pending -= len;
> spin_unlock_irqrestore(&ifs->state_lock, flags);
> }

iomap_set_range_uptodate() gets called from ->write_begin() and
->write_end() too. So what we are saying is we are updating
the state of read_bytes_pending even though we are not in
->read_folio() or ->readahead() call?

>
> @@ -208,6 +209,8 @@ static struct iomap_folio_state *ifs_alloc(struct inode *inode,
> spin_lock_init(&ifs->state_lock);
> if (folio_test_uptodate(folio))
> bitmap_set(ifs->state, 0, nr_blocks);
> + else
> + ifs->read_bytes_pending = folio_size(folio);

We might not come till here during ->read_folio -> ifs_alloc(). Since we
might have a cached ifs which was allocated during write to this folio.

But unless you are saying that during writes, we would have set
ifs->r_b_p to folio_size() and when the read call happens, we use
the same value of the cached ifs.
Ok, I see. I was mostly focusing on updating ifs->r_b_p value only when
the reads bytes are actually pending during ->read_folio() or
->readahead() and not updating r_b_p during writes.

...One small problem which I see with this approach is - we might have
some non-zero value in ifs->r_b_p when ifs_free() gets called and it
might give a warning of non-zero ifs->r_b_p, because we updated
ifs->r_b_p during writes to a non-zero value, but the reads
never happend. Then during a call to ->release_folio, it will complain
of a non-zero ifs->r_b_p.


> if (folio_test_dirty(folio))
> bitmap_set(ifs->state, nr_blocks, nr_blocks);
> folio_attach_private(folio, ifs);
> @@ -396,12 +399,6 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
> }
>
> ctx->cur_folio_in_bio = true;
> - if (ifs) {
> - spin_lock_irq(&ifs->state_lock);
> - ifs->read_bytes_pending += plen;
> - spin_unlock_irq(&ifs->state_lock);
> - }
> -
> sector = iomap_sector(iomap, pos);
> if (!ctx->bio ||
> bio_end_sector(ctx->bio) != sector ||

-ritesh

2024-04-26 19:32:51

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings

On Fri, Apr 26, 2024 at 11:25:25PM +0530, Ritesh Harjani wrote:
> Matthew Wilcox <[email protected]> writes:
> > The approach I suggested was to initialise read_bytes_pending to
> > folio_size() at the start. Then subtract off blocksize for each
> > uptodate block, whether you find it already uptodate, or as the
> > completion handler runs.
> >
> > Is there a reason that doesn't work?
>
> That is what this patch series does right. The current patch does work
> as far as my testing goes.
>
> For e.g. This is what initializes the r_b_p for the first time when
> ifs->r_b_p is 0.
>
> + loff_t to_read = min_t(loff_t, iter->len - offset,
> + folio_size(folio) - offset_in_folio(folio, orig_pos));
> <..>
> + if (!ifs->read_bytes_pending)
> + ifs->read_bytes_pending = to_read;
>
>
> Then this is where we subtract r_b_p for blocks which are uptodate.
> + padjust = pos - orig_pos;
> + ifs->read_bytes_pending -= padjust;
>
>
> This is when we adjust r_b_p when we directly zero the folio.
> if (iomap_block_needs_zeroing(iter, pos)) {
> + if (ifs) {
> + spin_lock_irq(&ifs->state_lock);
> + ifs->read_bytes_pending -= plen;
> + if (!ifs->read_bytes_pending)
> + rbp_finished = true;
> + spin_unlock_irq(&ifs->state_lock);
> + }
>
> But as you see this requires surgery throughout read paths. What if
> we add a state flag to ifs only for BH_BOUNDARY. Maybe that could
> result in a more simplified approach?
> Because all we require is to know whether the folio should be unlocked
> or not at the time of completion.
>
> Do you think we should try that part or you think the current approach
> looks ok?

You've really made life hard for yourself. I had something more like
this in mind. I may have missed a few places that need to be changed,
but this should update read_bytes_pending everwhere that we set bits
in the uptodate bitmap, so it should be right?

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 41c8f0c68ef5..f87ca8ee4d19 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -79,6 +79,7 @@ static void iomap_set_range_uptodate(struct folio *folio, size_t off,
if (ifs) {
spin_lock_irqsave(&ifs->state_lock, flags);
uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
+ ifs->read_bytes_pending -= len;
spin_unlock_irqrestore(&ifs->state_lock, flags);
}

@@ -208,6 +209,8 @@ static struct iomap_folio_state *ifs_alloc(struct inode *inode,
spin_lock_init(&ifs->state_lock);
if (folio_test_uptodate(folio))
bitmap_set(ifs->state, 0, nr_blocks);
+ else
+ ifs->read_bytes_pending = folio_size(folio);
if (folio_test_dirty(folio))
bitmap_set(ifs->state, nr_blocks, nr_blocks);
folio_attach_private(folio, ifs);
@@ -396,12 +399,6 @@ static loff_t iomap_readpage_iter(const struct iomap_iter *iter,
}

ctx->cur_folio_in_bio = true;
- if (ifs) {
- spin_lock_irq(&ifs->state_lock);
- ifs->read_bytes_pending += plen;
- spin_unlock_irq(&ifs->state_lock);
- }
-
sector = iomap_sector(iomap, pos);
if (!ctx->bio ||
bio_end_sector(ctx->bio) != sector ||

2024-04-26 20:32:20

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings

On Sat, Apr 27, 2024 at 12:27:52AM +0530, Ritesh Harjani wrote:
> Matthew Wilcox <[email protected]> writes:
> > @@ -79,6 +79,7 @@ static void iomap_set_range_uptodate(struct folio *folio, size_t off,
> > if (ifs) {
> > spin_lock_irqsave(&ifs->state_lock, flags);
> > uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
> > + ifs->read_bytes_pending -= len;
> > spin_unlock_irqrestore(&ifs->state_lock, flags);
> > }
>
> iomap_set_range_uptodate() gets called from ->write_begin() and
> ->write_end() too. So what we are saying is we are updating
> the state of read_bytes_pending even though we are not in
> ->read_folio() or ->readahead() call?

Exactly.

> >
> > @@ -208,6 +209,8 @@ static struct iomap_folio_state *ifs_alloc(struct inode *inode,
> > spin_lock_init(&ifs->state_lock);
> > if (folio_test_uptodate(folio))
> > bitmap_set(ifs->state, 0, nr_blocks);
> > + else
> > + ifs->read_bytes_pending = folio_size(folio);
>
> We might not come till here during ->read_folio -> ifs_alloc(). Since we
> might have a cached ifs which was allocated during write to this folio.
>
> But unless you are saying that during writes, we would have set
> ifs->r_b_p to folio_size() and when the read call happens, we use
> the same value of the cached ifs.
> Ok, I see. I was mostly focusing on updating ifs->r_b_p value only when
> the reads bytes are actually pending during ->read_folio() or
> ->readahead() and not updating r_b_p during writes.

I see why you might want to think that way ... but this way is much less
complex, don't you think? ;-)

> ...One small problem which I see with this approach is - we might have
> some non-zero value in ifs->r_b_p when ifs_free() gets called and it
> might give a warning of non-zero ifs->r_b_p, because we updated
> ifs->r_b_p during writes to a non-zero value, but the reads
> never happend. Then during a call to ->release_folio, it will complain
> of a non-zero ifs->r_b_p.

Yes, we'll have to remove that assertion. I don't think that's a
problem, do you?


2024-04-27 04:44:39

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFCv3 6/7] iomap: Optimize iomap_read_folio

On Fri, Apr 26, 2024 at 02:20:05PM +0530, Ritesh Harjani wrote:
> iomap_read_folio_iter() handles multiple sub blocks within a given
> folio but it's implementation logic is similar to how
> iomap_readahead_iter() handles multiple folios within a single mapped
> extent. Both of them iterate over a given range of folio/mapped extent
> and call iomap_readpage_iter() for reading.

Sounds good.

2024-04-27 06:04:01

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings

Matthew Wilcox <[email protected]> writes:

> On Sat, Apr 27, 2024 at 12:27:52AM +0530, Ritesh Harjani wrote:
>> Matthew Wilcox <[email protected]> writes:
>> > @@ -79,6 +79,7 @@ static void iomap_set_range_uptodate(struct folio *folio, size_t off,
>> > if (ifs) {
>> > spin_lock_irqsave(&ifs->state_lock, flags);
>> > uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
>> > + ifs->read_bytes_pending -= len;
>> > spin_unlock_irqrestore(&ifs->state_lock, flags);
>> > }
>>
>> iomap_set_range_uptodate() gets called from ->write_begin() and
>> ->write_end() too. So what we are saying is we are updating
>> the state of read_bytes_pending even though we are not in
>> ->read_folio() or ->readahead() call?
>
> Exactly.
>
>> >
>> > @@ -208,6 +209,8 @@ static struct iomap_folio_state *ifs_alloc(struct inode *inode,
>> > spin_lock_init(&ifs->state_lock);
>> > if (folio_test_uptodate(folio))
>> > bitmap_set(ifs->state, 0, nr_blocks);
>> > + else
>> > + ifs->read_bytes_pending = folio_size(folio);
>>
>> We might not come till here during ->read_folio -> ifs_alloc(). Since we
>> might have a cached ifs which was allocated during write to this folio.
>>
>> But unless you are saying that during writes, we would have set
>> ifs->r_b_p to folio_size() and when the read call happens, we use
>> the same value of the cached ifs.
>> Ok, I see. I was mostly focusing on updating ifs->r_b_p value only when
>> the reads bytes are actually pending during ->read_folio() or
>> ->readahead() and not updating r_b_p during writes.
>
> I see why you might want to think that way ... but this way is much less
> complex, don't you think? ;-)
>
>> ...One small problem which I see with this approach is - we might have
>> some non-zero value in ifs->r_b_p when ifs_free() gets called and it
>> might give a warning of non-zero ifs->r_b_p, because we updated
>> ifs->r_b_p during writes to a non-zero value, but the reads
>> never happend. Then during a call to ->release_folio, it will complain
>> of a non-zero ifs->r_b_p.
>
> Yes, we'll have to remove that assertion. I don't think that's a
> problem, do you?

Sure, I will give it a spin.

-ritesh

2024-04-27 06:11:56

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings

On Fri, Apr 26, 2024 at 08:19:47PM +0100, Matthew Wilcox wrote:
> > ...One small problem which I see with this approach is - we might have
> > some non-zero value in ifs->r_b_p when ifs_free() gets called and it
> > might give a warning of non-zero ifs->r_b_p, because we updated
> > ifs->r_b_p during writes to a non-zero value, but the reads
> > never happend. Then during a call to ->release_folio, it will complain
> > of a non-zero ifs->r_b_p.
>
> Yes, we'll have to remove that assertion. I don't think that's a
> problem, do you?

Or refine it, as the parts not read shoud not be uptodate either?

Either way I had another idea to simplify things a bit, but this might
end up beeing even simpler, so I'll stop the hacking on my version
for now.

2024-04-27 07:46:20

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings

On Thu, Apr 25, 2024 at 06:58:51PM +0530, Ritesh Harjani (IBM) wrote:
> Currently the bios for reads within iomap are only submitted at
> 2 places -
> 1. If we cannot merge the new req. with previous bio, only then we
> submit the previous bio.
> 2. Submit the bio at the end of the entire read processing.
>
> This means for filesystems with indirect block mapping, we call into
> ->iomap_begin() again w/o submitting the previous bios. That causes
> unoptimized data access patterns for blocks which are of BH_Boundary type.

The same is true for extent mappings. And it's not ideal there either,
although it only inreases the bio submission latency.


2024-05-13 13:42:58

by Jan Kara

[permalink] [raw]
Subject: Re: [RFCv3 4/7] ext2: Implement seq counter for validating cached iomap

On Thu 25-04-24 18:58:48, Ritesh Harjani (IBM) wrote:
> There is a possibility of following race with iomap during
> writebck -
>
> write_cache_pages()
> cache extent covering 0..1MB range
> write page at offset 0k
> truncate(file, 4k)
> drops all relevant pages
> frees fs blocks
> pwrite(file, 4k, 4k)
> creates dirty page in the page cache
> writes page at offset 4k to a stale block
>
> This race can happen because iomap_writepages() keeps a cached extent mapping
> within struct iomap. While write_cache_pages() is going over each folio,
> (can cache a large extent range), if a truncate happens in parallel on the
> next folio followed by a buffered write to the same offset within the file,
> this can change logical to physical offset of the cached iomap mapping.
> That means, the cached iomap has now become stale.
>
> This patch implements the seq counter approach for revalidation of stale
> iomap mappings. i_blkseq will get incremented for every block
> allocation/free. Here is what we do -
>
> For ext2 buffered-writes, the block allocation happens at the
> ->write_iter time itself. So at writeback time,
> 1. We first cache the i_blkseq.
> 2. Call ext2_get_blocks(, create = 0) to get the no. of blocks
> already allocated.
> 3. Call ext2_get_blocks() the second time with length to be same as
> the no. of blocks we know were already allocated.
> 4. Till now it means, the cached i_blkseq remains valid as no block
> allocation has happened yet.
> This means the next call to ->map_blocks(), we can verify whether the
> i_blkseq has raced with truncate or not. If not, then i_blkseq will
> remain valid.
>
> In case of a hole (could happen with mmaped writes), we only allocate
> 1 block at a time anyways. So even if the i_blkseq value changes right
> after, we anyway need to allocate the next block in subsequent
> ->map_blocks() call.
>
> Signed-off-by: Ritesh Harjani (IBM) <[email protected]>

A few small comments below.

> @@ -698,6 +699,11 @@ static inline struct ext2_inode_info *EXT2_I(struct inode *inode)
> return container_of(inode, struct ext2_inode_info, vfs_inode);
> }
>
> +static inline void ext2_inc_i_blkseq(struct ext2_inode_info *ei)
> +{
> + WRITE_ONCE(ei->i_blkseq, READ_ONCE(ei->i_blkseq) + 1);
> +}
> +

Please add a comment here (and assertion as well) that updates of i_blkseq
are protected by ei->i_truncate_mutex. Reads can race at any moment to
that's the reason why WRITE_ONCE() is used. You can remove READ_ONCE() here
as it is pointless (we are locked here).

> static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
> struct inode *inode, loff_t offset,
> unsigned len)
> {
> - if (offset >= wpc->iomap.offset &&
> - offset < wpc->iomap.offset + wpc->iomap.length)
> + loff_t maxblocks = (loff_t)INT_MAX;
> + u8 blkbits = inode->i_blkbits;
> + u32 bno;
> + bool new, boundary;
> + int ret;
> +
> + if (ext2_imap_valid(wpc, inode, offset))
> return 0;
>
> - return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
> + /*
> + * For ext2 buffered-writes, the block allocation happens at the
> + * ->write_iter time itself. So at writeback time -
> + * 1. We first cache the i_blkseq.
> + * 2. Call ext2_get_blocks(, create = 0) to get the no. of blocks
> + * already allocated.
> + * 3. Call ext2_get_blocks() the second time with length to be same as
> + * the no. of blocks we know were already allocated.
> + * 4. Till now it means, the cached i_blkseq remains valid as no block
> + * allocation has happened yet.
> + * This means the next call to ->map_blocks(), we can verify whether the
> + * i_blkseq has raced with truncate or not. If not, then i_blkseq will
> + * remain valid.
> + *
> + * In case of a hole (could happen with mmaped writes), we only allocate
> + * 1 block at a time anyways. So even if the i_blkseq value changes, we
> + * anyway need to allocate the next block in subsequent ->map_blocks()
> + * call.
> + */

I suspect it would be tidier to move this logic into ext2_get_blocks()
itself but I guess it is ok as is for now.

> + wpc->iomap.validity_cookie = READ_ONCE(EXT2_I(inode)->i_blkseq);
> +
> + ret = ext2_get_blocks(inode, offset >> blkbits, maxblocks << blkbits,
> + &bno, &new, &boundary, 0);
> + if (ret < 0)
> + return ret;
> + /*
> + * ret can be 0 in case of a hole which is possible for mmaped writes.
> + */
> + ret = ret ? ret : 1;
> + return ext2_iomap_begin(inode, offset, (loff_t)ret << blkbits,
> IOMAP_WRITE, &wpc->iomap, NULL);
> }
>
> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index 37f7ce56adce..32f5386284d6 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -188,7 +188,7 @@ static struct inode *ext2_alloc_inode(struct super_block *sb)
> #ifdef CONFIG_QUOTA
> memset(&ei->i_dquot, 0, sizeof(ei->i_dquot));
> #endif
> -
> + WRITE_ONCE(ei->i_blkseq, 0);
> return &ei->vfs_inode;

No need for write once here. This cannot race with anything...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR