2024-01-02 12:42:42

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH v2 00/25] ext4: use iomap for regular file's buffered IO path and enable large foilo

From: Zhang Yi <[email protected]>

Hello,

This is my second version of the RFC patch series that convert ext4
regular file's buffered IO path to iomap. I've been fixing a lot of bugs
in v1 and rebased it on 6.7 and with Christoph's "map multiple blocks
per ->map_blocks in iomap writeback" series [1].

This series only support ext4 with the default features and mount
options, doesn't support inline_data, bigalloc, dax, fs_verity, fs_crypt
and data=journal mode, ext4 would fall back to buffer_head path
automatically if you enabled these features/options. Although it has
many limitations now, it can satisfy the requirements of common cases
and bring a great performance benefit.

This series have passed kvm-xfstests tests in auto mode, no obvious
issues have been found. I will keep doing stress tests, hope most of the
corner cases had been covered. For the convenience of review, I split
the implementation into small patches, any comments will be helpful.

Changes since v1:
- Introduce seq count for iomap buffered write and writeback to protect
races from extents changes, e.g. truncate, mwrite.
- Always allocate unwritten extents for new blocks, drop dioread_lock
mode, and make no distinctions between dioread_lock and
dioread_nolock.
- Don't add ditry data range to jinode, drop data=ordered mode, and
make no distinctions between data=ordered and data=writeback mode.
- Postpone updating i_disksize to endio.
- Allow splitting extents and use reserved space in endio.
- Instead of reimplement a new delayed mapping helper
ext4_iomap_da_map_blocks() for buffer write, try to reuse
ext4_da_map_blocks().
- Add support for disabling large folio on active inodes.
- Support online defragmentation, make file fall back to buffer_head
and disable large folio in ext4_move_extents().
- Move ext4_nonda_switch() in advance to prevent deadlock in mwrite.
- Add dirty_len and pos trace info to trace_iomap_writepage_map().
- Update patch 1-6 to v2 [2].

Patch 1-6: this is a preparation series, it changes ext4_map_blocks()
and ext4_set_iomap() to recognize delayed only extents, I've send it out
separately[2] because I've done full tests and suppose this can be
reviewed and merged first.

Patch 7-8: these are two minor iomap changes, one is don't update i_size
in zero_range and the other one is for debug in iomap writeback path
that I've discussed whit Christoph [3].

Patch 9-14: this is another preparation series, including some changes
for delayed extents. Firstly, it factor out buffer_head from
ext4_da_map_blocks(), make it to support adding multi-blocks once a
time. Then make unwritten to written extents conversion in endio use to
reserved space, reduce the risk of potential data loss. Finally,
introduce a sequence counter for extent status tree, which is useful
for iomap buffer write and write back.

Patch 15-21: Implement buffered IO iomap path for read, write, mmap,
zero range, truncate and writeback, replace current buffered_head path.
Please look at the following patch for details.

Patch 22-25: Convert to iomap for regular file's buffered IO path
besides inline_data, bigalloc, dax, fs_verity, fs_crypt, and
data=journal mode, and enable large folio. It should be note that
buffered iomap path hasn't support Online defrag yet, so we need fall
back to buffer_head and disable large folio automatically if user call
EXT4_IOC_MOVE_EXT.

About Tests:
- I've test it through kvm-xfstests in auto mode.
- I've done a performance tests below.

Fio tests with psync on my machine with Intel Xeon Gold 6240 CPU
with 400GB system ram, 200GB ramdisk and 1TB nvme ssd disk.

== buffer read ==

buffer head iomap with large folio
type bs IOPS BW(MiB/s) IOPS BW(MiB/s)
----------------------------------------------------
hole 4K 565k 2206 811k 3167
hole 64K 45.1k 2820 78.1k 4879
hole 1M 2744 2744 4890 4891
ramdisk 4K 436k 1703 554k 2163
ramdisk 64K 29.6k 1848 44.0k 2747
ramdisk 1M 1994 1995 2809 2809
nvme 4K 306k 1196 324k 1267
nvme 64K 19.3k 1208 24.3k 1517
nvme 1M 1694 1694 2256 2256

== buffer write ==

buffer head ext4_iomap
type Overwrite Sync Writeback bs IOPS BW IOPS BW
-------------------------------------------------------------
cache N N N 4K 395k 1544 415k 1621
cache N N N 64K 30.8k 1928 80.1k 5005
cache N N N 1M 1963 1963 5641 5642
cache Y N N 4K 423k 1652 443k 1730
cache Y N N 64K 33.0k 2063 80.8k 5051
cache Y N N 1M 2103 2103 5588 5589
ramdisk N N Y 4K 362k 1416 307k 1198
ramdisk N N Y 64K 22.4k 1399 64.8k 4050
ramdisk N N Y 1M 1670 1670 4559 4560
ramdisk N Y N 4K 9830 38.4 13.5k 52.8
ramdisk N Y N 64K 5834 365 10.1k 629
ramdisk N Y N 1M 1011 1011 2064 2064
ramdisk Y N Y 4K 397k 1550 409k 1598
ramdisk Y N Y 64K 29.2k 1827 73.6k 4597
ramdisk Y N Y 1M 1837 1837 4985 4985
ramdisk Y Y N 4K 173k 675 182k 710
ramdisk Y Y N 64K 17.7k 1109 33.7k 2105
ramdisk Y Y N 1M 1128 1129 1790 1791
nvme N N Y 4K 298k 1164 290k 1134
nvme N N Y 64K 21.5k 1343 57.4k 3590
nvme N N Y 1M 1308 1308 3664 3664
nvme N Y N 4K 10.7k 41.8 12.0k 46.9
nvme N Y N 64K 5962 373 8598 537
nvme N Y N 1M 676 677 1417 1418
nvme Y N Y 4K 366k 1430 373k 1456
nvme Y N Y 64K 26.7k 1670 56.8k 3547
nvme Y N Y 1M 1745 1746 3586 3586
nvme Y Y N 4K 59.0k 230 61.2k 239
nvme Y Y N 64K 13.0k 813 21.0k 1311
nvme Y Y N 1M 683 683 1368 1369

TODO
- Keep on doing stress tests and fixing.
- I will rebase and resend my another patch set "ext4: more accurate
metadata reservaion for delalloc mount option[4]", it's useful for
iomap conversion. After this series, I suppose we could totally drop
ext4_nonda_switch() and prevent the risk of data loss caused by
extents splitting.
- Support for more features and mount options in the future.

[1] https://lore.kernel.org/linux-fsdevel/[email protected]/
[2] https://lore.kernel.org/linux-ext4/[email protected]/
[3] https://lore.kernel.org/linux-fsdevel/[email protected]/
[4] https://lore.kernel.org/linux-ext4/[email protected]/

Thanks,
Yi.

Zhang Yi (25):
ext4: refactor ext4_da_map_blocks()
ext4: convert to exclusive lock while inserting delalloc extents
ext4: correct the hole length returned by ext4_map_blocks()
ext4: add a hole extent entry in cache after punch
ext4: make ext4_map_blocks() distinguish delalloc only extent
ext4: make ext4_set_iomap() recognize IOMAP_DELALLOC map type
iomap: don't increase i_size if it's not a write operation
iomap: add pos and dirty_len into trace_iomap_writepage_map
ext4: allow inserting delalloc extents with multi-blocks
ext4: correct delalloc extent length
ext4: also mark extent as delalloc if it's been unwritten
ext4: factor out bh handles to ext4_da_get_block_prep()
ext4: use reserved metadata blocks when splitting extent in endio
ext4: introduce seq counter for extent entry
ext4: add a new iomap aops for regular file's buffered IO path
ext4: implement buffered read iomap path
ext4: implement buffered write iomap path
ext4: implement writeback iomap path
ext4: implement mmap iomap path
ext4: implement zero_range iomap path
ext4: writeback partial blocks before zero range
ext4: fall back to buffer_head path for defrag
ext4: partially enable iomap for regular file's buffered IO path
filemap: support disable large folios on active inode
ext4: enable large folio for regular file with iomap buffered IO path

fs/ext4/ext4.h | 14 +-
fs/ext4/ext4_jbd2.c | 6 +
fs/ext4/ext4_jbd2.h | 7 +
fs/ext4/extents.c | 145 +++++---
fs/ext4/extents_status.c | 39 ++-
fs/ext4/extents_status.h | 4 +-
fs/ext4/file.c | 19 +-
fs/ext4/ialloc.c | 5 +
fs/ext4/inode.c | 638 +++++++++++++++++++++++++++++-------
fs/ext4/move_extent.c | 35 ++
fs/ext4/page-io.c | 107 ++++++
fs/ext4/super.c | 3 +
fs/iomap/buffered-io.c | 7 +-
fs/iomap/trace.h | 43 ++-
include/linux/pagemap.h | 13 +
include/trace/events/ext4.h | 31 +-
mm/readahead.c | 6 +-
17 files changed, 921 insertions(+), 201 deletions(-)

--
2.39.2



2024-01-02 12:42:51

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH v2 02/25] ext4: convert to exclusive lock while inserting delalloc extents

From: Zhang Yi <[email protected]>

ext4_da_map_blocks() only hold i_data_sem in shared mode and i_rwsem
when inserting delalloc extents, it could be raced by another querying
path of ext4_map_blocks() without i_rwsem, .e.g buffered read path.
Suppose we buffered read a file containing just a hole, and without any
cached extents tree, then it is raced by another delayed buffered write
to the same area or the near area belongs to the same hole, and the new
delalloc extent could be overwritten to a hole extent.

pread() pwrite()
filemap_read_folio()
ext4_mpage_readpages()
ext4_map_blocks()
down_read(i_data_sem)
ext4_ext_determine_hole()
//find hole
ext4_ext_put_gap_in_cache()
ext4_es_find_extent_range()
//no delalloc extent
ext4_da_map_blocks()
down_read(i_data_sem)
ext4_insert_delayed_block()
//insert delalloc extent
ext4_es_insert_extent()
//overwrite delalloc extent to hole

This race could lead to inconsistent delalloc extents tree and
incorrect reserved space counter. Fix this by converting to hold
i_data_sem in exclusive mode when adding a new delalloc extent in
ext4_da_map_blocks().

Cc: [email protected]
Signed-off-by: Zhang Yi <[email protected]>
Suggested-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 25 +++++++++++--------------
1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5b0d3075be12..142c67f5c7fc 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1703,10 +1703,8 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,

/* Lookup extent status tree firstly */
if (ext4_es_lookup_extent(inode, iblock, NULL, &es)) {
- if (ext4_es_is_hole(&es)) {
- down_read(&EXT4_I(inode)->i_data_sem);
+ if (ext4_es_is_hole(&es))
goto add_delayed;
- }

/*
* Delayed extent could be allocated by fallocate.
@@ -1748,8 +1746,10 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
retval = ext4_ext_map_blocks(NULL, inode, map, 0);
else
retval = ext4_ind_map_blocks(NULL, inode, map, 0);
- if (retval < 0)
- goto out_unlock;
+ if (retval < 0) {
+ up_read(&EXT4_I(inode)->i_data_sem);
+ return retval;
+ }
if (retval > 0) {
unsigned int status;

@@ -1765,24 +1765,21 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
map->m_pblk, status);
- goto out_unlock;
+ up_read(&EXT4_I(inode)->i_data_sem);
+ return retval;
}
+ up_read(&EXT4_I(inode)->i_data_sem);

add_delayed:
- /*
- * XXX: __block_prepare_write() unmaps passed block,
- * is it OK?
- */
+ down_write(&EXT4_I(inode)->i_data_sem);
retval = ext4_insert_delayed_block(inode, map->m_lblk);
+ up_write(&EXT4_I(inode)->i_data_sem);
if (retval)
- goto out_unlock;
+ return retval;

map_bh(bh, inode->i_sb, invalid_block);
set_buffer_new(bh);
set_buffer_delay(bh);
-
-out_unlock:
- up_read((&EXT4_I(inode)->i_data_sem));
return retval;
}

--
2.39.2


2024-01-02 12:43:04

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH v2 04/25] ext4: add a hole extent entry in cache after punch

From: Zhang Yi <[email protected]>

In order to cache hole extents in the extent status tree and keep the
hole length as long as possible, re-add a hole entry to the cache just
after punching a hole.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/inode.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 142c67f5c7fc..1b5e6409f958 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4000,12 +4000,12 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)

/* If there are blocks to remove, do it */
if (stop_block > first_block) {
+ ext4_lblk_t hole_len = stop_block - first_block;

down_write(&EXT4_I(inode)->i_data_sem);
ext4_discard_preallocations(inode, 0);

- ext4_es_remove_extent(inode, first_block,
- stop_block - first_block);
+ ext4_es_remove_extent(inode, first_block, hole_len);

if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
ret = ext4_ext_remove_space(inode, first_block,
@@ -4014,6 +4014,8 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
ret = ext4_ind_remove_space(handle, inode, first_block,
stop_block);

+ ext4_es_insert_extent(inode, first_block, hole_len, ~0,
+ EXTENT_STATUS_HOLE);
up_write(&EXT4_I(inode)->i_data_sem);
}
ext4_fc_track_range(handle, inode, first_block, stop_block);
--
2.39.2


2024-01-02 12:43:10

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH v2 05/25] ext4: make ext4_map_blocks() distinguish delalloc only extent

From: Zhang Yi <[email protected]>

Add a new map flag EXT4_MAP_DELAYED to indicate the mapping range is a
delayed allocated only (not unwritten) one, and making
ext4_map_blocks() can distinguish it, no longer mixing it with holes.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/ext4.h | 4 +++-
fs/ext4/extents.c | 5 +++--
fs/ext4/inode.c | 2 ++
3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index a5d784872303..55195909d32f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -252,8 +252,10 @@ struct ext4_allocation_request {
#define EXT4_MAP_MAPPED BIT(BH_Mapped)
#define EXT4_MAP_UNWRITTEN BIT(BH_Unwritten)
#define EXT4_MAP_BOUNDARY BIT(BH_Boundary)
+#define EXT4_MAP_DELAYED BIT(BH_Delay)
#define EXT4_MAP_FLAGS (EXT4_MAP_NEW | EXT4_MAP_MAPPED |\
- EXT4_MAP_UNWRITTEN | EXT4_MAP_BOUNDARY)
+ EXT4_MAP_UNWRITTEN | EXT4_MAP_BOUNDARY |\
+ EXT4_MAP_DELAYED)

struct ext4_map_blocks {
ext4_fsblk_t m_pblk;
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 0892d0568013..fc69f13cf510 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4073,9 +4073,10 @@ static void ext4_ext_determine_hole(struct inode *inode,
} else if (in_range(map->m_lblk, es.es_lblk, es.es_len)) {
/*
* Straddle the beginning of the queried range, it's no
- * longer a hole, adjust the length to the delayed extent's
- * after map->m_lblk.
+ * longer a hole, mark it is a delalloc and adjust the
+ * length to the delayed extent's after map->m_lblk.
*/
+ map->m_flags |= EXT4_MAP_DELAYED;
len = es.es_lblk + es.es_len - map->m_lblk;
goto out;
} else {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1b5e6409f958..c141bf6d8db2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -515,6 +515,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
map->m_len = retval;
} else if (ext4_es_is_delayed(&es) || ext4_es_is_hole(&es)) {
map->m_pblk = 0;
+ map->m_flags |= ext4_es_is_delayed(&es) ?
+ EXT4_MAP_DELAYED : 0;
retval = es.es_len - (map->m_lblk - es.es_lblk);
if (retval > map->m_len)
retval = map->m_len;
--
2.39.2


2024-01-02 12:43:19

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH v2 06/25] ext4: make ext4_set_iomap() recognize IOMAP_DELALLOC map type

From: Zhang Yi <[email protected]>

Since ext4_map_blocks() can recognize a delayed allocated only extent,
make ext4_set_iomap() can also recognize it, and remove the useless
separate check in ext4_iomap_begin_report().

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/inode.c | 32 +++-----------------------------
1 file changed, 3 insertions(+), 29 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c141bf6d8db2..0458d7f0c059 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3261,6 +3261,9 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
iomap->addr = (u64) map->m_pblk << blkbits;
if (flags & IOMAP_DAX)
iomap->addr += EXT4_SB(inode->i_sb)->s_dax_part_off;
+ } else if (map->m_flags & EXT4_MAP_DELAYED) {
+ iomap->type = IOMAP_DELALLOC;
+ iomap->addr = IOMAP_NULL_ADDR;
} else {
iomap->type = IOMAP_HOLE;
iomap->addr = IOMAP_NULL_ADDR;
@@ -3423,35 +3426,11 @@ const struct iomap_ops ext4_iomap_overwrite_ops = {
.iomap_end = ext4_iomap_end,
};

-static bool ext4_iomap_is_delalloc(struct inode *inode,
- struct ext4_map_blocks *map)
-{
- struct extent_status es;
- ext4_lblk_t offset = 0, end = map->m_lblk + map->m_len - 1;
-
- ext4_es_find_extent_range(inode, &ext4_es_is_delayed,
- map->m_lblk, end, &es);
-
- if (!es.es_len || es.es_lblk > end)
- return false;
-
- if (es.es_lblk > map->m_lblk) {
- map->m_len = es.es_lblk - map->m_lblk;
- return false;
- }
-
- offset = map->m_lblk - es.es_lblk;
- map->m_len = es.es_len - offset;
-
- return true;
-}
-
static int ext4_iomap_begin_report(struct inode *inode, loff_t offset,
loff_t length, unsigned int flags,
struct iomap *iomap, struct iomap *srcmap)
{
int ret;
- bool delalloc = false;
struct ext4_map_blocks map;
u8 blkbits = inode->i_blkbits;

@@ -3492,13 +3471,8 @@ static int ext4_iomap_begin_report(struct inode *inode, loff_t offset,
ret = ext4_map_blocks(NULL, inode, &map, 0);
if (ret < 0)
return ret;
- if (ret == 0)
- delalloc = ext4_iomap_is_delalloc(inode, &map);
-
set_iomap:
ext4_set_iomap(inode, iomap, &map, offset, length, flags);
- if (delalloc && iomap->type == IOMAP_HOLE)
- iomap->type = IOMAP_DELALLOC;

return 0;
}
--
2.39.2


2024-01-02 12:43:24

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH v2 07/25] iomap: don't increase i_size if it's not a write operation

From: Zhang Yi <[email protected]>

Increase i_size in iomap_zero_range() looks not needed, the caller
should handle it. Especially, when truncate partial block, we should
not increase i_size beyond the new EOF here. It dosn't affect xfs and
gfs2 now because they reset the new file size after zero out, it doesn't
matter that a brief increase in i_size. But it will affect ext4 because
it set file size before truncate, so avoid increasing if it's not a
write path.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/iomap/buffered-io.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index e0c9cede82ee..293ba00e4bc0 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -887,6 +887,7 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
{
const struct iomap *srcmap = iomap_iter_srcmap(iter);
loff_t old_size = iter->inode->i_size;
+ bool update_size = iter->flags & IOMAP_WRITE;
size_t ret;

if (srcmap->type == IOMAP_INLINE) {
@@ -903,13 +904,13 @@ static size_t iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
* cache. It's up to the file system to write the updated size to disk,
* preferably after I/O completion so that no stale data is exposed.
*/
- if (pos + ret > old_size) {
+ if (update_size && pos + ret > old_size) {
i_size_write(iter->inode, pos + ret);
iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
}
__iomap_put_folio(iter, pos, ret, folio);

- if (old_size < pos)
+ if (update_size && old_size < pos)
pagecache_isize_extended(iter->inode, old_size, pos);
if (ret < len)
iomap_write_failed(iter->inode, pos + ret, len - ret);
--
2.39.2


2024-01-02 12:43:36

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH v2 10/25] ext4: correct delalloc extent length

From: Zhang Yi <[email protected]>

When adding a delalloc extent in ext4_da_map_blocks(), the extent
length can be incorrect if we found a cached hole extent entry and the
hole length is smaller than the origin map->m_len. Fortunately, it
should not be able to trigger any issue now because the map->m_len is
always 1. Fix this by adjust length before inserting delalloc extent.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/inode.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index bc29c2e92750..44033828db44 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1712,6 +1712,11 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,

/* Lookup extent status tree firstly */
if (ext4_es_lookup_extent(inode, iblock, NULL, &es)) {
+ retval = es.es_len - (iblock - es.es_lblk);
+ if (retval > map->m_len)
+ retval = map->m_len;
+ map->m_len = retval;
+
if (ext4_es_is_hole(&es))
goto add_delayed;

@@ -1727,10 +1732,6 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
}

map->m_pblk = ext4_es_pblock(&es) + iblock - es.es_lblk;
- retval = es.es_len - (iblock - es.es_lblk);
- if (retval > map->m_len)
- retval = map->m_len;
- map->m_len = retval;
if (ext4_es_is_written(&es))
map->m_flags |= EXT4_MAP_MAPPED;
else if (ext4_es_is_unwritten(&es))
--
2.39.2


2024-01-02 12:43:42

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH v2 09/25] ext4: allow inserting delalloc extents with multi-blocks

From: Zhang Yi <[email protected]>

Introduce a new helper ext4_insert_delayed_blocks() to replace
ext4_insert_delayed_block() that we could add multi-delayed blocks into
the extent status tree once a time. But for now, it doesn't support
bigalloc feature yet. Also rename ext4_es_insert_delayed_block() to
ext4_es_insert_delayed_extent(), which matches the name style of other
ext4_es_{insert|remove}_extent() functions.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/extents_status.c | 26 ++++++++++++++-----------
fs/ext4/extents_status.h | 4 ++--
fs/ext4/inode.c | 39 ++++++++++++++++++++++---------------
include/trace/events/ext4.h | 12 +++++++-----
4 files changed, 47 insertions(+), 34 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 4a00e2f019d9..324a6b0a6283 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -2052,19 +2052,21 @@ bool ext4_is_pending(struct inode *inode, ext4_lblk_t lblk)
}

/*
- * ext4_es_insert_delayed_block - adds a delayed block to the extents status
- * tree, adding a pending reservation where
- * needed
+ * ext4_es_insert_delayed_extent - adds delayed blocks to the extents status
+ * tree, adding a pending reservation where
+ * needed
*
* @inode - file containing the newly added block
- * @lblk - logical block to be added
+ * @lblk - first logical block to be added
+ * @len - length of blocks to be added
* @allocated - indicates whether a physical cluster has been allocated for
* the logical cluster that contains the block
*/
-void ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
- bool allocated)
+void ext4_es_insert_delayed_extent(struct inode *inode, ext4_lblk_t lblk,
+ unsigned int len, bool allocated)
{
struct extent_status newes;
+ ext4_lblk_t end = lblk + len - 1;
int err1 = 0, err2 = 0, err3 = 0;
struct extent_status *es1 = NULL;
struct extent_status *es2 = NULL;
@@ -2073,13 +2075,15 @@ void ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
return;

- es_debug("add [%u/1) delayed to extent status tree of inode %lu\n",
- lblk, inode->i_ino);
+ es_debug("add [%u/%u) delayed to extent status tree of inode %lu\n",
+ lblk, len, inode->i_ino);
+ if (!len)
+ return;

newes.es_lblk = lblk;
- newes.es_len = 1;
+ newes.es_len = len;
ext4_es_store_pblock_status(&newes, ~0, EXTENT_STATUS_DELAYED);
- trace_ext4_es_insert_delayed_block(inode, &newes, allocated);
+ trace_ext4_es_insert_delayed_extent(inode, &newes, allocated);

ext4_es_insert_extent_check(inode, &newes);

@@ -2092,7 +2096,7 @@ void ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
pr = __alloc_pending(true);
write_lock(&EXT4_I(inode)->i_es_lock);

- err1 = __es_remove_extent(inode, lblk, lblk, NULL, es1);
+ err1 = __es_remove_extent(inode, lblk, end, NULL, es1);
if (err1 != 0)
goto error;
/* Free preallocated extent if it didn't get used. */
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index d9847a4a25db..24493e682ab4 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -249,8 +249,8 @@ extern void ext4_exit_pending(void);
extern void ext4_init_pending_tree(struct ext4_pending_tree *tree);
extern void ext4_remove_pending(struct inode *inode, ext4_lblk_t lblk);
extern bool ext4_is_pending(struct inode *inode, ext4_lblk_t lblk);
-extern void ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
- bool allocated);
+extern void ext4_es_insert_delayed_extent(struct inode *inode, ext4_lblk_t lblk,
+ unsigned int len, bool allocated);
extern unsigned int ext4_es_delayed_clu(struct inode *inode, ext4_lblk_t lblk,
ext4_lblk_t len);
extern void ext4_clear_inode_es(struct inode *inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0458d7f0c059..bc29c2e92750 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1452,7 +1452,7 @@ static int ext4_journalled_write_end(struct file *file,
/*
* Reserve space for a single cluster
*/
-static int ext4_da_reserve_space(struct inode *inode)
+static int ext4_da_reserve_space(struct inode *inode, unsigned int len)
{
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
struct ext4_inode_info *ei = EXT4_I(inode);
@@ -1463,18 +1463,18 @@ static int ext4_da_reserve_space(struct inode *inode)
* us from metadata over-estimation, though we may go over by
* a small amount in the end. Here we just reserve for data.
*/
- ret = dquot_reserve_block(inode, EXT4_C2B(sbi, 1));
+ ret = dquot_reserve_block(inode, EXT4_C2B(sbi, len));
if (ret)
return ret;

spin_lock(&ei->i_block_reservation_lock);
- if (ext4_claim_free_clusters(sbi, 1, 0)) {
+ if (ext4_claim_free_clusters(sbi, len, 0)) {
spin_unlock(&ei->i_block_reservation_lock);
- dquot_release_reservation_block(inode, EXT4_C2B(sbi, 1));
+ dquot_release_reservation_block(inode, EXT4_C2B(sbi, len));
return -ENOSPC;
}
- ei->i_reserved_data_blocks++;
- trace_ext4_da_reserve_space(inode);
+ ei->i_reserved_data_blocks += len;
+ trace_ext4_da_reserve_space(inode, len);
spin_unlock(&ei->i_block_reservation_lock);

return 0; /* success */
@@ -1620,18 +1620,21 @@ static void ext4_print_free_blocks(struct inode *inode)
return;
}

+
/*
- * ext4_insert_delayed_block - adds a delayed block to the extents status
- * tree, incrementing the reserved cluster/block
- * count or making a pending reservation
- * where needed
+ * ext4_insert_delayed_blocks - adds multi-delayed blocks to the extents
+ * status tree, incrementing the reserved
+ * cluster/block count or making a pending
+ * reservation where needed.
*
* @inode - file containing the newly added block
- * @lblk - logical block to be added
+ * @lblk - start logical block to be added
+ * @len - length of blocks to be added
*
* Returns 0 on success, negative error code on failure.
*/
-static int ext4_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk)
+static int ext4_insert_delayed_blocks(struct inode *inode, ext4_lblk_t lblk,
+ ext4_lblk_t len)
{
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
int ret;
@@ -1649,10 +1652,14 @@ static int ext4_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk)
* extents status tree doesn't get a match.
*/
if (sbi->s_cluster_ratio == 1) {
- ret = ext4_da_reserve_space(inode);
+ ret = ext4_da_reserve_space(inode, len);
if (ret != 0) /* ENOSPC */
return ret;
} else { /* bigalloc */
+ /* TODO: support bigalloc for multi-blocks. */
+ if (len != 1)
+ return -EOPNOTSUPP;
+
if (!ext4_es_scan_clu(inode, &ext4_es_is_delonly, lblk)) {
if (!ext4_es_scan_clu(inode,
&ext4_es_is_mapped, lblk)) {
@@ -1661,7 +1668,7 @@ static int ext4_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk)
if (ret < 0)
return ret;
if (ret == 0) {
- ret = ext4_da_reserve_space(inode);
+ ret = ext4_da_reserve_space(inode, 1);
if (ret != 0) /* ENOSPC */
return ret;
} else {
@@ -1673,7 +1680,7 @@ static int ext4_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk)
}
}

- ext4_es_insert_delayed_block(inode, lblk, allocated);
+ ext4_es_insert_delayed_extent(inode, lblk, len, allocated);
return 0;
}

@@ -1774,7 +1781,7 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,

add_delayed:
down_write(&EXT4_I(inode)->i_data_sem);
- retval = ext4_insert_delayed_block(inode, map->m_lblk);
+ retval = ext4_insert_delayed_blocks(inode, map->m_lblk, map->m_len);
up_write(&EXT4_I(inode)->i_data_sem);
if (retval)
return retval;
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 65029dfb92fb..53aa7a7fb3be 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -1249,14 +1249,15 @@ TRACE_EVENT(ext4_da_update_reserve_space,
);

TRACE_EVENT(ext4_da_reserve_space,
- TP_PROTO(struct inode *inode),
+ TP_PROTO(struct inode *inode, int reserved_blocks),

- TP_ARGS(inode),
+ TP_ARGS(inode, reserved_blocks),

TP_STRUCT__entry(
__field( dev_t, dev )
__field( ino_t, ino )
__field( __u64, i_blocks )
+ __field( int, reserved_blocks )
__field( int, reserved_data_blocks )
__field( __u16, mode )
),
@@ -1265,16 +1266,17 @@ TRACE_EVENT(ext4_da_reserve_space,
__entry->dev = inode->i_sb->s_dev;
__entry->ino = inode->i_ino;
__entry->i_blocks = inode->i_blocks;
+ __entry->reserved_blocks = reserved_blocks;
__entry->reserved_data_blocks = EXT4_I(inode)->i_reserved_data_blocks;
__entry->mode = inode->i_mode;
),

- TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu "
+ TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu reserved_blocks %u "
"reserved_data_blocks %d",
MAJOR(__entry->dev), MINOR(__entry->dev),
(unsigned long) __entry->ino,
__entry->mode, __entry->i_blocks,
- __entry->reserved_data_blocks)
+ __entry->reserved_blocks, __entry->reserved_data_blocks)
);

TRACE_EVENT(ext4_da_release_space,
@@ -2481,7 +2483,7 @@ TRACE_EVENT(ext4_es_shrink,
__entry->scan_time, __entry->nr_skipped, __entry->retried)
);

-TRACE_EVENT(ext4_es_insert_delayed_block,
+TRACE_EVENT(ext4_es_insert_delayed_extent,
TP_PROTO(struct inode *inode, struct extent_status *es,
bool allocated),

--
2.39.2


2024-01-02 12:43:55

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH v2 12/25] ext4: factor out bh handles to ext4_da_get_block_prep()

From: Zhang Yi <[email protected]>

Factor buffer_head related handles out from ext4_da_map_blocks() to
ext4_da_get_block_prep(), and make ext4_da_map_blocks() always return 0
if success, we can distinguish delalloc map through EXT4_MAP_DELAYED
flag.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/inode.c | 75 ++++++++++++++++++++++++-------------------------
1 file changed, 36 insertions(+), 39 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9ac9bb548a4c..9f9b1fce8da8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1690,48 +1690,37 @@ static int ext4_insert_delayed_blocks(struct inode *inode, ext4_lblk_t lblk,
* time. This function looks up the requested blocks and sets the
* buffer delay bit under the protection of i_data_sem.
*/
-static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
- struct ext4_map_blocks *map,
- struct buffer_head *bh)
+static int ext4_da_map_blocks(struct inode *inode, struct ext4_map_blocks *map)
{
struct extent_status es;
int retval;
- sector_t invalid_block = ~((sector_t) 0xffff);
#ifdef ES_AGGRESSIVE_TEST
struct ext4_map_blocks orig_map;

memcpy(&orig_map, map, sizeof(*map));
#endif

- if (invalid_block < ext4_blocks_count(EXT4_SB(inode->i_sb)->s_es))
- invalid_block = ~0;
-
map->m_flags = 0;
ext_debug(inode, "max_blocks %u, logical block %lu\n", map->m_len,
(unsigned long) map->m_lblk);

/* Lookup extent status tree firstly */
- if (ext4_es_lookup_extent(inode, iblock, NULL, &es)) {
- retval = es.es_len - (iblock - es.es_lblk);
- if (retval > map->m_len)
- retval = map->m_len;
- map->m_len = retval;
-
- if (ext4_es_is_hole(&es))
- goto add_delayed;
+ if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es)) {
+ retval = es.es_len - (map->m_lblk - es.es_lblk);
+ map->m_len = min_t(unsigned int, map->m_len, retval);

/*
* Delayed extent could be allocated by fallocate.
* So we need to check it.
*/
- if (ext4_es_is_delayed(&es) && !ext4_es_is_unwritten(&es)) {
- map_bh(bh, inode->i_sb, invalid_block);
- set_buffer_new(bh);
- set_buffer_delay(bh);
+ if (ext4_es_is_delonly(&es)) {
+ map->m_flags |= EXT4_MAP_DELAYED;
return 0;
}
+ if (ext4_es_is_hole(&es))
+ goto add_delayed;

- map->m_pblk = ext4_es_pblock(&es) + iblock - es.es_lblk;
+ map->m_pblk = ext4_es_pblock(&es) + map->m_lblk - es.es_lblk;
if (ext4_es_is_written(&es))
map->m_flags |= EXT4_MAP_MAPPED;
else if (ext4_es_is_unwritten(&es))
@@ -1743,7 +1732,7 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
ext4_map_blocks_es_recheck(NULL, inode, map, &orig_map, 0);
#endif
if (ext4_es_is_delayed(&es) || ext4_es_is_written(&es))
- return retval;
+ return 0;

down_read(&EXT4_I(inode)->i_data_sem);
goto insert_extent;
@@ -1775,6 +1764,7 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
WARN_ON(1);
}
insert_extent:
+ retval = 0;
status = map->m_flags & EXT4_MAP_UNWRITTEN ?
EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
if (status == EXTENT_STATUS_UNWRITTEN)
@@ -1788,14 +1778,10 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,

add_delayed:
down_write(&EXT4_I(inode)->i_data_sem);
+ map->m_flags |= EXT4_MAP_DELAYED;
retval = ext4_insert_delayed_blocks(inode, map->m_lblk, map->m_len);
up_write(&EXT4_I(inode)->i_data_sem);
- if (retval)
- return retval;

- map_bh(bh, inode->i_sb, invalid_block);
- set_buffer_new(bh);
- set_buffer_delay(bh);
return retval;
}

@@ -1815,11 +1801,15 @@ int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
struct buffer_head *bh, int create)
{
struct ext4_map_blocks map;
+ sector_t invalid_block = ~((sector_t) 0xffff);
int ret = 0;

BUG_ON(create == 0);
BUG_ON(bh->b_size != inode->i_sb->s_blocksize);

+ if (invalid_block < ext4_blocks_count(EXT4_SB(inode->i_sb)->s_es))
+ invalid_block = ~0;
+
map.m_lblk = iblock;
map.m_len = 1;

@@ -1828,22 +1818,29 @@ int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
* preallocated blocks are unmapped but should treated
* the same as allocated blocks.
*/
- ret = ext4_da_map_blocks(inode, iblock, &map, bh);
- if (ret <= 0)
+ ret = ext4_da_map_blocks(inode, &map);
+ if (ret < 0)
return ret;

- map_bh(bh, inode->i_sb, map.m_pblk);
- ext4_update_bh_state(bh, map.m_flags);
-
- if (buffer_unwritten(bh)) {
- /* A delayed write to unwritten bh should be marked
- * new and mapped. Mapped ensures that we don't do
- * get_block multiple times when we write to the same
- * offset and new ensures that we do proper zero out
- * for partial write.
- */
+ if (map.m_flags & EXT4_MAP_DELAYED) {
+ map_bh(bh, inode->i_sb, invalid_block);
set_buffer_new(bh);
- set_buffer_mapped(bh);
+ set_buffer_delay(bh);
+ } else {
+ map_bh(bh, inode->i_sb, map.m_pblk);
+ ext4_update_bh_state(bh, map.m_flags);
+
+ if (buffer_unwritten(bh)) {
+ /*
+ * A delayed write to unwritten bh should be marked
+ * new and mapped. Mapped ensures that we don't do
+ * get_block multiple times when we write to the same
+ * offset and new ensures that we do proper zero out
+ * for partial write.
+ */
+ set_buffer_new(bh);
+ set_buffer_mapped(bh);
+ }
}
return 0;
}
--
2.39.2


2024-01-02 12:44:02

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH v2 13/25] ext4: use reserved metadata blocks when splitting extent in endio

From: Zhang Yi <[email protected]>

Now ext4 only reserved space for delalloc for data blocks in
ext4_da_reserve_space(), not reserve space for meta blocks when
allocating data blocks. Besides, if we enable dioread_nolock mount
option, it also not reserve meta blocks for the unwritten extents to
written extents conversion. In order to prevent data loss due to failed
to allocate meta blocks when writing data back, we reserve 2% space or
4096 blocks for meta data, and use EXT4_GET_BLOCKS_PRE_IO to do the
potential split in advance. But all these two methods are just best
efforts, if it's really running out of sapce, there is no difference
between splitting extent when writing data back and when I/Os are
completed, both will result in data loss.

The best way is to reserve enough space for metadata. Before that, we
can at least make sure that things will not get worse if we postpone
splitting extent in endio. Plus, after converting regular file's
buffered write path from buffer_head to iomap, splitting extents in
endio becomes mandatory. Now, in preparation for the future conversion,
let's use reserved sapce in endio too.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/extents.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index fc69f13cf510..254f8b85653d 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3722,7 +3722,8 @@ static int ext4_convert_unwritten_extents_endio(handle_t *handle,
(unsigned long long)map->m_lblk, map->m_len);
#endif
err = ext4_split_convert_extents(handle, inode, map, ppath,
- EXT4_GET_BLOCKS_CONVERT);
+ EXT4_GET_BLOCKS_CONVERT |
+ EXT4_GET_BLOCKS_METADATA_NOFAIL);
if (err < 0)
return err;
path = ext4_find_extent(inode, map->m_lblk, ppath, 0);
--
2.39.2


2024-01-02 12:44:08

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH v2 14/25] ext4: introduce seq counter for extent entry

From: Zhang Yi <[email protected]>

Add a modify counter for extents to indicate the version of one inode's
extent status, increase it once extent changes, it will be used for
converting regular file's buffered write path to iomap.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/extents_status.c | 13 ++++++++++++-
fs/ext4/super.c | 1 +
include/trace/events/ext4.h | 19 +++++++++++++------
4 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 55195909d32f..287284a3f128 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1115,6 +1115,7 @@ struct ext4_inode_info {
ext4_lblk_t i_es_shrink_lblk; /* Offset where we start searching for
extents to shrink. Protected by
i_es_lock */
+ unsigned int i_es_seq; /* modify counter for extents */

/* ialloc */
ext4_group_t i_last_alloc_group;
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 324a6b0a6283..b0a4aa60b311 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -204,6 +204,13 @@ static inline ext4_lblk_t ext4_es_end(struct extent_status *es)
return es->es_lblk + es->es_len - 1;
}

+static inline void ext4_es_inc_seq(struct inode *inode)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+
+ WRITE_ONCE(ei->i_es_seq, READ_ONCE(ei->i_es_seq) + 1);
+}
+
/*
* search through the tree for an delayed extent with a given offset. If
* it can't be found, try to find next extent.
@@ -876,6 +883,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
WARN_ON(1);
}

+ ext4_es_inc_seq(inode);
newes.es_lblk = lblk;
newes.es_len = len;
ext4_es_store_pblock_status(&newes, pblk, status);
@@ -1503,13 +1511,15 @@ void ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
return;

- trace_ext4_es_remove_extent(inode, lblk, len);
es_debug("remove [%u/%u) from extent status tree of inode %lu\n",
lblk, len, inode->i_ino);

if (!len)
return;

+ ext4_es_inc_seq(inode);
+ trace_ext4_es_remove_extent(inode, lblk, len);
+
end = lblk + len - 1;
BUG_ON(end < lblk);

@@ -2080,6 +2090,7 @@ void ext4_es_insert_delayed_extent(struct inode *inode, ext4_lblk_t lblk,
if (!len)
return;

+ ext4_es_inc_seq(inode);
newes.es_lblk = lblk;
newes.es_len = len;
ext4_es_store_pblock_status(&newes, ~0, EXTENT_STATUS_DELAYED);
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index c5fcf377ab1f..d4afaae8ab38 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1421,6 +1421,7 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
ei->i_es_all_nr = 0;
ei->i_es_shk_nr = 0;
ei->i_es_shrink_lblk = 0;
+ ei->i_es_seq = 0;
ei->i_reserved_data_blocks = 0;
spin_lock_init(&(ei->i_block_reservation_lock));
ext4_init_pending_tree(&ei->i_pending_tree);
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 53aa7a7fb3be..6ae504c43de7 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -2186,6 +2186,7 @@ DECLARE_EVENT_CLASS(ext4__es_extent,
__field( ext4_lblk_t, len )
__field( ext4_fsblk_t, pblk )
__field( char, status )
+ __field( unsigned int, seq )
),

TP_fast_assign(
@@ -2195,13 +2196,15 @@ DECLARE_EVENT_CLASS(ext4__es_extent,
__entry->len = es->es_len;
__entry->pblk = ext4_es_show_pblock(es);
__entry->status = ext4_es_status(es);
+ __entry->seq = EXT4_I(inode)->i_es_seq;
),

- TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %s",
+ TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %s seq %u",
MAJOR(__entry->dev), MINOR(__entry->dev),
(unsigned long) __entry->ino,
__entry->lblk, __entry->len,
- __entry->pblk, show_extent_status(__entry->status))
+ __entry->pblk, show_extent_status(__entry->status),
+ __entry->seq)
);

DEFINE_EVENT(ext4__es_extent, ext4_es_insert_extent,
@@ -2226,6 +2229,7 @@ TRACE_EVENT(ext4_es_remove_extent,
__field( ino_t, ino )
__field( loff_t, lblk )
__field( loff_t, len )
+ __field( unsigned int, seq )
),

TP_fast_assign(
@@ -2233,12 +2237,13 @@ TRACE_EVENT(ext4_es_remove_extent,
__entry->ino = inode->i_ino;
__entry->lblk = lblk;
__entry->len = len;
+ __entry->seq = EXT4_I(inode)->i_es_seq;
),

- TP_printk("dev %d,%d ino %lu es [%lld/%lld)",
+ TP_printk("dev %d,%d ino %lu es [%lld/%lld) seq %u",
MAJOR(__entry->dev), MINOR(__entry->dev),
(unsigned long) __entry->ino,
- __entry->lblk, __entry->len)
+ __entry->lblk, __entry->len, __entry->seq)
);

TRACE_EVENT(ext4_es_find_extent_range_enter,
@@ -2497,6 +2502,7 @@ TRACE_EVENT(ext4_es_insert_delayed_extent,
__field( ext4_fsblk_t, pblk )
__field( char, status )
__field( bool, allocated )
+ __field( unsigned int, seq )
),

TP_fast_assign(
@@ -2507,15 +2513,16 @@ TRACE_EVENT(ext4_es_insert_delayed_extent,
__entry->pblk = ext4_es_show_pblock(es);
__entry->status = ext4_es_status(es);
__entry->allocated = allocated;
+ __entry->seq = EXT4_I(inode)->i_es_seq;
),

TP_printk("dev %d,%d ino %lu es [%u/%u) mapped %llu status %s "
- "allocated %d",
+ "allocated %d seq %u",
MAJOR(__entry->dev), MINOR(__entry->dev),
(unsigned long) __entry->ino,
__entry->lblk, __entry->len,
__entry->pblk, show_extent_status(__entry->status),
- __entry->allocated)
+ __entry->allocated, __entry->seq)
);

/* fsmap traces */
--
2.39.2


2024-01-02 12:44:15

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH v2 15/25] ext4: add a new iomap aops for regular file's buffered IO path

From: Zhang Yi <[email protected]>

Introduce a new iomap address space operations ext4_iomap_aops to
support regular file's buffered IO path and add an inode state flag
EXT4_STATE_BUFFERED_IOMAP to indicate that one inode use the iomap
path. Most of their callbacks should be able to use general
implementation, the left over read_folio, readahead and writepages
should be implemented later.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/inode.c | 33 +++++++++++++++++++++++++++++++++
2 files changed, 34 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 287284a3f128..3461cb3ff524 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1913,6 +1913,7 @@ enum {
EXT4_STATE_VERITY_IN_PROGRESS, /* building fs-verity Merkle tree */
EXT4_STATE_FC_COMMITTING, /* Fast commit ongoing */
EXT4_STATE_ORPHAN_FILE, /* Inode orphaned in orphan file */
+ EXT4_STATE_BUFFERED_IOMAP, /* Inode use iomap for buffered IO */
};

#define EXT4_INODE_BIT_FNS(name, field, offset) \
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9f9b1fce8da8..6b5cd0dab1ad 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3492,6 +3492,22 @@ const struct iomap_ops ext4_iomap_report_ops = {
.iomap_begin = ext4_iomap_begin_report,
};

+static int ext4_iomap_read_folio(struct file *file, struct folio *folio)
+{
+ return 0;
+}
+
+static void ext4_iomap_readahead(struct readahead_control *rac)
+{
+
+}
+
+static int ext4_iomap_writepages(struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+ return 0;
+}
+
/*
* For data=journal mode, folio should be marked dirty only when it was
* writeably mapped. When that happens, it was already attached to the
@@ -3581,6 +3597,21 @@ static const struct address_space_operations ext4_da_aops = {
.swap_activate = ext4_iomap_swap_activate,
};

+static const struct address_space_operations ext4_iomap_aops = {
+ .read_folio = ext4_iomap_read_folio,
+ .readahead = ext4_iomap_readahead,
+ .writepages = ext4_iomap_writepages,
+ .dirty_folio = iomap_dirty_folio,
+ .bmap = ext4_bmap,
+ .invalidate_folio = iomap_invalidate_folio,
+ .release_folio = iomap_release_folio,
+ .direct_IO = noop_direct_IO,
+ .migrate_folio = filemap_migrate_folio,
+ .is_partially_uptodate = iomap_is_partially_uptodate,
+ .error_remove_page = generic_error_remove_page,
+ .swap_activate = ext4_iomap_swap_activate,
+};
+
static const struct address_space_operations ext4_dax_aops = {
.writepages = ext4_dax_writepages,
.direct_IO = noop_direct_IO,
@@ -3603,6 +3634,8 @@ void ext4_set_aops(struct inode *inode)
}
if (IS_DAX(inode))
inode->i_mapping->a_ops = &ext4_dax_aops;
+ else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP))
+ inode->i_mapping->a_ops = &ext4_iomap_aops;
else if (test_opt(inode->i_sb, DELALLOC))
inode->i_mapping->a_ops = &ext4_da_aops;
else
--
2.39.2


2024-01-02 12:44:20

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH v2 16/25] ext4: implement buffered read iomap path

From: Zhang Yi <[email protected]>

Add ext4_iomap_buffered_io_begin() to query map blocks, and use
ext4_set_iomap() to convert ext4 map to iomap, after that we can
convert buffered read path to iomap simply.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/inode.c | 36 ++++++++++++++++++++++++++++++++++--
1 file changed, 34 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6b5cd0dab1ad..2044b322dfd8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3492,14 +3492,46 @@ const struct iomap_ops ext4_iomap_report_ops = {
.iomap_begin = ext4_iomap_begin_report,
};

-static int ext4_iomap_read_folio(struct file *file, struct folio *folio)
+static int ext4_iomap_buffered_io_begin(struct inode *inode, loff_t offset,
+ loff_t length, unsigned int iomap_flags,
+ struct iomap *iomap, struct iomap *srcmap)
{
+ int ret;
+ struct ext4_map_blocks map;
+ u8 blkbits = inode->i_blkbits;
+
+ if (unlikely(ext4_forced_shutdown(inode->i_sb)))
+ return -EIO;
+ if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
+ return -EINVAL;
+ if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
+ return -ERANGE;
+
+ /* Calculate the first and last logical blocks respectively. */
+ map.m_lblk = offset >> blkbits;
+ map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
+ EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
+
+ ret = ext4_map_blocks(NULL, inode, &map, 0);
+ if (ret < 0)
+ return ret;
+
+ ext4_set_iomap(inode, iomap, &map, offset, length, iomap_flags);
return 0;
}

-static void ext4_iomap_readahead(struct readahead_control *rac)
+const struct iomap_ops ext4_iomap_buffered_read_ops = {
+ .iomap_begin = ext4_iomap_buffered_io_begin,
+};
+
+static int ext4_iomap_read_folio(struct file *file, struct folio *folio)
{
+ return iomap_read_folio(folio, &ext4_iomap_buffered_read_ops);
+}

+static void ext4_iomap_readahead(struct readahead_control *rac)
+{
+ iomap_readahead(rac, &ext4_iomap_buffered_read_ops);
}

static int ext4_iomap_writepages(struct address_space *mapping,
--
2.39.2


2024-01-02 12:44:40

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH v2 18/25] ext4: implement writeback iomap path

From: Zhang Yi <[email protected]>

Implement buffered writeback iomap path, include map_blocks() and
prepare_ioend() callbacks in iomap_writeback_ops and the corresponding
end io process. Add ext4_iomap_map_blocks() to query mapping status,
start journal handle and allocate new blocks if it's not allocated. Add
ext4_iomap_prepare_ioend() to register the end io handler of converting
unwritten extents to mapped extents.

In order to reduce the number of map times, we map an entire delayed
extent once a time, it means that we could map more blocks than we
expected, so we have to calculate the length through wbc->range_end
carefully. Besides, we won't be able to avoid splitting extents in
endio, so we would allow that because we have been using reserved blocks
in endio, it doesn't bring more risks of data loss than splitting before
submitting IO.

Since we always allocate unwritten extents and keep them until data have
been written to disk, we don't need to write back the data before the
metadata is updated to disk, there is no risk of exposing stale data, so
the data=ordered journal mode becomes useless. We don't need to attach
data to the jinode, and the journal thread won't have to write any
dependent data any more. In the meantime, we don't have to reserve
journal credits for the extent status conversion and we also could
postpone i_disksize updating into endio.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/ext4.h | 4 ++
fs/ext4/ext4_jbd2.c | 6 ++
fs/ext4/extents.c | 23 ++++---
fs/ext4/inode.c | 156 +++++++++++++++++++++++++++++++++++++++++---
fs/ext4/page-io.c | 107 ++++++++++++++++++++++++++++++
fs/ext4/super.c | 2 +
6 files changed, 280 insertions(+), 18 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 03cdcf3d86a5..eaf29bade606 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1147,6 +1147,8 @@ struct ext4_inode_info {
*/
struct list_head i_rsv_conversion_list;
struct work_struct i_rsv_conversion_work;
+ struct list_head i_iomap_ioend_list;
+ struct work_struct i_iomap_ioend_work;
atomic_t i_unwritten; /* Nr. of inflight conversions pending */

spinlock_t i_block_reservation_lock;
@@ -3755,6 +3757,8 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *page,
size_t len);
extern struct ext4_io_end_vec *ext4_alloc_io_end_vec(ext4_io_end_t *io_end);
extern struct ext4_io_end_vec *ext4_last_io_end_vec(ext4_io_end_t *io_end);
+extern void ext4_iomap_end_io(struct work_struct *work);
+extern void ext4_iomap_end_bio(struct bio *bio);

/* mmp.c */
extern int ext4_multi_mount_protect(struct super_block *, ext4_fsblk_t);
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index d1a2e6624401..94c8073b49e7 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -11,6 +11,12 @@ int ext4_inode_journal_mode(struct inode *inode)
{
if (EXT4_JOURNAL(inode) == NULL)
return EXT4_INODE_WRITEBACK_DATA_MODE; /* writeback */
+ /*
+ * Ordered mode is no longer needed for the inode that use the
+ * iomap path, always use writeback mode.
+ */
+ if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP))
+ return EXT4_INODE_WRITEBACK_DATA_MODE; /* writeback */
/* We do not support data journalling with delayed allocation */
if (!S_ISREG(inode->i_mode) ||
ext4_test_inode_flag(inode, EXT4_INODE_EA_INODE) ||
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 254f8b85653d..67ff75108cd1 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3708,18 +3708,23 @@ static int ext4_convert_unwritten_extents_endio(handle_t *handle,
ext_debug(inode, "logical block %llu, max_blocks %u\n",
(unsigned long long)ee_block, ee_len);

- /* If extent is larger than requested it is a clear sign that we still
- * have some extent state machine issues left. So extent_split is still
- * required.
- * TODO: Once all related issues will be fixed this situation should be
- * illegal.
+ /*
+ * For the inodes that use the buffered iomap path need to split
+ * extents in endio, other inodes not.
+ *
+ * TODO: Reserve enough sapce for splitting extents, always split
+ * extents here, and totally remove this warning.
*/
if (ee_block != map->m_lblk || ee_len > map->m_len) {
#ifdef CONFIG_EXT4_DEBUG
- ext4_warning(inode->i_sb, "Inode (%ld) finished: extent logical block %llu,"
- " len %u; IO logical block %llu, len %u",
- inode->i_ino, (unsigned long long)ee_block, ee_len,
- (unsigned long long)map->m_lblk, map->m_len);
+ if (!ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) {
+ ext4_warning(inode->i_sb,
+ "Inode (%ld) finished: extent logical block %llu, "
+ "len %u; IO logical block %llu, len %u",
+ inode->i_ino, (unsigned long long)ee_block,
+ ee_len, (unsigned long long)map->m_lblk,
+ map->m_len);
+ }
#endif
err = ext4_split_convert_extents(handle, inode, map, ppath,
EXT4_GET_BLOCKS_CONVERT |
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5512f38a1a9d..c22b49521df2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -43,6 +43,7 @@
#include <linux/iversion.h>

#include "ext4_jbd2.h"
+#include "ext4_extents.h"
#include "xattr.h"
#include "acl.h"
#include "truncate.h"
@@ -2139,10 +2140,10 @@ static int mpage_map_and_submit_buffers(struct mpage_da_data *mpd)
return err;
}

-static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
+static int mpage_map_one_extent(handle_t *handle, struct inode *inode,
+ struct ext4_map_blocks *map,
+ struct ext4_io_submit *io)
{
- struct inode *inode = mpd->inode;
- struct ext4_map_blocks *map = &mpd->map;
int get_blocks_flags;
int err, dioread_nolock;

@@ -2174,13 +2175,13 @@ static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
err = ext4_map_blocks(handle, inode, map, get_blocks_flags);
if (err < 0)
return err;
- if (dioread_nolock && (map->m_flags & EXT4_MAP_UNWRITTEN)) {
- if (!mpd->io_submit.io_end->handle &&
+ if (io && dioread_nolock && (map->m_flags & EXT4_MAP_UNWRITTEN)) {
+ if (!io->io_end->handle &&
ext4_handle_valid(handle)) {
- mpd->io_submit.io_end->handle = handle->h_rsv_handle;
+ io->io_end->handle = handle->h_rsv_handle;
handle->h_rsv_handle = NULL;
}
- ext4_set_io_unwritten_flag(inode, mpd->io_submit.io_end);
+ ext4_set_io_unwritten_flag(inode, io->io_end);
}

BUG_ON(map->m_len == 0);
@@ -2224,7 +2225,7 @@ static int mpage_map_and_submit_extent(handle_t *handle,
return PTR_ERR(io_end_vec);
io_end_vec->offset = ((loff_t)map->m_lblk) << inode->i_blkbits;
do {
- err = mpage_map_one_extent(handle, mpd);
+ err = mpage_map_one_extent(handle, inode, map, &mpd->io_submit);
if (err < 0) {
struct super_block *sb = inode->i_sb;

@@ -3690,10 +3691,147 @@ static void ext4_iomap_readahead(struct readahead_control *rac)
iomap_readahead(rac, &ext4_iomap_buffered_read_ops);
}

+struct ext4_writeback_ctx {
+ struct iomap_writepage_ctx ctx;
+ struct writeback_control *wbc;
+ unsigned int data_seq;
+};
+
+static int ext4_iomap_map_blocks(struct iomap_writepage_ctx *wpc,
+ struct inode *inode, loff_t offset,
+ unsigned int dirty_len)
+{
+ struct ext4_writeback_ctx *ewpc =
+ container_of(wpc, struct ext4_writeback_ctx, ctx);
+ struct super_block *sb = inode->i_sb;
+ struct journal_s *journal = EXT4_SB(sb)->s_journal;
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ int needed_blocks;
+ struct ext4_map_blocks map;
+ handle_t *handle = NULL;
+ unsigned int blkbits = inode->i_blkbits;
+ unsigned int index = offset >> blkbits;
+ unsigned int end, len;
+ int ret;
+
+ if (unlikely(ext4_forced_shutdown(inode->i_sb)))
+ return -EIO;
+
+ /* Check validity of the cached writeback mapping. */
+ if (offset >= wpc->iomap.offset &&
+ offset < wpc->iomap.offset + wpc->iomap.length &&
+ ewpc->data_seq == READ_ONCE(ei->i_es_seq))
+ return 0;
+
+ needed_blocks = ext4_da_writepages_trans_blocks(inode);
+ end = min_t(unsigned int,
+ (ewpc->wbc->range_end >> blkbits), (UINT_MAX - 1));
+ len = (end > index + dirty_len) ? end - index + 1 : dirty_len;
+
+retry:
+ map.m_lblk = index;
+ map.m_len = min_t(unsigned int, EXT_UNWRITTEN_MAX_LEN, len);
+ ret = ext4_map_blocks(NULL, inode, &map, 0);
+ if (ret < 0)
+ return ret;
+
+ /* Already allocated. */
+ if (map.m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN))
+ goto out;
+
+ handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, needed_blocks);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ return ret;
+ }
+
+ ret = mpage_map_one_extent(handle, inode, &map, NULL);
+ if (ret < 0) {
+ if (ext4_forced_shutdown(sb))
+ goto out_journal;
+
+ /*
+ * Retry transient ENOSPC errors, if
+ * ext4_count_free_blocks() is non-zero, a commit
+ * should free up blocks.
+ */
+ if (ret == -ENOSPC && ext4_count_free_clusters(sb)) {
+ ext4_journal_stop(handle);
+ jbd2_journal_force_commit_nested(journal);
+ goto retry;
+ }
+
+ ext4_msg(sb, KERN_CRIT,
+ "Delayed block allocation failed for "
+ "inode %lu at logical offset %llu with "
+ "max blocks %u with error %d",
+ inode->i_ino, (unsigned long long)map.m_lblk,
+ (unsigned int)map.m_len, -ret);
+ ext4_msg(sb, KERN_CRIT,
+ "This should not happen!! Data will "
+ "be lost\n");
+ if (ret == -ENOSPC)
+ ext4_print_free_blocks(inode);
+out_journal:
+ ext4_journal_stop(handle);
+ return ret;
+ }
+
+ ext4_journal_stop(handle);
+out:
+ ewpc->data_seq = READ_ONCE(ei->i_es_seq);
+ ext4_set_iomap(inode, &wpc->iomap, &map, offset,
+ map.m_len << blkbits, 0);
+ return 0;
+}
+
+static int ext4_iomap_prepare_ioend(struct iomap_ioend *ioend, int status)
+{
+ struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
+
+ /* Need to convert unwritten extents when I/Os are completed. */
+ if (ioend->io_type == IOMAP_UNWRITTEN ||
+ ioend->io_offset + ioend->io_size > READ_ONCE(ei->i_disksize))
+ ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
+
+ return status;
+}
+
+static void ext4_iomap_discard_folio(struct folio *folio, loff_t pos)
+{
+ struct inode *inode = folio->mapping->host;
+
+ ext4_iomap_punch_delalloc(inode, pos,
+ folio_pos(folio) + folio_size(folio) - pos);
+}
+
+static const struct iomap_writeback_ops ext4_writeback_ops = {
+ .map_blocks = ext4_iomap_map_blocks,
+ .prepare_ioend = ext4_iomap_prepare_ioend,
+ .discard_folio = ext4_iomap_discard_folio,
+};
+
static int ext4_iomap_writepages(struct address_space *mapping,
struct writeback_control *wbc)
{
- return 0;
+ struct inode *inode = mapping->host;
+ struct super_block *sb = inode->i_sb;
+ long nr = wbc->nr_to_write;
+ int alloc_ctx, ret;
+ struct ext4_writeback_ctx ewpc = {
+ .wbc = wbc,
+ };
+
+ if (unlikely(ext4_forced_shutdown(sb)))
+ return -EIO;
+
+ alloc_ctx = ext4_writepages_down_read(sb);
+ trace_ext4_writepages(inode, wbc);
+ ret = iomap_writepages(mapping, wbc, &ewpc.ctx, &ext4_writeback_ops);
+ trace_ext4_writepages_result(inode, wbc, ret, nr - wbc->nr_to_write);
+ ext4_writepages_up_read(sb, alloc_ctx);
+
+ return ret;
}

/*
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index dfdd7e5cf038..a499208500e4 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -22,6 +22,7 @@
#include <linux/bio.h>
#include <linux/workqueue.h>
#include <linux/kernel.h>
+#include <linux/iomap.h>
#include <linux/slab.h>
#include <linux/mm.h>
#include <linux/sched/mm.h>
@@ -565,3 +566,109 @@ int ext4_bio_write_folio(struct ext4_io_submit *io, struct folio *folio,

return 0;
}
+
+static void ext4_iomap_finish_ioend(struct iomap_ioend *ioend)
+{
+ struct inode *inode = ioend->io_inode;
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ loff_t pos = ioend->io_offset;
+ size_t size = ioend->io_size;
+ loff_t new_disksize;
+ handle_t *handle;
+ int credits;
+ int ret, err;
+
+ ret = blk_status_to_errno(ioend->io_bio.bi_status);
+ if (unlikely(ret))
+ goto out;
+
+ /*
+ * We may need to convert up to one extent per block in
+ * the page and we may dirty the inode.
+ */
+ credits = ext4_chunk_trans_blocks(inode,
+ EXT4_MAX_BLOCKS(size, pos, inode->i_blkbits));
+ handle = ext4_journal_start(inode, EXT4_HT_EXT_CONVERT, credits);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out_err;
+ }
+
+ if (ioend->io_type == IOMAP_UNWRITTEN) {
+ ret = ext4_convert_unwritten_extents(handle, inode, pos, size);
+ if (ret)
+ goto out_journal;
+ }
+
+ /*
+ * Update on-disk size after IO is completed. Races with
+ * truncate are avoided by checking i_size under i_data_sem.
+ */
+ new_disksize = pos + size;
+ if (new_disksize > READ_ONCE(ei->i_disksize)) {
+ down_write(&ei->i_data_sem);
+ new_disksize = min(new_disksize, i_size_read(inode));
+ if (new_disksize > ei->i_disksize)
+ ei->i_disksize = new_disksize;
+ up_write(&ei->i_data_sem);
+ ret = ext4_mark_inode_dirty(handle, inode);
+ if (ret)
+ EXT4_ERROR_INODE_ERR(inode, -ret,
+ "Failed to mark inode dirty");
+ }
+
+out_journal:
+ err = ext4_journal_stop(handle);
+ if (!ret)
+ ret = err;
+out_err:
+ if (ret < 0 && !ext4_forced_shutdown(inode->i_sb)) {
+ ext4_msg(inode->i_sb, KERN_EMERG,
+ "failed to convert unwritten extents to "
+ "written extents or update inode size -- "
+ "potential data loss! (inode %lu, error %d)",
+ inode->i_ino, ret);
+ }
+out:
+ iomap_finish_ioends(ioend, ret);
+}
+
+/*
+ * Work on buffered iomap completed IO, to convert unwritten extents to
+ * mapped extents
+ */
+void ext4_iomap_end_io(struct work_struct *work)
+{
+ struct ext4_inode_info *ei = container_of(work, struct ext4_inode_info,
+ i_iomap_ioend_work);
+ struct iomap_ioend *ioend;
+ struct list_head ioend_list;
+ unsigned long flags;
+
+ spin_lock_irqsave(&ei->i_completed_io_lock, flags);
+ list_replace_init(&ei->i_iomap_ioend_list, &ioend_list);
+ spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
+
+ iomap_sort_ioends(&ioend_list);
+ while (!list_empty(&ioend_list)) {
+ ioend = list_entry(ioend_list.next, struct iomap_ioend, io_list);
+ list_del_init(&ioend->io_list);
+ iomap_ioend_try_merge(ioend, &ioend_list);
+ ext4_iomap_finish_ioend(ioend);
+ }
+}
+
+void ext4_iomap_end_bio(struct bio *bio)
+{
+ struct iomap_ioend *ioend = iomap_ioend_from_bio(bio);
+ struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
+ struct ext4_sb_info *sbi = EXT4_SB(ioend->io_inode->i_sb);
+ unsigned long flags;
+
+ /* Only reserved conversions from writeback should enter here */
+ spin_lock_irqsave(&ei->i_completed_io_lock, flags);
+ if (list_empty(&ei->i_iomap_ioend_list))
+ queue_work(sbi->rsv_conversion_wq, &ei->i_iomap_ioend_work);
+ list_add_tail(&ioend->io_list, &ei->i_iomap_ioend_list);
+ spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
+}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index d4afaae8ab38..811a46f83bb9 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1431,11 +1431,13 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
#endif
ei->jinode = NULL;
INIT_LIST_HEAD(&ei->i_rsv_conversion_list);
+ INIT_LIST_HEAD(&ei->i_iomap_ioend_list);
spin_lock_init(&ei->i_completed_io_lock);
ei->i_sync_tid = 0;
ei->i_datasync_tid = 0;
atomic_set(&ei->i_unwritten, 0);
INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
+ INIT_WORK(&ei->i_iomap_ioend_work, ext4_iomap_end_io);
ext4_fc_init_inode(&ei->vfs_inode);
mutex_init(&ei->i_fc_lock);
return &ei->vfs_inode;
--
2.39.2


2024-01-02 12:44:41

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH v2 17/25] ext4: implement buffered write iomap path

From: Zhang Yi <[email protected]>

Implement buffered write iomap path, use ext4_da_map_blocks() to map
delalloc extents and add ext4_iomap_get_blocks() to allocate blocks if
delalloc is disabled or free space is about to run out.

Note that we don't want to support dioread_lock mount option any more,
so we drop the branch of ext4_should_dioread_nolock() and always
allocate unwritten extents for new blocks, also make
ext4_should_dioread_nolock() not controlled by the DIOREAD_NOLOCK mount
option and always return true. Besides, the i_disksize updating is also
postponed to after writeback.

After this, now we map or allocate batch of blocks once a time, so it
should be able to bring a lot of performance gains.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/ext4.h | 3 +
fs/ext4/ext4_jbd2.h | 7 ++
fs/ext4/file.c | 19 ++++-
fs/ext4/inode.c | 168 ++++++++++++++++++++++++++++++++++++++++++--
4 files changed, 190 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 3461cb3ff524..03cdcf3d86a5 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2970,6 +2970,7 @@ int ext4_walk_page_buffers(handle_t *handle,
struct buffer_head *bh));
int do_journal_get_write_access(handle_t *handle, struct inode *inode,
struct buffer_head *bh);
+int ext4_nonda_switch(struct super_block *sb);
#define FALL_BACK_TO_NONDELALLOC 1
#define CONVERT_INLINE_DATA 2

@@ -3827,6 +3828,8 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
extern const struct iomap_ops ext4_iomap_ops;
extern const struct iomap_ops ext4_iomap_overwrite_ops;
extern const struct iomap_ops ext4_iomap_report_ops;
+extern const struct iomap_ops ext4_iomap_buffered_write_ops;
+extern const struct iomap_ops ext4_iomap_buffered_da_write_ops;

static inline int ext4_buffer_uptodate(struct buffer_head *bh)
{
diff --git a/fs/ext4/ext4_jbd2.h b/fs/ext4/ext4_jbd2.h
index 0c77697d5e90..c1194ba8d6f2 100644
--- a/fs/ext4/ext4_jbd2.h
+++ b/fs/ext4/ext4_jbd2.h
@@ -499,6 +499,13 @@ static inline int ext4_free_data_revoke_credits(struct inode *inode, int blocks)
*/
static inline int ext4_should_dioread_nolock(struct inode *inode)
{
+ /*
+ * Always enable dioread_nolock for inode which use buffered
+ * iomap path.
+ */
+ if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP))
+ return 1;
+
if (!test_opt(inode->i_sb, DIOREAD_NOLOCK))
return 0;
if (!S_ISREG(inode->i_mode))
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 6aa15dafc677..d15bd6ff1b20 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -282,6 +282,20 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
return count;
}

+static ssize_t ext4_iomap_buffered_write(struct kiocb *iocb,
+ struct iov_iter *from)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+ const struct iomap_ops *iomap_ops;
+
+ if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb))
+ iomap_ops = &ext4_iomap_buffered_da_write_ops;
+ else
+ iomap_ops = &ext4_iomap_buffered_write_ops;
+
+ return iomap_file_buffered_write(iocb, from, iomap_ops);
+}
+
static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
struct iov_iter *from)
{
@@ -296,7 +310,10 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
if (ret <= 0)
goto out;

- ret = generic_perform_write(iocb, from);
+ if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP))
+ ret = ext4_iomap_buffered_write(iocb, from);
+ else
+ ret = generic_perform_write(iocb, from);

out:
inode_unlock(inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2044b322dfd8..5512f38a1a9d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2827,7 +2827,7 @@ static int ext4_dax_writepages(struct address_space *mapping,
return ret;
}

-static int ext4_nonda_switch(struct super_block *sb)
+int ext4_nonda_switch(struct super_block *sb)
{
s64 free_clusters, dirty_clusters;
struct ext4_sb_info *sbi = EXT4_SB(sb);
@@ -3223,6 +3223,15 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
return inode->i_state & I_DIRTY_DATASYNC;
}

+static bool ext4_iomap_valid(struct inode *inode, const struct iomap *iomap)
+{
+ return iomap->validity_cookie == READ_ONCE(EXT4_I(inode)->i_es_seq);
+}
+
+static const struct iomap_folio_ops ext4_iomap_folio_ops = {
+ .iomap_valid = ext4_iomap_valid,
+};
+
static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
struct ext4_map_blocks *map, loff_t offset,
loff_t length, unsigned int flags)
@@ -3253,6 +3262,9 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
iomap->flags |= IOMAP_F_MERGED;

+ iomap->validity_cookie = READ_ONCE(EXT4_I(inode)->i_es_seq);
+ iomap->folio_ops = &ext4_iomap_folio_ops;
+
/*
* Flags passed to ext4_map_blocks() for direct I/O writes can result
* in m_flags having both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits
@@ -3492,11 +3504,42 @@ const struct iomap_ops ext4_iomap_report_ops = {
.iomap_begin = ext4_iomap_begin_report,
};

-static int ext4_iomap_buffered_io_begin(struct inode *inode, loff_t offset,
+static int ext4_iomap_get_blocks(struct inode *inode,
+ struct ext4_map_blocks *map)
+{
+ handle_t *handle;
+ int ret, needed_blocks;
+
+ /*
+ * Reserve one block more for addition to orphan list in case
+ * we allocate blocks but write fails for some reason.
+ */
+ needed_blocks = ext4_writepage_trans_blocks(inode) + 1;
+ handle = ext4_journal_start(inode, EXT4_HT_WRITE_PAGE, needed_blocks);
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+
+ ret = ext4_map_blocks(handle, inode, map,
+ EXT4_GET_BLOCKS_CREATE_UNWRIT_EXT);
+ /*
+ * Have to stop journal here since there is a potential deadlock
+ * caused by later balance_dirty_pages(), it might wait on the
+ * ditry pages to be written back, which might start another
+ * handle and wait this handle stop.
+ */
+ ext4_journal_stop(handle);
+
+ return ret;
+}
+
+#define IOMAP_F_EXT4_DELALLOC IOMAP_F_PRIVATE
+
+static int __ext4_iomap_buffered_io_begin(struct inode *inode, loff_t offset,
loff_t length, unsigned int iomap_flags,
- struct iomap *iomap, struct iomap *srcmap)
+ struct iomap *iomap, struct iomap *srcmap,
+ bool delalloc)
{
- int ret;
+ int ret, retries = 0;
struct ext4_map_blocks map;
u8 blkbits = inode->i_blkbits;

@@ -3506,20 +3549,133 @@ static int ext4_iomap_buffered_io_begin(struct inode *inode, loff_t offset,
return -EINVAL;
if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
return -ERANGE;
-
+retry:
/* Calculate the first and last logical blocks respectively. */
map.m_lblk = offset >> blkbits;
map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
+ if (iomap_flags & IOMAP_WRITE) {
+ if (delalloc)
+ ret = ext4_da_map_blocks(inode, &map);
+ else
+ ret = ext4_iomap_get_blocks(inode, &map);

- ret = ext4_map_blocks(NULL, inode, &map, 0);
+ if (ret == -ENOSPC &&
+ ext4_should_retry_alloc(inode->i_sb, &retries))
+ goto retry;
+ } else {
+ ret = ext4_map_blocks(NULL, inode, &map, 0);
+ }
if (ret < 0)
return ret;

ext4_set_iomap(inode, iomap, &map, offset, length, iomap_flags);
+ if (delalloc)
+ iomap->flags |= IOMAP_F_EXT4_DELALLOC;
+
+ return 0;
+}
+
+static inline int ext4_iomap_buffered_io_begin(struct inode *inode,
+ loff_t offset, loff_t length, unsigned int flags,
+ struct iomap *iomap, struct iomap *srcmap)
+{
+ return __ext4_iomap_buffered_io_begin(inode, offset, length, flags,
+ iomap, srcmap, false);
+}
+
+static inline int ext4_iomap_buffered_da_write_begin(struct inode *inode,
+ loff_t offset, loff_t length, unsigned int flags,
+ struct iomap *iomap, struct iomap *srcmap)
+{
+ return __ext4_iomap_buffered_io_begin(inode, offset, length, flags,
+ iomap, srcmap, true);
+}
+
+/*
+ * Drop the staled delayed allocation range from the write failure,
+ * including both start and end blocks. If not, we could leave a range
+ * of delayed extents covered by a clean folio, it could lead to
+ * inaccurate space reservation.
+ */
+static int ext4_iomap_punch_delalloc(struct inode *inode, loff_t offset,
+ loff_t length)
+{
+ ext4_es_remove_extent(inode, offset >> inode->i_blkbits,
+ DIV_ROUND_UP(length, EXT4_BLOCK_SIZE(inode->i_sb)));
return 0;
}

+static int ext4_iomap_buffered_write_end(struct inode *inode, loff_t offset,
+ loff_t length, ssize_t written,
+ unsigned int flags,
+ struct iomap *iomap)
+{
+ handle_t *handle;
+ loff_t end;
+ int ret = 0, ret2;
+
+ /* delalloc */
+ if (iomap->flags & IOMAP_F_EXT4_DELALLOC) {
+ ret = iomap_file_buffered_write_punch_delalloc(inode, iomap,
+ offset, length, written, ext4_iomap_punch_delalloc);
+ if (ret)
+ ext4_warning(inode->i_sb,
+ "Failed to clean up delalloc for inode %lu, %d",
+ inode->i_ino, ret);
+ return ret;
+ }
+
+ /* nodelalloc */
+ end = offset + length;
+ if (!(iomap->flags & IOMAP_F_SIZE_CHANGED) && end <= inode->i_size)
+ return 0;
+
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+
+ if (iomap->flags & IOMAP_F_SIZE_CHANGED) {
+ ext4_update_i_disksize(inode, inode->i_size);
+ ret = ext4_mark_inode_dirty(handle, inode);
+ }
+
+ /*
+ * If we have allocated more blocks and copied less.
+ * We will have blocks allocated outside inode->i_size,
+ * so truncate them.
+ */
+ if (end > inode->i_size)
+ ext4_orphan_add(handle, inode);
+
+ ret2 = ext4_journal_stop(handle);
+ ret = ret ? : ret2;
+
+ if (end > inode->i_size) {
+ ext4_truncate_failed_write(inode);
+ /*
+ * If truncate failed early the inode might still be
+ * on the orphan list; we need to make sure the inode
+ * is removed from the orphan list in that case.
+ */
+ if (inode->i_nlink)
+ ext4_orphan_del(NULL, inode);
+ }
+
+ return ret;
+}
+
+
+const struct iomap_ops ext4_iomap_buffered_write_ops = {
+ .iomap_begin = ext4_iomap_buffered_io_begin,
+ .iomap_end = ext4_iomap_buffered_write_end,
+};
+
+const struct iomap_ops ext4_iomap_buffered_da_write_ops = {
+ .iomap_begin = ext4_iomap_buffered_da_write_begin,
+ .iomap_end = ext4_iomap_buffered_write_end,
+};
+
const struct iomap_ops ext4_iomap_buffered_read_ops = {
.iomap_begin = ext4_iomap_buffered_io_begin,
};
--
2.39.2


2024-01-02 12:44:57

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH v2 20/25] ext4: implement zero_range iomap path

From: Zhang Yi <[email protected]>

Implement zero_range iomap path, add ext4_iomap_zero_range() to zero
out the already mapped blocks, everything have been done in
iomap_zero_range(), so invoke it directly.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/inode.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ec390bb59b6b..1ca2c995a889 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4061,6 +4061,13 @@ static int __ext4_block_zero_page_range(handle_t *handle,
return err;
}

+static int ext4_iomap_zero_range(struct inode *inode,
+ loff_t from, loff_t length)
+{
+ return iomap_zero_range(inode, from, length, NULL,
+ &ext4_iomap_buffered_read_ops);
+}
+
/*
* ext4_block_zero_page_range() zeros out a mapping of length 'length'
* starting from file offset 'from'. The range to be zero'd must
@@ -4086,6 +4093,8 @@ static int ext4_block_zero_page_range(handle_t *handle,
if (IS_DAX(inode)) {
return dax_zero_range(inode, from, length, NULL,
&ext4_iomap_ops);
+ } else if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) {
+ return ext4_iomap_zero_range(inode, from, length);
}
return __ext4_block_zero_page_range(handle, mapping, from, length);
}
--
2.39.2


2024-01-02 12:45:03

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH v2 22/25] ext4: fall back to buffer_head path for defrag

From: Zhang Yi <[email protected]>

Online defrag doesn't support iomap path yet, so we have to fall back to
buffer_head path for inodes which have been useing iomap. Fall back
active inode is dangerous, we must writeback and drop all dirty pages
under inode lock and mapping->invalidate_lock, those can protect us from
adding new folios into mapping.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/move_extent.c | 34 ++++++++++++++++++++++++++++++++++
1 file changed, 34 insertions(+)

diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 3aa57376d9c2..7a9ca71d4cac 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -538,6 +538,34 @@ mext_check_arguments(struct inode *orig_inode,
return 0;
}

+/*
+ * Disable buffered iomap path for the inode that requiring move extents,
+ * fallback to buffer_head path.
+ */
+static int ext4_disable_buffered_iomap_aops(struct inode *inode)
+{
+ int err;
+
+ /*
+ * The buffered_head aops don't know how to handle folios
+ * dirtied by iomap, so before falling back, flush all dirty
+ * folios the inode has.
+ */
+ filemap_invalidate_lock(inode->i_mapping);
+ err = filemap_write_and_wait(inode->i_mapping);
+ if (err < 0) {
+ filemap_invalidate_unlock(inode->i_mapping);
+ return err;
+ }
+ truncate_inode_pages(inode->i_mapping, 0);
+
+ ext4_clear_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP);
+ ext4_set_aops(inode);
+ filemap_invalidate_unlock(inode->i_mapping);
+
+ return 0;
+}
+
/**
* ext4_move_extents - Exchange the specified range of a file
*
@@ -609,6 +637,12 @@ ext4_move_extents(struct file *o_filp, struct file *d_filp, __u64 orig_blk,
inode_dio_wait(orig_inode);
inode_dio_wait(donor_inode);

+ /* Fallback to buffer_head aops for inodes with buffered iomap aops */
+ if (ext4_test_inode_state(orig_inode, EXT4_STATE_BUFFERED_IOMAP))
+ ext4_disable_buffered_iomap_aops(orig_inode);
+ if (ext4_test_inode_state(donor_inode, EXT4_STATE_BUFFERED_IOMAP))
+ ext4_disable_buffered_iomap_aops(donor_inode);
+
/* Protect extent tree against block allocations via delalloc */
ext4_double_down_write_data_sem(orig_inode, donor_inode);
/* Check the filesystem environment whether move_extent can be done */
--
2.39.2


2024-01-02 12:45:11

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH v2 24/25] filemap: support disable large folios on active inode

From: Zhang Yi <[email protected]>

Since commit 730633f0b7f9 ("mm: Protect operations adding pages to page
cache with invalidate_lock"), mapping->invalidate_lock can protect us
from adding new folios into page cache. So it's possible to disable
active inodes' large folios support, even through it might be dangerous.
Filesystems can disable it under mapping->invalidate_lock and drop all
page cache before dropping AS_LARGE_FOLIO_SUPPORT, besides, the order of
folio must also be determined under the lock.

Signed-off-by: Zhang Yi <[email protected]>
---
include/linux/pagemap.h | 13 +++++++++++++
mm/readahead.c | 6 ++++--
2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 06142ff7f9ce..2554699ba7e3 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -343,6 +343,19 @@ static inline void mapping_set_large_folios(struct address_space *mapping)
__set_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
}

+/**
+ * mapping_clear_large_folios() - The file disable supports large folios.
+ * @mapping: The file.
+ *
+ * The filesystem have to make sure the file is in atomic context and all
+ * cached folios have been cleared under mapping->invalidate_lock before
+ * calling this function.
+ */
+static inline void mapping_clear_large_folios(struct address_space *mapping)
+{
+ __clear_bit(AS_LARGE_FOLIO_SUPPORT, &mapping->flags);
+}
+
/*
* Large folio support currently depends on THP. These dependencies are
* being worked on but are not yet fixed.
diff --git a/mm/readahead.c b/mm/readahead.c
index 6925e6959fd3..c97eceaf7214 100644
--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -493,8 +493,11 @@ void page_cache_ra_order(struct readahead_control *ractl,
int err = 0;
gfp_t gfp = readahead_gfp_mask(mapping);

- if (!mapping_large_folio_support(mapping) || ra->size < 4)
+ filemap_invalidate_lock_shared(mapping);
+ if (!mapping_large_folio_support(mapping) || ra->size < 4) {
+ filemap_invalidate_unlock_shared(mapping);
goto fallback;
+ }

limit = min(limit, index + ra->size - 1);

@@ -506,7 +509,6 @@ void page_cache_ra_order(struct readahead_control *ractl,
new_order--;
}

- filemap_invalidate_lock_shared(mapping);
while (index <= limit) {
unsigned int order = new_order;

--
2.39.2


2024-01-02 12:45:19

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH v2 25/25] ext4: enable large folio for regular file with iomap buffered IO path

From: Zhang Yi <[email protected]>

After we convert buffered IO path to iomap for regular files, we can
enable large foilo for them together, that should be able to bring a lot
of performance gains for large IO.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/ialloc.c | 4 +++-
fs/ext4/inode.c | 4 +++-
fs/ext4/move_extent.c | 1 +
3 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 956b9d69c559..5a22fe5aa46b 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -1336,8 +1336,10 @@ struct inode *__ext4_new_inode(struct mnt_idmap *idmap,
}
}

- if (ext4_should_use_buffered_iomap(inode))
+ if (ext4_should_use_buffered_iomap(inode)) {
ext4_set_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP);
+ mapping_set_large_folios(inode->i_mapping);
+ }

if (ext4_handle_valid(handle)) {
ei->i_sync_tid = handle->h_transaction->t_tid;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2d2b8f2b634d..49a5b9b85407 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5319,8 +5319,10 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
if (ret)
goto bad_inode;

- if (ext4_should_use_buffered_iomap(inode))
+ if (ext4_should_use_buffered_iomap(inode)) {
ext4_set_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP);
+ mapping_set_large_folios(inode->i_mapping);
+ }

if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext4_file_inode_operations;
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index 7a9ca71d4cac..aecd6112d8a2 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -560,6 +560,7 @@ static int ext4_disable_buffered_iomap_aops(struct inode *inode)
truncate_inode_pages(inode->i_mapping, 0);

ext4_clear_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP);
+ mapping_clear_large_folios(inode->i_mapping);
ext4_set_aops(inode);
filemap_invalidate_unlock(inode->i_mapping);

--
2.39.2


2024-01-02 12:51:17

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH v2 19/25] ext4: implement mmap iomap path

From: Zhang Yi <[email protected]>

Implement mmap iomap path, add ext4_iomap_page_mkwrite() to map blocks
and dirty folio, almost everying have been done in iomap_page_mkwrite(),
so invoke it directly.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/inode.c | 25 +++++++++++++++++++++++++
1 file changed, 25 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index c22b49521df2..ec390bb59b6b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -6401,6 +6401,26 @@ static int ext4_bh_unmapped(handle_t *handle, struct inode *inode,
return !buffer_mapped(bh);
}

+static vm_fault_t ext4_iomap_page_mkwrite(struct vm_fault *vmf)
+{
+ struct inode *inode = file_inode(vmf->vma->vm_file);
+ const struct iomap_ops *iomap_ops;
+
+ /*
+ * ext4_nonda_switch() could writeback this folio, so have to
+ * call it before lock folio.
+ *
+ * TODO: drop ext4_nonda_switch() after reserving enough sapce
+ * for metadata and merge delalloc and nodelalloc operations.
+ */
+ if (test_opt(inode->i_sb, DELALLOC) && !ext4_nonda_switch(inode->i_sb))
+ iomap_ops = &ext4_iomap_buffered_da_write_ops;
+ else
+ iomap_ops = &ext4_iomap_buffered_write_ops;
+
+ return iomap_page_mkwrite(vmf, iomap_ops);
+}
+
vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
@@ -6424,6 +6444,11 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)

filemap_invalidate_lock_shared(mapping);

+ if (ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)) {
+ ret = ext4_iomap_page_mkwrite(vmf);
+ goto out;
+ }
+
err = ext4_convert_inline_data(inode);
if (err)
goto out_ret;
--
2.39.2


2024-01-02 12:51:45

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH v2 21/25] ext4: writeback partial blocks before zero range

From: Zhang Yi <[email protected]>

If we zero partial blocks, iomap_zero_iter() will skip zeroing out if
the srcmap is IOMAP_UNWRITTEN, it works fine in xfs because this type
means the block is pure unwritten, doesn't contain any delayed data,
but in ext4, IOMAP_UNWRITTEN may contain delayed data. For now we cannot
simply change the meaning of this flag in ext4, so just writeback
partial blocks from the beginning, make sure it becomes IOMAP_MAPPED
before zeroing out.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/extents.c | 9 +++++++++
1 file changed, 9 insertions(+)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 67ff75108cd1..d98c50472a42 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4606,6 +4606,15 @@ static long ext4_zero_range(struct file *file, loff_t offset,
if (ret)
goto out_mutex;

+ ret = filemap_write_and_wait_range(mapping,
+ round_down(offset, 1 << blkbits), offset);
+ if (ret)
+ goto out_mutex;
+
+ ret = filemap_write_and_wait_range(mapping, offset + len,
+ round_up((offset + len), 1 << blkbits));
+ if (ret)
+ goto out_mutex;
}

/* Zero range excluding the unaligned edges */
--
2.39.2


2024-01-02 12:51:48

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH v2 23/25] ext4: partially enable iomap for regular file's buffered IO path

From: Zhang Yi <[email protected]>

Partially enable iomap for regular file's buffered IO path on default
mount option and default filesystem features. Set inode state flag
EXT4_STATE_BUFFERED_IOMAP when creating one inode to indicate that this
inode choice the iomap path.

Now it still have many limitations, it doesn't support inline data,
fs_verity, fs_crypt, defrag, bigalloc, dax and data=journal mode yet, so
we have to fallback to buffered_head path if these options/features were
enabled. I hope these would be supported gradually in the future.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/ialloc.c | 3 +++
fs/ext4/inode.c | 34 ++++++++++++++++++++++++++++++++++
3 files changed, 38 insertions(+)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index eaf29bade606..16dce8701c5e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2972,6 +2972,7 @@ int ext4_walk_page_buffers(handle_t *handle,
struct buffer_head *bh));
int do_journal_get_write_access(handle_t *handle, struct inode *inode,
struct buffer_head *bh);
+bool ext4_should_use_buffered_iomap(struct inode *inode);
int ext4_nonda_switch(struct super_block *sb);
#define FALL_BACK_TO_NONDELALLOC 1
#define CONVERT_INLINE_DATA 2
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index e9bbb1da2d0a..956b9d69c559 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -1336,6 +1336,9 @@ struct inode *__ext4_new_inode(struct mnt_idmap *idmap,
}
}

+ if (ext4_should_use_buffered_iomap(inode))
+ ext4_set_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP);
+
if (ext4_handle_valid(handle)) {
ei->i_sync_tid = handle->h_transaction->t_tid;
ei->i_datasync_tid = handle->h_transaction->t_tid;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1ca2c995a889..2d2b8f2b634d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -759,6 +759,8 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,

if (ext4_has_inline_data(inode))
return -ERANGE;
+ if (WARN_ON(ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)))
+ return -EINVAL;

map.m_lblk = iblock;
map.m_len = bh->b_size >> inode->i_blkbits;
@@ -2537,6 +2539,9 @@ static int ext4_do_writepages(struct mpage_da_data *mpd)

trace_ext4_writepages(inode, wbc);

+ if (WARN_ON(ext4_test_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP)))
+ return -EINVAL;
+
/*
* No pages to write? This is mainly a kludge to avoid starting
* a transaction for special inodes like journal inode on last iput()
@@ -5024,6 +5029,32 @@ static const char *check_igot_inode(struct inode *inode, ext4_iget_flags flags)
return NULL;
}

+bool ext4_should_use_buffered_iomap(struct inode *inode)
+{
+ struct super_block *sb = inode->i_sb;
+
+ if (ext4_has_feature_inline_data(sb))
+ return false;
+ if (ext4_has_feature_verity(sb))
+ return false;
+ if (ext4_has_feature_bigalloc(sb))
+ return false;
+ if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_JOURNAL_DATA)
+ return false;
+ if (!S_ISREG(inode->i_mode))
+ return false;
+ if (IS_DAX(inode))
+ return false;
+ if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)))
+ return false;
+ if (ext4_test_inode_flag(inode, EXT4_INODE_EA_INODE))
+ return false;
+ if (ext4_test_inode_flag(inode, EXT4_INODE_ENCRYPT))
+ return false;
+
+ return true;
+}
+
struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
ext4_iget_flags flags, const char *function,
unsigned int line)
@@ -5288,6 +5319,9 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
if (ret)
goto bad_inode;

+ if (ext4_should_use_buffered_iomap(inode))
+ ext4_set_inode_state(inode, EXT4_STATE_BUFFERED_IOMAP);
+
if (S_ISREG(inode->i_mode)) {
inode->i_op = &ext4_file_inode_operations;
inode->i_fop = &ext4_file_operations;
--
2.39.2


2024-01-02 12:51:49

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH v2 01/25] ext4: refactor ext4_da_map_blocks()

From: Zhang Yi <[email protected]>

Refactor and cleanup ext4_da_map_blocks(), reduce some unnecessary
parameters and branches, no logic changes.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/inode.c | 39 +++++++++++++++++----------------------
1 file changed, 17 insertions(+), 22 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 61277f7f8722..5b0d3075be12 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1704,7 +1704,6 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
/* Lookup extent status tree firstly */
if (ext4_es_lookup_extent(inode, iblock, NULL, &es)) {
if (ext4_es_is_hole(&es)) {
- retval = 0;
down_read(&EXT4_I(inode)->i_data_sem);
goto add_delayed;
}
@@ -1749,26 +1748,9 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
retval = ext4_ext_map_blocks(NULL, inode, map, 0);
else
retval = ext4_ind_map_blocks(NULL, inode, map, 0);
-
-add_delayed:
- if (retval == 0) {
- int ret;
-
- /*
- * XXX: __block_prepare_write() unmaps passed block,
- * is it OK?
- */
-
- ret = ext4_insert_delayed_block(inode, map->m_lblk);
- if (ret != 0) {
- retval = ret;
- goto out_unlock;
- }
-
- map_bh(bh, inode->i_sb, invalid_block);
- set_buffer_new(bh);
- set_buffer_delay(bh);
- } else if (retval > 0) {
+ if (retval < 0)
+ goto out_unlock;
+ if (retval > 0) {
unsigned int status;

if (unlikely(retval != map->m_len)) {
@@ -1783,11 +1765,24 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
map->m_pblk, status);
+ goto out_unlock;
}

+add_delayed:
+ /*
+ * XXX: __block_prepare_write() unmaps passed block,
+ * is it OK?
+ */
+ retval = ext4_insert_delayed_block(inode, map->m_lblk);
+ if (retval)
+ goto out_unlock;
+
+ map_bh(bh, inode->i_sb, invalid_block);
+ set_buffer_new(bh);
+ set_buffer_delay(bh);
+
out_unlock:
up_read((&EXT4_I(inode)->i_data_sem));
-
return retval;
}

--
2.39.2


2024-01-02 12:52:07

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH v2 03/25] ext4: correct the hole length returned by ext4_map_blocks()

From: Zhang Yi <[email protected]>

In ext4_map_blocks(), if we can't find a range of mapping in the
extents cache, we are calling ext4_ext_map_blocks() to search the real
path and ext4_ext_determine_hole() to determine the hole range. But if
the querying range was partially or completely overlaped by a delalloc
extent, we can't find it in the real extent path, so the returned hole
length could be incorrect.

Fortunately, ext4_ext_put_gap_in_cache() have already handle delalloc
extent, but it searches start from the expanded hole_start, doesn't
start from the querying range, so the delalloc extent found could not be
the one that overlaped the querying range, plus, it also didn't adjust
the hole length. Let's just remove ext4_ext_put_gap_in_cache(), handle
delalloc and insert adjusted hole extent in ext4_ext_determine_hole().

Signed-off-by: Zhang Yi <[email protected]>
Suggested-by: Jan Kara <[email protected]>
---
fs/ext4/extents.c | 109 +++++++++++++++++++++++++++-------------------
1 file changed, 65 insertions(+), 44 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index d5efe076d3d3..0892d0568013 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2229,7 +2229,7 @@ static int ext4_fill_es_cache_info(struct inode *inode,


/*
- * ext4_ext_determine_hole - determine hole around given block
+ * ext4_ext_find_hole - find hole around given block according to the given path
* @inode: inode we lookup in
* @path: path in extent tree to @lblk
* @lblk: pointer to logical block around which we want to determine hole
@@ -2241,9 +2241,9 @@ static int ext4_fill_es_cache_info(struct inode *inode,
* The function returns the length of a hole starting at @lblk. We update @lblk
* to the beginning of the hole if we managed to find it.
*/
-static ext4_lblk_t ext4_ext_determine_hole(struct inode *inode,
- struct ext4_ext_path *path,
- ext4_lblk_t *lblk)
+static ext4_lblk_t ext4_ext_find_hole(struct inode *inode,
+ struct ext4_ext_path *path,
+ ext4_lblk_t *lblk)
{
int depth = ext_depth(inode);
struct ext4_extent *ex;
@@ -2270,30 +2270,6 @@ static ext4_lblk_t ext4_ext_determine_hole(struct inode *inode,
return len;
}

-/*
- * ext4_ext_put_gap_in_cache:
- * calculate boundaries of the gap that the requested block fits into
- * and cache this gap
- */
-static void
-ext4_ext_put_gap_in_cache(struct inode *inode, ext4_lblk_t hole_start,
- ext4_lblk_t hole_len)
-{
- struct extent_status es;
-
- ext4_es_find_extent_range(inode, &ext4_es_is_delayed, hole_start,
- hole_start + hole_len - 1, &es);
- if (es.es_len) {
- /* There's delayed extent containing lblock? */
- if (es.es_lblk <= hole_start)
- return;
- hole_len = min(es.es_lblk - hole_start, hole_len);
- }
- ext_debug(inode, " -> %u:%u\n", hole_start, hole_len);
- ext4_es_insert_extent(inode, hole_start, hole_len, ~0,
- EXTENT_STATUS_HOLE);
-}
-
/*
* ext4_ext_rm_idx:
* removes index from the index block.
@@ -4062,6 +4038,66 @@ static int get_implied_cluster_alloc(struct super_block *sb,
return 0;
}

+/*
+ * Determine hole length around the given logical block, first try to
+ * locate and expand the hole from the given @path, and then adjust it
+ * if it's partially or completely converted to delayed extents.
+ */
+static void ext4_ext_determine_hole(struct inode *inode,
+ struct ext4_ext_path *path,
+ struct ext4_map_blocks *map)
+{
+ ext4_lblk_t hole_start, len;
+ struct extent_status es;
+
+ hole_start = map->m_lblk;
+ len = ext4_ext_find_hole(inode, path, &hole_start);
+again:
+ ext4_es_find_extent_range(inode, &ext4_es_is_delayed, hole_start,
+ hole_start + len - 1, &es);
+ if (!es.es_len)
+ goto insert_hole;
+
+ /*
+ * Found a delayed range in the hole, handle it if the delayed
+ * range is before, behind and straddle the queried range.
+ */
+ if (map->m_lblk >= es.es_lblk + es.es_len) {
+ /*
+ * Before the queried range, find again from the queried
+ * start block.
+ */
+ len -= map->m_lblk - hole_start;
+ hole_start = map->m_lblk;
+ goto again;
+ } else if (in_range(map->m_lblk, es.es_lblk, es.es_len)) {
+ /*
+ * Straddle the beginning of the queried range, it's no
+ * longer a hole, adjust the length to the delayed extent's
+ * after map->m_lblk.
+ */
+ len = es.es_lblk + es.es_len - map->m_lblk;
+ goto out;
+ } else {
+ /*
+ * Partially or completely behind the queried range, update
+ * hole length until the beginning of the delayed extent.
+ */
+ len = min(es.es_lblk - hole_start, len);
+ }
+
+insert_hole:
+ /* Put just found gap into cache to speed up subsequent requests */
+ ext_debug(inode, " -> %u:%u\n", hole_start, len);
+ ext4_es_insert_extent(inode, hole_start, len, ~0, EXTENT_STATUS_HOLE);
+
+ /* Update hole_len to reflect hole size after map->m_lblk */
+ if (hole_start != map->m_lblk)
+ len -= map->m_lblk - hole_start;
+out:
+ map->m_pblk = 0;
+ map->m_len = min_t(unsigned int, map->m_len, len);
+}

/*
* Block allocation/map/preallocation routine for extents based files
@@ -4179,22 +4215,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
* we couldn't try to create block if create flag is zero
*/
if ((flags & EXT4_GET_BLOCKS_CREATE) == 0) {
- ext4_lblk_t hole_start, hole_len;
-
- hole_start = map->m_lblk;
- hole_len = ext4_ext_determine_hole(inode, path, &hole_start);
- /*
- * put just found gap into cache to speed up
- * subsequent requests
- */
- ext4_ext_put_gap_in_cache(inode, hole_start, hole_len);
-
- /* Update hole_len to reflect hole size after map->m_lblk */
- if (hole_start != map->m_lblk)
- hole_len -= map->m_lblk - hole_start;
- map->m_pblk = 0;
- map->m_len = min_t(unsigned int, map->m_len, hole_len);
-
+ ext4_ext_determine_hole(inode, path, map);
goto out;
}

--
2.39.2


2024-01-02 12:52:10

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH v2 08/25] iomap: add pos and dirty_len into trace_iomap_writepage_map

From: Zhang Yi <[email protected]>

Since commit "iomap: map multiple blocks at a time", we could map
multi-blocks once a time, and the dirty_len indicates the expected map
length, map_len won't large than it. The pos and dirty_len means the
dirty range that should be mapped to write, add them into
trace_iomap_writepage_map() could be more useful for debug.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/iomap/buffered-io.c | 2 +-
fs/iomap/trace.h | 43 +++++++++++++++++++++++++++++++++++++++++-
2 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 293ba00e4bc0..46675f51d9c9 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1787,7 +1787,7 @@ static int iomap_writepage_map_blocks(struct iomap_writepage_ctx *wpc,
error = wpc->ops->map_blocks(wpc, inode, pos, dirty_len);
if (error)
break;
- trace_iomap_writepage_map(inode, &wpc->iomap);
+ trace_iomap_writepage_map(inode, pos, dirty_len, &wpc->iomap);

map_len = min_t(u64, dirty_len,
wpc->iomap.offset + wpc->iomap.length - pos);
diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
index c16fd55f5595..3ef694f9489f 100644
--- a/fs/iomap/trace.h
+++ b/fs/iomap/trace.h
@@ -154,7 +154,48 @@ DEFINE_EVENT(iomap_class, name, \
TP_ARGS(inode, iomap))
DEFINE_IOMAP_EVENT(iomap_iter_dstmap);
DEFINE_IOMAP_EVENT(iomap_iter_srcmap);
-DEFINE_IOMAP_EVENT(iomap_writepage_map);
+
+TRACE_EVENT(iomap_writepage_map,
+ TP_PROTO(struct inode *inode, u64 pos, unsigned int dirty_len,
+ struct iomap *iomap),
+ TP_ARGS(inode, pos, dirty_len, iomap),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(u64, ino)
+ __field(u64, pos)
+ __field(u64, dirty_len)
+ __field(u64, addr)
+ __field(loff_t, offset)
+ __field(u64, length)
+ __field(u16, type)
+ __field(u16, flags)
+ __field(dev_t, bdev)
+ ),
+ TP_fast_assign(
+ __entry->dev = inode->i_sb->s_dev;
+ __entry->ino = inode->i_ino;
+ __entry->pos = pos;
+ __entry->dirty_len = dirty_len;
+ __entry->addr = iomap->addr;
+ __entry->offset = iomap->offset;
+ __entry->length = iomap->length;
+ __entry->type = iomap->type;
+ __entry->flags = iomap->flags;
+ __entry->bdev = iomap->bdev ? iomap->bdev->bd_dev : 0;
+ ),
+ TP_printk("dev %d:%d ino 0x%llx bdev %d:%d pos 0x%llx dirty len 0x%llx "
+ "addr 0x%llx offset 0x%llx length 0x%llx type %s flags %s",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->ino,
+ MAJOR(__entry->bdev), MINOR(__entry->bdev),
+ __entry->pos,
+ __entry->dirty_len,
+ __entry->addr,
+ __entry->offset,
+ __entry->length,
+ __print_symbolic(__entry->type, IOMAP_TYPE_STRINGS),
+ __print_flags(__entry->flags, "|", IOMAP_F_FLAGS_STRINGS))
+);

TRACE_EVENT(iomap_iter,
TP_PROTO(struct iomap_iter *iter, const void *ops,
--
2.39.2


2024-01-02 12:52:34

by Zhang Yi

[permalink] [raw]
Subject: [RFC PATCH v2 11/25] ext4: also mark extent as delalloc if it's been unwritten

From: Zhang Yi <[email protected]>

Mark extent as delalloc if it's been unwritten, this delalloc flag will
last until write back. It would be useful to indicate the map length
when writing data back after converting regular file's buffered write
path to iomap.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/inode.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 44033828db44..9ac9bb548a4c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1742,7 +1742,11 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
#ifdef ES_AGGRESSIVE_TEST
ext4_map_blocks_es_recheck(NULL, inode, map, &orig_map, 0);
#endif
- return retval;
+ if (ext4_es_is_delayed(&es) || ext4_es_is_written(&es))
+ return retval;
+
+ down_read(&EXT4_I(inode)->i_data_sem);
+ goto insert_extent;
}

/*
@@ -1770,9 +1774,11 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
inode->i_ino, retval, map->m_len);
WARN_ON(1);
}
-
+insert_extent:
status = map->m_flags & EXT4_MAP_UNWRITTEN ?
EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
+ if (status == EXTENT_STATUS_UNWRITTEN)
+ status |= EXTENT_STATUS_DELAYED;
ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
map->m_pblk, status);
up_read(&EXT4_I(inode)->i_data_sem);
--
2.39.2


2024-01-03 09:56:41

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH v2 01/25] ext4: refactor ext4_da_map_blocks()

On Tue 02-01-24 20:38:54, Zhang Yi wrote:
> From: Zhang Yi <[email protected]>
>
> Refactor and cleanup ext4_da_map_blocks(), reduce some unnecessary
> parameters and branches, no logic changes.
>
> Signed-off-by: Zhang Yi <[email protected]>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/ext4/inode.c | 39 +++++++++++++++++----------------------
> 1 file changed, 17 insertions(+), 22 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 61277f7f8722..5b0d3075be12 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1704,7 +1704,6 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
> /* Lookup extent status tree firstly */
> if (ext4_es_lookup_extent(inode, iblock, NULL, &es)) {
> if (ext4_es_is_hole(&es)) {
> - retval = 0;
> down_read(&EXT4_I(inode)->i_data_sem);
> goto add_delayed;
> }
> @@ -1749,26 +1748,9 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
> retval = ext4_ext_map_blocks(NULL, inode, map, 0);
> else
> retval = ext4_ind_map_blocks(NULL, inode, map, 0);
> -
> -add_delayed:
> - if (retval == 0) {
> - int ret;
> -
> - /*
> - * XXX: __block_prepare_write() unmaps passed block,
> - * is it OK?
> - */
> -
> - ret = ext4_insert_delayed_block(inode, map->m_lblk);
> - if (ret != 0) {
> - retval = ret;
> - goto out_unlock;
> - }
> -
> - map_bh(bh, inode->i_sb, invalid_block);
> - set_buffer_new(bh);
> - set_buffer_delay(bh);
> - } else if (retval > 0) {
> + if (retval < 0)
> + goto out_unlock;
> + if (retval > 0) {
> unsigned int status;
>
> if (unlikely(retval != map->m_len)) {
> @@ -1783,11 +1765,24 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
> EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
> ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
> map->m_pblk, status);
> + goto out_unlock;
> }
>
> +add_delayed:
> + /*
> + * XXX: __block_prepare_write() unmaps passed block,
> + * is it OK?
> + */
> + retval = ext4_insert_delayed_block(inode, map->m_lblk);
> + if (retval)
> + goto out_unlock;
> +
> + map_bh(bh, inode->i_sb, invalid_block);
> + set_buffer_new(bh);
> + set_buffer_delay(bh);
> +
> out_unlock:
> up_read((&EXT4_I(inode)->i_data_sem));
> -
> return retval;
> }
>
> --
> 2.39.2
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2024-01-03 10:03:43

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH v2 02/25] ext4: convert to exclusive lock while inserting delalloc extents

On Tue 02-01-24 20:38:55, Zhang Yi wrote:
> From: Zhang Yi <[email protected]>
>
> ext4_da_map_blocks() only hold i_data_sem in shared mode and i_rwsem
> when inserting delalloc extents, it could be raced by another querying
> path of ext4_map_blocks() without i_rwsem, .e.g buffered read path.
> Suppose we buffered read a file containing just a hole, and without any
> cached extents tree, then it is raced by another delayed buffered write
> to the same area or the near area belongs to the same hole, and the new
> delalloc extent could be overwritten to a hole extent.
>
> pread() pwrite()
> filemap_read_folio()
> ext4_mpage_readpages()
> ext4_map_blocks()
> down_read(i_data_sem)
> ext4_ext_determine_hole()
> //find hole
> ext4_ext_put_gap_in_cache()
> ext4_es_find_extent_range()
> //no delalloc extent
> ext4_da_map_blocks()
> down_read(i_data_sem)
> ext4_insert_delayed_block()
> //insert delalloc extent
> ext4_es_insert_extent()
> //overwrite delalloc extent to hole
>
> This race could lead to inconsistent delalloc extents tree and
> incorrect reserved space counter. Fix this by converting to hold
> i_data_sem in exclusive mode when adding a new delalloc extent in
> ext4_da_map_blocks().
>
> Cc: [email protected]
> Signed-off-by: Zhang Yi <[email protected]>
> Suggested-by: Jan Kara <[email protected]>

Looks good to me! Feel free to add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/ext4/inode.c | 25 +++++++++++--------------
> 1 file changed, 11 insertions(+), 14 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5b0d3075be12..142c67f5c7fc 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1703,10 +1703,8 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
>
> /* Lookup extent status tree firstly */
> if (ext4_es_lookup_extent(inode, iblock, NULL, &es)) {
> - if (ext4_es_is_hole(&es)) {
> - down_read(&EXT4_I(inode)->i_data_sem);
> + if (ext4_es_is_hole(&es))
> goto add_delayed;
> - }
>
> /*
> * Delayed extent could be allocated by fallocate.
> @@ -1748,8 +1746,10 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
> retval = ext4_ext_map_blocks(NULL, inode, map, 0);
> else
> retval = ext4_ind_map_blocks(NULL, inode, map, 0);
> - if (retval < 0)
> - goto out_unlock;
> + if (retval < 0) {
> + up_read(&EXT4_I(inode)->i_data_sem);
> + return retval;
> + }
> if (retval > 0) {
> unsigned int status;
>
> @@ -1765,24 +1765,21 @@ static int ext4_da_map_blocks(struct inode *inode, sector_t iblock,
> EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
> ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
> map->m_pblk, status);
> - goto out_unlock;
> + up_read(&EXT4_I(inode)->i_data_sem);
> + return retval;
> }
> + up_read(&EXT4_I(inode)->i_data_sem);
>
> add_delayed:
> - /*
> - * XXX: __block_prepare_write() unmaps passed block,
> - * is it OK?
> - */
> + down_write(&EXT4_I(inode)->i_data_sem);
> retval = ext4_insert_delayed_block(inode, map->m_lblk);
> + up_write(&EXT4_I(inode)->i_data_sem);
> if (retval)
> - goto out_unlock;
> + return retval;
>
> map_bh(bh, inode->i_sb, invalid_block);
> set_buffer_new(bh);
> set_buffer_delay(bh);
> -
> -out_unlock:
> - up_read((&EXT4_I(inode)->i_data_sem));
> return retval;
> }
>
> --
> 2.39.2
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2024-01-03 11:03:07

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH v2 03/25] ext4: correct the hole length returned by ext4_map_blocks()

On Tue 02-01-24 20:38:56, Zhang Yi wrote:
> From: Zhang Yi <[email protected]>
>
> In ext4_map_blocks(), if we can't find a range of mapping in the
> extents cache, we are calling ext4_ext_map_blocks() to search the real
> path and ext4_ext_determine_hole() to determine the hole range. But if
> the querying range was partially or completely overlaped by a delalloc
> extent, we can't find it in the real extent path, so the returned hole
> length could be incorrect.
>
> Fortunately, ext4_ext_put_gap_in_cache() have already handle delalloc
> extent, but it searches start from the expanded hole_start, doesn't
> start from the querying range, so the delalloc extent found could not be
> the one that overlaped the querying range, plus, it also didn't adjust
> the hole length. Let's just remove ext4_ext_put_gap_in_cache(), handle
> delalloc and insert adjusted hole extent in ext4_ext_determine_hole().
>
> Signed-off-by: Zhang Yi <[email protected]>
> Suggested-by: Jan Kara <[email protected]>

Some small comments below.

> @@ -4062,6 +4038,66 @@ static int get_implied_cluster_alloc(struct super_block *sb,
> return 0;
> }
>
> +/*
> + * Determine hole length around the given logical block, first try to
> + * locate and expand the hole from the given @path, and then adjust it
> + * if it's partially or completely converted to delayed extents.
> + */
> +static void ext4_ext_determine_hole(struct inode *inode,
> + struct ext4_ext_path *path,
> + struct ext4_map_blocks *map)
> +{

I'd prefer to keep setting of 'map' inside ext4_ext_map_blocks() to make it
more obvious there and just pass map->m_lblk into
ext4_ext_determine_hole(). ext4_ext_determine_hole() will just return the
length of the extent from lblk that was mapped (i.e. 'len').

Also I'd probably call this function like ext4_ext_determine_insert_hole()
to make it more obvious the function modifies the status tree.

Otherwise the patch looks good to me.

Honza

> + ext4_lblk_t hole_start, len;
> + struct extent_status es;
> +
> + hole_start = map->m_lblk;
> + len = ext4_ext_find_hole(inode, path, &hole_start);
> +again:
> + ext4_es_find_extent_range(inode, &ext4_es_is_delayed, hole_start,
> + hole_start + len - 1, &es);
> + if (!es.es_len)
> + goto insert_hole;
> +
> + /*
> + * Found a delayed range in the hole, handle it if the delayed
> + * range is before, behind and straddle the queried range.
> + */
> + if (map->m_lblk >= es.es_lblk + es.es_len) {
> + /*
> + * Before the queried range, find again from the queried
> + * start block.
> + */
> + len -= map->m_lblk - hole_start;
> + hole_start = map->m_lblk;
> + goto again;
> + } else if (in_range(map->m_lblk, es.es_lblk, es.es_len)) {
> + /*
> + * Straddle the beginning of the queried range, it's no
> + * longer a hole, adjust the length to the delayed extent's
> + * after map->m_lblk.
> + */
> + len = es.es_lblk + es.es_len - map->m_lblk;
> + goto out;
> + } else {
> + /*
> + * Partially or completely behind the queried range, update
> + * hole length until the beginning of the delayed extent.
> + */
> + len = min(es.es_lblk - hole_start, len);
> + }
> +
> +insert_hole:
> + /* Put just found gap into cache to speed up subsequent requests */
> + ext_debug(inode, " -> %u:%u\n", hole_start, len);
> + ext4_es_insert_extent(inode, hole_start, len, ~0, EXTENT_STATUS_HOLE);
> +
> + /* Update hole_len to reflect hole size after map->m_lblk */
> + if (hole_start != map->m_lblk)
> + len -= map->m_lblk - hole_start;
> +out:
> + map->m_pblk = 0;
> + map->m_len = min_t(unsigned int, map->m_len, len);
> +}
>
> /*
> * Block allocation/map/preallocation routine for extents based files
> @@ -4179,22 +4215,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
> * we couldn't try to create block if create flag is zero
> */
> if ((flags & EXT4_GET_BLOCKS_CREATE) == 0) {
> - ext4_lblk_t hole_start, hole_len;
> -
> - hole_start = map->m_lblk;
> - hole_len = ext4_ext_determine_hole(inode, path, &hole_start);
> - /*
> - * put just found gap into cache to speed up
> - * subsequent requests
> - */
> - ext4_ext_put_gap_in_cache(inode, hole_start, hole_len);
> -
> - /* Update hole_len to reflect hole size after map->m_lblk */
> - if (hole_start != map->m_lblk)
> - hole_len -= map->m_lblk - hole_start;
> - map->m_pblk = 0;
> - map->m_len = min_t(unsigned int, map->m_len, hole_len);
> -
> + ext4_ext_determine_hole(inode, path, map);
> goto out;
> }
>
> --
> 2.39.2
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2024-01-03 11:04:29

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH v2 04/25] ext4: add a hole extent entry in cache after punch

On Tue 02-01-24 20:38:57, Zhang Yi wrote:
> From: Zhang Yi <[email protected]>
>
> In order to cache hole extents in the extent status tree and keep the
> hole length as long as possible, re-add a hole entry to the cache just
> after punching a hole.
>
> Signed-off-by: Zhang Yi <[email protected]>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/ext4/inode.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 142c67f5c7fc..1b5e6409f958 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4000,12 +4000,12 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
>
> /* If there are blocks to remove, do it */
> if (stop_block > first_block) {
> + ext4_lblk_t hole_len = stop_block - first_block;
>
> down_write(&EXT4_I(inode)->i_data_sem);
> ext4_discard_preallocations(inode, 0);
>
> - ext4_es_remove_extent(inode, first_block,
> - stop_block - first_block);
> + ext4_es_remove_extent(inode, first_block, hole_len);
>
> if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> ret = ext4_ext_remove_space(inode, first_block,
> @@ -4014,6 +4014,8 @@ int ext4_punch_hole(struct file *file, loff_t offset, loff_t length)
> ret = ext4_ind_remove_space(handle, inode, first_block,
> stop_block);
>
> + ext4_es_insert_extent(inode, first_block, hole_len, ~0,
> + EXTENT_STATUS_HOLE);
> up_write(&EXT4_I(inode)->i_data_sem);
> }
> ext4_fc_track_range(handle, inode, first_block, stop_block);
> --
> 2.39.2
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2024-01-03 11:31:47

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH v2 05/25] ext4: make ext4_map_blocks() distinguish delalloc only extent

On Tue 02-01-24 20:38:58, Zhang Yi wrote:
> From: Zhang Yi <[email protected]>
>
> Add a new map flag EXT4_MAP_DELAYED to indicate the mapping range is a
> delayed allocated only (not unwritten) one, and making
> ext4_map_blocks() can distinguish it, no longer mixing it with holes.
>
> Signed-off-by: Zhang Yi <[email protected]>

One small comment below.

> ---
> fs/ext4/ext4.h | 4 +++-
> fs/ext4/extents.c | 5 +++--
> fs/ext4/inode.c | 2 ++
> 3 files changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index a5d784872303..55195909d32f 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -252,8 +252,10 @@ struct ext4_allocation_request {
> #define EXT4_MAP_MAPPED BIT(BH_Mapped)
> #define EXT4_MAP_UNWRITTEN BIT(BH_Unwritten)
> #define EXT4_MAP_BOUNDARY BIT(BH_Boundary)
> +#define EXT4_MAP_DELAYED BIT(BH_Delay)
> #define EXT4_MAP_FLAGS (EXT4_MAP_NEW | EXT4_MAP_MAPPED |\
> - EXT4_MAP_UNWRITTEN | EXT4_MAP_BOUNDARY)
> + EXT4_MAP_UNWRITTEN | EXT4_MAP_BOUNDARY |\
> + EXT4_MAP_DELAYED)
>
> struct ext4_map_blocks {
> ext4_fsblk_t m_pblk;
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 0892d0568013..fc69f13cf510 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4073,9 +4073,10 @@ static void ext4_ext_determine_hole(struct inode *inode,
> } else if (in_range(map->m_lblk, es.es_lblk, es.es_len)) {
> /*
> * Straddle the beginning of the queried range, it's no
> - * longer a hole, adjust the length to the delayed extent's
> - * after map->m_lblk.
> + * longer a hole, mark it is a delalloc and adjust the
> + * length to the delayed extent's after map->m_lblk.
> */
> + map->m_flags |= EXT4_MAP_DELAYED;

I wouldn't set delalloc bit here. If there's delalloc extent containing
lblk now, it means the caller of ext4_map_blocks() is not holding i_rwsem
(otherwise we would have found already in ext4_map_blocks()) and thus
delalloc info is unreliable anyway. So I wouldn't bother. But it's worth a
comment here like:

/*
* There's delalloc extent containing lblk. It must have
* been added after ext4_map_blocks() checked the extent
* status tree so we are not holding i_rwsem and delalloc
* info is only stabilized by i_data_sem we are going to
* release soon. Don't modify the extent status tree and
* report extent as a hole.
*/

Honza

> len = es.es_lblk + es.es_len - map->m_lblk;
> goto out;
> } else {
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 1b5e6409f958..c141bf6d8db2 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -515,6 +515,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
> map->m_len = retval;
> } else if (ext4_es_is_delayed(&es) || ext4_es_is_hole(&es)) {
> map->m_pblk = 0;
> + map->m_flags |= ext4_es_is_delayed(&es) ?
> + EXT4_MAP_DELAYED : 0;
> retval = es.es_len - (map->m_lblk - es.es_lblk);
> if (retval > map->m_len)
> retval = map->m_len;
> --
> 2.39.2
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2024-01-03 11:35:20

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH v2 06/25] ext4: make ext4_set_iomap() recognize IOMAP_DELALLOC map type

On Tue 02-01-24 20:38:59, Zhang Yi wrote:
> From: Zhang Yi <[email protected]>
>
> Since ext4_map_blocks() can recognize a delayed allocated only extent,
> make ext4_set_iomap() can also recognize it, and remove the useless
> separate check in ext4_iomap_begin_report().
>
> Signed-off-by: Zhang Yi <[email protected]>

Looks good to me. Feel free to add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/ext4/inode.c | 32 +++-----------------------------
> 1 file changed, 3 insertions(+), 29 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index c141bf6d8db2..0458d7f0c059 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3261,6 +3261,9 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> iomap->addr = (u64) map->m_pblk << blkbits;
> if (flags & IOMAP_DAX)
> iomap->addr += EXT4_SB(inode->i_sb)->s_dax_part_off;
> + } else if (map->m_flags & EXT4_MAP_DELAYED) {
> + iomap->type = IOMAP_DELALLOC;
> + iomap->addr = IOMAP_NULL_ADDR;
> } else {
> iomap->type = IOMAP_HOLE;
> iomap->addr = IOMAP_NULL_ADDR;
> @@ -3423,35 +3426,11 @@ const struct iomap_ops ext4_iomap_overwrite_ops = {
> .iomap_end = ext4_iomap_end,
> };
>
> -static bool ext4_iomap_is_delalloc(struct inode *inode,
> - struct ext4_map_blocks *map)
> -{
> - struct extent_status es;
> - ext4_lblk_t offset = 0, end = map->m_lblk + map->m_len - 1;
> -
> - ext4_es_find_extent_range(inode, &ext4_es_is_delayed,
> - map->m_lblk, end, &es);
> -
> - if (!es.es_len || es.es_lblk > end)
> - return false;
> -
> - if (es.es_lblk > map->m_lblk) {
> - map->m_len = es.es_lblk - map->m_lblk;
> - return false;
> - }
> -
> - offset = map->m_lblk - es.es_lblk;
> - map->m_len = es.es_len - offset;
> -
> - return true;
> -}
> -
> static int ext4_iomap_begin_report(struct inode *inode, loff_t offset,
> loff_t length, unsigned int flags,
> struct iomap *iomap, struct iomap *srcmap)
> {
> int ret;
> - bool delalloc = false;
> struct ext4_map_blocks map;
> u8 blkbits = inode->i_blkbits;
>
> @@ -3492,13 +3471,8 @@ static int ext4_iomap_begin_report(struct inode *inode, loff_t offset,
> ret = ext4_map_blocks(NULL, inode, &map, 0);
> if (ret < 0)
> return ret;
> - if (ret == 0)
> - delalloc = ext4_iomap_is_delalloc(inode, &map);
> -
> set_iomap:
> ext4_set_iomap(inode, iomap, &map, offset, length, flags);
> - if (delalloc && iomap->type == IOMAP_HOLE)
> - iomap->type = IOMAP_DELALLOC;
>
> return 0;
> }
> --
> 2.39.2
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2024-01-03 13:21:09

by Zhang Yi

[permalink] [raw]
Subject: Re: [RFC PATCH v2 05/25] ext4: make ext4_map_blocks() distinguish delalloc only extent

On 2024/1/3 19:31, Jan Kara wrote:
> On Tue 02-01-24 20:38:58, Zhang Yi wrote:
>> From: Zhang Yi <[email protected]>
>>
>> Add a new map flag EXT4_MAP_DELAYED to indicate the mapping range is a
>> delayed allocated only (not unwritten) one, and making
>> ext4_map_blocks() can distinguish it, no longer mixing it with holes.
>>
>> Signed-off-by: Zhang Yi <[email protected]>
>
> One small comment below.
>
>> ---
>> fs/ext4/ext4.h | 4 +++-
>> fs/ext4/extents.c | 5 +++--
>> fs/ext4/inode.c | 2 ++
>> 3 files changed, 8 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index a5d784872303..55195909d32f 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -252,8 +252,10 @@ struct ext4_allocation_request {
>> #define EXT4_MAP_MAPPED BIT(BH_Mapped)
>> #define EXT4_MAP_UNWRITTEN BIT(BH_Unwritten)
>> #define EXT4_MAP_BOUNDARY BIT(BH_Boundary)
>> +#define EXT4_MAP_DELAYED BIT(BH_Delay)
>> #define EXT4_MAP_FLAGS (EXT4_MAP_NEW | EXT4_MAP_MAPPED |\
>> - EXT4_MAP_UNWRITTEN | EXT4_MAP_BOUNDARY)
>> + EXT4_MAP_UNWRITTEN | EXT4_MAP_BOUNDARY |\
>> + EXT4_MAP_DELAYED)
>>
>> struct ext4_map_blocks {
>> ext4_fsblk_t m_pblk;
>> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
>> index 0892d0568013..fc69f13cf510 100644
>> --- a/fs/ext4/extents.c
>> +++ b/fs/ext4/extents.c
>> @@ -4073,9 +4073,10 @@ static void ext4_ext_determine_hole(struct inode *inode,
>> } else if (in_range(map->m_lblk, es.es_lblk, es.es_len)) {
>> /*
>> * Straddle the beginning of the queried range, it's no
>> - * longer a hole, adjust the length to the delayed extent's
>> - * after map->m_lblk.
>> + * longer a hole, mark it is a delalloc and adjust the
>> + * length to the delayed extent's after map->m_lblk.
>> */
>> + map->m_flags |= EXT4_MAP_DELAYED;
>
> I wouldn't set delalloc bit here. If there's delalloc extent containing
> lblk now, it means the caller of ext4_map_blocks() is not holding i_rwsem
> (otherwise we would have found already in ext4_map_blocks()) and thus
> delalloc info is unreliable anyway. So I wouldn't bother. But it's worth a
> comment here like:
>
> /*
> * There's delalloc extent containing lblk. It must have
> * been added after ext4_map_blocks() checked the extent
> * status tree so we are not holding i_rwsem and delalloc
> * info is only stabilized by i_data_sem we are going to
> * release soon. Don't modify the extent status tree and
> * report extent as a hole.
> */
>

Yeah, the delalloc info is still unreliable. Thanks for the advice, I
will revise it in my next iteration along with your advice in patch 03.

Thanks,
Yi.

>
>> len = es.es_lblk + es.es_len - map->m_lblk;
>> goto out;
>> } else {
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 1b5e6409f958..c141bf6d8db2 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -515,6 +515,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
>> map->m_len = retval;
>> } else if (ext4_es_is_delayed(&es) || ext4_es_is_hole(&es)) {
>> map->m_pblk = 0;
>> + map->m_flags |= ext4_es_is_delayed(&es) ?
>> + EXT4_MAP_DELAYED : 0;
>> retval = es.es_len - (map->m_lblk - es.es_lblk);
>> if (retval > map->m_len)
>> retval = map->m_len;
>> --
>> 2.39.2
>>