LinuxLists.cc - [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option

2023-08-24 12:39:12

Subject: [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option

From: Zhang Yi <[email protected]>

Hello,

The delayed allocation method allocates blocks during page writeback in
ext4_writepages(), which cannot handle block allocation failures due to
e.g. ENOSPC if acquires more extent blocks. In order to deal with this,
commit '79f0be8d2e6e ("ext4: Switch to non delalloc mode when we are low
on free blocks count.")' introduce ext4_nonda_switch() to convert to no
delalloc mode if the space if the free blocks is less than 150% of dirty
blocks or the watermark. In the meantime, '27dd43854227 ("ext4:
introduce reserved space")' reserve some of the file system space (2% or
4096 clusters, whichever is smaller). Both of those to solutions can
make sure that space is not exhausted when mapping delalloc blocks in
most cases, but cannot guarantee work in all cases, which could lead to
infinite loop or data loss (please see patch 14 for details).

This patch set wants to reserve metadata space more accurate for
delalloc mount option. The metadata blocks reservation is very tricky
and is also related to the continuity of physical blocks, an effective
way is to reserve as the worst case, which means that every data block
is discontinuous and one data block costs an extent entry. Reserve
metadata space as the worst case can make sure enough blocks reserved
during data writeback, the unused reservaion space can be released after
mapping data blocks. After doing this, add a worker to submit delayed
allocations to prevent excessive reservations. Finally, we could
completely drop the policy of switching back to non-delayed allocation.

The patch set is based on the latest ext4 dev branch.
Patch 1-2: Fix two reserved data blocks problems triggered when
bigalloc feature is enabled.
Patch 3-6: Move reserved data blocks updating from
ext4_{ext|ind}_map_blocks() to ext4_es_insert_extent(),
preparing for reserving metadata.
Patch 7-14: Reserve metadata blocks for delayed allocation as the worst
case, and update count after allocating or releasing.
Patch 15-16: In order to prevent too many reserved metadata blocks that
could running false positive out of space (doesn't take
that much after allocating data blocks), add a worker to
submit IO if the reservation is too big.

About tests:
1. This patch set has passed 'kvm-xfstests -g auto' many times.
2. The performance looks not significantly affected after doing the
following tests on my virtual machine with 4 CPU core and 32GB
memory, which based on Kunpeng-920 arm64 CPU and 1.5TB nvme ssd.

fio -directory=/test -direct=0 -iodepth=10 -fsync=$sync -rw=$rw \
-numjobs=${numjobs} -bs=${bs}k -ioengine=libaio -size=10G \
-ramp_time=10 -runtime=60 -norandommap=0 -group_reportin \
-name=tests

Disable bigalloc:
| Before | After
rw fsync jobs bs(kB) | iops bw(MiB/s) | iops bw(MiB/s)
------------------------------|------------------|-----------------
write 0 1 4 | 27500 107 | 27100 106
write 0 4 4 | 33900 132 | 35300 138
write 0 1 1024 | 134 135 | 149 150
write 0 4 1024 | 172 174 | 199 200
write 1 1 4 | 1530 6.1 | 1651 6.6
write 1 4 4 | 3139 12.3 | 3131 12.2
write 1 1 1024 | 184 185 | 195 196
write 1 4 1024 | 117 119 | 114 115
randwrite 0 1 4 | 17900 69.7 | 17600 68.9
randwrite 0 4 4 | 32700 128 | 34600 135
randwrite 0 1 1024 | 145 146 | 155 155
randwrite 0 4 1024 | 193 194 | 207 209
randwrite 1 1 4 | 1335 5.3 | 1444 5.7
randwrite 1 4 4 | 3364 13.1 | 3428 13.4
randwrite 1 1 1024 | 180 180 | 171 172
randwrite 1 4 1024 | 132 134 | 141 142

Enable bigalloc:
| Before | After
rw fsync jobs bs(kB) | iops bw(MiB/s) | iops bw(MiB/s)
------------------------------|------------------|-----------------
write 0 1 4 | 27500 107 | 30300 118
write 0 4 4 | 28800 112 | 34000 137
write 0 1 1024 | 141 142 | 162 162
write 0 4 1024 | 172 173 | 195 196
write 1 1 4 | 1410 5.6 | 1302 5.2
write 1 4 4 | 3052 11.9 | 3002 11.7
write 1 1 1024 | 153 153 | 163 164
write 1 4 1024 | 113 114 | 110 111
randwrite 0 1 4 | 17500 68.5 | 18400 72
randwrite 0 4 4 | 26600 104 | 24800 96
randwrite 0 1 1024 | 170 171 | 165 165
randwrite 0 4 1024 | 168 169 | 152 153
randwrite 1 1 4 | 1281 5.1 | 1335 5.3
randwrite 1 4 4 | 3115 12.2 | 3315 12
randwrite 1 1 1024 | 150 150 | 151 152
randwrite 1 4 1024 | 134 135 | 132 133

Tests on ramdisk:

Disable bigalloc
| Before | After
rw fsync jobs bs(kB) | iops bw(MiB/s) | iops bw(MiB/s)
------------------------------|------------------|-----------------
write 1 1 4 | 4699 18.4 | 4858 18
write 1 1 1024 | 245 246 | 247 248

Enable bigalloc
| Before | After
rw fsync jobs bs(kB) | iops bw(MiB/s) | iops bw(MiB/s)
------------------------------|------------------|-----------------
write 1 1 4 | 4634 18.1 | 5073 19.8
write 1 1 1024 | 246 247 | 268 269

Thanks,
Yi.

Zhang Yi (16):
ext4: correct the start block of counting reserved clusters
ext4: make sure allocate pending entry not fail
ext4: let __revise_pending() return the number of new inserts pendings
ext4: count removed reserved blocks for delalloc only es entry
ext4: pass real delayed status into ext4_es_insert_extent()
ext4: move delalloc data reserve spcae updating into
ext4_es_insert_extent()
ext4: count inode's total delalloc data blocks into ext4_es_tree
ext4: refactor delalloc space reservation
ext4: count reserved metadata blocks for delalloc per inode
ext4: reserve meta blocks in ext4_da_reserve_space()
ext4: factor out common part of
ext4_da_{release|update_reserve}_space()
ext4: update reserved meta blocks in
ext4_da_{release|update_reserve}_space()
ext4: calculate the worst extent blocks needed of a delalloc es entry
ext4: reserve extent blocks for delalloc
ext4: flush delalloc blocks if no free space
ext4: drop ext4_nonda_switch()

fs/ext4/balloc.c | 47 ++++-
fs/ext4/ext4.h | 14 +-
fs/ext4/extents.c | 65 +++----
fs/ext4/extents_status.c | 340 +++++++++++++++++++++---------------
fs/ext4/extents_status.h | 3 +-
fs/ext4/indirect.c | 7 -
fs/ext4/inode.c | 191 ++++++++++----------
fs/ext4/super.c | 22 ++-
include/trace/events/ext4.h | 70 ++++++--
9 files changed, 439 insertions(+), 320 deletions(-)

--
2.39.2

2023-08-24 13:44:17

by Zhang Yi

[permalink] [raw]

Subject: [RFC PATCH 07/16] ext4: count inode's total delalloc data blocks into ext4_es_tree

From: Zhang Yi <[email protected]>

Add a parameter in struct ext4_es_tree to count per-inode's total
delalloc data blocks number, it will be used to calculate reserved
meta blocks when creating a new delalloc extent entry, or mapping a
delalloc entry to a real one or releasing a delalloc entry.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/extents_status.c | 19 +++++++++++++++++++
fs/ext4/extents_status.h | 1 +
2 files changed, 20 insertions(+)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 34164c2827f2..b098c3316189 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -178,6 +178,7 @@ void ext4_es_init_tree(struct ext4_es_tree *tree)
{
tree->root = RB_ROOT;
tree->cache_es = NULL;
+ tree->da_es_len = 0;
}

#ifdef ES_DEBUG__
@@ -787,6 +788,20 @@ static inline void ext4_es_insert_extent_check(struct inode *inode,
}
#endif

+/*
+ * Update total delay allocated extent length.
+ */
+static inline void ext4_es_update_da_block(struct inode *inode, long es_len)
+{
+ struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
+
+ if (!es_len)
+ return;
+
+ tree->da_es_len += es_len;
+ es_debug("update da blocks %ld, to %u\n", es_len, tree->da_es_len);
+}
+
static int __es_insert_extent(struct inode *inode, struct extent_status *newes,
struct extent_status *prealloc)
{
@@ -915,6 +930,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
__es_free_extent(es1);
es1 = NULL;
}
+ ext4_es_update_da_block(inode, -rinfo.ndelonly_blk);

err2 = __es_insert_extent(inode, &newes, es2);
if (err2 == -ENOMEM && !ext4_es_must_keep(&newes))
@@ -1571,6 +1587,7 @@ void ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
__es_free_extent(es);
es = NULL;
}
+ ext4_es_update_da_block(inode, -rinfo.ndelonly_blk);
write_unlock(&EXT4_I(inode)->i_es_lock);
if (err)
goto retry;
@@ -2161,6 +2178,8 @@ void ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
pr = NULL;
}
}
+
+ ext4_es_update_da_block(inode, newes.es_len);
error:
write_unlock(&EXT4_I(inode)->i_es_lock);
if (err1 || err2 || err3 < 0)
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index 7344667eb2cd..ee873b305210 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -66,6 +66,7 @@ struct extent_status {
struct ext4_es_tree {
struct rb_root root;
struct extent_status *cache_es; /* recently accessed extent */
+ ext4_lblk_t da_es_len; /* total delalloc len */
};

struct ext4_es_stats {
--
2.39.2

2023-08-24 13:54:41

by Zhang Yi

[permalink] [raw]

Subject: [RFC PATCH 16/16] ext4: drop ext4_nonda_switch()

From: Zhang Yi <[email protected]>

Now that we have reserve enough metadata blocks for delalloc, the
ext4_nonda_switch() could be dropped, it's safe to keep on delalloc mode
for buffer write if the dirty space is high and free space is low, we
can make sure always successfully allocate the metadata block while
mapping delalloc entries in ext4_writepages().

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/extents_status.c | 9 ++++-----
fs/ext4/inode.c | 39 ++-------------------------------------
2 files changed, 6 insertions(+), 42 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 8e0dec27f967..954c6e49182e 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -971,11 +971,10 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
* reduce the reserved cluster count and claim quota.
*
* Otherwise, we aren't allocating delayed allocated clusters
- * (from fallocate, filemap, DIO, or clusters allocated when
- * delalloc has been disabled by ext4_nonda_switch()), reduce the
- * reserved cluster count by the number of allocated clusters that
- * have previously been delayed allocated. Quota has been claimed
- * by ext4_mb_new_blocks(), so release the quota reservations made
+ * (from fallocate, filemap, DIO), reduce the reserved cluster
+ * count by the number of allocated clusters that have previously
+ * been delayed allocated. Quota has been claimed by
+ * ext4_mb_new_blocks(), so release the quota reservations made
* for any previously delayed allocated clusters.
*/
ext4_da_update_reserve_space(inode, rinfo.ndelonly_clu + pending,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d714bf2e4171..0a76c99ea8c6 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2838,40 +2838,6 @@ static int ext4_dax_writepages(struct address_space *mapping,
return ret;
}

-static int ext4_nonda_switch(struct super_block *sb)
-{
- s64 free_clusters, dirty_clusters;
- struct ext4_sb_info *sbi = EXT4_SB(sb);
-
- /*
- * switch to non delalloc mode if we are running low
- * on free block. The free block accounting via percpu
- * counters can get slightly wrong with percpu_counter_batch getting
- * accumulated on each CPU without updating global counters
- * Delalloc need an accurate free block accounting. So switch
- * to non delalloc when we are near to error range.
- */
- free_clusters =
- percpu_counter_read_positive(&sbi->s_freeclusters_counter);
- dirty_clusters =
- percpu_counter_read_positive(&sbi->s_dirtyclusters_counter);
- /*
- * Start pushing delalloc when 1/2 of free blocks are dirty.
- */
- if (dirty_clusters && (free_clusters < 2 * dirty_clusters))
- try_to_writeback_inodes_sb(sb, WB_REASON_FS_FREE_SPACE);
-
- if (2 * free_clusters < 3 * dirty_clusters ||
- free_clusters < (dirty_clusters + EXT4_FREECLUSTERS_WATERMARK)) {
- /*
- * free block count is less than 150% of dirty blocks
- * or free blocks is less than watermark
- */
- return 1;
- }
- return 0;
-}
-
static int ext4_da_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len,
struct page **pagep, void **fsdata)
@@ -2886,7 +2852,7 @@ static int ext4_da_write_begin(struct file *file, struct address_space *mapping,

index = pos >> PAGE_SHIFT;

- if (ext4_nonda_switch(inode->i_sb) || ext4_verity_in_progress(inode)) {
+ if (ext4_verity_in_progress(inode)) {
*fsdata = (void *)FALL_BACK_TO_NONDELALLOC;
return ext4_write_begin(file, mapping, pos,
len, pagep, fsdata);
@@ -6117,8 +6083,7 @@ vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf)
goto retry_alloc;

/* Delalloc case is easy... */
- if (test_opt(inode->i_sb, DELALLOC) &&
- !ext4_nonda_switch(inode->i_sb)) {
+ if (test_opt(inode->i_sb, DELALLOC)) {
do {
err = block_page_mkwrite(vma, vmf,
ext4_da_get_block_prep);
--
2.39.2

2023-08-24 13:56:05

by Zhang Yi

[permalink] [raw]

Subject: [RFC PATCH 01/16] ext4: correct the start block of counting reserved clusters

From: Zhang Yi <[email protected]>

When big allocate feature is enabled, we need to count and update
reserved clusters before removing a delayed only extent_status entry.
{init|count|get}_rsvd() have already done this, but the start block
number of this counting isn's correct in the following case.

lblk end
| |
v v
-------------------------
| | orig_es
-------------------------
^ ^
len1 is 0 | len2 |

If the start block of the orig_es entry founded is bigger than lblk, we
passed lblk as start block to count_rsvd(), but the length is correct,
finally, the range to be counted is offset. This patch fix this by
passing the start blocks to 'orig_es->lblk + len1'.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/extents_status.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 6f7de14c0fa8..5e625ea4545d 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -1405,8 +1405,8 @@ static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
}
}
if (count_reserved)
- count_rsvd(inode, lblk, orig_es.es_len - len1 - len2,
- &orig_es, &rc);
+ count_rsvd(inode, orig_es.es_lblk + len1,
+ orig_es.es_len - len1 - len2, &orig_es, &rc);
goto out_get_reserved;
}

--
2.39.2

2023-08-24 14:35:04

by Zhang Yi

[permalink] [raw]

Subject: [RFC PATCH 06/16] ext4: move delalloc data reserve spcae updating into ext4_es_insert_extent()

From: Zhang Yi <[email protected]>

We update data reserved space for delalloc after allocating new blocks
in ext4_{ind|ext}_map_blocks(). If bigalloc feature is enabled, we also
need to query the extents_status tree to calculate the exact reserved
clusters. If we move it to ext4_es_insert_extent(), just after dropping
delalloc extents_status entry, it could become more simple because
__es_remove_extent() has done most of the work and we could remove
entire ext4_es_delayed_clu().

One important thing needs to take care of is that if bigalloc is
enabled, we should update data reserved count when first converting some
of the delayed only es entries of a caluster which has many other
delayed only entries left over.

| one cluster |
--------------------------------------------------------
| da es 0 | .. | da es 1 | .. | da es 2 | .. | da es 3 |
--------------------------------------------------------
^ ^
| | <- first allocating this delayed extent

The later allocations in that cluster will not count again. We could do
this by counting the new inserts pending clusters.

Another important thing is the quota claiming and i_blocks count, if
the delayed allocating has been raced by another no-delay allocating
(from fallocate, filemap, DIO...), we cannot claim quota as usual
because the racer have already done it. We could distinguish this case
easily through checking EXTENT_STATUS_DELAYED and the reserved only
blocks counted by __es_remove_extent(). If the EXTENT_STATUS_DELAYED is
set, it always means that the allocating is not from the delayed
allocating. But on the contrary, we can only get the opposite
conclusion if bigalloc is not enabled. If bigalloc is enabled, it could
be raced by another fallocate which is writing to other non-delayed
areas of the same cluster. In this case, the EXTENT_STATUS_DELAYED is
not set but we cannot claim quota again.

| one cluster |
-------------------------------------------
| | delayed es |
-------------------------------------------
^ ^
| fallocate |

So we also need to check the counted reserved only blocks, if it is zero
it means that the allocating is not from the delayed allocating, and we
should release reserved qutoa instead of claim it.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/extents.c | 37 -------------
fs/ext4/extents_status.c | 115 +++++++++------------------------------
fs/ext4/extents_status.h | 2 -
fs/ext4/indirect.c | 7 ---
fs/ext4/inode.c | 5 +-
5 files changed, 30 insertions(+), 136 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index e4115d338f10..592383effe80 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4323,43 +4323,6 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
goto out;
}

- /*
- * Reduce the reserved cluster count to reflect successful deferred
- * allocation of delayed allocated clusters or direct allocation of
- * clusters discovered to be delayed allocated. Once allocated, a
- * cluster is not included in the reserved count.
- */
- if (test_opt(inode->i_sb, DELALLOC) && allocated_clusters) {
- if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) {
- /*
- * When allocating delayed allocated clusters, simply
- * reduce the reserved cluster count and claim quota
- */
- ext4_da_update_reserve_space(inode, allocated_clusters,
- 1);
- } else {
- ext4_lblk_t lblk, len;
- unsigned int n;
-
- /*
- * When allocating non-delayed allocated clusters
- * (from fallocate, filemap, DIO, or clusters
- * allocated when delalloc has been disabled by
- * ext4_nonda_switch), reduce the reserved cluster
- * count by the number of allocated clusters that
- * have previously been delayed allocated. Quota
- * has been claimed by ext4_mb_new_blocks() above,
- * so release the quota reservations made for any
- * previously delayed allocated clusters.
- */
- lblk = EXT4_LBLK_CMASK(sbi, map->m_lblk);
- len = allocated_clusters << sbi->s_cluster_bits;
- n = ext4_es_delayed_clu(inode, lblk, len);
- if (n > 0)
- ext4_da_update_reserve_space(inode, (int) n, 0);
- }
- }
-
/*
* Cache the extent and update transaction to commit on fdatasync only
* when it is _not_ an unwritten extent.
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 62191c772b82..34164c2827f2 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -856,11 +856,14 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
struct extent_status newes;
ext4_lblk_t end = lblk + len - 1;
int err1 = 0, err2 = 0, err3 = 0;
+ struct rsvd_info rinfo;
+ int pending = 0;
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
struct extent_status *es1 = NULL;
struct extent_status *es2 = NULL;
struct pending_reservation *pr = NULL;
bool revise_pending = false;
+ bool delayed = false;

if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
return;
@@ -878,6 +881,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
* data lose, and the extent has been written, it's safe to remove
* the delayed flag even it's still delayed.
*/
+ delayed = status & EXTENT_STATUS_DELAYED;
if ((status & EXTENT_STATUS_DELAYED) &&
(status & EXTENT_STATUS_WRITTEN))
status &= ~EXTENT_STATUS_DELAYED;
@@ -902,7 +906,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
pr = __alloc_pending(true);
write_lock(&EXT4_I(inode)->i_es_lock);

- err1 = __es_remove_extent(inode, lblk, end, NULL, es1);
+ err1 = __es_remove_extent(inode, lblk, end, &rinfo, es1);
if (err1 != 0)
goto error;
/* Free preallocated extent if it didn't get used. */
@@ -932,9 +936,30 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
__free_pending(pr);
pr = NULL;
}
+ /*
+ * In the first partial allocating some delayed extents of
+ * one cluster, we also need to count the data cluster when
+ * allocating delay only extent entries.
+ */
+ pending = err3;
}
error:
write_unlock(&EXT4_I(inode)->i_es_lock);
+ /*
+ * If EXTENT_STATUS_DELAYED is not set and delayed only blocks is
+ * not zero, we are allocating delayed allocated clusters, simply
+ * reduce the reserved cluster count and claim quota.
+ *
+ * Otherwise, we aren't allocating delayed allocated clusters
+ * (from fallocate, filemap, DIO, or clusters allocated when
+ * delalloc has been disabled by ext4_nonda_switch()), reduce the
+ * reserved cluster count by the number of allocated clusters that
+ * have previously been delayed allocated. Quota has been claimed
+ * by ext4_mb_new_blocks(), so release the quota reservations made
+ * for any previously delayed allocated clusters.
+ */
+ ext4_da_update_reserve_space(inode, rinfo.ndelonly_clu + pending,
+ !delayed && rinfo.ndelonly_blk);
if (err1 || err2 || err3 < 0)
goto retry;

@@ -2146,94 +2171,6 @@ void ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
return;
}

-/*
- * __es_delayed_clu - count number of clusters containing blocks that
- * are delayed only
- *
- * @inode - file containing block range
- * @start - logical block defining start of range
- * @end - logical block defining end of range
- *
- * Returns the number of clusters containing only delayed (not delayed
- * and unwritten) blocks in the range specified by @start and @end. Any
- * cluster or part of a cluster within the range and containing a delayed
- * and not unwritten block within the range is counted as a whole cluster.
- */
-static unsigned int __es_delayed_clu(struct inode *inode, ext4_lblk_t start,
- ext4_lblk_t end)
-{
- struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
- struct extent_status *es;
- struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
- struct rb_node *node;
- ext4_lblk_t first_lclu, last_lclu;
- unsigned long long last_counted_lclu;
- unsigned int n = 0;
-
- /* guaranteed to be unequal to any ext4_lblk_t value */
- last_counted_lclu = ~0ULL;
-
- es = __es_tree_search(&tree->root, start);
-
- while (es && (es->es_lblk <= end)) {
- if (ext4_es_is_delonly(es)) {
- if (es->es_lblk <= start)
- first_lclu = EXT4_B2C(sbi, start);
- else
- first_lclu = EXT4_B2C(sbi, es->es_lblk);
-
- if (ext4_es_end(es) >= end)
- last_lclu = EXT4_B2C(sbi, end);
- else
- last_lclu = EXT4_B2C(sbi, ext4_es_end(es));
-
- if (first_lclu == last_counted_lclu)
- n += last_lclu - first_lclu;
- else
- n += last_lclu - first_lclu + 1;
- last_counted_lclu = last_lclu;
- }
- node = rb_next(&es->rb_node);
- if (!node)
- break;
- es = rb_entry(node, struct extent_status, rb_node);
- }
-
- return n;
-}
-
-/*
- * ext4_es_delayed_clu - count number of clusters containing blocks that
- * are both delayed and unwritten
- *
- * @inode - file containing block range
- * @lblk - logical block defining start of range
- * @len - number of blocks in range
- *
- * Locking for external use of __es_delayed_clu().
- */
-unsigned int ext4_es_delayed_clu(struct inode *inode, ext4_lblk_t lblk,
- ext4_lblk_t len)
-{
- struct ext4_inode_info *ei = EXT4_I(inode);
- ext4_lblk_t end;
- unsigned int n;
-
- if (len == 0)
- return 0;
-
- end = lblk + len - 1;
- WARN_ON(end < lblk);
-
- read_lock(&ei->i_es_lock);
-
- n = __es_delayed_clu(inode, lblk, end);
-
- read_unlock(&ei->i_es_lock);
-
- return n;
-}
-
/*
* __revise_pending - makes, cancels, or leaves unchanged pending cluster
* reservations for a specified block range depending
diff --git a/fs/ext4/extents_status.h b/fs/ext4/extents_status.h
index d9847a4a25db..7344667eb2cd 100644
--- a/fs/ext4/extents_status.h
+++ b/fs/ext4/extents_status.h
@@ -251,8 +251,6 @@ extern void ext4_remove_pending(struct inode *inode, ext4_lblk_t lblk);
extern bool ext4_is_pending(struct inode *inode, ext4_lblk_t lblk);
extern void ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
bool allocated);
-extern unsigned int ext4_es_delayed_clu(struct inode *inode, ext4_lblk_t lblk,
- ext4_lblk_t len);
extern void ext4_clear_inode_es(struct inode *inode);

#endif /* _EXT4_EXTENTS_STATUS_H */
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index a9f3716119d3..448401e02c55 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -652,13 +652,6 @@ int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
ext4_update_inode_fsync_trans(handle, inode, 1);
count = ar.len;

- /*
- * Update reserved blocks/metadata blocks after successful block
- * allocation which had been deferred till now.
- */
- if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
- ext4_da_update_reserve_space(inode, count, 1);
-
got_it:
map->m_flags |= EXT4_MAP_MAPPED;
map->m_pblk = le32_to_cpu(chain[depth-1].key);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 82115d6656d3..546a3b09fd0a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -330,11 +330,14 @@ qsize_t *ext4_get_reserved_space(struct inode *inode)
* ext4_discard_preallocations() from here.
*/
void ext4_da_update_reserve_space(struct inode *inode,
- int used, int quota_claim)
+ int used, int quota_claim)
{
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
struct ext4_inode_info *ei = EXT4_I(inode);

+ if (!used)
+ return;
+
spin_lock(&ei->i_block_reservation_lock);
trace_ext4_da_update_reserve_space(inode, used, quota_claim);
if (unlikely(used > ei->i_reserved_data_blocks)) {
--
2.39.2

2023-08-24 14:48:03

by Zhang Yi

[permalink] [raw]

Subject: [RFC PATCH 11/16] ext4: factor out common part of ext4_da_{release|update_reserve}_space()

From: Zhang Yi <[email protected]>

The reserve blocks updating part in ext4_da_release_space() and
ext4_da_update_reserve_space() are almost the same, so factor them out
to a common helper.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/inode.c | 60 +++++++++++++++++++++----------------------------
1 file changed, 25 insertions(+), 35 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index a189009d20fa..13036cecbcc0 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -325,6 +325,27 @@ qsize_t *ext4_get_reserved_space(struct inode *inode)
}
#endif

+static void __ext4_da_update_reserve_space(const char *where,
+ struct inode *inode,
+ int data_len)
+{
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ struct ext4_inode_info *ei = EXT4_I(inode);
+
+ if (unlikely(data_len > ei->i_reserved_data_blocks)) {
+ ext4_warning(inode->i_sb, "%s: ino %lu, clear %d "
+ "with only %d reserved data blocks",
+ where, inode->i_ino, data_len,
+ ei->i_reserved_data_blocks);
+ WARN_ON(1);
+ data_len = ei->i_reserved_data_blocks;
+ }
+
+ /* Update per-inode reservations */
+ ei->i_reserved_data_blocks -= data_len;
+ percpu_counter_sub(&sbi->s_dirtyclusters_counter, data_len);
+}
+
/*
* Called with i_data_sem down, which is important since we can call
* ext4_discard_preallocations() from here.
@@ -340,19 +361,7 @@ void ext4_da_update_reserve_space(struct inode *inode,

spin_lock(&ei->i_block_reservation_lock);
trace_ext4_da_update_reserve_space(inode, used, quota_claim);
- if (unlikely(used > ei->i_reserved_data_blocks)) {
- ext4_warning(inode->i_sb, "%s: ino %lu, used %d "
- "with only %d reserved data blocks",
- __func__, inode->i_ino, used,
- ei->i_reserved_data_blocks);
- WARN_ON(1);
- used = ei->i_reserved_data_blocks;
- }
-
- /* Update per-inode reservations */
- ei->i_reserved_data_blocks -= used;
- percpu_counter_sub(&sbi->s_dirtyclusters_counter, used);
-
+ __ext4_da_update_reserve_space(__func__, inode, used);
spin_unlock(&ei->i_block_reservation_lock);

/* Update quota subsystem for data blocks */
@@ -1483,29 +1492,10 @@ void ext4_da_release_space(struct inode *inode, int to_free)
if (!to_free)
return; /* Nothing to release, exit */

- spin_lock(&EXT4_I(inode)->i_block_reservation_lock);
-
+ spin_lock(&ei->i_block_reservation_lock);
trace_ext4_da_release_space(inode, to_free);
- if (unlikely(to_free > ei->i_reserved_data_blocks)) {
- /*
- * if there aren't enough reserved blocks, then the
- * counter is messed up somewhere. Since this
- * function is called from invalidate page, it's
- * harmless to return without any action.
- */
- ext4_warning(inode->i_sb, "ext4_da_release_space: "
- "ino %lu, to_free %d with only %d reserved "
- "data blocks", inode->i_ino, to_free,
- ei->i_reserved_data_blocks);
- WARN_ON(1);
- to_free = ei->i_reserved_data_blocks;
- }
- ei->i_reserved_data_blocks -= to_free;
-
- /* update fs dirty data blocks counter */
- percpu_counter_sub(&sbi->s_dirtyclusters_counter, to_free);
-
- spin_unlock(&EXT4_I(inode)->i_block_reservation_lock);
+ __ext4_da_update_reserve_space(__func__, inode, to_free);
+ spin_unlock(&ei->i_block_reservation_lock);

dquot_release_reservation_block(inode, EXT4_C2B(sbi, to_free));
}
--
2.39.2

2023-08-24 14:59:39

by Zhang Yi

[permalink] [raw]

Subject: [RFC PATCH 12/16] ext4: update reserved meta blocks in ext4_da_{release|update_reserve}_space()

From: Zhang Yi <[email protected]>

The same to ext4_da_reserve_space(), we also need to update reserved
metadata blocks when we release and convert a delalloc space range in
ext4_da_release_space() and ext4_da_update_reserve_space(). So also
prepare to reserve metadata blocks in these two functions, the
reservation logic are the same to data blocks. This patch is just a
preparation, the reserved ext_len is always zero.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/ext4.h | 4 ++--
fs/ext4/inode.c | 47 +++++++++++++++++++++----------------
include/trace/events/ext4.h | 28 ++++++++++++++--------
3 files changed, 47 insertions(+), 32 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index ee2dbbde176e..3e0a39653469 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2998,9 +2998,9 @@ extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
extern vm_fault_t ext4_page_mkwrite(struct vm_fault *vmf);
extern qsize_t *ext4_get_reserved_space(struct inode *inode);
extern int ext4_get_projid(struct inode *inode, kprojid_t *projid);
-extern void ext4_da_release_space(struct inode *inode, int to_free);
+extern void ext4_da_release_space(struct inode *inode, unsigned int data_len);
extern void ext4_da_update_reserve_space(struct inode *inode,
- int used, int quota_claim);
+ unsigned int data_len, int quota_claim);
extern int ext4_issue_zeroout(struct inode *inode, ext4_lblk_t lblk,
ext4_fsblk_t pblk, ext4_lblk_t len);

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 13036cecbcc0..38c47ce1333b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -327,53 +327,59 @@ qsize_t *ext4_get_reserved_space(struct inode *inode)

static void __ext4_da_update_reserve_space(const char *where,
struct inode *inode,
- int data_len)
+ unsigned int data_len, int ext_len)
{
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
struct ext4_inode_info *ei = EXT4_I(inode);

- if (unlikely(data_len > ei->i_reserved_data_blocks)) {
- ext4_warning(inode->i_sb, "%s: ino %lu, clear %d "
- "with only %d reserved data blocks",
- where, inode->i_ino, data_len,
- ei->i_reserved_data_blocks);
+ if (unlikely(data_len > ei->i_reserved_data_blocks ||
+ ext_len > (long)ei->i_reserved_ext_blocks)) {
+ ext4_warning(inode->i_sb, "%s: ino %lu, clear %d,%d "
+ "with only %d,%d reserved data blocks",
+ where, inode->i_ino, data_len, ext_len,
+ ei->i_reserved_data_blocks,
+ ei->i_reserved_ext_blocks);
WARN_ON(1);
- data_len = ei->i_reserved_data_blocks;
+ data_len = min(data_len, ei->i_reserved_data_blocks);
+ ext_len = min_t(unsigned int, ext_len, ei->i_reserved_ext_blocks);
}

/* Update per-inode reservations */
ei->i_reserved_data_blocks -= data_len;
- percpu_counter_sub(&sbi->s_dirtyclusters_counter, data_len);
+ ei->i_reserved_ext_blocks -= ext_len;
+ percpu_counter_sub(&sbi->s_dirtyclusters_counter, (s64)data_len + ext_len);
}

/*
* Called with i_data_sem down, which is important since we can call
* ext4_discard_preallocations() from here.
*/
-void ext4_da_update_reserve_space(struct inode *inode,
- int used, int quota_claim)
+void ext4_da_update_reserve_space(struct inode *inode, unsigned int data_len,
+ int quota_claim)
{
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
struct ext4_inode_info *ei = EXT4_I(inode);
+ int ext_len = 0;

- if (!used)
+ if (!data_len)
return;

spin_lock(&ei->i_block_reservation_lock);
- trace_ext4_da_update_reserve_space(inode, used, quota_claim);
- __ext4_da_update_reserve_space(__func__, inode, used);
+ trace_ext4_da_update_reserve_space(inode, data_len, ext_len,
+ quota_claim);
+ __ext4_da_update_reserve_space(__func__, inode, data_len, ext_len);
spin_unlock(&ei->i_block_reservation_lock);

/* Update quota subsystem for data blocks */
if (quota_claim)
- dquot_claim_block(inode, EXT4_C2B(sbi, used));
+ dquot_claim_block(inode, EXT4_C2B(sbi, data_len));
else {
/*
* We did fallocate with an offset that is already delayed
* allocated. So on delayed allocated writeback we should
* not re-claim the quota for fallocated blocks.
*/
- dquot_release_reservation_block(inode, EXT4_C2B(sbi, used));
+ dquot_release_reservation_block(inode, EXT4_C2B(sbi, data_len));
}

/*
@@ -1484,20 +1490,21 @@ static int ext4_da_reserve_space(struct inode *inode, unsigned int rsv_dlen,
return 0; /* success */
}

-void ext4_da_release_space(struct inode *inode, int to_free)
+void ext4_da_release_space(struct inode *inode, unsigned int data_len)
{
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
struct ext4_inode_info *ei = EXT4_I(inode);
+ int ext_len = 0;

- if (!to_free)
+ if (!data_len)
return; /* Nothing to release, exit */

spin_lock(&ei->i_block_reservation_lock);
- trace_ext4_da_release_space(inode, to_free);
- __ext4_da_update_reserve_space(__func__, inode, to_free);
+ trace_ext4_da_release_space(inode, data_len, ext_len);
+ __ext4_da_update_reserve_space(__func__, inode, data_len, ext_len);
spin_unlock(&ei->i_block_reservation_lock);

- dquot_release_reservation_block(inode, EXT4_C2B(sbi, to_free));
+ dquot_release_reservation_block(inode, EXT4_C2B(sbi, data_len));
}

/*
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 7a9839f2d681..e1e9d7ead20f 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -1214,15 +1214,19 @@ TRACE_EVENT(ext4_forget,
);

TRACE_EVENT(ext4_da_update_reserve_space,
- TP_PROTO(struct inode *inode, int used_blocks, int quota_claim),
+ TP_PROTO(struct inode *inode,
+ int data_blocks,
+ int meta_blocks,
+ int quota_claim),

- TP_ARGS(inode, used_blocks, quota_claim),
+ TP_ARGS(inode, data_blocks, meta_blocks, quota_claim),

TP_STRUCT__entry(
__field( dev_t, dev )
__field( ino_t, ino )
__field( __u64, i_blocks )
- __field( int, used_blocks )
+ __field( int, data_blocks )
+ __field( int, meta_blocks )
__field( int, reserved_data_blocks )
__field( int, reserved_ext_blocks )
__field( int, quota_claim )
@@ -1233,19 +1237,20 @@ TRACE_EVENT(ext4_da_update_reserve_space,
__entry->dev = inode->i_sb->s_dev;
__entry->ino = inode->i_ino;
__entry->i_blocks = inode->i_blocks;
- __entry->used_blocks = used_blocks;
+ __entry->data_blocks = data_blocks;
+ __entry->meta_blocks = meta_blocks;
__entry->reserved_data_blocks = EXT4_I(inode)->i_reserved_data_blocks;
__entry->reserved_ext_blocks = EXT4_I(inode)->i_reserved_ext_blocks;
__entry->quota_claim = quota_claim;
__entry->mode = inode->i_mode;
),

- TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu used_blocks %d "
+ TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu data_blocks %d meta_blocks %d "
"reserved_data_blocks %d reserved_ext_blocks %d quota_claim %d",
MAJOR(__entry->dev), MINOR(__entry->dev),
(unsigned long) __entry->ino,
__entry->mode, __entry->i_blocks,
- __entry->used_blocks,
+ __entry->data_blocks, __entry->meta_blocks,
__entry->reserved_data_blocks, __entry->reserved_ext_blocks,
__entry->quota_claim)
);
@@ -1289,15 +1294,16 @@ TRACE_EVENT(ext4_da_reserve_space,
);

TRACE_EVENT(ext4_da_release_space,
- TP_PROTO(struct inode *inode, int freed_blocks),
+ TP_PROTO(struct inode *inode, int freed_blocks, int meta_blocks),

- TP_ARGS(inode, freed_blocks),
+ TP_ARGS(inode, freed_blocks, meta_blocks),

TP_STRUCT__entry(
__field( dev_t, dev )
__field( ino_t, ino )
__field( __u64, i_blocks )
__field( int, freed_blocks )
+ __field( int, meta_blocks )
__field( int, reserved_data_blocks )
__field( int, reserved_ext_blocks )
__field( __u16, mode )
@@ -1308,17 +1314,19 @@ TRACE_EVENT(ext4_da_release_space,
__entry->ino = inode->i_ino;
__entry->i_blocks = inode->i_blocks;
__entry->freed_blocks = freed_blocks;
+ __entry->meta_blocks = meta_blocks;
__entry->reserved_data_blocks = EXT4_I(inode)->i_reserved_data_blocks;
__entry->reserved_ext_blocks = EXT4_I(inode)->i_reserved_ext_blocks;
__entry->mode = inode->i_mode;
),

- TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu freed_blocks %d "
+ TP_printk("dev %d,%d ino %lu mode 0%o i_blocks %llu "
+ "freed_blocks %d meta_blocks %d"
"reserved_data_blocks %d reserved_ext_blocks %d",
MAJOR(__entry->dev), MINOR(__entry->dev),
(unsigned long) __entry->ino,
__entry->mode, __entry->i_blocks,
- __entry->freed_blocks,
+ __entry->freed_blocks, __entry->meta_blocks,
__entry->reserved_data_blocks,
__entry->reserved_ext_blocks)
);
--
2.39.2

2023-08-24 17:18:28

by Zhang Yi

[permalink] [raw]

Subject: [RFC PATCH 05/16] ext4: pass real delayed status into ext4_es_insert_extent()

From: Zhang Yi <[email protected]>

Commit 'd2dc317d564a ("ext4: fix data corruption caused by unwritten and
delayed extents")' fix a data corruption issue by stop passing delayed
status into ext4_es_insert_extent() if the mapping range has been
written. This patch change it to still pass the real delayed status and
deal with the 'delayed && written' case in ext4_es_insert_extent(). If
the status have delayed bit is set, it means that the path of delayed
allocation is still running, and this insert process is not allocating
delayed allocated blocks.

Signed-off-by: Zhang Yi <[email protected]>
---
fs/ext4/extents_status.c | 13 +++++++------
fs/ext4/inode.c | 2 --
2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index 3a004ed04570..62191c772b82 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -873,13 +873,14 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,

BUG_ON(end < lblk);

+ /*
+ * Insert extent as delayed and written which can potentially cause
+ * data lose, and the extent has been written, it's safe to remove
+ * the delayed flag even it's still delayed.
+ */
if ((status & EXTENT_STATUS_DELAYED) &&
- (status & EXTENT_STATUS_WRITTEN)) {
- ext4_warning(inode->i_sb, "Inserting extent [%u/%u] as "
- " delayed and written which can potentially "
- " cause data loss.", lblk, len);
- WARN_ON(1);
- }
+ (status & EXTENT_STATUS_WRITTEN))
+ status &= ~EXTENT_STATUS_DELAYED;

newes.es_lblk = lblk;
newes.es_len = len;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6c490f05e2ba..82115d6656d3 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -563,7 +563,6 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
status = map->m_flags & EXT4_MAP_UNWRITTEN ?
EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
if (!(flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) &&
- !(status & EXTENT_STATUS_WRITTEN) &&
ext4_es_scan_range(inode, &ext4_es_is_delayed, map->m_lblk,
map->m_lblk + map->m_len - 1))
status |= EXTENT_STATUS_DELAYED;
@@ -673,7 +672,6 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
status = map->m_flags & EXT4_MAP_UNWRITTEN ?
EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
if (!(flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE) &&
- !(status & EXTENT_STATUS_WRITTEN) &&
ext4_es_scan_range(inode, &ext4_es_is_delayed, map->m_lblk,
map->m_lblk + map->m_len - 1))
status |= EXTENT_STATUS_DELAYED;
--
2.39.2

2023-09-01 18:01:47

by Zhang Yi

[permalink] [raw]

Subject: Re: [RFC PATCH 00/16] ext4: more accurate metadata reservaion for delalloc mount option

Hello! Thanks for your reply.

On 2023/8/30 23:30, Jan Kara wrote:
> Hello!
>
> On Thu 24-08-23 17:26:03, Zhang Yi wrote:
>> The delayed allocation method allocates blocks during page writeback in
>> ext4_writepages(), which cannot handle block allocation failures due to
>> e.g. ENOSPC if acquires more extent blocks. In order to deal with this,
>> commit '79f0be8d2e6e ("ext4: Switch to non delalloc mode when we are low
>> on free blocks count.")' introduce ext4_nonda_switch() to convert to no
>> delalloc mode if the space if the free blocks is less than 150% of dirty
>> blocks or the watermark.
>
> Well, that functionality is there mainly so that we can really allocate all
> blocks available in the filesystem. But you are right that since we are not
> reserving metadata blocks explicitely anymore, it is questionable whether
> we really still need this.
>
>> In the meantime, '27dd43854227 ("ext4:
>> introduce reserved space")' reserve some of the file system space (2% or
>> 4096 clusters, whichever is smaller). Both of those to solutions can
>> make sure that space is not exhausted when mapping delalloc blocks in
>> most cases, but cannot guarantee work in all cases, which could lead to
>> infinite loop or data loss (please see patch 14 for details).
>
> OK, I agree that in principle there could be problems due to percpu
> counters inaccuracy etc. but were you able to reproduce the problem under
> some at least somewhat realistic conditions? We were discussing making
> free space percpu counters switch to exact counting in case we are running
> tight on space to avoid these problems but it never proved to be a problem
> in practice so we never bothered to actually implement it.
>

Yes, we catch this problem in our products firstly and we reproduced it
when doing stress test on low free space disk, but the frequency is
very low. After analysis we found the root cause, and Zhihao helped to
write a 100% reproducer below.

1. Apply the 'infinite_loop.diff' in attachment, add info and delay
into ext4 code.
2. Run 'enospc.sh' on a virtual machine with 4 CPU (important, because
the cpu number will affect EXT4_FREECLUSTERS_WATERMARK and also
affect the reproduce).

After several minutes, the writeback process will loop infinitely, and
other processes which rely on it will hung.

[ 304.815575] INFO: task sync:7292 blocked for more than 153 seconds.
[ 304.818130] Not tainted 6.5.0-dirty #578
[ 304.819926] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 304.822747] task:sync state:D stack:0 pid:7292 ppid:1 flags:0x00004000
[ 304.825677] Call Trace:
[ 304.826538] <TASK>
[ 304.827307] __schedule+0x577/0x12b0
[ 304.828300] ? sync_fs_one_sb+0x50/0x50
[ 304.829433] schedule+0x9d/0x1e0
[ 304.830451] wb_wait_for_completion+0x82/0xd0
[ 304.831811] ? cpuacct_css_alloc+0x100/0x100
[ 304.833090] sync_inodes_sb+0xf1/0x440
[ 304.834207] ? sync_fs_one_sb+0x50/0x50
[ 304.835304] sync_inodes_one_sb+0x21/0x30
[ 304.836528] iterate_supers+0xd2/0x180
[ 304.837614] ksys_sync+0x50/0xf0
[ 304.838356] __do_sys_sync+0x12/0x20
[ 304.839207] do_syscall_64+0x68/0xf0
[ 304.839964] entry_SYSCALL_64_after_hwframe+0x63/0xcd

On the contrary, after doing a little tweaking the delay injection
procedure, we could reproduce the data loss problem easily.

1. Apply the 'data_loss.diff' in the attachment.
2. Run 'enospc.sh' like the previous one, then we got below error message.

[ 52.226320] EXT4-fs (sda): Delayed block allocation failed for inode 571 at logical offset 8 with max blocks 1 with error 28
[ 52.229126] EXT4-fs (sda): This should not happen!! Data will be lost

>> This patch set wants to reserve metadata space more accurate for
>> delalloc mount option. The metadata blocks reservation is very tricky
>> and is also related to the continuity of physical blocks, an effective
>> way is to reserve as the worst case, which means that every data block
>> is discontinuous and one data block costs an extent entry. Reserve
>> metadata space as the worst case can make sure enough blocks reserved
>> during data writeback, the unused reservaion space can be released after
>> mapping data blocks.
>
> Well, as you say, there is a problem with the worst case estimates - either
> you *heavily* overestimate the number of needed metadata blocks or the code
> to estimate the number of needed metadata blocks is really complex. We used
> to have estimates of needed metadata and we ditched that code (in favor of
> reserved clusters) exactly because it was complex and suffered from
> cornercases that were hard to fix. I haven't quite digested the other
> patches in this series to judge which case is it but it seems to lean on
> the "complex" side :).
>
> So I'm somewhat skeptical this complexity is really needed but I can be
> convinced :).

I understand your concern. At first I tried to solve this problem with
other simple solutions, but failed. I suppose reserve blocks for the
worst case is the only way to cover all cases, and I noticed that xfs
also uses this reservation method, so I learned the implementation from
it, but it's not exactly the same.

Although it's a worst-case reservation, it is not that complicated.
Firstly, the estimate formula is simple, just add the 'extent & node'
blocks calculated by the **total** delay allocated data blocks and the
remaining btree heights (no need to concern whether the logical
positions of each extents are continuous or not, the btree heights can
be merged between each discontinuous extent entries of one inode, this
can reduce overestimate to some extent). Secondary, the method of
reserving metadata blocks is similar to that of reserving data blocks,
just the estimate formula is different. Fortunately, there already have
data reservation helpers like ext4_da_update_reserve_space() and
ext4_da_release_reserve_space(), it works only takes some minor
changes. BTW, I don't really know the ditched estimation and
cornercases you mentioned, how it's like?

Finally, maybe this reservation could bring other benefits in the long
run. For example, after we've done this, maybe we could also reserve
metadata for DIO and buffer IO with dioread_nolock in the future, then
we could drop EXT4_GET_BLOCKS_PRE_IO safely, which looks like a
compromise, maybe we could get some improve performance if do this (I
haven't thought deeply, just a whim :) ). But it's a different thing.

>
>> After doing this, add a worker to submit delayed
>> allocations to prevent excessive reservations. Finally, we could
>> completely drop the policy of switching back to non-delayed allocation.
>
> BTW the worker there in patch 15 seems really pointless. If you do:
> queue_work(), flush_work() then you could just directly do the work inline
> and get as a bonus more efficiency and proper lockdep tracking of
> dependencies. But that's just a minor issue I have noticed.
>

Yes, I added this worker because I want to run the work asynchronously
if the s_dirtyclusters_counter is running beyond watermark. In this way,
the I/O flow could be smoother. But I didn't implement it because I
don't know if you like this estimate solution, I can do it if so.

Thanks,
Yi.

Attachments:

data_loss.diff (2.74 kB)
infinite_loop.diff (2.74 kB)
enospc.sh (908.00 B)
Download all attachments

2023-10-06 02:34:06

by Theodore Ts'o

[permalink] [raw]

Subject: Re: [RFC PATCH 01/16] ext4: correct the start block of counting reserved clusters

On Wed, Aug 30, 2023 at 03:10:31PM +0200, Jan Kara wrote:
> On Thu 24-08-23 17:26:04, Zhang Yi wrote:
> > From: Zhang Yi <[email protected]>
> >
> > When big allocate feature is enabled, we need to count and update
> > reserved clusters before removing a delayed only extent_status entry.
> > {init|count|get}_rsvd() have already done this, but the start block
> > number of this counting isn's correct in the following case.
> >
> > lblk end
> > | |
> > v v
> > -------------------------
> > | | orig_es
> > -------------------------
> > ^ ^
> > len1 is 0 | len2 |
> >
> > If the start block of the orig_es entry founded is bigger than lblk, we
> > passed lblk as start block to count_rsvd(), but the length is correct,
> > finally, the range to be counted is offset. This patch fix this by
> > passing the start blocks to 'orig_es->lblk + len1'.
> >
> > Signed-off-by: Zhang Yi <[email protected]>
>
> Looks good. Feel free to add:
>
> Reviewed-by: Jan Kara <[email protected]>

Thanks, I've applied the first two patches in this series, since these
are bug fixes. The rest of the patch series requires more analysis
and review.

- Ted