Since the beginning of zoned mode, I've promised Josef to get rid of the
extent_buffer redirtying, but never actually got around to doing so.
Then 2 weeks ago our CI has hit an ASSERT() in this area and I started to look
into it again. After some discussion with Christoph we came to the conclusion
to finally take the time and get rid of the extent_buffer redirtying once and
for all.
Patch one renames EXTENT_BUFFER_NO_CHECK into EXTENT_BUFFER_ZONED_ZEROOUT,
because this fits the new model somewhat better.
Number two sets the cancel bit instead of clearing the dirty bit from a zoned
extent_buffer.
Number three removes the last remaining bits of btrfs_redirty_list_add().
The last two patches in this series are just trivial cleanups I came across
while looking at the code.
---
Changes in v2:
- Rename EXTENT_BUFFER_CANCELLED to EXTENT_BUFFER_ZONED_ZEROOUT
- Add comments why we're marking the buffer as zeroout and zero it.
- Add Reviews from Josef and Christoph
- Link to v1: https://lore.kernel.org/r/[email protected]
---
Johannes Thumshirn (5):
btrfs: rename EXTENT_BUFFER_NO_CHECK to EXTENT_BUFFER_ZONED_ZEROOUT
btrfs: zoned: don't clear dirty flag of extent buffer
btrfs: remove now unneeded btrfs_redirty_list_add
btrfs: use memset_page instead of opencoding it
btrfs: reflow btrfs_free_tree_block
fs/btrfs/disk-io.c | 9 ++++-
fs/btrfs/extent-tree.c | 102 ++++++++++++++++++++++++-------------------------
fs/btrfs/extent_io.c | 18 +++++++--
fs/btrfs/extent_io.h | 3 +-
fs/btrfs/tree-log.c | 1 -
fs/btrfs/zoned.c | 16 --------
fs/btrfs/zoned.h | 5 ---
7 files changed, 74 insertions(+), 80 deletions(-)
---
base-commit: 592afe8e8b7ceee58107757fd29ff3290e6539e3
change-id: 20231120-josef-generic-163-f4df4eab2c98
Best regards,
--
Johannes Thumshirn <[email protected]>
Reflow btrfs_free_tree_block() so that there is one level of indentation
needed.
This patch has no functional changes.
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Johannes Thumshirn <[email protected]>
---
fs/btrfs/extent-tree.c | 97 +++++++++++++++++++++++++-------------------------
1 file changed, 49 insertions(+), 48 deletions(-)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4044102271e9..093aaf7aeb3a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3426,6 +3426,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
{
struct btrfs_fs_info *fs_info = trans->fs_info;
struct btrfs_ref generic_ref = { 0 };
+ struct btrfs_block_group *cache;
int ret;
btrfs_init_generic_ref(&generic_ref, BTRFS_DROP_DELAYED_REF,
@@ -3439,64 +3440,64 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
BUG_ON(ret); /* -ENOMEM */
}
- if (last_ref && btrfs_header_generation(buf) == trans->transid) {
- struct btrfs_block_group *cache;
- bool must_pin = false;
-
- if (root_id != BTRFS_TREE_LOG_OBJECTID) {
- ret = check_ref_cleanup(trans, buf->start);
- if (!ret)
- goto out;
- }
+ if (!last_ref)
+ return;
- cache = btrfs_lookup_block_group(fs_info, buf->start);
+ if (btrfs_header_generation(buf) != trans->transid)
+ goto out;
- if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
- pin_down_extent(trans, cache, buf->start, buf->len, 1);
- btrfs_put_block_group(cache);
+ if (root_id != BTRFS_TREE_LOG_OBJECTID) {
+ ret = check_ref_cleanup(trans, buf->start);
+ if (!ret)
goto out;
- }
+ }
- /*
- * If there are tree mod log users we may have recorded mod log
- * operations for this node. If we re-allocate this node we
- * could replay operations on this node that happened when it
- * existed in a completely different root. For example if it
- * was part of root A, then was reallocated to root B, and we
- * are doing a btrfs_old_search_slot(root b), we could replay
- * operations that happened when the block was part of root A,
- * giving us an inconsistent view of the btree.
- *
- * We are safe from races here because at this point no other
- * node or root points to this extent buffer, so if after this
- * check a new tree mod log user joins we will not have an
- * existing log of operations on this node that we have to
- * contend with.
- */
- if (test_bit(BTRFS_FS_TREE_MOD_LOG_USERS, &fs_info->flags))
- must_pin = true;
+ cache = btrfs_lookup_block_group(fs_info, buf->start);
- if (must_pin || btrfs_is_zoned(fs_info)) {
- pin_down_extent(trans, cache, buf->start, buf->len, 1);
- btrfs_put_block_group(cache);
- goto out;
- }
+ if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
+ pin_down_extent(trans, cache, buf->start, buf->len, 1);
+ btrfs_put_block_group(cache);
+ goto out;
+ }
- WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, &buf->bflags));
+ /*
+ * If there are tree mod log users we may have recorded mod log
+ * operations for this node. If we re-allocate this node we
+ * could replay operations on this node that happened when it
+ * existed in a completely different root. For example if it
+ * was part of root A, then was reallocated to root B, and we
+ * are doing a btrfs_old_search_slot(root b), we could replay
+ * operations that happened when the block was part of root A,
+ * giving us an inconsistent view of the btree.
+ *
+ * We are safe from races here because at this point no other
+ * node or root points to this extent buffer, so if after this
+ * check a new tree mod log user joins we will not have an
+ * existing log of operations on this node that we have to
+ * contend with.
+ */
- btrfs_add_free_space(cache, buf->start, buf->len);
- btrfs_free_reserved_bytes(cache, buf->len, 0);
+ if (test_bit(BTRFS_FS_TREE_MOD_LOG_USERS, &fs_info->flags)
+ || btrfs_is_zoned(fs_info)) {
+ pin_down_extent(trans, cache, buf->start, buf->len, 1);
btrfs_put_block_group(cache);
- trace_btrfs_reserved_extent_free(fs_info, buf->start, buf->len);
+ goto out;
}
+
+ WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, &buf->bflags));
+
+ btrfs_add_free_space(cache, buf->start, buf->len);
+ btrfs_free_reserved_bytes(cache, buf->len, 0);
+ btrfs_put_block_group(cache);
+ trace_btrfs_reserved_extent_free(fs_info, buf->start, buf->len);
+
out:
- if (last_ref) {
- /*
- * Deleting the buffer, clear the corrupt flag since it doesn't
- * matter anymore.
- */
- clear_bit(EXTENT_BUFFER_CORRUPT, &buf->bflags);
- }
+
+ /*
+ * Deleting the buffer, clear the corrupt flag since it doesn't
+ * matter anymore.
+ */
+ clear_bit(EXTENT_BUFFER_CORRUPT, &buf->bflags);
}
/* Can return -ENOMEM */
--
2.41.0
One a zoned filesystem, never clear the dirty flag of an extent buffer,
but instead mark it as zeroout.
On writeout, when encountering a marked extent_buffer, zero it out.
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Johannes Thumshirn <[email protected]>
---
fs/btrfs/disk-io.c | 7 ++++++-
fs/btrfs/extent_io.c | 16 ++++++++++++++--
fs/btrfs/zoned.c | 3 ++-
3 files changed, 22 insertions(+), 4 deletions(-)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 460b88526f56..9c09062d3d0a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -254,8 +254,13 @@ blk_status_t btree_csum_one_bio(struct btrfs_bio *bbio)
if (WARN_ON_ONCE(bbio->bio.bi_iter.bi_size != eb->len))
return BLK_STS_IOERR;
+ /*
+ * If an extent_buffer is marked as EXTENT_BUFFER_ZONED_ZEROOUT, don't
+ * checksum it but zero-out its content. This is done to preserve
+ * ordering of I/O without unnecessarily writing out data.
+ */
if (test_bit(EXTENT_BUFFER_ZONED_ZEROOUT, &eb->bflags)) {
- WARN_ON_ONCE(found_start != 0);
+ memzero_extent_buffer(eb, 0, eb->len);
return BLK_STS_OK;
}
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 49514ef829fb..c378094b5cc8 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3748,6 +3748,20 @@ void btrfs_clear_buffer_dirty(struct btrfs_trans_handle *trans,
if (trans && btrfs_header_generation(eb) != trans->transid)
return;
+ /*
+ * Instead of clearing the dirty flag off of the buffer, mark it as
+ * EXTENT_BUFFER_ZONED_ZEROOUT. This allows us to preserve
+ * write-ordering in zoned mode, without the need to later re-dirty
+ * the extent_buffer.
+ *
+ * The actual zeroout of the buffer will happen later in
+ * btree_csum_one_bio.
+ */
+ if (btrfs_is_zoned(fs_info)) {
+ set_bit(EXTENT_BUFFER_ZONED_ZEROOUT, &eb->bflags);
+ return;
+ }
+
if (!test_and_clear_bit(EXTENT_BUFFER_DIRTY, &eb->bflags))
return;
@@ -4139,8 +4153,6 @@ static void __write_extent_buffer(const struct extent_buffer *eb,
/* For unmapped (dummy) ebs, no need to check their uptodate status. */
const bool check_uptodate = !test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags);
- WARN_ON(test_bit(EXTENT_BUFFER_ZONED_ZEROOUT, &eb->bflags));
-
if (check_eb_range(eb, start, len))
return;
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index b9bfde6fb929..ed8e002b33e7 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1722,7 +1722,8 @@ void btrfs_redirty_list_add(struct btrfs_transaction *trans,
btrfs_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN))
return;
- ASSERT(!test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
+ ASSERT(test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
+ ASSERT(test_bit(EXTENT_BUFFER_ZONED_ZEROOUT, &eb->bflags));
memzero_extent_buffer(eb, 0, eb->len);
set_bit(EXTENT_BUFFER_ZONED_ZEROOUT, &eb->bflags);
--
2.41.0
EXTENT_BUFFER_ZONED_ZEROOUT better describes the state of the extent buffer,
namely it is written as all zeros. This is needed in zoned mode, to
preserve I/O ordering.
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Johannes Thumshirn <[email protected]>
---
fs/btrfs/disk-io.c | 2 +-
fs/btrfs/extent-tree.c | 2 +-
fs/btrfs/extent_io.c | 2 +-
fs/btrfs/extent_io.h | 3 ++-
fs/btrfs/zoned.c | 2 +-
5 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 5ac6789ca55f..460b88526f56 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -254,7 +254,7 @@ blk_status_t btree_csum_one_bio(struct btrfs_bio *bbio)
if (WARN_ON_ONCE(bbio->bio.bi_iter.bi_size != eb->len))
return BLK_STS_IOERR;
- if (test_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags)) {
+ if (test_bit(EXTENT_BUFFER_ZONED_ZEROOUT, &eb->bflags)) {
WARN_ON_ONCE(found_start != 0);
return BLK_STS_OK;
}
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0455935ff558..2d8379d34a24 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5041,7 +5041,7 @@ btrfs_init_new_buffer(struct btrfs_trans_handle *trans, struct btrfs_root *root,
__btrfs_tree_lock(buf, nest);
btrfs_clear_buffer_dirty(trans, buf);
clear_bit(EXTENT_BUFFER_STALE, &buf->bflags);
- clear_bit(EXTENT_BUFFER_NO_CHECK, &buf->bflags);
+ clear_bit(EXTENT_BUFFER_ZONED_ZEROOUT, &buf->bflags);
set_extent_buffer_uptodate(buf);
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 03cef28d9e37..49514ef829fb 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4139,7 +4139,7 @@ static void __write_extent_buffer(const struct extent_buffer *eb,
/* For unmapped (dummy) ebs, no need to check their uptodate status. */
const bool check_uptodate = !test_bit(EXTENT_BUFFER_UNMAPPED, &eb->bflags);
- WARN_ON(test_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags));
+ WARN_ON(test_bit(EXTENT_BUFFER_ZONED_ZEROOUT, &eb->bflags));
if (check_eb_range(eb, start, len))
return;
diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 2171057a4477..b8adb5b6fb21 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -28,7 +28,8 @@ enum {
EXTENT_BUFFER_IN_TREE,
/* write IO error */
EXTENT_BUFFER_WRITE_ERR,
- EXTENT_BUFFER_NO_CHECK,
+ /* Indicate the extent buffer is written zeroed out (for zoned) */
+ EXTENT_BUFFER_ZONED_ZEROOUT,
/* Indicate that extent buffer pages a being read */
EXTENT_BUFFER_READING,
};
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 188378ca19c7..b9bfde6fb929 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1725,7 +1725,7 @@ void btrfs_redirty_list_add(struct btrfs_transaction *trans,
ASSERT(!test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
memzero_extent_buffer(eb, 0, eb->len);
- set_bit(EXTENT_BUFFER_NO_CHECK, &eb->bflags);
+ set_bit(EXTENT_BUFFER_ZONED_ZEROOUT, &eb->bflags);
set_extent_buffer_dirty(eb);
set_extent_bit(&trans->dirty_pages, eb->start, eb->start + eb->len - 1,
EXTENT_DIRTY, NULL);
--
2.41.0
Use memset_page() in memset_extent_buffer() instead of opencoding it.
This does not not change any functionality.
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Johannes Thumshirn <[email protected]>
---
fs/btrfs/extent_io.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c378094b5cc8..defe0fa04572 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4195,7 +4195,7 @@ static void memset_extent_buffer(const struct extent_buffer *eb, int c,
struct page *page = eb->pages[index];
assert_eb_page_uptodate(eb, page);
- memset(page_address(page) + offset, c, cur_len);
+ memset_page(page, offset, c, cur_len);
cur += cur_len;
}
--
2.41.0
Now that we're not clearing the dirty flag off of extent_buffers in zoned mode,
all that is left of btrfs_redirty_list_add() is a memzero() and some
ASSERT()ions.
As we're also memzero()ing the buffer on write-out btrfs_redirty_list_add()
has become obsolete and can be removed.
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Signed-off-by: Johannes Thumshirn <[email protected]>
---
fs/btrfs/extent-tree.c | 5 +----
fs/btrfs/tree-log.c | 1 -
fs/btrfs/zoned.c | 17 -----------------
fs/btrfs/zoned.h | 5 -----
4 files changed, 1 insertion(+), 27 deletions(-)
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 2d8379d34a24..4044102271e9 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3445,10 +3445,8 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
if (root_id != BTRFS_TREE_LOG_OBJECTID) {
ret = check_ref_cleanup(trans, buf->start);
- if (!ret) {
- btrfs_redirty_list_add(trans->transaction, buf);
+ if (!ret)
goto out;
- }
}
cache = btrfs_lookup_block_group(fs_info, buf->start);
@@ -3479,7 +3477,6 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
must_pin = true;
if (must_pin || btrfs_is_zoned(fs_info)) {
- btrfs_redirty_list_add(trans->transaction, buf);
pin_down_extent(trans, cache, buf->start, buf->len, 1);
btrfs_put_block_group(cache);
goto out;
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 7d6729d9fd2f..bee065851185 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -2575,7 +2575,6 @@ static int clean_log_buffer(struct btrfs_trans_handle *trans,
ret = btrfs_pin_reserved_extent(trans, eb);
if (ret)
return ret;
- btrfs_redirty_list_add(trans->transaction, eb);
} else {
unaccount_log_buffer(eb->fs_info, eb->start);
}
diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index ed8e002b33e7..931ccc839152 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -1715,23 +1715,6 @@ void btrfs_calc_zone_unusable(struct btrfs_block_group *cache)
cache->zone_unusable = unusable;
}
-void btrfs_redirty_list_add(struct btrfs_transaction *trans,
- struct extent_buffer *eb)
-{
- if (!btrfs_is_zoned(eb->fs_info) ||
- btrfs_header_flag(eb, BTRFS_HEADER_FLAG_WRITTEN))
- return;
-
- ASSERT(test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
- ASSERT(test_bit(EXTENT_BUFFER_ZONED_ZEROOUT, &eb->bflags));
-
- memzero_extent_buffer(eb, 0, eb->len);
- set_bit(EXTENT_BUFFER_ZONED_ZEROOUT, &eb->bflags);
- set_extent_buffer_dirty(eb);
- set_extent_bit(&trans->dirty_pages, eb->start, eb->start + eb->len - 1,
- EXTENT_DIRTY, NULL);
-}
-
bool btrfs_use_zone_append(struct btrfs_bio *bbio)
{
u64 start = (bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT);
diff --git a/fs/btrfs/zoned.h b/fs/btrfs/zoned.h
index b9cec523b778..7bfe1d677310 100644
--- a/fs/btrfs/zoned.h
+++ b/fs/btrfs/zoned.h
@@ -59,8 +59,6 @@ int btrfs_reset_device_zone(struct btrfs_device *device, u64 physical,
int btrfs_ensure_empty_zones(struct btrfs_device *device, u64 start, u64 size);
int btrfs_load_block_group_zone_info(struct btrfs_block_group *cache, bool new);
void btrfs_calc_zone_unusable(struct btrfs_block_group *cache);
-void btrfs_redirty_list_add(struct btrfs_transaction *trans,
- struct extent_buffer *eb);
bool btrfs_use_zone_append(struct btrfs_bio *bbio);
void btrfs_record_physical_zoned(struct btrfs_bio *bbio);
int btrfs_check_meta_write_pointer(struct btrfs_fs_info *fs_info,
@@ -180,9 +178,6 @@ static inline int btrfs_load_block_group_zone_info(
static inline void btrfs_calc_zone_unusable(struct btrfs_block_group *cache) { }
-static inline void btrfs_redirty_list_add(struct btrfs_transaction *trans,
- struct extent_buffer *eb) { }
-
static inline bool btrfs_use_zone_append(struct btrfs_bio *bbio)
{
return false;
--
2.41.0
On Thu, Nov 23, 2023 at 3:48 PM Johannes Thumshirn
<[email protected]> wrote:
>
> Reflow btrfs_free_tree_block() so that there is one level of indentation
> needed.
>
> This patch has no functional changes.
>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Reviewed-by: Josef Bacik <[email protected]>
> Signed-off-by: Johannes Thumshirn <[email protected]>
> ---
> fs/btrfs/extent-tree.c | 97 +++++++++++++++++++++++++-------------------------
> 1 file changed, 49 insertions(+), 48 deletions(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 4044102271e9..093aaf7aeb3a 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3426,6 +3426,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
> {
> struct btrfs_fs_info *fs_info = trans->fs_info;
> struct btrfs_ref generic_ref = { 0 };
> + struct btrfs_block_group *cache;
While at it, can we rename 'cache' to something like 'bg'?
The cache name comes from the times where the structure was named
btrfs_block_group_cache, and having it named cache is always
confusing.
Nevertheless, the change looks fine to me.
Reviewed-by: Filipe Manana <[email protected]>
> int ret;
>
> btrfs_init_generic_ref(&generic_ref, BTRFS_DROP_DELAYED_REF,
> @@ -3439,64 +3440,64 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
> BUG_ON(ret); /* -ENOMEM */
> }
>
> - if (last_ref && btrfs_header_generation(buf) == trans->transid) {
> - struct btrfs_block_group *cache;
> - bool must_pin = false;
> -
> - if (root_id != BTRFS_TREE_LOG_OBJECTID) {
> - ret = check_ref_cleanup(trans, buf->start);
> - if (!ret)
> - goto out;
> - }
> + if (!last_ref)
> + return;
>
> - cache = btrfs_lookup_block_group(fs_info, buf->start);
> + if (btrfs_header_generation(buf) != trans->transid)
> + goto out;
>
> - if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
> - pin_down_extent(trans, cache, buf->start, buf->len, 1);
> - btrfs_put_block_group(cache);
> + if (root_id != BTRFS_TREE_LOG_OBJECTID) {
> + ret = check_ref_cleanup(trans, buf->start);
> + if (!ret)
> goto out;
> - }
> + }
>
> - /*
> - * If there are tree mod log users we may have recorded mod log
> - * operations for this node. If we re-allocate this node we
> - * could replay operations on this node that happened when it
> - * existed in a completely different root. For example if it
> - * was part of root A, then was reallocated to root B, and we
> - * are doing a btrfs_old_search_slot(root b), we could replay
> - * operations that happened when the block was part of root A,
> - * giving us an inconsistent view of the btree.
> - *
> - * We are safe from races here because at this point no other
> - * node or root points to this extent buffer, so if after this
> - * check a new tree mod log user joins we will not have an
> - * existing log of operations on this node that we have to
> - * contend with.
> - */
> - if (test_bit(BTRFS_FS_TREE_MOD_LOG_USERS, &fs_info->flags))
> - must_pin = true;
> + cache = btrfs_lookup_block_group(fs_info, buf->start);
>
> - if (must_pin || btrfs_is_zoned(fs_info)) {
> - pin_down_extent(trans, cache, buf->start, buf->len, 1);
> - btrfs_put_block_group(cache);
> - goto out;
> - }
> + if (btrfs_header_flag(buf, BTRFS_HEADER_FLAG_WRITTEN)) {
> + pin_down_extent(trans, cache, buf->start, buf->len, 1);
> + btrfs_put_block_group(cache);
> + goto out;
> + }
>
> - WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, &buf->bflags));
> + /*
> + * If there are tree mod log users we may have recorded mod log
> + * operations for this node. If we re-allocate this node we
> + * could replay operations on this node that happened when it
> + * existed in a completely different root. For example if it
> + * was part of root A, then was reallocated to root B, and we
> + * are doing a btrfs_old_search_slot(root b), we could replay
> + * operations that happened when the block was part of root A,
> + * giving us an inconsistent view of the btree.
> + *
> + * We are safe from races here because at this point no other
> + * node or root points to this extent buffer, so if after this
> + * check a new tree mod log user joins we will not have an
> + * existing log of operations on this node that we have to
> + * contend with.
> + */
>
> - btrfs_add_free_space(cache, buf->start, buf->len);
> - btrfs_free_reserved_bytes(cache, buf->len, 0);
> + if (test_bit(BTRFS_FS_TREE_MOD_LOG_USERS, &fs_info->flags)
> + || btrfs_is_zoned(fs_info)) {
> + pin_down_extent(trans, cache, buf->start, buf->len, 1);
> btrfs_put_block_group(cache);
> - trace_btrfs_reserved_extent_free(fs_info, buf->start, buf->len);
> + goto out;
> }
> +
> + WARN_ON(test_bit(EXTENT_BUFFER_DIRTY, &buf->bflags));
> +
> + btrfs_add_free_space(cache, buf->start, buf->len);
> + btrfs_free_reserved_bytes(cache, buf->len, 0);
> + btrfs_put_block_group(cache);
> + trace_btrfs_reserved_extent_free(fs_info, buf->start, buf->len);
> +
> out:
> - if (last_ref) {
> - /*
> - * Deleting the buffer, clear the corrupt flag since it doesn't
> - * matter anymore.
> - */
> - clear_bit(EXTENT_BUFFER_CORRUPT, &buf->bflags);
> - }
> +
> + /*
> + * Deleting the buffer, clear the corrupt flag since it doesn't
> + * matter anymore.
> + */
> + clear_bit(EXTENT_BUFFER_CORRUPT, &buf->bflags);
> }
>
> /* Can return -ENOMEM */
>
> --
> 2.41.0
>
>
On Thu, Nov 23, 2023 at 04:33:02PM +0000, Filipe Manana wrote:
> On Thu, Nov 23, 2023 at 3:48 PM Johannes Thumshirn
> <[email protected]> wrote:
> >
> > Reflow btrfs_free_tree_block() so that there is one level of indentation
> > needed.
> >
> > This patch has no functional changes.
> >
> > Reviewed-by: Christoph Hellwig <[email protected]>
> > Reviewed-by: Josef Bacik <[email protected]>
> > Signed-off-by: Johannes Thumshirn <[email protected]>
> > ---
> > fs/btrfs/extent-tree.c | 97 +++++++++++++++++++++++++-------------------------
> > 1 file changed, 49 insertions(+), 48 deletions(-)
> >
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 4044102271e9..093aaf7aeb3a 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -3426,6 +3426,7 @@ void btrfs_free_tree_block(struct btrfs_trans_handle *trans,
> > {
> > struct btrfs_fs_info *fs_info = trans->fs_info;
> > struct btrfs_ref generic_ref = { 0 };
> > + struct btrfs_block_group *cache;
>
> While at it, can we rename 'cache' to something like 'bg'?
>
> The cache name comes from the times where the structure was named
> btrfs_block_group_cache, and having it named cache is always
> confusing.
Agreed, using the new names in new code is highly recommended,
unfortunatelly we still have too many places using 'cache'.
On Thu, Nov 23, 2023 at 07:47:14AM -0800, Johannes Thumshirn wrote:
> Since the beginning of zoned mode, I've promised Josef to get rid of the
> extent_buffer redirtying, but never actually got around to doing so.
>
> Then 2 weeks ago our CI has hit an ASSERT() in this area and I started to look
> into it again. After some discussion with Christoph we came to the conclusion
> to finally take the time and get rid of the extent_buffer redirtying once and
> for all.
>
> Patch one renames EXTENT_BUFFER_NO_CHECK into EXTENT_BUFFER_ZONED_ZEROOUT,
> because this fits the new model somewhat better.
>
> Number two sets the cancel bit instead of clearing the dirty bit from a zoned
> extent_buffer.
>
> Number three removes the last remaining bits of btrfs_redirty_list_add().
>
> The last two patches in this series are just trivial cleanups I came across
> while looking at the code.
>
> ---
> Changes in v2:
> - Rename EXTENT_BUFFER_CANCELLED to EXTENT_BUFFER_ZONED_ZEROOUT
> - Add comments why we're marking the buffer as zeroout and zero it.
> - Add Reviews from Josef and Christoph
> - Link to v1: https://lore.kernel.org/r/[email protected]
>
> ---
> Johannes Thumshirn (5):
> btrfs: rename EXTENT_BUFFER_NO_CHECK to EXTENT_BUFFER_ZONED_ZEROOUT
> btrfs: zoned: don't clear dirty flag of extent buffer
> btrfs: remove now unneeded btrfs_redirty_list_add
> btrfs: use memset_page instead of opencoding it
> btrfs: reflow btrfs_free_tree_block
With the rename in patch 5/5 added to misc-next, thanks.
On Thu, Nov 23, 2023 at 07:47:14AM -0800, Johannes Thumshirn wrote:
> Since the beginning of zoned mode, I've promised Josef to get rid of the
> extent_buffer redirtying, but never actually got around to doing so.
>
> Then 2 weeks ago our CI has hit an ASSERT() in this area and I started to look
> into it again. After some discussion with Christoph we came to the conclusion
> to finally take the time and get rid of the extent_buffer redirtying once and
> for all.
>
> Patch one renames EXTENT_BUFFER_NO_CHECK into EXTENT_BUFFER_ZONED_ZEROOUT,
> because this fits the new model somewhat better.
>
> Number two sets the cancel bit instead of clearing the dirty bit from a zoned
> extent_buffer.
>
> Number three removes the last remaining bits of btrfs_redirty_list_add().
>
> The last two patches in this series are just trivial cleanups I came across
> while looking at the code.
>
> ---
Hello,
[SUMMARY]
With this series applied, running generic/013 200 times hits a
"__btrfs_free_extent:3251: errno=-117 Filesystem corrupted" error. I
investigated the issue and found a workaround patch. But, I'm still not
sure what is going on in the background. So, that patch might not address
the root cause.
[How it fails]
It fails with the following backtrace and error prints. It failed because
it cannot find a corresponding METADATA_ITEM for a delayed reference item.
It is interesting that the ref is trying to find "byte nr 66948939776
parent 66948939776 root 5 owner 0", but only finds "(66948939776 169 1)",
with the same bytenr and the same root, but different level. And, it fails
like this all the time. It sees the same bytenr but different level (1 !=
0).
Also, note that at this point, the extent buffer had no contents (filled
with zero).
Feb 17 17:25:43 kernel: BTRFS info (device nvme1n2): host-managed zoned block device /dev/nvme1n2, 1844 zones of 2147483648 bytes
Feb 17 17:25:43 kernel: BTRFS info (device nvme1n2): zoned: async discard ignored and disabled for zoned mode
Feb 17 17:25:43 kernel: BTRFS info (device nvme1n2): zoned mode enabled with zone size 2147483648
Feb 17 17:25:45 kernel: ------------[ cut here ]------------
Feb 17 17:25:45 kernel: WARNING: CPU: 9 PID: 29834 at fs/btrfs/extent-tree.c:3248 __btrfs_free_extent.isra.0+0x604/0x1330 [btrfs]
Feb 17 17:25:45 kernel: Modules linked in: dm_flakey algif_hash af_alg xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype iptable_filter ip_tables br_netfilter bridge stp llc overlay sunrpc kvm_amd kvm irqbypass rapl acpi_cpufreq ipmi_ssif i2c_piix4 k10temp btrfs ipmi_si blake2b_generic ipmi_devintf xor ipmi_msghandler raid6_pq bfq loop dm_mod zram bnxt_en ccp pkcs8_key_parser asn1_decoder public_key oid_registry fuse msr ipv6
Feb 17 17:25:45 kernel: CPU: 9 PID: 29834 Comm: fsstress Not tainted 6.8.0-rc4-BTRFS-ZNS+ #403
Feb 17 17:25:45 kernel: Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021
Feb 17 17:25:45 kernel: RIP: 0010:__btrfs_free_extent.isra.0+0x604/0x1330 [btrfs]
Feb 17 17:25:45 kernel: Code: 8b 3f e8 bf 69 00 00 48 8b 7d 60 45 8b 4f 40 49 89 d8 8b 54 24 40 4c 89 e9 48 c7 c6 30 64 65 a0 e8 61 fb 0d 00 e9 8f fd ff ff <0f> 0b f0 48 0f ba 28 02 41 b8 00 00 00 00 0f 83 86 04 00 00 b9 8b
Feb 17 17:25:45 kernel: RSP: 0018:ffffc900090cfb80 EFLAGS: 00010246
Feb 17 17:25:45 kernel: RAX: ffff888365c719d8 RBX: 0000000f9677c000 RCX: 0000000000000001
Feb 17 17:25:45 kernel: RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
Feb 17 17:25:45 kernel: RBP: ffff8889a044b220 R08: 0000000000000000 R09: 0000000000000004
Feb 17 17:25:45 kernel: R10: 0000000000000000 R11: 00000000ffffffff R12: 0000000000000001
Feb 17 17:25:45 kernel: R13: ffff888ad87a4c98 R14: 0000000000000005 R15: ffff888a0c7d2a80
Feb 17 17:25:45 kernel: FS: 00007f823f5f7740(0000) GS:ffff889fcea40000(0000) knlGS:0000000000000000
Feb 17 17:25:45 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 17 17:25:45 kernel: CR2: 0000560ce0610b38 CR3: 0000000a907ec000 CR4: 0000000000350ef0
Feb 17 17:25:45 kernel: Call Trace:
Feb 17 17:25:45 kernel: <TASK>
Feb 17 17:25:46 kernel: ? __btrfs_free_extent.isra.0+0x604/0x1330 [btrfs]
Feb 17 17:25:46 kernel: ? __warn+0x81/0x170
Feb 17 17:25:46 kernel: ? __btrfs_free_extent.isra.0+0x604/0x1330 [btrfs]
Feb 17 17:25:46 kernel: ? report_bug+0x18d/0x1c0
Feb 17 17:25:46 kernel: ? handle_bug+0x3c/0x70
Feb 17 17:25:46 kernel: ? exc_invalid_op+0x13/0x60
Feb 17 17:25:46 kernel: ? asm_exc_invalid_op+0x16/0x20
Feb 17 17:25:46 kernel: ? __btrfs_free_extent.isra.0+0x604/0x1330 [btrfs]
Feb 17 17:25:46 kernel: ? srso_return_thunk+0x5/0x5f
Feb 17 17:25:46 kernel: ? rcu_is_watching+0xd/0x40
Feb 17 17:25:46 kernel: ? srso_return_thunk+0x5/0x5f
Feb 17 17:25:46 kernel: ? lock_release+0x1e5/0x280
Feb 17 17:25:46 kernel: __btrfs_run_delayed_refs+0x64c/0x1380 [btrfs]
Feb 17 17:25:46 kernel: ? btrfs_commit_transaction+0x3e/0x12d0 [btrfs]
Feb 17 17:25:46 kernel: btrfs_run_delayed_refs+0x92/0x130 [btrfs]
Feb 17 17:25:46 kernel: btrfs_commit_transaction+0xa2/0x12d0 [btrfs]
Feb 17 17:25:46 kernel: ? srso_return_thunk+0x5/0x5f
Feb 17 17:25:46 kernel: ? srso_return_thunk+0x5/0x5f
Feb 17 17:25:46 kernel: ? rcu_is_watching+0xd/0x40
Feb 17 17:25:46 kernel: ? srso_return_thunk+0x5/0x5f
Feb 17 17:25:46 kernel: ? lock_release+0x1e5/0x280
Feb 17 17:25:46 kernel: btrfs_sync_file+0x532/0x660 [btrfs]
Feb 17 17:25:46 kernel: __x64_sys_fsync+0x37/0x60
Feb 17 17:25:46 kernel: do_syscall_64+0x79/0x1a0
Feb 17 17:25:46 kernel: entry_SYSCALL_64_after_hwframe+0x6e/0x76
Feb 17 17:25:46 kernel: RIP: 0033:0x7f823f6f8400
Feb 17 17:25:46 kernel: Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 80 3d e1 d1 0d 00 00 74 17 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 7c
Feb 17 17:25:46 kernel: RSP: 002b:00007ffe3c26e9f8 EFLAGS: 00000202 ORIG_RAX: 000000000000004a
Feb 17 17:25:46 kernel: RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f823f6f8400
Feb 17 17:25:46 kernel: RDX: 0000000000000193 RSI: 0000560cdfcdb048 RDI: 0000000000000003
Feb 17 17:25:46 kernel: RBP: 00000000000002e6 R08: 0000000000000007 R09: 00007ffe3c26ea0c
Feb 17 17:25:46 kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffe3c26ea10
Feb 17 17:25:46 kernel: R13: 028f5c28f5c28f5c R14: 8f5c28f5c28f5c29 R15: 0000560cdfcd7180
Feb 17 17:25:46 kernel: </TASK>
Feb 17 17:25:46 kernel: irq event stamp: 0
Feb 17 17:25:46 kernel: hardirqs last enabled at (0): [<0000000000000000>] 0x0
Feb 17 17:25:46 kernel: hardirqs last disabled at (0): [<ffffffff810e5e0e>] copy_process+0xb0e/0x1e00
Feb 17 17:25:46 kernel: softirqs last enabled at (0): [<ffffffff810e5e0e>] copy_process+0xb0e/0x1e00
Feb 17 17:25:46 kernel: softirqs last disabled at (0): [<0000000000000000>] 0x0
Feb 17 17:25:46 kernel: ---[ end trace 0000000000000000 ]---
Feb 17 17:25:46 kernel: ------------[ cut here ]------------
Feb 17 17:25:46 kernel: BTRFS: Transaction aborted (error -117)
Feb 17 17:25:46 kernel: WARNING: CPU: 9 PID: 29834 at fs/btrfs/extent-tree.c:3249 __btrfs_free_extent.isra.0+0xf8e/0x1330 [btrfs]
Feb 17 17:25:46 kernel: Modules linked in: dm_flakey algif_hash af_alg xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype iptable_filter ip_tables br_netfilter bridge stp llc overlay sunrpc kvm_amd kvm irqbypass rapl acpi_cpufreq ipmi_ssif i2c_piix4 k10temp btrfs ipmi_si blake2b_generic ipmi_devintf xor ipmi_msghandler raid6_pq bfq loop dm_mod zram bnxt_en ccp pkcs8_key_parser asn1_decoder public_key oid_registry fuse msr ipv6
Feb 17 17:25:46 kernel: CPU: 9 PID: 29834 Comm: fsstress Tainted: G W 6.8.0-rc4-BTRFS-ZNS+ #403
Feb 17 17:25:46 kernel: Hardware name: Supermicro Super Server/H12SSL-NT, BIOS 2.0 02/22/2021
Feb 17 17:25:46 kernel: RIP: 0010:__btrfs_free_extent.isra.0+0xf8e/0x1330 [btrfs]
Feb 17 17:25:46 kernel: Code: 48 c7 c6 40 5d 65 a0 e8 f0 f1 0d 00 c7 44 24 18 01 00 00 00 e9 ed f7 ff ff be 8b ff ff ff 48 c7 c7 68 5d 65 a0 e8 52 69 c1 e0 <0f> 0b e9 30 fb ff ff 48 8b 45 60 48 05 d8 19 00 00 f0 48 0f ba 28
Feb 17 17:25:46 kernel: RSP: 0018:ffffc900090cfb80 EFLAGS: 00010282
Feb 17 17:25:46 kernel: RAX: 0000000000000000 RBX: 0000000f9677c000 RCX: 0000000000000000
Feb 17 17:25:46 kernel: RDX: 0000000000000002 RSI: ffffffff82464302 RDI: 00000000ffffffff
Feb 17 17:25:46 kernel: RBP: ffff8889a044b220 R08: 0000000000009ffb R09: 00000000ffffdfff
Feb 17 17:25:46 kernel: R10: 00000000ffffdfff R11: ffffffff8264dd80 R12: 0000000000000001
Feb 17 17:25:46 kernel: R13: ffff888ad87a4c98 R14: 0000000000000005 R15: ffff888a0c7d2a80
Feb 17 17:25:46 kernel: FS: 00007f823f5f7740(0000) GS:ffff889fcea40000(0000) knlGS:0000000000000000
Feb 17 17:25:46 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb 17 17:25:46 kernel: CR2: 0000560ce0610b38 CR3: 0000000a907ec000 CR4: 0000000000350ef0
Feb 17 17:25:46 kernel: Call Trace:
Feb 17 17:25:46 kernel: <TASK>
Feb 17 17:25:46 kernel: ? __btrfs_free_extent.isra.0+0xf8e/0x1330 [btrfs]
Feb 17 17:25:46 kernel: ? __warn+0x81/0x170
Feb 17 17:25:46 kernel: ? __btrfs_free_extent.isra.0+0xf8e/0x1330 [btrfs]
Feb 17 17:25:46 kernel: ? report_bug+0x18d/0x1c0
Feb 17 17:25:47 kernel: ? tick_nohz_tick_stopped+0x12/0x30
Feb 17 17:25:47 kernel: ? srso_return_thunk+0x5/0x5f
Feb 17 17:25:47 kernel: ? handle_bug+0x3c/0x70
Feb 17 17:25:47 kernel: ? exc_invalid_op+0x13/0x60
Feb 17 17:25:47 kernel: ? asm_exc_invalid_op+0x16/0x20
Feb 17 17:25:47 kernel: ? __btrfs_free_extent.isra.0+0xf8e/0x1330 [btrfs]
Feb 17 17:25:47 kernel: ? srso_return_thunk+0x5/0x5f
Feb 17 17:25:47 kernel: ? rcu_is_watching+0xd/0x40
Feb 17 17:25:47 kernel: ? srso_return_thunk+0x5/0x5f
Feb 17 17:25:47 kernel: ? lock_release+0x1e5/0x280
Feb 17 17:25:47 kernel: __btrfs_run_delayed_refs+0x64c/0x1380 [btrfs]
Feb 17 17:25:47 kernel: ? btrfs_commit_transaction+0x3e/0x12d0 [btrfs]
Feb 17 17:25:47 kernel: btrfs_run_delayed_refs+0x92/0x130 [btrfs]
Feb 17 17:25:47 kernel: btrfs_commit_transaction+0xa2/0x12d0 [btrfs]
Feb 17 17:25:47 kernel: ? srso_return_thunk+0x5/0x5f
Feb 17 17:25:47 kernel: ? srso_return_thunk+0x5/0x5f
Feb 17 17:25:47 kernel: ? rcu_is_watching+0xd/0x40
Feb 17 17:25:47 kernel: ? srso_return_thunk+0x5/0x5f
Feb 17 17:25:47 kernel: ? lock_release+0x1e5/0x280
Feb 17 17:25:47 kernel: btrfs_sync_file+0x532/0x660 [btrfs]
Feb 17 17:25:47 kernel: __x64_sys_fsync+0x37/0x60
Feb 17 17:25:47 kernel: do_syscall_64+0x79/0x1a0
Feb 17 17:25:47 kernel: entry_SYSCALL_64_after_hwframe+0x6e/0x76
Feb 17 17:25:47 kernel: RIP: 0033:0x7f823f6f8400
Feb 17 17:25:47 kernel: Code: 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 80 3d e1 d1 0d 00 00 74 17 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 48 c3 0f 1f 80 00 00 00 00 48 83 ec 18 89 7c
Feb 17 17:25:47 kernel: RSP: 002b:00007ffe3c26e9f8 EFLAGS: 00000202 ORIG_RAX: 000000000000004a
Feb 17 17:25:47 kernel: RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f823f6f8400
Feb 17 17:25:47 kernel: RDX: 0000000000000193 RSI: 0000560cdfcdb048 RDI: 0000000000000003
Feb 17 17:25:47 kernel: RBP: 00000000000002e6 R08: 0000000000000007 R09: 00007ffe3c26ea0c
Feb 17 17:25:47 kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 00007ffe3c26ea10
Feb 17 17:25:47 kernel: R13: 028f5c28f5c28f5c R14: 8f5c28f5c28f5c29 R15: 0000560cdfcd7180
Feb 17 17:25:47 kernel: </TASK>
Feb 17 17:25:47 kernel: irq event stamp: 0
Feb 17 17:25:47 kernel: hardirqs last enabled at (0): [<0000000000000000>] 0x0
Feb 17 17:25:47 kernel: hardirqs last disabled at (0): [<ffffffff810e5e0e>] copy_process+0xb0e/0x1e00
Feb 17 17:25:47 kernel: softirqs last enabled at (0): [<ffffffff810e5e0e>] copy_process+0xb0e/0x1e00
Feb 17 17:25:47 kernel: softirqs last disabled at (0): [<0000000000000000>] 0x0
Feb 17 17:25:47 kernel: ---[ end trace 0000000000000000 ]---
Feb 17 17:25:47 kernel: BTRFS: error (device nvme1n2: state A) in __btrfs_free_extent:3249: errno=-117 Filesystem corrupted
Feb 17 17:25:47 kernel: BTRFS info (device nvme1n2: state EA): forced readonly
Feb 17 17:25:47 kernel: BTRFS info (device nvme1n2: state EA): leaf 66957836288 gen 3873 total ptrs 203 free space 1102 owner 2
Feb 17 17:25:47 kernel: BTRFS info (device nvme1n2: state EA): refs 2 lock_owner 29834 current 29834
Feb 17 17:25:47 kernel: item 0 key (63394947072 168 40960) itemoff 16230 itemsize 53
Feb 17 17:25:47 kernel: extent refs 1 gen 3835 flags 1
Feb 17 17:25:47 kernel: ref#0: extent data backref root 5 objectid 552 offset 1802240 count 1
(snip)...
Feb 17 17:25:55 kernel: item 164 key (66948923392 169 0) itemoff 8229 itemsize 33
Feb 17 17:25:55 kernel: extent refs 1 gen 3872 flags 2
Feb 17 17:25:55 kernel: ref#0: tree block backref root 2
Feb 17 17:25:55 kernel: item 165 key (66948939776 169 1) itemoff 8196 itemsize 33
Feb 17 17:25:55 kernel: extent refs 1 gen 3873 flags 2
Feb 17 17:25:55 kernel: ref#0: tree block backref root 5
Feb 17 17:25:55 kernel: item 166 key (68719476736 168 110592) itemoff 8143 itemsize 53
Feb 17 17:25:55 kernel: extent refs 1 gen 3841 flags 1
Feb 17 17:25:55 kernel: ref#0: extent data backref root 5 objectid 440 offset 3100672 count 1
(snip)...
Feb 17 17:25:56 kernel: item 202 key (68722249728 168 110592) itemoff 6177 itemsize 53
Feb 17 17:25:56 kernel: extent refs 1 gen 3842 flags 1
Feb 17 17:25:56 kernel: ref#0: extent data backref root 5 objectid 953 offset 5431296 count 1
Feb 17 17:25:56 kernel: BTRFS critical (device nvme1n2: state EA): unable to find ref byte nr 66948939776 parent 66948939776 root 5 owner 0 offset 0 slot 166
Feb 17 17:25:57 kernel: BTRFS error (device nvme1n2: state EA): failed to run delayed ref for logical 66948939776 num_bytes 16384 type 182 action 2 ref_mod 1: -2
Feb 17 17:25:57 kernel: BTRFS: error (device nvme1n2: state EA) in btrfs_run_delayed_refs:2246: errno=-2 No such entry
Feb 17 17:25:57 kernel: BTRFS warning (device nvme1n2: state EA): Skipping commit of aborted transaction.
Feb 17 17:25:57 kernel: BTRFS: error (device nvme1n2: state EA) in cleanup_transaction:2006: errno=-2 No such entry
Feb 17 17:26:02 kernel: BTRFS info (device nvme1n2: state EA): last unmount of filesystem f33f6f52-ef71-4984-898d-ff20390559d1
[Workaround patch]
I added "level" field to struct extent_buffer to store the tree level and
checked if buf->level is the same as btrfs_header_level(buf) or not in
btrfs_free_tree_block(). It revealed buf->level != btrfs_header_level(buf)
can happen. It is because the extent buffer is already zeroed out in
btree_csum_one_bio(). Since the header is also cleared,
btrfs_header_level(buf) returns 0 while it actually should be 1. As a
result, a faulty delayed ref, whose level = 0 not 1, is inserted.
A workaround fix is easy. We can leave the header intact. With this patch,
the failure goes away and generic/013 passed 300 times.
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 8ab185182c30..f43104224c92 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -278,7 +278,9 @@ blk_status_t btree_csum_one_bio(struct btrfs_bio *bbio)
* ordering of I/O without unnecessarily writing out data.
*/
if (test_bit(EXTENT_BUFFER_ZONED_ZEROOUT, &eb->bflags)) {
- memzero_extent_buffer(eb, 0, eb->len);
+ const unsigned long header_size = sizeof(struct btrfs_header);
+
+ memzero_extent_buffer(eb, header_size, eb->len - header_size);
return BLK_STS_OK;
}
[Remaining Questions]
Timeline of the faulty referenced extent buffer should be following:
- (allocated)
- btrfs_clear_buffer_dirty: set EXTENT_BUFFER_ZONED_ZEROOUT
- btree_csum_one_bio: zero out entire buffer
- (written to device?)
- btrfs_free_tree_block: add DROP_DELAYED_REF
Q1. I previously assumed at btrfs_clear_buffer_dirty, there is no reference
left. But, the actual drop of the reference is done after
btrfs_free_tree_block. Is it too early to mark it EXTENT_BUFFER_ZONED_ZEROOUT and
allow it to be zero out? Maybe, if the FS crashed just after writing zeroed
buffer to a device, re-mounted FS will see zero filled node and panic?
Q2. With the "buf->level != btrfs_header_level(buf)" check, I observed some
level mismatch printed with no delayed ref error followed. How can it
happen? I guess it is merged with the existing delayed refs ... but it does
not care the level mismatch?
Q3. My experience on debugging this suggests this bug is highly timing
related: adding debug prints in btrfs_free_tree_block prevents the bug to
appear. What is the reason behind this?