2024-06-04 07:13:20

by Zizhi Wo

[permalink] [raw]
Subject: [PATCH] xfs: Fix file creation failure

We have an xfs image that contains only 2 AGs, the first AG is full and
the second AG is empty, then a concurrent file creation and little writing
could unexpectedly return -ENOSPC error since there is a race window that
the allocator could get the wrong agf->agf_longest.

Write file process steps:
1) Find the entry that best meets the conditions, then calculate the start
address and length of the remaining part of the entry after allocation.
2) Delete this entry. Because the second AG is empty, the btree in its agf
has only one record, and agf->agf_longest will be set to 0 after deletion.
3) Insert the remaining unused parts of this entry based on the
calculations in 1), and update the agf->agf_longest.

Create file process steps:
1) Check whether there are free inodes in the inode chunk.
2) If there is no free inode, check whether there has space for creating
inode chunks, perform the no-lock judgment first.
3) If the judgment succeeds, the judgment is performed again with agf lock
held. Otherwire, an error is returned directly.

If the write process is in step 2) but not go to 3) yet, the create file
process goes to 2) at this time, it will be mistaken for no space,
resulting in the file system still has space but the file creation fails.

Direct write Create file
xfs_file_write_iter
...
xfs_direct_write_iomap_begin
xfs_iomap_write_direct
...
xfs_alloc_ag_vextent_near
xfs_alloc_cur_finish
xfs_alloc_fixup_trees
xfs_btree_delete
xfs_btree_delrec
xfs_allocbt_update_lastrec
// longest = 0 because numrec == 0.
agf->agf_longest = len = 0
xfs_create
...
xfs_dialloc
...
xfs_alloc_fix_freelist
xfs_alloc_space_available
-> as longest=0, it will return
false, no space for inode alloc.

Fix this issue by adding the bc_free_longest field to the xfs_btree_cur_t
structure to store the potential longest count that will be updated. The
assignment is done in xfs_alloc_fixup_trees() and xfs_free_ag_extent().

Reported by: Ye Bin <[email protected]>
Signed-off-by: Zizhi Wo <[email protected]>
---
fs/xfs/libxfs/xfs_alloc.c | 14 ++++++++++++++
fs/xfs/libxfs/xfs_alloc_btree.c | 9 ++++++++-
fs/xfs/libxfs/xfs_btree.h | 1 +
3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 6c55a6e88eba..86ba873d57a8 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -577,6 +577,13 @@ xfs_alloc_fixup_trees(
nfbno2 = rbno + rlen;
nflen2 = (fbno + flen) - nfbno2;
}
+
+ /*
+ * Record the potential maximum free length in advance.
+ */
+ if (nfbno1 != NULLAGBLOCK || nfbno2 != NULLAGBLOCK)
+ cnt_cur->bc_ag.bc_free_longest = XFS_EXTLEN_MAX(nflen1, nflen2);
+
/*
* Delete the entry from the by-size btree.
*/
@@ -2044,6 +2051,13 @@ xfs_free_ag_extent(
* Now allocate and initialize a cursor for the by-size tree.
*/
cnt_cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
+ /*
+ * Record the potential maximum free length in advance.
+ */
+ if (haveleft)
+ cnt_cur->bc_ag.bc_free_longest = ltlen;
+ if (haveright)
+ cnt_cur->bc_ag.bc_free_longest = gtlen;
/*
* Have both left and right contiguous neighbors.
* Merge all three into a single free block.
diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index 6ef5ddd89600..8e7d1e0f1a63 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -161,7 +161,14 @@ xfs_allocbt_update_lastrec(
rrp = XFS_ALLOC_REC_ADDR(cur->bc_mp, block, numrecs);
len = rrp->ar_blockcount;
} else {
- len = 0;
+ /*
+ * Update in advance to prevent file creation failure
+ * for concurrent processes even though there is no
+ * numrec currently.
+ * And there's no need to worry as the value that no
+ * less than bc_free_longest will be inserted later.
+ */
+ len = cpu_to_be32(cur->bc_ag.bc_free_longest);
}

break;
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index f93374278aa1..985b1885a643 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -281,6 +281,7 @@ struct xfs_btree_cur
struct xfs_perag *pag;
struct xfs_buf *agbp;
struct xbtree_afakeroot *afake; /* for staging cursor */
+ xfs_extlen_t bc_free_longest; /* potential longest free space */
} bc_ag;
struct {
struct xfbtree *xfbtree;
--
2.39.2



2024-06-04 16:08:14

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH] xfs: Fix file creation failure

On Tue, Jun 04, 2024 at 03:11:21PM +0800, Zizhi Wo wrote:
> We have an xfs image that contains only 2 AGs, the first AG is full and
> the second AG is empty, then a concurrent file creation and little writing
> could unexpectedly return -ENOSPC error since there is a race window that
> the allocator could get the wrong agf->agf_longest.
>
> Write file process steps:
> 1) Find the entry that best meets the conditions, then calculate the start
> address and length of the remaining part of the entry after allocation.
> 2) Delete this entry. Because the second AG is empty, the btree in its agf
> has only one record, and agf->agf_longest will be set to 0 after deletion.
> 3) Insert the remaining unused parts of this entry based on the
> calculations in 1), and update the agf->agf_longest.
>
> Create file process steps:
> 1) Check whether there are free inodes in the inode chunk.
> 2) If there is no free inode, check whether there has space for creating
> inode chunks, perform the no-lock judgment first.
> 3) If the judgment succeeds, the judgment is performed again with agf lock
> held. Otherwire, an error is returned directly.
>
> If the write process is in step 2) but not go to 3) yet, the create file
> process goes to 2) at this time, it will be mistaken for no space,
> resulting in the file system still has space but the file creation fails.
>
> Direct write Create file
> xfs_file_write_iter
> ...
> xfs_direct_write_iomap_begin
> xfs_iomap_write_direct
> ...
> xfs_alloc_ag_vextent_near
> xfs_alloc_cur_finish
> xfs_alloc_fixup_trees
> xfs_btree_delete
> xfs_btree_delrec
> xfs_allocbt_update_lastrec
> // longest = 0 because numrec == 0.
> agf->agf_longest = len = 0
> xfs_create
> ...
> xfs_dialloc
> ...
> xfs_alloc_fix_freelist
> xfs_alloc_space_available
> -> as longest=0, it will return
> false, no space for inode alloc.
>
> Fix this issue by adding the bc_free_longest field to the xfs_btree_cur_t
> structure to store the potential longest count that will be updated. The
> assignment is done in xfs_alloc_fixup_trees() and xfs_free_ag_extent().

This is going to be a reverse-order review due to the way that diff
ordered the chunks, which means that the bulk of my questions are at the
end.

> Reported by: Ye Bin <[email protected]>
> Signed-off-by: Zizhi Wo <[email protected]>
> ---
> fs/xfs/libxfs/xfs_alloc.c | 14 ++++++++++++++
> fs/xfs/libxfs/xfs_alloc_btree.c | 9 ++++++++-
> fs/xfs/libxfs/xfs_btree.h | 1 +
> 3 files changed, 23 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 6c55a6e88eba..86ba873d57a8 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -577,6 +577,13 @@ xfs_alloc_fixup_trees(
> nfbno2 = rbno + rlen;
> nflen2 = (fbno + flen) - nfbno2;
> }
> +
> + /*
> + * Record the potential maximum free length in advance.
> + */
> + if (nfbno1 != NULLAGBLOCK || nfbno2 != NULLAGBLOCK)
> + cnt_cur->bc_ag.bc_free_longest = XFS_EXTLEN_MAX(nflen1, nflen2);

Ok, so if we're allocating space then this sets bc_free_longest to the
longer of the two remaining sections, if any. But if we just allocated
the entirety of the longest extent in the cntbt, then we don't set
bc_free_longest, which means its zero, right? I guess that's ok because
that implies there's zero space left in the AG, so the longest freespace
is indeed zero.

If we just allocated the entirety of a non-longest extent in the cntbt
then we don't call ->lastrec_update so the value of bc_free_longest
doesn't matter?

> +
> /*
> * Delete the entry from the by-size btree.
> */
> @@ -2044,6 +2051,13 @@ xfs_free_ag_extent(
> * Now allocate and initialize a cursor for the by-size tree.
> */
> cnt_cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
> + /*
> + * Record the potential maximum free length in advance.
> + */
> + if (haveleft)
> + cnt_cur->bc_ag.bc_free_longest = ltlen;
> + if (haveright)
> + cnt_cur->bc_ag.bc_free_longest = gtlen;

What happens in the haveleft && haveright case? Shouldn't
bc_free_longest be set to ltlen + len + gtlen? You could just push the
setting of bc_free_longest into the haveleft/haveright code below.

> /*
> * Have both left and right contiguous neighbors.
> * Merge all three into a single free block.
> diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
> index 6ef5ddd89600..8e7d1e0f1a63 100644
> --- a/fs/xfs/libxfs/xfs_alloc_btree.c
> +++ b/fs/xfs/libxfs/xfs_alloc_btree.c
> @@ -161,7 +161,14 @@ xfs_allocbt_update_lastrec(
> rrp = XFS_ALLOC_REC_ADDR(cur->bc_mp, block, numrecs);
> len = rrp->ar_blockcount;
> } else {
> - len = 0;
> + /*
> + * Update in advance to prevent file creation failure
> + * for concurrent processes even though there is no
> + * numrec currently.
> + * And there's no need to worry as the value that no
> + * less than bc_free_longest will be inserted later.
> + */
> + len = cpu_to_be32(cur->bc_ag.bc_free_longest);

Humm. In this case, we've called ->update_lastrec on the cntbt cursor
having deleted all the records in this record block. Presumably that
means that we're going to add rec->alloc.ar_blockcount blocks to the
rightmost record in the left sibling of @block? Or already have?

Ahh, right, the pagf_longest checks are done without holding AGF lock.
The concurrent creat call sees this intermediate state (DELREC sets
pagf_longest to zero, a moment later INSREC/UPDATE set it to the correct
nonzero value) and decides to ENOSPC because "nobody" has sufficient
free space.

I think this phony zero never gets written to disk because although
we're logging zero into the ondisk and incore agf_longest here, the next
btree operation will reset it to the correct value. Right?

Would it be simpler to handle this case by duplicating the cntbt cursor
and walking one record leftward in the tree to find the longest extent,
rather than using this "bc_free_longest" variable?

> }
>
> break;
> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index f93374278aa1..985b1885a643 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -281,6 +281,7 @@ struct xfs_btree_cur
> struct xfs_perag *pag;
> struct xfs_buf *agbp;
> struct xbtree_afakeroot *afake; /* for staging cursor */
> + xfs_extlen_t bc_free_longest; /* potential longest free space */

This is only used for bnobt/cntbt trees, put it in the per-format
private data area, please.

If the answer to the question about duplicating the btree cursor is "no"
then I think this deserves a much longer comment that captures the fact
that the variable exists to avoid setting pagf_longest to zero for
benefit of the code that does unlocked scanning of AGs for free space.

I also wonder if the unlocked ag scan should just note that it observed
a zero pagf_longest and if no space can be found across all the AGs, to
try again with locks?

--D

> } bc_ag;
> struct {
> struct xfbtree *xfbtree;
> --
> 2.39.2
>
>

2024-06-04 23:00:43

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] xfs: Fix file creation failure

On Tue, Jun 04, 2024 at 03:11:21PM +0800, Zizhi Wo wrote:
> We have an xfs image that contains only 2 AGs, the first AG is full and
> the second AG is empty, then a concurrent file creation and little writing
> could unexpectedly return -ENOSPC error since there is a race window that
> the allocator could get the wrong agf->agf_longest.
>
> Write file process steps:
> 1) Find the entry that best meets the conditions, then calculate the start
> address and length of the remaining part of the entry after allocation.
> 2) Delete this entry. Because the second AG is empty, the btree in its agf
> has only one record, and agf->agf_longest will be set to 0 after deletion.
> 3) Insert the remaining unused parts of this entry based on the
> calculations in 1), and update the agf->agf_longest.
>
> Create file process steps:
> 1) Check whether there are free inodes in the inode chunk.
> 2) If there is no free inode, check whether there has space for creating
> inode chunks, perform the no-lock judgment first.
> 3) If the judgment succeeds, the judgment is performed again with agf lock
> held. Otherwire, an error is returned directly.
>
> If the write process is in step 2) but not go to 3) yet, the create file
> process goes to 2) at this time, it will be mistaken for no space,
> resulting in the file system still has space but the file creation fails.
>
> Direct write Create file
> xfs_file_write_iter
> ...
> xfs_direct_write_iomap_begin
> xfs_iomap_write_direct
> ...
> xfs_alloc_ag_vextent_near
> xfs_alloc_cur_finish
> xfs_alloc_fixup_trees
> xfs_btree_delete
> xfs_btree_delrec
> xfs_allocbt_update_lastrec
> // longest = 0 because numrec == 0.
> agf->agf_longest = len = 0
> xfs_create
> ...
> xfs_dialloc
> ...
> xfs_alloc_fix_freelist
> xfs_alloc_space_available
> -> as longest=0, it will return
> false, no space for inode alloc.

Ok, so this is a another attempt to address the problem Ye Bin
attempted to fix here:

https://lore.kernel.org/linux-xfs/[email protected]/

> Fix this issue by adding the bc_free_longest field to the xfs_btree_cur_t
> structure to store the potential longest count that will be updated. The
> assignment is done in xfs_alloc_fixup_trees() and xfs_free_ag_extent().

I outlined how this should be fixed in the above thread:

https://lore.kernel.org/linux-xfs/[email protected]/

This is what I said:

| What we actually want is for pag->pagf_longest not to change
| transiently to zero in xfs_alloc_fixup_trees(). If the delrec that
| zeroes the pagf_longest field is going to follow it up with an
| insrec that will set it back to a valid value, we really should not
| be doing the zeroing in the first place.
|
| Further, the only btree that tracks the right edge of the btree is
| the by-count allocbt. This isn't "generic" functionality, even
| though it is implemented through the generic btree code. If we lift
| ->update_lastrec from the generic code and do it directly in
| xfs_alloc.c whenever we are finished with a by-count tree update
| and the cursor points to a record in the right-most leaf of the
| tree, then we run the lastrec update directly at that point.
| By decoupling the lastrec updates from the individual record
| manipulations, we can make the transients disappear completely.

I'm not sure if this patch is an attempt to implement this - there
is no reference in the commit description to this previous attempt
to fix the issue, nor is the any discussion of why this particular
solution was chosen.

In future, when you are trying to fix an issue that has previously
been discussed/presented on the list, please reference it and
provide a link to the previous discussions in the changelog for the
new version of the patchset fixing the issue.

> Reported by: Ye Bin <[email protected]>
> Signed-off-by: Zizhi Wo <[email protected]>
> ---
> fs/xfs/libxfs/xfs_alloc.c | 14 ++++++++++++++
> fs/xfs/libxfs/xfs_alloc_btree.c | 9 ++++++++-
> fs/xfs/libxfs/xfs_btree.h | 1 +
> 3 files changed, 23 insertions(+), 1 deletion(-)
>
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 6c55a6e88eba..86ba873d57a8 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -577,6 +577,13 @@ xfs_alloc_fixup_trees(
> nfbno2 = rbno + rlen;
> nflen2 = (fbno + flen) - nfbno2;
> }
> +
> + /*
> + * Record the potential maximum free length in advance.
> + */
> + if (nfbno1 != NULLAGBLOCK || nfbno2 != NULLAGBLOCK)
> + cnt_cur->bc_ag.bc_free_longest = XFS_EXTLEN_MAX(nflen1, nflen2);
> +

Why do we store the length of a random extent being freed here?
nflen1/2 almost always have nothing to do with the longest free
space extent in the tree, they are just the new free space extents
we are insering into a random location in the free space tree.

> /*
> * Delete the entry from the by-size btree.
> */
> @@ -2044,6 +2051,13 @@ xfs_free_ag_extent(
> * Now allocate and initialize a cursor for the by-size tree.
> */
> cnt_cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
> + /*
> + * Record the potential maximum free length in advance.
> + */
> + if (haveleft)
> + cnt_cur->bc_ag.bc_free_longest = ltlen;
> + if (haveright)
> + cnt_cur->bc_ag.bc_free_longest = gtlen;

That doesn't look correct. At this point in the code, ltlen/gtlen
are the sizes of the physically adjacent freespace extents that we
are going to merge the newly freed extent with. i.e. the new
freespace extent is going to have one of 4 possible values:

no merge: len
left merge: ltlen + len
right merge: gtlen + len
both merge: ltlen + gtlen + len

So regardless of anything else, this code isn't setting the longest
freespace extent in teh AGF to the lenght of the longest freespace
extent in the filesystem.

Which leads me to ask: how did you test this code? This bug should
have been triggering verifier, repair and scrub failures quite
quickly with fstests....

> /*
> * Have both left and right contiguous neighbors.
> * Merge all three into a single free block.
> diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
> index 6ef5ddd89600..8e7d1e0f1a63 100644
> --- a/fs/xfs/libxfs/xfs_alloc_btree.c
> +++ b/fs/xfs/libxfs/xfs_alloc_btree.c
> @@ -161,7 +161,14 @@ xfs_allocbt_update_lastrec(
> rrp = XFS_ALLOC_REC_ADDR(cur->bc_mp, block, numrecs);
> len = rrp->ar_blockcount;
> } else {
> - len = 0;
> + /*
> + * Update in advance to prevent file creation failure
> + * for concurrent processes even though there is no
> + * numrec currently.
> + * And there's no need to worry as the value that no
> + * less than bc_free_longest will be inserted later.
> + */
> + len = cpu_to_be32(cur->bc_ag.bc_free_longest);
> }

So this is in the LASTREC_DELREC case when the last record is
removed from the btree. This is what causes the transient state
as we do this when deleting a record to trim it and then re-insert
the remainder back into the by-count btree.

Writing some random transient value into the AGF *and journalling
it* means we creating a transient on-disk format structure
corruption and potentially writing it to persistent storage (i.e.
the journal). The structure is, at least, not consistent in memory
because the free space tree is empty at this point in time, whilst
the agf->longest field says it has a free space available. This
could trip verifiers, be flagged as corruption by xfs_scrub/repair,
etc.

Now, this *might be safe* because we *may* clean it up later in the
transaction, but if this really is the last extent being removed
from the btree and a cursor has previously been used to do other
insert and free operations that set this field, then we trip over
this stale inforamtion and write a corrupt structure to disk. That's
not good.

As I said above, this "last record tracking" needs to be ripped out
of the generic btree code because only the by-count btree uses it.
Then it can be updated at the end of the by-count btree update
process in the allocation code (i.e. after all record manipulations
are done in the transaction) and that avoids this transient caused
by updating the last record on every btree record update that is
done.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2024-06-04 23:54:06

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH] xfs: Fix file creation failure

On Wed, Jun 05, 2024 at 09:00:28AM +1000, Dave Chinner wrote:
> On Tue, Jun 04, 2024 at 03:11:21PM +0800, Zizhi Wo wrote:
> > We have an xfs image that contains only 2 AGs, the first AG is full and
> > the second AG is empty, then a concurrent file creation and little writing
> > could unexpectedly return -ENOSPC error since there is a race window that
> > the allocator could get the wrong agf->agf_longest.
> >
> > Write file process steps:
> > 1) Find the entry that best meets the conditions, then calculate the start
> > address and length of the remaining part of the entry after allocation.
> > 2) Delete this entry. Because the second AG is empty, the btree in its agf
> > has only one record, and agf->agf_longest will be set to 0 after deletion.
> > 3) Insert the remaining unused parts of this entry based on the
> > calculations in 1), and update the agf->agf_longest.
> >
> > Create file process steps:
> > 1) Check whether there are free inodes in the inode chunk.
> > 2) If there is no free inode, check whether there has space for creating
> > inode chunks, perform the no-lock judgment first.
> > 3) If the judgment succeeds, the judgment is performed again with agf lock
> > held. Otherwire, an error is returned directly.
> >
> > If the write process is in step 2) but not go to 3) yet, the create file
> > process goes to 2) at this time, it will be mistaken for no space,
> > resulting in the file system still has space but the file creation fails.
> >
> > Direct write Create file
> > xfs_file_write_iter
> > ...
> > xfs_direct_write_iomap_begin
> > xfs_iomap_write_direct
> > ...
> > xfs_alloc_ag_vextent_near
> > xfs_alloc_cur_finish
> > xfs_alloc_fixup_trees
> > xfs_btree_delete
> > xfs_btree_delrec
> > xfs_allocbt_update_lastrec
> > // longest = 0 because numrec == 0.
> > agf->agf_longest = len = 0
> > xfs_create
> > ...
> > xfs_dialloc
> > ...
> > xfs_alloc_fix_freelist
> > xfs_alloc_space_available
> > -> as longest=0, it will return
> > false, no space for inode alloc.
>
> Ok, so this is a another attempt to address the problem Ye Bin
> attempted to fix here:
>
> https://lore.kernel.org/linux-xfs/[email protected]/
>
> > Fix this issue by adding the bc_free_longest field to the xfs_btree_cur_t
> > structure to store the potential longest count that will be updated. The
> > assignment is done in xfs_alloc_fixup_trees() and xfs_free_ag_extent().
>
> I outlined how this should be fixed in the above thread:
>
> https://lore.kernel.org/linux-xfs/[email protected]/
>
> This is what I said:
>
> | What we actually want is for pag->pagf_longest not to change
> | transiently to zero in xfs_alloc_fixup_trees(). If the delrec that
> | zeroes the pagf_longest field is going to follow it up with an
> | insrec that will set it back to a valid value, we really should not
> | be doing the zeroing in the first place.
> |
> | Further, the only btree that tracks the right edge of the btree is
> | the by-count allocbt. This isn't "generic" functionality, even
> | though it is implemented through the generic btree code. If we lift
> | ->update_lastrec from the generic code and do it directly in
> | xfs_alloc.c whenever we are finished with a by-count tree update
> | and the cursor points to a record in the right-most leaf of the
> | tree, then we run the lastrec update directly at that point.
> | By decoupling the lastrec updates from the individual record
> | manipulations, we can make the transients disappear completely.
>
> I'm not sure if this patch is an attempt to implement this - there
> is no reference in the commit description to this previous attempt
> to fix the issue, nor is the any discussion of why this particular
> solution was chosen.
>
> In future, when you are trying to fix an issue that has previously
> been discussed/presented on the list, please reference it and
> provide a link to the previous discussions in the changelog for the
> new version of the patchset fixing the issue.

Aha, that's why this seemed oddly familiar.

--D

> > Reported by: Ye Bin <[email protected]>
> > Signed-off-by: Zizhi Wo <[email protected]>
> > ---
> > fs/xfs/libxfs/xfs_alloc.c | 14 ++++++++++++++
> > fs/xfs/libxfs/xfs_alloc_btree.c | 9 ++++++++-
> > fs/xfs/libxfs/xfs_btree.h | 1 +
> > 3 files changed, 23 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index 6c55a6e88eba..86ba873d57a8 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -577,6 +577,13 @@ xfs_alloc_fixup_trees(
> > nfbno2 = rbno + rlen;
> > nflen2 = (fbno + flen) - nfbno2;
> > }
> > +
> > + /*
> > + * Record the potential maximum free length in advance.
> > + */
> > + if (nfbno1 != NULLAGBLOCK || nfbno2 != NULLAGBLOCK)
> > + cnt_cur->bc_ag.bc_free_longest = XFS_EXTLEN_MAX(nflen1, nflen2);
> > +
>
> Why do we store the length of a random extent being freed here?
> nflen1/2 almost always have nothing to do with the longest free
> space extent in the tree, they are just the new free space extents
> we are insering into a random location in the free space tree.
>
> > /*
> > * Delete the entry from the by-size btree.
> > */
> > @@ -2044,6 +2051,13 @@ xfs_free_ag_extent(
> > * Now allocate and initialize a cursor for the by-size tree.
> > */
> > cnt_cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
> > + /*
> > + * Record the potential maximum free length in advance.
> > + */
> > + if (haveleft)
> > + cnt_cur->bc_ag.bc_free_longest = ltlen;
> > + if (haveright)
> > + cnt_cur->bc_ag.bc_free_longest = gtlen;
>
> That doesn't look correct. At this point in the code, ltlen/gtlen
> are the sizes of the physically adjacent freespace extents that we
> are going to merge the newly freed extent with. i.e. the new
> freespace extent is going to have one of 4 possible values:
>
> no merge: len
> left merge: ltlen + len
> right merge: gtlen + len
> both merge: ltlen + gtlen + len
>
> So regardless of anything else, this code isn't setting the longest
> freespace extent in teh AGF to the lenght of the longest freespace
> extent in the filesystem.
>
> Which leads me to ask: how did you test this code? This bug should
> have been triggering verifier, repair and scrub failures quite
> quickly with fstests....
>
> > /*
> > * Have both left and right contiguous neighbors.
> > * Merge all three into a single free block.
> > diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
> > index 6ef5ddd89600..8e7d1e0f1a63 100644
> > --- a/fs/xfs/libxfs/xfs_alloc_btree.c
> > +++ b/fs/xfs/libxfs/xfs_alloc_btree.c
> > @@ -161,7 +161,14 @@ xfs_allocbt_update_lastrec(
> > rrp = XFS_ALLOC_REC_ADDR(cur->bc_mp, block, numrecs);
> > len = rrp->ar_blockcount;
> > } else {
> > - len = 0;
> > + /*
> > + * Update in advance to prevent file creation failure
> > + * for concurrent processes even though there is no
> > + * numrec currently.
> > + * And there's no need to worry as the value that no
> > + * less than bc_free_longest will be inserted later.
> > + */
> > + len = cpu_to_be32(cur->bc_ag.bc_free_longest);
> > }
>
> So this is in the LASTREC_DELREC case when the last record is
> removed from the btree. This is what causes the transient state
> as we do this when deleting a record to trim it and then re-insert
> the remainder back into the by-count btree.
>
> Writing some random transient value into the AGF *and journalling
> it* means we creating a transient on-disk format structure
> corruption and potentially writing it to persistent storage (i.e.
> the journal). The structure is, at least, not consistent in memory
> because the free space tree is empty at this point in time, whilst
> the agf->longest field says it has a free space available. This
> could trip verifiers, be flagged as corruption by xfs_scrub/repair,
> etc.
>
> Now, this *might be safe* because we *may* clean it up later in the
> transaction, but if this really is the last extent being removed
> from the btree and a cursor has previously been used to do other
> insert and free operations that set this field, then we trip over
> this stale inforamtion and write a corrupt structure to disk. That's
> not good.
>
> As I said above, this "last record tracking" needs to be ripped out
> of the generic btree code because only the by-count btree uses it.
> Then it can be updated at the end of the by-count btree update
> process in the allocation code (i.e. after all record manipulations
> are done in the transaction) and that avoids this transient caused
> by updating the last record on every btree record update that is
> done.
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
>

2024-06-05 02:52:28

by Zizhi Wo

[permalink] [raw]
Subject: Re: [PATCH] xfs: Fix file creation failure



在 2024/6/5 7:00, Dave Chinner 写道:
> On Tue, Jun 04, 2024 at 03:11:21PM +0800, Zizhi Wo wrote:
>> We have an xfs image that contains only 2 AGs, the first AG is full and
>> the second AG is empty, then a concurrent file creation and little writing
>> could unexpectedly return -ENOSPC error since there is a race window that
>> the allocator could get the wrong agf->agf_longest.
>>
>> Write file process steps:
>> 1) Find the entry that best meets the conditions, then calculate the start
>> address and length of the remaining part of the entry after allocation.
>> 2) Delete this entry. Because the second AG is empty, the btree in its agf
>> has only one record, and agf->agf_longest will be set to 0 after deletion.
>> 3) Insert the remaining unused parts of this entry based on the
>> calculations in 1), and update the agf->agf_longest.
>>
>> Create file process steps:
>> 1) Check whether there are free inodes in the inode chunk.
>> 2) If there is no free inode, check whether there has space for creating
>> inode chunks, perform the no-lock judgment first.
>> 3) If the judgment succeeds, the judgment is performed again with agf lock
>> held. Otherwire, an error is returned directly.
>>
>> If the write process is in step 2) but not go to 3) yet, the create file
>> process goes to 2) at this time, it will be mistaken for no space,
>> resulting in the file system still has space but the file creation fails.
>>
>> Direct write Create file
>> xfs_file_write_iter
>> ...
>> xfs_direct_write_iomap_begin
>> xfs_iomap_write_direct
>> ...
>> xfs_alloc_ag_vextent_near
>> xfs_alloc_cur_finish
>> xfs_alloc_fixup_trees
>> xfs_btree_delete
>> xfs_btree_delrec
>> xfs_allocbt_update_lastrec
>> // longest = 0 because numrec == 0.
>> agf->agf_longest = len = 0
>> xfs_create
>> ...
>> xfs_dialloc
>> ...
>> xfs_alloc_fix_freelist
>> xfs_alloc_space_available
>> -> as longest=0, it will return
>> false, no space for inode alloc.
>
> Ok, so this is a another attempt to address the problem Ye Bin
> attempted to fix here:
>
> https://lore.kernel.org/linux-xfs/[email protected]/
>
>> Fix this issue by adding the bc_free_longest field to the xfs_btree_cur_t
>> structure to store the potential longest count that will be updated. The
>> assignment is done in xfs_alloc_fixup_trees() and xfs_free_ag_extent().
>
> I outlined how this should be fixed in the above thread:
>
> https://lore.kernel.org/linux-xfs/[email protected]/
>
> This is what I said:
>
> | What we actually want is for pag->pagf_longest not to change
> | transiently to zero in xfs_alloc_fixup_trees(). If the delrec that
> | zeroes the pagf_longest field is going to follow it up with an
> | insrec that will set it back to a valid value, we really should not
> | be doing the zeroing in the first place.
> |
> | Further, the only btree that tracks the right edge of the btree is
> | the by-count allocbt. This isn't "generic" functionality, even
> | though it is implemented through the generic btree code. If we lift
> | ->update_lastrec from the generic code and do it directly in
> | xfs_alloc.c whenever we are finished with a by-count tree update
> | and the cursor points to a record in the right-most leaf of the
> | tree, then we run the lastrec update directly at that point.
> | By decoupling the lastrec updates from the individual record
> | manipulations, we can make the transients disappear completely.
>
> I'm not sure if this patch is an attempt to implement this - there
> is no reference in the commit description to this previous attempt
> to fix the issue, nor is the any discussion of why this particular
> solution was chosen.
>
> In future, when you are trying to fix an issue that has previously
> been discussed/presented on the list, please reference it and
> provide a link to the previous discussions in the changelog for the
> new version of the patchset fixing the issue.

Oh, I'm sorry for the confusion I caused you. And I will reference it
next time.

>
>> Reported by: Ye Bin <[email protected]>
>> Signed-off-by: Zizhi Wo <[email protected]>
>> ---
>> fs/xfs/libxfs/xfs_alloc.c | 14 ++++++++++++++
>> fs/xfs/libxfs/xfs_alloc_btree.c | 9 ++++++++-
>> fs/xfs/libxfs/xfs_btree.h | 1 +
>> 3 files changed, 23 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
>> index 6c55a6e88eba..86ba873d57a8 100644
>> --- a/fs/xfs/libxfs/xfs_alloc.c
>> +++ b/fs/xfs/libxfs/xfs_alloc.c
>> @@ -577,6 +577,13 @@ xfs_alloc_fixup_trees(
>> nfbno2 = rbno + rlen;
>> nflen2 = (fbno + flen) - nfbno2;
>> }
>> +
>> + /*
>> + * Record the potential maximum free length in advance.
>> + */
>> + if (nfbno1 != NULLAGBLOCK || nfbno2 != NULLAGBLOCK)
>> + cnt_cur->bc_ag.bc_free_longest = XFS_EXTLEN_MAX(nflen1, nflen2);
>> +
>
> Why do we store the length of a random extent being freed here?
> nflen1/2 almost always have nothing to do with the longest free
> space extent in the tree, they are just the new free space extents
> we are insering into a random location in the free space tree.
>

First of all, there may be ambiguity in the name of the bc_free_longest
field. I'm sorry for that. Its only role is to give the longest non-0 in
a particular scenario.

Yes, nflen1/2 can't determine the subsequent operation, but they are
used to update the longest record only if the numrec in cntbt is zero,
the last has been deleted and a new record will be added soon (that is,
there is still space left on the file system), and that is their only
function. So at this time nflen1/2 are not random extent, they indicate
the maximum value to be inserted later. If cntbt does not need to be
updated longest or the numrec is not zero, then bc_free_longest will not
be used to update the longest.

>> /*
>> * Delete the entry from the by-size btree.
>> */
>> @@ -2044,6 +2051,13 @@ xfs_free_ag_extent(
>> * Now allocate and initialize a cursor for the by-size tree.
>> */
>> cnt_cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
>> + /*
>> + * Record the potential maximum free length in advance.
>> + */
>> + if (haveleft)
>> + cnt_cur->bc_ag.bc_free_longest = ltlen;
>> + if (haveright)
>> + cnt_cur->bc_ag.bc_free_longest = gtlen;
>
> That doesn't look correct. At this point in the code, ltlen/gtlen
> are the sizes of the physically adjacent freespace extents that we
> are going to merge the newly freed extent with. i.e. the new
> freespace extent is going to have one of 4 possible values:
>
> no merge: len
> left merge: ltlen + len
> right merge: gtlen + len
> both merge: ltlen + gtlen + len
>
> So regardless of anything else, this code isn't setting the longest
> freespace extent in teh AGF to the lenght of the longest freespace
> extent in the filesystem.

> Which leads me to ask: how did you test this code? This bug should
> have been triggering verifier, repair and scrub failures quite
> quickly with fstests....
>

The logic I'm considering here is that the record is less than or equal
to the maximum value that will be updated soon, as I wrote "potentially"
in the comment. And consider the following two scenarios:
1) If it is no merge, then haveleft == 0 && haveright == 0, and
bc_free_longest will not be assigned, and is no need to worry about the
longest update at this time.
2) If it is in merge scenario, only updating the original values here,
and the actual updates are put into the subsequent xfs_btree_insert().
There is no need to worry about atomicity, both are carried out in the
same transaction. All we have to do is the longest non-0. As long as the
fast path judgment without locking passes, the longest must be updated
to the correct value during the second lock judgment.

I tested this part of the code, passed xfstests, and local validation
found no problems.

>> /*
>> * Have both left and right contiguous neighbors.
>> * Merge all three into a single free block.
>> diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
>> index 6ef5ddd89600..8e7d1e0f1a63 100644
>> --- a/fs/xfs/libxfs/xfs_alloc_btree.c
>> +++ b/fs/xfs/libxfs/xfs_alloc_btree.c
>> @@ -161,7 +161,14 @@ xfs_allocbt_update_lastrec(
>> rrp = XFS_ALLOC_REC_ADDR(cur->bc_mp, block, numrecs);
>> len = rrp->ar_blockcount;
>> } else {
>> - len = 0;
>> + /*
>> + * Update in advance to prevent file creation failure
>> + * for concurrent processes even though there is no
>> + * numrec currently.
>> + * And there's no need to worry as the value that no
>> + * less than bc_free_longest will be inserted later.
>> + */
>> + len = cpu_to_be32(cur->bc_ag.bc_free_longest);
>> }
>
> So this is in the LASTREC_DELREC case when the last record is
> removed from the btree. This is what causes the transient state
> as we do this when deleting a record to trim it and then re-insert
> the remainder back into the by-count btree.
>
> Writing some random transient value into the AGF *and journalling
> it* means we creating a transient on-disk format structure
> corruption and potentially writing it to persistent storage (i.e.
> the journal). The structure is, at least, not consistent in memory
> because the free space tree is empty at this point in time, whilst
> the agf->longest field says it has a free space available. This
> could trip verifiers, be flagged as corruption by xfs_scrub/repair,
> etc.
>

I'm sorry, but I didn't find the problem during my own screening. In my
opinion, because the trigger scenario for the current problem is only to
delete the last node and be updated shortly, and bc_free_longest is used
only in the following two scenarios:
1) cntbt has only one extent and remains after being used, so nflen 1/2
will be inserted later.
2) cntbt has only one extent and the released extent is adjacent to this
record. This unique record will be deleted firstly, and then the two
extents are merged and inserted.

The above two scenarios are both within the same transaction, and both
are guaranteed by a xfs_buf lock. When run xfs_trans_commit, code have
gone through the delete and insert process, or merge and update process.
So the longest value written to the disk is already the correct value
and matches the cntbt state at this time. In other scenarios, the numrec
of cntbt cannot be 0, so the longest cannot be updated through
bc_free_longest.

Thanks,
Zizhi Wo

> Now, this *might be safe* because we *may* clean it up later in the
> transaction, but if this really is the last extent being removed
> from the btree and a cursor has previously been used to do other
> insert and free operations that set this field, then we trip over
> this stale inforamtion and write a corrupt structure to disk. That's
> not good.
>
> As I said above, this "last record tracking" needs to be ripped out
> of the generic btree code because only the by-count btree uses it.
> Then it can be updated at the end of the by-count btree update
> process in the allocation code (i.e. after all record manipulations
> are done in the transaction) and that avoids this transient caused
> by updating the last record on every btree record update that is
> done.
>
> Cheers,
>
> Dave.

2024-06-05 03:38:36

by Zizhi Wo

[permalink] [raw]
Subject: Re: [PATCH] xfs: Fix file creation failure



在 2024/6/4 23:56, Darrick J. Wong 写道:
> On Tue, Jun 04, 2024 at 03:11:21PM +0800, Zizhi Wo wrote:
>> We have an xfs image that contains only 2 AGs, the first AG is full and
>> the second AG is empty, then a concurrent file creation and little writing
>> could unexpectedly return -ENOSPC error since there is a race window that
>> the allocator could get the wrong agf->agf_longest.
>>
>> Write file process steps:
>> 1) Find the entry that best meets the conditions, then calculate the start
>> address and length of the remaining part of the entry after allocation.
>> 2) Delete this entry. Because the second AG is empty, the btree in its agf
>> has only one record, and agf->agf_longest will be set to 0 after deletion.
>> 3) Insert the remaining unused parts of this entry based on the
>> calculations in 1), and update the agf->agf_longest.
>>
>> Create file process steps:
>> 1) Check whether there are free inodes in the inode chunk.
>> 2) If there is no free inode, check whether there has space for creating
>> inode chunks, perform the no-lock judgment first.
>> 3) If the judgment succeeds, the judgment is performed again with agf lock
>> held. Otherwire, an error is returned directly.
>>
>> If the write process is in step 2) but not go to 3) yet, the create file
>> process goes to 2) at this time, it will be mistaken for no space,
>> resulting in the file system still has space but the file creation fails.
>>
>> Direct write Create file
>> xfs_file_write_iter
>> ...
>> xfs_direct_write_iomap_begin
>> xfs_iomap_write_direct
>> ...
>> xfs_alloc_ag_vextent_near
>> xfs_alloc_cur_finish
>> xfs_alloc_fixup_trees
>> xfs_btree_delete
>> xfs_btree_delrec
>> xfs_allocbt_update_lastrec
>> // longest = 0 because numrec == 0.
>> agf->agf_longest = len = 0
>> xfs_create
>> ...
>> xfs_dialloc
>> ...
>> xfs_alloc_fix_freelist
>> xfs_alloc_space_available
>> -> as longest=0, it will return
>> false, no space for inode alloc.
>>
>> Fix this issue by adding the bc_free_longest field to the xfs_btree_cur_t
>> structure to store the potential longest count that will be updated. The
>> assignment is done in xfs_alloc_fixup_trees() and xfs_free_ag_extent().
>
> This is going to be a reverse-order review due to the way that diff
> ordered the chunks, which means that the bulk of my questions are at the
> end.
>
>> Reported by: Ye Bin <[email protected]>
>> Signed-off-by: Zizhi Wo <[email protected]>
>> ---
>> fs/xfs/libxfs/xfs_alloc.c | 14 ++++++++++++++
>> fs/xfs/libxfs/xfs_alloc_btree.c | 9 ++++++++-
>> fs/xfs/libxfs/xfs_btree.h | 1 +
>> 3 files changed, 23 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
>> index 6c55a6e88eba..86ba873d57a8 100644
>> --- a/fs/xfs/libxfs/xfs_alloc.c
>> +++ b/fs/xfs/libxfs/xfs_alloc.c
>> @@ -577,6 +577,13 @@ xfs_alloc_fixup_trees(
>> nfbno2 = rbno + rlen;
>> nflen2 = (fbno + flen) - nfbno2;
>> }
>> +
>> + /*
>> + * Record the potential maximum free length in advance.
>> + */
>> + if (nfbno1 != NULLAGBLOCK || nfbno2 != NULLAGBLOCK)
>> + cnt_cur->bc_ag.bc_free_longest = XFS_EXTLEN_MAX(nflen1, nflen2);
>
> Ok, so if we're allocating space then this sets bc_free_longest to the
> longer of the two remaining sections, if any. But if we just allocated
> the entirety of the longest extent in the cntbt, then we don't set
> bc_free_longest, which means its zero, right? I guess that's ok because
> that implies there's zero space left in the AG, so the longest freespace
> is indeed zero.
>
> If we just allocated the entirety of a non-longest extent in the cntbt
> then we don't call ->lastrec_update so the value of bc_free_longest
> doesn't matter?

Yes, absolutely right! Thank you!φ(゜▽゜*)♪

>
>> +
>> /*
>> * Delete the entry from the by-size btree.
>> */
>> @@ -2044,6 +2051,13 @@ xfs_free_ag_extent(
>> * Now allocate and initialize a cursor for the by-size tree.
>> */
>> cnt_cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
>> + /*
>> + * Record the potential maximum free length in advance.
>> + */
>> + if (haveleft)
>> + cnt_cur->bc_ag.bc_free_longest = ltlen;
>> + if (haveright)
>> + cnt_cur->bc_ag.bc_free_longest = gtlen;
>
> What happens in the haveleft && haveright case? Shouldn't
> bc_free_longest be set to ltlen + len + gtlen? You could just push the
> setting of bc_free_longest into the haveleft/haveright code below.

Oh, as I wrote to Dave, the logic I considered here is that the record
is less than or equal to the maximum value. And no need to worry about
that because it will soon be updated in xfs_btree_insert in the problem
triggering scenario.

>
>> /*
>> * Have both left and right contiguous neighbors.
>> * Merge all three into a single free block.
>> diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
>> index 6ef5ddd89600..8e7d1e0f1a63 100644
>> --- a/fs/xfs/libxfs/xfs_alloc_btree.c
>> +++ b/fs/xfs/libxfs/xfs_alloc_btree.c
>> @@ -161,7 +161,14 @@ xfs_allocbt_update_lastrec(
>> rrp = XFS_ALLOC_REC_ADDR(cur->bc_mp, block, numrecs);
>> len = rrp->ar_blockcount;
>> } else {
>> - len = 0;
>> + /*
>> + * Update in advance to prevent file creation failure
>> + * for concurrent processes even though there is no
>> + * numrec currently.
>> + * And there's no need to worry as the value that no
>> + * less than bc_free_longest will be inserted later.
>> + */
>> + len = cpu_to_be32(cur->bc_ag.bc_free_longest);
>
> Humm. In this case, we've called ->update_lastrec on the cntbt cursor
> having deleted all the records in this record block. Presumably that
> means that we're going to add rec->alloc.ar_blockcount blocks to the
> rightmost record in the left sibling of @block? Or already have?
>

In normal delete operations, cntbt will have a balancing process, moving
data from other nodes or merging to ensure that numrecs >= get_minrecs.
In this scenario, the cntbt is already an -empty- tree, and is in a
temporary state, new values will be inserted later.

> Ahh, right, the pagf_longest checks are done without holding AGF lock.
> The concurrent creat call sees this intermediate state (DELREC sets
> pagf_longest to zero, a moment later INSREC/UPDATE set it to the correct
> nonzero value) and decides to ENOSPC because "nobody" has sufficient
> free space.
>
> I think this phony zero never gets written to disk because although
> we're logging zero into the ondisk and incore agf_longest here, the next
> btree operation will reset it to the correct value. Right?

Yes, this phony zero will not be recorded to disk. It is just a
temporary condition.

>
> Would it be simpler to handle this case by duplicating the cntbt cursor
> and walking one record leftward in the tree to find the longest extent,
> rather than using this "bc_free_longest" variable?
>

In my opinion, this does not solve the problem. As mentioned above, at
this point the cntbt is an -empty- tree with no records.

>> }
>>
>> break;
>> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
>> index f93374278aa1..985b1885a643 100644
>> --- a/fs/xfs/libxfs/xfs_btree.h
>> +++ b/fs/xfs/libxfs/xfs_btree.h
>> @@ -281,6 +281,7 @@ struct xfs_btree_cur
>> struct xfs_perag *pag;
>> struct xfs_buf *agbp;
>> struct xbtree_afakeroot *afake; /* for staging cursor */
>> + xfs_extlen_t bc_free_longest; /* potential longest free space */
>
> This is only used for bnobt/cntbt trees, put it in the per-format
> private data area, please.
>

OK, I will modify it. More specifically, it is only used for cntbt,
because currently only cntbt can set the XFS_BTGEO_LASTREC_UPDATE flag,
and can call ->update_lastrec.

> If the answer to the question about duplicating the btree cursor is "no"
> then I think this deserves a much longer comment that captures the fact
> that the variable exists to avoid setting pagf_longest to zero for
> benefit of the code that does unlocked scanning of AGs for free space.
>
> I also wonder if the unlocked ag scan should just note that it observed
> a zero pagf_longest and if no space can be found across all the AGs, to
> try again with locks?
>
> --D

Currently xfs checks the space using the "check-lock-check again"
algorithm, which I understand to be more efficient. If there are a large
number of AG's and there is no free space in front of them, the
performance may be affected by checking the lock again. So I think
targeting specific AG might be more effective. Although the current code
process has a retry mechanism (in xfs_dialloc), it still can't
completely solve the problem: for example, there is no space for the
first scan, and the second scan has space but the longest is 0 in the
temporary state and return -ENOSPC, etc...

Thanks,
Zizhi Wo

>
>> } bc_ag;
>> struct {
>> struct xfbtree *xfbtree;
>> --
>> 2.39.2
>>
>>

2024-06-05 06:56:24

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] xfs: Fix file creation failure

On Wed, Jun 05, 2024 at 10:51:31AM +0800, Zizhi Wo wrote:
> 在 2024/6/5 7:00, Dave Chinner 写道:
> > On Tue, Jun 04, 2024 at 03:11:21PM +0800, Zizhi Wo wrote:
> > > We have an xfs image that contains only 2 AGs, the first AG is full and
> > > the second AG is empty, then a concurrent file creation and little writing
> > > could unexpectedly return -ENOSPC error since there is a race window that
> > > the allocator could get the wrong agf->agf_longest.
.....
> > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > > index 6c55a6e88eba..86ba873d57a8 100644
> > > --- a/fs/xfs/libxfs/xfs_alloc.c
> > > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > > @@ -577,6 +577,13 @@ xfs_alloc_fixup_trees(
> > > nfbno2 = rbno + rlen;
> > > nflen2 = (fbno + flen) - nfbno2;
> > > }
> > > +
> > > + /*
> > > + * Record the potential maximum free length in advance.
> > > + */
> > > + if (nfbno1 != NULLAGBLOCK || nfbno2 != NULLAGBLOCK)
> > > + cnt_cur->bc_ag.bc_free_longest = XFS_EXTLEN_MAX(nflen1, nflen2);
> > > +
> >
> > Why do we store the length of a random extent being freed here?
> > nflen1/2 almost always have nothing to do with the longest free
> > space extent in the tree, they are just the new free space extents
> > we are insering into a random location in the free space tree.
> >
>
> First of all, there may be ambiguity in the name of the bc_free_longest
> field. I'm sorry for that. Its only role is to give the longest non-0 in
> a particular scenario.
>
> Yes, nflen1/2 can't determine the subsequent operation, but they are
> used to update the longest record only if the numrec in cntbt is zero,
> the last has been deleted and a new record will be added soon (that is,
> there is still space left on the file system), and that is their only
> function. So at this time nflen1/2 are not random extent, they indicate
> the maximum value to be inserted later. If cntbt does not need to be
> updated longest or the numrec is not zero, then bc_free_longest will not
> be used to update the longest.

That's the comment you should have put in the code.

Comments need to explain -why- the code exists, not tell us -what-
the code does. We know the latter from reading the code, we cannot
know the former from reading the code...

> > > /*
> > > * Delete the entry from the by-size btree.
> > > */
> > > @@ -2044,6 +2051,13 @@ xfs_free_ag_extent(
> > > * Now allocate and initialize a cursor for the by-size tree.
> > > */
> > > cnt_cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
> > > + /*
> > > + * Record the potential maximum free length in advance.
> > > + */
> > > + if (haveleft)
> > > + cnt_cur->bc_ag.bc_free_longest = ltlen;
> > > + if (haveright)
> > > + cnt_cur->bc_ag.bc_free_longest = gtlen;
> >
> > That doesn't look correct. At this point in the code, ltlen/gtlen
> > are the sizes of the physically adjacent freespace extents that we
> > are going to merge the newly freed extent with. i.e. the new
> > freespace extent is going to have one of 4 possible values:
> >
> > no merge: len
> > left merge: ltlen + len
> > right merge: gtlen + len
> > both merge: ltlen + gtlen + len
> >
> > So regardless of anything else, this code isn't setting the longest
> > freespace extent in teh AGF to the lenght of the longest freespace
> > extent in the filesystem.
>
> > Which leads me to ask: how did you test this code? This bug should
> > have been triggering verifier, repair and scrub failures quite
> > quickly with fstests....
> >
>
> The logic I'm considering here is that the record is less than or equal
> to the maximum value that will be updated soon, as I wrote "potentially"
> in the comment. And consider the following two scenarios:
> 1) If it is no merge, then haveleft == 0 && haveright == 0, and
> bc_free_longest will not be assigned, and is no need to worry about the
> longest update at this time.
> 2) If it is in merge scenario, only updating the original values here,
> and the actual updates are put into the subsequent xfs_btree_insert().
> There is no need to worry about atomicity, both are carried out in the
> same transaction. All we have to do is the longest non-0. As long as the
> fast path judgment without locking passes, the longest must be updated
> to the correct value during the second lock judgment.

And therein lies the problem. We've learnt the had way not to do
partial state updates that might end up on disk even within
transactions. At some point in the future, we'll change the way the
code works, not realising that there's an inconsistent transient
state being logged, and some time after that we'll have occasional
reports of weird failures with values on disk or in the journal we
cannot explain.

> I tested this part of the code, passed xfstests, and local validation
> found no problems.

yeah, fstests won't expose the inconsistent state *right now*; the
problem is that we've just left a landmine for future developers to
step on.

It's also really difficult to follow - you've added non-obvious
coupling between the free space btree updates and the AGF updates
via a field held in a btree cursor. This essentially means that all
this stuff has to occur within the context of a single btree cursor,
and that btree cursor cannot be re-used for further operations
because this field is not reset by things like new lookups.

Then there is the question of what happens if we have duplicated the
cursor for a specific btree record operation, and the field is set
in the duplicated cursor. The core btree operations do this in
several places because they want to retain a cursor to the original
poistion, but the specific operation that needs to be performed will
change the cursor position (e.g. shifts, splits, merges, etc). Hence
there's no guarantee that a single cursor is used for all the
operations in a single btree modification, and hence storing
information in cursors that is supposed to persist until some other
btree modification method is run is asking for trouble.

Hence, IMO, coupling allocation btree coherency operations via the
btree cursor is something we should avoid at all costs. That's why I
keep saying the last record update for the by-count/AGF longest
needs to be moved outside the generic btree code itself. The
coherency and coupling needs to be directly controlled by the high
level alloc code, not by trying to punch special data holes through
the generic btree abstractions.

> > > /*
> > > * Have both left and right contiguous neighbors.
> > > * Merge all three into a single free block.
> > > diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
> > > index 6ef5ddd89600..8e7d1e0f1a63 100644
> > > --- a/fs/xfs/libxfs/xfs_alloc_btree.c
> > > +++ b/fs/xfs/libxfs/xfs_alloc_btree.c
> > > @@ -161,7 +161,14 @@ xfs_allocbt_update_lastrec(
> > > rrp = XFS_ALLOC_REC_ADDR(cur->bc_mp, block, numrecs);
> > > len = rrp->ar_blockcount;
> > > } else {
> > > - len = 0;
> > > + /*
> > > + * Update in advance to prevent file creation failure
> > > + * for concurrent processes even though there is no
> > > + * numrec currently.
> > > + * And there's no need to worry as the value that no
> > > + * less than bc_free_longest will be inserted later.
> > > + */
> > > + len = cpu_to_be32(cur->bc_ag.bc_free_longest);
> > > }
> >
> > So this is in the LASTREC_DELREC case when the last record is
> > removed from the btree. This is what causes the transient state
> > as we do this when deleting a record to trim it and then re-insert
> > the remainder back into the by-count btree.
> >
> > Writing some random transient value into the AGF *and journalling
> > it* means we creating a transient on-disk format structure
> > corruption and potentially writing it to persistent storage (i.e.
> > the journal). The structure is, at least, not consistent in memory
> > because the free space tree is empty at this point in time, whilst
> > the agf->longest field says it has a free space available. This
> > could trip verifiers, be flagged as corruption by xfs_scrub/repair,
> > etc.
> >
>
> I'm sorry, but I didn't find the problem during my own screening. In my
> opinion, because the trigger scenario for the current problem is only to
> delete the last node and be updated shortly, and bc_free_longest is used
> only in the following two scenarios:
> 1) cntbt has only one extent and remains after being used, so nflen 1/2
> will be inserted later.
> 2) cntbt has only one extent and the released extent is adjacent to this
> record. This unique record will be deleted firstly, and then the two
> extents are merged and inserted.

Yes, I understand what you've done.

But I don't think it is the right way to fix the issue for the
reasons I've given.

I've attached a quick hack (not even compile tested!) to
demonstrate the way I've been suggesting this issue should be fixed
by the removal of the lastrec update code from the generic
btree implementation. It updates pag->pagf_longest and
agf->longest directly from the high level allocation code once the
free space btree manipulations have been completed. We do this in a
context where we control AGF, the perag and the btree cursors
directly, and there is no need to temporarily store anything in the
btree cursors at all.

Feel free to use this code as the basis of future patches to address
this issue.

-Dave.
--
Dave Chinner
[email protected]

xfs: avoid races with btree lastrec updates

From: Dave Chinner <[email protected]>

When we modify the last record in a generic btree, the
calls a function to indicate to the btree implementation that the
operation is modifying the right-most record in the tree. This is
only used by the by-size freespace btree, and it is used to update
the agf->longest field that tracks the longest contiguous free
extent in the AG.

This, however, is racy with respect to external sampling of the
agf->longest field. We may be doing multiple operations on the
by-size freespace tree to make an update - an extent length change
requires deleting the current extent, then inserting a new extent
with the new size. When the current extent is the only free space
extent in the AG (e.g. it is entirely empty), we end up with a
transient state where there is no free space extents in the btree.

By the time the modification completes, however, there is once again
a free space extent in the tree, and the agf->longest value gets set
to the correct size again. The issue is that external sampling of
agf->longest during allocation AG selection will see the extent as
having no space free, and so will skip it. This can lead to
transient ENOSPC errors when there is an entire AG of free space
available for use.

To fix this, remove the lastrec update triggers from the generic
btree code. Move the by-size last record update code to the end of
the by-size btree modifications, such that it is only changed once
during a free space extent allocation or freeing operation. This
removes the externally visible transient empty state, and so avoids
the transient, erroneous ENOSPC conditions that can currently occur.

XXX: the detection of the last record update being required could
probably be improved - just checking for "in the last block" works
just fine, and we are already going to be logging the AGF for the
free space counter updates, so it might just be that always doing
the update when we land in the last block isn't such a big deal
performance wise.

---
fs/xfs/libxfs/xfs_alloc.c | 94 +++++++++++++++++++++++++++++++++++++++++
fs/xfs/libxfs/xfs_alloc_btree.c | 63 ---------------------------
fs/xfs/libxfs/xfs_btree.c | 51 ----------------------
fs/xfs/libxfs/xfs_btree.h | 6 ---
4 files changed, 94 insertions(+), 120 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 35fbd6b19682..f256b7724cef 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -463,6 +463,76 @@ xfs_alloc_fix_len(
args->len = rlen;
}

+/*
+ * Determine if the cursor points to the block that contains the right-most
+ * block of records in the by-count btree. This block contains the largest
+ * contiguous free extent in the AG, so if we modify ia record in this block we
+ * need to call xfs_alloc_fixup_longest() once the modifications are done to
+ * ensure the agf->agf_longest field is kept up to date with the longest free
+ * extent tracked by the by-count btree.
+ */
+static bool
+xfs_alloc_cursor_at_lastrec(
+ struct xfs_btree_cur *cnt_cur)
+{
+ struct xfs_btree_block *block;
+ union xfs_btree_ptr ptr;
+ struct xfs_buf *bp;
+
+ block = xfs_btree_get_block(cnt_cur, 0, &bp);
+
+ xfs_btree_get_sibling(cnt_cur, block, &ptr, XFS_BB_RIGHTSIB);
+ if (!xfs_btree_ptr_is_null(cnt_cur, &ptr))
+ return false;
+ return true;
+}
+
+/*
+ * Update the longest contiguous free extent in the AG from the by-count cursor
+ * that is passed to us. This should be done at the end of any allocation or
+ * freeing operation that touches the longest extent in the btree.
+ *
+ * Needing to update the longest extent can be determined by calling
+ * xfs_alloc_cursor_at_lastrec() after the cursor is positioned for record
+ * modification but before the modification begins.
+ */
+static int
+xfs_alloc_fixup_longest(
+ struct xfs_btree_cur *cnt_cur)
+{
+ struct xfs_perag *pag = cnt_cur->bc_ag->pag;
+ struct xfs_agf *agf;
+ xfs_agblock_t bno;
+ xfs_extlen_t len;
+ int error;
+ int i;
+
+ /* lookup last rec and update AGF */
+ error = xfs_alloc_lookup_le(cnt_cur, 0, pag->pagf_longest, &i);
+ if (error)
+ return error;
+ if (i == 0) {
+ /* empty tree */
+ pag->pagf_longest = 0;
+ } else {
+ error = xfs_alloc_get_rec(cnt_cur, &bno, &len, &i);
+ if (error)
+ return error;
+ if (XFS_IS_CORRUPT(mp, i != 1)) {
+ xfs_btree_mark_sick(cnt_cur);
+ return -EFSCORRUPTED;
+ }
+ pag->pagf_longest = nflen1;
+ }
+
+
+ agf = XFS_BUF_TO_AGF(cur->bc_ag.agbp);
+ agf->agf_longest = cpu_to_be32(pag->pagf_longest);
+ xfs_alloc_log_agf(cur->bc_tp, cur->bc_ag.agbp, XFS_AGF_LONGEST);
+
+ return 0;
+}
+
/*
* Update the two btrees, logically removing from freespace the extent
* starting at rbno, rlen blocks. The extent is contained within the
@@ -487,6 +557,7 @@ xfs_alloc_fixup_trees(
xfs_extlen_t nflen1=0; /* first new free length */
xfs_extlen_t nflen2=0; /* second new free length */
struct xfs_mount *mp;
+ bool fixup_longest = false;

mp = cnt_cur->bc_mp;

@@ -575,9 +646,13 @@ xfs_alloc_fixup_trees(
nfbno2 = rbno + rlen;
nflen2 = (fbno + flen) - nfbno2;
}
+
/*
* Delete the entry from the by-size btree.
*/
+ if (xfs_alloc_cursor_at_lastrec(cnt_cur))
+ fixup_longest = true;
+
if ((error = xfs_btree_delete(cnt_cur, &i)))
return error;
if (XFS_IS_CORRUPT(mp, i != 1)) {
@@ -615,6 +690,7 @@ xfs_alloc_fixup_trees(
return -EFSCORRUPTED;
}
}
+
/*
* Fix up the by-block btree entry(s).
*/
@@ -652,6 +728,10 @@ xfs_alloc_fixup_trees(
return -EFSCORRUPTED;
}
}
+
+ if (fixup_longest)
+ return xfs_alloc_fixup_longest(cnt_cur);
+
return 0;
}

@@ -1954,6 +2034,7 @@ xfs_free_ag_extent(
int i;
int error;
struct xfs_perag *pag = agbp->b_pag;
+ bool fixup_longest = false;

bno_cur = cnt_cur = NULL;
mp = tp->t_mountp;
@@ -2217,8 +2298,13 @@ xfs_free_ag_extent(
}
xfs_btree_del_cursor(bno_cur, XFS_BTREE_NOERROR);
bno_cur = NULL;
+
/*
* In all cases we need to insert the new freespace in the by-size tree.
+ *
+ * If this new freespace is being inserted in the block that contains
+ * the largest free space in the btree, make sure we also fix up the
+ * agf->agf-longest tracker field.
*/
if ((error = xfs_alloc_lookup_eq(cnt_cur, nbno, nlen, &i)))
goto error0;
@@ -2227,6 +2313,8 @@ xfs_free_ag_extent(
error = -EFSCORRUPTED;
goto error0;
}
+ if (xfs_alloc_cursor_at_lastrec(cnt_cur))
+ fixup_longest = true;
if ((error = xfs_btree_insert(cnt_cur, &i)))
goto error0;
if (XFS_IS_CORRUPT(mp, i != 1)) {
@@ -2234,6 +2322,12 @@ xfs_free_ag_extent(
error = -EFSCORRUPTED;
goto error0;
}
+ if (fixup_longest) {
+ error = xfs_alloc_fixup_longest(cnt_cur);
+ if (error)
+ goto error0;
+ }
+
xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
cnt_cur = NULL;

diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index 6ef5ddd89600..866d822dcfe7 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -115,67 +115,6 @@ xfs_allocbt_free_block(
return 0;
}

-/*
- * Update the longest extent in the AGF
- */
-STATIC void
-xfs_allocbt_update_lastrec(
- struct xfs_btree_cur *cur,
- const struct xfs_btree_block *block,
- const union xfs_btree_rec *rec,
- int ptr,
- int reason)
-{
- struct xfs_agf *agf = cur->bc_ag.agbp->b_addr;
- struct xfs_perag *pag;
- __be32 len;
- int numrecs;
-
- ASSERT(!xfs_btree_is_bno(cur->bc_ops));
-
- switch (reason) {
- case LASTREC_UPDATE:
- /*
- * If this is the last leaf block and it's the last record,
- * then update the size of the longest extent in the AG.
- */
- if (ptr != xfs_btree_get_numrecs(block))
- return;
- len = rec->alloc.ar_blockcount;
- break;
- case LASTREC_INSREC:
- if (be32_to_cpu(rec->alloc.ar_blockcount) <=
- be32_to_cpu(agf->agf_longest))
- return;
- len = rec->alloc.ar_blockcount;
- break;
- case LASTREC_DELREC:
- numrecs = xfs_btree_get_numrecs(block);
- if (ptr <= numrecs)
- return;
- ASSERT(ptr == numrecs + 1);
-
- if (numrecs) {
- xfs_alloc_rec_t *rrp;
-
- rrp = XFS_ALLOC_REC_ADDR(cur->bc_mp, block, numrecs);
- len = rrp->ar_blockcount;
- } else {
- len = 0;
- }
-
- break;
- default:
- ASSERT(0);
- return;
- }
-
- agf->agf_longest = len;
- pag = cur->bc_ag.agbp->b_pag;
- pag->pagf_longest = be32_to_cpu(len);
- xfs_alloc_log_agf(cur->bc_tp, cur->bc_ag.agbp, XFS_AGF_LONGEST);
-}
-
STATIC int
xfs_allocbt_get_minrecs(
struct xfs_btree_cur *cur,
@@ -493,7 +432,6 @@ const struct xfs_btree_ops xfs_bnobt_ops = {
.set_root = xfs_allocbt_set_root,
.alloc_block = xfs_allocbt_alloc_block,
.free_block = xfs_allocbt_free_block,
- .update_lastrec = xfs_allocbt_update_lastrec,
.get_minrecs = xfs_allocbt_get_minrecs,
.get_maxrecs = xfs_allocbt_get_maxrecs,
.init_key_from_rec = xfs_allocbt_init_key_from_rec,
@@ -525,7 +463,6 @@ const struct xfs_btree_ops xfs_cntbt_ops = {
.set_root = xfs_allocbt_set_root,
.alloc_block = xfs_allocbt_alloc_block,
.free_block = xfs_allocbt_free_block,
- .update_lastrec = xfs_allocbt_update_lastrec,
.get_minrecs = xfs_allocbt_get_minrecs,
.get_maxrecs = xfs_allocbt_get_maxrecs,
.init_key_from_rec = xfs_allocbt_init_key_from_rec,
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index a119cbf71df1..c727dbcfebee 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -1331,30 +1331,6 @@ xfs_btree_init_block_cur(
xfs_btree_owner(cur));
}

-/*
- * Return true if ptr is the last record in the btree and
- * we need to track updates to this record. The decision
- * will be further refined in the update_lastrec method.
- */
-STATIC int
-xfs_btree_is_lastrec(
- struct xfs_btree_cur *cur,
- struct xfs_btree_block *block,
- int level)
-{
- union xfs_btree_ptr ptr;
-
- if (level > 0)
- return 0;
- if (!(cur->bc_ops->geom_flags & XFS_BTGEO_LASTREC_UPDATE))
- return 0;
-
- xfs_btree_get_sibling(cur, block, &ptr, XFS_BB_RIGHTSIB);
- if (!xfs_btree_ptr_is_null(cur, &ptr))
- return 0;
- return 1;
-}
-
STATIC void
xfs_btree_buf_to_ptr(
struct xfs_btree_cur *cur,
@@ -2476,15 +2452,6 @@ xfs_btree_update(
xfs_btree_copy_recs(cur, rp, rec, 1);
xfs_btree_log_recs(cur, bp, ptr, ptr);

- /*
- * If we are tracking the last record in the tree and
- * we are at the far right edge of the tree, update it.
- */
- if (xfs_btree_is_lastrec(cur, block, 0)) {
- cur->bc_ops->update_lastrec(cur, block, rec,
- ptr, LASTREC_UPDATE);
- }
-
/* Pass new key value up to our parent. */
if (xfs_btree_needs_key_update(cur, ptr)) {
error = xfs_btree_update_keys(cur, 0);
@@ -3786,15 +3753,6 @@ xfs_btree_insrec(
goto error0;
}

- /*
- * If we are tracking the last record in the tree and
- * we are at the far right edge of the tree, update it.
- */
- if (xfs_btree_is_lastrec(cur, block, level)) {
- cur->bc_ops->update_lastrec(cur, block, rec,
- ptr, LASTREC_INSREC);
- }
-
/*
* Return the new block number, if any.
* If there is one, give back a record value and a cursor too.
@@ -4152,15 +4110,6 @@ xfs_btree_delrec(
xfs_btree_set_numrecs(block, --numrecs);
xfs_btree_log_block(cur, bp, XFS_BB_NUMRECS);

- /*
- * If we are tracking the last record in the tree and
- * we are at the far right edge of the tree, update it.
- */
- if (xfs_btree_is_lastrec(cur, block, level)) {
- cur->bc_ops->update_lastrec(cur, block, NULL,
- ptr, LASTREC_DELREC);
- }
-
/*
* We're at the root level. First, shrink the root block in-memory.
* Try to get rid of the next level down. If we can't then there's
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index f93374278aa1..99c1e4a02556 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -154,12 +154,6 @@ struct xfs_btree_ops {
int *stat);
int (*free_block)(struct xfs_btree_cur *cur, struct xfs_buf *bp);

- /* update last record information */
- void (*update_lastrec)(struct xfs_btree_cur *cur,
- const struct xfs_btree_block *block,
- const union xfs_btree_rec *rec,
- int ptr, int reason);
-
/* records in block/level */
int (*get_minrecs)(struct xfs_btree_cur *cur, int level);
int (*get_maxrecs)(struct xfs_btree_cur *cur, int level);

2024-06-05 08:47:28

by Zizhi Wo

[permalink] [raw]
Subject: Re: [PATCH] xfs: Fix file creation failure



在 2024/6/5 14:40, Dave Chinner 写道:
> On Wed, Jun 05, 2024 at 10:51:31AM +0800, Zizhi Wo wrote:
>> 在 2024/6/5 7:00, Dave Chinner 写道:
>>> On Tue, Jun 04, 2024 at 03:11:21PM +0800, Zizhi Wo wrote:
>>>> We have an xfs image that contains only 2 AGs, the first AG is full and
>>>> the second AG is empty, then a concurrent file creation and little writing
>>>> could unexpectedly return -ENOSPC error since there is a race window that
>>>> the allocator could get the wrong agf->agf_longest.
> .....

Yeah...I know it is amazing...

>>>> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
>>>> index 6c55a6e88eba..86ba873d57a8 100644
>>>> --- a/fs/xfs/libxfs/xfs_alloc.c
>>>> +++ b/fs/xfs/libxfs/xfs_alloc.c
>>>> @@ -577,6 +577,13 @@ xfs_alloc_fixup_trees(
>>>> nfbno2 = rbno + rlen;
>>>> nflen2 = (fbno + flen) - nfbno2;
>>>> }
>>>> +
>>>> + /*
>>>> + * Record the potential maximum free length in advance.
>>>> + */
>>>> + if (nfbno1 != NULLAGBLOCK || nfbno2 != NULLAGBLOCK)
>>>> + cnt_cur->bc_ag.bc_free_longest = XFS_EXTLEN_MAX(nflen1, nflen2);
>>>> +
>>>
>>> Why do we store the length of a random extent being freed here?
>>> nflen1/2 almost always have nothing to do with the longest free
>>> space extent in the tree, they are just the new free space extents
>>> we are insering into a random location in the free space tree.
>>>
>>
>> First of all, there may be ambiguity in the name of the bc_free_longest
>> field. I'm sorry for that. Its only role is to give the longest non-0 in
>> a particular scenario.
>>
>> Yes, nflen1/2 can't determine the subsequent operation, but they are
>> used to update the longest record only if the numrec in cntbt is zero,
>> the last has been deleted and a new record will be added soon (that is,
>> there is still space left on the file system), and that is their only
>> function. So at this time nflen1/2 are not random extent, they indicate
>> the maximum value to be inserted later. If cntbt does not need to be
>> updated longest or the numrec is not zero, then bc_free_longest will not
>> be used to update the longest.
>
> That's the comment you should have put in the code.
>
> Comments need to explain -why- the code exists, not tell us -what-
> the code does. We know the latter from reading the code, we cannot
> know the former from reading the code...
>

I am sorry for the trouble caused by my comments. I have indeed left out
a lot of details, and I will correct it next time. /(ㄒoㄒ)/~~

>>>> /*
>>>> * Delete the entry from the by-size btree.
>>>> */
>>>> @@ -2044,6 +2051,13 @@ xfs_free_ag_extent(
>>>> * Now allocate and initialize a cursor for the by-size tree.
>>>> */
>>>> cnt_cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag);
>>>> + /*
>>>> + * Record the potential maximum free length in advance.
>>>> + */
>>>> + if (haveleft)
>>>> + cnt_cur->bc_ag.bc_free_longest = ltlen;
>>>> + if (haveright)
>>>> + cnt_cur->bc_ag.bc_free_longest = gtlen;
>>>
>>> That doesn't look correct. At this point in the code, ltlen/gtlen
>>> are the sizes of the physically adjacent freespace extents that we
>>> are going to merge the newly freed extent with. i.e. the new
>>> freespace extent is going to have one of 4 possible values:
>>>
>>> no merge: len
>>> left merge: ltlen + len
>>> right merge: gtlen + len
>>> both merge: ltlen + gtlen + len
>>>
>>> So regardless of anything else, this code isn't setting the longest
>>> freespace extent in teh AGF to the lenght of the longest freespace
>>> extent in the filesystem.
>>
>>> Which leads me to ask: how did you test this code? This bug should
>>> have been triggering verifier, repair and scrub failures quite
>>> quickly with fstests....
>>>
>>
>> The logic I'm considering here is that the record is less than or equal
>> to the maximum value that will be updated soon, as I wrote "potentially"
>> in the comment. And consider the following two scenarios:
>> 1) If it is no merge, then haveleft == 0 && haveright == 0, and
>> bc_free_longest will not be assigned, and is no need to worry about the
>> longest update at this time.
>> 2) If it is in merge scenario, only updating the original values here,
>> and the actual updates are put into the subsequent xfs_btree_insert().
>> There is no need to worry about atomicity, both are carried out in the
>> same transaction. All we have to do is the longest non-0. As long as the
>> fast path judgment without locking passes, the longest must be updated
>> to the correct value during the second lock judgment.
>
> And therein lies the problem. We've learnt the had way not to do
> partial state updates that might end up on disk even within
> transactions. At some point in the future, we'll change the way the
> code works, not realising that there's an inconsistent transient
> state being logged, and some time after that we'll have occasional
> reports of weird failures with values on disk or in the journal we
> cannot explain.
>
>> I tested this part of the code, passed xfstests, and local validation
>> found no problems.
>
> yeah, fstests won't expose the inconsistent state *right now*; the
> problem is that we've just left a landmine for future developers to
> step on.
>
> It's also really difficult to follow - you've added non-obvious
> coupling between the free space btree updates and the AGF updates
> via a field held in a btree cursor. This essentially means that all
> this stuff has to occur within the context of a single btree cursor,
> and that btree cursor cannot be re-used for further operations
> because this field is not reset by things like new lookups.
>
> Then there is the question of what happens if we have duplicated the
> cursor for a specific btree record operation, and the field is set
> in the duplicated cursor. The core btree operations do this in
> several places because they want to retain a cursor to the original
> poistion, but the specific operation that needs to be performed will
> change the cursor position (e.g. shifts, splits, merges, etc). Hence
> there's no guarantee that a single cursor is used for all the
> operations in a single btree modification, and hence storing
> information in cursors that is supposed to persist until some other
> btree modification method is run is asking for trouble.
>
> Hence, IMO, coupling allocation btree coherency operations via the
> btree cursor is something we should avoid at all costs. That's why I
> keep saying the last record update for the by-count/AGF longest
> needs to be moved outside the generic btree code itself. The
> coherency and coupling needs to be directly controlled by the high
> level alloc code, not by trying to punch special data holes through
> the generic btree abstractions.
>

Oh, I did not consider the problems you pointed out above, and the
previous revision should be avoided. I fully agree with your opinion.

>>>> /*
>>>> * Have both left and right contiguous neighbors.
>>>> * Merge all three into a single free block.
>>>> diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
>>>> index 6ef5ddd89600..8e7d1e0f1a63 100644
>>>> --- a/fs/xfs/libxfs/xfs_alloc_btree.c
>>>> +++ b/fs/xfs/libxfs/xfs_alloc_btree.c
>>>> @@ -161,7 +161,14 @@ xfs_allocbt_update_lastrec(
>>>> rrp = XFS_ALLOC_REC_ADDR(cur->bc_mp, block, numrecs);
>>>> len = rrp->ar_blockcount;
>>>> } else {
>>>> - len = 0;
>>>> + /*
>>>> + * Update in advance to prevent file creation failure
>>>> + * for concurrent processes even though there is no
>>>> + * numrec currently.
>>>> + * And there's no need to worry as the value that no
>>>> + * less than bc_free_longest will be inserted later.
>>>> + */
>>>> + len = cpu_to_be32(cur->bc_ag.bc_free_longest);
>>>> }
>>>
>>> So this is in the LASTREC_DELREC case when the last record is
>>> removed from the btree. This is what causes the transient state
>>> as we do this when deleting a record to trim it and then re-insert
>>> the remainder back into the by-count btree.
>>>
>>> Writing some random transient value into the AGF *and journalling
>>> it* means we creating a transient on-disk format structure
>>> corruption and potentially writing it to persistent storage (i.e.
>>> the journal). The structure is, at least, not consistent in memory
>>> because the free space tree is empty at this point in time, whilst
>>> the agf->longest field says it has a free space available. This
>>> could trip verifiers, be flagged as corruption by xfs_scrub/repair,
>>> etc.
>>>
>>
>> I'm sorry, but I didn't find the problem during my own screening. In my
>> opinion, because the trigger scenario for the current problem is only to
>> delete the last node and be updated shortly, and bc_free_longest is used
>> only in the following two scenarios:
>> 1) cntbt has only one extent and remains after being used, so nflen 1/2
>> will be inserted later.
>> 2) cntbt has only one extent and the released extent is adjacent to this
>> record. This unique record will be deleted firstly, and then the two
>> extents are merged and inserted.
>
> Yes, I understand what you've done.
>
> But I don't think it is the right way to fix the issue for the
> reasons I've given.
>
> I've attached a quick hack (not even compile tested!) to
> demonstrate the way I've been suggesting this issue should be fixed
> by the removal of the lastrec update code from the generic
> btree implementation. It updates pag->pagf_longest and
> agf->longest directly from the high level allocation code once the
> free space btree manipulations have been completed. We do this in a
> context where we control AGF, the perag and the btree cursors
> directly, and there is no need to temporarily store anything in the
> btree cursors at all.
>
> Feel free to use this code as the basis of future patches to address
> this issue.
>
> -Dave.

That's amazing! Thanks!! I will read the code carefully and submit the
correct fix for this issue again soon.

Thanks,
Zizhi Wo