2021-03-31 01:33:03

by Hugh Dickins

[permalink] [raw]
Subject: BUG_ON(!mapping_empty(&inode->i_data))

Running my usual tmpfs kernel builds swapping load, on Sunday's rc4-mm1
mmotm (I never got to try rc3-mm1 but presume it behaved the same way),
I hit clear_inode()'s BUG_ON(!mapping_empty(&inode->i_data)); on two
machines, within an hour or few, repeatably though not to order.

The stack backtrace has always been clear_inode < ext4_clear_inode <
ext4_evict_inode < evict < dispose_list < prune_icache_sb <
super_cache_scan < do_shrink_slab < shrink_slab_memcg < shrink_slab <
shrink_node_memgs < shrink_node < balance_pgdat < kswapd.

ext4 is the disk filesystem I read the source to build from, and also
the filesystem I use on a loop device on a tmpfs file: I have not tried
with other filesystems, nor checked whether perhaps it happens always on
the loop one or always on the disk one. I have not seen it happen with
tmpfs - probably because its inodes cannot be evicted by the shrinker
anyway; I have not seen it happen when "rm -rf" evicts ext4 or tmpfs
inodes (but suspect that may be down to timing, or less pressure).
I doubt it's a matter of filesystem: think it's an XArray thing.

Whenever I've looked at the XArray nodes involved, the root node
(shift 6) contained one or three (adjacent) pointers to empty shift
0 nodes, which each had offset and parent and array correctly set.
Is there some way in which empty nodes can get left behind, and so
fail eviction's mapping_empty() check?

I did wonder whether some might get left behind if xas_alloc() fails
(though probably the tree here is too shallow to show that). Printks
showed that occasionally xas_alloc() did fail while testing (maybe at
memcg limit), but there was no correlation with the BUG_ONs.

I did wonder whether this is a long-standing issue, which your new
BUG_ON is the first to detect: so tried 5.12-rc5 clear_inode() with
a BUG_ON(!xa_empty(&inode->i_data.i_pages)) after its nrpages and
nrexceptional BUG_ONs. The result there surprised me: I expected
it to behave the same way, but it hits that BUG_ON in a minute or
so, instead of an hour or so. Was there a fix you made somewhere,
to avoid the BUG_ON(!mapping_empty) most of the time? but needs
more work. I looked around a little, but didn't find any.

I had hoped to work this out myself, and save us both some writing:
but better hand over to you, in the hope that you'll quickly guess
what's up, then I can try patches. I do like the no-nrexceptionals
series, but there's something still to be fixed.

Hugh


2021-03-31 02:51:24

by Matthew Wilcox

[permalink] [raw]
Subject: Re: BUG_ON(!mapping_empty(&inode->i_data))

On Tue, Mar 30, 2021 at 06:30:22PM -0700, Hugh Dickins wrote:
> Running my usual tmpfs kernel builds swapping load, on Sunday's rc4-mm1
> mmotm (I never got to try rc3-mm1 but presume it behaved the same way),
> I hit clear_inode()'s BUG_ON(!mapping_empty(&inode->i_data)); on two
> machines, within an hour or few, repeatably though not to order.
>
> The stack backtrace has always been clear_inode < ext4_clear_inode <
> ext4_evict_inode < evict < dispose_list < prune_icache_sb <
> super_cache_scan < do_shrink_slab < shrink_slab_memcg < shrink_slab <
> shrink_node_memgs < shrink_node < balance_pgdat < kswapd.
>
> ext4 is the disk filesystem I read the source to build from, and also
> the filesystem I use on a loop device on a tmpfs file: I have not tried
> with other filesystems, nor checked whether perhaps it happens always on
> the loop one or always on the disk one. I have not seen it happen with
> tmpfs - probably because its inodes cannot be evicted by the shrinker
> anyway; I have not seen it happen when "rm -rf" evicts ext4 or tmpfs
> inodes (but suspect that may be down to timing, or less pressure).
> I doubt it's a matter of filesystem: think it's an XArray thing.
>
> Whenever I've looked at the XArray nodes involved, the root node
> (shift 6) contained one or three (adjacent) pointers to empty shift
> 0 nodes, which each had offset and parent and array correctly set.
> Is there some way in which empty nodes can get left behind, and so
> fail eviction's mapping_empty() check?

There isn't _supposed_ to be. The XArray is supposed to delete nodes
whenever the ->count reaches zero. It might give me a clue if you could
share a dump of the tree, if you still have that handy.

> I did wonder whether some might get left behind if xas_alloc() fails
> (though probably the tree here is too shallow to show that). Printks
> showed that occasionally xas_alloc() did fail while testing (maybe at
> memcg limit), but there was no correlation with the BUG_ONs.

This is a problem inherited from the radix tree, and I really want to
justify fixing it ... I think I may have enough infrastructure in place
to do it now (as part of the xas_split() commit we can now allocate
multiple xa_nodes in xas->xa_alloc). But you're right; if we allocated
all the way down to an order-0 node, then this isn't the bug.

Were you using the ALLOW_ERROR_INJECTION feature on
__add_to_page_cache_locked()? I haven't looked into how that works,
and maybe that could leave us in an inconsistent state.

> I did wonder whether this is a long-standing issue, which your new
> BUG_ON is the first to detect: so tried 5.12-rc5 clear_inode() with
> a BUG_ON(!xa_empty(&inode->i_data.i_pages)) after its nrpages and
> nrexceptional BUG_ONs. The result there surprised me: I expected
> it to behave the same way, but it hits that BUG_ON in a minute or
> so, instead of an hour or so. Was there a fix you made somewhere,
> to avoid the BUG_ON(!mapping_empty) most of the time? but needs
> more work. I looked around a little, but didn't find any.

I didn't make a fix for this issue; indeed I haven't observed it myself.
It seems like cgroups are a good way to induce allocation failures, so
I should play around with that a bit. The userspace test-suite has a
relatively malicious allocator that will fail every allocation not marked
as GFP_KERNEL, so it always exercises the fallback path for GFP_NOWAIT,
but then it will always succeed eventually.

> I had hoped to work this out myself, and save us both some writing:
> but better hand over to you, in the hope that you'll quickly guess
> what's up, then I can try patches. I do like the no-nrexceptionals
> series, but there's something still to be fixed.

Agreed. It seems like it's unmasking a bug that already existed, so
it's not an argument for dropping the series, but we should fix the bug
so we don't crash people's machines.

Arguably, the condition being checked for is not serious enough for a
BUG_ON. A WARN_ON, yes, and dump the tree for later perusal, but it's
just a memory leak, and not (I think?) likely to lead to later memory
corruption. The nodes don't contain any pages, so there's nothing to
point to the mapping.

2021-03-31 22:02:44

by Hugh Dickins

[permalink] [raw]
Subject: Re: BUG_ON(!mapping_empty(&inode->i_data))

On Wed, 31 Mar 2021, Matthew Wilcox wrote:
> On Tue, Mar 30, 2021 at 06:30:22PM -0700, Hugh Dickins wrote:
> > Running my usual tmpfs kernel builds swapping load, on Sunday's rc4-mm1
> > mmotm (I never got to try rc3-mm1 but presume it behaved the same way),
> > I hit clear_inode()'s BUG_ON(!mapping_empty(&inode->i_data)); on two
> > machines, within an hour or few, repeatably though not to order.
> >
> > The stack backtrace has always been clear_inode < ext4_clear_inode <
> > ext4_evict_inode < evict < dispose_list < prune_icache_sb <
> > super_cache_scan < do_shrink_slab < shrink_slab_memcg < shrink_slab <
> > shrink_node_memgs < shrink_node < balance_pgdat < kswapd.
> >
> > ext4 is the disk filesystem I read the source to build from, and also
> > the filesystem I use on a loop device on a tmpfs file: I have not tried
> > with other filesystems, nor checked whether perhaps it happens always on
> > the loop one or always on the disk one. I have not seen it happen with
> > tmpfs - probably because its inodes cannot be evicted by the shrinker
> > anyway; I have not seen it happen when "rm -rf" evicts ext4 or tmpfs
> > inodes (but suspect that may be down to timing, or less pressure).
> > I doubt it's a matter of filesystem: think it's an XArray thing.
> >
> > Whenever I've looked at the XArray nodes involved, the root node
> > (shift 6) contained one or three (adjacent) pointers to empty shift
> > 0 nodes, which each had offset and parent and array correctly set.
> > Is there some way in which empty nodes can get left behind, and so
> > fail eviction's mapping_empty() check?
>
> There isn't _supposed_ to be. The XArray is supposed to delete nodes
> whenever the ->count reaches zero. It might give me a clue if you could
> share a dump of the tree, if you still have that handy.

Very useful suggestion: the xa_dump() may not give you more of a clue,
but just running again last night to gather that info has revealed more.

>
> > I did wonder whether some might get left behind if xas_alloc() fails
> > (though probably the tree here is too shallow to show that). Printks
> > showed that occasionally xas_alloc() did fail while testing (maybe at
> > memcg limit), but there was no correlation with the BUG_ONs.
>
> This is a problem inherited from the radix tree, and I really want to
> justify fixing it ... I think I may have enough infrastructure in place
> to do it now (as part of the xas_split() commit we can now allocate
> multiple xa_nodes in xas->xa_alloc). But you're right; if we allocated
> all the way down to an order-0 node, then this isn't the bug.
>
> Were you using the ALLOW_ERROR_INJECTION feature on
> __add_to_page_cache_locked()? I haven't looked into how that works,
> and maybe that could leave us in an inconsistent state.

No, no error injection: not something I've ever looked at either.

>
> > I did wonder whether this is a long-standing issue, which your new
> > BUG_ON is the first to detect: so tried 5.12-rc5 clear_inode() with
> > a BUG_ON(!xa_empty(&inode->i_data.i_pages)) after its nrpages and
> > nrexceptional BUG_ONs. The result there surprised me: I expected
> > it to behave the same way, but it hits that BUG_ON in a minute or
> > so, instead of an hour or so. Was there a fix you made somewhere,
> > to avoid the BUG_ON(!mapping_empty) most of the time? but needs
> > more work. I looked around a little, but didn't find any.
>
> I didn't make a fix for this issue; indeed I haven't observed it myself.

That was interesting to me last night, but not so interesting now
we have more info (below).

> It seems like cgroups are a good way to induce allocation failures, so
> I should play around with that a bit. The userspace test-suite has a
> relatively malicious allocator that will fail every allocation not marked
> as GFP_KERNEL, so it always exercises the fallback path for GFP_NOWAIT,
> but then it will always succeed eventually.
>
> > I had hoped to work this out myself, and save us both some writing:
> > but better hand over to you, in the hope that you'll quickly guess
> > what's up, then I can try patches. I do like the no-nrexceptionals
> > series, but there's something still to be fixed.
>
> Agreed. It seems like it's unmasking a bug that already existed, so
> it's not an argument for dropping the series, but we should fix the bug
> so we don't crash people's machines.
>
> Arguably, the condition being checked for is not serious enough for a
> BUG_ON. A WARN_ON, yes, and dump the tree for later perusal, but it's
> just a memory leak, and not (I think?) likely to lead to later memory
> corruption. The nodes don't contain any pages, so there's nothing to
> point to the mapping.

Good suggestion, yes, use a WARN_ON instead of a BUG_ON. And even when
this immediate bug is fixed, will still be necessary for so long as the
radix-tree -ENOMEM might leave intermediate nodes behind. (An easy way
to fix that might be just to add a cleanup pass when !mapping_empty().)

I followed your suggestion and used WARN_ON instead of BUG_ON in last
night's runs (when you make that change, do remember to do what I
forgot at first - reset i_pages to NULL - otherwise anyone who inherits
that struct inode thereafter inherits these nodes - I got lots of
warnings on tiny files!).

And also printed out s_dev i_ino i_size: in the hope that they might
shed some more light. Hope fulfilled: *every* instance, on every
machine, has been /usr/bin/ld.bfd.

My kernels are built with CONFIG_READ_ONLY_THP_FOR_FS=y. And checking
old /var/log/messages from last summer, when I was first playing around
with that, I see the temporary debug messages I had added for it:
"collapse_file(/usr/bin/ld.bfd, 0)" meaning that it was making a THP
at offset 0. (Other executables too, notably cc1; but maybe ld.bfd
is the one most likely to get evicted.)

I suspect there's a bug in the XArray handling in collapse_file(),
which sometimes leaves empty nodes behind.

Here's the xa_dump() info from one of the machines:

[ 348.026010] xarray: ffff88800f01eb98 head ffff88800855f2a2 flags 21 marks 0 0 0
[ 348.028101] 0-4095: node ffff88800855f2a0 max 0 parent 0000000000000000 shift 6 count 1 values 0 array ffff88800f01eb98 list ffff88800855f2b8 ffff88800855f2b8 marks 0 0 0
[ 348.030335] 960-1023: node ffff888023dbd890 offset 15 parent ffff88800855f2a0 shift 0 count 0 values 0 array ffff88800f01eb98 list ffff888023dbd8a8 ffff888023dbd8a8 marks 0 0 0
[ 348.032736] s_dev 259:5 i_ino 935588 i_size 9112288

[ 4521.543662] xarray: ffff888003e8f898 head ffff88801256963a flags 21 marks 0 0 0
[ 4521.545683] 0-4095: node ffff888012569638 max 0 parent 0000000000000000 shift 6 count 1 values 0 array ffff888003e8f898 list ffff888012569650 ffff888012569650 marks 0 0 0
[ 4521.547889] 384-447: node ffff8880215b2088 offset 6 parent ffff888012569638 shift 0 count 0 values 0 array ffff888003e8f898 list ffff8880215b20a0 ffff8880215b20a0 marks 0 0 0
[ 4521.550151] s_dev 259:5 i_ino 935588 i_size 9112288

[17526.162821] xarray: ffff88800f09a998 head ffff88801610808a flags 21 marks 0 0 0
[17526.163936] 0-4095: node ffff888016108088 max 0 parent 0000000000000000 shift 6 count 2 values 0 array ffff88800f09a998 list ffff8880161080a0 ffff8880161080a0 marks 0 0 0
[17526.165176] 384-447: node ffff888022a5fc00 offset 6 parent ffff888016108088 shift 0 count 0 values 0 array ffff88800f09a998 list ffff888022a5fc18 ffff888022a5fc18 marks 0 0 0
[17526.167012] 960-1023: node ffff888008656188 offset 15 parent ffff888016108088 shift 0 count 0 values 0 array ffff88800f09a998 list ffff8880086561a0 ffff8880086561a0 marks 0 0 0
[17526.168225] s_dev 259:5 i_ino 935588 i_size 9112288

[27201.416553] xarray: ffff888026638a98 head ffff8880241fd14a flags 21 marks 0 0 0
[27201.418724] 0-4095: node ffff8880241fd148 max 0 parent 0000000000000000 shift 6 count 1 values 0 array ffff888026638a98 list ffff8880241fd160 ffff8880241fd160 marks 0 0 0
[27201.421067] 384-447: node ffff888026471aa8 offset 6 parent ffff8880241fd148 shift 0 count 0 values 0 array ffff888026638a98 list ffff888026471ac0 ffff888026471ac0 marks 0 0 0
[27201.423353] s_dev 259:5 i_ino 935588 i_size 9112288

[28189.092410] xarray: ffff888022468858 head ffff888016358262 flags 21 marks 0 0 0
[28189.094548] 0-4095: node ffff888016358260 max 0 parent 0000000000000000 shift 6 count 1 values 0 array ffff888022468858 list ffff888016358278 ffff888016358278 marks 0 0 0
[28189.096821] 384-447: node ffff8880124455f8 offset 6 parent ffff888016358260 shift 0 count 0 values 0 array ffff888022468858 list ffff888012445610 ffff888012445610 marks 0 0 0
[28189.099217] s_dev 259:5 i_ino 935588 i_size 9112288

935588 -rwxr-xr-x 1 root root 9112288 Dec 7 14:28 /usr/bin/ld.bfd

Hugh

2021-04-01 18:18:12

by Matthew Wilcox

[permalink] [raw]
Subject: Re: BUG_ON(!mapping_empty(&inode->i_data))

On Wed, Mar 31, 2021 at 02:58:12PM -0700, Hugh Dickins wrote:
> I suspect there's a bug in the XArray handling in collapse_file(),
> which sometimes leaves empty nodes behind.

Urp, yes, that can easily happen.

/* This will be less messy when we use multi-index entries */
do {
xas_lock_irq(&xas);
xas_create_range(&xas);
if (!xas_error(&xas))
break;
if (!xas_nomem(&xas, GFP_KERNEL)) {
result = SCAN_FAIL;
goto out;
}

xas_create_range() can absolutely create nodes with zero entries.
So if we create m/n nodes and then it runs out of memory (or cgroup
denies it), we can leave nodes in the tree with zero entries.

There are three options for fixing it ...
- Switch to using multi-index entries. We need to do this anyway, but
I don't yet have a handle on the bugs that you found last time I
pushed this into linux-next. At -rc5 seems like a late stage to be
trying this solution.
- Add an xas_prune_range() that gets called on failure. Should be
straightforward to write, but will be obsolete as soon as we do the
above and it's a pain for the callers.
- Change how xas_create_range() works to merely preallocate the xa_nodes
and not insert them into the tree until we're trying to insert data into
them. I favour this option, and this scenario is amenable to writing
a test that will simulate failure halfway through.

I'm going to start on option 3 now.

2021-04-02 03:14:40

by Matthew Wilcox

[permalink] [raw]
Subject: Re: BUG_ON(!mapping_empty(&inode->i_data))

On Thu, Apr 01, 2021 at 06:06:15PM +0100, Matthew Wilcox wrote:
> On Wed, Mar 31, 2021 at 02:58:12PM -0700, Hugh Dickins wrote:
> > I suspect there's a bug in the XArray handling in collapse_file(),
> > which sometimes leaves empty nodes behind.
>
> Urp, yes, that can easily happen.
>
> /* This will be less messy when we use multi-index entries */
> do {
> xas_lock_irq(&xas);
> xas_create_range(&xas);
> if (!xas_error(&xas))
> break;
> if (!xas_nomem(&xas, GFP_KERNEL)) {
> result = SCAN_FAIL;
> goto out;
> }
>
> xas_create_range() can absolutely create nodes with zero entries.
> So if we create m/n nodes and then it runs out of memory (or cgroup
> denies it), we can leave nodes in the tree with zero entries.
>
> There are three options for fixing it ...
> - Switch to using multi-index entries. We need to do this anyway, but
> I don't yet have a handle on the bugs that you found last time I
> pushed this into linux-next. At -rc5 seems like a late stage to be
> trying this solution.
> - Add an xas_prune_range() that gets called on failure. Should be
> straightforward to write, but will be obsolete as soon as we do the
> above and it's a pain for the callers.
> - Change how xas_create_range() works to merely preallocate the xa_nodes
> and not insert them into the tree until we're trying to insert data into
> them. I favour this option, and this scenario is amenable to writing
> a test that will simulate failure halfway through.
>
> I'm going to start on option 3 now.

option 3 didn't work out terribly well. So here's option 4; if we fail
to allocate memory when creating a node, prune the tree. This fixes
(I think) the problem inherited from the radix tree, although the test
case is only for xas_create_range(). I should add a couple of test cases
for xas_create() failing, but I just got this to pass and I wanted to
send it out as soon as possible.

diff --git a/lib/test_xarray.c b/lib/test_xarray.c
index 8b1c318189ce..84c6057932f3 100644
--- a/lib/test_xarray.c
+++ b/lib/test_xarray.c
@@ -1463,6 +1463,30 @@ static noinline void check_create_range_4(struct xarray *xa,
XA_BUG_ON(xa, !xa_empty(xa));
}

+static noinline void check_create_range_5(struct xarray *xa,
+ unsigned long index, unsigned order)
+{
+ XA_STATE_ORDER(xas, xa, index, order);
+ int i = 0;
+ gfp_t gfp = GFP_KERNEL;
+
+ XA_BUG_ON(xa, !xa_empty(xa));
+
+ do {
+ xas_lock(&xas);
+ xas_create_range(&xas);
+ xas_unlock(&xas);
+ if (++i == 4)
+ gfp = GFP_NOWAIT;
+ } while (xas_nomem(&xas, gfp));
+
+ if (!xas_error(&xas))
+ xa_destroy(xa);
+
+ XA_BUG_ON(xa, xas.xa_alloc);
+ XA_BUG_ON(xa, !xa_empty(xa));
+}
+
static noinline void check_create_range(struct xarray *xa)
{
unsigned int order;
@@ -1490,6 +1514,12 @@ static noinline void check_create_range(struct xarray *xa)
check_create_range_4(xa, (3U << order) + 1, order);
check_create_range_4(xa, (3U << order) - 1, order);
check_create_range_4(xa, (1U << 24) + 1, order);
+
+ check_create_range_5(xa, 0, order);
+ check_create_range_5(xa, (1U << order), order);
+ check_create_range_5(xa, (2U << order), order);
+ check_create_range_5(xa, (3U << order), order);
+ check_create_range_5(xa, (1U << (2 * order)), order);
}

check_create_range_3();
diff --git a/lib/xarray.c b/lib/xarray.c
index f5d8f54907b4..923ccde6379e 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -276,77 +276,6 @@ static void xas_destroy(struct xa_state *xas)
}
}

-/**
- * xas_nomem() - Allocate memory if needed.
- * @xas: XArray operation state.
- * @gfp: Memory allocation flags.
- *
- * If we need to add new nodes to the XArray, we try to allocate memory
- * with GFP_NOWAIT while holding the lock, which will usually succeed.
- * If it fails, @xas is flagged as needing memory to continue. The caller
- * should drop the lock and call xas_nomem(). If xas_nomem() succeeds,
- * the caller should retry the operation.
- *
- * Forward progress is guaranteed as one node is allocated here and
- * stored in the xa_state where it will be found by xas_alloc(). More
- * nodes will likely be found in the slab allocator, but we do not tie
- * them up here.
- *
- * Return: true if memory was needed, and was successfully allocated.
- */
-bool xas_nomem(struct xa_state *xas, gfp_t gfp)
-{
- if (xas->xa_node != XA_ERROR(-ENOMEM)) {
- xas_destroy(xas);
- return false;
- }
- if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
- gfp |= __GFP_ACCOUNT;
- xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
- if (!xas->xa_alloc)
- return false;
- xas->xa_alloc->parent = NULL;
- XA_NODE_BUG_ON(xas->xa_alloc, !list_empty(&xas->xa_alloc->private_list));
- xas->xa_node = XAS_RESTART;
- return true;
-}
-EXPORT_SYMBOL_GPL(xas_nomem);
-
-/*
- * __xas_nomem() - Drop locks and allocate memory if needed.
- * @xas: XArray operation state.
- * @gfp: Memory allocation flags.
- *
- * Internal variant of xas_nomem().
- *
- * Return: true if memory was needed, and was successfully allocated.
- */
-static bool __xas_nomem(struct xa_state *xas, gfp_t gfp)
- __must_hold(xas->xa->xa_lock)
-{
- unsigned int lock_type = xa_lock_type(xas->xa);
-
- if (xas->xa_node != XA_ERROR(-ENOMEM)) {
- xas_destroy(xas);
- return false;
- }
- if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
- gfp |= __GFP_ACCOUNT;
- if (gfpflags_allow_blocking(gfp)) {
- xas_unlock_type(xas, lock_type);
- xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
- xas_lock_type(xas, lock_type);
- } else {
- xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
- }
- if (!xas->xa_alloc)
- return false;
- xas->xa_alloc->parent = NULL;
- XA_NODE_BUG_ON(xas->xa_alloc, !list_empty(&xas->xa_alloc->private_list));
- xas->xa_node = XAS_RESTART;
- return true;
-}
-
static void xas_update(struct xa_state *xas, struct xa_node *node)
{
if (xas->xa_update)
@@ -551,6 +480,120 @@ static void xas_free_nodes(struct xa_state *xas, struct xa_node *top)
}
}

+static bool __xas_trim(struct xa_state *xas)
+{
+ unsigned long index = xas->xa_index;
+ unsigned char shift = xas->xa_shift;
+ unsigned char sibs = xas->xa_sibs;
+
+ xas->xa_index |= ((sibs + 1UL) << shift) - 1;
+ xas->xa_shift = 0;
+ xas->xa_sibs = 0;
+ xas->xa_node = XAS_RESTART;
+
+ for (;;) {
+ xas_load(xas);
+ if (!xas_is_node(xas))
+ break;
+ xas_delete_node(xas);
+ xas->xa_index -= XA_CHUNK_SIZE;
+ if (xas->xa_index < index)
+ break;
+ }
+
+ xas->xa_shift = shift;
+ xas->xa_sibs = sibs;
+ xas->xa_index = index;
+ xas->xa_node = XA_ERROR(-ENOMEM);
+ return false;
+}
+
+/*
+ * We failed to allocate memory. Trim any nodes we created along the
+ * way which are now unused.
+ */
+static bool xas_trim(struct xa_state *xas)
+{
+ unsigned int lock_type = xa_lock_type(xas->xa);
+
+ xas_lock_type(xas, lock_type);
+ __xas_trim(xas);
+ xas_unlock_type(xas, lock_type);
+
+ return false;
+}
+
+/**
+ * xas_nomem() - Allocate memory if needed.
+ * @xas: XArray operation state.
+ * @gfp: Memory allocation flags.
+ *
+ * If we need to add new nodes to the XArray, we try to allocate memory
+ * with GFP_NOWAIT while holding the lock, which will usually succeed.
+ * If it fails, @xas is flagged as needing memory to continue. The caller
+ * should drop the lock and call xas_nomem(). If xas_nomem() succeeds,
+ * the caller should retry the operation.
+ *
+ * Forward progress is guaranteed as one node is allocated here and
+ * stored in the xa_state where it will be found by xas_alloc(). More
+ * nodes will likely be found in the slab allocator, but we do not tie
+ * them up here.
+ *
+ * Return: true if memory was needed, and was successfully allocated.
+ */
+bool xas_nomem(struct xa_state *xas, gfp_t gfp)
+{
+ if (xas->xa_node != XA_ERROR(-ENOMEM)) {
+ xas_destroy(xas);
+ return false;
+ }
+ if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
+ gfp |= __GFP_ACCOUNT;
+ xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
+ if (!xas->xa_alloc)
+ return xas_trim(xas);
+ xas->xa_alloc->parent = NULL;
+ XA_NODE_BUG_ON(xas->xa_alloc, !list_empty(&xas->xa_alloc->private_list));
+ xas->xa_node = XAS_RESTART;
+ return true;
+}
+EXPORT_SYMBOL_GPL(xas_nomem);
+
+/*
+ * __xas_nomem() - Drop locks and allocate memory if needed.
+ * @xas: XArray operation state.
+ * @gfp: Memory allocation flags.
+ *
+ * Internal variant of xas_nomem().
+ *
+ * Return: true if memory was needed, and was successfully allocated.
+ */
+static bool __xas_nomem(struct xa_state *xas, gfp_t gfp)
+ __must_hold(xas->xa->xa_lock)
+{
+ unsigned int lock_type = xa_lock_type(xas->xa);
+
+ if (xas->xa_node != XA_ERROR(-ENOMEM)) {
+ xas_destroy(xas);
+ return false;
+ }
+ if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
+ gfp |= __GFP_ACCOUNT;
+ if (gfpflags_allow_blocking(gfp)) {
+ xas_unlock_type(xas, lock_type);
+ xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
+ xas_lock_type(xas, lock_type);
+ } else {
+ xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
+ }
+ if (!xas->xa_alloc)
+ return __xas_trim(xas);
+ xas->xa_alloc->parent = NULL;
+ XA_NODE_BUG_ON(xas->xa_alloc, !list_empty(&xas->xa_alloc->private_list));
+ xas->xa_node = XAS_RESTART;
+ return true;
+}
+
/*
* xas_expand adds nodes to the head of the tree until it has reached
* sufficient height to be able to contain @xas->xa_index

2021-04-02 13:28:20

by Matthew Wilcox

[permalink] [raw]
Subject: Re: BUG_ON(!mapping_empty(&inode->i_data))

On Fri, Apr 02, 2021 at 04:13:05AM +0100, Matthew Wilcox wrote:
> + for (;;) {
> + xas_load(xas);
> + if (!xas_is_node(xas))
> + break;
> + xas_delete_node(xas);
> + xas->xa_index -= XA_CHUNK_SIZE;
> + if (xas->xa_index < index)
> + break;

That's a bug. index can be 0, so the condition would never be true.
It should be:

if (xas->xa_index <= (index | XA_CHUNK_MASK))
break;
xas->xa_index -= XA_CHUNK_SIZE;

The test doesn't notice this bug because the tree is otherwise empty,
and the !xas_is_node(xas) condition is hit first. The next test will
notice this.

2021-04-02 17:07:06

by Matthew Wilcox

[permalink] [raw]
Subject: Re: BUG_ON(!mapping_empty(&inode->i_data))

OK, more competent testing, and that previous bug now detected and fixed.
I have a reasonable amount of confidence this will solve your problem.
If you do apply this patch, don't enable CONFIG_TEST_XARRAY as the new
tests assume that attempting to allocate with a GFP flags of 0 will
definitely fail, which is true for my userspace allocator, but not true
inside the kernel. I'll add some ifdeffery to skip these tests inside
the kernel, as without a way to deterministically fail allocation,
there's no way to test this code properly.

diff --git a/lib/test_xarray.c b/lib/test_xarray.c
index 8b1c318189ce..14cbc12bfeca 100644
--- a/lib/test_xarray.c
+++ b/lib/test_xarray.c
@@ -321,12 +321,10 @@ static noinline void check_xa_mark(struct xarray *xa)
check_xa_mark_3(xa);
}

-static noinline void check_xa_shrink(struct xarray *xa)
+static noinline void check_xa_shrink_1(struct xarray *xa)
{
XA_STATE(xas, xa, 1);
struct xa_node *node;
- unsigned int order;
- unsigned int max_order = IS_ENABLED(CONFIG_XARRAY_MULTI) ? 15 : 1;

XA_BUG_ON(xa, !xa_empty(xa));
XA_BUG_ON(xa, xa_store_index(xa, 0, GFP_KERNEL) != NULL);
@@ -349,6 +347,13 @@ static noinline void check_xa_shrink(struct xarray *xa)
XA_BUG_ON(xa, xa_load(xa, 0) != xa_mk_value(0));
xa_erase_index(xa, 0);
XA_BUG_ON(xa, !xa_empty(xa));
+}
+
+static noinline void check_xa_shrink_2(struct xarray *xa)
+{
+ unsigned int order;
+ unsigned int max_order = IS_ENABLED(CONFIG_XARRAY_MULTI) ? 15 : 1;
+ struct xa_node *node;

for (order = 0; order < max_order; order++) {
unsigned long max = (1UL << order) - 1;
@@ -370,6 +375,34 @@ static noinline void check_xa_shrink(struct xarray *xa)
}
}

+static noinline void check_xa_shrink_3(struct xarray *xa, int nr,
+ unsigned long anchor, unsigned long newbie)
+{
+ XA_STATE(xas, xa, newbie);
+ int i;
+
+ xa_store_index(xa, anchor, GFP_KERNEL);
+
+ for (i = 0; i < nr; i++) {
+ xas_create(&xas, true);
+ xas_nomem(&xas, GFP_KERNEL);
+ }
+ xas_create(&xas, true);
+ xas_nomem(&xas, 0);
+ XA_BUG_ON(xa, xas_error(&xas) == 0);
+
+ xa_erase_index(xa, anchor);
+ XA_BUG_ON(xa, !xa_empty(xa));
+}
+
+static noinline void check_xa_shrink(struct xarray *xa)
+{
+ check_xa_shrink_1(xa);
+ check_xa_shrink_2(xa);
+ check_xa_shrink_3(xa, 8, 0, 1UL << 31);
+ check_xa_shrink_3(xa, 4, 1UL << 31, 0);
+}
+
static noinline void check_insert(struct xarray *xa)
{
unsigned long i;
@@ -1463,6 +1496,36 @@ static noinline void check_create_range_4(struct xarray *xa,
XA_BUG_ON(xa, !xa_empty(xa));
}

+static noinline void check_create_range_5(struct xarray *xa, void *entry,
+ unsigned long index, unsigned order)
+{
+ XA_STATE_ORDER(xas, xa, index, order);
+ int i = 0;
+ gfp_t gfp = GFP_KERNEL;
+
+ XA_BUG_ON(xa, !xa_empty(xa));
+
+ if (entry)
+ xa_store(xa, xa_to_value(entry), entry, GFP_KERNEL);
+
+ do {
+ xas_lock(&xas);
+ xas_create_range(&xas);
+ xas_unlock(&xas);
+ if (++i == 4)
+ gfp = GFP_NOWAIT;
+ } while (xas_nomem(&xas, gfp));
+
+ if (entry)
+ xa_erase(xa, xa_to_value(entry));
+
+ if (!xas_error(&xas))
+ xa_destroy(xa);
+
+ XA_BUG_ON(xa, xas.xa_alloc);
+ XA_BUG_ON(xa, !xa_empty(xa));
+}
+
static noinline void check_create_range(struct xarray *xa)
{
unsigned int order;
@@ -1490,6 +1553,24 @@ static noinline void check_create_range(struct xarray *xa)
check_create_range_4(xa, (3U << order) + 1, order);
check_create_range_4(xa, (3U << order) - 1, order);
check_create_range_4(xa, (1U << 24) + 1, order);
+
+ check_create_range_5(xa, NULL, 0, order);
+ check_create_range_5(xa, NULL, (1U << order), order);
+ check_create_range_5(xa, NULL, (2U << order), order);
+ check_create_range_5(xa, NULL, (3U << order), order);
+ check_create_range_5(xa, NULL, (1U << (2 * order)), order);
+
+ check_create_range_5(xa, xa_mk_value(0), 0, order);
+ check_create_range_5(xa, xa_mk_value(0), (1U << order), order);
+ check_create_range_5(xa, xa_mk_value(0), (2U << order), order);
+ check_create_range_5(xa, xa_mk_value(0), (3U << order), order);
+ check_create_range_5(xa, xa_mk_value(0), (1U << (2 * order)), order);
+
+ check_create_range_5(xa, xa_mk_value(1U << order), 0, order);
+ check_create_range_5(xa, xa_mk_value(1U << order), (1U << order), order);
+ check_create_range_5(xa, xa_mk_value(1U << order), (2U << order), order);
+ check_create_range_5(xa, xa_mk_value(1U << order), (3U << order), order);
+ check_create_range_5(xa, xa_mk_value(1U << order), (1U << (2 * order)), order);
}

check_create_range_3();
diff --git a/lib/xarray.c b/lib/xarray.c
index f5d8f54907b4..38a08eb99c7f 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -276,77 +276,6 @@ static void xas_destroy(struct xa_state *xas)
}
}

-/**
- * xas_nomem() - Allocate memory if needed.
- * @xas: XArray operation state.
- * @gfp: Memory allocation flags.
- *
- * If we need to add new nodes to the XArray, we try to allocate memory
- * with GFP_NOWAIT while holding the lock, which will usually succeed.
- * If it fails, @xas is flagged as needing memory to continue. The caller
- * should drop the lock and call xas_nomem(). If xas_nomem() succeeds,
- * the caller should retry the operation.
- *
- * Forward progress is guaranteed as one node is allocated here and
- * stored in the xa_state where it will be found by xas_alloc(). More
- * nodes will likely be found in the slab allocator, but we do not tie
- * them up here.
- *
- * Return: true if memory was needed, and was successfully allocated.
- */
-bool xas_nomem(struct xa_state *xas, gfp_t gfp)
-{
- if (xas->xa_node != XA_ERROR(-ENOMEM)) {
- xas_destroy(xas);
- return false;
- }
- if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
- gfp |= __GFP_ACCOUNT;
- xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
- if (!xas->xa_alloc)
- return false;
- xas->xa_alloc->parent = NULL;
- XA_NODE_BUG_ON(xas->xa_alloc, !list_empty(&xas->xa_alloc->private_list));
- xas->xa_node = XAS_RESTART;
- return true;
-}
-EXPORT_SYMBOL_GPL(xas_nomem);
-
-/*
- * __xas_nomem() - Drop locks and allocate memory if needed.
- * @xas: XArray operation state.
- * @gfp: Memory allocation flags.
- *
- * Internal variant of xas_nomem().
- *
- * Return: true if memory was needed, and was successfully allocated.
- */
-static bool __xas_nomem(struct xa_state *xas, gfp_t gfp)
- __must_hold(xas->xa->xa_lock)
-{
- unsigned int lock_type = xa_lock_type(xas->xa);
-
- if (xas->xa_node != XA_ERROR(-ENOMEM)) {
- xas_destroy(xas);
- return false;
- }
- if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
- gfp |= __GFP_ACCOUNT;
- if (gfpflags_allow_blocking(gfp)) {
- xas_unlock_type(xas, lock_type);
- xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
- xas_lock_type(xas, lock_type);
- } else {
- xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
- }
- if (!xas->xa_alloc)
- return false;
- xas->xa_alloc->parent = NULL;
- XA_NODE_BUG_ON(xas->xa_alloc, !list_empty(&xas->xa_alloc->private_list));
- xas->xa_node = XAS_RESTART;
- return true;
-}
-
static void xas_update(struct xa_state *xas, struct xa_node *node)
{
if (xas->xa_update)
@@ -551,6 +480,120 @@ static void xas_free_nodes(struct xa_state *xas, struct xa_node *top)
}
}

+static bool __xas_trim(struct xa_state *xas)
+{
+ unsigned long index = xas->xa_index;
+ unsigned char shift = xas->xa_shift;
+ unsigned char sibs = xas->xa_sibs;
+
+ xas->xa_index |= ((sibs + 1UL) << shift) - 1;
+ xas->xa_shift = 0;
+ xas->xa_sibs = 0;
+ xas->xa_node = XAS_RESTART;
+
+ for (;;) {
+ xas_load(xas);
+ if (!xas_is_node(xas))
+ break;
+ xas_delete_node(xas);
+ if (xas->xa_index <= (index | XA_CHUNK_MASK))
+ break;
+ xas->xa_index -= XA_CHUNK_SIZE;
+ }
+
+ xas->xa_shift = shift;
+ xas->xa_sibs = sibs;
+ xas->xa_index = index;
+ xas->xa_node = XA_ERROR(-ENOMEM);
+ return false;
+}
+
+/*
+ * We failed to allocate memory. Trim any nodes we created along the
+ * way which are now unused.
+ */
+static bool xas_trim(struct xa_state *xas)
+{
+ unsigned int lock_type = xa_lock_type(xas->xa);
+
+ xas_lock_type(xas, lock_type);
+ __xas_trim(xas);
+ xas_unlock_type(xas, lock_type);
+
+ return false;
+}
+
+/**
+ * xas_nomem() - Allocate memory if needed.
+ * @xas: XArray operation state.
+ * @gfp: Memory allocation flags.
+ *
+ * If we need to add new nodes to the XArray, we try to allocate memory
+ * with GFP_NOWAIT while holding the lock, which will usually succeed.
+ * If it fails, @xas is flagged as needing memory to continue. The caller
+ * should drop the lock and call xas_nomem(). If xas_nomem() succeeds,
+ * the caller should retry the operation.
+ *
+ * Forward progress is guaranteed as one node is allocated here and
+ * stored in the xa_state where it will be found by xas_alloc(). More
+ * nodes will likely be found in the slab allocator, but we do not tie
+ * them up here.
+ *
+ * Return: true if memory was needed, and was successfully allocated.
+ */
+bool xas_nomem(struct xa_state *xas, gfp_t gfp)
+{
+ if (xas->xa_node != XA_ERROR(-ENOMEM)) {
+ xas_destroy(xas);
+ return false;
+ }
+ if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
+ gfp |= __GFP_ACCOUNT;
+ xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
+ if (!xas->xa_alloc)
+ return xas_trim(xas);
+ xas->xa_alloc->parent = NULL;
+ XA_NODE_BUG_ON(xas->xa_alloc, !list_empty(&xas->xa_alloc->private_list));
+ xas->xa_node = XAS_RESTART;
+ return true;
+}
+EXPORT_SYMBOL_GPL(xas_nomem);
+
+/*
+ * __xas_nomem() - Drop locks and allocate memory if needed.
+ * @xas: XArray operation state.
+ * @gfp: Memory allocation flags.
+ *
+ * Internal variant of xas_nomem().
+ *
+ * Return: true if memory was needed, and was successfully allocated.
+ */
+static bool __xas_nomem(struct xa_state *xas, gfp_t gfp)
+ __must_hold(xas->xa->xa_lock)
+{
+ unsigned int lock_type = xa_lock_type(xas->xa);
+
+ if (xas->xa_node != XA_ERROR(-ENOMEM)) {
+ xas_destroy(xas);
+ return false;
+ }
+ if (xas->xa->xa_flags & XA_FLAGS_ACCOUNT)
+ gfp |= __GFP_ACCOUNT;
+ if (gfpflags_allow_blocking(gfp)) {
+ xas_unlock_type(xas, lock_type);
+ xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
+ xas_lock_type(xas, lock_type);
+ } else {
+ xas->xa_alloc = kmem_cache_alloc(radix_tree_node_cachep, gfp);
+ }
+ if (!xas->xa_alloc)
+ return __xas_trim(xas);
+ xas->xa_alloc->parent = NULL;
+ XA_NODE_BUG_ON(xas->xa_alloc, !list_empty(&xas->xa_alloc->private_list));
+ xas->xa_node = XAS_RESTART;
+ return true;
+}
+
/*
* xas_expand adds nodes to the head of the tree until it has reached
* sufficient height to be able to contain @xas->xa_index

2021-04-02 20:25:50

by Hugh Dickins

[permalink] [raw]
Subject: Re: BUG_ON(!mapping_empty(&inode->i_data))

On Fri, 2 Apr 2021, Matthew Wilcox wrote:

> OK, more competent testing, and that previous bug now detected and fixed.
> I have a reasonable amount of confidence this will solve your problem.
> If you do apply this patch, don't enable CONFIG_TEST_XARRAY as the new
> tests assume that attempting to allocate with a GFP flags of 0 will
> definitely fail, which is true for my userspace allocator, but not true
> inside the kernel. I'll add some ifdeffery to skip these tests inside
> the kernel, as without a way to deterministically fail allocation,
> there's no way to test this code properly.

Thanks a lot for all your efforts on this, but the news from the front
is disappointing. The lib/xarray.c you sent here is yesterday's plus
the little __xas_trim() fixup you sent this morning: I set that going
then on three machines, two of them are still good, but one is not (and
yes, I've checked several times that it is the intended kernel running).
xa_dump()s appended below, but I don't expect them to have more to tell.

I think you've been focusing on the old radix-tree -ENOMEM case, which
you'd wanted to clean up anyway, but overlooking the THP collapse_file()
case, which is the one actually hitting me. collapse_file() does that
xas_create_range(), which Doc tells me will create all the nodes which
might be needed; and if collapse_file() has to give up and revert for
any of many plausible reasons, those nodes may be left over at the end.

There is a "Put holes back where they were" xas_store(&xas, NULL) on
the failure path, which I think we would expect to delete empty nodes.
But it only goes as far as nr_none. Is it ok to xas_store(&xas, NULL)
where there was no non-NULL entry before? I should try that, maybe
adjusting the !nr_none break will give a very simple fix.

Or, if you remove the "static " from xas_trim(), maybe that provides
the xas_prune_range() you proposed, or the cleanup pass I proposed.
To be called on collapse_file() failure, or when eviction finds
!mapping_empty().

[ 2927.151739] xarray: ffff888017914c80 head ffff888003a10db2 flags 21 marks 0 0 0
[ 2927.171484] 0-4095: node ffff888003a10db0 max 0 parent 0000000000000000 shift 6 count 3 values 0 array ffff888017914c80 list ffff888003a10dc8 ffff888003a10dc8 marks 0 0 0
[ 2927.213313] 1344-1407: node ffff8880055c8490 offset 21 parent ffff888003a10db0 shift 0 count 0 values 0 array ffff888017914c80 list ffff8880055c84a8 ffff8880055c84a8 marks 0 0 0
[ 2927.257924] 1408-1471: node ffff8880055c8248 offset 22 parent ffff888003a10db0 shift 0 count 0 values 0 array ffff888017914c80 list ffff8880055c8260 ffff8880055c8260 marks 0 0 0
[ 2927.305332] 1472-1535: node ffff8880055c8000 offset 23 parent ffff888003a10db0 shift 0 count 0 values 0 array ffff888017914c80 list ffff8880055c8018 ffff8880055c8018 marks 0 0 0
[ 2927.355811] s_dev 8:8 i_ino 274355 i_size 10092280

[ 3813.689018] xarray: ffff888005511408 head ffff888017624db2 flags 21 marks 0 0 0
[ 3813.716012] 0-4095: node ffff888017624db0 max 2 parent 0000000000000000 shift 6 count 3 values 0 array ffff888005511408 list ffff888017624dc8 ffff888017624dc8 marks 0 0 0
[ 3813.771966] 1344-1407: node ffff888000595b60 offset 21 parent ffff888017624db0 shift 0 count 0 values 0 array ffff888005511408 list ffff888000595b78 ffff888000595b78 marks 0 0 0
[ 3813.828102] 1408-1471: node ffff888000594b68 offset 22 parent ffff888017624db0 shift 0 count 0 values 0 array ffff888005511408 list ffff888000594b80 ffff888000594b80 marks 0 0 0
[ 3813.883603] 1472-1535: node ffff888000594248 offset 23 parent ffff888017624db0 shift 0 count 0 values 0 array ffff888005511408 list ffff888000594260 ffff888000594260 marks 0 0 0
[ 3813.939146] s_dev 8:8 i_ino 274355 i_size 10092280

[14157.780505] xarray: ffff888007c8d988 head ffff88800bccfd9a flags 21 marks 0 0 0
[14157.801557] 0-4095: node ffff88800bccfd98 max 7 parent 0000000000000000 shift 6 count 2 values 0 array ffff888007c8d988 list ffff88800bccfdb0 ffff88800bccfdb0 marks 0 0 0
[14157.845337] 896-959: node ffff8880279fdda8 offset 14 parent ffff88800bccfd98 shift 0 count 0 values 0 array ffff888007c8d988 list ffff8880279fddc0 ffff8880279fddc0 marks 0 0 0
[14157.893594] 960-1023: node ffff8880279fe238 offset 15 parent ffff88800bccfd98 shift 0 count 0 values 0 array ffff888007c8d988 list ffff8880279fe250 ffff8880279fe250 marks 0 0 0
[14157.943810] s_dev 8:8 i_ino 274355 i_size 10092280

Hugh

2021-04-02 21:18:06

by Hugh Dickins

[permalink] [raw]
Subject: Re: BUG_ON(!mapping_empty(&inode->i_data))

On Fri, 2 Apr 2021, Hugh Dickins wrote:
>
> There is a "Put holes back where they were" xas_store(&xas, NULL) on
> the failure path, which I think we would expect to delete empty nodes.
> But it only goes as far as nr_none. Is it ok to xas_store(&xas, NULL)
> where there was no non-NULL entry before? I should try that, maybe
> adjusting the !nr_none break will give a very simple fix.

No, XArray did not like that:
xas_update() XA_NODE_BUG_ON(node, !list_empty(&node->private_list)).

But also it's the wrong thing for collapse_file() to do, from a file
integrity point of view. So far as there is a non-NULL page in the list,
or nr_none is non-zero, those subpages are frozen at the src end, and
THP head locked and not Uptodate at the dst end. But go beyond nr_none,
and a racing task could be adding new pages, which THP collapse failure
has no right to delete behind its back.

Not an issue for READ_ONLY_THP_FOR_FS, but important for shmem and future.

>
> Or, if you remove the "static " from xas_trim(), maybe that provides
> the xas_prune_range() you proposed, or the cleanup pass I proposed.
> To be called on collapse_file() failure, or when eviction finds
> !mapping_empty().

Something like this I think.

Hugh

2021-04-30 04:19:01

by Andrew Morton

[permalink] [raw]
Subject: Re: BUG_ON(!mapping_empty(&inode->i_data))

On Fri, 2 Apr 2021 14:16:04 -0700 (PDT) Hugh Dickins <[email protected]> wrote:

> On Fri, 2 Apr 2021, Hugh Dickins wrote:
> >
> > There is a "Put holes back where they were" xas_store(&xas, NULL) on
> > the failure path, which I think we would expect to delete empty nodes.
> > But it only goes as far as nr_none. Is it ok to xas_store(&xas, NULL)
> > where there was no non-NULL entry before? I should try that, maybe
> > adjusting the !nr_none break will give a very simple fix.
>
> No, XArray did not like that:
> xas_update() XA_NODE_BUG_ON(node, !list_empty(&node->private_list)).
>
> But also it's the wrong thing for collapse_file() to do, from a file
> integrity point of view. So far as there is a non-NULL page in the list,
> or nr_none is non-zero, those subpages are frozen at the src end, and
> THP head locked and not Uptodate at the dst end. But go beyond nr_none,
> and a racing task could be adding new pages, which THP collapse failure
> has no right to delete behind its back.
>
> Not an issue for READ_ONLY_THP_FOR_FS, but important for shmem and future.
>
> >
> > Or, if you remove the "static " from xas_trim(), maybe that provides
> > the xas_prune_range() you proposed, or the cleanup pass I proposed.
> > To be called on collapse_file() failure, or when eviction finds
> > !mapping_empty().
>
> Something like this I think.
>

I'm not sure this ever was resolved?

Is it the case that the series "Remove nrexceptional tracking v2" at
least exposed this bug?

IOW, what the heck should I do with

mm-introduce-and-use-mapping_empty.patch
mm-stop-accounting-shadow-entries.patch
dax-account-dax-entries-as-nrpages.patch
mm-remove-nrexceptional-from-inode.patch

Thanks.

2021-04-30 05:42:17

by Hugh Dickins

[permalink] [raw]
Subject: Re: BUG_ON(!mapping_empty(&inode->i_data))

On Thu, 29 Apr 2021, Andrew Morton wrote:
>
> I'm not sure this ever was resolved?

It was not resolved: Matthew had prospective fixes for one way in which
it could happen, but they did not help the case which still hits my
testing (well, I replace the BUG_ON by a WARN_ON, so not hit badly).

>
> Is it the case that the series "Remove nrexceptional tracking v2" at
> least exposed this bug?

Yes: makes a BUG out of a long-standing issue not noticed before.

>
> IOW, what the heck should I do with
>
> mm-introduce-and-use-mapping_empty.patch
> mm-stop-accounting-shadow-entries.patch
> dax-account-dax-entries-as-nrpages.patch
> mm-remove-nrexceptional-from-inode.patch

If Matthew doesn't have a proper fix yet (and it's a bit late for more
than an obvious fix), I think those should go in, with this addition:

[PATCH] mm: remove nrexceptional from inode: remove BUG_ON

clear_inode()'s BUG_ON(!mapping_empty(&inode->i_data)) is unsafe: we know
of two ways in which nodes can and do (on rare occasions) get left behind.
Until those are fixed, do not BUG_ON() nor even WARN_ON(). Yes, this will
then leak those nodes (or the next user of the struct inode may use them);
but this has been happening for years, and the new BUG_ON(!mapping_empty)
was only guilty of revealing that. A proper fix will follow, but no hurry.

Signed-off-by: Hugh Dickins <[email protected]>
---

fs/inode.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

--- mmotm/fs/inode.c 2021-04-22 18:30:46.285908982 -0700
+++ linux/fs/inode.c 2021-04-29 22:13:54.096530691 -0700
@@ -529,7 +529,14 @@ void clear_inode(struct inode *inode)
*/
xa_lock_irq(&inode->i_data.i_pages);
BUG_ON(inode->i_data.nrpages);
- BUG_ON(!mapping_empty(&inode->i_data));
+ /*
+ * Almost always, mapping_empty(&inode->i_data) here; but there are
+ * two known and long-standing ways in which nodes may get left behind
+ * (when deep radix-tree node allocation failed partway; or when THP
+ * collapse_file() failed). Until those two known cases are cleaned up,
+ * or a cleanup function is called here, do not BUG_ON(!mapping_empty),
+ * nor even WARN_ON(!mapping_empty).
+ */
xa_unlock_irq(&inode->i_data.i_pages);
BUG_ON(!list_empty(&inode->i_data.private_list));
BUG_ON(!(inode->i_state & I_FREEING));