2022-03-29 08:50:00

by Hugh Dickins

[permalink] [raw]
Subject: WARNING: mm/workingset.c:567 shadow_lru_isolate

Sunday's linux.git, under tmpfs swapping load (including
ext4 file readonly THPs and shmem THPs), gave me the:

if (WARN_ON_ONCE(!node->nr_values))
goto out_invalid;
if (WARN_ON_ONCE(node->count != node->nr_values))
goto out_invalid;

from mm/workingset.c - log below. I did see these on mmotm before
the merge window, but only now got around to bisecting them. And
the bisection, confirmed by revert, arrives surprisingly at:

820c4e2e6f51 ("mm/vmscan: Free non-shmem folios without splitting them")

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1739,8 +1739,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
/* Adding to swap updated mapping */
mapping = page_mapping(page);
}
- } else if (unlikely(PageTransHuge(page))) {
- /* Split file THP */
+ } else if (PageSwapBacked(page) && PageTransHuge(page)) {
+ /* Split shmem THP */
if (split_folio_to_list(folio, page_list))
goto keep_locked;
}

I'm not familiar with workingset.c, I'll make no guesses, over to you!

Hugh

[ 1470.572186] ------------[ cut here ]------------
[ 1470.574478] WARNING: CPU: 1 PID: 47 at mm/workingset.c:567 shadow_lru_isolate+0x84/0x128
[ 1470.575570] CPU: 1 PID: 47 Comm: kswapd0 Not tainted 5.18.0-rc0 #2
[ 1470.576592] Hardware name: LENOVO 20HQS0EG02/20HQS0EG02, BIOS N1MET54W (1.39 ) 04/16/2019
[ 1470.577836] RIP: 0010:shadow_lru_isolate+0x84/0x128
[ 1470.579170] Code: 89 ff 4c 8d 73 e8 e8 16 f4 ff ff 83 ca ff be 09 00 00 00 4c 89 f7 e8 91 3c 05 00 4c 89 ef e8 e9 25 7d 00 8a 43 eb 84 c0 75 04 <0f> 0b eb 2a 3a 43 ea 74 04 0f 0b eb 21 48 c7 c6 6a f7 1f 81 4c 89
[ 1470.580669] RSP: 0018:ffff888000e0bba0 EFLAGS: 00010046
[ 1470.582173] RAX: 0000000080000000 RBX: ffff888016c21510 RCX: 8bc9d3bc33ed677c
[ 1470.583755] RDX: 0000000000000003 RSI: ffffffff823f436f RDI: 00000000ffffffff
[ 1470.585303] RBP: ffff888000e0bbc8 R08: 0000000000000078 R09: 00000000000c8386
[ 1470.586860] R10: 0000000000000001 R11: ffff888000e055b8 R12: ffff8880166c99d0
[ 1470.588518] R13: ffff8880010dd180 R14: ffff888016c214f8 R15: ffff8880010dd1c0
[ 1470.590154] FS: 0000000000000000(0000) GS:ffff88803ec80000(0000) knlGS:0000000000000000
[ 1470.591743] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1470.593296] CR2: 0000555555694f50 CR3: 000000000f940005 CR4: 00000000003706e0
[ 1470.594824] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1470.596198] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1470.597581] Call Trace:
[ 1470.598907] <TASK>
[ 1470.600267] __list_lru_walk_one+0x84/0xf2
[ 1470.601629] ? workingset_update_node+0x84/0x84
[ 1470.602977] list_lru_walk_one_irq+0x59/0x74
[ 1470.604324] ? workingset_update_node+0x84/0x84
[ 1470.605657] scan_shadow_nodes+0x29/0x2b
[ 1470.607073] do_shrink_slab+0x1d0/0x33d
[ 1470.608477] shrink_slab+0x98/0x100
[ 1470.609880] shrink_node_memcgs+0x136/0x182
[ 1470.611164] shrink_node+0x220/0x3d9
[ 1470.612446] balance_pgdat+0x216/0x3cb
[ 1470.613715] ? lock_is_held_type+0xcf/0x10f
[ 1470.615010] kswapd+0x189/0x1ae
[ 1470.616285] ? balance_pgdat+0x3cb/0x3cb
[ 1470.617481] kthread+0xee/0xf6
[ 1470.618690] ? kthread_exit+0x1f/0x1f
[ 1470.619932] ret_from_fork+0x1f/0x30
[ 1470.621110] </TASK>
[ 1470.622237] irq event stamp: 76862920
[ 1470.623284] hardirqs last enabled at (76862919): [<ffffffff819d1ee3>] _raw_spin_unlock_irq+0x23/0x42
[ 1470.624373] hardirqs last disabled at (76862920): [<ffffffff819d1c96>] _raw_spin_lock_irq+0x17/0x4c
[ 1470.625540] softirqs last enabled at (76861226): [<ffffffff81c0036f>] __do_softirq+0x36f/0x3aa
[ 1470.626752] softirqs last disabled at (76861195): [<ffffffff811049cd>] __irq_exit_rcu+0x85/0xc1
[ 1470.627957] ---[ end trace 0000000000000000 ]---
[ 1470.629407] ------------[ cut here ]------------
[ 1470.630849] WARNING: CPU: 1 PID: 47 at mm/workingset.c:569 shadow_lru_isolate+0x8d/0x128
[ 1470.632264] CPU: 1 PID: 47 Comm: kswapd0 Tainted: G W 5.18.0-rc0 #2
[ 1470.633729] Hardware name: LENOVO 20HQS0EG02/20HQS0EG02, BIOS N1MET54W (1.39 ) 04/16/2019
[ 1470.635202] RIP: 0010:shadow_lru_isolate+0x8d/0x128
[ 1470.636601] Code: ff ff 83 ca ff be 09 00 00 00 4c 89 f7 e8 91 3c 05 00 4c 89 ef e8 e9 25 7d 00 8a 43 eb 84 c0 75 04 0f 0b eb 2a 3a 43 ea 74 04 <0f> 0b eb 21 48 c7 c6 6a f7 1f 81 4c 89 f7 e8 71 ca 7a 00 ba 01 00
[ 1470.638206] RSP: 0018:ffff888000e0bba0 EFLAGS: 00010087
[ 1470.639811] RAX: 0000000080000020 RBX: ffff888016c4d020 RCX: 8bc9d3bc33ed677c
[ 1470.641470] RDX: 0000000000000003 RSI: ffffffff823f436f RDI: 00000000ffffffff
[ 1470.643072] RBP: ffff888000e0bbc8 R08: 0000000000000078 R09: 00000000000c8386
[ 1470.644688] R10: 0000000000000001 R11: ffff888000e055b8 R12: ffff888016c54910
[ 1470.646312] R13: ffff8880010dd180 R14: ffff888016c4d008 R15: ffff8880010dd1c0
[ 1470.648341] FS: 0000000000000000(0000) GS:ffff88803ec80000(0000) knlGS:0000000000000000
[ 1470.649820] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1470.651404] CR2: 0000555555694f50 CR3: 000000000f940005 CR4: 00000000003706e0
[ 1470.651408] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1470.651411] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1470.655817] Call Trace:
[ 1470.655821] <TASK>
[ 1470.655824] __list_lru_walk_one+0x84/0xf2
[ 1470.660171] ? workingset_update_node+0x84/0x84
[ 1470.660178] list_lru_walk_one_irq+0x59/0x74
[ 1470.662784] ? workingset_update_node+0x84/0x84
[ 1470.662791] scan_shadow_nodes+0x29/0x2b
[ 1470.662795] do_shrink_slab+0x1d0/0x33d
[ 1470.662800] shrink_slab+0x98/0x100
[ 1470.662805] shrink_node_memcgs+0x136/0x182
[ 1470.662810] shrink_node+0x220/0x3d9
[ 1470.662814] balance_pgdat+0x216/0x3cb
[ 1470.662818] ? lock_is_held_type+0xcf/0x10f
[ 1470.662824] kswapd+0x189/0x1ae
[ 1470.662828] ? balance_pgdat+0x3cb/0x3cb
[ 1470.662833] kthread+0xee/0xf6
[ 1470.662838] ? kthread_exit+0x1f/0x1f
[ 1470.662843] ret_from_fork+0x1f/0x30
[ 1470.662849] </TASK>
[ 1470.662851] irq event stamp: 76862966
[ 1470.662854] hardirqs last enabled at (76862965): [<ffffffff819d1ee3>] _raw_spin_unlock_irq+0x23/0x42
[ 1470.662860] hardirqs last disabled at (76862966): [<ffffffff819d1c96>] _raw_spin_lock_irq+0x17/0x4c
[ 1470.662866] softirqs last enabled at (76862958): [<ffffffff81c0036f>] __do_softirq+0x36f/0x3aa
[ 1470.662871] softirqs last disabled at (76862925): [<ffffffff811049cd>] __irq_exit_rcu+0x85/0xc1
[ 1470.662877] ---[ end trace 0000000000000000 ]---


2022-03-30 01:11:14

by Matthew Wilcox

[permalink] [raw]
Subject: Re: WARNING: mm/workingset.c:567 shadow_lru_isolate

On Mon, Mar 28, 2022 at 09:24:00PM -0700, Hugh Dickins wrote:
> Sunday's linux.git, under tmpfs swapping load (including
> ext4 file readonly THPs and shmem THPs), gave me the:
>
> if (WARN_ON_ONCE(!node->nr_values))
> goto out_invalid;

Fun. I'm assuming it's actually the ext4 file ROTHP that causes the
issue ... but really, I don't think it matters to fixing the bug.

> if (WARN_ON_ONCE(node->count != node->nr_values))
> goto out_invalid;
>
> from mm/workingset.c - log below. I did see these on mmotm before
> the merge window, but only now got around to bisecting them. And
> the bisection, confirmed by revert, arrives surprisingly at:
>
> 820c4e2e6f51 ("mm/vmscan: Free non-shmem folios without splitting them")
>
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1739,8 +1739,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,
> /* Adding to swap updated mapping */
> mapping = page_mapping(page);
> }
> - } else if (unlikely(PageTransHuge(page))) {
> - /* Split file THP */
> + } else if (PageSwapBacked(page) && PageTransHuge(page)) {
> + /* Split shmem THP */
> if (split_folio_to_list(folio, page_list))
> goto keep_locked;
> }
>
> I'm not familiar with workingset.c, I'll make no guesses, over to you!

OK, so the overall point of this part of workingset.c is that we used
to simply remove pages from the page cache when they fell off the end
of the LRU. Now we replace them with a shadow entry (obviously shmem is
different here which is why you're so unfamiliar with it). The problem
with this is that we still tie up radix tree nodes tracking how old the
pages would have been. So workingset puts nodes which only have shadow
entries on an LRU and will remove old radix tree nodes when there's
enough memory pressure.

These WARN_ON messages are supposed to be checking the invariant that
the nodes we're deleting only contain shadow entries; no real pages.
When the first one fires, it says there are _no_ shadow entries in this
node, and if we get to the second one, we're seeing a node with N entries,
but M of them are shadow entries and M != N. That shouldn't be
possible; we should only put nodes on the list when nr_values is
positive and equal to count (see workingset_update_node()).

OK, so why are we hitting this now? Before, we'd split the THP,
creating 8 new radix tree nodes, all of which would contain 64 shadow
entries and get put on the LRU. What _should_ be happening after the
commit that we bisect to is that the node->count is unaffected, but
node->nr_values increases by 8. I'll investigate a little bit, but
may ask you to add some debug if I'm having trouble reproducing it.

> Hugh
>
> [ 1470.572186] ------------[ cut here ]------------
> [ 1470.574478] WARNING: CPU: 1 PID: 47 at mm/workingset.c:567 shadow_lru_isolate+0x84/0x128
> [ 1470.575570] CPU: 1 PID: 47 Comm: kswapd0 Not tainted 5.18.0-rc0 #2
> [ 1470.576592] Hardware name: LENOVO 20HQS0EG02/20HQS0EG02, BIOS N1MET54W (1.39 ) 04/16/2019
> [ 1470.577836] RIP: 0010:shadow_lru_isolate+0x84/0x128
> [ 1470.579170] Code: 89 ff 4c 8d 73 e8 e8 16 f4 ff ff 83 ca ff be 09 00 00 00 4c 89 f7 e8 91 3c 05 00 4c 89 ef e8 e9 25 7d 00 8a 43 eb 84 c0 75 04 <0f> 0b eb 2a 3a 43 ea 74 04 0f 0b eb 21 48 c7 c6 6a f7 1f 81 4c 89
> [ 1470.580669] RSP: 0018:ffff888000e0bba0 EFLAGS: 00010046
> [ 1470.582173] RAX: 0000000080000000 RBX: ffff888016c21510 RCX: 8bc9d3bc33ed677c
> [ 1470.583755] RDX: 0000000000000003 RSI: ffffffff823f436f RDI: 00000000ffffffff
> [ 1470.585303] RBP: ffff888000e0bbc8 R08: 0000000000000078 R09: 00000000000c8386
> [ 1470.586860] R10: 0000000000000001 R11: ffff888000e055b8 R12: ffff8880166c99d0
> [ 1470.588518] R13: ffff8880010dd180 R14: ffff888016c214f8 R15: ffff8880010dd1c0
> [ 1470.590154] FS: 0000000000000000(0000) GS:ffff88803ec80000(0000) knlGS:0000000000000000
> [ 1470.591743] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1470.593296] CR2: 0000555555694f50 CR3: 000000000f940005 CR4: 00000000003706e0
> [ 1470.594824] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1470.596198] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 1470.597581] Call Trace:
> [ 1470.598907] <TASK>
> [ 1470.600267] __list_lru_walk_one+0x84/0xf2
> [ 1470.601629] ? workingset_update_node+0x84/0x84
> [ 1470.602977] list_lru_walk_one_irq+0x59/0x74
> [ 1470.604324] ? workingset_update_node+0x84/0x84
> [ 1470.605657] scan_shadow_nodes+0x29/0x2b
> [ 1470.607073] do_shrink_slab+0x1d0/0x33d
> [ 1470.608477] shrink_slab+0x98/0x100
> [ 1470.609880] shrink_node_memcgs+0x136/0x182
> [ 1470.611164] shrink_node+0x220/0x3d9
> [ 1470.612446] balance_pgdat+0x216/0x3cb
> [ 1470.613715] ? lock_is_held_type+0xcf/0x10f
> [ 1470.615010] kswapd+0x189/0x1ae
> [ 1470.616285] ? balance_pgdat+0x3cb/0x3cb
> [ 1470.617481] kthread+0xee/0xf6
> [ 1470.618690] ? kthread_exit+0x1f/0x1f
> [ 1470.619932] ret_from_fork+0x1f/0x30
> [ 1470.621110] </TASK>
> [ 1470.622237] irq event stamp: 76862920
> [ 1470.623284] hardirqs last enabled at (76862919): [<ffffffff819d1ee3>] _raw_spin_unlock_irq+0x23/0x42
> [ 1470.624373] hardirqs last disabled at (76862920): [<ffffffff819d1c96>] _raw_spin_lock_irq+0x17/0x4c
> [ 1470.625540] softirqs last enabled at (76861226): [<ffffffff81c0036f>] __do_softirq+0x36f/0x3aa
> [ 1470.626752] softirqs last disabled at (76861195): [<ffffffff811049cd>] __irq_exit_rcu+0x85/0xc1
> [ 1470.627957] ---[ end trace 0000000000000000 ]---
> [ 1470.629407] ------------[ cut here ]------------
> [ 1470.630849] WARNING: CPU: 1 PID: 47 at mm/workingset.c:569 shadow_lru_isolate+0x8d/0x128
> [ 1470.632264] CPU: 1 PID: 47 Comm: kswapd0 Tainted: G W 5.18.0-rc0 #2
> [ 1470.633729] Hardware name: LENOVO 20HQS0EG02/20HQS0EG02, BIOS N1MET54W (1.39 ) 04/16/2019
> [ 1470.635202] RIP: 0010:shadow_lru_isolate+0x8d/0x128
> [ 1470.636601] Code: ff ff 83 ca ff be 09 00 00 00 4c 89 f7 e8 91 3c 05 00 4c 89 ef e8 e9 25 7d 00 8a 43 eb 84 c0 75 04 0f 0b eb 2a 3a 43 ea 74 04 <0f> 0b eb 21 48 c7 c6 6a f7 1f 81 4c 89 f7 e8 71 ca 7a 00 ba 01 00
> [ 1470.638206] RSP: 0018:ffff888000e0bba0 EFLAGS: 00010087
> [ 1470.639811] RAX: 0000000080000020 RBX: ffff888016c4d020 RCX: 8bc9d3bc33ed677c
> [ 1470.641470] RDX: 0000000000000003 RSI: ffffffff823f436f RDI: 00000000ffffffff
> [ 1470.643072] RBP: ffff888000e0bbc8 R08: 0000000000000078 R09: 00000000000c8386
> [ 1470.644688] R10: 0000000000000001 R11: ffff888000e055b8 R12: ffff888016c54910
> [ 1470.646312] R13: ffff8880010dd180 R14: ffff888016c4d008 R15: ffff8880010dd1c0
> [ 1470.648341] FS: 0000000000000000(0000) GS:ffff88803ec80000(0000) knlGS:0000000000000000
> [ 1470.649820] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1470.651404] CR2: 0000555555694f50 CR3: 000000000f940005 CR4: 00000000003706e0
> [ 1470.651408] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1470.651411] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 1470.655817] Call Trace:
> [ 1470.655821] <TASK>
> [ 1470.655824] __list_lru_walk_one+0x84/0xf2
> [ 1470.660171] ? workingset_update_node+0x84/0x84
> [ 1470.660178] list_lru_walk_one_irq+0x59/0x74
> [ 1470.662784] ? workingset_update_node+0x84/0x84
> [ 1470.662791] scan_shadow_nodes+0x29/0x2b
> [ 1470.662795] do_shrink_slab+0x1d0/0x33d
> [ 1470.662800] shrink_slab+0x98/0x100
> [ 1470.662805] shrink_node_memcgs+0x136/0x182
> [ 1470.662810] shrink_node+0x220/0x3d9
> [ 1470.662814] balance_pgdat+0x216/0x3cb
> [ 1470.662818] ? lock_is_held_type+0xcf/0x10f
> [ 1470.662824] kswapd+0x189/0x1ae
> [ 1470.662828] ? balance_pgdat+0x3cb/0x3cb
> [ 1470.662833] kthread+0xee/0xf6
> [ 1470.662838] ? kthread_exit+0x1f/0x1f
> [ 1470.662843] ret_from_fork+0x1f/0x30
> [ 1470.662849] </TASK>
> [ 1470.662851] irq event stamp: 76862966
> [ 1470.662854] hardirqs last enabled at (76862965): [<ffffffff819d1ee3>] _raw_spin_unlock_irq+0x23/0x42
> [ 1470.662860] hardirqs last disabled at (76862966): [<ffffffff819d1c96>] _raw_spin_lock_irq+0x17/0x4c
> [ 1470.662866] softirqs last enabled at (76862958): [<ffffffff81c0036f>] __do_softirq+0x36f/0x3aa
> [ 1470.662871] softirqs last disabled at (76862925): [<ffffffff811049cd>] __irq_exit_rcu+0x85/0xc1
> [ 1470.662877] ---[ end trace 0000000000000000 ]---
>

2022-03-31 04:43:18

by Matthew Wilcox

[permalink] [raw]
Subject: Re: WARNING: mm/workingset.c:567 shadow_lru_isolate

On Mon, Mar 28, 2022 at 09:24:00PM -0700, Hugh Dickins wrote:
> I'm not familiar with workingset.c, I'll make no guesses, over to you!

I haven't been able to reproduce it, but I applied Feynman's Problem
Solving Algorithm, and I think this will do the trick:

diff --git a/lib/xarray.c b/lib/xarray.c
index b95e92598b9c..d3b168f619b1 100644
--- a/lib/xarray.c
+++ b/lib/xarray.c
@@ -1079,6 +1079,7 @@ void xas_split(struct xa_state *xas, void *entry, unsigned int order)
xa_mk_node(child));
if (xa_is_value(curr))
values--;
+ xas_update(xas, child);
} else {
unsigned int canon = offset - xas->xa_sibs;

@@ -1093,6 +1094,7 @@ void xas_split(struct xa_state *xas, void *entry, unsigned int order)
} while (offset-- > xas->xa_offset);

node->nr_values += values;
+ xas_update(xas, node);
}
EXPORT_SYMBOL_GPL(xas_split);
#endif

(the key insight is that the update function should have removed the
node from the list; the WARN_ON really means "this node should not be on
the list", so there's somewhere that we're forgetting to call
xas_update(), or somewhere that we're forgetting to call
mapping_set_update(). I didn't find any of the latter, but I did find
one of the former)

2022-03-31 05:44:12

by Hugh Dickins

[permalink] [raw]
Subject: Re: WARNING: mm/workingset.c:567 shadow_lru_isolate

On Wed, 30 Mar 2022, Matthew Wilcox wrote:
> On Mon, Mar 28, 2022 at 09:24:00PM -0700, Hugh Dickins wrote:
> > I'm not familiar with workingset.c, I'll make no guesses, over to you!
>
> I haven't been able to reproduce it, but I applied Feynman's Problem
> Solving Algorithm, and I think this will do the trick:

Excellent! Yes, this fixes it - thanks.

Hugh

>
> diff --git a/lib/xarray.c b/lib/xarray.c
> index b95e92598b9c..d3b168f619b1 100644
> --- a/lib/xarray.c
> +++ b/lib/xarray.c
> @@ -1079,6 +1079,7 @@ void xas_split(struct xa_state *xas, void *entry, unsigned int order)
> xa_mk_node(child));
> if (xa_is_value(curr))
> values--;
> + xas_update(xas, child);
> } else {
> unsigned int canon = offset - xas->xa_sibs;
>
> @@ -1093,6 +1094,7 @@ void xas_split(struct xa_state *xas, void *entry, unsigned int order)
> } while (offset-- > xas->xa_offset);
>
> node->nr_values += values;
> + xas_update(xas, node);
> }
> EXPORT_SYMBOL_GPL(xas_split);
> #endif
>
> (the key insight is that the update function should have removed the
> node from the list; the WARN_ON really means "this node should not be on
> the list", so there's somewhere that we're forgetting to call
> xas_update(), or somewhere that we're forgetting to call
> mapping_set_update(). I didn't find any of the latter, but I did find
> one of the former)