LinuxLists.cc - [PATCH] maple_tree: Fix sparse reported issues

2022-07-12 14:56:38

Subject: [PATCH] maple_tree: Fix sparse reported issues

When building with C=1, the maple tree had some rcu type mismatch &
locking mismatches in the destroy functions. There were cosmetic only
since this happens after the nodes are removed from the tree.

Fixes: f8acc5e9581e (Maple Tree: add new data structure)
Reported-by: kernel test robot <[email protected]>
Signed-off-by: Liam R. Howlett <[email protected]>
---
lib/maple_tree.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 79e07c7dc323..9dc4ffff18d0 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -2319,7 +2319,7 @@ static inline void mast_topiary(struct maple_subtree_state *mast)
MA_WR_STATE(wr_mas, mast->orig_l, NULL);
unsigned char r_start, r_end;
unsigned char l_start, l_end;
- void **l_slots, **r_slots;
+ void __rcu **l_slots, **r_slots;

wr_mas.type = mte_node_type(mast->orig_l->node);
mast->orig_l->index = mast->orig_l->last;
@@ -5461,6 +5461,7 @@ static void mt_free_walk(struct rcu_head *head)
mt_free_bulk(node->slot_len, slots);

start_slots_free:
+ mas_unlock(&mas);
free_leaf:
mt_free_rcu(&node->rcu);
}
@@ -5513,6 +5514,7 @@ static void mt_destroy_walk(struct maple_enode *enode, unsigned char ma_flags,
do {
enum maple_type type;
unsigned char offset;
+ struct maple_enode *parent, *tmp;

node->slot_len = mas_dead_leaves(&mas, slots);
if (free)
@@ -5524,13 +5526,17 @@ static void mt_destroy_walk(struct maple_enode *enode, unsigned char ma_flags,

type = mte_node_type(mas.node);
slots = ma_slots(mte_to_node(mas.node), type);
- if ((offset < mt_slots[type]) && mte_node_type(slots[offset]) &&
- mte_to_node(slots[offset])) {
- struct maple_enode *parent = mas.node;
+ if (offset >= mt_slots[type])
+ goto next;

- mas.node = mas_slot_locked(&mas, slots, offset);
+ tmp = mas_slot_locked(&mas, slots, offset);
+ if (mte_node_type(tmp) && mte_to_node(tmp)) {
+
+ parent = mas.node;
+ mas.node = tmp;
slots = mas_destroy_descend(&mas, parent, offset);
}
+next:
node = mas_mn(&mas);
} while (start != mas.node);

--
2.35.1

2022-07-12 15:16:09

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH] maple_tree: Fix sparse reported issues

On Tue, Jul 12, 2022 at 02:24:55PM +0000, Liam Howlett wrote:
> When building with C=1, the maple tree had some rcu type mismatch &
> locking mismatches in the destroy functions. There were cosmetic only
> since this happens after the nodes are removed from the tree.

... in the current use-case. It's a legitimate use of the API to do:

ma_init();
ma_store();
ma_destroy();
ma_store();

Can you add a new test that does that?

> @@ -5524,13 +5526,17 @@ static void mt_destroy_walk(struct maple_enode *enode, unsigned char ma_flags,
>
> type = mte_node_type(mas.node);
> slots = ma_slots(mte_to_node(mas.node), type);
> - if ((offset < mt_slots[type]) && mte_node_type(slots[offset]) &&
> - mte_to_node(slots[offset])) {
> - struct maple_enode *parent = mas.node;
> + if (offset >= mt_slots[type])
> + goto next;
>
> - mas.node = mas_slot_locked(&mas, slots, offset);
> + tmp = mas_slot_locked(&mas, slots, offset);
> + if (mte_node_type(tmp) && mte_to_node(tmp)) {
> +

Unnecessary blank line?

> + parent = mas.node;
> + mas.node = tmp;
> slots = mas_destroy_descend(&mas, parent, offset);
> }
> +next:
> node = mas_mn(&mas);
> } while (start != mas.node);
>
> --
> 2.35.1
>

2022-07-12 15:59:34

by Liam R. Howlett

[permalink] [raw]

Subject: Re: [PATCH] maple_tree: Fix sparse reported issues

* Matthew Wilcox <[email protected]> [220712 10:54]:
> On Tue, Jul 12, 2022 at 02:24:55PM +0000, Liam Howlett wrote:
> > When building with C=1, the maple tree had some rcu type mismatch &
> > locking mismatches in the destroy functions. There were cosmetic only
> > since this happens after the nodes are removed from the tree.
>
> ... in the current use-case. It's a legitimate use of the API to do:
>
> ma_init();
> ma_store();
> ma_destroy();
> ma_store();
>
> Can you add a new test that does that?

I can add this, but a few notes on it.

The function names are mt_init(), mtree_store(), and mtree_destroy().
There is an internal/advanced __mt_destroy(). Both destroy functions
clear the flags - which I think is desirable.

>
> > @@ -5524,13 +5526,17 @@ static void mt_destroy_walk(struct maple_enode *enode, unsigned char ma_flags,
> >
> > type = mte_node_type(mas.node);
> > slots = ma_slots(mte_to_node(mas.node), type);
> > - if ((offset < mt_slots[type]) && mte_node_type(slots[offset]) &&
> > - mte_to_node(slots[offset])) {
> > - struct maple_enode *parent = mas.node;
> > + if (offset >= mt_slots[type])
> > + goto next;
> >
> > - mas.node = mas_slot_locked(&mas, slots, offset);
> > + tmp = mas_slot_locked(&mas, slots, offset);
> > + if (mte_node_type(tmp) && mte_to_node(tmp)) {
> > +
>
> Unnecessary blank line?

ack, I'll remove that.

>
> > + parent = mas.node;
> > + mas.node = tmp;
> > slots = mas_destroy_descend(&mas, parent, offset);
> > }
> > +next:
> > node = mas_mn(&mas);
> > } while (start != mas.node);
> >
> > --
> > 2.35.1
> >
>
> --
> maple-tree mailing list
> [email protected]
> https://lists.infradead.org/mailman/listinfo/maple-tree

2022-07-13 08:54:56

by David Hildenbrand

[permalink] [raw]

Subject: Re: [PATCH] maple_tree: Fix sparse reported issues

On 12.07.22 16:24, Liam Howlett wrote:
> When building with C=1, the maple tree had some rcu type mismatch &
> locking mismatches in the destroy functions. There were cosmetic only
> since this happens after the nodes are removed from the tree.
>
> Fixes: f8acc5e9581e (Maple Tree: add new data structure)
> Reported-by: kernel test robot <[email protected]>
> Signed-off-by: Liam R. Howlett <[email protected]>

Sorry to say, but the fixes become hard to follow (what/where/why). :)

I guess it's time for a new series soon. Eventually it makes sense to
send the fixes as reply to the individual problematic patches. (instead
of fixes to commit ids that are not upstream)

[yes, I'll do more review soon :) ]

--
Thanks,

David / dhildenb

2022-07-13 14:05:00

by Liam R. Howlett

[permalink] [raw]

Subject: Re: [PATCH] maple_tree: Fix sparse reported issues

* David Hildenbrand <[email protected]> [220713 04:34]:
> On 12.07.22 16:24, Liam Howlett wrote:
> > When building with C=1, the maple tree had some rcu type mismatch &
> > locking mismatches in the destroy functions. There were cosmetic only
> > since this happens after the nodes are removed from the tree.
> >
> > Fixes: f8acc5e9581e (Maple Tree: add new data structure)
> > Reported-by: kernel test robot <[email protected]>
> > Signed-off-by: Liam R. Howlett <[email protected]>
>
> Sorry to say, but the fixes become hard to follow (what/where/why). :)
>
> I guess it's time for a new series soon. Eventually it makes sense to
> send the fixes as reply to the individual problematic patches. (instead
> of fixes to commit ids that are not upstream)
>
> [yes, I'll do more review soon :) ]

I appreciate the feedback, it's much better than yelling into the void.
I have one more fix in the works - for __vma_adjust() of all functions
so that'll be impossible to follow anyways :) I'll work on a v11 to
include that last one.

Thanks,
Liam

2022-07-13 17:13:15

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH] maple_tree: Fix sparse reported issues

On Wed, 13 Jul 2022, Liam Howlett wrote:
> * David Hildenbrand <[email protected]> [220713 04:34]:
> > On 12.07.22 16:24, Liam Howlett wrote:
> > > When building with C=1, the maple tree had some rcu type mismatch &
> > > locking mismatches in the destroy functions. There were cosmetic only
> > > since this happens after the nodes are removed from the tree.
> > >
> > > Fixes: f8acc5e9581e (Maple Tree: add new data structure)
> > > Reported-by: kernel test robot <[email protected]>
> > > Signed-off-by: Liam R. Howlett <[email protected]>
> >
> > Sorry to say, but the fixes become hard to follow (what/where/why). :)
> >
> > I guess it's time for a new series soon. Eventually it makes sense to
> > send the fixes as reply to the individual problematic patches. (instead
> > of fixes to commit ids that are not upstream)
> >
> > [yes, I'll do more review soon :) ]
>
> I appreciate the feedback, it's much better than yelling into the void.
> I have one more fix in the works - for __vma_adjust() of all functions
> so that'll be impossible to follow anyways :) I'll work on a v11 to
> include that last one.

Please do also post the incremental for that "one more fix" once it's
ready: I have been keeping up with what you've been posting so far,
folding them into my debugging here, and believe we have made some but
still not enough progress on the bugs I hit. Folding in one more fix
will be easy for me, advancing to v11 of a 69-part patchset will be...
dispiriting.

Thanks,
Hugh

2022-07-13 17:55:13

by Liam R. Howlett

[permalink] [raw]

Subject: Re: [PATCH] maple_tree: Fix sparse reported issues

* Hugh Dickins <[email protected]> [220713 11:56]:
> On Wed, 13 Jul 2022, Liam Howlett wrote:
> > * David Hildenbrand <[email protected]> [220713 04:34]:
> > > On 12.07.22 16:24, Liam Howlett wrote:
> > > > When building with C=1, the maple tree had some rcu type mismatch &
> > > > locking mismatches in the destroy functions. There were cosmetic only
> > > > since this happens after the nodes are removed from the tree.
> > > >
> > > > Fixes: f8acc5e9581e (Maple Tree: add new data structure)
> > > > Reported-by: kernel test robot <[email protected]>
> > > > Signed-off-by: Liam R. Howlett <[email protected]>
> > >
> > > Sorry to say, but the fixes become hard to follow (what/where/why). :)
> > >
> > > I guess it's time for a new series soon. Eventually it makes sense to
> > > send the fixes as reply to the individual problematic patches. (instead
> > > of fixes to commit ids that are not upstream)
> > >
> > > [yes, I'll do more review soon :) ]
> >
> > I appreciate the feedback, it's much better than yelling into the void.
> > I have one more fix in the works - for __vma_adjust() of all functions
> > so that'll be impossible to follow anyways :) I'll work on a v11 to
> > include that last one.
>
> Please do also post the incremental for that "one more fix" once it's
> ready: I have been keeping up with what you've been posting so far,
> folding them into my debugging here, and believe we have made some but
> still not enough progress on the bugs I hit. Folding in one more fix
> will be easy for me, advancing to v11 of a 69-part patchset will be...
> dispiriting.

Okay, thanks. I don't think it will fix your outstanding issues but it
is necessary to fix case 6 of vma_merge() on memory allocation failure
as reported by syzbot.

2022-07-15 20:27:06

by Liam R. Howlett

[permalink] [raw]

Subject: Re: [PATCH] maple_tree: Fix sparse reported issues

* Liam R. Howlett <[email protected]> [220713 13:50]:
> * Hugh Dickins <[email protected]> [220713 11:56]:
> > On Wed, 13 Jul 2022, Liam Howlett wrote:
> > > * David Hildenbrand <[email protected]> [220713 04:34]:
> > > > On 12.07.22 16:24, Liam Howlett wrote:
> > > > > When building with C=1, the maple tree had some rcu type mismatch &
> > > > > locking mismatches in the destroy functions. There were cosmetic only
> > > > > since this happens after the nodes are removed from the tree.
> > > > >
> > > > > Fixes: f8acc5e9581e (Maple Tree: add new data structure)
> > > > > Reported-by: kernel test robot <[email protected]>
> > > > > Signed-off-by: Liam R. Howlett <[email protected]>
> > > >
> > > > Sorry to say, but the fixes become hard to follow (what/where/why). :)
> > > >
> > > > I guess it's time for a new series soon. Eventually it makes sense to
> > > > send the fixes as reply to the individual problematic patches. (instead
> > > > of fixes to commit ids that are not upstream)
> > > >
> > > > [yes, I'll do more review soon :) ]
> > >
> > > I appreciate the feedback, it's much better than yelling into the void.
> > > I have one more fix in the works - for __vma_adjust() of all functions
> > > so that'll be impossible to follow anyways :) I'll work on a v11 to
> > > include that last one.
> >
> > Please do also post the incremental for that "one more fix" once it's
> > ready: I have been keeping up with what you've been posting so far,
> > folding them into my debugging here, and believe we have made some but
> > still not enough progress on the bugs I hit. Folding in one more fix
> > will be easy for me, advancing to v11 of a 69-part patchset will be...
> > dispiriting.
>
>
> Okay, thanks. I don't think it will fix your outstanding issues but it
> is necessary to fix case 6 of vma_merge() on memory allocation failure
> as reported by syzbot.

Hugh,

Please find attached the last outstanding fix for this series. It
covers a rare failure scenario that one of the build bots reported.
Ironically, it changes __vma_adjust() more than the rest of the series,
but leaves the locking in the existing order.

Thanks,
Liam

Attachments:

0001-mm-mmap-Fix-__vma_adjust-issue-on-memory-failure.patch (3.37 kB)
0001-mm-mmap-Fix-__vma_adjust-issue-on-memory-failure.patch

2022-07-17 21:06:23

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH] maple_tree: Fix sparse reported issues

On Fri, 15 Jul 2022, Liam Howlett wrote:
> * Liam R. Howlett <[email protected]> [220713 13:50]:
> > * Hugh Dickins <[email protected]> [220713 11:56]:
> > > On Wed, 13 Jul 2022, Liam Howlett wrote:
> > > > * David Hildenbrand <[email protected]> [220713 04:34]:
> > > > > On 12.07.22 16:24, Liam Howlett wrote:
> > > > > > When building with C=1, the maple tree had some rcu type mismatch &
> > > > > > locking mismatches in the destroy functions. There were cosmetic only
> > > > > > since this happens after the nodes are removed from the tree.
> > > > > >
> > > > > > Fixes: f8acc5e9581e (Maple Tree: add new data structure)
> > > > > > Reported-by: kernel test robot <[email protected]>
> > > > > > Signed-off-by: Liam R. Howlett <[email protected]>
> > > > >
> > > > > Sorry to say, but the fixes become hard to follow (what/where/why). :)
> > > > >
> > > > > I guess it's time for a new series soon. Eventually it makes sense to
> > > > > send the fixes as reply to the individual problematic patches. (instead
> > > > > of fixes to commit ids that are not upstream)
> > > > >
> > > > > [yes, I'll do more review soon :) ]
> > > >
> > > > I appreciate the feedback, it's much better than yelling into the void.
> > > > I have one more fix in the works - for __vma_adjust() of all functions
> > > > so that'll be impossible to follow anyways :) I'll work on a v11 to
> > > > include that last one.
> > >
> > > Please do also post the incremental for that "one more fix" once it's
> > > ready: I have been keeping up with what you've been posting so far,
> > > folding them into my debugging here, and believe we have made some but
> > > still not enough progress on the bugs I hit. Folding in one more fix
> > > will be easy for me, advancing to v11 of a 69-part patchset will be...
> > > dispiriting.
> >
> >
> > Okay, thanks. I don't think it will fix your outstanding issues but it
> > is necessary to fix case 6 of vma_merge() on memory allocation failure
> > as reported by syzbot.
>
> Hugh,
>
> Please find attached the last outstanding fix for this series. It
> covers a rare failure scenario that one of the build bots reported.
> Ironically, it changes __vma_adjust() more than the rest of the series,
> but leaves the locking in the existing order.

Thanks, I folded that in to my testing on next-20220715, along with
other you posted on Friday (mas_dead_leaves() walk fix); but have not
even glanced at the v11 you posted Saturday, though I see from responses
that v11 has some other changes, including __vma_adjust() again, but I
just have not looked. (I've had my own experiments in __vma_adjust()).

You'll be wanting my report, I'll give it here rather than in the v11
thread. In short, some progress, but still frustratingly none on the
main crashes.

1. swapops.h BUG on !PageLocked migration entry. This is *not* the
most urgent to fix, I'm just listing it first to get it out of the way
here. This is the BUG that terminates every tmpfs swapping load test
on the laptop: only progress was that, just one time, the workstation
hit it before hitting its usual problems: nice to see it there too.
I'll worry about this bug when the rest is running stably. I've only
ever noticed it when it's brk being unmapped, I expect that's a clue.

2. Silly in do_mas_mumap():
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2513,7 +2513,7 @@ int do_mas_munmap(struct ma_state *mas, struct mm_struct *mm,
arch_unmap(mm, start, end);

/* Find the first overlapping VMA */
- vma = mas_find(mas, end - 1);
+ vma = mas_find(mas, start);
if (!vma)
return 0;

Fixing that does fix some bad behaviors seen - I'd been having crashes in
unmap_vmas() and unlink_anon_vmas(), and put "if (WARN_ON(!vma)) return;"
in unmap_region(); but that no longer seems necessary now do_mas_munmap()
is fixed thus. I had high hopes that it would fix all the rest, but no.

3. vma_adjust_trans_huge(). Skip this paragraph, I think there
is actually no problem here, but for safety I have rearranged the
vma_adjust_trans_huge()s inside the rmap locks, because when things
aren't working right, best to rule out some possibilities. Why am
I even mentioning it here? In case I send any code snippets and
you're puzzled by vma_adjust_trans_huge() having moved.

4. unlink_anon_vmas() in __vma_adjust(). Not urgent to fix (can only
be an issue when GFP_KERNEL preallocation fails, which I think means
when current is being OOM-killed), but whereas vma_expand() has careful
anon_cloned flag, __vma_adjust() does not, and I think could be
unlinking a pre-existing anon_vma. Aside from that, I am nervous about
using unlink_anon_vmas() there like that (and in vma_expand()): IIUC it's
an unnecessary "optimization" for a very unlikely case, that's in danger
of doing more harm than good; and should be removed from them both (then
they can both simply return -ENOMEM when mas_preallocate() fails).

5. I was horrified/thrilled to notice last night that mas_store_prealloc()
does a mas_destroy() at the end. So only the first vma_mas_store() in
__vma_adjust() gets to use the carefully preallocated nodes. I thought
that might be responsible for lots of nastiness, but again sadly no.
Maybe it just falls back to GFP_ATOMIC when the preallocateds are gone
(I didn't look) and that often works okay. Whether the carefully
chosen prealloc count allows for __vma_adjust() use would be another
question. (Earlier on I did try doubling its calculation, in case it
was preallocating too few, but that made no difference either.)

6. The main problem, crashes on the workstation (never seen on the
laptop). I keep hacking around with the debug info (and, please,
%px not %p, %lx not %lu in the debug info: I've changed them all,
and a couple of %lds, in my copy of lib/maple_tree.c). I forget
how the typical crashes appeared with the MAPLE_DEBUGs turned off
(the BUG_ON(count != mm->map_count) in exit_mmap() perhaps); I've
turned them on, but usually comment out the mt_validate() and
mt_dump(), which generate far too much noise for non-specialists,
and delay the onset of crash tenfold (but re-enabled to give you
the attached messages.xz).

Every time, it's the cc1 (9.3.1) doing some munmapping (cc1 is
mostly what's running in this load test, so that's not surprising;
but surprising that even when I switched the laptop to using same
gcc-9, couldn't reproduce the problem there). Typically, that
munmapping has involved splitting a small, say three page, vma
into two pages below and one page above (the "insert", and I've
hacked the debug to dump that too, see "mmap: insert" - ah,
looking at the messages now, the insert is bigger this time).

And what has happened each time I've studied it (I don't know
if this is evident from the mt dumps in the messages, I hope so)
is that the vma and the insert have been correctly placed in the
tree, except that the vma has *also* been placed several pages
earlier, and a linear search of the tree finds that misplaced
instance first, vm_start not matching mt index.

The map_count in these cases has always been around 59, 60, 61:
maybe just typical for cc1, or maybe significant for maple tree?

I won't give up quite yet, but I'm hoping you'll have an idea for
the misplaced tree entry. Something going wrong in rebalancing?

For a long time I assumed the problem would be at the mm/mmap.c end,
and I've repeatedly tried different things with the vma_mas stuff
in __vma_adjust() (for example, using vma_mas_remove() to remove
vmas before re-adding them, and/or doing mas_reset() in more places);
but none of those attempts actually fixed the issue. So now I'm
thinking the problem is really at the lib/maple_tree.c end.

7. If you get to do cleanups later, please shrink those double blank
lines to single blank lines. And find a better name for the strange
vma_mas_szero() - vma_mas_erase(), or is erase something different?
I'm not even sure that it's needed: that's a special case for exec's
shift_arg_pages() VM_STACK_INCOMPLETE_SETUP, which uses __vma_adjust()
in a non-standard way. And trace_vma_mas_szero() in vma_mas_remove()
looks wrong.

Hugh

Attachments:

messages.xz (4.46 kB)

2022-07-18 03:00:49

by Liam R. Howlett

[permalink] [raw]

Subject: Re: [PATCH] maple_tree: Fix sparse reported issues

* Hugh Dickins <[email protected]> [220717 16:58]:
> On Fri, 15 Jul 2022, Liam Howlett wrote:
> > * Liam R. Howlett <[email protected]> [220713 13:50]:
> > > * Hugh Dickins <[email protected]> [220713 11:56]:
> > > > On Wed, 13 Jul 2022, Liam Howlett wrote:
> > > > > * David Hildenbrand <[email protected]> [220713 04:34]:
> > > > > > On 12.07.22 16:24, Liam Howlett wrote:
> > > > > > > When building with C=1, the maple tree had some rcu type mismatch &
> > > > > > > locking mismatches in the destroy functions. There were cosmetic only
> > > > > > > since this happens after the nodes are removed from the tree.
> > > > > > >
> > > > > > > Fixes: f8acc5e9581e (Maple Tree: add new data structure)
> > > > > > > Reported-by: kernel test robot <[email protected]>
> > > > > > > Signed-off-by: Liam R. Howlett <[email protected]>
> > > > > >
> > > > > > Sorry to say, but the fixes become hard to follow (what/where/why). :)
> > > > > >
> > > > > > I guess it's time for a new series soon. Eventually it makes sense to
> > > > > > send the fixes as reply to the individual problematic patches. (instead
> > > > > > of fixes to commit ids that are not upstream)
> > > > > >
> > > > > > [yes, I'll do more review soon :) ]
> > > > >
> > > > > I appreciate the feedback, it's much better than yelling into the void.
> > > > > I have one more fix in the works - for __vma_adjust() of all functions
> > > > > so that'll be impossible to follow anyways :) I'll work on a v11 to
> > > > > include that last one.
> > > >
> > > > Please do also post the incremental for that "one more fix" once it's
> > > > ready: I have been keeping up with what you've been posting so far,
> > > > folding them into my debugging here, and believe we have made some but
> > > > still not enough progress on the bugs I hit. Folding in one more fix
> > > > will be easy for me, advancing to v11 of a 69-part patchset will be...
> > > > dispiriting.
> > >
> > >
> > > Okay, thanks. I don't think it will fix your outstanding issues but it
> > > is necessary to fix case 6 of vma_merge() on memory allocation failure
> > > as reported by syzbot.
> >
> > Hugh,
> >
> > Please find attached the last outstanding fix for this series. It
> > covers a rare failure scenario that one of the build bots reported.
> > Ironically, it changes __vma_adjust() more than the rest of the series,
> > but leaves the locking in the existing order.
>
> Thanks, I folded that in to my testing on next-20220715, along with
> other you posted on Friday (mas_dead_leaves() walk fix);

Please drop that patch, it needs more testing.

> but have not
> even glanced at the v11 you posted Saturday, though I see from responses
> that v11 has some other changes, including __vma_adjust() again, but I
> just have not looked. (I've had my own experiments in __vma_adjust()).
>
> You'll be wanting my report, I'll give it here rather than in the v11
> thread. In short, some progress, but still frustratingly none on the
> main crashes.
>
> 1. swapops.h BUG on !PageLocked migration entry. This is *not* the
> most urgent to fix, I'm just listing it first to get it out of the way
> here. This is the BUG that terminates every tmpfs swapping load test
> on the laptop: only progress was that, just one time, the workstation
> hit it before hitting its usual problems: nice to see it there too.
> I'll worry about this bug when the rest is running stably. I've only
> ever noticed it when it's brk being unmapped, I expect that's a clue.

Thanks for pointing me towards a usable reproducer. I should be able to
narrow it down from there, especially with the brk hint.

>
> 2. Silly in do_mas_mumap():
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2513,7 +2513,7 @@ int do_mas_munmap(struct ma_state *mas, struct mm_struct *mm,
> arch_unmap(mm, start, end);
>
> /* Find the first overlapping VMA */
> - vma = mas_find(mas, end - 1);
> + vma = mas_find(mas, start);
> if (!vma)
> return 0;
>
> Fixing that does fix some bad behaviors seen - I'd been having crashes in
> unmap_vmas() and unlink_anon_vmas(), and put "if (WARN_ON(!vma)) return;"
> in unmap_region(); but that no longer seems necessary now do_mas_munmap()
> is fixed thus. I had high hopes that it would fix all the rest, but no.

This is actually correct. mas_find() is not the same as vma_find().
mas_find() uses the maple state index and searches until a limit (end
-1 in this case). I haven't seen these crashes, but I think you are
hitting the same issue you are discussing in #6 below. I also hadn't
realised the potential confusion of those APIs.

>
> 3. vma_adjust_trans_huge(). Skip this paragraph, I think there
> is actually no problem here, but for safety I have rearranged the
> vma_adjust_trans_huge()s inside the rmap locks, because when things
> aren't working right, best to rule out some possibilities. Why am
> I even mentioning it here? In case I send any code snippets and
> you're puzzled by vma_adjust_trans_huge() having moved.

Thanks, I will keep this in mind.

>
> 4. unlink_anon_vmas() in __vma_adjust(). Not urgent to fix (can only
> be an issue when GFP_KERNEL preallocation fails, which I think means
> when current is being OOM-killed), but whereas vma_expand() has careful
> anon_cloned flag, __vma_adjust() does not, and I think could be
> unlinking a pre-existing anon_vma. Aside from that, I am nervous about
> using unlink_anon_vmas() there like that (and in vma_expand()): IIUC it's
> an unnecessary "optimization" for a very unlikely case, that's in danger
> of doing more harm than good; and should be removed from them both (then
> they can both simply return -ENOMEM when mas_preallocate() fails).

I will add a flag to __vma_adjust, but I don't see how it could happen.
I guess if importer has an anon_vma already? I added these as an unwind
since I don't want to leak - even on the rare preallocation failure. If
you don't want to unroll, what would you suggest in these cases? Would
a flag be acceptable?

>
> 5. I was horrified/thrilled to notice last night that mas_store_prealloc()
> does a mas_destroy() at the end. So only the first vma_mas_store() in
> __vma_adjust() gets to use the carefully preallocated nodes. I thought
> that might be responsible for lots of nastiness, but again sadly no.
> Maybe it just falls back to GFP_ATOMIC when the preallocateds are gone
> (I didn't look) and that often works okay. Whether the carefully
> chosen prealloc count allows for __vma_adjust() use would be another
> question. (Earlier on I did try doubling its calculation, in case it
> was preallocating too few, but that made no difference either.)

mas_store_prealloc will allocate for the worst case scenario. Since I
cannot guarantee the second store isn't the worst case, and could fail,
I cannot allow for more than one allocation per preallocate. I thought
I was fine in __vma_adjust since I preallocate after the goto label for
a second removal but it turns out I wasn't since the second preallocate
could fail, so I've removed the requirement for a second store for 'case
6' by updating the tree once and removing both VMAs in a single pass.
There could, potentially be an issue if the caller to __vma_merge()
wanted to reduce one limit of the VMA (I guess just the start..) and
also remove one or more VMAs, and also be in a situation where
allocations could cause issues (fs_reclaim).. and since __vma_adjust()
is so complicated, I looked at the callers. Most use vma_merge(), and
those seem fine. fs/exec only adjusts one at a time. when splitting,
only a single insert happens. Did I miss some situation(s)?

>
> 6. The main problem, crashes on the workstation (never seen on the
> laptop). I keep hacking around with the debug info (and, please,
> %px not %p, %lx not %lu in the debug info: I've changed them all,

Okay, so I tried to remove all %px in the debug code so I'll revert
those. I use %lu for historic reasons from mt_dump(), I can change
those too, The tree uses ranges to store pointers so I print the
pointers in %lx and the ranges in %lu.

> and a couple of %lds, in my copy of lib/maple_tree.c). I forget
> how the typical crashes appeared with the MAPLE_DEBUGs turned off
> (the BUG_ON(count != mm->map_count) in exit_mmap() perhaps); I've
> turned them on, but usually comment out the mt_validate() and
> mt_dump(), which generate far too much noise for non-specialists,
> and delay the onset of crash tenfold (but re-enabled to give you
> the attached messages.xz).
>
> Every time, it's the cc1 (9.3.1) doing some munmapping (cc1 is
> mostly what's running in this load test, so that's not surprising;
> but surprising that even when I switched the laptop to using same
> gcc-9, couldn't reproduce the problem there). Typically, that
> munmapping has involved splitting a small, say three page, vma
> into two pages below and one page above (the "insert", and I've
> hacked the debug to dump that too, see "mmap: insert" - ah,
> looking at the messages now, the insert is bigger this time).
>
> And what has happened each time I've studied it (I don't know
> if this is evident from the mt dumps in the messages, I hope so)
> is that the vma and the insert have been correctly placed in the
> tree, except that the vma has *also* been placed several pages
> earlier, and a linear search of the tree finds that misplaced
> instance first, vm_start not matching mt index.
>
> The map_count in these cases has always been around 59, 60, 61:
> maybe just typical for cc1, or maybe significant for maple tree?
>
> I won't give up quite yet, but I'm hoping you'll have an idea for
> the misplaced tree entry. Something going wrong in rebalancing?
>
> For a long time I assumed the problem would be at the mm/mmap.c end,
> and I've repeatedly tried different things with the vma_mas stuff
> in __vma_adjust() (for example, using vma_mas_remove() to remove
> vmas before re-adding them, and/or doing mas_reset() in more places);
> but none of those attempts actually fixed the issue. So now I'm
> thinking the problem is really at the lib/maple_tree.c end.
>

Do you have the patch
"maple_tree-Fix-stale-data-copy-in-mas_wr_node_store.patch"? It sounds
like your issue fits this fix exactly. I was seeing the same issue with
gcc 9.3.1 20200408 and this bug doesn't happen for me now. The logs
you sent also fit the situation. I went through the same exercise
(exorcism?) of debugging the various additions and removals of the VMA
only to find the issue in the tree itself. The fix also modified the
test code to detect the issue - which was actually hit but not detected
in the existing test cases from a live capture of VMA activities. It is
difficult to spot in the tree dump as well. I am sure I sent this to
Andrew as it is included in v11 and did not show up in his diff, but I
cannot find it on lore, perhaps I forgot to CC you? I've attached it
here for you in case you missed it.

You are actually hitting the issue several iterations beyond when it
first occurred. It was introduced earlier in the tree and exposed with
your other operations by means of a node split or merge.

> 7. If you get to do cleanups later, please shrink those double blank
> lines to single blank lines. And find a better name for the strange
> vma_mas_szero() - vma_mas_erase(), or is erase something different?
> I'm not even sure that it's needed: that's a special case for exec's
> shift_arg_pages() VM_STACK_INCOMPLETE_SETUP, which uses __vma_adjust()
> in a non-standard way. And trace_vma_mas_szero() in vma_mas_remove()
> looks wrong.

Okay, I'll be sure to only have one blank line. Where do you see this?
I would have thought that checkpatch would complain? I did try to,
regretfully, address more checkpatch issues on v11. It added more noise
to the differences of v10 + patches to v11 than anything else.

Thanks again,
Liam

Attachments:

0001-maple_tree-Fix-stale-data-copy-in-mas_wr_node_store.patch (4.31 kB)
0001-maple_tree-Fix-stale-data-copy-in-mas_wr_node_store.patch

2022-07-18 04:34:10

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH] maple_tree: Fix sparse reported issues

On Mon, 18 Jul 2022, Liam Howlett wrote:
> * Hugh Dickins <[email protected]> [220717 16:58]:
> > On Fri, 15 Jul 2022, Liam Howlett wrote:
> > >
> > > Please find attached the last outstanding fix for this series. It
> > > covers a rare failure scenario that one of the build bots reported.
> > > Ironically, it changes __vma_adjust() more than the rest of the series,
> > > but leaves the locking in the existing order.
> >
> > Thanks, I folded that in to my testing on next-20220715, along with
> > other you posted on Friday (mas_dead_leaves() walk fix);
>
> Please drop that patch, it needs more testing.

Drop the mas_dead_leaves() walk fix, or the __vma_adjust() changes
which you attached in that mail to me? please let me know a.s.a.p,
since I cannot proceed without knowing which.

>
> > but have not
> > even glanced at the v11 you posted Saturday, though I see from responses
> > that v11 has some other changes, including __vma_adjust() again, but I
> > just have not looked. (I've had my own experiments in __vma_adjust()).
> >
> > You'll be wanting my report, I'll give it here rather than in the v11
> > thread. In short, some progress, but still frustratingly none on the
> > main crashes.
> >
> > 1. swapops.h BUG on !PageLocked migration entry. This is *not* the
> > most urgent to fix, I'm just listing it first to get it out of the way
> > here. This is the BUG that terminates every tmpfs swapping load test
> > on the laptop: only progress was that, just one time, the workstation
> > hit it before hitting its usual problems: nice to see it there too.
> > I'll worry about this bug when the rest is running stably. I've only
> > ever noticed it when it's brk being unmapped, I expect that's a clue.
>
> Thanks for pointing me towards a usable reproducer. I should be able to
> narrow it down from there, especially with the brk hint.

I'm afraid I cannot point you to a good reproducer; but anyway, the BUG
necessarily appears some time after whatever code is at fault has been
exercised, so it needs thought rather than any reproducer. It was not
obvious to me, but I'll think it through again, once the other issues
are resolved.

>
> >
> > 2. Silly in do_mas_mumap():
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2513,7 +2513,7 @@ int do_mas_munmap(struct ma_state *mas, struct mm_struct *mm,
> > arch_unmap(mm, start, end);
> >
> > /* Find the first overlapping VMA */
> > - vma = mas_find(mas, end - 1);
> > + vma = mas_find(mas, start);
> > if (!vma)
> > return 0;
> >
> > Fixing that does fix some bad behaviors seen - I'd been having crashes in
> > unmap_vmas() and unlink_anon_vmas(), and put "if (WARN_ON(!vma)) return;"
> > in unmap_region(); but that no longer seems necessary now do_mas_munmap()
> > is fixed thus. I had high hopes that it would fix all the rest, but no.
>
> This is actually correct. mas_find() is not the same as vma_find().
> mas_find() uses the maple state index and searches until a limit (end
> -1 in this case). I haven't seen these crashes, but I think you are
> hitting the same issue you are discussing in #6 below. I also hadn't
> realised the potential confusion of those APIs.

You're right, I'm wrong, sorry about that. But then I'm left with the
conundrum of how a class of crashes went away when I changed that. Did
I break it all so badly that it couldn't reach the anon_vma bugs I was
hitting before? Or did making that change coincide with my moving from
DEBUG_MAPLE off to on, so different crashes came sooner? Or did I fold
in another fix from you around that time? I don't know (and I don't
expect you to answer this!), but I've got some backtracking to do.

>
> >
> > 3. vma_adjust_trans_huge(). Skip this paragraph, I think there
> > is actually no problem here, but for safety I have rearranged the
> > vma_adjust_trans_huge()s inside the rmap locks, because when things
> > aren't working right, best to rule out some possibilities. Why am
> > I even mentioning it here? In case I send any code snippets and
> > you're puzzled by vma_adjust_trans_huge() having moved.
>
> Thanks, I will keep this in mind.
>
> >
> > 4. unlink_anon_vmas() in __vma_adjust(). Not urgent to fix (can only
> > be an issue when GFP_KERNEL preallocation fails, which I think means
> > when current is being OOM-killed), but whereas vma_expand() has careful
> > anon_cloned flag, __vma_adjust() does not, and I think could be
> > unlinking a pre-existing anon_vma. Aside from that, I am nervous about
> > using unlink_anon_vmas() there like that (and in vma_expand()): IIUC it's
> > an unnecessary "optimization" for a very unlikely case, that's in danger
> > of doing more harm than good; and should be removed from them both (then
> > they can both simply return -ENOMEM when mas_preallocate() fails).
>
> I will add a flag to __vma_adjust, but I don't see how it could happen.
> I guess if importer has an anon_vma already? I added these as an unwind
> since I don't want to leak - even on the rare preallocation failure. If
> you don't want to unroll, what would you suggest in these cases? Would
> a flag be acceptable?

I cannot see what purpose adding a flag to __vma_adjust() would serve.
If importer had anon_vma already, yes, without checking the code again,
that was I think what I had in mind. But (correct me if you've observed
that I'm wrong) there's no question of a leak there: a vma which wasn't
given an anon_vma before gets linked in to one, and it will all get torn
down correctly when the process exits (indeed, when OOM kill completes).

It's "nice" to delay setting anon_vma until it's needed, but not worth
any effort to rewind it (especially on such an unlikely path): and
normally, once anon_vma has been set, it stays set - I'm not sure of
the consequence of unsetting it again (racing with rmap lookups: may
be okay because of how anon_vma is not trusted when page is not mapped;
but it's easier just to omit the rewind than think all that through).

>
> >
> > 5. I was horrified/thrilled to notice last night that mas_store_prealloc()
> > does a mas_destroy() at the end. So only the first vma_mas_store() in
> > __vma_adjust() gets to use the carefully preallocated nodes. I thought
> > that might be responsible for lots of nastiness, but again sadly no.
> > Maybe it just falls back to GFP_ATOMIC when the preallocateds are gone
> > (I didn't look) and that often works okay. Whether the carefully
> > chosen prealloc count allows for __vma_adjust() use would be another
> > question. (Earlier on I did try doubling its calculation, in case it
> > was preallocating too few, but that made no difference either.)
>
> mas_store_prealloc will allocate for the worst case scenario. Since I
> cannot guarantee the second store isn't the worst case, and could fail,
> I cannot allow for more than one allocation per preallocate. I thought
> I was fine in __vma_adjust since I preallocate after the goto label for
> a second removal but it turns out I wasn't since the second preallocate
> could fail, so I've removed the requirement for a second store for 'case
> 6' by updating the tree once and removing both VMAs in a single pass.
> There could, potentially be an issue if the caller to __vma_merge()
> wanted to reduce one limit of the VMA (I guess just the start..) and
> also remove one or more VMAs, and also be in a situation where
> allocations could cause issues (fs_reclaim).. and since __vma_adjust()
> is so complicated, I looked at the callers. Most use vma_merge(), and
> those seem fine. fs/exec only adjusts one at a time. when splitting,
> only a single insert happens. Did I miss some situation(s)?

I don't think the fs/exec stack moving will be any worry for this.
Did you miss any case? Yes, the "insert" cases from __split_vma().
I have no appreciation of when maple tree actually needs to make an
allocation, so I don't know whether the adjust_next case ever needs
to make an allocation, but I'd have thought the insert case might
need to sometimes.

But I'll admit to skimming your paragraph there, I'm more concerned
to go on to the following issue than fully grasp your argument above.

>
> >
> > 6. The main problem, crashes on the workstation (never seen on the
> > laptop). I keep hacking around with the debug info (and, please,
> > %px not %p, %lx not %lu in the debug info: I've changed them all,
>
> Okay, so I tried to remove all %px in the debug code so I'll revert
> those. I use %lu for historic reasons from mt_dump(), I can change
> those too, The tree uses ranges to store pointers so I print the
> pointers in %lx and the ranges in %lu.
>
>
> > and a couple of %lds, in my copy of lib/maple_tree.c). I forget
> > how the typical crashes appeared with the MAPLE_DEBUGs turned off
> > (the BUG_ON(count != mm->map_count) in exit_mmap() perhaps); I've
> > turned them on, but usually comment out the mt_validate() and
> > mt_dump(), which generate far too much noise for non-specialists,
> > and delay the onset of crash tenfold (but re-enabled to give you
> > the attached messages.xz).
> >
> > Every time, it's the cc1 (9.3.1) doing some munmapping (cc1 is
> > mostly what's running in this load test, so that's not surprising;
> > but surprising that even when I switched the laptop to using same
> > gcc-9, couldn't reproduce the problem there). Typically, that
> > munmapping has involved splitting a small, say three page, vma
> > into two pages below and one page above (the "insert", and I've
> > hacked the debug to dump that too, see "mmap: insert" - ah,
> > looking at the messages now, the insert is bigger this time).
> >
> > And what has happened each time I've studied it (I don't know
> > if this is evident from the mt dumps in the messages, I hope so)
> > is that the vma and the insert have been correctly placed in the
> > tree, except that the vma has *also* been placed several pages
> > earlier, and a linear search of the tree finds that misplaced
> > instance first, vm_start not matching mt index.
> >
> > The map_count in these cases has always been around 59, 60, 61:
> > maybe just typical for cc1, or maybe significant for maple tree?
> >
> > I won't give up quite yet, but I'm hoping you'll have an idea for
> > the misplaced tree entry. Something going wrong in rebalancing?
> >
> > For a long time I assumed the problem would be at the mm/mmap.c end,
> > and I've repeatedly tried different things with the vma_mas stuff
> > in __vma_adjust() (for example, using vma_mas_remove() to remove
> > vmas before re-adding them, and/or doing mas_reset() in more places);
> > but none of those attempts actually fixed the issue. So now I'm
> > thinking the problem is really at the lib/maple_tree.c end.
> >
>
> Do you have the patch
> "maple_tree-Fix-stale-data-copy-in-mas_wr_node_store.patch"? It sounds
> like your issue fits this fix exactly. I was seeing the same issue with
> gcc 9.3.1 20200408 and this bug doesn't happen for me now. The logs
> you sent also fit the situation. I went through the same exercise
> (exorcism?) of debugging the various additions and removals of the VMA
> only to find the issue in the tree itself. The fix also modified the
> test code to detect the issue - which was actually hit but not detected
> in the existing test cases from a live capture of VMA activities. It is
> difficult to spot in the tree dump as well. I am sure I sent this to
> Andrew as it is included in v11 and did not show up in his diff, but I
> cannot find it on lore, perhaps I forgot to CC you? I've attached it
> here for you in case you missed it.

Thanks! No, I never received that patch, nor can I see it on lore
or marc.info; but I (still) haven't looked at v11, and don't know
about Andrew's diff. Anyway, sounds exciting, I'm eager to stop
writing this mail and get to testing with that in - but please
let me know whether it's the mas_dead_leaves() or the __vma_adjust()
mods you attached previously, which you want me to leave out.

>
> You are actually hitting the issue several iterations beyond when it
> first occurred. It was introduced earlier in the tree and exposed with
> your other operations by means of a node split or merge.
>
> > 7. If you get to do cleanups later, please shrink those double blank
> > lines to single blank lines. And find a better name for the strange
> > vma_mas_szero() - vma_mas_erase(), or is erase something different?
> > I'm not even sure that it's needed: that's a special case for exec's
> > shift_arg_pages() VM_STACK_INCOMPLETE_SETUP, which uses __vma_adjust()
> > in a non-standard way. And trace_vma_mas_szero() in vma_mas_remove()
> > looks wrong.
>
> Okay, I'll be sure to only have one blank line. Where do you see this?
> I would have thought that checkpatch would complain? I did try to,

No, I'm not going to search for those double blank lines now:
please do your own diff and look through for them. And I don't know
whether checkpatch objects to them or not, but they're bad for patch
application, since they increase the likelihood that a patch applies
in the wrong place.

As to whether this is or is not a good time for such cleanups,
I just don't know: I see Andrew on the one hand intending to drop
maple tree for the moment, but Linus on the other hand promising
an extra week for 5.19. I'll just let others decide what they want.

Hugh

> regretfully, address more checkpatch issues on v11. It added more noise
> to the differences of v10 + patches to v11 than anything else.
>
>
> Thanks again,
> Liam

2022-07-18 07:16:15

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH] maple_tree: Fix sparse reported issues

On Sun, 17 Jul 2022, Hugh Dickins wrote:
> On Mon, 18 Jul 2022, Liam Howlett wrote:
> > Do you have the patch
> > "maple_tree-Fix-stale-data-copy-in-mas_wr_node_store.patch"? It sounds
> > like your issue fits this fix exactly. I was seeing the same issue with
> > gcc 9.3.1 20200408 and this bug doesn't happen for me now. The logs
> > you sent also fit the situation. I went through the same exercise
> > (exorcism?) of debugging the various additions and removals of the VMA
> > only to find the issue in the tree itself. The fix also modified the
> > test code to detect the issue - which was actually hit but not detected
> > in the existing test cases from a live capture of VMA activities. It is
> > difficult to spot in the tree dump as well. I am sure I sent this to
> > Andrew as it is included in v11 and did not show up in his diff, but I
> > cannot find it on lore, perhaps I forgot to CC you? I've attached it
> > here for you in case you missed it.
>
> Thanks! No, I never received that patch, nor can I see it on lore
> or marc.info; but I (still) haven't looked at v11, and don't know
> about Andrew's diff. Anyway, sounds exciting, I'm eager to stop
> writing this mail and get to testing with that in - but please
> let me know whether it's the mas_dead_leaves() or the __vma_adjust()
> mods you attached previously, which you want me to leave out.

I just went ahead with both of those still in, my do_mas_munmap()
non-fix reverted, and your "Fix stale data copy" added: looks very
promising so far, been running two hours without a problem. I
do expect it to hit the migration entry !PageLocked BUG soon (as
it quickly did on the laptop), but that's okay: I'll switch over
to thinking about that BUG, so long as tonight's run does not
crash in some other way.

Hugh

2022-07-18 13:02:13

by Liam R. Howlett

[permalink] [raw]

Subject: Re: [PATCH] maple_tree: Fix sparse reported issues

* Hugh Dickins <[email protected]> [220718 00:28]:
> On Mon, 18 Jul 2022, Liam Howlett wrote:
> > * Hugh Dickins <[email protected]> [220717 16:58]:
> > > On Fri, 15 Jul 2022, Liam Howlett wrote:
> > > >
> > > > Please find attached the last outstanding fix for this series. It
> > > > covers a rare failure scenario that one of the build bots reported.
> > > > Ironically, it changes __vma_adjust() more than the rest of the series,
> > > > but leaves the locking in the existing order.
> > >
> > > Thanks, I folded that in to my testing on next-20220715, along with
> > > other you posted on Friday (mas_dead_leaves() walk fix);
> >
> > Please drop that patch, it needs more testing.
>
> Drop the mas_dead_leaves() walk fix, or the __vma_adjust() changes
> which you attached in that mail to me? please let me know a.s.a.p,
> since I cannot proceed without knowing which.

Drop the mas_dead_leaves() walk fix please. I responded to the patch
that it's not tested enough. I'll respond to the rest of this email
soon.

>
> >
> > > but have not
> > > even glanced at the v11 you posted Saturday, though I see from responses
> > > that v11 has some other changes, including __vma_adjust() again, but I
> > > just have not looked. (I've had my own experiments in __vma_adjust()).
> > >
> > > You'll be wanting my report, I'll give it here rather than in the v11
> > > thread. In short, some progress, but still frustratingly none on the
> > > main crashes.
> > >
> > > 1. swapops.h BUG on !PageLocked migration entry. This is *not* the
> > > most urgent to fix, I'm just listing it first to get it out of the way
> > > here. This is the BUG that terminates every tmpfs swapping load test
> > > on the laptop: only progress was that, just one time, the workstation
> > > hit it before hitting its usual problems: nice to see it there too.
> > > I'll worry about this bug when the rest is running stably. I've only
> > > ever noticed it when it's brk being unmapped, I expect that's a clue.
> >
> > Thanks for pointing me towards a usable reproducer. I should be able to
> > narrow it down from there, especially with the brk hint.
>
> I'm afraid I cannot point you to a good reproducer; but anyway, the BUG
> necessarily appears some time after whatever code is at fault has been
> exercised, so it needs thought rather than any reproducer. It was not
> obvious to me, but I'll think it through again, once the other issues
> are resolved.
>
> >
> > >
> > > 2. Silly in do_mas_mumap():
> > > --- a/mm/mmap.c
> > > +++ b/mm/mmap.c
> > > @@ -2513,7 +2513,7 @@ int do_mas_munmap(struct ma_state *mas, struct mm_struct *mm,
> > > arch_unmap(mm, start, end);
> > >
> > > /* Find the first overlapping VMA */
> > > - vma = mas_find(mas, end - 1);
> > > + vma = mas_find(mas, start);
> > > if (!vma)
> > > return 0;
> > >
> > > Fixing that does fix some bad behaviors seen - I'd been having crashes in
> > > unmap_vmas() and unlink_anon_vmas(), and put "if (WARN_ON(!vma)) return;"
> > > in unmap_region(); but that no longer seems necessary now do_mas_munmap()
> > > is fixed thus. I had high hopes that it would fix all the rest, but no.
> >
> > This is actually correct. mas_find() is not the same as vma_find().
> > mas_find() uses the maple state index and searches until a limit (end
> > -1 in this case). I haven't seen these crashes, but I think you are
> > hitting the same issue you are discussing in #6 below. I also hadn't
> > realised the potential confusion of those APIs.
>
> You're right, I'm wrong, sorry about that. But then I'm left with the
> conundrum of how a class of crashes went away when I changed that. Did
> I break it all so badly that it couldn't reach the anon_vma bugs I was
> hitting before? Or did making that change coincide with my moving from
> DEBUG_MAPLE off to on, so different crashes came sooner? Or did I fold
> in another fix from you around that time? I don't know (and I don't
> expect you to answer this!), but I've got some backtracking to do.
>
> >
> > >
> > > 3. vma_adjust_trans_huge(). Skip this paragraph, I think there
> > > is actually no problem here, but for safety I have rearranged the
> > > vma_adjust_trans_huge()s inside the rmap locks, because when things
> > > aren't working right, best to rule out some possibilities. Why am
> > > I even mentioning it here? In case I send any code snippets and
> > > you're puzzled by vma_adjust_trans_huge() having moved.
> >
> > Thanks, I will keep this in mind.
> >
> > >
> > > 4. unlink_anon_vmas() in __vma_adjust(). Not urgent to fix (can only
> > > be an issue when GFP_KERNEL preallocation fails, which I think means
> > > when current is being OOM-killed), but whereas vma_expand() has careful
> > > anon_cloned flag, __vma_adjust() does not, and I think could be
> > > unlinking a pre-existing anon_vma. Aside from that, I am nervous about
> > > using unlink_anon_vmas() there like that (and in vma_expand()): IIUC it's
> > > an unnecessary "optimization" for a very unlikely case, that's in danger
> > > of doing more harm than good; and should be removed from them both (then
> > > they can both simply return -ENOMEM when mas_preallocate() fails).
> >
> > I will add a flag to __vma_adjust, but I don't see how it could happen.
> > I guess if importer has an anon_vma already? I added these as an unwind
> > since I don't want to leak - even on the rare preallocation failure. If
> > you don't want to unroll, what would you suggest in these cases? Would
> > a flag be acceptable?
>
> I cannot see what purpose adding a flag to __vma_adjust() would serve.
> If importer had anon_vma already, yes, without checking the code again,
> that was I think what I had in mind. But (correct me if you've observed
> that I'm wrong) there's no question of a leak there: a vma which wasn't
> given an anon_vma before gets linked in to one, and it will all get torn
> down correctly when the process exits (indeed, when OOM kill completes).
>
> It's "nice" to delay setting anon_vma until it's needed, but not worth
> any effort to rewind it (especially on such an unlikely path): and
> normally, once anon_vma has been set, it stays set - I'm not sure of
> the consequence of unsetting it again (racing with rmap lookups: may
> be okay because of how anon_vma is not trusted when page is not mapped;
> but it's easier just to omit the rewind than think all that through).
>
> >
> > >
> > > 5. I was horrified/thrilled to notice last night that mas_store_prealloc()
> > > does a mas_destroy() at the end. So only the first vma_mas_store() in
> > > __vma_adjust() gets to use the carefully preallocated nodes. I thought
> > > that might be responsible for lots of nastiness, but again sadly no.
> > > Maybe it just falls back to GFP_ATOMIC when the preallocateds are gone
> > > (I didn't look) and that often works okay. Whether the carefully
> > > chosen prealloc count allows for __vma_adjust() use would be another
> > > question. (Earlier on I did try doubling its calculation, in case it
> > > was preallocating too few, but that made no difference either.)
> >
> > mas_store_prealloc will allocate for the worst case scenario. Since I
> > cannot guarantee the second store isn't the worst case, and could fail,
> > I cannot allow for more than one allocation per preallocate. I thought
> > I was fine in __vma_adjust since I preallocate after the goto label for
> > a second removal but it turns out I wasn't since the second preallocate
> > could fail, so I've removed the requirement for a second store for 'case
> > 6' by updating the tree once and removing both VMAs in a single pass.
> > There could, potentially be an issue if the caller to __vma_merge()
> > wanted to reduce one limit of the VMA (I guess just the start..) and
> > also remove one or more VMAs, and also be in a situation where
> > allocations could cause issues (fs_reclaim).. and since __vma_adjust()
> > is so complicated, I looked at the callers. Most use vma_merge(), and
> > those seem fine. fs/exec only adjusts one at a time. when splitting,
> > only a single insert happens. Did I miss some situation(s)?
>
> I don't think the fs/exec stack moving will be any worry for this.
> Did you miss any case? Yes, the "insert" cases from __split_vma().
> I have no appreciation of when maple tree actually needs to make an
> allocation, so I don't know whether the adjust_next case ever needs
> to make an allocation, but I'd have thought the insert case might
> need to sometimes.
>
> But I'll admit to skimming your paragraph there, I'm more concerned
> to go on to the following issue than fully grasp your argument above.
>
> >
> > >
> > > 6. The main problem, crashes on the workstation (never seen on the
> > > laptop). I keep hacking around with the debug info (and, please,
> > > %px not %p, %lx not %lu in the debug info: I've changed them all,
> >
> > Okay, so I tried to remove all %px in the debug code so I'll revert
> > those. I use %lu for historic reasons from mt_dump(), I can change
> > those too, The tree uses ranges to store pointers so I print the
> > pointers in %lx and the ranges in %lu.
> >
> >
> > > and a couple of %lds, in my copy of lib/maple_tree.c). I forget
> > > how the typical crashes appeared with the MAPLE_DEBUGs turned off
> > > (the BUG_ON(count != mm->map_count) in exit_mmap() perhaps); I've
> > > turned them on, but usually comment out the mt_validate() and
> > > mt_dump(), which generate far too much noise for non-specialists,
> > > and delay the onset of crash tenfold (but re-enabled to give you
> > > the attached messages.xz).
> > >
> > > Every time, it's the cc1 (9.3.1) doing some munmapping (cc1 is
> > > mostly what's running in this load test, so that's not surprising;
> > > but surprising that even when I switched the laptop to using same
> > > gcc-9, couldn't reproduce the problem there). Typically, that
> > > munmapping has involved splitting a small, say three page, vma
> > > into two pages below and one page above (the "insert", and I've
> > > hacked the debug to dump that too, see "mmap: insert" - ah,
> > > looking at the messages now, the insert is bigger this time).
> > >
> > > And what has happened each time I've studied it (I don't know
> > > if this is evident from the mt dumps in the messages, I hope so)
> > > is that the vma and the insert have been correctly placed in the
> > > tree, except that the vma has *also* been placed several pages
> > > earlier, and a linear search of the tree finds that misplaced
> > > instance first, vm_start not matching mt index.
> > >
> > > The map_count in these cases has always been around 59, 60, 61:
> > > maybe just typical for cc1, or maybe significant for maple tree?
> > >
> > > I won't give up quite yet, but I'm hoping you'll have an idea for
> > > the misplaced tree entry. Something going wrong in rebalancing?
> > >
> > > For a long time I assumed the problem would be at the mm/mmap.c end,
> > > and I've repeatedly tried different things with the vma_mas stuff
> > > in __vma_adjust() (for example, using vma_mas_remove() to remove
> > > vmas before re-adding them, and/or doing mas_reset() in more places);
> > > but none of those attempts actually fixed the issue. So now I'm
> > > thinking the problem is really at the lib/maple_tree.c end.
> > >
> >
> > Do you have the patch
> > "maple_tree-Fix-stale-data-copy-in-mas_wr_node_store.patch"? It sounds
> > like your issue fits this fix exactly. I was seeing the same issue with
> > gcc 9.3.1 20200408 and this bug doesn't happen for me now. The logs
> > you sent also fit the situation. I went through the same exercise
> > (exorcism?) of debugging the various additions and removals of the VMA
> > only to find the issue in the tree itself. The fix also modified the
> > test code to detect the issue - which was actually hit but not detected
> > in the existing test cases from a live capture of VMA activities. It is
> > difficult to spot in the tree dump as well. I am sure I sent this to
> > Andrew as it is included in v11 and did not show up in his diff, but I
> > cannot find it on lore, perhaps I forgot to CC you? I've attached it
> > here for you in case you missed it.
>
> Thanks! No, I never received that patch, nor can I see it on lore
> or marc.info; but I (still) haven't looked at v11, and don't know
> about Andrew's diff. Anyway, sounds exciting, I'm eager to stop
> writing this mail and get to testing with that in - but please
> let me know whether it's the mas_dead_leaves() or the __vma_adjust()
> mods you attached previously, which you want me to leave out.
>
> >
> > You are actually hitting the issue several iterations beyond when it
> > first occurred. It was introduced earlier in the tree and exposed with
> > your other operations by means of a node split or merge.
> >
> > > 7. If you get to do cleanups later, please shrink those double blank
> > > lines to single blank lines. And find a better name for the strange
> > > vma_mas_szero() - vma_mas_erase(), or is erase something different?
> > > I'm not even sure that it's needed: that's a special case for exec's
> > > shift_arg_pages() VM_STACK_INCOMPLETE_SETUP, which uses __vma_adjust()
> > > in a non-standard way. And trace_vma_mas_szero() in vma_mas_remove()
> > > looks wrong.
> >
> > Okay, I'll be sure to only have one blank line. Where do you see this?
> > I would have thought that checkpatch would complain? I did try to,
>
> No, I'm not going to search for those double blank lines now:
> please do your own diff and look through for them. And I don't know
> whether checkpatch objects to them or not, but they're bad for patch
> application, since they increase the likelihood that a patch applies
> in the wrong place.
>
> As to whether this is or is not a good time for such cleanups,
> I just don't know: I see Andrew on the one hand intending to drop
> maple tree for the moment, but Linus on the other hand promising
> an extra week for 5.19. I'll just let others decide what they want.
>
> Hugh
>
> > regretfully, address more checkpatch issues on v11. It added more noise
> > to the differences of v10 + patches to v11 than anything else.
> >
> >
> > Thanks again,
> > Liam

2022-07-18 14:14:53

by Liam R. Howlett

[permalink] [raw]

Subject: Re: [PATCH] maple_tree: Fix sparse reported issues

* Liam R. Howlett <[email protected]> [220718 08:56]:
> * Hugh Dickins <[email protected]> [220718 00:28]:
> > On Mon, 18 Jul 2022, Liam Howlett wrote:
> > > * Hugh Dickins <[email protected]> [220717 16:58]:
> > > > On Fri, 15 Jul 2022, Liam Howlett wrote:
> > > > >
> > > > > Please find attached the last outstanding fix for this series. It
> > > > > covers a rare failure scenario that one of the build bots reported.
> > > > > Ironically, it changes __vma_adjust() more than the rest of the series,
> > > > > but leaves the locking in the existing order.
> > > >
> > > > Thanks, I folded that in to my testing on next-20220715, along with
> > > > other you posted on Friday (mas_dead_leaves() walk fix);
> > >
> > > Please drop that patch, it needs more testing.
> >
> > Drop the mas_dead_leaves() walk fix, or the __vma_adjust() changes
> > which you attached in that mail to me? please let me know a.s.a.p,
> > since I cannot proceed without knowing which.
>
> Drop the mas_dead_leaves() walk fix please. I responded to the patch
> that it's not tested enough. I'll respond to the rest of this email
> soon.

So the mas_dead_leaves() patch will most likely produce memory leaks on
big-endian systems.

>
> >
> > >
> > > > but have not
> > > > even glanced at the v11 you posted Saturday, though I see from responses
> > > > that v11 has some other changes, including __vma_adjust() again, but I
> > > > just have not looked. (I've had my own experiments in __vma_adjust()).
> > > >
> > > > You'll be wanting my report, I'll give it here rather than in the v11
> > > > thread. In short, some progress, but still frustratingly none on the
> > > > main crashes.
> > > >
> > > > 1. swapops.h BUG on !PageLocked migration entry. This is *not* the
> > > > most urgent to fix, I'm just listing it first to get it out of the way
> > > > here. This is the BUG that terminates every tmpfs swapping load test
> > > > on the laptop: only progress was that, just one time, the workstation
> > > > hit it before hitting its usual problems: nice to see it there too.
> > > > I'll worry about this bug when the rest is running stably. I've only
> > > > ever noticed it when it's brk being unmapped, I expect that's a clue.
> > >
> > > Thanks for pointing me towards a usable reproducer. I should be able to
> > > narrow it down from there, especially with the brk hint.
> >
> > I'm afraid I cannot point you to a good reproducer; but anyway, the BUG
> > necessarily appears some time after whatever code is at fault has been
> > exercised, so it needs thought rather than any reproducer. It was not
> > obvious to me, but I'll think it through again, once the other issues
> > are resolved.
> >
> > >
> > > >
> > > > 2. Silly in do_mas_mumap():
> > > > --- a/mm/mmap.c
> > > > +++ b/mm/mmap.c
> > > > @@ -2513,7 +2513,7 @@ int do_mas_munmap(struct ma_state *mas, struct mm_struct *mm,
> > > > arch_unmap(mm, start, end);
> > > >
> > > > /* Find the first overlapping VMA */
> > > > - vma = mas_find(mas, end - 1);
> > > > + vma = mas_find(mas, start);
> > > > if (!vma)
> > > > return 0;
> > > >
> > > > Fixing that does fix some bad behaviors seen - I'd been having crashes in
> > > > unmap_vmas() and unlink_anon_vmas(), and put "if (WARN_ON(!vma)) return;"
> > > > in unmap_region(); but that no longer seems necessary now do_mas_munmap()
> > > > is fixed thus. I had high hopes that it would fix all the rest, but no.
> > >
> > > This is actually correct. mas_find() is not the same as vma_find().
> > > mas_find() uses the maple state index and searches until a limit (end
> > > -1 in this case). I haven't seen these crashes, but I think you are
> > > hitting the same issue you are discussing in #6 below. I also hadn't
> > > realised the potential confusion of those APIs.
> >
> > You're right, I'm wrong, sorry about that. But then I'm left with the
> > conundrum of how a class of crashes went away when I changed that. Did
> > I break it all so badly that it couldn't reach the anon_vma bugs I was
> > hitting before? Or did making that change coincide with my moving from
> > DEBUG_MAPLE off to on, so different crashes came sooner? Or did I fold
> > in another fix from you around that time? I don't know (and I don't
> > expect you to answer this!), but I've got some backtracking to do.

I expect it is because your search skipped the badness inserted by the
bug from #6. I would think the workload that this was crashing on would
split the nodes in a certain way that bad VMAs ended up ahead of the
correct data?

> >
> > >
> > > >
> > > > 3. vma_adjust_trans_huge(). Skip this paragraph, I think there
> > > > is actually no problem here, but for safety I have rearranged the
> > > > vma_adjust_trans_huge()s inside the rmap locks, because when things
> > > > aren't working right, best to rule out some possibilities. Why am
> > > > I even mentioning it here? In case I send any code snippets and
> > > > you're puzzled by vma_adjust_trans_huge() having moved.
> > >
> > > Thanks, I will keep this in mind.
> > >
> > > >
> > > > 4. unlink_anon_vmas() in __vma_adjust(). Not urgent to fix (can only
> > > > be an issue when GFP_KERNEL preallocation fails, which I think means
> > > > when current is being OOM-killed), but whereas vma_expand() has careful
> > > > anon_cloned flag, __vma_adjust() does not, and I think could be
> > > > unlinking a pre-existing anon_vma. Aside from that, I am nervous about
> > > > using unlink_anon_vmas() there like that (and in vma_expand()): IIUC it's
> > > > an unnecessary "optimization" for a very unlikely case, that's in danger
> > > > of doing more harm than good; and should be removed from them both (then
> > > > they can both simply return -ENOMEM when mas_preallocate() fails).
> > >
> > > I will add a flag to __vma_adjust, but I don't see how it could happen.
> > > I guess if importer has an anon_vma already? I added these as an unwind
> > > since I don't want to leak - even on the rare preallocation failure. If
> > > you don't want to unroll, what would you suggest in these cases? Would
> > > a flag be acceptable?
> >
> > I cannot see what purpose adding a flag to __vma_adjust() would serve.
> > If importer had anon_vma already, yes, without checking the code again,
> > that was I think what I had in mind. But (correct me if you've observed
> > that I'm wrong) there's no question of a leak there: a vma which wasn't
> > given an anon_vma before gets linked in to one, and it will all get torn
> > down correctly when the process exits (indeed, when OOM kill completes).
> >
> > It's "nice" to delay setting anon_vma until it's needed, but not worth
> > any effort to rewind it (especially on such an unlikely path): and
> > normally, once anon_vma has been set, it stays set - I'm not sure of
> > the consequence of unsetting it again (racing with rmap lookups: may
> > be okay because of how anon_vma is not trusted when page is not mapped;
> > but it's easier just to omit the rewind than think all that through).

So the only time I've even seen __vma_adjust() fail is with a fault
injector failing mas_preallocate() allocations. If it's safe to not
unwind, I'm happy to drop both unwinds but I was concerned in the path
of a vma_merge() calling __vma_adjust() and failing out on allocations
then OOM recovering, leaving a VMA with a 1/2 merged vma with anon
incorrectly set.. which is an even more unlikely scenario.

> >
> > >
> > > >
> > > > 5. I was horrified/thrilled to notice last night that mas_store_prealloc()
> > > > does a mas_destroy() at the end. So only the first vma_mas_store() in
> > > > __vma_adjust() gets to use the carefully preallocated nodes. I thought
> > > > that might be responsible for lots of nastiness, but again sadly no.
> > > > Maybe it just falls back to GFP_ATOMIC when the preallocateds are gone
> > > > (I didn't look) and that often works okay. Whether the carefully
> > > > chosen prealloc count allows for __vma_adjust() use would be another
> > > > question. (Earlier on I did try doubling its calculation, in case it
> > > > was preallocating too few, but that made no difference either.)
> > >
> > > mas_store_prealloc will allocate for the worst case scenario. Since I
> > > cannot guarantee the second store isn't the worst case, and could fail,
> > > I cannot allow for more than one allocation per preallocate. I thought
> > > I was fine in __vma_adjust since I preallocate after the goto label for
> > > a second removal but it turns out I wasn't since the second preallocate
> > > could fail, so I've removed the requirement for a second store for 'case
> > > 6' by updating the tree once and removing both VMAs in a single pass.
> > > There could, potentially be an issue if the caller to __vma_merge()
> > > wanted to reduce one limit of the VMA (I guess just the start..) and
> > > also remove one or more VMAs, and also be in a situation where
> > > allocations could cause issues (fs_reclaim).. and since __vma_adjust()
> > > is so complicated, I looked at the callers. Most use vma_merge(), and
> > > those seem fine. fs/exec only adjusts one at a time. when splitting,
> > > only a single insert happens. Did I miss some situation(s)?
> >
> > I don't think the fs/exec stack moving will be any worry for this.
> > Did you miss any case? Yes, the "insert" cases from __split_vma().
> > I have no appreciation of when maple tree actually needs to make an
> > allocation, so I don't know whether the adjust_next case ever needs
> > to make an allocation, but I'd have thought the insert case might
> > need to sometimes.

Right, the __split_vma() never adjusts anything but one side of the
'vma' VMA by inserting the 'insert' VMA. This will result in two writes
to the tree - but one will exactly fit in an existing range which will
be placed without an allocation via the mas_wr_slot_store() function in
the maple tree. Exact fits are nice - they are fast.

> >
> > But I'll admit to skimming your paragraph there, I'm more concerned
> > to go on to the following issue than fully grasp your argument above.
> >
> > >
> > > >
> > > > 6. The main problem, crashes on the workstation (never seen on the
> > > > laptop). I keep hacking around with the debug info (and, please,
> > > > %px not %p, %lx not %lu in the debug info: I've changed them all,
> > >
> > > Okay, so I tried to remove all %px in the debug code so I'll revert
> > > those. I use %lu for historic reasons from mt_dump(), I can change
> > > those too, The tree uses ranges to store pointers so I print the
> > > pointers in %lx and the ranges in %lu.
> > >
> > >
> > > > and a couple of %lds, in my copy of lib/maple_tree.c). I forget
> > > > how the typical crashes appeared with the MAPLE_DEBUGs turned off
> > > > (the BUG_ON(count != mm->map_count) in exit_mmap() perhaps); I've
> > > > turned them on, but usually comment out the mt_validate() and
> > > > mt_dump(), which generate far too much noise for non-specialists,
> > > > and delay the onset of crash tenfold (but re-enabled to give you
> > > > the attached messages.xz).
> > > >
> > > > Every time, it's the cc1 (9.3.1) doing some munmapping (cc1 is
> > > > mostly what's running in this load test, so that's not surprising;
> > > > but surprising that even when I switched the laptop to using same
> > > > gcc-9, couldn't reproduce the problem there). Typically, that
> > > > munmapping has involved splitting a small, say three page, vma
> > > > into two pages below and one page above (the "insert", and I've
> > > > hacked the debug to dump that too, see "mmap: insert" - ah,
> > > > looking at the messages now, the insert is bigger this time).
> > > >
> > > > And what has happened each time I've studied it (I don't know
> > > > if this is evident from the mt dumps in the messages, I hope so)
> > > > is that the vma and the insert have been correctly placed in the
> > > > tree, except that the vma has *also* been placed several pages
> > > > earlier, and a linear search of the tree finds that misplaced
> > > > instance first, vm_start not matching mt index.
> > > >
> > > > The map_count in these cases has always been around 59, 60, 61:
> > > > maybe just typical for cc1, or maybe significant for maple tree?
> > > >
> > > > I won't give up quite yet, but I'm hoping you'll have an idea for
> > > > the misplaced tree entry. Something going wrong in rebalancing?
> > > >
> > > > For a long time I assumed the problem would be at the mm/mmap.c end,
> > > > and I've repeatedly tried different things with the vma_mas stuff
> > > > in __vma_adjust() (for example, using vma_mas_remove() to remove
> > > > vmas before re-adding them, and/or doing mas_reset() in more places);
> > > > but none of those attempts actually fixed the issue. So now I'm
> > > > thinking the problem is really at the lib/maple_tree.c end.
> > > >
> > >
> > > Do you have the patch
> > > "maple_tree-Fix-stale-data-copy-in-mas_wr_node_store.patch"? It sounds
> > > like your issue fits this fix exactly. I was seeing the same issue with
> > > gcc 9.3.1 20200408 and this bug doesn't happen for me now. The logs
> > > you sent also fit the situation. I went through the same exercise
> > > (exorcism?) of debugging the various additions and removals of the VMA
> > > only to find the issue in the tree itself. The fix also modified the
> > > test code to detect the issue - which was actually hit but not detected
> > > in the existing test cases from a live capture of VMA activities. It is
> > > difficult to spot in the tree dump as well. I am sure I sent this to
> > > Andrew as it is included in v11 and did not show up in his diff, but I
> > > cannot find it on lore, perhaps I forgot to CC you? I've attached it
> > > here for you in case you missed it.
> >
> > Thanks! No, I never received that patch, nor can I see it on lore
> > or marc.info; but I (still) haven't looked at v11, and don't know
> > about Andrew's diff. Anyway, sounds exciting, I'm eager to stop
> > writing this mail and get to testing with that in - but please
> > let me know whether it's the mas_dead_leaves() or the __vma_adjust()
> > mods you attached previously, which you want me to leave out.

Sorry, I wandered off after the previous email. It was the
mas_dead_leaves() patch I was referencing to drop. I sent it out before
ensuring it was safe on all architectures and one - arm64 or s390
probably, wasn't happy with that change.

> >
> > >
> > > You are actually hitting the issue several iterations beyond when it
> > > first occurred. It was introduced earlier in the tree and exposed with
> > > your other operations by means of a node split or merge.
> > >
> > > > 7. If you get to do cleanups later, please shrink those double blank
> > > > lines to single blank lines. And find a better name for the strange
> > > > vma_mas_szero() - vma_mas_erase(), or is erase something different?
> > > > I'm not even sure that it's needed: that's a special case for exec's
> > > > shift_arg_pages() VM_STACK_INCOMPLETE_SETUP, which uses __vma_adjust()
> > > > in a non-standard way. And trace_vma_mas_szero() in vma_mas_remove()
> > > > looks wrong.
> > >
> > > Okay, I'll be sure to only have one blank line. Where do you see this?
> > > I would have thought that checkpatch would complain? I did try to,
> >
> > No, I'm not going to search for those double blank lines now:
> > please do your own diff and look through for them. And I don't know
> > whether checkpatch objects to them or not, but they're bad for patch
> > application, since they increase the likelihood that a patch applies
> > in the wrong place.
> >
> > As to whether this is or is not a good time for such cleanups,
> > I just don't know: I see Andrew on the one hand intending to drop
> > maple tree for the moment, but Linus on the other hand promising
> > an extra week for 5.19. I'll just let others decide what they want.
> >
> > Hugh
> >
> > > regretfully, address more checkpatch issues on v11. It added more noise
> > > to the differences of v10 + patches to v11 than anything else.
> > >
> > >
> > > Thanks again,
> > > Liam

2022-07-18 17:55:12

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH] maple_tree: Fix sparse reported issues

On Mon, 18 Jul 2022, Liam Howlett wrote:
> * Liam R. Howlett <[email protected]> [220718 08:56]:
> > * Hugh Dickins <[email protected]> [220718 00:28]:
> > > On Mon, 18 Jul 2022, Liam Howlett wrote:
> > > > * Hugh Dickins <[email protected]> [220717 16:58]:
> > > > > On Fri, 15 Jul 2022, Liam Howlett wrote:
> > > > > >
> > > > > > Please find attached the last outstanding fix for this series. It
> > > > > > covers a rare failure scenario that one of the build bots reported.
> > > > > > Ironically, it changes __vma_adjust() more than the rest of the series,
> > > > > > but leaves the locking in the existing order.
> > > > >
> > > > > Thanks, I folded that in to my testing on next-20220715, along with
> > > > > other you posted on Friday (mas_dead_leaves() walk fix);
> > > >
> > > > Please drop that patch, it needs more testing.
> > >
> > > Drop the mas_dead_leaves() walk fix, or the __vma_adjust() changes
> > > which you attached in that mail to me? please let me know a.s.a.p,
> > > since I cannot proceed without knowing which.
> >
> > Drop the mas_dead_leaves() walk fix please. I responded to the patch
> > that it's not tested enough. I'll respond to the rest of this email
> > soon.

Your mail situation gets even stranger. You sent the mas_dead_leaves()
walk fix on 12th July, followed quickly by that response that it's not
tested enough, so I ignored it completely then. But you sent the same
fix (I've only compared visually, it looks the same) again on 15th July,
so then I took it in.

I wonder whether Oracle's mail server decided to send out a repeat of
that patch in place of the missing all-important stale data copy one!

>
> So the mas_dead_leaves() patch will most likely produce memory leaks on
> big-endian systems.

I'll take out the mas_dead_leaves() walk patch, but since I was only
testing on x86, it won't have mattered.

...

> I expect it is because your search skipped the badness inserted by the
> bug from #6. I would think the workload that this was crashing on would
> split the nodes in a certain way that bad VMAs ended up ahead of the
> correct data?

Maybe; I can't afford to speculate further on it.

...

> So the only time I've even seen __vma_adjust() fail is with a fault
> injector failing mas_preallocate() allocations. If it's safe to not
> unwind, I'm happy to drop both unwinds but I was concerned in the path
> of a vma_merge() calling __vma_adjust() and failing out on allocations
> then OOM recovering, leaving a VMA with a 1/2 merged vma with anon
> incorrectly set.. which is an even more unlikely scenario.

It's not half-merged, it is correctly set up (just like if a write fault
had occurred somewhere in that extent before the merge), so no need to
unwind.

...

> Right, the __split_vma() never adjusts anything but one side of the
> 'vma' VMA by inserting the 'insert' VMA. This will result in two writes
> to the tree - but one will exactly fit in an existing range which will
> be placed without an allocation via the mas_wr_slot_store() function in
> the maple tree. Exact fits are nice - they are fast.

I'll have to come back and think about this again later on: "Exact fits
are nice" may answer my concern in the end, but I still have the worry
that the first store destroys the prealloc, when it might be the second
store which needs the prealloc.

...

> > > > Do you have the patch
> > > > "maple_tree-Fix-stale-data-copy-in-mas_wr_node_store.patch"? It sounds
> > > > like your issue fits this fix exactly. I was seeing the same issue with
> > > > gcc 9.3.1 20200408 and this bug doesn't happen for me now. The logs
> > > > you sent also fit the situation. I went through the same exercise
> > > > (exorcism?) of debugging the various additions and removals of the VMA
> > > > only to find the issue in the tree itself. The fix also modified the
> > > > test code to detect the issue - which was actually hit but not detected
> > > > in the existing test cases from a live capture of VMA activities. It is
> > > > difficult to spot in the tree dump as well. I am sure I sent this to
> > > > Andrew as it is included in v11 and did not show up in his diff, but I
> > > > cannot find it on lore, perhaps I forgot to CC you? I've attached it
> > > > here for you in case you missed it.
> > >
> > > Thanks! No, I never received that patch, nor can I see it on lore
> > > or marc.info; but I (still) haven't looked at v11, and don't know
> > > about Andrew's diff. Anyway, sounds exciting, I'm eager to stop
> > > writing this mail and get to testing with that in - but please
> > > let me know whether it's the mas_dead_leaves() or the __vma_adjust()
> > > mods you attached previously, which you want me to leave out.

The overnight test run ended in an unexpected way, but I believe we can
count it as a success - a big success for your stale data copy fix.

(If only that fix had got through the mail system on Friday,
my report on Sunday would have been much more optimistic.)

I said before that I expected the test run to hit the swapops.h
migration entry !PageLocked BUG, but it did not. It ran for
nearly 7 hours, and then one of its builds terminated with

{standard input}: Assembler messages:
{standard input}: Error: open CFI at the end of file;
missing .cfi_endproc directive
gcc: fatal error: Killed signal terminated program cc1
compilation terminated.

which I've never seen before. Usually I'd put something like that down
to a error in swap, or a TLB flushing error (but I include Nadav's fix
in my kernel, and wouldn't get very far without it): neither related to
the maple tree patchset.

But on this occasion, my guess is that it's actually an example of what
the swapops.h migration entry !PageLocked BUG is trying to alert us to.

Imagine when such a "stale" migration entry is found, but the page it
points to (now reused for something else) just happens to be PageLocked
at that instant. Then the BUG won't fire, and we proceed to use the
page as if it's ours, but it's not. I think that's what happened.

I must get on with the day: more testing, and thinking.

Hugh

2022-07-18 18:09:47

by Liam R. Howlett

[permalink] [raw]

Subject: Re: [PATCH] maple_tree: Fix sparse reported issues

...

>
> > So the only time I've even seen __vma_adjust() fail is with a fault
> > injector failing mas_preallocate() allocations. If it's safe to not
> > unwind, I'm happy to drop both unwinds but I was concerned in the path
> > of a vma_merge() calling __vma_adjust() and failing out on allocations
> > then OOM recovering, leaving a VMA with a 1/2 merged vma with anon
> > incorrectly set.. which is an even more unlikely scenario.
>
> It's not half-merged, it is correctly set up (just like if a write fault
> had occurred somewhere in that extent before the merge), so no need to
> unwind.
>

I'll drop the incorrect unwinding then.

> ...
>
> > Right, the __split_vma() never adjusts anything but one side of the
> > 'vma' VMA by inserting the 'insert' VMA. This will result in two writes
> > to the tree - but one will exactly fit in an existing range which will
> > be placed without an allocation via the mas_wr_slot_store() function in
> > the maple tree. Exact fits are nice - they are fast.
>
> I'll have to come back and think about this again later on: "Exact fits
> are nice" may answer my concern in the end, but I still have the worry
> that the first store destroys the prealloc, when it might be the second
> store which needs the prealloc.
>
> ...
>
> > > > > Do you have the patch
> > > > > "maple_tree-Fix-stale-data-copy-in-mas_wr_node_store.patch"? It sounds
> > > > > like your issue fits this fix exactly. I was seeing the same issue with
> > > > > gcc 9.3.1 20200408 and this bug doesn't happen for me now. The logs
> > > > > you sent also fit the situation. I went through the same exercise
> > > > > (exorcism?) of debugging the various additions and removals of the VMA
> > > > > only to find the issue in the tree itself. The fix also modified the
> > > > > test code to detect the issue - which was actually hit but not detected
> > > > > in the existing test cases from a live capture of VMA activities. It is
> > > > > difficult to spot in the tree dump as well. I am sure I sent this to
> > > > > Andrew as it is included in v11 and did not show up in his diff, but I
> > > > > cannot find it on lore, perhaps I forgot to CC you? I've attached it
> > > > > here for you in case you missed it.
> > > >
> > > > Thanks! No, I never received that patch, nor can I see it on lore
> > > > or marc.info; but I (still) haven't looked at v11, and don't know
> > > > about Andrew's diff. Anyway, sounds exciting, I'm eager to stop
> > > > writing this mail and get to testing with that in - but please
> > > > let me know whether it's the mas_dead_leaves() or the __vma_adjust()
> > > > mods you attached previously, which you want me to leave out.
>
> The overnight test run ended in an unexpected way, but I believe we can
> count it as a success - a big success for your stale data copy fix.
>
> (If only that fix had got through the mail system on Friday,
> my report on Sunday would have been much more optimistic.)
>
> I said before that I expected the test run to hit the swapops.h
> migration entry !PageLocked BUG, but it did not. It ran for
> nearly 7 hours, and then one of its builds terminated with
>
> {standard input}: Assembler messages:
> {standard input}: Error: open CFI at the end of file;
> missing .cfi_endproc directive
> gcc: fatal error: Killed signal terminated program cc1
> compilation terminated.
>
> which I've never seen before. Usually I'd put something like that down
> to a error in swap, or a TLB flushing error (but I include Nadav's fix
> in my kernel, and wouldn't get very far without it): neither related to
> the maple tree patchset.
>
> But on this occasion, my guess is that it's actually an example of what
> the swapops.h migration entry !PageLocked BUG is trying to alert us to.
>
> Imagine when such a "stale" migration entry is found, but the page it
> points to (now reused for something else) just happens to be PageLocked
> at that instant. Then the BUG won't fire, and we proceed to use the
> page as if it's ours, but it's not. I think that's what happened.
>
> I must get on with the day: more testing, and thinking.

I think this is the same issue seen here:
https://lore.kernel.org/linux-mm/[email protected]/

Note that on 20220616, the maple tree was in the next.

I suspect I am doing something wrong in do_brk_munmap(). I am using a
false VMA to munmap a partial vma by setting it up like the part of the
VMA that would have been split, inserted into the tree, then removed and
freed. I must be missing something necessary for this to function
correctly.

Thanks,
Liam

2022-07-18 22:05:35

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH] maple_tree: Fix sparse reported issues

On Mon, 18 Jul 2022, Liam Howlett wrote:
> >
> > I said before that I expected the test run to hit the swapops.h
> > migration entry !PageLocked BUG, but it did not. It ran for
> > nearly 7 hours, and then one of its builds terminated with
> >
> > {standard input}: Assembler messages:
> > {standard input}: Error: open CFI at the end of file;
> > missing .cfi_endproc directive
> > gcc: fatal error: Killed signal terminated program cc1
> > compilation terminated.
> >
> > which I've never seen before. Usually I'd put something like that down
> > to a error in swap, or a TLB flushing error (but I include Nadav's fix
> > in my kernel, and wouldn't get very far without it): neither related to
> > the maple tree patchset.
> >
> > But on this occasion, my guess is that it's actually an example of what
> > the swapops.h migration entry !PageLocked BUG is trying to alert us to.
> >
> > Imagine when such a "stale" migration entry is found, but the page it
> > points to (now reused for something else) just happens to be PageLocked
> > at that instant. Then the BUG won't fire, and we proceed to use the
> > page as if it's ours, but it's not. I think that's what happened.
> >
> > I must get on with the day: more testing, and thinking.
>
>
> I think this is the same issue seen here:
> https://lore.kernel.org/linux-mm/[email protected]/

Yes, that's a swapops.h migration entry !PageLocked BUG on brk.

>
> Note that on 20220616, the maple tree was in the next.
>
> I suspect I am doing something wrong in do_brk_munmap(). I am using a
> false VMA to munmap a partial vma by setting it up like the part of the
> VMA that would have been split, inserted into the tree, then removed and
> freed. I must be missing something necessary for this to function
> correctly.

Thanks for pointing to that, yes, the vma_init(&unmap, mm) etc in
do_brk_munmap(): I hadn't noticed struct vma_area_struct unmap before.

And almost coincident with your mail, my next test run crashed on
kernel BUG at mm/huge_memory.c:2013, VM_BUG_ON_VMA(vma->vm_start > haddr),
in __split_huge_pmd_locked(), while servicing do_brk_munmap():
no doubt for a (different but) related reason.

Presumably you noticed an opportunity to optimize out some maple tree
operations by giving do_brk_munmap() its own processing. But I think
you're going to have to undo all that, and make do_brk_munmap() do
things in the standard, less direct, munmap() way - using
do_mas_align_munmap() I presume.

(Oh dear, I see that doing mas_preallocate() at the beginning,
but then __split_vma()s inside, and __split_vma() will do a
vma_adjust(), which will do a mas_preallocate(). Ah, but they
are on distinct "mas"es at different levels, so it's probably okay.)

If rmap is to be sure to find migration entries that it inserted
earlier, you must hold anon_vma_lock_write() (in the brk case) across
the tricky vma manipulations. No doubt you could write an optimized
do_brk_munmap() which does everything under anon_vma_lock_write(), but
you'd then be duplicating far too much of the care in __vma_adjust()
for my taste (I'm already not so happy to find it duplicated in
vma_expand()).

I'll continue with some testing, but expect it to keep hitting these
issues until do_brk_munmap() is rewritten - perhaps it reduces to
little more than a wrapper like do_munmap() or do_mas_munmap().

Hugh

2022-07-19 01:56:27

by Liam R. Howlett

[permalink] [raw]

Subject: Re: [PATCH] maple_tree: Fix sparse reported issues

* Hugh Dickins <[email protected]> [220718 17:34]:
> On Mon, 18 Jul 2022, Liam Howlett wrote:
> > >
> > > I said before that I expected the test run to hit the swapops.h
> > > migration entry !PageLocked BUG, but it did not. It ran for
> > > nearly 7 hours, and then one of its builds terminated with
> > >
> > > {standard input}: Assembler messages:
> > > {standard input}: Error: open CFI at the end of file;
> > > missing .cfi_endproc directive
> > > gcc: fatal error: Killed signal terminated program cc1
> > > compilation terminated.
> > >
> > > which I've never seen before. Usually I'd put something like that down
> > > to a error in swap, or a TLB flushing error (but I include Nadav's fix
> > > in my kernel, and wouldn't get very far without it): neither related to
> > > the maple tree patchset.
> > >
> > > But on this occasion, my guess is that it's actually an example of what
> > > the swapops.h migration entry !PageLocked BUG is trying to alert us to.
> > >
> > > Imagine when such a "stale" migration entry is found, but the page it
> > > points to (now reused for something else) just happens to be PageLocked
> > > at that instant. Then the BUG won't fire, and we proceed to use the
> > > page as if it's ours, but it's not. I think that's what happened.
> > >
> > > I must get on with the day: more testing, and thinking.
> >
> >
> > I think this is the same issue seen here:
> > https://lore.kernel.org/linux-mm/[email protected]/
>
> Yes, that's a swapops.h migration entry !PageLocked BUG on brk.
>
> >
> > Note that on 20220616, the maple tree was in the next.
> >
> > I suspect I am doing something wrong in do_brk_munmap(). I am using a
> > false VMA to munmap a partial vma by setting it up like the part of the
> > VMA that would have been split, inserted into the tree, then removed and
> > freed. I must be missing something necessary for this to function
> > correctly.
>
> Thanks for pointing to that, yes, the vma_init(&unmap, mm) etc in
> do_brk_munmap(): I hadn't noticed struct vma_area_struct unmap before.
>
> And almost coincident with your mail, my next test run crashed on
> kernel BUG at mm/huge_memory.c:2013, VM_BUG_ON_VMA(vma->vm_start > haddr),
> in __split_huge_pmd_locked(), while servicing do_brk_munmap():
> no doubt for a (different but) related reason.
>
> Presumably you noticed an opportunity to optimize out some maple tree
> operations by giving do_brk_munmap() its own processing. But I think
> you're going to have to undo all that, and make do_brk_munmap() do
> things in the standard, less direct, munmap() way - using
> do_mas_align_munmap() I presume.
>
> (Oh dear, I see that doing mas_preallocate() at the beginning,
> but then __split_vma()s inside, and __split_vma() will do a
> vma_adjust(), which will do a mas_preallocate(). Ah, but they
> are on distinct "mas"es at different levels, so it's probably okay.)

Yes, this is not as efficient as I'd like and I'd like to find out which
are needed but right now this is the safest and it turns out allocating
is super fast, especially when you get 16 per page. I think any
function that already allocates *should* be safe to use a standard
mas_store_gfp() and let the tree allocate what is necessary. I had to
add preallocations due to fs_reclaim lockdep.

Think about the insert/removal of a vma into the linked list + gaps
calculation + allocation of new vma + rebalance the rbtree that we do
today and how that plays out on the CPU cache.

>
> If rmap is to be sure to find migration entries that it inserted
> earlier, you must hold anon_vma_lock_write() (in the brk case) across
> the tricky vma manipulations. No doubt you could write an optimized
> do_brk_munmap() which does everything under anon_vma_lock_write(), but
> you'd then be duplicating far too much of the care in __vma_adjust()
> for my taste (I'm already not so happy to find it duplicated in
> vma_expand()).

We already have a simplified case of do_mmap() for do_brk_flags(),
according to the comment. I thought a simplified version for the
reverse would be acceptable as well.

>
> I'll continue with some testing, but expect it to keep hitting these
> issues until do_brk_munmap() is rewritten - perhaps it reduces to
> little more than a wrapper like do_munmap() or do_mas_munmap().

I appreciate the effort on this.

Thanks,
Liam