2022-07-06 02:33:33

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH] maple_tree: Fix use of node for global range in mas_wr_spanning_store()

When writing a range with a NULL which expands to 0 - ULONG_MAX, don't
use a node to store this value. Instead, call mas_new_root() which will
set the tree pointer to NULL and free all the nodes.

Fix a comment for the allocations in mas_wr_spanning_store().

Add mas_node_count_gfp() and use it to clean up mas_preallocate().

Clean up mas_preallocate() and ensure the ma_state is safe on return.

Update maple_tree.h to set alloc = NULL.

Fixes: d0aac5e48048 (Maple Tree: add new data structure)
Signed-off-by: Liam R. Howlett <[email protected]>
---
include/linux/maple_tree.h | 1 +
lib/maple_tree.c | 34 +++++++++++++++++++++++++++-------
2 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
index 3c6642358904..2c9dede989c7 100644
--- a/include/linux/maple_tree.h
+++ b/include/linux/maple_tree.h
@@ -434,6 +434,7 @@ struct ma_wr_state {
.node = MAS_START, \
.min = 0, \
.max = ULONG_MAX, \
+ .alloc = NULL, \
}

#define MA_WR_STATE(name, ma_state, wr_entry) \
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 2cccd8f2153f..79e07c7dc323 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -1293,17 +1293,31 @@ static inline void mas_free(struct ma_state *mas, struct maple_enode *used)
* there is not enough nodes.
* @mas: The maple state
* @count: The number of nodes needed
+ * @gfp: the gfp flags
*/
-static void mas_node_count(struct ma_state *mas, int count)
+static void mas_node_count_gfp(struct ma_state *mas, int count, gfp_t gfp)
{
unsigned long allocated = mas_allocated(mas);

if (allocated < count) {
mas_set_alloc_req(mas, count - allocated);
- mas_alloc_nodes(mas, GFP_NOWAIT | __GFP_NOWARN);
+ mas_alloc_nodes(mas, gfp);
}
}

+/*
+ * mas_node_count() - Check if enough nodes are allocated and request more if
+ * there is not enough nodes.
+ * @mas: The maple state
+ * @count: The number of nodes needed
+ *
+ * Note: Uses GFP_NOWAIT | __GFP_NOWARN for gfp flags.
+ */
+static void mas_node_count(struct ma_state *mas, int count)
+{
+ return mas_node_count_gfp(mas, count, GFP_NOWAIT | __GFP_NOWARN);
+}
+
/*
* mas_start() - Sets up maple state for operations.
* @mas: The maple state.
@@ -3962,7 +3976,7 @@ static inline int mas_wr_spanning_store(struct ma_wr_state *wr_mas)
if (unlikely(!mas->index && mas->last == ULONG_MAX))
return mas_new_root(mas, wr_mas->entry);
/*
- * Node rebalancing may occur due to this store, so there may be two new
+ * Node rebalancing may occur due to this store, so there may be three new
* entries per level plus a new root.
*/
height = mas_mt_height(mas);
@@ -3995,6 +4009,12 @@ static inline int mas_wr_spanning_store(struct ma_wr_state *wr_mas)
mas->last = l_mas.last = r_mas.last;
}

+ /* expanding NULLs may make this cover the entire range */
+ if (!l_mas.index && r_mas.last == ULONG_MAX) {
+ mas_set_range(mas, 0, ULONG_MAX);
+ return mas_new_root(mas, wr_mas->entry);
+ }
+
memset(&b_node, 0, sizeof(struct maple_big_node));
/* Copy l_mas and store the value in b_node. */
mas_store_b_node(&l_wr_mas, &b_node, l_wr_mas.node_end);
@@ -5657,15 +5677,15 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
{
int ret;

- mas_set_alloc_req(mas, 1 + mas_mt_height(mas) * 3);
- mas_alloc_nodes(mas, gfp);
+ mas_node_count_gfp(mas, 1 + mas_mt_height(mas) * 3, gfp);
if (likely(!mas_is_err(mas)))
return 0;

mas_set_alloc_req(mas, 0);
- mas_destroy(mas);
ret = xa_err(mas->node);
- mas->node = MAS_START;
+ mas_reset(mas);
+ mas_destroy(mas);
+ mas_reset(mas);
return ret;
}

--
2.35.1


2022-07-06 03:01:17

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] maple_tree: Fix use of node for global range in mas_wr_spanning_store()

On Wed, 6 Jul 2022 02:05:37 +0000 Liam Howlett <[email protected]> wrote:

> When writing a range with a NULL which expands to 0 - ULONG_MAX, don't
> use a node to store this value. Instead, call mas_new_root() which will
> set the tree pointer to NULL and free all the nodes.
>
> Fix a comment for the allocations in mas_wr_spanning_store().
>
> Add mas_node_count_gfp() and use it to clean up mas_preallocate().
>
> Clean up mas_preallocate() and ensure the ma_state is safe on return.
>
> Update maple_tree.h to set alloc = NULL.

Cool.

How are we looking now? Any known issues still being worked on?

2022-07-06 14:56:14

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH] maple_tree: Fix use of node for global range in mas_wr_spanning_store()

* Andrew Morton <[email protected]> [220705 22:55]:
> On Wed, 6 Jul 2022 02:05:37 +0000 Liam Howlett <[email protected]> wrote:
>
> > When writing a range with a NULL which expands to 0 - ULONG_MAX, don't
> > use a node to store this value. Instead, call mas_new_root() which will
> > set the tree pointer to NULL and free all the nodes.
> >
> > Fix a comment for the allocations in mas_wr_spanning_store().
> >
> > Add mas_node_count_gfp() and use it to clean up mas_preallocate().
> >
> > Clean up mas_preallocate() and ensure the ma_state is safe on return.
> >
> > Update maple_tree.h to set alloc = NULL.
>
> Cool.
>
> How are we looking now? Any known issues still being worked on?

Did you pick up "Subject: [PATCH] mm/mmap: Fix copy_vma() new_vma
check"? I sent that yesterday as well.

I think we are in good shape. There were two outstanding issues I had
and this patch plus the copy_vma() patch fixes both.

2022-07-08 04:45:25

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] maple_tree: Fix use of node for global range in mas_wr_spanning_store()

On Wed, 6 Jul 2022, Liam Howlett wrote:
> * Andrew Morton <[email protected]> [220705 22:55]:
> > On Wed, 6 Jul 2022 02:05:37 +0000 Liam Howlett <[email protected]> wrote:
> >
> > > When writing a range with a NULL which expands to 0 - ULONG_MAX, don't
> > > use a node to store this value. Instead, call mas_new_root() which will
> > > set the tree pointer to NULL and free all the nodes.
> > >
> > > Fix a comment for the allocations in mas_wr_spanning_store().
> > >
> > > Add mas_node_count_gfp() and use it to clean up mas_preallocate().
> > >
> > > Clean up mas_preallocate() and ensure the ma_state is safe on return.
> > >
> > > Update maple_tree.h to set alloc = NULL.
> >
> > Cool.
> >
> > How are we looking now? Any known issues still being worked on?
>
> Did you pick up "Subject: [PATCH] mm/mmap: Fix copy_vma() new_vma
> check"? I sent that yesterday as well.
>
> I think we are in good shape. There were two outstanding issues I had
> and this patch plus the copy_vma() patch fixes both.

I'm not so sure.

I haven't gone back to check, but I do think there was a very recent
lib/maple_tree.c change or set of changes which greatly improved it
(say an hour to reach a problem instead of a minute).

But errors I do see still (I've only had time for it this month, and
there have been three other non-maple-tree bugs in linux-next which
got in the way of my huge tmpfs kernel build swapping load testing).

I see one kind of problem on one machine, and another on another,
and have no idea why the difference, so cannot advise you on how
to reproduce either. I do have CONFIG_DEBUG_VM_MAPLE_TREE=y and
CONFIG_DEBUG_MAPLE_TREE=y on both; but might try removing those,
since they tell me nothing without a long education, and more
important for me to write this mail now than get into capturing
those numbers for you.

One machine does show such output, with BUG at mas_validate_gaps,
doing __vma_adjust from __split_vma from do_mas_align_munmap.
That's fairly typical of all reports from that machine, I think.

Other machine (laptop I'm writing from) has never shown any such MT
debug output, but hits the kernel BUG at include/linux/swapops.h:378!
pfn_swap_entry_to_page() BUG_ON(is_migration_entry() && !PageLocked().

I've often called that the best BUG in the kernel: it tells whether
rmap is holding together: migration entries were inserted on one
rmap walk, some may have been zapped by unmapping meanwhile, but
before midnight strikes and the page is unlocked, all the remaining
ones have to be found by the second rmap walk. (And if we removed the
BUG, then it would be mysterious rare segfaults and/or data leaks:
which I cannot call "stable").

Hitting that BUG suggests that the rmap locking is deficient somewhere
(might be something else, but that's a best guess), and I wouldn't be
surprised if that somewhere is __vma_adjust() - which is not quite
the same as it was before (I wonder if the changes were essential for
maple tree, or "simplifications" which seemed a good idea at the time).
If there's a way to minimally maple-ize how it was before, known working,
that would be a useful comparison to make.

Here's a patch I've applied, that makes no difference to the problems
I see, but fixes or highlights a little that worried me in the source.
Earlier, you had that anon_vma locking in vma_expand() under "else",
it's good to see that fixed, but I believe there's a case it misses;
and I don't think you can change the lock ordering just because it looks
nicer (see comments at head of mm/rmap.c for lock ordering, or mm/mremap.c
for an example of what order we take file versus anon rmap locks).

Hopefully maple tree stable is not far off, but seems not quite yet.

Hugh

--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -528,7 +528,8 @@ inline int vma_expand(struct ma_state *mas, struct vm_area_struct *vma,
if (next->anon_vma && !vma->anon_vma) {
int error;

- vma->anon_vma = next->anon_vma;
+ anon_vma = next->anon_vma;
+ vma->anon_vma = anon_vma;
error = anon_vma_clone(vma, next);
if (error)
return error;
@@ -546,16 +547,19 @@ inline int vma_expand(struct ma_state *mas, struct vm_area_struct *vma,

vma_adjust_trans_huge(vma, start, end, 0);

+ if (file) {
+ mapping = file->f_mapping;
+ root = &mapping->i_mmap;
+ uprobe_munmap(vma, vma->vm_start, vma->vm_end);
+ i_mmap_lock_write(mapping);
+ }
+
if (anon_vma) {
anon_vma_lock_write(anon_vma);
anon_vma_interval_tree_pre_update_vma(vma);
}

if (file) {
- mapping = file->f_mapping;
- root = &mapping->i_mmap;
- uprobe_munmap(vma, vma->vm_start, vma->vm_end);
- i_mmap_lock_write(mapping);
flush_dcache_mmap_lock(mapping);
vma_interval_tree_remove(vma, root);
}
@@ -3021,8 +3025,12 @@ static int do_brk_flags(struct ma_state *mas, struct vm_area_struct *vma,
}
vma->vm_end = addr + len;
vma->vm_flags |= VM_SOFTDIRTY;
- if (mas_store_gfp(mas, vma, GFP_KERNEL))
- return -ENOMEM;
+ if (WARN_ON(mas_store_gfp(mas, vma, GFP_KERNEL))) {
+ /*
+ * Rewind not yet worked out: at the very least
+ * we must unlock if locked, so proceed to below.
+ */
+ }

if (vma->anon_vma) {
anon_vma_interval_tree_post_update_vma(vma);

2022-07-09 01:37:06

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH] maple_tree: Fix use of node for global range in mas_wr_spanning_store()

* Hugh Dickins <[email protected]> [220708 00:37]:
> On Wed, 6 Jul 2022, Liam Howlett wrote:
> > * Andrew Morton <[email protected]> [220705 22:55]:
> > > On Wed, 6 Jul 2022 02:05:37 +0000 Liam Howlett <[email protected]> wrote:
> > >
> > > > When writing a range with a NULL which expands to 0 - ULONG_MAX, don't
> > > > use a node to store this value. Instead, call mas_new_root() which will
> > > > set the tree pointer to NULL and free all the nodes.
> > > >
> > > > Fix a comment for the allocations in mas_wr_spanning_store().
> > > >
> > > > Add mas_node_count_gfp() and use it to clean up mas_preallocate().
> > > >
> > > > Clean up mas_preallocate() and ensure the ma_state is safe on return.
> > > >
> > > > Update maple_tree.h to set alloc = NULL.
> > >
> > > Cool.
> > >
> > > How are we looking now? Any known issues still being worked on?
> >
> > Did you pick up "Subject: [PATCH] mm/mmap: Fix copy_vma() new_vma
> > check"? I sent that yesterday as well.
> >
> > I think we are in good shape. There were two outstanding issues I had
> > and this patch plus the copy_vma() patch fixes both.
>

First, thank you again for looking at this patch set. Your analysis of
my changes always end up educating me and making the code better.


> I'm not so sure.

I agree, I have a few more issues now.

>
> I haven't gone back to check, but I do think there was a very recent
> lib/maple_tree.c change or set of changes which greatly improved it
> (say an hour to reach a problem instead of a minute).
>
> But errors I do see still (I've only had time for it this month, and
> there have been three other non-maple-tree bugs in linux-next which
> got in the way of my huge tmpfs kernel build swapping load testing).
>
> I see one kind of problem on one machine, and another on another,
> and have no idea why the difference, so cannot advise you on how
> to reproduce either. I do have CONFIG_DEBUG_VM_MAPLE_TREE=y and
> CONFIG_DEBUG_MAPLE_TREE=y on both; but might try removing those,
> since they tell me nothing without a long education, and more
> important for me to write this mail now than get into capturing
> those numbers for you.
>
> One machine does show such output, with BUG at mas_validate_gaps,
> doing __vma_adjust from __split_vma from do_mas_align_munmap.
> That's fairly typical of all reports from that machine, I think.

I woke up to a log of this, or something similar on my server - I've not
seen it on my laptop either. There have been some connection issues
here so I've not been able to get the details I need to reproduce or
narrow this down.

>
> Other machine (laptop I'm writing from) has never shown any such MT
> debug output, but hits the kernel BUG at include/linux/swapops.h:378!
> pfn_swap_entry_to_page() BUG_ON(is_migration_entry() && !PageLocked().
>
> I've often called that the best BUG in the kernel: it tells whether
> rmap is holding together: migration entries were inserted on one
> rmap walk, some may have been zapped by unmapping meanwhile, but
> before midnight strikes and the page is unlocked, all the remaining
> ones have to be found by the second rmap walk. (And if we removed the
> BUG, then it would be mysterious rare segfaults and/or data leaks:
> which I cannot call "stable").
>
> Hitting that BUG suggests that the rmap locking is deficient somewhere
> (might be something else, but that's a best guess), and I wouldn't be
> surprised if that somewhere is __vma_adjust() - which is not quite
> the same as it was before (I wonder if the changes were essential for
> maple tree, or "simplifications" which seemed a good idea at the time).
> If there's a way to minimally maple-ize how it was before, known working,
> that would be a useful comparison to make.

I have not seen this bug. I have re-evaluated my changes to
__vma_adjust() and do not see any lock moving as you discovered in
vma_expand(). The changes to __vma_adjust() were to remove the linked
list or inlined functions that were reduced to a line or two. That, and
preallocating to avoid the fs-reclaim lockdep issue.

>
> Here's a patch I've applied, that makes no difference to the problems
> I see, but fixes or highlights a little that worried me in the source.
> Earlier, you had that anon_vma locking in vma_expand() under "else",
> it's good to see that fixed, but I believe there's a case it misses;
> and I don't think you can change the lock ordering just because it looks
> nicer (see comments at head of mm/rmap.c for lock ordering, or mm/mremap.c
> for an example of what order we take file versus anon rmap locks).
>

I will fix the issue you pointed out in your code. I'll also have a look
at any other areas I may have messed up the locking - as soon as read
that block at the start mm/rmap.c a few more dozen times.


> Hopefully maple tree stable is not far off, but seems not quite yet.

Al Viro also told me he saw a significant regression to xfs test 250, a
slow down of a factor of 2. I started to look into this as well. I
have not seen this slow down in my testing, but that must be flushed out
as well. Perhaps my access to larger machines will help in the
reproduction of this issue as well.

So there are still a few issues to contend with before this is 'stable'.
Any analysis of the patches you are able to do would probably expedite
the stability path.

Thanks again,
Liam