Hi reviewers and author [email protected],
Kindly ping.
We met BUG_ON in mas_store_prealloc with kernel-6.1 stress testing
environment.
According to coredump, BUG_ON is triggered by mas->node with error
number -ENOMEM(0xffffffffffffffd2).
There are some mas_node_count functions in mas_wr_store_entry, and it
seems that mas->node may be set to error node with -ENOMEM if there was
no enough memory spcace for maple tree operations.
We think that return -ENOMEM instead of directly triggering BUG_ON when
memory is not available is suitable, because in reality the tree
operation shouldn't be performed in this situation.
following are the backtrace:
mas_store_prealloc+0x23c/0x484
vma_mas_store+0xe4/0x2d0
__vma_adjust+0xab0/0x1470
vma_merge+0x5b8/0x5d4
mprotect_fixup+0x1f4/0x478
__arm64_sys_mprotect+0x6b0/0x8f0
invoke_syscall+0x84/0x264
el0_svc_common+0x118/0x1f0
do_el0_svc+0x5c/0x184
el0_svc+0x38/0x98
Any suggestion is appreciated.
Thank you.
BRs,
John Hsu
* John Hsu (許永翰) <[email protected]> [230609 04:37]:
> Hi reviewers and author [email protected],
> Kindly ping.
>
Hello!
Thanks for reporting this issue.
> We met BUG_ON in mas_store_prealloc with kernel-6.1 stress testing
> environment.
version 6.1 or 6.1.x? Which exact version (git id or version number)
> According to coredump, BUG_ON is triggered by mas->node with error
> number -ENOMEM(0xffffffffffffffd2).
> There are some mas_node_count functions in mas_wr_store_entry, and it
> seems that mas->node may be set to error node with -ENOMEM if there was
> no enough memory spcace for maple tree operations.
> We think that return -ENOMEM instead of directly triggering BUG_ON when
> memory is not available is suitable,
This BUG_ON() is necessary since this function should _never_ run out of
memory; this function does not return an error code. mas_preallocate()
should have gotten you the memory necessary (or returned an -ENOMEM)
prior to the call to mas_store_prealloc(), so this is probably an
internal tree problem.
>because in reality the tree
> operation shouldn't be performed in this situation.
There is a tree operation being performed here. mprotect is merging a
vma by the looks of the call stack. Why do you think no tree operation
is necessary?
>
> following are the backtrace:
> mas_store_prealloc+0x23c/0x484
> vma_mas_store+0xe4/0x2d0
> __vma_adjust+0xab0/0x1470
> vma_merge+0x5b8/0x5d4
> mprotect_fixup+0x1f4/0x478
> __arm64_sys_mprotect+0x6b0/0x8f0
> invoke_syscall+0x84/0x264
> el0_svc_common+0x118/0x1f0
> do_el0_svc+0x5c/0x184
> el0_svc+0x38/0x98
I see this is arm64. Do you have a reproducer? If you don't have a
reproducer, I can try stress-ng on amr64 to simulate your workload using
mprotect, but I need to know the exact kernel version as this issue may
have been fixed in a later stable release.
Thanks,
Liam
* John Hsu (許永翰) <[email protected]> [230614 03:06]:
> Hi Liam, thanks for your reply.
Sorry, your email response with top posting is hard to follow so I will
do my best to answer your questions.
>
>
>
> version 6.1 or 6.1.x? Which exact version (git id or version number)
>
> Our environment is kernel-6.1.25-mainline-android14-5-gdea04bf2c398d.
Okay, I can have a look at 6.1.25 then.
>
>
> This BUG_ON() is necessary since this function should _never_ run out of
>
> memory; this function does not return an error code. mas_preallocate()
>
> should have gotten you the memory necessary (or returned an -ENOMEM)
>
> prior to the call to mas_store_prealloc(), so this is probably an
>
> internal tree problem.
>
> There is a tree operation being performed here. mprotect is merging a
>
> vma by the looks of the call stack. Why do you think no tree operation
>
> is necessary?
>
> As you mentioned, mas_preallocate() should allocate enough node, but there is such functions mas_node_count() in mas_store_prealloc().
> In mas_node_count() checks whether the *mas* has enough nodes, and allocate memory for node if there was no enough nodes in mas.
Right, we call mas_node_count() so that both code paths are used for
preallocations and regular mas_store()/mas_store_gfp(). It shouldn't
take a significant amount of time to verify there is enough nodes.
> I think that if mas_preallocate() allocate enough node, why we check the node count and allocate nodes if there was no enough nodes in mas in mas_node_count()?
We check for the above reason.
>
> We have seen that there may be some maple_tree operations in merge_vma...
If merge_vma() does anything, then there was an operation to the maple
tree.
>
> Moreover, would maple_tree provides an API for assigning user's gfp flag for allocating node?
mas_preallocate() and mas_store_gfp() has gfp flags as an argument. In
your call stack, it will be called in __vma_adjust() as such:
if (mas_preallocate(&mas, vma, GFP_KERNEL))
return -ENOMEM;
line 715 in v6.1.25
> In rb_tree, we allocate vma_area_struct (rb_node is in this struct.) with GFP_KERNEL, and maple_tree allocate node with GFP_NOWAIT and __GFP_NOWARN.
We use GFP_KERNEL as I explained above for the VMA tree.
It also will drop the lock and retry with GFP_KERNEL on failure
when not using the external lock. The mmap_lock is configured as an
external lock.
> Allocation will not wait for reclaiming and compacting when there is no enough available memory.
> Is there any concern for this design?
This has been addressed above, but let me know if I missed anything
here.
>
>
> I see this is arm64. Do you have a reproducer? If you don't have a
>
> reproducer, I can try stress-ng on amr64 to simulate your workload using
>
> mprotect, but I need to know the exact kernel version as this issue may
>
> have been fixed in a later stable release.
>
> It is offen occur under low memory condiction. Maybe you can try stress-ng on arm64 under high memory stress(e.g. reserved lots of memory).
Okay, I will try arm64 with v6.1.25.
...
> > following are the backtrace:
>
> > mas_store_prealloc+0x23c/0x484
>
> > vma_mas_store+0xe4/0x2d0
>
> > __vma_adjust+0xab0/0x1470
>
> > vma_merge+0x5b8/0x5d4
>
> > mprotect_fixup+0x1f4/0x478
>
> > __arm64_sys_mprotect+0x6b0/0x8f0
>
> > invoke_syscall+0x84/0x264
>
> > el0_svc_common+0x118/0x1f0
>
> > do_el0_svc+0x5c/0x184
>
> > el0_svc+0x38/0x98
>
Thanks,
Liam
On Wed, 2023-06-14 at 11:58 -0400, Liam R. Howlett wrote:
>
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
> * John Hsu (許永翰) <[email protected]> [230614 03:06]:
> > Hi Liam, thanks for your reply.
>
> Sorry, your email response with top posting is hard to follow so I
> will
> do my best to answer your questions.
Sorry for the wrong format....
> >
> >
> >
> > version 6.1 or 6.1.x? Which exact version (git id or version
> number)
> >
> > Our environment is kernel-6.1.25-mainline-android14-5-
> gdea04bf2c398d.
>
> Okay, I can have a look at 6.1.25 then.
OK, thanks.
> >
> >
> > This BUG_ON() is necessary since this function should _never_ run
> out of
> >
> > memory; this function does not return an error code.
> mas_preallocate()
> >
> > should have gotten you the memory necessary (or returned an
> -ENOMEM)
> >
> > prior to the call to mas_store_prealloc(), so this is probably an
> >
> > internal tree problem.
> >
> > There is a tree operation being performed here. mprotect is
> merging a
> >
> > vma by the looks of the call stack. Why do you think no tree
> operation
> >
> > is necessary?
> >
> > As you mentioned, mas_preallocate() should allocate enough node,
> but there is such functions mas_node_count() in mas_store_prealloc().
> > In mas_node_count() checks whether the *mas* has enough nodes, and
> allocate memory for node if there was no enough nodes in mas.
>
> Right, we call mas_node_count() so that both code paths are used for
> preallocations and regular mas_store()/mas_store_gfp(). It shouldn't
> take a significant amount of time to verify there is enough nodes.
Yap..., it didn't take a significant amount of time to verify whether
there is enough nodes. The problem is why the flow in mas_node_count
will alloc nodes if there was no enough nodes in mas?
> > I think that if mas_preallocate() allocate enough node, why we
> check the node count and allocate nodes if there was no enough nodes
> in mas in mas_node_count()?
>
> We check for the above reason.
>
OK..., this is one of the root cause of this BUG.
> >
> > We have seen that there may be some maple_tree operations in
> merge_vma...
>
> If merge_vma() does anything, then there was an operation to the
> maple
> tree.
>
> >
> > Moreover, would maple_tree provides an API for assigning user's gfp
> flag for allocating node?
>
> mas_preallocate() and mas_store_gfp() has gfp flags as an
> argument. In
> your call stack, it will be called in __vma_adjust() as such:
>
> if (mas_preallocate(&mas, vma, GFP_KERNEL))
> return -ENOMEM;
>
> line 715 in v6.1.25
>
> > In rb_tree, we allocate vma_area_struct (rb_node is in this
> struct.) with GFP_KERNEL, and maple_tree allocate node with
> GFP_NOWAIT and __GFP_NOWARN.
>
> We use GFP_KERNEL as I explained above for the VMA tree.
Got it! But the mas_node_count() always use GFP_NOWAIT and __GFP_NOWARN
in inserting tree flow. Do you consider the performance of maintaining
the structure of maple_tree?
> It also will drop the lock and retry with GFP_KERNEL on failure
> when not using the external lock. The mmap_lock is configured as an
> external lock.
>
> > Allocation will not wait for reclaiming and compacting when there
> is no enough available memory.
> > Is there any concern for this design?
>
> This has been addressed above, but let me know if I missed anything
> here.
>
I think that the mas_node_count() has higher rate of triggering
BUG_ON() when allocating nodes with GFP_NOWAIT and __GFP_NOWARN. If
mas_node_count() use GFP_KERNEL as mas_preallocate() in the mmap.c, the
allocation fail rate may be lower than use GFP_NOWAIT.
> >
> >
> > I see this is arm64. Do you have a reproducer? If you don't have
> a
> >
> > reproducer, I can try stress-ng on amr64 to simulate your workload
> using
> >
> > mprotect, but I need to know the exact kernel version as this issue
> may
> >
> > have been fixed in a later stable release.
> >
> > It is offen occur under low memory condiction. Maybe you can try
> stress-ng on arm64 under high memory stress(e.g. reserved lots of
> memory).
>
> Okay, I will try arm64 with v6.1.25.
OK, thanks.
> ...
> > > following are the backtrace:
> >
> > > mas_store_prealloc+0x23c/0x484
> >
> > > vma_mas_store+0xe4/0x2d0
> >
> > > __vma_adjust+0xab0/0x1470
> >
> > > vma_merge+0x5b8/0x5d4
> >
> > > mprotect_fixup+0x1f4/0x478
> >
> > > __arm64_sys_mprotect+0x6b0/0x8f0
> >
> > > invoke_syscall+0x84/0x264
> >
> > > el0_svc_common+0x118/0x1f0
> >
> > > do_el0_svc+0x5c/0x184
> >
> > > el0_svc+0x38/0x98
> >
>
> Thanks,
> Liam
BRs,
John Hsu
Apologies for the late response.
* John Hsu (許永翰) <[email protected]> [230616 05:19]:
> On Wed, 2023-06-14 at 11:58 -0400, Liam R. Howlett wrote:
> >
> > External email : Please do not click links or open attachments until
> > you have verified the sender or the content.
> > * John Hsu (許永翰) <[email protected]> [230614 03:06]:
> > > Hi Liam, thanks for your reply.
> >
> > Sorry, your email response with top posting is hard to follow so I
> > will
> > do my best to answer your questions.
>
> Sorry for the wrong format....
>
> > >
> > >
> > >
> > > version 6.1 or 6.1.x? Which exact version (git id or version
> > number)
> > >
> > > Our environment is kernel-6.1.25-mainline-android14-5-
> > gdea04bf2c398d.
> >
> > Okay, I can have a look at 6.1.25 then.
>
> OK, thanks.
>
> > >
> > >
> > > This BUG_ON() is necessary since this function should _never_ run
> > out of
> > >
> > > memory; this function does not return an error code.
> > mas_preallocate()
> > >
> > > should have gotten you the memory necessary (or returned an
> > -ENOMEM)
> > >
> > > prior to the call to mas_store_prealloc(), so this is probably an
> > >
> > > internal tree problem.
> > >
> > > There is a tree operation being performed here. mprotect is
> > merging a
> > >
> > > vma by the looks of the call stack. Why do you think no tree
> > operation
> > >
> > > is necessary?
> > >
> > > As you mentioned, mas_preallocate() should allocate enough node,
> > but there is such functions mas_node_count() in mas_store_prealloc().
> > > In mas_node_count() checks whether the *mas* has enough nodes, and
> > allocate memory for node if there was no enough nodes in mas.
> >
> > Right, we call mas_node_count() so that both code paths are used for
> > preallocations and regular mas_store()/mas_store_gfp(). It shouldn't
> > take a significant amount of time to verify there is enough nodes.
>
> Yap..., it didn't take a significant amount of time to verify whether
> there is enough nodes. The problem is why the flow in mas_node_count
> will alloc nodes if there was no enough nodes in mas?
What I meant is that both methods use the same call path because there
is not a reason to duplicate the path. After mas_preallocate() has
allocated the nodes needed, the call to check if there is enough nodes
will be quick.
>
> > > I think that if mas_preallocate() allocate enough node, why we
> > check the node count and allocate nodes if there was no enough nodes
> > in mas in mas_node_count()?
> >
> > We check for the above reason.
> >
>
> OK..., this is one of the root cause of this BUG.
The root cause is that there was not enough memory for a store
operation. Regardless of if we check the allocations in the
mas_store_prealloc() path or not, this would fail. If we remove the
check for nodes within this path, then we would have to BUG_ON() when we
run out of nodes to use or have a null pointer dereference BUG anyways.
>
> > >
> > > We have seen that there may be some maple_tree operations in
> > merge_vma...
> >
> > If merge_vma() does anything, then there was an operation to the
> > maple
> > tree.
> >
> > >
> > > Moreover, would maple_tree provides an API for assigning user's gfp
> > flag for allocating node?
> >
> > mas_preallocate() and mas_store_gfp() has gfp flags as an
> > argument. In
> > your call stack, it will be called in __vma_adjust() as such:
> >
> > if (mas_preallocate(&mas, vma, GFP_KERNEL))
> > return -ENOMEM;
> >
> > line 715 in v6.1.25
> >
> > > In rb_tree, we allocate vma_area_struct (rb_node is in this
> > struct.) with GFP_KERNEL, and maple_tree allocate node with
> > GFP_NOWAIT and __GFP_NOWARN.
> >
> > We use GFP_KERNEL as I explained above for the VMA tree.
>
> Got it! But the mas_node_count() always use GFP_NOWAIT and __GFP_NOWARN
> in inserting tree flow. Do you consider the performance of maintaining
> the structure of maple_tree?
Sorry, I don't understand what you mean by 'consider the performance of
maintaining the structure of maple_tree'.
>
> > It also will drop the lock and retry with GFP_KERNEL on failure
> > when not using the external lock. The mmap_lock is configured as an
> > external lock.
> >
> > > Allocation will not wait for reclaiming and compacting when there
> > is no enough available memory.
> > > Is there any concern for this design?
> >
> > This has been addressed above, but let me know if I missed anything
> > here.
> >
>
> I think that the mas_node_count() has higher rate of triggering
> BUG_ON() when allocating nodes with GFP_NOWAIT and __GFP_NOWARN. If
> mas_node_count() use GFP_KERNEL as mas_preallocate() in the mmap.c, the
> allocation fail rate may be lower than use GFP_NOWAIT.
Which BUG_ON() are you referring to?
If I was to separate the code path for mas_store_prealloc() and
mas_store_gfp(), then a BUG_ON() would still need to exist and still
would have been triggered.. We are in a place in the code where we
should never sleep and we don't have enough memory allocated to do what
was necessary.
Thanks,
Liam
On Thu, 2023-07-06 at 14:54 -0400, Liam R. Howlett wrote:
>
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
>
> Apologies for the late response.
>
> * John Hsu (許永翰) <[email protected]> [230616 05:19]:
> > On Wed, 2023-06-14 at 11:58 -0400, Liam R. Howlett wrote:
> > >
> > > External email : Please do not click links or open attachments
> until
> > > you have verified the sender or the content.
> > > * John Hsu (許永翰) <[email protected]> [230614 03:06]:
> > > > Hi Liam, thanks for your reply.
> > >
> > > Sorry, your email response with top posting is hard to follow so
> I
> > > will
> > > do my best to answer your questions.
> >
> > Sorry for the wrong format....
> >
> > > >
> > > >
> > > >
> > > > version 6.1 or 6.1.x? Which exact version (git id or version
> > > number)
> > > >
> > > > Our environment is kernel-6.1.25-mainline-android14-5-
> > > gdea04bf2c398d.
> > >
> > > Okay, I can have a look at 6.1.25 then.
> >
> > OK, thanks.
> >
> > > >
> > > >
> > > > This BUG_ON() is necessary since this function should _never_
> run
> > > out of
> > > >
> > > > memory; this function does not return an error code.
> > > mas_preallocate()
> > > >
> > > > should have gotten you the memory necessary (or returned an
> > > -ENOMEM)
> > > >
> > > > prior to the call to mas_store_prealloc(), so this is probably
> an
> > > >
> > > > internal tree problem.
> > > >
> > > > There is a tree operation being performed here. mprotect is
> > > merging a
> > > >
> > > > vma by the looks of the call stack. Why do you think no tree
> > > operation
> > > >
> > > > is necessary?
> > > >
> > > > As you mentioned, mas_preallocate() should allocate enough
> node,
> > > but there is such functions mas_node_count() in
> mas_store_prealloc().
> > > > In mas_node_count() checks whether the *mas* has enough nodes,
> and
> > > allocate memory for node if there was no enough nodes in mas.
> > >
> > > Right, we call mas_node_count() so that both code paths are used
> for
> > > preallocations and regular mas_store()/mas_store_gfp(). It
> shouldn't
> > > take a significant amount of time to verify there is enough
> nodes.
> >
> > Yap..., it didn't take a significant amount of time to verify
> whether
> > there is enough nodes. The problem is why the flow in
> mas_node_count
> > will alloc nodes if there was no enough nodes in mas?
>
> What I meant is that both methods use the same call path because
> there
> is not a reason to duplicate the path. After mas_preallocate() has
> allocated the nodes needed, the call to check if there is enough
> nodes
> will be quick.
So whether the purpose of mas_preallocate() is decreasing the lock
retention time?
> >
> > > > I think that if mas_preallocate() allocate enough node, why we
> > > check the node count and allocate nodes if there was no enough
> nodes
> > > in mas in mas_node_count()?
> > >
> > > We check for the above reason.
> > >
> >
> > OK..., this is one of the root cause of this BUG.
>
> The root cause is that there was not enough memory for a store
> operation. Regardless of if we check the allocations in the
> mas_store_prealloc() path or not, this would fail. If we remove the
> check for nodes within this path, then we would have to BUG_ON() when
> we
> run out of nodes to use or have a null pointer dereference BUG
> anyways.
>
Yap, the root cause is oom. The BUG_ON() for the situations that the
maple tree struct cannot be maintained because of the lack of memory is
necessary. But the the buddy system in linux kernel can reclaim memory
when the system is under the low memory status. If we use GFP_KERNEL
after trying GFP_NOWAIT to allocate node, maybe we can get enough
memory when the second try with GFP_KERNEL.
> >
> > > >
> > > > We have seen that there may be some maple_tree operations in
> > > merge_vma...
> > >
> > > If merge_vma() does anything, then there was an operation to the
> > > maple
> > > tree.
> > >
> > > >
> > > > Moreover, would maple_tree provides an API for assigning user's
> gfp
> > > flag for allocating node?
> > >
> > > mas_preallocate() and mas_store_gfp() has gfp flags as an
> > > argument. In
> > > your call stack, it will be called in __vma_adjust() as such:
> > >
> > > if (mas_preallocate(&mas, vma, GFP_KERNEL))
> > > return -ENOMEM;
> > >
> > > line 715 in v6.1.25
> > >
> > > > In rb_tree, we allocate vma_area_struct (rb_node is in this
> > > struct.) with GFP_KERNEL, and maple_tree allocate node with
> > > GFP_NOWAIT and __GFP_NOWARN.
> > >
> > > We use GFP_KERNEL as I explained above for the VMA tree.
> >
> > Got it! But the mas_node_count() always use GFP_NOWAIT and
> __GFP_NOWARN
> > in inserting tree flow. Do you consider the performance of
> maintaining
> > the structure of maple_tree?
>
> Sorry, I don't understand what you mean by 'consider the performance
> of
> maintaining the structure of maple_tree'.
>
As I mentioned above, GFP_NOWAIT will not allow buddy system for
reclaiming memory, so "Do you consider the performance of maintaining
the structure of maple_tree" means that: whether the mas_node_count()
path is not allowed to reclaim or compact memory for the performance.
> >
> > > It also will drop the lock and retry with GFP_KERNEL on failure
> > > when not using the external lock. The mmap_lock is configured as
> an
> > > external lock.
> > >
> > > > Allocation will not wait for reclaiming and compacting when
> there
> > > is no enough available memory.
> > > > Is there any concern for this design?
> > >
> > > This has been addressed above, but let me know if I missed
> anything
> > > here.
> > >
> >
> > I think that the mas_node_count() has higher rate of triggering
> > BUG_ON() when allocating nodes with GFP_NOWAIT and __GFP_NOWARN. If
> > mas_node_count() use GFP_KERNEL as mas_preallocate() in the mmap.c,
> the
> > allocation fail rate may be lower than use GFP_NOWAIT.
>
> Which BUG_ON() are you referring to?
>
> If I was to separate the code path for mas_store_prealloc() and
> mas_store_gfp(), then a BUG_ON() would still need to exist and still
> would have been triggered.. We are in a place in the code where we
> should never sleep and we don't have enough memory allocated to do
> what
> was necessary.
>
Yap. There is no reason to seprate mas_store_prealloc() and
mas_store_gfp. Is it possible to retry to allocate mas_node with
GFP_KERNEL (wait for system reclaim and compact) instead of triggering
BUG_ON once the GFP_NOWAIT allocation failed?
> Thanks,
> Liam
Best Regards,
John Hsu
* John Hsu (許永翰) <[email protected]> [230710 08:50]:
...
> > > > > This BUG_ON() is necessary since this function should _never_
> > run
> > > > out of
> > > > >
> > > > > memory; this function does not return an error code.
> > > > mas_preallocate()
> > > > >
> > > > > should have gotten you the memory necessary (or returned an
> > > > -ENOMEM)
> > > > >
> > > > > prior to the call to mas_store_prealloc(), so this is probably
> > an
> > > > >
> > > > > internal tree problem.
> > > > >
> > > > > There is a tree operation being performed here. mprotect is
> > > > merging a
> > > > >
> > > > > vma by the looks of the call stack. Why do you think no tree
> > > > operation
> > > > >
> > > > > is necessary?
> > > > >
> > > > > As you mentioned, mas_preallocate() should allocate enough
> > node,
> > > > but there is such functions mas_node_count() in
> > mas_store_prealloc().
> > > > > In mas_node_count() checks whether the *mas* has enough nodes,
> > and
> > > > allocate memory for node if there was no enough nodes in mas.
> > > >
> > > > Right, we call mas_node_count() so that both code paths are used
> > for
> > > > preallocations and regular mas_store()/mas_store_gfp(). It
> > shouldn't
> > > > take a significant amount of time to verify there is enough
> > nodes.
> > >
> > > Yap..., it didn't take a significant amount of time to verify
> > whether
> > > there is enough nodes. The problem is why the flow in
> > mas_node_count
> > > will alloc nodes if there was no enough nodes in mas?
> >
> > What I meant is that both methods use the same call path because
> > there
> > is not a reason to duplicate the path. After mas_preallocate() has
> > allocated the nodes needed, the call to check if there is enough
> > nodes
> > will be quick.
>
> So whether the purpose of mas_preallocate() is decreasing the lock
> retention time?
It could be, but in this case it's the locking order. We have to
pre-allocate and fail early if we are out of memory, because we _cannot_
use GFP_KERNEL where we call mas_store_prealloc(). mas_preallocate()
will use GFP_KERENL though. We cannot use GFP_KERNEL during the store
because reclaim may sleep and we cannot sleep holding the locks we need
to hold at the time of the store operation in __vma_adjust().
>
> > >
> > > > > I think that if mas_preallocate() allocate enough node, why we
> > > > check the node count and allocate nodes if there was no enough
> > nodes
> > > > in mas in mas_node_count()?
> > > >
> > > > We check for the above reason.
> > > >
> > >
> > > OK..., this is one of the root cause of this BUG.
> >
> > The root cause is that there was not enough memory for a store
> > operation. Regardless of if we check the allocations in the
> > mas_store_prealloc() path or not, this would fail. If we remove the
> > check for nodes within this path, then we would have to BUG_ON() when
> > we
> > run out of nodes to use or have a null pointer dereference BUG
> > anyways.
> >
> Yap, the root cause is oom. The BUG_ON() for the situations that the
> maple tree struct cannot be maintained because of the lack of memory is
> necessary. But the the buddy system in linux kernel can reclaim memory
> when the system is under the low memory status. If we use GFP_KERNEL
> after trying GFP_NOWAIT to allocate node, maybe we can get enough
> memory when the second try with GFP_KERNEL.
Right, but the GFP_KERNEL cannot be used when holding certain locks
because it may sleep.
> > >
> > > > >
> > > > > We have seen that there may be some maple_tree operations in
> > > > merge_vma...
> > > >
> > > > If merge_vma() does anything, then there was an operation to the
> > > > maple
> > > > tree.
> > > >
> > > > >
> > > > > Moreover, would maple_tree provides an API for assigning user's
> > gfp
> > > > flag for allocating node?
> > > >
> > > > mas_preallocate() and mas_store_gfp() has gfp flags as an
> > > > argument. In
> > > > your call stack, it will be called in __vma_adjust() as such:
> > > >
> > > > if (mas_preallocate(&mas, vma, GFP_KERNEL))
> > > > return -ENOMEM;
> > > >
> > > > line 715 in v6.1.25
> > > >
> > > > > In rb_tree, we allocate vma_area_struct (rb_node is in this
> > > > struct.) with GFP_KERNEL, and maple_tree allocate node with
> > > > GFP_NOWAIT and __GFP_NOWARN.
> > > >
> > > > We use GFP_KERNEL as I explained above for the VMA tree.
> > >
> > > Got it! But the mas_node_count() always use GFP_NOWAIT and
> > __GFP_NOWARN
> > > in inserting tree flow. Do you consider the performance of
> > maintaining
> > > the structure of maple_tree?
> >
> > Sorry, I don't understand what you mean by 'consider the performance
> > of
> > maintaining the structure of maple_tree'.
> >
> As I mentioned above, GFP_NOWAIT will not allow buddy system for
> reclaiming memory, so "Do you consider the performance of maintaining
> the structure of maple_tree" means that: whether the mas_node_count()
> path is not allowed to reclaim or compact memory for the performance.
Ah, no. This is not for performance. It was initially on the road map
for performance, but it was needed for the complicated locking in the MM
code.
rb tree embedded the nodes in the VMA which is allocated outside this
critical section and so it could use GFP_KERNEL.
> > >
> > > > It also will drop the lock and retry with GFP_KERNEL on failure
> > > > when not using the external lock. The mmap_lock is configured as
> > an
> > > > external lock.
> > > >
> > > > > Allocation will not wait for reclaiming and compacting when
> > there
> > > > is no enough available memory.
> > > > > Is there any concern for this design?
> > > >
> > > > This has been addressed above, but let me know if I missed
> > anything
> > > > here.
> > > >
> > >
> > > I think that the mas_node_count() has higher rate of triggering
> > > BUG_ON() when allocating nodes with GFP_NOWAIT and __GFP_NOWARN. If
> > > mas_node_count() use GFP_KERNEL as mas_preallocate() in the mmap.c,
> > the
> > > allocation fail rate may be lower than use GFP_NOWAIT.
> >
> > Which BUG_ON() are you referring to?
> >
> > If I was to separate the code path for mas_store_prealloc() and
> > mas_store_gfp(), then a BUG_ON() would still need to exist and still
> > would have been triggered.. We are in a place in the code where we
> > should never sleep and we don't have enough memory allocated to do
> > what
> > was necessary.
> >
> Yap. There is no reason to seprate mas_store_prealloc() and
> mas_store_gfp. Is it possible to retry to allocate mas_node with
> GFP_KERNEL (wait for system reclaim and compact) instead of triggering
> BUG_ON once the GFP_NOWAIT allocation failed?
Unfortunately not, no. In some cases it may actually work out that the
VMA may not need the locks in question, but it cannot be generalized for
__vma_adjust(). Where I am able, I use the mas_store_gfp() calls so we
can let reclaim and compact run, but it's not possible here.
Thanks,
Liam
On Thu, 2023-07-13 at 11:25 +0800, John Hsu wrote:
> On Mon, 2023-07-10 at 10:24 -0400, Liam R. Howlett wrote:
> >
> > External email : Please do not click links or open attachments
> > until
> > you have verified the sender or the content.
> > * John Hsu (許永翰) <[email protected]> [230710 08:50]:
> > ...
> >
> > > > > > > This BUG_ON() is necessary since this function should
> >
> > _never_
> > > > run
> > > > > > out of
> > > > > > >
> > > > > > > memory; this function does not return an error code.
> > > > > >
> > > > > > mas_preallocate()
> > > > > > >
> > > > > > > should have gotten you the memory necessary (or returned
> > > > > > > an
> > > > > >
> > > > > > -ENOMEM)
> > > > > > >
> > > > > > > prior to the call to mas_store_prealloc(), so this is
> >
> > probably
> > > > an
> > > > > > >
> > > > > > > internal tree problem.
> > > > > > >
> > > > > > > There is a tree operation being performed here. mprotect
> >
> > is
> > > > > > merging a
> > > > > > >
> > > > > > > vma by the looks of the call stack. Why do you think no
> >
> > tree
> > > > > > operation
> > > > > > >
> > > > > > > is necessary?
> > > > > > >
> > > > > > > As you mentioned, mas_preallocate() should allocate
> > > > > > > enough
> > > >
> > > > node,
> > > > > > but there is such functions mas_node_count() in
> > > >
> > > > mas_store_prealloc().
> > > > > > > In mas_node_count() checks whether the *mas* has enough
> >
> > nodes,
> > > > and
> > > > > > allocate memory for node if there was no enough nodes in
> > > > > > mas.
> > > > > >
> > > > > > Right, we call mas_node_count() so that both code paths are
> >
> > used
> > > > for
> > > > > > preallocations and regular mas_store()/mas_store_gfp(). It
> > > >
> > > > shouldn't
> > > > > > take a significant amount of time to verify there is enough
> > > >
> > > > nodes.
> > > > >
> > > > > Yap..., it didn't take a significant amount of time to verify
> > > >
> > > > whether
> > > > > there is enough nodes. The problem is why the flow in
> > > >
> > > > mas_node_count
> > > > > will alloc nodes if there was no enough nodes in mas?
> > > >
> > > > What I meant is that both methods use the same call path
> > > > because
> > > > there
> > > > is not a reason to duplicate the path. After mas_preallocate()
> >
> > has
> > > > allocated the nodes needed, the call to check if there is
> > > > enough
> > > > nodes
> > > > will be quick.
> > >
> > > So whether the purpose of mas_preallocate() is decreasing the
> > > lock
> > > retention time?
> >
> > It could be, but in this case it's the locking order. We have to
> > pre-allocate and fail early if we are out of memory, because we
> > _cannot_
> > use GFP_KERNEL where we call
> > mas_store_prealloc(). mas_preallocate()
> > will use GFP_KERENL though. We cannot use GFP_KERNEL during the
> > store
> > because reclaim may sleep and we cannot sleep holding the locks we
> > need
> > to hold at the time of the store operation in __vma_adjust().
> >
>
> Yap, if the type of lock is spinlock, the flow shouldn't sleep in the
> critical sections. mmap_lock is implmented by rw_semaphore(mutex). Is
> there any other lock in this section?
> > >
> > > > >
> > > > > > > I think that if mas_preallocate() allocate enough node,
> > > > > > > why
> >
> > we
> > > > > > check the node count and allocate nodes if there was no
> >
> > enough
> > > > nodes
> > > > > > in mas in mas_node_count()?
> > > > > >
> > > > > > We check for the above reason.
> > > > > >
> > > > >
> > > > > OK..., this is one of the root cause of this BUG.
> > > >
> > > > The root cause is that there was not enough memory for a store
> > > > operation. Regardless of if we check the allocations in the
> > > > mas_store_prealloc() path or not, this would fail. If we
> > > > remove
> >
> > the
> > > > check for nodes within this path, then we would have to
> > > > BUG_ON()
> >
> > when
> > > > we
> > > > run out of nodes to use or have a null pointer dereference BUG
> > > > anyways.
> > > >
> > >
> > > Yap, the root cause is oom. The BUG_ON() for the situations that
> >
> > the
> > > maple tree struct cannot be maintained because of the lack of
> >
> > memory is
> > > necessary. But the the buddy system in linux kernel can reclaim
> >
> > memory
> > > when the system is under the low memory status. If we use
> >
> > GFP_KERNEL
> > > after trying GFP_NOWAIT to allocate node, maybe we can get enough
> > > memory when the second try with GFP_KERNEL.
> >
> > Right, but the GFP_KERNEL cannot be used when holding certain locks
> > because it may sleep.
> >
> > > > >
> > > > > > >
> > > > > > > We have seen that there may be some maple_tree operations
> >
> > in
> > > > > > merge_vma...
> > > > > >
> > > > > > If merge_vma() does anything, then there was an operation
> > > > > > to
> >
> > the
> > > > > > maple
> > > > > > tree.
> > > > > >
> > > > > > >
> > > > > > > Moreover, would maple_tree provides an API for assigning
> >
> > user's
> > > > gfp
> > > > > > flag for allocating node?
> > > > > >
> > > > > > mas_preallocate() and mas_store_gfp() has gfp flags as an
> > > > > > argument. In
> > > > > > your call stack, it will be called in __vma_adjust() as
> > > > > > such:
> > > > > >
> > > > > > if (mas_preallocate(&mas, vma, GFP_KERNEL))
> > > > > > return -ENOMEM;
> > > > > >
> > > > > > line 715 in v6.1.25
> > > > > >
> > > > > > > In rb_tree, we allocate vma_area_struct (rb_node is in
> > > > > > > this
> > > > > >
> > > > > > struct.) with GFP_KERNEL, and maple_tree allocate node with
> > > > > > GFP_NOWAIT and __GFP_NOWARN.
> > > > > >
> > > > > > We use GFP_KERNEL as I explained above for the VMA tree.
> > > > >
> > > > > Got it! But the mas_node_count() always use GFP_NOWAIT and
> > > >
> > > > __GFP_NOWARN
> > > > > in inserting tree flow. Do you consider the performance of
> > > >
> > > > maintaining
> > > > > the structure of maple_tree?
> > > >
> > > > Sorry, I don't understand what you mean by 'consider the
> >
> > performance
> > > > of
> > > > maintaining the structure of maple_tree'.
> > > >
> > >
> > > As I mentioned above, GFP_NOWAIT will not allow buddy system for
> > > reclaiming memory, so "Do you consider the performance of
> >
> > maintaining
> > > the structure of maple_tree" means that: whether the
> >
> > mas_node_count()
> > > path is not allowed to reclaim or compact memory for the
> >
> > performance.
> >
> > Ah, no. This is not for performance. It was initially on the road
> > map
> > for performance, but it was needed for the complicated locking in
> > the
> > MM
> > code.
> >
> > rb tree embedded the nodes in the VMA which is allocated outside
> > this
> > critical section and so it could use GFP_KERNEL.
> >
>
> As I know, following is rb_tree flow in 5.15.186:
>
> ...
> mmap_write_lock_killable(mm)
> ...
> do_mmap()
> ...
> mmap_region()
> ...
> vm_area_alloc(mm)
> ...
> mmap_write_unlock(mm)
>
> vm_area_alloc is in the mmap_lock hoding period.
> It seems that the flow would sleep here in rb_tree flow.
> If I miss anything, please tell me, thanks!
>
>
> > > > >
> > > > > > It also will drop the lock and retry with GFP_KERNEL on
> >
> > failure
> > > > > > when not using the external lock. The mmap_lock is
> >
> > configured as
> > > > an
> > > > > > external lock.
> > > > > >
> > > > > > > Allocation will not wait for reclaiming and compacting
> > > > > > > when
> > > >
> > > > there
> > > > > > is no enough available memory.
> > > > > > > Is there any concern for this design?
> > > > > >
> > > > > > This has been addressed above, but let me know if I missed
> > > >
> > > > anything
> > > > > > here.
> > > > > >
> > > > >
> > > > > I think that the mas_node_count() has higher rate of
> > > > > triggering
> > > > > BUG_ON() when allocating nodes with GFP_NOWAIT and
> >
> > __GFP_NOWARN. If
> > > > > mas_node_count() use GFP_KERNEL as mas_preallocate() in the
> >
> > mmap.c,
> > > > the
> > > > > allocation fail rate may be lower than use GFP_NOWAIT.
> > > >
> > > > Which BUG_ON() are you referring to?
> > > >
> > > > If I was to separate the code path for mas_store_prealloc() and
> > > > mas_store_gfp(), then a BUG_ON() would still need to exist and
> >
> > still
> > > > would have been triggered.. We are in a place in the code
> > > > where
> >
> > we
> > > > should never sleep and we don't have enough memory allocated to
> >
> > do
> > > > what
> > > > was necessary.
> > > >
> > >
> > > Yap. There is no reason to seprate mas_store_prealloc() and
> > > mas_store_gfp. Is it possible to retry to allocate mas_node with
> > > GFP_KERNEL (wait for system reclaim and compact) instead of
> >
> > triggering
> > > BUG_ON once the GFP_NOWAIT allocation failed?
> >
> > Unfortunately not, no. In some cases it may actually work out that
> > the
> > VMA may not need the locks in question, but it cannot be
> > generalized
> > for
> > __vma_adjust(). Where I am able, I use the mas_store_gfp() calls
> > so
> > we
> > can let reclaim and compact run, but it's not possible here.
> >
>
> We have used GFP_KERNEL as alloc flag in mas_node_count flow about 2
> months. The mentioned problem we mentioned in the first mail didn't
> reproduce under high stress stability test.
>
> I have seen the patch provided by you. I will verify this patch in
> our
> stability test. But there is a problem, if maple_tree behavior is
> unexpected (e.g. redundant write bug this time). We can only review
> the
> whole mm flow to find out the wrong behavior. Do we have an efficient
> debug method for maple tree?
>
> > Thanks,
> > Liam
Best Regards,
John Hsu
On Mon, 2023-07-10 at 10:24 -0400, Liam R. Howlett wrote:
>
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
> * John Hsu (許永翰) <[email protected]> [230710 08:50]:
> ...
>
> > > > > > This BUG_ON() is necessary since this function should
> _never_
> > > run
> > > > > out of
> > > > > >
> > > > > > memory; this function does not return an error code.
> > > > > mas_preallocate()
> > > > > >
> > > > > > should have gotten you the memory necessary (or returned an
> > > > > -ENOMEM)
> > > > > >
> > > > > > prior to the call to mas_store_prealloc(), so this is
> probably
> > > an
> > > > > >
> > > > > > internal tree problem.
> > > > > >
> > > > > > There is a tree operation being performed here. mprotect
> is
> > > > > merging a
> > > > > >
> > > > > > vma by the looks of the call stack. Why do you think no
> tree
> > > > > operation
> > > > > >
> > > > > > is necessary?
> > > > > >
> > > > > > As you mentioned, mas_preallocate() should allocate enough
> > > node,
> > > > > but there is such functions mas_node_count() in
> > > mas_store_prealloc().
> > > > > > In mas_node_count() checks whether the *mas* has enough
> nodes,
> > > and
> > > > > allocate memory for node if there was no enough nodes in mas.
> > > > >
> > > > > Right, we call mas_node_count() so that both code paths are
> used
> > > for
> > > > > preallocations and regular mas_store()/mas_store_gfp(). It
> > > shouldn't
> > > > > take a significant amount of time to verify there is enough
> > > nodes.
> > > >
> > > > Yap..., it didn't take a significant amount of time to verify
> > > whether
> > > > there is enough nodes. The problem is why the flow in
> > > mas_node_count
> > > > will alloc nodes if there was no enough nodes in mas?
> > >
> > > What I meant is that both methods use the same call path because
> > > there
> > > is not a reason to duplicate the path. After mas_preallocate()
> has
> > > allocated the nodes needed, the call to check if there is enough
> > > nodes
> > > will be quick.
> >
> > So whether the purpose of mas_preallocate() is decreasing the lock
> > retention time?
>
> It could be, but in this case it's the locking order. We have to
> pre-allocate and fail early if we are out of memory, because we
> _cannot_
> use GFP_KERNEL where we call mas_store_prealloc(). mas_preallocate()
> will use GFP_KERENL though. We cannot use GFP_KERNEL during the
> store
> because reclaim may sleep and we cannot sleep holding the locks we
> need
> to hold at the time of the store operation in __vma_adjust().
>
Yap, if the type of lock is spinlock, the flow shouldn't sleep in the
critical sections. mmap_lock is implmented by rw_semaphore(mutex). Is
there any other lock in this section?
> >
> > > >
> > > > > > I think that if mas_preallocate() allocate enough node, why
> we
> > > > > check the node count and allocate nodes if there was no
> enough
> > > nodes
> > > > > in mas in mas_node_count()?
> > > > >
> > > > > We check for the above reason.
> > > > >
> > > >
> > > > OK..., this is one of the root cause of this BUG.
> > >
> > > The root cause is that there was not enough memory for a store
> > > operation. Regardless of if we check the allocations in the
> > > mas_store_prealloc() path or not, this would fail. If we remove
> the
> > > check for nodes within this path, then we would have to BUG_ON()
> when
> > > we
> > > run out of nodes to use or have a null pointer dereference BUG
> > > anyways.
> > >
> > Yap, the root cause is oom. The BUG_ON() for the situations that
> the
> > maple tree struct cannot be maintained because of the lack of
> memory is
> > necessary. But the the buddy system in linux kernel can reclaim
> memory
> > when the system is under the low memory status. If we use
> GFP_KERNEL
> > after trying GFP_NOWAIT to allocate node, maybe we can get enough
> > memory when the second try with GFP_KERNEL.
>
> Right, but the GFP_KERNEL cannot be used when holding certain locks
> because it may sleep.
>
> > > >
> > > > > >
> > > > > > We have seen that there may be some maple_tree operations
> in
> > > > > merge_vma...
> > > > >
> > > > > If merge_vma() does anything, then there was an operation to
> the
> > > > > maple
> > > > > tree.
> > > > >
> > > > > >
> > > > > > Moreover, would maple_tree provides an API for assigning
> user's
> > > gfp
> > > > > flag for allocating node?
> > > > >
> > > > > mas_preallocate() and mas_store_gfp() has gfp flags as an
> > > > > argument. In
> > > > > your call stack, it will be called in __vma_adjust() as such:
> > > > >
> > > > > if (mas_preallocate(&mas, vma, GFP_KERNEL))
> > > > > return -ENOMEM;
> > > > >
> > > > > line 715 in v6.1.25
> > > > >
> > > > > > In rb_tree, we allocate vma_area_struct (rb_node is in this
> > > > > struct.) with GFP_KERNEL, and maple_tree allocate node with
> > > > > GFP_NOWAIT and __GFP_NOWARN.
> > > > >
> > > > > We use GFP_KERNEL as I explained above for the VMA tree.
> > > >
> > > > Got it! But the mas_node_count() always use GFP_NOWAIT and
> > > __GFP_NOWARN
> > > > in inserting tree flow. Do you consider the performance of
> > > maintaining
> > > > the structure of maple_tree?
> > >
> > > Sorry, I don't understand what you mean by 'consider the
> performance
> > > of
> > > maintaining the structure of maple_tree'.
> > >
> > As I mentioned above, GFP_NOWAIT will not allow buddy system for
> > reclaiming memory, so "Do you consider the performance of
> maintaining
> > the structure of maple_tree" means that: whether the
> mas_node_count()
> > path is not allowed to reclaim or compact memory for the
> performance.
>
> Ah, no. This is not for performance. It was initially on the road
> map
> for performance, but it was needed for the complicated locking in the
> MM
> code.
>
> rb tree embedded the nodes in the VMA which is allocated outside this
> critical section and so it could use GFP_KERNEL.
>
As I know, following is rb_tree flow in 5.15.186:
...
mmap_write_lock_killable(mm)
...
do_mmap()
...
mmap_region()
...
vm_area_alloc(mm)
...
mmap_write_unlock(mm)
vm_area_alloc is in the mmap_lock hoding period.
It seems that the flow would sleep here in rb_tree flow.
If I miss anything, please tell me, thanks!
> > > >
> > > > > It also will drop the lock and retry with GFP_KERNEL on
> failure
> > > > > when not using the external lock. The mmap_lock is
> configured as
> > > an
> > > > > external lock.
> > > > >
> > > > > > Allocation will not wait for reclaiming and compacting when
> > > there
> > > > > is no enough available memory.
> > > > > > Is there any concern for this design?
> > > > >
> > > > > This has been addressed above, but let me know if I missed
> > > anything
> > > > > here.
> > > > >
> > > >
> > > > I think that the mas_node_count() has higher rate of triggering
> > > > BUG_ON() when allocating nodes with GFP_NOWAIT and
> __GFP_NOWARN. If
> > > > mas_node_count() use GFP_KERNEL as mas_preallocate() in the
> mmap.c,
> > > the
> > > > allocation fail rate may be lower than use GFP_NOWAIT.
> > >
> > > Which BUG_ON() are you referring to?
> > >
> > > If I was to separate the code path for mas_store_prealloc() and
> > > mas_store_gfp(), then a BUG_ON() would still need to exist and
> still
> > > would have been triggered.. We are in a place in the code where
> we
> > > should never sleep and we don't have enough memory allocated to
> do
> > > what
> > > was necessary.
> > >
> > Yap. There is no reason to seprate mas_store_prealloc() and
> > mas_store_gfp. Is it possible to retry to allocate mas_node with
> > GFP_KERNEL (wait for system reclaim and compact) instead of
> triggering
> > BUG_ON once the GFP_NOWAIT allocation failed?
>
> Unfortunately not, no. In some cases it may actually work out that
> the
> VMA may not need the locks in question, but it cannot be generalized
> for
> __vma_adjust(). Where I am able, I use the mas_store_gfp() calls so
> we
> can let reclaim and compact run, but it's not possible here.
>
We have used GFP_KERNEL as alloc flag in mas_node_count flow about 2
months. The mentioned problem we mentioned in the first mail didn't
reproduce under high stress stability test.
I have seen the patch provided by you. I will verify this patch in our
stability test. But there is a problem, if maple_tree behavior is
unexpected (e.g. redundant write bug this time). We can only review the
whole mm flow to find out the wrong behavior. Do we have an efficient
debug method for maple tree?
> Thanks,
> Liam
* Liam R. Howlett <[email protected]> [230719 14:51]:
> * John Hsu (許永翰) <[email protected]> [230712 23:26]:
> > On Mon, 2023-07-10 at 10:24 -0400, Liam R. Howlett wrote:
>
> ...
>
> > > > > > > > As you mentioned, mas_preallocate() should allocate enough
> > > > > node,
> > > > > > > but there is such functions mas_node_count() in
> > > > > mas_store_prealloc().
> > > > > > > > In mas_node_count() checks whether the *mas* has enough
> > > nodes,
> > > > > and
> > > > > > > allocate memory for node if there was no enough nodes in mas.
> > > > > > >
> > > > > > > Right, we call mas_node_count() so that both code paths are
> > > used
> > > > > for
> > > > > > > preallocations and regular mas_store()/mas_store_gfp(). It
> > > > > shouldn't
> > > > > > > take a significant amount of time to verify there is enough
> > > > > nodes.
> > > > > >
> > > > > > Yap..., it didn't take a significant amount of time to verify
> > > > > whether
> > > > > > there is enough nodes. The problem is why the flow in
> > > > > mas_node_count
> > > > > > will alloc nodes if there was no enough nodes in mas?
> > > > >
> > > > > What I meant is that both methods use the same call path because
> > > > > there
> > > > > is not a reason to duplicate the path. After mas_preallocate()
> > > has
> > > > > allocated the nodes needed, the call to check if there is enough
> > > > > nodes
> > > > > will be quick.
> > > >
> > > > So whether the purpose of mas_preallocate() is decreasing the lock
> > > > retention time?
> > >
> > > It could be, but in this case it's the locking order. We have to
> > > pre-allocate and fail early if we are out of memory, because we
> > > _cannot_
> > > use GFP_KERNEL where we call mas_store_prealloc(). mas_preallocate()
> > > will use GFP_KERENL though. We cannot use GFP_KERNEL during the
> > > store
> > > because reclaim may sleep and we cannot sleep holding the locks we
> > > need
> > > to hold at the time of the store operation in __vma_adjust().
> > >
> > Yap, if the type of lock is spinlock, the flow shouldn't sleep in the
> > critical sections. mmap_lock is implmented by rw_semaphore(mutex). Is
> > there any other lock in this section?
>
> Yes, the i_mmap_lock_write(), the anon_vma_lock_write(),
> flush_dcache_mmap_lock().
>
> > > >
> > > > > >
> > > > > > > > I think that if mas_preallocate() allocate enough node, why
> > > we
> > > > > > > check the node count and allocate nodes if there was no
> > > enough
> > > > > nodes
> > > > > > > in mas in mas_node_count()?
> > > > > > >
> > > > > > > We check for the above reason.
> > > > > > >
> > > > > >
> > > > > > OK..., this is one of the root cause of this BUG.
> > > > >
> > > > > The root cause is that there was not enough memory for a store
> > > > > operation. Regardless of if we check the allocations in the
> > > > > mas_store_prealloc() path or not, this would fail. If we remove
> > > the
> > > > > check for nodes within this path, then we would have to BUG_ON()
> > > when
> > > > > we
> > > > > run out of nodes to use or have a null pointer dereference BUG
> > > > > anyways.
> > > > >
> > > > Yap, the root cause is oom. The BUG_ON() for the situations that
> > > the
> > > > maple tree struct cannot be maintained because of the lack of
> > > memory is
> > > > necessary. But the the buddy system in linux kernel can reclaim
> > > memory
> > > > when the system is under the low memory status. If we use
> > > GFP_KERNEL
> > > > after trying GFP_NOWAIT to allocate node, maybe we can get enough
> > > > memory when the second try with GFP_KERNEL.
> > >
> > > Right, but the GFP_KERNEL cannot be used when holding certain locks
> > > because it may sleep.
> > >
> > > > > >
> > > > > > > >
> > > > > > > > We have seen that there may be some maple_tree operations
> > > in
> > > > > > > merge_vma...
> > > > > > >
> > > > > > > If merge_vma() does anything, then there was an operation to
> > > the
> > > > > > > maple
> > > > > > > tree.
> > > > > > >
> > > > > > > >
> > > > > > > > Moreover, would maple_tree provides an API for assigning
> > > user's
> > > > > gfp
> > > > > > > flag for allocating node?
> > > > > > >
> > > > > > > mas_preallocate() and mas_store_gfp() has gfp flags as an
> > > > > > > argument. In
> > > > > > > your call stack, it will be called in __vma_adjust() as such:
> > > > > > >
> > > > > > > if (mas_preallocate(&mas, vma, GFP_KERNEL))
> > > > > > > return -ENOMEM;
> > > > > > >
> > > > > > > line 715 in v6.1.25
> > > > > > >
> > > > > > > > In rb_tree, we allocate vma_area_struct (rb_node is in this
> > > > > > > struct.) with GFP_KERNEL, and maple_tree allocate node with
> > > > > > > GFP_NOWAIT and __GFP_NOWARN.
> > > > > > >
> > > > > > > We use GFP_KERNEL as I explained above for the VMA tree.
> > > > > >
> > > > > > Got it! But the mas_node_count() always use GFP_NOWAIT and
> > > > > __GFP_NOWARN
> > > > > > in inserting tree flow. Do you consider the performance of
> > > > > maintaining
> > > > > > the structure of maple_tree?
> > > > >
> > > > > Sorry, I don't understand what you mean by 'consider the
> > > performance
> > > > > of
> > > > > maintaining the structure of maple_tree'.
> > > > >
> > > > As I mentioned above, GFP_NOWAIT will not allow buddy system for
> > > > reclaiming memory, so "Do you consider the performance of
> > > maintaining
> > > > the structure of maple_tree" means that: whether the
> > > mas_node_count()
> > > > path is not allowed to reclaim or compact memory for the
> > > performance.
> > >
> > > Ah, no. This is not for performance. It was initially on the road
> > > map
> > > for performance, but it was needed for the complicated locking in the
> > > MM
> > > code.
> > >
> > > rb tree embedded the nodes in the VMA which is allocated outside this
> > > critical section and so it could use GFP_KERNEL.
> > >
> > As I know, following is rb_tree flow in 5.15.186:
> >
> > ...
> > mmap_write_lock_killable(mm)
> > ...
> > do_mmap()
> > ...
> > mmap_region()
> > ...
> > vm_area_alloc(mm)
> > ...
> > mmap_write_unlock(mm)
> >
> > vm_area_alloc is in the mmap_lock hoding period.
> > It seems that the flow would sleep here in rb_tree flow.
> > If I miss anything, please tell me, thanks!
>
> Before the mmap_write_unlock(mm) in the above sequence, the
> i_mmap_lock_write(), anon_vma_lock_write(), and/or the
> flush_dcache_mmap_lock() may be taken. Check __vma_adjust().
>
> The insertion into the tree needs to hold some subset of these locks.
> The rb-tree insert did not allocate within these locks, but the maple
> tree would need to allocate within these locks to insert into the tree.
> This is why the preallocation exists and why it is necessary.
>
> >
> >
> > > > > >
> > > > > > > It also will drop the lock and retry with GFP_KERNEL on
> > > failure
> > > > > > > when not using the external lock. The mmap_lock is
> > > configured as
> > > > > an
> > > > > > > external lock.
> > > > > > >
> > > > > > > > Allocation will not wait for reclaiming and compacting when
> > > > > there
> > > > > > > is no enough available memory.
> > > > > > > > Is there any concern for this design?
> > > > > > >
> > > > > > > This has been addressed above, but let me know if I missed
> > > > > anything
> > > > > > > here.
> > > > > > >
> > > > > >
> > > > > > I think that the mas_node_count() has higher rate of triggering
> > > > > > BUG_ON() when allocating nodes with GFP_NOWAIT and
> > > __GFP_NOWARN. If
> > > > > > mas_node_count() use GFP_KERNEL as mas_preallocate() in the
> > > mmap.c,
> > > > > the
> > > > > > allocation fail rate may be lower than use GFP_NOWAIT.
> > > > >
> > > > > Which BUG_ON() are you referring to?
> > > > >
> > > > > If I was to separate the code path for mas_store_prealloc() and
> > > > > mas_store_gfp(), then a BUG_ON() would still need to exist and
> > > still
> > > > > would have been triggered.. We are in a place in the code where
> > > we
> > > > > should never sleep and we don't have enough memory allocated to
> > > do
> > > > > what
> > > > > was necessary.
> > > > >
> > > > Yap. There is no reason to seprate mas_store_prealloc() and
> > > > mas_store_gfp. Is it possible to retry to allocate mas_node with
> > > > GFP_KERNEL (wait for system reclaim and compact) instead of
> > > triggering
> > > > BUG_ON once the GFP_NOWAIT allocation failed?
> > >
> > > Unfortunately not, no. In some cases it may actually work out that
> > > the
> > > VMA may not need the locks in question, but it cannot be generalized
> > > for
> > > __vma_adjust(). Where I am able, I use the mas_store_gfp() calls so
> > > we
> > > can let reclaim and compact run, but it's not possible here.
> > >
> > We have used GFP_KERNEL as alloc flag in mas_node_count flow about 2
> > months. The mentioned problem we mentioned in the first mail didn't
> > reproduce under high stress stability test.
> >
> > I have seen the patch provided by you. I will verify this patch in our
> > stability test. But there is a problem, if maple_tree behavior is
> > unexpected (e.g. redundant write bug this time). We can only review the
> > whole mm flow to find out the wrong behavior. Do we have an efficient
> > debug method for maple tree?
>
> There is a test suite for the maple tree found in
> tools/testing/radix-tree, but it does not test the mm code and
> integration.
>
> There are other mm tests, but they will not test the OOM scenario you
> hit - to the best of my knowledge.
>
> There are also config options to debug the tree operations, but they do
> not detect the redundant write issues. Perhaps I can look at adding
> support for detecting redundant writes, but that will not be backported
> to a stable kernel.
I just wanted to clarify what I said here:
I'm looking at adding this detection within CONFIG_DEBUG_MAPLE_TREE.
This is how I narrowed down the issue in this particular case, but it's
far from production quality. Even if or when it is production quality,
it won't be sent for inclusion into the stable kernel release.
Thanks,
Liam
* John Hsu (許永翰) <[email protected]> [230712 23:26]:
> On Mon, 2023-07-10 at 10:24 -0400, Liam R. Howlett wrote:
...
> > > > > > > As you mentioned, mas_preallocate() should allocate enough
> > > > node,
> > > > > > but there is such functions mas_node_count() in
> > > > mas_store_prealloc().
> > > > > > > In mas_node_count() checks whether the *mas* has enough
> > nodes,
> > > > and
> > > > > > allocate memory for node if there was no enough nodes in mas.
> > > > > >
> > > > > > Right, we call mas_node_count() so that both code paths are
> > used
> > > > for
> > > > > > preallocations and regular mas_store()/mas_store_gfp(). It
> > > > shouldn't
> > > > > > take a significant amount of time to verify there is enough
> > > > nodes.
> > > > >
> > > > > Yap..., it didn't take a significant amount of time to verify
> > > > whether
> > > > > there is enough nodes. The problem is why the flow in
> > > > mas_node_count
> > > > > will alloc nodes if there was no enough nodes in mas?
> > > >
> > > > What I meant is that both methods use the same call path because
> > > > there
> > > > is not a reason to duplicate the path. After mas_preallocate()
> > has
> > > > allocated the nodes needed, the call to check if there is enough
> > > > nodes
> > > > will be quick.
> > >
> > > So whether the purpose of mas_preallocate() is decreasing the lock
> > > retention time?
> >
> > It could be, but in this case it's the locking order. We have to
> > pre-allocate and fail early if we are out of memory, because we
> > _cannot_
> > use GFP_KERNEL where we call mas_store_prealloc(). mas_preallocate()
> > will use GFP_KERENL though. We cannot use GFP_KERNEL during the
> > store
> > because reclaim may sleep and we cannot sleep holding the locks we
> > need
> > to hold at the time of the store operation in __vma_adjust().
> >
> Yap, if the type of lock is spinlock, the flow shouldn't sleep in the
> critical sections. mmap_lock is implmented by rw_semaphore(mutex). Is
> there any other lock in this section?
Yes, the i_mmap_lock_write(), the anon_vma_lock_write(),
flush_dcache_mmap_lock().
> > >
> > > > >
> > > > > > > I think that if mas_preallocate() allocate enough node, why
> > we
> > > > > > check the node count and allocate nodes if there was no
> > enough
> > > > nodes
> > > > > > in mas in mas_node_count()?
> > > > > >
> > > > > > We check for the above reason.
> > > > > >
> > > > >
> > > > > OK..., this is one of the root cause of this BUG.
> > > >
> > > > The root cause is that there was not enough memory for a store
> > > > operation. Regardless of if we check the allocations in the
> > > > mas_store_prealloc() path or not, this would fail. If we remove
> > the
> > > > check for nodes within this path, then we would have to BUG_ON()
> > when
> > > > we
> > > > run out of nodes to use or have a null pointer dereference BUG
> > > > anyways.
> > > >
> > > Yap, the root cause is oom. The BUG_ON() for the situations that
> > the
> > > maple tree struct cannot be maintained because of the lack of
> > memory is
> > > necessary. But the the buddy system in linux kernel can reclaim
> > memory
> > > when the system is under the low memory status. If we use
> > GFP_KERNEL
> > > after trying GFP_NOWAIT to allocate node, maybe we can get enough
> > > memory when the second try with GFP_KERNEL.
> >
> > Right, but the GFP_KERNEL cannot be used when holding certain locks
> > because it may sleep.
> >
> > > > >
> > > > > > >
> > > > > > > We have seen that there may be some maple_tree operations
> > in
> > > > > > merge_vma...
> > > > > >
> > > > > > If merge_vma() does anything, then there was an operation to
> > the
> > > > > > maple
> > > > > > tree.
> > > > > >
> > > > > > >
> > > > > > > Moreover, would maple_tree provides an API for assigning
> > user's
> > > > gfp
> > > > > > flag for allocating node?
> > > > > >
> > > > > > mas_preallocate() and mas_store_gfp() has gfp flags as an
> > > > > > argument. In
> > > > > > your call stack, it will be called in __vma_adjust() as such:
> > > > > >
> > > > > > if (mas_preallocate(&mas, vma, GFP_KERNEL))
> > > > > > return -ENOMEM;
> > > > > >
> > > > > > line 715 in v6.1.25
> > > > > >
> > > > > > > In rb_tree, we allocate vma_area_struct (rb_node is in this
> > > > > > struct.) with GFP_KERNEL, and maple_tree allocate node with
> > > > > > GFP_NOWAIT and __GFP_NOWARN.
> > > > > >
> > > > > > We use GFP_KERNEL as I explained above for the VMA tree.
> > > > >
> > > > > Got it! But the mas_node_count() always use GFP_NOWAIT and
> > > > __GFP_NOWARN
> > > > > in inserting tree flow. Do you consider the performance of
> > > > maintaining
> > > > > the structure of maple_tree?
> > > >
> > > > Sorry, I don't understand what you mean by 'consider the
> > performance
> > > > of
> > > > maintaining the structure of maple_tree'.
> > > >
> > > As I mentioned above, GFP_NOWAIT will not allow buddy system for
> > > reclaiming memory, so "Do you consider the performance of
> > maintaining
> > > the structure of maple_tree" means that: whether the
> > mas_node_count()
> > > path is not allowed to reclaim or compact memory for the
> > performance.
> >
> > Ah, no. This is not for performance. It was initially on the road
> > map
> > for performance, but it was needed for the complicated locking in the
> > MM
> > code.
> >
> > rb tree embedded the nodes in the VMA which is allocated outside this
> > critical section and so it could use GFP_KERNEL.
> >
> As I know, following is rb_tree flow in 5.15.186:
>
> ...
> mmap_write_lock_killable(mm)
> ...
> do_mmap()
> ...
> mmap_region()
> ...
> vm_area_alloc(mm)
> ...
> mmap_write_unlock(mm)
>
> vm_area_alloc is in the mmap_lock hoding period.
> It seems that the flow would sleep here in rb_tree flow.
> If I miss anything, please tell me, thanks!
Before the mmap_write_unlock(mm) in the above sequence, the
i_mmap_lock_write(), anon_vma_lock_write(), and/or the
flush_dcache_mmap_lock() may be taken. Check __vma_adjust().
The insertion into the tree needs to hold some subset of these locks.
The rb-tree insert did not allocate within these locks, but the maple
tree would need to allocate within these locks to insert into the tree.
This is why the preallocation exists and why it is necessary.
>
>
> > > > >
> > > > > > It also will drop the lock and retry with GFP_KERNEL on
> > failure
> > > > > > when not using the external lock. The mmap_lock is
> > configured as
> > > > an
> > > > > > external lock.
> > > > > >
> > > > > > > Allocation will not wait for reclaiming and compacting when
> > > > there
> > > > > > is no enough available memory.
> > > > > > > Is there any concern for this design?
> > > > > >
> > > > > > This has been addressed above, but let me know if I missed
> > > > anything
> > > > > > here.
> > > > > >
> > > > >
> > > > > I think that the mas_node_count() has higher rate of triggering
> > > > > BUG_ON() when allocating nodes with GFP_NOWAIT and
> > __GFP_NOWARN. If
> > > > > mas_node_count() use GFP_KERNEL as mas_preallocate() in the
> > mmap.c,
> > > > the
> > > > > allocation fail rate may be lower than use GFP_NOWAIT.
> > > >
> > > > Which BUG_ON() are you referring to?
> > > >
> > > > If I was to separate the code path for mas_store_prealloc() and
> > > > mas_store_gfp(), then a BUG_ON() would still need to exist and
> > still
> > > > would have been triggered.. We are in a place in the code where
> > we
> > > > should never sleep and we don't have enough memory allocated to
> > do
> > > > what
> > > > was necessary.
> > > >
> > > Yap. There is no reason to seprate mas_store_prealloc() and
> > > mas_store_gfp. Is it possible to retry to allocate mas_node with
> > > GFP_KERNEL (wait for system reclaim and compact) instead of
> > triggering
> > > BUG_ON once the GFP_NOWAIT allocation failed?
> >
> > Unfortunately not, no. In some cases it may actually work out that
> > the
> > VMA may not need the locks in question, but it cannot be generalized
> > for
> > __vma_adjust(). Where I am able, I use the mas_store_gfp() calls so
> > we
> > can let reclaim and compact run, but it's not possible here.
> >
> We have used GFP_KERNEL as alloc flag in mas_node_count flow about 2
> months. The mentioned problem we mentioned in the first mail didn't
> reproduce under high stress stability test.
>
> I have seen the patch provided by you. I will verify this patch in our
> stability test. But there is a problem, if maple_tree behavior is
> unexpected (e.g. redundant write bug this time). We can only review the
> whole mm flow to find out the wrong behavior. Do we have an efficient
> debug method for maple tree?
There is a test suite for the maple tree found in
tools/testing/radix-tree, but it does not test the mm code and
integration.
There are other mm tests, but they will not test the OOM scenario you
hit - to the best of my knowledge.
There are also config options to debug the tree operations, but they do
not detect the redundant write issues. Perhaps I can look at adding
support for detecting redundant writes, but that will not be backported
to a stable kernel.
Thanks,
Liam
On Wed, 2023-07-19 at 14:51 -0400, Liam R. Howlett wrote:
>
> External email : Please do not click links or open attachments until
> you have verified the sender or the content.
> * John Hsu (許永翰) <[email protected]> [230712 23:26]:
> > On Mon, 2023-07-10 at 10:24 -0400, Liam R. Howlett wrote:
>
> ...
>
> > > > > > > > As you mentioned, mas_preallocate() should allocate
> enough
> > > > > node,
> > > > > > > but there is such functions mas_node_count() in
> > > > > mas_store_prealloc().
> > > > > > > > In mas_node_count() checks whether the *mas* has enough
> > > nodes,
> > > > > and
> > > > > > > allocate memory for node if there was no enough nodes in
> mas.
> > > > > > >
> > > > > > > Right, we call mas_node_count() so that both code paths
> are
> > > used
> > > > > for
> > > > > > > preallocations and regular
> mas_store()/mas_store_gfp(). It
> > > > > shouldn't
> > > > > > > take a significant amount of time to verify there is
> enough
> > > > > nodes.
> > > > > >
> > > > > > Yap..., it didn't take a significant amount of time to
> verify
> > > > > whether
> > > > > > there is enough nodes. The problem is why the flow in
> > > > > mas_node_count
> > > > > > will alloc nodes if there was no enough nodes in mas?
> > > > >
> > > > > What I meant is that both methods use the same call path
> because
> > > > > there
> > > > > is not a reason to duplicate the path. After
> mas_preallocate()
> > > has
> > > > > allocated the nodes needed, the call to check if there is
> enough
> > > > > nodes
> > > > > will be quick.
> > > >
> > > > So whether the purpose of mas_preallocate() is decreasing the
> lock
> > > > retention time?
> > >
> > > It could be, but in this case it's the locking order. We have to
> > > pre-allocate and fail early if we are out of memory, because we
> > > _cannot_
> > > use GFP_KERNEL where we call
> mas_store_prealloc(). mas_preallocate()
> > > will use GFP_KERENL though. We cannot use GFP_KERNEL during the
> > > store
> > > because reclaim may sleep and we cannot sleep holding the locks
> we
> > > need
> > > to hold at the time of the store operation in __vma_adjust().
> > >
> > Yap, if the type of lock is spinlock, the flow shouldn't sleep in
> the
> > critical sections. mmap_lock is implmented by rw_semaphore(mutex).
> Is
> > there any other lock in this section?
>
> Yes, the i_mmap_lock_write(), the anon_vma_lock_write(),
> flush_dcache_mmap_lock().
>
> > > >
> > > > > >
> > > > > > > > I think that if mas_preallocate() allocate enough node,
> why
> > > we
> > > > > > > check the node count and allocate nodes if there was no
> > > enough
> > > > > nodes
> > > > > > > in mas in mas_node_count()?
> > > > > > >
> > > > > > > We check for the above reason.
> > > > > > >
> > > > > >
> > > > > > OK..., this is one of the root cause of this BUG.
> > > > >
> > > > > The root cause is that there was not enough memory for a
> store
> > > > > operation. Regardless of if we check the allocations in the
> > > > > mas_store_prealloc() path or not, this would fail. If we
> remove
> > > the
> > > > > check for nodes within this path, then we would have to
> BUG_ON()
> > > when
> > > > > we
> > > > > run out of nodes to use or have a null pointer dereference
> BUG
> > > > > anyways.
> > > > >
> > > > Yap, the root cause is oom. The BUG_ON() for the situations
> that
> > > the
> > > > maple tree struct cannot be maintained because of the lack of
> > > memory is
> > > > necessary. But the the buddy system in linux kernel can reclaim
> > > memory
> > > > when the system is under the low memory status. If we use
> > > GFP_KERNEL
> > > > after trying GFP_NOWAIT to allocate node, maybe we can get
> enough
> > > > memory when the second try with GFP_KERNEL.
> > >
> > > Right, but the GFP_KERNEL cannot be used when holding certain
> locks
> > > because it may sleep.
> > >
> > > > > >
> > > > > > > >
> > > > > > > > We have seen that there may be some maple_tree
> operations
> > > in
> > > > > > > merge_vma...
> > > > > > >
> > > > > > > If merge_vma() does anything, then there was an operation
> to
> > > the
> > > > > > > maple
> > > > > > > tree.
> > > > > > >
> > > > > > > >
> > > > > > > > Moreover, would maple_tree provides an API for
> assigning
> > > user's
> > > > > gfp
> > > > > > > flag for allocating node?
> > > > > > >
> > > > > > > mas_preallocate() and mas_store_gfp() has gfp flags as an
> > > > > > > argument. In
> > > > > > > your call stack, it will be called in __vma_adjust() as
> such:
> > > > > > >
> > > > > > > if (mas_preallocate(&mas, vma, GFP_KERNEL))
> > > > > > > return -ENOMEM;
> > > > > > >
> > > > > > > line 715 in v6.1.25
> > > > > > >
> > > > > > > > In rb_tree, we allocate vma_area_struct (rb_node is in
> this
> > > > > > > struct.) with GFP_KERNEL, and maple_tree allocate node
> with
> > > > > > > GFP_NOWAIT and __GFP_NOWARN.
> > > > > > >
> > > > > > > We use GFP_KERNEL as I explained above for the VMA tree.
> > > > > >
> > > > > > Got it! But the mas_node_count() always use GFP_NOWAIT and
> > > > > __GFP_NOWARN
> > > > > > in inserting tree flow. Do you consider the performance of
> > > > > maintaining
> > > > > > the structure of maple_tree?
> > > > >
> > > > > Sorry, I don't understand what you mean by 'consider the
> > > performance
> > > > > of
> > > > > maintaining the structure of maple_tree'.
> > > > >
> > > > As I mentioned above, GFP_NOWAIT will not allow buddy system
> for
> > > > reclaiming memory, so "Do you consider the performance of
> > > maintaining
> > > > the structure of maple_tree" means that: whether the
> > > mas_node_count()
> > > > path is not allowed to reclaim or compact memory for the
> > > performance.
> > >
> > > Ah, no. This is not for performance. It was initially on the
> road
> > > map
> > > for performance, but it was needed for the complicated locking in
> the
> > > MM
> > > code.
> > >
> > > rb tree embedded the nodes in the VMA which is allocated outside
> this
> > > critical section and so it could use GFP_KERNEL.
> > >
> > As I know, following is rb_tree flow in 5.15.186:
> >
> > ...
> > mmap_write_lock_killable(mm)
> > ...
> > do_mmap()
> > ...
> > mmap_region()
> > ...
> > vm_area_alloc(mm)
> > ...
> > mmap_write_unlock(mm)
> >
> > vm_area_alloc is in the mmap_lock hoding period.
> > It seems that the flow would sleep here in rb_tree flow.
> > If I miss anything, please tell me, thanks!
>
> Before the mmap_write_unlock(mm) in the above sequence, the
> i_mmap_lock_write(), anon_vma_lock_write(), and/or the
> flush_dcache_mmap_lock() may be taken. Check __vma_adjust().
>
> The insertion into the tree needs to hold some subset of these locks.
> The rb-tree insert did not allocate within these locks, but the maple
> tree would need to allocate within these locks to insert into the
> tree.
> This is why the preallocation exists and why it is necessary.
>
Yap, preallocation is necessary. anon_vma_lock_write() and
flush_dcache_mmap_lock() hold the lock and manipulate rb_tree. I think
that there is no maple tree manipulations during the lock holding
period. Is there any future work in this section?
> >
> >
> > > > > >
> > > > > > > It also will drop the lock and retry with GFP_KERNEL on
> > > failure
> > > > > > > when not using the external lock. The mmap_lock is
> > > configured as
> > > > > an
> > > > > > > external lock.
> > > > > > >
> > > > > > > > Allocation will not wait for reclaiming and compacting
> when
> > > > > there
> > > > > > > is no enough available memory.
> > > > > > > > Is there any concern for this design?
> > > > > > >
> > > > > > > This has been addressed above, but let me know if I
> missed
> > > > > anything
> > > > > > > here.
> > > > > > >
> > > > > >
> > > > > > I think that the mas_node_count() has higher rate of
> triggering
> > > > > > BUG_ON() when allocating nodes with GFP_NOWAIT and
> > > __GFP_NOWARN. If
> > > > > > mas_node_count() use GFP_KERNEL as mas_preallocate() in the
> > > mmap.c,
> > > > > the
> > > > > > allocation fail rate may be lower than use GFP_NOWAIT.
> > > > >
> > > > > Which BUG_ON() are you referring to?
> > > > >
> > > > > If I was to separate the code path for mas_store_prealloc()
> and
> > > > > mas_store_gfp(), then a BUG_ON() would still need to exist
> and
> > > still
> > > > > would have been triggered.. We are in a place in the code
> where
> > > we
> > > > > should never sleep and we don't have enough memory allocated
> to
> > > do
> > > > > what
> > > > > was necessary.
> > > > >
> > > > Yap. There is no reason to seprate mas_store_prealloc() and
> > > > mas_store_gfp. Is it possible to retry to allocate mas_node
> with
> > > > GFP_KERNEL (wait for system reclaim and compact) instead of
> > > triggering
> > > > BUG_ON once the GFP_NOWAIT allocation failed?
> > >
> > > Unfortunately not, no. In some cases it may actually work out
> that
> > > the
> > > VMA may not need the locks in question, but it cannot be
> generalized
> > > for
> > > __vma_adjust(). Where I am able, I use the mas_store_gfp() calls
> so
> > > we
> > > can let reclaim and compact run, but it's not possible here.
> > >
> > We have used GFP_KERNEL as alloc flag in mas_node_count flow about
> 2
> > months. The mentioned problem we mentioned in the first mail didn't
> > reproduce under high stress stability test.
> >
> > I have seen the patch provided by you. I will verify this patch in
> our
> > stability test. But there is a problem, if maple_tree behavior is
> > unexpected (e.g. redundant write bug this time). We can only review
> the
> > whole mm flow to find out the wrong behavior. Do we have an
> efficient
> > debug method for maple tree?
>
> There is a test suite for the maple tree found in
> tools/testing/radix-tree, but it does not test the mm code and
> integration.
>
> There are other mm tests, but they will not test the OOM scenario you
> hit - to the best of my knowledge.
>
> There are also config options to debug the tree operations, but they
> do
> not detect the redundant write issues. Perhaps I can look at adding
> support for detecting redundant writes, but that will not be
> backported
> to a stable kernel.
>
The sufficient test cases of maple tree ensure the function work well.
But the redundant operations (alloc node, free node, tree
manipulations) of maple_tree are not easy to detect (e.g. the case
reported this time and mas_preallocate() allocates redundant nodes with
the worst case).
The detecting redundant writes mechanism may help the developers to
find out the problems easier. Hope it can be establised successfully!!
> Thanks,
> Liam
Best Regards,
John Hsu
* John Hsu (許永翰) <[email protected]> [230807 05:55]:
> On Wed, 2023-07-19 at 14:51 -0400, Liam R. Howlett wrote:
...
> > > As I know, following is rb_tree flow in 5.15.186:
> > >
> > > ...
> > > mmap_write_lock_killable(mm)
> > > ...
> > > do_mmap()
> > > ...
> > > mmap_region()
> > > ...
> > > vm_area_alloc(mm)
> > > ...
> > > mmap_write_unlock(mm)
> > >
> > > vm_area_alloc is in the mmap_lock hoding period.
> > > It seems that the flow would sleep here in rb_tree flow.
> > > If I miss anything, please tell me, thanks!
> >
> > Before the mmap_write_unlock(mm) in the above sequence, the
> > i_mmap_lock_write(), anon_vma_lock_write(), and/or the
> > flush_dcache_mmap_lock() may be taken. Check __vma_adjust().
> >
> > The insertion into the tree needs to hold some subset of these locks.
> > The rb-tree insert did not allocate within these locks, but the maple
> > tree would need to allocate within these locks to insert into the
> > tree.
> > This is why the preallocation exists and why it is necessary.
> >
>
> Yap, preallocation is necessary. anon_vma_lock_write() and
> flush_dcache_mmap_lock() hold the lock and manipulate rb_tree. I think
> that there is no maple tree manipulations during the lock holding
> period. Is there any future work in this section?
__vma_adjust() does modify the maple tree during the lock holding
section through vma_mas_store() in 6.1. Prior to 6.1, there is no maple
tree.
...
> > There are also config options to debug the tree operations, but they
> > do
> > not detect the redundant write issues. Perhaps I can look at adding
> > support for detecting redundant writes, but that will not be
> > backported
> > to a stable kernel.
> >
>
> The sufficient test cases of maple tree ensure the function work well.
> But the redundant operations (alloc node, free node, tree
> manipulations) of maple_tree are not easy to detect (e.g. the case
> reported this time and mas_preallocate() allocates redundant nodes with
> the worst case).
>
> The detecting redundant writes mechanism may help the developers to
> find out the problems easier. Hope it can be establised successfully!!
When I went to add this, I had found I already added it here [1].
This operation was not caught by MA_STATE_PREALLOC because there are two
writes before a mas_destroy(), so there may be nodes left which avoid
the warning. I'll look at improving this situation.
Thanks,
Liam
[1] https://lore.kernel.org/linux-mm/[email protected]/