After upgrading a machine from 5.17.4 to 6.1.12 a couple of weeks ago, I
started getting (inconsistent) failures when building Android:
> dex2oatd F 02-28 11:49:44 40098 40098 mem_map_arena_pool.cc:65] Check failed: map.IsValid() Failed anonymous mmap((nil), 131072, 0x3, 0x22, -1, 0): Cannot allocate memory. See process maps in the log.
While it claims to be using 0x22 (MAP_PRIVATE | MAP_ANONYMOUS) for the
flags, it really uses 0x40 (MAP_32BIT) as well, as shown by strace:
> mmap(NULL, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_32BIT, -1, 0) = 0x40720000
> mmap(NULL, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_32BIT, -1, 0) = 0x4124e000
> mmap(NULL, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_32BIT, -1, 0) = -1 ENOMEM (Cannot allocate memory)
> dex2oatd F 03-01 10:32:33 74063 74063 mem_map_arena_pool.cc:65] Check failed: map.IsValid() Failed anonymous mmap((nil), 131072, 0x3, 0x22, -1, 0): Cannot allocate memory. See process maps in the log.
Here's a simple reproducer, which (if my math is correct) tries to mmap
a total of ~600MiB in increasing chunk sizes:
#include <sys/mman.h>
#include <stdio.h>
#include <errno.h>
int main() {
size_t total_leaks = 0;
for (int shift=12; shift<=16; shift++) {
size_t size = ((size_t)1)<<shift;
for (int i=0; i<5000; ++i) {
void* m = mmap(NULL, size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS | MAP_32BIT, -1, 0);
if (m == MAP_FAILED || m == NULL) {
printf(
"Failed. m=%p size=%zd (1<<%d) i=%d "
" errno=%d total_leaks=%zd (%zd MiB)\n",
m, size, shift, i, errno,
total_leaks, total_leaks / 1024 / 1024);
return 1;
}
total_leaks += size;
}
}
printf("Success.\n");
return 0;
}
Older kernels fail very consistently at almost exactly 1GiB total_leaks,
if you change the test program to go that far. On 6.1.12, it fails much
earlier, after an arbitrary amount of successful mmaps:
> $ ./mmap-test
> Failed. m=0xffffffffffffffff size=4096 (1<<12) i=1500 errno=12 total_leaks=6144000 (5 MiB)
> $ ./mmap-test
> Failed. m=0xffffffffffffffff size=4096 (1<<12) i=620 errno=12 total_leaks=2539520 (2 MiB)
> $ ./mmap-test
> Failed. m=0xffffffffffffffff size=4096 (1<<12) i=2408 errno=12 total_leaks=9863168 (9 MiB)
> $ ./mmap-test
> Failed. m=0xffffffffffffffff size=4096 (1<<12) i=774 errno=12 total_leaks=3170304 (3 MiB)
> $ ./mmap-test
> Failed. m=0xffffffffffffffff size=4096 (1<<12) i=1648 errno=12 total_leaks=6750208 (6 MiB)
> $ ./mmap-test
I have checked a more recent master commit (ee3f96b1, from March 1st),
and the problem is still there. Bisecting shows that e15e06a8 is the
last good commit, and that 524e00b3 is the first one failing in this
way. The 10 or so commits in between run into a page fault BUG down in
vma_merge() instead.
This range of commits is about the same as mentioned in
https://lore.kernel.org/lkml/[email protected]/,
so I assume that my problem, too, was introduced with the Maple Tree
changes. Sending this to the same people and lists.
//Snild
* Snild Dolkow <[email protected]> [230302 10:33]:
> After upgrading a machine from 5.17.4 to 6.1.12 a couple of weeks ago, I
> started getting (inconsistent) failures when building Android:
Thanks for reporting this.
>
> > dex2oatd F 02-28 11:49:44 40098 40098 mem_map_arena_pool.cc:65] Check failed: map.IsValid() Failed anonymous mmap((nil), 131072, 0x3, 0x22, -1, 0): Cannot allocate memory. See process maps in the log.
>
> While it claims to be using 0x22 (MAP_PRIVATE | MAP_ANONYMOUS) for the
> flags, it really uses 0x40 (MAP_32BIT) as well, as shown by strace:
>
> > mmap(NULL, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_32BIT, -1, 0) = 0x40720000
> > mmap(NULL, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_32BIT, -1, 0) = 0x4124e000
> > mmap(NULL, 131072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_32BIT, -1, 0) = -1 ENOMEM (Cannot allocate memory)
> > dex2oatd F 03-01 10:32:33 74063 74063 mem_map_arena_pool.cc:65] Check failed: map.IsValid() Failed anonymous mmap((nil), 131072, 0x3, 0x22, -1, 0): Cannot allocate memory. See process maps in the log.
>
> Here's a simple reproducer, which (if my math is correct) tries to mmap a
> total of ~600MiB in increasing chunk sizes:
>
> #include <sys/mman.h>
> #include <stdio.h>
> #include <errno.h>
>
> int main() {
> size_t total_leaks = 0;
> for (int shift=12; shift<=16; shift++) {
> size_t size = ((size_t)1)<<shift;
> for (int i=0; i<5000; ++i) {
> void* m = mmap(NULL, size, PROT_READ | PROT_WRITE,
> MAP_PRIVATE | MAP_ANONYMOUS | MAP_32BIT, -1, 0);
> if (m == MAP_FAILED || m == NULL) {
> printf(
> "Failed. m=%p size=%zd (1<<%d) i=%d "
> " errno=%d total_leaks=%zd (%zd MiB)\n",
> m, size, shift, i, errno,
> total_leaks, total_leaks / 1024 / 1024);
> return 1;
> }
> total_leaks += size;
> }
> }
> printf("Success.\n");
> return 0;
> }
Very useful, thanks!
>
> Older kernels fail very consistently at almost exactly 1GiB total_leaks, if
> you change the test program to go that far. On 6.1.12, it fails much
> earlier, after an arbitrary amount of successful mmaps:
>
> > $ ./mmap-test Failed. m=0xffffffffffffffff size=4096 (1<<12) i=1500
> > errno=12 total_leaks=6144000 (5 MiB)
> > $ ./mmap-test Failed. m=0xffffffffffffffff size=4096 (1<<12) i=620
> > errno=12 total_leaks=2539520 (2 MiB)
> > $ ./mmap-test Failed. m=0xffffffffffffffff size=4096 (1<<12) i=2408
> > errno=12 total_leaks=9863168 (9 MiB)
> > $ ./mmap-test Failed. m=0xffffffffffffffff size=4096 (1<<12) i=774
> > errno=12 total_leaks=3170304 (3 MiB)
> > $ ./mmap-test Failed. m=0xffffffffffffffff size=4096 (1<<12) i=1648
> > errno=12 total_leaks=6750208 (6 MiB)
> > $ ./mmap-test
>
>
> I have checked a more recent master commit (ee3f96b1, from March 1st), and
> the problem is still there. Bisecting shows that e15e06a8 is the last good
> commit, and that 524e00b3 is the first one failing in this way. The 10 or so
> commits in between run into a page fault BUG down in vma_merge() instead.
It does look like it's the maple tree. I am working on this issue now.
>
> This range of commits is about the same as mentioned in https://lore.kernel.org/lkml/[email protected]/,
> so I assume that my problem, too, was introduced with the Maple Tree
> changes. Sending this to the same people and lists.
These are the right people to email.
Hopefully I'll have an update for you soon.
Regards,
Liam
[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]
On 02.03.23 16:32, Snild Dolkow wrote:
> After upgrading a machine from 5.17.4 to 6.1.12 a couple of weeks ago, I
> started getting (inconsistent) failures when building Android:
> [...]
> I have checked a more recent master commit (ee3f96b1, from March 1st),
> and the problem is still there. Bisecting shows that e15e06a8 is the
> last good commit, and that 524e00b3 is the first one failing in this
> way. The 10 or so commits in between run into a page fault BUG down in
> vma_merge() instead.
>
> This range of commits is about the same as mentioned in
> https://lore.kernel.org/lkml/[email protected]/, so I assume that my problem, too, was introduced with the Maple Tree changes. Sending this to the same people and lists.
>
Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:
#regzbot ^introduced e15e06a8..524e00b3
#regzbot title mm: mmap with MAP_32BIT randomly fails since 6.1
#regzbot ignore-activity
This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.
Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.
Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.
* Linux regression tracking #adding (Thorsten Leemhuis) <[email protected]> [230303 03:31]:
> [TLDR: I'm adding this report to the list of tracked Linux kernel
> regressions; the text you find below is based on a few templates
> paragraphs you might have encountered already in similar form.
> See link in footer if these mails annoy you.]
>
> On 02.03.23 16:32, Snild Dolkow wrote:
> > After upgrading a machine from 5.17.4 to 6.1.12 a couple of weeks ago, I
> > started getting (inconsistent) failures when building Android:
> > [...]
> > I have checked a more recent master commit (ee3f96b1, from March 1st),
> > and the problem is still there. Bisecting shows that e15e06a8 is the
> > last good commit, and that 524e00b3 is the first one failing in this
> > way. The 10 or so commits in between run into a page fault BUG down in
> > vma_merge() instead.
> >
> > This range of commits is about the same as mentioned in
> > https://lore.kernel.org/lkml/[email protected]/, so I assume that my problem, too, was introduced with the Maple Tree changes. Sending this to the same people and lists.
> >
>
> Thanks for the report. To be sure the issue doesn't fall through the
> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
> tracking bot:
>
> #regzbot ^introduced e15e06a8..524e00b3
> #regzbot title mm: mmap with MAP_32BIT randomly fails since 6.1
> #regzbot ignore-activity
Thanks!
>
> This isn't a regression? This issue or a fix for it are already
> discussed somewhere else? It was fixed already? You want to clarify when
> the regression started to happen? Or point out I got the title or
> something else totally wrong? Then just reply and tell me -- ideally
> while also telling regzbot about it, as explained by the page listed in
> the footer of this mail.
I've sent a patch which has been tested by Snild and does not fully fix
the issue [1]. I am continuing work on this problem.
>
> Developers: When fixing the issue, remember to add 'Link:' tags pointing
> to the report (the parent of this mail). See page linked in footer for
> details.
Pretty sure I did this part, so maybe the discussion was already picked
up.
1. https://lore.kernel.org/linux-mm/[email protected]/
Cheers,
Liam
It appears that commit 58c5d0d6d522112577c7eeb71d382ea642ed7be4 causes
another regression of allocations with MAP_32BIT.
Reverting it fixes the reproducer from
https://lore.kernel.org/linux-mm/[email protected]/
Do you think this commit is somewhat safe to revert?
The following may be superfluous, but adds some context and might help
someone
find this thread. It merely confirms to the observation of this
regression in
https://lore.kernel.org/linux-mm/[email protected]/
From what I can tell it also fixes my own use case, and
- The program I maintain,
https://github.com/hercules-ci/hercules-ci-agent/issues/514
- Another program, also Haskell:
https://github.com/aristanetworks/nix-serve-ng/issues/27
- An FPGA interface process. I've found them because they list the same
commit id on their blog.
https://jia.je/software/2023/05/06/linux-regression-vivado-en/
On 3/2/23 19:43, Liam R. Howlett wrote:
> * Snild Dolkow <[email protected]> [230302 10:33]:
>> After upgrading a machine from 5.17.4 to 6.1.12 a couple of weeks ago, I
>> started getting (inconsistent) failures when building Android:
>> While it claims to be using 0x22 (MAP_PRIVATE | MAP_ANONYMOUS) for the
>> flags, it really uses 0x40 (MAP_32BIT) as well, as shown by strace:
>>
The same applies to the dynamic linker in the GHC Haskell runtime system.
It also uses MAP_32BIT, in its linker, and reports the error
ghc: mmap 4096 bytes at (nil): Cannot allocate memory
I hope this was a somewhat useful contribution to the regressions
thread. (also hi, I'm new here)
Cheers,
Robert Hensing
* Robert Hensing <[email protected]> [230511 21:02]:
> It appears that commit 58c5d0d6d522112577c7eeb71d382ea642ed7be4 causes
> another regression of allocations with MAP_32BIT.
> Reverting it fixes the reproducer from
> https://lore.kernel.org/linux-mm/[email protected]/
>
> Do you think this commit is somewhat safe to revert?
No, don't do that.
Add this [1] instead. The patch is currently in mm-unstable and will
make its way though the normal channels to stable and mainline
[1] https://lore.kernel.org/linux-mm/[email protected]/
Thanks,
Liam
>
> The following may be superfluous, but adds some context and might help
> someone
> find this thread. It merely confirms to the observation of this
> regression in
> https://lore.kernel.org/linux-mm/[email protected]/
>
> From what I can tell it also fixes my own use case, and
>
> ?- The program I maintain,
> ?? https://github.com/hercules-ci/hercules-ci-agent/issues/514
>
> ?- Another program, also Haskell:
> ?? https://github.com/aristanetworks/nix-serve-ng/issues/27
>
> ?- An FPGA interface process. I've found them because they list the same
> ?? commit id on their blog.
> ?? https://jia.je/software/2023/05/06/linux-regression-vivado-en/
>
>
>
> On 3/2/23 19:43, Liam R. Howlett wrote:
> > * Snild Dolkow <[email protected]> [230302 10:33]:
> >> After upgrading a machine from 5.17.4 to 6.1.12 a couple of weeks ago, I
> >> started getting (inconsistent) failures when building Android:
> >> While it claims to be using 0x22 (MAP_PRIVATE | MAP_ANONYMOUS) for the
> >> flags, it really uses 0x40 (MAP_32BIT) as well, as shown by strace:
> >>
>
> The same applies to the dynamic linker in the GHC Haskell runtime system.
>
> It also uses MAP_32BIT, in its linker, and reports the error
>
> ghc: mmap 4096 bytes at (nil): Cannot allocate memory
>
>
> I hope this was a somewhat useful contribution to the regressions
> thread. (also hi, I'm new here)
>
> Cheers,
>
> Robert Hensing
>
>