2021-01-06 15:24:25

by Milan Broz

[permalink] [raw]
Subject: Very slow unlockall()

Hi,

we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup code
and someone tried to use it with hardened memory allocator library.

Execution time was increased to extreme (minutes) and as we found, the problem
is in munlockall().

Here is a plain reproducer for the core without any external code - it takes
unlocking on Fedora rawhide kernel more than 30 seconds!
I can reproduce it on 5.10 kernels and Linus' git.

The reproducer below tries to mmap large amount memory with PROT_NONE (later never used).
The real code of course does something more useful but the problem is the same.

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/mman.h>

int main (int argc, char *argv[])
{
void *p = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

if (p == MAP_FAILED) return 1;

if (mlockall(MCL_CURRENT | MCL_FUTURE)) return 1;
printf("locked\n");

if (munlockall()) return 1;
printf("unlocked\n");

return 0;
}

In traceback I see that time is spent in munlock_vma_pages_range.

[ 2962.006813] Call Trace:
[ 2962.006814] ? munlock_vma_pages_range+0xe7/0x4b0
[ 2962.006814] ? vma_merge+0xf3/0x3c0
[ 2962.006815] ? mlock_fixup+0x111/0x190
[ 2962.006815] ? apply_mlockall_flags+0xa7/0x110
[ 2962.006816] ? __do_sys_munlockall+0x2e/0x60
[ 2962.006816] ? do_syscall_64+0x33/0x40
...

Or with perf, I see

# Overhead Command Shared Object Symbol
# ........ ....... ................. .....................................
#
48.18% lock [kernel.kallsyms] [k] lock_is_held_type
11.67% lock [kernel.kallsyms] [k] ___might_sleep
10.65% lock [kernel.kallsyms] [k] follow_page_mask
9.17% lock [kernel.kallsyms] [k] debug_lockdep_rcu_enabled
6.73% lock [kernel.kallsyms] [k] munlock_vma_pages_range
...


Could please anyone check what's wrong here with the memory locking code?
Running it on my notebook I can effectively DoS the system :)

Original report is https://gitlab.com/cryptsetup/cryptsetup/-/issues/617
but this is apparently a kernel issue, just amplified by usage of munlockall().

Thanks,
Milan


2021-01-08 13:44:38

by Michal Hocko

[permalink] [raw]
Subject: Re: Very slow unlockall()

On Wed 06-01-21 16:20:15, Milan Broz wrote:
> Hi,
>
> we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup code
> and someone tried to use it with hardened memory allocator library.
>
> Execution time was increased to extreme (minutes) and as we found, the problem
> is in munlockall().
>
> Here is a plain reproducer for the core without any external code - it takes
> unlocking on Fedora rawhide kernel more than 30 seconds!
> I can reproduce it on 5.10 kernels and Linus' git.
>
> The reproducer below tries to mmap large amount memory with PROT_NONE (later never used).
> The real code of course does something more useful but the problem is the same.
>
> #include <stdio.h>
> #include <stdlib.h>
> #include <fcntl.h>
> #include <sys/mman.h>
>
> int main (int argc, char *argv[])
> {
> void *p = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>
> if (p == MAP_FAILED) return 1;
>
> if (mlockall(MCL_CURRENT | MCL_FUTURE)) return 1;
> printf("locked\n");
>
> if (munlockall()) return 1;
> printf("unlocked\n");
>
> return 0;
> }
>
> In traceback I see that time is spent in munlock_vma_pages_range.
>
> [ 2962.006813] Call Trace:
> [ 2962.006814] ? munlock_vma_pages_range+0xe7/0x4b0
> [ 2962.006814] ? vma_merge+0xf3/0x3c0
> [ 2962.006815] ? mlock_fixup+0x111/0x190
> [ 2962.006815] ? apply_mlockall_flags+0xa7/0x110
> [ 2962.006816] ? __do_sys_munlockall+0x2e/0x60
> [ 2962.006816] ? do_syscall_64+0x33/0x40
> ...
>
> Or with perf, I see
>
> # Overhead Command Shared Object Symbol
> # ........ ....... ................. .....................................
> #
> 48.18% lock [kernel.kallsyms] [k] lock_is_held_type
> 11.67% lock [kernel.kallsyms] [k] ___might_sleep
> 10.65% lock [kernel.kallsyms] [k] follow_page_mask
> 9.17% lock [kernel.kallsyms] [k] debug_lockdep_rcu_enabled
> 6.73% lock [kernel.kallsyms] [k] munlock_vma_pages_range
> ...
>
>
> Could please anyone check what's wrong here with the memory locking code?
> Running it on my notebook I can effectively DoS the system :)
>
> Original report is https://gitlab.com/cryptsetup/cryptsetup/-/issues/617
> but this is apparently a kernel issue, just amplified by usage of munlockall().

Which kernel version do you see this with? Have older releases worked
better?
--
Michal Hocko
SUSE Labs

2021-01-08 14:41:34

by Milan Broz

[permalink] [raw]
Subject: Re: Very slow unlockall()

On 08/01/2021 14:41, Michal Hocko wrote:
> On Wed 06-01-21 16:20:15, Milan Broz wrote:
>> Hi,
>>
>> we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup code
>> and someone tried to use it with hardened memory allocator library.
>>
>> Execution time was increased to extreme (minutes) and as we found, the problem
>> is in munlockall().
>>
>> Here is a plain reproducer for the core without any external code - it takes
>> unlocking on Fedora rawhide kernel more than 30 seconds!
>> I can reproduce it on 5.10 kernels and Linus' git.
>>
>> The reproducer below tries to mmap large amount memory with PROT_NONE (later never used).
>> The real code of course does something more useful but the problem is the same.
>>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <fcntl.h>
>> #include <sys/mman.h>
>>
>> int main (int argc, char *argv[])
>> {
>> void *p = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>
>> if (p == MAP_FAILED) return 1;
>>
>> if (mlockall(MCL_CURRENT | MCL_FUTURE)) return 1;
>> printf("locked\n");
>>
>> if (munlockall()) return 1;
>> printf("unlocked\n");
>>
>> return 0;
>> }
>>
>> In traceback I see that time is spent in munlock_vma_pages_range.
>>
>> [ 2962.006813] Call Trace:
>> [ 2962.006814] ? munlock_vma_pages_range+0xe7/0x4b0
>> [ 2962.006814] ? vma_merge+0xf3/0x3c0
>> [ 2962.006815] ? mlock_fixup+0x111/0x190
>> [ 2962.006815] ? apply_mlockall_flags+0xa7/0x110
>> [ 2962.006816] ? __do_sys_munlockall+0x2e/0x60
>> [ 2962.006816] ? do_syscall_64+0x33/0x40
>> ...
>>
>> Or with perf, I see
>>
>> # Overhead Command Shared Object Symbol
>> # ........ ....... ................. .....................................
>> #
>> 48.18% lock [kernel.kallsyms] [k] lock_is_held_type
>> 11.67% lock [kernel.kallsyms] [k] ___might_sleep
>> 10.65% lock [kernel.kallsyms] [k] follow_page_mask
>> 9.17% lock [kernel.kallsyms] [k] debug_lockdep_rcu_enabled
>> 6.73% lock [kernel.kallsyms] [k] munlock_vma_pages_range
>> ...
>>
>>
>> Could please anyone check what's wrong here with the memory locking code?
>> Running it on my notebook I can effectively DoS the system :)
>>
>> Original report is https://gitlab.com/cryptsetup/cryptsetup/-/issues/617
>> but this is apparently a kernel issue, just amplified by usage of munlockall().
>
> Which kernel version do you see this with? Have older releases worked
> better?

Hi,

I tried 5.10 stable and randomly few kernels I have built on testing VM (5.3 was the oldest),
it seems to be very similar run time, so the problem is apparently old...(I can test some specific kernel version if it make any sense).

For mainline (reproducer above):

With 5.11.0-0.rc2.20210106git36bbbd0e234d.117.fc34.x86_64 (latest Fedora rawhide kernel build - many debug options are on)

# time ./lock
locked
unlocked

real 0m32.287s
user 0m0.001s
sys 0m32.126s


Today's Linus git - 5.11.0-rc2+ in my testing x86_64 VM (no extensive kernel debug options):

# time ./lock
locked
unlocked

real 0m4.172s
user 0m0.000s
sys 0m4.172s

m.

2021-01-31 21:37:51

by Milan Broz

[permalink] [raw]
Subject: Re: Very slow unlockall()

On 08/01/2021 15:39, Milan Broz wrote:
> On 08/01/2021 14:41, Michal Hocko wrote:
>> On Wed 06-01-21 16:20:15, Milan Broz wrote:
>>> Hi,
>>>
>>> we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup code
>>> and someone tried to use it with hardened memory allocator library.
>>>
>>> Execution time was increased to extreme (minutes) and as we found, the problem
>>> is in munlockall().
>>>
>>> Here is a plain reproducer for the core without any external code - it takes
>>> unlocking on Fedora rawhide kernel more than 30 seconds!
>>> I can reproduce it on 5.10 kernels and Linus' git.
>>>
>>> The reproducer below tries to mmap large amount memory with PROT_NONE (later never used).
>>> The real code of course does something more useful but the problem is the same.
>>>
>>> #include <stdio.h>
>>> #include <stdlib.h>
>>> #include <fcntl.h>
>>> #include <sys/mman.h>
>>>
>>> int main (int argc, char *argv[])
>>> {
>>> void *p = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>>
>>> if (p == MAP_FAILED) return 1;
>>>
>>> if (mlockall(MCL_CURRENT | MCL_FUTURE)) return 1;
>>> printf("locked\n");
>>>
>>> if (munlockall()) return 1;
>>> printf("unlocked\n");
>>>
>>> return 0;
>>> }
>>>
>>> In traceback I see that time is spent in munlock_vma_pages_range.
>>>
>>> [ 2962.006813] Call Trace:
>>> [ 2962.006814] ? munlock_vma_pages_range+0xe7/0x4b0
>>> [ 2962.006814] ? vma_merge+0xf3/0x3c0
>>> [ 2962.006815] ? mlock_fixup+0x111/0x190
>>> [ 2962.006815] ? apply_mlockall_flags+0xa7/0x110
>>> [ 2962.006816] ? __do_sys_munlockall+0x2e/0x60
>>> [ 2962.006816] ? do_syscall_64+0x33/0x40
>>> ...
>>>
>>> Or with perf, I see
>>>
>>> # Overhead Command Shared Object Symbol
>>> # ........ ....... ................. .....................................
>>> #
>>> 48.18% lock [kernel.kallsyms] [k] lock_is_held_type
>>> 11.67% lock [kernel.kallsyms] [k] ___might_sleep
>>> 10.65% lock [kernel.kallsyms] [k] follow_page_mask
>>> 9.17% lock [kernel.kallsyms] [k] debug_lockdep_rcu_enabled
>>> 6.73% lock [kernel.kallsyms] [k] munlock_vma_pages_range
>>> ...
>>>
>>>
>>> Could please anyone check what's wrong here with the memory locking code?
>>> Running it on my notebook I can effectively DoS the system :)
>>>
>>> Original report is https://gitlab.com/cryptsetup/cryptsetup/-/issues/617
>>> but this is apparently a kernel issue, just amplified by usage of munlockall().
>>
>> Which kernel version do you see this with? Have older releases worked
>> better?
>
> Hi,
>
> I tried 5.10 stable and randomly few kernels I have built on testing VM (5.3 was the oldest),
> it seems to be very similar run time, so the problem is apparently old...(I can test some specific kernel version if it make any sense).
>
> For mainline (reproducer above):
>
> With 5.11.0-0.rc2.20210106git36bbbd0e234d.117.fc34.x86_64 (latest Fedora rawhide kernel build - many debug options are on)
>
> # time ./lock
> locked
> unlocked
>
> real 0m32.287s
> user 0m0.001s
> sys 0m32.126s
>
>
> Today's Linus git - 5.11.0-rc2+ in my testing x86_64 VM (no extensive kernel debug options):
>
> # time ./lock
> locked
> unlocked
>
> real 0m4.172s
> user 0m0.000s
> sys 0m4.172s
>
> m.

Hi,

so because there is no response, is this expected behavior of memory management subsystem then?

Thanks,
Milan





2021-02-01 13:11:46

by Vlastimil Babka

[permalink] [raw]
Subject: Re: Very slow unlockall()

On 1/8/21 3:39 PM, Milan Broz wrote:
> On 08/01/2021 14:41, Michal Hocko wrote:
>> On Wed 06-01-21 16:20:15, Milan Broz wrote:
>>> Hi,
>>>
>>> we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup code
>>> and someone tried to use it with hardened memory allocator library.
>>>
>>> Execution time was increased to extreme (minutes) and as we found, the problem
>>> is in munlockall().
>>>
>>> Here is a plain reproducer for the core without any external code - it takes
>>> unlocking on Fedora rawhide kernel more than 30 seconds!
>>> I can reproduce it on 5.10 kernels and Linus' git.
>>>
>>> The reproducer below tries to mmap large amount memory with PROT_NONE (later never used).
>>> The real code of course does something more useful but the problem is the same.
>>>
>>> #include <stdio.h>
>>> #include <stdlib.h>
>>> #include <fcntl.h>
>>> #include <sys/mman.h>
>>>
>>> int main (int argc, char *argv[])
>>> {
>>> void *p = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>>
>>> if (p == MAP_FAILED) return 1;
>>>
>>> if (mlockall(MCL_CURRENT | MCL_FUTURE)) return 1;
>>> printf("locked\n");
>>>
>>> if (munlockall()) return 1;
>>> printf("unlocked\n");
>>>
>>> return 0;
>>> }
>>>
>>> In traceback I see that time is spent in munlock_vma_pages_range.
>>>
>>> [ 2962.006813] Call Trace:
>>> [ 2962.006814] ? munlock_vma_pages_range+0xe7/0x4b0
>>> [ 2962.006814] ? vma_merge+0xf3/0x3c0
>>> [ 2962.006815] ? mlock_fixup+0x111/0x190
>>> [ 2962.006815] ? apply_mlockall_flags+0xa7/0x110
>>> [ 2962.006816] ? __do_sys_munlockall+0x2e/0x60
>>> [ 2962.006816] ? do_syscall_64+0x33/0x40
>>> ...
>>>
>>> Or with perf, I see
>>>
>>> # Overhead Command Shared Object Symbol
>>> # ........ ....... ................. .....................................
>>> #
>>> 48.18% lock [kernel.kallsyms] [k] lock_is_held_type
>>> 11.67% lock [kernel.kallsyms] [k] ___might_sleep
>>> 10.65% lock [kernel.kallsyms] [k] follow_page_mask
>>> 9.17% lock [kernel.kallsyms] [k] debug_lockdep_rcu_enabled
>>> 6.73% lock [kernel.kallsyms] [k] munlock_vma_pages_range
>>> ...

This seems to be from the debug kernel, as there's lockdep enabled. That's
expected to be very slow.

>>>
>>> Could please anyone check what's wrong here with the memory locking code?
>>> Running it on my notebook I can effectively DoS the system :)
>>>
>>> Original report is https://gitlab.com/cryptsetup/cryptsetup/-/issues/617
>>> but this is apparently a kernel issue, just amplified by usage of munlockall().
>>
>> Which kernel version do you see this with? Have older releases worked
>> better?
>
> Hi,
>
> I tried 5.10 stable and randomly few kernels I have built on testing VM (5.3 was the oldest),
> it seems to be very similar run time, so the problem is apparently old...(I can test some specific kernel version if it make any sense).
>
> For mainline (reproducer above):
>
> With 5.11.0-0.rc2.20210106git36bbbd0e234d.117.fc34.x86_64 (latest Fedora rawhide kernel build - many debug options are on)

From that, the amount of debugging seems to be rather excessive in the Fedora
rawhide kernel. Is that a special debug flavour?

> # time ./lock
> locked
> unlocked
>
> real 0m32.287s
> user 0m0.001s
> sys 0m32.126s
>
>
> Today's Linus git - 5.11.0-rc2+ in my testing x86_64 VM (no extensive kernel debug options):
>
> # time ./lock
> locked
> unlocked
>
> real 0m4.172s
> user 0m0.000s
> sys 0m4.172s

The perf report would be more interesting from this configuration.

> m.
>

2021-02-01 18:02:50

by Milan Broz

[permalink] [raw]
Subject: Re: Very slow unlockall()

On 01/02/2021 14:08, Vlastimil Babka wrote:
> On 1/8/21 3:39 PM, Milan Broz wrote:
>> On 08/01/2021 14:41, Michal Hocko wrote:
>>> On Wed 06-01-21 16:20:15, Milan Broz wrote:
>>>> Hi,
>>>>
>>>> we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup code
>>>> and someone tried to use it with hardened memory allocator library.
>>>>
>>>> Execution time was increased to extreme (minutes) and as we found, the problem
>>>> is in munlockall().
>>>>
>>>> Here is a plain reproducer for the core without any external code - it takes
>>>> unlocking on Fedora rawhide kernel more than 30 seconds!
>>>> I can reproduce it on 5.10 kernels and Linus' git.
>>>>
>>>> The reproducer below tries to mmap large amount memory with PROT_NONE (later never used).
>>>> The real code of course does something more useful but the problem is the same.
>>>>
>>>> #include <stdio.h>
>>>> #include <stdlib.h>
>>>> #include <fcntl.h>
>>>> #include <sys/mman.h>
>>>>
>>>> int main (int argc, char *argv[])
>>>> {
>>>> void *p = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>>>
>>>> if (p == MAP_FAILED) return 1;
>>>>
>>>> if (mlockall(MCL_CURRENT | MCL_FUTURE)) return 1;
>>>> printf("locked\n");
>>>>
>>>> if (munlockall()) return 1;
>>>> printf("unlocked\n");
>>>>
>>>> return 0;
>>>> }

...

>> Today's Linus git - 5.11.0-rc2+ in my testing x86_64 VM (no extensive kernel debug options):
>>
>> # time ./lock
>> locked
>> unlocked
>>
>> real 0m4.172s
>> user 0m0.000s
>> sys 0m4.172s
>
> The perf report would be more interesting from this configuration.

ok, I cannot run perf on that particular VM but tried the latest Fedora stable
kernel without debug options - 5.10.12-200.fc33.x86_64

This is the report running reproducer above:

time:
real 0m6.123s
user 0m0.099s
sys 0m5.310s

perf:

# Total Lost Samples: 0
#
# Samples: 20K of event 'cycles'
# Event count (approx.): 20397603279
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. ............................
#
47.26% lock [kernel.kallsyms] [k] follow_page_mask
20.43% lock [kernel.kallsyms] [k] munlock_vma_pages_range
15.92% lock [kernel.kallsyms] [k] follow_page
7.40% lock [kernel.kallsyms] [k] rcu_all_qs
5.87% lock [kernel.kallsyms] [k] _cond_resched
3.08% lock [kernel.kallsyms] [k] follow_huge_addr
0.01% lock [kernel.kallsyms] [k] __update_load_avg_cfs_rq
0.01% lock [kernel.kallsyms] [k] ____fput
0.01% lock [kernel.kallsyms] [k] rmap_walk_file
0.00% lock [kernel.kallsyms] [k] page_mapped
0.00% lock [kernel.kallsyms] [k] native_irq_return_iret
0.00% lock [kernel.kallsyms] [k] _raw_spin_lock_irq
0.00% lock [kernel.kallsyms] [k] perf_iterate_ctx
0.00% lock [kernel.kallsyms] [k] finish_task_switch
0.00% perf [kernel.kallsyms] [k] native_sched_clock
0.00% lock [kernel.kallsyms] [k] native_write_msr
0.00% perf [kernel.kallsyms] [k] native_write_msr


m.

2021-02-01 18:59:53

by Vlastimil Babka

[permalink] [raw]
Subject: Re: Very slow unlockall()

On 2/1/21 7:00 PM, Milan Broz wrote:
> On 01/02/2021 14:08, Vlastimil Babka wrote:
>> On 1/8/21 3:39 PM, Milan Broz wrote:
>>> On 08/01/2021 14:41, Michal Hocko wrote:
>>>> On Wed 06-01-21 16:20:15, Milan Broz wrote:
>>>>> Hi,
>>>>>
>>>>> we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup code
>>>>> and someone tried to use it with hardened memory allocator library.
>>>>>
>>>>> Execution time was increased to extreme (minutes) and as we found, the problem
>>>>> is in munlockall().
>>>>>
>>>>> Here is a plain reproducer for the core without any external code - it takes
>>>>> unlocking on Fedora rawhide kernel more than 30 seconds!
>>>>> I can reproduce it on 5.10 kernels and Linus' git.
>>>>>
>>>>> The reproducer below tries to mmap large amount memory with PROT_NONE (later never used).
>>>>> The real code of course does something more useful but the problem is the same.
>>>>>
>>>>> #include <stdio.h>
>>>>> #include <stdlib.h>
>>>>> #include <fcntl.h>
>>>>> #include <sys/mman.h>
>>>>>
>>>>> int main (int argc, char *argv[])
>>>>> {
>>>>> void *p = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

So, this is 2TB memory area, but PROT_NONE means it's never actually populated,
although mlockall(MCL_CURRENT) should do that. Once you put PROT_READ |
PROT_WRITE there, the mlockall() starts taking ages.

So does that reflect your use case? munlockall() with large PROT_NONE areas? If
so, munlock_vma_pages_range() is indeed not optimized for that, but I would
expect such scenario to be uncommon, so better clarify first.

>>>>>
>>>>> if (p == MAP_FAILED) return 1;
>>>>>
>>>>> if (mlockall(MCL_CURRENT | MCL_FUTURE)) return 1;
>>>>> printf("locked\n");
>>>>>
>>>>> if (munlockall()) return 1;
>>>>> printf("unlocked\n");
>>>>>
>>>>> return 0;
>>>>> }
>
> ...
>
>>> Today's Linus git - 5.11.0-rc2+ in my testing x86_64 VM (no extensive kernel debug options):
>>>
>>> # time ./lock
>>> locked
>>> unlocked
>>>
>>> real 0m4.172s
>>> user 0m0.000s
>>> sys 0m4.172s
>>
>> The perf report would be more interesting from this configuration.
>
> ok, I cannot run perf on that particular VM but tried the latest Fedora stable
> kernel without debug options - 5.10.12-200.fc33.x86_64
>
> This is the report running reproducer above:
>
> time:
> real 0m6.123s
> user 0m0.099s
> sys 0m5.310s
>
> perf:
>
> # Total Lost Samples: 0
> #
> # Samples: 20K of event 'cycles'
> # Event count (approx.): 20397603279
> #
> # Overhead Command Shared Object Symbol
> # ........ ....... ................. ............................
> #
> 47.26% lock [kernel.kallsyms] [k] follow_page_mask
> 20.43% lock [kernel.kallsyms] [k] munlock_vma_pages_range
> 15.92% lock [kernel.kallsyms] [k] follow_page
> 7.40% lock [kernel.kallsyms] [k] rcu_all_qs
> 5.87% lock [kernel.kallsyms] [k] _cond_resched
> 3.08% lock [kernel.kallsyms] [k] follow_huge_addr
> 0.01% lock [kernel.kallsyms] [k] __update_load_avg_cfs_rq
> 0.01% lock [kernel.kallsyms] [k] ____fput
> 0.01% lock [kernel.kallsyms] [k] rmap_walk_file
> 0.00% lock [kernel.kallsyms] [k] page_mapped
> 0.00% lock [kernel.kallsyms] [k] native_irq_return_iret
> 0.00% lock [kernel.kallsyms] [k] _raw_spin_lock_irq
> 0.00% lock [kernel.kallsyms] [k] perf_iterate_ctx
> 0.00% lock [kernel.kallsyms] [k] finish_task_switch
> 0.00% perf [kernel.kallsyms] [k] native_sched_clock
> 0.00% lock [kernel.kallsyms] [k] native_write_msr
> 0.00% perf [kernel.kallsyms] [k] native_write_msr
>
>
> m.
>

2021-02-01 19:23:23

by Milan Broz

[permalink] [raw]
Subject: Re: Very slow unlockall()

On 01/02/2021 19:55, Vlastimil Babka wrote:
> On 2/1/21 7:00 PM, Milan Broz wrote:
>> On 01/02/2021 14:08, Vlastimil Babka wrote:
>>> On 1/8/21 3:39 PM, Milan Broz wrote:
>>>> On 08/01/2021 14:41, Michal Hocko wrote:
>>>>> On Wed 06-01-21 16:20:15, Milan Broz wrote:
>>>>>> Hi,
>>>>>>
>>>>>> we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup code
>>>>>> and someone tried to use it with hardened memory allocator library.
>>>>>>
>>>>>> Execution time was increased to extreme (minutes) and as we found, the problem
>>>>>> is in munlockall().
>>>>>>
>>>>>> Here is a plain reproducer for the core without any external code - it takes
>>>>>> unlocking on Fedora rawhide kernel more than 30 seconds!
>>>>>> I can reproduce it on 5.10 kernels and Linus' git.
>>>>>>
>>>>>> The reproducer below tries to mmap large amount memory with PROT_NONE (later never used).
>>>>>> The real code of course does something more useful but the problem is the same.
>>>>>>
>>>>>> #include <stdio.h>
>>>>>> #include <stdlib.h>
>>>>>> #include <fcntl.h>
>>>>>> #include <sys/mman.h>
>>>>>>
>>>>>> int main (int argc, char *argv[])
>>>>>> {
>>>>>> void *p = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>
> So, this is 2TB memory area, but PROT_NONE means it's never actually populated,
> although mlockall(MCL_CURRENT) should do that. Once you put PROT_READ |
> PROT_WRITE there, the mlockall() starts taking ages.
>
> So does that reflect your use case? munlockall() with large PROT_NONE areas? If
> so, munlock_vma_pages_range() is indeed not optimized for that, but I would
> expect such scenario to be uncommon, so better clarify first.

It is just a simple reproducer of the underlying problem, as suggested here
https://gitlab.com/cryptsetup/cryptsetup/-/issues/617#note_478342301

We use mlockall() in cryptsetup and with hardened malloc it slows down unlock significantly.
(For the real case problem please read the whole issue report above.)

m.

2021-02-10 15:21:13

by Vlastimil Babka

[permalink] [raw]
Subject: Re: Very slow unlockall()

On 2/1/21 8:19 PM, Milan Broz wrote:
> On 01/02/2021 19:55, Vlastimil Babka wrote:
>> On 2/1/21 7:00 PM, Milan Broz wrote:
>>> On 01/02/2021 14:08, Vlastimil Babka wrote:
>>>> On 1/8/21 3:39 PM, Milan Broz wrote:
>>>>> On 08/01/2021 14:41, Michal Hocko wrote:
>>>>>> On Wed 06-01-21 16:20:15, Milan Broz wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup code
>>>>>>> and someone tried to use it with hardened memory allocator library.
>>>>>>>
>>>>>>> Execution time was increased to extreme (minutes) and as we found, the problem
>>>>>>> is in munlockall().
>>>>>>>
>>>>>>> Here is a plain reproducer for the core without any external code - it takes
>>>>>>> unlocking on Fedora rawhide kernel more than 30 seconds!
>>>>>>> I can reproduce it on 5.10 kernels and Linus' git.
>>>>>>>
>>>>>>> The reproducer below tries to mmap large amount memory with PROT_NONE (later never used).
>>>>>>> The real code of course does something more useful but the problem is the same.
>>>>>>>
>>>>>>> #include <stdio.h>
>>>>>>> #include <stdlib.h>
>>>>>>> #include <fcntl.h>
>>>>>>> #include <sys/mman.h>
>>>>>>>
>>>>>>> int main (int argc, char *argv[])
>>>>>>> {
>>>>>>> void *p = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>>
>> So, this is 2TB memory area, but PROT_NONE means it's never actually populated,
>> although mlockall(MCL_CURRENT) should do that. Once you put PROT_READ |
>> PROT_WRITE there, the mlockall() starts taking ages.
>>
>> So does that reflect your use case? munlockall() with large PROT_NONE areas? If
>> so, munlock_vma_pages_range() is indeed not optimized for that, but I would
>> expect such scenario to be uncommon, so better clarify first.
>
> It is just a simple reproducer of the underlying problem, as suggested here
> https://gitlab.com/cryptsetup/cryptsetup/-/issues/617#note_478342301
>
> We use mlockall() in cryptsetup and with hardened malloc it slows down unlock significantly.
> (For the real case problem please read the whole issue report above.)

OK, finally read through the bug report, and learned two things:

1) the PROT_NONE is indeed intentional part of the reproducer
2) Linux mailing lists still have a bad reputation and people avoid them. That's
sad :( Well, thanks for overcoming that :)

Daniel there says "I think the Linux kernel implementation of mlockall is quite
broken and tries to lock all the reserved PROT_NONE regions in advance which
doesn't make any sense."

From my testing this doesn't seem to be the case, as the mlockall() part is very
fast, so I don't think it faults in and mlocks PROT_NONE areas. It only starts
to be slow when changed to PROT_READ|PROT_WRITE. But the munlockall() part is
slow even with PROT_NONE as we don't skip the PROT_NONE areas there. We probably
can't just skip them, as they might actually contain mlocked pages if those were
faulted first with PROT_READ/PROT_WRITE and only then changed to PROT_NONE.

And the munlock (munlock_vma_pages_range()) is slow, because it uses
follow_page_mask() in a loop incrementing addresses by PAGE_SIZE, so that's
always traversing all levels of page tables from scratch. Funnily enough,
speeding this up was my first linux-mm series years ago. But the speedup only
works if pte's are present, which is not the case for unpopulated PROT_NONE
areas. That use case was unexpected back then. We should probably convert this
code to a proper page table walk. If there are large areas with unpopulated pmd
entries (or even higher levels) we would traverse them very quickly.

2021-02-10 16:59:49

by Michal Hocko

[permalink] [raw]
Subject: Re: Very slow unlockall()

On Wed 10-02-21 16:18:50, Vlastimil Babka wrote:
> On 2/1/21 8:19 PM, Milan Broz wrote:
> > On 01/02/2021 19:55, Vlastimil Babka wrote:
> >> On 2/1/21 7:00 PM, Milan Broz wrote:
> >>> On 01/02/2021 14:08, Vlastimil Babka wrote:
> >>>> On 1/8/21 3:39 PM, Milan Broz wrote:
> >>>>> On 08/01/2021 14:41, Michal Hocko wrote:
> >>>>>> On Wed 06-01-21 16:20:15, Milan Broz wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> we use mlockall(MCL_CURRENT | MCL_FUTURE) / munlockall() in cryptsetup code
> >>>>>>> and someone tried to use it with hardened memory allocator library.
> >>>>>>>
> >>>>>>> Execution time was increased to extreme (minutes) and as we found, the problem
> >>>>>>> is in munlockall().
> >>>>>>>
> >>>>>>> Here is a plain reproducer for the core without any external code - it takes
> >>>>>>> unlocking on Fedora rawhide kernel more than 30 seconds!
> >>>>>>> I can reproduce it on 5.10 kernels and Linus' git.
> >>>>>>>
> >>>>>>> The reproducer below tries to mmap large amount memory with PROT_NONE (later never used).
> >>>>>>> The real code of course does something more useful but the problem is the same.
> >>>>>>>
> >>>>>>> #include <stdio.h>
> >>>>>>> #include <stdlib.h>
> >>>>>>> #include <fcntl.h>
> >>>>>>> #include <sys/mman.h>
> >>>>>>>
> >>>>>>> int main (int argc, char *argv[])
> >>>>>>> {
> >>>>>>> void *p = mmap(NULL, 1UL << 41, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> >>
> >> So, this is 2TB memory area, but PROT_NONE means it's never actually populated,
> >> although mlockall(MCL_CURRENT) should do that. Once you put PROT_READ |
> >> PROT_WRITE there, the mlockall() starts taking ages.
> >>
> >> So does that reflect your use case? munlockall() with large PROT_NONE areas? If
> >> so, munlock_vma_pages_range() is indeed not optimized for that, but I would
> >> expect such scenario to be uncommon, so better clarify first.
> >
> > It is just a simple reproducer of the underlying problem, as suggested here
> > https://gitlab.com/cryptsetup/cryptsetup/-/issues/617#note_478342301
> >
> > We use mlockall() in cryptsetup and with hardened malloc it slows down unlock significantly.
> > (For the real case problem please read the whole issue report above.)
>
> OK, finally read through the bug report, and learned two things:
>
> 1) the PROT_NONE is indeed intentional part of the reproducer
> 2) Linux mailing lists still have a bad reputation and people avoid them. That's
> sad :( Well, thanks for overcoming that :)
>
> Daniel there says "I think the Linux kernel implementation of mlockall is quite
> broken and tries to lock all the reserved PROT_NONE regions in advance which
> doesn't make any sense."
>
> >From my testing this doesn't seem to be the case, as the mlockall() part is very
> fast, so I don't think it faults in and mlocks PROT_NONE areas. It only starts
> to be slow when changed to PROT_READ|PROT_WRITE. But the munlockall() part is
> slow even with PROT_NONE as we don't skip the PROT_NONE areas there. We probably
> can't just skip them, as they might actually contain mlocked pages if those were
> faulted first with PROT_READ/PROT_WRITE and only then changed to PROT_NONE.

Mlock code is quite easy to misunderstand but IIRC the mlock part
should be rather straightforward. It will mark VMAs as locked, do some
merging/splitting where appropriate and finally populate the range by
gup. This should fail because VMA doesn't allow neither read nor write,
right? And mlock should report that. mlockall will not bother because it
will ignore errors on population. So there is no page table walk
happening.

> And the munlock (munlock_vma_pages_range()) is slow, because it uses
> follow_page_mask() in a loop incrementing addresses by PAGE_SIZE, so that's
> always traversing all levels of page tables from scratch. Funnily enough,
> speeding this up was my first linux-mm series years ago. But the speedup only
> works if pte's are present, which is not the case for unpopulated PROT_NONE
> areas. That use case was unexpected back then. We should probably convert this
> code to a proper page table walk. If there are large areas with unpopulated pmd
> entries (or even higher levels) we would traverse them very quickly.

Yes, this is a good idea. I suspect it will be little bit tricky without
duplicating a large part of gup page table walker.
--
Michal Hocko
SUSE Labs

2021-02-10 17:45:08

by Michal Hocko

[permalink] [raw]
Subject: Re: Very slow unlockall()

On Wed 10-02-21 17:57:29, Michal Hocko wrote:
> On Wed 10-02-21 16:18:50, Vlastimil Babka wrote:
[...]
> > And the munlock (munlock_vma_pages_range()) is slow, because it uses
> > follow_page_mask() in a loop incrementing addresses by PAGE_SIZE, so that's
> > always traversing all levels of page tables from scratch. Funnily enough,
> > speeding this up was my first linux-mm series years ago. But the speedup only
> > works if pte's are present, which is not the case for unpopulated PROT_NONE
> > areas. That use case was unexpected back then. We should probably convert this
> > code to a proper page table walk. If there are large areas with unpopulated pmd
> > entries (or even higher levels) we would traverse them very quickly.
>
> Yes, this is a good idea. I suspect it will be little bit tricky without
> duplicating a large part of gup page table walker.

Thinking about it some more, unmap_page_range would be a better model
for this operation.
--
Michal Hocko
SUSE Labs

2021-02-11 05:25:09

by Hugh Dickins

[permalink] [raw]
Subject: Re: Very slow unlockall()

On Wed, 10 Feb 2021, Michal Hocko wrote:
> On Wed 10-02-21 17:57:29, Michal Hocko wrote:
> > On Wed 10-02-21 16:18:50, Vlastimil Babka wrote:
> [...]
> > > And the munlock (munlock_vma_pages_range()) is slow, because it uses
> > > follow_page_mask() in a loop incrementing addresses by PAGE_SIZE, so that's
> > > always traversing all levels of page tables from scratch. Funnily enough,
> > > speeding this up was my first linux-mm series years ago. But the speedup only
> > > works if pte's are present, which is not the case for unpopulated PROT_NONE
> > > areas. That use case was unexpected back then. We should probably convert this
> > > code to a proper page table walk. If there are large areas with unpopulated pmd
> > > entries (or even higher levels) we would traverse them very quickly.
> >
> > Yes, this is a good idea. I suspect it will be little bit tricky without
> > duplicating a large part of gup page table walker.
>
> Thinking about it some more, unmap_page_range would be a better model
> for this operation.

Could do, I suppose; but I thought it was just a matter of going back to
using follow_page_mask() in munlock_vma_pages_range() (whose fear of THP
split looks overwrought, since an extra reference now prevents splitting);
and enhancing follow_page_mask() to let the no_page_table() FOLL_DUMP
case set ctx->page_mask appropriately (or perhaps it can be preset
at a higher level, without having to pass ctx so far down, dunno).

Nice little job, but I couldn't quite spare the time to do it: needs a
bit more care than I could afford (I suspect the page_increm business at
the end of munlock_vma_pages_range() is good enough while THP tails are
skipped one by one, but will need to be fixed to apply page_mask correctly
to the start - __get_user_pages()'s page_increm-entation looks superior).

Hugh