2019-07-08 14:43:36

by Max Kellermann

[permalink] [raw]
Subject: Kernel 5.1.15 stuck in compaction

Hi,

one of our web servers got repeatedly stuck in the memory compaction
code; two PHP processes have been busy at 100% inside memory
compaction after a page fault:

100.00% 0.00% php-cgi7.0 [kernel.vmlinux] [k] page_fault
|
---page_fault
__do_page_fault
handle_mm_fault
__handle_mm_fault
do_huge_pmd_anonymous_page
__alloc_pages_nodemask
__alloc_pages_slowpath
__alloc_pages_direct_compact
try_to_compact_pages
compact_zone_order
compact_zone
|
|--61.30%--isolate_migratepages_block
| |
| |--20.44%--node_page_state
| |
| |--5.88%--compact_unlock_should_abort.isra.33
| |
| --3.28%--_cond_resched
| |
| --2.19%--rcu_all_qs
|
--3.37%--pageblock_skip_persistent

ftrace:

<...>-962300 [033] .... 236536.493919: node_page_state <-isolate_migratepages_block
<...>-962300 [033] .... 236536.493919: node_page_state <-isolate_migratepages_block
<...>-962300 [033] .... 236536.493919: node_page_state <-isolate_migratepages_block
<...>-962300 [033] .... 236536.493919: _cond_resched <-isolate_migratepages_block
<...>-962300 [033] .... 236536.493919: rcu_all_qs <-_cond_resched
<...>-962300 [033] .... 236536.493919: compact_unlock_should_abort.isra.33 <-isolate_migratepages_block
<...>-962300 [033] .... 236536.493919: pageblock_skip_persistent <-compact_zone
<...>-962300 [033] .... 236536.493919: isolate_migratepages_block <-compact_zone
<...>-962300 [033] .... 236536.493919: node_page_state <-isolate_migratepages_block
<...>-962300 [033] .... 236536.493919: node_page_state <-isolate_migratepages_block
<...>-962300 [033] .... 236536.493919: node_page_state <-isolate_migratepages_block
<...>-962300 [033] .... 236536.493919: node_page_state <-isolate_migratepages_block
<...>-962300 [033] .... 236536.493920: node_page_state <-isolate_migratepages_block
<...>-962300 [033] .... 236536.493920: node_page_state <-isolate_migratepages_block
<...>-962300 [033] .... 236536.493920: _cond_resched <-isolate_migratepages_block
<...>-962300 [033] .... 236536.493920: rcu_all_qs <-_cond_resched
<...>-962300 [033] .... 236536.493920: compact_unlock_should_abort.isra.33 <-isolate_migratepages_block
<...>-962300 [033] .... 236536.493920: pageblock_skip_persistent <-compact_zone
<...>-962300 [033] .... 236536.493920: isolate_migratepages_block <-compact_zone
<...>-962300 [033] .... 236536.493920: node_page_state <-isolate_migratepages_block

Nothing useful in /proc/PID/{stack,wchan,syscall}.

slabinfo/kmalloc-{16,32} are going through the roof (~ 15 GB each),
and this memleak-lookalike triggering the oomkiller all the time is
what drew our attention to this server.

Right now, the server is still stuck, and I can attempt to collect
more information on request.

Max


2019-07-08 15:18:37

by Max Kellermann

[permalink] [raw]
Subject: Re: Kernel 5.1.15 stuck in compaction

On 2019/07/08 12:35, Max Kellermann <[email protected]> wrote:
> one of our web servers got repeatedly stuck in the memory compaction
> code; two PHP processes have been busy at 100% inside memory
> compaction after a page fault:

This trace maybe helpful as well; the first PHP process:

275.846 compaction:mm_compaction_isolate_migratepages:range=(0x8a48e0 ~ 0x8a48e0) nr_scanned=0 nr_taken=0
LOST 8 events!
275.894 compaction:mm_compaction_isolate_migratepages:range=(0x8a48e0 ~ 0x8a48e0) nr_scanned=0 nr_taken=0
LOST 8 events!
275.942 compaction:mm_compaction_isolate_migratepages:range=(0x8a48e0 ~ 0x8a48e0) nr_scanned=0 nr_taken=0
LOST 8 events!
275.989 compaction:mm_compaction_isolate_migratepages:range=(0x8a48e0 ~ 0x8a48e0) nr_scanned=0 nr_taken=0

This is the other PHP process:

188.501 compaction:mm_compaction_isolate_migratepages:range=(0x169f40 ~ 0x169f40) nr_scanned=0 nr_taken=0
LOST 16 events!
188.600 compaction:mm_compaction_isolate_migratepages:range=(0x169f40 ~ 0x169f40) nr_scanned=0 nr_taken=0
LOST 5 events!
188.643 compaction:mm_compaction_isolate_migratepages:range=(0x169f40 ~ 0x169f40) nr_scanned=0 nr_taken=0
LOST 17 events!
188.742 compaction:mm_compaction_isolate_migratepages:range=(0x169f40 ~ 0x169f40) nr_scanned=0 nr_taken=0

No pages are being scanned at all, start and end are the same.

However, since my perf report contains calls to
compact_unlock_should_abort(), this means that the loop in
isolate_migratepages_block() is not getting skipped completely,
therefore the loop is just exiting too early.