Hi all,
We observed our program sometimes running over 1000x slower than normal and
cannot find the root cause. I would be grateful if someone can give me a hint. I
also searched on this mailing list but failed to find anything related.
The affected process spend almost 100% CPU time in kernel. First I ran `perf' to
identify where times are spent. It turns out 99.93% of the time is spent on
do_page_fault, and 98.16% on migrate_pages. And `perf record -e faults -t <tid>`
show 3 seconds per fault, which seems to much to me.
Then I do `perf record -e compaction:*`, and see 12k
compaction:mm_compaction_migratepages events per second. almost all of them has
nr_migrated=0 and nr_failed greater than 0. So it seems a lot of migration
failure is happening. But I cannot understand what causes it. The more
complete perf output is attached below.
This system is used to run machine-learning training programs, with 2 Intel Xeon
CPU, 512GB RAM and 8 NVIDIA GPU attached, running Ubuntu 20.04, with kernel
version 5.4.0-72-generic. Load average is 13, and 367GB of RAM is available when
we observe the problem this time. We have observed the very same problem on
multiple nodes, running different programs, with different models of CPU, GPU.
Currently, only reboot the system can resolve this. When this happens, all
process on the system is affected, and it does not resolve if I restart the
affected process.
Thanks,
Hu Weiwen
SCUT student
`perf report` output, collected with `sudo perf record -F 99 --pid 2413158 -g -- sleep 10`:
Samples: 993 of event 'cycles', Event count (approx.): 36810932529
Children Self Command Shared Object Symbol
- 99.93% 0.00% python libc-2.31.so [.] 0x00007ff5fcc78984 ▒
0x7ff5fcc78984 ▒
page_fault ▒
do_page_fault ▒
__do_page_fault ▒
do_user_addr_fault ▒
handle_mm_fault ▒
__handle_mm_fault ▒
do_huge_pmd_anonymous_page ◆
alloc_pages_vma ▒
__alloc_pages_nodemask ▒
__alloc_pages_slowpath ▒
__alloc_pages_direct_compact ▒
try_to_compact_pages ▒
compact_zone_order ▒
- compact_zone ▒
- 98.16% migrate_pages ▒
- 93.30% try_to_unmap ▒
- rmap_walk ▒
- 92.12% rmap_walk_file ▒
- 90.45% try_to_unmap_one ▒
- 83.48% ptep_clear_flush ▒
- 76.33% flush_tlb_mm_range ▒
- 74.35% native_flush_tlb_others ▒
- 73.75% on_each_cpu_cond_mask ▒
- 70.85% on_each_cpu_mask ▒
- 69.57% smp_call_function_many ▒
- 67.20% smp_call_function_single ▒
4.93% llist_add_batch ▒
+ 1.09% generic_exec_single ▒
1.19% smp_call_function_single ▒
+ 2.01% zalloc_cpumask_var ▒
1.45% page_vma_mapped_walk ▒
0.78% down_read ▒
0.79% up_read ▒
+ 3.29% remove_migration_ptes ▒
0.59% move_to_new_page
`perf report` output, collected with `sudo perf record -e compaction:* -a -- sleep 10`:
Samples: 133K of event 'compaction:mm_compaction_migratepages', Event count (approx.): 133394
Overhead Trace output
26.25% nr_migrated=0 nr_failed=32
5.00% nr_migrated=0 nr_failed=2
4.64% nr_migrated=0 nr_failed=4
4.24% nr_migrated=0 nr_failed=3
4.20% nr_migrated=0 nr_failed=6
3.83% nr_migrated=0 nr_failed=1
3.79% nr_migrated=0 nr_failed=5
3.62% nr_migrated=0 nr_failed=7
3.45% nr_migrated=0 nr_failed=8
3.06% nr_migrated=0 nr_failed=9
2.68% nr_migrated=0 nr_failed=10
2.68% nr_migrated=0 nr_failed=11
2.42% nr_migrated=0 nr_failed=12
2.35% nr_migrated=0 nr_failed=13
2.11% nr_migrated=0 nr_failed=14
1.93% nr_migrated=0 nr_failed=16
1.90% nr_migrated=0 nr_failed=15
1.64% nr_migrated=0 nr_failed=17
1.50% nr_migrated=0 nr_failed=18
1.25% nr_migrated=0 nr_failed=19
1.25% nr_migrated=0 nr_failed=20
1.22% nr_migrated=0 nr_failed=21
...
0.00% nr_migrated=0 nr_failed=207 ▒
0.00% nr_migrated=0 nr_failed=258 ▒
0.00% nr_migrated=0 nr_failed=273 ▒
0.00% nr_migrated=0 nr_failed=277 ▒
0.00% nr_migrated=10 nr_failed=22 ▒
0.00% nr_migrated=13 nr_failed=19 ▒
0.00% nr_migrated=2 nr_failed=2 ▒
0.00% nr_migrated=3 nr_failed=13 ▒
0.00% nr_migrated=3 nr_failed=17 ▒
0.00% nr_migrated=3 nr_failed=25 ▒
0.00% nr_migrated=3 nr_failed=27 ▒
0.00% nr_migrated=3 nr_failed=28 ▒
0.00% nr_migrated=4 nr_failed=10 ▒
0.00% nr_migrated=4 nr_failed=15 ▒
0.00% nr_migrated=4 nr_failed=24 ▒
0.00% nr_migrated=5 nr_failed=10 ▒
0.00% nr_migrated=5 nr_failed=15 ▒
0.00% nr_migrated=6 nr_failed=20 ▒
0.00% nr_migrated=6 nr_failed=21 ▒
0.00% nr_migrated=6 nr_failed=23 ▒
0.00% nr_migrated=8 nr_failed=19
On Sun, May 16, 2021 at 10:50:35AM +0800, Hillf Danton wrote:
> On Sun, 16 May 2021 01:01:04 +0800 Hu Weiwen wrote:
> > Hi all,
> >
> > We observed our program sometimes running over 1000x slower than normal and
> > cannot find the root cause. I would be grateful if someone can give me a hint. I
> > also searched on this mailing list but failed to find anything related.
> >
> > The affected process spend almost 100% CPU time in kernel. First I ran `perf' to
> > identify where times are spent. It turns out 99.93% of the time is spent on
> > do_page_fault, and 98.16% on migrate_pages. And `perf record -e faults -t <tid>`
> > show 3 seconds per fault, which seems to much to me.
> >
> > Then I do `perf record -e compaction:*`, and see 12k
> > compaction:mm_compaction_migratepages events per second. almost all of them has
> > nr_migrated=0 and nr_failed greater than 0. So it seems a lot of migration
> > failure is happening. But I cannot understand what causes it. The more
> > complete perf output is attached below.
> >
> > This system is used to run machine-learning training programs,
> > with 2 Intel Xeon CPU, 512GB RAM and 8 NVIDIA GPU attached,
> > running Ubuntu 20.04, with kernel version 5.4.0-72-generic.
>
> Given the do_huge_pmd_anonymous_page in your perf report below, what you
> need is likely commit facdaa917c4d ("mm: proactive compaction").
>
> You need to port it back to the linux-5.4 powering your box if there is
> anying that prevents you from using the mainline kernel. Feel free to
> report again if it does not work for you.
Thank you for your reply. Will try it. But it is hard to say whether it is
working. We can only reproduce this about once per month on 10+ similar boxes.
But I think things still not adding up.
The last time I see this, I left it running in this state for 6 hours. But it
still does not get better. If it is just fragmented memory that caused too much
compaction on huge page allocation, I'd expect it to solve itself in minutes.
Commit message of facdaa917c4d reported 700GB memory being allocated in 27
minutes without that patch, but we hardly have any progress in 6 hours.
Also, Commit message of facdaa917c4d reported at most 33ms latency on allocation
without that patch, but we see 3 seconds per page fault.
Over 99% of pages passed to `migrate_pages' are failed to migrate. This is
concerning. It might relate to our issue. I plan to investigate more on this
when I see the issue next time. Any tips on how to find out why migrate failed
are appreciated.
> > Load average is 13, and 367GB of RAM is available when
> > we observe the problem this time. We have observed the very same problem on
> > multiple nodes, running different programs, with different models of CPU, GPU.
> >
> > Currently, only reboot the system can resolve this. When this happens, all
> > process on the system is affected, and it does not resolve if I restart the
> > affected process.
> >
> > Thanks,
> > Hu Weiwen
> > SCUT student
> >
> >
> > `perf report` output, collected with `sudo perf record -F 99 --pid 2413158 -g -- sleep 10`:
> >
> > Samples: 993 of event 'cycles', Event count (approx.): 36810932529
> > Children Self Command Shared Object Symbol
> > - 99.93% 0.00% python libc-2.31.so [.] 0x00007ff5fcc78984
> > 0x7ff5fcc78984
> > page_fault
> > do_page_fault
> > __do_page_fault
> > do_user_addr_fault
> > handle_mm_fault
> > __handle_mm_fault
> > do_huge_pmd_anonymous_page
> > alloc_pages_vma
> > __alloc_pages_nodemask
> > __alloc_pages_slowpath
> > __alloc_pages_direct_compact
> > try_to_compact_pages
> > compact_zone_order
> > - compact_zone
> > - 98.16% migrate_pages
> > - 93.30% try_to_unmap
> > - rmap_walk
> > - 92.12% rmap_walk_file
> > - 90.45% try_to_unmap_one
> > - 83.48% ptep_clear_flush
> > - 76.33% flush_tlb_mm_range
> > - 74.35% native_flush_tlb_others
> > - 73.75% on_each_cpu_cond_mask
> > - 70.85% on_each_cpu_mask
> > - 69.57% smp_call_function_many
> > - 67.20% smp_call_function_single
> > 4.93% llist_add_batch
> > + 1.09% generic_exec_single
> > 1.19% smp_call_function_single
> > + 2.01% zalloc_cpumask_var
> > 1.45% page_vma_mapped_walk
> > 0.78% down_read
> > 0.79% up_read
> > + 3.29% remove_migration_ptes
> > 0.59% move_to_new_page
> >
> >
> > `perf report` output, collected with `sudo perf record -e compaction:* -a -- sleep 10`:
> >
> > Samples: 133K of event 'compaction:mm_compaction_migratepages', Event count (approx.): 133394
> > Overhead Trace output
> > 26.25% nr_migrated=0 nr_failed=32
> > 5.00% nr_migrated=0 nr_failed=2
> > 4.64% nr_migrated=0 nr_failed=4
> > 4.24% nr_migrated=0 nr_failed=3
> > 4.20% nr_migrated=0 nr_failed=6
> > 3.83% nr_migrated=0 nr_failed=1
> > 3.79% nr_migrated=0 nr_failed=5
> > 3.62% nr_migrated=0 nr_failed=7
> > 3.45% nr_migrated=0 nr_failed=8
> > 3.06% nr_migrated=0 nr_failed=9
> > 2.68% nr_migrated=0 nr_failed=10
> > 2.68% nr_migrated=0 nr_failed=11
> > 2.42% nr_migrated=0 nr_failed=12
> > 2.35% nr_migrated=0 nr_failed=13
> > 2.11% nr_migrated=0 nr_failed=14
> > 1.93% nr_migrated=0 nr_failed=16
> > 1.90% nr_migrated=0 nr_failed=15
> > 1.64% nr_migrated=0 nr_failed=17
> > 1.50% nr_migrated=0 nr_failed=18
> > 1.25% nr_migrated=0 nr_failed=19
> > 1.25% nr_migrated=0 nr_failed=20
> > 1.22% nr_migrated=0 nr_failed=21
> > ...
> > 0.00% nr_migrated=0 nr_failed=207
> > 0.00% nr_migrated=0 nr_failed=258
> > 0.00% nr_migrated=0 nr_failed=273
> > 0.00% nr_migrated=0 nr_failed=277
> > 0.00% nr_migrated=10 nr_failed=22
> > 0.00% nr_migrated=13 nr_failed=19
> > 0.00% nr_migrated=2 nr_failed=2
> > 0.00% nr_migrated=3 nr_failed=13
> > 0.00% nr_migrated=3 nr_failed=17
> > 0.00% nr_migrated=3 nr_failed=25
> > 0.00% nr_migrated=3 nr_failed=27
> > 0.00% nr_migrated=3 nr_failed=28
> > 0.00% nr_migrated=4 nr_failed=10
> > 0.00% nr_migrated=4 nr_failed=15
> > 0.00% nr_migrated=4 nr_failed=24
> > 0.00% nr_migrated=5 nr_failed=10
> > 0.00% nr_migrated=5 nr_failed=15
> > 0.00% nr_migrated=6 nr_failed=20
> > 0.00% nr_migrated=6 nr_failed=21
> > 0.00% nr_migrated=6 nr_failed=23
> > 0.00% nr_migrated=8 nr_failed=19