2012-06-27 22:16:49

by Jim Schutt

[permalink] [raw]
Subject: excessive CPU utilization by isolate_freepages?

Hi,

I'm running into trouble with systems going unresponsive,
and perf suggests it's excessive CPU usage by isolate_freepages().
I'm currently testing 3.5-rc4, but I think this problem may have
first shown up in 3.4. I'm only just learning how to use perf,
so I only currently have results to report for 3.5-rc4.

(FWIW I'm running the my distro version of perf; please let me know
if I need to compile the tools/perf version to match my kernel.)

The systems in question have 24 SAS drives spread across 3 HBAs,
running 24 Ceph OSD instances, one per drive. FWIW these servers
are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160
Ceph Linux clients doing dd simultaneously to a Ceph file system
backed by 12 of these servers.

In the early phase of such a test, when things are running well,
here's what vmstat reports for the state of one of these servers:

2012-06-27 13:56:58.356-06:00
vmstat -w 4 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st
31 15 0 287216 576 38606628 0 0 2 1158 2 14 1 3 95 0 0
27 15 0 225288 576 38583384 0 0 18 2222016 203357 134876 11 56 17 15 0
28 17 0 219256 576 38544736 0 0 11 2305932 203141 146296 11 49 23 17 0
6 18 0 215596 576 38552872 0 0 7 2363207 215264 166502 12 45 22 20 0
22 18 0 226984 576 38596404 0 0 3 2445741 223114 179527 12 43 23 22 0
30 12 0 230844 576 38461648 0 0 14 2298537 216580 166661 12 45 23 20 0
29 19 0 237856 576 38422884 0 0 5 2332741 209408 157138 12 42 25 22 0
17 11 0 222156 576 38483992 0 0 4 2380495 210312 173121 12 39 28 22 0
11 13 0 216152 576 38463872 0 0 44 2362186 215236 176454 12 39 27 22 0
12 14 0 223704 576 38546720 0 0 2 2395530 214684 177518 12 39 28 22 0
39 17 0 219932 576 38598184 0 0 4 2428231 223284 179095 12 42 24 21 0
11 10 0 219956 576 38521048 0 0 4 2323853 207472 166665 12 39 28 21 0
23 15 0 216580 576 38451904 0 0 3 2241800 201049 163496 11 37 31 21 0
9 13 0 225792 576 38451916 0 0 13 2281900 204869 171814 11 36 30 23 0
14 12 0 233820 576 38492728 0 0 4 2293423 207686 173019 11 37 31 21 0
11 20 0 213796 576 38533208 0 0 3 2288636 205605 168118 11 37 31 21 0


The system begins to struggle over the next several
minutes; here's what vmstat has to say:

2012-06-27 13:57:58.831-06:00
vmstat -w 4 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st
21 16 0 224628 576 38526872 0 0 2 1233 9 19 1 4 95 0 0
12 19 0 232060 576 38501020 0 0 4 2366769 221418 159890 12 48 20 19 0
124 8 0 218548 576 38379656 0 0 13 2103075 199660 108618 11 65 13 12 0
24 10 0 300476 576 38230288 0 0 31 1966282 177472 84572 10 76 7 7 0
20 16 0 217584 576 38296700 0 0 9 2062571 195936 128810 10 55 20 15 0
53 12 0 235720 576 38247968 0 0 30 2035407 196973 133921 10 52 23 15 0
20 16 0 360340 576 38067992 0 0 6 2192179 208692 136784 11 54 19 15 0
26 10 0 310800 576 38093884 0 0 43 2138772 207105 118718 11 64 12 13 0
24 15 0 261108 576 38030828 0 0 68 2174015 205793 135302 11 56 18 15 0
9 17 0 241816 576 37982072 0 0 20 2076145 194971 120285 10 60 16 14 0
37 15 0 255972 576 37892868 0 0 14 2225076 205694 126954 11 59 16 13 0
27 16 0 243212 576 37872704 0 0 6 2249476 210885 134684 12 60 15 14 0
30 10 0 217572 576 37795388 0 0 3 2128688 205027 118319 11 66 12 11 0
26 11 0 236420 576 37740164 0 0 23 2109709 205105 133925 10 56 19 14 0
45 15 0 330056 576 37619896 0 0 15 1948311 196188 119330 10 62 15 13 0
54 15 0 242696 576 37631500 0 0 4 2159530 202595 132588 11 59 16 15 0

2012-06-27 13:58:59.569-06:00
vmstat -w 4 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st
14 15 0 274932 576 37621548 0 0 2 1301 15 0 1 4 95 0 0
44 16 0 278748 576 37509516 0 0 6 2097643 196722 112697 10 66 12 11 0
88 14 0 228088 576 37412008 0 0 6 2089559 202206 116146 10 66 13 11 0
94 11 0 348348 576 37270624 0 0 7 1906390 181488 87333 9 76 9 6 0
42 11 0 215996 576 37288556 0 0 6 1782459 184375 95901 9 74 9 7 0
45 11 0 323112 576 37146500 0 0 12 1868376 187814 103947 9 71 11 9 0
51 7 0 244560 576 37100124 0 0 35 1767496 181383 107259 9 68 12 11 0
74 12 0 221584 576 37016420 0 0 18 1884986 183376 93425 9 75 9 7 0
45 10 0 275564 576 36985324 0 0 23 1683688 167223 97036 8 75 9 8 0
19 10 0 322176 576 36813176 0 0 14 1747378 177594 97218 8 72 12 8 0
122 7 0 225256 576 36838084 0 0 26 1730643 177915 92621 8 75 9 8 0
243 10 0 223464 576 36765460 0 0 18 1730158 173059 79373 8 80 6 5 0
100 10 0 307528 576 36598456 0 0 4 1738567 174077 79585 9 82 6 4 0
243 6 0 370064 576 36358576 0 0 9 1586528 174680 85353 8 81 6 5 0
267 2 0 322640 576 36254044 0 0 40 1011650 129389 42277 5 93 1 1 0
210 4 0 505092 576 35865460 0 0 25 720825 116356 32422 3 96 1 0 0

2012-06-27 14:00:03.219-06:00
vmstat -w 4 16
procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
r b swpd free buff cache si so bi bo in cs us sy id wa st
75 1 0 566988 576 35664800 0 0 2 1355 21 3 1 4 95 0 0
433 1 0 964052 576 35069112 0 0 7 456359 102256 20901 2 98 0 0 0
547 3 0 820116 576 34893932 0 0 57 560507 114878 28115 3 96 0 0 0
806 2 0 606992 576 34848180 0 0 339 309668 101230 21056 2 98 0 0 0
708 1 0 529624 576 34708000 0 0 248 370886 101327 20062 2 97 0 0 0
231 5 0 504772 576 34663880 0 0 305 334824 95045 20407 2 97 1 1 0
158 6 0 1063088 576 33518536 0 0 531 847435 130696 47140 4 92 1 2 0
193 0 0 1449156 576 33035572 0 0 363 371279 94470 18955 2 96 1 1 0
266 6 0 1623512 576 32728164 0 0 77 241114 95730 15483 2 98 0 0 0
243 8 0 1629504 576 32653080 0 0 81 471018 100223 20920 3 96 0 1 0
70 11 0 1342140 576 33084020 0 0 100 925869 139876 56599 6 88 3 3 0
211 7 0 1130316 576 33470432 0 0 290 1008984 150699 74320 6 83 6 5 0
365 3 0 776736 576 34072772 0 0 182 747167 139436 67135 5 88 4 3 0
29 1 0 1528412 576 34110640 0 0 50 612181 137403 77609 4 87 6 3 0
266 5 0 1657688 576 34105696 0 0 3 258307 62879 38508 2 93 3 2 0
1159 2 0 2002256 576 33775476 0 0 19 88554 42112 14230 1 98 0 0 0


Right around 14:00 I was able to get a "perf -a -g"; here's the
beginning of what "perf report --sort symbol --call-graph fractal,5"
had to say:

#
64.86% [k] _raw_spin_lock_irqsave
|
|--97.94%-- isolate_freepages
| compaction_alloc
| unmap_and_move
| migrate_pages
| compact_zone
| |
| |--99.56%-- try_to_compact_pages
| | __alloc_pages_direct_compact
| | __alloc_pages_slowpath
| | __alloc_pages_nodemask
| | alloc_pages_vma
| | do_huge_pmd_anonymous_page
| | handle_mm_fault
| | do_page_fault
| | page_fault
| | |
| | |--53.53%-- skb_copy_datagram_iovec
| | | tcp_recvmsg
| | | inet_recvmsg
| | | sock_recvmsg
| | | sys_recvfrom
| | | system_call_fastpath
| | | __recv
| | | |
| | | --100.00%-- (nil)
| | |
| | |--27.80%-- __pthread_create_2_1
| | | (nil)
| | |
| | --18.67%-- memcpy
| | |
| | |--57.38%-- 0x50d000005
| | |
| | |--34.52%-- 0x3b300bf271940a35
| | |
| | --8.10%-- 0x1500000000000009
| --0.44%-- [...]
--2.06%-- [...]

6.15% [k] isolate_freepages_block
|
|--99.95%-- isolate_freepages
| compaction_alloc
| unmap_and_move
| migrate_pages
| compact_zone
| |
| |--99.54%-- try_to_compact_pages
| | __alloc_pages_direct_compact
| | __alloc_pages_slowpath
| | __alloc_pages_nodemask
| | alloc_pages_vma
| | do_huge_pmd_anonymous_page
| | handle_mm_fault
| | do_page_fault
| | page_fault
| | |
| | |--54.40%-- skb_copy_datagram_iovec
| | | tcp_recvmsg
| | | inet_recvmsg
| | | sock_recvmsg
| | | sys_recvfrom
| | | system_call_fastpath
| | | __recv
| | | |
| | | --100.00%-- (nil)
| | |
| | |--25.19%-- __pthread_create_2_1
| | | (nil)
| | |
| | --20.41%-- memcpy
| | |
| | |--40.24%-- 0x3b300bf271940a35
| | |
| | |--38.29%-- 0x1500000000000009
| | |
| | --21.47%-- 0x50d000005
| --0.46%-- [...]
--0.05%-- [...]

3.96% [.] ceph_crc32c_le
|
|--99.99%-- 0xb8057558d0065990
--0.01%-- [...]

A different system in the same test had a slightly different
call tree, but isolate_freepages() still seems to show up
prominently:

#
32.32% [k] _raw_spin_lock_irqsave
|
|--97.64%-- isolate_freepages
| compaction_alloc
| unmap_and_move
| migrate_pages
| compact_zone
| try_to_compact_pages
| __alloc_pages_direct_compact
| __alloc_pages_slowpath
| __alloc_pages_nodemask
| alloc_pages_vma
| do_huge_pmd_anonymous_page
| handle_mm_fault
| do_page_fault
| page_fault
| |
| |--65.31%-- skb_copy_datagram_iovec
| | tcp_recvmsg
| | inet_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| | |
| | --100.00%-- (nil)
| |
| |--30.98%-- memcpy
| | |
| | |--50.60%-- 0x50d0000
| | |
| | --49.40%-- 0x50d000005
| --3.70%-- [...]
--2.36%-- [...]

17.10% [k] _raw_spin_lock_irq
|
|--98.27%-- isolate_migratepages_range
| compact_zone
| try_to_compact_pages
| __alloc_pages_direct_compact
| __alloc_pages_slowpath
| __alloc_pages_nodemask
| alloc_pages_vma
| do_huge_pmd_anonymous_page
| handle_mm_fault
| do_page_fault
| page_fault
| |
| |--99.85%-- __pthread_create_2_1
| | (nil)
| --0.15%-- [...]
--1.73%-- [...]

4.59% [k] mutex_spin_on_owner
|
--- __mutex_lock_slowpath
mutex_lock
|
|--50.14%-- page_lock_anon_vma
| |
| |--99.99%-- try_to_unmap_anon
| | try_to_unmap
| | __unmap_and_move
| | unmap_and_move
| | migrate_pages
| | compact_zone
| | try_to_compact_pages
| | __alloc_pages_direct_compact
| | __alloc_pages_slowpath
| | __alloc_pages_nodemask
| | alloc_pages_vma
| | do_huge_pmd_anonymous_page
| | handle_mm_fault
| | do_page_fault
| | page_fault
| | |
| | |--99.91%-- __pthread_create_2_1
| | | (nil)
| | --0.09%-- [...]
| --0.01%-- [...]
|
|--49.67%-- rmap_walk
| move_to_new_page
| __unmap_and_move
| unmap_and_move
| migrate_pages
| compact_zone
| try_to_compact_pages
| __alloc_pages_direct_compact
| __alloc_pages_slowpath
| __alloc_pages_nodemask
| alloc_pages_vma
| do_huge_pmd_anonymous_page
| handle_mm_fault
| do_page_fault
| page_fault
| |
| |--99.69%-- __pthread_create_2_1
| | (nil)
| --0.31%-- [...]
--0.20%-- [...]

4.10% [k] isolate_freepages_block
|
|--99.95%-- isolate_freepages
| compaction_alloc
| unmap_and_move
| migrate_pages
| compact_zone
| try_to_compact_pages
| __alloc_pages_direct_compact
| __alloc_pages_slowpath
| __alloc_pages_nodemask
| alloc_pages_vma
| do_huge_pmd_anonymous_page
| handle_mm_fault
| do_page_fault
| page_fault
| |
| |--46.97%-- skb_copy_datagram_iovec
| | tcp_recvmsg
| | inet_recvmsg
| | sock_recvmsg
| | sys_recvfrom
| | system_call_fastpath
| | __recv
| | |
| | --100.00%-- (nil)
| |
| |--31.79%-- __pthread_create_2_1
| | (nil)
| |
| --21.24%-- memcpy
| |
| |--61.90%-- 0x50d000005
| |
| --38.10%-- 0x50d0000
--0.05%-- [...]

3.65% [.] ceph_crc32c_le
|
|--99.86%-- 0xb8057558d0065990
--0.14%-- [...]



I seem to be able to recreate this issue at will, so please
let me know what I can do to help learn what is going on.

Thanks -- Jim

P.S. I got the recipients list via "scripts/get_maintainer.pl -f mm/compaction.c";
please let me know if I should have done something else.


2012-06-27 22:35:29

by Mark Nelson

[permalink] [raw]
Subject: Re: excessive CPU utilization by isolate_freepages?

On 06/27/2012 04:59 PM, Jim Schutt wrote:
> Hi,
>
> I'm running into trouble with systems going unresponsive,
> and perf suggests it's excessive CPU usage by isolate_freepages().
> I'm currently testing 3.5-rc4, but I think this problem may have
> first shown up in 3.4. I'm only just learning how to use perf,
> so I only currently have results to report for 3.5-rc4.
>
> (FWIW I'm running the my distro version of perf; please let me know
> if I need to compile the tools/perf version to match my kernel.)
>
> The systems in question have 24 SAS drives spread across 3 HBAs,
> running 24 Ceph OSD instances, one per drive. FWIW these servers
> are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160
> Ceph Linux clients doing dd simultaneously to a Ceph file system
> backed by 12 of these servers.
>
> In the early phase of such a test, when things are running well,
> here's what vmstat reports for the state of one of these servers:
>
> 2012-06-27 13:56:58.356-06:00
> vmstat -w 4 16
> procs -------------------memory------------------ ---swap-- -----io----
> --system-- -----cpu-------
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 31 15 0 287216 576 38606628 0 0 2 1158 2 14 1 3 95 0 0
> 27 15 0 225288 576 38583384 0 0 18 2222016 203357 134876 11 56 17 15 0
> 28 17 0 219256 576 38544736 0 0 11 2305932 203141 146296 11 49 23 17 0
> 6 18 0 215596 576 38552872 0 0 7 2363207 215264 166502 12 45 22 20 0
> 22 18 0 226984 576 38596404 0 0 3 2445741 223114 179527 12 43 23 22 0
> 30 12 0 230844 576 38461648 0 0 14 2298537 216580 166661 12 45 23 20 0
> 29 19 0 237856 576 38422884 0 0 5 2332741 209408 157138 12 42 25 22 0
> 17 11 0 222156 576 38483992 0 0 4 2380495 210312 173121 12 39 28 22 0
> 11 13 0 216152 576 38463872 0 0 44 2362186 215236 176454 12 39 27 22 0
> 12 14 0 223704 576 38546720 0 0 2 2395530 214684 177518 12 39 28 22 0
> 39 17 0 219932 576 38598184 0 0 4 2428231 223284 179095 12 42 24 21 0
> 11 10 0 219956 576 38521048 0 0 4 2323853 207472 166665 12 39 28 21 0
> 23 15 0 216580 576 38451904 0 0 3 2241800 201049 163496 11 37 31 21 0
> 9 13 0 225792 576 38451916 0 0 13 2281900 204869 171814 11 36 30 23 0
> 14 12 0 233820 576 38492728 0 0 4 2293423 207686 173019 11 37 31 21 0
> 11 20 0 213796 576 38533208 0 0 3 2288636 205605 168118 11 37 31 21 0
>
>
> The system begins to struggle over the next several
> minutes; here's what vmstat has to say:
>
> 2012-06-27 13:57:58.831-06:00
> vmstat -w 4 16
> procs -------------------memory------------------ ---swap-- -----io----
> --system-- -----cpu-------
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 21 16 0 224628 576 38526872 0 0 2 1233 9 19 1 4 95 0 0
> 12 19 0 232060 576 38501020 0 0 4 2366769 221418 159890 12 48 20 19 0
> 124 8 0 218548 576 38379656 0 0 13 2103075 199660 108618 11 65 13 12 0
> 24 10 0 300476 576 38230288 0 0 31 1966282 177472 84572 10 76 7 7 0
> 20 16 0 217584 576 38296700 0 0 9 2062571 195936 128810 10 55 20 15 0
> 53 12 0 235720 576 38247968 0 0 30 2035407 196973 133921 10 52 23 15 0
> 20 16 0 360340 576 38067992 0 0 6 2192179 208692 136784 11 54 19 15 0
> 26 10 0 310800 576 38093884 0 0 43 2138772 207105 118718 11 64 12 13 0
> 24 15 0 261108 576 38030828 0 0 68 2174015 205793 135302 11 56 18 15 0
> 9 17 0 241816 576 37982072 0 0 20 2076145 194971 120285 10 60 16 14 0
> 37 15 0 255972 576 37892868 0 0 14 2225076 205694 126954 11 59 16 13 0
> 27 16 0 243212 576 37872704 0 0 6 2249476 210885 134684 12 60 15 14 0
> 30 10 0 217572 576 37795388 0 0 3 2128688 205027 118319 11 66 12 11 0
> 26 11 0 236420 576 37740164 0 0 23 2109709 205105 133925 10 56 19 14 0
> 45 15 0 330056 576 37619896 0 0 15 1948311 196188 119330 10 62 15 13 0
> 54 15 0 242696 576 37631500 0 0 4 2159530 202595 132588 11 59 16 15 0
>
> 2012-06-27 13:58:59.569-06:00
> vmstat -w 4 16
> procs -------------------memory------------------ ---swap-- -----io----
> --system-- -----cpu-------
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 14 15 0 274932 576 37621548 0 0 2 1301 15 0 1 4 95 0 0
> 44 16 0 278748 576 37509516 0 0 6 2097643 196722 112697 10 66 12 11 0
> 88 14 0 228088 576 37412008 0 0 6 2089559 202206 116146 10 66 13 11 0
> 94 11 0 348348 576 37270624 0 0 7 1906390 181488 87333 9 76 9 6 0
> 42 11 0 215996 576 37288556 0 0 6 1782459 184375 95901 9 74 9 7 0
> 45 11 0 323112 576 37146500 0 0 12 1868376 187814 103947 9 71 11 9 0
> 51 7 0 244560 576 37100124 0 0 35 1767496 181383 107259 9 68 12 11 0
> 74 12 0 221584 576 37016420 0 0 18 1884986 183376 93425 9 75 9 7 0
> 45 10 0 275564 576 36985324 0 0 23 1683688 167223 97036 8 75 9 8 0
> 19 10 0 322176 576 36813176 0 0 14 1747378 177594 97218 8 72 12 8 0
> 122 7 0 225256 576 36838084 0 0 26 1730643 177915 92621 8 75 9 8 0
> 243 10 0 223464 576 36765460 0 0 18 1730158 173059 79373 8 80 6 5 0
> 100 10 0 307528 576 36598456 0 0 4 1738567 174077 79585 9 82 6 4 0
> 243 6 0 370064 576 36358576 0 0 9 1586528 174680 85353 8 81 6 5 0
> 267 2 0 322640 576 36254044 0 0 40 1011650 129389 42277 5 93 1 1 0
> 210 4 0 505092 576 35865460 0 0 25 720825 116356 32422 3 96 1 0 0
>
> 2012-06-27 14:00:03.219-06:00
> vmstat -w 4 16
> procs -------------------memory------------------ ---swap-- -----io----
> --system-- -----cpu-------
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 75 1 0 566988 576 35664800 0 0 2 1355 21 3 1 4 95 0 0
> 433 1 0 964052 576 35069112 0 0 7 456359 102256 20901 2 98 0 0 0
> 547 3 0 820116 576 34893932 0 0 57 560507 114878 28115 3 96 0 0 0
> 806 2 0 606992 576 34848180 0 0 339 309668 101230 21056 2 98 0 0 0
> 708 1 0 529624 576 34708000 0 0 248 370886 101327 20062 2 97 0 0 0
> 231 5 0 504772 576 34663880 0 0 305 334824 95045 20407 2 97 1 1 0
> 158 6 0 1063088 576 33518536 0 0 531 847435 130696 47140 4 92 1 2 0
> 193 0 0 1449156 576 33035572 0 0 363 371279 94470 18955 2 96 1 1 0
> 266 6 0 1623512 576 32728164 0 0 77 241114 95730 15483 2 98 0 0 0
> 243 8 0 1629504 576 32653080 0 0 81 471018 100223 20920 3 96 0 1 0
> 70 11 0 1342140 576 33084020 0 0 100 925869 139876 56599 6 88 3 3 0
> 211 7 0 1130316 576 33470432 0 0 290 1008984 150699 74320 6 83 6 5 0
> 365 3 0 776736 576 34072772 0 0 182 747167 139436 67135 5 88 4 3 0
> 29 1 0 1528412 576 34110640 0 0 50 612181 137403 77609 4 87 6 3 0
> 266 5 0 1657688 576 34105696 0 0 3 258307 62879 38508 2 93 3 2 0
> 1159 2 0 2002256 576 33775476 0 0 19 88554 42112 14230 1 98 0 0 0
>
>
> Right around 14:00 I was able to get a "perf -a -g"; here's the
> beginning of what "perf report --sort symbol --call-graph fractal,5"
> had to say:
>
> #
> 64.86% [k] _raw_spin_lock_irqsave
> |
> |--97.94%-- isolate_freepages
> | compaction_alloc
> | unmap_and_move
> | migrate_pages
> | compact_zone
> | |
> | |--99.56%-- try_to_compact_pages
> | | __alloc_pages_direct_compact
> | | __alloc_pages_slowpath
> | | __alloc_pages_nodemask
> | | alloc_pages_vma
> | | do_huge_pmd_anonymous_page
> | | handle_mm_fault
> | | do_page_fault
> | | page_fault
> | | |
> | | |--53.53%-- skb_copy_datagram_iovec
> | | | tcp_recvmsg
> | | | inet_recvmsg
> | | | sock_recvmsg
> | | | sys_recvfrom
> | | | system_call_fastpath
> | | | __recv
> | | | |
> | | | --100.00%-- (nil)
> | | |
> | | |--27.80%-- __pthread_create_2_1
> | | | (nil)
> | | |
> | | --18.67%-- memcpy
> | | |
> | | |--57.38%-- 0x50d000005
> | | |
> | | |--34.52%-- 0x3b300bf271940a35
> | | |
> | | --8.10%-- 0x1500000000000009
> | --0.44%-- [...]
> --2.06%-- [...]
>
> 6.15% [k] isolate_freepages_block
> |
> |--99.95%-- isolate_freepages
> | compaction_alloc
> | unmap_and_move
> | migrate_pages
> | compact_zone
> | |
> | |--99.54%-- try_to_compact_pages
> | | __alloc_pages_direct_compact
> | | __alloc_pages_slowpath
> | | __alloc_pages_nodemask
> | | alloc_pages_vma
> | | do_huge_pmd_anonymous_page
> | | handle_mm_fault
> | | do_page_fault
> | | page_fault
> | | |
> | | |--54.40%-- skb_copy_datagram_iovec
> | | | tcp_recvmsg
> | | | inet_recvmsg
> | | | sock_recvmsg
> | | | sys_recvfrom
> | | | system_call_fastpath
> | | | __recv
> | | | |
> | | | --100.00%-- (nil)
> | | |
> | | |--25.19%-- __pthread_create_2_1
> | | | (nil)
> | | |
> | | --20.41%-- memcpy
> | | |
> | | |--40.24%-- 0x3b300bf271940a35
> | | |
> | | |--38.29%-- 0x1500000000000009
> | | |
> | | --21.47%-- 0x50d000005
> | --0.46%-- [...]
> --0.05%-- [...]
>
> 3.96% [.] ceph_crc32c_le
> |
> |--99.99%-- 0xb8057558d0065990
> --0.01%-- [...]
>
> A different system in the same test had a slightly different
> call tree, but isolate_freepages() still seems to show up
> prominently:
>
> #
> 32.32% [k] _raw_spin_lock_irqsave
> |
> |--97.64%-- isolate_freepages
> | compaction_alloc
> | unmap_and_move
> | migrate_pages
> | compact_zone
> | try_to_compact_pages
> | __alloc_pages_direct_compact
> | __alloc_pages_slowpath
> | __alloc_pages_nodemask
> | alloc_pages_vma
> | do_huge_pmd_anonymous_page
> | handle_mm_fault
> | do_page_fault
> | page_fault
> | |
> | |--65.31%-- skb_copy_datagram_iovec
> | | tcp_recvmsg
> | | inet_recvmsg
> | | sock_recvmsg
> | | sys_recvfrom
> | | system_call_fastpath
> | | __recv
> | | |
> | | --100.00%-- (nil)
> | |
> | |--30.98%-- memcpy
> | | |
> | | |--50.60%-- 0x50d0000
> | | |
> | | --49.40%-- 0x50d000005
> | --3.70%-- [...]
> --2.36%-- [...]
>
> 17.10% [k] _raw_spin_lock_irq
> |
> |--98.27%-- isolate_migratepages_range
> | compact_zone
> | try_to_compact_pages
> | __alloc_pages_direct_compact
> | __alloc_pages_slowpath
> | __alloc_pages_nodemask
> | alloc_pages_vma
> | do_huge_pmd_anonymous_page
> | handle_mm_fault
> | do_page_fault
> | page_fault
> | |
> | |--99.85%-- __pthread_create_2_1
> | | (nil)
> | --0.15%-- [...]
> --1.73%-- [...]
>
> 4.59% [k] mutex_spin_on_owner
> |
> --- __mutex_lock_slowpath
> mutex_lock
> |
> |--50.14%-- page_lock_anon_vma
> | |
> | |--99.99%-- try_to_unmap_anon
> | | try_to_unmap
> | | __unmap_and_move
> | | unmap_and_move
> | | migrate_pages
> | | compact_zone
> | | try_to_compact_pages
> | | __alloc_pages_direct_compact
> | | __alloc_pages_slowpath
> | | __alloc_pages_nodemask
> | | alloc_pages_vma
> | | do_huge_pmd_anonymous_page
> | | handle_mm_fault
> | | do_page_fault
> | | page_fault
> | | |
> | | |--99.91%-- __pthread_create_2_1
> | | | (nil)
> | | --0.09%-- [...]
> | --0.01%-- [...]
> |
> |--49.67%-- rmap_walk
> | move_to_new_page
> | __unmap_and_move
> | unmap_and_move
> | migrate_pages
> | compact_zone
> | try_to_compact_pages
> | __alloc_pages_direct_compact
> | __alloc_pages_slowpath
> | __alloc_pages_nodemask
> | alloc_pages_vma
> | do_huge_pmd_anonymous_page
> | handle_mm_fault
> | do_page_fault
> | page_fault
> | |
> | |--99.69%-- __pthread_create_2_1
> | | (nil)
> | --0.31%-- [...]
> --0.20%-- [...]
>
> 4.10% [k] isolate_freepages_block
> |
> |--99.95%-- isolate_freepages
> | compaction_alloc
> | unmap_and_move
> | migrate_pages
> | compact_zone
> | try_to_compact_pages
> | __alloc_pages_direct_compact
> | __alloc_pages_slowpath
> | __alloc_pages_nodemask
> | alloc_pages_vma
> | do_huge_pmd_anonymous_page
> | handle_mm_fault
> | do_page_fault
> | page_fault
> | |
> | |--46.97%-- skb_copy_datagram_iovec
> | | tcp_recvmsg
> | | inet_recvmsg
> | | sock_recvmsg
> | | sys_recvfrom
> | | system_call_fastpath
> | | __recv
> | | |
> | | --100.00%-- (nil)
> | |
> | |--31.79%-- __pthread_create_2_1
> | | (nil)
> | |
> | --21.24%-- memcpy
> | |
> | |--61.90%-- 0x50d000005
> | |
> | --38.10%-- 0x50d0000
> --0.05%-- [...]
>
> 3.65% [.] ceph_crc32c_le
> |
> |--99.86%-- 0xb8057558d0065990
> --0.14%-- [...]
>
>
>
> I seem to be able to recreate this issue at will, so please
> let me know what I can do to help learn what is going on.
>
> Thanks -- Jim
>
> P.S. I got the recipients list via "scripts/get_maintainer.pl -f
> mm/compaction.c";
> please let me know if I should have done something else.
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Ok, I looked around a bit and found this:

https://lkml.org/lkml/2011/11/9/252

I think the same thing is happening here, ie huge pages are trying to be
allocated but due to memory fragmentation there are none available.
try_to_compact_pages is getting called to get a contiguous set of pages,
but apparently it can't move the pages it needs because they are stuck
waiting.

According to that post, /sys/block/<device>/bdi/max_ratio can be changed
to specify what percentage of dirty cache each of your OSDs can use.
Apparently the default is 20% for each device.

Might be worth a try!

Mark

2012-06-27 23:59:42

by Minchan Kim

[permalink] [raw]
Subject: Re: excessive CPU utilization by isolate_freepages?

On 06/28/2012 06:59 AM, Jim Schutt wrote:

> Hi,
>
> I'm running into trouble with systems going unresponsive,
> and perf suggests it's excessive CPU usage by isolate_freepages().
> I'm currently testing 3.5-rc4, but I think this problem may have
> first shown up in 3.4. I'm only just learning how to use perf,
> so I only currently have results to report for 3.5-rc4.
>
> (FWIW I'm running the my distro version of perf; please let me know
> if I need to compile the tools/perf version to match my kernel.)
>
> The systems in question have 24 SAS drives spread across 3 HBAs,
> running 24 Ceph OSD instances, one per drive. FWIW these servers
> are dual-socket Intel 5675 Xeons w/48 GB memory. I've got ~160
> Ceph Linux clients doing dd simultaneously to a Ceph file system
> backed by 12 of these servers.
>
> In the early phase of such a test, when things are running well,
> here's what vmstat reports for the state of one of these servers:
>
> 2012-06-27 13:56:58.356-06:00
> vmstat -w 4 16
> procs -------------------memory------------------ ---swap-- -----io----
> --system-- -----cpu-------
> r b swpd free buff cache si so bi
> bo in cs us sy id wa st
> 31 15 0 287216 576 38606628 0 0 2
> 1158 2 14 1 3 95 0 0
> 27 15 0 225288 576 38583384 0 0 18
> 2222016 203357 134876 11 56 17 15 0
> 28 17 0 219256 576 38544736 0 0 11
> 2305932 203141 146296 11 49 23 17 0
> 6 18 0 215596 576 38552872 0 0 7
> 2363207 215264 166502 12 45 22 20 0
> 22 18 0 226984 576 38596404 0 0 3
> 2445741 223114 179527 12 43 23 22 0
> 30 12 0 230844 576 38461648 0 0 14
> 2298537 216580 166661 12 45 23 20 0
> 29 19 0 237856 576 38422884 0 0 5
> 2332741 209408 157138 12 42 25 22 0
> 17 11 0 222156 576 38483992 0 0 4
> 2380495 210312 173121 12 39 28 22 0
> 11 13 0 216152 576 38463872 0 0 44
> 2362186 215236 176454 12 39 27 22 0
> 12 14 0 223704 576 38546720 0 0 2
> 2395530 214684 177518 12 39 28 22 0
> 39 17 0 219932 576 38598184 0 0 4
> 2428231 223284 179095 12 42 24 21 0
> 11 10 0 219956 576 38521048 0 0 4
> 2323853 207472 166665 12 39 28 21 0
> 23 15 0 216580 576 38451904 0 0 3
> 2241800 201049 163496 11 37 31 21 0
> 9 13 0 225792 576 38451916 0 0 13
> 2281900 204869 171814 11 36 30 23 0
> 14 12 0 233820 576 38492728 0 0 4
> 2293423 207686 173019 11 37 31 21 0
> 11 20 0 213796 576 38533208 0 0 3
> 2288636 205605 168118 11 37 31 21 0
>
>
> The system begins to struggle over the next several
> minutes; here's what vmstat has to say:
>
> 2012-06-27 13:57:58.831-06:00
> vmstat -w 4 16
> procs -------------------memory------------------ ---swap-- -----io----
> --system-- -----cpu-------
> r b swpd free buff cache si so bi
> bo in cs us sy id wa st
> 21 16 0 224628 576 38526872 0 0 2
> 1233 9 19 1 4 95 0 0
> 12 19 0 232060 576 38501020 0 0 4
> 2366769 221418 159890 12 48 20 19 0
> 124 8 0 218548 576 38379656 0 0 13
> 2103075 199660 108618 11 65 13 12 0
> 24 10 0 300476 576 38230288 0 0 31
> 1966282 177472 84572 10 76 7 7 0
> 20 16 0 217584 576 38296700 0 0 9
> 2062571 195936 128810 10 55 20 15 0
> 53 12 0 235720 576 38247968 0 0 30
> 2035407 196973 133921 10 52 23 15 0
> 20 16 0 360340 576 38067992 0 0 6
> 2192179 208692 136784 11 54 19 15 0
> 26 10 0 310800 576 38093884 0 0 43
> 2138772 207105 118718 11 64 12 13 0
> 24 15 0 261108 576 38030828 0 0 68
> 2174015 205793 135302 11 56 18 15 0
> 9 17 0 241816 576 37982072 0 0 20
> 2076145 194971 120285 10 60 16 14 0
> 37 15 0 255972 576 37892868 0 0 14
> 2225076 205694 126954 11 59 16 13 0
> 27 16 0 243212 576 37872704 0 0 6
> 2249476 210885 134684 12 60 15 14 0
> 30 10 0 217572 576 37795388 0 0 3
> 2128688 205027 118319 11 66 12 11 0
> 26 11 0 236420 576 37740164 0 0 23
> 2109709 205105 133925 10 56 19 14 0
> 45 15 0 330056 576 37619896 0 0 15
> 1948311 196188 119330 10 62 15 13 0
> 54 15 0 242696 576 37631500 0 0 4
> 2159530 202595 132588 11 59 16 15 0
>
> 2012-06-27 13:58:59.569-06:00
> vmstat -w 4 16
> procs -------------------memory------------------ ---swap-- -----io----
> --system-- -----cpu-------
> r b swpd free buff cache si so bi
> bo in cs us sy id wa st
> 14 15 0 274932 576 37621548 0 0 2
> 1301 15 0 1 4 95 0 0
> 44 16 0 278748 576 37509516 0 0 6
> 2097643 196722 112697 10 66 12 11 0
> 88 14 0 228088 576 37412008 0 0 6
> 2089559 202206 116146 10 66 13 11 0
> 94 11 0 348348 576 37270624 0 0 7
> 1906390 181488 87333 9 76 9 6 0
> 42 11 0 215996 576 37288556 0 0 6
> 1782459 184375 95901 9 74 9 7 0
> 45 11 0 323112 576 37146500 0 0 12
> 1868376 187814 103947 9 71 11 9 0
> 51 7 0 244560 576 37100124 0 0 35
> 1767496 181383 107259 9 68 12 11 0
> 74 12 0 221584 576 37016420 0 0 18
> 1884986 183376 93425 9 75 9 7 0
> 45 10 0 275564 576 36985324 0 0 23
> 1683688 167223 97036 8 75 9 8 0
> 19 10 0 322176 576 36813176 0 0 14
> 1747378 177594 97218 8 72 12 8 0
> 122 7 0 225256 576 36838084 0 0 26
> 1730643 177915 92621 8 75 9 8 0
> 243 10 0 223464 576 36765460 0 0 18
> 1730158 173059 79373 8 80 6 5 0
> 100 10 0 307528 576 36598456 0 0 4
> 1738567 174077 79585 9 82 6 4 0
> 243 6 0 370064 576 36358576 0 0 9
> 1586528 174680 85353 8 81 6 5 0
> 267 2 0 322640 576 36254044 0 0 40
> 1011650 129389 42277 5 93 1 1 0
> 210 4 0 505092 576 35865460 0 0 25
> 720825 116356 32422 3 96 1 0 0
>
> 2012-06-27 14:00:03.219-06:00
> vmstat -w 4 16
> procs -------------------memory------------------ ---swap-- -----io----
> --system-- -----cpu-------
> r b swpd free buff cache si so bi
> bo in cs us sy id wa st
> 75 1 0 566988 576 35664800 0 0 2
> 1355 21 3 1 4 95 0 0
> 433 1 0 964052 576 35069112 0 0 7
> 456359 102256 20901 2 98 0 0 0
> 547 3 0 820116 576 34893932 0 0 57
> 560507 114878 28115 3 96 0 0 0
> 806 2 0 606992 576 34848180 0 0 339
> 309668 101230 21056 2 98 0 0 0
> 708 1 0 529624 576 34708000 0 0 248
> 370886 101327 20062 2 97 0 0 0
> 231 5 0 504772 576 34663880 0 0 305
> 334824 95045 20407 2 97 1 1 0
> 158 6 0 1063088 576 33518536 0 0 531
> 847435 130696 47140 4 92 1 2 0
> 193 0 0 1449156 576 33035572 0 0 363
> 371279 94470 18955 2 96 1 1 0
> 266 6 0 1623512 576 32728164 0 0 77
> 241114 95730 15483 2 98 0 0 0
> 243 8 0 1629504 576 32653080 0 0 81
> 471018 100223 20920 3 96 0 1 0
> 70 11 0 1342140 576 33084020 0 0 100 925869
> 139876 56599 6 88 3 3 0
> 211 7 0 1130316 576 33470432 0 0 290
> 1008984 150699 74320 6 83 6 5 0
> 365 3 0 776736 576 34072772 0 0 182
> 747167 139436 67135 5 88 4 3 0
> 29 1 0 1528412 576 34110640 0 0 50 612181
> 137403 77609 4 87 6 3 0
> 266 5 0 1657688 576 34105696 0 0 3
> 258307 62879 38508 2 93 3 2 0
> 1159 2 0 2002256 576 33775476 0 0 19
> 88554 42112 14230 1 98 0 0 0
>
>
> Right around 14:00 I was able to get a "perf -a -g"; here's the
> beginning of what "perf report --sort symbol --call-graph fractal,5"
> had to say:
>
> #
> 64.86% [k] _raw_spin_lock_irqsave
> |
> |--97.94%-- isolate_freepages
> | compaction_alloc
> | unmap_and_move
> | migrate_pages
> | compact_zone
> | |
> | |--99.56%-- try_to_compact_pages
> | | __alloc_pages_direct_compact
> | | __alloc_pages_slowpath
> | | __alloc_pages_nodemask
> | | alloc_pages_vma
> | | do_huge_pmd_anonymous_page
> | | handle_mm_fault
> | | do_page_fault
> | | page_fault
> | | |
> | | |--53.53%-- skb_copy_datagram_iovec
> | | | tcp_recvmsg
> | | | inet_recvmsg
> | | | sock_recvmsg
> | | | sys_recvfrom
> | | | system_call_fastpath
> | | | __recv
> | | | |
> | | | --100.00%-- (nil)
> | | |
> | | |--27.80%-- __pthread_create_2_1
> | | | (nil)
> | | |
> | | --18.67%-- memcpy
> | | |
> | | |--57.38%-- 0x50d000005
> | | |
> | | |--34.52%-- 0x3b300bf271940a35
> | | |
> | | --8.10%-- 0x1500000000000009
> | --0.44%-- [...]
> --2.06%-- [...]
>
> 6.15% [k] isolate_freepages_block
> |
> |--99.95%-- isolate_freepages
> | compaction_alloc
> | unmap_and_move
> | migrate_pages
> | compact_zone
> | |
> | |--99.54%-- try_to_compact_pages
> | | __alloc_pages_direct_compact
> | | __alloc_pages_slowpath
> | | __alloc_pages_nodemask
> | | alloc_pages_vma
> | | do_huge_pmd_anonymous_page
> | | handle_mm_fault
> | | do_page_fault
> | | page_fault
> | | |
> | | |--54.40%-- skb_copy_datagram_iovec
> | | | tcp_recvmsg
> | | | inet_recvmsg
> | | | sock_recvmsg
> | | | sys_recvfrom
> | | | system_call_fastpath
> | | | __recv
> | | | |
> | | | --100.00%-- (nil)
> | | |
> | | |--25.19%-- __pthread_create_2_1
> | | | (nil)
> | | |
> | | --20.41%-- memcpy
> | | |
> | | |--40.24%-- 0x3b300bf271940a35
> | | |
> | | |--38.29%-- 0x1500000000000009
> | | |
> | | --21.47%-- 0x50d000005
> | --0.46%-- [...]
> --0.05%-- [...]
>
> 3.96% [.] ceph_crc32c_le
> |
> |--99.99%-- 0xb8057558d0065990
> --0.01%-- [...]
>
> A different system in the same test had a slightly different
> call tree, but isolate_freepages() still seems to show up
> prominently:
>
> #
> 32.32% [k] _raw_spin_lock_irqsave
> |
> |--97.64%-- isolate_freepages
> | compaction_alloc
> | unmap_and_move
> | migrate_pages
> | compact_zone
> | try_to_compact_pages
> | __alloc_pages_direct_compact
> | __alloc_pages_slowpath
> | __alloc_pages_nodemask
> | alloc_pages_vma
> | do_huge_pmd_anonymous_page
> | handle_mm_fault
> | do_page_fault
> | page_fault
> | |
> | |--65.31%-- skb_copy_datagram_iovec
> | | tcp_recvmsg
> | | inet_recvmsg
> | | sock_recvmsg
> | | sys_recvfrom
> | | system_call_fastpath
> | | __recv
> | | |
> | | --100.00%-- (nil)
> | |
> | |--30.98%-- memcpy
> | | |
> | | |--50.60%-- 0x50d0000
> | | |
> | | --49.40%-- 0x50d000005
> | --3.70%-- [...]
> --2.36%-- [...]
>
> 17.10% [k] _raw_spin_lock_irq
> |
> |--98.27%-- isolate_migratepages_range
> | compact_zone
> | try_to_compact_pages
> | __alloc_pages_direct_compact
> | __alloc_pages_slowpath
> | __alloc_pages_nodemask
> | alloc_pages_vma
> | do_huge_pmd_anonymous_page
> | handle_mm_fault
> | do_page_fault
> | page_fault
> | |
> | |--99.85%-- __pthread_create_2_1
> | | (nil)
> | --0.15%-- [...]
> --1.73%-- [...]
>
> 4.59% [k] mutex_spin_on_owner
> |
> --- __mutex_lock_slowpath
> mutex_lock
> |
> |--50.14%-- page_lock_anon_vma
> | |
> | |--99.99%-- try_to_unmap_anon
> | | try_to_unmap
> | | __unmap_and_move
> | | unmap_and_move
> | | migrate_pages
> | | compact_zone
> | | try_to_compact_pages
> | | __alloc_pages_direct_compact
> | | __alloc_pages_slowpath
> | | __alloc_pages_nodemask
> | | alloc_pages_vma
> | | do_huge_pmd_anonymous_page
> | | handle_mm_fault
> | | do_page_fault
> | | page_fault
> | | |
> | | |--99.91%-- __pthread_create_2_1
> | | | (nil)
> | | --0.09%-- [...]
> | --0.01%-- [...]
> |
> |--49.67%-- rmap_walk
> | move_to_new_page
> | __unmap_and_move
> | unmap_and_move
> | migrate_pages
> | compact_zone
> | try_to_compact_pages
> | __alloc_pages_direct_compact
> | __alloc_pages_slowpath
> | __alloc_pages_nodemask
> | alloc_pages_vma
> | do_huge_pmd_anonymous_page
> | handle_mm_fault
> | do_page_fault
> | page_fault
> | |
> | |--99.69%-- __pthread_create_2_1
> | | (nil)
> | --0.31%-- [...]
> --0.20%-- [...]
>
> 4.10% [k] isolate_freepages_block
> |
> |--99.95%-- isolate_freepages
> | compaction_alloc
> | unmap_and_move
> | migrate_pages
> | compact_zone
> | try_to_compact_pages
> | __alloc_pages_direct_compact
> | __alloc_pages_slowpath
> | __alloc_pages_nodemask
> | alloc_pages_vma
> | do_huge_pmd_anonymous_page
> | handle_mm_fault
> | do_page_fault
> | page_fault
> | |
> | |--46.97%-- skb_copy_datagram_iovec
> | | tcp_recvmsg
> | | inet_recvmsg
> | | sock_recvmsg
> | | sys_recvfrom
> | | system_call_fastpath
> | | __recv
> | | |
> | | --100.00%-- (nil)
> | |
> | |--31.79%-- __pthread_create_2_1
> | | (nil)
> | |
> | --21.24%-- memcpy
> | |
> | |--61.90%-- 0x50d000005
> | |
> | --38.10%-- 0x50d0000
> --0.05%-- [...]
>
> 3.65% [.] ceph_crc32c_le
> |
> |--99.86%-- 0xb8057558d0065990
> --0.14%-- [...]
>
>
>
> I seem to be able to recreate this issue at will, so please
> let me know what I can do to help learn what is going on.
>
> Thanks -- Jim



I doubt compaction try to migrate continuously although we have no free memory.
Could you apply this patch and retest?

https://lkml.org/lkml/2012/6/21/30

Thanks.

--
Kind regards,
Minchan Kim

2012-06-28 00:28:49

by Rik van Riel

[permalink] [raw]
Subject: Re: excessive CPU utilization by isolate_freepages?

On 06/27/2012 07:59 PM, Minchan Kim wrote:

> I doubt compaction try to migrate continuously although we have no free memory.
> Could you apply this patch and retest?
>
> https://lkml.org/lkml/2012/6/21/30

Another possibility is that compaction is succeeding every time,
but since we always start scanning all the way at the beginning
and end of each zone, we waste a lot of CPU time rescanning the
same pages (that we just filled up with moved pages) to see if
any are free.

In short, due to the way compaction behaves right now,
compaction + isolate_freepages are essentially quadratic.

What we need to do is remember where we left off after a
successful compaction, so we can continue the search there
at the next invocation.

--
All rights reversed

2012-06-28 00:53:39

by David Rientjes

[permalink] [raw]
Subject: Re: excessive CPU utilization by isolate_freepages?

On Wed, 27 Jun 2012, Rik van Riel wrote:

> > I doubt compaction try to migrate continuously although we have no free
> > memory.
> > Could you apply this patch and retest?
> >
> > https://lkml.org/lkml/2012/6/21/30
>

Not sure if Jim is using memcg; if not, then this won't be helpful.

> Another possibility is that compaction is succeeding every time,
> but since we always start scanning all the way at the beginning
> and end of each zone, we waste a lot of CPU time rescanning the
> same pages (that we just filled up with moved pages) to see if
> any are free.
>
> In short, due to the way compaction behaves right now,
> compaction + isolate_freepages are essentially quadratic.
>
> What we need to do is remember where we left off after a
> successful compaction, so we can continue the search there
> at the next invocation.
>

So when the free and migration scanners meet and compact_finished() ==
COMPACT_CONTINUE, loop around to the start of the zone and continue until
you reach the pfn that it was started at? Seems appropriate.

2012-06-28 00:55:41

by Minchan Kim

[permalink] [raw]
Subject: Re: excessive CPU utilization by isolate_freepages?

On 06/28/2012 09:28 AM, Rik van Riel wrote:

> On 06/27/2012 07:59 PM, Minchan Kim wrote:
>
>> I doubt compaction try to migrate continuously although we have no
>> free memory.
>> Could you apply this patch and retest?
>>
>> https://lkml.org/lkml/2012/6/21/30
>
> Another possibility is that compaction is succeeding every time,
> but since we always start scanning all the way at the beginning
> and end of each zone, we waste a lot of CPU time rescanning the
> same pages (that we just filled up with moved pages) to see if
> any are free.


It does make sense.

>
> In short, due to the way compaction behaves right now,
> compaction + isolate_freepages are essentially quadratic.
>
> What we need to do is remember where we left off after a
> successful compaction, so we can continue the search there
> at the next invocation.
>


Good idea.
It could enhance parallel compaction, too.
Of course, if we can't meet the goal, we need loop around from start/end of zone.

--
Kind regards,
Minchan Kim

2012-06-28 00:58:39

by Minchan Kim

[permalink] [raw]
Subject: Re: excessive CPU utilization by isolate_freepages?

On 06/28/2012 09:52 AM, David Rientjes wrote:

> On Wed, 27 Jun 2012, Rik van Riel wrote:
>
>>> > > I doubt compaction try to migrate continuously although we have no free
>>> > > memory.
>>> > > Could you apply this patch and retest?
>>> > >
>>> > > https://lkml.org/lkml/2012/6/21/30
>> >
> Not sure if Jim is using memcg; if not, then this won't be helpful.
>


It doesn't related to memcg.
if compaction_alloc can't find suitable migration target, it returns NULL.
Then, migrate_pages should be exit.

--
Kind regards,
Minchan Kim

2012-06-28 01:07:08

by David Rientjes

[permalink] [raw]
Subject: Re: excessive CPU utilization by isolate_freepages?

On Thu, 28 Jun 2012, Minchan Kim wrote:

> >>> > > https://lkml.org/lkml/2012/6/21/30
> >> >
> > Not sure if Jim is using memcg; if not, then this won't be helpful.
> >
>
>
> It doesn't related to memcg.
> if compaction_alloc can't find suitable migration target, it returns NULL.
> Then, migrate_pages should be exit.
>

If isolate_freepages() is going to fail, then this zone should have been
skipped when checking for compaction_suitable(). In Jim's perf output,
compaction_suitable() returns COMPACT_CONTINUE for a transparent hugepage.
Why is zone_watermark_ok(zone, 0 low_wmark + 1024, 0, 0) succeeding if
isolate_freepages() is going to fail?

2012-06-28 01:13:36

by Rik van Riel

[permalink] [raw]
Subject: Re: excessive CPU utilization by isolate_freepages?

On 06/27/2012 08:52 PM, David Rientjes wrote:
> On Wed, 27 Jun 2012, Rik van Riel wrote:

>> Another possibility is that compaction is succeeding every time,
>> but since we always start scanning all the way at the beginning
>> and end of each zone, we waste a lot of CPU time rescanning the
>> same pages (that we just filled up with moved pages) to see if
>> any are free.
>>
>> In short, due to the way compaction behaves right now,
>> compaction + isolate_freepages are essentially quadratic.
>>
>> What we need to do is remember where we left off after a
>> successful compaction, so we can continue the search there
>> at the next invocation.
>>
>
> So when the free and migration scanners meet and compact_finished() ==
> COMPACT_CONTINUE, loop around to the start of the zone and continue until
> you reach the pfn that it was started at? Seems appropriate.

Exactly.

It would entail changes to struct compact_control, where
we have to remember whether we started at the top of the
zone or not (for a full compaction, ie order==-1 we might).

For a compaction of order >0, we would remember the last
pfn where isolate_freepages isolated a page, and start
isolating below that.

If we fail to isolate pages when cc->free_pfn and cc->migrate_pfn
meet, we may want to restart from the top for a second round.
If they meet again after the second round, we have really failed,
and compaction will be deferred.

As long as compaction succeeds, we will slowly move through the
zone, with each invocation of compaction scanning a little more.

For cc->migrate_pfn it is probably fine to start from the
beginning of the zone each time.

--
All rights reversed

2012-06-28 01:18:43

by Minchan Kim

[permalink] [raw]
Subject: Re: excessive CPU utilization by isolate_freepages?

On 06/28/2012 10:06 AM, David Rientjes wrote:

> On Thu, 28 Jun 2012, Minchan Kim wrote:
>
>>>>>>> https://lkml.org/lkml/2012/6/21/30
>>>>>
>>> Not sure if Jim is using memcg; if not, then this won't be helpful.
>>>
>>
>>
>> It doesn't related to memcg.
>> if compaction_alloc can't find suitable migration target, it returns NULL.
>> Then, migrate_pages should be exit.
>>
>
> If isolate_freepages() is going to fail, then this zone should have been
> skipped when checking for compaction_suitable(). In Jim's perf output,
> compaction_suitable() returns COMPACT_CONTINUE for a transparent hugepage.
> Why is zone_watermark_ok(zone, 0 low_wmark + 1024, 0, 0) succeeding if
> isolate_freepages() is going to fail?
>


zone_watermark_ok doesn't consider migratetype but suitable_migrate_target
does consider one.

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>



--
Kind regards,
Minchan Kim

2012-06-28 11:36:24

by Mel Gorman

[permalink] [raw]
Subject: Re: excessive CPU utilization by isolate_freepages?

On Wed, Jun 27, 2012 at 03:59:19PM -0600, Jim Schutt wrote:
> Hi,
>
> I'm running into trouble with systems going unresponsive,
> and perf suggests it's excessive CPU usage by isolate_freepages().
> I'm currently testing 3.5-rc4, but I think this problem may have
> first shown up in 3.4. I'm only just learning how to use perf,
> so I only currently have results to report for 3.5-rc4.
>

Out of curiosity, why do you think it showed up in 3.4? It's not
surprising as such if it did show up there but I'm wondering what you
are basing it on.

It's not a suprise because it's also where reclaim/compaction stopped
depending on lumpy reclaim. In the past we would have reclaimed more
pages but now rely on compaction more. It's plassible that for many
parallel compactions that there would be higher CPU usage now.

> <SNIP>
> 2012-06-27 14:00:03.219-06:00
> vmstat -w 4 16
> procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 75 1 0 566988 576 35664800 0 0 2 1355 21 3 1 4 95 0 0
> 433 1 0 964052 576 35069112 0 0 7 456359 102256 20901 2 98 0 0 0
> 547 3 0 820116 576 34893932 0 0 57 560507 114878 28115 3 96 0 0 0
> 806 2 0 606992 576 34848180 0 0 339 309668 101230 21056 2 98 0 0 0
> 708 1 0 529624 576 34708000 0 0 248 370886 101327 20062 2 97 0 0 0
> 231 5 0 504772 576 34663880 0 0 305 334824 95045 20407 2 97 1 1 0
> 158 6 0 1063088 576 33518536 0 0 531 847435 130696 47140 4 92 1 2 0
> 193 0 0 1449156 576 33035572 0 0 363 371279 94470 18955 2 96 1 1 0
> 266 6 0 1623512 576 32728164 0 0 77 241114 95730 15483 2 98 0 0 0
> 243 8 0 1629504 576 32653080 0 0 81 471018 100223 20920 3 96 0 1 0
> 70 11 0 1342140 576 33084020 0 0 100 925869 139876 56599 6 88 3 3 0
> 211 7 0 1130316 576 33470432 0 0 290 1008984 150699 74320 6 83 6 5 0
> 365 3 0 776736 576 34072772 0 0 182 747167 139436 67135 5 88 4 3 0
> 29 1 0 1528412 576 34110640 0 0 50 612181 137403 77609 4 87 6 3 0
> 266 5 0 1657688 576 34105696 0 0 3 258307 62879 38508 2 93 3 2 0
> 1159 2 0 2002256 576 33775476 0 0 19 88554 42112 14230 1 98 0 0 0
>

ok, so System CPU usage through the roof.

>
> Right around 14:00 I was able to get a "perf -a -g"; here's the
> beginning of what "perf report --sort symbol --call-graph fractal,5"
> had to say:
>
> #
> 64.86% [k] _raw_spin_lock_irqsave
> |
> |--97.94%-- isolate_freepages
> | compaction_alloc
> | unmap_and_move
> | migrate_pages
> | compact_zone
> | |
> | |--99.56%-- try_to_compact_pages
> | | __alloc_pages_direct_compact
> | | __alloc_pages_slowpath
> | | __alloc_pages_nodemask
> | | alloc_pages_vma
> | | do_huge_pmd_anonymous_page
> | | handle_mm_fault
> | | do_page_fault
> | | page_fault
> | | |
> | | |--53.53%-- skb_copy_datagram_iovec
> | | | tcp_recvmsg
> | | | inet_recvmsg
> | | | sock_recvmsg
> | | | sys_recvfrom
> | | | system_call_fastpath
> | | | __recv
> | | | |
> | | | --100.00%-- (nil)
> | | |
> | | |--27.80%-- __pthread_create_2_1
> | | | (nil)
> | | |
> | | --18.67%-- memcpy
> | | |
> | | |--57.38%-- 0x50d000005
> | | |
> | | |--34.52%-- 0x3b300bf271940a35
> | | |
> | | --8.10%-- 0x1500000000000009
> | --0.44%-- [...]
> --2.06%-- [...]
>

This looks like lock contention to me on zone->lock which
isolate_freepages takes and releases frequently. You do not describe the
exact memory layout but it's likely that there are two very large zones
with 12 CPUs each. If they all were running compaction they would pound
zone->lock pretty heavily.

> <SNIP>

The other call traces also look like they are pounding zone->lock
heavily.

Rik's patch has the potential to reduce contention by virtue of the fact
that less scanning is required. I'd be interested in hearing how much of
an impact that patch has so please test that first.

If that approach does not work I'll put together a patch that either
backs off compaction on zone->lock contention.

> I seem to be able to recreate this issue at will, so please
> let me know what I can do to help learn what is going on.
>

Thanks very much for testing.

--
Mel Gorman
SUSE Labs

2012-06-28 15:30:46

by Jim Schutt

[permalink] [raw]
Subject: Re: excessive CPU utilization by isolate_freepages?

On 06/28/2012 05:36 AM, Mel Gorman wrote:
> On Wed, Jun 27, 2012 at 03:59:19PM -0600, Jim Schutt wrote:
>> Hi,
>>
>> I'm running into trouble with systems going unresponsive,
>> and perf suggests it's excessive CPU usage by isolate_freepages().
>> I'm currently testing 3.5-rc4, but I think this problem may have
>> first shown up in 3.4. I'm only just learning how to use perf,
>> so I only currently have results to report for 3.5-rc4.
>>
>
> Out of curiosity, why do you think it showed up in 3.4? It's not
> surprising as such if it did show up there but I'm wondering what you
> are basing it on.

If I remember correctly, when I was testing this workload on 3.4 is
when I first saw hundreds of runable threads being reported by vmstat.
At that time I couldn't reproduce quite as reliably, and I didn't
know how to get perf to give me call chains, so I didn't follow up :(

>
> It's not a suprise because it's also where reclaim/compaction stopped
> depending on lumpy reclaim. In the past we would have reclaimed more
> pages but now rely on compaction more. It's plassible that for many
> parallel compactions that there would be higher CPU usage now.
>
>> <SNIP>
>> 2012-06-27 14:00:03.219-06:00
>> vmstat -w 4 16
>> procs -------------------memory------------------ ---swap-- -----io---- --system-- -----cpu-------
>> r b swpd free buff cache si so bi bo in cs us sy id wa st
>> 75 1 0 566988 576 35664800 0 0 2 1355 21 3 1 4 95 0 0
>> 433 1 0 964052 576 35069112 0 0 7 456359 102256 20901 2 98 0 0 0
>> 547 3 0 820116 576 34893932 0 0 57 560507 114878 28115 3 96 0 0 0
>> 806 2 0 606992 576 34848180 0 0 339 309668 101230 21056 2 98 0 0 0
>> 708 1 0 529624 576 34708000 0 0 248 370886 101327 20062 2 97 0 0 0
>> 231 5 0 504772 576 34663880 0 0 305 334824 95045 20407 2 97 1 1 0
>> 158 6 0 1063088 576 33518536 0 0 531 847435 130696 47140 4 92 1 2 0
>> 193 0 0 1449156 576 33035572 0 0 363 371279 94470 18955 2 96 1 1 0
>> 266 6 0 1623512 576 32728164 0 0 77 241114 95730 15483 2 98 0 0 0
>> 243 8 0 1629504 576 32653080 0 0 81 471018 100223 20920 3 96 0 1 0
>> 70 11 0 1342140 576 33084020 0 0 100 925869 139876 56599 6 88 3 3 0
>> 211 7 0 1130316 576 33470432 0 0 290 1008984 150699 74320 6 83 6 5 0
>> 365 3 0 776736 576 34072772 0 0 182 747167 139436 67135 5 88 4 3 0
>> 29 1 0 1528412 576 34110640 0 0 50 612181 137403 77609 4 87 6 3 0
>> 266 5 0 1657688 576 34105696 0 0 3 258307 62879 38508 2 93 3 2 0
>> 1159 2 0 2002256 576 33775476 0 0 19 88554 42112 14230 1 98 0 0 0
>>
>
> ok, so System CPU usage through the roof.
>
>>
>> Right around 14:00 I was able to get a "perf -a -g"; here's the
>> beginning of what "perf report --sort symbol --call-graph fractal,5"
>> had to say:
>>
>> #
>> 64.86% [k] _raw_spin_lock_irqsave
>> |
>> |--97.94%-- isolate_freepages
>> | compaction_alloc
>> | unmap_and_move
>> | migrate_pages
>> | compact_zone
>> | |
>> | |--99.56%-- try_to_compact_pages
>> | | __alloc_pages_direct_compact
>> | | __alloc_pages_slowpath
>> | | __alloc_pages_nodemask
>> | | alloc_pages_vma
>> | | do_huge_pmd_anonymous_page
>> | | handle_mm_fault
>> | | do_page_fault
>> | | page_fault
>> | | |
>> | | |--53.53%-- skb_copy_datagram_iovec
>> | | | tcp_recvmsg
>> | | | inet_recvmsg
>> | | | sock_recvmsg
>> | | | sys_recvfrom
>> | | | system_call_fastpath
>> | | | __recv
>> | | | |
>> | | | --100.00%-- (nil)
>> | | |
>> | | |--27.80%-- __pthread_create_2_1
>> | | | (nil)
>> | | |
>> | | --18.67%-- memcpy
>> | | |
>> | | |--57.38%-- 0x50d000005
>> | | |
>> | | |--34.52%-- 0x3b300bf271940a35
>> | | |
>> | | --8.10%-- 0x1500000000000009
>> | --0.44%-- [...]
>> --2.06%-- [...]
>>
>
> This looks like lock contention to me on zone->lock which
> isolate_freepages takes and releases frequently. You do not describe the
> exact memory layout but it's likely that there are two very large zones
> with 12 CPUs each. If they all were running compaction they would pound
> zone->lock pretty heavily.

Does this tell you what you need to know?
If not, let me know what you'd like to see.

# grep -H MemTotal /sys/devices/system/node/node*/meminfo
/sys/devices/system/node/node0/meminfo:Node 0 MemTotal: 25156644 kB
/sys/devices/system/node/node1/meminfo:Node 1 MemTotal: 25165824 kB

>
>> <SNIP>
>
> The other call traces also look like they are pounding zone->lock
> heavily.
>
> Rik's patch has the potential to reduce contention by virtue of the fact
> that less scanning is required. I'd be interested in hearing how much of
> an impact that patch has so please test that first.

Working on that now....

>
> If that approach does not work I'll put together a patch that either
> backs off compaction on zone->lock contention.
>
>> I seem to be able to recreate this issue at will, so please
>> let me know what I can do to help learn what is going on.
>>
>
> Thanks very much for testing.
>

No problem. Thanks to you all for jumping on this so quickly.

FWIW, perf rocks! Thanks to all who've helped make it so great!

-- Jim