Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754919Ab0FNLSE (ORCPT ); Mon, 14 Jun 2010 07:18:04 -0400 Received: from gir.skynet.ie ([193.1.99.77]:47578 "EHLO gir.skynet.ie" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751884Ab0FNLR4 (ORCPT ); Mon, 14 Jun 2010 07:17:56 -0400 From: Mel Gorman To: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org Cc: Dave Chinner , Chris Mason , Nick Piggin , Rik van Riel , Johannes Weiner , Christoph Hellwig , KAMEZAWA Hiroyuki , Andrew Morton , Mel Gorman Subject: [PATCH 0/12] Avoid overflowing of stack during page reclaim V2 Date: Mon, 14 Jun 2010 12:17:41 +0100 Message-Id: <1276514273-27693-1-git-send-email-mel@csn.ul.ie> X-Mailer: git-send-email 1.7.1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 27094 Lines: 434 This is a merging of two series - the first of which reduces stack usage in page reclaim and the second which writes contiguous pages during reclaim and avoids writeback in direct reclaimers. Changelog since V1 o Merge with series that reduces stack usage in page reclaim in general o Allow memcg to writeback pages as they are not expected to overflow stack o Drop the contiguous-write patch for the moment There is a problem in the stack depth usage of page reclaim. Particularly during direct reclaim, it is possible to overflow the stack if it calls into the filesystems writepage function. This patch series aims to trace writebacks so it can be evaulated how many dirty pages are being written, reduce stack usage of page reclaim in general and avoid direct reclaim writing back pages and overflowing the stack. The first 4 patches are a forward-port of trace points that are partly based on trace points defined by Larry Woodman but never merged. They trace parts of kswapd, direct reclaim, LRU page isolation and page writeback. The tracepoints can be used to evaluate what is happening within reclaim and whether things are getting better or worse. They do not have to be part of the final series but might be useful during discussion and for later regression testing - particularly around the percentage of time spent in reclaim. The 6 patches after that reduce the stack footprint of page reclaim by moving large allocations out of the main call path. Functionally they should be similar although there is a timing change on when pages get freed exactly. This is aimed at giving filesystems as much stack as possible if kswapd is to writeback pages directly. Patch 11 puts dirty pages as it finds them onto a temporary list and then writes them all out with a helper function. This simplifies patch 12 and also increases the chances that IO requests can be optimally merged. Patch 12 prevents direct reclaim writing out pages at all and instead dirty pages are put back on the LRU. For lumpy reclaim, the caller will briefly wait on dirty pages to be written out before trying to reclaim the dirty pages a second time. This increases the responsibility of kswapd somewhat because it's now cleaning pages on behalf of direct reclaimers but kswapd seemed a better fit than background flushers to clean pages as it knows where the pages needing cleaning are. As it's async IO, it should not cause kswapd to stall (at least until the queue is congested) but the order that pages are reclaimed on the LRU is altered. Dirty pages that would have been reclaimed by direct reclaimers are getting another lap on the LRU. The dirty pages could have been put on a dedicated list but this increased counter overhead and the number of lists and it is unclear if it is necessary. Apologies for the length of the rest of the mail. Measuring the impact of this is not exactly straight-forward. I ran a number of tests with monitoring on X86, X86-64 and PPC64 and I'll cover what the X86-64 results were here. It's an AMD Phenom 4-core machine with 2G of RAM with a single disk and the onboard IO controller. Dirty ratio was left at 20 (tests with 40 are in progress). The filesystem all the tests were run on was XFS. Three kernels are compared. traceonly-v2r5 is the first 4 patches of this series stackreduce-v2r5 is the first 10 patches of this series nodirect-v2r5 is all patches in the series The results on each test is broken up into three parts. The first part compares the results of the test itself. The second part is a report based on the ftrace postprocessing script in patch 4 and reports on direct reclaim and kswapd activity. The third part reports what percentage of time was spent in direct reclaim and kswapd being awake. To work out the percentage of time spent in direct reclaim, I used /usr/bin/time to get the User + Sys CPU time. The stalled time was taken from the post-processing script. The total time is (User + Sys + Stall) and obviously the percentage is of stalled over total time. kernbench ========= traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5 Elapsed min 98.16 ( 0.00%) 97.95 ( 0.21%) 98.25 (-0.09%) Elapsed mean 98.29 ( 0.00%) 98.26 ( 0.03%) 98.40 (-0.11%) Elapsed stddev 0.08 ( 0.00%) 0.20 (-165.90%) 0.12 (-59.87%) Elapsed max 98.34 ( 0.00%) 98.51 (-0.17%) 98.58 (-0.24%) User min 311.03 ( 0.00%) 311.74 (-0.23%) 311.13 (-0.03%) User mean 311.28 ( 0.00%) 312.45 (-0.38%) 311.42 (-0.05%) User stddev 0.24 ( 0.00%) 0.51 (-114.08%) 0.38 (-59.06%) User max 311.58 ( 0.00%) 312.94 (-0.44%) 312.06 (-0.15%) System min 40.54 ( 0.00%) 39.65 ( 2.20%) 40.34 ( 0.49%) System mean 40.80 ( 0.00%) 40.01 ( 1.93%) 40.81 (-0.03%) System stddev 0.23 ( 0.00%) 0.34 (-47.57%) 0.29 (-25.47%) System max 41.04 ( 0.00%) 40.51 ( 1.29%) 41.11 (-0.17%) CPU min 357.00 ( 0.00%) 357.00 ( 0.00%) 357.00 ( 0.00%) CPU mean 357.75 ( 0.00%) 358.00 (-0.07%) 357.75 ( 0.00%) CPU stddev 0.43 ( 0.00%) 0.71 (-63.30%) 0.43 ( 0.00%) CPU max 358.00 ( 0.00%) 359.00 (-0.28%) 358.00 ( 0.00%) FTrace Reclaim Statistics traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5 Direct reclaims 0 0 0 Direct reclaim pages scanned 0 0 0 Direct reclaim write async I/O 0 0 0 Direct reclaim write sync I/O 0 0 0 Wake kswapd requests 0 0 0 Kswapd wakeups 0 0 0 Kswapd pages scanned 0 0 0 Kswapd reclaim write async I/O 0 0 0 Kswapd reclaim write sync I/O 0 0 0 Time stalled direct reclaim (ms) 0.00 0.00 0.00 Time kswapd awake (ms) 0.00 0.00 0.00 User/Sys Time Running Test (seconds) 2144.58 2146.22 2144.8 Percentage Time Spent Direct Reclaim 0.00% 0.00% 0.00% Total Elapsed Time (seconds) 788.42 794.85 793.66 Percentage Time kswapd Awake 0.00% 0.00% 0.00% kernbench is a straight-forward kernel compile. Kernel is built 5 times and and an average taken. There was no interesting difference in terms of performance. As the workload fit easily in memory, there was no page reclaim activity. IOZone ====== traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5 write-64 395452 ( 0.00%) 397208 ( 0.44%) 397796 ( 0.59%) write-128 463696 ( 0.00%) 460514 (-0.69%) 458940 (-1.04%) write-256 504861 ( 0.00%) 506050 ( 0.23%) 502969 (-0.38%) write-512 490875 ( 0.00%) 485767 (-1.05%) 494264 ( 0.69%) write-1024 497574 ( 0.00%) 489689 (-1.61%) 505956 ( 1.66%) write-2048 500993 ( 0.00%) 503076 ( 0.41%) 510097 ( 1.78%) write-4096 504491 ( 0.00%) 502073 (-0.48%) 506993 ( 0.49%) write-8192 488398 ( 0.00%) 228857 (-113.41%) 313871 (-55.60%) write-16384 409006 ( 0.00%) 433783 ( 5.71%) 365696 (-11.84%) write-32768 473136 ( 0.00%) 481153 ( 1.67%) 486373 ( 2.72%) write-65536 474833 ( 0.00%) 477970 ( 0.66%) 481192 ( 1.32%) write-131072 429557 ( 0.00%) 452604 ( 5.09%) 459840 ( 6.59%) write-262144 397934 ( 0.00%) 401955 ( 1.00%) 397479 (-0.11%) write-524288 222849 ( 0.00%) 230297 ( 3.23%) 209999 (-6.12%) rewrite-64 1452528 ( 0.00%) 1492919 ( 2.71%) 1452528 ( 0.00%) rewrite-128 1622919 ( 0.00%) 1618028 (-0.30%) 1663139 ( 2.42%) rewrite-256 1694118 ( 0.00%) 1704877 ( 0.63%) 1639786 (-3.31%) rewrite-512 1753325 ( 0.00%) 1740536 (-0.73%) 1730717 (-1.31%) rewrite-1024 1741104 ( 0.00%) 1787480 ( 2.59%) 1759651 ( 1.05%) rewrite-2048 1710867 ( 0.00%) 1747411 ( 2.09%) 1747411 ( 2.09%) rewrite-4096 1583280 ( 0.00%) 1621536 ( 2.36%) 1599942 ( 1.04%) rewrite-8192 1308005 ( 0.00%) 1338579 ( 2.28%) 1307358 (-0.05%) rewrite-16384 1293742 ( 0.00%) 1314178 ( 1.56%) 1291602 (-0.17%) rewrite-32768 1298360 ( 0.00%) 1314503 ( 1.23%) 1276758 (-1.69%) rewrite-65536 1289212 ( 0.00%) 1316088 ( 2.04%) 1281351 (-0.61%) rewrite-131072 1286117 ( 0.00%) 1309070 ( 1.75%) 1283007 (-0.24%) rewrite-262144 1285902 ( 0.00%) 1305816 ( 1.53%) 1274121 (-0.92%) rewrite-524288 220417 ( 0.00%) 223971 ( 1.59%) 226133 ( 2.53%) read-64 3203069 ( 0.00%) 2467108 (-29.83%) 3541098 ( 9.55%) read-128 3759450 ( 0.00%) 4267461 (11.90%) 4233807 (11.20%) read-256 4264168 ( 0.00%) 4350555 ( 1.99%) 3935921 (-8.34%) read-512 3437042 ( 0.00%) 3366987 (-2.08%) 3437042 ( 0.00%) read-1024 3738636 ( 0.00%) 3821805 ( 2.18%) 3735385 (-0.09%) read-2048 3938881 ( 0.00%) 3984558 ( 1.15%) 3967993 ( 0.73%) read-4096 3631489 ( 0.00%) 3828122 ( 5.14%) 3781775 ( 3.97%) read-8192 3175046 ( 0.00%) 3230268 ( 1.71%) 3247058 ( 2.22%) read-16384 2923635 ( 0.00%) 2869911 (-1.87%) 2954684 ( 1.05%) read-32768 2819040 ( 0.00%) 2839776 ( 0.73%) 2852152 ( 1.16%) read-65536 2659324 ( 0.00%) 2827502 ( 5.95%) 2816464 ( 5.58%) read-131072 2707652 ( 0.00%) 2727534 ( 0.73%) 2746406 ( 1.41%) read-262144 2765929 ( 0.00%) 2782166 ( 0.58%) 2776125 ( 0.37%) read-524288 2810803 ( 0.00%) 2822894 ( 0.43%) 2822394 ( 0.41%) reread-64 5389653 ( 0.00%) 5860307 ( 8.03%) 5735102 ( 6.02%) reread-128 5122535 ( 0.00%) 5325799 ( 3.82%) 5325799 ( 3.82%) reread-256 3245838 ( 0.00%) 3285566 ( 1.21%) 3236056 (-0.30%) reread-512 4340054 ( 0.00%) 4571003 ( 5.05%) 4742616 ( 8.49%) reread-1024 4265934 ( 0.00%) 4356809 ( 2.09%) 4374559 ( 2.48%) reread-2048 3915540 ( 0.00%) 4301837 ( 8.98%) 4338776 ( 9.75%) reread-4096 3846119 ( 0.00%) 3984379 ( 3.47%) 3979764 ( 3.36%) reread-8192 3257215 ( 0.00%) 3304518 ( 1.43%) 3325949 ( 2.07%) reread-16384 2959519 ( 0.00%) 2892622 (-2.31%) 2995773 ( 1.21%) reread-32768 2570835 ( 0.00%) 2607266 ( 1.40%) 2610783 ( 1.53%) reread-65536 2731466 ( 0.00%) 2683809 (-1.78%) 2691957 (-1.47%) reread-131072 2738144 ( 0.00%) 2763886 ( 0.93%) 2776056 ( 1.37%) reread-262144 2781012 ( 0.00%) 2781786 ( 0.03%) 2784322 ( 0.12%) reread-524288 2787049 ( 0.00%) 2779851 (-0.26%) 2787681 ( 0.02%) randread-64 1204796 ( 0.00%) 1204796 ( 0.00%) 1143223 (-5.39%) randread-128 4135958 ( 0.00%) 4012317 (-3.08%) 4135958 ( 0.00%) randread-256 3454704 ( 0.00%) 3511189 ( 1.61%) 3511189 ( 1.61%) randread-512 3437042 ( 0.00%) 3366987 (-2.08%) 3437042 ( 0.00%) randread-1024 3301774 ( 0.00%) 3401130 ( 2.92%) 3403826 ( 3.00%) randread-2048 3549844 ( 0.00%) 3391470 (-4.67%) 3477979 (-2.07%) randread-4096 3214912 ( 0.00%) 3261295 ( 1.42%) 3258820 ( 1.35%) randread-8192 2818958 ( 0.00%) 2836645 ( 0.62%) 2861450 ( 1.48%) randread-16384 2571662 ( 0.00%) 2515924 (-2.22%) 2564465 (-0.28%) randread-32768 2319848 ( 0.00%) 2331892 ( 0.52%) 2333594 ( 0.59%) randread-65536 2288193 ( 0.00%) 2297103 ( 0.39%) 2301123 ( 0.56%) randread-131072 2270669 ( 0.00%) 2275707 ( 0.22%) 2279150 ( 0.37%) randread-262144 2258949 ( 0.00%) 2264700 ( 0.25%) 2259975 ( 0.05%) randread-524288 2250529 ( 0.00%) 2240365 (-0.45%) 2242837 (-0.34%) randwrite-64 942521 ( 0.00%) 939223 (-0.35%) 939223 (-0.35%) randwrite-128 962469 ( 0.00%) 971174 ( 0.90%) 969421 ( 0.72%) randwrite-256 980760 ( 0.00%) 980760 ( 0.00%) 966633 (-1.46%) randwrite-512 1190529 ( 0.00%) 1138158 (-4.60%) 1040545 (-14.41%) randwrite-1024 1361836 ( 0.00%) 1361836 ( 0.00%) 1367037 ( 0.38%) randwrite-2048 1325646 ( 0.00%) 1364390 ( 2.84%) 1361794 ( 2.65%) randwrite-4096 1360371 ( 0.00%) 1372653 ( 0.89%) 1363935 ( 0.26%) randwrite-8192 1291680 ( 0.00%) 1305272 ( 1.04%) 1285206 (-0.50%) randwrite-16384 1253666 ( 0.00%) 1255865 ( 0.18%) 1231889 (-1.77%) randwrite-32768 1239139 ( 0.00%) 1250641 ( 0.92%) 1224239 (-1.22%) randwrite-65536 1223186 ( 0.00%) 1228115 ( 0.40%) 1203094 (-1.67%) randwrite-131072 1207002 ( 0.00%) 1215733 ( 0.72%) 1198229 (-0.73%) randwrite-262144 1184954 ( 0.00%) 1201542 ( 1.38%) 1179145 (-0.49%) randwrite-524288 96156 ( 0.00%) 96502 ( 0.36%) 95942 (-0.22%) bkwdread-64 2923952 ( 0.00%) 3022727 ( 3.27%) 2772930 (-5.45%) bkwdread-128 3785961 ( 0.00%) 3657016 (-3.53%) 3657016 (-3.53%) bkwdread-256 3017775 ( 0.00%) 3052087 ( 1.12%) 3159869 ( 4.50%) bkwdread-512 2875558 ( 0.00%) 2845081 (-1.07%) 2841317 (-1.21%) bkwdread-1024 3083680 ( 0.00%) 3181915 ( 3.09%) 3140041 ( 1.79%) bkwdread-2048 3266376 ( 0.00%) 3282603 ( 0.49%) 3281349 ( 0.46%) bkwdread-4096 3207709 ( 0.00%) 3287506 ( 2.43%) 3248345 ( 1.25%) bkwdread-8192 2777710 ( 0.00%) 2792156 ( 0.52%) 2777036 (-0.02%) bkwdread-16384 2565614 ( 0.00%) 2570412 ( 0.19%) 2541795 (-0.94%) bkwdread-32768 2472332 ( 0.00%) 2495631 ( 0.93%) 2284260 (-8.23%) bkwdread-65536 2435202 ( 0.00%) 2391036 (-1.85%) 2361477 (-3.12%) bkwdread-131072 2417850 ( 0.00%) 2453436 ( 1.45%) 2417903 ( 0.00%) bkwdread-262144 2468467 ( 0.00%) 2491433 ( 0.92%) 2446649 (-0.89%) bkwdread-524288 2513411 ( 0.00%) 2534789 ( 0.84%) 2486486 (-1.08%) FTrace Reclaim Statistics traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5 Direct reclaims 0 0 0 Direct reclaim pages scanned 0 0 0 Direct reclaim write async I/O 0 0 0 Direct reclaim write sync I/O 0 0 0 Wake kswapd requests 0 0 0 Kswapd wakeups 0 0 0 Kswapd pages scanned 0 0 0 Kswapd reclaim write async I/O 0 0 0 Kswapd reclaim write sync I/O 0 0 0 Time stalled direct reclaim (ms) 0.00 0.00 0.00 Time kswapd awake (ms) 0.00 0.00 0.00 User/Sys Time Running Test (seconds) 14.54 14.43 14.51 Percentage Time Spent Direct Reclaim 0.00% 0.00% 0.00% Total Elapsed Time (seconds) 106.39 104.49 107.15 Percentage Time kswapd Awake 0.00% 0.00% 0.00% No big surprises in terms of performance. I know there are gains and losses but I've always had trouble getting very stable figures out of IOZone so I find it hard to draw conclusions from them. I should revisit what I'm doing there to see what's wrong. In terms of reclaim, nothing interesting happened. SysBench ======== traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5 1 11025.01 ( 0.00%) 10249.52 (-7.57%) 10430.57 (-5.70%) 2 3844.63 ( 0.00%) 4988.95 (22.94%) 4038.95 ( 4.81%) 3 3210.23 ( 0.00%) 2918.52 (-9.99%) 3113.38 (-3.11%) 4 1958.91 ( 0.00%) 1987.69 ( 1.45%) 1808.37 (-8.32%) 5 2864.92 ( 0.00%) 3126.13 ( 8.36%) 2355.70 (-21.62%) 6 4831.63 ( 0.00%) 3815.67 (-26.63%) 4164.09 (-16.03%) 7 3788.37 ( 0.00%) 3140.39 (-20.63%) 3471.36 (-9.13%) 8 2293.61 ( 0.00%) 1636.87 (-40.12%) 1754.25 (-30.75%) FTrace Reclaim Statistics traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5 Direct reclaims 9843 13398 51651 Direct reclaim pages scanned 871367 1008709 3080593 Direct reclaim write async I/O 24883 30699 0 Direct reclaim write sync I/O 0 0 0 Wake kswapd requests 7070819 6961672 11268341 Kswapd wakeups 1578 1500 943 Kswapd pages scanned 22016558 21779455 17393431 Kswapd reclaim write async I/O 1161346 1101641 1717759 Kswapd reclaim write sync I/O 0 0 0 Time stalled direct reclaim (ms) 26.11 45.04 2.97 Time kswapd awake (ms) 5105.06 5135.93 6086.32 User/Sys Time Running Test (seconds) 734.52 712.39 703.9 Percentage Time Spent Direct Reclaim 0.00% 0.00% 0.00% Total Elapsed Time (seconds) 9710.02 9589.20 9334.45 Percentage Time kswapd Awake 0.06% 0.00% 0.00% Unlike other sysbench results I post, this is the result of a read/write test. As the machine is under-provisioned for the type of tests, figures are very unstable. For example, for each of the thread counts from 1-8, the test is run a minimum of 5 times. If the estimated mean is not within 1%, it's run up to a maximum of 10 times trying to get a stable average. None of these averages are stable with variances up to 15%. Part of the problem is that larger thread counts push the test into swap as the memory is insufficient. I could tune for this, but it was reclaim that was important. To illustrate though, here is a graph of total io in comparison to swap io as reported by vmstat. The update frequency was 2 seconds so the IO shown in the graph is about the maximum capacity of the disk for the entire duration of the test with swap kicking in every so often so this was heavily IO bound. http://www.csn.ul.ie/~mel/postings/nodirect-20100614/totalio-swapio-comparison.ps The writing back of dirty pages was a factor. It did happen, but it was a negligible portion of the overall IO. For example, with just tracing a total of 97MB or 2% of the pages scanned by direct reclaim was written back and it was all async IO. The time stalled in direct reclaim was negligible although as you'd expect, reduced by not writing back at all. What is interesting is that kswapd wake requests were raised by direc reclaim not writing back pages. My theory is that it's because the processes are making forward progress meaning they need more memory faster and are not calling congestion_wait as they would have before. kswapd is awake longer as a result of direct reclaim not writing back pages. Between 5-10% of the pages scanned by kswapd need to be written back based on these three kernels. As the disk was maxed out all of the time, I'm having trouble deciding whether this is "too much IO" or not but I'm leaning towards "no". I'd expect that the flusher was also having time getting IO bandwidth. Simple Writeback Test ===================== traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5 Direct reclaims 1923 2394 2683 Direct reclaim pages scanned 246400 252256 282896 Direct reclaim write async I/O 0 0 0 Direct reclaim write sync I/O 0 0 0 Wake kswapd requests 1496401 1648245 1709870 Kswapd wakeups 1140 1118 1113 Kswapd pages scanned 10999585 10982748 10851473 Kswapd reclaim write async I/O 0 0 1398 Kswapd reclaim write sync I/O 0 0 0 Time stalled direct reclaim (ms) 0.01 0.01 0.01 Time kswapd awake (ms) 293.54 293.68 285.51 User/Sys Time Running Test (seconds) 105.17 102.81 104.34 Percentage Time Spent Direct Reclaim 0.00% 0.00% 0.00% Total Elapsed Time (seconds) 638.75 640.63 639.42 Percentage Time kswapd Awake 0.04% 0.00% 0.00% This test starting with 4 threads, doubling the number of threads on each iteration up to 64. Each iteration writes 4*RAM amount of files to disk using dd to dirty memory and conv=fsync to have some sort of stability in the results. Direc reclaim writeback was not a problem for this test even though a number of pages were scanned so there is no reason not to disable it. kswapd encountered some pages in the "nodirect" kernel but it's likely a timing issue. Stress HighAlloc ================ stress-highalloc stress-highalloc stress-highalloc traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5 Pass 1 76.00 ( 0.00%) 77.00 ( 1.00%) 70.00 (-6.00%) Pass 2 78.00 ( 0.00%) 78.00 ( 0.00%) 73.00 (-5.00%) At Rest 80.00 ( 0.00%) 79.00 (-1.00%) 78.00 (-2.00%) FTrace Reclaim Statistics stress-highalloc stress-highalloc stress-highalloc traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5 Direct reclaims 1245 1247 1369 Direct reclaim pages scanned 180262 177032 164337 Direct reclaim write async I/O 35211 30075 0 Direct reclaim write sync I/O 17994 13127 0 Wake kswapd requests 4211 4868 4386 Kswapd wakeups 842 808 739 Kswapd pages scanned 41757111 33360435 9946823 Kswapd reclaim write async I/O 4872154 3154195 791840 Kswapd reclaim write sync I/O 0 0 0 Time stalled direct reclaim (ms) 9064.33 7249.29 2868.22 Time kswapd awake (ms) 6250.77 4612.99 1937.64 User/Sys Time Running Test (seconds) 2822.01 2812.45 2629.6 Percentage Time Spent Direct Reclaim 0.10% 0.00% 0.00% Total Elapsed Time (seconds) 11365.05 9682.00 5210.19 Percentage Time kswapd Awake 0.02% 0.00% 0.00% This test builds a large number of kernels simultaneously so that the total workload is 1.5 times the size of RAM. It then attempts to allocate all of RAM as huge pages. The metric is the percentage of memory allocated using load (Pass 1), a second attempt under load (Pass 2) and when the kernel compiles are finishes and the system is quiet (At Rest). The success figures were comparaible with or without direct reclaim. I know PPC64's success rates are hit a lot worse than this but I think it could be improved in other means than what we have today so I'm less worried about it. Unlike the other tests, synchronous direct writeback is a factor in this test because of lumpy reclaim. This increases the stall time of a lumpy reclaimer by quite a margin. Compare the "Time stalled direct reclaim" between the vanilla and nodirect kernel - the nodirect kernel is stalled less than a third of the time. Interestingly, the time kswapd is stalled is significantly reduced as well and overall, the test completes a lot faster. Whether this is because of improved IO patterns or just because lumpy reclaim stalling on synchronous IO is a bad idea, I'm not sure but either way, the patch looks like a good idea from the perspective of this test. Based on this series of tests at least, there appears to be good reasons for preventing direct reclaim writing back pages and it does not appear we are currently spending a lot of our time in writeback. It remains to be seen if it's still true with dirty ratio is higher (e.g. 40) or the amount of available memory differs (e.g. 256MB) but the trace points and post-processing script can be used to help figure it out. Comments? KOSAKI Motohiro (2): vmscan: kill prev_priority completely vmscan: simplify shrink_inactive_list() Mel Gorman (11): tracing, vmscan: Add trace events for kswapd wakeup, sleeping and direct reclaim tracing, vmscan: Add trace events for LRU page isolation tracing, vmscan: Add trace event when a page is written tracing, vmscan: Add a postprocessing script for reclaim-related ftrace events vmscan: Remove unnecessary temporary vars in do_try_to_free_pages vmscan: Setup pagevec as late as possible in shrink_inactive_list() vmscan: Setup pagevec as late as possible in shrink_page_list() vmscan: Update isolated page counters outside of main path in shrink_inactive_list() vmscan: Write out dirty pages in batch vmscan: Do not writeback pages in direct reclaim fix: script formatting .../trace/postprocess/trace-vmscan-postprocess.pl | 623 ++++++++++++++++++++ include/linux/mmzone.h | 15 - include/trace/events/gfpflags.h | 37 ++ include/trace/events/kmem.h | 38 +-- include/trace/events/vmscan.h | 184 ++++++ mm/page_alloc.c | 2 - mm/vmscan.c | 622 ++++++++++++-------- mm/vmstat.c | 2 - 8 files changed, 1223 insertions(+), 300 deletions(-) create mode 100644 Documentation/trace/postprocess/trace-vmscan-postprocess.pl create mode 100644 include/trace/events/gfpflags.h create mode 100644 include/trace/events/vmscan.h -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/