2010-06-14 11:18:04

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 0/12] Avoid overflowing of stack during page reclaim V2

This is a merging of two series - the first of which reduces stack usage
in page reclaim and the second which writes contiguous pages during reclaim
and avoids writeback in direct reclaimers.

Changelog since V1
o Merge with series that reduces stack usage in page reclaim in general
o Allow memcg to writeback pages as they are not expected to overflow stack
o Drop the contiguous-write patch for the moment

There is a problem in the stack depth usage of page reclaim. Particularly
during direct reclaim, it is possible to overflow the stack if it calls
into the filesystems writepage function. This patch series aims to trace
writebacks so it can be evaulated how many dirty pages are being written,
reduce stack usage of page reclaim in general and avoid direct reclaim
writing back pages and overflowing the stack.

The first 4 patches are a forward-port of trace points that are partly based
on trace points defined by Larry Woodman but never merged. They trace parts of
kswapd, direct reclaim, LRU page isolation and page writeback. The tracepoints
can be used to evaluate what is happening within reclaim and whether things
are getting better or worse. They do not have to be part of the final series
but might be useful during discussion and for later regression testing -
particularly around the percentage of time spent in reclaim.

The 6 patches after that reduce the stack footprint of page reclaim by moving
large allocations out of the main call path. Functionally they should be
similar although there is a timing change on when pages get freed exactly.
This is aimed at giving filesystems as much stack as possible if kswapd is
to writeback pages directly.

Patch 11 puts dirty pages as it finds them onto a temporary list and then
writes them all out with a helper function. This simplifies patch 12 and
also increases the chances that IO requests can be optimally merged.

Patch 12 prevents direct reclaim writing out pages at all and instead dirty
pages are put back on the LRU. For lumpy reclaim, the caller will briefly
wait on dirty pages to be written out before trying to reclaim the dirty
pages a second time. This increases the responsibility of kswapd somewhat
because it's now cleaning pages on behalf of direct reclaimers but kswapd
seemed a better fit than background flushers to clean pages as it knows
where the pages needing cleaning are. As it's async IO, it should not cause
kswapd to stall (at least until the queue is congested) but the order that
pages are reclaimed on the LRU is altered. Dirty pages that would have been
reclaimed by direct reclaimers are getting another lap on the LRU. The
dirty pages could have been put on a dedicated list but this increased
counter overhead and the number of lists and it is unclear if it is necessary.

Apologies for the length of the rest of the mail. Measuring the impact of
this is not exactly straight-forward.

I ran a number of tests with monitoring on X86, X86-64 and PPC64 and I'll
cover what the X86-64 results were here. It's an AMD Phenom 4-core machine
with 2G of RAM with a single disk and the onboard IO controller. Dirty
ratio was left at 20 (tests with 40 are in progress). The filesystem all
the tests were run on was XFS.

Three kernels are compared.

traceonly-v2r5 is the first 4 patches of this series
stackreduce-v2r5 is the first 10 patches of this series
nodirect-v2r5 is all patches in the series

The results on each test is broken up into three parts. The first part
compares the results of the test itself. The second part is a report based
on the ftrace postprocessing script in patch 4 and reports on direct reclaim
and kswapd activity. The third part reports what percentage of time was
spent in direct reclaim and kswapd being awake.

To work out the percentage of time spent in direct reclaim, I used
/usr/bin/time to get the User + Sys CPU time. The stalled time was taken
from the post-processing script. The total time is (User + Sys + Stall)
and obviously the percentage is of stalled over total time.

kernbench
=========

traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
Elapsed min 98.16 ( 0.00%) 97.95 ( 0.21%) 98.25 (-0.09%)
Elapsed mean 98.29 ( 0.00%) 98.26 ( 0.03%) 98.40 (-0.11%)
Elapsed stddev 0.08 ( 0.00%) 0.20 (-165.90%) 0.12 (-59.87%)
Elapsed max 98.34 ( 0.00%) 98.51 (-0.17%) 98.58 (-0.24%)
User min 311.03 ( 0.00%) 311.74 (-0.23%) 311.13 (-0.03%)
User mean 311.28 ( 0.00%) 312.45 (-0.38%) 311.42 (-0.05%)
User stddev 0.24 ( 0.00%) 0.51 (-114.08%) 0.38 (-59.06%)
User max 311.58 ( 0.00%) 312.94 (-0.44%) 312.06 (-0.15%)
System min 40.54 ( 0.00%) 39.65 ( 2.20%) 40.34 ( 0.49%)
System mean 40.80 ( 0.00%) 40.01 ( 1.93%) 40.81 (-0.03%)
System stddev 0.23 ( 0.00%) 0.34 (-47.57%) 0.29 (-25.47%)
System max 41.04 ( 0.00%) 40.51 ( 1.29%) 41.11 (-0.17%)
CPU min 357.00 ( 0.00%) 357.00 ( 0.00%) 357.00 ( 0.00%)
CPU mean 357.75 ( 0.00%) 358.00 (-0.07%) 357.75 ( 0.00%)
CPU stddev 0.43 ( 0.00%) 0.71 (-63.30%) 0.43 ( 0.00%)
CPU max 358.00 ( 0.00%) 359.00 (-0.28%) 358.00 ( 0.00%)

FTrace Reclaim Statistics
traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
Direct reclaims 0 0 0
Direct reclaim pages scanned 0 0 0
Direct reclaim write async I/O 0 0 0
Direct reclaim write sync I/O 0 0 0
Wake kswapd requests 0 0 0
Kswapd wakeups 0 0 0
Kswapd pages scanned 0 0 0
Kswapd reclaim write async I/O 0 0 0
Kswapd reclaim write sync I/O 0 0 0
Time stalled direct reclaim (ms) 0.00 0.00 0.00
Time kswapd awake (ms) 0.00 0.00 0.00

User/Sys Time Running Test (seconds) 2144.58 2146.22 2144.8
Percentage Time Spent Direct Reclaim 0.00% 0.00% 0.00%
Total Elapsed Time (seconds) 788.42 794.85 793.66
Percentage Time kswapd Awake 0.00% 0.00% 0.00%

kernbench is a straight-forward kernel compile. Kernel is built 5 times and and an
average taken. There was no interesting difference in terms of performance. As the
workload fit easily in memory, there was no page reclaim activity.

IOZone
======
traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
write-64 395452 ( 0.00%) 397208 ( 0.44%) 397796 ( 0.59%)
write-128 463696 ( 0.00%) 460514 (-0.69%) 458940 (-1.04%)
write-256 504861 ( 0.00%) 506050 ( 0.23%) 502969 (-0.38%)
write-512 490875 ( 0.00%) 485767 (-1.05%) 494264 ( 0.69%)
write-1024 497574 ( 0.00%) 489689 (-1.61%) 505956 ( 1.66%)
write-2048 500993 ( 0.00%) 503076 ( 0.41%) 510097 ( 1.78%)
write-4096 504491 ( 0.00%) 502073 (-0.48%) 506993 ( 0.49%)
write-8192 488398 ( 0.00%) 228857 (-113.41%) 313871 (-55.60%)
write-16384 409006 ( 0.00%) 433783 ( 5.71%) 365696 (-11.84%)
write-32768 473136 ( 0.00%) 481153 ( 1.67%) 486373 ( 2.72%)
write-65536 474833 ( 0.00%) 477970 ( 0.66%) 481192 ( 1.32%)
write-131072 429557 ( 0.00%) 452604 ( 5.09%) 459840 ( 6.59%)
write-262144 397934 ( 0.00%) 401955 ( 1.00%) 397479 (-0.11%)
write-524288 222849 ( 0.00%) 230297 ( 3.23%) 209999 (-6.12%)
rewrite-64 1452528 ( 0.00%) 1492919 ( 2.71%) 1452528 ( 0.00%)
rewrite-128 1622919 ( 0.00%) 1618028 (-0.30%) 1663139 ( 2.42%)
rewrite-256 1694118 ( 0.00%) 1704877 ( 0.63%) 1639786 (-3.31%)
rewrite-512 1753325 ( 0.00%) 1740536 (-0.73%) 1730717 (-1.31%)
rewrite-1024 1741104 ( 0.00%) 1787480 ( 2.59%) 1759651 ( 1.05%)
rewrite-2048 1710867 ( 0.00%) 1747411 ( 2.09%) 1747411 ( 2.09%)
rewrite-4096 1583280 ( 0.00%) 1621536 ( 2.36%) 1599942 ( 1.04%)
rewrite-8192 1308005 ( 0.00%) 1338579 ( 2.28%) 1307358 (-0.05%)
rewrite-16384 1293742 ( 0.00%) 1314178 ( 1.56%) 1291602 (-0.17%)
rewrite-32768 1298360 ( 0.00%) 1314503 ( 1.23%) 1276758 (-1.69%)
rewrite-65536 1289212 ( 0.00%) 1316088 ( 2.04%) 1281351 (-0.61%)
rewrite-131072 1286117 ( 0.00%) 1309070 ( 1.75%) 1283007 (-0.24%)
rewrite-262144 1285902 ( 0.00%) 1305816 ( 1.53%) 1274121 (-0.92%)
rewrite-524288 220417 ( 0.00%) 223971 ( 1.59%) 226133 ( 2.53%)
read-64 3203069 ( 0.00%) 2467108 (-29.83%) 3541098 ( 9.55%)
read-128 3759450 ( 0.00%) 4267461 (11.90%) 4233807 (11.20%)
read-256 4264168 ( 0.00%) 4350555 ( 1.99%) 3935921 (-8.34%)
read-512 3437042 ( 0.00%) 3366987 (-2.08%) 3437042 ( 0.00%)
read-1024 3738636 ( 0.00%) 3821805 ( 2.18%) 3735385 (-0.09%)
read-2048 3938881 ( 0.00%) 3984558 ( 1.15%) 3967993 ( 0.73%)
read-4096 3631489 ( 0.00%) 3828122 ( 5.14%) 3781775 ( 3.97%)
read-8192 3175046 ( 0.00%) 3230268 ( 1.71%) 3247058 ( 2.22%)
read-16384 2923635 ( 0.00%) 2869911 (-1.87%) 2954684 ( 1.05%)
read-32768 2819040 ( 0.00%) 2839776 ( 0.73%) 2852152 ( 1.16%)
read-65536 2659324 ( 0.00%) 2827502 ( 5.95%) 2816464 ( 5.58%)
read-131072 2707652 ( 0.00%) 2727534 ( 0.73%) 2746406 ( 1.41%)
read-262144 2765929 ( 0.00%) 2782166 ( 0.58%) 2776125 ( 0.37%)
read-524288 2810803 ( 0.00%) 2822894 ( 0.43%) 2822394 ( 0.41%)
reread-64 5389653 ( 0.00%) 5860307 ( 8.03%) 5735102 ( 6.02%)
reread-128 5122535 ( 0.00%) 5325799 ( 3.82%) 5325799 ( 3.82%)
reread-256 3245838 ( 0.00%) 3285566 ( 1.21%) 3236056 (-0.30%)
reread-512 4340054 ( 0.00%) 4571003 ( 5.05%) 4742616 ( 8.49%)
reread-1024 4265934 ( 0.00%) 4356809 ( 2.09%) 4374559 ( 2.48%)
reread-2048 3915540 ( 0.00%) 4301837 ( 8.98%) 4338776 ( 9.75%)
reread-4096 3846119 ( 0.00%) 3984379 ( 3.47%) 3979764 ( 3.36%)
reread-8192 3257215 ( 0.00%) 3304518 ( 1.43%) 3325949 ( 2.07%)
reread-16384 2959519 ( 0.00%) 2892622 (-2.31%) 2995773 ( 1.21%)
reread-32768 2570835 ( 0.00%) 2607266 ( 1.40%) 2610783 ( 1.53%)
reread-65536 2731466 ( 0.00%) 2683809 (-1.78%) 2691957 (-1.47%)
reread-131072 2738144 ( 0.00%) 2763886 ( 0.93%) 2776056 ( 1.37%)
reread-262144 2781012 ( 0.00%) 2781786 ( 0.03%) 2784322 ( 0.12%)
reread-524288 2787049 ( 0.00%) 2779851 (-0.26%) 2787681 ( 0.02%)
randread-64 1204796 ( 0.00%) 1204796 ( 0.00%) 1143223 (-5.39%)
randread-128 4135958 ( 0.00%) 4012317 (-3.08%) 4135958 ( 0.00%)
randread-256 3454704 ( 0.00%) 3511189 ( 1.61%) 3511189 ( 1.61%)
randread-512 3437042 ( 0.00%) 3366987 (-2.08%) 3437042 ( 0.00%)
randread-1024 3301774 ( 0.00%) 3401130 ( 2.92%) 3403826 ( 3.00%)
randread-2048 3549844 ( 0.00%) 3391470 (-4.67%) 3477979 (-2.07%)
randread-4096 3214912 ( 0.00%) 3261295 ( 1.42%) 3258820 ( 1.35%)
randread-8192 2818958 ( 0.00%) 2836645 ( 0.62%) 2861450 ( 1.48%)
randread-16384 2571662 ( 0.00%) 2515924 (-2.22%) 2564465 (-0.28%)
randread-32768 2319848 ( 0.00%) 2331892 ( 0.52%) 2333594 ( 0.59%)
randread-65536 2288193 ( 0.00%) 2297103 ( 0.39%) 2301123 ( 0.56%)
randread-131072 2270669 ( 0.00%) 2275707 ( 0.22%) 2279150 ( 0.37%)
randread-262144 2258949 ( 0.00%) 2264700 ( 0.25%) 2259975 ( 0.05%)
randread-524288 2250529 ( 0.00%) 2240365 (-0.45%) 2242837 (-0.34%)
randwrite-64 942521 ( 0.00%) 939223 (-0.35%) 939223 (-0.35%)
randwrite-128 962469 ( 0.00%) 971174 ( 0.90%) 969421 ( 0.72%)
randwrite-256 980760 ( 0.00%) 980760 ( 0.00%) 966633 (-1.46%)
randwrite-512 1190529 ( 0.00%) 1138158 (-4.60%) 1040545 (-14.41%)
randwrite-1024 1361836 ( 0.00%) 1361836 ( 0.00%) 1367037 ( 0.38%)
randwrite-2048 1325646 ( 0.00%) 1364390 ( 2.84%) 1361794 ( 2.65%)
randwrite-4096 1360371 ( 0.00%) 1372653 ( 0.89%) 1363935 ( 0.26%)
randwrite-8192 1291680 ( 0.00%) 1305272 ( 1.04%) 1285206 (-0.50%)
randwrite-16384 1253666 ( 0.00%) 1255865 ( 0.18%) 1231889 (-1.77%)
randwrite-32768 1239139 ( 0.00%) 1250641 ( 0.92%) 1224239 (-1.22%)
randwrite-65536 1223186 ( 0.00%) 1228115 ( 0.40%) 1203094 (-1.67%)
randwrite-131072 1207002 ( 0.00%) 1215733 ( 0.72%) 1198229 (-0.73%)
randwrite-262144 1184954 ( 0.00%) 1201542 ( 1.38%) 1179145 (-0.49%)
randwrite-524288 96156 ( 0.00%) 96502 ( 0.36%) 95942 (-0.22%)
bkwdread-64 2923952 ( 0.00%) 3022727 ( 3.27%) 2772930 (-5.45%)
bkwdread-128 3785961 ( 0.00%) 3657016 (-3.53%) 3657016 (-3.53%)
bkwdread-256 3017775 ( 0.00%) 3052087 ( 1.12%) 3159869 ( 4.50%)
bkwdread-512 2875558 ( 0.00%) 2845081 (-1.07%) 2841317 (-1.21%)
bkwdread-1024 3083680 ( 0.00%) 3181915 ( 3.09%) 3140041 ( 1.79%)
bkwdread-2048 3266376 ( 0.00%) 3282603 ( 0.49%) 3281349 ( 0.46%)
bkwdread-4096 3207709 ( 0.00%) 3287506 ( 2.43%) 3248345 ( 1.25%)
bkwdread-8192 2777710 ( 0.00%) 2792156 ( 0.52%) 2777036 (-0.02%)
bkwdread-16384 2565614 ( 0.00%) 2570412 ( 0.19%) 2541795 (-0.94%)
bkwdread-32768 2472332 ( 0.00%) 2495631 ( 0.93%) 2284260 (-8.23%)
bkwdread-65536 2435202 ( 0.00%) 2391036 (-1.85%) 2361477 (-3.12%)
bkwdread-131072 2417850 ( 0.00%) 2453436 ( 1.45%) 2417903 ( 0.00%)
bkwdread-262144 2468467 ( 0.00%) 2491433 ( 0.92%) 2446649 (-0.89%)
bkwdread-524288 2513411 ( 0.00%) 2534789 ( 0.84%) 2486486 (-1.08%)

FTrace Reclaim Statistics
traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
Direct reclaims 0 0 0
Direct reclaim pages scanned 0 0 0
Direct reclaim write async I/O 0 0 0
Direct reclaim write sync I/O 0 0 0
Wake kswapd requests 0 0 0
Kswapd wakeups 0 0 0
Kswapd pages scanned 0 0 0
Kswapd reclaim write async I/O 0 0 0
Kswapd reclaim write sync I/O 0 0 0
Time stalled direct reclaim (ms) 0.00 0.00 0.00
Time kswapd awake (ms) 0.00 0.00 0.00

User/Sys Time Running Test (seconds) 14.54 14.43 14.51
Percentage Time Spent Direct Reclaim 0.00% 0.00% 0.00%
Total Elapsed Time (seconds) 106.39 104.49 107.15
Percentage Time kswapd Awake 0.00% 0.00% 0.00%

No big surprises in terms of performance. I know there are gains and losses
but I've always had trouble getting very stable figures out of IOZone so
I find it hard to draw conclusions from them. I should revisit what I'm
doing there to see what's wrong.

In terms of reclaim, nothing interesting happened.

SysBench
========
traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
1 11025.01 ( 0.00%) 10249.52 (-7.57%) 10430.57 (-5.70%)
2 3844.63 ( 0.00%) 4988.95 (22.94%) 4038.95 ( 4.81%)
3 3210.23 ( 0.00%) 2918.52 (-9.99%) 3113.38 (-3.11%)
4 1958.91 ( 0.00%) 1987.69 ( 1.45%) 1808.37 (-8.32%)
5 2864.92 ( 0.00%) 3126.13 ( 8.36%) 2355.70 (-21.62%)
6 4831.63 ( 0.00%) 3815.67 (-26.63%) 4164.09 (-16.03%)
7 3788.37 ( 0.00%) 3140.39 (-20.63%) 3471.36 (-9.13%)
8 2293.61 ( 0.00%) 1636.87 (-40.12%) 1754.25 (-30.75%)
FTrace Reclaim Statistics
traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
Direct reclaims 9843 13398 51651
Direct reclaim pages scanned 871367 1008709 3080593
Direct reclaim write async I/O 24883 30699 0
Direct reclaim write sync I/O 0 0 0
Wake kswapd requests 7070819 6961672 11268341
Kswapd wakeups 1578 1500 943
Kswapd pages scanned 22016558 21779455 17393431
Kswapd reclaim write async I/O 1161346 1101641 1717759
Kswapd reclaim write sync I/O 0 0 0
Time stalled direct reclaim (ms) 26.11 45.04 2.97
Time kswapd awake (ms) 5105.06 5135.93 6086.32

User/Sys Time Running Test (seconds) 734.52 712.39 703.9
Percentage Time Spent Direct Reclaim 0.00% 0.00% 0.00%
Total Elapsed Time (seconds) 9710.02 9589.20 9334.45
Percentage Time kswapd Awake 0.06% 0.00% 0.00%

Unlike other sysbench results I post, this is the result of a read/write
test. As the machine is under-provisioned for the type of tests, figures
are very unstable. For example, for each of the thread counts from 1-8, the
test is run a minimum of 5 times. If the estimated mean is not within 1%,
it's run up to a maximum of 10 times trying to get a stable average. None
of these averages are stable with variances up to 15%. Part of the problem
is that larger thread counts push the test into swap as the memory is
insufficient. I could tune for this, but it was reclaim that was important.

To illustrate though, here is a graph of total io in comparison to swap io
as reported by vmstat. The update frequency was 2 seconds so the IO shown in
the graph is about the maximum capacity of the disk for the entire duration
of the test with swap kicking in every so often so this was heavily IO bound.

http://www.csn.ul.ie/~mel/postings/nodirect-20100614/totalio-swapio-comparison.ps

The writing back of dirty pages was a factor. It did happen, but it was
a negligible portion of the overall IO. For example, with just tracing a
total of 97MB or 2% of the pages scanned by direct reclaim was written back
and it was all async IO. The time stalled in direct reclaim was negligible
although as you'd expect, reduced by not writing back at all.

What is interesting is that kswapd wake requests were raised by direc reclaim
not writing back pages. My theory is that it's because the processes are
making forward progress meaning they need more memory faster and are not
calling congestion_wait as they would have before. kswapd is awake longer
as a result of direct reclaim not writing back pages.

Between 5-10% of the pages scanned by kswapd need to be written back based
on these three kernels. As the disk was maxed out all of the time, I'm having
trouble deciding whether this is "too much IO" or not but I'm leaning towards
"no". I'd expect that the flusher was also having time getting IO bandwidth.

Simple Writeback Test
=====================
traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
Direct reclaims 1923 2394 2683
Direct reclaim pages scanned 246400 252256 282896
Direct reclaim write async I/O 0 0 0
Direct reclaim write sync I/O 0 0 0
Wake kswapd requests 1496401 1648245 1709870
Kswapd wakeups 1140 1118 1113
Kswapd pages scanned 10999585 10982748 10851473
Kswapd reclaim write async I/O 0 0 1398
Kswapd reclaim write sync I/O 0 0 0
Time stalled direct reclaim (ms) 0.01 0.01 0.01
Time kswapd awake (ms) 293.54 293.68 285.51

User/Sys Time Running Test (seconds) 105.17 102.81 104.34
Percentage Time Spent Direct Reclaim 0.00% 0.00% 0.00%
Total Elapsed Time (seconds) 638.75 640.63 639.42
Percentage Time kswapd Awake 0.04% 0.00% 0.00%

This test starting with 4 threads, doubling the number of threads on each
iteration up to 64. Each iteration writes 4*RAM amount of files to disk
using dd to dirty memory and conv=fsync to have some sort of stability in
the results.

Direc reclaim writeback was not a problem for this test even though a number
of pages were scanned so there is no reason not to disable it. kswapd
encountered some pages in the "nodirect" kernel but it's likely a timing
issue.

Stress HighAlloc
================
stress-highalloc stress-highalloc stress-highalloc
traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
Pass 1 76.00 ( 0.00%) 77.00 ( 1.00%) 70.00 (-6.00%)
Pass 2 78.00 ( 0.00%) 78.00 ( 0.00%) 73.00 (-5.00%)
At Rest 80.00 ( 0.00%) 79.00 (-1.00%) 78.00 (-2.00%)

FTrace Reclaim Statistics
stress-highalloc stress-highalloc stress-highalloc
traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
Direct reclaims 1245 1247 1369
Direct reclaim pages scanned 180262 177032 164337
Direct reclaim write async I/O 35211 30075 0
Direct reclaim write sync I/O 17994 13127 0
Wake kswapd requests 4211 4868 4386
Kswapd wakeups 842 808 739
Kswapd pages scanned 41757111 33360435 9946823
Kswapd reclaim write async I/O 4872154 3154195 791840
Kswapd reclaim write sync I/O 0 0 0
Time stalled direct reclaim (ms) 9064.33 7249.29 2868.22
Time kswapd awake (ms) 6250.77 4612.99 1937.64

User/Sys Time Running Test (seconds) 2822.01 2812.45 2629.6
Percentage Time Spent Direct Reclaim 0.10% 0.00% 0.00%
Total Elapsed Time (seconds) 11365.05 9682.00 5210.19
Percentage Time kswapd Awake 0.02% 0.00% 0.00%

This test builds a large number of kernels simultaneously so that the total
workload is 1.5 times the size of RAM. It then attempts to allocate all of
RAM as huge pages. The metric is the percentage of memory allocated using
load (Pass 1), a second attempt under load (Pass 2) and when the kernel
compiles are finishes and the system is quiet (At Rest).

The success figures were comparaible with or without direct reclaim. I know
PPC64's success rates are hit a lot worse than this but I think it could be
improved in other means than what we have today so I'm less worried about it.

Unlike the other tests, synchronous direct writeback is a factor in this
test because of lumpy reclaim. This increases the stall time of a lumpy
reclaimer by quite a margin. Compare the "Time stalled direct reclaim"
between the vanilla and nodirect kernel - the nodirect kernel is stalled
less than a third of the time. Interestingly, the time kswapd is stalled
is significantly reduced as well and overall, the test completes a lot faster.

Whether this is because of improved IO patterns or just because lumpy
reclaim stalling on synchronous IO is a bad idea, I'm not sure but either
way, the patch looks like a good idea from the perspective of this test.

Based on this series of tests at least, there appears to be good reasons
for preventing direct reclaim writing back pages and it does not appear
we are currently spending a lot of our time in writeback. It remains to
be seen if it's still true with dirty ratio is higher (e.g. 40) or the
amount of available memory differs (e.g. 256MB) but the trace points and
post-processing script can be used to help figure it out.

Comments?

KOSAKI Motohiro (2):
vmscan: kill prev_priority completely
vmscan: simplify shrink_inactive_list()

Mel Gorman (11):
tracing, vmscan: Add trace events for kswapd wakeup, sleeping and
direct reclaim
tracing, vmscan: Add trace events for LRU page isolation
tracing, vmscan: Add trace event when a page is written
tracing, vmscan: Add a postprocessing script for reclaim-related
ftrace events
vmscan: Remove unnecessary temporary vars in do_try_to_free_pages
vmscan: Setup pagevec as late as possible in shrink_inactive_list()
vmscan: Setup pagevec as late as possible in shrink_page_list()
vmscan: Update isolated page counters outside of main path in
shrink_inactive_list()
vmscan: Write out dirty pages in batch
vmscan: Do not writeback pages in direct reclaim
fix: script formatting

.../trace/postprocess/trace-vmscan-postprocess.pl | 623 ++++++++++++++++++++
include/linux/mmzone.h | 15 -
include/trace/events/gfpflags.h | 37 ++
include/trace/events/kmem.h | 38 +--
include/trace/events/vmscan.h | 184 ++++++
mm/page_alloc.c | 2 -
mm/vmscan.c | 622 ++++++++++++--------
mm/vmstat.c | 2 -
8 files changed, 1223 insertions(+), 300 deletions(-)
create mode 100644 Documentation/trace/postprocess/trace-vmscan-postprocess.pl
create mode 100644 include/trace/events/gfpflags.h
create mode 100644 include/trace/events/vmscan.h


2010-06-14 11:18:11

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim

When memory is under enough pressure, a process may enter direct
reclaim to free pages in the same manner kswapd does. If a dirty page is
encountered during the scan, this page is written to backing storage using
mapping->writepage. This can result in very deep call stacks, particularly
if the target storage or filesystem are complex. It has already been observed
on XFS that the stack overflows but the problem is not XFS-specific.

This patch prevents direct reclaim writing back pages by not setting
may_writepage in scan_control. Instead, dirty pages are placed back on the
LRU lists for either background writing by the BDI threads or kswapd. If
in direct lumpy reclaim and dirty pages are encountered, the process will
kick the background flushter threads before trying again.

Memory control groups do not have a kswapd-like thread nor do pages get
direct reclaimed from the page allocator. Instead, memory control group
pages are reclaimed when the quota is being exceeded or the group is being
shrunk. As it is not expected that the entry points into page reclaim are
deep call chains memcg is still allowed to writeback dirty pages.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 76 ++++++++++++++++++++++++++++++++++++++++++++--------------
1 files changed, 57 insertions(+), 19 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4856a2a..574e816 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -372,6 +372,12 @@ int write_reclaim_page(struct page *page, struct address_space *mapping,
return PAGE_SUCCESS;
}

+/* kswapd and memcg can writeback as they are unlikely to overflow stack */
+static inline bool reclaim_can_writeback(struct scan_control *sc)
+{
+ return current_is_kswapd() || sc->mem_cgroup != NULL;
+}
+
/*
* pageout is called by shrink_page_list() for each dirty page.
* Calls ->writepage().
@@ -701,6 +707,9 @@ static noinline_for_stack void clean_page_list(struct list_head *page_list,
list_splice(&ret_pages, page_list);
}

+/* Direct lumpy reclaim waits up to a second for background cleaning */
+#define MAX_SWAP_CLEAN_WAIT 10
+
/*
* shrink_page_list() returns the number of reclaimed pages
*/
@@ -711,15 +720,16 @@ static unsigned long shrink_page_list(struct list_head *page_list,
LIST_HEAD(free_pages);
LIST_HEAD(putback_pages);
LIST_HEAD(dirty_pages);
- struct list_head *ret_list = page_list;
int pgactivate;
- bool cleaned = false;
+ int cleaned = 0;
+ unsigned long nr_dirty;
unsigned long nr_reclaimed = 0;

pgactivate = 0;
cond_resched();

restart_dirty:
+ nr_dirty = 0;
while (!list_empty(page_list)) {
enum page_references references;
struct address_space *mapping;
@@ -811,12 +821,17 @@ restart_dirty:
if (PageDirty(page)) {
/*
* On the first pass, dirty pages are put on a separate
- * list. IO is then queued based on ranges of pages for
- * each unique mapping in the list
+ * list. If kswapd, IO is then queued based on ranges of
+ * pages for each unique mapping in the list. Direct
+ * reclaimers put the dirty pages back on the list for
+ * cleaning by kswapd
*/
- if (!cleaned) {
- /* Keep locked for clean_page_list */
+ if (cleaned < MAX_SWAP_CLEAN_WAIT) {
+ /* Keep locked for kswapd to call clean_page_list */
+ if (!reclaim_can_writeback(sc))
+ unlock_page(page);
list_add(&page->lru, &dirty_pages);
+ nr_dirty++;
goto keep_dirty;
}

@@ -935,17 +950,41 @@ keep_dirty:
VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
}

- if (!cleaned && !list_empty(&dirty_pages)) {
- clean_page_list(&dirty_pages, sc);
- page_list = &dirty_pages;
- cleaned = true;
- goto restart_dirty;
+ if (cleaned < MAX_SWAP_CLEAN_WAIT && !list_empty(&dirty_pages)) {
+ /*
+ * Only kswapd cleans pages. Direct reclaimers entering the filesystem
+ * potentially splices two expensive call-chains and busts the stack
+ * so instead they go to sleep to give background cleaning a chance
+ */
+ list_splice(&dirty_pages, page_list);
+ INIT_LIST_HEAD(&dirty_pages);
+ if (reclaim_can_writeback(sc)) {
+ cleaned = MAX_SWAP_CLEAN_WAIT;
+ clean_page_list(page_list, sc);
+ goto restart_dirty;
+ } else {
+ cleaned++;
+ /*
+ * If lumpy reclaiming, kick the background flusher and wait
+ * for the pages to be cleaned
+ *
+ * XXX: kswapd won't find these isolated pages but the
+ * background flusher does not prioritise pages. It'd
+ * be nice to prioritise a list of pages somehow
+ */
+ if (sync_writeback == PAGEOUT_IO_SYNC) {
+ wakeup_flusher_threads(nr_dirty);
+ congestion_wait(BLK_RW_ASYNC, HZ/10);
+ goto restart_dirty;
+ }
+ }
}
- BUG_ON(!list_empty(&dirty_pages));

free_page_list(&free_pages);

- list_splice(&putback_pages, ret_list);
+ list_splice(&dirty_pages, page_list);
+ list_splice(&putback_pages, page_list);
+
count_vm_events(PGACTIVATE, pgactivate);
return nr_reclaimed;
}
@@ -1954,10 +1993,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
* writeout. So in laptop mode, write out the whole world.
*/
writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
- if (total_scanned > writeback_threshold) {
+ if (total_scanned > writeback_threshold)
wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
- sc->may_writepage = 1;
- }

/* Take a nap, wait for some writeback to complete */
if (!sc->hibernation_mode && sc->nr_scanned &&
@@ -1995,7 +2032,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
unsigned long nr_reclaimed;
struct scan_control sc = {
.gfp_mask = gfp_mask,
- .may_writepage = !laptop_mode,
+ .may_writepage = 0,
.nr_to_reclaim = SWAP_CLUSTER_MAX,
.may_unmap = 1,
.may_swap = 1,
@@ -2024,7 +2061,7 @@ unsigned long mem_cgroup_shrink_node_zone(struct mem_cgroup *mem,
struct zone *zone, int nid)
{
struct scan_control sc = {
- .may_writepage = !laptop_mode,
+ .may_writepage = 0,
.may_unmap = 1,
.may_swap = !noswap,
.swappiness = swappiness,
@@ -2676,7 +2713,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
struct reclaim_state reclaim_state;
int priority;
struct scan_control sc = {
- .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
+ .may_writepage = (current_is_kswapd() &&
+ (zone_reclaim_mode & RECLAIM_WRITE)),
.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
.may_swap = 1,
.nr_to_reclaim = max_t(unsigned long, nr_pages,
--
1.7.1

2010-06-14 11:18:08

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 06/12] vmscan: simplify shrink_inactive_list()

From: KOSAKI Motohiro <[email protected]>

Now, max_scan of shrink_inactive_list() is always passed less than
SWAP_CLUSTER_MAX. then, we can remove scanning pages loop in it.
This patch also help stack diet.

detail
- remove "while (nr_scanned < max_scan)" loop
- remove nr_freed (now, we use nr_reclaimed directly)
- remove nr_scan (now, we use nr_scanned directly)
- rename max_scan to nr_to_scan
- pass nr_to_scan into isolate_pages() directly instead
using SWAP_CLUSTER_MAX

Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Johannes Weiner <[email protected]>
---
mm/vmscan.c | 211 ++++++++++++++++++++++++++++-------------------------------
1 files changed, 100 insertions(+), 111 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 29e1ecd..6a42ebc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1132,15 +1132,21 @@ static int too_many_isolated(struct zone *zone, int file,
* shrink_inactive_list() is a helper for shrink_zone(). It returns the number
* of reclaimed pages
*/
-static unsigned long shrink_inactive_list(unsigned long max_scan,
+static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
struct zone *zone, struct scan_control *sc,
int priority, int file)
{
LIST_HEAD(page_list);
struct pagevec pvec;
- unsigned long nr_scanned = 0;
+ unsigned long nr_scanned;
unsigned long nr_reclaimed = 0;
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+ struct page *page;
+ unsigned long nr_taken;
+ unsigned long nr_active;
+ unsigned int count[NR_LRU_LISTS] = { 0, };
+ unsigned long nr_anon;
+ unsigned long nr_file;

while (unlikely(too_many_isolated(zone, file, sc))) {
congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1155,129 +1161,112 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,

lru_add_drain();
spin_lock_irq(&zone->lru_lock);
- do {
- struct page *page;
- unsigned long nr_taken;
- unsigned long nr_scan;
- unsigned long nr_freed;
- unsigned long nr_active;
- unsigned int count[NR_LRU_LISTS] = { 0, };
- int mode = sc->lumpy_reclaim_mode ? ISOLATE_BOTH : ISOLATE_INACTIVE;
- unsigned long nr_anon;
- unsigned long nr_file;

- if (scanning_global_lru(sc)) {
- nr_taken = isolate_pages_global(SWAP_CLUSTER_MAX,
- &page_list, &nr_scan,
- sc->order, mode,
- zone, 0, file);
- zone->pages_scanned += nr_scan;
- if (current_is_kswapd())
- __count_zone_vm_events(PGSCAN_KSWAPD, zone,
- nr_scan);
- else
- __count_zone_vm_events(PGSCAN_DIRECT, zone,
- nr_scan);
- } else {
- nr_taken = mem_cgroup_isolate_pages(SWAP_CLUSTER_MAX,
- &page_list, &nr_scan,
- sc->order, mode,
- zone, sc->mem_cgroup,
- 0, file);
- /*
- * mem_cgroup_isolate_pages() keeps track of
- * scanned pages on its own.
- */
- }
+ if (scanning_global_lru(sc)) {
+ nr_taken = isolate_pages_global(nr_to_scan,
+ &page_list, &nr_scanned, sc->order,
+ sc->lumpy_reclaim_mode ? ISOLATE_BOTH : ISOLATE_INACTIVE,
+ zone, 0, file);
+ zone->pages_scanned += nr_scanned;
+ if (current_is_kswapd())
+ __count_zone_vm_events(PGSCAN_KSWAPD, zone,
+ nr_scanned);
+ else
+ __count_zone_vm_events(PGSCAN_DIRECT, zone,
+ nr_scanned);
+ } else {
+ nr_taken = mem_cgroup_isolate_pages(nr_to_scan,
+ &page_list, &nr_scanned, sc->order,
+ sc->lumpy_reclaim_mode ? ISOLATE_BOTH : ISOLATE_INACTIVE,
+ zone, sc->mem_cgroup,
+ 0, file);
+ /*
+ * mem_cgroup_isolate_pages() keeps track of
+ * scanned pages on its own.
+ */
+ }

- if (nr_taken == 0)
- goto done;
+ if (nr_taken == 0)
+ goto done;

- nr_active = clear_active_flags(&page_list, count);
- __count_vm_events(PGDEACTIVATE, nr_active);
+ nr_active = clear_active_flags(&page_list, count);
+ __count_vm_events(PGDEACTIVATE, nr_active);

- __mod_zone_page_state(zone, NR_ACTIVE_FILE,
- -count[LRU_ACTIVE_FILE]);
- __mod_zone_page_state(zone, NR_INACTIVE_FILE,
- -count[LRU_INACTIVE_FILE]);
- __mod_zone_page_state(zone, NR_ACTIVE_ANON,
- -count[LRU_ACTIVE_ANON]);
- __mod_zone_page_state(zone, NR_INACTIVE_ANON,
- -count[LRU_INACTIVE_ANON]);
+ __mod_zone_page_state(zone, NR_ACTIVE_FILE,
+ -count[LRU_ACTIVE_FILE]);
+ __mod_zone_page_state(zone, NR_INACTIVE_FILE,
+ -count[LRU_INACTIVE_FILE]);
+ __mod_zone_page_state(zone, NR_ACTIVE_ANON,
+ -count[LRU_ACTIVE_ANON]);
+ __mod_zone_page_state(zone, NR_INACTIVE_ANON,
+ -count[LRU_INACTIVE_ANON]);

- nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
- nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
- __mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
- __mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
+ nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+ nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+ __mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
+ __mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);

- reclaim_stat->recent_scanned[0] += nr_anon;
- reclaim_stat->recent_scanned[1] += nr_file;
+ reclaim_stat->recent_scanned[0] += nr_anon;
+ reclaim_stat->recent_scanned[1] += nr_file;

- spin_unlock_irq(&zone->lru_lock);
+ spin_unlock_irq(&zone->lru_lock);

- nr_scanned += nr_scan;
- nr_freed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+ nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
+
+ /*
+ * If we are direct reclaiming for contiguous pages and we do
+ * not reclaim everything in the list, try again and wait
+ * for IO to complete. This will stall high-order allocations
+ * but that should be acceptable to the caller
+ */
+ if (nr_reclaimed < nr_taken && !current_is_kswapd() && sc->lumpy_reclaim_mode) {
+ congestion_wait(BLK_RW_ASYNC, HZ/10);

/*
- * If we are direct reclaiming for contiguous pages and we do
- * not reclaim everything in the list, try again and wait
- * for IO to complete. This will stall high-order allocations
- * but that should be acceptable to the caller
+ * The attempt at page out may have made some
+ * of the pages active, mark them inactive again.
*/
- if (nr_freed < nr_taken && !current_is_kswapd() &&
- sc->lumpy_reclaim_mode) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
-
- /*
- * The attempt at page out may have made some
- * of the pages active, mark them inactive again.
- */
- nr_active = clear_active_flags(&page_list, count);
- count_vm_events(PGDEACTIVATE, nr_active);
-
- nr_freed += shrink_page_list(&page_list, sc,
- PAGEOUT_IO_SYNC);
- }
+ nr_active = clear_active_flags(&page_list, count);
+ count_vm_events(PGDEACTIVATE, nr_active);

- nr_reclaimed += nr_freed;
+ nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
+ }

- local_irq_disable();
- if (current_is_kswapd())
- __count_vm_events(KSWAPD_STEAL, nr_freed);
- __count_zone_vm_events(PGSTEAL, zone, nr_freed);
+ local_irq_disable();
+ if (current_is_kswapd())
+ __count_vm_events(KSWAPD_STEAL, nr_reclaimed);
+ __count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);

- spin_lock(&zone->lru_lock);
- /*
- * Put back any unfreeable pages.
- */
- while (!list_empty(&page_list)) {
- int lru;
- page = lru_to_page(&page_list);
- VM_BUG_ON(PageLRU(page));
- list_del(&page->lru);
- if (unlikely(!page_evictable(page, NULL))) {
- spin_unlock_irq(&zone->lru_lock);
- putback_lru_page(page);
- spin_lock_irq(&zone->lru_lock);
- continue;
- }
- SetPageLRU(page);
- lru = page_lru(page);
- add_page_to_lru_list(zone, page, lru);
- if (is_active_lru(lru)) {
- int file = is_file_lru(lru);
- reclaim_stat->recent_rotated[file]++;
- }
- if (!pagevec_add(&pvec, page)) {
- spin_unlock_irq(&zone->lru_lock);
- __pagevec_release(&pvec);
- spin_lock_irq(&zone->lru_lock);
- }
+ spin_lock(&zone->lru_lock);
+ /*
+ * Put back any unfreeable pages.
+ */
+ while (!list_empty(&page_list)) {
+ int lru;
+ page = lru_to_page(&page_list);
+ VM_BUG_ON(PageLRU(page));
+ list_del(&page->lru);
+ if (unlikely(!page_evictable(page, NULL))) {
+ spin_unlock_irq(&zone->lru_lock);
+ putback_lru_page(page);
+ spin_lock_irq(&zone->lru_lock);
+ continue;
}
- __mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
- __mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
-
- } while (nr_scanned < max_scan);
+ SetPageLRU(page);
+ lru = page_lru(page);
+ add_page_to_lru_list(zone, page, lru);
+ if (is_active_lru(lru)) {
+ int file = is_file_lru(lru);
+ reclaim_stat->recent_rotated[file]++;
+ }
+ if (!pagevec_add(&pvec, page)) {
+ spin_unlock_irq(&zone->lru_lock);
+ __pagevec_release(&pvec);
+ spin_lock_irq(&zone->lru_lock);
+ }
+ }
+ __mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
+ __mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);

done:
spin_unlock_irq(&zone->lru_lock);
--
1.7.1

2010-06-14 11:18:10

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 07/12] vmscan: Remove unnecessary temporary vars in do_try_to_free_pages

Remove temporary variable that is only used once and does not help
clarify code.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 8 +++-----
1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6a42ebc..246e084 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1721,13 +1721,12 @@ static void shrink_zone(int priority, struct zone *zone,
static bool shrink_zones(int priority, struct zonelist *zonelist,
struct scan_control *sc)
{
- enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
struct zoneref *z;
struct zone *zone;
bool all_unreclaimable = true;

- for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
- sc->nodemask) {
+ for_each_zone_zonelist_nodemask(zone, z, zonelist,
+ gfp_zone(sc->gfp_mask), sc->nodemask) {
if (!populated_zone(zone))
continue;
/*
@@ -1773,7 +1772,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
unsigned long lru_pages = 0;
struct zoneref *z;
struct zone *zone;
- enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
unsigned long writeback_threshold;

get_mems_allowed();
@@ -1785,7 +1783,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
* mem_cgroup will not do shrink_slab.
*/
if (scanning_global_lru(sc)) {
- for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+ for_each_zone_zonelist(zone, z, zonelist, gfp_zone(sc->gfp_mask)) {

if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;
--
1.7.1

2010-06-14 11:18:06

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 03/12] tracing, vmscan: Add trace event when a page is written

This patch adds a trace event for when page reclaim queues a page for IO and
records whether it is synchronous or asynchronous. Excessive synchronous
IO for a process can result in noticeable stalls during direct reclaim.
Excessive IO from page reclaim may indicate that the system is seriously
under provisioned for the amount of dirty pages that exist.

Signed-off-by: Mel Gorman <[email protected]>
---
include/trace/events/vmscan.h | 23 +++++++++++++++++++++++
mm/vmscan.c | 2 ++
2 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index a331454..b26daa9 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -154,6 +154,29 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
__entry->nr_lumpy_dirty,
__entry->nr_lumpy_failed)
);
+
+TRACE_EVENT(mm_vmscan_writepage,
+
+ TP_PROTO(struct page *page,
+ int sync_io),
+
+ TP_ARGS(page, sync_io),
+
+ TP_STRUCT__entry(
+ __field(struct page *, page)
+ __field(int, sync_io)
+ ),
+
+ TP_fast_assign(
+ __entry->page = page;
+ __entry->sync_io = sync_io;
+ ),
+
+ TP_printk("page=%p pfn=%lu sync_io=%d",
+ __entry->page,
+ page_to_pfn(__entry->page),
+ __entry->sync_io)
+);

#endif /* _TRACE_VMSCAN_H */

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 25bf05a..58527c4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -399,6 +399,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
/* synchronous write or broken a_ops? */
ClearPageReclaim(page);
}
+ trace_mm_vmscan_writepage(page,
+ sync_writeback == PAGEOUT_IO_SYNC);
inc_zone_page_state(page, NR_VMSCAN_WRITE);
return PAGE_SUCCESS;
}
--
1.7.1

2010-06-14 11:18:01

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 02/12] tracing, vmscan: Add trace events for LRU page isolation

This patch adds an event for when pages are isolated en-masse from the
LRU lists. This event augments the information available on LRU traffic
and can be used to evaluate lumpy reclaim.

Signed-off-by: Mel Gorman <[email protected]>
---
include/trace/events/vmscan.h | 46 +++++++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 14 ++++++++++++
2 files changed, 60 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index f76521f..a331454 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -109,6 +109,52 @@ TRACE_EVENT(mm_vmscan_direct_reclaim_end,
TP_printk("nr_reclaimed=%lu", __entry->nr_reclaimed)
);

+TRACE_EVENT(mm_vmscan_lru_isolate,
+
+ TP_PROTO(int order,
+ unsigned long nr_requested,
+ unsigned long nr_scanned,
+ unsigned long nr_taken,
+ unsigned long nr_lumpy_taken,
+ unsigned long nr_lumpy_dirty,
+ unsigned long nr_lumpy_failed,
+ int isolate_mode),
+
+ TP_ARGS(order, nr_requested, nr_scanned, nr_taken, nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed, isolate_mode),
+
+ TP_STRUCT__entry(
+ __field(int, order)
+ __field(unsigned long, nr_requested)
+ __field(unsigned long, nr_scanned)
+ __field(unsigned long, nr_taken)
+ __field(unsigned long, nr_lumpy_taken)
+ __field(unsigned long, nr_lumpy_dirty)
+ __field(unsigned long, nr_lumpy_failed)
+ __field(int, isolate_mode)
+ ),
+
+ TP_fast_assign(
+ __entry->order = order;
+ __entry->nr_requested = nr_requested;
+ __entry->nr_scanned = nr_scanned;
+ __entry->nr_taken = nr_taken;
+ __entry->nr_lumpy_taken = nr_lumpy_taken;
+ __entry->nr_lumpy_dirty = nr_lumpy_dirty;
+ __entry->nr_lumpy_failed = nr_lumpy_failed;
+ __entry->isolate_mode = isolate_mode;
+ ),
+
+ TP_printk("isolate_mode=%d order=%d nr_requested=%lu nr_scanned=%lu nr_taken=%lu contig_taken=%lu contig_dirty=%lu contig_failed=%lu",
+ __entry->isolate_mode,
+ __entry->order,
+ __entry->nr_requested,
+ __entry->nr_scanned,
+ __entry->nr_taken,
+ __entry->nr_lumpy_taken,
+ __entry->nr_lumpy_dirty,
+ __entry->nr_lumpy_failed)
+);
+
#endif /* _TRACE_VMSCAN_H */

/* This part must be outside protection */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6bfb579..25bf05a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -917,6 +917,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
unsigned long *scanned, int order, int mode, int file)
{
unsigned long nr_taken = 0;
+ unsigned long nr_lumpy_taken = 0, nr_lumpy_dirty = 0, nr_lumpy_failed = 0;
unsigned long scan;

for (scan = 0; scan < nr_to_scan && !list_empty(src); scan++) {
@@ -994,12 +995,25 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
list_move(&cursor_page->lru, dst);
mem_cgroup_del_lru(cursor_page);
nr_taken++;
+ nr_lumpy_taken++;
+ if (PageDirty(cursor_page))
+ nr_lumpy_dirty++;
scan++;
+ } else {
+ if (mode == ISOLATE_BOTH &&
+ page_count(cursor_page))
+ nr_lumpy_failed++;
}
}
}

*scanned = scan;
+
+ trace_mm_vmscan_lru_isolate(order,
+ nr_to_scan, scan,
+ nr_taken,
+ nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed,
+ mode);
return nr_taken;
}

--
1.7.1

2010-06-14 11:19:00

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 11/12] vmscan: Write out dirty pages in batch

Page reclaim cleans individual pages using a_ops->writepage() because from
the VM perspective, it is known that pages in a particular zone must be freed
soon, it considers the target page to be the oldest and it does not want
to wait while background flushers cleans other pages. From a filesystem
perspective this is extremely inefficient as it generates a very seeky
IO pattern leading to the perverse situation where it can take longer to
clean all dirty pages than it would have otherwise.

This patch queues all dirty pages at once to maximise the chances that
the write requests get merged efficiently. It also makes the next patch
that avoids writeout from direct reclaim more straight-forward.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 175 ++++++++++++++++++++++++++++++++++++++++++++---------------
1 files changed, 131 insertions(+), 44 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 019f0af..4856a2a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -323,6 +323,55 @@ typedef enum {
PAGE_CLEAN,
} pageout_t;

+int write_reclaim_page(struct page *page, struct address_space *mapping,
+ enum pageout_io sync_writeback)
+{
+ int res;
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_NONE,
+ .nr_to_write = SWAP_CLUSTER_MAX,
+ .range_start = 0,
+ .range_end = LLONG_MAX,
+ .nonblocking = 1,
+ .for_reclaim = 1,
+ };
+
+ if (!clear_page_dirty_for_io(page))
+ return PAGE_CLEAN;
+
+ SetPageReclaim(page);
+ res = mapping->a_ops->writepage(page, &wbc);
+ /*
+ * XXX: This is the Holy Hand Grenade of PotentiallyInvalidMapping. As
+ * the page lock has been dropped by ->writepage, that mapping could
+ * be anything
+ */
+ if (res < 0)
+ handle_write_error(mapping, page, res);
+ if (res == AOP_WRITEPAGE_ACTIVATE) {
+ ClearPageReclaim(page);
+ return PAGE_ACTIVATE;
+ }
+
+ /*
+ * Wait on writeback if requested to. This happens when
+ * direct reclaiming a large contiguous area and the
+ * first attempt to free a range of pages fails.
+ */
+ if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
+ wait_on_page_writeback(page);
+
+ if (!PageWriteback(page)) {
+ /* synchronous write or broken a_ops? */
+ ClearPageReclaim(page);
+ }
+ trace_mm_vmscan_writepage(page,
+ sync_writeback == PAGEOUT_IO_SYNC);
+ inc_zone_page_state(page, NR_VMSCAN_WRITE);
+
+ return PAGE_SUCCESS;
+}
+
/*
* pageout is called by shrink_page_list() for each dirty page.
* Calls ->writepage().
@@ -367,45 +416,7 @@ static pageout_t pageout(struct page *page, struct address_space *mapping,
if (!may_write_to_queue(mapping->backing_dev_info))
return PAGE_KEEP;

- if (clear_page_dirty_for_io(page)) {
- int res;
- struct writeback_control wbc = {
- .sync_mode = WB_SYNC_NONE,
- .nr_to_write = SWAP_CLUSTER_MAX,
- .range_start = 0,
- .range_end = LLONG_MAX,
- .nonblocking = 1,
- .for_reclaim = 1,
- };
-
- SetPageReclaim(page);
- res = mapping->a_ops->writepage(page, &wbc);
- if (res < 0)
- handle_write_error(mapping, page, res);
- if (res == AOP_WRITEPAGE_ACTIVATE) {
- ClearPageReclaim(page);
- return PAGE_ACTIVATE;
- }
-
- /*
- * Wait on writeback if requested to. This happens when
- * direct reclaiming a large contiguous area and the
- * first attempt to free a range of pages fails.
- */
- if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
- wait_on_page_writeback(page);
-
- if (!PageWriteback(page)) {
- /* synchronous write or broken a_ops? */
- ClearPageReclaim(page);
- }
- trace_mm_vmscan_writepage(page,
- sync_writeback == PAGEOUT_IO_SYNC);
- inc_zone_page_state(page, NR_VMSCAN_WRITE);
- return PAGE_SUCCESS;
- }
-
- return PAGE_CLEAN;
+ return write_reclaim_page(page, mapping, sync_writeback);
}

/*
@@ -640,19 +651,75 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
}

/*
+ * Clean a list of pages. It is expected that all the pages on page_list have been
+ * locked as part of isolation from the LRU.
+ *
+ * XXX: Is there a problem with holding multiple page locks like this?
+ */
+static noinline_for_stack void clean_page_list(struct list_head *page_list,
+ struct scan_control *sc)
+{
+ LIST_HEAD(ret_pages);
+ struct page *page;
+
+ if (!sc->may_writepage)
+ return;
+
+ /* Write the pages out to disk in ranges where possible */
+ while (!list_empty(page_list)) {
+ struct address_space *mapping;
+ bool may_enter_fs;
+
+ page = lru_to_page(page_list);
+ list_del(&page->lru);
+ list_add(&page->lru, &ret_pages);
+
+ mapping = page_mapping(page);
+ if (!mapping || !may_write_to_queue(mapping->backing_dev_info)) {
+ unlock_page(page);
+ continue;
+ }
+
+ may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
+ (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
+ if (!may_enter_fs) {
+ unlock_page(page);
+ continue;
+ }
+
+ /* Write single page */
+ switch (write_reclaim_page(page, mapping, PAGEOUT_IO_ASYNC)) {
+ case PAGE_KEEP:
+ case PAGE_ACTIVATE:
+ case PAGE_CLEAN:
+ unlock_page(page);
+ break;
+ case PAGE_SUCCESS:
+ break;
+ }
+ }
+ list_splice(&ret_pages, page_list);
+}
+
+/*
* shrink_page_list() returns the number of reclaimed pages
*/
static unsigned long shrink_page_list(struct list_head *page_list,
struct scan_control *sc,
enum pageout_io sync_writeback)
{
- LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
- int pgactivate = 0;
+ LIST_HEAD(putback_pages);
+ LIST_HEAD(dirty_pages);
+ struct list_head *ret_list = page_list;
+ int pgactivate;
+ bool cleaned = false;
unsigned long nr_reclaimed = 0;

+ pgactivate = 0;
cond_resched();

+restart_dirty:
while (!list_empty(page_list)) {
enum page_references references;
struct address_space *mapping;
@@ -741,7 +808,18 @@ static unsigned long shrink_page_list(struct list_head *page_list,
}
}

- if (PageDirty(page)) {
+ if (PageDirty(page)) {
+ /*
+ * On the first pass, dirty pages are put on a separate
+ * list. IO is then queued based on ranges of pages for
+ * each unique mapping in the list
+ */
+ if (!cleaned) {
+ /* Keep locked for clean_page_list */
+ list_add(&page->lru, &dirty_pages);
+ goto keep_dirty;
+ }
+
if (references == PAGEREF_RECLAIM_CLEAN)
goto keep_locked;
if (!may_enter_fs)
@@ -852,13 +930,22 @@ activate_locked:
keep_locked:
unlock_page(page);
keep:
- list_add(&page->lru, &ret_pages);
+ list_add(&page->lru, &putback_pages);
+keep_dirty:
VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
}

+ if (!cleaned && !list_empty(&dirty_pages)) {
+ clean_page_list(&dirty_pages, sc);
+ page_list = &dirty_pages;
+ cleaned = true;
+ goto restart_dirty;
+ }
+ BUG_ON(!list_empty(&dirty_pages));
+
free_page_list(&free_pages);

- list_splice(&ret_pages, page_list);
+ list_splice(&putback_pages, ret_list);
count_vm_events(PGACTIVATE, pgactivate);
return nr_reclaimed;
}
--
1.7.1

2010-06-14 11:19:20

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 10/12] vmscan: Update isolated page counters outside of main path in shrink_inactive_list()

When shrink_inactive_list() isolates pages, it updates a number of
counters using temporary variables to gather them. These consume stack
and it's in the main path that calls ->writepage(). This patch moves the
accounting updates outside of the main path to reduce stack usage.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Johannes Weiner <[email protected]>
---
mm/vmscan.c | 63 +++++++++++++++++++++++++++++++++++-----------------------
1 files changed, 38 insertions(+), 25 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 165c2f5..019f0af 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1073,7 +1073,8 @@ static unsigned long clear_active_flags(struct list_head *page_list,
ClearPageActive(page);
nr_active++;
}
- count[lru]++;
+ if (count)
+ count[lru]++;
}

return nr_active;
@@ -1153,12 +1154,13 @@ static int too_many_isolated(struct zone *zone, int file,
* TODO: Try merging with migrations version of putback_lru_pages
*/
static noinline_for_stack void putback_lru_pages(struct zone *zone,
- struct zone_reclaim_stat *reclaim_stat,
+ struct scan_control *sc,
unsigned long nr_anon, unsigned long nr_file,
struct list_head *page_list)
{
struct page *page;
struct pagevec pvec;
+ struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);

pagevec_init(&pvec, 1);

@@ -1197,6 +1199,37 @@ static noinline_for_stack void putback_lru_pages(struct zone *zone,
pagevec_release(&pvec);
}

+static noinline_for_stack void update_isolated_counts(struct zone *zone,
+ struct scan_control *sc,
+ unsigned long *nr_anon,
+ unsigned long *nr_file,
+ struct list_head *isolated_list)
+{
+ unsigned long nr_active;
+ unsigned int count[NR_LRU_LISTS] = { 0, };
+ struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+
+ nr_active = clear_active_flags(isolated_list, count);
+ __count_vm_events(PGDEACTIVATE, nr_active);
+
+ __mod_zone_page_state(zone, NR_ACTIVE_FILE,
+ -count[LRU_ACTIVE_FILE]);
+ __mod_zone_page_state(zone, NR_INACTIVE_FILE,
+ -count[LRU_INACTIVE_FILE]);
+ __mod_zone_page_state(zone, NR_ACTIVE_ANON,
+ -count[LRU_ACTIVE_ANON]);
+ __mod_zone_page_state(zone, NR_INACTIVE_ANON,
+ -count[LRU_INACTIVE_ANON]);
+
+ *nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
+ *nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
+ __mod_zone_page_state(zone, NR_ISOLATED_ANON, *nr_anon);
+ __mod_zone_page_state(zone, NR_ISOLATED_FILE, *nr_file);
+
+ reclaim_stat->recent_scanned[0] += *nr_anon;
+ reclaim_stat->recent_scanned[1] += *nr_file;
+}
+
/*
* shrink_inactive_list() is a helper for shrink_zone(). It returns the number
* of reclaimed pages
@@ -1208,10 +1241,8 @@ static noinline_for_stack unsigned long shrink_inactive_list(unsigned long nr_to
LIST_HEAD(page_list);
unsigned long nr_scanned;
unsigned long nr_reclaimed = 0;
- struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
unsigned long nr_taken;
unsigned long nr_active;
- unsigned int count[NR_LRU_LISTS] = { 0, };
unsigned long nr_anon;
unsigned long nr_file;

@@ -1256,25 +1287,7 @@ static noinline_for_stack unsigned long shrink_inactive_list(unsigned long nr_to
return 0;
}

- nr_active = clear_active_flags(&page_list, count);
- __count_vm_events(PGDEACTIVATE, nr_active);
-
- __mod_zone_page_state(zone, NR_ACTIVE_FILE,
- -count[LRU_ACTIVE_FILE]);
- __mod_zone_page_state(zone, NR_INACTIVE_FILE,
- -count[LRU_INACTIVE_FILE]);
- __mod_zone_page_state(zone, NR_ACTIVE_ANON,
- -count[LRU_ACTIVE_ANON]);
- __mod_zone_page_state(zone, NR_INACTIVE_ANON,
- -count[LRU_INACTIVE_ANON]);
-
- nr_anon = count[LRU_ACTIVE_ANON] + count[LRU_INACTIVE_ANON];
- nr_file = count[LRU_ACTIVE_FILE] + count[LRU_INACTIVE_FILE];
- __mod_zone_page_state(zone, NR_ISOLATED_ANON, nr_anon);
- __mod_zone_page_state(zone, NR_ISOLATED_FILE, nr_file);
-
- reclaim_stat->recent_scanned[0] += nr_anon;
- reclaim_stat->recent_scanned[1] += nr_file;
+ update_isolated_counts(zone, sc, &nr_anon, &nr_file, &page_list);

spin_unlock_irq(&zone->lru_lock);

@@ -1293,7 +1306,7 @@ static noinline_for_stack unsigned long shrink_inactive_list(unsigned long nr_to
* The attempt at page out may have made some
* of the pages active, mark them inactive again.
*/
- nr_active = clear_active_flags(&page_list, count);
+ nr_active = clear_active_flags(&page_list, NULL);
count_vm_events(PGDEACTIVATE, nr_active);

nr_reclaimed += shrink_page_list(&page_list, sc, PAGEOUT_IO_SYNC);
@@ -1304,7 +1317,7 @@ static noinline_for_stack unsigned long shrink_inactive_list(unsigned long nr_to
__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);

- putback_lru_pages(zone, reclaim_stat, nr_anon, nr_file, &page_list);
+ putback_lru_pages(zone, sc, nr_anon, nr_file, &page_list);
return nr_reclaimed;
}

--
1.7.1

2010-06-14 11:17:59

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 01/12] tracing, vmscan: Add trace events for kswapd wakeup, sleeping and direct reclaim

This patch adds two trace events for kswapd waking up and going asleep for
the purposes of tracking kswapd activity and two trace events for direct
reclaim beginning and ending. The information can be used to work out how
much time a process or the system is spending on the reclamation of pages
and in the case of direct reclaim, how many pages were reclaimed for that
process. High frequency triggering of these events could point to memory
pressure problems.

Signed-off-by: Mel Gorman <[email protected]>
---
include/trace/events/gfpflags.h | 37 +++++++++++++
include/trace/events/kmem.h | 38 +-------------
include/trace/events/vmscan.h | 115 +++++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 24 +++++++--
4 files changed, 173 insertions(+), 41 deletions(-)
create mode 100644 include/trace/events/gfpflags.h
create mode 100644 include/trace/events/vmscan.h

diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
new file mode 100644
index 0000000..e3615c0
--- /dev/null
+++ b/include/trace/events/gfpflags.h
@@ -0,0 +1,37 @@
+/*
+ * The order of these masks is important. Matching masks will be seen
+ * first and the left over flags will end up showing by themselves.
+ *
+ * For example, if we have GFP_KERNEL before GFP_USER we wil get:
+ *
+ * GFP_KERNEL|GFP_HARDWALL
+ *
+ * Thus most bits set go first.
+ */
+#define show_gfp_flags(flags) \
+ (flags) ? __print_flags(flags, "|", \
+ {(unsigned long)GFP_HIGHUSER_MOVABLE, "GFP_HIGHUSER_MOVABLE"}, \
+ {(unsigned long)GFP_HIGHUSER, "GFP_HIGHUSER"}, \
+ {(unsigned long)GFP_USER, "GFP_USER"}, \
+ {(unsigned long)GFP_TEMPORARY, "GFP_TEMPORARY"}, \
+ {(unsigned long)GFP_KERNEL, "GFP_KERNEL"}, \
+ {(unsigned long)GFP_NOFS, "GFP_NOFS"}, \
+ {(unsigned long)GFP_ATOMIC, "GFP_ATOMIC"}, \
+ {(unsigned long)GFP_NOIO, "GFP_NOIO"}, \
+ {(unsigned long)__GFP_HIGH, "GFP_HIGH"}, \
+ {(unsigned long)__GFP_WAIT, "GFP_WAIT"}, \
+ {(unsigned long)__GFP_IO, "GFP_IO"}, \
+ {(unsigned long)__GFP_COLD, "GFP_COLD"}, \
+ {(unsigned long)__GFP_NOWARN, "GFP_NOWARN"}, \
+ {(unsigned long)__GFP_REPEAT, "GFP_REPEAT"}, \
+ {(unsigned long)__GFP_NOFAIL, "GFP_NOFAIL"}, \
+ {(unsigned long)__GFP_NORETRY, "GFP_NORETRY"}, \
+ {(unsigned long)__GFP_COMP, "GFP_COMP"}, \
+ {(unsigned long)__GFP_ZERO, "GFP_ZERO"}, \
+ {(unsigned long)__GFP_NOMEMALLOC, "GFP_NOMEMALLOC"}, \
+ {(unsigned long)__GFP_HARDWALL, "GFP_HARDWALL"}, \
+ {(unsigned long)__GFP_THISNODE, "GFP_THISNODE"}, \
+ {(unsigned long)__GFP_RECLAIMABLE, "GFP_RECLAIMABLE"}, \
+ {(unsigned long)__GFP_MOVABLE, "GFP_MOVABLE"} \
+ ) : "GFP_NOWAIT"
+
diff --git a/include/trace/events/kmem.h b/include/trace/events/kmem.h
index 3adca0c..a9c87ad 100644
--- a/include/trace/events/kmem.h
+++ b/include/trace/events/kmem.h
@@ -6,43 +6,7 @@

#include <linux/types.h>
#include <linux/tracepoint.h>
-
-/*
- * The order of these masks is important. Matching masks will be seen
- * first and the left over flags will end up showing by themselves.
- *
- * For example, if we have GFP_KERNEL before GFP_USER we wil get:
- *
- * GFP_KERNEL|GFP_HARDWALL
- *
- * Thus most bits set go first.
- */
-#define show_gfp_flags(flags) \
- (flags) ? __print_flags(flags, "|", \
- {(unsigned long)GFP_HIGHUSER_MOVABLE, "GFP_HIGHUSER_MOVABLE"}, \
- {(unsigned long)GFP_HIGHUSER, "GFP_HIGHUSER"}, \
- {(unsigned long)GFP_USER, "GFP_USER"}, \
- {(unsigned long)GFP_TEMPORARY, "GFP_TEMPORARY"}, \
- {(unsigned long)GFP_KERNEL, "GFP_KERNEL"}, \
- {(unsigned long)GFP_NOFS, "GFP_NOFS"}, \
- {(unsigned long)GFP_ATOMIC, "GFP_ATOMIC"}, \
- {(unsigned long)GFP_NOIO, "GFP_NOIO"}, \
- {(unsigned long)__GFP_HIGH, "GFP_HIGH"}, \
- {(unsigned long)__GFP_WAIT, "GFP_WAIT"}, \
- {(unsigned long)__GFP_IO, "GFP_IO"}, \
- {(unsigned long)__GFP_COLD, "GFP_COLD"}, \
- {(unsigned long)__GFP_NOWARN, "GFP_NOWARN"}, \
- {(unsigned long)__GFP_REPEAT, "GFP_REPEAT"}, \
- {(unsigned long)__GFP_NOFAIL, "GFP_NOFAIL"}, \
- {(unsigned long)__GFP_NORETRY, "GFP_NORETRY"}, \
- {(unsigned long)__GFP_COMP, "GFP_COMP"}, \
- {(unsigned long)__GFP_ZERO, "GFP_ZERO"}, \
- {(unsigned long)__GFP_NOMEMALLOC, "GFP_NOMEMALLOC"}, \
- {(unsigned long)__GFP_HARDWALL, "GFP_HARDWALL"}, \
- {(unsigned long)__GFP_THISNODE, "GFP_THISNODE"}, \
- {(unsigned long)__GFP_RECLAIMABLE, "GFP_RECLAIMABLE"}, \
- {(unsigned long)__GFP_MOVABLE, "GFP_MOVABLE"} \
- ) : "GFP_NOWAIT"
+#include "gfpflags.h"

DECLARE_EVENT_CLASS(kmem_alloc,

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
new file mode 100644
index 0000000..f76521f
--- /dev/null
+++ b/include/trace/events/vmscan.h
@@ -0,0 +1,115 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM vmscan
+
+#if !defined(_TRACE_VMSCAN_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_VMSCAN_H
+
+#include <linux/types.h>
+#include <linux/tracepoint.h>
+#include "gfpflags.h"
+
+TRACE_EVENT(mm_vmscan_kswapd_sleep,
+
+ TP_PROTO(int nid),
+
+ TP_ARGS(nid),
+
+ TP_STRUCT__entry(
+ __field( int, nid )
+ ),
+
+ TP_fast_assign(
+ __entry->nid = nid;
+ ),
+
+ TP_printk("nid=%d", __entry->nid)
+);
+
+TRACE_EVENT(mm_vmscan_kswapd_wake,
+
+ TP_PROTO(int nid, int order),
+
+ TP_ARGS(nid, order),
+
+ TP_STRUCT__entry(
+ __field( int, nid )
+ __field( int, order )
+ ),
+
+ TP_fast_assign(
+ __entry->nid = nid;
+ __entry->order = order;
+ ),
+
+ TP_printk("nid=%d order=%d", __entry->nid, __entry->order)
+);
+
+TRACE_EVENT(mm_vmscan_wakeup_kswapd,
+
+ TP_PROTO(int nid, int zid, int order),
+
+ TP_ARGS(nid, zid, order),
+
+ TP_STRUCT__entry(
+ __field( int, nid )
+ __field( int, zid )
+ __field( int, order )
+ ),
+
+ TP_fast_assign(
+ __entry->nid = nid;
+ __entry->zid = zid;
+ __entry->order = order;
+ ),
+
+ TP_printk("nid=%d zid=%d order=%d",
+ __entry->nid,
+ __entry->zid,
+ __entry->order)
+);
+
+TRACE_EVENT(mm_vmscan_direct_reclaim_begin,
+
+ TP_PROTO(int order, int may_writepage, gfp_t gfp_flags),
+
+ TP_ARGS(order, may_writepage, gfp_flags),
+
+ TP_STRUCT__entry(
+ __field( int, order )
+ __field( int, may_writepage )
+ __field( gfp_t, gfp_flags )
+ ),
+
+ TP_fast_assign(
+ __entry->order = order;
+ __entry->may_writepage = may_writepage;
+ __entry->gfp_flags = gfp_flags;
+ ),
+
+ TP_printk("order=%d may_writepage=%d gfp_flags=%s",
+ __entry->order,
+ __entry->may_writepage,
+ show_gfp_flags(__entry->gfp_flags))
+);
+
+TRACE_EVENT(mm_vmscan_direct_reclaim_end,
+
+ TP_PROTO(unsigned long nr_reclaimed),
+
+ TP_ARGS(nr_reclaimed),
+
+ TP_STRUCT__entry(
+ __field( unsigned long, nr_reclaimed )
+ ),
+
+ TP_fast_assign(
+ __entry->nr_reclaimed = nr_reclaimed;
+ ),
+
+ TP_printk("nr_reclaimed=%lu", __entry->nr_reclaimed)
+);
+
+#endif /* _TRACE_VMSCAN_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c7e57c..6bfb579 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -48,6 +48,9 @@

#include "internal.h"

+#define CREATE_TRACE_POINTS
+#include <trace/events/vmscan.h>
+
struct scan_control {
/* Incremented by the number of inactive pages that were scanned */
unsigned long nr_scanned;
@@ -1886,6 +1889,7 @@ out:
unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
gfp_t gfp_mask, nodemask_t *nodemask)
{
+ unsigned long nr_reclaimed;
struct scan_control sc = {
.gfp_mask = gfp_mask,
.may_writepage = !laptop_mode,
@@ -1898,7 +1902,15 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
.nodemask = nodemask,
};

- return do_try_to_free_pages(zonelist, &sc);
+ trace_mm_vmscan_direct_reclaim_begin(order,
+ sc.may_writepage,
+ gfp_mask);
+
+ nr_reclaimed = do_try_to_free_pages(zonelist, &sc);
+
+ trace_mm_vmscan_direct_reclaim_end(nr_reclaimed);
+
+ return nr_reclaimed;
}

#ifdef CONFIG_CGROUP_MEM_RES_CTLR
@@ -2297,9 +2309,10 @@ static int kswapd(void *p)
* premature sleep. If not, then go fully
* to sleep until explicitly woken up
*/
- if (!sleeping_prematurely(pgdat, order, remaining))
+ if (!sleeping_prematurely(pgdat, order, remaining)) {
+ trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
schedule();
- else {
+ } else {
if (remaining)
count_vm_event(KSWAPD_LOW_WMARK_HIT_QUICKLY);
else
@@ -2319,8 +2332,10 @@ static int kswapd(void *p)
* We can speed up thawing tasks if we don't call balance_pgdat
* after returning from the refrigerator
*/
- if (!ret)
+ if (!ret) {
+ trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
balance_pgdat(pgdat, order);
+ }
}
return 0;
}
@@ -2340,6 +2355,7 @@ void wakeup_kswapd(struct zone *zone, int order)
return;
if (pgdat->kswapd_max_order < order)
pgdat->kswapd_max_order = order;
+ trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
return;
if (!waitqueue_active(&pgdat->kswapd_wait))
--
1.7.1

2010-06-14 11:19:42

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 09/12] vmscan: Setup pagevec as late as possible in shrink_page_list()

shrink_page_list() sets up a pagevec to release pages as according as they
are free. It uses significant amounts of stack on the pagevec. This
patch adds pages to be freed via pagevec to a linked list which is then
freed en-masse at the end. This avoids using stack in the main path that
potentially calls writepage().

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 37 +++++++++++++++++++++++++++++--------
1 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 34c5c87..165c2f5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -620,6 +620,25 @@ static enum page_references page_check_references(struct page *page,
return PAGEREF_RECLAIM;
}

+static noinline_for_stack void free_page_list(struct list_head *free_pages)
+{
+ struct pagevec freed_pvec;
+ struct page *page, *tmp;
+
+ pagevec_init(&freed_pvec, 1);
+
+ list_for_each_entry_safe(page, tmp, free_pages, lru) {
+ list_del(&page->lru);
+ if (!pagevec_add(&freed_pvec, page)) {
+ __pagevec_free(&freed_pvec);
+ pagevec_reinit(&freed_pvec);
+ }
+ }
+
+ if (pagevec_count(&freed_pvec))
+ __pagevec_free(&freed_pvec);
+}
+
/*
* shrink_page_list() returns the number of reclaimed pages
*/
@@ -628,13 +647,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
enum pageout_io sync_writeback)
{
LIST_HEAD(ret_pages);
- struct pagevec freed_pvec;
+ LIST_HEAD(free_pages);
int pgactivate = 0;
unsigned long nr_reclaimed = 0;

cond_resched();

- pagevec_init(&freed_pvec, 1);
while (!list_empty(page_list)) {
enum page_references references;
struct address_space *mapping;
@@ -809,10 +827,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
__clear_page_locked(page);
free_it:
nr_reclaimed++;
- if (!pagevec_add(&freed_pvec, page)) {
- __pagevec_free(&freed_pvec);
- pagevec_reinit(&freed_pvec);
- }
+
+ /*
+ * Is there need to periodically free_page_list? It would
+ * appear not as the counts should be low
+ */
+ list_add(&page->lru, &free_pages);
continue;

cull_mlocked:
@@ -835,9 +855,10 @@ keep:
list_add(&page->lru, &ret_pages);
VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
}
+
+ free_page_list(&free_pages);
+
list_splice(&ret_pages, page_list);
- if (pagevec_count(&freed_pvec))
- __pagevec_free(&freed_pvec);
count_vm_events(PGACTIVATE, pgactivate);
return nr_reclaimed;
}
--
1.7.1

2010-06-14 11:19:57

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 08/12] vmscan: Setup pagevec as late as possible in shrink_inactive_list()

shrink_inactive_list() sets up a pagevec to release unfreeable pages. It
uses significant amounts of stack doing this. This patch splits
shrink_inactive_list() to take the stack usage out of the main path so
that callers to writepage() do not contain an unused pagevec on the
stack.

Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Johannes Weiner <[email protected]>
---
mm/vmscan.c | 95 +++++++++++++++++++++++++++++++++-------------------------
1 files changed, 54 insertions(+), 41 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 246e084..34c5c87 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1129,19 +1129,65 @@ static int too_many_isolated(struct zone *zone, int file,
}

/*
+ * TODO: Try merging with migrations version of putback_lru_pages
+ */
+static noinline_for_stack void putback_lru_pages(struct zone *zone,
+ struct zone_reclaim_stat *reclaim_stat,
+ unsigned long nr_anon, unsigned long nr_file,
+ struct list_head *page_list)
+{
+ struct page *page;
+ struct pagevec pvec;
+
+ pagevec_init(&pvec, 1);
+
+ /*
+ * Put back any unfreeable pages.
+ */
+ spin_lock(&zone->lru_lock);
+ while (!list_empty(page_list)) {
+ int lru;
+ page = lru_to_page(page_list);
+ VM_BUG_ON(PageLRU(page));
+ list_del(&page->lru);
+ if (unlikely(!page_evictable(page, NULL))) {
+ spin_unlock_irq(&zone->lru_lock);
+ putback_lru_page(page);
+ spin_lock_irq(&zone->lru_lock);
+ continue;
+ }
+ SetPageLRU(page);
+ lru = page_lru(page);
+ add_page_to_lru_list(zone, page, lru);
+ if (is_active_lru(lru)) {
+ int file = is_file_lru(lru);
+ reclaim_stat->recent_rotated[file]++;
+ }
+ if (!pagevec_add(&pvec, page)) {
+ spin_unlock_irq(&zone->lru_lock);
+ __pagevec_release(&pvec);
+ spin_lock_irq(&zone->lru_lock);
+ }
+ }
+ __mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
+ __mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
+
+ spin_unlock_irq(&zone->lru_lock);
+ pagevec_release(&pvec);
+}
+
+/*
* shrink_inactive_list() is a helper for shrink_zone(). It returns the number
* of reclaimed pages
*/
-static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
+static noinline_for_stack unsigned long shrink_inactive_list(unsigned long nr_to_scan,
struct zone *zone, struct scan_control *sc,
int priority, int file)
{
LIST_HEAD(page_list);
- struct pagevec pvec;
unsigned long nr_scanned;
unsigned long nr_reclaimed = 0;
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
- struct page *page;
unsigned long nr_taken;
unsigned long nr_active;
unsigned int count[NR_LRU_LISTS] = { 0, };
@@ -1157,8 +1203,6 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
}


- pagevec_init(&pvec, 1);
-
lru_add_drain();
spin_lock_irq(&zone->lru_lock);

@@ -1186,8 +1230,10 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
*/
}

- if (nr_taken == 0)
- goto done;
+ if (nr_taken == 0) {
+ spin_unlock_irq(&zone->lru_lock);
+ return 0;
+ }

nr_active = clear_active_flags(&page_list, count);
__count_vm_events(PGDEACTIVATE, nr_active);
@@ -1237,40 +1283,7 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
__count_vm_events(KSWAPD_STEAL, nr_reclaimed);
__count_zone_vm_events(PGSTEAL, zone, nr_reclaimed);

- spin_lock(&zone->lru_lock);
- /*
- * Put back any unfreeable pages.
- */
- while (!list_empty(&page_list)) {
- int lru;
- page = lru_to_page(&page_list);
- VM_BUG_ON(PageLRU(page));
- list_del(&page->lru);
- if (unlikely(!page_evictable(page, NULL))) {
- spin_unlock_irq(&zone->lru_lock);
- putback_lru_page(page);
- spin_lock_irq(&zone->lru_lock);
- continue;
- }
- SetPageLRU(page);
- lru = page_lru(page);
- add_page_to_lru_list(zone, page, lru);
- if (is_active_lru(lru)) {
- int file = is_file_lru(lru);
- reclaim_stat->recent_rotated[file]++;
- }
- if (!pagevec_add(&pvec, page)) {
- spin_unlock_irq(&zone->lru_lock);
- __pagevec_release(&pvec);
- spin_lock_irq(&zone->lru_lock);
- }
- }
- __mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
- __mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
-
-done:
- spin_unlock_irq(&zone->lru_lock);
- pagevec_release(&pvec);
+ putback_lru_pages(zone, reclaim_stat, nr_anon, nr_file, &page_list);
return nr_reclaimed;
}

--
1.7.1

2010-06-14 11:20:17

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 05/12] vmscan: kill prev_priority completely

From: KOSAKI Motohiro <[email protected]>

Since 2.6.28 zone->prev_priority is unused. Then it can be removed
safely. It reduce stack usage slightly.

Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
can be integrate again, it's useful. but four (or more) times trying
haven't got good performance number. Thus I give up such approach.

Signed-off-by: KOSAKI Motohiro <[email protected]>
Reviewed-by: Johannes Weiner <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 15 ------------
mm/page_alloc.c | 2 -
mm/vmscan.c | 57 ------------------------------------------------
mm/vmstat.c | 2 -
4 files changed, 0 insertions(+), 76 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b4d109e..b578eee 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -348,21 +348,6 @@ struct zone {
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];

/*
- * prev_priority holds the scanning priority for this zone. It is
- * defined as the scanning priority at which we achieved our reclaim
- * target at the previous try_to_free_pages() or balance_pgdat()
- * invocation.
- *
- * We use prev_priority as a measure of how much stress page reclaim is
- * under - it drives the swappiness decision: whether to unmap mapped
- * pages.
- *
- * Access to both this field is quite racy even on uniprocessor. But
- * it is expected to average out OK.
- */
- int prev_priority;
-
- /*
* The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
* this zone's LRU. Maintained by the pageout code.
*/
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 431214b..0b0b629 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4081,8 +4081,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
zone_seqlock_init(zone);
zone->zone_pgdat = pgdat;

- zone->prev_priority = DEF_PRIORITY;
-
zone_pcp_init(zone);
for_each_lru(l) {
INIT_LIST_HEAD(&zone->lru[l].list);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 58527c4..29e1ecd 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1286,20 +1286,6 @@ done:
}

/*
- * We are about to scan this zone at a certain priority level. If that priority
- * level is smaller (ie: more urgent) than the previous priority, then note
- * that priority level within the zone. This is done so that when the next
- * process comes in to scan this zone, it will immediately start out at this
- * priority level rather than having to build up its own scanning priority.
- * Here, this priority affects only the reclaim-mapped threshold.
- */
-static inline void note_zone_scanning_priority(struct zone *zone, int priority)
-{
- if (priority < zone->prev_priority)
- zone->prev_priority = priority;
-}
-
-/*
* This moves pages from the active list to the inactive list.
*
* We move them the other way if the page is referenced by one or more
@@ -1762,17 +1748,8 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
if (scanning_global_lru(sc)) {
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;
- note_zone_scanning_priority(zone, priority);
-
if (zone->all_unreclaimable && priority != DEF_PRIORITY)
continue; /* Let kswapd poll it */
- } else {
- /*
- * Ignore cpuset limitation here. We just want to reduce
- * # of used pages by us regardless of memory shortage.
- */
- mem_cgroup_note_reclaim_priority(sc->mem_cgroup,
- priority);
}

shrink_zone(priority, zone, sc);
@@ -1878,17 +1855,6 @@ out:
if (priority < 0)
priority = 0;

- if (scanning_global_lru(sc)) {
- for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
-
- if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
- continue;
-
- zone->prev_priority = priority;
- }
- } else
- mem_cgroup_record_reclaim_priority(sc->mem_cgroup, priority);
-
delayacct_freepages_end();
put_mems_allowed();

@@ -2054,22 +2020,12 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order)
.order = order,
.mem_cgroup = NULL,
};
- /*
- * temp_priority is used to remember the scanning priority at which
- * this zone was successfully refilled to
- * free_pages == high_wmark_pages(zone).
- */
- int temp_priority[MAX_NR_ZONES];
-
loop_again:
total_scanned = 0;
sc.nr_reclaimed = 0;
sc.may_writepage = !laptop_mode;
count_vm_event(PAGEOUTRUN);

- for (i = 0; i < pgdat->nr_zones; i++)
- temp_priority[i] = DEF_PRIORITY;
-
for (priority = DEF_PRIORITY; priority >= 0; priority--) {
int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
unsigned long lru_pages = 0;
@@ -2137,9 +2093,7 @@ loop_again:
if (zone->all_unreclaimable && priority != DEF_PRIORITY)
continue;

- temp_priority[i] = priority;
sc.nr_scanned = 0;
- note_zone_scanning_priority(zone, priority);

nid = pgdat->node_id;
zid = zone_idx(zone);
@@ -2212,16 +2166,6 @@ loop_again:
break;
}
out:
- /*
- * Note within each zone the priority level at which this zone was
- * brought into a happy state. So that the next thread which scans this
- * zone will start out at that priority level.
- */
- for (i = 0; i < pgdat->nr_zones; i++) {
- struct zone *zone = pgdat->node_zones + i;
-
- zone->prev_priority = temp_priority[i];
- }
if (!all_zones_ok) {
cond_resched();

@@ -2641,7 +2585,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
*/
priority = ZONE_RECLAIM_PRIORITY;
do {
- note_zone_scanning_priority(zone, priority);
shrink_zone(priority, zone, &sc);
priority--;
} while (priority >= 0 && sc.nr_reclaimed < nr_pages);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 7759941..5c0b1b6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -853,11 +853,9 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
}
seq_printf(m,
"\n all_unreclaimable: %u"
- "\n prev_priority: %i"
"\n start_pfn: %lu"
"\n inactive_ratio: %u",
zone->all_unreclaimable,
- zone->prev_priority,
zone->zone_start_pfn,
zone->inactive_ratio);
seq_putc(m, '\n');
--
1.7.1

2010-06-14 11:20:39

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 04/12] tracing, vmscan: Add a postprocessing script for reclaim-related ftrace events

This patch adds a simple post-processing script for the reclaim-related
trace events. It can be used to give an indication of how much traffic
there is on the LRU lists and how severe latencies due to reclaim are.
Example output looks like the following

Reclaim latencies expressed as order-latency_in_ms
uname-3942 9-200.179000000004 9-98.7900000000373 9-99.8330000001006
kswapd0-311 0-662.097999999998 0-2.79700000002049 \
0-149.100000000035 0-3295.73600000003 0-9806.31799999997 0-35528.833 \
0-10043.197 0-129740.979 0-3.50500000000466 0-3.54899999999907 \
0-9297.78999999992 0-3.48499999998603 0-3596.97999999998 0-3.92799999995623 \
0-3.35000000009313 0-16729.017 0-3.57799999997951 0-47435.0630000001 \
0-3.7819999998901 0-5864.06999999995 0-18635.334 0-10541.289 9-186011.565 \
9-3680.86300000001 9-1379.06499999994 9-958571.115 9-66215.474 \
9-6721.14699999988 9-1962.15299999993 9-1094806.125 9-2267.83199999994 \
9-47120.9029999999 9-427653.886 9-2.6359999999404 9-632.148999999976 \
9-476.753000000026 9-495.577000000048 9-8.45900000003166 9-6.6820000000298 \
9-1.30500000016764 9-251.746000000043 9-383.905000000028 9-80.1419999999925 \
9-281.160000000149 9-14.8780000000261 9-381.45299999998 9-512.07799999998 \
9-49.5519999999087 9-167.439000000013 9-183.820999999996 9-239.527999999933 \
9-19.9479999998584 9-148.747999999905 9-164.583000000101 9-16.9480000000913 \
9-192.376000000164 9-64.1010000000242 9-1.40800000005402 9-3.60800000000745 \
9-17.1359999999404 9-4.69500000006519 9-2.06400000001304 9-1582488.554 \
9-6244.19499999983 9-348153.812 9-2.0999999998603 9-0.987999999895692 \
0-32218.473 0-1.6140000000596 0-1.28100000019185 0-1.41300000017509 \
0-1.32299999985844 0-602.584000000032 0-1.34400000004098 0-1.6929999999702 \
1-22101.8190000001 9-174876.724 9-16.2420000000857 9-175.165999999736 \
9-15.8589999997057 9-0.604999999981374 9-3061.09000000032 9-479.277000000235 \
9-1.54499999992549 9-771.985000000335 9-4.88700000010431 9-15.0649999999441 \
9-0.879999999888241 9-252.01500000013 9-1381.03600000031 9-545.689999999944 \
9-3438.0129999998 9-3343.70099999988
bench-stresshig-3942 9-7063.33900000004 9-129960.482 9-2062.27500000002 \
9-3845.59399999992 9-171.82799999998 9-16493.821 9-7615.23900000006 \
9-10217.848 9-983.138000000035 9-2698.39999999991 9-4016.1540000001 \
9-5522.37700000009 9-21630.429 \
9-15061.048 9-10327.953 9-542.69700000016 9-317.652000000002 \
9-8554.71699999995 9-1786.61599999992 9-1899.31499999994 9-2093.41899999999 \
9-4992.62400000007 9-942.648999999976 9-1923.98300000001 9-3.7980000001844 \
9-5.99899999983609 9-0.912000000011176 9-1603.67700000014 9-1.98300000000745 \
9-3.96500000008382 9-0.902999999932945 9-2802.72199999983 9-1078.24799999991 \
9-2155.82900000014 9-10.058999999892 9-1984.723 9-1687.97999999998 \
9-1136.05300000007 9-3183.61699999985 9-458.731000000145 9-6.48600000003353 \
9-1013.25200000009 9-8415.22799999989 9-10065.584 9-2076.79600000009 \
9-3792.65699999989 9-71.2010000001173 9-2560.96999999997 9-2260.68400000012 \
9-2862.65799999982 9-1255.81500000018 9-15.7440000001807 9-4.33499999996275 \
9-1446.63800000004 9-238.635000000009 9-60.1790000000037 9-4.38800000003539 \
9-639.567000000039 9-306.698000000091 9-31.4070000001229 9-74.997999999905 \
9-632.725999999791 9-1625.93200000003 9-931.266000000061 9-98.7749999999069 \
9-984.606999999844 9-225.638999999966 9-421.316000000108 9-653.744999999879 \
9-572.804000000004 9-769.158999999985 9-603.918000000063 9-4.28499999991618 \
9-626.21399999992 9-1721.25 9-0.854999999981374 9-572.39599999995 \
9-681.881999999983 9-1345.12599999993 9-363.666999999899 9-3823.31099999999 \
9-2991.28200000012 9-4.27099999994971 9-309.76500000013 9-3068.35700000008 \
9-788.25 9-3515.73999999999 9-2065.96100000013 9-286.719999999972 \
9-316.076000000117 9-344.151000000071 9-2.51000000000931 9-306.688000000082 \
9-1515.00099999993 9-336.528999999864 9-793.491999999853 9-457.348999999929 \
9-13620.155 9-119.933999999892 9-35.0670000000391 9-918.266999999993 \
9-828.569000000134 9-4863.81099999999 9-105.222000000067 9-894.23900000006 \
9-110.964999999851 9-0.662999999942258 9-12753.3150000002 9-12.6129999998957 \
9-13368.0899999999 9-12.4199999999255 9-1.00300000002608 9-1.41100000008009 \
9-10300.5290000001 9-16.502000000095 9-30.7949999999255 9-6283.0140000002 \
9-4320.53799999994 9-6826.27300000004 9-3.07299999985844 9-1497.26799999992 \
9-13.4040000000969 9-3.12999999988824 9-3.86100000003353 9-11.3539999998175 \
9-0.10799999977462 9-21.780999999959 9-209.695999999996 9-299.647000000114 \
9-6.01699999999255 9-20.8349999999627 9-22.5470000000205 9-5470.16800000006 \
9-7.60499999998137 9-0.821000000229105 9-1.56600000010803 9-14.1669999998994 \
9-0.209000000031665 9-1.82300000009127 9-1.70000000018626 9-19.9429999999702 \
9-124.266999999993 9-0.0389999998733401 9-6.71400000015274 9-16.7710000001825 \
9-31.0409999999683 9-0.516999999992549 9-115.888000000035 9-5.19900000002235 \
9-222.389999999898 9-11.2739999999758 9-80.9050000000279 9-8.14500000001863 \
9-4.44599999999627 9-0.218999999808148 9-0.715000000083819 9-0.233000000007451
\
9-48.2630000000354 9-248.560999999987 9-374.96800000011 9-644.179000000004 \
9-0.835999999893829 9-79.0060000000522 9-128.447999999858 9-0.692000000039116 \
9-5.26500000013039 9-128.449000000022 9-2.04799999995157 9-12.0990000001621 \
9-8.39899999997579 9-10.3860000001732 9-11.9310000000987 9-53.4450000000652 \
9-0.46999999997206 9-2.96299999998882 9-17.9699999999721 9-0.776000000070781 \
9-25.2919999998994 9-33.1110000000335 9-0.434000000124797 9-0.641000000061467 \
9-0.505000000121072 9-1.12800000002608 9-149.222000000067 9-1.17599999997765 \
9-3247.33100000001 9-10.7439999999478 9-153.523000000045 9-1.38300000014715 \
9-794.762000000104 9-3.36199999996461 9-128.765999999829 9-181.543999999994 \
9-78149.8229999999 9-176.496999999974 9-89.9940000001807 9-9.12700000009499 \
9-250.827000000048 9-0.224999999860302 9-0.388999999966472 9-1.16700000036508 \
9-32.1740000001155 9-12.6800000001676 9-0.0720000001601875 9-0.274999999906868
\
9-0.724000000394881 9-266.866000000387 9-45.5709999999963 9-4.54399999976158 \
9-8.27199999988079 9-4.38099999958649 9-0.512000000104308 9-0.0640000002458692
\
9-5.20000000018626 9-0.0839999997988343 9-12.816000000108 9-0.503000000026077 \
9-0.507999999914318 9-6.23999999975786 9-3.35100000025705 9-18.8530000001192 \
9-25.2220000000671 9-68.2309999996796 9-98.9939999999478 9-0.441000000108033 \
9-4.24599999981001 9-261.702000000048 9-3.01599999982864 9-0.0749999997206032 \
9-0.0370000000111759 9-4.375 9-3.21800000034273 9-11.3960000001825 \
9-0.0540000000037253 9-0.286000000312924 9-0.865999999921769 \
9-0.294999999925494 9-6.45999999996275 9-4.31099999975413 9-128.248999999836 \
9-0.282999999821186 9-102.155000000261 9-0.0860000001266599 \
9-0.0540000000037253 9-0.935000000055879 9-0.0670000002719462 \
9-5.8640000000596 9-19.9860000000335 9-4.18699999991804 9-0.566000000108033 \
9-2.55099999997765 9-0.702000000048429 9-131.653999999631 9-0.638999999966472 \
9-14.3229999998584 9-183.398000000045 9-178.095999999903 9-3.22899999981746 \
9-7.31399999978021 9-22.2400000002235 9-11.7979999999516 9-108.10599999968 \
9-99.0159999998286 9-102.640999999829 9-38.414000000339
Process Direct Wokeup Pages Pages Pages
details Rclms Kswapd Scanned Sync-IO ASync-IO
cc1-30800 0 1 0 0 0 wakeup-0=1
cc1-24260 0 1 0 0 0 wakeup-0=1
cc1-24152 0 12 0 0 0 wakeup-0=12
cc1-8139 0 1 0 0 0 wakeup-0=1
cc1-4390 0 1 0 0 0 wakeup-0=1
cc1-4648 0 7 0 0 0 wakeup-0=7
cc1-4552 0 3 0 0 0 wakeup-0=3
dd-4550 0 31 0 0 0 wakeup-0=31
date-4898 0 1 0 0 0 wakeup-0=1
cc1-6549 0 7 0 0 0 wakeup-0=7
as-22202 0 17 0 0 0 wakeup-0=17
cc1-6495 0 9 0 0 0 wakeup-0=9
cc1-8299 0 1 0 0 0 wakeup-0=1
cc1-6009 0 1 0 0 0 wakeup-0=1
cc1-2574 0 2 0 0 0 wakeup-0=2
cc1-30568 0 1 0 0 0 wakeup-0=1
cc1-2679 0 6 0 0 0 wakeup-0=6
sh-13747 0 12 0 0 0 wakeup-0=12
cc1-22193 0 18 0 0 0 wakeup-0=18
cc1-30725 0 2 0 0 0 wakeup-0=2
as-4392 0 2 0 0 0 wakeup-0=2
cc1-28180 0 14 0 0 0 wakeup-0=14
cc1-13697 0 2 0 0 0 wakeup-0=2
cc1-22207 0 8 0 0 0 wakeup-0=8
cc1-15270 0 179 0 0 0 wakeup-0=179
cc1-22011 0 82 0 0 0 wakeup-0=82
cp-14682 0 1 0 0 0 wakeup-0=1
as-11926 0 2 0 0 0 wakeup-0=2
cc1-6016 0 5 0 0 0 wakeup-0=5
make-18554 0 13 0 0 0 wakeup-0=13
cc1-8292 0 12 0 0 0 wakeup-0=12
make-24381 0 1 0 0 0 wakeup-1=1
date-18681 0 33 0 0 0 wakeup-0=33
cc1-32276 0 1 0 0 0 wakeup-0=1
timestamp-outpu-2809 0 253 0 0 0 wakeup-0=240 wakeup-1=13
date-18624 0 7 0 0 0 wakeup-0=7
cc1-30960 0 9 0 0 0 wakeup-0=9
cc1-4014 0 1 0 0 0 wakeup-0=1
cc1-30706 0 22 0 0 0 wakeup-0=22
uname-3942 4 1 306 0 17 direct-9=4 wakeup-9=1
cc1-28207 0 1 0 0 0 wakeup-0=1
cc1-30563 0 9 0 0 0 wakeup-0=9
cc1-22214 0 10 0 0 0 wakeup-0=10
cc1-28221 0 11 0 0 0 wakeup-0=11
cc1-28123 0 6 0 0 0 wakeup-0=6
kswapd0-311 0 7 357302 0 34233 wakeup-0=7
cc1-5988 0 7 0 0 0 wakeup-0=7
as-30734 0 161 0 0 0 wakeup-0=161
cc1-22004 0 45 0 0 0 wakeup-0=45
date-4590 0 4 0 0 0 wakeup-0=4
cc1-15279 0 213 0 0 0 wakeup-0=213
date-30735 0 1 0 0 0 wakeup-0=1
cc1-30583 0 4 0 0 0 wakeup-0=4
cc1-32324 0 2 0 0 0 wakeup-0=2
cc1-23933 0 3 0 0 0 wakeup-0=3
cc1-22001 0 36 0 0 0 wakeup-0=36
bench-stresshig-3942 287 287 80186 6295 12196 direct-9=287 wakeup-9=287
cc1-28170 0 7 0 0 0 wakeup-0=7
date-7932 0 92 0 0 0 wakeup-0=92
cc1-22222 0 6 0 0 0 wakeup-0=6
cc1-32334 0 16 0 0 0 wakeup-0=16
cc1-2690 0 6 0 0 0 wakeup-0=6
cc1-30733 0 9 0 0 0 wakeup-0=9
cc1-32298 0 2 0 0 0 wakeup-0=2
cc1-13743 0 18 0 0 0 wakeup-0=18
cc1-22186 0 4 0 0 0 wakeup-0=4
cc1-28214 0 11 0 0 0 wakeup-0=11
cc1-13735 0 1 0 0 0 wakeup-0=1
updatedb-8173 0 18 0 0 0 wakeup-0=18
cc1-13750 0 3 0 0 0 wakeup-0=3
cat-2808 0 2 0 0 0 wakeup-0=2
cc1-15277 0 169 0 0 0 wakeup-0=169
date-18317 0 1 0 0 0 wakeup-0=1
cc1-15274 0 197 0 0 0 wakeup-0=197
cc1-30732 0 1 0 0 0 wakeup-0=1

Kswapd Kswapd Order Pages Pages Pages
Instance Wakeups Re-wakeup Scanned Sync-IO ASync-IO
kswapd0-311 91 24 357302 0 34233 wake-0=31 wake-1=1 wake-9=59 rewake-0=10 rewake-1=1 rewake-9=13

Summary
Direct reclaims: 291
Direct reclaim pages scanned: 437794
Direct reclaim write sync I/O: 6295
Direct reclaim write async I/O: 46446
Wake kswapd requests: 2152
Time stalled direct reclaim: 519.163009000002 ms

Kswapd wakeups: 91
Kswapd pages scanned: 357302
Kswapd reclaim write sync I/O: 0
Kswapd reclaim write async I/O: 34233
Time kswapd awake: 5282.749757 ms

Signed-off-by: Mel Gorman <[email protected]>
---
.../trace/postprocess/trace-vmscan-postprocess.pl | 654 ++++++++++++++++++++
1 files changed, 654 insertions(+), 0 deletions(-)
create mode 100644 Documentation/trace/postprocess/trace-vmscan-postprocess.pl

diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
new file mode 100644
index 0000000..b48d968
--- /dev/null
+++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl
@@ -0,0 +1,654 @@
+#!/usr/bin/perl
+# This is a POC for reading the text representation of trace output related to
+# page reclaim. It makes an attempt to extract some high-level information on
+# what is going on. The accuracy of the parser may vary
+#
+# Example usage: trace-vmscan-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
+# other options
+# --read-procstat If the trace lacks process info, get it from /proc
+# --ignore-pid Aggregate processes of the same name together
+#
+# Copyright (c) IBM Corporation 2009
+# Author: Mel Gorman <[email protected]>
+use strict;
+use Getopt::Long;
+
+# Tracepoint events
+use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN => 1;
+use constant MM_VMSCAN_DIRECT_RECLAIM_END => 2;
+use constant MM_VMSCAN_KSWAPD_WAKE => 3;
+use constant MM_VMSCAN_KSWAPD_SLEEP => 4;
+use constant MM_VMSCAN_LRU_SHRINK_ACTIVE => 5;
+use constant MM_VMSCAN_LRU_SHRINK_INACTIVE => 6;
+use constant MM_VMSCAN_LRU_ISOLATE => 7;
+use constant MM_VMSCAN_WRITEPAGE_SYNC => 8;
+use constant MM_VMSCAN_WRITEPAGE_ASYNC => 9;
+use constant EVENT_UNKNOWN => 10;
+
+# Per-order events
+use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER => 11;
+use constant MM_VMSCAN_WAKEUP_KSWAPD_PERORDER => 12;
+use constant MM_VMSCAN_KSWAPD_WAKE_PERORDER => 13;
+use constant HIGH_KSWAPD_REWAKEUP_PERORDER => 14;
+
+# Constants used to track state
+use constant STATE_DIRECT_BEGIN => 15;
+use constant STATE_DIRECT_ORDER => 16;
+use constant STATE_KSWAPD_BEGIN => 17;
+use constant STATE_KSWAPD_ORDER => 18;
+
+# High-level events extrapolated from tracepoints
+use constant HIGH_DIRECT_RECLAIM_LATENCY => 19;
+use constant HIGH_KSWAPD_LATENCY => 20;
+use constant HIGH_KSWAPD_REWAKEUP => 21;
+use constant HIGH_NR_SCANNED => 22;
+use constant HIGH_NR_TAKEN => 23;
+use constant HIGH_NR_RECLAIM => 24;
+use constant HIGH_NR_CONTIG_DIRTY => 25;
+
+my %perprocesspid;
+my %perprocess;
+my %last_procmap;
+my $opt_ignorepid;
+my $opt_read_procstat;
+
+my $total_wakeup_kswapd;
+my ($total_direct_reclaim, $total_direct_nr_scanned);
+my ($total_direct_latency, $total_kswapd_latency);
+my ($total_direct_writepage_sync, $total_direct_writepage_async);
+my ($total_kswapd_nr_scanned, $total_kswapd_wake);
+my ($total_kswapd_writepage_sync, $total_kswapd_writepage_async);
+
+# Catch sigint and exit on request
+my $sigint_report = 0;
+my $sigint_exit = 0;
+my $sigint_pending = 0;
+my $sigint_received = 0;
+sub sigint_handler {
+ my $current_time = time;
+ if ($current_time - 2 > $sigint_received) {
+ print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
+ $sigint_report = 1;
+ } else {
+ if (!$sigint_exit) {
+ print "Second SIGINT received quickly, exiting\n";
+ }
+ $sigint_exit++;
+ }
+
+ if ($sigint_exit > 3) {
+ print "Many SIGINTs received, exiting now without report\n";
+ exit;
+ }
+
+ $sigint_received = $current_time;
+ $sigint_pending = 1;
+}
+$SIG{INT} = "sigint_handler";
+
+# Parse command line options
+GetOptions(
+ 'ignore-pid' => \$opt_ignorepid,
+ 'read-procstat' => \$opt_read_procstat,
+);
+
+# Defaults for dynamically discovered regex's
+my $regex_direct_begin_default = 'order=([0-9]*) may_writepage=([0-9]*) gfp_flags=([A-Z_|]*)';
+my $regex_direct_end_default = 'nr_reclaimed=([0-9]*)';
+my $regex_kswapd_wake_default = 'nid=([0-9]*) order=([0-9]*)';
+my $regex_kswapd_sleep_default = 'nid=([0-9]*)';
+my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)';
+my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) contig_taken=([0-9]*) contig_dirty=([0-9]*) contig_failed=([0-9]*)';
+my $regex_lru_shrink_inactive_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*)';
+my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)';
+my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) sync_io=([0-9]*)';
+
+# Dyanically discovered regex
+my $regex_direct_begin;
+my $regex_direct_end;
+my $regex_kswapd_wake;
+my $regex_kswapd_sleep;
+my $regex_wakeup_kswapd;
+my $regex_lru_isolate;
+my $regex_lru_shrink_inactive;
+my $regex_lru_shrink_active;
+my $regex_writepage;
+
+# Static regex used. Specified like this for readability and for use with /o
+# (process_pid) (cpus ) ( time ) (tpoint ) (details)
+my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
+my $regex_statname = '[-0-9]*\s\((.*)\).*';
+my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
+
+sub generate_traceevent_regex {
+ my $event = shift;
+ my $default = shift;
+ my $regex;
+
+ # Read the event format or use the default
+ if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
+ print("WARNING: Event $event format string not found\n");
+ return $default;
+ } else {
+ my $line;
+ while (!eof(FORMAT)) {
+ $line = <FORMAT>;
+ $line =~ s/, REC->.*//;
+ if ($line =~ /^print fmt:\s"(.*)".*/) {
+ $regex = $1;
+ $regex =~ s/%s/\([0-9a-zA-Z|_]*\)/g;
+ $regex =~ s/%p/\([0-9a-f]*\)/g;
+ $regex =~ s/%d/\([-0-9]*\)/g;
+ $regex =~ s/%ld/\([-0-9]*\)/g;
+ $regex =~ s/%lu/\([0-9]*\)/g;
+ }
+ }
+ }
+
+ # Can't handle the print_flags stuff but in the context of this
+ # script, it really doesn't matter
+ $regex =~ s/\(REC.*\) \? __print_flags.*//;
+
+ # Verify fields are in the right order
+ my $tuple;
+ foreach $tuple (split /\s/, $regex) {
+ my ($key, $value) = split(/=/, $tuple);
+ my $expected = shift;
+ if ($key ne $expected) {
+ print("WARNING: Format not as expected for event $event '$key' != '$expected'\n");
+ $regex =~ s/$key=\((.*)\)/$key=$1/;
+ }
+ }
+
+ if (defined shift) {
+ die("Fewer fields than expected in format");
+ }
+
+ return $regex;
+}
+
+$regex_direct_begin = generate_traceevent_regex(
+ "vmscan/mm_vmscan_direct_reclaim_begin",
+ $regex_direct_begin_default,
+ "order", "may_writepage",
+ "gfp_flags");
+$regex_direct_end = generate_traceevent_regex(
+ "vmscan/mm_vmscan_direct_reclaim_end",
+ $regex_direct_end_default,
+ "nr_reclaimed");
+$regex_kswapd_wake = generate_traceevent_regex(
+ "vmscan/mm_vmscan_kswapd_wake",
+ $regex_kswapd_wake_default,
+ "nid", "order");
+$regex_kswapd_sleep = generate_traceevent_regex(
+ "vmscan/mm_vmscan_kswapd_sleep",
+ $regex_kswapd_sleep_default,
+ "nid");
+$regex_wakeup_kswapd = generate_traceevent_regex(
+ "vmscan/mm_vmscan_wakeup_kswapd",
+ $regex_wakeup_kswapd_default,
+ "nid", "zid", "order");
+$regex_lru_isolate = generate_traceevent_regex(
+ "vmscan/mm_vmscan_lru_isolate",
+ $regex_lru_isolate_default,
+ "isolate_mode", "order",
+ "nr_requested", "nr_scanned", "nr_taken",
+ "contig_taken", "contig_dirty", "contig_failed");
+$regex_lru_shrink_inactive = generate_traceevent_regex(
+ "vmscan/mm_vmscan_lru_shrink_inactive",
+ $regex_lru_shrink_inactive_default,
+ "nid", "zid",
+ "lru",
+ "nr_scanned", "nr_reclaimed", "priority");
+$regex_lru_shrink_active = generate_traceevent_regex(
+ "vmscan/mm_vmscan_lru_shrink_active",
+ $regex_lru_shrink_active_default,
+ "nid", "zid",
+ "lru",
+ "nr_scanned", "nr_rotated", "priority");
+$regex_writepage = generate_traceevent_regex(
+ "vmscan/mm_vmscan_writepage",
+ $regex_writepage_default,
+ "page", "pfn", "sync_io");
+
+sub read_statline($) {
+ my $pid = $_[0];
+ my $statline;
+
+ if (open(STAT, "/proc/$pid/stat")) {
+ $statline = <STAT>;
+ close(STAT);
+ }
+
+ if ($statline eq '') {
+ $statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
+ }
+
+ return $statline;
+}
+
+sub guess_process_pid($$) {
+ my $pid = $_[0];
+ my $statline = $_[1];
+
+ if ($pid == 0) {
+ return "swapper-0";
+ }
+
+ if ($statline !~ /$regex_statname/o) {
+ die("Failed to math stat line for process name :: $statline");
+ }
+ return "$1-$pid";
+}
+
+# Convert sec.usec timestamp format
+sub timestamp_to_ms($) {
+ my $timestamp = $_[0];
+
+ my ($sec, $usec) = split (/\./, $timestamp);
+ return ($sec * 1000) + ($usec / 1000);
+}
+
+sub process_events {
+ my $traceevent;
+ my $process_pid;
+ my $cpus;
+ my $timestamp;
+ my $tracepoint;
+ my $details;
+ my $statline;
+
+ # Read each line of the event log
+EVENT_PROCESS:
+ while ($traceevent = <STDIN>) {
+ if ($traceevent =~ /$regex_traceevent/o) {
+ $process_pid = $1;
+ $timestamp = $3;
+ $tracepoint = $4;
+
+ $process_pid =~ /(.*)-([0-9]*)$/;
+ my $process = $1;
+ my $pid = $2;
+
+ if ($process eq "") {
+ $process = $last_procmap{$pid};
+ $process_pid = "$process-$pid";
+ }
+ $last_procmap{$pid} = $process;
+
+ if ($opt_read_procstat) {
+ $statline = read_statline($pid);
+ if ($opt_read_procstat && $process eq '') {
+ $process_pid = guess_process_pid($pid, $statline);
+ }
+ }
+ } else {
+ next;
+ }
+
+ # Perl Switch() sucks majorly
+ if ($tracepoint eq "mm_vmscan_direct_reclaim_begin") {
+ $timestamp = timestamp_to_ms($timestamp);
+ $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}++;
+ $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN} = $timestamp;
+
+ $details = $5;
+ if ($details !~ /$regex_direct_begin/o) {
+ print "WARNING: Failed to parse mm_vmscan_direct_reclaim_begin as expected\n";
+ print " $details\n";
+ print " $regex_direct_begin\n";
+ next;
+ }
+ my $order = $1;
+ $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order]++;
+ $perprocesspid{$process_pid}->{STATE_DIRECT_ORDER} = $order;
+ } elsif ($tracepoint eq "mm_vmscan_direct_reclaim_end") {
+ # Count the event itself
+ my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END};
+ $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END}++;
+
+ # Record how long direct reclaim took this time
+ if (defined $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN}) {
+ $timestamp = timestamp_to_ms($timestamp);
+ my $order = $perprocesspid{$process_pid}->{STATE_DIRECT_ORDER};
+ my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN});
+ $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] = "$order-$latency";
+ }
+ } elsif ($tracepoint eq "mm_vmscan_kswapd_wake") {
+ $details = $5;
+ if ($details !~ /$regex_kswapd_wake/o) {
+ print "WARNING: Failed to parse mm_vmscan_kswapd_wake as expected\n";
+ print " $details\n";
+ print " $regex_kswapd_wake\n";
+ next;
+ }
+
+ my $order = $2;
+ $perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER} = $order;
+ if (!$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN}) {
+ $timestamp = timestamp_to_ms($timestamp);
+ $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}++;
+ $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = $timestamp;
+ $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order]++;
+ } else {
+ $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP}++;
+ $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order]++;
+ }
+ } elsif ($tracepoint eq "mm_vmscan_kswapd_sleep") {
+
+ # Count the event itself
+ my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP};
+ $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP}++;
+
+ # Record how long kswapd was awake
+ $timestamp = timestamp_to_ms($timestamp);
+ my $order = $perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER};
+ my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN});
+ $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index] = "$order-$latency";
+ $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = 0;
+ } elsif ($tracepoint eq "mm_vmscan_wakeup_kswapd") {
+ $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}++;
+
+ $details = $5;
+ if ($details !~ /$regex_wakeup_kswapd/o) {
+ print "WARNING: Failed to parse mm_vmscan_wakeup_kswapd as expected\n";
+ print " $details\n";
+ print " $regex_wakeup_kswapd\n";
+ next;
+ }
+ my $order = $3;
+ $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order]++;
+ } elsif ($tracepoint eq "mm_vmscan_lru_isolate") {
+ $details = $5;
+ if ($details !~ /$regex_lru_isolate/o) {
+ print "WARNING: Failed to parse mm_vmscan_lru_isolate as expected\n";
+ print " $details\n";
+ print " $regex_lru_isolate/o\n";
+ next;
+ }
+ my $nr_scanned = $4;
+ my $nr_contig_dirty = $7;
+ $perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned;
+ $perprocesspid{$process_pid}->{HIGH_NR_CONTIG_DIRTY} += $nr_contig_dirty;
+ } elsif ($tracepoint eq "mm_vmscan_writepage") {
+ $details = $5;
+ if ($details !~ /$regex_writepage/o) {
+ print "WARNING: Failed to parse mm_vmscan_writepage as expected\n";
+ print " $details\n";
+ print " $regex_writepage\n";
+ next;
+ }
+
+ my $sync_io = $3;
+ if ($sync_io) {
+ $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC}++;
+ } else {
+ $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC}++;
+ }
+ } else {
+ $perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
+ }
+
+ if ($sigint_pending) {
+ last EVENT_PROCESS;
+ }
+ }
+}
+
+sub dump_stats {
+ my $hashref = shift;
+ my %stats = %$hashref;
+
+ # Dump per-process stats
+ my $process_pid;
+ my $max_strlen = 0;
+
+ # Get the maximum process name
+ foreach $process_pid (keys %perprocesspid) {
+ my $len = length($process_pid);
+ if ($len > $max_strlen) {
+ $max_strlen = $len;
+ }
+ }
+ $max_strlen += 2;
+
+ # Work out latencies
+ printf("\n") if !$opt_ignorepid;
+ printf("Reclaim latencies expressed as order-latency_in_ms\n") if !$opt_ignorepid;
+ foreach $process_pid (keys %stats) {
+
+ if (!$stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[0] &&
+ !$stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[0]) {
+ next;
+ }
+
+ printf "%-" . $max_strlen . "s ", $process_pid if !$opt_ignorepid;
+ my $index = 0;
+ while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] ||
+ defined $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) {
+
+ if ($stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) {
+ printf("%s ", $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) if !$opt_ignorepid;
+ my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
+ $total_direct_latency += $latency;
+ } else {
+ printf("%s ", $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) if !$opt_ignorepid;
+ my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]);
+ $total_kswapd_latency += $latency;
+ }
+ $index++;
+ }
+ print "\n" if !$opt_ignorepid;
+ }
+
+ # Print out process activity
+ printf("\n");
+ printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s\n", "Process", "Direct", "Wokeup", "Pages", "Pages", "Pages", "Time");
+ printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s\n", "details", "Rclms", "Kswapd", "Scanned", "Sync-IO", "ASync-IO", "Stalled");
+ foreach $process_pid (keys %stats) {
+
+ if (!$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
+ next;
+ }
+
+ $total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
+ $total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
+ $total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+ $total_direct_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+ $total_direct_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+ my $index = 0;
+ my $this_reclaim_delay = 0;
+ while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) {
+ my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]);
+ $this_reclaim_delay += $latency;
+ $index++;
+ }
+
+ printf("%-" . $max_strlen . "s %8d %10d %8u %8u %8u %8.3f",
+ $process_pid,
+ $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN},
+ $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD},
+ $stats{$process_pid}->{HIGH_NR_SCANNED},
+ $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
+ $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC},
+ $this_reclaim_delay / 1000);
+
+ if ($stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) {
+ print " ";
+ for (my $order = 0; $order < 20; $order++) {
+ my $count = $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
+ if ($count != 0) {
+ print "direct-$order=$count ";
+ }
+ }
+ }
+ if ($stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}) {
+ print " ";
+ for (my $order = 0; $order < 20; $order++) {
+ my $count = $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order];
+ if ($count != 0) {
+ print "wakeup-$order=$count ";
+ }
+ }
+ }
+ if ($stats{$process_pid}->{HIGH_NR_CONTIG_DIRTY}) {
+ print " ";
+ my $count = $stats{$process_pid}->{HIGH_NR_CONTIG_DIRTY};
+ if ($count != 0) {
+ print "contig-dirty=$count ";
+ }
+ }
+
+ print "\n";
+ }
+
+ # Print out kswapd activity
+ printf("\n");
+ printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Kswapd", "Kswapd", "Order", "Pages", "Pages", "Pages");
+ printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Sync-IO", "ASync-IO");
+ foreach $process_pid (keys %stats) {
+
+ if (!$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
+ next;
+ }
+
+ $total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
+ $total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED};
+ $total_kswapd_writepage_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+ $total_kswapd_writepage_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+ printf("%-" . $max_strlen . "s %8d %10d %8u %8i %8u",
+ $process_pid,
+ $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE},
+ $stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP},
+ $stats{$process_pid}->{HIGH_NR_SCANNED},
+ $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC},
+ $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC});
+
+ if ($stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) {
+ print " ";
+ for (my $order = 0; $order < 20; $order++) {
+ my $count = $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order];
+ if ($count != 0) {
+ print "wake-$order=$count ";
+ }
+ }
+ }
+ if ($stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP}) {
+ print " ";
+ for (my $order = 0; $order < 20; $order++) {
+ my $count = $stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order];
+ if ($count != 0) {
+ print "rewake-$order=$count ";
+ }
+ }
+ }
+ printf("\n");
+ }
+
+ # Print out summaries
+ $total_direct_latency /= 1000;
+ $total_kswapd_latency /= 1000;
+ print "\nSummary\n";
+ print "Direct reclaims: $total_direct_reclaim\n";
+ print "Direct reclaim pages scanned: $total_direct_nr_scanned\n";
+ print "Direct reclaim write sync I/O: $total_direct_writepage_sync\n";
+ print "Direct reclaim write async I/O: $total_direct_writepage_async\n";
+ print "Wake kswapd requests: $total_wakeup_kswapd\n";
+ printf "Time stalled direct reclaim: %-1.2f ms\n", $total_direct_latency;
+ print "\n";
+ print "Kswapd wakeups: $total_kswapd_wake\n";
+ print "Kswapd pages scanned: $total_kswapd_nr_scanned\n";
+ print "Kswapd reclaim write sync I/O: $total_kswapd_writepage_sync\n";
+ print "Kswapd reclaim write async I/O: $total_kswapd_writepage_async\n";
+ printf "Time kswapd awake: %-1.2f ms\n", $total_kswapd_latency;
+}
+
+sub aggregate_perprocesspid() {
+ my $process_pid;
+ my $process;
+ undef %perprocess;
+
+ foreach $process_pid (keys %perprocesspid) {
+ $process = $process_pid;
+ $process =~ s/-([0-9])*$//;
+ if ($process eq '') {
+ $process = "NO_PROCESS_NAME";
+ }
+
+ $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN} += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN};
+ $perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE} += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE};
+ $perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD};
+ $perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP};
+ $perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED};
+ $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_SYNC};
+ $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ASYNC};
+
+ for (my $order = 0; $order < 20; $order++) {
+ $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order];
+ $perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order];
+ $perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order];
+
+ }
+
+ # Aggregate direct reclaim latencies
+ my $wr_index = $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END};
+ my $rd_index = 0;
+ while (defined $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index]) {
+ $perprocess{$process}->{HIGH_DIRECT_RECLAIM_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index];
+ $rd_index++;
+ $wr_index++;
+ }
+ $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index;
+
+ # Aggregate kswapd latencies
+ my $wr_index = $perprocess{$process}->{MM_VMSCAN_KSWAPD_SLEEP};
+ my $rd_index = 0;
+ while (defined $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index]) {
+ $perprocess{$process}->{HIGH_KSWAPD_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index];
+ $rd_index++;
+ $wr_index++;
+ }
+ $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index;
+ }
+}
+
+sub report() {
+ if (!$opt_ignorepid) {
+ dump_stats(\%perprocesspid);
+ } else {
+ aggregate_perprocesspid();
+ dump_stats(\%perprocess);
+ }
+}
+
+# Process events or signals until neither is available
+sub signal_loop() {
+ my $sigint_processed;
+ do {
+ $sigint_processed = 0;
+ process_events();
+
+ # Handle pending signals if any
+ if ($sigint_pending) {
+ my $current_time = time;
+
+ if ($sigint_exit) {
+ print "Received exit signal\n";
+ $sigint_pending = 0;
+ }
+ if ($sigint_report) {
+ if ($current_time >= $sigint_received + 2) {
+ report();
+ $sigint_report = 0;
+ $sigint_pending = 0;
+ $sigint_processed = 1;
+ }
+ }
+ }
+ } while ($sigint_pending || $sigint_processed);
+}
+
+signal_loop();
+report();
--
1.7.1

2010-06-14 15:10:18

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 0/12] Avoid overflowing of stack during page reclaim V2

On Mon, Jun 14, 2010 at 12:17:41PM +0100, Mel Gorman wrote:
> This is a merging of two series - the first of which reduces stack usage
> in page reclaim and the second which writes contiguous pages during reclaim
> and avoids writeback in direct reclaimers.

This stuff looks good to me from the filesystem POV.

You might want to throw in a follow on patch to remove the PF_MEMALLOC
checks from the various ->writepage methods.

2010-06-14 15:47:03

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 01/12] tracing, vmscan: Add trace events for kswapd wakeup, sleeping and direct reclaim

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> This patch adds two trace events for kswapd waking up and going asleep for
> the purposes of tracking kswapd activity and two trace events for direct
> reclaim beginning and ending. The information can be used to work out how
> much time a process or the system is spending on the reclamation of pages
> and in the case of direct reclaim, how many pages were reclaimed for that
> process. High frequency triggering of these events could point to memory
> pressure problems.
>
> Signed-off-by: Mel Gorman<[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2010-06-14 16:56:19

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 02/12] tracing, vmscan: Add trace events for LRU page isolation

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> This patch adds an event for when pages are isolated en-masse from the
> LRU lists. This event augments the information available on LRU traffic
> and can be used to evaluate lumpy reclaim.
>
> Signed-off-by: Mel Gorman<[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2010-06-14 17:38:16

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 03/12] tracing, vmscan: Add trace event when a page is written

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> This patch adds a trace event for when page reclaim queues a page for IO and
> records whether it is synchronous or asynchronous. Excessive synchronous
> IO for a process can result in noticeable stalls during direct reclaim.
> Excessive IO from page reclaim may indicate that the system is seriously
> under provisioned for the amount of dirty pages that exist.
>
> Signed-off-by: Mel Gorman<[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2010-06-14 17:56:32

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 04/12] tracing, vmscan: Add a postprocessing script for reclaim-related ftrace events

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> This patch adds a simple post-processing script for the reclaim-related
> trace events. It can be used to give an indication of how much traffic
> there is on the LRU lists and how severe latencies due to reclaim are.
> Example output looks like the following

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2010-06-14 18:06:20

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 05/12] vmscan: kill prev_priority completely

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> From: KOSAKI Motohiro<[email protected]>
>
> Since 2.6.28 zone->prev_priority is unused. Then it can be removed
> safely. It reduce stack usage slightly.
>
> Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
> can be integrate again, it's useful. but four (or more) times trying
> haven't got good performance number. Thus I give up such approach.
>
> Signed-off-by: KOSAKI Motohiro<[email protected]>
> Reviewed-by: Johannes Weiner<[email protected]>
> Signed-off-by: Mel Gorman<[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

--
All rights reversed

2010-06-14 18:07:48

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 06/12] vmscan: simplify shrink_inactive_list()

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> From: KOSAKI Motohiro<[email protected]>
>
> Now, max_scan of shrink_inactive_list() is always passed less than
> SWAP_CLUSTER_MAX. then, we can remove scanning pages loop in it.
> This patch also help stack diet.
>
> detail
> - remove "while (nr_scanned< max_scan)" loop
> - remove nr_freed (now, we use nr_reclaimed directly)
> - remove nr_scan (now, we use nr_scanned directly)
> - rename max_scan to nr_to_scan
> - pass nr_to_scan into isolate_pages() directly instead
> using SWAP_CLUSTER_MAX
>
> Signed-off-by: KOSAKI Motohiro<[email protected]>
> Reviewed-by: Johannes Weiner<[email protected]>

Other than the weird whitespace below,

Reviewed-by: Rik van Riel <[email protected]>

> + /*
> + * If we are direct reclaiming for contiguous pages and we do
> + * not reclaim everything in the list, try again and wait
> + * for IO to complete. This will stall high-order allocations
> + * but that should be acceptable to the caller
> + */
> + if (nr_reclaimed< nr_taken&& !current_is_kswapd()&& sc->lumpy_reclaim_mode) {
> + congestion_wait(BLK_RW_ASYNC, HZ/10);

--
All rights reversed

2010-06-14 18:14:50

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 07/12] vmscan: Remove unnecessary temporary vars in do_try_to_free_pages

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> Remove temporary variable that is only used once and does not help
> clarify code.
>
> Signed-off-by: Mel Gorman<[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

--
All rights reversed

2010-06-14 19:00:11

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 08/12] vmscan: Setup pagevec as late as possible in shrink_inactive_list()

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> shrink_inactive_list() sets up a pagevec to release unfreeable pages. It
> uses significant amounts of stack doing this. This patch splits
> shrink_inactive_list() to take the stack usage out of the main path so
> that callers to writepage() do not contain an unused pagevec on the
> stack.
>
> Signed-off-by: Mel Gorman<[email protected]>
> Reviewed-by: Johannes Weiner<[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2010-06-14 19:25:25

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 09/12] vmscan: Setup pagevec as late as possible in shrink_page_list()

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> shrink_page_list() sets up a pagevec to release pages as according as they
> are free. It uses significant amounts of stack on the pagevec. This
> patch adds pages to be freed via pagevec to a linked list which is then
> freed en-masse at the end. This avoids using stack in the main path that
> potentially calls writepage().
>
> Signed-off-by: Mel Gorman<[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

--
All rights reversed

2010-06-14 19:44:14

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 10/12] vmscan: Update isolated page counters outside of main path in shrink_inactive_list()

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> When shrink_inactive_list() isolates pages, it updates a number of
> counters using temporary variables to gather them. These consume stack
> and it's in the main path that calls ->writepage(). This patch moves the
> accounting updates outside of the main path to reduce stack usage.
>
> Signed-off-by: Mel Gorman<[email protected]>
> Reviewed-by: Johannes Weiner<[email protected]>

Acked-by: Rik van Riel <[email protected]>

--
All rights reversed

2010-06-14 20:56:51

by Larry Woodman

[permalink] [raw]
Subject: Re: [PATCH 01/12] tracing, vmscan: Add trace events for kswapd wakeup, sleeping and direct reclaim

On Mon, 2010-06-14 at 12:17 +0100, Mel Gorman wrote:
> This patch adds two trace events for kswapd waking up and going asleep for
> the purposes of tracking kswapd activity and two trace events for direct
> reclaim beginning and ending. The information can be used to work out how
> much time a process or the system is spending on the reclamation of pages
> and in the case of direct reclaim, how many pages were reclaimed for that
> process. High frequency triggering of these events could point to memory
> pressure problems.
>
> Signed-off-by: Mel Gorman <[email protected]>

Acked-by: Larry Woodman <[email protected]>

2010-06-14 20:57:59

by Larry Woodman

[permalink] [raw]
Subject: Re: [PATCH 02/12] tracing, vmscan: Add trace events for LRU page isolation

On Mon, 2010-06-14 at 12:17 +0100, Mel Gorman wrote:
> This patch adds an event for when pages are isolated en-masse from the
> LRU lists. This event augments the information available on LRU traffic
> and can be used to evaluate lumpy reclaim.
>
> Signed-off-by: Mel Gorman <[email protected]>

Acked-by: Larry Woodman <[email protected]>

2010-06-14 20:58:11

by Larry Woodman

[permalink] [raw]
Subject: Re: [PATCH 03/12] tracing, vmscan: Add trace event when a page is written

On Mon, 2010-06-14 at 12:17 +0100, Mel Gorman wrote:
> This patch adds a trace event for when page reclaim queues a page for IO and
> records whether it is synchronous or asynchronous. Excessive synchronous
> IO for a process can result in noticeable stalls during direct reclaim.
> Excessive IO from page reclaim may indicate that the system is seriously
> under provisioned for the amount of dirty pages that exist.
>
> Signed-off-by: Mel Gorman <[email protected]>

Acked-by: Larry Woodman <[email protected]>

2010-06-14 20:59:30

by Larry Woodman

[permalink] [raw]
Subject: Re: [PATCH 04/12] tracing, vmscan: Add a postprocessing script for reclaim-related ftrace events

On Mon, 2010-06-14 at 12:17 +0100, Mel Gorman wrote:
> This patch adds a simple post-processing script for the reclaim-related
> trace events. It can be used to give an indication of how much traffic
> there is on the LRU lists and how severe latencies due to reclaim are.

Acked-by: Larry Woodman <[email protected]>

2010-06-14 21:14:32

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On 06/14/2010 07:17 AM, Mel Gorman wrote:
> Page reclaim cleans individual pages using a_ops->writepage() because from
> the VM perspective, it is known that pages in a particular zone must be freed
> soon, it considers the target page to be the oldest and it does not want
> to wait while background flushers cleans other pages. From a filesystem
> perspective this is extremely inefficient as it generates a very seeky
> IO pattern leading to the perverse situation where it can take longer to
> clean all dirty pages than it would have otherwise.

Reclaiming clean pages should be fast enough that this should
make little, if any, difference.

> This patch queues all dirty pages at once to maximise the chances that
> the write requests get merged efficiently. It also makes the next patch
> that avoids writeout from direct reclaim more straight-forward.

However, this is a convincing argument :)

> Signed-off-by: Mel Gorman<[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

--
All rights reversed

2010-06-14 21:56:44

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim

On 06/14/2010 07:17 AM, Mel Gorman wrote:

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4856a2a..574e816 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -372,6 +372,12 @@ int write_reclaim_page(struct page *page, struct address_space *mapping,
> return PAGE_SUCCESS;
> }
>
> +/* kswapd and memcg can writeback as they are unlikely to overflow stack */
> +static inline bool reclaim_can_writeback(struct scan_control *sc)
> +{
> + return current_is_kswapd() || sc->mem_cgroup != NULL;
> +}
> +

I'm not entirely convinced on this bit, but am willing to
be convinced by the data.

--
All rights reversed

2010-06-14 23:12:57

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On Mon, Jun 14, 2010 at 12:17:52PM +0100, Mel Gorman wrote:
> Page reclaim cleans individual pages using a_ops->writepage() because from
> the VM perspective, it is known that pages in a particular zone must be freed
> soon, it considers the target page to be the oldest and it does not want
> to wait while background flushers cleans other pages. From a filesystem
> perspective this is extremely inefficient as it generates a very seeky
> IO pattern leading to the perverse situation where it can take longer to
> clean all dirty pages than it would have otherwise.
>
> This patch queues all dirty pages at once to maximise the chances that
> the write requests get merged efficiently. It also makes the next patch
> that avoids writeout from direct reclaim more straight-forward.

Seeing as you have a list of pages for IO, perhaps they could be sorted
before issuing ->writepage on them.

That is, while this patch issues all the IO in one hit, it doesn't
change the order in which the IO is issued - it is still issued in
LRU order. Given that they are issued in a short period of time now,
rather than across a longer scan period, it is likely that it will
not be any faster as:

a) IO will not be started as soon, and
b) the IO scheduler still only has a small re-ordering
window and will choke just as much on random IO patterns.

However, there is a list_sort() function that could be used to sort
the list; sorting the list of pages by mapping and page->index
within the mapping would result in all the pages on each mapping
being sent down in ascending offset order at once - exactly how the
filesystems want IO to be sent to it. Perhaps this is a simple
improvement that can be made to this code that will make a big
difference to worst case performance.

FWIW, I did this for delayed metadata buffer writeback in XFS
recently (i.e. sort the queue of (potentially tens of thousands of)
buffers in ascending block order before dispatch) and that showed a
10-15% reduction in seeks on simple kernel compile workloads. This
shows that if we optimise IO patterns at higher layers where the
sort window is much, much larger than in the IO scheduler, then
overall system performance improves....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-06-14 23:23:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On Tue, 15 Jun 2010 09:11:44 +1000
Dave Chinner <[email protected]> wrote:

> On Mon, Jun 14, 2010 at 12:17:52PM +0100, Mel Gorman wrote:
> > Page reclaim cleans individual pages using a_ops->writepage() because from
> > the VM perspective, it is known that pages in a particular zone must be freed
> > soon, it considers the target page to be the oldest and it does not want
> > to wait while background flushers cleans other pages. From a filesystem
> > perspective this is extremely inefficient as it generates a very seeky
> > IO pattern leading to the perverse situation where it can take longer to
> > clean all dirty pages than it would have otherwise.
> >
> > This patch queues all dirty pages at once to maximise the chances that
> > the write requests get merged efficiently. It also makes the next patch
> > that avoids writeout from direct reclaim more straight-forward.
>
> Seeing as you have a list of pages for IO, perhaps they could be sorted
> before issuing ->writepage on them.
>
> That is, while this patch issues all the IO in one hit, it doesn't
> change the order in which the IO is issued - it is still issued in
> LRU order. Given that they are issued in a short period of time now,
> rather than across a longer scan period, it is likely that it will
> not be any faster as:
>
> a) IO will not be started as soon, and
> b) the IO scheduler still only has a small re-ordering
> window and will choke just as much on random IO patterns.
>
> However, there is a list_sort() function that could be used to sort
> the list; sorting the list of pages by mapping and page->index
> within the mapping would result in all the pages on each mapping
> being sent down in ascending offset order at once - exactly how the
> filesystems want IO to be sent to it. Perhaps this is a simple
> improvement that can be made to this code that will make a big
> difference to worst case performance.
>
> FWIW, I did this for delayed metadata buffer writeback in XFS
> recently (i.e. sort the queue of (potentially tens of thousands of)
> buffers in ascending block order before dispatch) and that showed a
> 10-15% reduction in seeks on simple kernel compile workloads. This
> shows that if we optimise IO patterns at higher layers where the
> sort window is much, much larger than in the IO scheduler, then
> overall system performance improves....

Yup.

But then, this all really should be done at the block layer so other
io-submitting-paths can benefit from it.

IOW, maybe "the sort queue is the submission queue" wasn't a good idea.

2010-06-15 00:13:16

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 0/12] Avoid overflowing of stack during page reclaim V2

On Mon, 14 Jun 2010 12:17:41 +0100
Mel Gorman <[email protected]> wrote:

> SysBench
> ========
> traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
> 1 11025.01 ( 0.00%) 10249.52 (-7.57%) 10430.57 (-5.70%)
> 2 3844.63 ( 0.00%) 4988.95 (22.94%) 4038.95 ( 4.81%)
> 3 3210.23 ( 0.00%) 2918.52 (-9.99%) 3113.38 (-3.11%)
> 4 1958.91 ( 0.00%) 1987.69 ( 1.45%) 1808.37 (-8.32%)
> 5 2864.92 ( 0.00%) 3126.13 ( 8.36%) 2355.70 (-21.62%)
> 6 4831.63 ( 0.00%) 3815.67 (-26.63%) 4164.09 (-16.03%)
> 7 3788.37 ( 0.00%) 3140.39 (-20.63%) 3471.36 (-9.13%)
> 8 2293.61 ( 0.00%) 1636.87 (-40.12%) 1754.25 (-30.75%)
> FTrace Reclaim Statistics
> traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
> Direct reclaims 9843 13398 51651
> Direct reclaim pages scanned 871367 1008709 3080593
> Direct reclaim write async I/O 24883 30699 0
> Direct reclaim write sync I/O 0 0 0

Hmm, page-scan and reclaims jumps up but...


> User/Sys Time Running Test (seconds) 734.52 712.39 703.9
> Percentage Time Spent Direct Reclaim 0.00% 0.00% 0.00%
> Total Elapsed Time (seconds) 9710.02 9589.20 9334.45
> Percentage Time kswapd Awake 0.06% 0.00% 0.00%
>

Execution time is reduced. Does this shows removing "I/O noise" by direct
reclaim makes the system happy ? or writeback in direct reclaim give
us too much costs ?

It seems I'll have to consider about avoiding direct-reciam in memcg, later.

BTW, I think we'll have to add wait-for-pages-to-be-cleaned trick in
direct reclaim if we want to avoid too much scanning, later.


Thank you for interesting test.

Regards,
-Kame

2010-06-15 00:40:46

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On Mon, Jun 14, 2010 at 04:21:43PM -0700, Andrew Morton wrote:
> On Tue, 15 Jun 2010 09:11:44 +1000
> Dave Chinner <[email protected]> wrote:
>
> > On Mon, Jun 14, 2010 at 12:17:52PM +0100, Mel Gorman wrote:
> > > Page reclaim cleans individual pages using a_ops->writepage() because from
> > > the VM perspective, it is known that pages in a particular zone must be freed
> > > soon, it considers the target page to be the oldest and it does not want
> > > to wait while background flushers cleans other pages. From a filesystem
> > > perspective this is extremely inefficient as it generates a very seeky
> > > IO pattern leading to the perverse situation where it can take longer to
> > > clean all dirty pages than it would have otherwise.
> > >
> > > This patch queues all dirty pages at once to maximise the chances that
> > > the write requests get merged efficiently. It also makes the next patch
> > > that avoids writeout from direct reclaim more straight-forward.
> >
> > Seeing as you have a list of pages for IO, perhaps they could be sorted
> > before issuing ->writepage on them.
> >
> > That is, while this patch issues all the IO in one hit, it doesn't
> > change the order in which the IO is issued - it is still issued in
> > LRU order. Given that they are issued in a short period of time now,
> > rather than across a longer scan period, it is likely that it will
> > not be any faster as:
> >
> > a) IO will not be started as soon, and
> > b) the IO scheduler still only has a small re-ordering
> > window and will choke just as much on random IO patterns.
> >
> > However, there is a list_sort() function that could be used to sort
> > the list; sorting the list of pages by mapping and page->index
> > within the mapping would result in all the pages on each mapping
> > being sent down in ascending offset order at once - exactly how the
> > filesystems want IO to be sent to it. Perhaps this is a simple
> > improvement that can be made to this code that will make a big
> > difference to worst case performance.
> >
> > FWIW, I did this for delayed metadata buffer writeback in XFS
> > recently (i.e. sort the queue of (potentially tens of thousands of)
> > buffers in ascending block order before dispatch) and that showed a
> > 10-15% reduction in seeks on simple kernel compile workloads. This
> > shows that if we optimise IO patterns at higher layers where the
> > sort window is much, much larger than in the IO scheduler, then
> > overall system performance improves....
>
> Yup.
>
> But then, this all really should be done at the block layer so other
> io-submitting-paths can benefit from it.

That was what we did in the past with really, really deep IO
scheduler queues. That leads to IO latency and OOM problems because
we could lock gigabytes of memory away under IO and take minutes to
clean it.

Besides, there really isn't the right context in the block layer to
be able to queue and prioritise large amounts of IO without
significant penalties to some higher layer operation.

> IOW, maybe "the sort queue is the submission queue" wasn't a good idea.

Perhaps, but IMO sorting should be done where the context allows it
to be done most efficiently. Sorting is most effective when ever a
significant queue of IO is formed, whether it be in the filesystem,
the VFS, the VM or the block layer because the IO stack is very much
a GIGO queue.

Simply put, there's nothing the lower layers can do to optimise bad
IO patterns from the higher layers because they have small sort
windows which are necessary to keep IO latency in check. Hence if
the higher layers feed the lower layers crap they simply don't have
the context or depth to perform the same level of optimistations we
can do easily higher up the stack.

IOWs, IMO anywhere there is a context with significant queue of IO,
that's where we should be doing a better job of sorting before that
IO is dispatched to the lower layers. This is still no guarantee of
better IO (e.g. if the filesystem fragments the file) but it does
give the lower layers a far better chance at optimal allocation and
scheduling of IO...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-06-15 01:17:33

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On 06/14/2010 08:39 PM, Dave Chinner wrote:
> On Mon, Jun 14, 2010 at 04:21:43PM -0700, Andrew Morton wrote:

>> But then, this all really should be done at the block layer so other
>> io-submitting-paths can benefit from it.
>
> That was what we did in the past with really, really deep IO
> scheduler queues. That leads to IO latency and OOM problems because
> we could lock gigabytes of memory away under IO and take minutes to
> clean it.
>
> Besides, there really isn't the right context in the block layer to
> be able to queue and prioritise large amounts of IO without
> significant penalties to some higher layer operation.

Can we kick flushing for the whole inode at once from
vmscan.c?

That way we should:
1) ensure that the page we want is written to disk, and
2) we flush out related pages at the same time, getting
a decent IO pattern

Chances are that if we want to evict one page from a
file, we'll also want to evict other pages from that
same file. In fact, chances are a good number of them
will live nearby on the LRU list.

Does this make sense?

Would it be hard to add a "please flush this file"
way to call the filesystem flushing threads?

--
All rights reversed

2010-06-15 01:41:32

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On Tue, 15 Jun 2010 10:39:43 +1000 Dave Chinner <[email protected]> wrote:

> On Mon, Jun 14, 2010 at 04:21:43PM -0700, Andrew Morton wrote:
> > On Tue, 15 Jun 2010 09:11:44 +1000
> > > 10-15% reduction in seeks on simple kernel compile workloads. This
> > > shows that if we optimise IO patterns at higher layers where the
> > > sort window is much, much larger than in the IO scheduler, then
> > > overall system performance improves....
> >
> > Yup.
> >
> > But then, this all really should be done at the block layer so other
> > io-submitting-paths can benefit from it.
>
> That was what we did in the past with really, really deep IO
> scheduler queues. That leads to IO latency and OOM problems because
> we could lock gigabytes of memory away under IO and take minutes to
> clean it.
>
> Besides, there really isn't the right context in the block layer to
> be able to queue and prioritise large amounts of IO without
> significant penalties to some higher layer operation.
>
> > IOW, maybe "the sort queue is the submission queue" wasn't a good idea.
>
> Perhaps, but IMO sorting should be done where the context allows it
> to be done most efficiently. Sorting is most effective when ever a
> significant queue of IO is formed, whether it be in the filesystem,
> the VFS, the VM or the block layer because the IO stack is very much
> a GIGO queue.
>
> Simply put, there's nothing the lower layers can do to optimise bad
> IO patterns from the higher layers because they have small sort
> windows which are necessary to keep IO latency in check. Hence if
> the higher layers feed the lower layers crap they simply don't have
> the context or depth to perform the same level of optimistations we
> can do easily higher up the stack.
>
> IOWs, IMO anywhere there is a context with significant queue of IO,
> that's where we should be doing a better job of sorting before that
> IO is dispatched to the lower layers. This is still no guarantee of
> better IO (e.g. if the filesystem fragments the file) but it does
> give the lower layers a far better chance at optimal allocation and
> scheduling of IO...

None of what you said had much to do with what I said.

What you've described are implementation problems in the current block
layer because it conflates "sorting" with "queueing". I'm saying "fix
that".

And... sorting at the block layer will always be superior to sorting
at the pagecache layer because the block layer sorts at the physical
block level and can handle not-well-laid-out files and can sort and merge
pages from different address_spaces.

Still, I suspect none of it will improve anything anyway. Those pages
are still dirty, possibly-locked and need to go to disk. It doesn't
matter from the MM POV whether they sit in some VM list or in the
request queue.

Possibly there may be some benefit to not putting so many of these
unreclaimable pages into the queue all at the the same time. But
that's a shortcoming in the block code: we should be able to shove
arbitrary numbers of dirty page (segments) into the queue and not gum
the system up. Don't try to work around that in the VM.

2010-06-15 01:46:32

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On Mon, 14 Jun 2010 21:16:29 -0400 Rik van Riel <[email protected]> wrote:

> Would it be hard to add a "please flush this file"
> way to call the filesystem flushing threads?

Passing the igrab()bed inode into the flusher threads would fix the
iput_final() problems, as long as the alloc_pages() caller never blocks
indefinitely waiting for the work which the flusher threads are doing.

Otherwise we get (very hard-to-hit) deadlocks where the alloc_pages()
caller holds VFS locks and is waiting for the flusher threads while all
the flusher threads are stuck under iput_final() waiting for those VFS
locks.

That's fixable by not using igrab()/iput(). You can use lock_page() to
pin the address_space. Pass the address of the locked page across to
the flusher threads so they don't try to lock it a second time, or just
use trylocking on that writeback path or whatever.

2010-06-15 03:21:28

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On Mon, Jun 14, 2010 at 06:39:57PM -0700, Andrew Morton wrote:
> On Tue, 15 Jun 2010 10:39:43 +1000 Dave Chinner <[email protected]> wrote:
>
> > On Mon, Jun 14, 2010 at 04:21:43PM -0700, Andrew Morton wrote:
> > > On Tue, 15 Jun 2010 09:11:44 +1000
> > > > 10-15% reduction in seeks on simple kernel compile workloads. This
> > > > shows that if we optimise IO patterns at higher layers where the
> > > > sort window is much, much larger than in the IO scheduler, then
> > > > overall system performance improves....
> > >
> > > Yup.
> > >
> > > But then, this all really should be done at the block layer so other
> > > io-submitting-paths can benefit from it.
> >
> > That was what we did in the past with really, really deep IO
> > scheduler queues. That leads to IO latency and OOM problems because
> > we could lock gigabytes of memory away under IO and take minutes to
> > clean it.
> >
> > Besides, there really isn't the right context in the block layer to
> > be able to queue and prioritise large amounts of IO without
> > significant penalties to some higher layer operation.
> >
> > > IOW, maybe "the sort queue is the submission queue" wasn't a good idea.
> >
> > Perhaps, but IMO sorting should be done where the context allows it
> > to be done most efficiently. Sorting is most effective when ever a
> > significant queue of IO is formed, whether it be in the filesystem,
> > the VFS, the VM or the block layer because the IO stack is very much
> > a GIGO queue.
> >
> > Simply put, there's nothing the lower layers can do to optimise bad
> > IO patterns from the higher layers because they have small sort
> > windows which are necessary to keep IO latency in check. Hence if
> > the higher layers feed the lower layers crap they simply don't have
> > the context or depth to perform the same level of optimistations we
> > can do easily higher up the stack.
> >
> > IOWs, IMO anywhere there is a context with significant queue of IO,
> > that's where we should be doing a better job of sorting before that
> > IO is dispatched to the lower layers. This is still no guarantee of
> > better IO (e.g. if the filesystem fragments the file) but it does
> > give the lower layers a far better chance at optimal allocation and
> > scheduling of IO...
>
> None of what you said had much to do with what I said.
>
> What you've described are implementation problems in the current block
> layer because it conflates "sorting" with "queueing". I'm saying "fix
> that".

You can't sort until you've queued.

> And... sorting at the block layer will always be superior to sorting
> at the pagecache layer because the block layer sorts at the physical
> block level and can handle not-well-laid-out files and can sort and merge
> pages from different address_spaces.

Yes it, can do that. And it still does that even if the higher
layers sort their I/O dispatch better,

Filesystems try very hard to allocate adjacent logical offsets in a
file in adjacent physical blocks on disk - that's the whole point of
extent-indexed filesystems. Hence with modern filesystems there is
generally a direct correlation between the page {mapping,index}
tuple and the physical location of the mapped block.

i.e. there is generally zero physical correlation between pages in
different mappings, but there is a high physical correlation
between the index of pages on the same mapping. Hence by sorting
where we have a {mapping,index} context, we push out IO that is
much more likely to be in contiguous physical chunks that the
current random page shootdown.

We optimise applications to use these sorts of correlations all the
time to improve IO patterns. Why can't we make the same sort of
optmisations to the IO that the VM issues?

> Still, I suspect none of it will improve anything anyway. Those pages
> are still dirty, possibly-locked and need to go to disk. It doesn't
> matter from the MM POV whether they sit in some VM list or in the
> request queue.

Oh, but it does.....

> Possibly there may be some benefit to not putting so many of these
> unreclaimable pages into the queue all at the the same time. But
> that's a shortcoming in the block code: we should be able to shove
> arbitrary numbers of dirty page (segments) into the queue and not gum
> the system up. Don't try to work around that in the VM.

I think you know perfectly well why the system gums up when we
increase block layer queue depth: it's the fact that the _VM_ relies
on block layer queue congestion to limit the amount of dirty memory
in the system.

We've got a feedback loop between the block layer and the VM that
only works if block device queues are kept shallow. Keeping the
number of dirty pages under control is a VM responsibility, but it
is putting limitations on the block layer to ensure that the VM
works correctly. If you want the block layer to have deep queues,
then someone needs to fix the VM not to require knowledge of the
internal operation of the block layer for correct operation.

Adding a few lines of code to sort a list in the VM is far, far
easier than redesigning the write throttling code....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-06-15 04:09:08

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On 06/14/2010 09:45 PM, Andrew Morton wrote:
> On Mon, 14 Jun 2010 21:16:29 -0400 Rik van Riel<[email protected]> wrote:
>
>> Would it be hard to add a "please flush this file"
>> way to call the filesystem flushing threads?
>
> Passing the igrab()bed inode into the flusher threads would fix the
> iput_final() problems, as long as the alloc_pages() caller never blocks
> indefinitely waiting for the work which the flusher threads are doing.
>
> Otherwise we get (very hard-to-hit) deadlocks where the alloc_pages()
> caller holds VFS locks and is waiting for the flusher threads while all
> the flusher threads are stuck under iput_final() waiting for those VFS
> locks.
>
> That's fixable by not using igrab()/iput(). You can use lock_page() to
> pin the address_space. Pass the address of the locked page across to
> the flusher threads so they don't try to lock it a second time, or just
> use trylocking on that writeback path or whatever.

Any thread that does not have __GFP_FS set in its gfp_mask
cannot wait for the flusher to complete. This is regardless
of the mechanism used to kick the flusher.

Then again, those threads cannot call ->writepage today
either, so we should be fine keeping that behaviour.

Threads that do have __GFP_FS in their gfp_mask can wait
for the flusher in various ways. Maybe the lock_page()
method can be simplified by having the flusher thread
unlock the page the moment it gets it, and then run the
normal flusher code?

The pageout code (in shrink_page_list) already unlocks
the page anyway before putting it back on the relevant
LRU list. It would be easy enough to skip that unlock
and let the flusher thread take care of it.

--
All rights reversed

2010-06-15 04:16:33

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On Tue, 15 Jun 2010 13:20:34 +1000 Dave Chinner <[email protected]> wrote:

> On Mon, Jun 14, 2010 at 06:39:57PM -0700, Andrew Morton wrote:
> > On Tue, 15 Jun 2010 10:39:43 +1000 Dave Chinner <[email protected]> wrote:
> >
> > >
> > > IOWs, IMO anywhere there is a context with significant queue of IO,
> > > that's where we should be doing a better job of sorting before that
> > > IO is dispatched to the lower layers. This is still no guarantee of
> > > better IO (e.g. if the filesystem fragments the file) but it does
> > > give the lower layers a far better chance at optimal allocation and
> > > scheduling of IO...
> >
> > None of what you said had much to do with what I said.
> >
> > What you've described are implementation problems in the current block
> > layer because it conflates "sorting" with "queueing". I'm saying "fix
> > that".
>
> You can't sort until you've queued.

Yes you can. That's exactly what you're recommending! Only you're
recommending doing it at the wrong level. The fs-writeback radix-tree
walks do it at the wrong level too. Sorting should be done within, or
in a layer above the block queues, not within the large number of
individual callers.

> > And... sorting at the block layer will always be superior to sorting
> > at the pagecache layer because the block layer sorts at the physical
> > block level and can handle not-well-laid-out files and can sort and merge
> > pages from different address_spaces.
>
> Yes it, can do that. And it still does that even if the higher
> layers sort their I/O dispatch better,
>
> Filesystems try very hard to allocate adjacent logical offsets in a
> file in adjacent physical blocks on disk - that's the whole point of
> extent-indexed filesystems. Hence with modern filesystems there is
> generally a direct correlation between the page {mapping,index}
> tuple and the physical location of the mapped block.
>
> i.e. there is generally zero physical correlation between pages in
> different mappings, but there is a high physical correlation
> between the index of pages on the same mapping.

Nope. Large-number-of-small-files is a pretty common case. If the fs
doesn't handle that well (ie: by placing them nearby on disk), it's
borked.

> Hence by sorting
> where we have a {mapping,index} context, we push out IO that is
> much more likely to be in contiguous physical chunks that the
> current random page shootdown.
>
> We optimise applications to use these sorts of correlations all the
> time to improve IO patterns. Why can't we make the same sort of
> optmisations to the IO that the VM issues?

We can, but it shouldn't be specific to page reclaim. Other places
submit IO too, and want the same treatment.

> > Still, I suspect none of it will improve anything anyway. Those pages
> > are still dirty, possibly-locked and need to go to disk. It doesn't
> > matter from the MM POV whether they sit in some VM list or in the
> > request queue.
>
> Oh, but it does.....

The only difference is that pages which are in the queue (current
implementation thereof) can't be shot down by truncate.

> > Possibly there may be some benefit to not putting so many of these
> > unreclaimable pages into the queue all at the the same time. But
> > that's a shortcoming in the block code: we should be able to shove
> > arbitrary numbers of dirty page (segments) into the queue and not gum
> > the system up. Don't try to work around that in the VM.
>
> I think you know perfectly well why the system gums up when we
> increase block layer queue depth: it's the fact that the _VM_ relies
> on block layer queue congestion to limit the amount of dirty memory
> in the system.

mm, a little bit still, I guess. Mainly because dirty memory
management isn't zone aware, so even though we limit dirty memory
globally, a particular zone(set) can get excessively dirtied.

Most of this problem happen on the balance_dirty_pages() path, where we
already sort the pages in ascending logical order.

> We've got a feedback loop between the block layer and the VM that
> only works if block device queues are kept shallow. Keeping the
> number of dirty pages under control is a VM responsibility, but it
> is putting limitations on the block layer to ensure that the VM
> works correctly. If you want the block layer to have deep queues,
> then someone needs to fix the VM not to require knowledge of the
> internal operation of the block layer for correct operation.
>
> Adding a few lines of code to sort a list in the VM is far, far
> easier than redesigning the write throttling code....

It's a hack and a workaround. And I suspect it won't make any
difference, especially given Mel's measurements of the number of dirty
pages he's seeing coming off the LRU. Although those numbers may well
be due to the new quite-low dirty memory thresholds.

It would be interesting to code up a little test patch though, see if
there's benefit to be had going down this path.

2010-06-15 04:39:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On Tue, 15 Jun 2010 00:08:14 -0400 Rik van Riel <[email protected]> wrote:

> On 06/14/2010 09:45 PM, Andrew Morton wrote:
> > On Mon, 14 Jun 2010 21:16:29 -0400 Rik van Riel<[email protected]> wrote:
> >
> >> Would it be hard to add a "please flush this file"
> >> way to call the filesystem flushing threads?
> >
> > Passing the igrab()bed inode into the flusher threads would fix the
> > iput_final() problems, as long as the alloc_pages() caller never blocks
> > indefinitely waiting for the work which the flusher threads are doing.
> >
> > Otherwise we get (very hard-to-hit) deadlocks where the alloc_pages()
> > caller holds VFS locks and is waiting for the flusher threads while all
> > the flusher threads are stuck under iput_final() waiting for those VFS
> > locks.
> >
> > That's fixable by not using igrab()/iput(). You can use lock_page() to
> > pin the address_space. Pass the address of the locked page across to
> > the flusher threads so they don't try to lock it a second time, or just
> > use trylocking on that writeback path or whatever.
>
> Any thread that does not have __GFP_FS set in its gfp_mask
> cannot wait for the flusher to complete. This is regardless
> of the mechanism used to kick the flusher.

mm... kinda. A bare order-zero __GFP_WAIT allocation can still wait
forever, afaict.

> Then again, those threads cannot call ->writepage today
> either, so we should be fine keeping that behaviour.

I'm not sure. iput_final() can take a lot of locks, both VFS and
heaven knows what within the individual filesystems. Is it the case
that all allocations which occur under all of those locks is always
!__GFP_FS? Hard to say...

> Threads that do have __GFP_FS in their gfp_mask can wait
> for the flusher in various ways. Maybe the lock_page()
> method can be simplified by having the flusher thread
> unlock the page the moment it gets it, and then run the
> normal flusher code?

Well, _something_ has to pin the address_space. A single locked page
will do.

> The pageout code (in shrink_page_list) already unlocks
> the page anyway before putting it back on the relevant
> LRU list. It would be easy enough to skip that unlock
> and let the flusher thread take care of it.

Once that page is unlocked, we can't touch *mapping - its inode can be
concurrently reclaimed. Although I guess the technique in
handle_write_error() can be reused.

2010-06-15 05:12:53

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On Mon, Jun 14, 2010 at 09:37:32PM -0700, Andrew Morton wrote:
> On Tue, 15 Jun 2010 00:08:14 -0400 Rik van Riel <[email protected]> wrote:
>
> > On 06/14/2010 09:45 PM, Andrew Morton wrote:
> > > On Mon, 14 Jun 2010 21:16:29 -0400 Rik van Riel<[email protected]> wrote:
> > >
> > >> Would it be hard to add a "please flush this file"
> > >> way to call the filesystem flushing threads?
> > >
> > > Passing the igrab()bed inode into the flusher threads would fix the
> > > iput_final() problems, as long as the alloc_pages() caller never blocks
> > > indefinitely waiting for the work which the flusher threads are doing.
> > >
> > > Otherwise we get (very hard-to-hit) deadlocks where the alloc_pages()
> > > caller holds VFS locks and is waiting for the flusher threads while all
> > > the flusher threads are stuck under iput_final() waiting for those VFS
> > > locks.
> > >
> > > That's fixable by not using igrab()/iput(). You can use lock_page() to
> > > pin the address_space. Pass the address of the locked page across to
> > > the flusher threads so they don't try to lock it a second time, or just
> > > use trylocking on that writeback path or whatever.
> >
> > Any thread that does not have __GFP_FS set in its gfp_mask
> > cannot wait for the flusher to complete. This is regardless
> > of the mechanism used to kick the flusher.
>
> mm... kinda. A bare order-zero __GFP_WAIT allocation can still wait
> forever, afaict.
>
> > Then again, those threads cannot call ->writepage today
> > either, so we should be fine keeping that behaviour.
>
> I'm not sure. iput_final() can take a lot of locks, both VFS and
> heaven knows what within the individual filesystems. Is it the case
> that all allocations which occur under all of those locks is always
> !__GFP_FS? Hard to say...

__GFP_FS is set with i_mutex held in places, and there is nothing to
prevent a filesystem from using that in iput_final paths, AFAIK.


> > Threads that do have __GFP_FS in their gfp_mask can wait
> > for the flusher in various ways. Maybe the lock_page()
> > method can be simplified by having the flusher thread
> > unlock the page the moment it gets it, and then run the
> > normal flusher code?
>
> Well, _something_ has to pin the address_space. A single locked page
> will do.
>
> > The pageout code (in shrink_page_list) already unlocks
> > the page anyway before putting it back on the relevant
> > LRU list. It would be easy enough to skip that unlock
> > and let the flusher thread take care of it.
>
> Once that page is unlocked, we can't touch *mapping - its inode can be
> concurrently reclaimed. Although I guess the technique in
> handle_write_error() can be reused.

Nasty. That guy needs to be using lock_page_nosync().

2010-06-15 05:43:55

by Nick Piggin

[permalink] [raw]
Subject: [patch] mm: vmscan fix mapping use after free

On Tue, Jun 15, 2010 at 03:12:42PM +1000, Nick Piggin wrote:
> > Once that page is unlocked, we can't touch *mapping - its inode can be
> > concurrently reclaimed. Although I guess the technique in
> > handle_write_error() can be reused.
>
> Nasty. That guy needs to be using lock_page_nosync().
--

Need lock_page_nosync here because we have no reference to the mapping when
taking the page lock.

Signed-off-by: Nick Piggin <[email protected]>

---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -296,7 +296,7 @@ static int may_write_to_queue(struct bac
static void handle_write_error(struct address_space *mapping,
struct page *page, int error)
{
- lock_page(page);
+ lock_page_nosync(page);
if (page_mapping(page) == mapping)
mapping_set_error(mapping, error);
unlock_page(page);

2010-06-15 06:37:45

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On Mon, Jun 14, 2010 at 09:15:15PM -0700, Andrew Morton wrote:
> On Tue, 15 Jun 2010 13:20:34 +1000 Dave Chinner <[email protected]> wrote:
>
> > On Mon, Jun 14, 2010 at 06:39:57PM -0700, Andrew Morton wrote:
> > > On Tue, 15 Jun 2010 10:39:43 +1000 Dave Chinner <[email protected]> wrote:
> > >
> > > >
> > > > IOWs, IMO anywhere there is a context with significant queue of IO,
> > > > that's where we should be doing a better job of sorting before that
> > > > IO is dispatched to the lower layers. This is still no guarantee of
> > > > better IO (e.g. if the filesystem fragments the file) but it does
> > > > give the lower layers a far better chance at optimal allocation and
> > > > scheduling of IO...
> > >
> > > None of what you said had much to do with what I said.
> > >
> > > What you've described are implementation problems in the current block
> > > layer because it conflates "sorting" with "queueing". I'm saying "fix
> > > that".
> >
> > You can't sort until you've queued.
>
> Yes you can. That's exactly what you're recommending!

Umm, I suggested sorting a queue dirty pages that was build by
reclaim before dispatching them. How does that translate to
me recommending "sort before queuing"?

> Only you're
> recommending doing it at the wrong level.

If you feed a filesystem garbage IO, you'll get garbage performance
and there's nothing that a block layer sort queue can do to fix the
damage it does to both performance and filesystem fragmentation
levels. It's not just about IO issue - delayed allocation pretty
much requires writeback to be issuing well formed IOs to reap the
benefits it can provide....

> > > And... sorting at the block layer will always be superior to sorting
> > > at the pagecache layer because the block layer sorts at the physical
> > > block level and can handle not-well-laid-out files and can sort and merge
> > > pages from different address_spaces.
> >
> > Yes it, can do that. And it still does that even if the higher
> > layers sort their I/O dispatch better,
> >
> > Filesystems try very hard to allocate adjacent logical offsets in a
> > file in adjacent physical blocks on disk - that's the whole point of
> > extent-indexed filesystems. Hence with modern filesystems there is
> > generally a direct correlation between the page {mapping,index}
> > tuple and the physical location of the mapped block.
> >
> > i.e. there is generally zero physical correlation between pages in
> > different mappings, but there is a high physical correlation
> > between the index of pages on the same mapping.
>
> Nope. Large-number-of-small-files is a pretty common case. If the fs
> doesn't handle that well (ie: by placing them nearby on disk), it's
> borked.

Filesystems already handle this case just fine as we see it from
writeback all the time. Untarring a kernel is a good example of
this...

I suggested sorting all the IO to be issued into per-mapping page
groups because:
a) makes IO issued from reclaim look almost exactly the same
to the filesytem as if writeback is pushing out the IO.
b) it looks to be a trivial addition to the new code.

To me that's a no-brainer.

> It would be interesting to code up a little test patch though, see if
> there's benefit to be had going down this path.

I doubt Mel's tests cases will show anything - they simply didn't
show enough IO issued from reclaim to make any difference.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-06-15 10:14:15

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 06/12] vmscan: simplify shrink_inactive_list()

On Mon, Jun 14, 2010 at 02:06:13PM -0400, Rik van Riel wrote:
> On 06/14/2010 07:17 AM, Mel Gorman wrote:
>> From: KOSAKI Motohiro<[email protected]>
>>
>> Now, max_scan of shrink_inactive_list() is always passed less than
>> SWAP_CLUSTER_MAX. then, we can remove scanning pages loop in it.
>> This patch also help stack diet.
>>
>> detail
>> - remove "while (nr_scanned< max_scan)" loop
>> - remove nr_freed (now, we use nr_reclaimed directly)
>> - remove nr_scan (now, we use nr_scanned directly)
>> - rename max_scan to nr_to_scan
>> - pass nr_to_scan into isolate_pages() directly instead
>> using SWAP_CLUSTER_MAX
>>
>> Signed-off-by: KOSAKI Motohiro<[email protected]>
>> Reviewed-by: Johannes Weiner<[email protected]>
>
> Other than the weird whitespace below,
>

Not sure where this came out of. It's not in my local patch file nor in
my working tree. Very odd.

> Reviewed-by: Rik van Riel <[email protected]>
>

Thanks

>> + /*
>> + * If we are direct reclaiming for contiguous pages and we do
>> + * not reclaim everything in the list, try again and wait
>> + * for IO to complete. This will stall high-order allocations
>> + * but that should be acceptable to the caller
>> + */
>> + if (nr_reclaimed< nr_taken&& !current_is_kswapd()&& sc->lumpy_reclaim_mode) {
>> + congestion_wait(BLK_RW_ASYNC, HZ/10);
>
> --
> All rights reversed
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-15 10:19:19

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On Mon, Jun 14, 2010 at 05:13:37PM -0400, Rik van Riel wrote:
> On 06/14/2010 07:17 AM, Mel Gorman wrote:
>> Page reclaim cleans individual pages using a_ops->writepage() because from
>> the VM perspective, it is known that pages in a particular zone must be freed
>> soon, it considers the target page to be the oldest and it does not want
>> to wait while background flushers cleans other pages. From a filesystem
>> perspective this is extremely inefficient as it generates a very seeky
>> IO pattern leading to the perverse situation where it can take longer to
>> clean all dirty pages than it would have otherwise.
>
> Reclaiming clean pages should be fast enough that this should
> make little, if any, difference.
>

Indeed, this was a bit weak. The original point of the patch was to write
contiguous pages belonging to the same inode when they were encountered in
that batch which made a bit more sense but didn't work out at first
pass.

>> This patch queues all dirty pages at once to maximise the chances that
>> the write requests get merged efficiently. It also makes the next patch
>> that avoids writeout from direct reclaim more straight-forward.
>
> However, this is a convincing argument :)
>

Thanks.

>> Signed-off-by: Mel Gorman<[email protected]>
>
> Reviewed-by: Rik van Riel <[email protected]>
>

Thanks again :)

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-15 10:46:10

by Evgeniy Polyakov

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On Tue, Jun 15, 2010 at 04:36:43PM +1000, Dave Chinner ([email protected]) wrote:
> > Nope. Large-number-of-small-files is a pretty common case. If the fs
> > doesn't handle that well (ie: by placing them nearby on disk), it's
> > borked.
>
> Filesystems already handle this case just fine as we see it from
> writeback all the time. Untarring a kernel is a good example of
> this...
>
> I suggested sorting all the IO to be issued into per-mapping page
> groups because:
> a) makes IO issued from reclaim look almost exactly the same
> to the filesytem as if writeback is pushing out the IO.
> b) it looks to be a trivial addition to the new code.
>
> To me that's a no-brainer.

That doesn't coverup large-number-of-small-files pattern, since
untarring subsequently means creating something new, which FS can
optimize. Much more interesting case is when we have dirtied large
number of small files in kind-of random order and submitted them
down to disk.

Per-mapping sorting will not do anything good in this case, even if
files were previously created in a good facion being placed closely and
so on, and only block layer will find a correlation between adjacent
blocks in different files. But with existing queue management it has
quite a small opportunity, and that's what I think Andrew is arguing
about.

--
Evgeniy Polyakov

2010-06-15 10:47:09

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 08/12] vmscan: Setup pagevec as late as possible in shrink_inactive_list()

On Mon, Jun 14, 2010 at 12:17:49PM +0100, Mel Gorman wrote:
> /*
> + * TODO: Try merging with migrations version of putback_lru_pages
> + */
> +static noinline_for_stack void putback_lru_pages(struct zone *zone,
> + struct zone_reclaim_stat *reclaim_stat,
> + unsigned long nr_anon, unsigned long nr_file,
> + struct list_head *page_list)
> +{

I hate to nitpick on this, but with noinline_for_stack a prototype
really is unreadbale unless the function identifier goes to the next
line. Compare the one above with:

static noinline_for_stack void
putback_lru_pages(struct zone *zone, struct zone_reclaim_stat *reclaim_stat,
unsigned long nr_anon, unsigned long nr_file,
struct list_head *page_list)

> -static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
> +static noinline_for_stack unsigned long shrink_inactive_list(unsigned long nr_to_scan,
> struct zone *zone, struct scan_control *sc,
> int priority, int file)

Same here, just even worse due to the spill over ove 80 characters.

2010-06-15 10:53:44

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

> + /*
> + * XXX: This is the Holy Hand Grenade of PotentiallyInvalidMapping. As
> + * the page lock has been dropped by ->writepage, that mapping could
> + * be anything
> + */

Why is this an XXX comment?

> + /*
> + * Wait on writeback if requested to. This happens when
> + * direct reclaiming a large contiguous area and the
> + * first attempt to free a range of pages fails.
> + */
> + if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
> + wait_on_page_writeback(page);
> +
> + if (!PageWriteback(page)) {
> + /* synchronous write or broken a_ops? */
> + ClearPageReclaim(page);
> + }

how about:

if (PageWriteback(page) {
if (sync_writeback == PAGEOUT_IO_SYNC)
wait_on_page_writeback(page);
} else {
/* synchronous write or broken a_ops? */
ClearPageReclaim(page);
}

> if (!may_write_to_queue(mapping->backing_dev_info))
> return PAGE_KEEP;
>
> /*
> + * Clean a list of pages. It is expected that all the pages on page_list have been
> + * locked as part of isolation from the LRU.

A rather pointless line of 80 chars. I see the point for long string
literals, but here's it's just a pain.

> + *
> + * XXX: Is there a problem with holding multiple page locks like this?

I think there is. There's quite a few places that do hold multiple
pages locked, but they always lock pages in increasing page->inxex order.
Given that this locks basically in random order it could cause problems
for those places.

2010-06-15 10:55:45

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On Tue, Jun 15, 2010 at 02:28:22PM +0400, Evgeniy Polyakov wrote:
> On Tue, Jun 15, 2010 at 04:36:43PM +1000, Dave Chinner ([email protected]) wrote:
> > > Nope. Large-number-of-small-files is a pretty common case. If the fs
> > > doesn't handle that well (ie: by placing them nearby on disk), it's
> > > borked.
> >
> > Filesystems already handle this case just fine as we see it from
> > writeback all the time. Untarring a kernel is a good example of
> > this...
> >
> > I suggested sorting all the IO to be issued into per-mapping page
> > groups because:
> > a) makes IO issued from reclaim look almost exactly the same
> > to the filesytem as if writeback is pushing out the IO.
> > b) it looks to be a trivial addition to the new code.
> >
> > To me that's a no-brainer.
>
> That doesn't coverup large-number-of-small-files pattern, since
> untarring subsequently means creating something new, which FS can
> optimize. Much more interesting case is when we have dirtied large
> number of small files in kind-of random order and submitted them
> down to disk.
>
> Per-mapping sorting will not do anything good in this case, even if
> files were previously created in a good facion being placed closely and
> so on, and only block layer will find a correlation between adjacent
> blocks in different files. But with existing queue management it has
> quite a small opportunity, and that's what I think Andrew is arguing
> about.

The solution is not to sort pages on their way to be submitted either,
really.

What I do in fsblock is to maintain a block-nr sorted tree of dirty
blocks. This works nicely because fsblock dirty state is properly
synchronized with page dirty state. So writeout can just walk this in
order and it provides pretty optimal submission pattern of any
interleavings of data and metadata. No need for buffer boundary or
hacks like that. (needs some intelligence for delalloc, though).

But even with all that, it's not the complete story. It doesn't know
about direct IO, sync IO, or fsyncs, and it would be very hard and
ugly to try to synchronise and sort all that from the pagecache level.
It also is a heuristic in terms of optimal block scheduling behaviour.
With smarter devices and drivers there might be better ways to go.

So what is needed is to get as much info into the block layer as
possible. As Andrew says, there shouldn't be such a big difference
between pages being writeback or dirty in pagecache.

2010-06-15 10:57:45

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On Mon, Jun 14, 2010 at 04:21:43PM -0700, Andrew Morton wrote:
> Yup.
>
> But then, this all really should be done at the block layer so other
> io-submitting-paths can benefit from it.
>
> IOW, maybe "the sort queue is the submission queue" wasn't a good idea.

Even if has not effect on the actual I/O patters it has a massive
benefit for the filesystem. When probing delalloc/unwritten space at
least XFS does try to convert a larger extent forward from the index,
but doesn't bother to go backwards. By providing the trivial sort here
we make life a lot easier for the filesystem.

In addition to that we do get better I/O patters especially with short
queues as smart writepage implementatons will also submit the next few
pages, which is essentially free given how the storage works.
This means we already have a page cleaned before we might even submit it
without sorting.

2010-06-15 11:01:10

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On Mon, Jun 14, 2010 at 09:16:29PM -0400, Rik van Riel wrote:
> >Besides, there really isn't the right context in the block layer to
> >be able to queue and prioritise large amounts of IO without
> >significant penalties to some higher layer operation.
>
> Can we kick flushing for the whole inode at once from
> vmscan.c?

kswapd really should be a last effort tool to clean filesystem pages.
If it does enough I/O for this to matter significantly we need to
fix the VM to move more work to the flusher threads instead of trying
to fix kswapd.

> Would it be hard to add a "please flush this file"
> way to call the filesystem flushing threads?

We already have that API, in Jens' latest tree that's
sync_inodes_sb/writeback_inodes_sb. We could also add a non-waiting
variant if required, but I think the big problem with kswapd is that
we want to wait on I/O completion under circumstances.

2010-06-15 11:08:45

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On Tue, Jun 15, 2010 at 02:28:22PM +0400, Evgeniy Polyakov wrote:
> That doesn't coverup large-number-of-small-files pattern, since
> untarring subsequently means creating something new, which FS can
> optimize. Much more interesting case is when we have dirtied large
> number of small files in kind-of random order and submitted them
> down to disk.

That's why we still have block layer sorting. But for the problem
of larger files doing the sorting above the filesystem is a lot
more efficient, not just primarily due to the I/O patters but because
it makes life for the filesystem writeback code and allocator a lot
simpler.

> Per-mapping sorting will not do anything good in this case, even if
> files were previously created in a good facion being placed closely and
> so on, and only block layer will find a correlation between adjacent
> blocks in different files. But with existing queue management it has
> quite a small opportunity, and that's what I think Andrew is arguing
> about.

Which is actually more or less true - if we do larger amounts of
writeback from kswapd we're toast anyway and performance and allocation
patters go down the toilet. Then again throwing a list_sort in is
a rather trivial addition. Note that in addition to page->index we
can also sort by the inode number in the sort function. At least for
XFS and the traditional ufs derived allocators that will give you
additional locality for small files.

2010-06-15 11:10:32

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On Tue, Jun 15, 2010 at 08:55:38PM +1000, Nick Piggin wrote:
>
> What I do in fsblock is to maintain a block-nr sorted tree of dirty
> blocks. This works nicely because fsblock dirty state is properly
> synchronized with page dirty state. So writeout can just walk this in
> order and it provides pretty optimal submission pattern of any
> interleavings of data and metadata. No need for buffer boundary or
> hacks like that. (needs some intelligence for delalloc, though).

I think worrying about indirect blocks really doesn't matter much
these days. For one thing extent based filesystems have a lot less
of these, and second for a journaling filesystem we only need to log
modification to the indirect blocks and not actually write them back
in place during the sync. At least for XFS the actual writeback can
happen a lot later, as part of the ordered list of delwri buffers.

2010-06-15 11:11:40

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On Tue, Jun 15, 2010 at 06:53:41AM -0400, Christoph Hellwig wrote:
> > + /*
> > + * XXX: This is the Holy Hand Grenade of PotentiallyInvalidMapping. As
> > + * the page lock has been dropped by ->writepage, that mapping could
> > + * be anything
> > + */
>
> Why is this an XXX comment?
>

With the page lock released, the mapping may be no longer valid. Nick
posted a patch in relation to it that I need to look at. The comment was
because Andrew highlight that this was buggy and I wanted to make sure I
didn't forget about it.

> > + /*
> > + * Wait on writeback if requested to. This happens when
> > + * direct reclaiming a large contiguous area and the
> > + * first attempt to free a range of pages fails.
> > + */
> > + if (PageWriteback(page) && sync_writeback == PAGEOUT_IO_SYNC)
> > + wait_on_page_writeback(page);
> > +
> > + if (!PageWriteback(page)) {
> > + /* synchronous write or broken a_ops? */
> > + ClearPageReclaim(page);
> > + }
>
> how about:
>
> if (PageWriteback(page) {
> if (sync_writeback == PAGEOUT_IO_SYNC)
> wait_on_page_writeback(page);
> } else {
> /* synchronous write or broken a_ops? */
> ClearPageReclaim(page);
> }
>

Sure, that's tidier.

> > if (!may_write_to_queue(mapping->backing_dev_info))
> > return PAGE_KEEP;
> >
> > /*
> > + * Clean a list of pages. It is expected that all the pages on page_list have been
> > + * locked as part of isolation from the LRU.
>
> A rather pointless line of 80 chars. I see the point for long string
> literals, but here's it's just a pain.
>

I'll trim it.

> > + *
> > + * XXX: Is there a problem with holding multiple page locks like this?
>
> I think there is. There's quite a few places that do hold multiple
> pages locked, but they always lock pages in increasing page->inxex order.
> Given that this locks basically in random order it could cause problems
> for those places.
>

Hmm, ok. In that case, I'll have to release the locks on the list and
reacquire them. It was something I would have preferred to avoid. Thanks

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-15 11:13:44

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On Tue, Jun 15, 2010 at 06:53:41AM -0400, Christoph Hellwig wrote:
> > + /*
> > + * XXX: This is the Holy Hand Grenade of PotentiallyInvalidMapping. As
> > + * the page lock has been dropped by ->writepage, that mapping could
> > + * be anything
> > + */
>
> Why is this an XXX comment?

It's just a pretty simple use-after-free. Maybe people forget it because
->writepage is an asynchronous API.


> > + *
> > + * XXX: Is there a problem with holding multiple page locks like this?
>
> I think there is. There's quite a few places that do hold multiple
> pages locked, but they always lock pages in increasing page->inxex order.
> Given that this locks basically in random order it could cause problems
> for those places.

There shouldn't be a problem _holding_ the locks, but there is a problem
waiting for multiple locks out of page->index order.

But there is a problem with holding the lock of a lot of pages while
calling ->writepage on them. So yeah, you can't do that.

Hmm, I should rediff that lockdep page_lock patch and get it merged.
(although I don't know if that can catch these all these problems easily)

2010-06-15 11:20:24

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On Tue, Jun 15, 2010 at 07:10:26AM -0400, Christoph Hellwig wrote:
> On Tue, Jun 15, 2010 at 08:55:38PM +1000, Nick Piggin wrote:
> >
> > What I do in fsblock is to maintain a block-nr sorted tree of dirty
> > blocks. This works nicely because fsblock dirty state is properly
> > synchronized with page dirty state. So writeout can just walk this in
> > order and it provides pretty optimal submission pattern of any
> > interleavings of data and metadata. No need for buffer boundary or
> > hacks like that. (needs some intelligence for delalloc, though).
>
> I think worrying about indirect blocks really doesn't matter much
> these days. For one thing extent based filesystems have a lot less
> of these, and second for a journaling filesystem we only need to log
> modification to the indirect blocks and not actually write them back
> in place during the sync. At least for XFS the actual writeback can
> happen a lot later, as part of the ordered list of delwri buffers.

That's true, more importantly I meant any interleavings of data from
more than one file too.

2010-06-15 11:44:05

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On Mon, Jun 14, 2010 at 09:15:15PM -0700, Andrew Morton wrote:
> On Tue, 15 Jun 2010 13:20:34 +1000 Dave Chinner <[email protected]> wrote:
>
> > On Mon, Jun 14, 2010 at 06:39:57PM -0700, Andrew Morton wrote:
> > > On Tue, 15 Jun 2010 10:39:43 +1000 Dave Chinner <[email protected]> wrote:
> > >
> > > >
> > > > IOWs, IMO anywhere there is a context with significant queue of IO,
> > > > that's where we should be doing a better job of sorting before that
> > > > IO is dispatched to the lower layers. This is still no guarantee of
> > > > better IO (e.g. if the filesystem fragments the file) but it does
> > > > give the lower layers a far better chance at optimal allocation and
> > > > scheduling of IO...
> > >
> > > None of what you said had much to do with what I said.
> > >
> > > What you've described are implementation problems in the current block
> > > layer because it conflates "sorting" with "queueing". I'm saying "fix
> > > that".
> >
> > You can't sort until you've queued.
>
> Yes you can. That's exactly what you're recommending! Only you're
> recommending doing it at the wrong level. The fs-writeback radix-tree
> walks do it at the wrong level too. Sorting should be done within, or
> in a layer above the block queues, not within the large number of
> individual callers.
>
> > > And... sorting at the block layer will always be superior to sorting
> > > at the pagecache layer because the block layer sorts at the physical
> > > block level and can handle not-well-laid-out files and can sort and merge
> > > pages from different address_spaces.
> >
> > Yes it, can do that. And it still does that even if the higher
> > layers sort their I/O dispatch better,
> >
> > Filesystems try very hard to allocate adjacent logical offsets in a
> > file in adjacent physical blocks on disk - that's the whole point of
> > extent-indexed filesystems. Hence with modern filesystems there is
> > generally a direct correlation between the page {mapping,index}
> > tuple and the physical location of the mapped block.
> >
> > i.e. there is generally zero physical correlation between pages in
> > different mappings, but there is a high physical correlation
> > between the index of pages on the same mapping.
>
> Nope. Large-number-of-small-files is a pretty common case. If the fs
> doesn't handle that well (ie: by placing them nearby on disk), it's
> borked.
>
> > Hence by sorting
> > where we have a {mapping,index} context, we push out IO that is
> > much more likely to be in contiguous physical chunks that the
> > current random page shootdown.
> >
> > We optimise applications to use these sorts of correlations all the
> > time to improve IO patterns. Why can't we make the same sort of
> > optmisations to the IO that the VM issues?
>
> We can, but it shouldn't be specific to page reclaim. Other places
> submit IO too, and want the same treatment.
>
> > > Still, I suspect none of it will improve anything anyway. Those pages
> > > are still dirty, possibly-locked and need to go to disk. It doesn't
> > > matter from the MM POV whether they sit in some VM list or in the
> > > request queue.
> >
> > Oh, but it does.....
>
> The only difference is that pages which are in the queue (current
> implementation thereof) can't be shot down by truncate.
>
> > > Possibly there may be some benefit to not putting so many of these
> > > unreclaimable pages into the queue all at the the same time. But
> > > that's a shortcoming in the block code: we should be able to shove
> > > arbitrary numbers of dirty page (segments) into the queue and not gum
> > > the system up. Don't try to work around that in the VM.
> >
> > I think you know perfectly well why the system gums up when we
> > increase block layer queue depth: it's the fact that the _VM_ relies
> > on block layer queue congestion to limit the amount of dirty memory
> > in the system.
>
> mm, a little bit still, I guess. Mainly because dirty memory
> management isn't zone aware, so even though we limit dirty memory
> globally, a particular zone(set) can get excessively dirtied.
>
> Most of this problem happen on the balance_dirty_pages() path, where we
> already sort the pages in ascending logical order.
>
> > We've got a feedback loop between the block layer and the VM that
> > only works if block device queues are kept shallow. Keeping the
> > number of dirty pages under control is a VM responsibility, but it
> > is putting limitations on the block layer to ensure that the VM
> > works correctly. If you want the block layer to have deep queues,
> > then someone needs to fix the VM not to require knowledge of the
> > internal operation of the block layer for correct operation.
> >
> > Adding a few lines of code to sort a list in the VM is far, far
> > easier than redesigning the write throttling code....
>
> It's a hack and a workaround. And I suspect it won't make any
> difference, especially given Mel's measurements of the number of dirty
> pages he's seeing coming off the LRU. Although those numbers may well
> be due to the new quite-low dirty memory thresholds.
>

I tested with a dirty ratio of 40 but I didn't see a major problem.
It's still a case with the tests I saw that direct reclaim of dirty pages
was a relatively rare event except when lumpy reclaim was involved. What
did change is the amount of scanning the direct reclaim and kswapd had to do
(both increased quite a bit) but the percentage of dirty pages encountered was
roughly the same (1-2% of scanned pages were dirty in the case of sysbench).

This is sysbench only rather than flooding with more data.

DIRTY RATIO == 20
FTrace Reclaim Statistics
traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
Direct reclaims 9843 13398 51651
Direct reclaim pages scanned 871367 1008709 3080593
Direct reclaim write async I/O 24883 30699 0
Direct reclaim write sync I/O 0 0 0
Wake kswapd requests 7070819 6961672 11268341
Kswapd wakeups 1578 1500 943
Kswapd pages scanned 22016558 21779455 17393431
Kswapd reclaim write async I/O 1161346 1101641 1717759
Kswapd reclaim write sync I/O 0 0 0
Time stalled direct reclaim (ms) 26.11 45.04 2.97
Time kswapd awake (ms) 5105.06 5135.93 6086.32

User/Sys Time Running Test (seconds) 734.52 712.39 703.9
Percentage Time Spent Direct Reclaim 0.00% 0.00% 0.00%
Total Elapsed Time (seconds) 9710.02 9589.20 9334.45
Percentage Time kswapd Awake 0.06% 0.00% 0.00%

DIRTY RATIO == 40
FTrace Reclaim Statistics
traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
Direct reclaims 29945 41887 163006
Direct reclaim pages scanned 2853804 3075288 13142072
Direct reclaim write async I/O 51498 63662 0
Direct reclaim write sync I/O 0 0 0
Wake kswapd requests 11899105 12466894 15645364
Kswapd wakeups 945 891 522
Kswapd pages scanned 20401921 20674788 11319791
Kswapd reclaim write async I/O 1381897 1332436 1711266
Kswapd reclaim write sync I/O 0 0 0
Time stalled direct reclaim (ms) 131.78 165.08 5.47
Time kswapd awake (ms) 6321.11 6413.79 6687.67

User/Sys Time Running Test (seconds) 709.91 718.39 664.28
Percentage Time Spent Direct Reclaim 0.00% 0.00% 0.00%
Total Elapsed Time (seconds) 9579.90 9700.42 9101.05
Percentage Time kswapd Awake 0.07% 0.00% 0.00%

I guess what was really interesting was that even though raising the
dirty ratio allowed the test to complete faster, the percentage of time
spent in direct reclaim increased quite a lot. Again, just stopping
writeback in direct reclaim seemed to help.

> It would be interesting to code up a little test patch though, see if
> there's benefit to be had going down this path.
>

I'll do this just to see what it looks like. To be frank, I lack taste when
it comes to how the block layer and filesystem should behave so am having
troube deciding if sorting the pages prior to submission is a good thing or
if it would just encourage bad or lax behaviour in the IO submission queueing.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-15 11:45:31

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim

On Mon, Jun 14, 2010 at 05:55:51PM -0400, Rik van Riel wrote:
> On 06/14/2010 07:17 AM, Mel Gorman wrote:
>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 4856a2a..574e816 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -372,6 +372,12 @@ int write_reclaim_page(struct page *page, struct address_space *mapping,
>> return PAGE_SUCCESS;
>> }
>>
>> +/* kswapd and memcg can writeback as they are unlikely to overflow stack */
>> +static inline bool reclaim_can_writeback(struct scan_control *sc)
>> +{
>> + return current_is_kswapd() || sc->mem_cgroup != NULL;
>> +}
>> +
>
> I'm not entirely convinced on this bit, but am willing to
> be convinced by the data.
>

Which bit?

You're not convinced that kswapd should be allowed to write back?
You're not convinced that memcg should be allowed to write back?
You're not convinced that direct reclaim writing back pages can overflow
the stack?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-15 11:45:44

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/12] Avoid overflowing of stack during page reclaim V2

On Mon, Jun 14, 2010 at 11:10:11AM -0400, Christoph Hellwig wrote:
> On Mon, Jun 14, 2010 at 12:17:41PM +0100, Mel Gorman wrote:
> > This is a merging of two series - the first of which reduces stack usage
> > in page reclaim and the second which writes contiguous pages during reclaim
> > and avoids writeback in direct reclaimers.
>
> This stuff looks good to me from the filesystem POV.
>
> You might want to throw in a follow on patch to remove the PF_MEMALLOC
> checks from the various ->writepage methods.
>

Will do, thanks.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-15 11:50:19

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/12] Avoid overflowing of stack during page reclaim V2

On Tue, Jun 15, 2010 at 09:08:33AM +0900, KAMEZAWA Hiroyuki wrote:
> On Mon, 14 Jun 2010 12:17:41 +0100
> Mel Gorman <[email protected]> wrote:
>
> > SysBench
> > ========
> > traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
> > 1 11025.01 ( 0.00%) 10249.52 (-7.57%) 10430.57 (-5.70%)
> > 2 3844.63 ( 0.00%) 4988.95 (22.94%) 4038.95 ( 4.81%)
> > 3 3210.23 ( 0.00%) 2918.52 (-9.99%) 3113.38 (-3.11%)
> > 4 1958.91 ( 0.00%) 1987.69 ( 1.45%) 1808.37 (-8.32%)
> > 5 2864.92 ( 0.00%) 3126.13 ( 8.36%) 2355.70 (-21.62%)
> > 6 4831.63 ( 0.00%) 3815.67 (-26.63%) 4164.09 (-16.03%)
> > 7 3788.37 ( 0.00%) 3140.39 (-20.63%) 3471.36 (-9.13%)
> > 8 2293.61 ( 0.00%) 1636.87 (-40.12%) 1754.25 (-30.75%)
> > FTrace Reclaim Statistics
> > traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
> > Direct reclaims 9843 13398 51651
> > Direct reclaim pages scanned 871367 1008709 3080593
> > Direct reclaim write async I/O 24883 30699 0
> > Direct reclaim write sync I/O 0 0 0
>
> Hmm, page-scan and reclaims jumps up but...
>

It could be accounted for by the fact that the direct reclaimers are
stalled less in direct reclaim. They make more forward progress needing
more pages so end up scanning more as a result.

>
> > User/Sys Time Running Test (seconds) 734.52 712.39 703.9
> > Percentage Time Spent Direct Reclaim 0.00% 0.00% 0.00%
> > Total Elapsed Time (seconds) 9710.02 9589.20 9334.45
> > Percentage Time kswapd Awake 0.06% 0.00% 0.00%
> >
>
> Execution time is reduced. Does this shows removing "I/O noise" by direct
> reclaim makes the system happy? or writeback in direct reclaim give
> us too much costs ?
>

I think it's accounted for by just making more forward progress rather than
IO noise. The throughput results for sysbench are all over the place because
the disk was maxed so I'm shying away from drawing any conclusions on the
IO efficiency.

> It seems I'll have to consider about avoiding direct-reciam in memcg, later.
>
> BTW, I think we'll have to add wait-for-pages-to-be-cleaned trick in
> direct reclaim if we want to avoid too much scanning, later.
>

This happens for lumpy reclaim. I didn't think it was justified for
normal reclaim based on the percentage of dirty pages encountered during
scanning. If the percentage of dirty pages scanned, we'll need to first
figure out why that happened and then if stalling when they are
encountered is the correct thing to do.

>
> Thank you for interesting test.
>

You're welcome.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-15 13:08:18

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On Tue, Jun 15, 2010 at 12:43:42PM +0100, Mel Gorman wrote:
>
> I'll do this just to see what it looks like. To be frank, I lack
> taste when it comes to how the block layer and filesystem should
> behave so am having troube deciding if sorting the pages prior to
> submission is a good thing or if it would just encourage bad or lax
> behaviour in the IO submission queueing.
>

I suspect the right answer is we need to sort both at the block layer
and either (a) before you pass things to the filesystem layer, or if
you don't do that (b) the filesystem will be forced to do its own
queuing/sorting at the very least for delayed allocation pages, so the
allocator can do something sane. And given that there are multiple
file systems that support delayed allocation, it would be nice if this
could be recognized by the writeback code, as opposed to having btrfs,
xfs, ext4, all having to implement something very similar at the fs
layer.

- Ted

2010-06-15 13:24:10

by Mel Gorman

[permalink] [raw]
Subject: Re: [patch] mm: vmscan fix mapping use after free

On Tue, Jun 15, 2010 at 03:43:44PM +1000, Nick Piggin wrote:
> On Tue, Jun 15, 2010 at 03:12:42PM +1000, Nick Piggin wrote:
> > > Once that page is unlocked, we can't touch *mapping - its inode can be
> > > concurrently reclaimed. Although I guess the technique in
> > > handle_write_error() can be reused.
> >
> > Nasty. That guy needs to be using lock_page_nosync().
> --
>
> Need lock_page_nosync here because we have no reference to the mapping when
> taking the page lock.
>
> Signed-off-by: Nick Piggin <[email protected]>
>

Thanks, I've picked this up and merged it into the series and removed
the "hand grenade" comment.

> ---
> mm/vmscan.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: linux-2.6/mm/vmscan.c
> ===================================================================
> --- linux-2.6.orig/mm/vmscan.c
> +++ linux-2.6/mm/vmscan.c
> @@ -296,7 +296,7 @@ static int may_write_to_queue(struct bac
> static void handle_write_error(struct address_space *mapping,
> struct page *page, int error)
> {
> - lock_page(page);
> + lock_page_nosync(page);
> if (page_mapping(page) == mapping)
> mapping_set_error(mapping, error);
> unlock_page(page);
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-15 13:33:31

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On 06/15/2010 07:01 AM, Christoph Hellwig wrote:
> On Mon, Jun 14, 2010 at 09:16:29PM -0400, Rik van Riel wrote:
>>> Besides, there really isn't the right context in the block layer to
>>> be able to queue and prioritise large amounts of IO without
>>> significant penalties to some higher layer operation.
>>
>> Can we kick flushing for the whole inode at once from
>> vmscan.c?
>
> kswapd really should be a last effort tool to clean filesystem pages.
> If it does enough I/O for this to matter significantly we need to
> fix the VM to move more work to the flusher threads instead of trying
> to fix kswapd.
>
>> Would it be hard to add a "please flush this file"
>> way to call the filesystem flushing threads?
>
> We already have that API, in Jens' latest tree that's
> sync_inodes_sb/writeback_inodes_sb. We could also add a non-waiting
> variant if required, but I think the big problem with kswapd is that
> we want to wait on I/O completion under circumstances.

However, kswapd does not need to wait on I/O completion of
any page in particular - it just wants to wait on I/O
completion of any inactive pages in the zone (or memcg)
where memory is being freed.

--
All rights reversed

2010-06-15 13:35:26

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim

On 06/15/2010 07:45 AM, Mel Gorman wrote:
> On Mon, Jun 14, 2010 at 05:55:51PM -0400, Rik van Riel wrote:
>> On 06/14/2010 07:17 AM, Mel Gorman wrote:
>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 4856a2a..574e816 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -372,6 +372,12 @@ int write_reclaim_page(struct page *page, struct address_space *mapping,
>>> return PAGE_SUCCESS;
>>> }
>>>
>>> +/* kswapd and memcg can writeback as they are unlikely to overflow stack */
>>> +static inline bool reclaim_can_writeback(struct scan_control *sc)
>>> +{
>>> + return current_is_kswapd() || sc->mem_cgroup != NULL;
>>> +}
>>> +
>>
>> I'm not entirely convinced on this bit, but am willing to
>> be convinced by the data.
>>
>
> Which bit?
>
> You're not convinced that kswapd should be allowed to write back?
> You're not convinced that memcg should be allowed to write back?
> You're not convinced that direct reclaim writing back pages can overflow
> the stack?

If direct reclaim can overflow the stack, so can direct
memcg reclaim. That means this patch does not solve the
stack overflow, while admitting that we do need the
ability to get specific pages flushed to disk from the
pageout code.

--
All rights reversed

2010-06-15 13:37:47

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim

On Tue, Jun 15, 2010 at 09:34:18AM -0400, Rik van Riel wrote:
> If direct reclaim can overflow the stack, so can direct
> memcg reclaim. That means this patch does not solve the
> stack overflow, while admitting that we do need the
> ability to get specific pages flushed to disk from the
> pageout code.

Can you explain what the hell memcg reclaim is and why it needs
to reclaim from random contexts?
It seems everything that has a cg in it's name that I stumbled over
lately seems to be some ugly wart..

2010-06-15 13:54:27

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim

On Tue, Jun 15, 2010 at 09:37:27AM -0400, Christoph Hellwig wrote:
> On Tue, Jun 15, 2010 at 09:34:18AM -0400, Rik van Riel wrote:
> > If direct reclaim can overflow the stack, so can direct
> > memcg reclaim. That means this patch does not solve the
> > stack overflow, while admitting that we do need the
> > ability to get specific pages flushed to disk from the
> > pageout code.
>
> Can you explain what the hell memcg reclaim is and why it needs
> to reclaim from random contexts?

Kamezawa Hiroyuki has the full story here but here is a summary.

memcg is the Memory Controller cgroup
(Documentation/cgroups/memory.txt). It's intended for the control of the
amount of memory usable by a group of processes but its behaviour in
terms of reclaim differs from global reclaim. It has its own LRU lists
and kswapd operates on them. What is surprising is that direct reclaim
for a process in the control group also does not operate within the
cgroup.

Reclaim from a cgroup happens from the fault path. The new page is
"charged" to the cgroup. If it exceeds its allocated resources, some
pages within the group are reclaimed in a path that is similar to direct
reclaim except for its entry point.

So, memcg is not reclaiming from a random context, there is a limited
number of cases where a memcg is reclaiming and it is not expected to
overflow the stack.

> It seems everything that has a cg in it's name that I stumbled over
> lately seems to be some ugly wart..
>

The wart in this case is that the behaviour of page reclaim within a
memcg and globally differ a fair bit.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-15 13:59:48

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim

On Tue, Jun 15, 2010 at 09:34:18AM -0400, Rik van Riel wrote:
> On 06/15/2010 07:45 AM, Mel Gorman wrote:
>> On Mon, Jun 14, 2010 at 05:55:51PM -0400, Rik van Riel wrote:
>>> On 06/14/2010 07:17 AM, Mel Gorman wrote:
>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index 4856a2a..574e816 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -372,6 +372,12 @@ int write_reclaim_page(struct page *page, struct address_space *mapping,
>>>> return PAGE_SUCCESS;
>>>> }
>>>>
>>>> +/* kswapd and memcg can writeback as they are unlikely to overflow stack */
>>>> +static inline bool reclaim_can_writeback(struct scan_control *sc)
>>>> +{
>>>> + return current_is_kswapd() || sc->mem_cgroup != NULL;
>>>> +}
>>>> +
>>>
>>> I'm not entirely convinced on this bit, but am willing to
>>> be convinced by the data.
>>>
>>
>> Which bit?
>>
>> You're not convinced that kswapd should be allowed to write back?
>> You're not convinced that memcg should be allowed to write back?
>> You're not convinced that direct reclaim writing back pages can overflow
>> the stack?
>
> If direct reclaim can overflow the stack, so can direct
> memcg reclaim. That means this patch does not solve the
> stack overflow, while admitting that we do need the
> ability to get specific pages flushed to disk from the
> pageout code.
>

What path is taken with memcg != NULL that could overflow the stack? I
couldn't spot one but mm/memcontrol.c is a bit tangled so finding all
its use cases is tricky. The critical path I had in mind though was
direct reclaim and for that path, memcg == NULL or did I miss something?

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-15 14:05:38

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim

On 06/15/2010 09:59 AM, Mel Gorman wrote:
> On Tue, Jun 15, 2010 at 09:34:18AM -0400, Rik van Riel wrote:
>> On 06/15/2010 07:45 AM, Mel Gorman wrote:

>>>>>
>>>>> +/* kswapd and memcg can writeback as they are unlikely to overflow stack */
>>>>> +static inline bool reclaim_can_writeback(struct scan_control *sc)
>>>>> +{
>>>>> + return current_is_kswapd() || sc->mem_cgroup != NULL;
>>>>> +}

>> If direct reclaim can overflow the stack, so can direct
>> memcg reclaim. That means this patch does not solve the
>> stack overflow, while admitting that we do need the
>> ability to get specific pages flushed to disk from the
>> pageout code.
>>
>
> What path is taken with memcg != NULL that could overflow the stack? I
> couldn't spot one but mm/memcontrol.c is a bit tangled so finding all
> its use cases is tricky. The critical path I had in mind though was
> direct reclaim and for that path, memcg == NULL or did I miss something?

mem_cgroup_hierarchical_reclaim -> try_to_free_mem_cgroup_pages

--
All rights reversed

2010-06-15 14:16:21

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim

On Tue, Jun 15, 2010 at 10:04:24AM -0400, Rik van Riel wrote:
> On 06/15/2010 09:59 AM, Mel Gorman wrote:
>> On Tue, Jun 15, 2010 at 09:34:18AM -0400, Rik van Riel wrote:
>>> On 06/15/2010 07:45 AM, Mel Gorman wrote:
>
>>>>>>
>>>>>> +/* kswapd and memcg can writeback as they are unlikely to overflow stack */
>>>>>> +static inline bool reclaim_can_writeback(struct scan_control *sc)
>>>>>> +{
>>>>>> + return current_is_kswapd() || sc->mem_cgroup != NULL;
>>>>>> +}
>
>>> If direct reclaim can overflow the stack, so can direct
>>> memcg reclaim. That means this patch does not solve the
>>> stack overflow, while admitting that we do need the
>>> ability to get specific pages flushed to disk from the
>>> pageout code.
>>>
>>
>> What path is taken with memcg != NULL that could overflow the stack? I
>> couldn't spot one but mm/memcontrol.c is a bit tangled so finding all
>> its use cases is tricky. The critical path I had in mind though was
>> direct reclaim and for that path, memcg == NULL or did I miss something?
>
> mem_cgroup_hierarchical_reclaim -> try_to_free_mem_cgroup_pages
>

But in turn, where is mem_cgroup_hierarchical_reclaim called from direct
reclaim? It appears to be only called from the fault path or as a result
of the memcg changing size.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-15 14:38:29

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim

On 06/15/2010 09:37 AM, Christoph Hellwig wrote:
> On Tue, Jun 15, 2010 at 09:34:18AM -0400, Rik van Riel wrote:
>> If direct reclaim can overflow the stack, so can direct
>> memcg reclaim. That means this patch does not solve the
>> stack overflow, while admitting that we do need the
>> ability to get specific pages flushed to disk from the
>> pageout code.
>
> Can you explain what the hell memcg reclaim is and why it needs
> to reclaim from random contexts?

The page fault code will call the cgroup accounting code.

When a cgroup goes over its memory limit, __mem_cgroup_try_charge
will call mem_cgroup_hierarchical_reclaim, which will then go
into the page reclaim code.

> It seems everything that has a cg in it's name that I stumbled over
> lately seems to be some ugly wart..

No argument there. It took me a few minutes to find the code
path above :)

--
All rights reversed

2010-06-15 15:44:30

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On Tue, Jun 15, 2010 at 12:43:42PM +0100, Mel Gorman wrote:
> <SNIP on whether sorting should be in page cache or block layer>
>
> > It would be interesting to code up a little test patch though, see if
> > there's benefit to be had going down this path.
> >
>
> I'll do this just to see what it looks like. To be frank, I lack taste when
> it comes to how the block layer and filesystem should behave so am having
> troube deciding if sorting the pages prior to submission is a good thing or
> if it would just encourage bad or lax behaviour in the IO submission queueing.
>

The patch to sort the list being cleaned by reclaim looks like this.
It's not actually tested

vmscan: Sort pages being queued for IO before submitting to the filesystem

While page reclaim submits dirty pages in batch, it doesn't change the
order in which the IO is issued - it is still issued in LRU order. Given
that they are issued in a short period of time now, rather than across a
longer scan period, it is likely that it will not be any faster as:

a) IO will not be started as soon, and
b) the IO scheduler still only has a small re-ordering
window and will choke just as much on random IO patterns.

This patch uses list_sort() function to sort
the list; sorting the list of pages by mapping and page->index
within the mapping would result in all the pages on each mapping
being sent down in ascending offset order at once - exactly how the
filesystems want IO to be sent to it.

Credit mostly goes to Dave Chinner for this idea and the changelog text.

----

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 68b3d22..02ab246 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -32,6 +32,7 @@
#include <linux/topology.h>
#include <linux/cpu.h>
#include <linux/cpuset.h>
+#include <linux/list_sort.h>
#include <linux/notifier.h>
#include <linux/rwsem.h>
#include <linux/delay.h>
@@ -651,6 +652,34 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
__pagevec_free(&freed_pvec);
}

+/* Sort based on mapping then index */
+static int page_writeback_cmp(void *data, struct list_head *a, struct list_head *b)
+{
+ struct page *ap = list_entry(a, struct page, lru);
+ struct page *bp = list_entry(b, struct page, lru);
+ pgoff_t diff;
+
+ /*
+ * Page not locked but it's not critical, the mapping is just for sorting
+ * If the mapping is no longer valid, it's of little consequence
+ */
+ if (ap->mapping != bp->mapping) {
+ if (ap->mapping < bp->mapping)
+ return -1;
+ if (ap->mapping > bp->mapping)
+ return 1;
+ return 0;
+ }
+
+ /* Then index */
+ diff = ap->index - bp->index;
+ if (diff < 0)
+ return -1;
+ if (diff > 0)
+ return 1;
+ return 0;
+}
+
static noinline_for_stack void clean_page_list(struct list_head *page_list,
struct scan_control *sc)
{
@@ -660,6 +689,8 @@ static noinline_for_stack void clean_page_list(struct list_head *page_list,
if (!sc->may_writepage)
return;

+ list_sort(NULL, page_list, page_writeback_cmp);
+
/* Write the pages out to disk in ranges where possible */
while (!list_empty(page_list)) {
struct address_space *mapping;

2010-06-15 15:56:51

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 08/12] vmscan: Setup pagevec as late as possible in shrink_inactive_list()

On Tue, Jun 15, 2010 at 06:47:01AM -0400, Christoph Hellwig wrote:
> On Mon, Jun 14, 2010 at 12:17:49PM +0100, Mel Gorman wrote:
> > /*
> > + * TODO: Try merging with migrations version of putback_lru_pages
> > + */
> > +static noinline_for_stack void putback_lru_pages(struct zone *zone,
> > + struct zone_reclaim_stat *reclaim_stat,
> > + unsigned long nr_anon, unsigned long nr_file,
> > + struct list_head *page_list)
> > +{
>
> I hate to nitpick on this, but with noinline_for_stack a prototype
> really is unreadbale unless the function identifier goes to the next
> line. Compare the one above with:
>
> static noinline_for_stack void
> putback_lru_pages(struct zone *zone, struct zone_reclaim_stat *reclaim_stat,
> unsigned long nr_anon, unsigned long nr_file,
> struct list_head *page_list)
>
> > -static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
> > +static noinline_for_stack unsigned long shrink_inactive_list(unsigned long nr_to_scan,
> > struct zone *zone, struct scan_control *sc,
> > int priority, int file)
>
> Same here, just even worse due to the spill over ove 80 characters.
>

It's a fair nitpick and so is cleaned up.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-15 23:20:58

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On Tue, Jun 15, 2010 at 08:55:38PM +1000, Nick Piggin wrote:
> On Tue, Jun 15, 2010 at 02:28:22PM +0400, Evgeniy Polyakov wrote:
> > On Tue, Jun 15, 2010 at 04:36:43PM +1000, Dave Chinner ([email protected]) wrote:
> > Per-mapping sorting will not do anything good in this case, even if
> > files were previously created in a good facion being placed closely and
> > so on, and only block layer will find a correlation between adjacent
> > blocks in different files. But with existing queue management it has
> > quite a small opportunity, and that's what I think Andrew is arguing
> > about.
>
> The solution is not to sort pages on their way to be submitted either,
> really.
>
> What I do in fsblock is to maintain a block-nr sorted tree of dirty
> blocks. This works nicely because fsblock dirty state is properly
> synchronized with page dirty state.

How does this work with delayed allocation where there is no block
number associated with the page until writeback calls the allocation
routine?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-06-16 00:22:28

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim

On Tue, 15 Jun 2010 15:16:01 +0100
Mel Gorman <[email protected]> wrote:

> On Tue, Jun 15, 2010 at 10:04:24AM -0400, Rik van Riel wrote:
> > On 06/15/2010 09:59 AM, Mel Gorman wrote:
> >> On Tue, Jun 15, 2010 at 09:34:18AM -0400, Rik van Riel wrote:
> >>> On 06/15/2010 07:45 AM, Mel Gorman wrote:
> >
> >>>>>>
> >>>>>> +/* kswapd and memcg can writeback as they are unlikely to overflow stack */
> >>>>>> +static inline bool reclaim_can_writeback(struct scan_control *sc)
> >>>>>> +{
> >>>>>> + return current_is_kswapd() || sc->mem_cgroup != NULL;
> >>>>>> +}
> >
> >>> If direct reclaim can overflow the stack, so can direct
> >>> memcg reclaim. That means this patch does not solve the
> >>> stack overflow, while admitting that we do need the
> >>> ability to get specific pages flushed to disk from the
> >>> pageout code.
> >>>
> >>
> >> What path is taken with memcg != NULL that could overflow the stack? I
> >> couldn't spot one but mm/memcontrol.c is a bit tangled so finding all
> >> its use cases is tricky. The critical path I had in mind though was
> >> direct reclaim and for that path, memcg == NULL or did I miss something?
> >
> > mem_cgroup_hierarchical_reclaim -> try_to_free_mem_cgroup_pages
> >
>
> But in turn, where is mem_cgroup_hierarchical_reclaim called from direct
> reclaim? It appears to be only called from the fault path or as a result
> of the memcg changing size.
>
yes. It's only called from
- page fault
- add_to_page_cache()

I think we'll see no stack problem. Now, memcg doesn't wakeup kswapd for
reclaiming memory, it needs direct writeback.

Thanks,
-Kame

2010-06-16 00:32:32

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim

On 06/15/2010 08:17 PM, KAMEZAWA Hiroyuki wrote:
> On Tue, 15 Jun 2010 15:16:01 +0100
> Mel Gorman<[email protected]> wrote:

>> But in turn, where is mem_cgroup_hierarchical_reclaim called from direct
>> reclaim? It appears to be only called from the fault path or as a result
>> of the memcg changing size.
>>
> yes. It's only called from
> - page fault
> - add_to_page_cache()
>
> I think we'll see no stack problem. Now, memcg doesn't wakeup kswapd for
> reclaiming memory, it needs direct writeback.

Of course, a memcg page fault could still be triggered
from copy_to_user or copy_from_user, with a fairly
arbitrary stack frame above...

--
All rights reversed

2010-06-16 00:35:30

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim

On Tue, 15 Jun 2010 14:54:08 +0100
Mel Gorman <[email protected]> wrote:

> On Tue, Jun 15, 2010 at 09:37:27AM -0400, Christoph Hellwig wrote:
> > On Tue, Jun 15, 2010 at 09:34:18AM -0400, Rik van Riel wrote:
> > > If direct reclaim can overflow the stack, so can direct
> > > memcg reclaim. That means this patch does not solve the
> > > stack overflow, while admitting that we do need the
> > > ability to get specific pages flushed to disk from the
> > > pageout code.
> >
> > Can you explain what the hell memcg reclaim is and why it needs
> > to reclaim from random contexts?
>
> Kamezawa Hiroyuki has the full story here but here is a summary.
>
Thank you.

> memcg is the Memory Controller cgroup
> (Documentation/cgroups/memory.txt). It's intended for the control of the
> amount of memory usable by a group of processes but its behaviour in
> terms of reclaim differs from global reclaim. It has its own LRU lists
> and kswapd operates on them.

No, we don't use kswapd. But we have some hooks in kswapd for implementing
soft-limit. Soft-limit is for giving a hint for kswapd "please reclaim memory
from this memcg" when global memory exhausts and kswapd runs.

What a memcg use when it his limit is just direct reclaim.
(*) Justfing using a cpu by a kswapd because a memcg hits limit is difficult
for me. So, I don't use kswapd until now.
When direct-reclaim is used, cost-of-reclaim will be charged against
a cpu cgroup which a thread belongs to.


> What is surprising is that direct reclaim
> for a process in the control group also does not operate within the
> cgroup.
Sorry, I can't understand ....

>
> Reclaim from a cgroup happens from the fault path. The new page is
> "charged" to the cgroup. If it exceeds its allocated resources, some
> pages within the group are reclaimed in a path that is similar to direct
> reclaim except for its entry point.
>
yes.

> So, memcg is not reclaiming from a random context, there is a limited
> number of cases where a memcg is reclaiming and it is not expected to
> overflow the stack.
>

I think so. Especially, we'll never see 1k stack use of select().

> > It seems everything that has a cg in it's name that I stumbled over
> > lately seems to be some ugly wart..
> >
>
> The wart in this case is that the behaviour of page reclaim within a
> memcg and globally differ a fair bit.
>

Sorry. But there has been very long story to reach current implementations.
But don't worry, of memcg is not activated (not mounted), it doesn't affect
the behavior of processes ;)

But Hmm..

>[kamezawa@bluextal mmotm-2.6.35-0611]$ wc -l mm/memcontrol.c
>4705 mm/memcontrol.c

may need some diet :(


Thanks,
-Kame

2010-06-16 00:44:29

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim

On Tue, 15 Jun 2010 20:29:49 -0400
Rik van Riel <[email protected]> wrote:

> On 06/15/2010 08:17 PM, KAMEZAWA Hiroyuki wrote:
> > On Tue, 15 Jun 2010 15:16:01 +0100
> > Mel Gorman<[email protected]> wrote:
>
> >> But in turn, where is mem_cgroup_hierarchical_reclaim called from direct
> >> reclaim? It appears to be only called from the fault path or as a result
> >> of the memcg changing size.
> >>
> > yes. It's only called from
> > - page fault
> > - add_to_page_cache()
> >
> > I think we'll see no stack problem. Now, memcg doesn't wakeup kswapd for
> > reclaiming memory, it needs direct writeback.
>
> Of course, a memcg page fault could still be triggered
> from copy_to_user or copy_from_user, with a fairly
> arbitrary stack frame above...
>

Hmm. But I don't expect copy_from/to_user is called in very deep stack.

Should I prepare a thread for reclaiming memcg pages ?
Because we shouldn't limit kswapd's cpu time by CFS cgroup, waking up
kswapd just because "a memcg hit limits" isn't fun.

Hmm, or do you recommend no-dirty-page-writeback when a memcg hits limit ?
Maybe we'll see much swaps.

I want to go with this for a while, changing memcg's behavior will took
some amounts of time, there are only a few developpers.

Thanks,
-Kame

2010-06-16 01:39:54

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim

On 06/15/2010 08:39 PM, KAMEZAWA Hiroyuki wrote:

> Hmm, or do you recommend no-dirty-page-writeback when a memcg hits limit ?
> Maybe we'll see much swaps.
>
> I want to go with this for a while, changing memcg's behavior will took
> some amounts of time, there are only a few developpers.

One thing we can do, for kswapd, memcg and direct reclaim alike,
is to tell the flusher threads to flush pages related to a pageout
candidate page to disk.

That way the reclaiming processes can wait on some disk IO to
finish, while the flusher thread takes care of the actual flushing.

That should also fix the "kswapd filesystem IO has really poor IO
patterns" issue.

There's no reason not to fix this issue the right way.

--
All rights reversed

2010-06-16 01:45:13

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim

On Tue, 15 Jun 2010 20:53:43 -0400
Rik van Riel <[email protected]> wrote:

> On 06/15/2010 08:39 PM, KAMEZAWA Hiroyuki wrote:
>
> > Hmm, or do you recommend no-dirty-page-writeback when a memcg hits limit ?
> > Maybe we'll see much swaps.
> >
> > I want to go with this for a while, changing memcg's behavior will took
> > some amounts of time, there are only a few developpers.
>
> One thing we can do, for kswapd, memcg and direct reclaim alike,
> is to tell the flusher threads to flush pages related to a pageout
> candidate page to disk.
>
> That way the reclaiming processes can wait on some disk IO to
> finish, while the flusher thread takes care of the actual flushing.
>
> That should also fix the "kswapd filesystem IO has really poor IO
> patterns" issue.
>
> There's no reason not to fix this issue the right way.
>
yes. but this patch just stops writeback. I think it's sane to ask
not to change behavior until there are some useful changes in flusher
threads.

IMO, until flusher threads can work with I/O cgroup, memcg shoudln't
depend on it because writeback allows stealing resource without it.

Thanks,
-Kame

2010-06-16 02:24:54

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim

On Wed, 16 Jun 2010 10:40:36 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Tue, 15 Jun 2010 20:53:43 -0400
> Rik van Riel <[email protected]> wrote:
>
> > On 06/15/2010 08:39 PM, KAMEZAWA Hiroyuki wrote:
> >
> > > Hmm, or do you recommend no-dirty-page-writeback when a memcg hits limit ?
> > > Maybe we'll see much swaps.
> > >
> > > I want to go with this for a while, changing memcg's behavior will took
> > > some amounts of time, there are only a few developpers.
> >
> > One thing we can do, for kswapd, memcg and direct reclaim alike,
> > is to tell the flusher threads to flush pages related to a pageout
> > candidate page to disk.
> >
> > That way the reclaiming processes can wait on some disk IO to
> > finish, while the flusher thread takes care of the actual flushing.
> >
> > That should also fix the "kswapd filesystem IO has really poor IO
> > patterns" issue.
> >
> > There's no reason not to fix this issue the right way.
> >
> yes. but this patch just stops writeback. I think it's sane to ask
> not to change behavior until there are some useful changes in flusher
> threads.
>
> IMO, until flusher threads can work with I/O cgroup, memcg shoudln't
> depend on it because writeback allows stealing resource without it.
>

BTW, copy_from_user/copy_to_user is _real_ problem, I'm afraid following
much more than memcg.

handle_mm_fault()
-> handle_pte_fault()
-> do_wp_page()
-> balance_dirty_page_rate_limited()
-> balance_dirty_pages()
-> writeback_inodes_wbc()
-> writeback_inodes_wb()
-> writeback_sb_inodes()
-> writeback_single_inode()
-> do_writepages()
-> generic_write_pages()
-> write_cache_pages() // use on-stack pagevec.
-> writepage()

maybe much more stack consuming than memcg->writeback after vmscan.c diet.

Bye.
-Kame

















2010-06-16 05:07:06

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim

On Wed, Jun 16, 2010 at 09:17:55AM +0900, KAMEZAWA Hiroyuki wrote:
> yes. It's only called from
> - page fault
> - add_to_page_cache()
>
> I think we'll see no stack problem. Now, memcg doesn't wakeup kswapd for
> reclaiming memory, it needs direct writeback.

The page fault code should be fine, but add_to_page_cache can be called
with quite deep stacks. Two examples are grab_cache_page_write_begin
which already was part of one of the stack overflows mentioned in this
thread, or find_or_create_page which can be called via
_xfs_buf_lookup_pages, which can be called from under the whole XFS
allocator, or via grow_dev_page which might have a similarly deep
stack for users of the normal buffer cache. Although for the
find_or_create_page we usually should not have __GFP_FS set in the
gfp_mask.

2010-06-16 05:08:07

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim

On Wed, Jun 16, 2010 at 09:39:58AM +0900, KAMEZAWA Hiroyuki wrote:
> Hmm. But I don't expect copy_from/to_user is called in very deep stack.

Actually it is. The poll code mentioned earlier in this thread is just
want nasty example. I'm pretty sure there are tons of others in ioctl
code, as various ioctl implementations have been found to be massive
stack hogs in the past, even worse for out of tree drivers.

2010-06-16 05:11:47

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim

On Wed, Jun 16, 2010 at 11:20:24AM +0900, KAMEZAWA Hiroyuki wrote:
> BTW, copy_from_user/copy_to_user is _real_ problem, I'm afraid following
> much more than memcg.
>
> handle_mm_fault()
> -> handle_pte_fault()
> -> do_wp_page()
> -> balance_dirty_page_rate_limited()
> -> balance_dirty_pages()
> -> writeback_inodes_wbc()
> -> writeback_inodes_wb()
> -> writeback_sb_inodes()
> -> writeback_single_inode()
> -> do_writepages()
> -> generic_write_pages()
> -> write_cache_pages() // use on-stack pagevec.
> -> writepage()

Yes, this is a massive issue. Strangely enough I just wondered about
this callstack as balance_dirty_pages is the only place calling into the
per-bdi/sb writeback code directly instead of offloading it to the
flusher threads. It's something that should be fixed rather quickly
IMHO. write_cache_pages and other bits of this writeback code can use
quite large amounts of stack.

2010-06-16 06:04:42

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 11/12] vmscan: Write out dirty pages in batch

On Wed, Jun 16, 2010 at 09:20:09AM +1000, Dave Chinner wrote:
> On Tue, Jun 15, 2010 at 08:55:38PM +1000, Nick Piggin wrote:
> > On Tue, Jun 15, 2010 at 02:28:22PM +0400, Evgeniy Polyakov wrote:
> > > On Tue, Jun 15, 2010 at 04:36:43PM +1000, Dave Chinner ([email protected]) wrote:
> > > Per-mapping sorting will not do anything good in this case, even if
> > > files were previously created in a good facion being placed closely and
> > > so on, and only block layer will find a correlation between adjacent
> > > blocks in different files. But with existing queue management it has
> > > quite a small opportunity, and that's what I think Andrew is arguing
> > > about.
> >
> > The solution is not to sort pages on their way to be submitted either,
> > really.
> >
> > What I do in fsblock is to maintain a block-nr sorted tree of dirty
> > blocks. This works nicely because fsblock dirty state is properly
> > synchronized with page dirty state.
>
> How does this work with delayed allocation where there is no block
> number associated with the page until writeback calls the allocation
> routine?

It doesn't. I have been thinking about how best to make that work.
The mm/writeback is not in a good position to know what to do, so
the fs would have to help.

So either an fs callback, or the fs would have to insert the blocks
(or some marker) into the tree itself. It's relatively easy to do for
a single file (just walk the radix-tree and do delalloc conversions),
but between multiple files is harder (current code has the same problem
though).

2010-06-16 10:51:41

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim

On 2010-06-16 07:11, Christoph Hellwig wrote:
> On Wed, Jun 16, 2010 at 11:20:24AM +0900, KAMEZAWA Hiroyuki wrote:
>> BTW, copy_from_user/copy_to_user is _real_ problem, I'm afraid following
>> much more than memcg.
>>
>> handle_mm_fault()
>> -> handle_pte_fault()
>> -> do_wp_page()
>> -> balance_dirty_page_rate_limited()
>> -> balance_dirty_pages()
>> -> writeback_inodes_wbc()
>> -> writeback_inodes_wb()
>> -> writeback_sb_inodes()
>> -> writeback_single_inode()
>> -> do_writepages()
>> -> generic_write_pages()
>> -> write_cache_pages() // use on-stack pagevec.
>> -> writepage()
>
> Yes, this is a massive issue. Strangely enough I just wondered about
> this callstack as balance_dirty_pages is the only place calling into the
> per-bdi/sb writeback code directly instead of offloading it to the
> flusher threads. It's something that should be fixed rather quickly
> IMHO. write_cache_pages and other bits of this writeback code can use
> quite large amounts of stack.

I've had the same thought as well, bdp() should just signal a writeback
instead. Much cleaner than doing cleaning from that point.

--
Jens Axboe

2010-06-16 23:38:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 05/12] vmscan: kill prev_priority completely

On Mon, 14 Jun 2010 12:17:46 +0100
Mel Gorman <[email protected]> wrote:

> From: KOSAKI Motohiro <[email protected]>
>
> Since 2.6.28 zone->prev_priority is unused. Then it can be removed
> safely. It reduce stack usage slightly.
>
> Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
> can be integrate again, it's useful. but four (or more) times trying
> haven't got good performance number. Thus I give up such approach.

This would have been badder in earlier days when we were using the
scanning priority to decide when to start unmapping pte-mapped pages -
page reclaim would have been recirculating large blobs of mapped pages
around the LRU until the priority had built to the level where we
started to unmap them.

However that priority-based decision got removed and right now I don't
recall what it got replaced with. Aren't we now unmapping pages way
too early and suffering an increased major&minor fault rate? Worried.


Things which are still broken after we broke prev_priority:

- If page reclaim is having a lot of trouble, prev_priority would
have permitted do_try_to_free_pages() to call disable_swap_token()
earlier on. As things presently stand, we'll do a lot of
thrash-detection stuff before (presumably correctly) dropping the
swap token.

So. What's up with that? I don't even remember _why_ we disable
the swap token once the scanning priority gets severe and the code
comments there are risible. And why do we wait until priority==0
rather than priority==1?

- Busted prev_priority means that lumpy reclaim will act oddly.
Every time someone goes into do some recalim, they'll start out not
doing lumpy reclaim. Then, after a while, they'll get a clue and
will start doing the lumpy thing. Then they return from reclaim and
the next recalim caller will again forget that he should have done
lumpy reclaim.

I dunno what the effects of this are in the real world, but it
seems dumb.

And one has to wonder: if we're making these incorrect decisions based
upon a bogus view of the current scanning difficulty, why are these
various priority-based thresholding heuristics even in there? Are they
doing anything useful?

So.. either we have a load of useless-crap-and-cruft in there which
should be lopped out, or we don't have a load of useless-crap-and-cruft
in there, and we should fix prev_priority.

> Signed-off-by: KOSAKI Motohiro <[email protected]>
> Reviewed-by: Johannes Weiner <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> include/linux/mmzone.h | 15 ------------
> mm/page_alloc.c | 2 -
> mm/vmscan.c | 57 ------------------------------------------------
> mm/vmstat.c | 2 -

The patch forgot to remove mem_cgroup_get_reclaim_priority() and friends.

2010-06-16 23:44:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 08/12] vmscan: Setup pagevec as late as possible in shrink_inactive_list()

On Mon, 14 Jun 2010 12:17:49 +0100
Mel Gorman <[email protected]> wrote:

> shrink_inactive_list() sets up a pagevec to release unfreeable pages. It
> uses significant amounts of stack doing this. This patch splits
> shrink_inactive_list() to take the stack usage out of the main path so
> that callers to writepage() do not contain an unused pagevec on the
> stack.

You can get the entire pagevec off the stack - just make it a
static-to-shrink_inactive_list() pagevec-per-cpu.

Locking just requires pinning to a CPU. We could trivially co-opt
shrink_inactive_list()'s spin_lock_irq() for that, but
pagevec_release() can be relatively expensive so it'd be sad to move
that inside spin_lock_irq(). It'd be better to slap a
get_cpu()/put_cpu() around the whole thing.

2010-06-16 23:46:51

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 05/12] vmscan: kill prev_priority completely

On 06/16/2010 07:37 PM, Andrew Morton wrote:

> This would have been badder in earlier days when we were using the
> scanning priority to decide when to start unmapping pte-mapped pages -
> page reclaim would have been recirculating large blobs of mapped pages
> around the LRU until the priority had built to the level where we
> started to unmap them.
>
> However that priority-based decision got removed and right now I don't
> recall what it got replaced with. Aren't we now unmapping pages way
> too early and suffering an increased major&minor fault rate? Worried.

We keep a different set of statistics to decide whether to
reclaim only page cache pages, or both page cache and
anonymous pages. The function get_scan_ratio parses those
statistics.

> So. What's up with that? I don't even remember _why_ we disable
> the swap token once the scanning priority gets severe and the code
> comments there are risible. And why do we wait until priority==0
> rather than priority==1?

The reason is that we never page out the pages belonging to the
process owning the swap token (with the exception of that process
evicting its own pages).

If that process has a really large RSS in the current zone, and
we are having problems freeing pages, it may be beneficial to
also evict pages from that process.

Now that the LRU lists are split out into file backed and swap
backed, it may be a lot easier to find pages to evict. That
may mean we could notice we're getting into trouble at much
higher priority levels and disable the swap token at a higher
priority level.

I do not believe prev_priority will be very useful here, since
we'd like to start out with small scans whenever possible.

> - Busted prev_priority means that lumpy reclaim will act oddly.
> Every time someone goes into do some recalim, they'll start out not
> doing lumpy reclaim. Then, after a while, they'll get a clue and
> will start doing the lumpy thing. Then they return from reclaim and
> the next recalim caller will again forget that he should have done
> lumpy reclaim.

How common are lumpy reclaims, anyway?

Isn't it more likely that in-between every two higher-order
reclaims, a number of order zero reclaims will be happening?

In that case, the prev_priority logic may have introduced the
kind of behavioural bug you describe above...

> And one has to wonder: if we're making these incorrect decisions based
> upon a bogus view of the current scanning difficulty, why are these
> various priority-based thresholding heuristics even in there? Are they
> doing anything useful?

The prev_priority code was useful when we had filesystem and
swap backed pages mixed on the same LRU list.

I am not convinced it still has any use.

--
All rights reversed

2010-06-16 23:49:09

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 09/12] vmscan: Setup pagevec as late as possible in shrink_page_list()

On Mon, 14 Jun 2010 12:17:50 +0100
Mel Gorman <[email protected]> wrote:

> shrink_page_list() sets up a pagevec to release pages as according as they
> are free. It uses significant amounts of stack on the pagevec. This
> patch adds pages to be freed via pagevec to a linked list which is then
> freed en-masse at the end. This avoids using stack in the main path that
> potentially calls writepage().
>

hm, spose so. I cen't see any trivial way to eliminate the local
pagevec there.

> + if (pagevec_count(&freed_pvec))
> + __pagevec_free(&freed_pvec);
> ...
> - if (pagevec_count(&freed_pvec))
> - __pagevec_free(&freed_pvec);

That's an open-coded pagevec_free().

2010-06-17 00:25:05

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 05/12] vmscan: kill prev_priority completely

On Wed, 16 Jun 2010 19:45:29 -0400
Rik van Riel <[email protected]> wrote:

> On 06/16/2010 07:37 PM, Andrew Morton wrote:
>
> > This would have been badder in earlier days when we were using the
> > scanning priority to decide when to start unmapping pte-mapped pages -
> > page reclaim would have been recirculating large blobs of mapped pages
> > around the LRU until the priority had built to the level where we
> > started to unmap them.
> >
> > However that priority-based decision got removed and right now I don't
> > recall what it got replaced with. Aren't we now unmapping pages way
> > too early and suffering an increased major&minor fault rate? Worried.
>
> We keep a different set of statistics to decide whether to
> reclaim only page cache pages, or both page cache and
> anonymous pages. The function get_scan_ratio parses those
> statistics.

I wasn't talking about anon-vs-file. I was referring to mapped-file
versus not-mapped file. If the code sees a mapped page come off the
tail of the LRU it'll just unmap and reclaim the thing. This policy
caused awful amounts of paging activity when someone started doing lots
of read() activity, which is why the VM was changed to value mapped
pagecache higher than unmapped pagecache. Did this biasing get
retained and if so, how?

> > So. What's up with that? I don't even remember _why_ we disable
> > the swap token once the scanning priority gets severe and the code
> > comments there are risible. And why do we wait until priority==0
> > rather than priority==1?
>
> The reason is that we never page out the pages belonging to the
> process owning the swap token (with the exception of that process
> evicting its own pages).
>
> If that process has a really large RSS in the current zone, and
> we are having problems freeing pages, it may be beneficial to
> also evict pages from that process.
>
> Now that the LRU lists are split out into file backed and swap
> backed, it may be a lot easier to find pages to evict. That
> may mean we could notice we're getting into trouble at much
> higher priority levels and disable the swap token at a higher
> priority level.

hm, lots of "may"s there.

Does thrash-avoidance actually still work?

> I do not believe prev_priority will be very useful here, since
> we'd like to start out with small scans whenever possible.

Why?

> > - Busted prev_priority means that lumpy reclaim will act oddly.
> > Every time someone goes into do some recalim, they'll start out not
> > doing lumpy reclaim. Then, after a while, they'll get a clue and
> > will start doing the lumpy thing. Then they return from reclaim and
> > the next recalim caller will again forget that he should have done
> > lumpy reclaim.
>
> How common are lumpy reclaims, anyway?

A lot more than they should be, I suspect, given the recent trend
towards asking for higher-order allocations. Kernel developers should
be prohibited from using more than 512M of RAM.

> Isn't it more likely that in-between every two higher-order
> reclaims, a number of order zero reclaims will be happening?

Sounds likely, yes. Need prev_priority[DEF_PRIORITY], sigh.

> In that case, the prev_priority logic may have introduced the
> kind of behavioural bug you describe above...
>
> > And one has to wonder: if we're making these incorrect decisions based
> > upon a bogus view of the current scanning difficulty, why are these
> > various priority-based thresholding heuristics even in there? Are they
> > doing anything useful?
>
> The prev_priority code was useful when we had filesystem and
> swap backed pages mixed on the same LRU list.

No, stop saying swap! ;)

It's all to do with mapped pagecache versus unmapped pagecache. "ytf
does my browser get paged out all the time".


2010-06-17 00:30:13

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim

On Wed, 16 Jun 2010 01:06:40 -0400
Christoph Hellwig <[email protected]> wrote:

> On Wed, Jun 16, 2010 at 09:17:55AM +0900, KAMEZAWA Hiroyuki wrote:
> > yes. It's only called from
> > - page fault
> > - add_to_page_cache()
> >
> > I think we'll see no stack problem. Now, memcg doesn't wakeup kswapd for
> > reclaiming memory, it needs direct writeback.
>
> The page fault code should be fine, but add_to_page_cache can be called
> with quite deep stacks. Two examples are grab_cache_page_write_begin
> which already was part of one of the stack overflows mentioned in this
> thread, or find_or_create_page which can be called via
> _xfs_buf_lookup_pages, which can be called from under the whole XFS
> allocator, or via grow_dev_page which might have a similarly deep
> stack for users of the normal buffer cache. Although for the
> find_or_create_page we usually should not have __GFP_FS set in the
> gfp_mask.
>

Hmm. ok, then, memory cgroup needs some care.

BTW, why xbf_buf_create() use GFP_KERNEL even if it can be blocked ?
memory cgroup just limits pages for users, then, doesn't intend to
limit kernel pages. If this buffer is not for user(visible page cache), but for
internal structure, I'll have to add a code for ignoreing memory cgroup check
when gfp_mask doesn't have GFP_MOVABLE.


Thanks,
-Kame

2010-06-17 00:35:06

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 05/12] vmscan: kill prev_priority completely

On 06/16/2010 08:18 PM, Andrew Morton wrote:
> On Wed, 16 Jun 2010 19:45:29 -0400
> Rik van Riel<[email protected]> wrote:
>
>> On 06/16/2010 07:37 PM, Andrew Morton wrote:
>>
>>> This would have been badder in earlier days when we were using the
>>> scanning priority to decide when to start unmapping pte-mapped pages -
>>> page reclaim would have been recirculating large blobs of mapped pages
>>> around the LRU until the priority had built to the level where we
>>> started to unmap them.
>>>
>>> However that priority-based decision got removed and right now I don't
>>> recall what it got replaced with. Aren't we now unmapping pages way
>>> too early and suffering an increased major&minor fault rate? Worried.
>>
>> We keep a different set of statistics to decide whether to
>> reclaim only page cache pages, or both page cache and
>> anonymous pages. The function get_scan_ratio parses those
>> statistics.
>
> I wasn't talking about anon-vs-file. I was referring to mapped-file
> versus not-mapped file. If the code sees a mapped page come off the
> tail of the LRU it'll just unmap and reclaim the thing. This policy
> caused awful amounts of paging activity when someone started doing lots
> of read() activity, which is why the VM was changed to value mapped
> pagecache higher than unmapped pagecache. Did this biasing get
> retained and if so, how?

It changed a little, but we still have it:

1) we do not deactivate active file pages if the active file
list is smaller than the inactive file list - this protects
the working set from streaming IO

2) we keep mapped referenced executable pages on the active file
list if they got accessed while on the active list, while
other file pages get deactivated unconditionally

> Does thrash-avoidance actually still work?

I suspect it does, but I have not actually tested that code
in years :)

>> I do not believe prev_priority will be very useful here, since
>> we'd like to start out with small scans whenever possible.
>
> Why?

For one, memory sizes today are a lot larger than they were
when 2.6.0 came out.

Secondly, we now know more exactly what is on each LRU list.
That should greatly reduce unnecessary turnover of the list.

For example, if we know there is no swap space available, we
will not bother scanning the anon LRU lists.

If we know there is not enough file cache left to get us up
to the zone high water mark, we will not bother scanning the
few remaining file pages.

Because of those simple checks (in get_scan_priority), I do
not expect that we will have to scan through all of memory
as frequently as we had to do in 2.6.0.

Furthermore, we unconditionally deactivate most active pages
and have a working used-once scheme for pages on the anon
lists. This should also contribute to a reduction in the
number of pages that get scanned.

>> In that case, the prev_priority logic may have introduced the
>> kind of behavioural bug you describe above...
>>
>>> And one has to wonder: if we're making these incorrect decisions based
>>> upon a bogus view of the current scanning difficulty, why are these
>>> various priority-based thresholding heuristics even in there? Are they
>>> doing anything useful?
>>
>> The prev_priority code was useful when we had filesystem and
>> swap backed pages mixed on the same LRU list.
>
> No, stop saying swap! ;)
>
> It's all to do with mapped pagecache versus unmapped pagecache. "ytf
> does my browser get paged out all the time".

We have other measures in place now to protect the working set
on the file LRU lists (see above). We are able to have those
measures in the kernel because we no longer have mixed LRU
lists.

--
All rights reversed

2010-06-17 06:17:13

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim

On Thu, Jun 17, 2010 at 09:25:38AM +0900, KAMEZAWA Hiroyuki wrote:
>
> BTW, why xbf_buf_create() use GFP_KERNEL even if it can be blocked ?
> memory cgroup just limits pages for users, then, doesn't intend to
> limit kernel pages.

You mean xfs_buf_allocate? It doesn't in the end. It goes through the
xfs_kmem helper which clear __GFP_FS if we're currently inside a
filesystem transaction (PF_FSTRANS is set) or a caller specificly
requested it to be disabled even without that by passig the
XBF_DONT_BLOCK flag.

2010-06-17 06:28:04

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [PATCH 12/12] vmscan: Do not writeback pages in direct reclaim

On Thu, 17 Jun 2010 02:16:47 -0400
Christoph Hellwig <[email protected]> wrote:

> On Thu, Jun 17, 2010 at 09:25:38AM +0900, KAMEZAWA Hiroyuki wrote:
> >
> > BTW, why xbf_buf_create() use GFP_KERNEL even if it can be blocked ?
> > memory cgroup just limits pages for users, then, doesn't intend to
> > limit kernel pages.
>
> You mean xfs_buf_allocate? It doesn't in the end. It goes through the
> xfs_kmem helper which clear __GFP_FS if we're currently inside a
> filesystem transaction (PF_FSTRANS is set) or a caller specificly
> requested it to be disabled even without that by passig the
> XBF_DONT_BLOCK flag.
>
Ah, sorry. My question was wrong.

If xfs_buf_allocate() is not for pages on LRU but for kernel memory,
memory cgroup has no reason to charge against it because we can't reclaim
memory which is not on LRU.

Then, I wonder I may have to add following check

if (!(gfp_mask & __GFP_RECLAIMABLE)) {
/* ignore this. we just charge against reclaimable memory on LRU. */
return 0;
}

to mem_cgroup_charge_cache() which is a hook for accounting page-cache.


Thanks,
-Kame

2010-06-17 10:30:33

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 08/12] vmscan: Setup pagevec as late as possible in shrink_inactive_list()

On Wed, Jun 16, 2010 at 04:43:09PM -0700, Andrew Morton wrote:
> On Mon, 14 Jun 2010 12:17:49 +0100
> Mel Gorman <[email protected]> wrote:
>
> > shrink_inactive_list() sets up a pagevec to release unfreeable pages. It
> > uses significant amounts of stack doing this. This patch splits
> > shrink_inactive_list() to take the stack usage out of the main path so
> > that callers to writepage() do not contain an unused pagevec on the
> > stack.
>
> You can get the entire pagevec off the stack - just make it a
> static-to-shrink_inactive_list() pagevec-per-cpu.
>

That idea has been floated as well. I didn't pursue it because Dave
said that giving page reclaim a stack diet was never going to be the
full solution so I didn't think the complexity was justified.

I kept some of the stack reduction stuff because a) it was there and b)
it would give kswapd extra headroom when calling writepage.

> Locking just requires pinning to a CPU. We could trivially co-opt
> shrink_inactive_list()'s spin_lock_irq() for that, but
> pagevec_release() can be relatively expensive so it'd be sad to move
> that inside spin_lock_irq(). It'd be better to slap a
> get_cpu()/put_cpu() around the whole thing.
>

It'd be something interesting to try out when nothing else was happening but
I'm not going to focus on it for the moment unless I think it will really
help this stack overflow problem.

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-17 10:46:43

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 09/12] vmscan: Setup pagevec as late as possible in shrink_page_list()

On Wed, Jun 16, 2010 at 04:48:01PM -0700, Andrew Morton wrote:
> On Mon, 14 Jun 2010 12:17:50 +0100
> Mel Gorman <[email protected]> wrote:
>
> > shrink_page_list() sets up a pagevec to release pages as according as they
> > are free. It uses significant amounts of stack on the pagevec. This
> > patch adds pages to be freed via pagevec to a linked list which is then
> > freed en-masse at the end. This avoids using stack in the main path that
> > potentially calls writepage().
> >
>
> hm, spose so. I cen't see any trivial way to eliminate the local
> pagevec there.
>
> > + if (pagevec_count(&freed_pvec))
> > + __pagevec_free(&freed_pvec);
> > ...
> > - if (pagevec_count(&freed_pvec))
> > - __pagevec_free(&freed_pvec);
>
> That's an open-coded pagevec_free().
>

Fair point, will correct. Thanks

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2010-06-25 08:29:49

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 05/12] vmscan: kill prev_priority completely


sorry for the long delay.
(and I'm a bit wonder why I was not CCed this thread ;)

> On Mon, 14 Jun 2010 12:17:46 +0100
> Mel Gorman <[email protected]> wrote:
>
> > From: KOSAKI Motohiro <[email protected]>
> >
> > Since 2.6.28 zone->prev_priority is unused. Then it can be removed
> > safely. It reduce stack usage slightly.
> >
> > Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
> > can be integrate again, it's useful. but four (or more) times trying
> > haven't got good performance number. Thus I give up such approach.
>
> This would have been badder in earlier days when we were using the
> scanning priority to decide when to start unmapping pte-mapped pages -
> page reclaim would have been recirculating large blobs of mapped pages
> around the LRU until the priority had built to the level where we
> started to unmap them.
>
> However that priority-based decision got removed and right now I don't
> recall what it got replaced with. Aren't we now unmapping pages way
> too early and suffering an increased major&minor fault rate? Worried.
>
>
> Things which are still broken after we broke prev_priority:
>
> - If page reclaim is having a lot of trouble, prev_priority would
> have permitted do_try_to_free_pages() to call disable_swap_token()
> earlier on. As things presently stand, we'll do a lot of
> thrash-detection stuff before (presumably correctly) dropping the
> swap token.
>
> So. What's up with that? I don't even remember _why_ we disable
> the swap token once the scanning priority gets severe and the code
> comments there are risible. And why do we wait until priority==0
> rather than priority==1?
>
> - Busted prev_priority means that lumpy reclaim will act oddly.
> Every time someone goes into do some recalim, they'll start out not
> doing lumpy reclaim. Then, after a while, they'll get a clue and
> will start doing the lumpy thing. Then they return from reclaim and
> the next recalim caller will again forget that he should have done
> lumpy reclaim.
>
> I dunno what the effects of this are in the real world, but it
> seems dumb.
>
> And one has to wonder: if we're making these incorrect decisions based
> upon a bogus view of the current scanning difficulty, why are these
> various priority-based thresholding heuristics even in there? Are they
> doing anything useful?
>
> So.. either we have a load of useless-crap-and-cruft in there which
> should be lopped out, or we don't have a load of useless-crap-and-cruft
> in there, and we should fix prev_priority.

May I explain my experience? I'd like to explain why prev_priority wouldn't
works nowadays.

First of all, Yes, current vmscan still a lot of UP centric code. it
expose some weakness on some dozens CPUs machine. I think we need
more and more improvement.

The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
and per-task-pressure a bit. example, prev_priority try to boost priority to
other concurrent priority. but If the another task have mempolicy restriction,
It's unnecessary, but also makes wrong big latency and exceeding reclaim.
per-task based priority + prev_priority adjustment make the emulation of
per-system pressure. but it have two issue 1) too rough and brutal emulation
2) we need per-zone pressure, not per-system.

another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
prev_priority can't solve such multithreads workload issue.

In other word, prev_priority concept assume the sysmtem don't have lots
threads.

And, I don't think lumpy reclaim threshold is big matter, because It was
introduced to case aim7 corner case issue. I don't think such situation
will occur frequently in the real workload. thus end users can't observe
such logic.

For mapped-vs-unmapped thing, I dunnno the exactly reason. That was
introduced by Rik, unfortunatelly I had not joined its activity at
making design time. I can only say, while my testing the current code
works good.

That said, my conclusion is opposite. For long term view, we should
consider to kill reclaim priority completely. Instead, we should
consider to introduce per-zone pressure statistics.



> > Signed-off-by: KOSAKI Motohiro <[email protected]>
> > Reviewed-by: Johannes Weiner <[email protected]>
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > include/linux/mmzone.h | 15 ------------
> > mm/page_alloc.c | 2 -
> > mm/vmscan.c | 57 ------------------------------------------------
> > mm/vmstat.c | 2 -
>
> The patch forgot to remove mem_cgroup_get_reclaim_priority() and friends.

Sure. thanks.
Will fix.


btw, current zone reclaim have wrong swap token usage.

static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
{
(snip)
disable_swap_token();
cond_resched();


I can't understand the reason why zone reclaim _always_ disable swap token.
that's mean, if the system is enabled zone reclaim, swap token don't works
at all.

Perhaps, original author's intention was following, I guess.

priority = ZONE_RECLAIM_PRIORITY;
do {
if ((zone_reclaim_mode & RECLAIM_SWAP) && !priority) // here
disable_swap_token(); // here

note_zone_scanning_priority(zone, priority);
shrink_zone(priority, zone, &sc);
priority--;
} while (priority >= 0 && sc.nr_reclaimed < nr_pages);


However, if my understanding is correct, we can remove this
disable_swap_token() completely. because zone reclaim failure don't bring
to OOM-Killer, instead melery cause normal try_to_free_pages().



2010-06-28 10:35:39

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 05/12] vmscan: kill prev_priority completely

On Fri, Jun 25, 2010 at 05:29:41PM +0900, KOSAKI Motohiro wrote:
>
> sorry for the long delay.
> (and I'm a bit wonder why I was not CCed this thread ;)
>

My fault, I unintentionally deleted your name from the send script.
Sorry about that.

> > On Mon, 14 Jun 2010 12:17:46 +0100
> > Mel Gorman <[email protected]> wrote:
> >
> > > From: KOSAKI Motohiro <[email protected]>
> > >
> > > Since 2.6.28 zone->prev_priority is unused. Then it can be removed
> > > safely. It reduce stack usage slightly.
> > >
> > > Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
> > > can be integrate again, it's useful. but four (or more) times trying
> > > haven't got good performance number. Thus I give up such approach.
> >
> > This would have been badder in earlier days when we were using the
> > scanning priority to decide when to start unmapping pte-mapped pages -
> > page reclaim would have been recirculating large blobs of mapped pages
> > around the LRU until the priority had built to the level where we
> > started to unmap them.
> >
> > However that priority-based decision got removed and right now I don't
> > recall what it got replaced with. Aren't we now unmapping pages way
> > too early and suffering an increased major&minor fault rate? Worried.
> >
> >
> > Things which are still broken after we broke prev_priority:
> >
> > - If page reclaim is having a lot of trouble, prev_priority would
> > have permitted do_try_to_free_pages() to call disable_swap_token()
> > earlier on. As things presently stand, we'll do a lot of
> > thrash-detection stuff before (presumably correctly) dropping the
> > swap token.
> >
> > So. What's up with that? I don't even remember _why_ we disable
> > the swap token once the scanning priority gets severe and the code
> > comments there are risible. And why do we wait until priority==0
> > rather than priority==1?
> >
> > - Busted prev_priority means that lumpy reclaim will act oddly.
> > Every time someone goes into do some recalim, they'll start out not
> > doing lumpy reclaim. Then, after a while, they'll get a clue and
> > will start doing the lumpy thing. Then they return from reclaim and
> > the next recalim caller will again forget that he should have done
> > lumpy reclaim.
> >

The intention of the code was to note that orders < PAGE_ALLOC_COSTLY_ORDER,
there was an expectatation that those pages would be free or nearly free
without page reclaim taking special steps with lumpy reclaim. If this has
changed, it's almost certainly because of a greater dependence on high-order
pages than previously which should be resisted (it has cropped up a few
times recently). I do have a script that uses ftrace to count call sites
using high-order allocations and how often they occur which would be of use
if this problem is being investigated.

> > I dunno what the effects of this are in the real world, but it
> > seems dumb.
> >
> > And one has to wonder: if we're making these incorrect decisions based
> > upon a bogus view of the current scanning difficulty, why are these
> > various priority-based thresholding heuristics even in there? Are they
> > doing anything useful?
> >
> > So.. either we have a load of useless-crap-and-cruft in there which
> > should be lopped out, or we don't have a load of useless-crap-and-cruft
> > in there, and we should fix prev_priority.
>
> May I explain my experience? I'd like to explain why prev_priority wouldn't
> works nowadays.
>
> First of all, Yes, current vmscan still a lot of UP centric code. it
> expose some weakness on some dozens CPUs machine. I think we need
> more and more improvement.
>
> The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
> and per-task-pressure a bit. example, prev_priority try to boost priority to
> other concurrent priority. but If the another task have mempolicy restriction,
> It's unnecessary, but also makes wrong big latency and exceeding reclaim.
> per-task based priority + prev_priority adjustment make the emulation of
> per-system pressure. but it have two issue 1) too rough and brutal emulation
> 2) we need per-zone pressure, not per-system.
>
> another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
> 2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
> but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
> system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
> prev_priority can't solve such multithreads workload issue.
>
> In other word, prev_priority concept assume the sysmtem don't have lots
> threads.
>
> And, I don't think lumpy reclaim threshold is big matter, because It was
> introduced to case aim7 corner case issue. I don't think such situation
> will occur frequently in the real workload. thus end users can't observe
> such logic.
>

I'm not aware of current problems with lumpy reclaim related stalls or
problems but it's not something I have specifically investigated. If
there is a known example workload that is felt to trigger lumpy reclaim
more than it should, someone point me in the general direction and I'll
take a look at it with ftrace and see what falls out.

> For mapped-vs-unmapped thing, I dunnno the exactly reason. That was
> introduced by Rik, unfortunatelly I had not joined its activity at
> making design time. I can only say, while my testing the current code
> works good.
>
> That said, my conclusion is opposite. For long term view, we should
> consider to kill reclaim priority completely. Instead, we should
> consider to introduce per-zone pressure statistics.

Ah, the "what is pressure?" rat-hole :)

>
> > > Signed-off-by: KOSAKI Motohiro <[email protected]>
> > > Reviewed-by: Johannes Weiner <[email protected]>
> > > Signed-off-by: Mel Gorman <[email protected]>
> > > ---
> > > include/linux/mmzone.h | 15 ------------
> > > mm/page_alloc.c | 2 -
> > > mm/vmscan.c | 57 ------------------------------------------------
> > > mm/vmstat.c | 2 -
> >
> > The patch forgot to remove mem_cgroup_get_reclaim_priority() and friends.
>
> Sure. thanks.
> Will fix.
>

I've fixed this up in the current patchset that is V3.

>
> btw, current zone reclaim have wrong swap token usage.
>
> static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
> {
> (snip)
> disable_swap_token();
> cond_resched();
>
>
> I can't understand the reason why zone reclaim _always_ disable swap token.
> that's mean, if the system is enabled zone reclaim, swap token don't works
> at all.
>
> Perhaps, original author's intention was following, I guess.
>
> priority = ZONE_RECLAIM_PRIORITY;
> do {
> if ((zone_reclaim_mode & RECLAIM_SWAP) && !priority) // here
> disable_swap_token(); // here
>
> note_zone_scanning_priority(zone, priority);
> shrink_zone(priority, zone, &sc);
> priority--;
> } while (priority >= 0 && sc.nr_reclaimed < nr_pages);
>
>
> However, if my understanding is correct, we can remove this
> disable_swap_token() completely. because zone reclaim failure don't bring
> to OOM-Killer, instead melery cause normal try_to_free_pages().
>

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab