This continues to build on the previous feedback and further testing and
I'm hoping this can be finalised relatively soon. False sharing is still
a major problem but I still think it deserves its own series. Minimally I
think the fact that we are now scanning shared pages without much additional
system overhead is a big step in the right direction.
Changelog since V4
o Added code that avoids overloading preferred nodes
o Swap tasks if nodes are overloaded and the swap does not impair locality
Changelog since V3
o Correct detection of unset last nid/pid information
o Dropped nr_preferred_running and replaced it with Peter's load balancing
o Pass in correct node information for THP hinting faults
o Pressure tasks sharing a THP page to move towards same node
o Do not set pmd_numa if false sharing is detected
Changelog since V2
o Reshuffle to match Peter's implied preference for layout
o Reshuffle to move private/shared split towards end of series to make it
easier to evaluate the impact
o Use PID information to identify private accesses
o Set the floor for PTE scanning based on virtual address space scan rates
instead of time
o Some locking improvements
o Do not preempt pinned tasks unless they are kernel threads
Changelog since V1
o Scan pages with elevated map count (shared pages)
o Scale scan rates based on the vsz of the process so the sampling of the
task is independant of its size
o Favour moving towards nodes with more faults even if it's not the
preferred node
o Laughably basic accounting of a compute overloaded node when selecting
the preferred node.
o Applied review comments
This series integrates basic scheduler support for automatic NUMA balancing.
It borrows very heavily from Peter Ziljstra's work in "sched, numa, mm:
Add adaptive NUMA affinity support" but deviates too much to preserve
Signed-off-bys. As before, if the relevant authors are ok with it I'll
add Signed-off-bys (or add them yourselves if you pick the patches up).
This is still far from complete and there are known performance gaps
between this series and manual binding (when that is possible). As before,
the intention is not to complete the work but to incrementally improve
mainline and preserve bisectability for any bug reports that crop up. In
some cases performance may be worse unfortunately and when that happens
it will have to be judged if the system overhead is lower and if so,
is it still an acceptable direction as a stepping stone to something better.
Patch 1 adds sysctl documentation
Patch 2 tracks NUMA hinting faults per-task and per-node
Patch 3 corrects a THP NUMA hint fault accounting bug
Patch 4 avoids trying to migrate the THP zero page
Patches 5-7 selects a preferred node at the end of a PTE scan based on what
node incurrent the highest number of NUMA faults. When the balancer
is comparing two CPU it will prefer to locate tasks on their
preferred node.
Patch 8 reschedules a task when a preferred node is selected if it is not
running on that node already. This avoids waiting for the scheduler
to move the task slowly.
Patch 9 adds infrastructure to allow separate tracking of shared/private
pages but treats all faults as if they are private accesses. Laying
it out this way reduces churn later in the series when private
fault detection is introduced
Patch 10 replaces PTE scanning reset hammer and instread increases the
scanning rate when an otherwise settled task changes its
preferred node.
Patch 11 avoids some unnecessary allocation
Patch 12 sets the scan rate proportional to the size of the task being scanned.
Patch 13-14 kicks away some training wheels and scans shared pages and small VMAs.
Patch 15 introduces private fault detection based on the PID of the faulting
process and accounts for shared/private accesses differently
Patch 16 pick the least loaded CPU based on a preferred node based on a scheduling
domain common to both the source and destination NUMA node.
Patch 17 retries task migration if an earlier attempt failed
Patch 18 will swap tasks if the target node is overloaded and the swap would not
impair locality.
Testing on this is only partial as full tests take a long time to run. A
full specjbb for both single and multi takes over 4 hours. NPB D class
also takes a few hours. With all the kernels in question, it still takes
a weekend to churn through them all.
Kernel 3.9 is still the testing baseline.
o vanilla vanilla kernel with automatic numa balancing enabled
o favorpref-v5 Patches 1-11
o scanshared-v5 Patches 1-14
o splitprivate-v5 Patches 1-15
o accountload-v5 Patches 1-16
o retrymigrate-v5 Patches 1-17
o swaptasks-v5 Patches 1-18
This is SpecJBB running on a 4-socket machine with THP enabled and one JVM
running for the whole system. Only a limited number of clients are executed
to save on time.
specjbb
3.9.0 3.9.0 3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanilla favorpref-v5 scanshared-v5 splitprivate-v5 accountload-v5 retrymigrate-v5 swaptasks-v5
TPut 1 24474.00 ( 0.00%) 23503.00 ( -3.97%) 24858.00 ( 1.57%) 23890.00 ( -2.39%) 24303.00 ( -0.70%) 23529.00 ( -3.86%) 26110.00 ( 6.68%)
TPut 7 186914.00 ( 0.00%) 188656.00 ( 0.93%) 186370.00 ( -0.29%) 180352.00 ( -3.51%) 179962.00 ( -3.72%) 183667.00 ( -1.74%) 185912.00 ( -0.54%)
TPut 13 334429.00 ( 0.00%) 327613.00 ( -2.04%) 316733.00 ( -5.29%) 327675.00 ( -2.02%) 327558.00 ( -2.05%) 336418.00 ( 0.59%) 334563.00 ( 0.04%)
TPut 19 422820.00 ( 0.00%) 412078.00 ( -2.54%) 398354.00 ( -5.79%) 443889.00 ( 4.98%) 451359.00 ( 6.75%) 450069.00 ( 6.44%) 426753.00 ( 0.93%)
TPut 25 456121.00 ( 0.00%) 434898.00 ( -4.65%) 432072.00 ( -5.27%) 523230.00 ( 14.71%) 533432.00 ( 16.95%) 504138.00 ( 10.53%) 503152.00 ( 10.31%)
TPut 31 438595.00 ( 0.00%) 391575.00 (-10.72%) 415957.00 ( -5.16%) 520259.00 ( 18.62%) 510638.00 ( 16.43%) 442937.00 ( 0.99%) 486450.00 ( 10.91%)
TPut 37 409654.00 ( 0.00%) 370804.00 ( -9.48%) 398863.00 ( -2.63%) 510303.00 ( 24.57%) 475468.00 ( 16.07%) 427673.00 ( 4.40%) 460531.00 ( 12.42%)
TPut 43 370941.00 ( 0.00%) 327823.00 (-11.62%) 379232.00 ( 2.24%) 443788.00 ( 19.64%) 442169.00 ( 19.20%) 387382.00 ( 4.43%) 425120.00 ( 14.61%)
It's interesting that retrying the migrate introduced such a large dent. I
do not know why at this point. Swapping the tasks helped and overall the
performance is all right with room for improvement.
specjbb Peaks
3.9.0 3.9.0 3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanilla favorpref-v5 scanshared-v5 splitprivate-v5 accountload-v5 retrymigrate-v5 swaptasks-v5
Expctd Warehouse 48.00 ( 0.00%) 48.00 ( 0.00%) 48.00 ( 0.00%) 48.00 ( 0.00%) 48.00 ( 0.00%) 48.00 ( 0.00%) 48.00 ( 0.00%)
Actual Warehouse 26.00 ( 0.00%) 26.00 ( 0.00%) 26.00 ( 0.00%) 26.00 ( 0.00%) 26.00 ( 0.00%) 26.00 ( 0.00%) 26.00 ( 0.00%)
Actual Peak Bops 456121.00 ( 0.00%) 434898.00 ( -4.65%) 432072.00 ( -5.27%) 523230.00 ( 14.71%) 533432.00 ( 16.95%) 504138.00 ( 10.53%) 503152.00 ( 10.31%)
Peak performance improved a bit.
3.9.0 3.9.0 3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanillafavorpref-v5 scanshared-v5 splitprivate-v5 accountload-v5 retrymigrate-v5 swaptasks-v5
User 5178.63 5177.18 5166.66 5163.88 5180.82 5210.99 5174.46
System 63.37 77.01 66.88 70.55 71.84 67.78 64.88
Elapsed 254.06 254.28 254.13 254.12 254.66 254.00 259.90
System CPU is marginally increased for the whole series but bear in mind
that shared pages are now scanned too.
3.9.0 3.9.0 3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanillafavorpref-v5 scanshared-v5 splitprivate-v5 accountload-v5 retrymigrate-v5 swaptasks-v5
THP fault alloc 34484 35783 34536 33833 34698 34144 31746
THP collapse alloc 10 11 10 12 9 9 8
THP splits 4 3 4 4 4 3 4
THP fault fallback 0 0 0 0 0 0 0
THP collapse fail 0 0 0 0 0 0 0
Compaction stalls 0 0 0 0 0 0 0
Compaction success 0 0 0 0 0 0 0
Compaction failures 0 0 0 0 0 0 0
Page migrate success 2012272 2026917 1314521 4443221 4364473 4240500 3978819
Page migrate failure 0 0 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0 0 0
Compaction free scanned 0 0 0 0 0 0 0
Compaction cost 2088 2103 1364 4612 4530 4401 4130
NUMA PTE updates 19189011 19981179 14384847 17238428 15784233 15922331 15400588
NUMA hint faults 198452 200904 79319 89363 85872 96136 91433
NUMA hint local faults 140889 134909 30654 32321 29985 40007 37761
NUMA hint local percent 70 67 38 36 34 41 41
NUMA pages migrated 2012272 2026917 1314521 4443221 4364473 4240500 3978819
AutoNUMA cost 1164 1182 522 651 622 672 640
The percentage of hinting faults that are local are impaired although
this is mostly due to scanning shared pages. That will need to be improved
again. Overall there are fewer PTE updates though.
Next is the autonuma benchmark results. These were only run once so I have no
idea what the variance is. Obviously they could be run multiple times but with
this number of kernels we would die of old age waiting on the results.
autonumabench
3.9.0 3.9.0 3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanilla favorpref-v4 scanshared-v4 splitprivate-v4 accountload-v5 retrymigrate-v5 swaptasks-v5
User NUMA01 52623.86 ( 0.00%) 49514.41 ( 5.91%) 53783.60 ( -2.20%) 51205.78 ( 2.69%) 57578.80 ( -9.42%) 52430.64 ( 0.37%) 53708.31 ( -2.06%)
User NUMA01_THEADLOCAL 17595.48 ( 0.00%) 17620.51 ( -0.14%) 19734.74 (-12.16%) 16966.63 ( 3.57%) 17397.51 ( 1.13%) 16934.59 ( 3.76%) 17136.78 ( 2.61%)
User NUMA02 2043.84 ( 0.00%) 1993.04 ( 2.49%) 2051.29 ( -0.36%) 1901.96 ( 6.94%) 1957.73 ( 4.21%) 1936.44 ( 5.25%) 2032.55 ( 0.55%)
User NUMA02_SMT 1057.11 ( 0.00%) 1005.61 ( 4.87%) 980.19 ( 7.28%) 977.65 ( 7.52%) 938.97 ( 11.18%) 968.34 ( 8.40%) 979.50 ( 7.34%)
System NUMA01 414.17 ( 0.00%) 222.86 ( 46.19%) 145.79 ( 64.80%) 321.93 ( 22.27%) 141.79 ( 65.77%) 333.07 ( 19.58%) 345.33 ( 16.62%)
System NUMA01_THEADLOCAL 105.17 ( 0.00%) 102.35 ( 2.68%) 117.22 (-11.46%) 105.35 ( -0.17%) 104.41 ( 0.72%) 119.39 (-13.52%) 115.53 ( -9.85%)
System NUMA02 9.36 ( 0.00%) 9.96 ( -6.41%) 13.02 (-39.10%) 9.53 ( -1.82%) 8.73 ( 6.73%) 10.68 (-14.10%) 8.73 ( 6.73%)
System NUMA02_SMT 3.54 ( 0.00%) 3.53 ( 0.28%) 3.46 ( 2.26%) 5.85 (-65.25%) 3.32 ( 6.21%) 3.30 ( 6.78%) 4.97 (-40.40%)
Elapsed NUMA01 1201.52 ( 0.00%) 1143.59 ( 4.82%) 1244.61 ( -3.59%) 1182.92 ( 1.55%) 1315.30 ( -9.47%) 1201.92 ( -0.03%) 1246.12 ( -3.71%)
Elapsed NUMA01_THEADLOCAL 393.91 ( 0.00%) 392.49 ( 0.36%) 442.04 (-12.22%) 385.61 ( 2.11%) 414.00 ( -5.10%) 383.56 ( 2.63%) 390.09 ( 0.97%)
Elapsed NUMA02 50.30 ( 0.00%) 50.36 ( -0.12%) 49.53 ( 1.53%) 48.91 ( 2.76%) 48.73 ( 3.12%) 50.48 ( -0.36%) 48.76 ( 3.06%)
Elapsed NUMA02_SMT 58.48 ( 0.00%) 47.79 ( 18.28%) 51.56 ( 11.83%) 55.98 ( 4.27%) 56.05 ( 4.16%) 48.18 ( 17.61%) 46.90 ( 19.80%)
CPU NUMA01 4414.00 ( 0.00%) 4349.00 ( 1.47%) 4333.00 ( 1.84%) 4355.00 ( 1.34%) 4388.00 ( 0.59%) 4389.00 ( 0.57%) 4337.00 ( 1.74%)
CPU NUMA01_THEADLOCAL 4493.00 ( 0.00%) 4515.00 ( -0.49%) 4490.00 ( 0.07%) 4427.00 ( 1.47%) 4227.00 ( 5.92%) 4446.00 ( 1.05%) 4422.00 ( 1.58%)
CPU NUMA02 4081.00 ( 0.00%) 3977.00 ( 2.55%) 4167.00 ( -2.11%) 3908.00 ( 4.24%) 4034.00 ( 1.15%) 3856.00 ( 5.51%) 4186.00 ( -2.57%)
CPU NUMA02_SMT 1813.00 ( 0.00%) 2111.00 (-16.44%) 1907.00 ( -5.18%) 1756.00 ( 3.14%) 1681.00 ( 7.28%) 2016.00 (-11.20%) 2098.00 (-15.72%)
numa01 performance is impacted but it's an adverse workload on this
particular machine and at least the system CPu usage is lower in that
case. Otherwise the performnace looks decent although I am mindful that
the system CPU usage is higher in places.
3.9.0 3.9.0 3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanillafavorpref-v4 scanshared-v4 splitprivate-v4 accountload-v5 retrymigrate-v5 swaptasks-v5
THP fault alloc 14325 11724 14906 13553 14033 13994 15838
THP collapse alloc 6 3 7 13 9 6 3
THP splits 4 1 4 2 1 2 4
THP fault fallback 0 0 0 0 0 0 0
THP collapse fail 0 0 0 0 0 0 0
Compaction stalls 0 0 0 0 0 0 0
Compaction success 0 0 0 0 0 0 0
Compaction failures 0 0 0 0 0 0 0
Page migrate success 9020528 9708110 6677767 6773951 6247795 5812565 6574293
Page migrate failure 0 0 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0 0 0
Compaction free scanned 0 0 0 0 0 0 0
Compaction cost 9363 10077 6931 7031 6485 6033 6824
NUMA PTE updates 119292401 114641446 85954812 74337906 76564821 73541041 79835108
NUMA hint faults 755901 499186 287825 237095 227211 238800 236077
NUMA hint local faults 595478 333483 152899 122210 118620 132560 133470
NUMA hint local percent 78 66 53 51 52 55 56
NUMA pages migrated 9020528 9708110 6677767 6773951 6247795 5812565 6574293
AutoNUMA cost 4785 3482 2167 1834 1790 1819 1864
conclusions on each testcase. However, in general the series is doing a lot
less work with PTE updates, faults and so on. THe percentage of local faults
suffers but a large part of this seems to be around where shared pages are
getting scanned.
The following is SpecJBB running on with THP enabled and one JVM running per
NUMA node in the system.
specjbb
3.9.0 3.9.0 3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanilla favorpref-v4 scanshared-v4 splitprivate-v4 accountload-v5 retrymigrate-v5 swaptasks-v5
Mean 1 30640.75 ( 0.00%) 31222.25 ( 1.90%) 31275.50 ( 2.07%) 30554.00 ( -0.28%) 31073.25 ( 1.41%) 31210.25 ( 1.86%) 30738.00 ( 0.32%)
Mean 10 136983.25 ( 0.00%) 133072.00 ( -2.86%) 140022.00 ( 2.22%) 119168.25 (-13.01%) 134302.25 ( -1.96%) 140038.50 ( 2.23%) 133968.75 ( -2.20%)
Mean 19 124005.25 ( 0.00%) 121016.25 ( -2.41%) 122189.00 ( -1.46%) 111813.75 ( -9.83%) 120424.25 ( -2.89%) 119575.75 ( -3.57%) 120996.25 ( -2.43%)
Mean 28 114672.00 ( 0.00%) 111643.00 ( -2.64%) 109175.75 ( -4.79%) 101199.50 (-11.75%) 109499.25 ( -4.51%) 109031.25 ( -4.92%) 112354.25 ( -2.02%)
Mean 37 110916.50 ( 0.00%) 105791.75 ( -4.62%) 103103.75 ( -7.04%) 100187.00 ( -9.67%) 104726.75 ( -5.58%) 109913.50 ( -0.90%) 105993.00 ( -4.44%)
Mean 46 110139.25 ( 0.00%) 105383.25 ( -4.32%) 99454.75 ( -9.70%) 99762.00 ( -9.42%) 97961.00 (-11.06%) 105358.50 ( -4.34%) 104700.50 ( -4.94%)
Stddev 1 1002.06 ( 0.00%) 1125.30 (-12.30%) 959.60 ( 4.24%) 960.28 ( 4.17%) 1142.64 (-14.03%) 1245.68 (-24.31%) 860.81 ( 14.10%)
Stddev 10 4656.47 ( 0.00%) 6679.25 (-43.44%) 5946.78 (-27.71%) 10427.37 (-123.93%) 3744.32 ( 19.59%) 3394.82 ( 27.09%) 4160.26 ( 10.66%)
Stddev 19 2578.12 ( 0.00%) 5261.94 (-104.10%) 3414.66 (-32.45%) 5070.00 (-96.65%) 987.10 ( 61.71%) 300.27 ( 88.35%) 3561.43 (-38.14%)
Stddev 28 4123.69 ( 0.00%) 4156.17 ( -0.79%) 6666.32 (-61.66%) 3899.89 ( 5.43%) 1426.42 ( 65.41%) 3823.35 ( 7.28%) 5069.70 (-22.94%)
Stddev 37 2301.94 ( 0.00%) 5225.48 (-127.00%) 5444.18 (-136.50%) 3490.87 (-51.65%) 3133.33 (-36.12%) 2283.83 ( 0.79%) 2626.42 (-14.10%)
Stddev 46 8317.91 ( 0.00%) 6759.04 ( 18.74%) 6587.32 ( 20.81%) 4458.49 ( 46.40%) 5073.30 ( 39.01%) 7422.27 ( 10.77%) 6137.92 ( 26.21%)
TPut 1 122563.00 ( 0.00%) 124889.00 ( 1.90%) 125102.00 ( 2.07%) 122216.00 ( -0.28%) 124293.00 ( 1.41%) 124841.00 ( 1.86%) 122952.00 ( 0.32%)
TPut 10 547933.00 ( 0.00%) 532288.00 ( -2.86%) 560088.00 ( 2.22%) 476673.00 (-13.01%) 537209.00 ( -1.96%) 560154.00 ( 2.23%) 535875.00 ( -2.20%)
TPut 19 496021.00 ( 0.00%) 484065.00 ( -2.41%) 488756.00 ( -1.46%) 447255.00 ( -9.83%) 481697.00 ( -2.89%) 478303.00 ( -3.57%) 483985.00 ( -2.43%)
TPut 28 458688.00 ( 0.00%) 446572.00 ( -2.64%) 436703.00 ( -4.79%) 404798.00 (-11.75%) 437997.00 ( -4.51%) 436125.00 ( -4.92%) 449417.00 ( -2.02%)
TPut 37 443666.00 ( 0.00%) 423167.00 ( -4.62%) 412415.00 ( -7.04%) 400748.00 ( -9.67%) 418907.00 ( -5.58%) 439654.00 ( -0.90%) 423972.00 ( -4.44%)
TPut 46 440557.00 ( 0.00%) 421533.00 ( -4.32%) 397819.00 ( -9.70%) 399048.00 ( -9.42%) 391844.00 (-11.06%) 421434.00 ( -4.34%) 418802.00 ( -4.94%)
This one is more of a black eye. The average and overall performnace
is down although there is a considerable amount of noise. This workload
particularly suffers from false sharing and there is a requirement for a
follow-on series to better group related tasks together so the JVMs migrate
to individual nodes properly.
3.9.0 3.9.0 3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanillafavorpref-v4 scanshared-v4 splitprivate-v4 accountload-v5 retrymigrate-v5 swaptasks-v5
User 52899.04 53106.74 53245.67 52828.25 52817.97 52888.09 53476.23
System 250.42 254.20 203.97 222.28 222.24 229.28 232.46
Elapsed 1199.72 1208.35 1206.14 1197.28 1197.35 1205.42 1208.24
At least system CPU usage is lower.
3.9.0 3.9.0 3.9.0 3.9.0 3.9.0 3.9.0 3.9.0
vanillafavorpref-v4 scanshared-v4 splitprivate-v4 accountload-v5 retrymigrate-v5 swaptasks-v5
THP fault alloc 65188 66217 68158 63283 66020 69390 66853
THP collapse alloc 97 172 91 108 106 106 103
THP splits 38 37 36 34 34 42 35
THP fault fallback 0 0 0 0 0 0 0
THP collapse fail 0 0 0 0 0 0 0
Compaction stalls 0 0 0 0 0 0 0
Compaction success 0 0 0 0 0 0 0
Compaction failures 0 0 0 0 0 0 0
Page migrate success 14583860 14559261 7770770 10131560 10607758 10457889 10145643
Page migrate failure 0 0 0 0 0 0 0
Compaction pages isolated 0 0 0 0 0 0 0
Compaction migrate scanned 0 0 0 0 0 0 0
Compaction free scanned 0 0 0 0 0 0 0
Compaction cost 15138 15112 8066 10516 11010 10855 10531
NUMA PTE updates 128327468 129131539 74033679 72954561 73417999 75269838 74507785
NUMA hint faults 2103190 1712971 1488709 1362365 1338427 1275103 1401975
NUMA hint local faults 734136 640363 405816 471928 402556 389653 473054
NUMA hint local percent 34 37 27 34 30 30 33
NUMA pages migrated 14583860 14559261 7770770 10131560 10607758 10457889 10145643
AutoNUMA cost 11691 9745 8109 7515 7407 7101 7724
Far fewer PTEs are updated but the low percentage of local NUMA hinting faults
shows how much room there is for improvement.
So overall the series perfoms ok even though it is not a universal win that I'd
have liked. However, I think the fact that it is now dealing with shared pages,
that system overhead is generally lower and that it's now taking compute overloading
into account are all important steps in the right direction.
I'd still like to see this treated as a standalone with a separate series
focusing on false sharing detection and reduction, shared accesses used
for selecting preferred nodes, shared accesses used for load balancing and
reintroducing Peter's patch that balances compute nodes relative to each
other. This is to keep each series a manageable size for review even if
it's obvious that more work is required.
Documentation/sysctl/kernel.txt | 68 +++++++
include/linux/migrate.h | 7 +-
include/linux/mm.h | 69 ++++---
include/linux/mm_types.h | 7 +-
include/linux/page-flags-layout.h | 28 +--
include/linux/sched.h | 24 ++-
include/linux/sched/sysctl.h | 1 -
kernel/sched/core.c | 61 ++++++-
kernel/sched/fair.c | 374 ++++++++++++++++++++++++++++++++++----
kernel/sched/sched.h | 13 ++
kernel/sysctl.c | 14 +-
mm/huge_memory.c | 26 ++-
mm/memory.c | 27 +--
mm/mempolicy.c | 8 +-
mm/migrate.c | 21 +--
mm/mm_init.c | 18 +-
mm/mmzone.c | 12 +-
mm/mprotect.c | 28 +--
mm/page_alloc.c | 4 +-
19 files changed, 658 insertions(+), 152 deletions(-)
--
1.8.1.4
THP NUMA hinting fault on pages that are not migrated are being
accounted for incorrectly. Currently the fault will be counted as if the
task was running on a node local to the page which is not necessarily
true.
Signed-off-by: Mel Gorman <[email protected]>
---
mm/huge_memory.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e2f7f5aa..e4a79fa 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1293,7 +1293,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
int target_nid;
- int current_nid = -1;
+ int src_nid = -1;
bool migrated;
spin_lock(&mm->page_table_lock);
@@ -1302,9 +1302,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
page = pmd_page(pmd);
get_page(page);
- current_nid = page_to_nid(page);
+ src_nid = numa_node_id();
count_vm_numa_event(NUMA_HINT_FAULTS);
- if (current_nid == numa_node_id())
+ if (src_nid == page_to_nid(page))
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
target_nid = mpol_misplaced(page, vma, haddr);
@@ -1346,8 +1346,8 @@ clear_pmdnuma:
update_mmu_cache_pmd(vma, addr, pmdp);
out_unlock:
spin_unlock(&mm->page_table_lock);
- if (current_nid != -1)
- task_numa_fault(current_nid, HPAGE_PMD_NR, false);
+ if (src_nid != -1)
+ task_numa_fault(src_nid, HPAGE_PMD_NR, false);
return 0;
}
--
1.8.1.4
The NUMA PTE scan is reset every sysctl_numa_balancing_scan_period_reset
in case of phase changes. This is crude and it is clearly visible in graphs
when the PTE scanner resets even if the workload is already balanced. This
patch increases the scan rate if the preferred node is updated and the
task is currently running on the node to recheck if the placement
decision is correct. In the optimistic expectation that the placement
decisions will be correct, the maximum period between scans is also
increased to reduce overhead due to automatic NUMA balancing.
Signed-off-by: Mel Gorman <[email protected]>
---
Documentation/sysctl/kernel.txt | 11 +++--------
include/linux/mm_types.h | 3 ---
include/linux/sched/sysctl.h | 1 -
kernel/sched/core.c | 1 -
kernel/sched/fair.c | 27 ++++++++++++---------------
kernel/sysctl.c | 7 -------
6 files changed, 15 insertions(+), 35 deletions(-)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 246b128..a275042 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -373,15 +373,13 @@ guarantee. If the target workload is already bound to NUMA nodes then this
feature should be disabled. Otherwise, if the system overhead from the
feature is too high then the rate the kernel samples for NUMA hinting
faults may be controlled by the numa_balancing_scan_period_min_ms,
-numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
-numa_balancing_settle_count sysctls.
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_max_ms,
+numa_balancing_scan_size_mb and numa_balancing_settle_count sysctls.
==============================================================
numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
-numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_size_mb
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb
Automatic NUMA balancing scans tasks address space and unmaps pages to
detect if pages are properly placed or if the data should be migrated to a
@@ -416,9 +414,6 @@ effectively controls the minimum scanning rate for each task.
numa_balancing_scan_size_mb is how many megabytes worth of pages are
scanned for a given scan.
-numa_balancing_scan_period_reset is a blunt instrument that controls how
-often a tasks scan delay is reset to detect sudden changes in task behaviour.
-
numa_balancing_settle_count is how many scan periods must complete before
the schedule balancer stops pushing the task towards a preferred node. This
gives the scheduler a chance to place the task on an alternative node if the
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index ace9a5f..de70964 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -421,9 +421,6 @@ struct mm_struct {
*/
unsigned long numa_next_scan;
- /* numa_next_reset is when the PTE scanner period will be reset */
- unsigned long numa_next_reset;
-
/* Restart point for scanning and setting pte_numa */
unsigned long numa_scan_offset;
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index bf8086b..10d16c4f 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -47,7 +47,6 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
extern unsigned int sysctl_numa_balancing_scan_delay;
extern unsigned int sysctl_numa_balancing_scan_period_min;
extern unsigned int sysctl_numa_balancing_scan_period_max;
-extern unsigned int sysctl_numa_balancing_scan_period_reset;
extern unsigned int sysctl_numa_balancing_scan_size;
extern unsigned int sysctl_numa_balancing_settle_count;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b67a102..53d8465 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1585,7 +1585,6 @@ static void __sched_fork(struct task_struct *p)
#ifdef CONFIG_NUMA_BALANCING
if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
p->mm->numa_next_scan = jiffies;
- p->mm->numa_next_reset = jiffies;
p->mm->numa_scan_seq = 0;
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9590fcd..9002a4a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -782,8 +782,7 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
* numa task sample period in ms
*/
unsigned int sysctl_numa_balancing_scan_period_min = 100;
-unsigned int sysctl_numa_balancing_scan_period_max = 100*50;
-unsigned int sysctl_numa_balancing_scan_period_reset = 100*600;
+unsigned int sysctl_numa_balancing_scan_period_max = 100*600;
/* Portion of address space to scan in MB */
unsigned int sysctl_numa_balancing_scan_size = 256;
@@ -873,6 +872,7 @@ static void task_numa_placement(struct task_struct *p)
*/
if (max_faults && max_nid != p->numa_preferred_nid) {
int preferred_cpu;
+ int old_migrate_seq = p->numa_migrate_seq;
/*
* If the task is not on the preferred node then find the most
@@ -888,6 +888,16 @@ static void task_numa_placement(struct task_struct *p)
p->numa_preferred_nid = max_nid;
p->numa_migrate_seq = 0;
migrate_task_to(p, preferred_cpu);
+
+ /*
+ * If preferred nodes changes frequently then the scan rate
+ * will be continually high. Mitigate this by increasing the
+ * scan rate only if the task was settled.
+ */
+ if (old_migrate_seq >= sysctl_numa_balancing_settle_count) {
+ p->numa_scan_period = max(p->numa_scan_period >> 1,
+ sysctl_numa_balancing_scan_period_min);
+ }
}
}
@@ -984,19 +994,6 @@ void task_numa_work(struct callback_head *work)
}
/*
- * Reset the scan period if enough time has gone by. Objective is that
- * scanning will be reduced if pages are properly placed. As tasks
- * can enter different phases this needs to be re-examined. Lacking
- * proper tracking of reference behaviour, this blunt hammer is used.
- */
- migrate = mm->numa_next_reset;
- if (time_after(now, migrate)) {
- p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
- next_scan = now + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
- xchg(&mm->numa_next_reset, next_scan);
- }
-
- /*
* Enforce maximal scan/migration frequency..
*/
migrate = mm->numa_next_scan;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 263486f..1fcbc68 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -373,13 +373,6 @@ static struct ctl_table kern_table[] = {
.proc_handler = proc_dointvec,
},
{
- .procname = "numa_balancing_scan_period_reset",
- .data = &sysctl_numa_balancing_scan_period_reset,
- .maxlen = sizeof(unsigned int),
- .mode = 0644,
- .proc_handler = proc_dointvec,
- },
- {
.procname = "numa_balancing_scan_period_max_ms",
.data = &sysctl_numa_balancing_scan_period_max,
.maxlen = sizeof(unsigned int),
--
1.8.1.4
The scheduler avoids adding load imbalance when scheduling a task on its
preferred node. This unfortunately can mean that a task continues access
remote memory. In the event the CPUs are relatively imbalanced this
patch will check if the task running on the target CPU can be swapped
with. An attempt will be made to swap with the task if it is not running
on its preferred node and that moving it would not impair its locality.
Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/core.c | 39 +++++++++++++++++++++++++++++++++++++--
kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
kernel/sched/sched.h | 3 ++-
3 files changed, 80 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 53d8465..d679b01 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4857,10 +4857,13 @@ fail:
#ifdef CONFIG_NUMA_BALANCING
/* Migrate current task p to target_cpu */
-int migrate_task_to(struct task_struct *p, int target_cpu)
+int migrate_task_to(struct task_struct *p, int target_cpu,
+ struct task_struct *swap_p)
{
struct migration_arg arg = { p, target_cpu };
int curr_cpu = task_cpu(p);
+ struct rq *rq;
+ int retval;
if (curr_cpu == target_cpu)
return 0;
@@ -4868,7 +4871,39 @@ int migrate_task_to(struct task_struct *p, int target_cpu)
if (!cpumask_test_cpu(target_cpu, tsk_cpus_allowed(p)))
return -EINVAL;
- return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
+ if (swap_p == NULL)
+ return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
+
+ /* Make sure the target is still running the expected task */
+ rq = cpu_rq(target_cpu);
+ local_irq_disable();
+ raw_spin_lock(&rq->lock);
+ if (rq->curr != swap_p) {
+ raw_spin_unlock(&rq->lock);
+ local_irq_enable();
+ return -EINVAL;
+ }
+
+ /* Take a reference on the running task on the target cpu */
+ get_task_struct(swap_p);
+ raw_spin_unlock(&rq->lock);
+ local_irq_enable();
+
+ /* Move current running task to target CPU */
+ retval = stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
+ if (raw_smp_processor_id() != target_cpu) {
+ put_task_struct(swap_p);
+ return retval;
+ }
+
+ /* Move the remote task to the CPU just vacated */
+ local_irq_disable();
+ if (raw_smp_processor_id() == target_cpu)
+ __migrate_task(swap_p, target_cpu, curr_cpu);
+ local_irq_enable();
+
+ put_task_struct(swap_p);
+ return retval;
}
#endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 07a9f40..7a8f768 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -851,10 +851,12 @@ static unsigned long target_load(int cpu, int type);
static unsigned long power_of(int cpu);
static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
-static int task_numa_find_cpu(struct task_struct *p, int nid)
+static int task_numa_find_cpu(struct task_struct *p, int nid,
+ struct task_struct **swap_p)
{
int node_cpu = cpumask_first(cpumask_of_node(nid));
int cpu, src_cpu = task_cpu(p), dst_cpu = src_cpu;
+ int src_cpu_node = cpu_to_node(src_cpu);
unsigned long src_load, dst_load;
unsigned long min_load = ULONG_MAX;
struct task_group *tg = task_group(p);
@@ -864,6 +866,8 @@ static int task_numa_find_cpu(struct task_struct *p, int nid)
bool balanced;
int imbalance_pct, idx = -1;
+ *swap_p = NULL;
+
/* No harm being optimistic */
if (idle_cpu(node_cpu))
return node_cpu;
@@ -904,6 +908,8 @@ static int task_numa_find_cpu(struct task_struct *p, int nid)
src_eff_load *= src_load + effective_load(tg, src_cpu, -weight, -weight);
for_each_cpu(cpu, cpumask_of_node(nid)) {
+ struct task_struct *swap_candidate = NULL;
+
dst_load = target_load(cpu, idx);
/* If the CPU is idle, use it */
@@ -922,12 +928,41 @@ static int task_numa_find_cpu(struct task_struct *p, int nid)
* migrate to its preferred node due to load imbalances.
*/
balanced = (dst_eff_load <= src_eff_load);
- if (!balanced)
- continue;
+ if (!balanced) {
+ struct rq *rq = cpu_rq(cpu);
+ unsigned long src_faults, dst_faults;
+
+ /* Do not move tasks off their preferred node */
+ if (rq->curr->numa_preferred_nid == nid)
+ continue;
+
+ /* Do not attempt an illegal migration */
+ if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(rq->curr)))
+ continue;
+
+ /*
+ * Do not impair locality for the swap candidate.
+ * Destination for the swap candidate is the source cpu
+ */
+ if (rq->curr->numa_faults) {
+ src_faults = rq->curr->numa_faults[task_faults_idx(nid, 1)];
+ dst_faults = rq->curr->numa_faults[task_faults_idx(src_cpu_node, 1)];
+ if (src_faults > dst_faults)
+ continue;
+ }
+
+ /*
+ * The destination is overloaded but running a task
+ * that is not running on its preferred node. Consider
+ * swapping the CPU tasks are running on.
+ */
+ swap_candidate = rq->curr;
+ }
if (dst_load < min_load) {
min_load = dst_load;
dst_cpu = cpu;
+ *swap_p = swap_candidate;
}
}
@@ -938,6 +973,7 @@ static int task_numa_find_cpu(struct task_struct *p, int nid)
static void numa_migrate_preferred(struct task_struct *p)
{
int preferred_cpu = task_cpu(p);
+ struct task_struct *swap_p;
/* Success if task is already running on preferred CPU */
p->numa_migrate_retry = 0;
@@ -945,8 +981,8 @@ static void numa_migrate_preferred(struct task_struct *p)
return;
/* Otherwise, try migrate to a CPU on the preferred node */
- preferred_cpu = task_numa_find_cpu(p, p->numa_preferred_nid);
- if (migrate_task_to(p, preferred_cpu) != 0)
+ preferred_cpu = task_numa_find_cpu(p, p->numa_preferred_nid, &swap_p);
+ if (migrate_task_to(p, preferred_cpu, swap_p) != 0)
p->numa_migrate_retry = jiffies + HZ*5;
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 795346d..90ded64 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -504,7 +504,8 @@ DECLARE_PER_CPU(struct rq, runqueues);
#define raw_rq() (&__raw_get_cpu_var(runqueues))
#ifdef CONFIG_NUMA_BALANCING
-extern int migrate_task_to(struct task_struct *p, int cpu);
+extern int migrate_task_to(struct task_struct *p, int cpu,
+ struct task_struct *swap_p);
static inline void task_numa_free(struct task_struct *p)
{
kfree(p->numa_faults);
--
1.8.1.4
task_numa_work skips small VMAs. At the time the logic was to reduce the
scanning overhead which was considerable. It is a dubious hack at best.
It would make much more sense to cache where faults have been observed
and only rescan those regions during subsequent PTE scans. Remove this
hack as motivation to do it properly in the future.
Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 4 ----
1 file changed, 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8a392c8..b43122c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1080,10 +1080,6 @@ void task_numa_work(struct callback_head *work)
if (!vma_migratable(vma))
continue;
- /* Skip small VMAs. They are not likely to be of relevance */
- if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
- continue;
-
do {
start = max(start, vma->vm_start);
end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
--
1.8.1.4
When a preferred node is selected for a tasks there is an attempt to migrate
the task to a CPU there. This may fail in which case the task will only
migrate if the active load balancer takes action. This may never happen if
the conditions are not right. This patch will check at NUMA hinting fault
time if another attempt should be made to migrate the task. It will only
make an attempt once every five seconds.
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/fair.c | 40 +++++++++++++++++++++++-----------------
2 files changed, 24 insertions(+), 17 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d44fbc6..454ad2e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1505,6 +1505,7 @@ struct task_struct {
int numa_migrate_seq;
unsigned int numa_scan_period;
unsigned int numa_scan_period_max;
+ unsigned long numa_migrate_retry;
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8ee1c8e..07a9f40 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -934,6 +934,22 @@ static int task_numa_find_cpu(struct task_struct *p, int nid)
return dst_cpu;
}
+/* Attempt to migrate a task to a CPU on the preferred node. */
+static void numa_migrate_preferred(struct task_struct *p)
+{
+ int preferred_cpu = task_cpu(p);
+
+ /* Success if task is already running on preferred CPU */
+ p->numa_migrate_retry = 0;
+ if (cpu_to_node(preferred_cpu) == p->numa_preferred_nid)
+ return;
+
+ /* Otherwise, try migrate to a CPU on the preferred node */
+ preferred_cpu = task_numa_find_cpu(p, p->numa_preferred_nid);
+ if (migrate_task_to(p, preferred_cpu) != 0)
+ p->numa_migrate_retry = jiffies + HZ*5;
+}
+
static void task_numa_placement(struct task_struct *p)
{
int seq, nid, max_nid = -1;
@@ -968,28 +984,14 @@ static void task_numa_placement(struct task_struct *p)
}
}
- /*
- * Record the preferred node as the node with the most faults,
- * requeue the task to be running on the idlest CPU on the
- * preferred node and reset the scanning rate to recheck
- * the working set placement.
- */
+ /* Preferred node as the node with the most faults */
if (max_faults && max_nid != p->numa_preferred_nid) {
- int preferred_cpu;
int old_migrate_seq = p->numa_migrate_seq;
- /*
- * If the task is not on the preferred node then find
- * a suitable CPU to migrate to.
- */
- preferred_cpu = task_cpu(p);
- if (cpu_to_node(preferred_cpu) != max_nid)
- preferred_cpu = task_numa_find_cpu(p, max_nid);
-
- /* Update the preferred nid and migrate task if possible */
+ /* Queue task on preferred node if possible */
p->numa_preferred_nid = max_nid;
p->numa_migrate_seq = 0;
- migrate_task_to(p, preferred_cpu);
+ numa_migrate_preferred(p);
/*
* If preferred nodes changes frequently then the scan rate
@@ -1050,6 +1052,10 @@ void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
task_numa_placement(p);
+ /* Retry task to preferred node migration if it previously failed */
+ if (p->numa_migrate_retry && time_after(jiffies, p->numa_migrate_retry))
+ numa_migrate_preferred(p);
+
/* Record the fault, double the weight if pages were migrated */
p->numa_faults_buffer[task_faults_idx(node, priv)] += pages << migrated;
}
--
1.8.1.4
This patch replaces find_idlest_cpu_node with task_numa_find_cpu.
find_idlest_cpu_node has two critical limitations. It does not take the
scheduling class into account when calculating the load and it is unsuitable
for using when comparing loads between NUMA nodes.
task_numa_find_cpu uses similar load calculations to wake_affine() when
selecting the least loaded CPU within a scheduling domain common to the
source and destimation nodes. It avoids causing CPU load imbalances in
the machine by refusing to migrate if the relative load on the target
CPU is higher than the source CPU.
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 105 +++++++++++++++++++++++++++++++++++++++++-----------
1 file changed, 83 insertions(+), 22 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3f0519c..8ee1c8e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -846,29 +846,92 @@ static inline int task_faults_idx(int nid, int priv)
return 2 * nid + priv;
}
-static unsigned long weighted_cpuload(const int cpu);
+static unsigned long source_load(int cpu, int type);
+static unsigned long target_load(int cpu, int type);
+static unsigned long power_of(int cpu);
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
+
+static int task_numa_find_cpu(struct task_struct *p, int nid)
+{
+ int node_cpu = cpumask_first(cpumask_of_node(nid));
+ int cpu, src_cpu = task_cpu(p), dst_cpu = src_cpu;
+ unsigned long src_load, dst_load;
+ unsigned long min_load = ULONG_MAX;
+ struct task_group *tg = task_group(p);
+ s64 src_eff_load, dst_eff_load;
+ struct sched_domain *sd;
+ unsigned long weight;
+ bool balanced;
+ int imbalance_pct, idx = -1;
+ /* No harm being optimistic */
+ if (idle_cpu(node_cpu))
+ return node_cpu;
-static int
-find_idlest_cpu_node(int this_cpu, int nid)
-{
- unsigned long load, min_load = ULONG_MAX;
- int i, idlest_cpu = this_cpu;
+ /*
+ * Find the lowest common scheduling domain covering the nodes of both
+ * the CPU the task is currently running on and the target NUMA node.
+ */
+ rcu_read_lock();
+ for_each_domain(src_cpu, sd) {
+ if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
+ /*
+ * busy_idx is used for the load decision as it is the
+ * same index used by the regular load balancer for an
+ * active cpu.
+ */
+ idx = sd->busy_idx;
+ imbalance_pct = sd->imbalance_pct;
+ break;
+ }
+ }
+ rcu_read_unlock();
- BUG_ON(cpu_to_node(this_cpu) == nid);
+ if (WARN_ON_ONCE(idx == -1))
+ return src_cpu;
- rcu_read_lock();
- for_each_cpu(i, cpumask_of_node(nid)) {
- load = weighted_cpuload(i);
+ /*
+ * XXX the below is mostly nicked from wake_affine(); we should
+ * see about sharing a bit if at all possible; also it might want
+ * some per entity weight love.
+ */
+ weight = p->se.load.weight;
- if (load < min_load) {
- min_load = load;
- idlest_cpu = i;
+ src_load = source_load(src_cpu, idx);
+
+ src_eff_load = 100 + (imbalance_pct - 100) / 2;
+ src_eff_load *= power_of(src_cpu);
+ src_eff_load *= src_load + effective_load(tg, src_cpu, -weight, -weight);
+
+ for_each_cpu(cpu, cpumask_of_node(nid)) {
+ dst_load = target_load(cpu, idx);
+
+ /* If the CPU is idle, use it */
+ if (!dst_load)
+ return dst_cpu;
+
+ /* Otherwise check the target CPU load */
+ dst_eff_load = 100;
+ dst_eff_load *= power_of(cpu);
+ dst_eff_load *= dst_load + effective_load(tg, cpu, weight, weight);
+
+ /*
+ * Destination is considered balanced if the destination CPU is
+ * less loaded than the source CPU. Unfortunately there is a
+ * risk that a task running on a lightly loaded CPU will not
+ * migrate to its preferred node due to load imbalances.
+ */
+ balanced = (dst_eff_load <= src_eff_load);
+ if (!balanced)
+ continue;
+
+ if (dst_load < min_load) {
+ min_load = dst_load;
+ dst_cpu = cpu;
}
}
- rcu_read_unlock();
- return idlest_cpu;
+ return dst_cpu;
}
static void task_numa_placement(struct task_struct *p)
@@ -916,14 +979,12 @@ static void task_numa_placement(struct task_struct *p)
int old_migrate_seq = p->numa_migrate_seq;
/*
- * If the task is not on the preferred node then find the most
- * idle CPU to migrate to.
+ * If the task is not on the preferred node then find
+ * a suitable CPU to migrate to.
*/
preferred_cpu = task_cpu(p);
- if (cpu_to_node(preferred_cpu) != max_nid) {
- preferred_cpu = find_idlest_cpu_node(preferred_cpu,
- max_nid);
- }
+ if (cpu_to_node(preferred_cpu) != max_nid)
+ preferred_cpu = task_numa_find_cpu(p, max_nid);
/* Update the preferred nid and migrate task if possible */
p->numa_preferred_nid = max_nid;
@@ -3238,7 +3299,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
}
#else
-static inline unsigned long effective_load(struct task_group *tg, int cpu,
+static unsigned long effective_load(struct task_group *tg, int cpu,
unsigned long wl, unsigned long wg)
{
return wl;
--
1.8.1.4
Ideally it would be possible to distinguish between NUMA hinting faults that
are private to a task and those that are shared. If treated identically
there is a risk that shared pages bounce between nodes depending on
the order they are referenced by tasks. Ultimately what is desirable is
that task private pages remain local to the task while shared pages are
interleaved between sharing tasks running on different nodes to give good
average performance. This is further complicated by THP as even
applications that partition their data may not be partitioning on a huge
page boundary.
To start with, this patch assumes that multi-threaded or multi-process
applications partition their data and that in general the private accesses
are more important for cpu->memory locality in the general case. Also,
no new infrastructure is required to treat private pages properly but
interleaving for shared pages requires additional infrastructure.
To detect private accesses the pid of the last accessing task is required
but the storage requirements are a high. This patch borrows heavily from
Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
to encode some bits from the last accessing task in the page flags as
well as the node information. Collisions will occur but it is better than
just depending on the node information. Node information is then used to
determine if a page needs to migrate. The PID information is used to detect
private/shared accesses. The preferred NUMA node is selected based on where
the maximum number of approximately private faults were measured. Shared
faults are not taken into consideration for a few reasons.
First, if there are many tasks sharing the page then they'll all move
towards the same node. The node will be compute overloaded and then
scheduled away later only to bounce back again. Alternatively the shared
tasks would just bounce around nodes because the fault information is
effectively noise. Either way accounting for shared faults the same as
private faults can result in lower performance overall.
The second reason is based on a hypothetical workload that has a small
number of very important, heavily accessed private pages but a large shared
array. The shared array would dominate the number of faults and be selected
as a preferred node even though it's the wrong decision.
The third reason is that multiple threads in a process will race each
other to fault the shared page making the fault information unreliable.
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mm.h | 69 ++++++++++++++++++++++++++-------------
include/linux/mm_types.h | 4 +--
include/linux/page-flags-layout.h | 28 +++++++++-------
kernel/sched/fair.c | 12 +++++--
mm/huge_memory.c | 10 +++---
mm/memory.c | 16 ++++-----
mm/mempolicy.c | 8 +++--
mm/migrate.c | 4 +--
mm/mm_init.c | 18 +++++-----
mm/mmzone.c | 12 +++----
mm/mprotect.c | 24 +++++++++-----
mm/page_alloc.c | 4 +--
12 files changed, 128 insertions(+), 81 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e2091b8..93f9feb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -582,11 +582,11 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
* sets it, so none of the operations on it need to be atomic.
*/
-/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NID] | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NIDPID] | ... | FLAGS | */
#define SECTIONS_PGOFF ((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
#define NODES_PGOFF (SECTIONS_PGOFF - NODES_WIDTH)
#define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH)
-#define LAST_NID_PGOFF (ZONES_PGOFF - LAST_NID_WIDTH)
+#define LAST_NIDPID_PGOFF (ZONES_PGOFF - LAST_NIDPID_WIDTH)
/*
* Define the bit shifts to access each section. For non-existent
@@ -596,7 +596,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
#define SECTIONS_PGSHIFT (SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
#define NODES_PGSHIFT (NODES_PGOFF * (NODES_WIDTH != 0))
#define ZONES_PGSHIFT (ZONES_PGOFF * (ZONES_WIDTH != 0))
-#define LAST_NID_PGSHIFT (LAST_NID_PGOFF * (LAST_NID_WIDTH != 0))
+#define LAST_NIDPID_PGSHIFT (LAST_NIDPID_PGOFF * (LAST_NIDPID_WIDTH != 0))
/* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
#ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -618,7 +618,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
#define ZONES_MASK ((1UL << ZONES_WIDTH) - 1)
#define NODES_MASK ((1UL << NODES_WIDTH) - 1)
#define SECTIONS_MASK ((1UL << SECTIONS_WIDTH) - 1)
-#define LAST_NID_MASK ((1UL << LAST_NID_WIDTH) - 1)
+#define LAST_NIDPID_MASK ((1UL << LAST_NIDPID_WIDTH) - 1)
#define ZONEID_MASK ((1UL << ZONEID_SHIFT) - 1)
static inline enum zone_type page_zonenum(const struct page *page)
@@ -662,48 +662,73 @@ static inline int page_to_nid(const struct page *page)
#endif
#ifdef CONFIG_NUMA_BALANCING
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-static inline int page_nid_xchg_last(struct page *page, int nid)
+static inline int nid_pid_to_nidpid(int nid, int pid)
{
- return xchg(&page->_last_nid, nid);
+ return ((nid & LAST__NID_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
}
-static inline int page_nid_last(struct page *page)
+static inline int nidpid_to_pid(int nidpid)
{
- return page->_last_nid;
+ return nidpid & LAST__PID_MASK;
}
-static inline void page_nid_reset_last(struct page *page)
+
+static inline int nidpid_to_nid(int nidpid)
+{
+ return (nidpid >> LAST__PID_SHIFT) & LAST__NID_MASK;
+}
+
+static inline bool nidpid_pid_unset(int nidpid)
+{
+ return nidpid_to_pid(nidpid) == (-1 & LAST__PID_MASK);
+}
+
+static inline bool nidpid_nid_unset(int nidpid)
+{
+ return nidpid_to_nid(nidpid) == (-1 & LAST__NID_MASK);
+}
+
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+static inline int page_nidpid_xchg_last(struct page *page, int nid)
+{
+ return xchg(&page->_last_nidpid, nid);
+}
+
+static inline int page_nidpid_last(struct page *page)
+{
+ return page->_last_nidpid;
+}
+static inline void page_nidpid_reset_last(struct page *page)
{
- page->_last_nid = -1;
+ page->_last_nidpid = -1;
}
#else
-static inline int page_nid_last(struct page *page)
+static inline int page_nidpid_last(struct page *page)
{
- return (page->flags >> LAST_NID_PGSHIFT) & LAST_NID_MASK;
+ return (page->flags >> LAST_NIDPID_PGSHIFT) & LAST_NIDPID_MASK;
}
-extern int page_nid_xchg_last(struct page *page, int nid);
+extern int page_nidpid_xchg_last(struct page *page, int nidpid);
-static inline void page_nid_reset_last(struct page *page)
+static inline void page_nidpid_reset_last(struct page *page)
{
- int nid = (1 << LAST_NID_SHIFT) - 1;
+ int nidpid = (1 << LAST_NIDPID_SHIFT) - 1;
- page->flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
- page->flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+ page->flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
+ page->flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
}
-#endif /* LAST_NID_NOT_IN_PAGE_FLAGS */
+#endif /* LAST_NIDPID_NOT_IN_PAGE_FLAGS */
#else
-static inline int page_nid_xchg_last(struct page *page, int nid)
+static inline int page_nidpid_xchg_last(struct page *page, int nidpid)
{
return page_to_nid(page);
}
-static inline int page_nid_last(struct page *page)
+static inline int page_nidpid_last(struct page *page)
{
return page_to_nid(page);
}
-static inline void page_nid_reset_last(struct page *page)
+static inline void page_nidpid_reset_last(struct page *page)
{
}
#endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index de70964..4137f67 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -174,8 +174,8 @@ struct page {
void *shadow;
#endif
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
- int _last_nid;
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+ int _last_nidpid;
#endif
}
/*
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index 93506a1..02bc918 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -38,10 +38,10 @@
* The last is when there is insufficient space in page->flags and a separate
* lookup is necessary.
*
- * No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
- * " plus space for last_nid: | NODE | ZONE | LAST_NID ... | FLAGS |
- * classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
- * " plus space for last_nid: | SECTION | NODE | ZONE | LAST_NID ... | FLAGS |
+ * No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
+ * " plus space for last_nidpid: | NODE | ZONE | LAST_NIDPID ... | FLAGS |
+ * classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
+ * " plus space for last_nidpid: | SECTION | NODE | ZONE | LAST_NIDPID ... | FLAGS |
* classic sparse no space for node: | SECTION | ZONE | ... | FLAGS |
*/
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
@@ -62,15 +62,21 @@
#endif
#ifdef CONFIG_NUMA_BALANCING
-#define LAST_NID_SHIFT NODES_SHIFT
+#define LAST__PID_SHIFT 8
+#define LAST__PID_MASK ((1 << LAST__PID_SHIFT)-1)
+
+#define LAST__NID_SHIFT NODES_SHIFT
+#define LAST__NID_MASK ((1 << LAST__NID_SHIFT)-1)
+
+#define LAST_NIDPID_SHIFT (LAST__PID_SHIFT+LAST__NID_SHIFT)
#else
-#define LAST_NID_SHIFT 0
+#define LAST_NIDPID_SHIFT 0
#endif
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define LAST_NID_WIDTH LAST_NID_SHIFT
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NIDPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define LAST_NIDPID_WIDTH LAST_NIDPID_SHIFT
#else
-#define LAST_NID_WIDTH 0
+#define LAST_NIDPID_WIDTH 0
#endif
/*
@@ -81,8 +87,8 @@
#define NODE_NOT_IN_PAGE_FLAGS
#endif
-#if defined(CONFIG_NUMA_BALANCING) && LAST_NID_WIDTH == 0
-#define LAST_NID_NOT_IN_PAGE_FLAGS
+#if defined(CONFIG_NUMA_BALANCING) && LAST_NIDPID_WIDTH == 0
+#define LAST_NIDPID_NOT_IN_PAGE_FLAGS
#endif
#endif /* _LINUX_PAGE_FLAGS_LAYOUT */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b43122c..3f0519c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -945,7 +945,7 @@ static void task_numa_placement(struct task_struct *p)
/*
* Got a PROT_NONE fault for a page on @node.
*/
-void task_numa_fault(int last_nid, int node, int pages, bool migrated)
+void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
{
struct task_struct *p = current;
int priv;
@@ -957,8 +957,14 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
if (!p->mm)
return;
- /* For now, do not attempt to detect private/shared accesses */
- priv = 1;
+ /*
+ * First accesses are treated as private, otherwise consider accesses
+ * to be private if the accessing pid has not changed
+ */
+ if (!nidpid_pid_unset(last_nidpid))
+ priv = ((p->pid & LAST__PID_MASK) == nidpid_to_pid(last_nidpid));
+ else
+ priv = 1;
/* Allocate buffer to track faults on a per-node basis */
if (unlikely(!p->numa_faults)) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9462591..c7f79dd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1292,7 +1292,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
{
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
- int target_nid, last_nid;
+ int target_nid, last_nidpid;
int src_nid = -1;
bool migrated;
@@ -1316,7 +1316,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (src_nid == page_to_nid(page))
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
- last_nid = page_nid_last(page);
+ last_nidpid = page_nidpid_last(page);
target_nid = mpol_misplaced(page, vma, haddr);
if (target_nid == -1) {
put_page(page);
@@ -1342,7 +1342,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (!migrated)
goto check_same;
- task_numa_fault(last_nid, target_nid, HPAGE_PMD_NR, true);
+ task_numa_fault(last_nidpid, target_nid, HPAGE_PMD_NR, true);
return 0;
check_same:
@@ -1357,7 +1357,7 @@ clear_pmdnuma:
out_unlock:
spin_unlock(&mm->page_table_lock);
if (src_nid != -1)
- task_numa_fault(last_nid, src_nid, HPAGE_PMD_NR, false);
+ task_numa_fault(last_nidpid, src_nid, HPAGE_PMD_NR, false);
return 0;
}
@@ -1649,7 +1649,7 @@ static void __split_huge_page_refcount(struct page *page)
page_tail->mapping = page->mapping;
page_tail->index = page->index + i;
- page_nid_xchg_last(page_tail, page_nid_last(page));
+ page_nidpid_xchg_last(page_tail, page_nidpid_last(page));
BUG_ON(!PageAnon(page_tail));
BUG_ON(!PageUptodate(page_tail));
diff --git a/mm/memory.c b/mm/memory.c
index 62ae8a7..374ffa4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -69,8 +69,8 @@
#include "internal.h"
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nid.
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nidpid.
#endif
#ifndef CONFIG_NEED_MULTIPLE_NODES
@@ -3536,7 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
{
struct page *page = NULL;
spinlock_t *ptl;
- int current_nid = -1, last_nid;
+ int current_nid = -1, last_nidpid;
int target_nid;
bool migrated = false;
@@ -3571,7 +3571,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
return 0;
}
- last_nid = page_nid_last(page);
+ last_nidpid = page_nidpid_last(page);
current_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, current_nid);
pte_unmap_unlock(ptep, ptl);
@@ -3592,7 +3592,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
out:
if (current_nid != -1)
- task_numa_fault(last_nid, current_nid, 1, migrated);
+ task_numa_fault(last_nidpid, current_nid, 1, migrated);
return 0;
}
@@ -3608,7 +3608,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
spinlock_t *ptl;
bool numa = false;
int local_nid = numa_node_id();
- int last_nid;
+ int last_nidpid;
spin_lock(&mm->page_table_lock);
pmd = *pmdp;
@@ -3658,7 +3658,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
* migrated to.
*/
curr_nid = local_nid;
- last_nid = page_nid_last(page);
+ last_nidpid = page_nidpid_last(page);
target_nid = numa_migrate_prep(page, vma, addr,
page_to_nid(page));
if (target_nid == -1) {
@@ -3671,7 +3671,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
migrated = migrate_misplaced_page(page, vma, target_nid);
if (migrated)
curr_nid = target_nid;
- task_numa_fault(last_nid, curr_nid, 1, migrated);
+ task_numa_fault(last_nidpid, curr_nid, 1, migrated);
pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 7431001..4669000 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2288,9 +2288,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
/* Migrate the page towards the node whose CPU is referencing it */
if (pol->flags & MPOL_F_MORON) {
- int last_nid;
+ int last_nidpid;
+ int this_nidpid;
polnid = numa_node_id();
+ this_nidpid = nid_pid_to_nidpid(polnid, current->pid);;
/*
* Multi-stage node selection is used in conjunction
@@ -2313,8 +2315,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
* it less likely we act on an unlikely task<->page
* relation.
*/
- last_nid = page_nid_xchg_last(page, polnid);
- if (last_nid != polnid)
+ last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
+ if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
goto out;
}
diff --git a/mm/migrate.c b/mm/migrate.c
index 23f8122..01d653d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1478,7 +1478,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
__GFP_NOWARN) &
~GFP_IOFS, 0);
if (newpage)
- page_nid_xchg_last(newpage, page_nid_last(page));
+ page_nidpid_xchg_last(newpage, page_nidpid_last(page));
return newpage;
}
@@ -1655,7 +1655,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
if (!new_page)
goto out_fail;
- page_nid_xchg_last(new_page, page_nid_last(page));
+ page_nidpid_xchg_last(new_page, page_nidpid_last(page));
isolated = numamigrate_isolate_page(pgdat, page);
if (!isolated) {
diff --git a/mm/mm_init.c b/mm/mm_init.c
index c280a02..eecdc64 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -69,26 +69,26 @@ void __init mminit_verify_pageflags_layout(void)
unsigned long or_mask, add_mask;
shift = 8 * sizeof(unsigned long);
- width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NID_SHIFT;
+ width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NIDPID_SHIFT;
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
- "Section %d Node %d Zone %d Lastnid %d Flags %d\n",
+ "Section %d Node %d Zone %d Lastnidpid %d Flags %d\n",
SECTIONS_WIDTH,
NODES_WIDTH,
ZONES_WIDTH,
- LAST_NID_WIDTH,
+ LAST_NIDPID_WIDTH,
NR_PAGEFLAGS);
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
- "Section %d Node %d Zone %d Lastnid %d\n",
+ "Section %d Node %d Zone %d Lastnidpid %d\n",
SECTIONS_SHIFT,
NODES_SHIFT,
ZONES_SHIFT,
- LAST_NID_SHIFT);
+ LAST_NIDPID_SHIFT);
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_pgshifts",
- "Section %lu Node %lu Zone %lu Lastnid %lu\n",
+ "Section %lu Node %lu Zone %lu Lastnidpid %lu\n",
(unsigned long)SECTIONS_PGSHIFT,
(unsigned long)NODES_PGSHIFT,
(unsigned long)ZONES_PGSHIFT,
- (unsigned long)LAST_NID_PGSHIFT);
+ (unsigned long)LAST_NIDPID_PGSHIFT);
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodezoneid",
"Node/Zone ID: %lu -> %lu\n",
(unsigned long)(ZONEID_PGOFF + ZONEID_SHIFT),
@@ -100,9 +100,9 @@ void __init mminit_verify_pageflags_layout(void)
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
"Node not in page flags");
#endif
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
- "Last nid not in page flags");
+ "Last nidpid not in page flags");
#endif
if (SECTIONS_WIDTH) {
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 2ac0afb..89b3b7e 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -98,19 +98,19 @@ void lruvec_init(struct lruvec *lruvec)
}
#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NID_NOT_IN_PAGE_FLAGS)
-int page_nid_xchg_last(struct page *page, int nid)
+int page_nidpid_xchg_last(struct page *page, int nidpid)
{
unsigned long old_flags, flags;
- int last_nid;
+ int last_nidpid;
do {
old_flags = flags = page->flags;
- last_nid = page_nid_last(page);
+ last_nidpid = page_nidpid_last(page);
- flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
- flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+ flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
+ flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
} while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
- return last_nid;
+ return last_nidpid;
}
#endif
diff --git a/mm/mprotect.c b/mm/mprotect.c
index cacc64a..04c9469 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,14 +37,15 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end, pgprot_t newprot,
- int dirty_accountable, int prot_numa, bool *ret_all_same_node)
+ int dirty_accountable, int prot_numa, bool *ret_all_same_nidpid)
{
struct mm_struct *mm = vma->vm_mm;
pte_t *pte, oldpte;
spinlock_t *ptl;
unsigned long pages = 0;
- bool all_same_node = true;
+ bool all_same_nidpid = true;
int last_nid = -1;
+ int last_pid = -1;
pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
arch_enter_lazy_mmu_mode();
@@ -64,10 +65,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
page = vm_normal_page(vma, addr, oldpte);
if (page) {
int this_nid = page_to_nid(page);
+ int nidpid = page_nidpid_last(page);
+ int this_pid = nidpid_to_pid(nidpid);
+
if (last_nid == -1)
last_nid = this_nid;
- if (last_nid != this_nid)
- all_same_node = false;
+ if (last_pid == -1)
+ last_pid = this_pid;
+ if (last_nid != this_nid ||
+ last_pid != this_pid) {
+ all_same_nidpid = false;
+ }
if (!pte_numa(oldpte)) {
ptent = pte_mknuma(ptent);
@@ -106,7 +114,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
- *ret_all_same_node = all_same_node;
+ *ret_all_same_nidpid = all_same_nidpid;
return pages;
}
@@ -133,7 +141,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
pmd_t *pmd;
unsigned long next;
unsigned long pages = 0;
- bool all_same_node;
+ bool all_same_nidpid;
pmd = pmd_offset(pud, addr);
do {
@@ -151,7 +159,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
if (pmd_none_or_clear_bad(pmd))
continue;
pages += change_pte_range(vma, pmd, addr, next, newprot,
- dirty_accountable, prot_numa, &all_same_node);
+ dirty_accountable, prot_numa, &all_same_nidpid);
/*
* If we are changing protections for NUMA hinting faults then
@@ -159,7 +167,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
* node. This allows a regular PMD to be handled as one fault
* and effectively batches the taking of the PTL
*/
- if (prot_numa && all_same_node)
+ if (prot_numa && all_same_nidpid)
change_pmd_protnuma(vma->vm_mm, addr, pmd);
} while (pmd++, addr = next, addr != end);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fcced7..f7c9c0f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -613,7 +613,7 @@ static inline int free_pages_check(struct page *page)
bad_page(page);
return 1;
}
- page_nid_reset_last(page);
+ page_nidpid_reset_last(page);
if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
return 0;
@@ -3910,7 +3910,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
mminit_verify_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
- page_nid_reset_last(page);
+ page_nidpid_reset_last(page);
SetPageReserved(page);
/*
* Mark the block movable so that blocks are reserved for
--
1.8.1.4
Currently automatic NUMA balancing is unable to distinguish between false
shared versus private pages except by ignoring pages with an elevated
page_mapcount entirely. This avoids shared pages bouncing between the
nodes whose task is using them but that is ignored quite a lot of data.
This patch kicks away the training wheels in preparation for adding support
for identifying shared/private pages is now in place. The ordering is so
that the impact of the shared/private detection can be easily measured. Note
that the patch does not migrate shared, file-backed within vmas marked
VM_EXEC as these are generally shared library pages. Migrating such pages
is not beneficial as there is an expectation they are read-shared between
caches and iTLB and iCache pressure is generally low.
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/migrate.h | 7 ++++---
mm/memory.c | 7 ++-----
mm/migrate.c | 17 ++++++-----------
mm/mprotect.c | 4 +---
4 files changed, 13 insertions(+), 22 deletions(-)
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index a405d3dc..e7e26af 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -92,11 +92,12 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
#endif /* CONFIG_MIGRATION */
#ifdef CONFIG_NUMA_BALANCING
-extern int migrate_misplaced_page(struct page *page, int node);
-extern int migrate_misplaced_page(struct page *page, int node);
+extern int migrate_misplaced_page(struct page *page,
+ struct vm_area_struct *vma, int node);
extern bool migrate_ratelimited(int node);
#else
-static inline int migrate_misplaced_page(struct page *page, int node)
+static inline int migrate_misplaced_page(struct page *page,
+ struct vm_area_struct *vma, int node)
{
return -EAGAIN; /* can't migrate now */
}
diff --git a/mm/memory.c b/mm/memory.c
index ab933be..62ae8a7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3586,7 +3586,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
/* Migrate to the requested node */
- migrated = migrate_misplaced_page(page, target_nid);
+ migrated = migrate_misplaced_page(page, vma, target_nid);
if (migrated)
current_nid = target_nid;
@@ -3651,9 +3651,6 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
page = vm_normal_page(vma, addr, pteval);
if (unlikely(!page))
continue;
- /* only check non-shared pages */
- if (unlikely(page_mapcount(page) != 1))
- continue;
/*
* Note that the NUMA fault is later accounted to either
@@ -3671,7 +3668,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* Migrate to the requested node */
pte_unmap_unlock(pte, ptl);
- migrated = migrate_misplaced_page(page, target_nid);
+ migrated = migrate_misplaced_page(page, vma, target_nid);
if (migrated)
curr_nid = target_nid;
task_numa_fault(last_nid, curr_nid, 1, migrated);
diff --git a/mm/migrate.c b/mm/migrate.c
index 3bbaf5d..23f8122 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1579,7 +1579,8 @@ int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
* node. Caller is expected to have an elevated reference count on
* the page that will be dropped by this function before returning.
*/
-int migrate_misplaced_page(struct page *page, int node)
+int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
+ int node)
{
pg_data_t *pgdat = NODE_DATA(node);
int isolated;
@@ -1587,10 +1588,11 @@ int migrate_misplaced_page(struct page *page, int node)
LIST_HEAD(migratepages);
/*
- * Don't migrate pages that are mapped in multiple processes.
- * TODO: Handle false sharing detection instead of this hammer
+ * Don't migrate file pages that are mapped in multiple processes
+ * with execute permissions as they are probably shared libraries.
*/
- if (page_mapcount(page) != 1)
+ if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
+ (vma->vm_flags & VM_EXEC))
goto out;
/*
@@ -1641,13 +1643,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
int page_lru = page_is_file_cache(page);
/*
- * Don't migrate pages that are mapped in multiple processes.
- * TODO: Handle false sharing detection instead of this hammer
- */
- if (page_mapcount(page) != 1)
- goto out_dropref;
-
- /*
* Rate-limit the amount of data that is being migrated to a node.
* Optimal placement is no good if the memory bus is saturated and
* all the time is being spent migrating!
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94722a4..cacc64a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -69,9 +69,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
if (last_nid != this_nid)
all_same_node = false;
- /* only check non-shared pages */
- if (!pte_numa(oldpte) &&
- page_mapcount(page) == 1) {
+ if (!pte_numa(oldpte)) {
ptent = pte_mknuma(ptent);
updated = true;
}
--
1.8.1.4
task_numa_placement checks current->mm but after buffers for faults
have already been uselessly allocated. Move the check earlier.
[[email protected]: Identified the problem]
Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9002a4a..022a04c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -834,8 +834,6 @@ static void task_numa_placement(struct task_struct *p)
int seq, nid, max_nid = -1;
unsigned long max_faults = 0;
- if (!p->mm) /* for example, ksmd faulting in a user's mm */
- return;
seq = ACCESS_ONCE(p->mm->numa_scan_seq);
if (p->numa_scan_seq == seq)
return;
@@ -912,6 +910,10 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
if (!sched_feat_numa(NUMA))
return;
+ /* for example, ksmd faulting in a user's mm */
+ if (!p->mm)
+ return;
+
/* For now, do not attempt to detect private/shared accesses */
priv = 1;
--
1.8.1.4
The NUMA PTE scan rate is controlled with a combination of the
numa_balancing_scan_period_min, numa_balancing_scan_period_max and
numa_balancing_scan_size. This scan rate is independent of the size
of the task and as an aside it is further complicated by the fact that
numa_balancing_scan_size controls how many pages are marked pte_numa and
not how much virtual memory is scanned.
In combination, it is almost impossible to meaningfully tune the min and
max scan periods and reasoning about performance is complex when the time
to complete a full scan is is partially a function of the tasks memory
size. This patch alters the semantic of the min and max tunables to be
about tuning the length time it takes to complete a scan of a tasks virtual
address space. Conceptually this is a lot easier to understand. There is a
"sanity" check to ensure the scan rate is never extremely fast based on the
amount of virtual memory that should be scanned in a second. The default
of 2.5G seems arbitrary but it is to have the maximum scan rate after the
patch roughly match the maximum scan rate before the patch was applied.
Signed-off-by: Mel Gorman <[email protected]>
---
Documentation/sysctl/kernel.txt | 11 ++++---
include/linux/sched.h | 1 +
kernel/sched/fair.c | 72 +++++++++++++++++++++++++++++++++++------
3 files changed, 70 insertions(+), 14 deletions(-)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index a275042..f38d4f4 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -401,15 +401,16 @@ workload pattern changes and minimises performance impact due to remote
memory accesses. These sysctls control the thresholds for scan delays and
the number of pages scanned.
-numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
-between scans. It effectively controls the maximum scanning rate for
-each task.
+numa_balancing_scan_period_min_ms is the minimum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the maximum scanning
+rate for each task.
numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
when it initially forks.
-numa_balancing_scan_period_max_ms is the maximum delay between scans. It
-effectively controls the minimum scanning rate for each task.
+numa_balancing_scan_period_max_ms is the maximum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the minimum scanning
+rate for each task.
numa_balancing_scan_size_mb is how many megabytes worth of pages are
scanned for a given scan.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index b81195e..d44fbc6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1504,6 +1504,7 @@ struct task_struct {
int numa_scan_seq;
int numa_migrate_seq;
unsigned int numa_scan_period;
+ unsigned int numa_scan_period_max;
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 022a04c..8a392c8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -779,10 +779,12 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
#ifdef CONFIG_NUMA_BALANCING
/*
- * numa task sample period in ms
+ * Approximate time to scan a full NUMA task in ms. The task scan period is
+ * calculated based on the tasks virtual memory size and
+ * numa_balancing_scan_size.
*/
-unsigned int sysctl_numa_balancing_scan_period_min = 100;
-unsigned int sysctl_numa_balancing_scan_period_max = 100*600;
+unsigned int sysctl_numa_balancing_scan_period_min = 1000;
+unsigned int sysctl_numa_balancing_scan_period_max = 600000;
/* Portion of address space to scan in MB */
unsigned int sysctl_numa_balancing_scan_size = 256;
@@ -790,6 +792,46 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
/* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
unsigned int sysctl_numa_balancing_scan_delay = 1000;
+static unsigned int task_nr_scan_windows(struct task_struct *p)
+{
+ unsigned long nr_vm_pages = 0;
+ unsigned long nr_scan_pages;
+
+ nr_scan_pages = sysctl_numa_balancing_scan_size << (20 - PAGE_SHIFT);
+ nr_vm_pages = p->mm->total_vm;
+ if (!nr_vm_pages)
+ nr_vm_pages = nr_scan_pages;
+
+ nr_vm_pages = round_up(nr_vm_pages, nr_scan_pages);
+ return nr_vm_pages / nr_scan_pages;
+}
+
+/* For sanitys sake, never scan more PTEs than MAX_SCAN_WINDOW MB/sec. */
+#define MAX_SCAN_WINDOW 2560
+
+static unsigned int task_scan_min(struct task_struct *p)
+{
+ unsigned int scan, floor;
+ unsigned int windows = 1;
+
+ if (sysctl_numa_balancing_scan_size < MAX_SCAN_WINDOW)
+ windows = MAX_SCAN_WINDOW / sysctl_numa_balancing_scan_size;
+ floor = 1000 / windows;
+
+ scan = sysctl_numa_balancing_scan_period_min / task_nr_scan_windows(p);
+ return max_t(unsigned int, floor, scan);
+}
+
+static unsigned int task_scan_max(struct task_struct *p)
+{
+ unsigned int smin = task_scan_min(p);
+ unsigned int smax;
+
+ /* Watch for min being lower than max due to floor calculations */
+ smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p);
+ return max(smin, smax);
+}
+
/*
* Once a preferred node is selected the scheduler balancer will prefer moving
* a task to that node for sysctl_numa_balancing_settle_count number of PTE
@@ -839,6 +881,7 @@ static void task_numa_placement(struct task_struct *p)
return;
p->numa_scan_seq = seq;
p->numa_migrate_seq++;
+ p->numa_scan_period_max = task_scan_max(p);
/* Find the node with the highest number of faults */
for (nid = 0; nid < nr_node_ids; nid++) {
@@ -894,7 +937,7 @@ static void task_numa_placement(struct task_struct *p)
*/
if (old_migrate_seq >= sysctl_numa_balancing_settle_count) {
p->numa_scan_period = max(p->numa_scan_period >> 1,
- sysctl_numa_balancing_scan_period_min);
+ task_scan_min(p));
}
}
}
@@ -935,7 +978,7 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
* This is reset periodically in case of phase changes
*/
if (!migrated)
- p->numa_scan_period = min(sysctl_numa_balancing_scan_period_max,
+ p->numa_scan_period = min(p->numa_scan_period_max,
p->numa_scan_period + jiffies_to_msecs(10));
task_numa_placement(p);
@@ -961,6 +1004,7 @@ void task_numa_work(struct callback_head *work)
struct mm_struct *mm = p->mm;
struct vm_area_struct *vma;
unsigned long start, end;
+ unsigned long nr_pte_updates = 0;
long pages;
WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
@@ -1002,8 +1046,10 @@ void task_numa_work(struct callback_head *work)
if (time_before(now, migrate))
return;
- if (p->numa_scan_period == 0)
- p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+ if (p->numa_scan_period == 0) {
+ p->numa_scan_period_max = task_scan_max(p);
+ p->numa_scan_period = task_scan_min(p);
+ }
next_scan = now + msecs_to_jiffies(p->numa_scan_period);
if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
@@ -1042,7 +1088,15 @@ void task_numa_work(struct callback_head *work)
start = max(start, vma->vm_start);
end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
end = min(end, vma->vm_end);
- pages -= change_prot_numa(vma, start, end);
+ nr_pte_updates += change_prot_numa(vma, start, end);
+
+ /*
+ * Scan sysctl_numa_balancing_scan_size but ensure that
+ * at least one PTE is updated so that unused virtual
+ * address space is quickly skipped.
+ */
+ if (nr_pte_updates)
+ pages -= (end - start) >> PAGE_SHIFT;
start = end;
if (pages <= 0)
@@ -1089,7 +1143,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
if (now - curr->node_stamp > period) {
if (!curr->node_stamp)
- curr->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+ curr->numa_scan_period = task_scan_min(curr);
curr->node_stamp = now;
if (!time_before(jiffies, curr->mm->numa_next_scan)) {
--
1.8.1.4
A preferred node is selected based on the node the most NUMA hinting
faults was incurred on. There is no guarantee that the task is running
on that node at the time so this patch rescheules the task to run on
the most idle CPU of the selected node when selected. This avoids
waiting for the balancer to make a decision.
Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/core.c | 17 +++++++++++++++++
kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 1 +
3 files changed, 63 insertions(+), 1 deletion(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5e02507..b67a102 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4856,6 +4856,23 @@ fail:
return ret;
}
+#ifdef CONFIG_NUMA_BALANCING
+/* Migrate current task p to target_cpu */
+int migrate_task_to(struct task_struct *p, int target_cpu)
+{
+ struct migration_arg arg = { p, target_cpu };
+ int curr_cpu = task_cpu(p);
+
+ if (curr_cpu == target_cpu)
+ return 0;
+
+ if (!cpumask_test_cpu(target_cpu, tsk_cpus_allowed(p)))
+ return -EINVAL;
+
+ return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
+}
+#endif
+
/*
* migration_cpu_stop - this will be executed by a highprio stopper thread
* and performs thread migration by bumping thread off CPU then
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 49396e1..f68fad5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -800,6 +800,31 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
*/
unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+static unsigned long weighted_cpuload(const int cpu);
+
+
+static int
+find_idlest_cpu_node(int this_cpu, int nid)
+{
+ unsigned long load, min_load = ULONG_MAX;
+ int i, idlest_cpu = this_cpu;
+
+ BUG_ON(cpu_to_node(this_cpu) == nid);
+
+ rcu_read_lock();
+ for_each_cpu(i, cpumask_of_node(nid)) {
+ load = weighted_cpuload(i);
+
+ if (load < min_load) {
+ min_load = load;
+ idlest_cpu = i;
+ }
+ }
+ rcu_read_unlock();
+
+ return idlest_cpu;
+}
+
static void task_numa_placement(struct task_struct *p)
{
int seq, nid, max_nid = -1;
@@ -829,10 +854,29 @@ static void task_numa_placement(struct task_struct *p)
}
}
- /* Update the tasks preferred node if necessary */
+ /*
+ * Record the preferred node as the node with the most faults,
+ * requeue the task to be running on the idlest CPU on the
+ * preferred node and reset the scanning rate to recheck
+ * the working set placement.
+ */
if (max_faults && max_nid != p->numa_preferred_nid) {
+ int preferred_cpu;
+
+ /*
+ * If the task is not on the preferred node then find the most
+ * idle CPU to migrate to.
+ */
+ preferred_cpu = task_cpu(p);
+ if (cpu_to_node(preferred_cpu) != max_nid) {
+ preferred_cpu = find_idlest_cpu_node(preferred_cpu,
+ max_nid);
+ }
+
+ /* Update the preferred nid and migrate task if possible */
p->numa_preferred_nid = max_nid;
p->numa_migrate_seq = 0;
+ migrate_task_to(p, preferred_cpu);
}
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c5f773d..795346d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -504,6 +504,7 @@ DECLARE_PER_CPU(struct rq, runqueues);
#define raw_rq() (&__raw_get_cpu_var(runqueues))
#ifdef CONFIG_NUMA_BALANCING
+extern int migrate_task_to(struct task_struct *p, int cpu);
static inline void task_numa_free(struct task_struct *p)
{
kfree(p->numa_faults);
--
1.8.1.4
Ideally it would be possible to distinguish between NUMA hinting faults
that are private to a task and those that are shared. This patch prepares
infrastructure for separately accounting shared and private faults by
allocating the necessary buffers and passing in relevant information. For
now, all faults are treated as private and detection will be introduced
later.
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/sched.h | 5 +++--
kernel/sched/fair.c | 33 ++++++++++++++++++++++++---------
mm/huge_memory.c | 7 ++++---
mm/memory.c | 9 ++++++---
4 files changed, 37 insertions(+), 17 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 82a6136..b81195e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1600,10 +1600,11 @@ struct task_struct {
#define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
#ifdef CONFIG_NUMA_BALANCING
-extern void task_numa_fault(int node, int pages, bool migrated);
+extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
extern void set_numabalancing_state(bool enabled);
#else
-static inline void task_numa_fault(int node, int pages, bool migrated)
+static inline void task_numa_fault(int last_node, int node, int pages,
+ bool migrated)
{
}
static inline void set_numabalancing_state(bool enabled)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f68fad5..9590fcd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -800,6 +800,11 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
*/
unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+static inline int task_faults_idx(int nid, int priv)
+{
+ return 2 * nid + priv;
+}
+
static unsigned long weighted_cpuload(const int cpu);
@@ -841,13 +846,19 @@ static void task_numa_placement(struct task_struct *p)
/* Find the node with the highest number of faults */
for (nid = 0; nid < nr_node_ids; nid++) {
unsigned long faults;
+ int priv, i;
- /* Decay existing window and copy faults since last scan */
- p->numa_faults[nid] >>= 1;
- p->numa_faults[nid] += p->numa_faults_buffer[nid];
- p->numa_faults_buffer[nid] = 0;
+ for (priv = 0; priv < 2; priv++) {
+ i = task_faults_idx(nid, priv);
- faults = p->numa_faults[nid];
+ /* Decay existing window, copy faults since last scan */
+ p->numa_faults[i] >>= 1;
+ p->numa_faults[i] += p->numa_faults_buffer[i];
+ p->numa_faults_buffer[i] = 0;
+ }
+
+ /* Find maximum private faults */
+ faults = p->numa_faults[task_faults_idx(nid, 1)];
if (faults > max_faults) {
max_faults = faults;
max_nid = nid;
@@ -883,16 +894,20 @@ static void task_numa_placement(struct task_struct *p)
/*
* Got a PROT_NONE fault for a page on @node.
*/
-void task_numa_fault(int node, int pages, bool migrated)
+void task_numa_fault(int last_nid, int node, int pages, bool migrated)
{
struct task_struct *p = current;
+ int priv;
if (!sched_feat_numa(NUMA))
return;
+ /* For now, do not attempt to detect private/shared accesses */
+ priv = 1;
+
/* Allocate buffer to track faults on a per-node basis */
if (unlikely(!p->numa_faults)) {
- int size = sizeof(*p->numa_faults) * nr_node_ids;
+ int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
/* numa_faults and numa_faults_buffer share the allocation */
p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
@@ -900,7 +915,7 @@ void task_numa_fault(int node, int pages, bool migrated)
return;
BUG_ON(p->numa_faults_buffer);
- p->numa_faults_buffer = p->numa_faults + nr_node_ids;
+ p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
}
/*
@@ -914,7 +929,7 @@ void task_numa_fault(int node, int pages, bool migrated)
task_numa_placement(p);
/* Record the fault, double the weight if pages were migrated */
- p->numa_faults_buffer[node] += pages << migrated;
+ p->numa_faults_buffer[task_faults_idx(node, priv)] += pages << migrated;
}
static void reset_ptenuma_scan(struct task_struct *p)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ec938ed..9462591 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1292,7 +1292,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
{
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
- int target_nid;
+ int target_nid, last_nid;
int src_nid = -1;
bool migrated;
@@ -1316,6 +1316,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (src_nid == page_to_nid(page))
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
+ last_nid = page_nid_last(page);
target_nid = mpol_misplaced(page, vma, haddr);
if (target_nid == -1) {
put_page(page);
@@ -1341,7 +1342,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (!migrated)
goto check_same;
- task_numa_fault(target_nid, HPAGE_PMD_NR, true);
+ task_numa_fault(last_nid, target_nid, HPAGE_PMD_NR, true);
return 0;
check_same:
@@ -1356,7 +1357,7 @@ clear_pmdnuma:
out_unlock:
spin_unlock(&mm->page_table_lock);
if (src_nid != -1)
- task_numa_fault(src_nid, HPAGE_PMD_NR, false);
+ task_numa_fault(last_nid, src_nid, HPAGE_PMD_NR, false);
return 0;
}
diff --git a/mm/memory.c b/mm/memory.c
index 6c6f6b0..ab933be 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3536,7 +3536,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
{
struct page *page = NULL;
spinlock_t *ptl;
- int current_nid = -1;
+ int current_nid = -1, last_nid;
int target_nid;
bool migrated = false;
@@ -3571,6 +3571,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
return 0;
}
+ last_nid = page_nid_last(page);
current_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, current_nid);
pte_unmap_unlock(ptep, ptl);
@@ -3591,7 +3592,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
out:
if (current_nid != -1)
- task_numa_fault(current_nid, 1, migrated);
+ task_numa_fault(last_nid, current_nid, 1, migrated);
return 0;
}
@@ -3607,6 +3608,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
spinlock_t *ptl;
bool numa = false;
int local_nid = numa_node_id();
+ int last_nid;
spin_lock(&mm->page_table_lock);
pmd = *pmdp;
@@ -3659,6 +3661,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
* migrated to.
*/
curr_nid = local_nid;
+ last_nid = page_nid_last(page);
target_nid = numa_migrate_prep(page, vma, addr,
page_to_nid(page));
if (target_nid == -1) {
@@ -3671,7 +3674,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
migrated = migrate_misplaced_page(page, target_nid);
if (migrated)
curr_nid = target_nid;
- task_numa_fault(curr_nid, 1, migrated);
+ task_numa_fault(last_nid, curr_nid, 1, migrated);
pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
}
--
1.8.1.4
Signed-off-by: Mel Gorman <[email protected]>
---
Documentation/sysctl/kernel.txt | 66 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 66 insertions(+)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index ccd4258..0fe678c 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -354,6 +354,72 @@ utilize.
==============================================================
+numa_balancing
+
+Enables/disables automatic page fault based NUMA memory
+balancing. Memory is moved automatically to nodes
+that access it often.
+
+Enables/disables automatic NUMA memory balancing. On NUMA machines, there
+is a performance penalty if remote memory is accessed by a CPU. When this
+feature is enabled the kernel samples what task thread is accessing memory
+by periodically unmapping pages and later trapping a page fault. At the
+time of the page fault, it is determined if the data being accessed should
+be migrated to a local memory node.
+
+The unmapping of pages and trapping faults incur additional overhead that
+ideally is offset by improved memory locality but there is no universal
+guarantee. If the target workload is already bound to NUMA nodes then this
+feature should be disabled. Otherwise, if the system overhead from the
+feature is too high then the rate the kernel samples for NUMA hinting
+faults may be controlled by the numa_balancing_scan_period_min_ms,
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+
+==============================================================
+
+numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
+numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_size_mb
+
+Automatic NUMA balancing scans tasks address space and unmaps pages to
+detect if pages are properly placed or if the data should be migrated to a
+memory node local to where the task is running. Every "scan delay" the task
+scans the next "scan size" number of pages in its address space. When the
+end of the address space is reached the scanner restarts from the beginning.
+
+In combination, the "scan delay" and "scan size" determine the scan rate.
+When "scan delay" decreases, the scan rate increases. The scan delay and
+hence the scan rate of every task is adaptive and depends on historical
+behaviour. If pages are properly placed then the scan delay increases,
+otherwise the scan delay decreases. The "scan size" is not adaptive but
+the higher the "scan size", the higher the scan rate.
+
+Higher scan rates incur higher system overhead as page faults must be
+trapped and potentially data must be migrated. However, the higher the scan
+rate, the more quickly a tasks memory is migrated to a local node if the
+workload pattern changes and minimises performance impact due to remote
+memory accesses. These sysctls control the thresholds for scan delays and
+the number of pages scanned.
+
+numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
+between scans. It effectively controls the maximum scanning rate for
+each task.
+
+numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
+when it initially forks.
+
+numa_balancing_scan_period_max_ms is the maximum delay between scans. It
+effectively controls the minimum scanning rate for each task.
+
+numa_balancing_scan_size_mb is how many megabytes worth of pages are
+scanned for a given scan.
+
+numa_balancing_scan_period_reset is a blunt instrument that controls how
+often a tasks scan delay is reset to detect sudden changes in task behaviour.
+
+==============================================================
+
osrelease, ostype & version:
# cat osrelease
--
1.8.1.4
This patch favours moving tasks towards the preferred NUMA node when it
has just been selected. Ideally this is self-reinforcing as the longer
the task runs on that node, the more faults it should incur causing
task_numa_placement to keep the task running on that node. In reality a
big weakness is that the nodes CPUs can be overloaded and it would be more
efficient to queue tasks on an idle node and migrate to the new node. This
would require additional smarts in the balancer so for now the balancer
will simply prefer to place the task on the preferred node for a PTE scans
which is controlled by the numa_balancing_settle_count sysctl. Once the
settle_count number of scans has complete the schedule is free to place
the task on an alternative node if the load is imbalanced.
[[email protected]: Fixed statistics]
Signed-off-by: Mel Gorman <[email protected]>
---
Documentation/sysctl/kernel.txt | 8 +++++-
include/linux/sched.h | 1 +
kernel/sched/core.c | 3 ++-
kernel/sched/fair.c | 60 ++++++++++++++++++++++++++++++++++++++---
kernel/sysctl.c | 7 +++++
5 files changed, 73 insertions(+), 6 deletions(-)
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index 0fe678c..246b128 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -374,7 +374,8 @@ feature should be disabled. Otherwise, if the system overhead from the
feature is too high then the rate the kernel samples for NUMA hinting
faults may be controlled by the numa_balancing_scan_period_min_ms,
numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
+numa_balancing_settle_count sysctls.
==============================================================
@@ -418,6 +419,11 @@ scanned for a given scan.
numa_balancing_scan_period_reset is a blunt instrument that controls how
often a tasks scan delay is reset to detect sudden changes in task behaviour.
+numa_balancing_settle_count is how many scan periods must complete before
+the schedule balancer stops pushing the task towards a preferred node. This
+gives the scheduler a chance to place the task on an alternative node if the
+preferred node is overloaded.
+
==============================================================
osrelease, ostype & version:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 42f9818..82a6136 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -815,6 +815,7 @@ enum cpu_idle_type {
#define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */
#define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */
#define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */
+#define SD_NUMA 0x4000 /* cross-node balancing */
extern int __weak arch_sd_sibiling_asym_packing(void);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0bd541c..5e02507 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1591,7 +1591,7 @@ static void __sched_fork(struct task_struct *p)
p->node_stamp = 0ULL;
p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
- p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
+ p->numa_migrate_seq = 0;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
p->numa_preferred_nid = -1;
p->numa_work.next = &p->numa_work;
@@ -6141,6 +6141,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
| 0*SD_SHARE_PKG_RESOURCES
| 1*SD_SERIALIZE
| 0*SD_PREFER_SIBLING
+ | 1*SD_NUMA
| sd_local_flags(level)
,
.last_balance = jiffies,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8c7dd96..49396e1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -791,6 +791,15 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
/* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
unsigned int sysctl_numa_balancing_scan_delay = 1000;
+/*
+ * Once a preferred node is selected the scheduler balancer will prefer moving
+ * a task to that node for sysctl_numa_balancing_settle_count number of PTE
+ * scans. This will give the process the chance to accumulate more faults on
+ * the preferred node but still allow the scheduler to move the task again if
+ * the nodes CPUs are overloaded.
+ */
+unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+
static void task_numa_placement(struct task_struct *p)
{
int seq, nid, max_nid = -1;
@@ -802,6 +811,7 @@ static void task_numa_placement(struct task_struct *p)
if (p->numa_scan_seq == seq)
return;
p->numa_scan_seq = seq;
+ p->numa_migrate_seq++;
/* Find the node with the highest number of faults */
for (nid = 0; nid < nr_node_ids; nid++) {
@@ -820,8 +830,10 @@ static void task_numa_placement(struct task_struct *p)
}
/* Update the tasks preferred node if necessary */
- if (max_faults && max_nid != p->numa_preferred_nid)
+ if (max_faults && max_nid != p->numa_preferred_nid) {
p->numa_preferred_nid = max_nid;
+ p->numa_migrate_seq = 0;
+ }
}
/*
@@ -3898,6 +3910,35 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
return delta < (s64)sysctl_sched_migration_cost;
}
+#ifdef CONFIG_NUMA_BALANCING
+/* Returns true if the destination node has incurred more faults */
+static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
+{
+ int src_nid, dst_nid;
+
+ if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+ return false;
+
+ src_nid = cpu_to_node(env->src_cpu);
+ dst_nid = cpu_to_node(env->dst_cpu);
+
+ if (src_nid == dst_nid ||
+ p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+ return false;
+
+ if (p->numa_preferred_nid == dst_nid)
+ return true;
+
+ return false;
+}
+#else
+static inline bool migrate_improves_locality(struct task_struct *p,
+ struct lb_env *env)
+{
+ return false;
+}
+#endif
+
/*
* can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
*/
@@ -3946,11 +3987,22 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
/*
* Aggressive migration if:
- * 1) task is cache cold, or
- * 2) too many balance attempts have failed.
+ * 1) destination numa is preferred
+ * 2) task is cache cold, or
+ * 3) too many balance attempts have failed.
*/
-
tsk_cache_hot = task_hot(p, env->src_rq->clock_task, env->sd);
+
+ if (migrate_improves_locality(p, env)) {
+#ifdef CONFIG_SCHEDSTATS
+ if (tsk_cache_hot) {
+ schedstat_inc(env->sd, lb_hot_gained[env->idle]);
+ schedstat_inc(p, se.statistics.nr_forced_migrations);
+ }
+#endif
+ return 1;
+ }
+
if (!tsk_cache_hot ||
env->sd->nr_balance_failed > env->sd->cache_nice_tries) {
#ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index afc1dc6..263486f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -393,6 +393,13 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+ {
+ .procname = "numa_balancing_settle_count",
+ .data = &sysctl_numa_balancing_settle_count,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
#endif /* CONFIG_NUMA_BALANCING */
#endif /* CONFIG_SCHED_DEBUG */
{
--
1.8.1.4
NUMA hinting faults counts and placement decisions are both recorded in the
same array which distorts the samples in an unpredictable fashion. The values
linearly accumulate during the scan and then decay creating a sawtooth-like
pattern in the per-node counts. It also means that placement decisions are
time sensitive. At best it means that it is very difficult to state that
the buffer holds a decaying average of past faulting behaviour. At worst,
it can confuse the load balancer if it sees one node with an artifically high
count due to very recent faulting activity and may create a bouncing effect.
This patch adds a second array. numa_faults stores the historical data
which is used for placement decisions. numa_faults_buffer holds the
fault activity during the current scan window. When the scan completes,
numa_faults decays and the values from numa_faults_buffer are copied
across.
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/sched.h | 13 +++++++++++++
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 16 +++++++++++++---
3 files changed, 27 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ba46a64..42f9818 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1506,7 +1506,20 @@ struct task_struct {
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
+ /*
+ * Exponential decaying average of faults on a per-node basis.
+ * Scheduling placement decisions are made based on the these counts.
+ * The values remain static for the duration of a PTE scan
+ */
unsigned long *numa_faults;
+
+ /*
+ * numa_faults_buffer records faults per node during the current
+ * scan window. When the scan completes, the counts in numa_faults
+ * decay and these values are copied.
+ */
+ unsigned long *numa_faults_buffer;
+
int numa_preferred_nid;
#endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ed4e785..0bd541c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1596,6 +1596,7 @@ static void __sched_fork(struct task_struct *p)
p->numa_preferred_nid = -1;
p->numa_work.next = &p->numa_work;
p->numa_faults = NULL;
+ p->numa_faults_buffer = NULL;
#endif /* CONFIG_NUMA_BALANCING */
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 731ee9e..8c7dd96 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -805,8 +805,14 @@ static void task_numa_placement(struct task_struct *p)
/* Find the node with the highest number of faults */
for (nid = 0; nid < nr_node_ids; nid++) {
- unsigned long faults = p->numa_faults[nid];
+ unsigned long faults;
+
+ /* Decay existing window and copy faults since last scan */
p->numa_faults[nid] >>= 1;
+ p->numa_faults[nid] += p->numa_faults_buffer[nid];
+ p->numa_faults_buffer[nid] = 0;
+
+ faults = p->numa_faults[nid];
if (faults > max_faults) {
max_faults = faults;
max_nid = nid;
@@ -832,9 +838,13 @@ void task_numa_fault(int node, int pages, bool migrated)
if (unlikely(!p->numa_faults)) {
int size = sizeof(*p->numa_faults) * nr_node_ids;
- p->numa_faults = kzalloc(size, GFP_KERNEL);
+ /* numa_faults and numa_faults_buffer share the allocation */
+ p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
if (!p->numa_faults)
return;
+
+ BUG_ON(p->numa_faults_buffer);
+ p->numa_faults_buffer = p->numa_faults + nr_node_ids;
}
/*
@@ -848,7 +858,7 @@ void task_numa_fault(int node, int pages, bool migrated)
task_numa_placement(p);
/* Record the fault, double the weight if pages were migrated */
- p->numa_faults[node] += pages << migrated;
+ p->numa_faults_buffer[node] += pages << migrated;
}
static void reset_ptenuma_scan(struct task_struct *p)
--
1.8.1.4
This patch selects a preferred node for a task to run on based on the
NUMA hinting faults. This information is later used to migrate tasks
towards the node during balancing.
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 17 +++++++++++++++--
3 files changed, 17 insertions(+), 2 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 72861b4..ba46a64 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1507,6 +1507,7 @@ struct task_struct {
struct callback_head numa_work;
unsigned long *numa_faults;
+ int numa_preferred_nid;
#endif /* CONFIG_NUMA_BALANCING */
struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f332ec0..ed4e785 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1593,6 +1593,7 @@ static void __sched_fork(struct task_struct *p)
p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
+ p->numa_preferred_nid = -1;
p->numa_work.next = &p->numa_work;
p->numa_faults = NULL;
#endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 904fd6f..731ee9e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -793,7 +793,8 @@ unsigned int sysctl_numa_balancing_scan_delay = 1000;
static void task_numa_placement(struct task_struct *p)
{
- int seq;
+ int seq, nid, max_nid = -1;
+ unsigned long max_faults = 0;
if (!p->mm) /* for example, ksmd faulting in a user's mm */
return;
@@ -802,7 +803,19 @@ static void task_numa_placement(struct task_struct *p)
return;
p->numa_scan_seq = seq;
- /* FIXME: Scheduling placement policy hints go here */
+ /* Find the node with the highest number of faults */
+ for (nid = 0; nid < nr_node_ids; nid++) {
+ unsigned long faults = p->numa_faults[nid];
+ p->numa_faults[nid] >>= 1;
+ if (faults > max_faults) {
+ max_faults = faults;
+ max_nid = nid;
+ }
+ }
+
+ /* Update the tasks preferred node if necessary */
+ if (max_faults && max_nid != p->numa_preferred_nid)
+ p->numa_preferred_nid = max_nid;
}
/*
--
1.8.1.4
This patch tracks what nodes numa hinting faults were incurred on. Greater
weight is given if the pages were to be migrated on the understanding
that such faults cost significantly more. If a task has paid the cost to
migrating data to that node then in the future it would be preferred if the
task did not migrate the data again unnecessarily. This information is later
used to schedule a task on the node incurring the most NUMA hinting faults.
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/sched.h | 2 ++
kernel/sched/core.c | 3 +++
kernel/sched/fair.c | 12 +++++++++++-
kernel/sched/sched.h | 11 +++++++++++
4 files changed, 27 insertions(+), 1 deletion(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e692a02..72861b4 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1505,6 +1505,8 @@ struct task_struct {
unsigned int numa_scan_period;
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
+
+ unsigned long *numa_faults;
#endif /* CONFIG_NUMA_BALANCING */
struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 67d0465..f332ec0 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1594,6 +1594,7 @@ static void __sched_fork(struct task_struct *p)
p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
p->numa_work.next = &p->numa_work;
+ p->numa_faults = NULL;
#endif /* CONFIG_NUMA_BALANCING */
}
@@ -1853,6 +1854,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
if (mm)
mmdrop(mm);
if (unlikely(prev_state == TASK_DEAD)) {
+ task_numa_free(prev);
+
/*
* Remove function-return probe instances associated with this
* task and put them back on the free list.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a33e59..904fd6f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -815,7 +815,14 @@ void task_numa_fault(int node, int pages, bool migrated)
if (!sched_feat_numa(NUMA))
return;
- /* FIXME: Allocate task-specific structure for placement policy here */
+ /* Allocate buffer to track faults on a per-node basis */
+ if (unlikely(!p->numa_faults)) {
+ int size = sizeof(*p->numa_faults) * nr_node_ids;
+
+ p->numa_faults = kzalloc(size, GFP_KERNEL);
+ if (!p->numa_faults)
+ return;
+ }
/*
* If pages are properly placed (did not migrate) then scan slower.
@@ -826,6 +833,9 @@ void task_numa_fault(int node, int pages, bool migrated)
p->numa_scan_period + jiffies_to_msecs(10));
task_numa_placement(p);
+
+ /* Record the fault, double the weight if pages were migrated */
+ p->numa_faults[node] += pages << migrated;
}
static void reset_ptenuma_scan(struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index cc03cfd..c5f773d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -503,6 +503,17 @@ DECLARE_PER_CPU(struct rq, runqueues);
#define cpu_curr(cpu) (cpu_rq(cpu)->curr)
#define raw_rq() (&__raw_get_cpu_var(runqueues))
+#ifdef CONFIG_NUMA_BALANCING
+static inline void task_numa_free(struct task_struct *p)
+{
+ kfree(p->numa_faults);
+}
+#else /* CONFIG_NUMA_BALANCING */
+static inline void task_numa_free(struct task_struct *p)
+{
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
#ifdef CONFIG_SMP
#define rcu_dereference_check_sched_domain(p) \
--
1.8.1.4
The zero page is not replicated between nodes and is often shared
between processes. The data is read-only and likely to be cached in
local CPUs if heavily accessed meaning that the remote memory access
cost is less of a concern. This patch stops accounting for numa hinting
faults on the zero page in both terms of counting faults and scheduling
tasks on nodes.
Signed-off-by: Mel Gorman <[email protected]>
---
mm/huge_memory.c | 9 +++++++++
mm/memory.c | 7 ++++++-
2 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e4a79fa..ec938ed 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1302,6 +1302,15 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
page = pmd_page(pmd);
get_page(page);
+
+ /*
+ * Do not account for faults against the huge zero page. The read-only
+ * data is likely to be read-cached on the local CPUs and it is less
+ * useful to know about local versus remote hits on the zero page.
+ */
+ if (is_huge_zero_pfn(page_to_pfn(page)))
+ goto clear_pmdnuma;
+
src_nid = numa_node_id();
count_vm_numa_event(NUMA_HINT_FAULTS);
if (src_nid == page_to_nid(page))
diff --git a/mm/memory.c b/mm/memory.c
index ba94dec..6c6f6b0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3560,8 +3560,13 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
set_pte_at(mm, addr, ptep, pte);
update_mmu_cache(vma, addr, ptep);
+ /*
+ * Do not account for faults against the zero page. The read-only data
+ * is likely to be read-cached on the local CPUs and it is less useful
+ * to know about local versus remote hits on the zero page.
+ */
page = vm_normal_page(vma, addr, pte);
- if (!page) {
+ if (!page || is_zero_pfn(page_to_pfn(page))) {
pte_unmap_unlock(ptep, ptl);
return 0;
}
--
1.8.1.4
On Mon, Jul 15, 2013 at 04:20:18PM +0100, Mel Gorman wrote:
> ---
> kernel/sched/fair.c | 105 +++++++++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 83 insertions(+), 22 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3f0519c..8ee1c8e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -846,29 +846,92 @@ static inline int task_faults_idx(int nid, int priv)
> return 2 * nid + priv;
> }
>
> -static unsigned long weighted_cpuload(const int cpu);
> +static unsigned long source_load(int cpu, int type);
> +static unsigned long target_load(int cpu, int type);
> +static unsigned long power_of(int cpu);
> +static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
> +
> +static int task_numa_find_cpu(struct task_struct *p, int nid)
> +{
> + int node_cpu = cpumask_first(cpumask_of_node(nid));
> + int cpu, src_cpu = task_cpu(p), dst_cpu = src_cpu;
> + unsigned long src_load, dst_load;
> + unsigned long min_load = ULONG_MAX;
> + struct task_group *tg = task_group(p);
> + s64 src_eff_load, dst_eff_load;
> + struct sched_domain *sd;
> + unsigned long weight;
> + bool balanced;
> + int imbalance_pct, idx = -1;
>
> + /* No harm being optimistic */
> + if (idle_cpu(node_cpu))
> + return node_cpu;
>
> + /*
> + * Find the lowest common scheduling domain covering the nodes of both
> + * the CPU the task is currently running on and the target NUMA node.
> + */
> + rcu_read_lock();
> + for_each_domain(src_cpu, sd) {
> + if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
> + /*
> + * busy_idx is used for the load decision as it is the
> + * same index used by the regular load balancer for an
> + * active cpu.
> + */
> + idx = sd->busy_idx;
> + imbalance_pct = sd->imbalance_pct;
> + break;
> + }
> + }
> + rcu_read_unlock();
>
> + if (WARN_ON_ONCE(idx == -1))
> + return src_cpu;
>
> + /*
> + * XXX the below is mostly nicked from wake_affine(); we should
> + * see about sharing a bit if at all possible; also it might want
> + * some per entity weight love.
> + */
> + weight = p->se.load.weight;
>
> + src_load = source_load(src_cpu, idx);
> +
> + src_eff_load = 100 + (imbalance_pct - 100) / 2;
> + src_eff_load *= power_of(src_cpu);
> + src_eff_load *= src_load + effective_load(tg, src_cpu, -weight, -weight);
So did you try with this effective_load() term 'missing'?
> +
> + for_each_cpu(cpu, cpumask_of_node(nid)) {
> + dst_load = target_load(cpu, idx);
> +
> + /* If the CPU is idle, use it */
> + if (!dst_load)
> + return dst_cpu;
> +
> + /* Otherwise check the target CPU load */
> + dst_eff_load = 100;
> + dst_eff_load *= power_of(cpu);
> + dst_eff_load *= dst_load + effective_load(tg, cpu, weight, weight);
> +
> + /*
> + * Destination is considered balanced if the destination CPU is
> + * less loaded than the source CPU. Unfortunately there is a
> + * risk that a task running on a lightly loaded CPU will not
> + * migrate to its preferred node due to load imbalances.
> + */
> + balanced = (dst_eff_load <= src_eff_load);
> + if (!balanced)
> + continue;
> +
> + if (dst_load < min_load) {
> + min_load = dst_load;
> + dst_cpu = cpu;
> }
> }
>
> + return dst_cpu;
> }
On Mon, Jul 15, 2013 at 04:20:20PM +0100, Mel Gorman wrote:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 53d8465..d679b01 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4857,10 +4857,13 @@ fail:
>
> #ifdef CONFIG_NUMA_BALANCING
> /* Migrate current task p to target_cpu */
> -int migrate_task_to(struct task_struct *p, int target_cpu)
> +int migrate_task_to(struct task_struct *p, int target_cpu,
> + struct task_struct *swap_p)
> {
> struct migration_arg arg = { p, target_cpu };
> int curr_cpu = task_cpu(p);
> + struct rq *rq;
> + int retval;
>
> if (curr_cpu == target_cpu)
> return 0;
> @@ -4868,7 +4871,39 @@ int migrate_task_to(struct task_struct *p, int target_cpu)
> if (!cpumask_test_cpu(target_cpu, tsk_cpus_allowed(p)))
> return -EINVAL;
>
> - return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
> + if (swap_p == NULL)
> + return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
> +
> + /* Make sure the target is still running the expected task */
> + rq = cpu_rq(target_cpu);
> + local_irq_disable();
> + raw_spin_lock(&rq->lock);
raw_spin_lock_irq() :-)
> + if (rq->curr != swap_p) {
> + raw_spin_unlock(&rq->lock);
> + local_irq_enable();
> + return -EINVAL;
> + }
> +
> + /* Take a reference on the running task on the target cpu */
> + get_task_struct(swap_p);
> + raw_spin_unlock(&rq->lock);
> + local_irq_enable();
raw_spin_unlock_irq()
> +
> + /* Move current running task to target CPU */
> + retval = stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
> + if (raw_smp_processor_id() != target_cpu) {
> + put_task_struct(swap_p);
> + return retval;
> + }
(1)
> + /* Move the remote task to the CPU just vacated */
> + local_irq_disable();
> + if (raw_smp_processor_id() == target_cpu)
> + __migrate_task(swap_p, target_cpu, curr_cpu);
> + local_irq_enable();
> +
> + put_task_struct(swap_p);
> + return retval;
> }
So I know this is very much like what Ingo did in his patches, but
there's a whole heap of 'problems' with this approach to task flipping.
So at (1) we just moved ourselves to the remote cpu. This might have
left our original cpu idle and we might have done a newidle balance,
even though we intend another task to run here.
At (1) we just moved ourselves to the remote cpu, however we might not
be eligible to run, so moving the other task to our original CPU might
take a while -- exacerbating the previously mention issue.
Since (1) might take a whole lot of time, it might become rather
unlikely that our task @swap_p is still queued on the cpu where we
expected him to be.
On Mon, Jul 15, 2013 at 04:20:02PM +0100, Mel Gorman wrote:
>
> specjbb
> 3.9.0 3.9.0 3.9.0 3.9.0
> vanilla accountload-v5 retrymigrate-v5 swaptasks-v5
> TPut 1 24474.00 ( 0.00%) 24303.00 ( -0.70%) 23529.00 ( -3.86%) 26110.00 ( 6.68%)
> TPut 7 186914.00 ( 0.00%) 179962.00 ( -3.72%) 183667.00 ( -1.74%) 185912.00 ( -0.54%)
> TPut 13 334429.00 ( 0.00%) 327558.00 ( -2.05%) 336418.00 ( 0.59%) 334563.00 ( 0.04%)
> TPut 19 422820.00 ( 0.00%) 451359.00 ( 6.75%) 450069.00 ( 6.44%) 426753.00 ( 0.93%)
> TPut 25 456121.00 ( 0.00%) 533432.00 ( 16.95%) 504138.00 ( 10.53%) 503152.00 ( 10.31%)
> TPut 31 438595.00 ( 0.00%) 510638.00 ( 16.43%) 442937.00 ( 0.99%) 486450.00 ( 10.91%)
> TPut 37 409654.00 ( 0.00%) 475468.00 ( 16.07%) 427673.00 ( 4.40%) 460531.00 ( 12.42%)
> TPut 43 370941.00 ( 0.00%) 442169.00 ( 19.20%) 387382.00 ( 4.43%) 425120.00 ( 14.61%)
>
> It's interesting that retrying the migrate introduced such a large dent. I
> do not know why at this point. Swapping the tasks helped and overall the
> performance is all right with room for improvement.
I think it means that our direct migration scheme is creating too much
imbalance and doing it more often results in more task movement to fix
it up again, hindering page migration efforts to settle on a node.
On Mon, Jul 15, 2013 at 10:03:21PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 15, 2013 at 04:20:18PM +0100, Mel Gorman wrote:
> > ---
> > kernel/sched/fair.c | 105 +++++++++++++++++++++++++++++++++++++++++-----------
> > 1 file changed, 83 insertions(+), 22 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 3f0519c..8ee1c8e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -846,29 +846,92 @@ static inline int task_faults_idx(int nid, int priv)
> > return 2 * nid + priv;
> > }
> >
> > -static unsigned long weighted_cpuload(const int cpu);
> > +static unsigned long source_load(int cpu, int type);
> > +static unsigned long target_load(int cpu, int type);
> > +static unsigned long power_of(int cpu);
> > +static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
> > +
> > +static int task_numa_find_cpu(struct task_struct *p, int nid)
> > +{
> > + int node_cpu = cpumask_first(cpumask_of_node(nid));
> > + int cpu, src_cpu = task_cpu(p), dst_cpu = src_cpu;
> > + unsigned long src_load, dst_load;
> > + unsigned long min_load = ULONG_MAX;
> > + struct task_group *tg = task_group(p);
> > + s64 src_eff_load, dst_eff_load;
> > + struct sched_domain *sd;
> > + unsigned long weight;
> > + bool balanced;
> > + int imbalance_pct, idx = -1;
> >
> > + /* No harm being optimistic */
> > + if (idle_cpu(node_cpu))
> > + return node_cpu;
> >
> > + /*
> > + * Find the lowest common scheduling domain covering the nodes of both
> > + * the CPU the task is currently running on and the target NUMA node.
> > + */
> > + rcu_read_lock();
> > + for_each_domain(src_cpu, sd) {
> > + if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
> > + /*
> > + * busy_idx is used for the load decision as it is the
> > + * same index used by the regular load balancer for an
> > + * active cpu.
> > + */
> > + idx = sd->busy_idx;
> > + imbalance_pct = sd->imbalance_pct;
> > + break;
> > + }
> > + }
> > + rcu_read_unlock();
> >
> > + if (WARN_ON_ONCE(idx == -1))
> > + return src_cpu;
> >
> > + /*
> > + * XXX the below is mostly nicked from wake_affine(); we should
> > + * see about sharing a bit if at all possible; also it might want
> > + * some per entity weight love.
> > + */
> > + weight = p->se.load.weight;
> >
> > + src_load = source_load(src_cpu, idx);
> > +
> > + src_eff_load = 100 + (imbalance_pct - 100) / 2;
> > + src_eff_load *= power_of(src_cpu);
> > + src_eff_load *= src_load + effective_load(tg, src_cpu, -weight, -weight);
>
> So did you try with this effective_load() term 'missing'?
>
Yes, it performed worse in tests. Looking at it, I figured that it would
have to perform worse unless effective_load regularly returns negative
values.
--
Mel Gorman
SUSE Labs
On Mon, Jul 15, 2013 at 10:11:10PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 15, 2013 at 04:20:20PM +0100, Mel Gorman wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 53d8465..d679b01 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4857,10 +4857,13 @@ fail:
> >
> > #ifdef CONFIG_NUMA_BALANCING
> > /* Migrate current task p to target_cpu */
> > -int migrate_task_to(struct task_struct *p, int target_cpu)
> > +int migrate_task_to(struct task_struct *p, int target_cpu,
> > + struct task_struct *swap_p)
> > {
> > struct migration_arg arg = { p, target_cpu };
> > int curr_cpu = task_cpu(p);
> > + struct rq *rq;
> > + int retval;
> >
> > if (curr_cpu == target_cpu)
> > return 0;
> > @@ -4868,7 +4871,39 @@ int migrate_task_to(struct task_struct *p, int target_cpu)
> > if (!cpumask_test_cpu(target_cpu, tsk_cpus_allowed(p)))
> > return -EINVAL;
> >
> > - return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
> > + if (swap_p == NULL)
> > + return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
> > +
> > + /* Make sure the target is still running the expected task */
> > + rq = cpu_rq(target_cpu);
> > + local_irq_disable();
> > + raw_spin_lock(&rq->lock);
>
> raw_spin_lock_irq() :-)
>
damnit!
> > + if (rq->curr != swap_p) {
> > + raw_spin_unlock(&rq->lock);
> > + local_irq_enable();
> > + return -EINVAL;
> > + }
> > +
> > + /* Take a reference on the running task on the target cpu */
> > + get_task_struct(swap_p);
> > + raw_spin_unlock(&rq->lock);
> > + local_irq_enable();
>
> raw_spin_unlock_irq()
>
Fixed.
> > +
> > + /* Move current running task to target CPU */
> > + retval = stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
> > + if (raw_smp_processor_id() != target_cpu) {
> > + put_task_struct(swap_p);
> > + return retval;
> > + }
>
> (1)
>
> > + /* Move the remote task to the CPU just vacated */
> > + local_irq_disable();
> > + if (raw_smp_processor_id() == target_cpu)
> > + __migrate_task(swap_p, target_cpu, curr_cpu);
> > + local_irq_enable();
> > +
> > + put_task_struct(swap_p);
> > + return retval;
> > }
>
> So I know this is very much like what Ingo did in his patches, but
> there's a whole heap of 'problems' with this approach to task flipping.
>
> So at (1) we just moved ourselves to the remote cpu. This might have
> left our original cpu idle and we might have done a newidle balance,
> even though we intend another task to run here.
>
True. Minimally a parallel numa hinting fault that selected the source
nid as preferred nid might make the idle_cpu check and move there immediately
> At (1) we just moved ourselves to the remote cpu, however we might not
> be eligible to run, so moving the other task to our original CPU might
> take a while -- exacerbating the previously mention issue.
>
Also true.
> Since (1) might take a whole lot of time, it might become rather
> unlikely that our task @swap_p is still queued on the cpu where we
> expected him to be.
>
Which would hurt the intentions of patch 17.
hmm.
I did not want to do this lazily via the active load balancer because it
might never happen or by the time it did happen that it's no longer the
correct decision. This applied whether I set numa_preferred_nid or added
a numa_preferred_cpu.
What I think I can do is set a preferred CPU, wait until the next
wakeup and then move the task during select_task_rq as long as the load
balance permits. I cannot test it right now as all my test machines are
unplugged as part of a move but the patch against patch 17 is below.
Once p->numa_preferred_cpu exists then I should be able to lazily swap tasks
by setting p->numa_preferred_cpu.
Obviously untested and I need to give it more thought but this is the
general idea of what I mean.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 454ad2e..f388673 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1503,9 +1503,9 @@ struct task_struct {
#ifdef CONFIG_NUMA_BALANCING
int numa_scan_seq;
int numa_migrate_seq;
+ int numa_preferred_cpu;
unsigned int numa_scan_period;
unsigned int numa_scan_period_max;
- unsigned long numa_migrate_retry;
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
@@ -1604,6 +1604,14 @@ struct task_struct {
#ifdef CONFIG_NUMA_BALANCING
extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
extern void set_numabalancing_state(bool enabled);
+static inline int numa_preferred_cpu(struct task_struct *p)
+{
+ return p->numa_preferred_cpu;
+}
+static inline void reset_numa_preferred_cpu(struct task_struct *p)
+{
+ p->numa_preferred_cpu = -1;
+}
#else
static inline void task_numa_fault(int last_node, int node, int pages,
bool migrated)
@@ -1612,6 +1620,14 @@ static inline void task_numa_fault(int last_node, int node, int pages,
static inline void set_numabalancing_state(bool enabled)
{
}
+static inline int numa_preferred_cpu(struct task_struct *p)
+{
+ return -1;
+}
+
+static inline void reset_numa_preferred_cpu(struct task_struct *p)
+{
+}
#endif
static inline struct pid *task_pid(struct task_struct *task)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 53d8465..309a27d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1553,6 +1553,9 @@ int wake_up_state(struct task_struct *p, unsigned int state)
*/
static void __sched_fork(struct task_struct *p)
{
+#ifdef CONFIG_NUMA_BALANCING
+ p->numa_preferred_cpu = -1;
+#endif
p->on_rq = 0;
p->se.on_rq = 0;
@@ -1591,6 +1594,7 @@ static void __sched_fork(struct task_struct *p)
p->node_stamp = 0ULL;
p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
p->numa_migrate_seq = 0;
+ p->numa_preferred_cpu = -1;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
p->numa_preferred_nid = -1;
p->numa_work.next = &p->numa_work;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 07a9f40..21806b5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -940,14 +940,14 @@ static void numa_migrate_preferred(struct task_struct *p)
int preferred_cpu = task_cpu(p);
/* Success if task is already running on preferred CPU */
- p->numa_migrate_retry = 0;
+ p->numa_preferred_cpu = -1;
if (cpu_to_node(preferred_cpu) == p->numa_preferred_nid)
return;
/* Otherwise, try migrate to a CPU on the preferred node */
preferred_cpu = task_numa_find_cpu(p, p->numa_preferred_nid);
if (migrate_task_to(p, preferred_cpu) != 0)
- p->numa_migrate_retry = jiffies + HZ*5;
+ p->numa_preferred_cpu = preferred_cpu;
}
static void task_numa_placement(struct task_struct *p)
@@ -1052,10 +1052,6 @@ void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
task_numa_placement(p);
- /* Retry task to preferred node migration if it previously failed */
- if (p->numa_migrate_retry && time_after(jiffies, p->numa_migrate_retry))
- numa_migrate_preferred(p);
-
/* Record the fault, double the weight if pages were migrated */
p->numa_faults_buffer[task_faults_idx(node, priv)] += pages << migrated;
}
@@ -3538,10 +3534,25 @@ select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
int new_cpu = cpu;
int want_affine = 0;
int sync = wake_flags & WF_SYNC;
+ int numa_cpu;
if (p->nr_cpus_allowed == 1)
return prev_cpu;
+ /*
+ * If a previous NUMA CPU migration failed then recheck now and use a
+ * CPU near the preferred CPU if it would not introduce load imbalance.
+ */
+ numa_cpu = numa_preferred_cpu(p);
+ if (numa_cpu != -1 && cpumask_test_cpu(numa_cpu, tsk_cpus_allowed(p))) {
+ int least_loaded_cpu;
+
+ reset_numa_preferred_cpu(p);
+ least_loaded_cpu = task_numa_find_cpu(p, cpu_to_node(numa_cpu));
+ if (least_loaded_cpu != prev_cpu)
+ return least_loaded_cpu;
+ }
+
if (sd_flag & SD_BALANCE_WAKE) {
if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
want_affine = 1;
On Tue, Jul 16, 2013 at 09:23:42AM +0100, Mel Gorman wrote:
> On Mon, Jul 15, 2013 at 10:03:21PM +0200, Peter Zijlstra wrote:
> > On Mon, Jul 15, 2013 at 04:20:18PM +0100, Mel Gorman wrote:
> > > ---
> > > + src_eff_load = 100 + (imbalance_pct - 100) / 2;
> > > + src_eff_load *= power_of(src_cpu);
> > > + src_eff_load *= src_load + effective_load(tg, src_cpu, -weight, -weight);
> >
> > So did you try with this effective_load() term 'missing'?
> >
>
> Yes, it performed worse in tests. Looking at it, I figured that it would
> have to perform worse unless effective_load regularly returns negative
> values.
In this case it would return negative, seeing as we put a negative in.
Summary:
Seeing improvement on a 2 node when running autonumabenchmark .
But seeing regression for specjbb for the same box.
Also seeing huge regression when running autonumabenchmark
both on 4 node and 8 node box.
Below is the autonuma benchmark results on a 2 node machine.
Autonuma benchmark results.
mainline v3.9: (Ht enabled)
Testcase: Min Max Avg StdDev
numa01: 220.12 246.96 239.18 9.69
numa02: 41.85 43.02 42.43 0.47
v3.9 + Mel's v5 patches:A (Ht enabled)
Testcase: Min Max Avg StdDev %Change
numa01: 239.52 242.99 241.61 1.26 -1.00%
numa02: 37.94 38.12 38.05 0.06 11.49%
mainline v3.9:
Testcase: Min Max Avg StdDev
numa01: 118.72 121.04 120.23 0.83
numa02: 36.64 37.56 36.99 0.34
v3.9 + Mel's v5 patches:
Testcase: Min Max Avg StdDev %Change
numa01: 111.34 122.28 118.61 3.77 1.32%
numa02: 36.23 37.27 36.55 0.37 1.18%
Here are results of specjbb run on a 2 node machine.
Specjbb was run on 3 vms.
In the fit case, one vm was big to fit one node size.
In the no-fit case, one vm was bigger than the node size.
Specjbb results.
---------------------------------------------------------------------------------------
| | vm| nofit| fit|
| | vm| noksm| ksm| noksm| ksm|
| | vm| nothp| thp| nothp| thp| nothp| thp| nothp| thp|
---------------------------------------------------------------------------------------
| mainline_v39+ | vm_1| 136056| 189423| 135359| 186722| 136983| 191669| 136728| 184253|
| mainline_v39+ | vm_2| 66041| 84779| 64564| 86645| 67426| 84427| 63657| 85043|
| mainline_v39+ | vm_3| 67322| 83301| 63731| 85394| 65015| 85156| 63838| 84199|
| mel_numa_balan| vm_1| 133170| 177883| 136385| 176716| 140650| 174535| 132811| 190120|
| mel_numa_balan| vm_2| 65021| 81707| 62876| 81826| 63635| 84943| 58313| 78997|
| mel_numa_balan| vm_3| 61915| 82198| 60106| 81723| 64222| 81123| 59559| 78299|
| change % | vm_1| -2.12| -6.09| 0.76| -5.36| 2.68| -8.94| -2.86| 3.18|
| change % | vm_2| -1.54| -3.62| -2.61| -5.56| -5.62| 0.61| -8.39| -7.11|
| change % | vm_3| -8.03| -1.32| -5.69| -4.30| -1.22| -4.74| -6.70| -7.01|
---------------------------------------------------------------------------------------
numactl o/p
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 12 13 14 15 16 17
node 0 size: 12276 MB
node 0 free: 10574 MB
node 1 cpus: 6 7 8 9 10 11 18 19 20 21 22 23
node 1 size: 12288 MB
node 1 free: 9697 MB
node distances:
node 0 1
0: 10 21
1: 21 10
Autonuma results on a 4 node machine.
KernelVersion: 3.9.0(HT)
Testcase: Min Max Avg StdDev
numa01: 569.80 624.94 593.12 19.14
numa02: 18.65 21.32 19.69 0.98
KernelVersion: 3.9.0 + Mel's v5 patches(HT)
Testcase: Min Max Avg StdDev %Change
numa01: 718.83 750.46 740.10 11.42 -19.59%
numa02: 20.07 22.36 20.97 0.81 -5.72%
KernelVersion: 3.9.0()
Testcase: Min Max Avg StdDev
numa01: 586.75 628.65 604.15 16.13
numa02: 19.67 20.49 19.93 0.29
KernelVersion: 3.9.0 + Mel's v5 patches
Testcase: Min Max Avg StdDev %Change
numa01: 741.48 759.37 747.23 6.36 -18.84%
numa02: 20.55 22.06 21.21 0.52 -5.80%
System x3750 M4 -[8722C1A]-
numactl o/p
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 32 33 34 35 36 37 38 39
node 0 size: 65468 MB
node 0 free: 63069 MB
node 1 cpus: 8 9 10 11 12 13 14 15 40 41 42 43 44 45 46 47
node 1 size: 65536 MB
node 1 free: 63497 MB
node 2 cpus: 16 17 18 19 20 21 22 23 48 49 50 51 52 53 54 55
node 2 size: 65536 MB
node 2 free: 63515 MB
node 3 cpus: 24 25 26 27 28 29 30 31 56 57 58 59 60 61 62 63
node 3 size: 65536 MB
node 3 free: 63659 MB
node distances:
node 0 1 2 3
0: 10 11 11 12
1: 11 10 12 11
2: 11 12 10 11
3: 12 11 11 10
The results on the 8 node also look similar to 4 node.
--
Thanks and Regards
Srikar Dronamraju
On Mon, Jul 15, 2013 at 11:20 PM, Mel Gorman <[email protected]> wrote:
> +
> +static int task_numa_find_cpu(struct task_struct *p, int nid)
> +{
> + int node_cpu = cpumask_first(cpumask_of_node(nid));
[...]
>
> + /* No harm being optimistic */
> + if (idle_cpu(node_cpu))
> + return node_cpu;
>
[...]
> + for_each_cpu(cpu, cpumask_of_node(nid)) {
> + dst_load = target_load(cpu, idx);
> +
> + /* If the CPU is idle, use it */
> + if (!dst_load)
> + return dst_cpu;
> +
Here you want cpu, instead of dst_cpu, I guess.
On Tue, Jul 16, 2013 at 11:55:24PM +0800, Hillf Danton wrote:
> On Mon, Jul 15, 2013 at 11:20 PM, Mel Gorman <[email protected]> wrote:
> > +
> > +static int task_numa_find_cpu(struct task_struct *p, int nid)
> > +{
> > + int node_cpu = cpumask_first(cpumask_of_node(nid));
> [...]
> >
> > + /* No harm being optimistic */
> > + if (idle_cpu(node_cpu))
> > + return node_cpu;
> >
> [...]
> > + for_each_cpu(cpu, cpumask_of_node(nid)) {
> > + dst_load = target_load(cpu, idx);
> > +
> > + /* If the CPU is idle, use it */
> > + if (!dst_load)
> > + return dst_cpu;
> > +
> Here you want cpu, instead of dst_cpu, I guess.
Crap, yes. Thanks!
--
Mel Gorman
SUSE Labs
On Mon, Jul 15, 2013 at 11:20 PM, Mel Gorman <[email protected]> wrote:
> THP NUMA hinting fault on pages that are not migrated are being
> accounted for incorrectly. Currently the fault will be counted as if the
> task was running on a node local to the page which is not necessarily
> true.
>
Can you please run test again without this correction and check the difference?
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/huge_memory.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e2f7f5aa..e4a79fa 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1293,7 +1293,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
> struct page *page;
> unsigned long haddr = addr & HPAGE_PMD_MASK;
> int target_nid;
> - int current_nid = -1;
> + int src_nid = -1;
> bool migrated;
>
> spin_lock(&mm->page_table_lock);
> @@ -1302,9 +1302,9 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>
> page = pmd_page(pmd);
> get_page(page);
> - current_nid = page_to_nid(page);
> + src_nid = numa_node_id();
> count_vm_numa_event(NUMA_HINT_FAULTS);
> - if (current_nid == numa_node_id())
> + if (src_nid == page_to_nid(page))
> count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
>
> target_nid = mpol_misplaced(page, vma, haddr);
> @@ -1346,8 +1346,8 @@ clear_pmdnuma:
> update_mmu_cache_pmd(vma, addr, pmdp);
> out_unlock:
> spin_unlock(&mm->page_table_lock);
> - if (current_nid != -1)
> - task_numa_fault(current_nid, HPAGE_PMD_NR, false);
> + if (src_nid != -1)
> + task_numa_fault(src_nid, HPAGE_PMD_NR, false);
> return 0;
> }
On Mon, Jul 15, 2013 at 11:20 PM, Mel Gorman <[email protected]> wrote:
> +static int
> +find_idlest_cpu_node(int this_cpu, int nid)
> +{
> + unsigned long load, min_load = ULONG_MAX;
> + int i, idlest_cpu = this_cpu;
> +
> + BUG_ON(cpu_to_node(this_cpu) == nid);
> +
> + rcu_read_lock();
> + for_each_cpu(i, cpumask_of_node(nid)) {
Check allowed CPUs first if task is given?
> + load = weighted_cpuload(i);
> +
> + if (load < min_load) {
> + min_load = load;
> + idlest_cpu = i;
> + }
> + }
> + rcu_read_unlock();
> +
> + return idlest_cpu;
> +}
> +
[...]
> + /*
> + * Record the preferred node as the node with the most faults,
> + * requeue the task to be running on the idlest CPU on the
> + * preferred node and reset the scanning rate to recheck
> + * the working set placement.
> + */
> if (max_faults && max_nid != p->numa_preferred_nid) {
> + int preferred_cpu;
> +
> + /*
> + * If the task is not on the preferred node then find the most
> + * idle CPU to migrate to.
> + */
> + preferred_cpu = task_cpu(p);
> + if (cpu_to_node(preferred_cpu) != max_nid) {
> + preferred_cpu = find_idlest_cpu_node(preferred_cpu,
> + max_nid);
> + }
> +
> + /* Update the preferred nid and migrate task if possible */
> p->numa_preferred_nid = max_nid;
> p->numa_migrate_seq = 0;
> + migrate_task_to(p, preferred_cpu);
> }
On Mon, Jul 15, 2013 at 11:20 PM, Mel Gorman <[email protected]> wrote:
> /*
> * Got a PROT_NONE fault for a page on @node.
> */
> -void task_numa_fault(int node, int pages, bool migrated)
> +void task_numa_fault(int last_nid, int node, int pages, bool migrated)
For what is the new parameter?
> {
> struct task_struct *p = current;
> + int priv;
>
> if (!sched_feat_numa(NUMA))
> return;
>
> + /* For now, do not attempt to detect private/shared accesses */
> + priv = 1;
> +
> /* Allocate buffer to track faults on a per-node basis */
> if (unlikely(!p->numa_faults)) {
> - int size = sizeof(*p->numa_faults) * nr_node_ids;
> + int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
>
> /* numa_faults and numa_faults_buffer share the allocation */
> p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
> @@ -900,7 +915,7 @@ void task_numa_fault(int node, int pages, bool migrated)
> return;
>
> BUG_ON(p->numa_faults_buffer);
> - p->numa_faults_buffer = p->numa_faults + nr_node_ids;
> + p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
> }
>
> /*
> @@ -914,7 +929,7 @@ void task_numa_fault(int node, int pages, bool migrated)
> task_numa_placement(p);
>
> /* Record the fault, double the weight if pages were migrated */
> - p->numa_faults_buffer[node] += pages << migrated;
> + p->numa_faults_buffer[task_faults_idx(node, priv)] += pages << migrated;
> }
>
On 07/15/2013 11:20 PM, Mel Gorman wrote:
> Currently automatic NUMA balancing is unable to distinguish between false
> shared versus private pages except by ignoring pages with an elevated
What's the meaning of false shared?
> page_mapcount entirely. This avoids shared pages bouncing between the
> nodes whose task is using them but that is ignored quite a lot of data.
>
> This patch kicks away the training wheels in preparation for adding support
> for identifying shared/private pages is now in place. The ordering is so
> that the impact of the shared/private detection can be easily measured. Note
> that the patch does not migrate shared, file-backed within vmas marked
> VM_EXEC as these are generally shared library pages. Migrating such pages
> is not beneficial as there is an expectation they are read-shared between
> caches and iTLB and iCache pressure is generally low.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> include/linux/migrate.h | 7 ++++---
> mm/memory.c | 7 ++-----
> mm/migrate.c | 17 ++++++-----------
> mm/mprotect.c | 4 +---
> 4 files changed, 13 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index a405d3dc..e7e26af 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -92,11 +92,12 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
> #endif /* CONFIG_MIGRATION */
>
> #ifdef CONFIG_NUMA_BALANCING
> -extern int migrate_misplaced_page(struct page *page, int node);
> -extern int migrate_misplaced_page(struct page *page, int node);
> +extern int migrate_misplaced_page(struct page *page,
> + struct vm_area_struct *vma, int node);
> extern bool migrate_ratelimited(int node);
> #else
> -static inline int migrate_misplaced_page(struct page *page, int node)
> +static inline int migrate_misplaced_page(struct page *page,
> + struct vm_area_struct *vma, int node)
> {
> return -EAGAIN; /* can't migrate now */
> }
> diff --git a/mm/memory.c b/mm/memory.c
> index ab933be..62ae8a7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3586,7 +3586,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
> }
>
> /* Migrate to the requested node */
> - migrated = migrate_misplaced_page(page, target_nid);
> + migrated = migrate_misplaced_page(page, vma, target_nid);
> if (migrated)
> current_nid = target_nid;
>
> @@ -3651,9 +3651,6 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
> page = vm_normal_page(vma, addr, pteval);
> if (unlikely(!page))
> continue;
> - /* only check non-shared pages */
> - if (unlikely(page_mapcount(page) != 1))
> - continue;
>
> /*
> * Note that the NUMA fault is later accounted to either
> @@ -3671,7 +3668,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>
> /* Migrate to the requested node */
> pte_unmap_unlock(pte, ptl);
> - migrated = migrate_misplaced_page(page, target_nid);
> + migrated = migrate_misplaced_page(page, vma, target_nid);
> if (migrated)
> curr_nid = target_nid;
> task_numa_fault(last_nid, curr_nid, 1, migrated);
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 3bbaf5d..23f8122 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1579,7 +1579,8 @@ int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
> * node. Caller is expected to have an elevated reference count on
> * the page that will be dropped by this function before returning.
> */
> -int migrate_misplaced_page(struct page *page, int node)
> +int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
> + int node)
> {
> pg_data_t *pgdat = NODE_DATA(node);
> int isolated;
> @@ -1587,10 +1588,11 @@ int migrate_misplaced_page(struct page *page, int node)
> LIST_HEAD(migratepages);
>
> /*
> - * Don't migrate pages that are mapped in multiple processes.
> - * TODO: Handle false sharing detection instead of this hammer
> + * Don't migrate file pages that are mapped in multiple processes
> + * with execute permissions as they are probably shared libraries.
> */
> - if (page_mapcount(page) != 1)
> + if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
> + (vma->vm_flags & VM_EXEC))
> goto out;
>
> /*
> @@ -1641,13 +1643,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> int page_lru = page_is_file_cache(page);
>
> /*
> - * Don't migrate pages that are mapped in multiple processes.
> - * TODO: Handle false sharing detection instead of this hammer
> - */
> - if (page_mapcount(page) != 1)
> - goto out_dropref;
> -
> - /*
> * Rate-limit the amount of data that is being migrated to a node.
> * Optimal placement is no good if the memory bus is saturated and
> * all the time is being spent migrating!
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 94722a4..cacc64a 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -69,9 +69,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> if (last_nid != this_nid)
> all_same_node = false;
>
> - /* only check non-shared pages */
> - if (!pte_numa(oldpte) &&
> - page_mapcount(page) == 1) {
> + if (!pte_numa(oldpte)) {
> ptent = pte_mknuma(ptent);
> updated = true;
> }
On Mon, Jul 15, 2013 at 04:20:04PM +0100, Mel Gorman wrote:
> index cc03cfd..c5f773d 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -503,6 +503,17 @@ DECLARE_PER_CPU(struct rq, runqueues);
> #define cpu_curr(cpu) (cpu_rq(cpu)->curr)
> #define raw_rq() (&__raw_get_cpu_var(runqueues))
>
> +#ifdef CONFIG_NUMA_BALANCING
> +static inline void task_numa_free(struct task_struct *p)
> +{
> + kfree(p->numa_faults);
> +}
> +#else /* CONFIG_NUMA_BALANCING */
> +static inline void task_numa_free(struct task_struct *p)
> +{
> +}
> +#endif /* CONFIG_NUMA_BALANCING */
> +
> #ifdef CONFIG_SMP
>
> #define rcu_dereference_check_sched_domain(p) \
I also need the below hunk to make it compile:
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -6,6 +6,7 @@
#include <linux/spinlock.h>
#include <linux/stop_machine.h>
#include <linux/tick.h>
+#include <linux/slab.h>
#include "cpupri.h"
#include "cpuacct.h"
On Mon, Jul 15, 2013 at 04:20:18PM +0100, Mel Gorman wrote:
> +static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
And this -- which suggests you always build with cgroups enabled? I generally
try and disable all that nonsense when building new stuff, the scheduler is a
'lot' simpler that way. Once that works make it 'interesting' again.
---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3367,8 +3367,7 @@ static long effective_load(struct task_g
}
#else
-static unsigned long effective_load(struct task_group *tg, int cpu,
- unsigned long wl, unsigned long wg)
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
{
return wl;
}
On Mon, Jul 15, 2013 at 04:20:06PM +0100, Mel Gorman wrote:
> The zero page is not replicated between nodes and is often shared
> between processes. The data is read-only and likely to be cached in
> local CPUs if heavily accessed meaning that the remote memory access
> cost is less of a concern. This patch stops accounting for numa hinting
> faults on the zero page in both terms of counting faults and scheduling
> tasks on nodes.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/huge_memory.c | 9 +++++++++
> mm/memory.c | 7 ++++++-
> 2 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e4a79fa..ec938ed 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1302,6 +1302,15 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
>
> page = pmd_page(pmd);
> get_page(page);
> +
> + /*
> + * Do not account for faults against the huge zero page. The read-only
> + * data is likely to be read-cached on the local CPUs and it is less
> + * useful to know about local versus remote hits on the zero page.
> + */
> + if (is_huge_zero_pfn(page_to_pfn(page)))
> + goto clear_pmdnuma;
> +
> src_nid = numa_node_id();
> count_vm_numa_event(NUMA_HINT_FAULTS);
> if (src_nid == page_to_nid(page))
And because of:
5918d10 thp: fix huge zero page logic for page with pfn == 0
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1308,7 +1308,7 @@ int do_huge_pmd_numa_page(struct mm_stru
* data is likely to be read-cached on the local CPUs and it is less
* useful to know about local versus remote hits on the zero page.
*/
- if (is_huge_zero_pfn(page_to_pfn(page)))
+ if (is_huge_zero_page(page))
goto clear_pmdnuma;
src_nid = numa_node_id();
On Mon, 15 Jul 2013 16:20:17 +0100
Mel Gorman <[email protected]> wrote:
> Ideally it would be possible to distinguish between NUMA hinting faults that
> are private to a task and those that are shared. If treated identically
> there is a risk that shared pages bounce between nodes depending on
Your patch 15 breaks the compile with !CONFIG_NUMA_BALANCING.
This little patch fixes it:
The code in change_pte_range unconditionally calls nidpid_to_pid,
even when CONFIG_NUMA_SCHED is disabled. Returning -1 keeps the
value of last_nid at "don't care" and should result in the mprotect
code doing nothing NUMA-related when CONFIG_NUMA_SCHED is disabled.
Signed-off-by: Rik van Riel <[email protected]>
---
include/linux/mm.h | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 668f03c..0e0d190 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -731,6 +731,26 @@ static inline int page_nidpid_last(struct page *page)
return page_to_nid(page);
}
+static inline int nidpid_to_nid(int nidpid)
+{
+ return -1;
+}
+
+static inline int nidpid_to_pid(int nidpid)
+{
+ return -1;
+}
+
+static inline int nid_pid_to_nidpid(int nid, int pid)
+{
+ return -1;
+}
+
+static inline bool nidpid_pid_unset(int nidpid)
+{
+ return 1;
+}
+
static inline void page_nidpid_reset_last(struct page *page)
{
}
Subject: stop_machine: Introduce stop_two_cpus()
From: Peter Zijlstra <[email protected]>
Date: Sun Jul 21 12:24:09 CEST 2013
Introduce stop_two_cpus() in order to allow controlled swapping of two
tasks. It repurposes the stop_machine() state machine but only stops
the two cpus which we can do with on-stack structures and avoid
machine wide synchronization issues.
Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/stop_machine.h | 1
kernel/stop_machine.c | 243 +++++++++++++++++++++++++------------------
2 files changed, 146 insertions(+), 98 deletions(-)
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -28,6 +28,7 @@ struct cpu_stop_work {
};
int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg);
+int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void *arg);
void stop_one_cpu_nowait(unsigned int cpu, cpu_stop_fn_t fn, void *arg,
struct cpu_stop_work *work_buf);
int stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t fn, void *arg);
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -115,6 +115,137 @@ int stop_one_cpu(unsigned int cpu, cpu_s
return done.executed ? done.ret : -ENOENT;
}
+/* This controls the threads on each CPU. */
+enum multi_stop_state {
+ /* Dummy starting state for thread. */
+ MULTI_STOP_NONE,
+ /* Awaiting everyone to be scheduled. */
+ MULTI_STOP_PREPARE,
+ /* Disable interrupts. */
+ MULTI_STOP_DISABLE_IRQ,
+ /* Run the function */
+ MULTI_STOP_RUN,
+ /* Exit */
+ MULTI_STOP_EXIT,
+};
+
+struct multi_stop_data {
+ int (*fn)(void *);
+ void *data;
+ /* Like num_online_cpus(), but hotplug cpu uses us, so we need this. */
+ unsigned int num_threads;
+ const struct cpumask *active_cpus;
+
+ enum multi_stop_state state;
+ atomic_t thread_ack;
+};
+
+static void set_state(struct multi_stop_data *msdata,
+ enum multi_stop_state newstate)
+{
+ /* Reset ack counter. */
+ atomic_set(&msdata->thread_ack, msdata->num_threads);
+ smp_wmb();
+ msdata->state = newstate;
+}
+
+/* Last one to ack a state moves to the next state. */
+static void ack_state(struct multi_stop_data *msdata)
+{
+ if (atomic_dec_and_test(&msdata->thread_ack))
+ set_state(msdata, msdata->state + 1);
+}
+
+/* This is the cpu_stop function which stops the CPU. */
+static int multi_cpu_stop(void *data)
+{
+ struct multi_stop_data *msdata = data;
+ enum multi_stop_state curstate = MULTI_STOP_NONE;
+ int cpu = smp_processor_id(), err = 0;
+ unsigned long flags;
+ bool is_active;
+
+ /*
+ * When called from stop_machine_from_inactive_cpu(), irq might
+ * already be disabled. Save the state and restore it on exit.
+ */
+ local_save_flags(flags);
+
+ if (!msdata->active_cpus)
+ is_active = cpu == cpumask_first(cpu_online_mask);
+ else
+ is_active = cpumask_test_cpu(cpu, msdata->active_cpus);
+
+ /* Simple state machine */
+ do {
+ /* Chill out and ensure we re-read multi_stop_state. */
+ cpu_relax();
+ if (msdata->state != curstate) {
+ curstate = msdata->state;
+ switch (curstate) {
+ case MULTI_STOP_DISABLE_IRQ:
+ local_irq_disable();
+ hard_irq_disable();
+ break;
+ case MULTI_STOP_RUN:
+ if (is_active)
+ err = msdata->fn(msdata->data);
+ break;
+ default:
+ break;
+ }
+ ack_state(msdata);
+ }
+ } while (curstate != MULTI_STOP_EXIT);
+
+ local_irq_restore(flags);
+ return err;
+}
+
+/**
+ * stop_two_cpus - stops two cpus
+ * @cpu1: the cpu to stop
+ * @cpu2: the other cpu to stop
+ * @fn: function to execute
+ * @arg: argument to @fn
+ *
+ * Stops both the current and specified CPU and runs @fn on one of them.
+ *
+ * returns when both are completed.
+ */
+int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void *arg)
+{
+ struct cpu_stop_done done;
+ struct cpu_stop_work work1, work2;
+ struct multi_stop_data msdata = {
+ .fn = fn,
+ .data = arg,
+ .num_threads = 2,
+ .active_cpus = cpumask_of(cpu1),
+ };
+
+ work1 = work2 = (struct cpu_stop_work){
+ .fn = multi_cpu_stop,
+ .arg = &msdata,
+ .done = &done
+ };
+
+ cpu_stop_init_done(&done, 2);
+ set_state(&msdata, MULTI_STOP_PREPARE);
+ /*
+ * Must queue both works with preemption disabled; if cpu1 were
+ * the local cpu we'd never queue the second work, and our fn
+ * might wait forever.
+ */
+ preempt_disable();
+ cpu_stop_queue_work(cpu1, &work1);
+ cpu_stop_queue_work(cpu2, &work2);
+ preempt_enable();
+
+ wait_for_completion(&done.completion);
+ return done.executed ? done.ret : -ENOENT;
+}
+
/**
* stop_one_cpu_nowait - stop a cpu but don't wait for completion
* @cpu: cpu to stop
@@ -359,98 +490,14 @@ early_initcall(cpu_stop_init);
#ifdef CONFIG_STOP_MACHINE
-/* This controls the threads on each CPU. */
-enum stopmachine_state {
- /* Dummy starting state for thread. */
- STOPMACHINE_NONE,
- /* Awaiting everyone to be scheduled. */
- STOPMACHINE_PREPARE,
- /* Disable interrupts. */
- STOPMACHINE_DISABLE_IRQ,
- /* Run the function */
- STOPMACHINE_RUN,
- /* Exit */
- STOPMACHINE_EXIT,
-};
-
-struct stop_machine_data {
- int (*fn)(void *);
- void *data;
- /* Like num_online_cpus(), but hotplug cpu uses us, so we need this. */
- unsigned int num_threads;
- const struct cpumask *active_cpus;
-
- enum stopmachine_state state;
- atomic_t thread_ack;
-};
-
-static void set_state(struct stop_machine_data *smdata,
- enum stopmachine_state newstate)
-{
- /* Reset ack counter. */
- atomic_set(&smdata->thread_ack, smdata->num_threads);
- smp_wmb();
- smdata->state = newstate;
-}
-
-/* Last one to ack a state moves to the next state. */
-static void ack_state(struct stop_machine_data *smdata)
-{
- if (atomic_dec_and_test(&smdata->thread_ack))
- set_state(smdata, smdata->state + 1);
-}
-
-/* This is the cpu_stop function which stops the CPU. */
-static int stop_machine_cpu_stop(void *data)
-{
- struct stop_machine_data *smdata = data;
- enum stopmachine_state curstate = STOPMACHINE_NONE;
- int cpu = smp_processor_id(), err = 0;
- unsigned long flags;
- bool is_active;
-
- /*
- * When called from stop_machine_from_inactive_cpu(), irq might
- * already be disabled. Save the state and restore it on exit.
- */
- local_save_flags(flags);
-
- if (!smdata->active_cpus)
- is_active = cpu == cpumask_first(cpu_online_mask);
- else
- is_active = cpumask_test_cpu(cpu, smdata->active_cpus);
-
- /* Simple state machine */
- do {
- /* Chill out and ensure we re-read stopmachine_state. */
- cpu_relax();
- if (smdata->state != curstate) {
- curstate = smdata->state;
- switch (curstate) {
- case STOPMACHINE_DISABLE_IRQ:
- local_irq_disable();
- hard_irq_disable();
- break;
- case STOPMACHINE_RUN:
- if (is_active)
- err = smdata->fn(smdata->data);
- break;
- default:
- break;
- }
- ack_state(smdata);
- }
- } while (curstate != STOPMACHINE_EXIT);
-
- local_irq_restore(flags);
- return err;
-}
-
int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
{
- struct stop_machine_data smdata = { .fn = fn, .data = data,
- .num_threads = num_online_cpus(),
- .active_cpus = cpus };
+ struct multi_stop_data msdata = {
+ .fn = fn,
+ .data = data,
+ .num_threads = num_online_cpus(),
+ .active_cpus = cpus,
+ };
if (!stop_machine_initialized) {
/*
@@ -461,7 +508,7 @@ int __stop_machine(int (*fn)(void *), vo
unsigned long flags;
int ret;
- WARN_ON_ONCE(smdata.num_threads != 1);
+ WARN_ON_ONCE(msdata.num_threads != 1);
local_irq_save(flags);
hard_irq_disable();
@@ -472,8 +519,8 @@ int __stop_machine(int (*fn)(void *), vo
}
/* Set the initial state and stop all online cpus. */
- set_state(&smdata, STOPMACHINE_PREPARE);
- return stop_cpus(cpu_online_mask, stop_machine_cpu_stop, &smdata);
+ set_state(&msdata, MULTI_STOP_PREPARE);
+ return stop_cpus(cpu_online_mask, multi_cpu_stop, &msdata);
}
int stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
@@ -513,25 +560,25 @@ EXPORT_SYMBOL_GPL(stop_machine);
int stop_machine_from_inactive_cpu(int (*fn)(void *), void *data,
const struct cpumask *cpus)
{
- struct stop_machine_data smdata = { .fn = fn, .data = data,
+ struct multi_stop_data msdata = { .fn = fn, .data = data,
.active_cpus = cpus };
struct cpu_stop_done done;
int ret;
/* Local CPU must be inactive and CPU hotplug in progress. */
BUG_ON(cpu_active(raw_smp_processor_id()));
- smdata.num_threads = num_active_cpus() + 1; /* +1 for local */
+ msdata.num_threads = num_active_cpus() + 1; /* +1 for local */
/* No proper task established and can't sleep - busy wait for lock. */
while (!mutex_trylock(&stop_cpus_mutex))
cpu_relax();
/* Schedule work on other CPUs and execute directly for local CPU */
- set_state(&smdata, STOPMACHINE_PREPARE);
+ set_state(&msdata, MULTI_STOP_PREPARE);
cpu_stop_init_done(&done, num_active_cpus());
- queue_stop_cpus_work(cpu_active_mask, stop_machine_cpu_stop, &smdata,
+ queue_stop_cpus_work(cpu_active_mask, multi_cpu_stop, &msdata,
&done);
- ret = stop_machine_cpu_stop(&smdata);
+ ret = multi_cpu_stop(&msdata);
/* Busy wait for completion. */
while (!completion_done(&done.completion))
Subject: sched: Introduce migrate_swap()
From: Peter Zijlstra <[email protected]>
Date: Sun Jul 21 12:30:54 CEST 2013
Use the new stop_two_cpus() to implement migrate_swap(), a function
that flips two tasks between their respective cpus.
I'm fairly sure there's a less crude way than employing the
stop_two_cpus() method, but everything I tried either got horribly
fragile and/or complex. So keep it simple for now.
The notable detail is how we 'migrate' tasks that aren't runnable
anymore. We'll make it appear like we migrated them before they went
to sleep. The sole difference is the previous cpu in the wakeup path,
so we override this.
TODO: I'm fairly sure we can get rid of the wake_cpu != -1 test by
keeping wake_cpu to the actual task cpu; just couldn't be bothered to
think through all the cases.
Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/sched.h | 1
kernel/sched/core.c | 103 ++++++++++++++++++++++++++++++++++++++++++++---
kernel/sched/fair.c | 3 -
kernel/sched/idle_task.c | 2
kernel/sched/rt.c | 5 --
kernel/sched/sched.h | 3 -
kernel/sched/stop_task.c | 2
7 files changed, 105 insertions(+), 14 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1035,6 +1035,7 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
+ int wake_cpu;
#endif
int on_rq;
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1030,6 +1030,90 @@ void set_task_cpu(struct task_struct *p,
__set_task_cpu(p, new_cpu);
}
+static void __migrate_swap_task(struct task_struct *p, int cpu)
+{
+ if (p->on_rq) {
+ struct rq *src_rq, *dst_rq;
+
+ src_rq = task_rq(p);
+ dst_rq = cpu_rq(cpu);
+
+ deactivate_task(src_rq, p, 0);
+ set_task_cpu(p, cpu);
+ activate_task(dst_rq, p, 0);
+ check_preempt_curr(dst_rq, p, 0);
+ } else {
+ /*
+ * Task isn't running anymore; make it appear like we migrated
+ * it before it went to sleep. This means on wakeup we make the
+ * previous cpu or targer instead of where it really is.
+ */
+ p->wake_cpu = cpu;
+ }
+}
+
+struct migration_swap_arg {
+ struct task_struct *src_task, *dst_task;
+ int src_cpu, dst_cpu;
+};
+
+static int migrate_swap_stop(void *data)
+{
+ struct migration_swap_arg *arg = data;
+ struct rq *src_rq, *dst_rq;
+ int ret = -EAGAIN;
+
+ src_rq = cpu_rq(arg->src_cpu);
+ dst_rq = cpu_rq(arg->dst_cpu);
+
+ double_rq_lock(src_rq, dst_rq);
+ if (task_cpu(arg->dst_task) != arg->dst_cpu)
+ goto unlock;
+
+ if (task_cpu(arg->src_task) != arg->src_cpu)
+ goto unlock;
+
+ if (!cpumask_test_cpu(arg->dst_cpu, tsk_cpus_allowed(arg->src_task)))
+ goto unlock;
+
+ if (!cpumask_test_cpu(arg->src_cpu, tsk_cpus_allowed(arg->dst_task)))
+ goto unlock;
+
+ __migrate_swap_task(arg->src_task, arg->dst_cpu);
+ __migrate_swap_task(arg->dst_task, arg->src_cpu);
+
+ ret = 0;
+
+unlock:
+ double_rq_unlock(src_rq, dst_rq);
+
+ return ret;
+}
+
+/*
+ * XXX worry about hotplug
+ */
+int migrate_swap(struct task_struct *cur, struct task_struct *p)
+{
+ struct migration_swap_arg arg = {
+ .src_task = cur,
+ .src_cpu = task_cpu(cur),
+ .dst_task = p,
+ .dst_cpu = task_cpu(p),
+ };
+
+ if (arg.src_cpu == arg.dst_cpu)
+ return -EINVAL;
+
+ if (!cpumask_test_cpu(arg.dst_cpu, tsk_cpus_allowed(arg.src_task)))
+ return -EINVAL;
+
+ if (!cpumask_test_cpu(arg.src_cpu, tsk_cpus_allowed(arg.dst_task)))
+ return -EINVAL;
+
+ return stop_two_cpus(arg.dst_cpu, arg.src_cpu, migrate_swap_stop, &arg);
+}
+
struct migration_arg {
struct task_struct *task;
int dest_cpu;
@@ -1249,9 +1333,9 @@ static int select_fallback_rq(int cpu, s
* The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
*/
static inline
-int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
+int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
{
- int cpu = p->sched_class->select_task_rq(p, sd_flags, wake_flags);
+ cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);
/*
* In order not to call set_task_cpu() on a blocking task we need
@@ -1520,7 +1604,12 @@ try_to_wake_up(struct task_struct *p, un
if (p->sched_class->task_waking)
p->sched_class->task_waking(p);
- cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
+ if (p->wake_cpu != -1) { /* XXX make this condition go away */
+ cpu = p->wake_cpu;
+ p->wake_cpu = -1;
+ }
+
+ cpu = select_task_rq(p, cpu, SD_BALANCE_WAKE, wake_flags);
if (task_cpu(p) != cpu) {
wake_flags |= WF_MIGRATED;
set_task_cpu(p, cpu);
@@ -1605,6 +1694,10 @@ static void __sched_fork(struct task_str
{
p->on_rq = 0;
+#ifdef CONFIG_SMP
+ p->wake_cpu = -1;
+#endif
+
p->se.on_rq = 0;
p->se.exec_start = 0;
p->se.sum_exec_runtime = 0;
@@ -1755,7 +1848,7 @@ void wake_up_new_task(struct task_struct
* - cpus_allowed can change in the fork path
* - any previously selected cpu might disappear through hotplug
*/
- set_task_cpu(p, select_task_rq(p, SD_BALANCE_FORK, 0));
+ set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
#endif
/* Initialize new task's runnable average */
@@ -2083,7 +2176,7 @@ void sched_exec(void)
int dest_cpu;
raw_spin_lock_irqsave(&p->pi_lock, flags);
- dest_cpu = p->sched_class->select_task_rq(p, SD_BALANCE_EXEC, 0);
+ dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0);
if (dest_cpu == smp_processor_id())
goto unlock;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3600,11 +3600,10 @@ static int select_idle_sibling(struct ta
* preempt must be disabled.
*/
static int
-select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
{
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
- int prev_cpu = task_cpu(p);
int new_cpu = cpu;
int want_affine = 0;
int sync = wake_flags & WF_SYNC;
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -9,7 +9,7 @@
#ifdef CONFIG_SMP
static int
-select_task_rq_idle(struct task_struct *p, int sd_flag, int flags)
+select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags)
{
return task_cpu(p); /* IDLE tasks as never migrated */
}
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1169,13 +1169,10 @@ static void yield_task_rt(struct rq *rq)
static int find_lowest_rq(struct task_struct *task);
static int
-select_task_rq_rt(struct task_struct *p, int sd_flag, int flags)
+select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
{
struct task_struct *curr;
struct rq *rq;
- int cpu;
-
- cpu = task_cpu(p);
if (p->nr_cpus_allowed == 1)
goto out;
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -556,6 +556,7 @@ static inline u64 rq_clock_task(struct r
#ifdef CONFIG_NUMA_BALANCING
extern int migrate_task_to(struct task_struct *p, int cpu);
+extern int migrate_swap(struct task_struct *, struct task_struct *);
static inline void task_numa_free(struct task_struct *p)
{
kfree(p->numa_faults);
@@ -987,7 +988,7 @@ struct sched_class {
void (*put_prev_task) (struct rq *rq, struct task_struct *p);
#ifdef CONFIG_SMP
- int (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
+ int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
void (*migrate_task_rq)(struct task_struct *p, int next_cpu);
void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -11,7 +11,7 @@
#ifdef CONFIG_SMP
static int
-select_task_rq_stop(struct task_struct *p, int sd_flag, int flags)
+select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags)
{
return task_cpu(p); /* stop tasks as never migrate */
}
Subject: sched, numa: Break stuff..
From: Peter Zijlstra <[email protected]>
Date: Tue Jul 23 14:58:41 CEST 2013
This patch is mostly a comment in code. I don't believe the current
scan period adjustment scheme can work properly nor do I think it a
good idea to ratelimit the numa faults as a whole based on migration.
Reasons are in the modified comments...
Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/sched/fair.c | 41 ++++++++++++++++++++++++++++++++---------
1 file changed, 32 insertions(+), 9 deletions(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1108,7 +1108,6 @@ static void task_numa_placement(struct t
/* Preferred node as the node with the most faults */
if (max_faults && max_nid != p->numa_preferred_nid) {
- int old_migrate_seq = p->numa_migrate_seq;
/* Queue task on preferred node if possible */
p->numa_preferred_nid = max_nid;
@@ -1116,14 +1115,19 @@ static void task_numa_placement(struct t
numa_migrate_preferred(p);
/*
+ int old_migrate_seq = p->numa_migrate_seq;
+ *
* If preferred nodes changes frequently then the scan rate
* will be continually high. Mitigate this by increasing the
* scan rate only if the task was settled.
- */
+ *
+ * APZ: disabled because we don't lower it again :/
+ *
if (old_migrate_seq >= sysctl_numa_balancing_settle_count) {
p->numa_scan_period = max(p->numa_scan_period >> 1,
task_scan_min(p));
}
+ */
}
}
@@ -1167,10 +1171,20 @@ void task_numa_fault(int last_nidpid, in
/*
* If pages are properly placed (did not migrate) then scan slower.
* This is reset periodically in case of phase changes
- */
- if (!migrated)
+ *
+ * APZ: it seems to me that one can get a ton of !migrated faults;
+ * consider the scenario where two threads fight over a shared memory
+ * segment. We'll win half the faults, half of that will be local, half
+ * of that will be remote. This means we'll see 1/4-th of the total
+ * memory being !migrated. Using a fixed increment will completely
+ * flatten the scan speed for a sufficiently large workload. Another
+ * scenario is due to that migration rate limit.
+ *
+ if (!migrated) {
p->numa_scan_period = min(p->numa_scan_period_max,
p->numa_scan_period + jiffies_to_msecs(10));
+ }
+ */
task_numa_placement(p);
@@ -1216,12 +1230,15 @@ void task_numa_work(struct callback_head
if (p->flags & PF_EXITING)
return;
+#if 0
/*
* We do not care about task placement until a task runs on a node
* other than the first one used by the address space. This is
* largely because migrations are driven by what CPU the task
* is running on. If it's never scheduled on another node, it'll
* not migrate so why bother trapping the fault.
+ *
+ * APZ: seems like a bad idea for pure shared memory workloads.
*/
if (mm->first_nid == NUMA_PTE_SCAN_INIT)
mm->first_nid = numa_node_id();
@@ -1233,6 +1250,7 @@ void task_numa_work(struct callback_head
mm->first_nid = NUMA_PTE_SCAN_ACTIVE;
}
+#endif
/*
* Enforce maximal scan/migration frequency..
@@ -1254,9 +1272,14 @@ void task_numa_work(struct callback_head
* Do not set pte_numa if the current running node is rate-limited.
* This loses statistics on the fault but if we are unwilling to
* migrate to this node, it is less likely we can do useful work
- */
+ *
+ * APZ: seems like a bad idea; even if this node can't migrate anymore
+ * other nodes might and we want up-to-date information to do balance
+ * decisions.
+ *
if (migrate_ratelimited(numa_node_id()))
return;
+ */
start = mm->numa_scan_offset;
pages = sysctl_numa_balancing_scan_size;
@@ -1297,10 +1320,10 @@ void task_numa_work(struct callback_head
out:
/*
- * It is possible to reach the end of the VMA list but the last few VMAs are
- * not guaranteed to the vma_migratable. If they are not, we would find the
- * !migratable VMA on the next scan but not reset the scanner to the start
- * so check it now.
+ * It is possible to reach the end of the VMA list but the last few
+ * VMAs are not guaranteed to the vma_migratable. If they are not, we
+ * would find the !migratable VMA on the next scan but not reset the
+ * scanner to the start so check it now.
*/
if (vma)
mm->numa_scan_offset = start;
Subject: mm, numa: Sanitize task_numa_fault() callsites
From: Peter Zijlstra <[email protected]>
Date: Mon Jul 22 10:42:38 CEST 2013
There are three callers of task_numa_fault():
- do_huge_pmd_numa_page():
Accounts against the current node, not the node where the
page resides, unless we migrated, in which case it accounts
against the node we migrated to.
- do_numa_page():
Accounts against the current node, not the node where the
page resides, unless we migrated, in which case it accounts
against the node we migrated to.
- do_pmd_numa_page():
Accounts not at all when the page isn't migrated, otherwise
accounts against the node we migrated towards.
This seems wrong to me; all three sites should have the same
sementaics, furthermore we should accounts against where the page
really is, we already know where the task is.
So modify all three sites to always account; we did after all receive
the fault; and always account to where the page is after migration,
regardless of success.
They all still differ on when they clear the PTE/PMD; ideally that
would get sorted too.
Signed-off-by: Peter Zijlstra <[email protected]>
---
mm/huge_memory.c | 24 ++++++++++++++----------
mm/memory.c | 52 ++++++++++++++++++++++------------------------------
2 files changed, 36 insertions(+), 40 deletions(-)
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1292,9 +1292,9 @@ int do_huge_pmd_numa_page(struct mm_stru
{
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
+ int page_nid = -1, this_nid = numa_node_id();
int target_nid, last_nidpid;
- int src_nid = -1;
- bool migrated;
+ bool migrated = false;
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(pmd, *pmdp)))
@@ -1311,9 +1311,9 @@ int do_huge_pmd_numa_page(struct mm_stru
if (is_huge_zero_page(page))
goto clear_pmdnuma;
- src_nid = numa_node_id();
+ page_nid = page_to_nid(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
- if (src_nid == page_to_nid(page))
+ if (page_nid == this_nid)
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
last_nidpid = page_nidpid_last(page);
@@ -1327,7 +1327,7 @@ int do_huge_pmd_numa_page(struct mm_stru
spin_unlock(&mm->page_table_lock);
lock_page(page);
- /* Confirm the PTE did not while locked */
+ /* Confirm the PMD didn't change while we released the page_table_lock */
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(pmd, *pmdp))) {
unlock_page(page);
@@ -1339,11 +1339,12 @@ int do_huge_pmd_numa_page(struct mm_stru
/* Migrate the THP to the requested node */
migrated = migrate_misplaced_transhuge_page(mm, vma,
pmdp, pmd, addr, page, target_nid);
- if (!migrated)
+ if (migrated)
+ page_nid = target_nid;
+ else
goto check_same;
- task_numa_fault(last_nidpid, target_nid, HPAGE_PMD_NR, true);
- return 0;
+ goto out;
check_same:
spin_lock(&mm->page_table_lock);
@@ -1356,8 +1357,11 @@ int do_huge_pmd_numa_page(struct mm_stru
update_mmu_cache_pmd(vma, addr, pmdp);
out_unlock:
spin_unlock(&mm->page_table_lock);
- if (src_nid != -1)
- task_numa_fault(last_nidpid, src_nid, HPAGE_PMD_NR, false);
+
+out:
+ if (page_nid != -1)
+ task_numa_fault(last_nidpid, page_nid, HPAGE_PMD_NR, migrated);
+
return 0;
}
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3533,8 +3533,8 @@ int do_numa_page(struct mm_struct *mm, s
{
struct page *page = NULL;
spinlock_t *ptl;
- int current_nid = -1, last_nidpid;
- int target_nid;
+ int page_nid = -1;
+ int target_nid, last_nidpid;
bool migrated = false;
/*
@@ -3569,15 +3569,10 @@ int do_numa_page(struct mm_struct *mm, s
}
last_nidpid = page_nidpid_last(page);
- current_nid = page_to_nid(page);
- target_nid = numa_migrate_prep(page, vma, addr, current_nid);
+ page_nid = page_to_nid(page);
+ target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(ptep, ptl);
if (target_nid == -1) {
- /*
- * Account for the fault against the current node if it not
- * being replaced regardless of where the page is located.
- */
- current_nid = numa_node_id();
put_page(page);
goto out;
}
@@ -3585,11 +3580,12 @@ int do_numa_page(struct mm_struct *mm, s
/* Migrate to the requested node */
migrated = migrate_misplaced_page(page, vma, target_nid);
if (migrated)
- current_nid = target_nid;
+ page_nid = target_nid;
out:
- if (current_nid != -1)
- task_numa_fault(last_nidpid, current_nid, 1, migrated);
+ if (page_nid != -1)
+ task_numa_fault(last_nidpid, page_nid, 1, migrated);
+
return 0;
}
@@ -3604,7 +3600,6 @@ static int do_pmd_numa_page(struct mm_st
unsigned long offset;
spinlock_t *ptl;
bool numa = false;
- int local_nid = numa_node_id();
int last_nidpid;
spin_lock(&mm->page_table_lock);
@@ -3628,9 +3623,10 @@ static int do_pmd_numa_page(struct mm_st
for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
pte_t pteval = *pte;
struct page *page;
- int curr_nid = local_nid;
+ int page_nid = -1;
int target_nid;
- bool migrated;
+ bool migrated = false;
+
if (!pte_present(pteval))
continue;
if (!pte_numa(pteval))
@@ -3649,26 +3645,22 @@ static int do_pmd_numa_page(struct mm_st
if (unlikely(!page))
continue;
- /*
- * Note that the NUMA fault is later accounted to either
- * the node that is currently running or where the page is
- * migrated to.
- */
- curr_nid = local_nid;
last_nidpid = page_nidpid_last(page);
+ page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr,
- page_to_nid(page));
- if (target_nid == -1) {
+ page_nid);
+ pte_unmap_unlock(pte, ptl);
+
+ if (target_nid != -1) {
+ migrated = migrate_misplaced_page(page, vma, target_nid);
+ if (migrated)
+ page_nid = target_nid;
+ } else {
put_page(page);
- continue;
}
- /* Migrate to the requested node */
- pte_unmap_unlock(pte, ptl);
- migrated = migrate_misplaced_page(page, vma, target_nid);
- if (migrated)
- curr_nid = target_nid;
- task_numa_fault(last_nidpid, curr_nid, 1, migrated);
+ if (page_nid != -1)
+ task_numa_fault(last_nidpid, page_nid, 1, migrated);
pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
}
Subject: sched, numa: migrates_degrades_locality()
From: Peter Zijlstra <[email protected]>
Date: Mon Jul 22 14:02:54 CEST 2013
It just makes heaps of sense; so add it and make both it and
migrate_improve_locality() a sched_feat().
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
---
kernel/sched/fair.c | 35 +++++++++++++++++++++++++++++++++--
kernel/sched/features.h | 2 ++
2 files changed, 35 insertions(+), 2 deletions(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4323,7 +4323,7 @@ static bool migrate_improves_locality(st
{
int src_nid, dst_nid;
- if (!sched_feat(NUMA_BALANCE))
+ if (!sched_feat(NUMA_BALANCE) || !sched_feat(NUMA_FAULTS_UP))
return false;
if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
@@ -4336,7 +4336,30 @@ static bool migrate_improves_locality(st
p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
return false;
- if (p->numa_preferred_nid == dst_nid)
+ if (task_faults(p, dst_nid) > task_faults(p, src_nid))
+ return true;
+
+ return false;
+}
+
+static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
+{
+ int src_nid, dst_nid;
+
+ if (!sched_feat(NUMA_BALANCE) || !sched_feat(NUMA_FAULTS_DOWN))
+ return false;
+
+ if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+ return false;
+
+ src_nid = cpu_to_node(env->src_cpu);
+ dst_nid = cpu_to_node(env->dst_cpu);
+
+ if (src_nid == dst_nid ||
+ p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+ return false;
+
+ if (task_faults(p, dst_nid) < task_faults(p, src_nid))
return true;
return false;
@@ -4347,6 +4370,12 @@ static inline bool migrate_improves_loca
{
return false;
}
+
+static inline bool migrate_degrades_locality(struct task_struct *p,
+ struct lb_env *env)
+{
+ return false;
+}
#endif
/*
@@ -4409,6 +4438,8 @@ int can_migrate_task(struct task_struct
* 3) too many balance attempts have failed.
*/
tsk_cache_hot = task_hot(p, rq_clock_task(env->src_rq), env->sd);
+ if (!tsk_cache_hot)
+ tsk_cache_hot = migrate_degrades_locality(p, env);
if (migrate_improves_locality(p, env)) {
#ifdef CONFIG_SCHEDSTATS
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -70,4 +70,6 @@ SCHED_FEAT(LB_MIN, false)
SCHED_FEAT(NUMA, false)
SCHED_FEAT(NUMA_FORCE, false)
SCHED_FEAT(NUMA_BALANCE, true)
+SCHED_FEAT(NUMA_FAULTS_UP, true)
+SCHED_FEAT(NUMA_FAULTS_DOWN, false)
#endif
Subject: sched, numa: Improve scanner
From: Peter Zijlstra <[email protected]>
Date: Tue Jul 23 17:02:38 CEST 2013
With a trace_printk("working\n"); right after the cmpxchg in
task_numa_work() we can see that of a 4 thread process, its always the
same task winning the race and doing the protection change.
This is a problem since the task doing the protection change has a
penalty for taking faults -- it is busy when marking the PTEs. If its
always the same task the ->numa_faults[] get severely skewed.
Avoid this by delaying the task doing the protection change such that
it is unlikely to win the privilege again.
Before:
root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
thread 0/0-3232 [022] .... 212.787402: task_numa_work: working
thread 0/0-3232 [022] .... 212.888473: task_numa_work: working
thread 0/0-3232 [022] .... 212.989538: task_numa_work: working
thread 0/0-3232 [022] .... 213.090602: task_numa_work: working
thread 0/0-3232 [022] .... 213.191667: task_numa_work: working
thread 0/0-3232 [022] .... 213.292734: task_numa_work: working
thread 0/0-3232 [022] .... 213.393804: task_numa_work: working
thread 0/0-3232 [022] .... 213.494869: task_numa_work: working
thread 0/0-3232 [022] .... 213.596937: task_numa_work: working
thread 0/0-3232 [022] .... 213.699000: task_numa_work: working
thread 0/0-3232 [022] .... 213.801067: task_numa_work: working
thread 0/0-3232 [022] .... 213.903155: task_numa_work: working
thread 0/0-3232 [022] .... 214.005201: task_numa_work: working
thread 0/0-3232 [022] .... 214.107266: task_numa_work: working
thread 0/0-3232 [022] .... 214.209342: task_numa_work: working
After:
root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
thread 0/0-3253 [005] .... 136.865051: task_numa_work: working
thread 0/2-3255 [026] .... 136.965134: task_numa_work: working
thread 0/3-3256 [024] .... 137.065217: task_numa_work: working
thread 0/3-3256 [024] .... 137.165302: task_numa_work: working
thread 0/3-3256 [024] .... 137.265382: task_numa_work: working
thread 0/0-3253 [004] .... 137.366465: task_numa_work: working
thread 0/2-3255 [026] .... 137.466549: task_numa_work: working
thread 0/0-3253 [004] .... 137.566629: task_numa_work: working
thread 0/0-3253 [004] .... 137.666711: task_numa_work: working
thread 0/1-3254 [028] .... 137.766799: task_numa_work: working
thread 0/0-3253 [004] .... 137.866876: task_numa_work: working
thread 0/2-3255 [026] .... 137.966960: task_numa_work: working
thread 0/1-3254 [028] .... 138.067041: task_numa_work: working
thread 0/2-3255 [026] .... 138.167123: task_numa_work: working
thread 0/3-3256 [024] .... 138.267207: task_numa_work: working
Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/sched/fair.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1316,6 +1316,12 @@ void task_numa_work(struct callback_head
return;
/*
+ * Delay this task enough that another task of this mm will likely win
+ * the next time around.
+ */
+ p->node_stamp += 2 * TICK_NSEC;
+
+ /*
* Do not set pte_numa if the current running node is rate-limited.
* This loses statistics on the fault but if we are unwilling to
* migrate to this node, it is less likely we can do useful work
@@ -1405,7 +1411,7 @@ void task_tick_numa(struct rq *rq, struc
if (now - curr->node_stamp > period) {
if (!curr->node_stamp)
curr->numa_scan_period = task_scan_min(curr);
- curr->node_stamp = now;
+ curr->node_stamp += period;
if (!time_before(jiffies, curr->mm->numa_next_scan)) {
init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
Subject: mm, sched, numa: Create a per-task MPOL_INTERLEAVE policy
From: Peter Zijlstra <[email protected]>
Date: Mon Jul 22 10:42:38 CEST 2013
Just an idea.. the rest of the code doesn't work good enough for this to
matter, also there's something sickly with it since it makes my box
explode. But wanted to put the idea out there anyway.
Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/mempolicy.h | 5 +-
kernel/sched/fair.c | 44 +++++++++++++++++++++
kernel/sched/features.h | 1
mm/huge_memory.c | 28 +++++++------
mm/memory.c | 33 ++++++++++------
mm/mempolicy.c | 94 +++++++++++++++++++++++++++++-----------------
6 files changed, 145 insertions(+), 60 deletions(-)
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -60,6 +60,7 @@ struct mempolicy {
* The default fast path of a NULL MPOL_DEFAULT policy is always inlined.
*/
+extern struct mempolicy *__mpol_new(unsigned short, unsigned short);
extern void __mpol_put(struct mempolicy *pol);
static inline void mpol_put(struct mempolicy *pol)
{
@@ -187,7 +188,7 @@ static inline int vma_migratable(struct
return 1;
}
-extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long);
+extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long, int *);
#else
@@ -307,7 +308,7 @@ static inline int mpol_to_str(char *buff
}
static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
- unsigned long address)
+ unsigned long address, int *account_node)
{
return -1; /* no node preference */
}
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -893,6 +893,47 @@ static inline unsigned long task_faults(
return p->numa_faults[2*nid] + p->numa_faults[2*nid+1];
}
+/*
+ * Create/Update p->mempolicy MPOL_INTERLEAVE to match p->numa_faults[].
+ */
+static void task_numa_mempol(struct task_struct *p, long max_faults)
+{
+ struct mempolicy *pol = p->mempolicy, *new = NULL;
+ nodemask_t nodes = NODE_MASK_NONE;
+ int node;
+
+ if (!pol) {
+ new = __mpol_new(MPOL_INTERLEAVE, MPOL_F_MOF | MPOL_F_MORON);
+ if (IS_ERR(new))
+ return;
+ }
+
+ task_lock(p);
+
+ pol = p->mempolicy; /* lock forces a re-read */
+ if (!pol) {
+ pol = p->mempolicy = new;
+ new = NULL;
+ }
+
+ if (!(pol->flags & MPOL_F_MORON))
+ goto unlock;
+
+ for_each_node(node) {
+ if (task_faults(p, node) > max_faults/2)
+ node_set(node, nodes);
+ }
+
+ mpol_rebind_task(p, &nodes, MPOL_REBIND_STEP1);
+ mpol_rebind_task(p, &nodes, MPOL_REBIND_STEP2);
+
+unlock:
+ task_unlock(p);
+
+ if (new)
+ __mpol_put(new);
+}
+
static unsigned long weighted_cpuload(const int cpu);
static unsigned long source_load(int cpu, int type);
static unsigned long target_load(int cpu, int type);
@@ -1106,6 +1147,9 @@ static void task_numa_placement(struct t
}
}
+ if (sched_feat(NUMA_INTERLEAVE))
+ task_numa_mempol(p, max_faults);
+
/* Preferred node as the node with the most faults */
if (max_faults && max_nid != p->numa_preferred_nid) {
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -72,4 +72,5 @@ SCHED_FEAT(NUMA_FORCE, false)
SCHED_FEAT(NUMA_BALANCE, true)
SCHED_FEAT(NUMA_FAULTS_UP, true)
SCHED_FEAT(NUMA_FAULTS_DOWN, true)
+SCHED_FEAT(NUMA_INTERLEAVE, false)
#endif
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1292,7 +1292,7 @@ int do_huge_pmd_numa_page(struct mm_stru
{
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
- int page_nid = -1, this_nid = numa_node_id();
+ int page_nid = -1, account_nid = -1, this_nid = numa_node_id();
int target_nid, last_nidpid;
bool migrated = false;
@@ -1301,7 +1301,6 @@ int do_huge_pmd_numa_page(struct mm_stru
goto out_unlock;
page = pmd_page(pmd);
- get_page(page);
/*
* Do not account for faults against the huge zero page. The read-only
@@ -1317,13 +1316,12 @@ int do_huge_pmd_numa_page(struct mm_stru
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
last_nidpid = page_nidpid_last(page);
- target_nid = mpol_misplaced(page, vma, haddr);
- if (target_nid == -1) {
- put_page(page);
+ target_nid = mpol_misplaced(page, vma, haddr, &account_nid);
+ if (target_nid == -1)
goto clear_pmdnuma;
- }
/* Acquire the page lock to serialise THP migrations */
+ get_page(page);
spin_unlock(&mm->page_table_lock);
lock_page(page);
@@ -1332,6 +1330,7 @@ int do_huge_pmd_numa_page(struct mm_stru
if (unlikely(!pmd_same(pmd, *pmdp))) {
unlock_page(page);
put_page(page);
+ account_nid = page_nid = -1; /* someone else took our fault */
goto out_unlock;
}
spin_unlock(&mm->page_table_lock);
@@ -1339,17 +1338,20 @@ int do_huge_pmd_numa_page(struct mm_stru
/* Migrate the THP to the requested node */
migrated = migrate_misplaced_transhuge_page(mm, vma,
pmdp, pmd, addr, page, target_nid);
- if (migrated)
- page_nid = target_nid;
- else
+ if (!migrated) {
+ account_nid = -1; /* account against the old page */
goto check_same;
+ }
+ page_nid = target_nid;
goto out;
check_same:
spin_lock(&mm->page_table_lock);
- if (unlikely(!pmd_same(pmd, *pmdp)))
+ if (unlikely(!pmd_same(pmd, *pmdp))) {
+ page_nid = -1; /* someone else took our fault */
goto out_unlock;
+ }
clear_pmdnuma:
pmd = pmd_mknonnuma(pmd);
set_pmd_at(mm, haddr, pmdp, pmd);
@@ -1359,8 +1361,10 @@ int do_huge_pmd_numa_page(struct mm_stru
spin_unlock(&mm->page_table_lock);
out:
- if (page_nid != -1)
- task_numa_fault(last_nidpid, page_nid, HPAGE_PMD_NR, migrated);
+ if (account_nid == -1)
+ account_nid = page_nid;
+ if (account_nid != -1)
+ task_numa_fault(last_nidpid, account_nid, HPAGE_PMD_NR, migrated);
return 0;
}
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3516,16 +3516,17 @@ static int do_nonlinear_fault(struct mm_
return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
}
-int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
- unsigned long addr, int current_nid)
+static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
+ unsigned long addr, int page_nid,
+ int *account_nid)
{
get_page(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
- if (current_nid == numa_node_id())
+ if (page_nid == numa_node_id())
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
- return mpol_misplaced(page, vma, addr);
+ return mpol_misplaced(page, vma, addr, account_nid);
}
int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -3533,7 +3534,7 @@ int do_numa_page(struct mm_struct *mm, s
{
struct page *page = NULL;
spinlock_t *ptl;
- int page_nid = -1;
+ int page_nid = -1, account_nid = -1;
int target_nid, last_nidpid;
bool migrated = false;
@@ -3570,7 +3571,7 @@ int do_numa_page(struct mm_struct *mm, s
last_nidpid = page_nidpid_last(page);
page_nid = page_to_nid(page);
- target_nid = numa_migrate_prep(page, vma, addr, page_nid);
+ target_nid = numa_migrate_prep(page, vma, addr, page_nid, &account_nid);
pte_unmap_unlock(ptep, ptl);
if (target_nid == -1) {
put_page(page);
@@ -3583,8 +3584,10 @@ int do_numa_page(struct mm_struct *mm, s
page_nid = target_nid;
out:
- if (page_nid != -1)
- task_numa_fault(last_nidpid, page_nid, 1, migrated);
+ if (account_nid == -1)
+ account_nid = page_nid;
+ if (account_nid != -1)
+ task_numa_fault(last_nidpid, account_nid, 1, migrated);
return 0;
}
@@ -3623,7 +3626,7 @@ static int do_pmd_numa_page(struct mm_st
for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
pte_t pteval = *pte;
struct page *page;
- int page_nid = -1;
+ int page_nid = -1, account_nid = -1;
int target_nid;
bool migrated = false;
@@ -3648,19 +3651,25 @@ static int do_pmd_numa_page(struct mm_st
last_nidpid = page_nidpid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr,
- page_nid);
+ page_nid, &account_nid);
pte_unmap_unlock(pte, ptl);
if (target_nid != -1) {
migrated = migrate_misplaced_page(page, vma, target_nid);
if (migrated)
page_nid = target_nid;
+ else
+ account_nid = -1;
} else {
put_page(page);
}
- if (page_nid != -1)
- task_numa_fault(last_nidpid, page_nid, 1, migrated);
+ if (account_nid == -1)
+ account_nid = page_nid;
+ if (account_nid != -1)
+ task_numa_fault(last_nidpid, account_nid, 1, migrated);
+
+ cond_resched();
pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
}
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -118,22 +118,18 @@ static struct mempolicy default_policy =
.flags = MPOL_F_LOCAL,
};
-static struct mempolicy preferred_node_policy[MAX_NUMNODES];
+static struct mempolicy numa_policy = {
+ .refcnt = ATOMIC_INIT(1), /* never free it */
+ .mode = MPOL_PREFERRED,
+ .flags = MPOL_F_LOCAL | MPOL_F_MOF | MPOL_F_MORON,
+};
static struct mempolicy *get_task_policy(struct task_struct *p)
{
struct mempolicy *pol = p->mempolicy;
- int node;
- if (!pol) {
- node = numa_node_id();
- if (node != NUMA_NO_NODE)
- pol = &preferred_node_policy[node];
-
- /* preferred_node_policy is not initialised early in boot */
- if (!pol->mode)
- pol = NULL;
- }
+ if (!pol)
+ pol = &numa_policy;
return pol;
}
@@ -248,6 +244,20 @@ static int mpol_set_nodemask(struct memp
return ret;
}
+struct mempolicy *__mpol_new(unsigned short mode, unsigned short flags)
+{
+ struct mempolicy *policy;
+
+ policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
+ if (!policy)
+ return ERR_PTR(-ENOMEM);
+ atomic_set(&policy->refcnt, 1);
+ policy->mode = mode;
+ policy->flags = flags;
+
+ return policy;
+}
+
/*
* This function just creates a new policy, does some check and simple
* initialization. You must invoke mpol_set_nodemask() to set nodes.
@@ -255,8 +265,6 @@ static int mpol_set_nodemask(struct memp
static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
nodemask_t *nodes)
{
- struct mempolicy *policy;
-
pr_debug("setting mode %d flags %d nodes[0] %lx\n",
mode, flags, nodes ? nodes_addr(*nodes)[0] : NUMA_NO_NODE);
@@ -284,14 +292,8 @@ static struct mempolicy *mpol_new(unsign
mode = MPOL_PREFERRED;
} else if (nodes_empty(*nodes))
return ERR_PTR(-EINVAL);
- policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
- if (!policy)
- return ERR_PTR(-ENOMEM);
- atomic_set(&policy->refcnt, 1);
- policy->mode = mode;
- policy->flags = flags;
- return policy;
+ return __mpol_new(mode, flags);
}
/* Slow path of a mpol destructor. */
@@ -2234,12 +2236,13 @@ static void sp_free(struct sp_node *n)
* Policy determination "mimics" alloc_page_vma().
* Called from fault path where we know the vma and faulting address.
*/
-int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
+int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr, int *account_node)
{
struct mempolicy *pol;
struct zone *zone;
int curnid = page_to_nid(page);
unsigned long pgoff;
+ int thisnid = numa_node_id();
int polnid = -1;
int ret = -1;
@@ -2261,7 +2264,7 @@ int mpol_misplaced(struct page *page, st
case MPOL_PREFERRED:
if (pol->flags & MPOL_F_LOCAL)
- polnid = numa_node_id();
+ polnid = thisnid;
else
polnid = pol->v.preferred_node;
break;
@@ -2276,7 +2279,7 @@ int mpol_misplaced(struct page *page, st
if (node_isset(curnid, pol->v.nodes))
goto out;
(void)first_zones_zonelist(
- node_zonelist(numa_node_id(), GFP_HIGHUSER),
+ node_zonelist(thisnid, GFP_HIGHUSER),
gfp_zone(GFP_HIGHUSER),
&pol->v.nodes, &zone);
polnid = zone->node;
@@ -2291,8 +2294,7 @@ int mpol_misplaced(struct page *page, st
int last_nidpid;
int this_nidpid;
- polnid = numa_node_id();
- this_nidpid = nid_pid_to_nidpid(polnid, current->pid);;
+ this_nidpid = nid_pid_to_nidpid(thisnid, current->pid);;
/*
* Multi-stage node selection is used in conjunction
@@ -2318,6 +2320,39 @@ int mpol_misplaced(struct page *page, st
last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
goto out;
+
+ /*
+ * Preserve interleave pages while allowing useful
+ * ->numa_faults[] statistics.
+ *
+ * When migrating into an interleave set, migrate to
+ * the correct interleaved node but account against the
+ * current node (where the task is running).
+ *
+ * Not doing this would result in ->numa_faults[] being
+ * flat across the interleaved nodes, making it
+ * impossible to shrink the node list even when all
+ * tasks are running on a single node.
+ *
+ * src dst migrate account
+ * 0 0 -- this_node $page_node
+ * 0 1 -- policy_node this_node
+ * 1 0 -- this_node $page_node
+ * 1 1 -- policy_node this_node
+ *
+ */
+ switch (pol->mode) {
+ case MPOL_INTERLEAVE:
+ if (node_isset(thisnid, pol->v.nodes)) {
+ if (account_node)
+ *account_node = thisnid;
+ }
+ break;
+
+ default:
+ polnid = thisnid;
+ break;
+ }
}
if (curnid != polnid)
@@ -2580,15 +2615,6 @@ void __init numa_policy_init(void)
sizeof(struct sp_node),
0, SLAB_PANIC, NULL);
- for_each_node(nid) {
- preferred_node_policy[nid] = (struct mempolicy) {
- .refcnt = ATOMIC_INIT(1),
- .mode = MPOL_PREFERRED,
- .flags = MPOL_F_MOF | MPOL_F_MORON,
- .v = { .preferred_node = nid, },
- };
- }
-
/*
* Set interleaving policy for system init. Interleaving is only
* enabled across suitably sized nodes (default is >= 16MB), or
On Thu, Jul 25, 2013 at 12:46:33PM +0200, Peter Zijlstra wrote:
> @@ -2234,12 +2236,13 @@ static void sp_free(struct sp_node *n)
> * Policy determination "mimics" alloc_page_vma().
> * Called from fault path where we know the vma and faulting address.
> */
> -int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
> +int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr, int *account_node)
> {
> struct mempolicy *pol;
> struct zone *zone;
> int curnid = page_to_nid(page);
> unsigned long pgoff;
> + int thisnid = numa_node_id();
> int polnid = -1;
> int ret = -1;
>
> @@ -2261,7 +2264,7 @@ int mpol_misplaced(struct page *page, st
>
> case MPOL_PREFERRED:
> if (pol->flags & MPOL_F_LOCAL)
> - polnid = numa_node_id();
> + polnid = thisnid;
> else
> polnid = pol->v.preferred_node;
> break;
> @@ -2276,7 +2279,7 @@ int mpol_misplaced(struct page *page, st
> if (node_isset(curnid, pol->v.nodes))
> goto out;
> (void)first_zones_zonelist(
> - node_zonelist(numa_node_id(), GFP_HIGHUSER),
> + node_zonelist(thisnid, GFP_HIGHUSER),
> gfp_zone(GFP_HIGHUSER),
> &pol->v.nodes, &zone);
> polnid = zone->node;
> @@ -2291,8 +2294,7 @@ int mpol_misplaced(struct page *page, st
> int last_nidpid;
> int this_nidpid;
>
> - polnid = numa_node_id();
> - this_nidpid = nid_pid_to_nidpid(polnid, current->pid);;
> + this_nidpid = nid_pid_to_nidpid(thisnid, current->pid);;
>
> /*
> * Multi-stage node selection is used in conjunction
> @@ -2318,6 +2320,39 @@ int mpol_misplaced(struct page *page, st
> last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
> if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
That should've become:
if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != thisnid)
> goto out;
> +
> + /*
> + * Preserve interleave pages while allowing useful
> + * ->numa_faults[] statistics.
> + *
> + * When migrating into an interleave set, migrate to
> + * the correct interleaved node but account against the
> + * current node (where the task is running).
> + *
> + * Not doing this would result in ->numa_faults[] being
> + * flat across the interleaved nodes, making it
> + * impossible to shrink the node list even when all
> + * tasks are running on a single node.
> + *
> + * src dst migrate account
> + * 0 0 -- this_node $page_node
> + * 0 1 -- policy_node this_node
> + * 1 0 -- this_node $page_node
> + * 1 1 -- policy_node this_node
> + *
> + */
> + switch (pol->mode) {
> + case MPOL_INTERLEAVE:
> + if (node_isset(thisnid, pol->v.nodes)) {
> + if (account_node)
> + *account_node = thisnid;
> + }
> + break;
> +
> + default:
> + polnid = thisnid;
> + break;
> + }
> }
>
> if (curnid != polnid)
On Mon, Jul 15, 2013 at 04:20:17PM +0100, Mel Gorman wrote:
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index cacc64a..04c9469 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -37,14 +37,15 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
>
> static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> unsigned long addr, unsigned long end, pgprot_t newprot,
> - int dirty_accountable, int prot_numa, bool *ret_all_same_node)
> + int dirty_accountable, int prot_numa, bool *ret_all_same_nidpid)
> {
> struct mm_struct *mm = vma->vm_mm;
> pte_t *pte, oldpte;
> spinlock_t *ptl;
> unsigned long pages = 0;
> - bool all_same_node = true;
> + bool all_same_nidpid = true;
> int last_nid = -1;
> + int last_pid = -1;
>
> pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> arch_enter_lazy_mmu_mode();
> @@ -64,10 +65,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> page = vm_normal_page(vma, addr, oldpte);
> if (page) {
> int this_nid = page_to_nid(page);
> + int nidpid = page_nidpid_last(page);
> + int this_pid = nidpid_to_pid(nidpid);
> +
> if (last_nid == -1)
> last_nid = this_nid;
> - if (last_nid != this_nid)
> - all_same_node = false;
> + if (last_pid == -1)
> + last_pid = this_pid;
> + if (last_nid != this_nid ||
> + last_pid != this_pid) {
> + all_same_nidpid = false;
> + }
At this point I would've expected something like:
int nidpid = page_nidpid_last(page);
int thisnid = nidpid_to_nid(nidpid);
int thispid = nidpit_to_pid(nidpit);
It seems 'weird' to mix the state like you did; is there a reason the
above is incorrect?
>
> if (!pte_numa(oldpte)) {
> ptent = pte_mknuma(ptent);
> @@ -106,7 +114,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> arch_leave_lazy_mmu_mode();
> pte_unmap_unlock(pte - 1, ptl);
>
> - *ret_all_same_node = all_same_node;
> + *ret_all_same_nidpid = all_same_nidpid;
> return pages;
> }
>
> @@ -133,7 +141,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> pmd_t *pmd;
> unsigned long next;
> unsigned long pages = 0;
> - bool all_same_node;
> + bool all_same_nidpid;
>
> pmd = pmd_offset(pud, addr);
> do {
> @@ -151,7 +159,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> if (pmd_none_or_clear_bad(pmd))
> continue;
> pages += change_pte_range(vma, pmd, addr, next, newprot,
> - dirty_accountable, prot_numa, &all_same_node);
> + dirty_accountable, prot_numa, &all_same_nidpid);
>
> /*
> * If we are changing protections for NUMA hinting faults then
> @@ -159,7 +167,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> * node. This allows a regular PMD to be handled as one fault
> * and effectively batches the taking of the PTL
> */
> - if (prot_numa && all_same_node)
> + if (prot_numa && all_same_nidpid)
> change_pmd_protnuma(vma->vm_mm, addr, pmd);
> } while (pmd++, addr = next, addr != end);
>
Hurmph I just stumbled upon this PMD 'trick' and I'm not at all sure I
like it. If an application would pre-fault/initialize its memory with
the main thread we'll collapse it into a PMDs and forever thereafter (by
virtue of do_pmd_numa_page()) they'll all stay the same. Resulting in
PMD granularity.
It seems possible that concurrent faults can break it up, but the window
is tiny so I don't expect to actually see that happening.
In any case, this thing needs comments; both here in mprotect and near
do_pmu_numa_page().
On Mon, Jul 15, 2013 at 04:20:04PM +0100, Mel Gorman wrote:
> +++ b/kernel/sched/fair.c
> @@ -815,7 +815,14 @@ void task_numa_fault(int node, int pages, bool migrated)
> if (!sched_feat_numa(NUMA))
> return;
>
> - /* FIXME: Allocate task-specific structure for placement policy here */
> + /* Allocate buffer to track faults on a per-node basis */
> + if (unlikely(!p->numa_faults)) {
> + int size = sizeof(*p->numa_faults) * nr_node_ids;
> +
> + p->numa_faults = kzalloc(size, GFP_KERNEL);
We should probably stick a __GFP_NOWARN in there.
> + if (!p->numa_faults)
> + return;
> + }
>
> /*
> * If pages are properly placed (did not migrate) then scan slower.
Subject: mm, numa: Change page last {nid,pid} into {cpu,pid}
From: Peter Zijlstra <[email protected]>
Date: Thu Jul 25 18:44:50 CEST 2013
Change the per page last fault tracking to use cpu,pid instead of
nid,pid. This will allow us to try and lookup the alternate task more
easily.
Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/mm.h | 85 ++++++++++++++++++++------------------
include/linux/mm_types.h | 4 -
include/linux/page-flags-layout.h | 22 ++++-----
kernel/bounds.c | 4 +
kernel/sched/fair.c | 6 +-
mm/huge_memory.c | 8 +--
mm/memory.c | 17 ++++---
mm/mempolicy.c | 13 +++--
mm/migrate.c | 4 -
mm/mm_init.c | 18 ++++----
mm/mmzone.c | 14 +++---
mm/mprotect.c | 32 +++++++-------
mm/page_alloc.c | 4 -
13 files changed, 122 insertions(+), 109 deletions(-)
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -588,11 +588,11 @@ static inline pte_t maybe_mkwrite(pte_t
* sets it, so none of the operations on it need to be atomic.
*/
-/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NIDPID] | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_CPUPID] | ... | FLAGS | */
#define SECTIONS_PGOFF ((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
#define NODES_PGOFF (SECTIONS_PGOFF - NODES_WIDTH)
#define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH)
-#define LAST_NIDPID_PGOFF (ZONES_PGOFF - LAST_NIDPID_WIDTH)
+#define LAST_CPUPID_PGOFF (ZONES_PGOFF - LAST_CPUPID_WIDTH)
/*
* Define the bit shifts to access each section. For non-existent
@@ -602,7 +602,7 @@ static inline pte_t maybe_mkwrite(pte_t
#define SECTIONS_PGSHIFT (SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
#define NODES_PGSHIFT (NODES_PGOFF * (NODES_WIDTH != 0))
#define ZONES_PGSHIFT (ZONES_PGOFF * (ZONES_WIDTH != 0))
-#define LAST_NIDPID_PGSHIFT (LAST_NIDPID_PGOFF * (LAST_NIDPID_WIDTH != 0))
+#define LAST_CPUPID_PGSHIFT (LAST_CPUPID_PGOFF * (LAST_CPUPID_WIDTH != 0))
/* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
#ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -624,7 +624,7 @@ static inline pte_t maybe_mkwrite(pte_t
#define ZONES_MASK ((1UL << ZONES_WIDTH) - 1)
#define NODES_MASK ((1UL << NODES_WIDTH) - 1)
#define SECTIONS_MASK ((1UL << SECTIONS_WIDTH) - 1)
-#define LAST_NIDPID_MASK ((1UL << LAST_NIDPID_WIDTH) - 1)
+#define LAST_CPUPID_MASK ((1UL << LAST_CPUPID_WIDTH) - 1)
#define ZONEID_MASK ((1UL << ZONEID_SHIFT) - 1)
static inline enum zone_type page_zonenum(const struct page *page)
@@ -668,96 +668,101 @@ static inline int page_to_nid(const stru
#endif
#ifdef CONFIG_NUMA_BALANCING
-static inline int nid_pid_to_nidpid(int nid, int pid)
+static inline int cpu_pid_to_cpupid(int cpu, int pid)
{
- return ((nid & LAST__NID_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
+ return ((cpu & LAST__CPU_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
}
-static inline int nidpid_to_pid(int nidpid)
+static inline int cpupid_to_pid(int cpupid)
{
- return nidpid & LAST__PID_MASK;
+ return cpupid & LAST__PID_MASK;
}
-static inline int nidpid_to_nid(int nidpid)
+static inline int cpupid_to_cpu(int cpupid)
{
- return (nidpid >> LAST__PID_SHIFT) & LAST__NID_MASK;
+ return (cpupid >> LAST__PID_SHIFT) & LAST__CPU_MASK;
}
-static inline bool nidpid_pid_unset(int nidpid)
+static inline int cpupid_to_nid(int cpupid)
{
- return nidpid_to_pid(nidpid) == (-1 & LAST__PID_MASK);
+ return cpu_to_node(cpupid_to_cpu(cpupid));
}
-static inline bool nidpid_nid_unset(int nidpid)
+static inline bool cpupid_pid_unset(int cpupid)
{
- return nidpid_to_nid(nidpid) == (-1 & LAST__NID_MASK);
+ return cpupid_to_pid(cpupid) == (-1 & LAST__PID_MASK);
}
-#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
-static inline int page_nidpid_xchg_last(struct page *page, int nid)
+static inline bool cpupid_cpu_unset(int cpupid)
{
- return xchg(&page->_last_nidpid, nid);
+ return cpupid_to_cpu(cpupid) == (-1 & LAST__CPU_MASK);
}
-static inline int page_nidpid_last(struct page *page)
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
{
- return page->_last_nidpid;
+ return xchg(&page->_last_cpupid, cpupid);
}
-static inline void page_nidpid_reset_last(struct page *page)
+
+static inline int page_cpupid_last(struct page *page)
+{
+ return page->_last_cpupid;
+}
+static inline void page_cpupid_reset_last(struct page *page)
{
- page->_last_nidpid = -1;
+ page->_last_cpupid = -1;
}
#else
-static inline int page_nidpid_last(struct page *page)
+static inline int page_cpupid_last(struct page *page)
{
- return (page->flags >> LAST_NIDPID_PGSHIFT) & LAST_NIDPID_MASK;
+ return (page->flags >> LAST_CPUPID_PGSHIFT) & LAST_CPUPID_MASK;
}
-extern int page_nidpid_xchg_last(struct page *page, int nidpid);
+extern int page_cpupid_xchg_last(struct page *page, int cpupid);
-static inline void page_nidpid_reset_last(struct page *page)
+static inline void page_cpupid_reset_last(struct page *page)
{
- int nidpid = (1 << LAST_NIDPID_SHIFT) - 1;
+ int cpupid = (1 << LAST_CPUPID_SHIFT) - 1;
- page->flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
- page->flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
+ page->flags &= ~(LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT);
+ page->flags |= (cpupid & LAST_CPUPID_MASK) << LAST_CPUPID_PGSHIFT;
}
-#endif /* LAST_NIDPID_NOT_IN_PAGE_FLAGS */
-#else
-static inline int page_nidpid_xchg_last(struct page *page, int nidpid)
+#endif /* LAST_CPUPID_NOT_IN_PAGE_FLAGS */
+#else /* !CONFIG_NUMA_BALANCING */
+static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
{
- return page_to_nid(page);
+ return page_to_nid(page); /* XXX */
}
-static inline int page_nidpid_last(struct page *page)
+static inline int page_cpupid_last(struct page *page)
{
- return page_to_nid(page);
+ return page_to_nid(page); /* XXX */
}
-static inline int nidpid_to_nid(int nidpid)
+static inline int cpupid_to_nid(int cpupid)
{
return -1;
}
-static inline int nidpid_to_pid(int nidpid)
+static inline int cpupid_to_pid(int cpupid)
{
return -1;
}
-static inline int nid_pid_to_nidpid(int nid, int pid)
+static inline int nid_pid_to_cpupid(int nid, int pid)
{
return -1;
}
-static inline bool nidpid_pid_unset(int nidpid)
+static inline bool cpupid_pid_unset(int cpupid)
{
return 1;
}
-static inline void page_nidpid_reset_last(struct page *page)
+static inline void page_cpupid_reset_last(struct page *page)
{
}
-#endif
+#endif /* CONFIG_NUMA_BALANCING */
static inline struct zone *page_zone(const struct page *page)
{
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -174,8 +174,8 @@ struct page {
void *shadow;
#endif
-#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
- int _last_nidpid;
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+ int _last_cpupid;
#endif
}
/*
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -39,9 +39,9 @@
* lookup is necessary.
*
* No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
- * " plus space for last_nidpid: | NODE | ZONE | LAST_NIDPID ... | FLAGS |
+ * " plus space for last_cpupid: | NODE | ZONE | LAST_CPUPID ... | FLAGS |
* classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
- * " plus space for last_nidpid: | SECTION | NODE | ZONE | LAST_NIDPID ... | FLAGS |
+ * " plus space for last_cpupid: | SECTION | NODE | ZONE | LAST_CPUPID ... | FLAGS |
* classic sparse no space for node: | SECTION | ZONE | ... | FLAGS |
*/
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
@@ -65,18 +65,18 @@
#define LAST__PID_SHIFT 8
#define LAST__PID_MASK ((1 << LAST__PID_SHIFT)-1)
-#define LAST__NID_SHIFT NODES_SHIFT
-#define LAST__NID_MASK ((1 << LAST__NID_SHIFT)-1)
+#define LAST__CPU_SHIFT NR_CPUS_BITS
+#define LAST__CPU_MASK ((1 << LAST__CPU_SHIFT)-1)
-#define LAST_NIDPID_SHIFT (LAST__PID_SHIFT+LAST__NID_SHIFT)
+#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT+LAST__CPU_SHIFT)
#else
-#define LAST_NIDPID_SHIFT 0
+#define LAST_CPUPID_SHIFT 0
#endif
-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NIDPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define LAST_NIDPID_WIDTH LAST_NIDPID_SHIFT
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
#else
-#define LAST_NIDPID_WIDTH 0
+#define LAST_CPUPID_WIDTH 0
#endif
/*
@@ -87,8 +87,8 @@
#define NODE_NOT_IN_PAGE_FLAGS
#endif
-#if defined(CONFIG_NUMA_BALANCING) && LAST_NIDPID_WIDTH == 0
-#define LAST_NIDPID_NOT_IN_PAGE_FLAGS
+#if defined(CONFIG_NUMA_BALANCING) && LAST_CPUPID_WIDTH == 0
+#define LAST_CPUPID_NOT_IN_PAGE_FLAGS
#endif
#endif /* _LINUX_PAGE_FLAGS_LAYOUT */
--- a/kernel/bounds.c
+++ b/kernel/bounds.c
@@ -10,6 +10,7 @@
#include <linux/mmzone.h>
#include <linux/kbuild.h>
#include <linux/page_cgroup.h>
+#include <linux/log2.h>
void foo(void)
{
@@ -17,5 +18,8 @@ void foo(void)
DEFINE(NR_PAGEFLAGS, __NR_PAGEFLAGS);
DEFINE(MAX_NR_ZONES, __MAX_NR_ZONES);
DEFINE(NR_PCG_FLAGS, __NR_PCG_FLAGS);
+#ifdef CONFIG_SMP
+ DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS));
+#endif
/* End of constants */
}
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1225,7 +1225,7 @@ static void task_numa_placement(struct t
/*
* Got a PROT_NONE fault for a page on @node.
*/
-void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
+void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
{
struct task_struct *p = current;
int priv;
@@ -1241,8 +1241,8 @@ void task_numa_fault(int last_nidpid, in
* First accesses are treated as private, otherwise consider accesses
* to be private if the accessing pid has not changed
*/
- if (!nidpid_pid_unset(last_nidpid))
- priv = ((p->pid & LAST__PID_MASK) == nidpid_to_pid(last_nidpid));
+ if (!cpupid_pid_unset(last_cpupid))
+ priv = ((p->pid & LAST__PID_MASK) == cpupid_to_pid(last_cpupid));
else
priv = 1;
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1293,7 +1293,7 @@ int do_huge_pmd_numa_page(struct mm_stru
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
int page_nid = -1, account_nid = -1, this_nid = numa_node_id();
- int target_nid, last_nidpid;
+ int target_nid, last_cpupid;
bool migrated = false;
spin_lock(&mm->page_table_lock);
@@ -1315,7 +1315,7 @@ int do_huge_pmd_numa_page(struct mm_stru
if (page_nid == this_nid)
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
- last_nidpid = page_nidpid_last(page);
+ last_cpupid = page_cpupid_last(page);
target_nid = mpol_misplaced(page, vma, haddr, &account_nid);
if (target_nid == -1)
goto clear_pmdnuma;
@@ -1364,7 +1364,7 @@ int do_huge_pmd_numa_page(struct mm_stru
if (account_nid == -1)
account_nid = page_nid;
if (account_nid != -1)
- task_numa_fault(last_nidpid, account_nid, HPAGE_PMD_NR, migrated);
+ task_numa_fault(last_cpupid, account_nid, HPAGE_PMD_NR, migrated);
return 0;
}
@@ -1664,7 +1664,7 @@ static void __split_huge_page_refcount(s
page_tail->mapping = page->mapping;
page_tail->index = page->index + i;
- page_nidpid_xchg_last(page_tail, page_nidpid_last(page));
+ page_cpupid_xchg_last(page_tail, page_cpupid_last(page));
BUG_ON(!PageAnon(page_tail));
BUG_ON(!PageUptodate(page_tail));
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -70,7 +70,7 @@
#include "internal.h"
#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
-#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nidpid.
+#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid.
#endif
#ifndef CONFIG_NEED_MULTIPLE_NODES
@@ -3535,7 +3535,7 @@ int do_numa_page(struct mm_struct *mm, s
struct page *page = NULL;
spinlock_t *ptl;
int page_nid = -1, account_nid = -1;
- int target_nid, last_nidpid;
+ int target_nid, last_cpupid;
bool migrated = false;
/*
@@ -3569,7 +3569,7 @@ int do_numa_page(struct mm_struct *mm, s
return 0;
}
- last_nidpid = page_nidpid_last(page);
+ last_cpupid = page_cpupid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid, &account_nid);
pte_unmap_unlock(ptep, ptl);
@@ -3587,13 +3587,16 @@ int do_numa_page(struct mm_struct *mm, s
if (account_nid == -1)
account_nid = page_nid;
if (account_nid != -1)
- task_numa_fault(last_nidpid, account_nid, 1, migrated);
+ task_numa_fault(last_cpupid, account_nid, 1, migrated);
return 0;
}
/* NUMA hinting page fault entry point for regular pmds */
#ifdef CONFIG_NUMA_BALANCING
+/*
+ * See change_pmd_range().
+ */
static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmdp)
{
@@ -3603,7 +3606,7 @@ static int do_pmd_numa_page(struct mm_st
unsigned long offset;
spinlock_t *ptl;
bool numa = false;
- int last_nidpid;
+ int last_cpupid;
spin_lock(&mm->page_table_lock);
pmd = *pmdp;
@@ -3648,7 +3651,7 @@ static int do_pmd_numa_page(struct mm_st
if (unlikely(!page))
continue;
- last_nidpid = page_nidpid_last(page);
+ last_cpupid = page_cpupid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr,
page_nid, &account_nid);
@@ -3667,7 +3670,7 @@ static int do_pmd_numa_page(struct mm_st
if (account_nid == -1)
account_nid = page_nid;
if (account_nid != -1)
- task_numa_fault(last_nidpid, account_nid, 1, migrated);
+ task_numa_fault(last_cpupid, account_nid, 1, migrated);
cond_resched();
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2242,7 +2242,8 @@ int mpol_misplaced(struct page *page, st
struct zone *zone;
int curnid = page_to_nid(page);
unsigned long pgoff;
- int thisnid = numa_node_id();
+ int thiscpu = raw_smp_processor_id();
+ int thisnid = cpu_to_node(thiscpu);
int polnid = -1;
int ret = -1;
@@ -2291,10 +2292,10 @@ int mpol_misplaced(struct page *page, st
/* Migrate the page towards the node whose CPU is referencing it */
if (pol->flags & MPOL_F_MORON) {
- int last_nidpid;
- int this_nidpid;
+ int last_cpupid;
+ int this_cpupid;
- this_nidpid = nid_pid_to_nidpid(thisnid, current->pid);;
+ this_cpupid = cpu_pid_to_cpupid(thiscpu, current->pid);;
/*
* Multi-stage node selection is used in conjunction
@@ -2317,8 +2318,8 @@ int mpol_misplaced(struct page *page, st
* it less likely we act on an unlikely task<->page
* relation.
*/
- last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
- if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
+ last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
+ if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid)
goto out;
/*
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1495,7 +1495,7 @@ static struct page *alloc_misplaced_dst_
__GFP_NOWARN) &
~GFP_IOFS, 0);
if (newpage)
- page_nidpid_xchg_last(newpage, page_nidpid_last(page));
+ page_cpupid_xchg_last(newpage, page_cpupid_last(page));
return newpage;
}
@@ -1672,7 +1672,7 @@ int migrate_misplaced_transhuge_page(str
if (!new_page)
goto out_fail;
- page_nidpid_xchg_last(new_page, page_nidpid_last(page));
+ page_cpupid_xchg_last(new_page, page_cpupid_last(page));
isolated = numamigrate_isolate_page(pgdat, page);
if (!isolated) {
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -71,26 +71,26 @@ void __init mminit_verify_pageflags_layo
unsigned long or_mask, add_mask;
shift = 8 * sizeof(unsigned long);
- width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NIDPID_SHIFT;
+ width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_CPUPID_SHIFT;
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
- "Section %d Node %d Zone %d Lastnidpid %d Flags %d\n",
+ "Section %d Node %d Zone %d Lastcpupid %d Flags %d\n",
SECTIONS_WIDTH,
NODES_WIDTH,
ZONES_WIDTH,
- LAST_NIDPID_WIDTH,
+ LAST_CPUPID_WIDTH,
NR_PAGEFLAGS);
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
- "Section %d Node %d Zone %d Lastnidpid %d\n",
+ "Section %d Node %d Zone %d Lastcpupid %d\n",
SECTIONS_SHIFT,
NODES_SHIFT,
ZONES_SHIFT,
- LAST_NIDPID_SHIFT);
+ LAST_CPUPID_SHIFT);
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_pgshifts",
- "Section %lu Node %lu Zone %lu Lastnidpid %lu\n",
+ "Section %lu Node %lu Zone %lu Lastcpupid %lu\n",
(unsigned long)SECTIONS_PGSHIFT,
(unsigned long)NODES_PGSHIFT,
(unsigned long)ZONES_PGSHIFT,
- (unsigned long)LAST_NIDPID_PGSHIFT);
+ (unsigned long)LAST_CPUPID_PGSHIFT);
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodezoneid",
"Node/Zone ID: %lu -> %lu\n",
(unsigned long)(ZONEID_PGOFF + ZONEID_SHIFT),
@@ -102,9 +102,9 @@ void __init mminit_verify_pageflags_layo
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
"Node not in page flags");
#endif
-#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
- "Last nidpid not in page flags");
+ "Last cpupid not in page flags");
#endif
if (SECTIONS_WIDTH) {
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -97,20 +97,20 @@ void lruvec_init(struct lruvec *lruvec)
INIT_LIST_HEAD(&lruvec->lists[lru]);
}
-#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NID_NOT_IN_PAGE_FLAGS)
-int page_nidpid_xchg_last(struct page *page, int nidpid)
+#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPU_NOT_IN_PAGE_FLAGS)
+int page_cpupid_xchg_last(struct page *page, int cpupid)
{
unsigned long old_flags, flags;
- int last_nidpid;
+ int last_cpupid;
do {
old_flags = flags = page->flags;
- last_nidpid = page_nidpid_last(page);
+ last_cpupid = page_cpupid_last(page);
- flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
- flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
+ flags &= ~(LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT);
+ flags |= (cpupid & LAST_CPUPID_MASK) << LAST_CPUPID_PGSHIFT;
} while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));
- return last_nidpid;
+ return last_cpupid;
}
#endif
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,14 +37,14 @@ static inline pgprot_t pgprot_modify(pgp
static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end, pgprot_t newprot,
- int dirty_accountable, int prot_numa, bool *ret_all_same_nidpid)
+ int dirty_accountable, int prot_numa, bool *ret_all_same_cpupid)
{
struct mm_struct *mm = vma->vm_mm;
pte_t *pte, oldpte;
spinlock_t *ptl;
unsigned long pages = 0;
- bool all_same_nidpid = true;
- int last_nid = -1;
+ bool all_same_cpupid = true;
+ int last_cpu = -1;
int last_pid = -1;
pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
@@ -64,18 +64,18 @@ static unsigned long change_pte_range(st
page = vm_normal_page(vma, addr, oldpte);
if (page) {
- int this_nid = page_to_nid(page);
- int nidpid = page_nidpid_last(page);
- int this_pid = nidpid_to_pid(nidpid);
+ int cpupid = page_cpupid_last(page);
+ int this_cpu = cpupid_to_cpu(cpupid);
+ int this_pid = cpupid_to_pid(cpupid);
- if (last_nid == -1)
- last_nid = this_nid;
+ if (last_cpu == -1)
+ last_cpu = this_cpu;
if (last_pid == -1)
last_pid = this_pid;
- if (last_nid != this_nid ||
- last_pid != this_pid) {
- all_same_nidpid = false;
- }
+
+ if (last_cpu != this_cpu ||
+ last_pid != this_pid)
+ all_same_cpupid = false;
if (!pte_numa(oldpte)) {
ptent = pte_mknuma(ptent);
@@ -114,7 +114,7 @@ static unsigned long change_pte_range(st
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
- *ret_all_same_nidpid = all_same_nidpid;
+ *ret_all_same_cpupid = all_same_cpupid;
return pages;
}
@@ -141,7 +141,7 @@ static inline unsigned long change_pmd_r
pmd_t *pmd;
unsigned long next;
unsigned long pages = 0;
- bool all_same_nidpid;
+ bool all_same_cpupid;
pmd = pmd_offset(pud, addr);
do {
@@ -159,7 +159,7 @@ static inline unsigned long change_pmd_r
if (pmd_none_or_clear_bad(pmd))
continue;
pages += change_pte_range(vma, pmd, addr, next, newprot,
- dirty_accountable, prot_numa, &all_same_nidpid);
+ dirty_accountable, prot_numa, &all_same_cpupid);
/*
* If we are changing protections for NUMA hinting faults then
@@ -167,7 +167,7 @@ static inline unsigned long change_pmd_r
* node. This allows a regular PMD to be handled as one fault
* and effectively batches the taking of the PTL
*/
- if (prot_numa && all_same_nidpid)
+ if (prot_numa && all_same_cpupid)
change_pmd_protnuma(vma->vm_mm, addr, pmd);
} while (pmd++, addr = next, addr != end);
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -622,7 +622,7 @@ static inline int free_pages_check(struc
bad_page(page);
return 1;
}
- page_nidpid_reset_last(page);
+ page_cpupid_reset_last(page);
if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
return 0;
@@ -3944,7 +3944,7 @@ void __meminit memmap_init_zone(unsigned
mminit_verify_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
- page_nidpid_reset_last(page);
+ page_cpupid_reset_last(page);
SetPageReserved(page);
/*
* Mark the block movable so that blocks are reserved for
Subject: sched, numa: Use {cpu, pid} to create task groups for shared faults
From: Peter Zijlstra <[email protected]>
Date: Tue Jul 30 10:40:20 CEST 2013
A very simple/straight forward shared fault task grouping
implementation.
Concerns are that grouping on a single shared fault might be too
aggressive -- this only works because Mel is excluding DSOs for faults,
otherwise we'd have the world in a single group.
Future work could explore more complex means of picking groups. We
could for example track one group for the entire scan (using something
like PDM) and join it at the end of the scan if we deem it shared a
sufficient amount of memory.
Another avenue to explore is that to do with tasks where private faults
are predominant. Should we exclude them from the group or treat them as
secondary, creating a graded group that tries hardest to collate shared
tasks but also tries to move private tasks near when possible.
Also, the grouping information is completely unused, its up to future
patches to do this.
Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/sched.h | 4 +
kernel/sched/core.c | 4 +
kernel/sched/fair.c | 156 ++++++++++++++++++++++++++++++++++++++++++++++----
3 files changed, 153 insertions(+), 11 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1338,6 +1338,10 @@ struct task_struct {
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
+ spinlock_t numa_lock; /* for numa_entry / numa_group */
+ struct list_head numa_entry;
+ struct numa_group *numa_group;
+
/*
* Exponential decaying average of faults on a per-node basis.
* Scheduling placement decisions are made based on the these counts.
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1730,6 +1730,10 @@ static void __sched_fork(struct task_str
p->numa_work.next = &p->numa_work;
p->numa_faults = NULL;
p->numa_faults_buffer = NULL;
+
+ spin_lock_init(&p->numa_lock);
+ INIT_LIST_HEAD(&p->numa_entry);
+ p->numa_group = NULL;
#endif /* CONFIG_NUMA_BALANCING */
}
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1160,6 +1160,17 @@ static void numa_migrate_preferred(struc
p->numa_migrate_retry = jiffies + HZ/10;
}
+struct numa_group {
+ atomic_t refcount;
+
+ spinlock_t lock; /* nr_tasks, tasks */
+ int nr_tasks;
+ struct list_head task_list;
+
+ struct rcu_head rcu;
+ atomic_long_t faults[0];
+};
+
static void task_numa_placement(struct task_struct *p)
{
int seq, nid, max_nid = -1;
@@ -1168,6 +1179,7 @@ static void task_numa_placement(struct t
seq = ACCESS_ONCE(p->mm->numa_scan_seq);
if (p->numa_scan_seq == seq)
return;
+
p->numa_scan_seq = seq;
p->numa_migrate_seq++;
p->numa_scan_period_max = task_scan_max(p);
@@ -1178,14 +1190,24 @@ static void task_numa_placement(struct t
int priv, i;
for (priv = 0; priv < 2; priv++) {
+ long diff;
+
i = task_faults_idx(nid, priv);
+ diff = -p->numa_faults[i];
+
/* Decay existing window, copy faults since last scan */
p->numa_faults[i] >>= 1;
p->numa_faults[i] += p->numa_faults_buffer[i];
p->numa_faults_buffer[i] = 0;
+ diff += p->numa_faults[i];
faults += p->numa_faults[i];
+
+ if (p->numa_group) {
+ /* safe because we can only change our own group */
+ atomic_long_add(diff, &p->numa_group->faults[i]);
+ }
}
if (faults > max_faults) {
@@ -1222,13 +1244,117 @@ static void task_numa_placement(struct t
}
}
+static inline int get_numa_group(struct numa_group *grp)
+{
+ return atomic_inc_not_zero(&grp->refcount);
+}
+
+static inline void put_numa_group(struct numa_group *grp)
+{
+ if (atomic_dec_and_test(&grp->refcount))
+ kfree_rcu(grp, rcu);
+}
+
+static void double_lock(spinlock_t *l1, spinlock_t *l2)
+{
+ if (l1 > l2)
+ swap(l1, l2);
+
+ spin_lock(l1);
+ spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
+}
+
+void task_numa_group(struct task_struct *p, int cpu, int pid)
+{
+ struct task_struct *tsk;
+ struct numa_group *grp, *my_grp;
+ unsigned int size = sizeof(struct numa_group) +
+ 2*nr_node_ids*sizeof(atomic_long_t);
+ int i;
+
+ if (unlikely(!p->numa_group)) {
+ grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
+ if (!grp)
+ return;
+
+ atomic_set(&grp->refcount, 1);
+ spin_lock_init(&grp->lock);
+ INIT_LIST_HEAD(&grp->task_list);
+
+ spin_lock(&p->numa_lock);
+ list_add(&p->numa_entry, &grp->task_list);
+ grp->nr_tasks++;
+ rcu_assign_pointer(p->numa_group, grp);
+ spin_unlock(&p->numa_lock);
+ }
+
+ rcu_read_lock();
+ tsk = ACCESS_ONCE(cpu_rq(cpu)->curr);
+
+ if ((tsk->pid & LAST__PID_MASK) != pid)
+ goto unlock;
+
+ grp = rcu_dereference(tsk->numa_group);
+ if (!grp)
+ goto unlock;
+
+ my_grp = p->numa_group;
+ if (grp == my_grp)
+ goto unlock;
+
+ /*
+ * Only join the other group if its bigger; if we're the bigger group,
+ * the other task will join us.
+ */
+ if (my_grp->nr_tasks > grp->nr_tasks)
+ goto unlock;
+
+ /*
+ * Tie-break on the grp address.
+ */
+ if (my_grp->nr_tasks == grp->nr_tasks && my_grp > grp)
+ goto unlock;
+
+ if (!get_numa_group(grp))
+ goto unlock;
+
+ rcu_read_unlock();
+
+ /* join with @grp */
+
+ for (i = 0; i < 2*nr_node_ids; i++) {
+ atomic_long_sub(p->numa_faults[i], &my_grp->faults[i]);
+ atomic_long_add(p->numa_faults[i], &grp->faults[i]);
+ }
+
+ spin_lock(&p->numa_lock);
+ double_lock(&my_grp->lock, &grp->lock);
+
+ list_move(&p->numa_entry, &grp->task_list);
+ my_grp->nr_tasks--;
+ grp->nr_tasks++;
+
+ spin_unlock(&my_grp->lock);
+ spin_unlock(&grp->lock);
+
+ rcu_assign_pointer(p->numa_group, grp);
+ spin_unlock(&p->numa_lock);
+
+ put_numa_group(my_grp);
+ return;
+
+
+unlock:
+ rcu_read_unlock();
+}
+
/*
* Got a PROT_NONE fault for a page on @node.
*/
void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
{
struct task_struct *p = current;
- int priv;
+ int priv, cpu, pid;
if (!sched_feat_numa(NUMA))
return;
@@ -1237,21 +1363,12 @@ void task_numa_fault(int last_cpupid, in
if (!p->mm)
return;
- /*
- * First accesses are treated as private, otherwise consider accesses
- * to be private if the accessing pid has not changed
- */
- if (!cpupid_pid_unset(last_cpupid))
- priv = ((p->pid & LAST__PID_MASK) == cpupid_to_pid(last_cpupid));
- else
- priv = 1;
-
/* Allocate buffer to track faults on a per-node basis */
if (unlikely(!p->numa_faults)) {
int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
/* numa_faults and numa_faults_buffer share the allocation */
- p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
+ p->numa_faults = kzalloc(size * 2, GFP_KERNEL | __GFP_NOWARN);
if (!p->numa_faults)
return;
@@ -1260,6 +1377,23 @@ void task_numa_fault(int last_cpupid, in
}
/*
+ * First accesses are treated as private, otherwise consider accesses
+ * to be private if the accessing pid has not changed
+ */
+ if (unlikely(last_cpupid == (-1 & LAST_CPUPID_MASK))) {
+ cpu = raw_smp_processor_id();
+ pid = p->pid & LAST__PID_MASK;
+ } else {
+ cpu = cpupid_to_cpu(last_cpupid);
+ pid = cpupid_to_pid(last_cpupid);
+ }
+
+ priv = (pid == (p->pid & LAST__PID_MASK));
+
+ if (!priv)
+ task_numa_group(p, cpu, pid);
+
+ /*
* If pages are properly placed (did not migrate) then scan slower.
* This is reset periodically in case of phase changes
*
On Wed, Jul 17, 2013 at 12:50:30PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 15, 2013 at 04:20:04PM +0100, Mel Gorman wrote:
> > index cc03cfd..c5f773d 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -503,6 +503,17 @@ DECLARE_PER_CPU(struct rq, runqueues);
> > #define cpu_curr(cpu) (cpu_rq(cpu)->curr)
> > #define raw_rq() (&__raw_get_cpu_var(runqueues))
> >
> > +#ifdef CONFIG_NUMA_BALANCING
> > +static inline void task_numa_free(struct task_struct *p)
> > +{
> > + kfree(p->numa_faults);
> > +}
> > +#else /* CONFIG_NUMA_BALANCING */
> > +static inline void task_numa_free(struct task_struct *p)
> > +{
> > +}
> > +#endif /* CONFIG_NUMA_BALANCING */
> > +
> > #ifdef CONFIG_SMP
> >
> > #define rcu_dereference_check_sched_domain(p) \
>
>
> I also need the below hunk to make it compile:
>
Weird, I do not see the same problem so it's something .config specific.
Can you send me the .config you used please?
--
Mel Gorman
SUSE Labs
On Mon, Jul 29, 2013 at 12:10:59PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 15, 2013 at 04:20:04PM +0100, Mel Gorman wrote:
> > +++ b/kernel/sched/fair.c
> > @@ -815,7 +815,14 @@ void task_numa_fault(int node, int pages, bool migrated)
> > if (!sched_feat_numa(NUMA))
> > return;
> >
> > - /* FIXME: Allocate task-specific structure for placement policy here */
> > + /* Allocate buffer to track faults on a per-node basis */
> > + if (unlikely(!p->numa_faults)) {
> > + int size = sizeof(*p->numa_faults) * nr_node_ids;
> > +
> > + p->numa_faults = kzalloc(size, GFP_KERNEL);
>
> We should probably stick a __GFP_NOWARN in there.
>
Yes.
--
Mel Gorman
SUSE Labs
On Wed, Jul 17, 2013 at 01:00:53PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 15, 2013 at 04:20:06PM +0100, Mel Gorman wrote:
> > The zero page is not replicated between nodes and is often shared
> > between processes. The data is read-only and likely to be cached in
> > local CPUs if heavily accessed meaning that the remote memory access
> > cost is less of a concern. This patch stops accounting for numa hinting
> > faults on the zero page in both terms of counting faults and scheduling
> > tasks on nodes.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > mm/huge_memory.c | 9 +++++++++
> > mm/memory.c | 7 ++++++-
> > 2 files changed, 15 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index e4a79fa..ec938ed 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1302,6 +1302,15 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
> >
> > page = pmd_page(pmd);
> > get_page(page);
> > +
> > + /*
> > + * Do not account for faults against the huge zero page. The read-only
> > + * data is likely to be read-cached on the local CPUs and it is less
> > + * useful to know about local versus remote hits on the zero page.
> > + */
> > + if (is_huge_zero_pfn(page_to_pfn(page)))
> > + goto clear_pmdnuma;
> > +
> > src_nid = numa_node_id();
> > count_vm_numa_event(NUMA_HINT_FAULTS);
> > if (src_nid == page_to_nid(page))
>
> And because of:
>
> 5918d10 thp: fix huge zero page logic for page with pfn == 0
>
Yes. Thanks.
--
Mel Gorman
SUSE Labs
On Thu, Jul 25, 2013 at 12:40:09PM +0200, Peter Zijlstra wrote:
>
> Subject: sched, numa: migrates_degrades_locality()
> From: Peter Zijlstra <[email protected]>
> Date: Mon Jul 22 14:02:54 CEST 2013
>
> It just makes heaps of sense; so add it and make both it and
> migrate_improve_locality() a sched_feat().
>
Ok. I'll be splitting this patch and merging part of it into "sched:
Favour moving tasks towards the preferred node" and keeping the
degrades_locality as a separate patch. I'm also not a fan of the
tunables names NUMA_FAULTS_UP and NUMA_FAULTS_DOWN because it is hard to
guess what they mean. NUMA_FAVOUR_HIGHER, NUMA_RESIST_LOWER?
Change to just the parent patch looks is as follows. task_faults() is
not introduced yet in the series which is why it is still missing.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 78bfbea..5ea3afe 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3978,8 +3978,10 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
{
int src_nid, dst_nid;
- if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+ if (!sched_feat(NUMA_FAVOUR_HIGHER || !p->numa_faults ||
+ !(env->sd->flags & SD_NUMA))) {
return false;
+ }
src_nid = cpu_to_node(env->src_cpu);
dst_nid = cpu_to_node(env->dst_cpu);
@@ -3988,7 +3990,7 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
return false;
- if (p->numa_preferred_nid == dst_nid)
+ if (p->numa_faults[dst_nid] > p->numa_faults[src_nid])
return true;
return false;
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 99399f8..97a1136 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -69,4 +69,11 @@ SCHED_FEAT(LB_MIN, false)
#ifdef CONFIG_NUMA_BALANCING
SCHED_FEAT(NUMA, false)
SCHED_FEAT(NUMA_FORCE, false)
+
+/*
+ * NUMA_FAVOUR_HIGHER will favor moving tasks towards nodes where a
+ * higher number of hinting faults are recorded during active load
+ * balancing.
+ */
+SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
#endif
--
Mel Gorman
SUSE Labs
On Wed, Jul 31, 2013 at 09:44:11AM +0100, Mel Gorman wrote:
> On Thu, Jul 25, 2013 at 12:40:09PM +0200, Peter Zijlstra wrote:
> >
> > Subject: sched, numa: migrates_degrades_locality()
> > From: Peter Zijlstra <[email protected]>
> > Date: Mon Jul 22 14:02:54 CEST 2013
> >
> > It just makes heaps of sense; so add it and make both it and
> > migrate_improve_locality() a sched_feat().
> >
>
> Ok. I'll be splitting this patch and merging part of it into "sched:
> Favour moving tasks towards the preferred node" and keeping the
> degrades_locality as a separate patch. I'm also not a fan of the
> tunables names NUMA_FAULTS_UP and NUMA_FAULTS_DOWN because it is hard to
> guess what they mean. NUMA_FAVOUR_HIGHER, NUMA_RESIST_LOWER?
Sure, I don't much care about the names.. ideally you'd never use them
anyway ;-)
On Wed, Jul 17, 2013 at 09:31:05AM +0800, Hillf Danton wrote:
> On Mon, Jul 15, 2013 at 11:20 PM, Mel Gorman <[email protected]> wrote:
> > +static int
> > +find_idlest_cpu_node(int this_cpu, int nid)
> > +{
> > + unsigned long load, min_load = ULONG_MAX;
> > + int i, idlest_cpu = this_cpu;
> > +
> > + BUG_ON(cpu_to_node(this_cpu) == nid);
> > +
> > + rcu_read_lock();
> > + for_each_cpu(i, cpumask_of_node(nid)) {
>
> Check allowed CPUs first if task is given?
>
If the task is not allowed to run on the CPUs for that node then how
were the NUMA hinting faults recorded?
--
Mel Gorman
SUSE Labs
On Wed, Jul 17, 2013 at 10:17:29AM +0800, Hillf Danton wrote:
> On Mon, Jul 15, 2013 at 11:20 PM, Mel Gorman <[email protected]> wrote:
> > /*
> > * Got a PROT_NONE fault for a page on @node.
> > */
> > -void task_numa_fault(int node, int pages, bool migrated)
> > +void task_numa_fault(int last_nid, int node, int pages, bool migrated)
>
> For what is the new parameter?
>
To weight the fault heavier if the page was migrated due to being
improperly placed at fault time.
--
Mel Gorman
SUSE Labs
On Wed, Jul 17, 2013 at 01:22:22PM +0800, Sam Ben wrote:
> On 07/15/2013 11:20 PM, Mel Gorman wrote:
> >Currently automatic NUMA balancing is unable to distinguish between false
> >shared versus private pages except by ignoring pages with an elevated
>
> What's the meaning of false shared?
>
Two tasks may be operating on a shared buffer that is not aligned. It is
expected that will at least cache align to avoid CPU cache line bouncing
but the buffers are not necessarily page aligned. A page is the minimum
granularity we can track NUMA hinting faults so two tasks sharing
such a page will appear to be sharing data when in fact they are not.
--
Mel Gorman
SUSE Labs
On Wed, Jul 17, 2013 at 09:53:53PM -0400, Rik van Riel wrote:
> On Mon, 15 Jul 2013 16:20:17 +0100
> Mel Gorman <[email protected]> wrote:
>
> > Ideally it would be possible to distinguish between NUMA hinting faults that
> > are private to a task and those that are shared. If treated identically
> > there is a risk that shared pages bounce between nodes depending on
>
> Your patch 15 breaks the compile with !CONFIG_NUMA_BALANCING.
>
> This little patch fixes it:
>
Sloppy of me. Thanks.
--
Mel Gorman
SUSE Labs
On Fri, Jul 26, 2013 at 01:20:50PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 15, 2013 at 04:20:17PM +0100, Mel Gorman wrote:
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index cacc64a..04c9469 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -37,14 +37,15 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)
> >
> > static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > unsigned long addr, unsigned long end, pgprot_t newprot,
> > - int dirty_accountable, int prot_numa, bool *ret_all_same_node)
> > + int dirty_accountable, int prot_numa, bool *ret_all_same_nidpid)
> > {
> > struct mm_struct *mm = vma->vm_mm;
> > pte_t *pte, oldpte;
> > spinlock_t *ptl;
> > unsigned long pages = 0;
> > - bool all_same_node = true;
> > + bool all_same_nidpid = true;
> > int last_nid = -1;
> > + int last_pid = -1;
> >
> > pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> > arch_enter_lazy_mmu_mode();
> > @@ -64,10 +65,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > page = vm_normal_page(vma, addr, oldpte);
> > if (page) {
> > int this_nid = page_to_nid(page);
> > + int nidpid = page_nidpid_last(page);
> > + int this_pid = nidpid_to_pid(nidpid);
> > +
> > if (last_nid == -1)
> > last_nid = this_nid;
> > - if (last_nid != this_nid)
> > - all_same_node = false;
> > + if (last_pid == -1)
> > + last_pid = this_pid;
> > + if (last_nid != this_nid ||
> > + last_pid != this_pid) {
> > + all_same_nidpid = false;
> > + }
>
> At this point I would've expected something like:
>
> int nidpid = page_nidpid_last(page);
> int thisnid = nidpid_to_nid(nidpid);
> int thispid = nidpit_to_pid(nidpit);
>
> It seems 'weird' to mix the state like you did; is there a reason the
> above is incorrect?
>
No there isn't and it looks like a brain fart. I've changed it to what
you suggested.
> >
> > if (!pte_numa(oldpte)) {
> > ptent = pte_mknuma(ptent);
> > @@ -106,7 +114,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > arch_leave_lazy_mmu_mode();
> > pte_unmap_unlock(pte - 1, ptl);
> >
> > - *ret_all_same_node = all_same_node;
> > + *ret_all_same_nidpid = all_same_nidpid;
> > return pages;
> > }
> >
> > @@ -133,7 +141,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> > pmd_t *pmd;
> > unsigned long next;
> > unsigned long pages = 0;
> > - bool all_same_node;
> > + bool all_same_nidpid;
> >
> > pmd = pmd_offset(pud, addr);
> > do {
> > @@ -151,7 +159,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> > if (pmd_none_or_clear_bad(pmd))
> > continue;
> > pages += change_pte_range(vma, pmd, addr, next, newprot,
> > - dirty_accountable, prot_numa, &all_same_node);
> > + dirty_accountable, prot_numa, &all_same_nidpid);
> >
> > /*
> > * If we are changing protections for NUMA hinting faults then
> > @@ -159,7 +167,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> > * node. This allows a regular PMD to be handled as one fault
> > * and effectively batches the taking of the PTL
> > */
> > - if (prot_numa && all_same_node)
> > + if (prot_numa && all_same_nidpid)
> > change_pmd_protnuma(vma->vm_mm, addr, pmd);
> > } while (pmd++, addr = next, addr != end);
> >
>
> Hurmph I just stumbled upon this PMD 'trick' and I'm not at all sure I
> like it. If an application would pre-fault/initialize its memory with
> the main thread we'll collapse it into a PMDs and forever thereafter (by
> virtue of do_pmd_numa_page()) they'll all stay the same. Resulting in
> PMD granularity.
>
Potentially yes. When that PMD trick was introduced it was because the cost
of faults was very high due to a high scanning rate. The trick mitigated
worse-case scenarios until faults were properly accounted for and the scan
rates were better controlled. As these *should* be addressed by the series
I think I will be adding a patch to kick away this PMD crutch and see how
it looks in profiles.
--
Mel Gorman
SUSE Labs
On Wed, Jul 31, 2013 at 10:29:38AM +0100, Mel Gorman wrote:
> > Hurmph I just stumbled upon this PMD 'trick' and I'm not at all sure I
> > like it. If an application would pre-fault/initialize its memory with
> > the main thread we'll collapse it into a PMDs and forever thereafter (by
> > virtue of do_pmd_numa_page()) they'll all stay the same. Resulting in
> > PMD granularity.
> >
>
> Potentially yes. When that PMD trick was introduced it was because the cost
> of faults was very high due to a high scanning rate. The trick mitigated
> worse-case scenarios until faults were properly accounted for and the scan
> rates were better controlled. As these *should* be addressed by the series
> I think I will be adding a patch to kick away this PMD crutch and see how
> it looks in profiles.
I've been thinking on this a bit and I think we should split these and
thp pages when we get shared faults from different nodes on them and
refuse thp collapses when the pages are on different nodes.
With the exception that when we introduce the interleave mempolicies we
should define 'different node' as being outside of the interleave mask.
* Mel Gorman <[email protected]> [2013-07-31 10:07:27]:
> On Wed, Jul 17, 2013 at 09:31:05AM +0800, Hillf Danton wrote:
> > On Mon, Jul 15, 2013 at 11:20 PM, Mel Gorman <[email protected]> wrote:
> > > +static int
> > > +find_idlest_cpu_node(int this_cpu, int nid)
> > > +{
> > > + unsigned long load, min_load = ULONG_MAX;
> > > + int i, idlest_cpu = this_cpu;
> > > +
> > > + BUG_ON(cpu_to_node(this_cpu) == nid);
> > > +
> > > + rcu_read_lock();
> > > + for_each_cpu(i, cpumask_of_node(nid)) {
> >
> > Check allowed CPUs first if task is given?
> >
>
> If the task is not allowed to run on the CPUs for that node then how
> were the NUMA hinting faults recorded?
>
But still we could check if the task is allowed to run on a cpu before we
capture the load of the cpu. This would avoid us trying to select a cpu
whose load is low but which cannot run this task.
> --
> Mel Gorman
> SUSE Labs
>
--
Thanks and Regards
Srikar Dronamraju
On Wed, Jul 17, 2013 at 12:54:23PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 15, 2013 at 04:20:18PM +0100, Mel Gorman wrote:
> > +static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
>
> And this
> -- which suggests you always build with cgroups enabled?
Yes, the test kernel configuration is one taken from an opensuse kernel
with a bunch of unnecessary drivers removed.
> I generally
> try and disable all that nonsense when building new stuff, the scheduler is a
> 'lot' simpler that way. Once that works make it 'interesting' again.
>
Understood. I'll disable CONFIG_CGROUPS in the next round of testing which
will be based against 3.11-rc3 once I plough this set of feedback.
Thanks.
--
Mel Gorman
SUSE Labs
On Thu, Jul 25, 2013 at 12:33:52PM +0200, Peter Zijlstra wrote:
>
> Subject: stop_machine: Introduce stop_two_cpus()
> From: Peter Zijlstra <[email protected]>
> Date: Sun Jul 21 12:24:09 CEST 2013
>
> Introduce stop_two_cpus() in order to allow controlled swapping of two
> tasks. It repurposes the stop_machine() state machine but only stops
> the two cpus which we can do with on-stack structures and avoid
> machine wide synchronization issues.
>
> Signed-off-by: Peter Zijlstra <[email protected]>
Clever! I did not spot any problems so will be pulling this (and
presumably the next patch) into the series. Thanks!
--
Mel Gorman
SUSE Labs
On Wed, Jul 31, 2013 at 11:03:31AM +0100, Mel Gorman wrote:
> On Thu, Jul 25, 2013 at 12:33:52PM +0200, Peter Zijlstra wrote:
> >
> > Subject: stop_machine: Introduce stop_two_cpus()
> > From: Peter Zijlstra <[email protected]>
> > Date: Sun Jul 21 12:24:09 CEST 2013
> >
> > Introduce stop_two_cpus() in order to allow controlled swapping of two
> > tasks. It repurposes the stop_machine() state machine but only stops
> > the two cpus which we can do with on-stack structures and avoid
> > machine wide synchronization issues.
> >
> > Signed-off-by: Peter Zijlstra <[email protected]>
>
> Clever! I did not spot any problems so will be pulling this (and
> presumably the next patch) into the series. Thanks!
You mean aside from the glaring lack of hotplug handling? :-)
On Wed, Jul 31, 2013 at 12:05:05PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 31, 2013 at 11:03:31AM +0100, Mel Gorman wrote:
> > On Thu, Jul 25, 2013 at 12:33:52PM +0200, Peter Zijlstra wrote:
> > >
> > > Subject: stop_machine: Introduce stop_two_cpus()
> > > From: Peter Zijlstra <[email protected]>
> > > Date: Sun Jul 21 12:24:09 CEST 2013
> > >
> > > Introduce stop_two_cpus() in order to allow controlled swapping of two
> > > tasks. It repurposes the stop_machine() state machine but only stops
> > > the two cpus which we can do with on-stack structures and avoid
> > > machine wide synchronization issues.
> > >
> > > Signed-off-by: Peter Zijlstra <[email protected]>
> >
> > Clever! I did not spot any problems so will be pulling this (and
> > presumably the next patch) into the series. Thanks!
>
> You mean aside from the glaring lack of hotplug handling? :-)
Other than that which the following patch called out anyway :)
--
Mel Gorman
SUSE Labs
On Wed, Jul 31, 2013 at 11:34:37AM +0200, Peter Zijlstra wrote:
> On Wed, Jul 31, 2013 at 10:29:38AM +0100, Mel Gorman wrote:
> > > Hurmph I just stumbled upon this PMD 'trick' and I'm not at all sure I
> > > like it. If an application would pre-fault/initialize its memory with
> > > the main thread we'll collapse it into a PMDs and forever thereafter (by
> > > virtue of do_pmd_numa_page()) they'll all stay the same. Resulting in
> > > PMD granularity.
> > >
> >
> > Potentially yes. When that PMD trick was introduced it was because the cost
> > of faults was very high due to a high scanning rate. The trick mitigated
> > worse-case scenarios until faults were properly accounted for and the scan
> > rates were better controlled. As these *should* be addressed by the series
> > I think I will be adding a patch to kick away this PMD crutch and see how
> > it looks in profiles.
>
> I've been thinking on this a bit and I think we should split these and
> thp pages when we get shared faults from different nodes on them and
> refuse thp collapses when the pages are on different nodes.
>
Agreed, I reached the same conclusion when thinking about THP false sharing
just before I went on holiday. The first prototype patch was a bit messy
and performed very badly so "Handle false sharing of THP" was chucked onto
the TODO pile to worry about when I got back. It also collided a little with
the PMD handling of base pages which is another reason to get rid of that.
> With the exception that when we introduce the interleave mempolicies we
> should define 'different node' as being outside of the interleave mask.
Understood.
--
Mel Gorman
SUSE Labs
On Thu, Jul 25, 2013 at 12:36:20PM +0200, Peter Zijlstra wrote:
>
> Subject: sched, numa: Break stuff..
> From: Peter Zijlstra <[email protected]>
> Date: Tue Jul 23 14:58:41 CEST 2013
>
> This patch is mostly a comment in code. I don't believe the current
> scan period adjustment scheme can work properly nor do I think it a
> good idea to ratelimit the numa faults as a whole based on migration.
>
> Reasons are in the modified comments...
>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> kernel/sched/fair.c | 41 ++++++++++++++++++++++++++++++++---------
> 1 file changed, 32 insertions(+), 9 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1108,7 +1108,6 @@ static void task_numa_placement(struct t
>
> /* Preferred node as the node with the most faults */
> if (max_faults && max_nid != p->numa_preferred_nid) {
> - int old_migrate_seq = p->numa_migrate_seq;
>
> /* Queue task on preferred node if possible */
> p->numa_preferred_nid = max_nid;
> @@ -1116,14 +1115,19 @@ static void task_numa_placement(struct t
> numa_migrate_preferred(p);
>
> /*
> + int old_migrate_seq = p->numa_migrate_seq;
> + *
> * If preferred nodes changes frequently then the scan rate
> * will be continually high. Mitigate this by increasing the
> * scan rate only if the task was settled.
> - */
> + *
> + * APZ: disabled because we don't lower it again :/
> + *
> if (old_migrate_seq >= sysctl_numa_balancing_settle_count) {
> p->numa_scan_period = max(p->numa_scan_period >> 1,
> task_scan_min(p));
> }
> + */
> }
> }
>
I'm not sure I understand your point. The scan rate is decreased again if
the page is found to be properly placed in the future. It's in the next
hunk you modify although the periodically reset comment is now out of date.
> @@ -1167,10 +1171,20 @@ void task_numa_fault(int last_nidpid, in
> /*
> * If pages are properly placed (did not migrate) then scan slower.
> * This is reset periodically in case of phase changes
> - */
> - if (!migrated)
> + *
> + * APZ: it seems to me that one can get a ton of !migrated faults;
> + * consider the scenario where two threads fight over a shared memory
> + * segment. We'll win half the faults, half of that will be local, half
> + * of that will be remote. This means we'll see 1/4-th of the total
> + * memory being !migrated. Using a fixed increment will completely
> + * flatten the scan speed for a sufficiently large workload. Another
> + * scenario is due to that migration rate limit.
> + *
> + if (!migrated) {
> p->numa_scan_period = min(p->numa_scan_period_max,
> p->numa_scan_period + jiffies_to_msecs(10));
> + }
> + */
FWIW, I'm also not happy with how the scan rate is reduced but did not
come up with a better alternative that was not fragile or depended on
gathering too much state. Granted, I also have not been treating it as a
high priority problem.
>
> task_numa_placement(p);
>
> @@ -1216,12 +1230,15 @@ void task_numa_work(struct callback_head
> if (p->flags & PF_EXITING)
> return;
>
> +#if 0
> /*
> * We do not care about task placement until a task runs on a node
> * other than the first one used by the address space. This is
> * largely because migrations are driven by what CPU the task
> * is running on. If it's never scheduled on another node, it'll
> * not migrate so why bother trapping the fault.
> + *
> + * APZ: seems like a bad idea for pure shared memory workloads.
> */
> if (mm->first_nid == NUMA_PTE_SCAN_INIT)
> mm->first_nid = numa_node_id();
At some point in the past scan starts were based on waiting a fixed interval
but that seemed like a hack designed to get around hurting kernel compile
benchmarks. I'll give it more thought and see can I think of a better
alternative that is based on an event but not this event.
> @@ -1233,6 +1250,7 @@ void task_numa_work(struct callback_head
>
> mm->first_nid = NUMA_PTE_SCAN_ACTIVE;
> }
> +#endif
>
> /*
> * Enforce maximal scan/migration frequency..
> @@ -1254,9 +1272,14 @@ void task_numa_work(struct callback_head
> * Do not set pte_numa if the current running node is rate-limited.
> * This loses statistics on the fault but if we are unwilling to
> * migrate to this node, it is less likely we can do useful work
> - */
> + *
> + * APZ: seems like a bad idea; even if this node can't migrate anymore
> + * other nodes might and we want up-to-date information to do balance
> + * decisions.
> + *
> if (migrate_ratelimited(numa_node_id()))
> return;
> + */
>
Ingo also disliked this but I wanted to avoid a situation where the
workload suffered because of a corner case where the interconnect was
filled with migration traffic.
> start = mm->numa_scan_offset;
> pages = sysctl_numa_balancing_scan_size;
> @@ -1297,10 +1320,10 @@ void task_numa_work(struct callback_head
>
> out:
> /*
> - * It is possible to reach the end of the VMA list but the last few VMAs are
> - * not guaranteed to the vma_migratable. If they are not, we would find the
> - * !migratable VMA on the next scan but not reset the scanner to the start
> - * so check it now.
> + * It is possible to reach the end of the VMA list but the last few
> + * VMAs are not guaranteed to the vma_migratable. If they are not, we
> + * would find the !migratable VMA on the next scan but not reset the
> + * scanner to the start so check it now.
> */
> if (vma)
> mm->numa_scan_offset = start;
Will fix.
--
Mel Gorman
SUSE Labs
On Wed, Jul 31, 2013 at 11:30:52AM +0100, Mel Gorman wrote:
> I'm not sure I understand your point. The scan rate is decreased again if
> the page is found to be properly placed in the future. It's in the next
> hunk you modify although the periodically reset comment is now out of date.
Yeah its because of the next hunk. I figured that if we don't lower it,
we shouldn't raise it either.
> > @@ -1167,10 +1171,20 @@ void task_numa_fault(int last_nidpid, in
> > /*
> > * If pages are properly placed (did not migrate) then scan slower.
> > * This is reset periodically in case of phase changes
> > - */
> > - if (!migrated)
> > + *
> > + * APZ: it seems to me that one can get a ton of !migrated faults;
> > + * consider the scenario where two threads fight over a shared memory
> > + * segment. We'll win half the faults, half of that will be local, half
> > + * of that will be remote. This means we'll see 1/4-th of the total
> > + * memory being !migrated. Using a fixed increment will completely
> > + * flatten the scan speed for a sufficiently large workload. Another
> > + * scenario is due to that migration rate limit.
> > + *
> > + if (!migrated) {
> > p->numa_scan_period = min(p->numa_scan_period_max,
> > p->numa_scan_period + jiffies_to_msecs(10));
> > + }
> > + */
>
> FWIW, I'm also not happy with how the scan rate is reduced but did not
> come up with a better alternative that was not fragile or depended on
> gathering too much state. Granted, I also have not been treating it as a
> high priority problem.
Right, so what Ingo did is have the scan rate depend on the convergence.
What exactly did you dislike about that?
We could define the convergence as all the faults inside the interleave
mask vs the total faults, and then run at: min + (1 - c)*(max-min).
> > +#if 0
> > /*
> > * We do not care about task placement until a task runs on a node
> > * other than the first one used by the address space. This is
> > * largely because migrations are driven by what CPU the task
> > * is running on. If it's never scheduled on another node, it'll
> > * not migrate so why bother trapping the fault.
> > + *
> > + * APZ: seems like a bad idea for pure shared memory workloads.
> > */
> > if (mm->first_nid == NUMA_PTE_SCAN_INIT)
> > mm->first_nid = numa_node_id();
>
> At some point in the past scan starts were based on waiting a fixed interval
> but that seemed like a hack designed to get around hurting kernel compile
> benchmarks. I'll give it more thought and see can I think of a better
> alternative that is based on an event but not this event.
Ah, well the reasoning on that was that all this NUMA business is
'expensive' so we'd better only bother with tasks that persist long
enough for it to pay off.
In that regard it makes perfect sense to wait a fixed amount of runtime
before we start scanning.
So it was not a pure hack to make kbuild work again.. that is did was
good though.
> > @@ -1254,9 +1272,14 @@ void task_numa_work(struct callback_head
> > * Do not set pte_numa if the current running node is rate-limited.
> > * This loses statistics on the fault but if we are unwilling to
> > * migrate to this node, it is less likely we can do useful work
> > - */
> > + *
> > + * APZ: seems like a bad idea; even if this node can't migrate anymore
> > + * other nodes might and we want up-to-date information to do balance
> > + * decisions.
> > + *
> > if (migrate_ratelimited(numa_node_id()))
> > return;
> > + */
> >
>
> Ingo also disliked this but I wanted to avoid a situation where the
> workload suffered because of a corner case where the interconnect was
> filled with migration traffic.
Right, but you already rate limit the actual migrations, this should
leave enough bandwidth to allow the non-migrating scanning.
I think its important we keep up-to-date information if we're going to
do placement based on it.
On that rate-limit, this looks to be a hard-coded number unrelated to
the actual hardware. I think we should at the very least make it a
configurable number and preferably scale the number with the SLIT info.
Or alternatively actually measure the node to node bandwidth.
On Thu, Jul 25, 2013 at 12:38:45PM +0200, Peter Zijlstra wrote:
>
> Subject: mm, numa: Sanitize task_numa_fault() callsites
> From: Peter Zijlstra <[email protected]>
> Date: Mon Jul 22 10:42:38 CEST 2013
>
> There are three callers of task_numa_fault():
>
> - do_huge_pmd_numa_page():
> Accounts against the current node, not the node where the
> page resides, unless we migrated, in which case it accounts
> against the node we migrated to.
>
> - do_numa_page():
> Accounts against the current node, not the node where the
> page resides, unless we migrated, in which case it accounts
> against the node we migrated to.
>
> - do_pmd_numa_page():
> Accounts not at all when the page isn't migrated, otherwise
> accounts against the node we migrated towards.
>
> This seems wrong to me; all three sites should have the same
> sementaics, furthermore we should accounts against where the page
> really is, we already know where the task is.
>
Agreed. To allow the scheduler parts to still be evaluated in proper
isolation I moved this patch to much earlier in the series.
--
Mel Gorman
SUSE Labs
On Wed, Jul 31, 2013 at 12:48:14PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 31, 2013 at 11:30:52AM +0100, Mel Gorman wrote:
> > I'm not sure I understand your point. The scan rate is decreased again if
> > the page is found to be properly placed in the future. It's in the next
> > hunk you modify although the periodically reset comment is now out of date.
>
> Yeah its because of the next hunk. I figured that if we don't lower it,
> we shouldn't raise it either.
>
hmm, I'm going to punt that to a TODO item and think about it some more
with a fresh head.
> > > @@ -1167,10 +1171,20 @@ void task_numa_fault(int last_nidpid, in
> > > /*
> > > * If pages are properly placed (did not migrate) then scan slower.
> > > * This is reset periodically in case of phase changes
> > > - */
> > > - if (!migrated)
> > > + *
> > > + * APZ: it seems to me that one can get a ton of !migrated faults;
> > > + * consider the scenario where two threads fight over a shared memory
> > > + * segment. We'll win half the faults, half of that will be local, half
> > > + * of that will be remote. This means we'll see 1/4-th of the total
> > > + * memory being !migrated. Using a fixed increment will completely
> > > + * flatten the scan speed for a sufficiently large workload. Another
> > > + * scenario is due to that migration rate limit.
> > > + *
> > > + if (!migrated) {
> > > p->numa_scan_period = min(p->numa_scan_period_max,
> > > p->numa_scan_period + jiffies_to_msecs(10));
> > > + }
> > > + */
> >
> > FWIW, I'm also not happy with how the scan rate is reduced but did not
> > come up with a better alternative that was not fragile or depended on
> > gathering too much state. Granted, I also have not been treating it as a
> > high priority problem.
>
> Right, so what Ingo did is have the scan rate depend on the convergence.
> What exactly did you dislike about that?
>
It depended entirely on properly detecting if we are converged or not. As
things like false share detection within THP is still not there I was
worried that it was too easy to make the wrong decision here and keep it
pinned at the maximum scan rate.
> We could define the convergence as all the faults inside the interleave
> mask vs the total faults, and then run at: min + (1 - c)*(max-min).
>
And when we have such things properly in place then I think we can kick
away the current crutch.
> > > +#if 0
> > > /*
> > > * We do not care about task placement until a task runs on a node
> > > * other than the first one used by the address space. This is
> > > * largely because migrations are driven by what CPU the task
> > > * is running on. If it's never scheduled on another node, it'll
> > > * not migrate so why bother trapping the fault.
> > > + *
> > > + * APZ: seems like a bad idea for pure shared memory workloads.
> > > */
> > > if (mm->first_nid == NUMA_PTE_SCAN_INIT)
> > > mm->first_nid = numa_node_id();
> >
> > At some point in the past scan starts were based on waiting a fixed interval
> > but that seemed like a hack designed to get around hurting kernel compile
> > benchmarks. I'll give it more thought and see can I think of a better
> > alternative that is based on an event but not this event.
>
> Ah, well the reasoning on that was that all this NUMA business is
> 'expensive' so we'd better only bother with tasks that persist long
> enough for it to pay off.
>
Which is fair enough but tasks that lasted *just* longer than the interval
still got punished. Processes running with a slightly slower CPU gets
hurts meaning that it would be a difficult bug report to digest.
> In that regard it makes perfect sense to wait a fixed amount of runtime
> before we start scanning.
>
> So it was not a pure hack to make kbuild work again.. that is did was
> good though.
>
Maybe we should reintroduce the delay then but I really would prefer that
it was triggered on some sort of event.
> > > @@ -1254,9 +1272,14 @@ void task_numa_work(struct callback_head
> > > * Do not set pte_numa if the current running node is rate-limited.
> > > * This loses statistics on the fault but if we are unwilling to
> > > * migrate to this node, it is less likely we can do useful work
> > > - */
> > > + *
> > > + * APZ: seems like a bad idea; even if this node can't migrate anymore
> > > + * other nodes might and we want up-to-date information to do balance
> > > + * decisions.
> > > + *
> > > if (migrate_ratelimited(numa_node_id()))
> > > return;
> > > + */
> > >
> >
> > Ingo also disliked this but I wanted to avoid a situation where the
> > workload suffered because of a corner case where the interconnect was
> > filled with migration traffic.
>
> Right, but you already rate limit the actual migrations, this should
> leave enough bandwidth to allow the non-migrating scanning.
>
> I think its important we keep up-to-date information if we're going to
> do placement based on it.
>
Ok, you convinced me. I slapped a changelog on it that is a cut&paste job
and moved it earlier in the series.
> On that rate-limit, this looks to be a hard-coded number unrelated to
> the actual hardware.
Guesstimate.
> I think we should at the very least make it a
> configurable number and preferably scale the number with the SLIT info.
> Or alternatively actually measure the node to node bandwidth.
>
Ideally we should just kick it away because scan rate limiting works
properly. Lets not make it a tunable just yet so we can avoid having to
deprecate it later.
--
Mel Gorman
SUSE Labs
New version that includes a final put for the numa_group struct and a
few other modifications.
The new task_numa_free() completely blows though, far too expensive.
Good ideas needed.
---
Subject: sched, numa: Use {cpu, pid} to create task groups for shared faults
From: Peter Zijlstra <[email protected]>
Date: Tue Jul 30 10:40:20 CEST 2013
A very simple/straight forward shared fault task grouping
implementation.
Concerns are that grouping on a single shared fault might be too
aggressive -- this only works because Mel is excluding DSOs for faults,
otherwise we'd have the world in a single group.
Future work could explore more complex means of picking groups. We
could for example track one group for the entire scan (using something
like PDM) and join it at the end of the scan if we deem it shared a
sufficient amount of memory.
Another avenue to explore is that to do with tasks where private faults
are predominant. Should we exclude them from the group or treat them as
secondary, creating a graded group that tries hardest to collate shared
tasks but also tries to move private tasks near when possible.
Also, the grouping information is completely unused, its up to future
patches to do this.
Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/sched.h | 4 +
kernel/sched/core.c | 4 +
kernel/sched/fair.c | 177 +++++++++++++++++++++++++++++++++++++++++++++++---
kernel/sched/sched.h | 5 -
4 files changed, 176 insertions(+), 14 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1341,6 +1341,10 @@ struct task_struct {
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
+ spinlock_t numa_lock; /* for numa_entry / numa_group */
+ struct list_head numa_entry;
+ struct numa_group *numa_group;
+
/*
* Exponential decaying average of faults on a per-node basis.
* Scheduling placement decisions are made based on the these counts.
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1733,6 +1733,10 @@ static void __sched_fork(struct task_str
p->numa_work.next = &p->numa_work;
p->numa_faults = NULL;
p->numa_faults_buffer = NULL;
+
+ spin_lock_init(&p->numa_lock);
+ INIT_LIST_HEAD(&p->numa_entry);
+ p->numa_group = NULL;
#endif /* CONFIG_NUMA_BALANCING */
}
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1160,6 +1160,17 @@ static void numa_migrate_preferred(struc
p->numa_migrate_retry = jiffies + HZ/10;
}
+struct numa_group {
+ atomic_t refcount;
+
+ spinlock_t lock; /* nr_tasks, tasks */
+ int nr_tasks;
+ struct list_head task_list;
+
+ struct rcu_head rcu;
+ atomic_long_t faults[0];
+};
+
static void task_numa_placement(struct task_struct *p)
{
int seq, nid, max_nid = -1;
@@ -1168,6 +1179,7 @@ static void task_numa_placement(struct t
seq = ACCESS_ONCE(p->mm->numa_scan_seq);
if (p->numa_scan_seq == seq)
return;
+
p->numa_scan_seq = seq;
p->numa_migrate_seq++;
p->numa_scan_period_max = task_scan_max(p);
@@ -1178,14 +1190,24 @@ static void task_numa_placement(struct t
int priv, i;
for (priv = 0; priv < 2; priv++) {
+ long diff;
+
i = task_faults_idx(nid, priv);
+ diff = -p->numa_faults[i];
+
/* Decay existing window, copy faults since last scan */
p->numa_faults[i] >>= 1;
p->numa_faults[i] += p->numa_faults_buffer[i];
p->numa_faults_buffer[i] = 0;
+ diff += p->numa_faults[i];
faults += p->numa_faults[i];
+
+ if (p->numa_group) {
+ /* safe because we can only change our own group */
+ atomic_long_add(diff, &p->numa_group->faults[i]);
+ }
}
if (faults > max_faults) {
@@ -1222,6 +1244,133 @@ static void task_numa_placement(struct t
}
}
+static inline int get_numa_group(struct numa_group *grp)
+{
+ return atomic_inc_not_zero(&grp->refcount);
+}
+
+static inline void put_numa_group(struct numa_group *grp)
+{
+ if (atomic_dec_and_test(&grp->refcount))
+ kfree_rcu(grp, rcu);
+}
+
+static void double_lock(spinlock_t *l1, spinlock_t *l2)
+{
+ if (l1 > l2)
+ swap(l1, l2);
+
+ spin_lock(l1);
+ spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
+}
+
+static void task_numa_group(struct task_struct *p, int cpu, int pid)
+{
+ struct numa_group *grp, *my_grp;
+ struct task_struct *tsk;
+ bool join = false;
+ int i;
+
+ if (unlikely(!p->numa_group)) {
+ unsigned int size = sizeof(struct numa_group) +
+ 2*nr_node_ids*sizeof(atomic_long_t);
+
+ grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
+ if (!grp)
+ return;
+
+ atomic_set(&grp->refcount, 1);
+ spin_lock_init(&grp->lock);
+ INIT_LIST_HEAD(&grp->task_list);
+
+ for (i = 0; i < 2*nr_node_ids; i++)
+ atomic_long_set(&grp->faults[i], p->numa_faults[i]);
+
+ spin_lock(&p->numa_lock);
+ list_add(&p->numa_entry, &grp->task_list);
+ grp->nr_tasks++;
+ rcu_assign_pointer(p->numa_group, grp);
+ spin_unlock(&p->numa_lock);
+ }
+
+ rcu_read_lock();
+ tsk = ACCESS_ONCE(cpu_rq(cpu)->curr);
+
+ if ((tsk->pid & LAST__PID_MASK) != pid)
+ goto unlock;
+
+ grp = rcu_dereference(tsk->numa_group);
+ if (!grp)
+ goto unlock;
+
+ my_grp = p->numa_group;
+ if (grp == my_grp)
+ goto unlock;
+
+ /*
+ * Only join the other group if its bigger; if we're the bigger group,
+ * the other task will join us.
+ */
+ if (my_grp->nr_tasks > grp->nr_tasks)
+ goto unlock;
+
+ /*
+ * Tie-break on the grp address.
+ */
+ if (my_grp->nr_tasks == grp->nr_tasks && my_grp > grp)
+ goto unlock;
+
+ if (!get_numa_group(grp))
+ goto unlock;
+
+ join = true;
+
+unlock:
+ rcu_read_unlock();
+
+ if (!join)
+ return;
+
+ for (i = 0; i < 2*nr_node_ids; i++) {
+ atomic_long_sub(p->numa_faults[i], &my_grp->faults[i]);
+ atomic_long_add(p->numa_faults[i], &grp->faults[i]);
+ }
+
+ spin_lock(&p->numa_lock);
+ double_lock(&my_grp->lock, &grp->lock);
+
+ list_move(&p->numa_entry, &grp->task_list);
+ my_grp->nr_tasks--;
+ grp->nr_tasks++;
+
+ spin_unlock(&my_grp->lock);
+ spin_unlock(&grp->lock);
+
+ rcu_assign_pointer(p->numa_group, grp);
+ spin_unlock(&p->numa_lock);
+
+ put_numa_group(my_grp);
+}
+
+static void task_numa_free(struct task_struct *p)
+{
+ kfree(p->numa_faults);
+ if (p->numa_group) {
+ struct numa_group *grp = p->numa_group;
+ int i;
+
+ for (i = 0; i < 2*nr_node_ids; i++)
+ atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
+
+ spin_lock(&p->numa_lock);
+ spin_lock(&group->lock);
+ list_del(&p->numa_entry);
+ spin_unlock(&group->lock);
+ rcu_assign_pointer(p->numa_group, NULL);
+ put_numa_group(grp);
+ }
+}
+
/*
* Got a PROT_NONE fault for a page on @node.
*/
@@ -1237,21 +1386,12 @@ void task_numa_fault(int last_cpupid, in
if (!p->mm)
return;
- /*
- * First accesses are treated as private, otherwise consider accesses
- * to be private if the accessing pid has not changed
- */
- if (!cpupid_pid_unset(last_cpupid))
- priv = ((p->pid & LAST__PID_MASK) == cpupid_to_pid(last_cpupid));
- else
- priv = 1;
-
/* Allocate buffer to track faults on a per-node basis */
if (unlikely(!p->numa_faults)) {
int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
/* numa_faults and numa_faults_buffer share the allocation */
- p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
+ p->numa_faults = kzalloc(size * 2, GFP_KERNEL | __GFP_NOWARN);
if (!p->numa_faults)
return;
@@ -1260,6 +1400,23 @@ void task_numa_fault(int last_cpupid, in
}
/*
+ * First accesses are treated as private, otherwise consider accesses
+ * to be private if the accessing pid has not changed
+ */
+ if (unlikely(last_cpupid == (-1 & LAST_CPUPID_MASK))) {
+ priv = 1;
+ } else {
+ int cpu, pid;
+
+ cpu = cpupid_to_cpu(last_cpupid);
+ pid = cpupid_to_pid(last_cpupid);
+
+ priv = (pid == (p->pid & LAST__PID_MASK));
+ if (!priv)
+ task_numa_group(p, cpu, pid);
+ }
+
+ /*
* If pages are properly placed (did not migrate) then scan slower.
* This is reset periodically in case of phase changes
*
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -556,10 +556,7 @@ static inline u64 rq_clock_task(struct r
#ifdef CONFIG_NUMA_BALANCING
extern int migrate_task_to(struct task_struct *p, int cpu);
extern int migrate_swap(struct task_struct *, struct task_struct *);
-static inline void task_numa_free(struct task_struct *p)
-{
- kfree(p->numa_faults);
-}
+extern void task_numa_free(struct task_struct *p);
#else /* CONFIG_NUMA_BALANCING */
static inline void task_numa_free(struct task_struct *p)
{
On Wed, Jul 31, 2013 at 12:57:19PM +0100, Mel Gorman wrote:
> > Right, so what Ingo did is have the scan rate depend on the convergence.
> > What exactly did you dislike about that?
> >
>
> It depended entirely on properly detecting if we are converged or not. As
> things like false share detection within THP is still not there I was
> worried that it was too easy to make the wrong decision here and keep it
> pinned at the maximum scan rate.
>
> > We could define the convergence as all the faults inside the interleave
> > mask vs the total faults, and then run at: min + (1 - c)*(max-min).
> >
>
> And when we have such things properly in place then I think we can kick
> away the current crutch.
OK, so I'll go write that patch I suppose ;-)
> > Ah, well the reasoning on that was that all this NUMA business is
> > 'expensive' so we'd better only bother with tasks that persist long
> > enough for it to pay off.
> >
>
> Which is fair enough but tasks that lasted *just* longer than the interval
> still got punished. Processes running with a slightly slower CPU gets
> hurts meaning that it would be a difficult bug report to digest.
>
> > In that regard it makes perfect sense to wait a fixed amount of runtime
> > before we start scanning.
> >
> > So it was not a pure hack to make kbuild work again.. that is did was
> > good though.
> >
>
> Maybe we should reintroduce the delay then but I really would prefer that
> it was triggered on some sort of event.
Humm:
kernel/sched/fair.c:
/* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
unsigned int sysctl_numa_balancing_scan_delay = 1000;
kernel/sched/core.c:__sched_fork():
numa_scan_period = sysctl_numa_balancing_scan_delay
It seems its still there, no need to resuscitate.
I share your preference for a clear event, although nothing really comes
to mind. The entire multi-process space seems devoid of useful triggers.
> > On that rate-limit, this looks to be a hard-coded number unrelated to
> > the actual hardware.
>
> Guesstimate.
>
> > I think we should at the very least make it a
> > configurable number and preferably scale the number with the SLIT info.
> > Or alternatively actually measure the node to node bandwidth.
> >
>
> Ideally we should just kick it away because scan rate limiting works
> properly. Lets not make it a tunable just yet so we can avoid having to
> deprecate it later.
I'm not seeing how the rate-limit as per the convergence is going to
help here. Suppose we migrate the task to another node and its going to
stay there. Then our convergence is going down to 0 (all our memory is
remote) so we end up at the max scan rate migrating every single page
ASAP.
This would completely and utterly saturate any interconnect.
Also, in the case we don't have a fully connected system the memory
transfers will need multiple hops, which greatly complicates the entire
accounting trick :-)
I'm not particularly arguing one way or another, just saying we could
probably blow the interconnect whatever we do.
On Wed, Jul 31, 2013 at 05:07:51PM +0200, Peter Zijlstra wrote:
> @@ -1260,6 +1400,23 @@ void task_numa_fault(int last_cpupid, in
> }
>
> /*
> + * First accesses are treated as private, otherwise consider accesses
> + * to be private if the accessing pid has not changed
> + */
> + if (unlikely(last_cpupid == (-1 & LAST_CPUPID_MASK))) {
> + priv = 1;
> + } else {
> + int cpu, pid;
> +
> + cpu = cpupid_to_cpu(last_cpupid);
> + pid = cpupid_to_pid(last_cpupid);
> +
> + priv = (pid == (p->pid & LAST__PID_MASK));
So Rik just pointed out that this condition is likely to generate false
positives due to the birthday paradox. The problem with including
cpu/nid information is another kind of false positives.
We've no idea which is worse..
> + if (!priv)
> + task_numa_group(p, cpu, pid);
> + }
On 07/31/2013 11:07 AM, Peter Zijlstra wrote:
>
> New version that includes a final put for the numa_group struct and a
> few other modifications.
>
> The new task_numa_free() completely blows though, far too expensive.
> Good ideas needed.
>
> ---
> Subject: sched, numa: Use {cpu, pid} to create task groups for shared faults
> From: Peter Zijlstra <[email protected]>
> Date: Tue Jul 30 10:40:20 CEST 2013
>
> A very simple/straight forward shared fault task grouping
> implementation.
>
> Concerns are that grouping on a single shared fault might be too
> aggressive -- this only works because Mel is excluding DSOs for faults,
> otherwise we'd have the world in a single group.
>
> Future work could explore more complex means of picking groups. We
> could for example track one group for the entire scan (using something
> like PDM) and join it at the end of the scan if we deem it shared a
> sufficient amount of memory.
>
> Another avenue to explore is that to do with tasks where private faults
> are predominant. Should we exclude them from the group or treat them as
> secondary, creating a graded group that tries hardest to collate shared
> tasks but also tries to move private tasks near when possible.
>
> Also, the grouping information is completely unused, its up to future
> patches to do this.
>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> include/linux/sched.h | 4 +
> kernel/sched/core.c | 4 +
> kernel/sched/fair.c | 177 +++++++++++++++++++++++++++++++++++++++++++++++---
> kernel/sched/sched.h | 5 -
> 4 files changed, 176 insertions(+), 14 deletions(-)
> +
> +static void task_numa_free(struct task_struct *p)
> +{
> + kfree(p->numa_faults);
> + if (p->numa_group) {
> + struct numa_group *grp = p->numa_group;
See below.
> + int i;
> +
> + for (i = 0; i < 2*nr_node_ids; i++)
> + atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
> +
> + spin_lock(&p->numa_lock);
> + spin_lock(&group->lock);
> + list_del(&p->numa_entry);
> + spin_unlock(&group->lock);
> + rcu_assign_pointer(p->numa_group, NULL);
> + put_numa_group(grp);
So is the local variable group or grp here? Got to be one or the
other to compile...
Don
> + }
> +}
> +
> /*
> * Got a PROT_NONE fault for a page on @node.
> */
On Wed, Jul 31, 2013 at 11:45:09AM -0400, Don Morris wrote:
> > +
> > +static void task_numa_free(struct task_struct *p)
> > +{
> > + kfree(p->numa_faults);
> > + if (p->numa_group) {
> > + struct numa_group *grp = p->numa_group;
>
> See below.
>
> > + int i;
> > +
> > + for (i = 0; i < 2*nr_node_ids; i++)
> > + atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
> > +
> > + spin_lock(&p->numa_lock);
> > + spin_lock(&group->lock);
> > + list_del(&p->numa_entry);
> > + spin_unlock(&group->lock);
> > + rcu_assign_pointer(p->numa_group, NULL);
> > + put_numa_group(grp);
>
> So is the local variable group or grp here? Got to be one or the
> other to compile...
Feh, compiling is soooo overrated! :-)
Thanks.
On Wed, Jul 31, 2013 at 05:30:18PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 31, 2013 at 12:57:19PM +0100, Mel Gorman wrote:
>
> > > Right, so what Ingo did is have the scan rate depend on the convergence.
> > > What exactly did you dislike about that?
> > >
> >
> > It depended entirely on properly detecting if we are converged or not. As
> > things like false share detection within THP is still not there I was
> > worried that it was too easy to make the wrong decision here and keep it
> > pinned at the maximum scan rate.
> >
> > > We could define the convergence as all the faults inside the interleave
> > > mask vs the total faults, and then run at: min + (1 - c)*(max-min).
> > >
> >
> > And when we have such things properly in place then I think we can kick
> > away the current crutch.
>
> OK, so I'll go write that patch I suppose ;-)
>
> > > Ah, well the reasoning on that was that all this NUMA business is
> > > 'expensive' so we'd better only bother with tasks that persist long
> > > enough for it to pay off.
> > >
> >
> > Which is fair enough but tasks that lasted *just* longer than the interval
> > still got punished. Processes running with a slightly slower CPU gets
> > hurts meaning that it would be a difficult bug report to digest.
> >
> > > In that regard it makes perfect sense to wait a fixed amount of runtime
> > > before we start scanning.
> > >
> > > So it was not a pure hack to make kbuild work again.. that is did was
> > > good though.
> > >
> >
> > Maybe we should reintroduce the delay then but I really would prefer that
> > it was triggered on some sort of event.
>
> Humm:
>
> kernel/sched/fair.c:
>
> /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
> unsigned int sysctl_numa_balancing_scan_delay = 1000;
>
>
> kernel/sched/core.c:__sched_fork():
>
> numa_scan_period = sysctl_numa_balancing_scan_delay
>
>
> It seems its still there, no need to resuscitate.
>
Yes, reverting 5bca23035391928c4c7301835accca3551b96cc2 effectively restores
the behaviour you are looking for. It just seems very crude. Then again,
I also should not have left the scan delay on top of the first_nid
check.
> I share your preference for a clear event, although nothing really comes
> to mind. The entire multi-process space seems devoid of useful triggers.
>
RSS was another option it felt as arbitrary as a plain delay.
Should I revert 5bca23035391928c4c7301835accca3551b96cc2 with an
explanation that it potentially is completely useless in the purely
multi-process shared case?
> > > On that rate-limit, this looks to be a hard-coded number unrelated to
> > > the actual hardware.
> >
> > Guesstimate.
> >
> > > I think we should at the very least make it a
> > > configurable number and preferably scale the number with the SLIT info.
> > > Or alternatively actually measure the node to node bandwidth.
> > >
> >
> > Ideally we should just kick it away because scan rate limiting works
> > properly. Lets not make it a tunable just yet so we can avoid having to
> > deprecate it later.
>
> I'm not seeing how the rate-limit as per the convergence is going to
> help here.
It should reduce the potential number of NUMA hinting faults that can be
incurred. However, I accept your point because even it does not directly
avoid a large number of migration events.
> Suppose we migrate the task to another node and its going to
> stay there. Then our convergence is going down to 0 (all our memory is
> remote) so we end up at the max scan rate migrating every single page
> ASAP.
>
> This would completely and utterly saturate any interconnect.
>
Good point and we'd arrive back at rate limiting the migration in an
attempt to avoid it.
> Also, in the case we don't have a fully connected system the memory
> transfers will need multiple hops, which greatly complicates the entire
> accounting trick :-)
>
Also unfortunately true. The larger the machine, the more likely this
becomes.
--
Mel Gorman
SUSE Labs
On Wed, Jul 31, 2013 at 05:11:41PM +0100, Mel Gorman wrote:
> RSS was another option it felt as arbitrary as a plain delay.
Right, it would avoid 'small' programs getting scanning done with the
rationale that their cost isn't that large since they don't have much
memory to begin with.
The same can be said for tasks that don't run much -- irrespective of
how much absolute runtime they've gathered.
Is there any other group of tasks that we do not want to scan?
Maybe if we can list all the various exclusions we can get to a proper
quantifier that way.
So far we've got:
- doesn't run long
- doesn't run much
- doesn't have much memory
> Should I revert 5bca23035391928c4c7301835accca3551b96cc2 with an
> explanation that it potentially is completely useless in the purely
> multi-process shared case?
Yeah I suppose so..
* Mel Gorman <[email protected]> [2013-07-15 16:20:10]:
> A preferred node is selected based on the node the most NUMA hinting
> faults was incurred on. There is no guarantee that the task is running
> on that node at the time so this patch rescheules the task to run on
> the most idle CPU of the selected node when selected. This avoids
> waiting for the balancer to make a decision.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> kernel/sched/core.c | 17 +++++++++++++++++
> kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
> kernel/sched/sched.h | 1 +
> 3 files changed, 63 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5e02507..b67a102 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4856,6 +4856,23 @@ fail:
> return ret;
> }
>
> +#ifdef CONFIG_NUMA_BALANCING
> +/* Migrate current task p to target_cpu */
> +int migrate_task_to(struct task_struct *p, int target_cpu)
> +{
> + struct migration_arg arg = { p, target_cpu };
> + int curr_cpu = task_cpu(p);
> +
> + if (curr_cpu == target_cpu)
> + return 0;
> +
> + if (!cpumask_test_cpu(target_cpu, tsk_cpus_allowed(p)))
> + return -EINVAL;
> +
> + return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
As I had noted earlier, this upsets schedstats badly.
Can we add a TODO for this patch, which mentions that schedstats need to
taken care.
One alternative that I can think of is to have a per scheduling class
routine that gets called and does the needful.
for example: for fair share, it could update the schedstats as well as
check for cfs_throttling.
But I think its an issue that needs some fix or we should obsolete
schedstats.
> @@ -904,6 +908,8 @@ static int task_numa_find_cpu(struct task_struct *p, int nid)
> src_eff_load *= src_load + effective_load(tg, src_cpu, -weight, -weight);
>
> for_each_cpu(cpu, cpumask_of_node(nid)) {
> + struct task_struct *swap_candidate = NULL;
> +
> dst_load = target_load(cpu, idx);
>
> /* If the CPU is idle, use it */
> @@ -922,12 +928,41 @@ static int task_numa_find_cpu(struct task_struct *p, int nid)
> * migrate to its preferred node due to load imbalances.
> */
> balanced = (dst_eff_load <= src_eff_load);
> - if (!balanced)
> - continue;
> + if (!balanced) {
> + struct rq *rq = cpu_rq(cpu);
> + unsigned long src_faults, dst_faults;
> +
> + /* Do not move tasks off their preferred node */
> + if (rq->curr->numa_preferred_nid == nid)
> + continue;
> +
> + /* Do not attempt an illegal migration */
> + if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(rq->curr)))
> + continue;
> +
> + /*
> + * Do not impair locality for the swap candidate.
> + * Destination for the swap candidate is the source cpu
> + */
> + if (rq->curr->numa_faults) {
> + src_faults = rq->curr->numa_faults[task_faults_idx(nid, 1)];
> + dst_faults = rq->curr->numa_faults[task_faults_idx(src_cpu_node, 1)];
> + if (src_faults > dst_faults)
> + continue;
> + }
> +
> + /*
> + * The destination is overloaded but running a task
> + * that is not running on its preferred node. Consider
> + * swapping the CPU tasks are running on.
> + */
> + swap_candidate = rq->curr;
> + }
>
> if (dst_load < min_load) {
> min_load = dst_load;
> dst_cpu = cpu;
> + *swap_p = swap_candidate;
Are we some times passing a wrong candidate?
Lets say the first cpu balanced is false and we set the swap_candidate,
but find the second cpu(/or later cpus) to be idle or has lesser effective load, then we
could be sending the task that is running on the first cpu as the swap
candidate.
Then would the preferred cpu and swap_candidate match?
--
Thanks and Regards
Srikar
* Mel Gorman <[email protected]> [2013-07-15 16:20:19]:
> When a preferred node is selected for a tasks there is an attempt to migrate
> the task to a CPU there. This may fail in which case the task will only
> migrate if the active load balancer takes action. This may never happen if
Apart from load imbalance or heavily loaded cpus on the preferred node,
what could be the other reasons for migration failure with
migrate_task_to()? I see it almost similar to active load balance except
for pushing instead of pulling tasks.
If load imbalance is the only reason, do we need to retry? If the task
is really so attached to memory on that node, shouldn't we getting
task_numa_placement hit before the next 5 seconds?
> the conditions are not right. This patch will check at NUMA hinting fault
> time if another attempt should be made to migrate the task. It will only
> make an attempt once every five seconds.
>
> Signed-off-by: Mel Gorman <[email protected]>
--
Thanks and Regards
Srikar Dronamraju
Subject: [PATCH,RFC] numa,sched: use group fault statistics in numa placement
Here is a quick strawman on how the group fault stuff could be used
to help pick the best node for a task. This is likely to be quite
suboptimal and in need of tweaking. My main goal is to get this to
Peter & Mel before it's breakfast time on their side of the Atlantic...
This goes on top of "sched, numa: Use {cpu, pid} to create task groups for shared faults"
Enjoy :)
Signed-off-by: Rik van Riel <[email protected]>
---
kernel/sched/fair.c | 32 +++++++++++++++++++++++++++++---
1 file changed, 29 insertions(+), 3 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6a06bef..fb2e229 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1135,8 +1135,9 @@ struct numa_group {
static void task_numa_placement(struct task_struct *p)
{
- int seq, nid, max_nid = -1;
- unsigned long max_faults = 0;
+ int seq, nid, max_nid = -1, max_group_nid = -1;
+ unsigned long max_faults = 0, max_group_faults = 0;
+ unsigned long total_faults = 0, total_group_faults = 0;
seq = ACCESS_ONCE(p->mm->numa_scan_seq);
if (p->numa_scan_seq == seq)
@@ -1148,7 +1149,7 @@ static void task_numa_placement(struct task_struct *p)
/* Find the node with the highest number of faults */
for (nid = 0; nid < nr_node_ids; nid++) {
- unsigned long faults = 0;
+ unsigned long faults = 0, group_faults = 0;
int priv, i;
for (priv = 0; priv < 2; priv++) {
@@ -1169,6 +1170,7 @@ static void task_numa_placement(struct task_struct *p)
if (p->numa_group) {
/* safe because we can only change our own group */
atomic_long_add(diff, &p->numa_group->faults[i]);
+ group_faults += atomic_long_read(&p->numa_group->faults[i]);
}
}
@@ -1176,11 +1178,35 @@ static void task_numa_placement(struct task_struct *p)
max_faults = faults;
max_nid = nid;
}
+
+ if (group_faults > max_group_faults) {
+ max_group_faults = group_faults;
+ max_group_nid = nid;
+ }
+
+ total_faults += faults;
+ total_group_faults += group_faults;
}
if (sched_feat(NUMA_INTERLEAVE))
task_numa_mempol(p, max_faults);
+ /*
+ * Should we stay on our own, or move in with the group?
+ * The absolute count of faults may not be useful, but comparing
+ * the fraction of accesses in each top node may give us a hint
+ * where to start looking for a migration target.
+ *
+ * max_group_faults max_faults
+ * ------------------ > ------------
+ * total_group_faults total_faults
+ */
+ if (max_group_nid >= 0 && max_group_nid != max_nid) {
+ if (max_group_faults * total_faults >
+ max_faults * total_group_faults)
+ max_nid = max_group_nid;
+ }
+
/* Preferred node as the node with the most faults */
if (max_faults && max_nid != p->numa_preferred_nid) {
> +static int task_numa_find_cpu(struct task_struct *p, int nid)
> +{
> + int node_cpu = cpumask_first(cpumask_of_node(nid));
> + int cpu, src_cpu = task_cpu(p), dst_cpu = src_cpu;
> + unsigned long src_load, dst_load;
> + unsigned long min_load = ULONG_MAX;
> + struct task_group *tg = task_group(p);
> + s64 src_eff_load, dst_eff_load;
> + struct sched_domain *sd;
> + unsigned long weight;
> + bool balanced;
> + int imbalance_pct, idx = -1;
>
> + /* No harm being optimistic */
> + if (idle_cpu(node_cpu))
> + return node_cpu;
Cant this lead to lot of imbalance across nodes? Wont this lead to lot
of ping-pong of tasks between different nodes resulting in performance
hit? Lets say the system is not fully loaded, something like a numa01
but with far lesser number of threads probably nr_cpus/2 or nr_cpus/4,
then all threads will try to move to single node as we can keep seeing
idle threads. No? Wont it lead all load moving to one node and load
balancer spreading it out...
>
> -static int
> -find_idlest_cpu_node(int this_cpu, int nid)
> -{
> - unsigned long load, min_load = ULONG_MAX;
> - int i, idlest_cpu = this_cpu;
> + /*
> + * Find the lowest common scheduling domain covering the nodes of both
> + * the CPU the task is currently running on and the target NUMA node.
> + */
> + rcu_read_lock();
> + for_each_domain(src_cpu, sd) {
> + if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
> + /*
> + * busy_idx is used for the load decision as it is the
> + * same index used by the regular load balancer for an
> + * active cpu.
> + */
> + idx = sd->busy_idx;
> + imbalance_pct = sd->imbalance_pct;
> + break;
> + }
> + }
> + rcu_read_unlock();
>
> - BUG_ON(cpu_to_node(this_cpu) == nid);
> + if (WARN_ON_ONCE(idx == -1))
> + return src_cpu;
>
> - rcu_read_lock();
> - for_each_cpu(i, cpumask_of_node(nid)) {
> - load = weighted_cpuload(i);
> + /*
> + * XXX the below is mostly nicked from wake_affine(); we should
> + * see about sharing a bit if at all possible; also it might want
> + * some per entity weight love.
> + */
> + weight = p->se.load.weight;
>
> - if (load < min_load) {
> - min_load = load;
> - idlest_cpu = i;
> + src_load = source_load(src_cpu, idx);
> +
> + src_eff_load = 100 + (imbalance_pct - 100) / 2;
> + src_eff_load *= power_of(src_cpu);
> + src_eff_load *= src_load + effective_load(tg, src_cpu, -weight, -weight);
> +
> + for_each_cpu(cpu, cpumask_of_node(nid)) {
> + dst_load = target_load(cpu, idx);
> +
> + /* If the CPU is idle, use it */
> + if (!dst_load)
> + return dst_cpu;
> +
> + /* Otherwise check the target CPU load */
> + dst_eff_load = 100;
> + dst_eff_load *= power_of(cpu);
> + dst_eff_load *= dst_load + effective_load(tg, cpu, weight, weight);
> +
> + /*
> + * Destination is considered balanced if the destination CPU is
> + * less loaded than the source CPU. Unfortunately there is a
> + * risk that a task running on a lightly loaded CPU will not
> + * migrate to its preferred node due to load imbalances.
> + */
> + balanced = (dst_eff_load <= src_eff_load);
> + if (!balanced)
> + continue;
> +
Okay same case as above, the cpu could be lightly loaded, but the
destination node could be heavier than the source node. No?
> + if (dst_load < min_load) {
> + min_load = dst_load;
> + dst_cpu = cpu;
> }
> }
> - rcu_read_unlock();
>
> - return idlest_cpu;
> + return dst_cpu;
> }
>
--
Thanks and Regards
Srikar Dronamraju
On Thu, Aug 01, 2013 at 02:23:19AM -0400, Rik van Riel wrote:
> Subject: [PATCH,RFC] numa,sched: use group fault statistics in numa placement
>
> Here is a quick strawman on how the group fault stuff could be used
> to help pick the best node for a task. This is likely to be quite
> suboptimal and in need of tweaking. My main goal is to get this to
> Peter & Mel before it's breakfast time on their side of the Atlantic...
>
> This goes on top of "sched, numa: Use {cpu, pid} to create task groups for shared faults"
>
> Enjoy :)
>
> + /*
> + * Should we stay on our own, or move in with the group?
> + * The absolute count of faults may not be useful, but comparing
> + * the fraction of accesses in each top node may give us a hint
> + * where to start looking for a migration target.
> + *
> + * max_group_faults max_faults
> + * ------------------ > ------------
> + * total_group_faults total_faults
> + */
> + if (max_group_nid >= 0 && max_group_nid != max_nid) {
> + if (max_group_faults * total_faults >
> + max_faults * total_group_faults)
> + max_nid = max_group_nid;
> + }
This makes sense.. another part of the problem, which you might already
have spotted is selecting a task to swap with.
If you only look at per task faults its often impossible to find a
suitable swap task because moving you to a more suitable node would
degrade the other task -- below a patch you've already seen but I
haven't yet posted because I'm not at all sure its something 'sane' :-)
With group information your case might be stronger because you already
have many tasks on that node.
Still there's the tie where there's two groups with each exactly half
their tasks crossed between two nodes. I suppose we should forcefully
tie break in this case.
And all this while also maintaining the invariants placed by the regular
balancer. It would be no good to move tasks about if the balancer would
then have to shuffle stuff right back (or worse) in order to maintain
fairness.
---
Subject: sched, numa: Alternative migration scheme
From: Peter Zijlstra <[email protected]>
Date: Sun Jul 21 23:12:13 CEST 2013
Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/sched/fair.c | 260 +++++++++++++++++++++++++++++++++++++---------------
1 file changed, 187 insertions(+), 73 deletions(-)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -816,6 +816,8 @@ update_stats_curr_start(struct cfs_rq *c
* Scheduling class queueing methods:
*/
+static unsigned long task_h_load(struct task_struct *p);
+
#ifdef CONFIG_NUMA_BALANCING
/*
* Approximate time to scan a full NUMA task in ms. The task scan period is
@@ -885,92 +887,206 @@ static inline int task_faults_idx(int ni
return 2 * nid + priv;
}
+static inline unsigned long task_faults(struct task_struct *p, int nid)
+{
+ if (!p->numa_faults)
+ return 0;
+
+ return p->numa_faults[2*nid] + p->numa_faults[2*nid+1];
+}
+
+static unsigned long weighted_cpuload(const int cpu);
static unsigned long source_load(int cpu, int type);
static unsigned long target_load(int cpu, int type);
static unsigned long power_of(int cpu);
static long effective_load(struct task_group *tg, int cpu, long wl, long wg);
-static int task_numa_find_cpu(struct task_struct *p, int nid)
+struct numa_stats {
+ unsigned long nr_running;
+ unsigned long load;
+ unsigned long power;
+ unsigned long capacity;
+ int has_capacity;
+};
+
+/*
+ * XXX borrowed from update_sg_lb_stats
+ */
+static void update_numa_stats(struct numa_stats *ns, int nid)
{
- int node_cpu = cpumask_first(cpumask_of_node(nid));
- int cpu, src_cpu = task_cpu(p), dst_cpu = src_cpu;
- unsigned long src_load, dst_load;
- unsigned long min_load = ULONG_MAX;
- struct task_group *tg = task_group(p);
- s64 src_eff_load, dst_eff_load;
- struct sched_domain *sd;
- unsigned long weight;
- bool balanced;
- int imbalance_pct, idx = -1;
+ int cpu;
+
+ memset(ns, 0, sizeof(*ns));
+ for_each_cpu(cpu, cpumask_of_node(nid)) {
+ struct rq *rq = cpu_rq(cpu);
+
+ ns->nr_running += rq->nr_running;
+ ns->load += weighted_cpuload(cpu);
+ ns->power += power_of(cpu);
+ }
+
+ ns->load = (ns->load * SCHED_POWER_SCALE) / ns->power;
+ ns->capacity = DIV_ROUND_CLOSEST(ns->power, SCHED_POWER_SCALE);
+ ns->has_capacity = (ns->nr_running < ns->capacity);
+}
+
+struct task_numa_env {
+ struct task_struct *p;
+
+ int src_cpu, src_nid;
+ int dst_cpu, dst_nid;
+
+ struct numa_stats src_stats, dst_stats;
+
+ int imbalance_pct, idx;
+
+ struct task_struct *best_task;
+ long best_imp;
+ int best_cpu;
+};
+
+static void task_numa_assign(struct task_numa_env *env,
+ struct task_struct *p, long imp)
+{
+ if (env->best_task)
+ put_task_struct(env->best_task);
+ if (p)
+ get_task_struct(p);
+
+ env->best_task = p;
+ env->best_imp = imp;
+ env->best_cpu = env->dst_cpu;
+}
+
+static void task_numa_compare(struct task_numa_env *env, long imp)
+{
+ struct rq *src_rq = cpu_rq(env->src_cpu);
+ struct rq *dst_rq = cpu_rq(env->dst_cpu);
+ struct task_struct *cur;
+ unsigned long dst_load, src_load;
+ unsigned long load;
+
+ rcu_read_lock();
+ cur = ACCESS_ONCE(dst_rq->curr);
+ if (cur->pid == 0) /* idle */
+ cur = NULL;
+
+ if (cur) {
+ imp += task_faults(cur, env->src_nid) -
+ task_faults(cur, env->dst_nid);
+ }
+
+ if (imp < env->best_imp)
+ goto unlock;
+
+ if (!cur) {
+ /* If there's room for an extra task; go ahead */
+ if (env->dst_stats.has_capacity)
+ goto assign;
+
+ /* If we're both over-capacity; balance */
+ if (!env->src_stats.has_capacity)
+ goto balance;
+
+ goto unlock;
+ }
+
+ /* Balance doesn't matter much if we're running a task per cpu */
+ if (src_rq->nr_running == 1 && dst_rq->nr_running == 1)
+ goto assign;
+
+ /*
+ * In the overloaded case, try and keep the load balanced.
+ */
+balance:
+ dst_load = env->dst_stats.load;
+ src_load = env->src_stats.load;
+
+ /* XXX missing power terms */
+ load = task_h_load(env->p);
+ dst_load += load;
+ src_load -= load;
+
+ if (cur) {
+ load = task_h_load(cur);
+ dst_load -= load;
+ src_load += load;
+ }
+
+ /* make src_load the smaller */
+ if (dst_load < src_load)
+ swap(dst_load, src_load);
- /* No harm being optimistic */
- if (idle_cpu(node_cpu))
- return node_cpu;
+ if (src_load * env->imbalance_pct < dst_load * 100)
+ goto unlock;
+
+assign:
+ task_numa_assign(env, cur, imp);
+unlock:
+ rcu_read_unlock();
+}
+
+static int task_numa_migrate(struct task_struct *p)
+{
+ struct task_numa_env env = {
+ .p = p,
+
+ .src_cpu = task_cpu(p),
+ .src_nid = cpu_to_node(task_cpu(p)),
+
+ .imbalance_pct = 112,
+
+ .best_task = NULL,
+ .best_imp = 0,
+ .best_cpu = -1
+ };
+ struct sched_domain *sd;
+ unsigned long faults;
+ int nid, cpu, ret;
/*
* Find the lowest common scheduling domain covering the nodes of both
* the CPU the task is currently running on and the target NUMA node.
*/
rcu_read_lock();
- for_each_domain(src_cpu, sd) {
- if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
- /*
- * busy_idx is used for the load decision as it is the
- * same index used by the regular load balancer for an
- * active cpu.
- */
- idx = sd->busy_idx;
- imbalance_pct = sd->imbalance_pct;
+ for_each_domain(env.src_cpu, sd) {
+ if (cpumask_intersects(cpumask_of_node(env.src_nid), sched_domain_span(sd))) {
+ env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
break;
}
}
rcu_read_unlock();
- if (WARN_ON_ONCE(idx == -1))
- return src_cpu;
+ faults = task_faults(p, env.src_nid);
+ update_numa_stats(&env.src_stats, env.src_nid);
- /*
- * XXX the below is mostly nicked from wake_affine(); we should
- * see about sharing a bit if at all possible; also it might want
- * some per entity weight love.
- */
- weight = p->se.load.weight;
+ for_each_online_node(nid) {
+ long imp;
- src_load = source_load(src_cpu, idx);
-
- src_eff_load = 100 + (imbalance_pct - 100) / 2;
- src_eff_load *= power_of(src_cpu);
- src_eff_load *= src_load + effective_load(tg, src_cpu, -weight, -weight);
-
- for_each_cpu(cpu, cpumask_of_node(nid)) {
- dst_load = target_load(cpu, idx);
-
- /* If the CPU is idle, use it */
- if (!dst_load)
- return cpu;
-
- /* Otherwise check the target CPU load */
- dst_eff_load = 100;
- dst_eff_load *= power_of(cpu);
- dst_eff_load *= dst_load + effective_load(tg, cpu, weight, weight);
+ if (nid == env.src_nid)
+ continue;
- /*
- * Destination is considered balanced if the destination CPU is
- * less loaded than the source CPU. Unfortunately there is a
- * risk that a task running on a lightly loaded CPU will not
- * migrate to its preferred node due to load imbalances.
- */
- balanced = (dst_eff_load <= src_eff_load);
- if (!balanced)
+ imp = task_faults(p, nid) - faults;
+ if (imp < 0)
continue;
- if (dst_load < min_load) {
- min_load = dst_load;
- dst_cpu = cpu;
+ env.dst_nid = nid;
+ update_numa_stats(&env.dst_stats, env.dst_nid);
+ for_each_cpu(cpu, cpumask_of_node(nid)) {
+ env.dst_cpu = cpu;
+ task_numa_compare(&env, imp);
}
}
- return dst_cpu;
+ if (env.best_cpu == -1)
+ return -EAGAIN;
+
+ if (env.best_task == NULL)
+ return migrate_task_to(p, env.best_cpu);
+
+ ret = migrate_swap(p, env.best_task);
+ put_task_struct(env.best_task);
+ return ret;
}
/* Attempt to migrate a task to a CPU on the preferred node. */
@@ -983,10 +1099,13 @@ static void numa_migrate_preferred(struc
if (cpu_to_node(preferred_cpu) == p->numa_preferred_nid)
return;
- /* Otherwise, try migrate to a CPU on the preferred node */
- preferred_cpu = task_numa_find_cpu(p, p->numa_preferred_nid);
- if (migrate_task_to(p, preferred_cpu) != 0)
- p->numa_migrate_retry = jiffies + HZ*5;
+ if (!sched_feat(NUMA_BALANCE))
+ return;
+
+ task_numa_migrate(p);
+
+ /* Try again until we hit the preferred node */
+ p->numa_migrate_retry = jiffies + HZ/10;
}
static void task_numa_placement(struct task_struct *p)
@@ -1003,7 +1122,7 @@ static void task_numa_placement(struct t
/* Find the node with the highest number of faults */
for (nid = 0; nid < nr_node_ids; nid++) {
- unsigned long faults;
+ unsigned long faults = 0;
int priv, i;
for (priv = 0; priv < 2; priv++) {
@@ -1013,19 +1132,16 @@ static void task_numa_placement(struct t
p->numa_faults[i] >>= 1;
p->numa_faults[i] += p->numa_faults_buffer[i];
p->numa_faults_buffer[i] = 0;
+
+ faults += p->numa_faults[i];
}
- /* Find maximum private faults */
- faults = p->numa_faults[task_faults_idx(nid, 1)];
if (faults > max_faults) {
max_faults = faults;
max_nid = nid;
}
}
- if (!sched_feat(NUMA_BALANCE))
- return;
-
/* Preferred node as the node with the most faults */
if (max_faults && max_nid != p->numa_preferred_nid) {
int old_migrate_seq = p->numa_migrate_seq;
@@ -3342,7 +3458,7 @@ static long effective_load(struct task_g
{
struct sched_entity *se = tg->se[cpu];
- if (!tg->parent) /* the trivial, non-cgroup case */
+ if (!tg->parent || !wl) /* the trivial / non-cgroup case */
return wl;
for_each_sched_entity(se) {
@@ -4347,8 +4463,6 @@ static int move_one_task(struct lb_env *
return 0;
}
-static unsigned long task_h_load(struct task_struct *p);
-
static const unsigned int sched_nr_migrate_break = 32;
/*
On Thu, Aug 01, 2013 at 10:17:57AM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <[email protected]> [2013-07-15 16:20:10]:
>
> > A preferred node is selected based on the node the most NUMA hinting
> > faults was incurred on. There is no guarantee that the task is running
> > on that node at the time so this patch rescheules the task to run on
> > the most idle CPU of the selected node when selected. This avoids
> > waiting for the balancer to make a decision.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > kernel/sched/core.c | 17 +++++++++++++++++
> > kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
> > kernel/sched/sched.h | 1 +
> > 3 files changed, 63 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 5e02507..b67a102 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4856,6 +4856,23 @@ fail:
> > return ret;
> > }
> >
> > +#ifdef CONFIG_NUMA_BALANCING
> > +/* Migrate current task p to target_cpu */
> > +int migrate_task_to(struct task_struct *p, int target_cpu)
> > +{
> > + struct migration_arg arg = { p, target_cpu };
> > + int curr_cpu = task_cpu(p);
> > +
> > + if (curr_cpu == target_cpu)
> > + return 0;
> > +
> > + if (!cpumask_test_cpu(target_cpu, tsk_cpus_allowed(p)))
> > + return -EINVAL;
> > +
> > + return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
>
> As I had noted earlier, this upsets schedstats badly.
> Can we add a TODO for this patch, which mentions that schedstats need to
> taken care.
>
I added a TODO comment because there is a possibility that this will all
change again with the stop_two_cpus patch.
--
Mel Gorman
SUSE Labs
On Thu, Aug 01, 2013 at 12:40:13PM +0530, Srikar Dronamraju wrote:
> > +static int task_numa_find_cpu(struct task_struct *p, int nid)
> > +{
> > + int node_cpu = cpumask_first(cpumask_of_node(nid));
> > + int cpu, src_cpu = task_cpu(p), dst_cpu = src_cpu;
> > + unsigned long src_load, dst_load;
> > + unsigned long min_load = ULONG_MAX;
> > + struct task_group *tg = task_group(p);
> > + s64 src_eff_load, dst_eff_load;
> > + struct sched_domain *sd;
> > + unsigned long weight;
> > + bool balanced;
> > + int imbalance_pct, idx = -1;
> >
> > + /* No harm being optimistic */
> > + if (idle_cpu(node_cpu))
> > + return node_cpu;
>
> Cant this lead to lot of imbalance across nodes? Wont this lead to lot
> of ping-pong of tasks between different nodes resulting in performance
> hit?
Ideally it wouldn't because if we are trying to migrate the task to here in
the first place then it must have been scheduled there for long enough to
accumulate those faults. Now, there might be a ping-pong effect because a
tasks gets moved off by the load balancer because the CPUs are overloaded and
now we're trying to move it back. If we can detect that this is happening
then one way of dealing with it would be to clear p->numa_faults[] when
a task is moved off a node due to compute overload.
> Lets say the system is not fully loaded, something like a numa01
> but with far lesser number of threads probably nr_cpus/2 or nr_cpus/4,
> then all threads will try to move to single node as we can keep seeing
> idle threads. No? Wont it lead all load moving to one node and load
> balancer spreading it out...
>
I cannot be 100% certain. I'm not strong enough on the scheduler yet and
the compute overloading handling is currently too weak.
--
Mel Gorman
SUSE Labs
On Thu, Aug 01, 2013 at 10:43:27AM +0530, Srikar Dronamraju wrote:
> * Mel Gorman <[email protected]> [2013-07-15 16:20:19]:
>
> > When a preferred node is selected for a tasks there is an attempt to migrate
> > the task to a CPU there. This may fail in which case the task will only
> > migrate if the active load balancer takes action. This may never happen if
>
> Apart from load imbalance or heavily loaded cpus on the preferred node,
> what could be the other reasons for migration failure with
> migrate_task_to()?
These were the reasons I expected that migration might fail.
> I see it almost similar to active load balance except
> for pushing instead of pulling tasks.
>
> If load imbalance is the only reason, do we need to retry? If the task
> is really so attached to memory on that node, shouldn't we getting
> task_numa_placement hit before the next 5 seconds?
>
Depends on the PTE scanning rate.
--
Mel Gorman
SUSE Labs
On Thu, Aug 01, 2013 at 10:29:58AM +0530, Srikar Dronamraju wrote:
> > @@ -904,6 +908,8 @@ static int task_numa_find_cpu(struct task_struct *p, int nid)
> > src_eff_load *= src_load + effective_load(tg, src_cpu, -weight, -weight);
> >
> > for_each_cpu(cpu, cpumask_of_node(nid)) {
> > + struct task_struct *swap_candidate = NULL;
> > +
> > dst_load = target_load(cpu, idx);
> >
> > /* If the CPU is idle, use it */
> > @@ -922,12 +928,41 @@ static int task_numa_find_cpu(struct task_struct *p, int nid)
> > * migrate to its preferred node due to load imbalances.
> > */
> > balanced = (dst_eff_load <= src_eff_load);
> > - if (!balanced)
> > - continue;
> > + if (!balanced) {
> > + struct rq *rq = cpu_rq(cpu);
> > + unsigned long src_faults, dst_faults;
> > +
> > + /* Do not move tasks off their preferred node */
> > + if (rq->curr->numa_preferred_nid == nid)
> > + continue;
> > +
> > + /* Do not attempt an illegal migration */
> > + if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(rq->curr)))
> > + continue;
> > +
> > + /*
> > + * Do not impair locality for the swap candidate.
> > + * Destination for the swap candidate is the source cpu
> > + */
> > + if (rq->curr->numa_faults) {
> > + src_faults = rq->curr->numa_faults[task_faults_idx(nid, 1)];
> > + dst_faults = rq->curr->numa_faults[task_faults_idx(src_cpu_node, 1)];
> > + if (src_faults > dst_faults)
> > + continue;
> > + }
> > +
> > + /*
> > + * The destination is overloaded but running a task
> > + * that is not running on its preferred node. Consider
> > + * swapping the CPU tasks are running on.
> > + */
> > + swap_candidate = rq->curr;
> > + }
> >
> > if (dst_load < min_load) {
> > min_load = dst_load;
> > dst_cpu = cpu;
> > + *swap_p = swap_candidate;
>
> Are we some times passing a wrong candidate?
> Lets say the first cpu balanced is false and we set the swap_candidate,
> but find the second cpu(/or later cpus) to be idle or has lesser effective load, then we
> could be sending the task that is running on the first cpu as the swap
> candidate.
Then at the second or later CPU swap_candidate == NULL so swap_p is
cleared too.
--
Mel Gorman
SUSE Labs
On Wed, Jul 31, 2013 at 06:39:03PM +0200, Peter Zijlstra wrote:
> On Wed, Jul 31, 2013 at 05:11:41PM +0100, Mel Gorman wrote:
> > RSS was another option it felt as arbitrary as a plain delay.
>
> Right, it would avoid 'small' programs getting scanning done with the
> rationale that their cost isn't that large since they don't have much
> memory to begin with.
>
Yeah, but it's not necessarily true. Whatever value we pick there can be
an openmp process that fits in there.
> The same can be said for tasks that don't run much -- irrespective of
> how much absolute runtime they've gathered.
>
> Is there any other group of tasks that we do not want to scan?
>
strcmp(p->comm, ....)
> Maybe if we can list all the various exclusions we can get to a proper
> quantifier that way.
>
> So far we've got:
>
> - doesn't run long
> - doesn't run much
> - doesn't have much memory
>
- does not have sysV shm sections
> > Should I revert 5bca23035391928c4c7301835accca3551b96cc2 with an
> > explanation that it potentially is completely useless in the purely
> > multi-process shared case?
>
> Yeah I suppose so..
Will do.
--
Mel Gorman
SUSE Labs
On 08/01/2013 06:37 AM, Peter Zijlstra wrote:
> On Thu, Aug 01, 2013 at 02:23:19AM -0400, Rik van Riel wrote:
>> Subject: [PATCH,RFC] numa,sched: use group fault statistics in numa placement
>>
>> Here is a quick strawman on how the group fault stuff could be used
>> to help pick the best node for a task. This is likely to be quite
>> suboptimal and in need of tweaking. My main goal is to get this to
>> Peter & Mel before it's breakfast time on their side of the Atlantic...
>>
>> This goes on top of "sched, numa: Use {cpu, pid} to create task groups for shared faults"
>>
>> Enjoy :)
>>
>> + /*
>> + * Should we stay on our own, or move in with the group?
>> + * The absolute count of faults may not be useful, but comparing
>> + * the fraction of accesses in each top node may give us a hint
>> + * where to start looking for a migration target.
>> + *
>> + * max_group_faults max_faults
>> + * ------------------ > ------------
>> + * total_group_faults total_faults
>> + */
>> + if (max_group_nid >= 0 && max_group_nid != max_nid) {
>> + if (max_group_faults * total_faults >
>> + max_faults * total_group_faults)
>> + max_nid = max_group_nid;
>> + }
>
> This makes sense.. another part of the problem, which you might already
> have spotted is selecting a task to swap with.
>
> If you only look at per task faults its often impossible to find a
> suitable swap task because moving you to a more suitable node would
> degrade the other task -- below a patch you've already seen but I
> haven't yet posted because I'm not at all sure its something 'sane' :-)
I did not realize you had not posted that patch yet, and was
actually building on top of it :)
I suspect that comparing both per-task and per-group fault weights
in task_numa_compare should make your code do the right thing in
task_numa_migrate.
I suspect there will be enough randomness in accesses that they
will never be exactly the same, so we might not need an explicit
tie breaker.
However, if numa_migrate_preferred fails, we may want to try
migrating to any node that has a better score than the current
one. After all, if we have a group of tasks that would fit in
2 NUMA nodes, we don't want half of the tasks to not migrate
at all because the top node is full. We want them to move to
the #2 node at some point.
--
All rights reversed
On Tue, 30 Jul 2013 13:24:39 +0200
Peter Zijlstra <[email protected]> wrote:
>
> Subject: mm, numa: Change page last {nid,pid} into {cpu,pid}
> From: Peter Zijlstra <[email protected]>
> Date: Thu Jul 25 18:44:50 CEST 2013
>
> Change the per page last fault tracking to use cpu,pid instead of
> nid,pid. This will allow us to try and lookup the alternate task more
> easily.
>
> Signed-off-by: Peter Zijlstra <[email protected]>
Here are some compile fixes for !CONFIG_NUMA_BALANCING
Signed-off-by: Rik van Riel <[email protected]>
---
include/linux/mm.h | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d2f91a2..4f34a37 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -746,7 +746,12 @@ static inline int cpupid_to_pid(int cpupid)
return -1;
}
-static inline int nid_pid_to_cpupid(int nid, int pid)
+static inline int cpupid_to_cpu(int cpupid)
+{
+ return -1;
+}
+
+static inline int cpu_pid_to_cpupid(int nid, int pid)
{
return -1;
}
On Tue, 30 Jul 2013 13:38:57 +0200
Peter Zijlstra <[email protected]> wrote:
>
> Subject: sched, numa: Use {cpu, pid} to create task groups for shared faults
> From: Peter Zijlstra <[email protected]>
> Date: Tue Jul 30 10:40:20 CEST 2013
>
> A very simple/straight forward shared fault task grouping
> implementation.
Here is another (untested) version of task placement on top of
your task grouping. Better send it to you now, rather than on
your friday evening :)
The algorithm is loosely based on Andrea's placement algorithm,
but not as strict because I am not entirely confident that the
task grouping code works right yet...
It also has no "fall back to a better node than the current one"
code yet, for the case where we fail to migrate to the best node,
but that should be a separate patch anyway.
Subject: [PATCH,RFC] numa,sched: use group fault statistics in numa placement
This version uses the fraction of faults on a particular node for
both task and group, to figure out the best node to place a task.
I wish I had benchmark numbers to report, but our timezones just
don't seem to work out that way. Enjoy at your own peril :)
I will be testing these tomorrow.
Signed-off-by: Rik van Riel <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/fair.c | 94 +++++++++++++++++++++++++++++++++++++++++----------
2 files changed, 77 insertions(+), 18 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9e7fcfe..5e175ae 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1355,6 +1355,7 @@ struct task_struct {
* The values remain static for the duration of a PTE scan
*/
unsigned long *numa_faults;
+ unsigned long total_numa_faults;
/*
* numa_faults_buffer records faults per node during the current
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6a06bef..3ef4d45 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -844,6 +844,18 @@ static unsigned int task_scan_max(struct task_struct *p)
*/
unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+struct numa_group {
+ atomic_t refcount;
+
+ spinlock_t lock; /* nr_tasks, tasks */
+ int nr_tasks;
+ struct list_head task_list;
+
+ struct rcu_head rcu;
+ atomic_long_t total_faults;
+ atomic_long_t faults[0];
+};
+
static inline int task_faults_idx(int nid, int priv)
{
return 2 * nid + priv;
@@ -857,6 +869,38 @@ static inline unsigned long task_faults(struct task_struct *p, int nid)
return p->numa_faults[2*nid] + p->numa_faults[2*nid+1];
}
+static inline unsigned long group_faults(struct task_struct *p, int nid)
+{
+ if (!p->numa_group)
+ return 0;
+
+ return atomic_long_read(&p->numa_group->faults[2*nid]) +
+ atomic_long_read(&p->numa_group->faults[2*nid+1]);
+}
+
+/*
+ * These return the fraction of accesses done by a particular task, or
+ * task group, on a particular numa node. The group weight is given a
+ * larger multiplier, in order to group tasks together that are almost
+ * evenly spread out between numa nodes.
+ */
+static inline unsigned long task_weight(struct task_struct *p, int nid)
+{
+ if (!p->numa_faults)
+ return 0;
+
+ return 1000 * task_faults(p, nid) / p->total_numa_faults;
+}
+
+static inline unsigned long group_weight(struct task_struct *p, int nid)
+{
+ if (!p->numa_group)
+ return 0;
+
+ return 1200 * group_faults(p, nid) /
+ atomic_long_read(&p->numa_group->total_faults);
+}
+
/*
* Create/Update p->mempolicy MPOL_INTERLEAVE to match p->numa_faults[].
*/
@@ -979,8 +1023,10 @@ static void task_numa_compare(struct task_numa_env *env, long imp)
cur = NULL;
if (cur) {
- imp += task_faults(cur, env->src_nid) -
- task_faults(cur, env->dst_nid);
+ imp += task_faults(cur, env->src_nid) +
+ group_faults(cur, env->src_nid) -
+ task_faults(cur, env->dst_nid) -
+ group_faults(cur, env->dst_nid);
}
trace_printk("compare[%d] task:%s/%d improvement: %ld\n",
@@ -1067,7 +1113,7 @@ static int task_numa_migrate(struct task_struct *p)
}
rcu_read_unlock();
- faults = task_faults(p, env.src_nid);
+ faults = task_faults(p, env.src_nid) + group_faults(p, env.src_nid);
update_numa_stats(&env.src_stats, env.src_nid);
for_each_online_node(nid) {
@@ -1076,7 +1122,7 @@ static int task_numa_migrate(struct task_struct *p)
if (nid == env.src_nid)
continue;
- imp = task_faults(p, nid) - faults;
+ imp = task_faults(p, nid) + group_faults(p, nid) - faults;
if (imp < 0)
continue;
@@ -1122,21 +1168,10 @@ static void numa_migrate_preferred(struct task_struct *p)
p->numa_migrate_retry = jiffies + HZ/10;
}
-struct numa_group {
- atomic_t refcount;
-
- spinlock_t lock; /* nr_tasks, tasks */
- int nr_tasks;
- struct list_head task_list;
-
- struct rcu_head rcu;
- atomic_long_t faults[0];
-};
-
static void task_numa_placement(struct task_struct *p)
{
- int seq, nid, max_nid = -1;
- unsigned long max_faults = 0;
+ int seq, nid, max_nid = -1, max_group_nid = -1;
+ unsigned long max_faults = 0, max_group_faults = 0;
seq = ACCESS_ONCE(p->mm->numa_scan_seq);
if (p->numa_scan_seq == seq)
@@ -1148,7 +1183,7 @@ static void task_numa_placement(struct task_struct *p)
/* Find the node with the highest number of faults */
for (nid = 0; nid < nr_node_ids; nid++) {
- unsigned long faults = 0;
+ unsigned long faults = 0, group_faults = 0;
int priv, i;
for (priv = 0; priv < 2; priv++) {
@@ -1161,6 +1196,7 @@ static void task_numa_placement(struct task_struct *p)
/* Decay existing window, copy faults since last scan */
p->numa_faults[i] >>= 1;
p->numa_faults[i] += p->numa_faults_buffer[i];
+ p->total_numa_faults += p->numa_faults_buffer[i];
p->numa_faults_buffer[i] = 0;
diff += p->numa_faults[i];
@@ -1169,6 +1205,8 @@ static void task_numa_placement(struct task_struct *p)
if (p->numa_group) {
/* safe because we can only change our own group */
atomic_long_add(diff, &p->numa_group->faults[i]);
+ atomic_long_add(diff, &p->numa_group->total_faults);
+ group_faults += atomic_long_read(&p->numa_group->faults[i]);
}
}
@@ -1176,11 +1214,29 @@ static void task_numa_placement(struct task_struct *p)
max_faults = faults;
max_nid = nid;
}
+
+ if (group_faults > max_group_faults) {
+ max_group_faults = group_faults;
+ max_group_nid = nid;
+ }
}
if (sched_feat(NUMA_INTERLEAVE))
task_numa_mempol(p, max_faults);
+ /*
+ * Should we stay on our own, or move in with the group?
+ * If the task's memory accesses are concentrated on one node, go
+ * to (more likely, stay on) that node. If the group's accesses
+ * are more concentrated than the task's accesses, join the group.
+ *
+ * max_group_faults max_faults
+ * ------------------ > ------------
+ * total_group_faults total_faults
+ */
+ if (group_weight(p, max_group_nid) > task_weight(p, max_nid))
+ max_nid = max_group_nid;
+
/* Preferred node as the node with the most faults */
if (max_faults && max_nid != p->numa_preferred_nid) {
@@ -1242,6 +1298,7 @@ void task_numa_group(struct task_struct *p, int cpu, int pid)
atomic_set(&grp->refcount, 1);
spin_lock_init(&grp->lock);
INIT_LIST_HEAD(&grp->task_list);
+ atomic_long_set(&grp->total_faults, 0);
spin_lock(&p->numa_lock);
list_add(&p->numa_entry, &grp->task_list);
@@ -1336,6 +1393,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
BUG_ON(p->numa_faults_buffer);
p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
+ p->total_numa_faults = 0;
}
/*
Here's the latest; it seems to not crash and appears to actually do as
advertised.
---
Subject: sched, numa: Use {cpu, pid} to create task groups for shared faults
From: Peter Zijlstra <[email protected]>
Date: Tue Jul 30 10:40:20 CEST 2013
A very simple/straight forward shared fault task grouping
implementation.
Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/sched.h | 3
kernel/sched/core.c | 3
kernel/sched/fair.c | 174 +++++++++++++++++++++++++++++++++++++++++++++++---
kernel/sched/sched.h | 5 -
4 files changed, 171 insertions(+), 14 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1341,6 +1341,9 @@ struct task_struct {
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
+ struct list_head numa_entry;
+ struct numa_group *numa_group;
+
/*
* Exponential decaying average of faults on a per-node basis.
* Scheduling placement decisions are made based on the these counts.
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1733,6 +1733,9 @@ static void __sched_fork(struct task_str
p->numa_work.next = &p->numa_work;
p->numa_faults = NULL;
p->numa_faults_buffer = NULL;
+
+ INIT_LIST_HEAD(&p->numa_entry);
+ p->numa_group = NULL;
#endif /* CONFIG_NUMA_BALANCING */
}
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1149,6 +1149,17 @@ static void numa_migrate_preferred(struc
p->numa_migrate_retry = jiffies + HZ/10;
}
+struct numa_group {
+ atomic_t refcount;
+
+ spinlock_t lock; /* nr_tasks, tasks */
+ int nr_tasks;
+ struct list_head task_list;
+
+ struct rcu_head rcu;
+ atomic_long_t faults[0];
+};
+
static void task_numa_placement(struct task_struct *p)
{
int seq, nid, max_nid = -1;
@@ -1157,6 +1168,7 @@ static void task_numa_placement(struct t
seq = ACCESS_ONCE(p->mm->numa_scan_seq);
if (p->numa_scan_seq == seq)
return;
+
p->numa_scan_seq = seq;
p->numa_migrate_seq++;
p->numa_scan_period_max = task_scan_max(p);
@@ -1167,14 +1179,24 @@ static void task_numa_placement(struct t
int priv, i;
for (priv = 0; priv < 2; priv++) {
+ long diff;
+
i = task_faults_idx(nid, priv);
+ diff = -p->numa_faults[i];
+
/* Decay existing window, copy faults since last scan */
p->numa_faults[i] >>= 1;
p->numa_faults[i] += p->numa_faults_buffer[i];
p->numa_faults_buffer[i] = 0;
+ diff += p->numa_faults[i];
faults += p->numa_faults[i];
+
+ if (p->numa_group) {
+ /* safe because we can only change our own group */
+ atomic_long_add(diff, &p->numa_group->faults[i]);
+ }
}
if (faults > max_faults) {
@@ -1211,6 +1233,130 @@ static void task_numa_placement(struct t
}
}
+static inline int get_numa_group(struct numa_group *grp)
+{
+ return atomic_inc_not_zero(&grp->refcount);
+}
+
+static inline void put_numa_group(struct numa_group *grp)
+{
+ if (atomic_dec_and_test(&grp->refcount))
+ kfree_rcu(grp, rcu);
+}
+
+static void double_lock(spinlock_t *l1, spinlock_t *l2)
+{
+ if (l1 > l2)
+ swap(l1, l2);
+
+ spin_lock(l1);
+ spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
+}
+
+static void task_numa_group(struct task_struct *p, int cpu, int pid)
+{
+ struct numa_group *grp, *my_grp;
+ struct task_struct *tsk;
+ bool join = false;
+ int i;
+
+ if (unlikely(!p->numa_group)) {
+ unsigned int size = sizeof(struct numa_group) +
+ 2*nr_node_ids*sizeof(atomic_long_t);
+
+ grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
+ if (!grp)
+ return;
+
+ atomic_set(&grp->refcount, 1);
+ spin_lock_init(&grp->lock);
+ INIT_LIST_HEAD(&grp->task_list);
+
+ for (i = 0; i < 2*nr_node_ids; i++)
+ atomic_long_set(&grp->faults[i], p->numa_faults[i]);
+
+ list_add(&p->numa_entry, &grp->task_list);
+ grp->nr_tasks++;
+ rcu_assign_pointer(p->numa_group, grp);
+ }
+
+ rcu_read_lock();
+ tsk = ACCESS_ONCE(cpu_rq(cpu)->curr);
+
+ if ((tsk->pid & LAST__PID_MASK) != pid)
+ goto unlock;
+
+ grp = rcu_dereference(tsk->numa_group);
+ if (!grp)
+ goto unlock;
+
+ my_grp = p->numa_group;
+ if (grp == my_grp)
+ goto unlock;
+
+ /*
+ * Only join the other group if its bigger; if we're the bigger group,
+ * the other task will join us.
+ */
+ if (my_grp->nr_tasks > grp->nr_tasks)
+ goto unlock;
+
+ /*
+ * Tie-break on the grp address.
+ */
+ if (my_grp->nr_tasks == grp->nr_tasks && my_grp > grp)
+ goto unlock;
+
+ if (!get_numa_group(grp))
+ goto unlock;
+
+ join = true;
+
+unlock:
+ rcu_read_unlock();
+
+ if (!join)
+ return;
+
+ for (i = 0; i < 2*nr_node_ids; i++) {
+ atomic_long_sub(p->numa_faults[i], &my_grp->faults[i]);
+ atomic_long_add(p->numa_faults[i], &grp->faults[i]);
+ }
+
+ double_lock(&my_grp->lock, &grp->lock);
+
+ list_move(&p->numa_entry, &grp->task_list);
+ my_grp->nr_tasks--;
+ grp->nr_tasks++;
+
+ spin_unlock(&my_grp->lock);
+ spin_unlock(&grp->lock);
+
+ rcu_assign_pointer(p->numa_group, grp);
+
+ put_numa_group(my_grp);
+}
+
+void task_numa_free(struct task_struct *p)
+{
+ struct numa_group *grp = p->numa_group;
+ int i;
+
+ kfree(p->numa_faults);
+
+ if (grp) {
+ for (i = 0; i < 2*nr_node_ids; i++)
+ atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
+
+ spin_lock(&grp->lock);
+ list_del(&p->numa_entry);
+ grp->nr_tasks--;
+ spin_unlock(&grp->lock);
+ rcu_assign_pointer(p->numa_group, NULL);
+ put_numa_group(grp);
+ }
+}
+
/*
* Got a PROT_NONE fault for a page on @node.
*/
@@ -1226,21 +1372,12 @@ void task_numa_fault(int last_cpupid, in
if (!p->mm)
return;
- /*
- * First accesses are treated as private, otherwise consider accesses
- * to be private if the accessing pid has not changed
- */
- if (!cpupid_pid_unset(last_cpupid))
- priv = ((p->pid & LAST__PID_MASK) == cpupid_to_pid(last_cpupid));
- else
- priv = 1;
-
/* Allocate buffer to track faults on a per-node basis */
if (unlikely(!p->numa_faults)) {
int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
/* numa_faults and numa_faults_buffer share the allocation */
- p->numa_faults = kzalloc(size * 2, GFP_KERNEL);
+ p->numa_faults = kzalloc(size * 2, GFP_KERNEL | __GFP_NOWARN);
if (!p->numa_faults)
return;
@@ -1249,6 +1386,23 @@ void task_numa_fault(int last_cpupid, in
}
/*
+ * First accesses are treated as private, otherwise consider accesses
+ * to be private if the accessing pid has not changed
+ */
+ if (unlikely(last_cpupid == (-1 & LAST_CPUPID_MASK))) {
+ priv = 1;
+ } else {
+ int cpu, pid;
+
+ cpu = cpupid_to_cpu(last_cpupid);
+ pid = cpupid_to_pid(last_cpupid);
+
+ priv = (pid == (p->pid & LAST__PID_MASK));
+ if (!priv)
+ task_numa_group(p, cpu, pid);
+ }
+
+ /*
* If pages are properly placed (did not migrate) then scan slower.
* This is reset periodically in case of phase changes
*
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -556,10 +556,7 @@ static inline u64 rq_clock_task(struct r
#ifdef CONFIG_NUMA_BALANCING
extern int migrate_task_to(struct task_struct *p, int cpu);
extern int migrate_swap(struct task_struct *, struct task_struct *);
-static inline void task_numa_free(struct task_struct *p)
-{
- kfree(p->numa_faults);
-}
+extern void task_numa_free(struct task_struct *p);
#else /* CONFIG_NUMA_BALANCING */
static inline void task_numa_free(struct task_struct *p)
{
Subject: mm, numa: Do not group on RO pages
From: Peter Zijlstra <[email protected]>
Date: Fri Aug 2 18:38:34 CEST 2013
And here's a little something to make sure not the whole world ends up
in a single group.
As while we don't migrate shared executable pages, we do scan/fault on
them. And since everybody links to libc, everybody ends up in the same
group.
Sugested-by: Rik van Riel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/sched.h | 7 +++++--
kernel/sched/fair.c | 5 +++--
mm/huge_memory.c | 15 +++++++++++++--
mm/memory.c | 31 ++++++++++++++++++++++++++-----
4 files changed, 47 insertions(+), 11 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1438,12 +1438,15 @@ struct task_struct {
/* Future-safe accessor for struct task_struct's cpus_allowed. */
#define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
+#define TNF_MIGRATED 0x01
+#define TNF_NO_GROUP 0x02
+
#ifdef CONFIG_NUMA_BALANCING
-extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
+extern void task_numa_fault(int last_node, int node, int pages, int flags);
extern void set_numabalancing_state(bool enabled);
#else
static inline void task_numa_fault(int last_node, int node, int pages,
- bool migrated)
+ int flags)
{
}
static inline void set_numabalancing_state(bool enabled)
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1371,9 +1371,10 @@ void task_numa_free(struct task_struct *
/*
* Got a PROT_NONE fault for a page on @node.
*/
-void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
+void task_numa_fault(int last_cpupid, int node, int pages, int flags)
{
struct task_struct *p = current;
+ bool migrated = flags & TNF_MIGRATED;
int priv;
if (!numabalancing_enabled)
@@ -1409,7 +1410,7 @@ void task_numa_fault(int last_cpupid, in
pid = cpupid_to_pid(last_cpupid);
priv = (pid == (p->pid & LAST__PID_MASK));
- if (!priv)
+ if (!priv && !(flags & TNF_NO_GROUP))
task_numa_group(p, cpu, pid);
}
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1295,6 +1295,7 @@ int do_huge_pmd_numa_page(struct mm_stru
int page_nid = -1, account_nid = -1, this_nid = numa_node_id();
int target_nid, last_cpupid;
bool migrated = false;
+ int flags = 0;
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(pmd, *pmdp)))
@@ -1333,6 +1334,15 @@ int do_huge_pmd_numa_page(struct mm_stru
account_nid = page_nid = -1; /* someone else took our fault */
goto out_unlock;
}
+
+ /*
+ * Avoid grouping on DSO/COW pages in specific and RO pages
+ * in general, RO pages shouldn't hurt as much anyway since
+ * they can be in shared cache state.
+ */
+ if (page_mapcount(page) != 1 && !pmd_write(pmd))
+ flags |= TNF_NO_GROUP;
+
spin_unlock(&mm->page_table_lock);
/* Migrate the THP to the requested node */
@@ -1341,7 +1351,8 @@ int do_huge_pmd_numa_page(struct mm_stru
if (!migrated) {
account_nid = -1; /* account against the old page */
goto check_same;
- }
+ } else
+ flags |= TNF_MIGRATED;
page_nid = target_nid;
goto out;
@@ -1364,7 +1375,7 @@ int do_huge_pmd_numa_page(struct mm_stru
if (account_nid == -1)
account_nid = page_nid;
if (account_nid != -1)
- task_numa_fault(last_cpupid, account_nid, HPAGE_PMD_NR, migrated);
+ task_numa_fault(last_cpupid, account_nid, HPAGE_PMD_NR, flags);
return 0;
}
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3537,6 +3537,7 @@ int do_numa_page(struct mm_struct *mm, s
int page_nid = -1, account_nid = -1;
int target_nid, last_cpupid;
bool migrated = false;
+ int flags = 0;
/*
* The "pte" at this point cannot be used safely without
@@ -3569,6 +3570,14 @@ int do_numa_page(struct mm_struct *mm, s
return 0;
}
+ /*
+ * Avoid grouping on DSO/COW pages in specific and RO pages
+ * in general, RO pages shouldn't hurt as much anyway since
+ * they can be in shared cache state.
+ */
+ if (page_mapcount(page) != 1 && !pte_write(pte))
+ flags |= TNF_NO_GROUP;
+
last_cpupid = page_cpupid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid, &account_nid);
@@ -3580,14 +3589,16 @@ int do_numa_page(struct mm_struct *mm, s
/* Migrate to the requested node */
migrated = migrate_misplaced_page(page, vma, target_nid);
- if (migrated)
+ if (migrated) {
page_nid = target_nid;
+ flags |= TNF_MIGRATED;
+ }
out:
if (account_nid == -1)
account_nid = page_nid;
if (account_nid != -1)
- task_numa_fault(last_cpupid, account_nid, 1, migrated);
+ task_numa_fault(last_cpupid, account_nid, 1, flags);
return 0;
}
@@ -3632,6 +3643,7 @@ static int do_pmd_numa_page(struct mm_st
int page_nid = -1, account_nid = -1;
int target_nid;
bool migrated = false;
+ int flags = 0;
if (!pte_present(pteval))
continue;
@@ -3651,6 +3663,14 @@ static int do_pmd_numa_page(struct mm_st
if (unlikely(!page))
continue;
+ /*
+ * Avoid grouping on DSO/COW pages in specific and RO pages
+ * in general, RO pages shouldn't hurt as much anyway since
+ * they can be in shared cache state.
+ */
+ if (page_mapcount(page) != 1 && !pte_write(pteval))
+ flags |= TNF_NO_GROUP;
+
last_cpupid = page_cpupid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr,
@@ -3659,9 +3679,10 @@ static int do_pmd_numa_page(struct mm_st
if (target_nid != -1) {
migrated = migrate_misplaced_page(page, vma, target_nid);
- if (migrated)
+ if (migrated) {
page_nid = target_nid;
- else
+ flags |= TNF_MIGRATED;
+ } else
account_nid = -1;
} else {
put_page(page);
@@ -3670,7 +3691,7 @@ static int do_pmd_numa_page(struct mm_st
if (account_nid == -1)
account_nid = page_nid;
if (account_nid != -1)
- task_numa_fault(last_cpupid, account_nid, 1, migrated);
+ task_numa_fault(last_cpupid, account_nid, 1, flags);
cond_resched();
> + /*
> + * Avoid grouping on DSO/COW pages in specific and RO pages
> + * in general, RO pages shouldn't hurt as much anyway since
> + * they can be in shared cache state.
> + */
OK, so that comment is crap. Its that you cannot work into RO pages and
this RO pages don't establish a collaboration.
> + if (page_mapcount(page) != 1 && !pmd_write(pmd))
> + flags |= TNF_NO_GROUP;
Rik also noted that mapcount == 1 will trivially not form groups. This
should indeed be so but I didn't test it without that clause.
On Fri, 2 Aug 2013 18:50:32 +0200
Peter Zijlstra <[email protected]> wrote:
> Subject: mm, numa: Do not group on RO pages
Using the fraction of the faults that happen on each node to
determine both the group weight and the task weight of each
node, and attempting to move the task to the node with the
highest score, seems to work fairly well.
Here are the specjbb scores with this patch, on top of your
task grouping patches:
vanilla numasched7
Warehouses
1 40651 45657
2 82897 88827
3 116623 130644
4 144512 171051
5 176681 209915
6 190471 247480
7 204036 283966
8 214466 318464
9 223451 348657
10 227439 380886
11 226163 374822
12 220857 370519
13 215871 367582
14 210965 361110
I suspect there may be further room for improvement, but it
may be time for this patch to go into Mel's tree, so others
will test it as well, helping us all learn what is broken
and how it can be improved...
Signed-off-by: Rik van Riel <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/fair.c | 109 +++++++++++++++++++++++++++++++++++++++++---------
2 files changed, 91 insertions(+), 19 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9e7fcfe..5e175ae 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1355,6 +1355,7 @@ struct task_struct {
* The values remain static for the duration of a PTE scan
*/
unsigned long *numa_faults;
+ unsigned long total_numa_faults;
/*
* numa_faults_buffer records faults per node during the current
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6a06bef..2c9c1dd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -844,6 +844,18 @@ static unsigned int task_scan_max(struct task_struct *p)
*/
unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+struct numa_group {
+ atomic_t refcount;
+
+ spinlock_t lock; /* nr_tasks, tasks */
+ int nr_tasks;
+ struct list_head task_list;
+
+ struct rcu_head rcu;
+ atomic_long_t total_faults;
+ atomic_long_t faults[0];
+};
+
static inline int task_faults_idx(int nid, int priv)
{
return 2 * nid + priv;
@@ -857,6 +869,51 @@ static inline unsigned long task_faults(struct task_struct *p, int nid)
return p->numa_faults[2*nid] + p->numa_faults[2*nid+1];
}
+static inline unsigned long group_faults(struct task_struct *p, int nid)
+{
+ if (!p->numa_group)
+ return 0;
+
+ return atomic_long_read(&p->numa_group->faults[2*nid]) +
+ atomic_long_read(&p->numa_group->faults[2*nid+1]);
+}
+
+/*
+ * These return the fraction of accesses done by a particular task, or
+ * task group, on a particular numa node. The group weight is given a
+ * larger multiplier, in order to group tasks together that are almost
+ * evenly spread out between numa nodes.
+ */
+static inline unsigned long task_weight(struct task_struct *p, int nid)
+{
+ unsigned long total_faults;
+
+ if (!p->numa_faults)
+ return 0;
+
+ total_faults = p->total_numa_faults;
+
+ if (!total_faults)
+ return 0;
+
+ return 1000 * task_faults(p, nid) / total_faults;
+}
+
+static inline unsigned long group_weight(struct task_struct *p, int nid)
+{
+ unsigned long total_faults;
+
+ if (!p->numa_group)
+ return 0;
+
+ total_faults = atomic_long_read(&p->numa_group->total_faults);
+
+ if (!total_faults)
+ return 0;
+
+ return 1200 * group_faults(p, nid) / total_faults;
+}
+
/*
* Create/Update p->mempolicy MPOL_INTERLEAVE to match p->numa_faults[].
*/
@@ -979,8 +1036,10 @@ static void task_numa_compare(struct task_numa_env *env, long imp)
cur = NULL;
if (cur) {
- imp += task_faults(cur, env->src_nid) -
- task_faults(cur, env->dst_nid);
+ imp += task_weight(cur, env->src_nid) +
+ group_weight(cur, env->src_nid) -
+ task_weight(cur, env->dst_nid) -
+ group_weight(cur, env->dst_nid);
}
trace_printk("compare[%d] task:%s/%d improvement: %ld\n",
@@ -1051,7 +1110,7 @@ static int task_numa_migrate(struct task_struct *p)
.best_cpu = -1
};
struct sched_domain *sd;
- unsigned long faults;
+ unsigned long weight;
int nid, cpu, ret;
/*
@@ -1067,7 +1126,7 @@ static int task_numa_migrate(struct task_struct *p)
}
rcu_read_unlock();
- faults = task_faults(p, env.src_nid);
+ weight = task_weight(p, env.src_nid) + group_weight(p, env.src_nid);
update_numa_stats(&env.src_stats, env.src_nid);
for_each_online_node(nid) {
@@ -1076,7 +1135,7 @@ static int task_numa_migrate(struct task_struct *p)
if (nid == env.src_nid)
continue;
- imp = task_faults(p, nid) - faults;
+ imp = task_weight(p, nid) + group_weight(p, nid) - weight;
if (imp < 0)
continue;
@@ -1122,21 +1181,10 @@ static void numa_migrate_preferred(struct task_struct *p)
p->numa_migrate_retry = jiffies + HZ/10;
}
-struct numa_group {
- atomic_t refcount;
-
- spinlock_t lock; /* nr_tasks, tasks */
- int nr_tasks;
- struct list_head task_list;
-
- struct rcu_head rcu;
- atomic_long_t faults[0];
-};
-
static void task_numa_placement(struct task_struct *p)
{
- int seq, nid, max_nid = -1;
- unsigned long max_faults = 0;
+ int seq, nid, max_nid = -1, max_group_nid = -1;
+ unsigned long max_faults = 0, max_group_faults = 0;
seq = ACCESS_ONCE(p->mm->numa_scan_seq);
if (p->numa_scan_seq == seq)
@@ -1148,7 +1196,7 @@ static void task_numa_placement(struct task_struct *p)
/* Find the node with the highest number of faults */
for (nid = 0; nid < nr_node_ids; nid++) {
- unsigned long faults = 0;
+ unsigned long faults = 0, group_faults = 0;
int priv, i;
for (priv = 0; priv < 2; priv++) {
@@ -1161,6 +1209,7 @@ static void task_numa_placement(struct task_struct *p)
/* Decay existing window, copy faults since last scan */
p->numa_faults[i] >>= 1;
p->numa_faults[i] += p->numa_faults_buffer[i];
+ p->total_numa_faults += p->numa_faults_buffer[i];
p->numa_faults_buffer[i] = 0;
diff += p->numa_faults[i];
@@ -1169,6 +1218,8 @@ static void task_numa_placement(struct task_struct *p)
if (p->numa_group) {
/* safe because we can only change our own group */
atomic_long_add(diff, &p->numa_group->faults[i]);
+ atomic_long_add(diff, &p->numa_group->total_faults);
+ group_faults += atomic_long_read(&p->numa_group->faults[i]);
}
}
@@ -1176,11 +1227,29 @@ static void task_numa_placement(struct task_struct *p)
max_faults = faults;
max_nid = nid;
}
+
+ if (group_faults > max_group_faults) {
+ max_group_faults = group_faults;
+ max_group_nid = nid;
+ }
}
if (sched_feat(NUMA_INTERLEAVE))
task_numa_mempol(p, max_faults);
+ /*
+ * Should we stay on our own, or move in with the group?
+ * If the task's memory accesses are concentrated on one node, go
+ * to (more likely, stay on) that node. If the group's accesses
+ * are more concentrated than the task's accesses, join the group.
+ *
+ * max_group_faults max_faults
+ * ------------------ > ------------
+ * total_group_faults total_faults
+ */
+ if (group_weight(p, max_group_nid) > task_weight(p, max_nid))
+ max_nid = max_group_nid;
+
/* Preferred node as the node with the most faults */
if (max_faults && max_nid != p->numa_preferred_nid) {
@@ -1242,6 +1311,7 @@ void task_numa_group(struct task_struct *p, int cpu, int pid)
atomic_set(&grp->refcount, 1);
spin_lock_init(&grp->lock);
INIT_LIST_HEAD(&grp->task_list);
+ atomic_long_set(&grp->total_faults, 0);
spin_lock(&p->numa_lock);
list_add(&p->numa_entry, &grp->task_list);
@@ -1336,6 +1406,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
BUG_ON(p->numa_faults_buffer);
p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
+ p->total_numa_faults = 0;
}
/*
On 08/05/2013 03:36 PM, Rik van Riel wrote:
> On Fri, 2 Aug 2013 18:50:32 +0200
> Peter Zijlstra <[email protected]> wrote:
>
>> Subject: mm, numa: Do not group on RO pages
>
> Using the fraction of the faults that happen on each node to
> determine both the group weight and the task weight of each
> node, and attempting to move the task to the node with the
> highest score, seems to work fairly well.
>
> Here are the specjbb scores with this patch, on top of your
> task grouping patches:
>
> vanilla numasched7
> Warehouses
> 1 40651 45657
> 2 82897 88827
> 3 116623 130644
> 4 144512 171051
> 5 176681 209915
> 6 190471 247480
> 7 204036 283966
> 8 214466 318464
> 9 223451 348657
> 10 227439 380886
> 11 226163 374822
> 12 220857 370519
> 13 215871 367582
> 14 210965 361110
>
> I suspect there may be further room for improvement, but it
> may be time for this patch to go into Mel's tree, so others
> will test it as well, helping us all learn what is broken
> and how it can be improved...
I've been testing what I believe is the accumulation of Mel's
original changes plus what Peter added via LKML and this thread
then this change. Don't think I missed any, but apologies if I
did.
Looking at it with Andrea's AutoNUMA tests (modified to automatically
generate power-of-two runs based on the available nodes -- i.e.
a 4 node system would run 2-node then 4-node, 8 node runs 2,4,8,
16 (if I had one) should do 2,4,8,16, etc.) it does look like
the "highest score" is being used -- but that's not really a
great thing for this type of private memory accessed by
multiple processes -- it looks to be all concentrating back
into a single node in the unbound cases for the runs beyond
2 nodes taking 1000+ seconds where the stock kernel takes 670
and the hard binding takes only 483. So it looks to me like the
weighting here is a bit too strong -- we don't want all the
tasks on the same node (more threads than available processors)
when there's an idle node reasonably close we can move some of
the memory to. Granted, this would be easier in cases with
really large DBs where the memory *and* cpu load are both
larger than the node resources....
Including a spreadsheet with the basic run / hard binding run
memory layout as things run and a run summary for comparison.
Don
>
> Signed-off-by: Rik van Riel <[email protected]>
> ---
> include/linux/sched.h | 1 +
> kernel/sched/fair.c | 109 +++++++++++++++++++++++++++++++++++++++++---------
> 2 files changed, 91 insertions(+), 19 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 9e7fcfe..5e175ae 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1355,6 +1355,7 @@ struct task_struct {
> * The values remain static for the duration of a PTE scan
> */
> unsigned long *numa_faults;
> + unsigned long total_numa_faults;
>
> /*
> * numa_faults_buffer records faults per node during the current
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6a06bef..2c9c1dd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -844,6 +844,18 @@ static unsigned int task_scan_max(struct task_struct *p)
> */
> unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
>
> +struct numa_group {
> + atomic_t refcount;
> +
> + spinlock_t lock; /* nr_tasks, tasks */
> + int nr_tasks;
> + struct list_head task_list;
> +
> + struct rcu_head rcu;
> + atomic_long_t total_faults;
> + atomic_long_t faults[0];
> +};
> +
> static inline int task_faults_idx(int nid, int priv)
> {
> return 2 * nid + priv;
> @@ -857,6 +869,51 @@ static inline unsigned long task_faults(struct task_struct *p, int nid)
> return p->numa_faults[2*nid] + p->numa_faults[2*nid+1];
> }
>
> +static inline unsigned long group_faults(struct task_struct *p, int nid)
> +{
> + if (!p->numa_group)
> + return 0;
> +
> + return atomic_long_read(&p->numa_group->faults[2*nid]) +
> + atomic_long_read(&p->numa_group->faults[2*nid+1]);
> +}
> +
> +/*
> + * These return the fraction of accesses done by a particular task, or
> + * task group, on a particular numa node. The group weight is given a
> + * larger multiplier, in order to group tasks together that are almost
> + * evenly spread out between numa nodes.
> + */
> +static inline unsigned long task_weight(struct task_struct *p, int nid)
> +{
> + unsigned long total_faults;
> +
> + if (!p->numa_faults)
> + return 0;
> +
> + total_faults = p->total_numa_faults;
> +
> + if (!total_faults)
> + return 0;
> +
> + return 1000 * task_faults(p, nid) / total_faults;
> +}
> +
> +static inline unsigned long group_weight(struct task_struct *p, int nid)
> +{
> + unsigned long total_faults;
> +
> + if (!p->numa_group)
> + return 0;
> +
> + total_faults = atomic_long_read(&p->numa_group->total_faults);
> +
> + if (!total_faults)
> + return 0;
> +
> + return 1200 * group_faults(p, nid) / total_faults;
> +}
> +
> /*
> * Create/Update p->mempolicy MPOL_INTERLEAVE to match p->numa_faults[].
> */
> @@ -979,8 +1036,10 @@ static void task_numa_compare(struct task_numa_env *env, long imp)
> cur = NULL;
>
> if (cur) {
> - imp += task_faults(cur, env->src_nid) -
> - task_faults(cur, env->dst_nid);
> + imp += task_weight(cur, env->src_nid) +
> + group_weight(cur, env->src_nid) -
> + task_weight(cur, env->dst_nid) -
> + group_weight(cur, env->dst_nid);
> }
>
> trace_printk("compare[%d] task:%s/%d improvement: %ld\n",
> @@ -1051,7 +1110,7 @@ static int task_numa_migrate(struct task_struct *p)
> .best_cpu = -1
> };
> struct sched_domain *sd;
> - unsigned long faults;
> + unsigned long weight;
> int nid, cpu, ret;
>
> /*
> @@ -1067,7 +1126,7 @@ static int task_numa_migrate(struct task_struct *p)
> }
> rcu_read_unlock();
>
> - faults = task_faults(p, env.src_nid);
> + weight = task_weight(p, env.src_nid) + group_weight(p, env.src_nid);
> update_numa_stats(&env.src_stats, env.src_nid);
>
> for_each_online_node(nid) {
> @@ -1076,7 +1135,7 @@ static int task_numa_migrate(struct task_struct *p)
> if (nid == env.src_nid)
> continue;
>
> - imp = task_faults(p, nid) - faults;
> + imp = task_weight(p, nid) + group_weight(p, nid) - weight;
> if (imp < 0)
> continue;
>
> @@ -1122,21 +1181,10 @@ static void numa_migrate_preferred(struct task_struct *p)
> p->numa_migrate_retry = jiffies + HZ/10;
> }
>
> -struct numa_group {
> - atomic_t refcount;
> -
> - spinlock_t lock; /* nr_tasks, tasks */
> - int nr_tasks;
> - struct list_head task_list;
> -
> - struct rcu_head rcu;
> - atomic_long_t faults[0];
> -};
> -
> static void task_numa_placement(struct task_struct *p)
> {
> - int seq, nid, max_nid = -1;
> - unsigned long max_faults = 0;
> + int seq, nid, max_nid = -1, max_group_nid = -1;
> + unsigned long max_faults = 0, max_group_faults = 0;
>
> seq = ACCESS_ONCE(p->mm->numa_scan_seq);
> if (p->numa_scan_seq == seq)
> @@ -1148,7 +1196,7 @@ static void task_numa_placement(struct task_struct *p)
>
> /* Find the node with the highest number of faults */
> for (nid = 0; nid < nr_node_ids; nid++) {
> - unsigned long faults = 0;
> + unsigned long faults = 0, group_faults = 0;
> int priv, i;
>
> for (priv = 0; priv < 2; priv++) {
> @@ -1161,6 +1209,7 @@ static void task_numa_placement(struct task_struct *p)
> /* Decay existing window, copy faults since last scan */
> p->numa_faults[i] >>= 1;
> p->numa_faults[i] += p->numa_faults_buffer[i];
> + p->total_numa_faults += p->numa_faults_buffer[i];
> p->numa_faults_buffer[i] = 0;
>
> diff += p->numa_faults[i];
> @@ -1169,6 +1218,8 @@ static void task_numa_placement(struct task_struct *p)
> if (p->numa_group) {
> /* safe because we can only change our own group */
> atomic_long_add(diff, &p->numa_group->faults[i]);
> + atomic_long_add(diff, &p->numa_group->total_faults);
> + group_faults += atomic_long_read(&p->numa_group->faults[i]);
> }
> }
>
> @@ -1176,11 +1227,29 @@ static void task_numa_placement(struct task_struct *p)
> max_faults = faults;
> max_nid = nid;
> }
> +
> + if (group_faults > max_group_faults) {
> + max_group_faults = group_faults;
> + max_group_nid = nid;
> + }
> }
>
> if (sched_feat(NUMA_INTERLEAVE))
> task_numa_mempol(p, max_faults);
>
> + /*
> + * Should we stay on our own, or move in with the group?
> + * If the task's memory accesses are concentrated on one node, go
> + * to (more likely, stay on) that node. If the group's accesses
> + * are more concentrated than the task's accesses, join the group.
> + *
> + * max_group_faults max_faults
> + * ------------------ > ------------
> + * total_group_faults total_faults
> + */
> + if (group_weight(p, max_group_nid) > task_weight(p, max_nid))
> + max_nid = max_group_nid;
> +
> /* Preferred node as the node with the most faults */
> if (max_faults && max_nid != p->numa_preferred_nid) {
>
> @@ -1242,6 +1311,7 @@ void task_numa_group(struct task_struct *p, int cpu, int pid)
> atomic_set(&grp->refcount, 1);
> spin_lock_init(&grp->lock);
> INIT_LIST_HEAD(&grp->task_list);
> + atomic_long_set(&grp->total_faults, 0);
>
> spin_lock(&p->numa_lock);
> list_add(&p->numa_entry, &grp->task_list);
> @@ -1336,6 +1406,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
>
> BUG_ON(p->numa_faults_buffer);
> p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
> + p->total_numa_faults = 0;
> }
>
> /*
>
> .
>
OK, here's one that actually works and doesn't have magical crashes.
---
Subject: mm, sched, numa: Create a per-task MPOL_INTERLEAVE policy
From: Peter Zijlstra <[email protected]>
Date: Mon Jul 22 10:42:38 CEST 2013
For those tasks belonging to groups that span nodes -- for whatever
reason -- we want to interleave their memory allocations to minimize
their performance penalty.
There's a subtlety to interleaved memory allocations though, once you
establish an interleave mask a measurement of where the actual memory
is is completely flat across those nodes. Therefore we'll never
actually shrink the interleave mask, even if at some point all tasks
can/do run on a single node again.
To fix this issue, change the accounting so that when we find a page
part of the interleave mask, we still account it against the current
node, not the node the page is really at.
Finally, simplify the 'default' numa policy. It used a per-node
preferred node policy and always picked the current node, this can be
written with a single MPOL_F_LOCAL policy.
Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/mempolicy.h | 5 +-
kernel/sched/fair.c | 56 +++++++++++++++++++++++++++
kernel/sched/features.h | 1
mm/huge_memory.c | 28 +++++++------
mm/memory.c | 33 ++++++++++-----
mm/mempolicy.c | 95 +++++++++++++++++++++++++++++-----------------
6 files changed, 158 insertions(+), 60 deletions(-)
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -60,6 +60,7 @@ struct mempolicy {
* The default fast path of a NULL MPOL_DEFAULT policy is always inlined.
*/
+extern struct mempolicy *__mpol_new(unsigned short, unsigned short);
extern void __mpol_put(struct mempolicy *pol);
static inline void mpol_put(struct mempolicy *pol)
{
@@ -187,7 +188,7 @@ static inline int vma_migratable(struct
return 1;
}
-extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long);
+extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long, int *);
#else
@@ -307,7 +308,7 @@ static inline int mpol_to_str(char *buff
}
static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
- unsigned long address)
+ unsigned long address, int *account_node)
{
return -1; /* no node preference */
}
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -893,6 +893,59 @@ static inline unsigned long task_faults(
return p->numa_faults[2*nid] + p->numa_faults[2*nid+1];
}
+/*
+ * Create/Update p->mempolicy MPOL_INTERLEAVE to match p->numa_faults[].
+ */
+static void task_numa_mempol(struct task_struct *p, long max_faults)
+{
+ struct mempolicy *pol = p->mempolicy, *new = NULL;
+ nodemask_t nodes = NODE_MASK_NONE;
+ int node;
+
+ if (!max_faults)
+ return;
+
+ if (!pol) {
+ new = __mpol_new(MPOL_INTERLEAVE, MPOL_F_MOF | MPOL_F_MORON);
+ if (IS_ERR(new))
+ return;
+ }
+
+ task_lock(p);
+
+ pol = p->mempolicy; /* lock forces a re-read */
+ if (!pol)
+ pol = new;
+
+ if (!(pol->flags & MPOL_F_MORON))
+ goto unlock;
+
+ for_each_node(node) {
+ if (task_faults(p, node) > max_faults/2)
+ node_set(node, nodes);
+ }
+
+ if (pol == new) {
+ /*
+ * XXX 'borrowed' from do_set_mempolicy()
+ */
+ pol->v.nodes = nodes;
+ p->mempolicy = pol;
+ p->flags |= PF_MEMPOLICY;
+ p->il_next = first_node(nodes);
+ new = NULL;
+ } else {
+ mpol_rebind_task(p, &nodes, MPOL_REBIND_STEP1);
+ mpol_rebind_task(p, &nodes, MPOL_REBIND_STEP2);
+ }
+
+unlock:
+ task_unlock(p);
+
+ if (new)
+ __mpol_put(new);
+}
+
static unsigned long weighted_cpuload(const int cpu);
static unsigned long source_load(int cpu, int type);
static unsigned long target_load(int cpu, int type);
@@ -1106,6 +1159,9 @@ static void task_numa_placement(struct t
}
}
+ if (sched_feat(NUMA_INTERLEAVE))
+ task_numa_mempol(p, max_faults);
+
/* Preferred node as the node with the most faults */
if (max_faults && max_nid != p->numa_preferred_nid) {
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -72,4 +72,5 @@ SCHED_FEAT(NUMA_FORCE, false)
SCHED_FEAT(NUMA_BALANCE, true)
SCHED_FEAT(NUMA_FAULTS_UP, true)
SCHED_FEAT(NUMA_FAULTS_DOWN, true)
+SCHED_FEAT(NUMA_INTERLEAVE, false)
#endif
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1292,7 +1292,7 @@ int do_huge_pmd_numa_page(struct mm_stru
{
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
- int page_nid = -1, this_nid = numa_node_id();
+ int page_nid = -1, account_nid = -1, this_nid = numa_node_id();
int target_nid, last_nidpid;
bool migrated = false;
@@ -1301,7 +1301,6 @@ int do_huge_pmd_numa_page(struct mm_stru
goto out_unlock;
page = pmd_page(pmd);
- get_page(page);
/*
* Do not account for faults against the huge zero page. The read-only
@@ -1317,13 +1316,12 @@ int do_huge_pmd_numa_page(struct mm_stru
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
last_nidpid = page_nidpid_last(page);
- target_nid = mpol_misplaced(page, vma, haddr);
- if (target_nid == -1) {
- put_page(page);
+ target_nid = mpol_misplaced(page, vma, haddr, &account_nid);
+ if (target_nid == -1)
goto clear_pmdnuma;
- }
/* Acquire the page lock to serialise THP migrations */
+ get_page(page);
spin_unlock(&mm->page_table_lock);
lock_page(page);
@@ -1332,6 +1330,7 @@ int do_huge_pmd_numa_page(struct mm_stru
if (unlikely(!pmd_same(pmd, *pmdp))) {
unlock_page(page);
put_page(page);
+ account_nid = page_nid = -1; /* someone else took our fault */
goto out_unlock;
}
spin_unlock(&mm->page_table_lock);
@@ -1339,17 +1338,20 @@ int do_huge_pmd_numa_page(struct mm_stru
/* Migrate the THP to the requested node */
migrated = migrate_misplaced_transhuge_page(mm, vma,
pmdp, pmd, addr, page, target_nid);
- if (migrated)
- page_nid = target_nid;
- else
+ if (!migrated) {
+ account_nid = -1; /* account against the old page */
goto check_same;
+ }
+ page_nid = target_nid;
goto out;
check_same:
spin_lock(&mm->page_table_lock);
- if (unlikely(!pmd_same(pmd, *pmdp)))
+ if (unlikely(!pmd_same(pmd, *pmdp))) {
+ page_nid = -1; /* someone else took our fault */
goto out_unlock;
+ }
clear_pmdnuma:
pmd = pmd_mknonnuma(pmd);
set_pmd_at(mm, haddr, pmdp, pmd);
@@ -1359,8 +1361,10 @@ int do_huge_pmd_numa_page(struct mm_stru
spin_unlock(&mm->page_table_lock);
out:
- if (page_nid != -1)
- task_numa_fault(last_nidpid, page_nid, HPAGE_PMD_NR, migrated);
+ if (account_nid == -1)
+ account_nid = page_nid;
+ if (account_nid != -1)
+ task_numa_fault(last_nidpid, account_nid, HPAGE_PMD_NR, migrated);
return 0;
}
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3529,16 +3529,17 @@ static int do_nonlinear_fault(struct mm_
return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
}
-int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
- unsigned long addr, int current_nid)
+static int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
+ unsigned long addr, int page_nid,
+ int *account_nid)
{
get_page(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
- if (current_nid == numa_node_id())
+ if (page_nid == numa_node_id())
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
- return mpol_misplaced(page, vma, addr);
+ return mpol_misplaced(page, vma, addr, account_nid);
}
int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
@@ -3546,7 +3547,7 @@ int do_numa_page(struct mm_struct *mm, s
{
struct page *page = NULL;
spinlock_t *ptl;
- int page_nid = -1;
+ int page_nid = -1, account_nid = -1;
int target_nid, last_nidpid;
bool migrated = false;
@@ -3583,7 +3584,7 @@ int do_numa_page(struct mm_struct *mm, s
last_nidpid = page_nidpid_last(page);
page_nid = page_to_nid(page);
- target_nid = numa_migrate_prep(page, vma, addr, page_nid);
+ target_nid = numa_migrate_prep(page, vma, addr, page_nid, &account_nid);
pte_unmap_unlock(ptep, ptl);
if (target_nid == -1) {
put_page(page);
@@ -3596,8 +3597,10 @@ int do_numa_page(struct mm_struct *mm, s
page_nid = target_nid;
out:
- if (page_nid != -1)
- task_numa_fault(last_nidpid, page_nid, 1, migrated);
+ if (account_nid == -1)
+ account_nid = page_nid;
+ if (account_nid != -1)
+ task_numa_fault(last_nidpid, account_nid, 1, migrated);
return 0;
}
@@ -3636,7 +3639,7 @@ static int do_pmd_numa_page(struct mm_st
for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
pte_t pteval = *pte;
struct page *page;
- int page_nid = -1;
+ int page_nid = -1, account_nid = -1;
int target_nid;
bool migrated = false;
@@ -3661,19 +3664,25 @@ static int do_pmd_numa_page(struct mm_st
last_nidpid = page_nidpid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr,
- page_nid);
+ page_nid, &account_nid);
pte_unmap_unlock(pte, ptl);
if (target_nid != -1) {
migrated = migrate_misplaced_page(page, vma, target_nid);
if (migrated)
page_nid = target_nid;
+ else
+ account_nid = -1;
} else {
put_page(page);
}
- if (page_nid != -1)
- task_numa_fault(last_nidpid, page_nid, 1, migrated);
+ if (account_nid == -1)
+ account_nid = page_nid;
+ if (account_nid != -1)
+ task_numa_fault(last_nidpid, account_nid, 1, migrated);
+
+ cond_resched();
pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
}
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -118,22 +118,18 @@ static struct mempolicy default_policy =
.flags = MPOL_F_LOCAL,
};
-static struct mempolicy preferred_node_policy[MAX_NUMNODES];
+static struct mempolicy numa_policy = {
+ .refcnt = ATOMIC_INIT(1), /* never free it */
+ .mode = MPOL_PREFERRED,
+ .flags = MPOL_F_LOCAL | MPOL_F_MOF | MPOL_F_MORON,
+};
static struct mempolicy *get_task_policy(struct task_struct *p)
{
struct mempolicy *pol = p->mempolicy;
- int node;
- if (!pol) {
- node = numa_node_id();
- if (node != NUMA_NO_NODE)
- pol = &preferred_node_policy[node];
-
- /* preferred_node_policy is not initialised early in boot */
- if (!pol->mode)
- pol = NULL;
- }
+ if (!pol)
+ pol = &numa_policy;
return pol;
}
@@ -248,6 +244,20 @@ static int mpol_set_nodemask(struct memp
return ret;
}
+struct mempolicy *__mpol_new(unsigned short mode, unsigned short flags)
+{
+ struct mempolicy *policy;
+
+ policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
+ if (!policy)
+ return ERR_PTR(-ENOMEM);
+ atomic_set(&policy->refcnt, 1);
+ policy->mode = mode;
+ policy->flags = flags;
+
+ return policy;
+}
+
/*
* This function just creates a new policy, does some check and simple
* initialization. You must invoke mpol_set_nodemask() to set nodes.
@@ -255,8 +265,6 @@ static int mpol_set_nodemask(struct memp
static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
nodemask_t *nodes)
{
- struct mempolicy *policy;
-
pr_debug("setting mode %d flags %d nodes[0] %lx\n",
mode, flags, nodes ? nodes_addr(*nodes)[0] : NUMA_NO_NODE);
@@ -284,14 +292,8 @@ static struct mempolicy *mpol_new(unsign
mode = MPOL_PREFERRED;
} else if (nodes_empty(*nodes))
return ERR_PTR(-EINVAL);
- policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
- if (!policy)
- return ERR_PTR(-ENOMEM);
- atomic_set(&policy->refcnt, 1);
- policy->mode = mode;
- policy->flags = flags;
- return policy;
+ return __mpol_new(mode, flags);
}
/* Slow path of a mpol destructor. */
@@ -2242,12 +2244,13 @@ static void sp_free(struct sp_node *n)
* Policy determination "mimics" alloc_page_vma().
* Called from fault path where we know the vma and faulting address.
*/
-int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
+int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr, int *account_node)
{
struct mempolicy *pol;
struct zone *zone;
int curnid = page_to_nid(page);
unsigned long pgoff;
+ int thisnid = numa_node_id();
int polnid = -1;
int ret = -1;
@@ -2269,7 +2272,7 @@ int mpol_misplaced(struct page *page, st
case MPOL_PREFERRED:
if (pol->flags & MPOL_F_LOCAL)
- polnid = numa_node_id();
+ polnid = thisnid;
else
polnid = pol->v.preferred_node;
break;
@@ -2284,7 +2287,7 @@ int mpol_misplaced(struct page *page, st
if (node_isset(curnid, pol->v.nodes))
goto out;
(void)first_zones_zonelist(
- node_zonelist(numa_node_id(), GFP_HIGHUSER),
+ node_zonelist(thisnid, GFP_HIGHUSER),
gfp_zone(GFP_HIGHUSER),
&pol->v.nodes, &zone);
polnid = zone->node;
@@ -2299,8 +2302,7 @@ int mpol_misplaced(struct page *page, st
int last_nidpid;
int this_nidpid;
- polnid = numa_node_id();
- this_nidpid = nid_pid_to_nidpid(polnid, current->pid);;
+ this_nidpid = nid_pid_to_nidpid(thisnid, current->pid);
/*
* Multi-stage node selection is used in conjunction
@@ -2326,6 +2328,40 @@ int mpol_misplaced(struct page *page, st
last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
goto out;
+
+ /*
+ * Preserve interleave pages while allowing useful
+ * ->numa_faults[] statistics.
+ *
+ * When migrating into an interleave set, migrate to
+ * the correct interleaved node but account against the
+ * current node (where the task is running).
+ *
+ * Not doing this would result in ->numa_faults[] being
+ * flat across the interleaved nodes, making it
+ * impossible to shrink the node list even when all
+ * tasks are running on a single node.
+ *
+ * src dst migrate account
+ * 0 0 -- this_node $page_node
+ * 0 1 -- policy_node this_node
+ * 1 0 -- this_node $page_node
+ * 1 1 -- policy_node this_node
+ *
+ */
+ switch (pol->mode) {
+ case MPOL_INTERLEAVE:
+ if (node_isset(thisnid, pol->v.nodes)) {
+ if (account_node)
+ *account_node = thisnid;
+ break;
+ }
+ /* fall-through for nodes outside the set */
+
+ default:
+ polnid = thisnid;
+ break;
+ }
}
if (curnid != polnid)
@@ -2588,15 +2624,6 @@ void __init numa_policy_init(void)
sizeof(struct sp_node),
0, SLAB_PANIC, NULL);
- for_each_node(nid) {
- preferred_node_policy[nid] = (struct mempolicy) {
- .refcnt = ATOMIC_INIT(1),
- .mode = MPOL_PREFERRED,
- .flags = MPOL_F_MOF | MPOL_F_MORON,
- .v = { .preferred_node = nid, },
- };
- }
-
/*
* Set interleaving policy for system init. Interleaving is only
* enabled across suitably sized nodes (default is >= 16MB), or
On Mon, Aug 26, 2013 at 06:10:27PM +0200, Peter Zijlstra wrote:
> + if (pol == new) {
> + /*
> + * XXX 'borrowed' from do_set_mempolicy()
This should probably also say something like:
/*
* This is safe without holding mm->mmap_sem for show_numa_map()
* because this is only used for a NULL->pol transition, not
* pol1->pol2 transitions.
*/
> + */
> + pol->v.nodes = nodes;
> + p->mempolicy = pol;
> + p->flags |= PF_MEMPOLICY;
> + p->il_next = first_node(nodes);
> + new = NULL;
> + } else {
On Fri, Aug 02, 2013 at 06:47:15PM +0200, Peter Zijlstra wrote:
> Subject: sched, numa: Use {cpu, pid} to create task groups for shared faults
> From: Peter Zijlstra <[email protected]>
> Date: Tue Jul 30 10:40:20 CEST 2013
>
> A very simple/straight forward shared fault task grouping
> implementation.
>
> Signed-off-by: Peter Zijlstra <[email protected]>
So Rik and me found a possible issue with this -- although in the end it
turned out to be a userspace 'feature' instead.
It might be possible for a COW page to be 'shared' and thus get a
last_cpupid set from another process. When we break cow and reuse the
now private and writable page might still have this last_cpupid and thus
cause a shared fault and form grouping.
Something like the below resets the last_cpupid field on reuse much like
fresh COW copies will have.
There might be something that avoids the above scenario but I'm too
tired to come up with anything.
---
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2730,6 +2730,9 @@ static int do_wp_page(struct mm_struct *
get_page(dirty_page);
reuse:
+ if (old_page)
+ page_cpupid_xchg_last(old_page, (1 << LAST_CPUPID_SHIFT) - 1);
+
flush_cache_page(vma, address, pte_pfn(orig_pte));
entry = pte_mkyoung(orig_pte);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
On 08/28/2013 12:41 PM, Peter Zijlstra wrote:
> On Fri, Aug 02, 2013 at 06:47:15PM +0200, Peter Zijlstra wrote:
>> Subject: sched, numa: Use {cpu, pid} to create task groups for shared faults
>> From: Peter Zijlstra <[email protected]>
>> Date: Tue Jul 30 10:40:20 CEST 2013
>>
>> A very simple/straight forward shared fault task grouping
>> implementation.
>>
>> Signed-off-by: Peter Zijlstra <[email protected]>
>
> So Rik and me found a possible issue with this -- although in the end it
> turned out to be a userspace 'feature' instead.
>
> It might be possible for a COW page to be 'shared' and thus get a
> last_cpupid set from another process. When we break cow and reuse the
> now private and writable page might still have this last_cpupid and thus
> cause a shared fault and form grouping.
>
> Something like the below resets the last_cpupid field on reuse much like
> fresh COW copies will have.
>
> There might be something that avoids the above scenario but I'm too
> tired to come up with anything.
I believe this is a real bug.
It can be avoided by either -1ing out the cpupid like you do, or
using the current process's cpupid, when we re-use an old page
in do_wp_page.
Acked-by: Rik van Riel <[email protected]>
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2730,6 +2730,9 @@ static int do_wp_page(struct mm_struct *
> get_page(dirty_page);
>
> reuse:
> + if (old_page)
> + page_cpupid_xchg_last(old_page, (1 << LAST_CPUPID_SHIFT) - 1);
> +
> flush_cache_page(vma, address, pte_pfn(orig_pte));
> entry = pte_mkyoung(orig_pte);
> entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>
--
All rights reversed