2012-11-06 09:22:17

by Mel Gorman

[permalink] [raw]
Subject: [RFC PATCH 00/19] Foundation for automatic NUMA balancing

There are currently two competing approaches to implement support for
automatically migrating pages to optimise NUMA locality. Performance results
are available for both but review highlighted different problems in both.
They are not compatible with each other even though some fundamental
mechanics should have been the same.

For example, schednuma implements many of its optimisations before the code
that benefits most from these optimisations are introduced obscuring what the
cost of schednuma might be and if the optimisations can be used elsewhere
independant of the series. It also effectively hard-codes PROT_NONE to be
the hinting fault even though it should be an achitecture-specific decision.
On the other hand, it is well integrated and implements all its work in the
context of the process that benefits from the migration.

autonuma goes straight to kernel threads for marking PTEs pte_numa to
capture the necessary statistics it depends on. This obscures the cost of
autonuma in a manner that is difficult to measure and hard to retro-fit
to put in the context of the process. Some of these costs are in paths the
scheduler folk traditionally are very wary of making heavier, particularly
if that cost is difficult to measure. On the other hand, performance
tests indicate it is the best perfoming solution.

As the patch sets do not share any code, it is difficult to incrementally
develop one to take advantage of the strengths of the other. Many of the
patches would be code churn that is annoying to review and fairly measuring
the results would be problematic.

This series addresses part of the integration and sharing problem by
implementing a foundation that either the policy for schednuma or autonuma
can be rebased on. The actual policy it implements is a very stupid
greedy policy called "Migrate On Reference Of pte_numa Node (MORON)".
While stupid, it can be faster than the vanilla kernel and the expectation
is that any clever policy should be able to beat MORON. The advantage is
that it still defines how the policy needs to hook into the core code --
scheduler and mempolicy mostly so many optimisations (such as native THP
migration) can be shared between different policy implementations.

This series steals very heavily from both autonuma and schednuma with very
little original code. In some cases I removed the signed-off-bys because
the result was too different. I have noted in the changelog where this
happened but the signed-offs can be restored if the original authors agree.

Patches 1-3 move some vmstat counters so that migrated pages get accounted
for. In the past the primary user of migration was compaction but
if pages are to migrate for NUMA optimisation then the counters
need to be generally useful.

Patch 4 defines an arch-specific PTE bit called _PAGE_NUMA that is used
to trigger faults later in the series. A placement policy is expected
to use these faults to determine if a page should migrate. On x86,
the bit is the same as _PAGE_PROTNONE but other architectures
may differ.

Patch 5-7 defines pte_numa, pmd_numa, pte_mknuma, pte_mknonuma and
friends. It implements them for x86, handles GUP and preserves
the _PAGE_NUMA bit across THP splits.

Patch 8 creates the fault handler for p[te|md]_numa PTEs and just clears
them again.

Patches 9-11 add a migrate-on-fault mode that applications can specifically
ask for. Applications can take advantage of this if they wish. It
also meanst that if automatic balancing was broken for some workload
that the application could disable the automatic stuff but still
get some advantage.

Patch 12 adds migrate_misplaced_page which is responsible for migrating
a page to a new location.

Patch 13 migrates the page on fault if mpol_misplaced() says to do so.

Patch 14 adds a MPOL_MF_LAZY mempolicy that an interested application can use.
On the next reference the memory should be migrated to the node that
references the memory.

Patch 15 sets pte_numa within the context of the scheduler.

Patch 16 adds some vmstats that can be used to approximate the cost of the
scheduling policy in a more fine-grained fashion than looking at
the system CPU usage.

Patch 17 implements the MORON policy.

Patches 18-19 note that the marking of pte_numa has a number of disadvantages and
instead incrementally updates a limited range of the address space
each tick.

The obvious next step is to rebase a proper placement policy on top of this
foundation and compare it to MORON (or any other placement policy). It
should be possible to share optimisations between different policies to
allow meaningful comparisons.

For now, I am going to compare this patchset with the most recent posting
of schednuma and autonuma just to get a feeling for where it stands. I
only ran the autonuma benchmark and specjbb tests.

The baseline kernel has stat patches 1-3 applied.

AUTONUMA BENCH
3.7.0 3.7.0 3.7.0 3.7.0
rc2-stats-v2r1 rc2-autonuma-v27r8 rc2-schednuma-v1r4 rc2-balancenuma-v1r15
User NUMA01 67145.71 ( 0.00%) 30879.07 ( 54.01%) 61162.81 ( 8.91%) 25274.74 ( 62.36%)
User NUMA01_THEADLOCAL 55104.60 ( 0.00%) 17285.49 ( 68.63%) 17007.21 ( 69.14%) 21067.79 ( 61.77%)
User NUMA02 7074.54 ( 0.00%) 2219.11 ( 68.63%) 2193.59 ( 68.99%) 2157.32 ( 69.51%)
User NUMA02_SMT 2916.86 ( 0.00%) 1027.73 ( 64.77%) 1037.28 ( 64.44%) 1016.54 ( 65.15%)
System NUMA01 42.28 ( 0.00%) 511.37 (-1109.48%) 2872.08 (-6693.00%) 363.56 (-759.89%)
System NUMA01_THEADLOCAL 41.71 ( 0.00%) 183.24 (-339.32%) 185.24 (-344.11%) 329.94 (-691.03%)
System NUMA02 34.67 ( 0.00%) 27.85 ( 19.67%) 21.60 ( 37.70%) 26.74 ( 22.87%)
System NUMA02_SMT 0.89 ( 0.00%) 20.34 (-2185.39%) 5.84 (-556.18%) 19.73 (-2116.85%)
Elapsed NUMA01 1512.97 ( 0.00%) 724.38 ( 52.12%) 1407.59 ( 6.97%) 572.77 ( 62.14%)
Elapsed NUMA01_THEADLOCAL 1264.23 ( 0.00%) 389.51 ( 69.19%) 380.64 ( 69.89%) 486.16 ( 61.54%)
Elapsed NUMA02 181.52 ( 0.00%) 60.65 ( 66.59%) 52.68 ( 70.98%) 66.26 ( 63.50%)
Elapsed NUMA02_SMT 163.59 ( 0.00%) 53.45 ( 67.33%) 48.81 ( 70.16%) 61.42 ( 62.45%)
CPU NUMA01 4440.00 ( 0.00%) 4333.00 ( 2.41%) 4549.00 ( -2.45%) 4476.00 ( -0.81%)
CPU NUMA01_THEADLOCAL 4362.00 ( 0.00%) 4484.00 ( -2.80%) 4516.00 ( -3.53%) 4401.00 ( -0.89%)
CPU NUMA02 3916.00 ( 0.00%) 3704.00 ( 5.41%) 4204.00 ( -7.35%) 3295.00 ( 15.86%)
CPU NUMA02_SMT 1783.00 ( 0.00%) 1960.00 ( -9.93%) 2136.00 (-19.80%) 1687.00 ( 5.38%)

All the automatic placement stuff incurs a high system CPU penalty and
it is not consistent which implementation performs the best. However,
balancenuma does relatively well in terms system CPU usage even without
any special optimisations such as the TLB flush optimisations. It was
relatively good for NUMA01 but the worst for NUMA01_THREADLOCAL. Glancing
at profiles it looks like mmap_sem contention is a problem but a lot of
samples were measured intel_idle too. This is a profile excerpt for NUMA01

samples % image name app name symbol name
341728 17.7499 vmlinux-3.7.0-rc2-balancenuma-v1r15 vmlinux-3.7.0-rc2-balancenuma-v1r15 intel_idle
332454 17.2682 cc1 cc1 /usr/lib64/gcc/x86_64-suse-linux/4.7/cc1
312835 16.2492 vmlinux-3.7.0-rc2-balancenuma-v1r15 vmlinux-3.7.0-rc2-balancenuma-v1r15 mutex_spin_on_owner
78978 4.1022 oprofiled oprofiled /usr/bin/oprofiled
56961 2.9586 vmlinux-3.7.0-rc2-balancenuma-v1r15 vmlinux-3.7.0-rc2-balancenuma-v1r15 native_write_msr_safe
56633 2.9416 vmlinux-3.7.0-rc2-balancenuma-v1r15 vmlinux-3.7.0-rc2-balancenuma-v1r15 update_sd_lb_stats

I haven't investigated in more detail at this point.

MMTests Statistics: duration
3.7.0 3.7.0 3.7.0 3.7.0
rc2-stats-v2r1rc2-autonuma-v27r8rc2-schednuma-v1r4rc2-balancenuma-v1r15
User 132248.88 101395.25 158084.98 99810.06
System 120.19 1794.22 6283.60 1634.20
Elapsed 3131.10 2771.13 4068.03 2747.31

Overall elapsed time actually scores balancenuma as the best.

MMTests Statistics: vmstat
3.7.0 3.7.0 3.7.0 3.7.0
rc2-stats-v2r1rc2-autonuma-v27r8rc2-schednuma-v1r4rc2-balancenuma-v1r15
Page Ins 37256 167976 167340 189348
Page Outs 28888 164248 161400 169540
Swap Ins 0 0 0 0
Swap Outs 0 0 0 0
Direct pages scanned 0 0 0 0
Kswapd pages scanned 0 0 0 0
Kswapd pages reclaimed 0 0 0 0
Direct pages reclaimed 0 0 0 0
Kswapd efficiency 100% 100% 100% 100%
Kswapd velocity 0.000 0.000 0.000 0.000
Direct efficiency 100% 100% 100% 100%
Direct velocity 0.000 0.000 0.000 0.000
Percentage direct scans 0% 0% 0% 0%
Page writes by reclaim 0 0 0 0
Page writes file 0 0 0 0
Page writes anon 0 0 0 0
Page reclaim immediate 0 0 0 0
Page rescued immediate 0 0 0 0
Slabs scanned 0 0 0 0
Direct inode steals 0 0 0 0
Kswapd inode steals 0 0 0 0
Kswapd skipped wait 0 0 0 0
THP fault alloc 17370 31018 22082 28615
THP collapse alloc 6 24869 2 993
THP splits 3 25337 2 15032
THP fault fallback 0 0 0 0
THP collapse fail 0 0 0 0
Compaction stalls 0 0 0 0
Compaction success 0 0 0 0
Compaction failures 0 0 0 0
Page migrate success 0 14450122 870091 6776279
Page migrate failure 0 0 0 0
Compaction pages isolated 0 0 0 0
Compaction migrate scanned 0 0 0 0
Compaction free scanned 0 0 0 0
Compaction cost 0 14999 903 7033
NUMA PTE updates 0 396727 1940174013 386573907
NUMA hint faults 0 28622403 4928759 7887705
NUMA hint local faults 0 20605969 4043237 730296
NUMA pages migrated 0 14450122 870091 6776279
AutoNUMA cost 0 143389 38241 42273

In terms of the estimated cost, balancenuma scored reasonably well on a basic
cost metric. Like autonuma it also is spltting THP instead of migrating them.

Next was specjbb. In this case the performance of MORON depends entirely
on scheduling decisions. If the scheduler keeps JVM threads on the same
nodes, it'll do well but as it gives no hints to the scheduler there are
no guarantees. The full report for this is quite long so I'm cutting it
a bit shorter.

SPECJBB BOPS
3.7.0 3.7.0 3.7.0 3.7.0
rc2-stats-v2r1 rc2-autonuma-v27r8 rc2-schednuma-v1r4 rc2-balancenuma-v1r15
Mean 1 25960.00 ( 0.00%) 24808.25 ( -4.44%) 24876.25 ( -4.17%) 25932.75 ( -0.10%)
Mean 2 53997.50 ( 0.00%) 55949.25 ( 3.61%) 51358.50 ( -4.89%) 53729.25 ( -0.50%)
Mean 3 78454.25 ( 0.00%) 83204.50 ( 6.05%) 74280.75 ( -5.32%) 77932.25 ( -0.67%)
Mean 4 101131.25 ( 0.00%) 108606.75 ( 7.39%) 100828.50 ( -0.30%) 100058.75 ( -1.06%)
Mean 5 120807.00 ( 0.00%) 131488.25 ( 8.84%) 118191.00 ( -2.17%) 120264.75 ( -0.45%)
Mean 6 135793.50 ( 0.00%) 154615.75 ( 13.86%) 132698.75 ( -2.28%) 138114.25 ( 1.71%)
Mean 7 137686.75 ( 0.00%) 159637.75 ( 15.94%) 135343.25 ( -1.70%) 138525.00 ( 0.61%)
Mean 8 135802.25 ( 0.00%) 161599.50 ( 19.00%) 138071.75 ( 1.67%) 139256.50 ( 2.54%)
Mean 9 129194.00 ( 0.00%) 162968.50 ( 26.14%) 137107.25 ( 6.13%) 131907.00 ( 2.10%)
Mean 10 125457.00 ( 0.00%) 160352.25 ( 27.81%) 134933.50 ( 7.55%) 128257.75 ( 2.23%)
Mean 11 121733.75 ( 0.00%) 155280.50 ( 27.56%) 135810.00 ( 11.56%) 113742.25 ( -6.56%)
Mean 12 110556.25 ( 0.00%) 149744.50 ( 35.45%) 140871.00 ( 27.42%) 110366.00 ( -0.17%)
Mean 13 107484.75 ( 0.00%) 146110.25 ( 35.94%) 128493.00 ( 19.55%) 107018.50 ( -0.43%)
Mean 14 105733.00 ( 0.00%) 141589.25 ( 33.91%) 122834.50 ( 16.17%) 111093.50 ( 5.07%)
Mean 15 104492.00 ( 0.00%) 139034.25 ( 33.06%) 116800.75 ( 11.78%) 111163.25 ( 6.38%)
Mean 16 103312.75 ( 0.00%) 136828.50 ( 32.44%) 114710.25 ( 11.03%) 109039.75 ( 5.54%)
Mean 17 101999.25 ( 0.00%) 135627.25 ( 32.97%) 112106.75 ( 9.91%) 107185.00 ( 5.08%)
Mean 18 100107.75 ( 0.00%) 134610.50 ( 34.47%) 105763.50 ( 5.65%) 101597.50 ( 1.49%)
Stddev 1 928.73 ( 0.00%) 631.50 ( 32.00%) 668.62 ( 28.01%) 744.53 ( 19.83%)
Stddev 2 882.50 ( 0.00%) 732.74 ( 16.97%) 599.58 ( 32.06%) 1090.89 (-23.61%)
Stddev 3 1374.38 ( 0.00%) 778.22 ( 43.38%) 1114.44 ( 18.91%) 926.30 ( 32.60%)
Stddev 4 1051.34 ( 0.00%) 1338.16 (-27.28%) 636.17 ( 39.49%) 1058.94 ( -0.72%)
Stddev 5 620.49 ( 0.00%) 591.76 ( 4.63%) 1412.99 (-127.72%) 1089.88 (-75.65%)
Stddev 6 1088.39 ( 0.00%) 504.34 ( 53.66%) 1749.26 (-60.72%) 1437.91 (-32.11%)
Stddev 7 4369.58 ( 0.00%) 685.85 ( 84.30%) 2099.44 ( 51.95%) 1234.64 ( 71.74%)
Stddev 8 6533.31 ( 0.00%) 213.43 ( 96.73%) 1727.73 ( 73.56%) 6133.56 ( 6.12%)
Stddev 9 949.54 ( 0.00%) 2030.71 (-113.86%) 2148.63 (-126.28%) 3050.78 (-221.29%)
Stddev 10 2452.75 ( 0.00%) 4121.15 (-68.02%) 2141.49 ( 12.69%) 6328.60 (-158.02%)
Stddev 11 3093.48 ( 0.00%) 6584.90 (-112.86%) 3007.52 ( 2.78%) 5632.18 (-82.07%)
Stddev 12 2352.98 ( 0.00%) 8414.96 (-257.63%) 7615.28 (-223.64%) 4822.33 (-104.95%)
Stddev 13 2773.86 ( 0.00%) 9776.25 (-252.44%) 7559.97 (-172.54%) 5538.51 (-99.67%)
Stddev 14 2581.31 ( 0.00%) 8301.74 (-221.61%) 7714.73 (-198.87%) 3218.30 (-24.68%)
Stddev 15 2641.95 ( 0.00%) 8175.16 (-209.44%) 7929.36 (-200.13%) 3243.36 (-22.76%)
Stddev 16 2613.22 ( 0.00%) 8178.51 (-212.97%) 6375.95 (-143.99%) 3131.85 (-19.85%)
Stddev 17 2062.55 ( 0.00%) 8172.20 (-296.22%) 4925.07 (-138.79%) 4172.83 (-102.31%)
Stddev 18 2558.89 ( 0.00%) 9572.40 (-274.08%) 3663.78 (-43.18%) 5086.46 (-98.78%)
TPut 1 103840.00 ( 0.00%) 99233.00 ( -4.44%) 99505.00 ( -4.17%) 103731.00 ( -0.10%)
TPut 2 215990.00 ( 0.00%) 223797.00 ( 3.61%) 205434.00 ( -4.89%) 214917.00 ( -0.50%)
TPut 3 313817.00 ( 0.00%) 332818.00 ( 6.05%) 297123.00 ( -5.32%) 311729.00 ( -0.67%)
TPut 4 404525.00 ( 0.00%) 434427.00 ( 7.39%) 403314.00 ( -0.30%) 400235.00 ( -1.06%)
TPut 5 483228.00 ( 0.00%) 525953.00 ( 8.84%) 472764.00 ( -2.17%) 481059.00 ( -0.45%)
TPut 6 543174.00 ( 0.00%) 618463.00 ( 13.86%) 530795.00 ( -2.28%) 552457.00 ( 1.71%)
TPut 7 550747.00 ( 0.00%) 638551.00 ( 15.94%) 541373.00 ( -1.70%) 554100.00 ( 0.61%)
TPut 8 543209.00 ( 0.00%) 646398.00 ( 19.00%) 552287.00 ( 1.67%) 557026.00 ( 2.54%)
TPut 9 516776.00 ( 0.00%) 651874.00 ( 26.14%) 548429.00 ( 6.13%) 527628.00 ( 2.10%)
TPut 10 501828.00 ( 0.00%) 641409.00 ( 27.81%) 539734.00 ( 7.55%) 513031.00 ( 2.23%)
TPut 11 486935.00 ( 0.00%) 621122.00 ( 27.56%) 543240.00 ( 11.56%) 454969.00 ( -6.56%)
TPut 12 442225.00 ( 0.00%) 598978.00 ( 35.45%) 563484.00 ( 27.42%) 441464.00 ( -0.17%)
TPut 13 429939.00 ( 0.00%) 584441.00 ( 35.94%) 513972.00 ( 19.55%) 428074.00 ( -0.43%)
TPut 14 422932.00 ( 0.00%) 566357.00 ( 33.91%) 491338.00 ( 16.17%) 444374.00 ( 5.07%)
TPut 15 417968.00 ( 0.00%) 556137.00 ( 33.06%) 467203.00 ( 11.78%) 444653.00 ( 6.38%)
TPut 16 413251.00 ( 0.00%) 547314.00 ( 32.44%) 458841.00 ( 11.03%) 436159.00 ( 5.54%)
TPut 17 407997.00 ( 0.00%) 542509.00 ( 32.97%) 448427.00 ( 9.91%) 428740.00 ( 5.08%)
TPut 18 400431.00 ( 0.00%) 538442.00 ( 34.47%) 423054.00 ( 5.65%) 406390.00 ( 1.49%)

As before autonuma is the best overall. MORON is not great but it is
not terrible either. Where it regresses against the vanilla kernel, the
regressions are marginal and for larger numbers of warehouses it gets some
of the gains of schednuma.

SPECJBB PEAKS
3.7.0 3.7.0 3.7.0 3.7.0
rc2-stats-v2r1 rc2-autonuma-v27r8 rc2-schednuma-v1r4 rc2-balancenuma-v1r15
Expctd Warehouse 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%)
Expctd Peak Bops 442225.00 ( 0.00%) 598978.00 ( 35.45%) 563484.00 ( 27.42%) 441464.00 ( -0.17%)
Actual Warehouse 7.00 ( 0.00%) 9.00 ( 28.57%) 12.00 ( 71.43%) 8.00 ( 14.29%)
Actual Peak Bops 550747.00 ( 0.00%) 651874.00 ( 18.36%) 563484.00 ( 2.31%) 557026.00 ( 1.14%)

balancenuma sees a marginal improvement and gets about 50% of the performance
gain of schednuma without any optimisation or much in the way of smarts.

MMTests Statistics: duration
3.7.0 3.7.0 3.7.0 3.7.0
rc2-stats-v2r1rc2-autonuma-v27r8rc2-schednuma-v1r4rc2-balancenuma-v1r15
User 481580.26 957808.42 930687.08 959635.32
System 179.35 1646.94 32799.65 1146.42
Elapsed 10398.85 20775.06 20825.26 20784.14

Here balancenuma clearly wins in terms of System CPU usage even though
it's still a heavy cost. The overhead is less than autonuma and is *WAY*
cheaper than schednuma. As some of autonumas cost is incurred by kernel
threads that are not captured here it may be that balancenumas system
overhead is way lower than both.

MMTests Statistics: vmstat
3.7.0 3.7.0 3.7.0 3.7.0
rc2-stats-v2r1rc2-autonuma-v27r8rc2-schednuma-v1r4rc2-balancenuma-v1r15
Page Ins 33220 157280 157292 160504
Page Outs 111332 246140 259472 221496
Swap Ins 0 0 0 0
Swap Outs 0 0 0 0
Direct pages scanned 0 0 0 0
Kswapd pages scanned 0 0 0 0
Kswapd pages reclaimed 0 0 0 0
Direct pages reclaimed 0 0 0 0
Kswapd efficiency 100% 100% 100% 100%
Kswapd velocity 0.000 0.000 0.000 0.000
Direct efficiency 100% 100% 100% 100%
Direct velocity 0.000 0.000 0.000 0.000
Percentage direct scans 0% 0% 0% 0%
Page writes by reclaim 0 0 0 0
Page writes file 0 0 0 0
Page writes anon 0 0 0 0
Page reclaim immediate 0 0 0 0
Page rescued immediate 0 0 0 0
Slabs scanned 0 0 0 0
Direct inode steals 0 0 0 0
Kswapd inode steals 0 0 0 0
Kswapd skipped wait 0 0 0 0
THP fault alloc 1 2 3 2
THP collapse alloc 0 0 0 0
THP splits 0 13 0 4
THP fault fallback 0 0 0 0
THP collapse fail 0 0 0 0
Compaction stalls 0 0 0 0
Compaction success 0 0 0 0
Compaction failures 0 0 0 0
Page migrate success 0 16818940 760468681 1107531
Page migrate failure 0 0 0 0
Compaction pages isolated 0 0 0 0
Compaction migrate scanned 0 0 0 0
Compaction free scanned 0 0 0 0
Compaction cost 0 17458 789366 1149
NUMA PTE updates 0 1369 21588065110 2846145462
NUMA hint faults 0 4060111612 5807608305 1705913
NUMA hint local faults 0 3780981882 5046837790 493042
NUMA pages migrated 0 16818940 760468681 1107531
AutoNUMA cost 0 20300877 29203606 28473

The estimated cost overhead of balancenuma is way lower than either of
the other implementations.

MORON is a pretty poor placement policy but it should represent a foundation
that either schednuma or a significant chunk of autonuma could be layered
on with common optimisations shared. It's relatively small at about half
the size of schednuma and a third the size of autonuma.

Comments?

arch/sh/mm/Kconfig | 1 +
arch/x86/include/asm/pgtable.h | 65 ++++++-
arch/x86/include/asm/pgtable_types.h | 20 +++
arch/x86/mm/gup.c | 13 +-
include/asm-generic/pgtable.h | 12 ++
include/linux/huge_mm.h | 10 ++
include/linux/mempolicy.h | 8 +
include/linux/migrate.h | 21 ++-
include/linux/mm.h | 3 +
include/linux/mm_types.h | 14 ++
include/linux/sched.h | 22 +++
include/linux/vm_event_item.h | 12 +-
include/trace/events/migrate.h | 51 ++++++
include/uapi/linux/mempolicy.h | 17 +-
init/Kconfig | 14 ++
kernel/sched/core.c | 13 ++
kernel/sched/fair.c | 146 ++++++++++++++++
kernel/sched/features.h | 7 +
kernel/sched/sched.h | 6 +
kernel/sysctl.c | 38 +++-
mm/compaction.c | 15 +-
mm/huge_memory.c | 54 ++++++
mm/memory-failure.c | 3 +-
mm/memory.c | 132 +++++++++++++-
mm/memory_hotplug.c | 3 +-
mm/mempolicy.c | 319 +++++++++++++++++++++++++++++++---
mm/migrate.c | 121 ++++++++++++-
mm/page_alloc.c | 3 +-
mm/vmstat.c | 16 +-
29 files changed, 1104 insertions(+), 55 deletions(-)
create mode 100644 include/trace/events/migrate.h

--
1.7.9.2


2012-11-06 09:15:06

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 01/19] mm: compaction: Move migration fail/success stats to migrate.c

The compact_pages_moved and compact_pagemigrate_failed events are
convenient for determining if compaction is active and to what
degree migration is succeeding but it's at the wrong level. Other
users of migration may also want to know if migration is working
properly and this will be particularly true for any automated
NUMA migration. This patch moves the counters down to migration
with the new events called pgmigrate_success and pgmigrate_fail.
The compact_blocks_moved counter is removed because while it was
useful for debugging initially, it's worthless now as no meaningful
conclusions can be drawn from its value.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/vm_event_item.h | 4 +++-
mm/compaction.c | 4 ----
mm/migrate.c | 6 ++++++
mm/vmstat.c | 7 ++++---
4 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 3d31145..8aa7cb9 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -38,8 +38,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
KSWAPD_SKIP_CONGESTION_WAIT,
PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+#ifdef CONFIG_MIGRATION
+ PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
+#endif
#ifdef CONFIG_COMPACTION
- COMPACTBLOCKS, COMPACTPAGES, COMPACTPAGEFAILED,
COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
#endif
#ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/compaction.c b/mm/compaction.c
index 9eef558..00ad883 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -994,10 +994,6 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
update_nr_listpages(cc);
nr_remaining = cc->nr_migratepages;

- count_vm_event(COMPACTBLOCKS);
- count_vm_events(COMPACTPAGES, nr_migrate - nr_remaining);
- if (nr_remaining)
- count_vm_events(COMPACTPAGEFAILED, nr_remaining);
trace_mm_compaction_migratepages(nr_migrate - nr_remaining,
nr_remaining);

diff --git a/mm/migrate.c b/mm/migrate.c
index 77ed2d7..04687f6 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -962,6 +962,7 @@ int migrate_pages(struct list_head *from,
{
int retry = 1;
int nr_failed = 0;
+ int nr_succeeded = 0;
int pass = 0;
struct page *page;
struct page *page2;
@@ -988,6 +989,7 @@ int migrate_pages(struct list_head *from,
retry++;
break;
case 0:
+ nr_succeeded++;
break;
default:
/* Permanent failure */
@@ -998,6 +1000,10 @@ int migrate_pages(struct list_head *from,
}
rc = 0;
out:
+ if (nr_succeeded)
+ count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
+ if (nr_failed)
+ count_vm_events(PGMIGRATE_FAIL, nr_failed);
if (!swapwrite)
current->flags &= ~PF_SWAPWRITE;

diff --git a/mm/vmstat.c b/mm/vmstat.c
index c737057..89a7fd6 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -774,10 +774,11 @@ const char * const vmstat_text[] = {

"pgrotated",

+#ifdef CONFIG_MIGRATION
+ "pgmigrate_success",
+ "pgmigrate_fail",
+#endif
#ifdef CONFIG_COMPACTION
- "compact_blocks_moved",
- "compact_pages_moved",
- "compact_pagemigrate_failed",
"compact_stall",
"compact_fail",
"compact_success",
--
1.7.9.2

2012-11-06 09:15:32

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 05/19] mm: numa: pte_numa() and pmd_numa()

From: Andrea Arcangeli <[email protected]>

Implement pte_numa and pmd_numa.

We must atomically set the numa bit and clear the present bit to
define a pte_numa or pmd_numa.

Once a pte or pmd has been set as pte_numa or pmd_numa, the next time
a thread touches a virtual address in the corresponding virtual range,
a NUMA hinting page fault will trigger. The NUMA hinting page fault
will clear the NUMA bit and set the present bit again to resolve the
page fault.

The expectation is that a NUMA hinting page fault is used as part
of a placement policy that decides if a page should remain on the
current node or migrated to a different node.

Acked-by: Rik van Riel <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
arch/x86/include/asm/pgtable.h | 65 ++++++++++++++++++++++++++++++++++++++--
include/asm-generic/pgtable.h | 12 ++++++++
2 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a1f780d..e075d57 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -404,7 +404,8 @@ static inline int pte_same(pte_t a, pte_t b)

static inline int pte_present(pte_t a)
{
- return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
+ return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
+ _PAGE_NUMA);
}

static inline int pte_hidden(pte_t pte)
@@ -420,7 +421,63 @@ static inline int pmd_present(pmd_t pmd)
* the _PAGE_PSE flag will remain set at all times while the
* _PAGE_PRESENT bit is clear).
*/
- return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE);
+ return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE |
+ _PAGE_NUMA);
+}
+
+#ifdef CONFIG_BALANCE_NUMA
+/*
+ * _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the
+ * same bit too). It's set only when _PAGE_PRESET is not set and it's
+ * never set if _PAGE_PRESENT is set.
+ *
+ * pte/pmd_present() returns true if pte/pmd_numa returns true. Page
+ * fault triggers on those regions if pte/pmd_numa returns true
+ * (because _PAGE_PRESENT is not set).
+ */
+static inline int pte_numa(pte_t pte)
+{
+ return (pte_flags(pte) &
+ (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+ return (pmd_flags(pmd) &
+ (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+}
+#endif
+
+/*
+ * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
+ * because they're called by the NUMA hinting minor page fault. If we
+ * wouldn't set the _PAGE_ACCESSED bitflag here, the TLB miss handler
+ * would be forced to set it later while filling the TLB after we
+ * return to userland. That would trigger a second write to memory
+ * that we optimize away by setting _PAGE_ACCESSED here.
+ */
+static inline pte_t pte_mknonnuma(pte_t pte)
+{
+ pte = pte_clear_flags(pte, _PAGE_NUMA);
+ return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+static inline pmd_t pmd_mknonnuma(pmd_t pmd)
+{
+ pmd = pmd_clear_flags(pmd, _PAGE_NUMA);
+ return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+
+static inline pte_t pte_mknuma(pte_t pte)
+{
+ pte = pte_set_flags(pte, _PAGE_NUMA);
+ return pte_clear_flags(pte, _PAGE_PRESENT);
+}
+
+static inline pmd_t pmd_mknuma(pmd_t pmd)
+{
+ pmd = pmd_set_flags(pmd, _PAGE_NUMA);
+ return pmd_clear_flags(pmd, _PAGE_PRESENT);
}

static inline int pmd_none(pmd_t pmd)
@@ -479,6 +536,10 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)

static inline int pmd_bad(pmd_t pmd)
{
+#ifdef CONFIG_BALANCE_NUMA
+ if (pmd_numa(pmd))
+ return 0;
+#endif
return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
}

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index b36ce40..896667e 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -554,6 +554,18 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
#endif
}

+#ifndef CONFIG_BALANCE_NUMA
+static inline int pte_numa(pte_t pte)
+{
+ return 0;
+}
+
+static inline int pmd_numa(pmd_t pmd)
+{
+ return 0;
+}
+#endif /* CONFIG_BALANCE_NUMA */
+
#endif /* CONFIG_MMU */

#endif /* !__ASSEMBLY__ */
--
1.7.9.2

2012-11-06 09:15:30

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 06/19] mm: numa: teach gup_fast about pmd_numa

From: Andrea Arcangeli <[email protected]>

When scanning pmds, the pmd may be of numa type (_PAGE_PRESENT not set),
however the pte might be present. Therefore, gup_pmd_range() must return
0 in this case to avoid losing a NUMA hinting page fault during gup_fast.

Note: gup_fast will skip over non present ptes (like numa types), so
no explicit check is needed for the pte_numa case. gup_fast will also
skip over THP when the trans huge pmd is non present. So, the pmd_numa
case will also be correctly skipped with no additional code changes
required.

Acked-by: Rik van Riel <[email protected]>
Signed-off-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
arch/x86/mm/gup.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index dd74e46..02c5ec5 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -163,8 +163,19 @@ static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
* can't because it has irq disabled and
* wait_split_huge_page() would never return as the
* tlb flush IPI wouldn't run.
+ *
+ * The pmd_numa() check is needed because the code
+ * doesn't check the _PAGE_PRESENT bit of the pmd if
+ * the gup_pte_range() path is taken. NOTE: not all
+ * gup_fast users will will access the page contents
+ * using the CPU through the NUMA memory channels like
+ * KVM does. So we're forced to trigger NUMA hinting
+ * page faults unconditionally for all gup_fast users
+ * even though NUMA hinting page faults aren't useful
+ * to I/O drivers that will access the page with DMA
+ * and not with the CPU.
*/
- if (pmd_none(pmd) || pmd_trans_splitting(pmd))
+ if (pmd_none(pmd) || pmd_trans_splitting(pmd) || pmd_numa(pmd))
return 0;
if (unlikely(pmd_large(pmd))) {
if (!gup_huge_pmd(pmd, addr, next, write, pages, nr))
--
1.7.9.2

2012-11-06 09:15:27

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 07/19] mm: numa: split_huge_page: transfer the NUMA type from the pmd to the pte

From: Andrea Arcangeli <[email protected]>

When we split a transparent hugepage, transfer the NUMA type from the
pmd to the pte if needed.

Signed-off-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
mm/huge_memory.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40f17c3..3aaf242 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1363,6 +1363,8 @@ static int __split_huge_page_map(struct page *page,
BUG_ON(page_mapcount(page) != 1);
if (!pmd_young(*pmd))
entry = pte_mkold(entry);
+ if (pmd_numa(*pmd))
+ entry = pte_mknuma(entry);
pte = pte_offset_map(&_pmd, haddr);
BUG_ON(!pte_none(*pte));
set_pte_at(mm, haddr, pte, entry);
--
1.7.9.2

2012-11-06 09:15:26

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 09/19] mm: mempolicy: Make MPOL_LOCAL a real policy

From: Peter Zijlstra <[email protected]>

Make MPOL_LOCAL a real and exposed policy such that applications that
relied on the previous default behaviour can explicitly request it.

Requested-by: Christoph Lameter <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/uapi/linux/mempolicy.h | 1 +
mm/mempolicy.c | 9 ++++++---
2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 23e62e0..3e835c9 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -20,6 +20,7 @@ enum {
MPOL_PREFERRED,
MPOL_BIND,
MPOL_INTERLEAVE,
+ MPOL_LOCAL,
MPOL_MAX, /* always last member of enum */
};

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 66e90ec..54bd3e5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -269,6 +269,10 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
(flags & MPOL_F_RELATIVE_NODES)))
return ERR_PTR(-EINVAL);
}
+ } else if (mode == MPOL_LOCAL) {
+ if (!nodes_empty(*nodes))
+ return ERR_PTR(-EINVAL);
+ mode = MPOL_PREFERRED;
} else if (nodes_empty(*nodes))
return ERR_PTR(-EINVAL);
policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
@@ -2399,7 +2403,6 @@ void numa_default_policy(void)
* "local" is pseudo-policy: MPOL_PREFERRED with MPOL_F_LOCAL flag
* Used only for mpol_parse_str() and mpol_to_str()
*/
-#define MPOL_LOCAL MPOL_MAX
static const char * const policy_modes[] =
{
[MPOL_DEFAULT] = "default",
@@ -2452,12 +2455,12 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
if (flags)
*flags++ = '\0'; /* terminate mode string */

- for (mode = 0; mode <= MPOL_LOCAL; mode++) {
+ for (mode = 0; mode < MPOL_MAX; mode++) {
if (!strcmp(str, policy_modes[mode])) {
break;
}
}
- if (mode > MPOL_LOCAL)
+ if (mode >= MPOL_MAX)
goto out;

switch (mode) {
--
1.7.9.2

2012-11-06 09:15:23

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 08/19] mm: numa: Create basic numa page hinting infrastructure

Note: This patch started as "mm/mpol: Create special PROT_NONE
infrastructure" and preserves the basic idea but steals *very*
heavily from "autonuma: numa hinting page faults entry points" for
the actual fault handlers without the migration parts. The end
result is barely recognisable as either patch so all Signed-off
and Reviewed-bys are dropped. If Peter, Ingo and Andrea are ok with
this version, I will re-add the signed-offs-by to reflect the history.

In order to facilitate a lazy -- fault driven -- migration of pages, create
a special transient PAGE_NUMA variant, we can then use the 'spurious'
protection faults to drive our migrations from.

Pages that already had an effective PROT_NONE mapping will not be detected
to generate these 'spurious' faults for the simple reason that we cannot
distinguish them on their protection bits, see pte_numa(). This isn't
a problem since PROT_NONE (and possible PROT_WRITE with dirty tracking)
aren't used or are rare enough for us to not care about their placement.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/huge_mm.h | 10 +++++
mm/huge_memory.c | 21 ++++++++++
mm/memory.c | 103 +++++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 131 insertions(+), 3 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index b31cb7d..a13ebb1 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -159,6 +159,10 @@ static inline struct page *compound_trans_head(struct page *page)
}
return page;
}
+
+extern int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
+ pmd_t pmd, pmd_t *pmdp);
+
#else /* CONFIG_TRANSPARENT_HUGEPAGE */
#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
#define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
@@ -195,6 +199,12 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
{
return 0;
}
+
+static inline int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
+ pmd_t pmd, pmd_t *pmdp);
+{
+}
+
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */

#endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3aaf242..92a64d2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1017,6 +1017,27 @@ out:
return page;
}

+/* NUMA hinting page fault entry point for trans huge pmds */
+int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
+ pmd_t pmd, pmd_t *pmdp)
+{
+ struct page *page;
+
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(pmd, *pmdp)))
+ goto out_unlock;
+
+ page = pmd_page(pmd);
+ pmd = pmd_mknonnuma(pmd);
+ set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmdp, pmd);
+ VM_BUG_ON(pmd_numa(*pmdp));
+ update_mmu_cache_pmd(vma, addr, ptep);
+
+out_unlock:
+ spin_unlock(&mm->page_table_lock);
+ return 0;
+}
+
int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
pmd_t *pmd, unsigned long addr)
{
diff --git a/mm/memory.c b/mm/memory.c
index fb135ba..72092d8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3433,6 +3433,94 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return __do_fault(mm, vma, address, pmd, pgoff, flags, orig_pte);
}

+int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
+{
+ struct page *page;
+ spinlock_t *ptl;
+
+ /*
+ * The "pte" at this point cannot be used safely without
+ * validation through pte_unmap_same(). It's of NUMA type but
+ * the pfn may be screwed if the read is non atomic.
+ *
+ * ptep_modify_prot_start is not called as this is clearing
+ * the _PAGE_NUMA bit and it is not really expected that there
+ * would be concurrent hardware modifications to the PTE.
+ */
+ ptl = pte_lockptr(mm, pmd);
+ spin_lock(ptl);
+ if (unlikely(!pte_same(*ptep, pte)))
+ goto out_unlock;
+ pte = pte_mknonnuma(pte);
+ set_pte_at(mm, addr, ptep, pte);
+ page = vm_normal_page(vma, addr, pte);
+ BUG_ON(!page);
+ update_mmu_cache(vma, addr, ptep);
+
+out_unlock:
+ pte_unmap_unlock(ptep, ptl);
+ return 0;
+}
+
+/* NUMA hinting page fault entry point for regular pmds */
+int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long addr, pmd_t *pmdp)
+{
+ pmd_t pmd;
+ pte_t *pte, *orig_pte;
+ unsigned long _addr = addr & PMD_MASK;
+ unsigned long offset;
+ spinlock_t *ptl;
+ bool numa = false;
+
+ spin_lock(&mm->page_table_lock);
+ pmd = *pmdp;
+ if (pmd_numa(pmd)) {
+ set_pmd_at(mm, _addr, pmdp, pmd_mknonnuma(pmd));
+ numa = true;
+ }
+ spin_unlock(&mm->page_table_lock);
+
+ if (!numa)
+ return 0;
+
+ /* we're in a page fault so some vma must be in the range */
+ BUG_ON(!vma);
+ BUG_ON(vma->vm_start >= _addr + PMD_SIZE);
+ offset = max(_addr, vma->vm_start) & ~PMD_MASK;
+ VM_BUG_ON(offset >= PMD_SIZE);
+ orig_pte = pte = pte_offset_map_lock(mm, pmdp, _addr, &ptl);
+ pte += offset >> PAGE_SHIFT;
+ for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
+ pte_t pteval = *pte;
+ struct page *page;
+ if (!pte_present(pteval))
+ continue;
+ if (addr >= vma->vm_end) {
+ vma = find_vma(mm, addr);
+ /* there's a pte present so there must be a vma */
+ BUG_ON(!vma);
+ BUG_ON(addr < vma->vm_start);
+ }
+ if (pte_numa(pteval)) {
+ pteval = pte_mknonnuma(pteval);
+ set_pte_at(mm, addr, pte, pteval);
+ }
+ page = vm_normal_page(vma, addr, pteval);
+ if (unlikely(!page))
+ continue;
+ /* only check non-shared pages */
+ if (unlikely(page_mapcount(page) != 1))
+ continue;
+ pte_unmap_unlock(pte, ptl);
+
+ pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+ }
+ pte_unmap_unlock(orig_pte, ptl);
+ return 0;
+}
+
/*
* These routines also need to handle stuff like marking pages dirty
* and/or accessed for architectures that don't do it in hardware (most
@@ -3471,6 +3559,9 @@ int handle_pte_fault(struct mm_struct *mm,
pte, pmd, flags, entry);
}

+ if (pte_numa(entry))
+ return do_numa_page(mm, vma, address, entry, pte, pmd);
+
ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
if (unlikely(!pte_same(*pte, entry)))
@@ -3539,9 +3630,11 @@ retry:

barrier();
if (pmd_trans_huge(orig_pmd)) {
- if (flags & FAULT_FLAG_WRITE &&
- !pmd_write(orig_pmd) &&
- !pmd_trans_splitting(orig_pmd)) {
+ if (pmd_numa(*pmd))
+ return do_huge_pmd_numa_page(mm, address,
+ orig_pmd, pmd);
+
+ if ((flags & FAULT_FLAG_WRITE) && !pmd_write(orig_pmd)) {
ret = do_huge_pmd_wp_page(mm, vma, address, pmd,
orig_pmd);
/*
@@ -3553,10 +3646,14 @@ retry:
goto retry;
return ret;
}
+
return 0;
}
}

+ if (pmd_numa(*pmd))
+ return do_pmd_numa_page(mm, vma, address, pmd);
+
/*
* Use __pte_alloc instead of pte_alloc_map, because we can't
* run pte_offset_map on the pmd, if an huge pmd could
--
1.7.9.2

2012-11-06 09:15:20

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 10/19] mm: mempolicy: Add MPOL_MF_NOOP

From: Lee Schermerhorn <[email protected]>

NOTE: I have not yet addressed by own review feedback of this patch. At
this point I'm trying to construct a baseline tree and will apply
my own review feedback later and then fold it in.

This patch augments the MPOL_MF_LAZY feature by adding a "NOOP" policy
to mbind(). When the NOOP policy is used with the 'MOVE and 'LAZY
flags, mbind() will map the pages PROT_NONE so that they will be
migrated on the next touch.

This allows an application to prepare for a new phase of operation
where different regions of shared storage will be assigned to
worker threads, w/o changing policy. Note that we could just use
"default" policy in this case. However, this also allows an
application to request that pages be migrated, only if necessary,
to follow any arbitrary policy that might currently apply to a
range of pages, without knowing the policy, or without specifying
multiple mbind()s for ranges with different policies.

[ Bug in early version of mpol_parse_str() reported by Fengguang Wu. ]

Bug-Reported-by: Reported-by: Fengguang Wu <[email protected]>
Signed-off-by: Lee Schermerhorn <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/uapi/linux/mempolicy.h | 1 +
mm/mempolicy.c | 11 ++++++-----
2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 3e835c9..d23dca8 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -21,6 +21,7 @@ enum {
MPOL_BIND,
MPOL_INTERLEAVE,
MPOL_LOCAL,
+ MPOL_NOOP, /* retain existing policy for range */
MPOL_MAX, /* always last member of enum */
};

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 54bd3e5..c21e914 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -251,10 +251,10 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
pr_debug("setting mode %d flags %d nodes[0] %lx\n",
mode, flags, nodes ? nodes_addr(*nodes)[0] : -1);

- if (mode == MPOL_DEFAULT) {
+ if (mode == MPOL_DEFAULT || mode == MPOL_NOOP) {
if (nodes && !nodes_empty(*nodes))
return ERR_PTR(-EINVAL);
- return NULL; /* simply delete any existing policy */
+ return NULL;
}
VM_BUG_ON(!nodes);

@@ -1147,7 +1147,7 @@ static long do_mbind(unsigned long start, unsigned long len,
if (start & ~PAGE_MASK)
return -EINVAL;

- if (mode == MPOL_DEFAULT)
+ if (mode == MPOL_DEFAULT || mode == MPOL_NOOP)
flags &= ~MPOL_MF_STRICT;

len = (len + PAGE_SIZE - 1) & PAGE_MASK;
@@ -2409,7 +2409,8 @@ static const char * const policy_modes[] =
[MPOL_PREFERRED] = "prefer",
[MPOL_BIND] = "bind",
[MPOL_INTERLEAVE] = "interleave",
- [MPOL_LOCAL] = "local"
+ [MPOL_LOCAL] = "local",
+ [MPOL_NOOP] = "noop", /* should not actually be used */
};


@@ -2460,7 +2461,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
break;
}
}
- if (mode >= MPOL_MAX)
+ if (mode >= MPOL_MAX || mode == MPOL_NOOP)
goto out;

switch (mode) {
--
1.7.9.2

2012-11-06 09:17:22

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 18/19] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate

From: Peter Zijlstra <[email protected]>

Note: The scan period is much larger than it was in the original patch.
The reason was because the system CPU usage went through the roof
with a sample period of 100ms but it was unsuitable to have a
situation where a large process could stall for excessively long
updating pte_numa. This may need to be tuned again if a placement
policy converges too slowly.

Previously, to probe the working set of a task, we'd use
a very simple and crude method: mark all of its address
space PROT_NONE.

That method has various (obvious) disadvantages:

- it samples the working set at dissimilar rates,
giving some tasks a sampling quality advantage
over others.

- creates performance problems for tasks with very
large working sets

- over-samples processes with large address spaces but
which only very rarely execute

Improve that method by keeping a rotating offset into the
address space that marks the current position of the scan,
and advance it by a constant rate (in a CPU cycles execution
proportional manner). If the offset reaches the last mapped
address of the mm then it then it starts over at the first
address.

The per-task nature of the working set sampling functionality in this tree
allows such constant rate, per task, execution-weight proportional sampling
of the working set, with an adaptive sampling interval/frequency that
goes from once per 2 seconds up to just once per 32 seconds. The current
sampling volume is 256 MB per interval.

As tasks mature and converge their working set, so does the
sampling rate slow down to just a trickle, 256 MB per 8
seconds of CPU time executed.

This, beyond being adaptive, also rate-limits rarely
executing systems and does not over-sample on overloaded
systems.

[ In AutoNUMA speak, this patch deals with the effective sampling
rate of the 'hinting page fault'. AutoNUMA's scanning is
currently rate-limited, but it is also fundamentally
single-threaded, executing in the knuma_scand kernel thread,
so the limit in AutoNUMA is global and does not scale up with
the number of CPUs, nor does it scan tasks in an execution
proportional manner.

So the idea of rate-limiting the scanning was first implemented
in the AutoNUMA tree via a global rate limit. This patch goes
beyond that by implementing an execution rate proportional
working set sampling rate that is not implemented via a single
global scanning daemon. ]

[ Dan Carpenter pointed out a possible NULL pointer dereference in the
first version of this patch. ]

Based-on-idea-by: Andrea Arcangeli <[email protected]>
Bug-Found-By: Dan Carpenter <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
[ Wrote changelog and fixed bug. ]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mm_types.h | 3 +++
include/linux/sched.h | 1 +
kernel/sched/fair.c | 45 ++++++++++++++++++++++++++++++++-------------
kernel/sysctl.c | 7 +++++++
4 files changed, 43 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index d82accb..b40f4ef 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -406,6 +406,9 @@ struct mm_struct {
*/
unsigned long numa_next_scan;

+ /* Restart point for scanning and setting pte_numa */
+ unsigned long numa_scan_offset;
+
/* numa_scan_seq prevents two threads setting pte_numa */
int numa_scan_seq;
#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ac71181..abb1c70 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2008,6 +2008,7 @@ extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;

extern unsigned int sysctl_balance_numa_scan_period_min;
extern unsigned int sysctl_balance_numa_scan_period_max;
+extern unsigned int sysctl_balance_numa_scan_size;
extern unsigned int sysctl_balance_numa_settle_count;

#ifdef CONFIG_SCHED_DEBUG
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 020a8f2..38b911ef 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -780,10 +780,13 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)

#ifdef CONFIG_BALANCE_NUMA
/*
- * numa task sample period in ms: 5s
+ * numa task sample period in ms
*/
-unsigned int sysctl_balance_numa_scan_period_min = 5000;
-unsigned int sysctl_balance_numa_scan_period_max = 5000*16;
+unsigned int sysctl_balance_numa_scan_period_min = 2000;
+unsigned int sysctl_balance_numa_scan_period_max = 2000*16;
+
+/* Portion of address space to scan in MB */
+unsigned int sysctl_balance_numa_scan_size = 256;

static void task_numa_placement(struct task_struct *p)
{
@@ -817,6 +820,9 @@ void task_numa_work(struct callback_head *work)
unsigned long migrate, next_scan, now = jiffies;
struct task_struct *p = current;
struct mm_struct *mm = p->mm;
+ struct vm_area_struct *vma;
+ unsigned long offset, end;
+ long length;

WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));

@@ -843,18 +849,31 @@ void task_numa_work(struct callback_head *work)
if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
return;

- ACCESS_ONCE(mm->numa_scan_seq)++;
- {
- struct vm_area_struct *vma;
+ offset = mm->numa_scan_offset;
+ length = sysctl_balance_numa_scan_size;
+ length <<= 20;

- down_read(&mm->mmap_sem);
- for (vma = mm->mmap; vma; vma = vma->vm_next) {
- if (!vma_migratable(vma))
- continue;
- change_prot_numa(vma, vma->vm_start, vma->vm_end);
- }
- up_read(&mm->mmap_sem);
+ down_read(&mm->mmap_sem);
+ vma = find_vma(mm, offset);
+ if (!vma) {
+ ACCESS_ONCE(mm->numa_scan_seq)++;
+ offset = 0;
+ vma = mm->mmap;
+ }
+ for (; vma && length > 0; vma = vma->vm_next) {
+ if (!vma_migratable(vma))
+ continue;
+
+ offset = max(offset, vma->vm_start);
+ end = min(ALIGN(offset + length, HPAGE_SIZE), vma->vm_end);
+ length -= end - offset;
+
+ change_prot_numa(vma, offset, end);
+
+ offset = end;
}
+ mm->numa_scan_offset = offset;
+ up_read(&mm->mmap_sem);
}

/*
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 1359f51..d191203 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -366,6 +366,13 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+ {
+ .procname = "balance_numa_scan_size_mb",
+ .data = &sysctl_balance_numa_scan_size,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
#endif /* CONFIG_BALANCE_NUMA */
#endif /* CONFIG_SCHED_DEBUG */
{
--
1.7.9.2

2012-11-06 09:17:21

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 19/19] mm: sched: numa: Implement slow start for working set sampling

From: Peter Zijlstra <[email protected]>

Add a 1 second delay before starting to scan the working set of
a task and starting to balance it amongst nodes.

[ note that before the constant per task WSS sampling rate patch
the initial scan would happen much later still, in effect that
patch caused this regression. ]

The theory is that short-run tasks benefit very little from NUMA
placement: they come and go, and they better stick to the node
they were started on. As tasks mature and rebalance to other CPUs
and nodes, so does their NUMA placement have to change and so
does it start to matter more and more.

In practice this change fixes an observable kbuild regression:

# [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]

!NUMA:
45.291088843 seconds time elapsed ( +- 0.40% )
45.154231752 seconds time elapsed ( +- 0.36% )

+NUMA, no slow start:
46.172308123 seconds time elapsed ( +- 0.30% )
46.343168745 seconds time elapsed ( +- 0.25% )

+NUMA, 1 sec slow start:
45.224189155 seconds time elapsed ( +- 0.25% )
45.160866532 seconds time elapsed ( +- 0.17% )

and it also fixes an observable perf bench (hackbench) regression:

# perf stat --null --repeat 10 perf bench sched messaging

-NUMA:

-NUMA: 0.246225691 seconds time elapsed ( +- 1.31% )
+NUMA no slow start: 0.252620063 seconds time elapsed ( +- 1.13% )

+NUMA 1sec delay: 0.248076230 seconds time elapsed ( +- 1.35% )

The implementation is simple and straightforward, most of the patch
deals with adding the /proc/sys/kernel/balance_numa_scan_delay_ms tunable
knob.

Signed-off-by: Peter Zijlstra <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
[ Wrote the changelog, ran measurements, tuned the default. ]
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 5 +++++
kernel/sysctl.c | 7 +++++++
4 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index abb1c70..a2b06ea 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2006,6 +2006,7 @@ enum sched_tunable_scaling {
};
extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;

+extern unsigned int sysctl_balance_numa_scan_delay;
extern unsigned int sysctl_balance_numa_scan_period_min;
extern unsigned int sysctl_balance_numa_scan_period_max;
extern unsigned int sysctl_balance_numa_scan_size;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 81fa185..047e3c7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1543,7 +1543,7 @@ static void __sched_fork(struct task_struct *p)
p->node_stamp = 0ULL;
p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
- p->numa_scan_period = sysctl_balance_numa_scan_period_min;
+ p->numa_scan_period = sysctl_balance_numa_scan_delay;
p->numa_work.next = &p->numa_work;
#endif /* CONFIG_BALANCE_NUMA */
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 38b911ef..8c9c28e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -788,6 +788,9 @@ unsigned int sysctl_balance_numa_scan_period_max = 2000*16;
/* Portion of address space to scan in MB */
unsigned int sysctl_balance_numa_scan_size = 256;

+/* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
+unsigned int sysctl_balance_numa_scan_delay = 1000;
+
static void task_numa_placement(struct task_struct *p)
{
int seq = ACCESS_ONCE(p->mm->numa_scan_seq);
@@ -900,6 +903,8 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;

if (now - curr->node_stamp > period) {
+ if (!curr->node_stamp)
+ curr->numa_scan_period = sysctl_balance_numa_scan_period_min;
curr->node_stamp = now;

if (!time_before(jiffies, curr->mm->numa_next_scan)) {
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d191203..5ee587d 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -353,6 +353,13 @@ static struct ctl_table kern_table[] = {
#endif /* CONFIG_SMP */
#ifdef CONFIG_BALANCE_NUMA
{
+ .procname = "balance_numa_scan_delay_ms",
+ .data = &sysctl_balance_numa_scan_delay,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
.procname = "balance_numa_scan_period_min_ms",
.data = &sysctl_balance_numa_scan_period_min,
.maxlen = sizeof(unsigned int),
--
1.7.9.2

2012-11-06 09:19:15

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 11/19] mm: mempolicy: Check for misplaced page

From: Lee Schermerhorn <[email protected]>

This patch provides a new function to test whether a page resides
on a node that is appropriate for the mempolicy for the vma and
address where the page is supposed to be mapped. This involves
looking up the node where the page belongs. So, the function
returns that node so that it may be used to allocated the page
without consulting the policy again.

A subsequent patch will call this function from the fault path.
Because of this, I don't want to go ahead and allocate the page, e.g.,
via alloc_page_vma() only to have to free it if it has the correct
policy. So, I just mimic the alloc_page_vma() node computation
logic--sort of.

Note: we could use this function to implement a MPOL_MF_STRICT
behavior when migrating pages to match mbind() mempolicy--e.g.,
to ensure that pages in an interleaved range are reinterleaved
rather than left where they are when they reside on any page in
the interleave nodemask.

Signed-off-by: Lee Schermerhorn <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Linus Torvalds <[email protected]>
[ Added MPOL_F_LAZY to trigger migrate-on-fault;
simplified code now that we don't have to bother
with special crap for interleaved ]
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mempolicy.h | 8 +++++
include/uapi/linux/mempolicy.h | 1 +
mm/mempolicy.c | 76 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 85 insertions(+)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index e5ccb9d..c511e25 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -198,6 +198,8 @@ static inline int vma_migratable(struct vm_area_struct *vma)
return 1;
}

+extern int mpol_misplaced(struct page *, struct vm_area_struct *, unsigned long);
+
#else

struct mempolicy {};
@@ -323,5 +325,11 @@ static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol,
return 0;
}

+static inline int mpol_misplaced(struct page *page, struct vm_area_struct *vma,
+ unsigned long address)
+{
+ return -1; /* no node preference */
+}
+
#endif /* CONFIG_NUMA */
#endif
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index d23dca8..472de8a 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -61,6 +61,7 @@ enum mpol_rebind_step {
#define MPOL_F_SHARED (1 << 0) /* identify shared policies */
#define MPOL_F_LOCAL (1 << 1) /* preferred local allocation */
#define MPOL_F_REBINDING (1 << 2) /* identify policies in rebinding */
+#define MPOL_F_MOF (1 << 3) /* this policy wants migrate on fault */


#endif /* _UAPI_LINUX_MEMPOLICY_H */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index c21e914..df1466d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2181,6 +2181,82 @@ static void sp_free(struct sp_node *n)
kmem_cache_free(sn_cache, n);
}

+/**
+ * mpol_misplaced - check whether current page node is valid in policy
+ *
+ * @page - page to be checked
+ * @vma - vm area where page mapped
+ * @addr - virtual address where page mapped
+ *
+ * Lookup current policy node id for vma,addr and "compare to" page's
+ * node id.
+ *
+ * Returns:
+ * -1 - not misplaced, page is in the right node
+ * node - node id where the page should be
+ *
+ * Policy determination "mimics" alloc_page_vma().
+ * Called from fault path where we know the vma and faulting address.
+ */
+int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
+{
+ struct mempolicy *pol;
+ struct zone *zone;
+ int curnid = page_to_nid(page);
+ unsigned long pgoff;
+ int polnid = -1;
+ int ret = -1;
+
+ BUG_ON(!vma);
+
+ pol = get_vma_policy(current, vma, addr);
+ if (!(pol->flags & MPOL_F_MOF))
+ goto out;
+
+ switch (pol->mode) {
+ case MPOL_INTERLEAVE:
+ BUG_ON(addr >= vma->vm_end);
+ BUG_ON(addr < vma->vm_start);
+
+ pgoff = vma->vm_pgoff;
+ pgoff += (addr - vma->vm_start) >> PAGE_SHIFT;
+ polnid = offset_il_node(pol, vma, pgoff);
+ break;
+
+ case MPOL_PREFERRED:
+ if (pol->flags & MPOL_F_LOCAL)
+ polnid = numa_node_id();
+ else
+ polnid = pol->v.preferred_node;
+ break;
+
+ case MPOL_BIND:
+ /*
+ * allows binding to multiple nodes.
+ * use current page if in policy nodemask,
+ * else select nearest allowed node, if any.
+ * If no allowed nodes, use current [!misplaced].
+ */
+ if (node_isset(curnid, pol->v.nodes))
+ goto out;
+ (void)first_zones_zonelist(
+ node_zonelist(numa_node_id(), GFP_HIGHUSER),
+ gfp_zone(GFP_HIGHUSER),
+ &pol->v.nodes, &zone);
+ polnid = zone->node;
+ break;
+
+ default:
+ BUG();
+ }
+ if (curnid != polnid)
+ ret = polnid;
+out:
+ mpol_cond_put(pol);
+
+ return ret;
+}
+
static void sp_delete(struct shared_policy *sp, struct sp_node *n)
{
pr_debug("deleting %lx-l%lx\n", n->start, n->end);
--
1.7.9.2

2012-11-06 09:19:20

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 03/19] mm: compaction: Add scanned and isolated counters for compaction

Compaction already has tracepoints to count scanned and isolated pages
but it requires that ftrace be enabled and if that information has to be
written to disk then it can be disruptive. This patch adds vmstat counters
for compaction called compact_migrate_scanned, compact_free_scanned and
compact_isolated.

With these counters, it is possible to define a basic cost model for
compaction. This approximates of how much work compaction is doing and can
be compared that with an oprofile showing TLB misses and see if the cost of
compaction is being offset by THP for example. Minimally a compaction patch
can be evaluated in terms of whether it increases or decreases cost. The
basic cost model looks like this

Fundamental unit u: a word sizeof(void *)

Ca = cost of struct page access = sizeof(struct page) / u

Cmc = Cost migrate page copy = (Ca + PAGE_SIZE/u) * 2
Cmf = Cost migrate failure = Ca * 2
Ci = Cost page isolation = (Ca + Wi)
where Wi is a constant that should reflect the approximate
cost of the locking operation.

Csm = Cost migrate scanning = Ca
Csf = Cost free scanning = Ca

Overall cost = (Csm * compact_migrate_scanned) +
(Csf * compact_free_scanned) +
(Ci * compact_isolated) +
(Cmc * pgmigrate_success) +
(Cmf * pgmigrate_failed)

Where the values are read from /proc/vmstat.

This is very basic and ignores certain costs such as the allocation cost
to do a migrate page copy but any improvement to the model would still
use the same vmstat counters.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/vm_event_item.h | 2 ++
mm/compaction.c | 8 ++++++++
mm/vmstat.c | 3 +++
3 files changed, 13 insertions(+)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 8aa7cb9..a1f750b 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -42,6 +42,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
#endif
#ifdef CONFIG_COMPACTION
+ COMPACTMIGRATE_SCANNED, COMPACTFREE_SCANNED,
+ COMPACTISOLATED,
COMPACTSTALL, COMPACTFAIL, COMPACTSUCCESS,
#endif
#ifdef CONFIG_HUGETLB_PAGE
diff --git a/mm/compaction.c b/mm/compaction.c
index 2c077a7..aee7443 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -356,6 +356,10 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
if (blockpfn == end_pfn)
update_pageblock_skip(cc, valid_page, total_isolated, false);

+ count_vm_events(COMPACTFREE_SCANNED, nr_scanned);
+ if (total_isolated)
+ count_vm_events(COMPACTISOLATED, total_isolated);
+
return total_isolated;
}

@@ -646,6 +650,10 @@ next_pageblock:

trace_mm_compaction_isolate_migratepages(nr_scanned, nr_isolated);

+ count_vm_events(COMPACTMIGRATE_SCANNED, nr_scanned);
+ if (nr_isolated)
+ count_vm_events(COMPACTISOLATED, nr_isolated);
+
return low_pfn;
}

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 89a7fd6..3a067fa 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -779,6 +779,9 @@ const char * const vmstat_text[] = {
"pgmigrate_fail",
#endif
#ifdef CONFIG_COMPACTION
+ "compact_migrate_scanned",
+ "compact_free_scanned",
+ "compact_isolated",
"compact_stall",
"compact_fail",
"compact_success",
--
1.7.9.2

2012-11-06 09:19:17

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 04/19] mm: numa: define _PAGE_NUMA

From: Andrea Arcangeli <[email protected]>

The objective of _PAGE_NUMA is to be able to trigger NUMA hinting page
faults to identify the per NUMA node working set of the thread at
runtime.

Arming the NUMA hinting page fault mechanism works similarly to
setting up a mprotect(PROT_NONE) virtual range: the present bit is
cleared at the same time that _PAGE_NUMA is set, so when the fault
triggers we can identify it as a NUMA hinting page fault.

_PAGE_NUMA on x86 shares the same bit number of _PAGE_PROTNONE (but it
could also use a different bitflag, it's up to the architecture to
decide).

It would be confusing to call the "NUMA hinting page faults" as
"do_prot_none faults". They're different events and _PAGE_NUMA doesn't
alter the semantics of mprotect(PROT_NONE) in any way.

Sharing the same bitflag with _PAGE_PROTNONE in fact complicates
things: it requires us to ensure the code paths executed by
_PAGE_PROTNONE remains mutually exclusive to the code paths executed
by _PAGE_NUMA at all times, to avoid _PAGE_NUMA and _PAGE_PROTNONE to
step into each other toes.

Because we want to be able to set this bitflag in any established pte
or pmd (while clearing the present bit at the same time) without
losing information, this bitflag must never be set when the pte and
pmd are present, so the bitflag picked for _PAGE_NUMA usage, must not
be used by the swap entry format.

Signed-off-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
arch/x86/include/asm/pgtable_types.h | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index ec8a1fc..3c32db8 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -64,6 +64,26 @@
#define _PAGE_FILE (_AT(pteval_t, 1) << _PAGE_BIT_FILE)
#define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)

+/*
+ * _PAGE_NUMA indicates that this page will trigger a numa hinting
+ * minor page fault to gather numa placement statistics (see
+ * pte_numa()). The bit picked (8) is within the range between
+ * _PAGE_FILE (6) and _PAGE_PROTNONE (8) bits. Therefore, it doesn't
+ * require changes to the swp entry format because that bit is always
+ * zero when the pte is not present.
+ *
+ * The bit picked must be always zero when the pmd is present and not
+ * present, so that we don't lose information when we set it while
+ * atomically clearing the present bit.
+ *
+ * Because we shared the same bit (8) with _PAGE_PROTNONE this can be
+ * interpreted as _PAGE_NUMA only in places that _PAGE_PROTNONE
+ * couldn't reach, like handle_mm_fault() (see access_error in
+ * arch/x86/mm/fault.c, the vma protection must not be PROT_NONE for
+ * handle_mm_fault() to be invoked).
+ */
+#define _PAGE_NUMA _PAGE_PROTNONE
+
#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \
_PAGE_ACCESSED | _PAGE_DIRTY)
#define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \
--
1.7.9.2

2012-11-06 09:19:13

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 12/19] mm: migrate: Introduce migrate_misplaced_page()

From: Peter Zijlstra <[email protected]>

Note: This was originally based on Peter's patch "mm/migrate: Introduce
migrate_misplaced_page()" but borrows extremely heavily from Andrea's
"autonuma: memory follows CPU algorithm and task/mm_autonuma stats
collection". The end result is barely recognisable so signed-offs
had to be dropped. If original authors are ok with it, I'll
re-add the signed-off-bys.

Add migrate_misplaced_page() which deals with migrating pages from
faults.

Based-on-work-by: Lee Schermerhorn <[email protected]>
Based-on-work-by: Peter Zijlstra <[email protected]>
Based-on-work-by: Andrea Arcangeli <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/migrate.h | 8 ++++
mm/migrate.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 110 insertions(+), 2 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 9d1c159..69f60b5 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -13,6 +13,7 @@ enum migrate_reason {
MR_MEMORY_HOTPLUG,
MR_SYSCALL, /* also applies to cpusets */
MR_MEMPOLICY_MBIND,
+ MR_NUMA_MISPLACED,
MR_CMA
};

@@ -39,6 +40,7 @@ extern int migrate_vmas(struct mm_struct *mm,
extern void migrate_page_copy(struct page *newpage, struct page *page);
extern int migrate_huge_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page);
+extern int migrate_misplaced_page(struct page *page, int node);
#else

static inline void putback_lru_pages(struct list_head *l) {}
@@ -72,5 +74,11 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
#define migrate_page NULL
#define fail_migrate_page NULL

+static inline
+int migrate_misplaced_page(struct page *page, int node)
+{
+ return -EAGAIN; /* can't migrate now */
+}
#endif /* CONFIG_MIGRATION */
+
#endif /* _LINUX_MIGRATE_H */
diff --git a/mm/migrate.c b/mm/migrate.c
index 27be9c9..4a92808 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -282,7 +282,7 @@ static int migrate_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page,
struct buffer_head *head, enum migrate_mode mode)
{
- int expected_count;
+ int expected_count = 0;
void **pslot;

if (!mapping) {
@@ -1415,4 +1415,104 @@ int migrate_vmas(struct mm_struct *mm, const nodemask_t *to,
}
return err;
}
-#endif
+
+/*
+ * Returns true if this is a safe migration target node for misplaced NUMA
+ * pages. Currently it only checks the watermarks which crude
+ */
+static bool migrate_balanced_pgdat(struct pglist_data *pgdat,
+ int nr_migrate_pages)
+{
+ int z;
+ for (z = pgdat->nr_zones - 1; z >= 0; z--) {
+ struct zone *zone = pgdat->node_zones + z;
+
+ if (!populated_zone(zone))
+ continue;
+
+ if (zone->all_unreclaimable)
+ continue;
+
+ /* Avoid waking kswapd by allocating pages_to_migrate pages. */
+ if (!zone_watermark_ok(zone, 0,
+ high_wmark_pages(zone) +
+ nr_migrate_pages,
+ 0, 0))
+ continue;
+ return true;
+ }
+ return false;
+}
+
+static struct page *alloc_misplaced_dst_page(struct page *page,
+ unsigned long data,
+ int **result)
+{
+ int nid = (int) data;
+ struct page *newpage;
+
+ newpage = alloc_pages_exact_node(nid,
+ (GFP_HIGHUSER_MOVABLE | GFP_THISNODE |
+ __GFP_NOMEMALLOC | __GFP_NORETRY |
+ __GFP_NOWARN) &
+ ~GFP_IOFS, 0);
+ return newpage;
+}
+
+/*
+ * Attempt to migrate a misplaced page to the specified destination
+ * node. Caller is expected to have an elevated reference count on
+ * the page that will be dropped by this function before returning.
+ */
+int migrate_misplaced_page(struct page *page, int node)
+{
+ int isolated = 0;
+ LIST_HEAD(migratepages);
+
+ /*
+ * Don't migrate pages that are mapped in multiple processes.
+ * TODO: Handle false sharing detection instead of this hammer
+ */
+ if (page_mapcount(page) != 1)
+ goto out;
+
+ /* Avoid migrating to a node that is nearly full */
+ if (migrate_balanced_pgdat(NODE_DATA(node), 1)) {
+ int page_lru;
+
+ if (isolate_lru_page(page)) {
+ put_page(page);
+ goto out;
+ }
+ isolated = 1;
+
+ /*
+ * Page is isolated which takes a reference count so now the
+ * callers reference can be safely dropped without the page
+ * disappearing underneath us during migration
+ */
+ put_page(page);
+
+ page_lru = page_is_file_cache(page);
+ inc_zone_page_state(page, NR_ISOLATED_ANON + page_lru);
+ list_add(&page->lru, &migratepages);
+ }
+
+ if (isolated) {
+ int nr_remaining;
+
+ nr_remaining = migrate_pages(&migratepages,
+ alloc_misplaced_dst_page,
+ node, false, MIGRATE_ASYNC,
+ MR_NUMA_MISPLACED);
+ if (nr_remaining) {
+ putback_lru_pages(&migratepages);
+ isolated = 0;
+ }
+ }
+ BUG_ON(!list_empty(&migratepages));
+out:
+ return isolated;
+}
+
+#endif /* CONFIG_NUMA */
--
1.7.9.2

2012-11-06 09:19:11

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 13/19] mm: mempolicy: Use _PAGE_NUMA to migrate pages

Note: Based on "mm/mpol: Use special PROT_NONE to migrate pages" but
sufficiently different that the signed-off-bys were dropped

Combine our previous _PAGE_NUMA, mpol_misplaced and migrate_misplaced_page()
pieces into an effective migrate on fault scheme.

Note that (on x86) we rely on PROT_NONE pages being !present and avoid
the TLB flush from try_to_unmap(TTU_MIGRATION). This greatly improves the
page-migration performance.

Based-on-work-by: Peter Zijlstra <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/huge_mm.h | 8 ++++----
mm/huge_memory.c | 32 +++++++++++++++++++++++++++++---
mm/memory.c | 29 +++++++++++++++++++++++++----
3 files changed, 58 insertions(+), 11 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a13ebb1..406f81c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -160,8 +160,8 @@ static inline struct page *compound_trans_head(struct page *page)
return page;
}

-extern int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
- pmd_t pmd, pmd_t *pmdp);
+extern int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long addr, pmd_t pmd, pmd_t *pmdp);

#else /* CONFIG_TRANSPARENT_HUGEPAGE */
#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
@@ -200,8 +200,8 @@ static inline int pmd_trans_huge_lock(pmd_t *pmd,
return 0;
}

-static inline int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
- pmd_t pmd, pmd_t *pmdp);
+static inline int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long addr, pmd_t pmd, pmd_t *pmdp);
{
}

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 92a64d2..1453c30 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -18,6 +18,7 @@
#include <linux/freezer.h>
#include <linux/mman.h>
#include <linux/pagemap.h>
+#include <linux/migrate.h>
#include <asm/tlb.h>
#include <asm/pgalloc.h>
#include "internal.h"
@@ -1018,16 +1019,39 @@ out:
}

/* NUMA hinting page fault entry point for trans huge pmds */
-int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,
- pmd_t pmd, pmd_t *pmdp)
+int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long addr, pmd_t pmd, pmd_t *pmdp)
{
- struct page *page;
+ struct page *page = NULL;
+ unsigned long haddr = addr & HPAGE_PMD_MASK;
+ int target_nid;

spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(pmd, *pmdp)))
goto out_unlock;

page = pmd_page(pmd);
+ get_page(page);
+ spin_unlock(&mm->page_table_lock);
+
+ target_nid = mpol_misplaced(page, vma, haddr);
+ if (target_nid == -1)
+ goto clear_pmdnuma;
+
+ /*
+ * Due to lacking code to migrate thp pages, we'll split
+ * (which preserves the special PROT_NONE) and re-take the
+ * fault on the normal pages.
+ */
+ split_huge_page(page);
+ put_page(page);
+ return 0;
+
+clear_pmdnuma:
+ spin_lock(&mm->page_table_lock);
+ if (unlikely(!pmd_same(pmd, *pmdp)))
+ goto out_unlock;
+
pmd = pmd_mknonnuma(pmd);
set_pmd_at(mm, addr & HPAGE_PMD_MASK, pmdp, pmd);
VM_BUG_ON(pmd_numa(*pmdp));
@@ -1035,6 +1059,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, unsigned long addr,

out_unlock:
spin_unlock(&mm->page_table_lock);
+ if (page)
+ put_page(page);
return 0;
}

diff --git a/mm/memory.c b/mm/memory.c
index 72092d8..fb46ef2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
#include <linux/swapops.h>
#include <linux/elf.h>
#include <linux/gfp.h>
+#include <linux/migrate.h>

#include <asm/io.h>
#include <asm/pgalloc.h>
@@ -3436,8 +3437,9 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pte_t pte, pte_t *ptep, pmd_t *pmd)
{
- struct page *page;
+ struct page *page = NULL;
spinlock_t *ptl;
+ int current_nid, target_nid;

/*
* The "pte" at this point cannot be used safely without
@@ -3452,14 +3454,33 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
spin_lock(ptl);
if (unlikely(!pte_same(*ptep, pte)))
goto out_unlock;
- pte = pte_mknonnuma(pte);
- set_pte_at(mm, addr, ptep, pte);
+
page = vm_normal_page(vma, addr, pte);
BUG_ON(!page);
+
+ get_page(page);
+ current_nid = page_to_nid(page);
+ target_nid = mpol_misplaced(page, vma, addr);
+ if (target_nid == -1)
+ goto clear_pmdnuma;
+
+ pte_unmap_unlock(ptep, ptl);
+ migrate_misplaced_page(page, target_nid);
+ page = NULL;
+
+ ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ if (!pte_same(*ptep, pte))
+ goto out_unlock;
+
+clear_pmdnuma:
+ pte = pte_mknonnuma(pte);
+ set_pte_at(mm, addr, ptep, pte);
update_mmu_cache(vma, addr, ptep);

out_unlock:
pte_unmap_unlock(ptep, ptl);
+ if (page)
+ put_page(page);
return 0;
}

@@ -3631,7 +3652,7 @@ retry:
barrier();
if (pmd_trans_huge(orig_pmd)) {
if (pmd_numa(*pmd))
- return do_huge_pmd_numa_page(mm, address,
+ return do_huge_pmd_numa_page(mm, vma, address,
orig_pmd, pmd);

if ((flags & FAULT_FLAG_WRITE) && !pmd_write(orig_pmd)) {
--
1.7.9.2

2012-11-06 09:19:09

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 14/19] mm: mempolicy: Add MPOL_MF_LAZY

From: Lee Schermerhorn <[email protected]>

NOTE: Once again there is a lot of patch stealing and the end result
is sufficiently different that I had to drop the signed-offs.
Will re-add if the original authors are ok with that.

This patch adds another mbind() flag to request "lazy migration". The
flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
pages are marked PROT_NONE. The pages will be migrated in the fault
path on "first touch", if the policy dictates at that time.

"Lazy Migration" will allow testing of migrate-on-fault via mbind().
Also allows applications to specify that only subsequently touched
pages be migrated to obey new policy, instead of all pages in range.
This can be useful for multi-threaded applications working on a
large shared data area that is initialized by an initial thread
resulting in all pages on one [or a few, if overflowed] nodes.
After PROT_NONE, the pages in regions assigned to the worker threads
will be automatically migrated local to the threads on 1st touch.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mm.h | 3 +
include/uapi/linux/mempolicy.h | 13 ++-
mm/mempolicy.c | 176 ++++++++++++++++++++++++++++++++++++----
3 files changed, 174 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa06804..eed70f8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1548,6 +1548,9 @@ static inline pgprot_t vm_get_page_prot(unsigned long vm_flags)
}
#endif

+void change_prot_numa(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end);
+
struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
unsigned long pfn, unsigned long size, pgprot_t);
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 472de8a..6a1baae 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -49,9 +49,16 @@ enum mpol_rebind_step {

/* Flags for mbind */
#define MPOL_MF_STRICT (1<<0) /* Verify existing pages in the mapping */
-#define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform to mapping */
-#define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to mapping */
-#define MPOL_MF_INTERNAL (1<<3) /* Internal flags start here */
+#define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform
+ to policy */
+#define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to policy */
+#define MPOL_MF_LAZY (1<<3) /* Modifies '_MOVE: lazy migrate on fault */
+#define MPOL_MF_INTERNAL (1<<4) /* Internal flags start here */
+
+#define MPOL_MF_VALID (MPOL_MF_STRICT | \
+ MPOL_MF_MOVE | \
+ MPOL_MF_MOVE_ALL | \
+ MPOL_MF_LAZY)

/*
* Internal flags that share the struct mempolicy flags word with
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index df1466d..abe2e45 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -90,6 +90,7 @@
#include <linux/syscalls.h>
#include <linux/ctype.h>
#include <linux/mm_inline.h>
+#include <linux/mmu_notifier.h>

#include <asm/tlbflush.h>
#include <asm/uaccess.h>
@@ -566,6 +567,136 @@ static inline int check_pgd_range(struct vm_area_struct *vma,
}

/*
+ * Here we search for not shared page mappings (mapcount == 1) and we
+ * set up the pmd/pte_numa on those mappings so the very next access
+ * will fire a NUMA hinting page fault.
+ */
+static int
+change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte, *_pte;
+ struct page *page;
+ unsigned long _address, end;
+ spinlock_t *ptl;
+ int ret = 0;
+
+ VM_BUG_ON(address & ~PAGE_MASK);
+
+ pgd = pgd_offset(mm, address);
+ if (!pgd_present(*pgd))
+ goto out;
+
+ pud = pud_offset(pgd, address);
+ if (!pud_present(*pud))
+ goto out;
+
+ pmd = pmd_offset(pud, address);
+ if (pmd_none(*pmd))
+ goto out;
+
+ if (pmd_trans_huge_lock(pmd, vma) == 1) {
+ int page_nid;
+ ret = HPAGE_PMD_NR;
+
+ VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+
+ if (pmd_numa(*pmd)) {
+ spin_unlock(&mm->page_table_lock);
+ goto out;
+ }
+
+ page = pmd_page(*pmd);
+
+ /* only check non-shared pages */
+ if (page_mapcount(page) != 1) {
+ spin_unlock(&mm->page_table_lock);
+ goto out;
+ }
+
+ page_nid = page_to_nid(page);
+
+ if (pmd_numa(*pmd)) {
+ spin_unlock(&mm->page_table_lock);
+ goto out;
+ }
+
+ set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+ /* defer TLB flush to lower the overhead */
+ spin_unlock(&mm->page_table_lock);
+ goto out;
+ }
+
+ if (pmd_trans_unstable(pmd))
+ goto out;
+ VM_BUG_ON(!pmd_present(*pmd));
+
+ end = min(vma->vm_end, (address + PMD_SIZE) & PMD_MASK);
+ pte = pte_offset_map_lock(mm, pmd, address, &ptl);
+ for (_address = address, _pte = pte; _address < end;
+ _pte++, _address += PAGE_SIZE) {
+ pte_t pteval = *_pte;
+ if (!pte_present(pteval))
+ continue;
+ if (pte_numa(pteval))
+ continue;
+ page = vm_normal_page(vma, _address, pteval);
+ if (unlikely(!page))
+ continue;
+ /* only check non-shared pages */
+ if (page_mapcount(page) != 1)
+ continue;
+
+ if (pte_numa(pteval))
+ continue;
+
+ set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
+
+ /* defer TLB flush to lower the overhead */
+ ret++;
+ }
+ pte_unmap_unlock(pte, ptl);
+
+ if (ret && !pmd_numa(*pmd)) {
+ spin_lock(&mm->page_table_lock);
+ set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+ spin_unlock(&mm->page_table_lock);
+ /* defer TLB flush to lower the overhead */
+ }
+
+out:
+ return ret;
+}
+
+/* Assumes mmap_sem is held */
+void
+change_prot_numa(struct vm_area_struct *vma,
+ unsigned long address, unsigned long end)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ int progress = 0;
+
+ while (address < vma->vm_end) {
+ VM_BUG_ON(address < vma->vm_start ||
+ address + PAGE_SIZE > vma->vm_end);
+
+ progress += change_prot_numa_range(mm, vma, address);
+ address = (address + PMD_SIZE) & PMD_MASK;
+ }
+
+ /*
+ * Flush the TLB for the mm to start the NUMA hinting
+ * page faults after we finish scanning this vma part.
+ */
+ mmu_notifier_invalidate_range_start(vma->vm_mm, address, end);
+ flush_tlb_range(vma, address, end);
+ mmu_notifier_invalidate_range_end(vma->vm_mm, address, end);
+}
+
+/*
* Check if all pages in a range are on a set of nodes.
* If pagelist != NULL then isolate pages from the LRU and
* put them on the pagelist.
@@ -583,22 +714,32 @@ check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
return ERR_PTR(-EFAULT);
prev = NULL;
for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
+ unsigned long endvma = vma->vm_end;
+
+ if (endvma > end)
+ endvma = end;
+ if (vma->vm_start > start)
+ start = vma->vm_start;
+
if (!(flags & MPOL_MF_DISCONTIG_OK)) {
if (!vma->vm_next && vma->vm_end < end)
return ERR_PTR(-EFAULT);
if (prev && prev->vm_end < vma->vm_start)
return ERR_PTR(-EFAULT);
}
- if (!is_vm_hugetlb_page(vma) &&
- ((flags & MPOL_MF_STRICT) ||
+
+ if (is_vm_hugetlb_page(vma))
+ goto next;
+
+ if (flags & MPOL_MF_LAZY) {
+ change_prot_numa(vma, start, endvma);
+ goto next;
+ }
+
+ if ((flags & MPOL_MF_STRICT) ||
((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
- vma_migratable(vma)))) {
- unsigned long endvma = vma->vm_end;
+ vma_migratable(vma))) {

- if (endvma > end)
- endvma = end;
- if (vma->vm_start > start)
- start = vma->vm_start;
err = check_pgd_range(vma, start, endvma, nodes,
flags, private);
if (err) {
@@ -606,6 +747,7 @@ check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
break;
}
}
+next:
prev = vma;
}
return first;
@@ -1138,8 +1280,7 @@ static long do_mbind(unsigned long start, unsigned long len,
int err;
LIST_HEAD(pagelist);

- if (flags & ~(unsigned long)(MPOL_MF_STRICT |
- MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+ if (flags & ~(unsigned long)MPOL_MF_VALID)
return -EINVAL;
if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
return -EPERM;
@@ -1162,6 +1303,9 @@ static long do_mbind(unsigned long start, unsigned long len,
if (IS_ERR(new))
return PTR_ERR(new);

+ if (flags & MPOL_MF_LAZY)
+ new->flags |= MPOL_F_MOF;
+
/*
* If we are using the default policy then operation
* on discontinuous address spaces is okay after all
@@ -1198,13 +1342,15 @@ static long do_mbind(unsigned long start, unsigned long len,
vma = check_range(mm, start, end, nmask,
flags | MPOL_MF_INVERT, &pagelist);

- err = PTR_ERR(vma);
- if (!IS_ERR(vma)) {
- int nr_failed = 0;
-
+ err = PTR_ERR(vma); /* maybe ... */
+ if (!IS_ERR(vma) && mode != MPOL_NOOP)
err = mbind_range(mm, start, end, new);

+ if (!err) {
+ int nr_failed = 0;
+
if (!list_empty(&pagelist)) {
+ WARN_ON_ONCE(flags & MPOL_MF_LAZY);
nr_failed = migrate_pages(&pagelist, new_vma_page,
(unsigned long)vma,
false, MIGRATE_SYNC,
@@ -1213,7 +1359,7 @@ static long do_mbind(unsigned long start, unsigned long len,
putback_lru_pages(&pagelist);
}

- if (!err && nr_failed && (flags & MPOL_MF_STRICT))
+ if (nr_failed && (flags & MPOL_MF_STRICT))
err = -EIO;
} else
putback_lru_pages(&pagelist);
--
1.7.9.2

2012-11-06 09:19:07

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 15/19] mm: numa: Add fault driven placement and migration

From: Peter Zijlstra <[email protected]>

NOTE: This patch is based on "sched, numa, mm: Add fault driven
placement and migration policy" but as it throws away all the policy
to just leave a basic foundation I had to drop the signed-offs-by.

This patch creates a bare-bones method for setting PTEs pte_numa in the
context of the scheduler that when faulted later will be faulted onto the
node the CPU is running on. In itself this does nothing useful but any
placement policy will fundamentally depend on receiving hints on placement
from fault context and doing something intelligent about it.

Signed-off-by: Mel Gorman <[email protected]>
---
arch/sh/mm/Kconfig | 1 +
include/linux/mm_types.h | 11 +++++
include/linux/sched.h | 20 ++++++++
init/Kconfig | 14 ++++++
kernel/sched/core.c | 13 +++++
kernel/sched/fair.c | 122 ++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/features.h | 7 +++
kernel/sched/sched.h | 6 +++
kernel/sysctl.c | 24 ++++++++-
mm/huge_memory.c | 6 ++-
mm/memory.c | 7 ++-
11 files changed, 227 insertions(+), 4 deletions(-)

diff --git a/arch/sh/mm/Kconfig b/arch/sh/mm/Kconfig
index cb8f992..ddbcfe7 100644
--- a/arch/sh/mm/Kconfig
+++ b/arch/sh/mm/Kconfig
@@ -111,6 +111,7 @@ config VSYSCALL
config NUMA
bool "Non Uniform Memory Access (NUMA) Support"
depends on MMU && SYS_SUPPORTS_NUMA && EXPERIMENTAL
+ select NUMA_VARIABLE_LOCALITY
default n
help
Some SH systems have many various memories scattered around
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 31f8a3a..d82accb 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -398,6 +398,17 @@ struct mm_struct {
#ifdef CONFIG_CPUMASK_OFFSTACK
struct cpumask cpumask_allocation;
#endif
+#ifdef CONFIG_BALANCE_NUMA
+ /*
+ * numa_next_scan is the next time when the PTEs will me marked
+ * pte_numa to gather statistics and migrate pages to new nodes
+ * if necessary
+ */
+ unsigned long numa_next_scan;
+
+ /* numa_scan_seq prevents two threads setting pte_numa */
+ int numa_scan_seq;
+#endif
struct uprobes_state uprobes_state;
};

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0dd42a0..ac71181 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1479,6 +1479,14 @@ struct task_struct {
short il_next;
short pref_node_fork;
#endif
+#ifdef CONFIG_BALANCE_NUMA
+ int numa_scan_seq;
+ int numa_migrate_seq;
+ unsigned int numa_scan_period;
+ u64 node_stamp; /* migration stamp */
+ struct callback_head numa_work;
+#endif /* CONFIG_BALANCE_NUMA */
+
struct rcu_head rcu;

/*
@@ -1553,6 +1561,14 @@ struct task_struct {
/* Future-safe accessor for struct task_struct's cpus_allowed. */
#define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)

+#ifdef CONFIG_BALANCE_NUMA
+extern void task_numa_fault(int node, int pages);
+#else
+static inline void task_numa_fault(int node, int pages)
+{
+}
+#endif
+
/*
* Priority of a process goes from 0..MAX_PRIO-1, valid RT
* priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
@@ -1990,6 +2006,10 @@ enum sched_tunable_scaling {
};
extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;

+extern unsigned int sysctl_balance_numa_scan_period_min;
+extern unsigned int sysctl_balance_numa_scan_period_max;
+extern unsigned int sysctl_balance_numa_settle_count;
+
#ifdef CONFIG_SCHED_DEBUG
extern unsigned int sysctl_sched_migration_cost;
extern unsigned int sysctl_sched_nr_migrate;
diff --git a/init/Kconfig b/init/Kconfig
index 6fdd6e3..aaba45d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -696,6 +696,20 @@ config LOG_BUF_SHIFT
config HAVE_UNSTABLE_SCHED_CLOCK
bool

+#
+# For architectures that (ab)use NUMA to represent different memory regions
+# all cpu-local but of different latencies, such as SuperH.
+#
+config NUMA_VARIABLE_LOCALITY
+ bool
+
+config BALANCE_NUMA
+ bool "Memory placement aware NUMA scheduler"
+ default n
+ depends on SMP && NUMA && MIGRATION && !NUMA_VARIABLE_LOCALITY
+ help
+ This option adds support for automatic NUMA aware memory/task placement.
+
menuconfig CGROUPS
boolean "Control Group support"
depends on EVENTFD
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2d8927f..81fa185 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1533,6 +1533,19 @@ static void __sched_fork(struct task_struct *p)
#ifdef CONFIG_PREEMPT_NOTIFIERS
INIT_HLIST_HEAD(&p->preempt_notifiers);
#endif
+
+#ifdef CONFIG_BALANCE_NUMA
+ if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
+ p->mm->numa_next_scan = jiffies;
+ p->mm->numa_scan_seq = 0;
+ }
+
+ p->node_stamp = 0ULL;
+ p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
+ p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
+ p->numa_scan_period = sysctl_balance_numa_scan_period_min;
+ p->numa_work.next = &p->numa_work;
+#endif /* CONFIG_BALANCE_NUMA */
}

/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b800a1..020a8f2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -26,6 +26,8 @@
#include <linux/slab.h>
#include <linux/profile.h>
#include <linux/interrupt.h>
+#include <linux/mempolicy.h>
+#include <linux/task_work.h>

#include <trace/events/sched.h>

@@ -776,6 +778,123 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
* Scheduling class queueing methods:
*/

+#ifdef CONFIG_BALANCE_NUMA
+/*
+ * numa task sample period in ms: 5s
+ */
+unsigned int sysctl_balance_numa_scan_period_min = 5000;
+unsigned int sysctl_balance_numa_scan_period_max = 5000*16;
+
+static void task_numa_placement(struct task_struct *p)
+{
+ int seq = ACCESS_ONCE(p->mm->numa_scan_seq);
+
+ if (p->numa_scan_seq == seq)
+ return;
+ p->numa_scan_seq = seq;
+
+ /* FIXME: Scheduling placement policy hints go here */
+}
+
+/*
+ * Got a PROT_NONE fault for a page on @node.
+ */
+void task_numa_fault(int node, int pages)
+{
+ struct task_struct *p = current;
+
+ /* FIXME: Allocate task-specific structure for placement policy here */
+
+ task_numa_placement(p);
+}
+
+/*
+ * The expensive part of numa migration is done from task_work context.
+ * Triggered from task_tick_numa().
+ */
+void task_numa_work(struct callback_head *work)
+{
+ unsigned long migrate, next_scan, now = jiffies;
+ struct task_struct *p = current;
+ struct mm_struct *mm = p->mm;
+
+ WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
+
+ work->next = work; /* protect against double add */
+ /*
+ * Who cares about NUMA placement when they're dying.
+ *
+ * NOTE: make sure not to dereference p->mm before this check,
+ * exit_task_work() happens _after_ exit_mm() so we could be called
+ * without p->mm even though we still had it when we enqueued this
+ * work.
+ */
+ if (p->flags & PF_EXITING)
+ return;
+
+ /*
+ * Enforce maximal scan/migration frequency..
+ */
+ migrate = mm->numa_next_scan;
+ if (time_before(now, migrate))
+ return;
+
+ next_scan = now + 2*msecs_to_jiffies(sysctl_balance_numa_scan_period_min);
+ if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
+ return;
+
+ ACCESS_ONCE(mm->numa_scan_seq)++;
+ {
+ struct vm_area_struct *vma;
+
+ down_read(&mm->mmap_sem);
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ if (!vma_migratable(vma))
+ continue;
+ change_prot_numa(vma, vma->vm_start, vma->vm_end);
+ }
+ up_read(&mm->mmap_sem);
+ }
+}
+
+/*
+ * Drive the periodic memory faults..
+ */
+void task_tick_numa(struct rq *rq, struct task_struct *curr)
+{
+ struct callback_head *work = &curr->numa_work;
+ u64 period, now;
+
+ /*
+ * We don't care about NUMA placement if we don't have memory.
+ */
+ if (!curr->mm || (curr->flags & PF_EXITING) || work->next != work)
+ return;
+
+ /*
+ * Using runtime rather than walltime has the dual advantage that
+ * we (mostly) drive the selection from busy threads and that the
+ * task needs to have done some actual work before we bother with
+ * NUMA placement.
+ */
+ now = curr->se.sum_exec_runtime;
+ period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
+
+ if (now - curr->node_stamp > period) {
+ curr->node_stamp = now;
+
+ if (!time_before(jiffies, curr->mm->numa_next_scan)) {
+ init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
+ task_work_add(curr, work, true);
+ }
+ }
+}
+#else
+static void task_tick_numa(struct rq *rq, struct task_struct *curr)
+{
+}
+#endif /* CONFIG_BALANCE_NUMA */
+
static void
account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
@@ -4954,6 +5073,9 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
cfs_rq = cfs_rq_of(se);
entity_tick(cfs_rq, se, queued);
}
+
+ if (sched_feat_numa(NUMA))
+ task_tick_numa(rq, curr);
}

/*
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index eebefca..7cfd289 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -61,3 +61,10 @@ SCHED_FEAT(TTWU_QUEUE, true)
SCHED_FEAT(FORCE_SD_OVERLAP, false)
SCHED_FEAT(RT_RUNTIME_SHARE, true)
SCHED_FEAT(LB_MIN, false)
+
+/*
+ * Apply the automatic NUMA scheduling policy
+ */
+#ifdef CONFIG_BALANCE_NUMA
+SCHED_FEAT(NUMA, true)
+#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7a7db09..9a43241 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -648,6 +648,12 @@ extern struct static_key sched_feat_keys[__SCHED_FEAT_NR];
#define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
#endif /* SCHED_DEBUG && HAVE_JUMP_LABEL */

+#ifdef CONFIG_BALANCE_NUMA
+#define sched_feat_numa(x) sched_feat(x)
+#else
+#define sched_feat_numa(x) (0)
+#endif
+
static inline u64 global_rt_period(void)
{
return (u64)sysctl_sched_rt_period * NSEC_PER_USEC;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 26f65ea..1359f51 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -256,9 +256,11 @@ static int min_sched_granularity_ns = 100000; /* 100 usecs */
static int max_sched_granularity_ns = NSEC_PER_SEC; /* 1 second */
static int min_wakeup_granularity_ns; /* 0 usecs */
static int max_wakeup_granularity_ns = NSEC_PER_SEC; /* 1 second */
+#ifdef CONFIG_SMP
static int min_sched_tunable_scaling = SCHED_TUNABLESCALING_NONE;
static int max_sched_tunable_scaling = SCHED_TUNABLESCALING_END-1;
-#endif
+#endif /* CONFIG_SMP */
+#endif /* CONFIG_SCHED_DEBUG */

#ifdef CONFIG_COMPACTION
static int min_extfrag_threshold;
@@ -301,6 +303,7 @@ static struct ctl_table kern_table[] = {
.extra1 = &min_wakeup_granularity_ns,
.extra2 = &max_wakeup_granularity_ns,
},
+#ifdef CONFIG_SMP
{
.procname = "sched_tunable_scaling",
.data = &sysctl_sched_tunable_scaling,
@@ -347,7 +350,24 @@ static struct ctl_table kern_table[] = {
.extra1 = &zero,
.extra2 = &one,
},
-#endif
+#endif /* CONFIG_SMP */
+#ifdef CONFIG_BALANCE_NUMA
+ {
+ .procname = "balance_numa_scan_period_min_ms",
+ .data = &sysctl_balance_numa_scan_period_min,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+ {
+ .procname = "balance_numa_scan_period_max_ms",
+ .data = &sysctl_balance_numa_scan_period_max,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
+#endif /* CONFIG_BALANCE_NUMA */
+#endif /* CONFIG_SCHED_DEBUG */
{
.procname = "sched_rt_period_us",
.data = &sysctl_sched_rt_period,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1453c30..91f9b06 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1045,6 +1045,8 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
*/
split_huge_page(page);
put_page(page);
+
+ task_numa_fault(target_nid, HPAGE_PMD_NR);
return 0;

clear_pmdnuma:
@@ -1059,8 +1061,10 @@ clear_pmdnuma:

out_unlock:
spin_unlock(&mm->page_table_lock);
- if (page)
+ if (page) {
put_page(page);
+ task_numa_fault(page_to_nid(page), HPAGE_PMD_NR);
+ }
return 0;
}

diff --git a/mm/memory.c b/mm/memory.c
index fb46ef2..a63daf9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3439,7 +3439,8 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
{
struct page *page = NULL;
spinlock_t *ptl;
- int current_nid, target_nid;
+ int current_nid = -1;
+ int target_nid;

/*
* The "pte" at this point cannot be used safely without
@@ -3464,6 +3465,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (target_nid == -1)
goto clear_pmdnuma;

+ current_nid = target_nid;
pte_unmap_unlock(ptep, ptl);
migrate_misplaced_page(page, target_nid);
page = NULL;
@@ -3481,6 +3483,9 @@ out_unlock:
pte_unmap_unlock(ptep, ptl);
if (page)
put_page(page);
+
+ if (current_nid != -1)
+ task_numa_fault(current_nid, HPAGE_PMD_NR);
return 0;
}

--
1.7.9.2

2012-11-06 09:19:05

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 16/19] mm: numa: Add pte updates, hinting and migration stats

It is tricky to quantify the basic cost of automatic NUMA placement in a
meaningful manner. This patch adds some vmstats that can be used as part
of a basic costing model.

u = basic unit = sizeof(void *)
Ca = cost of struct page access = sizeof(struct page) / u
Cpte = Cost PTE access = Ca
Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
where Cpte is incurred twice for a read and a write and Wlock
is a constant representing the cost of taking or releasing a
lock
Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u
Ci = Cost of page isolation = Ca + Wi
where Wi is a constant that should reflect the approximate cost
of the locking operation
Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
where Wnuma is the approximate NUMA factor. 1 is local. 1.2
would imply that remote accesses are 20% more expensive

Balancing cost = Cpte * numa_pte_updates +
Cnumahint * numa_hint_faults +
Ci * numa_pages_migrated +
Cpagecopy * numa_pages_migrated

Note that numa_pages_migrated is used as a measure of how many pages
were isolated even though it would miss pages that failed to migrate. A
vmstat counter could have been added for it but the isolation cost is
pretty marginal in comparison to the overall cost so it seemed overkill.

The ideal way to measure automatic placement benefit would be to count
the number of remote accesses versus local accesses and do something like

benefit = (remote_accesses_before - remove_access_after) * Wnuma

but the information is not readily available. As a workload converges, the
expection would be that the number of remote numa hints would reduce to 0.

convergence = numa_hint_faults_local / numa_hint_faults
where this is measured for the last N number of
numa hints recorded. When the workload is fully
converged the value is 1.

This can measure if the placement policy is converging and how fast it is
doing it.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/vm_event_item.h | 6 ++++++
mm/huge_memory.c | 1 +
mm/memory.c | 3 +++
mm/mempolicy.c | 6 ++++++
mm/migrate.c | 3 ++-
mm/vmstat.c | 6 ++++++
6 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index a1f750b..dded0af 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -38,6 +38,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
KSWAPD_LOW_WMARK_HIT_QUICKLY, KSWAPD_HIGH_WMARK_HIT_QUICKLY,
KSWAPD_SKIP_CONGESTION_WAIT,
PAGEOUTRUN, ALLOCSTALL, PGROTATED,
+#ifdef CONFIG_BALANCE_NUMA
+ NUMA_PTE_UPDATES,
+ NUMA_HINT_FAULTS,
+ NUMA_HINT_FAULTS_LOCAL,
+ NUMA_PAGE_MIGRATE,
+#endif
#ifdef CONFIG_MIGRATION
PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
#endif
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 91f9b06..a82a313 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1033,6 +1033,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
page = pmd_page(pmd);
get_page(page);
spin_unlock(&mm->page_table_lock);
+ count_vm_event(NUMA_HINT_FAULTS);

target_nid = mpol_misplaced(page, vma, haddr);
if (target_nid == -1)
diff --git a/mm/memory.c b/mm/memory.c
index a63daf9..2780948 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3456,11 +3456,14 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(!pte_same(*ptep, pte)))
goto out_unlock;

+ count_vm_event(NUMA_HINT_FAULTS);
page = vm_normal_page(vma, addr, pte);
BUG_ON(!page);

get_page(page);
current_nid = page_to_nid(page);
+ if (current_nid == numa_node_id())
+ count_vm_event(NUMA_HINT_FAULTS_LOCAL);
target_nid = mpol_misplaced(page, vma, addr);
if (target_nid == -1)
goto clear_pmdnuma;
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index abe2e45..e25da64 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -583,6 +583,7 @@ change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long _address, end;
spinlock_t *ptl;
int ret = 0;
+ int nr_pte_updates = 0;

VM_BUG_ON(address & ~PAGE_MASK);

@@ -625,6 +626,7 @@ change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
}

set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
+ nr_pte_updates++;
/* defer TLB flush to lower the overhead */
spin_unlock(&mm->page_table_lock);
goto out;
@@ -654,6 +656,7 @@ change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
continue;

set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
+ nr_pte_updates++;

/* defer TLB flush to lower the overhead */
ret++;
@@ -668,6 +671,8 @@ change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
}

out:
+ if (nr_pte_updates)
+ count_vm_events(NUMA_PTE_UPDATES, nr_pte_updates);
return ret;
}

@@ -694,6 +699,7 @@ change_prot_numa(struct vm_area_struct *vma,
mmu_notifier_invalidate_range_start(vma->vm_mm, address, end);
flush_tlb_range(vma, address, end);
mmu_notifier_invalidate_range_end(vma->vm_mm, address, end);
+
}

/*
diff --git a/mm/migrate.c b/mm/migrate.c
index 4a92808..14e2a31 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1508,7 +1508,8 @@ int migrate_misplaced_page(struct page *page, int node)
if (nr_remaining) {
putback_lru_pages(&migratepages);
isolated = 0;
- }
+ } else
+ count_vm_event(NUMA_PAGE_MIGRATE);
}
BUG_ON(!list_empty(&migratepages));
out:
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 3a067fa..cfa386da 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -774,6 +774,12 @@ const char * const vmstat_text[] = {

"pgrotated",

+#ifdef CONFIG_BALANCE_NUMA
+ "numa_pte_updates",
+ "numa_hint_faults",
+ "numa_hint_faults_local",
+ "numa_pages_migrated",
+#endif
#ifdef CONFIG_MIGRATION
"pgmigrate_success",
"pgmigrate_fail",
--
1.7.9.2

2012-11-06 09:19:03

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 17/19] mm: numa: Migrate on reference policy

This is the dumbest possible policy that still does something of note.
When a pte_numa is faulted, it is moved immediately. Any replacement
policy must at least do better than this and in all likelihood this
policy regresses normal workloads.

Signed-off-by: Mel Gorman <[email protected]>
---
include/uapi/linux/mempolicy.h | 1 +
mm/mempolicy.c | 37 +++++++++++++++++++++++++++++++++++--
2 files changed, 36 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 6a1baae..b25064f 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -69,6 +69,7 @@ enum mpol_rebind_step {
#define MPOL_F_LOCAL (1 << 1) /* preferred local allocation */
#define MPOL_F_REBINDING (1 << 2) /* identify policies in rebinding */
#define MPOL_F_MOF (1 << 3) /* this policy wants migrate on fault */
+#define MPOL_F_MORON (1 << 4) /* Migrate On pte_numa Reference On Node */


#endif /* _UAPI_LINUX_MEMPOLICY_H */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index e25da64..11d4b6b 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -118,6 +118,22 @@ static struct mempolicy default_policy = {
.flags = MPOL_F_LOCAL,
};

+static struct mempolicy preferred_node_policy[MAX_NUMNODES];
+
+static struct mempolicy *get_task_policy(struct task_struct *p)
+{
+ struct mempolicy *pol = p->mempolicy;
+ int node;
+
+ if (!pol) {
+ node = numa_node_id();
+ if (node != -1)
+ pol = &preferred_node_policy[node];
+ }
+
+ return pol;
+}
+
static const struct mempolicy_operations {
int (*create)(struct mempolicy *pol, const nodemask_t *nodes);
/*
@@ -1704,7 +1720,7 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
struct mempolicy *get_vma_policy(struct task_struct *task,
struct vm_area_struct *vma, unsigned long addr)
{
- struct mempolicy *pol = task->mempolicy;
+ struct mempolicy *pol = get_task_policy(task);

if (vma) {
if (vma->vm_ops && vma->vm_ops->get_policy) {
@@ -2127,7 +2143,7 @@ retry_cpuset:
*/
struct page *alloc_pages_current(gfp_t gfp, unsigned order)
{
- struct mempolicy *pol = current->mempolicy;
+ struct mempolicy *pol = get_task_policy(current);
struct page *page;
unsigned int cpuset_mems_cookie;

@@ -2401,6 +2417,14 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
default:
BUG();
}
+
+ /*
+ * Moronic node selection policy. Migrate the page to the node that is
+ * currently referencing it
+ */
+ if (pol->flags & MPOL_F_MORON)
+ polnid = numa_node_id();
+
if (curnid != polnid)
ret = polnid;
out:
@@ -2589,6 +2613,15 @@ void __init numa_policy_init(void)
sizeof(struct sp_node),
0, SLAB_PANIC, NULL);

+ for_each_node(nid) {
+ preferred_node_policy[nid] = (struct mempolicy) {
+ .refcnt = ATOMIC_INIT(1),
+ .mode = MPOL_PREFERRED,
+ .flags = MPOL_F_MOF | MPOL_F_MORON,
+ .v = { .preferred_node = nid, },
+ };
+ }
+
/*
* Set interleaving policy for system init. Interleaving is only
* enabled across suitably sized nodes (default is >= 16MB), or
--
1.7.9.2

2012-11-06 09:22:15

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 02/19] mm: migrate: Add a tracepoint for migrate_pages

The pgmigrate_success and pgmigrate_fail vmstat counters tells the user
about migration activity but not the type or the reason. This patch adds
a tracepoint to identify the type of page migration and why the page is
being migrated.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/migrate.h | 13 ++++++++--
include/trace/events/migrate.h | 51 ++++++++++++++++++++++++++++++++++++++++
mm/compaction.c | 3 ++-
mm/memory-failure.c | 3 ++-
mm/memory_hotplug.c | 3 ++-
mm/mempolicy.c | 6 +++--
mm/migrate.c | 10 ++++++--
mm/page_alloc.c | 3 ++-
8 files changed, 82 insertions(+), 10 deletions(-)
create mode 100644 include/trace/events/migrate.h

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index ce7e667..9d1c159 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -7,6 +7,15 @@

typedef struct page *new_page_t(struct page *, unsigned long private, int **);

+enum migrate_reason {
+ MR_COMPACTION,
+ MR_MEMORY_FAILURE,
+ MR_MEMORY_HOTPLUG,
+ MR_SYSCALL, /* also applies to cpusets */
+ MR_MEMPOLICY_MBIND,
+ MR_CMA
+};
+
#ifdef CONFIG_MIGRATION

extern void putback_lru_pages(struct list_head *l);
@@ -14,7 +23,7 @@ extern int migrate_page(struct address_space *,
struct page *, struct page *, enum migrate_mode);
extern int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
- enum migrate_mode mode);
+ enum migrate_mode mode, int reason);
extern int migrate_huge_page(struct page *, new_page_t x,
unsigned long private, bool offlining,
enum migrate_mode mode);
@@ -35,7 +44,7 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping,
static inline void putback_lru_pages(struct list_head *l) {}
static inline int migrate_pages(struct list_head *l, new_page_t x,
unsigned long private, bool offlining,
- enum migrate_mode mode) { return -ENOSYS; }
+ enum migrate_mode mode, int reason) { return -ENOSYS; }
static inline int migrate_huge_page(struct page *page, new_page_t x,
unsigned long private, bool offlining,
enum migrate_mode mode) { return -ENOSYS; }
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
new file mode 100644
index 0000000..ec2a6cc
--- /dev/null
+++ b/include/trace/events/migrate.h
@@ -0,0 +1,51 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM migrate
+
+#if !defined(_TRACE_MIGRATE_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_MIGRATE_H
+
+#define MIGRATE_MODE \
+ {MIGRATE_ASYNC, "MIGRATE_ASYNC"}, \
+ {MIGRATE_SYNC_LIGHT, "MIGRATE_SYNC_LIGHT"}, \
+ {MIGRATE_SYNC, "MIGRATE_SYNC"}
+
+#define MIGRATE_REASON \
+ {MR_COMPACTION, "compaction"}, \
+ {MR_MEMORY_FAILURE, "memory_failure"}, \
+ {MR_MEMORY_HOTPLUG, "memory_hotplug"}, \
+ {MR_SYSCALL, "syscall_or_cpuset"}, \
+ {MR_MEMPOLICY_MBIND, "mempolicy_mbind"}, \
+ {MR_CMA, "cma"}
+
+TRACE_EVENT(mm_migrate_pages,
+
+ TP_PROTO(unsigned long succeeded, unsigned long failed,
+ enum migrate_mode mode, int reason),
+
+ TP_ARGS(succeeded, failed, mode, reason),
+
+ TP_STRUCT__entry(
+ __field( unsigned long, succeeded)
+ __field( unsigned long, failed)
+ __field( enum migrate_mode, mode)
+ __field( int, reason)
+ ),
+
+ TP_fast_assign(
+ __entry->succeeded = succeeded;
+ __entry->failed = failed;
+ __entry->mode = mode;
+ __entry->reason = reason;
+ ),
+
+ TP_printk("nr_succeeded=%lu nr_failed=%lu mode=%s reason=%s",
+ __entry->succeeded,
+ __entry->failed,
+ __print_symbolic(__entry->mode, MIGRATE_MODE),
+ __print_symbolic(__entry->reason, MIGRATE_REASON))
+);
+
+#endif /* _TRACE_MIGRATE_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/mm/compaction.c b/mm/compaction.c
index 00ad883..2c077a7 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -990,7 +990,8 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
nr_migrate = cc->nr_migratepages;
err = migrate_pages(&cc->migratepages, compaction_alloc,
(unsigned long)cc, false,
- cc->sync ? MIGRATE_SYNC_LIGHT : MIGRATE_ASYNC);
+ cc->sync ? MIGRATE_SYNC_LIGHT : MIGRATE_ASYNC,
+ MR_COMPACTION);
update_nr_listpages(cc);
nr_remaining = cc->nr_migratepages;

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 6c5899b..ddb68a1 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1558,7 +1558,8 @@ int soft_offline_page(struct page *page, int flags)
page_is_file_cache(page));
list_add(&page->lru, &pagelist);
ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
- false, MIGRATE_SYNC);
+ false, MIGRATE_SYNC,
+ MR_MEMORY_FAILURE);
if (ret) {
putback_lru_pages(&pagelist);
pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 56b758a..af60ce7 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -819,7 +819,8 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
* migrate_pages returns # of failed pages.
*/
ret = migrate_pages(&source, alloc_migrate_target, 0,
- true, MIGRATE_SYNC);
+ true, MIGRATE_SYNC,
+ MR_MEMORY_HOTPLUG);
if (ret)
putback_lru_pages(&source);
}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d04a8a5..66e90ec 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -961,7 +961,8 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,

if (!list_empty(&pagelist)) {
err = migrate_pages(&pagelist, new_node_page, dest,
- false, MIGRATE_SYNC);
+ false, MIGRATE_SYNC,
+ MR_SYSCALL);
if (err)
putback_lru_pages(&pagelist);
}
@@ -1202,7 +1203,8 @@ static long do_mbind(unsigned long start, unsigned long len,
if (!list_empty(&pagelist)) {
nr_failed = migrate_pages(&pagelist, new_vma_page,
(unsigned long)vma,
- false, MIGRATE_SYNC);
+ false, MIGRATE_SYNC,
+ MR_MEMPOLICY_MBIND);
if (nr_failed)
putback_lru_pages(&pagelist);
}
diff --git a/mm/migrate.c b/mm/migrate.c
index 04687f6..27be9c9 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -38,6 +38,9 @@

#include <asm/tlbflush.h>

+#define CREATE_TRACE_POINTS
+#include <trace/events/migrate.h>
+
#include "internal.h"

/*
@@ -958,7 +961,7 @@ out:
*/
int migrate_pages(struct list_head *from,
new_page_t get_new_page, unsigned long private, bool offlining,
- enum migrate_mode mode)
+ enum migrate_mode mode, int reason)
{
int retry = 1;
int nr_failed = 0;
@@ -1004,6 +1007,8 @@ out:
count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
if (nr_failed)
count_vm_events(PGMIGRATE_FAIL, nr_failed);
+ trace_mm_migrate_pages(nr_succeeded, nr_failed, mode, reason);
+
if (!swapwrite)
current->flags &= ~PF_SWAPWRITE;

@@ -1145,7 +1150,8 @@ set_status:
err = 0;
if (!list_empty(&pagelist)) {
err = migrate_pages(&pagelist, new_page_node,
- (unsigned long)pm, 0, MIGRATE_SYNC);
+ (unsigned long)pm, 0, MIGRATE_SYNC,
+ MR_SYSCALL);
if (err)
putback_lru_pages(&pagelist);
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bb90971..51bef90 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5707,7 +5707,8 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,

ret = migrate_pages(&cc->migratepages,
alloc_migrate_target,
- 0, false, MIGRATE_SYNC);
+ 0, false, MIGRATE_SYNC,
+ MR_CMA);
}

putback_lru_pages(&cc->migratepages);
--
1.7.9.2

2012-11-06 17:30:01

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 01/19] mm: compaction: Move migration fail/success stats to migrate.c

On 11/06/2012 04:14 AM, Mel Gorman wrote:
> The compact_pages_moved and compact_pagemigrate_failed events are
> convenient for determining if compaction is active and to what
> degree migration is succeeding but it's at the wrong level. Other
> users of migration may also want to know if migration is working
> properly and this will be particularly true for any automated
> NUMA migration. This patch moves the counters down to migration
> with the new events called pgmigrate_success and pgmigrate_fail.
> The compact_blocks_moved counter is removed because while it was
> useful for debugging initially, it's worthless now as no meaningful
> conclusions can be drawn from its value.
>
> Signed-off-by: Mel Gorman <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

2012-11-06 17:31:33

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 02/19] mm: migrate: Add a tracepoint for migrate_pages

On 11/06/2012 04:14 AM, Mel Gorman wrote:
> The pgmigrate_success and pgmigrate_fail vmstat counters tells the user
> about migration activity but not the type or the reason. This patch adds
> a tracepoint to identify the type of page migration and why the page is
> being migrated.
>
> Signed-off-by: Mel Gorman <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

2012-11-06 17:33:25

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 03/19] mm: compaction: Add scanned and isolated counters for compaction

On 11/06/2012 04:14 AM, Mel Gorman wrote:
> Compaction already has tracepoints to count scanned and isolated pages
> but it requires that ftrace be enabled and if that information has to be
> written to disk then it can be disruptive. This patch adds vmstat counters
> for compaction called compact_migrate_scanned, compact_free_scanned and
> compact_isolated.
>
> With these counters, it is possible to define a basic cost model for
> compaction. This approximates of how much work compaction is doing and can
> be compared that with an oprofile showing TLB misses and see if the cost of
> compaction is being offset by THP for example. Minimally a compaction patch
> can be evaluated in terms of whether it increases or decreases cost. The
> basic cost model looks like this


> Signed-off-by: Mel Gorman <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

2012-11-06 18:33:20

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 04/19] mm: numa: define _PAGE_NUMA

On 11/06/2012 04:14 AM, Mel Gorman wrote:
> From: Andrea Arcangeli <[email protected]>
>
> The objective of _PAGE_NUMA is to be able to trigger NUMA hinting page
> faults to identify the per NUMA node working set of the thread at
> runtime.
>
> Arming the NUMA hinting page fault mechanism works similarly to
> setting up a mprotect(PROT_NONE) virtual range: the present bit is
> cleared at the same time that _PAGE_NUMA is set, so when the fault
> triggers we can identify it as a NUMA hinting page fault.
>
> _PAGE_NUMA on x86 shares the same bit number of _PAGE_PROTNONE (but it
> could also use a different bitflag, it's up to the architecture to
> decide).
>
> It would be confusing to call the "NUMA hinting page faults" as
> "do_prot_none faults". They're different events and _PAGE_NUMA doesn't
> alter the semantics of mprotect(PROT_NONE) in any way.
>
> Sharing the same bitflag with _PAGE_PROTNONE in fact complicates
> things: it requires us to ensure the code paths executed by
> _PAGE_PROTNONE remains mutually exclusive to the code paths executed
> by _PAGE_NUMA at all times, to avoid _PAGE_NUMA and _PAGE_PROTNONE to
> step into each other toes.
>
> Because we want to be able to set this bitflag in any established pte
> or pmd (while clearing the present bit at the same time) without
> losing information, this bitflag must never be set when the pte and
> pmd are present, so the bitflag picked for _PAGE_NUMA usage, must not
> be used by the swap entry format.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

2012-11-06 18:56:12

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 08/19] mm: numa: Create basic numa page hinting infrastructure

On 11/06/2012 04:14 AM, Mel Gorman wrote:
> Note: This patch started as "mm/mpol: Create special PROT_NONE
> infrastructure" and preserves the basic idea but steals *very*
> heavily from "autonuma: numa hinting page faults entry points" for
> the actual fault handlers without the migration parts. The end
> result is barely recognisable as either patch so all Signed-off
> and Reviewed-bys are dropped. If Peter, Ingo and Andrea are ok with
> this version, I will re-add the signed-offs-by to reflect the history.
>
> In order to facilitate a lazy -- fault driven -- migration of pages, create
> a special transient PAGE_NUMA variant, we can then use the 'spurious'
> protection faults to drive our migrations from.
>
> Pages that already had an effective PROT_NONE mapping will not be detected

The patch itself is good, but the changelog needs a little
fix. While you are defining _PAGE_NUMA to _PAGE_PROTNONE on
x86, this may be different on other architectures.

Therefore, the changelog should refer to PAGE_NUMA, not
PROT_NONE.

> to generate these 'spurious' faults for the simple reason that we cannot
> distinguish them on their protection bits, see pte_numa(). This isn't
> a problem since PROT_NONE (and possible PROT_WRITE with dirty tracking)
> aren't used or are rare enough for us to not care about their placement.
>
> Signed-off-by: Mel Gorman <[email protected]>

Other than the changelog ...

Reviewed-by: Rik van Riel <[email protected]>

2012-11-06 19:08:11

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 12/19] mm: migrate: Introduce migrate_misplaced_page()

On 11/06/2012 04:14 AM, Mel Gorman wrote:
> From: Peter Zijlstra <[email protected]>
>
> Note: This was originally based on Peter's patch "mm/migrate: Introduce
> migrate_misplaced_page()" but borrows extremely heavily from Andrea's
> "autonuma: memory follows CPU algorithm and task/mm_autonuma stats
> collection". The end result is barely recognisable so signed-offs
> had to be dropped. If original authors are ok with it, I'll
> re-add the signed-off-bys.
>
> Add migrate_misplaced_page() which deals with migrating pages from
> faults.
>
> Based-on-work-by: Lee Schermerhorn <[email protected]>
> Based-on-work-by: Peter Zijlstra <[email protected]>
> Based-on-work-by: Andrea Arcangeli <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>

Excellent, this avoids the ugliness in Peter's
approach, and the hard to read maze of functions
from Andrea's tree.

Reviewed-by: Rik van Riel <[email protected]>

2012-11-06 19:16:58

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: mempolicy: Add MPOL_MF_LAZY

On 11/06/2012 04:14 AM, Mel Gorman wrote:
> From: Lee Schermerhorn <[email protected]>
>
> NOTE: Once again there is a lot of patch stealing and the end result
> is sufficiently different that I had to drop the signed-offs.
> Will re-add if the original authors are ok with that.
>
> This patch adds another mbind() flag to request "lazy migration". The
> flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
> pages are marked PROT_NONE. The pages will be migrated in the fault
> path on "first touch", if the policy dictates at that time.
>
> "Lazy Migration" will allow testing of migrate-on-fault via mbind().
> Also allows applications to specify that only subsequently touched
> pages be migrated to obey new policy, instead of all pages in range.
> This can be useful for multi-threaded applications working on a
> large shared data area that is initialized by an initial thread
> resulting in all pages on one [or a few, if overflowed] nodes.
> After PROT_NONE, the pages in regions assigned to the worker threads
> will be automatically migrated local to the threads on 1st touch.
>
> Signed-off-by: Mel Gorman <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

2012-11-06 19:25:34

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 13/19] mm: mempolicy: Use _PAGE_NUMA to migrate pages

On 11/06/2012 04:14 AM, Mel Gorman wrote:
> Note: Based on "mm/mpol: Use special PROT_NONE to migrate pages" but
> sufficiently different that the signed-off-bys were dropped
>
> Combine our previous _PAGE_NUMA, mpol_misplaced and migrate_misplaced_page()
> pieces into an effective migrate on fault scheme.
>
> Note that (on x86) we rely on PROT_NONE pages being !present and avoid
> the TLB flush from try_to_unmap(TTU_MIGRATION). This greatly improves the
> page-migration performance.
>
> Based-on-work-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>


> page = vm_normal_page(vma, addr, pte);
> BUG_ON(!page);
> +
> + get_page(page);
> + current_nid = page_to_nid(page);
> + target_nid = mpol_misplaced(page, vma, addr);
> + if (target_nid == -1)
> + goto clear_pmdnuma;
> +
> + pte_unmap_unlock(ptep, ptl);
> + migrate_misplaced_page(page, target_nid);
> + page = NULL;
> +
> + ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
> + if (!pte_same(*ptep, pte))
> + goto out_unlock;
> +

I see you tried to avoid the extraneous TLB flush
from inside migrate_misplaced_page. However,
try_to_unmap_one calls ptep_clear_flush, which will
currently still result in a remote TLB flush for
a _PAGE_NUMA pte, despite the pte not being
accessible for memory accesses (_PAGE_PRESENT not set).

Furthermore, if migrate_misplaced_page moved the page,
the !pte_same check will return false, and you will
get a double fault.

I wonder if migrate_misplaced_page should return a
struct page* or a pfn, so we can compute what "pte"
_should_ be, corrected for the new pfn, feed that
value to pte_same, and then avoid the double fault?

Also, we may want the change for ptep_clear_flush
that avoids flushing remote TLBs for a pte without
the _PAGE_PRESENT bit set.

2012-11-06 19:38:53

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 15/19] mm: numa: Add fault driven placement and migration

On 11/06/2012 04:14 AM, Mel Gorman wrote:
> From: Peter Zijlstra <[email protected]>
>
> NOTE: This patch is based on "sched, numa, mm: Add fault driven
> placement and migration policy" but as it throws away all the policy
> to just leave a basic foundation I had to drop the signed-offs-by.
>
> This patch creates a bare-bones method for setting PTEs pte_numa in the
> context of the scheduler that when faulted later will be faulted onto the
> node the CPU is running on. In itself this does nothing useful but any
> placement policy will fundamentally depend on receiving hints on placement
> from fault context and doing something intelligent about it.
>
> Signed-off-by: Mel Gorman <[email protected]>

Excellent basis for implementing a smarter NUMA
policy.

Not sure if such a policy should be implemented
as a replacement for this patch, or on top of it...

Either way, thank you for cleaning up all of the
NUMA base code, while I was away at conferences
and stuck in airports :)

Peter, Andrea - does this look like a good basis
for implementing and comparing your NUMA policies?

I mean, it does to me. I am just wondering if there
is any reason at all you two could not use it as a
basis for an apples-to-apples comparison of your
NUMA placement policies?

Sharing 2/3 of the code would sure get rid of the
bulk of the discussion, and allow us to make real
progress.

2012-11-06 19:52:46

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 16/19] mm: numa: Add pte updates, hinting and migration stats

On 11/06/2012 04:14 AM, Mel Gorman wrote:
> It is tricky to quantify the basic cost of automatic NUMA placement in a
> meaningful manner. This patch adds some vmstats that can be used as part
> of a basic costing model.
>
> u = basic unit = sizeof(void *)
> Ca = cost of struct page access = sizeof(struct page) / u
> Cpte = Cost PTE access = Ca
> Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
> where Cpte is incurred twice for a read and a write and Wlock
> is a constant representing the cost of taking or releasing a
> lock
> Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
> Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u
> Ci = Cost of page isolation = Ca + Wi
> where Wi is a constant that should reflect the approximate cost
> of the locking operation
> Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
> where Wnuma is the approximate NUMA factor. 1 is local. 1.2
> would imply that remote accesses are 20% more expensive
>
> Balancing cost = Cpte * numa_pte_updates +
> Cnumahint * numa_hint_faults +
> Ci * numa_pages_migrated +
> Cpagecopy * numa_pages_migrated
>
> Note that numa_pages_migrated is used as a measure of how many pages
> were isolated even though it would miss pages that failed to migrate. A
> vmstat counter could have been added for it but the isolation cost is
> pretty marginal in comparison to the overall cost so it seemed overkill.
>
> The ideal way to measure automatic placement benefit would be to count
> the number of remote accesses versus local accesses and do something like
>
> benefit = (remote_accesses_before - remove_access_after) * Wnuma
>
> but the information is not readily available. As a workload converges, the
> expection would be that the number of remote numa hints would reduce to 0.
>
> convergence = numa_hint_faults_local / numa_hint_faults
> where this is measured for the last N number of
> numa hints recorded. When the workload is fully
> converged the value is 1.
>
> This can measure if the placement policy is converging and how fast it is
> doing it.
>
> Signed-off-by: Mel Gorman <[email protected]>

I'm skipping the ACKing of the policy patches, which
appear to be meant to be placeholders for a "real"
policy. However, you have a few more mechanism patches
left in the series, which would be required regardless
of what policy gets merged, so ...

Reviewed-by: Rik van Riel <[email protected]>

2012-11-06 19:53:31

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 18/19] mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate

On 11/06/2012 04:14 AM, Mel Gorman wrote:
> From: Peter Zijlstra <[email protected]>
>
> Note: The scan period is much larger than it was in the original patch.
> The reason was because the system CPU usage went through the roof
> with a sample period of 100ms but it was unsuitable to have a
> situation where a large process could stall for excessively long
> updating pte_numa. This may need to be tuned again if a placement
> policy converges too slowly.
>
> Previously, to probe the working set of a task, we'd use
> a very simple and crude method: mark all of its address
> space PROT_NONE.
>
> That method has various (obvious) disadvantages:
>
> - it samples the working set at dissimilar rates,
> giving some tasks a sampling quality advantage
> over others.
>
> - creates performance problems for tasks with very
> large working sets
>
> - over-samples processes with large address spaces but
> which only very rarely execute
>
> Improve that method by keeping a rotating offset into the
> address space that marks the current position of the scan,
> and advance it by a constant rate (in a CPU cycles execution
> proportional manner). If the offset reaches the last mapped
> address of the mm then it then it starts over at the first
> address.
>
> The per-task nature of the working set sampling functionality in this tree
> allows such constant rate, per task, execution-weight proportional sampling
> of the working set, with an adaptive sampling interval/frequency that
> goes from once per 2 seconds up to just once per 32 seconds. The current
> sampling volume is 256 MB per interval.
>
> As tasks mature and converge their working set, so does the
> sampling rate slow down to just a trickle, 256 MB per 8
> seconds of CPU time executed.
>
> This, beyond being adaptive, also rate-limits rarely
> executing systems and does not over-sample on overloaded
> systems.
>
> [ In AutoNUMA speak, this patch deals with the effective sampling
> rate of the 'hinting page fault'. AutoNUMA's scanning is
> currently rate-limited, but it is also fundamentally
> single-threaded, executing in the knuma_scand kernel thread,
> so the limit in AutoNUMA is global and does not scale up with
> the number of CPUs, nor does it scan tasks in an execution
> proportional manner.
>
> So the idea of rate-limiting the scanning was first implemented
> in the AutoNUMA tree via a global rate limit. This patch goes
> beyond that by implementing an execution rate proportional
> working set sampling rate that is not implemented via a single
> global scanning daemon. ]
>
> [ Dan Carpenter pointed out a possible NULL pointer dereference in the
> first version of this patch. ]
>
> Based-on-idea-by: Andrea Arcangeli <[email protected]>
> Bug-Found-By: Dan Carpenter <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Rik van Riel <[email protected]>
> [ Wrote changelog and fixed bug. ]
> Signed-off-by: Ingo Molnar <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

2012-11-06 19:54:34

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 19/19] mm: sched: numa: Implement slow start for working set sampling

On 11/06/2012 04:14 AM, Mel Gorman wrote:
> From: Peter Zijlstra <[email protected]>
>
> Add a 1 second delay before starting to scan the working set of
> a task and starting to balance it amongst nodes.
>
> [ note that before the constant per task WSS sampling rate patch
> the initial scan would happen much later still, in effect that
> patch caused this regression. ]
>
> The theory is that short-run tasks benefit very little from NUMA
> placement: they come and go, and they better stick to the node
> they were started on. As tasks mature and rebalance to other CPUs
> and nodes, so does their NUMA placement have to change and so
> does it start to matter more and more.
>
> In practice this change fixes an observable kbuild regression:
>
> # [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]
>
> !NUMA:
> 45.291088843 seconds time elapsed ( +- 0.40% )
> 45.154231752 seconds time elapsed ( +- 0.36% )
>
> +NUMA, no slow start:
> 46.172308123 seconds time elapsed ( +- 0.30% )
> 46.343168745 seconds time elapsed ( +- 0.25% )
>
> +NUMA, 1 sec slow start:
> 45.224189155 seconds time elapsed ( +- 0.25% )
> 45.160866532 seconds time elapsed ( +- 0.17% )
>
> and it also fixes an observable perf bench (hackbench) regression:
>
> # perf stat --null --repeat 10 perf bench sched messaging
>
> -NUMA:
>
> -NUMA: 0.246225691 seconds time elapsed ( +- 1.31% )
> +NUMA no slow start: 0.252620063 seconds time elapsed ( +- 1.13% )
>
> +NUMA 1sec delay: 0.248076230 seconds time elapsed ( +- 1.35% )
>
> The implementation is simple and straightforward, most of the patch
> deals with adding the /proc/sys/kernel/balance_numa_scan_delay_ms tunable
> knob.
>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Rik van Riel <[email protected]>
> [ Wrote the changelog, ran measurements, tuned the default. ]
> Signed-off-by: Ingo Molnar <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

2012-11-07 09:25:24

by Zhouping Liu

[permalink] [raw]
Subject: Re: [RFC PATCH 00/19] Foundation for automatic NUMA balancing

On 11/06/2012 05:14 PM, Mel Gorman wrote:
> There are currently two competing approaches to implement support for
> automatically migrating pages to optimise NUMA locality. Performance results
> are available for both but review highlighted different problems in both.
> They are not compatible with each other even though some fundamental
> mechanics should have been the same.
>
> For example, schednuma implements many of its optimisations before the code
> that benefits most from these optimisations are introduced obscuring what the
> cost of schednuma might be and if the optimisations can be used elsewhere
> independant of the series. It also effectively hard-codes PROT_NONE to be
> the hinting fault even though it should be an achitecture-specific decision.
> On the other hand, it is well integrated and implements all its work in the
> context of the process that benefits from the migration.
>
> autonuma goes straight to kernel threads for marking PTEs pte_numa to
> capture the necessary statistics it depends on. This obscures the cost of
> autonuma in a manner that is difficult to measure and hard to retro-fit
> to put in the context of the process. Some of these costs are in paths the
> scheduler folk traditionally are very wary of making heavier, particularly
> if that cost is difficult to measure. On the other hand, performance
> tests indicate it is the best perfoming solution.
>
> As the patch sets do not share any code, it is difficult to incrementally
> develop one to take advantage of the strengths of the other. Many of the
> patches would be code churn that is annoying to review and fairly measuring
> the results would be problematic.
>
> This series addresses part of the integration and sharing problem by
> implementing a foundation that either the policy for schednuma or autonuma
> can be rebased on. The actual policy it implements is a very stupid
> greedy policy called "Migrate On Reference Of pte_numa Node (MORON)".
> While stupid, it can be faster than the vanilla kernel and the expectation
> is that any clever policy should be able to beat MORON. The advantage is
> that it still defines how the policy needs to hook into the core code --
> scheduler and mempolicy mostly so many optimisations (such as native THP
> migration) can be shared between different policy implementations.
>
> This series steals very heavily from both autonuma and schednuma with very
> little original code. In some cases I removed the signed-off-bys because
> the result was too different. I have noted in the changelog where this
> happened but the signed-offs can be restored if the original authors agree.
>
> Patches 1-3 move some vmstat counters so that migrated pages get accounted
> for. In the past the primary user of migration was compaction but
> if pages are to migrate for NUMA optimisation then the counters
> need to be generally useful.
>
> Patch 4 defines an arch-specific PTE bit called _PAGE_NUMA that is used
> to trigger faults later in the series. A placement policy is expected
> to use these faults to determine if a page should migrate. On x86,
> the bit is the same as _PAGE_PROTNONE but other architectures
> may differ.
>
> Patch 5-7 defines pte_numa, pmd_numa, pte_mknuma, pte_mknonuma and
> friends. It implements them for x86, handles GUP and preserves
> the _PAGE_NUMA bit across THP splits.
>
> Patch 8 creates the fault handler for p[te|md]_numa PTEs and just clears
> them again.
>
> Patches 9-11 add a migrate-on-fault mode that applications can specifically
> ask for. Applications can take advantage of this if they wish. It
> also meanst that if automatic balancing was broken for some workload
> that the application could disable the automatic stuff but still
> get some advantage.
>
> Patch 12 adds migrate_misplaced_page which is responsible for migrating
> a page to a new location.
>
> Patch 13 migrates the page on fault if mpol_misplaced() says to do so.
>
> Patch 14 adds a MPOL_MF_LAZY mempolicy that an interested application can use.
> On the next reference the memory should be migrated to the node that
> references the memory.
>
> Patch 15 sets pte_numa within the context of the scheduler.
>
> Patch 16 adds some vmstats that can be used to approximate the cost of the
> scheduling policy in a more fine-grained fashion than looking at
> the system CPU usage.
>
> Patch 17 implements the MORON policy.
>
> Patches 18-19 note that the marking of pte_numa has a number of disadvantages and
> instead incrementally updates a limited range of the address space
> each tick.
>
> The obvious next step is to rebase a proper placement policy on top of this
> foundation and compare it to MORON (or any other placement policy). It
> should be possible to share optimisations between different policies to
> allow meaningful comparisons.
>
> For now, I am going to compare this patchset with the most recent posting
> of schednuma and autonuma just to get a feeling for where it stands. I
> only ran the autonuma benchmark and specjbb tests.
>
> The baseline kernel has stat patches 1-3 applied.

Hello Mel,

my 2 nodes machine hit a panic fault after applied the patch set(based
on kernel-3.7.0-rc4), please review it:

.....
[ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-3.7.0-rc4+
root=UUID=a557cd78-962e-48a2-b606-c77b3d8d22dd console=ttyS0,115200
console=tty0 ro rd.md=0 rd.lvm=0 rd.dm=0 rd.luks=0 init 3 debug
earlyprintk=ttyS0,115200 LANG=en_US.UTF-8
[ 0.000000] PID hash table entries: 4096 (order: 3, 32768 bytes)
[ 0.000000] __ex_table already sorted, skipping sort
[ 0.000000] Checking aperture...
[ 0.000000] No AGP bridge found
[ 0.000000] Memory: 8102020k/10485760k available (6112k kernel code,
2108912k absent, 274828k reserved, 3823k data, 1176k init)
[ 0.000000] ------------[ cut here ]------------
[ 0.000000] kernel BUG at mm/mempolicy.c:1785!
[ 0.000000] invalid opcode: 0000 [#1] SMP
[ 0.000000] Modules linked in:
[ 0.000000] CPU 0
[ 0.000000] Pid: 0, comm: swapper Not tainted 3.7.0-rc4+ #9 IBM IBM
System x3400 M3 Server -[7379I08]-/69Y4356
[ 0.000000] RIP: 0010:[<ffffffff81175b0e>] [<ffffffff81175b0e>]
policy_zonelist+0x1e/0xa0
[ 0.000000] RSP: 0000:ffffffff818afe68 EFLAGS: 00010093
[ 0.000000] RAX: 0000000000000000 RBX: ffffffff81cbfe00 RCX:
000000000000049d
[ 0.000000] RDX: 0000000000000000 RSI: ffffffff81cbfe00 RDI:
0000000000008000
[ 0.000000] RBP: ffffffff818afe78 R08: 203a79726f6d654d R09:
0000000000000179
[ 0.000000] R10: 303138203a79726f R11: 30312f6b30323032 R12:
0000000000008000
[ 0.000000] R13: 0000000000000000 R14: ffffffff818c1420 R15:
ffffffff818c1420
[ 0.000000] FS: 0000000000000000(0000) GS:ffff88017bc00000(0000)
knlGS:0000000000000000
[ 0.000000] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 0.000000] CR2: 0000000000000000 CR3: 00000000018b9000 CR4:
00000000000006b0
[ 0.000000] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 0.000000] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[ 0.000000] Process swapper (pid: 0, threadinfo ffffffff818ae000,
task ffffffff818c1420)
[ 0.000000] Stack:
[ 0.000000] ffff88027ffbe8c0 ffffffff81cbfe00 ffffffff818afec8
ffffffff81176966
[ 0.000000] 0000000000000000 0000000000000030 ffffffff818afef8
0000000000000100
[ 0.000000] ffffffff81a12000 0000000000000000 ffff88027ffbe8c0
000000007b5d69a0
[ 0.000000] Call Trace:
[ 0.000000] [<ffffffff81176966>] alloc_pages_current+0xa6/0x170
[ 0.000000] [<ffffffff81137a44>] __get_free_pages+0x14/0x50
[ 0.000000] [<ffffffff819efd9b>] kmem_cache_init+0x53/0x2d2
[ 0.000000] [<ffffffff819caa53>] start_kernel+0x1e0/0x3c7
[ 0.000000] [<ffffffff819ca672>] ? repair_env_string+0x5e/0x5e
[ 0.000000] [<ffffffff819ca356>] x86_64_start_reservations+0x131/0x135
[ 0.000000] [<ffffffff819ca45a>] x86_64_start_kernel+0x100/0x10f
[ 0.000000] Code: e4 17 00 48 89 e5 5d c3 0f 1f 44 00 00 e8 cb e2 47
00 55 48 89 e5 53 48 83 ec 08 0f b7 46 04 66 83 f8 01 74 08 66 83 f8 02
74 42 <0f> 0b 89 fb 81 e3 00 00 04 00 f6 46 06 02 75 04 0f bf 56 08 31
[ 0.000000] RIP [<ffffffff81175b0e>] policy_zonelist+0x1e/0xa0
[ 0.000000] RSP <ffffffff818afe68>
[ 0.000000] ---[ end trace ce62cfec816bb3fe ]---
[ 0.000000] Kernel panic - not syncing: Attempted to kill the idle task!
......

the config file is attached
and no such issue found in mainline, please let me know if you need
further info.

Thanks,
Zhouping


Attachments:
config_mel (108.68 kB)

2012-11-07 10:38:47

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 08/19] mm: numa: Create basic numa page hinting infrastructure

On Tue, Nov 06, 2012 at 01:58:26PM -0500, Rik van Riel wrote:
> On 11/06/2012 04:14 AM, Mel Gorman wrote:
> >Note: This patch started as "mm/mpol: Create special PROT_NONE
> > infrastructure" and preserves the basic idea but steals *very*
> > heavily from "autonuma: numa hinting page faults entry points" for
> > the actual fault handlers without the migration parts. The end
> > result is barely recognisable as either patch so all Signed-off
> > and Reviewed-bys are dropped. If Peter, Ingo and Andrea are ok with
> > this version, I will re-add the signed-offs-by to reflect the history.
> >
> >In order to facilitate a lazy -- fault driven -- migration of pages, create
> >a special transient PAGE_NUMA variant, we can then use the 'spurious'
> >protection faults to drive our migrations from.
> >
> >Pages that already had an effective PROT_NONE mapping will not be detected
>
> The patch itself is good, but the changelog needs a little
> fix. While you are defining _PAGE_NUMA to _PAGE_PROTNONE on
> x86, this may be different on other architectures.
>
> Therefore, the changelog should refer to PAGE_NUMA, not
> PROT_NONE.
>

Fair point. I still want to record the point that PROT_NONE will not
generate the faults though. How about this?

In order to facilitate a lazy -- fault driven -- migration of pages, create
a special transient PAGE_NUMA variant, we can then use the 'spurious'
protection faults to drive our migrations from.

The meaning of PAGE_NUMA depends on the architecture but on x86 it is
effectively PROT_NONE. In this case, PROT_NONE mappings will not be detected
to generate these 'spurious' faults for the simple reason that we cannot
distinguish them on their protection bits, see pte_numa(). This isn't
a problem since PROT_NONE (and possible PROT_WRITE with dirty tracking)
aren't used or are rare enough for us to not care about their placement.

> >to generate these 'spurious' faults for the simple reason that we cannot
> >distinguish them on their protection bits, see pte_numa(). This isn't
> >a problem since PROT_NONE (and possible PROT_WRITE with dirty tracking)
> >aren't used or are rare enough for us to not care about their placement.
> >
> >Signed-off-by: Mel Gorman <[email protected]>
>
> Other than the changelog ...
>
> Reviewed-by: Rik van Riel <[email protected]>

Thanks.

--
Mel Gorman
SUSE Labs

2012-11-07 10:46:11

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 08/19] mm: numa: Create basic numa page hinting infrastructure

On 11/07/2012 05:38 AM, Mel Gorman wrote:
> On Tue, Nov 06, 2012 at 01:58:26PM -0500, Rik van Riel wrote:
>> On 11/06/2012 04:14 AM, Mel Gorman wrote:
>>> Note: This patch started as "mm/mpol: Create special PROT_NONE
>>> infrastructure" and preserves the basic idea but steals *very*
>>> heavily from "autonuma: numa hinting page faults entry points" for
>>> the actual fault handlers without the migration parts. The end
>>> result is barely recognisable as either patch so all Signed-off
>>> and Reviewed-bys are dropped. If Peter, Ingo and Andrea are ok with
>>> this version, I will re-add the signed-offs-by to reflect the history.
>>>
>>> In order to facilitate a lazy -- fault driven -- migration of pages, create
>>> a special transient PAGE_NUMA variant, we can then use the 'spurious'
>>> protection faults to drive our migrations from.
>>>
>>> Pages that already had an effective PROT_NONE mapping will not be detected
>>
>> The patch itself is good, but the changelog needs a little
>> fix. While you are defining _PAGE_NUMA to _PAGE_PROTNONE on
>> x86, this may be different on other architectures.
>>
>> Therefore, the changelog should refer to PAGE_NUMA, not
>> PROT_NONE.
>>
>
> Fair point. I still want to record the point that PROT_NONE will not
> generate the faults though. How about this?
>
> In order to facilitate a lazy -- fault driven -- migration of pages, create
> a special transient PAGE_NUMA variant, we can then use the 'spurious'
> protection faults to drive our migrations from.
>
> The meaning of PAGE_NUMA depends on the architecture but on x86 it is
> effectively PROT_NONE. In this case, PROT_NONE mappings will not be detected
> to generate these 'spurious' faults for the simple reason that we cannot
> distinguish them on their protection bits, see pte_numa(). This isn't
> a problem since PROT_NONE (and possible PROT_WRITE with dirty tracking)
> aren't used or are rare enough for us to not care about their placement.

Actual PROT_NONE mappings will not generate these NUMA faults
for the reason that the page fault code checks the permission
on the VMA (and will throw a segmentation fault on actual
PROT_NONE mappings), before it ever calls handle_mm_fault.

2012-11-07 10:49:47

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 15/19] mm: numa: Add fault driven placement and migration

On Tue, Nov 06, 2012 at 02:41:13PM -0500, Rik van Riel wrote:
> On 11/06/2012 04:14 AM, Mel Gorman wrote:
> >From: Peter Zijlstra <[email protected]>
> >
> >NOTE: This patch is based on "sched, numa, mm: Add fault driven
> > placement and migration policy" but as it throws away all the policy
> > to just leave a basic foundation I had to drop the signed-offs-by.
> >
> >This patch creates a bare-bones method for setting PTEs pte_numa in the
> >context of the scheduler that when faulted later will be faulted onto the
> >node the CPU is running on. In itself this does nothing useful but any
> >placement policy will fundamentally depend on receiving hints on placement
> >from fault context and doing something intelligent about it.
> >
> >Signed-off-by: Mel Gorman <[email protected]>
>
> Excellent basis for implementing a smarter NUMA
> policy.
>
> Not sure if such a policy should be implemented
> as a replacement for this patch, or on top of it...
>

I'm expecting on top of it. As a POC, I'm looking at implementing the CPU
Follows Memory algorithm (mostly from autonuma) on top of this but using the
home-node logic from schednuma to handle how processes get scheduled. MORON
will need to relax to take the home node into account to avoid fighting
the home-node decisions. task_numa_fault() determines if the home node
needs to change based on statistics it gathers from faults. So far I am
keeping within the framework but it is still a WIP.

> Either way, thank you for cleaning up all of the
> NUMA base code, while I was away at conferences
> and stuck in airports :)
>

My pleasure. Thanks a lot for reviewing this!

--
Mel Gorman
SUSE Labs

2012-11-07 10:57:47

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 16/19] mm: numa: Add pte updates, hinting and migration stats

On Tue, Nov 06, 2012 at 02:55:06PM -0500, Rik van Riel wrote:
> On 11/06/2012 04:14 AM, Mel Gorman wrote:
> >It is tricky to quantify the basic cost of automatic NUMA placement in a
> >meaningful manner. This patch adds some vmstats that can be used as part
> >of a basic costing model.
> >
> >u = basic unit = sizeof(void *)
> >Ca = cost of struct page access = sizeof(struct page) / u
> >Cpte = Cost PTE access = Ca
> >Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock)
> > where Cpte is incurred twice for a read and a write and Wlock
> > is a constant representing the cost of taking or releasing a
> > lock
> >Cnumahint = Cost of a minor page fault = some high constant e.g. 1000
> >Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u
> >Ci = Cost of page isolation = Ca + Wi
> > where Wi is a constant that should reflect the approximate cost
> > of the locking operation
> >Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma)
> > where Wnuma is the approximate NUMA factor. 1 is local. 1.2
> > would imply that remote accesses are 20% more expensive
> >
> >Balancing cost = Cpte * numa_pte_updates +
> > Cnumahint * numa_hint_faults +
> > Ci * numa_pages_migrated +
> > Cpagecopy * numa_pages_migrated
> >
> >Note that numa_pages_migrated is used as a measure of how many pages
> >were isolated even though it would miss pages that failed to migrate. A
> >vmstat counter could have been added for it but the isolation cost is
> >pretty marginal in comparison to the overall cost so it seemed overkill.
> >
> >The ideal way to measure automatic placement benefit would be to count
> >the number of remote accesses versus local accesses and do something like
> >
> > benefit = (remote_accesses_before - remove_access_after) * Wnuma
> >
> >but the information is not readily available. As a workload converges, the
> >expection would be that the number of remote numa hints would reduce to 0.
> >
> > convergence = numa_hint_faults_local / numa_hint_faults
> > where this is measured for the last N number of
> > numa hints recorded. When the workload is fully
> > converged the value is 1.
> >
> >This can measure if the placement policy is converging and how fast it is
> >doing it.
> >
> >Signed-off-by: Mel Gorman <[email protected]>
>
> I'm skipping the ACKing of the policy patches, which
> appear to be meant to be placeholders for a "real"
> policy.

I do expect the MORON policy to disappear or at least change so much it
is not recognisable.

> However, you have a few more mechanism patches
> left in the series, which would be required regardless
> of what policy gets merged, so ...
>

Initially, I had the slow WSS sampling at the end because superficially
they could be considered an optimisation and I wanted to avoid sneaking
optimisations in. On reflection, the slow WSS sampling is pretty fundamental
and I've moved it earlier in the series like so;

mm: mempolicy: Add MPOL_MF_LAZY mm: mempolicy: Use _PAGE_NUMA to migrate pages
mm: numa: Add fault driven placement and migration
mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate
mm: sched: numa: Implement slow start for working set sampling
mm: numa: Add pte updates, hinting and migration stats
mm: numa: Migrate on reference policy

--
Mel Gorman
SUSE Labs

2012-11-07 11:00:15

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 08/19] mm: numa: Create basic numa page hinting infrastructure

On Wed, Nov 07, 2012 at 05:48:30AM -0500, Rik van Riel wrote:
> On 11/07/2012 05:38 AM, Mel Gorman wrote:
> >On Tue, Nov 06, 2012 at 01:58:26PM -0500, Rik van Riel wrote:
> >>On 11/06/2012 04:14 AM, Mel Gorman wrote:
> >>>Note: This patch started as "mm/mpol: Create special PROT_NONE
> >>> infrastructure" and preserves the basic idea but steals *very*
> >>> heavily from "autonuma: numa hinting page faults entry points" for
> >>> the actual fault handlers without the migration parts. The end
> >>> result is barely recognisable as either patch so all Signed-off
> >>> and Reviewed-bys are dropped. If Peter, Ingo and Andrea are ok with
> >>> this version, I will re-add the signed-offs-by to reflect the history.
> >>>
> >>>In order to facilitate a lazy -- fault driven -- migration of pages, create
> >>>a special transient PAGE_NUMA variant, we can then use the 'spurious'
> >>>protection faults to drive our migrations from.
> >>>
> >>>Pages that already had an effective PROT_NONE mapping will not be detected
> >>
> >>The patch itself is good, but the changelog needs a little
> >>fix. While you are defining _PAGE_NUMA to _PAGE_PROTNONE on
> >>x86, this may be different on other architectures.
> >>
> >>Therefore, the changelog should refer to PAGE_NUMA, not
> >>PROT_NONE.
> >>
> >
> >Fair point. I still want to record the point that PROT_NONE will not
> >generate the faults though. How about this?
> >
> > In order to facilitate a lazy -- fault driven -- migration of pages, create
> > a special transient PAGE_NUMA variant, we can then use the 'spurious'
> > protection faults to drive our migrations from.
> >
> > The meaning of PAGE_NUMA depends on the architecture but on x86 it is
> > effectively PROT_NONE. In this case, PROT_NONE mappings will not be detected
> > to generate these 'spurious' faults for the simple reason that we cannot
> > distinguish them on their protection bits, see pte_numa(). This isn't
> > a problem since PROT_NONE (and possible PROT_WRITE with dirty tracking)
> > aren't used or are rare enough for us to not care about their placement.
>
> Actual PROT_NONE mappings will not generate these NUMA faults
> for the reason that the page fault code checks the permission
> on the VMA (and will throw a segmentation fault on actual
> PROT_NONE mappings), before it ever calls handle_mm_fault.
>

Updated. Thanks.

--
Mel Gorman
SUSE Labs

2012-11-07 11:43:56

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 15/19] mm: numa: Add fault driven placement and migration

On 11/07/2012 05:49 AM, Mel Gorman wrote:
> On Tue, Nov 06, 2012 at 02:41:13PM -0500, Rik van Riel wrote:
>> On 11/06/2012 04:14 AM, Mel Gorman wrote:

>>> Signed-off-by: Mel Gorman <[email protected]>
>>
>> Excellent basis for implementing a smarter NUMA
>> policy.
>>
>> Not sure if such a policy should be implemented
>> as a replacement for this patch, or on top of it...
>>
>
> I'm expecting on top of it.

In that case:

Acked-by: Rik van Riel <[email protected]>

2012-11-07 11:45:06

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 16/19] mm: numa: Add pte updates, hinting and migration stats

On 11/07/2012 05:57 AM, Mel Gorman wrote:
> On Tue, Nov 06, 2012 at 02:55:06PM -0500, Rik van Riel wrote:
>> On 11/06/2012 04:14 AM, Mel Gorman wrote:

>>> Signed-off-by: Mel Gorman <[email protected]>
>>
>> I'm skipping the ACKing of the policy patches, which
>> appear to be meant to be placeholders for a "real"
>> policy.
>
> I do expect the MORON policy to disappear or at least change so much it
> is not recognisable.

On the other hand, maybe it would be better to get
things at least into -mm, so the policy can be built
on top?

Just in case...

Acked-by: Rik van Riel <[email protected]>

2012-11-07 11:54:30

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 17/19] mm: numa: Migrate on reference policy

On 11/06/2012 04:14 AM, Mel Gorman wrote:
> This is the dumbest possible policy that still does something of note.
> When a pte_numa is faulted, it is moved immediately. Any replacement
> policy must at least do better than this and in all likelihood this
> policy regresses normal workloads.
>
> Signed-off-by: Mel Gorman <[email protected]>

I expect this code to be replaced with a smarter policy.
However, it may be appropriate to merge this into -mm,
and then have the smarter policy implemented on top.

In case we go that route ...

Acked-by: Rik van Riel <[email protected]>

2012-11-07 12:32:24

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 13/19] mm: mempolicy: Use _PAGE_NUMA to migrate pages

On Tue, Nov 06, 2012 at 02:18:18PM -0500, Rik van Riel wrote:
> On 11/06/2012 04:14 AM, Mel Gorman wrote:
> >Note: Based on "mm/mpol: Use special PROT_NONE to migrate pages" but
> > sufficiently different that the signed-off-bys were dropped
> >
> >Combine our previous _PAGE_NUMA, mpol_misplaced and migrate_misplaced_page()
> >pieces into an effective migrate on fault scheme.
> >
> >Note that (on x86) we rely on PROT_NONE pages being !present and avoid
> >the TLB flush from try_to_unmap(TTU_MIGRATION). This greatly improves the
> >page-migration performance.
> >
> >Based-on-work-by: Peter Zijlstra <[email protected]>
> >Signed-off-by: Mel Gorman <[email protected]>
>
>
> > page = vm_normal_page(vma, addr, pte);
> > BUG_ON(!page);
> >+
> >+ get_page(page);
> >+ current_nid = page_to_nid(page);
> >+ target_nid = mpol_misplaced(page, vma, addr);
> >+ if (target_nid == -1)
> >+ goto clear_pmdnuma;
> >+
> >+ pte_unmap_unlock(ptep, ptl);
> >+ migrate_misplaced_page(page, target_nid);
> >+ page = NULL;
> >+
> >+ ptep = pte_offset_map_lock(mm, pmd, addr, &ptl);
> >+ if (!pte_same(*ptep, pte))
> >+ goto out_unlock;
> >+
>
> I see you tried to avoid the extraneous TLB flush
> from inside migrate_misplaced_page.

Yeah, I leave the pte_numa in place until after the migration to avoid it.

> However,
> try_to_unmap_one calls ptep_clear_flush, which will
> currently still result in a remote TLB flush for
> a _PAGE_NUMA pte, despite the pte not being
> accessible for memory accesses (_PAGE_PRESENT not set).
>

Well spotted, I'll fix it up.

> Furthermore, if migrate_misplaced_page moved the page,
> the !pte_same check will return false, and you will
> get a double fault.
>

Yes, you're right. autonuma avoids this problem by clearing _PAGE_NUMA
before the migration happens but then it will incur the TLB flush
overhead.

> I wonder if migrate_misplaced_page should return a
> struct page* or a pfn, so we can compute what "pte"
> _should_ be, corrected for the new pfn, feed that
> value to pte_same, and then avoid the double fault?
>

I think I can do that without reaching too far into migrate.c by abusing
the migration callback handler to return the location of the new page.
I'll see what I can do.

> Also, we may want the change for ptep_clear_flush
> that avoids flushing remote TLBs for a pte without
> the _PAGE_PRESENT bit set.
>

Maybe but initially I'll limit it to try_to_unmap_one.

Thanks!

--
Mel Gorman
SUSE Labs

2012-11-07 15:26:06

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 00/19] Foundation for automatic NUMA balancing

On Wed, Nov 07, 2012 at 05:27:12PM +0800, Zhouping Liu wrote:
>
> Hello Mel,
>
> my 2 nodes machine hit a panic fault after applied the patch
> set(based on kernel-3.7.0-rc4), please review it:
>
> <SNIP>

Early initialisation problem by the looks of things. Try this please

---8<---
mm: numa: Check that preferred_node_policy is initialised

Zhouping Liu reported the following

[ 0.000000] ------------[ cut here ]------------
[ 0.000000] kernel BUG at mm/mempolicy.c:1785!
[ 0.000000] invalid opcode: 0000 [#1] SMP
[ 0.000000] Modules linked in:
[ 0.000000] CPU 0
....
[ 0.000000] Call Trace:
[ 0.000000] [<ffffffff81176966>] alloc_pages_current+0xa6/0x170
[ 0.000000] [<ffffffff81137a44>] __get_free_pages+0x14/0x50
[ 0.000000] [<ffffffff819efd9b>] kmem_cache_init+0x53/0x2d2
[ 0.000000] [<ffffffff819caa53>] start_kernel+0x1e0/0x3c7

Problem is that early in boot preferred_nod_policy and SLUB
initialisation trips up. Check it is initialised.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/mempolicy.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 11d4b6b..8cfa6dc 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -129,6 +129,10 @@ static struct mempolicy *get_task_policy(struct task_struct *p)
node = numa_node_id();
if (node != -1)
pol = &preferred_node_policy[node];
+
+ /* preferred_node_policy is not initialised early in boot */
+ if (!pol->mode)
+ pol = NULL;
}

return pol;

2012-11-08 06:35:33

by Zhouping Liu

[permalink] [raw]
Subject: Re: [RFC PATCH 00/19] Foundation for automatic NUMA balancing

On 11/07/2012 11:25 PM, Mel Gorman wrote:
> On Wed, Nov 07, 2012 at 05:27:12PM +0800, Zhouping Liu wrote:
>> Hello Mel,
>>
>> my 2 nodes machine hit a panic fault after applied the patch
>> set(based on kernel-3.7.0-rc4), please review it:
>>
>> <SNIP>
> Early initialisation problem by the looks of things. Try this please

Tested the patch, and the issue is gone.

>
> ---8<---
> mm: numa: Check that preferred_node_policy is initialised
>
> Zhouping Liu reported the following
>
> [ 0.000000] ------------[ cut here ]------------
> [ 0.000000] kernel BUG at mm/mempolicy.c:1785!
> [ 0.000000] invalid opcode: 0000 [#1] SMP
> [ 0.000000] Modules linked in:
> [ 0.000000] CPU 0
> ....
> [ 0.000000] Call Trace:
> [ 0.000000] [<ffffffff81176966>] alloc_pages_current+0xa6/0x170
> [ 0.000000] [<ffffffff81137a44>] __get_free_pages+0x14/0x50
> [ 0.000000] [<ffffffff819efd9b>] kmem_cache_init+0x53/0x2d2
> [ 0.000000] [<ffffffff819caa53>] start_kernel+0x1e0/0x3c7
>
> Problem is that early in boot preferred_nod_policy and SLUB
> initialisation trips up. Check it is initialised.
>
> Signed-off-by: Mel Gorman <[email protected]>

Tested-by: Zhouping Liu <[email protected]>

Thanks,
Zhouping

> ---
> mm/mempolicy.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 11d4b6b..8cfa6dc 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -129,6 +129,10 @@ static struct mempolicy *get_task_policy(struct task_struct *p)
> node = numa_node_id();
> if (node != -1)
> pol = &preferred_node_policy[node];
> +
> + /* preferred_node_policy is not initialised early in boot */
> + if (!pol->mode)
> + pol = NULL;
> }
>
> return pol;
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2012-11-08 07:01:32

by Zhouping Liu

[permalink] [raw]
Subject: Re: [RFC PATCH 00/19] Foundation for automatic NUMA balancing

On 11/08/2012 02:39 PM, ???? wrote:
> Hi all:
> I got a problem??
> 1. on intel cpu xeon E5000 family which support xapic ??one NIC
> irq can share on the CPUs basic on smp_affinity.
> 2. but on intel cpu xeon E5-2600 family which support x2apic, one
> NIC irq only on CPU0 whatever i set the smp_affinfiy like as "aa"; "55";
> "ff".
> My OS is CentOS 6.2 x32 ??i test 4 cpus?? the result is which only
> support apic can share one irq to all cpus??which support x2apic only make
> the irq to one cpu??

richard, I'm not sure whether your problem is occurred with the
patch-set or not,
if it's not related to the patches, you should report it on a *new* subject.

Thanks,
Zhouping

>
>
> want help me
>
> richard
>
>
> 2012/11/8 Zhouping Liu <[email protected]>
>
>> On 11/07/2012 11:25 PM, Mel Gorman wrote:
>>
>>> On Wed, Nov 07, 2012 at 05:27:12PM +0800, Zhouping Liu wrote:
>>>
>>>> Hello Mel,
>>>>
>>>> my 2 nodes machine hit a panic fault after applied the patch
>>>> set(based on kernel-3.7.0-rc4), please review it:
>>>>
>>>> <SNIP>
>>>>
>>> Early initialisation problem by the looks of things. Try this please
>>>
>> Tested the patch, and the issue is gone.
>>
>>
>>> ---8<---
>>> mm: numa: Check that preferred_node_policy is initialised
>>>
>>> Zhouping Liu reported the following
>>>
>>> [ 0.000000] ------------[ cut here ]------------
>>> [ 0.000000] kernel BUG at mm/mempolicy.c:1785!
>>> [ 0.000000] invalid opcode: 0000 [#1] SMP
>>> [ 0.000000] Modules linked in:
>>> [ 0.000000] CPU 0
>>> ....
>>> [ 0.000000] Call Trace:
>>> [ 0.000000] [<ffffffff81176966>] alloc_pages_current+0xa6/0x170
>>> [ 0.000000] [<ffffffff81137a44>] __get_free_pages+0x14/0x50
>>> [ 0.000000] [<ffffffff819efd9b>] kmem_cache_init+0x53/0x2d2
>>> [ 0.000000] [<ffffffff819caa53>] start_kernel+0x1e0/0x3c7
>>>
>>> Problem is that early in boot preferred_nod_policy and SLUB
>>> initialisation trips up. Check it is initialised.
>>>
>>> Signed-off-by: Mel Gorman <[email protected]>
>>>
>> Tested-by: Zhouping Liu <[email protected]>
>>
>> Thanks,
>> Zhouping
>>
>> ---
>>> mm/mempolicy.c | 4 ++++
>>> 1 file changed, 4 insertions(+)
>>>
>>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>>> index 11d4b6b..8cfa6dc 100644
>>> --- a/mm/mempolicy.c
>>> +++ b/mm/mempolicy.c
>>> @@ -129,6 +129,10 @@ static struct mempolicy *get_task_policy(struct
>>> task_struct *p)
>>> node = numa_node_id();
>>> if (node != -1)
>>> pol = &preferred_node_policy[node];
>>> +
>>> + /* preferred_node_policy is not initialised early in boot
>>> */
>>> + if (!pol->mode)
>>> + pol = NULL;
>>> }
>>> return pol;
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to [email protected]. For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/**majordomo-info.html<http://vger.kernel.org/majordomo-info.html>
>> Please read the FAQ at http://www.tux.org/lkml/
>>

2012-11-09 14:43:21

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC PATCH 00/19] Foundation for automatic NUMA balancing

Hi Mel,

On Tue, Nov 06, 2012 at 09:14:36AM +0000, Mel Gorman wrote:
> This series addresses part of the integration and sharing problem by
> implementing a foundation that either the policy for schednuma or autonuma
> can be rebased on. The actual policy it implements is a very stupid
> greedy policy called "Migrate On Reference Of pte_numa Node (MORON)".
> While stupid, it can be faster than the vanilla kernel and the expectation
> is that any clever policy should be able to beat MORON. The advantage is
> that it still defines how the policy needs to hook into the core code --
> scheduler and mempolicy mostly so many optimisations (s uch as native THP
> migration) can be shared between different policy implementations.

I haven't had much time to look into it yet, because I've been
attending KVM Forum the last few days, but this foundation looks ok
with me as a starting base and I ack it for merging it upstream. I'll
try to rebase on top of this and send you some patches.

> Patch 14 adds a MPOL_MF_LAZY mempolicy that an interested application can use.
> On the next reference the memory should be migrated to the node that
> references the memory.

This approach of starting with a stripped down foundation won't allow
for easy backportability anyway, so merging the userland API at the
first step shouldn't provide any benefit for the work that is ahead of
us. I would leave this for later and not part of the foundation.

All we need is a failsafe runtime and boot time turn off knob, just in
case.

Thanks,
Andrea

2012-11-09 16:12:43

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 00/19] Foundation for automatic NUMA balancing

On Fri, Nov 09, 2012 at 03:42:57PM +0100, Andrea Arcangeli wrote:
> Hi Mel,
>
> On Tue, Nov 06, 2012 at 09:14:36AM +0000, Mel Gorman wrote:
> > This series addresses part of the integration and sharing problem by
> > implementing a foundation that either the policy for schednuma or autonuma
> > can be rebased on. The actual policy it implements is a very stupid
> > greedy policy called "Migrate On Reference Of pte_numa Node (MORON)".
> > While stupid, it can be faster than the vanilla kernel and the expectation
> > is that any clever policy should be able to beat MORON. The advantage is
> > that it still defines how the policy needs to hook into the core code --
> > scheduler and mempolicy mostly so many optimisations (s uch as native THP
> > migration) can be shared between different policy implementations.
>
> I haven't had much time to look into it yet, because I've been
> attending KVM Forum the last few days,

That's fine. I knew you were travelling and that there would be delay.

> but this foundation looks ok
> with me as a starting base and I ack it for merging it upstream. I'll
> try to rebase on top of this and send you some patches.
>

Thanks, that's great news! It's not quite ready for merging yet. I found
a few bugs in the foundation that I ironed out since and I would like to
have better figures for specjbb.

With that in mind I'm still in the process of implementing something like
cpu-follow-memory on top. I'll post it early next week even if the figures
are crap for the purposes of illustration and to get the existing fixes
out there. Even you think the version of the cpu-follow implementation is
complete crap you'll at least see what I thought the integration points
would look like and we'll come up with an alternative.

My hope is that we layer the smallest amount on top each iteration with
benchmark validation at each step until we get something approaching
autonuma or schednumas in terms of performance. Which one we use as the
performance target will depend on whether schednuma or autonuma was better
on that particular test. I'll be using mmtests on a 4-node machine each
step but obviously other testers would be very welcome.

As things stand right now I just finished a script to show where threads
and running and what their per-node memory usage is and it's showing that
specjbb threads are not converging at all. I'm not losing sleep over it
just yet as I would be incredibly surprised if I got this right first time
even with having schednuma and autonuma to look at :) .

> > Patch 14 adds a MPOL_MF_LAZY mempolicy that an interested application can use.
> > On the next reference the memory should be migrated to the node that
> > references the memory.
>
> This approach of starting with a stripped down foundation won't allow
> for easy backportability anyway, so merging the userland API at the
> first step shouldn't provide any benefit for the work that is ahead of
> us. I would leave this for later and not part of the foundation.
>

This needs a bit more consensus. I'm happy to drop the userspace API
until all this settles down but will initially try and keep the internal
mempolicy aspects. Initially I preserved the userspace API because
I understood Peter's logic that we should help application developers
as much as possible before depending entirely on the automatic approach
offered by both autonuma and schednuma.

Peter?

> All we need is a failsafe runtime and boot time turn off knob, just in
> case.

Yes, fully agreed. It's on the TODO list and I consider it a requirement
before it's merged. THP experience has told us that being able to turn
it off at runtime was very handy for debugging.

Thanks Andrea.

--
Mel Gorman
SUSE Labs

2012-11-13 09:36:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 12/19] mm: migrate: Introduce migrate_misplaced_page()


* Mel Gorman <[email protected]> wrote:

> From: Peter Zijlstra <[email protected]>
>
> Note: This was originally based on Peter's patch "mm/migrate: Introduce
> migrate_misplaced_page()" but borrows extremely heavily from Andrea's
> "autonuma: memory follows CPU algorithm and task/mm_autonuma stats
> collection". The end result is barely recognisable so signed-offs
> had to be dropped. If original authors are ok with it, I'll
> re-add the signed-off-bys.
>
> Add migrate_misplaced_page() which deals with migrating pages from
> faults.
>
> Based-on-work-by: Lee Schermerhorn <[email protected]>
> Based-on-work-by: Peter Zijlstra <[email protected]>
> Based-on-work-by: Andrea Arcangeli <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> include/linux/migrate.h | 8 ++++
> mm/migrate.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 110 insertions(+), 2 deletions(-)

That's a nice patch - the TASK_NUMA_FAULT approach in the
original patch was not very elegant.

I've started testing it to see how well your version works.

Thanks,

Ingo

2012-11-13 09:54:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 05/19] mm: numa: pte_numa() and pmd_numa()


* Mel Gorman <[email protected]> wrote:

> From: Andrea Arcangeli <[email protected]>
>
> Implement pte_numa and pmd_numa.
>
> We must atomically set the numa bit and clear the present bit to
> define a pte_numa or pmd_numa.
>
> Once a pte or pmd has been set as pte_numa or pmd_numa, the next time
> a thread touches a virtual address in the corresponding virtual range,
> a NUMA hinting page fault will trigger. The NUMA hinting page fault
> will clear the NUMA bit and set the present bit again to resolve the
> page fault.
>
> The expectation is that a NUMA hinting page fault is used as part
> of a placement policy that decides if a page should remain on the
> current node or migrated to a different node.
>
> Acked-by: Rik van Riel <[email protected]>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> arch/x86/include/asm/pgtable.h | 65 ++++++++++++++++++++++++++++++++++++++--
> include/asm-generic/pgtable.h | 12 ++++++++
> 2 files changed, 75 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index a1f780d..e075d57 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -404,7 +404,8 @@ static inline int pte_same(pte_t a, pte_t b)
>
> static inline int pte_present(pte_t a)
> {
> - return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE);
> + return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
> + _PAGE_NUMA);
> }
>
> static inline int pte_hidden(pte_t pte)
> @@ -420,7 +421,63 @@ static inline int pmd_present(pmd_t pmd)
> * the _PAGE_PSE flag will remain set at all times while the
> * _PAGE_PRESENT bit is clear).
> */
> - return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE);
> + return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE |
> + _PAGE_NUMA);
> +}
> +
> +#ifdef CONFIG_BALANCE_NUMA
> +/*
> + * _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the
> + * same bit too). It's set only when _PAGE_PRESET is not set and it's
> + * never set if _PAGE_PRESENT is set.
> + *
> + * pte/pmd_present() returns true if pte/pmd_numa returns true. Page
> + * fault triggers on those regions if pte/pmd_numa returns true
> + * (because _PAGE_PRESENT is not set).
> + */
> +static inline int pte_numa(pte_t pte)
> +{
> + return (pte_flags(pte) &
> + (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
> +}
> +
> +static inline int pmd_numa(pmd_t pmd)
> +{
> + return (pmd_flags(pmd) &
> + (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
> +}
> +#endif
> +
> +/*
> + * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
> + * because they're called by the NUMA hinting minor page fault. If we
> + * wouldn't set the _PAGE_ACCESSED bitflag here, the TLB miss handler
> + * would be forced to set it later while filling the TLB after we
> + * return to userland. That would trigger a second write to memory
> + * that we optimize away by setting _PAGE_ACCESSED here.
> + */
> +static inline pte_t pte_mknonnuma(pte_t pte)
> +{
> + pte = pte_clear_flags(pte, _PAGE_NUMA);
> + return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
> +}
> +
> +static inline pmd_t pmd_mknonnuma(pmd_t pmd)
> +{
> + pmd = pmd_clear_flags(pmd, _PAGE_NUMA);
> + return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
> +}
> +
> +static inline pte_t pte_mknuma(pte_t pte)
> +{
> + pte = pte_set_flags(pte, _PAGE_NUMA);
> + return pte_clear_flags(pte, _PAGE_PRESENT);
> +}
> +
> +static inline pmd_t pmd_mknuma(pmd_t pmd)
> +{
> + pmd = pmd_set_flags(pmd, _PAGE_NUMA);
> + return pmd_clear_flags(pmd, _PAGE_PRESENT);
> }
>
> static inline int pmd_none(pmd_t pmd)
> @@ -479,6 +536,10 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
>
> static inline int pmd_bad(pmd_t pmd)
> {
> +#ifdef CONFIG_BALANCE_NUMA
> + if (pmd_numa(pmd))
> + return 0;
> +#endif
> return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
> }
>
> diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
> index b36ce40..896667e 100644
> --- a/include/asm-generic/pgtable.h
> +++ b/include/asm-generic/pgtable.h
> @@ -554,6 +554,18 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
> #endif
> }
>
> +#ifndef CONFIG_BALANCE_NUMA
> +static inline int pte_numa(pte_t pte)
> +{
> + return 0;
> +}
> +
> +static inline int pmd_numa(pmd_t pmd)
> +{
> + return 0;
> +}
> +#endif /* CONFIG_BALANCE_NUMA */
> +

Hm, this overcomplicates things quite a bit and adds arch
specific code, and there's no explanation given for that
approach that I can see?

Basically, what's wrong with the generic approach that numa/core
has:

__weak bool pte_numa(struct vm_area_struct *vma, pte_t pte)

[see the full function below.]

Then we can reuse existing protection-changing functionality and
keep it all tidy.

an architecture that wants to do something special could
possibly override it in the future - but we want to keep the
generic logic in generic code.

Thanks,

Ingo

------------>
__weak bool pte_numa(struct vm_area_struct *vma, pte_t pte)
{
/*
* For NUMA page faults, we use PROT_NONE ptes in VMAs with
* "normal" vma->vm_page_prot protections. Genuine PROT_NONE
* VMAs should never get here, because the fault handling code
* will notice that the VMA has no read or write permissions.
*
* This means we cannot get 'special' PROT_NONE faults from genuine
* PROT_NONE maps, nor from PROT_WRITE file maps that do dirty
* tracking.
*
* Neither case is really interesting for our current use though so we
* don't care.
*/
if (pte_same(pte, pte_modify(pte, vma->vm_page_prot)))
return false;

return pte_same(pte, pte_modify(pte, vma_prot_none(vma)));
}

2012-11-13 10:07:42

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 06/19] mm: numa: teach gup_fast about pmd_numa


* Mel Gorman <[email protected]> wrote:

> From: Andrea Arcangeli <[email protected]>
>
> When scanning pmds, the pmd may be of numa type (_PAGE_PRESENT not set),
> however the pte might be present. Therefore, gup_pmd_range() must return
> 0 in this case to avoid losing a NUMA hinting page fault during gup_fast.
>
> Note: gup_fast will skip over non present ptes (like numa
> types), so no explicit check is needed for the pte_numa case.
> [...]

So, why not fix all architectures that choose to expose
pte_numa() and pmd_numa() methods - via the patch below?

Thanks,

Ingo

----------------->
>From db4aa58db59a2a296141c698be8b4535d0051ca1 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <[email protected]>
Date: Fri, 5 Oct 2012 21:36:27 +0200
Subject: [PATCH] numa, mm: Support NUMA hinting page faults from gup/gup_fast

Introduce FOLL_NUMA to tell follow_page to check
pte/pmd_numa. get_user_pages must use FOLL_NUMA, and it's safe to do
so because it always invokes handle_mm_fault and retries the
follow_page later.

KVM secondary MMU page faults will trigger the NUMA hinting page
faults through gup_fast -> get_user_pages -> follow_page ->
handle_mm_fault.

Other follow_page callers like KSM should not use FOLL_NUMA, or they
would fail to get the pages if they use follow_page instead of
get_user_pages.

[ This patch was picked up from the AutoNUMA tree. ]

Originally-by: Andrea Arcangeli <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Rik van Riel <[email protected]>
[ ported to this tree. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/mm.h | 1 +
mm/memory.c | 17 +++++++++++++++++
2 files changed, 18 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0025bf9..1821629 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1600,6 +1600,7 @@ struct page *follow_page(struct vm_area_struct *, unsigned long address,
#define FOLL_MLOCK 0x40 /* mark page as mlocked */
#define FOLL_SPLIT 0x80 /* don't return transhuge pages, split them */
#define FOLL_HWPOISON 0x100 /* check page is hwpoisoned */
+#define FOLL_NUMA 0x200 /* force NUMA hinting page fault */

typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
void *data);
diff --git a/mm/memory.c b/mm/memory.c
index e3e8ab2..a660fd0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1536,6 +1536,8 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
goto out;
}
+ if ((flags & FOLL_NUMA) && pmd_numa(vma, *pmd))
+ goto no_page_table;
if (pmd_trans_huge(*pmd)) {
if (flags & FOLL_SPLIT) {
split_huge_page_pmd(mm, pmd);
@@ -1565,6 +1567,8 @@ split_fallthrough:
pte = *ptep;
if (!pte_present(pte))
goto no_page;
+ if ((flags & FOLL_NUMA) && pte_numa(vma, pte))
+ goto no_page;
if ((flags & FOLL_WRITE) && !pte_write(pte))
goto unlock;

@@ -1716,6 +1720,19 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
(VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD);
vm_flags &= (gup_flags & FOLL_FORCE) ?
(VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);
+
+ /*
+ * If FOLL_FORCE and FOLL_NUMA are both set, handle_mm_fault
+ * would be called on PROT_NONE ranges. We must never invoke
+ * handle_mm_fault on PROT_NONE ranges or the NUMA hinting
+ * page faults would unprotect the PROT_NONE ranges if
+ * _PAGE_NUMA and _PAGE_PROTNONE are sharing the same pte/pmd
+ * bitflag. So to avoid that, don't set FOLL_NUMA if
+ * FOLL_FORCE is set.
+ */
+ if (!(gup_flags & FOLL_FORCE))
+ gup_flags |= FOLL_NUMA;
+
i = 0;

do {

2012-11-13 10:21:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 08/19] mm: numa: Create basic numa page hinting infrastructure


* Mel Gorman <[email protected]> wrote:

> Note: This patch started as "mm/mpol: Create special PROT_NONE
> infrastructure" and preserves the basic idea but steals *very*
> heavily from "autonuma: numa hinting page faults entry points" for
> the actual fault handlers without the migration parts. The end
> result is barely recognisable as either patch so all Signed-off
> and Reviewed-bys are dropped. If Peter, Ingo and Andrea are ok with
> this version, I will re-add the signed-offs-by to reflect the history.

Most of the changes you had to do here relates to the earlier
decision to turn it all the NUMA protection fault demultiplexing
and setup code into a per arch facility.

On one hand I'm 100% fine with making the decision to *use* the
new NUMA code per arch and explicitly opt-in - we already have
such a Kconfig switch in our tree already. The decision whether
to use any of this for an architecture must be considered and
tested carefully.

But given that most architectures will be just fine reusing the
already existing generic PROT_NONE machinery, the far better
approach is to do what we've been doing in generic kernel code
for the last 10 years: offer a default generic version, and then
to offer per arch hooks on a strict as-needed basis, if they
want or need to do something weird ...

So why fork away this logic into per arch code so early and
without explicit justification? It creates duplication artifacts
all around and makes porting to a new 'sane' architecture
harder.

Also, if there *are* per architecture concerns then I'd very
much like to see that argued very explicitly, on a per arch
basis, as it occurs, not obscured through thick "just in case"
layers of abstraction ...

Thanks,

Ingo

2012-11-13 10:26:08

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: mempolicy: Add MPOL_MF_LAZY


* Mel Gorman <[email protected]> wrote:

> From: Lee Schermerhorn <[email protected]>
>
> NOTE: Once again there is a lot of patch stealing and the end result
> is sufficiently different that I had to drop the signed-offs.
> Will re-add if the original authors are ok with that.
>
> This patch adds another mbind() flag to request "lazy migration". The
> flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
> pages are marked PROT_NONE. The pages will be migrated in the fault
> path on "first touch", if the policy dictates at that time.
>
> "Lazy Migration" will allow testing of migrate-on-fault via mbind().
> Also allows applications to specify that only subsequently touched
> pages be migrated to obey new policy, instead of all pages in range.
> This can be useful for multi-threaded applications working on a
> large shared data area that is initialized by an initial thread
> resulting in all pages on one [or a few, if overflowed] nodes.
> After PROT_NONE, the pages in regions assigned to the worker threads
> will be automatically migrated local to the threads on 1st touch.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> include/linux/mm.h | 3 +
> include/uapi/linux/mempolicy.h | 13 ++-
> mm/mempolicy.c | 176 ++++++++++++++++++++++++++++++++++++----
> 3 files changed, 174 insertions(+), 18 deletions(-)
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index fa06804..eed70f8 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1548,6 +1548,9 @@ static inline pgprot_t vm_get_page_prot(unsigned long vm_flags)
> }
> #endif
>
> +void change_prot_numa(struct vm_area_struct *vma,
> + unsigned long start, unsigned long end);
> +
> struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
> int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
> unsigned long pfn, unsigned long size, pgprot_t);
> diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
> index 472de8a..6a1baae 100644
> --- a/include/uapi/linux/mempolicy.h
> +++ b/include/uapi/linux/mempolicy.h
> @@ -49,9 +49,16 @@ enum mpol_rebind_step {
>
> /* Flags for mbind */
> #define MPOL_MF_STRICT (1<<0) /* Verify existing pages in the mapping */
> -#define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform to mapping */
> -#define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to mapping */
> -#define MPOL_MF_INTERNAL (1<<3) /* Internal flags start here */
> +#define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform
> + to policy */
> +#define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to policy */
> +#define MPOL_MF_LAZY (1<<3) /* Modifies '_MOVE: lazy migrate on fault */
> +#define MPOL_MF_INTERNAL (1<<4) /* Internal flags start here */
> +
> +#define MPOL_MF_VALID (MPOL_MF_STRICT | \
> + MPOL_MF_MOVE | \
> + MPOL_MF_MOVE_ALL | \
> + MPOL_MF_LAZY)
>
> /*
> * Internal flags that share the struct mempolicy flags word with
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index df1466d..abe2e45 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -90,6 +90,7 @@
> #include <linux/syscalls.h>
> #include <linux/ctype.h>
> #include <linux/mm_inline.h>
> +#include <linux/mmu_notifier.h>
>
> #include <asm/tlbflush.h>
> #include <asm/uaccess.h>
> @@ -566,6 +567,136 @@ static inline int check_pgd_range(struct vm_area_struct *vma,
> }
>
> /*
> + * Here we search for not shared page mappings (mapcount == 1) and we
> + * set up the pmd/pte_numa on those mappings so the very next access
> + * will fire a NUMA hinting page fault.
> + */
> +static int
> +change_prot_numa_range(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long address)
> +{
> + pgd_t *pgd;
> + pud_t *pud;
> + pmd_t *pmd;
> + pte_t *pte, *_pte;
> + struct page *page;
> + unsigned long _address, end;
> + spinlock_t *ptl;
> + int ret = 0;
> +
> + VM_BUG_ON(address & ~PAGE_MASK);
> +
> + pgd = pgd_offset(mm, address);
> + if (!pgd_present(*pgd))
> + goto out;
> +
> + pud = pud_offset(pgd, address);
> + if (!pud_present(*pud))
> + goto out;
> +
> + pmd = pmd_offset(pud, address);
> + if (pmd_none(*pmd))
> + goto out;
> +
> + if (pmd_trans_huge_lock(pmd, vma) == 1) {
> + int page_nid;
> + ret = HPAGE_PMD_NR;
> +
> + VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> +
> + if (pmd_numa(*pmd)) {
> + spin_unlock(&mm->page_table_lock);
> + goto out;
> + }
> +
> + page = pmd_page(*pmd);
> +
> + /* only check non-shared pages */
> + if (page_mapcount(page) != 1) {
> + spin_unlock(&mm->page_table_lock);
> + goto out;
> + }
> +
> + page_nid = page_to_nid(page);
> +
> + if (pmd_numa(*pmd)) {
> + spin_unlock(&mm->page_table_lock);
> + goto out;
> + }
> +
> + set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
> + /* defer TLB flush to lower the overhead */
> + spin_unlock(&mm->page_table_lock);
> + goto out;
> + }
> +
> + if (pmd_trans_unstable(pmd))
> + goto out;
> + VM_BUG_ON(!pmd_present(*pmd));
> +
> + end = min(vma->vm_end, (address + PMD_SIZE) & PMD_MASK);
> + pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> + for (_address = address, _pte = pte; _address < end;
> + _pte++, _address += PAGE_SIZE) {
> + pte_t pteval = *_pte;
> + if (!pte_present(pteval))
> + continue;
> + if (pte_numa(pteval))
> + continue;
> + page = vm_normal_page(vma, _address, pteval);
> + if (unlikely(!page))
> + continue;
> + /* only check non-shared pages */
> + if (page_mapcount(page) != 1)
> + continue;
> +
> + if (pte_numa(pteval))
> + continue;
> +
> + set_pte_at(mm, _address, _pte, pte_mknuma(pteval));
> +
> + /* defer TLB flush to lower the overhead */
> + ret++;
> + }
> + pte_unmap_unlock(pte, ptl);
> +
> + if (ret && !pmd_numa(*pmd)) {
> + spin_lock(&mm->page_table_lock);
> + set_pmd_at(mm, address, pmd, pmd_mknuma(*pmd));
> + spin_unlock(&mm->page_table_lock);
> + /* defer TLB flush to lower the overhead */
> + }
> +
> +out:
> + return ret;
> +}
> +
> +/* Assumes mmap_sem is held */
> +void
> +change_prot_numa(struct vm_area_struct *vma,
> + unsigned long address, unsigned long end)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + int progress = 0;
> +
> + while (address < vma->vm_end) {
> + VM_BUG_ON(address < vma->vm_start ||
> + address + PAGE_SIZE > vma->vm_end);
> +
> + progress += change_prot_numa_range(mm, vma, address);
> + address = (address + PMD_SIZE) & PMD_MASK;
> + }
> +
> + /*
> + * Flush the TLB for the mm to start the NUMA hinting
> + * page faults after we finish scanning this vma part.
> + */
> + mmu_notifier_invalidate_range_start(vma->vm_mm, address, end);
> + flush_tlb_range(vma, address, end);
> + mmu_notifier_invalidate_range_end(vma->vm_mm, address, end);
> +}
> +

Here you are paying a heavy price for the earlier design
mistake, for forking into per arch approach - the NUMA version
of change_protection() had to be open-coded:

> include/linux/mm.h | 3 +
> include/uapi/linux/mempolicy.h | 13 ++-
> mm/mempolicy.c | 176 ++++++++++++++++++++++++++++++++++++----
> 3 files changed, 174 insertions(+), 18 deletions(-)

Compare it to the generic version that Peter used:

include/uapi/linux/mempolicy.h | 13 ++++++++---
mm/mempolicy.c | 49 +++++++++++++++++++++++++++---------------
2 files changed, 42 insertions(+), 20 deletions(-)

and the cleanliness and maintainability advantages are obvious.

So without some really good arguments in favor of your approach
NAK on that complex approach really.

Thanks,

Ingo

2012-11-13 10:45:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 15/19] mm: numa: Add fault driven placement and migration


* Mel Gorman <[email protected]> wrote:

> NOTE: This patch is based on "sched, numa, mm: Add fault driven
> placement and migration policy" but as it throws away
> all the policy to just leave a basic foundation I had to
> drop the signed-offs-by.

So, much of that has been updated meanwhile - but the split
makes fundamental sense - we considered it before.

One detail you did in this patch was the following rename:

s/EMBEDDED_NUMA/NUMA_VARIABLE_LOCALITY

> --- a/arch/sh/mm/Kconfig
> +++ b/arch/sh/mm/Kconfig
> @@ -111,6 +111,7 @@ config VSYSCALL
> config NUMA
> bool "Non Uniform Memory Access (NUMA) Support"
> depends on MMU && SYS_SUPPORTS_NUMA && EXPERIMENTAL
> + select NUMA_VARIABLE_LOCALITY
> default n
> help
> Some SH systems have many various memories scattered around
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>
..aaba45d 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -696,6 +696,20 @@ config LOG_BUF_SHIFT
> config HAVE_UNSTABLE_SCHED_CLOCK
> bool
>
> +#
> +# For architectures that (ab)use NUMA to represent different memory regions
> +# all cpu-local but of different latencies, such as SuperH.
> +#
> +config NUMA_VARIABLE_LOCALITY
> + bool

The NUMA_VARIABLE_LOCALITY name slightly misses the real point
though that NUMA_EMBEDDED tried to stress: it's important to
realize that these are systems that (ab-)use our NUMA memory
zoning code to implement support for variable speed RAM modules
- so they can use the existing node binding ABIs.

The cost of that is the losing of the regular NUMA node
structure. So by all means it's a convenient hack - but the name
must signal that. I'm not attached to the NUMA_EMBEDDED naming
overly strongly, but NUMA_VARIABLE_LOCALITY sounds more harmless
than it should.

Perhaps ARCH_WANT_NUMA_VARIABLE_LOCALITY_OVERRIDE? A tad long
but we don't want it to be overused in any case.

Thanks,

Ingo

2012-11-13 11:24:17

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 05/19] mm: numa: pte_numa() and pmd_numa()

Hi Ingo,

On Tue, Nov 13, 2012 at 10:54:17AM +0100, Ingo Molnar wrote:
>
> * Mel Gorman <[email protected]> wrote:
>
> > From: Andrea Arcangeli <[email protected]>
> >
> > Implement pte_numa and pmd_numa.
> >
> > <Changlog SNIP>
> > ---
> > arch/x86/include/asm/pgtable.h | 65 ++++++++++++++++++++++++++++++++++++++--
> > include/asm-generic/pgtable.h | 12 ++++++++
> > 2 files changed, 75 insertions(+), 2 deletions(-)
> >
> > <Patch SNIP>
>
> Hm, this overcomplicates things quite a bit and adds arch
> specific code, and there's no explanation given for that
> approach that I can see?
>

So there are two possible problems here - the PTE flag naming and how
it's implemented.

On the PTE flag naming front, the changelog explains the disadvantages
to using PROT_NONE and this arrangement allows an architecture to make a
better decision if one is available. The relevant parts of the changelog are

_PAGE_NUMA on x86 shares the same bit number of _PAGE_PROTNONE (but
it could also use a different bitflag, it's up to the architecture
to decide).

and

Sharing the same bitflag with _PAGE_PROTNONE in fact complicates
things: it requires us to ensure the code paths executed by
_PAGE_PROTNONE remains mutually exclusive to the code paths executed
by _PAGE_NUMA at all times, to avoid _PAGE_NUMA and _PAGE_PROTNONE
to step into each other toes.

so I'd like to keep that. Any major objections?

> Basically, what's wrong with the generic approach that numa/core
> has:
>
> __weak bool pte_numa(struct vm_area_struct *vma, pte_t pte)
>
> [see the full function below.]
>
> Then we can reuse existing protection-changing functionality and
> keep it all tidy.
>

I very much like this idea of this approach. Superficially I see nothing
wrong with it. I just didn't think of it when I was trying to resolve
the two trees together.

> an architecture that wants to do something special could
> possibly override it in the future - but we want to keep the
> generic logic in generic code.
>

Sensible and probably less mess in the future.

> __weak bool pte_numa(struct vm_area_struct *vma, pte_t pte)
> {

I'll lift this and see can it be modified to use _PAGE_NUMA instead of
hard-coding for PROT_NONE. Of course if you beat me to it and send a patch,
that'd be cool too :)

Thanks!

--
Mel Gorman
SUSE Labs

2012-11-13 11:37:45

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 06/19] mm: numa: teach gup_fast about pmd_numa

On Tue, Nov 13, 2012 at 11:07:36AM +0100, Ingo Molnar wrote:
>
> * Mel Gorman <[email protected]> wrote:
>
> > From: Andrea Arcangeli <[email protected]>
> >
> > When scanning pmds, the pmd may be of numa type (_PAGE_PRESENT not set),
> > however the pte might be present. Therefore, gup_pmd_range() must return
> > 0 in this case to avoid losing a NUMA hinting page fault during gup_fast.
> >
> > Note: gup_fast will skip over non present ptes (like numa
> > types), so no explicit check is needed for the pte_numa case.
> > [...]
>
> So, why not fix all architectures that choose to expose
> pte_numa() and pmd_numa() methods - via the patch below?
>

I'll pick it up. Thanks.

--
Mel Gorman
SUSE Labs

2012-11-13 11:43:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 12/19] mm: migrate: Introduce migrate_misplaced_page()


* Ingo Molnar <[email protected]> wrote:

>
> * Mel Gorman <[email protected]> wrote:
>
> > From: Peter Zijlstra <[email protected]>
> >
> > Note: This was originally based on Peter's patch "mm/migrate: Introduce
> > migrate_misplaced_page()" but borrows extremely heavily from Andrea's
> > "autonuma: memory follows CPU algorithm and task/mm_autonuma stats
> > collection". The end result is barely recognisable so signed-offs
> > had to be dropped. If original authors are ok with it, I'll
> > re-add the signed-off-bys.
> >
> > Add migrate_misplaced_page() which deals with migrating pages from
> > faults.
> >
> > Based-on-work-by: Lee Schermerhorn <[email protected]>
> > Based-on-work-by: Peter Zijlstra <[email protected]>
> > Based-on-work-by: Andrea Arcangeli <[email protected]>
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > include/linux/migrate.h | 8 ++++
> > mm/migrate.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++-
> > 2 files changed, 110 insertions(+), 2 deletions(-)
>
> That's a nice patch - the TASK_NUMA_FAULT approach in the
> original patch was not very elegant.
>
> I've started testing it to see how well your version works.

Hm, I'm seeing some instability - see the boot crash below. If I
undo your patch it goes away.

( To help debugging this I've attached migration.patch which
applies your patch on top of Peter's latest queue of patches.
If I revert this patch then the crash goes away. )

I've gone back to the well-tested page migration code from Peter
for the time being.

Thanks,

Ingo

[ 7.999147] Freeing unused kernel memory: 148k freed
[ 8.004841] Freeing unused kernel memory: 44k freed
[ 8.028683] BUG: Bad page state in process init pfn:815ae6
[ 8.034462] page:ffffea002056b980 count:0 mapcount:1 mapping:ffff8804175c3218 index:0x14

[ 8.042835] page flags: 0xc080000000001c(referenced|uptodate|dirty)
[ 8.049884] Modules linked in:
[ 8.053164] Pid: 1, comm: init Not tainted 3.7.0-rc5-01482-g324e8d9-dirty #349
[ 8.060626] Call Trace:
[ 8.063246] [<ffffffff819bcd9d>] bad_page+0xe6/0xfb
[ 8.068358] [<ffffffff8118d4e4>] free_pages_prepare+0x104/0x110
[ 8.074510] [<ffffffff8118d530>] free_hot_cold_page+0x40/0x160
[ 8.080576] [<ffffffff81192237>] __put_single_page+0x27/0x30
[ 8.084612] usb 3-1: New USB device found, idVendor=1241, idProduct=1503
[ 8.084615] usb 3-1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[ 8.084618] usb 3-1: Product: USB Keyboard
[ 8.084620] usb 3-1: Manufacturer:
[ 8.098878] input: USB Keyboard as /devices/pci0000:00/0000:00:12.0/usb3/3-1/3-1:1.0/input/input5
[ 8.099128] hid-generic 0003:1241:1503.0003: input,hidraw2: USB HID v1.10 Keyboard [ USB Keyboard] on usb-0000:00:12.0-1/input0
[ 8.121649] input: USB Keyboard as /devices/pci0000:00/0000:00:12.0/usb3/3-1/3-1:1.1/input/input6
[ 8.121896] hid-generic 0003:1241:1503.0004: input,hidraw3: USB HID v1.10 Device [ USB Keyboard] on usb-0000:00:12.0-1/input1
[ 8.150442] [<ffffffff81192915>] put_page+0x35/0x50
[ 8.155563] [<ffffffff811ad950>] handle_pte_fault+0x3c0/0xbb0
[ 8.161544] [<ffffffff81098cee>] ? native_flush_tlb_others+0x2e/0x30
[ 8.168137] [<ffffffff811aef08>] handle_mm_fault+0x278/0x340
[ 8.174030] [<ffffffff819cd0c2>] __do_page_fault+0x172/0x4e0
[ 8.179933] [<ffffffff810e947e>] ? task_numa_work+0x21e/0x2f0
[ 8.185935] [<ffffffff810cbc4c>] ? task_work_run+0xac/0xe0
[ 8.191657] [<ffffffff819cd43e>] do_page_fault+0xe/0x10
[ 8.197126] [<ffffffff819c9a58>] page_fault+0x28/0x30
[ 8.202411] Disabling lock debugging due to kernel taint
[ 8.208057] ------------[ cut here ]------------
[ 8.212843] kernel BUG at include/linux/mm.h:419!
[ 8.217688] invalid opcode: 0000 [#1] SMP
[ 8.222075] Modules linked in:
[ 8.225345] CPU 8
[ 8.227256] Pid: 1, comm: init Tainted: G B 3.7.0-rc5-01482-g324e8d9-dirty #349 Supermicro H8DG6/H8DGi/H8DG6/H8DGi
[ 8.239018] RIP: 0010:[<ffffffff819bd455>] [<ffffffff819bd455>] get_page.part.44+0x4/0x6
[ 8.247489] RSP: 0000:ffff880415c67d08 EFLAGS: 00010246
[ 8.252940] RAX: 0000000000000000 RBX: ffff8808161bc5e0 RCX: 000000002056b980
[ 8.260210] RDX: 0000000815ae6100 RSI: 00007f006f4bc915 RDI: 0000000815ae6100
[ 8.267488] RBP: ffff880415c67d08 R08: ffff88081499e2c0 R09: 0000000000000028
[ 8.274763] R10: 00007fffcaf301a0 R11: 00007fffcaf302b0 R12: ffffea002056b980
[ 8.282042] R13: 0000000815ae6100 R14: ffff88081499e2c0 R15: ffffea0020586f30
[ 8.289318] FS: 00007f006f6c3740(0000) GS:ffff880817c00000(0000) knlGS:0000000000000000
[ 8.297626] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 8.303537] CR2: 00007f006f4bc915 CR3: 00000008161b1000 CR4: 00000000000407e0
[ 8.310813] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 8.318091] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 8.325369] Process init (pid: 1, threadinfo ffff880415c66000, task ffff880415c68000)
[ 8.333416] Stack:
[ 8.335570] ffff880415c67dc8 ffffffff811ae135 ffff8808149a02b8 00000000004da000
[ 8.343505] 0000000000400000 00000000004da000 ffff880415c67d68 ffffffff81098cee
[ 8.351475] ffff880415c67d68 ffffffff00000028 ffff8808149a02b8 ffff8808149a0000
[ 8.359384] Call Trace:
[ 8.361977] [<ffffffff811ae135>] handle_pte_fault+0xba5/0xbb0
[ 8.367957] [<ffffffff81098cee>] ? native_flush_tlb_others+0x2e/0x30
[ 8.374538] [<ffffffff81098f4e>] ? flush_tlb_mm_range+0x1ee/0x230
[ 8.380856] [<ffffffff811aef08>] handle_mm_fault+0x278/0x340
[ 8.386740] [<ffffffff819cd0c2>] __do_page_fault+0x172/0x4e0
[ 8.392625] [<ffffffff8105d5d1>] ? __switch_to+0x181/0x4a0
[ 8.398343] [<ffffffff810e947e>] ? task_numa_work+0x21e/0x2f0
[ 8.404315] [<ffffffff810cbc4c>] ? task_work_run+0xac/0xe0
[ 8.410026] [<ffffffff819cd43e>] do_page_fault+0xe/0x10
[ 8.415478] [<ffffffff819c9a58>] page_fault+0x28/0x30
[ 8.420753] Code: 99 ff ff ff 85 c0 74 0c 4c 89 e0 48 c1 e0 06 48 29 d8 eb 02 31 c0 5b 41 5c 5d c3 55 48 89 e5 0f 0b 55 48 89 e5 0f 0b 55 48 89 e5 <0f> 0b 55 48 8b 07 31 c9 48 89 e5 f6 c4 40 74 03 8b 4f 68 bf 00
[ 8.444972] RIP [<ffffffff819bd455>] get_page.part.44+0x4/0x6
[ 8.451021] RSP <ffff880415c67d08>
[ 8.454674] ---[ end trace 871518523836e5de ]---
[ 32.668913] BUG: soft lockup - CPU#8 stuck for 22s! [init:1]


Attachments:
(No filename) (6.30 kB)
config (89.86 kB)
migration.patch (6.82 kB)
Download all attachments

2012-11-13 11:50:38

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 08/19] mm: numa: Create basic numa page hinting infrastructure

On Tue, Nov 13, 2012 at 11:21:20AM +0100, Ingo Molnar wrote:
>
> * Mel Gorman <[email protected]> wrote:
>
> > Note: This patch started as "mm/mpol: Create special PROT_NONE
> > infrastructure" and preserves the basic idea but steals *very*
> > heavily from "autonuma: numa hinting page faults entry points" for
> > the actual fault handlers without the migration parts. The end
> > result is barely recognisable as either patch so all Signed-off
> > and Reviewed-bys are dropped. If Peter, Ingo and Andrea are ok with
> > this version, I will re-add the signed-offs-by to reflect the history.
>
> Most of the changes you had to do here relates to the earlier
> decision to turn it all the NUMA protection fault demultiplexing
> and setup code into a per arch facility.
>

Yes.

> On one hand I'm 100% fine with making the decision to *use* the
> new NUMA code per arch and explicitly opt-in - we already have
> such a Kconfig switch in our tree already. The decision whether
> to use any of this for an architecture must be considered and
> tested carefully.
>

Agreed.

> But given that most architectures will be just fine reusing the
> already existing generic PROT_NONE machinery, the far better
> approach is to do what we've been doing in generic kernel code
> for the last 10 years: offer a default generic version, and then
> to offer per arch hooks on a strict as-needed basis, if they
> want or need to do something weird ...
>

If they are *not* fine with it, it's a large retrofit because the PROT_NONE
machinery has been hard-coded throughout. It also requires that anyone
looking at the fault paths must remember at almost all times that PROT_NONE
can also mean PROT_NUMA and it depends on context. While that's fine right
now, it'll be harder to maintain in the future.

> So why fork away this logic into per arch code so early and
> without explicit justification? It creates duplication artifacts
> all around and makes porting to a new 'sane' architecture
> harder.
>

I agree that the duplication artifacts was a mistake. I can fix that but
feel that the naming is fine and we shouldn't hard-code that
change_prot_none() can actually mean change_prot_numa() if called from
the right place.

> Also, if there *are* per architecture concerns then I'd very
> much like to see that argued very explicitly, on a per arch
> basis, as it occurs, not obscured through thick "just in case"
> layers of abstraction ...
>

Once there is a generic pte_numa handler for example, an arch-specific
overriding of it should raise a red flag for closer inspection.

--
Mel Gorman
SUSE Labs

2012-11-13 11:56:29

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 12/19] mm: migrate: Introduce migrate_misplaced_page()

On Tue, Nov 13, 2012 at 12:43:44PM +0100, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
> >
> > * Mel Gorman <[email protected]> wrote:
> >
> > > From: Peter Zijlstra <[email protected]>
> > >
> > > Note: This was originally based on Peter's patch "mm/migrate: Introduce
> > > migrate_misplaced_page()" but borrows extremely heavily from Andrea's
> > > "autonuma: memory follows CPU algorithm and task/mm_autonuma stats
> > > collection". The end result is barely recognisable so signed-offs
> > > had to be dropped. If original authors are ok with it, I'll
> > > re-add the signed-off-bys.
> > >
> > > Add migrate_misplaced_page() which deals with migrating pages from
> > > faults.
> > >
> > > Based-on-work-by: Lee Schermerhorn <[email protected]>
> > > Based-on-work-by: Peter Zijlstra <[email protected]>
> > > Based-on-work-by: Andrea Arcangeli <[email protected]>
> > > Signed-off-by: Mel Gorman <[email protected]>
> > > ---
> > > include/linux/migrate.h | 8 ++++
> > > mm/migrate.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++-
> > > 2 files changed, 110 insertions(+), 2 deletions(-)
> >
> > That's a nice patch - the TASK_NUMA_FAULT approach in the
> > original patch was not very elegant.
> >
> > I've started testing it to see how well your version works.
>
> Hm, I'm seeing some instability - see the boot crash below. If I
> undo your patch it goes away.
>

Hah, I would not describe a "boot crash" as some instability. That's
just outright broken :)

I've not built at tree with the latest of Peter's code yet so I don't
know at this time which line it is BUG()ing on. However, it is *very*
likely that this patch is not a drop-in replacement for your tree
because IIRC, there are differences in how and when we call get_page().
That is the likely source of the snag.

--
Mel Gorman
SUSE Labs

2012-11-13 12:02:55

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 14/19] mm: mempolicy: Add MPOL_MF_LAZY

On Tue, Nov 13, 2012 at 11:25:55AM +0100, Ingo Molnar wrote:
>
> * Mel Gorman <[email protected]> wrote:
>
> > From: Lee Schermerhorn <[email protected]>
> >
> > NOTE: Once again there is a lot of patch stealing and the end result
> > is sufficiently different that I had to drop the signed-offs.
> > Will re-add if the original authors are ok with that.
> >
> > This patch adds another mbind() flag to request "lazy migration". The
> > flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected
> > pages are marked PROT_NONE. The pages will be migrated in the fault
> > path on "first touch", if the policy dictates at that time.
> >
> > <SNIP>
>
> Here you are paying a heavy price for the earlier design
> mistake, for forking into per arch approach - the NUMA version
> of change_protection() had to be open-coded:
>

I considered this when looking at the two trees.

At the time I also had the option of making change_prot_numa() to be a
wrapper around change_protection() and if pte_numa is made generic, that
becomes more attractive.

One of the reasons I went with this version from Andrea's tree is simply
because it does less work than change_protect() but what should be
sufficient for _PAGE_NUMA. I avoid the TLB flush if there are no PTE
updates for example but could shuffle change_protection() and get the
same thing.

> > include/linux/mm.h | 3 +
> > include/uapi/linux/mempolicy.h | 13 ++-
> > mm/mempolicy.c | 176 ++++++++++++++++++++++++++++++++++++----
> > 3 files changed, 174 insertions(+), 18 deletions(-)
>
> Compare it to the generic version that Peter used:
>
> include/uapi/linux/mempolicy.h | 13 ++++++++---
> mm/mempolicy.c | 49 +++++++++++++++++++++++++++---------------
> 2 files changed, 42 insertions(+), 20 deletions(-)
>
> and the cleanliness and maintainability advantages are obvious.
>
> So without some really good arguments in favor of your approach
> NAK on that complex approach really.
>

I will reimplement around change_protection() and see what effect, if any,
it has on overhead.

--
Mel Gorman
SUSE Labs

2012-11-13 12:09:14

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 15/19] mm: numa: Add fault driven placement and migration

On Tue, Nov 13, 2012 at 11:45:30AM +0100, Ingo Molnar wrote:
>
> * Mel Gorman <[email protected]> wrote:
>
> > NOTE: This patch is based on "sched, numa, mm: Add fault driven
> > placement and migration policy" but as it throws away
> > all the policy to just leave a basic foundation I had to
> > drop the signed-offs-by.
>
> So, much of that has been updated meanwhile - but the split
> makes fundamental sense - we considered it before.
>

Yes, I saw the new series after I had written the changelog for V2. I
decided to release a V2 anyway and plan to examine the revised patches and
see what's in there. I hope to do that today, but it's more likely it will
be tomorrow as some other issues have piled up on the TODO list.

> One detail you did in this patch was the following rename:
>
> s/EMBEDDED_NUMA/NUMA_VARIABLE_LOCALITY
>

Yes.

> > --- a/arch/sh/mm/Kconfig
> > +++ b/arch/sh/mm/Kconfig
> > @@ -111,6 +111,7 @@ config VSYSCALL
> > config NUMA
> > bool "Non Uniform Memory Access (NUMA) Support"
> > depends on MMU && SYS_SUPPORTS_NUMA && EXPERIMENTAL
> > + select NUMA_VARIABLE_LOCALITY
> > default n
> > help
> > Some SH systems have many various memories scattered around
> > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >
> ..aaba45d 100644
> > --- a/init/Kconfig
> > +++ b/init/Kconfig
> > @@ -696,6 +696,20 @@ config LOG_BUF_SHIFT
> > config HAVE_UNSTABLE_SCHED_CLOCK
> > bool
> >
> > +#
> > +# For architectures that (ab)use NUMA to represent different memory regions
> > +# all cpu-local but of different latencies, such as SuperH.
> > +#
> > +config NUMA_VARIABLE_LOCALITY
> > + bool
>
> The NUMA_VARIABLE_LOCALITY name slightly misses the real point
> though that NUMA_EMBEDDED tried to stress: it's important to
> realize that these are systems that (ab-)use our NUMA memory
> zoning code to implement support for variable speed RAM modules
> - so they can use the existing node binding ABIs.
>
> The cost of that is the losing of the regular NUMA node
> structure. So by all means it's a convenient hack - but the name
> must signal that. I'm not attached to the NUMA_EMBEDDED naming
> overly strongly, but NUMA_VARIABLE_LOCALITY sounds more harmless
> than it should.
>
> Perhaps ARCH_WANT_NUMA_VARIABLE_LOCALITY_OVERRIDE? A tad long
> but we don't want it to be overused in any case.
>

I had two reasons for not using the NUMA_EMBEDDED name.

1. Embedded is too generic a term and could mean anything. There are x86
machines that are considered embedded who this option is meaningless
for. It's be irritating to get mails about how they cannot enable the
NUMA_EMBEDDED option for their embedded machine.

2. I encounter people periodically that plan to abuse NUMA for building
things like ram-like regions backed by something else that are not
arch-specific. In some cases, these are far from being for an embedded
use-case. While I have heavily discouraged such NUMA abuse in the past
I still kept it in mind for the naming.

I'll go with the long name you suggest even though it's arch specific
because I never want point 2 above to happen anyway. Maybe the name will
poke the next person who plans to abuse NUMA in the eye hard enough to
discourage them.

--
Mel Gorman
SUSE Labs

2012-11-13 13:39:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 15/19] mm: numa: Add fault driven placement and migration


* Mel Gorman <[email protected]> wrote:

> > The NUMA_VARIABLE_LOCALITY name slightly misses the real
> > point though that NUMA_EMBEDDED tried to stress: it's
> > important to realize that these are systems that (ab-)use
> > our NUMA memory zoning code to implement support for
> > variable speed RAM modules - so they can use the existing
> > node binding ABIs.
> >
> > The cost of that is the losing of the regular NUMA node
> > structure. So by all means it's a convenient hack - but the
> > name must signal that. I'm not attached to the NUMA_EMBEDDED
> > naming overly strongly, but NUMA_VARIABLE_LOCALITY sounds
> > more harmless than it should.
> >
> > Perhaps ARCH_WANT_NUMA_VARIABLE_LOCALITY_OVERRIDE? A tad
> > long but we don't want it to be overused in any case.
> >
>
> I had two reasons for not using the NUMA_EMBEDDED name.

As I indicated I'm fine with not using that.

> I'll go with the long name you suggest even though it's arch
> specific because I never want point 2 above to happen anyway.
> Maybe the name will poke the next person who plans to abuse
> NUMA in the eye hard enough to discourage them.

FYI, I've applied a slightly shorter variant in the numa/core
tree, will send it out later today.

Thanks,

Ingo

2012-11-13 13:49:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 08/19] mm: numa: Create basic numa page hinting infrastructure


* Mel Gorman <[email protected]> wrote:

> > But given that most architectures will be just fine reusing
> > the already existing generic PROT_NONE machinery, the far
> > better approach is to do what we've been doing in generic
> > kernel code for the last 10 years: offer a default generic
> > version, and then to offer per arch hooks on a strict
> > as-needed basis, if they want or need to do something weird
> > ...
>
> If they are *not* fine with it, it's a large retrofit because
> the PROT_NONE machinery has been hard-coded throughout. [...]

That was a valid criticism for earlier versions of the NUMA
patches - but should much less be the case in the latest
iterations of the patches:

- it has generic pte_numa() / pmd_numa() instead of using
prot_none() directly

- the key utility functions are named using the _numa pattern,
not *_prot_none*() anymore.

Let us know if you can still see such instances - it's probably
simple oversight.

Thanks,

Ingo

2012-11-13 13:51:14

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 06/19] mm: numa: teach gup_fast about pmd_numa


* Mel Gorman <[email protected]> wrote:

> On Tue, Nov 13, 2012 at 11:07:36AM +0100, Ingo Molnar wrote:
> >
> > * Mel Gorman <[email protected]> wrote:
> >
> > > From: Andrea Arcangeli <[email protected]>
> > >
> > > When scanning pmds, the pmd may be of numa type (_PAGE_PRESENT not set),
> > > however the pte might be present. Therefore, gup_pmd_range() must return
> > > 0 in this case to avoid losing a NUMA hinting page fault during gup_fast.
> > >
> > > Note: gup_fast will skip over non present ptes (like numa
> > > types), so no explicit check is needed for the pte_numa case.
> > > [...]
> >
> > So, why not fix all architectures that choose to expose
> > pte_numa() and pmd_numa() methods - via the patch below?
> >
>
> I'll pick it up. Thanks.

FYI, before you do too much restructuring work, that patch is
already part of tip:numa/core, I'll push out our updated version
of the tree later today.

Thanks,

Ingo

2012-11-13 14:26:41

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 08/19] mm: numa: Create basic numa page hinting infrastructure

On Tue, Nov 13, 2012 at 02:49:10PM +0100, Ingo Molnar wrote:
>
> * Mel Gorman <[email protected]> wrote:
>
> > > But given that most architectures will be just fine reusing
> > > the already existing generic PROT_NONE machinery, the far
> > > better approach is to do what we've been doing in generic
> > > kernel code for the last 10 years: offer a default generic
> > > version, and then to offer per arch hooks on a strict
> > > as-needed basis, if they want or need to do something weird
> > > ...
> >
> > If they are *not* fine with it, it's a large retrofit because
> > the PROT_NONE machinery has been hard-coded throughout. [...]
>
> That was a valid criticism for earlier versions of the NUMA
> patches - but should much less be the case in the latest
> iterations of the patches:
>

Which are where? They are possible somewhere in -tip, maybe the
tip/numa/core but I am seeing this;

$ git diff e657e078d3dfa9f96976db7a2b5fd7d7c9f1f1a6..tip/numa/core | grep change_prot_none
+change_prot_none(struct vm_area_struct *vma, unsigned long start, unsigned long end)
+ change_prot_none(vma, offset, end);
+ change_prot_none(vma, start, endvma);

This is being called from task_numa_work() for example so it's case where
the maintainer has to memember that prot_none actually means prot_numa in
this case. Further, the generic implementation of pte_numa is hard-coding
prot_none

+static bool pte_numa(struct vm_area_struct *vma, pte_t pte)
+{
.......
+ if (pte_same(pte, pte_modify(pte, vma->vm_page_prot)))
+ return false;
+
+ return pte_same(pte, pte_modify(pte, vma_prot_none(vma)));
+}

I can take the structuring idea of moving pte_numa around but it still
should have the _PAGE_NUMA naming. So it still looks to me as the PROT_NONE
machine is hard-coded.

> - it has generic pte_numa() / pmd_numa() instead of using
> prot_none() directly
>

I intend to move the pte_numa out myself.

> - the key utility functions are named using the _numa pattern,
> not *_prot_none*() anymore.
>

Where did change_prot_none() come from then?

> Let us know if you can still see such instances - it's probably
> simple oversight.
>

I could be lookjing at the wrong tip branch. Please post the full series
to the list so it can be reviewed that way instead of trying to second
guess.

--
Mel Gorman
SUSE Labs

2012-11-13 14:49:38

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 12/19] mm: migrate: Introduce migrate_misplaced_page()

On 11/13/2012 06:43 AM, Ingo Molnar wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
>>
>> * Mel Gorman <[email protected]> wrote:
>>
>>> From: Peter Zijlstra <[email protected]>
>>>
>>> Note: This was originally based on Peter's patch "mm/migrate: Introduce
>>> migrate_misplaced_page()" but borrows extremely heavily from Andrea's
>>> "autonuma: memory follows CPU algorithm and task/mm_autonuma stats
>>> collection". The end result is barely recognisable so signed-offs
>>> had to be dropped. If original authors are ok with it, I'll
>>> re-add the signed-off-bys.
>>>
>>> Add migrate_misplaced_page() which deals with migrating pages from
>>> faults.
>>>
>>> Based-on-work-by: Lee Schermerhorn <[email protected]>
>>> Based-on-work-by: Peter Zijlstra <[email protected]>
>>> Based-on-work-by: Andrea Arcangeli <[email protected]>
>>> Signed-off-by: Mel Gorman <[email protected]>
>>> ---
>>> include/linux/migrate.h | 8 ++++
>>> mm/migrate.c | 104 ++++++++++++++++++++++++++++++++++++++++++++++-
>>> 2 files changed, 110 insertions(+), 2 deletions(-)
>>
>> That's a nice patch - the TASK_NUMA_FAULT approach in the
>> original patch was not very elegant.
>>
>> I've started testing it to see how well your version works.
>
> Hm, I'm seeing some instability - see the boot crash below. If I
> undo your patch it goes away.
>
> ( To help debugging this I've attached migration.patch which
> applies your patch on top of Peter's latest queue of patches.
> If I revert this patch then the crash goes away. )
>
> I've gone back to the well-tested page migration code from Peter
> for the time being.

Is there a place we can see your code?

Peter's patch with MIGRATE_FAULT is very much NAKed, so
this approach does need to be made to work...

You can either make the working tree public somewhere,
so we can help, or figure it out yourself. Your choice :)

--
All rights reversed