2022-01-29 16:34:21

by Bharata B Rao

[permalink] [raw]
Subject: [RFC PATCH v0 0/3] sched/numa: Process Adaptive autoNUMA

Hi,

This patchset implements an adaptive algorithm for calculating the autonuma
scan period. In the existing mechanism of scan period calculation,

- scan period is derived from the per-thread stats.
- static threshold (NUMA_PERIOD_THRESHOLD) is used for changing the
scan rate.

In this new approach (Process Adaptive autoNUMA or PAN), we gather NUMA
fault stats at per-process level which allows for capturing the application
behaviour better. In addition, the algorithm learns and adjusts the scan
rate based on remote fault rate. By not sticking to a static threshold, the
algorithm can respond better to different workload behaviours.

Since the threads of a processes are already considered as a group,
we add a bunch of metrics to the task's mm to track the various
types of faults and derive the scan rate from them.

The new per-process fault stats contribute only to the per-process
scan period calculation, while the existing per-thread stats continue
to contribute towards the numa_group stats which eventually
determine the thresholds for migrating memory and threads
across nodes.

This patchset has been tested with a bunch of benchmarks on the
following system:

2 socket AMD Milan System
32 cores or 64 threads per socket
256GB memory per socket amounting to 512GB in total
transparent_hugepage=never has been used for all the below tests.
NPS1 NUMA configuration where each socket is a NUMA node

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
node 0 size: 257645 MB
node 0 free: 255193 MB
node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127
node 1 size: 257984 MB
node 1 free: 257017 MB
node distances:
node 0 1
0: 10 32
1: 32 10

While this is an early version that we are experimenting on NPS1
configuration with THP off, we plan to test it on other configurations
as well. However posting it now for some early feedback.

Here are the numbers from some of the benchmarks that we have
tried. A brief description of these benchmarks is also given at
end of this mail. The detailed results are available at
https://drive.google.com/drive/folders/1O7sY-YBsT3F5GHZMiOGQbXkgpp6O0BEo

Plots of how scan period varies in default vs PAN for a couple
of benchmarks are also present in the above folder.

------------------------------------------------------
% gain of PAN vs default (Avg of 3 runs)
------------------------------------------------------
NAS-BT -0.17
NAS-CG +9.39
NAS-MG +8.19
NAS-FT +2.23
Hashjoin +0.58
Graph500 +14.93
Pagerank +0.37
------------------------------------------------------
Default PAN %diff
------------------------------------------------------
NUMA hint faults(Total of 3 runs)
------------------------------------------------------
NAS-BT 758282358 539850429 +29
NAS-CG 2179458823 1180301361 +46
NAS-MG 517641172 346066391 +33
NAS-FT 297044964 230033861 +23
Hashjoin 201684863 268436275 -33
Graph500 261808733 154338827 +41
Pagerank 217917818 211260310 +03
------------------------------------------------------
Migrations(Total of 3 runs)
------------------------------------------------------
NAS-BT 106888517 86482076 +19
NAS-CG 81191368 12859924 +84
NAS-MG 83927451 39651254 +53
NAS-FT 61807715 38934618 +37
Hashjoin 45406983 59828843 -32
Graph500 22798837 21560714 +05
Pagerank 59072135 44968673 +24
------------------------------------------------------

And here are some tests from a few microbenchmarks of mmtests suite.
(The results are trimmed a bit here, the complete results can
be viewed in the above mentioned link)

Hackbench
---------
hackbench-process-pipes
hackbench hackbench
default pan
Min 256 23.5510 ( 0.00%) 23.1900 ( 1.53%)
Amean 256 24.4604 ( 0.00%) 24.0353 * 1.74%*
Stddev 256 0.4420 ( 0.00%) 0.7611 ( -72.18%)
CoeffVar 256 1.8072 ( 0.00%) 3.1666 ( -75.22%)
Max 256 25.4930 ( 0.00%) 30.5450 ( -19.82%)
BAmean-50 256 24.1074 ( 0.00%) 23.6616 ( 1.85%)
BAmean-95 256 24.4111 ( 0.00%) 23.9308 ( 1.97%)
BAmean-99 256 24.4499 ( 0.00%) 23.9696 ( 1.96%)

hackbench hackbench
default pan
Duration User 25810.02 25158.93
Duration System 276322.70 271729.32
Duration Elapsed 2707.75 2671.33

hackbench hackbench
default pan
Ops NUMA alloc hit 1082415453.00 1088025994.00
Ops NUMA alloc miss 0.00 0.00
Ops NUMA interleave hit 0.00 0.00
Ops NUMA alloc local 1082415441.00 1088025974.00
Ops NUMA base-page range updates 33475.00 228900.00
Ops NUMA PTE updates 33475.00 228900.00
Ops NUMA PMD updates 0.00 0.00
Ops NUMA hint faults 15758.00 222100.00
Ops NUMA hint local faults % 15371.00 214570.00
Ops NUMA hint local percent 97.54 96.61
Ops NUMA pages migrated 235.00 4029.00
Ops AutoNUMA cost 79.03 1112.18

tbench
------
tbench4
tbench tbench
default pan
Hmean 1 436.89 ( 0.00%) 432.73 * -0.95%*
Hmean 2 834.27 ( 0.00%) 848.11 * 1.66%*
Hmean 4 1629.50 ( 0.00%) 1614.22 * -0.94%*
Hmean 8 2944.06 ( 0.00%) 3031.66 * 2.98%*
Hmean 16 5418.25 ( 0.00%) 5674.74 * 4.73%*
Hmean 32 9959.60 ( 0.00%) 9009.82 * -9.54%*
Hmean 64 13999.14 ( 0.00%) 12160.51 * -13.13%*
Hmean 128 16797.09 ( 0.00%) 16506.14 * -1.73%*
Hmean 256 25344.27 ( 0.00%) 25683.66 * 1.34%*
Hmean 512 25289.03 ( 0.00%) 25513.77 * 0.89%*
BHmean-50 1 437.13 ( 0.00%) 433.01 ( -0.94%)
BHmean-50 2 836.35 ( 0.00%) 848.85 ( 1.49%)
BHmean-50 4 1631.39 ( 0.00%) 1618.43 ( -0.79%)
BHmean-50 8 2948.25 ( 0.00%) 3037.86 ( 3.04%)
BHmean-50 16 5425.17 ( 0.00%) 5684.25 ( 4.78%)
BHmean-50 32 9969.17 ( 0.00%) 9034.06 ( -9.38%)
BHmean-50 64 14013.93 ( 0.00%) 12202.07 ( -12.93%)
BHmean-50 128 16881.94 ( 0.00%) 16571.27 ( -1.84%)
BHmean-50 256 25379.59 ( 0.00%) 25819.18 ( 1.73%)
BHmean-50 512 25435.41 ( 0.00%) 25718.02 ( 1.11%)
BHmean-95 1 436.92 ( 0.00%) 432.81 ( -0.94%)
BHmean-95 2 834.59 ( 0.00%) 848.23 ( 1.63%)
BHmean-95 4 1629.73 ( 0.00%) 1614.83 ( -0.91%)
BHmean-95 8 2945.02 ( 0.00%) 3032.19 ( 2.96%)
BHmean-95 16 5418.86 ( 0.00%) 5675.91 ( 4.74%)
BHmean-95 32 9962.57 ( 0.00%) 9014.17 ( -9.52%)
BHmean-95 64 14002.44 ( 0.00%) 12164.32 ( -13.13%)
BHmean-95 128 16820.56 ( 0.00%) 16522.82 ( -1.77%)
BHmean-95 256 25347.34 ( 0.00%) 25692.56 ( 1.36%)
BHmean-95 512 25302.10 ( 0.00%) 25528.52 ( 0.89%)
BHmean-99 1 436.90 ( 0.00%) 432.75 ( -0.95%)
BHmean-99 2 834.35 ( 0.00%) 848.17 ( 1.66%)
BHmean-99 4 1629.57 ( 0.00%) 1614.38 ( -0.93%)
BHmean-99 8 2944.36 ( 0.00%) 3031.77 ( 2.97%)
BHmean-99 16 5418.40 ( 0.00%) 5675.01 ( 4.74%)
BHmean-99 32 9961.01 ( 0.00%) 9011.43 ( -9.53%)
BHmean-99 64 14000.68 ( 0.00%) 12161.34 ( -13.14%)
BHmean-99 128 16803.44 ( 0.00%) 16511.94 ( -1.73%)
BHmean-99 256 25344.93 ( 0.00%) 25685.57 ( 1.34%)
BHmean-99 512 25291.87 ( 0.00%) 25516.94 ( 0.89%)

tbench tbench
default pan
Duration User 8482.50 8289.35
Duration System 49462.63 49364.56
Duration Elapsed 2217.10 2217.08

tbench tbench
default pan
Ops NUMA alloc hit 388738400.00 378941469.00
Ops NUMA alloc miss 0.00 0.00
Ops NUMA interleave hit 0.00 0.00
Ops NUMA alloc local 388738391.00 378941455.00
Ops NUMA base-page range updates 266760.00 266275.00
Ops NUMA PTE updates 266760.00 266275.00
Ops NUMA PMD updates 0.00 0.00
Ops NUMA hint faults 241547.00 257790.00
Ops NUMA hint local faults % 145814.00 126410.00
Ops NUMA hint local percent 60.37 49.04
Ops NUMA pages migrated 51535.00 66083.00
Ops AutoNUMA cost 1210.58 1292.07

dbench
------
dbench4 Latency
dbench dbench
default pan
Amean latency-1 2.02 ( 0.00%) 2.05 * -1.52%*
Amean latency-2 2.60 ( 0.00%) 2.55 * 1.64%*
Amean latency-4 3.52 ( 0.00%) 3.56 * -1.17%*
Amean latency-8 12.79 ( 0.00%) 11.83 * 7.49%*
Amean latency-16 23.33 ( 0.00%) 19.09 * 18.19%*
Amean latency-32 19.30 ( 0.00%) 18.83 * 2.43%*
Amean latency-64 25.32 ( 0.00%) 24.30 * 4.00%*
Amean latency-128 45.25 ( 0.00%) 42.93 * 5.13%*
Amean latency-1024 0.00 ( 0.00%) 0.00 * 0.00%*
BAmean-50 latency-1 1.65 ( 0.00%) 1.74 ( -5.16%)
BAmean-50 latency-2 2.10 ( 0.00%) 2.10 ( -0.13%)
BAmean-50 latency-4 2.65 ( 0.00%) 2.71 ( -2.28%)
BAmean-50 latency-8 6.21 ( 0.00%) 4.64 ( 25.30%)
BAmean-50 latency-16 17.64 ( 0.00%) 14.08 ( 20.16%)
BAmean-50 latency-32 15.58 ( 0.00%) 15.90 ( -2.07%)
BAmean-50 latency-64 20.76 ( 0.00%) 20.31 ( 2.15%)
BAmean-50 latency-128 36.22 ( 0.00%) 34.85 ( 3.80%)
BAmean-50 latency-1024 0.00 ( 0.00%) 0.00 ( 0.00%)
BAmean-95 latency-1 1.88 ( 0.00%) 1.94 ( -3.17%)
BAmean-95 latency-2 2.25 ( 0.00%) 2.26 ( -0.26%)
BAmean-95 latency-4 3.00 ( 0.00%) 3.08 ( -2.71%)
BAmean-95 latency-8 11.66 ( 0.00%) 10.03 ( 13.97%)
BAmean-95 latency-16 22.30 ( 0.00%) 17.68 ( 20.73%)
BAmean-95 latency-32 17.95 ( 0.00%) 17.70 ( 1.38%)
BAmean-95 latency-64 23.57 ( 0.00%) 22.72 ( 3.62%)
BAmean-95 latency-128 42.44 ( 0.00%) 39.96 ( 5.84%)
BAmean-95 latency-1024 0.00 ( 0.00%) 0.00 ( 0.00%)
BAmean-99 latency-1 1.90 ( 0.00%) 1.96 ( -3.30%)
BAmean-99 latency-2 2.38 ( 0.00%) 2.37 ( 0.48%)
BAmean-99 latency-4 3.24 ( 0.00%) 3.34 ( -3.26%)
BAmean-99 latency-8 12.34 ( 0.00%) 10.71 ( 13.27%)
BAmean-99 latency-16 22.79 ( 0.00%) 18.27 ( 19.82%)
BAmean-99 latency-32 18.68 ( 0.00%) 18.32 ( 1.93%)
BAmean-99 latency-64 24.69 ( 0.00%) 23.69 ( 4.06%)
BAmean-99 latency-128 44.44 ( 0.00%) 42.15 ( 5.17%)
BAmean-99 latency-1024 0.00 ( 0.00%) 0.00 ( 0.00%)

dbench4 Throughput (misleading but traditional)
dbench dbench
default pan
Hmean 1 505.12 ( 0.00%) 492.96 * -2.41%*
Hmean 2 824.14 ( 0.00%) 824.06 * -0.01%*
Hmean 4 1174.61 ( 0.00%) 1207.86 * 2.83%*
Hmean 8 1665.10 ( 0.00%) 1667.27 * 0.13%*
Hmean 16 2215.59 ( 0.00%) 2160.93 * -2.47%*
Hmean 32 2727.05 ( 0.00%) 2633.26 * -3.44%*
Hmean 64 3128.64 ( 0.00%) 3098.73 * -0.96%*
Hmean 128 3282.89 ( 0.00%) 3340.26 * 1.75%*
Hmean 1024 2551.02 ( 0.00%) 2559.41 * 0.33%*
BHmean-50 1 509.87 ( 0.00%) 495.10 ( -2.90%)
BHmean-50 2 829.35 ( 0.00%) 828.14 ( -0.15%)
BHmean-50 4 1182.38 ( 0.00%) 1219.30 ( 3.12%)
BHmean-50 8 1678.49 ( 0.00%) 1678.83 ( 0.02%)
BHmean-50 16 2251.01 ( 0.00%) 2194.52 ( -2.51%)
BHmean-50 32 2751.39 ( 0.00%) 2678.45 ( -2.65%)
BHmean-50 64 3189.69 ( 0.00%) 3154.45 ( -1.10%)
BHmean-50 128 3396.18 ( 0.00%) 3451.59 ( 1.63%)
BHmean-50 1024 2836.80 ( 0.00%) 2836.84 ( 0.00%)
BHmean-95 1 506.13 ( 0.00%) 493.24 ( -2.55%)
BHmean-95 2 824.84 ( 0.00%) 824.30 ( -0.06%)
BHmean-95 4 1175.91 ( 0.00%) 1208.57 ( 2.78%)
BHmean-95 8 1666.46 ( 0.00%) 1668.22 ( 0.11%)
BHmean-95 16 2219.59 ( 0.00%) 2163.86 ( -2.51%)
BHmean-95 32 2731.26 ( 0.00%) 2640.34 ( -3.33%)
BHmean-95 64 3144.73 ( 0.00%) 3108.59 ( -1.15%)
BHmean-95 128 3306.51 ( 0.00%) 3363.33 ( 1.72%)
BHmean-95 1024 2658.37 ( 0.00%) 2668.88 ( 0.40%)
BHmean-99 1 505.37 ( 0.00%) 493.08 ( -2.43%)
BHmean-99 2 824.31 ( 0.00%) 824.12 ( -0.02%)
BHmean-99 4 1174.94 ( 0.00%) 1208.02 ( 2.81%)
BHmean-99 8 1665.40 ( 0.00%) 1667.48 ( 0.12%)
BHmean-99 16 2216.51 ( 0.00%) 2161.60 ( -2.48%)
BHmean-99 32 2728.09 ( 0.00%) 2635.09 ( -3.41%)
BHmean-99 64 3135.81 ( 0.00%) 3102.12 ( -1.07%)
BHmean-99 128 3291.11 ( 0.00%) 3349.16 ( 1.76%)
BHmean-99 1024 2645.54 ( 0.00%) 2655.67 ( 0.38%)


dbench dbench
default pan
Duration User 822.55 827.85
Duration System 8384.99 8164.83
Duration Elapsed 1671.36 1670.74

dbench dbench
default pan
Ops NUMA alloc hit 183324626.00 182350114.00
Ops NUMA alloc miss 0.00 0.00
Ops NUMA interleave hit 0.00 0.00
Ops NUMA alloc local 183324508.00 182350004.00
Ops NUMA base-page range updates 181531.00 515929.00
Ops NUMA PTE updates 181531.00 515929.00
Ops NUMA PMD updates 0.00 0.00
Ops NUMA hint faults 162742.00 510979.00
Ops NUMA hint local faults % 120309.00 426848.00
Ops NUMA hint local percent 73.93 83.54
Ops NUMA pages migrated 37605.00 59519.00
Ops AutoNUMA cost 815.70 2559.64

Netperf-RR
----------
netperf-udp-rr
netperf netperf
rr-default rr-pan
Min 1 104915.69 ( 0.00%) 104505.71 ( -0.39%)
Hmean 1 105865.46 ( 0.00%) 105899.22 * 0.03%*
Stddev 1 528.45 ( 0.00%) 881.92 ( -66.89%)
CoeffVar 1 0.50 ( 0.00%) 0.83 ( -66.83%)
Max 1 106410.28 ( 0.00%) 107196.52 ( 0.74%)
BHmean-50 1 106232.53 ( 0.00%) 106568.26 ( 0.32%)
BHmean-95 1 105972.05 ( 0.00%) 106056.35 ( 0.08%)
BHmean-99 1 105972.05 ( 0.00%) 106056.35 ( 0.08%)

netperf netperf
rr-default rr-pan
Duration User 11.20 10.74
Duration System 202.40 201.32
Duration Elapsed 303.09 303.08

netperf netperf
rr-default rr-pan
Ops NUMA alloc hit 183999.00 183853.00
Ops NUMA alloc miss 0.00 0.00
Ops NUMA interleave hit 0.00 0.00
Ops NUMA alloc local 183999.00 183853.00
Ops NUMA base-page range updates 0.00 24370.00
Ops NUMA PTE updates 0.00 24370.00
Ops NUMA PMD updates 0.00 0.00
Ops NUMA hint faults 539.00 24470.00
Ops NUMA hint local faults % 539.00 24447.00
Ops NUMA hint local percent 100.00 99.91
Ops NUMA pages migrated 0.00 23.00
Ops AutoNUMA cost 2.69 122.52

netperf-tcp-rr
netperf netperf
rr-default rr-pan
Min 1 96156.03 ( 0.00%) 96556.87 ( 0.42%)
Hmean 1 96627.24 ( 0.00%) 97551.38 * 0.96%*
Stddev 1 284.71 ( 0.00%) 637.74 (-123.99%)
CoeffVar 1 0.29 ( 0.00%) 0.65 (-121.87%)
Max 1 96974.45 ( 0.00%) 98554.94 ( 1.63%)
BHmean-50 1 96840.81 ( 0.00%) 98067.19 ( 1.27%)
BHmean-95 1 96679.89 ( 0.00%) 97663.14 ( 1.02%)
BHmean-99 1 96679.89 ( 0.00%) 97663.14 ( 1.02%)

netperf netperf
rr-default rr-pan
Duration User 10.21 10.26
Duration System 207.90 208.28
Duration Elapsed 302.99 303.02

netperf netperf
rr-default rr-pan
Ops NUMA alloc hit 183669.00 183695.00
Ops NUMA alloc miss 0.00 0.00
Ops NUMA interleave hit 0.00 0.00
Ops NUMA alloc local 183657.00 183695.00
Ops NUMA base-page range updates 3949.00 38561.00
Ops NUMA PTE updates 3949.00 38561.00
Ops NUMA PMD updates 0.00 0.00
Ops NUMA hint faults 4186.00 43328.00
Ops NUMA hint local faults % 4100.00 43195.00
Ops NUMA hint local percent 97.95 99.69
Ops NUMA pages migrated 9.00 73.00
Ops AutoNUMA cost 20.96 216.91

Autonumabench
-------------
autonumabench
autonumabench autonumabench
default pan
Amean syst-NUMA01 11664.40 ( 0.00%) 11616.17 * 0.41%*
Amean syst-NUMA01_THREADLOCAL 0.24 ( 0.00%) 0.22 * 7.78%*
Amean syst-NUMA02 1.55 ( 0.00%) 9.31 *-499.26%*
Amean syst-NUMA02_SMT 1.14 ( 0.00%) 4.04 *-254.39%*
Amean elsp-NUMA01 223.52 ( 0.00%) 221.43 * 0.93%*
Amean elsp-NUMA01_THREADLOCAL 0.95 ( 0.00%) 0.94 * 0.76%*
Amean elsp-NUMA02 6.83 ( 0.00%) 5.74 * 15.90%*
Amean elsp-NUMA02_SMT 6.65 ( 0.00%) 6.25 * 5.97%*
BAmean-50 syst-NUMA01 11455.44 ( 0.00%) 10985.76 ( 4.10%)
BAmean-50 syst-NUMA01_THREADLOCAL 0.22 ( 0.00%) 0.21 ( 7.46%)
BAmean-50 syst-NUMA02 1.11 ( 0.00%) 8.91 (-703.00%)
BAmean-50 syst-NUMA02_SMT 0.94 ( 0.00%) 3.42 (-262.19%)
BAmean-50 elsp-NUMA01 217.38 ( 0.00%) 214.03 ( 1.54%)
BAmean-50 elsp-NUMA01_THREADLOCAL 0.94 ( 0.00%) 0.94 ( 0.35%)
BAmean-50 elsp-NUMA02 6.66 ( 0.00%) 5.45 ( 18.08%)
BAmean-50 elsp-NUMA02_SMT 6.50 ( 0.00%) 6.09 ( 6.31%)
BAmean-95 syst-NUMA01 11611.74 ( 0.00%) 11448.30 ( 1.41%)
BAmean-95 syst-NUMA01_THREADLOCAL 0.23 ( 0.00%) 0.22 ( 7.14%)
BAmean-95 syst-NUMA02 1.27 ( 0.00%) 9.21 (-624.93%)
BAmean-95 syst-NUMA02_SMT 0.97 ( 0.00%) 3.90 (-300.34%)
BAmean-95 elsp-NUMA01 221.75 ( 0.00%) 218.53 ( 1.45%)
BAmean-95 elsp-NUMA01_THREADLOCAL 0.94 ( 0.00%) 0.94 ( 0.53%)
BAmean-95 elsp-NUMA02 6.75 ( 0.00%) 5.68 ( 15.81%)
BAmean-95 elsp-NUMA02_SMT 6.61 ( 0.00%) 6.23 ( 5.82%)
BAmean-99 syst-NUMA01 11611.74 ( 0.00%) 11448.30 ( 1.41%)
BAmean-99 syst-NUMA01_THREADLOCAL 0.23 ( 0.00%) 0.22 ( 7.14%)
BAmean-99 syst-NUMA02 1.27 ( 0.00%) 9.21 (-624.93%)
BAmean-99 syst-NUMA02_SMT 0.97 ( 0.00%) 3.90 (-300.34%)
BAmean-99 elsp-NUMA01 221.75 ( 0.00%) 218.53 ( 1.45%)
BAmean-99 elsp-NUMA01_THREADLOCAL 0.94 ( 0.00%) 0.94 ( 0.53%)
BAmean-99 elsp-NUMA02 6.75 ( 0.00%) 5.68 ( 15.81%)
BAmean-99 elsp-NUMA02_SMT 6.61 ( 0.00%) 6.23 ( 5.82%)

autonumabenchautonumabench
default pan
Duration User 94363.43 94436.71
Duration System 81671.72 81408.53
Duration Elapsed 1676.81 1647.99

autonumabench autonumabench
default pan
Ops NUMA alloc hit 539544115.00 539522029.00
Ops NUMA alloc miss 0.00 0.00
Ops NUMA interleave hit 0.00 0.00
Ops NUMA alloc local 279025768.00 281735736.00
Ops NUMA base-page range updates 69695169.00 84767502.00
Ops NUMA PTE updates 69695169.00 84767502.00
Ops NUMA PMD updates 0.00 0.00
Ops NUMA hint faults 69691818.00 87895044.00
Ops NUMA hint local faults % 56565519.00 65819747.00
Ops NUMA hint local percent 81.17 74.88
Ops NUMA pages migrated 5950362.00 8310169.00
Ops AutoNUMA cost 349060.01 440226.49

Here is a short description of different benchmarks used:

NAS Parallel benchmarks
-----------------------
Using OpenMP version from https://www.nas.nasa.gov/software/npb.html
Variations tried: BT, CG, MT and FT
Memory Footprint:
BT - 170GB
CG - 196GB
MG - 211GB
FT - 81GB
Results: Operations per second.

Hashjoin
--------
Benchmark from IISc that does SQL join between two tables.
Used the version from https://github.com/mitosis-project/mitosis-asplos20-artifact
It performs a fixed number of join operations
Memory Footprint: 80GB
Results: Time taken

Graph500
--------
Benchmark from https://graph500.org/ that does bread first search
on a graph. The score is reported as TEPS (Traversed Edges Per Second).
It comprises of both search and validation phases, but only the
score phase is contributing to the final score here.
Memory Footprint: 70GB
Results: TEPS

PageRank
--------
This is part of GAP Benchmark Suite from https://github.com/sbeamer/gapbs
It performs basic search operations.
Memory Footprint: 91GB
Results: Time taken

Disha Talreja (3):
sched/numa: Process based autonuma scan period framework
sched/numa: Add cumulative history of per-process fault stats
sched/numa: Add adaptive scan period calculation

include/linux/mm_types.h | 14 ++
kernel/sched/debug.c | 2 +
kernel/sched/fair.c | 344 ++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 2 +
4 files changed, 358 insertions(+), 4 deletions(-)

--
2.25.1


2022-01-29 16:34:27

by Bharata B Rao

[permalink] [raw]
Subject: [RFC PATCH v0 1/3] sched/numa: Process based autonuma scan period framework

From: Disha Talreja <[email protected]>

Add a new framework that calculates autonuma scan period
based on per-process NUMA fault stats.

NUMA faults can be classified into different categories, such
as local vs. remote, or private vs. shared. It is also important
to understand such behavior from the perspective of a process.
The per-process fault stats added here will be used for
calculating the scan period in the adaptive NUMA algorithm.

The actual scan period is still using the original value
p->numa_scan_period before the real implementation is added in
place in a later commit.

Co-developed-by: Wei Huang <[email protected]>
Signed-off-by: Wei Huang <[email protected]>
Signed-off-by: Disha Talreja <[email protected]>
Signed-off-by: Bharata B Rao <[email protected]>
---
include/linux/mm_types.h | 7 +++++++
kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++++++++++++--
2 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 9db36dc5d4cf..4f978c09d3db 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -610,6 +610,13 @@ struct mm_struct {

/* numa_scan_seq prevents two threads setting pte_numa */
int numa_scan_seq;
+
+ /* Process-based Adaptive NUMA */
+ atomic_long_t faults_locality[2];
+ atomic_long_t faults_shared[2];
+
+ spinlock_t pan_numa_lock;
+ unsigned int numa_scan_period;
#endif
/*
* An operation with batched TLB flushing is going on. Anything
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 095b0aa378df..1d6404b2d42e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2099,6 +2099,20 @@ static void numa_group_count_active_nodes(struct numa_group *numa_group)
numa_group->active_nodes = active_nodes;
}

+/**********************************************/
+/* Process-based Adaptive NUMA (PAN) Design */
+/**********************************************/
+/*
+ * Updates mm->numa_scan_period under mm->pan_numa_lock.
+ *
+ * Returns p->numa_scan_period now but updated to return
+ * p->mm->numa_scan_period in a later patch.
+ */
+static unsigned long pan_get_scan_period(struct task_struct *p)
+{
+ return p->numa_scan_period;
+}
+
/*
* When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS
* increments. The more local the fault statistics are, the higher the scan
@@ -2616,6 +2630,9 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
task_numa_group(p, last_cpupid, flags, &priv);
}

+ atomic_long_add(pages, &(p->mm->faults_locality[local]));
+ atomic_long_add(pages, &(p->mm->faults_shared[priv]));
+
/*
* If a workload spans multiple NUMA nodes, a shared fault that
* occurs wholly within the set of nodes that the workload is
@@ -2702,12 +2719,20 @@ static void task_numa_work(struct callback_head *work)
if (time_before(now, migrate))
return;

- if (p->numa_scan_period == 0) {
+ if (p->mm->numa_scan_period == 0) {
+ p->numa_scan_period_max = task_scan_max(p);
+ p->numa_scan_period = task_scan_start(p);
+ mm->numa_scan_period = p->numa_scan_period;
+ } else if (p->numa_scan_period == 0) {
p->numa_scan_period_max = task_scan_max(p);
p->numa_scan_period = task_scan_start(p);
}

- next_scan = now + msecs_to_jiffies(p->numa_scan_period);
+ if (!spin_trylock(&p->mm->pan_numa_lock))
+ return;
+ next_scan = now + msecs_to_jiffies(pan_get_scan_period(p));
+ spin_unlock(&p->mm->pan_numa_lock);
+
if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
return;

@@ -2807,6 +2832,16 @@ static void task_numa_work(struct callback_head *work)
}
}

+/* Init Process-based Adaptive NUMA */
+static void pan_init_numa(struct task_struct *p)
+{
+ struct mm_struct *mm = p->mm;
+
+ spin_lock_init(&mm->pan_numa_lock);
+ mm->numa_scan_period = sysctl_numa_balancing_scan_delay;
+
+}
+
void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
{
int mm_users = 0;
@@ -2817,6 +2852,7 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
if (mm_users == 1) {
mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
mm->numa_scan_seq = 0;
+ pan_init_numa(p);
}
}
p->node_stamp = 0;
--
2.25.1

2022-01-29 16:36:14

by Bharata B Rao

[permalink] [raw]
Subject: [RFC PATCH v0 2/3] sched/numa: Add cumulative history of per-process fault stats

From: Disha Talreja <[email protected]>

The cumulative history of local/remote (lr) and private/shared (ps)
will be used for calculating adaptive scan period.

Co-developed-by: Wei Huang <[email protected]>
Signed-off-by: Wei Huang <[email protected]>
Signed-off-by: Disha Talreja <[email protected]>
Signed-off-by: Bharata B Rao <[email protected]>
---
include/linux/mm_types.h | 2 ++
kernel/sched/fair.c | 49 +++++++++++++++++++++++++++++++++++++++-
2 files changed, 50 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4f978c09d3db..2c6f119b947f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -614,6 +614,8 @@ struct mm_struct {
/* Process-based Adaptive NUMA */
atomic_long_t faults_locality[2];
atomic_long_t faults_shared[2];
+ unsigned long faults_locality_history[2];
+ unsigned long faults_shared_history[2];

spinlock_t pan_numa_lock;
unsigned int numa_scan_period;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1d6404b2d42e..4911b3841d00 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2102,14 +2102,56 @@ static void numa_group_count_active_nodes(struct numa_group *numa_group)
/**********************************************/
/* Process-based Adaptive NUMA (PAN) Design */
/**********************************************/
+/*
+ * Update the cumulative history of local/remote and private/shared
+ * statistics. If the numbers are too small worthy of updating,
+ * return FALSE, otherwise return TRUE.
+ */
+static bool pan_update_history(struct task_struct *p)
+{
+ unsigned long local, remote, shared, private;
+ long diff;
+ int i;
+
+ remote = atomic_long_read(&p->mm->faults_locality[0]);
+ local = atomic_long_read(&p->mm->faults_locality[1]);
+ shared = atomic_long_read(&p->mm->faults_shared[0]);
+ private = atomic_long_read(&p->mm->faults_shared[1]);
+
+ /* skip if the activities in this window are too small */
+ if (local + remote < 100)
+ return false;
+
+ /* decay over the time window by 1/4 */
+ diff = local - (long)(p->mm->faults_locality_history[1] / 4);
+ p->mm->faults_locality_history[1] += diff;
+ diff = remote - (long)(p->mm->faults_locality_history[0] / 4);
+ p->mm->faults_locality_history[0] += diff;
+
+ /* decay over the time window by 1/2 */
+ diff = shared - (long)(p->mm->faults_shared_history[0] / 2);
+ p->mm->faults_shared_history[0] += diff;
+ diff = private - (long)(p->mm->faults_shared_history[1] / 2);
+ p->mm->faults_shared_history[1] += diff;
+
+ /* clear the statistics for the next window */
+ for (i = 0; i < 2; i++) {
+ atomic_long_set(&(p->mm->faults_locality[i]), 0);
+ atomic_long_set(&(p->mm->faults_shared[i]), 0);
+ }
+
+ return true;
+}
+
/*
* Updates mm->numa_scan_period under mm->pan_numa_lock.
- *
* Returns p->numa_scan_period now but updated to return
* p->mm->numa_scan_period in a later patch.
*/
static unsigned long pan_get_scan_period(struct task_struct *p)
{
+ pan_update_history(p);
+
return p->numa_scan_period;
}

@@ -2836,10 +2878,15 @@ static void task_numa_work(struct callback_head *work)
static void pan_init_numa(struct task_struct *p)
{
struct mm_struct *mm = p->mm;
+ int i;

spin_lock_init(&mm->pan_numa_lock);
mm->numa_scan_period = sysctl_numa_balancing_scan_delay;

+ for (i = 0; i < 2; i++) {
+ mm->faults_locality_history[i] = 0;
+ mm->faults_shared_history[i] = 0;
+ }
}

void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
--
2.25.1

2022-01-29 16:36:30

by Bharata B Rao

[permalink] [raw]
Subject: [RFC PATCH v0 3/3] sched/numa: Add adaptive scan period calculation

From: Disha Talreja <[email protected]>

This patch implements an adaptive algorithm for calculating
the autonuma scan period. In the existing mechanism of scan
period calculation,

- scan period is derived from the per-thread stats.
- static threshold (NUMA_PERIOD_THRESHOLD) is used for changing
the scan rate.

In this new approach (Process Adaptive autoNUMA), we gather NUMA
fault stats at per-process level which allows for capturing the
application behaviour better. In addition, the algorithm learns
and adjusts the scan rate based on remote fault rate. By not
sticking to a static threshold, the algorithm can respond better
to different workload behaviours.

Since the threads of a processes are already considered as a group,
we add a bunch of metrics to the task's mm to track the various
types of faults and derive the scan rate from them.

The new per-process fault stats contribute only to the per-process
scan period calculation, while the existing per-thread stats
continue to contribute towards the numa_group stats which
eventually determine the thresholds for migrating memory and
threads across nodes.

In this algorithm, the remote fault rates are maintained for
the previous two scan windows. These historical remote fault
rates along with the remote fault rate from the current window
are used to determine the intended trend of the scanning period.

An increase in the trend implies an increased period thereby
resulting in slower scanning. A decrease in the trend implies
decreased period and hence faster scanning.

The intended trends for the last two windows are tracked and
the actual trend is reversed (thereby increasing or decreasing
the scan period in that window) only if the same trend reversal
has been intended in the previous two windows.

While the remote fault rate metric is derived from the accumulated
remote and local faults from all the threads of the mm, the
per-mm private and shared faults also contribute in deciding
the trend of the scan period.

Co-developed-by: Wei Huang <[email protected]>
Signed-off-by: Wei Huang <[email protected]>
Signed-off-by: Disha Talreja <[email protected]>
Signed-off-by: Bharata B Rao <[email protected]>
---
include/linux/mm_types.h | 5 +
kernel/sched/debug.c | 2 +
kernel/sched/fair.c | 265 ++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 2 +
4 files changed, 268 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2c6f119b947f..d57cd96d8df0 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -619,6 +619,11 @@ struct mm_struct {

spinlock_t pan_numa_lock;
unsigned int numa_scan_period;
+ int remote_fault_rates[2]; /* histogram of remote fault rate */
+ long scanned_pages;
+ bool trend;
+ int slope;
+ u8 hist_trend;
#endif
/*
* An operation with batched TLB flushing is going on. Anything
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index aa29211de1bf..060bb46166a6 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -334,6 +334,8 @@ static __init int sched_init_debug(void)
debugfs_create_u32("scan_period_min_ms", 0644, numa, &sysctl_numa_balancing_scan_period_min);
debugfs_create_u32("scan_period_max_ms", 0644, numa, &sysctl_numa_balancing_scan_period_max);
debugfs_create_u32("scan_size_mb", 0644, numa, &sysctl_numa_balancing_scan_size);
+ debugfs_create_u32("pan_scan_period_min", 0644, numa, &sysctl_pan_scan_period_min);
+ debugfs_create_u32("pan_scan_period_max", 0644, numa, &sysctl_pan_scan_period_max);
#endif

debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4911b3841d00..5a9cacfbf9ec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1026,6 +1026,10 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
/* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
unsigned int sysctl_numa_balancing_scan_delay = 1000;

+/* Clips of max and min scanning periods */
+unsigned int sysctl_pan_scan_period_min = 50;
+unsigned int sysctl_pan_scan_period_max = 5000;
+
struct numa_group {
refcount_t refcount;

@@ -2102,6 +2106,242 @@ static void numa_group_count_active_nodes(struct numa_group *numa_group)
/**********************************************/
/* Process-based Adaptive NUMA (PAN) Design */
/**********************************************/
+#define SLOPE(N, D) ((N)/(D))
+
+static unsigned int pan_scan_max(struct task_struct *p)
+{
+ unsigned long smax, nr_scan_pages;
+ unsigned long rss = 0;
+
+ smax = sysctl_pan_scan_period_max;
+ nr_scan_pages = sysctl_numa_balancing_scan_size << (20 - PAGE_SHIFT);
+
+ rss = get_mm_rss(p->mm);
+ if (!rss)
+ rss = nr_scan_pages;
+
+ if (READ_ONCE(p->mm->numa_scan_seq) == 0) {
+ smax = p->mm->scanned_pages * sysctl_pan_scan_period_max;
+ smax = smax / rss;
+ smax = max_t(unsigned long, sysctl_pan_scan_period_min, smax);
+ }
+
+ return smax;
+}
+
+/*
+ * Process-based Adaptive NUMA scan period update alogirthm
+ *
+ * These are the important concepts behind the scan period update:
+ *
+ * - increase trend (of scan period)
+ * scan period => up, memory coverage => down, overhead => down,
+ * accuracy => down
+ * - decrease trend
+ * scan period => down, memory coverage => up, overhead => up,
+ * accuracy => up
+ * - trend: Reflects the current active trend
+ * 1 means increasing trend, 0 means decreasing trend
+ * - slope
+ * it controls scan_period: new_scan_period = current_scan_period *
+ * 100 / slope
+ * - hist_trend: Reflects the intended trend in the last two
+ * windows. Uses the last two bits (bit0 and bit1) for the same.
+ * 1 if increasing trend was intended, 0 if decreasing was intended.
+ */
+
+/*
+ * Check if the scan period needs updation when the remote fault
+ * rate has changed (delta > 5)
+ *
+ * Returns TRUE if scan period needs updation, else FALSE.
+ */
+static bool pan_changed_rate_update(struct mm_struct *mm, int ps_ratio,
+ int oldest_remote_fault_rate,
+ int fault_rate_diff)
+{
+ u8 value;
+
+ /*
+ * Set the intended trend for the current window.
+ * - If the remote fault rate has decreased, set the
+ * intended trend to increasing.
+ * - Otherwise leave the intended trend as decreasing.
+ */
+ mm->hist_trend = mm->hist_trend << 1;
+ if (fault_rate_diff < 5)
+ mm->hist_trend |= 0x01;
+
+ value = mm->hist_trend & 0x03;
+
+ if (fault_rate_diff < -5 && value == 3) {
+ /*
+ * The remote fault rate has decreased and the intended
+ * trend was set to increasing in the previous window.
+ *
+ * If on decreasing trend, reverse the trend and change
+ * the slope using the fault rates from (current-1)
+ * and (current-2) windows.
+ *
+ * If already on increasing trend, change the slope using
+ * the fault rates from (current) and (current-1) windows.
+ */
+ if (!mm->trend) {
+ mm->trend = true;
+ mm->slope = SLOPE(mm->remote_fault_rates[0] * 100,
+ oldest_remote_fault_rate);
+ } else {
+ mm->slope = SLOPE(mm->remote_fault_rates[1] * 100,
+ mm->remote_fault_rates[0]);
+ }
+ } else if (fault_rate_diff > 5 && value == 0) {
+ /*
+ * The remote fault rate has increased and the intended
+ * trend was set to decreasing in the previous window.
+ *
+ * If on increasing trend,
+ * - If shared fault ratio is more than 30%, don't yet
+ * reverse the trend, just mark the intended trend as
+ * increasing.
+ * - Otherwise reverse the trend. Change the slope using
+ * the fault rates from (current-1) and (current-2) windows.
+ *
+ * If on decreasing trend
+ * - Continue with a changed slope using the fault
+ * rates from (current) and (current-1) windows.
+ */
+ if (mm->trend) {
+ if (ps_ratio < 7) {
+ mm->hist_trend |= 0x01;
+ return true;
+ }
+
+ mm->trend = false;
+ mm->slope = SLOPE(mm->remote_fault_rates[0] * 100,
+ oldest_remote_fault_rate);
+ } else {
+ mm->slope = SLOPE(mm->remote_fault_rates[1] * 100,
+ mm->remote_fault_rates[0]);
+ }
+ } else if (value == 1 || value == 2) {
+ /*
+ * The intended trend is oscillating
+ *
+ * If on decreasing trend and the shared fault ratio
+ * is more than 30%, reverse the trend and change the slope.
+ *
+ * If on increasing trend, continue as is.
+ */
+ if (!mm->trend && ps_ratio < 7) {
+ mm->hist_trend |= 0x01;
+ mm->trend = true;
+ mm->slope = SLOPE(100 * 100,
+ 100 + ((7 - ps_ratio) * 10));
+ }
+ return false;
+ }
+ return true;
+}
+
+/*
+ * Check if the scan period needs updation when the remote fault
+ * rate has remained more or less the same (delta <= 5)
+ *
+ * Returns TRUE if scan period needs updation, else FALSE.
+ */
+static bool pan_const_rate_update(struct mm_struct *mm, int ps_ratio,
+ int oldest_remote_fault_rate)
+{
+ int diff1, diff2;
+
+ mm->hist_trend = mm->hist_trend << 1;
+
+ /*
+ * If we are in the increasing trend, don't change anything
+ * except the intended trend for this window that was reset
+ * to decreasing by default.
+ */
+ if (mm->trend)
+ return false;
+
+ /* We are in the decreasing trend, reverse under some condidtions. */
+ diff1 = oldest_remote_fault_rate - mm->remote_fault_rates[0];
+ diff2 = mm->remote_fault_rates[0] - mm->remote_fault_rates[1];
+
+ if (ps_ratio < 7) {
+ /*
+ * More than 30% of the pages are shared, so no point in
+ * further reducing the scan period. If increasing trend
+ * was intended in the previous window also, then reverse
+ * the trend to increasing. Else just record the increasing
+ * intended trend for this window and return.
+ */
+ mm->hist_trend |= 0x01;
+ if ((mm->hist_trend & 0x03) == 3) {
+ mm->trend = true;
+ mm->slope = SLOPE(100 * 100,
+ (100 + ((7 - ps_ratio) * 10)));
+ } else
+ return false;
+ } else if (diff1 >= 0 && diff2 >= 0 && mm->numa_scan_seq > 1) {
+ /*
+ * Remote fault rate has reduced successively in the last
+ * two windows and address space has been scanned at least
+ * once. If increasing trend was intended in the previous
+ * window also, then reverse the trend to increasing. Else
+ * just record the increasing trend for this window and return.
+ */
+ mm->hist_trend |= 0x01;
+ if ((mm->hist_trend & 0x03) == 3) {
+ mm->trend = true;
+ mm->slope = SLOPE(100 * 100, 110);
+ mm->hist_trend |= 0x03;
+ } else
+ return false;
+ }
+ return true;
+}
+
+static void pan_calculate_scan_period(struct task_struct *p)
+{
+ int remote_fault_rate, oldest_remote_fault_rate, ps_ratio, i, diff;
+ struct mm_struct *mm = p->mm;
+ unsigned long remote_hist = mm->faults_locality_history[0];
+ unsigned long local_hist = mm->faults_locality_history[1];
+ unsigned long shared_hist = mm->faults_shared_history[0];
+ unsigned long priv_hist = mm->faults_shared_history[1];
+ bool need_update;
+
+ ps_ratio = (priv_hist * 10) / (priv_hist + shared_hist + 1);
+ remote_fault_rate = (remote_hist * 100) / (local_hist + remote_hist + 1);
+
+ /* Keep the remote fault ratio at least 1% */
+ remote_fault_rate = max(remote_fault_rate, 1);
+ for (i = 0; i < 2; i++)
+ if (mm->remote_fault_rates[i] == 0)
+ mm->remote_fault_rates[i] = 1;
+
+ /* Shift right in mm->remote_fault_rates[] to keep track of history */
+ oldest_remote_fault_rate = mm->remote_fault_rates[0];
+ mm->remote_fault_rates[0] = mm->remote_fault_rates[1];
+ mm->remote_fault_rates[1] = remote_fault_rate;
+ diff = remote_fault_rate - oldest_remote_fault_rate;
+
+ if (abs(diff) <= 5)
+ need_update = pan_const_rate_update(mm, ps_ratio,
+ oldest_remote_fault_rate);
+ else
+ need_update = pan_changed_rate_update(mm, ps_ratio,
+ oldest_remote_fault_rate,
+ diff);
+
+ if (need_update) {
+ if (mm->slope == 0)
+ mm->slope = 100;
+ mm->numa_scan_period = (100 * mm->numa_scan_period) / mm->slope;
+ }
+}
+
/*
* Update the cumulative history of local/remote and private/shared
* statistics. If the numbers are too small worthy of updating,
@@ -2145,14 +2385,17 @@ static bool pan_update_history(struct task_struct *p)

/*
* Updates mm->numa_scan_period under mm->pan_numa_lock.
- * Returns p->numa_scan_period now but updated to return
- * p->mm->numa_scan_period in a later patch.
*/
static unsigned long pan_get_scan_period(struct task_struct *p)
{
- pan_update_history(p);
+ if (pan_update_history(p))
+ pan_calculate_scan_period(p);
+
+ p->mm->numa_scan_period = clamp(p->mm->numa_scan_period,
+ READ_ONCE(sysctl_pan_scan_period_min),
+ pan_scan_max(p));

- return p->numa_scan_period;
+ return p->mm->numa_scan_period;
}

/*
@@ -2860,6 +3103,7 @@ static void task_numa_work(struct callback_head *work)
mm->numa_scan_offset = start;
else
reset_ptenuma_scan(p);
+ mm->scanned_pages += ((sysctl_numa_balancing_scan_size << (20 - PAGE_SHIFT)) - pages);
mmap_read_unlock(mm);

/*
@@ -2882,10 +3126,15 @@ static void pan_init_numa(struct task_struct *p)

spin_lock_init(&mm->pan_numa_lock);
mm->numa_scan_period = sysctl_numa_balancing_scan_delay;
+ mm->scanned_pages = 0;
+ mm->trend = false;
+ mm->hist_trend = 0;
+ mm->slope = 100;

for (i = 0; i < 2; i++) {
mm->faults_locality_history[i] = 0;
mm->faults_shared_history[i] = 0;
+ mm->remote_fault_rates[i] = 1;
}
}

@@ -2948,6 +3197,9 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
if ((curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
return;

+ if (!spin_trylock(&curr->mm->pan_numa_lock))
+ return;
+
/*
* Using runtime rather than walltime has the dual advantage that
* we (mostly) drive the selection from busy threads and that the
@@ -2955,16 +3207,17 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
* NUMA placement.
*/
now = curr->se.sum_exec_runtime;
- period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
+ period = (u64)curr->mm->numa_scan_period * NSEC_PER_MSEC;

if (now > curr->node_stamp + period) {
if (!curr->node_stamp)
- curr->numa_scan_period = task_scan_start(curr);
+ curr->mm->numa_scan_period = task_scan_start(curr);
curr->node_stamp += period;

if (!time_before(jiffies, curr->mm->numa_next_scan))
task_work_add(curr, work, TWA_RESUME);
}
+ spin_unlock(&curr->mm->pan_numa_lock);
}

static void update_scan_period(struct task_struct *p, int new_cpu)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index de53be905739..635f96bc989d 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2424,6 +2424,8 @@ extern unsigned int sysctl_numa_balancing_scan_delay;
extern unsigned int sysctl_numa_balancing_scan_period_min;
extern unsigned int sysctl_numa_balancing_scan_period_max;
extern unsigned int sysctl_numa_balancing_scan_size;
+extern unsigned int sysctl_pan_scan_period_min;
+extern unsigned int sysctl_pan_scan_period_max;
#endif

#ifdef CONFIG_SCHED_HRTICK
--
2.25.1

2022-02-01 20:13:32

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH v0 0/3] sched/numa: Process Adaptive autoNUMA

On Fri, Jan 28, 2022 at 10:58:48AM +0530, Bharata B Rao wrote:
> Hi,
>
> This patchset implements an adaptive algorithm for calculating the autonuma
> scan period.

autonuma refers to the khugepaged-like approach to NUMA balancing that
was later superceded by NUMA Balancing (NUMAB) and is generally reflected
by the naming e.g. git grep -i autonuma and note how few references there
are to autonuma versus numab or "NUMA balancing". I know MMTests still
refers to AutoNUMA but mostly because at the time it was written,
autoNUMA was what was being evaluated and I never updated the naming.

> In the existing mechanism of scan period calculation,
>
> - scan period is derived from the per-thread stats.
> - static threshold (NUMA_PERIOD_THRESHOLD) is used for changing the
> scan rate.
>
> In this new approach (Process Adaptive autoNUMA or PAN), we gather NUMA
> fault stats at per-process level which allows for capturing the application
> behaviour better. In addition, the algorithm learns and adjusts the scan
> rate based on remote fault rate. By not sticking to a static threshold, the
> algorithm can respond better to different workload behaviours.
>

NUMA Balancing is concerned with threads (task) and an address space (mm)
so basing the naming on Address Space rather than process may be more
appropriate although I admit the acronym is not as snappy.

> Since the threads of a processes are already considered as a group,
> we add a bunch of metrics to the task's mm to track the various
> types of faults and derive the scan rate from them.
>

Enumerate the types of faults and note how the per-thread and
per-address-space metrics are related.

> The new per-process fault stats contribute only to the per-process
> scan period calculation, while the existing per-thread stats continue
> to contribute towards the numa_group stats which eventually
> determine the thresholds for migrating memory and threads
> across nodes.
>
> This patchset has been tested with a bunch of benchmarks on the
> following system:
>

Please include the comparisons of both the headline metrics and notes on
the change in scan rates in the changelog of the patch. Not all people
are access to Google drive and it is not guaranteed to remain forever.
Similarly, the leader is not guaranteed to appear in the git history

> ------------------------------------------------------
> % gain of PAN vs default (Avg of 3 runs)
> ------------------------------------------------------
> NAS-BT -0.17
> NAS-CG +9.39
> NAS-MG +8.19
> NAS-FT +2.23
> Hashjoin +0.58
> Graph500 +14.93
> Pagerank +0.37



> ------------------------------------------------------
> Default PAN %diff
> ------------------------------------------------------
> NUMA hint faults(Total of 3 runs)
> ------------------------------------------------------
> NAS-BT 758282358 539850429 +29
> NAS-CG 2179458823 1180301361 +46
> NAS-MG 517641172 346066391 +33
> NAS-FT 297044964 230033861 +23
> Hashjoin 201684863 268436275 -33
> Graph500 261808733 154338827 +41
> Pagerank 217917818 211260310 +03
> ------------------------------------------------------
> Migrations(Total of 3 runs)
> ------------------------------------------------------
> NAS-BT 106888517 86482076 +19
> NAS-CG 81191368 12859924 +84
> NAS-MG 83927451 39651254 +53
> NAS-FT 61807715 38934618 +37
> Hashjoin 45406983 59828843 -32
> Graph500 22798837 21560714 +05
> Pagerank 59072135 44968673 +24
> ------------------------------------------------------
>
> And here are some tests from a few microbenchmarks of mmtests suite.
> (The results are trimmed a bit here, the complete results can
> be viewed in the above mentioned link)
>
> Hackbench
> ---------
> hackbench-process-pipes
> hackbench hackbench
> default pan
> Min 256 23.5510 ( 0.00%) 23.1900 ( 1.53%)
> Amean 256 24.4604 ( 0.00%) 24.0353 * 1.74%*
> Stddev 256 0.4420 ( 0.00%) 0.7611 ( -72.18%)
> CoeffVar 256 1.8072 ( 0.00%) 3.1666 ( -75.22%)
> Max 256 25.4930 ( 0.00%) 30.5450 ( -19.82%)
> BAmean-50 256 24.1074 ( 0.00%) 23.6616 ( 1.85%)
> BAmean-95 256 24.4111 ( 0.00%) 23.9308 ( 1.97%)
> BAmean-99 256 24.4499 ( 0.00%) 23.9696 ( 1.96%)
>
> hackbench hackbench
> default pan
> Duration User 25810.02 25158.93
> Duration System 276322.70 271729.32
> Duration Elapsed 2707.75 2671.33
>

> hackbench hackbench
> default pan
> Ops NUMA alloc hit 1082415453.00 1088025994.00
> Ops NUMA alloc miss 0.00 0.00
> Ops NUMA interleave hit 0.00 0.00
> Ops NUMA alloc local 1082415441.00 1088025974.00
> Ops NUMA base-page range updates 33475.00 228900.00
> Ops NUMA PTE updates 33475.00 228900.00
> Ops NUMA PMD updates 0.00 0.00
> Ops NUMA hint faults 15758.00 222100.00
> Ops NUMA hint local faults % 15371.00 214570.00
> Ops NUMA hint local percent 97.54 96.61
> Ops NUMA pages migrated 235.00 4029.00
> Ops AutoNUMA cost 79.03 1112.18
>

Hackbench processes are generally short-lived enough that NUMA balancing
has a marginal impact. Interesting though that updates and hints were
increased by a lot relatively speaking.

> tbench
> ------
> tbench4
> tbench tbench
> default pan
> Hmean 1 436.89 ( 0.00%) 432.73 * -0.95%*
> Hmean 2 834.27 ( 0.00%) 848.11 * 1.66%*
> Hmean 4 1629.50 ( 0.00%) 1614.22 * -0.94%*
> Hmean 8 2944.06 ( 0.00%) 3031.66 * 2.98%*
> Hmean 16 5418.25 ( 0.00%) 5674.74 * 4.73%*
> Hmean 32 9959.60 ( 0.00%) 9009.82 * -9.54%*
> Hmean 64 13999.14 ( 0.00%) 12160.51 * -13.13%*
> Hmean 128 16797.09 ( 0.00%) 16506.14 * -1.73%*
> Hmean 256 25344.27 ( 0.00%) 25683.66 * 1.34%*
> Hmean 512 25289.03 ( 0.00%) 25513.77 * 0.89%*
> BHmean-50 1 437.13 ( 0.00%) 433.01 ( -0.94%)
> BHmean-50 2 836.35 ( 0.00%) 848.85 ( 1.49%)
> BHmean-50 4 1631.39 ( 0.00%) 1618.43 ( -0.79%)
> BHmean-50 8 2948.25 ( 0.00%) 3037.86 ( 3.04%)
> BHmean-50 16 5425.17 ( 0.00%) 5684.25 ( 4.78%)
> BHmean-50 32 9969.17 ( 0.00%) 9034.06 ( -9.38%)
> BHmean-50 64 14013.93 ( 0.00%) 12202.07 ( -12.93%)
> BHmean-50 128 16881.94 ( 0.00%) 16571.27 ( -1.84%)
> BHmean-50 256 25379.59 ( 0.00%) 25819.18 ( 1.73%)
> BHmean-50 512 25435.41 ( 0.00%) 25718.02 ( 1.11%)
> BHmean-95 1 436.92 ( 0.00%) 432.81 ( -0.94%)
> BHmean-95 2 834.59 ( 0.00%) 848.23 ( 1.63%)
> BHmean-95 4 1629.73 ( 0.00%) 1614.83 ( -0.91%)
> BHmean-95 8 2945.02 ( 0.00%) 3032.19 ( 2.96%)
> BHmean-95 16 5418.86 ( 0.00%) 5675.91 ( 4.74%)
> BHmean-95 32 9962.57 ( 0.00%) 9014.17 ( -9.52%)
> BHmean-95 64 14002.44 ( 0.00%) 12164.32 ( -13.13%)
> BHmean-95 128 16820.56 ( 0.00%) 16522.82 ( -1.77%)
> BHmean-95 256 25347.34 ( 0.00%) 25692.56 ( 1.36%)
> BHmean-95 512 25302.10 ( 0.00%) 25528.52 ( 0.89%)
> BHmean-99 1 436.90 ( 0.00%) 432.75 ( -0.95%)
> BHmean-99 2 834.35 ( 0.00%) 848.17 ( 1.66%)
> BHmean-99 4 1629.57 ( 0.00%) 1614.38 ( -0.93%)
> BHmean-99 8 2944.36 ( 0.00%) 3031.77 ( 2.97%)
> BHmean-99 16 5418.40 ( 0.00%) 5675.01 ( 4.74%)
> BHmean-99 32 9961.01 ( 0.00%) 9011.43 ( -9.53%)
> BHmean-99 64 14000.68 ( 0.00%) 12161.34 ( -13.14%)
> BHmean-99 128 16803.44 ( 0.00%) 16511.94 ( -1.73%)
> BHmean-99 256 25344.93 ( 0.00%) 25685.57 ( 1.34%)
> BHmean-99 512 25291.87 ( 0.00%) 25516.94 ( 0.89%)
>
> tbench tbench
> default pan
> Duration User 8482.50 8289.35
> Duration System 49462.63 49364.56
> Duration Elapsed 2217.10 2217.08
>
> tbench tbench
> default pan
> Ops NUMA alloc hit 388738400.00 378941469.00
> Ops NUMA alloc miss 0.00 0.00
> Ops NUMA interleave hit 0.00 0.00
> Ops NUMA alloc local 388738391.00 378941455.00
> Ops NUMA base-page range updates 266760.00 266275.00
> Ops NUMA PTE updates 266760.00 266275.00
> Ops NUMA PMD updates 0.00 0.00
> Ops NUMA hint faults 241547.00 257790.00
> Ops NUMA hint local faults % 145814.00 126410.00
> Ops NUMA hint local percent 60.37 49.04
> Ops NUMA pages migrated 51535.00 66083.00
> Ops AutoNUMA cost 1210.58 1292.07
>

Not much change.

> dbench
> ------
> dbench4 Latency
> dbench dbench
> default pan
> Amean latency-1 2.02 ( 0.00%) 2.05 * -1.52%*
> Amean latency-2 2.60 ( 0.00%) 2.55 * 1.64%*
> Amean latency-4 3.52 ( 0.00%) 3.56 * -1.17%*
> Amean latency-8 12.79 ( 0.00%) 11.83 * 7.49%*
> Amean latency-16 23.33 ( 0.00%) 19.09 * 18.19%*
> Amean latency-32 19.30 ( 0.00%) 18.83 * 2.43%*
> Amean latency-64 25.32 ( 0.00%) 24.30 * 4.00%*
> Amean latency-128 45.25 ( 0.00%) 42.93 * 5.13%*
> Amean latency-1024 0.00 ( 0.00%) 0.00 * 0.00%*
> BAmean-50 latency-1 1.65 ( 0.00%) 1.74 ( -5.16%)
> BAmean-50 latency-2 2.10 ( 0.00%) 2.10 ( -0.13%)
> BAmean-50 latency-4 2.65 ( 0.00%) 2.71 ( -2.28%)
> BAmean-50 latency-8 6.21 ( 0.00%) 4.64 ( 25.30%)
> BAmean-50 latency-16 17.64 ( 0.00%) 14.08 ( 20.16%)
> BAmean-50 latency-32 15.58 ( 0.00%) 15.90 ( -2.07%)
> BAmean-50 latency-64 20.76 ( 0.00%) 20.31 ( 2.15%)
> BAmean-50 latency-128 36.22 ( 0.00%) 34.85 ( 3.80%)
> BAmean-50 latency-1024 0.00 ( 0.00%) 0.00 ( 0.00%)
> BAmean-95 latency-1 1.88 ( 0.00%) 1.94 ( -3.17%)
> BAmean-95 latency-2 2.25 ( 0.00%) 2.26 ( -0.26%)
> BAmean-95 latency-4 3.00 ( 0.00%) 3.08 ( -2.71%)
> BAmean-95 latency-8 11.66 ( 0.00%) 10.03 ( 13.97%)
> BAmean-95 latency-16 22.30 ( 0.00%) 17.68 ( 20.73%)
> BAmean-95 latency-32 17.95 ( 0.00%) 17.70 ( 1.38%)
> BAmean-95 latency-64 23.57 ( 0.00%) 22.72 ( 3.62%)
> BAmean-95 latency-128 42.44 ( 0.00%) 39.96 ( 5.84%)
> BAmean-95 latency-1024 0.00 ( 0.00%) 0.00 ( 0.00%)
> BAmean-99 latency-1 1.90 ( 0.00%) 1.96 ( -3.30%)
> BAmean-99 latency-2 2.38 ( 0.00%) 2.37 ( 0.48%)
> BAmean-99 latency-4 3.24 ( 0.00%) 3.34 ( -3.26%)
> BAmean-99 latency-8 12.34 ( 0.00%) 10.71 ( 13.27%)
> BAmean-99 latency-16 22.79 ( 0.00%) 18.27 ( 19.82%)
> BAmean-99 latency-32 18.68 ( 0.00%) 18.32 ( 1.93%)
> BAmean-99 latency-64 24.69 ( 0.00%) 23.69 ( 4.06%)
> BAmean-99 latency-128 44.44 ( 0.00%) 42.15 ( 5.17%)
> BAmean-99 latency-1024 0.00 ( 0.00%) 0.00 ( 0.00%)
>
> dbench4 Throughput (misleading but traditional)
> dbench dbench
> default pan
> Hmean 1 505.12 ( 0.00%) 492.96 * -2.41%*
> Hmean 2 824.14 ( 0.00%) 824.06 * -0.01%*
> Hmean 4 1174.61 ( 0.00%) 1207.86 * 2.83%*
> Hmean 8 1665.10 ( 0.00%) 1667.27 * 0.13%*
> Hmean 16 2215.59 ( 0.00%) 2160.93 * -2.47%*
> Hmean 32 2727.05 ( 0.00%) 2633.26 * -3.44%*
> Hmean 64 3128.64 ( 0.00%) 3098.73 * -0.96%*
> Hmean 128 3282.89 ( 0.00%) 3340.26 * 1.75%*
> Hmean 1024 2551.02 ( 0.00%) 2559.41 * 0.33%*
> BHmean-50 1 509.87 ( 0.00%) 495.10 ( -2.90%)
> BHmean-50 2 829.35 ( 0.00%) 828.14 ( -0.15%)
> BHmean-50 4 1182.38 ( 0.00%) 1219.30 ( 3.12%)
> BHmean-50 8 1678.49 ( 0.00%) 1678.83 ( 0.02%)
> BHmean-50 16 2251.01 ( 0.00%) 2194.52 ( -2.51%)
> BHmean-50 32 2751.39 ( 0.00%) 2678.45 ( -2.65%)
> BHmean-50 64 3189.69 ( 0.00%) 3154.45 ( -1.10%)
> BHmean-50 128 3396.18 ( 0.00%) 3451.59 ( 1.63%)
> BHmean-50 1024 2836.80 ( 0.00%) 2836.84 ( 0.00%)
> BHmean-95 1 506.13 ( 0.00%) 493.24 ( -2.55%)
> BHmean-95 2 824.84 ( 0.00%) 824.30 ( -0.06%)
> BHmean-95 4 1175.91 ( 0.00%) 1208.57 ( 2.78%)
> BHmean-95 8 1666.46 ( 0.00%) 1668.22 ( 0.11%)
> BHmean-95 16 2219.59 ( 0.00%) 2163.86 ( -2.51%)
> BHmean-95 32 2731.26 ( 0.00%) 2640.34 ( -3.33%)
> BHmean-95 64 3144.73 ( 0.00%) 3108.59 ( -1.15%)
> BHmean-95 128 3306.51 ( 0.00%) 3363.33 ( 1.72%)
> BHmean-95 1024 2658.37 ( 0.00%) 2668.88 ( 0.40%)
> BHmean-99 1 505.37 ( 0.00%) 493.08 ( -2.43%)
> BHmean-99 2 824.31 ( 0.00%) 824.12 ( -0.02%)
> BHmean-99 4 1174.94 ( 0.00%) 1208.02 ( 2.81%)
> BHmean-99 8 1665.40 ( 0.00%) 1667.48 ( 0.12%)
> BHmean-99 16 2216.51 ( 0.00%) 2161.60 ( -2.48%)
> BHmean-99 32 2728.09 ( 0.00%) 2635.09 ( -3.41%)
> BHmean-99 64 3135.81 ( 0.00%) 3102.12 ( -1.07%)
> BHmean-99 128 3291.11 ( 0.00%) 3349.16 ( 1.76%)
> BHmean-99 1024 2645.54 ( 0.00%) 2655.67 ( 0.38%)
>
>
> dbench dbench
> default pan
> Duration User 822.55 827.85
> Duration System 8384.99 8164.83
> Duration Elapsed 1671.36 1670.74
>
> dbench dbench
> default pan
> Ops NUMA alloc hit 183324626.00 182350114.00
> Ops NUMA alloc miss 0.00 0.00
> Ops NUMA interleave hit 0.00 0.00
> Ops NUMA alloc local 183324508.00 182350004.00
> Ops NUMA base-page range updates 181531.00 515929.00
> Ops NUMA PTE updates 181531.00 515929.00
> Ops NUMA PMD updates 0.00 0.00
> Ops NUMA hint faults 162742.00 510979.00
> Ops NUMA hint local faults % 120309.00 426848.00
> Ops NUMA hint local percent 73.93 83.54
> Ops NUMA pages migrated 37605.00 59519.00
> Ops AutoNUMA cost 815.70 2559.64
>

More hinting faults and migrations

> Netperf-RR
> ----------
> netperf-udp-rr
> netperf netperf
> rr-default rr-pan
> Min 1 104915.69 ( 0.00%) 104505.71 ( -0.39%)
> Hmean 1 105865.46 ( 0.00%) 105899.22 * 0.03%*
> Stddev 1 528.45 ( 0.00%) 881.92 ( -66.89%)
> CoeffVar 1 0.50 ( 0.00%) 0.83 ( -66.83%)
> Max 1 106410.28 ( 0.00%) 107196.52 ( 0.74%)
> BHmean-50 1 106232.53 ( 0.00%) 106568.26 ( 0.32%)
> BHmean-95 1 105972.05 ( 0.00%) 106056.35 ( 0.08%)
> BHmean-99 1 105972.05 ( 0.00%) 106056.35 ( 0.08%)
>
> netperf netperf
> rr-default rr-pan
> Duration User 11.20 10.74
> Duration System 202.40 201.32
> Duration Elapsed 303.09 303.08
>
> netperf netperf
> rr-default rr-pan
> Ops NUMA alloc hit 183999.00 183853.00
> Ops NUMA alloc miss 0.00 0.00
> Ops NUMA interleave hit 0.00 0.00
> Ops NUMA alloc local 183999.00 183853.00
> Ops NUMA base-page range updates 0.00 24370.00
> Ops NUMA PTE updates 0.00 24370.00
> Ops NUMA PMD updates 0.00 0.00
> Ops NUMA hint faults 539.00 24470.00
> Ops NUMA hint local faults % 539.00 24447.00
> Ops NUMA hint local percent 100.00 99.91
> Ops NUMA pages migrated 0.00 23.00
> Ops AutoNUMA cost 2.69 122.52
>

Netperf these days usually runs on the same node so NUMA balancing
triggers very rarely.

> netperf-tcp-rr
> netperf netperf
> rr-default rr-pan
> Min 1 96156.03 ( 0.00%) 96556.87 ( 0.42%)
> Hmean 1 96627.24 ( 0.00%) 97551.38 * 0.96%*
> Stddev 1 284.71 ( 0.00%) 637.74 (-123.99%)
> CoeffVar 1 0.29 ( 0.00%) 0.65 (-121.87%)
> Max 1 96974.45 ( 0.00%) 98554.94 ( 1.63%)
> BHmean-50 1 96840.81 ( 0.00%) 98067.19 ( 1.27%)
> BHmean-95 1 96679.89 ( 0.00%) 97663.14 ( 1.02%)
> BHmean-99 1 96679.89 ( 0.00%) 97663.14 ( 1.02%)
>
> netperf netperf
> rr-default rr-pan
> Duration User 10.21 10.26
> Duration System 207.90 208.28
> Duration Elapsed 302.99 303.02
>
> netperf netperf
> rr-default rr-pan
> Ops NUMA alloc hit 183669.00 183695.00
> Ops NUMA alloc miss 0.00 0.00
> Ops NUMA interleave hit 0.00 0.00
> Ops NUMA alloc local 183657.00 183695.00
> Ops NUMA base-page range updates 3949.00 38561.00
> Ops NUMA PTE updates 3949.00 38561.00
> Ops NUMA PMD updates 0.00 0.00
> Ops NUMA hint faults 4186.00 43328.00
> Ops NUMA hint local faults % 4100.00 43195.00
> Ops NUMA hint local percent 97.95 99.69
> Ops NUMA pages migrated 9.00 73.00
> Ops AutoNUMA cost 20.96 216.91
>

Same.

> Autonumabench
> -------------
> autonumabench
> autonumabench autonumabench
> default pan
> Amean syst-NUMA01 11664.40 ( 0.00%) 11616.17 * 0.41%*
> Amean syst-NUMA01_THREADLOCAL 0.24 ( 0.00%) 0.22 * 7.78%*
> Amean syst-NUMA02 1.55 ( 0.00%) 9.31 *-499.26%*
> Amean syst-NUMA02_SMT 1.14 ( 0.00%) 4.04 *-254.39%*
> Amean elsp-NUMA01 223.52 ( 0.00%) 221.43 * 0.93%*
> Amean elsp-NUMA01_THREADLOCAL 0.95 ( 0.00%) 0.94 * 0.76%*
> Amean elsp-NUMA02 6.83 ( 0.00%) 5.74 * 15.90%*
> Amean elsp-NUMA02_SMT 6.65 ( 0.00%) 6.25 * 5.97%*
> BAmean-50 syst-NUMA01 11455.44 ( 0.00%) 10985.76 ( 4.10%)
> BAmean-50 syst-NUMA01_THREADLOCAL 0.22 ( 0.00%) 0.21 ( 7.46%)
> BAmean-50 syst-NUMA02 1.11 ( 0.00%) 8.91 (-703.00%)
> BAmean-50 syst-NUMA02_SMT 0.94 ( 0.00%) 3.42 (-262.19%)
> BAmean-50 elsp-NUMA01 217.38 ( 0.00%) 214.03 ( 1.54%)
> BAmean-50 elsp-NUMA01_THREADLOCAL 0.94 ( 0.00%) 0.94 ( 0.35%)
> BAmean-50 elsp-NUMA02 6.66 ( 0.00%) 5.45 ( 18.08%)
> BAmean-50 elsp-NUMA02_SMT 6.50 ( 0.00%) 6.09 ( 6.31%)
> BAmean-95 syst-NUMA01 11611.74 ( 0.00%) 11448.30 ( 1.41%)
> BAmean-95 syst-NUMA01_THREADLOCAL 0.23 ( 0.00%) 0.22 ( 7.14%)
> BAmean-95 syst-NUMA02 1.27 ( 0.00%) 9.21 (-624.93%)
> BAmean-95 syst-NUMA02_SMT 0.97 ( 0.00%) 3.90 (-300.34%)
> BAmean-95 elsp-NUMA01 221.75 ( 0.00%) 218.53 ( 1.45%)
> BAmean-95 elsp-NUMA01_THREADLOCAL 0.94 ( 0.00%) 0.94 ( 0.53%)
> BAmean-95 elsp-NUMA02 6.75 ( 0.00%) 5.68 ( 15.81%)
> BAmean-95 elsp-NUMA02_SMT 6.61 ( 0.00%) 6.23 ( 5.82%)
> BAmean-99 syst-NUMA01 11611.74 ( 0.00%) 11448.30 ( 1.41%)
> BAmean-99 syst-NUMA01_THREADLOCAL 0.23 ( 0.00%) 0.22 ( 7.14%)
> BAmean-99 syst-NUMA02 1.27 ( 0.00%) 9.21 (-624.93%)
> BAmean-99 syst-NUMA02_SMT 0.97 ( 0.00%) 3.90 (-300.34%)
> BAmean-99 elsp-NUMA01 221.75 ( 0.00%) 218.53 ( 1.45%)
> BAmean-99 elsp-NUMA01_THREADLOCAL 0.94 ( 0.00%) 0.94 ( 0.53%)
> BAmean-99 elsp-NUMA02 6.75 ( 0.00%) 5.68 ( 15.81%)
> BAmean-99 elsp-NUMA02_SMT 6.61 ( 0.00%) 6.23 ( 5.82%)
>
> autonumabenchautonumabench
> default pan
> Duration User 94363.43 94436.71
> Duration System 81671.72 81408.53
> Duration Elapsed 1676.81 1647.99
>
> autonumabench autonumabench
> default pan
> Ops NUMA alloc hit 539544115.00 539522029.00
> Ops NUMA alloc miss 0.00 0.00
> Ops NUMA interleave hit 0.00 0.00
> Ops NUMA alloc local 279025768.00 281735736.00
> Ops NUMA base-page range updates 69695169.00 84767502.00
> Ops NUMA PTE updates 69695169.00 84767502.00
> Ops NUMA PMD updates 0.00 0.00
> Ops NUMA hint faults 69691818.00 87895044.00
> Ops NUMA hint local faults % 56565519.00 65819747.00
> Ops NUMA hint local percent 81.17 74.88
> Ops NUMA pages migrated 5950362.00 8310169.00
> Ops AutoNUMA cost 349060.01 440226.49
>

More hinting faults and migrations. Not clear which sub-test exactly but
most likely NUMA02.

--
Mel Gorman
SUSE Labs

2022-02-01 20:13:35

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH v0 3/3] sched/numa: Add adaptive scan period calculation

On Fri, Jan 28, 2022 at 10:58:51AM +0530, Bharata B Rao wrote:
> From: Disha Talreja <[email protected]>
>
> This patch implements an adaptive algorithm for calculating
> the autonuma scan period. In the existing mechanism of scan
> period calculation,
>
> - scan period is derived from the per-thread stats.
> - static threshold (NUMA_PERIOD_THRESHOLD) is used for changing
> the scan rate.
>
> In this new approach (Process Adaptive autoNUMA), we gather NUMA
> fault stats at per-process level which allows for capturing the
> application behaviour better. In addition, the algorithm learns
> and adjusts the scan rate based on remote fault rate. By not
> sticking to a static threshold, the algorithm can respond better
> to different workload behaviours.
>

This appears to replace the per-task numa_scan_period with a per-mm
numa_scan_period. This likely leads to more stable rate overall but it
potentially misses that some threads are more active than others and
miss that different threads may have different local/remote faults and
private/shared faults. I think this may be smoothing the average while
potentially missing outliers.

After the patch, p->numa_scan_period appears to primarily affect if a
page is retried for migration but a lot of infrastructure is still left
behind and it's unclear what purpose it serves.

> Since the threads of a processes are already considered as a group,
> we add a bunch of metrics to the task's mm to track the various
> types of faults and derive the scan rate from them.
>
> The new per-process fault stats contribute only to the per-process
> scan period calculation, while the existing per-thread stats
> continue to contribute towards the numa_group stats which
> eventually determine the thresholds for migrating memory and
> threads across nodes.
>
> In this algorithm, the remote fault rates are maintained for
> the previous two scan windows. These historical remote fault
> rates along with the remote fault rate from the current window
> are used to determine the intended trend of the scanning period.
>
> An increase in the trend implies an increased period thereby
> resulting in slower scanning. A decrease in the trend implies
> decreased period and hence faster scanning.
>

Clarify what affects the trend in the changelog. e.g. how do differences
in local vs remote and private vs shared affect trend?

What happens if one thread is primarily local faults while another is
primarily remote faults, how does that affect the trend and overall
scan period? The per-task scanning is flawed in terms that more active
threads can scan address space regions that the task is uninterested in
but I worry that masking that with an address space average may delay
the correction of an imbalance in one thread because an overall trend
misses the details.

> The intended trends for the last two windows are tracked and
> the actual trend is reversed (thereby increasing or decreasing
> the scan period in that window) only if the same trend reversal
> has been intended in the previous two windows.
>
> While the remote fault rate metric is derived from the accumulated
> remote and local faults from all the threads of the mm, the
> per-mm private and shared faults also contribute in deciding
> the trend of the scan period.
>
> Co-developed-by: Wei Huang <[email protected]>
> Signed-off-by: Wei Huang <[email protected]>
> Signed-off-by: Disha Talreja <[email protected]>
> Signed-off-by: Bharata B Rao <[email protected]>
> ---
> include/linux/mm_types.h | 5 +
> kernel/sched/debug.c | 2 +
> kernel/sched/fair.c | 265 ++++++++++++++++++++++++++++++++++++++-
> kernel/sched/sched.h | 2 +
> 4 files changed, 268 insertions(+), 6 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 2c6f119b947f..d57cd96d8df0 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -619,6 +619,11 @@ struct mm_struct {
>
> spinlock_t pan_numa_lock;
> unsigned int numa_scan_period;
> + int remote_fault_rates[2]; /* histogram of remote fault rate */
> + long scanned_pages;

Why signed? What happens if it wraps (either negative if signed or back
to 0 if unsigned)?

> + bool trend;
> + int slope;
> + u8 hist_trend;

Document the fields.

> #endif
> /*
> * An operation with batched TLB flushing is going on. Anything
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index aa29211de1bf..060bb46166a6 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -334,6 +334,8 @@ static __init int sched_init_debug(void)
> debugfs_create_u32("scan_period_min_ms", 0644, numa, &sysctl_numa_balancing_scan_period_min);
> debugfs_create_u32("scan_period_max_ms", 0644, numa, &sysctl_numa_balancing_scan_period_max);
> debugfs_create_u32("scan_size_mb", 0644, numa, &sysctl_numa_balancing_scan_size);
> + debugfs_create_u32("pan_scan_period_min", 0644, numa, &sysctl_pan_scan_period_min);
> + debugfs_create_u32("pan_scan_period_max", 0644, numa, &sysctl_pan_scan_period_max);
> #endif

Update Documentation and what relationship if any scan_period_*_ms has
with pan_scan_period_*. Add the units to be consistent.

>
> debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4911b3841d00..5a9cacfbf9ec 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1026,6 +1026,10 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
> /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
> unsigned int sysctl_numa_balancing_scan_delay = 1000;
>
> +/* Clips of max and min scanning periods */
> +unsigned int sysctl_pan_scan_period_min = 50;
> +unsigned int sysctl_pan_scan_period_max = 5000;
> +

Why are the period different to the min/max for the previous per-task
values? (admittedly, those values were pulled out of a hat).

> struct numa_group {
> refcount_t refcount;
>
> @@ -2102,6 +2106,242 @@ static void numa_group_count_active_nodes(struct numa_group *numa_group)
> /**********************************************/
> /* Process-based Adaptive NUMA (PAN) Design */
> /**********************************************/
> +#define SLOPE(N, D) ((N)/(D))
> +

Document. N/D implies numerator and denominator. Usage implies a
percentage change in remote faults but not always and there are a lot of
magic numers with limited explanation.

> +static unsigned int pan_scan_max(struct task_struct *p)
> +{
> + unsigned long smax, nr_scan_pages;
> + unsigned long rss = 0;
> +
> + smax = sysctl_pan_scan_period_max;
> + nr_scan_pages = sysctl_numa_balancing_scan_size << (20 - PAGE_SHIFT);
> +
> + rss = get_mm_rss(p->mm);
> + if (!rss)
> + rss = nr_scan_pages;
> +
> + if (READ_ONCE(p->mm->numa_scan_seq) == 0) {
> + smax = p->mm->scanned_pages * sysctl_pan_scan_period_max;
> + smax = smax / rss;
> + smax = max_t(unsigned long, sysctl_pan_scan_period_min, smax);
> + }
> +

rss is not necessarily related to virtual address space size e.g. sparse
mappings. May not be relevant but worth commenting on why it doesn't
matter.

> + return smax;
> +}
> +
> +/*
> + * Process-based Adaptive NUMA scan period update alogirthm
> + *
> + * These are the important concepts behind the scan period update:
> + *
> + * - increase trend (of scan period)
> + * scan period => up, memory coverage => down, overhead => down,
> + * accuracy => down
> + * - decrease trend
> + * scan period => down, memory coverage => up, overhead => up,
> + * accuracy => up
> + * - trend: Reflects the current active trend
> + * 1 means increasing trend, 0 means decreasing trend
> + * - slope
> + * it controls scan_period: new_scan_period = current_scan_period *
> + * 100 / slope
> + * - hist_trend: Reflects the intended trend in the last two
> + * windows. Uses the last two bits (bit0 and bit1) for the same.
> + * 1 if increasing trend was intended, 0 if decreasing was intended.
> + */
> +
> +/*
> + * Check if the scan period needs updation when the remote fault
> + * rate has changed (delta > 5)
> + *
> + * Returns TRUE if scan period needs updation, else FALSE.
> + */
> +static bool pan_changed_rate_update(struct mm_struct *mm, int ps_ratio,
> + int oldest_remote_fault_rate,
> + int fault_rate_diff)
> +{
> + u8 value;
> +
> + /*
> + * Set the intended trend for the current window.
> + * - If the remote fault rate has decreased, set the
> + * intended trend to increasing.
> + * - Otherwise leave the intended trend as decreasing.
> + */
> + mm->hist_trend = mm->hist_trend << 1;
> + if (fault_rate_diff < 5)
> + mm->hist_trend |= 0x01;
> +

Why 5? Presumably 50% but not clear.

> + value = mm->hist_trend & 0x03;
> +

Document better what is contained in this u8 value.

> + if (fault_rate_diff < -5 && value == 3) {
> + /*

Document magic numbers.

> + * The remote fault rate has decreased and the intended
> + * trend was set to increasing in the previous window.
> + *
> + * If on decreasing trend, reverse the trend and change
> + * the slope using the fault rates from (current-1)
> + * and (current-2) windows.
> + *
> + * If already on increasing trend, change the slope using
> + * the fault rates from (current) and (current-1) windows.
> + */
> + if (!mm->trend) {
> + mm->trend = true;
> + mm->slope = SLOPE(mm->remote_fault_rates[0] * 100,
> + oldest_remote_fault_rate);
> + } else {
> + mm->slope = SLOPE(mm->remote_fault_rates[1] * 100,
> + mm->remote_fault_rates[0]);
> + }
> + } else if (fault_rate_diff > 5 && value == 0) {
> + /*
> + * The remote fault rate has increased and the intended
> + * trend was set to decreasing in the previous window.
> + *
> + * If on increasing trend,
> + * - If shared fault ratio is more than 30%, don't yet
> + * reverse the trend, just mark the intended trend as
> + * increasing.
> + * - Otherwise reverse the trend. Change the slope using
> + * the fault rates from (current-1) and (current-2) windows.
> + *
> + * If on decreasing trend
> + * - Continue with a changed slope using the fault
> + * rates from (current) and (current-1) windows.
> + */
> + if (mm->trend) {
> + if (ps_ratio < 7) {
> + mm->hist_trend |= 0x01;
> + return true;
> + }
> +
> + mm->trend = false;
> + mm->slope = SLOPE(mm->remote_fault_rates[0] * 100,
> + oldest_remote_fault_rate);
> + } else {
> + mm->slope = SLOPE(mm->remote_fault_rates[1] * 100,
> + mm->remote_fault_rates[0]);
> + }
> + } else if (value == 1 || value == 2) {
> + /*
> + * The intended trend is oscillating
> + *
> + * If on decreasing trend and the shared fault ratio
> + * is more than 30%, reverse the trend and change the slope.
> + *
> + * If on increasing trend, continue as is.
> + */
> + if (!mm->trend && ps_ratio < 7) {
> + mm->hist_trend |= 0x01;
> + mm->trend = true;
> + mm->slope = SLOPE(100 * 100,
> + 100 + ((7 - ps_ratio) * 10));
> + }
> + return false;
> + }
> + return true;
> +}
> +
> +/*
> + * Check if the scan period needs updation when the remote fault
> + * rate has remained more or less the same (delta <= 5)
> + *
> + * Returns TRUE if scan period needs updation, else FALSE.
> + */


s/updation/updating/

> +static bool pan_const_rate_update(struct mm_struct *mm, int ps_ratio,
> + int oldest_remote_fault_rate)

Document the intent behind the difference between pan_const_rate_update
and pan_changed_rate_update.

> +{
> + int diff1, diff2;
> +

Clarify what diff1 and diff2 are the differences between in the naming.

> + mm->hist_trend = mm->hist_trend << 1;
> +
> + /*
> + * If we are in the increasing trend, don't change anything
> + * except the intended trend for this window that was reset
> + * to decreasing by default.
> + */
> + if (mm->trend)
> + return false;
> +
> + /* We are in the decreasing trend, reverse under some condidtions. */
> + diff1 = oldest_remote_fault_rate - mm->remote_fault_rates[0];
> + diff2 = mm->remote_fault_rates[0] - mm->remote_fault_rates[1];
> +
> + if (ps_ratio < 7) {
> + /*
> + * More than 30% of the pages are shared, so no point in
> + * further reducing the scan period. If increasing trend
> + * was intended in the previous window also, then reverse
> + * the trend to increasing. Else just record the increasing
> + * intended trend for this window and return.
> + */
> + mm->hist_trend |= 0x01;
> + if ((mm->hist_trend & 0x03) == 3) {
> + mm->trend = true;
> + mm->slope = SLOPE(100 * 100,
> + (100 + ((7 - ps_ratio) * 10)));
> + } else
> + return false;
> + } else if (diff1 >= 0 && diff2 >= 0 && mm->numa_scan_seq > 1) {
> + /*
> + * Remote fault rate has reduced successively in the last
> + * two windows and address space has been scanned at least
> + * once. If increasing trend was intended in the previous
> + * window also, then reverse the trend to increasing. Else
> + * just record the increasing trend for this window and return.
> + */
> + mm->hist_trend |= 0x01;
> + if ((mm->hist_trend & 0x03) == 3) {
> + mm->trend = true;
> + mm->slope = SLOPE(100 * 100, 110);
> + mm->hist_trend |= 0x03;
> + } else
> + return false;
> + }
> + return true;
> +}
> +
> +static void pan_calculate_scan_period(struct task_struct *p)
> +{
> + int remote_fault_rate, oldest_remote_fault_rate, ps_ratio, i, diff;
> + struct mm_struct *mm = p->mm;
> + unsigned long remote_hist = mm->faults_locality_history[0];
> + unsigned long local_hist = mm->faults_locality_history[1];
> + unsigned long shared_hist = mm->faults_shared_history[0];
> + unsigned long priv_hist = mm->faults_shared_history[1];
> + bool need_update;
> +
> + ps_ratio = (priv_hist * 10) / (priv_hist + shared_hist + 1);
> + remote_fault_rate = (remote_hist * 100) / (local_hist + remote_hist + 1);
> +
> + /* Keep the remote fault ratio at least 1% */
> + remote_fault_rate = max(remote_fault_rate, 1);
> + for (i = 0; i < 2; i++)
> + if (mm->remote_fault_rates[i] == 0)
> + mm->remote_fault_rates[i] = 1;
> +

What if there is one thread in the entire address that is responsible
for all of the remote faults if it's a shared region? Does this skew the
scan rates for unrelated threads?

> + /* Shift right in mm->remote_fault_rates[] to keep track of history */
> + oldest_remote_fault_rate = mm->remote_fault_rates[0];
> + mm->remote_fault_rates[0] = mm->remote_fault_rates[1];
> + mm->remote_fault_rates[1] = remote_fault_rate;
> + diff = remote_fault_rate - oldest_remote_fault_rate;
> +
> + if (abs(diff) <= 5)
> + need_update = pan_const_rate_update(mm, ps_ratio,
> + oldest_remote_fault_rate);
> + else
> + need_update = pan_changed_rate_update(mm, ps_ratio,
> + oldest_remote_fault_rate,
> + diff);
> +
> + if (need_update) {
> + if (mm->slope == 0)
> + mm->slope = 100;
> + mm->numa_scan_period = (100 * mm->numa_scan_period) / mm->slope;
> + }
> +}
> +
> /*
> * Update the cumulative history of local/remote and private/shared
> * statistics. If the numbers are too small worthy of updating,
> @@ -2145,14 +2385,17 @@ static bool pan_update_history(struct task_struct *p)
>
> /*
> * Updates mm->numa_scan_period under mm->pan_numa_lock.
> - * Returns p->numa_scan_period now but updated to return
> - * p->mm->numa_scan_period in a later patch.
> */

But p->numa_scan_period still exists so it's harder to evaluate the
overall result.

> static unsigned long pan_get_scan_period(struct task_struct *p)
> {
> - pan_update_history(p);
> + if (pan_update_history(p))
> + pan_calculate_scan_period(p);
> +
> + p->mm->numa_scan_period = clamp(p->mm->numa_scan_period,
> + READ_ONCE(sysctl_pan_scan_period_min),
> + pan_scan_max(p));
>
> - return p->numa_scan_period;
> + return p->mm->numa_scan_period;
> }
>
> /*
> @@ -2860,6 +3103,7 @@ static void task_numa_work(struct callback_head *work)
> mm->numa_scan_offset = start;
> else
> reset_ptenuma_scan(p);
> + mm->scanned_pages += ((sysctl_numa_balancing_scan_size << (20 - PAGE_SHIFT)) - pages);
> mmap_read_unlock(mm);
>
> /*
> @@ -2882,10 +3126,15 @@ static void pan_init_numa(struct task_struct *p)
>
> spin_lock_init(&mm->pan_numa_lock);
> mm->numa_scan_period = sysctl_numa_balancing_scan_delay;
> + mm->scanned_pages = 0;
> + mm->trend = false;
> + mm->hist_trend = 0;
> + mm->slope = 100;
>
> for (i = 0; i < 2; i++) {
> mm->faults_locality_history[i] = 0;
> mm->faults_shared_history[i] = 0;
> + mm->remote_fault_rates[i] = 1;
> }
> }
>
> @@ -2948,6 +3197,9 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
> if ((curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
> return;
>
> + if (!spin_trylock(&curr->mm->pan_numa_lock))
> + return;
> +
> /*
> * Using runtime rather than walltime has the dual advantage that
> * we (mostly) drive the selection from busy threads and that the

This potentially misses triggering of scans in general but again, the
more stable scan rates may be due to mm-wide averaging while missing
per-task specifics.

> @@ -2955,16 +3207,17 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
> * NUMA placement.
> */
> now = curr->se.sum_exec_runtime;
> - period = (u64)curr->numa_scan_period * NSEC_PER_MSEC;
> + period = (u64)curr->mm->numa_scan_period * NSEC_PER_MSEC;
>
> if (now > curr->node_stamp + period) {
> if (!curr->node_stamp)
> - curr->numa_scan_period = task_scan_start(curr);
> + curr->mm->numa_scan_period = task_scan_start(curr);
> curr->node_stamp += period;
>
> if (!time_before(jiffies, curr->mm->numa_next_scan))
> task_work_add(curr, work, TWA_RESUME);
> }
> + spin_unlock(&curr->mm->pan_numa_lock);
> }
>
> static void update_scan_period(struct task_struct *p, int new_cpu)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index de53be905739..635f96bc989d 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2424,6 +2424,8 @@ extern unsigned int sysctl_numa_balancing_scan_delay;
> extern unsigned int sysctl_numa_balancing_scan_period_min;
> extern unsigned int sysctl_numa_balancing_scan_period_max;
> extern unsigned int sysctl_numa_balancing_scan_size;
> +extern unsigned int sysctl_pan_scan_period_min;
> +extern unsigned int sysctl_pan_scan_period_max;
> #endif
>
> #ifdef CONFIG_SCHED_HRTICK
> --
> 2.25.1
>

--
Mel Gorman
SUSE Labs

2022-02-01 20:13:45

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH v0 1/3] sched/numa: Process based autonuma scan period framework

On Fri, Jan 28, 2022 at 10:58:49AM +0530, Bharata B Rao wrote:
> From: Disha Talreja <[email protected]>
>
> Add a new framework that calculates autonuma scan period
> based on per-process NUMA fault stats.
>
> NUMA faults can be classified into different categories, such
> as local vs. remote, or private vs. shared. It is also important
> to understand such behavior from the perspective of a process.
> The per-process fault stats added here will be used for
> calculating the scan period in the adaptive NUMA algorithm.
>

Be more specific no how the local vs remote, private vs shared states
are reflections of per-task activity of the same.

> The actual scan period is still using the original value
> p->numa_scan_period before the real implementation is added in
> place in a later commit.
>
> Co-developed-by: Wei Huang <[email protected]>
> Signed-off-by: Wei Huang <[email protected]>
> Signed-off-by: Disha Talreja <[email protected]>
> Signed-off-by: Bharata B Rao <[email protected]>
> ---
> include/linux/mm_types.h | 7 +++++++
> kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++++++++++++--
> 2 files changed, 45 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 9db36dc5d4cf..4f978c09d3db 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -610,6 +610,13 @@ struct mm_struct {
>
> /* numa_scan_seq prevents two threads setting pte_numa */
> int numa_scan_seq;
> +
> + /* Process-based Adaptive NUMA */
> + atomic_long_t faults_locality[2];
> + atomic_long_t faults_shared[2];
> +
> + spinlock_t pan_numa_lock;

Document what this lock protects. In the context of this patch it appears
to protect a read of p->numa_scan_period and it's overkill to use a
spinlock for that. Also, given that it's a trylock, the task_numa_work
ends up doing no scanning or updates. This might have some value in
terms of avoiding multiple threads doing updates if they happen to start
at the same time but that's a narrow side-effect given the short hold
time of the lock.

> + unsigned int numa_scan_period;

Document how the per-mm numa_scan_period is related to the per-task
numa_scan_period.

Maybe it's done in a later patch, I haven't read that far yet.

> #endif
> /*
> * An operation with batched TLB flushing is going on. Anything
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 095b0aa378df..1d6404b2d42e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2099,6 +2099,20 @@ static void numa_group_count_active_nodes(struct numa_group *numa_group)
> numa_group->active_nodes = active_nodes;
> }
>
> +/**********************************************/
> +/* Process-based Adaptive NUMA (PAN) Design */
> +/**********************************************/
> +/*
> + * Updates mm->numa_scan_period under mm->pan_numa_lock.
> + *
> + * Returns p->numa_scan_period now but updated to return
> + * p->mm->numa_scan_period in a later patch.
> + */
> +static unsigned long pan_get_scan_period(struct task_struct *p)
> +{
> + return p->numa_scan_period;
> +}
> +
> /*
> * When adapting the scan rate, the period is divided into NUMA_PERIOD_SLOTS
> * increments. The more local the fault statistics are, the higher the scan
> @@ -2616,6 +2630,9 @@ void task_numa_fault(int last_cpupid, int mem_node, int pages, int flags)
> task_numa_group(p, last_cpupid, flags, &priv);
> }
>
> + atomic_long_add(pages, &(p->mm->faults_locality[local]));
> + atomic_long_add(pages, &(p->mm->faults_shared[priv]));
> +
> /*
> * If a workload spans multiple NUMA nodes, a shared fault that
> * occurs wholly within the set of nodes that the workload is
> @@ -2702,12 +2719,20 @@ static void task_numa_work(struct callback_head *work)
> if (time_before(now, migrate))
> return;
>
> - if (p->numa_scan_period == 0) {
> + if (p->mm->numa_scan_period == 0) {
> + p->numa_scan_period_max = task_scan_max(p);
> + p->numa_scan_period = task_scan_start(p);
> + mm->numa_scan_period = p->numa_scan_period;
> + } else if (p->numa_scan_period == 0) {
> p->numa_scan_period_max = task_scan_max(p);
> p->numa_scan_period = task_scan_start(p);
> }
>
> - next_scan = now + msecs_to_jiffies(p->numa_scan_period);
> + if (!spin_trylock(&p->mm->pan_numa_lock))
> + return;
> + next_scan = now + msecs_to_jiffies(pan_get_scan_period(p));
> + spin_unlock(&p->mm->pan_numa_lock);
> +
> if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
> return;
>
> @@ -2807,6 +2832,16 @@ static void task_numa_work(struct callback_head *work)
> }
> }
>
> +/* Init Process-based Adaptive NUMA */
> +static void pan_init_numa(struct task_struct *p)
> +{
> + struct mm_struct *mm = p->mm;
> +
> + spin_lock_init(&mm->pan_numa_lock);
> + mm->numa_scan_period = sysctl_numa_balancing_scan_delay;
> +
> +}
> +
> void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
> {
> int mm_users = 0;
> @@ -2817,6 +2852,7 @@ void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
> if (mm_users == 1) {
> mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
> mm->numa_scan_seq = 0;
> + pan_init_numa(p);
> }
> }
> p->node_stamp = 0;
> --
> 2.25.1
>

--
Mel Gorman
SUSE Labs

2022-02-01 20:13:46

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH v0 2/3] sched/numa: Add cumulative history of per-process fault stats

On Fri, Jan 28, 2022 at 10:58:50AM +0530, Bharata B Rao wrote:
> From: Disha Talreja <[email protected]>
>
> The cumulative history of local/remote (lr) and private/shared (ps)
> will be used for calculating adaptive scan period.
>

How it used to calculate adaptive scan period?

As it is likely used in a later patch, note here that the per-thread
stats are simply accumulated in the address space for now.

> Co-developed-by: Wei Huang <[email protected]>
> Signed-off-by: Wei Huang <[email protected]>
> Signed-off-by: Disha Talreja <[email protected]>
> Signed-off-by: Bharata B Rao <[email protected]>
> ---
> include/linux/mm_types.h | 2 ++
> kernel/sched/fair.c | 49 +++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 50 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 4f978c09d3db..2c6f119b947f 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -614,6 +614,8 @@ struct mm_struct {
> /* Process-based Adaptive NUMA */
> atomic_long_t faults_locality[2];
> atomic_long_t faults_shared[2];
> + unsigned long faults_locality_history[2];
> + unsigned long faults_shared_history[2];
>
> spinlock_t pan_numa_lock;
> unsigned int numa_scan_period;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1d6404b2d42e..4911b3841d00 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2102,14 +2102,56 @@ static void numa_group_count_active_nodes(struct numa_group *numa_group)
> /**********************************************/
> /* Process-based Adaptive NUMA (PAN) Design */
> /**********************************************/
> +/*
> + * Update the cumulative history of local/remote and private/shared
> + * statistics. If the numbers are too small worthy of updating,
> + * return FALSE, otherwise return TRUE.
> + */
> +static bool pan_update_history(struct task_struct *p)
> +{
> + unsigned long local, remote, shared, private;
> + long diff;
> + int i;
> +
> + remote = atomic_long_read(&p->mm->faults_locality[0]);
> + local = atomic_long_read(&p->mm->faults_locality[1]);
> + shared = atomic_long_read(&p->mm->faults_shared[0]);
> + private = atomic_long_read(&p->mm->faults_shared[1]);
> +
> + /* skip if the activities in this window are too small */
> + if (local + remote < 100)
> + return false;
> +

Why 100?

> + /* decay over the time window by 1/4 */
> + diff = local - (long)(p->mm->faults_locality_history[1] / 4);
> + p->mm->faults_locality_history[1] += diff;
> + diff = remote - (long)(p->mm->faults_locality_history[0] / 4);
> + p->mm->faults_locality_history[0] += diff;
> +
> + /* decay over the time window by 1/2 */
> + diff = shared - (long)(p->mm->faults_shared_history[0] / 2);
> + p->mm->faults_shared_history[0] += diff;
> + diff = private - (long)(p->mm->faults_shared_history[1] / 2);
> + p->mm->faults_shared_history[1] += diff;
> +

Why are the decay windows different?


> + /* clear the statistics for the next window */
> + for (i = 0; i < 2; i++) {
> + atomic_long_set(&(p->mm->faults_locality[i]), 0);
> + atomic_long_set(&(p->mm->faults_shared[i]), 0);
> + }
> +
> + return true;
> +}
> +
> /*
> * Updates mm->numa_scan_period under mm->pan_numa_lock.
> - *
> * Returns p->numa_scan_period now but updated to return
> * p->mm->numa_scan_period in a later patch.
> */

Spurious whitespace change.

> static unsigned long pan_get_scan_period(struct task_struct *p)
> {
> + pan_update_history(p);
> +
> return p->numa_scan_period;
> }
>

Ok, so the spinlock is protecting the RMW of the PAN history. It still
may be a concern that task_numa_work gets aborted if the spinlock cannot
be acquired.

> @@ -2836,10 +2878,15 @@ static void task_numa_work(struct callback_head *work)
> static void pan_init_numa(struct task_struct *p)
> {
> struct mm_struct *mm = p->mm;
> + int i;
>
> spin_lock_init(&mm->pan_numa_lock);
> mm->numa_scan_period = sysctl_numa_balancing_scan_delay;
>
> + for (i = 0; i < 2; i++) {
> + mm->faults_locality_history[i] = 0;
> + mm->faults_shared_history[i] = 0;
> + }
> }
>
> void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
> --
> 2.25.1
>

--
Mel Gorman
SUSE Labs

2022-02-02 17:20:24

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC PATCH v0 0/3] sched/numa: Process Adaptive autoNUMA

On 1/31/2022 5:47 PM, Mel Gorman wrote:
> On Fri, Jan 28, 2022 at 10:58:48AM +0530, Bharata B Rao wrote:
>> Hi,
>>
>> This patchset implements an adaptive algorithm for calculating the autonuma
>> scan period.
>
> autonuma refers to the khugepaged-like approach to NUMA balancing that
> was later superceded by NUMA Balancing (NUMAB) and is generally reflected
> by the naming e.g. git grep -i autonuma and note how few references there
> are to autonuma versus numab or "NUMA balancing". I know MMTests still
> refers to AutoNUMA but mostly because at the time it was written,
> autoNUMA was what was being evaluated and I never updated the naming.

Thanks. Noted and will use appropriate terminologies next time onward.

>
>> In the existing mechanism of scan period calculation,
>>
>> - scan period is derived from the per-thread stats.
>> - static threshold (NUMA_PERIOD_THRESHOLD) is used for changing the
>> scan rate.
>>
>> In this new approach (Process Adaptive autoNUMA or PAN), we gather NUMA
>> fault stats at per-process level which allows for capturing the application
>> behaviour better. In addition, the algorithm learns and adjusts the scan
>> rate based on remote fault rate. By not sticking to a static threshold, the
>> algorithm can respond better to different workload behaviours.
>>
>
> NUMA Balancing is concerned with threads (task) and an address space (mm)
> so basing the naming on Address Space rather than process may be more
> appropriate although I admit the acronym is not as snappy.

Sure, will think about more appropriate naming.

>
>> Since the threads of a processes are already considered as a group,
>> we add a bunch of metrics to the task's mm to track the various
>> types of faults and derive the scan rate from them.
>>
>
> Enumerate the types of faults and note how the per-thread and
> per-address-space metrics are related.

Sure will list the type of faults and describe the.

Per-address-space metrics are essentially aggregate of the existing per-thread
metrics. Unlike the existing task_numa_group mechanism, the threads are
implicitly/already considered part of the address space group (p->mm).

>
>> The new per-process fault stats contribute only to the per-process
>> scan period calculation, while the existing per-thread stats continue
>> to contribute towards the numa_group stats which eventually
>> determine the thresholds for migrating memory and threads
>> across nodes.
>>
>> This patchset has been tested with a bunch of benchmarks on the
>> following system:
>>
>
> Please include the comparisons of both the headline metrics and notes on
> the change in scan rates in the changelog of the patch. Not all people
> are access to Google drive and it is not guaranteed to remain forever.
> Similarly, the leader is not guaranteed to appear in the git history

Sure, noted.

>
>> ------------------------------------------------------
>> % gain of PAN vs default (Avg of 3 runs)
>> ------------------------------------------------------
>> NAS-BT -0.17
>> NAS-CG +9.39
>> NAS-MG +8.19
>> NAS-FT +2.23
>> Hashjoin +0.58
>> Graph500 +14.93
>> Pagerank +0.37
>
>
>
>> ------------------------------------------------------
>> Default PAN %diff
>> ------------------------------------------------------
>> NUMA hint faults(Total of 3 runs)
>> ------------------------------------------------------
>> NAS-BT 758282358 539850429 +29
>> NAS-CG 2179458823 1180301361 +46
>> NAS-MG 517641172 346066391 +33
>> NAS-FT 297044964 230033861 +23
>> Hashjoin 201684863 268436275 -33
>> Graph500 261808733 154338827 +41
>> Pagerank 217917818 211260310 +03
>> ------------------------------------------------------
>> Migrations(Total of 3 runs)
>> ------------------------------------------------------
>> NAS-BT 106888517 86482076 +19
>> NAS-CG 81191368 12859924 +84
>> NAS-MG 83927451 39651254 +53
>> NAS-FT 61807715 38934618 +37
>> Hashjoin 45406983 59828843 -32
>> Graph500 22798837 21560714 +05
>> Pagerank 59072135 44968673 +24
>> ------------------------------------------------------
>>
>> And here are some tests from a few microbenchmarks of mmtests suite.
>> (The results are trimmed a bit here, the complete results can
>> be viewed in the above mentioned link)
>>
>> Hackbench
>> ---------
>> hackbench-process-pipes
>> hackbench hackbench
>> default pan
>> Min 256 23.5510 ( 0.00%) 23.1900 ( 1.53%)
>> Amean 256 24.4604 ( 0.00%) 24.0353 * 1.74%*
>> Stddev 256 0.4420 ( 0.00%) 0.7611 ( -72.18%)
>> CoeffVar 256 1.8072 ( 0.00%) 3.1666 ( -75.22%)
>> Max 256 25.4930 ( 0.00%) 30.5450 ( -19.82%)
>> BAmean-50 256 24.1074 ( 0.00%) 23.6616 ( 1.85%)
>> BAmean-95 256 24.4111 ( 0.00%) 23.9308 ( 1.97%)
>> BAmean-99 256 24.4499 ( 0.00%) 23.9696 ( 1.96%)
>>
>> hackbench hackbench
>> default pan
>> Duration User 25810.02 25158.93
>> Duration System 276322.70 271729.32
>> Duration Elapsed 2707.75 2671.33
>>
>
>> hackbench hackbench
>> default pan
>> Ops NUMA alloc hit 1082415453.00 1088025994.00
>> Ops NUMA alloc miss 0.00 0.00
>> Ops NUMA interleave hit 0.00 0.00
>> Ops NUMA alloc local 1082415441.00 1088025974.00
>> Ops NUMA base-page range updates 33475.00 228900.00
>> Ops NUMA PTE updates 33475.00 228900.00
>> Ops NUMA PMD updates 0.00 0.00
>> Ops NUMA hint faults 15758.00 222100.00
>> Ops NUMA hint local faults % 15371.00 214570.00
>> Ops NUMA hint local percent 97.54 96.61
>> Ops NUMA pages migrated 235.00 4029.00
>> Ops AutoNUMA cost 79.03 1112.18
>>
>
> Hackbench processes are generally short-lived enough that NUMA balancing
> has a marginal impact. Interesting though that updates and hints were
> increased by a lot relatively speaking.

Yes, this increased AutoNUMA cost seen mostly with these micro benchmarks
are not seen typically with the other benchmarks that we have listed at
the beginning which we believe contributes to the gain that those
benchmarks see.

The algorithm tries aggressively to learn the application behaviour
at the beginning and short-lived tasks will see more scanning than
default.

Having said that, we need to investigate and check why some of these
micro benchmarks incur higher autonuma cost.

>
>> Netperf-RR
>> ----------
>> netperf-udp-rr
>> netperf netperf
>> rr-default rr-pan
>> Min 1 104915.69 ( 0.00%) 104505.71 ( -0.39%)
>> Hmean 1 105865.46 ( 0.00%) 105899.22 * 0.03%*
>> Stddev 1 528.45 ( 0.00%) 881.92 ( -66.89%)
>> CoeffVar 1 0.50 ( 0.00%) 0.83 ( -66.83%)
>> Max 1 106410.28 ( 0.00%) 107196.52 ( 0.74%)
>> BHmean-50 1 106232.53 ( 0.00%) 106568.26 ( 0.32%)
>> BHmean-95 1 105972.05 ( 0.00%) 106056.35 ( 0.08%)
>> BHmean-99 1 105972.05 ( 0.00%) 106056.35 ( 0.08%)
>>
>> netperf netperf
>> rr-default rr-pan
>> Duration User 11.20 10.74
>> Duration System 202.40 201.32
>> Duration Elapsed 303.09 303.08
>>
>> netperf netperf
>> rr-default rr-pan
>> Ops NUMA alloc hit 183999.00 183853.00
>> Ops NUMA alloc miss 0.00 0.00
>> Ops NUMA interleave hit 0.00 0.00
>> Ops NUMA alloc local 183999.00 183853.00
>> Ops NUMA base-page range updates 0.00 24370.00
>> Ops NUMA PTE updates 0.00 24370.00
>> Ops NUMA PMD updates 0.00 0.00
>> Ops NUMA hint faults 539.00 24470.00
>> Ops NUMA hint local faults % 539.00 24447.00
>> Ops NUMA hint local percent 100.00 99.91
>> Ops NUMA pages migrated 0.00 23.00
>> Ops AutoNUMA cost 2.69 122.52
>>
>
> Netperf these days usually runs on the same node so NUMA balancing
> triggers very rarely.

But we still see increase in the hint faults, need to investigate this.


>> autonumabenchautonumabench
>> default pan
>> Duration User 94363.43 94436.71
>> Duration System 81671.72 81408.53
>> Duration Elapsed 1676.81 1647.99
>>
>> autonumabench autonumabench
>> default pan
>> Ops NUMA alloc hit 539544115.00 539522029.00
>> Ops NUMA alloc miss 0.00 0.00
>> Ops NUMA interleave hit 0.00 0.00
>> Ops NUMA alloc local 279025768.00 281735736.00
>> Ops NUMA base-page range updates 69695169.00 84767502.00
>> Ops NUMA PTE updates 69695169.00 84767502.00
>> Ops NUMA PMD updates 0.00 0.00
>> Ops NUMA hint faults 69691818.00 87895044.00
>> Ops NUMA hint local faults % 56565519.00 65819747.00
>> Ops NUMA hint local percent 81.17 74.88
>> Ops NUMA pages migrated 5950362.00 8310169.00
>> Ops AutoNUMA cost 349060.01 440226.49
>>
>
> More hinting faults and migrations. Not clear which sub-test exactly but
> most likely NUMA02.

I will have to run them separately and check.

Regards,
Bharata.

2022-02-02 17:58:08

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH v0 1/3] sched/numa: Process based autonuma scan period framework

On Tue, Feb 01, 2022 at 05:52:55PM +0530, Bharata B Rao wrote:
> On 1/31/2022 5:47 PM, Mel Gorman wrote:
> > On Fri, Jan 28, 2022 at 10:58:49AM +0530, Bharata B Rao wrote:
> >> From: Disha Talreja <[email protected]>
> >>
> >> Add a new framework that calculates autonuma scan period
> >> based on per-process NUMA fault stats.
> >>
> >> NUMA faults can be classified into different categories, such
> >> as local vs. remote, or private vs. shared. It is also important
> >> to understand such behavior from the perspective of a process.
> >> The per-process fault stats added here will be used for
> >> calculating the scan period in the adaptive NUMA algorithm.
> >>
> >
> > Be more specific no how the local vs remote, private vs shared states
> > are reflections of per-task activity of the same.
>
> Sure, will document the algorithm better. However the overall thinking
> here is that the address-space scanning is a per-process activity and
> hence the scan period value derived from the accumulated per-process
> faults is more appropriate than calculating per-task (per-thread) scan
> periods. Participating threads may have their local/shared and private/shared
> behaviors, but when aggregated at the process level, it gives a better
> input for eventual scan period variation. The understanding is that individual
> thread fault rates will start altering the overall process metrics in
> such a manner that we respond by changing the scan rate to do more aggressive
> or less aggressive scanning.
>

I don't have anything to add on your other responses as it would mostly
be an acknowledgment of your response.

However, the major concern I have is that address-space wide decisions
on scan rates has no sensible means of adapting to thread-specific
requirements. I completely agree that it will result in more stable scan
rates, particularly the adjustments. It also side-steps a problem where
new threads may start with a scan rate that is completely inappropriate.

However, I worry that it would be limited overall because each thread
potentially has unique behaviour which is not obvious in a workload like
NAS where threads are all executing similar instructions on different
data. For other applications, threads may operate on thread-local areas
only (low scan rate), others could operate on shared only regresions (high
scan rate until back off and interleave), threads can has phase behaviour
(manager thread collecting data from worker threads) and threads can have
different lifetimes and phase behaviour. Each thread would have a different
optimal scan rate to decide if memory needs to be migrated to a local node
or not. I don't see how address-space wide statistics could every be mapped
back to threads to adapt scan rates based on thread-specific behaviour.

Thread scanning on the other hand can be improved in multiple ways. If
nothing else, they can do redundant scanning of regions that are
not relveant to a task which gets increasingly problematic when VSZ
increases. The obvious problems are

1. Scan based on page table updates, not address ranges to mitigate
problems with THP vs base page updates

2. Move scan delay to be a per-vma structure that is kmalloced if
necessary instead of being address space wide.

3. Track what threads access a VMA. The suggestion was to use a unsigned
long pid_mask and use the lower bits to tag approximately what
threads access a VMA. Skip VMAs that did not trap a fault. This would
be approximate because of PID collisions but would reduce scanning
of areas the thread is not interested in

4. Track active regions within VMAs. Very coarse tracking, use unsigned
long to trap what ranges are active

In different ways, this would reduce the amount of scanning work threads
do and focuses them on regions of relevance to reduce overhead overall
without losing thread-specific details.

Unfortunately, I have not had the time yet to prototype anything.

--
Mel Gorman
SUSE Labs

2022-02-02 18:46:36

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC PATCH v0 2/3] sched/numa: Add cumulative history of per-process fault stats


On 1/31/2022 5:47 PM, Mel Gorman wrote:
> On Fri, Jan 28, 2022 at 10:58:50AM +0530, Bharata B Rao wrote:
>> From: Disha Talreja <[email protected]>
>>
>> The cumulative history of local/remote (lr) and private/shared (ps)
>> will be used for calculating adaptive scan period.
>>
>
> How it used to calculate adaptive scan period?

Fault stats from different windows are accumulated and the cumulative stats
are used to arrive at the per-mm scan period unlike in the current case
where the stats from the last window determines the per-task scan period.

>
> As it is likely used in a later patch, note here that the per-thread
> stats are simply accumulated in the address space for now.

Yes, will make that clear in the patch description here.

>
>> Co-developed-by: Wei Huang <[email protected]>
>> Signed-off-by: Wei Huang <[email protected]>
>> Signed-off-by: Disha Talreja <[email protected]>
>> Signed-off-by: Bharata B Rao <[email protected]>
>> ---
>> include/linux/mm_types.h | 2 ++
>> kernel/sched/fair.c | 49 +++++++++++++++++++++++++++++++++++++++-
>> 2 files changed, 50 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 4f978c09d3db..2c6f119b947f 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -614,6 +614,8 @@ struct mm_struct {
>> /* Process-based Adaptive NUMA */
>> atomic_long_t faults_locality[2];
>> atomic_long_t faults_shared[2];
>> + unsigned long faults_locality_history[2];
>> + unsigned long faults_shared_history[2];
>>
>> spinlock_t pan_numa_lock;
>> unsigned int numa_scan_period;
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 1d6404b2d42e..4911b3841d00 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2102,14 +2102,56 @@ static void numa_group_count_active_nodes(struct numa_group *numa_group)
>> /**********************************************/
>> /* Process-based Adaptive NUMA (PAN) Design */
>> /**********************************************/
>> +/*
>> + * Update the cumulative history of local/remote and private/shared
>> + * statistics. If the numbers are too small worthy of updating,
>> + * return FALSE, otherwise return TRUE.
>> + */
>> +static bool pan_update_history(struct task_struct *p)
>> +{
>> + unsigned long local, remote, shared, private;
>> + long diff;
>> + int i;
>> +
>> + remote = atomic_long_read(&p->mm->faults_locality[0]);
>> + local = atomic_long_read(&p->mm->faults_locality[1]);
>> + shared = atomic_long_read(&p->mm->faults_shared[0]);
>> + private = atomic_long_read(&p->mm->faults_shared[1]);
>> +
>> + /* skip if the activities in this window are too small */
>> + if (local + remote < 100)
>> + return false;
>> +
>
> Why 100?

We need some minimum number of faults to make a decision and
figured out 100 could be a good minimum here.

>
>> + /* decay over the time window by 1/4 */
>> + diff = local - (long)(p->mm->faults_locality_history[1] / 4);
>> + p->mm->faults_locality_history[1] += diff;
>> + diff = remote - (long)(p->mm->faults_locality_history[0] / 4);
>> + p->mm->faults_locality_history[0] += diff;
>> +
>> + /* decay over the time window by 1/2 */
>> + diff = shared - (long)(p->mm->faults_shared_history[0] / 2);
>> + p->mm->faults_shared_history[0] += diff;
>> + diff = private - (long)(p->mm->faults_shared_history[1] / 2);
>> + p->mm->faults_shared_history[1] += diff;
>> +
>
> Why are the decay windows different?

Like in the existing algorithm, we started with a decay factor of 1/2
for both local/remote and private/shared. However we found lr_ratio
oscillating too much with that and hence dampened it to 1/4.

Decay factor of 1/4 for ps_ratio too may not change the overall
behaviour that much, but will have to experiment and check.

>
>
>> + /* clear the statistics for the next window */
>> + for (i = 0; i < 2; i++) {
>> + atomic_long_set(&(p->mm->faults_locality[i]), 0);
>> + atomic_long_set(&(p->mm->faults_shared[i]), 0);
>> + }
>> +
>> + return true;
>> +}
>> +
>> /*
>> * Updates mm->numa_scan_period under mm->pan_numa_lock.
>> - *
>> * Returns p->numa_scan_period now but updated to return
>> * p->mm->numa_scan_period in a later patch.
>> */
>
> Spurious whitespace change.

Sorry, will fix.

>
>> static unsigned long pan_get_scan_period(struct task_struct *p)
>> {
>> + pan_update_history(p);
>> +
>> return p->numa_scan_period;
>> }
>>
>
> Ok, so the spinlock is protecting the RMW of the PAN history. It still
> may be a concern that task_numa_work gets aborted if the spinlock cannot
> be acquired.

As replied in 1/3, the thread which is holding the lock for stats update
should start the scanning is our current understanding.

Regards,
Bharata.

2022-02-03 00:38:43

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC PATCH v0 3/3] sched/numa: Add adaptive scan period calculation


On 1/31/2022 5:47 PM, Mel Gorman wrote:
> On Fri, Jan 28, 2022 at 10:58:51AM +0530, Bharata B Rao wrote:
>> From: Disha Talreja <[email protected]>
>>
>> This patch implements an adaptive algorithm for calculating
>> the autonuma scan period. In the existing mechanism of scan
>> period calculation,
>>
>> - scan period is derived from the per-thread stats.
>> - static threshold (NUMA_PERIOD_THRESHOLD) is used for changing
>> the scan rate.
>>
>> In this new approach (Process Adaptive autoNUMA), we gather NUMA
>> fault stats at per-process level which allows for capturing the
>> application behaviour better. In addition, the algorithm learns
>> and adjusts the scan rate based on remote fault rate. By not
>> sticking to a static threshold, the algorithm can respond better
>> to different workload behaviours.
>>
>
> This appears to replace the per-task numa_scan_period with a per-mm
> numa_scan_period. This likely leads to more stable rate overall but it
> potentially misses that some threads are more active than others and
> miss that different threads may have different local/remote faults and
> private/shared faults. I think this may be smoothing the average while
> potentially missing outliers.

Some threads may be more active than others, different threads may have
different local/remote and shared/private faults, but we are aggregating
this at per-process level only to derive an effective scan period for the
process. The best we can do here is to factor in those local variations
at per-process level and respond quickly with a scan period update.

Note that per-process faults are used only for determining the scan
period, the existing per-task (per-thread) faults continue to influence
the page and task migration decisions.

>
> After the patch, p->numa_scan_period appears to primarily affect if a
> page is retried for migration but a lot of infrastructure is still left
> behind and it's unclear what purpose it serves.

I suppose you meant p->numa_scan_period now only influences the retrying
of task migration. Yes and guess we could just use per-mm numa_scan_period
there too but that would require us to hold pan_numa_lock spin_lock strictly
speaking.

If we take care of migrate retry interval, we can remove per-task
numa_scan_period in the next version.

BTW on a related note, why task stats aggregation and scan period update (via
task_numa_placement() and update_task_scan_period()) are conditional to
p->numa_migrate_retry? Should we ideally separate the task migration retry
from task stats and period update?

>
>> Since the threads of a processes are already considered as a group,
>> we add a bunch of metrics to the task's mm to track the various
>> types of faults and derive the scan rate from them.
>>
>> The new per-process fault stats contribute only to the per-process
>> scan period calculation, while the existing per-thread stats
>> continue to contribute towards the numa_group stats which
>> eventually determine the thresholds for migrating memory and
>> threads across nodes.
>>
>> In this algorithm, the remote fault rates are maintained for
>> the previous two scan windows. These historical remote fault
>> rates along with the remote fault rate from the current window
>> are used to determine the intended trend of the scanning period.
>>
>> An increase in the trend implies an increased period thereby
>> resulting in slower scanning. A decrease in the trend implies
>> decreased period and hence faster scanning.
>>
>
> Clarify what affects the trend in the changelog. e.g. how do differences
> in local vs remote and private vs shared affect trend?

Sure will document the algorithm more comprehensively including
the above aspects.

>
> What happens if one thread is primarily local faults while another is
> primarily remote faults, how does that affect the trend and overall
> scan period?

1. If those two threads are operating on separate memory pages, we will
see remote fault rate going up. That causes the trend to decrease in general
(Of course the algorithm does observe the trend for previous two windows
before changing the trend though). Decrease in trend will result in lower
scan period thus creating higher opportunities for page or task migration.

2. If there is shared memory between them, that aspect gets captured in
ps_ratio which controls the decrease in trend. If more than 30% shared
faults are seen, we don't further decrease the trend.


> The per-task scanning is flawed in terms that more active
> threads can scan address space regions that the task is uninterested in
> but I worry that masking that with an address space average may delay
> the correction of an imbalance in one thread because an overall trend
> misses the details.

BTW we had an additional patch that tracked the per-node locality faults
data at per-process level and did correction to the scan rate trend if we
find too much of an imbalance between nodes. This involved maintaining 4
atomic counters (local, remote, private and shared) for each node in
process's mm and atomic updates to them from the fault path. Since we
didn't see marked difference in throughput for benchmarks we tested with
this RFC v0, we haven't included it in this series.

>
>> The intended trends for the last two windows are tracked and
>> the actual trend is reversed (thereby increasing or decreasing
>> the scan period in that window) only if the same trend reversal
>> has been intended in the previous two windows.
>>
>> While the remote fault rate metric is derived from the accumulated
>> remote and local faults from all the threads of the mm, the
>> per-mm private and shared faults also contribute in deciding
>> the trend of the scan period.
>>
>> Co-developed-by: Wei Huang <[email protected]>
>> Signed-off-by: Wei Huang <[email protected]>
>> Signed-off-by: Disha Talreja <[email protected]>
>> Signed-off-by: Bharata B Rao <[email protected]>
>> ---
>> include/linux/mm_types.h | 5 +
>> kernel/sched/debug.c | 2 +
>> kernel/sched/fair.c | 265 ++++++++++++++++++++++++++++++++++++++-
>> kernel/sched/sched.h | 2 +
>> 4 files changed, 268 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 2c6f119b947f..d57cd96d8df0 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -619,6 +619,11 @@ struct mm_struct {
>>
>> spinlock_t pan_numa_lock;
>> unsigned int numa_scan_period;
>> + int remote_fault_rates[2]; /* histogram of remote fault rate */
>> + long scanned_pages;
>
> Why signed? What happens if it wraps (either negative if signed or back
> to 0 if unsigned)?

scanned_pages should have been unsigned and we have to take care of overflow.

>
>> + bool trend;
>> + int slope;
>> + u8 hist_trend;
>
> Document the fields.

Documented these as code comments, will document here too.

>
>> #endif
>> /*
>> * An operation with batched TLB flushing is going on. Anything
>> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>> index aa29211de1bf..060bb46166a6 100644
>> --- a/kernel/sched/debug.c
>> +++ b/kernel/sched/debug.c
>> @@ -334,6 +334,8 @@ static __init int sched_init_debug(void)
>> debugfs_create_u32("scan_period_min_ms", 0644, numa, &sysctl_numa_balancing_scan_period_min);
>> debugfs_create_u32("scan_period_max_ms", 0644, numa, &sysctl_numa_balancing_scan_period_max);
>> debugfs_create_u32("scan_size_mb", 0644, numa, &sysctl_numa_balancing_scan_size);
>> + debugfs_create_u32("pan_scan_period_min", 0644, numa, &sysctl_pan_scan_period_min);
>> + debugfs_create_u32("pan_scan_period_max", 0644, numa, &sysctl_pan_scan_period_max);
>> #endif
>
> Update Documentation and what relationship if any scan_period_*_ms has
> with pan_scan_period_*. Add the units to be consistent.

Sure will add documentation and units. And there is no relation
between the two. The patchset needs some cleanup work here to
ensure that we aren't mixing things.

>
>>
>> debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 4911b3841d00..5a9cacfbf9ec 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -1026,6 +1026,10 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
>> /* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
>> unsigned int sysctl_numa_balancing_scan_delay = 1000;
>>
>> +/* Clips of max and min scanning periods */
>> +unsigned int sysctl_pan_scan_period_min = 50;
>> +unsigned int sysctl_pan_scan_period_max = 5000;
>> +
>
> Why are the period different to the min/max for the previous per-task
> values? (admittedly, those values were pulled out of a hat).

Currently the minimum period possible is 100ms and the MAX_SCAN_WINDOW
is 2560 MB/s. With an increased memory bandwidth availability in the newer
systems, we thought it should be possible to handle more migrations.
Hence we thought we could potentially scan the double of MAX_SCAN_WINDOW
and hence reduced the min period to 50.

Regarding max, no specific reason, but found it to work well.

>
>> struct numa_group {
>> refcount_t refcount;
>>
>> @@ -2102,6 +2106,242 @@ static void numa_group_count_active_nodes(struct numa_group *numa_group)
>> /**********************************************/
>> /* Process-based Adaptive NUMA (PAN) Design */
>> /**********************************************/
>> +#define SLOPE(N, D) ((N)/(D))
>> +
>
> Document. N/D implies numerator and denominator. Usage implies a
> percentage change in remote faults but not always and there are a lot of
> magic numers with limited explanation.

Sure will document better.

>
>> +static unsigned int pan_scan_max(struct task_struct *p)
>> +{
>> + unsigned long smax, nr_scan_pages;
>> + unsigned long rss = 0;
>> +
>> + smax = sysctl_pan_scan_period_max;
>> + nr_scan_pages = sysctl_numa_balancing_scan_size << (20 - PAGE_SHIFT);
>> +
>> + rss = get_mm_rss(p->mm);
>> + if (!rss)
>> + rss = nr_scan_pages;
>> +
>> + if (READ_ONCE(p->mm->numa_scan_seq) == 0) {
>> + smax = p->mm->scanned_pages * sysctl_pan_scan_period_max;
>> + smax = smax / rss;
>> + smax = max_t(unsigned long, sysctl_pan_scan_period_min, smax);
>> + }
>> +
>
> rss is not necessarily related to virtual address space size e.g. sparse
> mappings. May not be relevant but worth commenting on why it doesn't
> matter.

During the 1st full scan, the intention is to scale down the max period
by scanned_pages/rss so that the scan period doesn't get to its max value
until the address space is scanned once fully.

>
>> + return smax;
>> +}
>> +
>> +/*
>> + * Process-based Adaptive NUMA scan period update alogirthm
>> + *
>> + * These are the important concepts behind the scan period update:
>> + *
>> + * - increase trend (of scan period)
>> + * scan period => up, memory coverage => down, overhead => down,
>> + * accuracy => down
>> + * - decrease trend
>> + * scan period => down, memory coverage => up, overhead => up,
>> + * accuracy => up
>> + * - trend: Reflects the current active trend
>> + * 1 means increasing trend, 0 means decreasing trend
>> + * - slope
>> + * it controls scan_period: new_scan_period = current_scan_period *
>> + * 100 / slope
>> + * - hist_trend: Reflects the intended trend in the last two
>> + * windows. Uses the last two bits (bit0 and bit1) for the same.
>> + * 1 if increasing trend was intended, 0 if decreasing was intended.
>> + */
>> +
>> +/*
>> + * Check if the scan period needs updation when the remote fault
>> + * rate has changed (delta > 5)
>> + *
>> + * Returns TRUE if scan period needs updation, else FALSE.
>> + */
>> +static bool pan_changed_rate_update(struct mm_struct *mm, int ps_ratio,
>> + int oldest_remote_fault_rate,
>> + int fault_rate_diff)
>> +{
>> + u8 value;
>> +
>> + /*
>> + * Set the intended trend for the current window.
>> + * - If the remote fault rate has decreased, set the
>> + * intended trend to increasing.
>> + * - Otherwise leave the intended trend as decreasing.
>> + */
>> + mm->hist_trend = mm->hist_trend << 1;
>> + if (fault_rate_diff < 5)
>> + mm->hist_trend |= 0x01;
>> +
>
> Why 5? Presumably 50% but not clear.

5%. If the difference in the remote fault rates between windows has reached
this value, we consider the remote fault rate has stabilized and the task
isn't seeing much changes in the remote fault rate.

>
>> + value = mm->hist_trend & 0x03;
>> +
>
> Document better what is contained in this u8 value.

Sure. It captures the intended trends for the previous and current windows.
>
>> + if (fault_rate_diff < -5 && value == 3) {
>> + /*
>
> Document magic numbers.

Sure will do.

>
>> + * The remote fault rate has decreased and the intended
>> + * trend was set to increasing in the previous window.
>> + *
>> + * If on decreasing trend, reverse the trend and change
>> + * the slope using the fault rates from (current-1)
>> + * and (current-2) windows.
>> + *
>> + * If already on increasing trend, change the slope using
>> + * the fault rates from (current) and (current-1) windows.
>> + */
>> + if (!mm->trend) {
>> + mm->trend = true;
>> + mm->slope = SLOPE(mm->remote_fault_rates[0] * 100,
>> + oldest_remote_fault_rate);
>> + } else {
>> + mm->slope = SLOPE(mm->remote_fault_rates[1] * 100,
>> + mm->remote_fault_rates[0]);
>> + }
>> + } else if (fault_rate_diff > 5 && value == 0) {
>> + /*
>> + * The remote fault rate has increased and the intended
>> + * trend was set to decreasing in the previous window.
>> + *
>> + * If on increasing trend,
>> + * - If shared fault ratio is more than 30%, don't yet
>> + * reverse the trend, just mark the intended trend as
>> + * increasing.
>> + * - Otherwise reverse the trend. Change the slope using
>> + * the fault rates from (current-1) and (current-2) windows.
>> + *
>> + * If on decreasing trend
>> + * - Continue with a changed slope using the fault
>> + * rates from (current) and (current-1) windows.
>> + */
>> + if (mm->trend) {
>> + if (ps_ratio < 7) {
>> + mm->hist_trend |= 0x01;
>> + return true;
>> + }
>> +
>> + mm->trend = false;
>> + mm->slope = SLOPE(mm->remote_fault_rates[0] * 100,
>> + oldest_remote_fault_rate);
>> + } else {
>> + mm->slope = SLOPE(mm->remote_fault_rates[1] * 100,
>> + mm->remote_fault_rates[0]);
>> + }
>> + } else if (value == 1 || value == 2) {
>> + /*
>> + * The intended trend is oscillating
>> + *
>> + * If on decreasing trend and the shared fault ratio
>> + * is more than 30%, reverse the trend and change the slope.
>> + *
>> + * If on increasing trend, continue as is.
>> + */
>> + if (!mm->trend && ps_ratio < 7) {
>> + mm->hist_trend |= 0x01;
>> + mm->trend = true;
>> + mm->slope = SLOPE(100 * 100,
>> + 100 + ((7 - ps_ratio) * 10));
>> + }
>> + return false;
>> + }
>> + return true;
>> +}
>> +
>> +/*
>> + * Check if the scan period needs updation when the remote fault
>> + * rate has remained more or less the same (delta <= 5)
>> + *
>> + * Returns TRUE if scan period needs updation, else FALSE.
>> + */
>
>
> s/updation/updating/

Sure.

>
>> +static bool pan_const_rate_update(struct mm_struct *mm, int ps_ratio,
>> + int oldest_remote_fault_rate)
>
> Document the intent behind the difference between pan_const_rate_update
> and pan_changed_rate_update.

Sure will document this better in the next revision. However, scan rate updates
are done broadly for two categories:

1. The case where the remote fault rate has remained more or less constant(< 5%)
is handled by pan_const_rate_update().
2. When the difference in remote fault rate between windows is still higher than
5%, scan rate updates are handled by pan_changed_rate_update().

>
>> +{
>> + int diff1, diff2;
>> +
>
> Clarify what diff1 and diff2 are the differences between in the naming.

Sure, will document this.

diff1: Difference in remote fault rates between (current-2) and (current-1) windows
diff2: Difference in remote fault rates between (current-1) and current windows

>
>> + mm->hist_trend = mm->hist_trend << 1;
>> +
>> + /*
>> + * If we are in the increasing trend, don't change anything
>> + * except the intended trend for this window that was reset
>> + * to decreasing by default.
>> + */
>> + if (mm->trend)
>> + return false;
>> +
>> + /* We are in the decreasing trend, reverse under some condidtions. */
>> + diff1 = oldest_remote_fault_rate - mm->remote_fault_rates[0];
>> + diff2 = mm->remote_fault_rates[0] - mm->remote_fault_rates[1];
>> +
>> + if (ps_ratio < 7) {
>> + /*
>> + * More than 30% of the pages are shared, so no point in
>> + * further reducing the scan period. If increasing trend
>> + * was intended in the previous window also, then reverse
>> + * the trend to increasing. Else just record the increasing
>> + * intended trend for this window and return.
>> + */
>> + mm->hist_trend |= 0x01;
>> + if ((mm->hist_trend & 0x03) == 3) {
>> + mm->trend = true;
>> + mm->slope = SLOPE(100 * 100,
>> + (100 + ((7 - ps_ratio) * 10)));
>> + } else
>> + return false;
>> + } else if (diff1 >= 0 && diff2 >= 0 && mm->numa_scan_seq > 1) {
>> + /*
>> + * Remote fault rate has reduced successively in the last
>> + * two windows and address space has been scanned at least
>> + * once. If increasing trend was intended in the previous
>> + * window also, then reverse the trend to increasing. Else
>> + * just record the increasing trend for this window and return.
>> + */
>> + mm->hist_trend |= 0x01;
>> + if ((mm->hist_trend & 0x03) == 3) {
>> + mm->trend = true;
>> + mm->slope = SLOPE(100 * 100, 110);
>> + mm->hist_trend |= 0x03;
>> + } else
>> + return false;
>> + }
>> + return true;
>> +}
>> +
>> +static void pan_calculate_scan_period(struct task_struct *p)
>> +{
>> + int remote_fault_rate, oldest_remote_fault_rate, ps_ratio, i, diff;
>> + struct mm_struct *mm = p->mm;
>> + unsigned long remote_hist = mm->faults_locality_history[0];
>> + unsigned long local_hist = mm->faults_locality_history[1];
>> + unsigned long shared_hist = mm->faults_shared_history[0];
>> + unsigned long priv_hist = mm->faults_shared_history[1];
>> + bool need_update;
>> +
>> + ps_ratio = (priv_hist * 10) / (priv_hist + shared_hist + 1);
>> + remote_fault_rate = (remote_hist * 100) / (local_hist + remote_hist + 1);
>> +
>> + /* Keep the remote fault ratio at least 1% */
>> + remote_fault_rate = max(remote_fault_rate, 1);
>> + for (i = 0; i < 2; i++)
>> + if (mm->remote_fault_rates[i] == 0)
>> + mm->remote_fault_rates[i] = 1;
>> +
>
> What if there is one thread in the entire address that is responsible
> for all of the remote faults if it's a shared region? Does this skew the
> scan rates for unrelated threads?

Adjusting the scan period based on ps_ratio should help to eventually
stabilize the scan period.

>
>> + /* Shift right in mm->remote_fault_rates[] to keep track of history */
>> + oldest_remote_fault_rate = mm->remote_fault_rates[0];
>> + mm->remote_fault_rates[0] = mm->remote_fault_rates[1];
>> + mm->remote_fault_rates[1] = remote_fault_rate;
>> + diff = remote_fault_rate - oldest_remote_fault_rate;
>> +
>> + if (abs(diff) <= 5)
>> + need_update = pan_const_rate_update(mm, ps_ratio,
>> + oldest_remote_fault_rate);
>> + else
>> + need_update = pan_changed_rate_update(mm, ps_ratio,
>> + oldest_remote_fault_rate,
>> + diff);
>> +
>> + if (need_update) {
>> + if (mm->slope == 0)
>> + mm->slope = 100;
>> + mm->numa_scan_period = (100 * mm->numa_scan_period) / mm->slope;
>> + }
>> +}
>> +
>> /*
>> * Update the cumulative history of local/remote and private/shared
>> * statistics. If the numbers are too small worthy of updating,
>> @@ -2145,14 +2385,17 @@ static bool pan_update_history(struct task_struct *p)
>>
>> /*
>> * Updates mm->numa_scan_period under mm->pan_numa_lock.
>> - * Returns p->numa_scan_period now but updated to return
>> - * p->mm->numa_scan_period in a later patch.
>> */
>
> But p->numa_scan_period still exists so it's harder to evaluate the
> overall result.

Sorry about this, will remove in the next iteration after taking
care of its dependency on numa_migrate_retry.

>
>> static unsigned long pan_get_scan_period(struct task_struct *p)
>> {
>> - pan_update_history(p);
>> + if (pan_update_history(p))
>> + pan_calculate_scan_period(p);
>> +
>> + p->mm->numa_scan_period = clamp(p->mm->numa_scan_period,
>> + READ_ONCE(sysctl_pan_scan_period_min),
>> + pan_scan_max(p));
>>
>> - return p->numa_scan_period;
>> + return p->mm->numa_scan_period;
>> }
>>
>> /*
>> @@ -2860,6 +3103,7 @@ static void task_numa_work(struct callback_head *work)
>> mm->numa_scan_offset = start;
>> else
>> reset_ptenuma_scan(p);
>> + mm->scanned_pages += ((sysctl_numa_balancing_scan_size << (20 - PAGE_SHIFT)) - pages);
>> mmap_read_unlock(mm);
>>
>> /*
>> @@ -2882,10 +3126,15 @@ static void pan_init_numa(struct task_struct *p)
>>
>> spin_lock_init(&mm->pan_numa_lock);
>> mm->numa_scan_period = sysctl_numa_balancing_scan_delay;
>> + mm->scanned_pages = 0;
>> + mm->trend = false;
>> + mm->hist_trend = 0;
>> + mm->slope = 100;
>>
>> for (i = 0; i < 2; i++) {
>> mm->faults_locality_history[i] = 0;
>> mm->faults_shared_history[i] = 0;
>> + mm->remote_fault_rates[i] = 1;
>> }
>> }
>>
>> @@ -2948,6 +3197,9 @@ static void task_tick_numa(struct rq *rq, struct task_struct *curr)
>> if ((curr->flags & (PF_EXITING | PF_KTHREAD)) || work->next != work)
>> return;
>>
>> + if (!spin_trylock(&curr->mm->pan_numa_lock))
>> + return;
>> +
>> /*
>> * Using runtime rather than walltime has the dual advantage that
>> * we (mostly) drive the selection from busy threads and that the
>
> This potentially misses triggering of scans in general but again, the
> more stable scan rates may be due to mm-wide averaging while missing
> per-task specifics.

As noted in reply to 1/3, we thought we shouldn't miss the updates.
Here we try for the lock because we want to read mm->numa_scan_period.
May be some READ_ONCE() kind of trick should be enough as we do take the
lock next and re-get the mm->numa_scan_period in the task work context.

Regards,
Bharata.

2022-02-03 15:37:00

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC PATCH v0 1/3] sched/numa: Process based autonuma scan period framework

Thanks Mel for taking time to look at the patchset and providing your valuable review comments.

On 1/31/2022 5:47 PM, Mel Gorman wrote:
> On Fri, Jan 28, 2022 at 10:58:49AM +0530, Bharata B Rao wrote:
>> From: Disha Talreja <[email protected]>
>>
>> Add a new framework that calculates autonuma scan period
>> based on per-process NUMA fault stats.
>>
>> NUMA faults can be classified into different categories, such
>> as local vs. remote, or private vs. shared. It is also important
>> to understand such behavior from the perspective of a process.
>> The per-process fault stats added here will be used for
>> calculating the scan period in the adaptive NUMA algorithm.
>>
>
> Be more specific no how the local vs remote, private vs shared states
> are reflections of per-task activity of the same.

Sure, will document the algorithm better. However the overall thinking
here is that the address-space scanning is a per-process activity and
hence the scan period value derived from the accumulated per-process
faults is more appropriate than calculating per-task (per-thread) scan
periods. Participating threads may have their local/shared and private/shared
behaviors, but when aggregated at the process level, it gives a better
input for eventual scan period variation. The understanding is that individual
thread fault rates will start altering the overall process metrics in
such a manner that we respond by changing the scan rate to do more aggressive
or less aggressive scanning.

>
>> The actual scan period is still using the original value
>> p->numa_scan_period before the real implementation is added in
>> place in a later commit.
>>
>> Co-developed-by: Wei Huang <[email protected]>
>> Signed-off-by: Wei Huang <[email protected]>
>> Signed-off-by: Disha Talreja <[email protected]>
>> Signed-off-by: Bharata B Rao <[email protected]>
>> ---
>> include/linux/mm_types.h | 7 +++++++
>> kernel/sched/fair.c | 40 ++++++++++++++++++++++++++++++++++++++--
>> 2 files changed, 45 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 9db36dc5d4cf..4f978c09d3db 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -610,6 +610,13 @@ struct mm_struct {
>>
>> /* numa_scan_seq prevents two threads setting pte_numa */
>> int numa_scan_seq;
>> +
>> + /* Process-based Adaptive NUMA */
>> + atomic_long_t faults_locality[2];
>> + atomic_long_t faults_shared[2];
>> +
>> + spinlock_t pan_numa_lock;
>
> Document what this lock protects. In the context of this patch it appears
> to protect a read of p->numa_scan_period and it's overkill to use a
> spinlock for that. Also, given that it's a trylock, the task_numa_work
> ends up doing no scanning or updates. This might have some value in
> terms of avoiding multiple threads doing updates if they happen to start
> at the same time but that's a narrow side-effect given the short hold
> time of the lock.

Sure, I put it in the code comment, but will document the usage here.

If the try_lock fails, it means some other thread is updating the stat and
most likely that thread will go ahead with the atomic update to mm->numa_next_scan
and start the scanning. So can't see how this will stall scanning or stat updates
in general. Please note that in the existing scheme, the stats aggregation
happens at fault time but in PAN it happens in task work context.

>
>> + unsigned int numa_scan_period;
>
> Document how the per-mm numa_scan_period is related to the per-task
> numa_scan_period.

They aren't related, per-mm numa_scan_period is in fact a replacement of
per-task numa_scan_period. However numa_migrate_retry interval still depends
on per-task period as you noted elsewhere. I think we could replace that usage
too with per-mm numa_scan_period and completely remove the per-task version.

Regards,
Bharata.

2022-02-07 15:40:34

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH v0 1/3] sched/numa: Process based autonuma scan period framework

On Fri, Feb 04, 2022 at 04:33:57PM +0530, Bharata B Rao wrote:
> > However, the major concern I have is that address-space wide decisions
> > on scan rates has no sensible means of adapting to thread-specific
> > requirements. I completely agree that it will result in more stable scan
> > rates, particularly the adjustments. It also side-steps a problem where
> > new threads may start with a scan rate that is completely inappropriate.
> >
> > However, I worry that it would be limited overall because each thread
> > potentially has unique behaviour which is not obvious in a workload like
> > NAS where threads are all executing similar instructions on different
> > data. For other applications, threads may operate on thread-local areas
> > only (low scan rate), others could operate on shared only regresions (high
> > scan rate until back off and interleave), threads can has phase behaviour
> > (manager thread collecting data from worker threads) and threads can have
> > different lifetimes and phase behaviour. Each thread would have a different
> > optimal scan rate to decide if memory needs to be migrated to a local node
> > or not. I don't see how address-space wide statistics could every be mapped
> > back to threads to adapt scan rates based on thread-specific behaviour.
>
> So if all the threads have similar behavior, wouldn't they all arrive at similar
> scan period independently and shouldn't that stabilize the overall scan period
> variation? But we do see variation in per-thread scan periods and overall benefit
> in numbers from per-mm scan period approach for benchmarks like NAS.
>

NAS is close to being an ideal case. Threads have individual behaviour
but it's mostly SIMD with variations between the different computations.

> And, for a thread that has completely different behaviour in the group, currently
> there is no determinism AFAICS on when that would get its chance to update
> the scan period and also whether the eventual scanning happens in the areas
> of interest for that thread. In that case, isn't changing the scan period in
> isolation to cater to that unique thread an overhead on process address space
> scanning?
>

It would protect against worst-case behaviour and non-determinism of
the thread updating its scan period but it's masking the problem by
averaging worst-case behaviour instead of solving it. Address-space wide
scanning does not necessarily scan areas of interest to a specific thread
because only the thread can determine that. Similarly, address-space
wide identification of hot spots would identify hot spots but not what
threads are creating the hot regions so threads still end up scanning
regions that are not relevant to the thread's access pattern.

> Since process level stats are essentially aggregation of thread level stats,
> process level stats will capture thread level behaviour in general. However,
> if there are certain threads whose behavior is much very different from other
> threads, they should eventually impact the process level behaviour is our
> current thinking.
>

It does capture thread level in general but mostly by averaging all
behaviour. It masks any outlier behaviour such as worker threads in a pool
being allocated, activated and deactivated versus long-lived threads that
are almost continually active.

> For example, in a micro-benchmark where half the threads have local-only
> access and the other half start with remote-all accesses, the initial
> behaviour of the two sets are completely different, but we do see that
> per-mm scan period approach performing more or less similar to the existing
> approach.

Yes, as an average of the two where as the threads with local-only
accesses should scan at the lowest possible rate and the remote-all
should scan quickly to either meet the threshold allowing the pages to
be migrated or backed off if it's shared accesses that are a mix of
local and remote faults.

> However if we use further optimization related to tuning the scan
> period in response to the detected node imbalance(this optimization patch
> wasn't included as part of this initial series), things improve further.
>
> Having said that, there could be corner cases where per-mm approach may not
> be able to capture the per-thread behaviour effectively as you note. We would
> surely want to explore such cases with per-mm approach to understand the
> behaviour better. We can write micro -benchmarks for this but if there already
> existing well known benchmarks that exhibit such behaviors at per-thread level,
> we are willing to give them a try.
>

And a major concern is that the corner cases where pre-mm approach does
not work well means we need to get thread-specific information back.
There is value in knowing how the overall address space is being used
e.g. setting the scan rate of new threads or limiting the minimum scan
period to avoid excessive scanning. However, most of the major problems
such as threads only scanning areas of interest cannot be easily
adddressed by per-mm only information as the hot spots could belong to
any thread.

> > Thread scanning on the other hand can be improved in multiple ways. If
> > nothing else, they can do redundant scanning of regions that are
> > not relveant to a task which gets increasingly problematic when VSZ
> > increases. The obvious problems are
> >
> > 1. Scan based on page table updates, not address ranges to mitigate
> > problems with THP vs base page updates
> >
> > 2. Move scan delay to be a per-vma structure that is kmalloced if
> > necessary instead of being address space wide.
> >
> > 3. Track what threads access a VMA. The suggestion was to use a unsigned
> > long pid_mask and use the lower bits to tag approximately what
> > threads access a VMA. Skip VMAs that did not trap a fault. This would
> > be approximate because of PID collisions but would reduce scanning
> > of areas the thread is not interested in
> >
> > 4. Track active regions within VMAs. Very coarse tracking, use unsigned
> > long to trap what ranges are active
> >
> > In different ways, this would reduce the amount of scanning work threads
> > do and focuses them on regions of relevance to reduce overhead overall
> > without losing thread-specific details.
>
> Thanks for these pointers, these are worth exploring. And any approach
> for reducing the redundant scanning should complement the current effort
> to optimize the scan period calculation.
>

I think per-mm stats may be used to bound the scan rates of individual
threads but the biggest help would be if threads could track what VMAs
they are faulting within (problem 3 above) and identifying potential
hotspots (problem 4). There would still be some redundant scanning and
on occasion some thread has to rescan the entire address space but it
would reduce the overhead.

--
Mel Gorman
SUSE Labs

2022-02-07 17:34:44

by Bharata B Rao

[permalink] [raw]
Subject: Re: [RFC PATCH v0 1/3] sched/numa: Process based autonuma scan period framework

On 2/1/2022 7:45 PM, Mel Gorman wrote:
> On Tue, Feb 01, 2022 at 05:52:55PM +0530, Bharata B Rao wrote:
>> On 1/31/2022 5:47 PM, Mel Gorman wrote:
>>> On Fri, Jan 28, 2022 at 10:58:49AM +0530, Bharata B Rao wrote:
>>>> From: Disha Talreja <[email protected]>
>>>>
>>>> Add a new framework that calculates autonuma scan period
>>>> based on per-process NUMA fault stats.
>>>>
>>>> NUMA faults can be classified into different categories, such
>>>> as local vs. remote, or private vs. shared. It is also important
>>>> to understand such behavior from the perspective of a process.
>>>> The per-process fault stats added here will be used for
>>>> calculating the scan period in the adaptive NUMA algorithm.
>>>>
>>>
>>> Be more specific no how the local vs remote, private vs shared states
>>> are reflections of per-task activity of the same.
>>
>> Sure, will document the algorithm better. However the overall thinking
>> here is that the address-space scanning is a per-process activity and
>> hence the scan period value derived from the accumulated per-process
>> faults is more appropriate than calculating per-task (per-thread) scan
>> periods. Participating threads may have their local/shared and private/shared
>> behaviors, but when aggregated at the process level, it gives a better
>> input for eventual scan period variation. The understanding is that individual
>> thread fault rates will start altering the overall process metrics in
>> such a manner that we respond by changing the scan rate to do more aggressive
>> or less aggressive scanning.
>>
>
> I don't have anything to add on your other responses as it would mostly
> be an acknowledgment of your response.
>
> However, the major concern I have is that address-space wide decisions
> on scan rates has no sensible means of adapting to thread-specific
> requirements. I completely agree that it will result in more stable scan
> rates, particularly the adjustments. It also side-steps a problem where
> new threads may start with a scan rate that is completely inappropriate.
>
> However, I worry that it would be limited overall because each thread
> potentially has unique behaviour which is not obvious in a workload like
> NAS where threads are all executing similar instructions on different
> data. For other applications, threads may operate on thread-local areas
> only (low scan rate), others could operate on shared only regresions (high
> scan rate until back off and interleave), threads can has phase behaviour
> (manager thread collecting data from worker threads) and threads can have
> different lifetimes and phase behaviour. Each thread would have a different
> optimal scan rate to decide if memory needs to be migrated to a local node
> or not. I don't see how address-space wide statistics could every be mapped
> back to threads to adapt scan rates based on thread-specific behaviour.

So if all the threads have similar behavior, wouldn't they all arrive at similar
scan period independently and shouldn't that stabilize the overall scan period
variation? But we do see variation in per-thread scan periods and overall benefit
in numbers from per-mm scan period approach for benchmarks like NAS.

And, for a thread that has completely different behaviour in the group, currently
there is no determinism AFAICS on when that would get its chance to update
the scan period and also whether the eventual scanning happens in the areas
of interest for that thread. In that case, isn't changing the scan period in
isolation to cater to that unique thread an overhead on process address space
scanning?

Since process level stats are essentially aggregation of thread level stats,
process level stats will capture thread level behaviour in general. However,
if there are certain threads whose behavior is much very different from other
threads, they should eventually impact the process level behaviour is our
current thinking.

For example, in a micro-benchmark where half the threads have local-only
access and the other half start with remote-all accesses, the initial
behaviour of the two sets are completely different, but we do see that
per-mm scan period approach performing more or less similar to the existing
approach. However if we use further optimization related to tuning the scan
period in response to the detected node imbalance(this optimization patch
wasn't included as part of this initial series), things improve further.

Having said that, there could be corner cases where per-mm approach may not
be able to capture the per-thread behaviour effectively as you note. We would
surely want to explore such cases with per-mm approach to understand the
behaviour better. We can write micro -benchmarks for this but if there already
existing well known benchmarks that exhibit such behaviors at per-thread level,
we are willing to give them a try.

>
> Thread scanning on the other hand can be improved in multiple ways. If
> nothing else, they can do redundant scanning of regions that are
> not relveant to a task which gets increasingly problematic when VSZ
> increases. The obvious problems are
>
> 1. Scan based on page table updates, not address ranges to mitigate
> problems with THP vs base page updates
>
> 2. Move scan delay to be a per-vma structure that is kmalloced if
> necessary instead of being address space wide.
>
> 3. Track what threads access a VMA. The suggestion was to use a unsigned
> long pid_mask and use the lower bits to tag approximately what
> threads access a VMA. Skip VMAs that did not trap a fault. This would
> be approximate because of PID collisions but would reduce scanning
> of areas the thread is not interested in
>
> 4. Track active regions within VMAs. Very coarse tracking, use unsigned
> long to trap what ranges are active
>
> In different ways, this would reduce the amount of scanning work threads
> do and focuses them on regions of relevance to reduce overhead overall
> without losing thread-specific details.

Thanks for these pointers, these are worth exploring. And any approach
for reducing the redundant scanning should complement the current effort
to optimize the scan period calculation.

Regards,
Bharata.


2023-06-21 06:04:50

by Raghavendra K T

[permalink] [raw]
Subject: Re: [RFC PATCH v0 1/3] sched/numa: Process based autonuma scan period framework

+linux-mm
On 2/1/2022 7:45 PM, Mel Gorman wrote:
> On Tue, Feb 01, 2022 at 05:52:55PM +0530, Bharata B Rao wrote:
>> On 1/31/2022 5:47 PM, Mel Gorman wrote:
>>> On Fri, Jan 28, 2022 at 10:58:49AM +0530, Bharata B Rao wrote:
>>>> From: Disha Talreja <[email protected]>
>>>>
>>>> Add a new framework that calculates autonuma scan period
>>>> based on per-process NUMA fault stats.
>>>>
>>>> NUMA faults can be classified into different categories, such
>>>> as local vs. remote, or private vs. shared. It is also important
>>>> to understand such behavior from the perspective of a process.
>>>> The per-process fault stats added here will be used for
>>>> calculating the scan period in the adaptive NUMA algorithm.
>>>>
>>>
>>> Be more specific no how the local vs remote, private vs shared states
>>> are reflections of per-task activity of the same.
>>
>> Sure, will document the algorithm better. However the overall thinking
>> here is that the address-space scanning is a per-process activity and
>> hence the scan period value derived from the accumulated per-process
>> faults is more appropriate than calculating per-task (per-thread) scan
>> periods. Participating threads may have their local/shared and private/shared
>> behaviors, but when aggregated at the process level, it gives a better
>> input for eventual scan period variation. The understanding is that individual
>> thread fault rates will start altering the overall process metrics in
>> such a manner that we respond by changing the scan rate to do more aggressive
>> or less aggressive scanning.
>>
>
> I don't have anything to add on your other responses as it would mostly
> be an acknowledgment of your response.
>
> However, the major concern I have is that address-space wide decisions
> on scan rates has no sensible means of adapting to thread-specific
> requirements. I completely agree that it will result in more stable scan
> rates, particularly the adjustments. It also side-steps a problem where
> new threads may start with a scan rate that is completely inappropriate.
>
> However, I worry that it would be limited overall because each thread
> potentially has unique behaviour which is not obvious in a workload like
> NAS where threads are all executing similar instructions on different
> data. For other applications, threads may operate on thread-local areas
> only (low scan rate), others could operate on shared only regresions (high
> scan rate until back off and interleave), threads can has phase behaviour
> (manager thread collecting data from worker threads) and threads can have
> different lifetimes and phase behaviour. Each thread would have a different
> optimal scan rate to decide if memory needs to be migrated to a local node
> or not. I don't see how address-space wide statistics could every be mapped
> back to threads to adapt scan rates based on thread-specific behaviour.
>
> Thread scanning on the other hand can be improved in multiple ways. If
> nothing else, they can do redundant scanning of regions that are
> not relveant to a task which gets increasingly problematic when VSZ
> increases. The obvious problems are
>
> 1. Scan based on page table updates, not address ranges to mitigate
> problems with THP vs base page updates
>

Hello Mel,
Sorry for digging a very old email, to seek directions on numascanning.

From the list we have handled (2) and (3) below .. and looking forward
to continue, with (1) above.

My understanding is when the 256MB limit was introduced, it was mainly
to limit total number PTE we scan (=64k PTEs of 4kB page).

Considering we can do more if we have THP or hugepage, and thus do we
want to cover more hugePTEs here?

I mean can we say we want to scan 64k worth 2MB page table entry (or
corresponding hugepage entries)?

I started with a simple patch that just handles 4k/hugepage, but
does not handle THP case properly yet as its not trivial (to track
how much worth of page table entries we handled in a VMA that has THP)
(patch may have white space error because of copying).

Idea is to scan 64k worth of PTEs instead of 256MB for scanning.

Secondly Unrelated to this, I was also thinking if how recently
vma access was done information could be helpful..

Please let me know your suggestion/comment on the direction/approach etc

> 2. Move scan delay to be a per-vma structure that is kmalloced if
> necessary instead of being address space wide.
>
> 3. Track what threads access a VMA. The suggestion was to use a unsigned
> long pid_mask and use the lower bits to tag approximately what
> threads access a VMA. Skip VMAs that did not trap a fault. This would
> be approximate because of PID collisions but would reduce scanning
> of areas the thread is not interested in
>
> 4. Track active regions within VMAs. Very coarse tracking, use unsigned
> long to trap what ranges are active
>
> In different ways, this would reduce the amount of scanning work threads
> do and focuses them on regions of relevance to reduce overhead overall
> without losing thread-specific details.
>
> Unfortunately, I have not had the time yet to prototype anything.
>
Comments about the patch
- may need to scale virtpages checking as well
- Needs checking of exact THP PTEs covered in scan
- Does not touch task_scan_min() etc which influences scan_period (do we
require)???

---8<---

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 6d041aa9f0fe..066e9bee1187 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -260,7 +260,8 @@ int pud_huge(pud_t pud);
long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot,
unsigned long cp_flags);
-
+long hugetllb_effective_scanned_ptes(struct vm_area_struct *vma,
unsigned long start,
+ unsigned long end);
bool is_hugetlb_entry_migration(pte_t pte);
void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ce77080c79..e64430863f9e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2441,6 +2441,8 @@ bool can_change_pte_writable(struct vm_area_struct
*vma, unsigned long addr,
extern long change_protection(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long
start,
unsigned long end, unsigned long cp_flags);
+extern long effective_scanned_ptes(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end);
extern int mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather
*tlb,
struct vm_area_struct *vma, struct vm_area_struct **pprev,
unsigned long start, unsigned long end, unsigned long newflags);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 373ff5f55884..a8280f589cbf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2959,7 +2959,7 @@ static void task_numa_work(struct callback_head *work)
struct vm_area_struct *vma;
unsigned long start, end;
unsigned long nr_pte_updates = 0;
- long pages, virtpages;
+ long pages, virtpages, ptes_to_scan;
struct vma_iterator vmi;

SCHED_WARN_ON(p != container_of(work, struct task_struct,
numa_work));
@@ -3006,6 +3006,8 @@ static void task_numa_work(struct callback_head *work)
start = mm->numa_scan_offset;
pages = sysctl_numa_balancing_scan_size;
pages <<= 20 - PAGE_SHIFT; /* MB in pages */
+ /* Consider total number of PTEs to scan rather than sticking to
256MB */
+ ptes_to_scan = pages;
virtpages = pages * 8; /* Scan up to this much virtual space */
if (!pages)
return;
@@ -3099,11 +3101,11 @@ static void task_numa_work(struct callback_head
*work)
* areas faster.
*/
if (nr_pte_updates)
- pages -= (end - start) >> PAGE_SHIFT;
- virtpages -= (end - start) >> PAGE_SHIFT;
+ ptes_to_scan -=
effective_scanned_ptes(vma, start, end);

+ virtpages -= effective_scanned_ptes(vma, start,
end);
start = end;
- if (pages <= 0 || virtpages <= 0)
+ if (ptes_to_scan <= 0 || virtpages <= 0)
goto out;

cond_resched();
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f154019e6b84..9935b462c479 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6841,6 +6841,15 @@ long hugetlb_change_protection(struct
vm_area_struct *vma,
return pages > 0 ? (pages << h->order) : pages;
}

+long hugetllb_effective_scanned_ptes(struct vm_area_struct *vma,
unsigned long start,
+ unsigned long end)
+{
+ struct hstate *h = hstate_vma(vma);
+
+ return (end - start) >> (PAGE_SHIFT + h->order);
+}
+
+
/* Return true if reservation was successful, false otherwise. */
bool hugetlb_reserve_pages(struct inode *inode,
long from, long to,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 92d3d3ca390a..8022cb09b47b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -586,6 +586,16 @@ long change_protection(struct mmu_gather *tlb,
return pages;
}

+long effective_scanned_ptes(struct vm_area_struct *vma, unsigned long
start,
+ unsigned long end)
+{
+ if (is_vm_hugetlb_page(vma))
+ return hugetllb_effective_scanned_ptes(vma, start, end);
+
+ return (end - start) >> PAGE_SHIFT;
+}
+
+
static int prot_none_pte_entry(pte_t *pte, unsigned long addr,
unsigned long next, struct mm_walk *walk)
{