2013-09-10 09:32:39

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 0/50] Basic scheduler support for automatic NUMA balancing V7

It has been a long time since V6 of this series and time for an update. Much
of this is now stabilised with the most important addition being the inclusion
of Peter and Rik's work on grouping tasks that share pages together.

This series has a number of goals. It reduces overhead of automatic balancing
through scan rate reduction and the avoidance of TLB flushes. It selects a
preferred node and moves tasks towards their memory as well as moving memory
toward their task. It handles shared pages and groups related tasks together.

Changelog since V6
o Group tasks that share pages together
o More scan avoidance of VMAs mapping pages that are not likely to migrate
o cpunid conversion, system-wide searching of tasks to balance with

Changelog since V6
o Various TLB flush optimisations
o Comment updates
o Sanitise task_numa_fault callsites for consistent semantics
o Revert some of the scanning adaption stuff
o Revert patch that defers scanning until task schedules on another node
o Start delayed scanning properly
o Avoid the same task always performing the PTE scan
o Continue PTE scanning even if migration is rate limited

Changelog since V5
o Add __GFP_NOWARN for numa hinting fault count
o Use is_huge_zero_page
o Favour moving tasks towards nodes with higher faults
o Optionally resist moving tasks towards nodes with lower faults
o Scan shared THP pages

Changelog since V4
o Added code that avoids overloading preferred nodes
o Swap tasks if nodes are overloaded and the swap does not impair locality

Changelog since V3
o Correct detection of unset last nid/pid information
o Dropped nr_preferred_running and replaced it with Peter's load balancing
o Pass in correct node information for THP hinting faults
o Pressure tasks sharing a THP page to move towards same node
o Do not set pmd_numa if false sharing is detected

Changelog since V2
o Reshuffle to match Peter's implied preference for layout
o Reshuffle to move private/shared split towards end of series to make it
easier to evaluate the impact
o Use PID information to identify private accesses
o Set the floor for PTE scanning based on virtual address space scan rates
instead of time
o Some locking improvements
o Do not preempt pinned tasks unless they are kernel threads

Changelog since V1
o Scan pages with elevated map count (shared pages)
o Scale scan rates based on the vsz of the process so the sampling of the
task is independant of its size
o Favour moving towards nodes with more faults even if it's not the
preferred node
o Laughably basic accounting of a compute overloaded node when selecting
the preferred node.
o Applied review comments

This series integrates basic scheduler support for automatic NUMA balancing.
It borrows very heavily from Peter Ziljstra's work in "sched, numa, mm:
Add adaptive NUMA affinity support" but deviates too much to preserve
Signed-off-bys. As before, if the relevant authors are ok with it I'll
add Signed-off-bys (or add them yourselves if you pick the patches up).

There are still gaps between this series and manual binding but it's still
an important series of steps in the right direction and the size of the
series is getting unwieldly. As before, the intention is not to complete
the work but to incrementally improve mainline and preserve bisectability
for any bug reports that crop up.

Patch 1 is a monolothic dump of patches thare are destined for upstream that
this series indirectly depends upon.

Patches 2-3 adds sysctl documentation and comment fixlets

Patch 4 avoids accounting for a hinting fault if another thread handled the
fault in parallel

Patches 5-6 avoid races with parallel THP migration and THP splits.

Patch 7 corrects a THP NUMA hint fault accounting bug

Patch 8 sanitizes task_numa_fault callsites to have consist semantics and
always record the fault based on the correct location of the page.

Patch 9 avoids trying to migrate the THP zero page

Patch 10 avoids the same task being selected to perform the PTE scan within
a shared address space.

Patch 11 continues PTE scanning even if migration rate limited

Patch 12 notes that delaying the PTE scan until a task is scheduled on an
alternatie node misses the case where the task is only accessing
shared memory on a partially loaded machine and reverts a patch.

Patches 13,15 initialses numa_next_scan properly so that PTE scanning is delayed
when a process starts.

Patch 14 sets the scan rate proportional to the size of the task being
scanned.

Patches 16-17 avoids TLB flushes during the PTE scan if no updates are made

Patch 18 slows the scan rate if no hinting faults were trapped by an idle task.

Patch 19 tracks NUMA hinting faults per-task and per-node

Patches 20-24 selects a preferred node at the end of a PTE scan based on what
node incurrent the highest number of NUMA faults. When the balancer
is comparing two CPU it will prefer to locate tasks on their
preferred node. When initially selected the task is rescheduled on
the preferred node if it is not running on that node already. This
avoids waiting for the scheduler to move the task slowly.

Patch 25 adds infrastructure to allow separate tracking of shared/private
pages but treats all faults as if they are private accesses. Laying
it out this way reduces churn later in the series when private
fault detection is introduced

Patch 26 avoids some unnecessary allocation

Patch 27-28 kicks away some training wheels and scans shared pages and
small VMAs.

Patch 29 introduces private fault detection based on the PID of the faulting
process and accounts for shared/private accesses differently.

Patch 30 avoids migrating memory immediately after the load balancer moves
a task to another node in case it's a transient migration.

Patch 31 pick the least loaded CPU based on a preferred node based on
a scheduling domain common to both the source and destination
NUMA node.

Patch 32 retries task migration if an earlier attempt failed

Patch 33 will begin task migration immediately if running on its preferred
node

Patch 34 will avoid trapping hinting faults for shared read-only library
pages as these never migrate anyway

Patch 35 avoids handling pmd hinting faults if none of the ptes below it were
marked pte numa

Patches 36-37 introduce a mechanism for swapping tasks

Patch 38 uses a system-wide search to find tasks that can be swapped
to improve the overall locality of the system.

Patch 39 notes that the system-wide search may ignore the preferred node and
will use the preferred node placement if it has spare compute
capacity.

Patches 40-42 use cpupid to track pages so potential sharing tasks can
be quickly found

Patches 43-44 avoids grouping based on read-only pages

Patches 45-46 schedules tasks based on their numa group

Patch 47 adds some debugging aids

Patches 48-49 separately considers task and group weights when selecting the node to
schedule a task on

Patch 50 avoids migrating tasks away from their preferred node.

Kernel 3.11-rc7 is the testing baseline.

o account-v7 Patches 1-7
o lesspmd-v7 Patches 1-35
o selectweight-v7 Patches 1-49
o avoidmove-v7 Patches 1-50

This is SpecJBB running on a 4-socket machine with THP enabled and one JVM
running for the whole system.

specjbb

3.11.0-rc7 3.11.0-rc7 3.11.0-rc7 3.11.0-rc7
account-v7 lesspmd-v7 selectweight-v7 avoidmove-v7
TPut 1 26483.00 ( 0.00%) 26691.00 ( 0.79%) 26618.00 ( 0.51%) 25450.00 ( -3.90%)
TPut 2 55009.00 ( 0.00%) 54744.00 ( -0.48%) 53200.00 ( -3.29%) 53998.00 ( -1.84%)
TPut 3 86711.00 ( 0.00%) 85564.00 ( -1.32%) 86547.00 ( -0.19%) 85424.00 ( -1.48%)
TPut 4 108073.00 ( 0.00%) 112757.00 ( 4.33%) 111408.00 ( 3.09%) 113522.00 ( 5.04%)
TPut 5 138128.00 ( 0.00%) 137733.00 ( -0.29%) 140797.00 ( 1.93%) 140930.00 ( 2.03%)
TPut 6 161949.00 ( 0.00%) 164499.00 ( 1.57%) 164759.00 ( 1.74%) 161916.00 ( -0.02%)
TPut 7 185205.00 ( 0.00%) 190214.00 ( 2.70%) 189409.00 ( 2.27%) 191425.00 ( 3.36%)
TPut 8 214152.00 ( 0.00%) 216550.00 ( 1.12%) 219510.00 ( 2.50%) 217374.00 ( 1.50%)
TPut 9 245408.00 ( 0.00%) 242975.00 ( -0.99%) 241001.00 ( -1.80%) 243116.00 ( -0.93%)
TPut 10 262786.00 ( 0.00%) 267812.00 ( 1.91%) 260897.00 ( -0.72%) 267728.00 ( 1.88%)
TPut 11 293162.00 ( 0.00%) 299621.00 ( 2.20%) 291130.00 ( -0.69%) 300006.00 ( 2.33%)
TPut 12 310423.00 ( 0.00%) 317867.00 ( 2.40%) 307821.00 ( -0.84%) 317531.00 ( 2.29%)
TPut 13 328542.00 ( 0.00%) 347286.00 ( 5.71%) 327800.00 ( -0.23%) 344849.00 ( 4.96%)
TPut 14 362081.00 ( 0.00%) 374173.00 ( 3.34%) 342014.00 ( -5.54%) 366256.00 ( 1.15%)
TPut 15 374475.00 ( 0.00%) 393658.00 ( 5.12%) 348941.00 ( -6.82%) 376056.00 ( 0.42%)
TPut 16 407367.00 ( 0.00%) 409212.00 ( 0.45%) 361272.00 (-11.32%) 409353.00 ( 0.49%)
TPut 17 423282.00 ( 0.00%) 424424.00 ( 0.27%) 377808.00 (-10.74%) 410761.00 ( -2.96%)
TPut 18 447960.00 ( 0.00%) 456736.00 ( 1.96%) 392421.00 (-12.40%) 437756.00 ( -2.28%)
TPut 19 449296.00 ( 0.00%) 475797.00 ( 5.90%) 404142.00 (-10.05%) 446286.00 ( -0.67%)
TPut 20 480073.00 ( 0.00%) 487883.00 ( 1.63%) 414085.00 (-13.75%) 453840.00 ( -5.46%)
TPut 21 476891.00 ( 0.00%) 505589.00 ( 6.02%) 422953.00 (-11.31%) 458974.00 ( -3.76%)
TPut 22 492092.00 ( 0.00%) 503878.00 ( 2.40%) 433232.00 (-11.96%) 461927.00 ( -6.13%)
TPut 23 500602.00 ( 0.00%) 523202.00 ( 4.51%) 433320.00 (-13.44%) 454256.00 ( -9.26%)
TPut 24 500408.00 ( 0.00%) 509350.00 ( 1.79%) 441878.00 (-11.70%) 460559.00 ( -7.96%)
TPut 25 503390.00 ( 0.00%) 521126.00 ( 3.52%) 454313.00 ( -9.75%) 468970.00 ( -6.84%)
TPut 26 514905.00 ( 0.00%) 523315.00 ( 1.63%) 453013.00 (-12.02%) 455508.00 (-11.54%)
TPut 27 513125.00 ( 0.00%) 529317.00 ( 3.16%) 461561.00 (-10.05%) 463229.00 ( -9.72%)
TPut 28 508313.00 ( 0.00%) 540357.00 ( 6.30%) 460727.00 ( -9.36%) 452718.00 (-10.94%)
TPut 29 514726.00 ( 0.00%) 534836.00 ( 3.91%) 451867.00 (-12.21%) 449201.00 (-12.73%)
TPut 30 509362.00 ( 0.00%) 526295.00 ( 3.32%) 453946.00 (-10.88%) 444615.00 (-12.71%)
TPut 31 506812.00 ( 0.00%) 532603.00 ( 5.09%) 448303.00 (-11.54%) 450953.00 (-11.02%)
TPut 32 500600.00 ( 0.00%) 524926.00 ( 4.86%) 452692.00 ( -9.57%) 432748.00 (-13.55%)
TPut 33 491116.00 ( 0.00%) 525059.00 ( 6.91%) 436046.00 (-11.21%) 433109.00 (-11.81%)
TPut 34 483206.00 ( 0.00%) 508843.00 ( 5.31%) 440762.00 ( -8.78%) 408980.00 (-15.36%)
TPut 35 489281.00 ( 0.00%) 504354.00 ( 3.08%) 423368.00 (-13.47%) 408371.00 (-16.54%)
TPut 36 480259.00 ( 0.00%) 489147.00 ( 1.85%) 415108.00 (-13.57%) 397698.00 (-17.19%)
TPut 37 474611.00 ( 0.00%) 497076.00 ( 4.73%) 411894.00 (-13.21%) 396970.00 (-16.36%)
TPut 38 470478.00 ( 0.00%) 487195.00 ( 3.55%) 407295.00 (-13.43%) 389028.00 (-17.31%)
TPut 39 437255.00 ( 0.00%) 477739.00 ( 9.26%) 413837.00 ( -5.36%) 391655.00 (-10.43%)
TPut 40 463513.00 ( 0.00%) 473658.00 ( 2.19%) 407789.00 (-12.02%) 383771.00 (-17.20%)
TPut 41 426922.00 ( 0.00%) 446614.00 ( 4.61%) 384862.00 ( -9.85%) 376937.00 (-11.71%)
TPut 42 423707.00 ( 0.00%) 442783.00 ( 4.50%) 393131.00 ( -7.22%) 389373.00 ( -8.10%)
TPut 43 443489.00 ( 0.00%) 444903.00 ( 0.32%) 375795.00 (-15.26%) 377239.00 (-14.94%)
TPut 44 415987.00 ( 0.00%) 432628.00 ( 4.00%) 367343.00 (-11.69%) 383026.00 ( -7.92%)
TPut 45 409382.00 ( 0.00%) 424978.00 ( 3.81%) 364387.00 (-10.99%) 385429.00 ( -5.85%)
TPut 46 402538.00 ( 0.00%) 393039.00 ( -2.36%) 359730.00 (-10.63%) 370411.00 ( -7.98%)
TPut 47 373125.00 ( 0.00%) 406744.00 ( 9.01%) 342382.00 ( -8.24%) 375368.00 ( 0.60%)
TPut 48 405485.00 ( 0.00%) 421600.00 ( 3.97%) 347063.00 (-14.41%) 400586.00 ( -1.21%)

So this is somewhat of a bad start. The initial bulk of the patches help
but the grouping code did not work out as well. This tends to be a bit
variable as a re-run sometimes behaves very differently. Modelling the task
groupings show that threads in the same task group are still scheduled to
run on CPUs from different nodes so more work is needed there.

specjbb Peaks
3.11.0-rc7 3.11.0-rc7 3.11.0-rc7 3.11.0-rc7
account-v7 lesspmd-v7 selectweight-v7 avoidmove-v7
Expctd Warehouse 48.00 ( 0.00%) 48.00 ( 0.00%) 48.00 ( 0.00%) 48.00 ( 0.00%)
Expctd Peak Bops 373125.00 ( 0.00%) 406744.00 ( 9.01%) 342382.00 ( -8.24%) 375368.00 ( 0.60%)
Actual Warehouse 27.00 ( 0.00%) 29.00 ( 7.41%) 28.00 ( 3.70%) 26.00 ( -3.70%)
Actual Peak Bops 514905.00 ( 0.00%) 540357.00 ( 4.94%) 461561.00 (-10.36%) 468970.00 ( -8.92%)
SpecJBB Bops 8275.00 ( 0.00%) 8604.00 ( 3.98%) 7083.00 (-14.40%) 8175.00 ( -1.21%)
SpecJBB Bops/JVM 8275.00 ( 0.00%) 8604.00 ( 3.98%) 7083.00 (-14.40%) 8175.00 ( -1.21%)

The actual specjbb score for the overall series does not look as bad
as the raw figures illustrate.

3.11.0-rc7 3.11.0-rc7 3.11.0-rc7 3.11.0-rc7
account-v7 lesspmd-v7 selectweight-v7 avoidmove-v7
User 43513.28 44403.17 44513.42 44406.55
System 871.01 122.46 107.05 116.15
Elapsed 1665.24 1664.94 1665.03 1665.06

A big positive at least is that system CPU overhead is slashed.

3.11.0-rc7 3.11.0-rc7 3.11.0-rc7 3.11.0-rc7
account-v7 lesspmd-v7 selectweight-v7 avoidmove-v7
Compaction stalls 0 0 0 0
Compaction success 0 0 0 0
Compaction failures 0 0 0 0
Page migrate success 133385393 14958732 9859116 12092458
Page migrate failure 0 0 0 0
Compaction pages isolated 0 0 0 0
Compaction migrate scanned 0 0 0 0
Compaction free scanned 0 0 0 0
Compaction cost 138454 15527 10233 12551
NUMA PTE updates 19952605 712634 674115 730464
NUMA hint faults 4113211 710022 668011 729294
NUMA hint local faults 1197939 274740 251230 273679
NUMA hint local percent 29 38 37 37
NUMA pages migrated 133385393 14958732 9859116 12092458
AutoNUMA cost 23240 3839 3532 3881

And the source of the reduction is obvious here from the much smaller
number of PTE updates and hinting faults.


This is SpecJBB running on a 4-socket machine with THP enabled and one JVM
running per node on the system.

specjbb
3.11.0-rc7 3.11.0-rc7 3.11.0-rc7 3.11.0-rc7
account-v7 lesspmd-v7 selectweight-v7 avoidmove-v7
Mean 1 29995.75 ( 0.00%) 30321.50 ( 1.09%) 29457.25 ( -1.80%) 30791.75 ( 2.65%)
Mean 2 62699.25 ( 0.00%) 60564.75 ( -3.40%) 59721.00 ( -4.75%) 61050.00 ( -2.63%)
Mean 3 88312.75 ( 0.00%) 89286.50 ( 1.10%) 88451.50 ( 0.16%) 90461.75 ( 2.43%)
Mean 4 117827.00 ( 0.00%) 115583.00 ( -1.90%) 116043.00 ( -1.51%) 114945.75 ( -2.45%)
Mean 5 139419.00 ( 0.00%) 137869.25 ( -1.11%) 137761.00 ( -1.19%) 136841.25 ( -1.85%)
Mean 6 156185.25 ( 0.00%) 155811.50 ( -0.24%) 151628.50 ( -2.92%) 149850.25 ( -4.06%)
Mean 7 162258.25 ( 0.00%) 160665.25 ( -0.98%) 154775.25 ( -4.61%) 154356.25 ( -4.87%)
Mean 8 160665.00 ( 0.00%) 160376.75 ( -0.18%) 150849.00 ( -6.11%) 154266.50 ( -3.98%)
Mean 9 156048.00 ( 0.00%) 159689.75 ( 2.33%) 150347.00 ( -3.65%) 150804.75 ( -3.36%)
Mean 10 144640.75 ( 0.00%) 153683.50 ( 6.25%) 146165.50 ( 1.05%) 143256.00 ( -0.96%)
Mean 11 136418.75 ( 0.00%) 146141.75 ( 7.13%) 139216.75 ( 2.05%) 137435.00 ( 0.74%)
Mean 12 132808.00 ( 0.00%) 141567.75 ( 6.60%) 131523.25 ( -0.97%) 139129.50 ( 4.76%)
Mean 13 126834.75 ( 0.00%) 140738.50 ( 10.96%) 124446.25 ( -1.88%) 138181.50 ( 8.95%)
Mean 14 127837.25 ( 0.00%) 140882.00 ( 10.20%) 121495.50 ( -4.96%) 128275.75 ( 0.34%)
Mean 15 122268.50 ( 0.00%) 139983.50 ( 14.49%) 115737.25 ( -5.34%) 119838.75 ( -1.99%)
Mean 16 118739.25 ( 0.00%) 142654.25 ( 20.14%) 110902.25 ( -6.60%) 123369.50 ( 3.90%)
Mean 17 117972.75 ( 0.00%) 136969.50 ( 16.10%) 108398.00 ( -8.12%) 115575.00 ( -2.03%)
Mean 18 116308.50 ( 0.00%) 134009.50 ( 15.22%) 109094.25 ( -6.20%) 118385.75 ( 1.79%)
Mean 19 114594.75 ( 0.00%) 125941.75 ( 9.90%) 108366.75 ( -5.43%) 117998.00 ( 2.97%)
Mean 20 116338.50 ( 0.00%) 121586.50 ( 4.51%) 110267.25 ( -5.22%) 121703.00 ( 4.61%)
Mean 21 114274.00 ( 0.00%) 118586.00 ( 3.77%) 105316.25 ( -7.84%) 112591.75 ( -1.47%)
Mean 22 113135.00 ( 0.00%) 121886.75 ( 7.74%) 108124.25 ( -4.43%) 107672.00 ( -4.83%)
Mean 23 109514.25 ( 0.00%) 117894.25 ( 7.65%) 111499.50 ( 1.81%) 108045.75 ( -1.34%)
Mean 24 112897.00 ( 0.00%) 119902.75 ( 6.21%) 110615.50 ( -2.02%) 117146.00 ( 3.76%)
Mean 25 107127.75 ( 0.00%) 125763.50 ( 17.40%) 107750.75 ( 0.58%) 116425.75 ( 8.68%)
Mean 26 109338.75 ( 0.00%) 119034.00 ( 8.87%) 105875.25 ( -3.17%) 116591.00 ( 6.63%)
Mean 27 110967.75 ( 0.00%) 122720.50 ( 10.59%) 99660.75 (-10.19%) 118399.25 ( 6.70%)
Mean 28 116559.50 ( 0.00%) 121524.50 ( 4.26%) 98095.50 (-15.84%) 116433.50 ( -0.11%)
Mean 29 113278.00 ( 0.00%) 115992.75 ( 2.40%) 101014.00 (-10.83%) 122954.25 ( 8.54%)
Mean 30 110273.75 ( 0.00%) 112436.50 ( 1.96%) 103679.75 ( -5.98%) 127165.50 ( 15.32%)
Mean 31 107409.50 ( 0.00%) 120160.00 ( 11.87%) 101122.75 ( -5.85%) 128566.25 ( 19.70%)
Mean 32 105624.00 ( 0.00%) 122808.50 ( 16.27%) 100410.75 ( -4.94%) 126009.50 ( 19.30%)
Mean 33 107521.75 ( 0.00%) 118049.50 ( 9.79%) 97788.25 ( -9.05%) 124172.00 ( 15.49%)
Mean 34 108135.75 ( 0.00%) 118198.75 ( 9.31%) 99215.25 ( -8.25%) 129010.75 ( 19.30%)
Mean 35 104407.75 ( 0.00%) 115090.50 ( 10.23%) 97804.00 ( -6.32%) 126019.75 ( 20.70%)
Mean 36 101119.00 ( 0.00%) 118554.75 ( 17.24%) 101608.00 ( 0.48%) 126106.00 ( 24.71%)
Mean 37 104228.25 ( 0.00%) 123893.25 ( 18.87%) 99277.75 ( -4.75%) 122410.25 ( 17.44%)
Mean 38 104402.50 ( 0.00%) 118543.50 ( 13.54%) 97255.00 ( -6.85%) 118682.75 ( 13.68%)
Mean 39 100158.50 ( 0.00%) 116866.00 ( 16.68%) 99918.00 ( -0.24%) 122019.75 ( 21.83%)
Mean 40 101911.75 ( 0.00%) 117276.25 ( 15.08%) 98766.25 ( -3.09%) 121322.00 ( 19.05%)
Mean 41 104757.50 ( 0.00%) 116656.75 ( 11.36%) 97970.25 ( -6.48%) 121403.00 ( 15.89%)
Mean 42 104782.50 ( 0.00%) 116385.25 ( 11.07%) 96897.25 ( -7.53%) 118765.25 ( 13.34%)
Mean 43 97073.00 ( 0.00%) 113745.50 ( 17.18%) 93433.00 ( -3.75%) 118571.25 ( 22.15%)
Mean 44 99739.00 ( 0.00%) 116286.00 ( 16.59%) 96193.50 ( -3.55%) 116149.75 ( 16.45%)
Mean 45 104422.25 ( 0.00%) 109978.25 ( 5.32%) 95737.50 ( -8.32%) 113604.75 ( 8.79%)
Mean 46 103389.25 ( 0.00%) 110703.00 ( 7.07%) 93711.50 ( -9.36%) 110550.75 ( 6.93%)
Mean 47 96092.25 ( 0.00%) 108942.50 ( 13.37%) 94220.50 ( -1.95%) 104079.00 ( 8.31%)
Mean 48 97596.25 ( 0.00%) 109194.00 ( 11.88%) 101071.25 ( 3.56%) 101543.00 ( 4.04%)
Stddev 1 1326.20 ( 0.00%) 1351.58 ( -1.91%) 1525.30 (-15.01%) 1048.89 ( 20.91%)
Stddev 2 1837.05 ( 0.00%) 1538.27 ( 16.26%) 919.58 ( 49.94%) 1974.67 ( -7.49%)
Stddev 3 1267.24 ( 0.00%) 2599.37 (-105.12%) 2323.12 (-83.32%) 2091.33 (-65.03%)
Stddev 4 6125.28 ( 0.00%) 2980.50 ( 51.34%) 1706.84 ( 72.13%) 2497.81 ( 59.22%)
Stddev 5 6161.12 ( 0.00%) 2495.59 ( 59.49%) 2466.47 ( 59.97%) 3077.78 ( 50.05%)
Stddev 6 5784.16 ( 0.00%) 4799.20 ( 17.03%) 4580.83 ( 20.80%) 2889.81 ( 50.04%)
Stddev 7 6607.07 ( 0.00%) 1167.21 ( 82.33%) 6196.26 ( 6.22%) 4385.20 ( 33.63%)
Stddev 8 1671.12 ( 0.00%) 6631.06 (-296.80%) 6812.80 (-307.68%) 8598.19 (-414.52%)
Stddev 9 6052.25 ( 0.00%) 6954.93 (-14.91%) 6382.84 ( -5.46%) 8987.78 (-48.50%)
Stddev 10 11473.39 ( 0.00%) 4442.38 ( 61.28%) 6772.50 ( 40.97%) 16758.82 (-46.07%)
Stddev 11 7093.02 ( 0.00%) 4526.31 ( 36.19%) 9026.86 (-27.26%) 13353.17 (-88.26%)
Stddev 12 3865.06 ( 0.00%) 2743.41 ( 29.02%) 15584.41 (-303.21%) 14112.46 (-265.13%)
Stddev 13 2777.36 ( 0.00%) 1050.96 ( 62.16%) 16286.28 (-486.39%) 8243.38 (-196.81%)
Stddev 14 1795.89 ( 0.00%) 536.93 ( 70.10%) 13502.75 (-651.87%) 6328.98 (-252.42%)
Stddev 15 2250.85 ( 0.00%) 1135.62 ( 49.55%) 9908.63 (-340.22%) 11274.74 (-400.91%)
Stddev 16 1963.42 ( 0.00%) 379.50 ( 80.67%) 9645.69 (-391.27%) 2679.87 (-36.49%)
Stddev 17 1592.42 ( 0.00%) 1388.57 ( 12.80%) 6322.29 (-297.02%) 3768.27 (-136.64%)
Stddev 18 3317.92 ( 0.00%) 721.81 ( 78.25%) 3065.44 ( 7.61%) 6375.92 (-92.17%)
Stddev 19 4525.33 ( 0.00%) 3273.36 ( 27.67%) 5565.31 (-22.98%) 3248.71 ( 28.21%)
Stddev 20 4140.94 ( 0.00%) 2332.35 ( 43.68%) 8000.27 (-93.20%) 6237.91 (-50.64%)
Stddev 21 1515.71 ( 0.00%) 3309.22 (-118.33%) 6587.02 (-334.58%) 10217.84 (-574.13%)
Stddev 22 5498.36 ( 0.00%) 2437.41 ( 55.67%) 7920.50 (-44.05%) 8414.84 (-53.04%)
Stddev 23 5637.68 ( 0.00%) 1832.68 ( 67.49%) 6543.07 (-16.06%) 5976.59 ( -6.01%)
Stddev 24 4862.89 ( 0.00%) 6295.82 (-29.47%) 9229.15 (-89.79%) 9046.57 (-86.03%)
Stddev 25 1725.07 ( 0.00%) 2986.87 (-73.15%) 13679.77 (-693.00%) 9521.44 (-451.95%)
Stddev 26 4590.06 ( 0.00%) 1862.17 ( 59.43%) 10773.97 (-134.72%) 5417.65 (-18.03%)
Stddev 27 6060.43 ( 0.00%) 1567.32 ( 74.14%) 10217.36 (-68.59%) 2934.56 ( 51.58%)
Stddev 28 2742.94 ( 0.00%) 2533.06 ( 7.65%) 11375.97 (-314.74%) 3713.72 (-35.39%)
Stddev 29 3878.01 ( 0.00%) 783.58 ( 79.79%) 8718.86 (-124.83%) 2870.90 ( 25.97%)
Stddev 30 4446.49 ( 0.00%) 852.75 ( 80.82%) 5318.24 (-19.61%) 2174.56 ( 51.09%)
Stddev 31 3825.27 ( 0.00%) 876.75 ( 77.08%) 7412.96 (-93.79%) 1517.78 ( 60.32%)
Stddev 32 8118.60 ( 0.00%) 1367.48 ( 83.16%) 5757.34 ( 29.08%) 1025.48 ( 87.37%)
Stddev 33 3237.05 ( 0.00%) 3807.47 (-17.62%) 7493.40 (-131.49%) 4600.54 (-42.12%)
Stddev 34 7413.56 ( 0.00%) 3599.54 ( 51.45%) 8514.89 (-14.86%) 2999.21 ( 59.54%)
Stddev 35 6061.77 ( 0.00%) 3756.88 ( 38.02%) 5594.20 ( 7.71%) 4241.61 ( 30.03%)
Stddev 36 5836.80 ( 0.00%) 2944.03 ( 49.56%) 10641.97 (-82.33%) 1267.44 ( 78.29%)
Stddev 37 2719.65 ( 0.00%) 3819.92 (-40.46%) 4075.76 (-49.86%) 2604.21 ( 4.24%)
Stddev 38 3267.94 ( 0.00%) 2148.38 ( 34.26%) 5219.19 (-59.71%) 4865.10 (-48.87%)
Stddev 39 3596.06 ( 0.00%) 1042.13 ( 71.02%) 5891.17 (-63.82%) 3067.42 ( 14.70%)
Stddev 40 4303.03 ( 0.00%) 2518.02 ( 41.48%) 5279.70 (-22.70%) 1750.86 ( 59.31%)
Stddev 41 10269.08 ( 0.00%) 3602.25 ( 64.92%) 5907.68 ( 42.47%) 3163.17 ( 69.20%)
Stddev 42 3221.41 ( 0.00%) 3707.32 (-15.08%) 6926.80 (-115.02%) 2555.18 ( 20.68%)
Stddev 43 7203.43 ( 0.00%) 3082.74 ( 57.20%) 6537.72 ( 9.24%) 3912.25 ( 45.69%)
Stddev 44 6164.48 ( 0.00%) 2946.14 ( 52.21%) 4702.32 ( 23.72%) 3228.17 ( 47.63%)
Stddev 45 7696.65 ( 0.00%) 2461.14 ( 68.02%) 4697.11 ( 38.97%) 4675.68 ( 39.25%)
Stddev 46 6989.59 ( 0.00%) 3713.96 ( 46.86%) 5105.63 ( 26.95%) 5008.38 ( 28.35%)
Stddev 47 5580.13 ( 0.00%) 4025.00 ( 27.87%) 4034.38 ( 27.70%) 5538.34 ( 0.75%)
Stddev 48 5647.24 ( 0.00%) 1694.00 ( 70.00%) 2980.82 ( 47.22%) 8123.60 (-43.85%)
TPut 1 119983.00 ( 0.00%) 121286.00 ( 1.09%) 117829.00 ( -1.80%) 123167.00 ( 2.65%)
TPut 2 250797.00 ( 0.00%) 242259.00 ( -3.40%) 238884.00 ( -4.75%) 244200.00 ( -2.63%)
TPut 3 353251.00 ( 0.00%) 357146.00 ( 1.10%) 353806.00 ( 0.16%) 361847.00 ( 2.43%)
TPut 4 471308.00 ( 0.00%) 462332.00 ( -1.90%) 464172.00 ( -1.51%) 459783.00 ( -2.45%)
TPut 5 557676.00 ( 0.00%) 551477.00 ( -1.11%) 551044.00 ( -1.19%) 547365.00 ( -1.85%)
TPut 6 624741.00 ( 0.00%) 623246.00 ( -0.24%) 606514.00 ( -2.92%) 599401.00 ( -4.06%)
TPut 7 649033.00 ( 0.00%) 642661.00 ( -0.98%) 619101.00 ( -4.61%) 617425.00 ( -4.87%)
TPut 8 642660.00 ( 0.00%) 641507.00 ( -0.18%) 603396.00 ( -6.11%) 617066.00 ( -3.98%)
TPut 9 624192.00 ( 0.00%) 638759.00 ( 2.33%) 601388.00 ( -3.65%) 603219.00 ( -3.36%)
TPut 10 578563.00 ( 0.00%) 614734.00 ( 6.25%) 584662.00 ( 1.05%) 573024.00 ( -0.96%)
TPut 11 545675.00 ( 0.00%) 584567.00 ( 7.13%) 556867.00 ( 2.05%) 549740.00 ( 0.74%)
TPut 12 531232.00 ( 0.00%) 566271.00 ( 6.60%) 526093.00 ( -0.97%) 556518.00 ( 4.76%)
TPut 13 507339.00 ( 0.00%) 562954.00 ( 10.96%) 497785.00 ( -1.88%) 552726.00 ( 8.95%)
TPut 14 511349.00 ( 0.00%) 563528.00 ( 10.20%) 485982.00 ( -4.96%) 513103.00 ( 0.34%)
TPut 15 489074.00 ( 0.00%) 559934.00 ( 14.49%) 462949.00 ( -5.34%) 479355.00 ( -1.99%)
TPut 16 474957.00 ( 0.00%) 570617.00 ( 20.14%) 443609.00 ( -6.60%) 493478.00 ( 3.90%)
TPut 17 471891.00 ( 0.00%) 547878.00 ( 16.10%) 433592.00 ( -8.12%) 462300.00 ( -2.03%)
TPut 18 465234.00 ( 0.00%) 536038.00 ( 15.22%) 436377.00 ( -6.20%) 473543.00 ( 1.79%)
TPut 19 458379.00 ( 0.00%) 503767.00 ( 9.90%) 433467.00 ( -5.43%) 471992.00 ( 2.97%)
TPut 20 465354.00 ( 0.00%) 486346.00 ( 4.51%) 441069.00 ( -5.22%) 486812.00 ( 4.61%)
TPut 21 457096.00 ( 0.00%) 474344.00 ( 3.77%) 421265.00 ( -7.84%) 450367.00 ( -1.47%)
TPut 22 452540.00 ( 0.00%) 487547.00 ( 7.74%) 432497.00 ( -4.43%) 430688.00 ( -4.83%)
TPut 23 438057.00 ( 0.00%) 471577.00 ( 7.65%) 445998.00 ( 1.81%) 432183.00 ( -1.34%)
TPut 24 451588.00 ( 0.00%) 479611.00 ( 6.21%) 442462.00 ( -2.02%) 468584.00 ( 3.76%)
TPut 25 428511.00 ( 0.00%) 503054.00 ( 17.40%) 431003.00 ( 0.58%) 465703.00 ( 8.68%)
TPut 26 437355.00 ( 0.00%) 476136.00 ( 8.87%) 423501.00 ( -3.17%) 466364.00 ( 6.63%)
TPut 27 443871.00 ( 0.00%) 490882.00 ( 10.59%) 398643.00 (-10.19%) 473597.00 ( 6.70%)
TPut 28 466238.00 ( 0.00%) 486098.00 ( 4.26%) 392382.00 (-15.84%) 465734.00 ( -0.11%)
TPut 29 453112.00 ( 0.00%) 463971.00 ( 2.40%) 404056.00 (-10.83%) 491817.00 ( 8.54%)
TPut 30 441095.00 ( 0.00%) 449746.00 ( 1.96%) 414719.00 ( -5.98%) 508662.00 ( 15.32%)
TPut 31 429638.00 ( 0.00%) 480640.00 ( 11.87%) 404491.00 ( -5.85%) 514265.00 ( 19.70%)
TPut 32 422496.00 ( 0.00%) 491234.00 ( 16.27%) 401643.00 ( -4.94%) 504038.00 ( 19.30%)
TPut 33 430087.00 ( 0.00%) 472198.00 ( 9.79%) 391153.00 ( -9.05%) 496688.00 ( 15.49%)
TPut 34 432543.00 ( 0.00%) 472795.00 ( 9.31%) 396861.00 ( -8.25%) 516043.00 ( 19.30%)
TPut 35 417631.00 ( 0.00%) 460362.00 ( 10.23%) 391216.00 ( -6.32%) 504079.00 ( 20.70%)
TPut 36 404476.00 ( 0.00%) 474219.00 ( 17.24%) 406432.00 ( 0.48%) 504424.00 ( 24.71%)
TPut 37 416913.00 ( 0.00%) 495573.00 ( 18.87%) 397111.00 ( -4.75%) 489641.00 ( 17.44%)
TPut 38 417610.00 ( 0.00%) 474174.00 ( 13.54%) 389020.00 ( -6.85%) 474731.00 ( 13.68%)
TPut 39 400634.00 ( 0.00%) 467464.00 ( 16.68%) 399672.00 ( -0.24%) 488079.00 ( 21.83%)
TPut 40 407647.00 ( 0.00%) 469105.00 ( 15.08%) 395065.00 ( -3.09%) 485288.00 ( 19.05%)
TPut 41 419030.00 ( 0.00%) 466627.00 ( 11.36%) 391881.00 ( -6.48%) 485612.00 ( 15.89%)
TPut 42 419130.00 ( 0.00%) 465541.00 ( 11.07%) 387589.00 ( -7.53%) 475061.00 ( 13.34%)
TPut 43 388292.00 ( 0.00%) 454982.00 ( 17.18%) 373732.00 ( -3.75%) 474285.00 ( 22.15%)
TPut 44 398956.00 ( 0.00%) 465144.00 ( 16.59%) 384774.00 ( -3.55%) 464599.00 ( 16.45%)
TPut 45 417689.00 ( 0.00%) 439913.00 ( 5.32%) 382950.00 ( -8.32%) 454419.00 ( 8.79%)
TPut 46 413557.00 ( 0.00%) 442812.00 ( 7.07%) 374846.00 ( -9.36%) 442203.00 ( 6.93%)
TPut 47 384369.00 ( 0.00%) 435770.00 ( 13.37%) 376882.00 ( -1.95%) 416316.00 ( 8.31%)
TPut 48 390385.00 ( 0.00%) 436776.00 ( 11.88%) 404285.00 ( 3.56%) 406172.00 ( 4.04%)

This is looking a bit better overall. One would generally expect this
JVM configuration to be handled better because there are far few problems
dealing with shared pages.

specjbb Peaks
3.11.0-rc7 3.11.0-rc7 3.11.0-rc7 3.11.0-rc7
account-v7 lesspmd-v7 selectweight-v7 avoidmove-v7
Expctd Warehouse 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%) 12.00 ( 0.00%)
Expctd Peak Bops 545675.00 ( 0.00%) 584567.00 ( 7.13%) 556867.00 ( 2.05%) 549740.00 ( 0.74%)
Actual Warehouse 8.00 ( 0.00%) 8.00 ( 0.00%) 8.00 ( 0.00%) 8.00 ( 0.00%)
Actual Peak Bops 649033.00 ( 0.00%) 642661.00 ( -0.98%) 619101.00 ( -4.61%) 617425.00 ( -4.87%)
SpecJBB Bops 474931.00 ( 0.00%) 523877.00 ( 10.31%) 454089.00 ( -4.39%) 482435.00 ( 1.58%)
SpecJBB Bops/JVM 118733.00 ( 0.00%) 130969.00 ( 10.31%) 113522.00 ( -4.39%) 120609.00 ( 1.58%)

Because the specjvm score is based on lower number of clients this does
not look as impressive but at least the overall series does not have a
worse specjbb score.


3.11.0-rc7 3.11.0-rc7 3.11.0-rc7 3.11.0-rc7
account-v7 lesspmd-v7 selectweight-v7 avoidmove-v7
User 464762.73 474999.54 474756.33 475883.65
System 10593.13 725.15 752.36 689.00
Elapsed 10409.45 10414.85 10416.46 10441.17

On the other hand, look at the system CPU overhead. We are getting comparable
or better performance at a small fraction of the cost.


3.11.0-rc7 3.11.0-rc7 3.11.0-rc7 3.11.0-rc7
account-v7 lesspmd-v7 selectweight-v7 avoidmove-v7
Compaction stalls 0 0 0 0
Compaction success 0 0 0 0
Compaction failures 0 0 0 0
Page migrate success 1339274585 55904453 50672239 48174428
Page migrate failure 0 0 0 0
Compaction pages isolated 0 0 0 0
Compaction migrate scanned 0 0 0 0
Compaction free scanned 0 0 0 0
Compaction cost 1390167 58028 52597 50005
NUMA PTE updates 501107230 9187590 8925627 9120756
NUMA hint faults 69895484 9184458 8917340 9096029
NUMA hint local faults 21848214 3778721 3832832 4025324
NUMA hint local percent 31 41 42 44
NUMA pages migrated 1339274585 55904453 50672239 48174428
AutoNUMA cost 378431 47048 45611 46459

And again the reduced cost is from massively reduced numbers of PTE updates
and faults. This may mean some workloads may converge slower but the system
will not get hammered constantly trying to converge either.

3.11.0-rc7 3.11.0-rc7 3.11.0-rc7 3.11.0-rc7
account-v7 lesspmd-v7 selectweight-v7 avoidmove-v7
User NUMA01 53586.49 ( 0.00%) 57956.00 ( -8.15%) 38838.20 ( 27.52%) 45977.10 ( 14.20%)
User NUMA01_THEADLOCAL 16956.29 ( 0.00%) 17070.87 ( -0.68%) 16972.80 ( -0.10%) 17262.89 ( -1.81%)
User NUMA02 2024.02 ( 0.00%) 2022.45 ( 0.08%) 2035.17 ( -0.55%) 2013.42 ( 0.52%)
User NUMA02_SMT 968.96 ( 0.00%) 992.63 ( -2.44%) 979.86 ( -1.12%) 1379.96 (-42.42%)
System NUMA01 1442.97 ( 0.00%) 542.69 ( 62.39%) 309.92 ( 78.52%) 405.48 ( 71.90%)
System NUMA01_THEADLOCAL 117.16 ( 0.00%) 72.08 ( 38.48%) 75.60 ( 35.47%) 91.56 ( 21.85%)
System NUMA02 7.12 ( 0.00%) 7.86 (-10.39%) 7.84 (-10.11%) 6.38 ( 10.39%)
System NUMA02_SMT 8.49 ( 0.00%) 3.74 ( 55.95%) 3.53 ( 58.42%) 6.26 ( 26.27%)
Elapsed NUMA01 1216.88 ( 0.00%) 1372.29 (-12.77%) 918.05 ( 24.56%) 1065.63 ( 12.43%)
Elapsed NUMA01_THEADLOCAL 375.15 ( 0.00%) 388.68 ( -3.61%) 386.02 ( -2.90%) 382.63 ( -1.99%)
Elapsed NUMA02 48.61 ( 0.00%) 52.19 ( -7.36%) 49.65 ( -2.14%) 51.85 ( -6.67%)
Elapsed NUMA02_SMT 49.68 ( 0.00%) 51.23 ( -3.12%) 50.36 ( -1.37%) 80.91 (-62.86%)
CPU NUMA01 4522.00 ( 0.00%) 4262.00 ( 5.75%) 4264.00 ( 5.71%) 4352.00 ( 3.76%)
CPU NUMA01_THEADLOCAL 4551.00 ( 0.00%) 4410.00 ( 3.10%) 4416.00 ( 2.97%) 4535.00 ( 0.35%)
CPU NUMA02 4178.00 ( 0.00%) 3890.00 ( 6.89%) 4114.00 ( 1.53%) 3895.00 ( 6.77%)
CPU NUMA02_SMT 1967.00 ( 0.00%) 1944.00 ( 1.17%) 1952.00 ( 0.76%) 1713.00 ( 12.91%)

Elapsed figures here are poor. The numa01 test case saw an improvement but
it's an adverse workload and not that interesting per-se. Its main benefit
is from the reduction of system overhead. numa02_smt suffered badly due
to the last patch in the series that needs addressing.

nas-omp
3.11.0-rc7 3.11.0-rc7 3.11.0-rc7 3.11.0-rc7
account-v7 lesspmd-v7 selectweight-v7 avoidmove-v7
Time bt.C 187.22 ( 0.00%) 188.02 ( -0.43%) 188.19 ( -0.52%) 197.49 ( -5.49%)
Time cg.C 61.58 ( 0.00%) 49.64 ( 19.39%) 61.44 ( 0.23%) 56.84 ( 7.70%)
Time ep.C 13.28 ( 0.00%) 13.28 ( 0.00%) 14.05 ( -5.80%) 13.34 ( -0.45%)
Time ft.C 38.35 ( 0.00%) 37.39 ( 2.50%) 35.08 ( 8.53%) 37.05 ( 3.39%)
Time is.C 2.12 ( 0.00%) 1.75 ( 17.45%) 2.20 ( -3.77%) 2.14 ( -0.94%)
Time lu.C 180.71 ( 0.00%) 183.01 ( -1.27%) 186.64 ( -3.28%) 169.77 ( 6.05%)
Time mg.C 32.02 ( 0.00%) 31.57 ( 1.41%) 29.45 ( 8.03%) 31.98 ( 0.12%)
Time sp.C 413.92 ( 0.00%) 396.36 ( 4.24%) 400.92 ( 3.14%) 388.54 ( 6.13%)
Time ua.C 200.27 ( 0.00%) 204.68 ( -2.20%) 211.46 ( -5.59%) 194.92 ( 2.67%)

This is Nasa Parallel Benchmark (npb) running with openmp. Some small improvements.

3.11.0-rc7 3.11.0-rc7 3.11.0-rc7 3.11.0-rc7
account-v7 lesspmd-v7 selectweight-v7 avoidmove-v7
User 47694.80 47262.68 47998.27 46282.27
System 421.02 136.12 129.91 131.36
Elapsed 1265.34 1242.74 1267.06 1229.70

With large reductions of system CPU usage.

So overall it is still a bit of a mixed bag. There is not a universal
performance win but there are massive reductions in system CPU overhead
which may of big benefit on larger machines meaning the series is still
worth considering. The ratio of local/remote NUMA hinting faults is still
very slow and the fact that there are tasks sharing a numa group running
on different nodes should be examined closer.

Documentation/sysctl/kernel.txt | 73 ++
arch/x86/mm/numa.c | 6 +-
fs/proc/array.c | 2 +
include/linux/migrate.h | 7 +-
include/linux/mm.h | 107 ++-
include/linux/mm_types.h | 14 +-
include/linux/page-flags-layout.h | 28 +-
include/linux/sched.h | 45 +-
include/linux/stop_machine.h | 1 +
kernel/bounds.c | 4 +
kernel/fork.c | 5 +-
kernel/sched/core.c | 196 ++++-
kernel/sched/debug.c | 60 +-
kernel/sched/fair.c | 1523 ++++++++++++++++++++++++++++++-------
kernel/sched/features.h | 19 +-
kernel/sched/idle_task.c | 2 +-
kernel/sched/rt.c | 5 +-
kernel/sched/sched.h | 19 +-
kernel/sched/stop_task.c | 2 +-
kernel/stop_machine.c | 272 ++++---
kernel/sysctl.c | 7 +
lib/vsprintf.c | 5 +
mm/huge_memory.c | 103 ++-
mm/memory.c | 95 ++-
mm/mempolicy.c | 24 +-
mm/migrate.c | 21 +-
mm/mm_init.c | 18 +-
mm/mmzone.c | 14 +-
mm/mprotect.c | 70 +-
mm/page_alloc.c | 4 +-
30 files changed, 2147 insertions(+), 604 deletions(-)

--
1.8.1.4


2013-09-10 09:32:44

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 04/50] mm: numa: Do not account for a hinting fault if we raced

If another task handled a hinting fault in parallel then do not double
account for it.

Not-signed-off-by: Peter Zijlstra
---
mm/huge_memory.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 860a368..5c37cd2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1337,8 +1337,11 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,

check_same:
spin_lock(&mm->page_table_lock);
- if (unlikely(!pmd_same(pmd, *pmdp)))
+ if (unlikely(!pmd_same(pmd, *pmdp))) {
+ /* Someone else took our fault */
+ current_nid = -1;
goto out_unlock;
+ }
clear_pmdnuma:
pmd = pmd_mknonnuma(pmd);
set_pmd_at(mm, haddr, pmdp, pmd);
--
1.8.1.4

2013-09-10 09:32:54

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 17/50] mm: Do not flush TLB during protection change if !pte_present && !migration_entry

NUMA PTE scanning is expensive both in terms of the scanning itself and
the TLB flush if there are any updates. Currently non-present PTEs are
accounted for as an update and incurring a TLB flush where it is only
necessary for anonymous migration entries. This patch addresses the
problem and should reduce TLB flushes.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/mprotect.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 1f9b54b..1e9cef0 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -109,8 +109,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
make_migration_entry_read(&entry);
set_pte_at(mm, addr, pte,
swp_entry_to_pte(entry));
+
+ pages++;
}
- pages++;
}
} while (pte++, addr += PAGE_SIZE, addr != end);
arch_leave_lazy_mmu_mode();
--
1.8.1.4

2013-09-10 09:33:01

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 24/50] sched: Reschedule task on preferred NUMA node once selected

A preferred node is selected based on the node the most NUMA hinting
faults was incurred on. There is no guarantee that the task is running
on that node at the time so this patch rescheules the task to run on
the most idle CPU of the selected node when selected. This avoids
waiting for the balancer to make a decision.

Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/core.c | 19 +++++++++++++++++++
kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/sched.h | 1 +
3 files changed, 65 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0dbd5cd..e94509d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4368,6 +4368,25 @@ fail:
return ret;
}

+#ifdef CONFIG_NUMA_BALANCING
+/* Migrate current task p to target_cpu */
+int migrate_task_to(struct task_struct *p, int target_cpu)
+{
+ struct migration_arg arg = { p, target_cpu };
+ int curr_cpu = task_cpu(p);
+
+ if (curr_cpu == target_cpu)
+ return 0;
+
+ if (!cpumask_test_cpu(target_cpu, tsk_cpus_allowed(p)))
+ return -EINVAL;
+
+ /* TODO: This is not properly updating schedstats */
+
+ return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
+}
+#endif
+
/*
* migration_cpu_stop - this will be executed by a highprio stopper thread
* and performs thread migration by bumping thread off CPU then
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5649280..350c411 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,31 @@ static unsigned int task_scan_max(struct task_struct *p)
*/
unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;

+static unsigned long weighted_cpuload(const int cpu);
+
+
+static int
+find_idlest_cpu_node(int this_cpu, int nid)
+{
+ unsigned long load, min_load = ULONG_MAX;
+ int i, idlest_cpu = this_cpu;
+
+ BUG_ON(cpu_to_node(this_cpu) == nid);
+
+ rcu_read_lock();
+ for_each_cpu(i, cpumask_of_node(nid)) {
+ load = weighted_cpuload(i);
+
+ if (load < min_load) {
+ min_load = load;
+ idlest_cpu = i;
+ }
+ }
+ rcu_read_unlock();
+
+ return idlest_cpu;
+}
+
static void task_numa_placement(struct task_struct *p)
{
int seq, nid, max_nid = -1;
@@ -916,10 +941,29 @@ static void task_numa_placement(struct task_struct *p)
}
}

- /* Update the tasks preferred node if necessary */
+ /*
+ * Record the preferred node as the node with the most faults,
+ * requeue the task to be running on the idlest CPU on the
+ * preferred node and reset the scanning rate to recheck
+ * the working set placement.
+ */
if (max_faults && max_nid != p->numa_preferred_nid) {
+ int preferred_cpu;
+
+ /*
+ * If the task is not on the preferred node then find the most
+ * idle CPU to migrate to.
+ */
+ preferred_cpu = task_cpu(p);
+ if (cpu_to_node(preferred_cpu) != max_nid) {
+ preferred_cpu = find_idlest_cpu_node(preferred_cpu,
+ max_nid);
+ }
+
+ /* Update the preferred nid and migrate task if possible */
p->numa_preferred_nid = max_nid;
p->numa_migrate_seq = 0;
+ migrate_task_to(p, preferred_cpu);
}
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 46c2068..778f875 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -555,6 +555,7 @@ static inline u64 rq_clock_task(struct rq *rq)
}

#ifdef CONFIG_NUMA_BALANCING
+extern int migrate_task_to(struct task_struct *p, int cpu);
static inline void task_numa_free(struct task_struct *p)
{
kfree(p->numa_faults);
--
1.8.1.4

2013-09-10 09:32:51

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 15/50] sched: numa: Correct adjustment of numa_scan_period

numa_scan_period is in milliseconds, not jiffies. Properly placed pages
slow the scanning rate but adding 10 jiffies to numa_scan_period means
that the rate scanning slows depends on HZ which is confusing. Get rid
of the jiffies_to_msec conversion and treat it as ms.

Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 23fd1f3..29ba117 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -914,7 +914,7 @@ void task_numa_fault(int node, int pages, bool migrated)
p->numa_scan_period_max = task_scan_max(p);

p->numa_scan_period = min(p->numa_scan_period_max,
- p->numa_scan_period + jiffies_to_msecs(10));
+ p->numa_scan_period + 10);
}

task_numa_placement(p);
--
1.8.1.4

2013-09-10 09:32:58

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 19/50] sched: Track NUMA hinting faults on per-node basis

This patch tracks what nodes numa hinting faults were incurred on.
This information is later used to schedule a task on the node storing
the pages most frequently faulted by the task.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/sched.h | 2 ++
kernel/sched/core.c | 3 +++
kernel/sched/fair.c | 11 ++++++++++-
kernel/sched/sched.h | 12 ++++++++++++
4 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 49b426e..dfba435 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1334,6 +1334,8 @@ struct task_struct {
unsigned int numa_scan_period_max;
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
+
+ unsigned long *numa_faults;
#endif /* CONFIG_NUMA_BALANCING */

struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9d7a33a..dbc2de6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1644,6 +1644,7 @@ static void __sched_fork(struct task_struct *p)
p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
p->numa_work.next = &p->numa_work;
+ p->numa_faults = NULL;
#endif /* CONFIG_NUMA_BALANCING */
}

@@ -1905,6 +1906,8 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev)
if (mm)
mmdrop(mm);
if (unlikely(prev_state == TASK_DEAD)) {
+ task_numa_free(prev);
+
/*
* Remove function-return probe instances associated with this
* task and put them back on the free list.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 779ebd7..ebd24c0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -902,7 +902,14 @@ void task_numa_fault(int node, int pages, bool migrated)
if (!numabalancing_enabled)
return;

- /* FIXME: Allocate task-specific structure for placement policy here */
+ /* Allocate buffer to track faults on a per-node basis */
+ if (unlikely(!p->numa_faults)) {
+ int size = sizeof(*p->numa_faults) * nr_node_ids;
+
+ p->numa_faults = kzalloc(size, GFP_KERNEL|__GFP_NOWARN);
+ if (!p->numa_faults)
+ return;
+ }

/*
* If pages are properly placed (did not migrate) then scan slower.
@@ -918,6 +925,8 @@ void task_numa_fault(int node, int pages, bool migrated)
}

task_numa_placement(p);
+
+ p->numa_faults[node] += pages;
}

static void reset_ptenuma_scan(struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7c17661..46c2068 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -6,6 +6,7 @@
#include <linux/spinlock.h>
#include <linux/stop_machine.h>
#include <linux/tick.h>
+#include <linux/slab.h>

#include "cpupri.h"
#include "cpuacct.h"
@@ -553,6 +554,17 @@ static inline u64 rq_clock_task(struct rq *rq)
return rq->clock_task;
}

+#ifdef CONFIG_NUMA_BALANCING
+static inline void task_numa_free(struct task_struct *p)
+{
+ kfree(p->numa_faults);
+}
+#else /* CONFIG_NUMA_BALANCING */
+static inline void task_numa_free(struct task_struct *p)
+{
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
#ifdef CONFIG_SMP

#define rcu_dereference_check_sched_domain(p) \
--
1.8.1.4

2013-09-10 09:33:05

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 26/50] sched: Check current->mm before allocating NUMA faults

task_numa_placement checks current->mm but after buffers for faults
have already been uselessly allocated. Move the check earlier.

[[email protected]: Identified the problem]
Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 108f357..e259241 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -930,8 +930,6 @@ static void task_numa_placement(struct task_struct *p)
int seq, nid, max_nid = -1;
unsigned long max_faults = 0;

- if (!p->mm) /* for example, ksmd faulting in a user's mm */
- return;
seq = ACCESS_ONCE(p->mm->numa_scan_seq);
if (p->numa_scan_seq == seq)
return;
@@ -998,6 +996,10 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
if (!numabalancing_enabled)
return;

+ /* for example, ksmd faulting in a user's mm */
+ if (!p->mm)
+ return;
+
/* For now, do not attempt to detect private/shared accesses */
priv = 1;

--
1.8.1.4

2013-09-10 09:33:10

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 32/50] sched: Retry migration of tasks to CPU on a preferred node

When a preferred node is selected for a tasks there is an attempt to migrate
the task to a CPU there. This may fail in which case the task will only
migrate if the active load balancer takes action. This may never happen if
the conditions are not right. This patch will check at NUMA hinting fault
time if another attempt should be made to migrate the task. It will only
make an attempt once every five seconds.

Signed-off-by: Mel Gorman <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/fair.c | 26 +++++++++++++++++++-------
2 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6eb8fa6..3418b0b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1333,6 +1333,7 @@ struct task_struct {
int numa_migrate_seq;
unsigned int numa_scan_period;
unsigned int numa_scan_period_max;
+ unsigned long numa_migrate_retry;
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5f0388e..5b4d94e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1011,6 +1011,19 @@ migrate:
return migrate_task_to(p, env.best_cpu);
}

+/* Attempt to migrate a task to a CPU on the preferred node. */
+static void numa_migrate_preferred(struct task_struct *p)
+{
+ /* Success if task is already running on preferred CPU */
+ p->numa_migrate_retry = 0;
+ if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
+ return;
+
+ /* Otherwise, try migrate to a CPU on the preferred node */
+ if (task_numa_migrate(p) != 0)
+ p->numa_migrate_retry = jiffies + HZ*5;
+}
+
static void task_numa_placement(struct task_struct *p)
{
int seq, nid, max_nid = -1;
@@ -1045,17 +1058,12 @@ static void task_numa_placement(struct task_struct *p)
}
}

- /*
- * Record the preferred node as the node with the most faults,
- * requeue the task to be running on the idlest CPU on the
- * preferred node and reset the scanning rate to recheck
- * the working set placement.
- */
+ /* Preferred node as the node with the most faults */
if (max_faults && max_nid != p->numa_preferred_nid) {
/* Update the preferred nid and migrate task if possible */
p->numa_preferred_nid = max_nid;
p->numa_migrate_seq = 1;
- task_numa_migrate(p);
+ numa_migrate_preferred(p);
}
}

@@ -1111,6 +1119,10 @@ void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)

task_numa_placement(p);

+ /* Retry task to preferred node migration if it previously failed */
+ if (p->numa_migrate_retry && time_after(jiffies, p->numa_migrate_retry))
+ numa_migrate_preferred(p);
+
p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
}

--
1.8.1.4

2013-09-10 09:33:19

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 29/50] sched: Set preferred NUMA node based on number of private faults

Ideally it would be possible to distinguish between NUMA hinting faults that
are private to a task and those that are shared. If treated identically
there is a risk that shared pages bounce between nodes depending on
the order they are referenced by tasks. Ultimately what is desirable is
that task private pages remain local to the task while shared pages are
interleaved between sharing tasks running on different nodes to give good
average performance. This is further complicated by THP as even
applications that partition their data may not be partitioning on a huge
page boundary.

To start with, this patch assumes that multi-threaded or multi-process
applications partition their data and that in general the private accesses
are more important for cpu->memory locality in the general case. Also,
no new infrastructure is required to treat private pages properly but
interleaving for shared pages requires additional infrastructure.

To detect private accesses the pid of the last accessing task is required
but the storage requirements are a high. This patch borrows heavily from
Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking"
to encode some bits from the last accessing task in the page flags as
well as the node information. Collisions will occur but it is better than
just depending on the node information. Node information is then used to
determine if a page needs to migrate. The PID information is used to detect
private/shared accesses. The preferred NUMA node is selected based on where
the maximum number of approximately private faults were measured. Shared
faults are not taken into consideration for a few reasons.

First, if there are many tasks sharing the page then they'll all move
towards the same node. The node will be compute overloaded and then
scheduled away later only to bounce back again. Alternatively the shared
tasks would just bounce around nodes because the fault information is
effectively noise. Either way accounting for shared faults the same as
private faults can result in lower performance overall.

The second reason is based on a hypothetical workload that has a small
number of very important, heavily accessed private pages but a large shared
array. The shared array would dominate the number of faults and be selected
as a preferred node even though it's the wrong decision.

The third reason is that multiple threads in a process will race each
other to fault the shared page making the fault information unreliable.

[[email protected]: Fix complication error when !NUMA_BALANCING]
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mm.h | 89 +++++++++++++++++++++++++++++----------
include/linux/mm_types.h | 4 +-
include/linux/page-flags-layout.h | 28 +++++++-----
kernel/sched/fair.c | 12 ++++--
mm/huge_memory.c | 8 ++--
mm/memory.c | 16 +++----
mm/mempolicy.c | 8 ++--
mm/migrate.c | 4 +-
mm/mm_init.c | 18 ++++----
mm/mmzone.c | 14 +++---
mm/mprotect.c | 26 ++++++++----
mm/page_alloc.c | 4 +-
12 files changed, 149 insertions(+), 82 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f022460..0a0db6c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -588,11 +588,11 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
* sets it, so none of the operations on it need to be atomic.
*/

-/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NID] | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NIDPID] | ... | FLAGS | */
#define SECTIONS_PGOFF ((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
#define NODES_PGOFF (SECTIONS_PGOFF - NODES_WIDTH)
#define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH)
-#define LAST_NID_PGOFF (ZONES_PGOFF - LAST_NID_WIDTH)
+#define LAST_NIDPID_PGOFF (ZONES_PGOFF - LAST_NIDPID_WIDTH)

/*
* Define the bit shifts to access each section. For non-existent
@@ -602,7 +602,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
#define SECTIONS_PGSHIFT (SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
#define NODES_PGSHIFT (NODES_PGOFF * (NODES_WIDTH != 0))
#define ZONES_PGSHIFT (ZONES_PGOFF * (ZONES_WIDTH != 0))
-#define LAST_NID_PGSHIFT (LAST_NID_PGOFF * (LAST_NID_WIDTH != 0))
+#define LAST_NIDPID_PGSHIFT (LAST_NIDPID_PGOFF * (LAST_NIDPID_WIDTH != 0))

/* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
#ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -624,7 +624,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
#define ZONES_MASK ((1UL << ZONES_WIDTH) - 1)
#define NODES_MASK ((1UL << NODES_WIDTH) - 1)
#define SECTIONS_MASK ((1UL << SECTIONS_WIDTH) - 1)
-#define LAST_NID_MASK ((1UL << LAST_NID_WIDTH) - 1)
+#define LAST_NIDPID_MASK ((1UL << LAST_NIDPID_WIDTH) - 1)
#define ZONEID_MASK ((1UL << ZONEID_SHIFT) - 1)

static inline enum zone_type page_zonenum(const struct page *page)
@@ -668,48 +668,93 @@ static inline int page_to_nid(const struct page *page)
#endif

#ifdef CONFIG_NUMA_BALANCING
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-static inline int page_nid_xchg_last(struct page *page, int nid)
+static inline int nid_pid_to_nidpid(int nid, int pid)
{
- return xchg(&page->_last_nid, nid);
+ return ((nid & LAST__NID_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
}

-static inline int page_nid_last(struct page *page)
+static inline int nidpid_to_pid(int nidpid)
{
- return page->_last_nid;
+ return nidpid & LAST__PID_MASK;
}
-static inline void page_nid_reset_last(struct page *page)
+
+static inline int nidpid_to_nid(int nidpid)
+{
+ return (nidpid >> LAST__PID_SHIFT) & LAST__NID_MASK;
+}
+
+static inline bool nidpid_pid_unset(int nidpid)
+{
+ return nidpid_to_pid(nidpid) == (-1 & LAST__PID_MASK);
+}
+
+static inline bool nidpid_nid_unset(int nidpid)
{
- page->_last_nid = -1;
+ return nidpid_to_nid(nidpid) == (-1 & LAST__NID_MASK);
+}
+
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+static inline int page_nidpid_xchg_last(struct page *page, int nid)
+{
+ return xchg(&page->_last_nidpid, nid);
+}
+
+static inline int page_nidpid_last(struct page *page)
+{
+ return page->_last_nidpid;
+}
+static inline void page_nidpid_reset_last(struct page *page)
+{
+ page->_last_nidpid = -1;
}
#else
-static inline int page_nid_last(struct page *page)
+static inline int page_nidpid_last(struct page *page)
{
- return (page->flags >> LAST_NID_PGSHIFT) & LAST_NID_MASK;
+ return (page->flags >> LAST_NIDPID_PGSHIFT) & LAST_NIDPID_MASK;
}

-extern int page_nid_xchg_last(struct page *page, int nid);
+extern int page_nidpid_xchg_last(struct page *page, int nidpid);

-static inline void page_nid_reset_last(struct page *page)
+static inline void page_nidpid_reset_last(struct page *page)
{
- int nid = (1 << LAST_NID_SHIFT) - 1;
+ int nidpid = (1 << LAST_NIDPID_SHIFT) - 1;

- page->flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
- page->flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+ page->flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
+ page->flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
}
-#endif /* LAST_NID_NOT_IN_PAGE_FLAGS */
+#endif /* LAST_NIDPID_NOT_IN_PAGE_FLAGS */
#else
-static inline int page_nid_xchg_last(struct page *page, int nid)
+static inline int page_nidpid_xchg_last(struct page *page, int nidpid)
{
return page_to_nid(page);
}

-static inline int page_nid_last(struct page *page)
+static inline int page_nidpid_last(struct page *page)
{
return page_to_nid(page);
}

-static inline void page_nid_reset_last(struct page *page)
+static inline int nidpid_to_nid(int nidpid)
+{
+ return -1;
+}
+
+static inline int nidpid_to_pid(int nidpid)
+{
+ return -1;
+}
+
+static inline int nid_pid_to_nidpid(int nid, int pid)
+{
+ return -1;
+}
+
+static inline bool nidpid_pid_unset(int nidpid)
+{
+ return 1;
+}
+
+static inline void page_nidpid_reset_last(struct page *page)
{
}
#endif
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 4f12073..f46378e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -174,8 +174,8 @@ struct page {
void *shadow;
#endif

-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
- int _last_nid;
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+ int _last_nidpid;
#endif
}
/*
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index 93506a1..02bc918 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -38,10 +38,10 @@
* The last is when there is insufficient space in page->flags and a separate
* lookup is necessary.
*
- * No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
- * " plus space for last_nid: | NODE | ZONE | LAST_NID ... | FLAGS |
- * classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
- * " plus space for last_nid: | SECTION | NODE | ZONE | LAST_NID ... | FLAGS |
+ * No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
+ * " plus space for last_nidpid: | NODE | ZONE | LAST_NIDPID ... | FLAGS |
+ * classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
+ * " plus space for last_nidpid: | SECTION | NODE | ZONE | LAST_NIDPID ... | FLAGS |
* classic sparse no space for node: | SECTION | ZONE | ... | FLAGS |
*/
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
@@ -62,15 +62,21 @@
#endif

#ifdef CONFIG_NUMA_BALANCING
-#define LAST_NID_SHIFT NODES_SHIFT
+#define LAST__PID_SHIFT 8
+#define LAST__PID_MASK ((1 << LAST__PID_SHIFT)-1)
+
+#define LAST__NID_SHIFT NODES_SHIFT
+#define LAST__NID_MASK ((1 << LAST__NID_SHIFT)-1)
+
+#define LAST_NIDPID_SHIFT (LAST__PID_SHIFT+LAST__NID_SHIFT)
#else
-#define LAST_NID_SHIFT 0
+#define LAST_NIDPID_SHIFT 0
#endif

-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define LAST_NID_WIDTH LAST_NID_SHIFT
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NIDPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define LAST_NIDPID_WIDTH LAST_NIDPID_SHIFT
#else
-#define LAST_NID_WIDTH 0
+#define LAST_NIDPID_WIDTH 0
#endif

/*
@@ -81,8 +87,8 @@
#define NODE_NOT_IN_PAGE_FLAGS
#endif

-#if defined(CONFIG_NUMA_BALANCING) && LAST_NID_WIDTH == 0
-#define LAST_NID_NOT_IN_PAGE_FLAGS
+#if defined(CONFIG_NUMA_BALANCING) && LAST_NIDPID_WIDTH == 0
+#define LAST_NIDPID_NOT_IN_PAGE_FLAGS
#endif

#endif /* _LINUX_PAGE_FLAGS_LAYOUT */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2d04112..223e1f8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -988,7 +988,7 @@ static void task_numa_placement(struct task_struct *p)
/*
* Got a PROT_NONE fault for a page on @node.
*/
-void task_numa_fault(int last_nid, int node, int pages, bool migrated)
+void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
{
struct task_struct *p = current;
int priv;
@@ -1000,8 +1000,14 @@ void task_numa_fault(int last_nid, int node, int pages, bool migrated)
if (!p->mm)
return;

- /* For now, do not attempt to detect private/shared accesses */
- priv = 1;
+ /*
+ * First accesses are treated as private, otherwise consider accesses
+ * to be private if the accessing pid has not changed
+ */
+ if (!nidpid_pid_unset(last_nidpid))
+ priv = ((p->pid & LAST__PID_MASK) == nidpid_to_pid(last_nidpid));
+ else
+ priv = 1;

/* Allocate buffer to track faults on a per-node basis */
if (unlikely(!p->numa_faults)) {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a8e624e..622bc7e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1294,7 +1294,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
int page_nid = -1, this_nid = numa_node_id();
- int target_nid, last_nid = -1;
+ int target_nid, last_nidpid = -1;
bool page_locked;
bool migrated = false;

@@ -1305,7 +1305,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
page = pmd_page(pmd);
BUG_ON(is_huge_zero_page(page));
page_nid = page_to_nid(page);
- last_nid = page_nid_last(page);
+ last_nidpid = page_nidpid_last(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
if (page_nid == this_nid)
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
@@ -1377,7 +1377,7 @@ out:
page_unlock_anon_vma_read(anon_vma);

if (page_nid != -1)
- task_numa_fault(last_nid, page_nid, HPAGE_PMD_NR, migrated);
+ task_numa_fault(last_nidpid, page_nid, HPAGE_PMD_NR, migrated);

return 0;
}
@@ -1692,7 +1692,7 @@ static void __split_huge_page_refcount(struct page *page,
page_tail->mapping = page->mapping;

page_tail->index = page->index + i;
- page_nid_xchg_last(page_tail, page_nid_last(page));
+ page_nidpid_xchg_last(page_tail, page_nidpid_last(page));

BUG_ON(!PageAnon(page_tail));
BUG_ON(!PageUptodate(page_tail));
diff --git a/mm/memory.c b/mm/memory.c
index e335ec0..948ec32 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -69,8 +69,8 @@

#include "internal.h"

-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
-#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nid.
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nidpid.
#endif

#ifndef CONFIG_NEED_MULTIPLE_NODES
@@ -3547,7 +3547,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page = NULL;
spinlock_t *ptl;
int page_nid = -1;
- int last_nid;
+ int last_nidpid;
int target_nid;
bool migrated = false;

@@ -3578,7 +3578,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
BUG_ON(is_zero_pfn(page_to_pfn(page)));

- last_nid = page_nid_last(page);
+ last_nidpid = page_nidpid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(ptep, ptl);
@@ -3594,7 +3594,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,

out:
if (page_nid != -1)
- task_numa_fault(last_nid, page_nid, 1, migrated);
+ task_numa_fault(last_nidpid, page_nid, 1, migrated);
return 0;
}

@@ -3609,7 +3609,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long offset;
spinlock_t *ptl;
bool numa = false;
- int last_nid;
+ int last_nidpid;

spin_lock(&mm->page_table_lock);
pmd = *pmdp;
@@ -3654,7 +3654,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(!page))
continue;

- last_nid = page_nid_last(page);
+ last_nidpid = page_nidpid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(pte, ptl);
@@ -3667,7 +3667,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
}

if (page_nid != -1)
- task_numa_fault(last_nid, page_nid, 1, migrated);
+ task_numa_fault(last_nidpid, page_nid, 1, migrated);

pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4baf12e..8e2a364 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2292,9 +2292,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long

/* Migrate the page towards the node whose CPU is referencing it */
if (pol->flags & MPOL_F_MORON) {
- int last_nid;
+ int last_nidpid;
+ int this_nidpid;

polnid = numa_node_id();
+ this_nidpid = nid_pid_to_nidpid(polnid, current->pid);

/*
* Multi-stage node selection is used in conjunction
@@ -2317,8 +2319,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
* it less likely we act on an unlikely task<->page
* relation.
*/
- last_nid = page_nid_xchg_last(page, polnid);
- if (last_nid != polnid)
+ last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
+ if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
goto out;
}

diff --git a/mm/migrate.c b/mm/migrate.c
index 08ac3ba..f56ca20 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1495,7 +1495,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
__GFP_NOWARN) &
~GFP_IOFS, 0);
if (newpage)
- page_nid_xchg_last(newpage, page_nid_last(page));
+ page_nidpid_xchg_last(newpage, page_nidpid_last(page));

return newpage;
}
@@ -1672,7 +1672,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
if (!new_page)
goto out_fail;

- page_nid_xchg_last(new_page, page_nid_last(page));
+ page_nidpid_xchg_last(new_page, page_nidpid_last(page));

isolated = numamigrate_isolate_page(pgdat, page);
if (!isolated) {
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 633c088..467de57 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -71,26 +71,26 @@ void __init mminit_verify_pageflags_layout(void)
unsigned long or_mask, add_mask;

shift = 8 * sizeof(unsigned long);
- width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NID_SHIFT;
+ width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NIDPID_SHIFT;
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
- "Section %d Node %d Zone %d Lastnid %d Flags %d\n",
+ "Section %d Node %d Zone %d Lastnidpid %d Flags %d\n",
SECTIONS_WIDTH,
NODES_WIDTH,
ZONES_WIDTH,
- LAST_NID_WIDTH,
+ LAST_NIDPID_WIDTH,
NR_PAGEFLAGS);
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
- "Section %d Node %d Zone %d Lastnid %d\n",
+ "Section %d Node %d Zone %d Lastnidpid %d\n",
SECTIONS_SHIFT,
NODES_SHIFT,
ZONES_SHIFT,
- LAST_NID_SHIFT);
+ LAST_NIDPID_SHIFT);
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_pgshifts",
- "Section %lu Node %lu Zone %lu Lastnid %lu\n",
+ "Section %lu Node %lu Zone %lu Lastnidpid %lu\n",
(unsigned long)SECTIONS_PGSHIFT,
(unsigned long)NODES_PGSHIFT,
(unsigned long)ZONES_PGSHIFT,
- (unsigned long)LAST_NID_PGSHIFT);
+ (unsigned long)LAST_NIDPID_PGSHIFT);
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodezoneid",
"Node/Zone ID: %lu -> %lu\n",
(unsigned long)(ZONEID_PGOFF + ZONEID_SHIFT),
@@ -102,9 +102,9 @@ void __init mminit_verify_pageflags_layout(void)
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
"Node not in page flags");
#endif
-#ifdef LAST_NID_NOT_IN_PAGE_FLAGS
+#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
- "Last nid not in page flags");
+ "Last nidpid not in page flags");
#endif

if (SECTIONS_WIDTH) {
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 2ac0afb..25bb477 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -97,20 +97,20 @@ void lruvec_init(struct lruvec *lruvec)
INIT_LIST_HEAD(&lruvec->lists[lru]);
}

-#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NID_NOT_IN_PAGE_FLAGS)
-int page_nid_xchg_last(struct page *page, int nid)
+#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NIDPID_NOT_IN_PAGE_FLAGS)
+int page_nidpid_xchg_last(struct page *page, int nidpid)
{
unsigned long old_flags, flags;
- int last_nid;
+ int last_nidpid;

do {
old_flags = flags = page->flags;
- last_nid = page_nid_last(page);
+ last_nidpid = page_nidpid_last(page);

- flags &= ~(LAST_NID_MASK << LAST_NID_PGSHIFT);
- flags |= (nid & LAST_NID_MASK) << LAST_NID_PGSHIFT;
+ flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
+ flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
} while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));

- return last_nid;
+ return last_nidpid;
}
#endif
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 4a21819..70ec934 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,14 +37,15 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)

static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end, pgprot_t newprot,
- int dirty_accountable, int prot_numa, bool *ret_all_same_node)
+ int dirty_accountable, int prot_numa, bool *ret_all_same_nidpid)
{
struct mm_struct *mm = vma->vm_mm;
pte_t *pte, oldpte;
spinlock_t *ptl;
unsigned long pages = 0;
- bool all_same_node = true;
+ bool all_same_nidpid = true;
int last_nid = -1;
+ int last_pid = -1;

pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
arch_enter_lazy_mmu_mode();
@@ -71,11 +72,18 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
* hits on the zero page
*/
if (page && !is_zero_pfn(page_to_pfn(page))) {
- int this_nid = page_to_nid(page);
+ int nidpid = page_nidpid_last(page);
+ int this_nid = nidpid_to_nid(nidpid);
+ int this_pid = nidpid_to_pid(nidpid);
+
if (last_nid == -1)
last_nid = this_nid;
- if (last_nid != this_nid)
- all_same_node = false;
+ if (last_pid == -1)
+ last_pid = this_pid;
+ if (last_nid != this_nid ||
+ last_pid != this_pid) {
+ all_same_nidpid = false;
+ }

if (!pte_numa(oldpte)) {
ptent = pte_mknuma(ptent);
@@ -115,7 +123,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);

- *ret_all_same_node = all_same_node;
+ *ret_all_same_nidpid = all_same_nidpid;
return pages;
}

@@ -142,7 +150,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
pmd_t *pmd;
unsigned long next;
unsigned long pages = 0;
- bool all_same_node;
+ bool all_same_nidpid;

pmd = pmd_offset(pud, addr);
do {
@@ -166,7 +174,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
if (pmd_none_or_clear_bad(pmd))
continue;
pages += change_pte_range(vma, pmd, addr, next, newprot,
- dirty_accountable, prot_numa, &all_same_node);
+ dirty_accountable, prot_numa, &all_same_nidpid);

/*
* If we are changing protections for NUMA hinting faults then
@@ -174,7 +182,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
* node. This allows a regular PMD to be handled as one fault
* and effectively batches the taking of the PTL
*/
- if (prot_numa && all_same_node)
+ if (prot_numa && all_same_nidpid)
change_pmd_protnuma(vma->vm_mm, addr, pmd);
} while (pmd++, addr = next, addr != end);

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b100255..7bf960e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -622,7 +622,7 @@ static inline int free_pages_check(struct page *page)
bad_page(page);
return 1;
}
- page_nid_reset_last(page);
+ page_nidpid_reset_last(page);
if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
return 0;
@@ -3944,7 +3944,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
mminit_verify_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
- page_nid_reset_last(page);
+ page_nidpid_reset_last(page);
SetPageReserved(page);
/*
* Mark the block movable so that blocks are reserved for
--
1.8.1.4

2013-09-10 09:33:32

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 48/50] sched: numa: Decide whether to favour task or group weights based on swap candidate relationships

From: Rik van Riel <[email protected]>

This patch separately considers task and group affinities when searching
for swap candidates during task NUMA placement. If tasks are not part of
a group or the same group then the task weights are considered.
Otherwise the group weights are compared.

Not-signed-off-by: Rik van Riel
---
kernel/sched/fair.c | 59 ++++++++++++++++++++++++++++++++---------------------
1 file changed, 36 insertions(+), 23 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 80906fa..fdb7923 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1039,13 +1039,15 @@ static void task_numa_assign(struct task_numa_env *env,
* into account that it might be best if task running on the dst_cpu should
* be exchanged with the source task
*/
-static void task_numa_compare(struct task_numa_env *env, long imp)
+static void task_numa_compare(struct task_numa_env *env,
+ long taskimp, long groupimp)
{
struct rq *src_rq = cpu_rq(env->src_cpu);
struct rq *dst_rq = cpu_rq(env->dst_cpu);
struct task_struct *cur;
long dst_load, src_load;
long load;
+ long imp = (groupimp > 0) ? groupimp : taskimp;

rcu_read_lock();
cur = ACCESS_ONCE(dst_rq->curr);
@@ -1064,10 +1066,19 @@ static void task_numa_compare(struct task_numa_env *env, long imp)
if (!cpumask_test_cpu(env->src_cpu, tsk_cpus_allowed(cur)))
goto unlock;

- imp += task_weight(cur, env->src_nid) +
- group_weight(cur, env->src_nid) -
- task_weight(cur, env->dst_nid) -
- group_weight(cur, env->dst_nid);
+ /*
+ * If dst and source tasks are in the same NUMA group, or not
+ * in any group then look only at task weights otherwise give
+ * priority to the group weights.
+ */
+ if (!cur->numa_group || ! env->p->numa_group ||
+ cur->numa_group == env->p->numa_group) {
+ imp = taskimp + task_weight(cur, env->src_nid) -
+ task_weight(cur, env->dst_nid);
+ } else {
+ imp = groupimp + group_weight(cur, env->src_nid) -
+ group_weight(cur, env->dst_nid);
+ }
}

if (imp < env->best_imp)
@@ -1117,7 +1128,8 @@ unlock:
rcu_read_unlock();
}

-static void task_numa_find_cpu(struct task_numa_env *env, long imp)
+static void task_numa_find_cpu(struct task_numa_env *env,
+ long taskimp, long groupimp)
{
int cpu;

@@ -1127,7 +1139,7 @@ static void task_numa_find_cpu(struct task_numa_env *env, long imp)
continue;

env->dst_cpu = cpu;
- task_numa_compare(env, imp);
+ task_numa_compare(env, taskimp, groupimp);
}
}

@@ -1147,9 +1159,9 @@ static int task_numa_migrate(struct task_struct *p)
.best_cpu = -1
};
struct sched_domain *sd;
- unsigned long weight;
+ unsigned long taskweight, groupweight;
int nid, ret;
- long imp;
+ long taskimp, groupimp;

/*
* Find the lowest common scheduling domain covering the nodes of both
@@ -1164,10 +1176,12 @@ static int task_numa_migrate(struct task_struct *p)
}
rcu_read_unlock();

- weight = task_weight(p, env.src_nid) + group_weight(p, env.src_nid);
+ taskweight = task_weight(p, env.src_nid);
+ groupweight = group_weight(p, env.src_nid);
update_numa_stats(&env.src_stats, env.src_nid);
env.dst_nid = p->numa_preferred_nid;
- imp = task_weight(p, env.dst_nid) + group_weight(p, env.dst_nid) - weight;
+ taskimp = task_weight(p, env.dst_nid) - taskweight;
+ groupimp = group_weight(p, env.dst_nid) - groupweight;
update_numa_stats(&env.dst_stats, env.dst_nid);

/*
@@ -1175,20 +1189,21 @@ static int task_numa_migrate(struct task_struct *p)
* alternative node with relatively better statistics.
*/
if (env.dst_stats.has_capacity) {
- task_numa_find_cpu(&env, imp);
+ task_numa_find_cpu(&env, taskimp, groupimp);
} else {
for_each_online_node(nid) {
if (nid == env.src_nid || nid == p->numa_preferred_nid)
continue;

/* Only consider nodes where both task and groups benefit */
- imp = task_weight(p, nid) + group_weight(p, nid) - weight;
- if (imp < 0)
+ taskimp = task_weight(p, nid) - taskweight;
+ groupimp = group_weight(p, nid) - groupweight;
+ if (taskimp < 0 && groupimp < 0)
continue;

env.dst_nid = nid;
update_numa_stats(&env.dst_stats, env.dst_nid);
- task_numa_find_cpu(&env, imp);
+ task_numa_find_cpu(&env, taskimp, groupimp);
}
}

@@ -4627,10 +4642,9 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
if (dst_nid == p->numa_preferred_nid)
return true;

- /* After the task has settled, check if the new node is better. */
- if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count &&
- task_weight(p, dst_nid) + group_weight(p, dst_nid) >
- task_weight(p, src_nid) + group_weight(p, src_nid))
+ /* If both task and group weight improve, this move is a winner. */
+ if (task_weight(p, dst_nid) > task_weight(p, src_nid) &&
+ group_weight(p, dst_nid) > group_weight(p, src_nid))
return true;

return false;
@@ -4657,10 +4671,9 @@ static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
if (src_nid == p->numa_preferred_nid)
return true;

- /* After the task has settled, check if the new node is worse. */
- if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count &&
- task_weight(p, dst_nid) + group_weight(p, dst_nid) <
- task_weight(p, src_nid) + group_weight(p, src_nid))
+ /* If either task or group weight get worse, don't do it. */
+ if (task_weight(p, dst_nid) < task_weight(p, src_nid) ||
+ group_weight(p, dst_nid) < group_weight(p, src_nid))
return true;

return false;
--
1.8.1.4

2013-09-10 09:33:15

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 30/50] sched: Do not migrate memory immediately after switching node

From: Rik van Riel <[email protected]>

The load balancer can move tasks between nodes and does not take NUMA
locality into account. With automatic NUMA balancing this may result in the
tasks working set being migrated to the new node. However, as the fault
buffer will still store faults from the old node the schduler may decide to
reset the preferred node and migrate the task back resulting in more
migrations.

The ideal would be that the scheduler did not migrate tasks with a heavy
memory footprint but this may result nodes being overloaded. We could
also discard the fault information on task migration but this would still
cause all the tasks working set to be migrated. This patch simply avoids
migrating the memory for a short time after a task is migrated.

Not-signed-off: Rik van Riel
---
kernel/sched/core.c | 2 +-
kernel/sched/fair.c | 18 ++++++++++++++++--
mm/mempolicy.c | 12 ++++++++++++
3 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e94509d..374da2b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1641,7 +1641,7 @@ static void __sched_fork(struct task_struct *p)

p->node_stamp = 0ULL;
p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
- p->numa_migrate_seq = 0;
+ p->numa_migrate_seq = 1;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
p->numa_preferred_nid = -1;
p->numa_work.next = &p->numa_work;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 223e1f8..c2f1cf5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -884,7 +884,7 @@ static unsigned int task_scan_max(struct task_struct *p)
* the preferred node but still allow the scheduler to move the task again if
* the nodes CPUs are overloaded.
*/
-unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+unsigned int sysctl_numa_balancing_settle_count __read_mostly = 4;

static inline int task_faults_idx(int nid, int priv)
{
@@ -980,7 +980,7 @@ static void task_numa_placement(struct task_struct *p)

/* Update the preferred nid and migrate task if possible */
p->numa_preferred_nid = max_nid;
- p->numa_migrate_seq = 0;
+ p->numa_migrate_seq = 1;
migrate_task_to(p, preferred_cpu);
}
}
@@ -4074,6 +4074,20 @@ static void move_task(struct task_struct *p, struct lb_env *env)
set_task_cpu(p, env->dst_cpu);
activate_task(env->dst_rq, p, 0);
check_preempt_curr(env->dst_rq, p, 0);
+#ifdef CONFIG_NUMA_BALANCING
+ if (p->numa_preferred_nid != -1) {
+ int src_nid = cpu_to_node(env->src_cpu);
+ int dst_nid = cpu_to_node(env->dst_cpu);
+
+ /*
+ * If the load balancer has moved the task then limit
+ * migrations from taking place in the short term in
+ * case this is a short-lived migration.
+ */
+ if (src_nid != dst_nid && dst_nid != p->numa_preferred_nid)
+ p->numa_migrate_seq = 0;
+ }
+#endif
}

/*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8e2a364..adc93b2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2322,6 +2322,18 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
goto out;
+
+#ifdef CONFIG_NUMA_BALANCING
+ /*
+ * If the scheduler has just moved us away from our
+ * preferred node, do not bother migrating pages yet.
+ * This way a short and temporary process migration will
+ * not cause excessive memory migration.
+ */
+ if (polnid != current->numa_preferred_nid &&
+ !current->numa_migrate_seq)
+ goto out;
+#endif
}

if (curnid != polnid)
--
1.8.1.4

2013-09-10 09:33:23

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 44/50] sched: numa: stay on the same node if CLONE_VM

From: Rik van Riel <[email protected]>

A newly spawned thread inside a process should stay on the same
NUMA node as its parent. This prevents processes from being "torn"
across multiple NUMA nodes every time they spawn a new thread.

Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/sched.h | 2 +-
kernel/fork.c | 2 +-
kernel/sched/core.c | 14 +++++++++-----
3 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 15888f5..4f51ceb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2005,7 +2005,7 @@ extern void wake_up_new_task(struct task_struct *tsk);
#else
static inline void kick_process(struct task_struct *tsk) { }
#endif
-extern void sched_fork(struct task_struct *p);
+extern void sched_fork(unsigned long clone_flags, struct task_struct *p);
extern void sched_dead(struct task_struct *p);

extern void proc_caches_init(void);
diff --git a/kernel/fork.c b/kernel/fork.c
index f693bdf..2bc7f88 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1309,7 +1309,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
#endif

/* Perform scheduler related setup. Assign this task to a CPU. */
- sched_fork(p);
+ sched_fork(clone_flags, p);

retval = perf_event_init_task(p);
if (retval)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3808860..7bf0827 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1699,7 +1699,7 @@ int wake_up_state(struct task_struct *p, unsigned int state)
*
* __sched_fork() is basic setup used by init_idle() too:
*/
-static void __sched_fork(struct task_struct *p)
+static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
{
p->on_rq = 0;

@@ -1732,11 +1732,15 @@ static void __sched_fork(struct task_struct *p)
p->mm->numa_scan_seq = 0;
}

+ if (clone_flags & CLONE_VM)
+ p->numa_preferred_nid = current->numa_preferred_nid;
+ else
+ p->numa_preferred_nid = -1;
+
p->node_stamp = 0ULL;
p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
p->numa_migrate_seq = 1;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
- p->numa_preferred_nid = -1;
p->numa_work.next = &p->numa_work;
p->numa_faults = NULL;
p->numa_faults_buffer = NULL;
@@ -1768,12 +1772,12 @@ void set_numabalancing_state(bool enabled)
/*
* fork()/clone()-time setup:
*/
-void sched_fork(struct task_struct *p)
+void sched_fork(unsigned long clone_flags, struct task_struct *p)
{
unsigned long flags;
int cpu = get_cpu();

- __sched_fork(p);
+ __sched_fork(clone_flags, p);
/*
* We mark the process as running here. This guarantees that
* nobody will actually run it, and a signal or other external
@@ -4304,7 +4308,7 @@ void init_idle(struct task_struct *idle, int cpu)

raw_spin_lock_irqsave(&rq->lock, flags);

- __sched_fork(idle);
+ __sched_fork(0, idle);
idle->state = TASK_RUNNING;
idle->se.exec_start = sched_clock();

--
1.8.1.4

2013-09-10 09:33:29

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 49/50] sched: numa: fix task or group comparison

From: Rik van Riel <[email protected]>

This patch should probably be folded into

commit 77e0ecbbc5cf0a84764be88b9de5ff13e4338163
Author: Rik van Riel <[email protected]>
Date: Tue Aug 27 21:52:47 2013 +0100

sched: numa: Decide whether to favour task or group weights based on swap candidate relationships

This patch separately considers task and group affinities when
searching for swap candidates during NUMA placement. If tasks
are part of the same group, or no group at all, the task weights
are considered.

Some hysteresis is added to prevent tasks within one group from
getting bounced between NUMA nodes due to tiny differences.

If tasks are part of different groups, the code compares group
weights, in order to favor grouping task groups together.

The patch also changes the group weight multiplier to be the
same as the task weight multiplier, since the two are no longer
added up like before.

Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 32 +++++++++++++++++++++++++-------
1 file changed, 25 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fdb7923..ac7184d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -962,7 +962,7 @@ static inline unsigned long group_weight(struct task_struct *p, int nid)
if (!total_faults)
return 0;

- return 1200 * group_faults(p, nid) / total_faults;
+ return 1000 * group_faults(p, nid) / total_faults;
}

static unsigned long weighted_cpuload(const int cpu);
@@ -1068,16 +1068,34 @@ static void task_numa_compare(struct task_numa_env *env,

/*
* If dst and source tasks are in the same NUMA group, or not
- * in any group then look only at task weights otherwise give
- * priority to the group weights.
+ * in any group then look only at task weights.
*/
- if (!cur->numa_group || ! env->p->numa_group ||
- cur->numa_group == env->p->numa_group) {
+ if (cur->numa_group == env->p->numa_group) {
imp = taskimp + task_weight(cur, env->src_nid) -
task_weight(cur, env->dst_nid);
+ /*
+ * Add some hysteresis to prevent swapping the
+ * tasks within a group over tiny differences.
+ */
+ if (cur->numa_group)
+ imp -= imp/16;
} else {
- imp = groupimp + group_weight(cur, env->src_nid) -
- group_weight(cur, env->dst_nid);
+ /*
+ * Compare the group weights. If a task is all by
+ * itself (not part of a group), use the task weight
+ * instead.
+ */
+ if (env->p->numa_group)
+ imp = groupimp;
+ else
+ imp = taskimp;
+
+ if (cur->numa_group)
+ imp += group_weight(cur, env->src_nid) -
+ group_weight(cur, env->dst_nid);
+ else
+ imp += task_weight(cur, env->src_nid) -
+ task_weight(cur, env->dst_nid);
}
}

--
1.8.1.4

2013-09-10 09:33:57

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 50/50] sched: numa: Avoid migrating tasks that are placed on their preferred node

From: Peter Zijlstra <[email protected]>

(This changelog needs more work, it's currently inaccurate and it's not
clear at exactly what point rt > env->fbq_type is true for the logic to
kick in)

This patch classifies scheduler domains and runqueues into FBQ (cannot
guess what this expands to) types which are one of

regular: There are tasks running that do not care about their NUMA
placement

remote: There are tasks running that care about their placement but are
currently running on a node remote to their ideal placement

all: No distinction

To implement this the patch tracks the number of tasks that are optimally
NUMA placed (rq->nr_preferred_running) and the number of tasks running that
care about their placement (nr_numa_running). The load balancer uses this
information to avoid migrating idea placed NUMA tasks as long as better
options for load balancing exists.

Not-signed-off-by: Peter Zijlstra
---
kernel/sched/core.c | 29 ++++++++++++
kernel/sched/fair.c | 128 ++++++++++++++++++++++++++++++++++++++++++++++-----
kernel/sched/sched.h | 5 ++
3 files changed, 150 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7bf0827..3fc31b7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4485,6 +4485,35 @@ int migrate_task_to(struct task_struct *p, int target_cpu)

return stop_one_cpu(curr_cpu, migration_cpu_stop, &arg);
}
+
+/*
+ * Requeue a task on a given node and accurately track the number of NUMA
+ * tasks on the runqueues
+ */
+void sched_setnuma(struct task_struct *p, int nid)
+{
+ struct rq *rq;
+ unsigned long flags;
+ bool on_rq, running;
+
+ rq = task_rq_lock(p, &flags);
+ on_rq = p->on_rq;
+ running = task_current(rq, p);
+
+ if (on_rq)
+ dequeue_task(rq, p, 0);
+ if (running)
+ p->sched_class->put_prev_task(rq, p);
+
+ p->numa_preferred_nid = nid;
+ p->numa_migrate_seq = 1;
+
+ if (running)
+ p->sched_class->set_curr_task(rq);
+ if (on_rq)
+ enqueue_task(rq, p, 0);
+ task_rq_unlock(rq, p, &flags);
+}
#endif

/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ac7184d..27bc89b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -888,6 +888,18 @@ static unsigned int task_scan_max(struct task_struct *p)
*/
unsigned int sysctl_numa_balancing_settle_count __read_mostly = 4;

+static void account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+ rq->nr_numa_running += (p->numa_preferred_nid != -1);
+ rq->nr_preferred_running += (p->numa_preferred_nid == task_node(p));
+}
+
+static void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+ rq->nr_numa_running -= (p->numa_preferred_nid != -1);
+ rq->nr_preferred_running -= (p->numa_preferred_nid == task_node(p));
+}
+
struct numa_group {
atomic_t refcount;

@@ -1229,6 +1241,8 @@ static int task_numa_migrate(struct task_struct *p)
if (env.best_cpu == -1)
return -EAGAIN;

+ sched_setnuma(p, env.dst_nid);
+
if (env.best_task == NULL) {
int ret = migrate_task_to(p, env.best_cpu);
return ret;
@@ -1340,8 +1354,7 @@ static void task_numa_placement(struct task_struct *p)
/* Preferred node as the node with the most faults */
if (max_faults && max_nid != p->numa_preferred_nid) {
/* Update the preferred nid and migrate task if possible */
- p->numa_preferred_nid = max_nid;
- p->numa_migrate_seq = 1;
+ sched_setnuma(p, max_nid);
numa_migrate_preferred(p);
}
}
@@ -1736,6 +1749,14 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
static void task_tick_numa(struct rq *rq, struct task_struct *curr)
{
}
+
+static inline void account_numa_enqueue(struct rq *rq, struct task_struct *p)
+{
+}
+
+static inline void account_numa_dequeue(struct rq *rq, struct task_struct *p)
+{
+}
#endif /* CONFIG_NUMA_BALANCING */

static void
@@ -1745,8 +1766,12 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
if (!parent_entity(se))
update_load_add(&rq_of(cfs_rq)->load, se->load.weight);
#ifdef CONFIG_SMP
- if (entity_is_task(se))
- list_add(&se->group_node, &rq_of(cfs_rq)->cfs_tasks);
+ if (entity_is_task(se)) {
+ struct rq *rq = rq_of(cfs_rq);
+
+ account_numa_enqueue(rq, task_of(se));
+ list_add(&se->group_node, &rq->cfs_tasks);
+ }
#endif
cfs_rq->nr_running++;
}
@@ -1757,8 +1782,10 @@ account_entity_dequeue(struct cfs_rq *cfs_rq, struct sched_entity *se)
update_load_sub(&cfs_rq->load, se->load.weight);
if (!parent_entity(se))
update_load_sub(&rq_of(cfs_rq)->load, se->load.weight);
- if (entity_is_task(se))
+ if (entity_is_task(se)) {
+ account_numa_dequeue(rq_of(cfs_rq), task_of(se));
list_del_init(&se->group_node);
+ }
cfs_rq->nr_running--;
}

@@ -4553,6 +4580,8 @@ static bool yield_to_task_fair(struct rq *rq, struct task_struct *p, bool preemp

static unsigned long __read_mostly max_load_balance_interval = HZ/10;

+enum fbq_type { regular, remote, all };
+
#define LBF_ALL_PINNED 0x01
#define LBF_NEED_BREAK 0x02
#define LBF_DST_PINNED 0x04
@@ -4579,6 +4608,8 @@ struct lb_env {
unsigned int loop;
unsigned int loop_break;
unsigned int loop_max;
+
+ enum fbq_type fbq_type;
};

/*
@@ -5044,6 +5075,10 @@ struct sg_lb_stats {
unsigned int group_weight;
int group_imb; /* Is there an imbalance in the group ? */
int group_has_capacity; /* Is there extra capacity in the group? */
+#ifdef CONFIG_NUMA_BALANCING
+ unsigned int nr_numa_running;
+ unsigned int nr_preferred_running;
+#endif
};

/*
@@ -5335,6 +5370,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,

sgs->group_load += load;
sgs->sum_nr_running += nr_running;
+#ifdef CONFIG_NUMA_BALANCING
+ sgs->nr_numa_running += rq->nr_numa_running;
+ sgs->nr_preferred_running += rq->nr_preferred_running;
+#endif
sgs->sum_weighted_load += weighted_cpuload(i);
if (idle_cpu(i))
sgs->idle_cpus++;
@@ -5409,14 +5448,43 @@ static bool update_sd_pick_busiest(struct lb_env *env,
return false;
}

+#ifdef CONFIG_NUMA_BALANCING
+static inline enum fbq_type fbq_classify_group(struct sg_lb_stats *sgs)
+{
+ if (sgs->sum_nr_running > sgs->nr_numa_running)
+ return regular;
+ if (sgs->sum_nr_running > sgs->nr_preferred_running)
+ return remote;
+ return all;
+}
+
+static inline enum fbq_type fbq_classify_rq(struct rq *rq)
+{
+ if (rq->nr_running > rq->nr_numa_running)
+ return regular;
+ if (rq->nr_running > rq->nr_preferred_running)
+ return remote;
+ return all;
+}
+#else
+static inline enum fbq_type fbq_classify_group(struct sg_lb_stats *sgs)
+{
+ return all;
+}
+
+static inline enum fbq_type fbq_classify_rq(struct rq *rq)
+{
+ return regular;
+}
+#endif /* CONFIG_NUMA_BALANCING */
+
/**
* update_sd_lb_stats - Update sched_domain's statistics for load balancing.
* @env: The load balancing environment.
* @balance: Should we balance.
* @sds: variable to hold the statistics for this sched_domain.
*/
-static inline void update_sd_lb_stats(struct lb_env *env,
- struct sd_lb_stats *sds)
+static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sds)
{
struct sched_domain *child = env->sd->child;
struct sched_group *sg = env->sd->groups;
@@ -5466,6 +5534,9 @@ static inline void update_sd_lb_stats(struct lb_env *env,

sg = sg->next;
} while (sg != env->sd->groups);
+
+ if (env->sd->flags & SD_NUMA)
+ env->fbq_type = fbq_classify_group(&sds->busiest_stat);
}

/**
@@ -5768,15 +5839,47 @@ static struct rq *find_busiest_queue(struct lb_env *env,
int i;

for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
- unsigned long power = power_of(i);
- unsigned long capacity = DIV_ROUND_CLOSEST(power,
- SCHED_POWER_SCALE);
- unsigned long wl;
+ unsigned long power, capacity, wl;
+ enum fbq_type rt;

+ rq = cpu_rq(i);
+ rt = fbq_classify_rq(rq);
+
+#ifdef CONFIG_NUMA_BALANCING
+ trace_printk("group(%d:%pc) rq(%d): wl: %lu nr: %d nrn: %d nrp: %d gt:%d rt:%d\n",
+ env->sd->level, sched_group_cpus(group), i,
+ weighted_cpuload(i), rq->nr_running,
+ rq->nr_numa_running, rq->nr_preferred_running,
+ env->fbq_type, rt);
+#endif
+
+ /*
+ * We classify groups/runqueues into three groups:
+ * - regular: there are !numa tasks
+ * - remote: there are numa tasks that run on the 'wrong' node
+ * - all: there is no distinction
+ *
+ * In order to avoid migrating ideally placed numa tasks,
+ * ignore those when there's better options.
+ *
+ * If we ignore the actual busiest queue to migrate another
+ * task, the next balance pass can still reduce the busiest
+ * queue by moving tasks around inside the node.
+ *
+ * If we cannot move enough load due to this classification
+ * the next pass will adjust the group classification and
+ * allow migration of more tasks.
+ *
+ * Both cases only affect the total convergence complexity.
+ */
+ if (rt > env->fbq_type)
+ continue;
+
+ power = power_of(i);
+ capacity = DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE);
if (!capacity)
capacity = fix_small_capacity(env->sd, group);

- rq = cpu_rq(i);
wl = weighted_cpuload(i);

/*
@@ -5888,6 +5991,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
.idle = idle,
.loop_break = sched_nr_migrate_break,
.cpus = cpus,
+ .fbq_type = all,
};

/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4c6ec25..b9bcea5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -407,6 +407,10 @@ struct rq {
* remote CPUs use both these fields when doing load calculation.
*/
unsigned int nr_running;
+#ifdef CONFIG_NUMA_BALANCING
+ unsigned int nr_numa_running;
+ unsigned int nr_preferred_running;
+#endif
#define CPU_LOAD_IDX_MAX 5
unsigned long cpu_load[CPU_LOAD_IDX_MAX];
unsigned long last_load_update_tick;
@@ -555,6 +559,7 @@ static inline u64 rq_clock_task(struct rq *rq)
}

#ifdef CONFIG_NUMA_BALANCING
+extern void sched_setnuma(struct task_struct *p, int node);
extern int migrate_task_to(struct task_struct *p, int cpu);
extern int migrate_swap(struct task_struct *, struct task_struct *);
extern void task_numa_free(struct task_struct *p);
--
1.8.1.4

2013-09-10 09:34:19

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 47/50] sched: numa: add debugging

From: Ingo Molnar <[email protected]>

Signed-off-by: Ingo Molnar <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
---
include/linux/sched.h | 6 ++++++
kernel/sched/debug.c | 60 +++++++++++++++++++++++++++++++++++++++++++++++++--
kernel/sched/fair.c | 5 ++++-
3 files changed, 68 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 46fb36a..ac08eb6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1357,6 +1357,7 @@ struct task_struct {
unsigned long *numa_faults_buffer;

int numa_preferred_nid;
+ unsigned long numa_pages_migrated;
#endif /* CONFIG_NUMA_BALANCING */

struct rcu_head rcu;
@@ -2577,6 +2578,11 @@ static inline unsigned int task_cpu(const struct task_struct *p)
return task_thread_info(p)->cpu;
}

+static inline int task_node(const struct task_struct *p)
+{
+ return cpu_to_node(task_cpu(p));
+}
+
extern void set_task_cpu(struct task_struct *p, unsigned int cpu);

#else
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index e076bdd..49ab782 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -15,6 +15,7 @@
#include <linux/seq_file.h>
#include <linux/kallsyms.h>
#include <linux/utsname.h>
+#include <linux/mempolicy.h>

#include "sched.h"

@@ -137,6 +138,9 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
SEQ_printf(m, "%15Ld %15Ld %15Ld.%06ld %15Ld.%06ld %15Ld.%06ld",
0LL, 0LL, 0LL, 0L, 0LL, 0L, 0LL, 0L);
#endif
+#ifdef CONFIG_NUMA_BALANCING
+ SEQ_printf(m, " %d", cpu_to_node(task_cpu(p)));
+#endif
#ifdef CONFIG_CGROUP_SCHED
SEQ_printf(m, " %s", task_group_path(task_group(p)));
#endif
@@ -159,7 +163,7 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
read_lock_irqsave(&tasklist_lock, flags);

do_each_thread(g, p) {
- if (!p->on_rq || task_cpu(p) != rq_cpu)
+ if (task_cpu(p) != rq_cpu)
continue;

print_task(m, rq, p);
@@ -345,7 +349,7 @@ static void sched_debug_header(struct seq_file *m)
cpu_clk = local_clock();
local_irq_restore(flags);

- SEQ_printf(m, "Sched Debug Version: v0.10, %s %.*s\n",
+ SEQ_printf(m, "Sched Debug Version: v0.11, %s %.*s\n",
init_utsname()->release,
(int)strcspn(init_utsname()->version, " "),
init_utsname()->version);
@@ -488,6 +492,56 @@ static int __init init_sched_debug_procfs(void)

__initcall(init_sched_debug_procfs);

+#define __P(F) \
+ SEQ_printf(m, "%-45s:%21Ld\n", #F, (long long)F)
+#define P(F) \
+ SEQ_printf(m, "%-45s:%21Ld\n", #F, (long long)p->F)
+#define __PN(F) \
+ SEQ_printf(m, "%-45s:%14Ld.%06ld\n", #F, SPLIT_NS((long long)F))
+#define PN(F) \
+ SEQ_printf(m, "%-45s:%14Ld.%06ld\n", #F, SPLIT_NS((long long)p->F))
+
+
+static void sched_show_numa(struct task_struct *p, struct seq_file *m)
+{
+#ifdef CONFIG_NUMA_BALANCING
+ struct mempolicy *pol;
+ int node, i;
+
+ if (p->mm)
+ P(mm->numa_scan_seq);
+
+ task_lock(p);
+ pol = p->mempolicy;
+ if (pol && !(pol->flags & MPOL_F_MORON))
+ pol = NULL;
+ mpol_get(pol);
+ task_unlock(p);
+
+ SEQ_printf(m, "numa_migrations, %ld\n", xchg(&p->numa_pages_migrated, 0));
+
+ for_each_online_node(node) {
+ for (i = 0; i < 2; i++) {
+ unsigned long nr_faults = -1;
+ int cpu_current, home_node;
+
+ if (p->numa_faults)
+ nr_faults = p->numa_faults[2*node + i];
+
+ cpu_current = !i ? (task_node(p) == node) :
+ (pol && node_isset(node, pol->v.nodes));
+
+ home_node = (p->numa_preferred_nid == node);
+
+ SEQ_printf(m, "numa_faults, %d, %d, %d, %d, %ld\n",
+ i, node, cpu_current, home_node, nr_faults);
+ }
+ }
+
+ mpol_put(pol);
+#endif
+}
+
void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
{
unsigned long nr_switches;
@@ -591,6 +645,8 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
SEQ_printf(m, "%-45s:%21Ld\n",
"clock-delta", (long long)(t1-t0));
}
+
+ sched_show_numa(p, m);
}

void proc_sched_set_task(struct task_struct *p)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4653f71..80906fa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1138,7 +1138,7 @@ static int task_numa_migrate(struct task_struct *p)
.p = p,

.src_cpu = task_cpu(p),
- .src_nid = cpu_to_node(task_cpu(p)),
+ .src_nid = task_node(p),

.imbalance_pct = 112,

@@ -1510,6 +1510,9 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)
if (p->numa_migrate_retry && time_after(jiffies, p->numa_migrate_retry))
numa_migrate_preferred(p);

+ if (migrated)
+ p->numa_pages_migrated += pages;
+
p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
}

--
1.8.1.4

2013-09-10 09:34:37

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 46/50] sched: numa: Prevent parallel updates to group stats during placement

Having multiple tasks in a group go through task_numa_placement
simultaneously can lead to a task picking a wrong node to run on, because
the group stats may be in the middle of an update. This patch avoids
parallel updates by holding the numa_group lock during placement
decisions.

Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 35 +++++++++++++++++++++++------------
1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3a92c58..4653f71 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1231,6 +1231,7 @@ static void task_numa_placement(struct task_struct *p)
{
int seq, nid, max_nid = -1, max_group_nid = -1;
unsigned long max_faults = 0, max_group_faults = 0;
+ spinlock_t *group_lock = NULL;

seq = ACCESS_ONCE(p->mm->numa_scan_seq);
if (p->numa_scan_seq == seq)
@@ -1239,6 +1240,12 @@ static void task_numa_placement(struct task_struct *p)
p->numa_migrate_seq++;
p->numa_scan_period_max = task_scan_max(p);

+ /* If the task is part of a group prevent parallel updates to group stats */
+ if (p->numa_group) {
+ group_lock = &p->numa_group->lock;
+ spin_lock(group_lock);
+ }
+
/* Find the node with the highest number of faults */
for_each_online_node(nid) {
unsigned long faults = 0, group_faults = 0;
@@ -1277,20 +1284,24 @@ static void task_numa_placement(struct task_struct *p)
}
}

- /*
- * If the preferred task and group nids are different,
- * iterate over the nodes again to find the best place.
- */
- if (p->numa_group && max_nid != max_group_nid) {
- unsigned long weight, max_weight = 0;
-
- for_each_online_node(nid) {
- weight = task_weight(p, nid) + group_weight(p, nid);
- if (weight > max_weight) {
- max_weight = weight;
- max_nid = nid;
+ if (p->numa_group) {
+ /*
+ * If the preferred task and group nids are different,
+ * iterate over the nodes again to find the best place.
+ */
+ if (max_nid != max_group_nid) {
+ unsigned long weight, max_weight = 0;
+
+ for_each_online_node(nid) {
+ weight = task_weight(p, nid) + group_weight(p, nid);
+ if (weight > max_weight) {
+ max_weight = weight;
+ max_nid = nid;
+ }
}
}
+
+ spin_unlock(group_lock);
}

/* Preferred node as the node with the most faults */
--
1.8.1.4

2013-09-10 09:34:51

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 45/50] sched: numa: use group fault statistics in numa placement

This patch uses the fraction of faults on a particular node for both task
and group, to figure out the best node to place a task. If the task and
group statistics disagree on what the preferred node should be then a full
rescan will select the node with the best combined weight.

Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/fair.c | 134 +++++++++++++++++++++++++++++++++++++++++---------
2 files changed, 113 insertions(+), 22 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4f51ceb..46fb36a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1347,6 +1347,7 @@ struct task_struct {
* The values remain static for the duration of a PTE scan
*/
unsigned long *numa_faults;
+ unsigned long total_numa_faults;

/*
* numa_faults_buffer records faults per node during the current
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ecfce3e..3a92c58 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -897,6 +897,7 @@ struct numa_group {
struct list_head task_list;

struct rcu_head rcu;
+ atomic_long_t total_faults;
atomic_long_t faults[0];
};

@@ -919,6 +920,51 @@ static inline unsigned long task_faults(struct task_struct *p, int nid)
p->numa_faults[task_faults_idx(nid, 1)];
}

+static inline unsigned long group_faults(struct task_struct *p, int nid)
+{
+ if (!p->numa_group)
+ return 0;
+
+ return atomic_long_read(&p->numa_group->faults[2*nid]) +
+ atomic_long_read(&p->numa_group->faults[2*nid+1]);
+}
+
+/*
+ * These return the fraction of accesses done by a particular task, or
+ * task group, on a particular numa node. The group weight is given a
+ * larger multiplier, in order to group tasks together that are almost
+ * evenly spread out between numa nodes.
+ */
+static inline unsigned long task_weight(struct task_struct *p, int nid)
+{
+ unsigned long total_faults;
+
+ if (!p->numa_faults)
+ return 0;
+
+ total_faults = p->total_numa_faults;
+
+ if (!total_faults)
+ return 0;
+
+ return 1000 * task_faults(p, nid) / total_faults;
+}
+
+static inline unsigned long group_weight(struct task_struct *p, int nid)
+{
+ unsigned long total_faults;
+
+ if (!p->numa_group)
+ return 0;
+
+ total_faults = atomic_long_read(&p->numa_group->total_faults);
+
+ if (!total_faults)
+ return 0;
+
+ return 1200 * group_faults(p, nid) / total_faults;
+}
+
static unsigned long weighted_cpuload(const int cpu);
static unsigned long source_load(int cpu, int type);
static unsigned long target_load(int cpu, int type);
@@ -1018,8 +1064,10 @@ static void task_numa_compare(struct task_numa_env *env, long imp)
if (!cpumask_test_cpu(env->src_cpu, tsk_cpus_allowed(cur)))
goto unlock;

- imp += task_faults(cur, env->src_nid) -
- task_faults(cur, env->dst_nid);
+ imp += task_weight(cur, env->src_nid) +
+ group_weight(cur, env->src_nid) -
+ task_weight(cur, env->dst_nid) -
+ group_weight(cur, env->dst_nid);
}

if (imp < env->best_imp)
@@ -1099,7 +1147,7 @@ static int task_numa_migrate(struct task_struct *p)
.best_cpu = -1
};
struct sched_domain *sd;
- unsigned long faults;
+ unsigned long weight;
int nid, ret;
long imp;

@@ -1116,10 +1164,10 @@ static int task_numa_migrate(struct task_struct *p)
}
rcu_read_unlock();

- faults = task_faults(p, env.src_nid);
+ weight = task_weight(p, env.src_nid) + group_weight(p, env.src_nid);
update_numa_stats(&env.src_stats, env.src_nid);
env.dst_nid = p->numa_preferred_nid;
- imp = task_faults(env.p, env.dst_nid) - faults;
+ imp = task_weight(p, env.dst_nid) + group_weight(p, env.dst_nid) - weight;
update_numa_stats(&env.dst_stats, env.dst_nid);

/*
@@ -1133,8 +1181,8 @@ static int task_numa_migrate(struct task_struct *p)
if (nid == env.src_nid || nid == p->numa_preferred_nid)
continue;

- /* Only consider nodes that recorded more faults */
- imp = task_faults(env.p, nid) - faults;
+ /* Only consider nodes where both task and groups benefit */
+ imp = task_weight(p, nid) + group_weight(p, nid) - weight;
if (imp < 0)
continue;

@@ -1181,8 +1229,8 @@ static void numa_migrate_preferred(struct task_struct *p)

static void task_numa_placement(struct task_struct *p)
{
- int seq, nid, max_nid = -1;
- unsigned long max_faults = 0;
+ int seq, nid, max_nid = -1, max_group_nid = -1;
+ unsigned long max_faults = 0, max_group_faults = 0;

seq = ACCESS_ONCE(p->mm->numa_scan_seq);
if (p->numa_scan_seq == seq)
@@ -1193,7 +1241,7 @@ static void task_numa_placement(struct task_struct *p)

/* Find the node with the highest number of faults */
for_each_online_node(nid) {
- unsigned long faults = 0;
+ unsigned long faults = 0, group_faults = 0;
int priv, i;

for (priv = 0; priv < 2; priv++) {
@@ -1209,9 +1257,12 @@ static void task_numa_placement(struct task_struct *p)

faults += p->numa_faults[i];
diff += p->numa_faults[i];
+ p->total_numa_faults += diff;
if (p->numa_group) {
/* safe because we can only change our own group */
atomic_long_add(diff, &p->numa_group->faults[i]);
+ atomic_long_add(diff, &p->numa_group->total_faults);
+ group_faults += atomic_long_read(&p->numa_group->faults[i]);
}
}

@@ -1219,6 +1270,27 @@ static void task_numa_placement(struct task_struct *p)
max_faults = faults;
max_nid = nid;
}
+
+ if (group_faults > max_group_faults) {
+ max_group_faults = group_faults;
+ max_group_nid = nid;
+ }
+ }
+
+ /*
+ * If the preferred task and group nids are different,
+ * iterate over the nodes again to find the best place.
+ */
+ if (p->numa_group && max_nid != max_group_nid) {
+ unsigned long weight, max_weight = 0;
+
+ for_each_online_node(nid) {
+ weight = task_weight(p, nid) + group_weight(p, nid);
+ if (weight > max_weight) {
+ max_weight = weight;
+ max_nid = nid;
+ }
+ }
}

/* Preferred node as the node with the most faults */
@@ -1273,6 +1345,8 @@ static void task_numa_group(struct task_struct *p, int cpu, int pid)
for (i = 0; i < 2*nr_node_ids; i++)
atomic_long_set(&grp->faults[i], p->numa_faults[i]);

+ atomic_long_set(&grp->total_faults, p->total_numa_faults);
+
list_add(&p->numa_entry, &grp->task_list);
grp->nr_tasks++;
rcu_assign_pointer(p->numa_group, grp);
@@ -1320,6 +1394,8 @@ unlock:
atomic_long_sub(p->numa_faults[i], &my_grp->faults[i]);
atomic_long_add(p->numa_faults[i], &grp->faults[i]);
}
+ atomic_long_sub(p->total_numa_faults, &my_grp->total_faults);
+ atomic_long_add(p->total_numa_faults, &grp->total_faults);

double_lock(&my_grp->lock, &grp->lock);

@@ -1340,12 +1416,12 @@ void task_numa_free(struct task_struct *p)
struct numa_group *grp = p->numa_group;
int i;

- kfree(p->numa_faults);
-
if (grp) {
for (i = 0; i < 2*nr_node_ids; i++)
atomic_long_sub(p->numa_faults[i], &grp->faults[i]);

+ atomic_long_sub(p->total_numa_faults, &grp->total_faults);
+
spin_lock(&grp->lock);
list_del(&p->numa_entry);
grp->nr_tasks--;
@@ -1353,6 +1429,8 @@ void task_numa_free(struct task_struct *p)
rcu_assign_pointer(p->numa_group, NULL);
put_numa_group(grp);
}
+
+ kfree(p->numa_faults);
}

/*
@@ -1382,6 +1460,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, int flags)

BUG_ON(p->numa_faults_buffer);
p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
+ p->total_numa_faults = 0;
}

/*
@@ -4527,12 +4606,17 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
src_nid = cpu_to_node(env->src_cpu);
dst_nid = cpu_to_node(env->dst_cpu);

- if (src_nid == dst_nid ||
- p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+ if (src_nid == dst_nid)
return false;

- if (dst_nid == p->numa_preferred_nid ||
- task_faults(p, dst_nid) > task_faults(p, src_nid))
+ /* Always encourage migration to the preferred node. */
+ if (dst_nid == p->numa_preferred_nid)
+ return true;
+
+ /* After the task has settled, check if the new node is better. */
+ if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count &&
+ task_weight(p, dst_nid) + group_weight(p, dst_nid) >
+ task_weight(p, src_nid) + group_weight(p, src_nid))
return true;

return false;
@@ -4552,14 +4636,20 @@ static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
src_nid = cpu_to_node(env->src_cpu);
dst_nid = cpu_to_node(env->dst_cpu);

- if (src_nid == dst_nid ||
- p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+ if (src_nid == dst_nid)
return false;

- if (task_faults(p, dst_nid) < task_faults(p, src_nid))
- return true;
-
- return false;
+ /* Migrating away from the preferred node is always bad. */
+ if (src_nid == p->numa_preferred_nid)
+ return true;
+
+ /* After the task has settled, check if the new node is worse. */
+ if (p->numa_migrate_seq >= sysctl_numa_balancing_settle_count &&
+ task_weight(p, dst_nid) + group_weight(p, dst_nid) <
+ task_weight(p, src_nid) + group_weight(p, src_nid))
+ return true;
+
+ return false;
}

#else
--
1.8.1.4

2013-09-10 09:35:10

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 43/50] mm: numa: Do not group on RO pages

From: Peter Zijlstra <[email protected]>

And here's a little something to make sure not the whole world ends up
in a single group.

As while we don't migrate shared executable pages, we do scan/fault on
them. And since everybody links to libc, everybody ends up in the same
group.

[[email protected]: mapcount 1]
Suggested-by: Rik van Riel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/sched.h | 7 +++++--
kernel/sched/fair.c | 5 +++--
mm/huge_memory.c | 17 ++++++++++++++---
mm/memory.c | 30 ++++++++++++++++++++++++++----
4 files changed, 48 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4fad1f17..15888f5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1434,13 +1434,16 @@ struct task_struct {
/* Future-safe accessor for struct task_struct's cpus_allowed. */
#define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)

+#define TNF_MIGRATED 0x01
+#define TNF_NO_GROUP 0x02
+
#ifdef CONFIG_NUMA_BALANCING
-extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
+extern void task_numa_fault(int last_node, int node, int pages, int flags);
extern pid_t task_numa_group_id(struct task_struct *p);
extern void set_numabalancing_state(bool enabled);
#else
static inline void task_numa_fault(int last_node, int node, int pages,
- bool migrated)
+ int flags)
{
}
static inline pid_t task_numa_group_id(struct task_struct *p)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1faf3ff..ecfce3e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1358,9 +1358,10 @@ void task_numa_free(struct task_struct *p)
/*
* Got a PROT_NONE fault for a page on @node.
*/
-void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
+void task_numa_fault(int last_cpupid, int node, int pages, int flags)
{
struct task_struct *p = current;
+ bool migrated = flags & TNF_MIGRATED;
int priv;

if (!numabalancing_enabled)
@@ -1396,7 +1397,7 @@ void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
pid = cpupid_to_pid(last_cpupid);

priv = (pid == (p->pid & LAST__PID_MASK));
- if (!priv)
+ if (!priv && !(flags & TNF_NO_GROUP))
task_numa_group(p, cpu, pid);
}

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index cf903fc..5c339a1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1297,6 +1297,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
int target_nid, last_cpupid = -1;
bool page_locked;
bool migrated = false;
+ int flags = 0;

spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(pmd, *pmdp)))
@@ -1311,6 +1312,14 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);

/*
+ * Avoid grouping on DSO/COW pages in specific and RO pages
+ * in general, RO pages shouldn't hurt as much anyway since
+ * they can be in shared cache state.
+ */
+ if (!pmd_write(pmd))
+ flags |= TNF_NO_GROUP;
+
+ /*
* Acquire the page lock to serialise THP migrations but avoid dropping
* page_table_lock if at all possible
*/
@@ -1350,10 +1359,12 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
spin_unlock(&mm->page_table_lock);
migrated = migrate_misplaced_transhuge_page(mm, vma,
pmdp, pmd, addr, page, target_nid);
- if (migrated)
+ if (migrated) {
page_nid = target_nid;
- else
+ flags |= TNF_MIGRATED;
+ } else {
goto check_same;
+ }

goto out;

@@ -1377,7 +1388,7 @@ out:
page_unlock_anon_vma_read(anon_vma);

if (page_nid != -1)
- task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, migrated);
+ task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, flags);

return 0;
}
diff --git a/mm/memory.c b/mm/memory.c
index f779403..1aa4187 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3558,6 +3558,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
int last_cpupid;
int target_nid;
bool migrated = false;
+ int flags = 0;

/*
* The "pte" at this point cannot be used safely without
@@ -3586,6 +3587,14 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
BUG_ON(is_zero_pfn(page_to_pfn(page)));

+ /*
+ * Avoid grouping on DSO/COW pages in specific and RO pages
+ * in general, RO pages shouldn't hurt as much anyway since
+ * they can be in shared cache state.
+ */
+ if (!pte_write(pte))
+ flags |= TNF_NO_GROUP;
+
last_cpupid = page_cpupid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
@@ -3597,12 +3606,14 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,

/* Migrate to the requested node */
migrated = migrate_misplaced_page(page, vma, target_nid);
- if (migrated)
+ if (migrated) {
page_nid = target_nid;
+ flags |= TNF_MIGRATED;
+ }

out:
if (page_nid != -1)
- task_numa_fault(last_cpupid, page_nid, 1, migrated);
+ task_numa_fault(last_cpupid, page_nid, 1, flags);
return 0;
}

@@ -3643,6 +3654,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
int page_nid = -1;
int target_nid;
bool migrated = false;
+ int flags = 0;

if (!pte_present(pteval))
continue;
@@ -3662,20 +3674,30 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(!page))
continue;

+ /*
+ * Avoid grouping on DSO/COW pages in specific and RO pages
+ * in general, RO pages shouldn't hurt as much anyway since
+ * they can be in shared cache state.
+ */
+ if (!pte_write(pteval))
+ flags |= TNF_NO_GROUP;
+
last_cpupid = page_cpupid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(pte, ptl);
if (target_nid != -1) {
migrated = migrate_misplaced_page(page, vma, target_nid);
- if (migrated)
+ if (migrated) {
page_nid = target_nid;
+ flags |= TNF_MIGRATED;
+ }
} else {
put_page(page);
}

if (page_nid != -1)
- task_numa_fault(last_cpupid, page_nid, 1, migrated);
+ task_numa_fault(last_cpupid, page_nid, 1, flags);

pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
}
--
1.8.1.4

2013-09-10 09:35:44

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 42/50] sched: numa: Report a NUMA task group ID

It is desirable to model from userspace how the scheduler groups tasks
over time. This patch adds an ID to the numa_group and reports it via
/proc/PID/status.

Signed-off-by: Mel Gorman <[email protected]>
---
fs/proc/array.c | 2 ++
include/linux/sched.h | 5 +++++
kernel/sched/fair.c | 7 +++++++
3 files changed, 14 insertions(+)

diff --git a/fs/proc/array.c b/fs/proc/array.c
index cbd0f1b..1bd2077 100644
--- a/fs/proc/array.c
+++ b/fs/proc/array.c
@@ -183,6 +183,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
seq_printf(m,
"State:\t%s\n"
"Tgid:\t%d\n"
+ "Ngid:\t%d\n"
"Pid:\t%d\n"
"PPid:\t%d\n"
"TracerPid:\t%d\n"
@@ -190,6 +191,7 @@ static inline void task_state(struct seq_file *m, struct pid_namespace *ns,
"Gid:\t%d\t%d\t%d\t%d\n",
get_task_state(p),
task_tgid_nr_ns(p, ns),
+ task_numa_group_id(p),
pid_nr_ns(pid, ns),
ppid, tpid,
from_kuid_munged(user_ns, cred->uid),
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ea057a2..4fad1f17 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1436,12 +1436,17 @@ struct task_struct {

#ifdef CONFIG_NUMA_BALANCING
extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
+extern pid_t task_numa_group_id(struct task_struct *p);
extern void set_numabalancing_state(bool enabled);
#else
static inline void task_numa_fault(int last_node, int node, int pages,
bool migrated)
{
}
+static inline pid_t task_numa_group_id(struct task_struct *p)
+{
+ return 0;
+}
static inline void set_numabalancing_state(bool enabled)
{
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b80eaa2..1faf3ff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -893,12 +893,18 @@ struct numa_group {

spinlock_t lock; /* nr_tasks, tasks */
int nr_tasks;
+ pid_t gid;
struct list_head task_list;

struct rcu_head rcu;
atomic_long_t faults[0];
};

+pid_t task_numa_group_id(struct task_struct *p)
+{
+ return p->numa_group ? p->numa_group->gid : 0;
+}
+
static inline int task_faults_idx(int nid, int priv)
{
return 2 * nid + priv;
@@ -1262,6 +1268,7 @@ static void task_numa_group(struct task_struct *p, int cpu, int pid)
atomic_set(&grp->refcount, 1);
spin_lock_init(&grp->lock);
INIT_LIST_HEAD(&grp->task_list);
+ grp->gid = p->pid;

for (i = 0; i < 2*nr_node_ids; i++)
atomic_long_set(&grp->faults[i], p->numa_faults[i]);
--
1.8.1.4

2013-09-10 09:36:01

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 40/50] mm: numa: Change page last {nid,pid} into {cpu,pid}

From: Peter Zijlstra <[email protected]>

Change the per page last fault tracking to use cpu,pid instead of
nid,pid. This will allow us to try and lookup the alternate task more
easily. Note that even though it is the cpu that is store in the page
flags that the mpol_misplaced decision is still based on the node.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mm.h | 90 ++++++++++++++++++++++-----------------
include/linux/mm_types.h | 4 +-
include/linux/page-flags-layout.h | 22 +++++-----
kernel/bounds.c | 4 ++
kernel/sched/fair.c | 6 +--
mm/huge_memory.c | 8 ++--
mm/memory.c | 16 +++----
mm/mempolicy.c | 16 ++++---
mm/migrate.c | 4 +-
mm/mm_init.c | 18 ++++----
mm/mmzone.c | 14 +++---
mm/mprotect.c | 28 ++++++------
mm/page_alloc.c | 4 +-
13 files changed, 125 insertions(+), 109 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0a0db6c..61dc023 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -588,11 +588,11 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
* sets it, so none of the operations on it need to be atomic.
*/

-/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_NIDPID] | ... | FLAGS | */
+/* Page flags: | [SECTION] | [NODE] | ZONE | [LAST_CPUPID] | ... | FLAGS | */
#define SECTIONS_PGOFF ((sizeof(unsigned long)*8) - SECTIONS_WIDTH)
#define NODES_PGOFF (SECTIONS_PGOFF - NODES_WIDTH)
#define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH)
-#define LAST_NIDPID_PGOFF (ZONES_PGOFF - LAST_NIDPID_WIDTH)
+#define LAST_CPUPID_PGOFF (ZONES_PGOFF - LAST_CPUPID_WIDTH)

/*
* Define the bit shifts to access each section. For non-existent
@@ -602,7 +602,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
#define SECTIONS_PGSHIFT (SECTIONS_PGOFF * (SECTIONS_WIDTH != 0))
#define NODES_PGSHIFT (NODES_PGOFF * (NODES_WIDTH != 0))
#define ZONES_PGSHIFT (ZONES_PGOFF * (ZONES_WIDTH != 0))
-#define LAST_NIDPID_PGSHIFT (LAST_NIDPID_PGOFF * (LAST_NIDPID_WIDTH != 0))
+#define LAST_CPUPID_PGSHIFT (LAST_CPUPID_PGOFF * (LAST_CPUPID_WIDTH != 0))

/* NODE:ZONE or SECTION:ZONE is used to ID a zone for the buddy allocator */
#ifdef NODE_NOT_IN_PAGE_FLAGS
@@ -624,7 +624,7 @@ static inline pte_t maybe_mkwrite(pte_t pte, struct vm_area_struct *vma)
#define ZONES_MASK ((1UL << ZONES_WIDTH) - 1)
#define NODES_MASK ((1UL << NODES_WIDTH) - 1)
#define SECTIONS_MASK ((1UL << SECTIONS_WIDTH) - 1)
-#define LAST_NIDPID_MASK ((1UL << LAST_NIDPID_WIDTH) - 1)
+#define LAST_CPUPID_MASK ((1UL << LAST_CPUPID_WIDTH) - 1)
#define ZONEID_MASK ((1UL << ZONEID_SHIFT) - 1)

static inline enum zone_type page_zonenum(const struct page *page)
@@ -668,96 +668,106 @@ static inline int page_to_nid(const struct page *page)
#endif

#ifdef CONFIG_NUMA_BALANCING
-static inline int nid_pid_to_nidpid(int nid, int pid)
+static inline int cpu_pid_to_cpupid(int cpu, int pid)
{
- return ((nid & LAST__NID_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
+ return ((cpu & LAST__CPU_MASK) << LAST__PID_SHIFT) | (pid & LAST__PID_MASK);
}

-static inline int nidpid_to_pid(int nidpid)
+static inline int cpupid_to_pid(int cpupid)
{
- return nidpid & LAST__PID_MASK;
+ return cpupid & LAST__PID_MASK;
}

-static inline int nidpid_to_nid(int nidpid)
+static inline int cpupid_to_cpu(int cpupid)
{
- return (nidpid >> LAST__PID_SHIFT) & LAST__NID_MASK;
+ return (cpupid >> LAST__PID_SHIFT) & LAST__CPU_MASK;
}

-static inline bool nidpid_pid_unset(int nidpid)
+static inline int cpupid_to_nid(int cpupid)
{
- return nidpid_to_pid(nidpid) == (-1 & LAST__PID_MASK);
+ return cpu_to_node(cpupid_to_cpu(cpupid));
}

-static inline bool nidpid_nid_unset(int nidpid)
+static inline bool cpupid_pid_unset(int cpupid)
{
- return nidpid_to_nid(nidpid) == (-1 & LAST__NID_MASK);
+ return cpupid_to_pid(cpupid) == (-1 & LAST__PID_MASK);
}

-#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
-static inline int page_nidpid_xchg_last(struct page *page, int nid)
+static inline bool cpupid_cpu_unset(int cpupid)
{
- return xchg(&page->_last_nidpid, nid);
+ return cpupid_to_cpu(cpupid) == (-1 & LAST__CPU_MASK);
}

-static inline int page_nidpid_last(struct page *page)
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
{
- return page->_last_nidpid;
+ return xchg(&page->_last_cpupid, cpupid);
}
-static inline void page_nidpid_reset_last(struct page *page)
+
+static inline int page_cpupid_last(struct page *page)
+{
+ return page->_last_cpupid;
+}
+static inline void page_cpupid_reset_last(struct page *page)
{
- page->_last_nidpid = -1;
+ page->_last_cpupid = -1;
}
#else
-static inline int page_nidpid_last(struct page *page)
+static inline int page_cpupid_last(struct page *page)
{
- return (page->flags >> LAST_NIDPID_PGSHIFT) & LAST_NIDPID_MASK;
+ return (page->flags >> LAST_CPUPID_PGSHIFT) & LAST_CPUPID_MASK;
}

-extern int page_nidpid_xchg_last(struct page *page, int nidpid);
+extern int page_cpupid_xchg_last(struct page *page, int cpupid);

-static inline void page_nidpid_reset_last(struct page *page)
+static inline void page_cpupid_reset_last(struct page *page)
{
- int nidpid = (1 << LAST_NIDPID_SHIFT) - 1;
+ int cpupid = (1 << LAST_CPUPID_SHIFT) - 1;

- page->flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
- page->flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
+ page->flags &= ~(LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT);
+ page->flags |= (cpupid & LAST_CPUPID_MASK) << LAST_CPUPID_PGSHIFT;
}
-#endif /* LAST_NIDPID_NOT_IN_PAGE_FLAGS */
-#else
-static inline int page_nidpid_xchg_last(struct page *page, int nidpid)
+#endif /* LAST_CPUPID_NOT_IN_PAGE_FLAGS */
+#else /* !CONFIG_NUMA_BALANCING */
+static inline int page_cpupid_xchg_last(struct page *page, int cpupid)
{
- return page_to_nid(page);
+ return page_to_nid(page); /* XXX */
}

-static inline int page_nidpid_last(struct page *page)
+static inline int page_cpupid_last(struct page *page)
{
- return page_to_nid(page);
+ return page_to_nid(page); /* XXX */
}

-static inline int nidpid_to_nid(int nidpid)
+static inline int cpupid_to_nid(int cpupid)
{
return -1;
}

-static inline int nidpid_to_pid(int nidpid)
+static inline int cpupid_to_pid(int cpupid)
{
return -1;
}

-static inline int nid_pid_to_nidpid(int nid, int pid)
+static inline int cpupid_to_cpu(int cpupid)
{
return -1;
}

-static inline bool nidpid_pid_unset(int nidpid)
+static inline int cpu_pid_to_cpupid(int nid, int pid)
+{
+ return -1;
+}
+
+static inline bool cpupid_pid_unset(int cpupid)
{
return 1;
}

-static inline void page_nidpid_reset_last(struct page *page)
+static inline void page_cpupid_reset_last(struct page *page)
{
}
-#endif
+#endif /* CONFIG_NUMA_BALANCING */

static inline struct zone *page_zone(const struct page *page)
{
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index f46378e..b0370cd 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -174,8 +174,8 @@ struct page {
void *shadow;
#endif

-#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
- int _last_nidpid;
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+ int _last_cpupid;
#endif
}
/*
diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
index 02bc918..da52366 100644
--- a/include/linux/page-flags-layout.h
+++ b/include/linux/page-flags-layout.h
@@ -39,9 +39,9 @@
* lookup is necessary.
*
* No sparsemem or sparsemem vmemmap: | NODE | ZONE | ... | FLAGS |
- * " plus space for last_nidpid: | NODE | ZONE | LAST_NIDPID ... | FLAGS |
+ * " plus space for last_cpupid: | NODE | ZONE | LAST_CPUPID ... | FLAGS |
* classic sparse with space for node:| SECTION | NODE | ZONE | ... | FLAGS |
- * " plus space for last_nidpid: | SECTION | NODE | ZONE | LAST_NIDPID ... | FLAGS |
+ * " plus space for last_cpupid: | SECTION | NODE | ZONE | LAST_CPUPID ... | FLAGS |
* classic sparse no space for node: | SECTION | ZONE | ... | FLAGS |
*/
#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP)
@@ -65,18 +65,18 @@
#define LAST__PID_SHIFT 8
#define LAST__PID_MASK ((1 << LAST__PID_SHIFT)-1)

-#define LAST__NID_SHIFT NODES_SHIFT
-#define LAST__NID_MASK ((1 << LAST__NID_SHIFT)-1)
+#define LAST__CPU_SHIFT NR_CPUS_BITS
+#define LAST__CPU_MASK ((1 << LAST__CPU_SHIFT)-1)

-#define LAST_NIDPID_SHIFT (LAST__PID_SHIFT+LAST__NID_SHIFT)
+#define LAST_CPUPID_SHIFT (LAST__PID_SHIFT+LAST__CPU_SHIFT)
#else
-#define LAST_NIDPID_SHIFT 0
+#define LAST_CPUPID_SHIFT 0
#endif

-#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_NIDPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
-#define LAST_NIDPID_WIDTH LAST_NIDPID_SHIFT
+#if SECTIONS_WIDTH+ZONES_WIDTH+NODES_SHIFT+LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
+#define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
#else
-#define LAST_NIDPID_WIDTH 0
+#define LAST_CPUPID_WIDTH 0
#endif

/*
@@ -87,8 +87,8 @@
#define NODE_NOT_IN_PAGE_FLAGS
#endif

-#if defined(CONFIG_NUMA_BALANCING) && LAST_NIDPID_WIDTH == 0
-#define LAST_NIDPID_NOT_IN_PAGE_FLAGS
+#if defined(CONFIG_NUMA_BALANCING) && LAST_CPUPID_WIDTH == 0
+#define LAST_CPUPID_NOT_IN_PAGE_FLAGS
#endif

#endif /* _LINUX_PAGE_FLAGS_LAYOUT */
diff --git a/kernel/bounds.c b/kernel/bounds.c
index 0c9b862..e8ca97b 100644
--- a/kernel/bounds.c
+++ b/kernel/bounds.c
@@ -10,6 +10,7 @@
#include <linux/mmzone.h>
#include <linux/kbuild.h>
#include <linux/page_cgroup.h>
+#include <linux/log2.h>

void foo(void)
{
@@ -17,5 +18,8 @@ void foo(void)
DEFINE(NR_PAGEFLAGS, __NR_PAGEFLAGS);
DEFINE(MAX_NR_ZONES, __MAX_NR_ZONES);
DEFINE(NR_PCG_FLAGS, __NR_PCG_FLAGS);
+#ifdef CONFIG_SMP
+ DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS));
+#endif
/* End of constants */
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f2bd291..bafa8d7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1208,7 +1208,7 @@ static void task_numa_placement(struct task_struct *p)
/*
* Got a PROT_NONE fault for a page on @node.
*/
-void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
+void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
{
struct task_struct *p = current;
int priv;
@@ -1224,8 +1224,8 @@ void task_numa_fault(int last_nidpid, int node, int pages, bool migrated)
* First accesses are treated as private, otherwise consider accesses
* to be private if the accessing pid has not changed
*/
- if (!nidpid_pid_unset(last_nidpid))
- priv = ((p->pid & LAST__PID_MASK) == nidpid_to_pid(last_nidpid));
+ if (!cpupid_pid_unset(last_cpupid))
+ priv = ((p->pid & LAST__PID_MASK) == cpupid_to_pid(last_cpupid));
else
priv = 1;

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 622bc7e..cf903fc 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1294,7 +1294,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
int page_nid = -1, this_nid = numa_node_id();
- int target_nid, last_nidpid = -1;
+ int target_nid, last_cpupid = -1;
bool page_locked;
bool migrated = false;

@@ -1305,7 +1305,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
page = pmd_page(pmd);
BUG_ON(is_huge_zero_page(page));
page_nid = page_to_nid(page);
- last_nidpid = page_nidpid_last(page);
+ last_cpupid = page_cpupid_last(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
if (page_nid == this_nid)
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
@@ -1377,7 +1377,7 @@ out:
page_unlock_anon_vma_read(anon_vma);

if (page_nid != -1)
- task_numa_fault(last_nidpid, page_nid, HPAGE_PMD_NR, migrated);
+ task_numa_fault(last_cpupid, page_nid, HPAGE_PMD_NR, migrated);

return 0;
}
@@ -1692,7 +1692,7 @@ static void __split_huge_page_refcount(struct page *page,
page_tail->mapping = page->mapping;

page_tail->index = page->index + i;
- page_nidpid_xchg_last(page_tail, page_nidpid_last(page));
+ page_cpupid_xchg_last(page_tail, page_cpupid_last(page));

BUG_ON(!PageAnon(page_tail));
BUG_ON(!PageUptodate(page_tail));
diff --git a/mm/memory.c b/mm/memory.c
index 948ec32..6b558a5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -69,8 +69,8 @@

#include "internal.h"

-#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
-#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_nidpid.
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
+#warning Unfortunate NUMA and NUMA Balancing config, growing page-frame for last_cpupid.
#endif

#ifndef CONFIG_NEED_MULTIPLE_NODES
@@ -3547,7 +3547,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page = NULL;
spinlock_t *ptl;
int page_nid = -1;
- int last_nidpid;
+ int last_cpupid;
int target_nid;
bool migrated = false;

@@ -3578,7 +3578,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
BUG_ON(is_zero_pfn(page_to_pfn(page)));

- last_nidpid = page_nidpid_last(page);
+ last_cpupid = page_cpupid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(ptep, ptl);
@@ -3594,7 +3594,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,

out:
if (page_nid != -1)
- task_numa_fault(last_nidpid, page_nid, 1, migrated);
+ task_numa_fault(last_cpupid, page_nid, 1, migrated);
return 0;
}

@@ -3609,7 +3609,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long offset;
spinlock_t *ptl;
bool numa = false;
- int last_nidpid;
+ int last_cpupid;

spin_lock(&mm->page_table_lock);
pmd = *pmdp;
@@ -3654,7 +3654,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(!page))
continue;

- last_nidpid = page_nidpid_last(page);
+ last_cpupid = page_cpupid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(pte, ptl);
@@ -3667,7 +3667,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
}

if (page_nid != -1)
- task_numa_fault(last_nidpid, page_nid, 1, migrated);
+ task_numa_fault(last_cpupid, page_nid, 1, migrated);

pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index adc93b2..a458b82 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2244,6 +2244,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
struct zone *zone;
int curnid = page_to_nid(page);
unsigned long pgoff;
+ int thiscpu = raw_smp_processor_id();
+ int thisnid = cpu_to_node(thiscpu);
int polnid = -1;
int ret = -1;

@@ -2292,11 +2294,11 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long

/* Migrate the page towards the node whose CPU is referencing it */
if (pol->flags & MPOL_F_MORON) {
- int last_nidpid;
- int this_nidpid;
+ int last_cpupid;
+ int this_cpupid;

- polnid = numa_node_id();
- this_nidpid = nid_pid_to_nidpid(polnid, current->pid);
+ polnid = thisnid;
+ this_cpupid = cpu_pid_to_cpupid(thiscpu, current->pid);

/*
* Multi-stage node selection is used in conjunction
@@ -2319,8 +2321,8 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
* it less likely we act on an unlikely task<->page
* relation.
*/
- last_nidpid = page_nidpid_xchg_last(page, this_nidpid);
- if (!nidpid_pid_unset(last_nidpid) && nidpid_to_nid(last_nidpid) != polnid)
+ last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
+ if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid)
goto out;

#ifdef CONFIG_NUMA_BALANCING
@@ -2330,7 +2332,7 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
* This way a short and temporary process migration will
* not cause excessive memory migration.
*/
- if (polnid != current->numa_preferred_nid &&
+ if (thisnid != current->numa_preferred_nid &&
!current->numa_migrate_seq)
goto out;
#endif
diff --git a/mm/migrate.c b/mm/migrate.c
index f56ca20..637aac7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1495,7 +1495,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
__GFP_NOWARN) &
~GFP_IOFS, 0);
if (newpage)
- page_nidpid_xchg_last(newpage, page_nidpid_last(page));
+ page_cpupid_xchg_last(newpage, page_cpupid_last(page));

return newpage;
}
@@ -1672,7 +1672,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
if (!new_page)
goto out_fail;

- page_nidpid_xchg_last(new_page, page_nidpid_last(page));
+ page_cpupid_xchg_last(new_page, page_cpupid_last(page));

isolated = numamigrate_isolate_page(pgdat, page);
if (!isolated) {
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 467de57..68562e9 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -71,26 +71,26 @@ void __init mminit_verify_pageflags_layout(void)
unsigned long or_mask, add_mask;

shift = 8 * sizeof(unsigned long);
- width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_NIDPID_SHIFT;
+ width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH - LAST_CPUPID_SHIFT;
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
- "Section %d Node %d Zone %d Lastnidpid %d Flags %d\n",
+ "Section %d Node %d Zone %d Lastcpupid %d Flags %d\n",
SECTIONS_WIDTH,
NODES_WIDTH,
ZONES_WIDTH,
- LAST_NIDPID_WIDTH,
+ LAST_CPUPID_WIDTH,
NR_PAGEFLAGS);
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
- "Section %d Node %d Zone %d Lastnidpid %d\n",
+ "Section %d Node %d Zone %d Lastcpupid %d\n",
SECTIONS_SHIFT,
NODES_SHIFT,
ZONES_SHIFT,
- LAST_NIDPID_SHIFT);
+ LAST_CPUPID_SHIFT);
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_pgshifts",
- "Section %lu Node %lu Zone %lu Lastnidpid %lu\n",
+ "Section %lu Node %lu Zone %lu Lastcpupid %lu\n",
(unsigned long)SECTIONS_PGSHIFT,
(unsigned long)NODES_PGSHIFT,
(unsigned long)ZONES_PGSHIFT,
- (unsigned long)LAST_NIDPID_PGSHIFT);
+ (unsigned long)LAST_CPUPID_PGSHIFT);
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodezoneid",
"Node/Zone ID: %lu -> %lu\n",
(unsigned long)(ZONEID_PGOFF + ZONEID_SHIFT),
@@ -102,9 +102,9 @@ void __init mminit_verify_pageflags_layout(void)
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
"Node not in page flags");
#endif
-#ifdef LAST_NIDPID_NOT_IN_PAGE_FLAGS
+#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
mminit_dprintk(MMINIT_TRACE, "pageflags_layout_nodeflags",
- "Last nidpid not in page flags");
+ "Last cpupid not in page flags");
#endif

if (SECTIONS_WIDTH) {
diff --git a/mm/mmzone.c b/mm/mmzone.c
index 25bb477..2c70c3a 100644
--- a/mm/mmzone.c
+++ b/mm/mmzone.c
@@ -97,20 +97,20 @@ void lruvec_init(struct lruvec *lruvec)
INIT_LIST_HEAD(&lruvec->lists[lru]);
}

-#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_NIDPID_NOT_IN_PAGE_FLAGS)
-int page_nidpid_xchg_last(struct page *page, int nidpid)
+#if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_IN_PAGE_FLAGS)
+int page_cpupid_xchg_last(struct page *page, int cpupid)
{
unsigned long old_flags, flags;
- int last_nidpid;
+ int last_cpupid;

do {
old_flags = flags = page->flags;
- last_nidpid = page_nidpid_last(page);
+ last_cpupid = page_cpupid_last(page);

- flags &= ~(LAST_NIDPID_MASK << LAST_NIDPID_PGSHIFT);
- flags |= (nidpid & LAST_NIDPID_MASK) << LAST_NIDPID_PGSHIFT;
+ flags &= ~(LAST_CPUPID_MASK << LAST_CPUPID_PGSHIFT);
+ flags |= (cpupid & LAST_CPUPID_MASK) << LAST_CPUPID_PGSHIFT;
} while (unlikely(cmpxchg(&page->flags, old_flags, flags) != old_flags));

- return last_nidpid;
+ return last_cpupid;
}
#endif
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 191a89a..8ae8909 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -37,14 +37,14 @@ static inline pgprot_t pgprot_modify(pgprot_t oldprot, pgprot_t newprot)

static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end, pgprot_t newprot,
- int dirty_accountable, int prot_numa, bool *ret_all_same_nidpid)
+ int dirty_accountable, int prot_numa, bool *ret_all_same_cpupid)
{
struct mm_struct *mm = vma->vm_mm;
pte_t *pte, oldpte;
spinlock_t *ptl;
unsigned long pages = 0;
- bool all_same_nidpid = true;
- int last_nid = -1;
+ bool all_same_cpupid = true;
+ int last_cpu = -1;
int last_pid = -1;

pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
@@ -72,17 +72,17 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
* hits on the zero page
*/
if (page && !is_zero_pfn(page_to_pfn(page))) {
- int nidpid = page_nidpid_last(page);
- int this_nid = nidpid_to_nid(nidpid);
- int this_pid = nidpid_to_pid(nidpid);
+ int cpupid = page_cpupid_last(page);
+ int this_cpu = cpupid_to_cpu(cpupid);
+ int this_pid = cpupid_to_pid(cpupid);

- if (last_nid == -1)
- last_nid = this_nid;
+ if (last_cpu == -1)
+ last_cpu = this_cpu;
if (last_pid == -1)
last_pid = this_pid;
- if (last_nid != this_nid ||
+ if (last_cpu != this_cpu ||
last_pid != this_pid) {
- all_same_nidpid = false;
+ all_same_cpupid = false;
}

if (!pte_numa(oldpte)) {
@@ -123,7 +123,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);

- *ret_all_same_nidpid = all_same_nidpid;
+ *ret_all_same_cpupid = all_same_cpupid;
return pages;
}

@@ -150,7 +150,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
pmd_t *pmd;
unsigned long next;
unsigned long pages = 0;
- bool all_same_nidpid;
+ bool all_same_cpupid;

pmd = pmd_offset(pud, addr);
do {
@@ -176,7 +176,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
if (pmd_none_or_clear_bad(pmd))
continue;
this_pages = change_pte_range(vma, pmd, addr, next, newprot,
- dirty_accountable, prot_numa, &all_same_nidpid);
+ dirty_accountable, prot_numa, &all_same_cpupid);
pages += this_pages;

/*
@@ -185,7 +185,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
* node. This allows a regular PMD to be handled as one fault
* and effectively batches the taking of the PTL
*/
- if (prot_numa && this_pages && all_same_nidpid)
+ if (prot_numa && this_pages && all_same_cpupid)
change_pmd_protnuma(vma->vm_mm, addr, pmd);
} while (pmd++, addr = next, addr != end);

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7bf960e..4b6c4e8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -622,7 +622,7 @@ static inline int free_pages_check(struct page *page)
bad_page(page);
return 1;
}
- page_nidpid_reset_last(page);
+ page_cpupid_reset_last(page);
if (page->flags & PAGE_FLAGS_CHECK_AT_PREP)
page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
return 0;
@@ -3944,7 +3944,7 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone,
mminit_verify_page_links(page, zone, nid, pfn);
init_page_count(page);
page_mapcount_reset(page);
- page_nidpid_reset_last(page);
+ page_cpupid_reset_last(page);
SetPageReserved(page);
/*
* Mark the block movable so that blocks are reserved for
--
1.8.1.4

2013-09-10 09:35:59

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 41/50] sched: numa: Use {cpu, pid} to create task groups for shared faults

From: Peter Zijlstra <[email protected]>

While parallel applications tend to align their data on the cache
boundary, they tend not to align on the page or THP boundary.
Consequently tasks that partition their data can still "false-share"
pages presenting a problem for optimal NUMA placement.

This patch uses NUMA hinting faults to chain tasks together into
numa_groups. As well as storing the NID a task was running on when
accessing a page a truncated representation of the faulting PID is
stored. If subsequent faults are from different PIDs it is reasonable
to assume that those two tasks share a page and are candidates for
being grouped together. Note that this patch makes no scheduling
decisions based on the grouping information.

Not-signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/sched.h | 3 +
kernel/sched/core.c | 3 +
kernel/sched/fair.c | 169 +++++++++++++++++++++++++++++++++++++++++++++++---
kernel/sched/sched.h | 5 +-
mm/memory.c | 8 +++
5 files changed, 175 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3e8c547..ea057a2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1338,6 +1338,9 @@ struct task_struct {
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;

+ struct list_head numa_entry;
+ struct numa_group *numa_group;
+
/*
* Exponential decaying average of faults on a per-node basis.
* Scheduling placement decisions are made based on the these counts.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 67f2b7b..3808860 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1740,6 +1740,9 @@ static void __sched_fork(struct task_struct *p)
p->numa_work.next = &p->numa_work;
p->numa_faults = NULL;
p->numa_faults_buffer = NULL;
+
+ INIT_LIST_HEAD(&p->numa_entry);
+ p->numa_group = NULL;
#endif /* CONFIG_NUMA_BALANCING */
}

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bafa8d7..b80eaa2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -888,6 +888,17 @@ static unsigned int task_scan_max(struct task_struct *p)
*/
unsigned int sysctl_numa_balancing_settle_count __read_mostly = 4;

+struct numa_group {
+ atomic_t refcount;
+
+ spinlock_t lock; /* nr_tasks, tasks */
+ int nr_tasks;
+ struct list_head task_list;
+
+ struct rcu_head rcu;
+ atomic_long_t faults[0];
+};
+
static inline int task_faults_idx(int nid, int priv)
{
return 2 * nid + priv;
@@ -1180,7 +1191,10 @@ static void task_numa_placement(struct task_struct *p)
int priv, i;

for (priv = 0; priv < 2; priv++) {
+ long diff;
+
i = task_faults_idx(nid, priv);
+ diff = -p->numa_faults[i];

/* Decay existing window, copy faults since last scan */
p->numa_faults[i] >>= 1;
@@ -1188,6 +1202,11 @@ static void task_numa_placement(struct task_struct *p)
p->numa_faults_buffer[i] = 0;

faults += p->numa_faults[i];
+ diff += p->numa_faults[i];
+ if (p->numa_group) {
+ /* safe because we can only change our own group */
+ atomic_long_add(diff, &p->numa_group->faults[i]);
+ }
}

if (faults > max_faults) {
@@ -1205,6 +1224,130 @@ static void task_numa_placement(struct task_struct *p)
}
}

+static inline int get_numa_group(struct numa_group *grp)
+{
+ return atomic_inc_not_zero(&grp->refcount);
+}
+
+static inline void put_numa_group(struct numa_group *grp)
+{
+ if (atomic_dec_and_test(&grp->refcount))
+ kfree_rcu(grp, rcu);
+}
+
+static void double_lock(spinlock_t *l1, spinlock_t *l2)
+{
+ if (l1 > l2)
+ swap(l1, l2);
+
+ spin_lock(l1);
+ spin_lock_nested(l2, SINGLE_DEPTH_NESTING);
+}
+
+static void task_numa_group(struct task_struct *p, int cpu, int pid)
+{
+ struct numa_group *grp, *my_grp;
+ struct task_struct *tsk;
+ bool join = false;
+ int i;
+
+ if (unlikely(!p->numa_group)) {
+ unsigned int size = sizeof(struct numa_group) +
+ 2*nr_node_ids*sizeof(atomic_long_t);
+
+ grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
+ if (!grp)
+ return;
+
+ atomic_set(&grp->refcount, 1);
+ spin_lock_init(&grp->lock);
+ INIT_LIST_HEAD(&grp->task_list);
+
+ for (i = 0; i < 2*nr_node_ids; i++)
+ atomic_long_set(&grp->faults[i], p->numa_faults[i]);
+
+ list_add(&p->numa_entry, &grp->task_list);
+ grp->nr_tasks++;
+ rcu_assign_pointer(p->numa_group, grp);
+ }
+
+ rcu_read_lock();
+ tsk = ACCESS_ONCE(cpu_rq(cpu)->curr);
+
+ if ((tsk->pid & LAST__PID_MASK) != pid)
+ goto unlock;
+
+ grp = rcu_dereference(tsk->numa_group);
+ if (!grp)
+ goto unlock;
+
+ my_grp = p->numa_group;
+ if (grp == my_grp)
+ goto unlock;
+
+ /*
+ * Only join the other group if its bigger; if we're the bigger group,
+ * the other task will join us.
+ */
+ if (my_grp->nr_tasks > grp->nr_tasks)
+ goto unlock;
+
+ /*
+ * Tie-break on the grp address.
+ */
+ if (my_grp->nr_tasks == grp->nr_tasks && my_grp > grp)
+ goto unlock;
+
+ if (!get_numa_group(grp))
+ goto unlock;
+
+ join = true;
+
+unlock:
+ rcu_read_unlock();
+
+ if (!join)
+ return;
+
+ for (i = 0; i < 2*nr_node_ids; i++) {
+ atomic_long_sub(p->numa_faults[i], &my_grp->faults[i]);
+ atomic_long_add(p->numa_faults[i], &grp->faults[i]);
+ }
+
+ double_lock(&my_grp->lock, &grp->lock);
+
+ list_move(&p->numa_entry, &grp->task_list);
+ my_grp->nr_tasks--;
+ grp->nr_tasks++;
+
+ spin_unlock(&my_grp->lock);
+ spin_unlock(&grp->lock);
+
+ rcu_assign_pointer(p->numa_group, grp);
+
+ put_numa_group(my_grp);
+}
+
+void task_numa_free(struct task_struct *p)
+{
+ struct numa_group *grp = p->numa_group;
+ int i;
+
+ kfree(p->numa_faults);
+
+ if (grp) {
+ for (i = 0; i < 2*nr_node_ids; i++)
+ atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
+
+ spin_lock(&grp->lock);
+ list_del(&p->numa_entry);
+ grp->nr_tasks--;
+ spin_unlock(&grp->lock);
+ rcu_assign_pointer(p->numa_group, NULL);
+ put_numa_group(grp);
+ }
+}
+
/*
* Got a PROT_NONE fault for a page on @node.
*/
@@ -1220,15 +1363,6 @@ void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
if (!p->mm)
return;

- /*
- * First accesses are treated as private, otherwise consider accesses
- * to be private if the accessing pid has not changed
- */
- if (!cpupid_pid_unset(last_cpupid))
- priv = ((p->pid & LAST__PID_MASK) == cpupid_to_pid(last_cpupid));
- else
- priv = 1;
-
/* Allocate buffer to track faults on a per-node basis */
if (unlikely(!p->numa_faults)) {
int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;
@@ -1243,6 +1377,23 @@ void task_numa_fault(int last_cpupid, int node, int pages, bool migrated)
}

/*
+ * First accesses are treated as private, otherwise consider accesses
+ * to be private if the accessing pid has not changed
+ */
+ if (unlikely(last_cpupid == (-1 & LAST_CPUPID_MASK))) {
+ priv = 1;
+ } else {
+ int cpu, pid;
+
+ cpu = cpupid_to_cpu(last_cpupid);
+ pid = cpupid_to_pid(last_cpupid);
+
+ priv = (pid == (p->pid & LAST__PID_MASK));
+ if (!priv)
+ task_numa_group(p, cpu, pid);
+ }
+
+ /*
* If pages are properly placed (did not migrate) then scan slower.
* This is reset periodically in case of phase changes
*/
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 99b1ecd..4c6ec25 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -557,10 +557,7 @@ static inline u64 rq_clock_task(struct rq *rq)
#ifdef CONFIG_NUMA_BALANCING
extern int migrate_task_to(struct task_struct *p, int cpu);
extern int migrate_swap(struct task_struct *, struct task_struct *);
-static inline void task_numa_free(struct task_struct *p)
-{
- kfree(p->numa_faults);
-}
+extern void task_numa_free(struct task_struct *p);
#else /* CONFIG_NUMA_BALANCING */
static inline void task_numa_free(struct task_struct *p)
{
diff --git a/mm/memory.c b/mm/memory.c
index 6b558a5..f779403 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2730,6 +2730,14 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
get_page(dirty_page);

reuse:
+ /*
+ * Clear the pages cpupid information as the existing
+ * information potentially belongs to a now completely
+ * unrelated process.
+ */
+ if (old_page)
+ page_cpupid_xchg_last(old_page, (1 << LAST_CPUPID_SHIFT) - 1);
+
flush_cache_page(vma, address, pte_pfn(orig_pte));
entry = pte_mkyoung(orig_pte);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
--
1.8.1.4

2013-09-10 09:36:44

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 39/50] sched: numa: Favor placing a task on the preferred node

A tasks preferred node is selected based on the number of faults
recorded for a node but the actual task_numa_migate() conducts a global
search regardless of the preferred nid. This patch checks if the
preferred nid has capacity and if so, searches for a CPU within that
node. This avoids a global search when the preferred node is not
overloaded.

Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 54 ++++++++++++++++++++++++++++++++++-------------------
1 file changed, 35 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 12b42a6..f2bd291 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1052,6 +1052,20 @@ unlock:
rcu_read_unlock();
}

+static void task_numa_find_cpu(struct task_numa_env *env, long imp)
+{
+ int cpu;
+
+ for_each_cpu(cpu, cpumask_of_node(env->dst_nid)) {
+ /* Skip this CPU if the source task cannot migrate */
+ if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(env->p)))
+ continue;
+
+ env->dst_cpu = cpu;
+ task_numa_compare(env, imp);
+ }
+}
+
static int task_numa_migrate(struct task_struct *p)
{
const struct cpumask *cpumask = cpumask_of_node(p->numa_preferred_nid);
@@ -1069,7 +1083,8 @@ static int task_numa_migrate(struct task_struct *p)
};
struct sched_domain *sd;
unsigned long faults;
- int nid, cpu, ret;
+ int nid, ret;
+ long imp;

/*
* Find the lowest common scheduling domain covering the nodes of both
@@ -1086,28 +1101,29 @@ static int task_numa_migrate(struct task_struct *p)

faults = task_faults(p, env.src_nid);
update_numa_stats(&env.src_stats, env.src_nid);
+ env.dst_nid = p->numa_preferred_nid;
+ imp = task_faults(env.p, env.dst_nid) - faults;
+ update_numa_stats(&env.dst_stats, env.dst_nid);

- /* Find an alternative node with relatively better statistics */
- for_each_online_node(nid) {
- long imp;
-
- if (nid == env.src_nid)
- continue;
-
- /* Only consider nodes that recorded more faults */
- imp = task_faults(p, nid) - faults;
- if (imp < 0)
- continue;
+ /*
+ * If the preferred nid has capacity then use it. Otherwise find an
+ * alternative node with relatively better statistics.
+ */
+ if (env.dst_stats.has_capacity) {
+ task_numa_find_cpu(&env, imp);
+ } else {
+ for_each_online_node(nid) {
+ if (nid == env.src_nid || nid == p->numa_preferred_nid)
+ continue;

- env.dst_nid = nid;
- update_numa_stats(&env.dst_stats, env.dst_nid);
- for_each_cpu(cpu, cpumask_of_node(nid)) {
- /* Skip this CPU if the source task cannot migrate */
- if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+ /* Only consider nodes that recorded more faults */
+ imp = task_faults(env.p, nid) - faults;
+ if (imp < 0)
continue;

- env.dst_cpu = cpu;
- task_numa_compare(&env, imp);
+ env.dst_nid = nid;
+ update_numa_stats(&env.dst_stats, env.dst_nid);
+ task_numa_find_cpu(&env, imp);
}
}

--
1.8.1.4

2013-09-10 09:33:13

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 31/50] sched: Avoid overloading CPUs on a preferred NUMA node

This patch replaces find_idlest_cpu_node with task_numa_find_cpu.
find_idlest_cpu_node has two critical limitations. It does not take the
scheduling class into account when calculating the load and it is unsuitable
for using when comparing loads between NUMA nodes.

task_numa_find_cpu uses similar load calculations to wake_affine() when
selecting the least loaded CPU within a scheduling domain common to the
source and destimation nodes. It avoids causing CPU load imbalances in
the machine by refusing to migrate if the relative load on the target
CPU is higher than the source CPU.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 131 ++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 102 insertions(+), 29 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c2f1cf5..5f0388e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -901,28 +901,114 @@ static inline unsigned long task_faults(struct task_struct *p, int nid)
}

static unsigned long weighted_cpuload(const int cpu);
+static unsigned long source_load(int cpu, int type);
+static unsigned long target_load(int cpu, int type);
+static unsigned long power_of(int cpu);
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg);

+struct numa_stats {
+ unsigned long load;
+ s64 eff_load;
+ unsigned long faults;
+};

-static int
-find_idlest_cpu_node(int this_cpu, int nid)
-{
- unsigned long load, min_load = ULONG_MAX;
- int i, idlest_cpu = this_cpu;
+struct task_numa_env {
+ struct task_struct *p;

- BUG_ON(cpu_to_node(this_cpu) == nid);
+ int src_cpu, src_nid;
+ int dst_cpu, dst_nid;

- rcu_read_lock();
- for_each_cpu(i, cpumask_of_node(nid)) {
- load = weighted_cpuload(i);
+ struct numa_stats src_stats, dst_stats;

- if (load < min_load) {
- min_load = load;
- idlest_cpu = i;
+ unsigned long best_load;
+ int best_cpu;
+};
+
+static int task_numa_migrate(struct task_struct *p)
+{
+ int node_cpu = cpumask_first(cpumask_of_node(p->numa_preferred_nid));
+ struct task_numa_env env = {
+ .p = p,
+ .src_cpu = task_cpu(p),
+ .src_nid = cpu_to_node(task_cpu(p)),
+ .dst_cpu = node_cpu,
+ .dst_nid = p->numa_preferred_nid,
+ .best_load = ULONG_MAX,
+ .best_cpu = task_cpu(p),
+ };
+ struct sched_domain *sd;
+ int cpu;
+ struct task_group *tg = task_group(p);
+ unsigned long weight;
+ bool balanced;
+ int imbalance_pct, idx = -1;
+
+ /*
+ * Find the lowest common scheduling domain covering the nodes of both
+ * the CPU the task is currently running on and the target NUMA node.
+ */
+ rcu_read_lock();
+ for_each_domain(env.src_cpu, sd) {
+ if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
+ /*
+ * busy_idx is used for the load decision as it is the
+ * same index used by the regular load balancer for an
+ * active cpu.
+ */
+ idx = sd->busy_idx;
+ imbalance_pct = sd->imbalance_pct;
+ break;
}
}
rcu_read_unlock();

- return idlest_cpu;
+ if (WARN_ON_ONCE(idx == -1))
+ return 0;
+
+ /*
+ * XXX the below is mostly nicked from wake_affine(); we should
+ * see about sharing a bit if at all possible; also it might want
+ * some per entity weight love.
+ */
+ weight = p->se.load.weight;
+ env.src_stats.load = source_load(env.src_cpu, idx);
+ env.src_stats.eff_load = 100 + (imbalance_pct - 100) / 2;
+ env.src_stats.eff_load *= power_of(env.src_cpu);
+ env.src_stats.eff_load *= env.src_stats.load + effective_load(tg, env.src_cpu, -weight, -weight);
+
+ for_each_cpu(cpu, cpumask_of_node(env.dst_nid)) {
+ env.dst_cpu = cpu;
+ env.dst_stats.load = target_load(cpu, idx);
+
+ /* If the CPU is idle, use it */
+ if (!env.dst_stats.load) {
+ env.best_cpu = cpu;
+ goto migrate;
+ }
+
+ /* Otherwise check the target CPU load */
+ env.dst_stats.eff_load = 100;
+ env.dst_stats.eff_load *= power_of(cpu);
+ env.dst_stats.eff_load *= env.dst_stats.load + effective_load(tg, cpu, weight, weight);
+
+ /*
+ * Destination is considered balanced if the destination CPU is
+ * less loaded than the source CPU. Unfortunately there is a
+ * risk that a task running on a lightly loaded CPU will not
+ * migrate to its preferred node due to load imbalances.
+ */
+ balanced = (env.dst_stats.eff_load <= env.src_stats.eff_load);
+ if (!balanced)
+ continue;
+
+ if (env.dst_stats.eff_load < env.best_load) {
+ env.best_load = env.dst_stats.eff_load;
+ env.best_cpu = cpu;
+ }
+ }
+
+migrate:
+ return migrate_task_to(p, env.best_cpu);
}

static void task_numa_placement(struct task_struct *p)
@@ -966,22 +1052,10 @@ static void task_numa_placement(struct task_struct *p)
* the working set placement.
*/
if (max_faults && max_nid != p->numa_preferred_nid) {
- int preferred_cpu;
-
- /*
- * If the task is not on the preferred node then find the most
- * idle CPU to migrate to.
- */
- preferred_cpu = task_cpu(p);
- if (cpu_to_node(preferred_cpu) != max_nid) {
- preferred_cpu = find_idlest_cpu_node(preferred_cpu,
- max_nid);
- }
-
/* Update the preferred nid and migrate task if possible */
p->numa_preferred_nid = max_nid;
p->numa_migrate_seq = 1;
- migrate_task_to(p, preferred_cpu);
+ task_numa_migrate(p);
}
}

@@ -3274,7 +3348,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
{
struct sched_entity *se = tg->se[cpu];

- if (!tg->parent) /* the trivial, non-cgroup case */
+ if (!tg->parent || !wl) /* the trivial, non-cgroup case */
return wl;

for_each_sched_entity(se) {
@@ -3327,8 +3401,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
}
#else

-static inline unsigned long effective_load(struct task_group *tg, int cpu,
- unsigned long wl, unsigned long wg)
+static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
{
return wl;
}
--
1.8.1.4

2013-09-10 09:36:58

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 38/50] sched: numa: Use a system-wide search to find swap/migration candidates

This patch implements a system-wide search for swap/migration candidates
based on total NUMA hinting faults. It has a balance limit, however it
doesn't properly consider total node balance.

In the old scheme a task selected a preferred node based on the highest
number of private faults recorded on the node. In this scheme, the preferred
node is based on the total number of faults. If the preferred node for a
task changes then task_numa_migrate will search the whole system looking
for tasks to swap with that would improve both the overall compute
balance and minimise the expected number of remote NUMA hinting faults.

Note from Mel: There appears to be no guarantee that the node the source
task is placed on by task_numa_migrate() has any relationship
to the newly selected task->numa_preferred_nid. It is not clear
if this is deliberate but it looks accidental.

[[email protected]: Do not swap with tasks that cannot run on source cpu]
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 244 ++++++++++++++++++++++++++++++++++++++--------------
1 file changed, 178 insertions(+), 66 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cf16c1a..12b42a6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -816,6 +816,8 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
* Scheduling class queueing methods:
*/

+static unsigned long task_h_load(struct task_struct *p);
+
#ifdef CONFIG_NUMA_BALANCING
/*
* Approximate time to scan a full NUMA task in ms. The task scan period is
@@ -906,12 +908,40 @@ static unsigned long target_load(int cpu, int type);
static unsigned long power_of(int cpu);
static long effective_load(struct task_group *tg, int cpu, long wl, long wg);

+/* Cached statistics for all CPUs within a node */
struct numa_stats {
+ unsigned long nr_running;
unsigned long load;
- s64 eff_load;
- unsigned long faults;
+
+ /* Total compute capacity of CPUs on a node */
+ unsigned long power;
+
+ /* Approximate capacity in terms of runnable tasks on a node */
+ unsigned long capacity;
+ int has_capacity;
};

+/*
+ * XXX borrowed from update_sg_lb_stats
+ */
+static void update_numa_stats(struct numa_stats *ns, int nid)
+{
+ int cpu;
+
+ memset(ns, 0, sizeof(*ns));
+ for_each_cpu(cpu, cpumask_of_node(nid)) {
+ struct rq *rq = cpu_rq(cpu);
+
+ ns->nr_running += rq->nr_running;
+ ns->load += weighted_cpuload(cpu);
+ ns->power += power_of(cpu);
+ }
+
+ ns->load = (ns->load * SCHED_POWER_SCALE) / ns->power;
+ ns->capacity = DIV_ROUND_CLOSEST(ns->power, SCHED_POWER_SCALE);
+ ns->has_capacity = (ns->nr_running < ns->capacity);
+}
+
struct task_numa_env {
struct task_struct *p;

@@ -920,28 +950,126 @@ struct task_numa_env {

struct numa_stats src_stats, dst_stats;

- unsigned long best_load;
+ int imbalance_pct, idx;
+
+ struct task_struct *best_task;
+ long best_imp;
int best_cpu;
};

+static void task_numa_assign(struct task_numa_env *env,
+ struct task_struct *p, long imp)
+{
+ if (env->best_task)
+ put_task_struct(env->best_task);
+ if (p)
+ get_task_struct(p);
+
+ env->best_task = p;
+ env->best_imp = imp;
+ env->best_cpu = env->dst_cpu;
+}
+
+/*
+ * This checks if the overall compute and NUMA accesses of the system would
+ * be improved if the source tasks was migrated to the target dst_cpu taking
+ * into account that it might be best if task running on the dst_cpu should
+ * be exchanged with the source task
+ */
+static void task_numa_compare(struct task_numa_env *env, long imp)
+{
+ struct rq *src_rq = cpu_rq(env->src_cpu);
+ struct rq *dst_rq = cpu_rq(env->dst_cpu);
+ struct task_struct *cur;
+ long dst_load, src_load;
+ long load;
+
+ rcu_read_lock();
+ cur = ACCESS_ONCE(dst_rq->curr);
+ if (cur->pid == 0) /* idle */
+ cur = NULL;
+
+ /*
+ * "imp" is the fault differential for the source task between the
+ * source and destination node. Calculate the total differential for
+ * the source task and potential destination task. The more negative
+ * the value is, the more rmeote accesses that would be expected to
+ * be incurred if the tasks were swapped.
+ */
+ if (cur) {
+ /* Skip this swap candidate if cannot move to the source cpu */
+ if (!cpumask_test_cpu(env->src_cpu, tsk_cpus_allowed(cur)))
+ goto unlock;
+
+ imp += task_faults(cur, env->src_nid) -
+ task_faults(cur, env->dst_nid);
+ }
+
+ if (imp < env->best_imp)
+ goto unlock;
+
+ if (!cur) {
+ /* Is there capacity at our destination? */
+ if (env->src_stats.has_capacity &&
+ !env->dst_stats.has_capacity)
+ goto unlock;
+
+ goto balance;
+ }
+
+ /* Balance doesn't matter much if we're running a task per cpu */
+ if (src_rq->nr_running == 1 && dst_rq->nr_running == 1)
+ goto assign;
+
+ /*
+ * In the overloaded case, try and keep the load balanced.
+ */
+balance:
+ dst_load = env->dst_stats.load;
+ src_load = env->src_stats.load;
+
+ /* XXX missing power terms */
+ load = task_h_load(env->p);
+ dst_load += load;
+ src_load -= load;
+
+ if (cur) {
+ load = task_h_load(cur);
+ dst_load -= load;
+ src_load += load;
+ }
+
+ /* make src_load the smaller */
+ if (dst_load < src_load)
+ swap(dst_load, src_load);
+
+ if (src_load * env->imbalance_pct < dst_load * 100)
+ goto unlock;
+
+assign:
+ task_numa_assign(env, cur, imp);
+unlock:
+ rcu_read_unlock();
+}
+
static int task_numa_migrate(struct task_struct *p)
{
- int node_cpu = cpumask_first(cpumask_of_node(p->numa_preferred_nid));
+ const struct cpumask *cpumask = cpumask_of_node(p->numa_preferred_nid);
struct task_numa_env env = {
.p = p,
+
.src_cpu = task_cpu(p),
.src_nid = cpu_to_node(task_cpu(p)),
- .dst_cpu = node_cpu,
- .dst_nid = p->numa_preferred_nid,
- .best_load = ULONG_MAX,
- .best_cpu = task_cpu(p),
+
+ .imbalance_pct = 112,
+
+ .best_task = NULL,
+ .best_imp = 0,
+ .best_cpu = -1
};
- struct sched_domain *sd;
- int cpu;
- struct task_group *tg = task_group(p);
- unsigned long weight;
- bool balanced;
- int imbalance_pct, idx = -1;
+ struct sched_domain *sd;
+ unsigned long faults;
+ int nid, cpu, ret;

/*
* Find the lowest common scheduling domain covering the nodes of both
@@ -949,66 +1077,52 @@ static int task_numa_migrate(struct task_struct *p)
*/
rcu_read_lock();
for_each_domain(env.src_cpu, sd) {
- if (cpumask_test_cpu(node_cpu, sched_domain_span(sd))) {
- /*
- * busy_idx is used for the load decision as it is the
- * same index used by the regular load balancer for an
- * active cpu.
- */
- idx = sd->busy_idx;
- imbalance_pct = sd->imbalance_pct;
+ if (cpumask_intersects(cpumask, sched_domain_span(sd))) {
+ env.imbalance_pct = 100 + (sd->imbalance_pct - 100) / 2;
break;
}
}
rcu_read_unlock();

- if (WARN_ON_ONCE(idx == -1))
- return 0;
+ faults = task_faults(p, env.src_nid);
+ update_numa_stats(&env.src_stats, env.src_nid);

- /*
- * XXX the below is mostly nicked from wake_affine(); we should
- * see about sharing a bit if at all possible; also it might want
- * some per entity weight love.
- */
- weight = p->se.load.weight;
- env.src_stats.load = source_load(env.src_cpu, idx);
- env.src_stats.eff_load = 100 + (imbalance_pct - 100) / 2;
- env.src_stats.eff_load *= power_of(env.src_cpu);
- env.src_stats.eff_load *= env.src_stats.load + effective_load(tg, env.src_cpu, -weight, -weight);
-
- for_each_cpu(cpu, cpumask_of_node(env.dst_nid)) {
- env.dst_cpu = cpu;
- env.dst_stats.load = target_load(cpu, idx);
-
- /* If the CPU is idle, use it */
- if (!env.dst_stats.load) {
- env.best_cpu = cpu;
- goto migrate;
- }
+ /* Find an alternative node with relatively better statistics */
+ for_each_online_node(nid) {
+ long imp;

- /* Otherwise check the target CPU load */
- env.dst_stats.eff_load = 100;
- env.dst_stats.eff_load *= power_of(cpu);
- env.dst_stats.eff_load *= env.dst_stats.load + effective_load(tg, cpu, weight, weight);
+ if (nid == env.src_nid)
+ continue;

- /*
- * Destination is considered balanced if the destination CPU is
- * less loaded than the source CPU. Unfortunately there is a
- * risk that a task running on a lightly loaded CPU will not
- * migrate to its preferred node due to load imbalances.
- */
- balanced = (env.dst_stats.eff_load <= env.src_stats.eff_load);
- if (!balanced)
+ /* Only consider nodes that recorded more faults */
+ imp = task_faults(p, nid) - faults;
+ if (imp < 0)
continue;

- if (env.dst_stats.eff_load < env.best_load) {
- env.best_load = env.dst_stats.eff_load;
- env.best_cpu = cpu;
+ env.dst_nid = nid;
+ update_numa_stats(&env.dst_stats, env.dst_nid);
+ for_each_cpu(cpu, cpumask_of_node(nid)) {
+ /* Skip this CPU if the source task cannot migrate */
+ if (!cpumask_test_cpu(cpu, tsk_cpus_allowed(p)))
+ continue;
+
+ env.dst_cpu = cpu;
+ task_numa_compare(&env, imp);
}
}

-migrate:
- return migrate_task_to(p, env.best_cpu);
+ /* No better CPU than the current one was found. */
+ if (env.best_cpu == -1)
+ return -EAGAIN;
+
+ if (env.best_task == NULL) {
+ int ret = migrate_task_to(p, env.best_cpu);
+ return ret;
+ }
+
+ ret = migrate_swap(p, env.best_task);
+ put_task_struct(env.best_task);
+ return ret;
}

/* Attempt to migrate a task to a CPU on the preferred node. */
@@ -1046,7 +1160,7 @@ static void task_numa_placement(struct task_struct *p)

/* Find the node with the highest number of faults */
for_each_online_node(nid) {
- unsigned long faults;
+ unsigned long faults = 0;
int priv, i;

for (priv = 0; priv < 2; priv++) {
@@ -1056,10 +1170,10 @@ static void task_numa_placement(struct task_struct *p)
p->numa_faults[i] >>= 1;
p->numa_faults[i] += p->numa_faults_buffer[i];
p->numa_faults_buffer[i] = 0;
+
+ faults += p->numa_faults[i];
}

- /* Find maximum private faults */
- faults = p->numa_faults[task_faults_idx(nid, 1)];
if (faults > max_faults) {
max_faults = faults;
max_nid = nid;
@@ -4405,8 +4519,6 @@ static int move_one_task(struct lb_env *env)
return 0;
}

-static unsigned long task_h_load(struct task_struct *p);
-
static const unsigned int sched_nr_migrate_break = 32;

/*
--
1.8.1.4

2013-09-10 09:37:24

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 37/50] sched: Introduce migrate_swap()

From: Peter Zijlstra <[email protected]>

Use the new stop_two_cpus() to implement migrate_swap(), a function that
flips two tasks between their respective cpus.

I'm fairly sure there's a less crude way than employing the stop_two_cpus()
method, but everything I tried either got horribly fragile and/or complex. So
keep it simple for now.

The notable detail is how we 'migrate' tasks that aren't runnable
anymore. We'll make it appear like we migrated them before they went to
sleep. The sole difference is the previous cpu in the wakeup path, so we
override this.

TODO: I'm fairly sure we can get rid of the wake_cpu != -1 test by keeping
wake_cpu to the actual task cpu; just couldn't be bothered to think through
all the cases.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 103 ++++++++++++++++++++++++++++++++++++++++++++---
kernel/sched/fair.c | 3 +-
kernel/sched/idle_task.c | 2 +-
kernel/sched/rt.c | 5 +--
kernel/sched/sched.h | 3 +-
kernel/sched/stop_task.c | 2 +-
7 files changed, 105 insertions(+), 14 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 3418b0b..3e8c547 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1035,6 +1035,7 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
+ int wake_cpu;
#endif
int on_rq;

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 374da2b..67f2b7b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1032,6 +1032,90 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
__set_task_cpu(p, new_cpu);
}

+static void __migrate_swap_task(struct task_struct *p, int cpu)
+{
+ if (p->on_rq) {
+ struct rq *src_rq, *dst_rq;
+
+ src_rq = task_rq(p);
+ dst_rq = cpu_rq(cpu);
+
+ deactivate_task(src_rq, p, 0);
+ set_task_cpu(p, cpu);
+ activate_task(dst_rq, p, 0);
+ check_preempt_curr(dst_rq, p, 0);
+ } else {
+ /*
+ * Task isn't running anymore; make it appear like we migrated
+ * it before it went to sleep. This means on wakeup we make the
+ * previous cpu or targer instead of where it really is.
+ */
+ p->wake_cpu = cpu;
+ }
+}
+
+struct migration_swap_arg {
+ struct task_struct *src_task, *dst_task;
+ int src_cpu, dst_cpu;
+};
+
+static int migrate_swap_stop(void *data)
+{
+ struct migration_swap_arg *arg = data;
+ struct rq *src_rq, *dst_rq;
+ int ret = -EAGAIN;
+
+ src_rq = cpu_rq(arg->src_cpu);
+ dst_rq = cpu_rq(arg->dst_cpu);
+
+ double_rq_lock(src_rq, dst_rq);
+ if (task_cpu(arg->dst_task) != arg->dst_cpu)
+ goto unlock;
+
+ if (task_cpu(arg->src_task) != arg->src_cpu)
+ goto unlock;
+
+ if (!cpumask_test_cpu(arg->dst_cpu, tsk_cpus_allowed(arg->src_task)))
+ goto unlock;
+
+ if (!cpumask_test_cpu(arg->src_cpu, tsk_cpus_allowed(arg->dst_task)))
+ goto unlock;
+
+ __migrate_swap_task(arg->src_task, arg->dst_cpu);
+ __migrate_swap_task(arg->dst_task, arg->src_cpu);
+
+ ret = 0;
+
+unlock:
+ double_rq_unlock(src_rq, dst_rq);
+
+ return ret;
+}
+
+/*
+ * XXX worry about hotplug
+ */
+int migrate_swap(struct task_struct *cur, struct task_struct *p)
+{
+ struct migration_swap_arg arg = {
+ .src_task = cur,
+ .src_cpu = task_cpu(cur),
+ .dst_task = p,
+ .dst_cpu = task_cpu(p),
+ };
+
+ if (arg.src_cpu == arg.dst_cpu)
+ return -EINVAL;
+
+ if (!cpumask_test_cpu(arg.dst_cpu, tsk_cpus_allowed(arg.src_task)))
+ return -EINVAL;
+
+ if (!cpumask_test_cpu(arg.src_cpu, tsk_cpus_allowed(arg.dst_task)))
+ return -EINVAL;
+
+ return stop_two_cpus(arg.dst_cpu, arg.src_cpu, migrate_swap_stop, &arg);
+}
+
struct migration_arg {
struct task_struct *task;
int dest_cpu;
@@ -1251,9 +1335,9 @@ out:
* The caller (fork, wakeup) owns p->pi_lock, ->cpus_allowed is stable.
*/
static inline
-int select_task_rq(struct task_struct *p, int sd_flags, int wake_flags)
+int select_task_rq(struct task_struct *p, int cpu, int sd_flags, int wake_flags)
{
- int cpu = p->sched_class->select_task_rq(p, sd_flags, wake_flags);
+ cpu = p->sched_class->select_task_rq(p, cpu, sd_flags, wake_flags);

/*
* In order not to call set_task_cpu() on a blocking task we need
@@ -1528,7 +1612,12 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)
if (p->sched_class->task_waking)
p->sched_class->task_waking(p);

- cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
+ if (p->wake_cpu != -1) { /* XXX make this condition go away */
+ cpu = p->wake_cpu;
+ p->wake_cpu = -1;
+ }
+
+ cpu = select_task_rq(p, cpu, SD_BALANCE_WAKE, wake_flags);
if (task_cpu(p) != cpu) {
wake_flags |= WF_MIGRATED;
set_task_cpu(p, cpu);
@@ -1614,6 +1703,10 @@ static void __sched_fork(struct task_struct *p)
{
p->on_rq = 0;

+#ifdef CONFIG_SMP
+ p->wake_cpu = -1;
+#endif
+
p->se.on_rq = 0;
p->se.exec_start = 0;
p->se.sum_exec_runtime = 0;
@@ -1765,7 +1858,7 @@ void wake_up_new_task(struct task_struct *p)
* - cpus_allowed can change in the fork path
* - any previously selected cpu might disappear through hotplug
*/
- set_task_cpu(p, select_task_rq(p, SD_BALANCE_FORK, 0));
+ set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
#endif

/* Initialize new task's runnable average */
@@ -2093,7 +2186,7 @@ void sched_exec(void)
int dest_cpu;

raw_spin_lock_irqsave(&p->pi_lock, flags);
- dest_cpu = p->sched_class->select_task_rq(p, SD_BALANCE_EXEC, 0);
+ dest_cpu = p->sched_class->select_task_rq(p, task_cpu(p), SD_BALANCE_EXEC, 0);
if (dest_cpu == smp_processor_id())
goto unlock;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5d244d0..cf16c1a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3655,11 +3655,10 @@ done:
* preempt must be disabled.
*/
static int
-select_task_rq_fair(struct task_struct *p, int sd_flag, int wake_flags)
+select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
{
struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
int cpu = smp_processor_id();
- int prev_cpu = task_cpu(p);
int new_cpu = cpu;
int want_affine = 0;
int sync = wake_flags & WF_SYNC;
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index d8da010..516c3d9 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -9,7 +9,7 @@

#ifdef CONFIG_SMP
static int
-select_task_rq_idle(struct task_struct *p, int sd_flag, int flags)
+select_task_rq_idle(struct task_struct *p, int cpu, int sd_flag, int flags)
{
return task_cpu(p); /* IDLE tasks as never migrated */
}
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 01970c8..d81866d 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1169,13 +1169,10 @@ static void yield_task_rt(struct rq *rq)
static int find_lowest_rq(struct task_struct *task);

static int
-select_task_rq_rt(struct task_struct *p, int sd_flag, int flags)
+select_task_rq_rt(struct task_struct *p, int cpu, int sd_flag, int flags)
{
struct task_struct *curr;
struct rq *rq;
- int cpu;
-
- cpu = task_cpu(p);

if (p->nr_cpus_allowed == 1)
goto out;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 778f875..99b1ecd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -556,6 +556,7 @@ static inline u64 rq_clock_task(struct rq *rq)

#ifdef CONFIG_NUMA_BALANCING
extern int migrate_task_to(struct task_struct *p, int cpu);
+extern int migrate_swap(struct task_struct *, struct task_struct *);
static inline void task_numa_free(struct task_struct *p)
{
kfree(p->numa_faults);
@@ -988,7 +989,7 @@ struct sched_class {
void (*put_prev_task) (struct rq *rq, struct task_struct *p);

#ifdef CONFIG_SMP
- int (*select_task_rq)(struct task_struct *p, int sd_flag, int flags);
+ int (*select_task_rq)(struct task_struct *p, int task_cpu, int sd_flag, int flags);
void (*migrate_task_rq)(struct task_struct *p, int next_cpu);

void (*pre_schedule) (struct rq *this_rq, struct task_struct *task);
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index e08fbee..47197de 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -11,7 +11,7 @@

#ifdef CONFIG_SMP
static int
-select_task_rq_stop(struct task_struct *p, int sd_flag, int flags)
+select_task_rq_stop(struct task_struct *p, int cpu, int sd_flag, int flags)
{
return task_cpu(p); /* stop tasks as never migrate */
}
--
1.8.1.4

2013-09-10 09:37:48

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 36/50] stop_machine: Introduce stop_two_cpus()

From: Peter Zijlstra <[email protected]>

Introduce stop_two_cpus() in order to allow controlled swapping of two
tasks. It repurposes the stop_machine() state machine but only stops
the two cpus which we can do with on-stack structures and avoid
machine wide synchronization issues.

The ordering of CPUs is important to avoid deadlocks. If unordered then
two cpus calling stop_two_cpus on each other simultaneously would attempt
to queue in the opposite order on each CPU causing an AB-BA style deadlock.
By always having the lowest number CPU doing the queueing of works, we can
guarantee that works are always queued in the same order, and deadlocks
are avoided.

[[email protected]: Deadlock avoidance]
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/stop_machine.h | 1 +
kernel/stop_machine.c | 272 +++++++++++++++++++++++++++----------------
2 files changed, 175 insertions(+), 98 deletions(-)

diff --git a/include/linux/stop_machine.h b/include/linux/stop_machine.h
index 3b5e910..d2abbdb 100644
--- a/include/linux/stop_machine.h
+++ b/include/linux/stop_machine.h
@@ -28,6 +28,7 @@ struct cpu_stop_work {
};

int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg);
+int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void *arg);
void stop_one_cpu_nowait(unsigned int cpu, cpu_stop_fn_t fn, void *arg,
struct cpu_stop_work *work_buf);
int stop_cpus(const struct cpumask *cpumask, cpu_stop_fn_t fn, void *arg);
diff --git a/kernel/stop_machine.c b/kernel/stop_machine.c
index c09f295..32a6c44 100644
--- a/kernel/stop_machine.c
+++ b/kernel/stop_machine.c
@@ -115,6 +115,166 @@ int stop_one_cpu(unsigned int cpu, cpu_stop_fn_t fn, void *arg)
return done.executed ? done.ret : -ENOENT;
}

+/* This controls the threads on each CPU. */
+enum multi_stop_state {
+ /* Dummy starting state for thread. */
+ MULTI_STOP_NONE,
+ /* Awaiting everyone to be scheduled. */
+ MULTI_STOP_PREPARE,
+ /* Disable interrupts. */
+ MULTI_STOP_DISABLE_IRQ,
+ /* Run the function */
+ MULTI_STOP_RUN,
+ /* Exit */
+ MULTI_STOP_EXIT,
+};
+
+struct multi_stop_data {
+ int (*fn)(void *);
+ void *data;
+ /* Like num_online_cpus(), but hotplug cpu uses us, so we need this. */
+ unsigned int num_threads;
+ const struct cpumask *active_cpus;
+
+ enum multi_stop_state state;
+ atomic_t thread_ack;
+};
+
+static void set_state(struct multi_stop_data *msdata,
+ enum multi_stop_state newstate)
+{
+ /* Reset ack counter. */
+ atomic_set(&msdata->thread_ack, msdata->num_threads);
+ smp_wmb();
+ msdata->state = newstate;
+}
+
+/* Last one to ack a state moves to the next state. */
+static void ack_state(struct multi_stop_data *msdata)
+{
+ if (atomic_dec_and_test(&msdata->thread_ack))
+ set_state(msdata, msdata->state + 1);
+}
+
+/* This is the cpu_stop function which stops the CPU. */
+static int multi_cpu_stop(void *data)
+{
+ struct multi_stop_data *msdata = data;
+ enum multi_stop_state curstate = MULTI_STOP_NONE;
+ int cpu = smp_processor_id(), err = 0;
+ unsigned long flags;
+ bool is_active;
+
+ /*
+ * When called from stop_machine_from_inactive_cpu(), irq might
+ * already be disabled. Save the state and restore it on exit.
+ */
+ local_save_flags(flags);
+
+ if (!msdata->active_cpus)
+ is_active = cpu == cpumask_first(cpu_online_mask);
+ else
+ is_active = cpumask_test_cpu(cpu, msdata->active_cpus);
+
+ /* Simple state machine */
+ do {
+ /* Chill out and ensure we re-read multi_stop_state. */
+ cpu_relax();
+ if (msdata->state != curstate) {
+ curstate = msdata->state;
+ switch (curstate) {
+ case MULTI_STOP_DISABLE_IRQ:
+ local_irq_disable();
+ hard_irq_disable();
+ break;
+ case MULTI_STOP_RUN:
+ if (is_active)
+ err = msdata->fn(msdata->data);
+ break;
+ default:
+ break;
+ }
+ ack_state(msdata);
+ }
+ } while (curstate != MULTI_STOP_EXIT);
+
+ local_irq_restore(flags);
+ return err;
+}
+
+struct irq_cpu_stop_queue_work_info {
+ int cpu1;
+ int cpu2;
+ struct cpu_stop_work *work1;
+ struct cpu_stop_work *work2;
+};
+
+/*
+ * This function is always run with irqs and preemption disabled.
+ * This guarantees that both work1 and work2 get queued, before
+ * our local migrate thread gets the chance to preempt us.
+ */
+static void irq_cpu_stop_queue_work(void *arg)
+{
+ struct irq_cpu_stop_queue_work_info *info = arg;
+ cpu_stop_queue_work(info->cpu1, info->work1);
+ cpu_stop_queue_work(info->cpu2, info->work2);
+}
+
+/**
+ * stop_two_cpus - stops two cpus
+ * @cpu1: the cpu to stop
+ * @cpu2: the other cpu to stop
+ * @fn: function to execute
+ * @arg: argument to @fn
+ *
+ * Stops both the current and specified CPU and runs @fn on one of them.
+ *
+ * returns when both are completed.
+ */
+int stop_two_cpus(unsigned int cpu1, unsigned int cpu2, cpu_stop_fn_t fn, void *arg)
+{
+ int call_cpu;
+ struct cpu_stop_done done;
+ struct cpu_stop_work work1, work2;
+ struct irq_cpu_stop_queue_work_info call_args;
+ struct multi_stop_data msdata = {
+ .fn = fn,
+ .data = arg,
+ .num_threads = 2,
+ .active_cpus = cpumask_of(cpu1),
+ };
+
+ work1 = work2 = (struct cpu_stop_work){
+ .fn = multi_cpu_stop,
+ .arg = &msdata,
+ .done = &done
+ };
+
+ call_args = (struct irq_cpu_stop_queue_work_info){
+ .cpu1 = cpu1,
+ .cpu2 = cpu2,
+ .work1 = &work1,
+ .work2 = &work2,
+ };
+
+ cpu_stop_init_done(&done, 2);
+ set_state(&msdata, MULTI_STOP_PREPARE);
+
+ /*
+ * Queuing needs to be done by the lowest numbered CPU, to ensure
+ * that works are always queued in the same order on every CPU.
+ * This prevents deadlocks.
+ */
+ call_cpu = min(cpu1, cpu2);
+
+ smp_call_function_single(call_cpu, &irq_cpu_stop_queue_work,
+ &call_args, 0);
+
+ wait_for_completion(&done.completion);
+ return done.executed ? done.ret : -ENOENT;
+}
+
/**
* stop_one_cpu_nowait - stop a cpu but don't wait for completion
* @cpu: cpu to stop
@@ -359,98 +519,14 @@ early_initcall(cpu_stop_init);

#ifdef CONFIG_STOP_MACHINE

-/* This controls the threads on each CPU. */
-enum stopmachine_state {
- /* Dummy starting state for thread. */
- STOPMACHINE_NONE,
- /* Awaiting everyone to be scheduled. */
- STOPMACHINE_PREPARE,
- /* Disable interrupts. */
- STOPMACHINE_DISABLE_IRQ,
- /* Run the function */
- STOPMACHINE_RUN,
- /* Exit */
- STOPMACHINE_EXIT,
-};
-
-struct stop_machine_data {
- int (*fn)(void *);
- void *data;
- /* Like num_online_cpus(), but hotplug cpu uses us, so we need this. */
- unsigned int num_threads;
- const struct cpumask *active_cpus;
-
- enum stopmachine_state state;
- atomic_t thread_ack;
-};
-
-static void set_state(struct stop_machine_data *smdata,
- enum stopmachine_state newstate)
-{
- /* Reset ack counter. */
- atomic_set(&smdata->thread_ack, smdata->num_threads);
- smp_wmb();
- smdata->state = newstate;
-}
-
-/* Last one to ack a state moves to the next state. */
-static void ack_state(struct stop_machine_data *smdata)
-{
- if (atomic_dec_and_test(&smdata->thread_ack))
- set_state(smdata, smdata->state + 1);
-}
-
-/* This is the cpu_stop function which stops the CPU. */
-static int stop_machine_cpu_stop(void *data)
-{
- struct stop_machine_data *smdata = data;
- enum stopmachine_state curstate = STOPMACHINE_NONE;
- int cpu = smp_processor_id(), err = 0;
- unsigned long flags;
- bool is_active;
-
- /*
- * When called from stop_machine_from_inactive_cpu(), irq might
- * already be disabled. Save the state and restore it on exit.
- */
- local_save_flags(flags);
-
- if (!smdata->active_cpus)
- is_active = cpu == cpumask_first(cpu_online_mask);
- else
- is_active = cpumask_test_cpu(cpu, smdata->active_cpus);
-
- /* Simple state machine */
- do {
- /* Chill out and ensure we re-read stopmachine_state. */
- cpu_relax();
- if (smdata->state != curstate) {
- curstate = smdata->state;
- switch (curstate) {
- case STOPMACHINE_DISABLE_IRQ:
- local_irq_disable();
- hard_irq_disable();
- break;
- case STOPMACHINE_RUN:
- if (is_active)
- err = smdata->fn(smdata->data);
- break;
- default:
- break;
- }
- ack_state(smdata);
- }
- } while (curstate != STOPMACHINE_EXIT);
-
- local_irq_restore(flags);
- return err;
-}
-
int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
{
- struct stop_machine_data smdata = { .fn = fn, .data = data,
- .num_threads = num_online_cpus(),
- .active_cpus = cpus };
+ struct multi_stop_data msdata = {
+ .fn = fn,
+ .data = data,
+ .num_threads = num_online_cpus(),
+ .active_cpus = cpus,
+ };

if (!stop_machine_initialized) {
/*
@@ -461,7 +537,7 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
unsigned long flags;
int ret;

- WARN_ON_ONCE(smdata.num_threads != 1);
+ WARN_ON_ONCE(msdata.num_threads != 1);

local_irq_save(flags);
hard_irq_disable();
@@ -472,8 +548,8 @@ int __stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
}

/* Set the initial state and stop all online cpus. */
- set_state(&smdata, STOPMACHINE_PREPARE);
- return stop_cpus(cpu_online_mask, stop_machine_cpu_stop, &smdata);
+ set_state(&msdata, MULTI_STOP_PREPARE);
+ return stop_cpus(cpu_online_mask, multi_cpu_stop, &msdata);
}

int stop_machine(int (*fn)(void *), void *data, const struct cpumask *cpus)
@@ -513,25 +589,25 @@ EXPORT_SYMBOL_GPL(stop_machine);
int stop_machine_from_inactive_cpu(int (*fn)(void *), void *data,
const struct cpumask *cpus)
{
- struct stop_machine_data smdata = { .fn = fn, .data = data,
+ struct multi_stop_data msdata = { .fn = fn, .data = data,
.active_cpus = cpus };
struct cpu_stop_done done;
int ret;

/* Local CPU must be inactive and CPU hotplug in progress. */
BUG_ON(cpu_active(raw_smp_processor_id()));
- smdata.num_threads = num_active_cpus() + 1; /* +1 for local */
+ msdata.num_threads = num_active_cpus() + 1; /* +1 for local */

/* No proper task established and can't sleep - busy wait for lock. */
while (!mutex_trylock(&stop_cpus_mutex))
cpu_relax();

/* Schedule work on other CPUs and execute directly for local CPU */
- set_state(&smdata, STOPMACHINE_PREPARE);
+ set_state(&msdata, MULTI_STOP_PREPARE);
cpu_stop_init_done(&done, num_active_cpus());
- queue_stop_cpus_work(cpu_active_mask, stop_machine_cpu_stop, &smdata,
+ queue_stop_cpus_work(cpu_active_mask, multi_cpu_stop, &msdata,
&done);
- ret = stop_machine_cpu_stop(&smdata);
+ ret = multi_cpu_stop(&msdata);

/* Busy wait for completion. */
while (!completion_done(&done.completion))
--
1.8.1.4

2013-09-10 09:38:05

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 35/50] mm: numa: Only trap pmd hinting faults if we would otherwise trap PTE faults

Base page PMD faulting is meant to batch handle NUMA hinting faults from
PTEs. However, even is no PTE faults would ever be handled within a
range the kernel still traps PMD hinting faults. This patch avoids the
overhead.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/mprotect.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 70ec934..191a89a 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -154,6 +154,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,

pmd = pmd_offset(pud, addr);
do {
+ unsigned long this_pages;
+
next = pmd_addr_end(addr, end);
if (pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
@@ -173,8 +175,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
}
if (pmd_none_or_clear_bad(pmd))
continue;
- pages += change_pte_range(vma, pmd, addr, next, newprot,
+ this_pages = change_pte_range(vma, pmd, addr, next, newprot,
dirty_accountable, prot_numa, &all_same_nidpid);
+ pages += this_pages;

/*
* If we are changing protections for NUMA hinting faults then
@@ -182,7 +185,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
* node. This allows a regular PMD to be handled as one fault
* and effectively batches the taking of the PTL
*/
- if (prot_numa && all_same_nidpid)
+ if (prot_numa && this_pages && all_same_nidpid)
change_pmd_protnuma(vma->vm_mm, addr, pmd);
} while (pmd++, addr = next, addr != end);

--
1.8.1.4

2013-09-10 09:38:32

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 34/50] sched: numa: Do not trap hinting faults for shared libraries

NUMA hinting faults will not migrate a shared executable page mapped by
multiple processes on the grounds that the data is probably in the CPU
cache already and the page may just bounce between tasks running on multipl
nodes. Even if the migration is avoided, there is still the overhead of
trapping the fault, updating the statistics, making scheduler placement
decisions based on the information etc. If we are never going to migrate
the page, it is overhead for no gain and worse a process may be placed on
a sub-optimal node for shared executable pages. This patch avoids trapping
faults for shared libraries entirely.

Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fd724bc..5d244d0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1227,6 +1227,16 @@ void task_numa_work(struct callback_head *work)
if (!vma_migratable(vma))
continue;

+ /*
+ * Shared library pages mapped by multiple processes are not
+ * migrated as it is expected they are cache replicated. Avoid
+ * hinting faults in read-only file-backed mappings or the vdso
+ * as migrating the pages will be of marginal benefit.
+ */
+ if (!vma->vm_mm ||
+ (vma->vm_file && (vma->vm_flags & (VM_READ|VM_WRITE)) == (VM_READ)))
+ continue;
+
do {
start = max(start, vma->vm_start);
end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
--
1.8.1.4

2013-09-10 09:38:58

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 33/50] sched: numa: increment numa_migrate_seq when task runs in correct location

From: Rik van Riel <[email protected]>

When a task is already running on its preferred node, increment
numa_migrate_seq to indicate that the task is settled if migration is
temporarily disabled, and memory should migrate towards it.

[[email protected]: Only increment migrate_seq if migration temporarily disabled]
Signed-off-by: Rik van Riel <[email protected]>
---
kernel/sched/fair.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5b4d94e..fd724bc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1016,8 +1016,16 @@ static void numa_migrate_preferred(struct task_struct *p)
{
/* Success if task is already running on preferred CPU */
p->numa_migrate_retry = 0;
- if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid)
+ if (cpu_to_node(task_cpu(p)) == p->numa_preferred_nid) {
+ /*
+ * If migration is temporarily disabled due to a task migration
+ * then re-enable it now as the task is running on its
+ * preferred node and memory should migrate locally
+ */
+ if (!p->numa_migrate_seq)
+ p->numa_migrate_seq++;
return;
+ }

/* Otherwise, try migrate to a CPU on the preferred node */
if (task_numa_migrate(p) != 0)
--
1.8.1.4

2013-09-10 09:39:29

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 27/50] mm: numa: Scan pages with elevated page_mapcount

Currently automatic NUMA balancing is unable to distinguish between false
shared versus private pages except by ignoring pages with an elevated
page_mapcount entirely. This avoids shared pages bouncing between the
nodes whose task is using them but that is ignored quite a lot of data.

This patch kicks away the training wheels in preparation for adding support
for identifying shared/private pages is now in place. The ordering is so
that the impact of the shared/private detection can be easily measured. Note
that the patch does not migrate shared, file-backed within vmas marked
VM_EXEC as these are generally shared library pages. Migrating such pages
is not beneficial as there is an expectation they are read-shared between
caches and iTLB and iCache pressure is generally low.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/migrate.h | 7 ++++---
mm/huge_memory.c | 8 ++------
mm/memory.c | 7 ++-----
mm/migrate.c | 17 ++++++-----------
mm/mprotect.c | 4 +---
5 files changed, 15 insertions(+), 28 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index a405d3dc..e7e26af 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -92,11 +92,12 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
#endif /* CONFIG_MIGRATION */

#ifdef CONFIG_NUMA_BALANCING
-extern int migrate_misplaced_page(struct page *page, int node);
-extern int migrate_misplaced_page(struct page *page, int node);
+extern int migrate_misplaced_page(struct page *page,
+ struct vm_area_struct *vma, int node);
extern bool migrate_ratelimited(int node);
#else
-static inline int migrate_misplaced_page(struct page *page, int node)
+static inline int migrate_misplaced_page(struct page *page,
+ struct vm_area_struct *vma, int node)
{
return -EAGAIN; /* can't migrate now */
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ca66a8a..a8e624e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1498,12 +1498,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
} else {
struct page *page = pmd_page(*pmd);

- /*
- * Only check non-shared pages. See change_pte_range
- * for comment on why the zero page is not modified
- */
- if (page_mapcount(page) == 1 &&
- !is_huge_zero_page(page) &&
+ /* See change_pte_range about the zero page */
+ if (!is_huge_zero_page(page) &&
!pmd_numa(*pmd)) {
entry = pmdp_get_and_clear(mm, addr, pmd);
entry = pmd_mknuma(entry);
diff --git a/mm/memory.c b/mm/memory.c
index bd016c2..e335ec0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3588,7 +3588,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
}

/* Migrate to the requested node */
- migrated = migrate_misplaced_page(page, target_nid);
+ migrated = migrate_misplaced_page(page, vma, target_nid);
if (migrated)
page_nid = target_nid;

@@ -3653,16 +3653,13 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
page = vm_normal_page(vma, addr, pteval);
if (unlikely(!page))
continue;
- /* only check non-shared pages */
- if (unlikely(page_mapcount(page) != 1))
- continue;

last_nid = page_nid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(pte, ptl);
if (target_nid != -1) {
- migrated = migrate_misplaced_page(page, target_nid);
+ migrated = migrate_misplaced_page(page, vma, target_nid);
if (migrated)
page_nid = target_nid;
} else {
diff --git a/mm/migrate.c b/mm/migrate.c
index 6f0c244..08ac3ba 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1596,7 +1596,8 @@ int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
* node. Caller is expected to have an elevated reference count on
* the page that will be dropped by this function before returning.
*/
-int migrate_misplaced_page(struct page *page, int node)
+int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
+ int node)
{
pg_data_t *pgdat = NODE_DATA(node);
int isolated;
@@ -1604,10 +1605,11 @@ int migrate_misplaced_page(struct page *page, int node)
LIST_HEAD(migratepages);

/*
- * Don't migrate pages that are mapped in multiple processes.
- * TODO: Handle false sharing detection instead of this hammer
+ * Don't migrate file pages that are mapped in multiple processes
+ * with execute permissions as they are probably shared libraries.
*/
- if (page_mapcount(page) != 1)
+ if (page_mapcount(page) != 1 && page_is_file_cache(page) &&
+ (vma->vm_flags & VM_EXEC))
goto out;

/*
@@ -1658,13 +1660,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
int page_lru = page_is_file_cache(page);

/*
- * Don't migrate pages that are mapped in multiple processes.
- * TODO: Handle false sharing detection instead of this hammer
- */
- if (page_mapcount(page) != 1)
- goto out_dropref;
-
- /*
* Rate-limit the amount of data that is being migrated to a node.
* Optimal placement is no good if the memory bus is saturated and
* all the time is being spent migrating!
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 1e9cef0..4a21819 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -77,9 +77,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
if (last_nid != this_nid)
all_same_node = false;

- /* only check non-shared pages */
- if (!pte_numa(oldpte) &&
- page_mapcount(page) == 1) {
+ if (!pte_numa(oldpte)) {
ptent = pte_mknuma(ptent);
updated = true;
}
--
1.8.1.4

2013-09-10 09:39:28

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 28/50] sched: Remove check that skips small VMAs

task_numa_work skips small VMAs. At the time the logic was to reduce the
scanning overhead which was considerable. It is a dubious hack at best.
It would make much more sense to cache where faults have been observed
and only rescan those regions during subsequent PTE scans. Remove this
hack as motivation to do it properly in the future.

Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 4 ----
1 file changed, 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e259241..2d04112 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1127,10 +1127,6 @@ void task_numa_work(struct callback_head *work)
if (!vma_migratable(vma))
continue;

- /* Skip small VMAs. They are not likely to be of relevance */
- if (vma->vm_end - vma->vm_start < HPAGE_SIZE)
- continue;
-
do {
start = max(start, vma->vm_start);
end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
--
1.8.1.4

2013-09-10 09:40:02

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 25/50] sched: Add infrastructure for split shared/private accounting of NUMA hinting faults

Ideally it would be possible to distinguish between NUMA hinting faults
that are private to a task and those that are shared. This patch prepares
infrastructure for separately accounting shared and private faults by
allocating the necessary buffers and passing in relevant information. For
now, all faults are treated as private and detection will be introduced
later.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/sched.h | 5 +++--
kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++-----------
mm/huge_memory.c | 5 +++--
mm/memory.c | 8 ++++++--
4 files changed, 47 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a2e661d..6eb8fa6 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1430,10 +1430,11 @@ struct task_struct {
#define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)

#ifdef CONFIG_NUMA_BALANCING
-extern void task_numa_fault(int node, int pages, bool migrated);
+extern void task_numa_fault(int last_node, int node, int pages, bool migrated);
extern void set_numabalancing_state(bool enabled);
#else
-static inline void task_numa_fault(int node, int pages, bool migrated)
+static inline void task_numa_fault(int last_node, int node, int pages,
+ bool migrated)
{
}
static inline void set_numabalancing_state(bool enabled)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 350c411..108f357 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -886,6 +886,20 @@ static unsigned int task_scan_max(struct task_struct *p)
*/
unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;

+static inline int task_faults_idx(int nid, int priv)
+{
+ return 2 * nid + priv;
+}
+
+static inline unsigned long task_faults(struct task_struct *p, int nid)
+{
+ if (!p->numa_faults)
+ return 0;
+
+ return p->numa_faults[task_faults_idx(nid, 0)] +
+ p->numa_faults[task_faults_idx(nid, 1)];
+}
+
static unsigned long weighted_cpuload(const int cpu);


@@ -928,13 +942,19 @@ static void task_numa_placement(struct task_struct *p)
/* Find the node with the highest number of faults */
for_each_online_node(nid) {
unsigned long faults;
+ int priv, i;

- /* Decay existing window and copy faults since last scan */
- p->numa_faults[nid] >>= 1;
- p->numa_faults[nid] += p->numa_faults_buffer[nid];
- p->numa_faults_buffer[nid] = 0;
+ for (priv = 0; priv < 2; priv++) {
+ i = task_faults_idx(nid, priv);

- faults = p->numa_faults[nid];
+ /* Decay existing window, copy faults since last scan */
+ p->numa_faults[i] >>= 1;
+ p->numa_faults[i] += p->numa_faults_buffer[i];
+ p->numa_faults_buffer[i] = 0;
+ }
+
+ /* Find maximum private faults */
+ faults = p->numa_faults[task_faults_idx(nid, 1)];
if (faults > max_faults) {
max_faults = faults;
max_nid = nid;
@@ -970,16 +990,20 @@ static void task_numa_placement(struct task_struct *p)
/*
* Got a PROT_NONE fault for a page on @node.
*/
-void task_numa_fault(int node, int pages, bool migrated)
+void task_numa_fault(int last_nid, int node, int pages, bool migrated)
{
struct task_struct *p = current;
+ int priv;

if (!numabalancing_enabled)
return;

+ /* For now, do not attempt to detect private/shared accesses */
+ priv = 1;
+
/* Allocate buffer to track faults on a per-node basis */
if (unlikely(!p->numa_faults)) {
- int size = sizeof(*p->numa_faults) * nr_node_ids;
+ int size = sizeof(*p->numa_faults) * 2 * nr_node_ids;

/* numa_faults and numa_faults_buffer share the allocation */
p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
@@ -987,7 +1011,7 @@ void task_numa_fault(int node, int pages, bool migrated)
return;

BUG_ON(p->numa_faults_buffer);
- p->numa_faults_buffer = p->numa_faults + nr_node_ids;
+ p->numa_faults_buffer = p->numa_faults + (2 * nr_node_ids);
}

/*
@@ -1005,7 +1029,7 @@ void task_numa_fault(int node, int pages, bool migrated)

task_numa_placement(p);

- p->numa_faults_buffer[node] += pages;
+ p->numa_faults_buffer[task_faults_idx(node, priv)] += pages;
}

static void reset_ptenuma_scan(struct task_struct *p)
@@ -4099,7 +4123,7 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
return false;

if (dst_nid == p->numa_preferred_nid ||
- p->numa_faults[dst_nid] > p->numa_faults[src_nid])
+ task_faults(p, dst_nid) > task_faults(p, src_nid))
return true;

return false;
@@ -4123,7 +4147,7 @@ static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
return false;

- if (p->numa_faults[dst_nid] < p->numa_faults[src_nid])
+ if (task_faults(p, dst_nid) < task_faults(p, src_nid))
return true;

return false;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 065a31d..ca66a8a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1294,7 +1294,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
int page_nid = -1, this_nid = numa_node_id();
- int target_nid;
+ int target_nid, last_nid = -1;
bool page_locked;
bool migrated = false;

@@ -1305,6 +1305,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
page = pmd_page(pmd);
BUG_ON(is_huge_zero_page(page));
page_nid = page_to_nid(page);
+ last_nid = page_nid_last(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
if (page_nid == this_nid)
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);
@@ -1376,7 +1377,7 @@ out:
page_unlock_anon_vma_read(anon_vma);

if (page_nid != -1)
- task_numa_fault(page_nid, HPAGE_PMD_NR, migrated);
+ task_numa_fault(last_nid, page_nid, HPAGE_PMD_NR, migrated);

return 0;
}
diff --git a/mm/memory.c b/mm/memory.c
index 86c3caf..bd016c2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3547,6 +3547,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct page *page = NULL;
spinlock_t *ptl;
int page_nid = -1;
+ int last_nid;
int target_nid;
bool migrated = false;

@@ -3577,6 +3578,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
BUG_ON(is_zero_pfn(page_to_pfn(page)));

+ last_nid = page_nid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(ptep, ptl);
@@ -3592,7 +3594,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,

out:
if (page_nid != -1)
- task_numa_fault(page_nid, 1, migrated);
+ task_numa_fault(last_nid, page_nid, 1, migrated);
return 0;
}

@@ -3607,6 +3609,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long offset;
spinlock_t *ptl;
bool numa = false;
+ int last_nid;

spin_lock(&mm->page_table_lock);
pmd = *pmdp;
@@ -3654,6 +3657,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(page_mapcount(page) != 1))
continue;

+ last_nid = page_nid_last(page);
page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(pte, ptl);
@@ -3666,7 +3670,7 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
}

if (page_nid != -1)
- task_numa_fault(page_nid, 1, migrated);
+ task_numa_fault(last_nid, page_nid, 1, migrated);

pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
}
--
1.8.1.4

2013-09-10 09:40:29

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 23/50] sched: Resist moving tasks towards nodes with fewer hinting faults

Just as "sched: Favour moving tasks towards the preferred node" favours
moving tasks towards nodes with a higher number of recorded NUMA hinting
faults, this patch resists moving tasks towards nodes with lower faults.

[[email protected]: changelog]
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 33 +++++++++++++++++++++++++++++++++
kernel/sched/features.h | 8 ++++++++
2 files changed, 41 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 216908c..5649280 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4060,12 +4060,43 @@ static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)

return false;
}
+
+
+static bool migrate_degrades_locality(struct task_struct *p, struct lb_env *env)
+{
+ int src_nid, dst_nid;
+
+ if (!sched_feat(NUMA) || !sched_feat(NUMA_RESIST_LOWER))
+ return false;
+
+ if (!p->numa_faults || !(env->sd->flags & SD_NUMA))
+ return false;
+
+ src_nid = cpu_to_node(env->src_cpu);
+ dst_nid = cpu_to_node(env->dst_cpu);
+
+ if (src_nid == dst_nid ||
+ p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+ return false;
+
+ if (p->numa_faults[dst_nid] < p->numa_faults[src_nid])
+ return true;
+
+ return false;
+}
+
#else
static inline bool migrate_improves_locality(struct task_struct *p,
struct lb_env *env)
{
return false;
}
+
+static inline bool migrate_degrades_locality(struct task_struct *p,
+ struct lb_env *env)
+{
+ return false;
+}
#endif

/*
@@ -4130,6 +4161,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
* 3) too many balance attempts have failed.
*/
tsk_cache_hot = task_hot(p, rq_clock_task(env->src_rq), env->sd);
+ if (!tsk_cache_hot)
+ tsk_cache_hot = migrate_degrades_locality(p, env);

if (migrate_improves_locality(p, env)) {
#ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index d9278ce..5716929 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -74,4 +74,12 @@ SCHED_FEAT(NUMA, false)
* balancing.
*/
SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
+
+/*
+ * NUMA_RESIST_LOWER will resist moving tasks towards nodes where a
+ * lower number of hinting faults have been recorded. As this has
+ * the potential to prevent a task ever migrating to a new node
+ * due to CPU overload it is disabled by default.
+ */
+SCHED_FEAT(NUMA_RESIST_LOWER, false)
#endif
--
1.8.1.4

2013-09-10 09:40:49

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 22/50] sched: Favour moving tasks towards the preferred node

This patch favours moving tasks towards NUMA node that recorded a higher
number of NUMA faults during active load balancing. Ideally this is
self-reinforcing as the longer the task runs on that node, the more faults
it should incur causing task_numa_placement to keep the task running on that
node. In reality a big weakness is that the nodes CPUs can be overloaded
and it would be more efficient to queue tasks on an idle node and migrate
to the new node. This would require additional smarts in the balancer so
for now the balancer will simply prefer to place the task on the preferred
node for a PTE scans which is controlled by the numa_balancing_settle_count
sysctl. Once the settle_count number of scans has complete the schedule
is free to place the task on an alternative node if the load is imbalanced.

[[email protected]: Fixed statistics]
[[email protected]: Tunable and use higher faults instead of preferred]
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
Documentation/sysctl/kernel.txt | 8 +++++-
include/linux/sched.h | 1 +
kernel/sched/core.c | 3 +-
kernel/sched/fair.c | 63 ++++++++++++++++++++++++++++++++++++++---
kernel/sched/features.h | 7 +++++
kernel/sysctl.c | 7 +++++
6 files changed, 83 insertions(+), 6 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index ad8d4f5..23ff00a 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -374,7 +374,8 @@ feature should be disabled. Otherwise, if the system overhead from the
feature is too high then the rate the kernel samples for NUMA hinting
faults may be controlled by the numa_balancing_scan_period_min_ms,
numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
-numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+numa_balancing_scan_period_max_ms, numa_balancing_scan_size_mb and
+numa_balancing_settle_count sysctls.

==============================================================

@@ -419,6 +420,11 @@ scanned for a given scan.
numa_balancing_scan_period_reset is a blunt instrument that controls how
often a tasks scan delay is reset to detect sudden changes in task behaviour.

+numa_balancing_settle_count is how many scan periods must complete before
+the schedule balancer stops pushing the task towards a preferred node. This
+gives the scheduler a chance to place the task on an alternative node if the
+preferred node is overloaded.
+
==============================================================

osrelease, ostype & version:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 84fb883..a2e661d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -776,6 +776,7 @@ enum cpu_idle_type {
#define SD_ASYM_PACKING 0x0800 /* Place busy groups earlier in the domain */
#define SD_PREFER_SIBLING 0x1000 /* Prefer to place tasks in a sibling domain */
#define SD_OVERLAP 0x2000 /* sched_domains of this level overlap */
+#define SD_NUMA 0x4000 /* cross-node balancing */

extern int __weak arch_sd_sibiling_asym_packing(void);

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2e8f3e2..0dbd5cd 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1641,7 +1641,7 @@ static void __sched_fork(struct task_struct *p)

p->node_stamp = 0ULL;
p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
- p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
+ p->numa_migrate_seq = 0;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
p->numa_preferred_nid = -1;
p->numa_work.next = &p->numa_work;
@@ -5667,6 +5667,7 @@ sd_numa_init(struct sched_domain_topology_level *tl, int cpu)
| 0*SD_SHARE_PKG_RESOURCES
| 1*SD_SERIALIZE
| 0*SD_PREFER_SIBLING
+ | 1*SD_NUMA
| sd_local_flags(level)
,
.last_balance = jiffies,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c2fefa5..216908c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -877,6 +877,15 @@ static unsigned int task_scan_max(struct task_struct *p)
return max(smin, smax);
}

+/*
+ * Once a preferred node is selected the scheduler balancer will prefer moving
+ * a task to that node for sysctl_numa_balancing_settle_count number of PTE
+ * scans. This will give the process the chance to accumulate more faults on
+ * the preferred node but still allow the scheduler to move the task again if
+ * the nodes CPUs are overloaded.
+ */
+unsigned int sysctl_numa_balancing_settle_count __read_mostly = 3;
+
static void task_numa_placement(struct task_struct *p)
{
int seq, nid, max_nid = -1;
@@ -888,6 +897,7 @@ static void task_numa_placement(struct task_struct *p)
if (p->numa_scan_seq == seq)
return;
p->numa_scan_seq = seq;
+ p->numa_migrate_seq++;
p->numa_scan_period_max = task_scan_max(p);

/* Find the node with the highest number of faults */
@@ -907,8 +917,10 @@ static void task_numa_placement(struct task_struct *p)
}

/* Update the tasks preferred node if necessary */
- if (max_faults && max_nid != p->numa_preferred_nid)
+ if (max_faults && max_nid != p->numa_preferred_nid) {
p->numa_preferred_nid = max_nid;
+ p->numa_migrate_seq = 0;
+ }
}

/*
@@ -4024,6 +4036,38 @@ task_hot(struct task_struct *p, u64 now, struct sched_domain *sd)
return delta < (s64)sysctl_sched_migration_cost;
}

+#ifdef CONFIG_NUMA_BALANCING
+/* Returns true if the destination node has incurred more faults */
+static bool migrate_improves_locality(struct task_struct *p, struct lb_env *env)
+{
+ int src_nid, dst_nid;
+
+ if (!sched_feat(NUMA_FAVOUR_HIGHER) || !p->numa_faults ||
+ !(env->sd->flags & SD_NUMA)) {
+ return false;
+ }
+
+ src_nid = cpu_to_node(env->src_cpu);
+ dst_nid = cpu_to_node(env->dst_cpu);
+
+ if (src_nid == dst_nid ||
+ p->numa_migrate_seq >= sysctl_numa_balancing_settle_count)
+ return false;
+
+ if (dst_nid == p->numa_preferred_nid ||
+ p->numa_faults[dst_nid] > p->numa_faults[src_nid])
+ return true;
+
+ return false;
+}
+#else
+static inline bool migrate_improves_locality(struct task_struct *p,
+ struct lb_env *env)
+{
+ return false;
+}
+#endif
+
/*
* can_migrate_task - may task p from runqueue rq be migrated to this_cpu?
*/
@@ -4081,11 +4125,22 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)

/*
* Aggressive migration if:
- * 1) task is cache cold, or
- * 2) too many balance attempts have failed.
+ * 1) destination numa is preferred
+ * 2) task is cache cold, or
+ * 3) too many balance attempts have failed.
*/
-
tsk_cache_hot = task_hot(p, rq_clock_task(env->src_rq), env->sd);
+
+ if (migrate_improves_locality(p, env)) {
+#ifdef CONFIG_SCHEDSTATS
+ if (tsk_cache_hot) {
+ schedstat_inc(env->sd, lb_hot_gained[env->idle]);
+ schedstat_inc(p, se.statistics.nr_forced_migrations);
+ }
+#endif
+ return 1;
+ }
+
if (!tsk_cache_hot ||
env->sd->nr_balance_failed > env->sd->cache_nice_tries) {

diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index cba5c61..d9278ce 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -67,4 +67,11 @@ SCHED_FEAT(LB_MIN, false)
*/
#ifdef CONFIG_NUMA_BALANCING
SCHED_FEAT(NUMA, false)
+
+/*
+ * NUMA_FAVOUR_HIGHER will favor moving tasks towards nodes where a
+ * higher number of hinting faults are recorded during active load
+ * balancing.
+ */
+SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
#endif
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 07f6fc4..0015fb9 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -391,6 +391,13 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec,
},
+ {
+ .procname = "numa_balancing_settle_count",
+ .data = &sysctl_numa_balancing_settle_count,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ },
#endif /* CONFIG_NUMA_BALANCING */
#endif /* CONFIG_SCHED_DEBUG */
{
--
1.8.1.4

2013-09-10 09:32:56

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 20/50] sched: Select a preferred node with the most numa hinting faults

This patch selects a preferred node for a task to run on based on the
NUMA hinting faults. This information is later used to migrate tasks
towards the node during balancing.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/sched.h | 1 +
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 17 +++++++++++++++--
3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dfba435..d6ec68a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1336,6 +1336,7 @@ struct task_struct {
struct callback_head numa_work;

unsigned long *numa_faults;
+ int numa_preferred_nid;
#endif /* CONFIG_NUMA_BALANCING */

struct rcu_head rcu;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dbc2de6..0235ab8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1643,6 +1643,7 @@ static void __sched_fork(struct task_struct *p)
p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
p->numa_migrate_seq = p->mm ? p->mm->numa_scan_seq - 1 : 0;
p->numa_scan_period = sysctl_numa_balancing_scan_delay;
+ p->numa_preferred_nid = -1;
p->numa_work.next = &p->numa_work;
p->numa_faults = NULL;
#endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ebd24c0..8c60822 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -879,7 +879,8 @@ static unsigned int task_scan_max(struct task_struct *p)

static void task_numa_placement(struct task_struct *p)
{
- int seq;
+ int seq, nid, max_nid = -1;
+ unsigned long max_faults = 0;

if (!p->mm) /* for example, ksmd faulting in a user's mm */
return;
@@ -889,7 +890,19 @@ static void task_numa_placement(struct task_struct *p)
p->numa_scan_seq = seq;
p->numa_scan_period_max = task_scan_max(p);

- /* FIXME: Scheduling placement policy hints go here */
+ /* Find the node with the highest number of faults */
+ for_each_online_node(nid) {
+ unsigned long faults = p->numa_faults[nid];
+ p->numa_faults[nid] >>= 1;
+ if (faults > max_faults) {
+ max_faults = faults;
+ max_nid = nid;
+ }
+ }
+
+ /* Update the tasks preferred node if necessary */
+ if (max_faults && max_nid != p->numa_preferred_nid)
+ p->numa_preferred_nid = max_nid;
}

/*
--
1.8.1.4

2013-09-10 09:41:09

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 21/50] sched: Update NUMA hinting faults once per scan

NUMA hinting fault counts and placement decisions are both recorded in the
same array which distorts the samples in an unpredictable fashion. The values
linearly accumulate during the scan and then decay creating a sawtooth-like
pattern in the per-node counts. It also means that placement decisions are
time sensitive. At best it means that it is very difficult to state that
the buffer holds a decaying average of past faulting behaviour. At worst,
it can confuse the load balancer if it sees one node with an artifically high
count due to very recent faulting activity and may create a bouncing effect.

This patch adds a second array. numa_faults stores the historical data
which is used for placement decisions. numa_faults_buffer holds the
fault activity during the current scan window. When the scan completes,
numa_faults decays and the values from numa_faults_buffer are copied
across.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/sched.h | 13 +++++++++++++
kernel/sched/core.c | 1 +
kernel/sched/fair.c | 16 +++++++++++++---
3 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d6ec68a..84fb883 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1335,7 +1335,20 @@ struct task_struct {
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;

+ /*
+ * Exponential decaying average of faults on a per-node basis.
+ * Scheduling placement decisions are made based on the these counts.
+ * The values remain static for the duration of a PTE scan
+ */
unsigned long *numa_faults;
+
+ /*
+ * numa_faults_buffer records faults per node during the current
+ * scan window. When the scan completes, the counts in numa_faults
+ * decay and these values are copied.
+ */
+ unsigned long *numa_faults_buffer;
+
int numa_preferred_nid;
#endif /* CONFIG_NUMA_BALANCING */

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0235ab8..2e8f3e2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1646,6 +1646,7 @@ static void __sched_fork(struct task_struct *p)
p->numa_preferred_nid = -1;
p->numa_work.next = &p->numa_work;
p->numa_faults = NULL;
+ p->numa_faults_buffer = NULL;
#endif /* CONFIG_NUMA_BALANCING */
}

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8c60822..c2fefa5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -892,8 +892,14 @@ static void task_numa_placement(struct task_struct *p)

/* Find the node with the highest number of faults */
for_each_online_node(nid) {
- unsigned long faults = p->numa_faults[nid];
+ unsigned long faults;
+
+ /* Decay existing window and copy faults since last scan */
p->numa_faults[nid] >>= 1;
+ p->numa_faults[nid] += p->numa_faults_buffer[nid];
+ p->numa_faults_buffer[nid] = 0;
+
+ faults = p->numa_faults[nid];
if (faults > max_faults) {
max_faults = faults;
max_nid = nid;
@@ -919,9 +925,13 @@ void task_numa_fault(int node, int pages, bool migrated)
if (unlikely(!p->numa_faults)) {
int size = sizeof(*p->numa_faults) * nr_node_ids;

- p->numa_faults = kzalloc(size, GFP_KERNEL|__GFP_NOWARN);
+ /* numa_faults and numa_faults_buffer share the allocation */
+ p->numa_faults = kzalloc(size * 2, GFP_KERNEL|__GFP_NOWARN);
if (!p->numa_faults)
return;
+
+ BUG_ON(p->numa_faults_buffer);
+ p->numa_faults_buffer = p->numa_faults + nr_node_ids;
}

/*
@@ -939,7 +949,7 @@ void task_numa_fault(int node, int pages, bool migrated)

task_numa_placement(p);

- p->numa_faults[node] += pages;
+ p->numa_faults_buffer[node] += pages;
}

static void reset_ptenuma_scan(struct task_struct *p)
--
1.8.1.4

2013-09-10 09:41:47

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 18/50] sched: numa: Slow scan rate if no NUMA hinting faults are being recorded

NUMA PTE scanning slows if a NUMA hinting fault was trapped and no page
was migrated. For long-lived but idle processes there may be no faults
but the scan rate will be high and just waste CPU. This patch will slow
the scan rate for processes that are not trapping faults.

Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 29ba117..779ebd7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1039,6 +1039,18 @@ void task_numa_work(struct callback_head *work)

out:
/*
+ * If the whole process was scanned without updates then no NUMA
+ * hinting faults are being recorded and scan rate should be lower.
+ */
+ if (mm->numa_scan_offset == 0 && !nr_pte_updates) {
+ p->numa_scan_period = min(p->numa_scan_period_max,
+ p->numa_scan_period << 1);
+
+ next_scan = now + msecs_to_jiffies(p->numa_scan_period);
+ mm->numa_next_scan = next_scan;
+ }
+
+ /*
* It is possible to reach the end of the VMA list but the last few
* VMAs are not guaranteed to the vma_migratable. If they are not, we
* would find the !migratable VMA on the next scan but not reset the
--
1.8.1.4

2013-09-10 09:42:12

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 16/50] mm: Only flush TLBs if a transhuge PMD is modified for NUMA pte scanning

NUMA PTE scanning is expensive both in terms of the scanning itself and
the TLB flush if there are any updates. The TLB flush is avoided if no
PTEs are updated but there is a bug where transhuge PMDs are considered
to be updated even if they were already pmd_numa. This patch addresses
the problem and TLB flushes should be reduced.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/huge_memory.c | 19 ++++++++++++++++---
mm/mprotect.c | 14 ++++++++++----
2 files changed, 26 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40f75a6..065a31d 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1474,6 +1474,12 @@ out:
return ret;
}

+/*
+ * Returns
+ * - 0 if PMD could not be locked
+ * - 1 if PMD was locked but protections unchange and TLB flush unnecessary
+ * - HPAGE_PMD_NR is protections changed and TLB flush necessary
+ */
int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, pgprot_t newprot, int prot_numa)
{
@@ -1482,9 +1488,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,

if (__pmd_trans_huge_lock(pmd, vma) == 1) {
pmd_t entry;
- entry = pmdp_get_and_clear(mm, addr, pmd);
+ ret = 1;
if (!prot_numa) {
+ entry = pmdp_get_and_clear(mm, addr, pmd);
entry = pmd_modify(entry, newprot);
+ ret = HPAGE_PMD_NR;
BUG_ON(pmd_write(entry));
} else {
struct page *page = pmd_page(*pmd);
@@ -1496,12 +1504,17 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
if (page_mapcount(page) == 1 &&
!is_huge_zero_page(page) &&
!pmd_numa(*pmd)) {
+ entry = pmdp_get_and_clear(mm, addr, pmd);
entry = pmd_mknuma(entry);
+ ret = HPAGE_PMD_NR;
}
}
- set_pmd_at(mm, addr, pmd, entry);
+
+ /* Set PMD if cleared earlier */
+ if (ret == HPAGE_PMD_NR)
+ set_pmd_at(mm, addr, pmd, entry);
+
spin_unlock(&vma->vm_mm->page_table_lock);
- ret = 1;
}

return ret;
diff --git a/mm/mprotect.c b/mm/mprotect.c
index faa499e..1f9b54b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -151,10 +151,16 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
if (pmd_trans_huge(*pmd)) {
if (next - addr != HPAGE_PMD_SIZE)
split_huge_page_pmd(vma, addr, pmd);
- else if (change_huge_pmd(vma, pmd, addr, newprot,
- prot_numa)) {
- pages++;
- continue;
+ else {
+ int nr_ptes = change_huge_pmd(vma, pmd, addr,
+ newprot, prot_numa);
+
+ if (nr_ptes) {
+ if (nr_ptes == HPAGE_PMD_NR)
+ pages++;
+
+ continue;
+ }
}
/* fall through */
}
--
1.8.1.4

2013-09-10 09:32:50

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 10/50] sched: numa: Mitigate chance that same task always updates PTEs

From: Peter Zijlstra <[email protected]>

With a trace_printk("working\n"); right after the cmpxchg in
task_numa_work() we can see that of a 4 thread process, its always the
same task winning the race and doing the protection change.

This is a problem since the task doing the protection change has a
penalty for taking faults -- it is busy when marking the PTEs. If its
always the same task the ->numa_faults[] get severely skewed.

Avoid this by delaying the task doing the protection change such that
it is unlikely to win the privilege again.

Before:

root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
thread 0/0-3232 [022] .... 212.787402: task_numa_work: working
thread 0/0-3232 [022] .... 212.888473: task_numa_work: working
thread 0/0-3232 [022] .... 212.989538: task_numa_work: working
thread 0/0-3232 [022] .... 213.090602: task_numa_work: working
thread 0/0-3232 [022] .... 213.191667: task_numa_work: working
thread 0/0-3232 [022] .... 213.292734: task_numa_work: working
thread 0/0-3232 [022] .... 213.393804: task_numa_work: working
thread 0/0-3232 [022] .... 213.494869: task_numa_work: working
thread 0/0-3232 [022] .... 213.596937: task_numa_work: working
thread 0/0-3232 [022] .... 213.699000: task_numa_work: working
thread 0/0-3232 [022] .... 213.801067: task_numa_work: working
thread 0/0-3232 [022] .... 213.903155: task_numa_work: working
thread 0/0-3232 [022] .... 214.005201: task_numa_work: working
thread 0/0-3232 [022] .... 214.107266: task_numa_work: working
thread 0/0-3232 [022] .... 214.209342: task_numa_work: working

After:

root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
thread 0/0-3253 [005] .... 136.865051: task_numa_work: working
thread 0/2-3255 [026] .... 136.965134: task_numa_work: working
thread 0/3-3256 [024] .... 137.065217: task_numa_work: working
thread 0/3-3256 [024] .... 137.165302: task_numa_work: working
thread 0/3-3256 [024] .... 137.265382: task_numa_work: working
thread 0/0-3253 [004] .... 137.366465: task_numa_work: working
thread 0/2-3255 [026] .... 137.466549: task_numa_work: working
thread 0/0-3253 [004] .... 137.566629: task_numa_work: working
thread 0/0-3253 [004] .... 137.666711: task_numa_work: working
thread 0/1-3254 [028] .... 137.766799: task_numa_work: working
thread 0/0-3253 [004] .... 137.866876: task_numa_work: working
thread 0/2-3255 [026] .... 137.966960: task_numa_work: working
thread 0/1-3254 [028] .... 138.067041: task_numa_work: working
thread 0/2-3255 [026] .... 138.167123: task_numa_work: working
thread 0/3-3256 [024] .... 138.267207: task_numa_work: working

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 227b070..c93a56e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -946,6 +946,12 @@ void task_numa_work(struct callback_head *work)
return;

/*
+ * Delay this task enough that another task of this mm will likely win
+ * the next time around.
+ */
+ p->node_stamp += 2 * TICK_NSEC;
+
+ /*
* Do not set pte_numa if the current running node is rate-limited.
* This loses statistics on the fault but if we are unwilling to
* migrate to this node, it is less likely we can do useful work
@@ -1026,7 +1032,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)
if (now - curr->node_stamp > period) {
if (!curr->node_stamp)
curr->numa_scan_period = sysctl_numa_balancing_scan_period_min;
- curr->node_stamp = now;
+ curr->node_stamp += period;

if (!time_before(jiffies, curr->mm->numa_next_scan)) {
init_task_work(work, task_numa_work); /* TODO: move this into sched_fork() */
--
1.8.1.4

2013-09-10 09:32:48

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 11/50] sched: numa: Continue PTE scanning even if migrate rate limited

From: Peter Zijlstra <[email protected]>

Avoiding marking PTEs pte_numa because a particular NUMA node is migrate rate
limited sees like a bad idea. Even if this node can't migrate anymore other
nodes might and we want up-to-date information to do balance decisions.
We already rate limit the actual migrations, this should leave enough
bandwidth to allow the non-migrating scanning. I think its important we
keep up-to-date information if we're going to do placement based on it.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 8 --------
1 file changed, 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c93a56e..b5aa546 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -951,14 +951,6 @@ void task_numa_work(struct callback_head *work)
*/
p->node_stamp += 2 * TICK_NSEC;

- /*
- * Do not set pte_numa if the current running node is rate-limited.
- * This loses statistics on the fault but if we are unwilling to
- * migrate to this node, it is less likely we can do useful work
- */
- if (migrate_ratelimited(numa_node_id()))
- return;
-
start = mm->numa_scan_offset;
pages = sysctl_numa_balancing_scan_size;
pages <<= 20 - PAGE_SHIFT; /* MB in pages */
--
1.8.1.4

2013-09-10 09:42:56

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 13/50] sched: numa: Initialise numa_next_scan properly

Scan delay logic and resets are currently initialised to start scanning
immediately instead of delaying properly. Initialise them properly at
fork time and catch when a new mm has been allocated.

Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/core.c | 4 ++--
kernel/sched/fair.c | 7 +++++++
2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index f307c2c..9d7a33a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1634,8 +1634,8 @@ static void __sched_fork(struct task_struct *p)

#ifdef CONFIG_NUMA_BALANCING
if (p->mm && atomic_read(&p->mm->mm_users) == 1) {
- p->mm->numa_next_scan = jiffies;
- p->mm->numa_next_reset = jiffies;
+ p->mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+ p->mm->numa_next_reset = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
p->mm->numa_scan_seq = 0;
}

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2e4c8d0..2fb978b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -900,6 +900,13 @@ void task_numa_work(struct callback_head *work)
if (p->flags & PF_EXITING)
return;

+ if (!mm->numa_next_reset || !mm->numa_next_scan) {
+ mm->numa_next_scan = now +
+ msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
+ mm->numa_next_reset = now +
+ msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
+ }
+
/*
* Reset the scan period if enough time has gone by. Objective is that
* scanning will be reduced if pages are properly placed. As tasks
--
1.8.1.4

2013-09-10 09:42:54

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 14/50] sched: Set the scan rate proportional to the memory usage of the task being scanned

The NUMA PTE scan rate is controlled with a combination of the
numa_balancing_scan_period_min, numa_balancing_scan_period_max and
numa_balancing_scan_size. This scan rate is independent of the size
of the task and as an aside it is further complicated by the fact that
numa_balancing_scan_size controls how many pages are marked pte_numa and
not how much virtual memory is scanned.

In combination, it is almost impossible to meaningfully tune the min and
max scan periods and reasoning about performance is complex when the time
to complete a full scan is is partially a function of the tasks memory
size. This patch alters the semantic of the min and max tunables to be
about tuning the length time it takes to complete a scan of a tasks occupied
virtual address space. Conceptually this is a lot easier to understand. There
is a "sanity" check to ensure the scan rate is never extremely fast based on
the amount of virtual memory that should be scanned in a second. The default
of 2.5G seems arbitrary but it is to have the maximum scan rate after the
patch roughly match the maximum scan rate before the patch was applied.

Signed-off-by: Mel Gorman <[email protected]>
---
Documentation/sysctl/kernel.txt | 11 +++---
include/linux/sched.h | 1 +
kernel/sched/fair.c | 84 ++++++++++++++++++++++++++++++++++++-----
3 files changed, 81 insertions(+), 15 deletions(-)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index ccadb52..ad8d4f5 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -402,15 +402,16 @@ workload pattern changes and minimises performance impact due to remote
memory accesses. These sysctls control the thresholds for scan delays and
the number of pages scanned.

-numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
-between scans. It effectively controls the maximum scanning rate for
-each task.
+numa_balancing_scan_period_min_ms is the minimum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the maximum scanning
+rate for each task.

numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
when it initially forks.

-numa_balancing_scan_period_max_ms is the maximum delay between scans. It
-effectively controls the minimum scanning rate for each task.
+numa_balancing_scan_period_max_ms is the maximum time in milliseconds to
+scan a tasks virtual memory. It effectively controls the minimum scanning
+rate for each task.

numa_balancing_scan_size_mb is how many megabytes worth of pages are
scanned for a given scan.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 078066d..49b426e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1331,6 +1331,7 @@ struct task_struct {
int numa_scan_seq;
int numa_migrate_seq;
unsigned int numa_scan_period;
+ unsigned int numa_scan_period_max;
u64 node_stamp; /* migration stamp */
struct callback_head numa_work;
#endif /* CONFIG_NUMA_BALANCING */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2fb978b..23fd1f3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -818,10 +818,12 @@ update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)

#ifdef CONFIG_NUMA_BALANCING
/*
- * numa task sample period in ms
+ * Approximate time to scan a full NUMA task in ms. The task scan period is
+ * calculated based on the tasks virtual memory size and
+ * numa_balancing_scan_size.
*/
-unsigned int sysctl_numa_balancing_scan_period_min = 100;
-unsigned int sysctl_numa_balancing_scan_period_max = 100*50;
+unsigned int sysctl_numa_balancing_scan_period_min = 1000;
+unsigned int sysctl_numa_balancing_scan_period_max = 600000;
unsigned int sysctl_numa_balancing_scan_period_reset = 100*600;

/* Portion of address space to scan in MB */
@@ -830,6 +832,51 @@ unsigned int sysctl_numa_balancing_scan_size = 256;
/* Scan @scan_size MB every @scan_period after an initial @scan_delay in ms */
unsigned int sysctl_numa_balancing_scan_delay = 1000;

+static unsigned int task_nr_scan_windows(struct task_struct *p)
+{
+ unsigned long rss = 0;
+ unsigned long nr_scan_pages;
+
+ /*
+ * Calculations based on RSS as non-present and empty pages are skipped
+ * by the PTE scanner and NUMA hinting faults should be trapped based
+ * on resident pages
+ */
+ nr_scan_pages = sysctl_numa_balancing_scan_size << (20 - PAGE_SHIFT);
+ rss = get_mm_rss(p->mm);
+ if (!rss)
+ rss = nr_scan_pages;
+
+ rss = round_up(rss, nr_scan_pages);
+ return rss / nr_scan_pages;
+}
+
+/* For sanitys sake, never scan more PTEs than MAX_SCAN_WINDOW MB/sec. */
+#define MAX_SCAN_WINDOW 2560
+
+static unsigned int task_scan_min(struct task_struct *p)
+{
+ unsigned int scan, floor;
+ unsigned int windows = 1;
+
+ if (sysctl_numa_balancing_scan_size < MAX_SCAN_WINDOW)
+ windows = MAX_SCAN_WINDOW / sysctl_numa_balancing_scan_size;
+ floor = 1000 / windows;
+
+ scan = sysctl_numa_balancing_scan_period_min / task_nr_scan_windows(p);
+ return max_t(unsigned int, floor, scan);
+}
+
+static unsigned int task_scan_max(struct task_struct *p)
+{
+ unsigned int smin = task_scan_min(p);
+ unsigned int smax;
+
+ /* Watch for min being lower than max due to floor calculations */
+ smax = sysctl_numa_balancing_scan_period_max / task_nr_scan_windows(p);
+ return max(smin, smax);
+}
+
static void task_numa_placement(struct task_struct *p)
{
int seq;
@@ -840,6 +887,7 @@ static void task_numa_placement(struct task_struct *p)
if (p->numa_scan_seq == seq)
return;
p->numa_scan_seq = seq;
+ p->numa_scan_period_max = task_scan_max(p);

/* FIXME: Scheduling placement policy hints go here */
}
@@ -860,9 +908,14 @@ void task_numa_fault(int node, int pages, bool migrated)
* If pages are properly placed (did not migrate) then scan slower.
* This is reset periodically in case of phase changes
*/
- if (!migrated)
- p->numa_scan_period = min(sysctl_numa_balancing_scan_period_max,
+ if (!migrated) {
+ /* Initialise if necessary */
+ if (!p->numa_scan_period_max)
+ p->numa_scan_period_max = task_scan_max(p);
+
+ p->numa_scan_period = min(p->numa_scan_period_max,
p->numa_scan_period + jiffies_to_msecs(10));
+ }

task_numa_placement(p);
}
@@ -884,6 +937,7 @@ void task_numa_work(struct callback_head *work)
struct mm_struct *mm = p->mm;
struct vm_area_struct *vma;
unsigned long start, end;
+ unsigned long nr_pte_updates = 0;
long pages;

WARN_ON_ONCE(p != container_of(work, struct task_struct, numa_work));
@@ -915,7 +969,7 @@ void task_numa_work(struct callback_head *work)
*/
migrate = mm->numa_next_reset;
if (time_after(now, migrate)) {
- p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+ p->numa_scan_period = task_scan_min(p);
next_scan = now + msecs_to_jiffies(sysctl_numa_balancing_scan_period_reset);
xchg(&mm->numa_next_reset, next_scan);
}
@@ -927,8 +981,10 @@ void task_numa_work(struct callback_head *work)
if (time_before(now, migrate))
return;

- if (p->numa_scan_period == 0)
- p->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+ if (p->numa_scan_period == 0) {
+ p->numa_scan_period_max = task_scan_max(p);
+ p->numa_scan_period = task_scan_min(p);
+ }

next_scan = now + msecs_to_jiffies(p->numa_scan_period);
if (cmpxchg(&mm->numa_next_scan, migrate, next_scan) != migrate)
@@ -965,7 +1021,15 @@ void task_numa_work(struct callback_head *work)
start = max(start, vma->vm_start);
end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
end = min(end, vma->vm_end);
- pages -= change_prot_numa(vma, start, end);
+ nr_pte_updates += change_prot_numa(vma, start, end);
+
+ /*
+ * Scan sysctl_numa_balancing_scan_size but ensure that
+ * at least one PTE is updated so that unused virtual
+ * address space is quickly skipped.
+ */
+ if (nr_pte_updates)
+ pages -= (end - start) >> PAGE_SHIFT;

start = end;
if (pages <= 0)
@@ -1012,7 +1076,7 @@ void task_tick_numa(struct rq *rq, struct task_struct *curr)

if (now - curr->node_stamp > period) {
if (!curr->node_stamp)
- curr->numa_scan_period = sysctl_numa_balancing_scan_period_min;
+ curr->numa_scan_period = task_scan_min(curr);
curr->node_stamp += period;

if (!time_before(jiffies, curr->mm->numa_next_scan)) {
--
1.8.1.4

2013-09-10 09:43:35

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 12/50] Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node"

PTE scanning and NUMA hinting fault handling is expensive so commit
5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled
on a new node") deferred the PTE scan until a task had been scheduled on
another node. The problem is that in the purely shared memory case that
this may never happen and no NUMA hinting fault information will be
captured. We are not ruling out the possibility that something better
can be done here but for now, this patch needs to be reverted and depend
entirely on the scan_delay to avoid punishing short-lived processes.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mm_types.h | 10 ----------
kernel/fork.c | 3 ---
kernel/sched/fair.c | 18 ------------------
kernel/sched/features.h | 4 +---
4 files changed, 1 insertion(+), 34 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index faf4b7c..4f12073 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -427,20 +427,10 @@ struct mm_struct {

/* numa_scan_seq prevents two threads setting pte_numa */
int numa_scan_seq;
-
- /*
- * The first node a task was scheduled on. If a task runs on
- * a different node than Make PTE Scan Go Now.
- */
- int first_nid;
#endif
struct uprobes_state uprobes_state;
};

-/* first nid will either be a valid NID or one of these values */
-#define NUMA_PTE_SCAN_INIT -1
-#define NUMA_PTE_SCAN_ACTIVE -2
-
static inline void mm_init_cpumask(struct mm_struct *mm)
{
#ifdef CONFIG_CPUMASK_OFFSTACK
diff --git a/kernel/fork.c b/kernel/fork.c
index e23bb19..f693bdf 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -820,9 +820,6 @@ struct mm_struct *dup_mm(struct task_struct *tsk)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
mm->pmd_huge_pte = NULL;
#endif
-#ifdef CONFIG_NUMA_BALANCING
- mm->first_nid = NUMA_PTE_SCAN_INIT;
-#endif
if (!mm_init(mm, tsk))
goto fail_nomem;

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b5aa546..2e4c8d0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -901,24 +901,6 @@ void task_numa_work(struct callback_head *work)
return;

/*
- * We do not care about task placement until a task runs on a node
- * other than the first one used by the address space. This is
- * largely because migrations are driven by what CPU the task
- * is running on. If it's never scheduled on another node, it'll
- * not migrate so why bother trapping the fault.
- */
- if (mm->first_nid == NUMA_PTE_SCAN_INIT)
- mm->first_nid = numa_node_id();
- if (mm->first_nid != NUMA_PTE_SCAN_ACTIVE) {
- /* Are we running on a new node yet? */
- if (numa_node_id() == mm->first_nid &&
- !sched_feat_numa(NUMA_FORCE))
- return;
-
- mm->first_nid = NUMA_PTE_SCAN_ACTIVE;
- }
-
- /*
* Reset the scan period if enough time has gone by. Objective is that
* scanning will be reduced if pages are properly placed. As tasks
* can enter different phases this needs to be re-examined. Lacking
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 99399f8..cba5c61 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -63,10 +63,8 @@ SCHED_FEAT(LB_MIN, false)
/*
* Apply the automatic NUMA scheduling policy. Enabled automatically
* at runtime if running on a NUMA machine. Can be controlled via
- * numa_balancing=. Allow PTE scanning to be forced on UMA machines
- * for debugging the core machinery.
+ * numa_balancing=
*/
#ifdef CONFIG_NUMA_BALANCING
SCHED_FEAT(NUMA, false)
-SCHED_FEAT(NUMA_FORCE, false)
#endif
--
1.8.1.4

2013-09-10 09:43:51

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 09/50] mm: numa: Do not migrate or account for hinting faults on the zero page

The zero page is not replicated between nodes and is often shared between
processes. The data is read-only and likely to be cached in local CPUs
if heavily accessed meaning that the remote memory access cost is less
of a concern. This patch prevents trapping faults on the zero pages. For
tasks using the zero page this will reduce the number of PTE updates,
TLB flushes and hinting faults.

[[email protected]: Correct use of is_huge_zero_page]
Signed-off-by: Mel Gorman <[email protected]>
---
mm/huge_memory.c | 7 ++++++-
mm/memory.c | 1 +
mm/mprotect.c | 10 +++++++++-
3 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 94d0739..40f75a6 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1303,6 +1303,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto out_unlock;

page = pmd_page(pmd);
+ BUG_ON(is_huge_zero_page(page));
page_nid = page_to_nid(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
if (page_nid == this_nid)
@@ -1488,8 +1489,12 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
} else {
struct page *page = pmd_page(*pmd);

- /* only check non-shared pages */
+ /*
+ * Only check non-shared pages. See change_pte_range
+ * for comment on why the zero page is not modified
+ */
if (page_mapcount(page) == 1 &&
+ !is_huge_zero_page(page) &&
!pmd_numa(*pmd)) {
entry = pmd_mknuma(entry);
}
diff --git a/mm/memory.c b/mm/memory.c
index c20f872..86c3caf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3575,6 +3575,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
pte_unmap_unlock(ptep, ptl);
return 0;
}
+ BUG_ON(is_zero_pfn(page_to_pfn(page)));

page_nid = page_to_nid(page);
target_nid = numa_migrate_prep(page, vma, addr, page_nid);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 2bbb648..faa499e 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -62,7 +62,15 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
struct page *page;

page = vm_normal_page(vma, addr, oldpte);
- if (page) {
+
+ /*
+ * Do not trap faults against the zero page.
+ * The read-only data is likely to be
+ * read-cached on the local CPU cache and it
+ * is less useful to know about local vs remote
+ * hits on the zero page
+ */
+ if (page && !is_zero_pfn(page_to_pfn(page))) {
int this_nid = page_to_nid(page);
if (last_nid == -1)
last_nid = this_nid;
--
1.8.1.4

2013-09-10 09:44:09

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 08/50] mm: numa: Sanitize task_numa_fault() callsites

There are three callers of task_numa_fault():

- do_huge_pmd_numa_page():
Accounts against the current node, not the node where the
page resides, unless we migrated, in which case it accounts
against the node we migrated to.

- do_numa_page():
Accounts against the current node, not the node where the
page resides, unless we migrated, in which case it accounts
against the node we migrated to.

- do_pmd_numa_page():
Accounts not at all when the page isn't migrated, otherwise
accounts against the node we migrated towards.

This seems wrong to me; all three sites should have the same
sementaics, furthermore we should accounts against where the page
really is, we already know where the task is.

So modify all three sites to always account; we did after all receive
the fault; and always account to where the page is after migration,
regardless of success.

They all still differ on when they clear the PTE/PMD; ideally that
would get sorted too.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
mm/huge_memory.c | 26 ++++++++++++++------------
mm/memory.c | 53 +++++++++++++++++++++--------------------------------
2 files changed, 35 insertions(+), 44 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 981d8a2..94d0739 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1293,18 +1293,19 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
struct anon_vma *anon_vma = NULL;
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
+ int page_nid = -1, this_nid = numa_node_id();
int target_nid;
- int current_nid = -1;
- bool migrated, page_locked;
+ bool page_locked;
+ bool migrated = false;

spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(pmd, *pmdp)))
goto out_unlock;

page = pmd_page(pmd);
- current_nid = page_to_nid(page);
+ page_nid = page_to_nid(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
- if (current_nid == numa_node_id())
+ if (page_nid == this_nid)
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);

/*
@@ -1347,19 +1348,18 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
spin_unlock(&mm->page_table_lock);
migrated = migrate_misplaced_transhuge_page(mm, vma,
pmdp, pmd, addr, page, target_nid);
- if (!migrated)
+ if (migrated)
+ page_nid = target_nid;
+ else
goto check_same;

- task_numa_fault(target_nid, HPAGE_PMD_NR, true);
- if (anon_vma)
- page_unlock_anon_vma_read(anon_vma);
- return 0;
+ goto out;

check_same:
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(pmd, *pmdp))) {
/* Someone else took our fault */
- current_nid = -1;
+ page_nid = -1;
goto out_unlock;
}
clear_pmdnuma:
@@ -1370,11 +1370,13 @@ clear_pmdnuma:
out_unlock:
spin_unlock(&mm->page_table_lock);

+out:
if (anon_vma)
page_unlock_anon_vma_read(anon_vma);

- if (current_nid != -1)
- task_numa_fault(current_nid, HPAGE_PMD_NR, false);
+ if (page_nid != -1)
+ task_numa_fault(page_nid, HPAGE_PMD_NR, migrated);
+
return 0;
}

diff --git a/mm/memory.c b/mm/memory.c
index af84bc0..c20f872 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3530,12 +3530,12 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma,
}

int numa_migrate_prep(struct page *page, struct vm_area_struct *vma,
- unsigned long addr, int current_nid)
+ unsigned long addr, int page_nid)
{
get_page(page);

count_vm_numa_event(NUMA_HINT_FAULTS);
- if (current_nid == numa_node_id())
+ if (page_nid == numa_node_id())
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);

return mpol_misplaced(page, vma, addr);
@@ -3546,7 +3546,7 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
{
struct page *page = NULL;
spinlock_t *ptl;
- int current_nid = -1;
+ int page_nid = -1;
int target_nid;
bool migrated = false;

@@ -3576,15 +3576,10 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
return 0;
}

- current_nid = page_to_nid(page);
- target_nid = numa_migrate_prep(page, vma, addr, current_nid);
+ page_nid = page_to_nid(page);
+ target_nid = numa_migrate_prep(page, vma, addr, page_nid);
pte_unmap_unlock(ptep, ptl);
if (target_nid == -1) {
- /*
- * Account for the fault against the current node if it not
- * being replaced regardless of where the page is located.
- */
- current_nid = numa_node_id();
put_page(page);
goto out;
}
@@ -3592,11 +3587,11 @@ int do_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
/* Migrate to the requested node */
migrated = migrate_misplaced_page(page, target_nid);
if (migrated)
- current_nid = target_nid;
+ page_nid = target_nid;

out:
- if (current_nid != -1)
- task_numa_fault(current_nid, 1, migrated);
+ if (page_nid != -1)
+ task_numa_fault(page_nid, 1, migrated);
return 0;
}

@@ -3611,7 +3606,6 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long offset;
spinlock_t *ptl;
bool numa = false;
- int local_nid = numa_node_id();

spin_lock(&mm->page_table_lock);
pmd = *pmdp;
@@ -3634,9 +3628,10 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
for (addr = _addr + offset; addr < _addr + PMD_SIZE; pte++, addr += PAGE_SIZE) {
pte_t pteval = *pte;
struct page *page;
- int curr_nid = local_nid;
+ int page_nid = -1;
int target_nid;
- bool migrated;
+ bool migrated = false;
+
if (!pte_present(pteval))
continue;
if (!pte_numa(pteval))
@@ -3658,25 +3653,19 @@ static int do_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (unlikely(page_mapcount(page) != 1))
continue;

- /*
- * Note that the NUMA fault is later accounted to either
- * the node that is currently running or where the page is
- * migrated to.
- */
- curr_nid = local_nid;
- target_nid = numa_migrate_prep(page, vma, addr,
- page_to_nid(page));
- if (target_nid == -1) {
+ page_nid = page_to_nid(page);
+ target_nid = numa_migrate_prep(page, vma, addr, page_nid);
+ pte_unmap_unlock(pte, ptl);
+ if (target_nid != -1) {
+ migrated = migrate_misplaced_page(page, target_nid);
+ if (migrated)
+ page_nid = target_nid;
+ } else {
put_page(page);
- continue;
}

- /* Migrate to the requested node */
- pte_unmap_unlock(pte, ptl);
- migrated = migrate_misplaced_page(page, target_nid);
- if (migrated)
- curr_nid = target_nid;
- task_numa_fault(curr_nid, 1, migrated);
+ if (page_nid != -1)
+ task_numa_fault(page_nid, 1, migrated);

pte = pte_offset_map_lock(mm, pmdp, addr, &ptl);
}
--
1.8.1.4

2013-09-10 09:44:49

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 05/50] mm: Wait for THP migrations to complete during NUMA hinting faults

The locking for migrating THP is unusual. While normal page migration
prevents parallel accesses using a migration PTE, THP migration relies on
a combination of the page_table_lock, the page lock and the existance of
the NUMA hinting PTE to guarantee safety but there is a bug in the scheme.

If a THP page is currently being migrated and another thread traps a
fault on the same page it checks if the page is misplaced. If it is not,
then pmd_numa is cleared. The problem is that it checks if the page is
misplaced without holding the page lock meaning that the racing thread
can be migrating the THP when the second thread clears the NUMA bit
and faults a stale page.

This patch checks if the page is potentially being migrated and stalls
using the lock_page if it is potentially being migrated before checking
if the page is misplaced or not.

Not-signed-off-by: Peter Zijlstra
---
mm/huge_memory.c | 23 ++++++++++++++++-------
1 file changed, 16 insertions(+), 7 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5c37cd2..d0a3fce 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1307,13 +1307,14 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
if (current_nid == numa_node_id())
count_vm_numa_event(NUMA_HINT_FAULTS_LOCAL);

- target_nid = mpol_misplaced(page, vma, haddr);
- if (target_nid == -1) {
- put_page(page);
- goto clear_pmdnuma;
- }
+ /*
+ * Acquire the page lock to serialise THP migrations but avoid dropping
+ * page_table_lock if at all possible
+ */
+ if (trylock_page(page))
+ goto got_lock;

- /* Acquire the page lock to serialise THP migrations */
+ /* Serialise against migrationa and check placement check placement */
spin_unlock(&mm->page_table_lock);
lock_page(page);

@@ -1324,9 +1325,17 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
put_page(page);
goto out_unlock;
}
- spin_unlock(&mm->page_table_lock);
+
+got_lock:
+ target_nid = mpol_misplaced(page, vma, haddr);
+ if (target_nid == -1) {
+ unlock_page(page);
+ put_page(page);
+ goto clear_pmdnuma;
+ }

/* Migrate the THP to the requested node */
+ spin_unlock(&mm->page_table_lock);
migrated = migrate_misplaced_transhuge_page(mm, vma,
pmdp, pmd, addr, page, target_nid);
if (!migrated)
--
1.8.1.4

2013-09-10 09:44:48

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 06/50] mm: Prevent parallel splits during THP migration

THP migrations are serialised by the page lock but on its own that does
not prevent THP splits. If the page is split during THP migration then
the pmd_same checks will prevent page table corruption but the unlock page
and other fix-ups potentially will cause corruption. This patch takes the
anon_vma lock to prevent parallel splits during migration.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/huge_memory.c | 43 +++++++++++++++++++++++++++++--------------
1 file changed, 29 insertions(+), 14 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d0a3fce..981d8a2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1290,18 +1290,18 @@ out:
int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
unsigned long addr, pmd_t pmd, pmd_t *pmdp)
{
+ struct anon_vma *anon_vma = NULL;
struct page *page;
unsigned long haddr = addr & HPAGE_PMD_MASK;
int target_nid;
int current_nid = -1;
- bool migrated;
+ bool migrated, page_locked;

spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(pmd, *pmdp)))
goto out_unlock;

page = pmd_page(pmd);
- get_page(page);
current_nid = page_to_nid(page);
count_vm_numa_event(NUMA_HINT_FAULTS);
if (current_nid == numa_node_id())
@@ -1311,12 +1311,29 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
* Acquire the page lock to serialise THP migrations but avoid dropping
* page_table_lock if at all possible
*/
- if (trylock_page(page))
- goto got_lock;
+ page_locked = trylock_page(page);
+ target_nid = mpol_misplaced(page, vma, haddr);
+ if (target_nid == -1) {
+ /* If the page was locked, there are no parallel migrations */
+ if (page_locked) {
+ unlock_page(page);
+ goto clear_pmdnuma;
+ }

- /* Serialise against migrationa and check placement check placement */
+ /* Otherwise wait for potential migrations to complete */
+ spin_unlock(&mm->page_table_lock);
+ wait_on_page_locked(page);
+ goto check_same;
+ }
+
+ /* Page is misplaced, serialise migrations and parallel THP splits */
+ get_page(page);
spin_unlock(&mm->page_table_lock);
- lock_page(page);
+ if (!page_locked) {
+ lock_page(page);
+ page_locked = true;
+ }
+ anon_vma = page_lock_anon_vma_read(page);

/* Confirm the PMD did not change while page_table_lock was released */
spin_lock(&mm->page_table_lock);
@@ -1326,14 +1343,6 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
goto out_unlock;
}

-got_lock:
- target_nid = mpol_misplaced(page, vma, haddr);
- if (target_nid == -1) {
- unlock_page(page);
- put_page(page);
- goto clear_pmdnuma;
- }
-
/* Migrate the THP to the requested node */
spin_unlock(&mm->page_table_lock);
migrated = migrate_misplaced_transhuge_page(mm, vma,
@@ -1342,6 +1351,8 @@ got_lock:
goto check_same;

task_numa_fault(target_nid, HPAGE_PMD_NR, true);
+ if (anon_vma)
+ page_unlock_anon_vma_read(anon_vma);
return 0;

check_same:
@@ -1358,6 +1369,10 @@ clear_pmdnuma:
update_mmu_cache_pmd(vma, addr, pmdp);
out_unlock:
spin_unlock(&mm->page_table_lock);
+
+ if (anon_vma)
+ page_unlock_anon_vma_read(anon_vma);
+
if (current_nid != -1)
task_numa_fault(current_nid, HPAGE_PMD_NR, false);
return 0;
--
1.8.1.4

2013-09-10 09:44:47

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 07/50] mm: Account for a THP NUMA hinting update as one PTE update

A THP PMD update is accounted for as 512 pages updated in vmstat. This is
large difference when estimating the cost of automatic NUMA balancing and
can be misleading when comparing results that had collapsed versus split
THP. This patch addresses the accounting issue.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/mprotect.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 94722a4..2bbb648 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -145,7 +145,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
split_huge_page_pmd(vma, addr, pmd);
else if (change_huge_pmd(vma, pmd, addr, newprot,
prot_numa)) {
- pages += HPAGE_PMD_NR;
+ pages++;
continue;
}
/* fall through */
--
1.8.1.4

2013-09-10 09:45:49

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 02/50] mm: numa: Document automatic NUMA balancing sysctls

Signed-off-by: Mel Gorman <[email protected]>
---
Documentation/sysctl/kernel.txt | 66 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 66 insertions(+)

diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index ab7d16e..ccadb52 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -354,6 +354,72 @@ utilize.

==============================================================

+numa_balancing
+
+Enables/disables automatic page fault based NUMA memory
+balancing. Memory is moved automatically to nodes
+that access it often.
+
+Enables/disables automatic NUMA memory balancing. On NUMA machines, there
+is a performance penalty if remote memory is accessed by a CPU. When this
+feature is enabled the kernel samples what task thread is accessing memory
+by periodically unmapping pages and later trapping a page fault. At the
+time of the page fault, it is determined if the data being accessed should
+be migrated to a local memory node.
+
+The unmapping of pages and trapping faults incur additional overhead that
+ideally is offset by improved memory locality but there is no universal
+guarantee. If the target workload is already bound to NUMA nodes then this
+feature should be disabled. Otherwise, if the system overhead from the
+feature is too high then the rate the kernel samples for NUMA hinting
+faults may be controlled by the numa_balancing_scan_period_min_ms,
+numa_balancing_scan_delay_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_period_max_ms and numa_balancing_scan_size_mb sysctls.
+
+==============================================================
+
+numa_balancing_scan_period_min_ms, numa_balancing_scan_delay_ms,
+numa_balancing_scan_period_max_ms, numa_balancing_scan_period_reset,
+numa_balancing_scan_size_mb
+
+Automatic NUMA balancing scans tasks address space and unmaps pages to
+detect if pages are properly placed or if the data should be migrated to a
+memory node local to where the task is running. Every "scan delay" the task
+scans the next "scan size" number of pages in its address space. When the
+end of the address space is reached the scanner restarts from the beginning.
+
+In combination, the "scan delay" and "scan size" determine the scan rate.
+When "scan delay" decreases, the scan rate increases. The scan delay and
+hence the scan rate of every task is adaptive and depends on historical
+behaviour. If pages are properly placed then the scan delay increases,
+otherwise the scan delay decreases. The "scan size" is not adaptive but
+the higher the "scan size", the higher the scan rate.
+
+Higher scan rates incur higher system overhead as page faults must be
+trapped and potentially data must be migrated. However, the higher the scan
+rate, the more quickly a tasks memory is migrated to a local node if the
+workload pattern changes and minimises performance impact due to remote
+memory accesses. These sysctls control the thresholds for scan delays and
+the number of pages scanned.
+
+numa_balancing_scan_period_min_ms is the minimum delay in milliseconds
+between scans. It effectively controls the maximum scanning rate for
+each task.
+
+numa_balancing_scan_delay_ms is the starting "scan delay" used for a task
+when it initially forks.
+
+numa_balancing_scan_period_max_ms is the maximum delay between scans. It
+effectively controls the minimum scanning rate for each task.
+
+numa_balancing_scan_size_mb is how many megabytes worth of pages are
+scanned for a given scan.
+
+numa_balancing_scan_period_reset is a blunt instrument that controls how
+often a tasks scan delay is reset to detect sudden changes in task behaviour.
+
+==============================================================
+
osrelease, ostype & version:

# cat osrelease
--
1.8.1.4

2013-09-10 09:45:50

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 01/50] sched: monolithic code dump of what is being pushed upstream

From: Peter Zijlstra <[email protected]>

So I have the below patch in front of all your patches. It contains the
10 or so sched,fair patches I posted to lkml the other day.

I used these to poke at the group_imb crud, am now digging through
traces of perf bench numa to see if there's anything else I need.

like said on IRC: I boot with ftrace=nop to ensure we allocate properly
sized trace buffers. This can also be done at runtime by switching
active tracer -- this allocates the default buffer size, or by
explicitly setting a per-cpu buffer size in
/debug/tracing/buffer_size_kb. By default the thing allocates a single
page per cpu or something uselessly small like that.

I then run a benchmark and at an appropriate time (eg. when I see
something 'weird' happen) I do something like:

echo 0 > /debug/tracing/tracing_on # disable writing into the buffers
cat /debug/tracing/trace > ~/trace # dump to file
echo 0 > /debug/tracing/trace # reset buffers
echo 1 > /debug/tracing/tracing_on # enable writing to the buffers

[ Note I mount debugfs at /debug, this is not the default location but I
think the rest of the world is wrong ;-) ]

Also, the brain seems to adapt once you're staring at them for longer
than a day -- yay for human pattern recognition skillz.

Ingo tends to favour more verbose dumps, I tend to favour minimal
dumps.. whatever works for you is something you'll learn with
experience.
---
arch/x86/mm/numa.c | 6 +-
kernel/sched/core.c | 18 +-
kernel/sched/fair.c | 498 ++++++++++++++++++++++++++++-----------------------
kernel/sched/sched.h | 1 +
lib/vsprintf.c | 5 +
5 files changed, 288 insertions(+), 240 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 8bf93ba..4ed4612 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -737,7 +737,6 @@ int early_cpu_to_node(int cpu)
void debug_cpumask_set_cpu(int cpu, int node, bool enable)
{
struct cpumask *mask;
- char buf[64];

if (node == NUMA_NO_NODE) {
/* early_cpu_to_node() already emits a warning and trace */
@@ -755,10 +754,9 @@ void debug_cpumask_set_cpu(int cpu, int node, bool enable)
else
cpumask_clear_cpu(cpu, mask);

- cpulist_scnprintf(buf, sizeof(buf), mask);
- printk(KERN_DEBUG "%s cpu %d node %d: mask now %s\n",
+ printk(KERN_DEBUG "%s cpu %d node %d: mask now %pc\n",
enable ? "numa_add_cpu" : "numa_remove_cpu",
- cpu, node, buf);
+ cpu, node, mask);
return;
}

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 05c39f0..f307c2c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4809,9 +4809,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
struct cpumask *groupmask)
{
struct sched_group *group = sd->groups;
- char str[256];

- cpulist_scnprintf(str, sizeof(str), sched_domain_span(sd));
cpumask_clear(groupmask);

printk(KERN_DEBUG "%*s domain %d: ", level, "", level);
@@ -4824,7 +4822,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
return -1;
}

- printk(KERN_CONT "span %s level %s\n", str, sd->name);
+ printk(KERN_CONT "span %pc level %s\n", sched_domain_span(sd), sd->name);

if (!cpumask_test_cpu(cpu, sched_domain_span(sd))) {
printk(KERN_ERR "ERROR: domain->span does not contain "
@@ -4870,9 +4868,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,

cpumask_or(groupmask, groupmask, sched_group_cpus(group));

- cpulist_scnprintf(str, sizeof(str), sched_group_cpus(group));
-
- printk(KERN_CONT " %s", str);
+ printk(KERN_CONT " %pc", sched_group_cpus(group));
if (group->sgp->power != SCHED_POWER_SCALE) {
printk(KERN_CONT " (cpu_power = %d)",
group->sgp->power);
@@ -4964,7 +4960,8 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
SD_BALANCE_FORK |
SD_BALANCE_EXEC |
SD_SHARE_CPUPOWER |
- SD_SHARE_PKG_RESOURCES);
+ SD_SHARE_PKG_RESOURCES |
+ SD_PREFER_SIBLING);
if (nr_node_ids == 1)
pflags &= ~SD_SERIALIZE;
}
@@ -5168,6 +5165,13 @@ cpu_attach_domain(struct sched_domain *sd, struct root_domain *rd, int cpu)
tmp->parent = parent->parent;
if (parent->parent)
parent->parent->child = tmp;
+ /*
+ * Transfer SD_PREFER_SIBLING down in case of a
+ * degenerate parent; the spans match for this
+ * so the property transfers.
+ */
+ if (parent->flags & SD_PREFER_SIBLING)
+ tmp->flags |= SD_PREFER_SIBLING;
destroy_sched_domain(parent, cpu);
} else
tmp = tmp->parent;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 68f1609..0c085ac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3859,7 +3859,8 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;

#define LBF_ALL_PINNED 0x01
#define LBF_NEED_BREAK 0x02
-#define LBF_SOME_PINNED 0x04
+#define LBF_DST_PINNED 0x04
+#define LBF_SOME_PINNED 0x08

struct lb_env {
struct sched_domain *sd;
@@ -3950,6 +3951,8 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)

schedstat_inc(p, se.statistics.nr_failed_migrations_affine);

+ env->flags |= LBF_SOME_PINNED;
+
/*
* Remember if this task can be migrated to any other cpu in
* our sched_group. We may want to revisit it if we couldn't
@@ -3958,13 +3961,13 @@ int can_migrate_task(struct task_struct *p, struct lb_env *env)
* Also avoid computing new_dst_cpu if we have already computed
* one in current iteration.
*/
- if (!env->dst_grpmask || (env->flags & LBF_SOME_PINNED))
+ if (!env->dst_grpmask || (env->flags & LBF_DST_PINNED))
return 0;

/* Prevent to re-select dst_cpu via env's cpus */
for_each_cpu_and(cpu, env->dst_grpmask, env->cpus) {
if (cpumask_test_cpu(cpu, tsk_cpus_allowed(p))) {
- env->flags |= LBF_SOME_PINNED;
+ env->flags |= LBF_DST_PINNED;
env->new_dst_cpu = cpu;
break;
}
@@ -4019,6 +4022,7 @@ static int move_one_task(struct lb_env *env)
continue;

move_task(p, env);
+
/*
* Right now, this is only the second place move_task()
* is called, so we can safely collect move_task()
@@ -4233,50 +4237,65 @@ static unsigned long task_h_load(struct task_struct *p)

/********** Helpers for find_busiest_group ************************/
/*
- * sd_lb_stats - Structure to store the statistics of a sched_domain
- * during load balancing.
- */
-struct sd_lb_stats {
- struct sched_group *busiest; /* Busiest group in this sd */
- struct sched_group *this; /* Local group in this sd */
- unsigned long total_load; /* Total load of all groups in sd */
- unsigned long total_pwr; /* Total power of all groups in sd */
- unsigned long avg_load; /* Average load across all groups in sd */
-
- /** Statistics of this group */
- unsigned long this_load;
- unsigned long this_load_per_task;
- unsigned long this_nr_running;
- unsigned long this_has_capacity;
- unsigned int this_idle_cpus;
-
- /* Statistics of the busiest group */
- unsigned int busiest_idle_cpus;
- unsigned long max_load;
- unsigned long busiest_load_per_task;
- unsigned long busiest_nr_running;
- unsigned long busiest_group_capacity;
- unsigned long busiest_has_capacity;
- unsigned int busiest_group_weight;
-
- int group_imb; /* Is there imbalance in this sd */
-};
-
-/*
* sg_lb_stats - stats of a sched_group required for load_balancing
*/
struct sg_lb_stats {
unsigned long avg_load; /*Avg load across the CPUs of the group */
unsigned long group_load; /* Total load over the CPUs of the group */
- unsigned long sum_nr_running; /* Nr tasks running in the group */
unsigned long sum_weighted_load; /* Weighted load of group's tasks */
- unsigned long group_capacity;
- unsigned long idle_cpus;
- unsigned long group_weight;
+ unsigned long load_per_task;
+ unsigned long group_power;
+ unsigned int sum_nr_running; /* Nr tasks running in the group */
+ unsigned int group_capacity;
+ unsigned int idle_cpus;
+ unsigned int group_weight;
int group_imb; /* Is there an imbalance in the group ? */
int group_has_capacity; /* Is there extra capacity in the group? */
};

+/*
+ * sd_lb_stats - Structure to store the statistics of a sched_domain
+ * during load balancing.
+ */
+struct sd_lb_stats {
+ struct sched_group *busiest; /* Busiest group in this sd */
+ struct sched_group *this; /* Local group in this sd */
+ unsigned long total_load; /* Total load of all groups in sd */
+ unsigned long total_pwr; /* Total power of all groups in sd */
+ unsigned long avg_load; /* Average load across all groups in sd */
+
+ struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
+ struct sg_lb_stats this_stat; /* Statistics of this group */
+};
+
+static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
+{
+ /*
+ * struct sd_lb_stats {
+ * struct sched_group * busiest; // 0 8
+ * struct sched_group * this; // 8 8
+ * long unsigned int total_load; // 16 8
+ * long unsigned int total_pwr; // 24 8
+ * long unsigned int avg_load; // 32 8
+ * struct sg_lb_stats {
+ * long unsigned int avg_load; // 40 8
+ * long unsigned int group_load; // 48 8
+ * ...
+ * } busiest_stat; // 40 64
+ * struct sg_lb_stats this_stat; // 104 64
+ *
+ * // size: 168, cachelines: 3, members: 7
+ * // last cacheline: 40 bytes
+ * };
+ *
+ * Skimp on the clearing to avoid duplicate work. We can avoid clearing
+ * this_stat because update_sg_lb_stats() does a full clear/assignment.
+ * We must however clear busiest_stat::avg_load because
+ * update_sd_pick_busiest() reads this before assignment.
+ */
+ memset(sds, 0, offsetof(struct sd_lb_stats, busiest_stat.group_load));
+}
+
/**
* get_sd_load_idx - Obtain the load index for a given sched domain.
* @sd: The sched_domain whose load_idx is to be obtained.
@@ -4460,60 +4479,66 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
return 0;
}

+/*
+ * Group imbalance indicates (and tries to solve) the problem where balancing
+ * groups is inadequate due to tsk_cpus_allowed() constraints.
+ *
+ * Imagine a situation of two groups of 4 cpus each and 4 tasks each with a
+ * cpumask covering 1 cpu of the first group and 3 cpus of the second group.
+ * Something like:
+ *
+ * { 0 1 2 3 } { 4 5 6 7 }
+ * * * * *
+ *
+ * If we were to balance group-wise we'd place two tasks in the first group and
+ * two tasks in the second group. Clearly this is undesired as it will overload
+ * cpu 3 and leave one of the cpus in the second group unused.
+ *
+ * The current solution to this issue is detecting the skew in the first group
+ * by noticing the lower domain failed to reach balance and had difficulty
+ * moving tasks due to affinity constraints.
+ *
+ * When this is so detected; this group becomes a candidate for busiest; see
+ * update_sd_pick_busiest(). And calculcate_imbalance() and
+ * find_busiest_group() avoid some of the usual balance conditions to allow it
+ * to create an effective group imbalance.
+ *
+ * This is a somewhat tricky proposition since the next run might not find the
+ * group imbalance and decide the groups need to be balanced again. A most
+ * subtle and fragile situation.
+ */
+
+static inline int sg_imbalanced(struct sched_group *group)
+{
+ return group->sgp->imbalance;
+}
+
/**
* update_sg_lb_stats - Update sched_group's statistics for load balancing.
* @env: The load balancing environment.
* @group: sched_group whose statistics are to be updated.
* @load_idx: Load index of sched_domain of this_cpu for load calc.
* @local_group: Does group contain this_cpu.
- * @balance: Should we balance.
* @sgs: variable to hold the statistics for this group.
*/
static inline void update_sg_lb_stats(struct lb_env *env,
struct sched_group *group, int load_idx,
- int local_group, int *balance, struct sg_lb_stats *sgs)
+ int local_group, struct sg_lb_stats *sgs)
{
- unsigned long nr_running, max_nr_running, min_nr_running;
- unsigned long load, max_cpu_load, min_cpu_load;
- unsigned int balance_cpu = -1, first_idle_cpu = 0;
- unsigned long avg_load_per_task = 0;
+ unsigned long nr_running;
+ unsigned long load;
int i;

- if (local_group)
- balance_cpu = group_balance_cpu(group);
-
- /* Tally up the load of all CPUs in the group */
- max_cpu_load = 0;
- min_cpu_load = ~0UL;
- max_nr_running = 0;
- min_nr_running = ~0UL;
-
for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
struct rq *rq = cpu_rq(i);

nr_running = rq->nr_running;

/* Bias balancing toward cpus of our domain */
- if (local_group) {
- if (idle_cpu(i) && !first_idle_cpu &&
- cpumask_test_cpu(i, sched_group_mask(group))) {
- first_idle_cpu = 1;
- balance_cpu = i;
- }
-
+ if (local_group)
load = target_load(i, load_idx);
- } else {
+ else
load = source_load(i, load_idx);
- if (load > max_cpu_load)
- max_cpu_load = load;
- if (min_cpu_load > load)
- min_cpu_load = load;
-
- if (nr_running > max_nr_running)
- max_nr_running = nr_running;
- if (min_nr_running > nr_running)
- min_nr_running = nr_running;
- }

sgs->group_load += load;
sgs->sum_nr_running += nr_running;
@@ -4522,46 +4547,25 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->idle_cpus++;
}

- /*
- * First idle cpu or the first cpu(busiest) in this sched group
- * is eligible for doing load balancing at this and above
- * domains. In the newly idle case, we will allow all the cpu's
- * to do the newly idle load balance.
- */
- if (local_group) {
- if (env->idle != CPU_NEWLY_IDLE) {
- if (balance_cpu != env->dst_cpu) {
- *balance = 0;
- return;
- }
- update_group_power(env->sd, env->dst_cpu);
- } else if (time_after_eq(jiffies, group->sgp->next_update))
- update_group_power(env->sd, env->dst_cpu);
- }
+ if (local_group && (env->idle != CPU_NEWLY_IDLE ||
+ time_after_eq(jiffies, group->sgp->next_update)))
+ update_group_power(env->sd, env->dst_cpu);

/* Adjust by relative CPU power of the group */
- sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / group->sgp->power;
+ sgs->group_power = group->sgp->power;
+ sgs->avg_load = (sgs->group_load*SCHED_POWER_SCALE) / sgs->group_power;

- /*
- * Consider the group unbalanced when the imbalance is larger
- * than the average weight of a task.
- *
- * APZ: with cgroup the avg task weight can vary wildly and
- * might not be a suitable number - should we keep a
- * normalized nr_running number somewhere that negates
- * the hierarchy?
- */
if (sgs->sum_nr_running)
- avg_load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
+ sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;

- if ((max_cpu_load - min_cpu_load) >= avg_load_per_task &&
- (max_nr_running - min_nr_running) > 1)
- sgs->group_imb = 1;
+ sgs->group_imb = sg_imbalanced(group);
+
+ sgs->group_capacity =
+ DIV_ROUND_CLOSEST(sgs->group_power, SCHED_POWER_SCALE);

- sgs->group_capacity = DIV_ROUND_CLOSEST(group->sgp->power,
- SCHED_POWER_SCALE);
if (!sgs->group_capacity)
sgs->group_capacity = fix_small_capacity(env->sd, group);
+
sgs->group_weight = group->group_weight;

if (sgs->group_capacity > sgs->sum_nr_running)
@@ -4586,7 +4590,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
struct sched_group *sg,
struct sg_lb_stats *sgs)
{
- if (sgs->avg_load <= sds->max_load)
+ if (sgs->avg_load <= sds->busiest_stat.avg_load)
return false;

if (sgs->sum_nr_running > sgs->group_capacity)
@@ -4619,11 +4623,11 @@ static bool update_sd_pick_busiest(struct lb_env *env,
* @sds: variable to hold the statistics for this sched_domain.
*/
static inline void update_sd_lb_stats(struct lb_env *env,
- int *balance, struct sd_lb_stats *sds)
+ struct sd_lb_stats *sds)
{
struct sched_domain *child = env->sd->child;
struct sched_group *sg = env->sd->groups;
- struct sg_lb_stats sgs;
+ struct sg_lb_stats tmp_sgs;
int load_idx, prefer_sibling = 0;

if (child && child->flags & SD_PREFER_SIBLING)
@@ -4632,17 +4636,17 @@ static inline void update_sd_lb_stats(struct lb_env *env,
load_idx = get_sd_load_idx(env->sd, env->idle);

do {
+ struct sg_lb_stats *sgs = &tmp_sgs;
int local_group;

local_group = cpumask_test_cpu(env->dst_cpu, sched_group_cpus(sg));
- memset(&sgs, 0, sizeof(sgs));
- update_sg_lb_stats(env, sg, load_idx, local_group, balance, &sgs);
-
- if (local_group && !(*balance))
- return;
+ if (local_group) {
+ sds->this = sg;
+ sgs = &sds->this_stat;
+ }

- sds->total_load += sgs.group_load;
- sds->total_pwr += sg->sgp->power;
+ memset(sgs, 0, sizeof(*sgs));
+ update_sg_lb_stats(env, sg, load_idx, local_group, sgs);

/*
* In case the child domain prefers tasks go to siblings
@@ -4654,26 +4658,17 @@ static inline void update_sd_lb_stats(struct lb_env *env,
* heaviest group when it is already under-utilized (possible
* with a large weight task outweighs the tasks on the system).
*/
- if (prefer_sibling && !local_group && sds->this_has_capacity)
- sgs.group_capacity = min(sgs.group_capacity, 1UL);
+ if (prefer_sibling && !local_group &&
+ sds->this && sds->this_stat.group_has_capacity)
+ sgs->group_capacity = min(sgs->group_capacity, 1U);

- if (local_group) {
- sds->this_load = sgs.avg_load;
- sds->this = sg;
- sds->this_nr_running = sgs.sum_nr_running;
- sds->this_load_per_task = sgs.sum_weighted_load;
- sds->this_has_capacity = sgs.group_has_capacity;
- sds->this_idle_cpus = sgs.idle_cpus;
- } else if (update_sd_pick_busiest(env, sds, sg, &sgs)) {
- sds->max_load = sgs.avg_load;
+ /* Now, start updating sd_lb_stats */
+ sds->total_load += sgs->group_load;
+ sds->total_pwr += sgs->group_power;
+
+ if (!local_group && update_sd_pick_busiest(env, sds, sg, sgs)) {
sds->busiest = sg;
- sds->busiest_nr_running = sgs.sum_nr_running;
- sds->busiest_idle_cpus = sgs.idle_cpus;
- sds->busiest_group_capacity = sgs.group_capacity;
- sds->busiest_load_per_task = sgs.sum_weighted_load;
- sds->busiest_has_capacity = sgs.group_has_capacity;
- sds->busiest_group_weight = sgs.group_weight;
- sds->group_imb = sgs.group_imb;
+ sds->busiest_stat = *sgs;
}

sg = sg->next;
@@ -4718,7 +4713,8 @@ static int check_asym_packing(struct lb_env *env, struct sd_lb_stats *sds)
return 0;

env->imbalance = DIV_ROUND_CLOSEST(
- sds->max_load * sds->busiest->sgp->power, SCHED_POWER_SCALE);
+ sds->busiest_stat.avg_load * sds->busiest_stat.group_power,
+ SCHED_POWER_SCALE);

return 1;
}
@@ -4736,24 +4732,23 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
unsigned long tmp, pwr_now = 0, pwr_move = 0;
unsigned int imbn = 2;
unsigned long scaled_busy_load_per_task;
+ struct sg_lb_stats *this, *busiest;

- if (sds->this_nr_running) {
- sds->this_load_per_task /= sds->this_nr_running;
- if (sds->busiest_load_per_task >
- sds->this_load_per_task)
- imbn = 1;
- } else {
- sds->this_load_per_task =
- cpu_avg_load_per_task(env->dst_cpu);
- }
+ this = &sds->this_stat;
+ busiest = &sds->busiest_stat;

- scaled_busy_load_per_task = sds->busiest_load_per_task
- * SCHED_POWER_SCALE;
- scaled_busy_load_per_task /= sds->busiest->sgp->power;
+ if (!this->sum_nr_running)
+ this->load_per_task = cpu_avg_load_per_task(env->dst_cpu);
+ else if (busiest->load_per_task > this->load_per_task)
+ imbn = 1;

- if (sds->max_load - sds->this_load + scaled_busy_load_per_task >=
- (scaled_busy_load_per_task * imbn)) {
- env->imbalance = sds->busiest_load_per_task;
+ scaled_busy_load_per_task =
+ (busiest->load_per_task * SCHED_POWER_SCALE) /
+ busiest->group_power;
+
+ if (busiest->avg_load - this->avg_load + scaled_busy_load_per_task >=
+ (scaled_busy_load_per_task * imbn)) {
+ env->imbalance = busiest->load_per_task;
return;
}

@@ -4763,34 +4758,37 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
* moving them.
*/

- pwr_now += sds->busiest->sgp->power *
- min(sds->busiest_load_per_task, sds->max_load);
- pwr_now += sds->this->sgp->power *
- min(sds->this_load_per_task, sds->this_load);
+ pwr_now += busiest->group_power *
+ min(busiest->load_per_task, busiest->avg_load);
+ pwr_now += this->group_power *
+ min(this->load_per_task, this->avg_load);
pwr_now /= SCHED_POWER_SCALE;

/* Amount of load we'd subtract */
- tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
- sds->busiest->sgp->power;
- if (sds->max_load > tmp)
- pwr_move += sds->busiest->sgp->power *
- min(sds->busiest_load_per_task, sds->max_load - tmp);
+ tmp = (busiest->load_per_task * SCHED_POWER_SCALE) /
+ busiest->group_power;
+ if (busiest->avg_load > tmp) {
+ pwr_move += busiest->group_power *
+ min(busiest->load_per_task,
+ busiest->avg_load - tmp);
+ }

/* Amount of load we'd add */
- if (sds->max_load * sds->busiest->sgp->power <
- sds->busiest_load_per_task * SCHED_POWER_SCALE)
- tmp = (sds->max_load * sds->busiest->sgp->power) /
- sds->this->sgp->power;
- else
- tmp = (sds->busiest_load_per_task * SCHED_POWER_SCALE) /
- sds->this->sgp->power;
- pwr_move += sds->this->sgp->power *
- min(sds->this_load_per_task, sds->this_load + tmp);
+ if (busiest->avg_load * busiest->group_power <
+ busiest->load_per_task * SCHED_POWER_SCALE) {
+ tmp = (busiest->avg_load * busiest->group_power) /
+ this->group_power;
+ } else {
+ tmp = (busiest->load_per_task * SCHED_POWER_SCALE) /
+ this->group_power;
+ }
+ pwr_move += this->group_power *
+ min(this->load_per_task, this->avg_load + tmp);
pwr_move /= SCHED_POWER_SCALE;

/* Move if we gain throughput */
if (pwr_move > pwr_now)
- env->imbalance = sds->busiest_load_per_task;
+ env->imbalance = busiest->load_per_task;
}

/**
@@ -4802,11 +4800,18 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
{
unsigned long max_pull, load_above_capacity = ~0UL;
+ struct sg_lb_stats *this, *busiest;

- sds->busiest_load_per_task /= sds->busiest_nr_running;
- if (sds->group_imb) {
- sds->busiest_load_per_task =
- min(sds->busiest_load_per_task, sds->avg_load);
+ this = &sds->this_stat;
+ busiest = &sds->busiest_stat;
+
+ if (busiest->group_imb) {
+ /*
+ * In the group_imb case we cannot rely on group-wide averages
+ * to ensure cpu-load equilibrium, look at wider averages. XXX
+ */
+ busiest->load_per_task =
+ min(busiest->load_per_task, sds->avg_load);
}

/*
@@ -4814,21 +4819,22 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
* max load less than avg load(as we skip the groups at or below
* its cpu_power, while calculating max_load..)
*/
- if (sds->max_load < sds->avg_load) {
+ if (busiest->avg_load < sds->avg_load) {
env->imbalance = 0;
return fix_small_imbalance(env, sds);
}

- if (!sds->group_imb) {
+ if (!busiest->group_imb) {
/*
* Don't want to pull so many tasks that a group would go idle.
+ * Except of course for the group_imb case, since then we might
+ * have to drop below capacity to reach cpu-load equilibrium.
*/
- load_above_capacity = (sds->busiest_nr_running -
- sds->busiest_group_capacity);
+ load_above_capacity =
+ (busiest->sum_nr_running - busiest->group_capacity);

load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
-
- load_above_capacity /= sds->busiest->sgp->power;
+ load_above_capacity /= busiest->group_power;
}

/*
@@ -4838,15 +4844,14 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
* we also don't want to reduce the group load below the group capacity
* (so that we can implement power-savings policies etc). Thus we look
* for the minimum possible imbalance.
- * Be careful of negative numbers as they'll appear as very large values
- * with unsigned longs.
*/
- max_pull = min(sds->max_load - sds->avg_load, load_above_capacity);
+ max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);

/* How much load to actually move to equalise the imbalance */
- env->imbalance = min(max_pull * sds->busiest->sgp->power,
- (sds->avg_load - sds->this_load) * sds->this->sgp->power)
- / SCHED_POWER_SCALE;
+ env->imbalance = min(
+ max_pull * busiest->group_power,
+ (sds->avg_load - this->avg_load) * this->group_power
+ ) / SCHED_POWER_SCALE;

/*
* if *imbalance is less than the average load per runnable task
@@ -4854,9 +4859,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
* a think about bumping its value to force at least one task to be
* moved
*/
- if (env->imbalance < sds->busiest_load_per_task)
+ if (env->imbalance < busiest->load_per_task)
return fix_small_imbalance(env, sds);
-
}

/******* find_busiest_group() helpers end here *********************/
@@ -4872,69 +4876,62 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
* to restore balance.
*
* @env: The load balancing environment.
- * @balance: Pointer to a variable indicating if this_cpu
- * is the appropriate cpu to perform load balancing at this_level.
*
* Return: - The busiest group if imbalance exists.
* - If no imbalance and user has opted for power-savings balance,
* return the least loaded group whose CPUs can be
* put to idle by rebalancing its tasks onto our group.
*/
-static struct sched_group *
-find_busiest_group(struct lb_env *env, int *balance)
+static struct sched_group *find_busiest_group(struct lb_env *env)
{
+ struct sg_lb_stats *this, *busiest;
struct sd_lb_stats sds;

- memset(&sds, 0, sizeof(sds));
+ init_sd_lb_stats(&sds);

/*
* Compute the various statistics relavent for load balancing at
* this level.
*/
- update_sd_lb_stats(env, balance, &sds);
-
- /*
- * this_cpu is not the appropriate cpu to perform load balancing at
- * this level.
- */
- if (!(*balance))
- goto ret;
+ update_sd_lb_stats(env, &sds);
+ this = &sds.this_stat;
+ busiest = &sds.busiest_stat;

if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
check_asym_packing(env, &sds))
return sds.busiest;

/* There is no busy sibling group to pull tasks from */
- if (!sds.busiest || sds.busiest_nr_running == 0)
+ if (!sds.busiest || busiest->sum_nr_running == 0)
goto out_balanced;

sds.avg_load = (SCHED_POWER_SCALE * sds.total_load) / sds.total_pwr;

/*
* If the busiest group is imbalanced the below checks don't
- * work because they assumes all things are equal, which typically
+ * work because they assume all things are equal, which typically
* isn't true due to cpus_allowed constraints and the like.
*/
- if (sds.group_imb)
+ if (busiest->group_imb)
goto force_balance;

/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
- if (env->idle == CPU_NEWLY_IDLE && sds.this_has_capacity &&
- !sds.busiest_has_capacity)
+ if (env->idle == CPU_NEWLY_IDLE && this->group_has_capacity &&
+ !busiest->group_has_capacity)
goto force_balance;

/*
* If the local group is more busy than the selected busiest group
* don't try and pull any tasks.
*/
- if (sds.this_load >= sds.max_load)
+ if (this->avg_load >= busiest->avg_load)
goto out_balanced;

/*
* Don't pull any tasks if this group is already above the domain
* average load.
*/
- if (sds.this_load >= sds.avg_load)
+ if (this->avg_load >= sds.avg_load)
goto out_balanced;

if (env->idle == CPU_IDLE) {
@@ -4944,15 +4941,16 @@ find_busiest_group(struct lb_env *env, int *balance)
* there is no imbalance between this and busiest group
* wrt to idle cpu's, it is balanced.
*/
- if ((sds.this_idle_cpus <= sds.busiest_idle_cpus + 1) &&
- sds.busiest_nr_running <= sds.busiest_group_weight)
+ if ((this->idle_cpus <= busiest->idle_cpus + 1) &&
+ busiest->sum_nr_running <= busiest->group_weight)
goto out_balanced;
} else {
/*
* In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
* imbalance_pct to be conservative.
*/
- if (100 * sds.max_load <= env->sd->imbalance_pct * sds.this_load)
+ if (100 * busiest->avg_load <=
+ env->sd->imbalance_pct * this->avg_load)
goto out_balanced;
}

@@ -4962,7 +4960,6 @@ force_balance:
return sds.busiest;

out_balanced:
-ret:
env->imbalance = 0;
return NULL;
}
@@ -4974,10 +4971,10 @@ static struct rq *find_busiest_queue(struct lb_env *env,
struct sched_group *group)
{
struct rq *busiest = NULL, *rq;
- unsigned long max_load = 0;
+ unsigned long busiest_load = 0, busiest_power = SCHED_POWER_SCALE;
int i;

- for_each_cpu(i, sched_group_cpus(group)) {
+ for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
unsigned long power = power_of(i);
unsigned long capacity = DIV_ROUND_CLOSEST(power,
SCHED_POWER_SCALE);
@@ -4986,9 +4983,6 @@ static struct rq *find_busiest_queue(struct lb_env *env,
if (!capacity)
capacity = fix_small_capacity(env->sd, group);

- if (!cpumask_test_cpu(i, env->cpus))
- continue;
-
rq = cpu_rq(i);
wl = weighted_cpuload(i);

@@ -5005,10 +4999,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
* the load can be moved away from the cpu that is potentially
* running at a lower capacity.
*/
- wl = (wl * SCHED_POWER_SCALE) / power;
-
- if (wl > max_load) {
- max_load = wl;
+ if (wl * busiest_power > busiest_load * power) {
+ busiest_load = wl;
+ busiest_power = power;
busiest = rq;
}
}
@@ -5045,15 +5038,50 @@ static int need_active_balance(struct lb_env *env)

static int active_load_balance_cpu_stop(void *data);

+static int should_we_balance(struct lb_env *env)
+{
+ struct sched_group *sg = env->sd->groups;
+ struct cpumask *sg_cpus, *sg_mask;
+ int cpu, balance_cpu = -1;
+
+ /*
+ * In the newly idle case, we will allow all the cpu's
+ * to do the newly idle load balance.
+ */
+ if (env->idle == CPU_NEWLY_IDLE)
+ return 1;
+
+ sg_cpus = sched_group_cpus(sg);
+ sg_mask = sched_group_mask(sg);
+ /* Try to find first idle cpu */
+ for_each_cpu_and(cpu, sg_cpus, env->cpus) {
+ if (!cpumask_test_cpu(cpu, sg_mask) || !idle_cpu(cpu))
+ continue;
+
+ balance_cpu = cpu;
+ break;
+ }
+
+ if (balance_cpu == -1)
+ balance_cpu = group_balance_cpu(sg);
+
+ /*
+ * First idle cpu or the first cpu(busiest) in this sched group
+ * is eligible for doing load balancing at this and above domains.
+ */
+ return balance_cpu != env->dst_cpu;
+}
+
/*
* Check this_cpu to ensure it is balanced within domain. Attempt to move
* tasks if there is an imbalance.
*/
static int load_balance(int this_cpu, struct rq *this_rq,
struct sched_domain *sd, enum cpu_idle_type idle,
- int *balance)
+ int *should_balance)
{
int ld_moved, cur_ld_moved, active_balance = 0;
+ struct sched_domain *sd_parent = sd->parent;
struct sched_group *group;
struct rq *busiest;
unsigned long flags;
@@ -5080,12 +5108,11 @@ static int load_balance(int this_cpu, struct rq *this_rq,

schedstat_inc(sd, lb_count[idle]);

-redo:
- group = find_busiest_group(&env, balance);
-
- if (*balance == 0)
+ if (!(*should_balance = should_we_balance(&env)))
goto out_balanced;

+redo:
+ group = find_busiest_group(&env);
if (!group) {
schedstat_inc(sd, lb_nobusyg[idle]);
goto out_balanced;
@@ -5158,11 +5185,11 @@ more_balance:
* moreover subsequent load balance cycles should correct the
* excess load moved.
*/
- if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
+ if ((env.flags & LBF_DST_PINNED) && env.imbalance > 0) {

env.dst_rq = cpu_rq(env.new_dst_cpu);
env.dst_cpu = env.new_dst_cpu;
- env.flags &= ~LBF_SOME_PINNED;
+ env.flags &= ~LBF_DST_PINNED;
env.loop = 0;
env.loop_break = sched_nr_migrate_break;

@@ -5176,6 +5203,18 @@ more_balance:
goto more_balance;
}

+ /*
+ * We failed to reach balance because of affinity.
+ */
+ if (sd_parent) {
+ int *group_imbalance = &sd_parent->groups->sgp->imbalance;
+
+ if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
+ *group_imbalance = 1;
+ } else if (*group_imbalance)
+ *group_imbalance = 0;
+ }
+
/* All tasks on this runqueue were pinned by CPU affinity */
if (unlikely(env.flags & LBF_ALL_PINNED)) {
cpumask_clear_cpu(cpu_of(busiest), cpus);
@@ -5298,7 +5337,7 @@ void idle_balance(int this_cpu, struct rq *this_rq)
rcu_read_lock();
for_each_domain(this_cpu, sd) {
unsigned long interval;
- int balance = 1;
+ int should_balance;

if (!(sd->flags & SD_LOAD_BALANCE))
continue;
@@ -5306,7 +5345,8 @@ void idle_balance(int this_cpu, struct rq *this_rq)
if (sd->flags & SD_BALANCE_NEWIDLE) {
/* If we've pulled tasks over stop searching: */
pulled_task = load_balance(this_cpu, this_rq,
- sd, CPU_NEWLY_IDLE, &balance);
+ sd, CPU_NEWLY_IDLE,
+ &should_balance);
}

interval = msecs_to_jiffies(sd->balance_interval);
@@ -5544,7 +5584,7 @@ void update_max_interval(void)
*/
static void rebalance_domains(int cpu, enum cpu_idle_type idle)
{
- int balance = 1;
+ int should_balance = 1;
struct rq *rq = cpu_rq(cpu);
unsigned long interval;
struct sched_domain *sd;
@@ -5576,9 +5616,9 @@ static void rebalance_domains(int cpu, enum cpu_idle_type idle)
}

if (time_after_eq(jiffies, sd->last_balance + interval)) {
- if (load_balance(cpu, rq, sd, idle, &balance)) {
+ if (load_balance(cpu, rq, sd, idle, &should_balance)) {
/*
- * The LBF_SOME_PINNED logic could have changed
+ * The LBF_DST_PINNED logic could have changed
* env->dst_cpu, so we can't know our idle
* state even if we migrated tasks. Update it.
*/
@@ -5599,7 +5639,7 @@ out:
* CPU in our sched group which is doing load balancing more
* actively.
*/
- if (!balance)
+ if (!should_balance)
break;
}
rcu_read_unlock();
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ef0a7b2..7c17661 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -605,6 +605,7 @@ struct sched_group_power {
*/
unsigned int power, power_orig;
unsigned long next_update;
+ int imbalance; /* XXX unrelated to power but shared group state */
/*
* Number of busy cpus in this group.
*/
diff --git a/lib/vsprintf.c b/lib/vsprintf.c
index 739a363..5521015 100644
--- a/lib/vsprintf.c
+++ b/lib/vsprintf.c
@@ -26,6 +26,7 @@
#include <linux/math64.h>
#include <linux/uaccess.h>
#include <linux/ioport.h>
+#include <linux/cpumask.h>
#include <net/addrconf.h>

#include <asm/page.h> /* for PAGE_SIZE */
@@ -1142,6 +1143,7 @@ int kptr_restrict __read_mostly;
* The maximum supported length is 64 bytes of the input. Consider
* to use print_hex_dump() for the larger input.
* - 'a' For a phys_addr_t type and its derivative types (passed by reference)
+ * - 'c' For a cpumask list
*
* Note: The difference between 'S' and 'F' is that on ia64 and ppc64
* function pointers are really function descriptors, which contain a
@@ -1253,6 +1255,8 @@ char *pointer(const char *fmt, char *buf, char *end, void *ptr,
spec.base = 16;
return number(buf, end,
(unsigned long long) *((phys_addr_t *)ptr), spec);
+ case 'c':
+ return buf + cpulist_scnprintf(buf, end - buf, ptr);
}
spec.flags |= SMALL;
if (spec.field_width == -1) {
@@ -1494,6 +1498,7 @@ qualifier:
* case.
* %*ph[CDN] a variable-length hex string with a separator (supports up to 64
* bytes of the input)
+ * %pc print a cpumask as comma-separated list
* %n is ignored
*
* ** Please update Documentation/printk-formats.txt when making changes **
--
1.8.1.4

2013-09-10 09:45:48

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 03/50] sched, numa: Comment fixlets

From: Peter Zijlstra <[email protected]>

Fix a 80 column violation and a PTE vs PMD reference.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 8 ++++----
mm/huge_memory.c | 2 +-
2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0c085ac..227b070 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -988,10 +988,10 @@ void task_numa_work(struct callback_head *work)

out:
/*
- * It is possible to reach the end of the VMA list but the last few VMAs are
- * not guaranteed to the vma_migratable. If they are not, we would find the
- * !migratable VMA on the next scan but not reset the scanner to the start
- * so check it now.
+ * It is possible to reach the end of the VMA list but the last few
+ * VMAs are not guaranteed to the vma_migratable. If they are not, we
+ * would find the !migratable VMA on the next scan but not reset the
+ * scanner to the start so check it now.
*/
if (vma)
mm->numa_scan_offset = start;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a92012a..860a368 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1317,7 +1317,7 @@ int do_huge_pmd_numa_page(struct mm_struct *mm, struct vm_area_struct *vma,
spin_unlock(&mm->page_table_lock);
lock_page(page);

- /* Confirm the PTE did not while locked */
+ /* Confirm the PMD did not change while page_table_lock was released */
spin_lock(&mm->page_table_lock);
if (unlikely(!pmd_same(pmd, *pmdp))) {
unlock_page(page);
--
1.8.1.4

2013-09-11 00:57:59

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 01/50] sched: monolithic code dump of what is being pushed upstream

On Tue, Sep 10, 2013 at 10:31:41AM +0100, Mel Gorman wrote:
> @@ -5045,15 +5038,50 @@ static int need_active_balance(struct lb_env *env)
>
> static int active_load_balance_cpu_stop(void *data);
>
> +static int should_we_balance(struct lb_env *env)
> +{
> + struct sched_group *sg = env->sd->groups;
> + struct cpumask *sg_cpus, *sg_mask;
> + int cpu, balance_cpu = -1;
> +
> + /*
> + * In the newly idle case, we will allow all the cpu's
> + * to do the newly idle load balance.
> + */
> + if (env->idle == CPU_NEWLY_IDLE)
> + return 1;
> +
> + sg_cpus = sched_group_cpus(sg);
> + sg_mask = sched_group_mask(sg);
> + /* Try to find first idle cpu */
> + for_each_cpu_and(cpu, sg_cpus, env->cpus) {
> + if (!cpumask_test_cpu(cpu, sg_mask) || !idle_cpu(cpu))
> + continue;
> +
> + balance_cpu = cpu;
> + break;
> + }
> +
> + if (balance_cpu == -1)
> + balance_cpu = group_balance_cpu(sg);
> +
> + /*
> + * First idle cpu or the first cpu(busiest) in this sched group
> + * is eligible for doing load balancing at this and above domains.
> + */
> + return balance_cpu != env->dst_cpu;
> +}
> +

Hello, Mel.

There is one mistake from me.
The last return statement in should_we_balance() should be
'return balance_cpu == env->dst_cpu'. The fix was submitted yesterday.

You can get more information on below thread.
https://lkml.org/lkml/2013/9/10/1

I think that this fix is somewhat important to scheduler's behavior,
so it may be better to update your test result with this fix.
Sorry for notifying this.

Thanks.

2013-09-11 02:04:23

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 0/50] Basic scheduler support for automatic NUMA balancing V7

On 09/10/2013 05:31 AM, Mel Gorman wrote:
> It has been a long time since V6 of this series and time for an update. Much
> of this is now stabilised with the most important addition being the inclusion
> of Peter and Rik's work on grouping tasks that share pages together.
>
> This series has a number of goals. It reduces overhead of automatic balancing
> through scan rate reduction and the avoidance of TLB flushes. It selects a
> preferred node and moves tasks towards their memory as well as moving memory
> toward their task. It handles shared pages and groups related tasks together.

The attached two patches should fix the task grouping issues
we discussed on #mm earlier.

Now on to the load balancer. When specjbb takes up way fewer
CPUs than what are available on a node, it is possible for
multiple specjbb processes to end up on the same NUMA node,
and the load balancer makes no attempt to move some of them
to completely idle loads.

I have not figured out yet how to fix that behaviour...

--
All rights reversed


Attachments:
0061-exec-leave-numa-group.patch (2.19 kB)
0062-numa-join-group-carefully.patch (6.02 kB)
Download all attachments

2013-09-11 03:11:07

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 01/50] sched: monolithic code dump of what is being pushed upstream

On Tue, Sep 10, 2013 at 5:31 PM, Mel Gorman <[email protected]> wrote:
> @@ -5045,15 +5038,50 @@ static int need_active_balance(struct lb_env *env)
>
> static int active_load_balance_cpu_stop(void *data);
>
> +static int should_we_balance(struct lb_env *env)
> +{
> + struct sched_group *sg = env->sd->groups;
> + struct cpumask *sg_cpus, *sg_mask;
> + int cpu, balance_cpu = -1;
> +
> + /*
> + * In the newly idle case, we will allow all the cpu's
> + * to do the newly idle load balance.
> + */
> + if (env->idle == CPU_NEWLY_IDLE)
> + return 1;
> +
> + sg_cpus = sched_group_cpus(sg);
> + sg_mask = sched_group_mask(sg);
> + /* Try to find first idle cpu */
> + for_each_cpu_and(cpu, sg_cpus, env->cpus) {
> + if (!cpumask_test_cpu(cpu, sg_mask) || !idle_cpu(cpu))
> + continue;
> +
> + balance_cpu = cpu;
> + break;
> + }
> +
> + if (balance_cpu == -1)
> + balance_cpu = group_balance_cpu(sg);
> +
> + /*
> + * First idle cpu or the first cpu(busiest) in this sched group
> + * is eligible for doing load balancing at this and above domains.
> + */
> + return balance_cpu != env->dst_cpu;

FYI: Here is a bug reported by Dave Chinner.
https://lkml.org/lkml/2013/9/10/1

And lets see if any changes in your SpecJBB results without it.

Hillf

2013-09-12 02:10:16

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 27/50] mm: numa: Scan pages with elevated page_mapcount

Hillo Mel

On Tue, Sep 10, 2013 at 5:32 PM, Mel Gorman <[email protected]> wrote:
> Currently automatic NUMA balancing is unable to distinguish between false
> shared versus private pages except by ignoring pages with an elevated
> page_mapcount entirely. This avoids shared pages bouncing between the
> nodes whose task is using them but that is ignored quite a lot of data.
>
> This patch kicks away the training wheels in preparation for adding support
> for identifying shared/private pages is now in place. The ordering is so
> that the impact of the shared/private detection can be easily measured. Note
> that the patch does not migrate shared, file-backed within vmas marked
> VM_EXEC as these are generally shared library pages. Migrating such pages
> is not beneficial as there is an expectation they are read-shared between
> caches and iTLB and iCache pressure is generally low.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
[...]
> @@ -1658,13 +1660,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> int page_lru = page_is_file_cache(page);
>
> /*
> - * Don't migrate pages that are mapped in multiple processes.
> - * TODO: Handle false sharing detection instead of this hammer
> - */
> - if (page_mapcount(page) != 1)
> - goto out_dropref;
> -
Is there rmap walk when migrating THP?

> - /*
> * Rate-limit the amount of data that is being migrated to a node.
> * Optimal placement is no good if the memory bus is saturated and
> * all the time is being spent migrating!

2013-09-12 12:42:21

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 41/50] sched: numa: Use {cpu, pid} to create task groups for shared faults

Hello Mel

On Tue, Sep 10, 2013 at 5:32 PM, Mel Gorman <[email protected]> wrote:
>
> +void task_numa_free(struct task_struct *p)
> +{
> + struct numa_group *grp = p->numa_group;
> + int i;
> +
> + kfree(p->numa_faults);
> +
> + if (grp) {
> + for (i = 0; i < 2*nr_node_ids; i++)
> + atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
> +
use after free, numa_faults ;/

> + spin_lock(&grp->lock);
> + list_del(&p->numa_entry);
> + grp->nr_tasks--;
> + spin_unlock(&grp->lock);
> + rcu_assign_pointer(p->numa_group, NULL);
> + put_numa_group(grp);
> + }
> +}
> +

2013-09-12 12:45:57

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 41/50] sched: numa: Use {cpu, pid} to create task groups for shared faults

Hello Mel

On Tue, Sep 10, 2013 at 5:32 PM, Mel Gorman <[email protected]> wrote:
>
> +void task_numa_free(struct task_struct *p)
> +{
> + struct numa_group *grp = p->numa_group;
> + int i;
> +
> + kfree(p->numa_faults);
> +
> + if (grp) {
> + for (i = 0; i < 2*nr_node_ids; i++)
> + atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
> +
use after free :/

> + spin_lock(&grp->lock);
> + list_del(&p->numa_entry);
> + grp->nr_tasks--;
> + spin_unlock(&grp->lock);
> + rcu_assign_pointer(p->numa_group, NULL);
> + put_numa_group(grp);
> + }
> +}
> +

2013-09-12 14:40:39

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 41/50] sched: numa: Use {cpu, pid} to create task groups for shared faults

On Thu, Sep 12, 2013 at 08:42:18PM +0800, Hillf Danton wrote:
> Hello Mel
>
> On Tue, Sep 10, 2013 at 5:32 PM, Mel Gorman <[email protected]> wrote:
> >
> > +void task_numa_free(struct task_struct *p)
> > +{
> > + struct numa_group *grp = p->numa_group;
> > + int i;
> > +
> > + kfree(p->numa_faults);
> > +
> > + if (grp) {
> > + for (i = 0; i < 2*nr_node_ids; i++)
> > + atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
> > +
> use after free, numa_faults ;/
>

It gets fixed in the patch "sched: numa: use group fault statistics in
numa placement" but I agree that it's the wrong place to fix it.

--
Mel Gorman
SUSE Labs

2013-09-13 08:12:05

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 27/50] mm: numa: Scan pages with elevated page_mapcount

On Thu, Sep 12, 2013 at 10:10:13AM +0800, Hillf Danton wrote:
> Hillo Mel
>
> On Tue, Sep 10, 2013 at 5:32 PM, Mel Gorman <[email protected]> wrote:
> > Currently automatic NUMA balancing is unable to distinguish between false
> > shared versus private pages except by ignoring pages with an elevated
> > page_mapcount entirely. This avoids shared pages bouncing between the
> > nodes whose task is using them but that is ignored quite a lot of data.
> >
> > This patch kicks away the training wheels in preparation for adding support
> > for identifying shared/private pages is now in place. The ordering is so
> > that the impact of the shared/private detection can be easily measured. Note
> > that the patch does not migrate shared, file-backed within vmas marked
> > VM_EXEC as these are generally shared library pages. Migrating such pages
> > is not beneficial as there is an expectation they are read-shared between
> > caches and iTLB and iCache pressure is generally low.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> [...]
> > @@ -1658,13 +1660,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
> > int page_lru = page_is_file_cache(page);
> >
> > /*
> > - * Don't migrate pages that are mapped in multiple processes.
> > - * TODO: Handle false sharing detection instead of this hammer
> > - */
> > - if (page_mapcount(page) != 1)
> > - goto out_dropref;
> > -
> Is there rmap walk when migrating THP?
>

Should not be necessary for THP.

--
Mel Gorman
SUSE Labs

2013-09-13 08:11:27

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 01/50] sched: monolithic code dump of what is being pushed upstream

On Wed, Sep 11, 2013 at 11:11:03AM +0800, Hillf Danton wrote:
> On Tue, Sep 10, 2013 at 5:31 PM, Mel Gorman <[email protected]> wrote:
> > @@ -5045,15 +5038,50 @@ static int need_active_balance(struct lb_env *env)
> >
> > static int active_load_balance_cpu_stop(void *data);
> >
> > +static int should_we_balance(struct lb_env *env)
> > +{
> > + struct sched_group *sg = env->sd->groups;
> > + struct cpumask *sg_cpus, *sg_mask;
> > + int cpu, balance_cpu = -1;
> > +
> > + /*
> > + * In the newly idle case, we will allow all the cpu's
> > + * to do the newly idle load balance.
> > + */
> > + if (env->idle == CPU_NEWLY_IDLE)
> > + return 1;
> > +
> > + sg_cpus = sched_group_cpus(sg);
> > + sg_mask = sched_group_mask(sg);
> > + /* Try to find first idle cpu */
> > + for_each_cpu_and(cpu, sg_cpus, env->cpus) {
> > + if (!cpumask_test_cpu(cpu, sg_mask) || !idle_cpu(cpu))
> > + continue;
> > +
> > + balance_cpu = cpu;
> > + break;
> > + }
> > +
> > + if (balance_cpu == -1)
> > + balance_cpu = group_balance_cpu(sg);
> > +
> > + /*
> > + * First idle cpu or the first cpu(busiest) in this sched group
> > + * is eligible for doing load balancing at this and above domains.
> > + */
> > + return balance_cpu != env->dst_cpu;
>
> FYI: Here is a bug reported by Dave Chinner.
> https://lkml.org/lkml/2013/9/10/1
>
> And lets see if any changes in your SpecJBB results without it.
>

Thanks for pointing that out. I've picked up the one-liner fix.

--
Mel Gorman
SUSE Labs

2013-09-14 02:58:03

by Bob Liu

[permalink] [raw]
Subject: Re: [PATCH 0/50] Basic scheduler support for automatic NUMA balancing V7

Hi Mel,

On 09/10/2013 05:31 PM, Mel Gorman wrote:
> It has been a long time since V6 of this series and time for an update. Much
> of this is now stabilised with the most important addition being the inclusion
> of Peter and Rik's work on grouping tasks that share pages together.
>
> This series has a number of goals. It reduces overhead of automatic balancing
> through scan rate reduction and the avoidance of TLB flushes. It selects a
> preferred node and moves tasks towards their memory as well as moving memory
> toward their task. It handles shared pages and groups related tasks together.
>

I found sometimes numa balancing will be broken after khugepaged
started, because khugepaged always allocate huge page from the node of
the first scanned normal page during collapsing.

A simple use case is when a user run his application interleaving all
nodes using "numactl --interleave=all xxxx".
But after khugepaged started most pages of his application will be
located to only one specific node.

I have a simple patch fix this issue in thread:
[PATCH 2/2] mm: thp: khugepaged: add policy for finding target node

I think this may related with this topic, I don't know whether this
series can also fix the issue I mentioned.

--
Regards,
-Bob

2013-09-16 12:37:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 07/50] mm: Account for a THP NUMA hinting update as one PTE update

On Tue, Sep 10, 2013 at 10:31:47AM +0100, Mel Gorman wrote:
> A THP PMD update is accounted for as 512 pages updated in vmstat. This is
> large difference when estimating the cost of automatic NUMA balancing and
> can be misleading when comparing results that had collapsed versus split
> THP. This patch addresses the accounting issue.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/mprotect.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 94722a4..2bbb648 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -145,7 +145,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> split_huge_page_pmd(vma, addr, pmd);
> else if (change_huge_pmd(vma, pmd, addr, newprot,
> prot_numa)) {
> - pages += HPAGE_PMD_NR;
> + pages++;

But now you're not counting pages anymore..

2013-09-16 13:46:05

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 07/50] mm: Account for a THP NUMA hinting update as one PTE update

On 09/16/2013 08:36 AM, Peter Zijlstra wrote:
> On Tue, Sep 10, 2013 at 10:31:47AM +0100, Mel Gorman wrote:
>> A THP PMD update is accounted for as 512 pages updated in vmstat. This is
>> large difference when estimating the cost of automatic NUMA balancing and
>> can be misleading when comparing results that had collapsed versus split
>> THP. This patch addresses the accounting issue.
>>
>> Signed-off-by: Mel Gorman <[email protected]>
>> ---
>> mm/mprotect.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 94722a4..2bbb648 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -145,7 +145,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
>> split_huge_page_pmd(vma, addr, pmd);
>> else if (change_huge_pmd(vma, pmd, addr, newprot,
>> prot_numa)) {
>> - pages += HPAGE_PMD_NR;
>> + pages++;
>
> But now you're not counting pages anymore..

The migrate statistics still count pages. That makes sense, since the
amount of work scales with the amount of memory moved.

It is just the "number of faults" counters that actually count the
number of faults again, instead of the number of pages represented
by each fault.

IMHO this change makes sense.

--
All rights reversed

2013-09-16 14:54:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 07/50] mm: Account for a THP NUMA hinting update as one PTE update

On Mon, Sep 16, 2013 at 09:39:59AM -0400, Rik van Riel wrote:
> On 09/16/2013 08:36 AM, Peter Zijlstra wrote:
> > On Tue, Sep 10, 2013 at 10:31:47AM +0100, Mel Gorman wrote:
> >> A THP PMD update is accounted for as 512 pages updated in vmstat. This is
> >> large difference when estimating the cost of automatic NUMA balancing and
> >> can be misleading when comparing results that had collapsed versus split
> >> THP. This patch addresses the accounting issue.
> >>
> >> Signed-off-by: Mel Gorman <[email protected]>
> >> ---
> >> mm/mprotect.c | 2 +-
> >> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/mm/mprotect.c b/mm/mprotect.c
> >> index 94722a4..2bbb648 100644
> >> --- a/mm/mprotect.c
> >> +++ b/mm/mprotect.c
> >> @@ -145,7 +145,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> >> split_huge_page_pmd(vma, addr, pmd);
> >> else if (change_huge_pmd(vma, pmd, addr, newprot,
> >> prot_numa)) {
> >> - pages += HPAGE_PMD_NR;
> >> + pages++;
> >
> > But now you're not counting pages anymore..
>
> The migrate statistics still count pages. That makes sense, since the
> amount of work scales with the amount of memory moved.

Right.

> It is just the "number of faults" counters that actually count the
> number of faults again, instead of the number of pages represented
> by each fault.

So you're suggesting s/pages/faults/ or somesuch?

> IMHO this change makes sense.

I never said the change didn't make sense as such. Just that we're no
longer counting pages in change_*_range().

2013-09-16 15:18:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 14/50] sched: Set the scan rate proportional to the memory usage of the task being scanned

On Tue, Sep 10, 2013 at 10:31:54AM +0100, Mel Gorman wrote:
> @@ -860,9 +908,14 @@ void task_numa_fault(int node, int pages, bool migrated)
> * If pages are properly placed (did not migrate) then scan slower.
> * This is reset periodically in case of phase changes
> */
> - if (!migrated)
> - p->numa_scan_period = min(sysctl_numa_balancing_scan_period_max,
> + if (!migrated) {
> + /* Initialise if necessary */
> + if (!p->numa_scan_period_max)
> + p->numa_scan_period_max = task_scan_max(p);
> +
> + p->numa_scan_period = min(p->numa_scan_period_max,
> p->numa_scan_period + jiffies_to_msecs(10));

So the next patch changes the jiffies_to_msec() thing.. is that really
worth a whole separate patch?

Also, I really don't believe any of that is 'right', increasing the scan
period by a fixed amount for every !migrated page is just wrong.

Firstly; there's the migration throttle which basically guarantees that
most pages aren't migrated -- even when they ought to be, thus inflating
the period.

Secondly; assume a _huge_ process, so large that even a small fraction
of non-migrated pages will completely clip the scan period.

2013-09-16 15:40:41

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 14/50] sched: Set the scan rate proportional to the memory usage of the task being scanned

On Mon, Sep 16, 2013 at 05:18:22PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 10, 2013 at 10:31:54AM +0100, Mel Gorman wrote:
> > @@ -860,9 +908,14 @@ void task_numa_fault(int node, int pages, bool migrated)
> > * If pages are properly placed (did not migrate) then scan slower.
> > * This is reset periodically in case of phase changes
> > */
> > - if (!migrated)
> > - p->numa_scan_period = min(sysctl_numa_balancing_scan_period_max,
> > + if (!migrated) {
> > + /* Initialise if necessary */
> > + if (!p->numa_scan_period_max)
> > + p->numa_scan_period_max = task_scan_max(p);
> > +
> > + p->numa_scan_period = min(p->numa_scan_period_max,
> > p->numa_scan_period + jiffies_to_msecs(10));
>
> So the next patch changes the jiffies_to_msec() thing.. is that really
> worth a whole separate patch?
>

No, I can collapse them.

> Also, I really don't believe any of that is 'right', increasing the scan
> period by a fixed amount for every !migrated page is just wrong.
>

At the moment Rik and I are both looking at adapting the scan rate based
on whether the faults trapped since the last scan window were local or
remote faults. It should be able to sensibly adapt the scan rate
independently of the RSS of the process.

--
Mel Gorman
SUSE Labs

2013-09-16 16:13:04

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 07/50] mm: Account for a THP NUMA hinting update as one PTE update

On Mon, Sep 16, 2013 at 04:54:38PM +0200, Peter Zijlstra wrote:
> On Mon, Sep 16, 2013 at 09:39:59AM -0400, Rik van Riel wrote:
> > On 09/16/2013 08:36 AM, Peter Zijlstra wrote:
> > > On Tue, Sep 10, 2013 at 10:31:47AM +0100, Mel Gorman wrote:
> > >> A THP PMD update is accounted for as 512 pages updated in vmstat. This is
> > >> large difference when estimating the cost of automatic NUMA balancing and
> > >> can be misleading when comparing results that had collapsed versus split
> > >> THP. This patch addresses the accounting issue.
> > >>
> > >> Signed-off-by: Mel Gorman <[email protected]>
> > >> ---
> > >> mm/mprotect.c | 2 +-
> > >> 1 file changed, 1 insertion(+), 1 deletion(-)
> > >>
> > >> diff --git a/mm/mprotect.c b/mm/mprotect.c
> > >> index 94722a4..2bbb648 100644
> > >> --- a/mm/mprotect.c
> > >> +++ b/mm/mprotect.c
> > >> @@ -145,7 +145,7 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
> > >> split_huge_page_pmd(vma, addr, pmd);
> > >> else if (change_huge_pmd(vma, pmd, addr, newprot,
> > >> prot_numa)) {
> > >> - pages += HPAGE_PMD_NR;
> > >> + pages++;
> > >
> > > But now you're not counting pages anymore..
> >
> > The migrate statistics still count pages. That makes sense, since the
> > amount of work scales with the amount of memory moved.
>
> Right.
>
> > It is just the "number of faults" counters that actually count the
> > number of faults again, instead of the number of pages represented
> > by each fault.
>
> So you're suggesting s/pages/faults/ or somesuch?
>

It's really the number of ptes that are updated.

> > IMHO this change makes sense.
>
> I never said the change didn't make sense as such. Just that we're no
> longer counting pages in change_*_range().

well, it's still a THP page. Is it worth renaming?

--
Mel Gorman
SUSE Labs

2013-09-16 16:36:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 17/50] mm: Do not flush TLB during protection change if !pte_present && !migration_entry

On Tue, Sep 10, 2013 at 10:31:57AM +0100, Mel Gorman wrote:
> NUMA PTE scanning is expensive both in terms of the scanning itself and
> the TLB flush if there are any updates. Currently non-present PTEs are
> accounted for as an update and incurring a TLB flush where it is only
> necessary for anonymous migration entries. This patch addresses the
> problem and should reduce TLB flushes.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/mprotect.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 1f9b54b..1e9cef0 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -109,8 +109,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> make_migration_entry_read(&entry);
> set_pte_at(mm, addr, pte,
> swp_entry_to_pte(entry));
> +
> + pages++;
> }
> - pages++;
> }
> } while (pte++, addr += PAGE_SIZE, addr != end);
> arch_leave_lazy_mmu_mode();

Should we fold this into patch 7 ?

2013-09-16 16:37:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 07/50] mm: Account for a THP NUMA hinting update as one PTE update

On Mon, Sep 16, 2013 at 05:11:50PM +0100, Mel Gorman wrote:
> > I never said the change didn't make sense as such. Just that we're no
> > longer counting pages in change_*_range().
>
> well, it's still a THP page. Is it worth renaming?

Dunno, the pedant in me needed to raise the issue :-)

2013-09-17 14:30:25

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH] hotplug: Optimize {get,put}_online_cpus()

Subject: hotplug: Optimize {get,put}_online_cpus()
From: Peter Zijlstra <[email protected]>
Date: Tue Sep 17 16:17:11 CEST 2013

The cpu hotplug lock is a purely reader biased read-write lock.

The current implementation uses global state, change it so the reader
side uses per-cpu state in the uncontended fast-path.

Cc: Oleg Nesterov <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Steven Rostedt <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/cpu.h | 33 ++++++++++++++-
kernel/cpu.c | 108 ++++++++++++++++++++++++++--------------------------
2 files changed, 87 insertions(+), 54 deletions(-)

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
#include <linux/node.h>
#include <linux/compiler.h>
#include <linux/cpumask.h>
+#include <linux/percpu.h>

struct device;

@@ -175,8 +176,36 @@ extern struct bus_type cpu_subsys;

extern void cpu_hotplug_begin(void);
extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern struct task_struct *__cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+ might_sleep();
+
+ this_cpu_inc(__cpuhp_refcount);
+ /*
+ * Order the refcount inc against the writer read; pairs with the full
+ * barrier in cpu_hotplug_begin().
+ */
+ smp_mb();
+ if (unlikely(__cpuhp_writer))
+ __get_online_cpus();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+ barrier();
+ this_cpu_dec(__cpuhp_refcount);
+ if (unlikely(__cpuhp_writer))
+ __put_online_cpus();
+}
+
extern void cpu_hotplug_disable(void);
extern void cpu_hotplug_enable(void);
#define hotcpu_notifier(fn, pri) cpu_notifier(fn, pri)
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,92 @@ static int cpu_hotplug_disabled;

#ifdef CONFIG_HOTPLUG_CPU

-static struct {
- struct task_struct *active_writer;
- struct mutex lock; /* Synchronizes accesses to refcount, */
- /*
- * Also blocks the new readers during
- * an ongoing cpu hotplug operation.
- */
- int refcount;
-} cpu_hotplug = {
- .active_writer = NULL,
- .lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
- .refcount = 0,
-};
+struct task_struct *__cpuhp_writer = NULL;
+EXPORT_SYMBOL_GPL(__cpuhp_writer);
+
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);

-void get_online_cpus(void)
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
+
+void __get_online_cpus(void)
{
- might_sleep();
- if (cpu_hotplug.active_writer == current)
+ if (__cpuhp_writer == current)
return;
- mutex_lock(&cpu_hotplug.lock);
- cpu_hotplug.refcount++;
- mutex_unlock(&cpu_hotplug.lock);

+again:
+ /*
+ * Ensure a pending reading has a 0 refcount.
+ *
+ * Without this a new reader that comes in before cpu_hotplug_begin()
+ * reads the refcount will deadlock.
+ */
+ this_cpu_dec(__cpuhp_refcount);
+ wait_event(cpuhp_wq, !__cpuhp_writer);
+
+ this_cpu_inc(__cpuhp_refcount);
+ /*
+ * See get_online_cpu().
+ */
+ smp_mb();
+ if (unlikely(__cpuhp_writer))
+ goto again;
}
-EXPORT_SYMBOL_GPL(get_online_cpus);
+EXPORT_SYMBOL_GPL(__get_online_cpus);

-void put_online_cpus(void)
+void __put_online_cpus(void)
{
- if (cpu_hotplug.active_writer == current)
- return;
- mutex_lock(&cpu_hotplug.lock);
+ unsigned int refcnt = 0;
+ int cpu;

- if (WARN_ON(!cpu_hotplug.refcount))
- cpu_hotplug.refcount++; /* try to fix things up */
+ if (__cpuhp_writer == current)
+ return;

- if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
- wake_up_process(cpu_hotplug.active_writer);
- mutex_unlock(&cpu_hotplug.lock);
+ for_each_possible_cpu(cpu)
+ refcnt += per_cpu(__cpuhp_refcount, cpu);

+ if (!refcnt)
+ wake_up_process(__cpuhp_writer);
}
-EXPORT_SYMBOL_GPL(put_online_cpus);
+EXPORT_SYMBOL_GPL(__put_online_cpus);

/*
* This ensures that the hotplug operation can begin only when the
* refcount goes to zero.
*
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
* Since cpu_hotplug_begin() is always called after invoking
* cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- * writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- * non zero and goes to sleep again.
- *
- * However, this is very difficult to achieve in practice since
- * get_online_cpus() not an api which is called all that often.
- *
*/
void cpu_hotplug_begin(void)
{
- cpu_hotplug.active_writer = current;
+ __cpuhp_writer = current;

for (;;) {
- mutex_lock(&cpu_hotplug.lock);
- if (likely(!cpu_hotplug.refcount))
+ unsigned int refcnt = 0;
+ int cpu;
+
+ /*
+ * Order the setting of writer against the reading of refcount;
+ * pairs with the full barrier in get_online_cpus().
+ */
+
+ set_current_state(TASK_UNINTERRUPTIBLE);
+
+ for_each_possible_cpu(cpu)
+ refcnt += per_cpu(__cpuhp_refcount, cpu);
+
+ if (!refcnt)
break;
- __set_current_state(TASK_UNINTERRUPTIBLE);
- mutex_unlock(&cpu_hotplug.lock);
+
schedule();
}
+ __set_current_state(TASK_RUNNING);
}

void cpu_hotplug_done(void)
{
- cpu_hotplug.active_writer = NULL;
- mutex_unlock(&cpu_hotplug.lock);
+ __cpuhp_writer = NULL;
+ wake_up_all(&cpuhp_wq);
}

/*

2013-09-17 14:32:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 37/50] sched: Introduce migrate_swap()

On Tue, Sep 10, 2013 at 10:32:17AM +0100, Mel Gorman wrote:
> TODO: I'm fairly sure we can get rid of the wake_cpu != -1 test by keeping
> wake_cpu to the actual task cpu; just couldn't be bothered to think through
> all the cases.

> + * XXX worry about hotplug

Combined with the {get,put}_online_cpus() optimization patch, the below
should address the two outstanding issues.

Completely untested for now.. will try and get it some runtime later.

Not-yet-signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/sched/core.c | 37 ++++++++++++++++++++-----------------
kernel/sched/sched.h | 1 +
2 files changed, 21 insertions(+), 17 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1035,7 +1035,7 @@ static void __migrate_swap_task(struct t
/*
* Task isn't running anymore; make it appear like we migrated
* it before it went to sleep. This means on wakeup we make the
- * previous cpu or targer instead of where it really is.
+ * previous cpu our target instead of where it really is.
*/
p->wake_cpu = cpu;
}
@@ -1080,11 +1080,16 @@ static int migrate_swap_stop(void *data)
}

/*
- * XXX worry about hotplug
+ * Cross migrate two tasks
*/
int migrate_swap(struct task_struct *cur, struct task_struct *p)
{
- struct migration_swap_arg arg = {
+ struct migration_swap_arg arg;
+ int ret = -EINVAL;
+
+ get_online_cpus();
+
+ arg = (struct migration_swap_arg){
.src_task = cur,
.src_cpu = task_cpu(cur),
.dst_task = p,
@@ -1092,15 +1097,22 @@ int migrate_swap(struct task_struct *cur
};

if (arg.src_cpu == arg.dst_cpu)
- return -EINVAL;
+ goto out;
+
+ if (!cpu_active(arg.src_cpu) || !cpu_active(arg.dst_cpu))
+ goto out;

if (!cpumask_test_cpu(arg.dst_cpu, tsk_cpus_allowed(arg.src_task)))
- return -EINVAL;
+ goto out;

if (!cpumask_test_cpu(arg.src_cpu, tsk_cpus_allowed(arg.dst_task)))
- return -EINVAL;
+ goto out;
+
+ ret = stop_two_cpus(arg.dst_cpu, arg.src_cpu, migrate_swap_stop, &arg);

- return stop_two_cpus(arg.dst_cpu, arg.src_cpu, migrate_swap_stop, &arg);
+out:
+ put_online_cpus();
+ return ret;
}

struct migration_arg {
@@ -1608,12 +1620,7 @@ try_to_wake_up(struct task_struct *p, un
if (p->sched_class->task_waking)
p->sched_class->task_waking(p);

- if (p->wake_cpu != -1) { /* XXX make this condition go away */
- cpu = p->wake_cpu;
- p->wake_cpu = -1;
- }
-
- cpu = select_task_rq(p, cpu, SD_BALANCE_WAKE, wake_flags);
+ cpu = select_task_rq(p, p->wake_cpu, SD_BALANCE_WAKE, wake_flags);
if (task_cpu(p) != cpu) {
wake_flags |= WF_MIGRATED;
set_task_cpu(p, cpu);
@@ -1699,10 +1706,6 @@ static void __sched_fork(struct task_str
{
p->on_rq = 0;

-#ifdef CONFIG_SMP
- p->wake_cpu = -1;
-#endif
-
p->se.on_rq = 0;
p->se.exec_start = 0;
p->se.sum_exec_runtime = 0;
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -737,6 +737,7 @@ static inline void __set_task_cpu(struct
*/
smp_wmb();
task_thread_info(p)->cpu = cpu;
+ p->wake_cpu = cpu;
#endif
}

2013-09-17 16:21:00

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Tue, Sep 17, 2013 at 04:30:03PM +0200, Peter Zijlstra wrote:
> Subject: hotplug: Optimize {get,put}_online_cpus()
> From: Peter Zijlstra <[email protected]>
> Date: Tue Sep 17 16:17:11 CEST 2013
>
> The cpu hotplug lock is a purely reader biased read-write lock.
>
> The current implementation uses global state, change it so the reader
> side uses per-cpu state in the uncontended fast-path.
>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Paul McKenney <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> include/linux/cpu.h | 33 ++++++++++++++-
> kernel/cpu.c | 108 ++++++++++++++++++++++++++--------------------------
> 2 files changed, 87 insertions(+), 54 deletions(-)
>
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -16,6 +16,7 @@
> #include <linux/node.h>
> #include <linux/compiler.h>
> #include <linux/cpumask.h>
> +#include <linux/percpu.h>
>
> struct device;
>
> @@ -175,8 +176,36 @@ extern struct bus_type cpu_subsys;
>
> extern void cpu_hotplug_begin(void);
> extern void cpu_hotplug_done(void);
> -extern void get_online_cpus(void);
> -extern void put_online_cpus(void);
> +
> +extern struct task_struct *__cpuhp_writer;
> +DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
> +
> +extern void __get_online_cpus(void);
> +
> +static inline void get_online_cpus(void)
> +{
> + might_sleep();
> +
> + this_cpu_inc(__cpuhp_refcount);
> + /*
> + * Order the refcount inc against the writer read; pairs with the full
> + * barrier in cpu_hotplug_begin().
> + */
> + smp_mb();
> + if (unlikely(__cpuhp_writer))
> + __get_online_cpus();
> +}
> +

If the problem with get_online_cpus() is the shared global state then a
full barrier in the fast path is still going to hurt. Granted, it will hurt
a lot less and there should be no lock contention.

However, what barrier in cpu_hotplug_begin is the comment referring to? The
other barrier is in the slowpath __get_online_cpus. Did you mean to do
a rmb here and a wmb after __cpuhp_writer is set in cpu_hotplug_begin?
I'm assuming you are currently using a full barrier to guarantee that an
update if cpuhp_writer will be visible so get_online_cpus blocks but I'm
not 100% sure because of the comments.

> +extern void __put_online_cpus(void);
> +
> +static inline void put_online_cpus(void)
> +{
> + barrier();

Why is this barrier necessary? I could not find anything that stated if an
inline function is an implicit compiler barrier but whether it is or not,
it's not clear why it's necessary at all.

> + this_cpu_dec(__cpuhp_refcount);
> + if (unlikely(__cpuhp_writer))
> + __put_online_cpus();
> +}
> +
> extern void cpu_hotplug_disable(void);
> extern void cpu_hotplug_enable(void);
> #define hotcpu_notifier(fn, pri) cpu_notifier(fn, pri)
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -49,88 +49,92 @@ static int cpu_hotplug_disabled;
>
> #ifdef CONFIG_HOTPLUG_CPU
>
> -static struct {
> - struct task_struct *active_writer;
> - struct mutex lock; /* Synchronizes accesses to refcount, */
> - /*
> - * Also blocks the new readers during
> - * an ongoing cpu hotplug operation.
> - */
> - int refcount;
> -} cpu_hotplug = {
> - .active_writer = NULL,
> - .lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
> - .refcount = 0,
> -};
> +struct task_struct *__cpuhp_writer = NULL;
> +EXPORT_SYMBOL_GPL(__cpuhp_writer);
> +
> +DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
> +EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
>
> -void get_online_cpus(void)
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
> +
> +void __get_online_cpus(void)
> {
> - might_sleep();
> - if (cpu_hotplug.active_writer == current)
> + if (__cpuhp_writer == current)
> return;
> - mutex_lock(&cpu_hotplug.lock);
> - cpu_hotplug.refcount++;
> - mutex_unlock(&cpu_hotplug.lock);
>
> +again:
> + /*
> + * Ensure a pending reading has a 0 refcount.
> + *
> + * Without this a new reader that comes in before cpu_hotplug_begin()
> + * reads the refcount will deadlock.
> + */
> + this_cpu_dec(__cpuhp_refcount);
> + wait_event(cpuhp_wq, !__cpuhp_writer);
> +
> + this_cpu_inc(__cpuhp_refcount);
> + /*
> + * See get_online_cpu().
> + */
> + smp_mb();
> + if (unlikely(__cpuhp_writer))
> + goto again;
> }

If CPU hotplug operations are very frequent (or a stupid stress test) then
it's possible for a new hotplug operation to start (updating __cpuhp_writer)
before a caller to __get_online_cpus can update the refcount. Potentially
a caller to __get_online_cpus gets starved although as it only affects a
CPU hotplug stress test it may not be a serious issue.

> -EXPORT_SYMBOL_GPL(get_online_cpus);
> +EXPORT_SYMBOL_GPL(__get_online_cpus);
>
> -void put_online_cpus(void)
> +void __put_online_cpus(void)
> {
> - if (cpu_hotplug.active_writer == current)
> - return;
> - mutex_lock(&cpu_hotplug.lock);
> + unsigned int refcnt = 0;
> + int cpu;
>
> - if (WARN_ON(!cpu_hotplug.refcount))
> - cpu_hotplug.refcount++; /* try to fix things up */
> + if (__cpuhp_writer == current)
> + return;
>
> - if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
> - wake_up_process(cpu_hotplug.active_writer);
> - mutex_unlock(&cpu_hotplug.lock);
> + for_each_possible_cpu(cpu)
> + refcnt += per_cpu(__cpuhp_refcount, cpu);
>

This can result in spurious wakeups if CPU N calls get_online_cpus after
its refcnt has been checked but I could not think of a case where it
matters.

> + if (!refcnt)
> + wake_up_process(__cpuhp_writer);
> }
> -EXPORT_SYMBOL_GPL(put_online_cpus);
> +EXPORT_SYMBOL_GPL(__put_online_cpus);
>
> /*
> * This ensures that the hotplug operation can begin only when the
> * refcount goes to zero.
> *
> - * Note that during a cpu-hotplug operation, the new readers, if any,
> - * will be blocked by the cpu_hotplug.lock
> - *
> * Since cpu_hotplug_begin() is always called after invoking
> * cpu_maps_update_begin(), we can be sure that only one writer is active.
> - *
> - * Note that theoretically, there is a possibility of a livelock:
> - * - Refcount goes to zero, last reader wakes up the sleeping
> - * writer.
> - * - Last reader unlocks the cpu_hotplug.lock.
> - * - A new reader arrives at this moment, bumps up the refcount.
> - * - The writer acquires the cpu_hotplug.lock finds the refcount
> - * non zero and goes to sleep again.
> - *
> - * However, this is very difficult to achieve in practice since
> - * get_online_cpus() not an api which is called all that often.
> - *
> */
> void cpu_hotplug_begin(void)
> {
> - cpu_hotplug.active_writer = current;
> + __cpuhp_writer = current;
>
> for (;;) {
> - mutex_lock(&cpu_hotplug.lock);
> - if (likely(!cpu_hotplug.refcount))
> + unsigned int refcnt = 0;
> + int cpu;
> +
> + /*
> + * Order the setting of writer against the reading of refcount;
> + * pairs with the full barrier in get_online_cpus().
> + */
> +
> + set_current_state(TASK_UNINTERRUPTIBLE);
> +
> + for_each_possible_cpu(cpu)
> + refcnt += per_cpu(__cpuhp_refcount, cpu);
> +

CPU 0 CPU 1
get_online_cpus
refcnt++
__cpuhp_writer = current
refcnt > 0
schedule
__get_online_cpus slowpath
refcnt--
wait_event(!__cpuhp_writer)

What wakes up __cpuhp_writer to recheck the refcnts and see that they're
all 0?

> + if (!refcnt)
> break;
> - __set_current_state(TASK_UNINTERRUPTIBLE);
> - mutex_unlock(&cpu_hotplug.lock);
> +
> schedule();
> }
> + __set_current_state(TASK_RUNNING);
> }
>
> void cpu_hotplug_done(void)
> {
> - cpu_hotplug.active_writer = NULL;
> - mutex_unlock(&cpu_hotplug.lock);
> + __cpuhp_writer = NULL;
> + wake_up_all(&cpuhp_wq);
> }
>
> /*

--
Mel Gorman
SUSE Labs

2013-09-17 16:45:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Tue, Sep 17, 2013 at 05:20:50PM +0100, Mel Gorman wrote:
> > +extern struct task_struct *__cpuhp_writer;
> > +DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
> > +
> > +extern void __get_online_cpus(void);
> > +
> > +static inline void get_online_cpus(void)
> > +{
> > + might_sleep();
> > +
> > + this_cpu_inc(__cpuhp_refcount);
> > + /*
> > + * Order the refcount inc against the writer read; pairs with the full
> > + * barrier in cpu_hotplug_begin().
> > + */
> > + smp_mb();
> > + if (unlikely(__cpuhp_writer))
> > + __get_online_cpus();
> > +}
> > +
>
> If the problem with get_online_cpus() is the shared global state then a
> full barrier in the fast path is still going to hurt. Granted, it will hurt
> a lot less and there should be no lock contention.

I went for a lot less, I wasn't smart enough to get rid of it. Also,
since its a lock op we should at least provide an ACQUIRE barrier.

> However, what barrier in cpu_hotplug_begin is the comment referring to?

set_current_state() implies a full barrier and nicely separates the
write to __cpuhp_writer and the read of __cpuph_refcount.

> The
> other barrier is in the slowpath __get_online_cpus. Did you mean to do
> a rmb here and a wmb after __cpuhp_writer is set in cpu_hotplug_begin?

No, since we're ordering LOADs and STORES (see below) we must use full
barriers.

> I'm assuming you are currently using a full barrier to guarantee that an
> update if cpuhp_writer will be visible so get_online_cpus blocks but I'm
> not 100% sure because of the comments.

I'm ordering:

CPU0 -- get_online_cpus() CPU1 -- cpu_hotplug_begin()

STORE __cpuhp_refcount STORE __cpuhp_writer

MB MB

LOAD __cpuhp_writer LOAD __cpuhp_refcount

Such that neither can miss the state of the other and we get proper
mutual exclusion.

> > +extern void __put_online_cpus(void);
> > +
> > +static inline void put_online_cpus(void)
> > +{
> > + barrier();
>
> Why is this barrier necessary?

To ensure the compiler keeps all loads/stores done before the
read-unlock before it.

Arguably it should be a complete RELEASE barrier. I should've put an XXX
comment here but the brain gave out completely for the day.

> I could not find anything that stated if an
> inline function is an implicit compiler barrier but whether it is or not,
> it's not clear why it's necessary at all.

It is not, only actual function calls are an implied sync point for the
compiler.

> > + this_cpu_dec(__cpuhp_refcount);
> > + if (unlikely(__cpuhp_writer))
> > + __put_online_cpus();
> > +}
> > +

> > +struct task_struct *__cpuhp_writer = NULL;
> > +EXPORT_SYMBOL_GPL(__cpuhp_writer);
> > +
> > +DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
> > +EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
> >
> > +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
> > +
> > +void __get_online_cpus(void)
> > {
> > + if (__cpuhp_writer == current)
> > return;
> >
> > +again:
> > + /*
> > + * Ensure a pending reading has a 0 refcount.
> > + *
> > + * Without this a new reader that comes in before cpu_hotplug_begin()
> > + * reads the refcount will deadlock.
> > + */
> > + this_cpu_dec(__cpuhp_refcount);
> > + wait_event(cpuhp_wq, !__cpuhp_writer);
> > +
> > + this_cpu_inc(__cpuhp_refcount);
> > + /*
> > + * See get_online_cpu().
> > + */
> > + smp_mb();
> > + if (unlikely(__cpuhp_writer))
> > + goto again;
> > }
>
> If CPU hotplug operations are very frequent (or a stupid stress test) then
> it's possible for a new hotplug operation to start (updating __cpuhp_writer)
> before a caller to __get_online_cpus can update the refcount. Potentially
> a caller to __get_online_cpus gets starved although as it only affects a
> CPU hotplug stress test it may not be a serious issue.

Right.. If that ever becomes a problem we should fix it, but aside from
stress tests hotplug should be extremely rare.

Initially I kept the reference over the wait_event() but realized (as
per the comment) that that would deadlock cpu_hotplug_begin() for it
would never observe !refcount.

One solution for this problem is having refcount as an array of 2 and
flipping the index at the appropriate times.

> > +EXPORT_SYMBOL_GPL(__get_online_cpus);
> >
> > +void __put_online_cpus(void)
> > {
> > + unsigned int refcnt = 0;
> > + int cpu;
> >
> > + if (__cpuhp_writer == current)
> > + return;
> >
> > + for_each_possible_cpu(cpu)
> > + refcnt += per_cpu(__cpuhp_refcount, cpu);
> >
>
> This can result in spurious wakeups if CPU N calls get_online_cpus after
> its refcnt has been checked but I could not think of a case where it
> matters.

Right and right.. too many wakeups aren't a correctness issue. One
should try and minimize them for performance reasons though :-)

> > + if (!refcnt)
> > + wake_up_process(__cpuhp_writer);
> > }


> > /*
> > * This ensures that the hotplug operation can begin only when the
> > * refcount goes to zero.
> > *
> > * Since cpu_hotplug_begin() is always called after invoking
> > * cpu_maps_update_begin(), we can be sure that only one writer is active.
> > */
> > void cpu_hotplug_begin(void)
> > {
> > + __cpuhp_writer = current;
> >
> > for (;;) {
> > + unsigned int refcnt = 0;
> > + int cpu;
> > +
> > + /*
> > + * Order the setting of writer against the reading of refcount;
> > + * pairs with the full barrier in get_online_cpus().
> > + */
> > +
> > + set_current_state(TASK_UNINTERRUPTIBLE);
> > +
> > + for_each_possible_cpu(cpu)
> > + refcnt += per_cpu(__cpuhp_refcount, cpu);
> > +
>
> CPU 0 CPU 1
> get_online_cpus
> refcnt++
> __cpuhp_writer = current
> refcnt > 0
> schedule
> __get_online_cpus slowpath
> refcnt--
> wait_event(!__cpuhp_writer)
>
> What wakes up __cpuhp_writer to recheck the refcnts and see that they're
> all 0?

The wakeup in __put_online_cpus() you just commented on?
put_online_cpus() will drop into the slow path __put_online_cpus() if
there's a writer and compute the refcount and perform the wakeup when
!refcount.

> > + if (!refcnt)
> > break;
> > +
> > schedule();
> > }
> > + __set_current_state(TASK_RUNNING);
> > }

2013-09-17 17:00:38

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 17/50] mm: Do not flush TLB during protection change if !pte_present && !migration_entry

On Mon, Sep 16, 2013 at 06:35:47PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 10, 2013 at 10:31:57AM +0100, Mel Gorman wrote:
> > NUMA PTE scanning is expensive both in terms of the scanning itself and
> > the TLB flush if there are any updates. Currently non-present PTEs are
> > accounted for as an update and incurring a TLB flush where it is only
> > necessary for anonymous migration entries. This patch addresses the
> > problem and should reduce TLB flushes.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > mm/mprotect.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/mprotect.c b/mm/mprotect.c
> > index 1f9b54b..1e9cef0 100644
> > --- a/mm/mprotect.c
> > +++ b/mm/mprotect.c
> > @@ -109,8 +109,9 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > make_migration_entry_read(&entry);
> > set_pte_at(mm, addr, pte,
> > swp_entry_to_pte(entry));
> > +
> > + pages++;
> > }
> > - pages++;
> > }
> > } while (pte++, addr += PAGE_SIZE, addr != end);
> > arch_leave_lazy_mmu_mode();
>
> Should we fold this into patch 7 ?

Looking closer at it, I think folding it into the patch would overload
the purpose of patch 7 a little too much but I shuffled the series to
keep the patches together.

--
Mel Gorman
SUSE Labs

2013-09-18 15:50:04

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

New version, now with excessive comments.

I found a deadlock (where both reader and writer would go to sleep);
identified below as case 1b.

The implementation without patch is reader biased, this implementation,
as Mel pointed out, is writer biased. I should try and fix this but I'm
stepping away from the computer now as I have the feeling I'll only
wreck stuff from now on.

---
Subject: hotplug: Optimize {get,put}_online_cpus()
From: Peter Zijlstra <[email protected]>
Date: Tue Sep 17 16:17:11 CEST 2013

The current implementation uses global state, change it so the reader
side uses per-cpu state in the contended fast path.

Cc: Oleg Nesterov <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Steven Rostedt <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/cpu.h | 29 ++++++++-
kernel/cpu.c | 159 ++++++++++++++++++++++++++++++++++------------------
2 files changed, 134 insertions(+), 54 deletions(-)

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
#include <linux/node.h>
#include <linux/compiler.h>
#include <linux/cpumask.h>
+#include <linux/percpu.h>

struct device;

@@ -175,8 +176,32 @@ extern struct bus_type cpu_subsys;

extern void cpu_hotplug_begin(void);
extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern struct task_struct *__cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+ might_sleep();
+
+ this_cpu_inc(__cpuhp_refcount);
+ smp_mb(); /* see comment near __get_online_cpus() */
+ if (unlikely(__cpuhp_writer))
+ __get_online_cpus();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+ this_cpu_dec(__cpuhp_refcount);
+ smp_mb(); /* see comment near __get_online_cpus() */
+ if (unlikely(__cpuhp_writer))
+ __put_online_cpus();
+}
+
extern void cpu_hotplug_disable(void);
extern void cpu_hotplug_enable(void);
#define hotcpu_notifier(fn, pri) cpu_notifier(fn, pri)
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,143 @@ static int cpu_hotplug_disabled;

#ifdef CONFIG_HOTPLUG_CPU

-static struct {
- struct task_struct *active_writer;
- struct mutex lock; /* Synchronizes accesses to refcount, */
- /*
- * Also blocks the new readers during
- * an ongoing cpu hotplug operation.
- */
- int refcount;
-} cpu_hotplug = {
- .active_writer = NULL,
- .lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
- .refcount = 0,
-};
+struct task_struct *__cpuhp_writer = NULL;
+EXPORT_SYMBOL_GPL(__cpuhp_writer);
+
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
+
+/*
+ * We must order things like:
+ *
+ * CPU0 -- read-lock CPU1 -- write-lock
+ *
+ * STORE __cpuhp_refcount STORE __cpuhp_writer
+ * MB MB
+ * LOAD __cpuhp_writer LOAD __cpuhp_refcount
+ *
+ *
+ * This gives rise to the following permutations:
+ *
+ * a) all of R happend before W
+ * b) R starts but sees the W store -- therefore W must see the R store
+ * W starts but sees the R store -- therefore R must see the W store
+ * c) all of W happens before R
+ *
+ * 1) RL vs WL:
+ *
+ * 1a) RL proceeds; WL observes refcount and goes wait for !refcount.
+ * 1b) RL drops into the slow path; WL waits for !refcount.
+ * 1c) WL proceeds; RL drops into the slow path.
+ *
+ * 2) RL vs WU:
+ *
+ * 2a) RL drops into the slow path; WU clears writer and wakes RL
+ * 2b) RL proceeds; WU continues to wake others
+ * 2d) RL proceeds.
+ *
+ * 3) RU vs WL:
+ *
+ * 3a) RU proceeds; WL proceeds.
+ * 3b) RU drops to slow path; WL proceeds
+ * 3c) WL waits for !refcount; RL drops to slow path
+ *
+ * 4) RU vs WU:
+ *
+ * Impossible since R and W state are mutually exclusive.
+ *
+ * This leaves us to consider the R slow paths:
+ *
+ * RL
+ *
+ * 1b) we must wake W
+ * 2a) nothing of importance
+ *
+ * RU
+ *
+ * 3b) nothing of importance
+ * 3c) we must wake W
+ *
+ */

-void get_online_cpus(void)
+void __get_online_cpus(void)
{
- might_sleep();
- if (cpu_hotplug.active_writer == current)
+ if (__cpuhp_writer == current)
return;
- mutex_lock(&cpu_hotplug.lock);
- cpu_hotplug.refcount++;
- mutex_unlock(&cpu_hotplug.lock);

+again:
+ /*
+ * Case 1b; we must decrement our refcount again otherwise WL will
+ * never observe !refcount and stay blocked forever. Not good since
+ * we're going to sleep too. Someone must be awake and do something.
+ *
+ * Skip recomputing the refcount, just wake the pending writer and
+ * have him check it -- writers are rare.
+ */
+ this_cpu_dec(__cpuhp_refcount);
+ wake_up_process(__cpuhp_writer); /* implies MB */
+
+ wait_event(cpuhp_wq, !__cpuhp_writer);
+
+ /* Basically re-do the fast-path. Excep we can never be the writer. */
+ this_cpu_inc(__cpuhp_refcount);
+ smp_mb();
+ if (unlikely(__cpuhp_writer))
+ goto again;
}
-EXPORT_SYMBOL_GPL(get_online_cpus);
+EXPORT_SYMBOL_GPL(__get_online_cpus);

-void put_online_cpus(void)
+void __put_online_cpus(void)
{
- if (cpu_hotplug.active_writer == current)
- return;
- mutex_lock(&cpu_hotplug.lock);
+ unsigned int refcnt = 0;
+ int cpu;

- if (WARN_ON(!cpu_hotplug.refcount))
- cpu_hotplug.refcount++; /* try to fix things up */
+ if (__cpuhp_writer == current)
+ return;

- if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
- wake_up_process(cpu_hotplug.active_writer);
- mutex_unlock(&cpu_hotplug.lock);
+ /* 3c */
+ for_each_possible_cpu(cpu)
+ refcnt += per_cpu(__cpuhp_refcount, cpu);

+ if (!refcnt)
+ wake_up_process(__cpuhp_writer);
}
-EXPORT_SYMBOL_GPL(put_online_cpus);
+EXPORT_SYMBOL_GPL(__put_online_cpus);

/*
* This ensures that the hotplug operation can begin only when the
* refcount goes to zero.
*
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
* Since cpu_hotplug_begin() is always called after invoking
* cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- * writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- * non zero and goes to sleep again.
- *
- * However, this is very difficult to achieve in practice since
- * get_online_cpus() not an api which is called all that often.
- *
*/
void cpu_hotplug_begin(void)
{
- cpu_hotplug.active_writer = current;
+ __cpuhp_writer = current;

for (;;) {
- mutex_lock(&cpu_hotplug.lock);
- if (likely(!cpu_hotplug.refcount))
+ unsigned int refcnt = 0;
+ int cpu;
+
+ set_current_state(TASK_UNINTERRUPTIBLE); /* implies MB */
+
+ for_each_possible_cpu(cpu)
+ refcnt += per_cpu(__cpuhp_refcount, cpu);
+
+ if (!refcnt)
break;
- __set_current_state(TASK_UNINTERRUPTIBLE);
- mutex_unlock(&cpu_hotplug.lock);
+
schedule();
}
+ __set_current_state(TASK_RUNNING);
}

void cpu_hotplug_done(void)
{
- cpu_hotplug.active_writer = NULL;
- mutex_unlock(&cpu_hotplug.lock);
+ __cpuhp_writer = NULL;
+ wake_up_all(&cpuhp_wq); /* implies MB */
}

/*

2013-09-19 14:33:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()




Meh, I should stop poking at this..

This one lost all the comments again :/

It uses preempt_disable/preempt_enable vs synchronize_sched() to remove
the barriers from the fast path.

After that it waits for !refcount before setting state, which stops new
readers.

I used a per-cpu spinlock to keep the state check and refcount inc
atomic vs the setting of state.

So the slow path is still per-cpu and mostly uncontended even in the
pending writer case.

After setting state it again waits for !refcount -- someone could have
sneaked in between the last !refcount and setting state. But this time
we know refcount will stay 0.

The only thing I don't really like is the unconditional writer wake in
the read-unlock slowpath, but I couldn't come up with anything better.
Here at least we guarantee that there is a wakeup after the last dec --
although there might be far too many wakes.

---
Subject: hotplug: Optimize {get,put}_online_cpus()
From: Peter Zijlstra <[email protected]>
Date: Tue Sep 17 16:17:11 CEST 2013

Cc: Oleg Nesterov <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Steven Rostedt <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/cpu.h | 32 ++++++++++-
kernel/cpu.c | 151 +++++++++++++++++++++++++++++-----------------------
2 files changed, 116 insertions(+), 67 deletions(-)

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
#include <linux/node.h>
#include <linux/compiler.h>
#include <linux/cpumask.h>
+#include <linux/percpu.h>

struct device;

@@ -175,8 +176,35 @@ extern struct bus_type cpu_subsys;

extern void cpu_hotplug_begin(void);
extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern struct task_struct *__cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+ might_sleep();
+
+ preempt_disable();
+ if (likely(!__cpuhp_writer || __cpuhp_writer == current))
+ this_cpu_inc(__cpuhp_refcount);
+ else
+ __get_online_cpus();
+ preempt_enable();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+ preempt_disable();
+ this_cpu_dec(__cpuhp_refcount);
+ if (unlikely(__cpuhp_writer && __cpuhp_writer != current))
+ __put_online_cpus();
+ preempt_enable();
+}
+
extern void cpu_hotplug_disable(void);
extern void cpu_hotplug_enable(void);
#define hotcpu_notifier(fn, pri) cpu_notifier(fn, pri)
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,109 @@ static int cpu_hotplug_disabled;

#ifdef CONFIG_HOTPLUG_CPU

-static struct {
- struct task_struct *active_writer;
- struct mutex lock; /* Synchronizes accesses to refcount, */
- /*
- * Also blocks the new readers during
- * an ongoing cpu hotplug operation.
- */
- int refcount;
-} cpu_hotplug = {
- .active_writer = NULL,
- .lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
- .refcount = 0,
-};
-
-void get_online_cpus(void)
-{
- might_sleep();
- if (cpu_hotplug.active_writer == current)
- return;
- mutex_lock(&cpu_hotplug.lock);
- cpu_hotplug.refcount++;
- mutex_unlock(&cpu_hotplug.lock);
-
-}
-EXPORT_SYMBOL_GPL(get_online_cpus);
-
-void put_online_cpus(void)
-{
- if (cpu_hotplug.active_writer == current)
- return;
- mutex_lock(&cpu_hotplug.lock);
-
- if (WARN_ON(!cpu_hotplug.refcount))
- cpu_hotplug.refcount++; /* try to fix things up */
-
- if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
- wake_up_process(cpu_hotplug.active_writer);
- mutex_unlock(&cpu_hotplug.lock);
+struct task_struct *__cpuhp_writer = NULL;
+EXPORT_SYMBOL_GPL(__cpuhp_writer);

+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static DEFINE_PER_CPU(int, cpuhp_state);
+static DEFINE_PER_CPU(spinlock_t, cpuhp_lock);
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
+
+void __get_online_cpus(void)
+{
+ spin_lock(__this_cpu_ptr(&cpuhp_lock));
+ for (;;) {
+ if (!__this_cpu_read(cpuhp_state)) {
+ __this_cpu_inc(__cpuhp_refcount);
+ break;
+ }
+
+ spin_unlock(__this_cpu_ptr(&cpuhp_lock));
+ preempt_enable();
+
+ wait_event(cpuhp_wq, !__cpuhp_writer);
+
+ preempt_disable();
+ spin_lock(__this_cpu_ptr(&cpuhp_lock));
+ }
+ spin_unlock(__this_cpu_ptr(&cpuhp_lock));
+}
+EXPORT_SYMBOL_GPL(__get_online_cpus);
+
+void __put_online_cpus(void)
+{
+ wake_up_process(__cpuhp_writer);
+}
+EXPORT_SYMBOL_GPL(__put_online_cpus);
+
+static void cpuph_wait_refcount(void)
+{
+ for (;;) {
+ unsigned int refcnt = 0;
+ int cpu;
+
+ set_current_state(TASK_UNINTERRUPTIBLE);
+
+ for_each_possible_cpu(cpu)
+ refcnt += per_cpu(__cpuhp_refcount, cpu);
+
+ if (!refcnt)
+ break;
+
+ schedule();
+ }
+ __set_current_state(TASK_RUNNING);
+}
+
+static void cpuhp_set_state(int state)
+{
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ spinlock_t *lock = &per_cpu(cpuhp_lock, cpu);
+
+ spin_lock(lock);
+ per_cpu(cpuhp_state, cpu) = state;
+ spin_unlock(lock);
+ }
}
-EXPORT_SYMBOL_GPL(put_online_cpus);

/*
* This ensures that the hotplug operation can begin only when the
* refcount goes to zero.
*
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
* Since cpu_hotplug_begin() is always called after invoking
* cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- * writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- * non zero and goes to sleep again.
- *
- * However, this is very difficult to achieve in practice since
- * get_online_cpus() not an api which is called all that often.
- *
*/
void cpu_hotplug_begin(void)
{
- cpu_hotplug.active_writer = current;
+ lockdep_assert_held(&cpu_add_remove_lock);

- for (;;) {
- mutex_lock(&cpu_hotplug.lock);
- if (likely(!cpu_hotplug.refcount))
- break;
- __set_current_state(TASK_UNINTERRUPTIBLE);
- mutex_unlock(&cpu_hotplug.lock);
- schedule();
- }
+ __cpuhp_writer = current;
+
+ /* After this everybody will observe _writer and take the slow path. */
+ synchronize_sched();
+
+ /* Wait for no readers -- reader preference */
+ cpuhp_wait_refcount();
+
+ /* Stop new readers. */
+ cpuhp_set_state(1);
+
+ /* Wait for no readers */
+ cpuhp_wait_refcount();
}

void cpu_hotplug_done(void)
{
- cpu_hotplug.active_writer = NULL;
- mutex_unlock(&cpu_hotplug.lock);
+ __cpuhp_writer = NULL;
+
+ /* Allow new readers */
+ cpuhp_set_state(0);
+
+ wake_up_all(&cpuhp_wq);
}

/*

2013-09-20 09:55:41

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 46/50] sched: numa: Prevent parallel updates to group stats during placement

On Tue, Sep 10, 2013 at 10:32:26AM +0100, Mel Gorman wrote:
> Having multiple tasks in a group go through task_numa_placement
> simultaneously can lead to a task picking a wrong node to run on, because
> the group stats may be in the middle of an update. This patch avoids
> parallel updates by holding the numa_group lock during placement
> decisions.
>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> kernel/sched/fair.c | 35 +++++++++++++++++++++++------------
> 1 file changed, 23 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3a92c58..4653f71 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1231,6 +1231,7 @@ static void task_numa_placement(struct task_struct *p)
> {
> int seq, nid, max_nid = -1, max_group_nid = -1;
> unsigned long max_faults = 0, max_group_faults = 0;
> + spinlock_t *group_lock = NULL;
>
> seq = ACCESS_ONCE(p->mm->numa_scan_seq);
> if (p->numa_scan_seq == seq)
> @@ -1239,6 +1240,12 @@ static void task_numa_placement(struct task_struct *p)
> p->numa_migrate_seq++;
> p->numa_scan_period_max = task_scan_max(p);
>
> + /* If the task is part of a group prevent parallel updates to group stats */
> + if (p->numa_group) {
> + group_lock = &p->numa_group->lock;
> + spin_lock(group_lock);
> + }
> +
> /* Find the node with the highest number of faults */
> for_each_online_node(nid) {
> unsigned long faults = 0, group_faults = 0;
> @@ -1277,20 +1284,24 @@ static void task_numa_placement(struct task_struct *p)
> }
> }
>
> + if (p->numa_group) {
> + /*
> + * If the preferred task and group nids are different,
> + * iterate over the nodes again to find the best place.
> + */
> + if (max_nid != max_group_nid) {
> + unsigned long weight, max_weight = 0;
> +
> + for_each_online_node(nid) {
> + weight = task_weight(p, nid) + group_weight(p, nid);
> + if (weight > max_weight) {
> + max_weight = weight;
> + max_nid = nid;
> + }
> }
> }
> +
> + spin_unlock(group_lock);
> }
>
> /* Preferred node as the node with the most faults */

If you're going to hold locks you can also do away with all that
atomic_long_*() nonsense :-)

2013-09-20 12:32:18

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 46/50] sched: numa: Prevent parallel updates to group stats during placement

On Fri, Sep 20, 2013 at 11:55:26AM +0200, Peter Zijlstra wrote:
> On Tue, Sep 10, 2013 at 10:32:26AM +0100, Mel Gorman wrote:
> > Having multiple tasks in a group go through task_numa_placement
> > simultaneously can lead to a task picking a wrong node to run on, because
> > the group stats may be in the middle of an update. This patch avoids
> > parallel updates by holding the numa_group lock during placement
> > decisions.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> > ---
> > kernel/sched/fair.c | 35 +++++++++++++++++++++++------------
> > 1 file changed, 23 insertions(+), 12 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 3a92c58..4653f71 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1231,6 +1231,7 @@ static void task_numa_placement(struct task_struct *p)
> > {
> > int seq, nid, max_nid = -1, max_group_nid = -1;
> > unsigned long max_faults = 0, max_group_faults = 0;
> > + spinlock_t *group_lock = NULL;
> >
> > seq = ACCESS_ONCE(p->mm->numa_scan_seq);
> > if (p->numa_scan_seq == seq)
> > @@ -1239,6 +1240,12 @@ static void task_numa_placement(struct task_struct *p)
> > p->numa_migrate_seq++;
> > p->numa_scan_period_max = task_scan_max(p);
> >
> > + /* If the task is part of a group prevent parallel updates to group stats */
> > + if (p->numa_group) {
> > + group_lock = &p->numa_group->lock;
> > + spin_lock(group_lock);
> > + }
> > +
> > /* Find the node with the highest number of faults */
> > for_each_online_node(nid) {
> > unsigned long faults = 0, group_faults = 0;
> > @@ -1277,20 +1284,24 @@ static void task_numa_placement(struct task_struct *p)
> > }
> > }
> >
> > + if (p->numa_group) {
> > + /*
> > + * If the preferred task and group nids are different,
> > + * iterate over the nodes again to find the best place.
> > + */
> > + if (max_nid != max_group_nid) {
> > + unsigned long weight, max_weight = 0;
> > +
> > + for_each_online_node(nid) {
> > + weight = task_weight(p, nid) + group_weight(p, nid);
> > + if (weight > max_weight) {
> > + max_weight = weight;
> > + max_nid = nid;
> > + }
> > }
> > }
> > +
> > + spin_unlock(group_lock);
> > }
> >
> > /* Preferred node as the node with the most faults */
>
> If you're going to hold locks you can also do away with all that
> atomic_long_*() nonsense :-)

Yep! Easily done, patch is untested but should be straight-forward.

---8<---
sched: numa: use longs for numa group fault stats

As Peter says "If you're going to hold locks you can also do away with all
that atomic_long_*() nonsense". Lock aquisition moved slightly to protect
the updates. numa_group faults stats type are still "long" to add a basic
sanity check for fault counts going negative.

Signed-off-by: Mel Gorman <[email protected]>
---
kernel/sched/fair.c | 54 ++++++++++++++++++++++++-----------------------------
1 file changed, 24 insertions(+), 30 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a2963..c09687d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -897,8 +897,8 @@ struct numa_group {
struct list_head task_list;

struct rcu_head rcu;
- atomic_long_t total_faults;
- atomic_long_t faults[0];
+ long total_faults;
+ long faults[0];
};

pid_t task_numa_group_id(struct task_struct *p)
@@ -925,8 +925,7 @@ static inline unsigned long group_faults(struct task_struct *p, int nid)
if (!p->numa_group)
return 0;

- return atomic_long_read(&p->numa_group->faults[2*nid]) +
- atomic_long_read(&p->numa_group->faults[2*nid+1]);
+ return p->numa_group->faults[2*nid] + p->numa_group->faults[2*nid+1];
}

/*
@@ -952,17 +951,10 @@ static inline unsigned long task_weight(struct task_struct *p, int nid)

static inline unsigned long group_weight(struct task_struct *p, int nid)
{
- unsigned long total_faults;
-
- if (!p->numa_group)
- return 0;
-
- total_faults = atomic_long_read(&p->numa_group->total_faults);
-
- if (!total_faults)
+ if (!p->numa_group || !p->numa_group->total_faults)
return 0;

- return 1200 * group_faults(p, nid) / total_faults;
+ return 1200 * group_faults(p, nid) / p->numa_group->total_faults;
}

static unsigned long weighted_cpuload(const int cpu);
@@ -1267,9 +1259,9 @@ static void task_numa_placement(struct task_struct *p)
p->total_numa_faults += diff;
if (p->numa_group) {
/* safe because we can only change our own group */
- atomic_long_add(diff, &p->numa_group->faults[i]);
- atomic_long_add(diff, &p->numa_group->total_faults);
- group_faults += atomic_long_read(&p->numa_group->faults[i]);
+ p->numa_group->faults[i] += diff;
+ p->numa_group->total_faults += diff;
+ group_faults += p->numa_group->faults[i];
}
}

@@ -1343,7 +1335,7 @@ static void task_numa_group(struct task_struct *p, int cpupid)

if (unlikely(!p->numa_group)) {
unsigned int size = sizeof(struct numa_group) +
- 2*nr_node_ids*sizeof(atomic_long_t);
+ 2*nr_node_ids*sizeof(long);

grp = kzalloc(size, GFP_KERNEL | __GFP_NOWARN);
if (!grp)
@@ -1355,9 +1347,9 @@ static void task_numa_group(struct task_struct *p, int cpupid)
grp->gid = p->pid;

for (i = 0; i < 2*nr_node_ids; i++)
- atomic_long_set(&grp->faults[i], p->numa_faults[i]);
+ grp->faults[i] = p->numa_faults[i];

- atomic_long_set(&grp->total_faults, p->total_numa_faults);
+ grp->total_faults = p->total_numa_faults;

list_add(&p->numa_entry, &grp->task_list);
grp->nr_tasks++;
@@ -1402,14 +1394,15 @@ unlock:
if (!join)
return;

+ double_lock(&my_grp->lock, &grp->lock);
+
for (i = 0; i < 2*nr_node_ids; i++) {
- atomic_long_sub(p->numa_faults[i], &my_grp->faults[i]);
- atomic_long_add(p->numa_faults[i], &grp->faults[i]);
+ my_grp->faults[i] -= p->numa_faults[i];
+ grp->faults[i] -= p->numa_faults[i];
+ WARN_ON_ONCE(grp->faults[i] < 0);
}
- atomic_long_sub(p->total_numa_faults, &my_grp->total_faults);
- atomic_long_add(p->total_numa_faults, &grp->total_faults);
-
- double_lock(&my_grp->lock, &grp->lock);
+ my_grp->total_faults -= p->total_numa_faults;
+ grp->total_faults -= p->total_numa_faults;

list_move(&p->numa_entry, &grp->task_list);
my_grp->nr_tasks--;
@@ -1430,12 +1423,13 @@ void task_numa_free(struct task_struct *p)
void *numa_faults = p->numa_faults;

if (grp) {
- for (i = 0; i < 2*nr_node_ids; i++)
- atomic_long_sub(p->numa_faults[i], &grp->faults[i]);
-
- atomic_long_sub(p->total_numa_faults, &grp->total_faults);
-
spin_lock(&grp->lock);
+ for (i = 0; i < 2*nr_node_ids; i++) {
+ grp->faults[i] -= p->numa_faults[i];
+ WARN_ON_ONCE(grp->faults[i] < 0);
+ }
+ grp->total_faults -= p->total_numa_faults;
+
list_del(&p->numa_entry);
grp->nr_tasks--;
spin_unlock(&grp->lock);

2013-09-20 12:37:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 46/50] sched: numa: Prevent parallel updates to group stats during placement

On Fri, Sep 20, 2013 at 01:31:52PM +0100, Mel Gorman wrote:
> static inline unsigned long group_weight(struct task_struct *p, int nid)
> {
> + if (!p->numa_group || !p->numa_group->total_faults)
> return 0;
>
> + return 1200 * group_faults(p, nid) / p->numa_group->total_faults;
> }

Unrelated to this change; I recently thought we might want to change
these weight factors based on if the task was predominantly private or
shared.

For shared we use the bigger weight for group, for private we use the
bigger weight for task.

2013-09-20 13:31:36

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 46/50] sched: numa: Prevent parallel updates to group stats during placement

On Fri, Sep 20, 2013 at 01:31:51PM +0100, Mel Gorman wrote:
> @@ -1402,14 +1394,15 @@ unlock:
> if (!join)
> return;
>
> + double_lock(&my_grp->lock, &grp->lock);
> +
> for (i = 0; i < 2*nr_node_ids; i++) {
> - atomic_long_sub(p->numa_faults[i], &my_grp->faults[i]);
> - atomic_long_add(p->numa_faults[i], &grp->faults[i]);
> + my_grp->faults[i] -= p->numa_faults[i];
> + grp->faults[i] -= p->numa_faults[i];
> + WARN_ON_ONCE(grp->faults[i] < 0);
> }

That stupidity got fixed

--
Mel Gorman
SUSE Labs

2013-09-21 16:41:11

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

Sorry for delay, I was sick...

On 09/19, Peter Zijlstra wrote:
>
> I used a per-cpu spinlock to keep the state check and refcount inc
> atomic vs the setting of state.

I think this could be simpler, see below.

> So the slow path is still per-cpu and mostly uncontended even in the
> pending writer case.

Is it really important? I mean, per-cpu/uncontended even if the writer
is pending?

Otherwise we could do

static DEFINE_PER_CPU(long, cpuhp_fast_ctr);
static struct task_struct *cpuhp_writer;
static DEFINE_MUTEX(cpuhp_slow_lock)
static long cpuhp_slow_ctr;

static bool update_fast_ctr(int inc)
{
bool success = true;

preempt_disable();
if (likely(!cpuhp_writer))
__get_cpu_var(cpuhp_fast_ctr) += inc;
else if (cpuhp_writer != current)
success = false;
preempt_enable();

return success;
}

void get_online_cpus(void)
{
if (likely(update_fast_ctr(+1));
return;

mutex_lock(&cpuhp_slow_lock);
cpuhp_slow_ctr++;
mutex_unlock(&cpuhp_slow_lock);
}

void put_online_cpus(void)
{
if (likely(update_fast_ctr(-1));
return;

mutex_lock(&cpuhp_slow_lock);
if (!--cpuhp_slow_ctr && cpuhp_writer)
wake_up_process(cpuhp_writer);
mutex_unlock(&cpuhp_slow_lock);
}

static void clear_fast_ctr(void)
{
long total = 0;
int cpu;

for_each_possible_cpu(cpu) {
total += per_cpu(cpuhp_fast_ctr, cpu);
per_cpu(cpuhp_fast_ctr, cpu) = 0;
}

return total;
}

static void cpu_hotplug_begin(void)
{
cpuhp_writer = current;
synchronize_sched();

/* Nobody except us can use can use cpuhp_fast_ctr */

mutex_lock(&cpuhp_slow_lock);
cpuhp_slow_ctr += clear_fast_ctr();

while (cpuhp_slow_ctr) {
__set_current_state(TASK_UNINTERRUPTIBLE);
mutex_unlock(&&cpuhp_slow_lock);
schedule();
mutex_lock(&cpuhp_slow_lock);
}
}

static void cpu_hotplug_done(void)
{
cpuhp_writer = NULL;
mutex_unlock(&cpuhp_slow_lock);
}

I already sent this code in 2010, it needs some trivial updates.

But. We already have percpu_rw_semaphore, can't we reuse it? In fact
I thought about this from the very beginning. Just we need
percpu_down_write_recursive_readers() which does

bool xxx(brw)
{
if (down_trylock(&brw->rw_sem))
return false;
if (!atomic_read(&brw->slow_read_ctr))
return true;
up_write(&brw->rw_sem);
return false;
}

ait_event(brw->write_waitq, xxx(brw));

instead of down_write() + wait_event(!atomic_read(&brw->slow_read_ctr)).

The only problem is the lockdep annotations in percpu_down_read(), but
this looks simple, just we need down_read_no_lockdep() (like __up_read).

Note also that percpu_down_write/percpu_up_write can be improved wrt
synchronize_sched(). We can turn the 2nd one into call_rcu(), and the
1nd one can be avoided if another percpu_down_write() comes "soon after"
percpu_down_up().


As for the patch itself, I am not sure.

> +static void cpuph_wait_refcount(void)
> +{
> + for (;;) {
> + unsigned int refcnt = 0;
> + int cpu;
> +
> + set_current_state(TASK_UNINTERRUPTIBLE);
> +
> + for_each_possible_cpu(cpu)
> + refcnt += per_cpu(__cpuhp_refcount, cpu);
> +
> + if (!refcnt)
> + break;
> +
> + schedule();
> + }
> + __set_current_state(TASK_RUNNING);
> +}

It seems, this can succeed while it should not, see below.

> void cpu_hotplug_begin(void)
> {
> - cpu_hotplug.active_writer = current;
> + lockdep_assert_held(&cpu_add_remove_lock);
>
> - for (;;) {
> - mutex_lock(&cpu_hotplug.lock);
> - if (likely(!cpu_hotplug.refcount))
> - break;
> - __set_current_state(TASK_UNINTERRUPTIBLE);
> - mutex_unlock(&cpu_hotplug.lock);
> - schedule();
> - }
> + __cpuhp_writer = current;
> +
> + /* After this everybody will observe _writer and take the slow path. */
> + synchronize_sched();

Yes, the reader should see _writer, but:

> + /* Wait for no readers -- reader preference */
> + cpuhp_wait_refcount();

but how we can ensure the writer sees the results of the reader's updates?

Suppose that we have 2 CPU's, __cpuhp_refcount[0] = 0, __cpuhp_refcount[1] = 1.
IOW, we have a single R reader which takes this lock on CPU_1 and sleeps.

Now,

- The writer calls cpuph_wait_refcount()

- cpuph_wait_refcount() does refcnt += __cpuhp_refcount[0].
refcnt == 0.

- another reader comes on CPU_0, increments __cpuhp_refcount[0].

- this reader migrates to CPU_1 and does put_online_cpus(),
this decrements __cpuhp_refcount[1] which becomes zero.

- cpuph_wait_refcount() continues and reads __cpuhp_refcount[1]
which is zero. refcnt == 0, return.

- The writer does cpuhp_set_state(1).

- The reader R (original reader) wakes up, calls get_online_cpus()
recursively, and sleeps in wait_event(!__cpuhp_writer).

Btw, I think that __sb_start_write/etc is equally wrong. Perhaps it is
another potential user of percpu_rw_sem.

Oleg.

2013-09-21 19:20:23

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 09/21, Oleg Nesterov wrote:
>
> As for the patch itself, I am not sure.

Forgot to mention... and with this patch cpu_hotplug_done() loses the
"release" semantics, not sure this is fine.

Oleg.

2013-09-23 09:30:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Sat, Sep 21, 2013 at 06:34:04PM +0200, Oleg Nesterov wrote:
> > So the slow path is still per-cpu and mostly uncontended even in the
> > pending writer case.
>
> Is it really important? I mean, per-cpu/uncontended even if the writer
> is pending?

I think so, once we make {get,put}_online_cpus() really cheap they'll
get in more and more places, and the global count with pending writer
will make things crawl on bigger machines.

> Otherwise we could do

<snip>

> I already sent this code in 2010, it needs some trivial updates.

Yeah, I found that a few days ago.. but per the above I didn't like the
pending writer case.

> But. We already have percpu_rw_semaphore,

Oh urgh, forgot about that one. /me goes read.

/me curses loudly.. that thing has an _expedited() call in it, those
should die.

Also, it suffers the same problem. I think esp. for hotplug we should be
100% geared towards readers and pretty much damn writers.

I'd dread to think what would happen if a 4k cpu machine were to land in
the slow path on that global mutex. Readers would never go-away and
progress would make a glacier seem fast.

> Note also that percpu_down_write/percpu_up_write can be improved wrt
> synchronize_sched(). We can turn the 2nd one into call_rcu(), and the
> 1nd one can be avoided if another percpu_down_write() comes "soon after"
> percpu_down_up().

Write side be damned ;-)

It is anyway with a pure read bias and a large machine..

> As for the patch itself, I am not sure.
>
> > +static void cpuph_wait_refcount(void)
>
> It seems, this can succeed while it should not, see below.
>
> > void cpu_hotplug_begin(void)
> > {
> > + lockdep_assert_held(&cpu_add_remove_lock);
> >
> > + __cpuhp_writer = current;
> > +
> > + /* After this everybody will observe _writer and take the slow path. */
> > + synchronize_sched();
>
> Yes, the reader should see _writer, but:
>
> > + /* Wait for no readers -- reader preference */
> > + cpuhp_wait_refcount();
>
> but how we can ensure the writer sees the results of the reader's updates?
>
> Suppose that we have 2 CPU's, __cpuhp_refcount[0] = 0, __cpuhp_refcount[1] = 1.
> IOW, we have a single R reader which takes this lock on CPU_1 and sleeps.
>
> Now,
>
> - The writer calls cpuph_wait_refcount()
>
> - cpuph_wait_refcount() does refcnt += __cpuhp_refcount[0].
> refcnt == 0.
>
> - another reader comes on CPU_0, increments __cpuhp_refcount[0].
>
> - this reader migrates to CPU_1 and does put_online_cpus(),
> this decrements __cpuhp_refcount[1] which becomes zero.
>
> - cpuph_wait_refcount() continues and reads __cpuhp_refcount[1]
> which is zero. refcnt == 0, return.
>
> - The writer does cpuhp_set_state(1).
>
> - The reader R (original reader) wakes up, calls get_online_cpus()
> recursively, and sleeps in wait_event(!__cpuhp_writer).

Ah indeed..

The best I can come up with is something like:

static unsigned int cpuhp_refcount(void)
{
unsigned int refcount = 0;
int cpu;

for_each_possible_cpu(cpu)
refcount += per_cpu(__cpuhp_refcount, cpu);
}

static void cpuhp_wait_refcount(void)
{
for (;;) {
unsigned int rc1, rc2;

rc1 = cpuhp_refcount();
set_current_state(TASK_UNINTERRUPTIBLE); /* MB */
rc2 = cpuhp_refcount();

if (rc1 == rc2 && !rc1)
break;

schedule();
}
__set_current_state(TASK_RUNNING);
}

2013-09-23 14:50:24

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Thu, 19 Sep 2013 16:32:41 +0200
Peter Zijlstra <[email protected]> wrote:


> +extern void __get_online_cpus(void);
> +
> +static inline void get_online_cpus(void)
> +{
> + might_sleep();
> +
> + preempt_disable();
> + if (likely(!__cpuhp_writer || __cpuhp_writer == current))
> + this_cpu_inc(__cpuhp_refcount);
> + else
> + __get_online_cpus();
> + preempt_enable();
> +}


This isn't much different than srcu_read_lock(). What about doing
something like this:

static inline void get_online_cpus(void)
{
might_sleep();

srcu_read_lock(&cpuhp_srcu);
if (unlikely(__cpuhp_writer || __cpuhp_writer != current)) {
srcu_read_unlock(&cpuhp_srcu);
__get_online_cpus();
current->online_cpus_held++;
}
}

static inline void put_online_cpus(void)
{
if (unlikely(current->online_cpus_held)) {
current->online_cpus_held--;
__put_online_cpus();
return;
}

srcu_read_unlock(&cpuhp_srcu);
}

Then have the writer simply do:

__cpuhp_write = current;
synchronize_srcu(&cpuhp_srcu);

<grab the mutex here>

-- Steve

2013-09-23 14:55:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Mon, Sep 23, 2013 at 10:50:17AM -0400, Steven Rostedt wrote:
> On Thu, 19 Sep 2013 16:32:41 +0200
> Peter Zijlstra <[email protected]> wrote:
>
>
> > +extern void __get_online_cpus(void);
> > +
> > +static inline void get_online_cpus(void)
> > +{
> > + might_sleep();
> > +
> > + preempt_disable();
> > + if (likely(!__cpuhp_writer || __cpuhp_writer == current))
> > + this_cpu_inc(__cpuhp_refcount);
> > + else
> > + __get_online_cpus();
> > + preempt_enable();
> > +}
>
>
> This isn't much different than srcu_read_lock(). What about doing
> something like this:
>
> static inline void get_online_cpus(void)
> {
> might_sleep();
>
> srcu_read_lock(&cpuhp_srcu);
> if (unlikely(__cpuhp_writer || __cpuhp_writer != current)) {
> srcu_read_unlock(&cpuhp_srcu);
> __get_online_cpus();
> current->online_cpus_held++;
> }
> }

There's a full memory barrier in srcu_read_lock(), while there was no
such thing in the previous fast path.

Also, why current->online_cpus_held()? That would make the write side
O(nr_tasks) instead of O(nr_cpus).

> static inline void put_online_cpus(void)
> {
> if (unlikely(current->online_cpus_held)) {
> current->online_cpus_held--;
> __put_online_cpus();
> return;
> }
>
> srcu_read_unlock(&cpuhp_srcu);
> }

Also, you might not have noticed but, srcu_read_{,un}lock() have an
extra idx thing to pass about. That doesn't fit with the hotplug api.

>
> Then have the writer simply do:
>
> __cpuhp_write = current;
> synchronize_srcu(&cpuhp_srcu);
>
> <grab the mutex here>

How does that do reader preference?

2013-09-23 15:13:09

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Mon, 23 Sep 2013 16:54:46 +0200
Peter Zijlstra <[email protected]> wrote:

> On Mon, Sep 23, 2013 at 10:50:17AM -0400, Steven Rostedt wrote:
> > On Thu, 19 Sep 2013 16:32:41 +0200
> > Peter Zijlstra <[email protected]> wrote:
> >
> >
> > > +extern void __get_online_cpus(void);
> > > +
> > > +static inline void get_online_cpus(void)
> > > +{
> > > + might_sleep();
> > > +
> > > + preempt_disable();
> > > + if (likely(!__cpuhp_writer || __cpuhp_writer == current))
> > > + this_cpu_inc(__cpuhp_refcount);
> > > + else
> > > + __get_online_cpus();
> > > + preempt_enable();
> > > +}
> >
> >
> > This isn't much different than srcu_read_lock(). What about doing
> > something like this:
> >
> > static inline void get_online_cpus(void)
> > {
> > might_sleep();
> >
> > srcu_read_lock(&cpuhp_srcu);
> > if (unlikely(__cpuhp_writer || __cpuhp_writer != current)) {
> > srcu_read_unlock(&cpuhp_srcu);
> > __get_online_cpus();
> > current->online_cpus_held++;
> > }
> > }
>
> There's a full memory barrier in srcu_read_lock(), while there was no
> such thing in the previous fast path.

Yeah, I mentioned this to Paul, and we talked about making
srcu_read_lock() work with no mb's. But currently, doesn't
get_online_cpus() just take a mutex? What's wrong with a mb() as it
still kicks ass over what is currently there today?

>
> Also, why current->online_cpus_held()? That would make the write side
> O(nr_tasks) instead of O(nr_cpus).

?? I'm not sure I understand this. The online_cpus_held++ was there for
recursion. Can't get_online_cpus() nest? I was thinking it can. If so,
once the "__cpuhp_writer" is set, we need to do __put_online_cpus() as
many times as we did a __get_online_cpus(). I don't know where the
O(nr_tasks) comes from. The ref here was just to account for doing the
old "get_online_cpus" instead of a srcu_read_lock().

>
> > static inline void put_online_cpus(void)
> > {
> > if (unlikely(current->online_cpus_held)) {
> > current->online_cpus_held--;
> > __put_online_cpus();
> > return;
> > }
> >
> > srcu_read_unlock(&cpuhp_srcu);
> > }
>
> Also, you might not have noticed but, srcu_read_{,un}lock() have an
> extra idx thing to pass about. That doesn't fit with the hotplug api.

I'll have to look a that, as I'm not exactly sure about the idx thing.

>
> >
> > Then have the writer simply do:
> >
> > __cpuhp_write = current;
> > synchronize_srcu(&cpuhp_srcu);
> >
> > <grab the mutex here>
>
> How does that do reader preference?

Well, the point I was trying to do was to let readers go very fast
(well, with a mb instead of a mutex), and then when the CPU hotplug
happens, it goes back to the current method.

That is, once we set __cpuhp_write, and then run synchronize_srcu(),
the system will be in a state that does what it does today (grabbing
mutexes, and upping refcounts).

I thought the whole point was to speed up the get_online_cpus() when no
hotplug is happening. This does that, and is rather simple. It only
gets slow when hotplug is in effect.

-- Steve

2013-09-23 15:22:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Mon, Sep 23, 2013 at 11:13:03AM -0400, Steven Rostedt wrote:
> Well, the point I was trying to do was to let readers go very fast
> (well, with a mb instead of a mutex), and then when the CPU hotplug
> happens, it goes back to the current method.

Well, for that the thing Oleg proposed works just fine and the
preempt_disable() section vs synchronize_sched() is hardly magic.

But I'd really like to get the writer pending case fast too.

> That is, once we set __cpuhp_write, and then run synchronize_srcu(),
> the system will be in a state that does what it does today (grabbing
> mutexes, and upping refcounts).

Still no point in using srcu for this; preempt_disable +
synchronize_sched() is similar and much faster -- its the rcu_sched
equivalent of what you propose.

> I thought the whole point was to speed up the get_online_cpus() when no
> hotplug is happening. This does that, and is rather simple. It only
> gets slow when hotplug is in effect.

No, well, it also gets slow when a hotplug is pending, which can be
quite a while if we go sprinkle get_online_cpus() all over the place and
the machine is busy.

One we start a hotplug attempt we must wait for all readers to quiesce
-- since the lock is full reader preference this can take an infinite
amount of time -- while we're waiting for this all 4k+ CPUs will be
bouncing the one mutex around on every get_online_cpus(); of which we'll
have many since that's the entire point of making them cheap, to use
more of them.

2013-09-23 15:52:29

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Mon, Sep 23, 2013 at 11:13:03AM -0400, Steven Rostedt wrote:
> On Mon, 23 Sep 2013 16:54:46 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> > On Mon, Sep 23, 2013 at 10:50:17AM -0400, Steven Rostedt wrote:

[ . . . ]

> ?? I'm not sure I understand this. The online_cpus_held++ was there for
> recursion. Can't get_online_cpus() nest? I was thinking it can. If so,
> once the "__cpuhp_writer" is set, we need to do __put_online_cpus() as
> many times as we did a __get_online_cpus(). I don't know where the
> O(nr_tasks) comes from. The ref here was just to account for doing the
> old "get_online_cpus" instead of a srcu_read_lock().
>
> >
> > > static inline void put_online_cpus(void)
> > > {
> > > if (unlikely(current->online_cpus_held)) {
> > > current->online_cpus_held--;
> > > __put_online_cpus();
> > > return;
> > > }
> > >
> > > srcu_read_unlock(&cpuhp_srcu);
> > > }
> >
> > Also, you might not have noticed but, srcu_read_{,un}lock() have an
> > extra idx thing to pass about. That doesn't fit with the hotplug api.
>
> I'll have to look a that, as I'm not exactly sure about the idx thing.

Not a problem, just stuff the idx into some per-task thing. Either
task_struct or taskinfo will work fine.

> > >
> > > Then have the writer simply do:
> > >
> > > __cpuhp_write = current;
> > > synchronize_srcu(&cpuhp_srcu);
> > >
> > > <grab the mutex here>
> >
> > How does that do reader preference?
>
> Well, the point I was trying to do was to let readers go very fast
> (well, with a mb instead of a mutex), and then when the CPU hotplug
> happens, it goes back to the current method.
>
> That is, once we set __cpuhp_write, and then run synchronize_srcu(),
> the system will be in a state that does what it does today (grabbing
> mutexes, and upping refcounts).
>
> I thought the whole point was to speed up the get_online_cpus() when no
> hotplug is happening. This does that, and is rather simple. It only
> gets slow when hotplug is in effect.

Or to put it another way, if the underlying slow-path mutex is
reader-preference, then the whole thing will be reader-preference.

Thanx, Paul

2013-09-23 15:59:13

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Mon, 23 Sep 2013 17:22:23 +0200
Peter Zijlstra <[email protected]> wrote:

> Still no point in using srcu for this; preempt_disable +
> synchronize_sched() is similar and much faster -- its the rcu_sched
> equivalent of what you propose.

To be honest, I sent this out last week and it somehow got trashed by
my laptop and connecting to my smtp server. Where the last version of
your patch still had the memory barrier ;-)

So yeah, a true synchronize_sched() is better.

-- Steve

2013-09-23 16:01:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Mon, Sep 23, 2013 at 08:50:59AM -0700, Paul E. McKenney wrote:
> Not a problem, just stuff the idx into some per-task thing. Either
> task_struct or taskinfo will work fine.

Still not seeing the point of using srcu though..

srcu_read_lock() vs synchronize_srcu() is the same but far more
expensive than preempt_disable() vs synchronize_sched().

> Or to put it another way, if the underlying slow-path mutex is
> reader-preference, then the whole thing will be reader-preference.

Right, so 1) we have no such mutex so we're going to have to open-code
that anyway, and 2) like I just explained in the other email, I want the
pending writer case to be _fast_ as well.

2013-09-23 16:03:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Mon, Sep 23, 2013 at 11:59:08AM -0400, Steven Rostedt wrote:
> On Mon, 23 Sep 2013 17:22:23 +0200
> Peter Zijlstra <[email protected]> wrote:
>
> > Still no point in using srcu for this; preempt_disable +
> > synchronize_sched() is similar and much faster -- its the rcu_sched
> > equivalent of what you propose.
>
> To be honest, I sent this out last week and it somehow got trashed by
> my laptop and connecting to my smtp server. Where the last version of
> your patch still had the memory barrier ;-)

Ah, ok, yes in that case things start to make sense again.

2013-09-23 17:30:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Mon, Sep 23, 2013 at 10:04:00AM -0700, Paul E. McKenney wrote:
> At some point I suspect that we will want some form of fairness, but in
> the meantime, good point.

I figured we could start a timer on hotplug to force quiesce the readers
after about 10 minutes or so ;-)

Should be a proper discouragement from (ab)using this hotplug stuff...

Muwhahaha

2013-09-23 17:39:09

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 09/23, Peter Zijlstra wrote:
>
> On Sat, Sep 21, 2013 at 06:34:04PM +0200, Oleg Nesterov wrote:
> > > So the slow path is still per-cpu and mostly uncontended even in the
> > > pending writer case.
> >
> > Is it really important? I mean, per-cpu/uncontended even if the writer
> > is pending?
>
> I think so, once we make {get,put}_online_cpus() really cheap they'll
> get in more and more places, and the global count with pending writer
> will make things crawl on bigger machines.

Hmm. But the writers should be rare.

> > But. We already have percpu_rw_semaphore,
>
> Oh urgh, forgot about that one. /me goes read.
>
> /me curses loudly.. that thing has an _expedited() call in it, those
> should die.

Probably yes, the original reason for _expedited() has gone away.

> I'd dread to think what would happen if a 4k cpu machine were to land in
> the slow path on that global mutex. Readers would never go-away and
> progress would make a glacier seem fast.

Another problem is that write-lock can never succeed unless it
prevents the new readers, but this needs the per-task counter.

> > Note also that percpu_down_write/percpu_up_write can be improved wrt
> > synchronize_sched(). We can turn the 2nd one into call_rcu(), and the
> > 1nd one can be avoided if another percpu_down_write() comes "soon after"
> > percpu_down_up().
>
> Write side be damned ;-)

Suppose that a 4k cpu machine does disable_nonboot_cpus(), every
_cpu_down() does synchronize_sched()... OK, perhaps the locking can be
changed so that cpu_hotplug_begin/end is called only once in this case.

> > - The writer calls cpuph_wait_refcount()
> >
> > - cpuph_wait_refcount() does refcnt += __cpuhp_refcount[0].
> > refcnt == 0.
> >
> > - another reader comes on CPU_0, increments __cpuhp_refcount[0].
> >
> > - this reader migrates to CPU_1 and does put_online_cpus(),
> > this decrements __cpuhp_refcount[1] which becomes zero.
> >
> > - cpuph_wait_refcount() continues and reads __cpuhp_refcount[1]
> > which is zero. refcnt == 0, return.
>
> Ah indeed..
>
> The best I can come up with is something like:
>
> static unsigned int cpuhp_refcount(void)
> {
> unsigned int refcount = 0;
> int cpu;
>
> for_each_possible_cpu(cpu)
> refcount += per_cpu(__cpuhp_refcount, cpu);
> }
>
> static void cpuhp_wait_refcount(void)
> {
> for (;;) {
> unsigned int rc1, rc2;
>
> rc1 = cpuhp_refcount();
> set_current_state(TASK_UNINTERRUPTIBLE); /* MB */
> rc2 = cpuhp_refcount();
>
> if (rc1 == rc2 && !rc1)

But this only makes the race above "theoretical ** 2". Both
cpuhp_refcount()'s can be equally fooled.

Looks like, cpuhp_refcount() should take all per-cpu cpuhp_lock's
before it reads __cpuhp_refcount.

Oleg.

2013-09-23 17:58:34

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

And somehow I didn't notice that cpuhp_set_state() doesn't look right,

On 09/19, Peter Zijlstra wrote:
> void cpu_hotplug_begin(void)
> {
> - cpu_hotplug.active_writer = current;
> + lockdep_assert_held(&cpu_add_remove_lock);
>
> - for (;;) {
> - mutex_lock(&cpu_hotplug.lock);
> - if (likely(!cpu_hotplug.refcount))
> - break;
> - __set_current_state(TASK_UNINTERRUPTIBLE);
> - mutex_unlock(&cpu_hotplug.lock);
> - schedule();
> - }
> + __cpuhp_writer = current;
> +
> + /* After this everybody will observe _writer and take the slow path. */
> + synchronize_sched();
> +
> + /* Wait for no readers -- reader preference */
> + cpuhp_wait_refcount();
> +
> + /* Stop new readers. */
> + cpuhp_set_state(1);

But this stops all readers, not only new. Even if cpuhp_wait_refcount()
was correct, a new reader can come right before cpuhp_set_state(1) and
then it can call another recursive get_online_cpus() right after.

Oleg.

2013-09-23 18:16:34

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Mon, Sep 23, 2013 at 06:01:30PM +0200, Peter Zijlstra wrote:
> On Mon, Sep 23, 2013 at 08:50:59AM -0700, Paul E. McKenney wrote:
> > Not a problem, just stuff the idx into some per-task thing. Either
> > task_struct or taskinfo will work fine.
>
> Still not seeing the point of using srcu though..
>
> srcu_read_lock() vs synchronize_srcu() is the same but far more
> expensive than preempt_disable() vs synchronize_sched().

Heh! You want the old-style SRCU. ;-)

> > Or to put it another way, if the underlying slow-path mutex is
> > reader-preference, then the whole thing will be reader-preference.
>
> Right, so 1) we have no such mutex so we're going to have to open-code
> that anyway, and 2) like I just explained in the other email, I want the
> pending writer case to be _fast_ as well.

At some point I suspect that we will want some form of fairness, but in
the meantime, good point.

Thanx, Paul

2013-09-24 12:38:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()


OK, so another attempt.

This one is actually fair in that it immediately forces a reader
quiescent state by explicitly implementing reader-reader recursion.

This does away with the potentially long pending writer case and can
thus use the simpler global state.

I don't really like this lock being fair, but alas.

Also, please have a look at the atomic_dec_and_test(cpuhp_waitcount) and
cpu_hotplug_done(). I think its ok, but I keep confusing myself.

---
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
#include <linux/node.h>
#include <linux/compiler.h>
#include <linux/cpumask.h>
+#include <linux/percpu.h>

struct device;

@@ -173,10 +174,49 @@ extern struct bus_type cpu_subsys;
#ifdef CONFIG_HOTPLUG_CPU
/* Stop CPUs going up and down. */

+extern void cpu_hotplug_init_task(struct task_struct *p);
+
extern void cpu_hotplug_begin(void);
extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern int __cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+ might_sleep();
+
+ if (current->cpuhp_ref++) {
+ barrier();
+ return;
+ }
+
+ preempt_disable();
+ if (likely(!__cpuhp_writer))
+ __this_cpu_inc(__cpuhp_refcount);
+ else
+ __get_online_cpus();
+ preempt_enable();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+ barrier();
+ if (--current->cpuhp_ref)
+ return;
+
+ preempt_disable();
+ if (likely(!__cpuhp_writer))
+ __this_cpu_dec(__cpuhp_refcount);
+ else
+ __put_online_cpus();
+ preempt_enable();
+}
+
extern void cpu_hotplug_disable(void);
extern void cpu_hotplug_enable(void);
#define hotcpu_notifier(fn, pri) cpu_notifier(fn, pri)
@@ -200,6 +240,8 @@ static inline void cpu_hotplug_driver_un

#else /* CONFIG_HOTPLUG_CPU */

+static inline void cpu_hotplug_init_task(struct task_struct *p) {}
+
static inline void cpu_hotplug_begin(void) {}
static inline void cpu_hotplug_done(void) {}
#define get_online_cpus() do { } while (0)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1454,6 +1454,9 @@ struct task_struct {
unsigned int sequential_io;
unsigned int sequential_io_avg;
#endif
+#ifdef CONFIG_HOTPLUG_CPU
+ int cpuhp_ref;
+#endif
};

/* Future-safe accessor for struct task_struct's cpus_allowed. */
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,115 @@ static int cpu_hotplug_disabled;

#ifdef CONFIG_HOTPLUG_CPU

-static struct {
- struct task_struct *active_writer;
- struct mutex lock; /* Synchronizes accesses to refcount, */
- /*
- * Also blocks the new readers during
- * an ongoing cpu hotplug operation.
- */
- int refcount;
-} cpu_hotplug = {
- .active_writer = NULL,
- .lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
- .refcount = 0,
-};
+static struct task_struct *cpuhp_writer_task = NULL;

-void get_online_cpus(void)
-{
- might_sleep();
- if (cpu_hotplug.active_writer == current)
- return;
- mutex_lock(&cpu_hotplug.lock);
- cpu_hotplug.refcount++;
- mutex_unlock(&cpu_hotplug.lock);
+int __cpuhp_writer;
+EXPORT_SYMBOL_GPL(__cpuhp_writer);

+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static atomic_t cpuhp_waitcount;
+static atomic_t cpuhp_slowcount;
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
+
+void cpu_hotplug_init_task(struct task_struct *p)
+{
+ p->cpuhp_ref = 0;
}
-EXPORT_SYMBOL_GPL(get_online_cpus);

-void put_online_cpus(void)
+#define cpuhp_writer_wake() \
+ wake_up_process(cpuhp_writer_task)
+
+#define cpuhp_writer_wait(cond) \
+do { \
+ for (;;) { \
+ set_current_state(TASK_UNINTERRUPTIBLE); \
+ if (cond) \
+ break; \
+ schedule(); \
+ } \
+ __set_current_state(TASK_RUNNING); \
+} while (0)
+
+void __get_online_cpus(void)
{
- if (cpu_hotplug.active_writer == current)
+ if (cpuhp_writer_task == current)
return;
- mutex_lock(&cpu_hotplug.lock);

- if (WARN_ON(!cpu_hotplug.refcount))
- cpu_hotplug.refcount++; /* try to fix things up */
+ atomic_inc(&cpuhp_waitcount);
+
+ /*
+ * We either call schedule() in the wait, or we'll fall through
+ * and reschedule on the preempt_enable() in get_online_cpus().
+ */
+ preempt_enable_no_resched();
+ wait_event(cpuhp_wq, !__cpuhp_writer);
+ preempt_disable();
+
+ /*
+ * It would be possible for cpu_hotplug_done() to complete before
+ * the atomic_inc() above; in which case there is no writer waiting
+ * and doing a wakeup would be BAD (tm).
+ *
+ * If however we still observe cpuhp_writer_task here we know
+ * cpu_hotplug_done() is currently stuck waiting for cpuhp_waitcount.
+ */
+ if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)
+ cpuhp_writer_wake();
+}
+EXPORT_SYMBOL_GPL(__get_online_cpus);

- if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
- wake_up_process(cpu_hotplug.active_writer);
- mutex_unlock(&cpu_hotplug.lock);
+void __put_online_cpus(void)
+{
+ if (cpuhp_writer_task == current)
+ return;

+ if (atomic_dec_and_test(&cpuhp_slowcount))
+ cpuhp_writer_wake();
}
-EXPORT_SYMBOL_GPL(put_online_cpus);
+EXPORT_SYMBOL_GPL(__put_online_cpus);

/*
* This ensures that the hotplug operation can begin only when the
* refcount goes to zero.
*
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
* Since cpu_hotplug_begin() is always called after invoking
* cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- * writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- * non zero and goes to sleep again.
- *
- * However, this is very difficult to achieve in practice since
- * get_online_cpus() not an api which is called all that often.
- *
*/
void cpu_hotplug_begin(void)
{
- cpu_hotplug.active_writer = current;
+ unsigned int count = 0;
+ int cpu;
+
+ lockdep_assert_held(&cpu_add_remove_lock);

- for (;;) {
- mutex_lock(&cpu_hotplug.lock);
- if (likely(!cpu_hotplug.refcount))
- break;
- __set_current_state(TASK_UNINTERRUPTIBLE);
- mutex_unlock(&cpu_hotplug.lock);
- schedule();
+ __cpuhp_writer = 1;
+ cpuhp_writer_task = current;
+
+ /* After this everybody will observe writer and take the slow path. */
+ synchronize_sched();
+
+ /* Collapse the per-cpu refcount into slowcount */
+ for_each_possible_cpu(cpu) {
+ count += per_cpu(__cpuhp_refcount, cpu);
+ per_cpu(__cpuhp_refcount, cpu) = 0;
}
+ atomic_add(count, &cpuhp_slowcount);
+
+ /* Wait for all readers to go away */
+ cpuhp_writer_wait(!atomic_read(&cpuhp_slowcount));
}

void cpu_hotplug_done(void)
{
- cpu_hotplug.active_writer = NULL;
- mutex_unlock(&cpu_hotplug.lock);
+ /* Signal the writer is done */
+ cpuhp_writer = 0;
+ wake_up_all(&cpuhp_wq);
+
+ /* Wait for any pending readers to be running */
+ cpuhp_writer_wait(!atomic_read(&cpuhp_waitcount));
+ cpuhp_writer_task = NULL;
}

/*
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1736,6 +1736,8 @@ static void __sched_fork(unsigned long c
INIT_LIST_HEAD(&p->numa_entry);
p->numa_group = NULL;
#endif /* CONFIG_NUMA_BALANCING */
+
+ cpu_hotplug_init_task(p);
}

#ifdef CONFIG_NUMA_BALANCING

2013-09-24 14:42:47

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Tue, Sep 24, 2013 at 02:38:21PM +0200, Peter Zijlstra wrote:
>
> OK, so another attempt.
>
> This one is actually fair in that it immediately forces a reader
> quiescent state by explicitly implementing reader-reader recursion.
>
> This does away with the potentially long pending writer case and can
> thus use the simpler global state.
>
> I don't really like this lock being fair, but alas.
>
> Also, please have a look at the atomic_dec_and_test(cpuhp_waitcount) and
> cpu_hotplug_done(). I think its ok, but I keep confusing myself.

Cute!

Some commentary below. Also one question about how a race leading to
a NULL-pointer dereference is avoided.

Thanx, Paul

> ---
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -16,6 +16,7 @@
> #include <linux/node.h>
> #include <linux/compiler.h>
> #include <linux/cpumask.h>
> +#include <linux/percpu.h>
>
> struct device;
>
> @@ -173,10 +174,49 @@ extern struct bus_type cpu_subsys;
> #ifdef CONFIG_HOTPLUG_CPU
> /* Stop CPUs going up and down. */
>
> +extern void cpu_hotplug_init_task(struct task_struct *p);
> +
> extern void cpu_hotplug_begin(void);
> extern void cpu_hotplug_done(void);
> -extern void get_online_cpus(void);
> -extern void put_online_cpus(void);
> +
> +extern int __cpuhp_writer;
> +DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
> +
> +extern void __get_online_cpus(void);
> +
> +static inline void get_online_cpus(void)
> +{
> + might_sleep();
> +
> + if (current->cpuhp_ref++) {
> + barrier();
> + return;
> + }
> +
> + preempt_disable();
> + if (likely(!__cpuhp_writer))
> + __this_cpu_inc(__cpuhp_refcount);
> + else
> + __get_online_cpus();
> + preempt_enable();
> +}
> +
> +extern void __put_online_cpus(void);
> +
> +static inline void put_online_cpus(void)
> +{
> + barrier();
> + if (--current->cpuhp_ref)
> + return;
> +
> + preempt_disable();
> + if (likely(!__cpuhp_writer))
> + __this_cpu_dec(__cpuhp_refcount);
> + else
> + __put_online_cpus();
> + preempt_enable();
> +}
> +
> extern void cpu_hotplug_disable(void);
> extern void cpu_hotplug_enable(void);
> #define hotcpu_notifier(fn, pri) cpu_notifier(fn, pri)
> @@ -200,6 +240,8 @@ static inline void cpu_hotplug_driver_un
>
> #else /* CONFIG_HOTPLUG_CPU */
>
> +static inline void cpu_hotplug_init_task(struct task_struct *p) {}
> +
> static inline void cpu_hotplug_begin(void) {}
> static inline void cpu_hotplug_done(void) {}
> #define get_online_cpus() do { } while (0)
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1454,6 +1454,9 @@ struct task_struct {
> unsigned int sequential_io;
> unsigned int sequential_io_avg;
> #endif
> +#ifdef CONFIG_HOTPLUG_CPU
> + int cpuhp_ref;
> +#endif
> };
>
> /* Future-safe accessor for struct task_struct's cpus_allowed. */
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -49,88 +49,115 @@ static int cpu_hotplug_disabled;
>
> #ifdef CONFIG_HOTPLUG_CPU
>
> -static struct {
> - struct task_struct *active_writer;
> - struct mutex lock; /* Synchronizes accesses to refcount, */
> - /*
> - * Also blocks the new readers during
> - * an ongoing cpu hotplug operation.
> - */
> - int refcount;
> -} cpu_hotplug = {
> - .active_writer = NULL,
> - .lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
> - .refcount = 0,
> -};
> +static struct task_struct *cpuhp_writer_task = NULL;
>
> -void get_online_cpus(void)
> -{
> - might_sleep();
> - if (cpu_hotplug.active_writer == current)
> - return;
> - mutex_lock(&cpu_hotplug.lock);
> - cpu_hotplug.refcount++;
> - mutex_unlock(&cpu_hotplug.lock);
> +int __cpuhp_writer;
> +EXPORT_SYMBOL_GPL(__cpuhp_writer);
>
> +DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
> +EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
> +
> +static atomic_t cpuhp_waitcount;
> +static atomic_t cpuhp_slowcount;
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_wq);
> +
> +void cpu_hotplug_init_task(struct task_struct *p)
> +{
> + p->cpuhp_ref = 0;
> }
> -EXPORT_SYMBOL_GPL(get_online_cpus);
>
> -void put_online_cpus(void)
> +#define cpuhp_writer_wake() \
> + wake_up_process(cpuhp_writer_task)
> +
> +#define cpuhp_writer_wait(cond) \
> +do { \
> + for (;;) { \
> + set_current_state(TASK_UNINTERRUPTIBLE); \
> + if (cond) \
> + break; \
> + schedule(); \
> + } \
> + __set_current_state(TASK_RUNNING); \
> +} while (0)

Why not wait_event()? Presumably the above is a bit lighter weight,
but is that even something that can be measured?

> +void __get_online_cpus(void)
> {
> - if (cpu_hotplug.active_writer == current)
> + if (cpuhp_writer_task == current)
> return;
> - mutex_lock(&cpu_hotplug.lock);
>
> - if (WARN_ON(!cpu_hotplug.refcount))
> - cpu_hotplug.refcount++; /* try to fix things up */
> + atomic_inc(&cpuhp_waitcount);
> +
> + /*
> + * We either call schedule() in the wait, or we'll fall through
> + * and reschedule on the preempt_enable() in get_online_cpus().
> + */
> + preempt_enable_no_resched();
> + wait_event(cpuhp_wq, !__cpuhp_writer);

Finally! A good use for preempt_enable_no_resched(). ;-)

> + preempt_disable();
> +
> + /*
> + * It would be possible for cpu_hotplug_done() to complete before
> + * the atomic_inc() above; in which case there is no writer waiting
> + * and doing a wakeup would be BAD (tm).
> + *
> + * If however we still observe cpuhp_writer_task here we know
> + * cpu_hotplug_done() is currently stuck waiting for cpuhp_waitcount.
> + */
> + if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)

OK, I'll bite... What sequence of events results in the
atomic_dec_and_test() returning true but there being no
cpuhp_writer_task?

Ah, I see it...

o Task A becomes the writer.

o Task B tries to read, but stalls for whatever reason before
the atomic_inc().

o Task A completes its write-side operation. It sees no readers
blocked, so goes on its merry way.

o Task B does its atomic_inc(), does its read, then sees
atomic_dec_and_test() return zero, but cpuhp_writer_task
is NULL, so it doesn't do the wakeup.

But what prevents the following sequence of events?

o Task A becomes the writer.

o Task B tries to read, but stalls for whatever reason before
the atomic_inc().

o Task A completes its write-side operation. It sees no readers
blocked, so goes on its merry way, but is delayed before it
NULLs cpuhp_writer_task.

o Task B does its atomic_inc(), does its read, then sees
atomic_dec_and_test() return zero. However, it sees
cpuhp_writer_task as non-NULL.

o Then Task A NULLs cpuhp_writer_task.

o Task B's call to cpuhp_writer_wake() sees a NULL pointer.

> + cpuhp_writer_wake();
> +}
> +EXPORT_SYMBOL_GPL(__get_online_cpus);
>
> - if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
> - wake_up_process(cpu_hotplug.active_writer);
> - mutex_unlock(&cpu_hotplug.lock);
> +void __put_online_cpus(void)
> +{
> + if (cpuhp_writer_task == current)
> + return;
>
> + if (atomic_dec_and_test(&cpuhp_slowcount))
> + cpuhp_writer_wake();
> }
> -EXPORT_SYMBOL_GPL(put_online_cpus);
> +EXPORT_SYMBOL_GPL(__put_online_cpus);
>
> /*
> * This ensures that the hotplug operation can begin only when the
> * refcount goes to zero.
> *
> - * Note that during a cpu-hotplug operation, the new readers, if any,
> - * will be blocked by the cpu_hotplug.lock
> - *
> * Since cpu_hotplug_begin() is always called after invoking
> * cpu_maps_update_begin(), we can be sure that only one writer is active.
> - *
> - * Note that theoretically, there is a possibility of a livelock:
> - * - Refcount goes to zero, last reader wakes up the sleeping
> - * writer.
> - * - Last reader unlocks the cpu_hotplug.lock.
> - * - A new reader arrives at this moment, bumps up the refcount.
> - * - The writer acquires the cpu_hotplug.lock finds the refcount
> - * non zero and goes to sleep again.
> - *
> - * However, this is very difficult to achieve in practice since
> - * get_online_cpus() not an api which is called all that often.
> - *
> */
> void cpu_hotplug_begin(void)
> {
> - cpu_hotplug.active_writer = current;
> + unsigned int count = 0;
> + int cpu;
> +
> + lockdep_assert_held(&cpu_add_remove_lock);
>
> - for (;;) {
> - mutex_lock(&cpu_hotplug.lock);
> - if (likely(!cpu_hotplug.refcount))
> - break;
> - __set_current_state(TASK_UNINTERRUPTIBLE);
> - mutex_unlock(&cpu_hotplug.lock);
> - schedule();
> + __cpuhp_writer = 1;
> + cpuhp_writer_task = current;

At this point, the value of cpuhp_slowcount can go negative. Can't see
that this causes a problem, given the atomic_add() below.

> +
> + /* After this everybody will observe writer and take the slow path. */
> + synchronize_sched();
> +
> + /* Collapse the per-cpu refcount into slowcount */
> + for_each_possible_cpu(cpu) {
> + count += per_cpu(__cpuhp_refcount, cpu);
> + per_cpu(__cpuhp_refcount, cpu) = 0;
> }

The above is safe because the readers are no longer changing their
__cpuhp_refcount values.

> + atomic_add(count, &cpuhp_slowcount);
> +
> + /* Wait for all readers to go away */
> + cpuhp_writer_wait(!atomic_read(&cpuhp_slowcount));
> }
>
> void cpu_hotplug_done(void)
> {
> - cpu_hotplug.active_writer = NULL;
> - mutex_unlock(&cpu_hotplug.lock);
> + /* Signal the writer is done */
> + cpuhp_writer = 0;
> + wake_up_all(&cpuhp_wq);
> +
> + /* Wait for any pending readers to be running */
> + cpuhp_writer_wait(!atomic_read(&cpuhp_waitcount));
> + cpuhp_writer_task = NULL;
> }
>
> /*
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1736,6 +1736,8 @@ static void __sched_fork(unsigned long c
> INIT_LIST_HEAD(&p->numa_entry);
> p->numa_group = NULL;
> #endif /* CONFIG_NUMA_BALANCING */
> +
> + cpu_hotplug_init_task(p);
> }
>
> #ifdef CONFIG_NUMA_BALANCING
>

2013-09-24 16:10:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Tue, Sep 24, 2013 at 07:42:36AM -0700, Paul E. McKenney wrote:
> > +#define cpuhp_writer_wake() \
> > + wake_up_process(cpuhp_writer_task)
> > +
> > +#define cpuhp_writer_wait(cond) \
> > +do { \
> > + for (;;) { \
> > + set_current_state(TASK_UNINTERRUPTIBLE); \
> > + if (cond) \
> > + break; \
> > + schedule(); \
> > + } \
> > + __set_current_state(TASK_RUNNING); \
> > +} while (0)
>
> Why not wait_event()? Presumably the above is a bit lighter weight,
> but is that even something that can be measured?

I didn't want to mix readers and writers on cpuhp_wq, and I suppose I
could create a second waitqueue; that might also be a better solution
for the NULL thing below.

> > + atomic_inc(&cpuhp_waitcount);
> > +
> > + /*
> > + * We either call schedule() in the wait, or we'll fall through
> > + * and reschedule on the preempt_enable() in get_online_cpus().
> > + */
> > + preempt_enable_no_resched();
> > + wait_event(cpuhp_wq, !__cpuhp_writer);
>
> Finally! A good use for preempt_enable_no_resched(). ;-)

Hehe, there were a few others, but tglx removed most with the
schedule_preempt_disabled() primitive.

In fact, I considered a wait_event_preempt_disabled() but was too lazy.
That whole wait_event macro fest looks like it could use an iteration or
two of collapse anyhow.

> > + preempt_disable();
> > +
> > + /*
> > + * It would be possible for cpu_hotplug_done() to complete before
> > + * the atomic_inc() above; in which case there is no writer waiting
> > + * and doing a wakeup would be BAD (tm).
> > + *
> > + * If however we still observe cpuhp_writer_task here we know
> > + * cpu_hotplug_done() is currently stuck waiting for cpuhp_waitcount.
> > + */
> > + if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)
>
> OK, I'll bite... What sequence of events results in the
> atomic_dec_and_test() returning true but there being no
> cpuhp_writer_task?
>
> Ah, I see it...

<snip>

Indeed, and

> But what prevents the following sequence of events?

<snip>

> o Task B's call to cpuhp_writer_wake() sees a NULL pointer.

Quite so.. nothing. See there was a reason I kept being confused about
it.

> > void cpu_hotplug_begin(void)
> > {
> > + unsigned int count = 0;
> > + int cpu;
> > +
> > + lockdep_assert_held(&cpu_add_remove_lock);
> >
> > + __cpuhp_writer = 1;
> > + cpuhp_writer_task = current;
>
> At this point, the value of cpuhp_slowcount can go negative. Can't see
> that this causes a problem, given the atomic_add() below.

Agreed.

> > +
> > + /* After this everybody will observe writer and take the slow path. */
> > + synchronize_sched();
> > +
> > + /* Collapse the per-cpu refcount into slowcount */
> > + for_each_possible_cpu(cpu) {
> > + count += per_cpu(__cpuhp_refcount, cpu);
> > + per_cpu(__cpuhp_refcount, cpu) = 0;
> > }
>
> The above is safe because the readers are no longer changing their
> __cpuhp_refcount values.

Yes, I'll expand the comment.

So how about something like this?

---
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
#include <linux/node.h>
#include <linux/compiler.h>
#include <linux/cpumask.h>
+#include <linux/percpu.h>

struct device;

@@ -173,10 +174,50 @@ extern struct bus_type cpu_subsys;
#ifdef CONFIG_HOTPLUG_CPU
/* Stop CPUs going up and down. */

+extern void cpu_hotplug_init_task(struct task_struct *p);
+
extern void cpu_hotplug_begin(void);
extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern struct task_struct *__cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+ might_sleep();
+
+ /* Support reader-in-reader recursion */
+ if (current->cpuhp_ref++) {
+ barrier();
+ return;
+ }
+
+ preempt_disable();
+ if (likely(!__cpuhp_writer))
+ __this_cpu_inc(__cpuhp_refcount);
+ else
+ __get_online_cpus();
+ preempt_enable();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+ barrier();
+ if (--current->cpuhp_ref)
+ return;
+
+ preempt_disable();
+ if (likely(!__cpuhp_writer))
+ __this_cpu_dec(__cpuhp_refcount);
+ else
+ __put_online_cpus();
+ preempt_enable();
+}
+
extern void cpu_hotplug_disable(void);
extern void cpu_hotplug_enable(void);
#define hotcpu_notifier(fn, pri) cpu_notifier(fn, pri)
@@ -200,6 +241,8 @@ static inline void cpu_hotplug_driver_un

#else /* CONFIG_HOTPLUG_CPU */

+static inline void cpu_hotplug_init_task(struct task_struct *p) {}
+
static inline void cpu_hotplug_begin(void) {}
static inline void cpu_hotplug_done(void) {}
#define get_online_cpus() do { } while (0)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1454,6 +1454,9 @@ struct task_struct {
unsigned int sequential_io;
unsigned int sequential_io_avg;
#endif
+#ifdef CONFIG_HOTPLUG_CPU
+ int cpuhp_ref;
+#endif
};

/* Future-safe accessor for struct task_struct's cpus_allowed. */
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,100 @@ static int cpu_hotplug_disabled;

#ifdef CONFIG_HOTPLUG_CPU

-static struct {
- struct task_struct *active_writer;
- struct mutex lock; /* Synchronizes accesses to refcount, */
- /*
- * Also blocks the new readers during
- * an ongoing cpu hotplug operation.
- */
- int refcount;
-} cpu_hotplug = {
- .active_writer = NULL,
- .lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
- .refcount = 0,
-};
+struct task_struct *__cpuhp_writer;
+EXPORT_SYMBOL_GPL(__cpuhp_writer);

-void get_online_cpus(void)
-{
- might_sleep();
- if (cpu_hotplug.active_writer == current)
- return;
- mutex_lock(&cpu_hotplug.lock);
- cpu_hotplug.refcount++;
- mutex_unlock(&cpu_hotplug.lock);
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static atomic_t cpuhp_waitcount;
+static atomic_t cpuhp_slowcount;
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);

+void cpu_hotplug_init_task(struct task_struct *p)
+{
+ p->cpuhp_ref = 0;
}
-EXPORT_SYMBOL_GPL(get_online_cpus);

-void put_online_cpus(void)
+void __get_online_cpus(void)
{
- if (cpu_hotplug.active_writer == current)
+ /* Support reader-in-writer recursion */
+ if (__cpuhp_writer == current)
return;
- mutex_lock(&cpu_hotplug.lock);

- if (WARN_ON(!cpu_hotplug.refcount))
- cpu_hotplug.refcount++; /* try to fix things up */
+ atomic_inc(&cpuhp_waitcount);

- if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
- wake_up_process(cpu_hotplug.active_writer);
- mutex_unlock(&cpu_hotplug.lock);
+ /*
+ * We either call schedule() in the wait, or we'll fall through
+ * and reschedule on the preempt_enable() in get_online_cpus().
+ */
+ preempt_enable_no_resched();
+ wait_event(cpuhp_readers, !__cpuhp_writer);
+ preempt_disable();
+
+ if (atomic_dec_and_test(&cpuhp_waitcount))
+ wake_up_all(&cpuhp_writer);
+}
+EXPORT_SYMBOL_GPL(__get_online_cpus);
+
+void __put_online_cpus(void)
+{
+ if (__cpuhp_writer == current)
+ return;

+ if (atomic_dec_and_test(&cpuhp_slowcount))
+ wake_up_all(&cpuhp_writer);
}
-EXPORT_SYMBOL_GPL(put_online_cpus);
+EXPORT_SYMBOL_GPL(__put_online_cpus);

/*
* This ensures that the hotplug operation can begin only when the
* refcount goes to zero.
*
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
* Since cpu_hotplug_begin() is always called after invoking
* cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- * writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- * non zero and goes to sleep again.
- *
- * However, this is very difficult to achieve in practice since
- * get_online_cpus() not an api which is called all that often.
- *
*/
void cpu_hotplug_begin(void)
{
- cpu_hotplug.active_writer = current;
+ unsigned int count = 0;
+ int cpu;
+
+ lockdep_assert_held(&cpu_add_remove_lock);

- for (;;) {
- mutex_lock(&cpu_hotplug.lock);
- if (likely(!cpu_hotplug.refcount))
- break;
- __set_current_state(TASK_UNINTERRUPTIBLE);
- mutex_unlock(&cpu_hotplug.lock);
- schedule();
+ __cpuhp_writer = current;
+
+ /*
+ * After this everybody will observe writer and take the slow path.
+ */
+ synchronize_sched();
+
+ /*
+ * Collapse the per-cpu refcount into slowcount. This is safe because
+ * readers are now taking the slow path (per the above) which doesn't
+ * touch __cpuhp_refcount.
+ */
+ for_each_possible_cpu(cpu) {
+ count += per_cpu(__cpuhp_refcount, cpu);
+ per_cpu(__cpuhp_refcount, cpu) = 0;
}
+ atomic_add(count, &cpuhp_slowcount);
+
+ /* Wait for all readers to go away */
+ wait_event(cpuhp_writer, !atomic_read(&cpuhp_slowcount));
}

void cpu_hotplug_done(void)
{
- cpu_hotplug.active_writer = NULL;
- mutex_unlock(&cpu_hotplug.lock);
+ /* Signal the writer is done */
+ cpuhp_writer = NULL;
+ wake_up_all(&cpuhp_readers);
+
+ /*
+ * Wait for any pending readers to be running. This ensures readers
+ * after writer and avoids writers starving readers.
+ */
+ wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
}

/*
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1736,6 +1736,8 @@ static void __sched_fork(unsigned long c
INIT_LIST_HEAD(&p->numa_entry);
p->numa_group = NULL;
#endif /* CONFIG_NUMA_BALANCING */
+
+ cpu_hotplug_init_task(p);
}

#ifdef CONFIG_NUMA_BALANCING

2013-09-24 16:11:03

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 09/24, Peter Zijlstra wrote:
>
> +static inline void get_online_cpus(void)
> +{
> + might_sleep();
> +
> + if (current->cpuhp_ref++) {
> + barrier();
> + return;

I don't undestand this barrier()... we are going to return if we already
hold the lock, do we really need it?

The same for put_online_cpus().

> +void __get_online_cpus(void)
> {
> - if (cpu_hotplug.active_writer == current)
> + if (cpuhp_writer_task == current)
> return;

Probably it would be better to simply inc/dec ->cpuhp_ref in
cpu_hotplug_begin/end and remove this check here and in
__put_online_cpus().

This also means that the writer doing get/put_online_cpus() will
always use the fast path, and __cpuhp_writer can go away,
cpuhp_writer_task != NULL can be used instead.

> + atomic_inc(&cpuhp_waitcount);
> +
> + /*
> + * We either call schedule() in the wait, or we'll fall through
> + * and reschedule on the preempt_enable() in get_online_cpus().
> + */
> + preempt_enable_no_resched();
> + wait_event(cpuhp_wq, !__cpuhp_writer);
> + preempt_disable();
> +
> + /*
> + * It would be possible for cpu_hotplug_done() to complete before
> + * the atomic_inc() above; in which case there is no writer waiting
> + * and doing a wakeup would be BAD (tm).
> + *
> + * If however we still observe cpuhp_writer_task here we know
> + * cpu_hotplug_done() is currently stuck waiting for cpuhp_waitcount.
> + */
> + if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)
> + cpuhp_writer_wake();

cpuhp_writer_wake() here and in __put_online_cpus() looks racy...
Not only cpuhp_writer_wake() can hit cpuhp_writer_task == NULL (we need
something like ACCESS_ONCE()), its task_struct can be already freed/reused
if the writer exits.

And I don't really understand the logic... This slow path succeds without
incrementing any counter (except current->cpuhp_ref)? How the next writer
can notice the fact it should wait for this reader?

> void cpu_hotplug_done(void)
> {
> - cpu_hotplug.active_writer = NULL;
> - mutex_unlock(&cpu_hotplug.lock);
> + /* Signal the writer is done */
> + cpuhp_writer = 0;
> + wake_up_all(&cpuhp_wq);
> +
> + /* Wait for any pending readers to be running */
> + cpuhp_writer_wait(!atomic_read(&cpuhp_waitcount));
> + cpuhp_writer_task = NULL;

We also need to ensure that the next reader should see all changes
done by the writer, iow this lacks "realease" semantics.




But, Peter, the main question is, why this is better than
percpu_rw_semaphore performance-wise? (Assuming we add
task_struct->cpuhp_ref).

If the writer is pending, percpu_down_read() does

down_read(&brw->rw_sem);
atomic_inc(&brw->slow_read_ctr);
__up_read(&brw->rw_sem);

is it really much worse than wait_event + atomic_dec_and_test?

And! please note that with your implementation the new readers will
be likely blocked while the writer sleeps in synchronize_sched().
This doesn't happen with percpu_rw_semaphore.

Oleg.

2013-09-24 16:38:56

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 09/24, Peter Zijlstra wrote:
>
> +void __get_online_cpus(void)
> {
> - if (cpu_hotplug.active_writer == current)
> + /* Support reader-in-writer recursion */
> + if (__cpuhp_writer == current)
> return;
> - mutex_lock(&cpu_hotplug.lock);
>
> - if (WARN_ON(!cpu_hotplug.refcount))
> - cpu_hotplug.refcount++; /* try to fix things up */
> + atomic_inc(&cpuhp_waitcount);
>
> - if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
> - wake_up_process(cpu_hotplug.active_writer);
> - mutex_unlock(&cpu_hotplug.lock);
> + /*
> + * We either call schedule() in the wait, or we'll fall through
> + * and reschedule on the preempt_enable() in get_online_cpus().
> + */
> + preempt_enable_no_resched();
> + wait_event(cpuhp_readers, !__cpuhp_writer);
> + preempt_disable();
> +
> + if (atomic_dec_and_test(&cpuhp_waitcount))
> + wake_up_all(&cpuhp_writer);

Yes, this should fix the races with the exiting writer, but still this
doesn't look right afaics.

In particular let me repeat,

> void cpu_hotplug_begin(void)
> {
> - cpu_hotplug.active_writer = current;
> + unsigned int count = 0;
> + int cpu;
> +
> + lockdep_assert_held(&cpu_add_remove_lock);
>
> - for (;;) {
> - mutex_lock(&cpu_hotplug.lock);
> - if (likely(!cpu_hotplug.refcount))
> - break;
> - __set_current_state(TASK_UNINTERRUPTIBLE);
> - mutex_unlock(&cpu_hotplug.lock);
> - schedule();
> + __cpuhp_writer = current;
> +
> + /*
> + * After this everybody will observe writer and take the slow path.
> + */
> + synchronize_sched();

synchronize_sched() is slow. The new readers will likely notice
__cpuhp_writer != NULL much earlier and they will be blocked in
__get_online_cpus() while the writer sleeps before it actually
enters the critical section.

Or I completely misunderstood this all?

Oleg.

2013-09-24 16:39:32

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Tue, 24 Sep 2013 14:38:21 +0200
Peter Zijlstra <[email protected]> wrote:

> +#define cpuhp_writer_wait(cond) \
> +do { \
> + for (;;) { \
> + set_current_state(TASK_UNINTERRUPTIBLE); \
> + if (cond) \
> + break; \
> + schedule(); \
> + } \
> + __set_current_state(TASK_RUNNING); \
> +} while (0)
> +
> +void __get_online_cpus(void)

The above really needs a comment about how it is used. Otherwise, I can
envision someone calling this as "oh I can use this when I'm in a
preempt disable section", and the comment below for the
preempt_enable_no_resched() will no longer be true.

-- Steve


> {
> - if (cpu_hotplug.active_writer == current)
> + if (cpuhp_writer_task == current)
> return;
> - mutex_lock(&cpu_hotplug.lock);
>
> - if (WARN_ON(!cpu_hotplug.refcount))
> - cpu_hotplug.refcount++; /* try to fix things up */
> + atomic_inc(&cpuhp_waitcount);
> +
> + /*
> + * We either call schedule() in the wait, or we'll fall through
> + * and reschedule on the preempt_enable() in get_online_cpus().
> + */
> + preempt_enable_no_resched();
> + wait_event(cpuhp_wq, !__cpuhp_writer);
> + preempt_disable();
> +
> + /*
> + * It would be possible for cpu_hotplug_done() to complete before
> + * the atomic_inc() above; in which case there is no writer waiting
> + * and doing a wakeup would be BAD (tm).
> + *
> + * If however we still observe cpuhp_writer_task here we know
> + * cpu_hotplug_done() is currently stuck waiting for cpuhp_waitcount.
> + */
> + if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)
> + cpuhp_writer_wake();
> +}
> +EXPORT_SYMBOL_GPL(__get_online_cpus);
>

2013-09-24 16:43:46

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Tue, 24 Sep 2013 18:03:59 +0200
Oleg Nesterov <[email protected]> wrote:

> On 09/24, Peter Zijlstra wrote:
> >
> > +static inline void get_online_cpus(void)
> > +{
> > + might_sleep();
> > +
> > + if (current->cpuhp_ref++) {
> > + barrier();
> > + return;
>
> I don't undestand this barrier()... we are going to return if we already
> hold the lock, do we really need it?

I'm confused too. Unless gcc moves this after the release, but the
release uses preempt_disable() which is its own barrier.

If anything, it requires a comment.

-- Steve

>
> The same for put_online_cpus().
>
> > +void __get_online_cpus(void)
> > {
> > - if (cpu_hotplug.active_writer == current)
> > + if (cpuhp_writer_task == current)
> > return;
>

2013-09-24 16:49:09

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Tue, Sep 24, 2013 at 06:03:59PM +0200, Oleg Nesterov wrote:
> On 09/24, Peter Zijlstra wrote:
> >
> > +static inline void get_online_cpus(void)
> > +{
> > + might_sleep();
> > +
> > + if (current->cpuhp_ref++) {
> > + barrier();
> > + return;
>
> I don't undestand this barrier()... we are going to return if we already
> hold the lock, do we really need it?
>
> The same for put_online_cpus().

The barrier() is needed because of the possibility of inlining, right?

> > +void __get_online_cpus(void)
> > {
> > - if (cpu_hotplug.active_writer == current)
> > + if (cpuhp_writer_task == current)
> > return;
>
> Probably it would be better to simply inc/dec ->cpuhp_ref in
> cpu_hotplug_begin/end and remove this check here and in
> __put_online_cpus().
>
> This also means that the writer doing get/put_online_cpus() will
> always use the fast path, and __cpuhp_writer can go away,
> cpuhp_writer_task != NULL can be used instead.

I would need to see the code for this change to be sure. ;-)

> > + atomic_inc(&cpuhp_waitcount);
> > +
> > + /*
> > + * We either call schedule() in the wait, or we'll fall through
> > + * and reschedule on the preempt_enable() in get_online_cpus().
> > + */
> > + preempt_enable_no_resched();
> > + wait_event(cpuhp_wq, !__cpuhp_writer);
> > + preempt_disable();
> > +
> > + /*
> > + * It would be possible for cpu_hotplug_done() to complete before
> > + * the atomic_inc() above; in which case there is no writer waiting
> > + * and doing a wakeup would be BAD (tm).
> > + *
> > + * If however we still observe cpuhp_writer_task here we know
> > + * cpu_hotplug_done() is currently stuck waiting for cpuhp_waitcount.
> > + */
> > + if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)
> > + cpuhp_writer_wake();
>
> cpuhp_writer_wake() here and in __put_online_cpus() looks racy...
> Not only cpuhp_writer_wake() can hit cpuhp_writer_task == NULL (we need
> something like ACCESS_ONCE()), its task_struct can be already freed/reused
> if the writer exits.
>
> And I don't really understand the logic... This slow path succeds without
> incrementing any counter (except current->cpuhp_ref)? How the next writer
> can notice the fact it should wait for this reader?
>
> > void cpu_hotplug_done(void)
> > {
> > - cpu_hotplug.active_writer = NULL;
> > - mutex_unlock(&cpu_hotplug.lock);
> > + /* Signal the writer is done */
> > + cpuhp_writer = 0;
> > + wake_up_all(&cpuhp_wq);
> > +
> > + /* Wait for any pending readers to be running */
> > + cpuhp_writer_wait(!atomic_read(&cpuhp_waitcount));
> > + cpuhp_writer_task = NULL;
>
> We also need to ensure that the next reader should see all changes
> done by the writer, iow this lacks "realease" semantics.

Good point -- I was expecting wake_up_all() to provide the release
semantics, but code could be reordered into __wake_up()'s critical
section, especially in the case where there was nothing to wake
up, but where there were new readers starting concurrently with
cpu_hotplug_done().

> But, Peter, the main question is, why this is better than
> percpu_rw_semaphore performance-wise? (Assuming we add
> task_struct->cpuhp_ref).
>
> If the writer is pending, percpu_down_read() does
>
> down_read(&brw->rw_sem);
> atomic_inc(&brw->slow_read_ctr);
> __up_read(&brw->rw_sem);
>
> is it really much worse than wait_event + atomic_dec_and_test?
>
> And! please note that with your implementation the new readers will
> be likely blocked while the writer sleeps in synchronize_sched().
> This doesn't happen with percpu_rw_semaphore.

Thanx, Paul

2013-09-24 16:51:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Tue, Sep 24, 2013 at 06:03:59PM +0200, Oleg Nesterov wrote:
> On 09/24, Peter Zijlstra wrote:
> >
> > +static inline void get_online_cpus(void)
> > +{
> > + might_sleep();
> > +
> > + if (current->cpuhp_ref++) {
> > + barrier();
> > + return;
>
> I don't undestand this barrier()... we are going to return if we already
> hold the lock, do we really need it?
>
> The same for put_online_cpus().

to make {get,put}_online_cpus() always behave like per-cpu lock
sections.

I don't think its ever 'correct' for loads/stores to escape the section,
even if not strictly harmful.

> > +void __get_online_cpus(void)
> > {
> > - if (cpu_hotplug.active_writer == current)
> > + if (cpuhp_writer_task == current)
> > return;
>
> Probably it would be better to simply inc/dec ->cpuhp_ref in
> cpu_hotplug_begin/end and remove this check here and in
> __put_online_cpus().

Oh indeed!

> > + if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)
> > + cpuhp_writer_wake();
>
> cpuhp_writer_wake() here and in __put_online_cpus() looks racy...

Yeah it is. Paul already said.

> But, Peter, the main question is, why this is better than
> percpu_rw_semaphore performance-wise? (Assuming we add
> task_struct->cpuhp_ref).
>
> If the writer is pending, percpu_down_read() does
>
> down_read(&brw->rw_sem);
> atomic_inc(&brw->slow_read_ctr);
> __up_read(&brw->rw_sem);
>
> is it really much worse than wait_event + atomic_dec_and_test?
>
> And! please note that with your implementation the new readers will
> be likely blocked while the writer sleeps in synchronize_sched().
> This doesn't happen with percpu_rw_semaphore.

Good points both, no I don't think there's a significant performance gap
there.

I'm still hoping we can come up with something better though :/ I don't
particularly like either.

2013-09-24 16:54:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Tue, Sep 24, 2013 at 09:49:00AM -0700, Paul E. McKenney wrote:
> > > void cpu_hotplug_done(void)
> > > {
> > > + /* Signal the writer is done */
> > > + cpuhp_writer = 0;
> > > + wake_up_all(&cpuhp_wq);
> > > +
> > > + /* Wait for any pending readers to be running */
> > > + cpuhp_writer_wait(!atomic_read(&cpuhp_waitcount));
> > > + cpuhp_writer_task = NULL;
> >
> > We also need to ensure that the next reader should see all changes
> > done by the writer, iow this lacks "realease" semantics.
>
> Good point -- I was expecting wake_up_all() to provide the release
> semantics, but code could be reordered into __wake_up()'s critical
> section, especially in the case where there was nothing to wake
> up, but where there were new readers starting concurrently with
> cpu_hotplug_done().

Doh, indeed. I missed this in Oleg's email, but yes I made that same
assumption about wake_up_all().

2013-09-24 17:09:27

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 09/24, Peter Zijlstra wrote:
>
> On Tue, Sep 24, 2013 at 09:49:00AM -0700, Paul E. McKenney wrote:
> > > > void cpu_hotplug_done(void)
> > > > {
> > > > + /* Signal the writer is done */
> > > > + cpuhp_writer = 0;
> > > > + wake_up_all(&cpuhp_wq);
> > > > +
> > > > + /* Wait for any pending readers to be running */
> > > > + cpuhp_writer_wait(!atomic_read(&cpuhp_waitcount));
> > > > + cpuhp_writer_task = NULL;
> > >
> > > We also need to ensure that the next reader should see all changes
> > > done by the writer, iow this lacks "realease" semantics.
> >
> > Good point -- I was expecting wake_up_all() to provide the release
> > semantics, but code could be reordered into __wake_up()'s critical
> > section, especially in the case where there was nothing to wake
> > up, but where there were new readers starting concurrently with
> > cpu_hotplug_done().
>
> Doh, indeed. I missed this in Oleg's email, but yes I made that same
> assumption about wake_up_all().

Well, I think this is even worse... No matter what the writer does,
the new reader needs mb() after it checks !__cpuhp_writer. Or we
need another synchronize_sched() in cpu_hotplug_done(). This is
what percpu_rw_semaphore() does (to remind, this can be turned into
call_rcu).

Oleg.

2013-09-24 17:13:45

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 09/24, Steven Rostedt wrote:
>
> On Tue, 24 Sep 2013 18:03:59 +0200
> Oleg Nesterov <[email protected]> wrote:
>
> > On 09/24, Peter Zijlstra wrote:
> > >
> > > +static inline void get_online_cpus(void)
> > > +{
> > > + might_sleep();
> > > +
> > > + if (current->cpuhp_ref++) {
> > > + barrier();
> > > + return;
> >
> > I don't undestand this barrier()... we are going to return if we already
> > hold the lock, do we really need it?
>
> I'm confused too. Unless gcc moves this after the release, but the
> release uses preempt_disable() which is its own barrier.
>
> If anything, it requires a comment.

And I am still confused even after emails from Paul and Peter...

If gcc can actually do something wrong, then I suspect this barrier()
should be unconditional.

Oleg.

2013-09-24 17:47:29

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Tue, Sep 24, 2013 at 07:06:31PM +0200, Oleg Nesterov wrote:
> On 09/24, Steven Rostedt wrote:
> >
> > On Tue, 24 Sep 2013 18:03:59 +0200
> > Oleg Nesterov <[email protected]> wrote:
> >
> > > On 09/24, Peter Zijlstra wrote:
> > > >
> > > > +static inline void get_online_cpus(void)
> > > > +{
> > > > + might_sleep();
> > > > +
> > > > + if (current->cpuhp_ref++) {
> > > > + barrier();
> > > > + return;
> > >
> > > I don't undestand this barrier()... we are going to return if we already
> > > hold the lock, do we really need it?
> >
> > I'm confused too. Unless gcc moves this after the release, but the
> > release uses preempt_disable() which is its own barrier.
> >
> > If anything, it requires a comment.
>
> And I am still confused even after emails from Paul and Peter...
>
> If gcc can actually do something wrong, then I suspect this barrier()
> should be unconditional.

If you are saying that there should be a barrier() on all return paths
from get_online_cpus(), I agree.

Thanx, Paul

2013-09-24 18:07:47

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 09/24, Paul E. McKenney wrote:
>
> On Tue, Sep 24, 2013 at 07:06:31PM +0200, Oleg Nesterov wrote:
> >
> > If gcc can actually do something wrong, then I suspect this barrier()
> > should be unconditional.
>
> If you are saying that there should be a barrier() on all return paths
> from get_online_cpus(), I agree.

Paul, Peter, could you provide any (even completely artificial) example
to explain me why do we need this barrier() ? I am puzzled. And
preempt_enable() already has barrier...

get_online_cpus();
do_something();

Yes, we need to ensure gcc doesn't reorder this code so that
do_something() comes before get_online_cpus(). But it can't? At least
it should check current->cpuhp_ref != 0 first? And if it is non-zero
we do not really care, we are already in the critical section and
this ->cpuhp_ref has only meaning in put_online_cpus().

Confused...

Oleg.

2013-09-24 20:24:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Mon, Sep 23, 2013 at 07:32:03PM +0200, Oleg Nesterov wrote:
> > static void cpuhp_wait_refcount(void)
> > {
> > for (;;) {
> > unsigned int rc1, rc2;
> >
> > rc1 = cpuhp_refcount();
> > set_current_state(TASK_UNINTERRUPTIBLE); /* MB */
> > rc2 = cpuhp_refcount();
> >
> > if (rc1 == rc2 && !rc1)
>
> But this only makes the race above "theoretical ** 2". Both
> cpuhp_refcount()'s can be equally fooled.
>
> Looks like, cpuhp_refcount() should take all per-cpu cpuhp_lock's
> before it reads __cpuhp_refcount.

Ah, so SRCU has a solution for this using a sequence count.

So now we drop from a no memory barriers fast path, into a memory
barrier 'slow' path into blocking.

Only once we block do we hit global state..

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
#include <linux/node.h>
#include <linux/compiler.h>
#include <linux/cpumask.h>
+#include <linux/percpu.h>

struct device;

@@ -173,10 +174,50 @@ extern struct bus_type cpu_subsys;
#ifdef CONFIG_HOTPLUG_CPU
/* Stop CPUs going up and down. */

+extern void cpu_hotplug_init_task(struct task_struct *p);
+
extern void cpu_hotplug_begin(void);
extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern int __cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+ might_sleep();
+
+ /* Support reader-in-reader recursion */
+ if (current->cpuhp_ref++) {
+ barrier();
+ return;
+ }
+
+ preempt_disable();
+ if (likely(!__cpuhp_writer))
+ __this_cpu_inc(__cpuhp_refcount);
+ else
+ __get_online_cpus();
+ preempt_enable();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+ barrier();
+ if (--current->cpuhp_ref)
+ return;
+
+ preempt_disable();
+ if (likely(!__cpuhp_writer))
+ __this_cpu_dec(__cpuhp_refcount);
+ else
+ __put_online_cpus();
+ preempt_enable();
+}
+
extern void cpu_hotplug_disable(void);
extern void cpu_hotplug_enable(void);
#define hotcpu_notifier(fn, pri) cpu_notifier(fn, pri)
@@ -200,6 +241,8 @@ static inline void cpu_hotplug_driver_un

#else /* CONFIG_HOTPLUG_CPU */

+static inline void cpu_hotplug_init_task(struct task_struct *p) {}
+
static inline void cpu_hotplug_begin(void) {}
static inline void cpu_hotplug_done(void) {}
#define get_online_cpus() do { } while (0)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1454,6 +1454,9 @@ struct task_struct {
unsigned int sequential_io;
unsigned int sequential_io_avg;
#endif
+#ifdef CONFIG_HOTPLUG_CPU
+ int cpuhp_ref;
+#endif
};

/* Future-safe accessor for struct task_struct's cpus_allowed. */
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,148 @@ static int cpu_hotplug_disabled;

#ifdef CONFIG_HOTPLUG_CPU

-static struct {
- struct task_struct *active_writer;
- struct mutex lock; /* Synchronizes accesses to refcount, */
+int __cpuhp_writer;
+EXPORT_SYMBOL_GPL(__cpuhp_writer);
+
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static DEFINE_PER_CPU(unsigned int, cpuhp_seq);
+static atomic_t cpuhp_waitcount;
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
+
+void cpu_hotplug_init_task(struct task_struct *p)
+{
+ p->cpuhp_ref = 0;
+}
+
+void __get_online_cpus(void)
+{
+ if (__cpuhp_writer == 1) {
+ /* See __srcu_read_lock() */
+ __this_cpu_inc(__cpuhp_refcount);
+ smp_mb();
+ __this_cpu_inc(cpuhp_seq);
+ return;
+ }
+
+ atomic_inc(&cpuhp_waitcount);
+
/*
- * Also blocks the new readers during
- * an ongoing cpu hotplug operation.
+ * We either call schedule() in the wait, or we'll fall through
+ * and reschedule on the preempt_enable() in get_online_cpus().
*/
- int refcount;
-} cpu_hotplug = {
- .active_writer = NULL,
- .lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
- .refcount = 0,
-};
+ preempt_enable_no_resched();
+ wait_event(cpuhp_readers, !__cpuhp_writer);
+ preempt_disable();

-void get_online_cpus(void)
+ /*
+ * XXX list_empty_careful(&cpuhp_readers.task_list) ?
+ */
+ if (atomic_dec_and_test(&cpuhp_waitcount))
+ wake_up_all(&cpuhp_writer);
+}
+EXPORT_SYMBOL_GPL(__get_online_cpus);
+
+void __put_online_cpus(void)
{
- might_sleep();
- if (cpu_hotplug.active_writer == current)
- return;
- mutex_lock(&cpu_hotplug.lock);
- cpu_hotplug.refcount++;
- mutex_unlock(&cpu_hotplug.lock);
+ /* See __srcu_read_unlock() */
+ smp_mb();
+ this_cpu_dec(__cpuhp_refcount);

+ /* Prod writer to recheck readers_active */
+ wake_up_all(&cpuhp_writer);
}
-EXPORT_SYMBOL_GPL(get_online_cpus);
+EXPORT_SYMBOL_GPL(__put_online_cpus);

-void put_online_cpus(void)
+static unsigned int cpuhp_seq(void)
{
- if (cpu_hotplug.active_writer == current)
- return;
- mutex_lock(&cpu_hotplug.lock);
+ unsigned int seq = 0;
+ int cpu;

- if (WARN_ON(!cpu_hotplug.refcount))
- cpu_hotplug.refcount++; /* try to fix things up */
+ for_each_possible_cpu(cpu)
+ seq += per_cpu(cpuhp_seq, cpu);

- if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
- wake_up_process(cpu_hotplug.active_writer);
- mutex_unlock(&cpu_hotplug.lock);
+ return seq;
+}
+
+static unsigned int cpuhp_refcount(void)
+{
+ unsigned int refcount = 0;
+ int cpu;

+ for_each_possible_cpu(cpu)
+ refcount += per_cpu(__cpuhp_refcount, cpu);
+
+ return refcount;
+}
+
+/*
+ * See srcu_readers_active_idx_check()
+ */
+static bool cpuhp_readers_active_check(void)
+{
+ unsigned int seq = cpuhp_seq();
+
+ smp_mb();
+
+ if (cpuhp_refcount() != 0)
+ return false;
+
+ smp_mb();
+
+ return cpuhp_seq() == seq;
}
-EXPORT_SYMBOL_GPL(put_online_cpus);

/*
* This ensures that the hotplug operation can begin only when the
* refcount goes to zero.
*
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
* Since cpu_hotplug_begin() is always called after invoking
* cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- * writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- * non zero and goes to sleep again.
- *
- * However, this is very difficult to achieve in practice since
- * get_online_cpus() not an api which is called all that often.
- *
*/
void cpu_hotplug_begin(void)
{
- cpu_hotplug.active_writer = current;
+ unsigned int count = 0;
+ int cpu;

- for (;;) {
- mutex_lock(&cpu_hotplug.lock);
- if (likely(!cpu_hotplug.refcount))
- break;
- __set_current_state(TASK_UNINTERRUPTIBLE);
- mutex_unlock(&cpu_hotplug.lock);
- schedule();
- }
+ lockdep_assert_held(&cpu_add_remove_lock);
+
+ /* allow reader-in-writer recursion */
+ current->cpuhp_ref++;
+
+ /* make readers take the slow path */
+ __cpuhp_writer = 1;
+
+ /* See percpu_down_write() */
+ synchronize_sched();
+
+ /* make readers block */
+ __cpuhp_writer = 2;
+
+ /* Wait for all readers to go away */
+ wait_event(cpuhp_writer, cpuhp_readers_active_check());
}

void cpu_hotplug_done(void)
{
- cpu_hotplug.active_writer = NULL;
- mutex_unlock(&cpu_hotplug.lock);
+ /* Signal the writer is done, no fast path yet */
+ __cpuhp_writer = 1;
+ wake_up_all(&cpuhp_readers);
+
+ /* See percpu_up_write() */
+ synchronize_sched();
+
+ /* Let em rip */
+ __cpuhp_writer = 0
+ current->cpuhp_ref--;
+
+ /*
+ * Wait for any pending readers to be running. This ensures readers
+ * after writer and avoids writers starving readers.
+ */
+ wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
}

/*
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1736,6 +1736,8 @@ static void __sched_fork(unsigned long c
INIT_LIST_HEAD(&p->numa_entry);
p->numa_group = NULL;
#endif /* CONFIG_NUMA_BALANCING */
+
+ cpu_hotplug_init_task(p);
}

#ifdef CONFIG_NUMA_BALANCING

2013-09-24 20:35:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Tue, Sep 24, 2013 at 08:00:05PM +0200, Oleg Nesterov wrote:
> On 09/24, Paul E. McKenney wrote:
> >
> > On Tue, Sep 24, 2013 at 07:06:31PM +0200, Oleg Nesterov wrote:
> > >
> > > If gcc can actually do something wrong, then I suspect this barrier()
> > > should be unconditional.
> >
> > If you are saying that there should be a barrier() on all return paths
> > from get_online_cpus(), I agree.
>
> Paul, Peter, could you provide any (even completely artificial) example
> to explain me why do we need this barrier() ? I am puzzled. And
> preempt_enable() already has barrier...
>
> get_online_cpus();
> do_something();
>
> Yes, we need to ensure gcc doesn't reorder this code so that
> do_something() comes before get_online_cpus(). But it can't? At least
> it should check current->cpuhp_ref != 0 first? And if it is non-zero
> we do not really care, we are already in the critical section and
> this ->cpuhp_ref has only meaning in put_online_cpus().
>
> Confused...


So the reason I put it in was because of the inline; it could possibly
make it do:

test 0, current->cpuhp_ref
je label1:
inc current->cpuhp_ref

label2:
do_something();

label1:
inc %gs:__preempt_count
test 0, __cpuhp_writer
jne label3
inc %gs:__cpuhp_refcount
label5
dec %gs:__preempt_count
je label4
jmp label2
label3:
call __get_online_cpus();
jmp label5
label4:
call ____preempt_schedule();
jmp label2

In which case the recursive fast path doesn't have a barrier() between
taking the ref and starting do_something().

I wanted to make absolutely sure nothing of do_something leaked before
the label2 thing. The other labels all have barrier() from the
preempt_count ops.

2013-09-24 21:02:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Tue, Sep 24, 2013 at 10:24:23PM +0200, Peter Zijlstra wrote:
> +void __get_online_cpus(void)
> +{
> + if (__cpuhp_writer == 1) {
take_ref:
> + /* See __srcu_read_lock() */
> + __this_cpu_inc(__cpuhp_refcount);
> + smp_mb();
> + __this_cpu_inc(cpuhp_seq);
> + return;
> + }
> +
> + atomic_inc(&cpuhp_waitcount);
> +
> /*
> + * We either call schedule() in the wait, or we'll fall through
> + * and reschedule on the preempt_enable() in get_online_cpus().
> */
> + preempt_enable_no_resched();
> + wait_event(cpuhp_readers, !__cpuhp_writer);
> + preempt_disable();
>
> + /*
> + * XXX list_empty_careful(&cpuhp_readers.task_list) ?
> + */
> + if (atomic_dec_and_test(&cpuhp_waitcount))
> + wake_up_all(&cpuhp_writer);
goto take_ref;
> +}
> +EXPORT_SYMBOL_GPL(__get_online_cpus);

It would probably be a good idea to increment __cpuhp_refcount after the
wait_event.

2013-09-24 21:09:58

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Tue, Sep 24, 2013 at 06:09:59PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 24, 2013 at 07:42:36AM -0700, Paul E. McKenney wrote:
> > > +#define cpuhp_writer_wake() \
> > > + wake_up_process(cpuhp_writer_task)
> > > +
> > > +#define cpuhp_writer_wait(cond) \
> > > +do { \
> > > + for (;;) { \
> > > + set_current_state(TASK_UNINTERRUPTIBLE); \
> > > + if (cond) \
> > > + break; \
> > > + schedule(); \
> > > + } \
> > > + __set_current_state(TASK_RUNNING); \
> > > +} while (0)
> >
> > Why not wait_event()? Presumably the above is a bit lighter weight,
> > but is that even something that can be measured?
>
> I didn't want to mix readers and writers on cpuhp_wq, and I suppose I
> could create a second waitqueue; that might also be a better solution
> for the NULL thing below.

That would have the advantage of being a bit less racy.

> > > + atomic_inc(&cpuhp_waitcount);
> > > +
> > > + /*
> > > + * We either call schedule() in the wait, or we'll fall through
> > > + * and reschedule on the preempt_enable() in get_online_cpus().
> > > + */
> > > + preempt_enable_no_resched();
> > > + wait_event(cpuhp_wq, !__cpuhp_writer);
> >
> > Finally! A good use for preempt_enable_no_resched(). ;-)
>
> Hehe, there were a few others, but tglx removed most with the
> schedule_preempt_disabled() primitive.

;-)

> In fact, I considered a wait_event_preempt_disabled() but was too lazy.
> That whole wait_event macro fest looks like it could use an iteration or
> two of collapse anyhow.

There are some serious layers there, aren't there?

> > > + preempt_disable();
> > > +
> > > + /*
> > > + * It would be possible for cpu_hotplug_done() to complete before
> > > + * the atomic_inc() above; in which case there is no writer waiting
> > > + * and doing a wakeup would be BAD (tm).
> > > + *
> > > + * If however we still observe cpuhp_writer_task here we know
> > > + * cpu_hotplug_done() is currently stuck waiting for cpuhp_waitcount.
> > > + */
> > > + if (atomic_dec_and_test(&cpuhp_waitcount) && cpuhp_writer_task)
> >
> > OK, I'll bite... What sequence of events results in the
> > atomic_dec_and_test() returning true but there being no
> > cpuhp_writer_task?
> >
> > Ah, I see it...
>
> <snip>
>
> Indeed, and
>
> > But what prevents the following sequence of events?
>
> <snip>
>
> > o Task B's call to cpuhp_writer_wake() sees a NULL pointer.
>
> Quite so.. nothing. See there was a reason I kept being confused about
> it.
>
> > > void cpu_hotplug_begin(void)
> > > {
> > > + unsigned int count = 0;
> > > + int cpu;
> > > +
> > > + lockdep_assert_held(&cpu_add_remove_lock);
> > >
> > > + __cpuhp_writer = 1;
> > > + cpuhp_writer_task = current;
> >
> > At this point, the value of cpuhp_slowcount can go negative. Can't see
> > that this causes a problem, given the atomic_add() below.
>
> Agreed.
>
> > > +
> > > + /* After this everybody will observe writer and take the slow path. */
> > > + synchronize_sched();
> > > +
> > > + /* Collapse the per-cpu refcount into slowcount */
> > > + for_each_possible_cpu(cpu) {
> > > + count += per_cpu(__cpuhp_refcount, cpu);
> > > + per_cpu(__cpuhp_refcount, cpu) = 0;
> > > }
> >
> > The above is safe because the readers are no longer changing their
> > __cpuhp_refcount values.
>
> Yes, I'll expand the comment.
>
> So how about something like this?

A few memory barriers required, if I am reading the code correctly.
Some of them, perhaps all of them, called out by Oleg.

Thanx, Paul

> ---
> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -16,6 +16,7 @@
> #include <linux/node.h>
> #include <linux/compiler.h>
> #include <linux/cpumask.h>
> +#include <linux/percpu.h>
>
> struct device;
>
> @@ -173,10 +174,50 @@ extern struct bus_type cpu_subsys;
> #ifdef CONFIG_HOTPLUG_CPU
> /* Stop CPUs going up and down. */
>
> +extern void cpu_hotplug_init_task(struct task_struct *p);
> +
> extern void cpu_hotplug_begin(void);
> extern void cpu_hotplug_done(void);
> -extern void get_online_cpus(void);
> -extern void put_online_cpus(void);
> +
> +extern struct task_struct *__cpuhp_writer;
> +DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
> +
> +extern void __get_online_cpus(void);
> +
> +static inline void get_online_cpus(void)
> +{
> + might_sleep();
> +
> + /* Support reader-in-reader recursion */
> + if (current->cpuhp_ref++) {
> + barrier();
> + return;
> + }
> +
> + preempt_disable();
> + if (likely(!__cpuhp_writer))
> + __this_cpu_inc(__cpuhp_refcount);

As Oleg noted, need a barrier here for when a new reader runs concurrently
with a completing writer.

> + else
> + __get_online_cpus();
> + preempt_enable();
> +}
> +
> +extern void __put_online_cpus(void);
> +
> +static inline void put_online_cpus(void)
> +{
> + barrier();
> + if (--current->cpuhp_ref)
> + return;
> +
> + preempt_disable();
> + if (likely(!__cpuhp_writer))
> + __this_cpu_dec(__cpuhp_refcount);

No barrier needed here because synchronize_sched() covers it.

> + else
> + __put_online_cpus();
> + preempt_enable();
> +}
> +
> extern void cpu_hotplug_disable(void);
> extern void cpu_hotplug_enable(void);
> #define hotcpu_notifier(fn, pri) cpu_notifier(fn, pri)
> @@ -200,6 +241,8 @@ static inline void cpu_hotplug_driver_un
>
> #else /* CONFIG_HOTPLUG_CPU */
>
> +static inline void cpu_hotplug_init_task(struct task_struct *p) {}
> +
> static inline void cpu_hotplug_begin(void) {}
> static inline void cpu_hotplug_done(void) {}
> #define get_online_cpus() do { } while (0)
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1454,6 +1454,9 @@ struct task_struct {
> unsigned int sequential_io;
> unsigned int sequential_io_avg;
> #endif
> +#ifdef CONFIG_HOTPLUG_CPU
> + int cpuhp_ref;
> +#endif
> };
>
> /* Future-safe accessor for struct task_struct's cpus_allowed. */
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -49,88 +49,100 @@ static int cpu_hotplug_disabled;
>
> #ifdef CONFIG_HOTPLUG_CPU
>
> -static struct {
> - struct task_struct *active_writer;
> - struct mutex lock; /* Synchronizes accesses to refcount, */
> - /*
> - * Also blocks the new readers during
> - * an ongoing cpu hotplug operation.
> - */
> - int refcount;
> -} cpu_hotplug = {
> - .active_writer = NULL,
> - .lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
> - .refcount = 0,
> -};
> +struct task_struct *__cpuhp_writer;
> +EXPORT_SYMBOL_GPL(__cpuhp_writer);
>
> -void get_online_cpus(void)
> -{
> - might_sleep();
> - if (cpu_hotplug.active_writer == current)
> - return;
> - mutex_lock(&cpu_hotplug.lock);
> - cpu_hotplug.refcount++;
> - mutex_unlock(&cpu_hotplug.lock);
> +DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
> +EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
> +
> +static atomic_t cpuhp_waitcount;
> +static atomic_t cpuhp_slowcount;
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
>
> +void cpu_hotplug_init_task(struct task_struct *p)
> +{
> + p->cpuhp_ref = 0;
> }
> -EXPORT_SYMBOL_GPL(get_online_cpus);
>
> -void put_online_cpus(void)
> +void __get_online_cpus(void)
> {
> - if (cpu_hotplug.active_writer == current)
> + /* Support reader-in-writer recursion */
> + if (__cpuhp_writer == current)
> return;
> - mutex_lock(&cpu_hotplug.lock);
>
> - if (WARN_ON(!cpu_hotplug.refcount))
> - cpu_hotplug.refcount++; /* try to fix things up */
> + atomic_inc(&cpuhp_waitcount);
>
> - if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
> - wake_up_process(cpu_hotplug.active_writer);
> - mutex_unlock(&cpu_hotplug.lock);
> + /*
> + * We either call schedule() in the wait, or we'll fall through
> + * and reschedule on the preempt_enable() in get_online_cpus().
> + */
> + preempt_enable_no_resched();
> + wait_event(cpuhp_readers, !__cpuhp_writer);
> + preempt_disable();
> +
> + if (atomic_dec_and_test(&cpuhp_waitcount))

This provides the needed memory barrier for concurrent write releases.

> + wake_up_all(&cpuhp_writer);
> +}
> +EXPORT_SYMBOL_GPL(__get_online_cpus);
> +
> +void __put_online_cpus(void)
> +{
> + if (__cpuhp_writer == current)
> + return;
>
> + if (atomic_dec_and_test(&cpuhp_slowcount))

This provides the needed memory barrier for concurrent write acquisitions.

> + wake_up_all(&cpuhp_writer);
> }
> -EXPORT_SYMBOL_GPL(put_online_cpus);
> +EXPORT_SYMBOL_GPL(__put_online_cpus);
>
> /*
> * This ensures that the hotplug operation can begin only when the
> * refcount goes to zero.
> *
> - * Note that during a cpu-hotplug operation, the new readers, if any,
> - * will be blocked by the cpu_hotplug.lock
> - *
> * Since cpu_hotplug_begin() is always called after invoking
> * cpu_maps_update_begin(), we can be sure that only one writer is active.
> - *
> - * Note that theoretically, there is a possibility of a livelock:
> - * - Refcount goes to zero, last reader wakes up the sleeping
> - * writer.
> - * - Last reader unlocks the cpu_hotplug.lock.
> - * - A new reader arrives at this moment, bumps up the refcount.
> - * - The writer acquires the cpu_hotplug.lock finds the refcount
> - * non zero and goes to sleep again.
> - *
> - * However, this is very difficult to achieve in practice since
> - * get_online_cpus() not an api which is called all that often.
> - *
> */
> void cpu_hotplug_begin(void)
> {
> - cpu_hotplug.active_writer = current;
> + unsigned int count = 0;
> + int cpu;
> +
> + lockdep_assert_held(&cpu_add_remove_lock);
>
> - for (;;) {
> - mutex_lock(&cpu_hotplug.lock);
> - if (likely(!cpu_hotplug.refcount))
> - break;
> - __set_current_state(TASK_UNINTERRUPTIBLE);
> - mutex_unlock(&cpu_hotplug.lock);
> - schedule();
> + __cpuhp_writer = current;
> +
> + /*
> + * After this everybody will observe writer and take the slow path.
> + */
> + synchronize_sched();
> +
> + /*
> + * Collapse the per-cpu refcount into slowcount. This is safe because
> + * readers are now taking the slow path (per the above) which doesn't
> + * touch __cpuhp_refcount.
> + */
> + for_each_possible_cpu(cpu) {
> + count += per_cpu(__cpuhp_refcount, cpu);
> + per_cpu(__cpuhp_refcount, cpu) = 0;
> }
> + atomic_add(count, &cpuhp_slowcount);
> +
> + /* Wait for all readers to go away */
> + wait_event(cpuhp_writer, !atomic_read(&cpuhp_slowcount));

Oddly enough, there appear to be cases where you need a memory barrier
here. Suppose that all the readers finish after the atomic_add() above,
but before the wait_event(). Then wait_event() just checks the condition
without any memory barriers. So smp_mb() needed here.

/me runs off to check RCU's use of wait_event()...

Found one missing. And some places in need of comments. And a few
places that could use an ACCESS_ONCE().

Back to the review...

> }
>
> void cpu_hotplug_done(void)
> {
> - cpu_hotplug.active_writer = NULL;
> - mutex_unlock(&cpu_hotplug.lock);
> + /* Signal the writer is done */

And I believe we need a memory barrier here to keep the write-side
critical section confined from the viewpoint of a reader that starts
just after the NULLing of cpuhp_writer.

Of course, being who I am, I cannot resist pointing out that you have
the same number of memory barriers as would use of SRCU, and that
synchronize_srcu() can be quite a bit faster than synchronize_sched()
in the case where there are no readers. ;-)

> + cpuhp_writer = NULL;
> + wake_up_all(&cpuhp_readers);
> +
> + /*
> + * Wait for any pending readers to be running. This ensures readers
> + * after writer and avoids writers starving readers.
> + */
> + wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
> }
>
> /*
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1736,6 +1736,8 @@ static void __sched_fork(unsigned long c
> INIT_LIST_HEAD(&p->numa_entry);
> p->numa_group = NULL;
> #endif /* CONFIG_NUMA_BALANCING */
> +
> + cpu_hotplug_init_task(p);
> }
>
> #ifdef CONFIG_NUMA_BALANCING
>

2013-09-25 15:24:06

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 09/24, Peter Zijlstra wrote:
>
> On Tue, Sep 24, 2013 at 08:00:05PM +0200, Oleg Nesterov wrote:
> >
> > Yes, we need to ensure gcc doesn't reorder this code so that
> > do_something() comes before get_online_cpus(). But it can't? At least
> > it should check current->cpuhp_ref != 0 first? And if it is non-zero
> > we do not really care, we are already in the critical section and
> > this ->cpuhp_ref has only meaning in put_online_cpus().
> >
> > Confused...
>
>
> So the reason I put it in was because of the inline; it could possibly
> make it do:

[...snip...]

> In which case the recursive fast path doesn't have a barrier() between
> taking the ref and starting do_something().

Yes, but my point was, this can only happen in recursive fast path.
And in this case (I think) we do not care, we are already in the critical
section.

current->cpuhp_ref doesn't matter at all until we call put_online_cpus().

Suppose that gcc knows for sure that current->cpuhp_ref != 0. Then I
think, for example,

get_online_cpus();
do_something();
put_online_cpus();

converted to

do_something();
current->cpuhp_ref++;
current->cpuhp_ref--;

is fine. do_something() should not depend on ->cpuhp_ref.

OK, please forget. I guess I will never understand this ;)

Oleg.

2013-09-25 15:35:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Wed, Sep 25, 2013 at 05:16:42PM +0200, Oleg Nesterov wrote:
> Yes, but my point was, this can only happen in recursive fast path.

Right, I understood.

> And in this case (I think) we do not care, we are already in the critical
> section.

I tend to agree, however paranoia..

> OK, please forget. I guess I will never understand this ;)

It might just be I'm less certain about there not being any avenue of
mischief.

2013-09-25 16:02:24

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 09/24, Peter Zijlstra wrote:
>
> So now we drop from a no memory barriers fast path, into a memory
> barrier 'slow' path into blocking.

Cough... can't understand the above ;) In fact I can't understand
the patch... see below. But in any case, afaics the fast path
needs mb() unless you add another synchronize_sched() into
cpu_hotplug_done().

> +static inline void get_online_cpus(void)
> +{
> + might_sleep();
> +
> + /* Support reader-in-reader recursion */
> + if (current->cpuhp_ref++) {
> + barrier();
> + return;
> + }
> +
> + preempt_disable();
> + if (likely(!__cpuhp_writer))
> + __this_cpu_inc(__cpuhp_refcount);

mb() to ensure the reader can't miss, say, a STORE done inside
the cpu_hotplug_begin/end section.

put_online_cpus() needs mb() as well.

> +void __get_online_cpus(void)
> +{
> + if (__cpuhp_writer == 1) {
> + /* See __srcu_read_lock() */
> + __this_cpu_inc(__cpuhp_refcount);
> + smp_mb();
> + __this_cpu_inc(cpuhp_seq);
> + return;
> + }

OK, cpuhp_seq should guarantee cpuhp_readers_active_check() gets
the "stable" numbers. Looks suspicious... but lets assume this
works.

However, I do not see how "__cpuhp_writer == 1" can work, please
see below.

> + /*
> + * XXX list_empty_careful(&cpuhp_readers.task_list) ?
> + */
> + if (atomic_dec_and_test(&cpuhp_waitcount))
> + wake_up_all(&cpuhp_writer);

Same problem as in previous version. __get_online_cpus() succeeds
without incrementing __cpuhp_refcount. "goto start" can't help
afaics.

> void cpu_hotplug_begin(void)
> {
> - cpu_hotplug.active_writer = current;
> + unsigned int count = 0;
> + int cpu;
>
> - for (;;) {
> - mutex_lock(&cpu_hotplug.lock);
> - if (likely(!cpu_hotplug.refcount))
> - break;
> - __set_current_state(TASK_UNINTERRUPTIBLE);
> - mutex_unlock(&cpu_hotplug.lock);
> - schedule();
> - }
> + lockdep_assert_held(&cpu_add_remove_lock);
> +
> + /* allow reader-in-writer recursion */
> + current->cpuhp_ref++;
> +
> + /* make readers take the slow path */
> + __cpuhp_writer = 1;
> +
> + /* See percpu_down_write() */
> + synchronize_sched();

Suppose there are no readers at this point,

> +
> + /* make readers block */
> + __cpuhp_writer = 2;
> +
> + /* Wait for all readers to go away */
> + wait_event(cpuhp_writer, cpuhp_readers_active_check());

So wait_event() "quickly" returns.

Now. Why the new reader should see __cpuhp_writer = 2 ? It can
still see it == 1, and take that "if (__cpuhp_writer == 1)" path
above.

Oleg.

2013-09-25 16:41:09

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 09/25, Peter Zijlstra wrote:
>
> On Wed, Sep 25, 2013 at 05:16:42PM +0200, Oleg Nesterov wrote:
>
> > And in this case (I think) we do not care, we are already in the critical
> > section.
>
> I tend to agree, however paranoia..

Ah, in this case I tend to agree. better be paranoid ;)

Oleg.

2013-09-25 16:59:14

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Wed, Sep 25, 2013 at 05:55:15PM +0200, Oleg Nesterov wrote:
> On 09/24, Peter Zijlstra wrote:
> >
> > So now we drop from a no memory barriers fast path, into a memory
> > barrier 'slow' path into blocking.
>
> Cough... can't understand the above ;) In fact I can't understand
> the patch... see below. But in any case, afaics the fast path
> needs mb() unless you add another synchronize_sched() into
> cpu_hotplug_done().

For whatever it is worth, I too don't see how it works without read-side
memory barriers.

Thanx, Paul

> > +static inline void get_online_cpus(void)
> > +{
> > + might_sleep();
> > +
> > + /* Support reader-in-reader recursion */
> > + if (current->cpuhp_ref++) {
> > + barrier();
> > + return;
> > + }
> > +
> > + preempt_disable();
> > + if (likely(!__cpuhp_writer))
> > + __this_cpu_inc(__cpuhp_refcount);
>
> mb() to ensure the reader can't miss, say, a STORE done inside
> the cpu_hotplug_begin/end section.
>
> put_online_cpus() needs mb() as well.
>
> > +void __get_online_cpus(void)
> > +{
> > + if (__cpuhp_writer == 1) {
> > + /* See __srcu_read_lock() */
> > + __this_cpu_inc(__cpuhp_refcount);
> > + smp_mb();
> > + __this_cpu_inc(cpuhp_seq);
> > + return;
> > + }
>
> OK, cpuhp_seq should guarantee cpuhp_readers_active_check() gets
> the "stable" numbers. Looks suspicious... but lets assume this
> works.
>
> However, I do not see how "__cpuhp_writer == 1" can work, please
> see below.
>
> > + /*
> > + * XXX list_empty_careful(&cpuhp_readers.task_list) ?
> > + */
> > + if (atomic_dec_and_test(&cpuhp_waitcount))
> > + wake_up_all(&cpuhp_writer);
>
> Same problem as in previous version. __get_online_cpus() succeeds
> without incrementing __cpuhp_refcount. "goto start" can't help
> afaics.
>
> > void cpu_hotplug_begin(void)
> > {
> > - cpu_hotplug.active_writer = current;
> > + unsigned int count = 0;
> > + int cpu;
> >
> > - for (;;) {
> > - mutex_lock(&cpu_hotplug.lock);
> > - if (likely(!cpu_hotplug.refcount))
> > - break;
> > - __set_current_state(TASK_UNINTERRUPTIBLE);
> > - mutex_unlock(&cpu_hotplug.lock);
> > - schedule();
> > - }
> > + lockdep_assert_held(&cpu_add_remove_lock);
> > +
> > + /* allow reader-in-writer recursion */
> > + current->cpuhp_ref++;
> > +
> > + /* make readers take the slow path */
> > + __cpuhp_writer = 1;
> > +
> > + /* See percpu_down_write() */
> > + synchronize_sched();
>
> Suppose there are no readers at this point,
>
> > +
> > + /* make readers block */
> > + __cpuhp_writer = 2;
> > +
> > + /* Wait for all readers to go away */
> > + wait_event(cpuhp_writer, cpuhp_readers_active_check());
>
> So wait_event() "quickly" returns.
>
> Now. Why the new reader should see __cpuhp_writer = 2 ? It can
> still see it == 1, and take that "if (__cpuhp_writer == 1)" path
> above.
>
> Oleg.
>

2013-09-25 17:43:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Wed, Sep 25, 2013 at 05:55:15PM +0200, Oleg Nesterov wrote:
> On 09/24, Peter Zijlstra wrote:
> >
> > So now we drop from a no memory barriers fast path, into a memory
> > barrier 'slow' path into blocking.
>
> Cough... can't understand the above ;) In fact I can't understand
> the patch... see below. But in any case, afaics the fast path
> needs mb() unless you add another synchronize_sched() into
> cpu_hotplug_done().

Sure we can add more ;-) But I went with perpcu_up_write(), it too does
the sync_sched() before clearing the fast path state.

> > +static inline void get_online_cpus(void)
> > +{
> > + might_sleep();
> > +
> > + /* Support reader-in-reader recursion */
> > + if (current->cpuhp_ref++) {
> > + barrier();
> > + return;
> > + }
> > +
> > + preempt_disable();
> > + if (likely(!__cpuhp_writer))
> > + __this_cpu_inc(__cpuhp_refcount);
>
> mb() to ensure the reader can't miss, say, a STORE done inside
> the cpu_hotplug_begin/end section.
>
> put_online_cpus() needs mb() as well.

OK, I'm not getting this; why isn't the sync_sched sufficient to get out
of this fast path without barriers?

> > +void __get_online_cpus(void)
> > +{
> > + if (__cpuhp_writer == 1) {
> > + /* See __srcu_read_lock() */
> > + __this_cpu_inc(__cpuhp_refcount);
> > + smp_mb();
> > + __this_cpu_inc(cpuhp_seq);
> > + return;
> > + }
>
> OK, cpuhp_seq should guarantee cpuhp_readers_active_check() gets
> the "stable" numbers. Looks suspicious... but lets assume this
> works.

I 'borrowed' it from SRCU, so if its broken here its broken there too I
suppose.

> However, I do not see how "__cpuhp_writer == 1" can work, please
> see below.
>
> > + if (atomic_dec_and_test(&cpuhp_waitcount))
> > + wake_up_all(&cpuhp_writer);
>
> Same problem as in previous version. __get_online_cpus() succeeds
> without incrementing __cpuhp_refcount. "goto start" can't help
> afaics.

I added a goto into the cond-block, not before the cond; but see the
version below.

> > void cpu_hotplug_begin(void)
> > {
> > + unsigned int count = 0;
> > + int cpu;
> >
> > + lockdep_assert_held(&cpu_add_remove_lock);
> > +
> > + /* allow reader-in-writer recursion */
> > + current->cpuhp_ref++;
> > +
> > + /* make readers take the slow path */
> > + __cpuhp_writer = 1;
> > +
> > + /* See percpu_down_write() */
> > + synchronize_sched();
>
> Suppose there are no readers at this point,
>
> > +
> > + /* make readers block */
> > + __cpuhp_writer = 2;
> > +
> > + /* Wait for all readers to go away */
> > + wait_event(cpuhp_writer, cpuhp_readers_active_check());
>
> So wait_event() "quickly" returns.
>
> Now. Why the new reader should see __cpuhp_writer = 2 ? It can
> still see it == 1, and take that "if (__cpuhp_writer == 1)" path
> above.

OK, .. I see the hole, no immediate way to fix it -- too tired atm.

2013-09-25 17:57:59

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 09/25, Peter Zijlstra wrote:
>
> On Wed, Sep 25, 2013 at 05:55:15PM +0200, Oleg Nesterov wrote:
>
> > > +static inline void get_online_cpus(void)
> > > +{
> > > + might_sleep();
> > > +
> > > + /* Support reader-in-reader recursion */
> > > + if (current->cpuhp_ref++) {
> > > + barrier();
> > > + return;
> > > + }
> > > +
> > > + preempt_disable();
> > > + if (likely(!__cpuhp_writer))
> > > + __this_cpu_inc(__cpuhp_refcount);
> >
> > mb() to ensure the reader can't miss, say, a STORE done inside
> > the cpu_hotplug_begin/end section.
> >
> > put_online_cpus() needs mb() as well.
>
> OK, I'm not getting this; why isn't the sync_sched sufficient to get out
> of this fast path without barriers?

Aah, sorry, I didn't notice this version has another synchronize_sched()
in cpu_hotplug_done().

Then I need to recheck again...

No. Too tired too ;) damn LSB test failures...

> > > + if (atomic_dec_and_test(&cpuhp_waitcount))
> > > + wake_up_all(&cpuhp_writer);
> >
> > Same problem as in previous version. __get_online_cpus() succeeds
> > without incrementing __cpuhp_refcount. "goto start" can't help
> > afaics.
>
> I added a goto into the cond-block, not before the cond; but see the
> version below.

"into the cond-block" doesn't look right too, at first glance. This
always succeeds, but by this time another writer can already hold
the lock.

Oleg.

2013-09-25 18:40:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Wed, Sep 25, 2013 at 07:50:55PM +0200, Oleg Nesterov wrote:
> No. Too tired too ;) damn LSB test failures...


ok; I cobbled this together.. I might think better of it tomorrow, but
for now I think I closed the hole before wait_event(readers_active())
you pointed out -- of course I might have created new holes :/

For easy reading the + only version.

---
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
#include <linux/node.h>
#include <linux/compiler.h>
#include <linux/cpumask.h>
+#include <linux/percpu.h>

struct device;

@@ -173,10 +174,50 @@ extern struct bus_type cpu_subsys;
#ifdef CONFIG_HOTPLUG_CPU
/* Stop CPUs going up and down. */

+extern void cpu_hotplug_init_task(struct task_struct *p);
+
extern void cpu_hotplug_begin(void);
extern void cpu_hotplug_done(void);
+
+extern int __cpuhp_writer;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+ might_sleep();
+
+ /* Support reader-in-reader recursion */
+ if (current->cpuhp_ref++) {
+ barrier();
+ return;
+ }
+
+ preempt_disable();
+ if (likely(!__cpuhp_writer))
+ __this_cpu_inc(__cpuhp_refcount);
+ else
+ __get_online_cpus();
+ preempt_enable();
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+ barrier();
+ if (--current->cpuhp_ref)
+ return;
+
+ preempt_disable();
+ if (likely(!__cpuhp_writer))
+ __this_cpu_dec(__cpuhp_refcount);
+ else
+ __put_online_cpus();
+ preempt_enable();
+}
+
extern void cpu_hotplug_disable(void);
extern void cpu_hotplug_enable(void);
#define hotcpu_notifier(fn, pri) cpu_notifier(fn, pri)
@@ -200,6 +241,8 @@ static inline void cpu_hotplug_driver_un

#else /* CONFIG_HOTPLUG_CPU */

+static inline void cpu_hotplug_init_task(struct task_struct *p) {}
+
static inline void cpu_hotplug_begin(void) {}
static inline void cpu_hotplug_done(void) {}
#define get_online_cpus() do { } while (0)
+++ b/include/linux/sched.h
@@ -1454,6 +1454,9 @@ struct task_struct {
unsigned int sequential_io;
unsigned int sequential_io_avg;
#endif
+#ifdef CONFIG_HOTPLUG_CPU
+ int cpuhp_ref;
+#endif
};

/* Future-safe accessor for struct task_struct's cpus_allowed. */
+++ b/kernel/cpu.c
@@ -49,88 +49,140 @@ static int cpu_hotplug_disabled;

#ifdef CONFIG_HOTPLUG_CPU

+enum { readers_fast = 0, readers_slow, readers_block };
+
+int __cpuhp_writer;
+EXPORT_SYMBOL_GPL(__cpuhp_writer);
+
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static DEFINE_PER_CPU(unsigned int, cpuhp_seq);
+static atomic_t cpuhp_waitcount;
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
+
+void cpu_hotplug_init_task(struct task_struct *p)
+{
+ p->cpuhp_ref = 0;
+}

+void __get_online_cpus(void)
{
+again:
+ /* See __srcu_read_lock() */
+ __this_cpu_inc(__cpuhp_refcount);
+ smp_mb(); /* A matches B, E */
+ __this_cpu_inc(cpuhp_seq);
+
+ if (unlikely(__cpuhp_writer == readers_block)) {
+ __put_online_cpus();
+
+ atomic_inc(&cpuhp_waitcount);
+
+ /*
+ * We either call schedule() in the wait, or we'll fall through
+ * and reschedule on the preempt_enable() in get_online_cpus().
+ */
+ preempt_enable_no_resched();
+ __wait_event(cpuhp_readers, __cpuhp_writer != readers_block);
+ preempt_disable();

+ if (atomic_dec_and_test(&cpuhp_waitcount))
+ wake_up_all(&cpuhp_writer);
+
+ goto again;
+ }
+}
+EXPORT_SYMBOL_GPL(__get_online_cpus);
+
+void __put_online_cpus(void)
+{
+ /* See __srcu_read_unlock() */
+ smp_mb(); /* C matches D */
+ this_cpu_dec(__cpuhp_refcount);
+
+ /* Prod writer to recheck readers_active */
+ wake_up_all(&cpuhp_writer);
}
+EXPORT_SYMBOL_GPL(__put_online_cpus);

+#define per_cpu_sum(var) \
+({ \
+ typeof(var) __sum = 0; \
+ int cpu; \
+ for_each_possible_cpu(cpu) \
+ __sum += per_cpu(var, cpu); \
+ __sum; \
+)}
+
+/*
+ * See srcu_readers_active_idx_check()
+ */
+static bool cpuhp_readers_active_check(void)
{
+ unsigned int seq = per_cpu_sum(cpuhp_seq);
+
+ smp_mb(); /* B matches A */

+ if (per_cpu_sum(__cpuhp_refcount) != 0)
+ return false;

+ smp_mb(); /* D matches C */

+ return per_cpu_sum(cpuhp_seq) == seq;
}

/*
* This ensures that the hotplug operation can begin only when the
* refcount goes to zero.
*
* Since cpu_hotplug_begin() is always called after invoking
* cpu_maps_update_begin(), we can be sure that only one writer is active.
*/
void cpu_hotplug_begin(void)
{
+ unsigned int count = 0;
+ int cpu;

+ lockdep_assert_held(&cpu_add_remove_lock);
+
+ /* allow reader-in-writer recursion */
+ current->cpuhp_ref++;
+
+ /* make readers take the slow path */
+ __cpuhp_writer = readers_slow;
+
+ /* See percpu_down_write() */
+ synchronize_sched();
+
+ /* make readers block */
+ __cpuhp_writer = readers_block;
+
+ smp_mb(); /* E matches A */
+
+ /* Wait for all readers to go away */
+ wait_event(cpuhp_writer, cpuhp_readers_active_check());
}

void cpu_hotplug_done(void)
{
+ /* Signal the writer is done, no fast path yet */
+ __cpuhp_writer = readers_slow;
+ wake_up_all(&cpuhp_readers);
+
+ /* See percpu_up_write() */
+ synchronize_sched();
+
+ /* Let 'em rip */
+ __cpuhp_writer = readers_fast;
+ current->cpuhp_ref--;
+
+ /*
+ * Wait for any pending readers to be running. This ensures readers
+ * after writer and avoids writers starving readers.
+ */
+ wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
}

/*
+++ b/kernel/sched/core.c
@@ -1736,6 +1736,8 @@ static void __sched_fork(unsigned long c
INIT_LIST_HEAD(&p->numa_entry);
p->numa_group = NULL;
#endif /* CONFIG_NUMA_BALANCING */
+
+ cpu_hotplug_init_task(p);
}

#ifdef CONFIG_NUMA_BALANCING

2013-09-25 21:22:09

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Wed, Sep 25, 2013 at 08:40:15PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 25, 2013 at 07:50:55PM +0200, Oleg Nesterov wrote:
> > No. Too tired too ;) damn LSB test failures...
>
>
> ok; I cobbled this together.. I might think better of it tomorrow, but
> for now I think I closed the hole before wait_event(readers_active())
> you pointed out -- of course I might have created new holes :/
>
> For easy reading the + only version.

A couple of nits and some commentary, but if there are races, they are
quite subtle. ;-)

Thanx, Paul

> ---
> +++ b/include/linux/cpu.h
> @@ -16,6 +16,7 @@
> #include <linux/node.h>
> #include <linux/compiler.h>
> #include <linux/cpumask.h>
> +#include <linux/percpu.h>
>
> struct device;
>
> @@ -173,10 +174,50 @@ extern struct bus_type cpu_subsys;
> #ifdef CONFIG_HOTPLUG_CPU
> /* Stop CPUs going up and down. */
>
> +extern void cpu_hotplug_init_task(struct task_struct *p);
> +
> extern void cpu_hotplug_begin(void);
> extern void cpu_hotplug_done(void);
> +
> +extern int __cpuhp_writer;
> +DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
> +
> +extern void __get_online_cpus(void);
> +
> +static inline void get_online_cpus(void)
> +{
> + might_sleep();
> +
> + /* Support reader-in-reader recursion */
> + if (current->cpuhp_ref++) {
> + barrier();

Oleg was right, this barrier() can go. The value was >=1 and remains
>=1, so reordering causes no harm. (See below.)

> + return;
> + }
> +
> + preempt_disable();
> + if (likely(!__cpuhp_writer))
> + __this_cpu_inc(__cpuhp_refcount);

The barrier required here is provided by synchronize_sched(), and
all the code is contained by the barrier()s in preempt_disable() and
preempt_enable().

> + else
> + __get_online_cpus();

And a memory barrier is unconditionally executed by __get_online_cpus().

> + preempt_enable();

The barrier() in preempt_enable() prevents the compiler from bleeding
the critical section out.

> +}
> +
> +extern void __put_online_cpus(void);
> +
> +static inline void put_online_cpus(void)
> +{
> + barrier();

This barrier() can also be dispensed with.

> + if (--current->cpuhp_ref)

If we leave here, the value was >=1 and remains >=1, so reordering does
no harm.

> + return;
> +
> + preempt_disable();

The barrier() in preempt_disable() prevents the compiler from bleeding
the critical section out.

> + if (likely(!__cpuhp_writer))
> + __this_cpu_dec(__cpuhp_refcount);

The barrier here is supplied by synchronize_sched().

> + else
> + __put_online_cpus();

And a memory barrier is unconditionally executed by __put_online_cpus().

> + preempt_enable();
> +}
> +
> extern void cpu_hotplug_disable(void);
> extern void cpu_hotplug_enable(void);
> #define hotcpu_notifier(fn, pri) cpu_notifier(fn, pri)
> @@ -200,6 +241,8 @@ static inline void cpu_hotplug_driver_un
>
> #else /* CONFIG_HOTPLUG_CPU */
>
> +static inline void cpu_hotplug_init_task(struct task_struct *p) {}
> +
> static inline void cpu_hotplug_begin(void) {}
> static inline void cpu_hotplug_done(void) {}
> #define get_online_cpus() do { } while (0)
> +++ b/include/linux/sched.h
> @@ -1454,6 +1454,9 @@ struct task_struct {
> unsigned int sequential_io;
> unsigned int sequential_io_avg;
> #endif
> +#ifdef CONFIG_HOTPLUG_CPU
> + int cpuhp_ref;
> +#endif
> };
>
> /* Future-safe accessor for struct task_struct's cpus_allowed. */
> +++ b/kernel/cpu.c
> @@ -49,88 +49,140 @@ static int cpu_hotplug_disabled;
>
> #ifdef CONFIG_HOTPLUG_CPU
>
> +enum { readers_fast = 0, readers_slow, readers_block };
> +
> +int __cpuhp_writer;
> +EXPORT_SYMBOL_GPL(__cpuhp_writer);
> +
> +DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
> +EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
> +
> +static DEFINE_PER_CPU(unsigned int, cpuhp_seq);
> +static atomic_t cpuhp_waitcount;
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
> +
> +void cpu_hotplug_init_task(struct task_struct *p)
> +{
> + p->cpuhp_ref = 0;
> +}
>
> +void __get_online_cpus(void)
> {
> +again:
> + /* See __srcu_read_lock() */
> + __this_cpu_inc(__cpuhp_refcount);
> + smp_mb(); /* A matches B, E */
> + __this_cpu_inc(cpuhp_seq);
> +
> + if (unlikely(__cpuhp_writer == readers_block)) {
> + __put_online_cpus();

Suppose we got delayed here for some time. The writer might complete,
and be awakened by the blocked readers (we have not incremented our
counter yet). We would then drop through, do the atomic_dec_and_test()
and deliver a spurious wake_up_all() at some random time in the future.

Which should be OK because __wait_event() looks to handle spurious
wake_up()s.

> + atomic_inc(&cpuhp_waitcount);
> +
> + /*
> + * We either call schedule() in the wait, or we'll fall through
> + * and reschedule on the preempt_enable() in get_online_cpus().
> + */
> + preempt_enable_no_resched();
> + __wait_event(cpuhp_readers, __cpuhp_writer != readers_block);
> + preempt_disable();
>
> + if (atomic_dec_and_test(&cpuhp_waitcount))
> + wake_up_all(&cpuhp_writer);

There can be only one writer, so why the wake_up_all()?

> +
> + goto again;
> + }
> +}
> +EXPORT_SYMBOL_GPL(__get_online_cpus);
> +
> +void __put_online_cpus(void)
> +{
> + /* See __srcu_read_unlock() */
> + smp_mb(); /* C matches D */

In other words, if they see our decrement (presumably to aggregate zero,
as that is the only time it matters) they will also see our critical section.

> + this_cpu_dec(__cpuhp_refcount);
> +
> + /* Prod writer to recheck readers_active */
> + wake_up_all(&cpuhp_writer);
> }
> +EXPORT_SYMBOL_GPL(__put_online_cpus);
>
> +#define per_cpu_sum(var) \
> +({ \
> + typeof(var) __sum = 0; \
> + int cpu; \
> + for_each_possible_cpu(cpu) \
> + __sum += per_cpu(var, cpu); \
> + __sum; \
> +)}
> +
> +/*
> + * See srcu_readers_active_idx_check()
> + */
> +static bool cpuhp_readers_active_check(void)
> {
> + unsigned int seq = per_cpu_sum(cpuhp_seq);
> +
> + smp_mb(); /* B matches A */

In other words, if we see __get_online_cpus() cpuhp_seq increment, we
are guaranteed to also see its __cpuhp_refcount increment.

> + if (per_cpu_sum(__cpuhp_refcount) != 0)
> + return false;
>
> + smp_mb(); /* D matches C */
>
> + return per_cpu_sum(cpuhp_seq) == seq;

On equality, we know that there could not be any "sneak path" pairs
where we see a decrement but not the corresponding increment for a
given reader. If we saw its decrement, the memory barriers guarantee
that we now see its cpuhp_seq increment.

> }
>
> /*
> * This ensures that the hotplug operation can begin only when the
> * refcount goes to zero.
> *
> * Since cpu_hotplug_begin() is always called after invoking
> * cpu_maps_update_begin(), we can be sure that only one writer is active.
> */
> void cpu_hotplug_begin(void)
> {
> + unsigned int count = 0;
> + int cpu;
>
> + lockdep_assert_held(&cpu_add_remove_lock);
> +
> + /* allow reader-in-writer recursion */
> + current->cpuhp_ref++;
> +
> + /* make readers take the slow path */
> + __cpuhp_writer = readers_slow;
> +
> + /* See percpu_down_write() */
> + synchronize_sched();

At this point, we know that all readers take the slow path.

> + /* make readers block */
> + __cpuhp_writer = readers_block;
> +
> + smp_mb(); /* E matches A */

If they don't see our write of readers_block to __cpuhp_writer, then
we are guaranteed to see their __cpuhp_refcount increment, and therefore
will wait for them.

> + /* Wait for all readers to go away */
> + wait_event(cpuhp_writer, cpuhp_readers_active_check());
> }
>
> void cpu_hotplug_done(void)
> {
> + /* Signal the writer is done, no fast path yet */
> + __cpuhp_writer = readers_slow;
> + wake_up_all(&cpuhp_readers);

OK, the wait_event()/wake_up_all() prevents the races where the
readers are delayed between fetching __cpuhp_writer and blocking.

> + /* See percpu_up_write() */
> + synchronize_sched();

At this point, readers no longer attempt to block.

You avoid falling into the usual acquire-release-mismatch trap by using
__cpuhp_refcount on both the fastpatch and the slowpath, so that it is OK
to acquire on the fastpath and release on the slowpath (and vice versa).

> + /* Let 'em rip */
> + __cpuhp_writer = readers_fast;
> + current->cpuhp_ref--;
> +
> + /*
> + * Wait for any pending readers to be running. This ensures readers
> + * after writer and avoids writers starving readers.
> + */
> + wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
> }
>
> /*
> +++ b/kernel/sched/core.c
> @@ -1736,6 +1736,8 @@ static void __sched_fork(unsigned long c
> INIT_LIST_HEAD(&p->numa_entry);
> p->numa_group = NULL;
> #endif /* CONFIG_NUMA_BALANCING */
> +
> + cpu_hotplug_init_task(p);
> }
>
> #ifdef CONFIG_NUMA_BALANCING
>

2013-09-26 11:11:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Wed, Sep 25, 2013 at 02:22:00PM -0700, Paul E. McKenney wrote:
> A couple of nits and some commentary, but if there are races, they are
> quite subtle. ;-)

*whee*..

I made one little change in the logic; I moved the waitcount increment
to before the __put_online_cpus() call, such that the writer will have
to wait for us to wake up before trying again -- not for us to actually
have acquired the read lock, for that we'd need to mess up
__get_online_cpus() a bit more.

Complete patch below.

---
Subject: hotplug: Optimize {get,put}_online_cpus()
From: Peter Zijlstra <[email protected]>
Date: Tue Sep 17 16:17:11 CEST 2013

The current implementation of get_online_cpus() is global of nature
and thus not suited for any kind of common usage.

Re-implement the current recursive r/w cpu hotplug lock such that the
read side locks are as light as possible.

The current cpu hotplug lock is entirely reader biased; but since
readers are expensive there aren't a lot of them about and writer
starvation isn't a particular problem.

However by making the reader side more usable there is a fair chance
it will get used more and thus the starvation issue becomes a real
possibility.

Therefore this new implementation is fair, alternating readers and
writers; this however requires per-task state to allow the reader
recursion.

Many comments are contributed by Paul McKenney, and many previous
attempts were shown to be inadequate by both Paul and Oleg; many
thanks to them for persisting to poke holes in my attempts.

Cc: Oleg Nesterov <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Steven Rostedt <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/cpu.h | 58 +++++++++++++
include/linux/sched.h | 3
kernel/cpu.c | 209 +++++++++++++++++++++++++++++++++++---------------
kernel/sched/core.c | 2
4 files changed, 208 insertions(+), 64 deletions(-)

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -16,6 +16,7 @@
#include <linux/node.h>
#include <linux/compiler.h>
#include <linux/cpumask.h>
+#include <linux/percpu.h>

struct device;

@@ -173,10 +174,61 @@ extern struct bus_type cpu_subsys;
#ifdef CONFIG_HOTPLUG_CPU
/* Stop CPUs going up and down. */

+extern void cpu_hotplug_init_task(struct task_struct *p);
+
extern void cpu_hotplug_begin(void);
extern void cpu_hotplug_done(void);
-extern void get_online_cpus(void);
-extern void put_online_cpus(void);
+
+extern int __cpuhp_state;
+DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
+
+extern void __get_online_cpus(void);
+
+static inline void get_online_cpus(void)
+{
+ might_sleep();
+
+ /* Support reader recursion */
+ /* The value was >= 1 and remains so, reordering causes no harm. */
+ if (current->cpuhp_ref++)
+ return;
+
+ preempt_disable();
+ if (likely(!__cpuhp_state)) {
+ /* The barrier here is supplied by synchronize_sched(). */
+ __this_cpu_inc(__cpuhp_refcount);
+ } else {
+ __get_online_cpus(); /* Unconditional memory barrier. */
+ }
+ preempt_enable();
+ /*
+ * The barrier() from preempt_enable() prevents the compiler from
+ * bleeding the critical section out.
+ */
+}
+
+extern void __put_online_cpus(void);
+
+static inline void put_online_cpus(void)
+{
+ /* The value was >= 1 and remains so, reordering causes no harm. */
+ if (--current->cpuhp_ref)
+ return;
+
+ /*
+ * The barrier() in preempt_disable() prevents the compiler from
+ * bleeding the critical section out.
+ */
+ preempt_disable();
+ if (likely(!__cpuhp_state)) {
+ /* The barrier here is supplied by synchronize_sched(). */
+ __this_cpu_dec(__cpuhp_refcount);
+ } else {
+ __put_online_cpus(); /* Unconditional memory barrier. */
+ }
+ preempt_enable();
+}
+
extern void cpu_hotplug_disable(void);
extern void cpu_hotplug_enable(void);
#define hotcpu_notifier(fn, pri) cpu_notifier(fn, pri)
@@ -200,6 +252,8 @@ static inline void cpu_hotplug_driver_un

#else /* CONFIG_HOTPLUG_CPU */

+static inline void cpu_hotplug_init_task(struct task_struct *p) {}
+
static inline void cpu_hotplug_begin(void) {}
static inline void cpu_hotplug_done(void) {}
#define get_online_cpus() do { } while (0)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1454,6 +1454,9 @@ struct task_struct {
unsigned int sequential_io;
unsigned int sequential_io_avg;
#endif
+#ifdef CONFIG_HOTPLUG_CPU
+ int cpuhp_ref;
+#endif
};

/* Future-safe accessor for struct task_struct's cpus_allowed. */
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -49,88 +49,173 @@ static int cpu_hotplug_disabled;

#ifdef CONFIG_HOTPLUG_CPU

-static struct {
- struct task_struct *active_writer;
- struct mutex lock; /* Synchronizes accesses to refcount, */
- /*
- * Also blocks the new readers during
- * an ongoing cpu hotplug operation.
- */
- int refcount;
-} cpu_hotplug = {
- .active_writer = NULL,
- .lock = __MUTEX_INITIALIZER(cpu_hotplug.lock),
- .refcount = 0,
-};
+enum { readers_fast = 0, readers_slow, readers_block };
+
+int __cpuhp_state;
+EXPORT_SYMBOL_GPL(__cpuhp_state);
+
+DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
+EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
+
+static DEFINE_PER_CPU(unsigned int, cpuhp_seq);
+static atomic_t cpuhp_waitcount;
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
+static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
+
+void cpu_hotplug_init_task(struct task_struct *p)
+{
+ p->cpuhp_ref = 0;
+}
+
+void __get_online_cpus(void)
+{
+again:
+ /* See __srcu_read_lock() */
+ __this_cpu_inc(__cpuhp_refcount);
+ smp_mb(); /* A matches B, E */
+ __this_cpu_inc(cpuhp_seq);
+
+ if (unlikely(__cpuhp_state == readers_block)) {
+ /*
+ * Make sure an outgoing writer sees the waitcount to ensure
+ * we make progress.
+ */
+ atomic_inc(&cpuhp_waitcount);
+ __put_online_cpus();
+
+ /*
+ * We either call schedule() in the wait, or we'll fall through
+ * and reschedule on the preempt_enable() in get_online_cpus().
+ */
+ preempt_enable_no_resched();
+ __wait_event(cpuhp_readers, __cpuhp_state != readers_block);
+ preempt_disable();
+
+ if (atomic_dec_and_test(&cpuhp_waitcount))
+ wake_up_all(&cpuhp_writer);
+
+ goto again;
+ }
+}
+EXPORT_SYMBOL_GPL(__get_online_cpus);

-void get_online_cpus(void)
+void __put_online_cpus(void)
{
- might_sleep();
- if (cpu_hotplug.active_writer == current)
- return;
- mutex_lock(&cpu_hotplug.lock);
- cpu_hotplug.refcount++;
- mutex_unlock(&cpu_hotplug.lock);
+ /* See __srcu_read_unlock() */
+ smp_mb(); /* C matches D */
+ /*
+ * In other words, if they see our decrement (presumably to aggregate
+ * zero, as that is the only time it matters) they will also see our
+ * critical section.
+ */
+ this_cpu_dec(__cpuhp_refcount);

+ /* Prod writer to recheck readers_active */
+ wake_up_all(&cpuhp_writer);
}
-EXPORT_SYMBOL_GPL(get_online_cpus);
+EXPORT_SYMBOL_GPL(__put_online_cpus);
+
+#define per_cpu_sum(var) \
+({ \
+ typeof(var) __sum = 0; \
+ int cpu; \
+ for_each_possible_cpu(cpu) \
+ __sum += per_cpu(var, cpu); \
+ __sum; \
+)}

-void put_online_cpus(void)
+/*
+ * See srcu_readers_active_idx_check() for a rather more detailed explanation.
+ */
+static bool cpuhp_readers_active_check(void)
{
- if (cpu_hotplug.active_writer == current)
- return;
- mutex_lock(&cpu_hotplug.lock);
+ unsigned int seq = per_cpu_sum(cpuhp_seq);
+
+ smp_mb(); /* B matches A */
+
+ /*
+ * In other words, if we see __get_online_cpus() cpuhp_seq increment,
+ * we are guaranteed to also see its __cpuhp_refcount increment.
+ */

- if (WARN_ON(!cpu_hotplug.refcount))
- cpu_hotplug.refcount++; /* try to fix things up */
+ if (per_cpu_sum(__cpuhp_refcount) != 0)
+ return false;

- if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
- wake_up_process(cpu_hotplug.active_writer);
- mutex_unlock(&cpu_hotplug.lock);
+ smp_mb(); /* D matches C */

+ /*
+ * On equality, we know that there could not be any "sneak path" pairs
+ * where we see a decrement but not the corresponding increment for a
+ * given reader. If we saw its decrement, the memory barriers guarantee
+ * that we now see its cpuhp_seq increment.
+ */
+
+ return per_cpu_sum(cpuhp_seq) == seq;
}
-EXPORT_SYMBOL_GPL(put_online_cpus);

/*
- * This ensures that the hotplug operation can begin only when the
- * refcount goes to zero.
- *
- * Note that during a cpu-hotplug operation, the new readers, if any,
- * will be blocked by the cpu_hotplug.lock
- *
- * Since cpu_hotplug_begin() is always called after invoking
- * cpu_maps_update_begin(), we can be sure that only one writer is active.
- *
- * Note that theoretically, there is a possibility of a livelock:
- * - Refcount goes to zero, last reader wakes up the sleeping
- * writer.
- * - Last reader unlocks the cpu_hotplug.lock.
- * - A new reader arrives at this moment, bumps up the refcount.
- * - The writer acquires the cpu_hotplug.lock finds the refcount
- * non zero and goes to sleep again.
- *
- * However, this is very difficult to achieve in practice since
- * get_online_cpus() not an api which is called all that often.
- *
+ * This will notify new readers to block and wait for all active readers to
+ * complete.
*/
void cpu_hotplug_begin(void)
{
- cpu_hotplug.active_writer = current;
+ /*
+ * Since cpu_hotplug_begin() is always called after invoking
+ * cpu_maps_update_begin(), we can be sure that only one writer is
+ * active.
+ */
+ lockdep_assert_held(&cpu_add_remove_lock);

- for (;;) {
- mutex_lock(&cpu_hotplug.lock);
- if (likely(!cpu_hotplug.refcount))
- break;
- __set_current_state(TASK_UNINTERRUPTIBLE);
- mutex_unlock(&cpu_hotplug.lock);
- schedule();
- }
+ /* Allow reader-in-writer recursion. */
+ current->cpuhp_ref++;
+
+ /* Notify readers to take the slow path. */
+ __cpuhp_state = readers_slow;
+
+ /* See percpu_down_write(); guarantees all readers take the slow path */
+ synchronize_sched();
+
+ /*
+ * Notify new readers to block; up until now, and thus throughout the
+ * longish synchronize_sched() above, new readers could still come in.
+ */
+ __cpuhp_state = readers_block;
+
+ smp_mb(); /* E matches A */
+
+ /*
+ * If they don't see our writer of readers_block to __cpuhp_state,
+ * then we are guaranteed to see their __cpuhp_refcount increment, and
+ * therefore will wait for them.
+ */
+
+ /* Wait for all now active readers to complete. */
+ wait_event(cpuhp_writer, cpuhp_readers_active_check());
}

void cpu_hotplug_done(void)
{
- cpu_hotplug.active_writer = NULL;
- mutex_unlock(&cpu_hotplug.lock);
+ /* Signal the writer is done, no fast path yet. */
+ __cpuhp_state = readers_slow;
+ wake_up_all(&cpuhp_readers);
+
+ /*
+ * The wait_event()/wake_up_all() prevents the race where the readers
+ * are delayed between fetching __cpuhp_state and blocking.
+ */
+
+ /* See percpu_up_write(); readers will no longer attempt to block. */
+ synchronize_sched();
+
+ /* Let 'em rip */
+ __cpuhp_state = readers_fast;
+ current->cpuhp_ref--;
+
+ /*
+ * Wait for any pending readers to be running. This ensures readers
+ * after writer and avoids writers starving readers.
+ */
+ wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
}

/*
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -1736,6 +1736,8 @@ static void __sched_fork(unsigned long c
INIT_LIST_HEAD(&p->numa_entry);
p->numa_group = NULL;
#endif /* CONFIG_NUMA_BALANCING */
+
+ cpu_hotplug_init_task(p);
}

#ifdef CONFIG_NUMA_BALANCING

2013-09-26 16:13:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Thu, Sep 26, 2013 at 05:53:21PM +0200, Oleg Nesterov wrote:
> On 09/26, Peter Zijlstra wrote:
> > void cpu_hotplug_done(void)
> > {
> > - cpu_hotplug.active_writer = NULL;
> > - mutex_unlock(&cpu_hotplug.lock);
> > + /* Signal the writer is done, no fast path yet. */
> > + __cpuhp_state = readers_slow;
> > + wake_up_all(&cpuhp_readers);
> > +
> > + /*
> > + * The wait_event()/wake_up_all() prevents the race where the readers
> > + * are delayed between fetching __cpuhp_state and blocking.
> > + */
> > +
> > + /* See percpu_up_write(); readers will no longer attempt to block. */
> > + synchronize_sched();
>
> Shouldn't you move wake_up_all(&cpuhp_readers) down after
> synchronize_sched() (or add another one) ? To ensure that a reader can't
> see state = BLOCK after wakeup().

Well, if they are blocked, the wake_up_all() will do an actual
try_to_wake_up() which issues a MB as per smp_mb__before_spinlock().

The woken task will get a MB from passing through the context switch to
make it actually run. And therefore; like Paul's comment says; it cannot
observe the previous BLOCK state but must indeed see the just issued
SLOW state.

Right?

2013-09-26 16:21:39

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 09/26, Peter Zijlstra wrote:
>
> On Thu, Sep 26, 2013 at 05:53:21PM +0200, Oleg Nesterov wrote:
> > On 09/26, Peter Zijlstra wrote:
> > > void cpu_hotplug_done(void)
> > > {
> > > - cpu_hotplug.active_writer = NULL;
> > > - mutex_unlock(&cpu_hotplug.lock);
> > > + /* Signal the writer is done, no fast path yet. */
> > > + __cpuhp_state = readers_slow;
> > > + wake_up_all(&cpuhp_readers);
> > > +
> > > + /*
> > > + * The wait_event()/wake_up_all() prevents the race where the readers
> > > + * are delayed between fetching __cpuhp_state and blocking.
> > > + */
> > > +
> > > + /* See percpu_up_write(); readers will no longer attempt to block. */
> > > + synchronize_sched();
> >
> > Shouldn't you move wake_up_all(&cpuhp_readers) down after
> > synchronize_sched() (or add another one) ? To ensure that a reader can't
> > see state = BLOCK after wakeup().
>
> Well, if they are blocked, the wake_up_all() will do an actual
> try_to_wake_up() which issues a MB as per smp_mb__before_spinlock().

Yes. Everything is fine with the already blocked readers.

I meant the new reader which still can see state = BLOCK after we
do wakeup(), but I didn't notice it should do __wait_event() which
takes the lock unconditionally, it must see the change after that.

> Right?

Yes, I was wrong, thanks.

Oleg.

2013-09-26 16:40:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Thu, Sep 26, 2013 at 06:14:26PM +0200, Oleg Nesterov wrote:
> On 09/26, Peter Zijlstra wrote:
> >
> > On Thu, Sep 26, 2013 at 05:53:21PM +0200, Oleg Nesterov wrote:
> > > On 09/26, Peter Zijlstra wrote:
> > > > void cpu_hotplug_done(void)
> > > > {
> > > > - cpu_hotplug.active_writer = NULL;
> > > > - mutex_unlock(&cpu_hotplug.lock);
> > > > + /* Signal the writer is done, no fast path yet. */
> > > > + __cpuhp_state = readers_slow;
> > > > + wake_up_all(&cpuhp_readers);
> > > > +
> > > > + /*
> > > > + * The wait_event()/wake_up_all() prevents the race where the readers
> > > > + * are delayed between fetching __cpuhp_state and blocking.
> > > > + */
> > > > +
> > > > + /* See percpu_up_write(); readers will no longer attempt to block. */
> > > > + synchronize_sched();
> > >
> > > Shouldn't you move wake_up_all(&cpuhp_readers) down after
> > > synchronize_sched() (or add another one) ? To ensure that a reader can't
> > > see state = BLOCK after wakeup().
> >
> > Well, if they are blocked, the wake_up_all() will do an actual
> > try_to_wake_up() which issues a MB as per smp_mb__before_spinlock().
>
> Yes. Everything is fine with the already blocked readers.
>
> I meant the new reader which still can see state = BLOCK after we
> do wakeup(), but I didn't notice it should do __wait_event() which
> takes the lock unconditionally, it must see the change after that.

Ah, because both __wake_up() and __wait_event()->prepare_to_wait() take
q->lock. Thereby matching the __wake_up() RELEASE to the __wait_event()
ACQUIRE, creating the full barrier.

2013-09-26 17:05:57

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

Peter,

Sorry. Unlikely I will be able to read this patch today. So let me
ask another potentially wrong question without any thinking.

On 09/26, Peter Zijlstra wrote:
>
> +void __get_online_cpus(void)
> +{
> +again:
> + /* See __srcu_read_lock() */
> + __this_cpu_inc(__cpuhp_refcount);
> + smp_mb(); /* A matches B, E */
> + __this_cpu_inc(cpuhp_seq);
> +
> + if (unlikely(__cpuhp_state == readers_block)) {

OK. Either we should see state = BLOCK or the writer should notice the
change in __cpuhp_refcount/seq. (altough I'd like to recheck this
cpuhp_seq logic ;)

> + atomic_inc(&cpuhp_waitcount);
> + __put_online_cpus();

OK, this does wake(cpuhp_writer).

> void cpu_hotplug_begin(void)
> {
> ...
> + /*
> + * Notify new readers to block; up until now, and thus throughout the
> + * longish synchronize_sched() above, new readers could still come in.
> + */
> + __cpuhp_state = readers_block;
> +
> + smp_mb(); /* E matches A */
> +
> + /*
> + * If they don't see our writer of readers_block to __cpuhp_state,
> + * then we are guaranteed to see their __cpuhp_refcount increment, and
> + * therefore will wait for them.
> + */
> +
> + /* Wait for all now active readers to complete. */
> + wait_event(cpuhp_writer, cpuhp_readers_active_check());

But. doesn't this mean that we need __wait_event() here as well?

Isn't it possible that the reader sees BLOCK but the writer does _not_
see the change in __cpuhp_refcount/cpuhp_seq? Those mb's guarantee
"either", not "both".

Don't we need to ensure that we can't check cpuhp_readers_active_check()
after wake(cpuhp_writer) was already called by the reader and before we
take the same lock?

Oleg.

2013-09-26 17:50:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Thu, Sep 26, 2013 at 06:58:40PM +0200, Oleg Nesterov wrote:
> Peter,
>
> Sorry. Unlikely I will be able to read this patch today. So let me
> ask another potentially wrong question without any thinking.
>
> On 09/26, Peter Zijlstra wrote:
> >
> > +void __get_online_cpus(void)
> > +{
> > +again:
> > + /* See __srcu_read_lock() */
> > + __this_cpu_inc(__cpuhp_refcount);
> > + smp_mb(); /* A matches B, E */
> > + __this_cpu_inc(cpuhp_seq);
> > +
> > + if (unlikely(__cpuhp_state == readers_block)) {
>
> OK. Either we should see state = BLOCK or the writer should notice the
> change in __cpuhp_refcount/seq. (altough I'd like to recheck this
> cpuhp_seq logic ;)
>
> > + atomic_inc(&cpuhp_waitcount);
> > + __put_online_cpus();
>
> OK, this does wake(cpuhp_writer).
>
> > void cpu_hotplug_begin(void)
> > {
> > ...
> > + /*
> > + * Notify new readers to block; up until now, and thus throughout the
> > + * longish synchronize_sched() above, new readers could still come in.
> > + */
> > + __cpuhp_state = readers_block;
> > +
> > + smp_mb(); /* E matches A */
> > +
> > + /*
> > + * If they don't see our writer of readers_block to __cpuhp_state,
> > + * then we are guaranteed to see their __cpuhp_refcount increment, and
> > + * therefore will wait for them.
> > + */
> > +
> > + /* Wait for all now active readers to complete. */
> > + wait_event(cpuhp_writer, cpuhp_readers_active_check());
>
> But. doesn't this mean that we need __wait_event() here as well?
>
> Isn't it possible that the reader sees BLOCK but the writer does _not_
> see the change in __cpuhp_refcount/cpuhp_seq? Those mb's guarantee
> "either", not "both".

But if the readers does see BLOCK it will not be an active reader no
more; and thus the writer doesn't need to observe and wait for it.

> Don't we need to ensure that we can't check cpuhp_readers_active_check()
> after wake(cpuhp_writer) was already called by the reader and before we
> take the same lock?

I'm too tired to fully grasp what you're asking here; but given the
previous answer I think not.

2013-09-27 18:22:50

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 09/26, Peter Zijlstra wrote:
>
> But if the readers does see BLOCK it will not be an active reader no
> more; and thus the writer doesn't need to observe and wait for it.

I meant they both can block, but please ignore. Today I simply can't
understand what I was thinking about yesterday.


I tried hard to find any hole in this version but failed, I believe it
is correct.

But, could you help me to understand some details?

> +void __get_online_cpus(void)
> +{
> +again:
> + /* See __srcu_read_lock() */
> + __this_cpu_inc(__cpuhp_refcount);
> + smp_mb(); /* A matches B, E */
> + __this_cpu_inc(cpuhp_seq);
> +
> + if (unlikely(__cpuhp_state == readers_block)) {

Note that there is no barrier() after inc(seq) and __cpuhp_state
check, this inc() can be "postponed" till ...

> +void __put_online_cpus(void)
> {
> - might_sleep();
> - if (cpu_hotplug.active_writer == current)
> - return;
> - mutex_lock(&cpu_hotplug.lock);
> - cpu_hotplug.refcount++;
> - mutex_unlock(&cpu_hotplug.lock);
> + /* See __srcu_read_unlock() */
> + smp_mb(); /* C matches D */

... this mb() in __put_online_cpus().

And this is fine! The qustion is, perhaps it would be more "natural"
and understandable to shift this_cpu_inc(cpuhp_seq) into
__put_online_cpus().

We need to ensure 2 things:

1. The reader should notic state = BLOCK or the writer should see
inc(__cpuhp_refcount). This is guaranteed by 2 mb's in
__get_online_cpus() and in cpu_hotplug_begin().

We do not care if the writer misses some inc(__cpuhp_refcount)
in per_cpu_sum(__cpuhp_refcount), that reader(s) should notice
state = readers_block (and inc(cpuhp_seq) can't help anyway).

2. If the writer sees the result of this_cpu_dec(__cpuhp_refcount)
from __put_online_cpus() (note that the writer can miss the
corresponding inc() if it was done on another CPU, so this dec()
can lead to sum() == 0), it should also notice the change in cpuhp_seq.

Fortunately, this can only happen if the reader migrates, in
this case schedule() provides a barrier, the writer can't miss
the change in cpuhp_seq.

IOW. Unless I missed something, cpuhp_seq is actually needed to
serialize __put_online_cpus()->this_cpu_dec(__cpuhp_refcount) and
and /* D matches C */ in cpuhp_readers_active_check(), and this
is not immediately clear if you look at __get_online_cpus().

I do not suggest to change this code, but please tell me if my
understanding is not correct.

> +static bool cpuhp_readers_active_check(void)
> {
> - if (cpu_hotplug.active_writer == current)
> - return;
> - mutex_lock(&cpu_hotplug.lock);
> + unsigned int seq = per_cpu_sum(cpuhp_seq);
> +
> + smp_mb(); /* B matches A */
> +
> + /*
> + * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> + * we are guaranteed to also see its __cpuhp_refcount increment.
> + */
>
> - if (WARN_ON(!cpu_hotplug.refcount))
> - cpu_hotplug.refcount++; /* try to fix things up */
> + if (per_cpu_sum(__cpuhp_refcount) != 0)
> + return false;
>
> - if (!--cpu_hotplug.refcount && unlikely(cpu_hotplug.active_writer))
> - wake_up_process(cpu_hotplug.active_writer);
> - mutex_unlock(&cpu_hotplug.lock);
> + smp_mb(); /* D matches C */

It seems that both barries could be smp_rmb() ? I am not sure the comments
from srcu_readers_active_idx_check() can explain mb(), note that
__srcu_read_lock() always succeeds unlike get_cpus_online().

> void cpu_hotplug_done(void)
> {
> - cpu_hotplug.active_writer = NULL;
> - mutex_unlock(&cpu_hotplug.lock);
> + /* Signal the writer is done, no fast path yet. */
> + __cpuhp_state = readers_slow;
> + wake_up_all(&cpuhp_readers);
> +
> + /*
> + * The wait_event()/wake_up_all() prevents the race where the readers
> + * are delayed between fetching __cpuhp_state and blocking.
> + */
> +
> + /* See percpu_up_write(); readers will no longer attempt to block. */
> + synchronize_sched();
> +
> + /* Let 'em rip */
> + __cpuhp_state = readers_fast;
> + current->cpuhp_ref--;
> +
> + /*
> + * Wait for any pending readers to be running. This ensures readers
> + * after writer and avoids writers starving readers.
> + */
> + wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
> }

OK, to some degree I can understand "avoids writers starving readers"
part (although the next writer should do synchronize_sched() first),
but could you explain "ensures readers after writer" ?

Oleg.

2013-09-27 20:41:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote:
> On 09/26, Peter Zijlstra wrote:
> >
> > But if the readers does see BLOCK it will not be an active reader no
> > more; and thus the writer doesn't need to observe and wait for it.
>
> I meant they both can block, but please ignore. Today I simply can't
> understand what I was thinking about yesterday.

I think we all know that state all too well ;-)

> I tried hard to find any hole in this version but failed, I believe it
> is correct.

Yay!

> But, could you help me to understand some details?

I'll try, but I'm not too bright atm myself :-)

> > +void __get_online_cpus(void)
> > +{
> > +again:
> > + /* See __srcu_read_lock() */
> > + __this_cpu_inc(__cpuhp_refcount);
> > + smp_mb(); /* A matches B, E */
> > + __this_cpu_inc(cpuhp_seq);
> > +
> > + if (unlikely(__cpuhp_state == readers_block)) {
>
> Note that there is no barrier() after inc(seq) and __cpuhp_state
> check, this inc() can be "postponed" till ...
>
> > +void __put_online_cpus(void)
> > {
> > + /* See __srcu_read_unlock() */
> > + smp_mb(); /* C matches D */
>
> ... this mb() in __put_online_cpus().
>
> And this is fine! The qustion is, perhaps it would be more "natural"
> and understandable to shift this_cpu_inc(cpuhp_seq) into
> __put_online_cpus().

Possibly; I never got further than that the required order is:

ref++
MB
seq++
MB
ref--

It doesn't matter if the seq++ is in the lock or unlock primitive. I
never considered one place more natural than the other.

> We need to ensure 2 things:
>
> 1. The reader should notic state = BLOCK or the writer should see
> inc(__cpuhp_refcount). This is guaranteed by 2 mb's in
> __get_online_cpus() and in cpu_hotplug_begin().
>
> We do not care if the writer misses some inc(__cpuhp_refcount)
> in per_cpu_sum(__cpuhp_refcount), that reader(s) should notice
> state = readers_block (and inc(cpuhp_seq) can't help anyway).

Agreed.

> 2. If the writer sees the result of this_cpu_dec(__cpuhp_refcount)
> from __put_online_cpus() (note that the writer can miss the
> corresponding inc() if it was done on another CPU, so this dec()
> can lead to sum() == 0), it should also notice the change in cpuhp_seq.
>
> Fortunately, this can only happen if the reader migrates, in
> this case schedule() provides a barrier, the writer can't miss
> the change in cpuhp_seq.

Again, agreed; this is also the message of the second comment in
cpuhp_readers_active_check() by Paul.

> IOW. Unless I missed something, cpuhp_seq is actually needed to
> serialize __put_online_cpus()->this_cpu_dec(__cpuhp_refcount) and
> and /* D matches C */ in cpuhp_readers_active_check(), and this
> is not immediately clear if you look at __get_online_cpus().
>
> I do not suggest to change this code, but please tell me if my
> understanding is not correct.

I think you're entirely right.

> > +static bool cpuhp_readers_active_check(void)
> > {
> > + unsigned int seq = per_cpu_sum(cpuhp_seq);
> > +
> > + smp_mb(); /* B matches A */
> > +
> > + /*
> > + * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> > + * we are guaranteed to also see its __cpuhp_refcount increment.
> > + */
> >
> > + if (per_cpu_sum(__cpuhp_refcount) != 0)
> > + return false;
> >
> > + smp_mb(); /* D matches C */
>
> It seems that both barries could be smp_rmb() ? I am not sure the comments
> from srcu_readers_active_idx_check() can explain mb(), note that
> __srcu_read_lock() always succeeds unlike get_cpus_online().

I see what you mean; cpuhp_readers_active_check() is all purely reads;
there are no writes to order.

Paul; is there any argument for the MB here as opposed to RMB; and if
not should we change both these and SRCU?

> > void cpu_hotplug_done(void)
> > {
> > + /* Signal the writer is done, no fast path yet. */
> > + __cpuhp_state = readers_slow;
> > + wake_up_all(&cpuhp_readers);
> > +
> > + /*
> > + * The wait_event()/wake_up_all() prevents the race where the readers
> > + * are delayed between fetching __cpuhp_state and blocking.
> > + */
> > +
> > + /* See percpu_up_write(); readers will no longer attempt to block. */
> > + synchronize_sched();
> > +
> > + /* Let 'em rip */
> > + __cpuhp_state = readers_fast;
> > + current->cpuhp_ref--;
> > +
> > + /*
> > + * Wait for any pending readers to be running. This ensures readers
> > + * after writer and avoids writers starving readers.
> > + */
> > + wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
> > }
>
> OK, to some degree I can understand "avoids writers starving readers"
> part (although the next writer should do synchronize_sched() first),
> but could you explain "ensures readers after writer" ?

Suppose reader A sees state == BLOCK and goes to sleep; our writer B
does cpu_hotplug_done() and wakes all pending readers. If for some
reason A doesn't schedule to inc ref until B again executes
cpu_hotplug_begin() and state is once again BLOCK, A will not have made
any progress.

The waitcount increment before __put_online_cpus() ensures
cpu_hotplug_done() sees the !0 waitcount and waits until out reader runs
far enough to at least pass the dec_and_test().

And once past the dec_and_test() preemption is disabled and the
sched_sync() in a new cpu_hotplug_begin() will suffice to guarantee
we'll have acquired a reference and are an active reader.

2013-09-28 12:56:13

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 09/27, Peter Zijlstra wrote:
>
> On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote:
>
> > > +static bool cpuhp_readers_active_check(void)
> > > {
> > > + unsigned int seq = per_cpu_sum(cpuhp_seq);
> > > +
> > > + smp_mb(); /* B matches A */
> > > +
> > > + /*
> > > + * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> > > + * we are guaranteed to also see its __cpuhp_refcount increment.
> > > + */
> > >
> > > + if (per_cpu_sum(__cpuhp_refcount) != 0)
> > > + return false;
> > >
> > > + smp_mb(); /* D matches C */
> >
> > It seems that both barries could be smp_rmb() ? I am not sure the comments
> > from srcu_readers_active_idx_check() can explain mb(),

To avoid the confusion, I meant "those comments can't explain mb()s here,
in cpuhp_readers_active_check()".

> > note that
> > __srcu_read_lock() always succeeds unlike get_cpus_online().

And this cput_hotplug_ and synchronize_srcu() differ, see below.

> I see what you mean; cpuhp_readers_active_check() is all purely reads;
> there are no writes to order.
>
> Paul; is there any argument for the MB here as opposed to RMB;

Yes, Paul, please ;)

> and if
> not should we change both these and SRCU?

I guess that SRCU is more "complex" in this respect. IIUC,
cpuhp_readers_active_check() needs "more" barriers because if
synchronize_srcu() succeeds it needs to synchronize with the new readers
which call srcu_read_lock/unlock() "right now". Again, unlike cpu-hotplug
srcu never blocks the readers, srcu_read_*() always succeeds.



Hmm. I am wondering why __srcu_read_lock() needs ACCESS_ONCE() to increment
->c and ->seq. A plain this_cpu_inc() should be fine?

And since it disables preemption, why it can't use __this_cpu_inc() to inc
->c[idx]. OK, in general __this_cpu_inc() is not irq-safe (rmw) so we can't
do __this_cpu_inc(seq[idx]), c[idx] should be fine? If irq does srcu_read_lock()
it should also do _unlock.

But this is minor/offtopic.

> > > void cpu_hotplug_done(void)
> > > {
...
> > > + /*
> > > + * Wait for any pending readers to be running. This ensures readers
> > > + * after writer and avoids writers starving readers.
> > > + */
> > > + wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
> > > }
> >
> > OK, to some degree I can understand "avoids writers starving readers"
> > part (although the next writer should do synchronize_sched() first),
> > but could you explain "ensures readers after writer" ?
>
> Suppose reader A sees state == BLOCK and goes to sleep; our writer B
> does cpu_hotplug_done() and wakes all pending readers. If for some
> reason A doesn't schedule to inc ref until B again executes
> cpu_hotplug_begin() and state is once again BLOCK, A will not have made
> any progress.

Yes, yes, thanks, this is clear. But this explains "writers starving readers".
And let me repeat, if B again executes cpu_hotplug_begin() it will do
another synchronize_sched() before it sets BLOCK, so I am not sure we
need this "in practice".

I was confused by "ensures readers after writer", I thought this means
we need the additional synchronization with the readers which are going
to increment cpuhp_waitcount, say, some sort of barries.

Please note that this wait_event() adds a problem... it doesn't allow
to "offload" the final synchronize_sched(). Suppose a 4k cpu machine
does disable_nonboot_cpus(), we do not want 2 * 4k * synchronize_sched's
in this case. We can solve this, but this wait_event() complicates
the problem.

Oleg.

2013-09-28 14:48:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Sat, Sep 28, 2013 at 02:48:59PM +0200, Oleg Nesterov wrote:
> > > > void cpu_hotplug_done(void)
> > > > {
> ...
> > > > + /*
> > > > + * Wait for any pending readers to be running. This ensures readers
> > > > + * after writer and avoids writers starving readers.
> > > > + */
> > > > + wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
> > > > }
> > >
> > > OK, to some degree I can understand "avoids writers starving readers"
> > > part (although the next writer should do synchronize_sched() first),
> > > but could you explain "ensures readers after writer" ?
> >
> > Suppose reader A sees state == BLOCK and goes to sleep; our writer B
> > does cpu_hotplug_done() and wakes all pending readers. If for some
> > reason A doesn't schedule to inc ref until B again executes
> > cpu_hotplug_begin() and state is once again BLOCK, A will not have made
> > any progress.
>
> Yes, yes, thanks, this is clear. But this explains "writers starving readers".
> And let me repeat, if B again executes cpu_hotplug_begin() it will do
> another synchronize_sched() before it sets BLOCK, so I am not sure we
> need this "in practice".
>
> I was confused by "ensures readers after writer", I thought this means
> we need the additional synchronization with the readers which are going
> to increment cpuhp_waitcount, say, some sort of barries.

Ah no; I just wanted to guarantee that any pending readers did get a
chance to run. And yes due to the two sync_sched() calls it seems
somewhat unlikely in practise.

> Please note that this wait_event() adds a problem... it doesn't allow
> to "offload" the final synchronize_sched(). Suppose a 4k cpu machine
> does disable_nonboot_cpus(), we do not want 2 * 4k * synchronize_sched's
> in this case. We can solve this, but this wait_event() complicates
> the problem.

That seems like a particularly easy fix; something like so?

---
include/linux/cpu.h | 1
kernel/cpu.c | 84 ++++++++++++++++++++++++++++++++++------------------
2 files changed, 56 insertions(+), 29 deletions(-)

--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -109,6 +109,7 @@ enum {
#define CPU_DOWN_FAILED_FROZEN (CPU_DOWN_FAILED | CPU_TASKS_FROZEN)
#define CPU_DEAD_FROZEN (CPU_DEAD | CPU_TASKS_FROZEN)
#define CPU_DYING_FROZEN (CPU_DYING | CPU_TASKS_FROZEN)
+#define CPU_POST_DEAD_FROZEN (CPU_POST_DEAD | CPU_TASKS_FROZEN)
#define CPU_STARTING_FROZEN (CPU_STARTING | CPU_TASKS_FROZEN)


--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -364,8 +364,7 @@ static int __ref take_cpu_down(void *_pa
return 0;
}

-/* Requires cpu_add_remove_lock to be held */
-static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
+static int __ref __cpu_down(unsigned int cpu, int tasks_frozen)
{
int err, nr_calls = 0;
void *hcpu = (void *)(long)cpu;
@@ -375,21 +374,13 @@ static int __ref _cpu_down(unsigned int
.hcpu = hcpu,
};

- if (num_online_cpus() == 1)
- return -EBUSY;
-
- if (!cpu_online(cpu))
- return -EINVAL;
-
- cpu_hotplug_begin();
-
err = __cpu_notify(CPU_DOWN_PREPARE | mod, hcpu, -1, &nr_calls);
if (err) {
nr_calls--;
__cpu_notify(CPU_DOWN_FAILED | mod, hcpu, nr_calls, NULL);
printk("%s: attempt to take down CPU %u failed\n",
__func__, cpu);
- goto out_release;
+ return err;
}
smpboot_park_threads(cpu);

@@ -398,7 +389,7 @@ static int __ref _cpu_down(unsigned int
/* CPU didn't die: tell everyone. Can't complain. */
smpboot_unpark_threads(cpu);
cpu_notify_nofail(CPU_DOWN_FAILED | mod, hcpu);
- goto out_release;
+ return err;
}
BUG_ON(cpu_online(cpu));

@@ -420,10 +411,27 @@ static int __ref _cpu_down(unsigned int

check_for_tasks(cpu);

-out_release:
+ return err;
+}
+
+/* Requires cpu_add_remove_lock to be held */
+static int __ref _cpu_down(unsigned int cpu, int tasks_frozen)
+{
+ unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
+ int err;
+
+ if (num_online_cpus() == 1)
+ return -EBUSY;
+
+ if (!cpu_online(cpu))
+ return -EINVAL;
+
+ cpu_hotplug_begin();
+ err = __cpu_down(cpu, tasks_frozen);
cpu_hotplug_done();
+
if (!err)
- cpu_notify_nofail(CPU_POST_DEAD | mod, hcpu);
+ cpu_notify_nofail(CPU_POST_DEAD | mod, (void *)(long)cpu);
return err;
}

@@ -447,30 +455,22 @@ int __ref cpu_down(unsigned int cpu)
EXPORT_SYMBOL(cpu_down);
#endif /*CONFIG_HOTPLUG_CPU*/

-/* Requires cpu_add_remove_lock to be held */
-static int _cpu_up(unsigned int cpu, int tasks_frozen)
+static int ___cpu_up(unsigned int cpu, int tasks_frozen)
{
int ret, nr_calls = 0;
void *hcpu = (void *)(long)cpu;
unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
struct task_struct *idle;

- cpu_hotplug_begin();
-
- if (cpu_online(cpu) || !cpu_present(cpu)) {
- ret = -EINVAL;
- goto out;
- }
-
idle = idle_thread_get(cpu);
if (IS_ERR(idle)) {
ret = PTR_ERR(idle);
- goto out;
+ return ret;
}

ret = smpboot_create_threads(cpu);
if (ret)
- goto out;
+ return ret;

ret = __cpu_notify(CPU_UP_PREPARE | mod, hcpu, -1, &nr_calls);
if (ret) {
@@ -492,9 +492,24 @@ static int _cpu_up(unsigned int cpu, int
/* Now call notifier in preparation. */
cpu_notify(CPU_ONLINE | mod, hcpu);

+ return 0;
+
out_notify:
- if (ret != 0)
- __cpu_notify(CPU_UP_CANCELED | mod, hcpu, nr_calls, NULL);
+ __cpu_notify(CPU_UP_CANCELED | mod, hcpu, nr_calls, NULL);
+ return ret;
+}
+
+/* Requires cpu_add_remove_lock to be held */
+static int _cpu_up(unsigned int cpu, int tasks_frozen)
+{
+ cpu_hotplug_begin();
+
+ if (cpu_online(cpu) || !cpu_present(cpu)) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ ret = ___cpu_up(cpu, tasks_frozen);
out:
cpu_hotplug_done();

@@ -572,11 +587,13 @@ int disable_nonboot_cpus(void)
*/
cpumask_clear(frozen_cpus);

+ cpu_hotplug_begin();
+
printk("Disabling non-boot CPUs ...\n");
for_each_online_cpu(cpu) {
if (cpu == first_cpu)
continue;
- error = _cpu_down(cpu, 1);
+ error = __cpu_down(cpu, 1);
if (!error)
cpumask_set_cpu(cpu, frozen_cpus);
else {
@@ -586,6 +603,11 @@ int disable_nonboot_cpus(void)
}
}

+ cpu_hotplug_done();
+
+ for_each_cpu(cpu, frozen_cpus)
+ cpu_notify_nofail(CPU_POST_DEAD_FROZEN, (void*)(long)cpu);
+
if (!error) {
BUG_ON(num_online_cpus() > 1);
/* Make sure the CPUs won't be enabled by someone else */
@@ -619,8 +641,10 @@ void __ref enable_nonboot_cpus(void)

arch_enable_nonboot_cpus_begin();

+ cpu_hotplug_begin();
+
for_each_cpu(cpu, frozen_cpus) {
- error = _cpu_up(cpu, 1);
+ error = ___cpu_up(cpu, 1);
if (!error) {
printk(KERN_INFO "CPU%d is up\n", cpu);
continue;
@@ -628,6 +652,8 @@ void __ref enable_nonboot_cpus(void)
printk(KERN_WARNING "Error taking CPU%d up: %d\n", cpu, error);
}

+ cpu_hotplug_done();
+
arch_enable_nonboot_cpus_end();

cpumask_clear(frozen_cpus);

2013-09-28 16:38:19

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 09/28, Peter Zijlstra wrote:
>
> On Sat, Sep 28, 2013 at 02:48:59PM +0200, Oleg Nesterov wrote:
>
> > Please note that this wait_event() adds a problem... it doesn't allow
> > to "offload" the final synchronize_sched(). Suppose a 4k cpu machine
> > does disable_nonboot_cpus(), we do not want 2 * 4k * synchronize_sched's
> > in this case. We can solve this, but this wait_event() complicates
> > the problem.
>
> That seems like a particularly easy fix; something like so?

Yes, but...

> @@ -586,6 +603,11 @@ int disable_nonboot_cpus(void)
>
> + cpu_hotplug_done();
> +
> + for_each_cpu(cpu, frozen_cpus)
> + cpu_notify_nofail(CPU_POST_DEAD_FROZEN, (void*)(long)cpu);

This changes the protocol, I simply do not know if it is fine in general
to do __cpu_down(another_cpu) without CPU_POST_DEAD(previous_cpu). Say,
currently it is possible that CPU_DOWN_PREPARE takes some global lock
released by CPU_DOWN_FAILED or CPU_POST_DEAD.

Hmm. Now that workqueues do not use CPU_POST_DEAD, it has only 2 users,
mce_cpu_callback() and cpufreq_cpu_callback() and the 1st one even ignores
this notification if FROZEN. So yes, probably this is fine, but needs an
ack from cpufreq maintainers (cc'ed), for example to ensure that it is
fine to call __cpufreq_remove_dev_prepare() twice without _finish().

Oleg.

2013-09-28 20:46:42

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Sat, Sep 28, 2013 at 02:48:59PM +0200, Oleg Nesterov wrote:
> On 09/27, Peter Zijlstra wrote:
> >
> > On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote:
> >
> > > > +static bool cpuhp_readers_active_check(void)
> > > > {
> > > > + unsigned int seq = per_cpu_sum(cpuhp_seq);
> > > > +
> > > > + smp_mb(); /* B matches A */
> > > > +
> > > > + /*
> > > > + * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> > > > + * we are guaranteed to also see its __cpuhp_refcount increment.
> > > > + */
> > > >
> > > > + if (per_cpu_sum(__cpuhp_refcount) != 0)
> > > > + return false;
> > > >
> > > > + smp_mb(); /* D matches C */
> > >
> > > It seems that both barries could be smp_rmb() ? I am not sure the comments
> > > from srcu_readers_active_idx_check() can explain mb(),
>
> To avoid the confusion, I meant "those comments can't explain mb()s here,
> in cpuhp_readers_active_check()".
>
> > > note that
> > > __srcu_read_lock() always succeeds unlike get_cpus_online().
>
> And this cput_hotplug_ and synchronize_srcu() differ, see below.
>
> > I see what you mean; cpuhp_readers_active_check() is all purely reads;
> > there are no writes to order.
> >
> > Paul; is there any argument for the MB here as opposed to RMB;
>
> Yes, Paul, please ;)

Sorry to be slow -- I will reply by end of Monday Pacific time at the
latest. I need to allow myself enough time so that it seems new...

Also I might try some mechanical proofs of parts of it.

Thanx, Paul

> > and if
> > not should we change both these and SRCU?
>
> I guess that SRCU is more "complex" in this respect. IIUC,
> cpuhp_readers_active_check() needs "more" barriers because if
> synchronize_srcu() succeeds it needs to synchronize with the new readers
> which call srcu_read_lock/unlock() "right now". Again, unlike cpu-hotplug
> srcu never blocks the readers, srcu_read_*() always succeeds.
>
>
>
> Hmm. I am wondering why __srcu_read_lock() needs ACCESS_ONCE() to increment
> ->c and ->seq. A plain this_cpu_inc() should be fine?
>
> And since it disables preemption, why it can't use __this_cpu_inc() to inc
> ->c[idx]. OK, in general __this_cpu_inc() is not irq-safe (rmw) so we can't
> do __this_cpu_inc(seq[idx]), c[idx] should be fine? If irq does srcu_read_lock()
> it should also do _unlock.
>
> But this is minor/offtopic.
>
> > > > void cpu_hotplug_done(void)
> > > > {
> ...
> > > > + /*
> > > > + * Wait for any pending readers to be running. This ensures readers
> > > > + * after writer and avoids writers starving readers.
> > > > + */
> > > > + wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
> > > > }
> > >
> > > OK, to some degree I can understand "avoids writers starving readers"
> > > part (although the next writer should do synchronize_sched() first),
> > > but could you explain "ensures readers after writer" ?
> >
> > Suppose reader A sees state == BLOCK and goes to sleep; our writer B
> > does cpu_hotplug_done() and wakes all pending readers. If for some
> > reason A doesn't schedule to inc ref until B again executes
> > cpu_hotplug_begin() and state is once again BLOCK, A will not have made
> > any progress.
>
> Yes, yes, thanks, this is clear. But this explains "writers starving readers".
> And let me repeat, if B again executes cpu_hotplug_begin() it will do
> another synchronize_sched() before it sets BLOCK, so I am not sure we
> need this "in practice".
>
> I was confused by "ensures readers after writer", I thought this means
> we need the additional synchronization with the readers which are going
> to increment cpuhp_waitcount, say, some sort of barries.
>
> Please note that this wait_event() adds a problem... it doesn't allow
> to "offload" the final synchronize_sched(). Suppose a 4k cpu machine
> does disable_nonboot_cpus(), we do not want 2 * 4k * synchronize_sched's
> in this case. We can solve this, but this wait_event() complicates
> the problem.
>
> Oleg.
>

2013-09-29 14:04:06

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 09/27, Oleg Nesterov wrote:
>
> I tried hard to find any hole in this version but failed, I believe it
> is correct.

And I still believe it is. But now I am starting to think that we
don't need cpuhp_seq. (and imo cpuhp_waitcount, but this is minor).

> We need to ensure 2 things:
>
> 1. The reader should notic state = BLOCK or the writer should see
> inc(__cpuhp_refcount). This is guaranteed by 2 mb's in
> __get_online_cpus() and in cpu_hotplug_begin().
>
> We do not care if the writer misses some inc(__cpuhp_refcount)
> in per_cpu_sum(__cpuhp_refcount), that reader(s) should notice
> state = readers_block (and inc(cpuhp_seq) can't help anyway).

Yes!

> 2. If the writer sees the result of this_cpu_dec(__cpuhp_refcount)
> from __put_online_cpus() (note that the writer can miss the
> corresponding inc() if it was done on another CPU, so this dec()
> can lead to sum() == 0),

But this can't happen in this version? Somehow I forgot that
__get_online_cpus() does inc/get under preempt_disable(), always on
the same CPU. And thanks to mb's the writer should not miss the
reader which has already passed the "state != BLOCK" check.

To simplify the discussion, lets ignore the "readers_fast" state,
synchronize_sched() logic looks obviously correct. IOW, lets discuss
only the SLOW -> BLOCK transition.

cput_hotplug_begin()
{
state = BLOCK;

mb();

wait_event(cpuhp_writer,
per_cpu_sum(__cpuhp_refcount) == 0);
}

should work just fine? Ignoring all details, we have

get_online_cpus()
{
again:
preempt_disable();

__this_cpu_inc(__cpuhp_refcount);

mb();

if (state == BLOCK) {

mb();

__this_cpu_dec(__cpuhp_refcount);
wake_up_all(cpuhp_writer);

preempt_enable();
wait_event(state != BLOCK);
goto again;
}

preempt_enable();
}

It seems to me that these mb's guarantee all we need, no?

It looks really simple. The reader can only succed if it doesn't see
BLOCK, in this case per_cpu_sum() should see the change,

We have

WRITER READER on CPU X

state = BLOCK; __cpuhp_refcount[X]++;

mb(); mb();

...
count += __cpuhp_refcount[X]; if (state != BLOCK)
... return;

mb();
__cpuhp_refcount[X]--;

Either reader or writer should notice the STORE we care about.

If a reader can decrement __cpuhp_refcount, we have 2 cases:

1. It is the reader holding this lock. In this case we
can't miss the corresponding inc() done by this reader,
because this reader didn't see BLOCK in the past.

It is just the

A == B == 0
CPU_0 CPU_1
----- -----
A = 1; B = 1;
mb(); mb();
b = B; a = A;

pattern, at least one CPU should see 1 in its a/b.

2. It is the reader which tries to take this lock and
noticed state == BLOCK. We could miss the result of
its inc(), but we do not care, this reader is going
to block.

_If_ the reader could migrate between inc/dec, then
yes, we have a problem. Because that dec() could make
the result of per_cpu_sum() = 0. IOW, we could miss
inc() but notice dec(). But given that it does this
on the same CPU this is not possible.

So why do we need cpuhp_seq?

Oleg.

2013-09-29 18:43:53

by Oleg Nesterov

[permalink] [raw]
Subject: [RFC] introduce synchronize_sched_{enter,exit}()

Hello.

Paul, Peter, et al, could you review the code below?

I am not sending the patch, I think it is simpler to read the code
inline (just in case, I didn't try to compile it yet).

It is functionally equivalent to

struct xxx_struct {
atomic_t counter;
};

static inline bool xxx_is_idle(struct xxx_struct *xxx)
{
return atomic_read(&xxx->counter) == 0;
}

static inline void xxx_enter(struct xxx_struct *xxx)
{
atomic_inc(&xxx->counter);
synchronize_sched();
}

static inline void xxx_enter(struct xxx_struct *xxx)
{
synchronize_sched();
atomic_dec(&xxx->counter);
}

except: it records the state and synchronize_sched() is only called by
xxx_enter() and only if necessary.

Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
(Peter, I think they should be unified anyway, but lets ignore this for
now). Or freeze_super() (which currently looks buggy), perhaps something
else. This pattern

writer:
state = SLOW_MODE;
synchronize_rcu/sched();

reader:
preempt_disable(); // or rcu_read_lock();
if (state != SLOW_MODE)
...

is quite common.

Note:
- This implementation allows multiple writers, and sometimes
this makes sense.

- But it's trivial to add "bool xxx->exclusive" set by xxx_init().
If it is true only one xxx_enter() is possible, other callers
should block until xxx_exit(). This is what percpu_down_write()
actually needs.

- Probably it makes sense to add xxx->rcu_domain = RCU/SCHED/ETC.

Do you think it is correct? Makes sense? (BUG_ON's are just comments).

Oleg.

// .h -----------------------------------------------------------------------

struct xxx_struct {
int gp_state;

int gp_count;
wait_queue_head_t gp_waitq;

int cb_state;
struct rcu_head cb_head;
};

static inline bool xxx_is_idle(struct xxx_struct *xxx)
{
return !xxx->gp_state; /* GP_IDLE */
}

extern void xxx_enter(struct xxx_struct *xxx);
extern void xxx_exit(struct xxx_struct *xxx);

// .c -----------------------------------------------------------------------

enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };

enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };

#define xxx_lock gp_waitq.lock

void xxx_enter(struct xxx_struct *xxx)
{
bool need_wait, need_sync;

spin_lock_irq(&xxx->xxx_lock);
need_wait = xxx->gp_count++;
need_sync = xxx->gp_state == GP_IDLE;
if (need_sync)
xxx->gp_state = GP_PENDING;
spin_unlock_irq(&xxx->xxx_lock);

BUG_ON(need_wait && need_sync);

} if (need_sync) {
synchronize_sched();
xxx->gp_state = GP_PASSED;
wake_up_all(&xxx->gp_waitq);
} else if (need_wait) {
wait_event(&xxx->gp_waitq, xxx->gp_state == GP_PASSED);
} else {
BUG_ON(xxx->gp_state != GP_PASSED);
}
}

static void cb_rcu_func(struct rcu_head *rcu)
{
struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
long flags;

BUG_ON(xxx->gp_state != GP_PASSED);
BUG_ON(xxx->cb_state == CB_IDLE);

spin_lock_irqsave(&xxx->xxx_lock, flags);
if (xxx->gp_count) {
xxx->cb_state = CB_IDLE;
} else if (xxx->cb_state == CB_REPLAY) {
xxx->cb_state = CB_PENDING;
call_rcu_sched(&xxx->cb_head, cb_rcu_func);
} else {
xxx->cb_state = CB_IDLE;
xxx->gp_state = GP_IDLE;
}
spin_unlock_irqrestore(&xxx->xxx_lock, flags);
}

void xxx_exit(struct xxx_struct *xxx)
{
spin_lock_irq(&xxx->xxx_lock);
if (!--xxx->gp_count) {
if (xxx->cb_state == CB_IDLE) {
xxx->cb_state = CB_PENDING;
call_rcu_sched(&xxx->cb_head, cb_rcu_func);
} else if (xxx->cb_state == CB_PENDING) {
xxx->cb_state = CB_REPLAY;
}
}
spin_unlock_irq(&xxx->xxx_lock);
}

2013-09-29 20:01:27

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC] introduce synchronize_sched_{enter,exit}()

On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote:
> Hello.
>
> Paul, Peter, et al, could you review the code below?
>
> I am not sending the patch, I think it is simpler to read the code
> inline (just in case, I didn't try to compile it yet).
>
> It is functionally equivalent to
>
> struct xxx_struct {
> atomic_t counter;
> };
>
> static inline bool xxx_is_idle(struct xxx_struct *xxx)
> {
> return atomic_read(&xxx->counter) == 0;
> }
>
> static inline void xxx_enter(struct xxx_struct *xxx)
> {
> atomic_inc(&xxx->counter);
> synchronize_sched();
> }
>
> static inline void xxx_enter(struct xxx_struct *xxx)
> {
> synchronize_sched();
> atomic_dec(&xxx->counter);
> }

But there is nothing for synchronize_sched() to wait for in the above.
Presumably the caller of xxx_is_idle() is required to disable preemption
or be under rcu_read_lock_sched()?

> except: it records the state and synchronize_sched() is only called by
> xxx_enter() and only if necessary.
>
> Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
> (Peter, I think they should be unified anyway, but lets ignore this for
> now). Or freeze_super() (which currently looks buggy), perhaps something
> else. This pattern
>
> writer:
> state = SLOW_MODE;
> synchronize_rcu/sched();
>
> reader:
> preempt_disable(); // or rcu_read_lock();
> if (state != SLOW_MODE)
> ...
>
> is quite common.

And this does guarantee that by the time the writer's synchronize_whatever()
exits, all readers will know that state==SLOW_MODE.

> Note:
> - This implementation allows multiple writers, and sometimes
> this makes sense.

If each writer atomically incremented SLOW_MODE, did its update, then
atomically decremented it, sure. You could be more clever and avoid
unneeded synchronize_whatever() calls, but I would have to see a good
reason for doing so before recommending this.

OK, but you appear to be doing this below anyway. ;-)

> - But it's trivial to add "bool xxx->exclusive" set by xxx_init().
> If it is true only one xxx_enter() is possible, other callers
> should block until xxx_exit(). This is what percpu_down_write()
> actually needs.

Agreed.

> - Probably it makes sense to add xxx->rcu_domain = RCU/SCHED/ETC.

Or just have pointers to the RCU functions in the xxx structure...

So you are trying to make something that abstracts the RCU-protected
state-change pattern? Or perhaps more accurately, the RCU-protected
state-change-and-back pattern?

> Do you think it is correct? Makes sense? (BUG_ON's are just comments).

... Maybe ... Please see below for commentary and a question.

Thanx, Paul

> Oleg.
>
> // .h -----------------------------------------------------------------------
>
> struct xxx_struct {
> int gp_state;
>
> int gp_count;
> wait_queue_head_t gp_waitq;
>
> int cb_state;
> struct rcu_head cb_head;

spinlock_t xxx_lock; /* ? */

This spinlock might not make the big-system guys happy, but it appears to
be needed below.

> };
>
> static inline bool xxx_is_idle(struct xxx_struct *xxx)
> {
> return !xxx->gp_state; /* GP_IDLE */
> }
>
> extern void xxx_enter(struct xxx_struct *xxx);
> extern void xxx_exit(struct xxx_struct *xxx);
>
> // .c -----------------------------------------------------------------------
>
> enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };
>
> enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
>
> #define xxx_lock gp_waitq.lock
>
> void xxx_enter(struct xxx_struct *xxx)
> {
> bool need_wait, need_sync;
>
> spin_lock_irq(&xxx->xxx_lock);
> need_wait = xxx->gp_count++;
> need_sync = xxx->gp_state == GP_IDLE;

Suppose ->gp_state is GP_PASSED. It could transition to GP_IDLE at any
time, right?

> if (need_sync)
> xxx->gp_state = GP_PENDING;
> spin_unlock_irq(&xxx->xxx_lock);
>
> BUG_ON(need_wait && need_sync);
>
> } if (need_sync) {
> synchronize_sched();
> xxx->gp_state = GP_PASSED;
> wake_up_all(&xxx->gp_waitq);
> } else if (need_wait) {
> wait_event(&xxx->gp_waitq, xxx->gp_state == GP_PASSED);

Suppose the wakeup is delayed until after the state has been updated
back to GP_IDLE? Ah, presumably the non-zero ->gp_count prevents this.
Never mind!

> } else {
> BUG_ON(xxx->gp_state != GP_PASSED);
> }
> }
>
> static void cb_rcu_func(struct rcu_head *rcu)
> {
> struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
> long flags;
>
> BUG_ON(xxx->gp_state != GP_PASSED);
> BUG_ON(xxx->cb_state == CB_IDLE);
>
> spin_lock_irqsave(&xxx->xxx_lock, flags);
> if (xxx->gp_count) {
> xxx->cb_state = CB_IDLE;
> } else if (xxx->cb_state == CB_REPLAY) {
> xxx->cb_state = CB_PENDING;
> call_rcu_sched(&xxx->cb_head, cb_rcu_func);
> } else {
> xxx->cb_state = CB_IDLE;
> xxx->gp_state = GP_IDLE;
> }

It took me a bit to work out the above. It looks like the intent is
to have the last xxx_exit() put the state back to GP_IDLE, which appears
to be the state in which readers can use a fastpath.

This works because if ->gp_count is non-zero and ->cb_state is CB_IDLE,
there must be an xxx_exit() in our future.

> spin_unlock_irqrestore(&xxx->xxx_lock, flags);
> }
>
> void xxx_exit(struct xxx_struct *xxx)
> {
> spin_lock_irq(&xxx->xxx_lock);
> if (!--xxx->gp_count) {
> if (xxx->cb_state == CB_IDLE) {
> xxx->cb_state = CB_PENDING;
> call_rcu_sched(&xxx->cb_head, cb_rcu_func);
> } else if (xxx->cb_state == CB_PENDING) {
> xxx->cb_state = CB_REPLAY;
> }
> }
> spin_unlock_irq(&xxx->xxx_lock);
> }

Then we also have something like this?

bool xxx_readers_fastpath_ok(struct xxx_struct *xxx)
{
BUG_ON(!rcu_read_lock_sched_held());
return xxx->gp_state == GP_IDLE;
}

2013-09-29 21:34:55

by Steven Rostedt

[permalink] [raw]
Subject: Re: [RFC] introduce synchronize_sched_{enter,exit}()

On Sun, 29 Sep 2013 20:36:34 +0200
Oleg Nesterov <[email protected]> wrote:


> Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
> (Peter, I think they should be unified anyway, but lets ignore this for
> now). Or freeze_super() (which currently looks buggy), perhaps something
> else. This pattern
>

Just so I'm clear to what you are trying to implement... This is to
handle the case (as Paul said) to see changes to state by RCU and back
again? That is, it isn't enough to see that the state changed to
something (like SLOW MODE), but we also need a way to see it change
back?

With get_online_cpus(), we need to see the state where it changed to
"performing hotplug" where holders need to go into the slow path, and
then also see the state change to "no longe performing hotplug" and the
holders now go back to fast path. Is this the rational for this email?

Thanks,

-- Steve

2013-09-30 10:30:47

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/50] Basic scheduler support for automatic NUMA balancing V7

On Sat, Sep 14, 2013 at 10:57:35AM +0800, Bob Liu wrote:
> Hi Mel,
>
> On 09/10/2013 05:31 PM, Mel Gorman wrote:
> > It has been a long time since V6 of this series and time for an update. Much
> > of this is now stabilised with the most important addition being the inclusion
> > of Peter and Rik's work on grouping tasks that share pages together.
> >
> > This series has a number of goals. It reduces overhead of automatic balancing
> > through scan rate reduction and the avoidance of TLB flushes. It selects a
> > preferred node and moves tasks towards their memory as well as moving memory
> > toward their task. It handles shared pages and groups related tasks together.
> >
>
> I found sometimes numa balancing will be broken after khugepaged
> started, because khugepaged always allocate huge page from the node of
> the first scanned normal page during collapsing.
>

This is a real, but separate problem.

> I think this may related with this topic, I don't know whether this
> series can also fix the issue I mentioned.
>

This series does not aim to fix that particular problem. There will be
some interactions between the problems as automatic NUMA balancing deals
with THP migration but they are only indirectly related. If khugepaged
does not collapse to huge pages inappropriately then automatic NUMA
balancing will never encounter them.

--
Mel Gorman
SUSE Labs

2013-09-30 12:49:41

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC] introduce synchronize_sched_{enter,exit}()

On 09/29, Paul E. McKenney wrote:
>
> On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote:
> >
> > struct xxx_struct {
> > atomic_t counter;
> > };
> >
> > static inline bool xxx_is_idle(struct xxx_struct *xxx)
> > {
> > return atomic_read(&xxx->counter) == 0;
> > }
> >
> > static inline void xxx_enter(struct xxx_struct *xxx)
> > {
> > atomic_inc(&xxx->counter);
> > synchronize_sched();
> > }
> >
> > static inline void xxx_enter(struct xxx_struct *xxx)
> > {
> > synchronize_sched();
> > atomic_dec(&xxx->counter);
> > }
>
> But there is nothing for synchronize_sched() to wait for in the above.
> Presumably the caller of xxx_is_idle() is required to disable preemption
> or be under rcu_read_lock_sched()?

Yes, yes, sure, xxx_is_idle() should be called under preempt_disable().
(or rcu_read_lock() if xxx_enter() uses synchronize_rcu()).

> So you are trying to make something that abstracts the RCU-protected
> state-change pattern? Or perhaps more accurately, the RCU-protected
> state-change-and-back pattern?

Yes, exactly.

> > struct xxx_struct {
> > int gp_state;
> >
> > int gp_count;
> > wait_queue_head_t gp_waitq;
> >
> > int cb_state;
> > struct rcu_head cb_head;
>
> spinlock_t xxx_lock; /* ? */

See

#define xxx_lock gp_waitq.lock

in .c below, but we can add another spinlock.

> This spinlock might not make the big-system guys happy, but it appears to
> be needed below.

Only the writers use this spinlock, and they should synchronize with each
other anyway. I don't think this can really penalize, say, percpu_down_write
or cpu_hotplug_begin.

> > // .c -----------------------------------------------------------------------
> >
> > enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };
> >
> > enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
> >
> > #define xxx_lock gp_waitq.lock
> >
> > void xxx_enter(struct xxx_struct *xxx)
> > {
> > bool need_wait, need_sync;
> >
> > spin_lock_irq(&xxx->xxx_lock);
> > need_wait = xxx->gp_count++;
> > need_sync = xxx->gp_state == GP_IDLE;
>
> Suppose ->gp_state is GP_PASSED. It could transition to GP_IDLE at any
> time, right?

As you already pointed below - no.

Once we incremented ->nr_writers, nobody can set GP_IDLE. And if the
caller is the "first" writer (need_sync == T) nobody else can change
->gp_state, so xxx_enter() sets GP_PASSED lockless.

> > if (need_sync)
> > xxx->gp_state = GP_PENDING;
> > spin_unlock_irq(&xxx->xxx_lock);
> >
> > BUG_ON(need_wait && need_sync);
> >
> > } if (need_sync) {
> > synchronize_sched();
> > xxx->gp_state = GP_PASSED;
> > wake_up_all(&xxx->gp_waitq);
> > } else if (need_wait) {
> > wait_event(&xxx->gp_waitq, xxx->gp_state == GP_PASSED);
>
> Suppose the wakeup is delayed until after the state has been updated
> back to GP_IDLE? Ah, presumably the non-zero ->gp_count prevents this.

Yes, exactly.

> > static void cb_rcu_func(struct rcu_head *rcu)
> > {
> > struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
> > long flags;
> >
> > BUG_ON(xxx->gp_state != GP_PASSED);
> > BUG_ON(xxx->cb_state == CB_IDLE);
> >
> > spin_lock_irqsave(&xxx->xxx_lock, flags);
> > if (xxx->gp_count) {
> > xxx->cb_state = CB_IDLE;
> > } else if (xxx->cb_state == CB_REPLAY) {
> > xxx->cb_state = CB_PENDING;
> > call_rcu_sched(&xxx->cb_head, cb_rcu_func);
> > } else {
> > xxx->cb_state = CB_IDLE;
> > xxx->gp_state = GP_IDLE;
> > }
>
> It took me a bit to work out the above. It looks like the intent is
> to have the last xxx_exit() put the state back to GP_IDLE, which appears
> to be the state in which readers can use a fastpath.

Yes, and we we offload this work to rcu callback so xxx_exit() doesn't
block.

The only complication is the next writer which does xxx_enter() after
xxx_exit(). If there are no other writers, the next xxx_exit() should do

rcu_cancel(&xxx->cb_head);
call_rcu_sched(&xxx->cb_head, cb_rcu_func);

to "extend" the gp, but since we do not have rcu_cancel() it simply sets
CB_REPLAY to instruct cb_rcu_func() to reschedule itself.

> This works because if ->gp_count is non-zero and ->cb_state is CB_IDLE,
> there must be an xxx_exit() in our future.

Yes, but ->cb_state doesn't really matter if ->gp_count != 0 in xxx_exit()
or cb_rcu_func() (except it can't be CB_IDLE in cb_rcu_func).

> > void xxx_exit(struct xxx_struct *xxx)
> > {
> > spin_lock_irq(&xxx->xxx_lock);
> > if (!--xxx->gp_count) {
> > if (xxx->cb_state == CB_IDLE) {
> > xxx->cb_state = CB_PENDING;
> > call_rcu_sched(&xxx->cb_head, cb_rcu_func);
> > } else if (xxx->cb_state == CB_PENDING) {
> > xxx->cb_state = CB_REPLAY;
> > }
> > }
> > spin_unlock_irq(&xxx->xxx_lock);
> > }
>
> Then we also have something like this?
>
> bool xxx_readers_fastpath_ok(struct xxx_struct *xxx)
> {
> BUG_ON(!rcu_read_lock_sched_held());
> return xxx->gp_state == GP_IDLE;
> }

Yes, this is what xxx_is_idle() does (ignoring BUG_ON). It actually
checks xxx->gp_state == 0, this is just to avoid the unnecessary export
of GP_* enum.

Oleg.

2013-09-30 13:00:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] introduce synchronize_sched_{enter,exit}()

On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote:
> Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
> (Peter, I think they should be unified anyway, but lets ignore this for
> now).

If you think the percpu_rwsem users can benefit sure.. So far its good I
didn't go the percpu_rwsem route for it looks like we got something
better at the end of it ;-)

> Or freeze_super() (which currently looks buggy), perhaps something
> else. This pattern
>
> writer:
> state = SLOW_MODE;
> synchronize_rcu/sched();
>
> reader:
> preempt_disable(); // or rcu_read_lock();
> if (state != SLOW_MODE)
> ...
>
> is quite common.

Well, if we make percpu_rwsem the defacto container of the pattern and
use that throughout, we'd have only a single implementation and don't
need the abstraction.

That said; we could still use the idea proposed; so let me take a look.

> // .h -----------------------------------------------------------------------
>
> struct xxx_struct {
> int gp_state;
>
> int gp_count;
> wait_queue_head_t gp_waitq;
>
> int cb_state;
> struct rcu_head cb_head;
> };
>
> static inline bool xxx_is_idle(struct xxx_struct *xxx)
> {
> return !xxx->gp_state; /* GP_IDLE */
> }
>
> extern void xxx_enter(struct xxx_struct *xxx);
> extern void xxx_exit(struct xxx_struct *xxx);
>
> // .c -----------------------------------------------------------------------
>
> enum { GP_IDLE = 0, GP_PENDING, GP_PASSED };
>
> enum { CB_IDLE = 0, CB_PENDING, CB_REPLAY };
>
> #define xxx_lock gp_waitq.lock
>
> void xxx_enter(struct xxx_struct *xxx)
> {
> bool need_wait, need_sync;
>
> spin_lock_irq(&xxx->xxx_lock);
> need_wait = xxx->gp_count++;
> need_sync = xxx->gp_state == GP_IDLE;
> if (need_sync)
> xxx->gp_state = GP_PENDING;
> spin_unlock_irq(&xxx->xxx_lock);
>
> BUG_ON(need_wait && need_sync);
>
> } if (need_sync) {
> synchronize_sched();
> xxx->gp_state = GP_PASSED;
> wake_up_all(&xxx->gp_waitq);
> } else if (need_wait) {
> wait_event(&xxx->gp_waitq, xxx->gp_state == GP_PASSED);
> } else {
> BUG_ON(xxx->gp_state != GP_PASSED);
> }
> }
>
> static void cb_rcu_func(struct rcu_head *rcu)
> {
> struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
> long flags;
>
> BUG_ON(xxx->gp_state != GP_PASSED);
> BUG_ON(xxx->cb_state == CB_IDLE);
>
> spin_lock_irqsave(&xxx->xxx_lock, flags);
> if (xxx->gp_count) {
> xxx->cb_state = CB_IDLE;

This seems to be when a new xxx_begin() has happened after our last
xxx_end() and the sync_sched() from xxx_begin() merges with the
xxx_end() one and we're done.

> } else if (xxx->cb_state == CB_REPLAY) {
> xxx->cb_state = CB_PENDING;
> call_rcu_sched(&xxx->cb_head, cb_rcu_func);

A later xxx_exit() has happened, and we need to requeue to catch a later
GP.

> } else {
> xxx->cb_state = CB_IDLE;
> xxx->gp_state = GP_IDLE;

Nothing fancy happened and we're done.

> }
> spin_unlock_irqrestore(&xxx->xxx_lock, flags);
> }
>
> void xxx_exit(struct xxx_struct *xxx)
> {
> spin_lock_irq(&xxx->xxx_lock);
> if (!--xxx->gp_count) {
> if (xxx->cb_state == CB_IDLE) {
> xxx->cb_state = CB_PENDING;
> call_rcu_sched(&xxx->cb_head, cb_rcu_func);
> } else if (xxx->cb_state == CB_PENDING) {
> xxx->cb_state = CB_REPLAY;
> }
> }
> spin_unlock_irq(&xxx->xxx_lock);
> }

So I don't immediately see the point of the concurrent write side;
percpu_rwsem wouldn't allow this and afaict neither would
freeze_super().

Other than that; yes this makes sense if you care about write side
performance and I think its solid.

2013-09-30 13:11:30

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC] introduce synchronize_sched_{enter,exit}()

On 09/29, Steven Rostedt wrote:
>
> On Sun, 29 Sep 2013 20:36:34 +0200
> Oleg Nesterov <[email protected]> wrote:
>
>
> > Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
> > (Peter, I think they should be unified anyway, but lets ignore this for
> > now). Or freeze_super() (which currently looks buggy), perhaps something
> > else. This pattern
> >
>
> Just so I'm clear to what you are trying to implement... This is to
> handle the case (as Paul said) to see changes to state by RCU and back
> again? That is, it isn't enough to see that the state changed to
> something (like SLOW MODE), but we also need a way to see it change
> back?

Suppose this code was applied as is. Now we can change percpu_rwsem,
see the "patch" below. (please ignore _expedited in the current code).

This immediately makes percpu_up_write() much faster, it no longer
blocks. And the contending writers (or even the same writer which
takes it again) can avoid synchronize_sched() in percpu_down_write().

And to remind, we can add xxx_struct->exclusive (or add the argument
to xxx_enter/exit), and then (with some other changes) we can kill
percpu_rw_semaphore->rw_sem.

> With get_online_cpus(), we need to see the state where it changed to
> "performing hotplug" where holders need to go into the slow path, and
> then also see the state change to "no longe performing hotplug" and the
> holders now go back to fast path. Is this the rational for this email?

The same. cpu_hotplug_begin/end (I mean the code written by Peter) can
be changed to use xxx_enter/exit.

Oleg.

--- x/include/linux/percpu-rwsem.h
+++ x/include/linux/percpu-rwsem.h
@@ -8,8 +8,8 @@
#include <linux/lockdep.h>

struct percpu_rw_semaphore {
+ xxx_struct xxx;
unsigned int __percpu *fast_read_ctr;
- atomic_t write_ctr;
struct rw_semaphore rw_sem;
atomic_t slow_read_ctr;
wait_queue_head_t write_waitq;
--- x/lib/percpu-rwsem.c
+++ x/lib/percpu-rwsem.c
@@ -17,7 +17,7 @@ int __percpu_init_rwsem(struct percpu_rw

/* ->rw_sem represents the whole percpu_rw_semaphore for lockdep */
__init_rwsem(&brw->rw_sem, name, rwsem_key);
- atomic_set(&brw->write_ctr, 0);
+ xxx_init(&brw->xxx, ...);
atomic_set(&brw->slow_read_ctr, 0);
init_waitqueue_head(&brw->write_waitq);
return 0;
@@ -25,6 +25,14 @@ int __percpu_init_rwsem(struct percpu_rw

void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
{
+ might_sleep();
+
+ // pseudo code which needs another simple xxx_ helper
+ if (xxx->gp_state == GP_REPLAY)
+ xxx->gp_state == GP_PENDING;
+ if (xxx->gp_state)
+ synchronize_sched();
+
free_percpu(brw->fast_read_ctr);
brw->fast_read_ctr = NULL; /* catch use after free bugs */
}
@@ -57,7 +65,7 @@ static bool update_fast_ctr(struct percp
bool success = false;

preempt_disable();
- if (likely(!atomic_read(&brw->write_ctr))) {
+ if (likely(xxx_is_idle(&brw->xxx))) {
__this_cpu_add(*brw->fast_read_ctr, val);
success = true;
}
@@ -126,20 +134,7 @@ static int clear_fast_ctr(struct percpu_
*/
void percpu_down_write(struct percpu_rw_semaphore *brw)
{
- /* tell update_fast_ctr() there is a pending writer */
- atomic_inc(&brw->write_ctr);
- /*
- * 1. Ensures that write_ctr != 0 is visible to any down_read/up_read
- * so that update_fast_ctr() can't succeed.
- *
- * 2. Ensures we see the result of every previous this_cpu_add() in
- * update_fast_ctr().
- *
- * 3. Ensures that if any reader has exited its critical section via
- * fast-path, it executes a full memory barrier before we return.
- * See R_W case in the comment above update_fast_ctr().
- */
- synchronize_sched_expedited();
+ xxx_enter(&brw->xxx);

/* exclude other writers, and block the new readers completely */
down_write(&brw->rw_sem);
@@ -159,7 +154,5 @@ void percpu_up_write(struct percpu_rw_se
* Insert the barrier before the next fast-path in down_read,
* see W_R case in the comment above update_fast_ctr().
*/
- synchronize_sched_expedited();
- /* the last writer unblocks update_fast_ctr() */
- atomic_dec(&brw->write_ctr);
+ xxx_exit();
}

2013-09-30 14:24:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] introduce synchronize_sched_{enter,exit}()

On Mon, Sep 30, 2013 at 02:59:42PM +0200, Peter Zijlstra wrote:

> >
> > static void cb_rcu_func(struct rcu_head *rcu)
> > {
> > struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
> > long flags;
> >
> > BUG_ON(xxx->gp_state != GP_PASSED);
> > BUG_ON(xxx->cb_state == CB_IDLE);
> >
> > spin_lock_irqsave(&xxx->xxx_lock, flags);
> > if (xxx->gp_count) {
> > xxx->cb_state = CB_IDLE;
>
> This seems to be when a new xxx_begin() has happened after our last
> xxx_end() and the sync_sched() from xxx_begin() merges with the
> xxx_end() one and we're done.
>
> > } else if (xxx->cb_state == CB_REPLAY) {
> > xxx->cb_state = CB_PENDING;
> > call_rcu_sched(&xxx->cb_head, cb_rcu_func);
>
> A later xxx_exit() has happened, and we need to requeue to catch a later
> GP.
>
> > } else {
> > xxx->cb_state = CB_IDLE;
> > xxx->gp_state = GP_IDLE;
>
> Nothing fancy happened and we're done.
>
> > }
> > spin_unlock_irqrestore(&xxx->xxx_lock, flags);
> > }
> >
> > void xxx_exit(struct xxx_struct *xxx)
> > {
> > spin_lock_irq(&xxx->xxx_lock);
> > if (!--xxx->gp_count) {
> > if (xxx->cb_state == CB_IDLE) {
> > xxx->cb_state = CB_PENDING;
> > call_rcu_sched(&xxx->cb_head, cb_rcu_func);
> > } else if (xxx->cb_state == CB_PENDING) {
> > xxx->cb_state = CB_REPLAY;
> > }
> > }
> > spin_unlock_irq(&xxx->xxx_lock);
> > }
>
> So I don't immediately see the point of the concurrent write side;
> percpu_rwsem wouldn't allow this and afaict neither would
> freeze_super().
>
> Other than that; yes this makes sense if you care about write side
> performance and I think its solid.

Hmm, wait. I don't see how this is equivalent to:

xxx_end()
{
synchronize_sched();
atomic_dec(&xxx->counter);
}

For that we'd have to decrement xxx->gp_count from cb_rcu_func(),
wouldn't we?

Without that there's no guarantee the fast path readers will have a MB
to observe the write critical section, unless I'm completely missing
something obviuos here.

2013-09-30 15:07:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] introduce synchronize_sched_{enter,exit}()

On Mon, Sep 30, 2013 at 04:24:00PM +0200, Peter Zijlstra wrote:
> For that we'd have to decrement xxx->gp_count from cb_rcu_func(),
> wouldn't we?
>
> Without that there's no guarantee the fast path readers will have a MB
> to observe the write critical section, unless I'm completely missing
> something obviuos here.

Duh.. we should be looking at gp_state like Paul said.

2013-09-30 16:45:22

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC] introduce synchronize_sched_{enter,exit}()

On 09/30, Peter Zijlstra wrote:
>
> On Sun, Sep 29, 2013 at 08:36:34PM +0200, Oleg Nesterov wrote:
> > Why? Say, percpu_rw_semaphore, or upcoming changes in get_online_cpus(),
> > (Peter, I think they should be unified anyway, but lets ignore this for
> > now).
>
> If you think the percpu_rwsem users can benefit sure.. So far its good I
> didn't go the percpu_rwsem route for it looks like we got something
> better at the end of it ;-)

I think you could simply improve percpu_rwsem instead. Once we add
task_struct->cpuhp_ctr percpu_rwsem and get_online_cpus/hotplug_begin
becomes absolutely congruent.

OTOH, it would be simpler to change hotplug first, then copy-and-paste
the improvents into percpu_rwsem, then see if we can simply convert
cpu_hotplug_begin/end into percpu_down/up_write.

> Well, if we make percpu_rwsem the defacto container of the pattern and
> use that throughout, we'd have only a single implementation

Not sure. I think it can have other users. But even if not, please look
at "struct sb_writers". Yes, I believe it makes sense to use percpu_rwsem
here, but note that it is actually array of semaphores. I do not think
each element needs its own xxx_struct.

> and don't
> need the abstraction.

And even if struct percpu_rw_semaphore will be the only container of
xxx_struct, I think the code looks better and more understandable this
way, exactly because it adds the new abstraction layer. Performance-wise
this should be free.

> > static void cb_rcu_func(struct rcu_head *rcu)
> > {
> > struct xxx_struct *xxx = container_of(rcu, struct xxx_struct, cb_head);
> > long flags;
> >
> > BUG_ON(xxx->gp_state != GP_PASSED);
> > BUG_ON(xxx->cb_state == CB_IDLE);
> >
> > spin_lock_irqsave(&xxx->xxx_lock, flags);
> > if (xxx->gp_count) {
> > xxx->cb_state = CB_IDLE;
>
> This seems to be when a new xxx_begin() has happened after our last
> xxx_end() and the sync_sched() from xxx_begin() merges with the
> xxx_end() one and we're done.

Yes,

> > } else if (xxx->cb_state == CB_REPLAY) {
> > xxx->cb_state = CB_PENDING;
> > call_rcu_sched(&xxx->cb_head, cb_rcu_func);
>
> A later xxx_exit() has happened, and we need to requeue to catch a later
> GP.

Exactly.

> So I don't immediately see the point of the concurrent write side;
> percpu_rwsem wouldn't allow this and afaict neither would
> freeze_super().

Oh I disagree. Even ignoring the fact I believe xxx_struct itself
can have more users (I can be wrong of course), I do think that
percpu_down_write_nonexclusive() makes sense (except "exclusive"
should be the argument of percpu_init_rwsem). And in fact the
initial implementation I sent didn't even has the "exclusive" mode.

Please look at uprobes (currently the only user). We do not really
need the global write-lock, we can do the per-uprobe locking. However,
every caller needs to block the percpu_down_read() callers (dup_mmap).

> Other than that; yes this makes sense if you care about write side
> performance and I think its solid.

Great ;)

Oleg.

2013-09-30 17:05:12

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [RFC] introduce synchronize_sched_{enter,exit}()

On 09/30, Peter Zijlstra wrote:
>
> On Mon, Sep 30, 2013 at 04:24:00PM +0200, Peter Zijlstra wrote:
> > For that we'd have to decrement xxx->gp_count from cb_rcu_func(),
> > wouldn't we?
> >
> > Without that there's no guarantee the fast path readers will have a MB
> > to observe the write critical section, unless I'm completely missing
> > something obviuos here.
>
> Duh.. we should be looking at gp_state like Paul said.

Yes, yes, that is why we have xxx_is_idle(). Its name is confusing
even ignoring "xxx".

OK, I'll try to invent the naming (but I'd like to hear suggestions ;)
and send the patch. I am going to add "exclusive" and "rcu_domain/ops"
later, currently percpu_rw_semaphore needs ->rw_sem anyway.

Oleg.

2013-09-30 20:00:22

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Saturday, September 28, 2013 06:31:04 PM Oleg Nesterov wrote:
> On 09/28, Peter Zijlstra wrote:
> >
> > On Sat, Sep 28, 2013 at 02:48:59PM +0200, Oleg Nesterov wrote:
> >
> > > Please note that this wait_event() adds a problem... it doesn't allow
> > > to "offload" the final synchronize_sched(). Suppose a 4k cpu machine
> > > does disable_nonboot_cpus(), we do not want 2 * 4k * synchronize_sched's
> > > in this case. We can solve this, but this wait_event() complicates
> > > the problem.
> >
> > That seems like a particularly easy fix; something like so?
>
> Yes, but...
>
> > @@ -586,6 +603,11 @@ int disable_nonboot_cpus(void)
> >
> > + cpu_hotplug_done();
> > +
> > + for_each_cpu(cpu, frozen_cpus)
> > + cpu_notify_nofail(CPU_POST_DEAD_FROZEN, (void*)(long)cpu);
>
> This changes the protocol, I simply do not know if it is fine in general
> to do __cpu_down(another_cpu) without CPU_POST_DEAD(previous_cpu). Say,
> currently it is possible that CPU_DOWN_PREPARE takes some global lock
> released by CPU_DOWN_FAILED or CPU_POST_DEAD.
>
> Hmm. Now that workqueues do not use CPU_POST_DEAD, it has only 2 users,
> mce_cpu_callback() and cpufreq_cpu_callback() and the 1st one even ignores
> this notification if FROZEN. So yes, probably this is fine, but needs an
> ack from cpufreq maintainers (cc'ed), for example to ensure that it is
> fine to call __cpufreq_remove_dev_prepare() twice without _finish().

To my eyes it will return -EBUSY when it tries to stop an already stopped
governor, which will cause the entire chain to fail I guess.

Srivatsa has touched that code most recently, so he should know better, though.

Thanks,
Rafael

2013-10-01 03:56:11

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Fri, Sep 27, 2013 at 10:41:16PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote:
> > On 09/26, Peter Zijlstra wrote:

[ . . . ]

> > > +static bool cpuhp_readers_active_check(void)
> > > {
> > > + unsigned int seq = per_cpu_sum(cpuhp_seq);
> > > +
> > > + smp_mb(); /* B matches A */
> > > +
> > > + /*
> > > + * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> > > + * we are guaranteed to also see its __cpuhp_refcount increment.
> > > + */
> > >
> > > + if (per_cpu_sum(__cpuhp_refcount) != 0)
> > > + return false;
> > >
> > > + smp_mb(); /* D matches C */
> >
> > It seems that both barries could be smp_rmb() ? I am not sure the comments
> > from srcu_readers_active_idx_check() can explain mb(), note that
> > __srcu_read_lock() always succeeds unlike get_cpus_online().
>
> I see what you mean; cpuhp_readers_active_check() is all purely reads;
> there are no writes to order.
>
> Paul; is there any argument for the MB here as opposed to RMB; and if
> not should we change both these and SRCU?

Given that these memory barriers execute only on the semi-slow path,
why add the complexity of moving from smp_mb() to either smp_rmb()
or smp_wmb()? Straight smp_mb() is easier to reason about and more
robust against future changes.

Thanx, Paul

2013-10-01 14:21:41

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 09/30, Paul E. McKenney wrote:
>
> On Fri, Sep 27, 2013 at 10:41:16PM +0200, Peter Zijlstra wrote:
> > On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote:
> > > On 09/26, Peter Zijlstra wrote:
>
> [ . . . ]
>
> > > > +static bool cpuhp_readers_active_check(void)
> > > > {
> > > > + unsigned int seq = per_cpu_sum(cpuhp_seq);
> > > > +
> > > > + smp_mb(); /* B matches A */
> > > > +
> > > > + /*
> > > > + * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> > > > + * we are guaranteed to also see its __cpuhp_refcount increment.
> > > > + */
> > > >
> > > > + if (per_cpu_sum(__cpuhp_refcount) != 0)
> > > > + return false;
> > > >
> > > > + smp_mb(); /* D matches C */
> > >
> > > It seems that both barries could be smp_rmb() ? I am not sure the comments
> > > from srcu_readers_active_idx_check() can explain mb(), note that
> > > __srcu_read_lock() always succeeds unlike get_cpus_online().
> >
> > I see what you mean; cpuhp_readers_active_check() is all purely reads;
> > there are no writes to order.
> >
> > Paul; is there any argument for the MB here as opposed to RMB; and if
> > not should we change both these and SRCU?
>
> Given that these memory barriers execute only on the semi-slow path,
> why add the complexity of moving from smp_mb() to either smp_rmb()
> or smp_wmb()? Straight smp_mb() is easier to reason about and more
> robust against future changes.

But otoh this looks misleading, and the comments add more confusion.

But please note another email, it seems to me we can simply kill
cpuhp_seq and all the barriers in cpuhp_readers_active_check().

Oleg.

2013-10-01 14:46:00

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Tue, Oct 01, 2013 at 04:14:29PM +0200, Oleg Nesterov wrote:
> On 09/30, Paul E. McKenney wrote:
> >
> > On Fri, Sep 27, 2013 at 10:41:16PM +0200, Peter Zijlstra wrote:
> > > On Fri, Sep 27, 2013 at 08:15:32PM +0200, Oleg Nesterov wrote:
> > > > On 09/26, Peter Zijlstra wrote:
> >
> > [ . . . ]
> >
> > > > > +static bool cpuhp_readers_active_check(void)
> > > > > {
> > > > > + unsigned int seq = per_cpu_sum(cpuhp_seq);
> > > > > +
> > > > > + smp_mb(); /* B matches A */
> > > > > +
> > > > > + /*
> > > > > + * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> > > > > + * we are guaranteed to also see its __cpuhp_refcount increment.
> > > > > + */
> > > > >
> > > > > + if (per_cpu_sum(__cpuhp_refcount) != 0)
> > > > > + return false;
> > > > >
> > > > > + smp_mb(); /* D matches C */
> > > >
> > > > It seems that both barries could be smp_rmb() ? I am not sure the comments
> > > > from srcu_readers_active_idx_check() can explain mb(), note that
> > > > __srcu_read_lock() always succeeds unlike get_cpus_online().
> > >
> > > I see what you mean; cpuhp_readers_active_check() is all purely reads;
> > > there are no writes to order.
> > >
> > > Paul; is there any argument for the MB here as opposed to RMB; and if
> > > not should we change both these and SRCU?
> >
> > Given that these memory barriers execute only on the semi-slow path,
> > why add the complexity of moving from smp_mb() to either smp_rmb()
> > or smp_wmb()? Straight smp_mb() is easier to reason about and more
> > robust against future changes.
>
> But otoh this looks misleading, and the comments add more confusion.
>
> But please note another email, it seems to me we can simply kill
> cpuhp_seq and all the barriers in cpuhp_readers_active_check().

If you don't have cpuhp_seq, you need some other way to avoid
counter overflow. Which might be provided by limited number of
tasks, or, on 64-bit systems, 64-bit counters.

Thanx, Paul

2013-10-01 14:48:44

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Tue, Oct 01, 2013 at 07:45:37AM -0700, Paul E. McKenney wrote:
> If you don't have cpuhp_seq, you need some other way to avoid
> counter overflow. Which might be provided by limited number of
> tasks, or, on 64-bit systems, 64-bit counters.

How so? PID space is basically limited to 30 bits, so how could we
overflow a 32bit reference counter?

2013-10-01 15:07:46

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 10/01, Paul E. McKenney wrote:
>
> On Tue, Oct 01, 2013 at 04:14:29PM +0200, Oleg Nesterov wrote:
> >
> > But please note another email, it seems to me we can simply kill
> > cpuhp_seq and all the barriers in cpuhp_readers_active_check().
>
> If you don't have cpuhp_seq, you need some other way to avoid
> counter overflow.

I don't think so. Overflows (espicially "unsigned") should be fine and
in fact we can't avoid them.

Say, a task does get() on CPU_0 and put() on CPU_1, after that we have

CTR[0] == 1, CTR[1] = (unsigned)-1

iow, the counter was already overflowed (underflowed). But this is fine,
all we care about is CTR[0] + CTR[1] == 0, and this is only true because
of another overflow.

But probably you meant another thing,

> Which might be provided by limited number of
> tasks, or, on 64-bit systems, 64-bit counters.

perhaps you meant that max_threads * max_depth can overflow the counter?
I don't think so... but OK, perhaps this counter should be u_long.

But how cpuhp_seq can help?

Oleg.

2013-10-01 15:25:05

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Tue, Oct 01, 2013 at 04:48:20PM +0200, Peter Zijlstra wrote:
> On Tue, Oct 01, 2013 at 07:45:37AM -0700, Paul E. McKenney wrote:
> > If you don't have cpuhp_seq, you need some other way to avoid
> > counter overflow. Which might be provided by limited number of
> > tasks, or, on 64-bit systems, 64-bit counters.
>
> How so? PID space is basically limited to 30 bits, so how could we
> overflow a 32bit reference counter?

Nesting.

Thanx, Paul

2013-10-01 15:38:41

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Sun, Sep 29, 2013 at 03:56:46PM +0200, Oleg Nesterov wrote:
> On 09/27, Oleg Nesterov wrote:
> >
> > I tried hard to find any hole in this version but failed, I believe it
> > is correct.
>
> And I still believe it is. But now I am starting to think that we
> don't need cpuhp_seq. (and imo cpuhp_waitcount, but this is minor).

Here is one scenario that I believe requires cpuhp_seq:

1. Task 0 on CPU 0 increments its counter on entry.

2. Task 1 on CPU 1 starts summing the counters and gets to
CPU 4. The sum thus far is 1 (Task 0).

3. Task 2 on CPU 2 increments its counter on entry.
Upon completing its entry code, it re-enables preemption.

4. Task 2 is preempted, and starts running on CPU 5.

5. Task 2 decrements its counter on exit.

6. Task 1 continues summing. Due to the fact that it saw Task 2's
exit but not its entry, the sum is zero.

One of cpuhp_seq's jobs is to prevent this scenario.

That said, bozo here still hasn't gotten to look at Peter's newest patch,
so perhaps it prevents this scenario some other way, perhaps by your
argument below.

> > We need to ensure 2 things:
> >
> > 1. The reader should notic state = BLOCK or the writer should see
> > inc(__cpuhp_refcount). This is guaranteed by 2 mb's in
> > __get_online_cpus() and in cpu_hotplug_begin().
> >
> > We do not care if the writer misses some inc(__cpuhp_refcount)
> > in per_cpu_sum(__cpuhp_refcount), that reader(s) should notice
> > state = readers_block (and inc(cpuhp_seq) can't help anyway).
>
> Yes!

OK, I will look over the patch with this in mind.

> > 2. If the writer sees the result of this_cpu_dec(__cpuhp_refcount)
> > from __put_online_cpus() (note that the writer can miss the
> > corresponding inc() if it was done on another CPU, so this dec()
> > can lead to sum() == 0),
>
> But this can't happen in this version? Somehow I forgot that
> __get_online_cpus() does inc/get under preempt_disable(), always on
> the same CPU. And thanks to mb's the writer should not miss the
> reader which has already passed the "state != BLOCK" check.
>
> To simplify the discussion, lets ignore the "readers_fast" state,
> synchronize_sched() logic looks obviously correct. IOW, lets discuss
> only the SLOW -> BLOCK transition.
>
> cput_hotplug_begin()
> {
> state = BLOCK;
>
> mb();
>
> wait_event(cpuhp_writer,
> per_cpu_sum(__cpuhp_refcount) == 0);
> }
>
> should work just fine? Ignoring all details, we have
>
> get_online_cpus()
> {
> again:
> preempt_disable();
>
> __this_cpu_inc(__cpuhp_refcount);
>
> mb();
>
> if (state == BLOCK) {
>
> mb();
>
> __this_cpu_dec(__cpuhp_refcount);
> wake_up_all(cpuhp_writer);
>
> preempt_enable();
> wait_event(state != BLOCK);
> goto again;
> }
>
> preempt_enable();
> }
>
> It seems to me that these mb's guarantee all we need, no?
>
> It looks really simple. The reader can only succed if it doesn't see
> BLOCK, in this case per_cpu_sum() should see the change,
>
> We have
>
> WRITER READER on CPU X
>
> state = BLOCK; __cpuhp_refcount[X]++;
>
> mb(); mb();
>
> ...
> count += __cpuhp_refcount[X]; if (state != BLOCK)
> ... return;
>
> mb();
> __cpuhp_refcount[X]--;
>
> Either reader or writer should notice the STORE we care about.
>
> If a reader can decrement __cpuhp_refcount, we have 2 cases:
>
> 1. It is the reader holding this lock. In this case we
> can't miss the corresponding inc() done by this reader,
> because this reader didn't see BLOCK in the past.
>
> It is just the
>
> A == B == 0
> CPU_0 CPU_1
> ----- -----
> A = 1; B = 1;
> mb(); mb();
> b = B; a = A;
>
> pattern, at least one CPU should see 1 in its a/b.
>
> 2. It is the reader which tries to take this lock and
> noticed state == BLOCK. We could miss the result of
> its inc(), but we do not care, this reader is going
> to block.
>
> _If_ the reader could migrate between inc/dec, then
> yes, we have a problem. Because that dec() could make
> the result of per_cpu_sum() = 0. IOW, we could miss
> inc() but notice dec(). But given that it does this
> on the same CPU this is not possible.
>
> So why do we need cpuhp_seq?

Good question, I will look again.

Thanx, Paul

2013-10-01 15:42:00

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 10/01, Paul E. McKenney wrote:
>
> On Tue, Oct 01, 2013 at 04:48:20PM +0200, Peter Zijlstra wrote:
> > On Tue, Oct 01, 2013 at 07:45:37AM -0700, Paul E. McKenney wrote:
> > > If you don't have cpuhp_seq, you need some other way to avoid
> > > counter overflow. Which might be provided by limited number of
> > > tasks, or, on 64-bit systems, 64-bit counters.
> >
> > How so? PID space is basically limited to 30 bits, so how could we
> > overflow a 32bit reference counter?
>
> Nesting.

Still it seems that UINT_MAX / PID_MAX_LIMIT has enough room.

But again, OK lets make it ulong. The question is, how cpuhp_seq can
help and why we can't kill it.

Oleg.

2013-10-01 15:47:09

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 10/01, Paul E. McKenney wrote:
>
> On Sun, Sep 29, 2013 at 03:56:46PM +0200, Oleg Nesterov wrote:
> > On 09/27, Oleg Nesterov wrote:
> > >
> > > I tried hard to find any hole in this version but failed, I believe it
> > > is correct.
> >
> > And I still believe it is. But now I am starting to think that we
> > don't need cpuhp_seq. (and imo cpuhp_waitcount, but this is minor).
>
> Here is one scenario that I believe requires cpuhp_seq:
>
> 1. Task 0 on CPU 0 increments its counter on entry.
>
> 2. Task 1 on CPU 1 starts summing the counters and gets to
> CPU 4. The sum thus far is 1 (Task 0).
>
> 3. Task 2 on CPU 2 increments its counter on entry.
> Upon completing its entry code, it re-enables preemption.

afaics at this stage it should notice state = BLOCK and decrement
the same counter on the same CPU before it does preempt_enable().

Because:

> > 2. It is the reader which tries to take this lock and
> > noticed state == BLOCK. We could miss the result of
> > its inc(), but we do not care, this reader is going
> > to block.
> >
> > _If_ the reader could migrate between inc/dec, then
> > yes, we have a problem. Because that dec() could make
> > the result of per_cpu_sum() = 0. IOW, we could miss
> > inc() but notice dec(). But given that it does this
> > on the same CPU this is not possible.
> >
> > So why do we need cpuhp_seq?
>
> Good question, I will look again.

Thanks! much appreciated.

Oleg.

2013-10-01 17:15:41

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 10/01/2013 01:41 AM, Rafael J. Wysocki wrote:
> On Saturday, September 28, 2013 06:31:04 PM Oleg Nesterov wrote:
>> On 09/28, Peter Zijlstra wrote:
>>>
>>> On Sat, Sep 28, 2013 at 02:48:59PM +0200, Oleg Nesterov wrote:
>>>
>>>> Please note that this wait_event() adds a problem... it doesn't allow
>>>> to "offload" the final synchronize_sched(). Suppose a 4k cpu machine
>>>> does disable_nonboot_cpus(), we do not want 2 * 4k * synchronize_sched's
>>>> in this case. We can solve this, but this wait_event() complicates
>>>> the problem.
>>>
>>> That seems like a particularly easy fix; something like so?
>>
>> Yes, but...
>>
>>> @@ -586,6 +603,11 @@ int disable_nonboot_cpus(void)
>>>
>>> + cpu_hotplug_done();
>>> +
>>> + for_each_cpu(cpu, frozen_cpus)
>>> + cpu_notify_nofail(CPU_POST_DEAD_FROZEN, (void*)(long)cpu);
>>
>> This changes the protocol, I simply do not know if it is fine in general
>> to do __cpu_down(another_cpu) without CPU_POST_DEAD(previous_cpu). Say,
>> currently it is possible that CPU_DOWN_PREPARE takes some global lock
>> released by CPU_DOWN_FAILED or CPU_POST_DEAD.
>>
>> Hmm. Now that workqueues do not use CPU_POST_DEAD, it has only 2 users,
>> mce_cpu_callback() and cpufreq_cpu_callback() and the 1st one even ignores
>> this notification if FROZEN. So yes, probably this is fine, but needs an
>> ack from cpufreq maintainers (cc'ed), for example to ensure that it is
>> fine to call __cpufreq_remove_dev_prepare() twice without _finish().
>
> To my eyes it will return -EBUSY when it tries to stop an already stopped
> governor, which will cause the entire chain to fail I guess.
>
> Srivatsa has touched that code most recently, so he should know better, though.
>

Yes it will return -EBUSY, but unfortunately it gets scarier from that
point onwards. When it gets an -EBUSY, __cpufreq_remove_dev_prepare() aborts
its work mid-way and returns, but doesn't bubble up the error to the CPU-hotplug
core. So the CPU hotplug code will continue to take that CPU down, with
further notifications such as CPU_DEAD, and chaos will ensue.

And we can't exactly "fix" this by simply returning the error code to CPU-hotplug
(since that would mean that suspend/resume would _always_ fail). Perhaps we can
teach cpufreq to ignore the error in this particular case (since the governor has
already been stopped and that's precisely what this function wanted to do as well),
but the problems don't seem to end there.

The other issue is that the CPUs in the policy->cpus mask are removed in the
_dev_finish() stage. So if that stage is post-poned like this, then _dev_prepare()
will get thoroughly confused since it also depends on seeing an updated
policy->cpus mask to decide when to nominate a new policy->cpu etc. (And the
cpu nomination code itself might start ping-ponging between CPUs, since none of
the CPUs would have been removed from the policy->cpus mask).

So, to summarize, this change to CPU hotplug code will break cpufreq (and
suspend/resume) as things stand today, but I don't think these problems are
insurmountable though..

However, as Oleg said, its definitely worth considering whether this proposed
change in semantics is going to hurt us in the future. CPU_POST_DEAD has certainly
proved to be very useful in certain challenging situations (commit 1aee40ac9c
explains one such example), so IMHO we should be very careful not to undermine
its utility.

Regards,
Srivatsa S. Bhat

2013-10-01 17:36:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
> However, as Oleg said, its definitely worth considering whether this proposed
> change in semantics is going to hurt us in the future. CPU_POST_DEAD has certainly
> proved to be very useful in certain challenging situations (commit 1aee40ac9c
> explains one such example), so IMHO we should be very careful not to undermine
> its utility.

Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
called at some time after the unplug' with no further guarantees. And my
patch preserves that.

Its not at all clear to me why cpufreq needs more; 1aee40ac9c certainly
doesn't explain it.

What's wrong with leaving a cleanup handle in percpu storage and
effectively doing:

struct cpu_destroy {
void (*destroy)(void *);
void *args;
};

DEFINE_PER_CPU(struct cpu_destroy, cpu_destroy);

POST_DEAD:
{
struct cpu_destroy x = per_cpu(cpu_destroy, cpu);
if (x.destroy)
x.destroy(x.arg);
}

POST_DEAD cannot fail; so CPU_DEAD/CPU_DOWN_PREPARE can simply assume it
will succeed; it has to.

The cpufreq situation simply doesn't make any kind of sense to me.

2013-10-01 17:52:27

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 10/01, Peter Zijlstra wrote:
>
> On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
> > However, as Oleg said, its definitely worth considering whether this proposed
> > change in semantics is going to hurt us in the future. CPU_POST_DEAD has certainly
> > proved to be very useful in certain challenging situations (commit 1aee40ac9c
> > explains one such example), so IMHO we should be very careful not to undermine
> > its utility.
>
> Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
> called at some time after the unplug' with no further guarantees. And my
> patch preserves that.

I tend to agree with Srivatsa... Without a strong reason it would be better
to preserve the current logic: "some time after" should not be after the
next CPU_DOWN/UP*. But I won't argue too much.

But note that you do not strictly need this change. Just kill cpuhp_waitcount,
then we can change cpu_hotplug_begin/end to use xxx_enter/exit we discuss in
another thread, this should likely "join" all synchronize_sched's.

Or split cpu_hotplug_begin() into 2 helpers which handle FAST -> SLOW and
SLOW -> BLOCK transitions, then move the first "FAST -> SLOW" handler outside
of for_each_online_cpu().

Oleg.

2013-10-01 17:56:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Tue, Oct 01, 2013 at 07:45:08PM +0200, Oleg Nesterov wrote:
> On 10/01, Peter Zijlstra wrote:
> >
> > On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
> > > However, as Oleg said, its definitely worth considering whether this proposed
> > > change in semantics is going to hurt us in the future. CPU_POST_DEAD has certainly
> > > proved to be very useful in certain challenging situations (commit 1aee40ac9c
> > > explains one such example), so IMHO we should be very careful not to undermine
> > > its utility.
> >
> > Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
> > called at some time after the unplug' with no further guarantees. And my
> > patch preserves that.
>
> I tend to agree with Srivatsa... Without a strong reason it would be better
> to preserve the current logic: "some time after" should not be after the
> next CPU_DOWN/UP*. But I won't argue too much.

Nah, I think breaking it is the right thing :-)

> But note that you do not strictly need this change. Just kill cpuhp_waitcount,
> then we can change cpu_hotplug_begin/end to use xxx_enter/exit we discuss in
> another thread, this should likely "join" all synchronize_sched's.

That would still be 4k * sync_sched() == terribly long.

> Or split cpu_hotplug_begin() into 2 helpers which handle FAST -> SLOW and
> SLOW -> BLOCK transitions, then move the first "FAST -> SLOW" handler outside
> of for_each_online_cpu().

Right, that's more messy but would work if we cannot teach cpufreq (and
possibly others) to not rely on state you shouldn't rely on anyway.

I tihnk the only guarnatee POST_DEAD should have is that it should be
called before UP_PREPARE of the same cpu ;-) Nothing more, nothing less.

2013-10-01 18:15:25

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 10/01, Peter Zijlstra wrote:
>
> On Tue, Oct 01, 2013 at 07:45:08PM +0200, Oleg Nesterov wrote:
> >
> > I tend to agree with Srivatsa... Without a strong reason it would be better
> > to preserve the current logic: "some time after" should not be after the
> > next CPU_DOWN/UP*. But I won't argue too much.
>
> Nah, I think breaking it is the right thing :-)

I don't really agree but I won't argue ;)

> > But note that you do not strictly need this change. Just kill cpuhp_waitcount,
> > then we can change cpu_hotplug_begin/end to use xxx_enter/exit we discuss in
> > another thread, this should likely "join" all synchronize_sched's.
>
> That would still be 4k * sync_sched() == terribly long.

No? the next xxx_enter() avoids sync_sched() if rcu callback is still
pending. Unless __cpufreq_remove_dev_finish() is "too slow" of course.

> > Or split cpu_hotplug_begin() into 2 helpers which handle FAST -> SLOW and
> > SLOW -> BLOCK transitions, then move the first "FAST -> SLOW" handler outside
> > of for_each_online_cpu().
>
> Right, that's more messy but would work if we cannot teach cpufreq (and
> possibly others) to not rely on state you shouldn't rely on anyway.

Yes,

> I tihnk the only guarnatee POST_DEAD should have is that it should be
> called before UP_PREPARE of the same cpu ;-) Nothing more, nothing less.

See above... This makes POST_DEAD really "special" compared to other
CPU_* events.

And again. Something like a global lock taken by CPU_DOWN_PREPARE and
released by POST_DEAD or DOWN_FAILED does not look "too wrong" to me.

But I leave this to you and Srivatsa.

Oleg.

2013-10-01 18:19:14

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 10/01/2013 11:06 PM, Peter Zijlstra wrote:
> On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
>> However, as Oleg said, its definitely worth considering whether this proposed
>> change in semantics is going to hurt us in the future. CPU_POST_DEAD has certainly
>> proved to be very useful in certain challenging situations (commit 1aee40ac9c
>> explains one such example), so IMHO we should be very careful not to undermine
>> its utility.
>
> Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
> called at some time after the unplug' with no further guarantees. And my
> patch preserves that.
>
> Its not at all clear to me why cpufreq needs more; 1aee40ac9c certainly
> doesn't explain it.
>

Sorry if I was unclear - I didn't mean to say that cpufreq needs more guarantees
than that. I was just saying that the cpufreq code would need certain additional
changes/restructuring to accommodate the change in the semantics brought about
by this patch. IOW, it won't work as it is, but it can certainly be fixed.

My other point (unrelated to cpufreq) was this: POST_DEAD of course means
that it will be called after unplug, with hotplug lock dropped. But it also
provides the guarantee (in the existing code), that a *new* hotplug operation
won't start until the POST_DEAD stage is also completed. This patch doesn't seem
to honor that part. The concern I have is in cases like those mentioned by
Oleg - say you take a lock at DOWN_PREPARE and want to drop it at POST_DEAD;
or some other requirement that makes it important to finish a full hotplug cycle
before moving on to the next one. I don't really have such a requirement in mind
at present, but I was just trying to think what we would be losing with this
change...

But to reiterate, I believe cpufreq can be reworked so that it doesn't depend
on things such as the above. But I wonder if dropping that latter guarantee
is going to be OK, going forward.

Regards,
Srivatsa S. Bhat

> What's wrong with leaving a cleanup handle in percpu storage and
> effectively doing:
>
> struct cpu_destroy {
> void (*destroy)(void *);
> void *args;
> };
>
> DEFINE_PER_CPU(struct cpu_destroy, cpu_destroy);
>
> POST_DEAD:
> {
> struct cpu_destroy x = per_cpu(cpu_destroy, cpu);
> if (x.destroy)
> x.destroy(x.arg);
> }
>
> POST_DEAD cannot fail; so CPU_DEAD/CPU_DOWN_PREPARE can simply assume it
> will succeed; it has to.
>
> The cpufreq situation simply doesn't make any kind of sense to me.
>
>

2013-10-01 19:01:14

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 10/01/2013 11:44 PM, Srivatsa S. Bhat wrote:
> On 10/01/2013 11:06 PM, Peter Zijlstra wrote:
>> On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
>>> However, as Oleg said, its definitely worth considering whether this proposed
>>> change in semantics is going to hurt us in the future. CPU_POST_DEAD has certainly
>>> proved to be very useful in certain challenging situations (commit 1aee40ac9c
>>> explains one such example), so IMHO we should be very careful not to undermine
>>> its utility.
>>
>> Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
>> called at some time after the unplug' with no further guarantees. And my
>> patch preserves that.
>>
>> Its not at all clear to me why cpufreq needs more; 1aee40ac9c certainly
>> doesn't explain it.
>>
>
> Sorry if I was unclear - I didn't mean to say that cpufreq needs more guarantees
> than that. I was just saying that the cpufreq code would need certain additional
> changes/restructuring to accommodate the change in the semantics brought about
> by this patch. IOW, it won't work as it is, but it can certainly be fixed.
>

And an important reason why this change can be accommodated with not so much
trouble is because you are changing it only in the suspend/resume path, where
userspace has already been frozen, so all hotplug operations are initiated by
the suspend path and that path *alone* (and so we enjoy certain "simplifiers" that
we know before-hand, eg: all of them are CPU offline operations, happening one at
a time, in sequence) and we don't expect any "interference" to this routine ;-).
As a result the number and variety of races that we need to take care of tend to
be far lesser. (For example, we don't have to worry about the deadlock caused by
sysfs-writes that 1aee40ac9c was talking about).

On the other hand, if the proposal was to change the regular hotplug path as well
on the same lines, then I guess it would have been a little more difficult to
adjust to it. For example, in cpufreq, _dev_prepare() sends a STOP to the governor,
whereas a part of _dev_finish() sends a START to it; so we might have races there,
due to which we might proceed with CPU offline with a running governor, depending
on the exact timing of the events. Of course, this problem doesn't occur in the
suspend/resume case, and hence I didn't bring it up in my previous mail.

So this is another reason why I'm a little concerned about POST_DEAD: since this
is a change in semantics, it might be worth asking ourselves whether we'd still
want to go with that change, if we happened to be changing regular hotplug as
well, rather than just the more controlled environment of suspend/resume.
Yes, I know that's not what you proposed, but I feel it might be worth considering
its implications while deciding how to solve the POST_DEAD issue.

Regards,
Srivatsa S. Bhat

2013-10-01 19:05:25

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Tue, Oct 01, 2013 at 08:07:50PM +0200, Oleg Nesterov wrote:
> On 10/01, Peter Zijlstra wrote:
> >
> > On Tue, Oct 01, 2013 at 07:45:08PM +0200, Oleg Nesterov wrote:
> > >
> > > I tend to agree with Srivatsa... Without a strong reason it would be better
> > > to preserve the current logic: "some time after" should not be after the
> > > next CPU_DOWN/UP*. But I won't argue too much.
> >
> > Nah, I think breaking it is the right thing :-)
>
> I don't really agree but I won't argue ;)

The authors of arch/x86/kernel/cpu/mcheck/mce.c would seem to be the
guys who would need to complain, given that they seem to have the only
use in 3.11.

Thanx, Paul

> > > But note that you do not strictly need this change. Just kill cpuhp_waitcount,
> > > then we can change cpu_hotplug_begin/end to use xxx_enter/exit we discuss in
> > > another thread, this should likely "join" all synchronize_sched's.
> >
> > That would still be 4k * sync_sched() == terribly long.
>
> No? the next xxx_enter() avoids sync_sched() if rcu callback is still
> pending. Unless __cpufreq_remove_dev_finish() is "too slow" of course.
>
> > > Or split cpu_hotplug_begin() into 2 helpers which handle FAST -> SLOW and
> > > SLOW -> BLOCK transitions, then move the first "FAST -> SLOW" handler outside
> > > of for_each_online_cpu().
> >
> > Right, that's more messy but would work if we cannot teach cpufreq (and
> > possibly others) to not rely on state you shouldn't rely on anyway.
>
> Yes,
>
> > I tihnk the only guarnatee POST_DEAD should have is that it should be
> > called before UP_PREPARE of the same cpu ;-) Nothing more, nothing less.
>
> See above... This makes POST_DEAD really "special" compared to other
> CPU_* events.
>
> And again. Something like a global lock taken by CPU_DOWN_PREPARE and
> released by POST_DEAD or DOWN_FAILED does not look "too wrong" to me.
>
> But I leave this to you and Srivatsa.
>
> Oleg.
>

2013-10-01 19:08:05

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 10/01/2013 11:26 PM, Peter Zijlstra wrote:
> On Tue, Oct 01, 2013 at 07:45:08PM +0200, Oleg Nesterov wrote:
>> On 10/01, Peter Zijlstra wrote:
>>>
>>> On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
>>>> However, as Oleg said, its definitely worth considering whether this proposed
>>>> change in semantics is going to hurt us in the future. CPU_POST_DEAD has certainly
>>>> proved to be very useful in certain challenging situations (commit 1aee40ac9c
>>>> explains one such example), so IMHO we should be very careful not to undermine
>>>> its utility.
>>>
>>> Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
>>> called at some time after the unplug' with no further guarantees. And my
>>> patch preserves that.
>>
>> I tend to agree with Srivatsa... Without a strong reason it would be better
>> to preserve the current logic: "some time after" should not be after the
>> next CPU_DOWN/UP*. But I won't argue too much.
>
> Nah, I think breaking it is the right thing :-)
>
>> But note that you do not strictly need this change. Just kill cpuhp_waitcount,
>> then we can change cpu_hotplug_begin/end to use xxx_enter/exit we discuss in
>> another thread, this should likely "join" all synchronize_sched's.
>
> That would still be 4k * sync_sched() == terribly long.
>
>> Or split cpu_hotplug_begin() into 2 helpers which handle FAST -> SLOW and
>> SLOW -> BLOCK transitions, then move the first "FAST -> SLOW" handler outside
>> of for_each_online_cpu().
>
> Right, that's more messy but would work if we cannot teach cpufreq (and
> possibly others) to not rely on state you shouldn't rely on anyway.
>
> I tihnk the only guarnatee POST_DEAD should have is that it should be
> called before UP_PREPARE of the same cpu ;-) Nothing more, nothing less.
>

Conceptually, that hints at a totally per-cpu implementation of CPU hotplug,
in which what happens to one CPU doesn't affect the others in the hotplug
path.. and yeah, that sounds very tempting! ;-) but I guess that will
need to be preceded by a massive rework of many of the existing hotplug
callbacks ;-)

Regards,
Srivatsa S. Bhat

2013-10-01 20:40:19

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Thu, Sep 26, 2013 at 01:10:42PM +0200, Peter Zijlstra wrote:
> On Wed, Sep 25, 2013 at 02:22:00PM -0700, Paul E. McKenney wrote:
> > A couple of nits and some commentary, but if there are races, they are
> > quite subtle. ;-)
>
> *whee*..
>
> I made one little change in the logic; I moved the waitcount increment
> to before the __put_online_cpus() call, such that the writer will have
> to wait for us to wake up before trying again -- not for us to actually
> have acquired the read lock, for that we'd need to mess up
> __get_online_cpus() a bit more.
>
> Complete patch below.

OK, looks like Oleg is correct, the cpuhp_seq can be dispensed with.

I still don't see anything wrong with it, so time for a serious stress
test on a large system. ;-)

Additional commentary interspersed.

Thanx, Paul

> ---
> Subject: hotplug: Optimize {get,put}_online_cpus()
> From: Peter Zijlstra <[email protected]>
> Date: Tue Sep 17 16:17:11 CEST 2013
>
> The current implementation of get_online_cpus() is global of nature
> and thus not suited for any kind of common usage.
>
> Re-implement the current recursive r/w cpu hotplug lock such that the
> read side locks are as light as possible.
>
> The current cpu hotplug lock is entirely reader biased; but since
> readers are expensive there aren't a lot of them about and writer
> starvation isn't a particular problem.
>
> However by making the reader side more usable there is a fair chance
> it will get used more and thus the starvation issue becomes a real
> possibility.
>
> Therefore this new implementation is fair, alternating readers and
> writers; this however requires per-task state to allow the reader
> recursion.
>
> Many comments are contributed by Paul McKenney, and many previous
> attempts were shown to be inadequate by both Paul and Oleg; many
> thanks to them for persisting to poke holes in my attempts.
>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Paul McKenney <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Steven Rostedt <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> include/linux/cpu.h | 58 +++++++++++++
> include/linux/sched.h | 3
> kernel/cpu.c | 209 +++++++++++++++++++++++++++++++++++---------------
> kernel/sched/core.c | 2
> 4 files changed, 208 insertions(+), 64 deletions(-)

I stripped the removed lines to keep my eyes from going buggy.

> --- a/include/linux/cpu.h
> +++ b/include/linux/cpu.h
> @@ -16,6 +16,7 @@
> #include <linux/node.h>
> #include <linux/compiler.h>
> #include <linux/cpumask.h>
> +#include <linux/percpu.h>
>
> struct device;
>
> @@ -173,10 +174,61 @@ extern struct bus_type cpu_subsys;
> #ifdef CONFIG_HOTPLUG_CPU
> /* Stop CPUs going up and down. */
>
> +extern void cpu_hotplug_init_task(struct task_struct *p);
> +
> extern void cpu_hotplug_begin(void);
> extern void cpu_hotplug_done(void);
> +
> +extern int __cpuhp_state;
> +DECLARE_PER_CPU(unsigned int, __cpuhp_refcount);
> +
> +extern void __get_online_cpus(void);
> +
> +static inline void get_online_cpus(void)
> +{
> + might_sleep();
> +
> + /* Support reader recursion */
> + /* The value was >= 1 and remains so, reordering causes no harm. */
> + if (current->cpuhp_ref++)
> + return;
> +
> + preempt_disable();
> + if (likely(!__cpuhp_state)) {
> + /* The barrier here is supplied by synchronize_sched(). */

I guess I shouldn't complain about the comment given where it came
from, but...

A more accurate comment would say that we are in an RCU-sched read-side
critical section, so the writer cannot both change __cpuhp_state from
readers_fast and start checking counters while we are here. So if we see
!__cpuhp_state, we know that the writer won't be checking until we past
the preempt_enable() and that once the synchronize_sched() is done,
the writer will see anything we did within this RCU-sched read-side
critical section.

(The writer -can- change __cpuhp_state from readers_slow to readers_block
while we are in this read-side critical section and then start summing
counters, but that corresponds to a different "if" statement.)

> + __this_cpu_inc(__cpuhp_refcount);
> + } else {
> + __get_online_cpus(); /* Unconditional memory barrier. */
> + }
> + preempt_enable();
> + /*
> + * The barrier() from preempt_enable() prevents the compiler from
> + * bleeding the critical section out.
> + */
> +}
> +
> +extern void __put_online_cpus(void);
> +
> +static inline void put_online_cpus(void)
> +{
> + /* The value was >= 1 and remains so, reordering causes no harm. */
> + if (--current->cpuhp_ref)
> + return;
> +
> + /*
> + * The barrier() in preempt_disable() prevents the compiler from
> + * bleeding the critical section out.
> + */
> + preempt_disable();
> + if (likely(!__cpuhp_state)) {
> + /* The barrier here is supplied by synchronize_sched(). */

Same here, both for the implied self-criticism and the more complete story.

Due to the basic RCU guarantee, the writer cannot both change __cpuhp_state
and start checking counters while we are in this RCU-sched read-side
critical section. And again, if the synchronize_sched() had to wait on
us (or if we were early enough that no waiting was needed), then once
the synchronize_sched() completes, the writer will see anything that we
did within this RCU-sched read-side critical section.

> + __this_cpu_dec(__cpuhp_refcount);
> + } else {
> + __put_online_cpus(); /* Unconditional memory barrier. */
> + }
> + preempt_enable();
> +}
> +
> extern void cpu_hotplug_disable(void);
> extern void cpu_hotplug_enable(void);
> #define hotcpu_notifier(fn, pri) cpu_notifier(fn, pri)
> @@ -200,6 +252,8 @@ static inline void cpu_hotplug_driver_un
>
> #else /* CONFIG_HOTPLUG_CPU */
>
> +static inline void cpu_hotplug_init_task(struct task_struct *p) {}
> +
> static inline void cpu_hotplug_begin(void) {}
> static inline void cpu_hotplug_done(void) {}
> #define get_online_cpus() do { } while (0)
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1454,6 +1454,9 @@ struct task_struct {
> unsigned int sequential_io;
> unsigned int sequential_io_avg;
> #endif
> +#ifdef CONFIG_HOTPLUG_CPU
> + int cpuhp_ref;
> +#endif
> };
>
> /* Future-safe accessor for struct task_struct's cpus_allowed. */
> --- a/kernel/cpu.c
> +++ b/kernel/cpu.c
> @@ -49,88 +49,173 @@ static int cpu_hotplug_disabled;
>
> #ifdef CONFIG_HOTPLUG_CPU
>
> +enum { readers_fast = 0, readers_slow, readers_block };
> +
> +int __cpuhp_state;
> +EXPORT_SYMBOL_GPL(__cpuhp_state);
> +
> +DEFINE_PER_CPU(unsigned int, __cpuhp_refcount);
> +EXPORT_PER_CPU_SYMBOL_GPL(__cpuhp_refcount);
> +
> +static DEFINE_PER_CPU(unsigned int, cpuhp_seq);
> +static atomic_t cpuhp_waitcount;
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_readers);
> +static DECLARE_WAIT_QUEUE_HEAD(cpuhp_writer);
> +
> +void cpu_hotplug_init_task(struct task_struct *p)
> +{
> + p->cpuhp_ref = 0;
> +}
> +
> +void __get_online_cpus(void)
> +{
> +again:
> + /* See __srcu_read_lock() */
> + __this_cpu_inc(__cpuhp_refcount);
> + smp_mb(); /* A matches B, E */
> + // __this_cpu_inc(cpuhp_seq);

Deleting the above per Oleg's suggestion. We still need the preceding
memory barrier.

> +
> + if (unlikely(__cpuhp_state == readers_block)) {
> + /*
> + * Make sure an outgoing writer sees the waitcount to ensure
> + * we make progress.
> + */
> + atomic_inc(&cpuhp_waitcount);
> + __put_online_cpus();

The decrement happens on the same CPU as the increment, avoiding the
increment-on-one-CPU-and-decrement-on-another problem.

And yes, if the reader misses the writer's assignment of readers_block
to __cpuhp_state, then the writer is guaranteed to see the reader's
increment. Conversely, any readers that increment their __cpuhp_refcount
after the writer looks are guaranteed to see the readers_block value,
which in turn means that they are guaranteed to immediately decrement
their __cpuhp_refcount, so that it doesn't matter that the writer
missed them.

Unfortunately, this trick does not apply back to SRCU, at least not
without adding a second memory barrier to the srcu_read_lock() path
(one to separate reading the index from incrementing the counter and
another to separate incrementing the counter from the critical section.
Can't have everything, I guess!

> +
> + /*
> + * We either call schedule() in the wait, or we'll fall through
> + * and reschedule on the preempt_enable() in get_online_cpus().
> + */
> + preempt_enable_no_resched();
> + __wait_event(cpuhp_readers, __cpuhp_state != readers_block);
> + preempt_disable();
> +
> + if (atomic_dec_and_test(&cpuhp_waitcount))
> + wake_up_all(&cpuhp_writer);

I still don't see why this is a wake_up_all() given that there can be
only one writer. Not that it makes much difference, but...

> +
> + goto again;
> + }
> +}
> +EXPORT_SYMBOL_GPL(__get_online_cpus);
>
> +void __put_online_cpus(void)
> {
> + /* See __srcu_read_unlock() */
> + smp_mb(); /* C matches D */
> + /*
> + * In other words, if they see our decrement (presumably to aggregate
> + * zero, as that is the only time it matters) they will also see our
> + * critical section.
> + */
> + this_cpu_dec(__cpuhp_refcount);
>
> + /* Prod writer to recheck readers_active */
> + wake_up_all(&cpuhp_writer);
> }
> +EXPORT_SYMBOL_GPL(__put_online_cpus);
> +
> +#define per_cpu_sum(var) \
> +({ \
> + typeof(var) __sum = 0; \
> + int cpu; \
> + for_each_possible_cpu(cpu) \
> + __sum += per_cpu(var, cpu); \
> + __sum; \
> +)}
>
> +/*
> + * See srcu_readers_active_idx_check() for a rather more detailed explanation.
> + */
> +static bool cpuhp_readers_active_check(void)
> {
> + // unsigned int seq = per_cpu_sum(cpuhp_seq);

Delete the above per Oleg's suggestion.

> +
> + smp_mb(); /* B matches A */
> +
> + /*
> + * In other words, if we see __get_online_cpus() cpuhp_seq increment,
> + * we are guaranteed to also see its __cpuhp_refcount increment.
> + */
>
> + if (per_cpu_sum(__cpuhp_refcount) != 0)
> + return false;
>
> + smp_mb(); /* D matches C */
>
> + /*
> + * On equality, we know that there could not be any "sneak path" pairs
> + * where we see a decrement but not the corresponding increment for a
> + * given reader. If we saw its decrement, the memory barriers guarantee
> + * that we now see its cpuhp_seq increment.
> + */
> +
> + // return per_cpu_sum(cpuhp_seq) == seq;

Delete the above per Oleg's suggestion, but actually need to replace with
"return true;". We should be able to get rid of the first memory barrier
(B matches A) because the smp_mb() in cpu_hotplug_begin() covers it, but we
cannot git rid of the second memory barrier (D matches C).

> }
>
> /*
> + * This will notify new readers to block and wait for all active readers to
> + * complete.
> */
> void cpu_hotplug_begin(void)
> {
> + /*
> + * Since cpu_hotplug_begin() is always called after invoking
> + * cpu_maps_update_begin(), we can be sure that only one writer is
> + * active.
> + */
> + lockdep_assert_held(&cpu_add_remove_lock);
>
> + /* Allow reader-in-writer recursion. */
> + current->cpuhp_ref++;
> +
> + /* Notify readers to take the slow path. */
> + __cpuhp_state = readers_slow;
> +
> + /* See percpu_down_write(); guarantees all readers take the slow path */
> + synchronize_sched();
> +
> + /*
> + * Notify new readers to block; up until now, and thus throughout the
> + * longish synchronize_sched() above, new readers could still come in.
> + */
> + __cpuhp_state = readers_block;
> +
> + smp_mb(); /* E matches A */
> +
> + /*
> + * If they don't see our writer of readers_block to __cpuhp_state,
> + * then we are guaranteed to see their __cpuhp_refcount increment, and
> + * therefore will wait for them.
> + */
> +
> + /* Wait for all now active readers to complete. */
> + wait_event(cpuhp_writer, cpuhp_readers_active_check());
> }
>
> void cpu_hotplug_done(void)
> {
> + /* Signal the writer is done, no fast path yet. */
> + __cpuhp_state = readers_slow;
> + wake_up_all(&cpuhp_readers);

And one reason that we cannot just immediately flip to readers_fast
is that new readers might fail to see the results of this writer's
critical section.

> +
> + /*
> + * The wait_event()/wake_up_all() prevents the race where the readers
> + * are delayed between fetching __cpuhp_state and blocking.
> + */
> +
> + /* See percpu_up_write(); readers will no longer attempt to block. */
> + synchronize_sched();
> +
> + /* Let 'em rip */
> + __cpuhp_state = readers_fast;
> + current->cpuhp_ref--;
> +
> + /*
> + * Wait for any pending readers to be running. This ensures readers
> + * after writer and avoids writers starving readers.
> + */
> + wait_event(cpuhp_writer, !atomic_read(&cpuhp_waitcount));
> }
>
> /*
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -1736,6 +1736,8 @@ static void __sched_fork(unsigned long c
> INIT_LIST_HEAD(&p->numa_entry);
> p->numa_group = NULL;
> #endif /* CONFIG_NUMA_BALANCING */
> +
> + cpu_hotplug_init_task(p);
> }
>
> #ifdef CONFIG_NUMA_BALANCING
>

2013-10-02 09:09:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Tue, Oct 01, 2013 at 08:07:50PM +0200, Oleg Nesterov wrote:
> > > But note that you do not strictly need this change. Just kill cpuhp_waitcount,
> > > then we can change cpu_hotplug_begin/end to use xxx_enter/exit we discuss in
> > > another thread, this should likely "join" all synchronize_sched's.
> >
> > That would still be 4k * sync_sched() == terribly long.
>
> No? the next xxx_enter() avoids sync_sched() if rcu callback is still
> pending. Unless __cpufreq_remove_dev_finish() is "too slow" of course.

Hmm,. not in the version you posted; there xxx_enter() would only not do
the sync_sched if there's a concurrent 'writer', in which case it will
wait for it.

You only avoid the sync_sched in xxx_exit() and potentially join in the
sync_sched() of a next xxx_begin().

So with that scheme:

for (i= ; i<4096; i++) {
xxx_begin();
xxx_exit();
}

Will get 4096 sync_sched() calls from the xxx_begin() and all but the
last xxx_exit() will 'drop' the rcu callback.

And given the construct; I'm not entirely sure you can do away with the
sync_sched() in between. While its clear to me you can merge the two
into one; leaving it out entirely doesn't seem right.

2013-10-02 10:19:11

by Srivatsa S. Bhat

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 10/01/2013 11:44 PM, Srivatsa S. Bhat wrote:
> On 10/01/2013 11:06 PM, Peter Zijlstra wrote:
>> On Tue, Oct 01, 2013 at 10:41:15PM +0530, Srivatsa S. Bhat wrote:
>>> However, as Oleg said, its definitely worth considering whether this proposed
>>> change in semantics is going to hurt us in the future. CPU_POST_DEAD has certainly
>>> proved to be very useful in certain challenging situations (commit 1aee40ac9c
>>> explains one such example), so IMHO we should be very careful not to undermine
>>> its utility.
>>
>> Urgh.. crazy things. I've always understood POST_DEAD to mean 'will be
>> called at some time after the unplug' with no further guarantees. And my
>> patch preserves that.
>>
>> Its not at all clear to me why cpufreq needs more; 1aee40ac9c certainly
>> doesn't explain it.
>>
>
> Sorry if I was unclear - I didn't mean to say that cpufreq needs more guarantees
> than that. I was just saying that the cpufreq code would need certain additional
> changes/restructuring to accommodate the change in the semantics brought about
> by this patch. IOW, it won't work as it is, but it can certainly be fixed.
>


Ok, so I thought a bit more about the changes you are proposing, and I agree
that they would be beneficial in the long run, especially given that it can
eventually lead to a more stream-lined hotplug process where different CPUs
can be hotplugged independently without waiting on each other, like you
mentioned in your other mail. So I'm fine with the new POST_DEAD guarantees
you are proposing - that they are run after unplug, and will be completed
before UP_PREPARE of the same CPU. And its also very convenient that we need
to fix only cpufreq to accommodate this change.

So below is a quick untested patch that modifies the cpufreq hotplug
callbacks appropriately. With this, cpufreq should be able to handle the
POST_DEAD changes, irrespective of whether we do that in the regular path
or in the suspend/resume path. (Because, I've restructured it in such a way
that the races that I had mentioned earlier are totally avoided. That is,
the POST_DEAD handler now performs only the bare-minimal final cleanup, which
doesn't race with or depend on anything else).



diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 04548f7..0a33c1a 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -1165,7 +1165,7 @@ static int __cpufreq_remove_dev_prepare(struct device *dev,
bool frozen)
{
unsigned int cpu = dev->id, cpus;
- int new_cpu, ret;
+ int new_cpu, ret = 0;
unsigned long flags;
struct cpufreq_policy *policy;

@@ -1200,9 +1200,10 @@ static int __cpufreq_remove_dev_prepare(struct device *dev,
policy->governor->name, CPUFREQ_NAME_LEN);
#endif

- lock_policy_rwsem_read(cpu);
+ lock_policy_rwsem_write(cpu);
cpus = cpumask_weight(policy->cpus);
- unlock_policy_rwsem_read(cpu);
+ cpumask_clear_cpu(cpu, policy->cpus);
+ unlock_policy_rwsem_write(cpu);

if (cpu != policy->cpu) {
if (!frozen)
@@ -1220,7 +1221,23 @@ static int __cpufreq_remove_dev_prepare(struct device *dev,
}
}

- return 0;
+ /* If no target, nothing more to do */
+ if (!cpufreq_driver->target)
+ return 0;
+
+ /* If cpu is last user of policy, cleanup the policy governor */
+ if (cpus == 1) {
+ ret = __cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
+ if (ret)
+ pr_err("%s: Failed to exit governor\n", __func__);
+ } else {
+ if ((ret = __cpufreq_governor(policy, CPUFREQ_GOV_START)) ||
+ (ret = __cpufreq_governor(policy, CPUFREQ_GOV_LIMITS))) {
+ pr_err("%s: Failed to start governor\n", __func__);
+ }
+ }
+
+ return ret;
}

static int __cpufreq_remove_dev_finish(struct device *dev,
@@ -1243,25 +1260,12 @@ static int __cpufreq_remove_dev_finish(struct device *dev,
return -EINVAL;
}

- WARN_ON(lock_policy_rwsem_write(cpu));
+ WARN_ON(lock_policy_rwsem_read(cpu));
cpus = cpumask_weight(policy->cpus);
-
- if (cpus > 1)
- cpumask_clear_cpu(cpu, policy->cpus);
- unlock_policy_rwsem_write(cpu);
+ unlock_policy_rwsem_read(cpu);

/* If cpu is last user of policy, free policy */
- if (cpus == 1) {
- if (cpufreq_driver->target) {
- ret = __cpufreq_governor(policy,
- CPUFREQ_GOV_POLICY_EXIT);
- if (ret) {
- pr_err("%s: Failed to exit governor\n",
- __func__);
- return ret;
- }
- }
-
+ if (cpus == 0) {
if (!frozen) {
lock_policy_rwsem_read(cpu);
kobj = &policy->kobj;
@@ -1294,15 +1298,6 @@ static int __cpufreq_remove_dev_finish(struct device *dev,

if (!frozen)
cpufreq_policy_free(policy);
- } else {
- if (cpufreq_driver->target) {
- if ((ret = __cpufreq_governor(policy, CPUFREQ_GOV_START)) ||
- (ret = __cpufreq_governor(policy, CPUFREQ_GOV_LIMITS))) {
- pr_err("%s: Failed to start governor\n",
- __func__);
- return ret;
- }
- }
}

per_cpu(cpufreq_cpu_data, cpu) = NULL;



Regards,
Srivatsa S. Bhat

2013-10-02 12:21:16

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 10/02, Peter Zijlstra wrote:
>
> On Tue, Oct 01, 2013 at 08:07:50PM +0200, Oleg Nesterov wrote:
> > > > But note that you do not strictly need this change. Just kill cpuhp_waitcount,
> > > > then we can change cpu_hotplug_begin/end to use xxx_enter/exit we discuss in
> > > > another thread, this should likely "join" all synchronize_sched's.
> > >
> > > That would still be 4k * sync_sched() == terribly long.
> >
> > No? the next xxx_enter() avoids sync_sched() if rcu callback is still
> > pending. Unless __cpufreq_remove_dev_finish() is "too slow" of course.
>
> Hmm,. not in the version you posted; there xxx_enter() would only not do
> the sync_sched if there's a concurrent 'writer', in which case it will
> wait for it.

No, please see below.

> You only avoid the sync_sched in xxx_exit() and potentially join in the
> sync_sched() of a next xxx_begin().
>
> So with that scheme:
>
> for (i= ; i<4096; i++) {
> xxx_begin();
> xxx_exit();
> }
>
> Will get 4096 sync_sched() calls from the xxx_begin() and all but the
> last xxx_exit() will 'drop' the rcu callback.

No, the code above should call sync_sched() only once, no matter what
this code does between _enter and _exit. This was one of the points.

To clarify, of course I mean the "likely" case. Say, a long preemption
after _exit can lead to another sync_sched().

void xxx_enter(struct xxx_struct *xxx)
{
bool need_wait, need_sync;

spin_lock_irq(&xxx->xxx_lock);
need_wait = xxx->gp_count++;
need_sync = xxx->gp_state == GP_IDLE;
if (need_sync)
xxx->gp_state = GP_PENDING;
spin_unlock_irq(&xxx->xxx_lock);

BUG_ON(need_wait && need_sync);

if (need_sync) {
synchronize_sched();
xxx->gp_state = GP_PASSED;
wake_up_all(&xxx->gp_waitq);
} else if (need_wait) {
wait_event(&xxx->gp_waitq, xxx->gp_state == GP_PASSED);
} else {
BUG_ON(xxx->gp_state != GP_PASSED);
}
}

The 1st iteration:

xxx_enter() does synchronize_sched() and sets gp_state = GP_PASSED.

xxx_exit() starts the rcu callback, but gp_state is still PASSED.

all other iterations in the "likely" case:

xxx_enter() should likely come before the pending callback fires
and clears gp_state. In this case we only increment ->gp_count
(this "disables" the rcu callback) and do nothing more, gp_state
is still GP_PASSED.

xxx_exit() does another call_rcu_sched(), or does the
CP_PENDING -> CB_REPLAY change. The latter is the same as "start
another callback".

In short: unless a gp elapses between _exit() and _enter(), the next
_enter() does nothing and avoids synchronize_sched().

> And given the construct; I'm not entirely sure you can do away with the
> sync_sched() in between. While its clear to me you can merge the two
> into one; leaving it out entirely doesn't seem right.

Could you explain?

Oleg.

2013-10-02 12:23:51

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 10/01, Paul E. McKenney wrote:
>
> On Tue, Oct 01, 2013 at 08:07:50PM +0200, Oleg Nesterov wrote:
> > On 10/01, Peter Zijlstra wrote:
> > >
> > > On Tue, Oct 01, 2013 at 07:45:08PM +0200, Oleg Nesterov wrote:
> > > >
> > > > I tend to agree with Srivatsa... Without a strong reason it would be better
> > > > to preserve the current logic: "some time after" should not be after the
> > > > next CPU_DOWN/UP*. But I won't argue too much.
> > >
> > > Nah, I think breaking it is the right thing :-)
> >
> > I don't really agree but I won't argue ;)
>
> The authors of arch/x86/kernel/cpu/mcheck/mce.c would seem to be the
> guys who would need to complain, given that they seem to have the only
> use in 3.11.

mce_cpu_callback() is fine, it ignores POST_DEAD if CPU_TASKS_FROZEN.

Oleg.

2013-10-02 12:25:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Wed, Oct 02, 2013 at 02:13:56PM +0200, Oleg Nesterov wrote:
> On 10/02, Peter Zijlstra wrote:
> > And given the construct; I'm not entirely sure you can do away with the
> > sync_sched() in between. While its clear to me you can merge the two
> > into one; leaving it out entirely doesn't seem right.
>
> Could you explain?

Somehow I thought the fastpath got enabled; it doesn't since we never
hit GP_IDLE, so we don't actually need that.

You're right.

2013-10-02 13:32:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Wed, Oct 02, 2013 at 02:13:56PM +0200, Oleg Nesterov wrote:
> In short: unless a gp elapses between _exit() and _enter(), the next
> _enter() does nothing and avoids synchronize_sched().

That does however make the entire scheme entirely writer biased;
increasing the need for the waitcount thing I have. Otherwise we'll
starve pending readers.

2013-10-02 14:07:47

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 10/02, Peter Zijlstra wrote:
>
> On Wed, Oct 02, 2013 at 02:13:56PM +0200, Oleg Nesterov wrote:
> > In short: unless a gp elapses between _exit() and _enter(), the next
> > _enter() does nothing and avoids synchronize_sched().
>
> That does however make the entire scheme entirely writer biased;

Well, this makes the scheme "a bit more" writer biased, but this is
exactly what we want in this case.

We do not block the readers after xxx_exit() entirely, but we do want
to keep them in SLOW state and avoid the costly SLOW -> FAST -> SLOW
transitions.

Lets even forget about disable_nonboot_cpus(), lets consider
percpu_rwsem-like logic "in general".

Yes, it is heavily optimizied for readers. But if the writers come in
a batch, or the same writer does down_write + up_write twice or more,
I think state == FAST is pointless in between (if we can avoid it).
This is the rare case (the writers should be rare), but if it happens
it makes sense to optimize the writers too. And again, even

for (;;) {
percpu_down_write();
percpu_up_write();
}

should not completely block the readers.

IOW. "turn sync_sched() into call_rcu_sched() in up_write()" is obviously
a win. If the next down_write/xxx_enter "knows" that the readers are
still in SLOW mode because gp was not completed yet, why should we
add the artificial delay?

As for disable_nonboot_cpus(). You are going to move cpu_hotplug_begin()
outside of the loop, this is the same thing.

Oleg.

2013-10-02 15:18:01

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Wed, Oct 02, 2013 at 04:00:20PM +0200, Oleg Nesterov wrote:
> And again, even
>
> for (;;) {
> percpu_down_write();
> percpu_up_write();
> }
>
> should not completely block the readers.

Sure there's a tiny window, but don't forget that a reader will have to
wait for the gp_state cacheline to transfer to shared state and the
per-cpu refcount cachelines to be brought back into exclusive mode and
the above can be aggressive enough that by that time we'll observe
state == blocked again.

So I don't think that in practise a reader will get in.

Also, since the write side is exposed to userspace; you've got an
effective DoS.

So I'll stick to waitcount -- as you can see in the patches I've just
posted.

2013-10-02 16:39:03

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On 10/02, Peter Zijlstra wrote:
>
> On Wed, Oct 02, 2013 at 04:00:20PM +0200, Oleg Nesterov wrote:
> > And again, even
> >
> > for (;;) {
> > percpu_down_write();
> > percpu_up_write();
> > }
> >
> > should not completely block the readers.
>
> Sure there's a tiny window, but don't forget that a reader will have to
> wait for the gp_state cacheline to transfer to shared state and the
> per-cpu refcount cachelines to be brought back into exclusive mode and
> the above can be aggressive enough that by that time we'll observe
> state == blocked again.

Sure, but don't forget that other callers of cpu_down() do a lot more
work before/after they actually call cpu_hotplug_begin/end().

> So I'll stick to waitcount -- as you can see in the patches I've just
> posted.

I still do not believe we need this waitcount "in practice" ;)

But even if I am right this is minor and we can reconsider this later,
so please forget.

Oleg.

2013-10-02 17:52:42

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] hotplug: Optimize {get,put}_online_cpus()

On Wed, Oct 02, 2013 at 04:00:20PM +0200, Oleg Nesterov wrote:
> On 10/02, Peter Zijlstra wrote:
> >
> > On Wed, Oct 02, 2013 at 02:13:56PM +0200, Oleg Nesterov wrote:
> > > In short: unless a gp elapses between _exit() and _enter(), the next
> > > _enter() does nothing and avoids synchronize_sched().
> >
> > That does however make the entire scheme entirely writer biased;
>
> Well, this makes the scheme "a bit more" writer biased, but this is
> exactly what we want in this case.
>
> We do not block the readers after xxx_exit() entirely, but we do want
> to keep them in SLOW state and avoid the costly SLOW -> FAST -> SLOW
> transitions.

Yes -- should help -a- -lot- for bulk write-side operations, such as
onlining all CPUs at boot time. ;-)

Thanx, Paul

> Lets even forget about disable_nonboot_cpus(), lets consider
> percpu_rwsem-like logic "in general".
>
> Yes, it is heavily optimizied for readers. But if the writers come in
> a batch, or the same writer does down_write + up_write twice or more,
> I think state == FAST is pointless in between (if we can avoid it).
> This is the rare case (the writers should be rare), but if it happens
> it makes sense to optimize the writers too. And again, even
>
> for (;;) {
> percpu_down_write();
> percpu_up_write();
> }
>
> should not completely block the readers.
>
> IOW. "turn sync_sched() into call_rcu_sched() in up_write()" is obviously
> a win. If the next down_write/xxx_enter "knows" that the readers are
> still in SLOW mode because gp was not completed yet, why should we
> add the artificial delay?
>
> As for disable_nonboot_cpus(). You are going to move cpu_hotplug_begin()
> outside of the loop, this is the same thing.
>
> Oleg.
>

2013-10-03 07:05:06

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC] introduce synchronize_sched_{enter,exit}()


* Peter Zijlstra <[email protected]> wrote:

>

Fully agreed! :-)

Ingo

2013-10-03 07:43:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC] introduce synchronize_sched_{enter,exit}()

On Thu, Oct 03, 2013 at 09:04:59AM +0200, Ingo Molnar wrote:
>
> * Peter Zijlstra <[email protected]> wrote:
>
> >
>
> Fully agreed! :-)

haha.. never realized I send that email completely empty. It was
supposed to contain the patch I later send as 2/3.