LinuxLists.cc - [PATCH v3 0/8] Remove dependency on congestion_wait in mm/

2021-10-08 13:54:50

[permalink] [raw]

Subject: [PATCH v3 0/8] Remove dependency on congestion_wait in mm/

This series is also available at

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-reclaimcongest-v3r9

This series that removes all calls to congestion_wait
in mm/ and deletes wait_iff_congested. It's not a clever
implementation but congestion_wait has been broken for a long time
(https://lore.kernel.org/linux-mm/[email protected]/).
Even if congestion throttling worked, it was never a great idea. While
excessive dirty/writeback pages at the tail of the LRU is one possibility
that reclaim may be slow, there is also the problem of too many pages
being isolated and reclaim failing for other reasons (elevated references,
too many pages isolated, excessive LRU contention etc).

This series replaces the "congestion" throttling with 3 different types.

o If there are too many dirty/writeback pages, sleep until a timeout
or enough pages get cleaned
o If too many pages are isolated, sleep until enough isolated pages
are either reclaimed or put back on the LRU
o If no progress is being made, direct reclaim tasks sleep until
another task makes progress with acceptable efficiency.

This was initially tested with a mix of workloads that used to trigger
corner cases that no longer work. A new test case was created called
"stutterp" (pagereclaim-stutterp-noreaders in mmtests) using a freshly
created XFS filesystem. Note that it may be necessary to increase the
timeout of ssh if executing remotely as ssh itself can get throttled and
the connection may timeout.

stutterp varies the number of "worker" processes from 4 up to NR_CPUS*4
to check the impact as the number of direct reclaimers increase. It has
four types of worker.

o One "anon latency" worker creates small mappings with mmap() and times
how long it takes to fault the mapping reading it 4K at a time
o X file writers which is fio randomly writing X files where the total
size of the files add up to the allowed dirty_ratio. fio is allowed
to run for a warmup period to allow some file-backed pages to
accumulate. The duration of the warmup is based on the best-case
linear write speed of the storage.
o Y file readers which is fio randomly reading small files
o Z anon memory hogs which continually map (100-dirty_ratio)% of
memory
o Total estimated WSS = (100+dirty_ration) percentage of memory

X+Y+Z+1 == NR_WORKERS varying from 4 up to NR_CPUS*4

The intent is to maximise the total WSS with a mix of file and anon memory
where some anonymous memory must be swapped and there is a high likelihood
of dirty/writeback pages reaching the end of the LRU.

The test can be configured to have no background readers to stress
dirty/writeback pages. The results below are based on having zero readers.

The short summary of the results is that the series works and stalls
until some event occurs but the timeouts may need adjustment.

The test results are not broken down by patch as the series should be
treated as one block that replaces a broken throttling mechanism with a
working one.

Finally, three machines were tested but I'm reporting the worst set of
results. The other two machines had much better latencies for example.

First the results of the "anon latency" latency

stutterp
5.15.0-rc1 5.15.0-rc1
vanilla mm-reclaimcongest-v3r9
Amean mmap-4 31.4003 ( 0.00%) 3502.0437 (-11052.92%)
Amean mmap-7 38.1641 ( 0.00%) 118.7176 (-211.07%)
Amean mmap-12 60.0981 ( 0.00%) 544.4736 (-805.97%)
Amean mmap-21 161.2699 ( 0.00%) 246.8211 ( -53.05%)
Amean mmap-30 174.5589 ( 0.00%) 511.8941 (-193.25%)
Amean mmap-48 8106.8160 ( 0.00%) 5181.3920 ( 36.09%)
Stddev mmap-4 41.3455 ( 0.00%) 35007.1657 (-84569.93%)
Stddev mmap-7 53.5556 ( 0.00%) 3880.7480 (-7146.20%)
Stddev mmap-12 171.3897 ( 0.00%) 11157.8419 (-6410.22%)
Stddev mmap-21 1506.6752 ( 0.00%) 6117.6842 (-306.04%)
Stddev mmap-30 557.5806 ( 0.00%) 9030.5131 (-1519.59%)
Stddev mmap-48 61681.5718 ( 0.00%) 35232.3288 ( 42.88%)
Max-90 mmap-4 31.4243 ( 0.00%) 79.4364 (-152.79%)
Max-90 mmap-7 41.0410 ( 0.00%) 38.8362 ( 5.37%)
Max-90 mmap-12 66.5255 ( 0.00%) 34.0194 ( 48.86%)
Max-90 mmap-21 146.7479 ( 0.00%) 79.2514 ( 45.99%)
Max-90 mmap-30 193.9513 ( 0.00%) 85.9060 ( 55.71%)
Max-90 mmap-48 277.9137 ( 0.00%) 1063.9764 (-282.84%
Max mmap-4 1913.8009 ( 0.00%) 362207.4705 (-18826.08%)
Max mmap-7 2423.9665 ( 0.00%) 192136.1715 (-7826.52%)
Max mmap-12 6845.6573 ( 0.00%) 262738.5257 (-3738.03%)
Max mmap-21 56278.6508 ( 0.00%) 212263.3098 (-277.16%)
Max mmap-30 19716.2990 ( 0.00%) 218858.2147 (-1010.04%)
Max mmap-48 477923.9400 ( 0.00%) 271100.1667 ( 43.28%)

For most thread counts, the time to mmap() is unfortunately increased.
In earlier versions of the series, this was lower but a large number of
throttling events were reaching their timeout increasing the amount of
inefficient scanning of the LRU. There is no prioritisation of reclaim
tasks making progress based on each tasks rate of page allocation versus
progress of reclaim. The variance is also impacted for high worker
counts but in all cases, the differences in latency are not statistically
significant due to very large maximum outliers. Max-90 shows that 90% of
the stalls are comparable but the Max results show the massive outliers
which are increased to to stalling.

It is expected that this will be very machine dependant. Due to the
test design, reclaim is difficult so allocations stall and there are
variances depending on whether THPs can be allocated or not. The amount
of memory will affect exactly how bad the corner cases are and how often
they trigger. The warmup period calculation is not ideal as it's based
on linear writes where as fio is randomly writing multiple files from
multiple tasks so the start state of the test is variable. For
example, these are the latencies on a single-socket machine that had
more memory

Amean mmap-4 20.5437 ( 0.00%) 17.2818 * 15.88%*
Amean mmap-6 39.2860 ( 0.00%) 75.5750 * -92.37%*
Amean mmap-8 2476.1950 ( 0.00%) 184.9578 ( 92.53%)
Amean mmap-12 178.0936 ( 0.00%) 198.2362 ( -11.31%)
Amean mmap-18 3238.9125 ( 0.00%) 168.2480 ( 94.81%)
Amean mmap-24 7922.7016 ( 0.00%) 290.8845 ( 96.33%)
Amean mmap-30 1766.8392 ( 0.00%) 460.1266 ( 73.96%)
Amean mmap-32 7542.2844 ( 0.00%) 512.1812 ( 93.21%)

The overall system CPU usage and elapsed time is as follows

5.15.0-rc3 5.15.0-rc3
vanillamm-reclaimcongest-v3r9
Duration User 6989.03 717.92
Duration System 7308.12 774.12
Duration Elapsed 2277.67 2159.98

The patches reduce system CPU usage by 89% as the vanilla kernel is rarely
stalling. The differences in elapsed time are due to the possibility that
the test controller can also get throttled and miss the timeout.

The high-level /proc/vmstats show

5.15.0-rc1 5.15.0-rc1
vanilla mm-reclaimcongest-v3r9
Ops Direct pages scanned 1056608451.00 154109543.00
Ops Kswapd pages scanned 109795048.00 108898253.00
Ops Kswapd pages reclaimed 63269243.00 22029757.00
Ops Direct pages reclaimed 10803973.00 9135952.00
Ops Kswapd velocity 48204.98 50416.32
Ops Direct velocity 463898.83 71347.67

Kswapd scanned a similar number of pages but the detailed pattern is
different. The vanilla kernel scans slowly over time where as the patches
exhibits burst patterns of scan activity. Direct reclaim scanning is
reduced by 85% due to stalling.

Generally, there are some spikes in reclaim activity (both direct and
kswapd) but crucially, the number of pages reclaimed is relatively
consistent. In other words, with this workload, reclaim rate remains
relatively constant but there are large variations in scan activity
representing useless scanning.

Ops Percentage direct scans 90.59 58.60

For direct reclaim, vanilla scanned 90.59% of pages where as with the
patches, 58.60% were direct reclaim due to throttling

Ops Page writes by reclaim 2613590.00 2320847.00

Page writes from reclaim context are somewhat consistent.

Ops Page writes anon 2932752.00 2567954.00

Swap activity remain somewhat consistent.

Ops Page reclaim immediate 996248528.00 64076505.00

The number of pages encountered at the tail of the LRU tagged for immediate
reclaim but still dirty/writeback is reduced by 94%.

Ops Slabs scanned 164284.00 170222.00

Slab scan activity is similar.

ftrace was used to gather stall activity

Vanilla
-------
1 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=16000
2 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=12000
8 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=8000
29 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=4000
82394 writeback_wait_iff_congested: usec_timeout=100000 usec_delayed=0

The fast majority of wait_iff_congested calls do not stall at all.
What is likely happening is that cond_resched() reschedules the task for
a short period when the BDI is not registering congestion (which it never
will in this test setup).

1 writeback_congestion_wait: usec_timeout=100000 usec_delayed=120000
2 writeback_congestion_wait: usec_timeout=100000 usec_delayed=132000
4 writeback_congestion_wait: usec_timeout=100000 usec_delayed=112000
380 writeback_congestion_wait: usec_timeout=100000 usec_delayed=108000
778 writeback_congestion_wait: usec_timeout=100000 usec_delayed=104000

congestion_wait if called always exceeds the timeout as there is no
trigger to wake it up.

Bottom line: Vanilla will throttle but it's not effective.

Patch series
------------

Kswapd throttle activity was always due to scanning pages tagged for
immediate reclaim at the tail of the LRU

1 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITEBACK
2 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITEBACK
2 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITEBACK
4 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
5 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
7 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK
13 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
119 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
131 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK

The majority of events did not stall or stalled for a short period.
A small number stalled for the entire timeout.

For direct reclaim, the number of times stalled for each
reason were

2053 reason=VMSCAN_THROTTLE_ISOLATED
100704 reason=VMSCAN_THROTTLE_WRITEBACK
106825 reason=VMSCAN_THROTTLE_NOPROGRESS

The most common reason to stall was due to a failure to make forward
progress followed closely by excessive pages tagged for immediate reclaim
at the tail of the LRU. A relatively small number were due to too many
pages isolated from the LRU by parallel threads

For VMSCAN_THROTTLE_ISOLATED, the breakdown of delays was

3 usec_timeout=20000 usect_delayed=16000 reason=VMSCAN_THROTTLE_ISOLATED
8 usec_timeout=20000 usect_delayed=8000 reason=VMSCAN_THROTTLE_ISOLATED
9 usec_timeout=20000 usect_delayed=12000 reason=VMSCAN_THROTTLE_ISOLATED
18 usec_timeout=20000 usect_delayed=4000 reason=VMSCAN_THROTTLE_ISOLATED
69 usec_timeout=20000 usect_delayed=20000 reason=VMSCAN_THROTTLE_ISOLATED
1946 usec_timeout=20000 usect_delayed=0 reason=VMSCAN_THROTTLE_ISOLATED

Most did not stall at all or for a short period. A small percentage reached
the timeout.

For VMSCAN_THROTTLE_NOPROGRESS, the breakdown of stalls were all over the
map
1 usec_timeout=500000 usect_delayed=164000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=244000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=252000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=276000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=332000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=368000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=412000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=460000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usec_timeout=500000 usect_delayed=476000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=196000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=336000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=364000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=444000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usec_timeout=500000 usect_delayed=452000 reason=VMSCAN_THROTTLE_NOPROGRESS
3 usec_timeout=500000 usect_delayed=292000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=188000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=236000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=268000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=328000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=448000 reason=VMSCAN_THROTTLE_NOPROGRESS
4 usec_timeout=500000 usect_delayed=456000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=140000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=144000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=264000 reason=VMSCAN_THROTTLE_NOPROGRESS
5 usec_timeout=500000 usect_delayed=436000 reason=VMSCAN_THROTTLE_NOPROGRESS
6 usec_timeout=500000 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS
6 usec_timeout=500000 usect_delayed=356000 reason=VMSCAN_THROTTLE_NOPROGRESS
6 usec_timeout=500000 usect_delayed=380000 reason=VMSCAN_THROTTLE_NOPROGRESS
6 usec_timeout=500000 usect_delayed=440000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=304000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=340000 reason=VMSCAN_THROTTLE_NOPROGRESS
7 usec_timeout=500000 usect_delayed=400000 reason=VMSCAN_THROTTLE_NOPROGRESS
8 usec_timeout=500000 usect_delayed=148000 reason=VMSCAN_THROTTLE_NOPROGRESS
8 usec_timeout=500000 usect_delayed=392000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=152000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=168000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=240000 reason=VMSCAN_THROTTLE_NOPROGRESS
9 usec_timeout=500000 usect_delayed=316000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=184000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=216000 reason=VMSCAN_THROTTLE_NOPROGRESS
10 usec_timeout=500000 usect_delayed=228000 reason=VMSCAN_THROTTLE_NOPROGRESS
11 usec_timeout=500000 usect_delayed=372000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=212000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=344000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=408000 reason=VMSCAN_THROTTLE_NOPROGRESS
12 usec_timeout=500000 usect_delayed=432000 reason=VMSCAN_THROTTLE_NOPROGRESS
13 usec_timeout=500000 usect_delayed=160000 reason=VMSCAN_THROTTLE_NOPROGRESS
13 usec_timeout=500000 usect_delayed=248000 reason=VMSCAN_THROTTLE_NOPROGRESS
13 usec_timeout=500000 usect_delayed=260000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=204000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=256000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=420000 reason=VMSCAN_THROTTLE_NOPROGRESS
14 usec_timeout=500000 usect_delayed=464000 reason=VMSCAN_THROTTLE_NOPROGRESS
15 usec_timeout=500000 usect_delayed=232000 reason=VMSCAN_THROTTLE_NOPROGRESS
16 usec_timeout=500000 usect_delayed=136000 reason=VMSCAN_THROTTLE_NOPROGRESS
16 usec_timeout=500000 usect_delayed=472000 reason=VMSCAN_THROTTLE_NOPROGRESS
17 usec_timeout=500000 usect_delayed=424000 reason=VMSCAN_THROTTLE_NOPROGRESS
18 usec_timeout=500000 usect_delayed=428000 reason=VMSCAN_THROTTLE_NOPROGRESS
19 usec_timeout=500000 usect_delayed=224000 reason=VMSCAN_THROTTLE_NOPROGRESS
19 usec_timeout=500000 usect_delayed=352000 reason=VMSCAN_THROTTLE_NOPROGRESS
21 usec_timeout=500000 usect_delayed=200000 reason=VMSCAN_THROTTLE_NOPROGRESS
21 usec_timeout=500000 usect_delayed=312000 reason=VMSCAN_THROTTLE_NOPROGRESS
22 usec_timeout=500000 usect_delayed=468000 reason=VMSCAN_THROTTLE_NOPROGRESS
25 usec_timeout=500000 usect_delayed=348000 reason=VMSCAN_THROTTLE_NOPROGRESS
28 usec_timeout=500000 usect_delayed=320000 reason=VMSCAN_THROTTLE_NOPROGRESS
28 usec_timeout=500000 usect_delayed=484000 reason=VMSCAN_THROTTLE_NOPROGRESS
28 usec_timeout=500000 usect_delayed=492000 reason=VMSCAN_THROTTLE_NOPROGRESS
29 usec_timeout=500000 usect_delayed=180000 reason=VMSCAN_THROTTLE_NOPROGRESS
29 usec_timeout=500000 usect_delayed=220000 reason=VMSCAN_THROTTLE_NOPROGRESS
29 usec_timeout=500000 usect_delayed=300000 reason=VMSCAN_THROTTLE_NOPROGRESS
29 usec_timeout=500000 usect_delayed=64000 reason=VMSCAN_THROTTLE_NOPROGRESS
32 usec_timeout=500000 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS
32 usec_timeout=500000 usect_delayed=360000 reason=VMSCAN_THROTTLE_NOPROGRESS
33 usec_timeout=500000 usect_delayed=132000 reason=VMSCAN_THROTTLE_NOPROGRESS
34 usec_timeout=500000 usect_delayed=296000 reason=VMSCAN_THROTTLE_NOPROGRESS
35 usec_timeout=500000 usect_delayed=76000 reason=VMSCAN_THROTTLE_NOPROGRESS
39 usec_timeout=500000 usect_delayed=284000 reason=VMSCAN_THROTTLE_NOPROGRESS
39 usec_timeout=500000 usect_delayed=324000 reason=VMSCAN_THROTTLE_NOPROGRESS
44 usec_timeout=500000 usect_delayed=384000 reason=VMSCAN_THROTTLE_NOPROGRESS
44 usec_timeout=500000 usect_delayed=480000 reason=VMSCAN_THROTTLE_NOPROGRESS
45 usec_timeout=500000 usect_delayed=416000 reason=VMSCAN_THROTTLE_NOPROGRESS
46 usec_timeout=500000 usect_delayed=192000 reason=VMSCAN_THROTTLE_NOPROGRESS
46 usec_timeout=500000 usect_delayed=488000 reason=VMSCAN_THROTTLE_NOPROGRESS
49 usec_timeout=500000 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS
54 usec_timeout=500000 usect_delayed=288000 reason=VMSCAN_THROTTLE_NOPROGRESS
57 usec_timeout=500000 usect_delayed=80000 reason=VMSCAN_THROTTLE_NOPROGRESS
58 usec_timeout=500000 usect_delayed=68000 reason=VMSCAN_THROTTLE_NOPROGRESS
59 usec_timeout=500000 usect_delayed=496000 reason=VMSCAN_THROTTLE_NOPROGRESS
60 usec_timeout=500000 usect_delayed=208000 reason=VMSCAN_THROTTLE_NOPROGRESS
66 usec_timeout=500000 usect_delayed=72000 reason=VMSCAN_THROTTLE_NOPROGRESS
75 usec_timeout=500000 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS
91 usec_timeout=500000 usect_delayed=88000 reason=VMSCAN_THROTTLE_NOPROGRESS
96 usec_timeout=500000 usect_delayed=92000 reason=VMSCAN_THROTTLE_NOPROGRESS
97 usec_timeout=500000 usect_delayed=84000 reason=VMSCAN_THROTTLE_NOPROGRESS
139 usec_timeout=500000 usect_delayed=40000 reason=VMSCAN_THROTTLE_NOPROGRESS
160 usec_timeout=500000 usect_delayed=56000 reason=VMSCAN_THROTTLE_NOPROGRESS
160 usec_timeout=500000 usect_delayed=60000 reason=VMSCAN_THROTTLE_NOPROGRESS
171 usec_timeout=500000 usect_delayed=100000 reason=VMSCAN_THROTTLE_NOPROGRESS
175 usec_timeout=500000 usect_delayed=48000 reason=VMSCAN_THROTTLE_NOPROGRESS
181 usec_timeout=500000 usect_delayed=52000 reason=VMSCAN_THROTTLE_NOPROGRESS
203 usec_timeout=500000 usect_delayed=44000 reason=VMSCAN_THROTTLE_NOPROGRESS
235 usec_timeout=500000 usect_delayed=36000 reason=VMSCAN_THROTTLE_NOPROGRESS
267 usec_timeout=500000 usect_delayed=32000 reason=VMSCAN_THROTTLE_NOPROGRESS
295 usec_timeout=500000 usect_delayed=96000 reason=VMSCAN_THROTTLE_NOPROGRESS
395 usec_timeout=500000 usect_delayed=28000 reason=VMSCAN_THROTTLE_NOPROGRESS
471 usec_timeout=500000 usect_delayed=24000 reason=VMSCAN_THROTTLE_NOPROGRESS
548 usec_timeout=500000 usect_delayed=20000 reason=VMSCAN_THROTTLE_NOPROGRESS
972 usec_timeout=500000 usect_delayed=16000 reason=VMSCAN_THROTTLE_NOPROGRESS
1129 usec_timeout=500000 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS
1507 usec_timeout=500000 usect_delayed=12000 reason=VMSCAN_THROTTLE_NOPROGRESS
3308 usec_timeout=500000 usect_delayed=8000 reason=VMSCAN_THROTTLE_NOPROGRESS
14459 usec_timeout=500000 usect_delayed=4000 reason=VMSCAN_THROTTLE_NOPROGRESS
34811 usec_timeout=500000 usect_delayed=0 reason=VMSCAN_THROTTLE_NOPROGRESS
45229 usec_timeout=500000 usect_delayed=500000 reason=VMSCAN_THROTTLE_NOPROGRESS

The full timeout is often hit but a large number also do not stall at all.
The remainder slept a little allowing other reclaim tasks to make progress.

While this timeout could be further increased, it could also negatively
impact worst-case behaviour when there is no prioritisation of what
task should make progress.

For VMSCAN_THROTTLE_WRITEBACK, the breakdown was

17 usec_timeout=100000 usect_delayed=68000 reason=VMSCAN_THROTTLE_WRITEBACK
18 usec_timeout=100000 usect_delayed=76000 reason=VMSCAN_THROTTLE_WRITEBACK
19 usec_timeout=100000 usect_delayed=80000 reason=VMSCAN_THROTTLE_WRITEBACK
22 usec_timeout=100000 usect_delayed=92000 reason=VMSCAN_THROTTLE_WRITEBACK
38 usec_timeout=100000 usect_delayed=44000 reason=VMSCAN_THROTTLE_WRITEBACK
41 usec_timeout=100000 usect_delayed=72000 reason=VMSCAN_THROTTLE_WRITEBACK
43 usec_timeout=100000 usect_delayed=56000 reason=VMSCAN_THROTTLE_WRITEBACK
51 usec_timeout=100000 usect_delayed=52000 reason=VMSCAN_THROTTLE_WRITEBACK
51 usec_timeout=100000 usect_delayed=88000 reason=VMSCAN_THROTTLE_WRITEBACK
56 usec_timeout=100000 usect_delayed=60000 reason=VMSCAN_THROTTLE_WRITEBACK
64 usec_timeout=100000 usect_delayed=84000 reason=VMSCAN_THROTTLE_WRITEBACK
74 usec_timeout=100000 usect_delayed=96000 reason=VMSCAN_THROTTLE_WRITEBACK
76 usec_timeout=100000 usect_delayed=48000 reason=VMSCAN_THROTTLE_WRITEBACK
94 usec_timeout=100000 usect_delayed=28000 reason=VMSCAN_THROTTLE_WRITEBACK
99 usec_timeout=100000 usect_delayed=40000 reason=VMSCAN_THROTTLE_WRITEBACK
110 usec_timeout=100000 usect_delayed=32000 reason=VMSCAN_THROTTLE_WRITEBACK
112 usec_timeout=100000 usect_delayed=64000 reason=VMSCAN_THROTTLE_WRITEBACK
152 usec_timeout=100000 usect_delayed=36000 reason=VMSCAN_THROTTLE_WRITEBACK
154 usec_timeout=100000 usect_delayed=24000 reason=VMSCAN_THROTTLE_WRITEBACK
386 usec_timeout=100000 usect_delayed=20000 reason=VMSCAN_THROTTLE_WRITEBACK
617 usec_timeout=100000 usect_delayed=16000 reason=VMSCAN_THROTTLE_WRITEBACK
1052 usec_timeout=100000 usect_delayed=12000 reason=VMSCAN_THROTTLE_WRITEBACK
1621 usec_timeout=100000 usect_delayed=8000 reason=VMSCAN_THROTTLE_WRITEBACK
8406 usec_timeout=100000 usect_delayed=4000 reason=VMSCAN_THROTTLE_WRITEBACK
20317 usec_timeout=100000 usect_delayed=0 reason=VMSCAN_THROTTLE_WRITEBACK
67014 usec_timeout=100000 usect_delayed=100000 reason=VMSCAN_THROTTLE_WRITEBACK

The majority hit the timeout in direct reclaim context although
a sizable number did not stall at all. This is very different to
kswapd where only a tiny percentage of stalls due to writeback
reached the timeout.

Bottom line, the throttling appears to work and the wakeup events may limit
worst case stalls. There might be some grounds for adjusting timeouts but
it's likely futile as the worst-case scenarios depend on the workload,
memory size and the speed of the storage. A better approach to improve
the series further would be to prioritise tasks based on their rate of
allocation with the caveat that it may be very expensive to track.

--
2.31.1

2021-10-08 13:55:22

[permalink] [raw]

Subject: [PATCH 1/8] mm/vmscan: Throttle reclaim until some writeback completes if congested

Page reclaim throttles on wait_iff_congested under the following conditions

o kswapd is encountering pages under writeback and marked for immediate
reclaim implying that pages are cycling through the LRU faster than
pages can be cleaned.

o Direct reclaim will stall if all dirty pages are backed by congested
inodes.

wait_iff_congested is almost completely broken with few exceptions. This
patch adds a new node-based workqueue and tracks the number of throttled
tasks and pages written back since throttling started. If enough pages
belonging to the node are written back then the throttled tasks will wake
early. If not, the throttled tasks sleeps until the timeout expires.

[[email protected]: Uninterruptible sleep and simpler wakeups]
[[email protected]: Avoid race when reclaim starts]
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/backing-dev.h | 1 -
include/linux/mmzone.h | 9 ++++
include/trace/events/vmscan.h | 34 +++++++++++++++
include/trace/events/writeback.h | 7 ---
mm/backing-dev.c | 48 ---------------------
mm/filemap.c | 1 +
mm/internal.h | 11 +++++
mm/page_alloc.c | 1 +
mm/vmscan.c | 74 ++++++++++++++++++++++++++------
mm/vmstat.c | 1 +
10 files changed, 119 insertions(+), 68 deletions(-)

diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index ac7f231b8825..9fb1f0ae273c 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -154,7 +154,6 @@ static inline int wb_congested(struct bdi_writeback *wb, int cong_bits)
}

long congestion_wait(int sync, long timeout);
-long wait_iff_congested(int sync, long timeout);

static inline bool mapping_can_writeback(struct address_space *mapping)
{
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 6a1d79d84675..ef0a63ebd21d 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -199,6 +199,7 @@ enum node_stat_item {
NR_VMSCAN_IMMEDIATE, /* Prioritise for reclaim when writeback ends */
NR_DIRTIED, /* page dirtyings since bootup */
NR_WRITTEN, /* page writings since bootup */
+ NR_THROTTLED_WRITTEN, /* NR_WRITTEN while reclaim throttled */
NR_KERNEL_MISC_RECLAIMABLE, /* reclaimable non-slab kernel pages */
NR_FOLL_PIN_ACQUIRED, /* via: pin_user_page(), gup flag: FOLL_PIN */
NR_FOLL_PIN_RELEASED, /* pages returned via unpin_user_page() */
@@ -272,6 +273,10 @@ enum lru_list {
NR_LRU_LISTS
};

+enum vmscan_throttle_state {
+ VMSCAN_THROTTLE_WRITEBACK,
+};
+
#define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++)

#define for_each_evictable_lru(lru) for (lru = 0; lru <= LRU_ACTIVE_FILE; lru++)
@@ -841,6 +846,10 @@ typedef struct pglist_data {
int node_id;
wait_queue_head_t kswapd_wait;
wait_queue_head_t pfmemalloc_wait;
+ wait_queue_head_t reclaim_wait; /* wq for throttling reclaim */
+ atomic_t nr_reclaim_throttled; /* nr of throtted tasks */
+ unsigned long nr_reclaim_start; /* nr pages written while throttled
+ * when throttling started. */
struct task_struct *kswapd; /* Protected by
mem_hotplug_begin/end() */
int kswapd_order;
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index 88faf2400ec2..c317f9fe0d17 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -27,6 +27,14 @@
{RECLAIM_WB_ASYNC, "RECLAIM_WB_ASYNC"} \
) : "RECLAIM_WB_NONE"

+#define _VMSCAN_THROTTLE_WRITEBACK (1 << VMSCAN_THROTTLE_WRITEBACK)
+
+#define show_throttle_flags(flags) \
+ (flags) ? __print_flags(flags, "|", \
+ {_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"} \
+ ) : "VMSCAN_THROTTLE_NONE"
+
+
#define trace_reclaim_flags(file) ( \
(file ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \
(RECLAIM_WB_ASYNC) \
@@ -454,6 +462,32 @@ DEFINE_EVENT(mm_vmscan_direct_reclaim_end_template, mm_vmscan_node_reclaim_end,
TP_ARGS(nr_reclaimed)
);

+TRACE_EVENT(mm_vmscan_throttled,
+
+ TP_PROTO(int nid, int usec_timeout, int usec_delayed, int reason),
+
+ TP_ARGS(nid, usec_timeout, usec_delayed, reason),
+
+ TP_STRUCT__entry(
+ __field(int, nid)
+ __field(int, usec_timeout)
+ __field(int, usec_delayed)
+ __field(int, reason)
+ ),
+
+ TP_fast_assign(
+ __entry->nid = nid;
+ __entry->usec_timeout = usec_timeout;
+ __entry->usec_delayed = usec_delayed;
+ __entry->reason = 1U << reason;
+ ),
+
+ TP_printk("nid=%d usec_timeout=%d usect_delayed=%d reason=%s",
+ __entry->nid,
+ __entry->usec_timeout,
+ __entry->usec_delayed,
+ show_throttle_flags(__entry->reason))
+);
#endif /* _TRACE_VMSCAN_H */

/* This part must be outside protection */
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 840d1ba84cf5..3bc759b81897 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -763,13 +763,6 @@ DEFINE_EVENT(writeback_congest_waited_template, writeback_congestion_wait,
TP_ARGS(usec_timeout, usec_delayed)
);

-DEFINE_EVENT(writeback_congest_waited_template, writeback_wait_iff_congested,
-
- TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
-
- TP_ARGS(usec_timeout, usec_delayed)
-);
-
DECLARE_EVENT_CLASS(writeback_single_inode_template,

TP_PROTO(struct inode *inode,
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 4a9d4e27d0d9..0ea1a105eae5 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -1041,51 +1041,3 @@ long congestion_wait(int sync, long timeout)
return ret;
}
EXPORT_SYMBOL(congestion_wait);
-
-/**
- * wait_iff_congested - Conditionally wait for a backing_dev to become uncongested or a pgdat to complete writes
- * @sync: SYNC or ASYNC IO
- * @timeout: timeout in jiffies
- *
- * In the event of a congested backing_dev (any backing_dev) this waits
- * for up to @timeout jiffies for either a BDI to exit congestion of the
- * given @sync queue or a write to complete.
- *
- * The return value is 0 if the sleep is for the full timeout. Otherwise,
- * it is the number of jiffies that were still remaining when the function
- * returned. return_value == timeout implies the function did not sleep.
- */
-long wait_iff_congested(int sync, long timeout)
-{
- long ret;
- unsigned long start = jiffies;
- DEFINE_WAIT(wait);
- wait_queue_head_t *wqh = &congestion_wqh[sync];
-
- /*
- * If there is no congestion, yield if necessary instead
- * of sleeping on the congestion queue
- */
- if (atomic_read(&nr_wb_congested[sync]) == 0) {
- cond_resched();
-
- /* In case we scheduled, work out time remaining */
- ret = timeout - (jiffies - start);
- if (ret < 0)
- ret = 0;
-
- goto out;
- }
-
- /* Sleep until uncongested or a write happens */
- prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
- ret = io_schedule_timeout(timeout);
- finish_wait(wqh, &wait);
-
-out:
- trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
- jiffies_to_usecs(jiffies - start));
-
- return ret;
-}
-EXPORT_SYMBOL(wait_iff_congested);
diff --git a/mm/filemap.c b/mm/filemap.c
index dae481293b5d..59187787fbfc 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1605,6 +1605,7 @@ void end_page_writeback(struct page *page)

smp_mb__after_atomic();
wake_up_page(page, PG_writeback);
+ acct_reclaim_writeback(page);
put_page(page);
}
EXPORT_SYMBOL(end_page_writeback);
diff --git a/mm/internal.h b/mm/internal.h
index cf3cb933eba3..90764d646e02 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -34,6 +34,17 @@

void page_writeback_init(void);

+void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
+ int nr_throttled);
+static inline void acct_reclaim_writeback(struct page *page)
+{
+ pg_data_t *pgdat = page_pgdat(page);
+ int nr_throttled = atomic_read(&pgdat->nr_reclaim_throttled);
+
+ if (nr_throttled)
+ __acct_reclaim_writeback(pgdat, page, nr_throttled);
+}
+
vm_fault_t do_swap_page(struct vm_fault *vmf);

void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b37435c274cf..d849ddfc1e51 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7396,6 +7396,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)

init_waitqueue_head(&pgdat->kswapd_wait);
init_waitqueue_head(&pgdat->pfmemalloc_wait);
+ init_waitqueue_head(&pgdat->reclaim_wait);

pgdat_page_ext_init(pgdat);
lruvec_init(&pgdat->__lruvec);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 74296c2d1fed..bcd22e53795f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1006,6 +1006,56 @@ static void handle_write_error(struct address_space *mapping,
unlock_page(page);
}

+static void
+reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
+ long timeout)
+{
+ wait_queue_head_t *wqh = &pgdat->reclaim_wait;
+ long ret;
+ DEFINE_WAIT(wait);
+
+ /*
+ * Do not throttle IO workers, kthreads other than kswapd or
+ * workqueues. They may be required for reclaim to make
+ * forward progress (e.g. journalling workqueues or kthreads).
+ */
+ if (!current_is_kswapd() &&
+ current->flags & (PF_IO_WORKER|PF_KTHREAD))
+ return;
+
+ if (atomic_inc_return(&pgdat->nr_reclaim_throttled) == 1) {
+ WRITE_ONCE(pgdat->nr_reclaim_start,
+ node_page_state(pgdat, NR_THROTTLED_WRITTEN));
+ }
+
+ prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+ ret = schedule_timeout(timeout);
+ finish_wait(wqh, &wait);
+ atomic_dec(&pgdat->nr_reclaim_throttled);
+
+ trace_mm_vmscan_throttled(pgdat->node_id, jiffies_to_usecs(timeout),
+ jiffies_to_usecs(timeout - ret),
+ reason);
+}
+
+/*
+ * Account for pages written if tasks are throttled waiting on dirty
+ * pages to clean. If enough pages have been cleaned since throttling
+ * started then wakeup the throttled tasks.
+ */
+void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
+ int nr_throttled)
+{
+ unsigned long nr_written;
+
+ __inc_node_page_state(page, NR_THROTTLED_WRITTEN);
+ nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) -
+ READ_ONCE(pgdat->nr_reclaim_start);
+
+ if (nr_written > SWAP_CLUSTER_MAX * nr_throttled)
+ wake_up_all(&pgdat->reclaim_wait);
+}
+
/* possible outcome of pageout() */
typedef enum {
/* failed to write page out, page is locked */
@@ -1412,9 +1462,8 @@ static unsigned int shrink_page_list(struct list_head *page_list,

/*
* The number of dirty pages determines if a node is marked
- * reclaim_congested which affects wait_iff_congested. kswapd
- * will stall and start writing pages if the tail of the LRU
- * is all dirty unqueued pages.
+ * reclaim_congested. kswapd will stall and start writing
+ * pages if the tail of the LRU is all dirty unqueued pages.
*/
page_check_dirty_writeback(page, &dirty, &writeback);
if (dirty || writeback)
@@ -3180,19 +3229,19 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
* If kswapd scans pages marked for immediate
* reclaim and under writeback (nr_immediate), it
* implies that pages are cycling through the LRU
- * faster than they are written so also forcibly stall.
+ * faster than they are written so forcibly stall
+ * until some pages complete writeback.
*/
if (sc->nr.immediate)
- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10);
}

/*
- * Tag a node/memcg as congested if all the dirty pages
- * scanned were backed by a congested BDI and
- * wait_iff_congested will stall.
+ * Tag a node/memcg as congested if all the dirty pages were marked
+ * for writeback and immediate reclaim (counted in nr.congested).
*
* Legacy memcg will stall in page writeback so avoid forcibly
- * stalling in wait_iff_congested().
+ * stalling in reclaim_throttle().
*/
if ((current_is_kswapd() ||
(cgroup_reclaim(sc) && writeback_throttling_sane(sc))) &&
@@ -3200,15 +3249,15 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
set_bit(LRUVEC_CONGESTED, &target_lruvec->flags);

/*
- * Stall direct reclaim for IO completions if underlying BDIs
- * and node is congested. Allow kswapd to continue until it
+ * Stall direct reclaim for IO completions if the lruvec is
+ * node is congested. Allow kswapd to continue until it
* starts encountering unqueued dirty pages or cycling through
* the LRU too quickly.
*/
if (!current_is_kswapd() && current_may_throttle() &&
!sc->hibernation_mode &&
test_bit(LRUVEC_CONGESTED, &target_lruvec->flags))
- wait_iff_congested(BLK_RW_ASYNC, HZ/10);
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10);

if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
sc))
@@ -4286,6 +4335,7 @@ static int kswapd(void *p)

WRITE_ONCE(pgdat->kswapd_order, 0);
WRITE_ONCE(pgdat->kswapd_highest_zoneidx, MAX_NR_ZONES);
+ atomic_set(&pgdat->nr_reclaim_throttled, 0);
for ( ; ; ) {
bool ret;

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 8ce2620344b2..9b2bc9d61d4b 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1225,6 +1225,7 @@ const char * const vmstat_text[] = {
"nr_vmscan_immediate_reclaim",
"nr_dirtied",
"nr_written",
+ "nr_throttled_written",
"nr_kernel_misc_reclaimable",
"nr_foll_pin_acquired",
"nr_foll_pin_released",
--
2.31.1

2021-10-08 13:55:35

[permalink] [raw]

Subject: [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made

Memcg reclaim throttles on congestion if no reclaim progress is made.
This makes little sense, it might be due to writeback or a host of
other factors.

For !memcg reclaim, it's messy. Direct reclaim primarily is throttled
in the page allocator if it is failing to make progress. Kswapd
throttles if too many pages are under writeback and marked for
immediate reclaim.

This patch explicitly throttles if reclaim is failing to make progress.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 1 +
include/trace/events/vmscan.h | 4 +++-
mm/memcontrol.c | 10 +--------
mm/vmscan.c | 38 +++++++++++++++++++++++++++++++++++
4 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ca65d6a64bdd..7c08cc91d526 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -276,6 +276,7 @@ enum lru_list {
enum vmscan_throttle_state {
VMSCAN_THROTTLE_WRITEBACK,
VMSCAN_THROTTLE_ISOLATED,
+ VMSCAN_THROTTLE_NOPROGRESS,
NR_VMSCAN_THROTTLE,
};

diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index d4905bd9e9c4..f25a6149d3ba 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -29,11 +29,13 @@

#define _VMSCAN_THROTTLE_WRITEBACK (1 << VMSCAN_THROTTLE_WRITEBACK)
#define _VMSCAN_THROTTLE_ISOLATED (1 << VMSCAN_THROTTLE_ISOLATED)
+#define _VMSCAN_THROTTLE_NOPROGRESS (1 << VMSCAN_THROTTLE_NOPROGRESS)

#define show_throttle_flags(flags) \
(flags) ? __print_flags(flags, "|", \
{_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"}, \
- {_VMSCAN_THROTTLE_ISOLATED, "VMSCAN_THROTTLE_ISOLATED"} \
+ {_VMSCAN_THROTTLE_ISOLATED, "VMSCAN_THROTTLE_ISOLATED"}, \
+ {_VMSCAN_THROTTLE_NOPROGRESS, "VMSCAN_THROTTLE_NOPROGRESS"} \
) : "VMSCAN_THROTTLE_NONE"

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6da5020a8656..8b33152c9b85 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3465,19 +3465,11 @@ static int mem_cgroup_force_empty(struct mem_cgroup *memcg)

/* try to free all pages in this cgroup */
while (nr_retries && page_counter_read(&memcg->memory)) {
- int progress;
-
if (signal_pending(current))
return -EINTR;

- progress = try_to_free_mem_cgroup_pages(memcg, 1,
- GFP_KERNEL, true);
- if (!progress) {
+ if (!try_to_free_mem_cgroup_pages(memcg, 1, GFP_KERNEL, true))
nr_retries--;
- /* maybe some writeback is necessary */
- congestion_wait(BLK_RW_ASYNC, HZ/10);
- }
-
}

return 0;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9ce4195d4123..cdebfc618179 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3311,6 +3311,33 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
return zone_watermark_ok_safe(zone, 0, watermark, sc->reclaim_idx);
}

+static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc)
+{
+ /* If reclaim is making progress, wake any throttled tasks. */
+ if (sc->nr_reclaimed) {
+ wait_queue_head_t *wqh;
+
+ wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_NOPROGRESS];
+ if (waitqueue_active(wqh))
+ wake_up_all(wqh);
+
+ return;
+ }
+
+ /*
+ * Do not throttle kswapd on NOPROGRESS as it will throttle on
+ * VMSCAN_THROTTLE_WRITEBACK if there are too many pages under
+ * writeback and marked for immediate reclaim at the tail of
+ * the LRU.
+ */
+ if (current_is_kswapd())
+ return;
+
+ /* Throttle if making no progress at high prioities. */
+ if (sc->priority < DEF_PRIORITY - 2)
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10);
+}
+
/*
* This is the direct reclaim path, for page-allocating processes. We only
* try to reclaim pages from zones which will satisfy the caller's allocation
@@ -3395,6 +3422,7 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)
continue;
last_pgdat = zone->zone_pgdat;
shrink_node(zone->zone_pgdat, sc);
+ consider_reclaim_throttle(zone->zone_pgdat, sc);
}

/*
@@ -3769,6 +3797,16 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
set_task_reclaim_state(current, NULL);

+ if (!nr_reclaimed) {
+ struct zoneref *z;
+ pg_data_t *pgdat;
+
+ z = first_zones_zonelist(zonelist, sc.reclaim_idx, sc.nodemask);
+ pgdat = zonelist_zone(z)->zone_pgdat;
+
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10);
+ }
+
return nr_reclaimed;
}
#endif
--
2.31.1

2021-10-08 13:55:43

[permalink] [raw]

Subject: [PATCH 4/8] mm/writeback: Throttle based on page writeback instead of congestion

do_writepages throttles on congestion if the writepages() fails due to a
lack of memory but congestion_wait() is partially broken as the congestion
state is not updated for all BDIs.

This patch stalls waiting for a number of pages to complete writeback
that located on the local node. The main weakness is that there is no
correlation between the location of the inode's pages and locality but
that is still better than congestion_wait.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page-writeback.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 4812a17b288c..f34f54fcd5b4 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2366,8 +2366,15 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
ret = generic_writepages(mapping, wbc);
if ((ret != -ENOMEM) || (wbc->sync_mode != WB_SYNC_ALL))
break;
- cond_resched();
- congestion_wait(BLK_RW_ASYNC, HZ/50);
+
+ /*
+ * Lacking an allocation context or the locality or writeback
+ * state of any of the inode's pages, throttle based on
+ * writeback activity on the local node. It's as good a
+ * guess as any.
+ */
+ reclaim_throttle(NODE_DATA(numa_node_id()),
+ VMSCAN_THROTTLE_WRITEBACK, HZ/50);
}
/*
* Usually few pages are written by now from those we've just submitted
--
2.31.1

2021-10-08 13:55:55

[permalink] [raw]

Subject: [PATCH 5/8] mm/page_alloc: Remove the throttling logic from the page allocator

The page allocator stalls based on the number of pages that are
waiting for writeback to start but this should now be redundant.
shrink_inactive_list() will wake flusher threads if the LRU tail are
unqueued dirty pages so the flusher should be active. If it fails to make
progress due to pages under writeback not being completed quickly then
it should stall on VMSCAN_THROTTLE_WRITEBACK.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 21 +--------------------
1 file changed, 1 insertion(+), 20 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 78e538067651..8fa0109ff417 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4795,30 +4795,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
trace_reclaim_retry_zone(z, order, reclaimable,
available, min_wmark, *no_progress_loops, wmark);
if (wmark) {
- /*
- * If we didn't make any progress and have a lot of
- * dirty + writeback pages then we should wait for
- * an IO to complete to slow down the reclaim and
- * prevent from pre mature OOM
- */
- if (!did_some_progress) {
- unsigned long write_pending;
-
- write_pending = zone_page_state_snapshot(zone,
- NR_ZONE_WRITE_PENDING);
-
- if (2 * write_pending > reclaimable) {
- congestion_wait(BLK_RW_ASYNC, HZ/10);
- return true;
- }
- }
-
ret = true;
- goto out;
+ break;
}
}

-out:
/*
* Memory allocation/reclaim might be called from a WQ context and the
* current implementation of the WQ concurrency control doesn't
--
2.31.1

2021-10-08 13:56:21

[permalink] [raw]

Subject: [PATCH 6/8] mm/vmscan: Centralise timeout values for reclaim_throttle

Neil Brown raised concerns about callers of reclaim_throttle specifying
a timeout value. The original timeout values to congestion_wait() were
probably pulled out of thin air or copy&pasted from somewhere else.
This patch centralises the timeout values and selects a timeout based
on the reason for reclaim throttling. These figures are also pulled
out of the same thin air but better values may be derived

Running a workload that is throttling for inappropriate periods
and tracing mm_vmscan_throttled can be used to pick a more appropriate
value. Excessive throttling would pick a lower timeout where as
excessive CPU usage in reclaim context would select a larger timeout.
Ideally a large value would always be used and the wakeups would
occur before a timeout but that requires careful testing.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/compaction.c | 2 +-
mm/internal.h | 3 +--
mm/page-writeback.c | 2 +-
mm/vmscan.c | 39 +++++++++++++++++++++++++++++++--------
4 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 7359093d8ac0..151b04c4dab3 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -828,7 +828,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
if (cc->mode == MIGRATE_ASYNC)
return -EAGAIN;

- reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED);

if (fatal_signal_pending(current))
return -EINTR;
diff --git a/mm/internal.h b/mm/internal.h
index 06d0c376efcd..f8d203cfd4e1 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -129,8 +129,7 @@ extern unsigned long highest_memmap_pfn;
*/
extern int isolate_lru_page(struct page *page);
extern void putback_lru_page(struct page *page);
-extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
- long timeout);
+extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason);

/*
* in mm/rmap.c:
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f34f54fcd5b4..4b01a6872f9e 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2374,7 +2374,7 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
* guess as any.
*/
reclaim_throttle(NODE_DATA(numa_node_id()),
- VMSCAN_THROTTLE_WRITEBACK, HZ/50);
+ VMSCAN_THROTTLE_WRITEBACK);
}
/*
* Usually few pages are written by now from those we've just submitted
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cdebfc618179..e096e81dcbd8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1006,11 +1006,10 @@ static void handle_write_error(struct address_space *mapping,
unlock_page(page);
}

-void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
- long timeout)
+void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
{
wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason];
- long ret;
+ long timeout, ret;
DEFINE_WAIT(wait);

/*
@@ -1027,6 +1026,30 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
node_page_state(pgdat, NR_THROTTLED_WRITTEN));
}

+ /*
+ * These figures are pulled out of thin air.
+ * VMSCAN_THROTTLE_ISOLATED is a transient condition based on too many
+ * parallel reclaimers which is a short-lived event so the timeout is
+ * short. Failing to make progress or waiting on writeback are
+ * potentially long-lived events so use a longer timeout. This is shaky
+ * logic as a failure to make progress could be due to anything from
+ * writeback to a slow device to excessive references pages at the tail
+ * of the inactive LRU.
+ */
+ switch(reason) {
+ case VMSCAN_THROTTLE_NOPROGRESS:
+ case VMSCAN_THROTTLE_WRITEBACK:
+ timeout = HZ/10;
+ break;
+ case VMSCAN_THROTTLE_ISOLATED:
+ timeout = HZ/50;
+ break;
+ default:
+ WARN_ON_ONCE(1);
+ timeout = HZ;
+ break;
+ }
+
prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
ret = schedule_timeout(timeout);
finish_wait(wqh, &wait);
@@ -2307,7 +2330,7 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,

/* wait a bit for the reclaimer. */
stalled = true;
- reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED);

/* We are about to die and free our memory. Return now. */
if (fatal_signal_pending(current))
@@ -3239,7 +3262,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
* until some pages complete writeback.
*/
if (sc->nr.immediate)
- reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10);
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);
}

/*
@@ -3263,7 +3286,7 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
if (!current_is_kswapd() && current_may_throttle() &&
!sc->hibernation_mode &&
test_bit(LRUVEC_CONGESTED, &target_lruvec->flags))
- reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK, HZ/10);
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_WRITEBACK);

if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
sc))
@@ -3335,7 +3358,7 @@ static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc)

/* Throttle if making no progress at high prioities. */
if (sc->priority < DEF_PRIORITY - 2)
- reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10);
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS);
}

/*
@@ -3804,7 +3827,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
z = first_zones_zonelist(zonelist, sc.reclaim_idx, sc.nodemask);
pgdat = zonelist_zone(z)->zone_pgdat;

- reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10);
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS);
}

return nr_reclaimed;
--
2.31.1

2021-10-08 13:56:49

[permalink] [raw]

Subject: [PATCH 2/8] mm/vmscan: Throttle reclaim and compaction when too may pages are isolated

Page reclaim throttles on congestion if too many parallel reclaim instances
have isolated too many pages. This makes no sense, excessive parallelisation
has nothing to do with writeback or congestion.

This patch creates an additional workqueue to sleep on when too many
pages are isolated. The throttled tasks are woken when the number
of isolated pages is reduced or a timeout occurs. There may be
some false positive wakeups for GFP_NOIO/GFP_NOFS callers but
the tasks will throttle again if necessary.

[[email protected]: Wake up from compaction context]
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 4 +++-
include/trace/events/vmscan.h | 4 +++-
mm/compaction.c | 10 ++++++++--
mm/internal.h | 11 +++++++++++
mm/page_alloc.c | 6 +++++-
mm/vmscan.c | 18 ++++++++++++------
6 files changed, 42 insertions(+), 11 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ef0a63ebd21d..ca65d6a64bdd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -275,6 +275,8 @@ enum lru_list {

enum vmscan_throttle_state {
VMSCAN_THROTTLE_WRITEBACK,
+ VMSCAN_THROTTLE_ISOLATED,
+ NR_VMSCAN_THROTTLE,
};

#define for_each_lru(lru) for (lru = 0; lru < NR_LRU_LISTS; lru++)
@@ -846,7 +848,7 @@ typedef struct pglist_data {
int node_id;
wait_queue_head_t kswapd_wait;
wait_queue_head_t pfmemalloc_wait;
- wait_queue_head_t reclaim_wait; /* wq for throttling reclaim */
+ wait_queue_head_t reclaim_wait[NR_VMSCAN_THROTTLE];
atomic_t nr_reclaim_throttled; /* nr of throtted tasks */
unsigned long nr_reclaim_start; /* nr pages written while throttled
* when throttling started. */
diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
index c317f9fe0d17..d4905bd9e9c4 100644
--- a/include/trace/events/vmscan.h
+++ b/include/trace/events/vmscan.h
@@ -28,10 +28,12 @@
) : "RECLAIM_WB_NONE"

#define _VMSCAN_THROTTLE_WRITEBACK (1 << VMSCAN_THROTTLE_WRITEBACK)
+#define _VMSCAN_THROTTLE_ISOLATED (1 << VMSCAN_THROTTLE_ISOLATED)

#define show_throttle_flags(flags) \
(flags) ? __print_flags(flags, "|", \
- {_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"} \
+ {_VMSCAN_THROTTLE_WRITEBACK, "VMSCAN_THROTTLE_WRITEBACK"}, \
+ {_VMSCAN_THROTTLE_ISOLATED, "VMSCAN_THROTTLE_ISOLATED"} \
) : "VMSCAN_THROTTLE_NONE"

diff --git a/mm/compaction.c b/mm/compaction.c
index bfc93da1c2c7..7359093d8ac0 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -761,6 +761,8 @@ isolate_freepages_range(struct compact_control *cc,
/* Similar to reclaim, but different enough that they don't share logic */
static bool too_many_isolated(pg_data_t *pgdat)
{
+ bool too_many;
+
unsigned long active, inactive, isolated;

inactive = node_page_state(pgdat, NR_INACTIVE_FILE) +
@@ -770,7 +772,11 @@ static bool too_many_isolated(pg_data_t *pgdat)
isolated = node_page_state(pgdat, NR_ISOLATED_FILE) +
node_page_state(pgdat, NR_ISOLATED_ANON);

- return isolated > (inactive + active) / 2;
+ too_many = isolated > (inactive + active) / 2;
+ if (!too_many)
+ wake_throttle_isolated(pgdat);
+
+ return too_many;
}

/**
@@ -822,7 +828,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
if (cc->mode == MIGRATE_ASYNC)
return -EAGAIN;

- congestion_wait(BLK_RW_ASYNC, HZ/10);
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);

if (fatal_signal_pending(current))
return -EINTR;
diff --git a/mm/internal.h b/mm/internal.h
index 90764d646e02..06d0c376efcd 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -45,6 +45,15 @@ static inline void acct_reclaim_writeback(struct page *page)
__acct_reclaim_writeback(pgdat, page, nr_throttled);
}

+static inline void wake_throttle_isolated(pg_data_t *pgdat)
+{
+ wait_queue_head_t *wqh;
+
+ wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_ISOLATED];
+ if (waitqueue_active(wqh))
+ wake_up_all(wqh);
+}
+
vm_fault_t do_swap_page(struct vm_fault *vmf);

void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
@@ -120,6 +129,8 @@ extern unsigned long highest_memmap_pfn;
*/
extern int isolate_lru_page(struct page *page);
extern void putback_lru_page(struct page *page);
+extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
+ long timeout);

/*
* in mm/rmap.c:
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d849ddfc1e51..78e538067651 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7389,6 +7389,8 @@ static void pgdat_init_kcompactd(struct pglist_data *pgdat) {}

static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
{
+ int i;
+
pgdat_resize_init(pgdat);

pgdat_init_split_queue(pgdat);
@@ -7396,7 +7398,9 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)

init_waitqueue_head(&pgdat->kswapd_wait);
init_waitqueue_head(&pgdat->pfmemalloc_wait);
- init_waitqueue_head(&pgdat->reclaim_wait);
+
+ for (i = 0; i < NR_VMSCAN_THROTTLE; i++)
+ init_waitqueue_head(&pgdat->reclaim_wait[i]);

pgdat_page_ext_init(pgdat);
lruvec_init(&pgdat->__lruvec);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bcd22e53795f..9ce4195d4123 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1006,11 +1006,10 @@ static void handle_write_error(struct address_space *mapping,
unlock_page(page);
}

-static void
-reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
+void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
long timeout)
{
- wait_queue_head_t *wqh = &pgdat->reclaim_wait;
+ wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason];
long ret;
DEFINE_WAIT(wait);

@@ -1053,7 +1052,7 @@ void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
READ_ONCE(pgdat->nr_reclaim_start);

if (nr_written > SWAP_CLUSTER_MAX * nr_throttled)
- wake_up_all(&pgdat->reclaim_wait);
+ wake_up_all(&pgdat->reclaim_wait[VMSCAN_THROTTLE_WRITEBACK]);
}

/* possible outcome of pageout() */
@@ -2168,6 +2167,7 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
struct scan_control *sc)
{
unsigned long inactive, isolated;
+ bool too_many;

if (current_is_kswapd())
return 0;
@@ -2191,7 +2191,13 @@ static int too_many_isolated(struct pglist_data *pgdat, int file,
if ((sc->gfp_mask & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS))
inactive >>= 3;

- return isolated > inactive;
+ too_many = isolated > inactive;
+
+ /* Wake up tasks throttled due to too_many_isolated. */
+ if (!too_many)
+ wake_throttle_isolated(pgdat);
+
+ return too_many;
}

/*
@@ -2300,8 +2306,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
return 0;

/* wait a bit for the reclaimer. */
- msleep(100);
stalled = true;
+ reclaim_throttle(pgdat, VMSCAN_THROTTLE_ISOLATED, HZ/10);

/* We are about to die and free our memory. Return now. */
if (fatal_signal_pending(current))
--
2.31.1

2021-10-08 13:56:55

[permalink] [raw]

Subject: [PATCH 8/8] mm/vmscan: Delay waking of tasks throttled on NOPROGRESS

Tracing indicates that tasks throttled on NOPROGRESS are woken
prematurely resulting in occasional massive spikes in direct
reclaim activity. This patch wakes tasks throttled on NOPROGRESS
if reclaim efficiency is at least 12%.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 7b54fec4072c..80a9a26f701f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3338,8 +3338,11 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)

static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc)
{
- /* If reclaim is making progress, wake any throttled tasks. */
- if (sc->nr_reclaimed) {
+ /*
+ * If reclaim is making progress greater than 12% efficiency then
+ * wake all the NOPROGRESS throttled tasks.
+ */
+ if (sc->nr_reclaimed > (sc->nr_scanned >> 3)) {
wait_queue_head_t *wqh;

wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_NOPROGRESS];
--
2.31.1

2021-10-08 13:58:15

[permalink] [raw]

Subject: [PATCH 7/8] mm/vmscan: Increase the timeout if page reclaim is not making progress

Tracing of the stutterp workload showed the following delays

1 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usect_delayed=536000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usect_delayed=544000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usect_delayed=556000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usect_delayed=624000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usect_delayed=716000 reason=VMSCAN_THROTTLE_NOPROGRESS
1 usect_delayed=772000 reason=VMSCAN_THROTTLE_NOPROGRESS
2 usect_delayed=512000 reason=VMSCAN_THROTTLE_NOPROGRESS
16 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS
53 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS
116 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS
5907 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS
71741 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS

All the throttling hit the full timeout and then there was wakeup delays
meaning that the wakeups are premature as no other reclaimer such as
kswapd has made progress. This patch increases the maximum timeout.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/vmscan.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e096e81dcbd8..7b54fec4072c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1038,6 +1038,8 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
*/
switch(reason) {
case VMSCAN_THROTTLE_NOPROGRESS:
+ timeout = HZ/2;
+ break;
case VMSCAN_THROTTLE_WRITEBACK:
timeout = HZ/10;
break;
--
2.31.1

2021-10-13 15:42:30

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [PATCH 1/8] mm/vmscan: Throttle reclaim until some writeback completes if congested

On 10/8/21 15:53, Mel Gorman wrote:
> Page reclaim throttles on wait_iff_congested under the following conditions
>
> o kswapd is encountering pages under writeback and marked for immediate
> reclaim implying that pages are cycling through the LRU faster than
> pages can be cleaned.
>
> o Direct reclaim will stall if all dirty pages are backed by congested
> inodes.
>
> wait_iff_congested is almost completely broken with few exceptions. This
> patch adds a new node-based workqueue and tracks the number of throttled
> tasks and pages written back since throttling started. If enough pages
> belonging to the node are written back then the throttled tasks will wake
> early. If not, the throttled tasks sleeps until the timeout expires.
>
> [[email protected]: Uninterruptible sleep and simpler wakeups]
> [[email protected]: Avoid race when reclaim starts]
> Signed-off-by: Mel Gorman <[email protected]>

Seems mostly OK, have just some doubts wrt NR_THROTTLED_WRITTEN mechanics,
that may ultimately be just a point of comments to add.

...

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1006,6 +1006,56 @@ static void handle_write_error(struct address_space *mapping,
> unlock_page(page);
> }
>
> +static void
> +reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> + long timeout)
> +{
> + wait_queue_head_t *wqh = &pgdat->reclaim_wait;
> + long ret;
> + DEFINE_WAIT(wait);
> +
> + /*
> + * Do not throttle IO workers, kthreads other than kswapd or
> + * workqueues. They may be required for reclaim to make
> + * forward progress (e.g. journalling workqueues or kthreads).
> + */
> + if (!current_is_kswapd() &&
> + current->flags & (PF_IO_WORKER|PF_KTHREAD))
> + return;
> +
> + if (atomic_inc_return(&pgdat->nr_reclaim_throttled) == 1) {
> + WRITE_ONCE(pgdat->nr_reclaim_start,
> + node_page_state(pgdat, NR_THROTTLED_WRITTEN));
> + }
> +
> + prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> + ret = schedule_timeout(timeout);
> + finish_wait(wqh, &wait);
> + atomic_dec(&pgdat->nr_reclaim_throttled);
> +
> + trace_mm_vmscan_throttled(pgdat->node_id, jiffies_to_usecs(timeout),
> + jiffies_to_usecs(timeout - ret),
> + reason);
> +}
> +
> +/*
> + * Account for pages written if tasks are throttled waiting on dirty
> + * pages to clean. If enough pages have been cleaned since throttling
> + * started then wakeup the throttled tasks.
> + */
> +void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
> + int nr_throttled)
> +{
> + unsigned long nr_written;
> +
> + __inc_node_page_state(page, NR_THROTTLED_WRITTEN);

Is this intentionally using the __ version that normally expects irqs to be
disabled (AFAIK they are not in this path)? I think this is rarely used cold
path so it doesn't seem worth to trade off speed for accuracy.

> + nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) -
> + READ_ONCE(pgdat->nr_reclaim_start);

Even if the inc above was safe, node_page_state() will return only the
global counter, so the value we read here will only actually increment when
some cpu's counter overflows, so it will be "bursty". Maybe it's ok, just
worth documenting?

> +
> + if (nr_written > SWAP_CLUSTER_MAX * nr_throttled)
> + wake_up_all(&pgdat->reclaim_wait);

Hm it seems a bit weird that the more tasks are throttled, the more we wait,
and then wake up all. Theoretically this will lead to even more
bursty/staggering herd behavior. Could be better to wake up single task each
SWAP_CLUSTER_MAX, and bump nr_reclaim_start? But maybe it's not a problem in
practice due to HZ/10 timeouts being short enough?

> +}
> +

2021-10-14 08:09:10

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [PATCH 2/8] mm/vmscan: Throttle reclaim and compaction when too may pages are isolated

On 10/8/21 15:53, Mel Gorman wrote:
> Page reclaim throttles on congestion if too many parallel reclaim instances
> have isolated too many pages. This makes no sense, excessive parallelisation
> has nothing to do with writeback or congestion.
>
> This patch creates an additional workqueue to sleep on when too many
> pages are isolated. The throttled tasks are woken when the number
> of isolated pages is reduced or a timeout occurs. There may be
> some false positive wakeups for GFP_NOIO/GFP_NOFS callers but
> the tasks will throttle again if necessary.
>
> [[email protected]: Wake up from compaction context]
> Signed-off-by: Mel Gorman <[email protected]>

...

> diff --git a/mm/internal.h b/mm/internal.h
> index 90764d646e02..06d0c376efcd 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -45,6 +45,15 @@ static inline void acct_reclaim_writeback(struct page *page)
> __acct_reclaim_writeback(pgdat, page, nr_throttled);
> }
>
> +static inline void wake_throttle_isolated(pg_data_t *pgdat)
> +{
> + wait_queue_head_t *wqh;
> +
> + wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_ISOLATED];
> + if (waitqueue_active(wqh))
> + wake_up_all(wqh);

Again, would it be better to wake up just one task to prevent possible
thundering herd? We can assume that that task will call too_many_isolated()
eventually to wake up the next one? Although it seems strange that
too_many_isolated() is the place where we detect the situation for wake up.
Simpler than to hook into NR_ISOLATED decrementing I guess.

> +}
> +
> vm_fault_t do_swap_page(struct vm_fault *vmf);
>
> void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
...
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1006,11 +1006,10 @@ static void handle_write_error(struct address_space *mapping,
> unlock_page(page);
> }
>
> -static void
> -reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> +void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> long timeout)
> {
> - wait_queue_head_t *wqh = &pgdat->reclaim_wait;
> + wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason];

It seems weird that later in this function we increase nr_reclaim_throttled
without distinguishing the reason, so effectively throttling for isolated
pages will trigger acct_reclaim_writeback() doing the NR_THROTTLED_WRITTEN
counting, although it's not related at all? Maybe either have separate
nr_reclaim_throttled counters per vmscan_throttle_state (if counter of
isolated is useful, I haven't seen the rest of series yet), or count only
VMSCAN_THROTTLE_WRITEBACK tasks?

> long ret;
> DEFINE_WAIT(wait);
>
> @@ -1053,7 +1052,7 @@ void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
> READ_ONCE(pgdat->nr_reclaim_start);
>
> if (nr_written > SWAP_CLUSTER_MAX * nr_throttled)
> - wake_up_all(&pgdat->reclaim_wait);
> + wake_up_all(&pgdat->reclaim_wait[VMSCAN_THROTTLE_WRITEBACK]);
> }
>
> /* possible outcome of pageout() */

2021-10-14 11:59:07

[permalink] [raw]

Subject: Re: [PATCH 1/8] mm/vmscan: Throttle reclaim until some writeback completes if congested

Thanks Vlastimil

On Wed, Oct 13, 2021 at 05:39:36PM +0200, Vlastimil Babka wrote:
> > +/*
> > + * Account for pages written if tasks are throttled waiting on dirty
> > + * pages to clean. If enough pages have been cleaned since throttling
> > + * started then wakeup the throttled tasks.
> > + */
> > +void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
> > + int nr_throttled)
> > +{
> > + unsigned long nr_written;
> > +
> > + __inc_node_page_state(page, NR_THROTTLED_WRITTEN);
>
> Is this intentionally using the __ version that normally expects irqs to be
> disabled (AFAIK they are not in this path)? I think this is rarely used cold
> path so it doesn't seem worth to trade off speed for accuracy.
>

It was intentional because IRQs can be disabled and if it's race-prone,
it's not overly problematic but you're right, better to be safe. I changed
it to the safe type as it's mostly free on x86, arm64 and s390 and for
other architectures, this is a slow path.

> > + nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) -
> > + READ_ONCE(pgdat->nr_reclaim_start);
>
> Even if the inc above was safe, node_page_state() will return only the
> global counter, so the value we read here will only actually increment when
> some cpu's counter overflows, so it will be "bursty". Maybe it's ok, just
> worth documenting?
>

I didn't think the penalty of doing an accurate read while writeback
throttled is worth it. I'll add a comment.

> > +
> > + if (nr_written > SWAP_CLUSTER_MAX * nr_throttled)
> > + wake_up_all(&pgdat->reclaim_wait);
>
> Hm it seems a bit weird that the more tasks are throttled, the more we wait,
> and then wake up all. Theoretically this will lead to even more
> bursty/staggering herd behavior. Could be better to wake up single task each
> SWAP_CLUSTER_MAX, and bump nr_reclaim_start? But maybe it's not a problem in
> practice due to HZ/10 timeouts being short enough?
>

Yes, the more tasks are throttled the longer tasks wait because tasks are
allocating faster than writeback can complete so I wanted to reduce the
allocation pressure. I considered waking one task at a time but there is
no prioritisation of tasks on the waitqueue and it's not clear that the
additional complexity is justified. With inaccurate counters, a light
allocator could get throttled for the full timeout unnecessarily.

Even if we were to wake one task at a time, I would prefer it was done
as a potential optimisation on top.

Diff on top based on review feedback;

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bcd22e53795f..735b1f2b5d9e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1048,7 +1048,15 @@ void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
{
unsigned long nr_written;

- __inc_node_page_state(page, NR_THROTTLED_WRITTEN);
+ inc_node_page_state(page, NR_THROTTLED_WRITTEN);
+
+ /*
+ * This is an inaccurate read as the per-cpu deltas may not
+ * be synchronised. However, given that the system is
+ * writeback throttled, it is not worth taking the penalty
+ * of getting an accurate count. At worst, the throttle
+ * timeout guarantees forward progress.
+ */
nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) -
READ_ONCE(pgdat->nr_reclaim_start);

2021-10-14 13:19:47

[permalink] [raw]

Subject: Re: [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made

On Thu, Oct 14, 2021 at 02:31:17PM +0200, Vlastimil Babka wrote:
> On 10/8/21 15:53, Mel Gorman wrote:
> > Memcg reclaim throttles on congestion if no reclaim progress is made.
> > This makes little sense, it might be due to writeback or a host of
> > other factors.
> >
> > For !memcg reclaim, it's messy. Direct reclaim primarily is throttled
> > in the page allocator if it is failing to make progress. Kswapd
> > throttles if too many pages are under writeback and marked for
> > immediate reclaim.
> >
> > This patch explicitly throttles if reclaim is failing to make progress.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
> ...
> > @@ -3769,6 +3797,16 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> > trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
> > set_task_reclaim_state(current, NULL);
> >
> > + if (!nr_reclaimed) {
> > + struct zoneref *z;
> > + pg_data_t *pgdat;
> > +
> > + z = first_zones_zonelist(zonelist, sc.reclaim_idx, sc.nodemask);
> > + pgdat = zonelist_zone(z)->zone_pgdat;
> > +
> > + reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10);
> > + }
>
> Is this necessary? AFAICS here we just returned from:
>
> do_try_to_free_pages()
> shrink_zones()
> for_each_zone()...
> consider_reclaim_throttle()
>
> Which already throttles when needed and using the appropriate pgdat, while
> here we have to somewhat awkwardly assume the preferred one.
>

Yes, you're right, consider_reclaim_throttle not only throttles on the
appropriate pgdat but takes priority into account.

Well spotted!

--
Mel Gorman
SUSE Labs

2021-10-14 14:24:22

[permalink] [raw]

Subject: Re: [PATCH 2/8] mm/vmscan: Throttle reclaim and compaction when too may pages are isolated

On Thu, Oct 14, 2021 at 10:06:25AM +0200, Vlastimil Babka wrote:
> On 10/8/21 15:53, Mel Gorman wrote:
> > Page reclaim throttles on congestion if too many parallel reclaim instances
> > have isolated too many pages. This makes no sense, excessive parallelisation
> > has nothing to do with writeback or congestion.
> >
> > This patch creates an additional workqueue to sleep on when too many
> > pages are isolated. The throttled tasks are woken when the number
> > of isolated pages is reduced or a timeout occurs. There may be
> > some false positive wakeups for GFP_NOIO/GFP_NOFS callers but
> > the tasks will throttle again if necessary.
> >
> > [[email protected]: Wake up from compaction context]
> > Signed-off-by: Mel Gorman <[email protected]>
>
> ...
>
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 90764d646e02..06d0c376efcd 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -45,6 +45,15 @@ static inline void acct_reclaim_writeback(struct page *page)
> > __acct_reclaim_writeback(pgdat, page, nr_throttled);
> > }
> >
> > +static inline void wake_throttle_isolated(pg_data_t *pgdat)
> > +{
> > + wait_queue_head_t *wqh;
> > +
> > + wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_ISOLATED];
> > + if (waitqueue_active(wqh))
> > + wake_up_all(wqh);
>
> Again, would it be better to wake up just one task to prevent possible
> thundering herd? We can assume that that task will call too_many_isolated()
> eventually to wake up the next one?

Same problem as the writeback throttling, there is no prioritsation of
light vs heavy allocators.

> Although it seems strange that
> too_many_isolated() is the place where we detect the situation for wake up.
> Simpler than to hook into NR_ISOLATED decrementing I guess.
>

Simplier but more costly. Every decrement would have to check
too_many_isolated(). I think the cost of that is too high given that the
VMSCAN_THROTTLE_ISOLATED is relatively hard to trigger and the minority
of throttling events.

> > +}
> > +
> > vm_fault_t do_swap_page(struct vm_fault *vmf);
> >
> > void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> ...
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1006,11 +1006,10 @@ static void handle_write_error(struct address_space *mapping,
> > unlock_page(page);
> > }
> >
> > -static void
> > -reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> > +void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> > long timeout)
> > {
> > - wait_queue_head_t *wqh = &pgdat->reclaim_wait;
> > + wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason];
>
> It seems weird that later in this function we increase nr_reclaim_throttled
> without distinguishing the reason, so effectively throttling for isolated
> pages will trigger acct_reclaim_writeback() doing the NR_THROTTLED_WRITTEN
> counting, although it's not related at all? Maybe either have separate
> nr_reclaim_throttled counters per vmscan_throttle_state (if counter of
> isolated is useful, I haven't seen the rest of series yet), or count only
> VMSCAN_THROTTLE_WRITEBACK tasks?
>

Very good point, it would be more appropriate to only count the
writeback reason.

Diff on top is below. It'll cause minor conflicts later in the series.

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index ca65d6a64bdd..58a25d42c31c 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -849,7 +849,7 @@ typedef struct pglist_data {
wait_queue_head_t kswapd_wait;
wait_queue_head_t pfmemalloc_wait;
wait_queue_head_t reclaim_wait[NR_VMSCAN_THROTTLE];
- atomic_t nr_reclaim_throttled; /* nr of throtted tasks */
+ atomic_t nr_writeback_throttled;/* nr of writeback-throttled tasks */
unsigned long nr_reclaim_start; /* nr pages written while throttled
* when throttling started. */
struct task_struct *kswapd; /* Protected by
diff --git a/mm/internal.h b/mm/internal.h
index 06d0c376efcd..3461a1055975 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -39,7 +39,7 @@ void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
static inline void acct_reclaim_writeback(struct page *page)
{
pg_data_t *pgdat = page_pgdat(page);
- int nr_throttled = atomic_read(&pgdat->nr_reclaim_throttled);
+ int nr_throttled = atomic_read(&pgdat->nr_writeback_throttled);

if (nr_throttled)
__acct_reclaim_writeback(pgdat, page, nr_throttled);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6e198bbbd86a..29434d4fc1c7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1011,6 +1011,7 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
{
wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason];
long ret;
+ bool acct_writeback = (reason == VMSCAN_THROTTLE_WRITEBACK);
DEFINE_WAIT(wait);

/*
@@ -1022,7 +1023,8 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
current->flags & (PF_IO_WORKER|PF_KTHREAD))
return;

- if (atomic_inc_return(&pgdat->nr_reclaim_throttled) == 1) {
+ if (acct_writeback &&
+ atomic_inc_return(&pgdat->nr_writeback_throttled) == 1) {
WRITE_ONCE(pgdat->nr_reclaim_start,
node_page_state(pgdat, NR_THROTTLED_WRITTEN));
}
@@ -1030,7 +1032,9 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
ret = schedule_timeout(timeout);
finish_wait(wqh, &wait);
- atomic_dec(&pgdat->nr_reclaim_throttled);
+
+ if (acct_writeback)
+ atomic_dec(&pgdat->nr_writeback_throttled);

trace_mm_vmscan_throttled(pgdat->node_id, jiffies_to_usecs(timeout),
jiffies_to_usecs(timeout - ret),
@@ -4349,7 +4353,7 @@ static int kswapd(void *p)

WRITE_ONCE(pgdat->kswapd_order, 0);
WRITE_ONCE(pgdat->kswapd_highest_zoneidx, MAX_NR_ZONES);
- atomic_set(&pgdat->nr_reclaim_throttled, 0);
+ atomic_set(&pgdat->nr_writeback_throttled, 0);
for ( ; ; ) {
bool ret;

2021-10-14 15:33:23

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made

On 10/8/21 15:53, Mel Gorman wrote:
> Memcg reclaim throttles on congestion if no reclaim progress is made.
> This makes little sense, it might be due to writeback or a host of
> other factors.
>
> For !memcg reclaim, it's messy. Direct reclaim primarily is throttled
> in the page allocator if it is failing to make progress. Kswapd
> throttles if too many pages are under writeback and marked for
> immediate reclaim.
>
> This patch explicitly throttles if reclaim is failing to make progress.
>
> Signed-off-by: Mel Gorman <[email protected]>
...
> @@ -3769,6 +3797,16 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
> trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
> set_task_reclaim_state(current, NULL);
>
> + if (!nr_reclaimed) {
> + struct zoneref *z;
> + pg_data_t *pgdat;
> +
> + z = first_zones_zonelist(zonelist, sc.reclaim_idx, sc.nodemask);
> + pgdat = zonelist_zone(z)->zone_pgdat;
> +
> + reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10);
> + }

Is this necessary? AFAICS here we just returned from:

do_try_to_free_pages()
shrink_zones()
for_each_zone()...
consider_reclaim_throttle()

Which already throttles when needed and using the appropriate pgdat, while
here we have to somewhat awkwardly assume the preferred one.

> +
> return nr_reclaimed;
> }
> #endif
>

2021-10-14 18:27:08

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [PATCH 5/8] mm/page_alloc: Remove the throttling logic from the page allocator

On 10/8/21 15:53, Mel Gorman wrote:
> The page allocator stalls based on the number of pages that are
> waiting for writeback to start but this should now be redundant.
> shrink_inactive_list() will wake flusher threads if the LRU tail are
> unqueued dirty pages so the flusher should be active. If it fails to make
> progress due to pages under writeback not being completed quickly then
> it should stall on VMSCAN_THROTTLE_WRITEBACK.
>
> Signed-off-by: Mel Gorman <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

> ---
> mm/page_alloc.c | 21 +--------------------
> 1 file changed, 1 insertion(+), 20 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 78e538067651..8fa0109ff417 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4795,30 +4795,11 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> trace_reclaim_retry_zone(z, order, reclaimable,
> available, min_wmark, *no_progress_loops, wmark);
> if (wmark) {
> - /*
> - * If we didn't make any progress and have a lot of
> - * dirty + writeback pages then we should wait for
> - * an IO to complete to slow down the reclaim and
> - * prevent from pre mature OOM
> - */
> - if (!did_some_progress) {
> - unsigned long write_pending;
> -
> - write_pending = zone_page_state_snapshot(zone,
> - NR_ZONE_WRITE_PENDING);
> -
> - if (2 * write_pending > reclaimable) {
> - congestion_wait(BLK_RW_ASYNC, HZ/10);
> - return true;
> - }
> - }
> -
> ret = true;
> - goto out;
> + break;
> }
> }
>
> -out:
> /*
> * Memory allocation/reclaim might be called from a WQ context and the
> * current implementation of the WQ concurrency control doesn't
>

2021-10-14 18:27:40

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [PATCH 6/8] mm/vmscan: Centralise timeout values for reclaim_throttle

On 10/8/21 15:53, Mel Gorman wrote:
> Neil Brown raised concerns about callers of reclaim_throttle specifying
> a timeout value. The original timeout values to congestion_wait() were
> probably pulled out of thin air or copy&pasted from somewhere else.
> This patch centralises the timeout values and selects a timeout based
> on the reason for reclaim throttling. These figures are also pulled
> out of the same thin air but better values may be derived
>
> Running a workload that is throttling for inappropriate periods
> and tracing mm_vmscan_throttled can be used to pick a more appropriate
> value. Excessive throttling would pick a lower timeout where as
> excessive CPU usage in reclaim context would select a larger timeout.
> Ideally a large value would always be used and the wakeups would
> occur before a timeout but that requires careful testing.
>
> Signed-off-by: Mel Gorman <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

2021-10-14 18:29:13

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [PATCH 1/8] mm/vmscan: Throttle reclaim until some writeback completes if congested

On 10/14/21 12:47, Mel Gorman wrote:
> Thanks Vlastimil
>
> On Wed, Oct 13, 2021 at 05:39:36PM +0200, Vlastimil Babka wrote:
>> > +/*
>> > + * Account for pages written if tasks are throttled waiting on dirty
>> > + * pages to clean. If enough pages have been cleaned since throttling
>> > + * started then wakeup the throttled tasks.
>> > + */
>> > +void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
>> > + int nr_throttled)
>> > +{
>> > + unsigned long nr_written;
>> > +
>> > + __inc_node_page_state(page, NR_THROTTLED_WRITTEN);
>>
>> Is this intentionally using the __ version that normally expects irqs to be
>> disabled (AFAIK they are not in this path)? I think this is rarely used cold
>> path so it doesn't seem worth to trade off speed for accuracy.
>>
>
> It was intentional because IRQs can be disabled and if it's race-prone,
> it's not overly problematic but you're right, better to be safe. I changed
> it to the safe type as it's mostly free on x86, arm64 and s390 and for
> other architectures, this is a slow path.

Great, thanks.

>> > + nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) -
>> > + READ_ONCE(pgdat->nr_reclaim_start);
>>
>> Even if the inc above was safe, node_page_state() will return only the
>> global counter, so the value we read here will only actually increment when
>> some cpu's counter overflows, so it will be "bursty". Maybe it's ok, just
>> worth documenting?
>>
>
> I didn't think the penalty of doing an accurate read while writeback
> throttled is worth it. I'll add a comment.
>
>> > +
>> > + if (nr_written > SWAP_CLUSTER_MAX * nr_throttled)
>> > + wake_up_all(&pgdat->reclaim_wait);
>>
>> Hm it seems a bit weird that the more tasks are throttled, the more we wait,
>> and then wake up all. Theoretically this will lead to even more
>> bursty/staggering herd behavior. Could be better to wake up single task each
>> SWAP_CLUSTER_MAX, and bump nr_reclaim_start? But maybe it's not a problem in
>> practice due to HZ/10 timeouts being short enough?
>>
>
> Yes, the more tasks are throttled the longer tasks wait because tasks are
> allocating faster than writeback can complete so I wanted to reduce the
> allocation pressure. I considered waking one task at a time but there is
> no prioritisation of tasks on the waitqueue and it's not clear that the
> additional complexity is justified. With inaccurate counters, a light
> allocator could get throttled for the full timeout unnecessarily.
>
> Even if we were to wake one task at a time, I would prefer it was done
> as a potential optimisation on top.

Fair enough.

> Diff on top based on review feedback;

Thanks, with that you can add

Acked-by: Vlastimil Babka <[email protected]>

to the updated version

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index bcd22e53795f..735b1f2b5d9e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1048,7 +1048,15 @@ void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
> {
> unsigned long nr_written;
>
> - __inc_node_page_state(page, NR_THROTTLED_WRITTEN);
> + inc_node_page_state(page, NR_THROTTLED_WRITTEN);
> +
> + /*
> + * This is an inaccurate read as the per-cpu deltas may not
> + * be synchronised. However, given that the system is
> + * writeback throttled, it is not worth taking the penalty
> + * of getting an accurate count. At worst, the throttle
> + * timeout guarantees forward progress.
> + */
> nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) -
> READ_ONCE(pgdat->nr_reclaim_start);
>

2021-10-14 20:15:16

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [PATCH 7/8] mm/vmscan: Increase the timeout if page reclaim is not making progress

On 10/8/21 15:53, Mel Gorman wrote:
> Tracing of the stutterp workload showed the following delays
>
> 1 usect_delayed=124000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 1 usect_delayed=128000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 1 usect_delayed=176000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 1 usect_delayed=536000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 1 usect_delayed=544000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 1 usect_delayed=556000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 1 usect_delayed=624000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 1 usect_delayed=716000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 1 usect_delayed=772000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 2 usect_delayed=512000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 16 usect_delayed=120000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 53 usect_delayed=116000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 116 usect_delayed=112000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 5907 usect_delayed=108000 reason=VMSCAN_THROTTLE_NOPROGRESS
> 71741 usect_delayed=104000 reason=VMSCAN_THROTTLE_NOPROGRESS
>
> All the throttling hit the full timeout and then there was wakeup delays
> meaning that the wakeups are premature as no other reclaimer such as
> kswapd has made progress. This patch increases the maximum timeout.
>
> Signed-off-by: Mel Gorman <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

> ---
> mm/vmscan.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index e096e81dcbd8..7b54fec4072c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1038,6 +1038,8 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason)
> */
> switch(reason) {
> case VMSCAN_THROTTLE_NOPROGRESS:
> + timeout = HZ/2;
> + break;
> case VMSCAN_THROTTLE_WRITEBACK:
> timeout = HZ/10;
> break;
>

2021-10-14 20:17:42

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [PATCH 4/8] mm/writeback: Throttle based on page writeback instead of congestion

On 10/8/21 15:53, Mel Gorman wrote:
> do_writepages throttles on congestion if the writepages() fails due to a
> lack of memory but congestion_wait() is partially broken as the congestion
> state is not updated for all BDIs.
>
> This patch stalls waiting for a number of pages to complete writeback
> that located on the local node. The main weakness is that there is no
> correlation between the location of the inode's pages and locality but
> that is still better than congestion_wait.
>
> Signed-off-by: Mel Gorman <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

> ---
> mm/page-writeback.c | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 4812a17b288c..f34f54fcd5b4 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2366,8 +2366,15 @@ int do_writepages(struct address_space *mapping, struct writeback_control *wbc)
> ret = generic_writepages(mapping, wbc);
> if ((ret != -ENOMEM) || (wbc->sync_mode != WB_SYNC_ALL))
> break;
> - cond_resched();
> - congestion_wait(BLK_RW_ASYNC, HZ/50);
> +
> + /*
> + * Lacking an allocation context or the locality or writeback
> + * state of any of the inode's pages, throttle based on
> + * writeback activity on the local node. It's as good a
> + * guess as any.
> + */
> + reclaim_throttle(NODE_DATA(numa_node_id()),
> + VMSCAN_THROTTLE_WRITEBACK, HZ/50);
> }
> /*
> * Usually few pages are written by now from those we've just submitted
>

2021-10-14 20:25:59

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [PATCH 8/8] mm/vmscan: Delay waking of tasks throttled on NOPROGRESS

On 10/8/21 15:53, Mel Gorman wrote:
> Tracing indicates that tasks throttled on NOPROGRESS are woken
> prematurely resulting in occasional massive spikes in direct
> reclaim activity. This patch wakes tasks throttled on NOPROGRESS
> if reclaim efficiency is at least 12%.
>
> Signed-off-by: Mel Gorman <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

> ---
> mm/vmscan.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7b54fec4072c..80a9a26f701f 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3338,8 +3338,11 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
>
> static void consider_reclaim_throttle(pg_data_t *pgdat, struct scan_control *sc)
> {
> - /* If reclaim is making progress, wake any throttled tasks. */
> - if (sc->nr_reclaimed) {
> + /*
> + * If reclaim is making progress greater than 12% efficiency then
> + * wake all the NOPROGRESS throttled tasks.
> + */
> + if (sc->nr_reclaimed > (sc->nr_scanned >> 3)) {
> wait_queue_head_t *wqh;
>
> wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_NOPROGRESS];
>

2021-10-14 20:26:37

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [PATCH 2/8] mm/vmscan: Throttle reclaim and compaction when too may pages are isolated

On 10/14/21 13:56, Mel Gorman wrote:
> On Thu, Oct 14, 2021 at 10:06:25AM +0200, Vlastimil Babka wrote:
>> On 10/8/21 15:53, Mel Gorman wrote:
>> > Page reclaim throttles on congestion if too many parallel reclaim instances
>> > have isolated too many pages. This makes no sense, excessive parallelisation
>> > has nothing to do with writeback or congestion.
>> >
>> > This patch creates an additional workqueue to sleep on when too many
>> > pages are isolated. The throttled tasks are woken when the number
>> > of isolated pages is reduced or a timeout occurs. There may be
>> > some false positive wakeups for GFP_NOIO/GFP_NOFS callers but
>> > the tasks will throttle again if necessary.
>> >
>> > [[email protected]: Wake up from compaction context]
>> > Signed-off-by: Mel Gorman <[email protected]>
>>
>> ...
>>
>> > diff --git a/mm/internal.h b/mm/internal.h
>> > index 90764d646e02..06d0c376efcd 100644
>> > --- a/mm/internal.h
>> > +++ b/mm/internal.h
>> > @@ -45,6 +45,15 @@ static inline void acct_reclaim_writeback(struct page *page)
>> > __acct_reclaim_writeback(pgdat, page, nr_throttled);
>> > }
>> >
>> > +static inline void wake_throttle_isolated(pg_data_t *pgdat)
>> > +{
>> > + wait_queue_head_t *wqh;
>> > +
>> > + wqh = &pgdat->reclaim_wait[VMSCAN_THROTTLE_ISOLATED];
>> > + if (waitqueue_active(wqh))
>> > + wake_up_all(wqh);
>>
>> Again, would it be better to wake up just one task to prevent possible
>> thundering herd? We can assume that that task will call too_many_isolated()
>> eventually to wake up the next one?
>
> Same problem as the writeback throttling, there is no prioritsation of
> light vs heavy allocators.
>
>> Although it seems strange that
>> too_many_isolated() is the place where we detect the situation for wake up.
>> Simpler than to hook into NR_ISOLATED decrementing I guess.
>>
>
> Simplier but more costly. Every decrement would have to check
> too_many_isolated(). I think the cost of that is too high given that the
> VMSCAN_THROTTLE_ISOLATED is relatively hard to trigger and the minority
> of throttling events.

Agreed.

>> > +}
>> > +
>> > vm_fault_t do_swap_page(struct vm_fault *vmf);
>> >
>> > void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>> ...
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -1006,11 +1006,10 @@ static void handle_write_error(struct address_space *mapping,
>> > unlock_page(page);
>> > }
>> >
>> > -static void
>> > -reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
>> > +void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
>> > long timeout)
>> > {
>> > - wait_queue_head_t *wqh = &pgdat->reclaim_wait;
>> > + wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason];
>>
>> It seems weird that later in this function we increase nr_reclaim_throttled
>> without distinguishing the reason, so effectively throttling for isolated
>> pages will trigger acct_reclaim_writeback() doing the NR_THROTTLED_WRITTEN
>> counting, although it's not related at all? Maybe either have separate
>> nr_reclaim_throttled counters per vmscan_throttle_state (if counter of
>> isolated is useful, I haven't seen the rest of series yet), or count only
>> VMSCAN_THROTTLE_WRITEBACK tasks?
>>
>
> Very good point, it would be more appropriate to only count the
> writeback reason.
>
> Diff on top is below. It'll cause minor conflicts later in the series.

Looks good, for the updated version:

Acked-by: Vlastimil Babka <[email protected]>

> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index ca65d6a64bdd..58a25d42c31c 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -849,7 +849,7 @@ typedef struct pglist_data {
> wait_queue_head_t kswapd_wait;
> wait_queue_head_t pfmemalloc_wait;
> wait_queue_head_t reclaim_wait[NR_VMSCAN_THROTTLE];
> - atomic_t nr_reclaim_throttled; /* nr of throtted tasks */
> + atomic_t nr_writeback_throttled;/* nr of writeback-throttled tasks */
> unsigned long nr_reclaim_start; /* nr pages written while throttled
> * when throttling started. */
> struct task_struct *kswapd; /* Protected by
> diff --git a/mm/internal.h b/mm/internal.h
> index 06d0c376efcd..3461a1055975 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -39,7 +39,7 @@ void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page,
> static inline void acct_reclaim_writeback(struct page *page)
> {
> pg_data_t *pgdat = page_pgdat(page);
> - int nr_throttled = atomic_read(&pgdat->nr_reclaim_throttled);
> + int nr_throttled = atomic_read(&pgdat->nr_writeback_throttled);
>
> if (nr_throttled)
> __acct_reclaim_writeback(pgdat, page, nr_throttled);
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 6e198bbbd86a..29434d4fc1c7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1011,6 +1011,7 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> {
> wait_queue_head_t *wqh = &pgdat->reclaim_wait[reason];
> long ret;
> + bool acct_writeback = (reason == VMSCAN_THROTTLE_WRITEBACK);
> DEFINE_WAIT(wait);
>
> /*
> @@ -1022,7 +1023,8 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> current->flags & (PF_IO_WORKER|PF_KTHREAD))
> return;
>
> - if (atomic_inc_return(&pgdat->nr_reclaim_throttled) == 1) {
> + if (acct_writeback &&
> + atomic_inc_return(&pgdat->nr_writeback_throttled) == 1) {
> WRITE_ONCE(pgdat->nr_reclaim_start,
> node_page_state(pgdat, NR_THROTTLED_WRITTEN));
> }
> @@ -1030,7 +1032,9 @@ void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> ret = schedule_timeout(timeout);
> finish_wait(wqh, &wait);
> - atomic_dec(&pgdat->nr_reclaim_throttled);
> +
> + if (acct_writeback)
> + atomic_dec(&pgdat->nr_writeback_throttled);
>
> trace_mm_vmscan_throttled(pgdat->node_id, jiffies_to_usecs(timeout),
> jiffies_to_usecs(timeout - ret),
> @@ -4349,7 +4353,7 @@ static int kswapd(void *p)
>
> WRITE_ONCE(pgdat->kswapd_order, 0);
> WRITE_ONCE(pgdat->kswapd_highest_zoneidx, MAX_NR_ZONES);
> - atomic_set(&pgdat->nr_reclaim_throttled, 0);
> + atomic_set(&pgdat->nr_writeback_throttled, 0);
> for ( ; ; ) {
> bool ret;
>
>

2021-10-14 20:53:44

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [PATCH 3/8] mm/vmscan: Throttle reclaim when no progress is being made

On 10/14/21 15:03, Mel Gorman wrote:
> On Thu, Oct 14, 2021 at 02:31:17PM +0200, Vlastimil Babka wrote:
>> On 10/8/21 15:53, Mel Gorman wrote:
>> > Memcg reclaim throttles on congestion if no reclaim progress is made.
>> > This makes little sense, it might be due to writeback or a host of
>> > other factors.
>> >
>> > For !memcg reclaim, it's messy. Direct reclaim primarily is throttled
>> > in the page allocator if it is failing to make progress. Kswapd
>> > throttles if too many pages are under writeback and marked for
>> > immediate reclaim.
>> >
>> > This patch explicitly throttles if reclaim is failing to make progress.
>> >
>> > Signed-off-by: Mel Gorman <[email protected]>
>> ...
>> > @@ -3769,6 +3797,16 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
>> > trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed);
>> > set_task_reclaim_state(current, NULL);
>> >
>> > + if (!nr_reclaimed) {
>> > + struct zoneref *z;
>> > + pg_data_t *pgdat;
>> > +
>> > + z = first_zones_zonelist(zonelist, sc.reclaim_idx, sc.nodemask);
>> > + pgdat = zonelist_zone(z)->zone_pgdat;
>> > +
>> > + reclaim_throttle(pgdat, VMSCAN_THROTTLE_NOPROGRESS, HZ/10);
>> > + }
>>
>> Is this necessary? AFAICS here we just returned from:
>>
>> do_try_to_free_pages()
>> shrink_zones()
>> for_each_zone()...
>> consider_reclaim_throttle()
>>
>> Which already throttles when needed and using the appropriate pgdat, while
>> here we have to somewhat awkwardly assume the preferred one.
>>
>
> Yes, you're right, consider_reclaim_throttle not only throttles on the
> appropriate pgdat but takes priority into account.
>
> Well spotted!

So with that part removed
Acked-by: Vlastimil Babka <[email protected]>

Thanks!