2018-04-02 09:26:31

by Buddy Lumpkin

[permalink] [raw]
Subject: [RFC PATCH 0/1] mm: Support multiple kswapd threads per node

I created this patch to address performance problems we are seeing in
Oracle Cloud Infrastructure. We run the Oracle Linux UEK4 kernel
internally, which is based on upstream 4.1. I created and tested this
patch for the latest upstream kernel and UEK4. I was able to show
substantial benefits in both kernels, using workloads that provide a
mix of anonymous memory allocations with filesystem writes.

As I went through the process of getting this patch approved internally, I
learned that it was hard to come up with a concise set of test results
that clearly demonstrate that devoting more threads toward proactive page
replacement is actually necessary. I was more focused on the impact that
direct reclaims had on latency at the time, so I came up with a systemtap
script that measures the latency of direct reclaims. On systems that were
doing large volumes of filesystem IO, I saw order 0 allocations regularly
taking over 10ms, and occasionally over 100ms. Since we were seeing large
volumes of direct reclaims triggered as a side effect of filesystem IO, I
figured this had to have a substantial impact on throughput.

I compared the maximum read throughput that could be obtained using direct
IO streams to standard filesystem IO through the page cache on one of the
dense storage systems that we vend. Direct IO was 55% higher in throughput
than standard filesystem IO. I can't remember the last time I measured
this but I know it was over 15 years ago, and I am quite sure the number
was no more than 10%. I was pretty sure that direct reclaims were to blame
for most of this and it would only take a few more tests to prove it. At
23GB/s, it only takes 32.6 seconds to fill the page cache on one of these
systems, but that is enough time to measure throughput without any page
replacement occuring. In this case direct IO throughput was only 13.5%
higher. It was pretty clear that direct reclaims were causing a
substantial reduction in throughput. I decided this would be the ideal way
to show the benefits of threading kswapd.

On the UEK4 kernel, six kswapd threads provided a 48% increase over one.
When I ran the same tests on upstream kernel version 4.16.0-rc7, I only
saw a 20% increase with 6 threads and the numbers fluctuated quite a bit
when I watched with iostat with a 2 second sample interval. The output
stalled periodically as well. When I profiled the system using perf, I
saw that 70% of the CPU time was being spent in a single function, it was
native_queued_spin_lock_slowpath(). 38% was during shrink_inactive_list()
and another 34% was spent during __lru_cache_add()

I eventually determined that my tests were presenting a difficult pattern
for the logic that uses shadow entries to periodically resize the LRU
lists. This was not a problem in the UEK4 kernel which also has shadow
entries, so something has changed in that regard. I have not had time to
really dig into this particular problem however, I assume those that are
more familiar with the code might see the test results below and have an
idea about what is going on.

I have appended a small patch to the end of this cover letter that
effectively disables most of the routines in mm/workingset.c so that
filesystem IO can be used to demonstrate the benefits of a threaded
kswapd. I am not suggesting that this is the correct solution for this
problem.

Test results below are the same that were run to demonstrate threaded
kswapd performance. For more context, read the patch commit log before
continuing and the test results below will make more sense

Direct IO results are roughly the same as expected ...

Test #1: Direct IO - shadow entries enabled
dd sy dd_cpu throughput
6 0 2.33 14726026.40
10 1 2.95 19954974.80
16 1 2.63 24419689.30
22 1 2.63 25430303.20
28 1 2.91 26026513.20
34 1 2.53 26178618.00
40 1 2.18 26239229.20
46 1 1.91 26250550.40
52 1 1.69 26251845.60
58 1 1.54 26253205.60
64 1 1.43 26253780.80
70 1 1.31 26254154.80
76 1 1.21 26253660.80
82 1 1.12 26254214.80
88 1 1.07 26253770.00
90 1 1.04 26252406.40

Going through the pagecache is a different story entirely. Let's look at
throughput with a single kswapd thread with shadow entries enabled, vs
disabled:

shadow entries ENABLED, 1 kswapd thread per node
dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct
10 5 27.96 35.52 34.94 7964174.80 0 460161197 0
16 8 40.75 84.86 81.92 11143540.00 0 907793664 0
22 12 45.01 99.96 99.98 12790778.40 6751 884827215 162344947
28 18 49.10 99.97 99.97 14410621.02 17989 719328362 536886953
34 22 52.87 99.80 99.98 14331978.80 25180 609680315 661201785
40 26 55.66 99.90 99.96 14612901.20 26843 449047388 810399311
46 28 56.37 99.74 99.96 15831410.40 33854 518952367 807944791
52 37 59.78 99.80 99.97 15264190.80 37042 372258422 881626890
58 50 71.90 99.44 99.53 14979692.40 45761 190511392 1114810023
64 53 72.14 99.84 99.95 14747164.80 83665 168461850 1013498958
70 50 68.09 99.80 99.90 15176129.60 113546 203506041 1008655113
76 59 73.77 99.73 99.96 14947922.40 98798 137174015 1057487320
82 66 79.25 99.66 99.98 14624100.40 100242 101830859 1074332196
88 73 81.26 98.85 99.98 14827533.60 101262 90402914 1086186724
90 78 85.48 99.55 99.98 14469963.20 101063 75722196 1083603245

shadow entries DISABLED, 1 kswapd thread per node
dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct
10 4 26.07 28.56 27.03 7355924.40 0 459316976 0
16 7 34.94 69.33 69.66 10867895.20 0 872661643 0
22 10 36.03 93.99 99.33 13130613.60 489 1037654473 11268334
28 10 30.34 95.90 98.60 14601509.60 671 1182591373 15429142
34 14 34.77 97.50 99.23 16468012.00 10850 1069005644 249839515
40 17 36.32 91.49 97.11 17335987.60 18903 975417728 434467710
46 19 38.40 90.54 91.61 17705394.40 25369 855737040 582427973
52 22 40.88 83.97 83.70 17607680.40 31250 709532935 724282458
58 25 40.89 82.19 80.14 17976905.60 35060 657796473 804117540
64 28 41.77 73.49 75.20 18001910.00 39073 561813658 895289337
70 33 45.51 63.78 64.39 17061897.20 44523 379465571 1020726436
76 36 46.95 57.96 60.32 16964459.60 47717 291299464 1093172384
82 39 47.16 55.43 56.16 16949956.00 49479 247071062 1134163008
88 42 47.41 53.75 47.62 16930911.20 51521 195449924 1180442208
90 43 47.18 51.40 50.59 16864428.00 51618 190758156 1183203901

When shadow entries are disabled, kernel mode CPU consumption drops and
peak throughput increases by 13.7%

Here is the same test with 4 kswapd threads:

shadow entries ENABLED, 4 kswapd threads per node
dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct
10 6 30.09 17.36 16.82 7692440.40 0 460386412 0
16 11 42.86 34.35 33.86 10836456.80 23 885908695 550482
22 14 46.00 55.30 50.53 13125285.20 0 1075382922 0
28 17 43.74 87.18 44.18 15298355.20 0 1254927179 0
34 26 53.78 99.88 89.93 16203179.20 3443 1247514636 80817567
40 35 62.99 99.88 97.58 16653526.80 15376 960519369 369681969
46 36 51.66 99.85 90.87 18668439.60 10907 1239045416 259575692
52 46 66.96 99.61 99.96 16970211.60 24264 751180033 577278765
58 52 76.53 99.91 99.97 15336601.60 30676 513418729 725394427
64 58 78.20 99.79 99.96 15266654.40 33466 450869495 791218349
70 65 82.98 99.93 99.98 15285421.60 35647 370270673 843608871
76 69 81.52 99.87 99.87 15681812.00 37625 358457523 889023203
82 78 85.68 99.97 99.98 15370775.60 39010 302132025 921379809
88 85 88.52 99.88 99.56 15410439.20 40100 267031806 947441566
90 88 90.11 99.67 99.41 15400593.20 40443 249090848 953893493

shadow entries DISABLED, 4 kswapd threads per node
dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct
10 5 27.09 16.65 14.17 7842605.60 0 459105291 0
16 10 37.12 26.02 24.85 11352920.40 15 920527796 358515
22 11 36.94 37.13 35.82 13771869.60 0 1132169011 0
28 13 35.23 48.43 46.86 16089746.00 0 1312902070 0
34 15 33.37 53.02 55.69 18314856.40 0 1476169080 0
40 19 35.90 69.60 64.41 19836126.80 0 1629999149 0
46 22 36.82 88.55 57.20 20740216.40 0 1708478106 0
52 24 34.38 93.76 68.34 21758352.00 0 1794055559 0
58 24 30.51 79.20 82.33 22735594.00 0 1872794397 0
64 26 30.21 97.12 76.73 23302203.60 176 1916593721 4206821
70 33 32.92 92.91 92.87 23776588.00 3575 1817685086 85574159
76 37 31.62 91.20 89.83 24308196.80 4752 1812262569 113981763
82 29 25.53 93.23 92.33 24802791.20 306 2032093122 7350704
88 43 37.12 76.18 77.01 25145694.40 20310 1253204719 487048202
90 42 38.56 73.90 74.57 22516787.60 22774 1193637495 545463615

With four kswapd threads, the effects are more pronounced. Kernel mode CPU
consumption is substantially higher with shadow entries enabled while
throughput is substantially lower.

When shadow entries are disabled, additional kswapd tasks increase
throughput while kernel mode CPU consumption stays roughly the same

---
mm/workingset.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/mm/workingset.c b/mm/workingset.c
index b7d616a3bbbe..656451ce2d5e 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -213,6 +213,7 @@ void *workingset_eviction(struct address_space *mapping, struct page *page)
unsigned long eviction;
struct lruvec *lruvec;

+ return NULL;
/* Page is fully exclusive and pins page->mem_cgroup */
VM_BUG_ON_PAGE(PageLRU(page), page);
VM_BUG_ON_PAGE(page_count(page), page);
--

Buddy Lumpkin (1):
vmscan: Support multiple kswapd threads per node

Documentation/sysctl/vm.txt | 21 ++++++++
include/linux/mm.h | 2 +
include/linux/mmzone.h | 10 +++-
kernel/sysctl.c | 10 ++++
mm/page_alloc.c | 15 ++++++
mm/vmscan.c | 116 +++++++++++++++++++++++++++++++++++++-------
6 files changed, 155 insertions(+), 19 deletions(-)

--
1.8.3.1



2018-04-02 09:26:41

by Buddy Lumpkin

[permalink] [raw]
Subject: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

Page replacement is handled in the Linux Kernel in one of two ways:

1) Asynchronously via kswapd
2) Synchronously, via direct reclaim

At page allocation time the allocating task is immediately given a page
from the zone free list allowing it to go right back to work doing
whatever it was doing; Probably directly or indirectly executing business
logic.

Just prior to satisfying the allocation, free pages is checked to see if
it has reached the zone low watermark and if so, kswapd is awakened.
Kswapd will start scanning pages looking for inactive pages to evict to
make room for new page allocations. The work of kswapd allows tasks to
continue allocating memory from their respective zone free list without
incurring any delay.

When the demand for free pages exceeds the rate that kswapd tasks can
supply them, page allocation works differently. Once the allocating task
finds that the number of free pages is at or below the zone min watermark,
the task will no longer pull pages from the free list. Instead, the task
will run the same CPU-bound routines as kswapd to satisfy its own
allocation by scanning and evicting pages. This is called a direct reclaim.

The time spent performing a direct reclaim can be substantial, often
taking tens to hundreds of milliseconds for small order0 allocations to
half a second or more for order9 huge-page allocations. In fact, kswapd is
not actually required on a linux system. It exists for the sole purpose of
optimizing performance by preventing direct reclaims.

When memory shortfall is sufficient to trigger direct reclaims, they can
occur in any task that is running on the system. A single aggressive
memory allocating task can set the stage for collateral damage to occur in
small tasks that rarely allocate additional memory. Consider the impact of
injecting an additional 100ms of latency when nscd allocates memory to
facilitate caching of a DNS query.

The presence of direct reclaims 10 years ago was a fairly reliable
indicator that too much was being asked of a Linux system. Kswapd was
likely wasting time scanning pages that were ineligible for eviction.
Adding RAM or reducing the working set size would usually make the problem
go away. Since then hardware has evolved to bring a new struggle for
kswapd. Storage speeds have increased by orders of magnitude while CPU
clock speeds stayed the same or even slowed down in exchange for more
cores per package. This presents a throughput problem for a single
threaded kswapd that will get worse with each generation of new hardware.

Test Details

NOTE: The tests below were run with shadow entries disabled. See the
associated patch and cover letter for details

The tests below were designed with the assumption that a kswapd bottleneck
is best demonstrated using filesystem reads. This way, the inactive list
will be full of clean pages, simplifying the analysis and allowing kswapd
to achieve the highest possible steal rate. Maximum steal rates for kswapd
are likely to be the same or lower for any other mix of page types on the
system.

Tests were run on a 2U Oracle X7-2L with 52 Intel Xeon Skylake 2GHz cores,
756GB of RAM and 8 x 3.6 TB NVMe Solid State Disk drives. Each drive has
an XFS file system mounted separately as /d0 through /d7. SSD drives
require multiple concurrent streams to show their potential, so I created
eleven 250GB zero-filled files on each drive so that I could test with
parallel reads.

The test script runs in multiple stages. At each stage, the number of dd
tasks run concurrently is increased by 2. I did not include all of the
test output for brevity.

During each stage dd tasks are launched to read from each drive in a round
robin fashion until the specified number of tasks for the stage has been
reached. Then iostat, vmstat and top are started in the background with 10
second intervals. After five minutes, all of the dd tasks are killed and
the iostat, vmstat and top output is parsed in order to report the
following:

CPU consumption
- sy - aggregate kernel mode CPU consumption from vmstat output. The value
doesn't tend to fluctuate much so I just grab the highest value.
Each sample is averaged over 10 seconds
- dd_cpu - for all of the dd tasks averaged across the top samples since
there is a lot of variation.

Throughput
- in Kbytes
- Command is iostat -x -d 10 -g total

This first test performs reads using O_DIRECT in order to show the maximum
throughput that can be obtained using these drives. It also demonstrates
how rapidly throughput scales as the number of dd tasks are increased.

The dd command for this test looks like this:

Command Used: dd iflag=direct if=/d${i}/$n of=/dev/null bs=4M

Test #1: Direct IO
dd sy dd_cpu throughput
6 0 2.33 14726026.40
10 1 2.95 19954974.80
16 1 2.63 24419689.30
22 1 2.63 25430303.20
28 1 2.91 26026513.20
34 1 2.53 26178618.00
40 1 2.18 26239229.20
46 1 1.91 26250550.40
52 1 1.69 26251845.60
58 1 1.54 26253205.60
64 1 1.43 26253780.80
70 1 1.31 26254154.80
76 1 1.21 26253660.80
82 1 1.12 26254214.80
88 1 1.07 26253770.00
90 1 1.04 26252406.40

Throughput was close to peak with only 22 dd tasks. Very little system CPU
was consumed as expected as the drives DMA directly into the user address
space when using direct IO.

In this next test, the iflag=direct option is removed and we only run the
test until the pgscan_kswapd from /proc/vmstat starts to increment. At
that point metrics are parsed and reported and the pagecache contents are
dropped prior to the next test. Lather, rinse, repeat.

Test #2: standard file system IO, no page replacement
dd sy dd_cpu throughput
6 2 28.78 5134316.40
10 3 31.40 8051218.40
16 5 34.73 11438106.80
22 7 33.65 14140596.40
28 8 31.24 16393455.20
34 10 29.88 18219463.60
40 11 28.33 19644159.60
46 11 25.05 20802497.60
52 13 26.92 22092370.00
58 13 23.29 22884881.20
64 14 23.12 23452248.80
70 15 22.40 23916468.00
76 16 22.06 24328737.20
82 17 20.97 24718693.20
88 16 18.57 25149404.40
90 16 18.31 25245565.60

Each read has to pause after the buffer in kernel space is populated while
those pages are added to the pagecache and copied into the user address
space. For this reason, more parallel streams are required to achieve peak
throughput. The copy operation consumes substantially more CPU than direct
IO as expected.

The next test measures throughput after kswapd starts running. This is the
same test only we wait for kswapd to wake up before we start collecting
metrics. The script actually keeps track of a few things that were not
mentioned earlier. It tracks direct reclaims and page scans by watching
the metrics in /proc/vmstat. CPU consumption for kswapd is tracked the
same way it is tracked for dd.

Since the test is 100% reads, you can assume that the page steal rate for
kswapd and direct reclaims is almost identical to the scan rate.

Test #3: 1 kswapd thread per node
dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct
10 4 26.07 28.56 27.03 7355924.40 0 459316976 0
16 7 34.94 69.33 69.66 10867895.20 0 872661643 0
22 10 36.03 93.99 99.33 13130613.60 489 1037654473 11268334
28 10 30.34 95.90 98.60 14601509.60 671 1182591373 15429142
34 14 34.77 97.50 99.23 16468012.00 10850 1069005644 249839515
40 17 36.32 91.49 97.11 17335987.60 18903 975417728 434467710
46 19 38.40 90.54 91.61 17705394.40 25369 855737040 582427973
52 22 40.88 83.97 83.70 17607680.40 31250 709532935 724282458
58 25 40.89 82.19 80.14 17976905.60 35060 657796473 804117540
64 28 41.77 73.49 75.20 18001910.00 39073 561813658 895289337
70 33 45.51 63.78 64.39 17061897.20 44523 379465571 1020726436
76 36 46.95 57.96 60.32 16964459.60 47717 291299464 1093172384
82 39 47.16 55.43 56.16 16949956.00 49479 247071062 1134163008
88 42 47.41 53.75 47.62 16930911.20 51521 195449924 1180442208
90 43 47.18 51.40 50.59 16864428.00 51618 190758156 1183203901

In the previous test where kswapd was not involved, the system-wide kernel
mode CPU consumption with 90 dd tasks was 16%. In this test CPU consumption
with 90 tasks is at 43%. With 52 cores, and two kswapd tasks (one per NUMA
node), kswapd can only be responsible for a little over 4% of the increase.
The rest is likely caused by 51,618 direct reclaims that scanned 1.2
billion pages over the five minute time period of the test.

Same test, more kswapd tasks:

Test #4: 4 kswapd threads per node
dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct
10 5 27.09 16.65 14.17 7842605.60 0 459105291 0
16 10 37.12 26.02 24.85 11352920.40 15 920527796 358515
22 11 36.94 37.13 35.82 13771869.60 0 1132169011 0
28 13 35.23 48.43 46.86 16089746.00 0 1312902070 0
34 15 33.37 53.02 55.69 18314856.40 0 1476169080 0
40 19 35.90 69.60 64.41 19836126.80 0 1629999149 0
46 22 36.82 88.55 57.20 20740216.40 0 1708478106 0
52 24 34.38 93.76 68.34 21758352.00 0 1794055559 0
58 24 30.51 79.20 82.33 22735594.00 0 1872794397 0
64 26 30.21 97.12 76.73 23302203.60 176 1916593721 4206821
70 33 32.92 92.91 92.87 23776588.00 3575 1817685086 85574159
76 37 31.62 91.20 89.83 24308196.80 4752 1812262569 113981763
82 29 25.53 93.23 92.33 24802791.20 306 2032093122 7350704
88 43 37.12 76.18 77.01 25145694.40 20310 1253204719 487048202
90 42 38.56 73.90 74.57 22516787.60 22774 1193637495 545463615

By increasing the number of kswapd threads, throughput increased by ~50%
while kernel mode CPU utilization decreased or stayed the same, likely due
to a decrease in the number of parallel tasks at any given time doing page
replacement.

Signed-off-by: Buddy Lumpkin <[email protected]>
---
Documentation/sysctl/vm.txt | 23 +++++++++
include/linux/mm.h | 2 +
include/linux/mmzone.h | 10 +++-
kernel/sysctl.c | 10 ++++
mm/page_alloc.c | 15 ++++++
mm/vmscan.c | 116 +++++++++++++++++++++++++++++++++++++-------
6 files changed, 157 insertions(+), 19 deletions(-)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index ff234d229cbb..aa54cbc14dd9 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -31,6 +31,7 @@ Currently, these files are in /proc/sys/vm:
- drop_caches
- extfrag_threshold
- hugetlb_shm_group
+- kswapd_threads
- laptop_mode
- legacy_va_layout
- lowmem_reserve_ratio
@@ -267,6 +268,28 @@ shared memory segment using hugetlb page.

==============================================================

+kswapd_threads
+
+kswapd_threads allows you to control the number of kswapd threads per node
+running on the system. This provides the ability to devote additional CPU
+resources toward proactive page replacement with the goal of reducing
+direct reclaims. When direct reclaims are prevented, the CPU consumed
+by them is prevented as well. Depending on the workload, the result can
+cause aggregate CPU usage on the system to go up, down or stay the same.
+
+More aggressive page replacement can reduce direct reclaims which cause
+latency for tasks and decrease throughput when doing filesystem IO through
+the pagecache. Direct reclaims are recorded using the allocstall counter
+in /proc/vmstat.
+
+The default value is 1 and the range of acceptible values are 1-16.
+Always start with lower values in the 2-6 range. Higher values should
+be justified with testing. If direct reclaims occur in spite of high
+values, the cost of direct reclaims (in latency) that occur can be
+higher due to increased lock contention.
+
+==============================================================
+
laptop_mode

laptop_mode is a knob that controls "laptop mode". All the things that are
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad06d42adb1a..e25b8da76f7d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2078,6 +2078,7 @@ static inline void zero_resv_unavail(void) {}
extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long,
enum memmap_context, struct vmem_altmap *);
extern void setup_per_zone_wmarks(void);
+extern void update_kswapd_threads(void);
extern int __meminit init_per_zone_wmark_min(void);
extern void mem_init(void);
extern void __init mmap_init(void);
@@ -2098,6 +2099,7 @@ extern __printf(3, 4)
extern void zone_pcp_reset(struct zone *zone);

/* page_alloc.c */
+extern int kswapd_threads;
extern int min_free_kbytes;
extern int watermark_scale_factor;

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 7522a6987595..ad36a5b5c3b8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -36,6 +36,8 @@
*/
#define PAGE_ALLOC_COSTLY_ORDER 3

+#define MAX_KSWAPD_THREADS 16
+
enum migratetype {
MIGRATE_UNMOVABLE,
MIGRATE_MOVABLE,
@@ -653,8 +655,10 @@ struct zonelist {
int node_id;
wait_queue_head_t kswapd_wait;
wait_queue_head_t pfmemalloc_wait;
- struct task_struct *kswapd; /* Protected by
- mem_hotplug_begin/end() */
+ /*
+ * Protected by mem_hotplug_begin/end()
+ */
+ struct task_struct *kswapd[MAX_KSWAPD_THREADS];
int kswapd_order;
enum zone_type kswapd_classzone_idx;

@@ -882,6 +886,8 @@ static inline int is_highmem(struct zone *zone)

/* These two functions are used to setup the per zone pages min values */
struct ctl_table;
+int kswapd_threads_sysctl_handler(struct ctl_table *, int,
+ void __user *, size_t *, loff_t *);
int min_free_kbytes_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
int watermark_scale_factor_sysctl_handler(struct ctl_table *, int,
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index f98f28c12020..3cef65ce1d46 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -134,6 +134,7 @@
#ifdef CONFIG_PERF_EVENTS
static int six_hundred_forty_kb = 640 * 1024;
#endif
+static int max_kswapd_threads = MAX_KSWAPD_THREADS;

/* this is needed for the proc_doulongvec_minmax of vm_dirty_bytes */
static unsigned long dirty_bytes_min = 2 * PAGE_SIZE;
@@ -1437,6 +1438,15 @@ static int sysrq_sysctl_handler(struct ctl_table *table, int write,
.extra1 = &zero,
},
{
+ .procname = "kswapd_threads",
+ .data = &kswapd_threads,
+ .maxlen = sizeof(kswapd_threads),
+ .mode = 0644,
+ .proc_handler = kswapd_threads_sysctl_handler,
+ .extra1 = &one,
+ .extra2 = &max_kswapd_threads,
+ },
+ {
.procname = "watermark_scale_factor",
.data = &watermark_scale_factor,
.maxlen = sizeof(watermark_scale_factor),
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1741dd23e7c1..de30683aeb0f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7143,6 +7143,21 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *table, int write,
return 0;
}

+int kswapd_threads_sysctl_handler(struct ctl_table *table, int write,
+ void __user *buffer, size_t *length, loff_t *ppos)
+{
+ int rc;
+
+ rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
+ if (rc)
+ return rc;
+
+ if (write)
+ update_kswapd_threads();
+
+ return 0;
+}
+
int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos)
{
diff --git a/mm/vmscan.c b/mm/vmscan.c
index cd5dc3faaa57..663ff14080e7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -118,6 +118,14 @@ struct scan_control {
unsigned long nr_reclaimed;
};

+
+/*
+ * Number of active kswapd threads
+ */
+#define DEF_KSWAPD_THREADS_PER_NODE 1
+int kswapd_threads = DEF_KSWAPD_THREADS_PER_NODE;
+int kswapd_threads_current = DEF_KSWAPD_THREADS_PER_NODE;
+
#ifdef ARCH_HAS_PREFETCH
#define prefetch_prev_lru_page(_page, _base, _field) \
do { \
@@ -3624,21 +3632,83 @@ unsigned long shrink_all_memory(unsigned long nr_to_reclaim)
restore their cpu bindings. */
static int kswapd_cpu_online(unsigned int cpu)
{
- int nid;
+ int nid, hid;
+ int nr_threads = kswapd_threads_current;

for_each_node_state(nid, N_MEMORY) {
pg_data_t *pgdat = NODE_DATA(nid);
const struct cpumask *mask;

mask = cpumask_of_node(pgdat->node_id);
-
- if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids)
- /* One of our CPUs online: restore mask */
- set_cpus_allowed_ptr(pgdat->kswapd, mask);
+ if (cpumask_any_and(cpu_online_mask, mask) < nr_cpu_ids) {
+ for (hid = 0; hid < nr_threads; hid++) {
+ /* One of our CPUs online: restore mask */
+ set_cpus_allowed_ptr(pgdat->kswapd[hid], mask);
+ }
+ }
}
return 0;
}

+static void update_kswapd_threads_node(int nid)
+{
+ pg_data_t *pgdat;
+ int drop, increase;
+ int last_idx, start_idx, hid;
+ int nr_threads = kswapd_threads_current;
+
+ pgdat = NODE_DATA(nid);
+ last_idx = nr_threads - 1;
+ if (kswapd_threads < nr_threads) {
+ drop = nr_threads - kswapd_threads;
+ for (hid = last_idx; hid > (last_idx - drop); hid--) {
+ if (pgdat->kswapd[hid]) {
+ kthread_stop(pgdat->kswapd[hid]);
+ pgdat->kswapd[hid] = NULL;
+ }
+ }
+ } else {
+ increase = kswapd_threads - nr_threads;
+ start_idx = last_idx + 1;
+ for (hid = start_idx; hid < (start_idx + increase); hid++) {
+ pgdat->kswapd[hid] = kthread_run(kswapd, pgdat,
+ "kswapd%d:%d", nid, hid);
+ if (IS_ERR(pgdat->kswapd[hid])) {
+ pr_err("Failed to start kswapd%d on node %d\n",
+ hid, nid);
+ pgdat->kswapd[hid] = NULL;
+ /*
+ * We are out of resources. Do not start any
+ * more threads.
+ */
+ break;
+ }
+ }
+ }
+}
+
+void update_kswapd_threads(void)
+{
+ int nid;
+
+ if (kswapd_threads_current == kswapd_threads)
+ return;
+
+ /*
+ * Hold the memory hotplug lock to avoid racing with memory
+ * hotplug initiated updates
+ */
+ mem_hotplug_begin();
+ for_each_node_state(nid, N_MEMORY)
+ update_kswapd_threads_node(nid);
+
+ pr_info("kswapd_thread count changed, old:%d new:%d\n",
+ kswapd_threads_current, kswapd_threads);
+ kswapd_threads_current = kswapd_threads;
+ mem_hotplug_done();
+}
+
+
/*
* This kswapd start function will be called by init and node-hot-add.
* On node-hot-add, kswapd will moved to proper cpus if cpus are hot-added.
@@ -3647,18 +3717,25 @@ int kswapd_run(int nid)
{
pg_data_t *pgdat = NODE_DATA(nid);
int ret = 0;
+ int hid, nr_threads;

- if (pgdat->kswapd)
+ if (pgdat->kswapd[0])
return 0;

- pgdat->kswapd = kthread_run(kswapd, pgdat, "kswapd%d", nid);
- if (IS_ERR(pgdat->kswapd)) {
- /* failure at boot is fatal */
- BUG_ON(system_state < SYSTEM_RUNNING);
- pr_err("Failed to start kswapd on node %d\n", nid);
- ret = PTR_ERR(pgdat->kswapd);
- pgdat->kswapd = NULL;
+ nr_threads = kswapd_threads;
+ for (hid = 0; hid < nr_threads; hid++) {
+ pgdat->kswapd[hid] = kthread_run(kswapd, pgdat, "kswapd%d:%d",
+ nid, hid);
+ if (IS_ERR(pgdat->kswapd[hid])) {
+ /* failure at boot is fatal */
+ BUG_ON(system_state < SYSTEM_RUNNING);
+ pr_err("Failed to start kswapd%d on node %d\n",
+ hid, nid);
+ ret = PTR_ERR(pgdat->kswapd[hid]);
+ pgdat->kswapd[hid] = NULL;
+ }
}
+ kswapd_threads_current = nr_threads;
return ret;
}

@@ -3668,11 +3745,16 @@ int kswapd_run(int nid)
*/
void kswapd_stop(int nid)
{
- struct task_struct *kswapd = NODE_DATA(nid)->kswapd;
+ struct task_struct *kswapd;
+ int hid;
+ int nr_threads = kswapd_threads_current;

- if (kswapd) {
- kthread_stop(kswapd);
- NODE_DATA(nid)->kswapd = NULL;
+ for (hid = 0; hid < nr_threads; hid++) {
+ kswapd = NODE_DATA(nid)->kswapd[hid];
+ if (kswapd) {
+ kthread_stop(kswapd);
+ NODE_DATA(nid)->kswapd[hid] = NULL;
+ }
}
}

--
1.8.3.1


2018-04-03 13:32:59

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote:
> Page replacement is handled in the Linux Kernel in one of two ways:
>
> 1) Asynchronously via kswapd
> 2) Synchronously, via direct reclaim
>
> At page allocation time the allocating task is immediately given a page
> from the zone free list allowing it to go right back to work doing
> whatever it was doing; Probably directly or indirectly executing business
> logic.
>
> Just prior to satisfying the allocation, free pages is checked to see if
> it has reached the zone low watermark and if so, kswapd is awakened.
> Kswapd will start scanning pages looking for inactive pages to evict to
> make room for new page allocations. The work of kswapd allows tasks to
> continue allocating memory from their respective zone free list without
> incurring any delay.
>
> When the demand for free pages exceeds the rate that kswapd tasks can
> supply them, page allocation works differently. Once the allocating task
> finds that the number of free pages is at or below the zone min watermark,
> the task will no longer pull pages from the free list. Instead, the task
> will run the same CPU-bound routines as kswapd to satisfy its own
> allocation by scanning and evicting pages. This is called a direct reclaim.
>
> The time spent performing a direct reclaim can be substantial, often
> taking tens to hundreds of milliseconds for small order0 allocations to
> half a second or more for order9 huge-page allocations. In fact, kswapd is
> not actually required on a linux system. It exists for the sole purpose of
> optimizing performance by preventing direct reclaims.
>
> When memory shortfall is sufficient to trigger direct reclaims, they can
> occur in any task that is running on the system. A single aggressive
> memory allocating task can set the stage for collateral damage to occur in
> small tasks that rarely allocate additional memory. Consider the impact of
> injecting an additional 100ms of latency when nscd allocates memory to
> facilitate caching of a DNS query.
>
> The presence of direct reclaims 10 years ago was a fairly reliable
> indicator that too much was being asked of a Linux system. Kswapd was
> likely wasting time scanning pages that were ineligible for eviction.
> Adding RAM or reducing the working set size would usually make the problem
> go away. Since then hardware has evolved to bring a new struggle for
> kswapd. Storage speeds have increased by orders of magnitude while CPU
> clock speeds stayed the same or even slowed down in exchange for more
> cores per package. This presents a throughput problem for a single
> threaded kswapd that will get worse with each generation of new hardware.

AFAIR we used to scale the number of kswapd workers many years ago. It
just turned out to be not all that great. We have a kswapd reclaim
window for quite some time and that can allow to tune how much proactive
kswapd should be.

Also please note that the direct reclaim is a way to throttle overly
aggressive memory consumers. The more we do in the background context
the easier for them it will be to allocate faster. So I am not really
sure that more background threads will solve the underlying problem. It
is just a matter of memory hogs tunning to end in the very same
situtation AFAICS. Moreover the more they are going to allocate the more
less CPU time will _other_ (non-allocating) task get.

> Test Details

I will have to study this more to comment.

[...]
> By increasing the number of kswapd threads, throughput increased by ~50%
> while kernel mode CPU utilization decreased or stayed the same, likely due
> to a decrease in the number of parallel tasks at any given time doing page
> replacement.

Well, isn't that just an effect of more work being done on behalf of
other workload that might run along with your tests (and which doesn't
really need to allocate a lot of memory)? In other words how
does the patch behaves with a non-artificial mixed workloads?

Please note that I am not saying that we absolutely have to stick with the
current single-thread-per-node implementation but I would really like to
see more background on why we should be allowing heavy memory hogs to
allocate faster or how to prevent that. I would be also very interested
to see how to scale the number of threads based on how CPUs are utilized
by other workloads.
--
Michal Hocko
SUSE Labs

2018-04-03 19:11:11

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

On Tue, Apr 03, 2018 at 03:31:15PM +0200, Michal Hocko wrote:
> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote:
> > The presence of direct reclaims 10 years ago was a fairly reliable
> > indicator that too much was being asked of a Linux system. Kswapd was
> > likely wasting time scanning pages that were ineligible for eviction.
> > Adding RAM or reducing the working set size would usually make the problem
> > go away. Since then hardware has evolved to bring a new struggle for
> > kswapd. Storage speeds have increased by orders of magnitude while CPU
> > clock speeds stayed the same or even slowed down in exchange for more
> > cores per package. This presents a throughput problem for a single
> > threaded kswapd that will get worse with each generation of new hardware.
>
> AFAIR we used to scale the number of kswapd workers many years ago. It
> just turned out to be not all that great. We have a kswapd reclaim
> window for quite some time and that can allow to tune how much proactive
> kswapd should be.
>
> Also please note that the direct reclaim is a way to throttle overly
> aggressive memory consumers. The more we do in the background context
> the easier for them it will be to allocate faster. So I am not really
> sure that more background threads will solve the underlying problem. It
> is just a matter of memory hogs tunning to end in the very same
> situtation AFAICS. Moreover the more they are going to allocate the more
> less CPU time will _other_ (non-allocating) task get.
>
> > Test Details
>
> I will have to study this more to comment.
>
> [...]
> > By increasing the number of kswapd threads, throughput increased by ~50%
> > while kernel mode CPU utilization decreased or stayed the same, likely due
> > to a decrease in the number of parallel tasks at any given time doing page
> > replacement.
>
> Well, isn't that just an effect of more work being done on behalf of
> other workload that might run along with your tests (and which doesn't
> really need to allocate a lot of memory)? In other words how
> does the patch behaves with a non-artificial mixed workloads?
>
> Please note that I am not saying that we absolutely have to stick with the
> current single-thread-per-node implementation but I would really like to
> see more background on why we should be allowing heavy memory hogs to
> allocate faster or how to prevent that. I would be also very interested
> to see how to scale the number of threads based on how CPUs are utilized
> by other workloads.

Yes, very much this. If you have a single-threaded workload which is
using the entirety of memory and would like to use even more, then it
makes sense to use as many CPUs as necessary getting memory out of its
way. If you have N CPUs and N-1 threads happily occupying themselves in
their own reasonably-sized working sets with one monster process trying
to use as much RAM as possible, then I'd be pretty unimpressed to see
the N-1 well-behaved threads preempted by kswapd.

My biggest problem with the patch-as-presented is that it's yet one more
thing for admins to get wrong. We should spawn more threads automatically
if system conditions are right to do that.

2018-04-03 20:15:59

by Buddy Lumpkin

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

Very sorry, I forgot to send my last response as plain text.

> On Apr 3, 2018, at 6:31 AM, Michal Hocko <[email protected]> wrote:
>
> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote:
>> Page replacement is handled in the Linux Kernel in one of two ways:
>>
>> 1) Asynchronously via kswapd
>> 2) Synchronously, via direct reclaim
>>
>> At page allocation time the allocating task is immediately given a page
>> from the zone free list allowing it to go right back to work doing
>> whatever it was doing; Probably directly or indirectly executing business
>> logic.
>>
>> Just prior to satisfying the allocation, free pages is checked to see if
>> it has reached the zone low watermark and if so, kswapd is awakened.
>> Kswapd will start scanning pages looking for inactive pages to evict to
>> make room for new page allocations. The work of kswapd allows tasks to
>> continue allocating memory from their respective zone free list without
>> incurring any delay.
>>
>> When the demand for free pages exceeds the rate that kswapd tasks can
>> supply them, page allocation works differently. Once the allocating task
>> finds that the number of free pages is at or below the zone min watermark,
>> the task will no longer pull pages from the free list. Instead, the task
>> will run the same CPU-bound routines as kswapd to satisfy its own
>> allocation by scanning and evicting pages. This is called a direct reclaim.
>>
>> The time spent performing a direct reclaim can be substantial, often
>> taking tens to hundreds of milliseconds for small order0 allocations to
>> half a second or more for order9 huge-page allocations. In fact, kswapd is
>> not actually required on a linux system. It exists for the sole purpose of
>> optimizing performance by preventing direct reclaims.
>>
>> When memory shortfall is sufficient to trigger direct reclaims, they can
>> occur in any task that is running on the system. A single aggressive
>> memory allocating task can set the stage for collateral damage to occur in
>> small tasks that rarely allocate additional memory. Consider the impact of
>> injecting an additional 100ms of latency when nscd allocates memory to
>> facilitate caching of a DNS query.
>>
>> The presence of direct reclaims 10 years ago was a fairly reliable
>> indicator that too much was being asked of a Linux system. Kswapd was
>> likely wasting time scanning pages that were ineligible for eviction.
>> Adding RAM or reducing the working set size would usually make the problem
>> go away. Since then hardware has evolved to bring a new struggle for
>> kswapd. Storage speeds have increased by orders of magnitude while CPU
>> clock speeds stayed the same or even slowed down in exchange for more
>> cores per package. This presents a throughput problem for a single
>> threaded kswapd that will get worse with each generation of new hardware.
>
> AFAIR we used to scale the number of kswapd workers many years ago. It
> just turned out to be not all that great. We have a kswapd reclaim
> window for quite some time and that can allow to tune how much proactive
> kswapd should be.

Are you referring to vm.watermark_scale_factor? This helps quite a bit. Previously
I had to increase min_free_kbytes in order to get a larger gap between the low
and min watemarks. I was very excited when saw that this had been added
upstream.

>
> Also please note that the direct reclaim is a way to throttle overly
> aggressive memory consumers.

I totally agree, in fact I think this should be the primary role of direct reclaims
because they have a substantial impact on performance. Direct reclaims are
the emergency brakes for page allocation, and the case I am making here is
that they used to only occur when kswapd had to skip over a lot of pages.

This changed over time as the rate a system can allocate pages increased.
Direct reclaims slowly became a normal part of page replacement.


> The more we do in the background context
> the easier for them it will be to allocate faster. So I am not really
> sure that more background threads will solve the underlying problem. It
> is just a matter of memory hogs tunning to end in the very same
> situtation AFAICS. Moreover the more they are going to allocate the more
> less CPU time will _other_ (non-allocating) task get.

The important thing to realize here is that kswapd and direct reclaims run the
same code paths. There is very little that they do differently. If you compare
my test results with one kswapd vs four, your an see that direct reclaims
increase the kernel mode CPU consumption considerably. By dedicating
more threads to proactive page replacement, you eliminate direct reclaims
which reduces the total number of parallel threads that are spinning on the
CPU.

>
>> Test Details
>
> I will have to study this more to comment.
>
> [...]
>> By increasing the number of kswapd threads, throughput increased by ~50%
>> while kernel mode CPU utilization decreased or stayed the same, likely due
>> to a decrease in the number of parallel tasks at any given time doing page
>> replacement.
>
> Well, isn't that just an effect of more work being done on behalf of
> other workload that might run along with your tests (and which doesn't
> really need to allocate a lot of memory)? In other words how
> does the patch behaves with a non-artificial mixed workloads?

It works quite well. We are just starting to test our production apps. I will have
results to share soon.

>
> Please note that I am not saying that we absolutely have to stick with the
> current single-thread-per-node implementation but I would really like to
> see more background on why we should be allowing heavy memory hogs to
> allocate faster or how to prevent that.

My test results demonstrate the problem very well. It shows that a handful of
SSDs can create enough demand for kswapd that it consumes ~100% CPU
long before throughput is able to reach it’s peak. Direct reclaims start occurring
at that point. Aggregate throughput continues to increase, but eventually the
pauses generated by the direct reclaims cause throughput to plateau:


Test #3: 1 kswapd thread per node
dd sy dd_cpu kswapd0 kswapd1 throughput dr pgscan_kswapd pgscan_direct
10 4 26.07 28.56 27.03 7355924.40 0 459316976 0
16 7 34.94 69.33 69.66 10867895.20 0 872661643 0
22 10 36.03 93.99 99.33 13130613.60 489 1037654473 11268334
28 10 30.34 95.90 98.60 14601509.60 671 1182591373 15429142
34 14 34.77 97.50 99.23 16468012.00 10850 1069005644 249839515
40 17 36.32 91.49 97.11 17335987.60 18903 975417728 434467710
46 19 38.40 90.54 91.61 17705394.40 25369 855737040 582427973
52 22 40.88 83.97 83.70 17607680.40 31250 709532935 724282458
58 25 40.89 82.19 80.14 17976905.60 35060 657796473 804117540
64 28 41.77 73.49 75.20 18001910.00 39073 561813658 895289337
70 33 45.51 63.78 64.39 17061897.20 44523 379465571 1020726436
76 36 46.95 57.96 60.32 16964459.60 47717 291299464 1093172384
82 39 47.16 55.43 56.16 16949956.00 49479 247071062 1134163008
88 42 47.41 53.75 47.62 16930911.20 51521 195449924 1180442208
90 43 47.18 51.40 50.59 16864428.00 51618 190758156 1183203901

I think we have reached the point where it makes sense for page replacement to have more
than one mode. Enterprise class servers with lots of memory and a large number of CPU
cores would benefit heavily if more threads could be devoted toward proactive page
replacement. The polar opposite case is my Raspberry PI which I want to run as efficiently
as possible. This problem is only going to get worse. I think it makes sense to be able to
choose between efficiency and performance (throughput and latency reduction).

> I would be also very interested
> to see how to scale the number of threads based on how CPUs are utilized
> by other workloads.
> --
> Michal Hocko
> SUSE Labs

I agree. I think it would be nice to have a work queue that can sense when CPU utilization
for a task peaks at 100% and uses that as criteria to start another task up to some maximum
that was determined at boot time.

I would also determine a max gap size for the watermarks at boot time as well, specifically the
gap between min and low since that provides the buffer that absorbs spikey reclaim behavior
as free pages drops. Each time an direct reclaim occurs, increase the gap up to the limit. Make
the limit tunable as well. If at any time along the way CPU peaks at 100%, start another thread
up to the limit established at boot (which is also tunable).


















2018-04-03 20:51:51

by Buddy Lumpkin

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node


> On Apr 3, 2018, at 12:07 PM, Matthew Wilcox <[email protected]> wrote:
>
> On Tue, Apr 03, 2018 at 03:31:15PM +0200, Michal Hocko wrote:
>> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote:
>>> The presence of direct reclaims 10 years ago was a fairly reliable
>>> indicator that too much was being asked of a Linux system. Kswapd was
>>> likely wasting time scanning pages that were ineligible for eviction.
>>> Adding RAM or reducing the working set size would usually make the problem
>>> go away. Since then hardware has evolved to bring a new struggle for
>>> kswapd. Storage speeds have increased by orders of magnitude while CPU
>>> clock speeds stayed the same or even slowed down in exchange for more
>>> cores per package. This presents a throughput problem for a single
>>> threaded kswapd that will get worse with each generation of new hardware.
>>
>> AFAIR we used to scale the number of kswapd workers many years ago. It
>> just turned out to be not all that great. We have a kswapd reclaim
>> window for quite some time and that can allow to tune how much proactive
>> kswapd should be.
>>
>> Also please note that the direct reclaim is a way to throttle overly
>> aggressive memory consumers. The more we do in the background context
>> the easier for them it will be to allocate faster. So I am not really
>> sure that more background threads will solve the underlying problem. It
>> is just a matter of memory hogs tunning to end in the very same
>> situtation AFAICS. Moreover the more they are going to allocate the more
>> less CPU time will _other_ (non-allocating) task get.
>>
>>> Test Details
>>
>> I will have to study this more to comment.
>>
>> [...]
>>> By increasing the number of kswapd threads, throughput increased by ~50%
>>> while kernel mode CPU utilization decreased or stayed the same, likely due
>>> to a decrease in the number of parallel tasks at any given time doing page
>>> replacement.
>>
>> Well, isn't that just an effect of more work being done on behalf of
>> other workload that might run along with your tests (and which doesn't
>> really need to allocate a lot of memory)? In other words how
>> does the patch behaves with a non-artificial mixed workloads?
>>
>> Please note that I am not saying that we absolutely have to stick with the
>> current single-thread-per-node implementation but I would really like to
>> see more background on why we should be allowing heavy memory hogs to
>> allocate faster or how to prevent that. I would be also very interested
>> to see how to scale the number of threads based on how CPUs are utilized
>> by other workloads.
>
> Yes, very much this. If you have a single-threaded workload which is
> using the entirety of memory and would like to use even more, then it
> makes sense to use as many CPUs as necessary getting memory out of its
> way. If you have N CPUs and N-1 threads happily occupying themselves in
> their own reasonably-sized working sets with one monster process trying
> to use as much RAM as possible, then I'd be pretty unimpressed to see
> the N-1 well-behaved threads preempted by kswapd.

The default value provides one kswapd thread per NUMA node, the same
it was without the patch. Also, I would point out that just because you devote
more threads to kswapd, doesn’t mean they are busy. If multiple kswapd threads
are busy, they are almost certainly doing work that would have resulted in
direct reclaims, which are often substantially more expensive than a couple
extra context switches due to preemption.

Also, the code still uses wake_up_interruptible to wake kswapd threads, so
after starting the first kswapd thread, free pages minus the size of the allocation
would still need to be below the low watermark for a page allocation at that time
to cause another kswapd thread to wake up.

When I first decided to try this out, I figured a lot of tuning would be needed to
see good behavior. But what I found in practice was that it actually works quite
well. When you look closely, you see that there is very little difference between
a direct reclaim and kswapd. In fact, direct reclaims work a little harder than
kswapd, and they should continue to do so because that prevents the number
of parallel scanning tasks from increasing unnecessarily.

Please try it out, you might be surprised at how well it works.

>
> My biggest problem with the patch-as-presented is that it's yet one more
> thing for admins to get wrong. We should spawn more threads automatically
> if system conditions are right to do that.

I totally agree with this. In my previous response to Michal Hocko, I described
how I think we could scale watermarks in response to direct reclaims, and
launch more kswapd threads when kswapd peaks at 100% CPU usage.





2018-04-03 21:14:58

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote:
> > Yes, very much this. If you have a single-threaded workload which is
> > using the entirety of memory and would like to use even more, then it
> > makes sense to use as many CPUs as necessary getting memory out of its
> > way. If you have N CPUs and N-1 threads happily occupying themselves in
> > their own reasonably-sized working sets with one monster process trying
> > to use as much RAM as possible, then I'd be pretty unimpressed to see
> > the N-1 well-behaved threads preempted by kswapd.
>
> The default value provides one kswapd thread per NUMA node, the same
> it was without the patch. Also, I would point out that just because you devote
> more threads to kswapd, doesn’t mean they are busy. If multiple kswapd threads
> are busy, they are almost certainly doing work that would have resulted in
> direct reclaims, which are often substantially more expensive than a couple
> extra context switches due to preemption.

[...]

> In my previous response to Michal Hocko, I described
> how I think we could scale watermarks in response to direct reclaims, and
> launch more kswapd threads when kswapd peaks at 100% CPU usage.

I think you're missing my point about the workload ... kswapd isn't
"nice", so it will compete with the N-1 threads which are chugging along
at 100% CPU inside their working sets. In this scenario, we _don't_
want to kick off kswapd at all; we want the monster thread to clean up
its own mess. If we have idle CPUs, then yes, absolutely, lets have
them clean up for the monster, but otherwise, I want my N-1 threads
doing their own thing.

Maybe we should renice kswapd anyway ... thoughts? We don't seem to have
had a nice'd kswapd since 2.6.12, but maybe we played with that earlier
and discovered it was a bad idea?


2018-04-04 10:08:59

by Buddy Lumpkin

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node


> On Apr 3, 2018, at 2:12 PM, Matthew Wilcox <[email protected]> wrote:
>
> On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote:
>>> Yes, very much this. If you have a single-threaded workload which is
>>> using the entirety of memory and would like to use even more, then it
>>> makes sense to use as many CPUs as necessary getting memory out of its
>>> way. If you have N CPUs and N-1 threads happily occupying themselves in
>>> their own reasonably-sized working sets with one monster process trying
>>> to use as much RAM as possible, then I'd be pretty unimpressed to see
>>> the N-1 well-behaved threads preempted by kswapd.
>>
>> The default value provides one kswapd thread per NUMA node, the same
>> it was without the patch. Also, I would point out that just because you devote
>> more threads to kswapd, doesn’t mean they are busy. If multiple kswapd threads
>> are busy, they are almost certainly doing work that would have resulted in
>> direct reclaims, which are often substantially more expensive than a couple
>> extra context switches due to preemption.
>
> [...]
>
>> In my previous response to Michal Hocko, I described
>> how I think we could scale watermarks in response to direct reclaims, and
>> launch more kswapd threads when kswapd peaks at 100% CPU usage.
>
> I think you're missing my point about the workload ... kswapd isn't
> "nice", so it will compete with the N-1 threads which are chugging along
> at 100% CPU inside their working sets. In this scenario, we _don't_
> want to kick off kswapd at all; we want the monster thread to clean up
> its own mess. If we have idle CPUs, then yes, absolutely, lets have
> them clean up for the monster, but otherwise, I want my N-1 threads
> doing their own thing.

For the scenario you describe above. I have my own opinions, but I would rather not
speculate on what happens. Tomorrow I will try to simulate this situation and i’ll
report back on the results. I think this actually makes a case for accepting the patch
as-is for now. Please hear me out on this:

You mentioned being concerned that an admin will do the wrong thing with this
tunable. I worked in the System Administrator/System Engineering job families for
many years and even though I transitioned to spending most of my time on
performance and kernel work, I still maintain an active role in System Engineering
related projects, hiring and mentoring.

The kswapd_threads tunable defaults to a value of one, which is the current default
behavior. I think there are plenty of sysctls that are more confusing than this one.
If you want to make a comparison, I would say that Transparent Hugepages is one
of the best examples of a feature that has confused System Administrators. I am sure
it works a lot better today, but it has a history of really sharp edges, and it has been
shipping enabled by default for a long time in the OS distributions I am familiar with.
I am hopeful that it works better in later kernels as I think we need more features
like it. Specifically, features that bring high performance to naive third party apps
that do not make use of advanced features like hugetlbfs, spoke, direct IO, or clumsy
interfaces like posix_fadvise. But until they are absolutely polished, I wish these kinds
of features would not be turned on by default. This includes kswapd_threads.

More reasons why implementing this tunable makes sense for now:
- A feature like this is a lot easier to reason about after it has been used in the field
for a while. This includes trying to auto-tune it
- We need an answer for this problem today. Today there are single NVMe drives
capable of 10GB/s and larger systems than the system I used for testing
- In the scenario you describe above, an admin would have no reason to touch
this sysctl
- I think I mentioned this before. I honestly thought a lot of tuning would be necessary
after implementing this but so far that hasn’t been the case. It works pretty well.


>
> Maybe we should renice kswapd anyway ... thoughts? We don't seem to have
> had a nice'd kswapd since 2.6.12, but maybe we played with that earlier
> and discovered it was a bad idea?
>


2018-04-05 04:20:50

by Buddy Lumpkin

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node


> On Apr 3, 2018, at 2:12 PM, Matthew Wilcox <[email protected]> wrote:
>
> On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote:
>>> Yes, very much this. If you have a single-threaded workload which is
>>> using the entirety of memory and would like to use even more, then it
>>> makes sense to use as many CPUs as necessary getting memory out of its
>>> way. If you have N CPUs and N-1 threads happily occupying themselves in
>>> their own reasonably-sized working sets with one monster process trying
>>> to use as much RAM as possible, then I'd be pretty unimpressed to see
>>> the N-1 well-behaved threads preempted by kswapd.
>>
>> The default value provides one kswapd thread per NUMA node, the same
>> it was without the patch. Also, I would point out that just because you devote
>> more threads to kswapd, doesn’t mean they are busy. If multiple kswapd threads
>> are busy, they are almost certainly doing work that would have resulted in
>> direct reclaims, which are often substantially more expensive than a couple
>> extra context switches due to preemption.
>
> [...]
>
>> In my previous response to Michal Hocko, I described
>> how I think we could scale watermarks in response to direct reclaims, and
>> launch more kswapd threads when kswapd peaks at 100% CPU usage.
>
> I think you're missing my point about the workload ... kswapd isn't
> "nice", so it will compete with the N-1 threads which are chugging along
> at 100% CPU inside their working sets. In this scenario, we _don't_
> want to kick off kswapd at all; we want the monster thread to clean up
> its own mess. If we have idle CPUs, then yes, absolutely, lets have
> them clean up for the monster, but otherwise, I want my N-1 threads
> doing their own thing.
>
> Maybe we should renice kswapd anyway ... thoughts? We don't seem to have
> had a nice'd kswapd since 2.6.12, but maybe we played with that earlier
> and discovered it was a bad idea?
>


Trying to distinguish between the monster and a high value task that you want
to run as quickly as possible would be challenging. I like your idea of using
renice. It probably makes sense to continue to run the first thread on each node
at a standard nice value, and run each additional task with a positive nice value.







2018-04-11 03:17:52

by Buddy Lumpkin

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node


> On Apr 3, 2018, at 6:31 AM, Michal Hocko <[email protected]> wrote:
>
> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote:
>> Page replacement is handled in the Linux Kernel in one of two ways:
>>
>> 1) Asynchronously via kswapd
>> 2) Synchronously, via direct reclaim
>>
>> At page allocation time the allocating task is immediately given a page
>> from the zone free list allowing it to go right back to work doing
>> whatever it was doing; Probably directly or indirectly executing business
>> logic.
>>
>> Just prior to satisfying the allocation, free pages is checked to see if
>> it has reached the zone low watermark and if so, kswapd is awakened.
>> Kswapd will start scanning pages looking for inactive pages to evict to
>> make room for new page allocations. The work of kswapd allows tasks to
>> continue allocating memory from their respective zone free list without
>> incurring any delay.
>>
>> When the demand for free pages exceeds the rate that kswapd tasks can
>> supply them, page allocation works differently. Once the allocating task
>> finds that the number of free pages is at or below the zone min watermark,
>> the task will no longer pull pages from the free list. Instead, the task
>> will run the same CPU-bound routines as kswapd to satisfy its own
>> allocation by scanning and evicting pages. This is called a direct reclaim.
>>
>> The time spent performing a direct reclaim can be substantial, often
>> taking tens to hundreds of milliseconds for small order0 allocations to
>> half a second or more for order9 huge-page allocations. In fact, kswapd is
>> not actually required on a linux system. It exists for the sole purpose of
>> optimizing performance by preventing direct reclaims.
>>
>> When memory shortfall is sufficient to trigger direct reclaims, they can
>> occur in any task that is running on the system. A single aggressive
>> memory allocating task can set the stage for collateral damage to occur in
>> small tasks that rarely allocate additional memory. Consider the impact of
>> injecting an additional 100ms of latency when nscd allocates memory to
>> facilitate caching of a DNS query.
>>
>> The presence of direct reclaims 10 years ago was a fairly reliable
>> indicator that too much was being asked of a Linux system. Kswapd was
>> likely wasting time scanning pages that were ineligible for eviction.
>> Adding RAM or reducing the working set size would usually make the problem
>> go away. Since then hardware has evolved to bring a new struggle for
>> kswapd. Storage speeds have increased by orders of magnitude while CPU
>> clock speeds stayed the same or even slowed down in exchange for more
>> cores per package. This presents a throughput problem for a single
>> threaded kswapd that will get worse with each generation of new hardware.
>
> AFAIR we used to scale the number of kswapd workers many years ago. It
> just turned out to be not all that great. We have a kswapd reclaim
> window for quite some time and that can allow to tune how much proactive
> kswapd should be.

I am not aware of a previous version of Linux that offered more than one kswapd
thread per NUMA node.

>
> Also please note that the direct reclaim is a way to throttle overly
> aggressive memory consumers. The more we do in the background context
> the easier for them it will be to allocate faster. So I am not really
> sure that more background threads will solve the underlying problem.

A single kswapd thread used to keep up with all of the demand you could
create on a Linux system quite easily provided it didn’t have to scan a lot
of pages that were ineligible for eviction. 10 years ago, Fibre Channel was
the popular high performance interconnect and if you were lucky enough
to have the latest hardware rated at 10GFC, you could get 1.2GB/s per host
bus adapter. Also, most high end storage solutions were still using spinning
rust so it took an insane number of spindles behind each host bus adapter
to saturate the channel if the access patterns were random. There really
wasn’t a reason to try to thread kswapd, and I am pretty sure there hasn’t
been any attempts to do this in the last 10 years.

> It is just a matter of memory hogs tunning to end in the very same
> situtation AFAICS. Moreover the more they are going to allocate the more
> less CPU time will _other_ (non-allocating) task get.

Please describe the scenario a bit more clearly. Once you start constructing
the workload that can create this scenario, I think you will find that you end
up with a mix that is rarely seen in practice.

>
>> Test Details
>
> I will have to study this more to comment.
>
> [...]
>> By increasing the number of kswapd threads, throughput increased by ~50%
>> while kernel mode CPU utilization decreased or stayed the same, likely due
>> to a decrease in the number of parallel tasks at any given time doing page
>> replacement.
>
> Well, isn't that just an effect of more work being done on behalf of
> other workload that might run along with your tests (and which doesn't
> really need to allocate a lot of memory)? In other words how
> does the patch behaves with a non-artificial mixed workloads?

Still working on this. I will share data as soon as I have it.

>
> Please note that I am not saying that we absolutely have to stick with the
> current single-thread-per-node implementation but I would really like to
> see more background on why we should be allowing heavy memory hogs to
> allocate faster or how to prevent that. I would be also very interested
> to see how to scale the number of threads based on how CPUs are utilized
> by other workloads.
> --
> Michal Hocko
> SUSE Labs


2018-04-11 03:57:02

by Buddy Lumpkin

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node


> On Apr 3, 2018, at 12:07 PM, Matthew Wilcox <[email protected]> wrote:
>
> On Tue, Apr 03, 2018 at 03:31:15PM +0200, Michal Hocko wrote:
>> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote:
>>> The presence of direct reclaims 10 years ago was a fairly reliable
>>> indicator that too much was being asked of a Linux system. Kswapd was
>>> likely wasting time scanning pages that were ineligible for eviction.
>>> Adding RAM or reducing the working set size would usually make the problem
>>> go away. Since then hardware has evolved to bring a new struggle for
>>> kswapd. Storage speeds have increased by orders of magnitude while CPU
>>> clock speeds stayed the same or even slowed down in exchange for more
>>> cores per package. This presents a throughput problem for a single
>>> threaded kswapd that will get worse with each generation of new hardware.
>>
>> AFAIR we used to scale the number of kswapd workers many years ago. It
>> just turned out to be not all that great. We have a kswapd reclaim
>> window for quite some time and that can allow to tune how much proactive
>> kswapd should be.
>>
>> Also please note that the direct reclaim is a way to throttle overly
>> aggressive memory consumers. The more we do in the background context
>> the easier for them it will be to allocate faster. So I am not really
>> sure that more background threads will solve the underlying problem. It
>> is just a matter of memory hogs tunning to end in the very same
>> situtation AFAICS. Moreover the more they are going to allocate the more
>> less CPU time will _other_ (non-allocating) task get.
>>
>>> Test Details
>>
>> I will have to study this more to comment.
>>
>> [...]
>>> By increasing the number of kswapd threads, throughput increased by ~50%
>>> while kernel mode CPU utilization decreased or stayed the same, likely due
>>> to a decrease in the number of parallel tasks at any given time doing page
>>> replacement.
>>
>> Well, isn't that just an effect of more work being done on behalf of
>> other workload that might run along with your tests (and which doesn't
>> really need to allocate a lot of memory)? In other words how
>> does the patch behaves with a non-artificial mixed workloads?
>>
>> Please note that I am not saying that we absolutely have to stick with the
>> current single-thread-per-node implementation but I would really like to
>> see more background on why we should be allowing heavy memory hogs to
>> allocate faster or how to prevent that. I would be also very interested
>> to see how to scale the number of threads based on how CPUs are utilized
>> by other workloads.
>
> Yes, very much this. If you have a single-threaded workload which is
> using the entirety of memory and would like to use even more, then it
> makes sense to use as many CPUs as necessary getting memory out of its
> way. If you have N CPUs and N-1 threads happily occupying themselves in
> their own reasonably-sized working sets with one monster process trying
> to use as much RAM as possible, then I'd be pretty unimpressed to see
> the N-1 well-behaved threads preempted by kswapd.

A single thread cannot create the demand to keep any number of kswapd tasks
busy, so this memory hog is going to need to have multiple threads if it is going
to do any measurable damage to the amount of work performed by the compute
bound tasks, and once we increase the number of tasks used for the memory
hog, preemption is already happening.

So let’s say we are willing to accept that it is going to take multiple threads to
create enough demand to keep multiple kswapd tasks busy, we just do not want
any additional preemptions strictly due to additional kswapd tasks. You have to
consider, If we managed to create enough demand to keep multiple kswapd tasks
busy, then we are creating enough demand to trigger direct reclaims. A _lot_ of
direct reclaims, and direct reclaims consume A _lot_ of cpu. So if we are running
multiple kswapd threads, they might be preempting your N-1 threads, but if they
were not running, the memory hog tasks would be preempting your N-1 threads.

>
> My biggest problem with the patch-as-presented is that it's yet one more
> thing for admins to get wrong. We should spawn more threads automatically
> if system conditions are right to do that.

One thing about this patch-as-presented that an admin could get wrong is by
starting with a setting of 16, deciding that it didn’t help and reducing it back to
one. It allows for 16 threads because I actually saw a benefit with large numbers
of kswapd threads when a substantial amount of the memory pressure was
created using anonymous memory mappings that do not involve the page cache.
This really is a special case, and the maximum number of threads allowed should
probably be reduced to a more sensible value like 8 or even 6 if there is concern
about admins doing the wrong thing.






2018-04-11 06:42:03

by Buddy Lumpkin

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node


> On Apr 3, 2018, at 2:12 PM, Matthew Wilcox <[email protected]> wrote:
>
> On Tue, Apr 03, 2018 at 01:49:25PM -0700, Buddy Lumpkin wrote:
>>> Yes, very much this. If you have a single-threaded workload which is
>>> using the entirety of memory and would like to use even more, then it
>>> makes sense to use as many CPUs as necessary getting memory out of its
>>> way. If you have N CPUs and N-1 threads happily occupying themselves in
>>> their own reasonably-sized working sets with one monster process trying
>>> to use as much RAM as possible, then I'd be pretty unimpressed to see
>>> the N-1 well-behaved threads preempted by kswapd.
>>
>> The default value provides one kswapd thread per NUMA node, the same
>> it was without the patch. Also, I would point out that just because you devote
>> more threads to kswapd, doesn’t mean they are busy. If multiple kswapd threads
>> are busy, they are almost certainly doing work that would have resulted in
>> direct reclaims, which are often substantially more expensive than a couple
>> extra context switches due to preemption.
>
> [...]
>
>> In my previous response to Michal Hocko, I described
>> how I think we could scale watermarks in response to direct reclaims, and
>> launch more kswapd threads when kswapd peaks at 100% CPU usage.
>
> I think you're missing my point about the workload ... kswapd isn't
> "nice", so it will compete with the N-1 threads which are chugging along
> at 100% CPU inside their working sets.

If the memory hog is generating enough demand for multiple kswapd
tasks to be busy, then it is generating enough demand to trigger direct
reclaims. Since direct reclaims are 100% CPU bound, the preemptions
you are concerned about are happening anyway.

> In this scenario, we _don't_
> want to kick off kswapd at all; we want the monster thread to clean up
> its own mess.

This makes direct reclaims sound like a positive thing overall and that
is simply not the case. If cleaning is the metaphor to describe direct
reclaims, then it’s happening in the kitchen using a garden hose.
When conditions for direct reclaims are present they can occur in any
task that is allocating on the system. They inject latency in random places
and they decrease filesystem throughput.

When software engineers try to build their own cache, I usually try to talk
them out of it. This rarely works, as they usually have reasons they believe
make the project compelling, so I just ask that they compare their results
using direct IO and a private cache to simply allowing the page cache to
do it’s thing. I can’t make this pitch any more because direct reclaims have
too much of an impact on filesystem throughput.

The only positive thing that direct reclaims provide is a means to prevent
the system from crashing or deadlocking when it falls too low on memory.

> If we have idle CPUs, then yes, absolutely, lets have
> them clean up for the monster, but otherwise, I want my N-1 threads
> doing their own thing.
>
> Maybe we should renice kswapd anyway ... thoughts? We don't seem to have
> had a nice'd kswapd since 2.6.12, but maybe we played with that earlier
> and discovered it was a bad idea?
>


2018-04-12 13:21:56

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

On Tue 03-04-18 12:41:56, Buddy Lumpkin wrote:
>
> > On Apr 3, 2018, at 6:31 AM, Michal Hocko <[email protected]> wrote:
> >
> > On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote:
> >> Page replacement is handled in the Linux Kernel in one of two ways:
> >>
> >> 1) Asynchronously via kswapd
> >> 2) Synchronously, via direct reclaim
> >>
> >> At page allocation time the allocating task is immediately given a page
> >> from the zone free list allowing it to go right back to work doing
> >> whatever it was doing; Probably directly or indirectly executing business
> >> logic.
> >>
> >> Just prior to satisfying the allocation, free pages is checked to see if
> >> it has reached the zone low watermark and if so, kswapd is awakened.
> >> Kswapd will start scanning pages looking for inactive pages to evict to
> >> make room for new page allocations. The work of kswapd allows tasks to
> >> continue allocating memory from their respective zone free list without
> >> incurring any delay.
> >>
> >> When the demand for free pages exceeds the rate that kswapd tasks can
> >> supply them, page allocation works differently. Once the allocating task
> >> finds that the number of free pages is at or below the zone min watermark,
> >> the task will no longer pull pages from the free list. Instead, the task
> >> will run the same CPU-bound routines as kswapd to satisfy its own
> >> allocation by scanning and evicting pages. This is called a direct reclaim.
> >>
> >> The time spent performing a direct reclaim can be substantial, often
> >> taking tens to hundreds of milliseconds for small order0 allocations to
> >> half a second or more for order9 huge-page allocations. In fact, kswapd is
> >> not actually required on a linux system. It exists for the sole purpose of
> >> optimizing performance by preventing direct reclaims.
> >>
> >> When memory shortfall is sufficient to trigger direct reclaims, they can
> >> occur in any task that is running on the system. A single aggressive
> >> memory allocating task can set the stage for collateral damage to occur in
> >> small tasks that rarely allocate additional memory. Consider the impact of
> >> injecting an additional 100ms of latency when nscd allocates memory to
> >> facilitate caching of a DNS query.
> >>
> >> The presence of direct reclaims 10 years ago was a fairly reliable
> >> indicator that too much was being asked of a Linux system. Kswapd was
> >> likely wasting time scanning pages that were ineligible for eviction.
> >> Adding RAM or reducing the working set size would usually make the problem
> >> go away. Since then hardware has evolved to bring a new struggle for
> >> kswapd. Storage speeds have increased by orders of magnitude while CPU
> >> clock speeds stayed the same or even slowed down in exchange for more
> >> cores per package. This presents a throughput problem for a single
> >> threaded kswapd that will get worse with each generation of new hardware.
> >
> > AFAIR we used to scale the number of kswapd workers many years ago. It
> > just turned out to be not all that great. We have a kswapd reclaim
> > window for quite some time and that can allow to tune how much proactive
> > kswapd should be.
>
> Are you referring to vm.watermark_scale_factor?

Yes along with min_free_kbytes

> This helps quite a bit. Previously
> I had to increase min_free_kbytes in order to get a larger gap between the low
> and min watemarks. I was very excited when saw that this had been added
> upstream.
>
> >
> > Also please note that the direct reclaim is a way to throttle overly
> > aggressive memory consumers.
>
> I totally agree, in fact I think this should be the primary role of direct reclaims
> because they have a substantial impact on performance. Direct reclaims are
> the emergency brakes for page allocation, and the case I am making here is
> that they used to only occur when kswapd had to skip over a lot of pages.

Or when it is busy reclaiming which can be the case quite easily if you
do not have the inactive file LRU full of clean page cache. And that is
another problem. If you have a trivial reclaim situation then a single
kswapd thread can reclaim quickly enough. But once you hit a wall with
hard-to-reclaim pages then I would expect multiple threads will simply
contend more (e.g. on fs locks in shrinkers etc...). Or how do you want
to prevent that?

Or more specifically. How is the admin supposed to know how many
background threads are still improving the situation?

> This changed over time as the rate a system can allocate pages increased.
> Direct reclaims slowly became a normal part of page replacement.
>
> > The more we do in the background context
> > the easier for them it will be to allocate faster. So I am not really
> > sure that more background threads will solve the underlying problem. It
> > is just a matter of memory hogs tunning to end in the very same
> > situtation AFAICS. Moreover the more they are going to allocate the more
> > less CPU time will _other_ (non-allocating) task get.
>
> The important thing to realize here is that kswapd and direct reclaims run the
> same code paths. There is very little that they do differently.

Their target is however completely different. Kswapd want to keep nodes
balanced while direct reclaim aims to reclaim _some_ memory. That is
quite some difference. Especially for the throttle by reclaiming memory
part.

> If you compare
> my test results with one kswapd vs four, your an see that direct reclaims
> increase the kernel mode CPU consumption considerably. By dedicating
> more threads to proactive page replacement, you eliminate direct reclaims
> which reduces the total number of parallel threads that are spinning on the
> CPU.

I still haven't looked at your test results in detail because they seem
quite artificial. Clean pagecache reclaim is not all that interesting
IMHO

[...]
> > I would be also very interested
> > to see how to scale the number of threads based on how CPUs are utilized
> > by other workloads.
>
> I think we have reached the point where it makes sense for page replacement to have more
> than one mode. Enterprise class servers with lots of memory and a large number of CPU
> cores would benefit heavily if more threads could be devoted toward proactive page
> replacement. The polar opposite case is my Raspberry PI which I want to run as efficiently
> as possible. This problem is only going to get worse. I think it makes sense to be able to
> choose between efficiency and performance (throughput and latency reduction).

The thing is that as long as this would require admin to guess then this
is not all that useful. People will simply not know what to set and we
are going to end up with stupid admin guides claiming that you should
use 1/N of per node cpus for kswapd and that will not work. Not to
mention that the reclaim logic is full of heuristics which change over
time and a subtle implementation detail that would work for a particular
scaling might break without anybody noticing. Really, if we are not able
to come up with some auto tuning then I think that this is not really
worth it.

--
Michal Hocko
SUSE Labs

2018-04-12 13:26:45

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

On Tue 10-04-18 20:10:24, Buddy Lumpkin wrote:
[...]
> > Also please note that the direct reclaim is a way to throttle overly
> > aggressive memory consumers. The more we do in the background context
> > the easier for them it will be to allocate faster. So I am not really
> > sure that more background threads will solve the underlying problem.
>
> A single kswapd thread used to keep up with all of the demand you could
> create on a Linux system quite easily provided it didn’t have to scan a lot
> of pages that were ineligible for eviction.

Well, what do you mean by ineligible for eviction? Could you be more
specific? Are we talking about pages on LRU list or metadata and
shrinker based reclaim.

> 10 years ago, Fibre Channel was
> the popular high performance interconnect and if you were lucky enough
> to have the latest hardware rated at 10GFC, you could get 1.2GB/s per host
> bus adapter. Also, most high end storage solutions were still using spinning
> rust so it took an insane number of spindles behind each host bus adapter
> to saturate the channel if the access patterns were random. There really
> wasn’t a reason to try to thread kswapd, and I am pretty sure there hasn’t
> been any attempts to do this in the last 10 years.

I do not really see your point. Yeah you can get a faster storage today.
So what? Pagecache has always been bound by the RAM speed.

> > It is just a matter of memory hogs tunning to end in the very same
> > situtation AFAICS. Moreover the more they are going to allocate the more
> > less CPU time will _other_ (non-allocating) task get.
>
> Please describe the scenario a bit more clearly. Once you start constructing
> the workload that can create this scenario, I think you will find that you end
> up with a mix that is rarely seen in practice.

What I meant is that the more you reclaim in the background to more you
allow memory hogs to allocate because they will not get throttled. All
that on behalf of other workload which is not memory bound and cannot
use CPU cycles additional kswapd would consume. Think of any computation
intensive workload spreading over most CPUs and a memory hungry data
processing.
--
Michal Hocko
SUSE Labs

2018-04-17 03:04:35

by Buddy Lumpkin

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node


> On Apr 12, 2018, at 6:16 AM, Michal Hocko <[email protected]> wrote:
>
> On Tue 03-04-18 12:41:56, Buddy Lumpkin wrote:
>>
>>> On Apr 3, 2018, at 6:31 AM, Michal Hocko <[email protected]> wrote:
>>>
>>> On Mon 02-04-18 09:24:22, Buddy Lumpkin wrote:
>>>> Page replacement is handled in the Linux Kernel in one of two ways:
>>>>
>>>> 1) Asynchronously via kswapd
>>>> 2) Synchronously, via direct reclaim
>>>>
>>>> At page allocation time the allocating task is immediately given a page
>>>> from the zone free list allowing it to go right back to work doing
>>>> whatever it was doing; Probably directly or indirectly executing business
>>>> logic.
>>>>
>>>> Just prior to satisfying the allocation, free pages is checked to see if
>>>> it has reached the zone low watermark and if so, kswapd is awakened.
>>>> Kswapd will start scanning pages looking for inactive pages to evict to
>>>> make room for new page allocations. The work of kswapd allows tasks to
>>>> continue allocating memory from their respective zone free list without
>>>> incurring any delay.
>>>>
>>>> When the demand for free pages exceeds the rate that kswapd tasks can
>>>> supply them, page allocation works differently. Once the allocating task
>>>> finds that the number of free pages is at or below the zone min watermark,
>>>> the task will no longer pull pages from the free list. Instead, the task
>>>> will run the same CPU-bound routines as kswapd to satisfy its own
>>>> allocation by scanning and evicting pages. This is called a direct reclaim.
>>>>
>>>> The time spent performing a direct reclaim can be substantial, often
>>>> taking tens to hundreds of milliseconds for small order0 allocations to
>>>> half a second or more for order9 huge-page allocations. In fact, kswapd is
>>>> not actually required on a linux system. It exists for the sole purpose of
>>>> optimizing performance by preventing direct reclaims.
>>>>
>>>> When memory shortfall is sufficient to trigger direct reclaims, they can
>>>> occur in any task that is running on the system. A single aggressive
>>>> memory allocating task can set the stage for collateral damage to occur in
>>>> small tasks that rarely allocate additional memory. Consider the impact of
>>>> injecting an additional 100ms of latency when nscd allocates memory to
>>>> facilitate caching of a DNS query.
>>>>
>>>> The presence of direct reclaims 10 years ago was a fairly reliable
>>>> indicator that too much was being asked of a Linux system. Kswapd was
>>>> likely wasting time scanning pages that were ineligible for eviction.
>>>> Adding RAM or reducing the working set size would usually make the problem
>>>> go away. Since then hardware has evolved to bring a new struggle for
>>>> kswapd. Storage speeds have increased by orders of magnitude while CPU
>>>> clock speeds stayed the same or even slowed down in exchange for more
>>>> cores per package. This presents a throughput problem for a single
>>>> threaded kswapd that will get worse with each generation of new hardware.
>>>
>>> AFAIR we used to scale the number of kswapd workers many years ago. It
>>> just turned out to be not all that great. We have a kswapd reclaim
>>> window for quite some time and that can allow to tune how much proactive
>>> kswapd should be.
>>
>> Are you referring to vm.watermark_scale_factor?
>
> Yes along with min_free_kbytes
>
>> This helps quite a bit. Previously
>> I had to increase min_free_kbytes in order to get a larger gap between the low
>> and min watemarks. I was very excited when saw that this had been added
>> upstream.
>>
>>>
>>> Also please note that the direct reclaim is a way to throttle overly
>>> aggressive memory consumers.
>>
>> I totally agree, in fact I think this should be the primary role of direct reclaims
>> because they have a substantial impact on performance. Direct reclaims are
>> the emergency brakes for page allocation, and the case I am making here is
>> that they used to only occur when kswapd had to skip over a lot of pages.
>
> Or when it is busy reclaiming which can be the case quite easily if you
> do not have the inactive file LRU full of clean page cache. And that is
> another problem. If you have a trivial reclaim situation then a single
> kswapd thread can reclaim quickly enough.

A single kswapd thread does not help quickly enough. That is the entire point
of this patch.

> But once you hit a wall with
> hard-to-reclaim pages then I would expect multiple threads will simply
> contend more (e.g. on fs locks in shrinkers etc…).

If that is the case, this is already happening since direct reclaims do just about
everything that kswapd does. I have tested with a mix of filesystem reads, writes
and anonymous memory with and without a swap device. The only locking
problems I have run into so far are related to routines in mm/workingset.c.

It is a lot harder to burden the page scan logic than it used to be. Somewhere
around 2007 a change was made where page types that had to be skipped
over were simply removed from the LRU list. Anonymous pages were only
scanned if a swap device exists, mlocked pages are not scanned at all. It took
a couple years before this was available in the common distros though.
Also, 64 bit kernels help as well as you don’t have the problem where objects
held in ZONE_NORMAL pin pages in ZONE_HIGHMEM.

Getting real world results is a waiting game on my end. Once we have a version
available to service owners, they need to coordinate an outage so that systems
can be rebooted. Only then can I coordinate with them to test for improvements.

> Or how do you want
> to prevent that?

Kswapd has a throughput problem. Once that problem is solved new bottlenecks
will reveal themselves. There is nothing to prevent here. When you remove
bottlenecks, new bottlenecks materialize and someone will need to identify
them and make them go away.
>
> Or more specifically. How is the admin supposed to know how many
> background threads are still improving the situation?

Reduce the setting and check to see if pgscan_direct is still incrementing.

>
>> This changed over time as the rate a system can allocate pages increased.
>> Direct reclaims slowly became a normal part of page replacement.
>>
>>> The more we do in the background context
>>> the easier for them it will be to allocate faster. So I am not really
>>> sure that more background threads will solve the underlying problem. It
>>> is just a matter of memory hogs tunning to end in the very same
>>> situtation AFAICS. Moreover the more they are going to allocate the more
>>> less CPU time will _other_ (non-allocating) task get.
>>
>> The important thing to realize here is that kswapd and direct reclaims run the
>> same code paths. There is very little that they do differently.
>
> Their target is however completely different. Kswapd want to keep nodes
> balanced while direct reclaim aims to reclaim _some_ memory. That is
> quite some difference. Especially for the throttle by reclaiming memory
> part.

Routines like balance_pgdat showed up in 2.4.10 when Andrea Arcangeli
rewrote a lot of the page replacement logic. He referred to his work as the
classzone patch and the whole selling point on what it would provide was
making allocation and page replacement more cohesive and balanced to
avoid cases where kswapd would behave pathologically, scanning to evict
pages in the wrong location, or in the wrong order. That doesn’t mean that
kswapd’s primary occupation is balancing, in fact if you read the comments
direct reclaims and kswapd sound pretty similar to me:

/*
* This is the direct reclaim path, for page-allocating processes. We only
* try to reclaim pages from zones which will satisfy the caller's allocation
* request.
*
* If a zone is deemed to be full of pinned pages then just give it a light
* scan then give up on it.
*/
static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc)

/*
* For kswapd, balance_pgdat() will reclaim pages across a node from zones
* that are eligible for use by the caller until at least one zone is
* balanced.
*
* Returns the order kswapd finished reclaiming at.
*
* kswapd scans the zones in the highmem->normal->dma direction. It skips
* zones which have free_pages > high_wmark_pages(zone), but once a zone is
* found to have free_pages <= high_wmark_pages(zone), any page is that zone
* or lower is eligible for reclaim until at least one usable zone is
* balanced.
*/
static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)

Kswapd makes an effort toward proving balance, but that is clearly not the
main goal. Both code paths are triggered by a need for memory, and both
code paths scan zones that are eligible to satisfy the allocation that triggered
them.

>
>> If you compare
>> my test results with one kswapd vs four, your an see that direct reclaims
>> increase the kernel mode CPU consumption considerably. By dedicating
>> more threads to proactive page replacement, you eliminate direct reclaims
>> which reduces the total number of parallel threads that are spinning on the
>> CPU.
>
> I still haven't looked at your test results in detail because they seem
> quite artificial. Clean pagecache reclaim is not all that interesting
> IMHO

Clean page cache is extremely interesting for demonstrating this bottleneck.
kswapd reads from the tail of the inactive list, and practically every page it
encounters is eligible for eviction, and yet it still cannot keep up with the demand
for fresh pages.

In the test data I provided, you can see that peak throughput with direct IO was:

26,254,215 Kbytes/s

Peak throughput without direct IO and 1 kswapd thread was:

18,001,910 Kbytes/s

Direct IO is 46% higher, and this gap is only going to continue to increase. It used
to be around 10%.

Any negative effects that can be seen with additional kswapd threads can already be
seen with multiple concurrent direct reclaims. The additional throughput that is gained
by scanning proactively in kswapd can certainly push harder against any additional
lock contention. In that case kswapd is just the canary in the coal mine, finding
problems that would eventually need to be solved anyway.

>
> [...]
>>> I would be also very interested
>>> to see how to scale the number of threads based on how CPUs are utilized
>>> by other workloads.
>>
>> I think we have reached the point where it makes sense for page replacement to have more
>> than one mode. Enterprise class servers with lots of memory and a large number of CPU
>> cores would benefit heavily if more threads could be devoted toward proactive page
>> replacement. The polar opposite case is my Raspberry PI which I want to run as efficiently
>> as possible. This problem is only going to get worse. I think it makes sense to be able to
>> choose between efficiency and performance (throughput and latency reduction).
>
> The thing is that as long as this would require admin to guess then this
> is not all that useful. People will simply not know what to set and we
> are going to end up with stupid admin guides claiming that you should
> use 1/N of per node cpus for kswapd and that will not work.

I think this sysctl is very intuitive to use. Only use it if direct reclaims are
occurring. This can be seen with sar -B. Justify any increase with testing.
That is a whole lot easier to wrap your head around than a lot of the other
sysctls that are available today. Find me an admin that actually understands
what the swappiness tunable does.

> Not to
> mention that the reclaim logic is full of heuristics which change over
> time and a subtle implementation detail that would work for a particular
> scaling might break without anybody noticing. Really, if we are not able
> to come up with some auto tuning then I think that this is not really
> worth it.

This is all speculation about how a patch behaves that you have not even
tested. Similar arguments can be made about most of the sysctls that are
available.


> --
> Michal Hocko
> SUSE Labs


2018-04-17 09:06:23

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

On Mon 16-04-18 20:02:22, Buddy Lumpkin wrote:
>
> > On Apr 12, 2018, at 6:16 AM, Michal Hocko <[email protected]> wrote:
[...]
> > But once you hit a wall with
> > hard-to-reclaim pages then I would expect multiple threads will simply
> > contend more (e.g. on fs locks in shrinkers etc…).
>
> If that is the case, this is already happening since direct reclaims do just about
> everything that kswapd does. I have tested with a mix of filesystem reads, writes
> and anonymous memory with and without a swap device. The only locking
> problems I have run into so far are related to routines in mm/workingset.c.

You haven't tried hard enough. Try to generate a bigger fs metadata
pressure. In other words something less of a toy than a pure reader
without any real processing.

[...]

> > Or more specifically. How is the admin supposed to know how many
> > background threads are still improving the situation?
>
> Reduce the setting and check to see if pgscan_direct is still incrementing.

This just doesn't work. You are oversimplifying a lot! There are much
more aspects to this. How many background threads are still worth it
without stealing cycles from others? Is half of CPUs per NUMA node worth
devoting to background reclaim or is it better to let those excessive
memory consumers to be throttled by the direct reclaim?

You are still ignoring/underestimating the fact that kswapd steals
cycles even from other workload that is not memory bound while direct
reclaim throttles (mostly) memory consumers.

[...]
> > I still haven't looked at your test results in detail because they seem
> > quite artificial. Clean pagecache reclaim is not all that interesting
> > IMHO
>
> Clean page cache is extremely interesting for demonstrating this bottleneck.

yes it shows the bottleneck but it is quite artificial. Read data is
usually processed and/or written back and that changes the picture a
lot.

Anyway, I do agree that the reclaim can be made faster. I am just not
(yet) convinced that multiplying the number of workers is the way to achieve
that.

[...]
> >>> I would be also very interested
> >>> to see how to scale the number of threads based on how CPUs are utilized
> >>> by other workloads.
> >>
> >> I think we have reached the point where it makes sense for page replacement to have more
> >> than one mode. Enterprise class servers with lots of memory and a large number of CPU
> >> cores would benefit heavily if more threads could be devoted toward proactive page
> >> replacement. The polar opposite case is my Raspberry PI which I want to run as efficiently
> >> as possible. This problem is only going to get worse. I think it makes sense to be able to
> >> choose between efficiency and performance (throughput and latency reduction).
> >
> > The thing is that as long as this would require admin to guess then this
> > is not all that useful. People will simply not know what to set and we
> > are going to end up with stupid admin guides claiming that you should
> > use 1/N of per node cpus for kswapd and that will not work.
>
> I think this sysctl is very intuitive to use. Only use it if direct reclaims are
> occurring. This can be seen with sar -B. Justify any increase with testing.
> That is a whole lot easier to wrap your head around than a lot of the other
> sysctls that are available today. Find me an admin that actually understands
> what the swappiness tunable does.

Well, you have pointed to a nice example actually. Yes swappiness is
confusing and you can find _many_ different howtos for tuning. Do they
work? No, for a long time on most workloads because we are simply
pagecache biased so much these days that we simply ignore the value most of the
time. I am pretty sure your "just watch sar -B and tune accordingly" will
become obsolete in a short time and people will get confused again.
Because they are explicitly tuning for their workload but it doesn't
help anymore because the internal implementation of the reclaim has
changed again (this happens all the time).

No, I simply do not want to repeat past errors and expose too much of
implementation details for admins who will most likely have no clue how
to use the tuning and rely on random advices on internet or even worse
admin guides of questionable quality full of cargo cult advises
(remember advises to disable THP for basically any performance problem
you see).

> > Not to
> > mention that the reclaim logic is full of heuristics which change over
> > time and a subtle implementation detail that would work for a particular
> > scaling might break without anybody noticing. Really, if we are not able
> > to come up with some auto tuning then I think that this is not really
> > worth it.
>
> This is all speculation about how a patch behaves that you have not even
> tested. Similar arguments can be made about most of the sysctls that are
> available.

I really do want a solid background for the change like this. You are
throwing a corner case numbers at me and ignoring some important points.
So let me repeat. If we want to allow more kswapd threads per node then
we really have to evaluate the effect on memory hogs throttling and we
should have a decent idea on how to scale those threads. If we are not
able to handle that in the kernel with the full picture then I fail to
see how admin can do that.
--
Michal Hocko
SUSE Labs

2020-09-30 19:30:19

by Sebastiaan Meijer

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

> yes it shows the bottleneck but it is quite artificial. Read data is
> usually processed and/or written back and that changes the picture a
> lot.
Apologies for reviving an ancient thread (and apologies in advance for my lack
of knowledge on how mailing lists work), but I'd like to offer up another
reason why merging this might be a good idea.

From what I understand, zswap runs its compression on the same kswapd thread,
limiting it to a single thread for compression. Given enough processing power,
zswap can get great throughput using heavier compression algorithms like zstd,
but this is currently greatly limited by the lack of threading.
People on other sites have claimed applying this patchset greatly improved
zswap performance on their systems even for lighter compression algorithms.

For me personally I currently have a swap-heavy zswap-enabled server with
a single-threaded kswapd0 consuming 100% CPU constantly, and performance
is suffering because of it.
The server has 32 cores sitting mostly idle that I'd love to put to zswap work.

This setup could be considered a corner case, but it's definitely a
production workload that would greatly benefit from this change.
--
Sebastiaan Meijer

2020-10-01 12:33:01

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

On Wed 30-09-20 21:27:12, Sebastiaan Meijer wrote:
> > yes it shows the bottleneck but it is quite artificial. Read data is
> > usually processed and/or written back and that changes the picture a
> > lot.
> Apologies for reviving an ancient thread (and apologies in advance for my lack
> of knowledge on how mailing lists work), but I'd like to offer up another
> reason why merging this might be a good idea.
>
> From what I understand, zswap runs its compression on the same kswapd thread,
> limiting it to a single thread for compression. Given enough processing power,
> zswap can get great throughput using heavier compression algorithms like zstd,
> but this is currently greatly limited by the lack of threading.

Isn't this a problem of the zswap implementation rather than general
kswapd reclaim? Why zswap doesn't do the same as normal swap out in a
context outside of the reclaim?

My recollection of the particular patch is dimm but I do remember it
tried to add more kswapd threads which would just paper over the problem
you are seein rather than solve it.

--
Michal Hocko
SUSE Labs

2020-10-01 16:19:26

by Sebastiaan Meijer

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

(Apologies for messing up the mailing list thread, Gmail had fooled me into
believing that it properly picked up the thread)

On Thu, 1 Oct 2020 at 14:30, Michal Hocko <[email protected]> wrote:
>
> On Wed 30-09-20 21:27:12, Sebastiaan Meijer wrote:
> > > yes it shows the bottleneck but it is quite artificial. Read data is
> > > usually processed and/or written back and that changes the picture a
> > > lot.
> > Apologies for reviving an ancient thread (and apologies in advance for my lack
> > of knowledge on how mailing lists work), but I'd like to offer up another
> > reason why merging this might be a good idea.
> >
> > From what I understand, zswap runs its compression on the same kswapd thread,
> > limiting it to a single thread for compression. Given enough processing power,
> > zswap can get great throughput using heavier compression algorithms like zstd,
> > but this is currently greatly limited by the lack of threading.
>
> Isn't this a problem of the zswap implementation rather than general
> kswapd reclaim? Why zswap doesn't do the same as normal swap out in a
> context outside of the reclaim?

I wouldn't be able to tell you, the documentation on zswap is fairly limited
from what I've found.

> My recollection of the particular patch is dimm but I do remember it
> tried to add more kswapd threads which would just paper over the problem
> you are seein rather than solve it.

Yeah, that's exactly what it does, just adding more kswap threads.
I've tried updating the patch to the latest mainline kernel to test its
viability for our use case, but the kswap code changed too much over the
past 2 years, updating it is beyond my ability right now it seems.

For the time being I've switched over to zram, which better suits our use
case either way, and is threaded, but lacks zswap's memory deduplication.

Even with zram I'm still seeing kswap frequently max out a core though,
so there's definitely still a case for further optimization of kswap.
In our case it's not a single big application taking up our memory, rather we
are running 2000 high-memory applications. They store a lot of data in swap,
but rarely ever access said data, so the actual swap i/o isn't even that high.

--
Sebastiaan Meijer

2020-10-02 07:05:22

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

On Thu 01-10-20 18:18:10, Sebastiaan Meijer wrote:
> (Apologies for messing up the mailing list thread, Gmail had fooled me into
> believing that it properly picked up the thread)
>
> On Thu, 1 Oct 2020 at 14:30, Michal Hocko <[email protected]> wrote:
> >
> > On Wed 30-09-20 21:27:12, Sebastiaan Meijer wrote:
> > > > yes it shows the bottleneck but it is quite artificial. Read data is
> > > > usually processed and/or written back and that changes the picture a
> > > > lot.
> > > Apologies for reviving an ancient thread (and apologies in advance for my lack
> > > of knowledge on how mailing lists work), but I'd like to offer up another
> > > reason why merging this might be a good idea.
> > >
> > > From what I understand, zswap runs its compression on the same kswapd thread,
> > > limiting it to a single thread for compression. Given enough processing power,
> > > zswap can get great throughput using heavier compression algorithms like zstd,
> > > but this is currently greatly limited by the lack of threading.
> >
> > Isn't this a problem of the zswap implementation rather than general
> > kswapd reclaim? Why zswap doesn't do the same as normal swap out in a
> > context outside of the reclaim?
>
> I wouldn't be able to tell you, the documentation on zswap is fairly limited
> from what I've found.

I would recommend you to talk to zswap maintainers. Describing your
problem and suggesting to offload the heavy lifting into a separate
context like the standard swap IO does. You are not the only one to hit
this problem
http://lkml.kernel.org/r/CALvZod43VXKZ3StaGXK_EZG_fKcW3v3=cEYOWFwp4HNJpOOf8g@mail.gmail.com.
Ccing Shakeel on such an email might help you to give more usecases.

> > My recollection of the particular patch is dimm but I do remember it
> > tried to add more kswapd threads which would just paper over the problem
> > you are seein rather than solve it.
>
> Yeah, that's exactly what it does, just adding more kswap threads.

Which is far from trivial because it has its side effects on the over
system balanc. See my reply to the original request and the follow up
discussion. I am not saying this is impossible to achieve and tune
properly but it is certainly non trivial and it would require a really
strong justification.
--
Michal Hocko
SUSE Labs

2020-10-02 08:42:15

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

On Fri, Oct 02, 2020 at 09:03:33AM +0200, Michal Hocko wrote:
> > > My recollection of the particular patch is dimm but I do remember it
> > > tried to add more kswapd threads which would just paper over the problem
> > > you are seein rather than solve it.
> >
> > Yeah, that's exactly what it does, just adding more kswap threads.
>
> Which is far from trivial because it has its side effects on the over
> system balanc.

While I have not read the original patches, multiple kswapd threads will
smash into the LRU lock repeatedly. It's already the case that just plain
storms of page cache allocations hammer that lock on pagevec releases and
gets worse as memory sizes increase. Increasing LRU lock contention when
memory is low is going to have diminishing returns.

--
Mel Gorman
SUSE Labs

2020-10-02 13:54:50

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

On Fri, 2020-10-02 at 09:03 +0200, Michal Hocko wrote:
> On Thu 01-10-20 18:18:10, Sebastiaan Meijer wrote:
> > (Apologies for messing up the mailing list thread, Gmail had fooled
> > me into
> > believing that it properly picked up the thread)
> >
> > On Thu, 1 Oct 2020 at 14:30, Michal Hocko <[email protected]> wrote:
> > > On Wed 30-09-20 21:27:12, Sebastiaan Meijer wrote:
> > > > > yes it shows the bottleneck but it is quite artificial. Read
> > > > > data is
> > > > > usually processed and/or written back and that changes the
> > > > > picture a
> > > > > lot.
> > > > Apologies for reviving an ancient thread (and apologies in
> > > > advance for my lack
> > > > of knowledge on how mailing lists work), but I'd like to offer
> > > > up another
> > > > reason why merging this might be a good idea.
> > > >
> > > > From what I understand, zswap runs its compression on the same
> > > > kswapd thread,
> > > > limiting it to a single thread for compression. Given enough
> > > > processing power,
> > > > zswap can get great throughput using heavier compression
> > > > algorithms like zstd,
> > > > but this is currently greatly limited by the lack of threading.
> > >
> > > Isn't this a problem of the zswap implementation rather than
> > > general
> > > kswapd reclaim? Why zswap doesn't do the same as normal swap out
> > > in a
> > > context outside of the reclaim?

On systems with lots of very fast IO devices, we have
also seen kswapd take 100% CPU time without any zswap
in use.

This seems like a generic issue, though zswap does
manage to bring it out on lower end systems.

--
All Rights Reversed.


Attachments:
signature.asc (499.00 B)
This is a digitally signed message part

2020-10-02 14:02:49

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

On Fri, Oct 02, 2020 at 09:53:05AM -0400, Rik van Riel wrote:
> On Fri, 2020-10-02 at 09:03 +0200, Michal Hocko wrote:
> > On Thu 01-10-20 18:18:10, Sebastiaan Meijer wrote:
> > > (Apologies for messing up the mailing list thread, Gmail had fooled
> > > me into
> > > believing that it properly picked up the thread)
> > >
> > > On Thu, 1 Oct 2020 at 14:30, Michal Hocko <[email protected]> wrote:
> > > > On Wed 30-09-20 21:27:12, Sebastiaan Meijer wrote:
> > > > > > yes it shows the bottleneck but it is quite artificial. Read
> > > > > > data is
> > > > > > usually processed and/or written back and that changes the
> > > > > > picture a
> > > > > > lot.
> > > > > Apologies for reviving an ancient thread (and apologies in
> > > > > advance for my lack
> > > > > of knowledge on how mailing lists work), but I'd like to offer
> > > > > up another
> > > > > reason why merging this might be a good idea.
> > > > >
> > > > > From what I understand, zswap runs its compression on the same
> > > > > kswapd thread,
> > > > > limiting it to a single thread for compression. Given enough
> > > > > processing power,
> > > > > zswap can get great throughput using heavier compression
> > > > > algorithms like zstd,
> > > > > but this is currently greatly limited by the lack of threading.
> > > >
> > > > Isn't this a problem of the zswap implementation rather than
> > > > general
> > > > kswapd reclaim? Why zswap doesn't do the same as normal swap out
> > > > in a
> > > > context outside of the reclaim?
>
> On systems with lots of very fast IO devices, we have
> also seen kswapd take 100% CPU time without any zswap
> in use.
>
> This seems like a generic issue, though zswap does
> manage to bring it out on lower end systems.

Then, given Mel's observation about contention on the LRU lock, what's
the solution? Partition the LRU list? Batch removals from the LRU list
by kswapd and hand off to per-?node?cpu? worker threads?

Rik, if you have access to one of those systems, I'd be interested to know
whether using file THPs would help with your workload. Tracking only
one THP instead of, say, 16 regular size pages is going to reduce the
amount of time taken to pull things off the LRU list.

2020-10-02 14:31:52

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH 1/1] vmscan: Support multiple kswapd threads per node

On Fri 02-10-20 09:53:05, Rik van Riel wrote:
> On Fri, 2020-10-02 at 09:03 +0200, Michal Hocko wrote:
> > On Thu 01-10-20 18:18:10, Sebastiaan Meijer wrote:
> > > (Apologies for messing up the mailing list thread, Gmail had fooled
> > > me into
> > > believing that it properly picked up the thread)
> > >
> > > On Thu, 1 Oct 2020 at 14:30, Michal Hocko <[email protected]> wrote:
> > > > On Wed 30-09-20 21:27:12, Sebastiaan Meijer wrote:
> > > > > > yes it shows the bottleneck but it is quite artificial. Read
> > > > > > data is
> > > > > > usually processed and/or written back and that changes the
> > > > > > picture a
> > > > > > lot.
> > > > > Apologies for reviving an ancient thread (and apologies in
> > > > > advance for my lack
> > > > > of knowledge on how mailing lists work), but I'd like to offer
> > > > > up another
> > > > > reason why merging this might be a good idea.
> > > > >
> > > > > From what I understand, zswap runs its compression on the same
> > > > > kswapd thread,
> > > > > limiting it to a single thread for compression. Given enough
> > > > > processing power,
> > > > > zswap can get great throughput using heavier compression
> > > > > algorithms like zstd,
> > > > > but this is currently greatly limited by the lack of threading.
> > > >
> > > > Isn't this a problem of the zswap implementation rather than
> > > > general
> > > > kswapd reclaim? Why zswap doesn't do the same as normal swap out
> > > > in a
> > > > context outside of the reclaim?
>
> On systems with lots of very fast IO devices, we have
> also seen kswapd take 100% CPU time without any zswap
> in use.

Do you have more details? Does the saturated kswapd lead to pre-mature
direct reclaim? What is the saturated number of reclaimed pages per unit
of time? Have you tried to play with this to see whether an additional
worker would help?

--
Michal Hocko
SUSE Labs