Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
From:   Kairui Song <ryncsn@gmail.com>
To:     linux-mm@kvack.org
Cc:     linux-kernel@vger.kernel.org,
        Andrew Morton <akpm@linux-foundation.org>,
        Kairui Song <kasong@tencent.com>
Subject: [RFC PATCH 0/7] Replace per-task RSS cache with per-CPU RSS cache
Date:   Fri, 29 Jul 2022 04:45:04 +0800
Message-Id: <20220728204511.56348-1-ryncsn@gmail.com>
Reply-To: Kairui Song <kasong@tencent.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

From: Kairui Song <kasong@tencent.com>

Hi Linux MM,

This is a early RFC patch series, which trys to fix the inaccurate RSS
counting issue and also improve proformance.

The problem
===========

While reading the code in mm/memory.c, I noticed the 64 event threshold
for RSS accounting, and the per-task design seems not the best solution
for either accuracy or efficiency.

The 64 events threshold is still quite small and contention is still
there, it's not the most contented thing but still a performance hit.
And it's already too large and has negative effect on RSS accuracy.

Although RSS is not a really good standard to measure or control process
memory usage, but it's still the most common way for most users to check
the process memory usage in a system (through utils like top, read from
/proc, ps ...)

It's not hard to find many users complaining about the strange RSS
counting issue just by googling it. And I can easily 'steal' tons of
memory from the kernel without being counted by RSS with the following
code snip on x86:

#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <linux/mman.h>
#define THREAD_NUM 10000
#define PAGE_SIZE 4096
#define MAP_SIZE 4096 * 32
void* thread(void *ptr) {
	char *p = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
	for (int i = 0; i < MAP_SIZE; ++i)
		p[i] = 0xff;
	sleep(1000);
	munmap(p, MAP_SIZE);
	return NULL;
}
int main(int argc, char **argv) {
	pthread_t threads[THREAD_NUM];
	for (int i = 0; i < THREAD_NUM; ++i)
		pthread_create(&threads[i], NULL, *thread, NULL);
	for (int i = 0; i < THREAD_NUM; ++i)
		pthread_join(threads[i], NULL);
	return 0;
}

And the RSS reports (using `ps`):
USER         PID %CPU  %MEM      VSZ    RSS   TTY      STAT START   TIME COMMAND
root        7278 33.1  0.1  83245376  43888   pts/0    Sl+  01:09   0:03 ./a.out

But actually 1.5G of memory is used by it, and `top` user will have no
idea about it.

RFC
===

In this RFC series, I tried to improve it by using a per-CPU cache,
instead of doing RSS caching per-task. A CPU will cache the RSS update
of it's corresponding active mm as much as possible. When an mm is
switch_to'ed to/off a CPU, flush the cached data.
I first tried to make the cache switch/flush independent of the mm
switch but turns out it is easier and more accurate to just stick
with the mm switch.

And as long as the mm is not switched, don't flush the cache. So the
atomic operation of mm counter is avoided as much as possible.
And now Instead, the RSS reader will collect the cache from each
CPU upon reading.

Since per my understanding, in most cases, RSS reading is less frequent
than RSS updating (checking all caller of
get_mm_counter/get_mm_rss/*_hiwater_rss, it's used by OOM killer, by
task numa balancer periodically, from /proc, and by sparc arch code
which seems the only hot path). This design seems to improve the
performance overall. And besides, with the previous design, high
frequency of RSS reading doesn't make much sense since there is a
64 events delay for each task.

And the overhead of iterating each CPU can be minimized, so reading will
not be much slower than before. One way is to reuse mm_cpumask, RSS reader
will only iterate CPUs that do have cached the RSS of target mm.

Without this optimzation, reader and cache invalidation may suffer a
full CPU sync. But by carefully arrange the data structure, and using
lockless reading design, the performance hit should still be acceptable.

Also removing the per-task struct cache may help save a little bit of sapce.

In this series:
Patch 1/7 - 3/7 remove the old per-task cache, and did some prepare.
Patch 4/7 implement a generic version of per-CPU RSS cache, with a
performance drop on RSS reading/invalidation, because it have to
iterate all CPUs.
Patch 6/7 and 7/7 implement and enabled reusing mm_cpumask for RSS
caching on x86_64.

Tests
=====

Based on this design and this series, I did some tests on x86_64,
and it showed a few advantages compared to the previous design:

- First, the accuracy of RSS of the demo C problem above is now fixed:
USER         PID %CPU  %MEM      VSZ    RSS   TTY      STAT START   TIME COMMAND
root        1201 40.7  4.7  83245376 1498856  pts/0    Sl+  01:11   0:03 ./a.out

Now it's excatly the amount of memory being consumed.

- Cache hit rate:
Booting up and login system (Fedora 36):
Hit/Miss (1 cache miss per 1800 events avg)::
1021027/567

Build linux kernel with tinyconfig and `make -j16`:
Hit/Miss (1 cache miss per 521 events avg):
35613357/68327

pgbench:
Hit/Miss (1 cache miss per 2543811 events avg):
35613357/14

The most common cache miss call chain here is:
 => add_mm_counter_fast
 => do_anonymous_page
 => __handle_mm_fault
 => handle_mm_fault
 => __get_user_pages
 => __get_user_pages_remote
 => get_arg_page
 => copy_string_kernel
 => do_execveat_common.isra.0
 => __x64_sys_execve
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe

With long running tasks, the hit rate is extremely high, nearly 100%
since tasks tend to have a high affinity with the local CPU.

- Some benchmark result:
(with transparent hugepage set to 'never', actuall with 'alwayse. similiar performance different is abserved, just for stabalize the result):

-- on 1 x 8 core AMD Ryzen 9 5900HX @ 3.30Ghz:
pts/pgbench (100 clients, 12 test run):
Before: 16073 TPS, 6.241 ms
After: 16829 TPS, 5.962 ms (~4% faster)

hackbench (32 process, 100 test run):
Before: 125.317
After: 123.482 (~1% faster)

Linux kernel build (tineyconfig, 20 test run)
Before: 20.9947s
After: 20.9955s (Almost identical)

-- on 2 x 48 core Intel Xeon Platinum 8255C CPU @ 2.50GHz
pts/pgbench (100 Clients, 24 test run):
Before: 22556 TPS, 4.458ms latency
After:  23025 TPS, 4.347ms latency (~2% faster)

hackbench (32 process, 100 test run):
Before: took 48.774s
AfterA: took 48.540s (~1% faster)

Linux kernel build (defconfig, 20 test run):
Before: took 42.7050s
After: took 42.7121s (Almost identical)

Overall, the performance seems slightly better than before with the
above tests. And some code paths can still be optimized (eg.
*_hiwater_rss related call site, make more RSS counting routing use
this cache etc...), and things may still get better.

This patch series is only in an early RFC state since I'm not sure if
this is an acceptable design, so I hope I can collect some discussion
from the community.

But at least this shows things definitely can be improved.

If this approach is appreciatable, I'll try to improve the following
work items:
- Arch optimzations, using mm_cpumask. (It seems sparc reads RSS on every
  page fault, not sure if sparc will conflict with this approach)
- Currently, kernel uses RSS reading helpers as a very cheap function
  call, which is no longer that cheap with this series. eg.
  *_hiwater_rss might be batched or optimized in some way to reduce the
  performance impact of slower reader.
- CPU hotplug.
- More code tweaking.

Please have my excuse if I've made any silly mistakes. Looking forward
to learn about everyone's opinion on this.

Kairui Song (7):
  mm: remove the per-task RSS counter cache
  mm: move check_mm to memory.c
  mm/headers: change emun order of MM_COUNTERS
  mm: introduce a generic per-CPU RSS cache
  mm: use fast path for pmd setting as well
  mm: introduce CONFIG_ARCH_PCP_RSS_USE_CPUMASK
  x86_64/tlb, mm: enable cpumask optimzation for RSS cache

 Documentation/filesystems/proc.rst |   7 -
 arch/Kconfig                       |   3 +
 arch/x86/Kconfig                   |   1 +
 arch/x86/mm/tlb.c                  |   5 +
 fs/exec.c                          |   2 -
 include/linux/mm.h                 |  34 +---
 include/linux/mm_types_task.h      |  49 ++++--
 include/linux/sched.h              |   3 -
 kernel/exit.c                      |   5 -
 kernel/fork.c                      |  39 +----
 kernel/kthread.c                   |   1 -
 kernel/sched/core.c                |   4 +
 mm/madvise.c                       |   7 +-
 mm/memory.c                        | 243 +++++++++++++++++++++++++----
 14 files changed, 269 insertions(+), 134 deletions(-)

-- 
2.35.2