LinuxLists.cc - [RFC PATCH 0/7] Replace per-task RSS cache with per-CPU RSS cache

2022-07-28 20:58:41

Subject: [RFC PATCH 0/7] Replace per-task RSS cache with per-CPU RSS cache

From: Kairui Song <[email protected]>

Hi Linux MM,

This is a early RFC patch series, which trys to fix the inaccurate RSS
counting issue and also improve proformance.

The problem
===========

While reading the code in mm/memory.c, I noticed the 64 event threshold
for RSS accounting, and the per-task design seems not the best solution
for either accuracy or efficiency.

The 64 events threshold is still quite small and contention is still
there, it's not the most contented thing but still a performance hit.
And it's already too large and has negative effect on RSS accuracy.

Although RSS is not a really good standard to measure or control process
memory usage, but it's still the most common way for most users to check
the process memory usage in a system (through utils like top, read from
/proc, ps ...)

It's not hard to find many users complaining about the strange RSS
counting issue just by googling it. And I can easily 'steal' tons of
memory from the kernel without being counted by RSS with the following
code snip on x86:

#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <linux/mman.h>
#define THREAD_NUM 10000
#define PAGE_SIZE 4096
#define MAP_SIZE 4096 * 32
void* thread(void *ptr) {
char *p = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
for (int i = 0; i < MAP_SIZE; ++i)
p[i] = 0xff;
sleep(1000);
munmap(p, MAP_SIZE);
return NULL;
}
int main(int argc, char **argv) {
pthread_t threads[THREAD_NUM];
for (int i = 0; i < THREAD_NUM; ++i)
pthread_create(&threads[i], NULL, *thread, NULL);
for (int i = 0; i < THREAD_NUM; ++i)
pthread_join(threads[i], NULL);
return 0;
}

And the RSS reports (using `ps`):
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 7278 33.1 0.1 83245376 43888 pts/0 Sl+ 01:09 0:03 ./a.out

But actually 1.5G of memory is used by it, and `top` user will have no
idea about it.

RFC
===

In this RFC series, I tried to improve it by using a per-CPU cache,
instead of doing RSS caching per-task. A CPU will cache the RSS update
of it's corresponding active mm as much as possible. When an mm is
switch_to'ed to/off a CPU, flush the cached data.
I first tried to make the cache switch/flush independent of the mm
switch but turns out it is easier and more accurate to just stick
with the mm switch.

And as long as the mm is not switched, don't flush the cache. So the
atomic operation of mm counter is avoided as much as possible.
And now Instead, the RSS reader will collect the cache from each
CPU upon reading.

Since per my understanding, in most cases, RSS reading is less frequent
than RSS updating (checking all caller of
get_mm_counter/get_mm_rss/*_hiwater_rss, it's used by OOM killer, by
task numa balancer periodically, from /proc, and by sparc arch code
which seems the only hot path). This design seems to improve the
performance overall. And besides, with the previous design, high
frequency of RSS reading doesn't make much sense since there is a
64 events delay for each task.

And the overhead of iterating each CPU can be minimized, so reading will
not be much slower than before. One way is to reuse mm_cpumask, RSS reader
will only iterate CPUs that do have cached the RSS of target mm.

Without this optimzation, reader and cache invalidation may suffer a
full CPU sync. But by carefully arrange the data structure, and using
lockless reading design, the performance hit should still be acceptable.

Also removing the per-task struct cache may help save a little bit of sapce.

In this series:
Patch 1/7 - 3/7 remove the old per-task cache, and did some prepare.
Patch 4/7 implement a generic version of per-CPU RSS cache, with a
performance drop on RSS reading/invalidation, because it have to
iterate all CPUs.
Patch 6/7 and 7/7 implement and enabled reusing mm_cpumask for RSS
caching on x86_64.

Tests
=====

Based on this design and this series, I did some tests on x86_64,
and it showed a few advantages compared to the previous design:

- First, the accuracy of RSS of the demo C problem above is now fixed:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1201 40.7 4.7 83245376 1498856 pts/0 Sl+ 01:11 0:03 ./a.out

Now it's excatly the amount of memory being consumed.

- Cache hit rate:
Booting up and login system (Fedora 36):
Hit/Miss (1 cache miss per 1800 events avg)::
1021027/567

Build linux kernel with tinyconfig and `make -j16`:
Hit/Miss (1 cache miss per 521 events avg):
35613357/68327

pgbench:
Hit/Miss (1 cache miss per 2543811 events avg):
35613357/14

The most common cache miss call chain here is:
=> add_mm_counter_fast
=> do_anonymous_page
=> __handle_mm_fault
=> handle_mm_fault
=> __get_user_pages
=> __get_user_pages_remote
=> get_arg_page
=> copy_string_kernel
=> do_execveat_common.isra.0
=> __x64_sys_execve
=> do_syscall_64
=> entry_SYSCALL_64_after_hwframe

With long running tasks, the hit rate is extremely high, nearly 100%
since tasks tend to have a high affinity with the local CPU.

- Some benchmark result:
(with transparent hugepage set to 'never', actuall with 'alwayse. similiar performance different is abserved, just for stabalize the result):

-- on 1 x 8 core AMD Ryzen 9 5900HX @ 3.30Ghz:
pts/pgbench (100 clients, 12 test run):
Before: 16073 TPS, 6.241 ms
After: 16829 TPS, 5.962 ms (~4% faster)

hackbench (32 process, 100 test run):
Before: 125.317
After: 123.482 (~1% faster)

Linux kernel build (tineyconfig, 20 test run)
Before: 20.9947s
After: 20.9955s (Almost identical)

-- on 2 x 48 core Intel Xeon Platinum 8255C CPU @ 2.50GHz
pts/pgbench (100 Clients, 24 test run):
Before: 22556 TPS, 4.458ms latency
After: 23025 TPS, 4.347ms latency (~2% faster)

hackbench (32 process, 100 test run):
Before: took 48.774s
AfterA: took 48.540s (~1% faster)

Linux kernel build (defconfig, 20 test run):
Before: took 42.7050s
After: took 42.7121s (Almost identical)

Overall, the performance seems slightly better than before with the
above tests. And some code paths can still be optimized (eg.
*_hiwater_rss related call site, make more RSS counting routing use
this cache etc...), and things may still get better.

This patch series is only in an early RFC state since I'm not sure if
this is an acceptable design, so I hope I can collect some discussion
from the community.

But at least this shows things definitely can be improved.

If this approach is appreciatable, I'll try to improve the following
work items:
- Arch optimzations, using mm_cpumask. (It seems sparc reads RSS on every
page fault, not sure if sparc will conflict with this approach)
- Currently, kernel uses RSS reading helpers as a very cheap function
call, which is no longer that cheap with this series. eg.
*_hiwater_rss might be batched or optimized in some way to reduce the
performance impact of slower reader.
- CPU hotplug.
- More code tweaking.

Please have my excuse if I've made any silly mistakes. Looking forward
to learn about everyone's opinion on this.

Kairui Song (7):
mm: remove the per-task RSS counter cache
mm: move check_mm to memory.c
mm/headers: change emun order of MM_COUNTERS
mm: introduce a generic per-CPU RSS cache
mm: use fast path for pmd setting as well
mm: introduce CONFIG_ARCH_PCP_RSS_USE_CPUMASK
x86_64/tlb, mm: enable cpumask optimzation for RSS cache

Documentation/filesystems/proc.rst | 7 -
arch/Kconfig | 3 +
arch/x86/Kconfig | 1 +
arch/x86/mm/tlb.c | 5 +
fs/exec.c | 2 -
include/linux/mm.h | 34 +---
include/linux/mm_types_task.h | 49 ++++--
include/linux/sched.h | 3 -
kernel/exit.c | 5 -
kernel/fork.c | 39 +----
kernel/kthread.c | 1 -
kernel/sched/core.c | 4 +
mm/madvise.c | 7 +-
mm/memory.c | 243 +++++++++++++++++++++++++----
14 files changed, 269 insertions(+), 134 deletions(-)

--
2.35.2

2022-07-28 20:59:25

by Kairui Song

[permalink] [raw]

Subject: [RFC PATCH 6/7] mm: introduce CONFIG_ARCH_PCP_RSS_USE_CPUMASK

From: Kairui Song <[email protected]>

If the arch related code can provide helpers to bind the RSS cache to
mm_cpumask, then the syncing code can just rely on that instead of doing
full CPU synchronization. This speed up the reading/mm_exit by a lot.

Signed-off-by: Kairui Song <[email protected]>
---
arch/Kconfig | 3 ++
kernel/sched/core.c | 3 +-
mm/memory.c | 94 ++++++++++++++++++++++++++++-----------------
3 files changed, 64 insertions(+), 36 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index 71b9272acb28..8df45b6346ae 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -1403,6 +1403,9 @@ config ARCH_HAS_ELFCORE_COMPAT
config ARCH_HAS_PARANOID_L1D_FLUSH
bool

+config ARCH_PCP_RSS_USE_CPUMASK
+ bool
+
config DYNAMIC_SIGFRAME
bool

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 11df67bb52ee..6f7991caf24b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5143,7 +5143,8 @@ context_switch(struct rq *rq, struct task_struct *prev,
prepare_lock_switch(rq, next, rf);

/* Cache new active_mm */
- switch_pcp_rss_cache_no_irq(next->active_mm);
+ if (!IS_ENABLED(CONFIG_ARCH_PCP_RSS_USE_CPUMASK))
+ switch_pcp_rss_cache_no_irq(next->active_mm);

/* Here we just switch the register state and the stack. */
switch_to(prev, next, prev);
diff --git a/mm/memory.c b/mm/memory.c
index 09d7d193da51..a819009aa3e0 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -188,9 +188,16 @@ unsigned long get_mm_counter(struct mm_struct *mm, int member)
{
int cpu;
long ret, update, sync_count;
+ const struct cpumask *mm_mask;

ret = atomic_long_read(&mm->rss_stat.count[member]);
- for_each_possible_cpu(cpu) {
+
+ if (IS_ENABLED(CONFIG_ARCH_PCP_RSS_USE_CPUMASK))
+ mm_mask = mm_cpumask(mm);
+ else
+ mm_mask = cpu_possible_mask;
+
+ for_each_cpu(cpu, mm_mask) {
if (READ_ONCE(per_cpu(cpu_rss_cache.mm, cpu)) != mm)
continue;
sync_count = READ_ONCE(per_cpu(cpu_rss_cache.sync_count, cpu));
@@ -217,12 +224,18 @@ unsigned long get_mm_rss(struct mm_struct *mm)
{
int cpu;
long ret, update, sync_count;
+ const struct cpumask *mm_mask;

ret = atomic_long_read(&mm->rss_stat.count[MM_FILEPAGES]),
+ atomic_long_read(&mm->rss_stat.count[MM_ANONPAGES]),
+ atomic_long_read(&mm->rss_stat.count[MM_SHMEMPAGES]);

- for_each_possible_cpu(cpu) {
+ if (IS_ENABLED(CONFIG_ARCH_PCP_RSS_USE_CPUMASK))
+ mm_mask = mm_cpumask(mm);
+ else
+ mm_mask = cpu_possible_mask;
+
+ for_each_cpu(cpu, mm_mask) {
if (READ_ONCE(per_cpu(cpu_rss_cache.mm, cpu)) != mm)
continue;
sync_count = READ_ONCE(per_cpu(cpu_rss_cache.sync_count, cpu));
@@ -266,10 +279,13 @@ void switch_pcp_rss_cache_no_irq(struct mm_struct *next_mm)
if (cpu_mm == NULL)
goto commit_done;

- /* Race with check_discard_rss_cache */
- if (cpu_mm != cmpxchg(this_cpu_ptr(&cpu_rss_cache.mm), cpu_mm,
- __pcp_rss_mm_mark(cpu_mm)))
- goto commit_done;
+ /* Arch will take care of cache invalidation */
+ if (!IS_ENABLED(CONFIG_ARCH_PCP_RSS_USE_CPUMASK)) {
+ /* Race with check_discard_rss_cache */
+ if (cpu_mm != cmpxchg(this_cpu_ptr(&cpu_rss_cache.mm), cpu_mm,
+ __pcp_rss_mm_mark(cpu_mm)))
+ goto commit_done;
+ }

for (int i = 0; i < NR_MM_COUNTERS; i++) {
count = this_cpu_read(cpu_rss_cache.count[i]);
@@ -328,46 +344,54 @@ static void check_discard_rss_cache(struct mm_struct *mm)
long cached_count[NR_MM_COUNTERS] = { 0 };
struct mm_struct *cpu_mm;

- /* Invalidate the RSS cache on every CPU */
- for_each_possible_cpu(cpu) {
- cpu_mm = READ_ONCE(per_cpu(cpu_rss_cache.mm, cpu));
- if (__pcp_rss_mm_unmark(cpu_mm) != mm)
- continue;
-
- /*
- * If not being flusehd, try read-in the counter and mark it NULL,
- * once cache's mm is set NULL, counter are considered invalided
- */
- if (cpu_mm != __pcp_rss_mm_mark(cpu_mm)) {
- long count[NR_MM_COUNTERS];
-
- for (int i = 0; i < NR_MM_COUNTERS; i++)
- count[i] = READ_ONCE(per_cpu(cpu_rss_cache.count[i], cpu));
+ /* Arch will take care of cache invalidation */
+ if (!IS_ENABLED(CONFIG_ARCH_PCP_RSS_USE_CPUMASK)) {
+ /* Invalidate the RSS cache on every CPU */
+ for_each_possible_cpu(cpu) {
+ cpu_mm = READ_ONCE(per_cpu(cpu_rss_cache.mm, cpu));
+ if (__pcp_rss_mm_unmark(cpu_mm) != mm)
+ continue;

/*
- * If successfully set to NULL, the owner CPU is not flushing it, counters
- * are uncommiteed and untouched during this period, since a dying mm won't
- * be accouted anymore
+ * If not being flusehd, try read-in the counter and mark it NULL,
+ * once cache's mm is set NULL, counter are considered invalided.
*/
- cpu_mm = cmpxchg(&per_cpu(cpu_rss_cache.mm, cpu), mm, NULL);
- if (cpu_mm == mm) {
+ if (cpu_mm != __pcp_rss_mm_mark(cpu_mm)) {
+ long count[NR_MM_COUNTERS];
+
for (int i = 0; i < NR_MM_COUNTERS; i++)
- cached_count[i] += count[i];
- continue;
+ count[i] = READ_ONCE(per_cpu(cpu_rss_cache.count[i], cpu));
+
+ /*
+ * If successfully set to NULL, the owner CPU is not flushing it,
+ * counters are uncommitted and untouched during this period, since
+ * a dying mm won't be accouted anymore.
+ */
+ cpu_mm = cmpxchg(&per_cpu(cpu_rss_cache.mm, cpu), mm, NULL);
+ if (cpu_mm == mm) {
+ for (int i = 0; i < NR_MM_COUNTERS; i++)
+ cached_count[i] += count[i];
+ continue;
+ }
}
- }

- /* It's being flushed, just busy wait as the critial section is really short */
- do {
- cpu_relax();
- cpu_mm = READ_ONCE(per_cpu(cpu_rss_cache.mm, cpu));
- } while (cpu_mm == __pcp_rss_mm_mark(mm));
+ /*
+ * It's being flushed, just busy wait as the critial section
+ * is really short.
+ */
+ do {
+ cpu_relax();
+ cpu_mm = READ_ONCE(per_cpu(cpu_rss_cache.mm, cpu));
+ } while (cpu_mm == __pcp_rss_mm_mark(mm));
+ }
}

for (int i = 0; i < NR_MM_COUNTERS; i++) {
long val = atomic_long_read(&mm->rss_stat.count[i]);

- val += cached_count[i];
+ if (!IS_ENABLED(CONFIG_ARCH_PCP_RSS_USE_CPUMASK)) {
+ val += cached_count[i];
+ }

if (unlikely(val)) {
pr_alert("BUG: Bad rss-counter state mm:%p type:%s val:%ld\n",
--
2.35.2