LinuxLists.cc - [PATCH v3 00/23] RSEQ node id and virtual cpu id extensions

2022-07-29 19:39:07

Subject: [PATCH v3 00/23] RSEQ node id and virtual cpu id extensions

Extend the rseq ABI to expose a NUMA node ID and a vm_vcpu_id field.

The NUMA node ID field allows implementing a faster getcpu(2) in libc.

The virtual cpu id allows ideal scaling (down or up) of user-space
per-cpu data structures. The virtual cpu ids allocated within a memory
space are tracked by the scheduler, which takes into account the number
of concurrently running threads, thus implicitly considering the number
of threads, the cpu affinity, the cpusets applying to those threads, and
the number of logical cores on the system.

This series is based on the v5.18.13 tag.

Thanks,

Mathieu

Mathieu Desnoyers (23):
rseq: Introduce feature size and alignment ELF auxiliary vector
entries
rseq: Introduce extensible rseq ABI
rseq: extend struct rseq with numa node id
selftests/rseq: Use ELF auxiliary vector for extensible rseq
selftests/rseq: Implement rseq numa node id field selftest
lib: invert _find_next_bit source arguments
lib: implement find_{first,next}_{zero,one}_and_zero_bit
cpumask: implement cpumask_{first,next}_{zero,one}_and_zero
sched: Introduce per memory space current virtual cpu id
rseq: extend struct rseq with per memory space vcpu id
selftests/rseq: Remove RSEQ_SKIP_FASTPATH code
selftests/rseq: Implement rseq vm_vcpu_id field support
selftests/rseq: x86: Template memory ordering and percpu access mode
selftests/rseq: arm: Template memory ordering and percpu access mode
selftests/rseq: arm64: Template memory ordering and percpu access mode
selftests/rseq: mips: Template memory ordering and percpu access mode
selftests/rseq: ppc: Template memory ordering and percpu access mode
selftests/rseq: s390: Template memory ordering and percpu access mode
selftests/rseq: riscv: Template memory ordering and percpu access mode
selftests/rseq: basic percpu ops vm_vcpu_id test
selftests/rseq: parametrized vm_vcpu_id test
selftests/rseq: x86: Implement rseq_load_u32_u32
selftests/rseq: Implement numa node id vs vm_vcpu_id invariant test

fs/binfmt_elf.c | 5 +
fs/exec.c | 4 +
include/linux/cpumask.h | 86 ++
include/linux/find.h | 123 +-
include/linux/mm.h | 25 +
include/linux/mm_types.h | 111 ++
include/linux/sched.h | 9 +
include/trace/events/rseq.h | 4 +-
include/uapi/linux/auxvec.h | 2 +
include/uapi/linux/rseq.h | 22 +
init/Kconfig | 4 +
kernel/fork.c | 15 +-
kernel/ptrace.c | 2 +-
kernel/rseq.c | 60 +-
kernel/sched/core.c | 82 ++
kernel/sched/deadline.c | 3 +
kernel/sched/debug.c | 13 +
kernel/sched/fair.c | 1 +
kernel/sched/rt.c | 2 +
kernel/sched/sched.h | 357 +++++
kernel/sched/stats.c | 16 +-
lib/find_bit.c | 17 +-
tools/include/linux/find.h | 9 +-
tools/lib/find_bit.c | 17 +-
tools/testing/selftests/rseq/.gitignore | 5 +
tools/testing/selftests/rseq/Makefile | 20 +-
.../testing/selftests/rseq/basic_numa_test.c | 117 ++
.../selftests/rseq/basic_percpu_ops_test.c | 46 +-
tools/testing/selftests/rseq/basic_test.c | 4 +
tools/testing/selftests/rseq/compiler.h | 6 +
tools/testing/selftests/rseq/param_test.c | 152 ++-
tools/testing/selftests/rseq/rseq-abi.h | 22 +
tools/testing/selftests/rseq/rseq-arm-bits.h | 505 +++++++
tools/testing/selftests/rseq/rseq-arm.h | 701 +---------
.../testing/selftests/rseq/rseq-arm64-bits.h | 392 ++++++
tools/testing/selftests/rseq/rseq-arm64.h | 520 +------
.../testing/selftests/rseq/rseq-bits-reset.h | 10 +
.../selftests/rseq/rseq-bits-template.h | 39 +
tools/testing/selftests/rseq/rseq-mips-bits.h | 462 +++++++
tools/testing/selftests/rseq/rseq-mips.h | 646 +--------
tools/testing/selftests/rseq/rseq-ppc-bits.h | 454 +++++++
tools/testing/selftests/rseq/rseq-ppc.h | 617 +--------
.../testing/selftests/rseq/rseq-riscv-bits.h | 410 ++++++
tools/testing/selftests/rseq/rseq-riscv.h | 529 +-------
tools/testing/selftests/rseq/rseq-s390-bits.h | 474 +++++++
tools/testing/selftests/rseq/rseq-s390.h | 495 +------
tools/testing/selftests/rseq/rseq-skip.h | 65 -
tools/testing/selftests/rseq/rseq-x86-bits.h | 1036 ++++++++++++++
tools/testing/selftests/rseq/rseq-x86.h | 1193 +----------------
tools/testing/selftests/rseq/rseq.c | 86 +-
tools/testing/selftests/rseq/rseq.h | 229 +++-
.../testing/selftests/rseq/run_param_test.sh | 5 +
52 files changed, 5536 insertions(+), 4693 deletions(-)
create mode 100644 tools/testing/selftests/rseq/basic_numa_test.c
create mode 100644 tools/testing/selftests/rseq/rseq-arm-bits.h
create mode 100644 tools/testing/selftests/rseq/rseq-arm64-bits.h
create mode 100644 tools/testing/selftests/rseq/rseq-bits-reset.h
create mode 100644 tools/testing/selftests/rseq/rseq-bits-template.h
create mode 100644 tools/testing/selftests/rseq/rseq-mips-bits.h
create mode 100644 tools/testing/selftests/rseq/rseq-ppc-bits.h
create mode 100644 tools/testing/selftests/rseq/rseq-riscv-bits.h
create mode 100644 tools/testing/selftests/rseq/rseq-s390-bits.h
delete mode 100644 tools/testing/selftests/rseq/rseq-skip.h
create mode 100644 tools/testing/selftests/rseq/rseq-x86-bits.h

--
2.17.1

2022-07-29 19:39:24

by Mathieu Desnoyers

[permalink] [raw]

Subject: [PATCH v3 06/23] lib: invert _find_next_bit source arguments

Apply bit-invert operations before the AND operation in _find_next_bit.
Allows AND operations on combined bitmasks in which we search either for
one or zero, e.g.: find first bit which is both zero in one bitmask AND
one in the second bitmask.

The existing use for find first zero bit does not use the second
argument, so whether the inversion is performed before or after the AND
operator does not matter.

Signed-off-by: Mathieu Desnoyers <[email protected]>
---
include/linux/find.h | 13 +++++++------
lib/find_bit.c | 17 ++++++++---------
tools/include/linux/find.h | 9 +++++----
tools/lib/find_bit.c | 17 ++++++++---------
4 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/include/linux/find.h b/include/linux/find.h
index 5bb6db213bcb..41941cb9cad7 100644
--- a/include/linux/find.h
+++ b/include/linux/find.h
@@ -10,7 +10,8 @@

extern unsigned long _find_next_bit(const unsigned long *addr1,
const unsigned long *addr2, unsigned long nbits,
- unsigned long start, unsigned long invert, unsigned long le);
+ unsigned long start, unsigned long invert_src1,
+ unsigned long src2, unsigned long le);
extern unsigned long _find_first_bit(const unsigned long *addr, unsigned long size);
extern unsigned long _find_first_and_bit(const unsigned long *addr1,
const unsigned long *addr2, unsigned long size);
@@ -41,7 +42,7 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
return val ? __ffs(val) : size;
}

- return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
+ return _find_next_bit(addr, NULL, size, offset, 0UL, 0UL, 0);
}
#endif

@@ -71,7 +72,7 @@ unsigned long find_next_and_bit(const unsigned long *addr1,
return val ? __ffs(val) : size;
}

- return _find_next_bit(addr1, addr2, size, offset, 0UL, 0);
+ return _find_next_bit(addr1, addr2, size, offset, 0UL, 0UL, 0);
}
#endif

@@ -99,7 +100,7 @@ unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
return val == ~0UL ? size : ffz(val);
}

- return _find_next_bit(addr, NULL, size, offset, ~0UL, 0);
+ return _find_next_bit(addr, NULL, size, offset, ~0UL, 0UL, 0);
}
#endif

@@ -247,7 +248,7 @@ unsigned long find_next_zero_bit_le(const void *addr, unsigned
return val == ~0UL ? size : ffz(val);
}

- return _find_next_bit(addr, NULL, size, offset, ~0UL, 1);
+ return _find_next_bit(addr, NULL, size, offset, ~0UL, 0UL, 1);
}
#endif

@@ -266,7 +267,7 @@ unsigned long find_next_bit_le(const void *addr, unsigned
return val ? __ffs(val) : size;
}

- return _find_next_bit(addr, NULL, size, offset, 0UL, 1);
+ return _find_next_bit(addr, NULL, size, offset, 0UL, 0UL, 1);
}
#endif

diff --git a/lib/find_bit.c b/lib/find_bit.c
index 1b8e4b2a9cba..73e78565e691 100644
--- a/lib/find_bit.c
+++ b/lib/find_bit.c
@@ -25,23 +25,23 @@
/*
* This is a common helper function for find_next_bit, find_next_zero_bit, and
* find_next_and_bit. The differences are:
- * - The "invert" argument, which is XORed with each fetched word before
- * searching it for one bits.
+ * - The "invert_src1" and "invert_src2" arguments, which are XORed to
+ * each source word before applying the 'and' operator.
* - The optional "addr2", which is anded with "addr1" if present.
*/
unsigned long _find_next_bit(const unsigned long *addr1,
const unsigned long *addr2, unsigned long nbits,
- unsigned long start, unsigned long invert, unsigned long le)
+ unsigned long start, unsigned long invert_src1,
+ unsigned long invert_src2, unsigned long le)
{
unsigned long tmp, mask;

if (unlikely(start >= nbits))
return nbits;

- tmp = addr1[start / BITS_PER_LONG];
+ tmp = addr1[start / BITS_PER_LONG] ^ invert_src1;
if (addr2)
- tmp &= addr2[start / BITS_PER_LONG];
- tmp ^= invert;
+ tmp &= addr2[start / BITS_PER_LONG] ^ invert_src2;

/* Handle 1st word. */
mask = BITMAP_FIRST_WORD_MASK(start);
@@ -57,10 +57,9 @@ unsigned long _find_next_bit(const unsigned long *addr1,
if (start >= nbits)
return nbits;

- tmp = addr1[start / BITS_PER_LONG];
+ tmp = addr1[start / BITS_PER_LONG] ^ invert_src1;
if (addr2)
- tmp &= addr2[start / BITS_PER_LONG];
- tmp ^= invert;
+ tmp &= addr2[start / BITS_PER_LONG] ^ invert_src2;
}

if (le)
diff --git a/tools/include/linux/find.h b/tools/include/linux/find.h
index 47e2bd6c5174..5ab0c95086ad 100644
--- a/tools/include/linux/find.h
+++ b/tools/include/linux/find.h
@@ -10,7 +10,8 @@

extern unsigned long _find_next_bit(const unsigned long *addr1,
const unsigned long *addr2, unsigned long nbits,
- unsigned long start, unsigned long invert, unsigned long le);
+ unsigned long start, unsigned long invert_src1,
+ unsigned long src2, unsigned long le);
extern unsigned long _find_first_bit(const unsigned long *addr, unsigned long size);
extern unsigned long _find_first_and_bit(const unsigned long *addr1,
const unsigned long *addr2, unsigned long size);
@@ -41,7 +42,7 @@ unsigned long find_next_bit(const unsigned long *addr, unsigned long size,
return val ? __ffs(val) : size;
}

- return _find_next_bit(addr, NULL, size, offset, 0UL, 0);
+ return _find_next_bit(addr, NULL, size, offset, 0UL, 0UL, 0);
}
#endif

@@ -71,7 +72,7 @@ unsigned long find_next_and_bit(const unsigned long *addr1,
return val ? __ffs(val) : size;
}

- return _find_next_bit(addr1, addr2, size, offset, 0UL, 0);
+ return _find_next_bit(addr1, addr2, size, offset, 0UL, 0UL, 0);
}
#endif

@@ -99,7 +100,7 @@ unsigned long find_next_zero_bit(const unsigned long *addr, unsigned long size,
return val == ~0UL ? size : ffz(val);
}

- return _find_next_bit(addr, NULL, size, offset, ~0UL, 0);
+ return _find_next_bit(addr, NULL, size, offset, ~0UL, 0UL, 0);
}
#endif

diff --git a/tools/lib/find_bit.c b/tools/lib/find_bit.c
index ba4b8d94e004..4176232de7f9 100644
--- a/tools/lib/find_bit.c
+++ b/tools/lib/find_bit.c
@@ -24,13 +24,14 @@
/*
* This is a common helper function for find_next_bit, find_next_zero_bit, and
* find_next_and_bit. The differences are:
- * - The "invert" argument, which is XORed with each fetched word before
- * searching it for one bits.
+ * - The "invert_src1" and "invert_src2" arguments, which are XORed to
+ * each source word before applying the 'and' operator.
* - The optional "addr2", which is anded with "addr1" if present.
*/
unsigned long _find_next_bit(const unsigned long *addr1,
const unsigned long *addr2, unsigned long nbits,
- unsigned long start, unsigned long invert, unsigned long le)
+ unsigned long start, unsigned long invert_src1,
+ unsigned long invert_src2, unsigned long le)
{
unsigned long tmp, mask;
(void) le;
@@ -38,10 +39,9 @@ unsigned long _find_next_bit(const unsigned long *addr1,
if (unlikely(start >= nbits))
return nbits;

- tmp = addr1[start / BITS_PER_LONG];
+ tmp = addr1[start / BITS_PER_LONG] ^ invert_src1;
if (addr2)
- tmp &= addr2[start / BITS_PER_LONG];
- tmp ^= invert;
+ tmp &= addr2[start / BITS_PER_LONG] ^ invert_src2;

/* Handle 1st word. */
mask = BITMAP_FIRST_WORD_MASK(start);
@@ -64,10 +64,9 @@ unsigned long _find_next_bit(const unsigned long *addr1,
if (start >= nbits)
return nbits;

- tmp = addr1[start / BITS_PER_LONG];
+ tmp = addr1[start / BITS_PER_LONG] ^ invert_src1;
if (addr2)
- tmp &= addr2[start / BITS_PER_LONG];
- tmp ^= invert;
+ tmp &= addr2[start / BITS_PER_LONG] ^ invert_src2;
}

#if (0)
--
2.17.1

2022-07-29 19:42:11

by Mathieu Desnoyers

[permalink] [raw]

Subject: [PATCH v3 09/23] sched: Introduce per memory space current virtual cpu id

This feature allows the scheduler to expose a current virtual cpu id
to user-space. This virtual cpu id is within the possible cpus range,
and is temporarily (and uniquely) assigned while threads are actively
running within a memory space. If a memory space has fewer threads than
cores, or is limited to run on few cores concurrently through sched
affinity or cgroup cpusets, the virtual cpu ids will be values close
to 0, thus allowing efficient use of user-space memory for per-cpu
data structures.

The vcpu_ids are NUMA-aware. On NUMA systems, when a vcpu_id is observed
by user-space to be associated with a NUMA node, it is guaranteed to
never change NUMA node unless a kernel-level NUMA configuration change
happens.

This feature is meant to be exposed by a new rseq thread area field.

The primary purpose of this feature is to do the heavy-lifting needed
by memory allocators to allow them to use per-cpu data structures
efficiently in the following situations:

- Single-threaded applications,
- Multi-threaded applications on large systems (many cores) with limited
cpu affinity mask,
- Multi-threaded applications on large systems (many cores) with
restricted cgroup cpuset per container,
- Processes using memory from many NUMA nodes.

One of the key concern from scheduler maintainers is the overhead
associated with additional atomic operations in the scheduler fast-path.
This is why the following optimizations are implemented:

1) On context switch between threads belonging to the same memory space,
transfer the mm_vcpu_id from prev to next without any atomic ops.
This takes care of use-cases involving frequent context switch
between threads belonging to the same memory space.

2) Threads belonging to a memory space with single user (mm_users==1)
can be assigned mm_vcpu_id=0 without any atomic operation on the
scheduler fast-path. In non-NUMA, when a memory space goes from
single to multi-threaded, lazily allocate the vcpu_id 0 in the mm
vcpu mask. This takes care of all single-threaded use-cases
involving context switching between threads belonging to different
memory spaces.

With NUMA, the single-threaded memory space scenario is still
special-cased to eliminate all atomic operations on the fast path,
but rather than returning vcpu_id=0, return the current numa_node_id
to allow single-threaded memory spaces to keep good numa locality.
On systems where the number of cpus ids is lower than the number of
numa node ids, pick the first cpu in the node cpumask rather than the
node ID.

3) Introduce a per-runqueue cache containing { mm, vcpu_id } entries.
Keep track of the recently allocated vcpu_id for each mm rather than
freeing them immediately. This eliminates most atomic ops when
context switching back and forth between threads belonging to
different memory spaces in multi-threaded scenarios (many processes,
each with many threads).

The credit goes to Paul Turner (Google) for the vcpu_id idea. This
feature is implemented based on the discussions with Paul Turner and
Peter Oskolkov (Google), but I took the liberty to implement scheduler
fast-path optimizations and my own NUMA-awareness scheme. The rumor has
it that Google have been running a rseq vcpu_id extension internally at
Google in production for a year. The tcmalloc source code indeed has
comments hinting at a vcpu_id prototype extension to the rseq system
call [1].

schedstats:

* perf bench sched messaging (single instance, multi-process):

On sched-switch:
single-threaded vcpu-id: 99.985 %
transfer between threads: 0 %
runqueue cache hit: 0.015 %
runqueue cache eviction (bit-clear): 0 %
runqueue cache discard (bit-clear): 0 %
vcpu-id allocation (bit-set): 0 %

On release mm:
vcpu-id remove (bit-clear): 0 %

On migration:
vcpu-id remove (bit-clear): 0 %

* perf bench sched messaging -t (single instance, multi-thread):

On sched-switch:
single-threaded vcpu-id: 0.128 %
transfer between threads: 98.260 %
runqueue cache hit: 1.075 %
runqueue cache eviction (bit-clear): 0.001 %
runqueue cache discard (bit-clear): 0 %
vcpu-id allocation (bit-set): 0.269 %

On release mm:
vcpu-id remove (bit-clear): 0.161 %

On migration:
vcpu-id remove (bit-clear): 0.107 %

* perf bench sched messaging -t (two instances, multi-thread):

On sched-switch:
single-threaded vcpu-id: 0.081 %
transfer between threads: 89.512 %
runqueue cache hit: 9.659 %
runqueue cache eviction (bit-clear): 0.003 %
runqueue cache discard (bit-clear): 0 %
vcpu-id allocation (bit-set): 0.374 %

On release mm:
vcpu-id remove (bit-clear): 0.243 %

On migration:
vcpu-id remove (bit-clear): 0.129 %

* perf bench sched pipe (one instance, multi-process):

On sched-switch:
single-threaded vcpu-id: 99.993 %
transfer between threads: 0.001 %
runqueue cache hit: 0.002 %
runqueue cache eviction (bit-clear): 0 %
runqueue cache discard (bit-clear): 0 %
vcpu-id allocation (bit-set): 0.002 %

On release mm:
vcpu-id remove (bit-clear): 0 %

On migration:
vcpu-id remove (bit-clear): 0.002 %

[1] https://github.com/google/tcmalloc/blob/master/tcmalloc/internal/linux_syscall_support.h#L26

Signed-off-by: Mathieu Desnoyers <[email protected]>
---
fs/exec.c | 4 +
include/linux/mm.h | 25 +++
include/linux/mm_types.h | 111 ++++++++++++
include/linux/sched.h | 5 +
init/Kconfig | 4 +
kernel/fork.c | 15 +-
kernel/sched/core.c | 82 +++++++++
kernel/sched/deadline.c | 3 +
kernel/sched/debug.c | 13 ++
kernel/sched/fair.c | 1 +
kernel/sched/rt.c | 2 +
kernel/sched/sched.h | 357 +++++++++++++++++++++++++++++++++++++++
kernel/sched/stats.c | 16 +-
13 files changed, 635 insertions(+), 3 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 5a75e92b1a0a..dcd4edd565c4 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1011,6 +1011,10 @@ static int exec_mmap(struct mm_struct *mm)
active_mm = tsk->active_mm;
tsk->active_mm = mm;
tsk->mm = mm;
+ mm_init_vcpu_users(mm);
+ mm_init_vcpumask(mm);
+ mm_init_node_vcpumask(mm);
+ sched_vcpu_activate_mm(tsk, mm);
/*
* This prevents preemption while active_mm is being loaded and
* it and mm are being updated, which could cause problems for
diff --git a/include/linux/mm.h b/include/linux/mm.h
index da08cce2a9fa..cea172092ddf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3392,4 +3392,29 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
}
#endif

+#ifdef CONFIG_SCHED_MM_VCPU
+void sched_vcpu_release_mm(struct task_struct *t, struct mm_struct *mm);
+void sched_vcpu_activate_mm(struct task_struct *t, struct mm_struct *mm);
+void sched_vcpu_get_mm(struct task_struct *t, struct mm_struct *mm);
+void sched_vcpu_dup_mm(struct task_struct *t, struct mm_struct *mm);
+static inline int task_mm_vcpu_id(struct task_struct *t)
+{
+ return t->mm_vcpu;
+}
+#else
+static inline void sched_vcpu_release_mm(struct task_struct *t, struct mm_struct *mm) { }
+static inline void sched_vcpu_activate_mm(struct task_struct *t, struct mm_struct *mm) { }
+static inline void sched_vcpu_get_mm(struct task_struct *t, struct mm_struct *mm) { }
+static inline void sched_vcpu_dup_mm(struct task_struct *t, struct mm_struct *mm) { }
+static inline int task_mm_vcpu_id(struct task_struct *t)
+{
+ /*
+ * Use the processor id as a fall-back when the mm vcpu feature is
+ * disabled. This provides functional per-cpu data structure accesses
+ * in user-space, althrough it won't provide the memory usage benefits.
+ */
+ return raw_smp_processor_id();
+}
+#endif
+
#endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8834e38c06a4..7854ab5b51f4 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -17,6 +17,7 @@
#include <linux/page-flags-layout.h>
#include <linux/workqueue.h>
#include <linux/seqlock.h>
+#include <linux/nodemask.h>

#include <asm/mmu.h>

@@ -524,6 +525,20 @@ struct mm_struct {
*/
atomic_t mm_count;

+#ifdef CONFIG_SCHED_MM_VCPU
+ /**
+ * @mm_vcpu_users: The number of references to &struct mm_struct
+ * from user-space threads.
+ *
+ * Initialized to 1 for the first thread with a reference to
+ * the mm. Incremented for each thread getting a reference to the
+ * mm, and decremented on mm release from user-space threads.
+ * Used to enable single-threaded mm_vcpu accounting (when == 1).
+ */
+
+ atomic_t mm_vcpu_users;
+#endif
+
#ifdef CONFIG_MMU
atomic_long_t pgtables_bytes; /* PTE page table pages */
#endif
@@ -681,6 +696,102 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
return (struct cpumask *)&mm->cpu_bitmap;
}

+#ifdef CONFIG_SCHED_MM_VCPU
+/* Future-safe accessor for struct mm_struct's vcpu_mask. */
+static inline cpumask_t *mm_vcpumask(struct mm_struct *mm)
+{
+ unsigned long vcpu_bitmap = (unsigned long)mm;
+
+ vcpu_bitmap += offsetof(struct mm_struct, cpu_bitmap);
+ /* Skip cpu_bitmap */
+ vcpu_bitmap += cpumask_size();
+ return (struct cpumask *)vcpu_bitmap;
+}
+
+static inline void mm_init_vcpu_users(struct mm_struct *mm)
+{
+ atomic_set(&mm->mm_vcpu_users, 1);
+}
+
+static inline void mm_init_vcpumask(struct mm_struct *mm)
+{
+ cpumask_clear(mm_vcpumask(mm));
+}
+
+static inline unsigned int mm_vcpumask_size(void)
+{
+ return cpumask_size();
+}
+
+#else
+static inline cpumask_t *mm_vcpumask(struct mm_struct *mm)
+{
+ return NULL;
+}
+
+static inline void mm_init_vcpu_users(struct mm_struct *mm) { }
+static inline void mm_init_vcpumask(struct mm_struct *mm) { }
+
+static inline unsigned int mm_vcpumask_size(void)
+{
+ return 0;
+}
+#endif
+
+#if defined(CONFIG_SCHED_MM_VCPU) && defined(CONFIG_NUMA)
+/*
+ * Layout of node vcpumasks:
+ * - node_alloc vcpumask: cpumask tracking which vcpu_id were
+ * allocated (across nodes) in this
+ * memory space.
+ * - node vcpumask[nr_node_ids]: per-node cpumask tracking which vcpu_id
+ * were allocated in this memory space.
+ */
+static inline cpumask_t *mm_node_alloc_vcpumask(struct mm_struct *mm)
+{
+ unsigned long vcpu_bitmap = (unsigned long)mm_vcpumask(mm);
+
+ /* Skip mm_vcpumask */
+ vcpu_bitmap += cpumask_size();
+ return (struct cpumask *)vcpu_bitmap;
+}
+
+static inline cpumask_t *mm_node_vcpumask(struct mm_struct *mm, unsigned int node)
+{
+ unsigned long vcpu_bitmap = (unsigned long)mm_node_alloc_vcpumask(mm);
+
+ /* Skip node alloc vcpumask */
+ vcpu_bitmap += cpumask_size();
+ vcpu_bitmap += node * cpumask_size();
+ return (struct cpumask *)vcpu_bitmap;
+}
+
+static inline void mm_init_node_vcpumask(struct mm_struct *mm)
+{
+ unsigned int node;
+
+ if (num_possible_nodes() == 1)
+ return;
+ cpumask_clear(mm_node_alloc_vcpumask(mm));
+ for (node = 0; node < nr_node_ids; node++)
+ cpumask_clear(mm_node_vcpumask(mm, node));
+}
+
+static inline unsigned int mm_node_vcpumask_size(void)
+{
+ if (num_possible_nodes() == 1)
+ return 0;
+ return (nr_node_ids + 1) * cpumask_size();
+}
+#else
+static inline void mm_init_node_vcpumask(struct mm_struct *mm) { }
+
+static inline unsigned int mm_node_vcpumask_size(void)
+{
+ return 0;
+}
+#endif
+
struct mmu_gather;
extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 68b23937b4a5..14f102c2fcfd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1299,6 +1299,11 @@ struct task_struct {
unsigned long rseq_event_mask;
#endif

+#ifdef CONFIG_SCHED_MM_VCPU
+ int mm_vcpu; /* Current vcpu in mm */
+ int vcpu_mm_active;
+#endif
+
struct tlbflush_unmap_batch tlb_ubc;

union {
diff --git a/init/Kconfig b/init/Kconfig
index fa63cc019ebf..457211f1ca06 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1041,6 +1041,10 @@ config RT_GROUP_SCHED

endif #CGROUP_SCHED

+config SCHED_MM_VCPU
+ def_bool y
+ depends on SMP && RSEQ
+
config UCLAMP_TASK_GROUP
bool "Utilization clamping per group of tasks"
depends on CGROUP_SCHED
diff --git a/kernel/fork.c b/kernel/fork.c
index 0d8abfb9e0f4..ef5a72d87cee 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1046,6 +1046,11 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
#ifdef CONFIG_MEMCG
tsk->active_memcg = NULL;
#endif
+
+#ifdef CONFIG_SCHED_MM_VCPU
+ tsk->mm_vcpu = 0;
+ tsk->vcpu_mm_active = 0;
+#endif
return tsk;

free_stack:
@@ -1149,6 +1154,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
goto fail_nocontext;

mm->user_ns = get_user_ns(user_ns);
+ mm_init_vcpu_users(mm);
+ mm_init_vcpumask(mm);
+ mm_init_node_vcpumask(mm);
return mm;

fail_nocontext:
@@ -1450,6 +1458,8 @@ static int wait_for_vfork_done(struct task_struct *child,
*/
static void mm_release(struct task_struct *tsk, struct mm_struct *mm)
{
+ sched_vcpu_release_mm(tsk, mm);
+
uprobe_free_utask(tsk);

/* Get rid of any cached register state */
@@ -1569,10 +1579,12 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
if (clone_flags & CLONE_VM) {
mmget(oldmm);
mm = oldmm;
+ sched_vcpu_get_mm(tsk, mm);
} else {
mm = dup_mm(tsk, current->mm);
if (!mm)
return -ENOMEM;
+ sched_vcpu_dup_mm(tsk, mm);
}

tsk->mm = mm;
@@ -3003,7 +3015,8 @@ void __init proc_caches_init(void)
* dynamically sized based on the maximum CPU number this system
* can have, taking hotplug into account (nr_cpu_ids).
*/
- mm_size = sizeof(struct mm_struct) + cpumask_size();
+ mm_size = sizeof(struct mm_struct) + cpumask_size() + mm_vcpumask_size() +
+ mm_node_vcpumask_size();

mm_cachep = kmem_cache_create_usercopy("mm_struct",
mm_size, ARCH_MIN_MMSTRUCT_ALIGN,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index dd11daa7a84b..f7bd328ebb24 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2331,6 +2331,7 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
lockdep_assert_rq_held(rq);

deactivate_task(rq, p, DEQUEUE_NOCLOCK);
+ rq_vcpu_cache_remove_mm_locked(rq, p->mm, false);
set_task_cpu(p, new_cpu);
rq_unlock(rq, rf);

@@ -2518,6 +2519,7 @@ int push_cpu_stop(void *arg)
// XXX validate p is still the highest prio task
if (task_rq(p) == rq) {
deactivate_task(rq, p, 0);
+ rq_vcpu_cache_remove_mm_locked(rq, p->mm, false);
set_task_cpu(p, lowest_rq->cpu);
activate_task(lowest_rq, p, 0);
resched_curr(lowest_rq);
@@ -3157,6 +3159,7 @@ static void __migrate_swap_task(struct task_struct *p, int cpu)
rq_pin_lock(dst_rq, &drf);

deactivate_task(src_rq, p, 0);
+ rq_vcpu_cache_remove_mm_locked(src_rq, p->mm, false);
set_task_cpu(p, cpu);
activate_task(dst_rq, p, 0);
check_preempt_curr(dst_rq, p, 0);
@@ -3780,6 +3783,8 @@ static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags
p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);

WRITE_ONCE(rq->ttwu_pending, 1);
+ if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
+ rq_vcpu_cache_remove_mm(task_rq(p), p->mm, false);
__smp_call_single_queue(cpu, &p->wake_entry.llist);
}

@@ -4189,6 +4194,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)

wake_flags |= WF_MIGRATED;
psi_ttwu_dequeue(p);
+ rq_vcpu_cache_remove_mm(task_rq(p), p->mm, false);
set_task_cpu(p, cpu);
}
#else
@@ -4912,6 +4918,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
sched_info_switch(rq, prev, next);
perf_event_task_sched_out(prev, next);
rseq_preempt(prev);
+ switch_mm_vcpu(rq, prev, next);
fire_sched_out_preempt_notifiers(prev, next);
kmap_local_sched_out();
prepare_task(next);
@@ -6043,6 +6050,7 @@ static bool try_steal_cookie(int this, int that)
goto next;

deactivate_task(src, p, 0);
+ rq_vcpu_cache_remove_mm_locked(src, p->mm, false);
set_task_cpu(p, this);
activate_task(dst, p, 0);

@@ -11109,3 +11117,77 @@ void call_trace_sched_update_nr_running(struct rq *rq, int count)
{
trace_sched_update_nr_running_tp(rq, count);
}
+
+#ifdef CONFIG_SCHED_MM_VCPU
+void sched_vcpu_release_mm(struct task_struct *t, struct mm_struct *mm)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+
+ if (!mm)
+ return;
+ WARN_ON_ONCE(t != current);
+ preempt_disable();
+ rq = this_rq();
+ rq_lock_irqsave(rq, &rf);
+ t->vcpu_mm_active = 0;
+ atomic_dec(&mm->mm_vcpu_users);
+ rq_vcpu_cache_remove_mm_locked(rq, mm, true);
+ rq_unlock_irqrestore(rq, &rf);
+ t->mm_vcpu = -1;
+ preempt_enable();
+}
+
+void sched_vcpu_activate_mm(struct task_struct *t, struct mm_struct *mm)
+{
+ WARN_ON_ONCE(t != current);
+ preempt_disable();
+ t->vcpu_mm_active = 1;
+ /* No need to reserve in cpumask because single-threaded. */
+ t->mm_vcpu = mm_vcpu_first_node_vcpu(numa_node_id());
+ preempt_enable();
+}
+
+void sched_vcpu_get_mm(struct task_struct *t, struct mm_struct *mm)
+{
+ int vcpu, mm_vcpu_users;
+ struct rq_flags rf;
+ struct rq *rq;
+
+ preempt_disable();
+ rq = this_rq();
+ t->vcpu_mm_active = 1;
+ mm_vcpu_users = atomic_read(&mm->mm_vcpu_users);
+ atomic_inc(&mm->mm_vcpu_users);
+ t->mm_vcpu = -1;
+ vcpu = current->mm_vcpu;
+ rq_lock_irqsave(rq, &rf);
+ /* On transition from 1 to 2 mm users, reserve vcpu ids. */
+ if (mm_vcpu_users == 1) {
+ mm_vcpu_reserve_nodes(mm);
+ rq_vcpu_cache_remove_mm_locked(rq, mm, true);
+ current->mm_vcpu = __mm_vcpu_get(rq, mm);
+ rq_vcpu_cache_add(rq, mm, current->mm_vcpu);
+ /*
+ * __mm_vcpu_get could get a different vcpu after going
+ * multi-threaded, then back single-threaded, then
+ * multi-threaded on a NUMA configuration using the first CPU
+ * matching the NUMA node as single-threaded vcpu, with
+ * leftover vcpu_id matching the NUMA node set from when this
+ * task was multithreaded.
+ */
+ if (current->mm_vcpu != vcpu)
+ rseq_set_notify_resume(current);
+ }
+ rq_unlock_irqrestore(rq, &rf);
+ preempt_enable();
+}
+
+void sched_vcpu_dup_mm(struct task_struct *t, struct mm_struct *mm)
+{
+ preempt_disable();
+ t->vcpu_mm_active = 1;
+ t->mm_vcpu = -1;
+ preempt_enable();
+}
+#endif
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index b61281d10458..20bbb56d554b 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -693,6 +693,7 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(later_rq->rd->span));
raw_spin_unlock(&dl_b->lock);

+ rq_vcpu_cache_remove_mm_locked(rq, p->mm, false);
set_task_cpu(p, later_rq->cpu);
double_unlock_balance(later_rq, rq);

@@ -2319,6 +2320,7 @@ static int push_dl_task(struct rq *rq)
}

deactivate_task(rq, next_task, 0);
+ rq_vcpu_cache_remove_mm_locked(rq, next_task->mm, false);
set_task_cpu(next_task, later_rq->cpu);

/*
@@ -2415,6 +2417,7 @@ static void pull_dl_task(struct rq *this_rq)
push_task = get_push_task(src_rq);
} else {
deactivate_task(src_rq, p, 0);
+ rq_vcpu_cache_remove_mm_locked(src_rq, p->mm, false);
set_task_cpu(p, this_cpu);
activate_task(this_rq, p, 0);
dmin = p->dl.deadline;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index bb3d63bdf4ae..d2f6a6b25a88 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -762,6 +762,19 @@ do { \
P(sched_goidle);
P(ttwu_count);
P(ttwu_local);
+#undef P
+#define P(n) SEQ_printf(m, " .%-30s: %Ld\n", #n, schedstat_val(rq->n));
+ P(nr_vcpu_thread_transfer);
+ P(nr_vcpu_cache_hit);
+ P(nr_vcpu_cache_evict);
+ P(nr_vcpu_cache_discard_wrong_node);
+ P(nr_vcpu_allocate);
+ P(nr_vcpu_allocate_node_reuse);
+ P(nr_vcpu_allocate_node_new);
+ P(nr_vcpu_allocate_node_rebalance);
+ P(nr_vcpu_allocate_node_steal);
+ P(nr_vcpu_remove_release_mm);
+ P(nr_vcpu_remove_migrate);
}
#undef P

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cc8daa3dcc8b..941cc84c5ed8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7883,6 +7883,7 @@ static void detach_task(struct task_struct *p, struct lb_env *env)
lockdep_assert_rq_held(env->src_rq);

deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
+ rq_vcpu_cache_remove_mm_locked(env->src_rq, p->mm, false);
set_task_cpu(p, env->dst_cpu);
}

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 7891c0f0e1ff..90321de2bda0 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2104,6 +2104,7 @@ static int push_rt_task(struct rq *rq, bool pull)
}

deactivate_task(rq, next_task, 0);
+ rq_vcpu_cache_remove_mm_locked(rq, next_task->mm, false);
set_task_cpu(next_task, lowest_rq->cpu);
activate_task(lowest_rq, next_task, 0);
resched_curr(lowest_rq);
@@ -2377,6 +2378,7 @@ static void pull_rt_task(struct rq *this_rq)
push_task = get_push_task(src_rq);
} else {
deactivate_task(src_rq, p, 0);
+ rq_vcpu_cache_remove_mm_locked(src_rq, p->mm, false);
set_task_cpu(p, this_cpu);
activate_task(this_rq, p, 0);
resched = true;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 84bba67c92dc..cc621f4691ba 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -901,6 +901,19 @@ struct uclamp_rq {
DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
#endif /* CONFIG_UCLAMP_TASK */

+#ifdef CONFIG_SCHED_MM_VCPU
+# define RQ_VCPU_CACHE_SIZE 8
+struct rq_vcpu_entry {
+ struct mm_struct *mm; /* NULL if unset */
+ int vcpu_id;
+};
+
+struct rq_vcpu_cache {
+ struct rq_vcpu_entry entry[RQ_VCPU_CACHE_SIZE];
+ unsigned int head;
+};
+#endif
+
/*
* This is the main, per-CPU runqueue data structure.
*
@@ -1071,6 +1084,19 @@ struct rq {
/* try_to_wake_up() stats */
unsigned int ttwu_count;
unsigned int ttwu_local;
+
+ unsigned long long nr_vcpu_single_thread;
+ unsigned long long nr_vcpu_thread_transfer;
+ unsigned long long nr_vcpu_cache_hit;
+ unsigned long long nr_vcpu_cache_evict;
+ unsigned long long nr_vcpu_cache_discard_wrong_node;
+ unsigned long long nr_vcpu_allocate;
+ unsigned long long nr_vcpu_allocate_node_reuse;
+ unsigned long long nr_vcpu_allocate_node_new;
+ unsigned long long nr_vcpu_allocate_node_rebalance;
+ unsigned long long nr_vcpu_allocate_node_steal;
+ unsigned long long nr_vcpu_remove_release_mm;
+ unsigned long long nr_vcpu_remove_migrate;
#endif

#ifdef CONFIG_CPU_IDLE
@@ -1101,6 +1127,10 @@ struct rq {
unsigned int core_forceidle_occupation;
u64 core_forceidle_start;
#endif
+
+#ifdef CONFIG_SCHED_MM_VCPU
+ struct rq_vcpu_cache vcpu_cache;
+#endif
};

#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -3112,4 +3142,331 @@ extern int sched_dynamic_mode(const char *str);
extern void sched_dynamic_update(int mode);
#endif

+#ifdef CONFIG_SCHED_MM_VCPU
+static inline int __mm_vcpu_get_single_node(struct mm_struct *mm)
+{
+ struct cpumask *cpumask;
+ int vcpu;
+
+ cpumask = mm_vcpumask(mm);
+ /* Atomically reserve lowest available vcpu number. */
+ do {
+ vcpu = cpumask_first_zero(cpumask);
+ if (vcpu >= nr_cpu_ids)
+ return -1;
+ } while (cpumask_test_and_set_cpu(vcpu, cpumask));
+ return vcpu;
+}
+
+#ifdef CONFIG_NUMA
+static inline bool mm_node_vcpumask_test_cpu(struct mm_struct *mm, int vcpu_id)
+{
+ if (num_possible_nodes() == 1)
+ return true;
+ return cpumask_test_cpu(vcpu_id, mm_node_vcpumask(mm, numa_node_id()));
+}
+
+static inline int __mm_vcpu_get(struct rq *rq, struct mm_struct *mm)
+{
+ struct cpumask *cpumask = mm_vcpumask(mm),
+ *node_cpumask = mm_node_vcpumask(mm, numa_node_id()),
+ *node_alloc_cpumask = mm_node_alloc_vcpumask(mm);
+ unsigned int node;
+ int vcpu;
+
+ if (num_possible_nodes() == 1)
+ return __mm_vcpu_get_single_node(mm);
+
+ /*
+ * Try to atomically reserve lowest available vcpu number within those
+ * already reserved for this NUMA node.
+ */
+ do {
+ vcpu = cpumask_first_one_and_zero(node_cpumask, cpumask);
+ if (vcpu >= nr_cpu_ids)
+ goto alloc_numa;
+ } while (cpumask_test_and_set_cpu(vcpu, cpumask));
+ schedstat_inc(rq->nr_vcpu_allocate_node_reuse);
+ goto end;
+
+alloc_numa:
+ /*
+ * Try to atomically reserve lowest available vcpu number within those
+ * not already allocated for numa nodes.
+ */
+ do {
+ vcpu = cpumask_first_zero_and_zero(node_alloc_cpumask, cpumask);
+ if (vcpu >= nr_cpu_ids)
+ goto numa_update;
+ } while (cpumask_test_and_set_cpu(vcpu, cpumask));
+ cpumask_set_cpu(vcpu, node_cpumask);
+ cpumask_set_cpu(vcpu, node_alloc_cpumask);
+ schedstat_inc(rq->nr_vcpu_allocate_node_new);
+ goto end;
+
+numa_update:
+ /*
+ * NUMA node id configuration changed for at least one CPU in the system.
+ * We need to steal a currently unused vcpu_id from an overprovisioned
+ * node for our current node. Userspace must handle the fact that the
+ * node id associated with this vcpu_id may change due to node ID
+ * reconfiguration.
+ *
+ * Count how many possible cpus are attached to each (other) node id,
+ * and compare this with the per-mm node vcpumask cpu count. Find one
+ * which has too many cpus in its mask to steal from.
+ */
+ for (node = 0; node < nr_node_ids; node++) {
+ struct cpumask *iter_cpumask;
+
+ if (node == numa_node_id())
+ continue;
+ iter_cpumask = mm_node_vcpumask(mm, node);
+ if (nr_cpus_node(node) < cpumask_weight(iter_cpumask)) {
+ /* Try to steal from this node. */
+ do {
+ vcpu = cpumask_first_one_and_zero(iter_cpumask, cpumask);
+ if (vcpu >= nr_cpu_ids)
+ goto steal_fail;
+ } while (cpumask_test_and_set_cpu(vcpu, cpumask));
+ cpumask_clear_cpu(vcpu, iter_cpumask);
+ cpumask_set_cpu(vcpu, node_cpumask);
+ schedstat_inc(rq->nr_vcpu_allocate_node_rebalance);
+ goto end;
+ }
+ }
+
+steal_fail:
+ /*
+ * Our attempt at gracefully stealing a vcpu_id from another
+ * overprovisioned NUMA node failed. Fallback to grabbing the first
+ * available vcpu_id.
+ */
+ do {
+ vcpu = cpumask_first_zero(cpumask);
+ if (vcpu >= nr_cpu_ids)
+ return -1;
+ } while (cpumask_test_and_set_cpu(vcpu, cpumask));
+ /* Steal vcpu from its numa node mask. */
+ for (node = 0; node < nr_node_ids; node++) {
+ struct cpumask *iter_cpumask;
+
+ if (node == numa_node_id())
+ continue;
+ iter_cpumask = mm_node_vcpumask(mm, node);
+ if (cpumask_test_cpu(vcpu, iter_cpumask)) {
+ cpumask_clear_cpu(vcpu, iter_cpumask);
+ break;
+ }
+ }
+ cpumask_set_cpu(vcpu, node_cpumask);
+ schedstat_inc(rq->nr_vcpu_allocate_node_steal);
+end:
+ return vcpu;
+}
+
+static inline int mm_vcpu_first_node_vcpu(int node)
+{
+ if (likely(nr_cpu_ids >= nr_node_ids))
+ return node;
+ else {
+ int vcpu;
+
+ vcpu = cpumask_first(cpumask_of_node(node));
+ if (vcpu >= nr_cpu_ids)
+ return -1;
+ return vcpu;
+ }
+}
+
+/*
+ * Single-threaded processes observe a mapping of vcpu_id->node_id where
+ * the vcpu_id returned corresponds to mm_vcpu_first_node_vcpu(). When going
+ * from single to multi-threaded, reserve this same mapping so it stays
+ * invariant.
+ */
+static inline void mm_vcpu_reserve_nodes(struct mm_struct *mm)
+{
+ struct cpumask *node_alloc_cpumask = mm_node_alloc_vcpumask(mm);
+ int node, other_node;
+
+ for (node = 0; node < nr_node_ids; node++) {
+ struct cpumask *iter_cpumask = mm_node_vcpumask(mm, node);
+ int vcpu = mm_vcpu_first_node_vcpu(node);
+
+ /* Skip nodes that have no CPU associated with them. */
+ if (vcpu < 0)
+ continue;
+ cpumask_set_cpu(vcpu, iter_cpumask);
+ cpumask_set_cpu(vcpu, node_alloc_cpumask);
+ for (other_node = 0; other_node < nr_node_ids; other_node++) {
+ if (other_node == node)
+ continue;
+ cpumask_clear_cpu(vcpu, mm_node_vcpumask(mm, other_node));
+ }
+ }
+}
+#else
+static inline bool mm_node_vcpumask_test_cpu(struct mm_struct *mm, int vcpu_id)
+{
+ return true;
+}
+static inline int __mm_vcpu_get(struct rq *rq, struct mm_struct *mm)
+{
+ return __mm_vcpu_get_single_node(mm);
+}
+static inline int mm_vcpu_first_node_vcpu(int node)
+{
+ return 0;
+}
+static inline void mm_vcpu_reserve_nodes(struct mm_struct *mm) { }
+#endif
+
+static inline void __mm_vcpu_put(struct mm_struct *mm, int vcpu)
+{
+ if (vcpu < 0)
+ return;
+ cpumask_clear_cpu(vcpu, mm_vcpumask(mm));
+}
+
+static inline struct rq_vcpu_entry *rq_vcpu_cache_lookup(struct rq *rq, struct mm_struct *mm)
+{
+ struct rq_vcpu_cache *vcpu_cache = &rq->vcpu_cache;
+ int i;
+
+ for (i = 0; i < RQ_VCPU_CACHE_SIZE; i++) {
+ struct rq_vcpu_entry *entry = &vcpu_cache->entry[i];
+
+ if (entry->mm == mm)
+ return entry;
+ }
+ return NULL;
+}
+
+/* Removal from cache simply leaves an unused hole. */
+static inline int rq_vcpu_cache_lookup_remove(struct rq *rq, struct mm_struct *mm)
+{
+ struct rq_vcpu_entry *entry = rq_vcpu_cache_lookup(rq, mm);
+
+ if (!entry)
+ return -1;
+ entry->mm = NULL; /* Remove from cache */
+ return entry->vcpu_id;
+}
+
+static inline void rq_vcpu_cache_remove_mm_locked(struct rq *rq, struct mm_struct *mm, bool release_mm)
+{
+ int vcpu;
+
+ if (!mm)
+ return;
+ /*
+ * Do not remove the cache entry for a runqueue that runs a task which
+ * currently uses the target mm.
+ */
+ if (!release_mm && rq->curr->mm == mm)
+ return;
+ vcpu = rq_vcpu_cache_lookup_remove(rq, mm);
+ if (vcpu < 0)
+ return;
+ if (release_mm)
+ schedstat_inc(rq->nr_vcpu_remove_release_mm);
+ else
+ schedstat_inc(rq->nr_vcpu_remove_migrate);
+ __mm_vcpu_put(mm, vcpu);
+}
+
+static inline void rq_vcpu_cache_remove_mm(struct rq *rq, struct mm_struct *mm, bool release_mm)
+{
+ struct rq_flags rf;
+
+ rq_lock_irqsave(rq, &rf);
+ rq_vcpu_cache_remove_mm_locked(rq, mm, release_mm);
+ rq_unlock_irqrestore(rq, &rf);
+}
+
+/*
+ * Add at head, move head forward. Cheap LRU cache.
+ * Only need to clear the vcpu mask bit from its own mm_vcpumask(mm) when we
+ * overwrite an old entry from the cache. Note that this is not needed if the
+ * overwritten entry is an unused hole. This access to the old_mm from an
+ * unrelated thread requires that cache entry for a given mm gets pruned from
+ * the cache when a task is dequeued from the runqueue.
+ */
+static inline void rq_vcpu_cache_add(struct rq *rq, struct mm_struct *mm, int vcpu_id)
+{
+ struct rq_vcpu_cache *vcpu_cache = &rq->vcpu_cache;
+ struct mm_struct *old_mm;
+ struct rq_vcpu_entry *entry;
+ unsigned int pos;
+
+ pos = vcpu_cache->head;
+ entry = &vcpu_cache->entry[pos];
+ old_mm = entry->mm;
+ if (old_mm) {
+ schedstat_inc(rq->nr_vcpu_cache_evict);
+ __mm_vcpu_put(old_mm, entry->vcpu_id);
+ }
+ entry->mm = mm;
+ entry->vcpu_id = vcpu_id;
+ vcpu_cache->head = (pos + 1) % RQ_VCPU_CACHE_SIZE;
+}
+
+static inline int mm_vcpu_get(struct rq *rq, struct task_struct *t)
+{
+ struct rq_vcpu_entry *entry;
+ struct mm_struct *mm = t->mm;
+ int vcpu;
+
+ /* Skip allocation if mm is single-threaded. */
+ if (atomic_read(&mm->mm_vcpu_users) == 1) {
+ schedstat_inc(rq->nr_vcpu_single_thread);
+ vcpu = mm_vcpu_first_node_vcpu(numa_node_id());
+ goto end;
+ }
+ entry = rq_vcpu_cache_lookup(rq, mm);
+ if (likely(entry)) {
+ vcpu = entry->vcpu_id;
+ if (likely(mm_node_vcpumask_test_cpu(mm, vcpu))) {
+ schedstat_inc(rq->nr_vcpu_cache_hit);
+ goto end;
+ } else {
+ schedstat_inc(rq->nr_vcpu_cache_discard_wrong_node);
+ entry->mm = NULL; /* Remove from cache */
+ __mm_vcpu_put(mm, vcpu);
+ }
+ }
+ schedstat_inc(rq->nr_vcpu_allocate);
+ vcpu = __mm_vcpu_get(rq, mm);
+ rq_vcpu_cache_add(rq, mm, vcpu);
+end:
+ return vcpu;
+}
+
+static inline void switch_mm_vcpu(struct rq *rq, struct task_struct *prev, struct task_struct *next)
+{
+ if (!(next->flags & PF_EXITING) && !(next->flags & PF_KTHREAD) && next->mm && next->vcpu_mm_active) {
+ if (!(prev->flags & PF_EXITING) && !(prev->flags & PF_KTHREAD) && prev->mm == next->mm &&
+ prev->vcpu_mm_active &&
+ mm_node_vcpumask_test_cpu(next->mm, prev->mm_vcpu)) {
+ /*
+ * Switching between threads with the same mm. Simply pass the
+ * vcpu token along to the next thread.
+ */
+ schedstat_inc(rq->nr_vcpu_thread_transfer);
+ next->mm_vcpu = prev->mm_vcpu;
+ } else {
+ next->mm_vcpu = mm_vcpu_get(rq, next);
+ }
+ }
+ if (!(prev->flags & PF_EXITING) && !(prev->flags & PF_KTHREAD) && prev->mm && prev->vcpu_mm_active)
+ prev->mm_vcpu = -1;
+}
+
+#else
+static inline void switch_mm_vcpu(struct rq *rq, struct task_struct *prev, struct task_struct *next) { }
+static inline void rq_vcpu_cache_remove_mm_locked(struct rq *rq, struct mm_struct *mm, bool release_mm) { }
+static inline void rq_vcpu_cache_remove_mm(struct rq *rq, struct mm_struct *mm, bool release_mm) { }
+#endif
+
#endif /* _KERNEL_SCHED_SCHED_H */
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 857f837f52cb..29e168af777b 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -133,12 +133,24 @@ static int show_schedstat(struct seq_file *seq, void *v)

/* runqueue-specific stats */
seq_printf(seq,
- "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+ "cpu%d %u 0 %u %u %u %u %llu %llu %lu %llu %llu %llu %llu %llu %llu %llu %llu %llu %llu %llu %llu",
cpu, rq->yld_count,
rq->sched_count, rq->sched_goidle,
rq->ttwu_count, rq->ttwu_local,
rq->rq_cpu_time,
- rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+ rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+ rq->nr_vcpu_single_thread,
+ rq->nr_vcpu_thread_transfer,
+ rq->nr_vcpu_cache_hit,
+ rq->nr_vcpu_cache_evict,
+ rq->nr_vcpu_cache_discard_wrong_node,
+ rq->nr_vcpu_allocate,
+ rq->nr_vcpu_allocate_node_reuse,
+ rq->nr_vcpu_allocate_node_new,
+ rq->nr_vcpu_allocate_node_rebalance,
+ rq->nr_vcpu_allocate_node_steal,
+ rq->nr_vcpu_remove_release_mm,
+ rq->nr_vcpu_remove_migrate);

seq_printf(seq, "\n");

--
2.17.1

2022-07-29 19:43:46

by Mathieu Desnoyers

[permalink] [raw]

Subject: [PATCH v3 03/23] rseq: extend struct rseq with numa node id

Adding the NUMA node id to struct rseq is a straightforward thing to do,
and a good way to figure out if anything in the user-space ecosystem
prevents extending struct rseq.

This NUMA node id field allows memory allocators such as tcmalloc to
take advantage of fast access to the current NUMA node id to perform
NUMA-aware memory allocation.

It can also be useful for implementing fast-paths for NUMA-aware
user-space mutexes.

It also allows implementing getcpu(2) purely in user-space.

Signed-off-by: Mathieu Desnoyers <[email protected]>
---
include/trace/events/rseq.h | 4 +++-
include/uapi/linux/rseq.h | 8 ++++++++
kernel/rseq.c | 19 +++++++++++++------
3 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/include/trace/events/rseq.h b/include/trace/events/rseq.h
index a04a64bc1a00..6bd442697354 100644
--- a/include/trace/events/rseq.h
+++ b/include/trace/events/rseq.h
@@ -16,13 +16,15 @@ TRACE_EVENT(rseq_update,

TP_STRUCT__entry(
__field(s32, cpu_id)
+ __field(s32, node_id)
),

TP_fast_assign(
__entry->cpu_id = raw_smp_processor_id();
+ __entry->node_id = cpu_to_node(raw_smp_processor_id());
),

- TP_printk("cpu_id=%d", __entry->cpu_id)
+ TP_printk("cpu_id=%d node_id=%d", __entry->cpu_id, __entry->node_id)
);

TRACE_EVENT(rseq_ip_fixup,
diff --git a/include/uapi/linux/rseq.h b/include/uapi/linux/rseq.h
index 05d3c4cdeb40..1cb90a435c5c 100644
--- a/include/uapi/linux/rseq.h
+++ b/include/uapi/linux/rseq.h
@@ -131,6 +131,14 @@ struct rseq {
*/
__u32 flags;

+ /*
+ * Restartable sequences node_id field. Updated by the kernel. Read by
+ * user-space with single-copy atomicity semantics. This field should
+ * only be read by the thread which registered this data structure.
+ * Aligned on 32-bit. Contains the current NUMA node ID.
+ */
+ __u32 node_id;
+
/*
* Flexible array member at end of structure, after last feature field.
*/
diff --git a/kernel/rseq.c b/kernel/rseq.c
index 46dc5c2ce2b7..cb7d8a5afc82 100644
--- a/kernel/rseq.c
+++ b/kernel/rseq.c
@@ -84,15 +84,17 @@
* F1. <failure>
*/

-static int rseq_update_cpu_id(struct task_struct *t)
+static int rseq_update_cpu_node_id(struct task_struct *t)
{
- u32 cpu_id = raw_smp_processor_id();
struct rseq __user *rseq = t->rseq;
+ u32 cpu_id = raw_smp_processor_id();
+ u32 node_id = cpu_to_node(cpu_id);

if (!user_write_access_begin(rseq, t->rseq_len))
goto efault;
unsafe_put_user(cpu_id, &rseq->cpu_id_start, efault_end);
unsafe_put_user(cpu_id, &rseq->cpu_id, efault_end);
+ unsafe_put_user(node_id, &rseq->node_id, efault_end);
/*
* Additional feature fields added after ORIG_RSEQ_SIZE
* need to be conditionally updated only if
@@ -108,9 +110,9 @@ static int rseq_update_cpu_id(struct task_struct *t)
return -EFAULT;
}

-static int rseq_reset_rseq_cpu_id(struct task_struct *t)
+static int rseq_reset_rseq_cpu_node_id(struct task_struct *t)
{
- u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED;
+ u32 cpu_id_start = 0, cpu_id = RSEQ_CPU_ID_UNINITIALIZED, node_id = 0;

/*
* Reset cpu_id_start to its initial state (0).
@@ -124,6 +126,11 @@ static int rseq_reset_rseq_cpu_id(struct task_struct *t)
*/
if (put_user(cpu_id, &t->rseq->cpu_id))
return -EFAULT;
+ /*
+ * Reset node_id to its initial state (0).
+ */
+ if (put_user(node_id, &t->rseq->node_id))
+ return -EFAULT;
/*
* Additional feature fields added after ORIG_RSEQ_SIZE
* need to be conditionally reset only if
@@ -306,7 +313,7 @@ void __rseq_handle_notify_resume(struct ksignal *ksig, struct pt_regs *regs)
if (unlikely(ret < 0))
goto error;
}
- if (unlikely(rseq_update_cpu_id(t)))
+ if (unlikely(rseq_update_cpu_node_id(t)))
goto error;
return;

@@ -353,7 +360,7 @@ SYSCALL_DEFINE4(rseq, struct rseq __user *, rseq, u32, rseq_len,
return -EINVAL;
if (current->rseq_sig != sig)
return -EPERM;
- ret = rseq_reset_rseq_cpu_id(current);
+ ret = rseq_reset_rseq_cpu_node_id(current);
if (ret)
return ret;
current->rseq = NULL;
--
2.17.1

2022-08-01 17:19:10

by Peter Oskolkov

[permalink] [raw]

Subject: Re: [PATCH v3 00/23] RSEQ node id and virtual cpu id extensions

On Fri, Jul 29, 2022 at 12:02 PM Mathieu Desnoyers
<[email protected]> wrote:
>
> Extend the rseq ABI to expose a NUMA node ID and a vm_vcpu_id field.

Thanks a lot, Mathieu - it is really exciting to see this happening!

I'll share our experiences here, with the hope that it may be useful.
I've also cc-ed
Chris Kennelly, who worked on the userspace/tcmalloc side, as he can provide
more context/details if I miss or misrepresent something.

The problem:

tcmalloc maintains per-cpu freelists in the userspace to make userspace
memory allocations fast and efficient; it relies on rseq to do so, as
any manipulation
of the freelists has to be protected vs thread migrations.

However, as a typical userspace process at a Google datacenter is confined to
a relatively small number of CPUs (8-16) via cgroups, while the
servers typically
have a much larger number of physical CPUs, the per-cpu freelist model
is somewhat
wasteful: if a process has only at most 10 threads running, for
example, but these threads
can "wander" across 100 CPUs over the lifetime of the process, keeping 100
freelists instead of 10 noticeably wastes memory.

Note that although a typical process at Google has a limited CPU
quota, thus using
only a small number of CPUs at any given time, the process may often have many
hundreds or thousands of threads, so per-thread freelists are not a viable
solution to the problem just described.

Our current solution:

As you outlined in patch 9, tracking the number of currently running threads per
address space and exposing this information via a vcpu_id abstraction helps
tcmalloc to noticeably reduce its freelist overhead in the "narrow
process running
on a wide server" situation, which is typical at Google.

We have experimented with several approaches here. The one that we are
currently using is the "flat" model: we allocate vcpu IDs ignoring numa nodes.

We did try per-numa-node vcpus, but it did not show any material improvement
over the "flat" model, perhaps because on our most "wide" servers the CPU
topology is multi-level. Chris Kennelly may provide more details here.

On a more technical note, we do use atomic operations extensively in
the kernel to make sure
vcpu IDs are "tightly packed", i.e. if only N threads of a process are currently
running on physical CPUs, vcpu IDs will be in the range [0, N-1], i.e. no gaps,
no going to N and above; this does consume some extra CPU cycles, but the
RAM savings we gain far outweigh the extra CPU cost; it will be interesting to
see what you can do with the optimizations you propose in this patchset.

Again, thanks a lot for this effort!

Peter

[...]

2022-08-02 15:12:04

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [PATCH v3 00/23] RSEQ node id and virtual cpu id extensions

----- On Aug 1, 2022, at 1:07 PM, Peter Oskolkov [email protected] wrote:

> On Fri, Jul 29, 2022 at 12:02 PM Mathieu Desnoyers
> <[email protected]> wrote:
>>
>> Extend the rseq ABI to expose a NUMA node ID and a vm_vcpu_id field.
>
> Thanks a lot, Mathieu - it is really exciting to see this happening!
>
> I'll share our experiences here, with the hope that it may be useful.
> I've also cc-ed
> Chris Kennelly, who worked on the userspace/tcmalloc side, as he can provide
> more context/details if I miss or misrepresent something.

Thanks for sharing your experiences at Google. This helps put things in
perspective.

>
> The problem:
>
> tcmalloc maintains per-cpu freelists in the userspace to make userspace
> memory allocations fast and efficient; it relies on rseq to do so, as
> any manipulation
> of the freelists has to be protected vs thread migrations.
>
> However, as a typical userspace process at a Google datacenter is confined to
> a relatively small number of CPUs (8-16) via cgroups, while the
> servers typically
> have a much larger number of physical CPUs, the per-cpu freelist model
> is somewhat
> wasteful: if a process has only at most 10 threads running, for
> example, but these threads
> can "wander" across 100 CPUs over the lifetime of the process, keeping 100
> freelists instead of 10 noticeably wastes memory.
>
> Note that although a typical process at Google has a limited CPU
> quota, thus using
> only a small number of CPUs at any given time, the process may often have many
> hundreds or thousands of threads, so per-thread freelists are not a viable
> solution to the problem just described.
>
> Our current solution:
>
> As you outlined in patch 9, tracking the number of currently running threads per
> address space and exposing this information via a vcpu_id abstraction helps
> tcmalloc to noticeably reduce its freelist overhead in the "narrow
> process running
> on a wide server" situation, which is typical at Google.
>
> We have experimented with several approaches here. The one that we are
> currently using is the "flat" model: we allocate vcpu IDs ignoring numa nodes.
>
> We did try per-numa-node vcpus, but it did not show any material improvement
> over the "flat" model, perhaps because on our most "wide" servers the CPU
> topology is multi-level. Chris Kennelly may provide more details here.

I would really like to know more about Google's per-numa-node vcpus implementation.
I suspect you guys may have taken a different turn somewhere in the design which
led to these results. But having not seen that implementation, I can only guess.

I notice the following Google-specific prototype extension in tcmalloc:

// This is a prototype extension to the rseq() syscall. Since a process may
// run on only a few cores at a time, we can use a dense set of "v(irtual)
// cpus." This can reduce cache requirements, as we only need N caches for
// the cores we actually run on simultaneously, rather than a cache for every
// physical core.
union {
struct {
short numa_node_id;
short vcpu_id;
};
int vcpu_flat;
};

Can you tell me more about the way the numa_node_id and vcpu_id are allocated
internally, and how they are expected to be used by userspace ?

>
> On a more technical note, we do use atomic operations extensively in
> the kernel to make sure
> vcpu IDs are "tightly packed", i.e. if only N threads of a process are currently
> running on physical CPUs, vcpu IDs will be in the range [0, N-1], i.e. no gaps,
> no going to N and above; this does consume some extra CPU cycles, but the
> RAM savings we gain far outweigh the extra CPU cost; it will be interesting to
> see what you can do with the optimizations you propose in this patchset.

The optimizations I propose keep those "tightly packed" characteristics, but skip
the atomic operations in common scenarios. I'll welcome benchmarks of the added
overhead in representative workloads.

> Again, thanks a lot for this effort!

Thanks for your input. It really helps steering the effort in the right direction.

Mathieu

>
> Peter
>
> [...]

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2022-08-02 17:33:05

by Peter Oskolkov

[permalink] [raw]

Subject: Re: [PATCH v3 00/23] RSEQ node id and virtual cpu id extensions

On Tue, Aug 2, 2022 at 8:01 AM Mathieu Desnoyers
<[email protected]> wrote:
>

[...]

> >
> > We have experimented with several approaches here. The one that we are
> > currently using is the "flat" model: we allocate vcpu IDs ignoring numa nodes.
> >
> > We did try per-numa-node vcpus, but it did not show any material improvement
> > over the "flat" model, perhaps because on our most "wide" servers the CPU
> > topology is multi-level. Chris Kennelly may provide more details here.
>
> I would really like to know more about Google's per-numa-node vcpus implementation.
> I suspect you guys may have taken a different turn somewhere in the design which
> led to these results. But having not seen that implementation, I can only guess.
>
> I notice the following Google-specific prototype extension in tcmalloc:
>
> // This is a prototype extension to the rseq() syscall. Since a process may
> // run on only a few cores at a time, we can use a dense set of "v(irtual)
> // cpus." This can reduce cache requirements, as we only need N caches for
> // the cores we actually run on simultaneously, rather than a cache for every
> // physical core.
> union {
> struct {
> short numa_node_id;
> short vcpu_id;
> };
> int vcpu_flat;
> };
>
> Can you tell me more about the way the numa_node_id and vcpu_id are allocated
> internally, and how they are expected to be used by userspace ?

Based on a "VCPU policy" flag passed by the userspace during rseq registration
request, our kernel would:
- do nothing re: vcpus, i.e. behave like it currently does upstream;
- allocate VCPUs in a "flat" manner, ignoring NUMA;
- populate numa_node_id with the value from the function with the same name in
https://elixir.bootlin.com/linux/latest/source/include/linux/topology.h
and allocate vcpu_id within the numa node in a tight manner.

Basically, if there are M threads running on node 0 and N threads
running on node 1 at time T, there will be [0,M-1] vcpu IDs associated with
node 0 and [0,N-1] vcpu IDs associated with node 1 at this moment
in time. If a thread migrates across nodes, the balance would change
accordingly.

I'm not sure how exactly tcmalloc tried to use VCPUs under this policy, and
what were the benefits expected. The simplest way would be to keep
a freelist per node_id/vcpu_id pair (basically, per vcpu_flat in the union),
but this would tend to increase the number of freelists due to thread
migrations,
so benefits should be related to memory locality, and so somewhat difficult to
measure precisely.

Chris Kennelly may offer more details here.

Thanks,
Peter

2022-08-02 21:19:53

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [PATCH v3 00/23] RSEQ node id and virtual cpu id extensions

----- On Aug 2, 2022, at 1:06 PM, Peter Oskolkov [email protected] wrote:

> On Tue, Aug 2, 2022 at 8:01 AM Mathieu Desnoyers
> <[email protected]> wrote:
>>
>
> [...]
>
>> >
>> > We have experimented with several approaches here. The one that we are
>> > currently using is the "flat" model: we allocate vcpu IDs ignoring numa nodes.
>> >
>> > We did try per-numa-node vcpus, but it did not show any material improvement
>> > over the "flat" model, perhaps because on our most "wide" servers the CPU
>> > topology is multi-level. Chris Kennelly may provide more details here.
>>
>> I would really like to know more about Google's per-numa-node vcpus
>> implementation.
>> I suspect you guys may have taken a different turn somewhere in the design which
>> led to these results. But having not seen that implementation, I can only guess.
>>
>> I notice the following Google-specific prototype extension in tcmalloc:
>>
>> // This is a prototype extension to the rseq() syscall. Since a process may
>> // run on only a few cores at a time, we can use a dense set of "v(irtual)
>> // cpus." This can reduce cache requirements, as we only need N caches for
>> // the cores we actually run on simultaneously, rather than a cache for every
>> // physical core.
>> union {
>> struct {
>> short numa_node_id;
>> short vcpu_id;
>> };
>> int vcpu_flat;
>> };
>>
>> Can you tell me more about the way the numa_node_id and vcpu_id are allocated
>> internally, and how they are expected to be used by userspace ?
>
> Based on a "VCPU policy" flag passed by the userspace during rseq registration
> request, our kernel would:
> - do nothing re: vcpus, i.e. behave like it currently does upstream;
> - allocate VCPUs in a "flat" manner, ignoring NUMA;
> - populate numa_node_id with the value from the function with the same name in
> https://elixir.bootlin.com/linux/latest/source/include/linux/topology.h
> and allocate vcpu_id within the numa node in a tight manner.
>
> Basically, if there are M threads running on node 0 and N threads
> running on node 1 at time T, there will be [0,M-1] vcpu IDs associated with
> node 0 and [0,N-1] vcpu IDs associated with node 1 at this moment
> in time. If a thread migrates across nodes, the balance would change
> accordingly.
>
> I'm not sure how exactly tcmalloc tried to use VCPUs under this policy, and
> what were the benefits expected. The simplest way would be to keep
> a freelist per node_id/vcpu_id pair (basically, per vcpu_flat in the union),
> but this would tend to increase the number of freelists due to thread
> migrations,
> so benefits should be related to memory locality, and so somewhat difficult to
> measure precisely.

So, based on your explanation of the Google implementation, for each memory space,
the kernel keeps per-numa-node vcpu-id allocation domains.

This leaves two choices to userspace AFAIU:

1) Userspace takes the vcpu_flat (int, combining the short node_id with the short vcpu_id)
as index in a sparse array. The sparseness of the array may be unwelcome in terms of
memory and cache footprint.

2) Userspace could take a 2-level approach: using the short node_id to index an array of
"numa node" objects, which would then point to a 2nd level indexed by short vcpu_id.
This adds an extra pointer dereference on the fast-path, and touches additional cache
lines on the fast path as well, which is probably unwelcome. In addition, keeping track
of this 2-level table adds extra complexity in userspace, and requires that user-space
designs its data structure specifically for NUMA, which is unusual considering that NUMA
is typically just an optimization hint to improve locality of memory accesses.

Hopefully I did not miss anything there.

So here is how I did things differently.

I realized that when userspace uses a rseq_abi()->cpu_id as index into per-cpu data structures,
it expects that when any of the process' threads observe a numa_node_id when running on behalf
of a given cpu_id once in the process lifetime, this topology is invariant [1]. IOW, the same
cpu_id should always run on the same NUMA node in the future.

This characteristic is what allows indexing by cpu_id to index data structures that have a
good NUMA locality: on first use of a given cpu_id, memory allocation can be done on behalf of
the right NUMA node, and then all per-cpu accesses are guaranteed to be local.

So I applied this concept to vcpu_ids.

The trick here is mostly to add a per-numa-node bitmap to the mm_struct in addition to the bitmap
tracking the current vcpu_id allocation. The per-numa-node bitmap keeps track of which vcpu_ids
have been allocated on behalf of each numa node in the past.

So when a thread running on a given NUMA node needs to find the lowest vcpu_id which is available,
it uses cpumask_first_one_and_zero(node_cpumask, cpumask) to find the first bit which has both
been allocated already for this NUMA node, and is currently not in use by another thread.

There is also a node_alloc_vcpumask bitmap which keeps track of which vcpu have already been
allocated in the past, across all NUMA nodes. This allows scanning efficiently for the first
vcpu which was _not_ yet allocated, and is currently unused with
cpumask_first_zero_and_zero(node_alloc_cpumask, cpumask).

With this, user-space can simply use the vm_vcpu_id field as index into the per-vcpu array,
and the NUMA locality is implicit. Upon initial allocation of the numa-local memory, it just
needs to read both vm_vcpu_id and numa_node_id fields within a rseq critical section to
ensure there is no migration between the two field loads.

So as long as the scheduler does a good job at keeping the number of threads per NUMA node
relatively constant by pushing back against thread migration across NUMA nodes over the process
lifetime, there should not be too many extra vcpu_ids needed. In the worse-case scenario, the
number of vcpu_ids needed is equal to the number of online cpus in the system.

The performance overhead of keeping track of those additional bitmaps for NUMA locality should
be minimal considering that the only update needed in the per-numa-node bitmap and the
node_alloc_vcpumask bitmap is the first time a vcpu_id is assigned within a process. The rest
is lookup only. And even there, the optimizations I have put in place skip those lookups in
the common scenarios entirely.

Thanks,

Mathieu

[1] There is the exception of Power CPU hotplug which can reconfigure the NUMA topology,
but this seems like a rather odd and rare corner-case. It is supported by my implementation,
but userspace would have to deal with this kind of reconfiguration on its own to
preserve NUMA locality.

>
> Chris Kennelly may offer more details here.
>
> Thanks,
> Peter

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2022-08-04 16:22:48

by Chris Kennelly

[permalink] [raw]

Subject: Re: [PATCH v3 00/23] RSEQ node id and virtual cpu id extensions

On Tue, Aug 2, 2022 at 4:53 PM Mathieu Desnoyers
<[email protected]> wrote:
>
> ----- On Aug 2, 2022, at 1:06 PM, Peter Oskolkov [email protected] wrote:
>
> > On Tue, Aug 2, 2022 at 8:01 AM Mathieu Desnoyers
> > <[email protected]> wrote:
> >>
> >
> > [...]
> >
> >> >
> >> > We have experimented with several approaches here. The one that we are
> >> > currently using is the "flat" model: we allocate vcpu IDs ignoring numa nodes.
> >> >
> >> > We did try per-numa-node vcpus, but it did not show any material improvement
> >> > over the "flat" model, perhaps because on our most "wide" servers the CPU
> >> > topology is multi-level. Chris Kennelly may provide more details here.

As Peter mentioned, though, we solely use the flat vcpu ID implementation.

> >>
> >> I would really like to know more about Google's per-numa-node vcpus
> >> implementation.
> >> I suspect you guys may have taken a different turn somewhere in the design which
> >> led to these results. But having not seen that implementation, I can only guess.
> >>
> >> I notice the following Google-specific prototype extension in tcmalloc:
> >>
> >> // This is a prototype extension to the rseq() syscall. Since a process may
> >> // run on only a few cores at a time, we can use a dense set of "v(irtual)
> >> // cpus." This can reduce cache requirements, as we only need N caches for
> >> // the cores we actually run on simultaneously, rather than a cache for every
> >> // physical core.
> >> union {
> >> struct {
> >> short numa_node_id;
> >> short vcpu_id;
> >> };
> >> int vcpu_flat;
> >> };
> >>
> >> Can you tell me more about the way the numa_node_id and vcpu_id are allocated
> >> internally, and how they are expected to be used by userspace ?
> >
> > Based on a "VCPU policy" flag passed by the userspace during rseq registration
> > request, our kernel would:
> > - do nothing re: vcpus, i.e. behave like it currently does upstream;
> > - allocate VCPUs in a "flat" manner, ignoring NUMA;
> > - populate numa_node_id with the value from the function with the same name in
> > https://elixir.bootlin.com/linux/latest/source/include/linux/topology.h
> > and allocate vcpu_id within the numa node in a tight manner.
> >
> > Basically, if there are M threads running on node 0 and N threads
> > running on node 1 at time T, there will be [0,M-1] vcpu IDs associated with
> > node 0 and [0,N-1] vcpu IDs associated with node 1 at this moment
> > in time. If a thread migrates across nodes, the balance would change
> > accordingly.
> >
> > I'm not sure how exactly tcmalloc tried to use VCPUs under this policy, and
> > what were the benefits expected. The simplest way would be to keep
> > a freelist per node_id/vcpu_id pair (basically, per vcpu_flat in the union),
> > but this would tend to increase the number of freelists due to thread
> > migrations,
> > so benefits should be related to memory locality, and so somewhat difficult to
> > measure precisely.
>
> So, based on your explanation of the Google implementation, for each memory space,
> the kernel keeps per-numa-node vcpu-id allocation domains.
>
> This leaves two choices to userspace AFAIU:
>
> 1) Userspace takes the vcpu_flat (int, combining the short node_id with the short vcpu_id)
> as index in a sparse array. The sparseness of the array may be unwelcome in terms of
> memory and cache footprint.

In general, TCMalloc addresses sparseness by lazily allocating caches.
Virtual address space is relatively inexpensive, so we can allocate
enough space to have cache space for every feasible CPU ID and
structure things so on fault (of a zero page), we trigger the slow
path and handle initialization
(https://github.com/google/tcmalloc/blob/master/tcmalloc/cpu_cache.h#L824).

Even without virtual CPU IDs, TCMalloc still preferred to allocate a
cache per possible core, so no additional checks would be needed on
the fast path.

>
>
> 2) Userspace could take a 2-level approach: using the short node_id to index an array of
> "numa node" objects, which would then point to a 2nd level indexed by short vcpu_id.
> This adds an extra pointer dereference on the fast-path, and touches additional cache
> lines on the fast path as well, which is probably unwelcome. In addition, keeping track
> of this 2-level table adds extra complexity in userspace, and requires that user-space
> designs its data structure specifically for NUMA, which is unusual considering that NUMA
> is typically just an optimization hint to improve locality of memory accesses.

This is the strategy I looked at using previously.

> Hopefully I did not miss anything there.
>
> So here is how I did things differently.
>
> I realized that when userspace uses a rseq_abi()->cpu_id as index into per-cpu data structures,
> it expects that when any of the process' threads observe a numa_node_id when running on behalf
> of a given cpu_id once in the process lifetime, this topology is invariant [1]. IOW, the same
> cpu_id should always run on the same NUMA node in the future.
>
> This characteristic is what allows indexing by cpu_id to index data structures that have a
> good NUMA locality: on first use of a given cpu_id, memory allocation can be done on behalf of
> the right NUMA node, and then all per-cpu accesses are guaranteed to be local.
>
> So I applied this concept to vcpu_ids.
>
> The trick here is mostly to add a per-numa-node bitmap to the mm_struct in addition to the bitmap
> tracking the current vcpu_id allocation. The per-numa-node bitmap keeps track of which vcpu_ids
> have been allocated on behalf of each numa node in the past.
>
> So when a thread running on a given NUMA node needs to find the lowest vcpu_id which is available,
> it uses cpumask_first_one_and_zero(node_cpumask, cpumask) to find the first bit which has both
> been allocated already for this NUMA node, and is currently not in use by another thread.
>
> There is also a node_alloc_vcpumask bitmap which keeps track of which vcpu have already been
> allocated in the past, across all NUMA nodes. This allows scanning efficiently for the first
> vcpu which was _not_ yet allocated, and is currently unused with
> cpumask_first_zero_and_zero(node_alloc_cpumask, cpumask).
>
> With this, user-space can simply use the vm_vcpu_id field as index into the per-vcpu array,
> and the NUMA locality is implicit. Upon initial allocation of the numa-local memory, it just
> needs to read both vm_vcpu_id and numa_node_id fields within a rseq critical section to
> ensure there is no migration between the two field loads.
>
>
> So as long as the scheduler does a good job at keeping the number of threads per NUMA node
> relatively constant by pushing back against thread migration across NUMA nodes over the process
> lifetime, there should not be too many extra vcpu_ids needed. In the worse-case scenario, the
> number of vcpu_ids needed is equal to the number of online cpus in the system.

Interesting. It seems like this simplifies the fast path for using
the NUMA ID quite a bit.

As Peter mentions, our implementation prefers lower indexed vCPU IDs
over higher ones, but TCMalloc is agnostic to the precise IDs in use
(it looks at individual cache stats, not IDs, to make decisions).
Definitively skipping a vCPU ID on one node because it is used on
another seems entirely workable.

Chris

>
> The performance overhead of keeping track of those additional bitmaps for NUMA locality should
> be minimal considering that the only update needed in the per-numa-node bitmap and the
> node_alloc_vcpumask bitmap is the first time a vcpu_id is assigned within a process. The rest
> is lookup only. And even there, the optimizations I have put in place skip those lookups in
> the common scenarios entirely.
>
>
> Thanks,
>
> Mathieu
>
> [1] There is the exception of Power CPU hotplug which can reconfigure the NUMA topology,
> but this seems like a rather odd and rare corner-case. It is supported by my implementation,
> but userspace would have to deal with this kind of reconfiguration on its own to
> preserve NUMA locality.
>
> >
> > Chris Kennelly may offer more details here.
> >
> > Thanks,
> > Peter
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com