LinuxLists.cc - [RFC PATCH v2 09/11] sched: Introduce per memory space current virtual cpu id

2022-02-18 22:21:44

Subject: [RFC PATCH v2 09/11] sched: Introduce per memory space current virtual cpu id

This feature allows the scheduler to expose a current virtual cpu id
to user-space. This virtual cpu id is within the possible cpus range,
and is temporarily (and uniquely) assigned while threads are actively
running within a memory space. If a memory space has fewer threads than
cores, or is limited to run on few cores concurrently through sched
affinity or cgroup cpusets, the virtual cpu ids will be values close
to 0, thus allowing efficient use of user-space memory for per-cpu
data structures.

The vcpu_ids are NUMA-aware. On NUMA systems, when a vcpu_id is observed
by user-space to be associated with a NUMA node, it is guaranteed to
never change NUMA node unless a kernel-level NUMA configuration change
happens.

This feature is meant to be exposed by a new rseq thread area field.

The primary purpose of this feature is to do the heavy-lifting needed
by memory allocators to allow them to use per-cpu data structures
efficiently in the following situations:

- Single-threaded applications,
- Multi-threaded applications on large systems (many cores) with limited
cpu affinity mask,
- Multi-threaded applications on large systems (many cores) with
restricted cgroup cpuset per container,
- Processes using memory from many NUMA nodes.

One of the key concerns from scheduler maintainers is the overhead
associated with additional atomic operations in the scheduler fast-path.
In order to save one atomic set bit and one atomic clear bit on the
scheduler context switch fast path, the following optimizations are
implemented:

1) On context switch between threads belonging to the same memory space,
transfer the mm_vcpu_id from prev to next without any atomic ops.
This takes care of use-cases involving frequent context switch
between threads belonging to the same memory space.

2) Threads belonging to a memory space with single user (mm_users==1)
can be assigned mm_vcpu_id=0 without any atomic operation on the
scheduler fast-path. In non-NUMA, when a memory space goes from
single to multi-threaded, lazily allocate the vcpu_id 0 in the mm
vcpu mask. This takes care of all single-threaded use-cases
involving context switching between threads belonging to different
memory spaces.

With NUMA, the single-threaded memory space scenario is still
special-cased to eliminate all atomic operations on the fast path,
but rather than returning vcpu_id=0, return the current numa_node_id
to allow single-threaded memory spaces to keep good numa locality.
On systems where the number of cpus ids is lower than the number of
numa node ids, pick the first cpu in the node cpumask rather than the
node ID.

3) Introduce a per-runqueue cache containing { mm, vcpu_id } entries.
Keep track of the recently allocated vcpu_id for each mm rather than
freeing them immediately. This eliminates most atomic ops when
context switching back and forth between threads belonging to
different memory spaces in multi-threaded scenarios (many processes,
each with many threads).

The credit goes to Paul Turner (Google) for the vcpu_id idea. This
feature is implemented based on the discussions with Paul Turner and
Peter Oskolkov (Google), but I took the liberty to implement scheduler
fast-path optimizations and my own NUMA-awareness scheme. The rumor has
it that Google have been running a rseq vcpu_id extension internally at
Google in production for a year. The tcmalloc source code indeed has
comments hinting at a vcpu_id prototype extension to the rseq system
call [1].

schedstats:

* perf bench sched messaging (single instance, multi-process):

On sched-switch:
single-threaded vcpu-id: 99.985 %
transfer between threads: 0 %
runqueue cache hit: 0.015 %
runqueue cache eviction (bit-clear): 0 %
runqueue cache discard (bit-clear): 0 %
vcpu-id allocation (bit-set): 0 %

On release mm:
vcpu-id remove (bit-clear): 0 %

On migration:
vcpu-id remove (bit-clear): 0 %

* perf bench sched messaging -t (single instance, multi-thread):

On sched-switch:
single-threaded vcpu-id: 0.128 %
transfer between threads: 98.260 %
runqueue cache hit: 1.075 %
runqueue cache eviction (bit-clear): 0.001 %
runqueue cache discard (bit-clear): 0 %
vcpu-id allocation (bit-set): 0.269 %

On release mm:
vcpu-id remove (bit-clear): 0.161 %

On migration:
vcpu-id remove (bit-clear): 0.107 %

* perf bench sched messaging -t (two instances, multi-thread):

On sched-switch:
single-threaded vcpu-id: 0.081 %
transfer between threads: 89.512 %
runqueue cache hit: 9.659 %
runqueue cache eviction (bit-clear): 0.003 %
runqueue cache discard (bit-clear): 0 %
vcpu-id allocation (bit-set): 0.374 %

On release mm:
vcpu-id remove (bit-clear): 0.243 %

On migration:
vcpu-id remove (bit-clear): 0.129 %

* perf bench sched pipe (one instance, multi-process):

On sched-switch:
single-threaded vcpu-id: 99.993 %
transfer between threads: 0.001 %
runqueue cache hit: 0.002 %
runqueue cache eviction (bit-clear): 0 %
runqueue cache discard (bit-clear): 0 %
vcpu-id allocation (bit-set): 0.002 %

On release mm:
vcpu-id remove (bit-clear): 0 %

On migration:
vcpu-id remove (bit-clear): 0.002 %

[1] https://github.com/google/tcmalloc/blob/master/tcmalloc/internal/linux_syscall_support.h#L26

Signed-off-by: Mathieu Desnoyers <[email protected]>
---
fs/exec.c | 4 +
include/linux/mm.h | 25 +++
include/linux/mm_types.h | 111 ++++++++++++
include/linux/sched.h | 5 +
init/Kconfig | 4 +
kernel/fork.c | 15 +-
kernel/sched/core.c | 82 +++++++++
kernel/sched/deadline.c | 3 +
kernel/sched/debug.c | 13 ++
kernel/sched/fair.c | 1 +
kernel/sched/rt.c | 2 +
kernel/sched/sched.h | 364 +++++++++++++++++++++++++++++++++++++++
kernel/sched/stats.c | 16 +-
13 files changed, 642 insertions(+), 3 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 79f2c9483302..7b7520b63e95 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1006,6 +1006,10 @@ static int exec_mmap(struct mm_struct *mm)
active_mm = tsk->active_mm;
tsk->active_mm = mm;
tsk->mm = mm;
+ mm_init_vcpu_users(mm);
+ mm_init_vcpumask(mm);
+ mm_init_node_vcpumask(mm);
+ sched_vcpu_activate_mm(tsk, mm);
/*
* This prevents preemption while active_mm is being loaded and
* it and mm are being updated, which could cause problems for
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e1a84b1e6787..6ca8a4a85fcd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3374,5 +3374,30 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
}
#endif

+#ifdef CONFIG_SCHED_MM_VCPU
+void sched_vcpu_release_mm(struct task_struct *t, struct mm_struct *mm);
+void sched_vcpu_activate_mm(struct task_struct *t, struct mm_struct *mm);
+void sched_vcpu_get_mm(struct task_struct *t, struct mm_struct *mm);
+void sched_vcpu_dup_mm(struct task_struct *t, struct mm_struct *mm);
+static inline int task_mm_vcpu_id(struct task_struct *t)
+{
+ return t->mm_vcpu;
+}
+#else
+static inline void sched_vcpu_release_mm(struct task_struct *t, struct mm_struct *mm) { }
+static inline void sched_vcpu_activate_mm(struct task_struct *t, struct mm_struct *mm) { }
+static inline void sched_vcpu_get_mm(struct task_struct *t, struct mm_struct *mm) { }
+static inline void sched_vcpu_dup_mm(struct task_struct *t, struct mm_struct *mm) { }
+static inline int task_mm_vcpu_id(struct task_struct *t)
+{
+ /*
+ * Use the processor id as a fall-back when the mm vcpu feature is
+ * disabled. This provides functional per-cpu data structure accesses
+ * in user-space, althrough it won't provide the memory usage benefits.
+ */
+ return raw_smp_processor_id();
+}
+#endif
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 9db36dc5d4cf..40fcc526396f 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -17,6 +17,7 @@
#include <linux/page-flags-layout.h>
#include <linux/workqueue.h>
#include <linux/seqlock.h>
+#include <linux/nodemask.h>

#include <asm/mmu.h>

@@ -502,6 +503,20 @@ struct mm_struct {
*/
atomic_t mm_count;

+#ifdef CONFIG_SCHED_MM_VCPU
+ /**
+ * @mm_vcpu_users: The number of references to &struct mm_struct
+ * from user-space threads.
+ *
+ * Initialized to 1 for the first thread with a reference with
+ * the mm. Incremented for each thread getting a reference to the
+ * mm, and decremented on mm release from user-space threads.
+ * Used to enable single-threaded mm_vcpu accounting (when == 1).
+ */
+
+ atomic_t mm_vcpu_users;
+#endif
+
#ifdef CONFIG_MMU
atomic_long_t pgtables_bytes; /* PTE page table pages */
#endif
@@ -659,6 +674,102 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
return (struct cpumask *)&mm->cpu_bitmap;
}

+#ifdef CONFIG_SCHED_MM_VCPU
+/* Future-safe accessor for struct mm_struct's vcpu_mask. */
+static inline cpumask_t *mm_vcpumask(struct mm_struct *mm)
+{
+ unsigned long vcpu_bitmap = (unsigned long)mm;
+
+ vcpu_bitmap += offsetof(struct mm_struct, cpu_bitmap);
+ /* Skip cpu_bitmap */
+ vcpu_bitmap += cpumask_size();
+ return (struct cpumask *)vcpu_bitmap;
+}
+
+static inline void mm_init_vcpu_users(struct mm_struct *mm)
+{
+ atomic_set(&mm->mm_vcpu_users, 1);
+}
+
+static inline void mm_init_vcpumask(struct mm_struct *mm)
+{
+ cpumask_clear(mm_vcpumask(mm));
+}
+
+static inline unsigned int mm_vcpumask_size(void)
+{
+ return cpumask_size();
+}
+
+#else
+static inline cpumask_t *mm_vcpumask(struct mm_struct *mm)
+{
+ return NULL;
+}
+
+static inline void mm_init_vcpu_users(struct mm_struct *mm) { }
+static inline void mm_init_vcpumask(struct mm_struct *mm) { }
+
+static inline unsigned int mm_vcpumask_size(void)
+{
+ return 0;
+}
+#endif
+
+#if defined(CONFIG_SCHED_MM_VCPU) && defined(CONFIG_NUMA)
+/*
+ * Layout of node vcpumasks:
+ * - node_alloc vcpumask: cpumask tracking which vcpu_id were
+ * allocated (across nodes) in this
+ * memory space.
+ * - node vcpumask[nr_node_ids]: per-node cpumask tracking which vcpu_id
+ * were allocated in this memory space.
+ */
+static inline cpumask_t *mm_node_alloc_vcpumask(struct mm_struct *mm)
+{
+ unsigned long vcpu_bitmap = (unsigned long)mm_vcpumask(mm);
+
+ /* Skip mm_vcpumask */
+ vcpu_bitmap += cpumask_size();
+ return (struct cpumask *)vcpu_bitmap;
+}
+
+static inline cpumask_t *mm_node_vcpumask(struct mm_struct *mm, unsigned int node)
+{
+ unsigned long vcpu_bitmap = (unsigned long)mm_node_alloc_vcpumask(mm);
+
+ /* Skip node alloc vcpumask */
+ vcpu_bitmap += cpumask_size();
+ vcpu_bitmap += node * cpumask_size();
+ return (struct cpumask *)vcpu_bitmap;
+}
+
+static inline void mm_init_node_vcpumask(struct mm_struct *mm)
+{
+ unsigned int node;
+
+ if (num_possible_nodes() == 1)
+ return;
+ cpumask_clear(mm_node_alloc_vcpumask(mm));
+ for (node = 0; node < nr_node_ids; node++)
+ cpumask_clear(mm_node_vcpumask(mm, node));
+}
+
+static inline unsigned int mm_node_vcpumask_size(void)
+{
+ if (num_possible_nodes() == 1)
+ return 0;
+ return (nr_node_ids + 1) * cpumask_size();
+}
+#else
+static inline void mm_init_node_vcpumask(struct mm_struct *mm) { }
+
+static inline unsigned int mm_node_vcpumask_size(void)
+{
+ return 0;
+}
+#endif
+
struct mmu_gather;
extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 838c9e0b4cae..c400d44f8716 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1300,6 +1300,11 @@ struct task_struct {
unsigned long rseq_event_mask;
#endif

+#ifdef CONFIG_SCHED_MM_VCPU
+ int mm_vcpu; /* Current vcpu in mm */
+ int vcpu_mm_active;
+#endif
+
struct tlbflush_unmap_batch tlb_ubc;

union {
diff --git a/init/Kconfig b/init/Kconfig
index e9119bf54b1f..6bd40f303a0d 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1023,6 +1023,10 @@ config RT_GROUP_SCHED

endif #CGROUP_SCHED

+config SCHED_MM_VCPU
+ def_bool y
+ depends on SMP && RSEQ
+
config UCLAMP_TASK_GROUP
bool "Utilization clamping per group of tasks"
depends on CGROUP_SCHED
diff --git a/kernel/fork.c b/kernel/fork.c
index d75a528f7b21..78fcf3277540 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -970,6 +970,11 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
#ifdef CONFIG_MEMCG
tsk->active_memcg = NULL;
#endif
+
+#ifdef CONFIG_SCHED_MM_VCPU
+ tsk->mm_vcpu = 0;
+ tsk->vcpu_mm_active = 0;
+#endif
return tsk;

free_stack:
@@ -1079,6 +1084,9 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
goto fail_nocontext;

mm->user_ns = get_user_ns(user_ns);
+ mm_init_vcpu_users(mm);
+ mm_init_vcpumask(mm);
+ mm_init_node_vcpumask(mm);
return mm;

fail_nocontext:
@@ -1380,6 +1388,8 @@ static int wait_for_vfork_done(struct task_struct *child,
*/
static void mm_release(struct task_struct *tsk, struct mm_struct *mm)
{
+ sched_vcpu_release_mm(tsk, mm);
+
uprobe_free_utask(tsk);

/* Get rid of any cached register state */
@@ -1499,10 +1509,12 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
if (clone_flags & CLONE_VM) {
mmget(oldmm);
mm = oldmm;
+ sched_vcpu_get_mm(tsk, mm);
} else {
mm = dup_mm(tsk, current->mm);
if (!mm)
return -ENOMEM;
+ sched_vcpu_dup_mm(tsk, mm);
}

tsk->mm = mm;
@@ -2901,7 +2913,8 @@ void __init proc_caches_init(void)
* dynamically sized based on the maximum CPU number this system
* can have, taking hotplug into account (nr_cpu_ids).
*/
- mm_size = sizeof(struct mm_struct) + cpumask_size();
+ mm_size = sizeof(struct mm_struct) + cpumask_size() + mm_vcpumask_size() +
+ mm_node_vcpumask_size();

mm_cachep = kmem_cache_create_usercopy("mm_struct",
mm_size, ARCH_MIN_MMSTRUCT_ALIGN,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1e08b02e0cd5..70bf2899c9b3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2267,6 +2267,7 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
lockdep_assert_rq_held(rq);

deactivate_task(rq, p, DEQUEUE_NOCLOCK);
+ rq_vcpu_cache_remove_mm_locked(rq, p->mm, false);
set_task_cpu(p, new_cpu);
rq_unlock(rq, rf);

@@ -2454,6 +2455,7 @@ int push_cpu_stop(void *arg)
// XXX validate p is still the highest prio task
if (task_rq(p) == rq) {
deactivate_task(rq, p, 0);
+ rq_vcpu_cache_remove_mm_locked(rq, p->mm, false);
set_task_cpu(p, lowest_rq->cpu);
activate_task(lowest_rq, p, 0);
resched_curr(lowest_rq);
@@ -3093,6 +3095,7 @@ static void __migrate_swap_task(struct task_struct *p, int cpu)
rq_pin_lock(dst_rq, &drf);

deactivate_task(src_rq, p, 0);
+ rq_vcpu_cache_remove_mm_locked(src_rq, p->mm, false);
set_task_cpu(p, cpu);
activate_task(dst_rq, p, 0);
check_preempt_curr(dst_rq, p, 0);
@@ -3716,6 +3719,8 @@ static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags
p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);

WRITE_ONCE(rq->ttwu_pending, 1);
+ if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
+ rq_vcpu_cache_remove_mm(task_rq(p), p->mm, false);
__smp_call_single_queue(cpu, &p->wake_entry.llist);
}

@@ -4125,6 +4130,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)

wake_flags |= WF_MIGRATED;
psi_ttwu_dequeue(p);
+ rq_vcpu_cache_remove_mm(task_rq(p), p->mm, false);
set_task_cpu(p, cpu);
}
#else
@@ -4796,6 +4802,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
sched_info_switch(rq, prev, next);
perf_event_task_sched_out(prev, next);
rseq_preempt(prev);
+ switch_mm_vcpu(rq, prev, next);
fire_sched_out_preempt_notifiers(prev, next);
kmap_local_sched_out();
prepare_task(next);
@@ -5922,6 +5929,7 @@ static bool try_steal_cookie(int this, int that)
goto next;

deactivate_task(src, p, 0);
+ rq_vcpu_cache_remove_mm_locked(src, p->mm, false);
set_task_cpu(p, this);
activate_task(dst, p, 0);

@@ -10927,3 +10935,77 @@ void call_trace_sched_update_nr_running(struct rq *rq, int count)
{
trace_sched_update_nr_running_tp(rq, count);
}
+
+#ifdef CONFIG_SCHED_MM_VCPU
+void sched_vcpu_release_mm(struct task_struct *t, struct mm_struct *mm)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+
+ if (!mm)
+ return;
+ WARN_ON_ONCE(t != current);
+ preempt_disable();
+ rq = this_rq();
+ rq_lock_irqsave(rq, &rf);
+ t->vcpu_mm_active = 0;
+ atomic_dec(&mm->mm_vcpu_users);
+ rq_vcpu_cache_remove_mm_locked(rq, mm, true);
+ rq_unlock_irqrestore(rq, &rf);
+ t->mm_vcpu = -1;
+ preempt_enable();
+}
+
+void sched_vcpu_activate_mm(struct task_struct *t, struct mm_struct *mm)
+{
+ WARN_ON_ONCE(t != current);
+ preempt_disable();
+ t->vcpu_mm_active = 1;
+ /* No need to reserve in cpumask because single-threaded. */
+ t->mm_vcpu = mm_vcpu_first_node_vcpu(numa_node_id());
+ preempt_enable();
+}
+
+void sched_vcpu_get_mm(struct task_struct *t, struct mm_struct *mm)
+{
+ int vcpu, mm_vcpu_users;
+ struct rq_flags rf;
+ struct rq *rq;
+
+ preempt_disable();
+ rq = this_rq();
+ t->vcpu_mm_active = 1;
+ mm_vcpu_users = atomic_read(&mm->mm_vcpu_users);
+ atomic_inc(&mm->mm_vcpu_users);
+ t->mm_vcpu = -1;
+ vcpu = current->mm_vcpu;
+ rq_lock_irqsave(rq, &rf);
+ /* On transition from 1 to 2 mm users, reserve vcpu ids. */
+ if (mm_vcpu_users == 1) {
+ mm_vcpu_reserve_nodes(mm);
+ rq_vcpu_cache_remove_mm_locked(rq, mm, true);
+ current->mm_vcpu = __mm_vcpu_get(rq, mm);
+ rq_vcpu_cache_add(rq, mm, current->mm_vcpu);
+ /*
+ * __mm_vcpu_get could get a different vcpu after going
+ * multi-threaded, then back single-threaded, then
+ * multi-threaded on a NUMA configuration using the first CPU
+ * matching the NUMA node as single-threaded vcpu, with
+ * leftover vcpu_id matching the NUMA node set from when this
+ * task was multithreaded.
+ */
+ if (current->mm_vcpu != vcpu)
+ rseq_set_notify_resume(current);
+ }
+ rq_unlock_irqrestore(rq, &rf);
+ preempt_enable();
+}
+
+void sched_vcpu_dup_mm(struct task_struct *t, struct mm_struct *mm)
+{
+ preempt_disable();
+ t->vcpu_mm_active = 1;
+ t->mm_vcpu = -1;
+ preempt_enable();
+}
+#endif
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d2c072b0ef01..f4f394db2db8 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -655,6 +655,7 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(later_rq->rd->span));
raw_spin_unlock(&dl_b->lock);

+ rq_vcpu_cache_remove_mm_locked(rq, p->mm, false);
set_task_cpu(p, later_rq->cpu);
double_unlock_balance(later_rq, rq);

@@ -2290,6 +2291,7 @@ static int push_dl_task(struct rq *rq)
}

deactivate_task(rq, next_task, 0);
+ rq_vcpu_cache_remove_mm_locked(rq, next_task->mm, false);
set_task_cpu(next_task, later_rq->cpu);

/*
@@ -2386,6 +2388,7 @@ static void pull_dl_task(struct rq *this_rq)
push_task = get_push_task(src_rq);
} else {
deactivate_task(src_rq, p, 0);
+ rq_vcpu_cache_remove_mm_locked(src_rq, p->mm, false);
set_task_cpu(p, this_cpu);
activate_task(this_rq, p, 0);
dmin = p->dl.deadline;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 102d6f70e84d..3b44f8dd064d 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -763,6 +763,19 @@ do { \
P(sched_goidle);
P(ttwu_count);
P(ttwu_local);
+#undef P
+#define P(n) SEQ_printf(m, " .%-30s: %Ld\n", #n, schedstat_val(rq->n));
+ P(nr_vcpu_thread_transfer);
+ P(nr_vcpu_cache_hit);
+ P(nr_vcpu_cache_evict);
+ P(nr_vcpu_cache_discard_wrong_node);
+ P(nr_vcpu_allocate);
+ P(nr_vcpu_allocate_node_reuse);
+ P(nr_vcpu_allocate_node_new);
+ P(nr_vcpu_allocate_node_rebalance);
+ P(nr_vcpu_allocate_node_steal);
+ P(nr_vcpu_remove_release_mm);
+ P(nr_vcpu_remove_migrate);
}
#undef P

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dcbd3110c687..9c8b88e57315 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7817,6 +7817,7 @@ static void detach_task(struct task_struct *p, struct lb_env *env)
lockdep_assert_rq_held(env->src_rq);

deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
+ rq_vcpu_cache_remove_mm_locked(env->src_rq, p->mm, false);
set_task_cpu(p, env->dst_cpu);
}

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 7b4f4fbbb404..fd37e23612f9 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2106,6 +2106,7 @@ static int push_rt_task(struct rq *rq, bool pull)
}

deactivate_task(rq, next_task, 0);
+ rq_vcpu_cache_remove_mm_locked(rq, next_task->mm, false);
set_task_cpu(next_task, lowest_rq->cpu);
activate_task(lowest_rq, next_task, 0);
resched_curr(lowest_rq);
@@ -2379,6 +2380,7 @@ static void pull_rt_task(struct rq *this_rq)
push_task = get_push_task(src_rq);
} else {
deactivate_task(src_rq, p, 0);
+ rq_vcpu_cache_remove_mm_locked(src_rq, p->mm, false);
set_task_cpu(p, this_cpu);
activate_task(this_rq, p, 0);
resched = true;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3da5718cd641..5034a7372452 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -916,6 +916,19 @@ struct uclamp_rq {
DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
#endif /* CONFIG_UCLAMP_TASK */

+#ifdef CONFIG_SCHED_MM_VCPU
+# define RQ_VCPU_CACHE_SIZE 8
+struct rq_vcpu_entry {
+ struct mm_struct *mm; /* NULL if unset */
+ int vcpu_id;
+};
+
+struct rq_vcpu_cache {
+ struct rq_vcpu_entry entry[RQ_VCPU_CACHE_SIZE];
+ unsigned int head;
+};
+#endif
+
/*
* This is the main, per-CPU runqueue data structure.
*
@@ -1086,6 +1099,19 @@ struct rq {
/* try_to_wake_up() stats */
unsigned int ttwu_count;
unsigned int ttwu_local;
+
+ unsigned long long nr_vcpu_single_thread;
+ unsigned long long nr_vcpu_thread_transfer;
+ unsigned long long nr_vcpu_cache_hit;
+ unsigned long long nr_vcpu_cache_evict;
+ unsigned long long nr_vcpu_cache_discard_wrong_node;
+ unsigned long long nr_vcpu_allocate;
+ unsigned long long nr_vcpu_allocate_node_reuse;
+ unsigned long long nr_vcpu_allocate_node_new;
+ unsigned long long nr_vcpu_allocate_node_rebalance;
+ unsigned long long nr_vcpu_allocate_node_steal;
+ unsigned long long nr_vcpu_remove_release_mm;
+ unsigned long long nr_vcpu_remove_migrate;
#endif

#ifdef CONFIG_CPU_IDLE
@@ -1116,6 +1142,10 @@ struct rq {
unsigned int core_forceidle_occupation;
u64 core_forceidle_start;
#endif
+
+#ifdef CONFIG_SCHED_MM_VCPU
+ struct rq_vcpu_cache vcpu_cache;
+#endif
};

#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -3137,3 +3167,337 @@ extern int sched_dynamic_mode(const char *str);
extern void sched_dynamic_update(int mode);
#endif

+#ifdef CONFIG_SCHED_MM_VCPU
+static inline int __mm_vcpu_get_single_node(struct mm_struct *mm)
+{
+ struct cpumask *cpumask;
+ int vcpu;
+
+ cpumask = mm_vcpumask(mm);
+ /* Atomically reserve lowest available vcpu number. */
+ do {
+ vcpu = cpumask_first_zero(cpumask);
+ if (vcpu >= nr_cpu_ids)
+ return -1;
+ } while (cpumask_test_and_set_cpu(vcpu, cpumask));
+ return vcpu;
+}
+
+#ifdef CONFIG_NUMA
+static inline bool mm_node_vcpumask_test_cpu(struct mm_struct *mm, int vcpu_id)
+{
+ if (num_possible_nodes() == 1)
+ return true;
+ return cpumask_test_cpu(vcpu_id, mm_node_vcpumask(mm, numa_node_id()));
+}
+
+static inline int __mm_vcpu_get(struct rq *rq, struct mm_struct *mm)
+{
+ struct cpumask *cpumask = mm_vcpumask(mm),
+ *node_cpumask = mm_node_vcpumask(mm, numa_node_id()),
+ *node_alloc_cpumask = mm_node_alloc_vcpumask(mm);
+ unsigned int node;
+ int vcpu;
+
+ if (num_possible_nodes() == 1)
+ return __mm_vcpu_get_single_node(mm);
+
+ /*
+ * Try to atomically reserve lowest available vcpu number within those
+ * already reserved for this NUMA node.
+ */
+ do {
+ vcpu = cpumask_first_one_and_zero(node_cpumask, cpumask);
+ if (vcpu >= nr_cpu_ids)
+ goto alloc_numa;
+ } while (cpumask_test_and_set_cpu(vcpu, cpumask));
+ schedstat_inc(rq->nr_vcpu_allocate_node_reuse);
+ goto end;
+
+alloc_numa:
+ /*
+ * Try to atomically reserve lowest available vcpu number within those
+ * not already allocated for numa nodes.
+ */
+ do {
+ vcpu = cpumask_first_zero_and_zero(node_alloc_cpumask, cpumask);
+ if (vcpu >= nr_cpu_ids)
+ goto numa_update;
+ } while (cpumask_test_and_set_cpu(vcpu, cpumask));
+ cpumask_set_cpu(vcpu, node_cpumask);
+ cpumask_set_cpu(vcpu, node_alloc_cpumask);
+ schedstat_inc(rq->nr_vcpu_allocate_node_new);
+ goto end;
+
+numa_update:
+ /*
+ * NUMA node id configuration changed for at least one CPU in the system.
+ * We need to steal a currently unused vcpu_id from an overprovisioned
+ * node for our current node. Userspace must handle the fact that the
+ * node id associated with this vcpu_id may change due to node ID
+ * reconfiguration.
+ *
+ * Count how many possible cpus are attached to each (other) node id,
+ * and compare this with the per-mm node vcpumask cpu count. Find one
+ * which has too many cpus in its mask to steal from.
+ */
+ for (node = 0; node < nr_node_ids; node++) {
+ struct cpumask *iter_cpumask;
+
+ if (node == numa_node_id())
+ continue;
+ iter_cpumask = mm_node_vcpumask(mm, node);
+ if (nr_cpus_node(node) < cpumask_weight(iter_cpumask)) {
+ /* Try to steal from this node. */
+ do {
+ vcpu = cpumask_first_one_and_zero(iter_cpumask, cpumask);
+ if (vcpu >= nr_cpu_ids)
+ goto steal_fail;
+ } while (cpumask_test_and_set_cpu(vcpu, cpumask));
+ cpumask_clear_cpu(vcpu, iter_cpumask);
+ cpumask_set_cpu(vcpu, node_cpumask);
+ schedstat_inc(rq->nr_vcpu_allocate_node_rebalance);
+ goto end;
+ }
+ }
+
+steal_fail:
+ /*
+ * Our attempt at gracefully stealing a vcpu_id from another
+ * overprovisioned NUMA node failed. Fallback to grabbing the first
+ * available vcpu_id.
+ */
+ do {
+ vcpu = cpumask_first_zero(cpumask);
+ if (vcpu >= nr_cpu_ids)
+ return -1;
+ } while (cpumask_test_and_set_cpu(vcpu, cpumask));
+ /* Steal vcpu from its numa node mask. */
+ for (node = 0; node < nr_node_ids; node++) {
+ struct cpumask *iter_cpumask;
+
+ if (node == numa_node_id())
+ continue;
+ iter_cpumask = mm_node_vcpumask(mm, node);
+ if (cpumask_test_cpu(vcpu, iter_cpumask)) {
+ cpumask_clear_cpu(vcpu, iter_cpumask);
+ break;
+ }
+ }
+ cpumask_set_cpu(vcpu, node_cpumask);
+ schedstat_inc(rq->nr_vcpu_allocate_node_steal);
+end:
+ return vcpu;
+}
+
+static inline int mm_vcpu_first_node_vcpu(int node)
+{
+ int vcpu;
+
+ if (likely(nr_cpu_ids >= nr_node_ids))
+ return node;
+ vcpu = cpumask_first(cpumask_of_node(node));
+ if (vcpu >= nr_cpu_ids)
+ return -1;
+ return vcpu;
+}
+
+/*
+ * Single-threaded processes observe a mapping of vcpu_id->node_id where
+ * the vcpu_id returned corresponds to mm_vcpu_first_node_vcpu(). When going
+ * from single to multi-threaded, reserve this same mapping so it stays
+ * invariant.
+ */
+static inline void mm_vcpu_reserve_nodes(struct mm_struct *mm)
+{
+ struct cpumask *node_alloc_cpumask = mm_node_alloc_vcpumask(mm);
+ int node, other_node;
+
+ for (node = 0; node < nr_node_ids; node++) {
+ struct cpumask *iter_cpumask = mm_node_vcpumask(mm, node);
+ int vcpu = mm_vcpu_first_node_vcpu(node);
+
+ /* Skip nodes that have no CPU associated with them. */
+ if (vcpu < 0)
+ continue;
+ cpumask_set_cpu(vcpu, iter_cpumask);
+ cpumask_set_cpu(vcpu, node_alloc_cpumask);
+ for (other_node = 0; other_node < nr_node_ids; other_node++) {
+ if (other_node == node)
+ continue;
+ cpumask_clear_cpu(vcpu, mm_node_vcpumask(mm, other_node));
+ }
+ }
+}
+#else
+static inline bool mm_node_vcpumask_test_cpu(struct mm_struct *mm, int vcpu_id)
+{
+ return true;
+}
+static inline int __mm_vcpu_get(struct rq *rq, struct mm_struct *mm)
+{
+ return __mm_vcpu_get_single_node(mm);
+}
+static inline int mm_vcpu_first_node_vcpu(int node)
+{
+ return 0;
+}
+static inline void mm_vcpu_reserve_nodes(struct mm_struct *mm) { }
+#endif
+
+static inline void __mm_vcpu_put(struct mm_struct *mm, int vcpu)
+{
+ if (vcpu < 0)
+ return;
+ cpumask_clear_cpu(vcpu, mm_vcpumask(mm));
+}
+
+static inline struct rq_vcpu_entry *rq_vcpu_cache_lookup(struct rq *rq, struct mm_struct *mm)
+{
+ struct rq_vcpu_cache *vcpu_cache = &rq->vcpu_cache;
+ int i;
+
+ for (i = 0; i < RQ_VCPU_CACHE_SIZE; i++) {
+ struct rq_vcpu_entry *entry = &vcpu_cache->entry[i];
+
+ if (entry->mm == mm)
+ return entry;
+ }
+ return NULL;
+}
+
+/* Removal from cache simply leaves an unused hole. */
+static inline int rq_vcpu_cache_lookup_remove(struct rq *rq, struct mm_struct *mm)
+{
+ struct rq_vcpu_entry *entry = rq_vcpu_cache_lookup(rq, mm);
+
+ if (!entry)
+ return -1;
+ entry->mm = NULL; /* Remove from cache */
+ return entry->vcpu_id;
+}
+
+static inline void rq_vcpu_cache_remove_mm_locked(struct rq *rq, struct mm_struct *mm,
+ bool release_mm)
+{
+ int vcpu;
+
+ if (!mm)
+ return;
+ /*
+ * Do not remove the cache entry for a runqueue that runs a task which
+ * currently uses the target mm.
+ */
+ if (!release_mm && rq->curr->mm == mm)
+ return;
+ vcpu = rq_vcpu_cache_lookup_remove(rq, mm);
+ if (vcpu < 0)
+ return;
+ if (release_mm)
+ schedstat_inc(rq->nr_vcpu_remove_release_mm);
+ else
+ schedstat_inc(rq->nr_vcpu_remove_migrate);
+ __mm_vcpu_put(mm, vcpu);
+}
+
+static inline void rq_vcpu_cache_remove_mm(struct rq *rq, struct mm_struct *mm,
+ bool release_mm)
+{
+ struct rq_flags rf;
+
+ rq_lock_irqsave(rq, &rf);
+ rq_vcpu_cache_remove_mm_locked(rq, mm, release_mm);
+ rq_unlock_irqrestore(rq, &rf);
+}
+
+/*
+ * Add at head, move head forward. Cheap LRU cache.
+ * Only need to clear the vcpu mask bit from its own mm_vcpumask(mm) when we
+ * overwrite an old entry from the cache. Note that this is not needed if the
+ * overwritten entry is an unused hole. This access to the old_mm from an
+ * unrelated thread requires that cache entry for a given mm gets pruned from
+ * the cache when a task is dequeued from the runqueue.
+ */
+static inline void rq_vcpu_cache_add(struct rq *rq, struct mm_struct *mm,
+ int vcpu_id)
+{
+ struct rq_vcpu_cache *vcpu_cache = &rq->vcpu_cache;
+ struct mm_struct *old_mm;
+ struct rq_vcpu_entry *entry;
+ unsigned int pos;
+
+ pos = vcpu_cache->head;
+ entry = &vcpu_cache->entry[pos];
+ old_mm = entry->mm;
+ if (old_mm) {
+ schedstat_inc(rq->nr_vcpu_cache_evict);
+ __mm_vcpu_put(old_mm, entry->vcpu_id);
+ }
+ entry->mm = mm;
+ entry->vcpu_id = vcpu_id;
+ vcpu_cache->head = (pos + 1) % RQ_VCPU_CACHE_SIZE;
+}
+
+static inline int mm_vcpu_get(struct rq *rq, struct task_struct *t)
+{
+ struct rq_vcpu_entry *entry;
+ struct mm_struct *mm = t->mm;
+ int vcpu;
+
+ /* Skip allocation if mm is single-threaded. */
+ if (atomic_read(&mm->mm_vcpu_users) == 1) {
+ schedstat_inc(rq->nr_vcpu_single_thread);
+ vcpu = mm_vcpu_first_node_vcpu(numa_node_id());
+ goto end;
+ }
+ entry = rq_vcpu_cache_lookup(rq, mm);
+ if (likely(entry)) {
+ vcpu = entry->vcpu_id;
+ if (likely(mm_node_vcpumask_test_cpu(mm, vcpu))) {
+ schedstat_inc(rq->nr_vcpu_cache_hit);
+ goto end;
+ } else {
+ schedstat_inc(rq->nr_vcpu_cache_discard_wrong_node);
+ entry->mm = NULL; /* Remove from cache */
+ __mm_vcpu_put(mm, vcpu);
+ }
+ }
+ schedstat_inc(rq->nr_vcpu_allocate);
+ vcpu = __mm_vcpu_get(rq, mm);
+ rq_vcpu_cache_add(rq, mm, vcpu);
+end:
+ return vcpu;
+}
+
+static inline void switch_mm_vcpu(struct rq *rq, struct task_struct *prev,
+ struct task_struct *next)
+{
+ if (!(next->flags & PF_EXITING) && !(next->flags & PF_KTHREAD) &&
+ next->mm && next->vcpu_mm_active) {
+ if (!(prev->flags & PF_EXITING) && !(prev->flags & PF_KTHREAD) &&
+ prev->mm == next->mm && prev->vcpu_mm_active &&
+ mm_node_vcpumask_test_cpu(next->mm, prev->mm_vcpu)) {
+ /*
+ * Switching between threads with the same mm. Simply pass the
+ * vcpu token along to the next thread.
+ */
+ schedstat_inc(rq->nr_vcpu_thread_transfer);
+ next->mm_vcpu = prev->mm_vcpu;
+ } else {
+ next->mm_vcpu = mm_vcpu_get(rq, next);
+ }
+ }
+ if (!(prev->flags & PF_EXITING) && !(prev->flags & PF_KTHREAD) &&
+ prev->mm && prev->vcpu_mm_active)
+ prev->mm_vcpu = -1;
+}
+
+#else
+static inline void switch_mm_vcpu(struct rq *rq, struct task_struct *prev,
+ struct task_struct *next) { }
+static inline void rq_vcpu_cache_remove_mm_locked(struct rq *rq,
+ struct mm_struct *mm,
+ bool release_mm) { }
+static inline void rq_vcpu_cache_remove_mm(struct rq *rq, struct mm_struct *mm,
+ bool release_mm) { }
+#endif
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 07dde2928c79..027d0caf2d14 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -134,12 +134,24 @@ static int show_schedstat(struct seq_file *seq, void *v)

/* runqueue-specific stats */
seq_printf(seq,
- "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+ "cpu%d %u 0 %u %u %u %u %llu %llu %lu %llu %llu %llu %llu %llu %llu %llu %llu %llu %llu %llu %llu",
cpu, rq->yld_count,
rq->sched_count, rq->sched_goidle,
rq->ttwu_count, rq->ttwu_local,
rq->rq_cpu_time,
- rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+ rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+ rq->nr_vcpu_single_thread,
+ rq->nr_vcpu_thread_transfer,
+ rq->nr_vcpu_cache_hit,
+ rq->nr_vcpu_cache_evict,
+ rq->nr_vcpu_cache_discard_wrong_node,
+ rq->nr_vcpu_allocate,
+ rq->nr_vcpu_allocate_node_reuse,
+ rq->nr_vcpu_allocate_node_new,
+ rq->nr_vcpu_allocate_node_rebalance,
+ rq->nr_vcpu_allocate_node_steal,
+ rq->nr_vcpu_remove_release_mm,
+ rq->nr_vcpu_remove_migrate);

seq_printf(seq, "\n");

--
2.17.1

2022-02-22 04:36:01

by Mathieu Desnoyers

[permalink] [raw]

Subject: [RFC PATCH v3 09/11] sched: Introduce per memory space current virtual cpu id

This feature allows the scheduler to expose a current virtual cpu id
to user-space. This virtual cpu id is within the possible cpus range,
and is temporarily (and uniquely) assigned while threads are actively
running within a memory space. If a memory space has fewer threads than
cores, or is limited to run on few cores concurrently through sched
affinity or cgroup cpusets, the virtual cpu ids will be values close
to 0, thus allowing efficient use of user-space memory for per-cpu
data structures.

The vcpu_ids are NUMA-aware. On NUMA systems, when a vcpu_id is observed
by user-space to be associated with a NUMA node, it is guaranteed to
never change NUMA node unless a kernel-level NUMA configuration change
happens.

This feature is meant to be exposed by a new rseq thread area field.

The primary purpose of this feature is to do the heavy-lifting needed
by memory allocators to allow them to use per-cpu data structures
efficiently in the following situations:

- Single-threaded applications,
- Multi-threaded applications on large systems (many cores) with limited
cpu affinity mask,
- Multi-threaded applications on large systems (many cores) with
restricted cgroup cpuset per container,
- Processes using memory from many NUMA nodes.

One of the key concerns from scheduler maintainers is the overhead
associated with additional atomic operations in the scheduler fast-path.
In order to save one atomic set bit and one atomic clear bit on the
scheduler context switch fast path, the following optimizations are
implemented:

1) On context switch between threads belonging to the same memory space,
transfer the mm_vcpu_id from prev to next without any atomic ops.
This takes care of use-cases involving frequent context switch
between threads belonging to the same memory space.

2) Threads belonging to a memory space with single user (mm_users==1)
can be assigned mm_vcpu_id=0 without any atomic operation on the
scheduler fast-path. In non-NUMA, when a memory space goes from
single to multi-threaded, lazily allocate the vcpu_id 0 in the mm
vcpu mask. This takes care of all single-threaded use-cases
involving context switching between threads belonging to different
memory spaces.

With NUMA, the single-threaded memory space scenario is still
special-cased to eliminate all atomic operations on the fast path,
but rather than returning vcpu_id=0, return the current numa_node_id
to allow single-threaded memory spaces to keep good numa locality.
On systems where the number of cpus ids is lower than the number of
numa node ids, pick the first cpu in the node cpumask rather than the
node ID.

3) Introduce a per-runqueue cache containing { mm, vcpu_id } entries.
Keep track of the recently allocated vcpu_id for each mm rather than
freeing them immediately. This eliminates most atomic ops when
context switching back and forth between threads belonging to
different memory spaces in multi-threaded scenarios (many processes,
each with many threads).

The credit goes to Paul Turner (Google) for the vcpu_id idea. This
feature is implemented based on the discussions with Paul Turner and
Peter Oskolkov (Google), but I took the liberty to implement scheduler
fast-path optimizations and my own NUMA-awareness scheme. The rumor has
it that Google have been running a rseq vcpu_id extension internally in
production for a year. The tcmalloc source code indeed has comments
hinting at a vcpu_id prototype extension to the rseq system call [1].

schedstats:

* perf bench sched messaging (single instance, multi-process):

On sched-switch:
single-threaded vcpu-id: 99.985 %
transfer between threads: 0 %
runqueue cache hit: 0.015 %
runqueue cache eviction (bit-clear): 0 %
runqueue cache discard (bit-clear): 0 %
vcpu-id allocation (bit-set): 0 %

On release mm:
vcpu-id remove (bit-clear): 0 %

On migration:
vcpu-id remove (bit-clear): 0 %

* perf bench sched messaging -t (single instance, multi-thread):

On sched-switch:
single-threaded vcpu-id: 0.128 %
transfer between threads: 98.260 %
runqueue cache hit: 1.075 %
runqueue cache eviction (bit-clear): 0.001 %
runqueue cache discard (bit-clear): 0 %
vcpu-id allocation (bit-set): 0.269 %

On release mm:
vcpu-id remove (bit-clear): 0.161 %

On migration:
vcpu-id remove (bit-clear): 0.107 %

* perf bench sched messaging -t (two instances, multi-thread):

On sched-switch:
single-threaded vcpu-id: 0.081 %
transfer between threads: 89.512 %
runqueue cache hit: 9.659 %
runqueue cache eviction (bit-clear): 0.003 %
runqueue cache discard (bit-clear): 0 %
vcpu-id allocation (bit-set): 0.374 %

On release mm:
vcpu-id remove (bit-clear): 0.243 %

On migration:
vcpu-id remove (bit-clear): 0.129 %

* perf bench sched pipe (one instance, multi-process):

On sched-switch:
single-threaded vcpu-id: 99.993 %
transfer between threads: 0.001 %
runqueue cache hit: 0.002 %
runqueue cache eviction (bit-clear): 0 %
runqueue cache discard (bit-clear): 0 %
vcpu-id allocation (bit-set): 0.002 %

On release mm:
vcpu-id remove (bit-clear): 0 %

On migration:
vcpu-id remove (bit-clear): 0.002 %

[1] https://github.com/google/tcmalloc/blob/master/tcmalloc/internal/linux_syscall_support.h#L26

Signed-off-by: Mathieu Desnoyers <[email protected]>
---
Changes since v2:
- Fix a race between migration and release_mm, leading to a mm pointer
leak in the runqueue vcpu cache: clearing the cache on migration
should validate that the target mm in not the current mm on the target
runqueue _and_ that the current task on that runqueue it not exiting
or half-way through exec.
- Introduce struct vcpu_domain to encapsulate the the cpumasks and user
count. This will make it easier to use this from pid namespaces as
well in the future.
- Add next_vm_vcpu field to sched_switch tracepoint.
---
fs/exec.c | 3 +
include/linux/mm.h | 40 ++++
include/linux/mm_types.h | 1 +
include/linux/sched.h | 5 +
include/linux/vcpu.h | 80 ++++++++
include/linux/vcpu_types.h | 28 +++
include/trace/events/sched.h | 7 +-
init/Kconfig | 4 +
kernel/fork.c | 13 +-
kernel/sched/core.c | 83 ++++++++
kernel/sched/deadline.c | 3 +
kernel/sched/debug.c | 13 ++
kernel/sched/fair.c | 1 +
kernel/sched/rt.c | 2 +
kernel/sched/sched.h | 375 +++++++++++++++++++++++++++++++++++
kernel/sched/stats.c | 16 +-
16 files changed, 669 insertions(+), 5 deletions(-)
create mode 100644 include/linux/vcpu.h
create mode 100644 include/linux/vcpu_types.h

diff --git a/fs/exec.c b/fs/exec.c
index 79f2c9483302..4c562fa13332 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -66,6 +66,7 @@
#include <linux/io_uring.h>
#include <linux/syscall_user_dispatch.h>
#include <linux/coredump.h>
+#include <linux/vcpu.h>

#include <linux/uaccess.h>
#include <asm/mmu_context.h>
@@ -1006,6 +1007,8 @@ static int exec_mmap(struct mm_struct *mm)
active_mm = tsk->active_mm;
tsk->active_mm = mm;
tsk->mm = mm;
+ vcpu_domain_init(mm_vcpu_domain(mm));
+ vcpu_domain_activate(tsk, mm_vcpu_domain(mm));
/*
* This prevents preemption while active_mm is being loaded and
* it and mm are being updated, which could cause problems for
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e1a84b1e6787..9d44d64cbc7c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3374,5 +3374,45 @@ madvise_set_anon_name(struct mm_struct *mm, unsigned long start,
}
#endif

+#ifdef CONFIG_VCPU_DOMAIN
+static inline struct vcpu_domain *mm_vcpu_domain(struct mm_struct *mm)
+{
+ unsigned long vcpu_bitmap = (unsigned long)mm;
+
+ if (!mm)
+ return NULL;
+ vcpu_bitmap += offsetof(struct mm_struct, cpu_bitmap);
+ /* Skip cpu_bitmap */
+ vcpu_bitmap += cpumask_size();
+ return (struct vcpu_domain *)vcpu_bitmap;
+}
+void vcpu_domain_release(struct task_struct *t, struct vcpu_domain *domain);
+void vcpu_domain_activate(struct task_struct *t, struct vcpu_domain *domain);
+void vcpu_domain_get(struct task_struct *t, struct vcpu_domain *domain);
+void vcpu_domain_dup(struct task_struct *t, struct vcpu_domain *domain);
+static inline int task_mm_vcpu_id(struct task_struct *t)
+{
+ return t->mm_vcpu;
+}
+#else
+static inline struct vcpu_domain *mm_vcpu_domain(struct mm_struct *mm)
+{
+ return NULL;
+}
+void vcpu_domain_release(struct task_struct *t, struct vcpu_domain *domain) { }
+void vcpu_domain_activate(struct task_struct *t, struct vcpu_domain *domain) { }
+void vcpu_domain_get(struct task_struct *t, struct vcpu_domain *domain) { }
+void vcpu_domain_dup(struct task_struct *t, struct vcpu_domain *domain) { }
+static inline int task_mm_vcpu_id(struct task_struct *t)
+{
+ /*
+ * Use the processor id as a fall-back when the mm vcpu feature is
+ * disabled. This provides functional per-cpu data structure accesses
+ * in user-space, althrough it won't provide the memory usage benefits.
+ */
+ return raw_smp_processor_id();
+}
+#endif
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 9db36dc5d4cf..ef23400aee51 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -17,6 +17,7 @@
#include <linux/page-flags-layout.h>
#include <linux/workqueue.h>
#include <linux/seqlock.h>
+#include <linux/nodemask.h>

#include <asm/mmu.h>

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 838c9e0b4cae..96df9e3fc7af 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1300,6 +1300,11 @@ struct task_struct {
unsigned long rseq_event_mask;
#endif

+#ifdef CONFIG_VCPU_DOMAIN
+ int mm_vcpu; /* Current vcpu in mm */
+ int vcpu_domain_active;
+#endif
+
struct tlbflush_unmap_batch tlb_ubc;

union {
diff --git a/include/linux/vcpu.h b/include/linux/vcpu.h
new file mode 100644
index 000000000000..252fc7573025
--- /dev/null
+++ b/include/linux/vcpu.h
@@ -0,0 +1,80 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_VCPU_H
+#define _LINUX_VCPU_H
+
+#include <linux/cpumask.h>
+#include <linux/nodemask.h>
+#include <linux/vcpu_types.h>
+
+#ifdef CONFIG_VCPU_DOMAIN
+static inline unsigned int vcpu_domain_vcpumask_size(void)
+{
+ return cpumask_size();
+}
+static inline cpumask_t *vcpu_domain_vcpumask(struct vcpu_domain *vcpu_domain)
+{
+ return vcpu_domain->vcpumasks;
+}
+
+# ifdef CONFIG_NUMA
+static inline unsigned int vcpu_domain_node_vcpumask_size(void)
+{
+ if (num_possible_nodes() == 1)
+ return 0;
+ return (nr_node_ids + 1) * cpumask_size();
+}
+static inline cpumask_t *vcpu_domain_node_alloc_vcpumask(struct vcpu_domain *vcpu_domain)
+{
+ unsigned long vcpu_bitmap = (unsigned long)vcpu_domain_vcpumask(vcpu_domain);
+
+ /* Skip vcpumask */
+ vcpu_bitmap += cpumask_size();
+ return (struct cpumask *)vcpu_bitmap;
+}
+static inline cpumask_t *vcpu_domain_node_vcpumask(struct vcpu_domain *vcpu_domain, unsigned int node)
+{
+ unsigned long vcpu_bitmap = (unsigned long)vcpu_domain_node_alloc_vcpumask(vcpu_domain);
+
+ /* Skip node alloc vcpumask */
+ vcpu_bitmap += cpumask_size();
+ vcpu_bitmap += node * cpumask_size();
+ return (struct cpumask *)vcpu_bitmap;
+}
+static inline void vcpu_domain_node_init(struct vcpu_domain *vcpu_domain)
+{
+ unsigned int node;
+
+ if (num_possible_nodes() == 1)
+ return;
+ cpumask_clear(vcpu_domain_node_alloc_vcpumask(vcpu_domain));
+ for (node = 0; node < nr_node_ids; node++)
+ cpumask_clear(vcpu_domain_node_vcpumask(vcpu_domain, node));
+}
+# else /* CONFIG_NUMA */
+static inline unsigned int vcpu_domain_node_vcpumask_size(void)
+{
+ return 0;
+}
+static inline void vcpu_domain_node_init(struct vcpu_domain *vcpu_domain) { }
+# endif /* CONFIG_NUMA */
+
+static inline unsigned int vcpu_domain_size(void)
+{
+ return offsetof(struct vcpu_domain, vcpumasks) + vcpu_domain_vcpumask_size() +
+ vcpu_domain_node_vcpumask_size();
+}
+static inline void vcpu_domain_init(struct vcpu_domain *vcpu_domain)
+{
+ atomic_set(&vcpu_domain->users, 1);
+ cpumask_clear(vcpu_domain_vcpumask(vcpu_domain));
+ vcpu_domain_node_init(vcpu_domain);
+}
+#else /* CONFIG_VCPU_DOMAIN */
+static inline unsigned int vcpu_domain_size(void)
+{
+ return 0;
+}
+static inline void vcpu_domain_init(struct vcpu_domain *vcpu_domain) { }
+#endif /* CONFIG_VCPU_DOMAIN */
+
+#endif /* _LINUX_VCPU_H */
diff --git a/include/linux/vcpu_types.h b/include/linux/vcpu_types.h
new file mode 100644
index 000000000000..36e520c4e287
--- /dev/null
+++ b/include/linux/vcpu_types.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_VCPU_TYPES_H
+#define _LINUX_VCPU_TYPES_H
+
+#include <linux/atomic.h>
+#include <linux/cpumask.h>
+
+struct vcpu_domain {
+ /**
+ * @users: The number of references to &struct vcpu_domain from
+ * user-space threads.
+ *
+ * Initialized to 1 for the first thread with a reference with
+ * the domain. Incremented for each thread getting a reference to the
+ * domain, and decremented on domain release from user-space threads.
+ * Used to enable single-threaded domain vcpu accounting (when == 1).
+ */
+ atomic_t users;
+ /*
+ * Layout of vcpumasks:
+ * - vcpumask (cpumask_size()),
+ * - node_alloc_vcpumask (cpumask_size(), NUMA=y only),
+ * - array of nr_node_ids node_vcpumask (each cpumask_size(), NUMA=y only).
+ */
+ cpumask_t vcpumasks[];
+};
+
+#endif /* _LINUX_VCPU_TYPES_H */
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 94640482cfe7..050cb749ca1a 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -233,6 +233,7 @@ TRACE_EVENT(sched_switch,
__array( char, next_comm, TASK_COMM_LEN )
__field( pid_t, next_pid )
__field( int, next_prio )
+ __field( pid_t, next_vm_vcpu )
),

TP_fast_assign(
@@ -243,10 +244,11 @@ TRACE_EVENT(sched_switch,
memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
__entry->next_pid = next->pid;
__entry->next_prio = next->prio;
+ __entry->next_vm_vcpu = task_mm_vcpu_id(next);
/* XXX SCHED_DEADLINE */
),

- TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d",
+ TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s%s ==> next_comm=%s next_pid=%d next_prio=%d next_vm_vcpu=%d",
__entry->prev_comm, __entry->prev_pid, __entry->prev_prio,

(__entry->prev_state & (TASK_REPORT_MAX - 1)) ?
@@ -262,7 +264,8 @@ TRACE_EVENT(sched_switch,
"R",

__entry->prev_state & TASK_REPORT_MAX ? "+" : "",
- __entry->next_comm, __entry->next_pid, __entry->next_prio)
+ __entry->next_comm, __entry->next_pid, __entry->next_prio,
+ __entry->next_vm_vcpu)
);

/*
diff --git a/init/Kconfig b/init/Kconfig
index e9119bf54b1f..8d504675d3a3 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1023,6 +1023,10 @@ config RT_GROUP_SCHED

endif #CGROUP_SCHED

+config VCPU_DOMAIN
+ def_bool y
+ depends on SMP && RSEQ
+
config UCLAMP_TASK_GROUP
bool "Utilization clamping per group of tasks"
depends on CGROUP_SCHED
diff --git a/kernel/fork.c b/kernel/fork.c
index d75a528f7b21..97bc7dcadafe 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -97,6 +97,7 @@
#include <linux/scs.h>
#include <linux/io_uring.h>
#include <linux/bpf.h>
+#include <linux/vcpu.h>

#include <asm/pgalloc.h>
#include <linux/uaccess.h>
@@ -970,6 +971,11 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
#ifdef CONFIG_MEMCG
tsk->active_memcg = NULL;
#endif
+
+#ifdef CONFIG_VCPU_DOMAIN
+ tsk->mm_vcpu = 0;
+ tsk->vcpu_domain_active = 0;
+#endif
return tsk;

free_stack:
@@ -1079,6 +1085,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
goto fail_nocontext;

mm->user_ns = get_user_ns(user_ns);
+ vcpu_domain_init(mm_vcpu_domain(mm));
return mm;

fail_nocontext:
@@ -1380,6 +1387,8 @@ static int wait_for_vfork_done(struct task_struct *child,
*/
static void mm_release(struct task_struct *tsk, struct mm_struct *mm)
{
+ vcpu_domain_release(tsk, mm_vcpu_domain(mm));
+
uprobe_free_utask(tsk);

/* Get rid of any cached register state */
@@ -1499,10 +1508,12 @@ static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
if (clone_flags & CLONE_VM) {
mmget(oldmm);
mm = oldmm;
+ vcpu_domain_get(tsk, mm_vcpu_domain(mm));
} else {
mm = dup_mm(tsk, current->mm);
if (!mm)
return -ENOMEM;
+ vcpu_domain_dup(tsk, mm_vcpu_domain(mm));
}

tsk->mm = mm;
@@ -2901,7 +2912,7 @@ void __init proc_caches_init(void)
* dynamically sized based on the maximum CPU number this system
* can have, taking hotplug into account (nr_cpu_ids).
*/
- mm_size = sizeof(struct mm_struct) + cpumask_size();
+ mm_size = sizeof(struct mm_struct) + cpumask_size() + vcpu_domain_size();

mm_cachep = kmem_cache_create_usercopy("mm_struct",
mm_size, ARCH_MIN_MMSTRUCT_ALIGN,
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 1e08b02e0cd5..3894822d548a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -16,6 +16,7 @@
#include <linux/blkdev.h>
#include <linux/kcov.h>
#include <linux/scs.h>
+#include <linux/vcpu.h>

#include <asm/switch_to.h>
#include <asm/tlb.h>
@@ -2267,6 +2268,7 @@ static struct rq *move_queued_task(struct rq *rq, struct rq_flags *rf,
lockdep_assert_rq_held(rq);

deactivate_task(rq, p, DEQUEUE_NOCLOCK);
+ rq_vcpu_domain_migrate_locked(rq, p);
set_task_cpu(p, new_cpu);
rq_unlock(rq, rf);

@@ -2454,6 +2456,7 @@ int push_cpu_stop(void *arg)
// XXX validate p is still the highest prio task
if (task_rq(p) == rq) {
deactivate_task(rq, p, 0);
+ rq_vcpu_domain_migrate_locked(rq, p);
set_task_cpu(p, lowest_rq->cpu);
activate_task(lowest_rq, p, 0);
resched_curr(lowest_rq);
@@ -3093,6 +3096,7 @@ static void __migrate_swap_task(struct task_struct *p, int cpu)
rq_pin_lock(dst_rq, &drf);

deactivate_task(src_rq, p, 0);
+ rq_vcpu_domain_migrate_locked(src_rq, p);
set_task_cpu(p, cpu);
activate_task(dst_rq, p, 0);
check_preempt_curr(dst_rq, p, 0);
@@ -3716,6 +3720,8 @@ static void __ttwu_queue_wakelist(struct task_struct *p, int cpu, int wake_flags
p->sched_remote_wakeup = !!(wake_flags & WF_MIGRATED);

WRITE_ONCE(rq->ttwu_pending, 1);
+ if (WARN_ON_ONCE(task_cpu(p) != cpu_of(rq)))
+ rq_vcpu_domain_migrate_locked(task_rq(p), p);
__smp_call_single_queue(cpu, &p->wake_entry.llist);
}

@@ -4125,6 +4131,7 @@ try_to_wake_up(struct task_struct *p, unsigned int state, int wake_flags)

wake_flags |= WF_MIGRATED;
psi_ttwu_dequeue(p);
+ rq_vcpu_domain_migrate(task_rq(p), p);
set_task_cpu(p, cpu);
}
#else
@@ -4796,6 +4803,7 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
sched_info_switch(rq, prev, next);
perf_event_task_sched_out(prev, next);
rseq_preempt(prev);
+ switch_mm_vcpu(rq, prev, next);
fire_sched_out_preempt_notifiers(prev, next);
kmap_local_sched_out();
prepare_task(next);
@@ -5922,6 +5930,7 @@ static bool try_steal_cookie(int this, int that)
goto next;

deactivate_task(src, p, 0);
+ rq_vcpu_domain_migrate_locked(src, p);
set_task_cpu(p, this);
activate_task(dst, p, 0);

@@ -10927,3 +10936,77 @@ void call_trace_sched_update_nr_running(struct rq *rq, int count)
{
trace_sched_update_nr_running_tp(rq, count);
}
+
+#ifdef CONFIG_VCPU_DOMAIN
+void vcpu_domain_release(struct task_struct *t, struct vcpu_domain *vcpu_domain)
+{
+ struct rq_flags rf;
+ struct rq *rq;
+
+ if (!vcpu_domain)
+ return;
+ WARN_ON_ONCE(t != current);
+ preempt_disable();
+ rq = this_rq();
+ rq_lock_irqsave(rq, &rf);
+ t->vcpu_domain_active = 0;
+ atomic_dec(&vcpu_domain->users);
+ rq_vcpu_cache_remove_vcpu_domain_locked(rq, vcpu_domain, true);
+ rq_unlock_irqrestore(rq, &rf);
+ t->mm_vcpu = -1;
+ preempt_enable();
+}
+
+void vcpu_domain_activate(struct task_struct *t, struct vcpu_domain *vcpu_domain)
+{
+ WARN_ON_ONCE(t != current);
+ preempt_disable();
+ t->vcpu_domain_active = 1;
+ /* No need to reserve in cpumask because single-threaded. */
+ t->mm_vcpu = vcpu_domain_vcpu_first_node_vcpu(numa_node_id());
+ preempt_enable();
+}
+
+void vcpu_domain_get(struct task_struct *t, struct vcpu_domain *vcpu_domain)
+{
+ int vcpu, vcpu_users;
+ struct rq_flags rf;
+ struct rq *rq;
+
+ preempt_disable();
+ rq = this_rq();
+ t->vcpu_domain_active = 1;
+ vcpu_users = atomic_read(&vcpu_domain->users);
+ atomic_inc(&vcpu_domain->users);
+ t->mm_vcpu = -1;
+ vcpu = current->mm_vcpu;
+ rq_lock_irqsave(rq, &rf);
+ /* On transition from 1 to 2 vcpu domain users, reserve vcpu ids. */
+ if (vcpu_users == 1) {
+ vcpu_domain_vcpu_reserve_nodes(vcpu_domain);
+ rq_vcpu_cache_remove_vcpu_domain_locked(rq, vcpu_domain, true);
+ current->mm_vcpu = __vcpu_domain_vcpu_get(rq, vcpu_domain);
+ rq_vcpu_cache_add(rq, vcpu_domain, current->mm_vcpu);
+ /*
+ * __vcpu_domain_vcpu_get could get a different vcpu after
+ * going multi-threaded, then back single-threaded, then
+ * multi-threaded on a NUMA configuration using the first CPU
+ * matching the NUMA node as single-threaded vcpu, with
+ * leftover vcpu_id matching the NUMA node set from when this
+ * task was multithreaded.
+ */
+ if (current->mm_vcpu != vcpu)
+ rseq_set_notify_resume(current);
+ }
+ rq_unlock_irqrestore(rq, &rf);
+ preempt_enable();
+}
+
+void vcpu_domain_dup(struct task_struct *t, struct vcpu_domain *vcpu_domain)
+{
+ preempt_disable();
+ t->vcpu_domain_active = 1;
+ t->mm_vcpu = -1;
+ preempt_enable();
+}
+#endif
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
index d2c072b0ef01..056933d83469 100644
--- a/kernel/sched/deadline.c
+++ b/kernel/sched/deadline.c
@@ -655,6 +655,7 @@ static struct rq *dl_task_offline_migration(struct rq *rq, struct task_struct *p
__dl_add(dl_b, p->dl.dl_bw, cpumask_weight(later_rq->rd->span));
raw_spin_unlock(&dl_b->lock);

+ rq_vcpu_domain_migrate_locked(rq, p);
set_task_cpu(p, later_rq->cpu);
double_unlock_balance(later_rq, rq);

@@ -2290,6 +2291,7 @@ static int push_dl_task(struct rq *rq)
}

deactivate_task(rq, next_task, 0);
+ rq_vcpu_domain_migrate_locked(rq, next_task);
set_task_cpu(next_task, later_rq->cpu);

/*
@@ -2386,6 +2388,7 @@ static void pull_dl_task(struct rq *this_rq)
push_task = get_push_task(src_rq);
} else {
deactivate_task(src_rq, p, 0);
+ rq_vcpu_domain_migrate_locked(src_rq, p);
set_task_cpu(p, this_cpu);
activate_task(this_rq, p, 0);
dmin = p->dl.deadline;
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 102d6f70e84d..0b30946de2a4 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -763,6 +763,19 @@ do { \
P(sched_goidle);
P(ttwu_count);
P(ttwu_local);
+#undef P
+#define P(n) SEQ_printf(m, " .%-30s: %Ld\n", #n, schedstat_val(rq->n));
+ P(nr_vcpu_thread_transfer);
+ P(nr_vcpu_cache_hit);
+ P(nr_vcpu_cache_evict);
+ P(nr_vcpu_cache_discard_wrong_node);
+ P(nr_vcpu_allocate);
+ P(nr_vcpu_allocate_node_reuse);
+ P(nr_vcpu_allocate_node_new);
+ P(nr_vcpu_allocate_node_rebalance);
+ P(nr_vcpu_allocate_node_steal);
+ P(nr_vcpu_remove_release);
+ P(nr_vcpu_remove_migrate);
}
#undef P

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dcbd3110c687..c43d97c79237 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7817,6 +7817,7 @@ static void detach_task(struct task_struct *p, struct lb_env *env)
lockdep_assert_rq_held(env->src_rq);

deactivate_task(env->src_rq, p, DEQUEUE_NOCLOCK);
+ rq_vcpu_domain_migrate_locked(env->src_rq, p);
set_task_cpu(p, env->dst_cpu);
}

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 7b4f4fbbb404..4922a87d5527 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2106,6 +2106,7 @@ static int push_rt_task(struct rq *rq, bool pull)
}

deactivate_task(rq, next_task, 0);
+ rq_vcpu_domain_migrate_locked(rq, next_task);
set_task_cpu(next_task, lowest_rq->cpu);
activate_task(lowest_rq, next_task, 0);
resched_curr(lowest_rq);
@@ -2379,6 +2380,7 @@ static void pull_rt_task(struct rq *this_rq)
push_task = get_push_task(src_rq);
} else {
deactivate_task(src_rq, p, 0);
+ rq_vcpu_domain_migrate_locked(src_rq, p);
set_task_cpu(p, this_cpu);
activate_task(this_rq, p, 0);
resched = true;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3da5718cd641..abdd48d1ffe6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -66,6 +66,7 @@
#include <linux/syscalls.h>
#include <linux/task_work.h>
#include <linux/tsacct_kern.h>
+#include <linux/vcpu.h>

#include <asm/tlb.h>

@@ -916,6 +917,19 @@ struct uclamp_rq {
DECLARE_STATIC_KEY_FALSE(sched_uclamp_used);
#endif /* CONFIG_UCLAMP_TASK */

+#ifdef CONFIG_VCPU_DOMAIN
+# define RQ_VCPU_CACHE_SIZE 8
+struct rq_vcpu_entry {
+ struct vcpu_domain *vcpu_domain; /* NULL if unset */
+ int vcpu_id;
+};
+
+struct rq_vcpu_cache {
+ struct rq_vcpu_entry entry[RQ_VCPU_CACHE_SIZE];
+ unsigned int head;
+};
+#endif
+
/*
* This is the main, per-CPU runqueue data structure.
*
@@ -1086,6 +1100,19 @@ struct rq {
/* try_to_wake_up() stats */
unsigned int ttwu_count;
unsigned int ttwu_local;
+
+ unsigned long long nr_vcpu_single_thread;
+ unsigned long long nr_vcpu_thread_transfer;
+ unsigned long long nr_vcpu_cache_hit;
+ unsigned long long nr_vcpu_cache_evict;
+ unsigned long long nr_vcpu_cache_discard_wrong_node;
+ unsigned long long nr_vcpu_allocate;
+ unsigned long long nr_vcpu_allocate_node_reuse;
+ unsigned long long nr_vcpu_allocate_node_new;
+ unsigned long long nr_vcpu_allocate_node_rebalance;
+ unsigned long long nr_vcpu_allocate_node_steal;
+ unsigned long long nr_vcpu_remove_release;
+ unsigned long long nr_vcpu_remove_migrate;
#endif

#ifdef CONFIG_CPU_IDLE
@@ -1116,6 +1143,10 @@ struct rq {
unsigned int core_forceidle_occupation;
u64 core_forceidle_start;
#endif
+
+#ifdef CONFIG_VCPU_DOMAIN
+ struct rq_vcpu_cache vcpu_cache;
+#endif
};

#ifdef CONFIG_FAIR_GROUP_SCHED
@@ -3137,3 +3168,347 @@ extern int sched_dynamic_mode(const char *str);
extern void sched_dynamic_update(int mode);
#endif

+#ifdef CONFIG_VCPU_DOMAIN
+static inline int __vcpu_domain_vcpu_get_single_node(struct vcpu_domain *vcpu_domain)
+{
+ struct cpumask *cpumask;
+ int vcpu;
+
+ cpumask = vcpu_domain_vcpumask(vcpu_domain);
+ /* Atomically reserve lowest available vcpu number. */
+ do {
+ vcpu = cpumask_first_zero(cpumask);
+ if (vcpu >= nr_cpu_ids)
+ return -1;
+ } while (cpumask_test_and_set_cpu(vcpu, cpumask));
+ return vcpu;
+}
+
+#ifdef CONFIG_NUMA
+static inline bool vcpu_domain_node_vcpumask_test_cpu(struct vcpu_domain *vcpu_domain, int vcpu_id)
+{
+ if (num_possible_nodes() == 1)
+ return true;
+ return cpumask_test_cpu(vcpu_id, vcpu_domain_node_vcpumask(vcpu_domain, numa_node_id()));
+}
+
+static inline int __vcpu_domain_vcpu_get(struct rq *rq, struct vcpu_domain *vcpu_domain)
+{
+ struct cpumask *cpumask = vcpu_domain_vcpumask(vcpu_domain),
+ *node_cpumask = vcpu_domain_node_vcpumask(vcpu_domain, numa_node_id()),
+ *node_alloc_cpumask = vcpu_domain_node_alloc_vcpumask(vcpu_domain);
+ unsigned int node;
+ int vcpu;
+
+ if (num_possible_nodes() == 1)
+ return __vcpu_domain_vcpu_get_single_node(vcpu_domain);
+
+ /*
+ * Try to atomically reserve lowest available vcpu number within those
+ * already reserved for this NUMA node.
+ */
+ do {
+ vcpu = cpumask_first_one_and_zero(node_cpumask, cpumask);
+ if (vcpu >= nr_cpu_ids)
+ goto alloc_numa;
+ } while (cpumask_test_and_set_cpu(vcpu, cpumask));
+ schedstat_inc(rq->nr_vcpu_allocate_node_reuse);
+ goto end;
+
+alloc_numa:
+ /*
+ * Try to atomically reserve lowest available vcpu number within those
+ * not already allocated for numa nodes.
+ */
+ do {
+ vcpu = cpumask_first_zero_and_zero(node_alloc_cpumask, cpumask);
+ if (vcpu >= nr_cpu_ids)
+ goto numa_update;
+ } while (cpumask_test_and_set_cpu(vcpu, cpumask));
+ cpumask_set_cpu(vcpu, node_cpumask);
+ cpumask_set_cpu(vcpu, node_alloc_cpumask);
+ schedstat_inc(rq->nr_vcpu_allocate_node_new);
+ goto end;
+
+numa_update:
+ /*
+ * NUMA node id configuration changed for at least one CPU in the system.
+ * We need to steal a currently unused vcpu_id from an overprovisioned
+ * node for our current node. Userspace must handle the fact that the
+ * node id associated with this vcpu_id may change due to node ID
+ * reconfiguration.
+ *
+ * Count how many possible cpus are attached to each (other) node id,
+ * and compare this with the per-mm node vcpumask cpu count. Find one
+ * which has too many cpus in its mask to steal from.
+ */
+ for (node = 0; node < nr_node_ids; node++) {
+ struct cpumask *iter_cpumask;
+
+ if (node == numa_node_id())
+ continue;
+ iter_cpumask = vcpu_domain_node_vcpumask(vcpu_domain, node);
+ if (nr_cpus_node(node) < cpumask_weight(iter_cpumask)) {
+ /* Try to steal from this node. */
+ do {
+ vcpu = cpumask_first_one_and_zero(iter_cpumask, cpumask);
+ if (vcpu >= nr_cpu_ids)
+ goto steal_fail;
+ } while (cpumask_test_and_set_cpu(vcpu, cpumask));
+ cpumask_clear_cpu(vcpu, iter_cpumask);
+ cpumask_set_cpu(vcpu, node_cpumask);
+ schedstat_inc(rq->nr_vcpu_allocate_node_rebalance);
+ goto end;
+ }
+ }
+
+steal_fail:
+ /*
+ * Our attempt at gracefully stealing a vcpu_id from another
+ * overprovisioned NUMA node failed. Fallback to grabbing the first
+ * available vcpu_id.
+ */
+ do {
+ vcpu = cpumask_first_zero(cpumask);
+ if (vcpu >= nr_cpu_ids)
+ return -1;
+ } while (cpumask_test_and_set_cpu(vcpu, cpumask));
+ /* Steal vcpu from its numa node mask. */
+ for (node = 0; node < nr_node_ids; node++) {
+ struct cpumask *iter_cpumask;
+
+ if (node == numa_node_id())
+ continue;
+ iter_cpumask = vcpu_domain_node_vcpumask(vcpu_domain, node);
+ if (cpumask_test_cpu(vcpu, iter_cpumask)) {
+ cpumask_clear_cpu(vcpu, iter_cpumask);
+ break;
+ }
+ }
+ cpumask_set_cpu(vcpu, node_cpumask);
+ schedstat_inc(rq->nr_vcpu_allocate_node_steal);
+end:
+ return vcpu;
+}
+
+static inline int vcpu_domain_vcpu_first_node_vcpu(int node)
+{
+ int vcpu;
+
+ if (likely(nr_cpu_ids >= nr_node_ids))
+ return node;
+ vcpu = cpumask_first(cpumask_of_node(node));
+ if (vcpu >= nr_cpu_ids)
+ return -1;
+ return vcpu;
+}
+
+/*
+ * Single-threaded processes observe a mapping of vcpu_id->node_id where
+ * the vcpu_id returned corresponds to vcpu_domain_vcpu_first_node_vcpu(). When going
+ * from single to multi-threaded, reserve this same mapping so it stays
+ * invariant.
+ */
+static inline void vcpu_domain_vcpu_reserve_nodes(struct vcpu_domain *vcpu_domain)
+{
+ struct cpumask *node_alloc_cpumask = vcpu_domain_node_alloc_vcpumask(vcpu_domain);
+ int node, other_node;
+
+ for (node = 0; node < nr_node_ids; node++) {
+ struct cpumask *iter_cpumask = vcpu_domain_node_vcpumask(vcpu_domain, node);
+ int vcpu = vcpu_domain_vcpu_first_node_vcpu(node);
+
+ /* Skip nodes that have no CPU associated with them. */
+ if (vcpu < 0)
+ continue;
+ cpumask_set_cpu(vcpu, iter_cpumask);
+ cpumask_set_cpu(vcpu, node_alloc_cpumask);
+ for (other_node = 0; other_node < nr_node_ids; other_node++) {
+ if (other_node == node)
+ continue;
+ cpumask_clear_cpu(vcpu, vcpu_domain_node_vcpumask(vcpu_domain, other_node));
+ }
+ }
+}
+#else
+static inline bool vcpu_domain_node_vcpumask_test_cpu(struct vcpu_domain *vcpu_domain,
+ int vcpu_id)
+{
+ return true;
+}
+static inline int __vcpu_domain_vcpu_get(struct rq *rq, struct vcpu_domain *vcpu_domain)
+{
+ return __vcpu_domain_vcpu_get_single_node(vcpu_domain);
+}
+static inline int vcpu_domain_vcpu_first_node_vcpu(int node)
+{
+ return 0;
+}
+static inline void vcpu_domain_vcpu_reserve_nodes(struct vcpu_domain *vcpu_domain) { }
+#endif
+
+static inline void __vcpu_domain_vcpu_put(struct vcpu_domain *vcpu_domain, int vcpu)
+{
+ if (vcpu < 0)
+ return;
+ cpumask_clear_cpu(vcpu, vcpu_domain_vcpumask(vcpu_domain));
+}
+
+static inline struct rq_vcpu_entry *rq_vcpu_cache_lookup(struct rq *rq, struct vcpu_domain *vcpu_domain)
+{
+ struct rq_vcpu_cache *vcpu_cache = &rq->vcpu_cache;
+ int i;
+
+ for (i = 0; i < RQ_VCPU_CACHE_SIZE; i++) {
+ struct rq_vcpu_entry *entry = &vcpu_cache->entry[i];
+
+ if (entry->vcpu_domain == vcpu_domain)
+ return entry;
+ }
+ return NULL;
+}
+
+/* Removal from cache simply leaves an unused hole. */
+static inline int rq_vcpu_cache_lookup_remove(struct rq *rq, struct vcpu_domain *vcpu_domain)
+{
+ struct rq_vcpu_entry *entry = rq_vcpu_cache_lookup(rq, vcpu_domain);
+
+ if (!entry)
+ return -1;
+ entry->vcpu_domain = NULL; /* Remove from cache */
+ return entry->vcpu_id;
+}
+
+static inline void rq_vcpu_cache_remove_vcpu_domain_locked(struct rq *rq, struct vcpu_domain *vcpu_domain,
+ bool release)
+{
+ int vcpu;
+
+ if (!vcpu_domain)
+ return;
+ /*
+ * Do not remove the cache entry for a runqueue that runs a task which
+ * currently uses the target mm.
+ */
+ if (!release && rq->curr->vcpu_domain_active && mm_vcpu_domain(rq->curr->mm) == vcpu_domain)
+ return;
+ vcpu = rq_vcpu_cache_lookup_remove(rq, vcpu_domain);
+ if (vcpu < 0)
+ return;
+ if (release)
+ schedstat_inc(rq->nr_vcpu_remove_release);
+ else
+ schedstat_inc(rq->nr_vcpu_remove_migrate);
+ __vcpu_domain_vcpu_put(vcpu_domain, vcpu);
+}
+
+static inline void rq_vcpu_cache_remove_vcpu_domain(struct rq *rq, struct vcpu_domain *vcpu_domain,
+ bool release)
+{
+ struct rq_flags rf;
+
+ rq_lock_irqsave(rq, &rf);
+ rq_vcpu_cache_remove_vcpu_domain_locked(rq, vcpu_domain, release);
+ rq_unlock_irqrestore(rq, &rf);
+}
+
+static inline void rq_vcpu_domain_migrate_locked(struct rq *rq, struct task_struct *t)
+{
+ rq_vcpu_cache_remove_vcpu_domain_locked(rq, mm_vcpu_domain(t->mm), false);
+}
+
+static inline void rq_vcpu_domain_migrate(struct rq *rq, struct task_struct *t)
+{
+ rq_vcpu_cache_remove_vcpu_domain(rq, mm_vcpu_domain(t->mm), false);
+}
+
+/*
+ * Add at head, move head forward. Cheap LRU cache.
+ * Only need to clear the vcpu mask bit from its own domain when we overwrite
+ * an old entry from the cache. Note that this is not needed if the overwritten
+ * entry is an unused hole. This access to the old_vcpu_domain from an
+ * unrelated thread requires that cache entry for a given vcpu_domain gets
+ * pruned from the cache when a task is dequeued from the runqueue.
+ */
+static inline void rq_vcpu_cache_add(struct rq *rq, struct vcpu_domain *vcpu_domain,
+ int vcpu_id)
+{
+ struct rq_vcpu_cache *vcpu_cache = &rq->vcpu_cache;
+ struct vcpu_domain *old_vcpu_domain;
+ struct rq_vcpu_entry *entry;
+ unsigned int pos;
+
+ pos = vcpu_cache->head;
+ entry = &vcpu_cache->entry[pos];
+ old_vcpu_domain = entry->vcpu_domain;
+ if (old_vcpu_domain) {
+ schedstat_inc(rq->nr_vcpu_cache_evict);
+ __vcpu_domain_vcpu_put(old_vcpu_domain, entry->vcpu_id);
+ }
+ entry->vcpu_domain = vcpu_domain;
+ entry->vcpu_id = vcpu_id;
+ vcpu_cache->head = (pos + 1) % RQ_VCPU_CACHE_SIZE;
+}
+
+static inline int vcpu_domain_vcpu_get(struct rq *rq, struct vcpu_domain *vcpu_domain)
+{
+ struct rq_vcpu_entry *entry;
+ int vcpu;
+
+ /* Skip allocation if mm is single-threaded. */
+ if (atomic_read(&vcpu_domain->users) == 1) {
+ schedstat_inc(rq->nr_vcpu_single_thread);
+ vcpu = vcpu_domain_vcpu_first_node_vcpu(numa_node_id());
+ goto end;
+ }
+ entry = rq_vcpu_cache_lookup(rq, vcpu_domain);
+ if (likely(entry)) {
+ vcpu = entry->vcpu_id;
+ if (likely(vcpu_domain_node_vcpumask_test_cpu(vcpu_domain, vcpu))) {
+ schedstat_inc(rq->nr_vcpu_cache_hit);
+ goto end;
+ } else {
+ schedstat_inc(rq->nr_vcpu_cache_discard_wrong_node);
+ entry->vcpu_domain = NULL; /* Remove from cache */
+ __vcpu_domain_vcpu_put(vcpu_domain, vcpu);
+ }
+ }
+ schedstat_inc(rq->nr_vcpu_allocate);
+ vcpu = __vcpu_domain_vcpu_get(rq, vcpu_domain);
+ rq_vcpu_cache_add(rq, vcpu_domain, vcpu);
+end:
+ return vcpu;
+}
+
+static inline void switch_mm_vcpu(struct rq *rq, struct task_struct *prev,
+ struct task_struct *next)
+{
+ if (!(next->flags & PF_KTHREAD) && next->vcpu_domain_active && next->mm) {
+ if (!(prev->flags & PF_KTHREAD) && prev->vcpu_domain_active &&
+ prev->mm == next->mm &&
+ vcpu_domain_node_vcpumask_test_cpu(mm_vcpu_domain(next->mm), prev->mm_vcpu)) {
+ /*
+ * Switching between threads with the same mm. Simply pass the
+ * vcpu token along to the next thread.
+ */
+ schedstat_inc(rq->nr_vcpu_thread_transfer);
+ next->mm_vcpu = prev->mm_vcpu;
+ } else {
+ next->mm_vcpu = vcpu_domain_vcpu_get(rq, mm_vcpu_domain(next->mm));
+ }
+ }
+ if (!(prev->flags & PF_KTHREAD) && prev->vcpu_domain_active && prev->mm)
+ prev->mm_vcpu = -1;
+}
+
+#else
+static inline void switch_mm_vcpu(struct rq *rq, struct task_struct *prev,
+ struct task_struct *next) { }
+static inline void rq_vcpu_cache_remove_vcpu_domain_locked(struct rq *rq,
+ struct vcpu_domain *domain,
+ bool release) { }
+static inline void rq_vcpu_cache_remove_vcpu_domain(struct rq *rq, struct vcpu_domain *domain,
+ bool release) { }
+static inline void rq_vcpu_domain_migrate_locked(struct rq *rq, struct task_struct *t) { }
+static inline void rq_vcpu_domain_migrate(struct rq *rq, struct task_struct *t) { }
+#endif
diff --git a/kernel/sched/stats.c b/kernel/sched/stats.c
index 07dde2928c79..a0bf4eaae296 100644
--- a/kernel/sched/stats.c
+++ b/kernel/sched/stats.c
@@ -134,12 +134,24 @@ static int show_schedstat(struct seq_file *seq, void *v)

/* runqueue-specific stats */
seq_printf(seq,
- "cpu%d %u 0 %u %u %u %u %llu %llu %lu",
+ "cpu%d %u 0 %u %u %u %u %llu %llu %lu %llu %llu %llu %llu %llu %llu %llu %llu %llu %llu %llu %llu",
cpu, rq->yld_count,
rq->sched_count, rq->sched_goidle,
rq->ttwu_count, rq->ttwu_local,
rq->rq_cpu_time,
- rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount);
+ rq->rq_sched_info.run_delay, rq->rq_sched_info.pcount,
+ rq->nr_vcpu_single_thread,
+ rq->nr_vcpu_thread_transfer,
+ rq->nr_vcpu_cache_hit,
+ rq->nr_vcpu_cache_evict,
+ rq->nr_vcpu_cache_discard_wrong_node,
+ rq->nr_vcpu_allocate,
+ rq->nr_vcpu_allocate_node_reuse,
+ rq->nr_vcpu_allocate_node_new,
+ rq->nr_vcpu_allocate_node_rebalance,
+ rq->nr_vcpu_allocate_node_steal,
+ rq->nr_vcpu_remove_release,
+ rq->nr_vcpu_remove_migrate);

seq_printf(seq, "\n");

--
2.17.1

2022-02-25 19:59:32

by Jonathan Corbet

[permalink] [raw]

Subject: Re: [RFC PATCH v2 09/11] sched: Introduce per memory space current virtual cpu id

Mathieu Desnoyers <[email protected]> writes:

> ----- On Feb 25, 2022, at 1:15 PM, Jonathan Corbet [email protected] wrote:
>> That helps, thanks. I do think that something like this belongs in the
>> changelog - or, even better, in the upcoming restartable-sequences
>> section in the userspace-api documentation :)
>
> Just to confirm, when you say "userspace-api documentation" do you refer to
> man pages ?

No, I meant Documentation/userspace-api/. But yes, even having the man
page in place would be a good step in the right direction.

Thanks,

jon

2022-02-25 20:19:46

by Jonathan Corbet

[permalink] [raw]

Subject: Re: [RFC PATCH v2 09/11] sched: Introduce per memory space current virtual cpu id

Mathieu Desnoyers <[email protected]> writes:

> This feature allows the scheduler to expose a current virtual cpu id
> to user-space. This virtual cpu id is within the possible cpus range,
> and is temporarily (and uniquely) assigned while threads are actively
> running within a memory space. If a memory space has fewer threads than
> cores, or is limited to run on few cores concurrently through sched
> affinity or cgroup cpusets, the virtual cpu ids will be values close
> to 0, thus allowing efficient use of user-space memory for per-cpu
> data structures.

So I have one possibly (probably) dumb question: if I'm writing a
program to make use of virtual CPU IDs, how do I know what the maximum
ID will be? It seems like one of the advantages of this mechanism would
be not having to be prepared for anything in the physical ID space, but
is there any guarantee that the virtual-ID space will be smaller?
Something like "no larger than the number of threads", say?

Thanks,

jon

2022-02-26 01:36:49

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [RFC PATCH v2 09/11] sched: Introduce per memory space current virtual cpu id

----- On Feb 25, 2022, at 1:15 PM, Jonathan Corbet [email protected] wrote:

> Mathieu Desnoyers <[email protected]> writes:
>
>> Some effective upper bounds for the number of vcpu ids observable in a process:
>>
>> - sysconf(3) _SC_NPROCESSORS_CONF,
>> - the number of threads which exist concurrently in the process,
>> - the number of cpus in the cpu affinity mask applied by sched_setaffinity,
>> except in corner-case situations such as cpu hotplug removing all cpus from
>> the affinity set,
>> - cgroup cpuset "partition" limits,
>>
>> Note that AFAIR non-partition cgroup cpusets allow a cgroup to "borrow"
>> additional cores from the rest of the system if they are idle, therefore
>> allowing the number of concurrent threads to go beyond the specified limit.
>>
>> AFAIR the sched affinity mask is tweaked independently of the cgroup cpuset.
>> Those are two mechanisms both affecting the scheduler task placement.
>>
>> I would expect the user-space code to use some sensible upper bound as a
>> hint about how many per-vcpu data structure elements to expect (and how many
>> to pre-allocate), but have a "lazy initialization" fall-back in case the
>> vcpu id goes up to the number of configured processors - 1. And I suspect
>> that even the number of configured processors may change with CRIU.
>>
>> If the above explanation makes sense (please let me know if I am wrong
>> or missed something), I suspect I should add it to the commit message.
>
> That helps, thanks. I do think that something like this belongs in the
> changelog - or, even better, in the upcoming restartable-sequences
> section in the userspace-api documentation :)

Just to confirm, when you say "userspace-api documentation" do you refer to
man pages ?

I did a few attempts at upstreaming a rseq.2 man page in 2020, but I have been
stuck waiting for feedback from Michael Kerrisk since then.

So for the moment I'm maintaining a rseq.2 man page here:

https://git.kernel.org/pub/scm/libs/librseq/librseq.git/tree/doc/man/rseq.2

I'd gladly accept some help to improve the documentation of rseq.

Thanks,

Mathieu

>
> Thanks,
>
> jon

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2022-02-26 01:41:34

by Jonathan Corbet

[permalink] [raw]

Subject: Re: [RFC PATCH v2 09/11] sched: Introduce per memory space current virtual cpu id

Mathieu Desnoyers <[email protected]> writes:

> Some effective upper bounds for the number of vcpu ids observable in a process:
>
> - sysconf(3) _SC_NPROCESSORS_CONF,
> - the number of threads which exist concurrently in the process,
> - the number of cpus in the cpu affinity mask applied by sched_setaffinity,
> except in corner-case situations such as cpu hotplug removing all cpus from
> the affinity set,
> - cgroup cpuset "partition" limits,
>
> Note that AFAIR non-partition cgroup cpusets allow a cgroup to "borrow"
> additional cores from the rest of the system if they are idle, therefore
> allowing the number of concurrent threads to go beyond the specified limit.
>
> AFAIR the sched affinity mask is tweaked independently of the cgroup cpuset.
> Those are two mechanisms both affecting the scheduler task placement.
>
> I would expect the user-space code to use some sensible upper bound as a
> hint about how many per-vcpu data structure elements to expect (and how many
> to pre-allocate), but have a "lazy initialization" fall-back in case the
> vcpu id goes up to the number of configured processors - 1. And I suspect
> that even the number of configured processors may change with CRIU.
>
> If the above explanation makes sense (please let me know if I am wrong
> or missed something), I suspect I should add it to the commit message.

That helps, thanks. I do think that something like this belongs in the
changelog - or, even better, in the upcoming restartable-sequences
section in the userspace-api documentation :)

Thanks,

jon

2022-02-26 02:02:31

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [RFC PATCH v2 09/11] sched: Introduce per memory space current virtual cpu id

----- On Feb 25, 2022, at 12:56 PM, Mathieu Desnoyers [email protected] wrote:

> ----- On Feb 25, 2022, at 12:35 PM, Jonathan Corbet [email protected] wrote:
>
>> Mathieu Desnoyers <[email protected]> writes:
>>
>>> This feature allows the scheduler to expose a current virtual cpu id
>>> to user-space. This virtual cpu id is within the possible cpus range,
>>> and is temporarily (and uniquely) assigned while threads are actively
>>> running within a memory space. If a memory space has fewer threads than
>>> cores, or is limited to run on few cores concurrently through sched
>>> affinity or cgroup cpusets, the virtual cpu ids will be values close
>>> to 0, thus allowing efficient use of user-space memory for per-cpu
>>> data structures.
>>
>> So I have one possibly (probably) dumb question: if I'm writing a
>> program to make use of virtual CPU IDs, how do I know what the maximum
>> ID will be? It seems like one of the advantages of this mechanism would
>> be not having to be prepared for anything in the physical ID space, but
>> is there any guarantee that the virtual-ID space will be smaller?
>> Something like "no larger than the number of threads", say?
>
> Hi Jonathan,
>
> This is a very relevant question. Let me quote what I answered to Florian
> on the last round of review for this series:
>
> Some effective upper bounds for the number of vcpu ids observable in a process:
>
> - sysconf(3) _SC_NPROCESSORS_CONF,
> - the number of threads which exist concurrently in the process,

One small detail I forgot to mention: on a NUMA system, a single-threaded
process will observe (typically) vcpu_id=numa_node_id. So it can jump around
between vcpu_id values depending on which numa node it runs on at the moment.

So the vcpu_id is not strictly bound by the number of concurrently running
threads.

Thanks,

Mathieu

> - the number of cpus in the cpu affinity mask applied by sched_setaffinity,
> except in corner-case situations such as cpu hotplug removing all cpus from
> the affinity set,
> - cgroup cpuset "partition" limits,
>
> Note that AFAIR non-partition cgroup cpusets allow a cgroup to "borrow"
> additional cores from the rest of the system if they are idle, therefore
> allowing the number of concurrent threads to go beyond the specified limit.
>
> AFAIR the sched affinity mask is tweaked independently of the cgroup cpuset.
> Those are two mechanisms both affecting the scheduler task placement.
>
> I would expect the user-space code to use some sensible upper bound as a
> hint about how many per-vcpu data structure elements to expect (and how many
> to pre-allocate), but have a "lazy initialization" fall-back in case the
> vcpu id goes up to the number of configured processors - 1. And I suspect
> that even the number of configured processors may change with CRIU.
>
> If the above explanation makes sense (please let me know if I am wrong
> or missed something), I suspect I should add it to the commit message.
>
> Thanks,
>
> Mathieu
>
> --
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2022-02-26 02:27:31

by Mathieu Desnoyers

[permalink] [raw]

Subject: Re: [RFC PATCH v2 09/11] sched: Introduce per memory space current virtual cpu id

----- On Feb 25, 2022, at 12:35 PM, Jonathan Corbet [email protected] wrote:

> Mathieu Desnoyers <[email protected]> writes:
>
>> This feature allows the scheduler to expose a current virtual cpu id
>> to user-space. This virtual cpu id is within the possible cpus range,
>> and is temporarily (and uniquely) assigned while threads are actively
>> running within a memory space. If a memory space has fewer threads than
>> cores, or is limited to run on few cores concurrently through sched
>> affinity or cgroup cpusets, the virtual cpu ids will be values close
>> to 0, thus allowing efficient use of user-space memory for per-cpu
>> data structures.
>
> So I have one possibly (probably) dumb question: if I'm writing a
> program to make use of virtual CPU IDs, how do I know what the maximum
> ID will be? It seems like one of the advantages of this mechanism would
> be not having to be prepared for anything in the physical ID space, but
> is there any guarantee that the virtual-ID space will be smaller?
> Something like "no larger than the number of threads", say?

Hi Jonathan,

This is a very relevant question. Let me quote what I answered to Florian
on the last round of review for this series:

Some effective upper bounds for the number of vcpu ids observable in a process:

- sysconf(3) _SC_NPROCESSORS_CONF,
- the number of threads which exist concurrently in the process,
- the number of cpus in the cpu affinity mask applied by sched_setaffinity,
except in corner-case situations such as cpu hotplug removing all cpus from
the affinity set,
- cgroup cpuset "partition" limits,

Note that AFAIR non-partition cgroup cpusets allow a cgroup to "borrow"
additional cores from the rest of the system if they are idle, therefore
allowing the number of concurrent threads to go beyond the specified limit.

AFAIR the sched affinity mask is tweaked independently of the cgroup cpuset.
Those are two mechanisms both affecting the scheduler task placement.

I would expect the user-space code to use some sensible upper bound as a
hint about how many per-vcpu data structure elements to expect (and how many
to pre-allocate), but have a "lazy initialization" fall-back in case the
vcpu id goes up to the number of configured processors - 1. And I suspect
that even the number of configured processors may change with CRIU.

If the above explanation makes sense (please let me know if I am wrong
or missed something), I suspect I should add it to the commit message.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com