LinuxLists.cc - [tip: sched/core] sched: Fix performance regression introduced by mm

[permalink] [raw]

Subject: Re: [tip: sched/core] sched: Fix performance regression introduced by mm_cid

Hello Mathieu,

On 4/22/2023 1:13 PM, tip-bot2 for Mathieu Desnoyers wrote:
> The following commit has been merged into the sched/core branch of tip:
>
> Commit-ID: 223baf9d17f25e2608dbdff7232c095c1e612268
> Gitweb: https://git.kernel.org/tip/223baf9d17f25e2608dbdff7232c095c1e612268
> Author: Mathieu Desnoyers <[email protected]>
> AuthorDate: Thu, 20 Apr 2023 10:55:48 -04:00
> Committer: Peter Zijlstra <[email protected]>
> CommitterDate: Fri, 21 Apr 2023 13:24:20 +02:00
>
> sched: Fix performance regression introduced by mm_cid
>
> Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
> sysbench regression reported by Aaron Lu.
>
> Keep track of the currently allocated mm_cid for each mm/cpu rather than
> freeing them immediately on context switch. This eliminates most atomic
> operations when context switching back and forth between threads
> belonging to different memory spaces in multi-threaded scenarios (many
> processes, each with many threads). The per-mm/per-cpu mm_cid values are
> serialized by their respective runqueue locks.
>
> Thread migration is handled by introducing invocation to
> sched_mm_cid_migrate_to() (with destination runqueue lock held) in
> activate_task() for migrating tasks. If the destination cpu's mm_cid is
> unset, and if the source runqueue is not actively using its mm_cid, then
> the source cpu's mm_cid is moved to the destination cpu on migration.
>
> Introduce a task-work executed periodically, similarly to NUMA work,
> which delays reclaim of cid values when they are unused for a period of
> time.
>
> Keep track of the allocation time for each per-cpu cid, and let the task
> work clear them when they are observed to be older than
> SCHED_MM_CID_PERIOD_NS and unused. This task work also clears all
> mm_cids which are greater or equal to the Hamming weight of the mm
> cidmask to keep concurrency ids compact.
>
> Because we want to ensure the mm_cid converges towards the smaller
> values as migrations happen, the prior optimization that was done when
> context switching between threads belonging to the same mm is removed,
> because it could delay the lazy release of the destination runqueue
> mm_cid after it has been replaced by a migration. Removing this prior
> optimization is not an issue performance-wise because the introduced
> per-mm/per-cpu mm_cid tracking also covers this more specific case.
>
> Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID")
> Reported-by: Aaron Lu <[email protected]>
> Signed-off-by: Mathieu Desnoyers <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Tested-by: Aaron Lu <[email protected]>
> Link: https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/

I run standard benchmarks as a part of kernel performance regression
testing. When I run these benchmarks against v6.3.0 to v6.4-rc1,
I have seen performance regression in hackbench running with threads. When I did
git bisect it pointed to this commit and reverting this commit helps regains
the performance. This regression is not seen with hackbench processes.
Following are the results from 1 Socket 4th generation EPYC
Processor(1 X 96C/192T) configured in NPS1 mode. This regression
becomes more severe as the number of core count increases.

The numbers on a 1 Socket Bergamo (1 X 128 cores/256 threads) is significantly worse.

Threads:

Test: With-mmcid-patch Without-mmcid-patch
1-groups: 5.23 (0.00 pct) 4.61 (+11.85 pct)
2-groups: 4.99 (0.00 pct) 4.72 (+5.41 pct)
4-groups: 5.96 (0.00 pct) 4.87 (+18.28 pct)
8-groups: 6.58 (0.00 pct) 5.44 (+17.32 pct)
16-groups: 11.48 (0.00 pct) 8.07 (+29.70 pct)

Processes:

Test: With-mmcid-patch Without-mmcid-patch
1-groups: 5.19 (0.00 pct) 4.90 (+5.58 pct)
2-groups: 5.44 (0.00 pct) 5.39 (+0.91 pct)
4-groups: 5.69 (0.00 pct) 5.64 (+0.87 pct)
8-groups: 6.08 (0.00 pct) 6.01 (+1.15 pct)
16-groups: 10.87 (0.00 pct) 10.83 (+0.36 pct)

When I did IBS profiling for kernel with this patch, it shows most of
the time is spent in finish_task_switch() and __switch_to_asm().

IBS profiling report with this patch:

# hackbench -g 32 -f 20 --threads --pipe -l 480000 -s 100

# Overhead Command Shared Object Symbol
# ........ ............... .................... ..............................................
#
9.41% hackbench [kernel.vmlinux] [k] finish_task_switch.isra.0
7.09% hackbench [kernel.vmlinux] [k] __switch_to_asm
4.97% hackbench [kernel.vmlinux] [k] try_to_wake_up
4.67% hackbench [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
4.41% hackbench [kernel.vmlinux] [k] finish_wait
4.38% hackbench [kernel.vmlinux] [k] __schedule
4.06% hackbench [kernel.vmlinux] [k] copyin
4.00% hackbench [kernel.vmlinux] [k] copyout
3.46% hackbench [kernel.vmlinux] [k] apparmor_file_permission
2.58% hackbench [kernel.vmlinux] [k] pipe_write
2.56% hackbench [kernel.vmlinux] [k] reweight_entity

Please let me know if you need any other data.

--
Thanks and Regards,
Swapnil

> ---
> include/linux/mm_types.h | 82 +++++-
> include/linux/sched.h | 3 +-
> include/linux/sched/mm.h | 5 +-
> kernel/fork.c | 9 +-
> kernel/sched/core.c | 523 ++++++++++++++++++++++++++++++++++++--
> kernel/sched/sched.h | 239 ++++++++++++++---
> 6 files changed, 804 insertions(+), 57 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index a57e6ae..5eab611 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -550,6 +550,13 @@ struct vm_area_struct {
> struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
> } __randomize_layout;
>
> +#ifdef CONFIG_SCHED_MM_CID
> +struct mm_cid {
> + u64 time;
> + int cid;
> +};
> +#endif
> +
> struct kioctx_table;
> struct mm_struct {
> struct {
> @@ -600,15 +607,19 @@ struct mm_struct {
> atomic_t mm_count;
> #ifdef CONFIG_SCHED_MM_CID
> /**
> - * @cid_lock: Protect cid bitmap updates vs lookups.
> + * @pcpu_cid: Per-cpu current cid.
> *
> - * Prevent situations where updates to the cid bitmap happen
> - * concurrently with lookups. Those can lead to situations
> - * where a lookup cannot find a free bit simply because it was
> - * unlucky enough to load, non-atomically, bitmap words as they
> - * were being concurrently updated by the updaters.
> + * Keep track of the currently allocated mm_cid for each cpu.
> + * The per-cpu mm_cid values are serialized by their respective
> + * runqueue locks.
> */
> - raw_spinlock_t cid_lock;
> + struct mm_cid __percpu *pcpu_cid;
> + /*
> + * @mm_cid_next_scan: Next mm_cid scan (in jiffies).
> + *
> + * When the next mm_cid scan is due (in jiffies).
> + */
> + unsigned long mm_cid_next_scan;
> #endif
> #ifdef CONFIG_MMU
> atomic_long_t pgtables_bytes; /* size of all page tables */
> @@ -873,6 +884,37 @@ static inline void vma_iter_init(struct vma_iterator *vmi,
> }
>
> #ifdef CONFIG_SCHED_MM_CID
> +
> +enum mm_cid_state {
> + MM_CID_UNSET = -1U, /* Unset state has lazy_put flag set. */
> + MM_CID_LAZY_PUT = (1U << 31),
> +};
> +
> +static inline bool mm_cid_is_unset(int cid)
> +{
> + return cid == MM_CID_UNSET;
> +}
> +
> +static inline bool mm_cid_is_lazy_put(int cid)
> +{
> + return !mm_cid_is_unset(cid) && (cid & MM_CID_LAZY_PUT);
> +}
> +
> +static inline bool mm_cid_is_valid(int cid)
> +{
> + return !(cid & MM_CID_LAZY_PUT);
> +}
> +
> +static inline int mm_cid_set_lazy_put(int cid)
> +{
> + return cid | MM_CID_LAZY_PUT;
> +}
> +
> +static inline int mm_cid_clear_lazy_put(int cid)
> +{
> + return cid & ~MM_CID_LAZY_PUT;
> +}
> +
> /* Accessor for struct mm_struct's cidmask. */
> static inline cpumask_t *mm_cidmask(struct mm_struct *mm)
> {
> @@ -886,16 +928,40 @@ static inline cpumask_t *mm_cidmask(struct mm_struct *mm)
>
> static inline void mm_init_cid(struct mm_struct *mm)
> {
> - raw_spin_lock_init(&mm->cid_lock);
> + int i;
> +
> + for_each_possible_cpu(i) {
> + struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, i);
> +
> + pcpu_cid->cid = MM_CID_UNSET;
> + pcpu_cid->time = 0;
> + }
> cpumask_clear(mm_cidmask(mm));
> }
>
> +static inline int mm_alloc_cid(struct mm_struct *mm)
> +{
> + mm->pcpu_cid = alloc_percpu(struct mm_cid);
> + if (!mm->pcpu_cid)
> + return -ENOMEM;
> + mm_init_cid(mm);
> + return 0;
> +}
> +
> +static inline void mm_destroy_cid(struct mm_struct *mm)
> +{
> + free_percpu(mm->pcpu_cid);
> + mm->pcpu_cid = NULL;
> +}
> +
> static inline unsigned int mm_cid_size(void)
> {
> return cpumask_size();
> }
> #else /* CONFIG_SCHED_MM_CID */
> static inline void mm_init_cid(struct mm_struct *mm) { }
> +static inline int mm_alloc_cid(struct mm_struct *mm) { return 0; }
> +static inline void mm_destroy_cid(struct mm_struct *mm) { }
> static inline unsigned int mm_cid_size(void)
> {
> return 0;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 6d654eb..675298d 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1314,7 +1314,10 @@ struct task_struct {
>
> #ifdef CONFIG_SCHED_MM_CID
> int mm_cid; /* Current cid in mm */
> + int last_mm_cid; /* Most recent cid in mm */
> + int migrate_from_cpu;
> int mm_cid_active; /* Whether cid bitmap is active */
> + struct callback_head cid_work;
> #endif
>
> struct tlbflush_unmap_batch tlb_ubc;
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 2a24361..f20fc06 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -37,6 +37,11 @@ static inline void mmgrab(struct mm_struct *mm)
> atomic_inc(&mm->mm_count);
> }
>
> +static inline void smp_mb__after_mmgrab(void)
> +{
> + smp_mb__after_atomic();
> +}
> +
> extern void __mmdrop(struct mm_struct *mm);
>
> static inline void mmdrop(struct mm_struct *mm)
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0c92f22..ad2ee22 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -793,6 +793,7 @@ void __mmdrop(struct mm_struct *mm)
> check_mm(mm);
> put_user_ns(mm->user_ns);
> mm_pasid_drop(mm);
> + mm_destroy_cid(mm);
>
> for (i = 0; i < NR_MM_COUNTERS; i++)
> percpu_counter_destroy(&mm->rss_stat[i]);
> @@ -1057,7 +1058,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
>
> #ifdef CONFIG_SCHED_MM_CID
> tsk->mm_cid = -1;
> + tsk->last_mm_cid = -1;
> tsk->mm_cid_active = 0;
> + tsk->migrate_from_cpu = -1;
> #endif
> return tsk;
>
> @@ -1162,18 +1165,22 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> if (init_new_context(p, mm))
> goto fail_nocontext;
>
> + if (mm_alloc_cid(mm))
> + goto fail_cid;
> +
> for (i = 0; i < NR_MM_COUNTERS; i++)
> if (percpu_counter_init(&mm->rss_stat[i], 0, GFP_KERNEL_ACCOUNT))
> goto fail_pcpu;
>
> mm->user_ns = get_user_ns(user_ns);
> lru_gen_init_mm(mm);
> - mm_init_cid(mm);
> return mm;
>
> fail_pcpu:
> while (i > 0)
> percpu_counter_destroy(&mm->rss_stat[--i]);
> + mm_destroy_cid(mm);
> +fail_cid:
> fail_nocontext:
> mm_free_pgd(mm);
> fail_nopgd:
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5a97ceb..898fa3b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2101,6 +2101,8 @@ void activate_task(struct rq *rq, struct task_struct *p, int flags)
> {
> if (task_on_rq_migrating(p))
> flags |= ENQUEUE_MIGRATED;
> + if (flags & ENQUEUE_MIGRATED)
> + sched_mm_cid_migrate_to(rq, p);
>
> enqueue_task(rq, p, flags);
>
> @@ -3210,6 +3212,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
> p->sched_class->migrate_task_rq(p, new_cpu);
> p->se.nr_migrations++;
> rseq_migrate(p);
> + sched_mm_cid_migrate_from(p);
> perf_event_task_migrate(p);
> }
>
> @@ -4483,6 +4486,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> p->wake_entry.u_flags = CSD_TYPE_TTWU;
> p->migration_pending = NULL;
> #endif
> + init_sched_mm_cid(p);
> }
>
> DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
> @@ -5129,7 +5133,6 @@ prepare_task_switch(struct rq *rq, struct task_struct *prev,
> sched_info_switch(rq, prev, next);
> perf_event_task_sched_out(prev, next);
> rseq_preempt(prev);
> - switch_mm_cid(prev, next);
> fire_sched_out_preempt_notifiers(prev, next);
> kmap_local_sched_out();
> prepare_task(next);
> @@ -5285,6 +5288,9 @@ context_switch(struct rq *rq, struct task_struct *prev,
> *
> * kernel -> user switch + mmdrop() active
> * user -> user switch
> + *
> + * switch_mm_cid() needs to be updated if the barriers provided
> + * by context_switch() are modified.
> */
> if (!next->mm) { // to kernel
> enter_lazy_tlb(prev->active_mm, next);
> @@ -5314,6 +5320,9 @@ context_switch(struct rq *rq, struct task_struct *prev,
> }
> }
>
> + /* switch_mm_cid() requires the memory barriers above. */
> + switch_mm_cid(rq, prev, next);
> +
> rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
>
> prepare_lock_switch(rq, next, rf);
> @@ -5602,6 +5611,7 @@ void scheduler_tick(void)
> resched_latency = cpu_resched_latency(rq);
> calc_global_load_tick(rq);
> sched_core_tick(rq);
> + task_tick_mm_cid(rq, curr);
>
> rq_unlock(rq, &rf);
>
> @@ -11469,45 +11479,524 @@ void call_trace_sched_update_nr_running(struct rq *rq, int count)
> }
>
> #ifdef CONFIG_SCHED_MM_CID
> -void sched_mm_cid_exit_signals(struct task_struct *t)
> +
> +/**
> + * @cid_lock: Guarantee forward-progress of cid allocation.
> + *
> + * Concurrency ID allocation within a bitmap is mostly lock-free. The cid_lock
> + * is only used when contention is detected by the lock-free allocation so
> + * forward progress can be guaranteed.
> + */
> +DEFINE_RAW_SPINLOCK(cid_lock);
> +
> +/**
> + * @use_cid_lock: Select cid allocation behavior: lock-free vs spinlock.
> + *
> + * When @use_cid_lock is 0, the cid allocation is lock-free. When contention is
> + * detected, it is set to 1 to ensure that all newly coming allocations are
> + * serialized by @cid_lock until the allocation which detected contention
> + * completes and sets @use_cid_lock back to 0. This guarantees forward progress
> + * of a cid allocation.
> + */
> +int use_cid_lock;
> +
> +/*
> + * mm_cid remote-clear implements a lock-free algorithm to clear per-mm/cpu cid
> + * concurrently with respect to the execution of the source runqueue context
> + * switch.
> + *
> + * There is one basic properties we want to guarantee here:
> + *
> + * (1) Remote-clear should _never_ mark a per-cpu cid UNSET when it is actively
> + * used by a task. That would lead to concurrent allocation of the cid and
> + * userspace corruption.
> + *
> + * Provide this guarantee by introducing a Dekker memory ordering to guarantee
> + * that a pair of loads observe at least one of a pair of stores, which can be
> + * shown as:
> + *
> + * X = Y = 0
> + *
> + * w[X]=1 w[Y]=1
> + * MB MB
> + * r[Y]=y r[X]=x
> + *
> + * Which guarantees that x==0 && y==0 is impossible. But rather than using
> + * values 0 and 1, this algorithm cares about specific state transitions of the
> + * runqueue current task (as updated by the scheduler context switch), and the
> + * per-mm/cpu cid value.
> + *
> + * Let's introduce task (Y) which has task->mm == mm and task (N) which has
> + * task->mm != mm for the rest of the discussion. There are two scheduler state
> + * transitions on context switch we care about:
> + *
> + * (TSA) Store to rq->curr with transition from (N) to (Y)
> + *
> + * (TSB) Store to rq->curr with transition from (Y) to (N)
> + *
> + * On the remote-clear side, there is one transition we care about:
> + *
> + * (TMA) cmpxchg to *pcpu_cid to set the LAZY flag
> + *
> + * There is also a transition to UNSET state which can be performed from all
> + * sides (scheduler, remote-clear). It is always performed with a cmpxchg which
> + * guarantees that only a single thread will succeed:
> + *
> + * (TMB) cmpxchg to *pcpu_cid to mark UNSET
> + *
> + * Just to be clear, what we do _not_ want to happen is a transition to UNSET
> + * when a thread is actively using the cid (property (1)).
> + *
> + * Let's looks at the relevant combinations of TSA/TSB, and TMA transitions.
> + *
> + * Scenario A) (TSA)+(TMA) (from next task perspective)
> + *
> + * CPU0 CPU1
> + *
> + * Context switch CS-1 Remote-clear
> + * - store to rq->curr: (N)->(Y) (TSA) - cmpxchg to *pcpu_id to LAZY (TMA)
> + * (implied barrier after cmpxchg)
> + * - switch_mm_cid()
> + * - memory barrier (see switch_mm_cid()
> + * comment explaining how this barrier
> + * is combined with other scheduler
> + * barriers)
> + * - mm_cid_get (next)
> + * - READ_ONCE(*pcpu_cid) - rcu_dereference(src_rq->curr)
> + *
> + * This Dekker ensures that either task (Y) is observed by the
> + * rcu_dereference() or the LAZY flag is observed by READ_ONCE(), or both are
> + * observed.
> + *
> + * If task (Y) store is observed by rcu_dereference(), it means that there is
> + * still an active task on the cpu. Remote-clear will therefore not transition
> + * to UNSET, which fulfills property (1).
> + *
> + * If task (Y) is not observed, but the lazy flag is observed by READ_ONCE(),
> + * it will move its state to UNSET, which clears the percpu cid perhaps
> + * uselessly (which is not an issue for correctness). Because task (Y) is not
> + * observed, CPU1 can move ahead to set the state to UNSET. Because moving
> + * state to UNSET is done with a cmpxchg expecting that the old state has the
> + * LAZY flag set, only one thread will successfully UNSET.
> + *
> + * If both states (LAZY flag and task (Y)) are observed, the thread on CPU0
> + * will observe the LAZY flag and transition to UNSET (perhaps uselessly), and
> + * CPU1 will observe task (Y) and do nothing more, which is fine.
> + *
> + * What we are effectively preventing with this Dekker is a scenario where
> + * neither LAZY flag nor store (Y) are observed, which would fail property (1)
> + * because this would UNSET a cid which is actively used.
> + */
> +
> +void sched_mm_cid_migrate_from(struct task_struct *t)
> +{
> + t->migrate_from_cpu = task_cpu(t);
> +}
> +
> +static
> +int __sched_mm_cid_migrate_from_fetch_cid(struct rq *src_rq,
> + struct task_struct *t,
> + struct mm_cid *src_pcpu_cid)
> {
> struct mm_struct *mm = t->mm;
> - unsigned long flags;
> + struct task_struct *src_task;
> + int src_cid, last_mm_cid;
>
> if (!mm)
> + return -1;
> +
> + last_mm_cid = t->last_mm_cid;
> + /*
> + * If the migrated task has no last cid, or if the current
> + * task on src rq uses the cid, it means the source cid does not need
> + * to be moved to the destination cpu.
> + */
> + if (last_mm_cid == -1)
> + return -1;
> + src_cid = READ_ONCE(src_pcpu_cid->cid);
> + if (!mm_cid_is_valid(src_cid) || last_mm_cid != src_cid)
> + return -1;
> +
> + /*
> + * If we observe an active task using the mm on this rq, it means we
> + * are not the last task to be migrated from this cpu for this mm, so
> + * there is no need to move src_cid to the destination cpu.
> + */
> + rcu_read_lock();
> + src_task = rcu_dereference(src_rq->curr);
> + if (READ_ONCE(src_task->mm_cid_active) && src_task->mm == mm) {
> + rcu_read_unlock();
> + t->last_mm_cid = -1;
> + return -1;
> + }
> + rcu_read_unlock();
> +
> + return src_cid;
> +}
> +
> +static
> +int __sched_mm_cid_migrate_from_try_steal_cid(struct rq *src_rq,
> + struct task_struct *t,
> + struct mm_cid *src_pcpu_cid,
> + int src_cid)
> +{
> + struct task_struct *src_task;
> + struct mm_struct *mm = t->mm;
> + int lazy_cid;
> +
> + if (src_cid == -1)
> + return -1;
> +
> + /*
> + * Attempt to clear the source cpu cid to move it to the destination
> + * cpu.
> + */
> + lazy_cid = mm_cid_set_lazy_put(src_cid);
> + if (!try_cmpxchg(&src_pcpu_cid->cid, &src_cid, lazy_cid))
> + return -1;
> +
> + /*
> + * The implicit barrier after cmpxchg per-mm/cpu cid before loading
> + * rq->curr->mm matches the scheduler barrier in context_switch()
> + * between store to rq->curr and load of prev and next task's
> + * per-mm/cpu cid.
> + *
> + * The implicit barrier after cmpxchg per-mm/cpu cid before loading
> + * rq->curr->mm_cid_active matches the barrier in
> + * sched_mm_cid_exit_signals(), sched_mm_cid_before_execve(), and
> + * sched_mm_cid_after_execve() between store to t->mm_cid_active and
> + * load of per-mm/cpu cid.
> + */
> +
> + /*
> + * If we observe an active task using the mm on this rq after setting
> + * the lazy-put flag, this task will be responsible for transitioning
> + * from lazy-put flag set to MM_CID_UNSET.
> + */
> + rcu_read_lock();
> + src_task = rcu_dereference(src_rq->curr);
> + if (READ_ONCE(src_task->mm_cid_active) && src_task->mm == mm) {
> + rcu_read_unlock();
> + /*
> + * We observed an active task for this mm, there is therefore
> + * no point in moving this cid to the destination cpu.
> + */
> + t->last_mm_cid = -1;
> + return -1;
> + }
> + rcu_read_unlock();
> +
> + /*
> + * The src_cid is unused, so it can be unset.
> + */
> + if (!try_cmpxchg(&src_pcpu_cid->cid, &lazy_cid, MM_CID_UNSET))
> + return -1;
> + return src_cid;
> +}
> +
> +/*
> + * Migration to dst cpu. Called with dst_rq lock held.
> + * Interrupts are disabled, which keeps the window of cid ownership without the
> + * source rq lock held small.
> + */
> +void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t)
> +{
> + struct mm_cid *src_pcpu_cid, *dst_pcpu_cid;
> + struct mm_struct *mm = t->mm;
> + int src_cid, dst_cid, src_cpu;
> + struct rq *src_rq;
> +
> + lockdep_assert_rq_held(dst_rq);
> +
> + if (!mm)
> + return;
> + src_cpu = t->migrate_from_cpu;
> + if (src_cpu == -1) {
> + t->last_mm_cid = -1;
> + return;
> + }
> + /*
> + * Move the src cid if the dst cid is unset. This keeps id
> + * allocation closest to 0 in cases where few threads migrate around
> + * many cpus.
> + *
> + * If destination cid is already set, we may have to just clear
> + * the src cid to ensure compactness in frequent migrations
> + * scenarios.
> + *
> + * It is not useful to clear the src cid when the number of threads is
> + * greater or equal to the number of allowed cpus, because user-space
> + * can expect that the number of allowed cids can reach the number of
> + * allowed cpus.
> + */
> + dst_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(dst_rq));
> + dst_cid = READ_ONCE(dst_pcpu_cid->cid);
> + if (!mm_cid_is_unset(dst_cid) &&
> + atomic_read(&mm->mm_users) >= t->nr_cpus_allowed)
> + return;
> + src_pcpu_cid = per_cpu_ptr(mm->pcpu_cid, src_cpu);
> + src_rq = cpu_rq(src_cpu);
> + src_cid = __sched_mm_cid_migrate_from_fetch_cid(src_rq, t, src_pcpu_cid);
> + if (src_cid == -1)
> + return;
> + src_cid = __sched_mm_cid_migrate_from_try_steal_cid(src_rq, t, src_pcpu_cid,
> + src_cid);
> + if (src_cid == -1)
> + return;
> + if (!mm_cid_is_unset(dst_cid)) {
> + __mm_cid_put(mm, src_cid);
> + return;
> + }
> + /* Move src_cid to dst cpu. */
> + mm_cid_snapshot_time(dst_rq, mm);
> + WRITE_ONCE(dst_pcpu_cid->cid, src_cid);
> +}
> +
> +static void sched_mm_cid_remote_clear(struct mm_struct *mm, struct mm_cid *pcpu_cid,
> + int cpu)
> +{
> + struct rq *rq = cpu_rq(cpu);
> + struct task_struct *t;
> + unsigned long flags;
> + int cid, lazy_cid;
> +
> + cid = READ_ONCE(pcpu_cid->cid);
> + if (!mm_cid_is_valid(cid))
> return;
> +
> + /*
> + * Clear the cpu cid if it is set to keep cid allocation compact. If
> + * there happens to be other tasks left on the source cpu using this
> + * mm, the next task using this mm will reallocate its cid on context
> + * switch.
> + */
> + lazy_cid = mm_cid_set_lazy_put(cid);
> + if (!try_cmpxchg(&pcpu_cid->cid, &cid, lazy_cid))
> + return;
> +
> + /*
> + * The implicit barrier after cmpxchg per-mm/cpu cid before loading
> + * rq->curr->mm matches the scheduler barrier in context_switch()
> + * between store to rq->curr and load of prev and next task's
> + * per-mm/cpu cid.
> + *
> + * The implicit barrier after cmpxchg per-mm/cpu cid before loading
> + * rq->curr->mm_cid_active matches the barrier in
> + * sched_mm_cid_exit_signals(), sched_mm_cid_before_execve(), and
> + * sched_mm_cid_after_execve() between store to t->mm_cid_active and
> + * load of per-mm/cpu cid.
> + */
> +
> + /*
> + * If we observe an active task using the mm on this rq after setting
> + * the lazy-put flag, that task will be responsible for transitioning
> + * from lazy-put flag set to MM_CID_UNSET.
> + */
> + rcu_read_lock();
> + t = rcu_dereference(rq->curr);
> + if (READ_ONCE(t->mm_cid_active) && t->mm == mm) {
> + rcu_read_unlock();
> + return;
> + }
> + rcu_read_unlock();
> +
> + /*
> + * The cid is unused, so it can be unset.
> + * Disable interrupts to keep the window of cid ownership without rq
> + * lock small.
> + */
> local_irq_save(flags);
> - mm_cid_put(mm, t->mm_cid);
> - t->mm_cid = -1;
> - t->mm_cid_active = 0;
> + if (try_cmpxchg(&pcpu_cid->cid, &lazy_cid, MM_CID_UNSET))
> + __mm_cid_put(mm, cid);
> local_irq_restore(flags);
> }
>
> +static void sched_mm_cid_remote_clear_old(struct mm_struct *mm, int cpu)
> +{
> + struct rq *rq = cpu_rq(cpu);
> + struct mm_cid *pcpu_cid;
> + struct task_struct *curr;
> + u64 rq_clock;
> +
> + /*
> + * rq->clock load is racy on 32-bit but one spurious clear once in a
> + * while is irrelevant.
> + */
> + rq_clock = READ_ONCE(rq->clock);
> + pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu);
> +
> + /*
> + * In order to take care of infrequently scheduled tasks, bump the time
> + * snapshot associated with this cid if an active task using the mm is
> + * observed on this rq.
> + */
> + rcu_read_lock();
> + curr = rcu_dereference(rq->curr);
> + if (READ_ONCE(curr->mm_cid_active) && curr->mm == mm) {
> + WRITE_ONCE(pcpu_cid->time, rq_clock);
> + rcu_read_unlock();
> + return;
> + }
> + rcu_read_unlock();
> +
> + if (rq_clock < pcpu_cid->time + SCHED_MM_CID_PERIOD_NS)
> + return;
> + sched_mm_cid_remote_clear(mm, pcpu_cid, cpu);
> +}
> +
> +static void sched_mm_cid_remote_clear_weight(struct mm_struct *mm, int cpu,
> + int weight)
> +{
> + struct mm_cid *pcpu_cid;
> + int cid;
> +
> + pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu);
> + cid = READ_ONCE(pcpu_cid->cid);
> + if (!mm_cid_is_valid(cid) || cid < weight)
> + return;
> + sched_mm_cid_remote_clear(mm, pcpu_cid, cpu);
> +}
> +
> +static void task_mm_cid_work(struct callback_head *work)
> +{
> + unsigned long now = jiffies, old_scan, next_scan;
> + struct task_struct *t = current;
> + struct cpumask *cidmask;
> + struct mm_struct *mm;
> + int weight, cpu;
> +
> + SCHED_WARN_ON(t != container_of(work, struct task_struct, cid_work));
> +
> + work->next = work; /* Prevent double-add */
> + if (t->flags & PF_EXITING)
> + return;
> + mm = t->mm;
> + if (!mm)
> + return;
> + old_scan = READ_ONCE(mm->mm_cid_next_scan);
> + next_scan = now + msecs_to_jiffies(MM_CID_SCAN_DELAY);
> + if (!old_scan) {
> + unsigned long res;
> +
> + res = cmpxchg(&mm->mm_cid_next_scan, old_scan, next_scan);
> + if (res != old_scan)
> + old_scan = res;
> + else
> + old_scan = next_scan;
> + }
> + if (time_before(now, old_scan))
> + return;
> + if (!try_cmpxchg(&mm->mm_cid_next_scan, &old_scan, next_scan))
> + return;
> + cidmask = mm_cidmask(mm);
> + /* Clear cids that were not recently used. */
> + for_each_possible_cpu(cpu)
> + sched_mm_cid_remote_clear_old(mm, cpu);
> + weight = cpumask_weight(cidmask);
> + /*
> + * Clear cids that are greater or equal to the cidmask weight to
> + * recompact it.
> + */
> + for_each_possible_cpu(cpu)
> + sched_mm_cid_remote_clear_weight(mm, cpu, weight);
> +}
> +
> +void init_sched_mm_cid(struct task_struct *t)
> +{
> + struct mm_struct *mm = t->mm;
> + int mm_users = 0;
> +
> + if (mm) {
> + mm_users = atomic_read(&mm->mm_users);
> + if (mm_users == 1)
> + mm->mm_cid_next_scan = jiffies + msecs_to_jiffies(MM_CID_SCAN_DELAY);
> + }
> + t->cid_work.next = &t->cid_work; /* Protect against double add */
> + init_task_work(&t->cid_work, task_mm_cid_work);
> +}
> +
> +void task_tick_mm_cid(struct rq *rq, struct task_struct *curr)
> +{
> + struct callback_head *work = &curr->cid_work;
> + unsigned long now = jiffies;
> +
> + if (!curr->mm || (curr->flags & (PF_EXITING | PF_KTHREAD)) ||
> + work->next != work)
> + return;
> + if (time_before(now, READ_ONCE(curr->mm->mm_cid_next_scan)))
> + return;
> + task_work_add(curr, work, TWA_RESUME);
> +}
> +
> +void sched_mm_cid_exit_signals(struct task_struct *t)
> +{
> + struct mm_struct *mm = t->mm;
> + struct rq_flags rf;
> + struct rq *rq;
> +
> + if (!mm)
> + return;
> +
> + preempt_disable();
> + rq = this_rq();
> + rq_lock_irqsave(rq, &rf);
> + preempt_enable_no_resched(); /* holding spinlock */
> + WRITE_ONCE(t->mm_cid_active, 0);
> + /*
> + * Store t->mm_cid_active before loading per-mm/cpu cid.
> + * Matches barrier in sched_mm_cid_remote_clear_old().
> + */
> + smp_mb();
> + mm_cid_put(mm);
> + t->last_mm_cid = t->mm_cid = -1;
> + rq_unlock_irqrestore(rq, &rf);
> +}
> +
> void sched_mm_cid_before_execve(struct task_struct *t)
> {
> struct mm_struct *mm = t->mm;
> - unsigned long flags;
> + struct rq_flags rf;
> + struct rq *rq;
>
> if (!mm)
> return;
> - local_irq_save(flags);
> - mm_cid_put(mm, t->mm_cid);
> - t->mm_cid = -1;
> - t->mm_cid_active = 0;
> - local_irq_restore(flags);
> +
> + preempt_disable();
> + rq = this_rq();
> + rq_lock_irqsave(rq, &rf);
> + preempt_enable_no_resched(); /* holding spinlock */
> + WRITE_ONCE(t->mm_cid_active, 0);
> + /*
> + * Store t->mm_cid_active before loading per-mm/cpu cid.
> + * Matches barrier in sched_mm_cid_remote_clear_old().
> + */
> + smp_mb();
> + mm_cid_put(mm);
> + t->last_mm_cid = t->mm_cid = -1;
> + rq_unlock_irqrestore(rq, &rf);
> }
>
> void sched_mm_cid_after_execve(struct task_struct *t)
> {
> struct mm_struct *mm = t->mm;
> - unsigned long flags;
> + struct rq_flags rf;
> + struct rq *rq;
>
> if (!mm)
> return;
> - local_irq_save(flags);
> - t->mm_cid = mm_cid_get(mm);
> - t->mm_cid_active = 1;
> - local_irq_restore(flags);
> +
> + preempt_disable();
> + rq = this_rq();
> + rq_lock_irqsave(rq, &rf);
> + preempt_enable_no_resched(); /* holding spinlock */
> + WRITE_ONCE(t->mm_cid_active, 1);
> + /*
> + * Store t->mm_cid_active before loading per-mm/cpu cid.
> + * Matches barrier in sched_mm_cid_remote_clear_old().
> + */
> + smp_mb();
> + t->last_mm_cid = t->mm_cid = mm_cid_get(rq, mm);
> + rq_unlock_irqrestore(rq, &rf);
> rseq_set_notify_resume(t);
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 0606169..ec7b3e0 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -3253,61 +3253,238 @@ static inline void update_current_exec_runtime(struct task_struct *curr,
> }
>
> #ifdef CONFIG_SCHED_MM_CID
> -static inline int __mm_cid_get(struct mm_struct *mm)
> +
> +#define SCHED_MM_CID_PERIOD_NS (100ULL * 1000000) /* 100ms */
> +#define MM_CID_SCAN_DELAY 100 /* 100ms */
> +
> +extern raw_spinlock_t cid_lock;
> +extern int use_cid_lock;
> +
> +extern void sched_mm_cid_migrate_from(struct task_struct *t);
> +extern void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t);
> +extern void task_tick_mm_cid(struct rq *rq, struct task_struct *curr);
> +extern void init_sched_mm_cid(struct task_struct *t);
> +
> +static inline void __mm_cid_put(struct mm_struct *mm, int cid)
> +{
> + if (cid < 0)
> + return;
> + cpumask_clear_cpu(cid, mm_cidmask(mm));
> +}
> +
> +/*
> + * The per-mm/cpu cid can have the MM_CID_LAZY_PUT flag set or transition to
> + * the MM_CID_UNSET state without holding the rq lock, but the rq lock needs to
> + * be held to transition to other states.
> + *
> + * State transitions synchronized with cmpxchg or try_cmpxchg need to be
> + * consistent across cpus, which prevents use of this_cpu_cmpxchg.
> + */
> +static inline void mm_cid_put_lazy(struct task_struct *t)
> +{
> + struct mm_struct *mm = t->mm;
> + struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
> + int cid;
> +
> + lockdep_assert_irqs_disabled();
> + cid = __this_cpu_read(pcpu_cid->cid);
> + if (!mm_cid_is_lazy_put(cid) ||
> + !try_cmpxchg(&this_cpu_ptr(pcpu_cid)->cid, &cid, MM_CID_UNSET))
> + return;
> + __mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
> +}
> +
> +static inline int mm_cid_pcpu_unset(struct mm_struct *mm)
> +{
> + struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
> + int cid, res;
> +
> + lockdep_assert_irqs_disabled();
> + cid = __this_cpu_read(pcpu_cid->cid);
> + for (;;) {
> + if (mm_cid_is_unset(cid))
> + return MM_CID_UNSET;
> + /*
> + * Attempt transition from valid or lazy-put to unset.
> + */
> + res = cmpxchg(&this_cpu_ptr(pcpu_cid)->cid, cid, MM_CID_UNSET);
> + if (res == cid)
> + break;
> + cid = res;
> + }
> + return cid;
> +}
> +
> +static inline void mm_cid_put(struct mm_struct *mm)
> +{
> + int cid;
> +
> + lockdep_assert_irqs_disabled();
> + cid = mm_cid_pcpu_unset(mm);
> + if (cid == MM_CID_UNSET)
> + return;
> + __mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
> +}
> +
> +static inline int __mm_cid_try_get(struct mm_struct *mm)
> {
> struct cpumask *cpumask;
> int cid;
>
> cpumask = mm_cidmask(mm);
> - cid = cpumask_first_zero(cpumask);
> - if (cid >= nr_cpu_ids)
> + /*
> + * Retry finding first zero bit if the mask is temporarily
> + * filled. This only happens during concurrent remote-clear
> + * which owns a cid without holding a rq lock.
> + */
> + for (;;) {
> + cid = cpumask_first_zero(cpumask);
> + if (cid < nr_cpu_ids)
> + break;
> + cpu_relax();
> + }
> + if (cpumask_test_and_set_cpu(cid, cpumask))
> return -1;
> - __cpumask_set_cpu(cid, cpumask);
> return cid;
> }
>
> -static inline void mm_cid_put(struct mm_struct *mm, int cid)
> +/*
> + * Save a snapshot of the current runqueue time of this cpu
> + * with the per-cpu cid value, allowing to estimate how recently it was used.
> + */
> +static inline void mm_cid_snapshot_time(struct rq *rq, struct mm_struct *mm)
> {
> - lockdep_assert_irqs_disabled();
> - if (cid < 0)
> - return;
> - raw_spin_lock(&mm->cid_lock);
> - __cpumask_clear_cpu(cid, mm_cidmask(mm));
> - raw_spin_unlock(&mm->cid_lock);
> + struct mm_cid *pcpu_cid = per_cpu_ptr(mm->pcpu_cid, cpu_of(rq));
> +
> + lockdep_assert_rq_held(rq);
> + WRITE_ONCE(pcpu_cid->time, rq->clock);
> }
>
> -static inline int mm_cid_get(struct mm_struct *mm)
> +static inline int __mm_cid_get(struct rq *rq, struct mm_struct *mm)
> {
> - int ret;
> + int cid;
>
> - lockdep_assert_irqs_disabled();
> - raw_spin_lock(&mm->cid_lock);
> - ret = __mm_cid_get(mm);
> - raw_spin_unlock(&mm->cid_lock);
> - return ret;
> + /*
> + * All allocations (even those using the cid_lock) are lock-free. If
> + * use_cid_lock is set, hold the cid_lock to perform cid allocation to
> + * guarantee forward progress.
> + */
> + if (!READ_ONCE(use_cid_lock)) {
> + cid = __mm_cid_try_get(mm);
> + if (cid >= 0)
> + goto end;
> + raw_spin_lock(&cid_lock);
> + } else {
> + raw_spin_lock(&cid_lock);
> + cid = __mm_cid_try_get(mm);
> + if (cid >= 0)
> + goto unlock;
> + }
> +
> + /*
> + * cid concurrently allocated. Retry while forcing following
> + * allocations to use the cid_lock to ensure forward progress.
> + */
> + WRITE_ONCE(use_cid_lock, 1);
> + /*
> + * Set use_cid_lock before allocation. Only care about program order
> + * because this is only required for forward progress.
> + */
> + barrier();
> + /*
> + * Retry until it succeeds. It is guaranteed to eventually succeed once
> + * all newcoming allocations observe the use_cid_lock flag set.
> + */
> + do {
> + cid = __mm_cid_try_get(mm);
> + cpu_relax();
> + } while (cid < 0);
> + /*
> + * Allocate before clearing use_cid_lock. Only care about
> + * program order because this is for forward progress.
> + */
> + barrier();
> + WRITE_ONCE(use_cid_lock, 0);
> +unlock:
> + raw_spin_unlock(&cid_lock);
> +end:
> + mm_cid_snapshot_time(rq, mm);
> + return cid;
> }
>
> -static inline void switch_mm_cid(struct task_struct *prev, struct task_struct *next)
> +static inline int mm_cid_get(struct rq *rq, struct mm_struct *mm)
> {
> + struct mm_cid __percpu *pcpu_cid = mm->pcpu_cid;
> + struct cpumask *cpumask;
> + int cid;
> +
> + lockdep_assert_rq_held(rq);
> + cpumask = mm_cidmask(mm);
> + cid = __this_cpu_read(pcpu_cid->cid);
> + if (mm_cid_is_valid(cid)) {
> + mm_cid_snapshot_time(rq, mm);
> + return cid;
> + }
> + if (mm_cid_is_lazy_put(cid)) {
> + if (try_cmpxchg(&this_cpu_ptr(pcpu_cid)->cid, &cid, MM_CID_UNSET))
> + __mm_cid_put(mm, mm_cid_clear_lazy_put(cid));
> + }
> + cid = __mm_cid_get(rq, mm);
> + __this_cpu_write(pcpu_cid->cid, cid);
> + return cid;
> +}
> +
> +static inline void switch_mm_cid(struct rq *rq,
> + struct task_struct *prev,
> + struct task_struct *next)
> +{
> + /*
> + * Provide a memory barrier between rq->curr store and load of
> + * {prev,next}->mm->pcpu_cid[cpu] on rq->curr->mm transition.
> + *
> + * Should be adapted if context_switch() is modified.
> + */
> + if (!next->mm) { // to kernel
> + /*
> + * user -> kernel transition does not guarantee a barrier, but
> + * we can use the fact that it performs an atomic operation in
> + * mmgrab().
> + */
> + if (prev->mm) // from user
> + smp_mb__after_mmgrab();
> + /*
> + * kernel -> kernel transition does not change rq->curr->mm
> + * state. It stays NULL.
> + */
> + } else { // to user
> + /*
> + * kernel -> user transition does not provide a barrier
> + * between rq->curr store and load of {prev,next}->mm->pcpu_cid[cpu].
> + * Provide it here.
> + */
> + if (!prev->mm) // from kernel
> + smp_mb();
> + /*
> + * user -> user transition guarantees a memory barrier through
> + * switch_mm() when current->mm changes. If current->mm is
> + * unchanged, no barrier is needed.
> + */
> + }
> if (prev->mm_cid_active) {
> - if (next->mm_cid_active && next->mm == prev->mm) {
> - /*
> - * Context switch between threads in same mm, hand over
> - * the mm_cid from prev to next.
> - */
> - next->mm_cid = prev->mm_cid;
> - prev->mm_cid = -1;
> - return;
> - }
> - mm_cid_put(prev->mm, prev->mm_cid);
> + mm_cid_snapshot_time(rq, prev->mm);
> + mm_cid_put_lazy(prev);
> prev->mm_cid = -1;
> }
> if (next->mm_cid_active)
> - next->mm_cid = mm_cid_get(next->mm);
> + next->last_mm_cid = next->mm_cid = mm_cid_get(rq, next->mm);
> }
>
> #else
> -static inline void switch_mm_cid(struct task_struct *prev, struct task_struct *next) { }
> +static inline void switch_mm_cid(struct rq *rq, struct task_struct *prev, struct task_struct *next) { }
> +static inline void sched_mm_cid_migrate_from(struct task_struct *t) { }
> +static inline void sched_mm_cid_migrate_to(struct rq *dst_rq, struct task_struct *t) { }
> +static inline void task_tick_mm_cid(struct rq *rq, struct task_struct *curr) { }
> +static inline void init_sched_mm_cid(struct task_struct *t) { }
> #endif
>
> #endif /* _KERNEL_SCHED_SCHED_H */

2023-06-20 09:47:03

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [tip: sched/core] sched: Fix performance regression introduced by mm_cid

On Tue, Jun 20, 2023 at 01:44:32PM +0530, Swapnil Sapkal wrote:
> Hello Mathieu,
>
> On 4/22/2023 1:13 PM, tip-bot2 for Mathieu Desnoyers wrote:
> > The following commit has been merged into the sched/core branch of tip:
> >
> > Commit-ID: 223baf9d17f25e2608dbdff7232c095c1e612268
> > Gitweb: https://git.kernel.org/tip/223baf9d17f25e2608dbdff7232c095c1e612268
> > Author: Mathieu Desnoyers <[email protected]>
> > AuthorDate: Thu, 20 Apr 2023 10:55:48 -04:00
> > Committer: Peter Zijlstra <[email protected]>
> > CommitterDate: Fri, 21 Apr 2023 13:24:20 +02:00
> >
> > sched: Fix performance regression introduced by mm_cid
> >
> > Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
> > sysbench regression reported by Aaron Lu.
> >
> > Keep track of the currently allocated mm_cid for each mm/cpu rather than
> > freeing them immediately on context switch. This eliminates most atomic
> > operations when context switching back and forth between threads
> > belonging to different memory spaces in multi-threaded scenarios (many
> > processes, each with many threads). The per-mm/per-cpu mm_cid values are
> > serialized by their respective runqueue locks.
> >
> > Thread migration is handled by introducing invocation to
> > sched_mm_cid_migrate_to() (with destination runqueue lock held) in
> > activate_task() for migrating tasks. If the destination cpu's mm_cid is
> > unset, and if the source runqueue is not actively using its mm_cid, then
> > the source cpu's mm_cid is moved to the destination cpu on migration.
> >
> > Introduce a task-work executed periodically, similarly to NUMA work,
> > which delays reclaim of cid values when they are unused for a period of
> > time.
> >
> > Keep track of the allocation time for each per-cpu cid, and let the task
> > work clear them when they are observed to be older than
> > SCHED_MM_CID_PERIOD_NS and unused. This task work also clears all
> > mm_cids which are greater or equal to the Hamming weight of the mm
> > cidmask to keep concurrency ids compact.
> >
> > Because we want to ensure the mm_cid converges towards the smaller
> > values as migrations happen, the prior optimization that was done when
> > context switching between threads belonging to the same mm is removed,
> > because it could delay the lazy release of the destination runqueue
> > mm_cid after it has been replaced by a migration. Removing this prior
> > optimization is not an issue performance-wise because the introduced
> > per-mm/per-cpu mm_cid tracking also covers this more specific case.
> >
> > Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID")
> > Reported-by: Aaron Lu <[email protected]>
> > Signed-off-by: Mathieu Desnoyers <[email protected]>
> > Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> > Tested-by: Aaron Lu <[email protected]>
> > Link: https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/
>
> I run standard benchmarks as a part of kernel performance regression
> testing. When I run these benchmarks against v6.3.0 to v6.4-rc1,
> I have seen performance regression in hackbench running with threads. When I did
> git bisect it pointed to this commit and reverting this commit helps regains
> the performance. This regression is not seen with hackbench processes.

Well, *this* commit was supposed to help fix the horrible contention on
cid_lock that was introduced with af7f588d8f73.

> Following are the results from 1 Socket 4th generation EPYC
> Processor(1 X 96C/192T) configured in NPS1 mode. This regression
> becomes more severe as the number of core count increases.
>
> The numbers on a 1 Socket Bergamo (1 X 128 cores/256 threads) is significantly worse.
>
> Threads:
>
> Test: With-mmcid-patch Without-mmcid-patch
> 1-groups: 5.23 (0.00 pct) 4.61 (+11.85 pct)
> 2-groups: 4.99 (0.00 pct) 4.72 (+5.41 pct)
> 4-groups: 5.96 (0.00 pct) 4.87 (+18.28 pct)
> 8-groups: 6.58 (0.00 pct) 5.44 (+17.32 pct)
> 16-groups: 11.48 (0.00 pct) 8.07 (+29.70 pct)

I'm really confused, so you're saying that having a process wide
spinlock is better than what this patch does? Or are you testing against
something without mm-cid entirely?

2023-06-20 10:40:12

[permalink] [raw]

Subject: Re: [tip: sched/core] sched: Fix performance regression introduced by mm_cid

Hello Peter,

On 6/20/2023 2:41 PM, Peter Zijlstra wrote:
> On Tue, Jun 20, 2023 at 01:44:32PM +0530, Swapnil Sapkal wrote:
>> Hello Mathieu,
>>
>> On 4/22/2023 1:13 PM, tip-bot2 for Mathieu Desnoyers wrote:
>>> The following commit has been merged into the sched/core branch of tip:
>>>
>>> Commit-ID: 223baf9d17f25e2608dbdff7232c095c1e612268
>>> Gitweb: https://git.kernel.org/tip/223baf9d17f25e2608dbdff7232c095c1e612268
>>> Author: Mathieu Desnoyers <[email protected]>
>>> AuthorDate: Thu, 20 Apr 2023 10:55:48 -04:00
>>> Committer: Peter Zijlstra <[email protected]>
>>> CommitterDate: Fri, 21 Apr 2023 13:24:20 +02:00
>>>
>>> sched: Fix performance regression introduced by mm_cid
>>>
>>> Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
>>> sysbench regression reported by Aaron Lu.
>>>
>>> Keep track of the currently allocated mm_cid for each mm/cpu rather than
>>> freeing them immediately on context switch. This eliminates most atomic
>>> operations when context switching back and forth between threads
>>> belonging to different memory spaces in multi-threaded scenarios (many
>>> processes, each with many threads). The per-mm/per-cpu mm_cid values are
>>> serialized by their respective runqueue locks.
>>>
>>> Thread migration is handled by introducing invocation to
>>> sched_mm_cid_migrate_to() (with destination runqueue lock held) in
>>> activate_task() for migrating tasks. If the destination cpu's mm_cid is
>>> unset, and if the source runqueue is not actively using its mm_cid, then
>>> the source cpu's mm_cid is moved to the destination cpu on migration.
>>>
>>> Introduce a task-work executed periodically, similarly to NUMA work,
>>> which delays reclaim of cid values when they are unused for a period of
>>> time.
>>>
>>> Keep track of the allocation time for each per-cpu cid, and let the task
>>> work clear them when they are observed to be older than
>>> SCHED_MM_CID_PERIOD_NS and unused. This task work also clears all
>>> mm_cids which are greater or equal to the Hamming weight of the mm
>>> cidmask to keep concurrency ids compact.
>>>
>>> Because we want to ensure the mm_cid converges towards the smaller
>>> values as migrations happen, the prior optimization that was done when
>>> context switching between threads belonging to the same mm is removed,
>>> because it could delay the lazy release of the destination runqueue
>>> mm_cid after it has been replaced by a migration. Removing this prior
>>> optimization is not an issue performance-wise because the introduced
>>> per-mm/per-cpu mm_cid tracking also covers this more specific case.
>>>
>>> Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID")
>>> Reported-by: Aaron Lu <[email protected]>
>>> Signed-off-by: Mathieu Desnoyers <[email protected]>
>>> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>>> Tested-by: Aaron Lu <[email protected]>
>>> Link: https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/
>>
>> I run standard benchmarks as a part of kernel performance regression
>> testing. When I run these benchmarks against v6.3.0 to v6.4-rc1,
>> I have seen performance regression in hackbench running with threads. When I did
>> git bisect it pointed to this commit and reverting this commit helps regains
>> the performance. This regression is not seen with hackbench processes.
>
> Well, *this* commit was supposed to help fix the horrible contention on
> cid_lock that was introduced with af7f588d8f73.

I went back and tested the commit that introduced mm_cid and I found that the
original implementation actually helped hackbench. Following are numbers from
2 Socket Zen3 Server (2 X 64C/128T):

Test: base (v6.2-rc1) base + orig_mm_cid
1-groups: 4.29 (0.00 pct) 4.32 (-0.69 pct)
2-groups: 4.96 (0.00 pct) 4.94 (0.40 pct)
4-groups: 5.21 (0.00 pct) 4.10 (21.30 pct)
8-groups: 5.44 (0.00 pct) 4.50 (17.27 pct)
16-groups: 7.09 (0.00 pct) 5.28 (25.52 pct)

I see following IBS traces in this case:

Base:

6.69% sched-messaging [kernel.vmlinux] [k] copy_user_generic_string
5.38% sched-messaging [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
3.73% swapper [kernel.vmlinux] [k] __switch_to_asm
3.23% sched-messaging [kernel.vmlinux] [k] __calc_delta
2.93% sched-messaging [kernel.vmlinux] [k] try_to_wake_up
2.63% sched-messaging [kernel.vmlinux] [k] dequeue_task_fair
2.56% sched-messaging [kernel.vmlinux] [k] osq_lock

Base + orig_mm_cid:

13.70% sched-messaging [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
11.87% swapper [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
8.99% sched-messaging [kernel.vmlinux] [k] copy_user_generic_string
6.08% sched-messaging [kernel.vmlinux] [k] osq_lock
4.79% sched-messaging [kernel.vmlinux] [k] apparmor_file_permission
3.71% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner
3.66% sched-messaging [kernel.vmlinux] [k] ktime_get_coarse_real_ts64
3.11% sched-messaging [kernel.vmlinux] [k] _copy_from_iter

>
>> Following are the results from 1 Socket 4th generation EPYC
>> Processor(1 X 96C/192T) configured in NPS1 mode. This regression
>> becomes more severe as the number of core count increases.
>>
>> The numbers on a 1 Socket Bergamo (1 X 128 cores/256 threads) is significantly worse.
>>
>> Threads:
>>
>> Test: With-mmcid-patch Without-mmcid-patch
>> 1-groups: 5.23 (0.00 pct) 4.61 (+11.85 pct)
>> 2-groups: 4.99 (0.00 pct) 4.72 (+5.41 pct)
>> 4-groups: 5.96 (0.00 pct) 4.87 (+18.28 pct)
>> 8-groups: 6.58 (0.00 pct) 5.44 (+17.32 pct)
>> 16-groups: 11.48 (0.00 pct) 8.07 (+29.70 pct)
>
> I'm really confused, so you're saying that having a process wide
> spinlock is better than what this patch does? Or are you testing against
> something without mm-cid entirely?

It does look like the lock contention introduced by the original mm_cid patch helped
hackbench in this case. In that case, I see hackbench threads run for longer on average (avg_atom)
and total idle entries are down significantly. Even on disabling C1 and C2, I see
similar behavior. With the new mm_cid patch that gets rid of the lock contention, we see a drop
in the hackbench performance.

I will go dig into this further meanwhile if you have any pointers please do let me know.

--
Thanks and Regards,
Swapnil

2023-06-20 11:42:21

[permalink] [raw]

Subject: Re: [tip: sched/core] sched: Fix performance regression introduced by mm_cid

On 6/20/23 06:35, Swapnil Sapkal wrote:
> Hello Peter,
>
> On 6/20/2023 2:41 PM, Peter Zijlstra wrote:
>> On Tue, Jun 20, 2023 at 01:44:32PM +0530, Swapnil Sapkal wrote:
>>> Hello Mathieu,
>>>
>>> On 4/22/2023 1:13 PM, tip-bot2 for Mathieu Desnoyers wrote:
>>>> The following commit has been merged into the sched/core branch of tip:
>>>>
>>>> Commit-ID:     223baf9d17f25e2608dbdff7232c095c1e612268
>>>> Gitweb:
>>>> https://git.kernel.org/tip/223baf9d17f25e2608dbdff7232c095c1e612268
>>>> Author:        Mathieu Desnoyers <[email protected]>
>>>> AuthorDate:    Thu, 20 Apr 2023 10:55:48 -04:00
>>>> Committer:     Peter Zijlstra <[email protected]>
>>>> CommitterDate: Fri, 21 Apr 2023 13:24:20 +02:00
>>>>
>>>> sched: Fix performance regression introduced by mm_cid
>>>>
>>>> Introduce per-mm/cpu current concurrency id (mm_cid) to fix a
>>>> PostgreSQL
>>>> sysbench regression reported by Aaron Lu.
>>>>
>>>> Keep track of the currently allocated mm_cid for each mm/cpu rather
>>>> than
>>>> freeing them immediately on context switch. This eliminates most atomic
>>>> operations when context switching back and forth between threads
>>>> belonging to different memory spaces in multi-threaded scenarios (many
>>>> processes, each with many threads). The per-mm/per-cpu mm_cid values
>>>> are
>>>> serialized by their respective runqueue locks.
>>>>
>>>> Thread migration is handled by introducing invocation to
>>>> sched_mm_cid_migrate_to() (with destination runqueue lock held) in
>>>> activate_task() for migrating tasks. If the destination cpu's mm_cid is
>>>> unset, and if the source runqueue is not actively using its mm_cid,
>>>> then
>>>> the source cpu's mm_cid is moved to the destination cpu on migration.
>>>>
>>>> Introduce a task-work executed periodically, similarly to NUMA work,
>>>> which delays reclaim of cid values when they are unused for a period of
>>>> time.
>>>>
>>>> Keep track of the allocation time for each per-cpu cid, and let the
>>>> task
>>>> work clear them when they are observed to be older than
>>>> SCHED_MM_CID_PERIOD_NS and unused. This task work also clears all
>>>> mm_cids which are greater or equal to the Hamming weight of the mm
>>>> cidmask to keep concurrency ids compact.
>>>>
>>>> Because we want to ensure the mm_cid converges towards the smaller
>>>> values as migrations happen, the prior optimization that was done when
>>>> context switching between threads belonging to the same mm is removed,
>>>> because it could delay the lazy release of the destination runqueue
>>>> mm_cid after it has been replaced by a migration. Removing this prior
>>>> optimization is not an issue performance-wise because the introduced
>>>> per-mm/per-cpu mm_cid tracking also covers this more specific case.
>>>>
>>>> Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID")
>>>> Reported-by: Aaron Lu <[email protected]>
>>>> Signed-off-by: Mathieu Desnoyers <[email protected]>
>>>> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>>>> Tested-by: Aaron Lu <[email protected]>
>>>> Link:
>>>> https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/
>>>
>>> I run standard benchmarks as a part of kernel performance regression
>>> testing. When I run these benchmarks against v6.3.0 to v6.4-rc1,
>>> I have seen performance regression in hackbench running with threads.
>>> When I did
>>> git bisect it pointed to this commit and reverting this commit helps
>>> regains
>>> the performance. This regression is not seen with hackbench processes.
>>
>> Well, *this* commit was supposed to help fix the horrible contention on
>> cid_lock that was introduced with af7f588d8f73.
>
> I went back and tested the commit that introduced mm_cid and I found
> that the
> original implementation actually helped hackbench. Following are numbers
> from
> 2 Socket Zen3 Server (2 X 64C/128T):
>
> Test:           base (v6.2-rc1)      base + orig_mm_cid
> 1-groups:     4.29 (0.00 pct)     4.32 (-0.69 pct)
> 2-groups:     4.96 (0.00 pct)     4.94 (0.40 pct)
> 4-groups:     5.21 (0.00 pct)     4.10 (21.30 pct)
> 8-groups:     5.44 (0.00 pct)     4.50 (17.27 pct)
> 16-groups:     7.09 (0.00 pct)     5.28 (25.52 pct)
>
> I see following IBS traces in this case:
>
> Base:
>
>    6.69% sched-messaging [kernel.vmlinux]          [k]
> copy_user_generic_string
>    5.38% sched-messaging [kernel.vmlinux]          [k]
> native_queued_spin_lock_slowpath
>    3.73% swapper          [kernel.vmlinux]          [k] __switch_to_asm
>    3.23% sched-messaging [kernel.vmlinux]          [k] __calc_delta
>    2.93% sched-messaging [kernel.vmlinux]          [k] try_to_wake_up
>    2.63% sched-messaging [kernel.vmlinux]          [k] dequeue_task_fair
>    2.56% sched-messaging [kernel.vmlinux]          [k] osq_lock
>
> Base + orig_mm_cid:
>
> 13.70% sched-messaging [kernel.vmlinux]      [k]
> native_queued_spin_lock_slowpath
> 11.87% swapper          [kernel.vmlinux]      [k]
> native_queued_spin_lock_slowpath
>    8.99% sched-messaging [kernel.vmlinux]      [k]
> copy_user_generic_string
>    6.08% sched-messaging [kernel.vmlinux]      [k] osq_lock
>    4.79% sched-messaging [kernel.vmlinux]      [k]
> apparmor_file_permission
>    3.71% sched-messaging [kernel.vmlinux]      [k] mutex_spin_on_owner
>    3.66% sched-messaging [kernel.vmlinux]      [k]
> ktime_get_coarse_real_ts64
>    3.11% sched-messaging [kernel.vmlinux]      [k] _copy_from_iter
>
>>
>>> Following are the results from 1 Socket 4th generation EPYC
>>> Processor(1 X 96C/192T) configured in NPS1 mode. This regression
>>> becomes more severe as the number of core count increases.
>>>
>>> The numbers on a 1 Socket Bergamo (1 X 128 cores/256 threads) is
>>> significantly worse.
>>>
>>> Threads:
>>>
>>> Test:             With-mmcid-patch        Without-mmcid-patch
>>> 1-groups:         5.23 (0.00 pct)         4.61 (+11.85 pct)
>>> 2-groups:         4.99 (0.00 pct)         4.72 (+5.41 pct)
>>> 4-groups:         5.96 (0.00 pct)         4.87 (+18.28 pct)
>>> 8-groups:         6.58 (0.00 pct)         5.44 (+17.32 pct)
>>> 16-groups:        11.48 (0.00 pct)         8.07 (+29.70 pct)
>>
>> I'm really confused, so you're saying that having a process wide
>> spinlock is better than what this patch does? Or are you testing against
>> something without mm-cid entirely?
>
> It does look like the lock contention introduced by the original mm_cid
> patch helped
> hackbench in this case. In that case, I see hackbench threads run for
> longer on average (avg_atom)
> and total idle entries are down significantly. Even on disabling C1 and
> C2, I see
> similar behavior. With the new mm_cid patch that gets rid of the lock
> contention, we see a drop
> in the hackbench performance.
>
> I will go dig into this further meanwhile if you have any pointers
> please do let me know.

I suspect the baseline don't have spinlock contention because the test-case
schedules between threads belonging to the same process, for which the initial
mm_cid patch had an optimization which skips the spinlock entirely.

This optimization for inter-thread scheduling had to be removed in the following
patch to address the performance issue more generally, covering the inter-process
scheduling.

I suspect the regression is caused by the mm_count cache line bouncing.

Please try with this additional patch applied:

https://lore.kernel.org/lkml/[email protected]/

This patch has recently been merged into the mm tree.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-06-21 16:51:39

[permalink] [raw]

Subject: Re: [tip: sched/core] sched: Fix performance regression introduced by mm_cid

Hello Mathieu,

On 6/20/2023 4:21 PM, Mathieu Desnoyers wrote:
> On 6/20/23 06:35, Swapnil Sapkal wrote:
>> Hello Peter,
>>
>> On 6/20/2023 2:41 PM, Peter Zijlstra wrote:
>>> On Tue, Jun 20, 2023 at 01:44:32PM +0530, Swapnil Sapkal wrote:
>>>> Hello Mathieu,
>>>>
>>>> On 4/22/2023 1:13 PM, tip-bot2 for Mathieu Desnoyers wrote:
>>>>> The following commit has been merged into the sched/core branch of tip:
>>>>>
>>>>> Commit-ID:     223baf9d17f25e2608dbdff7232c095c1e612268
>>>>> Gitweb: https://git.kernel.org/tip/223baf9d17f25e2608dbdff7232c095c1e612268
>>>>> Author:        Mathieu Desnoyers <[email protected]>
>>>>> AuthorDate:    Thu, 20 Apr 2023 10:55:48 -04:00
>>>>> Committer:     Peter Zijlstra <[email protected]>
>>>>> CommitterDate: Fri, 21 Apr 2023 13:24:20 +02:00
>>>>>
>>>>> sched: Fix performance regression introduced by mm_cid
>>>>>
>>>>> Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
>>>>> sysbench regression reported by Aaron Lu.
>>>>>
>>>>> Keep track of the currently allocated mm_cid for each mm/cpu rather than
>>>>> freeing them immediately on context switch. This eliminates most atomic
>>>>> operations when context switching back and forth between threads
>>>>> belonging to different memory spaces in multi-threaded scenarios (many
>>>>> processes, each with many threads). The per-mm/per-cpu mm_cid values are
>>>>> serialized by their respective runqueue locks.
>>>>>
>>>>> Thread migration is handled by introducing invocation to
>>>>> sched_mm_cid_migrate_to() (with destination runqueue lock held) in
>>>>> activate_task() for migrating tasks. If the destination cpu's mm_cid is
>>>>> unset, and if the source runqueue is not actively using its mm_cid, then
>>>>> the source cpu's mm_cid is moved to the destination cpu on migration.
>>>>>
>>>>> Introduce a task-work executed periodically, similarly to NUMA work,
>>>>> which delays reclaim of cid values when they are unused for a period of
>>>>> time.
>>>>>
>>>>> Keep track of the allocation time for each per-cpu cid, and let the task
>>>>> work clear them when they are observed to be older than
>>>>> SCHED_MM_CID_PERIOD_NS and unused. This task work also clears all
>>>>> mm_cids which are greater or equal to the Hamming weight of the mm
>>>>> cidmask to keep concurrency ids compact.
>>>>>
>>>>> Because we want to ensure the mm_cid converges towards the smaller
>>>>> values as migrations happen, the prior optimization that was done when
>>>>> context switching between threads belonging to the same mm is removed,
>>>>> because it could delay the lazy release of the destination runqueue
>>>>> mm_cid after it has been replaced by a migration. Removing this prior
>>>>> optimization is not an issue performance-wise because the introduced
>>>>> per-mm/per-cpu mm_cid tracking also covers this more specific case.
>>>>>
>>>>> Fixes: af7f588d8f73 ("sched: Introduce per-memory-map concurrency ID")
>>>>> Reported-by: Aaron Lu <[email protected]>
>>>>> Signed-off-by: Mathieu Desnoyers <[email protected]>
>>>>> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
>>>>> Tested-by: Aaron Lu <[email protected]>
>>>>> Link: https://lore.kernel.org/lkml/20230327080502.GA570847@ziqianlu-desk2/
>>>>
>>>> I run standard benchmarks as a part of kernel performance regression
>>>> testing. When I run these benchmarks against v6.3.0 to v6.4-rc1,
>>>> I have seen performance regression in hackbench running with threads. When I did
>>>> git bisect it pointed to this commit and reverting this commit helps regains
>>>> the performance. This regression is not seen with hackbench processes.
>>>
>>> Well, *this* commit was supposed to help fix the horrible contention on
>>> cid_lock that was introduced with af7f588d8f73.
>>
>> I went back and tested the commit that introduced mm_cid and I found that the
>> original implementation actually helped hackbench. Following are numbers from
>> 2 Socket Zen3 Server (2 X 64C/128T):
>>
>> Test:           base (v6.2-rc1)      base + orig_mm_cid
>>   1-groups:     4.29 (0.00 pct)     4.32 (-0.69 pct)
>>   2-groups:     4.96 (0.00 pct)     4.94 (0.40 pct)
>>   4-groups:     5.21 (0.00 pct)     4.10 (21.30 pct)
>>   8-groups:     5.44 (0.00 pct)     4.50 (17.27 pct)
>> 16-groups:     7.09 (0.00 pct)     5.28 (25.52 pct)
>>
>> I see following IBS traces in this case:
>>
>> Base:
>>
>>     6.69% sched-messaging [kernel.vmlinux]          [k] copy_user_generic_string
>>     5.38% sched-messaging [kernel.vmlinux]          [k] native_queued_spin_lock_slowpath
>>     3.73% swapper          [kernel.vmlinux]          [k] __switch_to_asm
>>     3.23% sched-messaging [kernel.vmlinux]          [k] __calc_delta
>>     2.93% sched-messaging [kernel.vmlinux]          [k] try_to_wake_up
>>     2.63% sched-messaging [kernel.vmlinux]          [k] dequeue_task_fair
>>     2.56% sched-messaging [kernel.vmlinux]          [k] osq_lock
>>
>> Base + orig_mm_cid:
>>
>>    13.70% sched-messaging [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
>>    11.87% swapper          [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
>>     8.99% sched-messaging [kernel.vmlinux]      [k] copy_user_generic_string
>>     6.08% sched-messaging [kernel.vmlinux]      [k] osq_lock
>>     4.79% sched-messaging [kernel.vmlinux]      [k] apparmor_file_permission
>>     3.71% sched-messaging [kernel.vmlinux]      [k] mutex_spin_on_owner
>>     3.66% sched-messaging [kernel.vmlinux]      [k] ktime_get_coarse_real_ts64
>>     3.11% sched-messaging [kernel.vmlinux]      [k] _copy_from_iter
>>
>>>
>>>> Following are the results from 1 Socket 4th generation EPYC
>>>> Processor(1 X 96C/192T) configured in NPS1 mode. This regression
>>>> becomes more severe as the number of core count increases.
>>>>
>>>> The numbers on a 1 Socket Bergamo (1 X 128 cores/256 threads) is significantly worse.
>>>>
>>>> Threads:
>>>>
>>>> Test:             With-mmcid-patch        Without-mmcid-patch
>>>> 1-groups:         5.23 (0.00 pct)         4.61 (+11.85 pct)
>>>> 2-groups:         4.99 (0.00 pct)         4.72 (+5.41 pct)
>>>> 4-groups:         5.96 (0.00 pct)         4.87 (+18.28 pct)
>>>> 8-groups:         6.58 (0.00 pct)         5.44 (+17.32 pct)
>>>> 16-groups:        11.48 (0.00 pct)         8.07 (+29.70 pct)
>>>
>>> I'm really confused, so you're saying that having a process wide
>>> spinlock is better than what this patch does? Or are you testing against
>>> something without mm-cid entirely?
>>
>> It does look like the lock contention introduced by the original mm_cid patch helped
>> hackbench in this case. In that case, I see hackbench threads run for longer on average (avg_atom)
>> and total idle entries are down significantly. Even on disabling C1 and C2, I see
>> similar behavior. With the new mm_cid patch that gets rid of the lock contention, we see a drop
>> in the hackbench performance.
>>
>> I will go dig into this further meanwhile if you have any pointers please do let me know.
>
> I suspect the baseline don't have spinlock contention because the test-case
> schedules between threads belonging to the same process, for which the initial
> mm_cid patch had an optimization which skips the spinlock entirely.
>
> This optimization for inter-thread scheduling had to be removed in the following
> patch to address the performance issue more generally, covering the inter-process
> scheduling.
>
> I suspect the regression is caused by the mm_count cache line bouncing.
>
> Please try with this additional patch applied:
>
> https://lore.kernel.org/lkml/[email protected]/

Thanks for the suggestion. I tried out with the patch you suggested. I am seeing
improvement in hackbench numbers with mm_count padding. But this is not matching
with what we achieved through reverting the new mm_cid patch.

Below are the results on the 1 Socket 4th Generation EPYC Processor (1 x 96C/192T):

Threads:

Test: Base (v6.4-rc1) Base + new_mmcid_reverted Base + mm_count_padding
1-groups: 5.23 (0.00 pct) 4.61 (11.85 pct) 5.11 (2.29 pct)
2-groups: 4.99 (0.00 pct) 4.72 (5.41 pct) 5.00 (-0.20 pct)
4-groups: 5.96 (0.00 pct) 4.87 (18.28 pct) 5.86 (1.67 pct)
8-groups: 6.58 (0.00 pct) 5.44 (17.32 pct) 6.20 (5.77 pct)
16-groups: 11.48 (0.00 pct) 8.07 (29.70 pct) 10.68 (6.96 pct)

Processes:

Test: Base (v6.4-rc1) Base + new_mmcid_reverted Base + mm_count_padding
1-groups: 5.19 (0.00 pct) 4.90 (5.58 pct) 5.19 (0.00 pct)
2-groups: 5.44 (0.00 pct) 5.39 (0.91 pct) 5.39 (0.91 pct)
4-groups: 5.69 (0.00 pct) 5.64 (0.87 pct) 5.64 (0.87 pct)
8-groups: 6.08 (0.00 pct) 6.01 (1.15 pct) 6.04 (0.65 pct)
16-groups: 10.87 (0.00 pct) 10.83 (0.36 pct) 10.93 (-0.55 pct)

The ibs profile shows that function __switch_to_asm() is coming at top in baseline
run and is not seen with mm_count padding patch. Will be attaching full ibs profile
data for all the 3 runs:

# Base (v6.4-rc1)
Threads:
Total time: 11.486 [sec]

5.15% sched-messaging [kernel.vmlinux] [k] __switch_to_asm
4.31% sched-messaging [kernel.vmlinux] [k] copyout
4.29% sched-messaging [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
4.22% sched-messaging [kernel.vmlinux] [k] copyin
3.92% sched-messaging [kernel.vmlinux] [k] apparmor_file_permission
2.91% sched-messaging [kernel.vmlinux] [k] __schedule
2.34% swapper [kernel.vmlinux] [k] __switch_to_asm
2.10% sched-messaging [kernel.vmlinux] [k] prepare_to_wait_event
2.10% sched-messaging [kernel.vmlinux] [k] try_to_wake_up
2.07% sched-messaging [kernel.vmlinux] [k] finish_task_switch.isra.0
2.00% sched-messaging [kernel.vmlinux] [k] pipe_write
1.82% sched-messaging [kernel.vmlinux] [k] check_preemption_disabled
1.73% sched-messaging [kernel.vmlinux] [k] exit_to_user_mode_prepare
1.52% sched-messaging [kernel.vmlinux] [k] __entry_text_start
1.49% sched-messaging [kernel.vmlinux] [k] osq_lock
1.45% sched-messaging libc.so.6 [.] write
1.44% swapper [kernel.vmlinux] [k] native_sched_clock
1.38% sched-messaging [kernel.vmlinux] [k] psi_group_change
1.38% sched-messaging [kernel.vmlinux] [k] pipe_read
1.37% sched-messaging libc.so.6 [.] read
1.06% sched-messaging [kernel.vmlinux] [k] vfs_read
1.01% swapper [kernel.vmlinux] [k] psi_group_change
1.00% sched-messaging [kernel.vmlinux] [k] update_curr

# Base + mm_count_padding
Threads:
Total time: 11.384 [sec]

4.43% sched-messaging [kernel.vmlinux] [k] copyin
4.39% sched-messaging [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
4.07% sched-messaging [kernel.vmlinux] [k] apparmor_file_permission
4.07% sched-messaging [kernel.vmlinux] [k] copyout
2.49% sched-messaging [kernel.vmlinux] [k] entry_SYSCALL_64
2.37% sched-messaging [kernel.vmlinux] [k] update_cfs_group
2.19% sched-messaging [kernel.vmlinux] [k] pipe_write
2.00% sched-messaging [kernel.vmlinux] [k] check_preemption_disabled
1.93% swapper [kernel.vmlinux] [k] update_load_avg
1.81% sched-messaging [kernel.vmlinux] [k] exit_to_user_mode_prepare
1.69% sched-messaging [kernel.vmlinux] [k] try_to_wake_up
1.58% sched-messaging libc.so.6 [.] write
1.53% sched-messaging [kernel.vmlinux] [k] psi_group_change
1.50% sched-messaging libc.so.6 [.] read
1.50% sched-messaging [kernel.vmlinux] [k] pipe_read
1.39% sched-messaging [kernel.vmlinux] [k] update_load_avg
1.39% sched-messaging [kernel.vmlinux] [k] osq_lock
1.30% sched-messaging [kernel.vmlinux] [k] update_curr
1.28% swapper [kernel.vmlinux] [k] psi_group_change
1.16% sched-messaging [kernel.vmlinux] [k] vfs_read
1.12% sched-messaging [kernel.vmlinux] [k] vfs_write
1.10% sched-messaging [kernel.vmlinux] [k] entry_SYSRETQ_unsafe_stack
1.09% sched-messaging [kernel.vmlinux] [k] __switch_to_asm
1.08% sched-messaging [kernel.vmlinux] [k] do_syscall_64
1.06% sched-messaging [kernel.vmlinux] [k] select_task_rq_fair
1.03% swapper [kernel.vmlinux] [k] update_cfs_group
1.00% swapper [kernel.vmlinux] [k] rb_insert_color

# Base + reverted_new_mm_cid
Threads:
Total time: 7.847 [sec]

12.14% sched-messaging [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
8.86% swapper [kernel.vmlinux] [k] native_queued_spin_lock_slowpath
6.13% sched-messaging [kernel.vmlinux] [k] copyin
5.54% sched-messaging [kernel.vmlinux] [k] apparmor_file_permission
3.59% sched-messaging [kernel.vmlinux] [k] copyout
2.61% sched-messaging [kernel.vmlinux] [k] osq_lock
2.48% sched-messaging [kernel.vmlinux] [k] pipe_write
2.33% sched-messaging [kernel.vmlinux] [k] exit_to_user_mode_prepare
2.01% sched-messaging [kernel.vmlinux] [k] check_preemption_disabled
1.96% sched-messaging [kernel.vmlinux] [k] __entry_text_start
1.91% sched-messaging libc.so.6 [.] write
1.77% sched-messaging libc.so.6 [.] read
1.64% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner
1.58% sched-messaging [kernel.vmlinux] [k] pipe_read
1.52% sched-messaging [kernel.vmlinux] [k] try_to_wake_up
1.38% sched-messaging [kernel.vmlinux] [k] ktime_get_coarse_real_ts64
1.35% sched-messaging [kernel.vmlinux] [k] vfs_write
1.28% sched-messaging [kernel.vmlinux] [k] entry_SYSRETQ_unsafe_stack
1.28% sched-messaging [kernel.vmlinux] [k] vfs_read
1.25% sched-messaging [kernel.vmlinux] [k] do_syscall_64
1.22% sched-messaging [kernel.vmlinux] [k] __fget_light
1.18% sched-messaging [kernel.vmlinux] [k] mutex_lock
1.12% sched-messaging [kernel.vmlinux] [k] file_update_time
1.04% sched-messaging [kernel.vmlinux] [k] _copy_from_iter
1.01% sched-messaging [kernel.vmlinux] [k] current_time

So with the reverted new_mm_cid patch, we are seeing a lot of time being spent in
native_queued_spin_lock_slowpath and yet, hackbench finishes faster.

I keep further digging into this please let me know if you have any pointers for me.

>
> This patch has recently been merged into the mm tree.
>
> Thanks,
>
> Mathieu
>
--
Thanks and Regards,
Swapnil

2023-06-21 19:17:21

[permalink] [raw]

Subject: Re: [tip: sched/core] sched: Fix performance regression introduced by mm_cid

On 6/21/23 12:36, Swapnil Sapkal wrote:
> Hello Mathieu,
>
[...]
>>
>> I suspect the regression is caused by the mm_count cache line bouncing.
>>
>> Please try with this additional patch applied:
>>
>> https://lore.kernel.org/lkml/[email protected]/
>
> Thanks for the suggestion. I tried out with the patch you suggested. I
> am seeing
> improvement in hackbench numbers with mm_count padding. But this is not
> matching
> with what we achieved through reverting the new mm_cid patch.
>
> Below are the results on the 1 Socket 4th Generation EPYC Processor (1 x
> 96C/192T):
>
> Threads:
>
> Test:              Base (v6.4-rc1)   Base + new_mmcid_reverted Base +
> mm_count_padding
> 1-groups:         5.23 (0.00 pct)         4.61 (11.85 pct)        5.11
> (2.29 pct)
> 2-groups:         4.99 (0.00 pct)         4.72 (5.41 pct)         5.00
> (-0.20 pct)
> 4-groups:         5.96 (0.00 pct)         4.87 (18.28 pct)        5.86
> (1.67 pct)
> 8-groups:         6.58 (0.00 pct)         5.44 (17.32 pct)        6.20
> (5.77 pct)
> 16-groups:        11.48 (0.00 pct)         8.07 (29.70 pct)       10.68
> (6.96 pct)
>
> Processes:
>
> Test:              Base (v6.4-rc1) Base + new_mmcid_reverted   Base +
> mm_count_padding
> 1-groups:         5.19 (0.00 pct)         4.90 (5.58 pct)         5.19
> (0.00 pct)
> 2-groups:         5.44 (0.00 pct)         5.39 (0.91 pct)         5.39
> (0.91 pct)
> 4-groups:         5.69 (0.00 pct)         5.64 (0.87 pct)         5.64
> (0.87 pct)
> 8-groups:         6.08 (0.00 pct)         6.01 (1.15 pct)         6.04
> (0.65 pct)
> 16-groups:        10.87 (0.00 pct)        10.83 (0.36 pct)        10.93
> (-0.55 pct)
>
> The ibs profile shows that function __switch_to_asm() is coming at top
> in baseline
> run and is not seen with mm_count padding patch. Will be attaching full
> ibs profile
> data for all the 3 runs:
>
> # Base (v6.4-rc1)
> Threads:
> Total time: 11.486 [sec]
>
>    5.15% sched-messaging [kernel.vmlinux]      [k] __switch_to_asm
>    4.31% sched-messaging [kernel.vmlinux]      [k] copyout
>    4.29% sched-messaging [kernel.vmlinux]      [k]
> native_queued_spin_lock_slowpath
>    4.22% sched-messaging [kernel.vmlinux]      [k] copyin
>    3.92% sched-messaging [kernel.vmlinux]      [k]
> apparmor_file_permission
>    2.91% sched-messaging [kernel.vmlinux]      [k] __schedule
>    2.34% swapper          [kernel.vmlinux]      [k] __switch_to_asm
>    2.10% sched-messaging [kernel.vmlinux]      [k] prepare_to_wait_event
>    2.10% sched-messaging [kernel.vmlinux]      [k] try_to_wake_up
>    2.07% sched-messaging [kernel.vmlinux]      [k]
> finish_task_switch.isra.0
>    2.00% sched-messaging [kernel.vmlinux]      [k] pipe_write
>    1.82% sched-messaging [kernel.vmlinux]      [k]
> check_preemption_disabled
>    1.73% sched-messaging [kernel.vmlinux]      [k]
> exit_to_user_mode_prepare
>    1.52% sched-messaging [kernel.vmlinux]      [k] __entry_text_start
>    1.49% sched-messaging [kernel.vmlinux]      [k] osq_lock
>    1.45% sched-messaging libc.so.6             [.] write
>    1.44% swapper          [kernel.vmlinux]      [k] native_sched_clock
>    1.38% sched-messaging [kernel.vmlinux]      [k] psi_group_change
>    1.38% sched-messaging [kernel.vmlinux]      [k] pipe_read
>    1.37% sched-messaging libc.so.6             [.] read
>    1.06% sched-messaging [kernel.vmlinux]      [k] vfs_read
>    1.01% swapper          [kernel.vmlinux]      [k] psi_group_change
>    1.00% sched-messaging [kernel.vmlinux]      [k] update_curr
>
> # Base + mm_count_padding
> Threads:
> Total time: 11.384 [sec]
>
>    4.43% sched-messaging [kernel.vmlinux]         [k] copyin
>    4.39% sched-messaging [kernel.vmlinux]         [k]
> native_queued_spin_lock_slowpath
>    4.07% sched-messaging [kernel.vmlinux]         [k]
> apparmor_file_permission
>    4.07% sched-messaging [kernel.vmlinux]         [k] copyout
>    2.49% sched-messaging [kernel.vmlinux]         [k] entry_SYSCALL_64
>    2.37% sched-messaging [kernel.vmlinux]         [k] update_cfs_group
>    2.19% sched-messaging [kernel.vmlinux]         [k] pipe_write
>    2.00% sched-messaging [kernel.vmlinux]         [k]
> check_preemption_disabled
>    1.93% swapper          [kernel.vmlinux]         [k] update_load_avg
>    1.81% sched-messaging [kernel.vmlinux]         [k]
> exit_to_user_mode_prepare
>    1.69% sched-messaging [kernel.vmlinux]         [k] try_to_wake_up
>    1.58% sched-messaging libc.so.6                [.] write
>    1.53% sched-messaging [kernel.vmlinux]         [k] psi_group_change
>    1.50% sched-messaging libc.so.6                [.] read
>    1.50% sched-messaging [kernel.vmlinux]         [k] pipe_read
>    1.39% sched-messaging [kernel.vmlinux]         [k] update_load_avg
>    1.39% sched-messaging [kernel.vmlinux]         [k] osq_lock
>    1.30% sched-messaging [kernel.vmlinux]         [k] update_curr
>    1.28% swapper          [kernel.vmlinux]         [k] psi_group_change
>    1.16% sched-messaging [kernel.vmlinux]         [k] vfs_read
>    1.12% sched-messaging [kernel.vmlinux]         [k] vfs_write
>    1.10% sched-messaging [kernel.vmlinux]         [k]
> entry_SYSRETQ_unsafe_stack
>    1.09% sched-messaging [kernel.vmlinux]         [k] __switch_to_asm
>    1.08% sched-messaging [kernel.vmlinux]         [k] do_syscall_64
>    1.06% sched-messaging [kernel.vmlinux]         [k]
> select_task_rq_fair
>    1.03% swapper          [kernel.vmlinux]         [k] update_cfs_group
>    1.00% swapper          [kernel.vmlinux]         [k] rb_insert_color
>
> # Base + reverted_new_mm_cid
> Threads:
> Total time: 7.847 [sec]
>
> 12.14% sched-messaging [kernel.vmlinux]      [k]
> native_queued_spin_lock_slowpath
>    8.86% swapper          [kernel.vmlinux]      [k]
> native_queued_spin_lock_slowpath
>    6.13% sched-messaging [kernel.vmlinux]      [k] copyin
>    5.54% sched-messaging [kernel.vmlinux]      [k]
> apparmor_file_permission
>    3.59% sched-messaging [kernel.vmlinux]      [k] copyout
>    2.61% sched-messaging [kernel.vmlinux]      [k] osq_lock
>    2.48% sched-messaging [kernel.vmlinux]      [k] pipe_write
>    2.33% sched-messaging [kernel.vmlinux]      [k]
> exit_to_user_mode_prepare
>    2.01% sched-messaging [kernel.vmlinux]      [k]
> check_preemption_disabled
>    1.96% sched-messaging [kernel.vmlinux]      [k] __entry_text_start
>    1.91% sched-messaging libc.so.6             [.] write
>    1.77% sched-messaging libc.so.6             [.] read
>    1.64% sched-messaging [kernel.vmlinux]      [k] mutex_spin_on_owner
>    1.58% sched-messaging [kernel.vmlinux]      [k] pipe_read
>    1.52% sched-messaging [kernel.vmlinux]      [k] try_to_wake_up
>    1.38% sched-messaging [kernel.vmlinux]      [k]
> ktime_get_coarse_real_ts64
>    1.35% sched-messaging [kernel.vmlinux]      [k] vfs_write
>    1.28% sched-messaging [kernel.vmlinux]      [k]
> entry_SYSRETQ_unsafe_stack
>    1.28% sched-messaging [kernel.vmlinux]      [k] vfs_read
>    1.25% sched-messaging [kernel.vmlinux]      [k] do_syscall_64
>    1.22% sched-messaging [kernel.vmlinux]      [k] __fget_light
>    1.18% sched-messaging [kernel.vmlinux]      [k] mutex_lock
>    1.12% sched-messaging [kernel.vmlinux]      [k] file_update_time
>    1.04% sched-messaging [kernel.vmlinux]      [k] _copy_from_iter
>    1.01% sched-messaging [kernel.vmlinux]      [k] current_time
>
> So with the reverted new_mm_cid patch, we are seeing a lot of time being
> spent in
> native_queued_spin_lock_slowpath and yet, hackbench finishes faster.
>
> I keep further digging into this please let me know if you have any
> pointers for me.

Do you have CONFIG_SECURITY_APPARMOR=y ? Can you try without ?

I notice that apparmor_file_permission appears near the top of your
profiles, and apparmor uses an internal aa_buffers_lock spinlock,
which could possibly explain the top hits for
native_queued_spin_lock_slowpath. My current suspicion is that
the raw spinlock that was taken by "Base + reverted_new_mm_cid"
changed the contention pattern on the apparmor lock enough to
speed things up by pure accident.

Thanks,

Mathieu

>
>>
>> This patch has recently been merged into the mm tree.
>>
>> Thanks,
>>
>> Mathieu
>>
> --
> Thanks and Regards,
> Swapnil

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-06-21 22:21:07

[permalink] [raw]

Subject: Re: [tip: sched/core] sched: Fix performance regression introduced by mm_cid

On 6/21/23 14:51, Mathieu Desnoyers wrote:
> On 6/21/23 12:36, Swapnil Sapkal wrote:
>> Hello Mathieu,
>>
> [...]
>>>
>>> I suspect the regression is caused by the mm_count cache line bouncing.
>>>
>>> Please try with this additional patch applied:
>>>
>>> https://lore.kernel.org/lkml/[email protected]/
>>
>> Thanks for the suggestion. I tried out with the patch you suggested. I
>> am seeing
>> improvement in hackbench numbers with mm_count padding. But this is
>> not matching
>> with what we achieved through reverting the new mm_cid patch.
>>
>> Below are the results on the 1 Socket 4th Generation EPYC Processor (1
>> x 96C/192T):
>>
>> Threads:
>>
>> Test:              Base (v6.4-rc1)   Base + new_mmcid_reverted Base +
>> mm_count_padding
>>   1-groups:         5.23 (0.00 pct)         4.61 (11.85 pct)
>> 5.11 (2.29 pct)
>>   2-groups:         4.99 (0.00 pct)         4.72 (5.41 pct)
>> 5.00 (-0.20 pct)
>>   4-groups:         5.96 (0.00 pct)         4.87 (18.28 pct)
>> 5.86 (1.67 pct)
>>   8-groups:         6.58 (0.00 pct)         5.44 (17.32 pct)
>> 6.20 (5.77 pct)
>> 16-groups:        11.48 (0.00 pct)         8.07 (29.70 pct)
>> 10.68 (6.96 pct)
>>
>> Processes:
>>
>> Test:              Base (v6.4-rc1) Base + new_mmcid_reverted   Base +
>> mm_count_padding
>>   1-groups:         5.19 (0.00 pct)         4.90 (5.58 pct)
>> 5.19 (0.00 pct)
>>   2-groups:         5.44 (0.00 pct)         5.39 (0.91 pct)
>> 5.39 (0.91 pct)
>>   4-groups:         5.69 (0.00 pct)         5.64 (0.87 pct)
>> 5.64 (0.87 pct)
>>   8-groups:         6.08 (0.00 pct)         6.01 (1.15 pct)
>> 6.04 (0.65 pct)
>> 16-groups:        10.87 (0.00 pct)        10.83 (0.36 pct)
>> 10.93 (-0.55 pct)
>>
>> The ibs profile shows that function __switch_to_asm() is coming at top
>> in baseline
>> run and is not seen with mm_count padding patch. Will be attaching
>> full ibs profile
>> data for all the 3 runs:
>>
>> # Base (v6.4-rc1)
>> Threads:
>> Total time: 11.486 [sec]
>>
>>     5.15% sched-messaging [kernel.vmlinux]      [k] __switch_to_asm
>>     4.31% sched-messaging [kernel.vmlinux]      [k] copyout
>>     4.29% sched-messaging [kernel.vmlinux]      [k]
>> native_queued_spin_lock_slowpath
>>     4.22% sched-messaging [kernel.vmlinux]      [k] copyin
>>     3.92% sched-messaging [kernel.vmlinux]      [k]
>> apparmor_file_permission
>>     2.91% sched-messaging [kernel.vmlinux]      [k] __schedule
>>     2.34% swapper          [kernel.vmlinux]      [k] __switch_to_asm
>>     2.10% sched-messaging [kernel.vmlinux]      [k]
>> prepare_to_wait_event
>>     2.10% sched-messaging [kernel.vmlinux]      [k] try_to_wake_up
>>     2.07% sched-messaging [kernel.vmlinux]      [k]
>> finish_task_switch.isra.0
>>     2.00% sched-messaging [kernel.vmlinux]      [k] pipe_write
>>     1.82% sched-messaging [kernel.vmlinux]      [k]
>> check_preemption_disabled
>>     1.73% sched-messaging [kernel.vmlinux]      [k]
>> exit_to_user_mode_prepare
>>     1.52% sched-messaging [kernel.vmlinux]      [k] __entry_text_start
>>     1.49% sched-messaging [kernel.vmlinux]      [k] osq_lock
>>     1.45% sched-messaging libc.so.6             [.] write
>>     1.44% swapper          [kernel.vmlinux]      [k] native_sched_clock
>>     1.38% sched-messaging [kernel.vmlinux]      [k] psi_group_change
>>     1.38% sched-messaging [kernel.vmlinux]      [k] pipe_read
>>     1.37% sched-messaging libc.so.6             [.] read
>>     1.06% sched-messaging [kernel.vmlinux]      [k] vfs_read
>>     1.01% swapper          [kernel.vmlinux]      [k] psi_group_change
>>     1.00% sched-messaging [kernel.vmlinux]      [k] update_curr
>>
>> # Base + mm_count_padding
>> Threads:
>> Total time: 11.384 [sec]
>>
>>     4.43% sched-messaging [kernel.vmlinux]         [k] copyin
>>     4.39% sched-messaging [kernel.vmlinux]         [k]
>> native_queued_spin_lock_slowpath
>>     4.07% sched-messaging [kernel.vmlinux]         [k]
>> apparmor_file_permission
>>     4.07% sched-messaging [kernel.vmlinux]         [k] copyout
>>     2.49% sched-messaging [kernel.vmlinux]         [k] entry_SYSCALL_64
>>     2.37% sched-messaging [kernel.vmlinux]         [k] update_cfs_group
>>     2.19% sched-messaging [kernel.vmlinux]         [k] pipe_write
>>     2.00% sched-messaging [kernel.vmlinux]         [k]
>> check_preemption_disabled
>>     1.93% swapper          [kernel.vmlinux]         [k] update_load_avg
>>     1.81% sched-messaging [kernel.vmlinux]         [k]
>> exit_to_user_mode_prepare
>>     1.69% sched-messaging [kernel.vmlinux]         [k] try_to_wake_up
>>     1.58% sched-messaging libc.so.6                [.] write
>>     1.53% sched-messaging [kernel.vmlinux]         [k] psi_group_change
>>     1.50% sched-messaging libc.so.6                [.] read
>>     1.50% sched-messaging [kernel.vmlinux]         [k] pipe_read
>>     1.39% sched-messaging [kernel.vmlinux]         [k] update_load_avg
>>     1.39% sched-messaging [kernel.vmlinux]         [k] osq_lock
>>     1.30% sched-messaging [kernel.vmlinux]         [k] update_curr
>>     1.28% swapper          [kernel.vmlinux]         [k] psi_group_change
>>     1.16% sched-messaging [kernel.vmlinux]         [k] vfs_read
>>     1.12% sched-messaging [kernel.vmlinux]         [k] vfs_write
>>     1.10% sched-messaging [kernel.vmlinux]         [k]
>> entry_SYSRETQ_unsafe_stack
>>     1.09% sched-messaging [kernel.vmlinux]         [k] __switch_to_asm
>>     1.08% sched-messaging [kernel.vmlinux]         [k] do_syscall_64
>>     1.06% sched-messaging [kernel.vmlinux]         [k]
>> select_task_rq_fair
>>     1.03% swapper          [kernel.vmlinux]         [k] update_cfs_group
>>     1.00% swapper          [kernel.vmlinux]         [k] rb_insert_color
>>
>> # Base + reverted_new_mm_cid
>> Threads:
>> Total time: 7.847 [sec]
>>
>>    12.14% sched-messaging [kernel.vmlinux]      [k]
>> native_queued_spin_lock_slowpath
>>     8.86% swapper          [kernel.vmlinux]      [k]
>> native_queued_spin_lock_slowpath
>>     6.13% sched-messaging [kernel.vmlinux]      [k] copyin
>>     5.54% sched-messaging [kernel.vmlinux]      [k]
>> apparmor_file_permission
>>     3.59% sched-messaging [kernel.vmlinux]      [k] copyout
>>     2.61% sched-messaging [kernel.vmlinux]      [k] osq_lock
>>     2.48% sched-messaging [kernel.vmlinux]      [k] pipe_write
>>     2.33% sched-messaging [kernel.vmlinux]      [k]
>> exit_to_user_mode_prepare
>>     2.01% sched-messaging [kernel.vmlinux]      [k]
>> check_preemption_disabled
>>     1.96% sched-messaging [kernel.vmlinux]      [k] __entry_text_start
>>     1.91% sched-messaging libc.so.6             [.] write
>>     1.77% sched-messaging libc.so.6             [.] read
>>     1.64% sched-messaging [kernel.vmlinux]      [k] mutex_spin_on_owner
>>     1.58% sched-messaging [kernel.vmlinux]      [k] pipe_read
>>     1.52% sched-messaging [kernel.vmlinux]      [k] try_to_wake_up
>>     1.38% sched-messaging [kernel.vmlinux]      [k]
>> ktime_get_coarse_real_ts64
>>     1.35% sched-messaging [kernel.vmlinux]      [k] vfs_write
>>     1.28% sched-messaging [kernel.vmlinux]      [k]
>> entry_SYSRETQ_unsafe_stack
>>     1.28% sched-messaging [kernel.vmlinux]      [k] vfs_read
>>     1.25% sched-messaging [kernel.vmlinux]      [k] do_syscall_64
>>     1.22% sched-messaging [kernel.vmlinux]      [k] __fget_light
>>     1.18% sched-messaging [kernel.vmlinux]      [k] mutex_lock
>>     1.12% sched-messaging [kernel.vmlinux]      [k] file_update_time
>>     1.04% sched-messaging [kernel.vmlinux]      [k] _copy_from_iter
>>     1.01% sched-messaging [kernel.vmlinux]      [k] current_time
>>
>> So with the reverted new_mm_cid patch, we are seeing a lot of time
>> being spent in
>> native_queued_spin_lock_slowpath and yet, hackbench finishes faster.
>>
>> I keep further digging into this please let me know if you have any
>> pointers for me.
>
> Do you have CONFIG_SECURITY_APPARMOR=y ? Can you try without ?
>
> I notice that apparmor_file_permission appears near the top of your
> profiles, and apparmor uses an internal aa_buffers_lock spinlock,
> which could possibly explain the top hits for
> native_queued_spin_lock_slowpath. My current suspicion is that
> the raw spinlock that was taken by "Base + reverted_new_mm_cid"
> changed the contention pattern on the apparmor lock enough to
> speed things up by pure accident.

If apparmor happens to be the culprit here, we should have a hard look
at this commit:

commit df323337e50 "apparmor: Use a memory pool instead per-CPU caches"

Which turned a per-cpu cache into a global memory pool protected by a
spinlock. It may benefit RT, but it does not appear to be so great at
scaling.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-06-22 00:15:47

[permalink] [raw]

Subject: Re: [tip: sched/core] sched: Fix performance regression introduced by mm_cid

On 6/21/23 14:41, Mathieu Desnoyers wrote:
> On 6/21/23 14:51, Mathieu Desnoyers wrote:
>> On 6/21/23 12:36, Swapnil Sapkal wrote:
>>> Hello Mathieu,
>>>
>> [...]
>>>>
>>>> I suspect the regression is caused by the mm_count cache line bouncing.
>>>>
>>>> Please try with this additional patch applied:
>>>>
>>>> https://lore.kernel.org/lkml/[email protected]/
>>>
>>> Thanks for the suggestion. I tried out with the patch you suggested. I am seeing
>>> improvement in hackbench numbers with mm_count padding. But this is not matching
>>> with what we achieved through reverting the new mm_cid patch.
>>>
>>> Below are the results on the 1 Socket 4th Generation EPYC Processor (1 x 96C/192T):
>>>
>>> Threads:
>>>
>>> Test:              Base (v6.4-rc1)   Base + new_mmcid_reverted Base + mm_count_padding
>>>   1-groups:         5.23 (0.00 pct)         4.61 (11.85 pct) 5.11 (2.29 pct)
>>>   2-groups:         4.99 (0.00 pct)         4.72 (5.41 pct) 5.00 (-0.20 pct)
>>>   4-groups:         5.96 (0.00 pct)         4.87 (18.28 pct) 5.86 (1.67 pct)
>>>   8-groups:         6.58 (0.00 pct)         5.44 (17.32 pct) 6.20 (5.77 pct)
>>> 16-groups:        11.48 (0.00 pct)         8.07 (29.70 pct) 10.68 (6.96 pct)
>>>
>>> Processes:
>>>
>>> Test:              Base (v6.4-rc1) Base + new_mmcid_reverted   Base + mm_count_padding
>>>   1-groups:         5.19 (0.00 pct)         4.90 (5.58 pct) 5.19 (0.00 pct)
>>>   2-groups:         5.44 (0.00 pct)         5.39 (0.91 pct) 5.39 (0.91 pct)
>>>   4-groups:         5.69 (0.00 pct)         5.64 (0.87 pct) 5.64 (0.87 pct)
>>>   8-groups:         6.08 (0.00 pct)         6.01 (1.15 pct) 6.04 (0.65 pct)
>>> 16-groups:        10.87 (0.00 pct)        10.83 (0.36 pct) 10.93 (-0.55 pct)
>>>
>>> The ibs profile shows that function __switch_to_asm() is coming at top in baseline
>>> run and is not seen with mm_count padding patch. Will be attaching full ibs profile
>>> data for all the 3 runs:
>>>
>>> # Base (v6.4-rc1)
>>> Threads:
>>> Total time: 11.486 [sec]
>>>
>>>     5.15% sched-messaging [kernel.vmlinux]      [k] __switch_to_asm
>>>     4.31% sched-messaging [kernel.vmlinux]      [k] copyout
>>>     4.29% sched-messaging [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
>>>     4.22% sched-messaging [kernel.vmlinux]      [k] copyin
>>>     3.92% sched-messaging [kernel.vmlinux]      [k] apparmor_file_permission
>>>     2.91% sched-messaging [kernel.vmlinux]      [k] __schedule
>>>     2.34% swapper          [kernel.vmlinux]      [k] __switch_to_asm
>>>     2.10% sched-messaging [kernel.vmlinux]      [k] prepare_to_wait_event
>>>     2.10% sched-messaging [kernel.vmlinux]      [k] try_to_wake_up
>>>     2.07% sched-messaging [kernel.vmlinux]      [k] finish_task_switch.isra.0
>>>     2.00% sched-messaging [kernel.vmlinux]      [k] pipe_write
>>>     1.82% sched-messaging [kernel.vmlinux]      [k] check_preemption_disabled
>>>     1.73% sched-messaging [kernel.vmlinux]      [k] exit_to_user_mode_prepare
>>>     1.52% sched-messaging [kernel.vmlinux]      [k] __entry_text_start
>>>     1.49% sched-messaging [kernel.vmlinux]      [k] osq_lock
>>>     1.45% sched-messaging libc.so.6             [.] write
>>>     1.44% swapper          [kernel.vmlinux]      [k] native_sched_clock
>>>     1.38% sched-messaging [kernel.vmlinux]      [k] psi_group_change
>>>     1.38% sched-messaging [kernel.vmlinux]      [k] pipe_read
>>>     1.37% sched-messaging libc.so.6             [.] read
>>>     1.06% sched-messaging [kernel.vmlinux]      [k] vfs_read
>>>     1.01% swapper          [kernel.vmlinux]      [k] psi_group_change
>>>     1.00% sched-messaging [kernel.vmlinux]      [k] update_curr
>>>
>>> # Base + mm_count_padding
>>> Threads:
>>> Total time: 11.384 [sec]
>>>
>>>     4.43% sched-messaging [kernel.vmlinux]         [k] copyin
>>>     4.39% sched-messaging [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
>>>     4.07% sched-messaging [kernel.vmlinux]         [k] apparmor_file_permission
>>>     4.07% sched-messaging [kernel.vmlinux]         [k] copyout
>>>     2.49% sched-messaging [kernel.vmlinux]         [k] entry_SYSCALL_64
>>>     2.37% sched-messaging [kernel.vmlinux]         [k] update_cfs_group
>>>     2.19% sched-messaging [kernel.vmlinux]         [k] pipe_write
>>>     2.00% sched-messaging [kernel.vmlinux]         [k] check_preemption_disabled
>>>     1.93% swapper          [kernel.vmlinux]         [k] update_load_avg
>>>     1.81% sched-messaging [kernel.vmlinux]         [k] exit_to_user_mode_prepare
>>>     1.69% sched-messaging [kernel.vmlinux]         [k] try_to_wake_up
>>>     1.58% sched-messaging libc.so.6                [.] write
>>>     1.53% sched-messaging [kernel.vmlinux]         [k] psi_group_change
>>>     1.50% sched-messaging libc.so.6                [.] read
>>>     1.50% sched-messaging [kernel.vmlinux]         [k] pipe_read
>>>     1.39% sched-messaging [kernel.vmlinux]         [k] update_load_avg
>>>     1.39% sched-messaging [kernel.vmlinux]         [k] osq_lock
>>>     1.30% sched-messaging [kernel.vmlinux]         [k] update_curr
>>>     1.28% swapper          [kernel.vmlinux]         [k] psi_group_change
>>>     1.16% sched-messaging [kernel.vmlinux]         [k] vfs_read
>>>     1.12% sched-messaging [kernel.vmlinux]         [k] vfs_write
>>>     1.10% sched-messaging [kernel.vmlinux]         [k] entry_SYSRETQ_unsafe_stack
>>>     1.09% sched-messaging [kernel.vmlinux]         [k] __switch_to_asm
>>>     1.08% sched-messaging [kernel.vmlinux]         [k] do_syscall_64
>>>     1.06% sched-messaging [kernel.vmlinux]         [k] select_task_rq_fair
>>>     1.03% swapper          [kernel.vmlinux]         [k] update_cfs_group
>>>     1.00% swapper          [kernel.vmlinux]         [k] rb_insert_color
>>>
>>> # Base + reverted_new_mm_cid
>>> Threads:
>>> Total time: 7.847 [sec]
>>>
>>>    12.14% sched-messaging [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
>>>     8.86% swapper          [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
>>>     6.13% sched-messaging [kernel.vmlinux]      [k] copyin
>>>     5.54% sched-messaging [kernel.vmlinux]      [k] apparmor_file_permission
>>>     3.59% sched-messaging [kernel.vmlinux]      [k] copyout
>>>     2.61% sched-messaging [kernel.vmlinux]      [k] osq_lock
>>>     2.48% sched-messaging [kernel.vmlinux]      [k] pipe_write
>>>     2.33% sched-messaging [kernel.vmlinux]      [k] exit_to_user_mode_prepare
>>>     2.01% sched-messaging [kernel.vmlinux]      [k] check_preemption_disabled
>>>     1.96% sched-messaging [kernel.vmlinux]      [k] __entry_text_start
>>>     1.91% sched-messaging libc.so.6             [.] write
>>>     1.77% sched-messaging libc.so.6             [.] read
>>>     1.64% sched-messaging [kernel.vmlinux]      [k] mutex_spin_on_owner
>>>     1.58% sched-messaging [kernel.vmlinux]      [k] pipe_read
>>>     1.52% sched-messaging [kernel.vmlinux]      [k] try_to_wake_up
>>>     1.38% sched-messaging [kernel.vmlinux]      [k] ktime_get_coarse_real_ts64
>>>     1.35% sched-messaging [kernel.vmlinux]      [k] vfs_write
>>>     1.28% sched-messaging [kernel.vmlinux]      [k] entry_SYSRETQ_unsafe_stack
>>>     1.28% sched-messaging [kernel.vmlinux]      [k] vfs_read
>>>     1.25% sched-messaging [kernel.vmlinux]      [k] do_syscall_64
>>>     1.22% sched-messaging [kernel.vmlinux]      [k] __fget_light
>>>     1.18% sched-messaging [kernel.vmlinux]      [k] mutex_lock
>>>     1.12% sched-messaging [kernel.vmlinux]      [k] file_update_time
>>>     1.04% sched-messaging [kernel.vmlinux]      [k] _copy_from_iter
>>>     1.01% sched-messaging [kernel.vmlinux]      [k] current_time
>>>
>>> So with the reverted new_mm_cid patch, we are seeing a lot of time being spent in
>>> native_queued_spin_lock_slowpath and yet, hackbench finishes faster.
>>>
>>> I keep further digging into this please let me know if you have any pointers for me.
>>
>> Do you have CONFIG_SECURITY_APPARMOR=y ? Can you try without ?
>>
>> I notice that apparmor_file_permission appears near the top of your
>> profiles, and apparmor uses an internal aa_buffers_lock spinlock,
>> which could possibly explain the top hits for
>> native_queued_spin_lock_slowpath. My current suspicion is that
>> the raw spinlock that was taken by "Base + reverted_new_mm_cid"
>> changed the contention pattern on the apparmor lock enough to
>> speed things up by pure accident.
>
> If apparmor happens to be the culprit here, we should have a hard look at this commit:
>
> commit df323337e50 "apparmor: Use a memory pool instead per-CPU caches"
>
> Which turned a per-cpu cache into a global memory pool protected by a spinlock. It may benefit RT, but it does not appear to be so great at scaling.
>
it is not. And I have a patch that needs some more formal testing for some stats.
Ubuntu pulled it in last cycle so it has gotten a fair bit of use and is looking good
on that end. There are probably some tweaks that can be done to improve it. The
backoff in particular is something that has mostly been adjusted in response to some
basic benchmarking.

anyways patch below

commit e057e9b47f1749882ea0efb4427d6b9671c761ab
Author: John Johansen <[email protected]>
Date: Tue Oct 25 01:18:41 2022 -0700

apparmor: cache buffers on percpu list if there is lock contention

df323337e507 ("apparmor: Use a memory pool instead per-CPU caches")
changed buffer allocation to use a memory pool, however on a heavily
loaded machine there can be lock contention on the global buffers
lock. Add a percpu list to cache buffers on when lock contention is
encountered.

When allocating buffers attempt to use cached buffers first,
before taking the global buffers lock. When freeing buffers
try to put them back to the global list but if contention is
encountered, put the buffer on the percpu list.

The length of time a buffer is held on the percpu list is dynamically
adjusted based on lock contention. The amount of hold time is rapidly
increased and slow ramped down.

v4:
- fix percpu ->count buffer count which had been spliced across a
debug patch.
- introduce define for MAX_LOCAL_COUNT
- rework count check and locking around it.
- update commit message to reference commit that introduced the
memory pool.
v3:
- limit number of buffers that can be pushed onto the percpu
list. This avoids a problem on some kernels where one percpu
list can inherit buffers from another cpu after a reschedule,
causing more kernel memory to used than is necessary. Under
normal conditions this should eventually return to normal
but under pathelogical conditions the extra memory consumption
may have been unbouanded
v2:
- dynamically adjust buffer hold time on percpu list based on
lock contention.
v1:
- cache buffers on percpu list on lock contention

Signed-off-by: John Johansen <[email protected]>

diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
index e6671a4a89c4..ea3af769af5a 100644
--- a/security/apparmor/lsm.c
+++ b/security/apparmor/lsm.c
@@ -55,12 +55,21 @@ union aa_buffer {
char buffer[1];
};

+struct aa_local_cache {
+ unsigned int contention;
+ unsigned int hold;
+ unsigned int count;
+ struct list_head head;
+};
+
+#define MAX_LOCAL_COUNT 2
#define RESERVE_COUNT 2
static int reserve_count = RESERVE_COUNT;
static int buffer_count;

static LIST_HEAD(aa_global_buffers);
static DEFINE_SPINLOCK(aa_buffers_lock);
+static DEFINE_PER_CPU(struct aa_local_cache, aa_local_buffers);

struct kmem_cache *aa_audit_slab;

@@ -2029,14 +2038,45 @@ static int param_set_mode(const char *val, const struct kernel_param *kp)
return 0;
}

+static void update_contention(struct aa_local_cache *cache)
+{
+ cache->contention += 3;
+ if (cache->contention > 9)
+ cache->contention = 9;
+ cache->hold += 1 << cache->contention; /* 8, 64, 512 */
+}
+
char *aa_get_buffer(bool in_atomic)
{
union aa_buffer *aa_buf;
+ struct aa_local_cache *cache;
bool try_again = true;
gfp_t flags = (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN);

+ /* use per cpu cached buffers first */
+ cache = get_cpu_ptr(&aa_local_buffers);
+ if (!list_empty(&cache->head)) {
+ aa_buf = list_first_entry(&cache->head, union aa_buffer, list);
+ list_del(&aa_buf->list);
+ cache->hold--;
+ cache->count--;
+ put_cpu_ptr(&aa_local_buffers);
+ return &aa_buf->buffer[0];
+ }
+ put_cpu_ptr(&aa_local_buffers);
+
+ if (!spin_trylock(&aa_buffers_lock)) {
+ cache = get_cpu_ptr(&aa_local_buffers);
+ update_contention(cache);
+ put_cpu_ptr(&aa_local_buffers);
+ spin_lock(&aa_buffers_lock);
+ } else {
+ cache = get_cpu_ptr(&aa_local_buffers);
+ if (cache->contention)
+ cache->contention--;
+ put_cpu_ptr(&aa_local_buffers);
+ }
retry:
- spin_lock(&aa_buffers_lock);
if (buffer_count > reserve_count ||
(in_atomic && !list_empty(&aa_global_buffers))) {
aa_buf = list_first_entry(&aa_global_buffers, union aa_buffer,
@@ -2062,6 +2102,7 @@ char *aa_get_buffer(bool in_atomic)
if (!aa_buf) {
if (try_again) {
try_again = false;
+ spin_lock(&aa_buffers_lock);
goto retry;
}
pr_warn_once("AppArmor: Failed to allocate a memory buffer.\n");
@@ -2073,15 +2114,42 @@ char *aa_get_buffer(bool in_atomic)
void aa_put_buffer(char *buf)
{
union aa_buffer *aa_buf;
+ struct aa_local_cache *cache;

if (!buf)
return;
aa_buf = container_of(buf, union aa_buffer, buffer[0]);

- spin_lock(&aa_buffers_lock);
- list_add(&aa_buf->list, &aa_global_buffers);
- buffer_count++;
- spin_unlock(&aa_buffers_lock);
+ cache = get_cpu_ptr(&aa_local_buffers);
+ if (!cache->hold) {
+ bool must_lock = cache->count >= MAX_LOCAL_COUNT;
+
+ put_cpu_ptr(&aa_local_buffers);
+
+ if (must_lock) {
+ spin_lock(&aa_buffers_lock);
+ goto locked;
+ } else if (spin_trylock(&aa_buffers_lock)) {
+ locked:
+ /* put back on global list */
+ list_add(&aa_buf->list, &aa_global_buffers);
+ buffer_count++;
+ spin_unlock(&aa_buffers_lock);
+ cache = get_cpu_ptr(&aa_local_buffers);
+ if (cache->contention)
+ cache->contention--;
+ put_cpu_ptr(&aa_local_buffers);
+ return;
+ }
+ /* contention on global list, fallback to percpu */
+ cache = get_cpu_ptr(&aa_local_buffers);
+ update_contention(cache);
+ }
+
+ /* cache in percpu list */
+ list_add(&aa_buf->list, &cache->head);
+ cache->count++;
+ put_cpu_ptr(&aa_local_buffers);
}

/*
@@ -2123,6 +2191,16 @@ static int __init alloc_buffers(void)
union aa_buffer *aa_buf;
int i, num;

+ /*
+ * per cpu set of cached allocated buffers used to help reduce
+ * lock contention
+ */
+ for_each_possible_cpu(i) {
+ per_cpu(aa_local_buffers, i).contention = 0;
+ per_cpu(aa_local_buffers, i).hold = 0;
+ per_cpu(aa_local_buffers, i).count = 0;
+ INIT_LIST_HEAD(&per_cpu(aa_local_buffers, i).head);
+ }
/*
* A function may require two buffers at once. Usually the buffers are
* used for a short period of time and are shared. On UP kernel buffers

2023-06-22 14:40:11

[permalink] [raw]

Subject: Re: [tip: sched/core] sched: Fix performance regression introduced by mm_cid

On 6/21/23 19:59, John Johansen wrote:
> On 6/21/23 14:41, Mathieu Desnoyers wrote:
>> On 6/21/23 14:51, Mathieu Desnoyers wrote:
>>> On 6/21/23 12:36, Swapnil Sapkal wrote:
>>>> Hello Mathieu,
>>>>
>>> [...]
>>>>>
>>>>> I suspect the regression is caused by the mm_count cache line
>>>>> bouncing.
>>>>>
>>>>> Please try with this additional patch applied:
>>>>>
>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>
>>>> Thanks for the suggestion. I tried out with the patch you suggested.
>>>> I am seeing
>>>> improvement in hackbench numbers with mm_count padding. But this is
>>>> not matching
>>>> with what we achieved through reverting the new mm_cid patch.
>>>>
>>>> Below are the results on the 1 Socket 4th Generation EPYC Processor
>>>> (1 x 96C/192T):
>>>>
>>>> Threads:
>>>>
>>>> Test:              Base (v6.4-rc1)   Base + new_mmcid_reverted Base
>>>> + mm_count_padding
>>>>   1-groups:         5.23 (0.00 pct)         4.61 (11.85 pct) 5.11
>>>> (2.29 pct)
>>>>   2-groups:         4.99 (0.00 pct)         4.72 (5.41 pct) 5.00
>>>> (-0.20 pct)
>>>>   4-groups:         5.96 (0.00 pct)         4.87 (18.28 pct) 5.86
>>>> (1.67 pct)
>>>>   8-groups:         6.58 (0.00 pct)         5.44 (17.32 pct) 6.20
>>>> (5.77 pct)
>>>> 16-groups:        11.48 (0.00 pct)         8.07 (29.70 pct) 10.68
>>>> (6.96 pct)
>>>>
>>>> Processes:
>>>>
>>>> Test:              Base (v6.4-rc1) Base + new_mmcid_reverted   Base
>>>> + mm_count_padding
>>>>   1-groups:         5.19 (0.00 pct)         4.90 (5.58 pct) 5.19
>>>> (0.00 pct)
>>>>   2-groups:         5.44 (0.00 pct)         5.39 (0.91 pct) 5.39
>>>> (0.91 pct)
>>>>   4-groups:         5.69 (0.00 pct)         5.64 (0.87 pct) 5.64
>>>> (0.87 pct)
>>>>   8-groups:         6.08 (0.00 pct)         6.01 (1.15 pct) 6.04
>>>> (0.65 pct)
>>>> 16-groups:        10.87 (0.00 pct)        10.83 (0.36 pct) 10.93
>>>> (-0.55 pct)
>>>>
>>>> The ibs profile shows that function __switch_to_asm() is coming at
>>>> top in baseline
>>>> run and is not seen with mm_count padding patch. Will be attaching
>>>> full ibs profile
>>>> data for all the 3 runs:
>>>>
>>>> # Base (v6.4-rc1)
>>>> Threads:
>>>> Total time: 11.486 [sec]
>>>>
>>>>     5.15% sched-messaging [kernel.vmlinux]      [k] __switch_to_asm
>>>>     4.31% sched-messaging [kernel.vmlinux]      [k] copyout
>>>>     4.29% sched-messaging [kernel.vmlinux]      [k]
>>>> native_queued_spin_lock_slowpath
>>>>     4.22% sched-messaging [kernel.vmlinux]      [k] copyin
>>>>     3.92% sched-messaging [kernel.vmlinux]      [k]
>>>> apparmor_file_permission
>>>>     2.91% sched-messaging [kernel.vmlinux]      [k] __schedule
>>>>     2.34% swapper          [kernel.vmlinux]      [k] __switch_to_asm
>>>>     2.10% sched-messaging [kernel.vmlinux]      [k]
>>>> prepare_to_wait_event
>>>>     2.10% sched-messaging [kernel.vmlinux]      [k] try_to_wake_up
>>>>     2.07% sched-messaging [kernel.vmlinux]      [k]
>>>> finish_task_switch.isra.0
>>>>     2.00% sched-messaging [kernel.vmlinux]      [k] pipe_write
>>>>     1.82% sched-messaging [kernel.vmlinux]      [k]
>>>> check_preemption_disabled
>>>>     1.73% sched-messaging [kernel.vmlinux]      [k]
>>>> exit_to_user_mode_prepare
>>>>     1.52% sched-messaging [kernel.vmlinux]      [k]
>>>> __entry_text_start
>>>>     1.49% sched-messaging [kernel.vmlinux]      [k] osq_lock
>>>>     1.45% sched-messaging libc.so.6             [.] write
>>>>     1.44% swapper          [kernel.vmlinux]      [k]
>>>> native_sched_clock
>>>>     1.38% sched-messaging [kernel.vmlinux]      [k] psi_group_change
>>>>     1.38% sched-messaging [kernel.vmlinux]      [k] pipe_read
>>>>     1.37% sched-messaging libc.so.6             [.] read
>>>>     1.06% sched-messaging [kernel.vmlinux]      [k] vfs_read
>>>>     1.01% swapper          [kernel.vmlinux]      [k] psi_group_change
>>>>     1.00% sched-messaging [kernel.vmlinux]      [k] update_curr
>>>>
>>>> # Base + mm_count_padding
>>>> Threads:
>>>> Total time: 11.384 [sec]
>>>>
>>>>     4.43% sched-messaging [kernel.vmlinux]         [k] copyin
>>>>     4.39% sched-messaging [kernel.vmlinux]         [k]
>>>> native_queued_spin_lock_slowpath
>>>>     4.07% sched-messaging [kernel.vmlinux]         [k]
>>>> apparmor_file_permission
>>>>     4.07% sched-messaging [kernel.vmlinux]         [k] copyout
>>>>     2.49% sched-messaging [kernel.vmlinux]         [k]
>>>> entry_SYSCALL_64
>>>>     2.37% sched-messaging [kernel.vmlinux]         [k]
>>>> update_cfs_group
>>>>     2.19% sched-messaging [kernel.vmlinux]         [k] pipe_write
>>>>     2.00% sched-messaging [kernel.vmlinux]         [k]
>>>> check_preemption_disabled
>>>>     1.93% swapper          [kernel.vmlinux]         [k]
>>>> update_load_avg
>>>>     1.81% sched-messaging [kernel.vmlinux]         [k]
>>>> exit_to_user_mode_prepare
>>>>     1.69% sched-messaging [kernel.vmlinux]         [k] try_to_wake_up
>>>>     1.58% sched-messaging libc.so.6                [.] write
>>>>     1.53% sched-messaging [kernel.vmlinux]         [k]
>>>> psi_group_change
>>>>     1.50% sched-messaging libc.so.6                [.] read
>>>>     1.50% sched-messaging [kernel.vmlinux]         [k] pipe_read
>>>>     1.39% sched-messaging [kernel.vmlinux]         [k]
>>>> update_load_avg
>>>>     1.39% sched-messaging [kernel.vmlinux]         [k] osq_lock
>>>>     1.30% sched-messaging [kernel.vmlinux]         [k] update_curr
>>>>     1.28% swapper          [kernel.vmlinux]         [k]
>>>> psi_group_change
>>>>     1.16% sched-messaging [kernel.vmlinux]         [k] vfs_read
>>>>     1.12% sched-messaging [kernel.vmlinux]         [k] vfs_write
>>>>     1.10% sched-messaging [kernel.vmlinux]         [k]
>>>> entry_SYSRETQ_unsafe_stack
>>>>     1.09% sched-messaging [kernel.vmlinux]         [k]
>>>> __switch_to_asm
>>>>     1.08% sched-messaging [kernel.vmlinux]         [k] do_syscall_64
>>>>     1.06% sched-messaging [kernel.vmlinux]         [k]
>>>> select_task_rq_fair
>>>>     1.03% swapper          [kernel.vmlinux]         [k]
>>>> update_cfs_group
>>>>     1.00% swapper          [kernel.vmlinux]         [k]
>>>> rb_insert_color
>>>>
>>>> # Base + reverted_new_mm_cid
>>>> Threads:
>>>> Total time: 7.847 [sec]
>>>>
>>>>    12.14% sched-messaging [kernel.vmlinux]      [k]
>>>> native_queued_spin_lock_slowpath
>>>>     8.86% swapper          [kernel.vmlinux]      [k]
>>>> native_queued_spin_lock_slowpath
>>>>     6.13% sched-messaging [kernel.vmlinux]      [k] copyin
>>>>     5.54% sched-messaging [kernel.vmlinux]      [k]
>>>> apparmor_file_permission
>>>>     3.59% sched-messaging [kernel.vmlinux]      [k] copyout
>>>>     2.61% sched-messaging [kernel.vmlinux]      [k] osq_lock
>>>>     2.48% sched-messaging [kernel.vmlinux]      [k] pipe_write
>>>>     2.33% sched-messaging [kernel.vmlinux]      [k]
>>>> exit_to_user_mode_prepare
>>>>     2.01% sched-messaging [kernel.vmlinux]      [k]
>>>> check_preemption_disabled
>>>>     1.96% sched-messaging [kernel.vmlinux]      [k]
>>>> __entry_text_start
>>>>     1.91% sched-messaging libc.so.6             [.] write
>>>>     1.77% sched-messaging libc.so.6             [.] read
>>>>     1.64% sched-messaging [kernel.vmlinux]      [k]
>>>> mutex_spin_on_owner
>>>>     1.58% sched-messaging [kernel.vmlinux]      [k] pipe_read
>>>>     1.52% sched-messaging [kernel.vmlinux]      [k] try_to_wake_up
>>>>     1.38% sched-messaging [kernel.vmlinux]      [k]
>>>> ktime_get_coarse_real_ts64
>>>>     1.35% sched-messaging [kernel.vmlinux]      [k] vfs_write
>>>>     1.28% sched-messaging [kernel.vmlinux]      [k]
>>>> entry_SYSRETQ_unsafe_stack
>>>>     1.28% sched-messaging [kernel.vmlinux]      [k] vfs_read
>>>>     1.25% sched-messaging [kernel.vmlinux]      [k] do_syscall_64
>>>>     1.22% sched-messaging [kernel.vmlinux]      [k] __fget_light
>>>>     1.18% sched-messaging [kernel.vmlinux]      [k] mutex_lock
>>>>     1.12% sched-messaging [kernel.vmlinux]      [k] file_update_time
>>>>     1.04% sched-messaging [kernel.vmlinux]      [k] _copy_from_iter
>>>>     1.01% sched-messaging [kernel.vmlinux]      [k] current_time
>>>>
>>>> So with the reverted new_mm_cid patch, we are seeing a lot of time
>>>> being spent in
>>>> native_queued_spin_lock_slowpath and yet, hackbench finishes faster.
>>>>
>>>> I keep further digging into this please let me know if you have any
>>>> pointers for me.
>>>
>>> Do you have CONFIG_SECURITY_APPARMOR=y ? Can you try without ?
>>>
>>> I notice that apparmor_file_permission appears near the top of your
>>> profiles, and apparmor uses an internal aa_buffers_lock spinlock,
>>> which could possibly explain the top hits for
>>> native_queued_spin_lock_slowpath. My current suspicion is that
>>> the raw spinlock that was taken by "Base + reverted_new_mm_cid"
>>> changed the contention pattern on the apparmor lock enough to
>>> speed things up by pure accident.
>>
>> If apparmor happens to be the culprit here, we should have a hard look
>> at this commit:
>>
>> commit df323337e50 "apparmor: Use a memory pool instead per-CPU caches"
>>
>> Which turned a per-cpu cache into a global memory pool protected by a
>> spinlock. It may benefit RT, but it does not appear to be so great at
>> scaling.
>>
> it is not. And I have a patch that needs some more formal testing for
> some stats.
> Ubuntu pulled it in last cycle so it has gotten a fair bit of use and is
> looking good
> on that end. There are probably some tweaks that can be done to improve
> it. The
> backoff in particular is something that has mostly been adjusted in
> response to some
> basic benchmarking.
>
> anyways patch below

I don't understand why all these heuristics are needed at all ?

What was fundamentally wrong with the per-cpu caches before commit
df323337e50 other than being non-RT friendly ? Was the only purpose of
that commit to reduce the duration of preempt-off critical sections, or
is there a bigger picture concern it was taking care of by introducing a
global pool ?

Introducing per-cpu memory pools, dealing with migration by giving
entries back to the right cpu's pool, taking into account the cpu the
entry belongs to, and use a per-cpu/lock-free data structure allowing
lock-free push to give back an entry on a remote cpu should do the trick
without locking, and without long preempt-off critical sections.

The only downside I see for per-cpu memory pools is a slightly larger
memory overhead on large multi-core systems. But is that really a concern ?

What am I missing here ?

Thanks,

Mathieu

>
> commit e057e9b47f1749882ea0efb4427d6b9671c761ab
> Author: John Johansen <[email protected]>
> Date:   Tue Oct 25 01:18:41 2022 -0700
>
>     apparmor: cache buffers on percpu list if there is lock contention
>     df323337e507 ("apparmor: Use a memory pool instead per-CPU caches")
>     changed buffer allocation to use a memory pool, however on a heavily
>     loaded machine there can be lock contention on the global buffers
>     lock. Add a percpu list to cache buffers on when lock contention is
>     encountered.
>     When allocating buffers attempt to use cached buffers first,
>     before taking the global buffers lock. When freeing buffers
>     try to put them back to the global list but if contention is
>     encountered, put the buffer on the percpu list.
>     The length of time a buffer is held on the percpu list is dynamically
>     adjusted based on lock contention. The amount of hold time is rapidly
>     increased and slow ramped down.
>     v4:
>     - fix percpu ->count buffer count which had been spliced across a
>       debug patch.
>     - introduce define for MAX_LOCAL_COUNT
>     - rework count check and locking around it.
>     - update commit message to reference commit that introduced the
>       memory pool.
>     v3:
>     - limit number of buffers that can be pushed onto the percpu
>       list. This avoids a problem on some kernels where one percpu
>       list can inherit buffers from another cpu after a reschedule,
>       causing more kernel memory to used than is necessary. Under
>       normal conditions this should eventually return to normal
>       but under pathelogical conditions the extra memory consumption
>       may have been unbouanded
>     v2:
>     - dynamically adjust buffer hold time on percpu list based on
>       lock contention.
>     v1:
>     - cache buffers on percpu list on lock contention
>     Signed-off-by: John Johansen <[email protected]>
>
> diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
> index e6671a4a89c4..ea3af769af5a 100644
> --- a/security/apparmor/lsm.c
> +++ b/security/apparmor/lsm.c
> @@ -55,12 +55,21 @@ union aa_buffer {
>      char buffer[1];
> };
>
> +struct aa_local_cache {
> +    unsigned int contention;
> +    unsigned int hold;
> +    unsigned int count;
> +    struct list_head head;
> +};
> +
> +#define MAX_LOCAL_COUNT 2
> #define RESERVE_COUNT 2
> static int reserve_count = RESERVE_COUNT;
> static int buffer_count;
>
> static LIST_HEAD(aa_global_buffers);
> static DEFINE_SPINLOCK(aa_buffers_lock);
> +static DEFINE_PER_CPU(struct aa_local_cache, aa_local_buffers);
>
> struct kmem_cache *aa_audit_slab;
>
> @@ -2029,14 +2038,45 @@ static int param_set_mode(const char *val, const
> struct kernel_param *kp)
>      return 0;
> }
>
> +static void update_contention(struct aa_local_cache *cache)
> +{
> +    cache->contention += 3;
> +    if (cache->contention > 9)
> +        cache->contention = 9;
> +    cache->hold += 1 << cache->contention;        /* 8, 64, 512 */
> +}
> +
> char *aa_get_buffer(bool in_atomic)
> {
>      union aa_buffer *aa_buf;
> +    struct aa_local_cache *cache;
>      bool try_again = true;
>      gfp_t flags = (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
>
> +    /* use per cpu cached buffers first */
> +    cache = get_cpu_ptr(&aa_local_buffers);
> +    if (!list_empty(&cache->head)) {
> +        aa_buf = list_first_entry(&cache->head, union aa_buffer, list);
> +        list_del(&aa_buf->list);
> +        cache->hold--;
> +        cache->count--;
> +        put_cpu_ptr(&aa_local_buffers);
> +        return &aa_buf->buffer[0];
> +    }
> +    put_cpu_ptr(&aa_local_buffers);
> +
> +    if (!spin_trylock(&aa_buffers_lock)) {
> +        cache = get_cpu_ptr(&aa_local_buffers);
> +        update_contention(cache);
> +        put_cpu_ptr(&aa_local_buffers);
> +        spin_lock(&aa_buffers_lock);
> +    } else {
> +        cache = get_cpu_ptr(&aa_local_buffers);
> +        if (cache->contention)
> +            cache->contention--;
> +        put_cpu_ptr(&aa_local_buffers);
> +    }
> retry:
> -    spin_lock(&aa_buffers_lock);
>      if (buffer_count > reserve_count ||
>          (in_atomic && !list_empty(&aa_global_buffers))) {
>          aa_buf = list_first_entry(&aa_global_buffers, union aa_buffer,
> @@ -2062,6 +2102,7 @@ char *aa_get_buffer(bool in_atomic)
>      if (!aa_buf) {
>          if (try_again) {
>              try_again = false;
> +            spin_lock(&aa_buffers_lock);
>              goto retry;
>          }
>          pr_warn_once("AppArmor: Failed to allocate a memory buffer.\n");
> @@ -2073,15 +2114,42 @@ char *aa_get_buffer(bool in_atomic)
> void aa_put_buffer(char *buf)
> {
>      union aa_buffer *aa_buf;
> +    struct aa_local_cache *cache;
>
>      if (!buf)
>          return;
>      aa_buf = container_of(buf, union aa_buffer, buffer[0]);
>
> -    spin_lock(&aa_buffers_lock);
> -    list_add(&aa_buf->list, &aa_global_buffers);
> -    buffer_count++;
> -    spin_unlock(&aa_buffers_lock);
> +    cache = get_cpu_ptr(&aa_local_buffers);
> +    if (!cache->hold) {
> +        bool must_lock = cache->count >= MAX_LOCAL_COUNT;
> +
> +        put_cpu_ptr(&aa_local_buffers);
> +
> +        if (must_lock) {
> +            spin_lock(&aa_buffers_lock);
> +            goto locked;
> +        } else if (spin_trylock(&aa_buffers_lock)) {
> +        locked:
> +            /* put back on global list */
> +            list_add(&aa_buf->list, &aa_global_buffers);
> +            buffer_count++;
> +            spin_unlock(&aa_buffers_lock);
> +            cache = get_cpu_ptr(&aa_local_buffers);
> +            if (cache->contention)
> +                cache->contention--;
> +            put_cpu_ptr(&aa_local_buffers);
> +            return;
> +        }
> +        /* contention on global list, fallback to percpu */
> +        cache = get_cpu_ptr(&aa_local_buffers);
> +        update_contention(cache);
> +    }
> +
> +    /* cache in percpu list */
> +    list_add(&aa_buf->list, &cache->head);
> +    cache->count++;
> +    put_cpu_ptr(&aa_local_buffers);
> }
>
> /*
> @@ -2123,6 +2191,16 @@ static int __init alloc_buffers(void)
>      union aa_buffer *aa_buf;
>      int i, num;
>
> +    /*
> +     * per cpu set of cached allocated buffers used to help reduce
> +     * lock contention
> +     */
> +    for_each_possible_cpu(i) {
> +        per_cpu(aa_local_buffers, i).contention = 0;
> +        per_cpu(aa_local_buffers, i).hold = 0;
> +        per_cpu(aa_local_buffers, i).count = 0;
> +        INIT_LIST_HEAD(&per_cpu(aa_local_buffers, i).head);
> +    }
>      /*
>       * A function may require two buffers at once. Usually the buffers
> are
>       * used for a short period of time and are shared. On UP kernel
> buffers
>
>
>
>

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-06-22 16:31:46

[permalink] [raw]

Subject: Re: [tip: sched/core] sched: Fix performance regression introduced by mm_cid

On 6/22/23 07:33, Mathieu Desnoyers wrote:
> On 6/21/23 19:59, John Johansen wrote:
>> On 6/21/23 14:41, Mathieu Desnoyers wrote:
>>> On 6/21/23 14:51, Mathieu Desnoyers wrote:
>>>> On 6/21/23 12:36, Swapnil Sapkal wrote:
>>>>> Hello Mathieu,
>>>>>
>>>> [...]
>>>>>>
>>>>>> I suspect the regression is caused by the mm_count cache line bouncing.
>>>>>>
>>>>>> Please try with this additional patch applied:
>>>>>>
>>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>>
>>>>> Thanks for the suggestion. I tried out with the patch you suggested. I am seeing
>>>>> improvement in hackbench numbers with mm_count padding. But this is not matching
>>>>> with what we achieved through reverting the new mm_cid patch.
>>>>>
>>>>> Below are the results on the 1 Socket 4th Generation EPYC Processor (1 x 96C/192T):
>>>>>
>>>>> Threads:
>>>>>
>>>>> Test:              Base (v6.4-rc1)   Base + new_mmcid_reverted Base + mm_count_padding
>>>>>   1-groups:         5.23 (0.00 pct)         4.61 (11.85 pct) 5.11 (2.29 pct)
>>>>>   2-groups:         4.99 (0.00 pct)         4.72 (5.41 pct) 5.00 (-0.20 pct)
>>>>>   4-groups:         5.96 (0.00 pct)         4.87 (18.28 pct) 5.86 (1.67 pct)
>>>>>   8-groups:         6.58 (0.00 pct)         5.44 (17.32 pct) 6.20 (5.77 pct)
>>>>> 16-groups:        11.48 (0.00 pct)         8.07 (29.70 pct) 10.68 (6.96 pct)
>>>>>
>>>>> Processes:
>>>>>
>>>>> Test:              Base (v6.4-rc1) Base + new_mmcid_reverted   Base + mm_count_padding
>>>>>   1-groups:         5.19 (0.00 pct)         4.90 (5.58 pct) 5.19 (0.00 pct)
>>>>>   2-groups:         5.44 (0.00 pct)         5.39 (0.91 pct) 5.39 (0.91 pct)
>>>>>   4-groups:         5.69 (0.00 pct)         5.64 (0.87 pct) 5.64 (0.87 pct)
>>>>>   8-groups:         6.08 (0.00 pct)         6.01 (1.15 pct) 6.04 (0.65 pct)
>>>>> 16-groups:        10.87 (0.00 pct)        10.83 (0.36 pct) 10.93 (-0.55 pct)
>>>>>
>>>>> The ibs profile shows that function __switch_to_asm() is coming at top in baseline
>>>>> run and is not seen with mm_count padding patch. Will be attaching full ibs profile
>>>>> data for all the 3 runs:
>>>>>
>>>>> # Base (v6.4-rc1)
>>>>> Threads:
>>>>> Total time: 11.486 [sec]
>>>>>
>>>>>     5.15% sched-messaging [kernel.vmlinux]      [k] __switch_to_asm
>>>>>     4.31% sched-messaging [kernel.vmlinux]      [k] copyout
>>>>>     4.29% sched-messaging [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
>>>>>     4.22% sched-messaging [kernel.vmlinux]      [k] copyin
>>>>>     3.92% sched-messaging [kernel.vmlinux]      [k] apparmor_file_permission
>>>>>     2.91% sched-messaging [kernel.vmlinux]      [k] __schedule
>>>>>     2.34% swapper          [kernel.vmlinux]      [k] __switch_to_asm
>>>>>     2.10% sched-messaging [kernel.vmlinux]      [k] prepare_to_wait_event
>>>>>     2.10% sched-messaging [kernel.vmlinux]      [k] try_to_wake_up
>>>>>     2.07% sched-messaging [kernel.vmlinux]      [k] finish_task_switch.isra.0
>>>>>     2.00% sched-messaging [kernel.vmlinux]      [k] pipe_write
>>>>>     1.82% sched-messaging [kernel.vmlinux]      [k] check_preemption_disabled
>>>>>     1.73% sched-messaging [kernel.vmlinux]      [k] exit_to_user_mode_prepare
>>>>>     1.52% sched-messaging [kernel.vmlinux]      [k] __entry_text_start
>>>>>     1.49% sched-messaging [kernel.vmlinux]      [k] osq_lock
>>>>>     1.45% sched-messaging libc.so.6             [.] write
>>>>>     1.44% swapper          [kernel.vmlinux]      [k] native_sched_clock
>>>>>     1.38% sched-messaging [kernel.vmlinux]      [k] psi_group_change
>>>>>     1.38% sched-messaging [kernel.vmlinux]      [k] pipe_read
>>>>>     1.37% sched-messaging libc.so.6             [.] read
>>>>>     1.06% sched-messaging [kernel.vmlinux]      [k] vfs_read
>>>>>     1.01% swapper          [kernel.vmlinux]      [k] psi_group_change
>>>>>     1.00% sched-messaging [kernel.vmlinux]      [k] update_curr
>>>>>
>>>>> # Base + mm_count_padding
>>>>> Threads:
>>>>> Total time: 11.384 [sec]
>>>>>
>>>>>     4.43% sched-messaging [kernel.vmlinux]         [k] copyin
>>>>>     4.39% sched-messaging [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
>>>>>     4.07% sched-messaging [kernel.vmlinux]         [k] apparmor_file_permission
>>>>>     4.07% sched-messaging [kernel.vmlinux]         [k] copyout
>>>>>     2.49% sched-messaging [kernel.vmlinux]         [k] entry_SYSCALL_64
>>>>>     2.37% sched-messaging [kernel.vmlinux]         [k] update_cfs_group
>>>>>     2.19% sched-messaging [kernel.vmlinux]         [k] pipe_write
>>>>>     2.00% sched-messaging [kernel.vmlinux]         [k] check_preemption_disabled
>>>>>     1.93% swapper          [kernel.vmlinux]         [k] update_load_avg
>>>>>     1.81% sched-messaging [kernel.vmlinux]         [k] exit_to_user_mode_prepare
>>>>>     1.69% sched-messaging [kernel.vmlinux]         [k] try_to_wake_up
>>>>>     1.58% sched-messaging libc.so.6                [.] write
>>>>>     1.53% sched-messaging [kernel.vmlinux]         [k] psi_group_change
>>>>>     1.50% sched-messaging libc.so.6                [.] read
>>>>>     1.50% sched-messaging [kernel.vmlinux]         [k] pipe_read
>>>>>     1.39% sched-messaging [kernel.vmlinux]         [k] update_load_avg
>>>>>     1.39% sched-messaging [kernel.vmlinux]         [k] osq_lock
>>>>>     1.30% sched-messaging [kernel.vmlinux]         [k] update_curr
>>>>>     1.28% swapper          [kernel.vmlinux]         [k] psi_group_change
>>>>>     1.16% sched-messaging [kernel.vmlinux]         [k] vfs_read
>>>>>     1.12% sched-messaging [kernel.vmlinux]         [k] vfs_write
>>>>>     1.10% sched-messaging [kernel.vmlinux]         [k] entry_SYSRETQ_unsafe_stack
>>>>>     1.09% sched-messaging [kernel.vmlinux]         [k] __switch_to_asm
>>>>>     1.08% sched-messaging [kernel.vmlinux]         [k] do_syscall_64
>>>>>     1.06% sched-messaging [kernel.vmlinux]         [k] select_task_rq_fair
>>>>>     1.03% swapper          [kernel.vmlinux]         [k] update_cfs_group
>>>>>     1.00% swapper          [kernel.vmlinux]         [k] rb_insert_color
>>>>>
>>>>> # Base + reverted_new_mm_cid
>>>>> Threads:
>>>>> Total time: 7.847 [sec]
>>>>>
>>>>>    12.14% sched-messaging [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
>>>>>     8.86% swapper          [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
>>>>>     6.13% sched-messaging [kernel.vmlinux]      [k] copyin
>>>>>     5.54% sched-messaging [kernel.vmlinux]      [k] apparmor_file_permission
>>>>>     3.59% sched-messaging [kernel.vmlinux]      [k] copyout
>>>>>     2.61% sched-messaging [kernel.vmlinux]      [k] osq_lock
>>>>>     2.48% sched-messaging [kernel.vmlinux]      [k] pipe_write
>>>>>     2.33% sched-messaging [kernel.vmlinux]      [k] exit_to_user_mode_prepare
>>>>>     2.01% sched-messaging [kernel.vmlinux]      [k] check_preemption_disabled
>>>>>     1.96% sched-messaging [kernel.vmlinux]      [k] __entry_text_start
>>>>>     1.91% sched-messaging libc.so.6             [.] write
>>>>>     1.77% sched-messaging libc.so.6             [.] read
>>>>>     1.64% sched-messaging [kernel.vmlinux]      [k] mutex_spin_on_owner
>>>>>     1.58% sched-messaging [kernel.vmlinux]      [k] pipe_read
>>>>>     1.52% sched-messaging [kernel.vmlinux]      [k] try_to_wake_up
>>>>>     1.38% sched-messaging [kernel.vmlinux]      [k] ktime_get_coarse_real_ts64
>>>>>     1.35% sched-messaging [kernel.vmlinux]      [k] vfs_write
>>>>>     1.28% sched-messaging [kernel.vmlinux]      [k] entry_SYSRETQ_unsafe_stack
>>>>>     1.28% sched-messaging [kernel.vmlinux]      [k] vfs_read
>>>>>     1.25% sched-messaging [kernel.vmlinux]      [k] do_syscall_64
>>>>>     1.22% sched-messaging [kernel.vmlinux]      [k] __fget_light
>>>>>     1.18% sched-messaging [kernel.vmlinux]      [k] mutex_lock
>>>>>     1.12% sched-messaging [kernel.vmlinux]      [k] file_update_time
>>>>>     1.04% sched-messaging [kernel.vmlinux]      [k] _copy_from_iter
>>>>>     1.01% sched-messaging [kernel.vmlinux]      [k] current_time
>>>>>
>>>>> So with the reverted new_mm_cid patch, we are seeing a lot of time being spent in
>>>>> native_queued_spin_lock_slowpath and yet, hackbench finishes faster.
>>>>>
>>>>> I keep further digging into this please let me know if you have any pointers for me.
>>>>
>>>> Do you have CONFIG_SECURITY_APPARMOR=y ? Can you try without ?
>>>>
>>>> I notice that apparmor_file_permission appears near the top of your
>>>> profiles, and apparmor uses an internal aa_buffers_lock spinlock,
>>>> which could possibly explain the top hits for
>>>> native_queued_spin_lock_slowpath. My current suspicion is that
>>>> the raw spinlock that was taken by "Base + reverted_new_mm_cid"
>>>> changed the contention pattern on the apparmor lock enough to
>>>> speed things up by pure accident.
>>>
>>> If apparmor happens to be the culprit here, we should have a hard look at this commit:
>>>
>>> commit df323337e50 "apparmor: Use a memory pool instead per-CPU caches"
>>>
>>> Which turned a per-cpu cache into a global memory pool protected by a spinlock. It may benefit RT, but it does not appear to be so great at scaling.
>>>
>> it is not. And I have a patch that needs some more formal testing for some stats.
>> Ubuntu pulled it in last cycle so it has gotten a fair bit of use and is looking good
>> on that end. There are probably some tweaks that can be done to improve it. The
>> backoff in particular is something that has mostly been adjusted in response to some
>> basic benchmarking.
>>
>> anyways patch below
>
> I don't understand why all these heuristics are needed at all ?
>
it can be simplified. The scaling heuristic came out of playing with it, doing some testing to reduce lock contention. Like I said it needs work. There is a reason its not landed anywhere yet. And yes it may not even be the right approach

> What was fundamentally wrong with the per-cpu caches before commit df323337e50 other than being non-RT friendly ? Was the only purpose of that commit to reduce the duration of preempt-off critical sections, or is there a bigger picture concern it was taking care of by introducing a global pool ?
>
- reducing preempt critical sections for rt.
- need to use GFP_ATOMIC within those large critical sections, there are audit allocations, etc.
- those large critical sections also introduce other locking constraints, that we wanted/needed to get rid of. There are places now that might sleep within the bounds of the old large critical section.
- reduce memory overhead on large multi-core systems

> Introducing per-cpu memory pools, dealing with migration by giving entries back to the right cpu's pool, taking into account the cpu the entry belongs to, and use a per-cpu/lock-free data structure allowing lock-free push to give back an entry on a remote cpu should do the trick without locking, and without long preempt-off critical sections.
I am not opposed to doing that, but we will need to be able to deal with preemption making it so we might need additional buffers for a given cpu, which either means allocating them, or having an even larger per cpu pool preallocated.

The other issue is the overhead on large multi-core systems, which makes going with a large preallocated set.

>
> The only downside I see for per-cpu memory pools is a slightly larger memory overhead on large multi-core systems. But is that really a concern ?
>
I was told it is

> What am I missing here ?
>
different people/groups having conflicting constraints.
- make it work with RT
- reduce large system overhead
- make it fast
...

> Thanks,
>
> Mathieu
>
>>
>> commit e057e9b47f1749882ea0efb4427d6b9671c761ab
>> Author: John Johansen <[email protected]>
>> Date:   Tue Oct 25 01:18:41 2022 -0700
>>
>>      apparmor: cache buffers on percpu list if there is lock contention
>>      df323337e507 ("apparmor: Use a memory pool instead per-CPU caches")
>>      changed buffer allocation to use a memory pool, however on a heavily
>>      loaded machine there can be lock contention on the global buffers
>>      lock. Add a percpu list to cache buffers on when lock contention is
>>      encountered.
>>      When allocating buffers attempt to use cached buffers first,
>>      before taking the global buffers lock. When freeing buffers
>>      try to put them back to the global list but if contention is
>>      encountered, put the buffer on the percpu list.
>>      The length of time a buffer is held on the percpu list is dynamically
>>      adjusted based on lock contention. The amount of hold time is rapidly
>>      increased and slow ramped down.
>>      v4:
>>      - fix percpu ->count buffer count which had been spliced across a
>>        debug patch.
>>      - introduce define for MAX_LOCAL_COUNT
>>      - rework count check and locking around it.
>>      - update commit message to reference commit that introduced the
>>        memory pool.
>>      v3:
>>      - limit number of buffers that can be pushed onto the percpu
>>        list. This avoids a problem on some kernels where one percpu
>>        list can inherit buffers from another cpu after a reschedule,
>>        causing more kernel memory to used than is necessary. Under
>>        normal conditions this should eventually return to normal
>>        but under pathelogical conditions the extra memory consumption
>>        may have been unbouanded
>>      v2:
>>      - dynamically adjust buffer hold time on percpu list based on
>>        lock contention.
>>      v1:
>>      - cache buffers on percpu list on lock contention
>>      Signed-off-by: John Johansen <[email protected]>
>>
>> diff --git a/security/apparmor/lsm.c b/security/apparmor/lsm.c
>> index e6671a4a89c4..ea3af769af5a 100644
>> --- a/security/apparmor/lsm.c
>> +++ b/security/apparmor/lsm.c
>> @@ -55,12 +55,21 @@ union aa_buffer {
>>       char buffer[1];
>>   };
>>
>> +struct aa_local_cache {
>> +    unsigned int contention;
>> +    unsigned int hold;
>> +    unsigned int count;
>> +    struct list_head head;
>> +};
>> +
>> +#define MAX_LOCAL_COUNT 2
>>   #define RESERVE_COUNT 2
>>   static int reserve_count = RESERVE_COUNT;
>>   static int buffer_count;
>>
>>   static LIST_HEAD(aa_global_buffers);
>>   static DEFINE_SPINLOCK(aa_buffers_lock);
>> +static DEFINE_PER_CPU(struct aa_local_cache, aa_local_buffers);
>>
>>   struct kmem_cache *aa_audit_slab;
>>
>> @@ -2029,14 +2038,45 @@ static int param_set_mode(const char *val, const struct kernel_param *kp)
>>       return 0;
>>   }
>>
>> +static void update_contention(struct aa_local_cache *cache)
>> +{
>> +    cache->contention += 3;
>> +    if (cache->contention > 9)
>> +        cache->contention = 9;
>> +    cache->hold += 1 << cache->contention;        /* 8, 64, 512 */
>> +}
>> +
>>   char *aa_get_buffer(bool in_atomic)
>>   {
>>       union aa_buffer *aa_buf;
>> +    struct aa_local_cache *cache;
>>       bool try_again = true;
>>       gfp_t flags = (GFP_KERNEL | __GFP_RETRY_MAYFAIL | __GFP_NOWARN);
>>
>> +    /* use per cpu cached buffers first */
>> +    cache = get_cpu_ptr(&aa_local_buffers);
>> +    if (!list_empty(&cache->head)) {
>> +        aa_buf = list_first_entry(&cache->head, union aa_buffer, list);
>> +        list_del(&aa_buf->list);
>> +        cache->hold--;
>> +        cache->count--;
>> +        put_cpu_ptr(&aa_local_buffers);
>> +        return &aa_buf->buffer[0];
>> +    }
>> +    put_cpu_ptr(&aa_local_buffers);
>> +
>> +    if (!spin_trylock(&aa_buffers_lock)) {
>> +        cache = get_cpu_ptr(&aa_local_buffers);
>> +        update_contention(cache);
>> +        put_cpu_ptr(&aa_local_buffers);
>> +        spin_lock(&aa_buffers_lock);
>> +    } else {
>> +        cache = get_cpu_ptr(&aa_local_buffers);
>> +        if (cache->contention)
>> +            cache->contention--;
>> +        put_cpu_ptr(&aa_local_buffers);
>> +    }
>>   retry:
>> -    spin_lock(&aa_buffers_lock);
>>       if (buffer_count > reserve_count ||
>>           (in_atomic && !list_empty(&aa_global_buffers))) {
>>           aa_buf = list_first_entry(&aa_global_buffers, union aa_buffer,
>> @@ -2062,6 +2102,7 @@ char *aa_get_buffer(bool in_atomic)
>>       if (!aa_buf) {
>>           if (try_again) {
>>               try_again = false;
>> +            spin_lock(&aa_buffers_lock);
>>               goto retry;
>>           }
>>           pr_warn_once("AppArmor: Failed to allocate a memory buffer.\n");
>> @@ -2073,15 +2114,42 @@ char *aa_get_buffer(bool in_atomic)
>>   void aa_put_buffer(char *buf)
>>   {
>>       union aa_buffer *aa_buf;
>> +    struct aa_local_cache *cache;
>>
>>       if (!buf)
>>           return;
>>       aa_buf = container_of(buf, union aa_buffer, buffer[0]);
>>
>> -    spin_lock(&aa_buffers_lock);
>> -    list_add(&aa_buf->list, &aa_global_buffers);
>> -    buffer_count++;
>> -    spin_unlock(&aa_buffers_lock);
>> +    cache = get_cpu_ptr(&aa_local_buffers);
>> +    if (!cache->hold) {
>> +        bool must_lock = cache->count >= MAX_LOCAL_COUNT;
>> +
>> +        put_cpu_ptr(&aa_local_buffers);
>> +
>> +        if (must_lock) {
>> +            spin_lock(&aa_buffers_lock);
>> +            goto locked;
>> +        } else if (spin_trylock(&aa_buffers_lock)) {
>> +        locked:
>> +            /* put back on global list */
>> +            list_add(&aa_buf->list, &aa_global_buffers);
>> +            buffer_count++;
>> +            spin_unlock(&aa_buffers_lock);
>> +            cache = get_cpu_ptr(&aa_local_buffers);
>> +            if (cache->contention)
>> +                cache->contention--;
>> +            put_cpu_ptr(&aa_local_buffers);
>> +            return;
>> +        }
>> +        /* contention on global list, fallback to percpu */
>> +        cache = get_cpu_ptr(&aa_local_buffers);
>> +        update_contention(cache);
>> +    }
>> +
>> +    /* cache in percpu list */
>> +    list_add(&aa_buf->list, &cache->head);
>> +    cache->count++;
>> +    put_cpu_ptr(&aa_local_buffers);
>>   }
>>
>>   /*
>> @@ -2123,6 +2191,16 @@ static int __init alloc_buffers(void)
>>       union aa_buffer *aa_buf;
>>       int i, num;
>>
>> +    /*
>> +     * per cpu set of cached allocated buffers used to help reduce
>> +     * lock contention
>> +     */
>> +    for_each_possible_cpu(i) {
>> +        per_cpu(aa_local_buffers, i).contention = 0;
>> +        per_cpu(aa_local_buffers, i).hold = 0;
>> +        per_cpu(aa_local_buffers, i).count = 0;
>> +        INIT_LIST_HEAD(&per_cpu(aa_local_buffers, i).head);
>> +    }
>>       /*
>>        * A function may require two buffers at once. Usually the buffers are
>>        * used for a short period of time and are shared. On UP kernel buffers
>>
>>
>>
>>
>

2023-06-23 07:13:25

[permalink] [raw]

Subject: Re: [tip: sched/core] sched: Fix performance regression introduced by mm_cid

On 2023-06-21 16:59:31 [-0700], John Johansen wrote:
> > Which turned a per-cpu cache into a global memory pool protected by a spinlock. It may benefit RT, but it does not appear to be so great at scaling.
> >
> it is not. And I have a patch that needs some more formal testing for some stats.
> Ubuntu pulled it in last cycle so it has gotten a fair bit of use and is looking good
> on that end. There are probably some tweaks that can be done to improve it. The
> backoff in particular is something that has mostly been adjusted in response to some
> basic benchmarking.
>
> anyways patch below
>
> commit e057e9b47f1749882ea0efb4427d6b9671c761ab

I think I've been looking at this patch, or a former version of it, and
it looked good.

Sebastian

2023-06-23 07:15:30

[permalink] [raw]

Subject: Re: [tip: sched/core] sched: Fix performance regression introduced by mm_cid

On 2023-06-22 10:33:13 [-0400], Mathieu Desnoyers wrote:
>
> What was fundamentally wrong with the per-cpu caches before commit
> df323337e50 other than being non-RT friendly ? Was the only purpose of that
> commit to reduce the duration of preempt-off critical sections, or is there
> a bigger picture concern it was taking care of by introducing a global pool
> ?

There were memory allocations within preempt-disabled sections
introduced by get_cpu_ptr(). This didn't fly on PREEMPT_RT. After
looking at this on 2 node, 64 CPUs box I didn't get anywhere near
complete usage of the allocated buffers per-CPU buffers. It looked
wasteful.
Based on my testing back then, it looked sufficient to use a global
buffer.

> Introducing per-cpu memory pools, dealing with migration by giving entries
> back to the right cpu's pool, taking into account the cpu the entry belongs
> to, and use a per-cpu/lock-free data structure allowing lock-free push to
> give back an entry on a remote cpu should do the trick without locking, and
> without long preempt-off critical sections.
>
> The only downside I see for per-cpu memory pools is a slightly larger memory
> overhead on large multi-core systems. But is that really a concern ?

Yes, if the memory is left unused and can't be reclaimed if needed.

> What am I missing here ?

I added (tried to add) an people claimed that the SLUB allocated got
better over the years. Also on NUMA systems it might be better to use it
since the memory is NUMA local. The allocation is for 8KiB by default.
Is there a test-case/ benchmark I could try this vs kmalloc()?

> Thanks,
>
> Mathieu

Sebastian

2023-06-23 07:27:30

[permalink] [raw]

Subject: Re: [tip: sched/core] sched: Fix performance regression introduced by mm_cid

On 6/22/23 23:37, Sebastian Andrzej Siewior wrote:
> On 2023-06-21 16:59:31 [-0700], John Johansen wrote:
>>> Which turned a per-cpu cache into a global memory pool protected by a spinlock. It may benefit RT, but it does not appear to be so great at scaling.
>>>
>> it is not. And I have a patch that needs some more formal testing for some stats.
>> Ubuntu pulled it in last cycle so it has gotten a fair bit of use and is looking good
>> on that end. There are probably some tweaks that can be done to improve it. The
>> backoff in particular is something that has mostly been adjusted in response to some
>> basic benchmarking.
>>
>> anyways patch below
>>
>> commit e057e9b47f1749882ea0efb4427d6b9671c761ab
>
> I think I've been looking at this patch, or a former version of it, and
> it looked good.
>

so, I am not satisfied with the way the scaling works, it feels like it
is more complicated than it needs to be. I also wanted to see with the percpu
caching if it was worth dropping the global pool and just going to the allocator.

With that said the patch does work, and seems to be stable in the broader
testing it has gotten. Its more complicated than I would like, and wanted to
play with ideas around improving it. I also wanted get some better benchmark data.
All things I just haven't had time for

I am also not opposed to using some version of this patch now, to solve the
immediate needs and working to improve it after the fact.

2023-06-23 07:42:25

[permalink] [raw]

Subject: Re: [tip: sched/core] sched: Fix performance regression introduced by mm_cid

On 6/22/23 23:37, Sebastian Andrzej Siewior wrote:
> On 2023-06-21 16:59:31 [-0700], John Johansen wrote:
>>> Which turned a per-cpu cache into a global memory pool protected by a spinlock. It may benefit RT, but it does not appear to be so great at scaling.
>>>
>> it is not. And I have a patch that needs some more formal testing for some stats.
>> Ubuntu pulled it in last cycle so it has gotten a fair bit of use and is looking good
>> on that end. There are probably some tweaks that can be done to improve it. The
>> backoff in particular is something that has mostly been adjusted in response to some
>> basic benchmarking.
>>
>> anyways patch below
>>
>> commit e057e9b47f1749882ea0efb4427d6b9671c761ab
>
> I think I've been looking at this patch, or a former version of it, and
> it looked good.
>
iirc the difference with the earlier version, is in the put case. Where in
the earlier version, if there was lock contention the buffer would always
get pushed onto the percpu list. With some debug patches on top we
saw some degenerate cases where this would result in percpu lists that
had excessive buffers on them.

So this version added a condition to force putting the buffer back
in to the global pool if the percpu list already has 2 buffers
cached on it.

2023-06-23 08:41:51

[permalink] [raw]

Subject: Re: [tip: sched/core] sched: Fix performance regression introduced by mm_cid

On 2023-06-23 00:35:29 [-0700], John Johansen wrote:
> >
> iirc the difference with the earlier version, is in the put case. Where in
> the earlier version, if there was lock contention the buffer would always
> get pushed onto the percpu list. With some debug patches on top we
> saw some degenerate cases where this would result in percpu lists that
> had excessive buffers on them.
>
> So this version added a condition to force putting the buffer back
> in to the global pool if the percpu list already has 2 buffers
> cached on it.

So none of the versions perform memory allocation/ deallocation in a
preempt disabled section so it is fine from PREEMPT_RT point of view.

Sebastian

2023-06-23 08:59:17

[permalink] [raw]

Subject: Re: [tip: sched/core] sched: Fix performance regression introduced by mm_cid

On 2023-06-23 00:16:35 [-0700], John Johansen wrote:
> With that said the patch does work, and seems to be stable in the broader
> testing it has gotten. Its more complicated than I would like, and wanted to
> play with ideas around improving it. I also wanted get some better benchmark data.
> All things I just haven't had time for

If you tell me what to do (as in run this) I could try to run it on a
machine. The current version, your proposal and kmalloc() instead.

> I am also not opposed to using some version of this patch now, to solve the
> immediate needs and working to improve it after the fact.

Sebastian

2023-06-23 13:21:38

by Thorsten Leemhuis

[permalink] [raw]

Subject: Re: [tip: sched/core] sched: Fix performance regression introduced by mm_cid

[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 20.06.23 10:14, Swapnil Sapkal wrote:
>
> On 4/22/2023 1:13 PM, tip-bot2 for Mathieu Desnoyers wrote:
>> The following commit has been merged into the sched/core branch of tip:
>>
>> Commit-ID:     223baf9d17f25e2608dbdff7232c095c1e612268
>> Gitweb:
>> https://git.kernel.org/tip/223baf9d17f25e2608dbdff7232c095c1e612268
>> Author:        Mathieu Desnoyers <[email protected]>
>> AuthorDate:    Thu, 20 Apr 2023 10:55:48 -04:00
>> Committer:     Peter Zijlstra <[email protected]>
>> CommitterDate: Fri, 21 Apr 2023 13:24:20 +02:00
>>
>> sched: Fix performance regression introduced by mm_cid
>>
>> Introduce per-mm/cpu current concurrency id (mm_cid) to fix a PostgreSQL
>> sysbench regression reported by Aaron Lu.
> [...]
> I run standard benchmarks as a part of kernel performance regression
> testing. When I run these benchmarks against v6.3.0 to v6.4-rc1,
> I have seen performance regression in hackbench running with threads.
> When I did
> git bisect it pointed to this commit and reverting this commit helps
> regains
> the performance. This regression is not seen with hackbench processes.
> Following are the results from 1 Socket 4th generation EPYC
> Processor(1 X 96C/192T) configured in NPS1 mode. This regression
> becomes more severe as the number of core count increases.
>
> The numbers on a 1 Socket Bergamo (1 X 128 cores/256 threads) is
> significantly worse.
> [...]

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced 223baf9d17f
#regzbot title sched: performance regression in hackbench (partly solved
in -next by c1753fd02a00, partially caused by df323337e50)
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

2023-07-14 06:16:30

[permalink] [raw]

Subject: Re: [tip: sched/core] sched: Fix performance regression introduced by mm_cid

Hello Mathieu,

On 6/22/2023 12:21 AM, Mathieu Desnoyers wrote:
> On 6/21/23 12:36, Swapnil Sapkal wrote:
>> Hello Mathieu,
>>
> [...]
>>>
>>> I suspect the regression is caused by the mm_count cache line bouncing.
>>>
>>> Please try with this additional patch applied:
>>>
>>> https://lore.kernel.org/lkml/[email protected]/
>>
>> Thanks for the suggestion. I tried out with the patch you suggested. I am seeing
>> improvement in hackbench numbers with mm_count padding. But this is not matching
>> with what we achieved through reverting the new mm_cid patch.
>>
>> Below are the results on the 1 Socket 4th Generation EPYC Processor (1 x 96C/192T):
>>
>> Threads:
>>
>> Test:              Base (v6.4-rc1)   Base + new_mmcid_reverted Base + mm_count_padding
>>   1-groups:         5.23 (0.00 pct)         4.61 (11.85 pct)        5.11 (2.29 pct)
>>   2-groups:         4.99 (0.00 pct)         4.72 (5.41 pct)         5.00 (-0.20 pct)
>>   4-groups:         5.96 (0.00 pct)         4.87 (18.28 pct)        5.86 (1.67 pct)
>>   8-groups:         6.58 (0.00 pct)         5.44 (17.32 pct)        6.20 (5.77 pct)
>> 16-groups:        11.48 (0.00 pct)         8.07 (29.70 pct)       10.68 (6.96 pct)
>>
>> Processes:
>>
>> Test:              Base (v6.4-rc1) Base + new_mmcid_reverted   Base + mm_count_padding
>>   1-groups:         5.19 (0.00 pct)         4.90 (5.58 pct)         5.19 (0.00 pct)
>>   2-groups:         5.44 (0.00 pct)         5.39 (0.91 pct)         5.39 (0.91 pct)
>>   4-groups:         5.69 (0.00 pct)         5.64 (0.87 pct)         5.64 (0.87 pct)
>>   8-groups:         6.08 (0.00 pct)         6.01 (1.15 pct)         6.04 (0.65 pct)
>> 16-groups:        10.87 (0.00 pct)        10.83 (0.36 pct)        10.93 (-0.55 pct)
>>
>> The ibs profile shows that function __switch_to_asm() is coming at top in baseline
>> run and is not seen with mm_count padding patch. Will be attaching full ibs profile
>> data for all the 3 runs:
>>
>> # Base (v6.4-rc1)
>> Threads:
>> Total time: 11.486 [sec]
>>
>>     5.15% sched-messaging [kernel.vmlinux]      [k] __switch_to_asm
>>     4.31% sched-messaging [kernel.vmlinux]      [k] copyout
>>     4.29% sched-messaging [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
>>     4.22% sched-messaging [kernel.vmlinux]      [k] copyin
>>     3.92% sched-messaging [kernel.vmlinux]      [k] apparmor_file_permission
>>     2.91% sched-messaging [kernel.vmlinux]      [k] __schedule
>>     2.34% swapper          [kernel.vmlinux]      [k] __switch_to_asm
>>     2.10% sched-messaging [kernel.vmlinux]      [k] prepare_to_wait_event
>>     2.10% sched-messaging [kernel.vmlinux]      [k] try_to_wake_up
>>     2.07% sched-messaging [kernel.vmlinux]      [k] finish_task_switch.isra.0
>>     2.00% sched-messaging [kernel.vmlinux]      [k] pipe_write
>>     1.82% sched-messaging [kernel.vmlinux]      [k] check_preemption_disabled
>>     1.73% sched-messaging [kernel.vmlinux]      [k] exit_to_user_mode_prepare
>>     1.52% sched-messaging [kernel.vmlinux]      [k] __entry_text_start
>>     1.49% sched-messaging [kernel.vmlinux]      [k] osq_lock
>>     1.45% sched-messaging libc.so.6             [.] write
>>     1.44% swapper          [kernel.vmlinux]      [k] native_sched_clock
>>     1.38% sched-messaging [kernel.vmlinux]      [k] psi_group_change
>>     1.38% sched-messaging [kernel.vmlinux]      [k] pipe_read
>>     1.37% sched-messaging libc.so.6             [.] read
>>     1.06% sched-messaging [kernel.vmlinux]      [k] vfs_read
>>     1.01% swapper          [kernel.vmlinux]      [k] psi_group_change
>>     1.00% sched-messaging [kernel.vmlinux]      [k] update_curr
>>
>> # Base + mm_count_padding
>> Threads:
>> Total time: 11.384 [sec]
>>
>>     4.43% sched-messaging [kernel.vmlinux]         [k] copyin
>>     4.39% sched-messaging [kernel.vmlinux]         [k] native_queued_spin_lock_slowpath
>>     4.07% sched-messaging [kernel.vmlinux]         [k] apparmor_file_permission
>>     4.07% sched-messaging [kernel.vmlinux]         [k] copyout
>>     2.49% sched-messaging [kernel.vmlinux]         [k] entry_SYSCALL_64
>>     2.37% sched-messaging [kernel.vmlinux]         [k] update_cfs_group
>>     2.19% sched-messaging [kernel.vmlinux]         [k] pipe_write
>>     2.00% sched-messaging [kernel.vmlinux]         [k] check_preemption_disabled
>>     1.93% swapper          [kernel.vmlinux]         [k] update_load_avg
>>     1.81% sched-messaging [kernel.vmlinux]         [k] exit_to_user_mode_prepare
>>     1.69% sched-messaging [kernel.vmlinux]         [k] try_to_wake_up
>>     1.58% sched-messaging libc.so.6                [.] write
>>     1.53% sched-messaging [kernel.vmlinux]         [k] psi_group_change
>>     1.50% sched-messaging libc.so.6                [.] read
>>     1.50% sched-messaging [kernel.vmlinux]         [k] pipe_read
>>     1.39% sched-messaging [kernel.vmlinux]         [k] update_load_avg
>>     1.39% sched-messaging [kernel.vmlinux]         [k] osq_lock
>>     1.30% sched-messaging [kernel.vmlinux]         [k] update_curr
>>     1.28% swapper          [kernel.vmlinux]         [k] psi_group_change
>>     1.16% sched-messaging [kernel.vmlinux]         [k] vfs_read
>>     1.12% sched-messaging [kernel.vmlinux]         [k] vfs_write
>>     1.10% sched-messaging [kernel.vmlinux]         [k] entry_SYSRETQ_unsafe_stack
>>     1.09% sched-messaging [kernel.vmlinux]         [k] __switch_to_asm
>>     1.08% sched-messaging [kernel.vmlinux]         [k] do_syscall_64
>>     1.06% sched-messaging [kernel.vmlinux]         [k] select_task_rq_fair
>>     1.03% swapper          [kernel.vmlinux]         [k] update_cfs_group
>>     1.00% swapper          [kernel.vmlinux]         [k] rb_insert_color
>>
>> # Base + reverted_new_mm_cid
>> Threads:
>> Total time: 7.847 [sec]
>>
>>    12.14% sched-messaging [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
>>     8.86% swapper          [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
>>     6.13% sched-messaging [kernel.vmlinux]      [k] copyin
>>     5.54% sched-messaging [kernel.vmlinux]      [k] apparmor_file_permission
>>     3.59% sched-messaging [kernel.vmlinux]      [k] copyout
>>     2.61% sched-messaging [kernel.vmlinux]      [k] osq_lock
>>     2.48% sched-messaging [kernel.vmlinux]      [k] pipe_write
>>     2.33% sched-messaging [kernel.vmlinux]      [k] exit_to_user_mode_prepare
>>     2.01% sched-messaging [kernel.vmlinux]      [k] check_preemption_disabled
>>     1.96% sched-messaging [kernel.vmlinux]      [k] __entry_text_start
>>     1.91% sched-messaging libc.so.6             [.] write
>>     1.77% sched-messaging libc.so.6             [.] read
>>     1.64% sched-messaging [kernel.vmlinux]      [k] mutex_spin_on_owner
>>     1.58% sched-messaging [kernel.vmlinux]      [k] pipe_read
>>     1.52% sched-messaging [kernel.vmlinux]      [k] try_to_wake_up
>>     1.38% sched-messaging [kernel.vmlinux]      [k] ktime_get_coarse_real_ts64
>>     1.35% sched-messaging [kernel.vmlinux]      [k] vfs_write
>>     1.28% sched-messaging [kernel.vmlinux]      [k] entry_SYSRETQ_unsafe_stack
>>     1.28% sched-messaging [kernel.vmlinux]      [k] vfs_read
>>     1.25% sched-messaging [kernel.vmlinux]      [k] do_syscall_64
>>     1.22% sched-messaging [kernel.vmlinux]      [k] __fget_light
>>     1.18% sched-messaging [kernel.vmlinux]      [k] mutex_lock
>>     1.12% sched-messaging [kernel.vmlinux]      [k] file_update_time
>>     1.04% sched-messaging [kernel.vmlinux]      [k] _copy_from_iter
>>     1.01% sched-messaging [kernel.vmlinux]      [k] current_time
>>
>> So with the reverted new_mm_cid patch, we are seeing a lot of time being spent in
>> native_queued_spin_lock_slowpath and yet, hackbench finishes faster.
>>
>> I keep further digging into this please let me know if you have any pointers for me.
>
> Do you have CONFIG_SECURITY_APPARMOR=y ? Can you try without ?
>
Sorry for the delay in response. My system was busy running some workloads. I tried
running hackbench disabling apparmor, looks like apparmor is not the culprit here.
Below are the results with apparmor disabled:

Test: Base Base + Reverted_new_mmcid Base+Apparmour_disabled
1-groups: 2.81 (0.00 pct) 2.79 (0.71 pct) 2.79 (0.71 pct)
2-groups: 3.25 (0.00 pct) 3.25 (0.00 pct) 3.20 (1.53 pct)
4-groups: 3.44 (0.00 pct) 3.28 (4.65 pct) 3.43 (0.29 pct)
8-groups: 3.52 (0.00 pct) 3.42 (2.84 pct) 3.53 (-0.28 pct)
16-groups: 5.65 (0.00 pct) 4.52 (20.00 pct) 5.67 (-0.35 pct)

Thanks,
Swapnil

> I notice that apparmor_file_permission appears near the top of your
> profiles, and apparmor uses an internal aa_buffers_lock spinlock,
> which could possibly explain the top hits for
> native_queued_spin_lock_slowpath. My current suspicion is that
> the raw spinlock that was taken by "Base + reverted_new_mm_cid"
> changed the contention pattern on the apparmor lock enough to
> speed things up by pure accident.
>
> Thanks,
>
> Mathieu
>
>
>>
>>>
>>> This patch has recently been merged into the mm tree.
>>>
>>> Thanks,
>>>
>>> Mathieu
>>>
>> --
>> Thanks and Regards,
>> Swapnil
>

2023-07-14 15:11:19

[permalink] [raw]

Subject: Re: [tip: sched/core] sched: Fix performance regression introduced by mm_cid

On 7/14/23 02:02, Swapnil Sapkal wrote:
> Hello Mathieu,
>
> On 6/22/2023 12:21 AM, Mathieu Desnoyers wrote:
>> On 6/21/23 12:36, Swapnil Sapkal wrote:
>>> Hello Mathieu,
>>>
>> [...]
>>>>
>>>> I suspect the regression is caused by the mm_count cache line bouncing.
>>>>
>>>> Please try with this additional patch applied:
>>>>
>>>> https://lore.kernel.org/lkml/[email protected]/
>>>
>>> Thanks for the suggestion. I tried out with the patch you suggested.
>>> I am seeing
>>> improvement in hackbench numbers with mm_count padding. But this is
>>> not matching
>>> with what we achieved through reverting the new mm_cid patch.
>>>
>>> Below are the results on the 1 Socket 4th Generation EPYC Processor
>>> (1 x 96C/192T):
>>>
>>> Threads:
>>>
>>> Test:              Base (v6.4-rc1)   Base + new_mmcid_reverted Base
>>> + mm_count_padding
>>>   1-groups:         5.23 (0.00 pct)         4.61 (11.85 pct)
>>> 5.11 (2.29 pct)
>>>   2-groups:         4.99 (0.00 pct)         4.72 (5.41 pct)
>>> 5.00 (-0.20 pct)
>>>   4-groups:         5.96 (0.00 pct)         4.87 (18.28 pct)
>>> 5.86 (1.67 pct)
>>>   8-groups:         6.58 (0.00 pct)         5.44 (17.32 pct)
>>> 6.20 (5.77 pct)
>>> 16-groups:        11.48 (0.00 pct)         8.07 (29.70 pct)
>>> 10.68 (6.96 pct)
>>>
>>> Processes:
>>>
>>> Test:              Base (v6.4-rc1) Base + new_mmcid_reverted   Base
>>> + mm_count_padding
>>>   1-groups:         5.19 (0.00 pct)         4.90 (5.58 pct)
>>> 5.19 (0.00 pct)
>>>   2-groups:         5.44 (0.00 pct)         5.39 (0.91 pct)
>>> 5.39 (0.91 pct)
>>>   4-groups:         5.69 (0.00 pct)         5.64 (0.87 pct)
>>> 5.64 (0.87 pct)
>>>   8-groups:         6.08 (0.00 pct)         6.01 (1.15 pct)
>>> 6.04 (0.65 pct)
>>> 16-groups:        10.87 (0.00 pct)        10.83 (0.36 pct)
>>> 10.93 (-0.55 pct)
>>>
>>> The ibs profile shows that function __switch_to_asm() is coming at
>>> top in baseline
>>> run and is not seen with mm_count padding patch. Will be attaching
>>> full ibs profile
>>> data for all the 3 runs:
>>>
>>> # Base (v6.4-rc1)
>>> Threads:
>>> Total time: 11.486 [sec]
>>>
>>>     5.15% sched-messaging [kernel.vmlinux]      [k] __switch_to_asm
>>>     4.31% sched-messaging [kernel.vmlinux]      [k] copyout
>>>     4.29% sched-messaging [kernel.vmlinux]      [k]
>>> native_queued_spin_lock_slowpath
>>>     4.22% sched-messaging [kernel.vmlinux]      [k] copyin
>>>     3.92% sched-messaging [kernel.vmlinux]      [k]
>>> apparmor_file_permission
>>>     2.91% sched-messaging [kernel.vmlinux]      [k] __schedule
>>>     2.34% swapper          [kernel.vmlinux]      [k] __switch_to_asm
>>>     2.10% sched-messaging [kernel.vmlinux]      [k]
>>> prepare_to_wait_event
>>>     2.10% sched-messaging [kernel.vmlinux]      [k] try_to_wake_up
>>>     2.07% sched-messaging [kernel.vmlinux]      [k]
>>> finish_task_switch.isra.0
>>>     2.00% sched-messaging [kernel.vmlinux]      [k] pipe_write
>>>     1.82% sched-messaging [kernel.vmlinux]      [k]
>>> check_preemption_disabled
>>>     1.73% sched-messaging [kernel.vmlinux]      [k]
>>> exit_to_user_mode_prepare
>>>     1.52% sched-messaging [kernel.vmlinux]      [k] __entry_text_start
>>>     1.49% sched-messaging [kernel.vmlinux]      [k] osq_lock
>>>     1.45% sched-messaging libc.so.6             [.] write
>>>     1.44% swapper          [kernel.vmlinux]      [k] native_sched_clock
>>>     1.38% sched-messaging [kernel.vmlinux]      [k] psi_group_change
>>>     1.38% sched-messaging [kernel.vmlinux]      [k] pipe_read
>>>     1.37% sched-messaging libc.so.6             [.] read
>>>     1.06% sched-messaging [kernel.vmlinux]      [k] vfs_read
>>>     1.01% swapper          [kernel.vmlinux]      [k] psi_group_change
>>>     1.00% sched-messaging [kernel.vmlinux]      [k] update_curr
>>>
>>> # Base + mm_count_padding
>>> Threads:
>>> Total time: 11.384 [sec]
>>>
>>>     4.43% sched-messaging [kernel.vmlinux]         [k] copyin
>>>     4.39% sched-messaging [kernel.vmlinux]         [k]
>>> native_queued_spin_lock_slowpath
>>>     4.07% sched-messaging [kernel.vmlinux]         [k]
>>> apparmor_file_permission
>>>     4.07% sched-messaging [kernel.vmlinux]         [k] copyout
>>>     2.49% sched-messaging [kernel.vmlinux]         [k]
>>> entry_SYSCALL_64
>>>     2.37% sched-messaging [kernel.vmlinux]         [k]
>>> update_cfs_group
>>>     2.19% sched-messaging [kernel.vmlinux]         [k] pipe_write
>>>     2.00% sched-messaging [kernel.vmlinux]         [k]
>>> check_preemption_disabled
>>>     1.93% swapper          [kernel.vmlinux]         [k] update_load_avg
>>>     1.81% sched-messaging [kernel.vmlinux]         [k]
>>> exit_to_user_mode_prepare
>>>     1.69% sched-messaging [kernel.vmlinux]         [k] try_to_wake_up
>>>     1.58% sched-messaging libc.so.6                [.] write
>>>     1.53% sched-messaging [kernel.vmlinux]         [k]
>>> psi_group_change
>>>     1.50% sched-messaging libc.so.6                [.] read
>>>     1.50% sched-messaging [kernel.vmlinux]         [k] pipe_read
>>>     1.39% sched-messaging [kernel.vmlinux]         [k] update_load_avg
>>>     1.39% sched-messaging [kernel.vmlinux]         [k] osq_lock
>>>     1.30% sched-messaging [kernel.vmlinux]         [k] update_curr
>>>     1.28% swapper          [kernel.vmlinux]         [k]
>>> psi_group_change
>>>     1.16% sched-messaging [kernel.vmlinux]         [k] vfs_read
>>>     1.12% sched-messaging [kernel.vmlinux]         [k] vfs_write
>>>     1.10% sched-messaging [kernel.vmlinux]         [k]
>>> entry_SYSRETQ_unsafe_stack
>>>     1.09% sched-messaging [kernel.vmlinux]         [k] __switch_to_asm
>>>     1.08% sched-messaging [kernel.vmlinux]         [k] do_syscall_64
>>>     1.06% sched-messaging [kernel.vmlinux]         [k]
>>> select_task_rq_fair
>>>     1.03% swapper          [kernel.vmlinux]         [k]
>>> update_cfs_group
>>>     1.00% swapper          [kernel.vmlinux]         [k] rb_insert_color
>>>
>>> # Base + reverted_new_mm_cid
>>> Threads:
>>> Total time: 7.847 [sec]
>>>
>>>    12.14% sched-messaging [kernel.vmlinux]      [k]
>>> native_queued_spin_lock_slowpath
>>>     8.86% swapper          [kernel.vmlinux]      [k]
>>> native_queued_spin_lock_slowpath
>>>     6.13% sched-messaging [kernel.vmlinux]      [k] copyin
>>>     5.54% sched-messaging [kernel.vmlinux]      [k]
>>> apparmor_file_permission
>>>     3.59% sched-messaging [kernel.vmlinux]      [k] copyout
>>>     2.61% sched-messaging [kernel.vmlinux]      [k] osq_lock
>>>     2.48% sched-messaging [kernel.vmlinux]      [k] pipe_write
>>>     2.33% sched-messaging [kernel.vmlinux]      [k]
>>> exit_to_user_mode_prepare
>>>     2.01% sched-messaging [kernel.vmlinux]      [k]
>>> check_preemption_disabled
>>>     1.96% sched-messaging [kernel.vmlinux]      [k] __entry_text_start
>>>     1.91% sched-messaging libc.so.6             [.] write
>>>     1.77% sched-messaging libc.so.6             [.] read
>>>     1.64% sched-messaging [kernel.vmlinux]      [k]
>>> mutex_spin_on_owner
>>>     1.58% sched-messaging [kernel.vmlinux]      [k] pipe_read
>>>     1.52% sched-messaging [kernel.vmlinux]      [k] try_to_wake_up
>>>     1.38% sched-messaging [kernel.vmlinux]      [k]
>>> ktime_get_coarse_real_ts64
>>>     1.35% sched-messaging [kernel.vmlinux]      [k] vfs_write
>>>     1.28% sched-messaging [kernel.vmlinux]      [k]
>>> entry_SYSRETQ_unsafe_stack
>>>     1.28% sched-messaging [kernel.vmlinux]      [k] vfs_read
>>>     1.25% sched-messaging [kernel.vmlinux]      [k] do_syscall_64
>>>     1.22% sched-messaging [kernel.vmlinux]      [k] __fget_light
>>>     1.18% sched-messaging [kernel.vmlinux]      [k] mutex_lock
>>>     1.12% sched-messaging [kernel.vmlinux]      [k] file_update_time
>>>     1.04% sched-messaging [kernel.vmlinux]      [k] _copy_from_iter
>>>     1.01% sched-messaging [kernel.vmlinux]      [k] current_time
>>>
>>> So with the reverted new_mm_cid patch, we are seeing a lot of time
>>> being spent in
>>> native_queued_spin_lock_slowpath and yet, hackbench finishes faster.
>>>
>>> I keep further digging into this please let me know if you have any
>>> pointers for me.
>>
>> Do you have CONFIG_SECURITY_APPARMOR=y ? Can you try without ?
>>
> Sorry for the delay in response. My system was busy running some
> workloads. I tried
> running hackbench disabling apparmor, looks like apparmor is not the
> culprit here.
> Below are the results with apparmor disabled:
>
> Test:                   Base            Base + Reverted_new_mmcid
> Base+Apparmour_disabled
> 1-groups:         2.81 (0.00 pct)         2.79 (0.71 pct)
> 2.79 (0.71 pct)
> 2-groups:         3.25 (0.00 pct)         3.25 (0.00 pct)
> 3.20 (1.53 pct)
> 4-groups:         3.44 (0.00 pct)         3.28 (4.65 pct)
> 3.43 (0.29 pct)
> 8-groups:         3.52 (0.00 pct)         3.42 (2.84 pct)
> 3.53 (-0.28 pct)
> 16-groups:         5.65 (0.00 pct)         4.52 (20.00 pct)
> 5.67 (-0.35 pct)

Can you provide the kernel config file associated with this
test ? I would also need to see ibs profiles showing the
functions using most cpu, especially spinlocks and their
callers.

My working hypothesis is that adding the rseq-mm-cid spinlock
in the scheduler improves performances of your benchmark because
it lessens the contention on _another_ lock somewhere else.

Note that we've just received a brand new 2 sockets,
96 cores/socket AMD machine at EfficiOS. We've bought it to
increase our coverage of scalability testing. With this I should
be able to reproduce those regressions on my end, which should
facilitate the investigation.

Thanks!

Mathieu

>
> Thanks,
> Swapnil
>
>> I notice that apparmor_file_permission appears near the top of your
>> profiles, and apparmor uses an internal aa_buffers_lock spinlock,
>> which could possibly explain the top hits for
>> native_queued_spin_lock_slowpath. My current suspicion is that
>> the raw spinlock that was taken by "Base + reverted_new_mm_cid"
>> changed the contention pattern on the apparmor lock enough to
>> speed things up by pure accident.
>>
>> Thanks,
>>
>> Mathieu
>>
>>
>>>
>>>>
>>>> This patch has recently been merged into the mm tree.
>>>>
>>>> Thanks,
>>>>
>>>> Mathieu
>>>>
>>> --
>>> Thanks and Regards,
>>> Swapnil
>>

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-07-18 06:18:16