LinuxLists.cc - [PATCH v1 0/3] cgroup/rstat: global cgroup_rstat

[permalink] [raw]

Subject: [PATCH v1 1/3] cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints

This commit enhances the ability to troubleshoot the global
cgroup_rstat_lock by introducing wrapper helper functions for the lock
along with associated tracepoints.

Although global, the cgroup_rstat_lock helper APIs and tracepoints take
arguments such as cgroup pointer and cpu_in_loop variable. This
adjustment is made because flushing occurs per cgroup despite the lock
being global. Hence, when troubleshooting, it's important to identify the
relevant cgroup. The cpu_in_loop variable is necessary because the global
lock may be released within the main flushing loop that traverses CPUs.
In the tracepoints, the cpu_in_loop value is set to -1 when acquiring the
main lock; otherwise, it denotes the CPU number processed last.

The new feature in this patchset is detecting when lock is contended. The
tracepoints are implemented with production in mind. For minimum overhead
attach to cgroup:cgroup_rstat_lock_contended, which only gets activated
when trylock detects lock is contended. A quick production check for
issues could be done via this perf commands:

perf record -g -e cgroup:cgroup_rstat_lock_contended

Next natural question would be asking how long time do lock contenders
wait for obtaining the lock. This can be answered by measuring the time
between cgroup:cgroup_rstat_lock_contended and cgroup:cgroup_rstat_locked
when args->contended is set. Like this bpftrace script:

bpftrace -e '
tracepoint:cgroup:cgroup_rstat_lock_contended {@start[tid]=nsecs}
tracepoint:cgroup:cgroup_rstat_locked {
if (args->contended) {
@wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);}}
interval:s:1 {time("%H:%M:%S "); print(@wait_ns); }'

Extending with time spend holding the lock will be more expensive as this
also looks at all the non-contended cases.
Like this bpftrace script:

bpftrace -e '
tracepoint:cgroup:cgroup_rstat_lock_contended {@start[tid]=nsecs}
tracepoint:cgroup:cgroup_rstat_locked { @locked[tid]=nsecs;
if (args->contended) {
@wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);}}
tracepoint:cgroup:cgroup_rstat_unlock {
@locked_ns=hist(nsecs-@locked[tid]); delete(@locked[tid]);}
interval:s:1 {time("%H:%M:%S "); print(@wait_ns);print(@locked_ns); }'

Signes-off-by: Jesper Dangaard Brouer <[email protected]>
---
include/linux/cgroup.h | 2 +-
include/trace/events/cgroup.h | 48 +++++++++++++++++++++++++++++++++++++++++
kernel/cgroup/rstat.c | 47 +++++++++++++++++++++++++++++++++-------
3 files changed, 88 insertions(+), 9 deletions(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 34aaf0e87def..2150ca60394b 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -690,7 +690,7 @@ static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
void cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
void cgroup_rstat_flush(struct cgroup *cgrp);
void cgroup_rstat_flush_hold(struct cgroup *cgrp);
-void cgroup_rstat_flush_release(void);
+void cgroup_rstat_flush_release(struct cgroup *cgrp);

/*
* Basic resource stats.
diff --git a/include/trace/events/cgroup.h b/include/trace/events/cgroup.h
index dd7d7c9efecd..13f375800135 100644
--- a/include/trace/events/cgroup.h
+++ b/include/trace/events/cgroup.h
@@ -204,6 +204,54 @@ DEFINE_EVENT(cgroup_event, cgroup_notify_frozen,
TP_ARGS(cgrp, path, val)
);

+DECLARE_EVENT_CLASS(cgroup_rstat,
+
+ TP_PROTO(struct cgroup *cgrp, int cpu_in_loop, bool contended),
+
+ TP_ARGS(cgrp, cpu_in_loop, contended),
+
+ TP_STRUCT__entry(
+ __field( int, root )
+ __field( int, level )
+ __field( u64, id )
+ __field( int, cpu_in_loop )
+ __field( bool, contended )
+ ),
+
+ TP_fast_assign(
+ __entry->root = cgrp->root->hierarchy_id;
+ __entry->id = cgroup_id(cgrp);
+ __entry->level = cgrp->level;
+ __entry->cpu_in_loop = cpu_in_loop;
+ __entry->contended = contended;
+ ),
+
+ TP_printk("root=%d id=%llu level=%d cpu_in_loop=%d lock contended:%d",
+ __entry->root, __entry->id, __entry->level,
+ __entry->cpu_in_loop, __entry->contended)
+);
+
+DEFINE_EVENT(cgroup_rstat, cgroup_rstat_lock_contended,
+
+ TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
+
+ TP_ARGS(cgrp, cpu, contended)
+);
+
+DEFINE_EVENT(cgroup_rstat, cgroup_rstat_locked,
+
+ TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
+
+ TP_ARGS(cgrp, cpu, contended)
+);
+
+DEFINE_EVENT(cgroup_rstat, cgroup_rstat_unlock,
+
+ TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
+
+ TP_ARGS(cgrp, cpu, contended)
+);
+
#endif /* _TRACE_CGROUP_H */

/* This part must be outside protection */
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 07e2284bb499..ff68c904e647 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -7,6 +7,8 @@
#include <linux/btf.h>
#include <linux/btf_ids.h>

+#include <trace/events/cgroup.h>
+
static DEFINE_SPINLOCK(cgroup_rstat_lock);
static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);

@@ -222,6 +224,35 @@ __weak noinline void bpf_rstat_flush(struct cgroup *cgrp,

__bpf_hook_end();

+/*
+ * Helper functions for locking cgroup_rstat_lock.
+ *
+ * This makes it easier to diagnose locking issues and contention in
+ * production environments. The parameter @cpu_in_loop indicate lock
+ * was released and re-taken when collection data from the CPUs. The
+ * value -1 is used when obtaining the main lock else this is the CPU
+ * number processed last.
+ */
+static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop)
+ __acquires(&cgroup_rstat_lock)
+{
+ bool contended;
+
+ contended = !spin_trylock_irq(&cgroup_rstat_lock);
+ if (contended) {
+ trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended);
+ spin_lock_irq(&cgroup_rstat_lock);
+ }
+ trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
+}
+
+static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
+ __releases(&cgroup_rstat_lock)
+{
+ trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false);
+ spin_unlock_irq(&cgroup_rstat_lock);
+}
+
/* see cgroup_rstat_flush() */
static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
__releases(&cgroup_rstat_lock) __acquires(&cgroup_rstat_lock)
@@ -248,10 +279,10 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)

/* play nice and yield if necessary */
if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
- spin_unlock_irq(&cgroup_rstat_lock);
+ __cgroup_rstat_unlock(cgrp, cpu);
if (!cond_resched())
cpu_relax();
- spin_lock_irq(&cgroup_rstat_lock);
+ __cgroup_rstat_lock(cgrp, cpu);
}
}
}
@@ -273,9 +304,9 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
{
might_sleep();

- spin_lock_irq(&cgroup_rstat_lock);
+ __cgroup_rstat_lock(cgrp, -1);
cgroup_rstat_flush_locked(cgrp);
- spin_unlock_irq(&cgroup_rstat_lock);
+ __cgroup_rstat_unlock(cgrp, -1);
}

/**
@@ -291,17 +322,17 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp)
__acquires(&cgroup_rstat_lock)
{
might_sleep();
- spin_lock_irq(&cgroup_rstat_lock);
+ __cgroup_rstat_lock(cgrp, -1);
cgroup_rstat_flush_locked(cgrp);
}

/**
* cgroup_rstat_flush_release - release cgroup_rstat_flush_hold()
*/
-void cgroup_rstat_flush_release(void)
+void cgroup_rstat_flush_release(struct cgroup *cgrp)
__releases(&cgroup_rstat_lock)
{
- spin_unlock_irq(&cgroup_rstat_lock);
+ __cgroup_rstat_unlock(cgrp, -1);
}

int cgroup_rstat_init(struct cgroup *cgrp)
@@ -533,7 +564,7 @@ void cgroup_base_stat_cputime_show(struct seq_file *seq)
#ifdef CONFIG_SCHED_CORE
forceidle_time = cgrp->bstat.forceidle_sum;
#endif
- cgroup_rstat_flush_release();
+ cgroup_rstat_flush_release(cgrp);
} else {
root_cgroup_cputime(&bstat);
usage = bstat.cputime.sum_exec_runtime;

2024-04-16 17:52:26

[permalink] [raw]

Subject: [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to mutex

Since kernel v4.18, cgroup_rstat_lock has been an IRQ-disabling spinlock,
as introduced by commit 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex
with a spinlock").

Despite efforts in cgroup_rstat_flush_locked() to yield the lock when
necessary during the collection of per-CPU stats, this approach has led
to several scaling issues observed in production environments. Holding
this IRQ lock has caused starvation of other critical kernel functions,
such as softirq (e.g., timers and netstack). Although kernel v6.8
introduced optimizations in this area, we continue to observe instances
where the spin_lock is held for 64-128 ms in production.

This patch converts cgroup_rstat_lock back to being a mutex lock. This
change is made possible thanks to the significant effort by Yosry Ahmed
to eliminate all atomic context use-cases through multiple commits,
ending in 0a2dc6ac3329 ("cgroup: removecgroup_rstat_flush_atomic()"),
included in kernel v6.5.

After this patch lock contention will be less obvious, as converting this
to a mutex avoids multiple CPUs spinning while waiting for the lock, but
it doesn't remove the lock contention. It is recommended to use the
tracepoints to diagnose this.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
---
kernel/cgroup/rstat.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index ff68c904e647..a90d68a7c27f 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -9,7 +9,7 @@

#include <trace/events/cgroup.h>

-static DEFINE_SPINLOCK(cgroup_rstat_lock);
+static DEFINE_MUTEX(cgroup_rstat_lock);
static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);

static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
@@ -238,10 +238,10 @@ static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop)
{
bool contended;

- contended = !spin_trylock_irq(&cgroup_rstat_lock);
+ contended = !mutex_trylock(&cgroup_rstat_lock);
if (contended) {
trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended);
- spin_lock_irq(&cgroup_rstat_lock);
+ mutex_lock(&cgroup_rstat_lock);
}
trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
}
@@ -250,7 +250,7 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
__releases(&cgroup_rstat_lock)
{
trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false);
- spin_unlock_irq(&cgroup_rstat_lock);
+ mutex_unlock(&cgroup_rstat_lock);
}

/* see cgroup_rstat_flush() */
@@ -278,7 +278,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
}

/* play nice and yield if necessary */
- if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
+ if (need_resched()) {
__cgroup_rstat_unlock(cgrp, cpu);
if (!cond_resched())
cpu_relax();

2024-04-16 17:52:45

[permalink] [raw]

Subject: [PATCH v1 3/3] cgroup/rstat: introduce ratelimited rstat flushing

This patch aims to reduce userspace-triggered pressure on the global
cgroup_rstat_lock by introducing a mechanism to limit how often reading
stat files causes cgroup rstat flushing.

In the memory cgroup subsystem, memcg_vmstats_needs_flush() combined with
mem_cgroup_flush_stats_ratelimited() already limits pressure on the
global lock (cgroup_rstat_lock). As a result, reading memory-related stat
files (such as memory.stat, memory.numa_stat, zswap.current) is already
a less userspace-triggerable issue.

However, other userspace users of cgroup_rstat_flush(), such as when
reading io.stat (blk-cgroup.c) and cpu.stat, lack a similar system to
limit pressure on the global lock. Furthermore, userspace can easily
trigger this issue by reading those stat files.

Typically, normal userspace stats tools (e.g., cadvisor, nomad, systemd)
spawn threads that read io.stat, cpu.stat, and memory.stat (even from the
same cgroup) without realizing that on the kernel side, they share the
same global lock. This limitation also helps prevent malicious userspace
applications from harming the kernel by reading these stat files in a
tight loop.

To address this, the patch introduces cgroup_rstat_flush_ratelimited(),
similar to memcg's mem_cgroup_flush_stats_ratelimited().

Flushing occurs per cgroup (even though the lock remains global) a
variable named rstat_flush_last_time is introduced to track when a given
cgroup was last flushed. This variable, which contains the jiffies of the
flush, shares properties and a cache line with rstat_flush_next and is
updated simultaneously.

For cpu.stat, we need to acquire the lock (via cgroup_rstat_flush_hold)
because other data is read under the lock, but we skip the expensive
flushing if it occurred recently.

Regarding io.stat, there is an opportunity outside the lock to skip the
flush, but inside the lock, we must recheck to handle races.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
---
block/blk-cgroup.c | 2 +
include/linux/cgroup-defs.h | 1 +
include/linux/cgroup.h | 3 +-
kernel/cgroup/rstat.c | 60 ++++++++++++++++++++++++++++++++++++++++++-
mm/memcontrol.c | 1 +
5 files changed, 63 insertions(+), 4 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index bdbb557feb5a..4156fedbb910 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1162,7 +1162,7 @@ static int blkcg_print_stat(struct seq_file *sf, void *v)
if (!seq_css(sf)->parent)
blkcg_fill_root_iostats();
else
- cgroup_rstat_flush(blkcg->css.cgroup);
+ cgroup_rstat_flush_ratelimited(blkcg->css.cgroup);

rcu_read_lock();
hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index ea48c861cd36..366dc644e075 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -509,6 +509,7 @@ struct cgroup {
* cgroup_rstat_flush_locked() and protected by cgroup_rstat_lock.
*/
struct cgroup *rstat_flush_next;
+ unsigned long rstat_flush_last_time;

/* cgroup basic resource statistics */
struct cgroup_base_stat last_bstat;
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 2150ca60394b..8622b222453e 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -689,7 +689,8 @@ static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
*/
void cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
void cgroup_rstat_flush(struct cgroup *cgrp);
-void cgroup_rstat_flush_hold(struct cgroup *cgrp);
+int cgroup_rstat_flush_hold(struct cgroup *cgrp);
+int cgroup_rstat_flush_ratelimited(struct cgroup *cgrp);
void cgroup_rstat_flush_release(struct cgroup *cgrp);

/*
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index a90d68a7c27f..8c71af67b197 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -193,6 +193,7 @@ static struct cgroup *cgroup_rstat_updated_list(struct cgroup *root, int cpu)
/* Push @root to the list first before pushing the children */
head = root;
root->rstat_flush_next = NULL;
+ root->rstat_flush_last_time = jiffies;
child = rstatc->updated_children;
rstatc->updated_children = root;
if (child != root)
@@ -261,12 +262,15 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)

lockdep_assert_held(&cgroup_rstat_lock);

+ cgrp->rstat_flush_last_time = jiffies;
+
for_each_possible_cpu(cpu) {
struct cgroup *pos = cgroup_rstat_updated_list(cgrp, cpu);

for (; pos; pos = pos->rstat_flush_next) {
struct cgroup_subsys_state *css;

+ pos->rstat_flush_last_time = jiffies;
cgroup_base_stat_flush(pos, cpu);
bpf_rstat_flush(pos, cgroup_parent(pos), cpu);

@@ -309,6 +313,49 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
__cgroup_rstat_unlock(cgrp, -1);
}

+#define FLUSH_TIME msecs_to_jiffies(50)
+
+/**
+ * cgroup_rstat_flush_ratelimited - limit pressure on global lock (cgroup_rstat_lock)
+ * @cgrp: target cgroup
+ *
+ * Trade accuracy for less pressure on global lock. Only use this when caller
+ * don't need 100% up-to-date stats.
+ *
+ * Userspace stats tools spawn threads reading io.stat, cpu.stat and memory.stat
+ * from same cgroup, but kernel side they share same global lock. This also
+ * limit malicious userspace from hurting kernel by reading these stat files in
+ * a tight loop.
+ *
+ * This function exit early if flush recently happened.
+ *
+ * Returns 1 if flush happened
+ */
+int cgroup_rstat_flush_ratelimited(struct cgroup *cgrp)
+{
+ int res = 0;
+
+ might_sleep();
+
+ /* Outside lock: check if this flush and lock can be skipped */
+ if (time_before(jiffies,
+ READ_ONCE(cgrp->rstat_flush_last_time) + FLUSH_TIME))
+ return 0;
+
+ __cgroup_rstat_lock(cgrp, -1);
+
+ /* Recheck inside lock, do flush after enough time have passed */
+ if (time_after(jiffies,
+ cgrp->rstat_flush_last_time + FLUSH_TIME)) {
+ cgroup_rstat_flush_locked(cgrp);
+ res = 1;
+ }
+
+ __cgroup_rstat_unlock(cgrp, -1);
+
+ return res;
+}
+
/**
* cgroup_rstat_flush_hold - flush stats in @cgrp's subtree and hold
* @cgrp: target cgroup
@@ -317,13 +364,22 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
* paired with cgroup_rstat_flush_release().
*
* This function may block.
+ *
+ * Returns 1 if flush happened
*/
-void cgroup_rstat_flush_hold(struct cgroup *cgrp)
+int cgroup_rstat_flush_hold(struct cgroup *cgrp)
__acquires(&cgroup_rstat_lock)
{
might_sleep();
__cgroup_rstat_lock(cgrp, -1);
- cgroup_rstat_flush_locked(cgrp);
+
+ /* Only do expensive flush after enough time have passed */
+ if (time_after(jiffies,
+ cgrp->rstat_flush_last_time + FLUSH_TIME)) {
+ cgroup_rstat_flush_locked(cgrp);
+ return 1;
+ }
+ return 0;
}

/**
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index fabce2b50c69..771261612ec1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -742,6 +742,7 @@ static void do_flush_stats(struct mem_cgroup *memcg)
if (mem_cgroup_is_root(memcg))
WRITE_ONCE(flush_last_time, jiffies_64);

+ /* memcg have own ratelimit layer */
cgroup_rstat_flush(memcg->css.cgroup);
}

2024-04-16 21:39:52

[permalink] [raw]

Subject: Re: [PATCH v1 1/3] cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints

On Tue, Apr 16, 2024 at 07:51:26PM +0200, Jesper Dangaard Brouer wrote:
> This commit enhances the ability to troubleshoot the global
> cgroup_rstat_lock by introducing wrapper helper functions for the lock
> along with associated tracepoints.

Applied to cgroup/for-6.10.

Thanks.

--
tejun

2024-04-16 21:53:04

[permalink] [raw]

Subject: Re: [PATCH v1 0/3] cgroup/rstat: global cgroup_rstat_lock changes

On Tue, Apr 16, 2024 at 07:51:19PM +0200, Jesper Dangaard Brouer wrote:
> This patchset is focused on the global cgroup_rstat_lock.
>
> Patch-1: Adds tracepoints to improve measuring lock behavior.
> Patch-2: Converts the global lock into a mutex.
> Patch-3: Limits userspace triggered pressure on the lock.

Imma wait for people's inputs on patch 2 and 3. ISTR switching the lock to
mutex made some tail latencies really bad for some workloads at google?
Yosry, was that you?

Thanks.

--
tejun

2024-04-18 02:14:33

[permalink] [raw]

Subject: Re: [PATCH v1 0/3] cgroup/rstat: global cgroup_rstat_lock changes

On Tue, Apr 16, 2024 at 2:38 PM Tejun Heo <[email protected]> wrote:
>
> On Tue, Apr 16, 2024 at 07:51:19PM +0200, Jesper Dangaard Brouer wrote:
> > This patchset is focused on the global cgroup_rstat_lock.
> >
> > Patch-1: Adds tracepoints to improve measuring lock behavior.
> > Patch-2: Converts the global lock into a mutex.
> > Patch-3: Limits userspace triggered pressure on the lock.
>
> Imma wait for people's inputs on patch 2 and 3. ISTR switching the lock to
> mutex made some tail latencies really bad for some workloads at google?
> Yosry, was that you?

I spent some time going through the history of my previous patchsets
to find context.

There were two separate instances where concerns were raised about
using a mutex.

(a) Converting the global rstat spinlock to a mutex:

Shakeel had concerns about priority inversion with a global sleepable
lock. So I never actually tested replacing the spinlock with a mutex
based on Shakeel's concerns as priority inversions would be difficult
to reproduce with synthetic tests.

Generally speaking, other than priority inversions, I was depending on
Wei's synthetic test to measure performance for userspace reads, and a
script I wrote with parallel reclaimers to measure performance for
in-kernel flushers.

(b) Adding a mutex on top of the global rstat spinlock for userspace
reads (to limit contention from userspace on the in-kernel lock):

Wei reported that this significantly affects userspace read latency
[2]. I then proceeded to add per-memcg thresholds for flushing, which
resulted in the regressions from that mutex going away. However, at
that point the mutex didn't really provide much value, so I removed it
[3].

[1]https://lore.kernel.org/lkml/CALvZod441xBoXzhqLWTZ+xnqDOFkHmvrzspr9NAr+nybqXgS-A@mail.gmail.com/
[2]https://lore.kernel.org/lkml/CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com/
[3]https://lore.kernel.org/lkml/CAJD7tkZgP3m-VVPn+fF_YuvXeQYK=tZZjJHj=dzD=CcSSpp2qg@mail.gmail.com/

2024-04-18 02:20:18

[permalink] [raw]

Subject: Re: [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to mutex

On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer <[email protected]> wrote:
>
> Since kernel v4.18, cgroup_rstat_lock has been an IRQ-disabling spinlock,
> as introduced by commit 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex
> with a spinlock").
>
> Despite efforts in cgroup_rstat_flush_locked() to yield the lock when
> necessary during the collection of per-CPU stats, this approach has led
> to several scaling issues observed in production environments. Holding
> this IRQ lock has caused starvation of other critical kernel functions,
> such as softirq (e.g., timers and netstack). Although kernel v6.8
> introduced optimizations in this area, we continue to observe instances
> where the spin_lock is held for 64-128 ms in production.
>
> This patch converts cgroup_rstat_lock back to being a mutex lock. This
> change is made possible thanks to the significant effort by Yosry Ahmed
> to eliminate all atomic context use-cases through multiple commits,
> ending in 0a2dc6ac3329 ("cgroup: removecgroup_rstat_flush_atomic()"),
> included in kernel v6.5.
>
> After this patch lock contention will be less obvious, as converting this
> to a mutex avoids multiple CPUs spinning while waiting for the lock, but
> it doesn't remove the lock contention. It is recommended to use the
> tracepoints to diagnose this.

I will keep the high-level conversation about using the mutex here in
the cover letter thread, but I am wondering why we are keeping the
lock dropping logic here with the mutex?

If this is to reduce lock contention, why does it depend on
need_resched()? spin_needbreak() is a good indicator for lock
contention, but need_resched() isn't, right?

Also, how was this tested?

When I did previous changes to the flushing logic I used to make sure
that userspace read latency was not impacted, as well as in-kernel
flushers (e.g. reclaim). We should make sure there are no regressions
on both fronts.

>
> Signed-off-by: Jesper Dangaard Brouer <[email protected]>
> ---
> kernel/cgroup/rstat.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index ff68c904e647..a90d68a7c27f 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -9,7 +9,7 @@
>
> #include <trace/events/cgroup.h>
>
> -static DEFINE_SPINLOCK(cgroup_rstat_lock);
> +static DEFINE_MUTEX(cgroup_rstat_lock);
> static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
>
> static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
> @@ -238,10 +238,10 @@ static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop)
> {
> bool contended;
>
> - contended = !spin_trylock_irq(&cgroup_rstat_lock);
> + contended = !mutex_trylock(&cgroup_rstat_lock);
> if (contended) {
> trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended);
> - spin_lock_irq(&cgroup_rstat_lock);
> + mutex_lock(&cgroup_rstat_lock);
> }
> trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
> }
> @@ -250,7 +250,7 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
> __releases(&cgroup_rstat_lock)
> {
> trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false);
> - spin_unlock_irq(&cgroup_rstat_lock);
> + mutex_unlock(&cgroup_rstat_lock);
> }
>
> /* see cgroup_rstat_flush() */
> @@ -278,7 +278,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
> }
>
> /* play nice and yield if necessary */
> - if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
> + if (need_resched()) {
> __cgroup_rstat_unlock(cgrp, cpu);
> if (!cond_resched())
> cpu_relax();

2024-04-18 02:22:25

[permalink] [raw]

Subject: Re: [PATCH v1 3/3] cgroup/rstat: introduce ratelimited rstat flushing

On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer <[email protected]> wrote:
>
> This patch aims to reduce userspace-triggered pressure on the global
> cgroup_rstat_lock by introducing a mechanism to limit how often reading
> stat files causes cgroup rstat flushing.
>
> In the memory cgroup subsystem, memcg_vmstats_needs_flush() combined with
> mem_cgroup_flush_stats_ratelimited() already limits pressure on the
> global lock (cgroup_rstat_lock). As a result, reading memory-related stat
> files (such as memory.stat, memory.numa_stat, zswap.current) is already
> a less userspace-triggerable issue.
>
> However, other userspace users of cgroup_rstat_flush(), such as when
> reading io.stat (blk-cgroup.c) and cpu.stat, lack a similar system to
> limit pressure on the global lock. Furthermore, userspace can easily
> trigger this issue by reading those stat files.
>
> Typically, normal userspace stats tools (e.g., cadvisor, nomad, systemd)
> spawn threads that read io.stat, cpu.stat, and memory.stat (even from the
> same cgroup) without realizing that on the kernel side, they share the
> same global lock. This limitation also helps prevent malicious userspace
> applications from harming the kernel by reading these stat files in a
> tight loop.
>
> To address this, the patch introduces cgroup_rstat_flush_ratelimited(),
> similar to memcg's mem_cgroup_flush_stats_ratelimited().
>
> Flushing occurs per cgroup (even though the lock remains global) a
> variable named rstat_flush_last_time is introduced to track when a given
> cgroup was last flushed. This variable, which contains the jiffies of the
> flush, shares properties and a cache line with rstat_flush_next and is
> updated simultaneously.
>
> For cpu.stat, we need to acquire the lock (via cgroup_rstat_flush_hold)
> because other data is read under the lock, but we skip the expensive
> flushing if it occurred recently.
>
> Regarding io.stat, there is an opportunity outside the lock to skip the
> flush, but inside the lock, we must recheck to handle races.
>
> Signed-off-by: Jesper Dangaard Brouer <[email protected]>

As I mentioned in another thread, I really don't like time-based
rate-limiting [1]. Would it be possible to generalize the
magnitude-based rate-limiting instead? Have something like
memcg_vmstats_needs_flush() in the core rstat code?

Also, why do we keep the memcg time rate-limiting with this patch? Is
it because we use a much larger window there (2s)? Having two layers
of time-based rate-limiting is not ideal imo.

[1]https://lore.kernel.org/lkml/CAJD7tkYnSRwJTpXxSnGgo-i3-OdD7cdT-e3_S_yf7dSknPoRKw@mail.gmail.com/

2024-04-18 08:01:10

[permalink] [raw]

Subject: Re: [PATCH v1 1/3] cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints

On 16/04/2024 23.36, Tejun Heo wrote:
> On Tue, Apr 16, 2024 at 07:51:26PM +0200, Jesper Dangaard Brouer wrote:
>> This commit enhances the ability to troubleshoot the global
>> cgroup_rstat_lock by introducing wrapper helper functions for the lock
>> along with associated tracepoints.
>
> Applied to cgroup/for-6.10.
>

Thanks for applying the tracepoint patch. I've backported this to our
main production kernels v6.6 LTS (with before mentioned upstream cgroup
work from Yosry and Longman). I have it running in production on two
machines this morning. Doing manual bpftrace script inspection now, but
plan is monitor this continuously (ebpf_exporter[1]) and even have
alerts on excessive wait time on contention.

It makes sense to delay applying the next two patches, until we have
some production experiments with those two patches, and I have fleet
monitoring in place. I'm be offline next week (on dive trip), so I'll
resume work on this 29 April, before I start doing prod experiments.

--Jesper
[1] https://github.com/cloudflare/ebpf_exporter

2024-04-18 09:02:49

[permalink] [raw]

Subject: Re: [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to mutex

On 18/04/2024 04.19, Yosry Ahmed wrote:
> On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer <[email protected]> wrote:
>>
>> Since kernel v4.18, cgroup_rstat_lock has been an IRQ-disabling spinlock,
>> as introduced by commit 0fa294fb1985 ("cgroup: Replace cgroup_rstat_mutex
>> with a spinlock").
>>
>> Despite efforts in cgroup_rstat_flush_locked() to yield the lock when
>> necessary during the collection of per-CPU stats, this approach has led
>> to several scaling issues observed in production environments. Holding
>> this IRQ lock has caused starvation of other critical kernel functions,
>> such as softirq (e.g., timers and netstack). Although kernel v6.8
>> introduced optimizations in this area, we continue to observe instances
>> where the spin_lock is held for 64-128 ms in production.
>>
>> This patch converts cgroup_rstat_lock back to being a mutex lock. This
>> change is made possible thanks to the significant effort by Yosry Ahmed
>> to eliminate all atomic context use-cases through multiple commits,
>> ending in 0a2dc6ac3329 ("cgroup: removecgroup_rstat_flush_atomic()"),
>> included in kernel v6.5.
>>
>> After this patch lock contention will be less obvious, as converting this
>> to a mutex avoids multiple CPUs spinning while waiting for the lock, but
>> it doesn't remove the lock contention. It is recommended to use the
>> tracepoints to diagnose this.
>
> I will keep the high-level conversation about using the mutex here in
> the cover letter thread, but I am wondering why we are keeping the
> lock dropping logic here with the mutex?
>

I agree that yielding the mutex in the loop makes less sense.
Especially since the raw_spin_unlock_irqrestore(cpu_lock, flags) call
will be a preemption point for my softirq. But I kept it because, we
are running a CONFIG_PREEMPT_VOLUNTARY kernel, so I still worried that
there was no sched point for other userspace processes while holding the
mutex, but I don't fully know the sched implication when holding a mutex.

> If this is to reduce lock contention, why does it depend on
> need_resched()? spin_needbreak() is a good indicator for lock
> contention, but need_resched() isn't, right?
>

As I said, I'm unsure of the semantics of holding a mutex.

> Also, how was this tested?
>

I tested this in a testlab, prior to posting upstream, with parallel
reader of the stat files. As I said in other mail, I plan to experiment
with these patches(2+3) in production, as micro-benchmarking will not
reveal the corner cases we care about. With BPF based measurements of
the lock congestion time, I hope we can catch production issues at a
time scale that is happens prior to user visible impacts.

> When I did previous changes to the flushing logic I used to make sure
> that userspace read latency was not impacted, as well as in-kernel
> flushers (e.g. reclaim). We should make sure there are no regressions
> on both fronts.
>

Agree, we should consider both userspace readers and in-kernel flushers.
Maybe these needed separate handing as they have separate needs.

>>
>> Signed-off-by: Jesper Dangaard Brouer <[email protected]>
>> ---
>> kernel/cgroup/rstat.c | 10 +++++-----
>> 1 file changed, 5 insertions(+), 5 deletions(-)
>>
>> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
>> index ff68c904e647..a90d68a7c27f 100644
>> --- a/kernel/cgroup/rstat.c
>> +++ b/kernel/cgroup/rstat.c
>> @@ -9,7 +9,7 @@
>>
>> #include <trace/events/cgroup.h>
>>
>> -static DEFINE_SPINLOCK(cgroup_rstat_lock);
>> +static DEFINE_MUTEX(cgroup_rstat_lock);
>> static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
>>
>> static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
>> @@ -238,10 +238,10 @@ static inline void __cgroup_rstat_lock(struct cgroup *cgrp, int cpu_in_loop)
>> {
>> bool contended;
>>
>> - contended = !spin_trylock_irq(&cgroup_rstat_lock);
>> + contended = !mutex_trylock(&cgroup_rstat_lock);
>> if (contended) {
>> trace_cgroup_rstat_lock_contended(cgrp, cpu_in_loop, contended);
>> - spin_lock_irq(&cgroup_rstat_lock);
>> + mutex_lock(&cgroup_rstat_lock);
>> }
>> trace_cgroup_rstat_locked(cgrp, cpu_in_loop, contended);
>> }
>> @@ -250,7 +250,7 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
>> __releases(&cgroup_rstat_lock)
>> {
>> trace_cgroup_rstat_unlock(cgrp, cpu_in_loop, false);
>> - spin_unlock_irq(&cgroup_rstat_lock);
>> + mutex_unlock(&cgroup_rstat_lock);
>> }
>>
>> /* see cgroup_rstat_flush() */
>> @@ -278,7 +278,7 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp)
>> }
>>
>> /* play nice and yield if necessary */
>> - if (need_resched() || spin_needbreak(&cgroup_rstat_lock)) {
>> + if (need_resched()) {
>> __cgroup_rstat_unlock(cgrp, cpu);
>> if (!cond_resched())
>> cpu_relax();

2024-04-18 11:03:28

[permalink] [raw]

Subject: Re: [PATCH v1 3/3] cgroup/rstat: introduce ratelimited rstat flushing

On 18/04/2024 04.21, Yosry Ahmed wrote:
> On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer <[email protected]> wrote:
>>
>> This patch aims to reduce userspace-triggered pressure on the global
>> cgroup_rstat_lock by introducing a mechanism to limit how often reading
>> stat files causes cgroup rstat flushing.
>>
>> In the memory cgroup subsystem, memcg_vmstats_needs_flush() combined with
>> mem_cgroup_flush_stats_ratelimited() already limits pressure on the
>> global lock (cgroup_rstat_lock). As a result, reading memory-related stat
>> files (such as memory.stat, memory.numa_stat, zswap.current) is already
>> a less userspace-triggerable issue.
>>
>> However, other userspace users of cgroup_rstat_flush(), such as when
>> reading io.stat (blk-cgroup.c) and cpu.stat, lack a similar system to
>> limit pressure on the global lock. Furthermore, userspace can easily
>> trigger this issue by reading those stat files.
>>
>> Typically, normal userspace stats tools (e.g., cadvisor, nomad, systemd)
>> spawn threads that read io.stat, cpu.stat, and memory.stat (even from the
>> same cgroup) without realizing that on the kernel side, they share the
>> same global lock. This limitation also helps prevent malicious userspace
>> applications from harming the kernel by reading these stat files in a
>> tight loop.
>>
>> To address this, the patch introduces cgroup_rstat_flush_ratelimited(),
>> similar to memcg's mem_cgroup_flush_stats_ratelimited().
>>
>> Flushing occurs per cgroup (even though the lock remains global) a
>> variable named rstat_flush_last_time is introduced to track when a given
>> cgroup was last flushed. This variable, which contains the jiffies of the
>> flush, shares properties and a cache line with rstat_flush_next and is
>> updated simultaneously.
>>
>> For cpu.stat, we need to acquire the lock (via cgroup_rstat_flush_hold)
>> because other data is read under the lock, but we skip the expensive
>> flushing if it occurred recently.
>>
>> Regarding io.stat, there is an opportunity outside the lock to skip the
>> flush, but inside the lock, we must recheck to handle races.
>>
>> Signed-off-by: Jesper Dangaard Brouer <[email protected]>
>
> As I mentioned in another thread, I really don't like time-based
> rate-limiting [1]. Would it be possible to generalize the
> magnitude-based rate-limiting instead? Have something like
> memcg_vmstats_needs_flush() in the core rstat code?
>

I've taken a closer look at memcg_vmstats_needs_flush(). And I'm
concerned about overhead maintaining the stats (that is used as a filter).

static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats)
{
return atomic64_read(&vmstats->stats_updates) >
MEMCG_CHARGE_BATCH * num_online_cpus();
}

I looked at `vmstats->stats_updates` to see how often this is getting
updated. It is updated in memcg_rstat_updated(), but it gets inlined
into a number function (__mod_memcg_state, __mod_memcg_lruvec_state,
__count_memcg_events), plus it calls cgroup_rstat_updated().
Counting invocations per sec (via funccount):

10:28:09
FUNC COUNT
__mod_memcg_state 377553
__count_memcg_events 393078
__mod_memcg_lruvec_state 1229673
cgroup_rstat_updated 2632389

I'm surprised to see how many time per sec this is getting invoked.
Originating from memcg_rstat_updated() = 2,000,304 times per sec.
(On a 128 CPU core machine with 39% idle CPU-load.)
Maintaining these stats seems excessive...

Then how often does the filter lower pressure on lock:

MEMCG_CHARGE_BATCH(64) * 128 CPU = 8192
2000304/(64*128) = 244 time per sec (every ~4ms)
(assuming memcg_rstat_updated val=1)

> Also, why do we keep the memcg time rate-limiting with this patch? Is
> it because we use a much larger window there (2s)? Having two layers
> of time-based rate-limiting is not ideal imo.
>

I'm also not-a-fan of having two layer of time-based rate-limiting, but
they do operate a different time scales *and* are not active at the same
time with current patch, if you noticed the details, then I excluded
memcg from using this as I commented "memcg have own ratelimit layer"
(in do_flush_stats).

I would prefer removing the memcg time rate-limiting and use this more
granular 50ms (20 timer/sec) for memcg also. And I was planning to do
that in a followup patchset. The 50ms (20 timer/sec) limit will be per
cgroup in the system, which then "scales"/increase with the number of
cgroups, but better than unbounded read/access locks per sec.

--Jesper

> [1]https://lore.kernel.org/lkml/CAJD7tkYnSRwJTpXxSnGgo-i3-OdD7cdT-e3_S_yf7dSknPoRKw@mail.gmail.com/

sudo ./bcc/tools/funccount -Ti 1 -d 10
'__mod_memcg_state|__mod_memcg_lruvec_state|__count_memcg_events|cgroup_rstat_updated'

2024-04-18 14:49:39

by Shakeel Butt

[permalink] [raw]

Subject: Re: [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to mutex

On Thu, Apr 18, 2024 at 11:02:06AM +0200, Jesper Dangaard Brouer wrote:
>
>
> On 18/04/2024 04.19, Yosry Ahmed wrote:
[...]
> >
> > I will keep the high-level conversation about using the mutex here in
> > the cover letter thread, but I am wondering why we are keeping the
> > lock dropping logic here with the mutex?
> >
>
> I agree that yielding the mutex in the loop makes less sense.
> Especially since the raw_spin_unlock_irqrestore(cpu_lock, flags) call
> will be a preemption point for my softirq. But I kept it because, we
> are running a CONFIG_PREEMPT_VOLUNTARY kernel, so I still worried that
> there was no sched point for other userspace processes while holding the
> mutex, but I don't fully know the sched implication when holding a mutex.
>

Are the softirqs you are interested in, raised from the same cpu or
remote cpu? What about local_softirq_pending() check in addition to
need_resched() and spin_needbreak() checks? If softirq can only be
raised on local cpu then convert the spin_lock to non-irq one (Please
correct me if I am wrong but on return from hard irq and not within bh
or irq disabled spin_lock, the kernel will run the pending softirqs,
right?). Did you get the chance to test these two changes or something
similar in your prod environment?

2024-04-18 15:50:52

by Shakeel Butt

[permalink] [raw]

Subject: Re: [PATCH v1 3/3] cgroup/rstat: introduce ratelimited rstat flushing

On Thu, Apr 18, 2024 at 01:00:30PM +0200, Jesper Dangaard Brouer wrote:
>
>
> On 18/04/2024 04.21, Yosry Ahmed wrote:
> > On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer <[email protected]> wrote:
> > >
> > > This patch aims to reduce userspace-triggered pressure on the global
> > > cgroup_rstat_lock by introducing a mechanism to limit how often reading
> > > stat files causes cgroup rstat flushing.
> > >
> > > In the memory cgroup subsystem, memcg_vmstats_needs_flush() combined with
> > > mem_cgroup_flush_stats_ratelimited() already limits pressure on the
> > > global lock (cgroup_rstat_lock). As a result, reading memory-related stat
> > > files (such as memory.stat, memory.numa_stat, zswap.current) is already
> > > a less userspace-triggerable issue.
> > >
> > > However, other userspace users of cgroup_rstat_flush(), such as when
> > > reading io.stat (blk-cgroup.c) and cpu.stat, lack a similar system to
> > > limit pressure on the global lock. Furthermore, userspace can easily
> > > trigger this issue by reading those stat files.
> > >
> > > Typically, normal userspace stats tools (e.g., cadvisor, nomad, systemd)
> > > spawn threads that read io.stat, cpu.stat, and memory.stat (even from the
> > > same cgroup) without realizing that on the kernel side, they share the
> > > same global lock. This limitation also helps prevent malicious userspace
> > > applications from harming the kernel by reading these stat files in a
> > > tight loop.
> > >
> > > To address this, the patch introduces cgroup_rstat_flush_ratelimited(),
> > > similar to memcg's mem_cgroup_flush_stats_ratelimited().
> > >
> > > Flushing occurs per cgroup (even though the lock remains global) a
> > > variable named rstat_flush_last_time is introduced to track when a given
> > > cgroup was last flushed. This variable, which contains the jiffies of the
> > > flush, shares properties and a cache line with rstat_flush_next and is
> > > updated simultaneously.
> > >
> > > For cpu.stat, we need to acquire the lock (via cgroup_rstat_flush_hold)
> > > because other data is read under the lock, but we skip the expensive
> > > flushing if it occurred recently.
> > >
> > > Regarding io.stat, there is an opportunity outside the lock to skip the
> > > flush, but inside the lock, we must recheck to handle races.
> > >
> > > Signed-off-by: Jesper Dangaard Brouer <[email protected]>
> >
> > As I mentioned in another thread, I really don't like time-based
> > rate-limiting [1]. Would it be possible to generalize the
> > magnitude-based rate-limiting instead? Have something like
> > memcg_vmstats_needs_flush() in the core rstat code?
> >
>
> I've taken a closer look at memcg_vmstats_needs_flush(). And I'm
> concerned about overhead maintaining the stats (that is used as a filter).
>
> static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats)
> {
> return atomic64_read(&vmstats->stats_updates) >
> MEMCG_CHARGE_BATCH * num_online_cpus();
> }
>
> I looked at `vmstats->stats_updates` to see how often this is getting
> updated. It is updated in memcg_rstat_updated(), but it gets inlined into a
> number function (__mod_memcg_state, __mod_memcg_lruvec_state,
> __count_memcg_events), plus it calls cgroup_rstat_updated().
> Counting invocations per sec (via funccount):
>
> 10:28:09
> FUNC COUNT
> __mod_memcg_state 377553
> __count_memcg_events 393078
> __mod_memcg_lruvec_state 1229673
> cgroup_rstat_updated 2632389
>

Is it possible for you to also measure the frequency of the unique
callstacks calling these functions? In addition the frequency of the
each stat item update would be awesome.

>
> I'm surprised to see how many time per sec this is getting invoked.
> Originating from memcg_rstat_updated() = 2,000,304 times per sec.
> (On a 128 CPU core machine with 39% idle CPU-load.)
> Maintaining these stats seems excessive...
>
> Then how often does the filter lower pressure on lock:
>
> MEMCG_CHARGE_BATCH(64) * 128 CPU = 8192
> 2000304/(64*128) = 244 time per sec (every ~4ms)
> (assuming memcg_rstat_updated val=1)
>

It seems like we have opportunities to improve the stat update side and
we definitely need to improve the stat flush side. One issue from the
memcg side is that kernel has to do a lot of work, so we should be
reducing that.

2024-04-18 20:40:34

[permalink] [raw]

Subject: Re: [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to mutex

On Thu, Apr 18, 2024 at 7:49 AM Shakeel Butt <[email protected]> wrote:
>
> On Thu, Apr 18, 2024 at 11:02:06AM +0200, Jesper Dangaard Brouer wrote:
> >
> >
> > On 18/04/2024 04.19, Yosry Ahmed wrote:
> [...]
> > >
> > > I will keep the high-level conversation about using the mutex here in
> > > the cover letter thread, but I am wondering why we are keeping the
> > > lock dropping logic here with the mutex?
> > >
> >
> > I agree that yielding the mutex in the loop makes less sense.
> > Especially since the raw_spin_unlock_irqrestore(cpu_lock, flags) call
> > will be a preemption point for my softirq. But I kept it because, we
> > are running a CONFIG_PREEMPT_VOLUNTARY kernel, so I still worried that
> > there was no sched point for other userspace processes while holding the
> > mutex, but I don't fully know the sched implication when holding a mutex.
> >
>
> Are the softirqs you are interested in, raised from the same cpu or
> remote cpu? What about local_softirq_pending() check in addition to
> need_resched() and spin_needbreak() checks? If softirq can only be
> raised on local cpu then convert the spin_lock to non-irq one (Please
> correct me if I am wrong but on return from hard irq and not within bh
> or irq disabled spin_lock, the kernel will run the pending softirqs,
> right?). Did you get the chance to test these two changes or something
> similar in your prod environment?

I tried making the spinlock a non-irq lock before, but Tejun objected [1].

Perhaps we could experiment with always dropping the lock at CPU
boundaries instead?

[1]https://lore.kernel.org/lkml/ZBz%2FV5a7%[email protected]/

2024-04-18 21:15:48

[permalink] [raw]

Subject: Re: [PATCH v1 3/3] cgroup/rstat: introduce ratelimited rstat flushing

Hello, Yosry.

On Thu, Apr 18, 2024 at 02:00:28PM -0700, Yosry Ahmed wrote:
..
> I think this is an artifact of different subsystems sharing the same
> rstat tree for no specific reason. I think almost all flushing calls
> really need the stats from one subsystem after all.
>
> If we have separate trees, lock contention gets slightly better as
> different subsystems do not compete. We can also have different
> subsystems "customize" their trees, for e.g. by setting different
> time-based or magnitude-based rate-limiting thresholds.
>
> I know this is a bigger lift, just thinking out loud :)

I have no objection to separating out rstat trees so that it has
per-controller tracking. However, the high frequency source of updates are
cpu and memory, which tend to fire together, and the only really high
frequency consumer seems to be memory, so I'm not too sure how much benefit
separating the trees out would bring.

Thanks.

--
tejun

2024-04-18 21:23:50

[permalink] [raw]

Subject: Re: [PATCH v1 3/3] cgroup/rstat: introduce ratelimited rstat flushing

On Thu, Apr 18, 2024 at 2:15 PM Tejun Heo <[email protected]> wrote:
>
> Hello, Yosry.
>
> On Thu, Apr 18, 2024 at 02:00:28PM -0700, Yosry Ahmed wrote:
> ...
> > I think this is an artifact of different subsystems sharing the same
> > rstat tree for no specific reason. I think almost all flushing calls
> > really need the stats from one subsystem after all.
> >
> > If we have separate trees, lock contention gets slightly better as
> > different subsystems do not compete. We can also have different
> > subsystems "customize" their trees, for e.g. by setting different
> > time-based or magnitude-based rate-limiting thresholds.
> >
> > I know this is a bigger lift, just thinking out loud :)
>
> I have no objection to separating out rstat trees so that it has
> per-controller tracking. However, the high frequency source of updates are
> cpu and memory, which tend to fire together, and the only really high
> frequency consumer seems to be memory, so I'm not too sure how much benefit
> separating the trees out would bring.

Well, we could split the global lock into multiple ones, which isn't
really a solution, but it would help other controllers not to be
affected by the high frequency of flushing from the memory controller
(which has its own thresholding).

For updates, cpu and memory would use separate percpu locks as well,
which may help slightly.

Outside of this, I think it helps us add controller-specific
optimizations. For example, I tried to generalize the thresholding
code in the memory controller and put it in the rstat code, but I
couldn't really have a single value representing the "pending stats"
from all controllers. It's impossible to compare memory stats (mostly
in pages or bytes) to cpu time stats for instance.

Similarly, with this proposal from Jesper (which I am not saying I am
agreeing with :P), instead of having time-based ratelimiting in both
the rstat code and the memcg code to support different thresholds, we
could have the memory controller set a different threshold for itself.

So perhaps the lock breakdowns are not enough motivation, but if we
start generalizing optimizations in some controllers, we may want to
split the tree for flexibility.

2024-04-18 21:33:25

[permalink] [raw]

Subject: Re: [PATCH v1 3/3] cgroup/rstat: introduce ratelimited rstat flushing

Hello,

On Thu, Apr 18, 2024 at 02:22:58PM -0700, Yosry Ahmed wrote:
> Outside of this, I think it helps us add controller-specific
> optimizations. For example, I tried to generalize the thresholding
> code in the memory controller and put it in the rstat code, but I
> couldn't really have a single value representing the "pending stats"
> from all controllers. It's impossible to compare memory stats (mostly
> in pages or bytes) to cpu time stats for instance.
>
> Similarly, with this proposal from Jesper (which I am not saying I am
> agreeing with :P), instead of having time-based ratelimiting in both
> the rstat code and the memcg code to support different thresholds, we
> could have the memory controller set a different threshold for itself.
>
> So perhaps the lock breakdowns are not enough motivation, but if we
> start generalizing optimizations in some controllers, we may want to
> split the tree for flexibility.

I see. Yeah, that makes more sense to me.

Thanks.

--
tejun

2024-04-19 10:17:09

[permalink] [raw]

Subject: Re: [PATCH v1 3/3] cgroup/rstat: introduce ratelimited rstat flushing

On 18/04/2024 23.00, Yosry Ahmed wrote:
> On Thu, Apr 18, 2024 at 4:00 AM Jesper Dangaard Brouer<[email protected]> wrote:
>> On 18/04/2024 04.21, Yosry Ahmed wrote:
>>> On Tue, Apr 16, 2024 at 10:51 AM Jesper Dangaard Brouer<[email protected]> wrote:
>>>> This patch aims to reduce userspace-triggered pressure on the global
>>>> cgroup_rstat_lock by introducing a mechanism to limit how often reading
>>>> stat files causes cgroup rstat flushing.
>>>>
[...]

> Taking a step back, I think this series is trying to address two
> issues in one go: interrupt handling latency and lock contention.

Yes, patch 2 and 3 are essentially independent and address two different
aspects.

> While both are related because reducing flushing reduces irq
> disablement, I think it would be better if we can fix that issue
> separately with a more fundamental solution (e.g. using a mutex or
> dropping the lock at each CPU boundary).
>
> After that, we can more clearly evaluate the lock contention problem
> with data purely about flushing latency, without taking into
> consideration the irq handling problem.
>
> Does this make sense to you?

Yes, make sense.

So, you are suggesting we start with the mutex change? (patch 2)
(which still needs some adjustments/tuning)

This make sense to me, as there are likely many solutions to reducing
the pressure on the lock.

With the tracepoint patch in-place, I/we can measure the pressure on the
lock, and I plan to do this across our CF fleet. Then we can slowly
work on improving lock contention and evaluate this on our fleets.

--Jesper
p.s.
Setting expectations:
- Going on vacation today, so will resume work after 29th April.

2024-04-19 13:15:20

[permalink] [raw]

Subject: Re: [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to mutex

On 18/04/2024 22.39, Yosry Ahmed wrote:
> On Thu, Apr 18, 2024 at 7:49 AM Shakeel Butt <[email protected]> wrote:
>>
>> On Thu, Apr 18, 2024 at 11:02:06AM +0200, Jesper Dangaard Brouer wrote:
>>>
>>>
>>> On 18/04/2024 04.19, Yosry Ahmed wrote:
>> [...]
>>>>
>>>> I will keep the high-level conversation about using the mutex here in
>>>> the cover letter thread, but I am wondering why we are keeping the
>>>> lock dropping logic here with the mutex?
>>>>
>>>
>>> I agree that yielding the mutex in the loop makes less sense.
>>> Especially since the raw_spin_unlock_irqrestore(cpu_lock, flags) call
>>> will be a preemption point for my softirq. But I kept it because, we
>>> are running a CONFIG_PREEMPT_VOLUNTARY kernel, so I still worried that
>>> there was no sched point for other userspace processes while holding the
>>> mutex, but I don't fully know the sched implication when holding a mutex.
>>>
>>
>> Are the softirqs you are interested in, raised from the same cpu or
>> remote cpu? What about local_softirq_pending() check in addition to
>> need_resched() and spin_needbreak() checks? If softirq can only be
>> raised on local cpu then convert the spin_lock to non-irq one (Please
>> correct me if I am wrong but on return from hard irq and not within bh
>> or irq disabled spin_lock, the kernel will run the pending softirqs,
>> right?). Did you get the chance to test these two changes or something
>> similar in your prod environment?
>
> I tried making the spinlock a non-irq lock before, but Tejun objected [1].
> [1] https://lore.kernel.org/lkml/ZBz%2FV5a7%[email protected]/
>

After reading [1], I think using a mutex is a better approach (than
non-irq spinlock).

> Perhaps we could experiment with always dropping the lock at CPU
> boundaries instead?
>

I don't think this will be enough (always dropping the lock at CPU
boundaries). My measured "lock-hold" times that is blocking IRQ (and
softirq) for too long. When looking at prod with my new cgroup
tracepoint script[2]. When contention occurs, I see many Yields
happening and with same magnitude as Contended. But still see events
with long "lock-hold" times, even-though yields are high.

[2]
https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt

Example output:

12:46:56 High Lock-contention: wait: 739 usec (0 ms) on CPU:56
comm:kswapd7
12:46:56 Long lock-hold time: 6381 usec (6 ms) on CPU:27 comm:kswapd3
12:46:56 Long lock-hold time: 18905 usec (18 ms) on CPU:100
comm:kworker/u261:12

12:46:56 time elapsed: 36 sec (interval = 1 sec)
Flushes(2051) 15/interval (avg 56/sec)
Locks(44464) 1340/interval (avg 1235/sec)
Yields(42413) 1325/interval (avg 1178/sec)
Contended(42112) 1322/interval (avg 1169/sec)

There is reported 15 flushes/sec, but locks are yielded quickly.

More problematically (for softirq latency) we see a Long lock-hold time
reaching 18 ms. For network RX softirq I need lower than 0.5ms latency,
to avoid RX-ring HW queue overflows.

--Jesper
p.s. I'm seeing a pattern with kswapdN contending on this lock.

@stack[697, kswapd3]:
__cgroup_rstat_lock+107
__cgroup_rstat_lock+107
cgroup_rstat_flush_locked+851
cgroup_rstat_flush+35
shrink_node+226
balance_pgdat+807
kswapd+521
kthread+228
ret_from_fork+48
ret_from_fork_asm+27

@stack[698, kswapd4]:
__cgroup_rstat_lock+107
__cgroup_rstat_lock+107
cgroup_rstat_flush_locked+851
cgroup_rstat_flush+35
shrink_node+226
balance_pgdat+807
kswapd+521
kthread+228
ret_from_fork+48
ret_from_fork_asm+27

@stack[699, kswapd5]:
__cgroup_rstat_lock+107
__cgroup_rstat_lock+107
cgroup_rstat_flush_locked+851
cgroup_rstat_flush+35
shrink_node+226
balance_pgdat+807
kswapd+521
kthread+228
ret_from_fork+48
ret_from_fork_asm+27

2024-04-19 16:12:40

by Shakeel Butt

[permalink] [raw]

Subject: Re: [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to mutex

On Fri, Apr 19, 2024 at 03:15:01PM +0200, Jesper Dangaard Brouer wrote:
>
>
> On 18/04/2024 22.39, Yosry Ahmed wrote:
> > On Thu, Apr 18, 2024 at 7:49 AM Shakeel Butt <[email protected]> wrote:
> > >
> > > On Thu, Apr 18, 2024 at 11:02:06AM +0200, Jesper Dangaard Brouer wrote:
> > > >
> > > >
> > > > On 18/04/2024 04.19, Yosry Ahmed wrote:
> > > [...]
> > > > >
> > > > > I will keep the high-level conversation about using the mutex here in
> > > > > the cover letter thread, but I am wondering why we are keeping the
> > > > > lock dropping logic here with the mutex?
> > > > >
> > > >
> > > > I agree that yielding the mutex in the loop makes less sense.
> > > > Especially since the raw_spin_unlock_irqrestore(cpu_lock, flags) call
> > > > will be a preemption point for my softirq. But I kept it because, we
> > > > are running a CONFIG_PREEMPT_VOLUNTARY kernel, so I still worried that
> > > > there was no sched point for other userspace processes while holding the
> > > > mutex, but I don't fully know the sched implication when holding a mutex.
> > > >
> > >
> > > Are the softirqs you are interested in, raised from the same cpu or
> > > remote cpu? What about local_softirq_pending() check in addition to
> > > need_resched() and spin_needbreak() checks? If softirq can only be
> > > raised on local cpu then convert the spin_lock to non-irq one (Please
> > > correct me if I am wrong but on return from hard irq and not within bh
> > > or irq disabled spin_lock, the kernel will run the pending softirqs,
> > > right?). Did you get the chance to test these two changes or something
> > > similar in your prod environment?
> >
> > I tried making the spinlock a non-irq lock before, but Tejun objected [1].
> > [1] https://lore.kernel.org/lkml/ZBz%2FV5a7%[email protected]/
> >
>
> After reading [1], I think using a mutex is a better approach (than non-irq
> spinlock).
>
>
> > Perhaps we could experiment with always dropping the lock at CPU
> > boundaries instead?
> >
>
> I don't think this will be enough (always dropping the lock at CPU
> boundaries). My measured "lock-hold" times that is blocking IRQ (and
> softirq) for too long. When looking at prod with my new cgroup
> tracepoint script[2]. When contention occurs, I see many Yields
> happening and with same magnitude as Contended. But still see events
> with long "lock-hold" times, even-though yields are high.
>
> [2] https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt
>
> Example output:
>
> 12:46:56 High Lock-contention: wait: 739 usec (0 ms) on CPU:56 comm:kswapd7
> 12:46:56 Long lock-hold time: 6381 usec (6 ms) on CPU:27 comm:kswapd3
> 12:46:56 Long lock-hold time: 18905 usec (18 ms) on CPU:100
> comm:kworker/u261:12
>
> 12:46:56 time elapsed: 36 sec (interval = 1 sec)
> Flushes(2051) 15/interval (avg 56/sec)
> Locks(44464) 1340/interval (avg 1235/sec)
> Yields(42413) 1325/interval (avg 1178/sec)
> Contended(42112) 1322/interval (avg 1169/sec)
>
> There is reported 15 flushes/sec, but locks are yielded quickly.
>
> More problematically (for softirq latency) we see a Long lock-hold time
> reaching 18 ms. For network RX softirq I need lower than 0.5ms latency,
> to avoid RX-ring HW queue overflows.
>
>
> --Jesper
> p.s. I'm seeing a pattern with kswapdN contending on this lock.
>
> @stack[697, kswapd3]:
> __cgroup_rstat_lock+107
> __cgroup_rstat_lock+107
> cgroup_rstat_flush_locked+851
> cgroup_rstat_flush+35
> shrink_node+226
> balance_pgdat+807
> kswapd+521
> kthread+228
> ret_from_fork+48
> ret_from_fork_asm+27
>
> @stack[698, kswapd4]:
> __cgroup_rstat_lock+107
> __cgroup_rstat_lock+107
> cgroup_rstat_flush_locked+851
> cgroup_rstat_flush+35
> shrink_node+226
> balance_pgdat+807
> kswapd+521
> kthread+228
> ret_from_fork+48
> ret_from_fork_asm+27
>
> @stack[699, kswapd5]:
> __cgroup_rstat_lock+107
> __cgroup_rstat_lock+107
> cgroup_rstat_flush_locked+851
> cgroup_rstat_flush+35
> shrink_node+226
> balance_pgdat+807
> kswapd+521
> kthread+228
> ret_from_fork+48
> ret_from_fork_asm+27
>

Can you simply replace mem_cgroup_flush_stats() in
prepare_scan_control() with the ratelimited version and see if the issue
still persists for your production traffic?

Also were you able to get which specific stats are getting the most
updates?

2024-04-19 19:22:35

[permalink] [raw]

Subject: Re: [PATCH v1 2/3] cgroup/rstat: convert cgroup_rstat_lock back to mutex

[..]
> > > Perhaps we could experiment with always dropping the lock at CPU
> > > boundaries instead?
> > >
> >
> > I don't think this will be enough (always dropping the lock at CPU
> > boundaries). My measured "lock-hold" times that is blocking IRQ (and
> > softirq) for too long. When looking at prod with my new cgroup
> > tracepoint script[2]. When contention occurs, I see many Yields
> > happening and with same magnitude as Contended. But still see events
> > with long "lock-hold" times, even-though yields are high.
> >
> > [2] https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt
> >
> > Example output:
> >
> > 12:46:56 High Lock-contention: wait: 739 usec (0 ms) on CPU:56 comm:kswapd7
> > 12:46:56 Long lock-hold time: 6381 usec (6 ms) on CPU:27 comm:kswapd3
> > 12:46:56 Long lock-hold time: 18905 usec (18 ms) on CPU:100
> > comm:kworker/u261:12
> >
> > 12:46:56 time elapsed: 36 sec (interval = 1 sec)
> > Flushes(2051) 15/interval (avg 56/sec)
> > Locks(44464) 1340/interval (avg 1235/sec)
> > Yields(42413) 1325/interval (avg 1178/sec)
> > Contended(42112) 1322/interval (avg 1169/sec)
> >
> > There is reported 15 flushes/sec, but locks are yielded quickly.
> >
> > More problematically (for softirq latency) we see a Long lock-hold time
> > reaching 18 ms. For network RX softirq I need lower than 0.5ms latency,
> > to avoid RX-ring HW queue overflows.

Here we are measuring yields against contention, but the main problem
here is IRQ serving latency, which doesn't have to correlate with
contention, right?

Perhaps contention is causing us to yield the lock every nth cpu
boundary, but apparently this is not enough for IRQ serving latency.
Dropping the lock on each boundary should improve IRQ serving latency,
regardless of the presence of contention.

Let's focus on one problem at a time ;)

> >
> >
> > --Jesper
> > p.s. I'm seeing a pattern with kswapdN contending on this lock.
> >
> > @stack[697, kswapd3]:
> > __cgroup_rstat_lock+107
> > __cgroup_rstat_lock+107
> > cgroup_rstat_flush_locked+851
> > cgroup_rstat_flush+35
> > shrink_node+226
> > balance_pgdat+807
> > kswapd+521
> > kthread+228
> > ret_from_fork+48
> > ret_from_fork_asm+27
> >
> > @stack[698, kswapd4]:
> > __cgroup_rstat_lock+107
> > __cgroup_rstat_lock+107
> > cgroup_rstat_flush_locked+851
> > cgroup_rstat_flush+35
> > shrink_node+226
> > balance_pgdat+807
> > kswapd+521
> > kthread+228
> > ret_from_fork+48
> > ret_from_fork_asm+27
> >
> > @stack[699, kswapd5]:
> > __cgroup_rstat_lock+107
> > __cgroup_rstat_lock+107
> > cgroup_rstat_flush_locked+851
> > cgroup_rstat_flush+35
> > shrink_node+226
> > balance_pgdat+807
> > kswapd+521
> > kthread+228
> > ret_from_fork+48
> > ret_from_fork_asm+27
> >
>
> Can you simply replace mem_cgroup_flush_stats() in
> prepare_scan_control() with the ratelimited version and see if the issue
> still persists for your production traffic?

With thresholding, the fact that we reach cgroup_rstat_flush() means
that there is a high magnitude of pending updates. I think Jesper
mentioned 128 CPUs before, that means 128 * 64 (MEMCG_CHARGE_BATCH)
page-sized updates. That could be over 33 MBs with 4K page size.

I am not sure if it's fine to ignore such updates in shrink_node(),
especially that it is called in a loop sometimes so I imagine we may
want to see what changed after the last iteration.

>
> Also were you able to get which specific stats are getting the most
> updates?

This, on the other hand, would be very interesting. I think it is very
possible that we don't actually have 33 MBs of updates, but rather we
keep adding and subtracting from the same stat until we reach the
threshold. This could especially be true for hot stats like slab
allocations.

2024-04-23 16:55:37

by Simon Horman

[permalink] [raw]

Subject: Re: [PATCH v1 1/3] cgroup/rstat: add cgroup_rstat_lock helpers and tracepoints

On Tue, Apr 16, 2024 at 07:51:26PM +0200, Jesper Dangaard Brouer wrote:

..

> /**
> * cgroup_rstat_flush_release - release cgroup_rstat_flush_hold()
> */

Hi Jesper,

as a follow-up could you add an entry for cgrp to the kernel doc above?

> -void cgroup_rstat_flush_release(void)
> +void cgroup_rstat_flush_release(struct cgroup *cgrp)
> __releases(&cgroup_rstat_lock)
> {
> - spin_unlock_irq(&cgroup_rstat_lock);
> + __cgroup_rstat_unlock(cgrp, -1);
> }

..

2024-04-29 11:47:07