Currently, if rstat flushing is invoked using the irqsafe variant
cgroup_rstat_flush_irqsafe(), we keep interrupts disabled and do not
sleep for the entire flush operation, which is O(# cpus * # cgroups).
This can be rather dangerous.
Not all contexts that use cgroup_rstat_flush_irqsafe() actually cannot
sleep, and among those that cannot sleep, not all contexts require
interrupts to be disabled. This patch series breaks down the
O(# cpus * # cgroups) duration that we disable interrupts for into a
series of O(# cgroups) durations. Disabling interrupts is deferred to
the caller if needed.
Patch 1 mainly addresses this by not requiring interrupts to be
disabled for the global rstat lock to be acquired. As a side effect of
that, the we disable rstat flushing in interrupt context. See patch 1
for more details.
One thing I am not sure about is whether the only caller of
cgroup_rstat_flush_hold() -- cgroup_base_stat_cputime_show(),
currently has any dependency on that call disabling interrupts.
Patch 2 follows suit for stats_flush_lock in the memcg code, allowing it
to be acquired without disabling interrupts.
Patch 3 removes cgroup_rstat_flush_irqsafe() and updates
cgroup_rstat_flush() to be more explicit about sleeping.
Patch 4 changes memcg code paths that invoke rstat flushing to sleep
where possible. The patch changes code paths where it is naturally saef
to sleep: userspace reads and the background periodic flusher.
Patches 5 & 6 allow sleeping while rstat flushing in reclaim context and
refault context. I am not sure if this is okay, especially the latter,
so I placed them in separate patches for ease of revert/drop.
Patch 7 is a slightly tangential optimization that limits the work done
by rstat flushing in some scenarios.
Yosry Ahmed (7):
cgroup: rstat: only disable interrupts for the percpu lock
memcg: do not disable interrupts when holding stats_flush_lock
cgroup: rstat: remove cgroup_rstat_flush_irqsafe()
memcg: sleep during flushing stats in safe contexts
vmscan: memcg: sleep when flushing stats during reclaim
workingset: memcg: sleep when flushing stats in workingset_refault()
memcg: do not modify rstat tree for zero updates
block/blk-cgroup.c | 2 +-
include/linux/cgroup.h | 3 +--
include/linux/memcontrol.h | 8 +++---
kernel/cgroup/cgroup.c | 4 +--
kernel/cgroup/rstat.c | 54 ++++++++++++++++++++------------------
mm/memcontrol.c | 52 ++++++++++++++++++++++--------------
mm/vmscan.c | 2 +-
mm/workingset.c | 4 +--
8 files changed, 73 insertions(+), 56 deletions(-)
--
2.40.0.rc1.284.g88254d51c5-goog
The rstat flushing code was modified so that we do not disable interrupts
when we hold the global rstat lock. Do the same for stats_flush_lock on
the memcg side to avoid unnecessarily disabling interrupts throughout
flushing.
Since the code exclusively uses trylock to acquire this lock, it should
be fine to hold from interrupt contexts or normal contexts without
disabling interrupts as a deadlock cannot occur. For interrupt contexts
we will return immediately without flushing anyway.
Signed-off-by: Yosry Ahmed <[email protected]>
---
mm/memcontrol.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5abffe6f8389..e0e92b38fa51 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -636,15 +636,17 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
static void __mem_cgroup_flush_stats(void)
{
- unsigned long flag;
-
- if (!spin_trylock_irqsave(&stats_flush_lock, flag))
+ /*
+ * This lock can be acquired from interrupt context, but we only acquire
+ * using trylock so it should be fine as we cannot cause a deadlock.
+ */
+ if (!spin_trylock(&stats_flush_lock))
return;
flush_next_time = jiffies_64 + 2*FLUSH_TIME;
cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup);
atomic_set(&stats_flush_threshold, 0);
- spin_unlock_irqrestore(&stats_flush_lock, flag);
+ spin_unlock(&stats_flush_lock);
}
void mem_cgroup_flush_stats(void)
--
2.40.0.rc1.284.g88254d51c5-goog
Currently, when sleeping is not allowed during rstat flushing, we hold
the global rstat lock with interrupts disabled throughout the entire
flush operation. Flushing in an O(# cgroups * # cpus) operation, and
having interrupts disabled throughout is dangerous.
For some contexts, we may not want to sleep, but can be interrupted
(e.g. while holding a spinlock or RCU read lock). As such, do not
disable interrupts throughout rstat flushing, only when holding the
percpu lock. This breaks down the O(# cgroups * # cpus) duration with
interrupts disabled to a series of O(# cgroups) durations.
Furthermore, if a cpu spinning waiting for the global rstat lock, it
doesn't need to spin with interrupts disabled anymore.
If the caller of rstat flushing needs interrupts to be disabled, it's up
to them to decide that, and it should be fine to hold the global rstat
lock with interrupts disabled. There is currently a single context that
may invoke rstat flushing with interrupts disabled, the
mem_cgroup_flush_stats() call in mem_cgroup_usage(), if called from
mem_cgroup_threshold().
To make it safe to hold the global rstat lock with interrupts enabled,
make sure we only flush from in_task() contexts. The side effect of that
we read stale stats in interrupt context, but this should be okay, as
flushing in interrupt context is dangerous anyway as it is an expensive
operation, so reading stale stats is safer.
Signed-off-by: Yosry Ahmed <[email protected]>
---
kernel/cgroup/rstat.c | 40 +++++++++++++++++++++++++++++-----------
1 file changed, 29 insertions(+), 11 deletions(-)
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 831f1f472bb8..af11e28bd055 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -7,6 +7,7 @@
#include <linux/btf.h>
#include <linux/btf_ids.h>
+/* This lock should only be held from task context */
static DEFINE_SPINLOCK(cgroup_rstat_lock);
static DEFINE_PER_CPU(raw_spinlock_t, cgroup_rstat_cpu_lock);
@@ -210,14 +211,24 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
/* if @may_sleep, play nice and yield if necessary */
if (may_sleep && (need_resched() ||
spin_needbreak(&cgroup_rstat_lock))) {
- spin_unlock_irq(&cgroup_rstat_lock);
+ spin_unlock(&cgroup_rstat_lock);
if (!cond_resched())
cpu_relax();
- spin_lock_irq(&cgroup_rstat_lock);
+ spin_lock(&cgroup_rstat_lock);
}
}
}
+static bool should_skip_flush(void)
+{
+ /*
+ * We acquire cgroup_rstat_lock without disabling interrupts, so we
+ * should not try to acquire from interrupt contexts to avoid deadlocks.
+ * It can be expensive to flush stats in interrupt context anyway.
+ */
+ return !in_task();
+}
+
/**
* cgroup_rstat_flush - flush stats in @cgrp's subtree
* @cgrp: target cgroup
@@ -229,15 +240,18 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
* This also gets all cgroups in the subtree including @cgrp off the
* ->updated_children lists.
*
- * This function may block.
+ * This function is safe to call from contexts that disable interrupts, but
+ * @may_sleep must be set to false, otherwise the function may block.
*/
__bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
{
- might_sleep();
+ if (should_skip_flush())
+ return;
- spin_lock_irq(&cgroup_rstat_lock);
+ might_sleep();
+ spin_lock(&cgroup_rstat_lock);
cgroup_rstat_flush_locked(cgrp, true);
- spin_unlock_irq(&cgroup_rstat_lock);
+ spin_unlock(&cgroup_rstat_lock);
}
/**
@@ -248,11 +262,12 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
*/
void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp)
{
- unsigned long flags;
+ if (should_skip_flush())
+ return;
- spin_lock_irqsave(&cgroup_rstat_lock, flags);
+ spin_lock(&cgroup_rstat_lock);
cgroup_rstat_flush_locked(cgrp, false);
- spin_unlock_irqrestore(&cgroup_rstat_lock, flags);
+ spin_unlock(&cgroup_rstat_lock);
}
/**
@@ -267,8 +282,11 @@ void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp)
void cgroup_rstat_flush_hold(struct cgroup *cgrp)
__acquires(&cgroup_rstat_lock)
{
+ if (should_skip_flush())
+ return;
+
might_sleep();
- spin_lock_irq(&cgroup_rstat_lock);
+ spin_lock(&cgroup_rstat_lock);
cgroup_rstat_flush_locked(cgrp, true);
}
@@ -278,7 +296,7 @@ void cgroup_rstat_flush_hold(struct cgroup *cgrp)
void cgroup_rstat_flush_release(void)
__releases(&cgroup_rstat_lock)
{
- spin_unlock_irq(&cgroup_rstat_lock);
+ spin_unlock(&cgroup_rstat_lock);
}
int cgroup_rstat_init(struct cgroup *cgrp)
--
2.40.0.rc1.284.g88254d51c5-goog
The naming of cgroup_rstat_flush_irqsafe() can be confusing.
It can read like "irqsave", which means that it disables
interrupts throughout, but it actually doesn't. It is just "safe" to
call from contexts with interrupts disabled.
Furthermore, this is only used today by mem_cgroup_flush_stats(), which
will be changed by a later patch to optionally allow sleeping. Simplify
the code and make it easier to reason about by instead passing in an
explicit @may_sleep argument to cgroup_rstat_flush(), which gets passed
directly to cgroup_rstat_flush_locked().
Signed-off-by: Yosry Ahmed <[email protected]>
---
block/blk-cgroup.c | 2 +-
include/linux/cgroup.h | 3 +--
kernel/cgroup/cgroup.c | 4 ++--
kernel/cgroup/rstat.c | 24 +++++-------------------
mm/memcontrol.c | 6 +++---
5 files changed, 12 insertions(+), 27 deletions(-)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index bd50b55bdb61..3fe313ce5e6b 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1043,7 +1043,7 @@ static int blkcg_print_stat(struct seq_file *sf, void *v)
if (!seq_css(sf)->parent)
blkcg_fill_root_iostats();
else
- cgroup_rstat_flush(blkcg->css.cgroup);
+ cgroup_rstat_flush(blkcg->css.cgroup, true);
rcu_read_lock();
hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 3410aecffdb4..743c345b6a3f 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -691,8 +691,7 @@ static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
* cgroup scalable recursive statistics.
*/
void cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
-void cgroup_rstat_flush(struct cgroup *cgrp);
-void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp);
+void cgroup_rstat_flush(struct cgroup *cgrp, bool may_sleep);
void cgroup_rstat_flush_hold(struct cgroup *cgrp);
void cgroup_rstat_flush_release(void);
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 935e8121b21e..58df0fc379f6 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5393,7 +5393,7 @@ static void css_release_work_fn(struct work_struct *work)
if (ss) {
/* css release path */
if (!list_empty(&css->rstat_css_node)) {
- cgroup_rstat_flush(cgrp);
+ cgroup_rstat_flush(cgrp, true);
list_del_rcu(&css->rstat_css_node);
}
@@ -5406,7 +5406,7 @@ static void css_release_work_fn(struct work_struct *work)
/* cgroup release path */
TRACE_CGROUP_PATH(release, cgrp);
- cgroup_rstat_flush(cgrp);
+ cgroup_rstat_flush(cgrp, true);
spin_lock_irq(&css_set_lock);
for (tcgrp = cgroup_parent(cgrp); tcgrp;
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index af11e28bd055..fe8690bb1e1c 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -243,30 +243,16 @@ static bool should_skip_flush(void)
* This function is safe to call from contexts that disable interrupts, but
* @may_sleep must be set to false, otherwise the function may block.
*/
-__bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
+__bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp, bool may_sleep)
{
if (should_skip_flush())
return;
- might_sleep();
- spin_lock(&cgroup_rstat_lock);
- cgroup_rstat_flush_locked(cgrp, true);
- spin_unlock(&cgroup_rstat_lock);
-}
-
-/**
- * cgroup_rstat_flush_irqsafe - irqsafe version of cgroup_rstat_flush()
- * @cgrp: target cgroup
- *
- * This function can be called from any context.
- */
-void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp)
-{
- if (should_skip_flush())
- return;
+ if (may_sleep)
+ might_sleep();
spin_lock(&cgroup_rstat_lock);
- cgroup_rstat_flush_locked(cgrp, false);
+ cgroup_rstat_flush_locked(cgrp, may_sleep);
spin_unlock(&cgroup_rstat_lock);
}
@@ -325,7 +311,7 @@ void cgroup_rstat_exit(struct cgroup *cgrp)
{
int cpu;
- cgroup_rstat_flush(cgrp);
+ cgroup_rstat_flush(cgrp, true);
/* sanity check */
for_each_possible_cpu(cpu) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e0e92b38fa51..72cd44f88d97 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -644,7 +644,7 @@ static void __mem_cgroup_flush_stats(void)
return;
flush_next_time = jiffies_64 + 2*FLUSH_TIME;
- cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup);
+ cgroup_rstat_flush(root_mem_cgroup->css.cgroup, false);
atomic_set(&stats_flush_threshold, 0);
spin_unlock(&stats_flush_lock);
}
@@ -7745,7 +7745,7 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
break;
}
- cgroup_rstat_flush(memcg->css.cgroup);
+ cgroup_rstat_flush(memcg->css.cgroup, true);
pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
if (pages < max)
continue;
@@ -7810,7 +7810,7 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size)
static u64 zswap_current_read(struct cgroup_subsys_state *css,
struct cftype *cft)
{
- cgroup_rstat_flush(css->cgroup);
+ cgroup_rstat_flush(css->cgroup, true);
return memcg_page_state(mem_cgroup_from_css(css), MEMCG_ZSWAP_B);
}
--
2.40.0.rc1.284.g88254d51c5-goog
In some situations, we may end up calling memcg_rstat_updated() with a
value of 0, which means the stat was not actually updated. An example is
if we fail to reclaim any pages in shrink_folio_list().
Do not add the cgroup to the rstat updated tree in this case, to avoid
unnecessarily flushing it.
Signed-off-by: Yosry Ahmed <[email protected]>
---
mm/memcontrol.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 39a9c7a978ae..7afd29399409 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -618,6 +618,9 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
{
unsigned int x;
+ if (!val)
+ return;
+
cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
x = __this_cpu_add_return(stats_updates, abs(val));
--
2.40.0.rc1.284.g88254d51c5-goog
Memory reclaim should be a sleepable context. Allow sleeping when
flushing memcg stats to avoid unnecessarily performing a lot of work
without sleeping. This can slow down reclaim code if flushing stats is
taking too long, but there is already multiple cond_resched()'s in
reclaim code.
Signed-off-by: Yosry Ahmed <[email protected]>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 59d1830d08ac..bae35cfb33c8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2845,7 +2845,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
* Flush the memory cgroup stats, so that we read accurate per-memcg
* lruvec stats for heuristics.
*/
- mem_cgroup_flush_stats(false);
+ mem_cgroup_flush_stats(true);
/*
* Determine the scan balance between anon and file LRUs.
--
2.40.0.rc1.284.g88254d51c5-goog
Currently, all contexts that flush memcg stats do so with sleeping not
allowed. Some of these contexts are perfectly safe to sleep in, such as
reading cgroup files from userspace or the background periodic flusher.
Enable choosing whether sleeping is allowed or not when flushing memcg
stats, and allow sleeping in safe contexts to avoid unnecessarily
performing a lot of work without sleeping.
Signed-off-by: Yosry Ahmed <[email protected]>
---
include/linux/memcontrol.h | 8 ++++----
mm/memcontrol.c | 35 ++++++++++++++++++++++-------------
mm/vmscan.c | 2 +-
mm/workingset.c | 3 ++-
4 files changed, 29 insertions(+), 19 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b6eda2ab205d..0c7b286f2caf 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1036,8 +1036,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
return x;
}
-void mem_cgroup_flush_stats(void);
-void mem_cgroup_flush_stats_delayed(void);
+void mem_cgroup_flush_stats(bool may_sleep);
+void mem_cgroup_flush_stats_delayed(bool may_sleep);
void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
int val);
@@ -1531,11 +1531,11 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
return node_page_state(lruvec_pgdat(lruvec), idx);
}
-static inline void mem_cgroup_flush_stats(void)
+static inline void mem_cgroup_flush_stats(bool may_sleep)
{
}
-static inline void mem_cgroup_flush_stats_delayed(void)
+static inline void mem_cgroup_flush_stats_delayed(bool may_sleep)
{
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 72cd44f88d97..39a9c7a978ae 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -634,7 +634,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
}
}
-static void __mem_cgroup_flush_stats(void)
+static void __mem_cgroup_flush_stats(bool may_sleep)
{
/*
* This lock can be acquired from interrupt context, but we only acquire
@@ -644,26 +644,26 @@ static void __mem_cgroup_flush_stats(void)
return;
flush_next_time = jiffies_64 + 2*FLUSH_TIME;
- cgroup_rstat_flush(root_mem_cgroup->css.cgroup, false);
+ cgroup_rstat_flush(root_mem_cgroup->css.cgroup, may_sleep);
atomic_set(&stats_flush_threshold, 0);
spin_unlock(&stats_flush_lock);
}
-void mem_cgroup_flush_stats(void)
+void mem_cgroup_flush_stats(bool may_sleep)
{
if (atomic_read(&stats_flush_threshold) > num_online_cpus())
- __mem_cgroup_flush_stats();
+ __mem_cgroup_flush_stats(may_sleep);
}
-void mem_cgroup_flush_stats_delayed(void)
+void mem_cgroup_flush_stats_delayed(bool may_sleep)
{
if (time_after64(jiffies_64, flush_next_time))
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats(may_sleep);
}
static void flush_memcg_stats_dwork(struct work_struct *w)
{
- __mem_cgroup_flush_stats();
+ __mem_cgroup_flush_stats(true);
queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
}
@@ -1570,7 +1570,7 @@ static void memory_stat_format(struct mem_cgroup *memcg, char *buf, int bufsize)
*
* Current memory state:
*/
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats(true);
for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
u64 size;
@@ -3671,7 +3671,11 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
unsigned long val;
if (mem_cgroup_is_root(memcg)) {
- mem_cgroup_flush_stats();
+ /*
+ * mem_cgroup_threshold() calls here from irqsafe context.
+ * Don't sleep.
+ */
+ mem_cgroup_flush_stats(false);
val = memcg_page_state(memcg, NR_FILE_PAGES) +
memcg_page_state(memcg, NR_ANON_MAPPED);
if (swap)
@@ -4014,7 +4018,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)
int nid;
struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats(true);
for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
seq_printf(m, "%s=%lu", stat->name,
@@ -4090,7 +4094,7 @@ static int memcg_stat_show(struct seq_file *m, void *v)
BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats(true);
for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
unsigned long nr;
@@ -4594,7 +4598,12 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
struct mem_cgroup *parent;
- mem_cgroup_flush_stats();
+ /*
+ * wb_writeback() takes a spinlock and calls
+ * wb_over_bg_thresh()->mem_cgroup_wb_stats().
+ * Do not sleep.
+ */
+ mem_cgroup_flush_stats(false);
*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
@@ -6596,7 +6605,7 @@ static int memory_numa_stat_show(struct seq_file *m, void *v)
int i;
struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats(true);
for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
int nid;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c1c5e8b24b8..59d1830d08ac 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2845,7 +2845,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
* Flush the memory cgroup stats, so that we read accurate per-memcg
* lruvec stats for heuristics.
*/
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats(false);
/*
* Determine the scan balance between anon and file LRUs.
diff --git a/mm/workingset.c b/mm/workingset.c
index 00c6f4d9d9be..042eabbb43f6 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -462,7 +462,8 @@ void workingset_refault(struct folio *folio, void *shadow)
mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
- mem_cgroup_flush_stats_delayed();
+ /* Do not sleep with RCU lock held */
+ mem_cgroup_flush_stats_delayed(false);
/*
* Compare the distance to the existing workingset size. We
* don't activate pages that couldn't stay resident even if
--
2.40.0.rc1.284.g88254d51c5-goog
In workingset_refault(), we call mem_cgroup_flush_stats_delayed() to
flush stats within an RCU read section and with sleeping disallowed.
Move the call to mem_cgroup_flush_stats_delayed() above the RCU read
section and allow sleeping to avoid unnecessarily performing a lot of
work without sleeping.
Signed-off-by: Yosry Ahmed <[email protected]>
---
A lot of code paths call into workingset_refault(), so I am not
generally sure at all whether it's okay to sleep in all contexts or not.
Feedback here would be very helpful.
---
mm/workingset.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/mm/workingset.c b/mm/workingset.c
index 042eabbb43f6..410bc6684ea7 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -406,6 +406,8 @@ void workingset_refault(struct folio *folio, void *shadow)
unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
eviction <<= bucket_order;
+ /* Flush stats (and potentially sleep) before holding RCU read lock */
+ mem_cgroup_flush_stats_delayed(true);
rcu_read_lock();
/*
* Look up the memcg associated with the stored ID. It might
@@ -461,9 +463,6 @@ void workingset_refault(struct folio *folio, void *shadow)
lruvec = mem_cgroup_lruvec(memcg, pgdat);
mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
-
- /* Do not sleep with RCU lock held */
- mem_cgroup_flush_stats_delayed(false);
/*
* Compare the distance to the existing workingset size. We
* don't activate pages that couldn't stay resident even if
--
2.40.0.rc1.284.g88254d51c5-goog
On Wed, Mar 22, 2023 at 9:00 PM Yosry Ahmed <[email protected]> wrote:
>
> Currently, if rstat flushing is invoked using the irqsafe variant
> cgroup_rstat_flush_irqsafe(), we keep interrupts disabled and do not
> sleep for the entire flush operation, which is O(# cpus * # cgroups).
> This can be rather dangerous.
>
> Not all contexts that use cgroup_rstat_flush_irqsafe() actually cannot
> sleep, and among those that cannot sleep, not all contexts require
> interrupts to be disabled.
Too many negations in the above sentence is making it very confusing.
On Wed, Mar 22, 2023 at 9:00 PM Yosry Ahmed <[email protected]> wrote:
>
> Currently, when sleeping is not allowed during rstat flushing, we hold
> the global rstat lock with interrupts disabled throughout the entire
> flush operation. Flushing in an O(# cgroups * # cpus) operation, and
> having interrupts disabled throughout is dangerous.
>
> For some contexts, we may not want to sleep, but can be interrupted
> (e.g. while holding a spinlock or RCU read lock). As such, do not
> disable interrupts throughout rstat flushing, only when holding the
> percpu lock. This breaks down the O(# cgroups * # cpus) duration with
> interrupts disabled to a series of O(# cgroups) durations.
>
> Furthermore, if a cpu spinning waiting for the global rstat lock, it
> doesn't need to spin with interrupts disabled anymore.
>
> If the caller of rstat flushing needs interrupts to be disabled, it's up
> to them to decide that, and it should be fine to hold the global rstat
> lock with interrupts disabled. There is currently a single context that
> may invoke rstat flushing with interrupts disabled, the
> mem_cgroup_flush_stats() call in mem_cgroup_usage(), if called from
> mem_cgroup_threshold().
>
> To make it safe to hold the global rstat lock with interrupts enabled,
> make sure we only flush from in_task() contexts. The side effect of that
> we read stale stats in interrupt context, but this should be okay, as
> flushing in interrupt context is dangerous anyway as it is an expensive
> operation, so reading stale stats is safer.
>
> Signed-off-by: Yosry Ahmed <[email protected]>
Couple of questions:
1. What exactly is cgroup_rstat_lock protecting? Can we just remove it
altogether?
2. Are we really calling rstat flush in irq context?
3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
done for root memcg. Why is mem_cgroup_threshold() interested in root
memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
On Wed, Mar 22, 2023 at 9:00 PM Yosry Ahmed <[email protected]> wrote:
>
> The rstat flushing code was modified so that we do not disable interrupts
> when we hold the global rstat lock. Do the same for stats_flush_lock on
> the memcg side to avoid unnecessarily disabling interrupts throughout
> flushing.
>
> Since the code exclusively uses trylock to acquire this lock, it should
> be fine to hold from interrupt contexts or normal contexts without
> disabling interrupts as a deadlock cannot occur. For interrupt contexts
> we will return immediately without flushing anyway.
>
> Signed-off-by: Yosry Ahmed <[email protected]>
> ---
> mm/memcontrol.c | 10 ++++++----
> 1 file changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5abffe6f8389..e0e92b38fa51 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -636,15 +636,17 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
>
> static void __mem_cgroup_flush_stats(void)
> {
> - unsigned long flag;
> -
> - if (!spin_trylock_irqsave(&stats_flush_lock, flag))
> + /*
> + * This lock can be acquired from interrupt context,
How? What's the code path?
> but we only acquire
> + * using trylock so it should be fine as we cannot cause a deadlock.
> + */
> + if (!spin_trylock(&stats_flush_lock))
> return;
>
> flush_next_time = jiffies_64 + 2*FLUSH_TIME;
> cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup);
> atomic_set(&stats_flush_threshold, 0);
> - spin_unlock_irqrestore(&stats_flush_lock, flag);
> + spin_unlock(&stats_flush_lock);
> }
>
> void mem_cgroup_flush_stats(void)
> --
> 2.40.0.rc1.284.g88254d51c5-goog
>
On Wed, Mar 22, 2023 at 9:10 PM Shakeel Butt <[email protected]> wrote:
>
> On Wed, Mar 22, 2023 at 9:00 PM Yosry Ahmed <[email protected]> wrote:
> >
> > Currently, if rstat flushing is invoked using the irqsafe variant
> > cgroup_rstat_flush_irqsafe(), we keep interrupts disabled and do not
> > sleep for the entire flush operation, which is O(# cpus * # cgroups).
> > This can be rather dangerous.
> >
> > Not all contexts that use cgroup_rstat_flush_irqsafe() actually cannot
> > sleep, and among those that cannot sleep, not all contexts require
> > interrupts to be disabled.
>
> Too many negations in the above sentence is making it very confusing.
Sorry, this is indeed very confusing. I guess a better rephrasing is:
Multiple code paths use cgroup_rstat_flush_irqsafe(), but many of them
can sleep. Even among the code paths that actually cannot sleep,
multiple ones are interruptible.
On Wed, Mar 22, 2023 at 9:29 PM Shakeel Butt <[email protected]> wrote:
>
> On Wed, Mar 22, 2023 at 9:00 PM Yosry Ahmed <[email protected]> wrote:
> >
> > Currently, when sleeping is not allowed during rstat flushing, we hold
> > the global rstat lock with interrupts disabled throughout the entire
> > flush operation. Flushing in an O(# cgroups * # cpus) operation, and
> > having interrupts disabled throughout is dangerous.
> >
> > For some contexts, we may not want to sleep, but can be interrupted
> > (e.g. while holding a spinlock or RCU read lock). As such, do not
> > disable interrupts throughout rstat flushing, only when holding the
> > percpu lock. This breaks down the O(# cgroups * # cpus) duration with
> > interrupts disabled to a series of O(# cgroups) durations.
> >
> > Furthermore, if a cpu spinning waiting for the global rstat lock, it
> > doesn't need to spin with interrupts disabled anymore.
> >
> > If the caller of rstat flushing needs interrupts to be disabled, it's up
> > to them to decide that, and it should be fine to hold the global rstat
> > lock with interrupts disabled. There is currently a single context that
> > may invoke rstat flushing with interrupts disabled, the
> > mem_cgroup_flush_stats() call in mem_cgroup_usage(), if called from
> > mem_cgroup_threshold().
> >
> > To make it safe to hold the global rstat lock with interrupts enabled,
> > make sure we only flush from in_task() contexts. The side effect of that
> > we read stale stats in interrupt context, but this should be okay, as
> > flushing in interrupt context is dangerous anyway as it is an expensive
> > operation, so reading stale stats is safer.
> >
> > Signed-off-by: Yosry Ahmed <[email protected]>
>
> Couple of questions:
>
> 1. What exactly is cgroup_rstat_lock protecting? Can we just remove it
> altogether?
I believe it protects the global state variables that we flush into.
For example, for memcg, it protects mem_cgroup->vmstats.
I tried removing the lock and allowing concurrent flushing on
different cpus, by changing mem_cgroup->vmstats to use atomics
instead, but that turned out to be a little expensive. Also,
cgroup_rstat_lock is already contended by different flushers
(mitigated by stats_flush_lock on the memcg side). If we remove it,
concurrent flushers contend on every single percpu lock instead, which
also seems to be expensive.
> 2. Are we really calling rstat flush in irq context?
I think it is possible through the charge/uncharge path:
memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
added the protection against flushing in an interrupt context for
future callers as well, as it may cause a deadlock if we don't disable
interrupts when acquiring cgroup_rstat_lock.
> 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> done for root memcg. Why is mem_cgroup_threshold() interested in root
> memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
I am not sure, but the code looks like event notifications may be set
up on root memcg, which is why we need to check thresholds.
Even if mem_cgroup_threshold() does not flush memcg stats, the purpose
of this patch is to make sure the rstat flushing code itself is not
disabling interrupts; which it currently does for any unsleepable
context, even if it is interruptible.
On Wed, Mar 22, 2023 at 9:32 PM Shakeel Butt <[email protected]> wrote:
>
> On Wed, Mar 22, 2023 at 9:00 PM Yosry Ahmed <[email protected]> wrote:
> >
> > The rstat flushing code was modified so that we do not disable interrupts
> > when we hold the global rstat lock. Do the same for stats_flush_lock on
> > the memcg side to avoid unnecessarily disabling interrupts throughout
> > flushing.
> >
> > Since the code exclusively uses trylock to acquire this lock, it should
> > be fine to hold from interrupt contexts or normal contexts without
> > disabling interrupts as a deadlock cannot occur. For interrupt contexts
> > we will return immediately without flushing anyway.
> >
> > Signed-off-by: Yosry Ahmed <[email protected]>
> > ---
> > mm/memcontrol.c | 10 ++++++----
> > 1 file changed, 6 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 5abffe6f8389..e0e92b38fa51 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -636,15 +636,17 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
> >
> > static void __mem_cgroup_flush_stats(void)
> > {
> > - unsigned long flag;
> > -
> > - if (!spin_trylock_irqsave(&stats_flush_lock, flag))
> > + /*
> > + * This lock can be acquired from interrupt context,
>
> How? What's the code path?
I believe through the charge/uncharge path we can do
memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage()->mem_cgroup_flush_stats(),
right? I am assuming we can charge/uncharge memory in an interrupt
context.
Also the current code always disables interrupts before calling
memcg_check_events(), which made me suspect the percpu variables that
are modified by that call can also be modified in interrupt context.
>
> > but we only acquire
> > + * using trylock so it should be fine as we cannot cause a deadlock.
> > + */
> > + if (!spin_trylock(&stats_flush_lock))
> > return;
> >
> > flush_next_time = jiffies_64 + 2*FLUSH_TIME;
> > cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup);
> > atomic_set(&stats_flush_threshold, 0);
> > - spin_unlock_irqrestore(&stats_flush_lock, flag);
> > + spin_unlock(&stats_flush_lock);
> > }
> >
> > void mem_cgroup_flush_stats(void)
> > --
> > 2.40.0.rc1.284.g88254d51c5-goog
> >
On Wed, Mar 22, 2023 at 10:15 PM Yosry Ahmed <[email protected]> wrote:
>
[...]
> > Couple of questions:
> >
> > 1. What exactly is cgroup_rstat_lock protecting? Can we just remove it
> > altogether?
>
> I believe it protects the global state variables that we flush into.
> For example, for memcg, it protects mem_cgroup->vmstats.
>
> I tried removing the lock and allowing concurrent flushing on
> different cpus, by changing mem_cgroup->vmstats to use atomics
> instead, but that turned out to be a little expensive. Also,
> cgroup_rstat_lock is already contended by different flushers
> (mitigated by stats_flush_lock on the memcg side). If we remove it,
> concurrent flushers contend on every single percpu lock instead, which
> also seems to be expensive.
We should add a comment on what it is protecting. I think block rstat
are fine but memcg and bpf would need this.
>
> > 2. Are we really calling rstat flush in irq context?
>
> I think it is possible through the charge/uncharge path:
> memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> added the protection against flushing in an interrupt context for
> future callers as well, as it may cause a deadlock if we don't disable
> interrupts when acquiring cgroup_rstat_lock.
>
> > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
>
> I am not sure, but the code looks like event notifications may be set
> up on root memcg, which is why we need to check thresholds.
This is something we should deprecate as root memcg's usage is ill defined.
>
> Even if mem_cgroup_threshold() does not flush memcg stats, the purpose
> of this patch is to make sure the rstat flushing code itself is not
> disabling interrupts; which it currently does for any unsleepable
> context, even if it is interruptible.
Basically I am saying we should aim for VM_BUG_ON(!in_task()) in the
flush function rather than adding should_skip_flush() which does not
stop potential new irq flushers.
On Wed, Mar 22, 2023 at 11:33 PM Shakeel Butt <[email protected]> wrote:
>
> On Wed, Mar 22, 2023 at 10:15 PM Yosry Ahmed <[email protected]> wrote:
> >
> [...]
> > > Couple of questions:
> > >
> > > 1. What exactly is cgroup_rstat_lock protecting? Can we just remove it
> > > altogether?
> >
> > I believe it protects the global state variables that we flush into.
> > For example, for memcg, it protects mem_cgroup->vmstats.
> >
> > I tried removing the lock and allowing concurrent flushing on
> > different cpus, by changing mem_cgroup->vmstats to use atomics
> > instead, but that turned out to be a little expensive. Also,
> > cgroup_rstat_lock is already contended by different flushers
> > (mitigated by stats_flush_lock on the memcg side). If we remove it,
> > concurrent flushers contend on every single percpu lock instead, which
> > also seems to be expensive.
>
> We should add a comment on what it is protecting. I think block rstat
> are fine but memcg and bpf would need this.
I think it also protects the cpu base stats flushed by cgroup_base_stat_flush().
I will add a comment in the next version.
>
> >
> > > 2. Are we really calling rstat flush in irq context?
> >
> > I think it is possible through the charge/uncharge path:
> > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > added the protection against flushing in an interrupt context for
> > future callers as well, as it may cause a deadlock if we don't disable
> > interrupts when acquiring cgroup_rstat_lock.
> >
> > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> >
> > I am not sure, but the code looks like event notifications may be set
> > up on root memcg, which is why we need to check thresholds.
>
> This is something we should deprecate as root memcg's usage is ill defined.
Right, but I think this would be orthogonal to this patch series.
>
> >
> > Even if mem_cgroup_threshold() does not flush memcg stats, the purpose
> > of this patch is to make sure the rstat flushing code itself is not
> > disabling interrupts; which it currently does for any unsleepable
> > context, even if it is interruptible.
>
> Basically I am saying we should aim for VM_BUG_ON(!in_task()) in the
> flush function rather than adding should_skip_flush() which does not
> stop potential new irq flushers.
I wanted to start with VM_BUG_ON(!in_task()) but I wasn't sure that
all contexts that call rstat flushing are not in irq contexts. I added
should_skip_flush() so that if there are existing flushers in irq
context, or new flushers are added, we are protected against a
deadlock.
We can change should_skip_flush() to have a WARN_ON_ONCE(!in_task())
to alert in this case. If you prefer removing should_skip_flush() and
just adding VM_BUG_ON(!in_task()) we can do that, but personally I was
not confident enough that we have no code paths today that may attempt
flushing from irq context.
On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <[email protected]> wrote:
>
[...]
> > >
> > > > 2. Are we really calling rstat flush in irq context?
> > >
> > > I think it is possible through the charge/uncharge path:
> > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > added the protection against flushing in an interrupt context for
> > > future callers as well, as it may cause a deadlock if we don't disable
> > > interrupts when acquiring cgroup_rstat_lock.
> > >
> > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > >
> > > I am not sure, but the code looks like event notifications may be set
> > > up on root memcg, which is why we need to check thresholds.
> >
> > This is something we should deprecate as root memcg's usage is ill defined.
>
> Right, but I think this would be orthogonal to this patch series.
>
I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
without either breaking a link between mem_cgroup_threshold and
cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
irqs.
So, this patch can not be applied before either of those two tasks are
done (and we may find more such scenarios).
On Thu, Mar 23, 2023 at 04:00:33AM +0000, Yosry Ahmed wrote:
> The naming of cgroup_rstat_flush_irqsafe() can be confusing.
> It can read like "irqsave", which means that it disables
> interrupts throughout, but it actually doesn't. It is just "safe" to
> call from contexts with interrupts disabled.
>
> Furthermore, this is only used today by mem_cgroup_flush_stats(), which
> will be changed by a later patch to optionally allow sleeping. Simplify
> the code and make it easier to reason about by instead passing in an
> explicit @may_sleep argument to cgroup_rstat_flush(), which gets passed
> directly to cgroup_rstat_flush_locked().
>
> Signed-off-by: Yosry Ahmed <[email protected]>
> ---
> block/blk-cgroup.c | 2 +-
> include/linux/cgroup.h | 3 +--
> kernel/cgroup/cgroup.c | 4 ++--
> kernel/cgroup/rstat.c | 24 +++++-------------------
> mm/memcontrol.c | 6 +++---
> 5 files changed, 12 insertions(+), 27 deletions(-)
>
> diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> index bd50b55bdb61..3fe313ce5e6b 100644
> --- a/block/blk-cgroup.c
> +++ b/block/blk-cgroup.c
> @@ -1043,7 +1043,7 @@ static int blkcg_print_stat(struct seq_file *sf, void *v)
> if (!seq_css(sf)->parent)
> blkcg_fill_root_iostats();
> else
> - cgroup_rstat_flush(blkcg->css.cgroup);
> + cgroup_rstat_flush(blkcg->css.cgroup, true);
>
> rcu_read_lock();
> hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index 3410aecffdb4..743c345b6a3f 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -691,8 +691,7 @@ static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
> * cgroup scalable recursive statistics.
> */
> void cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
> -void cgroup_rstat_flush(struct cgroup *cgrp);
> -void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp);
> +void cgroup_rstat_flush(struct cgroup *cgrp, bool may_sleep);
> void cgroup_rstat_flush_hold(struct cgroup *cgrp);
> void cgroup_rstat_flush_release(void);
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 935e8121b21e..58df0fc379f6 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -5393,7 +5393,7 @@ static void css_release_work_fn(struct work_struct *work)
> if (ss) {
> /* css release path */
> if (!list_empty(&css->rstat_css_node)) {
> - cgroup_rstat_flush(cgrp);
> + cgroup_rstat_flush(cgrp, true);
> list_del_rcu(&css->rstat_css_node);
> }
>
> @@ -5406,7 +5406,7 @@ static void css_release_work_fn(struct work_struct *work)
> /* cgroup release path */
> TRACE_CGROUP_PATH(release, cgrp);
>
> - cgroup_rstat_flush(cgrp);
> + cgroup_rstat_flush(cgrp, true);
This is an anti-pattern, please don't do this. Naked bool arguments
are a pain to comprehend at the callsite and an easy vector for bugs.
cgroup_rstat_flush_atomic() would be a better name for, well, atomic
callsites.
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index af11e28bd055..fe8690bb1e1c 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -243,30 +243,16 @@ static bool should_skip_flush(void)
> * This function is safe to call from contexts that disable interrupts, but
> * @may_sleep must be set to false, otherwise the function may block.
> */
> -__bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
> +__bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp, bool may_sleep)
> {
> if (should_skip_flush())
> return;
>
> - might_sleep();
> - spin_lock(&cgroup_rstat_lock);
> - cgroup_rstat_flush_locked(cgrp, true);
> - spin_unlock(&cgroup_rstat_lock);
> -}
> -
> -/**
> - * cgroup_rstat_flush_irqsafe - irqsafe version of cgroup_rstat_flush()
> - * @cgrp: target cgroup
> - *
> - * This function can be called from any context.
> - */
> -void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp)
> -{
> - if (should_skip_flush())
> - return;
> + if (may_sleep)
> + might_sleep();
might_sleep_if()
On Thu, Mar 23, 2023 at 8:43 AM Johannes Weiner <[email protected]> wrote:
>
> On Thu, Mar 23, 2023 at 04:00:33AM +0000, Yosry Ahmed wrote:
> > The naming of cgroup_rstat_flush_irqsafe() can be confusing.
> > It can read like "irqsave", which means that it disables
> > interrupts throughout, but it actually doesn't. It is just "safe" to
> > call from contexts with interrupts disabled.
> >
> > Furthermore, this is only used today by mem_cgroup_flush_stats(), which
> > will be changed by a later patch to optionally allow sleeping. Simplify
> > the code and make it easier to reason about by instead passing in an
> > explicit @may_sleep argument to cgroup_rstat_flush(), which gets passed
> > directly to cgroup_rstat_flush_locked().
> >
> > Signed-off-by: Yosry Ahmed <[email protected]>
> > ---
> > block/blk-cgroup.c | 2 +-
> > include/linux/cgroup.h | 3 +--
> > kernel/cgroup/cgroup.c | 4 ++--
> > kernel/cgroup/rstat.c | 24 +++++-------------------
> > mm/memcontrol.c | 6 +++---
> > 5 files changed, 12 insertions(+), 27 deletions(-)
> >
> > diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
> > index bd50b55bdb61..3fe313ce5e6b 100644
> > --- a/block/blk-cgroup.c
> > +++ b/block/blk-cgroup.c
> > @@ -1043,7 +1043,7 @@ static int blkcg_print_stat(struct seq_file *sf, void *v)
> > if (!seq_css(sf)->parent)
> > blkcg_fill_root_iostats();
> > else
> > - cgroup_rstat_flush(blkcg->css.cgroup);
> > + cgroup_rstat_flush(blkcg->css.cgroup, true);
> >
> > rcu_read_lock();
> > hlist_for_each_entry_rcu(blkg, &blkcg->blkg_list, blkcg_node) {
> > diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> > index 3410aecffdb4..743c345b6a3f 100644
> > --- a/include/linux/cgroup.h
> > +++ b/include/linux/cgroup.h
> > @@ -691,8 +691,7 @@ static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
> > * cgroup scalable recursive statistics.
> > */
> > void cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
> > -void cgroup_rstat_flush(struct cgroup *cgrp);
> > -void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp);
> > +void cgroup_rstat_flush(struct cgroup *cgrp, bool may_sleep);
> > void cgroup_rstat_flush_hold(struct cgroup *cgrp);
> > void cgroup_rstat_flush_release(void);
> >
> > diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> > index 935e8121b21e..58df0fc379f6 100644
> > --- a/kernel/cgroup/cgroup.c
> > +++ b/kernel/cgroup/cgroup.c
> > @@ -5393,7 +5393,7 @@ static void css_release_work_fn(struct work_struct *work)
> > if (ss) {
> > /* css release path */
> > if (!list_empty(&css->rstat_css_node)) {
> > - cgroup_rstat_flush(cgrp);
> > + cgroup_rstat_flush(cgrp, true);
> > list_del_rcu(&css->rstat_css_node);
> > }
> >
> > @@ -5406,7 +5406,7 @@ static void css_release_work_fn(struct work_struct *work)
> > /* cgroup release path */
> > TRACE_CGROUP_PATH(release, cgrp);
> >
> > - cgroup_rstat_flush(cgrp);
> > + cgroup_rstat_flush(cgrp, true);
>
> This is an anti-pattern, please don't do this. Naked bool arguments
> are a pain to comprehend at the callsite and an easy vector for bugs.
>
> cgroup_rstat_flush_atomic() would be a better name for, well, atomic
> callsites.
Thanks for pointing this out. I will rename it to
cgroup_rstat_flush_atomic() in upcoming versions instead. I will also
do the same for mem_cgroup_flush_stats() as I introduce a similar
boolean argument for it in the following patch.
>
> > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> > index af11e28bd055..fe8690bb1e1c 100644
> > --- a/kernel/cgroup/rstat.c
> > +++ b/kernel/cgroup/rstat.c
> > @@ -243,30 +243,16 @@ static bool should_skip_flush(void)
> > * This function is safe to call from contexts that disable interrupts, but
> > * @may_sleep must be set to false, otherwise the function may block.
> > */
> > -__bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
> > +__bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp, bool may_sleep)
> > {
> > if (should_skip_flush())
> > return;
> >
> > - might_sleep();
> > - spin_lock(&cgroup_rstat_lock);
> > - cgroup_rstat_flush_locked(cgrp, true);
> > - spin_unlock(&cgroup_rstat_lock);
> > -}
> > -
> > -/**
> > - * cgroup_rstat_flush_irqsafe - irqsafe version of cgroup_rstat_flush()
> > - * @cgrp: target cgroup
> > - *
> > - * This function can be called from any context.
> > - */
> > -void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp)
> > -{
> > - if (should_skip_flush())
> > - return;
> > + if (may_sleep)
> > + might_sleep();
>
> might_sleep_if()
Thanks for pointing this out. I don't think it will be needed if we
keep cgroup_rstat_flush_irqsafe() and only rename it to
cgroup_rstat_flush_atomic(), but it is useful to know that it exists
for future reference.
On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <[email protected]> wrote:
>
> On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <[email protected]> wrote:
> >
> > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <[email protected]> wrote:
> > >
> > [...]
> > > > >
> > > > > > 2. Are we really calling rstat flush in irq context?
> > > > >
> > > > > I think it is possible through the charge/uncharge path:
> > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > added the protection against flushing in an interrupt context for
> > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > >
> > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > >
> > > > > I am not sure, but the code looks like event notifications may be set
> > > > > up on root memcg, which is why we need to check thresholds.
> > > >
> > > > This is something we should deprecate as root memcg's usage is ill defined.
> > >
> > > Right, but I think this would be orthogonal to this patch series.
> > >
> >
> > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > without either breaking a link between mem_cgroup_threshold and
> > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > irqs.
> >
> > So, this patch can not be applied before either of those two tasks are
> > done (and we may find more such scenarios).
>
>
> Could you elaborate why?
>
> My understanding is that with an in_task() check to make sure we only
> acquire cgroup_rstat_lock from non-irq context it should be fine to
> acquire cgroup_rstat_lock without disabling interrupts.
From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
with irq disabled while other code paths will take cgroup_rstat_lock
with irq enabled. This is a potential deadlock hazard unless
cgroup_rstat_lock is always taken with irq disabled.
On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <[email protected]> wrote:
>
> On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <[email protected]> wrote:
> >
> [...]
> > > >
> > > > > 2. Are we really calling rstat flush in irq context?
> > > >
> > > > I think it is possible through the charge/uncharge path:
> > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > added the protection against flushing in an interrupt context for
> > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > interrupts when acquiring cgroup_rstat_lock.
> > > >
> > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > >
> > > > I am not sure, but the code looks like event notifications may be set
> > > > up on root memcg, which is why we need to check thresholds.
> > >
> > > This is something we should deprecate as root memcg's usage is ill defined.
> >
> > Right, but I think this would be orthogonal to this patch series.
> >
>
> I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> without either breaking a link between mem_cgroup_threshold and
> cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> irqs.
>
> So, this patch can not be applied before either of those two tasks are
> done (and we may find more such scenarios).
Could you elaborate why?
My understanding is that with an in_task() check to make sure we only
acquire cgroup_rstat_lock from non-irq context it should be fine to
acquire cgroup_rstat_lock without disabling interrupts.
On Thu, Mar 23, 2023 at 04:00:36AM +0000, Yosry Ahmed wrote:
> In workingset_refault(), we call mem_cgroup_flush_stats_delayed() to
> flush stats within an RCU read section and with sleeping disallowed.
> Move the call to mem_cgroup_flush_stats_delayed() above the RCU read
> section and allow sleeping to avoid unnecessarily performing a lot of
> work without sleeping.
>
> Signed-off-by: Yosry Ahmed <[email protected]>
> ---
>
> A lot of code paths call into workingset_refault(), so I am not
> generally sure at all whether it's okay to sleep in all contexts or not.
> Feedback here would be very helpful.
Yes, it's safe.
On Thu, Mar 23, 2023 at 04:00:34AM +0000, Yosry Ahmed wrote:
> @@ -644,26 +644,26 @@ static void __mem_cgroup_flush_stats(void)
> return;
>
> flush_next_time = jiffies_64 + 2*FLUSH_TIME;
> - cgroup_rstat_flush(root_mem_cgroup->css.cgroup, false);
> + cgroup_rstat_flush(root_mem_cgroup->css.cgroup, may_sleep);
How is it safe to call this with may_sleep=true when it's holding the
stats_flush_lock?
> atomic_set(&stats_flush_threshold, 0);
> spin_unlock(&stats_flush_lock);
> }
On Thu, Mar 23, 2023 at 8:56 AM Johannes Weiner <[email protected]> wrote:
>
> On Thu, Mar 23, 2023 at 04:00:34AM +0000, Yosry Ahmed wrote:
> > @@ -644,26 +644,26 @@ static void __mem_cgroup_flush_stats(void)
> > return;
> >
> > flush_next_time = jiffies_64 + 2*FLUSH_TIME;
> > - cgroup_rstat_flush(root_mem_cgroup->css.cgroup, false);
> > + cgroup_rstat_flush(root_mem_cgroup->css.cgroup, may_sleep);
>
> How is it safe to call this with may_sleep=true when it's holding the
> stats_flush_lock?
stats_flush_lock is always called with trylock, it is only used today
so that we can skip flushing if another cpu is already doing a flush
(which is not 100% correct as they may have not finished flushing yet,
but that's orthogonal here). So I think it should be safe to sleep as
no one can be blocked waiting for this spinlock.
Perhaps it would be better semantically to replace the spinlock with
an atomic test and set, instead of having a lock that can only be used
with trylock?
>
> > atomic_set(&stats_flush_threshold, 0);
> > spin_unlock(&stats_flush_lock);
> > }
On Thu, Mar 23, 2023 at 8:50 AM Johannes Weiner <[email protected]> wrote:
>
> On Thu, Mar 23, 2023 at 04:00:36AM +0000, Yosry Ahmed wrote:
> > In workingset_refault(), we call mem_cgroup_flush_stats_delayed() to
> > flush stats within an RCU read section and with sleeping disallowed.
> > Move the call to mem_cgroup_flush_stats_delayed() above the RCU read
> > section and allow sleeping to avoid unnecessarily performing a lot of
> > work without sleeping.
> >
> > Signed-off-by: Yosry Ahmed <[email protected]>
> > ---
> >
> > A lot of code paths call into workingset_refault(), so I am not
> > generally sure at all whether it's okay to sleep in all contexts or not.
> > Feedback here would be very helpful.
>
> Yes, it's safe.
Thanks! That's very helpful!
On Thu, Mar 23, 2023 at 04:00:36AM +0000, Yosry Ahmed wrote:
> In workingset_refault(), we call mem_cgroup_flush_stats_delayed() to
> flush stats within an RCU read section and with sleeping disallowed.
> Move the call to mem_cgroup_flush_stats_delayed() above the RCU read
> section and allow sleeping to avoid unnecessarily performing a lot of
> work without sleeping.
>
> Signed-off-by: Yosry Ahmed <[email protected]>
> ---
>
> A lot of code paths call into workingset_refault(), so I am not
> generally sure at all whether it's okay to sleep in all contexts or not.
> Feedback here would be very helpful.
>
> ---
> mm/workingset.c | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/mm/workingset.c b/mm/workingset.c
> index 042eabbb43f6..410bc6684ea7 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -406,6 +406,8 @@ void workingset_refault(struct folio *folio, void *shadow)
> unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
> eviction <<= bucket_order;
>
> + /* Flush stats (and potentially sleep) before holding RCU read lock */
> + mem_cgroup_flush_stats_delayed(true);
Btw, it might be a good time to rename this while you're in the
area. delayed suggests this is using a delayed_work, but this is
actually sometimes flushing directly from the callsite.
What it's doing is ratelimited calls. A better name would be:
mem_cgroup_flush_stats_ratelimited()
On Thu, Mar 23, 2023 at 9:00 AM Johannes Weiner <[email protected]> wrote:
>
> On Thu, Mar 23, 2023 at 04:00:36AM +0000, Yosry Ahmed wrote:
> > In workingset_refault(), we call mem_cgroup_flush_stats_delayed() to
> > flush stats within an RCU read section and with sleeping disallowed.
> > Move the call to mem_cgroup_flush_stats_delayed() above the RCU read
> > section and allow sleeping to avoid unnecessarily performing a lot of
> > work without sleeping.
> >
> > Signed-off-by: Yosry Ahmed <[email protected]>
> > ---
> >
> > A lot of code paths call into workingset_refault(), so I am not
> > generally sure at all whether it's okay to sleep in all contexts or not.
> > Feedback here would be very helpful.
> >
> > ---
> > mm/workingset.c | 5 ++---
> > 1 file changed, 2 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/workingset.c b/mm/workingset.c
> > index 042eabbb43f6..410bc6684ea7 100644
> > --- a/mm/workingset.c
> > +++ b/mm/workingset.c
> > @@ -406,6 +406,8 @@ void workingset_refault(struct folio *folio, void *shadow)
> > unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
> > eviction <<= bucket_order;
> >
> > + /* Flush stats (and potentially sleep) before holding RCU read lock */
> > + mem_cgroup_flush_stats_delayed(true);
>
> Btw, it might be a good time to rename this while you're in the
> area. delayed suggests this is using a delayed_work, but this is
> actually sometimes flushing directly from the callsite.
>
> What it's doing is ratelimited calls. A better name would be:
>
> mem_cgroup_flush_stats_ratelimited()
Agreed. Will do in the next version.
On Thu, Mar 23, 2023 at 8:46 AM Shakeel Butt <[email protected]> wrote:
>
> On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <[email protected]> wrote:
> >
> > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <[email protected]> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <[email protected]> wrote:
> > > >
> > > [...]
> > > > > >
> > > > > > > 2. Are we really calling rstat flush in irq context?
> > > > > >
> > > > > > I think it is possible through the charge/uncharge path:
> > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > > added the protection against flushing in an interrupt context for
> > > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > > >
> > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > > >
> > > > > > I am not sure, but the code looks like event notifications may be set
> > > > > > up on root memcg, which is why we need to check thresholds.
> > > > >
> > > > > This is something we should deprecate as root memcg's usage is ill defined.
> > > >
> > > > Right, but I think this would be orthogonal to this patch series.
> > > >
> > >
> > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > > without either breaking a link between mem_cgroup_threshold and
> > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > > irqs.
> > >
> > > So, this patch can not be applied before either of those two tasks are
> > > done (and we may find more such scenarios).
> >
> >
> > Could you elaborate why?
> >
> > My understanding is that with an in_task() check to make sure we only
> > acquire cgroup_rstat_lock from non-irq context it should be fine to
> > acquire cgroup_rstat_lock without disabling interrupts.
>
> From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
> with irq disabled while other code paths will take cgroup_rstat_lock
> with irq enabled. This is a potential deadlock hazard unless
> cgroup_rstat_lock is always taken with irq disabled.
Oh you are making sure it is not taken in the irq context through
should_skip_flush(). Hmm seems like a hack. Normally it is recommended
to actually remove all such users instead of silently
ignoring/bypassing the functionality.
So, how about removing mem_cgroup_flush_stats() from
mem_cgroup_usage(). It will break the known chain which is taking
cgroup_rstat_lock with irq disabled and you can add
WARN_ON_ONCE(!in_task()).
On Thu, Mar 23, 2023 at 9:10 AM Shakeel Butt <[email protected]> wrote:
>
> On Thu, Mar 23, 2023 at 8:46 AM Shakeel Butt <[email protected]> wrote:
> >
> > On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <[email protected]> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <[email protected]> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <[email protected]> wrote:
> > > > >
> > > > [...]
> > > > > > >
> > > > > > > > 2. Are we really calling rstat flush in irq context?
> > > > > > >
> > > > > > > I think it is possible through the charge/uncharge path:
> > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > > > added the protection against flushing in an interrupt context for
> > > > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > > > >
> > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > > > >
> > > > > > > I am not sure, but the code looks like event notifications may be set
> > > > > > > up on root memcg, which is why we need to check thresholds.
> > > > > >
> > > > > > This is something we should deprecate as root memcg's usage is ill defined.
> > > > >
> > > > > Right, but I think this would be orthogonal to this patch series.
> > > > >
> > > >
> > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > > > without either breaking a link between mem_cgroup_threshold and
> > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > > > irqs.
> > > >
> > > > So, this patch can not be applied before either of those two tasks are
> > > > done (and we may find more such scenarios).
> > >
> > >
> > > Could you elaborate why?
> > >
> > > My understanding is that with an in_task() check to make sure we only
> > > acquire cgroup_rstat_lock from non-irq context it should be fine to
> > > acquire cgroup_rstat_lock without disabling interrupts.
> >
> > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
> > with irq disabled while other code paths will take cgroup_rstat_lock
> > with irq enabled. This is a potential deadlock hazard unless
> > cgroup_rstat_lock is always taken with irq disabled.
>
> Oh you are making sure it is not taken in the irq context through
> should_skip_flush(). Hmm seems like a hack. Normally it is recommended
> to actually remove all such users instead of silently
> ignoring/bypassing the functionality.
It is a workaround, we simply accept to read stale stats in irq
context instead of the expensive flush operation.
>
> So, how about removing mem_cgroup_flush_stats() from
> mem_cgroup_usage(). It will break the known chain which is taking
> cgroup_rstat_lock with irq disabled and you can add
> WARN_ON_ONCE(!in_task()).
This changes the behavior in a more obvious way because:
1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage()
path is also exercised in a lot of paths outside irq context, this
will change the behavior for any event thresholds on the root memcg.
With proposed skipped flushing in irq context we only change the
behavior in a small subset of cases.
I think we can skip flushing in irq context for now, and separately
deprecate threshold events for the root memcg. When that is done we
can come back and remove should_skip_flush() and add a VM_BUG_ON or
WARN_ON_ONCE instead. WDYT?
2. mem_cgroup_usage() is also used when reading usage from userspace.
This should be an easy workaround though.
On Thu, Mar 23, 2023 at 9:18 AM Yosry Ahmed <[email protected]> wrote:
>
> On Thu, Mar 23, 2023 at 9:10 AM Shakeel Butt <[email protected]> wrote:
> >
> > On Thu, Mar 23, 2023 at 8:46 AM Shakeel Butt <[email protected]> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <[email protected]> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <[email protected]> wrote:
> > > > >
> > > > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <[email protected]> wrote:
> > > > > >
> > > > > [...]
> > > > > > > >
> > > > > > > > > 2. Are we really calling rstat flush in irq context?
> > > > > > > >
> > > > > > > > I think it is possible through the charge/uncharge path:
> > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > > > > added the protection against flushing in an interrupt context for
> > > > > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > > > > >
> > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > > > > >
> > > > > > > > I am not sure, but the code looks like event notifications may be set
> > > > > > > > up on root memcg, which is why we need to check thresholds.
> > > > > > >
> > > > > > > This is something we should deprecate as root memcg's usage is ill defined.
> > > > > >
> > > > > > Right, but I think this would be orthogonal to this patch series.
> > > > > >
> > > > >
> > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > > > > without either breaking a link between mem_cgroup_threshold and
> > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > > > > irqs.
> > > > >
> > > > > So, this patch can not be applied before either of those two tasks are
> > > > > done (and we may find more such scenarios).
> > > >
> > > >
> > > > Could you elaborate why?
> > > >
> > > > My understanding is that with an in_task() check to make sure we only
> > > > acquire cgroup_rstat_lock from non-irq context it should be fine to
> > > > acquire cgroup_rstat_lock without disabling interrupts.
> > >
> > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
> > > with irq disabled while other code paths will take cgroup_rstat_lock
> > > with irq enabled. This is a potential deadlock hazard unless
> > > cgroup_rstat_lock is always taken with irq disabled.
> >
> > Oh you are making sure it is not taken in the irq context through
> > should_skip_flush(). Hmm seems like a hack. Normally it is recommended
> > to actually remove all such users instead of silently
> > ignoring/bypassing the functionality.
>
> It is a workaround, we simply accept to read stale stats in irq
> context instead of the expensive flush operation.
>
> >
> > So, how about removing mem_cgroup_flush_stats() from
> > mem_cgroup_usage(). It will break the known chain which is taking
> > cgroup_rstat_lock with irq disabled and you can add
> > WARN_ON_ONCE(!in_task()).
>
> This changes the behavior in a more obvious way because:
> 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage()
> path is also exercised in a lot of paths outside irq context, this
> will change the behavior for any event thresholds on the root memcg.
> With proposed skipped flushing in irq context we only change the
> behavior in a small subset of cases.
>
> I think we can skip flushing in irq context for now, and separately
> deprecate threshold events for the root memcg. When that is done we
> can come back and remove should_skip_flush() and add a VM_BUG_ON or
> WARN_ON_ONCE instead. WDYT?
>
> 2. mem_cgroup_usage() is also used when reading usage from userspace.
> This should be an easy workaround though.
This is a cgroup v1 behavior and to me it is totally reasonable to get
the 2 second stale root's usage. Even if you want to skip flushing in
irq, do that in the memcg code and keep VM_BUG_ON/WARN_ON_ONCE in the
rstat core code. This way we will know if other subsystems are doing
the same or not.
On Thu, Mar 23, 2023 at 9:29 AM Shakeel Butt <[email protected]> wrote:
>
> On Thu, Mar 23, 2023 at 9:18 AM Yosry Ahmed <[email protected]> wrote:
> >
> > On Thu, Mar 23, 2023 at 9:10 AM Shakeel Butt <[email protected]> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 8:46 AM Shakeel Butt <[email protected]> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <[email protected]> wrote:
> > > > >
> > > > > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <[email protected]> wrote:
> > > > > > >
> > > > > > [...]
> > > > > > > > >
> > > > > > > > > > 2. Are we really calling rstat flush in irq context?
> > > > > > > > >
> > > > > > > > > I think it is possible through the charge/uncharge path:
> > > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > > > > > added the protection against flushing in an interrupt context for
> > > > > > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > > > > > >
> > > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > > > > > >
> > > > > > > > > I am not sure, but the code looks like event notifications may be set
> > > > > > > > > up on root memcg, which is why we need to check thresholds.
> > > > > > > >
> > > > > > > > This is something we should deprecate as root memcg's usage is ill defined.
> > > > > > >
> > > > > > > Right, but I think this would be orthogonal to this patch series.
> > > > > > >
> > > > > >
> > > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > > > > > without either breaking a link between mem_cgroup_threshold and
> > > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > > > > > irqs.
> > > > > >
> > > > > > So, this patch can not be applied before either of those two tasks are
> > > > > > done (and we may find more such scenarios).
> > > > >
> > > > >
> > > > > Could you elaborate why?
> > > > >
> > > > > My understanding is that with an in_task() check to make sure we only
> > > > > acquire cgroup_rstat_lock from non-irq context it should be fine to
> > > > > acquire cgroup_rstat_lock without disabling interrupts.
> > > >
> > > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
> > > > with irq disabled while other code paths will take cgroup_rstat_lock
> > > > with irq enabled. This is a potential deadlock hazard unless
> > > > cgroup_rstat_lock is always taken with irq disabled.
> > >
> > > Oh you are making sure it is not taken in the irq context through
> > > should_skip_flush(). Hmm seems like a hack. Normally it is recommended
> > > to actually remove all such users instead of silently
> > > ignoring/bypassing the functionality.
> >
> > It is a workaround, we simply accept to read stale stats in irq
> > context instead of the expensive flush operation.
> >
> > >
> > > So, how about removing mem_cgroup_flush_stats() from
> > > mem_cgroup_usage(). It will break the known chain which is taking
> > > cgroup_rstat_lock with irq disabled and you can add
> > > WARN_ON_ONCE(!in_task()).
> >
> > This changes the behavior in a more obvious way because:
> > 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage()
> > path is also exercised in a lot of paths outside irq context, this
> > will change the behavior for any event thresholds on the root memcg.
> > With proposed skipped flushing in irq context we only change the
> > behavior in a small subset of cases.
> >
> > I think we can skip flushing in irq context for now, and separately
> > deprecate threshold events for the root memcg. When that is done we
> > can come back and remove should_skip_flush() and add a VM_BUG_ON or
> > WARN_ON_ONCE instead. WDYT?
> >
> > 2. mem_cgroup_usage() is also used when reading usage from userspace.
> > This should be an easy workaround though.
>
> This is a cgroup v1 behavior and to me it is totally reasonable to get
> the 2 second stale root's usage. Even if you want to skip flushing in
> irq, do that in the memcg code and keep VM_BUG_ON/WARN_ON_ONCE in the
> rstat core code. This way we will know if other subsystems are doing
> the same or not.
We can do that. Basically in mem_cgroup_usage() have:
/* Some useful comment */
if (in_task())
mem_cgroup_flush_stats();
and in cgroup_rstat_flush() have:
WARN_ON_ONCE(!in_task());
I am assuming VM_BUG_ON is not used outside mm code.
The only thing that worries me is that if there is another unlikely
path somewhere that flushes stats in irq context we may run into a
deadlock. I am a little bit nervous about not skipping flushing if
!in_task() in cgroup_rstat_flush().
On Thu, Mar 23, 2023 at 9:37 AM Yosry Ahmed <[email protected]> wrote:
>
> On Thu, Mar 23, 2023 at 9:29 AM Shakeel Butt <[email protected]> wrote:
> >
> > On Thu, Mar 23, 2023 at 9:18 AM Yosry Ahmed <[email protected]> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 9:10 AM Shakeel Butt <[email protected]> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 8:46 AM Shakeel Butt <[email protected]> wrote:
> > > > >
> > > > > On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <[email protected]> wrote:
> > > > > > >
> > > > > > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <[email protected]> wrote:
> > > > > > > >
> > > > > > > [...]
> > > > > > > > > >
> > > > > > > > > > > 2. Are we really calling rstat flush in irq context?
> > > > > > > > > >
> > > > > > > > > > I think it is possible through the charge/uncharge path:
> > > > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > > > > > > added the protection against flushing in an interrupt context for
> > > > > > > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > > > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > > > > > > >
> > > > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > > > > > > >
> > > > > > > > > > I am not sure, but the code looks like event notifications may be set
> > > > > > > > > > up on root memcg, which is why we need to check thresholds.
> > > > > > > > >
> > > > > > > > > This is something we should deprecate as root memcg's usage is ill defined.
> > > > > > > >
> > > > > > > > Right, but I think this would be orthogonal to this patch series.
> > > > > > > >
> > > > > > >
> > > > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > > > > > > without either breaking a link between mem_cgroup_threshold and
> > > > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > > > > > > irqs.
> > > > > > >
> > > > > > > So, this patch can not be applied before either of those two tasks are
> > > > > > > done (and we may find more such scenarios).
> > > > > >
> > > > > >
> > > > > > Could you elaborate why?
> > > > > >
> > > > > > My understanding is that with an in_task() check to make sure we only
> > > > > > acquire cgroup_rstat_lock from non-irq context it should be fine to
> > > > > > acquire cgroup_rstat_lock without disabling interrupts.
> > > > >
> > > > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
> > > > > with irq disabled while other code paths will take cgroup_rstat_lock
> > > > > with irq enabled. This is a potential deadlock hazard unless
> > > > > cgroup_rstat_lock is always taken with irq disabled.
> > > >
> > > > Oh you are making sure it is not taken in the irq context through
> > > > should_skip_flush(). Hmm seems like a hack. Normally it is recommended
> > > > to actually remove all such users instead of silently
> > > > ignoring/bypassing the functionality.
> > >
> > > It is a workaround, we simply accept to read stale stats in irq
> > > context instead of the expensive flush operation.
> > >
> > > >
> > > > So, how about removing mem_cgroup_flush_stats() from
> > > > mem_cgroup_usage(). It will break the known chain which is taking
> > > > cgroup_rstat_lock with irq disabled and you can add
> > > > WARN_ON_ONCE(!in_task()).
> > >
> > > This changes the behavior in a more obvious way because:
> > > 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage()
> > > path is also exercised in a lot of paths outside irq context, this
> > > will change the behavior for any event thresholds on the root memcg.
> > > With proposed skipped flushing in irq context we only change the
> > > behavior in a small subset of cases.
> > >
> > > I think we can skip flushing in irq context for now, and separately
> > > deprecate threshold events for the root memcg. When that is done we
> > > can come back and remove should_skip_flush() and add a VM_BUG_ON or
> > > WARN_ON_ONCE instead. WDYT?
> > >
> > > 2. mem_cgroup_usage() is also used when reading usage from userspace.
> > > This should be an easy workaround though.
> >
> > This is a cgroup v1 behavior and to me it is totally reasonable to get
> > the 2 second stale root's usage. Even if you want to skip flushing in
> > irq, do that in the memcg code and keep VM_BUG_ON/WARN_ON_ONCE in the
> > rstat core code. This way we will know if other subsystems are doing
> > the same or not.
>
> We can do that. Basically in mem_cgroup_usage() have:
>
> /* Some useful comment */
> if (in_task())
> mem_cgroup_flush_stats();
>
> and in cgroup_rstat_flush() have:
> WARN_ON_ONCE(!in_task());
>
> I am assuming VM_BUG_ON is not used outside mm code.
>
> The only thing that worries me is that if there is another unlikely
> path somewhere that flushes stats in irq context we may run into a
> deadlock. I am a little bit nervous about not skipping flushing if
> !in_task() in cgroup_rstat_flush().
I think it is a good thing. We will find such scenarios and fix those
instead of hiding them forever or keeping the door open for new such
scenarios.
On Thu, Mar 23, 2023 at 9:45 AM Shakeel Butt <[email protected]> wrote:
>
> On Thu, Mar 23, 2023 at 9:37 AM Yosry Ahmed <[email protected]> wrote:
> >
> > On Thu, Mar 23, 2023 at 9:29 AM Shakeel Butt <[email protected]> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 9:18 AM Yosry Ahmed <[email protected]> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 9:10 AM Shakeel Butt <[email protected]> wrote:
> > > > >
> > > > > On Thu, Mar 23, 2023 at 8:46 AM Shakeel Butt <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <[email protected]> wrote:
> > > > > > >
> > > > > > > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <[email protected]> wrote:
> > > > > > > >
> > > > > > > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > [...]
> > > > > > > > > > >
> > > > > > > > > > > > 2. Are we really calling rstat flush in irq context?
> > > > > > > > > > >
> > > > > > > > > > > I think it is possible through the charge/uncharge path:
> > > > > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > > > > > > > added the protection against flushing in an interrupt context for
> > > > > > > > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > > > > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > > > > > > > >
> > > > > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > > > > > > > >
> > > > > > > > > > > I am not sure, but the code looks like event notifications may be set
> > > > > > > > > > > up on root memcg, which is why we need to check thresholds.
> > > > > > > > > >
> > > > > > > > > > This is something we should deprecate as root memcg's usage is ill defined.
> > > > > > > > >
> > > > > > > > > Right, but I think this would be orthogonal to this patch series.
> > > > > > > > >
> > > > > > > >
> > > > > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > > > > > > > without either breaking a link between mem_cgroup_threshold and
> > > > > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > > > > > > > irqs.
> > > > > > > >
> > > > > > > > So, this patch can not be applied before either of those two tasks are
> > > > > > > > done (and we may find more such scenarios).
> > > > > > >
> > > > > > >
> > > > > > > Could you elaborate why?
> > > > > > >
> > > > > > > My understanding is that with an in_task() check to make sure we only
> > > > > > > acquire cgroup_rstat_lock from non-irq context it should be fine to
> > > > > > > acquire cgroup_rstat_lock without disabling interrupts.
> > > > > >
> > > > > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
> > > > > > with irq disabled while other code paths will take cgroup_rstat_lock
> > > > > > with irq enabled. This is a potential deadlock hazard unless
> > > > > > cgroup_rstat_lock is always taken with irq disabled.
> > > > >
> > > > > Oh you are making sure it is not taken in the irq context through
> > > > > should_skip_flush(). Hmm seems like a hack. Normally it is recommended
> > > > > to actually remove all such users instead of silently
> > > > > ignoring/bypassing the functionality.
> > > >
> > > > It is a workaround, we simply accept to read stale stats in irq
> > > > context instead of the expensive flush operation.
> > > >
> > > > >
> > > > > So, how about removing mem_cgroup_flush_stats() from
> > > > > mem_cgroup_usage(). It will break the known chain which is taking
> > > > > cgroup_rstat_lock with irq disabled and you can add
> > > > > WARN_ON_ONCE(!in_task()).
> > > >
> > > > This changes the behavior in a more obvious way because:
> > > > 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage()
> > > > path is also exercised in a lot of paths outside irq context, this
> > > > will change the behavior for any event thresholds on the root memcg.
> > > > With proposed skipped flushing in irq context we only change the
> > > > behavior in a small subset of cases.
> > > >
> > > > I think we can skip flushing in irq context for now, and separately
> > > > deprecate threshold events for the root memcg. When that is done we
> > > > can come back and remove should_skip_flush() and add a VM_BUG_ON or
> > > > WARN_ON_ONCE instead. WDYT?
> > > >
> > > > 2. mem_cgroup_usage() is also used when reading usage from userspace.
> > > > This should be an easy workaround though.
> > >
> > > This is a cgroup v1 behavior and to me it is totally reasonable to get
> > > the 2 second stale root's usage. Even if you want to skip flushing in
> > > irq, do that in the memcg code and keep VM_BUG_ON/WARN_ON_ONCE in the
> > > rstat core code. This way we will know if other subsystems are doing
> > > the same or not.
> >
> > We can do that. Basically in mem_cgroup_usage() have:
> >
> > /* Some useful comment */
> > if (in_task())
> > mem_cgroup_flush_stats();
> >
> > and in cgroup_rstat_flush() have:
> > WARN_ON_ONCE(!in_task());
> >
> > I am assuming VM_BUG_ON is not used outside mm code.
> >
> > The only thing that worries me is that if there is another unlikely
> > path somewhere that flushes stats in irq context we may run into a
> > deadlock. I am a little bit nervous about not skipping flushing if
> > !in_task() in cgroup_rstat_flush().
>
> I think it is a good thing. We will find such scenarios and fix those
> instead of hiding them forever or keeping the door open for new such
> scenarios.
Sure, I can do that in the next version. I will include a patch that
adds an in_task() check to mem_cgroup_usage() before this one. Since
BUG_ON() is discouraged and VM_BUG_ON() is mm specific, I guess we are
left with WARN_ON_ONCE() for the rstat core code, right?
Thanks Shakeel. Any other thoughts I should address for the next version?
On Thu, Mar 23, 2023 at 09:01:12AM -0700, Yosry Ahmed wrote:
> On Thu, Mar 23, 2023 at 8:56 AM Johannes Weiner <[email protected]> wrote:
> >
> > On Thu, Mar 23, 2023 at 04:00:34AM +0000, Yosry Ahmed wrote:
> > > @@ -644,26 +644,26 @@ static void __mem_cgroup_flush_stats(void)
> > > return;
> > >
> > > flush_next_time = jiffies_64 + 2*FLUSH_TIME;
> > > - cgroup_rstat_flush(root_mem_cgroup->css.cgroup, false);
> > > + cgroup_rstat_flush(root_mem_cgroup->css.cgroup, may_sleep);
> >
> > How is it safe to call this with may_sleep=true when it's holding the
> > stats_flush_lock?
>
> stats_flush_lock is always called with trylock, it is only used today
> so that we can skip flushing if another cpu is already doing a flush
> (which is not 100% correct as they may have not finished flushing yet,
> but that's orthogonal here). So I think it should be safe to sleep as
> no one can be blocked waiting for this spinlock.
I see. It still cannot sleep while the lock is held, though, because
preemption is disabled. Make sure you have all lock debugging on while
testing this.
> Perhaps it would be better semantically to replace the spinlock with
> an atomic test and set, instead of having a lock that can only be used
> with trylock?
It could be helpful to clarify what stats_flush_lock is protecting
first. Keep in mind that locks should protect data, not code paths.
Right now it's doing multiple things:
1. It protects updates to stats_flush_threshold
2. It protects updates to flush_next_time
3. It serializes calls to cgroup_rstat_flush() based on those ratelimits
However,
1. stats_flush_threshold is already an atomic
2. flush_next_time is not atomic. The writer is locked, but the reader
is lockless. If the reader races with a flush, you could see this:
if (time_after(jiffies, flush_next_time))
spin_trylock()
flush_next_time = now + delay
flush()
spin_unlock()
spin_trylock()
flush_next_time = now + delay
flush()
spin_unlock()
which means we already can get flushes at a higher frequency than
FLUSH_TIME during races. But it isn't really a problem.
The reader could also see garbled partial updates, so it needs at
least READ_ONCE and WRITE_ONCE protection.
3. Serializing cgroup_rstat_flush() calls against the ratelimit
factors is currently broken because of the race in 2. But the race
is actually harmless, all we might get is the occasional earlier
flush. If there is no delta, the flush won't do much. And if there
is, the flush is justified.
In summary, it seems to me the lock can be ditched altogether. All the
code needs is READ_ONCE/WRITE_ONCE around flush_next_time.
On Thu, Mar 23, 2023 at 09:17:33AM -0700, Yosry Ahmed wrote:
> On Thu, Mar 23, 2023 at 9:10 AM Shakeel Butt <[email protected]> wrote:
> >
> > On Thu, Mar 23, 2023 at 8:46 AM Shakeel Butt <[email protected]> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <[email protected]> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <[email protected]> wrote:
> > > > >
> > > > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <[email protected]> wrote:
> > > > > >
> > > > > [...]
> > > > > > > >
> > > > > > > > > 2. Are we really calling rstat flush in irq context?
> > > > > > > >
> > > > > > > > I think it is possible through the charge/uncharge path:
> > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > > > > added the protection against flushing in an interrupt context for
> > > > > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > > > > >
> > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > > > > >
> > > > > > > > I am not sure, but the code looks like event notifications may be set
> > > > > > > > up on root memcg, which is why we need to check thresholds.
> > > > > > >
> > > > > > > This is something we should deprecate as root memcg's usage is ill defined.
> > > > > >
> > > > > > Right, but I think this would be orthogonal to this patch series.
> > > > > >
> > > > >
> > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > > > > without either breaking a link between mem_cgroup_threshold and
> > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > > > > irqs.
> > > > >
> > > > > So, this patch can not be applied before either of those two tasks are
> > > > > done (and we may find more such scenarios).
> > > >
> > > >
> > > > Could you elaborate why?
> > > >
> > > > My understanding is that with an in_task() check to make sure we only
> > > > acquire cgroup_rstat_lock from non-irq context it should be fine to
> > > > acquire cgroup_rstat_lock without disabling interrupts.
> > >
> > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
> > > with irq disabled while other code paths will take cgroup_rstat_lock
> > > with irq enabled. This is a potential deadlock hazard unless
> > > cgroup_rstat_lock is always taken with irq disabled.
> >
> > Oh you are making sure it is not taken in the irq context through
> > should_skip_flush(). Hmm seems like a hack. Normally it is recommended
> > to actually remove all such users instead of silently
> > ignoring/bypassing the functionality.
+1
It shouldn't silently skip the requested operation, rather it
shouldn't be requested from an incompatible context.
> > So, how about removing mem_cgroup_flush_stats() from
> > mem_cgroup_usage(). It will break the known chain which is taking
> > cgroup_rstat_lock with irq disabled and you can add
> > WARN_ON_ONCE(!in_task()).
>
> This changes the behavior in a more obvious way because:
> 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage()
> path is also exercised in a lot of paths outside irq context, this
> will change the behavior for any event thresholds on the root memcg.
> With proposed skipped flushing in irq context we only change the
> behavior in a small subset of cases.
Can you do
/* Note: stale usage data when called from irq context!! */
if (in_task())
mem_cgroup_flush_stats()
directly in the callsite? Maybe even include the whole callchain in
the comment that's currently broken and needs fixing/removing.
On Thu, Mar 23, 2023 at 10:27 AM Johannes Weiner <[email protected]> wrote:
>
> On Thu, Mar 23, 2023 at 09:01:12AM -0700, Yosry Ahmed wrote:
> > On Thu, Mar 23, 2023 at 8:56 AM Johannes Weiner <[email protected]> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 04:00:34AM +0000, Yosry Ahmed wrote:
> > > > @@ -644,26 +644,26 @@ static void __mem_cgroup_flush_stats(void)
> > > > return;
> > > >
> > > > flush_next_time = jiffies_64 + 2*FLUSH_TIME;
> > > > - cgroup_rstat_flush(root_mem_cgroup->css.cgroup, false);
> > > > + cgroup_rstat_flush(root_mem_cgroup->css.cgroup, may_sleep);
> > >
> > > How is it safe to call this with may_sleep=true when it's holding the
> > > stats_flush_lock?
> >
> > stats_flush_lock is always called with trylock, it is only used today
> > so that we can skip flushing if another cpu is already doing a flush
> > (which is not 100% correct as they may have not finished flushing yet,
> > but that's orthogonal here). So I think it should be safe to sleep as
> > no one can be blocked waiting for this spinlock.
>
> I see. It still cannot sleep while the lock is held, though, because
> preemption is disabled. Make sure you have all lock debugging on while
> testing this.
Thanks for pointing this out, will do.
>
> > Perhaps it would be better semantically to replace the spinlock with
> > an atomic test and set, instead of having a lock that can only be used
> > with trylock?
>
> It could be helpful to clarify what stats_flush_lock is protecting
> first. Keep in mind that locks should protect data, not code paths.
>
> Right now it's doing multiple things:
>
> 1. It protects updates to stats_flush_threshold
> 2. It protects updates to flush_next_time
> 3. It serializes calls to cgroup_rstat_flush() based on those ratelimits
>
> However,
>
> 1. stats_flush_threshold is already an atomic
>
> 2. flush_next_time is not atomic. The writer is locked, but the reader
> is lockless. If the reader races with a flush, you could see this:
>
> if (time_after(jiffies, flush_next_time))
> spin_trylock()
> flush_next_time = now + delay
> flush()
> spin_unlock()
> spin_trylock()
> flush_next_time = now + delay
> flush()
> spin_unlock()
>
> which means we already can get flushes at a higher frequency than
> FLUSH_TIME during races. But it isn't really a problem.
>
> The reader could also see garbled partial updates, so it needs at
> least READ_ONCE and WRITE_ONCE protection.
>
> 3. Serializing cgroup_rstat_flush() calls against the ratelimit
> factors is currently broken because of the race in 2. But the race
> is actually harmless, all we might get is the occasional earlier
> flush. If there is no delta, the flush won't do much. And if there
> is, the flush is justified.
>
> In summary, it seems to me the lock can be ditched altogether. All the
> code needs is READ_ONCE/WRITE_ONCE around flush_next_time.
Thanks a lot for this analysis. I agree that the lock can be removed
with proper READ_ONCE/WRITE_ONCE, but I think there is another purpose
of the lock that we are missing here.
I think one other purpose of the lock is avoiding a thundering herd
problem on cgroup_rstat_lock, particularly from reclaim context, as
mentioned by the log of commit aa48e47e3906 ("memcg: infrastructure
to flush memcg stats").
While testing, I did notice that removing this lock indeed causes a
thundering herd problem if we have a lot of concurrent reclaimers. The
trylock makes sure we abort immediately if someone else is flushing --
which is not ideal because that flusher might have just started, and
we may end up reading stale data anyway.
This is why I suggested replacing the lock by an atomic, and do
something like this if we want to maintain the current behavior:
static void __mem_cgroup_flush_stats(void)
{
...
if (atomic_xchg(&ongoing_flush, 1))
return;
...
atomic_set(&ongoing_flush, 0)
}
Alternatively, if we want to change the behavior and wait for the
concurrent flusher to finish flushing, we can maybe spin until
ongoing_flush goes back to 0 and then return:
static void __mem_cgroup_flush_stats(void)
{
...
if (atomic_xchg(&ongoing_flush, 1)) {
/* wait until the ongoing flusher finishes to get updated stats */
while (atomic_read(&ongoing_flush) {};
return;
}
/* flush the stats ourselves */
...
atomic_set(&ongoing_flush, 0)
}
WDYT?
On Thu, Mar 23, 2023 at 10:33 AM Johannes Weiner <[email protected]> wrote:
>
> On Thu, Mar 23, 2023 at 09:17:33AM -0700, Yosry Ahmed wrote:
> > On Thu, Mar 23, 2023 at 9:10 AM Shakeel Butt <[email protected]> wrote:
> > >
> > > On Thu, Mar 23, 2023 at 8:46 AM Shakeel Butt <[email protected]> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <[email protected]> wrote:
> > > > >
> > > > > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <[email protected]> wrote:
> > > > > > >
> > > > > > [...]
> > > > > > > > >
> > > > > > > > > > 2. Are we really calling rstat flush in irq context?
> > > > > > > > >
> > > > > > > > > I think it is possible through the charge/uncharge path:
> > > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > > > > > added the protection against flushing in an interrupt context for
> > > > > > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > > > > > >
> > > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > > > > > >
> > > > > > > > > I am not sure, but the code looks like event notifications may be set
> > > > > > > > > up on root memcg, which is why we need to check thresholds.
> > > > > > > >
> > > > > > > > This is something we should deprecate as root memcg's usage is ill defined.
> > > > > > >
> > > > > > > Right, but I think this would be orthogonal to this patch series.
> > > > > > >
> > > > > >
> > > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > > > > > without either breaking a link between mem_cgroup_threshold and
> > > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > > > > > irqs.
> > > > > >
> > > > > > So, this patch can not be applied before either of those two tasks are
> > > > > > done (and we may find more such scenarios).
> > > > >
> > > > >
> > > > > Could you elaborate why?
> > > > >
> > > > > My understanding is that with an in_task() check to make sure we only
> > > > > acquire cgroup_rstat_lock from non-irq context it should be fine to
> > > > > acquire cgroup_rstat_lock without disabling interrupts.
> > > >
> > > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
> > > > with irq disabled while other code paths will take cgroup_rstat_lock
> > > > with irq enabled. This is a potential deadlock hazard unless
> > > > cgroup_rstat_lock is always taken with irq disabled.
> > >
> > > Oh you are making sure it is not taken in the irq context through
> > > should_skip_flush(). Hmm seems like a hack. Normally it is recommended
> > > to actually remove all such users instead of silently
> > > ignoring/bypassing the functionality.
>
> +1
>
> It shouldn't silently skip the requested operation, rather it
> shouldn't be requested from an incompatible context.
>
> > > So, how about removing mem_cgroup_flush_stats() from
> > > mem_cgroup_usage(). It will break the known chain which is taking
> > > cgroup_rstat_lock with irq disabled and you can add
> > > WARN_ON_ONCE(!in_task()).
> >
> > This changes the behavior in a more obvious way because:
> > 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage()
> > path is also exercised in a lot of paths outside irq context, this
> > will change the behavior for any event thresholds on the root memcg.
> > With proposed skipped flushing in irq context we only change the
> > behavior in a small subset of cases.
>
> Can you do
>
> /* Note: stale usage data when called from irq context!! */
> if (in_task())
> mem_cgroup_flush_stats()
>
> directly in the callsite? Maybe even include the whole callchain in
> the comment that's currently broken and needs fixing/removing.
Yeah, we can do that in mem_cgroup_usage(), which is the only context
that I am aware of that may flush from irq context. We can also add
WARN_ON_ONCE(!in_task()) in the rstat core flushing code to catch any
other code paths that we are not aware of -- which may result in a
deadlock, but hopefully if there is a violation it will be caught soon
enough.
On Thu, Mar 23, 2023 at 11:09:30AM -0700, Yosry Ahmed wrote:
> On Thu, Mar 23, 2023 at 10:33 AM Johannes Weiner <[email protected]> wrote:
> >
> > On Thu, Mar 23, 2023 at 09:17:33AM -0700, Yosry Ahmed wrote:
> > > On Thu, Mar 23, 2023 at 9:10 AM Shakeel Butt <[email protected]> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 8:46 AM Shakeel Butt <[email protected]> wrote:
> > > > >
> > > > > On Thu, Mar 23, 2023 at 8:43 AM Yosry Ahmed <[email protected]> wrote:
> > > > > >
> > > > > > On Thu, Mar 23, 2023 at 8:40 AM Shakeel Butt <[email protected]> wrote:
> > > > > > >
> > > > > > > On Thu, Mar 23, 2023 at 6:36 AM Yosry Ahmed <[email protected]> wrote:
> > > > > > > >
> > > > > > > [...]
> > > > > > > > > >
> > > > > > > > > > > 2. Are we really calling rstat flush in irq context?
> > > > > > > > > >
> > > > > > > > > > I think it is possible through the charge/uncharge path:
> > > > > > > > > > memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage(). I
> > > > > > > > > > added the protection against flushing in an interrupt context for
> > > > > > > > > > future callers as well, as it may cause a deadlock if we don't disable
> > > > > > > > > > interrupts when acquiring cgroup_rstat_lock.
> > > > > > > > > >
> > > > > > > > > > > 3. The mem_cgroup_flush_stats() call in mem_cgroup_usage() is only
> > > > > > > > > > > done for root memcg. Why is mem_cgroup_threshold() interested in root
> > > > > > > > > > > memcg usage? Why not ignore root memcg in mem_cgroup_threshold() ?
> > > > > > > > > >
> > > > > > > > > > I am not sure, but the code looks like event notifications may be set
> > > > > > > > > > up on root memcg, which is why we need to check thresholds.
> > > > > > > > >
> > > > > > > > > This is something we should deprecate as root memcg's usage is ill defined.
> > > > > > > >
> > > > > > > > Right, but I think this would be orthogonal to this patch series.
> > > > > > > >
> > > > > > >
> > > > > > > I don't think we can make cgroup_rstat_lock a non-irq-disabling lock
> > > > > > > without either breaking a link between mem_cgroup_threshold and
> > > > > > > cgroup_rstat_lock or make mem_cgroup_threshold work without disabling
> > > > > > > irqs.
> > > > > > >
> > > > > > > So, this patch can not be applied before either of those two tasks are
> > > > > > > done (and we may find more such scenarios).
> > > > > >
> > > > > >
> > > > > > Could you elaborate why?
> > > > > >
> > > > > > My understanding is that with an in_task() check to make sure we only
> > > > > > acquire cgroup_rstat_lock from non-irq context it should be fine to
> > > > > > acquire cgroup_rstat_lock without disabling interrupts.
> > > > >
> > > > > From mem_cgroup_threshold() code path, cgroup_rstat_lock will be taken
> > > > > with irq disabled while other code paths will take cgroup_rstat_lock
> > > > > with irq enabled. This is a potential deadlock hazard unless
> > > > > cgroup_rstat_lock is always taken with irq disabled.
> > > >
> > > > Oh you are making sure it is not taken in the irq context through
> > > > should_skip_flush(). Hmm seems like a hack. Normally it is recommended
> > > > to actually remove all such users instead of silently
> > > > ignoring/bypassing the functionality.
> >
> > +1
> >
> > It shouldn't silently skip the requested operation, rather it
> > shouldn't be requested from an incompatible context.
> >
> > > > So, how about removing mem_cgroup_flush_stats() from
> > > > mem_cgroup_usage(). It will break the known chain which is taking
> > > > cgroup_rstat_lock with irq disabled and you can add
> > > > WARN_ON_ONCE(!in_task()).
> > >
> > > This changes the behavior in a more obvious way because:
> > > 1. The memcg_check_events()->mem_cgroup_threshold()->mem_cgroup_usage()
> > > path is also exercised in a lot of paths outside irq context, this
> > > will change the behavior for any event thresholds on the root memcg.
> > > With proposed skipped flushing in irq context we only change the
> > > behavior in a small subset of cases.
> >
> > Can you do
> >
> > /* Note: stale usage data when called from irq context!! */
> > if (in_task())
> > mem_cgroup_flush_stats()
> >
> > directly in the callsite? Maybe even include the whole callchain in
> > the comment that's currently broken and needs fixing/removing.
>
> Yeah, we can do that in mem_cgroup_usage(), which is the only context
> that I am aware of that may flush from irq context. We can also add
> WARN_ON_ONCE(!in_task()) in the rstat core flushing code to catch any
> other code paths that we are not aware of -- which may result in a
> deadlock, but hopefully if there is a violation it will be caught soon
> enough.
That sounds good to me, thanks!
On Thu, Mar 23, 2023 at 9:52 AM Yosry Ahmed <[email protected]> wrote:
>
[...]
>
> Sure, I can do that in the next version. I will include a patch that
> adds an in_task() check to mem_cgroup_usage() before this one. Since
> BUG_ON() is discouraged and VM_BUG_ON() is mm specific, I guess we are
> left with WARN_ON_ONCE() for the rstat core code, right?
>
> Thanks Shakeel. Any other thoughts I should address for the next version?
This is all for now (at least for this patch).
On Thu, Mar 23, 2023 at 11:08 AM Yosry Ahmed <[email protected]> wrote:
>
> On Thu, Mar 23, 2023 at 10:27 AM Johannes Weiner <[email protected]> wrote:
> >
> > On Thu, Mar 23, 2023 at 09:01:12AM -0700, Yosry Ahmed wrote:
> > > On Thu, Mar 23, 2023 at 8:56 AM Johannes Weiner <[email protected]> wrote:
> > > >
> > > > On Thu, Mar 23, 2023 at 04:00:34AM +0000, Yosry Ahmed wrote:
> > > > > @@ -644,26 +644,26 @@ static void __mem_cgroup_flush_stats(void)
> > > > > return;
> > > > >
> > > > > flush_next_time = jiffies_64 + 2*FLUSH_TIME;
> > > > > - cgroup_rstat_flush(root_mem_cgroup->css.cgroup, false);
> > > > > + cgroup_rstat_flush(root_mem_cgroup->css.cgroup, may_sleep);
> > > >
> > > > How is it safe to call this with may_sleep=true when it's holding the
> > > > stats_flush_lock?
> > >
> > > stats_flush_lock is always called with trylock, it is only used today
> > > so that we can skip flushing if another cpu is already doing a flush
> > > (which is not 100% correct as they may have not finished flushing yet,
> > > but that's orthogonal here). So I think it should be safe to sleep as
> > > no one can be blocked waiting for this spinlock.
> >
> > I see. It still cannot sleep while the lock is held, though, because
> > preemption is disabled. Make sure you have all lock debugging on while
> > testing this.
>
> Thanks for pointing this out, will do.
>
> >
> > > Perhaps it would be better semantically to replace the spinlock with
> > > an atomic test and set, instead of having a lock that can only be used
> > > with trylock?
> >
> > It could be helpful to clarify what stats_flush_lock is protecting
> > first. Keep in mind that locks should protect data, not code paths.
> >
> > Right now it's doing multiple things:
> >
> > 1. It protects updates to stats_flush_threshold
> > 2. It protects updates to flush_next_time
> > 3. It serializes calls to cgroup_rstat_flush() based on those ratelimits
> >
> > However,
> >
> > 1. stats_flush_threshold is already an atomic
> >
> > 2. flush_next_time is not atomic. The writer is locked, but the reader
> > is lockless. If the reader races with a flush, you could see this:
> >
> > if (time_after(jiffies, flush_next_time))
> > spin_trylock()
> > flush_next_time = now + delay
> > flush()
> > spin_unlock()
> > spin_trylock()
> > flush_next_time = now + delay
> > flush()
> > spin_unlock()
> >
> > which means we already can get flushes at a higher frequency than
> > FLUSH_TIME during races. But it isn't really a problem.
> >
> > The reader could also see garbled partial updates, so it needs at
> > least READ_ONCE and WRITE_ONCE protection.
> >
> > 3. Serializing cgroup_rstat_flush() calls against the ratelimit
> > factors is currently broken because of the race in 2. But the race
> > is actually harmless, all we might get is the occasional earlier
> > flush. If there is no delta, the flush won't do much. And if there
> > is, the flush is justified.
> >
> > In summary, it seems to me the lock can be ditched altogether. All the
> > code needs is READ_ONCE/WRITE_ONCE around flush_next_time.
>
> Thanks a lot for this analysis. I agree that the lock can be removed
> with proper READ_ONCE/WRITE_ONCE, but I think there is another purpose
> of the lock that we are missing here.
>
> I think one other purpose of the lock is avoiding a thundering herd
> problem on cgroup_rstat_lock, particularly from reclaim context, as
> mentioned by the log of commit aa48e47e3906 ("memcg: infrastructure
> to flush memcg stats").
>
> While testing, I did notice that removing this lock indeed causes a
> thundering herd problem if we have a lot of concurrent reclaimers. The
> trylock makes sure we abort immediately if someone else is flushing --
> which is not ideal because that flusher might have just started, and
> we may end up reading stale data anyway.
>
> This is why I suggested replacing the lock by an atomic, and do
> something like this if we want to maintain the current behavior:
>
> static void __mem_cgroup_flush_stats(void)
> {
> ...
> if (atomic_xchg(&ongoing_flush, 1))
> return;
> ...
> atomic_set(&ongoing_flush, 0)
> }
>
> Alternatively, if we want to change the behavior and wait for the
> concurrent flusher to finish flushing, we can maybe spin until
> ongoing_flush goes back to 0 and then return:
>
> static void __mem_cgroup_flush_stats(void)
> {
> ...
> if (atomic_xchg(&ongoing_flush, 1)) {
> /* wait until the ongoing flusher finishes to get updated stats */
> while (atomic_read(&ongoing_flush) {};
> return;
> }
> /* flush the stats ourselves */
> ...
> atomic_set(&ongoing_flush, 0)
> }
>
> WDYT?
I would go with your first approach i.e. no spinning.
Hello,
On Thu, Mar 23, 2023 at 04:00:31AM +0000, Yosry Ahmed wrote:
> Currently, when sleeping is not allowed during rstat flushing, we hold
> the global rstat lock with interrupts disabled throughout the entire
> flush operation. Flushing in an O(# cgroups * # cpus) operation, and
> having interrupts disabled throughout is dangerous.
>
> For some contexts, we may not want to sleep, but can be interrupted
> (e.g. while holding a spinlock or RCU read lock). As such, do not
> disable interrupts throughout rstat flushing, only when holding the
> percpu lock. This breaks down the O(# cgroups * # cpus) duration with
> interrupts disabled to a series of O(# cgroups) durations.
>
> Furthermore, if a cpu spinning waiting for the global rstat lock, it
> doesn't need to spin with interrupts disabled anymore.
I'm generally not a fan of big spin locks w/o irq protection. They too often
become a source of unpredictable latency spikes. As you said, the global
rstat lock can be held for quite a while. Removing _irq makes irq latency
better on the CPU but on the other hand it makes a lot more likely that the
lock is gonna be held even longer, possibly significantly so depending on
the configuration and workload which will in turn stall other CPUs waiting
for the lock. Sure, irqs are being serviced quicker but if the cost is more
and longer !irq context multi-cpu stalls, what's the point?
I don't think there's anything which requires the global lock to be held
throughout the entire flushing sequence and irq needs to be disabled when
grabbing the percpu lock anyway, so why not just release the global lock on
CPU boundaries instead? We don't really lose anything significant that way.
The durations of irq disabled sections are still about the same as in the
currently proposed solution at O(# cgroups) and we avoid the risk of holding
the global lock for too long unexpectedly from getting hit repeatedly by
irqs while holding the global lock.
Thanks.
--
tejun
On Thu, Mar 23, 2023 at 6:39 PM Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Thu, Mar 23, 2023 at 04:00:31AM +0000, Yosry Ahmed wrote:
> > Currently, when sleeping is not allowed during rstat flushing, we hold
> > the global rstat lock with interrupts disabled throughout the entire
> > flush operation. Flushing in an O(# cgroups * # cpus) operation, and
> > having interrupts disabled throughout is dangerous.
> >
> > For some contexts, we may not want to sleep, but can be interrupted
> > (e.g. while holding a spinlock or RCU read lock). As such, do not
> > disable interrupts throughout rstat flushing, only when holding the
> > percpu lock. This breaks down the O(# cgroups * # cpus) duration with
> > interrupts disabled to a series of O(# cgroups) durations.
> >
> > Furthermore, if a cpu spinning waiting for the global rstat lock, it
> > doesn't need to spin with interrupts disabled anymore.
>
> I'm generally not a fan of big spin locks w/o irq protection. They too often
> become a source of unpredictable latency spikes. As you said, the global
> rstat lock can be held for quite a while. Removing _irq makes irq latency
> better on the CPU but on the other hand it makes a lot more likely that the
> lock is gonna be held even longer, possibly significantly so depending on
> the configuration and workload which will in turn stall other CPUs waiting
> for the lock. Sure, irqs are being serviced quicker but if the cost is more
> and longer !irq context multi-cpu stalls, what's the point?
>
> I don't think there's anything which requires the global lock to be held
> throughout the entire flushing sequence and irq needs to be disabled when
> grabbing the percpu lock anyway, so why not just release the global lock on
> CPU boundaries instead? We don't really lose anything significant that way.
> The durations of irq disabled sections are still about the same as in the
> currently proposed solution at O(# cgroups) and we avoid the risk of holding
> the global lock for too long unexpectedly from getting hit repeatedly by
> irqs while holding the global lock.
Thanks for taking a look!
I think a problem with this approach is that we risk having to contend
for the global lock at every CPU boundary in atomic contexts. Right
now we contend for the global lock once, and once we have it we go
through all CPUs to flush, only having to contend with updates taking
the percpu locks at this point. If we unconditionally release &
reacquire the global lock at every CPU boundary then we may contend
for it much more frequently with concurrent flushers.
On the memory controller side, concurrent flushers are already held
back to avoid a thundering herd problem on the global rstat lock, but
flushers from outside the memory controller can still compete together
or with a flusher from the memory controller. In this case, we risk
contending the global lock more and concurrent flushers taking a
longer period of time, which may end up causing multi-CPU stalls
anyway, right? Also, if we keep _irq when spinning for the lock, then
concurrent flushers still need to spin with irq disabled -- another
problem that this series tries to fix.
This is particularly a problem for flushers in atomic contexts. There
is a flusher in mem_cgroup_wb_stats() that flushes while holding
another spinlock, and a flusher in mem_cgroup_usage() that flushes
with irqs disabled. If flushing takes a longer period of time due to
repeated lock contention, it affects such atomic context negatively.
I am not sure how all of this matters in practice, it depends heavily
on the workloads and the configuration like you mentioned. I am just
pointing out the potential disadvantages of reacquiring the lock at
every CPU boundary in atomic contexts.
>
> Thanks.
>
> --
> tejun
On 3/24/23 03:22, Yosry Ahmed wrote:
> On Thu, Mar 23, 2023 at 6:39 PM Tejun Heo <[email protected]> wrote:
>> Hello,
>>
>> On Thu, Mar 23, 2023 at 04:00:31AM +0000, Yosry Ahmed wrote:
>>> Currently, when sleeping is not allowed during rstat flushing, we hold
>>> the global rstat lock with interrupts disabled throughout the entire
>>> flush operation. Flushing in an O(# cgroups * # cpus) operation, and
>>> having interrupts disabled throughout is dangerous.
>>>
>>> For some contexts, we may not want to sleep, but can be interrupted
>>> (e.g. while holding a spinlock or RCU read lock). As such, do not
>>> disable interrupts throughout rstat flushing, only when holding the
>>> percpu lock. This breaks down the O(# cgroups * # cpus) duration with
>>> interrupts disabled to a series of O(# cgroups) durations.
>>>
>>> Furthermore, if a cpu spinning waiting for the global rstat lock, it
>>> doesn't need to spin with interrupts disabled anymore.
>> I'm generally not a fan of big spin locks w/o irq protection. They too often
>> become a source of unpredictable latency spikes. As you said, the global
>> rstat lock can be held for quite a while. Removing _irq makes irq latency
>> better on the CPU but on the other hand it makes a lot more likely that the
>> lock is gonna be held even longer, possibly significantly so depending on
>> the configuration and workload which will in turn stall other CPUs waiting
>> for the lock. Sure, irqs are being serviced quicker but if the cost is more
>> and longer !irq context multi-cpu stalls, what's the point?
>>
>> I don't think there's anything which requires the global lock to be held
>> throughout the entire flushing sequence and irq needs to be disabled when
>> grabbing the percpu lock anyway, so why not just release the global lock on
>> CPU boundaries instead? We don't really lose anything significant that way.
>> The durations of irq disabled sections are still about the same as in the
>> currently proposed solution at O(# cgroups) and we avoid the risk of holding
>> the global lock for too long unexpectedly from getting hit repeatedly by
>> irqs while holding the global lock.
> Thanks for taking a look!
>
> I think a problem with this approach is that we risk having to contend
> for the global lock at every CPU boundary in atomic contexts. Right
Isn't it the plan to just do a trylock in atomic contexts so that it
won't get stuck spinning for the lock for an indeterminate amount of time?
> now we contend for the global lock once, and once we have it we go
> through all CPUs to flush, only having to contend with updates taking
> the percpu locks at this point. If we unconditionally release &
> reacquire the global lock at every CPU boundary then we may contend
> for it much more frequently with concurrent flushers.
Note that with the use of qspinlock in all the major arches, the impact
of thundering herds of lockers are much less serious than before. There
are certainly some overhead in doing multiple lock acquires and
releases, but that shouldn't been too excessive.
I am all in for reducing lock hold time as much as possible as it will
improve the response time.
Cheers,
Longman
On Fri, Mar 24, 2023 at 7:12 AM Waiman Long <[email protected]> wrote:
>
> On 3/24/23 03:22, Yosry Ahmed wrote:
> > On Thu, Mar 23, 2023 at 6:39 PM Tejun Heo <[email protected]> wrote:
> >> Hello,
> >>
> >> On Thu, Mar 23, 2023 at 04:00:31AM +0000, Yosry Ahmed wrote:
> >>> Currently, when sleeping is not allowed during rstat flushing, we hold
> >>> the global rstat lock with interrupts disabled throughout the entire
> >>> flush operation. Flushing in an O(# cgroups * # cpus) operation, and
> >>> having interrupts disabled throughout is dangerous.
> >>>
> >>> For some contexts, we may not want to sleep, but can be interrupted
> >>> (e.g. while holding a spinlock or RCU read lock). As such, do not
> >>> disable interrupts throughout rstat flushing, only when holding the
> >>> percpu lock. This breaks down the O(# cgroups * # cpus) duration with
> >>> interrupts disabled to a series of O(# cgroups) durations.
> >>>
> >>> Furthermore, if a cpu spinning waiting for the global rstat lock, it
> >>> doesn't need to spin with interrupts disabled anymore.
> >> I'm generally not a fan of big spin locks w/o irq protection. They too often
> >> become a source of unpredictable latency spikes. As you said, the global
> >> rstat lock can be held for quite a while. Removing _irq makes irq latency
> >> better on the CPU but on the other hand it makes a lot more likely that the
> >> lock is gonna be held even longer, possibly significantly so depending on
> >> the configuration and workload which will in turn stall other CPUs waiting
> >> for the lock. Sure, irqs are being serviced quicker but if the cost is more
> >> and longer !irq context multi-cpu stalls, what's the point?
> >>
> >> I don't think there's anything which requires the global lock to be held
> >> throughout the entire flushing sequence and irq needs to be disabled when
> >> grabbing the percpu lock anyway, so why not just release the global lock on
> >> CPU boundaries instead? We don't really lose anything significant that way.
> >> The durations of irq disabled sections are still about the same as in the
> >> currently proposed solution at O(# cgroups) and we avoid the risk of holding
> >> the global lock for too long unexpectedly from getting hit repeatedly by
> >> irqs while holding the global lock.
> > Thanks for taking a look!
> >
> > I think a problem with this approach is that we risk having to contend
> > for the global lock at every CPU boundary in atomic contexts. Right
> Isn't it the plan to just do a trylock in atomic contexts so that it
> won't get stuck spinning for the lock for an indeterminate amount of time?
Not exactly. On the memory controller side, we currently only allow
one flusher at a time and force all flushers to flush the full
hierarchy, such that concurrent flushers can skip. This is done for
both atomic and non-atomic contexts.
For flushers outside the memory controller, they can still contend the
lock among themselves or with flushers in the memory controller. In
this case, instead of contending the lock once, they contend it at
each CPU boundary.
> > now we contend for the global lock once, and once we have it we go
> > through all CPUs to flush, only having to contend with updates taking
> > the percpu locks at this point. If we unconditionally release &
> > reacquire the global lock at every CPU boundary then we may contend
> > for it much more frequently with concurrent flushers.
>
> Note that with the use of qspinlock in all the major arches, the impact
> of thundering herds of lockers are much less serious than before. There
> are certainly some overhead in doing multiple lock acquires and
> releases, but that shouldn't been too excessive.
I ran some tests to measure this. Since I am using a cgroup v1
hierarchy, I cannot reproduce contention between memory controller
flushers and non-memory controller flushers, so I removed the "one
memory flusher only" restriction to have concurrent memory flushers
compete for the global rstat lock to measure the impact:
Before (only one flusher allowed to compete for the global rstat lock):
---cgroup_rstat_flush
|
--1.27%--cgroup_rstat_flush_locked
|
--0.94%--mem_cgroup_css_rstat_flush
After (concurrent flushers allowed to compete for the global rstat lock):
---cgroup_rstat_flush
|
|--4.94%--_raw_spin_lock
| |
| --4.94%--queued_spin_lock_slowpath
|
--0.92%--cgroup_rstat_flush_locked
|
--0.56%--mem_cgroup_css_rstat_flush
This was run with 20 processes trying to flush concurrently, so it may
be excessive, but it seems like in this case lock contention makes a
significant difference.
Again, this is not a regression for non-atomic flushers, as they
already compete for the lock at every CPU boundary, but for atomic
flushers that don't give up the lock at all today, it would be a
regression to start competing for the lock at every CPU boundary. This
patch series aims to minimize the number of atomic flushers (brings
them down to two, one of which is not common), so this may be fine.
My main concern is that for some flushers that this series converts
from atomic to non-atomic, we may notice a regression later and revert
it (e.g. refault path), which is why I have them in separate patches.
If we regress the atomic flushing path, it would be a larger surgery
to restore the performance for these paths -- which is why I would
rather keep the atomic path without excessive lock contention.
Thoughts?
>
> I am all in for reducing lock hold time as much as possible as it will
> improve the response time.
>
> Cheers,
> Longman
>
Hello,
On Fri, Mar 24, 2023 at 12:22:09AM -0700, Yosry Ahmed wrote:
> I think a problem with this approach is that we risk having to contend
> for the global lock at every CPU boundary in atomic contexts. Right
> now we contend for the global lock once, and once we have it we go
> through all CPUs to flush, only having to contend with updates taking
> the percpu locks at this point. If we unconditionally release &
> reacquire the global lock at every CPU boundary then we may contend
> for it much more frequently with concurrent flushers.
>
> On the memory controller side, concurrent flushers are already held
> back to avoid a thundering herd problem on the global rstat lock, but
> flushers from outside the memory controller can still compete together
> or with a flusher from the memory controller. In this case, we risk
> contending the global lock more and concurrent flushers taking a
> longer period of time, which may end up causing multi-CPU stalls
> anyway, right? Also, if we keep _irq when spinning for the lock, then
> concurrent flushers still need to spin with irq disabled -- another
> problem that this series tries to fix.
>
> This is particularly a problem for flushers in atomic contexts. There
> is a flusher in mem_cgroup_wb_stats() that flushes while holding
> another spinlock, and a flusher in mem_cgroup_usage() that flushes
> with irqs disabled. If flushing takes a longer period of time due to
> repeated lock contention, it affects such atomic context negatively.
>
> I am not sure how all of this matters in practice, it depends heavily
> on the workloads and the configuration like you mentioned. I am just
> pointing out the potential disadvantages of reacquiring the lock at
> every CPU boundary in atomic contexts.
So, I'm not too convinced by the arguments for a couple reasons:
* It's not very difficult to create conditions where a contented non-irq
protected spinlock is held unnecessarily long due to heavy IRQ irq load on
the holding CPU. This can easily extend the amount of time the lock is
held by multiple times if not orders of magnitude. That is likely a
significantly worse problem than the contention on the lock cacheline
which will lead to a lot more gradual degradation.
* If concurrent flushing is an actual problem, we need and can implement a
better solution. There's quite a bit of maneuvering room here given that
the flushing operations are mostly idempotent in close time proximity and
there's no real atomicity requirement across different segments of
flushing operations.
Thanks.
--
tejun
On Fri, Mar 24, 2023 at 6:54 PM Tejun Heo <[email protected]> wrote:
>
> Hello,
>
> On Fri, Mar 24, 2023 at 12:22:09AM -0700, Yosry Ahmed wrote:
> > I think a problem with this approach is that we risk having to contend
> > for the global lock at every CPU boundary in atomic contexts. Right
> > now we contend for the global lock once, and once we have it we go
> > through all CPUs to flush, only having to contend with updates taking
> > the percpu locks at this point. If we unconditionally release &
> > reacquire the global lock at every CPU boundary then we may contend
> > for it much more frequently with concurrent flushers.
> >
> > On the memory controller side, concurrent flushers are already held
> > back to avoid a thundering herd problem on the global rstat lock, but
> > flushers from outside the memory controller can still compete together
> > or with a flusher from the memory controller. In this case, we risk
> > contending the global lock more and concurrent flushers taking a
> > longer period of time, which may end up causing multi-CPU stalls
> > anyway, right? Also, if we keep _irq when spinning for the lock, then
> > concurrent flushers still need to spin with irq disabled -- another
> > problem that this series tries to fix.
> >
> > This is particularly a problem for flushers in atomic contexts. There
> > is a flusher in mem_cgroup_wb_stats() that flushes while holding
> > another spinlock, and a flusher in mem_cgroup_usage() that flushes
> > with irqs disabled. If flushing takes a longer period of time due to
> > repeated lock contention, it affects such atomic context negatively.
> >
> > I am not sure how all of this matters in practice, it depends heavily
> > on the workloads and the configuration like you mentioned. I am just
> > pointing out the potential disadvantages of reacquiring the lock at
> > every CPU boundary in atomic contexts.
>
> So, I'm not too convinced by the arguments for a couple reasons:
>
> * It's not very difficult to create conditions where a contented non-irq
> protected spinlock is held unnecessarily long due to heavy IRQ irq load on
> the holding CPU. This can easily extend the amount of time the lock is
> held by multiple times if not orders of magnitude. That is likely a
> significantly worse problem than the contention on the lock cacheline
> which will lead to a lot more gradual degradation.
I agree that can be a problem, it depends on the specific workload and
configuration. The continuous lock contention at each CPU boundary
causes a regression (see my reply to Waiman), but I am not sure if
it's worse than the scenario you are describing.
>
> * If concurrent flushing is an actual problem, we need and can implement a
> better solution. There's quite a bit of maneuvering room here given that
> the flushing operations are mostly idempotent in close time proximity and
> there's no real atomicity requirement across different segments of
> flushing operations.
Concurrent flushing can be a problem for some workloads, especially in
the MM code we flush in the reclaim and refault paths. This is
currently mitigated by only allowing one flusher at a time from the
memcg side, but contention can still happen with flushing when a
cgroup is being freed or other flushers in other subsystems.
I tried allowing concurrent flushing by completely removing the global
rstat lock, and only depending on the percpu locks for
synchronization. For this to be correct the global stat counters need
to be atomic, this introduced a slow down for flushing in general. I
also noticed heavier lock contention on the percpu locks, since all
flushers try to acquire all locks in the same order. I even tried
implementing a simple retry scheme where we try to acquire the percpu
lock, and if we fail we queue the current cpu and move to the next
one. This ended up causing a little bit of slowness as well. Together
with the slowness introduced by atomic operations it seemed like a
significant regression in the simple flushing path.
Don't get me wrong, I am all for modifying the current approach, I
just want to make sure we are making the correct decision for *most*
workloads. Keep in mind that this series aims to minimize the number
of flushers from atomic contexts as well, and for non-atomic flushers
we allow giving up the lock at CPU boundaries anyway. The current
approach only keeps the lock held throughout for atomic flushers.
Any ideas here are welcome!
>
> Thanks.
>
> --
> tejun
On Fri, Mar 24, 2023 at 7:18 PM Yosry Ahmed <[email protected]> wrote:
>
[...]
> Any ideas here are welcome!
>
Let's move forward. It seems like we are not going to reach an
agreement on making cgroup_rstat_lock a non-irq lock. However there is
agreement on the memcg code of not flushing in irq context and the
cleanup Johannes has requested. Let's proceed with those for now. We
can come back to cgroup_rstat_lock later if we still see issues in
production.
Tejun, do you have any concerns on adding WARN_ON_ONCE(!in_task()) in
the rstat flushing code?
On Fri, Mar 24, 2023 at 9:37 PM Yosry Ahmed <[email protected]> wrote:
>
> On Fri, Mar 24, 2023 at 9:31 PM Shakeel Butt <[email protected]> wrote:
> >
> > On Fri, Mar 24, 2023 at 7:18 PM Yosry Ahmed <[email protected]> wrote:
> > >
> > [...]
> > > Any ideas here are welcome!
> > >
> >
> > Let's move forward. It seems like we are not going to reach an
> > agreement on making cgroup_rstat_lock a non-irq lock. However there is
> > agreement on the memcg code of not flushing in irq context and the
> > cleanup Johannes has requested. Let's proceed with those for now. We
> > can come back to cgroup_rstat_lock later if we still see issues in
> > production.
>
> Even if we do not flush from irq context, we still flush from atomic
> contexts that will currently hold the lock with irqs disabled
> throughout the entire flush sequence. A primary purpose of this reason
> is to avoid that.
>
> We can either:
> (a) Proceed with the following approach of making cgroup_rstat_lock a
> non-irq lock.
> (b) Proceed with Tejun's suggestion of always releasing and
> reacquiring the lock at CPU boundaries, even for atomic flushes (if
> the spinlock needs a break ofc).
> (c) Something else.
(d) keep the status quo regarding cgroup_rstat_lock
(e) decouple the discussion of cgroup_rstat_lock from the agreed
improvements. Send the patches for the agreed ones and continue
discussing cgroup_rstat_lock.
On Fri, Mar 24, 2023 at 9:31 PM Shakeel Butt <[email protected]> wrote:
>
> On Fri, Mar 24, 2023 at 7:18 PM Yosry Ahmed <[email protected]> wrote:
> >
> [...]
> > Any ideas here are welcome!
> >
>
> Let's move forward. It seems like we are not going to reach an
> agreement on making cgroup_rstat_lock a non-irq lock. However there is
> agreement on the memcg code of not flushing in irq context and the
> cleanup Johannes has requested. Let's proceed with those for now. We
> can come back to cgroup_rstat_lock later if we still see issues in
> production.
Even if we do not flush from irq context, we still flush from atomic
contexts that will currently hold the lock with irqs disabled
throughout the entire flush sequence. A primary purpose of this reason
is to avoid that.
We can either:
(a) Proceed with the following approach of making cgroup_rstat_lock a
non-irq lock.
(b) Proceed with Tejun's suggestion of always releasing and
reacquiring the lock at CPU boundaries, even for atomic flushes (if
the spinlock needs a break ofc).
(c) Something else.
I am happy to proceed with any solution, but we need to address the
fact that interrupts are always disabled throughout the flush. My main
concern about Tejun's suggestion is atomic contexts having to contend
cgroup_rstat_lock much more than they do now, but it's still better
than what we have today.
>
> Tejun, do you have any concerns on adding WARN_ON_ONCE(!in_task()) in
> the rstat flushing code?
On Fri, Mar 24, 2023 at 9:46 PM Shakeel Butt <[email protected]> wrote:
>
> On Fri, Mar 24, 2023 at 9:37 PM Yosry Ahmed <[email protected]> wrote:
> >
> > On Fri, Mar 24, 2023 at 9:31 PM Shakeel Butt <[email protected]> wrote:
> > >
> > > On Fri, Mar 24, 2023 at 7:18 PM Yosry Ahmed <[email protected]> wrote:
> > > >
> > > [...]
> > > > Any ideas here are welcome!
> > > >
> > >
> > > Let's move forward. It seems like we are not going to reach an
> > > agreement on making cgroup_rstat_lock a non-irq lock. However there is
> > > agreement on the memcg code of not flushing in irq context and the
> > > cleanup Johannes has requested. Let's proceed with those for now. We
> > > can come back to cgroup_rstat_lock later if we still see issues in
> > > production.
> >
> > Even if we do not flush from irq context, we still flush from atomic
> > contexts that will currently hold the lock with irqs disabled
> > throughout the entire flush sequence. A primary purpose of this reason
> > is to avoid that.
> >
> > We can either:
> > (a) Proceed with the following approach of making cgroup_rstat_lock a
> > non-irq lock.
> > (b) Proceed with Tejun's suggestion of always releasing and
> > reacquiring the lock at CPU boundaries, even for atomic flushes (if
> > the spinlock needs a break ofc).
> > (c) Something else.
>
> (d) keep the status quo regarding cgroup_rstat_lock
> (e) decouple the discussion of cgroup_rstat_lock from the agreed
> improvements. Send the patches for the agreed ones and continue
> discussing cgroup_rstat_lock.
Ah, I lost sight of the fact that the rest of the patch series does
not strictly depend on this patch. I will respin the rest of the patch
series separately. Thanks, Shakeel.
Meanwhile, it would be useful to reach an agreement here to stop
acquiring the cgroup_rstat_lock for a long time with irq disabled in
atomic contexts.
Tejun, if having the lock be non-irq is a non-starter for you, I can
send a patch that instead gives up the lock and reacquires it at every
CPU boundary unconditionally -- or perhaps every N CPU boundaries to
avoid excessively releasing and reacquiring the lock.
Something like:
static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
{
...
for_each_possible_cpu(cpu) {
...
/* Always yield the at CPU boundaries to enable irqs */
spin_unlock_irq(&cgroup_rstat_lock);
/* if @may_sleep, play nice and yield if necessary */
if (may_sleep)
cond_resched();
spin_lock_irq(&cgroup_rstat_lock);
}
}
If you have other ideas to avoid disabling irq's for the entire flush
sequence I am also open to that.
Thanks!
Hello, Yosry.
On Mon, Mar 27, 2023 at 04:23:13PM -0700, Yosry Ahmed wrote:
> Tejun, if having the lock be non-irq is a non-starter for you, I can
This is an actual hazard. We see in prod these unprotected locks causing
very big spikes in tail latencies and they can be tricky to root cause too
and given the way rstat lock is used it's highly likely to be involved in
those scenarios with the proposed change, so it's gonna be a nack from my
end.
> send a patch that instead gives up the lock and reacquires it at every
> CPU boundary unconditionally -- or perhaps every N CPU boundaries to
> avoid excessively releasing and reacquiring the lock.
I'd just do the simple thing and see whether there's any perf penalty before
making it complicated. I'd be pretty surprised if unlocking and relocking
the same spinlock adds any noticeable overhead here.
Thanks.
--
tejun
Hi Tejun,
On Wed, 29 Mar 2023, Tejun Heo wrote:
> On Mon, Mar 27, 2023 at 04:23:13PM -0700, Yosry Ahmed wrote:
> > Tejun, if having the lock be non-irq is a non-starter for you, I can
>
> This is an actual hazard. We see in prod these unprotected locks causing
> very big spikes in tail latencies and they can be tricky to root cause too
> and given the way rstat lock is used it's highly likely to be involved in
> those scenarios with the proposed change, so it's gonna be a nack from my
> end.
Butting in here, I'm fascinated. This is certainly not my area, I know
nothing about rstat, but this is the first time I ever heard someone
arguing for more disabling of interrupts rather than less.
An interrupt coming in while holding a contended resource can certainly
add to latencies, that I accept of course. But until now, I thought it
was agreed best practice to disable irqs only regretfully, when strictly
necessary.
If that has changed, I for one want to know about it. How should we
now judge which spinlocks should disable interrupts and which should not?
Page table locks are currently my main interest - should those be changed?
Thanks,
Hugh
Hello, Hugh. How have you been?
On Wed, Mar 29, 2023 at 12:22:24PM -0700, Hugh Dickins wrote:
> Hi Tejun,
> Butting in here, I'm fascinated. This is certainly not my area, I know
> nothing about rstat, but this is the first time I ever heard someone
> arguing for more disabling of interrupts rather than less.
>
> An interrupt coming in while holding a contended resource can certainly
> add to latencies, that I accept of course. But until now, I thought it
> was agreed best practice to disable irqs only regretfully, when strictly
> necessary.
>
> If that has changed, I for one want to know about it. How should we
> now judge which spinlocks should disable interrupts and which should not?
> Page table locks are currently my main interest - should those be changed?
For rstat, it's a simple case because the global lock here wraps around
per-cpu locks which have to be irq-safe, so the only difference we get
between making the global irq-unsafe and keeping it so but releasing
inbetween is:
Global lock held: G
IRQ disabled: I
Percpu lock held: P
1. IRQ unsafe
GGGGGGGGGGGGGGG~~GGGGG
IIII IIII IIII ~~ IIII
PPPP PPPP PPPP ~~ PPPP
2. IRQ safe released inbetween cpus
GGGG GGGG GGGG ~~ GGGG
IIII IIII IIII ~~ IIII
PPPP PPPP PPPP ~~ PPPP
#2 seems like the obvious thing to do here given how the lock is used and
each P section may take a bit of time.
So, in the rstat case, the choice is, at least to me, obvious, but even for
more generic cases where the bulk of actual work isn't done w/ irq disabled,
I don't think the picture is as simple as "use the least protected variant
possible" anymore because the underlying hardware changed.
For an SMP kernel running on an UP system, "the least protected variant" is
the obvious choice to make because you don't lose anything by holding a
spinlock longer than necessary. However, as you increase the number of CPUs,
there rises a tradeoff between local irq servicing latency and global lock
contention.
Imagine a, say, 128 cpu system with a few cores servicing relatively high
frequency interrupts. Let's say there's a mildly hot lock. Usually, it shows
up in the system profile but only just. Let's say something happens and the
irq rate on those cores went up for some reason to the point where it
becomes a rather common occurrence when the lock is held on one of those
cpus, irqs are likely to intervene lengthening how long the lock is held,
sometimes, signficantly. Now because the lock is on average held for much
longer, it become a lot hotter as more CPUs would stall on it and depending
on luck or lack thereof these stalls can span many CPUs on the system for
quite a while. This is actually something we saw in production.
So, in general, there's a trade off between local irq service latency and
inducing global lock contention when using unprotected locks. With more and
more CPUs, the balance keeps shifting. The balance still very much depends
on the specifics of a given lock but yeah I think it's something we need to
be a lot more careful about now.
Thanks.
--
tejun
On Wed, 29 Mar 2023, Tejun Heo wrote:
> Hello, Hugh. How have you been?
>
> On Wed, Mar 29, 2023 at 12:22:24PM -0700, Hugh Dickins wrote:
> > Hi Tejun,
> > Butting in here, I'm fascinated. This is certainly not my area, I know
> > nothing about rstat, but this is the first time I ever heard someone
> > arguing for more disabling of interrupts rather than less.
> >
> > An interrupt coming in while holding a contended resource can certainly
> > add to latencies, that I accept of course. But until now, I thought it
> > was agreed best practice to disable irqs only regretfully, when strictly
> > necessary.
> >
> > If that has changed, I for one want to know about it. How should we
> > now judge which spinlocks should disable interrupts and which should not?
> > Page table locks are currently my main interest - should those be changed?
>
> For rstat, it's a simple case because the global lock here wraps around
> per-cpu locks which have to be irq-safe, so the only difference we get
> between making the global irq-unsafe and keeping it so but releasing
> inbetween is:
>
> Global lock held: G
> IRQ disabled: I
> Percpu lock held: P
>
> 1. IRQ unsafe
>
> GGGGGGGGGGGGGGG~~GGGGG
> IIII IIII IIII ~~ IIII
> PPPP PPPP PPPP ~~ PPPP
>
> 2. IRQ safe released inbetween cpus
>
> GGGG GGGG GGGG ~~ GGGG
> IIII IIII IIII ~~ IIII
> PPPP PPPP PPPP ~~ PPPP
>
> #2 seems like the obvious thing to do here given how the lock is used and
> each P section may take a bit of time.
Many thanks for the detailed response. I'll leave it to the rstat folks,
to agree or disagree with your analysis there.
>
> So, in the rstat case, the choice is, at least to me, obvious, but even for
> more generic cases where the bulk of actual work isn't done w/ irq disabled,
> I don't think the picture is as simple as "use the least protected variant
> possible" anymore because the underlying hardware changed.
>
> For an SMP kernel running on an UP system, "the least protected variant" is
> the obvious choice to make because you don't lose anything by holding a
> spinlock longer than necessary. However, as you increase the number of CPUs,
> there rises a tradeoff between local irq servicing latency and global lock
> contention.
>
> Imagine a, say, 128 cpu system with a few cores servicing relatively high
> frequency interrupts. Let's say there's a mildly hot lock. Usually, it shows
> up in the system profile but only just. Let's say something happens and the
> irq rate on those cores went up for some reason to the point where it
> becomes a rather common occurrence when the lock is held on one of those
> cpus, irqs are likely to intervene lengthening how long the lock is held,
> sometimes, signficantly. Now because the lock is on average held for much
> longer, it become a lot hotter as more CPUs would stall on it and depending
> on luck or lack thereof these stalls can span many CPUs on the system for
> quite a while. This is actually something we saw in production.
>
> So, in general, there's a trade off between local irq service latency and
> inducing global lock contention when using unprotected locks. With more and
> more CPUs, the balance keeps shifting. The balance still very much depends
> on the specifics of a given lock but yeah I think it's something we need to
> be a lot more careful about now.
And this looks a very plausible argument to me: I'll let it sink in.
But I hadn't heard that the RT folks were clamouring for more irq disabling:
perhaps they partition their machines with more care, and are not devotees
of high CPU counts.
What I hope is that others will chime in one way or the other -
it does sound as if a reappraisal of the balances is overdue.
Thanks,
Hugh (disabling interrupts for as long as he can)
Thanks for a great discussion, Tejun and Hugh.
On Wed, Mar 29, 2023 at 1:38 PM Hugh Dickins <[email protected]> wrote:
>
> On Wed, 29 Mar 2023, Tejun Heo wrote:
>
> > Hello, Hugh. How have you been?
> >
> > On Wed, Mar 29, 2023 at 12:22:24PM -0700, Hugh Dickins wrote:
> > > Hi Tejun,
> > > Butting in here, I'm fascinated. This is certainly not my area, I know
> > > nothing about rstat, but this is the first time I ever heard someone
> > > arguing for more disabling of interrupts rather than less.
> > >
> > > An interrupt coming in while holding a contended resource can certainly
> > > add to latencies, that I accept of course. But until now, I thought it
> > > was agreed best practice to disable irqs only regretfully, when strictly
> > > necessary.
> > >
> > > If that has changed, I for one want to know about it. How should we
> > > now judge which spinlocks should disable interrupts and which should not?
> > > Page table locks are currently my main interest - should those be changed?
> >
> > For rstat, it's a simple case because the global lock here wraps around
> > per-cpu locks which have to be irq-safe, so the only difference we get
> > between making the global irq-unsafe and keeping it so but releasing
> > inbetween is:
> >
> > Global lock held: G
> > IRQ disabled: I
> > Percpu lock held: P
> >
> > 1. IRQ unsafe
> >
> > GGGGGGGGGGGGGGG~~GGGGG
> > IIII IIII IIII ~~ IIII
> > PPPP PPPP PPPP ~~ PPPP
> >
> > 2. IRQ safe released inbetween cpus
> >
> > GGGG GGGG GGGG ~~ GGGG
> > IIII IIII IIII ~~ IIII
> > PPPP PPPP PPPP ~~ PPPP
> >
> > #2 seems like the obvious thing to do here given how the lock is used and
> > each P section may take a bit of time.
>
> Many thanks for the detailed response. I'll leave it to the rstat folks,
> to agree or disagree with your analysis there.
Thanks for the analysis, Tejun, it does indeed make sense. I perf'd
releasing and reacquiring the lock at each CPU boundary and the
overhead seems to be minimal. It would be higher with contention, but
all memcg flushers should be held back by the memcg code, and flushers
outside memcg are not frequent (reading blkcg and cpu base stats from
user space, and when a cgroup is being removed).
I realized that after v2 of this patch series [1], we would only end
up with two atomic flushing contexts, mem_cgroup_wb_stats() and
mem_cgroup_usage(). The latter is already disabling irqs for other
reasons, so anything we do within the rstat core code doesn't really
help, it needs to be addressed separately. So only the call site in
mem_cgroup_wb_stats() would benefit from not having irqs disabled
throughout the flush.
I will hold off on sending a patch until I observe that this call site
is causing us pain and/or other atomic call sites emerge (or we have
to revert one of the ones we made non-atomic), so that we don't hurt
other flushers unnecessarily. Does this make sense to you?
[1] https://lore.kernel.org/linux-mm/[email protected]/
>
> >
> > So, in the rstat case, the choice is, at least to me, obvious, but even for
> > more generic cases where the bulk of actual work isn't done w/ irq disabled,
> > I don't think the picture is as simple as "use the least protected variant
> > possible" anymore because the underlying hardware changed.
> >
> > For an SMP kernel running on an UP system, "the least protected variant" is
> > the obvious choice to make because you don't lose anything by holding a
> > spinlock longer than necessary. However, as you increase the number of CPUs,
> > there rises a tradeoff between local irq servicing latency and global lock
> > contention.
> >
> > Imagine a, say, 128 cpu system with a few cores servicing relatively high
> > frequency interrupts. Let's say there's a mildly hot lock. Usually, it shows
> > up in the system profile but only just. Let's say something happens and the
> > irq rate on those cores went up for some reason to the point where it
> > becomes a rather common occurrence when the lock is held on one of those
> > cpus, irqs are likely to intervene lengthening how long the lock is held,
> > sometimes, signficantly. Now because the lock is on average held for much
> > longer, it become a lot hotter as more CPUs would stall on it and depending
> > on luck or lack thereof these stalls can span many CPUs on the system for
> > quite a while. This is actually something we saw in production.
> >
> > So, in general, there's a trade off between local irq service latency and
> > inducing global lock contention when using unprotected locks. With more and
> > more CPUs, the balance keeps shifting. The balance still very much depends
> > on the specifics of a given lock but yeah I think it's something we need to
> > be a lot more careful about now.
>
> And this looks a very plausible argument to me: I'll let it sink in.
>
> But I hadn't heard that the RT folks were clamouring for more irq disabling:
> perhaps they partition their machines with more care, and are not devotees
> of high CPU counts.
>
> What I hope is that others will chime in one way or the other -
> it does sound as if a reappraisal of the balances is overdue.
>
> Thanks,
> Hugh (disabling interrupts for as long as he can)
Hello, Hugh.
On Wed, Mar 29, 2023 at 01:38:48PM -0700, Hugh Dickins wrote:
> > So, in general, there's a trade off between local irq service latency and
> > inducing global lock contention when using unprotected locks. With more and
> > more CPUs, the balance keeps shifting. The balance still very much depends
> > on the specifics of a given lock but yeah I think it's something we need to
> > be a lot more careful about now.
>
> And this looks a very plausible argument to me: I'll let it sink in.
Another somewhat relevant change is that flipping irq on/off used to be
relatively expensive on older x86 cpus. I forget all details about when and
how much but they should be a lot cheaper now. No idea about !x86 cpus tho.
> But I hadn't heard that the RT folks were clamouring for more irq disabling:
> perhaps they partition their machines with more care, and are not devotees
> of high CPU counts.
I think RT folks care a lot more about raw IRQ disables. These shouldn't
actually disable IRQs on RT kernels.
Thanks.
--
tejun