Currently, all calls to flush memcg stats use the atomic variant for
rstat flushing, cgroup_rstat_flush_irqsafe(), which keeps interrupts
disabled throughout flushing and does not sleep. Flushing stats is an
expensive operation, and we should avoid doing it atomically where
possible. Otherwise, we may end up doing a lot of work without
rescheduling and with interrupts disabled unnecessarily.
Patches 1 and 2 are cleanups requested during reviews of prior versions
of this series.
Patch 3 makes sure we never try to flush from within an irq context, and
patch 4 adds a WARN_ON_ONCE() to make sure we catch any violations.
Patches 5 to 8 introduce separate variants of mem_cgroup_flush_stats()
for atomic and non-atomic flushing, and make sure we only flush the
stats atomically when necessary.
Patch 9 is a slightly tangential optimization that limits the work done
by rstat flushing in some scenarios.
RFC -> v1:
- Dropped patch 1 that attempted to make the global rstat lock a non-irq
lock, will follow up on that separetly (Shakeel).
- Dropped stats_flush_lock entirely, replaced by an atomic (Johannes).
- Renamed cgroup_rstat_flush_irqsafe() to cgroup_rstat_flush_atomic()
instead of removing it (Johannes).
- Added a patch to rename mem_cgroup_flush_stats_delayed() to
mem_cgroup_flush_stats_ratelimited() (Johannes).
- Separate APIs for flushing memcg stats in atomic and non-atomic
contexts instead of a boolean argument (Johannes).
- Added patches 3 & 4 to make sure we never flush from irq context
(Shakeel & Johannes).
Yosry Ahmed (9):
cgroup: rename cgroup_rstat_flush_"irqsafe" to "atomic"
memcg: rename mem_cgroup_flush_stats_"delayed" to "ratelimited"
memcg: do not flush stats in irq context
cgroup: rstat: add WARN_ON_ONCE() if flushing outside task context
memcg: replace stats_flush_lock with an atomic
memcg: sleep during flushing stats in safe contexts
workingset: memcg: sleep when flushing stats in workingset_refault()
vmscan: memcg: sleep when flushing stats during reclaim
memcg: do not modify rstat tree for zero updates
include/linux/cgroup.h | 2 +-
include/linux/memcontrol.h | 9 +++-
kernel/cgroup/rstat.c | 6 ++-
mm/memcontrol.c | 86 ++++++++++++++++++++++++++++++++------
mm/workingset.c | 4 +-
5 files changed, 87 insertions(+), 20 deletions(-)
--
2.40.0.348.gf938b09366-goog
cgroup_rstat_flush_irqsafe() can be a confusing name. It may read as
"irqs are disabled throughout", which is what the current implementation
does (currently under discussion [1]), but is not the intention. The
intention is that this function is safe to call from atomic contexts.
Name it as such.
Suggested-by: Johannes Weiner <[email protected]>
Signed-off-by: Yosry Ahmed <[email protected]>
---
include/linux/cgroup.h | 2 +-
kernel/cgroup/rstat.c | 4 ++--
mm/memcontrol.c | 2 +-
3 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 3410aecffdb4..885f5395fcd0 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -692,7 +692,7 @@ static inline void cgroup_path_from_kernfs_id(u64 id, char *buf, size_t buflen)
*/
void cgroup_rstat_updated(struct cgroup *cgrp, int cpu);
void cgroup_rstat_flush(struct cgroup *cgrp);
-void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp);
+void cgroup_rstat_flush_atomic(struct cgroup *cgrp);
void cgroup_rstat_flush_hold(struct cgroup *cgrp);
void cgroup_rstat_flush_release(void);
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index 831f1f472bb8..d3252b0416b6 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -241,12 +241,12 @@ __bpf_kfunc void cgroup_rstat_flush(struct cgroup *cgrp)
}
/**
- * cgroup_rstat_flush_irqsafe - irqsafe version of cgroup_rstat_flush()
+ * cgroup_rstat_flush_atomic- atomic version of cgroup_rstat_flush()
* @cgrp: target cgroup
*
* This function can be called from any context.
*/
-void cgroup_rstat_flush_irqsafe(struct cgroup *cgrp)
+void cgroup_rstat_flush_atomic(struct cgroup *cgrp)
{
unsigned long flags;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5abffe6f8389..0205e58ea430 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -642,7 +642,7 @@ static void __mem_cgroup_flush_stats(void)
return;
flush_next_time = jiffies_64 + 2*FLUSH_TIME;
- cgroup_rstat_flush_irqsafe(root_mem_cgroup->css.cgroup);
+ cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
atomic_set(&stats_flush_threshold, 0);
spin_unlock_irqrestore(&stats_flush_lock, flag);
}
--
2.40.0.348.gf938b09366-goog
mem_cgroup_flush_stats_delayed() suggests his is using a delayed_work,
but this is actually sometimes flushing directly from the callsite.
What it's doing is ratelimited calls. A better name would be
mem_cgroup_flush_stats_ratelimited().
Suggested-by: Johannes Weiner <[email protected]>
Signed-off-by: Yosry Ahmed <[email protected]>
---
include/linux/memcontrol.h | 4 ++--
mm/memcontrol.c | 2 +-
mm/workingset.c | 2 +-
3 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b6eda2ab205d..ac3f3b3a45e2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1037,7 +1037,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
}
void mem_cgroup_flush_stats(void);
-void mem_cgroup_flush_stats_delayed(void);
+void mem_cgroup_flush_stats_ratelimited(void);
void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
int val);
@@ -1535,7 +1535,7 @@ static inline void mem_cgroup_flush_stats(void)
{
}
-static inline void mem_cgroup_flush_stats_delayed(void)
+static inline void mem_cgroup_flush_stats_ratelimited(void)
{
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0205e58ea430..c3b6aae78901 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -653,7 +653,7 @@ void mem_cgroup_flush_stats(void)
__mem_cgroup_flush_stats();
}
-void mem_cgroup_flush_stats_delayed(void)
+void mem_cgroup_flush_stats_ratelimited(void)
{
if (time_after64(jiffies_64, flush_next_time))
mem_cgroup_flush_stats();
diff --git a/mm/workingset.c b/mm/workingset.c
index 00c6f4d9d9be..af862c6738c3 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -462,7 +462,7 @@ void workingset_refault(struct folio *folio, void *shadow)
mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
- mem_cgroup_flush_stats_delayed();
+ mem_cgroup_flush_stats_ratelimited();
/*
* Compare the distance to the existing workingset size. We
* don't activate pages that couldn't stay resident even if
--
2.40.0.348.gf938b09366-goog
Currently, the only context in which we can invoke an rstat flush from
irq context is through mem_cgroup_usage() on the root memcg when called
from memcg_check_events(). An rstat flush is an expensive operation that
should not be done in irq context, so do not flush stats and use the
stale stats in this case.
Arguably, usage threshold events are not reliable on the root memcg
anyway since its usage is ill-defined.
Suggested-by: Johannes Weiner <[email protected]>
Suggested-by: Shakeel Butt <[email protected]>
Signed-off-by: Yosry Ahmed <[email protected]>
---
mm/memcontrol.c | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c3b6aae78901..ff39f78f962e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3669,7 +3669,21 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
unsigned long val;
if (mem_cgroup_is_root(memcg)) {
- mem_cgroup_flush_stats();
+ /*
+ * We can reach here from irq context through:
+ * uncharge_batch()
+ * |--memcg_check_events()
+ * |--mem_cgroup_threshold()
+ * |--__mem_cgroup_threshold()
+ * |--mem_cgroup_usage
+ *
+ * rstat flushing is an expensive operation that should not be
+ * done from irq context; use stale stats in this case.
+ * Arguably, usage threshold events are not reliable on the root
+ * memcg anyway since its usage is ill-defined.
+ */
+ if (in_task())
+ mem_cgroup_flush_stats();
val = memcg_page_state(memcg, NR_FILE_PAGES) +
memcg_page_state(memcg, NR_ANON_MAPPED);
if (swap)
--
2.40.0.348.gf938b09366-goog
rstat flushing is too expensive to perform in irq context.
The previous patch removed the only context that may invoke an rstat
flush from irq context, add a WARN_ON_ONCE() to detect future
violations, or those that we are not aware of.
Signed-off-by: Yosry Ahmed <[email protected]>
---
kernel/cgroup/rstat.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index d3252b0416b6..c2571939139f 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -176,6 +176,8 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
{
int cpu;
+ /* rstat flushing is too expensive for irq context */
+ WARN_ON_ONCE(!in_task());
lockdep_assert_held(&cgroup_rstat_lock);
for_each_possible_cpu(cpu) {
--
2.40.0.348.gf938b09366-goog
As Johannes notes in [1], stats_flush_lock is currently used to:
(a) Protect updated to stats_flush_threshold.
(b) Protect updates to flush_next_time.
(c) Serializes calls to cgroup_rstat_flush() based on those ratelimits.
However:
1. stats_flush_threshold is already an atomic
2. flush_next_time is not atomic. The writer is locked, but the reader
is lockless. If the reader races with a flush, you could see this:
if (time_after(jiffies, flush_next_time))
spin_trylock()
flush_next_time = now + delay
flush()
spin_unlock()
spin_trylock()
flush_next_time = now + delay
flush()
spin_unlock()
which means we already can get flushes at a higher frequency than
FLUSH_TIME during races. But it isn't really a problem.
The reader could also see garbled partial updates, so it needs at
least READ_ONCE and WRITE_ONCE protection.
3. Serializing cgroup_rstat_flush() calls against the ratelimit
factors is currently broken because of the race in 2. But the race
is actually harmless, all we might get is the occasional earlier
flush. If there is no delta, the flush won't do much. And if there
is, the flush is justified.
So the lock can be removed all together. However, the lock also served
the purpose of preventing a thundering herd problem for concurrent
flushers, see [2]. Use an atomic instead to serve the purpose of
unifying concurrent flushers.
[1]https://lore.kernel.org/lkml/[email protected]/
[2]https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Yosry Ahmed <[email protected]>
---
mm/memcontrol.c | 17 ++++++++++-------
1 file changed, 10 insertions(+), 7 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ff39f78f962e..64ff33e02c96 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -585,8 +585,8 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
*/
static void flush_memcg_stats_dwork(struct work_struct *w);
static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
-static DEFINE_SPINLOCK(stats_flush_lock);
static DEFINE_PER_CPU(unsigned int, stats_updates);
+static atomic_t stats_flush_ongoing = ATOMIC_INIT(0);
static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
static u64 flush_next_time;
@@ -636,15 +636,18 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
static void __mem_cgroup_flush_stats(void)
{
- unsigned long flag;
-
- if (!spin_trylock_irqsave(&stats_flush_lock, flag))
+ /*
+ * We always flush the entire tree, so concurrent flushers can just
+ * skip. This avoids a thundering herd problem on the rstat global lock
+ * from memcg flushers (e.g. reclaim, refault, etc).
+ */
+ if (atomic_xchg(&stats_flush_ongoing, 1))
return;
- flush_next_time = jiffies_64 + 2*FLUSH_TIME;
+ WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
atomic_set(&stats_flush_threshold, 0);
- spin_unlock_irqrestore(&stats_flush_lock, flag);
+ atomic_set(&stats_flush_ongoing, 0);
}
void mem_cgroup_flush_stats(void)
@@ -655,7 +658,7 @@ void mem_cgroup_flush_stats(void)
void mem_cgroup_flush_stats_ratelimited(void)
{
- if (time_after64(jiffies_64, flush_next_time))
+ if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
mem_cgroup_flush_stats();
}
--
2.40.0.348.gf938b09366-goog
Currently, all contexts that flush memcg stats do so with sleeping not
allowed. Some of these contexts are perfectly safe to sleep in, such as
reading cgroup files from userspace or the background periodic flusher.
Refactor the code to make mem_cgroup_flush_stats() non-atomic (aka
sleepable), and provide a separate atomic version. The atomic version is
used in reclaim, refault, writeback, and in mem_cgroup_usage(). All
other code paths are left to use the non-atomic version. This includes
callbacks for userspace reads and the periodic flusher.
Since refault is the only caller of mem_cgroup_flush_stats_ratelimited(),
this function is changed to call the atomic version of
mem_cgroup_flush_stats(). Reclaim and refault code paths are modified
to do non-atomic flushing in separate later patches -- so
mem_cgroup_flush_stats_ratelimited() will eventually become non-atomic.
Signed-off-by: Yosry Ahmed <[email protected]>
---
include/linux/memcontrol.h | 5 ++++
mm/memcontrol.c | 58 ++++++++++++++++++++++++++++++++------
mm/vmscan.c | 2 +-
3 files changed, 55 insertions(+), 10 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ac3f3b3a45e2..a4bc3910a2eb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1037,6 +1037,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
}
void mem_cgroup_flush_stats(void);
+void mem_cgroup_flush_stats_atomic(void);
void mem_cgroup_flush_stats_ratelimited(void);
void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
@@ -1535,6 +1536,10 @@ static inline void mem_cgroup_flush_stats(void)
{
}
+static inline void mem_cgroup_flush_stats_atomic(void)
+{
+}
+
static inline void mem_cgroup_flush_stats_ratelimited(void)
{
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 64ff33e02c96..57e8cbf701f3 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -634,7 +634,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
}
}
-static void __mem_cgroup_flush_stats(void)
+static bool mem_cgroup_pre_stats_flush(void)
{
/*
* We always flush the entire tree, so concurrent flushers can just
@@ -642,24 +642,57 @@ static void __mem_cgroup_flush_stats(void)
* from memcg flushers (e.g. reclaim, refault, etc).
*/
if (atomic_xchg(&stats_flush_ongoing, 1))
- return;
+ return false;
WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
- cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
+ return true;
+}
+
+static void mem_cgroup_post_stats_flush(void)
+{
atomic_set(&stats_flush_threshold, 0);
atomic_set(&stats_flush_ongoing, 0);
}
-void mem_cgroup_flush_stats(void)
+static bool mem_cgroup_should_flush_stats(void)
{
- if (atomic_read(&stats_flush_threshold) > num_online_cpus())
- __mem_cgroup_flush_stats();
+ return atomic_read(&stats_flush_threshold) > num_online_cpus();
+}
+
+/* atomic functions, safe to call from any context */
+static void __mem_cgroup_flush_stats_atomic(void)
+{
+ if (mem_cgroup_pre_stats_flush()) {
+ cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
+ mem_cgroup_post_stats_flush();
+ }
+}
+
+void mem_cgroup_flush_stats_atomic(void)
+{
+ if (mem_cgroup_should_flush_stats())
+ __mem_cgroup_flush_stats_atomic();
}
void mem_cgroup_flush_stats_ratelimited(void)
{
if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats_atomic();
+}
+
+/* non-atomic functions, only safe from sleepable contexts */
+static void __mem_cgroup_flush_stats(void)
+{
+ if (mem_cgroup_pre_stats_flush()) {
+ cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
+ mem_cgroup_post_stats_flush();
+ }
+}
+
+void mem_cgroup_flush_stats(void)
+{
+ if (mem_cgroup_should_flush_stats())
+ __mem_cgroup_flush_stats();
}
static void flush_memcg_stats_dwork(struct work_struct *w)
@@ -3684,9 +3717,12 @@ static unsigned long mem_cgroup_usage(struct mem_cgroup *memcg, bool swap)
* done from irq context; use stale stats in this case.
* Arguably, usage threshold events are not reliable on the root
* memcg anyway since its usage is ill-defined.
+ *
+ * Additionally, other call paths through memcg_check_events()
+ * disable irqs, so make sure we are flushing stats atomically.
*/
if (in_task())
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats_atomic();
val = memcg_page_state(memcg, NR_FILE_PAGES) +
memcg_page_state(memcg, NR_ANON_MAPPED);
if (swap)
@@ -4609,7 +4645,11 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
struct mem_cgroup *parent;
- mem_cgroup_flush_stats();
+ /*
+ * wb_writeback() takes a spinlock and calls
+ * wb_over_bg_thresh()->mem_cgroup_wb_stats(). Do not sleep.
+ */
+ mem_cgroup_flush_stats_atomic();
*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9c1c5e8b24b8..a9511ccb936f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2845,7 +2845,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
* Flush the memory cgroup stats, so that we read accurate per-memcg
* lruvec stats for heuristics.
*/
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats_atomic();
/*
* Determine the scan balance between anon and file LRUs.
--
2.40.0.348.gf938b09366-goog
In some situations, we may end up calling memcg_rstat_updated() with a
value of 0, which means the stat was not actually updated. An example is
if we fail to reclaim any pages in shrink_folio_list().
Do not add the cgroup to the rstat updated tree in this case, to avoid
unnecessarily flushing it.
Signed-off-by: Yosry Ahmed <[email protected]>
---
mm/memcontrol.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0c0e74188e90..828e670e721a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -618,6 +618,9 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
{
unsigned int x;
+ if (!val)
+ return;
+
cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
x = __this_cpu_add_return(stats_updates, abs(val));
--
2.40.0.348.gf938b09366-goog
Memory reclaim is a sleepable context. Allow sleeping when flushing
memcg stats to avoid unnecessarily performing a lot of work without
sleeping. This can slow down reclaim code if flushing stats is taking
too long, but there is already multiple cond_resched()'s in reclaim
code.
Signed-off-by: Yosry Ahmed <[email protected]>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a9511ccb936f..9c1c5e8b24b8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2845,7 +2845,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
* Flush the memory cgroup stats, so that we read accurate per-memcg
* lruvec stats for heuristics.
*/
- mem_cgroup_flush_stats_atomic();
+ mem_cgroup_flush_stats();
/*
* Determine the scan balance between anon and file LRUs.
--
2.40.0.348.gf938b09366-goog
In workingset_refault(), we call mem_cgroup_flush_stats_ratelimited()
to flush stats within an RCU read section and with sleeping disallowed.
Move the call to mem_cgroup_flush_stats_ratelimited() above the RCU read
section and allow sleeping to avoid unnecessarily performing a lot of
work without sleeping.
Since workingset_refault() is the only caller of
mem_cgroup_flush_stats_ratelimited(), just make it call the non-atomic
mem_cgroup_flush_stats().
Signed-off-by: Yosry Ahmed <[email protected]>
---
mm/memcontrol.c | 12 ++++++------
mm/workingset.c | 4 ++--
2 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 57e8cbf701f3..0c0e74188e90 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -674,12 +674,6 @@ void mem_cgroup_flush_stats_atomic(void)
__mem_cgroup_flush_stats_atomic();
}
-void mem_cgroup_flush_stats_ratelimited(void)
-{
- if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
- mem_cgroup_flush_stats_atomic();
-}
-
/* non-atomic functions, only safe from sleepable contexts */
static void __mem_cgroup_flush_stats(void)
{
@@ -695,6 +689,12 @@ void mem_cgroup_flush_stats(void)
__mem_cgroup_flush_stats();
}
+void mem_cgroup_flush_stats_ratelimited(void)
+{
+ if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
+ mem_cgroup_flush_stats();
+}
+
static void flush_memcg_stats_dwork(struct work_struct *w)
{
__mem_cgroup_flush_stats();
diff --git a/mm/workingset.c b/mm/workingset.c
index af862c6738c3..7d7ecc46521c 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -406,6 +406,8 @@ void workingset_refault(struct folio *folio, void *shadow)
unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
eviction <<= bucket_order;
+ /* Flush stats (and potentially sleep) before holding RCU read lock */
+ mem_cgroup_flush_stats_ratelimited();
rcu_read_lock();
/*
* Look up the memcg associated with the stored ID. It might
@@ -461,8 +463,6 @@ void workingset_refault(struct folio *folio, void *shadow)
lruvec = mem_cgroup_lruvec(memcg, pgdat);
mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
-
- mem_cgroup_flush_stats_ratelimited();
/*
* Compare the distance to the existing workingset size. We
* don't activate pages that couldn't stay resident even if
--
2.40.0.348.gf938b09366-goog
On Mon, Mar 27, 2023 at 11:16 PM Yosry Ahmed <[email protected]> wrote:
>
> cgroup_rstat_flush_irqsafe() can be a confusing name. It may read as
> "irqs are disabled throughout", which is what the current implementation
> does (currently under discussion [1]), but is not the intention. The
> intention is that this function is safe to call from atomic contexts.
> Name it as such.
>
> Suggested-by: Johannes Weiner <[email protected]>
> Signed-off-by: Yosry Ahmed <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
On Mon, Mar 27, 2023 at 11:16 PM Yosry Ahmed <[email protected]> wrote:
>
> mem_cgroup_flush_stats_delayed() suggests his is using a delayed_work,
> but this is actually sometimes flushing directly from the callsite.
>
> What it's doing is ratelimited calls. A better name would be
> mem_cgroup_flush_stats_ratelimited().
>
> Suggested-by: Johannes Weiner <[email protected]>
> Signed-off-by: Yosry Ahmed <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
On Mon, Mar 27, 2023 at 11:16 PM Yosry Ahmed <[email protected]> wrote:
>
> Currently, the only context in which we can invoke an rstat flush from
> irq context is through mem_cgroup_usage() on the root memcg when called
> from memcg_check_events(). An rstat flush is an expensive operation that
> should not be done in irq context, so do not flush stats and use the
> stale stats in this case.
>
> Arguably, usage threshold events are not reliable on the root memcg
> anyway since its usage is ill-defined.
>
> Suggested-by: Johannes Weiner <[email protected]>
> Suggested-by: Shakeel Butt <[email protected]>
> Signed-off-by: Yosry Ahmed <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
On Tue, Mar 28, 2023 at 06:16:34AM +0000, Yosry Ahmed wrote:
[...]
> @@ -585,8 +585,8 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
> */
> static void flush_memcg_stats_dwork(struct work_struct *w);
> static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
> -static DEFINE_SPINLOCK(stats_flush_lock);
> static DEFINE_PER_CPU(unsigned int, stats_updates);
> +static atomic_t stats_flush_ongoing = ATOMIC_INIT(0);
> static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
> static u64 flush_next_time;
>
> @@ -636,15 +636,18 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
>
> static void __mem_cgroup_flush_stats(void)
> {
> - unsigned long flag;
> -
> - if (!spin_trylock_irqsave(&stats_flush_lock, flag))
> + /*
> + * We always flush the entire tree, so concurrent flushers can just
> + * skip. This avoids a thundering herd problem on the rstat global lock
> + * from memcg flushers (e.g. reclaim, refault, etc).
> + */
> + if (atomic_xchg(&stats_flush_ongoing, 1))
Have you profiled this? I wonder if we should replace the above with
if (atomic_read(&stats_flush_ongoing) || atomic_xchg(&stats_flush_ongoing, 1))
to not always dirty the cacheline. This would not be an issue if there
is no cacheline sharing but I suspect percpu stats_updates is sharing
the cacheline with it and may cause false sharing with the parallel stat
updaters (updaters only need to read the base percpu pointer).
Other than that the patch looks good.
On Mon, Mar 27, 2023 at 11:16 PM Yosry Ahmed <[email protected]> wrote:
>
> rstat flushing is too expensive to perform in irq context.
> The previous patch removed the only context that may invoke an rstat
> flush from irq context, add a WARN_ON_ONCE() to detect future
> violations, or those that we are not aware of.
>
> Signed-off-by: Yosry Ahmed <[email protected]>
Reviewed-by: Shakeel Butt <[email protected]>
On Mon, Mar 27, 2023 at 11:16 PM Yosry Ahmed <[email protected]> wrote:
>
> Currently, all contexts that flush memcg stats do so with sleeping not
> allowed. Some of these contexts are perfectly safe to sleep in, such as
> reading cgroup files from userspace or the background periodic flusher.
>
> Refactor the code to make mem_cgroup_flush_stats() non-atomic (aka
> sleepable), and provide a separate atomic version. The atomic version is
> used in reclaim, refault, writeback, and in mem_cgroup_usage(). All
> other code paths are left to use the non-atomic version. This includes
> callbacks for userspace reads and the periodic flusher.
>
> Since refault is the only caller of mem_cgroup_flush_stats_ratelimited(),
> this function is changed to call the atomic version of
> mem_cgroup_flush_stats(). Reclaim and refault code paths are modified
> to do non-atomic flushing in separate later patches -- so
> mem_cgroup_flush_stats_ratelimited() will eventually become non-atomic.
>
> Signed-off-by: Yosry Ahmed <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
On Mon, Mar 27, 2023 at 11:16 PM Yosry Ahmed <[email protected]> wrote:
>
> In workingset_refault(), we call mem_cgroup_flush_stats_ratelimited()
> to flush stats within an RCU read section and with sleeping disallowed.
> Move the call to mem_cgroup_flush_stats_ratelimited() above the RCU read
> section and allow sleeping to avoid unnecessarily performing a lot of
> work without sleeping.
>
> Since workingset_refault() is the only caller of
> mem_cgroup_flush_stats_ratelimited(), just make it call the non-atomic
> mem_cgroup_flush_stats().
>
> Signed-off-by: Yosry Ahmed <[email protected]>
A nit below:
Acked-by: Shakeel Butt <[email protected]>
> ---
> mm/memcontrol.c | 12 ++++++------
> mm/workingset.c | 4 ++--
> 2 files changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 57e8cbf701f3..0c0e74188e90 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -674,12 +674,6 @@ void mem_cgroup_flush_stats_atomic(void)
> __mem_cgroup_flush_stats_atomic();
> }
>
> -void mem_cgroup_flush_stats_ratelimited(void)
> -{
> - if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
> - mem_cgroup_flush_stats_atomic();
> -}
> -
> /* non-atomic functions, only safe from sleepable contexts */
> static void __mem_cgroup_flush_stats(void)
> {
> @@ -695,6 +689,12 @@ void mem_cgroup_flush_stats(void)
> __mem_cgroup_flush_stats();
> }
>
> +void mem_cgroup_flush_stats_ratelimited(void)
> +{
> + if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
> + mem_cgroup_flush_stats();
> +}
> +
> static void flush_memcg_stats_dwork(struct work_struct *w)
> {
> __mem_cgroup_flush_stats();
> diff --git a/mm/workingset.c b/mm/workingset.c
> index af862c6738c3..7d7ecc46521c 100644
> --- a/mm/workingset.c
> +++ b/mm/workingset.c
> @@ -406,6 +406,8 @@ void workingset_refault(struct folio *folio, void *shadow)
> unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
> eviction <<= bucket_order;
>
> + /* Flush stats (and potentially sleep) before holding RCU read lock */
I think the only reason we use rcu lock is due to
mem_cgroup_from_id(). Maybe we should add mem_cgroup_tryget_from_id().
The other caller of mem_cgroup_from_id() in vmscan is already doing
the same and could use mem_cgroup_tryget_from_id().
Though this can be done separately to this series (if we decide to do
it at all).
On Mon, Mar 27, 2023 at 11:16 PM Yosry Ahmed <[email protected]> wrote:
>
> Memory reclaim is a sleepable context. Allow sleeping when flushing
> memcg stats to avoid unnecessarily performing a lot of work without
> sleeping. This can slow down reclaim code if flushing stats is taking
> too long, but there is already multiple cond_resched()'s in reclaim
> code.
>
> Signed-off-by: Yosry Ahmed <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
> ---
> mm/vmscan.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a9511ccb936f..9c1c5e8b24b8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2845,7 +2845,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
> * Flush the memory cgroup stats, so that we read accurate per-memcg
> * lruvec stats for heuristics.
> */
> - mem_cgroup_flush_stats_atomic();
> + mem_cgroup_flush_stats();
I wonder if we should just replace this with
mem_cgroup_flush_stats_ratelimited().
On Mon, Mar 27, 2023 at 11:17 PM Yosry Ahmed <[email protected]> wrote:
>
> In some situations, we may end up calling memcg_rstat_updated() with a
> value of 0, which means the stat was not actually updated. An example is
> if we fail to reclaim any pages in shrink_folio_list().
>
> Do not add the cgroup to the rstat updated tree in this case, to avoid
> unnecessarily flushing it.
>
> Signed-off-by: Yosry Ahmed <[email protected]>
Acked-by: Shakeel Butt <[email protected]>
On Tue, Mar 28, 2023 at 06:16:31AM +0000, Yosry Ahmed wrote:
> mem_cgroup_flush_stats_delayed() suggests his is using a delayed_work,
> but this is actually sometimes flushing directly from the callsite.
>
> What it's doing is ratelimited calls. A better name would be
> mem_cgroup_flush_stats_ratelimited().
>
> Suggested-by: Johannes Weiner <[email protected]>
> Signed-off-by: Yosry Ahmed <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
On Tue, Mar 28, 2023 at 06:16:30AM +0000, Yosry Ahmed wrote:
> cgroup_rstat_flush_irqsafe() can be a confusing name. It may read as
> "irqs are disabled throughout", which is what the current implementation
> does (currently under discussion [1]), but is not the intention. The
> intention is that this function is safe to call from atomic contexts.
> Name it as such.
>
> Suggested-by: Johannes Weiner <[email protected]>
> Signed-off-by: Yosry Ahmed <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
On Tue, Mar 28, 2023 at 06:16:32AM +0000, Yosry Ahmed wrote:
> Currently, the only context in which we can invoke an rstat flush from
> irq context is through mem_cgroup_usage() on the root memcg when called
> from memcg_check_events(). An rstat flush is an expensive operation that
> should not be done in irq context, so do not flush stats and use the
> stale stats in this case.
>
> Arguably, usage threshold events are not reliable on the root memcg
> anyway since its usage is ill-defined.
>
> Suggested-by: Johannes Weiner <[email protected]>
> Suggested-by: Shakeel Butt <[email protected]>
> Signed-off-by: Yosry Ahmed <[email protected]>
Excellent, thanks!
Acked-by: Johannes Weiner <[email protected]>
On Tue, Mar 28, 2023 at 06:16:33AM +0000, Yosry Ahmed wrote:
> rstat flushing is too expensive to perform in irq context.
> The previous patch removed the only context that may invoke an rstat
> flush from irq context, add a WARN_ON_ONCE() to detect future
> violations, or those that we are not aware of.
>
> Signed-off-by: Yosry Ahmed <[email protected]>
> ---
> kernel/cgroup/rstat.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> index d3252b0416b6..c2571939139f 100644
> --- a/kernel/cgroup/rstat.c
> +++ b/kernel/cgroup/rstat.c
> @@ -176,6 +176,8 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
> {
> int cpu;
>
> + /* rstat flushing is too expensive for irq context */
> + WARN_ON_ONCE(!in_task());
> lockdep_assert_held(&cgroup_rstat_lock);
This seems a bit arbitrary. Why is an irq caller forbidden, but an
irq-disabled, non-preemptible section caller is allowed? The latency
impact on the system would be the same, right?
On Tue, Mar 28, 2023 at 06:16:34AM +0000, Yosry Ahmed wrote:
> As Johannes notes in [1], stats_flush_lock is currently used to:
> (a) Protect updated to stats_flush_threshold.
> (b) Protect updates to flush_next_time.
> (c) Serializes calls to cgroup_rstat_flush() based on those ratelimits.
>
> However:
>
> 1. stats_flush_threshold is already an atomic
>
> 2. flush_next_time is not atomic. The writer is locked, but the reader
> is lockless. If the reader races with a flush, you could see this:
>
> if (time_after(jiffies, flush_next_time))
> spin_trylock()
> flush_next_time = now + delay
> flush()
> spin_unlock()
> spin_trylock()
> flush_next_time = now + delay
> flush()
> spin_unlock()
>
> which means we already can get flushes at a higher frequency than
> FLUSH_TIME during races. But it isn't really a problem.
>
> The reader could also see garbled partial updates, so it needs at
> least READ_ONCE and WRITE_ONCE protection.
>
> 3. Serializing cgroup_rstat_flush() calls against the ratelimit
> factors is currently broken because of the race in 2. But the race
> is actually harmless, all we might get is the occasional earlier
> flush. If there is no delta, the flush won't do much. And if there
> is, the flush is justified.
>
> So the lock can be removed all together. However, the lock also served
> the purpose of preventing a thundering herd problem for concurrent
> flushers, see [2]. Use an atomic instead to serve the purpose of
> unifying concurrent flushers.
>
> [1]https://lore.kernel.org/lkml/[email protected]/
> [2]https://lore.kernel.org/lkml/[email protected]/
>
> Signed-off-by: Yosry Ahmed <[email protected]>
With Shakeel's suggestion:
Acked-by: Johannes Weiner <[email protected]>
On Tue, Mar 28, 2023 at 06:16:35AM +0000, Yosry Ahmed wrote:
> @@ -642,24 +642,57 @@ static void __mem_cgroup_flush_stats(void)
> * from memcg flushers (e.g. reclaim, refault, etc).
> */
> if (atomic_xchg(&stats_flush_ongoing, 1))
> - return;
> + return false;
>
> WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
> - cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
> + return true;
> +}
> +
> +static void mem_cgroup_post_stats_flush(void)
> +{
> atomic_set(&stats_flush_threshold, 0);
> atomic_set(&stats_flush_ongoing, 0);
> }
>
> -void mem_cgroup_flush_stats(void)
> +static bool mem_cgroup_should_flush_stats(void)
> {
> - if (atomic_read(&stats_flush_threshold) > num_online_cpus())
> - __mem_cgroup_flush_stats();
> + return atomic_read(&stats_flush_threshold) > num_online_cpus();
> +}
> +
> +/* atomic functions, safe to call from any context */
> +static void __mem_cgroup_flush_stats_atomic(void)
> +{
> + if (mem_cgroup_pre_stats_flush()) {
> + cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
> + mem_cgroup_post_stats_flush();
> + }
> +}
I'm afraid I wasn't very nuanced with my complaint about the bool
parameter in the previous version. In this case, when you can do a
common helper for a couple of API functions defined right below it,
and the callers don't spread throughout the codebase, using bools
makes things simpler while still being easily understandable:
static void do_flush_stats(bool may_sleep)
{
if (atomic_xchg(&stats_flush_ongoing, 1))
return;
WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
atomic_set(&stats_flush_threshold, 0);
if (!may_sleep)
cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
else
cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
atomic_set(&stats_flush_ongoing, 0);
}
void mem_cgroup_flush_stats(void)
{
if (atomic_read(&stats_flush_threshold) > num_online_cpus())
do_flush_stats(true);
}
void mem_cgroup_flush_stats_atomic(void)
{
if (atomic_read(&stats_flush_threshold) > num_online_cpus())
do_flush_stats(false);
}
> void mem_cgroup_flush_stats_ratelimited(void)
> {
> if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
> - mem_cgroup_flush_stats();
> + mem_cgroup_flush_stats_atomic();
> +}
This should probably be mem_cgroup_flush_stats_atomic_ratelimited().
(Whee, kinda long, but that's alright. Very specialized caller...)
Btw, can you guys think of a reason against moving the threshold check
into the common function? It would then apply to the time-limited
flushes as well, but that shouldn't hurt anything. This would make the
code even simpler:
static void do_flush_stats(bool may_sleep)
{
if (atomic_read(&stats_flush_threshold) <= num_online_cpus())
return;
if (atomic_xchg(&stats_flush_ongoing, 1))
return;
WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
atomic_set(&stats_flush_threshold, 0);
if (!may_sleep)
cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
else
cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
atomic_set(&stats_flush_ongoing, 0);
}
void mem_cgroup_flush_stats(void)
{
do_flush_stats(true);
}
void mem_cgroup_flush_stats_atomic(void)
{
do_flush_stats(false);
}
void mem_cgroup_flush_stats_atomic_ratelimited(void)
{
if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
do_flush_stats(false);
}
> @@ -2845,7 +2845,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
> * Flush the memory cgroup stats, so that we read accurate per-memcg
> * lruvec stats for heuristics.
> */
> - mem_cgroup_flush_stats();
> + mem_cgroup_flush_stats_atomic();
I'm thinking this one could be non-atomic as well. It's called fairly
high up in reclaim without any locks held.
On Tue, Mar 28, 2023 at 06:16:36AM +0000, Yosry Ahmed wrote:
> @@ -406,6 +406,8 @@ void workingset_refault(struct folio *folio, void *shadow)
> unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
> eviction <<= bucket_order;
>
> + /* Flush stats (and potentially sleep) before holding RCU read lock */
> + mem_cgroup_flush_stats_ratelimited();
> rcu_read_lock();
Minor nit, but please keep the lock section visually separated by an
empty line between the flush and the rcu lock.
Other than that,
Acked-by: Johannes Weiner <[email protected]>
On Tue, Mar 28, 2023 at 08:18:11AM -0700, Shakeel Butt wrote:
> > @@ -406,6 +406,8 @@ void workingset_refault(struct folio *folio, void *shadow)
> > unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
> > eviction <<= bucket_order;
> >
> > + /* Flush stats (and potentially sleep) before holding RCU read lock */
>
> I think the only reason we use rcu lock is due to
> mem_cgroup_from_id(). Maybe we should add mem_cgroup_tryget_from_id().
> The other caller of mem_cgroup_from_id() in vmscan is already doing
> the same and could use mem_cgroup_tryget_from_id().
Good catch. Nothing else in there is protected by RCU. We can just
hold the ref instead.
> Though this can be done separately to this series (if we decide to do
> it at all).
Agreed
On Tue, Mar 28, 2023 at 11:35 AM Johannes Weiner <[email protected]> wrote:
>
> On Tue, Mar 28, 2023 at 06:16:35AM +0000, Yosry Ahmed wrote:
> > @@ -642,24 +642,57 @@ static void __mem_cgroup_flush_stats(void)
> > * from memcg flushers (e.g. reclaim, refault, etc).
> > */
> > if (atomic_xchg(&stats_flush_ongoing, 1))
> > - return;
> > + return false;
> >
> > WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
> > - cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
> > + return true;
> > +}
> > +
> > +static void mem_cgroup_post_stats_flush(void)
> > +{
> > atomic_set(&stats_flush_threshold, 0);
> > atomic_set(&stats_flush_ongoing, 0);
> > }
> >
> > -void mem_cgroup_flush_stats(void)
> > +static bool mem_cgroup_should_flush_stats(void)
> > {
> > - if (atomic_read(&stats_flush_threshold) > num_online_cpus())
> > - __mem_cgroup_flush_stats();
> > + return atomic_read(&stats_flush_threshold) > num_online_cpus();
> > +}
> > +
> > +/* atomic functions, safe to call from any context */
> > +static void __mem_cgroup_flush_stats_atomic(void)
> > +{
> > + if (mem_cgroup_pre_stats_flush()) {
> > + cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
> > + mem_cgroup_post_stats_flush();
> > + }
> > +}
>
> I'm afraid I wasn't very nuanced with my complaint about the bool
> parameter in the previous version. In this case, when you can do a
> common helper for a couple of API functions defined right below it,
> and the callers don't spread throughout the codebase, using bools
> makes things simpler while still being easily understandable:
Looking at your suggestion now, it seems fairly obvious that this is
what I should have gone for. Will do that for v2. Thanks!
>
> static void do_flush_stats(bool may_sleep)
> {
> if (atomic_xchg(&stats_flush_ongoing, 1))
> return;
>
> WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
> atomic_set(&stats_flush_threshold, 0);
>
> if (!may_sleep)
> cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
> else
> cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
>
> atomic_set(&stats_flush_ongoing, 0);
> }
>
> void mem_cgroup_flush_stats(void)
> {
> if (atomic_read(&stats_flush_threshold) > num_online_cpus())
> do_flush_stats(true);
> }
>
> void mem_cgroup_flush_stats_atomic(void)
> {
> if (atomic_read(&stats_flush_threshold) > num_online_cpus())
> do_flush_stats(false);
> }
>
> > void mem_cgroup_flush_stats_ratelimited(void)
> > {
> > if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
> > - mem_cgroup_flush_stats();
> > + mem_cgroup_flush_stats_atomic();
> > +}
>
> This should probably be mem_cgroup_flush_stats_atomic_ratelimited().
>
> (Whee, kinda long, but that's alright. Very specialized caller...)
It should, but the following patch makes it non-atomic anyway, so I
thought I wouldn't clutter the diff by renaming it here and then
reverting it back in the next patch.
There is an argument for maintaining a clean history tho in case the
next patch is reverted separately (which is the reason I put it in a
separate patch to begin with) -- so perhaps I should rename it here to
mem_cgroup_flush_stats_atomic_ratelimited () and back to
mem_cgroup_flush_stats_ratelimited() in the next patch, just for
consistency?
>
> Btw, can you guys think of a reason against moving the threshold check
> into the common function? It would then apply to the time-limited
> flushes as well, but that shouldn't hurt anything. This would make the
> code even simpler:
I think the point of having the threshold check outside the common
function is that the periodic flusher always flushes, regardless of
the threshold, to keep rstat flushing from critical contexts as cheap
as possible.
If you think it's not worth it, I can make that change. It is a
separate functional change tho, so maybe in a separate patch.
>
> static void do_flush_stats(bool may_sleep)
> {
> if (atomic_read(&stats_flush_threshold) <= num_online_cpus())
> return;
>
> if (atomic_xchg(&stats_flush_ongoing, 1))
> return;
>
> WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
> atomic_set(&stats_flush_threshold, 0);
>
> if (!may_sleep)
> cgroup_rstat_flush_atomic(root_mem_cgroup->css.cgroup);
> else
> cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
>
> atomic_set(&stats_flush_ongoing, 0);
> }
>
> void mem_cgroup_flush_stats(void)
> {
> do_flush_stats(true);
> }
>
> void mem_cgroup_flush_stats_atomic(void)
> {
> do_flush_stats(false);
> }
>
> void mem_cgroup_flush_stats_atomic_ratelimited(void)
> {
> if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
> do_flush_stats(false);
> }
>
> > @@ -2845,7 +2845,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
> > * Flush the memory cgroup stats, so that we read accurate per-memcg
> > * lruvec stats for heuristics.
> > */
> > - mem_cgroup_flush_stats();
> > + mem_cgroup_flush_stats_atomic();
>
> I'm thinking this one could be non-atomic as well. It's called fairly
> high up in reclaim without any locks held.
A later patch does exactly that. I put making the reclaim and refault
paths non-atomic in separate patches to easily revert them if we see a
regression. Let me know if this is too defensive and if you'd rather
have them squashed.
Thanks!
On Tue, Mar 28, 2023 at 06:16:37AM +0000, Yosry Ahmed wrote:
> Memory reclaim is a sleepable context. Allow sleeping when flushing
> memcg stats to avoid unnecessarily performing a lot of work without
> sleeping. This can slow down reclaim code if flushing stats is taking
> too long, but there is already multiple cond_resched()'s in reclaim
> code.
>
> Signed-off-by: Yosry Ahmed <[email protected]>
Forget what I said about this in the previous patch. :)
Acked-by: Johannes Weiner <[email protected]>
On Tue, Mar 28, 2023 at 06:16:38AM +0000, Yosry Ahmed wrote:
> In some situations, we may end up calling memcg_rstat_updated() with a
> value of 0, which means the stat was not actually updated. An example is
> if we fail to reclaim any pages in shrink_folio_list().
>
> Do not add the cgroup to the rstat updated tree in this case, to avoid
> unnecessarily flushing it.
>
> Signed-off-by: Yosry Ahmed <[email protected]>
Acked-by: Johannes Weiner <[email protected]>
On Tue, Mar 28, 2023 at 7:15 AM Shakeel Butt <[email protected]> wrote:
>
> On Tue, Mar 28, 2023 at 06:16:34AM +0000, Yosry Ahmed wrote:
> [...]
> > @@ -585,8 +585,8 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
> > */
> > static void flush_memcg_stats_dwork(struct work_struct *w);
> > static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
> > -static DEFINE_SPINLOCK(stats_flush_lock);
> > static DEFINE_PER_CPU(unsigned int, stats_updates);
> > +static atomic_t stats_flush_ongoing = ATOMIC_INIT(0);
> > static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
> > static u64 flush_next_time;
> >
> > @@ -636,15 +636,18 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
> >
> > static void __mem_cgroup_flush_stats(void)
> > {
> > - unsigned long flag;
> > -
> > - if (!spin_trylock_irqsave(&stats_flush_lock, flag))
> > + /*
> > + * We always flush the entire tree, so concurrent flushers can just
> > + * skip. This avoids a thundering herd problem on the rstat global lock
> > + * from memcg flushers (e.g. reclaim, refault, etc).
> > + */
> > + if (atomic_xchg(&stats_flush_ongoing, 1))
>
> Have you profiled this? I wonder if we should replace the above with
>
> if (atomic_read(&stats_flush_ongoing) || atomic_xchg(&stats_flush_ongoing, 1))
I profiled the entire series with perf and I haven't noticed a notable
difference between before and after the patch series -- but maybe some
specific access patterns cause a regression, not sure.
Does an atomic_cmpxchg() satisfy the same purpose? it's easier to read
/ more concise I guess.
Something like
if (atomic_cmpxchg(&stats_flush_ongoing, 0, 1))
WDYT?
>
> to not always dirty the cacheline. This would not be an issue if there
> is no cacheline sharing but I suspect percpu stats_updates is sharing
> the cacheline with it and may cause false sharing with the parallel stat
> updaters (updaters only need to read the base percpu pointer).
>
> Other than that the patch looks good.
On Tue, Mar 28, 2023 at 10:49 AM Johannes Weiner <[email protected]> wrote:
>
> On Tue, Mar 28, 2023 at 06:16:33AM +0000, Yosry Ahmed wrote:
> > rstat flushing is too expensive to perform in irq context.
> > The previous patch removed the only context that may invoke an rstat
> > flush from irq context, add a WARN_ON_ONCE() to detect future
> > violations, or those that we are not aware of.
> >
> > Signed-off-by: Yosry Ahmed <[email protected]>
> > ---
> > kernel/cgroup/rstat.c | 2 ++
> > 1 file changed, 2 insertions(+)
> >
> > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> > index d3252b0416b6..c2571939139f 100644
> > --- a/kernel/cgroup/rstat.c
> > +++ b/kernel/cgroup/rstat.c
> > @@ -176,6 +176,8 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
> > {
> > int cpu;
> >
> > + /* rstat flushing is too expensive for irq context */
> > + WARN_ON_ONCE(!in_task());
> > lockdep_assert_held(&cgroup_rstat_lock);
>
> This seems a bit arbitrary. Why is an irq caller forbidden, but an
> irq-disabled, non-preemptible section caller is allowed? The latency
> impact on the system would be the same, right?
Thanks for taking a look.
So in the first patch series the initial purpose was to make sure
cgroup_rstat_lock was never acquired in an irq context, so that we can
stop disabling irqs while holding it. Tejun disagreed with this
approach though.
We currently have one caller that calls flushing with irqs disabled
(mem_cgroup_usage()) -- so we cannot forbid such callers (yet), but I
thought we can at least forbid callers from irq context now (or catch
those that we are not aware of), and then maybe forbid irqs_disabled()
contexts as well we can get rid of that callsite.
WDYT?
On Tue, Mar 28, 2023 at 11:45:19AM -0700, Yosry Ahmed wrote:
> On Tue, Mar 28, 2023 at 11:35 AM Johannes Weiner <[email protected]> wrote:
> > On Tue, Mar 28, 2023 at 06:16:35AM +0000, Yosry Ahmed wrote:
> > > void mem_cgroup_flush_stats_ratelimited(void)
> > > {
> > > if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
> > > - mem_cgroup_flush_stats();
> > > + mem_cgroup_flush_stats_atomic();
> > > +}
> >
> > This should probably be mem_cgroup_flush_stats_atomic_ratelimited().
> >
> > (Whee, kinda long, but that's alright. Very specialized caller...)
>
> It should, but the following patch makes it non-atomic anyway, so I
> thought I wouldn't clutter the diff by renaming it here and then
> reverting it back in the next patch.
>
> There is an argument for maintaining a clean history tho in case the
> next patch is reverted separately (which is the reason I put it in a
> separate patch to begin with) -- so perhaps I should rename it here to
> mem_cgroup_flush_stats_atomic_ratelimited () and back to
> mem_cgroup_flush_stats_ratelimited() in the next patch, just for
> consistency?
Sounds good to me. It's pretty minor churn.
> > Btw, can you guys think of a reason against moving the threshold check
> > into the common function? It would then apply to the time-limited
> > flushes as well, but that shouldn't hurt anything. This would make the
> > code even simpler:
>
> I think the point of having the threshold check outside the common
> function is that the periodic flusher always flushes, regardless of
> the threshold, to keep rstat flushing from critical contexts as cheap
> as possible.
Good point. Yeah, let's keep it separate then.
> > > @@ -2845,7 +2845,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
> > > * Flush the memory cgroup stats, so that we read accurate per-memcg
> > > * lruvec stats for heuristics.
> > > */
> > > - mem_cgroup_flush_stats();
> > > + mem_cgroup_flush_stats_atomic();
> >
> > I'm thinking this one could be non-atomic as well. It's called fairly
> > high up in reclaim without any locks held.
>
> A later patch does exactly that. I put making the reclaim and refault
> paths non-atomic in separate patches to easily revert them if we see a
> regression. Let me know if this is too defensive and if you'd rather
> have them squashed.
No, good call. I should have just looked ahead first :-)
On Tue, Mar 28, 2023 at 8:19 AM Shakeel Butt <[email protected]> wrote:
>
> On Mon, Mar 27, 2023 at 11:16 PM Yosry Ahmed <[email protected]> wrote:
> >
> > Memory reclaim is a sleepable context. Allow sleeping when flushing
> > memcg stats to avoid unnecessarily performing a lot of work without
> > sleeping. This can slow down reclaim code if flushing stats is taking
> > too long, but there is already multiple cond_resched()'s in reclaim
> > code.
> >
> > Signed-off-by: Yosry Ahmed <[email protected]>
>
> Acked-by: Shakeel Butt <[email protected]>
>
> > ---
> > mm/vmscan.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index a9511ccb936f..9c1c5e8b24b8 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2845,7 +2845,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
> > * Flush the memory cgroup stats, so that we read accurate per-memcg
> > * lruvec stats for heuristics.
> > */
> > - mem_cgroup_flush_stats_atomic();
> > + mem_cgroup_flush_stats();
>
> I wonder if we should just replace this with
> mem_cgroup_flush_stats_ratelimited().
Thanks for taking a look!
I was hesitant about doing this because the flush call is inside the
retry loop, and it seems like we want to get fresh stats on each
retry. It seems very likely that we end up not flushing between
retries with mem_cgroup_flush_stats_ratelimited().
Maybe change it if we observe problems with non-atomic flushing?
On Tue, Mar 28, 2023 at 8:18 AM Shakeel Butt <[email protected]> wrote:
>
> On Mon, Mar 27, 2023 at 11:16 PM Yosry Ahmed <[email protected]> wrote:
> >
> > In workingset_refault(), we call mem_cgroup_flush_stats_ratelimited()
> > to flush stats within an RCU read section and with sleeping disallowed.
> > Move the call to mem_cgroup_flush_stats_ratelimited() above the RCU read
> > section and allow sleeping to avoid unnecessarily performing a lot of
> > work without sleeping.
> >
> > Since workingset_refault() is the only caller of
> > mem_cgroup_flush_stats_ratelimited(), just make it call the non-atomic
> > mem_cgroup_flush_stats().
> >
> > Signed-off-by: Yosry Ahmed <[email protected]>
>
> A nit below:
>
> Acked-by: Shakeel Butt <[email protected]>
>
> > ---
> > mm/memcontrol.c | 12 ++++++------
> > mm/workingset.c | 4 ++--
> > 2 files changed, 8 insertions(+), 8 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 57e8cbf701f3..0c0e74188e90 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -674,12 +674,6 @@ void mem_cgroup_flush_stats_atomic(void)
> > __mem_cgroup_flush_stats_atomic();
> > }
> >
> > -void mem_cgroup_flush_stats_ratelimited(void)
> > -{
> > - if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
> > - mem_cgroup_flush_stats_atomic();
> > -}
> > -
> > /* non-atomic functions, only safe from sleepable contexts */
> > static void __mem_cgroup_flush_stats(void)
> > {
> > @@ -695,6 +689,12 @@ void mem_cgroup_flush_stats(void)
> > __mem_cgroup_flush_stats();
> > }
> >
> > +void mem_cgroup_flush_stats_ratelimited(void)
> > +{
> > + if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
> > + mem_cgroup_flush_stats();
> > +}
> > +
> > static void flush_memcg_stats_dwork(struct work_struct *w)
> > {
> > __mem_cgroup_flush_stats();
> > diff --git a/mm/workingset.c b/mm/workingset.c
> > index af862c6738c3..7d7ecc46521c 100644
> > --- a/mm/workingset.c
> > +++ b/mm/workingset.c
> > @@ -406,6 +406,8 @@ void workingset_refault(struct folio *folio, void *shadow)
> > unpack_shadow(shadow, &memcgid, &pgdat, &eviction, &workingset);
> > eviction <<= bucket_order;
> >
> > + /* Flush stats (and potentially sleep) before holding RCU read lock */
>
> I think the only reason we use rcu lock is due to
> mem_cgroup_from_id(). Maybe we should add mem_cgroup_tryget_from_id().
> The other caller of mem_cgroup_from_id() in vmscan is already doing
> the same and could use mem_cgroup_tryget_from_id().
I think different callers of mem_cgroup_from_id() want different things.
(a) workingset_refault() reads the memcg from the id and doesn't
really care if the memcg is online or not.
(b) __mem_cgroup_uncharge_swap() reads the memcg from the id and drops
refs acquired on the swapout path. It doesn't need tryget as we should
know for a fact that we are holding refs from the swapout path. It
doesn't care if the memcg is online or not.
(c) mem_cgroup_swapin_charge_folio() reads the memcg from the id and
then gets a ref with css_tryget_online() -- so only if the refcount is
non-zero and the memcg is online.
So we would at least need mem_cgroup_tryget_from_id() and
mem_cgroup_tryget_online_from_id() to eliminate all direct calls of
mem_cgroup_from_id(). I am hesitant about (b) because if we use
mem_cgroup_tryget_from_id() the code will be getting a ref, then
dropping the ref we have been carrying from swapout, then dropping the
ref we just acquired.
WDYT?
>
> Though this can be done separately to this series (if we decide to do
> it at all).
On Tue, Mar 28, 2023 at 12:06 PM Johannes Weiner <[email protected]> wrote:
>
> On Tue, Mar 28, 2023 at 11:45:19AM -0700, Yosry Ahmed wrote:
> > On Tue, Mar 28, 2023 at 11:35 AM Johannes Weiner <[email protected]> wrote:
> > > On Tue, Mar 28, 2023 at 06:16:35AM +0000, Yosry Ahmed wrote:
> > > > void mem_cgroup_flush_stats_ratelimited(void)
> > > > {
> > > > if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
> > > > - mem_cgroup_flush_stats();
> > > > + mem_cgroup_flush_stats_atomic();
> > > > +}
> > >
> > > This should probably be mem_cgroup_flush_stats_atomic_ratelimited().
> > >
> > > (Whee, kinda long, but that's alright. Very specialized caller...)
> >
> > It should, but the following patch makes it non-atomic anyway, so I
> > thought I wouldn't clutter the diff by renaming it here and then
> > reverting it back in the next patch.
> >
> > There is an argument for maintaining a clean history tho in case the
> > next patch is reverted separately (which is the reason I put it in a
> > separate patch to begin with) -- so perhaps I should rename it here to
> > mem_cgroup_flush_stats_atomic_ratelimited () and back to
> > mem_cgroup_flush_stats_ratelimited() in the next patch, just for
> > consistency?
>
> Sounds good to me. It's pretty minor churn.
Ack. Will do so for v2. Thanks!
>
> > > Btw, can you guys think of a reason against moving the threshold check
> > > into the common function? It would then apply to the time-limited
> > > flushes as well, but that shouldn't hurt anything. This would make the
> > > code even simpler:
> >
> > I think the point of having the threshold check outside the common
> > function is that the periodic flusher always flushes, regardless of
> > the threshold, to keep rstat flushing from critical contexts as cheap
> > as possible.
>
> Good point. Yeah, let's keep it separate then.
Agreed.
>
> > > > @@ -2845,7 +2845,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
> > > > * Flush the memory cgroup stats, so that we read accurate per-memcg
> > > > * lruvec stats for heuristics.
> > > > */
> > > > - mem_cgroup_flush_stats();
> > > > + mem_cgroup_flush_stats_atomic();
> > >
> > > I'm thinking this one could be non-atomic as well. It's called fairly
> > > high up in reclaim without any locks held.
> >
> > A later patch does exactly that. I put making the reclaim and refault
> > paths non-atomic in separate patches to easily revert them if we see a
> > regression. Let me know if this is too defensive and if you'd rather
> > have them squashed.
>
> No, good call. I should have just looked ahead first :-)
On Tue, Mar 28, 2023 at 11:53 AM Yosry Ahmed <[email protected]> wrote:
>
[...]
> > > + if (atomic_xchg(&stats_flush_ongoing, 1))
> >
> > Have you profiled this? I wonder if we should replace the above with
> >
> > if (atomic_read(&stats_flush_ongoing) || atomic_xchg(&stats_flush_ongoing, 1))
>
> I profiled the entire series with perf and I haven't noticed a notable
> difference between before and after the patch series -- but maybe some
> specific access patterns cause a regression, not sure.
>
> Does an atomic_cmpxchg() satisfy the same purpose? it's easier to read
> / more concise I guess.
>
> Something like
>
> if (atomic_cmpxchg(&stats_flush_ongoing, 0, 1))
>
> WDYT?
>
No, I don't think cmpxchg will be any different from xchg(). On x86,
the cmpxchg will always write to stats_flush_ongoing and depending on
the comparison result, it will either be 0 or 1 here.
If you see the implementation of queued_spin_trylock(), it does the
same as well.
On Tue, Mar 28, 2023 at 12:02 PM Yosry Ahmed <[email protected]> wrote:
>
> On Tue, Mar 28, 2023 at 8:19 AM Shakeel Butt <[email protected]> wrote:
> >
> > On Mon, Mar 27, 2023 at 11:16 PM Yosry Ahmed <[email protected]> wrote:
> > >
> > > Memory reclaim is a sleepable context. Allow sleeping when flushing
> > > memcg stats to avoid unnecessarily performing a lot of work without
> > > sleeping. This can slow down reclaim code if flushing stats is taking
> > > too long, but there is already multiple cond_resched()'s in reclaim
> > > code.
> > >
> > > Signed-off-by: Yosry Ahmed <[email protected]>
> >
> > Acked-by: Shakeel Butt <[email protected]>
> >
> > > ---
> > > mm/vmscan.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > index a9511ccb936f..9c1c5e8b24b8 100644
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -2845,7 +2845,7 @@ static void prepare_scan_count(pg_data_t *pgdat, struct scan_control *sc)
> > > * Flush the memory cgroup stats, so that we read accurate per-memcg
> > > * lruvec stats for heuristics.
> > > */
> > > - mem_cgroup_flush_stats_atomic();
> > > + mem_cgroup_flush_stats();
> >
> > I wonder if we should just replace this with
> > mem_cgroup_flush_stats_ratelimited().
>
> Thanks for taking a look!
>
> I was hesitant about doing this because the flush call is inside the
> retry loop, and it seems like we want to get fresh stats on each
> retry. It seems very likely that we end up not flushing between
> retries with mem_cgroup_flush_stats_ratelimited().
>
> Maybe change it if we observe problems with non-atomic flushing?
Yeah, let's leave it for the future if we see the issue.
On Tue, Mar 28, 2023 at 12:28 PM Shakeel Butt <[email protected]> wrote:
>
> On Tue, Mar 28, 2023 at 11:53 AM Yosry Ahmed <[email protected]> wrote:
> >
> [...]
> > > > + if (atomic_xchg(&stats_flush_ongoing, 1))
> > >
> > > Have you profiled this? I wonder if we should replace the above with
> > >
> > > if (atomic_read(&stats_flush_ongoing) || atomic_xchg(&stats_flush_ongoing, 1))
> >
> > I profiled the entire series with perf and I haven't noticed a notable
> > difference between before and after the patch series -- but maybe some
> > specific access patterns cause a regression, not sure.
> >
> > Does an atomic_cmpxchg() satisfy the same purpose? it's easier to read
> > / more concise I guess.
> >
> > Something like
> >
> > if (atomic_cmpxchg(&stats_flush_ongoing, 0, 1))
> >
> > WDYT?
> >
>
> No, I don't think cmpxchg will be any different from xchg(). On x86,
> the cmpxchg will always write to stats_flush_ongoing and depending on
> the comparison result, it will either be 0 or 1 here.
>
> If you see the implementation of queued_spin_trylock(), it does the
> same as well.
Interesting. I thought cmpxchg by definition will compare first and
only do the write if stats_flush_ongoing == 0 in this case.
I thought queued_spin_trylock() was doing an atomic_read() first to
avoid the LOCK instruction unnecessarily the lock is held by someone
else.
On Tue, Mar 28, 2023 at 12:34 PM Yosry Ahmed <[email protected]> wrote:
>
> On Tue, Mar 28, 2023 at 12:28 PM Shakeel Butt <[email protected]> wrote:
> >
> > On Tue, Mar 28, 2023 at 11:53 AM Yosry Ahmed <[email protected]> wrote:
> > >
> > [...]
> > > > > + if (atomic_xchg(&stats_flush_ongoing, 1))
> > > >
> > > > Have you profiled this? I wonder if we should replace the above with
> > > >
> > > > if (atomic_read(&stats_flush_ongoing) || atomic_xchg(&stats_flush_ongoing, 1))
> > >
> > > I profiled the entire series with perf and I haven't noticed a notable
> > > difference between before and after the patch series -- but maybe some
> > > specific access patterns cause a regression, not sure.
> > >
> > > Does an atomic_cmpxchg() satisfy the same purpose? it's easier to read
> > > / more concise I guess.
> > >
> > > Something like
> > >
> > > if (atomic_cmpxchg(&stats_flush_ongoing, 0, 1))
> > >
> > > WDYT?
> > >
> >
> > No, I don't think cmpxchg will be any different from xchg(). On x86,
> > the cmpxchg will always write to stats_flush_ongoing and depending on
> > the comparison result, it will either be 0 or 1 here.
> >
> > If you see the implementation of queued_spin_trylock(), it does the
> > same as well.
>
> Interesting. I thought cmpxchg by definition will compare first and
> only do the write if stats_flush_ongoing == 0 in this case.
>
> I thought queued_spin_trylock() was doing an atomic_read() first to
> avoid the LOCK instruction unnecessarily the lock is held by someone
> else.
Anyway, perhaps it's better to follow what queued_spin_trylock() is
doing, even if only to avoid locking the cache line unnecessarily.
(Although now that I think about it, I wonder why atomic_cmpxchg
doesn't do this by default, food for thought)
On Tue, Mar 28, 2023 at 11:59 AM Yosry Ahmed <[email protected]> wrote:
>
> On Tue, Mar 28, 2023 at 10:49 AM Johannes Weiner <[email protected]> wrote:
> >
> > On Tue, Mar 28, 2023 at 06:16:33AM +0000, Yosry Ahmed wrote:
> > > rstat flushing is too expensive to perform in irq context.
> > > The previous patch removed the only context that may invoke an rstat
> > > flush from irq context, add a WARN_ON_ONCE() to detect future
> > > violations, or those that we are not aware of.
> > >
> > > Signed-off-by: Yosry Ahmed <[email protected]>
> > > ---
> > > kernel/cgroup/rstat.c | 2 ++
> > > 1 file changed, 2 insertions(+)
> > >
> > > diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
> > > index d3252b0416b6..c2571939139f 100644
> > > --- a/kernel/cgroup/rstat.c
> > > +++ b/kernel/cgroup/rstat.c
> > > @@ -176,6 +176,8 @@ static void cgroup_rstat_flush_locked(struct cgroup *cgrp, bool may_sleep)
> > > {
> > > int cpu;
> > >
> > > + /* rstat flushing is too expensive for irq context */
> > > + WARN_ON_ONCE(!in_task());
> > > lockdep_assert_held(&cgroup_rstat_lock);
> >
> > This seems a bit arbitrary. Why is an irq caller forbidden, but an
> > irq-disabled, non-preemptible section caller is allowed? The latency
> > impact on the system would be the same, right?
>
> Thanks for taking a look.
>
> So in the first patch series the initial purpose was to make sure
> cgroup_rstat_lock was never acquired in an irq context, so that we can
> stop disabling irqs while holding it. Tejun disagreed with this
> approach though.
>
> We currently have one caller that calls flushing with irqs disabled
> (mem_cgroup_usage()) -- so we cannot forbid such callers (yet), but I
> thought we can at least forbid callers from irq context now (or catch
> those that we are not aware of), and then maybe forbid irqs_disabled()
> contexts as well we can get rid of that callsite.
>
> WDYT?
I added more context in the commit log in the v2 respin [1]. Let me
know if you want me to change something else, rephrase the comment, or
drop the patch entirely.
[1]https://lore.kernel.org/linux-mm/[email protected]/