This series attempts to address shortages in today's approach for memcg
stats flushing, namely occasionally stale or expensive stat reads. The
series does so by changing the threshold that we use to decide whether
to trigger a flush to be per memcg instead of global (patch 3), and then
changing flushing to be per memcg (i.e. subtree flushes) instead of
global (patch 5).
Patch 3 & 5 are the core of the series, and they include more details
and testing results. The rest are either cleanups or prep work.
This series replaces the "memcg: more sophisticated stats flushing"
series [1], which also replaces another series, in a long list of
attempts to improve memcg stats flushing. It is not a new version of
the same patchset as it is a completely different approach. This is
based on collected feedback from discussions on lkml in all previous
attempts. Hopefully, this is the final attempt.
[1]https://lore.kernel.org/lkml/[email protected]/
v1 -> v2:
- Fixed compilation error reported by the kernel robot in patch 4, also
added a missing rcu_read_unlock().
- More testing results in the commit message of patch 3.
Yosry Ahmed (5):
mm: memcg: change flush_next_time to flush_last_time
mm: memcg: move vmstats structs definition above flushing code
mm: memcg: make stats flushing threshold per-memcg
mm: workingset: move the stats flush into workingset_test_recent()
mm: memcg: restore subtree stats flushing
include/linux/memcontrol.h | 8 +-
mm/memcontrol.c | 269 +++++++++++++++++++++----------------
mm/vmscan.c | 2 +-
mm/workingset.c | 42 ++++--
4 files changed, 185 insertions(+), 136 deletions(-)
--
2.42.0.609.gbb76f46606-goog
The following patch will make use of those structs in the flushing code,
so move their definitions (and a few other dependencies) a little bit up
to reduce the diff noise in the following patch.
No functional change intended.
Signed-off-by: Yosry Ahmed <[email protected]>
---
mm/memcontrol.c | 146 ++++++++++++++++++++++++------------------------
1 file changed, 73 insertions(+), 73 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4a194fcc9533..a393f1399a2b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -570,6 +570,79 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz)
return mz;
}
+/* Subset of vm_event_item to report for memcg event stats */
+static const unsigned int memcg_vm_event_stat[] = {
+ PGPGIN,
+ PGPGOUT,
+ PGSCAN_KSWAPD,
+ PGSCAN_DIRECT,
+ PGSCAN_KHUGEPAGED,
+ PGSTEAL_KSWAPD,
+ PGSTEAL_DIRECT,
+ PGSTEAL_KHUGEPAGED,
+ PGFAULT,
+ PGMAJFAULT,
+ PGREFILL,
+ PGACTIVATE,
+ PGDEACTIVATE,
+ PGLAZYFREE,
+ PGLAZYFREED,
+#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
+ ZSWPIN,
+ ZSWPOUT,
+#endif
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+ THP_FAULT_ALLOC,
+ THP_COLLAPSE_ALLOC,
+ THP_SWPOUT,
+ THP_SWPOUT_FALLBACK,
+#endif
+};
+
+#define NR_MEMCG_EVENTS ARRAY_SIZE(memcg_vm_event_stat)
+static int mem_cgroup_events_index[NR_VM_EVENT_ITEMS] __read_mostly;
+
+static void init_memcg_events(void)
+{
+ int i;
+
+ for (i = 0; i < NR_MEMCG_EVENTS; ++i)
+ mem_cgroup_events_index[memcg_vm_event_stat[i]] = i + 1;
+}
+
+static inline int memcg_events_index(enum vm_event_item idx)
+{
+ return mem_cgroup_events_index[idx] - 1;
+}
+
+struct memcg_vmstats_percpu {
+ /* Local (CPU and cgroup) page state & events */
+ long state[MEMCG_NR_STAT];
+ unsigned long events[NR_MEMCG_EVENTS];
+
+ /* Delta calculation for lockless upward propagation */
+ long state_prev[MEMCG_NR_STAT];
+ unsigned long events_prev[NR_MEMCG_EVENTS];
+
+ /* Cgroup1: threshold notifications & softlimit tree updates */
+ unsigned long nr_page_events;
+ unsigned long targets[MEM_CGROUP_NTARGETS];
+};
+
+struct memcg_vmstats {
+ /* Aggregated (CPU and subtree) page state & events */
+ long state[MEMCG_NR_STAT];
+ unsigned long events[NR_MEMCG_EVENTS];
+
+ /* Non-hierarchical (CPU aggregated) page state & events */
+ long state_local[MEMCG_NR_STAT];
+ unsigned long events_local[NR_MEMCG_EVENTS];
+
+ /* Pending child counts during tree propagation */
+ long state_pending[MEMCG_NR_STAT];
+ unsigned long events_pending[NR_MEMCG_EVENTS];
+};
+
/*
* memcg and lruvec stats flushing
*
@@ -681,79 +754,6 @@ static void flush_memcg_stats_dwork(struct work_struct *w)
queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
}
-/* Subset of vm_event_item to report for memcg event stats */
-static const unsigned int memcg_vm_event_stat[] = {
- PGPGIN,
- PGPGOUT,
- PGSCAN_KSWAPD,
- PGSCAN_DIRECT,
- PGSCAN_KHUGEPAGED,
- PGSTEAL_KSWAPD,
- PGSTEAL_DIRECT,
- PGSTEAL_KHUGEPAGED,
- PGFAULT,
- PGMAJFAULT,
- PGREFILL,
- PGACTIVATE,
- PGDEACTIVATE,
- PGLAZYFREE,
- PGLAZYFREED,
-#if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP)
- ZSWPIN,
- ZSWPOUT,
-#endif
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- THP_FAULT_ALLOC,
- THP_COLLAPSE_ALLOC,
- THP_SWPOUT,
- THP_SWPOUT_FALLBACK,
-#endif
-};
-
-#define NR_MEMCG_EVENTS ARRAY_SIZE(memcg_vm_event_stat)
-static int mem_cgroup_events_index[NR_VM_EVENT_ITEMS] __read_mostly;
-
-static void init_memcg_events(void)
-{
- int i;
-
- for (i = 0; i < NR_MEMCG_EVENTS; ++i)
- mem_cgroup_events_index[memcg_vm_event_stat[i]] = i + 1;
-}
-
-static inline int memcg_events_index(enum vm_event_item idx)
-{
- return mem_cgroup_events_index[idx] - 1;
-}
-
-struct memcg_vmstats_percpu {
- /* Local (CPU and cgroup) page state & events */
- long state[MEMCG_NR_STAT];
- unsigned long events[NR_MEMCG_EVENTS];
-
- /* Delta calculation for lockless upward propagation */
- long state_prev[MEMCG_NR_STAT];
- unsigned long events_prev[NR_MEMCG_EVENTS];
-
- /* Cgroup1: threshold notifications & softlimit tree updates */
- unsigned long nr_page_events;
- unsigned long targets[MEM_CGROUP_NTARGETS];
-};
-
-struct memcg_vmstats {
- /* Aggregated (CPU and subtree) page state & events */
- long state[MEMCG_NR_STAT];
- unsigned long events[NR_MEMCG_EVENTS];
-
- /* Non-hierarchical (CPU aggregated) page state & events */
- long state_local[MEMCG_NR_STAT];
- unsigned long events_local[NR_MEMCG_EVENTS];
-
- /* Pending child counts during tree propagation */
- long state_pending[MEMCG_NR_STAT];
- unsigned long events_pending[NR_MEMCG_EVENTS];
-};
-
unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx)
{
long x = READ_ONCE(memcg->vmstats->state[idx]);
--
2.42.0.609.gbb76f46606-goog
The workingset code flushes the stats in workingset_refault() to get
accurate stats of the eviction memcg. In preparation for more scoped
flushed and passing the eviction memcg to the flush call, move the call
to workingset_test_recent() where we have a pointer to the eviction
memcg.
The flush call is sleepable, and cannot be made in an rcu read section.
Hence, minimize the rcu read section by also moving it into
workingset_test_recent(). Furthermore, instead of holding the rcu read
lock throughout workingset_test_recent(), only hold it briefly to get a
ref on the eviction memcg. This allows us to make the flush call after
we get the eviction memcg.
As for workingset_refault(), nothing else there appears to be protected
by rcu. The memcg of the faulted folio (which is not necessarily the
same as the eviction memcg) is protected by the folio lock, which is
held from all callsites. Add a VM_BUG_ON() to make sure this doesn't
change from under us.
No functional change intended.
Signed-off-by: Yosry Ahmed <[email protected]>
---
mm/workingset.c | 36 ++++++++++++++++++++++++------------
1 file changed, 24 insertions(+), 12 deletions(-)
diff --git a/mm/workingset.c b/mm/workingset.c
index b192e44a0e7c..a573be6c59fd 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -425,8 +425,16 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset)
struct pglist_data *pgdat;
unsigned long eviction;
- if (lru_gen_enabled())
- return lru_gen_test_recent(shadow, file, &eviction_lruvec, &eviction, workingset);
+ rcu_read_lock();
+
+ if (lru_gen_enabled()) {
+ bool recent = lru_gen_test_recent(shadow, file,
+ &eviction_lruvec, &eviction, workingset);
+
+ rcu_read_unlock();
+ return recent;
+ }
+
unpack_shadow(shadow, &memcgid, &pgdat, &eviction, workingset);
eviction <<= bucket_order;
@@ -448,8 +456,16 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset)
* configurations instead.
*/
eviction_memcg = mem_cgroup_from_id(memcgid);
- if (!mem_cgroup_disabled() && !eviction_memcg)
+ if (!mem_cgroup_disabled() &&
+ (!eviction_memcg || !mem_cgroup_tryget(eviction_memcg))) {
+ rcu_read_unlock();
return false;
+ }
+
+ rcu_read_unlock();
+
+ /* Flush stats (and potentially sleep) outside the RCU read section */
+ mem_cgroup_flush_stats_ratelimited();
eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
refault = atomic_long_read(&eviction_lruvec->nonresident_age);
@@ -493,6 +509,7 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset)
}
}
+ mem_cgroup_put(eviction_memcg);
return refault_distance <= workingset_size;
}
@@ -519,19 +536,16 @@ void workingset_refault(struct folio *folio, void *shadow)
return;
}
- /* Flush stats (and potentially sleep) before holding RCU read lock */
- mem_cgroup_flush_stats_ratelimited();
-
- rcu_read_lock();
-
/*
* The activation decision for this folio is made at the level
* where the eviction occurred, as that is where the LRU order
* during folio reclaim is being determined.
*
* However, the cgroup that will own the folio is the one that
- * is actually experiencing the refault event.
+ * is actually experiencing the refault event. Make sure the folio is
+ * locked to guarantee folio_memcg() stability throughout.
*/
+ VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
nr = folio_nr_pages(folio);
memcg = folio_memcg(folio);
pgdat = folio_pgdat(folio);
@@ -540,7 +554,7 @@ void workingset_refault(struct folio *folio, void *shadow)
mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + file, nr);
if (!workingset_test_recent(shadow, file, &workingset))
- goto out;
+ return;
folio_set_active(folio);
workingset_age_nonresident(lruvec, nr);
@@ -556,8 +570,6 @@ void workingset_refault(struct folio *folio, void *shadow)
lru_note_cost_refault(folio);
mod_lruvec_state(lruvec, WORKINGSET_RESTORE_BASE + file, nr);
}
-out:
- rcu_read_unlock();
}
/**
--
2.42.0.609.gbb76f46606-goog
Stats flushing for memcg currently follows the following rules:
- Always flush the entire memcg hierarchy (i.e. flush the root).
- Only one flusher is allowed at a time. If someone else tries to flush
concurrently, they skip and return immediately.
- A periodic flusher flushes all the stats every 2 seconds.
The reason this approach is followed is because all flushes are
serialized by a global rstat spinlock. On the memcg side, flushing is
invoked from userspace reads as well as in-kernel flushers (e.g.
reclaim, refault, etc). This approach aims to avoid serializing all
flushers on the global lock, which can cause a significant performance
hit under high concurrency.
This approach has the following problems:
- Occasionally a userspace read of the stats of a non-root cgroup will
be too expensive as it has to flush the entire hierarchy [1].
- Sometimes the stats accuracy are compromised if there is an ongoing
flush, and we skip and return before the subtree of interest is
actually flushed, yielding stale stats (by up to 2s due to periodic
flushing). This is more visible when reading stats from userspace,
but can also affect in-kernel flushers.
The latter problem is particulary a concern when userspace reads stats
after an event occurs, but gets stats from before the event. Examples:
- When memory usage / pressure spikes, a userspace OOM handler may look
at the stats of different memcgs to select a victim based on various
heuristics (e.g. how much private memory will be freed by killing
this). Reading stale stats from before the usage spike in this case
may cause a wrongful OOM kill.
- A proactive reclaimer may read the stats after writing to
memory.reclaim to measure the success of the reclaim operation. Stale
stats from before reclaim may give a false negative.
- Reading the stats of a parent and a child memcg may be inconsistent
(child larger than parent), if the flush doesn't happen when the
parent is read, but happens when the child is read.
As for in-kernel flushers, they will occasionally get stale stats. No
regressions are currently known from this, but if there are regressions,
they would be very difficult to debug and link to the source of the
problem.
This patch aims to fix these problems by restoring subtree flushing,
and removing the unified/coalesced flushing logic that skips flushing if
there is an ongoing flush. This change would introduce a significant
regression with global stats flushing thresholds. With per-memcg stats
flushing thresholds, this seems to perform really well. The thresholds
protect the underlying lock from unnecessary contention.
Add a mutex to protect the underlying rstat lock from excessive memcg
flushing. The thresholds are re-checked after the mutex is grabbed to
make sure that a concurrent flush did not already get the subtree we are
trying to flush. A call to cgroup_rstat_flush() is not cheap, even if
there are no pending updates.
This patch was tested in two ways to ensure the latency of flushing is
up to bar, on a machine with 384 cpus:
- A synthetic test with 5000 concurrent workers in 500 cgroups doing
allocations and reclaim, as well as 1000 readers for memory.stat
(variation of [2]). No regressions were noticed in the total runtime.
Note that significant regressions in this test are observed with
global stats thresholds, but not with per-memcg thresholds.
- A synthetic stress test for concurrently reading memcg stats while
memory allocation/freeing workers are running in the background,
provided by Wei Xu [3]. With 250k threads reading the stats every
100ms in 50k cgroups, 99.9% of reads take <= 50us. Less than 0.01%
of reads take more than 1ms, and no reads take more than 100ms.
[1] https://lore.kernel.org/lkml/CABWYdi0c6__rh-K7dcM_pkf9BJdTRtAU08M43KO9ME4-dsgfoQ@mail.gmail.com/
[2] https://lore.kernel.org/lkml/CAJD7tka13M-zVZTyQJYL1iUAYvuQ1fcHbCjcOBZcz6POYTV-4g@mail.gmail.com/
[3] https://lore.kernel.org/lkml/CAAPL-u9D2b=iF5Lf_cRnKxUfkiEe0AMDTu6yhrUAzX0b6a6rDg@mail.gmail.com/
Signed-off-by: Yosry Ahmed <[email protected]>
---
include/linux/memcontrol.h | 8 ++---
mm/memcontrol.c | 73 +++++++++++++++++++++++---------------
mm/vmscan.c | 2 +-
mm/workingset.c | 10 ++++--
4 files changed, 56 insertions(+), 37 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 41790e18bf3b..f64ac140083e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1057,8 +1057,8 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
return x;
}
-void mem_cgroup_flush_stats(void);
-void mem_cgroup_flush_stats_ratelimited(void);
+void mem_cgroup_flush_stats(struct mem_cgroup *memcg);
+void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg);
void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
int val);
@@ -1573,11 +1573,11 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
return node_page_state(lruvec_pgdat(lruvec), idx);
}
-static inline void mem_cgroup_flush_stats(void)
+static inline void mem_cgroup_flush_stats(struct mem_cgroup *memcg)
{
}
-static inline void mem_cgroup_flush_stats_ratelimited(void)
+static inline void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg)
{
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9a586893bd3e..182b4f215fc6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -666,7 +666,6 @@ struct memcg_vmstats {
*/
static void flush_memcg_stats_dwork(struct work_struct *w);
static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
-static atomic_t stats_flush_ongoing = ATOMIC_INIT(0);
static u64 flush_last_time;
#define FLUSH_TIME (2UL*HZ)
@@ -727,35 +726,45 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
}
}
-static void do_flush_stats(void)
+static void do_flush_stats(struct mem_cgroup *memcg)
{
- /*
- * We always flush the entire tree, so concurrent flushers can just
- * skip. This avoids a thundering herd problem on the rstat global lock
- * from memcg flushers (e.g. reclaim, refault, etc).
- */
- if (atomic_read(&stats_flush_ongoing) ||
- atomic_xchg(&stats_flush_ongoing, 1))
- return;
-
- WRITE_ONCE(flush_last_time, jiffies_64);
-
- cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
+ if (mem_cgroup_is_root(memcg))
+ WRITE_ONCE(flush_last_time, jiffies_64);
- atomic_set(&stats_flush_ongoing, 0);
+ cgroup_rstat_flush(memcg->css.cgroup);
}
-void mem_cgroup_flush_stats(void)
+/*
+ * mem_cgroup_flush_stats - flush the stats of a memory cgroup subtree
+ * @memcg: root of the subtree to flush
+ *
+ * Flushing is serialized by the underlying global rstat lock. There is also a
+ * minimum amount of work to be done even if there are no stat updates to flush.
+ * Hence, we only flush the stats if the updates delta exceeds a threshold. This
+ * avoids unnecessary work and contention on the underlying lock.
+ */
+void mem_cgroup_flush_stats(struct mem_cgroup *memcg)
{
- if (memcg_should_flush_stats(root_mem_cgroup))
- do_flush_stats();
+ static DEFINE_MUTEX(memcg_stats_flush_mutex);
+
+ if (!memcg)
+ memcg = root_mem_cgroup;
+
+ if (!memcg_should_flush_stats(memcg))
+ return;
+
+ mutex_lock(&memcg_stats_flush_mutex);
+ /* An overlapping flush may have occurred, check again after locking */
+ if (memcg_should_flush_stats(memcg))
+ do_flush_stats(memcg);
+ mutex_unlock(&memcg_stats_flush_mutex);
}
-void mem_cgroup_flush_stats_ratelimited(void)
+void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg)
{
/* Only flush if the periodic flusher is one full cycle late */
if (time_after64(jiffies_64, READ_ONCE(flush_last_time) + 2*FLUSH_TIME))
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats(memcg);
}
static void flush_memcg_stats_dwork(struct work_struct *w)
@@ -764,7 +773,7 @@ static void flush_memcg_stats_dwork(struct work_struct *w)
* Deliberately ignore memcg_should_flush_stats() here so that flushing
* in latency-sensitive paths is as cheap as possible.
*/
- do_flush_stats();
+ do_flush_stats(root_mem_cgroup);
queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
}
@@ -1639,7 +1648,7 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
*
* Current memory state:
*/
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats(memcg);
for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
u64 size;
@@ -4208,7 +4217,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)
int nid;
struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats(memcg);
for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
seq_printf(m, "%s=%lu", stat->name,
@@ -4289,7 +4298,7 @@ static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats(memcg);
for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
unsigned long nr;
@@ -4785,7 +4794,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages,
struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css);
struct mem_cgroup *parent;
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats(memcg);
*pdirty = memcg_page_state(memcg, NR_FILE_DIRTY);
*pwriteback = memcg_page_state(memcg, NR_WRITEBACK);
@@ -6861,7 +6870,7 @@ static int memory_numa_stat_show(struct seq_file *m, void *v)
int i;
struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats(memcg);
for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
int nid;
@@ -8100,7 +8109,11 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
break;
}
- cgroup_rstat_flush(memcg->css.cgroup);
+ /*
+ * mem_cgroup_flush_stats() ignores small changes. Use
+ * do_flush_stats() directly to get accurate stats for charging.
+ */
+ do_flush_stats(memcg);
pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
if (pages < max)
continue;
@@ -8165,8 +8178,10 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size)
static u64 zswap_current_read(struct cgroup_subsys_state *css,
struct cftype *cft)
{
- cgroup_rstat_flush(css->cgroup);
- return memcg_page_state(mem_cgroup_from_css(css), MEMCG_ZSWAP_B);
+ struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+ mem_cgroup_flush_stats(memcg);
+ return memcg_page_state(memcg, MEMCG_ZSWAP_B);
}
static int zswap_max_show(struct seq_file *m, void *v)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c16e2b1ea8ae..2cc0cb41fb32 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2246,7 +2246,7 @@ static void prepare_scan_control(pg_data_t *pgdat, struct scan_control *sc)
* Flush the memory cgroup stats, so that we read accurate per-memcg
* lruvec stats for heuristics.
*/
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats(sc->target_mem_cgroup);
/*
* Determine the scan balance between anon and file LRUs.
diff --git a/mm/workingset.c b/mm/workingset.c
index a573be6c59fd..11045febc383 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -464,8 +464,12 @@ bool workingset_test_recent(void *shadow, bool file, bool *workingset)
rcu_read_unlock();
- /* Flush stats (and potentially sleep) outside the RCU read section */
- mem_cgroup_flush_stats_ratelimited();
+ /*
+ * Flush stats (and potentially sleep) outside the RCU read section.
+ * XXX: With per-memcg flushing and thresholding, is ratelimiting
+ * still needed here?
+ */
+ mem_cgroup_flush_stats_ratelimited(eviction_memcg);
eviction_lruvec = mem_cgroup_lruvec(eviction_memcg, pgdat);
refault = atomic_long_read(&eviction_lruvec->nonresident_age);
@@ -676,7 +680,7 @@ static unsigned long count_shadow_nodes(struct shrinker *shrinker,
struct lruvec *lruvec;
int i;
- mem_cgroup_flush_stats();
+ mem_cgroup_flush_stats(sc->memcg);
lruvec = mem_cgroup_lruvec(sc->memcg, NODE_DATA(sc->nid));
for (pages = 0, i = 0; i < NR_LRU_LISTS; i++)
pages += lruvec_page_state_local(lruvec,
--
2.42.0.609.gbb76f46606-goog
flush_next_time is an inaccurate name. It's not the next time that
periodic flushing will happen, it's rather the next time that
ratelimited flushing can happen if the periodic flusher is late.
Simplify its semantics by just storing the timestamp of the last flush
instead, flush_last_time. Move the 2*FLUSH_TIME addition to
mem_cgroup_flush_stats_ratelimited(), and add a comment explaining it.
This way, all the ratelimiting semantics live in one place.
No functional change intended.
Signed-off-by: Yosry Ahmed <[email protected]>
---
mm/memcontrol.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2fb30abaf267..4a194fcc9533 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -590,7 +590,7 @@ static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
static DEFINE_PER_CPU(unsigned int, stats_updates);
static atomic_t stats_flush_ongoing = ATOMIC_INIT(0);
static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
-static u64 flush_next_time;
+static u64 flush_last_time;
#define FLUSH_TIME (2UL*HZ)
@@ -650,7 +650,7 @@ static void do_flush_stats(void)
atomic_xchg(&stats_flush_ongoing, 1))
return;
- WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME);
+ WRITE_ONCE(flush_last_time, jiffies_64);
cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
@@ -666,7 +666,8 @@ void mem_cgroup_flush_stats(void)
void mem_cgroup_flush_stats_ratelimited(void)
{
- if (time_after64(jiffies_64, READ_ONCE(flush_next_time)))
+ /* Only flush if the periodic flusher is one full cycle late */
+ if (time_after64(jiffies_64, READ_ONCE(flush_last_time) + 2*FLUSH_TIME))
mem_cgroup_flush_stats();
}
--
2.42.0.609.gbb76f46606-goog
A global counter for the magnitude of memcg stats update is maintained
on the memcg side to avoid invoking rstat flushes when the pending
updates are not significant. This avoids unnecessary flushes, which are
not very cheap even if there isn't a lot of stats to flush. It also
avoids unnecessary lock contention on the underlying global rstat lock.
Make this threshold per-memcg. The scheme is followed where percpu (now
also per-memcg) counters are incremented in the update path, and only
propagated to per-memcg atomics when they exceed a certain threshold.
This provides two benefits:
(a) On large machines with a lot of memcgs, the global threshold can be
reached relatively fast, so guarding the underlying lock becomes less
effective. Making the threshold per-memcg avoids this.
(b) Having a global threshold makes it hard to do subtree flushes, as we
cannot reset the global counter except for a full flush. Per-memcg
counters removes this as a blocker from doing subtree flushes, which
helps avoid unnecessary work when the stats of a small subtree are
needed.
Nothing is free, of course. This comes at a cost:
(a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4
bytes. The extra memory usage is insigificant.
(b) More work on the update side, although in the common case it will
only be percpu counter updates. The amount of work scales with the
number of ancestors (i.e. tree depth). This is not a new concept, adding
a cgroup to the rstat tree involves a parent loop, so is charging.
Testing results below show no significant regressions.
(c) The error margin in the stats for the system as a whole increases
from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH *
NR_MEMCGS. This is probably fine because we have a similar per-memcg
error in charges coming from percpu stocks, and we have a periodic
flusher that makes sure we always flush all the stats every 2s anyway.
This patch was tested to make sure no significant regressions are
introduced on the update path as follows. The following benchmarks were
ran in a cgroup that is 4 levels deep (/sys/fs/cgroup/a/b/c/d), which is
deeper than a usual setup:
(a) neper [1] with 1000 flows and 100 threads (single machine). The
values in the table are the average of server and client throughputs in
mbps after 30 iterations, each running for 30s:
tcp_rr tcp_stream
Base 9504218.56 357366.84
Patched 9656205.68 356978.39
Delta +1.6% -0.1%
Standard Deviation 0.95% 1.03%
An increase in the performance of tcp_rr doesn't really make sense, but
it's probably in the noise. The same tests were ran with 1 flow and 1
thread but the throughput was too noisy to make any conclusions (the
averages did not show regressions nonetheless).
Looking at perf for one iteration of the above test, __mod_memcg_state()
(which is where memcg_rstat_updated() is called) does not show up at all
without this patch, but it shows up with this patch as 1.06% for tcp_rr
and 0.36% for tcp_stream.
(b) "stress-ng --vm 0 -t 1m --times --perf". I don't understand
stress-ng very well, so I am not sure that's the best way to test this,
but it spawns 384 workers and spits a lot of metrics which looks nice :)
I picked a few ones that seem to be relevant to the stats update path. I
also included cache misses as this patch introduce more atomics that may
bounce between cpu caches:
Metric Base Patched Delta
Cache Misses 3.394 B/sec 3.433 B/sec +1.14%
Cache L1D Read 0.148 T/sec 0.154 T/sec +4.05%
Cache L1D Read Miss 20.430 B/sec 21.820 B/sec +6.8%
Page Faults Total 4.304 M/sec 4.535 M/sec +5.4%
Page Faults Minor 4.304 M/sec 4.535 M/sec +5.4%
Page Faults Major 18.794 /sec 0.000 /sec
Kmalloc 0.153 M/sec 0.152 M/sec -0.65%
Kfree 0.152 M/sec 0.153 M/sec +0.65%
MM Page Alloc 4.640 M/sec 4.898 M/sec +5.56%
MM Page Free 4.639 M/sec 4.897 M/sec +5.56%
Lock Contention Begin 0.362 M/sec 0.479 M/sec +32.32%
Lock Contention End 0.362 M/sec 0.479 M/sec +32.32%
page-cache add 238.057 /sec 0.000 /sec
page-cache del 6.265 /sec 6.267 /sec -0.03%
This is only using a single run in each case. I am not sure what to
make out of most of these numbers, but they mostly seem in the noise
(some better, some worse). The lock contention numbers are interesting.
I am not sure if higher is better or worse here. No new locks or lock
sections are introduced by this patch either way.
Looking at perf, __mod_memcg_state() shows up as 0.00% with and without
this patch. This is suspicious, but I verified while stress-ng is
running that all the threads are in the right cgroup.
(3) will-it-scale page_fault tests. These tests (specifically
per_process_ops in page_fault3 test) detected a 25.9% regression before
for a change in the stats update path [2]. These are the
numbers from 30 runs (+ is good):
LABEL | MEAN | MEDIAN | STDDEV |
------------------------------+-------------+-------------+-------------
page_fault1_per_process_ops | | | |
(A) base | 265207.738 | 262941.000 | 12112.379 |
(B) patched | 249249.191 | 248781.000 | 8767.457 |
| -6.02% | -5.39% | |
page_fault1_per_thread_ops | | | |
(A) base | 241618.484 | 240209.000 | 10162.207 |
(B) patched | 229820.671 | 229108.000 | 7506.582 |
| -4.88% | -4.62% | |
page_fault1_scalability | | |
(A) base | 0.03545 | 0.035705 | 0.0015837 |
(B) patched | 0.029952 | 0.029957 | 0.0013551 |
| -9.29% | -9.35% | |
page_fault2_per_process_ops | | |
(A) base | 203916.148 | 203496.000 | 2908.331 |
(B) patched | 186975.419 | 187023.000 | 1991.100 |
| -6.85% | -6.90% | |
page_fault2_per_thread_ops | | |
(A) base | 170604.972 | 170532.000 | 1624.834 |
(B) patched | 163100.260 | 163263.000 | 1517.967 |
| -4.40% | -4.26% | |
page_fault2_scalability | | |
(A) base | 0.054603 | 0.054693 | 0.00080196 |
(B) patched | 0.044882 | 0.044957 | 0.0011766 |
| -0.05% | +0.33% | |
page_fault3_per_process_ops | | |
(A) base | 1299821.099 | 1297918.000 | 9882.872 |
(B) patched | 1248700.839 | 1247168.000 | 8454.891 |
| -3.93% | -3.91% | |
page_fault3_per_thread_ops | | |
(A) base | 387216.963 | 387115.000 | 1605.760 |
(B) patched | 368538.213 | 368826.000 | 1852.594 |
| -4.82% | -4.72% | |
page_fault3_scalability | | |
(A) base | 0.59909 | 0.59367 | 0.01256 |
(B) patched | 0.59995 | 0.59769 | 0.010088 |
| +0.14% | +0.68% | |
There is some microbenchmarks regressions (and some minute improvements),
but nothing outside the normal variance of this benchmark between kernel
versions. The fix for [2] assumes that 3% is noise -- and there were no
further practical complaints), so hopefully this means that such variations
in these microbenchmarks do not reflect on practical workloads.
[1]https://github.com/google/neper
[2]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/
Signed-off-by: Yosry Ahmed <[email protected]>
---
mm/memcontrol.c | 49 +++++++++++++++++++++++++++++++++----------------
1 file changed, 33 insertions(+), 16 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a393f1399a2b..9a586893bd3e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -627,6 +627,9 @@ struct memcg_vmstats_percpu {
/* Cgroup1: threshold notifications & softlimit tree updates */
unsigned long nr_page_events;
unsigned long targets[MEM_CGROUP_NTARGETS];
+
+ /* Stats updates since the last flush */
+ unsigned int stats_updates;
};
struct memcg_vmstats {
@@ -641,6 +644,9 @@ struct memcg_vmstats {
/* Pending child counts during tree propagation */
long state_pending[MEMCG_NR_STAT];
unsigned long events_pending[NR_MEMCG_EVENTS];
+
+ /* Stats updates since the last flush */
+ atomic64_t stats_updates;
};
/*
@@ -660,9 +666,7 @@ struct memcg_vmstats {
*/
static void flush_memcg_stats_dwork(struct work_struct *w);
static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork);
-static DEFINE_PER_CPU(unsigned int, stats_updates);
static atomic_t stats_flush_ongoing = ATOMIC_INIT(0);
-static atomic_t stats_flush_threshold = ATOMIC_INIT(0);
static u64 flush_last_time;
#define FLUSH_TIME (2UL*HZ)
@@ -689,26 +693,37 @@ static void memcg_stats_unlock(void)
preempt_enable_nested();
}
+
+static bool memcg_should_flush_stats(struct mem_cgroup *memcg)
+{
+ return atomic64_read(&memcg->vmstats->stats_updates) >
+ MEMCG_CHARGE_BATCH * num_online_cpus();
+}
+
static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
{
+ int cpu = smp_processor_id();
unsigned int x;
if (!val)
return;
- cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id());
+ cgroup_rstat_updated(memcg->css.cgroup, cpu);
+
+ for (; memcg; memcg = parent_mem_cgroup(memcg)) {
+ x = __this_cpu_add_return(memcg->vmstats_percpu->stats_updates,
+ abs(val));
+
+ if (x < MEMCG_CHARGE_BATCH)
+ continue;
- x = __this_cpu_add_return(stats_updates, abs(val));
- if (x > MEMCG_CHARGE_BATCH) {
/*
- * If stats_flush_threshold exceeds the threshold
- * (>num_online_cpus()), cgroup stats update will be triggered
- * in __mem_cgroup_flush_stats(). Increasing this var further
- * is redundant and simply adds overhead in atomic update.
+ * If @memcg is already flush-able, increasing stats_updates is
+ * redundant. Avoid the overhead of the atomic update.
*/
- if (atomic_read(&stats_flush_threshold) <= num_online_cpus())
- atomic_add(x / MEMCG_CHARGE_BATCH, &stats_flush_threshold);
- __this_cpu_write(stats_updates, 0);
+ if (!memcg_should_flush_stats(memcg))
+ atomic64_add(x, &memcg->vmstats->stats_updates);
+ __this_cpu_write(memcg->vmstats_percpu->stats_updates, 0);
}
}
@@ -727,13 +742,12 @@ static void do_flush_stats(void)
cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
- atomic_set(&stats_flush_threshold, 0);
atomic_set(&stats_flush_ongoing, 0);
}
void mem_cgroup_flush_stats(void)
{
- if (atomic_read(&stats_flush_threshold) > num_online_cpus())
+ if (memcg_should_flush_stats(root_mem_cgroup))
do_flush_stats();
}
@@ -747,8 +761,8 @@ void mem_cgroup_flush_stats_ratelimited(void)
static void flush_memcg_stats_dwork(struct work_struct *w)
{
/*
- * Always flush here so that flushing in latency-sensitive paths is
- * as cheap as possible.
+ * Deliberately ignore memcg_should_flush_stats() here so that flushing
+ * in latency-sensitive paths is as cheap as possible.
*/
do_flush_stats();
queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME);
@@ -5803,6 +5817,9 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
}
}
}
+ /* We are in a per-cpu loop here, only do the atomic write once */
+ if (atomic64_read(&memcg->vmstats->stats_updates))
+ atomic64_set(&memcg->vmstats->stats_updates, 0);
}
#ifdef CONFIG_MMU
--
2.42.0.609.gbb76f46606-goog
On Mon, Oct 9, 2023 at 8:21 PM Yosry Ahmed <[email protected]> wrote:
>
> A global counter for the magnitude of memcg stats update is maintained
> on the memcg side to avoid invoking rstat flushes when the pending
> updates are not significant. This avoids unnecessary flushes, which are
> not very cheap even if there isn't a lot of stats to flush. It also
> avoids unnecessary lock contention on the underlying global rstat lock.
>
> Make this threshold per-memcg. The scheme is followed where percpu (now
> also per-memcg) counters are incremented in the update path, and only
> propagated to per-memcg atomics when they exceed a certain threshold.
>
> This provides two benefits:
> (a) On large machines with a lot of memcgs, the global threshold can be
> reached relatively fast, so guarding the underlying lock becomes less
> effective. Making the threshold per-memcg avoids this.
>
> (b) Having a global threshold makes it hard to do subtree flushes, as we
> cannot reset the global counter except for a full flush. Per-memcg
> counters removes this as a blocker from doing subtree flushes, which
> helps avoid unnecessary work when the stats of a small subtree are
> needed.
>
> Nothing is free, of course. This comes at a cost:
> (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4
> bytes. The extra memory usage is insigificant.
>
> (b) More work on the update side, although in the common case it will
> only be percpu counter updates. The amount of work scales with the
> number of ancestors (i.e. tree depth). This is not a new concept, adding
> a cgroup to the rstat tree involves a parent loop, so is charging.
> Testing results below show no significant regressions.
>
> (c) The error margin in the stats for the system as a whole increases
> from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH *
> NR_MEMCGS. This is probably fine because we have a similar per-memcg
> error in charges coming from percpu stocks, and we have a periodic
> flusher that makes sure we always flush all the stats every 2s anyway.
>
> This patch was tested to make sure no significant regressions are
> introduced on the update path as follows. The following benchmarks were
> ran in a cgroup that is 4 levels deep (/sys/fs/cgroup/a/b/c/d), which is
> deeper than a usual setup:
>
> (a) neper [1] with 1000 flows and 100 threads (single machine). The
> values in the table are the average of server and client throughputs in
> mbps after 30 iterations, each running for 30s:
>
> tcp_rr tcp_stream
> Base 9504218.56 357366.84
> Patched 9656205.68 356978.39
> Delta +1.6% -0.1%
> Standard Deviation 0.95% 1.03%
>
> An increase in the performance of tcp_rr doesn't really make sense, but
> it's probably in the noise. The same tests were ran with 1 flow and 1
> thread but the throughput was too noisy to make any conclusions (the
> averages did not show regressions nonetheless).
>
> Looking at perf for one iteration of the above test, __mod_memcg_state()
> (which is where memcg_rstat_updated() is called) does not show up at all
> without this patch, but it shows up with this patch as 1.06% for tcp_rr
> and 0.36% for tcp_stream.
>
> (b) "stress-ng --vm 0 -t 1m --times --perf". I don't understand
> stress-ng very well, so I am not sure that's the best way to test this,
> but it spawns 384 workers and spits a lot of metrics which looks nice :)
> I picked a few ones that seem to be relevant to the stats update path. I
> also included cache misses as this patch introduce more atomics that may
> bounce between cpu caches:
>
> Metric Base Patched Delta
> Cache Misses 3.394 B/sec 3.433 B/sec +1.14%
> Cache L1D Read 0.148 T/sec 0.154 T/sec +4.05%
> Cache L1D Read Miss 20.430 B/sec 21.820 B/sec +6.8%
> Page Faults Total 4.304 M/sec 4.535 M/sec +5.4%
> Page Faults Minor 4.304 M/sec 4.535 M/sec +5.4%
> Page Faults Major 18.794 /sec 0.000 /sec
> Kmalloc 0.153 M/sec 0.152 M/sec -0.65%
> Kfree 0.152 M/sec 0.153 M/sec +0.65%
> MM Page Alloc 4.640 M/sec 4.898 M/sec +5.56%
> MM Page Free 4.639 M/sec 4.897 M/sec +5.56%
> Lock Contention Begin 0.362 M/sec 0.479 M/sec +32.32%
> Lock Contention End 0.362 M/sec 0.479 M/sec +32.32%
> page-cache add 238.057 /sec 0.000 /sec
> page-cache del 6.265 /sec 6.267 /sec -0.03%
>
> This is only using a single run in each case. I am not sure what to
> make out of most of these numbers, but they mostly seem in the noise
> (some better, some worse). The lock contention numbers are interesting.
> I am not sure if higher is better or worse here. No new locks or lock
> sections are introduced by this patch either way.
>
> Looking at perf, __mod_memcg_state() shows up as 0.00% with and without
> this patch. This is suspicious, but I verified while stress-ng is
> running that all the threads are in the right cgroup.
>
> (3) will-it-scale page_fault tests. These tests (specifically
> per_process_ops in page_fault3 test) detected a 25.9% regression before
> for a change in the stats update path [2]. These are the
> numbers from 30 runs (+ is good):
>
> LABEL | MEAN | MEDIAN | STDDEV |
> ------------------------------+-------------+-------------+-------------
> page_fault1_per_process_ops | | | |
> (A) base | 265207.738 | 262941.000 | 12112.379 |
> (B) patched | 249249.191 | 248781.000 | 8767.457 |
> | -6.02% | -5.39% | |
> page_fault1_per_thread_ops | | | |
> (A) base | 241618.484 | 240209.000 | 10162.207 |
> (B) patched | 229820.671 | 229108.000 | 7506.582 |
> | -4.88% | -4.62% | |
> page_fault1_scalability | | |
> (A) base | 0.03545 | 0.035705 | 0.0015837 |
> (B) patched | 0.029952 | 0.029957 | 0.0013551 |
> | -9.29% | -9.35% | |
> page_fault2_per_process_ops | | |
> (A) base | 203916.148 | 203496.000 | 2908.331 |
> (B) patched | 186975.419 | 187023.000 | 1991.100 |
> | -6.85% | -6.90% | |
> page_fault2_per_thread_ops | | |
> (A) base | 170604.972 | 170532.000 | 1624.834 |
> (B) patched | 163100.260 | 163263.000 | 1517.967 |
> | -4.40% | -4.26% | |
> page_fault2_scalability | | |
> (A) base | 0.054603 | 0.054693 | 0.00080196 |
> (B) patched | 0.044882 | 0.044957 | 0.0011766 |
> | -0.05% | +0.33% | |
> page_fault3_per_process_ops | | |
> (A) base | 1299821.099 | 1297918.000 | 9882.872 |
> (B) patched | 1248700.839 | 1247168.000 | 8454.891 |
> | -3.93% | -3.91% | |
> page_fault3_per_thread_ops | | |
> (A) base | 387216.963 | 387115.000 | 1605.760 |
> (B) patched | 368538.213 | 368826.000 | 1852.594 |
> | -4.82% | -4.72% | |
> page_fault3_scalability | | |
> (A) base | 0.59909 | 0.59367 | 0.01256 |
> (B) patched | 0.59995 | 0.59769 | 0.010088 |
> | +0.14% | +0.68% | |
>
> There is some microbenchmarks regressions (and some minute improvements),
> but nothing outside the normal variance of this benchmark between kernel
> versions. The fix for [2] assumes that 3% is noise -- and there were no
> further practical complaints), so hopefully this means that such variations
> in these microbenchmarks do not reflect on practical workloads.
>
> [1]https://github.com/google/neper
> [2]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/
>
> Signed-off-by: Yosry Ahmed <[email protected]>
Johannes, as I mentioned in a reply to v1, I think this might be what
you suggested in our previous discussion [1], but I am not sure this
is what you meant for the update path, so I did not add a
Suggested-by.
Please let me know if this is what you meant and I can amend the tag as such.
[1]https://lore.kernel.org/lkml/[email protected]/
Il giorno mar 10 ott 2023 alle ore 05:21 Yosry Ahmed
<[email protected]> ha scritto:
>
> This series attempts to address shortages in today's approach for memcg
> stats flushing, namely occasionally stale or expensive stat reads. The
> series does so by changing the threshold that we use to decide whether
> to trigger a flush to be per memcg instead of global (patch 3), and then
> changing flushing to be per memcg (i.e. subtree flushes) instead of
> global (patch 5).
>
> Patch 3 & 5 are the core of the series, and they include more details
> and testing results. The rest are either cleanups or prep work.
>
> This series replaces the "memcg: more sophisticated stats flushing"
> series [1], which also replaces another series, in a long list of
> attempts to improve memcg stats flushing. It is not a new version of
> the same patchset as it is a completely different approach. This is
> based on collected feedback from discussions on lkml in all previous
> attempts. Hopefully, this is the final attempt.
>
> [1]https://lore.kernel.org/lkml/[email protected]/
>
> v1 -> v2:
> - Fixed compilation error reported by the kernel robot in patch 4, also
> added a missing rcu_read_unlock().
> - More testing results in the commit message of patch 3.
>
> Yosry Ahmed (5):
> mm: memcg: change flush_next_time to flush_last_time
> mm: memcg: move vmstats structs definition above flushing code
> mm: memcg: make stats flushing threshold per-memcg
> mm: workingset: move the stats flush into workingset_test_recent()
> mm: memcg: restore subtree stats flushing
>
> include/linux/memcontrol.h | 8 +-
> mm/memcontrol.c | 269 +++++++++++++++++++++----------------
> mm/vmscan.c | 2 +-
> mm/workingset.c | 42 ++++--
> 4 files changed, 185 insertions(+), 136 deletions(-)
>
> --
> 2.42.0.609.gbb76f46606-goog
>
>
Hi Yosry,
thanks for this series! We backported it on a 5.19-based kernel and ran it on a
machine for almost a week now. The goal was to fix a CPU utilization regression
caused by memory stats readings, it seems that this series was the last bit
needed to completely fix it and bring CPU utilization to 5.12 levels.
FWIW,
Tested-by: Domenico Cerasuolo <[email protected]>
On Tue, Oct 10, 2023 at 9:48 AM domenico cerasuolo
<[email protected]> wrote:
>
> Il giorno mar 10 ott 2023 alle ore 05:21 Yosry Ahmed
> <[email protected]> ha scritto:
> >
> > This series attempts to address shortages in today's approach for memcg
> > stats flushing, namely occasionally stale or expensive stat reads. The
> > series does so by changing the threshold that we use to decide whether
> > to trigger a flush to be per memcg instead of global (patch 3), and then
> > changing flushing to be per memcg (i.e. subtree flushes) instead of
> > global (patch 5).
> >
> > Patch 3 & 5 are the core of the series, and they include more details
> > and testing results. The rest are either cleanups or prep work.
> >
> > This series replaces the "memcg: more sophisticated stats flushing"
> > series [1], which also replaces another series, in a long list of
> > attempts to improve memcg stats flushing. It is not a new version of
> > the same patchset as it is a completely different approach. This is
> > based on collected feedback from discussions on lkml in all previous
> > attempts. Hopefully, this is the final attempt.
> >
> > [1]https://lore.kernel.org/lkml/[email protected]/
> >
> > v1 -> v2:
> > - Fixed compilation error reported by the kernel robot in patch 4, also
> > added a missing rcu_read_unlock().
> > - More testing results in the commit message of patch 3.
> >
> > Yosry Ahmed (5):
> > mm: memcg: change flush_next_time to flush_last_time
> > mm: memcg: move vmstats structs definition above flushing code
> > mm: memcg: make stats flushing threshold per-memcg
> > mm: workingset: move the stats flush into workingset_test_recent()
> > mm: memcg: restore subtree stats flushing
> >
> > include/linux/memcontrol.h | 8 +-
> > mm/memcontrol.c | 269 +++++++++++++++++++++----------------
> > mm/vmscan.c | 2 +-
> > mm/workingset.c | 42 ++++--
> > 4 files changed, 185 insertions(+), 136 deletions(-)
> >
> > --
> > 2.42.0.609.gbb76f46606-goog
> >
> >
>
> Hi Yosry,
>
> thanks for this series! We backported it on a 5.19-based kernel and ran it on a
> machine for almost a week now. The goal was to fix a CPU utilization regression
> caused by memory stats readings, it seems that this series was the last bit
> needed to completely fix it and bring CPU utilization to 5.12 levels.
>
> FWIW,
>
> Tested-by: Domenico Cerasuolo <[email protected]>
That's awesome. Thanks for the testing!
On Mon, Oct 9, 2023 at 8:21 PM Yosry Ahmed <[email protected]> wrote:
>
> A global counter for the magnitude of memcg stats update is maintained
> on the memcg side to avoid invoking rstat flushes when the pending
> updates are not significant. This avoids unnecessary flushes, which are
> not very cheap even if there isn't a lot of stats to flush. It also
> avoids unnecessary lock contention on the underlying global rstat lock.
>
> Make this threshold per-memcg. The scheme is followed where percpu (now
> also per-memcg) counters are incremented in the update path, and only
> propagated to per-memcg atomics when they exceed a certain threshold.
>
> This provides two benefits:
> (a) On large machines with a lot of memcgs, the global threshold can be
> reached relatively fast, so guarding the underlying lock becomes less
> effective. Making the threshold per-memcg avoids this.
>
> (b) Having a global threshold makes it hard to do subtree flushes, as we
> cannot reset the global counter except for a full flush. Per-memcg
> counters removes this as a blocker from doing subtree flushes, which
> helps avoid unnecessary work when the stats of a small subtree are
> needed.
>
> Nothing is free, of course. This comes at a cost:
> (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4
> bytes. The extra memory usage is insigificant.
>
> (b) More work on the update side, although in the common case it will
> only be percpu counter updates. The amount of work scales with the
> number of ancestors (i.e. tree depth). This is not a new concept, adding
> a cgroup to the rstat tree involves a parent loop, so is charging.
> Testing results below show no significant regressions.
>
> (c) The error margin in the stats for the system as a whole increases
> from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH *
> NR_MEMCGS. This is probably fine because we have a similar per-memcg
> error in charges coming from percpu stocks, and we have a periodic
> flusher that makes sure we always flush all the stats every 2s anyway.
>
> This patch was tested to make sure no significant regressions are
> introduced on the update path as follows. The following benchmarks were
> ran in a cgroup that is 4 levels deep (/sys/fs/cgroup/a/b/c/d), which is
> deeper than a usual setup:
>
> (a) neper [1] with 1000 flows and 100 threads (single machine). The
> values in the table are the average of server and client throughputs in
> mbps after 30 iterations, each running for 30s:
>
> tcp_rr tcp_stream
> Base 9504218.56 357366.84
> Patched 9656205.68 356978.39
> Delta +1.6% -0.1%
> Standard Deviation 0.95% 1.03%
>
> An increase in the performance of tcp_rr doesn't really make sense, but
> it's probably in the noise. The same tests were ran with 1 flow and 1
> thread but the throughput was too noisy to make any conclusions (the
> averages did not show regressions nonetheless).
>
> Looking at perf for one iteration of the above test, __mod_memcg_state()
> (which is where memcg_rstat_updated() is called) does not show up at all
> without this patch, but it shows up with this patch as 1.06% for tcp_rr
> and 0.36% for tcp_stream.
>
> (b) "stress-ng --vm 0 -t 1m --times --perf". I don't understand
> stress-ng very well, so I am not sure that's the best way to test this,
> but it spawns 384 workers and spits a lot of metrics which looks nice :)
> I picked a few ones that seem to be relevant to the stats update path. I
> also included cache misses as this patch introduce more atomics that may
> bounce between cpu caches:
>
> Metric Base Patched Delta
> Cache Misses 3.394 B/sec 3.433 B/sec +1.14%
> Cache L1D Read 0.148 T/sec 0.154 T/sec +4.05%
> Cache L1D Read Miss 20.430 B/sec 21.820 B/sec +6.8%
> Page Faults Total 4.304 M/sec 4.535 M/sec +5.4%
> Page Faults Minor 4.304 M/sec 4.535 M/sec +5.4%
> Page Faults Major 18.794 /sec 0.000 /sec
> Kmalloc 0.153 M/sec 0.152 M/sec -0.65%
> Kfree 0.152 M/sec 0.153 M/sec +0.65%
> MM Page Alloc 4.640 M/sec 4.898 M/sec +5.56%
> MM Page Free 4.639 M/sec 4.897 M/sec +5.56%
> Lock Contention Begin 0.362 M/sec 0.479 M/sec +32.32%
> Lock Contention End 0.362 M/sec 0.479 M/sec +32.32%
> page-cache add 238.057 /sec 0.000 /sec
> page-cache del 6.265 /sec 6.267 /sec -0.03%
>
> This is only using a single run in each case. I am not sure what to
> make out of most of these numbers, but they mostly seem in the noise
> (some better, some worse). The lock contention numbers are interesting.
> I am not sure if higher is better or worse here. No new locks or lock
> sections are introduced by this patch either way.
>
> Looking at perf, __mod_memcg_state() shows up as 0.00% with and without
> this patch. This is suspicious, but I verified while stress-ng is
> running that all the threads are in the right cgroup.
>
> (3) will-it-scale page_fault tests. These tests (specifically
> per_process_ops in page_fault3 test) detected a 25.9% regression before
> for a change in the stats update path [2]. These are the
> numbers from 30 runs (+ is good):
>
> LABEL | MEAN | MEDIAN | STDDEV |
> ------------------------------+-------------+-------------+-------------
> page_fault1_per_process_ops | | | |
> (A) base | 265207.738 | 262941.000 | 12112.379 |
> (B) patched | 249249.191 | 248781.000 | 8767.457 |
> | -6.02% | -5.39% | |
> page_fault1_per_thread_ops | | | |
> (A) base | 241618.484 | 240209.000 | 10162.207 |
> (B) patched | 229820.671 | 229108.000 | 7506.582 |
> | -4.88% | -4.62% | |
> page_fault1_scalability | | |
> (A) base | 0.03545 | 0.035705 | 0.0015837 |
> (B) patched | 0.029952 | 0.029957 | 0.0013551 |
> | -9.29% | -9.35% | |
This much regression is not acceptable.
In addition, I ran netperf with the same 4 level hierarchy as you have
run and I am seeing ~11% regression.
More specifically on a machine with 44 CPUs (HT disabled ixion machine):
# for server
$ netserver -6
# 22 instances of netperf clients
$ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
(averaged over 4 runs)
base (next-20231009): 33081 MBPS
patched: 29267 MBPS
So, this series is not acceptable unless this regression is resolved.
On Tue, Oct 10, 2023 at 1:45 PM Shakeel Butt <[email protected]> wrote:
>
> On Mon, Oct 9, 2023 at 8:21 PM Yosry Ahmed <[email protected]> wrote:
> >
> > A global counter for the magnitude of memcg stats update is maintained
> > on the memcg side to avoid invoking rstat flushes when the pending
> > updates are not significant. This avoids unnecessary flushes, which are
> > not very cheap even if there isn't a lot of stats to flush. It also
> > avoids unnecessary lock contention on the underlying global rstat lock.
> >
> > Make this threshold per-memcg. The scheme is followed where percpu (now
> > also per-memcg) counters are incremented in the update path, and only
> > propagated to per-memcg atomics when they exceed a certain threshold.
> >
> > This provides two benefits:
> > (a) On large machines with a lot of memcgs, the global threshold can be
> > reached relatively fast, so guarding the underlying lock becomes less
> > effective. Making the threshold per-memcg avoids this.
> >
> > (b) Having a global threshold makes it hard to do subtree flushes, as we
> > cannot reset the global counter except for a full flush. Per-memcg
> > counters removes this as a blocker from doing subtree flushes, which
> > helps avoid unnecessary work when the stats of a small subtree are
> > needed.
> >
> > Nothing is free, of course. This comes at a cost:
> > (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4
> > bytes. The extra memory usage is insigificant.
> >
> > (b) More work on the update side, although in the common case it will
> > only be percpu counter updates. The amount of work scales with the
> > number of ancestors (i.e. tree depth). This is not a new concept, adding
> > a cgroup to the rstat tree involves a parent loop, so is charging.
> > Testing results below show no significant regressions.
> >
> > (c) The error margin in the stats for the system as a whole increases
> > from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH *
> > NR_MEMCGS. This is probably fine because we have a similar per-memcg
> > error in charges coming from percpu stocks, and we have a periodic
> > flusher that makes sure we always flush all the stats every 2s anyway.
> >
> > This patch was tested to make sure no significant regressions are
> > introduced on the update path as follows. The following benchmarks were
> > ran in a cgroup that is 4 levels deep (/sys/fs/cgroup/a/b/c/d), which is
> > deeper than a usual setup:
> >
> > (a) neper [1] with 1000 flows and 100 threads (single machine). The
> > values in the table are the average of server and client throughputs in
> > mbps after 30 iterations, each running for 30s:
> >
> > tcp_rr tcp_stream
> > Base 9504218.56 357366.84
> > Patched 9656205.68 356978.39
> > Delta +1.6% -0.1%
> > Standard Deviation 0.95% 1.03%
> >
> > An increase in the performance of tcp_rr doesn't really make sense, but
> > it's probably in the noise. The same tests were ran with 1 flow and 1
> > thread but the throughput was too noisy to make any conclusions (the
> > averages did not show regressions nonetheless).
> >
> > Looking at perf for one iteration of the above test, __mod_memcg_state()
> > (which is where memcg_rstat_updated() is called) does not show up at all
> > without this patch, but it shows up with this patch as 1.06% for tcp_rr
> > and 0.36% for tcp_stream.
> >
> > (b) "stress-ng --vm 0 -t 1m --times --perf". I don't understand
> > stress-ng very well, so I am not sure that's the best way to test this,
> > but it spawns 384 workers and spits a lot of metrics which looks nice :)
> > I picked a few ones that seem to be relevant to the stats update path. I
> > also included cache misses as this patch introduce more atomics that may
> > bounce between cpu caches:
> >
> > Metric Base Patched Delta
> > Cache Misses 3.394 B/sec 3.433 B/sec +1.14%
> > Cache L1D Read 0.148 T/sec 0.154 T/sec +4.05%
> > Cache L1D Read Miss 20.430 B/sec 21.820 B/sec +6.8%
> > Page Faults Total 4.304 M/sec 4.535 M/sec +5.4%
> > Page Faults Minor 4.304 M/sec 4.535 M/sec +5.4%
> > Page Faults Major 18.794 /sec 0.000 /sec
> > Kmalloc 0.153 M/sec 0.152 M/sec -0.65%
> > Kfree 0.152 M/sec 0.153 M/sec +0.65%
> > MM Page Alloc 4.640 M/sec 4.898 M/sec +5.56%
> > MM Page Free 4.639 M/sec 4.897 M/sec +5.56%
> > Lock Contention Begin 0.362 M/sec 0.479 M/sec +32.32%
> > Lock Contention End 0.362 M/sec 0.479 M/sec +32.32%
> > page-cache add 238.057 /sec 0.000 /sec
> > page-cache del 6.265 /sec 6.267 /sec -0.03%
> >
> > This is only using a single run in each case. I am not sure what to
> > make out of most of these numbers, but they mostly seem in the noise
> > (some better, some worse). The lock contention numbers are interesting.
> > I am not sure if higher is better or worse here. No new locks or lock
> > sections are introduced by this patch either way.
> >
> > Looking at perf, __mod_memcg_state() shows up as 0.00% with and without
> > this patch. This is suspicious, but I verified while stress-ng is
> > running that all the threads are in the right cgroup.
> >
> > (3) will-it-scale page_fault tests. These tests (specifically
> > per_process_ops in page_fault3 test) detected a 25.9% regression before
> > for a change in the stats update path [2]. These are the
> > numbers from 30 runs (+ is good):
> >
> > LABEL | MEAN | MEDIAN | STDDEV |
> > ------------------------------+-------------+-------------+-------------
> > page_fault1_per_process_ops | | | |
> > (A) base | 265207.738 | 262941.000 | 12112.379 |
> > (B) patched | 249249.191 | 248781.000 | 8767.457 |
> > | -6.02% | -5.39% | |
> > page_fault1_per_thread_ops | | | |
> > (A) base | 241618.484 | 240209.000 | 10162.207 |
> > (B) patched | 229820.671 | 229108.000 | 7506.582 |
> > | -4.88% | -4.62% | |
> > page_fault1_scalability | | |
> > (A) base | 0.03545 | 0.035705 | 0.0015837 |
> > (B) patched | 0.029952 | 0.029957 | 0.0013551 |
> > | -9.29% | -9.35% | |
>
> This much regression is not acceptable.
>
> In addition, I ran netperf with the same 4 level hierarchy as you have
> run and I am seeing ~11% regression.
Interesting, I thought neper and netperf should be similar. Let me try
to reproduce this.
Thanks for testing!
>
> More specifically on a machine with 44 CPUs (HT disabled ixion machine):
>
> # for server
> $ netserver -6
>
> # 22 instances of netperf clients
> $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
>
> (averaged over 4 runs)
>
> base (next-20231009): 33081 MBPS
> patched: 29267 MBPS
>
> So, this series is not acceptable unless this regression is resolved.
On Tue, Oct 10, 2023 at 2:02 PM Yosry Ahmed <[email protected]> wrote:
>
> On Tue, Oct 10, 2023 at 1:45 PM Shakeel Butt <[email protected]> wrote:
> >
> > On Mon, Oct 9, 2023 at 8:21 PM Yosry Ahmed <[email protected]> wrote:
> > >
> > > A global counter for the magnitude of memcg stats update is maintained
> > > on the memcg side to avoid invoking rstat flushes when the pending
> > > updates are not significant. This avoids unnecessary flushes, which are
> > > not very cheap even if there isn't a lot of stats to flush. It also
> > > avoids unnecessary lock contention on the underlying global rstat lock.
> > >
> > > Make this threshold per-memcg. The scheme is followed where percpu (now
> > > also per-memcg) counters are incremented in the update path, and only
> > > propagated to per-memcg atomics when they exceed a certain threshold.
> > >
> > > This provides two benefits:
> > > (a) On large machines with a lot of memcgs, the global threshold can be
> > > reached relatively fast, so guarding the underlying lock becomes less
> > > effective. Making the threshold per-memcg avoids this.
> > >
> > > (b) Having a global threshold makes it hard to do subtree flushes, as we
> > > cannot reset the global counter except for a full flush. Per-memcg
> > > counters removes this as a blocker from doing subtree flushes, which
> > > helps avoid unnecessary work when the stats of a small subtree are
> > > needed.
> > >
> > > Nothing is free, of course. This comes at a cost:
> > > (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4
> > > bytes. The extra memory usage is insigificant.
> > >
> > > (b) More work on the update side, although in the common case it will
> > > only be percpu counter updates. The amount of work scales with the
> > > number of ancestors (i.e. tree depth). This is not a new concept, adding
> > > a cgroup to the rstat tree involves a parent loop, so is charging.
> > > Testing results below show no significant regressions.
> > >
> > > (c) The error margin in the stats for the system as a whole increases
> > > from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH *
> > > NR_MEMCGS. This is probably fine because we have a similar per-memcg
> > > error in charges coming from percpu stocks, and we have a periodic
> > > flusher that makes sure we always flush all the stats every 2s anyway.
> > >
> > > This patch was tested to make sure no significant regressions are
> > > introduced on the update path as follows. The following benchmarks were
> > > ran in a cgroup that is 4 levels deep (/sys/fs/cgroup/a/b/c/d), which is
> > > deeper than a usual setup:
> > >
> > > (a) neper [1] with 1000 flows and 100 threads (single machine). The
> > > values in the table are the average of server and client throughputs in
> > > mbps after 30 iterations, each running for 30s:
> > >
> > > tcp_rr tcp_stream
> > > Base 9504218.56 357366.84
> > > Patched 9656205.68 356978.39
> > > Delta +1.6% -0.1%
> > > Standard Deviation 0.95% 1.03%
> > >
> > > An increase in the performance of tcp_rr doesn't really make sense, but
> > > it's probably in the noise. The same tests were ran with 1 flow and 1
> > > thread but the throughput was too noisy to make any conclusions (the
> > > averages did not show regressions nonetheless).
> > >
> > > Looking at perf for one iteration of the above test, __mod_memcg_state()
> > > (which is where memcg_rstat_updated() is called) does not show up at all
> > > without this patch, but it shows up with this patch as 1.06% for tcp_rr
> > > and 0.36% for tcp_stream.
> > >
> > > (b) "stress-ng --vm 0 -t 1m --times --perf". I don't understand
> > > stress-ng very well, so I am not sure that's the best way to test this,
> > > but it spawns 384 workers and spits a lot of metrics which looks nice :)
> > > I picked a few ones that seem to be relevant to the stats update path. I
> > > also included cache misses as this patch introduce more atomics that may
> > > bounce between cpu caches:
> > >
> > > Metric Base Patched Delta
> > > Cache Misses 3.394 B/sec 3.433 B/sec +1.14%
> > > Cache L1D Read 0.148 T/sec 0.154 T/sec +4.05%
> > > Cache L1D Read Miss 20.430 B/sec 21.820 B/sec +6.8%
> > > Page Faults Total 4.304 M/sec 4.535 M/sec +5.4%
> > > Page Faults Minor 4.304 M/sec 4.535 M/sec +5.4%
> > > Page Faults Major 18.794 /sec 0.000 /sec
> > > Kmalloc 0.153 M/sec 0.152 M/sec -0.65%
> > > Kfree 0.152 M/sec 0.153 M/sec +0.65%
> > > MM Page Alloc 4.640 M/sec 4.898 M/sec +5.56%
> > > MM Page Free 4.639 M/sec 4.897 M/sec +5.56%
> > > Lock Contention Begin 0.362 M/sec 0.479 M/sec +32.32%
> > > Lock Contention End 0.362 M/sec 0.479 M/sec +32.32%
> > > page-cache add 238.057 /sec 0.000 /sec
> > > page-cache del 6.265 /sec 6.267 /sec -0.03%
> > >
> > > This is only using a single run in each case. I am not sure what to
> > > make out of most of these numbers, but they mostly seem in the noise
> > > (some better, some worse). The lock contention numbers are interesting.
> > > I am not sure if higher is better or worse here. No new locks or lock
> > > sections are introduced by this patch either way.
> > >
> > > Looking at perf, __mod_memcg_state() shows up as 0.00% with and without
> > > this patch. This is suspicious, but I verified while stress-ng is
> > > running that all the threads are in the right cgroup.
> > >
> > > (3) will-it-scale page_fault tests. These tests (specifically
> > > per_process_ops in page_fault3 test) detected a 25.9% regression before
> > > for a change in the stats update path [2]. These are the
> > > numbers from 30 runs (+ is good):
> > >
> > > LABEL | MEAN | MEDIAN | STDDEV |
> > > ------------------------------+-------------+-------------+-------------
> > > page_fault1_per_process_ops | | | |
> > > (A) base | 265207.738 | 262941.000 | 12112.379 |
> > > (B) patched | 249249.191 | 248781.000 | 8767.457 |
> > > | -6.02% | -5.39% | |
> > > page_fault1_per_thread_ops | | | |
> > > (A) base | 241618.484 | 240209.000 | 10162.207 |
> > > (B) patched | 229820.671 | 229108.000 | 7506.582 |
> > > | -4.88% | -4.62% | |
> > > page_fault1_scalability | | |
> > > (A) base | 0.03545 | 0.035705 | 0.0015837 |
> > > (B) patched | 0.029952 | 0.029957 | 0.0013551 |
> > > | -9.29% | -9.35% | |
> >
> > This much regression is not acceptable.
> >
> > In addition, I ran netperf with the same 4 level hierarchy as you have
> > run and I am seeing ~11% regression.
>
> Interesting, I thought neper and netperf should be similar. Let me try
> to reproduce this.
>
> Thanks for testing!
>
> >
> > More specifically on a machine with 44 CPUs (HT disabled ixion machine):
> >
> > # for server
> > $ netserver -6
> >
> > # 22 instances of netperf clients
> > $ netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
> >
> > (averaged over 4 runs)
> >
> > base (next-20231009): 33081 MBPS
> > patched: 29267 MBPS
> >
> > So, this series is not acceptable unless this regression is resolved.
I tried this on a machine with 72 cpus (also ixion), running both
netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
# echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
# mkdir /sys/fs/cgroup/a
# echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
# mkdir /sys/fs/cgroup/a/b
# echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
# mkdir /sys/fs/cgroup/a/b/c
# echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
# mkdir /sys/fs/cgroup/a/b/c/d
# echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
# ./netserver -6
# echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
# for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE --
-m 10K; done
Base:
540000 262144 10240 60.00 54613.89
540000 262144 10240 60.00 54940.52
540000 262144 10240 60.00 55168.86
540000 262144 10240 60.00 54800.15
540000 262144 10240 60.00 54452.55
540000 262144 10240 60.00 54501.60
540000 262144 10240 60.00 55036.11
540000 262144 10240 60.00 52018.91
540000 262144 10240 60.00 54877.78
540000 262144 10240 60.00 55342.38
Average: 54575.275
Patched:
540000 262144 10240 60.00 53694.86
540000 262144 10240 60.00 54807.68
540000 262144 10240 60.00 54782.89
540000 262144 10240 60.00 51404.91
540000 262144 10240 60.00 55024.00
540000 262144 10240 60.00 54725.84
540000 262144 10240 60.00 51400.40
540000 262144 10240 60.00 54212.63
540000 262144 10240 60.00 51951.47
540000 262144 10240 60.00 51978.27
Average: 53398.295
That's ~2% regression. Did I do anything incorrectly?
On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote:
[...]
>
> I tried this on a machine with 72 cpus (also ixion), running both
> netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
> # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> # mkdir /sys/fs/cgroup/a
> # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
> # mkdir /sys/fs/cgroup/a/b
> # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
> # mkdir /sys/fs/cgroup/a/b/c
> # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
> # mkdir /sys/fs/cgroup/a/b/c/d
> # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> # ./netserver -6
>
> # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE --
> -m 10K; done
You are missing '&' at the end. Use something like below:
#!/bin/bash
for i in {1..22}
do
/data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K &
done
wait
On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <[email protected]> wrote:
>
> On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote:
> [...]
> >
> > I tried this on a machine with 72 cpus (also ixion), running both
> > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
> > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> > # mkdir /sys/fs/cgroup/a
> > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
> > # mkdir /sys/fs/cgroup/a/b
> > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
> > # mkdir /sys/fs/cgroup/a/b/c
> > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
> > # mkdir /sys/fs/cgroup/a/b/c/d
> > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > # ./netserver -6
> >
> > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE --
> > -m 10K; done
>
> You are missing '&' at the end. Use something like below:
>
> #!/bin/bash
> for i in {1..22}
> do
> /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K &
> done
> wait
>
Oh sorry I missed the fact that you are running instances in parallel, my bad.
So I ran 36 instances on a machine with 72 cpus. I did this 10 times
and got an average from all instances for all runs to reduce noise:
#!/bin/bash
ITER=10
NR_INSTANCES=36
for i in $(seq $ITER); do
echo "iteration $i"
for j in $(seq $NR_INSTANCES); do
echo "iteration $i" >> "out$j"
./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" &
done
wait
done
cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}'
Base: 22169 mbps
Patched: 21331.9 mbps
The difference is ~3.7% in my runs. I am not sure what's different.
Perhaps it's the number of runs?
On Tue, Oct 10, 2023 at 6:48 PM Yosry Ahmed <[email protected]> wrote:
>
> On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <[email protected]> wrote:
> >
> > On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote:
> > [...]
> > >
> > > I tried this on a machine with 72 cpus (also ixion), running both
> > > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
> > > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> > > # mkdir /sys/fs/cgroup/a
> > > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
> > > # mkdir /sys/fs/cgroup/a/b
> > > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
> > > # mkdir /sys/fs/cgroup/a/b/c
> > > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
> > > # mkdir /sys/fs/cgroup/a/b/c/d
> > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > # ./netserver -6
> > >
> > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE --
> > > -m 10K; done
> >
> > You are missing '&' at the end. Use something like below:
> >
> > #!/bin/bash
> > for i in {1..22}
> > do
> > /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K &
> > done
> > wait
> >
>
> Oh sorry I missed the fact that you are running instances in parallel, my bad.
>
> So I ran 36 instances on a machine with 72 cpus. I did this 10 times
> and got an average from all instances for all runs to reduce noise:
>
> #!/bin/bash
>
> ITER=10
> NR_INSTANCES=36
>
> for i in $(seq $ITER); do
> echo "iteration $i"
> for j in $(seq $NR_INSTANCES); do
> echo "iteration $i" >> "out$j"
> ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" &
> done
> wait
> done
>
> cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}'
>
> Base: 22169 mbps
> Patched: 21331.9 mbps
>
> The difference is ~3.7% in my runs. I am not sure what's different.
> Perhaps it's the number of runs?
My base kernel is next-20231009 and I am running experiments with
hyperthreading disabled.
On Wed, Oct 11, 2023 at 5:46 AM Shakeel Butt <[email protected]> wrote:
>
> On Tue, Oct 10, 2023 at 6:48 PM Yosry Ahmed <[email protected]> wrote:
> >
> > On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <[email protected]> wrote:
> > >
> > > On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote:
> > > [...]
> > > >
> > > > I tried this on a machine with 72 cpus (also ixion), running both
> > > > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
> > > > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> > > > # mkdir /sys/fs/cgroup/a
> > > > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
> > > > # mkdir /sys/fs/cgroup/a/b
> > > > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
> > > > # mkdir /sys/fs/cgroup/a/b/c
> > > > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
> > > > # mkdir /sys/fs/cgroup/a/b/c/d
> > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > # ./netserver -6
> > > >
> > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE --
> > > > -m 10K; done
> > >
> > > You are missing '&' at the end. Use something like below:
> > >
> > > #!/bin/bash
> > > for i in {1..22}
> > > do
> > > /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K &
> > > done
> > > wait
> > >
> >
> > Oh sorry I missed the fact that you are running instances in parallel, my bad.
> >
> > So I ran 36 instances on a machine with 72 cpus. I did this 10 times
> > and got an average from all instances for all runs to reduce noise:
> >
> > #!/bin/bash
> >
> > ITER=10
> > NR_INSTANCES=36
> >
> > for i in $(seq $ITER); do
> > echo "iteration $i"
> > for j in $(seq $NR_INSTANCES); do
> > echo "iteration $i" >> "out$j"
> > ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" &
> > done
> > wait
> > done
> >
> > cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}'
> >
> > Base: 22169 mbps
> > Patched: 21331.9 mbps
> >
> > The difference is ~3.7% in my runs. I am not sure what's different.
> > Perhaps it's the number of runs?
>
> My base kernel is next-20231009 and I am running experiments with
> hyperthreading disabled.
Using next-20231009 and a similar 44 core machine with hyperthreading
disabled, I ran 22 instances of netperf in parallel and got the
following numbers from averaging 20 runs:
Base: 33076.5 mbps
Patched: 31410.1 mbps
That's about 5% diff. I guess the number of iterations helps reduce
the noise? I am not sure.
Please also keep in mind that in this case all netperf instances are
in the same cgroup and at a 4-level depth. I imagine in a practical
setup processes would be a little more spread out, which means less
common ancestors, so less contended atomic operations.
On Wed, Oct 11, 2023 at 8:13 PM Yosry Ahmed <[email protected]> wrote:
>
> On Wed, Oct 11, 2023 at 5:46 AM Shakeel Butt <[email protected]> wrote:
> >
> > On Tue, Oct 10, 2023 at 6:48 PM Yosry Ahmed <[email protected]> wrote:
> > >
> > > On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <[email protected]> wrote:
> > > >
> > > > On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote:
> > > > [...]
> > > > >
> > > > > I tried this on a machine with 72 cpus (also ixion), running both
> > > > > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
> > > > > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> > > > > # mkdir /sys/fs/cgroup/a
> > > > > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
> > > > > # mkdir /sys/fs/cgroup/a/b
> > > > > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
> > > > > # mkdir /sys/fs/cgroup/a/b/c
> > > > > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
> > > > > # mkdir /sys/fs/cgroup/a/b/c/d
> > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > # ./netserver -6
> > > > >
> > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE --
> > > > > -m 10K; done
> > > >
> > > > You are missing '&' at the end. Use something like below:
> > > >
> > > > #!/bin/bash
> > > > for i in {1..22}
> > > > do
> > > > /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K &
> > > > done
> > > > wait
> > > >
> > >
> > > Oh sorry I missed the fact that you are running instances in parallel, my bad.
> > >
> > > So I ran 36 instances on a machine with 72 cpus. I did this 10 times
> > > and got an average from all instances for all runs to reduce noise:
> > >
> > > #!/bin/bash
> > >
> > > ITER=10
> > > NR_INSTANCES=36
> > >
> > > for i in $(seq $ITER); do
> > > echo "iteration $i"
> > > for j in $(seq $NR_INSTANCES); do
> > > echo "iteration $i" >> "out$j"
> > > ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" &
> > > done
> > > wait
> > > done
> > >
> > > cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}'
> > >
> > > Base: 22169 mbps
> > > Patched: 21331.9 mbps
> > >
> > > The difference is ~3.7% in my runs. I am not sure what's different.
> > > Perhaps it's the number of runs?
> >
> > My base kernel is next-20231009 and I am running experiments with
> > hyperthreading disabled.
>
> Using next-20231009 and a similar 44 core machine with hyperthreading
> disabled, I ran 22 instances of netperf in parallel and got the
> following numbers from averaging 20 runs:
>
> Base: 33076.5 mbps
> Patched: 31410.1 mbps
>
> That's about 5% diff. I guess the number of iterations helps reduce
> the noise? I am not sure.
>
> Please also keep in mind that in this case all netperf instances are
> in the same cgroup and at a 4-level depth. I imagine in a practical
> setup processes would be a little more spread out, which means less
> common ancestors, so less contended atomic operations.
(Resending the reply as I messed up the last one, was not in plain text)
I was curious, so I ran the same testing in a cgroup 2 levels deep
(i.e /sys/fs/cgroup/a/b), which is a much more common setup in my
experience. Here are the numbers:
Base: 40198.0 mbps
Patched: 38629.7 mbps
The regression is reduced to ~3.9%.
What's more interesting is that going from a level 2 cgroup to a level
4 cgroup is already a big hit with or without this patch:
Base: 40198.0 -> 33076.5 mbps (~17.7% regression)
Patched: 38629.7 -> 31410.1 (~18.7% regression)
So going from level 2 to 4 is already a significant regression for
other reasons (e.g. hierarchical charging). This patch only makes it
marginally worse. This puts the numbers more into perspective imo than
comparing values at level 4. What do you think?
On Thu, Oct 12, 2023 at 01:04:03AM -0700, Yosry Ahmed wrote:
> On Wed, Oct 11, 2023 at 8:13 PM Yosry Ahmed <[email protected]> wrote:
> >
> > On Wed, Oct 11, 2023 at 5:46 AM Shakeel Butt <[email protected]> wrote:
> > >
> > > On Tue, Oct 10, 2023 at 6:48 PM Yosry Ahmed <[email protected]> wrote:
> > > >
> > > > On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <[email protected]> wrote:
> > > > >
> > > > > On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote:
> > > > > [...]
> > > > > >
> > > > > > I tried this on a machine with 72 cpus (also ixion), running both
> > > > > > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
> > > > > > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> > > > > > # mkdir /sys/fs/cgroup/a
> > > > > > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
> > > > > > # mkdir /sys/fs/cgroup/a/b
> > > > > > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
> > > > > > # mkdir /sys/fs/cgroup/a/b/c
> > > > > > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
> > > > > > # mkdir /sys/fs/cgroup/a/b/c/d
> > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > > # ./netserver -6
> > > > > >
> > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE --
> > > > > > -m 10K; done
> > > > >
> > > > > You are missing '&' at the end. Use something like below:
> > > > >
> > > > > #!/bin/bash
> > > > > for i in {1..22}
> > > > > do
> > > > > /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K &
> > > > > done
> > > > > wait
> > > > >
> > > >
> > > > Oh sorry I missed the fact that you are running instances in parallel, my bad.
> > > >
> > > > So I ran 36 instances on a machine with 72 cpus. I did this 10 times
> > > > and got an average from all instances for all runs to reduce noise:
> > > >
> > > > #!/bin/bash
> > > >
> > > > ITER=10
> > > > NR_INSTANCES=36
> > > >
> > > > for i in $(seq $ITER); do
> > > > echo "iteration $i"
> > > > for j in $(seq $NR_INSTANCES); do
> > > > echo "iteration $i" >> "out$j"
> > > > ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" &
> > > > done
> > > > wait
> > > > done
> > > >
> > > > cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}'
> > > >
> > > > Base: 22169 mbps
> > > > Patched: 21331.9 mbps
> > > >
> > > > The difference is ~3.7% in my runs. I am not sure what's different.
> > > > Perhaps it's the number of runs?
> > >
> > > My base kernel is next-20231009 and I am running experiments with
> > > hyperthreading disabled.
> >
> > Using next-20231009 and a similar 44 core machine with hyperthreading
> > disabled, I ran 22 instances of netperf in parallel and got the
> > following numbers from averaging 20 runs:
> >
> > Base: 33076.5 mbps
> > Patched: 31410.1 mbps
> >
> > That's about 5% diff. I guess the number of iterations helps reduce
> > the noise? I am not sure.
> >
> > Please also keep in mind that in this case all netperf instances are
> > in the same cgroup and at a 4-level depth. I imagine in a practical
> > setup processes would be a little more spread out, which means less
> > common ancestors, so less contended atomic operations.
>
>
> (Resending the reply as I messed up the last one, was not in plain text)
>
> I was curious, so I ran the same testing in a cgroup 2 levels deep
> (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my
> experience. Here are the numbers:
>
> Base: 40198.0 mbps
> Patched: 38629.7 mbps
>
> The regression is reduced to ~3.9%.
>
> What's more interesting is that going from a level 2 cgroup to a level
> 4 cgroup is already a big hit with or without this patch:
>
> Base: 40198.0 -> 33076.5 mbps (~17.7% regression)
> Patched: 38629.7 -> 31410.1 (~18.7% regression)
>
> So going from level 2 to 4 is already a significant regression for
> other reasons (e.g. hierarchical charging). This patch only makes it
> marginally worse. This puts the numbers more into perspective imo than
> comparing values at level 4. What do you think?
I think it's reasonable.
Especially comparing to how many cachelines we used to touch on the
write side when all flushing happened there. This looks like a good
trade-off to me.
On Thu, Oct 12, 2023 at 1:04 AM Yosry Ahmed <[email protected]> wrote:
>
> On Wed, Oct 11, 2023 at 8:13 PM Yosry Ahmed <[email protected]> wrote:
> >
> > On Wed, Oct 11, 2023 at 5:46 AM Shakeel Butt <[email protected]> wrote:
> > >
> > > On Tue, Oct 10, 2023 at 6:48 PM Yosry Ahmed <[email protected]> wrote:
> > > >
> > > > On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <[email protected]> wrote:
> > > > >
> > > > > On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote:
> > > > > [...]
> > > > > >
> > > > > > I tried this on a machine with 72 cpus (also ixion), running both
> > > > > > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
> > > > > > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> > > > > > # mkdir /sys/fs/cgroup/a
> > > > > > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
> > > > > > # mkdir /sys/fs/cgroup/a/b
> > > > > > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
> > > > > > # mkdir /sys/fs/cgroup/a/b/c
> > > > > > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
> > > > > > # mkdir /sys/fs/cgroup/a/b/c/d
> > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > > # ./netserver -6
> > > > > >
> > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE --
> > > > > > -m 10K; done
> > > > >
> > > > > You are missing '&' at the end. Use something like below:
> > > > >
> > > > > #!/bin/bash
> > > > > for i in {1..22}
> > > > > do
> > > > > /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K &
> > > > > done
> > > > > wait
> > > > >
> > > >
> > > > Oh sorry I missed the fact that you are running instances in parallel, my bad.
> > > >
> > > > So I ran 36 instances on a machine with 72 cpus. I did this 10 times
> > > > and got an average from all instances for all runs to reduce noise:
> > > >
> > > > #!/bin/bash
> > > >
> > > > ITER=10
> > > > NR_INSTANCES=36
> > > >
> > > > for i in $(seq $ITER); do
> > > > echo "iteration $i"
> > > > for j in $(seq $NR_INSTANCES); do
> > > > echo "iteration $i" >> "out$j"
> > > > ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" &
> > > > done
> > > > wait
> > > > done
> > > >
> > > > cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}'
> > > >
> > > > Base: 22169 mbps
> > > > Patched: 21331.9 mbps
> > > >
> > > > The difference is ~3.7% in my runs. I am not sure what's different.
> > > > Perhaps it's the number of runs?
> > >
> > > My base kernel is next-20231009 and I am running experiments with
> > > hyperthreading disabled.
> >
> > Using next-20231009 and a similar 44 core machine with hyperthreading
> > disabled, I ran 22 instances of netperf in parallel and got the
> > following numbers from averaging 20 runs:
> >
> > Base: 33076.5 mbps
> > Patched: 31410.1 mbps
> >
> > That's about 5% diff. I guess the number of iterations helps reduce
> > the noise? I am not sure.
> >
> > Please also keep in mind that in this case all netperf instances are
> > in the same cgroup and at a 4-level depth. I imagine in a practical
> > setup processes would be a little more spread out, which means less
> > common ancestors, so less contended atomic operations.
>
>
> (Resending the reply as I messed up the last one, was not in plain text)
>
> I was curious, so I ran the same testing in a cgroup 2 levels deep
> (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my
> experience. Here are the numbers:
>
> Base: 40198.0 mbps
> Patched: 38629.7 mbps
>
> The regression is reduced to ~3.9%.
>
> What's more interesting is that going from a level 2 cgroup to a level
> 4 cgroup is already a big hit with or without this patch:
>
> Base: 40198.0 -> 33076.5 mbps (~17.7% regression)
> Patched: 38629.7 -> 31410.1 (~18.7% regression)
>
> So going from level 2 to 4 is already a significant regression for
> other reasons (e.g. hierarchical charging). This patch only makes it
> marginally worse. This puts the numbers more into perspective imo than
> comparing values at level 4. What do you think?
This is weird as we are running the experiments on the same machine. I
will rerun with 2 levels as well. Also can you rerun the page fault
benchmark as well which was showing 9% regression in your original
commit message?
On Thu, Oct 12, 2023 at 6:35 AM Shakeel Butt <[email protected]> wrote:
>
> On Thu, Oct 12, 2023 at 1:04 AM Yosry Ahmed <[email protected]> wrote:
> >
> > On Wed, Oct 11, 2023 at 8:13 PM Yosry Ahmed <[email protected]> wrote:
> > >
> > > On Wed, Oct 11, 2023 at 5:46 AM Shakeel Butt <[email protected]> wrote:
> > > >
> > > > On Tue, Oct 10, 2023 at 6:48 PM Yosry Ahmed <[email protected]> wrote:
> > > > >
> > > > > On Tue, Oct 10, 2023 at 5:36 PM Shakeel Butt <[email protected]> wrote:
> > > > > >
> > > > > > On Tue, Oct 10, 2023 at 03:21:47PM -0700, Yosry Ahmed wrote:
> > > > > > [...]
> > > > > > >
> > > > > > > I tried this on a machine with 72 cpus (also ixion), running both
> > > > > > > netserver and netperf in /sys/fs/cgroup/a/b/c/d as follows:
> > > > > > > # echo "+memory" > /sys/fs/cgroup/cgroup.subtree_control
> > > > > > > # mkdir /sys/fs/cgroup/a
> > > > > > > # echo "+memory" > /sys/fs/cgroup/a/cgroup.subtree_control
> > > > > > > # mkdir /sys/fs/cgroup/a/b
> > > > > > > # echo "+memory" > /sys/fs/cgroup/a/b/cgroup.subtree_control
> > > > > > > # mkdir /sys/fs/cgroup/a/b/c
> > > > > > > # echo "+memory" > /sys/fs/cgroup/a/b/c/cgroup.subtree_control
> > > > > > > # mkdir /sys/fs/cgroup/a/b/c/d
> > > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > > > # ./netserver -6
> > > > > > >
> > > > > > > # echo 0 > /sys/fs/cgroup/a/b/c/d/cgroup.procs
> > > > > > > # for i in $(seq 10); do ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE --
> > > > > > > -m 10K; done
> > > > > >
> > > > > > You are missing '&' at the end. Use something like below:
> > > > > >
> > > > > > #!/bin/bash
> > > > > > for i in {1..22}
> > > > > > do
> > > > > > /data/tmp/netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K &
> > > > > > done
> > > > > > wait
> > > > > >
> > > > >
> > > > > Oh sorry I missed the fact that you are running instances in parallel, my bad.
> > > > >
> > > > > So I ran 36 instances on a machine with 72 cpus. I did this 10 times
> > > > > and got an average from all instances for all runs to reduce noise:
> > > > >
> > > > > #!/bin/bash
> > > > >
> > > > > ITER=10
> > > > > NR_INSTANCES=36
> > > > >
> > > > > for i in $(seq $ITER); do
> > > > > echo "iteration $i"
> > > > > for j in $(seq $NR_INSTANCES); do
> > > > > echo "iteration $i" >> "out$j"
> > > > > ./netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K >> "out$j" &
> > > > > done
> > > > > wait
> > > > > done
> > > > >
> > > > > cat out* | grep 540000 | awk '{sum += $5} END {print sum/NR}'
> > > > >
> > > > > Base: 22169 mbps
> > > > > Patched: 21331.9 mbps
> > > > >
> > > > > The difference is ~3.7% in my runs. I am not sure what's different.
> > > > > Perhaps it's the number of runs?
> > > >
> > > > My base kernel is next-20231009 and I am running experiments with
> > > > hyperthreading disabled.
> > >
> > > Using next-20231009 and a similar 44 core machine with hyperthreading
> > > disabled, I ran 22 instances of netperf in parallel and got the
> > > following numbers from averaging 20 runs:
> > >
> > > Base: 33076.5 mbps
> > > Patched: 31410.1 mbps
> > >
> > > That's about 5% diff. I guess the number of iterations helps reduce
> > > the noise? I am not sure.
> > >
> > > Please also keep in mind that in this case all netperf instances are
> > > in the same cgroup and at a 4-level depth. I imagine in a practical
> > > setup processes would be a little more spread out, which means less
> > > common ancestors, so less contended atomic operations.
> >
> >
> > (Resending the reply as I messed up the last one, was not in plain text)
> >
> > I was curious, so I ran the same testing in a cgroup 2 levels deep
> > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my
> > experience. Here are the numbers:
> >
> > Base: 40198.0 mbps
> > Patched: 38629.7 mbps
> >
> > The regression is reduced to ~3.9%.
> >
> > What's more interesting is that going from a level 2 cgroup to a level
> > 4 cgroup is already a big hit with or without this patch:
> >
> > Base: 40198.0 -> 33076.5 mbps (~17.7% regression)
> > Patched: 38629.7 -> 31410.1 (~18.7% regression)
> >
> > So going from level 2 to 4 is already a significant regression for
> > other reasons (e.g. hierarchical charging). This patch only makes it
> > marginally worse. This puts the numbers more into perspective imo than
> > comparing values at level 4. What do you think?
>
> This is weird as we are running the experiments on the same machine. I
> will rerun with 2 levels as well. Also can you rerun the page fault
> benchmark as well which was showing 9% regression in your original
> commit message?
Thanks. I will re-run the page_fault tests, but keep in mind that the
page fault benchmarks in will-it-scale are highly variable. We run
them between kernel versions internally, and I think we ignore any
changes below 10% as the benchmark is naturally noisy.
I have a couple of runs for page_fault3_scalability showing a 2-3%
improvement with this patch :)
[..]
> > > >
> > > > Using next-20231009 and a similar 44 core machine with hyperthreading
> > > > disabled, I ran 22 instances of netperf in parallel and got the
> > > > following numbers from averaging 20 runs:
> > > >
> > > > Base: 33076.5 mbps
> > > > Patched: 31410.1 mbps
> > > >
> > > > That's about 5% diff. I guess the number of iterations helps reduce
> > > > the noise? I am not sure.
> > > >
> > > > Please also keep in mind that in this case all netperf instances are
> > > > in the same cgroup and at a 4-level depth. I imagine in a practical
> > > > setup processes would be a little more spread out, which means less
> > > > common ancestors, so less contended atomic operations.
> > >
> > >
> > > (Resending the reply as I messed up the last one, was not in plain text)
> > >
> > > I was curious, so I ran the same testing in a cgroup 2 levels deep
> > > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my
> > > experience. Here are the numbers:
> > >
> > > Base: 40198.0 mbps
> > > Patched: 38629.7 mbps
> > >
> > > The regression is reduced to ~3.9%.
> > >
> > > What's more interesting is that going from a level 2 cgroup to a level
> > > 4 cgroup is already a big hit with or without this patch:
> > >
> > > Base: 40198.0 -> 33076.5 mbps (~17.7% regression)
> > > Patched: 38629.7 -> 31410.1 (~18.7% regression)
> > >
> > > So going from level 2 to 4 is already a significant regression for
> > > other reasons (e.g. hierarchical charging). This patch only makes it
> > > marginally worse. This puts the numbers more into perspective imo than
> > > comparing values at level 4. What do you think?
> >
> > This is weird as we are running the experiments on the same machine. I
> > will rerun with 2 levels as well. Also can you rerun the page fault
> > benchmark as well which was showing 9% regression in your original
> > commit message?
>
> Thanks. I will re-run the page_fault tests, but keep in mind that the
> page fault benchmarks in will-it-scale are highly variable. We run
> them between kernel versions internally, and I think we ignore any
> changes below 10% as the benchmark is naturally noisy.
>
> I have a couple of runs for page_fault3_scalability showing a 2-3%
> improvement with this patch :)
I ran the page_fault tests for 10 runs on a machine with 256 cpus in a
level 2 cgroup, here are the results (the results in the original
commit message are for 384 cpus in a level 4 cgroup):
LABEL | MEAN | MEDIAN | STDDEV |
------------------------------+-------------+-------------+-------------
page_fault1_per_process_ops | | | |
(A) base | 270249.164 | 265437.000 | 13451.836 |
(B) patched | 261368.709 | 255725.000 | 13394.767 |
| -3.29% | -3.66% | |
page_fault1_per_thread_ops | | | |
(A) base | 242111.345 | 239737.000 | 10026.031 |
(B) patched | 237057.109 | 235305.000 | 9769.687 |
| -2.09% | -1.85% | |
page_fault1_scalability | | |
(A) base | 0.034387 | 0.035168 | 0.0018283 |
(B) patched | 0.033988 | 0.034573 | 0.0018056 |
| -1.16% | -1.69% | |
page_fault2_per_process_ops | | |
(A) base | 203561.836 | 203301.000 | 2550.764 |
(B) patched | 197195.945 | 197746.000 | 2264.263 |
| -3.13% | -2.73% | |
page_fault2_per_thread_ops | | |
(A) base | 171046.473 | 170776.000 | 1509.679 |
(B) patched | 166626.327 | 166406.000 | 768.753 |
| -2.58% | -2.56% | |
page_fault2_scalability | | |
(A) base | 0.054026 | 0.053821 | 0.00062121 |
(B) patched | 0.053329 | 0.05306 | 0.00048394 |
| -1.29% | -1.41% | |
page_fault3_per_process_ops | | |
(A) base | 1295807.782 | 1297550.000 | 5907.585 |
(B) patched | 1275579.873 | 1273359.000 | 8759.160 |
| -1.56% | -1.86% | |
page_fault3_per_thread_ops | | |
(A) base | 391234.164 | 390860.000 | 1760.720 |
(B) patched | 377231.273 | 376369.000 | 1874.971 |
| -3.58% | -3.71% | |
page_fault3_scalability | | |
(A) base | 0.60369 | 0.60072 | 0.0083029 |
(B) patched | 0.61733 | 0.61544 | 0.009855 |
| +2.26% | +2.45% | |
The numbers are much better. I can modify the commit log to include
the testing in the replies instead of what's currently there if this
helps (22 netperf instances on 44 cpus and will-it-scale page_fault on
256 cpus -- all in a level 2 cgroup).
On Thu, Oct 12, 2023 at 2:06 PM Yosry Ahmed <[email protected]> wrote:
>
> [..]
> > > > >
> > > > > Using next-20231009 and a similar 44 core machine with hyperthreading
> > > > > disabled, I ran 22 instances of netperf in parallel and got the
> > > > > following numbers from averaging 20 runs:
> > > > >
> > > > > Base: 33076.5 mbps
> > > > > Patched: 31410.1 mbps
> > > > >
> > > > > That's about 5% diff. I guess the number of iterations helps reduce
> > > > > the noise? I am not sure.
> > > > >
> > > > > Please also keep in mind that in this case all netperf instances are
> > > > > in the same cgroup and at a 4-level depth. I imagine in a practical
> > > > > setup processes would be a little more spread out, which means less
> > > > > common ancestors, so less contended atomic operations.
> > > >
> > > >
> > > > (Resending the reply as I messed up the last one, was not in plain text)
> > > >
> > > > I was curious, so I ran the same testing in a cgroup 2 levels deep
> > > > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my
> > > > experience. Here are the numbers:
> > > >
> > > > Base: 40198.0 mbps
> > > > Patched: 38629.7 mbps
> > > >
> > > > The regression is reduced to ~3.9%.
> > > >
> > > > What's more interesting is that going from a level 2 cgroup to a level
> > > > 4 cgroup is already a big hit with or without this patch:
> > > >
> > > > Base: 40198.0 -> 33076.5 mbps (~17.7% regression)
> > > > Patched: 38629.7 -> 31410.1 (~18.7% regression)
> > > >
> > > > So going from level 2 to 4 is already a significant regression for
> > > > other reasons (e.g. hierarchical charging). This patch only makes it
> > > > marginally worse. This puts the numbers more into perspective imo than
> > > > comparing values at level 4. What do you think?
> > >
> > > This is weird as we are running the experiments on the same machine. I
> > > will rerun with 2 levels as well. Also can you rerun the page fault
> > > benchmark as well which was showing 9% regression in your original
> > > commit message?
> >
> > Thanks. I will re-run the page_fault tests, but keep in mind that the
> > page fault benchmarks in will-it-scale are highly variable. We run
> > them between kernel versions internally, and I think we ignore any
> > changes below 10% as the benchmark is naturally noisy.
> >
> > I have a couple of runs for page_fault3_scalability showing a 2-3%
> > improvement with this patch :)
>
> I ran the page_fault tests for 10 runs on a machine with 256 cpus in a
> level 2 cgroup, here are the results (the results in the original
> commit message are for 384 cpus in a level 4 cgroup):
>
> LABEL | MEAN | MEDIAN | STDDEV |
> ------------------------------+-------------+-------------+-------------
> page_fault1_per_process_ops | | | |
> (A) base | 270249.164 | 265437.000 | 13451.836 |
> (B) patched | 261368.709 | 255725.000 | 13394.767 |
> | -3.29% | -3.66% | |
> page_fault1_per_thread_ops | | | |
> (A) base | 242111.345 | 239737.000 | 10026.031 |
> (B) patched | 237057.109 | 235305.000 | 9769.687 |
> | -2.09% | -1.85% | |
> page_fault1_scalability | | |
> (A) base | 0.034387 | 0.035168 | 0.0018283 |
> (B) patched | 0.033988 | 0.034573 | 0.0018056 |
> | -1.16% | -1.69% | |
> page_fault2_per_process_ops | | |
> (A) base | 203561.836 | 203301.000 | 2550.764 |
> (B) patched | 197195.945 | 197746.000 | 2264.263 |
> | -3.13% | -2.73% | |
> page_fault2_per_thread_ops | | |
> (A) base | 171046.473 | 170776.000 | 1509.679 |
> (B) patched | 166626.327 | 166406.000 | 768.753 |
> | -2.58% | -2.56% | |
> page_fault2_scalability | | |
> (A) base | 0.054026 | 0.053821 | 0.00062121 |
> (B) patched | 0.053329 | 0.05306 | 0.00048394 |
> | -1.29% | -1.41% | |
> page_fault3_per_process_ops | | |
> (A) base | 1295807.782 | 1297550.000 | 5907.585 |
> (B) patched | 1275579.873 | 1273359.000 | 8759.160 |
> | -1.56% | -1.86% | |
> page_fault3_per_thread_ops | | |
> (A) base | 391234.164 | 390860.000 | 1760.720 |
> (B) patched | 377231.273 | 376369.000 | 1874.971 |
> | -3.58% | -3.71% | |
> page_fault3_scalability | | |
> (A) base | 0.60369 | 0.60072 | 0.0083029 |
> (B) patched | 0.61733 | 0.61544 | 0.009855 |
> | +2.26% | +2.45% | |
>
> The numbers are much better. I can modify the commit log to include
> the testing in the replies instead of what's currently there if this
> helps (22 netperf instances on 44 cpus and will-it-scale page_fault on
> 256 cpus -- all in a level 2 cgroup).
Yes this looks better. I think we should also ask intel perf and
phoronix folks to run their benchmarks as well (but no need to block
on them).
On Thu, Oct 12, 2023 at 2:16 PM Shakeel Butt <[email protected]> wrote:
>
> On Thu, Oct 12, 2023 at 2:06 PM Yosry Ahmed <[email protected]> wrote:
> >
> > [..]
> > > > > >
> > > > > > Using next-20231009 and a similar 44 core machine with hyperthreading
> > > > > > disabled, I ran 22 instances of netperf in parallel and got the
> > > > > > following numbers from averaging 20 runs:
> > > > > >
> > > > > > Base: 33076.5 mbps
> > > > > > Patched: 31410.1 mbps
> > > > > >
> > > > > > That's about 5% diff. I guess the number of iterations helps reduce
> > > > > > the noise? I am not sure.
> > > > > >
> > > > > > Please also keep in mind that in this case all netperf instances are
> > > > > > in the same cgroup and at a 4-level depth. I imagine in a practical
> > > > > > setup processes would be a little more spread out, which means less
> > > > > > common ancestors, so less contended atomic operations.
> > > > >
> > > > >
> > > > > (Resending the reply as I messed up the last one, was not in plain text)
> > > > >
> > > > > I was curious, so I ran the same testing in a cgroup 2 levels deep
> > > > > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my
> > > > > experience. Here are the numbers:
> > > > >
> > > > > Base: 40198.0 mbps
> > > > > Patched: 38629.7 mbps
> > > > >
> > > > > The regression is reduced to ~3.9%.
> > > > >
> > > > > What's more interesting is that going from a level 2 cgroup to a level
> > > > > 4 cgroup is already a big hit with or without this patch:
> > > > >
> > > > > Base: 40198.0 -> 33076.5 mbps (~17.7% regression)
> > > > > Patched: 38629.7 -> 31410.1 (~18.7% regression)
> > > > >
> > > > > So going from level 2 to 4 is already a significant regression for
> > > > > other reasons (e.g. hierarchical charging). This patch only makes it
> > > > > marginally worse. This puts the numbers more into perspective imo than
> > > > > comparing values at level 4. What do you think?
> > > >
> > > > This is weird as we are running the experiments on the same machine. I
> > > > will rerun with 2 levels as well. Also can you rerun the page fault
> > > > benchmark as well which was showing 9% regression in your original
> > > > commit message?
> > >
> > > Thanks. I will re-run the page_fault tests, but keep in mind that the
> > > page fault benchmarks in will-it-scale are highly variable. We run
> > > them between kernel versions internally, and I think we ignore any
> > > changes below 10% as the benchmark is naturally noisy.
> > >
> > > I have a couple of runs for page_fault3_scalability showing a 2-3%
> > > improvement with this patch :)
> >
> > I ran the page_fault tests for 10 runs on a machine with 256 cpus in a
> > level 2 cgroup, here are the results (the results in the original
> > commit message are for 384 cpus in a level 4 cgroup):
> >
> > LABEL | MEAN | MEDIAN | STDDEV |
> > ------------------------------+-------------+-------------+-------------
> > page_fault1_per_process_ops | | | |
> > (A) base | 270249.164 | 265437.000 | 13451.836 |
> > (B) patched | 261368.709 | 255725.000 | 13394.767 |
> > | -3.29% | -3.66% | |
> > page_fault1_per_thread_ops | | | |
> > (A) base | 242111.345 | 239737.000 | 10026.031 |
> > (B) patched | 237057.109 | 235305.000 | 9769.687 |
> > | -2.09% | -1.85% | |
> > page_fault1_scalability | | |
> > (A) base | 0.034387 | 0.035168 | 0.0018283 |
> > (B) patched | 0.033988 | 0.034573 | 0.0018056 |
> > | -1.16% | -1.69% | |
> > page_fault2_per_process_ops | | |
> > (A) base | 203561.836 | 203301.000 | 2550.764 |
> > (B) patched | 197195.945 | 197746.000 | 2264.263 |
> > | -3.13% | -2.73% | |
> > page_fault2_per_thread_ops | | |
> > (A) base | 171046.473 | 170776.000 | 1509.679 |
> > (B) patched | 166626.327 | 166406.000 | 768.753 |
> > | -2.58% | -2.56% | |
> > page_fault2_scalability | | |
> > (A) base | 0.054026 | 0.053821 | 0.00062121 |
> > (B) patched | 0.053329 | 0.05306 | 0.00048394 |
> > | -1.29% | -1.41% | |
> > page_fault3_per_process_ops | | |
> > (A) base | 1295807.782 | 1297550.000 | 5907.585 |
> > (B) patched | 1275579.873 | 1273359.000 | 8759.160 |
> > | -1.56% | -1.86% | |
> > page_fault3_per_thread_ops | | |
> > (A) base | 391234.164 | 390860.000 | 1760.720 |
> > (B) patched | 377231.273 | 376369.000 | 1874.971 |
> > | -3.58% | -3.71% | |
> > page_fault3_scalability | | |
> > (A) base | 0.60369 | 0.60072 | 0.0083029 |
> > (B) patched | 0.61733 | 0.61544 | 0.009855 |
> > | +2.26% | +2.45% | |
> >
> > The numbers are much better. I can modify the commit log to include
> > the testing in the replies instead of what's currently there if this
> > helps (22 netperf instances on 44 cpus and will-it-scale page_fault on
> > 256 cpus -- all in a level 2 cgroup).
>
> Yes this looks better. I think we should also ask intel perf and
> phoronix folks to run their benchmarks as well (but no need to block
> on them).
Anything I need to do for this to happen? (I thought such testing is
already done on linux-next)
Also, any further comments on the patch (or the series in general)? If
not, I can send a new commit message for this patch in-place.
On Thu, Oct 12, 2023 at 2:20 PM Yosry Ahmed <[email protected]> wrote:
>
[...]
> >
> > Yes this looks better. I think we should also ask intel perf and
> > phoronix folks to run their benchmarks as well (but no need to block
> > on them).
>
> Anything I need to do for this to happen? (I thought such testing is
> already done on linux-next)
Just Cced the relevant folks.
Michael, Oliver & Feng, if you have some time/resource available,
please do trigger your performance benchmarks on the following series
(but nothing urgent):
https://lore.kernel.org/all/[email protected]/
>
> Also, any further comments on the patch (or the series in general)? If
> not, I can send a new commit message for this patch in-place.
Sorry, I haven't taken a look yet but will try in a week or so.
On Thu, Oct 12, 2023 at 2:39 PM Shakeel Butt <[email protected]> wrote:
>
> On Thu, Oct 12, 2023 at 2:20 PM Yosry Ahmed <[email protected]> wrote:
> >
> [...]
> > >
> > > Yes this looks better. I think we should also ask intel perf and
> > > phoronix folks to run their benchmarks as well (but no need to block
> > > on them).
> >
> > Anything I need to do for this to happen? (I thought such testing is
> > already done on linux-next)
>
> Just Cced the relevant folks.
>
> Michael, Oliver & Feng, if you have some time/resource available,
> please do trigger your performance benchmarks on the following series
> (but nothing urgent):
>
> https://lore.kernel.org/all/[email protected]/
Thanks for that.
>
> >
> > Also, any further comments on the patch (or the series in general)? If
> > not, I can send a new commit message for this patch in-place.
>
> Sorry, I haven't taken a look yet but will try in a week or so.
Sounds good, thanks.
Meanwhile, Andrew, could you please replace the commit log of this
patch as follows for more updated testing info:
Subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg
A global counter for the magnitude of memcg stats update is maintained
on the memcg side to avoid invoking rstat flushes when the pending
updates are not significant. This avoids unnecessary flushes, which are
not very cheap even if there isn't a lot of stats to flush. It also
avoids unnecessary lock contention on the underlying global rstat lock.
Make this threshold per-memcg. The scheme is followed where percpu (now
also per-memcg) counters are incremented in the update path, and only
propagated to per-memcg atomics when they exceed a certain threshold.
This provides two benefits:
(a) On large machines with a lot of memcgs, the global threshold can be
reached relatively fast, so guarding the underlying lock becomes less
effective. Making the threshold per-memcg avoids this.
(b) Having a global threshold makes it hard to do subtree flushes, as we
cannot reset the global counter except for a full flush. Per-memcg
counters removes this as a blocker from doing subtree flushes, which
helps avoid unnecessary work when the stats of a small subtree are
needed.
Nothing is free, of course. This comes at a cost:
(a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4
bytes. The extra memory usage is insigificant.
(b) More work on the update side, although in the common case it will
only be percpu counter updates. The amount of work scales with the
number of ancestors (i.e. tree depth). This is not a new concept, adding
a cgroup to the rstat tree involves a parent loop, so is charging.
Testing results below show no significant regressions.
(c) The error margin in the stats for the system as a whole increases
from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH *
NR_MEMCGS. This is probably fine because we have a similar per-memcg
error in charges coming from percpu stocks, and we have a periodic
flusher that makes sure we always flush all the stats every 2s anyway.
This patch was tested to make sure no significant regressions are
introduced on the update path as follows. The following benchmarks were
ran in a cgroup that is 2 levels deep (/sys/fs/cgroup/a/b/):
(1) Running 22 instances of netperf on a 44 cpu machine with
hyperthreading disabled. All instances are run in a level 2 cgroup, as
well as netserver:
# netserver -6
# netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
Averaging 20 runs, the numbers are as follows:
Base: 40198.0 mbps
Patched: 38629.7 mbps (-3.9%)
The regression is minimal, especially for 22 instances in the same
cgroup sharing all ancestors (so updating the same atomics).
(2) will-it-scale page_fault tests. These tests (specifically
per_process_ops in page_fault3 test) detected a 25.9% regression before
for a change in the stats update path [1]. These are the
numbers from 10 runs (+ is good) on a machine with 256 cpus:
LABEL | MEAN | MEDIAN | STDDEV |
------------------------------+-------------+-------------+-------------
page_fault1_per_process_ops | | | |
(A) base | 270249.164 | 265437.000 | 13451.836 |
(B) patched | 261368.709 | 255725.000 | 13394.767 |
| -3.29% | -3.66% | |
page_fault1_per_thread_ops | | | |
(A) base | 242111.345 | 239737.000 | 10026.031 |
(B) patched | 237057.109 | 235305.000 | 9769.687 |
| -2.09% | -1.85% | |
page_fault1_scalability | | |
(A) base | 0.034387 | 0.035168 | 0.0018283 |
(B) patched | 0.033988 | 0.034573 | 0.0018056 |
| -1.16% | -1.69% | |
page_fault2_per_process_ops | | |
(A) base | 203561.836 | 203301.000 | 2550.764 |
(B) patched | 197195.945 | 197746.000 | 2264.263 |
| -3.13% | -2.73% | |
page_fault2_per_thread_ops | | |
(A) base | 171046.473 | 170776.000 | 1509.679 |
(B) patched | 166626.327 | 166406.000 | 768.753 |
| -2.58% | -2.56% | |
page_fault2_scalability | | |
(A) base | 0.054026 | 0.053821 | 0.00062121 |
(B) patched | 0.053329 | 0.05306 | 0.00048394 |
| -1.29% | -1.41% | |
page_fault3_per_process_ops | | |
(A) base | 1295807.782 | 1297550.000 | 5907.585 |
(B) patched | 1275579.873 | 1273359.000 | 8759.160 |
| -1.56% | -1.86% | |
page_fault3_per_thread_ops | | |
(A) base | 391234.164 | 390860.000 | 1760.720 |
(B) patched | 377231.273 | 376369.000 | 1874.971 |
| -3.58% | -3.71% | |
page_fault3_scalability | | |
(A) base | 0.60369 | 0.60072 | 0.0083029 |
(B) patched | 0.61733 | 0.61544 | 0.009855 |
| +2.26% | +2.45% | |
All regressions seem to be minimal, and within the normal variance for
the benchmark. The fix for [1] assumes that 3% is noise -- and there were no
further practical complaints), so hopefully this means that such variations
in these microbenchmarks do not reflect on practical workloads.
(3) I also ran stress-ng in a nested cgroup and did not observe any
obvious regressions.
[1]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/
[..]
> > >
> > > Using next-20231009 and a similar 44 core machine with hyperthreading
> > > disabled, I ran 22 instances of netperf in parallel and got the
> > > following numbers from averaging 20 runs:
> > >
> > > Base: 33076.5 mbps
> > > Patched: 31410.1 mbps
> > >
> > > That's about 5% diff. I guess the number of iterations helps reduce
> > > the noise? I am not sure.
> > >
> > > Please also keep in mind that in this case all netperf instances are
> > > in the same cgroup and at a 4-level depth. I imagine in a practical
> > > setup processes would be a little more spread out, which means less
> > > common ancestors, so less contended atomic operations.
> >
> >
> > (Resending the reply as I messed up the last one, was not in plain text)
> >
> > I was curious, so I ran the same testing in a cgroup 2 levels deep
> > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my
> > experience. Here are the numbers:
> >
> > Base: 40198.0 mbps
> > Patched: 38629.7 mbps
> >
> > The regression is reduced to ~3.9%.
> >
> > What's more interesting is that going from a level 2 cgroup to a level
> > 4 cgroup is already a big hit with or without this patch:
> >
> > Base: 40198.0 -> 33076.5 mbps (~17.7% regression)
> > Patched: 38629.7 -> 31410.1 (~18.7% regression)
> >
> > So going from level 2 to 4 is already a significant regression for
> > other reasons (e.g. hierarchical charging). This patch only makes it
> > marginally worse. This puts the numbers more into perspective imo than
> > comparing values at level 4. What do you think?
>
> I think it's reasonable.
>
> Especially comparing to how many cachelines we used to touch on the
> write side when all flushing happened there. This looks like a good
> trade-off to me.
Thanks.
Still wanting to figure out if this patch is what you suggested in our
previous discussion [1], to add a
Suggested-by if appropriate :)
[1]https://lore.kernel.org/lkml/[email protected]/
On Thu, Oct 12, 2023 at 04:28:49PM -0700, Yosry Ahmed wrote:
> [..]
> > > >
> > > > Using next-20231009 and a similar 44 core machine with hyperthreading
> > > > disabled, I ran 22 instances of netperf in parallel and got the
> > > > following numbers from averaging 20 runs:
> > > >
> > > > Base: 33076.5 mbps
> > > > Patched: 31410.1 mbps
> > > >
> > > > That's about 5% diff. I guess the number of iterations helps reduce
> > > > the noise? I am not sure.
> > > >
> > > > Please also keep in mind that in this case all netperf instances are
> > > > in the same cgroup and at a 4-level depth. I imagine in a practical
> > > > setup processes would be a little more spread out, which means less
> > > > common ancestors, so less contended atomic operations.
> > >
> > >
> > > (Resending the reply as I messed up the last one, was not in plain text)
> > >
> > > I was curious, so I ran the same testing in a cgroup 2 levels deep
> > > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my
> > > experience. Here are the numbers:
> > >
> > > Base: 40198.0 mbps
> > > Patched: 38629.7 mbps
> > >
> > > The regression is reduced to ~3.9%.
> > >
> > > What's more interesting is that going from a level 2 cgroup to a level
> > > 4 cgroup is already a big hit with or without this patch:
> > >
> > > Base: 40198.0 -> 33076.5 mbps (~17.7% regression)
> > > Patched: 38629.7 -> 31410.1 (~18.7% regression)
> > >
> > > So going from level 2 to 4 is already a significant regression for
> > > other reasons (e.g. hierarchical charging). This patch only makes it
> > > marginally worse. This puts the numbers more into perspective imo than
> > > comparing values at level 4. What do you think?
> >
> > I think it's reasonable.
> >
> > Especially comparing to how many cachelines we used to touch on the
> > write side when all flushing happened there. This looks like a good
> > trade-off to me.
>
> Thanks.
>
> Still wanting to figure out if this patch is what you suggested in our
> previous discussion [1], to add a
> Suggested-by if appropriate :)
>
> [1]https://lore.kernel.org/lkml/[email protected]/
Haha, sort of. I suggested the cgroup-level flush-batching, but my
proposal was missing the clever upward propagation of the pending stat
updates that you added.
You can add the tag if you're feeling generous, but I wouldn't be mad
if you don't!
On Thu, Oct 12, 2023 at 7:33 PM Johannes Weiner <[email protected]> wrote:
>
> On Thu, Oct 12, 2023 at 04:28:49PM -0700, Yosry Ahmed wrote:
> > [..]
> > > > >
> > > > > Using next-20231009 and a similar 44 core machine with hyperthreading
> > > > > disabled, I ran 22 instances of netperf in parallel and got the
> > > > > following numbers from averaging 20 runs:
> > > > >
> > > > > Base: 33076.5 mbps
> > > > > Patched: 31410.1 mbps
> > > > >
> > > > > That's about 5% diff. I guess the number of iterations helps reduce
> > > > > the noise? I am not sure.
> > > > >
> > > > > Please also keep in mind that in this case all netperf instances are
> > > > > in the same cgroup and at a 4-level depth. I imagine in a practical
> > > > > setup processes would be a little more spread out, which means less
> > > > > common ancestors, so less contended atomic operations.
> > > >
> > > >
> > > > (Resending the reply as I messed up the last one, was not in plain text)
> > > >
> > > > I was curious, so I ran the same testing in a cgroup 2 levels deep
> > > > (i.e /sys/fs/cgroup/a/b), which is a much more common setup in my
> > > > experience. Here are the numbers:
> > > >
> > > > Base: 40198.0 mbps
> > > > Patched: 38629.7 mbps
> > > >
> > > > The regression is reduced to ~3.9%.
> > > >
> > > > What's more interesting is that going from a level 2 cgroup to a level
> > > > 4 cgroup is already a big hit with or without this patch:
> > > >
> > > > Base: 40198.0 -> 33076.5 mbps (~17.7% regression)
> > > > Patched: 38629.7 -> 31410.1 (~18.7% regression)
> > > >
> > > > So going from level 2 to 4 is already a significant regression for
> > > > other reasons (e.g. hierarchical charging). This patch only makes it
> > > > marginally worse. This puts the numbers more into perspective imo than
> > > > comparing values at level 4. What do you think?
> > >
> > > I think it's reasonable.
> > >
> > > Especially comparing to how many cachelines we used to touch on the
> > > write side when all flushing happened there. This looks like a good
> > > trade-off to me.
> >
> > Thanks.
> >
> > Still wanting to figure out if this patch is what you suggested in our
> > previous discussion [1], to add a
> > Suggested-by if appropriate :)
> >
> > [1]https://lore.kernel.org/lkml/[email protected]/
>
> Haha, sort of. I suggested the cgroup-level flush-batching, but my
> proposal was missing the clever upward propagation of the pending stat
> updates that you added.
>
> You can add the tag if you're feeling generous, but I wouldn't be mad
> if you don't!
I like to think that I am a generous person :)
Will add it in the next respin.
On Thu, 12 Oct 2023 15:23:06 -0700 Yosry Ahmed <[email protected]> wrote:
> Meanwhile, Andrew, could you please replace the commit log of this
> patch as follows for more updated testing info:
Done.
On Sat, Oct 14, 2023 at 4:08 PM Andrew Morton <[email protected]> wrote:
>
> On Thu, 12 Oct 2023 15:23:06 -0700 Yosry Ahmed <[email protected]> wrote:
>
> > Meanwhile, Andrew, could you please replace the commit log of this
> > patch as follows for more updated testing info:
>
> Done.
Thanks!
On Sat, Oct 14, 2023 at 4:08 PM Andrew Morton <[email protected]> wrote:
>
> On Thu, 12 Oct 2023 15:23:06 -0700 Yosry Ahmed <[email protected]> wrote:
>
> > Meanwhile, Andrew, could you please replace the commit log of this
> > patch as follows for more updated testing info:
>
> Done.
Sorry Andrew, but could you please also take this fixlet?
From: Yosry Ahmed <[email protected]>
Date: Tue, 17 Oct 2023 23:07:59 +0000
Subject: [PATCH] mm: memcg: clear percpu stats_pending during stats flush
When flushing memcg stats, we clear the per-memcg count of pending stat
updates, as they are captured by the flush. Also clear the percpu count
for the cpu being flushed.
Suggested-by: Wei Xu <[email protected]>
Signed-off-by: Yosry Ahmed <[email protected]>
---
mm/memcontrol.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 0b1377b16b3e0..fa92de780ac89 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5653,6 +5653,7 @@ static void mem_cgroup_css_rstat_flush(struct
cgroup_subsys_state *css, int cpu)
}
}
}
+ statc->stats_updates = 0;
/* We are in a per-cpu loop here, only do the atomic write once */
if (atomic64_read(&memcg->vmstats->stats_updates))
atomic64_set(&memcg->vmstats->stats_updates, 0);
--
2.42.0.655.g421f12c284-goog
hi, Yosry Ahmed, hi, Shakeel Butt,
On Thu, Oct 12, 2023 at 03:23:06PM -0700, Yosry Ahmed wrote:
> On Thu, Oct 12, 2023 at 2:39 PM Shakeel Butt <[email protected]> wrote:
> >
> > On Thu, Oct 12, 2023 at 2:20 PM Yosry Ahmed <[email protected]> wrote:
> > >
> > [...]
> > > >
> > > > Yes this looks better. I think we should also ask intel perf and
> > > > phoronix folks to run their benchmarks as well (but no need to block
> > > > on them).
> > >
> > > Anything I need to do for this to happen? (I thought such testing is
> > > already done on linux-next)
> >
> > Just Cced the relevant folks.
> >
> > Michael, Oliver & Feng, if you have some time/resource available,
> > please do trigger your performance benchmarks on the following series
> > (but nothing urgent):
> >
> > https://lore.kernel.org/all/[email protected]/
>
> Thanks for that.
we (0day team) have already applied the patch-set as:
c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing
ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent()
51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg
130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code
26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time
25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything <---- the base our tool picked for the patch set
they've already in our so-called hourly-kernel which under various function
and performance tests.
our 0day test logic is if we found any regression by these hourly-kernels
comparing to base (e.g. milestone release), auto-bisect will be triggnered.
then we only report when we capture a first bad commit for a regression.
based on this, if you don't receive any report in following 2-3 weeks, you
could think 0day cannot capture any regression from your patch-set.
*However*, please be aware that 0day is not a traditional CI system, and also
due to resource constraints, we cannot guaratee coverage, we cannot tigger
specific tests for your patchset, either.
(sorry if this is not your expectation)
>
> >
> > >
> > > Also, any further comments on the patch (or the series in general)? If
> > > not, I can send a new commit message for this patch in-place.
> >
> > Sorry, I haven't taken a look yet but will try in a week or so.
>
> Sounds good, thanks.
>
> Meanwhile, Andrew, could you please replace the commit log of this
> patch as follows for more updated testing info:
>
> Subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg
>
> A global counter for the magnitude of memcg stats update is maintained
> on the memcg side to avoid invoking rstat flushes when the pending
> updates are not significant. This avoids unnecessary flushes, which are
> not very cheap even if there isn't a lot of stats to flush. It also
> avoids unnecessary lock contention on the underlying global rstat lock.
>
> Make this threshold per-memcg. The scheme is followed where percpu (now
> also per-memcg) counters are incremented in the update path, and only
> propagated to per-memcg atomics when they exceed a certain threshold.
>
> This provides two benefits:
> (a) On large machines with a lot of memcgs, the global threshold can be
> reached relatively fast, so guarding the underlying lock becomes less
> effective. Making the threshold per-memcg avoids this.
>
> (b) Having a global threshold makes it hard to do subtree flushes, as we
> cannot reset the global counter except for a full flush. Per-memcg
> counters removes this as a blocker from doing subtree flushes, which
> helps avoid unnecessary work when the stats of a small subtree are
> needed.
>
> Nothing is free, of course. This comes at a cost:
> (a) A new per-cpu counter per memcg, consuming NR_CPUS * NR_MEMCGS * 4
> bytes. The extra memory usage is insigificant.
>
> (b) More work on the update side, although in the common case it will
> only be percpu counter updates. The amount of work scales with the
> number of ancestors (i.e. tree depth). This is not a new concept, adding
> a cgroup to the rstat tree involves a parent loop, so is charging.
> Testing results below show no significant regressions.
>
> (c) The error margin in the stats for the system as a whole increases
> from NR_CPUS * MEMCG_CHARGE_BATCH to NR_CPUS * MEMCG_CHARGE_BATCH *
> NR_MEMCGS. This is probably fine because we have a similar per-memcg
> error in charges coming from percpu stocks, and we have a periodic
> flusher that makes sure we always flush all the stats every 2s anyway.
>
> This patch was tested to make sure no significant regressions are
> introduced on the update path as follows. The following benchmarks were
> ran in a cgroup that is 2 levels deep (/sys/fs/cgroup/a/b/):
>
> (1) Running 22 instances of netperf on a 44 cpu machine with
> hyperthreading disabled. All instances are run in a level 2 cgroup, as
> well as netserver:
> # netserver -6
> # netperf -6 -H ::1 -l 60 -t TCP_SENDFILE -- -m 10K
>
> Averaging 20 runs, the numbers are as follows:
> Base: 40198.0 mbps
> Patched: 38629.7 mbps (-3.9%)
>
> The regression is minimal, especially for 22 instances in the same
> cgroup sharing all ancestors (so updating the same atomics).
>
> (2) will-it-scale page_fault tests. These tests (specifically
> per_process_ops in page_fault3 test) detected a 25.9% regression before
> for a change in the stats update path [1]. These are the
> numbers from 10 runs (+ is good) on a machine with 256 cpus:
>
> LABEL | MEAN | MEDIAN | STDDEV |
> ------------------------------+-------------+-------------+-------------
> page_fault1_per_process_ops | | | |
> (A) base | 270249.164 | 265437.000 | 13451.836 |
> (B) patched | 261368.709 | 255725.000 | 13394.767 |
> | -3.29% | -3.66% | |
> page_fault1_per_thread_ops | | | |
> (A) base | 242111.345 | 239737.000 | 10026.031 |
> (B) patched | 237057.109 | 235305.000 | 9769.687 |
> | -2.09% | -1.85% | |
> page_fault1_scalability | | |
> (A) base | 0.034387 | 0.035168 | 0.0018283 |
> (B) patched | 0.033988 | 0.034573 | 0.0018056 |
> | -1.16% | -1.69% | |
> page_fault2_per_process_ops | | |
> (A) base | 203561.836 | 203301.000 | 2550.764 |
> (B) patched | 197195.945 | 197746.000 | 2264.263 |
> | -3.13% | -2.73% | |
> page_fault2_per_thread_ops | | |
> (A) base | 171046.473 | 170776.000 | 1509.679 |
> (B) patched | 166626.327 | 166406.000 | 768.753 |
> | -2.58% | -2.56% | |
> page_fault2_scalability | | |
> (A) base | 0.054026 | 0.053821 | 0.00062121 |
> (B) patched | 0.053329 | 0.05306 | 0.00048394 |
> | -1.29% | -1.41% | |
> page_fault3_per_process_ops | | |
> (A) base | 1295807.782 | 1297550.000 | 5907.585 |
> (B) patched | 1275579.873 | 1273359.000 | 8759.160 |
> | -1.56% | -1.86% | |
> page_fault3_per_thread_ops | | |
> (A) base | 391234.164 | 390860.000 | 1760.720 |
> (B) patched | 377231.273 | 376369.000 | 1874.971 |
> | -3.58% | -3.71% | |
> page_fault3_scalability | | |
> (A) base | 0.60369 | 0.60072 | 0.0083029 |
> (B) patched | 0.61733 | 0.61544 | 0.009855 |
> | +2.26% | +2.45% | |
>
> All regressions seem to be minimal, and within the normal variance for
> the benchmark. The fix for [1] assumes that 3% is noise -- and there were no
> further practical complaints), so hopefully this means that such variations
> in these microbenchmarks do not reflect on practical workloads.
>
> (3) I also ran stress-ng in a nested cgroup and did not observe any
> obvious regressions.
>
> [1]https://lore.kernel.org/all/20190520063534.GB19312@shao2-debian/
On Wed, Oct 18, 2023 at 1:22 AM Oliver Sang <[email protected]> wrote:
>
> hi, Yosry Ahmed, hi, Shakeel Butt,
>
> On Thu, Oct 12, 2023 at 03:23:06PM -0700, Yosry Ahmed wrote:
> > On Thu, Oct 12, 2023 at 2:39 PM Shakeel Butt <[email protected]> wrote:
> > >
> > > On Thu, Oct 12, 2023 at 2:20 PM Yosry Ahmed <[email protected]> wrote:
> > > >
> > > [...]
> > > > >
> > > > > Yes this looks better. I think we should also ask intel perf and
> > > > > phoronix folks to run their benchmarks as well (but no need to block
> > > > > on them).
> > > >
> > > > Anything I need to do for this to happen? (I thought such testing is
> > > > already done on linux-next)
> > >
> > > Just Cced the relevant folks.
> > >
> > > Michael, Oliver & Feng, if you have some time/resource available,
> > > please do trigger your performance benchmarks on the following series
> > > (but nothing urgent):
> > >
> > > https://lore.kernel.org/all/[email protected]/
> >
> > Thanks for that.
>
> we (0day team) have already applied the patch-set as:
>
> c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing
> ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent()
> 51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg
> 130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code
> 26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time
> 25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything <---- the base our tool picked for the patch set
>
> they've already in our so-called hourly-kernel which under various function
> and performance tests.
>
> our 0day test logic is if we found any regression by these hourly-kernels
> comparing to base (e.g. milestone release), auto-bisect will be triggnered.
> then we only report when we capture a first bad commit for a regression.
>
> based on this, if you don't receive any report in following 2-3 weeks, you
> could think 0day cannot capture any regression from your patch-set.
>
> *However*, please be aware that 0day is not a traditional CI system, and also
> due to resource constraints, we cannot guaratee coverage, we cannot tigger
> specific tests for your patchset, either.
> (sorry if this is not your expectation)
>
Thanks for taking a look and clarifying this, much appreciated.
Fingers crossed for not getting any reports :)
On Tue, 10 Oct 2023 03:21:11 +0000 Yosry Ahmed <[email protected]> wrote:
> This series attempts to address shortages in today's approach for memcg
> stats flushing, namely occasionally stale or expensive stat reads. The
> series does so by changing the threshold that we use to decide whether
> to trigger a flush to be per memcg instead of global (patch 3), and then
> changing flushing to be per memcg (i.e. subtree flushes) instead of
> global (patch 5).
>
> Patch 3 & 5 are the core of the series, and they include more details
> and testing results. The rest are either cleanups or prep work.
>
> This series replaces the "memcg: more sophisticated stats flushing"
> series [1], which also replaces another series, in a long list of
> attempts to improve memcg stats flushing. It is not a new version of
> the same patchset as it is a completely different approach. This is
> based on collected feedback from discussions on lkml in all previous
> attempts. Hopefully, this is the final attempt.
Seems that Shakeel's performance concerns have largely been set aside.
It would be good to have some affirmative input on this patchset from
the memcg developers, please?
Hello,
kernel test robot noticed a -25.8% regression of will-it-scale.per_thread_ops on:
commit: 51d74c18a9c61e7ee33bc90b522dd7f6e5b80bb5 ("[PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg")
url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257
base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/all/[email protected]/
patch subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg
testcase: will-it-scale
test machine: 104 threads 2 sockets (Skylake) with 192G memory
parameters:
nr_task: 100%
mode: thread
test: fallocate1
cpufreq_governor: performance
In addition to that, the commit also has significant impact on the following tests:
+------------------+---------------------------------------------------------------+
| testcase: change | will-it-scale: will-it-scale.per_thread_ops -30.0% regression |
| test machine | 104 threads 2 sockets (Skylake) with 192G memory |
| test parameters | cpufreq_governor=performance |
| | mode=thread |
| | nr_task=50% |
| | test=fallocate1 |
+------------------+---------------------------------------------------------------+
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-lkp/[email protected]
Details are as below:
-------------------------------------------------------------------------------------------------->
The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231020/[email protected]
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-12/performance/x86_64-rhel-8.3/thread/100%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale
commit:
130617edc1 ("mm: memcg: move vmstats structs definition above flushing code")
51d74c18a9 ("mm: memcg: make stats flushing threshold per-memcg")
130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522
---------------- ---------------------------
%stddev %change %stddev
\ | \
2.09 -0.5 1.61 ? 2% mpstat.cpu.all.usr%
27.58 +3.7% 28.59 turbostat.RAMWatt
3324 -10.0% 2993 vmstat.system.cs
1056 -100.0% 0.00 numa-meminfo.node0.Inactive(file)
6.67 ?141% +15799.3% 1059 numa-meminfo.node1.Inactive(file)
120.83 ? 11% +79.6% 217.00 ? 9% perf-c2c.DRAM.local
594.50 ? 6% +43.8% 854.83 ? 5% perf-c2c.DRAM.remote
3797041 -25.8% 2816352 will-it-scale.104.threads
36509 -25.8% 27079 will-it-scale.per_thread_ops
3797041 -25.8% 2816352 will-it-scale.workload
1.142e+09 -26.2% 8.437e+08 numa-numastat.node0.local_node
1.143e+09 -26.1% 8.439e+08 numa-numastat.node0.numa_hit
1.148e+09 -25.4% 8.563e+08 ? 2% numa-numastat.node1.local_node
1.149e+09 -25.4% 8.564e+08 ? 2% numa-numastat.node1.numa_hit
32933 -2.6% 32068 proc-vmstat.nr_slab_reclaimable
2.291e+09 -25.8% 1.7e+09 proc-vmstat.numa_hit
2.291e+09 -25.8% 1.7e+09 proc-vmstat.numa_local
2.29e+09 -25.8% 1.699e+09 proc-vmstat.pgalloc_normal
2.289e+09 -25.8% 1.699e+09 proc-vmstat.pgfree
1.00 ? 93% +154.2% 2.55 ? 16% perf-sched.sch_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
191.10 ? 2% +18.0% 225.55 ? 2% perf-sched.wait_and_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
385.50 ? 14% +39.6% 538.17 ? 12% perf-sched.wait_and_delay.count.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
118.67 ? 11% -62.6% 44.33 ?100% perf-sched.wait_and_delay.count.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
5043 ? 2% -13.0% 4387 ? 6% perf-sched.wait_and_delay.count.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
167.12 ?222% +200.1% 501.48 ? 99% perf-sched.wait_and_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
191.09 ? 2% +18.0% 225.53 ? 2% perf-sched.wait_time.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
293.46 ? 4% +12.8% 330.98 ? 6% perf-sched.wait_time.max.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
199.33 -100.0% 0.00 numa-vmstat.node0.nr_active_file
264.00 -100.0% 0.00 numa-vmstat.node0.nr_inactive_file
199.33 -100.0% 0.00 numa-vmstat.node0.nr_zone_active_file
264.00 -100.0% 0.00 numa-vmstat.node0.nr_zone_inactive_file
1.143e+09 -26.1% 8.439e+08 numa-vmstat.node0.numa_hit
1.142e+09 -26.2% 8.437e+08 numa-vmstat.node0.numa_local
1.67 ?141% +15799.3% 264.99 numa-vmstat.node1.nr_inactive_file
1.67 ?141% +15799.3% 264.99 numa-vmstat.node1.nr_zone_inactive_file
1.149e+09 -25.4% 8.564e+08 ? 2% numa-vmstat.node1.numa_hit
1.148e+09 -25.4% 8.563e+08 ? 2% numa-vmstat.node1.numa_local
0.59 ? 3% +125.2% 1.32 ? 2% perf-stat.i.MPKI
9.027e+09 -17.9% 7.408e+09 perf-stat.i.branch-instructions
0.64 -0.0 0.60 perf-stat.i.branch-miss-rate%
58102855 -23.3% 44580037 ? 2% perf-stat.i.branch-misses
15.28 +7.0 22.27 perf-stat.i.cache-miss-rate%
25155306 ? 2% +82.7% 45953601 ? 3% perf-stat.i.cache-misses
1.644e+08 +25.4% 2.062e+08 ? 2% perf-stat.i.cache-references
3258 -10.3% 2921 perf-stat.i.context-switches
6.73 +23.3% 8.30 perf-stat.i.cpi
145.97 -1.3% 144.13 perf-stat.i.cpu-migrations
11519 ? 3% -45.4% 6293 ? 3% perf-stat.i.cycles-between-cache-misses
0.04 -0.0 0.03 perf-stat.i.dTLB-load-miss-rate%
3921408 -25.3% 2929564 perf-stat.i.dTLB-load-misses
1.098e+10 -18.1% 8.993e+09 perf-stat.i.dTLB-loads
0.00 ? 2% +0.0 0.00 ? 4% perf-stat.i.dTLB-store-miss-rate%
5.606e+09 -23.2% 4.304e+09 perf-stat.i.dTLB-stores
95.65 -1.2 94.49 perf-stat.i.iTLB-load-miss-rate%
3876741 -25.0% 2905764 perf-stat.i.iTLB-load-misses
4.286e+10 -18.9% 3.477e+10 perf-stat.i.instructions
11061 +8.2% 11969 perf-stat.i.instructions-per-iTLB-miss
0.15 -18.9% 0.12 perf-stat.i.ipc
48.65 ? 2% +46.2% 71.11 ? 2% perf-stat.i.metric.K/sec
247.84 -18.9% 201.05 perf-stat.i.metric.M/sec
3138385 ? 2% +77.7% 5578401 ? 2% perf-stat.i.node-load-misses
375827 ? 3% +69.2% 635857 ? 11% perf-stat.i.node-loads
1343194 -26.8% 983668 perf-stat.i.node-store-misses
51550 ? 3% -19.0% 41748 ? 7% perf-stat.i.node-stores
0.59 ? 3% +125.1% 1.32 ? 2% perf-stat.overall.MPKI
0.64 -0.0 0.60 perf-stat.overall.branch-miss-rate%
15.30 +7.0 22.28 perf-stat.overall.cache-miss-rate%
6.73 +23.3% 8.29 perf-stat.overall.cpi
11470 ? 2% -45.3% 6279 ? 2% perf-stat.overall.cycles-between-cache-misses
0.04 -0.0 0.03 perf-stat.overall.dTLB-load-miss-rate%
0.00 ? 2% +0.0 0.00 ? 4% perf-stat.overall.dTLB-store-miss-rate%
95.56 -1.4 94.17 perf-stat.overall.iTLB-load-miss-rate%
11059 +8.2% 11967 perf-stat.overall.instructions-per-iTLB-miss
0.15 -18.9% 0.12 perf-stat.overall.ipc
3396437 +9.5% 3718021 perf-stat.overall.path-length
8.997e+09 -17.9% 7.383e+09 perf-stat.ps.branch-instructions
57910417 -23.3% 44426577 ? 2% perf-stat.ps.branch-misses
25075498 ? 2% +82.7% 45803186 ? 3% perf-stat.ps.cache-misses
1.639e+08 +25.4% 2.056e+08 ? 2% perf-stat.ps.cache-references
3247 -10.3% 2911 perf-stat.ps.context-switches
145.47 -1.3% 143.61 perf-stat.ps.cpu-migrations
3908900 -25.3% 2920218 perf-stat.ps.dTLB-load-misses
1.094e+10 -18.1% 8.963e+09 perf-stat.ps.dTLB-loads
5.587e+09 -23.2% 4.289e+09 perf-stat.ps.dTLB-stores
3863663 -25.0% 2895895 perf-stat.ps.iTLB-load-misses
4.272e+10 -18.9% 3.466e+10 perf-stat.ps.instructions
3128132 ? 2% +77.7% 5559939 ? 2% perf-stat.ps.node-load-misses
375403 ? 3% +69.0% 634300 ? 11% perf-stat.ps.node-loads
1338688 -26.8% 980311 perf-stat.ps.node-store-misses
51546 ? 3% -19.1% 41692 ? 7% perf-stat.ps.node-stores
1.29e+13 -18.8% 1.047e+13 perf-stat.total.instructions
0.96 -0.3 0.70 ? 2% perf-profile.calltrace.cycles-pp.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
0.97 -0.3 0.72 perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.fallocate64
0.76 ? 2% -0.2 0.54 ? 3% perf-profile.calltrace.cycles-pp.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
0.82 -0.2 0.60 ? 2% perf-profile.calltrace.cycles-pp.alloc_pages_mpol.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.91 -0.2 0.72 perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
0.68 +0.1 0.76 ? 2% perf-profile.calltrace.cycles-pp.lru_add_fn.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp
1.67 +0.1 1.77 perf-profile.calltrace.cycles-pp.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
1.78 ? 2% +0.1 1.92 ? 2% perf-profile.calltrace.cycles-pp.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change
0.69 ? 5% +0.1 0.84 ? 4% perf-profile.calltrace.cycles-pp.get_mem_cgroup_from_mm.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
1.56 ? 2% +0.2 1.76 ? 2% perf-profile.calltrace.cycles-pp.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr
0.85 ? 4% +0.4 1.23 ? 2% perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.78 ? 4% +0.4 1.20 ? 3% perf-profile.calltrace.cycles-pp.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range
0.73 ? 4% +0.4 1.17 ? 3% perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio
48.39 +0.8 49.14 perf-profile.calltrace.cycles-pp.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
0.00 +0.8 0.77 ? 4% perf-profile.calltrace.cycles-pp.mem_cgroup_commit_charge.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
40.24 +0.8 41.03 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp
40.22 +0.8 41.01 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio
0.00 +0.8 0.79 ? 3% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp
40.19 +0.8 40.98 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru
1.33 ? 5% +0.8 2.13 ? 4% perf-profile.calltrace.cycles-pp.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
48.16 +0.8 48.98 perf-profile.calltrace.cycles-pp.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
0.00 +0.9 0.88 ? 2% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio
47.92 +0.9 48.81 perf-profile.calltrace.cycles-pp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe
47.07 +0.9 48.01 perf-profile.calltrace.cycles-pp.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
46.59 +1.1 47.64 perf-profile.calltrace.cycles-pp.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate
0.99 -0.3 0.73 ? 2% perf-profile.children.cycles-pp.syscall_return_via_sysret
0.96 -0.3 0.70 ? 2% perf-profile.children.cycles-pp.shmem_alloc_folio
0.78 ? 2% -0.2 0.56 ? 3% perf-profile.children.cycles-pp.shmem_inode_acct_blocks
0.83 -0.2 0.61 ? 2% perf-profile.children.cycles-pp.alloc_pages_mpol
0.92 -0.2 0.73 perf-profile.children.cycles-pp.syscall_exit_to_user_mode
0.74 ? 2% -0.2 0.55 ? 2% perf-profile.children.cycles-pp.xas_store
0.67 -0.2 0.50 ? 3% perf-profile.children.cycles-pp.__alloc_pages
0.43 -0.1 0.31 ? 2% perf-profile.children.cycles-pp.__entry_text_start
0.41 ? 2% -0.1 0.30 ? 3% perf-profile.children.cycles-pp.free_unref_page_list
0.35 -0.1 0.25 ? 2% perf-profile.children.cycles-pp.xas_load
0.35 ? 2% -0.1 0.25 ? 4% perf-profile.children.cycles-pp.__mod_lruvec_state
0.39 -0.1 0.30 ? 2% perf-profile.children.cycles-pp.get_page_from_freelist
0.27 ? 2% -0.1 0.19 ? 4% perf-profile.children.cycles-pp.__mod_node_page_state
0.32 ? 3% -0.1 0.24 ? 3% perf-profile.children.cycles-pp.find_lock_entries
0.23 ? 2% -0.1 0.15 ? 4% perf-profile.children.cycles-pp.xas_descend
0.28 ? 3% -0.1 0.20 ? 3% perf-profile.children.cycles-pp._raw_spin_lock
0.25 ? 3% -0.1 0.18 ? 3% perf-profile.children.cycles-pp.__dquot_alloc_space
0.16 ? 3% -0.1 0.10 ? 5% perf-profile.children.cycles-pp.xas_find_conflict
0.26 ? 2% -0.1 0.20 ? 3% perf-profile.children.cycles-pp.filemap_get_entry
0.26 -0.1 0.20 ? 2% perf-profile.children.cycles-pp.rmqueue
0.20 ? 3% -0.1 0.14 ? 3% perf-profile.children.cycles-pp.truncate_cleanup_folio
0.19 ? 5% -0.1 0.14 ? 4% perf-profile.children.cycles-pp.xas_clear_mark
0.17 ? 5% -0.0 0.12 ? 4% perf-profile.children.cycles-pp.xas_init_marks
0.15 ? 4% -0.0 0.10 ? 4% perf-profile.children.cycles-pp.free_unref_page_commit
0.18 ? 3% -0.0 0.14 ? 3% perf-profile.children.cycles-pp.__cond_resched
0.07 ? 5% -0.0 0.02 ? 99% perf-profile.children.cycles-pp.xas_find
0.13 ? 2% -0.0 0.09 perf-profile.children.cycles-pp.security_vm_enough_memory_mm
0.14 ? 4% -0.0 0.10 ? 7% perf-profile.children.cycles-pp.__fget_light
0.06 ? 6% -0.0 0.02 ? 99% perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
0.12 ? 4% -0.0 0.08 ? 4% perf-profile.children.cycles-pp.xas_start
0.08 ? 5% -0.0 0.05 perf-profile.children.cycles-pp.__folio_throttle_swaprate
0.12 -0.0 0.08 ? 5% perf-profile.children.cycles-pp.folio_unlock
0.14 ? 3% -0.0 0.11 ? 3% perf-profile.children.cycles-pp.try_charge_memcg
0.12 ? 6% -0.0 0.08 ? 5% perf-profile.children.cycles-pp.free_unref_page_prepare
0.12 ? 3% -0.0 0.09 ? 4% perf-profile.children.cycles-pp.noop_dirty_folio
0.20 ? 2% -0.0 0.17 ? 5% perf-profile.children.cycles-pp.page_counter_uncharge
0.10 -0.0 0.07 ? 5% perf-profile.children.cycles-pp.cap_vm_enough_memory
0.09 ? 6% -0.0 0.06 ? 6% perf-profile.children.cycles-pp._raw_spin_trylock
0.09 ? 5% -0.0 0.06 ? 7% perf-profile.children.cycles-pp.inode_add_bytes
0.06 ? 6% -0.0 0.03 ? 70% perf-profile.children.cycles-pp.filemap_free_folio
0.06 ? 6% -0.0 0.03 ? 70% perf-profile.children.cycles-pp.percpu_counter_add_batch
0.12 ? 3% -0.0 0.09 ? 5% perf-profile.children.cycles-pp.__folio_cancel_dirty
0.12 ? 3% -0.0 0.10 ? 5% perf-profile.children.cycles-pp.shmem_recalc_inode
0.09 ? 5% -0.0 0.07 ? 7% perf-profile.children.cycles-pp.__vm_enough_memory
0.08 ? 5% -0.0 0.06 perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
0.08 ? 5% -0.0 0.06 perf-profile.children.cycles-pp.security_file_permission
0.08 ? 6% -0.0 0.05 ? 7% perf-profile.children.cycles-pp.apparmor_file_permission
0.09 ? 4% -0.0 0.07 ? 8% perf-profile.children.cycles-pp.__percpu_counter_limited_add
0.08 ? 6% -0.0 0.06 ? 8% perf-profile.children.cycles-pp.__list_add_valid_or_report
0.07 ? 8% -0.0 0.05 perf-profile.children.cycles-pp.get_pfnblock_flags_mask
0.14 ? 3% -0.0 0.12 ? 6% perf-profile.children.cycles-pp.cgroup_rstat_updated
0.07 ? 5% -0.0 0.05 perf-profile.children.cycles-pp.policy_nodemask
0.24 ? 2% -0.0 0.22 ? 2% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
0.08 -0.0 0.07 ? 7% perf-profile.children.cycles-pp.xas_create
0.69 +0.1 0.78 perf-profile.children.cycles-pp.lru_add_fn
1.72 ? 2% +0.1 1.80 perf-profile.children.cycles-pp.shmem_add_to_page_cache
1.79 ? 2% +0.1 1.93 ? 2% perf-profile.children.cycles-pp.filemap_remove_folio
0.13 ? 5% +0.1 0.28 perf-profile.children.cycles-pp.file_modified
0.69 ? 5% +0.1 0.84 ? 3% perf-profile.children.cycles-pp.get_mem_cgroup_from_mm
0.09 ? 7% +0.2 0.24 ? 2% perf-profile.children.cycles-pp.inode_needs_update_time
1.58 ? 3% +0.2 1.77 ? 2% perf-profile.children.cycles-pp.__filemap_remove_folio
0.15 ? 3% +0.4 0.50 ? 3% perf-profile.children.cycles-pp.__count_memcg_events
0.79 ? 4% +0.4 1.20 ? 3% perf-profile.children.cycles-pp.filemap_unaccount_folio
0.36 ? 5% +0.4 0.77 ? 4% perf-profile.children.cycles-pp.mem_cgroup_commit_charge
98.33 +0.5 98.78 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
97.74 +0.6 98.34 perf-profile.children.cycles-pp.do_syscall_64
48.39 +0.8 49.15 perf-profile.children.cycles-pp.__x64_sys_fallocate
1.34 ? 5% +0.8 2.14 ? 4% perf-profile.children.cycles-pp.__mem_cgroup_charge
1.61 ? 4% +0.8 2.42 ? 2% perf-profile.children.cycles-pp.__mod_lruvec_page_state
48.17 +0.8 48.98 perf-profile.children.cycles-pp.vfs_fallocate
47.94 +0.9 48.82 perf-profile.children.cycles-pp.shmem_fallocate
47.10 +0.9 48.04 perf-profile.children.cycles-pp.shmem_get_folio_gfp
84.34 +0.9 85.28 perf-profile.children.cycles-pp.folio_lruvec_lock_irqsave
84.31 +0.9 85.26 perf-profile.children.cycles-pp._raw_spin_lock_irqsave
84.24 +1.0 85.21 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
46.65 +1.1 47.70 perf-profile.children.cycles-pp.shmem_alloc_and_add_folio
1.23 ? 4% +1.4 2.58 ? 2% perf-profile.children.cycles-pp.__mod_memcg_lruvec_state
0.98 -0.3 0.73 ? 2% perf-profile.self.cycles-pp.syscall_return_via_sysret
0.88 -0.2 0.70 perf-profile.self.cycles-pp.syscall_exit_to_user_mode
0.60 -0.2 0.45 perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
0.41 ? 3% -0.1 0.27 ? 3% perf-profile.self.cycles-pp.release_pages
0.41 -0.1 0.30 ? 3% perf-profile.self.cycles-pp.xas_store
0.41 ? 3% -0.1 0.29 ? 2% perf-profile.self.cycles-pp.folio_batch_move_lru
0.30 ? 3% -0.1 0.18 ? 5% perf-profile.self.cycles-pp.shmem_add_to_page_cache
0.38 ? 2% -0.1 0.27 ? 2% perf-profile.self.cycles-pp.__entry_text_start
0.30 ? 3% -0.1 0.20 ? 6% perf-profile.self.cycles-pp.lru_add_fn
0.28 ? 2% -0.1 0.20 ? 5% perf-profile.self.cycles-pp.shmem_fallocate
0.26 ? 2% -0.1 0.18 ? 5% perf-profile.self.cycles-pp.__mod_node_page_state
0.27 ? 3% -0.1 0.20 ? 2% perf-profile.self.cycles-pp._raw_spin_lock
0.21 ? 2% -0.1 0.15 ? 4% perf-profile.self.cycles-pp.__alloc_pages
0.20 ? 2% -0.1 0.14 ? 3% perf-profile.self.cycles-pp.xas_descend
0.26 ? 3% -0.1 0.20 ? 4% perf-profile.self.cycles-pp.find_lock_entries
0.18 ? 4% -0.0 0.13 ? 5% perf-profile.self.cycles-pp.xas_clear_mark
0.15 ? 7% -0.0 0.10 ? 11% perf-profile.self.cycles-pp.shmem_inode_acct_blocks
0.16 ? 4% -0.0 0.12 ? 4% perf-profile.self.cycles-pp.__dquot_alloc_space
0.13 ? 4% -0.0 0.09 ? 5% perf-profile.self.cycles-pp.free_unref_page_commit
0.13 -0.0 0.09 ? 5% perf-profile.self.cycles-pp._raw_spin_lock_irq
0.16 ? 4% -0.0 0.12 ? 4% perf-profile.self.cycles-pp.shmem_alloc_and_add_folio
0.13 ? 5% -0.0 0.09 ? 7% perf-profile.self.cycles-pp.__filemap_remove_folio
0.13 ? 2% -0.0 0.09 ? 5% perf-profile.self.cycles-pp.get_page_from_freelist
0.12 ? 4% -0.0 0.09 ? 5% perf-profile.self.cycles-pp.vfs_fallocate
0.06 ? 7% -0.0 0.02 ? 99% perf-profile.self.cycles-pp.apparmor_file_permission
0.13 ? 3% -0.0 0.10 ? 5% perf-profile.self.cycles-pp.fallocate64
0.11 ? 4% -0.0 0.07 perf-profile.self.cycles-pp.xas_start
0.07 ? 5% -0.0 0.03 ? 70% perf-profile.self.cycles-pp.shmem_alloc_folio
0.14 ? 4% -0.0 0.10 ? 7% perf-profile.self.cycles-pp.__fget_light
0.10 ? 4% -0.0 0.06 ? 7% perf-profile.self.cycles-pp.rmqueue
0.12 ? 3% -0.0 0.09 ? 4% perf-profile.self.cycles-pp.xas_load
0.11 ? 4% -0.0 0.08 ? 7% perf-profile.self.cycles-pp.folio_unlock
0.10 ? 4% -0.0 0.07 ? 8% perf-profile.self.cycles-pp.alloc_pages_mpol
0.15 ? 2% -0.0 0.12 ? 5% perf-profile.self.cycles-pp.shmem_get_folio_gfp
0.10 -0.0 0.07 perf-profile.self.cycles-pp.cap_vm_enough_memory
0.16 ? 2% -0.0 0.13 ? 6% perf-profile.self.cycles-pp.page_counter_uncharge
0.12 ? 5% -0.0 0.09 ? 4% perf-profile.self.cycles-pp.__cond_resched
0.06 ? 6% -0.0 0.03 ? 70% perf-profile.self.cycles-pp.filemap_free_folio
0.12 ? 3% -0.0 0.10 ? 5% perf-profile.self.cycles-pp.free_unref_page_list
0.12 -0.0 0.09 ? 4% perf-profile.self.cycles-pp.noop_dirty_folio
0.10 ? 3% -0.0 0.07 ? 5% perf-profile.self.cycles-pp.filemap_remove_folio
0.10 ? 5% -0.0 0.07 ? 5% perf-profile.self.cycles-pp.try_charge_memcg
0.12 ? 3% -0.0 0.10 ? 8% perf-profile.self.cycles-pp.cgroup_rstat_updated
0.09 ? 4% -0.0 0.07 ? 7% perf-profile.self.cycles-pp.__folio_cancel_dirty
0.08 ? 4% -0.0 0.06 ? 8% perf-profile.self.cycles-pp._raw_spin_lock_irqsave
0.08 ? 5% -0.0 0.06 perf-profile.self.cycles-pp._raw_spin_trylock
0.08 -0.0 0.06 ? 6% perf-profile.self.cycles-pp.folio_add_lru
0.08 ? 8% -0.0 0.06 ? 6% perf-profile.self.cycles-pp.__mod_lruvec_state
0.07 ? 5% -0.0 0.05 perf-profile.self.cycles-pp.xas_find_conflict
0.08 ? 10% -0.0 0.06 ? 9% perf-profile.self.cycles-pp.truncate_cleanup_folio
0.07 ? 10% -0.0 0.05 perf-profile.self.cycles-pp.xas_init_marks
0.08 ? 4% -0.0 0.06 ? 7% perf-profile.self.cycles-pp.__percpu_counter_limited_add
0.07 ? 7% -0.0 0.05 perf-profile.self.cycles-pp.get_pfnblock_flags_mask
0.07 ? 5% -0.0 0.06 ? 8% perf-profile.self.cycles-pp.__list_add_valid_or_report
0.02 ?141% +0.0 0.06 ? 8% perf-profile.self.cycles-pp.uncharge_batch
0.21 ? 9% +0.1 0.31 ? 7% perf-profile.self.cycles-pp.mem_cgroup_commit_charge
0.69 ? 5% +0.1 0.83 ? 4% perf-profile.self.cycles-pp.get_mem_cgroup_from_mm
0.06 ? 6% +0.2 0.22 ? 2% perf-profile.self.cycles-pp.inode_needs_update_time
0.14 ? 8% +0.3 0.42 ? 7% perf-profile.self.cycles-pp.__mem_cgroup_charge
0.13 ? 7% +0.4 0.49 ? 3% perf-profile.self.cycles-pp.__count_memcg_events
84.24 +1.0 85.21 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
1.12 ? 5% +1.4 2.50 ? 2% perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
***************************************************************************************************
lkp-skl-fpga01: 104 threads 2 sockets (Skylake) with 192G memory
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-12/performance/x86_64-rhel-8.3/thread/50%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale
commit:
130617edc1 ("mm: memcg: move vmstats structs definition above flushing code")
51d74c18a9 ("mm: memcg: make stats flushing threshold per-memcg")
130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522
---------------- ---------------------------
%stddev %change %stddev
\ | \
1.87 -0.4 1.43 ? 3% mpstat.cpu.all.usr%
3171 -5.3% 3003 ? 2% vmstat.system.cs
84.83 ? 9% +55.8% 132.17 ? 16% perf-c2c.DRAM.local
484.17 ? 3% +37.1% 663.67 ? 10% perf-c2c.DRAM.remote
72763 ? 5% +14.4% 83212 ? 12% turbostat.C1
0.08 -25.0% 0.06 turbostat.IPC
27.90 +4.6% 29.18 turbostat.RAMWatt
3982212 -30.0% 2785941 will-it-scale.52.threads
76580 -30.0% 53575 will-it-scale.per_thread_ops
3982212 -30.0% 2785941 will-it-scale.workload
1.175e+09 ? 2% -28.6% 8.392e+08 ? 2% numa-numastat.node0.local_node
1.175e+09 ? 2% -28.6% 8.394e+08 ? 2% numa-numastat.node0.numa_hit
1.231e+09 ? 2% -31.3% 8.463e+08 ? 3% numa-numastat.node1.local_node
1.232e+09 ? 2% -31.3% 8.466e+08 ? 3% numa-numastat.node1.numa_hit
1.175e+09 ? 2% -28.6% 8.394e+08 ? 2% numa-vmstat.node0.numa_hit
1.175e+09 ? 2% -28.6% 8.392e+08 ? 2% numa-vmstat.node0.numa_local
1.232e+09 ? 2% -31.3% 8.466e+08 ? 3% numa-vmstat.node1.numa_hit
1.231e+09 ? 2% -31.3% 8.463e+08 ? 3% numa-vmstat.node1.numa_local
2.408e+09 -30.0% 1.686e+09 proc-vmstat.numa_hit
2.406e+09 -30.0% 1.685e+09 proc-vmstat.numa_local
2.404e+09 -29.9% 1.684e+09 proc-vmstat.pgalloc_normal
2.404e+09 -29.9% 1.684e+09 proc-vmstat.pgfree
0.04 ? 9% -19.3% 0.03 ? 6% perf-sched.wait_and_delay.avg.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
0.04 ? 8% -22.3% 0.03 ? 5% perf-sched.wait_and_delay.avg.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
0.91 ? 2% +11.3% 1.01 ? 5% perf-sched.wait_and_delay.avg.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
0.04 ? 13% -90.3% 0.00 ?223% perf-sched.wait_and_delay.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
1.14 +15.1% 1.31 perf-sched.wait_and_delay.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
189.94 ? 3% +18.3% 224.73 ? 4% perf-sched.wait_and_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
1652 ? 4% -13.4% 1431 ? 4% perf-sched.wait_and_delay.count.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
83.67 ? 7% -87.6% 10.33 ?223% perf-sched.wait_and_delay.count.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
3827 ? 4% -13.0% 3328 ? 3% perf-sched.wait_and_delay.count.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
1.71 ?165% -83.4% 0.28 ? 21% perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
0.43 ? 17% -43.8% 0.24 ? 26% perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.46 ? 17% -36.7% 0.29 ? 12% perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
0.30 ? 34% -90.7% 0.03 ?223% perf-sched.wait_and_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
0.04 ? 9% -19.3% 0.03 ? 6% perf-sched.wait_time.avg.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
0.04 ? 8% -22.3% 0.03 ? 5% perf-sched.wait_time.avg.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
0.04 ? 11% -33.1% 0.03 ? 17% perf-sched.wait_time.avg.ms.__cond_resched.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.90 ? 2% +11.5% 1.00 ? 5% perf-sched.wait_time.avg.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
0.04 ? 13% -26.6% 0.03 ? 12% perf-sched.wait_time.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
1.13 +15.2% 1.30 perf-sched.wait_time.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
189.93 ? 3% +18.3% 224.72 ? 4% perf-sched.wait_time.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
1.71 ?165% -83.4% 0.28 ? 21% perf-sched.wait_time.max.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
0.43 ? 17% -43.8% 0.24 ? 26% perf-sched.wait_time.max.ms.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.46 ? 17% -36.7% 0.29 ? 12% perf-sched.wait_time.max.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
0.75 +142.0% 1.83 ? 2% perf-stat.i.MPKI
8.47e+09 -24.4% 6.407e+09 perf-stat.i.branch-instructions
0.66 -0.0 0.63 perf-stat.i.branch-miss-rate%
56364992 -28.3% 40421603 ? 3% perf-stat.i.branch-misses
14.64 +6.7 21.30 perf-stat.i.cache-miss-rate%
30868184 +81.3% 55977240 ? 3% perf-stat.i.cache-misses
2.107e+08 +24.7% 2.627e+08 ? 2% perf-stat.i.cache-references
3106 -5.5% 2934 ? 2% perf-stat.i.context-switches
3.55 +33.4% 4.74 perf-stat.i.cpi
4722 -44.8% 2605 ? 3% perf-stat.i.cycles-between-cache-misses
0.04 -0.0 0.04 perf-stat.i.dTLB-load-miss-rate%
4117232 -29.1% 2917107 perf-stat.i.dTLB-load-misses
1.051e+10 -24.1% 7.979e+09 perf-stat.i.dTLB-loads
0.00 ? 3% +0.0 0.00 ? 6% perf-stat.i.dTLB-store-miss-rate%
5.886e+09 -27.5% 4.269e+09 perf-stat.i.dTLB-stores
78.16 -6.6 71.51 perf-stat.i.iTLB-load-miss-rate%
4131074 ? 3% -30.0% 2891515 perf-stat.i.iTLB-load-misses
4.098e+10 -25.0% 3.072e+10 perf-stat.i.instructions
9929 ? 2% +7.0% 10627 perf-stat.i.instructions-per-iTLB-miss
0.28 -25.0% 0.21 perf-stat.i.ipc
63.49 +43.8% 91.27 ? 3% perf-stat.i.metric.K/sec
241.12 -24.6% 181.87 perf-stat.i.metric.M/sec
3735316 +78.6% 6669641 ? 3% perf-stat.i.node-load-misses
377465 ? 4% +86.1% 702512 ? 11% perf-stat.i.node-loads
1322217 -27.6% 957081 ? 5% perf-stat.i.node-store-misses
37459 ? 3% -23.0% 28826 ? 5% perf-stat.i.node-stores
0.75 +141.8% 1.82 ? 2% perf-stat.overall.MPKI
0.67 -0.0 0.63 perf-stat.overall.branch-miss-rate%
14.65 +6.7 21.30 perf-stat.overall.cache-miss-rate%
3.55 +33.4% 4.73 perf-stat.overall.cpi
4713 -44.8% 2601 ? 3% perf-stat.overall.cycles-between-cache-misses
0.04 -0.0 0.04 perf-stat.overall.dTLB-load-miss-rate%
0.00 ? 3% +0.0 0.00 ? 5% perf-stat.overall.dTLB-store-miss-rate%
78.14 -6.7 71.47 perf-stat.overall.iTLB-load-miss-rate%
9927 ? 2% +7.0% 10624 perf-stat.overall.instructions-per-iTLB-miss
0.28 -25.0% 0.21 perf-stat.overall.ipc
3098901 +7.1% 3318983 perf-stat.overall.path-length
8.441e+09 -24.4% 6.385e+09 perf-stat.ps.branch-instructions
56179581 -28.3% 40286337 ? 3% perf-stat.ps.branch-misses
30759982 +81.3% 55777812 ? 3% perf-stat.ps.cache-misses
2.1e+08 +24.6% 2.618e+08 ? 2% perf-stat.ps.cache-references
3095 -5.5% 2923 ? 2% perf-stat.ps.context-switches
4103292 -29.1% 2907270 perf-stat.ps.dTLB-load-misses
1.048e+10 -24.1% 7.952e+09 perf-stat.ps.dTLB-loads
5.866e+09 -27.5% 4.255e+09 perf-stat.ps.dTLB-stores
4117020 ? 3% -30.0% 2881750 perf-stat.ps.iTLB-load-misses
4.084e+10 -25.0% 3.062e+10 perf-stat.ps.instructions
3722149 +78.5% 6645867 ? 3% perf-stat.ps.node-load-misses
376240 ? 4% +86.1% 700053 ? 11% perf-stat.ps.node-loads
1317772 -27.6% 953773 ? 5% perf-stat.ps.node-store-misses
37408 ? 3% -23.2% 28748 ? 5% perf-stat.ps.node-stores
1.234e+13 -25.1% 9.246e+12 perf-stat.total.instructions
1.28 -0.4 0.90 ? 2% perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.fallocate64
1.26 ? 2% -0.4 0.90 ? 3% perf-profile.calltrace.cycles-pp.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
1.08 ? 2% -0.3 0.77 ? 3% perf-profile.calltrace.cycles-pp.alloc_pages_mpol.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.92 ? 2% -0.3 0.62 ? 3% perf-profile.calltrace.cycles-pp.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
0.84 ? 3% -0.2 0.61 ? 3% perf-profile.calltrace.cycles-pp.__alloc_pages.alloc_pages_mpol.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp
1.26 -0.2 1.08 perf-profile.calltrace.cycles-pp.folio_batch_move_lru.lru_add_drain_cpu.__folio_batch_release.shmem_undo_range.shmem_setattr
1.26 -0.2 1.08 perf-profile.calltrace.cycles-pp.lru_add_drain_cpu.__folio_batch_release.shmem_undo_range.shmem_setattr.notify_change
1.24 -0.2 1.06 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu.__folio_batch_release.shmem_undo_range
1.24 -0.2 1.06 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu.__folio_batch_release
1.23 -0.2 1.06 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu
1.20 -0.2 1.04 ? 2% perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
0.68 ? 3% +0.0 0.72 ? 4% perf-profile.calltrace.cycles-pp.__mem_cgroup_uncharge_list.release_pages.__folio_batch_release.shmem_undo_range.shmem_setattr
1.08 +0.1 1.20 perf-profile.calltrace.cycles-pp.lru_add_fn.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp
2.91 +0.3 3.18 ? 2% perf-profile.calltrace.cycles-pp.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change.do_truncate
2.56 +0.4 2.92 ? 2% perf-profile.calltrace.cycles-pp.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change
1.36 ? 3% +0.4 1.76 ? 9% perf-profile.calltrace.cycles-pp.get_mem_cgroup_from_mm.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
2.22 +0.5 2.68 ? 2% perf-profile.calltrace.cycles-pp.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr
0.00 +0.6 0.60 ? 2% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.release_pages.__folio_batch_release.shmem_undo_range.shmem_setattr
2.33 +0.6 2.94 perf-profile.calltrace.cycles-pp.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
0.00 +0.7 0.72 ? 2% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.lru_add_fn.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio
0.69 ? 4% +0.8 1.47 ? 3% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio
1.24 ? 2% +0.8 2.04 ? 2% perf-profile.calltrace.cycles-pp.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range
0.00 +0.8 0.82 ? 4% perf-profile.calltrace.cycles-pp.__count_memcg_events.mem_cgroup_commit_charge.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp
1.17 ? 2% +0.8 2.00 ? 2% perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio
0.59 ? 4% +0.9 1.53 perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp
1.38 +1.0 2.33 ? 2% perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.62 ? 3% +1.0 1.66 ? 5% perf-profile.calltrace.cycles-pp.mem_cgroup_commit_charge.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
38.70 +1.2 39.90 perf-profile.calltrace.cycles-pp.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
38.34 +1.3 39.65 perf-profile.calltrace.cycles-pp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe
37.24 +1.6 38.86 perf-profile.calltrace.cycles-pp.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
36.64 +1.8 38.40 perf-profile.calltrace.cycles-pp.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate
2.47 ? 2% +2.1 4.59 ? 8% perf-profile.calltrace.cycles-pp.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
1.30 -0.4 0.92 ? 2% perf-profile.children.cycles-pp.syscall_return_via_sysret
1.28 ? 2% -0.4 0.90 ? 3% perf-profile.children.cycles-pp.shmem_alloc_folio
1.10 ? 2% -0.3 0.78 ? 3% perf-profile.children.cycles-pp.alloc_pages_mpol
0.96 ? 2% -0.3 0.64 ? 3% perf-profile.children.cycles-pp.shmem_inode_acct_blocks
0.88 -0.3 0.58 ? 2% perf-profile.children.cycles-pp.xas_store
0.88 ? 3% -0.2 0.64 ? 3% perf-profile.children.cycles-pp.__alloc_pages
0.61 ? 2% -0.2 0.43 ? 3% perf-profile.children.cycles-pp.__entry_text_start
1.26 -0.2 1.09 perf-profile.children.cycles-pp.lru_add_drain_cpu
0.56 -0.2 0.39 ? 4% perf-profile.children.cycles-pp.free_unref_page_list
1.22 -0.2 1.06 ? 2% perf-profile.children.cycles-pp.syscall_exit_to_user_mode
0.46 -0.1 0.32 ? 3% perf-profile.children.cycles-pp.__mod_lruvec_state
0.41 ? 3% -0.1 0.28 ? 4% perf-profile.children.cycles-pp.xas_load
0.44 ? 4% -0.1 0.31 ? 4% perf-profile.children.cycles-pp.find_lock_entries
0.50 ? 3% -0.1 0.37 ? 2% perf-profile.children.cycles-pp.get_page_from_freelist
0.24 ? 7% -0.1 0.12 ? 5% perf-profile.children.cycles-pp.__list_add_valid_or_report
0.34 ? 2% -0.1 0.24 ? 4% perf-profile.children.cycles-pp.__mod_node_page_state
0.38 ? 3% -0.1 0.28 ? 4% perf-profile.children.cycles-pp._raw_spin_lock
0.32 ? 2% -0.1 0.22 ? 5% perf-profile.children.cycles-pp.__dquot_alloc_space
0.26 ? 2% -0.1 0.17 ? 2% perf-profile.children.cycles-pp.xas_descend
0.22 ? 3% -0.1 0.14 ? 4% perf-profile.children.cycles-pp.free_unref_page_commit
0.25 -0.1 0.17 ? 3% perf-profile.children.cycles-pp.xas_clear_mark
0.32 ? 4% -0.1 0.25 ? 3% perf-profile.children.cycles-pp.rmqueue
0.23 ? 2% -0.1 0.16 ? 2% perf-profile.children.cycles-pp.xas_init_marks
0.24 ? 2% -0.1 0.17 ? 5% perf-profile.children.cycles-pp.__cond_resched
0.25 ? 4% -0.1 0.18 ? 2% perf-profile.children.cycles-pp.truncate_cleanup_folio
0.30 ? 3% -0.1 0.23 ? 4% perf-profile.children.cycles-pp.filemap_get_entry
0.20 ? 2% -0.1 0.13 ? 5% perf-profile.children.cycles-pp.folio_unlock
0.16 ? 4% -0.1 0.10 ? 5% perf-profile.children.cycles-pp.xas_find_conflict
0.19 ? 3% -0.1 0.13 ? 5% perf-profile.children.cycles-pp._raw_spin_lock_irq
0.17 ? 5% -0.1 0.12 ? 3% perf-profile.children.cycles-pp.noop_dirty_folio
0.13 ? 4% -0.1 0.08 ? 9% perf-profile.children.cycles-pp.security_vm_enough_memory_mm
0.18 ? 8% -0.1 0.13 ? 4% perf-profile.children.cycles-pp.shmem_recalc_inode
0.16 ? 2% -0.1 0.11 ? 3% perf-profile.children.cycles-pp.free_unref_page_prepare
0.09 ? 5% -0.1 0.04 ? 45% perf-profile.children.cycles-pp.mem_cgroup_update_lru_size
0.10 ? 7% -0.0 0.05 ? 45% perf-profile.children.cycles-pp.cap_vm_enough_memory
0.14 ? 5% -0.0 0.10 perf-profile.children.cycles-pp.__folio_cancel_dirty
0.14 ? 5% -0.0 0.10 ? 4% perf-profile.children.cycles-pp.security_file_permission
0.10 ? 5% -0.0 0.06 ? 6% perf-profile.children.cycles-pp.xas_find
0.15 ? 4% -0.0 0.11 ? 3% perf-profile.children.cycles-pp.__fget_light
0.14 ? 5% -0.0 0.11 ? 3% perf-profile.children.cycles-pp.file_modified
0.12 ? 3% -0.0 0.09 ? 7% perf-profile.children.cycles-pp.__vm_enough_memory
0.12 ? 3% -0.0 0.09 ? 4% perf-profile.children.cycles-pp.apparmor_file_permission
0.12 ? 3% -0.0 0.08 ? 5% perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
0.12 ? 4% -0.0 0.08 ? 4% perf-profile.children.cycles-pp.xas_start
0.09 -0.0 0.06 ? 8% perf-profile.children.cycles-pp.__folio_throttle_swaprate
0.12 ? 6% -0.0 0.08 ? 8% perf-profile.children.cycles-pp._raw_spin_trylock
0.12 ? 4% -0.0 0.08 ? 4% perf-profile.children.cycles-pp.__percpu_counter_limited_add
0.12 ? 4% -0.0 0.09 ? 4% perf-profile.children.cycles-pp.inode_add_bytes
0.20 ? 2% -0.0 0.17 ? 7% perf-profile.children.cycles-pp.try_charge_memcg
0.10 ? 5% -0.0 0.07 ? 7% perf-profile.children.cycles-pp.policy_nodemask
0.09 ? 6% -0.0 0.06 ? 6% perf-profile.children.cycles-pp.get_pfnblock_flags_mask
0.09 ? 6% -0.0 0.06 ? 7% perf-profile.children.cycles-pp.filemap_free_folio
0.07 ? 6% -0.0 0.05 ? 7% perf-profile.children.cycles-pp.down_write
0.08 ? 4% -0.0 0.06 ? 8% perf-profile.children.cycles-pp.get_task_policy
0.09 ? 5% -0.0 0.07 ? 5% perf-profile.children.cycles-pp.xas_create
0.09 ? 7% -0.0 0.07 perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
0.09 ? 7% -0.0 0.07 perf-profile.children.cycles-pp.inode_needs_update_time
0.16 ? 2% -0.0 0.14 ? 5% perf-profile.children.cycles-pp.cgroup_rstat_updated
0.08 ? 7% -0.0 0.06 ? 9% perf-profile.children.cycles-pp.percpu_counter_add_batch
0.07 ? 5% -0.0 0.05 ? 7% perf-profile.children.cycles-pp.folio_mark_dirty
0.08 ? 10% -0.0 0.06 ? 6% perf-profile.children.cycles-pp.shmem_is_huge
0.07 ? 6% +0.0 0.09 ? 10% perf-profile.children.cycles-pp.propagate_protected_usage
0.43 ? 3% +0.0 0.46 ? 5% perf-profile.children.cycles-pp.uncharge_batch
0.68 ? 3% +0.0 0.73 ? 4% perf-profile.children.cycles-pp.__mem_cgroup_uncharge_list
1.11 +0.1 1.22 perf-profile.children.cycles-pp.lru_add_fn
2.91 +0.3 3.18 ? 2% perf-profile.children.cycles-pp.truncate_inode_folio
2.56 +0.4 2.92 ? 2% perf-profile.children.cycles-pp.filemap_remove_folio
1.37 ? 3% +0.4 1.76 ? 9% perf-profile.children.cycles-pp.get_mem_cgroup_from_mm
2.24 +0.5 2.70 ? 2% perf-profile.children.cycles-pp.__filemap_remove_folio
2.38 +0.6 2.97 perf-profile.children.cycles-pp.shmem_add_to_page_cache
0.18 ? 4% +0.7 0.91 ? 4% perf-profile.children.cycles-pp.__count_memcg_events
1.26 +0.8 2.04 ? 2% perf-profile.children.cycles-pp.filemap_unaccount_folio
0.63 ? 2% +1.0 1.67 ? 5% perf-profile.children.cycles-pp.mem_cgroup_commit_charge
38.71 +1.2 39.91 perf-profile.children.cycles-pp.vfs_fallocate
38.37 +1.3 39.66 perf-profile.children.cycles-pp.shmem_fallocate
37.28 +1.6 38.89 perf-profile.children.cycles-pp.shmem_get_folio_gfp
36.71 +1.7 38.45 perf-profile.children.cycles-pp.shmem_alloc_and_add_folio
2.58 +1.8 4.36 ? 2% perf-profile.children.cycles-pp.__mod_lruvec_page_state
2.48 ? 2% +2.1 4.60 ? 8% perf-profile.children.cycles-pp.__mem_cgroup_charge
1.93 ? 3% +2.4 4.36 ? 2% perf-profile.children.cycles-pp.__mod_memcg_lruvec_state
1.30 -0.4 0.92 ? 2% perf-profile.self.cycles-pp.syscall_return_via_sysret
0.73 -0.2 0.52 ? 2% perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
0.54 ? 2% -0.2 0.36 ? 3% perf-profile.self.cycles-pp.release_pages
0.48 -0.2 0.30 ? 3% perf-profile.self.cycles-pp.xas_store
0.54 ? 2% -0.2 0.38 ? 3% perf-profile.self.cycles-pp.__entry_text_start
1.17 -0.1 1.03 ? 2% perf-profile.self.cycles-pp.syscall_exit_to_user_mode
0.36 ? 2% -0.1 0.22 ? 3% perf-profile.self.cycles-pp.shmem_add_to_page_cache
0.43 ? 5% -0.1 0.30 ? 7% perf-profile.self.cycles-pp.lru_add_fn
0.24 ? 7% -0.1 0.12 ? 6% perf-profile.self.cycles-pp.__list_add_valid_or_report
0.38 ? 4% -0.1 0.27 ? 4% perf-profile.self.cycles-pp._raw_spin_lock
0.52 ? 3% -0.1 0.41 perf-profile.self.cycles-pp.folio_batch_move_lru
0.32 ? 2% -0.1 0.22 ? 4% perf-profile.self.cycles-pp.__mod_node_page_state
0.36 ? 4% -0.1 0.26 ? 4% perf-profile.self.cycles-pp.find_lock_entries
0.36 ? 2% -0.1 0.26 ? 2% perf-profile.self.cycles-pp.shmem_fallocate
0.28 ? 3% -0.1 0.20 ? 5% perf-profile.self.cycles-pp.__alloc_pages
0.24 ? 2% -0.1 0.16 ? 4% perf-profile.self.cycles-pp.xas_descend
0.23 ? 2% -0.1 0.16 ? 3% perf-profile.self.cycles-pp.xas_clear_mark
0.18 ? 3% -0.1 0.11 ? 6% perf-profile.self.cycles-pp.free_unref_page_commit
0.18 ? 3% -0.1 0.12 ? 4% perf-profile.self.cycles-pp.shmem_inode_acct_blocks
0.21 ? 3% -0.1 0.15 ? 2% perf-profile.self.cycles-pp.shmem_alloc_and_add_folio
0.18 ? 2% -0.1 0.12 ? 3% perf-profile.self.cycles-pp.__filemap_remove_folio
0.18 ? 7% -0.1 0.12 ? 7% perf-profile.self.cycles-pp.vfs_fallocate
0.20 ? 2% -0.1 0.14 ? 6% perf-profile.self.cycles-pp.__dquot_alloc_space
0.18 ? 2% -0.1 0.13 ? 3% perf-profile.self.cycles-pp.folio_unlock
0.18 ? 2% -0.1 0.12 ? 3% perf-profile.self.cycles-pp.get_page_from_freelist
0.15 ? 3% -0.1 0.10 ? 7% perf-profile.self.cycles-pp.xas_load
0.17 ? 3% -0.1 0.12 ? 8% perf-profile.self.cycles-pp.__cond_resched
0.17 ? 2% -0.1 0.12 ? 3% perf-profile.self.cycles-pp._raw_spin_lock_irq
0.17 ? 5% -0.1 0.12 ? 3% perf-profile.self.cycles-pp.noop_dirty_folio
0.10 ? 7% -0.0 0.05 ? 45% perf-profile.self.cycles-pp.cap_vm_enough_memory
0.12 ? 3% -0.0 0.08 ? 4% perf-profile.self.cycles-pp.rmqueue
0.07 ? 5% -0.0 0.02 ? 99% perf-profile.self.cycles-pp.xas_find
0.13 ? 3% -0.0 0.09 ? 6% perf-profile.self.cycles-pp.alloc_pages_mpol
0.07 ? 6% -0.0 0.03 ? 70% perf-profile.self.cycles-pp.xas_find_conflict
0.16 ? 2% -0.0 0.12 ? 6% perf-profile.self.cycles-pp.free_unref_page_list
0.12 ? 5% -0.0 0.08 ? 4% perf-profile.self.cycles-pp.fallocate64
0.20 ? 4% -0.0 0.16 ? 3% perf-profile.self.cycles-pp.shmem_get_folio_gfp
0.06 ? 7% -0.0 0.02 ? 99% perf-profile.self.cycles-pp.shmem_recalc_inode
0.13 ? 3% -0.0 0.09 perf-profile.self.cycles-pp._raw_spin_lock_irqsave
0.22 ? 3% -0.0 0.19 ? 6% perf-profile.self.cycles-pp.page_counter_uncharge
0.14 ? 3% -0.0 0.10 ? 6% perf-profile.self.cycles-pp.filemap_remove_folio
0.15 ? 5% -0.0 0.11 ? 3% perf-profile.self.cycles-pp.__fget_light
0.12 ? 4% -0.0 0.08 perf-profile.self.cycles-pp.__folio_cancel_dirty
0.11 ? 4% -0.0 0.08 ? 7% perf-profile.self.cycles-pp._raw_spin_trylock
0.12 ? 3% -0.0 0.09 ? 5% perf-profile.self.cycles-pp.__mod_lruvec_state
0.11 ? 5% -0.0 0.08 ? 4% perf-profile.self.cycles-pp.truncate_cleanup_folio
0.11 ? 3% -0.0 0.08 ? 6% perf-profile.self.cycles-pp.__percpu_counter_limited_add
0.11 ? 3% -0.0 0.08 ? 6% perf-profile.self.cycles-pp.xas_start
0.10 ? 6% -0.0 0.07 ? 5% perf-profile.self.cycles-pp.xas_init_marks
0.09 ? 6% -0.0 0.06 ? 6% perf-profile.self.cycles-pp.get_pfnblock_flags_mask
0.11 -0.0 0.08 ? 5% perf-profile.self.cycles-pp.folio_add_lru
0.09 ? 6% -0.0 0.06 ? 7% perf-profile.self.cycles-pp.filemap_free_folio
0.09 ? 4% -0.0 0.06 ? 6% perf-profile.self.cycles-pp.shmem_alloc_folio
0.14 ? 5% -0.0 0.12 ? 5% perf-profile.self.cycles-pp.cgroup_rstat_updated
0.10 ? 4% -0.0 0.08 ? 4% perf-profile.self.cycles-pp.apparmor_file_permission
0.07 ? 7% -0.0 0.04 ? 44% perf-profile.self.cycles-pp.policy_nodemask
0.07 ? 11% -0.0 0.04 ? 45% perf-profile.self.cycles-pp.shmem_is_huge
0.08 ? 4% -0.0 0.06 ? 8% perf-profile.self.cycles-pp.get_task_policy
0.08 ? 6% -0.0 0.05 ? 8% perf-profile.self.cycles-pp.__x64_sys_fallocate
0.12 ? 3% -0.0 0.10 ? 6% perf-profile.self.cycles-pp.try_charge_memcg
0.07 -0.0 0.05 perf-profile.self.cycles-pp.free_unref_page_prepare
0.07 ? 6% -0.0 0.06 ? 9% perf-profile.self.cycles-pp.percpu_counter_add_batch
0.08 ? 4% -0.0 0.06 perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
0.09 ? 7% -0.0 0.07 ? 5% perf-profile.self.cycles-pp.filemap_get_entry
0.07 ? 9% +0.0 0.09 ? 10% perf-profile.self.cycles-pp.propagate_protected_usage
0.96 ? 2% +0.2 1.12 ? 7% perf-profile.self.cycles-pp.__mod_lruvec_page_state
0.45 ? 4% +0.4 0.82 ? 8% perf-profile.self.cycles-pp.mem_cgroup_commit_charge
1.36 ? 3% +0.4 1.75 ? 9% perf-profile.self.cycles-pp.get_mem_cgroup_from_mm
0.29 +0.7 1.00 ? 10% perf-profile.self.cycles-pp.__mem_cgroup_charge
0.16 ? 4% +0.7 0.90 ? 4% perf-profile.self.cycles-pp.__count_memcg_events
1.80 ? 2% +2.5 4.26 ? 2% perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On Fri, Oct 20, 2023 at 9:18 AM kernel test robot <[email protected]> wrote:
>
>
>
> Hello,
>
> kernel test robot noticed a -25.8% regression of will-it-scale.per_thread_ops on:
>
>
> commit: 51d74c18a9c61e7ee33bc90b522dd7f6e5b80bb5 ("[PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg")
> url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257
> base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
> patch link: https://lore.kernel.org/all/[email protected]/
> patch subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg
>
> testcase: will-it-scale
> test machine: 104 threads 2 sockets (Skylake) with 192G memory
> parameters:
>
> nr_task: 100%
> mode: thread
> test: fallocate1
> cpufreq_governor: performance
>
>
> In addition to that, the commit also has significant impact on the following tests:
>
> +------------------+---------------------------------------------------------------+
> | testcase: change | will-it-scale: will-it-scale.per_thread_ops -30.0% regression |
> | test machine | 104 threads 2 sockets (Skylake) with 192G memory |
> | test parameters | cpufreq_governor=performance |
> | | mode=thread |
> | | nr_task=50% |
> | | test=fallocate1 |
> +------------------+---------------------------------------------------------------+
>
Yosry, I don't think 25% to 30% regression can be ignored. Unless
there is a quick fix, IMO this series should be skipped for the
upcoming kernel open window.
On Fri, Oct 20, 2023 at 10:23 AM Shakeel Butt <[email protected]> wrote:
>
> On Fri, Oct 20, 2023 at 9:18 AM kernel test robot <[email protected]> wrote:
> >
> >
> >
> > Hello,
> >
> > kernel test robot noticed a -25.8% regression of will-it-scale.per_thread_ops on:
> >
> >
> > commit: 51d74c18a9c61e7ee33bc90b522dd7f6e5b80bb5 ("[PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg")
> > url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257
> > base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
> > patch link: https://lore.kernel.org/all/[email protected]/
> > patch subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg
> >
> > testcase: will-it-scale
> > test machine: 104 threads 2 sockets (Skylake) with 192G memory
> > parameters:
> >
> > nr_task: 100%
> > mode: thread
> > test: fallocate1
> > cpufreq_governor: performance
> >
> >
> > In addition to that, the commit also has significant impact on the following tests:
> >
> > +------------------+---------------------------------------------------------------+
> > | testcase: change | will-it-scale: will-it-scale.per_thread_ops -30.0% regression |
> > | test machine | 104 threads 2 sockets (Skylake) with 192G memory |
> > | test parameters | cpufreq_governor=performance |
> > | | mode=thread |
> > | | nr_task=50% |
> > | | test=fallocate1 |
> > +------------------+---------------------------------------------------------------+
> >
>
> Yosry, I don't think 25% to 30% regression can be ignored. Unless
> there is a quick fix, IMO this series should be skipped for the
> upcoming kernel open window.
I am currently looking into it. It's reasonable to skip the next merge
window if a quick fix isn't found soon.
I am surprised by the size of the regression given the following:
1.12 ą 5% +1.4 2.50 ą 2%
perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
IIUC we are only spending 1% more time in __mod_memcg_lruvec_state().
On Sat, Oct 21, 2023 at 01:42:58AM +0800, Yosry Ahmed wrote:
> On Fri, Oct 20, 2023 at 10:23 AM Shakeel Butt <[email protected]> wrote:
> >
> > On Fri, Oct 20, 2023 at 9:18 AM kernel test robot <[email protected]> wrote:
> > >
> > >
> > >
> > > Hello,
> > >
> > > kernel test robot noticed a -25.8% regression of will-it-scale.per_thread_ops on:
> > >
> > >
> > > commit: 51d74c18a9c61e7ee33bc90b522dd7f6e5b80bb5 ("[PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg")
> > > url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257
> > > base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
> > > patch link: https://lore.kernel.org/all/[email protected]/
> > > patch subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg
> > >
> > > testcase: will-it-scale
> > > test machine: 104 threads 2 sockets (Skylake) with 192G memory
> > > parameters:
> > >
> > > nr_task: 100%
> > > mode: thread
> > > test: fallocate1
> > > cpufreq_governor: performance
> > >
> > >
> > > In addition to that, the commit also has significant impact on the following tests:
> > >
> > > +------------------+---------------------------------------------------------------+
> > > | testcase: change | will-it-scale: will-it-scale.per_thread_ops -30.0% regression |
> > > | test machine | 104 threads 2 sockets (Skylake) with 192G memory |
> > > | test parameters | cpufreq_governor=performance |
> > > | | mode=thread |
> > > | | nr_task=50% |
> > > | | test=fallocate1 |
> > > +------------------+---------------------------------------------------------------+
> > >
> >
> > Yosry, I don't think 25% to 30% regression can be ignored. Unless
> > there is a quick fix, IMO this series should be skipped for the
> > upcoming kernel open window.
>
> I am currently looking into it. It's reasonable to skip the next merge
> window if a quick fix isn't found soon.
>
> I am surprised by the size of the regression given the following:
> 1.12 ą 5% +1.4 2.50 ą 2%
> perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
>
> IIUC we are only spending 1% more time in __mod_memcg_lruvec_state().
Yes, this is kind of confusing. And we have seen similar cases before,
espcially for micro benchmark like will-it-scale, stressng, netperf
etc, the change to those functions in hot path was greatly amplified
in the final benchmark score.
In a netperf case, https://lore.kernel.org/lkml/20220619150456.GB34471@xsang-OptiPlex-9020/
the affected functions have around 10% change in perf's cpu-cycles,
and trigger 69% regression. IIRC, micro benchmarks are very sensitive
to those statistics update, like memcg's and vmstat.
Thanks,
Feng
On Sun, Oct 22, 2023 at 6:34 PM Feng Tang <[email protected]> wrote:
>
> On Sat, Oct 21, 2023 at 01:42:58AM +0800, Yosry Ahmed wrote:
> > On Fri, Oct 20, 2023 at 10:23 AM Shakeel Butt <[email protected]> wrote:
> > >
> > > On Fri, Oct 20, 2023 at 9:18 AM kernel test robot <[email protected]> wrote:
> > > >
> > > >
> > > >
> > > > Hello,
> > > >
> > > > kernel test robot noticed a -25.8% regression of will-it-scale.per_thread_ops on:
> > > >
> > > >
> > > > commit: 51d74c18a9c61e7ee33bc90b522dd7f6e5b80bb5 ("[PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg")
> > > > url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257
> > > > base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
> > > > patch link: https://lore.kernel.org/all/[email protected]/
> > > > patch subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg
> > > >
> > > > testcase: will-it-scale
> > > > test machine: 104 threads 2 sockets (Skylake) with 192G memory
> > > > parameters:
> > > >
> > > > nr_task: 100%
> > > > mode: thread
> > > > test: fallocate1
> > > > cpufreq_governor: performance
> > > >
> > > >
> > > > In addition to that, the commit also has significant impact on the following tests:
> > > >
> > > > +------------------+---------------------------------------------------------------+
> > > > | testcase: change | will-it-scale: will-it-scale.per_thread_ops -30.0% regression |
> > > > | test machine | 104 threads 2 sockets (Skylake) with 192G memory |
> > > > | test parameters | cpufreq_governor=performance |
> > > > | | mode=thread |
> > > > | | nr_task=50% |
> > > > | | test=fallocate1 |
> > > > +------------------+---------------------------------------------------------------+
> > > >
> > >
> > > Yosry, I don't think 25% to 30% regression can be ignored. Unless
> > > there is a quick fix, IMO this series should be skipped for the
> > > upcoming kernel open window.
> >
> > I am currently looking into it. It's reasonable to skip the next merge
> > window if a quick fix isn't found soon.
> >
> > I am surprised by the size of the regression given the following:
> > 1.12 ą 5% +1.4 2.50 ą 2%
> > perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
> >
> > IIUC we are only spending 1% more time in __mod_memcg_lruvec_state().
>
> Yes, this is kind of confusing. And we have seen similar cases before,
> espcially for micro benchmark like will-it-scale, stressng, netperf
> etc, the change to those functions in hot path was greatly amplified
> in the final benchmark score.
>
> In a netperf case, https://lore.kernel.org/lkml/20220619150456.GB34471@xsang-OptiPlex-9020/
> the affected functions have around 10% change in perf's cpu-cycles,
> and trigger 69% regression. IIRC, micro benchmarks are very sensitive
> to those statistics update, like memcg's and vmstat.
>
Thanks for clarifying. I am still trying to reproduce locally but I am
running into some quirks with tooling. I may have to run a modified
version of the fallocate test manually. Meanwhile, I noticed that the
patch was tested without the fixlet that I posted [1] for it,
understandably. Would it be possible to get some numbers with that
fixlet? It should reduce the total number of contended atomic
operations, so it may help.
[1]https://lore.kernel.org/lkml/CAJD7tkZDarDn_38ntFg5bK2fAmFdSe+Rt6DKOZA7Sgs_kERoVA@mail.gmail.com/
I am also wondering if aligning the stats_updates atomic will help.
Right now it may share a cacheline with some items of the
events_pending array. The latter may be dirtied during a flush and
unnecessarily dirty the former, but the chances are slim to be honest.
If it's easy to test such a diff, that would be nice, but I don't
expect a lot of difference:
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7cbc7d94eb65..a35fce653262 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -646,7 +646,7 @@ struct memcg_vmstats {
unsigned long events_pending[NR_MEMCG_EVENTS];
/* Stats updates since the last flush */
- atomic64_t stats_updates;
+ atomic64_t stats_updates ____cacheline_aligned_in_smp;
};
/*
On Mon, Oct 23, 2023 at 11:25 AM Yosry Ahmed <[email protected]> wrote:
>
> On Sun, Oct 22, 2023 at 6:34 PM Feng Tang <[email protected]> wrote:
> >
> > On Sat, Oct 21, 2023 at 01:42:58AM +0800, Yosry Ahmed wrote:
> > > On Fri, Oct 20, 2023 at 10:23 AM Shakeel Butt <[email protected]> wrote:
> > > >
> > > > On Fri, Oct 20, 2023 at 9:18 AM kernel test robot <[email protected]> wrote:
> > > > >
> > > > >
> > > > >
> > > > > Hello,
> > > > >
> > > > > kernel test robot noticed a -25.8% regression of will-it-scale.per_thread_ops on:
> > > > >
> > > > >
> > > > > commit: 51d74c18a9c61e7ee33bc90b522dd7f6e5b80bb5 ("[PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg")
> > > > > url: https://github.com/intel-lab-lkp/linux/commits/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257
> > > > > base: https://git.kernel.org/cgit/linux/kernel/git/akpm/mm.git mm-everything
> > > > > patch link: https://lore.kernel.org/all/[email protected]/
> > > > > patch subject: [PATCH v2 3/5] mm: memcg: make stats flushing threshold per-memcg
> > > > >
> > > > > testcase: will-it-scale
> > > > > test machine: 104 threads 2 sockets (Skylake) with 192G memory
> > > > > parameters:
> > > > >
> > > > > nr_task: 100%
> > > > > mode: thread
> > > > > test: fallocate1
> > > > > cpufreq_governor: performance
> > > > >
> > > > >
> > > > > In addition to that, the commit also has significant impact on the following tests:
> > > > >
> > > > > +------------------+---------------------------------------------------------------+
> > > > > | testcase: change | will-it-scale: will-it-scale.per_thread_ops -30.0% regression |
> > > > > | test machine | 104 threads 2 sockets (Skylake) with 192G memory |
> > > > > | test parameters | cpufreq_governor=performance |
> > > > > | | mode=thread |
> > > > > | | nr_task=50% |
> > > > > | | test=fallocate1 |
> > > > > +------------------+---------------------------------------------------------------+
> > > > >
> > > >
> > > > Yosry, I don't think 25% to 30% regression can be ignored. Unless
> > > > there is a quick fix, IMO this series should be skipped for the
> > > > upcoming kernel open window.
> > >
> > > I am currently looking into it. It's reasonable to skip the next merge
> > > window if a quick fix isn't found soon.
> > >
> > > I am surprised by the size of the regression given the following:
> > > 1.12 ą 5% +1.4 2.50 ą 2%
> > > perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
> > >
> > > IIUC we are only spending 1% more time in __mod_memcg_lruvec_state().
> >
> > Yes, this is kind of confusing. And we have seen similar cases before,
> > espcially for micro benchmark like will-it-scale, stressng, netperf
> > etc, the change to those functions in hot path was greatly amplified
> > in the final benchmark score.
> >
> > In a netperf case, https://lore.kernel.org/lkml/20220619150456.GB34471@xsang-OptiPlex-9020/
> > the affected functions have around 10% change in perf's cpu-cycles,
> > and trigger 69% regression. IIRC, micro benchmarks are very sensitive
> > to those statistics update, like memcg's and vmstat.
> >
>
> Thanks for clarifying. I am still trying to reproduce locally but I am
> running into some quirks with tooling. I may have to run a modified
> version of the fallocate test manually. Meanwhile, I noticed that the
> patch was tested without the fixlet that I posted [1] for it,
> understandably. Would it be possible to get some numbers with that
> fixlet? It should reduce the total number of contended atomic
> operations, so it may help.
>
> [1]https://lore.kernel.org/lkml/CAJD7tkZDarDn_38ntFg5bK2fAmFdSe+Rt6DKOZA7Sgs_kERoVA@mail.gmail.com/
>
> I am also wondering if aligning the stats_updates atomic will help.
> Right now it may share a cacheline with some items of the
> events_pending array. The latter may be dirtied during a flush and
> unnecessarily dirty the former, but the chances are slim to be honest.
> If it's easy to test such a diff, that would be nice, but I don't
> expect a lot of difference:
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7cbc7d94eb65..a35fce653262 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -646,7 +646,7 @@ struct memcg_vmstats {
> unsigned long events_pending[NR_MEMCG_EVENTS];
>
> /* Stats updates since the last flush */
> - atomic64_t stats_updates;
> + atomic64_t stats_updates ____cacheline_aligned_in_smp;
> };
>
> /*
I still could not run the benchmark, but I used a version of
fallocate1.c that does 1 million iterations. I ran 100 in parallel.
This showed ~13% regression with the patch, so not the same as the
will-it-scale version, but it could be an indicator.
With that, I did not see any improvement with the fixlet above or
___cacheline_aligned_in_smp. So you can scratch that.
I did, however, see some improvement with reducing the indirection
layers by moving stats_updates directly into struct mem_cgroup. The
regression in my manual testing went down to 9%. Still not great, but
I am wondering how this reflects on the benchmark. If you're able to
test it that would be great, the diff is below. Meanwhile I am still
looking for other improvements that can be made.
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f64ac140083e..b4dfcd8b9cc1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -270,6 +270,9 @@ struct mem_cgroup {
CACHELINE_PADDING(_pad1_);
+ /* Stats updates since the last flush */
+ atomic64_t stats_updates;
+
/* memory.stat */
struct memcg_vmstats *vmstats;
@@ -309,6 +312,7 @@ struct mem_cgroup {
atomic_t moving_account;
struct task_struct *move_lock_task;
+ unsigned int __percpu *stats_updates_percpu;
struct memcg_vmstats_percpu __percpu *vmstats_percpu;
#ifdef CONFIG_CGROUP_WRITEBACK
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7cbc7d94eb65..e5d2f3d4d874 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -627,9 +627,6 @@ struct memcg_vmstats_percpu {
/* Cgroup1: threshold notifications & softlimit tree updates */
unsigned long nr_page_events;
unsigned long targets[MEM_CGROUP_NTARGETS];
-
- /* Stats updates since the last flush */
- unsigned int stats_updates;
};
struct memcg_vmstats {
@@ -644,9 +641,6 @@ struct memcg_vmstats {
/* Pending child counts during tree propagation */
long state_pending[MEMCG_NR_STAT];
unsigned long events_pending[NR_MEMCG_EVENTS];
-
- /* Stats updates since the last flush */
- atomic64_t stats_updates;
};
/*
@@ -695,14 +689,14 @@ static void memcg_stats_unlock(void)
static bool memcg_should_flush_stats(struct mem_cgroup *memcg)
{
- return atomic64_read(&memcg->vmstats->stats_updates) >
+ return atomic64_read(&memcg->stats_updates) >
MEMCG_CHARGE_BATCH * num_online_cpus();
}
static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
{
int cpu = smp_processor_id();
- unsigned int x;
+ unsigned int *stats_updates_percpu;
if (!val)
return;
@@ -710,10 +704,10 @@ static inline void memcg_rstat_updated(struct
mem_cgroup *memcg, int val)
cgroup_rstat_updated(memcg->css.cgroup, cpu);
for (; memcg; memcg = parent_mem_cgroup(memcg)) {
- x = __this_cpu_add_return(memcg->vmstats_percpu->stats_updates,
- abs(val));
+ stats_updates_percpu =
this_cpu_ptr(memcg->stats_updates_percpu);
- if (x < MEMCG_CHARGE_BATCH)
+ *stats_updates_percpu += abs(val);
+ if (*stats_updates_percpu < MEMCG_CHARGE_BATCH)
continue;
/*
@@ -721,8 +715,8 @@ static inline void memcg_rstat_updated(struct
mem_cgroup *memcg, int val)
* redundant. Avoid the overhead of the atomic update.
*/
if (!memcg_should_flush_stats(memcg))
- atomic64_add(x, &memcg->vmstats->stats_updates);
- __this_cpu_write(memcg->vmstats_percpu->stats_updates, 0);
+ atomic64_add(*stats_updates_percpu,
&memcg->stats_updates);
+ *stats_updates_percpu = 0;
}
}
@@ -5467,6 +5461,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
free_mem_cgroup_per_node_info(memcg, node);
kfree(memcg->vmstats);
free_percpu(memcg->vmstats_percpu);
+ free_percpu(memcg->stats_updates_percpu);
kfree(memcg);
}
@@ -5504,6 +5499,11 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
if (!memcg->vmstats_percpu)
goto fail;
+ memcg->stats_updates_percpu = alloc_percpu_gfp(unsigned int,
+ GFP_KERNEL_ACCOUNT);
+ if (!memcg->stats_updates_percpu)
+ goto fail;
+
for_each_node(node)
if (alloc_mem_cgroup_per_node_info(memcg, node))
goto fail;
@@ -5735,10 +5735,12 @@ static void mem_cgroup_css_rstat_flush(struct
cgroup_subsys_state *css, int cpu)
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
struct mem_cgroup *parent = parent_mem_cgroup(memcg);
struct memcg_vmstats_percpu *statc;
+ int *stats_updates_percpu;
long delta, delta_cpu, v;
int i, nid;
statc = per_cpu_ptr(memcg->vmstats_percpu, cpu);
+ stats_updates_percpu = per_cpu_ptr(memcg->stats_updates_percpu, cpu);
for (i = 0; i < MEMCG_NR_STAT; i++) {
/*
@@ -5826,10 +5828,10 @@ static void mem_cgroup_css_rstat_flush(struct
cgroup_subsys_state *css, int cpu)
}
}
}
- statc->stats_updates = 0;
+ *stats_updates_percpu = 0;
/* We are in a per-cpu loop here, only do the atomic write once */
- if (atomic64_read(&memcg->vmstats->stats_updates))
- atomic64_set(&memcg->vmstats->stats_updates, 0);
+ if (atomic64_read(&memcg->stats_updates))
+ atomic64_set(&memcg->stats_updates, 0);
}
#ifdef CONFIG_MMU
hi, Yosry Ahmed,
On Mon, Oct 23, 2023 at 07:13:50PM -0700, Yosry Ahmed wrote:
...
>
> I still could not run the benchmark, but I used a version of
> fallocate1.c that does 1 million iterations. I ran 100 in parallel.
> This showed ~13% regression with the patch, so not the same as the
> will-it-scale version, but it could be an indicator.
>
> With that, I did not see any improvement with the fixlet above or
> ___cacheline_aligned_in_smp. So you can scratch that.
>
> I did, however, see some improvement with reducing the indirection
> layers by moving stats_updates directly into struct mem_cgroup. The
> regression in my manual testing went down to 9%. Still not great, but
> I am wondering how this reflects on the benchmark. If you're able to
> test it that would be great, the diff is below. Meanwhile I am still
> looking for other improvements that can be made.
we applied previous patch-set as below:
c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing
ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent()
51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg
130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code
26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time
25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything <---- the base our tool picked for the patch set
I tried to apply below patch to either 51d74c18a9c61 or c5f50d8b23c79,
but failed. could you guide how to apply this patch?
Thanks
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index f64ac140083e..b4dfcd8b9cc1 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -270,6 +270,9 @@ struct mem_cgroup {
>
> CACHELINE_PADDING(_pad1_);
>
> + /* Stats updates since the last flush */
> + atomic64_t stats_updates;
> +
> /* memory.stat */
> struct memcg_vmstats *vmstats;
>
> @@ -309,6 +312,7 @@ struct mem_cgroup {
> atomic_t moving_account;
> struct task_struct *move_lock_task;
>
> + unsigned int __percpu *stats_updates_percpu;
> struct memcg_vmstats_percpu __percpu *vmstats_percpu;
>
> #ifdef CONFIG_CGROUP_WRITEBACK
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7cbc7d94eb65..e5d2f3d4d874 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -627,9 +627,6 @@ struct memcg_vmstats_percpu {
> /* Cgroup1: threshold notifications & softlimit tree updates */
> unsigned long nr_page_events;
> unsigned long targets[MEM_CGROUP_NTARGETS];
> -
> - /* Stats updates since the last flush */
> - unsigned int stats_updates;
> };
>
> struct memcg_vmstats {
> @@ -644,9 +641,6 @@ struct memcg_vmstats {
> /* Pending child counts during tree propagation */
> long state_pending[MEMCG_NR_STAT];
> unsigned long events_pending[NR_MEMCG_EVENTS];
> -
> - /* Stats updates since the last flush */
> - atomic64_t stats_updates;
> };
>
> /*
> @@ -695,14 +689,14 @@ static void memcg_stats_unlock(void)
>
> static bool memcg_should_flush_stats(struct mem_cgroup *memcg)
> {
> - return atomic64_read(&memcg->vmstats->stats_updates) >
> + return atomic64_read(&memcg->stats_updates) >
> MEMCG_CHARGE_BATCH * num_online_cpus();
> }
>
> static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val)
> {
> int cpu = smp_processor_id();
> - unsigned int x;
> + unsigned int *stats_updates_percpu;
>
> if (!val)
> return;
> @@ -710,10 +704,10 @@ static inline void memcg_rstat_updated(struct
> mem_cgroup *memcg, int val)
> cgroup_rstat_updated(memcg->css.cgroup, cpu);
>
> for (; memcg; memcg = parent_mem_cgroup(memcg)) {
> - x = __this_cpu_add_return(memcg->vmstats_percpu->stats_updates,
> - abs(val));
> + stats_updates_percpu =
> this_cpu_ptr(memcg->stats_updates_percpu);
>
> - if (x < MEMCG_CHARGE_BATCH)
> + *stats_updates_percpu += abs(val);
> + if (*stats_updates_percpu < MEMCG_CHARGE_BATCH)
> continue;
>
> /*
> @@ -721,8 +715,8 @@ static inline void memcg_rstat_updated(struct
> mem_cgroup *memcg, int val)
> * redundant. Avoid the overhead of the atomic update.
> */
> if (!memcg_should_flush_stats(memcg))
> - atomic64_add(x, &memcg->vmstats->stats_updates);
> - __this_cpu_write(memcg->vmstats_percpu->stats_updates, 0);
> + atomic64_add(*stats_updates_percpu,
> &memcg->stats_updates);
> + *stats_updates_percpu = 0;
> }
> }
>
> @@ -5467,6 +5461,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
> free_mem_cgroup_per_node_info(memcg, node);
> kfree(memcg->vmstats);
> free_percpu(memcg->vmstats_percpu);
> + free_percpu(memcg->stats_updates_percpu);
> kfree(memcg);
> }
>
> @@ -5504,6 +5499,11 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
> if (!memcg->vmstats_percpu)
> goto fail;
>
> + memcg->stats_updates_percpu = alloc_percpu_gfp(unsigned int,
> + GFP_KERNEL_ACCOUNT);
> + if (!memcg->stats_updates_percpu)
> + goto fail;
> +
> for_each_node(node)
> if (alloc_mem_cgroup_per_node_info(memcg, node))
> goto fail;
> @@ -5735,10 +5735,12 @@ static void mem_cgroup_css_rstat_flush(struct
> cgroup_subsys_state *css, int cpu)
> struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> struct mem_cgroup *parent = parent_mem_cgroup(memcg);
> struct memcg_vmstats_percpu *statc;
> + int *stats_updates_percpu;
> long delta, delta_cpu, v;
> int i, nid;
>
> statc = per_cpu_ptr(memcg->vmstats_percpu, cpu);
> + stats_updates_percpu = per_cpu_ptr(memcg->stats_updates_percpu, cpu);
>
> for (i = 0; i < MEMCG_NR_STAT; i++) {
> /*
> @@ -5826,10 +5828,10 @@ static void mem_cgroup_css_rstat_flush(struct
> cgroup_subsys_state *css, int cpu)
> }
> }
> }
> - statc->stats_updates = 0;
> + *stats_updates_percpu = 0;
> /* We are in a per-cpu loop here, only do the atomic write once */
> - if (atomic64_read(&memcg->vmstats->stats_updates))
> - atomic64_set(&memcg->vmstats->stats_updates, 0);
> + if (atomic64_read(&memcg->stats_updates))
> + atomic64_set(&memcg->stats_updates, 0);
> }
>
> #ifdef CONFIG_MMU
>
On Mon, Oct 23, 2023 at 11:56 PM Oliver Sang <[email protected]> wrote:
>
> hi, Yosry Ahmed,
>
> On Mon, Oct 23, 2023 at 07:13:50PM -0700, Yosry Ahmed wrote:
>
> ...
>
> >
> > I still could not run the benchmark, but I used a version of
> > fallocate1.c that does 1 million iterations. I ran 100 in parallel.
> > This showed ~13% regression with the patch, so not the same as the
> > will-it-scale version, but it could be an indicator.
> >
> > With that, I did not see any improvement with the fixlet above or
> > ___cacheline_aligned_in_smp. So you can scratch that.
> >
> > I did, however, see some improvement with reducing the indirection
> > layers by moving stats_updates directly into struct mem_cgroup. The
> > regression in my manual testing went down to 9%. Still not great, but
> > I am wondering how this reflects on the benchmark. If you're able to
> > test it that would be great, the diff is below. Meanwhile I am still
> > looking for other improvements that can be made.
>
> we applied previous patch-set as below:
>
> c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing
> ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent()
> 51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg
> 130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code
> 26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time
> 25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything <---- the base our tool picked for the patch set
>
> I tried to apply below patch to either 51d74c18a9c61 or c5f50d8b23c79,
> but failed. could you guide how to apply this patch?
> Thanks
>
Thanks for looking into this. I rebased the diff on top of
c5f50d8b23c79. Please find it attached.
hi, Yosry Ahmed,
On Tue, Oct 24, 2023 at 12:14:42AM -0700, Yosry Ahmed wrote:
> On Mon, Oct 23, 2023 at 11:56 PM Oliver Sang <[email protected]> wrote:
> >
> > hi, Yosry Ahmed,
> >
> > On Mon, Oct 23, 2023 at 07:13:50PM -0700, Yosry Ahmed wrote:
> >
> > ...
> >
> > >
> > > I still could not run the benchmark, but I used a version of
> > > fallocate1.c that does 1 million iterations. I ran 100 in parallel.
> > > This showed ~13% regression with the patch, so not the same as the
> > > will-it-scale version, but it could be an indicator.
> > >
> > > With that, I did not see any improvement with the fixlet above or
> > > ___cacheline_aligned_in_smp. So you can scratch that.
> > >
> > > I did, however, see some improvement with reducing the indirection
> > > layers by moving stats_updates directly into struct mem_cgroup. The
> > > regression in my manual testing went down to 9%. Still not great, but
> > > I am wondering how this reflects on the benchmark. If you're able to
> > > test it that would be great, the diff is below. Meanwhile I am still
> > > looking for other improvements that can be made.
> >
> > we applied previous patch-set as below:
> >
> > c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing
> > ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent()
> > 51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg
> > 130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code
> > 26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time
> > 25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything <---- the base our tool picked for the patch set
> >
> > I tried to apply below patch to either 51d74c18a9c61 or c5f50d8b23c79,
> > but failed. could you guide how to apply this patch?
> > Thanks
> >
>
> Thanks for looking into this. I rebased the diff on top of
> c5f50d8b23c79. Please find it attached.
from our tests, this patch has little impact.
it was applied as below ac6a9444dec85:
ac6a9444dec85 (linux-devel/fixup-c5f50d8b23c79) memcg: move stats_updates to struct mem_cgroup
c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing
ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent()
51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg
130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code
26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time
25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything
for the first regression reported in original report, data are very close
for 51d74c18a9c61, c5f50d8b23c79 (patch-set tip, parent of ac6a9444dec85),
and ac6a9444dec85.
full comparison is as [1]
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-12/performance/x86_64-rhel-8.3/thread/100%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale
130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 c5f50d8b23c7982ac875791755b ac6a9444dec85dc50c6bfbc4ee7
---------------- --------------------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev %change %stddev
\ | \ | \ | \
36509 -25.8% 27079 -25.2% 27305 -25.0% 27383 will-it-scale.per_thread_ops
for the second regression reported in origianl report, seems a small impact
from ac6a9444dec85.
full comparison is as [2]
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-12/performance/x86_64-rhel-8.3/thread/50%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale
130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 c5f50d8b23c7982ac875791755b ac6a9444dec85dc50c6bfbc4ee7
---------------- --------------------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev %change %stddev
\ | \ | \ | \
76580 -30.0% 53575 -28.9% 54415 -26.7% 56152 will-it-scale.per_thread_ops
[1]
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-12/performance/x86_64-rhel-8.3/thread/100%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale
130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 c5f50d8b23c7982ac875791755b ac6a9444dec85dc50c6bfbc4ee7
---------------- --------------------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev %change %stddev
\ | \ | \ | \
2.09 -0.5 1.61 ± 2% -0.5 1.61 -0.5 1.60 mpstat.cpu.all.usr%
3324 -10.0% 2993 +3.6% 3444 ± 20% -6.2% 3118 ± 4% vmstat.system.cs
120.83 ± 11% +79.6% 217.00 ± 9% +105.8% 248.67 ± 10% +115.2% 260.00 ± 10% perf-c2c.DRAM.local
594.50 ± 6% +43.8% 854.83 ± 5% +56.6% 931.17 ± 10% +21.2% 720.67 ± 7% perf-c2c.DRAM.remote
-16.64 +39.7% -23.25 +177.3% -46.14 +13.9% -18.94 sched_debug.cpu.nr_uninterruptible.min
6.59 ± 13% +6.5% 7.02 ± 11% +84.7% 12.18 ± 51% -6.6% 6.16 ± 10% sched_debug.cpu.nr_uninterruptible.stddev
0.04 -20.8% 0.03 ± 11% -20.8% 0.03 ± 11% -25.0% 0.03 turbostat.IPC
27.58 +3.7% 28.59 +4.2% 28.74 +3.8% 28.63 turbostat.RAMWatt
71000 ± 68% +66.4% 118174 ± 60% -49.8% 35634 ± 13% -59.9% 28485 ± 10% numa-meminfo.node0.AnonHugePages
1056 -100.0% 0.00 +1.9% 1076 -12.6% 923.33 ± 44% numa-meminfo.node0.Inactive(file)
6.67 ±141% +15799.3% 1059 -100.0% 0.00 +2669.8% 184.65 ±223% numa-meminfo.node1.Inactive(file)
3797041 -25.8% 2816352 -25.2% 2839803 -25.0% 2847955 will-it-scale.104.threads
36509 -25.8% 27079 -25.2% 27305 -25.0% 27383 will-it-scale.per_thread_ops
3797041 -25.8% 2816352 -25.2% 2839803 -25.0% 2847955 will-it-scale.workload
1.142e+09 -26.2% 8.437e+08 -26.6% 8.391e+08 -25.7% 8.489e+08 numa-numastat.node0.local_node
1.143e+09 -26.1% 8.439e+08 -26.6% 8.392e+08 -25.7% 8.491e+08 numa-numastat.node0.numa_hit
1.148e+09 -25.4% 8.563e+08 ± 2% -23.7% 8.756e+08 ± 2% -24.2% 8.702e+08 numa-numastat.node1.local_node
1.149e+09 -25.4% 8.564e+08 ± 2% -23.8% 8.758e+08 ± 2% -24.2% 8.707e+08 numa-numastat.node1.numa_hit
10842 +0.9% 10941 +2.9% 11153 ± 2% +0.3% 10873 proc-vmstat.nr_mapped
32933 -2.6% 32068 +0.1% 32956 ± 2% -1.5% 32450 ± 2% proc-vmstat.nr_slab_reclaimable
2.291e+09 -25.8% 1.7e+09 -25.1% 1.715e+09 -24.9% 1.72e+09 proc-vmstat.numa_hit
2.291e+09 -25.8% 1.7e+09 -25.1% 1.715e+09 -25.0% 1.719e+09 proc-vmstat.numa_local
2.29e+09 -25.8% 1.699e+09 -25.1% 1.714e+09 -24.9% 1.718e+09 proc-vmstat.pgalloc_normal
2.289e+09 -25.8% 1.699e+09 -25.1% 1.714e+09 -24.9% 1.718e+09 proc-vmstat.pgfree
199.33 -100.0% 0.00 -0.3% 198.66 -16.4% 166.67 ± 44% numa-vmstat.node0.nr_active_file
264.00 -100.0% 0.00 +1.9% 269.00 -12.6% 230.83 ± 44% numa-vmstat.node0.nr_inactive_file
199.33 -100.0% 0.00 -0.3% 198.66 -16.4% 166.67 ± 44% numa-vmstat.node0.nr_zone_active_file
264.00 -100.0% 0.00 +1.9% 269.00 -12.6% 230.83 ± 44% numa-vmstat.node0.nr_zone_inactive_file
1.143e+09 -26.1% 8.439e+08 -26.6% 8.392e+08 -25.7% 8.491e+08 numa-vmstat.node0.numa_hit
1.142e+09 -26.2% 8.437e+08 -26.6% 8.391e+08 -25.7% 8.489e+08 numa-vmstat.node0.numa_local
1.67 ±141% +15799.3% 264.99 -100.0% 0.00 +2669.8% 46.16 ±223% numa-vmstat.node1.nr_inactive_file
1.67 ±141% +15799.3% 264.99 -100.0% 0.00 +2669.8% 46.16 ±223% numa-vmstat.node1.nr_zone_inactive_file
1.149e+09 -25.4% 8.564e+08 ± 2% -23.8% 8.758e+08 ± 2% -24.2% 8.707e+08 numa-vmstat.node1.numa_hit
1.148e+09 -25.4% 8.563e+08 ± 2% -23.7% 8.756e+08 ± 2% -24.2% 8.702e+08 numa-vmstat.node1.numa_local
0.04 ±108% -76.2% 0.01 ± 23% +154.8% 0.10 ± 34% +110.0% 0.08 ± 88% perf-sched.sch_delay.avg.ms.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
1.00 ± 93% +154.2% 2.55 ± 16% +133.4% 2.34 ± 39% +174.6% 2.76 ± 22% perf-sched.sch_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
0.71 ±131% -91.3% 0.06 ± 74% +184.4% 2.02 ± 40% +122.6% 1.58 ± 98% perf-sched.sch_delay.max.ms.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
1.84 ± 45% +35.2% 2.48 ± 31% +66.1% 3.05 ± 25% +61.9% 2.98 ± 10% perf-sched.sch_delay.max.ms.schedule_hrtimeout_range_clock.do_select.core_sys_select.kern_select
191.10 ± 2% +18.0% 225.55 ± 2% +18.9% 227.22 ± 4% +19.8% 228.89 ± 4% perf-sched.wait_and_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
3484 -7.8% 3211 ± 6% -7.3% 3230 ± 7% -11.0% 3101 ± 3% perf-sched.wait_and_delay.count.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
385.50 ± 14% +39.6% 538.17 ± 12% +104.5% 788.17 ± 54% +30.9% 504.67 ± 41% perf-sched.wait_and_delay.count.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
3784 -7.5% 3500 ± 6% -7.1% 3516 ± 2% -10.6% 3383 ± 4% perf-sched.wait_and_delay.count.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
118.67 ± 11% -62.6% 44.33 ±100% -45.9% 64.17 ± 71% -64.9% 41.67 ±100% perf-sched.wait_and_delay.count.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
5043 ± 2% -13.0% 4387 ± 6% -14.7% 4301 ± 3% -16.5% 4210 ± 4% perf-sched.wait_and_delay.count.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
167.12 ±222% +200.1% 501.48 ± 99% +2.9% 171.99 ±215% +399.7% 835.05 ± 44% perf-sched.wait_and_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode.do_syscall_64
2.17 ± 21% +8.9% 2.36 ± 16% +94.3% 4.21 ± 36% +40.4% 3.04 ± 21% perf-sched.wait_time.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
191.09 ± 2% +18.0% 225.53 ± 2% +18.9% 227.21 ± 4% +19.8% 228.88 ± 4% perf-sched.wait_time.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
293.46 ± 4% +12.8% 330.98 ± 6% +21.0% 355.18 ± 16% +7.1% 314.31 ± 26% perf-sched.wait_time.max.ms.__cond_resched.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
30.33 ±105% -35.1% 19.69 ±138% +494.1% 180.18 ± 79% +135.5% 71.43 ± 76% perf-sched.wait_time.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
0.59 ± 3% +125.2% 1.32 ± 2% +139.3% 1.41 +128.6% 1.34 perf-stat.i.MPKI
9.027e+09 -17.9% 7.408e+09 -17.5% 7.446e+09 -17.3% 7.465e+09 perf-stat.i.branch-instructions
0.64 -0.0 0.60 -0.0 0.60 -0.0 0.60 perf-stat.i.branch-miss-rate%
58102855 -23.3% 44580037 ± 2% -23.4% 44524712 ± 2% -22.9% 44801374 perf-stat.i.branch-misses
15.28 +7.0 22.27 +7.9 23.14 +7.2 22.50 perf-stat.i.cache-miss-rate%
25155306 ± 2% +82.7% 45953601 ± 3% +95.2% 49105558 ± 2% +87.7% 47212483 perf-stat.i.cache-misses
1.644e+08 +25.4% 2.062e+08 ± 2% +29.0% 2.12e+08 +27.6% 2.098e+08 perf-stat.i.cache-references
3258 -10.3% 2921 +2.5% 3341 ± 19% -6.7% 3041 ± 4% perf-stat.i.context-switches
6.73 +23.3% 8.30 +22.7% 8.26 +21.8% 8.20 perf-stat.i.cpi
145.97 -1.3% 144.13 -1.4% 143.89 -1.2% 144.29 perf-stat.i.cpu-migrations
11519 ± 3% -45.4% 6293 ± 3% -48.9% 5892 ± 2% -46.9% 6118 perf-stat.i.cycles-between-cache-misses
0.04 -0.0 0.03 -0.0 0.03 -0.0 0.03 perf-stat.i.dTLB-load-miss-rate%
3921408 -25.3% 2929564 -24.6% 2957991 -24.5% 2961168 perf-stat.i.dTLB-load-misses
1.098e+10 -18.1% 8.993e+09 -17.6% 9.045e+09 -16.3% 9.185e+09 perf-stat.i.dTLB-loads
0.00 ± 2% +0.0 0.00 ± 4% +0.0 0.00 ± 5% +0.0 0.00 ± 3% perf-stat.i.dTLB-store-miss-rate%
5.606e+09 -23.2% 4.304e+09 -22.6% 4.338e+09 -22.4% 4.349e+09 perf-stat.i.dTLB-stores
95.65 -1.2 94.49 -0.9 94.74 -0.8 94.87 perf-stat.i.iTLB-load-miss-rate%
3876741 -25.0% 2905764 -24.8% 2915184 -25.0% 2909099 perf-stat.i.iTLB-load-misses
4.286e+10 -18.9% 3.477e+10 -18.4% 3.496e+10 -17.9% 3.517e+10 perf-stat.i.instructions
11061 +8.2% 11969 +8.4% 11996 +9.3% 12091 perf-stat.i.instructions-per-iTLB-miss
0.15 -18.9% 0.12 -18.5% 0.12 -17.9% 0.12 perf-stat.i.ipc
0.01 ± 96% -8.9% 0.01 ± 96% +72.3% 0.01 ± 73% +174.6% 0.02 ± 32% perf-stat.i.major-faults
48.65 ± 2% +46.2% 71.11 ± 2% +57.0% 76.37 ± 2% +45.4% 70.72 perf-stat.i.metric.K/sec
247.84 -18.9% 201.05 -18.4% 202.30 -17.7% 203.92 perf-stat.i.metric.M/sec
89.33 +0.5 89.79 -0.7 88.67 -2.1 87.23 perf-stat.i.node-load-miss-rate%
3138385 ± 2% +77.7% 5578401 ± 2% +89.9% 5958861 ± 2% +70.9% 5363943 perf-stat.i.node-load-misses
375827 ± 3% +69.2% 635857 ± 11% +102.6% 761334 ± 4% +109.3% 786773 ± 5% perf-stat.i.node-loads
1343194 -26.8% 983668 -22.6% 1039799 ± 2% -23.6% 1026076 perf-stat.i.node-store-misses
51550 ± 3% -19.0% 41748 ± 7% -22.5% 39954 ± 4% -20.6% 40921 ± 7% perf-stat.i.node-stores
0.59 ± 3% +125.1% 1.32 ± 2% +139.2% 1.40 +128.7% 1.34 perf-stat.overall.MPKI
0.64 -0.0 0.60 -0.0 0.60 -0.0 0.60 perf-stat.overall.branch-miss-rate%
15.30 +7.0 22.28 +7.9 23.16 +7.2 22.50 perf-stat.overall.cache-miss-rate%
6.73 +23.3% 8.29 +22.6% 8.25 +21.9% 8.20 perf-stat.overall.cpi
11470 ± 2% -45.3% 6279 ± 2% -48.8% 5875 ± 2% -46.7% 6108 perf-stat.overall.cycles-between-cache-misses
0.04 -0.0 0.03 -0.0 0.03 -0.0 0.03 perf-stat.overall.dTLB-load-miss-rate%
0.00 ± 2% +0.0 0.00 ± 4% +0.0 0.00 ± 5% +0.0 0.00 ± 4% perf-stat.overall.dTLB-store-miss-rate%
95.56 -1.4 94.17 -1.0 94.56 -0.9 94.66 perf-stat.overall.iTLB-load-miss-rate%
11059 +8.2% 11967 +8.5% 11994 +9.3% 12091 perf-stat.overall.instructions-per-iTLB-miss
0.15 -18.9% 0.12 -18.4% 0.12 -17.9% 0.12 perf-stat.overall.ipc
89.29 +0.5 89.78 -0.6 88.67 -2.1 87.20 perf-stat.overall.node-load-miss-rate%
3396437 +9.5% 3718021 +9.1% 3705386 +9.6% 3721307 perf-stat.overall.path-length
8.997e+09 -17.9% 7.383e+09 -17.5% 7.421e+09 -17.3% 7.44e+09 perf-stat.ps.branch-instructions
57910417 -23.3% 44426577 ± 2% -23.4% 44376780 ± 2% -22.9% 44649215 perf-stat.ps.branch-misses
25075498 ± 2% +82.7% 45803186 ± 3% +95.2% 48942749 ± 2% +87.7% 47057228 perf-stat.ps.cache-misses
1.639e+08 +25.4% 2.056e+08 ± 2% +28.9% 2.113e+08 +27.6% 2.091e+08 perf-stat.ps.cache-references
3247 -10.3% 2911 +2.5% 3329 ± 19% -6.7% 3030 ± 4% perf-stat.ps.context-switches
145.47 -1.3% 143.61 -1.4% 143.38 -1.2% 143.70 perf-stat.ps.cpu-migrations
3908900 -25.3% 2920218 -24.6% 2949112 -24.5% 2951633 perf-stat.ps.dTLB-load-misses
1.094e+10 -18.1% 8.963e+09 -17.6% 9.014e+09 -16.3% 9.154e+09 perf-stat.ps.dTLB-loads
5.587e+09 -23.2% 4.289e+09 -22.6% 4.324e+09 -22.4% 4.335e+09 perf-stat.ps.dTLB-stores
3863663 -25.0% 2895895 -24.8% 2905355 -25.0% 2899323 perf-stat.ps.iTLB-load-misses
4.272e+10 -18.9% 3.466e+10 -18.4% 3.484e+10 -17.9% 3.505e+10 perf-stat.ps.instructions
3128132 ± 2% +77.7% 5559939 ± 2% +89.9% 5938929 ± 2% +70.9% 5346027 perf-stat.ps.node-load-misses
375403 ± 3% +69.0% 634300 ± 11% +102.3% 759484 ± 4% +109.1% 784913 ± 5% perf-stat.ps.node-loads
1338688 -26.8% 980311 -22.6% 1036279 ± 2% -23.6% 1022618 perf-stat.ps.node-store-misses
51546 ± 3% -19.1% 41692 ± 7% -22.6% 39921 ± 4% -20.7% 40875 ± 7% perf-stat.ps.node-stores
1.29e+13 -18.8% 1.047e+13 -18.4% 1.052e+13 -17.8% 1.06e+13 perf-stat.total.instructions
0.96 -0.3 0.70 ± 2% -0.3 0.70 ± 2% -0.3 0.70 perf-profile.calltrace.cycles-pp.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
0.97 -0.3 0.72 -0.2 0.72 -0.2 0.72 perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.fallocate64
0.76 ± 2% -0.2 0.54 ± 3% -0.2 0.59 ± 3% -0.1 0.68 perf-profile.calltrace.cycles-pp.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
0.82 -0.2 0.60 ± 2% -0.2 0.60 ± 2% -0.2 0.60 perf-profile.calltrace.cycles-pp.alloc_pages_mpol.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.91 -0.2 0.72 -0.2 0.72 -0.2 0.70 ± 2% perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
51.50 -0.0 51.47 -0.5 50.99 -0.3 51.21 perf-profile.calltrace.cycles-pp.fallocate64
48.31 +0.0 48.35 +0.5 48.83 +0.3 48.61 perf-profile.calltrace.cycles-pp.ftruncate64
48.29 +0.0 48.34 +0.5 48.81 +0.3 48.60 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.ftruncate64
48.28 +0.0 48.33 +0.5 48.80 +0.3 48.59 perf-profile.calltrace.cycles-pp.do_sys_ftruncate.do_syscall_64.entry_SYSCALL_64_after_hwframe.ftruncate64
48.29 +0.1 48.34 +0.5 48.82 +0.3 48.60 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.ftruncate64
48.28 +0.1 48.33 +0.5 48.80 +0.3 48.58 perf-profile.calltrace.cycles-pp.do_truncate.do_sys_ftruncate.do_syscall_64.entry_SYSCALL_64_after_hwframe.ftruncate64
48.27 +0.1 48.33 +0.5 48.80 +0.3 48.58 perf-profile.calltrace.cycles-pp.notify_change.do_truncate.do_sys_ftruncate.do_syscall_64.entry_SYSCALL_64_after_hwframe
48.27 +0.1 48.32 +0.5 48.80 +0.3 48.58 perf-profile.calltrace.cycles-pp.shmem_setattr.notify_change.do_truncate.do_sys_ftruncate.do_syscall_64
48.25 +0.1 48.31 +0.5 48.78 +0.3 48.57 perf-profile.calltrace.cycles-pp.shmem_undo_range.shmem_setattr.notify_change.do_truncate.do_sys_ftruncate
2.06 ± 2% +0.1 2.13 ± 2% +0.1 2.16 +0.0 2.09 perf-profile.calltrace.cycles-pp.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change.do_truncate
0.68 +0.1 0.76 ± 2% +0.1 0.75 +0.1 0.74 perf-profile.calltrace.cycles-pp.lru_add_fn.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp
1.67 +0.1 1.77 +0.1 1.81 ± 2% +0.0 1.71 perf-profile.calltrace.cycles-pp.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
45.76 +0.1 45.86 +0.5 46.29 +0.4 46.13 perf-profile.calltrace.cycles-pp.__folio_batch_release.shmem_undo_range.shmem_setattr.notify_change.do_truncate
1.78 ± 2% +0.1 1.92 ± 2% +0.2 1.95 +0.1 1.88 perf-profile.calltrace.cycles-pp.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change
0.69 ± 5% +0.1 0.84 ± 4% +0.2 0.86 ± 5% +0.1 0.79 ± 2% perf-profile.calltrace.cycles-pp.get_mem_cgroup_from_mm.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
1.56 ± 2% +0.2 1.76 ± 2% +0.2 1.79 +0.2 1.71 perf-profile.calltrace.cycles-pp.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr
0.85 ± 4% +0.4 1.23 ± 2% +0.4 1.26 ± 3% +0.3 1.14 perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.78 ± 4% +0.4 1.20 ± 3% +0.4 1.22 +0.3 1.11 perf-profile.calltrace.cycles-pp.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range
0.73 ± 4% +0.4 1.17 ± 3% +0.5 1.19 ± 2% +0.4 1.08 perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio
41.60 +0.7 42.30 +0.1 41.73 +0.5 42.06 perf-profile.calltrace.cycles-pp.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
41.50 +0.7 42.23 +0.2 41.66 +0.5 41.99 perf-profile.calltrace.cycles-pp.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
48.39 +0.8 49.14 +0.2 48.64 +0.5 48.89 perf-profile.calltrace.cycles-pp.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
0.00 +0.8 0.77 ± 4% +0.8 0.80 ± 2% +0.8 0.78 ± 2% perf-profile.calltrace.cycles-pp.mem_cgroup_commit_charge.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
40.24 +0.8 41.03 +0.2 40.48 +0.6 40.80 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp
40.22 +0.8 41.01 +0.2 40.47 +0.6 40.79 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio
0.00 +0.8 0.79 ± 3% +0.8 0.82 ± 3% +0.8 0.79 ± 2% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp
40.19 +0.8 40.98 +0.3 40.44 +0.6 40.76 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru
1.33 ± 5% +0.8 2.13 ± 4% +0.9 2.21 ± 4% +0.8 2.09 ± 2% perf-profile.calltrace.cycles-pp.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
48.16 +0.8 48.98 +0.3 48.48 +0.6 48.72 perf-profile.calltrace.cycles-pp.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
0.00 +0.9 0.88 ± 2% +0.9 0.91 +0.9 0.86 perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio
47.92 +0.9 48.81 +0.4 48.30 +0.6 48.56 perf-profile.calltrace.cycles-pp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe
47.07 +0.9 48.01 +0.5 47.60 +0.7 47.79 perf-profile.calltrace.cycles-pp.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
46.59 +1.1 47.64 +0.7 47.24 +0.8 47.44 perf-profile.calltrace.cycles-pp.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate
0.99 -0.3 0.73 ± 2% -0.3 0.74 -0.3 0.74 perf-profile.children.cycles-pp.syscall_return_via_sysret
0.96 -0.3 0.70 ± 2% -0.3 0.70 ± 2% -0.3 0.71 perf-profile.children.cycles-pp.shmem_alloc_folio
0.78 ± 2% -0.2 0.56 ± 3% -0.2 0.61 ± 3% -0.1 0.69 ± 2% perf-profile.children.cycles-pp.shmem_inode_acct_blocks
0.83 -0.2 0.61 ± 2% -0.2 0.61 ± 2% -0.2 0.62 perf-profile.children.cycles-pp.alloc_pages_mpol
0.92 -0.2 0.73 -0.2 0.73 -0.2 0.71 ± 2% perf-profile.children.cycles-pp.syscall_exit_to_user_mode
0.74 ± 2% -0.2 0.55 ± 2% -0.2 0.56 ± 2% -0.2 0.58 ± 3% perf-profile.children.cycles-pp.xas_store
0.67 -0.2 0.50 ± 3% -0.2 0.50 ± 2% -0.2 0.50 perf-profile.children.cycles-pp.__alloc_pages
0.43 -0.1 0.31 ± 2% -0.1 0.31 -0.1 0.31 perf-profile.children.cycles-pp.__entry_text_start
0.41 ± 2% -0.1 0.30 ± 3% -0.1 0.31 ± 2% -0.1 0.31 ± 2% perf-profile.children.cycles-pp.free_unref_page_list
0.35 -0.1 0.25 ± 2% -0.1 0.25 ± 2% -0.1 0.25 perf-profile.children.cycles-pp.xas_load
0.35 ± 2% -0.1 0.25 ± 4% -0.1 0.25 ± 2% -0.1 0.26 ± 2% perf-profile.children.cycles-pp.__mod_lruvec_state
0.39 -0.1 0.30 ± 2% -0.1 0.29 ± 3% -0.1 0.30 perf-profile.children.cycles-pp.get_page_from_freelist
0.27 ± 2% -0.1 0.19 ± 4% -0.1 0.19 ± 5% -0.1 0.19 ± 3% perf-profile.children.cycles-pp.__mod_node_page_state
0.32 ± 3% -0.1 0.24 ± 3% -0.1 0.25 -0.1 0.26 ± 4% perf-profile.children.cycles-pp.find_lock_entries
0.23 ± 2% -0.1 0.15 ± 4% -0.1 0.16 ± 3% -0.1 0.16 ± 5% perf-profile.children.cycles-pp.xas_descend
0.25 ± 3% -0.1 0.18 ± 3% -0.1 0.18 ± 3% -0.1 0.18 ± 2% perf-profile.children.cycles-pp.__dquot_alloc_space
0.28 ± 3% -0.1 0.20 ± 3% -0.1 0.21 ± 2% -0.1 0.20 ± 2% perf-profile.children.cycles-pp._raw_spin_lock
0.16 ± 3% -0.1 0.10 ± 5% -0.1 0.10 ± 4% -0.1 0.10 ± 4% perf-profile.children.cycles-pp.xas_find_conflict
0.26 ± 2% -0.1 0.20 ± 3% -0.1 0.19 ± 3% -0.1 0.19 perf-profile.children.cycles-pp.filemap_get_entry
0.26 -0.1 0.20 ± 2% -0.1 0.20 ± 4% -0.1 0.20 ± 2% perf-profile.children.cycles-pp.rmqueue
0.20 ± 3% -0.1 0.14 ± 3% -0.0 0.15 ± 3% -0.0 0.16 ± 3% perf-profile.children.cycles-pp.truncate_cleanup_folio
0.19 ± 5% -0.1 0.14 ± 4% -0.0 0.15 ± 5% -0.0 0.15 ± 4% perf-profile.children.cycles-pp.xas_clear_mark
0.17 ± 5% -0.0 0.12 ± 4% -0.0 0.12 ± 6% -0.0 0.13 ± 3% perf-profile.children.cycles-pp.xas_init_marks
0.15 ± 4% -0.0 0.10 ± 4% -0.0 0.10 ± 4% -0.0 0.11 ± 3% perf-profile.children.cycles-pp.free_unref_page_commit
0.15 ± 12% -0.0 0.10 ± 20% -0.1 0.10 ± 15% -0.1 0.10 ± 14% perf-profile.children.cycles-pp._raw_spin_lock_irq
51.56 -0.0 51.51 -0.5 51.03 -0.3 51.26 perf-profile.children.cycles-pp.fallocate64
0.18 ± 3% -0.0 0.14 ± 3% -0.0 0.13 ± 5% -0.0 0.14 ± 2% perf-profile.children.cycles-pp.__cond_resched
0.07 ± 5% -0.0 0.02 ± 99% -0.0 0.04 ± 44% -0.0 0.04 ± 44% perf-profile.children.cycles-pp.xas_find
0.13 ± 2% -0.0 0.09 -0.0 0.10 ± 5% -0.0 0.12 ± 4% perf-profile.children.cycles-pp.security_vm_enough_memory_mm
0.14 ± 4% -0.0 0.10 ± 7% -0.0 0.10 ± 6% -0.0 0.10 ± 3% perf-profile.children.cycles-pp.__fget_light
0.06 ± 6% -0.0 0.02 ± 99% -0.0 0.05 -0.0 0.05 perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
0.12 ± 4% -0.0 0.08 ± 4% -0.0 0.08 ± 4% -0.0 0.08 perf-profile.children.cycles-pp.xas_start
0.08 ± 5% -0.0 0.05 -0.0 0.05 -0.0 0.05 ± 7% perf-profile.children.cycles-pp.__folio_throttle_swaprate
0.12 -0.0 0.08 ± 5% -0.0 0.08 ± 5% -0.0 0.08 ± 5% perf-profile.children.cycles-pp.folio_unlock
0.14 ± 3% -0.0 0.11 ± 3% -0.0 0.11 ± 4% -0.0 0.12 ± 3% perf-profile.children.cycles-pp.try_charge_memcg
0.12 ± 6% -0.0 0.08 ± 5% -0.0 0.09 ± 5% -0.0 0.09 ± 7% perf-profile.children.cycles-pp.free_unref_page_prepare
0.12 ± 3% -0.0 0.09 ± 4% -0.0 0.09 ± 7% -0.0 0.09 perf-profile.children.cycles-pp.noop_dirty_folio
0.20 ± 2% -0.0 0.17 ± 5% -0.0 0.18 -0.0 0.19 ± 2% perf-profile.children.cycles-pp.page_counter_uncharge
0.10 -0.0 0.07 ± 5% -0.0 0.08 ± 8% +0.0 0.10 ± 4% perf-profile.children.cycles-pp.cap_vm_enough_memory
0.09 ± 6% -0.0 0.06 ± 6% -0.0 0.06 ± 7% -0.0 0.06 ± 7% perf-profile.children.cycles-pp._raw_spin_trylock
0.09 ± 5% -0.0 0.06 ± 7% -0.0 0.06 ± 7% -0.0 0.07 ± 7% perf-profile.children.cycles-pp.inode_add_bytes
0.06 ± 6% -0.0 0.03 ± 70% -0.0 0.04 ± 44% -0.0 0.05 ± 7% perf-profile.children.cycles-pp.filemap_free_folio
0.06 ± 6% -0.0 0.03 ± 70% +0.0 0.07 ± 7% +0.1 0.14 ± 6% perf-profile.children.cycles-pp.percpu_counter_add_batch
0.12 ± 3% -0.0 0.10 ± 5% -0.0 0.09 ± 4% -0.0 0.09 perf-profile.children.cycles-pp.shmem_recalc_inode
0.12 ± 3% -0.0 0.09 ± 5% -0.0 0.09 ± 5% -0.0 0.10 ± 4% perf-profile.children.cycles-pp.__folio_cancel_dirty
0.09 ± 5% -0.0 0.07 ± 7% -0.0 0.09 ± 4% +0.1 0.16 ± 7% perf-profile.children.cycles-pp.__vm_enough_memory
0.08 ± 5% -0.0 0.06 -0.0 0.06 ± 6% -0.0 0.06 ± 6% perf-profile.children.cycles-pp.security_file_permission
0.08 ± 5% -0.0 0.06 -0.0 0.06 ± 6% -0.0 0.06 ± 6% perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
0.08 ± 6% -0.0 0.05 ± 7% -0.0 0.05 ± 8% -0.0 0.05 ± 7% perf-profile.children.cycles-pp.apparmor_file_permission
0.09 ± 4% -0.0 0.07 ± 8% -0.0 0.09 ± 8% -0.0 0.07 ± 6% perf-profile.children.cycles-pp.__percpu_counter_limited_add
0.08 ± 6% -0.0 0.06 ± 8% -0.0 0.06 -0.0 0.06 ± 6% perf-profile.children.cycles-pp.__list_add_valid_or_report
0.07 ± 8% -0.0 0.05 -0.0 0.05 ± 7% -0.0 0.06 ± 9% perf-profile.children.cycles-pp.get_pfnblock_flags_mask
0.14 ± 3% -0.0 0.12 ± 6% -0.0 0.12 ± 3% -0.0 0.13 ± 3% perf-profile.children.cycles-pp.cgroup_rstat_updated
0.07 ± 5% -0.0 0.05 -0.0 0.05 -0.0 0.05 perf-profile.children.cycles-pp.policy_nodemask
0.24 ± 2% -0.0 0.22 ± 2% -0.0 0.22 ± 2% -0.0 0.22 ± 2% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
0.08 -0.0 0.07 ± 7% -0.0 0.06 ± 6% -0.0 0.07 ± 6% perf-profile.children.cycles-pp.xas_create
0.08 ± 8% -0.0 0.06 ± 7% -0.0 0.06 ± 7% -0.0 0.07 ± 7% perf-profile.children.cycles-pp.mem_cgroup_update_lru_size
0.00 +0.0 0.00 +0.0 0.00 +0.1 0.08 ± 8% perf-profile.children.cycles-pp.__file_remove_privs
0.28 ± 2% +0.0 0.28 ± 4% +0.0 0.30 +0.0 0.30 perf-profile.children.cycles-pp.uncharge_batch
0.14 ± 5% +0.0 0.17 ± 4% +0.0 0.17 ± 2% +0.0 0.16 perf-profile.children.cycles-pp.uncharge_folio
0.43 +0.0 0.46 ± 4% +0.0 0.48 +0.0 0.47 perf-profile.children.cycles-pp.__mem_cgroup_uncharge_list
48.31 +0.0 48.35 +0.5 48.83 +0.3 48.61 perf-profile.children.cycles-pp.ftruncate64
48.28 +0.0 48.33 +0.5 48.80 +0.3 48.59 perf-profile.children.cycles-pp.do_sys_ftruncate
48.28 +0.1 48.33 +0.5 48.80 +0.3 48.58 perf-profile.children.cycles-pp.do_truncate
48.27 +0.1 48.33 +0.5 48.80 +0.3 48.58 perf-profile.children.cycles-pp.notify_change
48.27 +0.1 48.32 +0.5 48.80 +0.3 48.58 perf-profile.children.cycles-pp.shmem_setattr
48.26 +0.1 48.32 +0.5 48.79 +0.3 48.57 perf-profile.children.cycles-pp.shmem_undo_range
2.06 ± 2% +0.1 2.13 ± 2% +0.1 2.16 +0.0 2.10 perf-profile.children.cycles-pp.truncate_inode_folio
0.69 +0.1 0.78 +0.1 0.77 +0.1 0.76 perf-profile.children.cycles-pp.lru_add_fn
1.72 ± 2% +0.1 1.80 +0.1 1.83 ± 2% +0.0 1.74 perf-profile.children.cycles-pp.shmem_add_to_page_cache
45.77 +0.1 45.86 +0.5 46.29 +0.4 46.13 perf-profile.children.cycles-pp.__folio_batch_release
1.79 ± 2% +0.1 1.93 ± 2% +0.2 1.96 +0.1 1.88 perf-profile.children.cycles-pp.filemap_remove_folio
0.13 ± 5% +0.1 0.28 +0.1 0.19 ± 5% +0.1 0.24 ± 2% perf-profile.children.cycles-pp.file_modified
0.69 ± 5% +0.1 0.84 ± 3% +0.2 0.86 ± 5% +0.1 0.79 ± 2% perf-profile.children.cycles-pp.get_mem_cgroup_from_mm
0.09 ± 7% +0.2 0.24 ± 2% +0.1 0.15 ± 3% +0.0 0.14 ± 4% perf-profile.children.cycles-pp.inode_needs_update_time
1.58 ± 3% +0.2 1.77 ± 2% +0.2 1.80 +0.1 1.72 perf-profile.children.cycles-pp.__filemap_remove_folio
0.15 ± 3% +0.4 0.50 ± 3% +0.4 0.52 ± 2% +0.4 0.52 ± 2% perf-profile.children.cycles-pp.__count_memcg_events
0.79 ± 4% +0.4 1.20 ± 3% +0.4 1.22 +0.3 1.12 perf-profile.children.cycles-pp.filemap_unaccount_folio
0.36 ± 5% +0.4 0.77 ± 4% +0.4 0.81 ± 2% +0.4 0.78 ± 2% perf-profile.children.cycles-pp.mem_cgroup_commit_charge
98.33 +0.5 98.78 +0.4 98.77 +0.4 98.77 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
97.74 +0.6 98.34 +0.6 98.32 +0.6 98.33 perf-profile.children.cycles-pp.do_syscall_64
41.62 +0.7 42.33 +0.1 41.76 +0.5 42.08 perf-profile.children.cycles-pp.folio_add_lru
43.91 +0.7 44.64 +0.2 44.09 +0.5 44.40 perf-profile.children.cycles-pp.folio_batch_move_lru
48.39 +0.8 49.15 +0.2 48.64 +0.5 48.89 perf-profile.children.cycles-pp.__x64_sys_fallocate
1.34 ± 5% +0.8 2.14 ± 4% +0.9 2.22 ± 4% +0.8 2.10 ± 2% perf-profile.children.cycles-pp.__mem_cgroup_charge
1.61 ± 4% +0.8 2.42 ± 2% +0.9 2.47 ± 2% +0.6 2.24 perf-profile.children.cycles-pp.__mod_lruvec_page_state
48.17 +0.8 48.98 +0.3 48.48 +0.6 48.72 perf-profile.children.cycles-pp.vfs_fallocate
47.94 +0.9 48.82 +0.4 48.32 +0.6 48.56 perf-profile.children.cycles-pp.shmem_fallocate
47.10 +0.9 48.04 +0.5 47.64 +0.7 47.83 perf-profile.children.cycles-pp.shmem_get_folio_gfp
84.34 +0.9 85.28 +0.8 85.11 +0.9 85.28 perf-profile.children.cycles-pp.folio_lruvec_lock_irqsave
84.31 +0.9 85.26 +0.8 85.08 +0.9 85.26 perf-profile.children.cycles-pp._raw_spin_lock_irqsave
84.24 +1.0 85.21 +0.8 85.04 +1.0 85.21 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
46.65 +1.1 47.70 +0.7 47.30 +0.8 47.48 perf-profile.children.cycles-pp.shmem_alloc_and_add_folio
1.23 ± 4% +1.4 2.58 ± 2% +1.4 2.63 ± 2% +1.3 2.52 perf-profile.children.cycles-pp.__mod_memcg_lruvec_state
0.98 -0.3 0.73 ± 2% -0.2 0.74 -0.2 0.74 perf-profile.self.cycles-pp.syscall_return_via_sysret
0.88 -0.2 0.70 -0.2 0.70 -0.2 0.69 ± 2% perf-profile.self.cycles-pp.syscall_exit_to_user_mode
0.60 -0.2 0.45 -0.1 0.46 ± 2% -0.2 0.46 ± 3% perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
0.41 ± 3% -0.1 0.27 ± 3% -0.1 0.27 ± 2% -0.1 0.28 ± 2% perf-profile.self.cycles-pp.release_pages
0.41 ± 3% -0.1 0.29 ± 2% -0.1 0.28 ± 3% -0.1 0.29 ± 2% perf-profile.self.cycles-pp.folio_batch_move_lru
0.41 -0.1 0.30 ± 3% -0.1 0.30 ± 2% -0.1 0.32 ± 4% perf-profile.self.cycles-pp.xas_store
0.30 ± 3% -0.1 0.18 ± 5% -0.1 0.19 ± 2% -0.1 0.19 ± 2% perf-profile.self.cycles-pp.shmem_add_to_page_cache
0.38 ± 2% -0.1 0.27 ± 2% -0.1 0.27 ± 2% -0.1 0.27 perf-profile.self.cycles-pp.__entry_text_start
0.30 ± 3% -0.1 0.20 ± 6% -0.1 0.20 ± 5% -0.1 0.21 ± 2% perf-profile.self.cycles-pp.lru_add_fn
0.28 ± 2% -0.1 0.20 ± 5% -0.1 0.20 ± 2% -0.1 0.20 ± 3% perf-profile.self.cycles-pp.shmem_fallocate
0.26 ± 2% -0.1 0.18 ± 5% -0.1 0.18 ± 4% -0.1 0.19 ± 3% perf-profile.self.cycles-pp.__mod_node_page_state
0.27 ± 3% -0.1 0.20 ± 2% -0.1 0.20 ± 3% -0.1 0.20 ± 3% perf-profile.self.cycles-pp._raw_spin_lock
0.21 ± 2% -0.1 0.15 ± 4% -0.1 0.15 ± 4% -0.1 0.16 ± 2% perf-profile.self.cycles-pp.__alloc_pages
0.20 ± 2% -0.1 0.14 ± 3% -0.1 0.14 ± 2% -0.1 0.14 ± 5% perf-profile.self.cycles-pp.xas_descend
0.26 ± 3% -0.1 0.20 ± 4% -0.1 0.21 ± 3% -0.0 0.22 ± 4% perf-profile.self.cycles-pp.find_lock_entries
0.06 ± 6% -0.1 0.00 +0.0 0.06 ± 7% +0.1 0.13 ± 6% perf-profile.self.cycles-pp.percpu_counter_add_batch
0.18 ± 4% -0.0 0.13 ± 5% -0.0 0.13 ± 3% -0.0 0.14 ± 4% perf-profile.self.cycles-pp.xas_clear_mark
0.15 ± 7% -0.0 0.10 ± 11% -0.0 0.11 ± 8% -0.0 0.10 ± 6% perf-profile.self.cycles-pp.shmem_inode_acct_blocks
0.13 ± 4% -0.0 0.09 ± 5% -0.0 0.08 ± 5% -0.0 0.09 perf-profile.self.cycles-pp.free_unref_page_commit
0.13 -0.0 0.09 ± 5% -0.0 0.09 ± 5% -0.0 0.09 ± 6% perf-profile.self.cycles-pp._raw_spin_lock_irq
0.16 ± 4% -0.0 0.12 ± 4% -0.0 0.12 ± 3% -0.0 0.12 ± 4% perf-profile.self.cycles-pp.__dquot_alloc_space
0.16 ± 4% -0.0 0.12 ± 4% -0.0 0.11 ± 6% -0.0 0.11 perf-profile.self.cycles-pp.shmem_alloc_and_add_folio
0.13 ± 5% -0.0 0.09 ± 7% -0.0 0.09 -0.0 0.10 ± 7% perf-profile.self.cycles-pp.__filemap_remove_folio
0.13 ± 2% -0.0 0.09 ± 5% -0.0 0.09 ± 4% -0.0 0.09 ± 4% perf-profile.self.cycles-pp.get_page_from_freelist
0.06 ± 7% -0.0 0.02 ± 99% -0.0 0.02 ± 99% -0.0 0.02 ±141% perf-profile.self.cycles-pp.apparmor_file_permission
0.12 ± 4% -0.0 0.09 ± 5% -0.0 0.09 ± 5% -0.0 0.08 ± 8% perf-profile.self.cycles-pp.vfs_fallocate
0.13 ± 3% -0.0 0.10 ± 5% -0.0 0.10 ± 4% -0.0 0.10 ± 4% perf-profile.self.cycles-pp.fallocate64
0.11 ± 4% -0.0 0.07 -0.0 0.08 ± 6% -0.0 0.08 ± 6% perf-profile.self.cycles-pp.xas_start
0.07 ± 5% -0.0 0.03 ± 70% -0.0 0.04 ± 44% -0.1 0.02 ±141% perf-profile.self.cycles-pp.shmem_alloc_folio
0.14 ± 4% -0.0 0.10 ± 7% -0.0 0.10 ± 5% -0.0 0.10 ± 3% perf-profile.self.cycles-pp.__fget_light
0.10 ± 4% -0.0 0.06 ± 7% -0.0 0.06 ± 7% -0.0 0.06 ± 6% perf-profile.self.cycles-pp.rmqueue
0.10 ± 4% -0.0 0.07 ± 8% -0.0 0.07 ± 5% -0.0 0.07 ± 5% perf-profile.self.cycles-pp.alloc_pages_mpol
0.12 ± 3% -0.0 0.09 ± 4% -0.0 0.09 ± 4% -0.0 0.09 ± 4% perf-profile.self.cycles-pp.xas_load
0.11 ± 4% -0.0 0.08 ± 7% -0.0 0.08 ± 5% -0.0 0.08 ± 4% perf-profile.self.cycles-pp.folio_unlock
0.15 ± 2% -0.0 0.12 ± 5% -0.0 0.12 ± 4% -0.0 0.12 ± 4% perf-profile.self.cycles-pp.shmem_get_folio_gfp
0.10 -0.0 0.07 -0.0 0.08 ± 7% +0.0 0.10 ± 4% perf-profile.self.cycles-pp.cap_vm_enough_memory
0.16 ± 2% -0.0 0.13 ± 6% -0.0 0.14 -0.0 0.14 perf-profile.self.cycles-pp.page_counter_uncharge
0.12 ± 5% -0.0 0.09 ± 4% -0.0 0.09 ± 7% -0.0 0.09 ± 5% perf-profile.self.cycles-pp.__cond_resched
0.06 ± 6% -0.0 0.03 ± 70% -0.0 0.04 ± 44% -0.0 0.05 perf-profile.self.cycles-pp.filemap_free_folio
0.12 -0.0 0.09 ± 4% -0.0 0.09 ± 4% -0.0 0.09 perf-profile.self.cycles-pp.noop_dirty_folio
0.12 ± 3% -0.0 0.10 ± 5% -0.0 0.10 ± 7% -0.0 0.10 ± 5% perf-profile.self.cycles-pp.free_unref_page_list
0.10 ± 3% -0.0 0.07 ± 5% -0.0 0.07 ± 5% -0.0 0.08 ± 6% perf-profile.self.cycles-pp.filemap_remove_folio
0.10 ± 5% -0.0 0.07 ± 5% -0.0 0.07 -0.0 0.08 ± 4% perf-profile.self.cycles-pp.try_charge_memcg
0.12 ± 3% -0.0 0.10 ± 8% -0.0 0.10 -0.0 0.10 ± 4% perf-profile.self.cycles-pp.cgroup_rstat_updated
0.09 ± 4% -0.0 0.07 ± 7% -0.0 0.07 ± 5% -0.0 0.07 ± 7% perf-profile.self.cycles-pp.__folio_cancel_dirty
0.08 ± 4% -0.0 0.06 ± 8% -0.0 0.06 ± 6% -0.0 0.06 ± 8% perf-profile.self.cycles-pp._raw_spin_lock_irqsave
0.08 ± 5% -0.0 0.06 -0.0 0.06 -0.0 0.06 perf-profile.self.cycles-pp._raw_spin_trylock
0.08 -0.0 0.06 ± 6% -0.0 0.06 ± 8% -0.0 0.06 perf-profile.self.cycles-pp.folio_add_lru
0.07 ± 5% -0.0 0.05 -0.0 0.05 -0.0 0.04 ± 44% perf-profile.self.cycles-pp.xas_find_conflict
0.08 ± 8% -0.0 0.06 ± 6% -0.0 0.06 ± 6% -0.0 0.06 ± 7% perf-profile.self.cycles-pp.__mod_lruvec_state
0.56 ± 6% -0.0 0.54 ± 9% -0.0 0.55 ± 5% -0.2 0.40 ± 3% perf-profile.self.cycles-pp.__mod_lruvec_page_state
0.08 ± 10% -0.0 0.06 ± 9% -0.0 0.06 -0.0 0.06 perf-profile.self.cycles-pp.truncate_cleanup_folio
0.07 ± 10% -0.0 0.05 -0.0 0.05 ± 7% -0.0 0.05 ± 8% perf-profile.self.cycles-pp.xas_init_marks
0.08 ± 4% -0.0 0.06 ± 7% +0.0 0.08 ± 4% -0.0 0.07 ± 10% perf-profile.self.cycles-pp.__percpu_counter_limited_add
0.07 ± 7% -0.0 0.05 -0.0 0.05 ± 7% -0.0 0.06 ± 9% perf-profile.self.cycles-pp.get_pfnblock_flags_mask
0.07 ± 5% -0.0 0.06 ± 8% -0.0 0.06 ± 6% -0.0 0.05 ± 7% perf-profile.self.cycles-pp.__list_add_valid_or_report
0.07 ± 5% -0.0 0.06 ± 9% -0.0 0.06 ± 7% -0.0 0.06 perf-profile.self.cycles-pp.mem_cgroup_update_lru_size
0.08 ± 4% -0.0 0.07 ± 5% -0.0 0.06 -0.0 0.06 ± 6% perf-profile.self.cycles-pp.filemap_get_entry
0.00 +0.0 0.00 +0.0 0.00 +0.1 0.08 ± 8% perf-profile.self.cycles-pp.__file_remove_privs
0.14 ± 2% +0.0 0.16 ± 6% +0.0 0.17 ± 3% +0.0 0.16 perf-profile.self.cycles-pp.uncharge_folio
0.02 ±141% +0.0 0.06 ± 8% +0.0 0.06 +0.0 0.06 ± 9% perf-profile.self.cycles-pp.uncharge_batch
0.21 ± 9% +0.1 0.31 ± 7% +0.1 0.32 ± 5% +0.1 0.30 ± 4% perf-profile.self.cycles-pp.mem_cgroup_commit_charge
0.69 ± 5% +0.1 0.83 ± 4% +0.2 0.86 ± 5% +0.1 0.79 ± 2% perf-profile.self.cycles-pp.get_mem_cgroup_from_mm
0.06 ± 6% +0.2 0.22 ± 2% +0.1 0.13 ± 5% +0.1 0.11 ± 4% perf-profile.self.cycles-pp.inode_needs_update_time
0.14 ± 8% +0.3 0.42 ± 7% +0.3 0.44 ± 6% +0.3 0.40 ± 3% perf-profile.self.cycles-pp.__mem_cgroup_charge
0.13 ± 7% +0.4 0.49 ± 3% +0.4 0.51 ± 2% +0.4 0.51 ± 2% perf-profile.self.cycles-pp.__count_memcg_events
84.24 +1.0 85.21 +0.8 85.04 +1.0 85.21 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
1.12 ± 5% +1.4 2.50 ± 2% +1.4 2.55 ± 2% +1.3 2.43 perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
[2]
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-12/performance/x86_64-rhel-8.3/thread/50%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale
130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 c5f50d8b23c7982ac875791755b ac6a9444dec85dc50c6bfbc4ee7
---------------- --------------------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev %change %stddev
\ | \ | \ | \
10544810 ± 11% +1.7% 10720938 ± 4% +1.7% 10719232 ± 4% +24.8% 13160448 meminfo.DirectMap2M
1.87 -0.4 1.43 ± 3% -0.4 1.47 ± 2% -0.4 1.46 mpstat.cpu.all.usr%
3171 -5.3% 3003 ± 2% +17.4% 3725 ± 30% +2.6% 3255 ± 5% vmstat.system.cs
93.97 ±130% +360.8% 433.04 ± 83% +5204.4% 4984 ±150% +1540.1% 1541 ± 56% boot-time.boot
6762 ±101% +96.3% 13275 ± 75% +3212.0% 223971 ±150% +752.6% 57655 ± 60% boot-time.idle
84.83 ± 9% +55.8% 132.17 ± 16% +75.6% 149.00 ± 11% +98.0% 168.00 ± 6% perf-c2c.DRAM.local
484.17 ± 3% +37.1% 663.67 ± 10% +44.1% 697.67 ± 7% -0.2% 483.00 ± 5% perf-c2c.DRAM.remote
72763 ± 5% +14.4% 83212 ± 12% +141.5% 175744 ± 83% +55.7% 113321 ± 21% turbostat.C1
0.08 -25.0% 0.06 -27.1% 0.06 ± 6% -25.0% 0.06 turbostat.IPC
27.90 +4.6% 29.18 +4.9% 29.27 +3.9% 29.00 turbostat.RAMWatt
3982212 -30.0% 2785941 -28.9% 2829631 -26.7% 2919929 will-it-scale.52.threads
76580 -30.0% 53575 -28.9% 54415 -26.7% 56152 will-it-scale.per_thread_ops
3982212 -30.0% 2785941 -28.9% 2829631 -26.7% 2919929 will-it-scale.workload
1.175e+09 ± 2% -28.6% 8.392e+08 ± 2% -28.2% 8.433e+08 ± 2% -25.4% 8.762e+08 numa-numastat.node0.local_node
1.175e+09 ± 2% -28.6% 8.394e+08 ± 2% -28.3% 8.434e+08 ± 2% -25.4% 8.766e+08 numa-numastat.node0.numa_hit
1.231e+09 ± 2% -31.3% 8.463e+08 ± 3% -29.5% 8.683e+08 ± 3% -27.7% 8.901e+08 numa-numastat.node1.local_node
1.232e+09 ± 2% -31.3% 8.466e+08 ± 3% -29.5% 8.688e+08 ± 3% -27.7% 8.907e+08 numa-numastat.node1.numa_hit
2.408e+09 -30.0% 1.686e+09 -28.9% 1.712e+09 -26.6% 1.767e+09 proc-vmstat.numa_hit
2.406e+09 -30.0% 1.685e+09 -28.9% 1.712e+09 -26.6% 1.766e+09 proc-vmstat.numa_local
2.404e+09 -29.9% 1.684e+09 -28.8% 1.71e+09 -26.6% 1.765e+09 proc-vmstat.pgalloc_normal
2.404e+09 -29.9% 1.684e+09 -28.8% 1.71e+09 -26.6% 1.765e+09 proc-vmstat.pgfree
2302080 -0.9% 2280448 -0.5% 2290432 -1.2% 2274688 proc-vmstat.unevictable_pgs_scanned
83444 ± 71% +34.2% 111978 ± 65% -9.1% 75877 ± 86% -76.2% 19883 ± 12% numa-meminfo.node0.AnonHugePages
150484 ± 55% +9.3% 164434 ± 46% -9.3% 136435 ± 53% -62.4% 56548 ± 18% numa-meminfo.node0.AnonPages
167427 ± 50% +8.2% 181159 ± 41% -8.3% 153613 ± 47% -56.1% 73487 ± 14% numa-meminfo.node0.Inactive
166720 ± 50% +8.7% 181159 ± 41% -8.3% 152902 ± 48% -56.6% 72379 ± 14% numa-meminfo.node0.Inactive(anon)
111067 ± 62% -13.7% 95819 ± 59% +14.6% 127294 ± 60% +86.1% 206693 ± 8% numa-meminfo.node1.AnonHugePages
179594 ± 47% -4.2% 172027 ± 43% +9.3% 196294 ± 39% +55.8% 279767 ± 3% numa-meminfo.node1.AnonPages
257406 ± 30% -2.1% 251990 ± 32% +9.9% 282766 ± 26% +42.2% 366131 ± 8% numa-meminfo.node1.AnonPages.max
196741 ± 43% -3.6% 189753 ± 39% +8.1% 212645 ± 36% +50.9% 296827 ± 3% numa-meminfo.node1.Inactive
196385 ± 43% -3.9% 188693 ± 39% +8.1% 212288 ± 36% +51.1% 296827 ± 3% numa-meminfo.node1.Inactive(anon)
37621 ± 55% +9.3% 41115 ± 46% -9.3% 34116 ± 53% -62.4% 14141 ± 18% numa-vmstat.node0.nr_anon_pages
41664 ± 50% +8.6% 45233 ± 41% -8.2% 38240 ± 47% -56.6% 18079 ± 14% numa-vmstat.node0.nr_inactive_anon
41677 ± 50% +8.6% 45246 ± 41% -8.2% 38250 ± 47% -56.6% 18092 ± 14% numa-vmstat.node0.nr_zone_inactive_anon
1.175e+09 ± 2% -28.6% 8.394e+08 ± 2% -28.3% 8.434e+08 ± 2% -25.4% 8.766e+08 numa-vmstat.node0.numa_hit
1.175e+09 ± 2% -28.6% 8.392e+08 ± 2% -28.2% 8.433e+08 ± 2% -25.4% 8.762e+08 numa-vmstat.node0.numa_local
44903 ± 47% -4.2% 43015 ± 43% +9.3% 49079 ± 39% +55.8% 69957 ± 3% numa-vmstat.node1.nr_anon_pages
49030 ± 43% -3.9% 47139 ± 39% +8.3% 53095 ± 36% +51.4% 74210 ± 3% numa-vmstat.node1.nr_inactive_anon
49035 ± 43% -3.9% 47135 ± 39% +8.3% 53098 ± 36% +51.3% 74212 ± 3% numa-vmstat.node1.nr_zone_inactive_anon
1.232e+09 ± 2% -31.3% 8.466e+08 ± 3% -29.5% 8.688e+08 ± 3% -27.7% 8.907e+08 numa-vmstat.node1.numa_hit
1.231e+09 ± 2% -31.3% 8.463e+08 ± 3% -29.5% 8.683e+08 ± 3% -27.7% 8.901e+08 numa-vmstat.node1.numa_local
5256095 ± 59% +557.5% 34561019 ± 89% +4549.1% 2.444e+08 ±146% +1646.7% 91810708 ± 50% sched_debug.cfs_rq:/.avg_vruntime.avg
8288083 ± 52% +365.0% 38543329 ± 81% +3020.3% 2.586e+08 ±145% +1133.9% 1.023e+08 ± 49% sched_debug.cfs_rq:/.avg_vruntime.max
1364475 ± 40% +26.7% 1728262 ± 29% +346.8% 6096205 ±118% +180.4% 3826288 ± 41% sched_debug.cfs_rq:/.avg_vruntime.stddev
161.62 ± 99% -42.4% 93.09 ±144% -57.3% 69.01 ± 74% -86.6% 21.73 ± 10% sched_debug.cfs_rq:/.load_avg.avg
902.70 ±107% -46.8% 480.28 ±171% -57.3% 385.28 ±120% -94.8% 47.03 ± 8% sched_debug.cfs_rq:/.load_avg.stddev
5256095 ± 59% +557.5% 34561019 ± 89% +4549.1% 2.444e+08 ±146% +1646.7% 91810708 ± 50% sched_debug.cfs_rq:/.min_vruntime.avg
8288083 ± 52% +365.0% 38543329 ± 81% +3020.3% 2.586e+08 ±145% +1133.9% 1.023e+08 ± 49% sched_debug.cfs_rq:/.min_vruntime.max
1364475 ± 40% +26.7% 1728262 ± 29% +346.8% 6096205 ±118% +180.4% 3826288 ± 41% sched_debug.cfs_rq:/.min_vruntime.stddev
31.84 ±161% -71.8% 8.98 ± 44% -84.0% 5.10 ± 43% -79.0% 6.68 ± 24% sched_debug.cfs_rq:/.removed.load_avg.avg
272.14 ±192% -84.9% 41.10 ± 29% -89.7% 28.08 ± 21% -87.8% 33.19 ± 12% sched_debug.cfs_rq:/.removed.load_avg.stddev
334.70 ± 17% +32.4% 443.13 ± 19% +34.3% 449.66 ± 11% +14.6% 383.66 ± 24% sched_debug.cfs_rq:/.util_est_enqueued.avg
322.95 ± 23% +12.5% 363.30 ± 19% +27.9% 412.92 ± 6% +11.2% 359.17 ± 18% sched_debug.cfs_rq:/.util_est_enqueued.stddev
240924 ± 52% +136.5% 569868 ± 62% +2031.9% 5136297 ±145% +600.7% 1688103 ± 51% sched_debug.cpu.clock.avg
240930 ± 52% +136.5% 569874 ± 62% +2031.9% 5136304 ±145% +600.7% 1688109 ± 51% sched_debug.cpu.clock.max
240917 ± 52% +136.5% 569861 ± 62% +2032.0% 5136290 ±145% +600.7% 1688095 ± 51% sched_debug.cpu.clock.min
239307 ± 52% +136.6% 566140 ± 62% +2009.9% 5049095 ±145% +600.7% 1676912 ± 51% sched_debug.cpu.clock_task.avg
239479 ± 52% +136.5% 566334 ± 62% +2014.9% 5064818 ±145% +600.4% 1677208 ± 51% sched_debug.cpu.clock_task.max
232462 ± 53% +140.6% 559281 ± 63% +2064.0% 5030381 ±146% +617.9% 1668793 ± 52% sched_debug.cpu.clock_task.min
683.22 ± 3% +0.7% 688.14 ± 4% +1762.4% 12724 ±138% +19.2% 814.55 ± 8% sched_debug.cpu.clock_task.stddev
3267 ± 57% +146.0% 8040 ± 63% +2127.2% 72784 ±146% +652.5% 24591 ± 52% sched_debug.cpu.curr->pid.avg
10463 ± 39% +101.0% 21030 ± 54% +1450.9% 162275 ±143% +448.5% 57391 ± 49% sched_debug.cpu.curr->pid.max
3373 ± 57% +149.1% 8403 ± 64% +2141.6% 75621 ±146% +657.7% 25561 ± 52% sched_debug.cpu.curr->pid.stddev
58697 ± 14% +1.6% 59612 ± 7% +1.9e+05% 1.142e+08 ±156% +105.4% 120565 ± 32% sched_debug.cpu.nr_switches.max
6023 ± 10% +13.6% 6843 ± 11% +2.9e+05% 17701514 ±151% +124.8% 13541 ± 32% sched_debug.cpu.nr_switches.stddev
240917 ± 52% +136.5% 569862 ± 62% +2032.0% 5136291 ±145% +600.7% 1688096 ± 51% sched_debug.cpu_clk
240346 ± 52% +136.9% 569288 ± 62% +2036.8% 5135723 ±145% +602.1% 1687529 ± 51% sched_debug.ktime
241481 ± 51% +136.2% 570443 ± 62% +2027.2% 5136856 ±145% +599.3% 1688672 ± 51% sched_debug.sched_clk
0.04 ± 9% -19.3% 0.03 ± 6% -19.7% 0.03 ± 6% -14.3% 0.03 ± 8% perf-sched.wait_and_delay.avg.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
0.04 ± 11% -18.0% 0.03 ± 13% -22.8% 0.03 ± 10% -14.0% 0.04 ± 15% perf-sched.wait_and_delay.avg.ms.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.04 ± 8% -22.3% 0.03 ± 5% -19.4% 0.03 ± 3% -12.6% 0.04 ± 9% perf-sched.wait_and_delay.avg.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
0.91 ± 2% +11.3% 1.01 ± 5% +65.3% 1.51 ± 53% +28.8% 1.17 ± 11% perf-sched.wait_and_delay.avg.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
0.04 ± 13% -90.3% 0.00 ±223% -66.4% 0.01 ±101% -83.8% 0.01 ±223% perf-sched.wait_and_delay.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
24.11 ± 3% -8.5% 22.08 ± 11% -25.2% 18.04 ± 50% -29.5% 17.01 ± 21% perf-sched.wait_and_delay.avg.ms.pipe_read.vfs_read.ksys_read.do_syscall_64
1.14 +15.1% 1.31 -24.1% 0.86 ± 70% +13.7% 1.29 perf-sched.wait_and_delay.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
189.94 ± 3% +18.3% 224.73 ± 4% +20.3% 228.52 ± 3% +22.1% 231.82 ± 3% perf-sched.wait_and_delay.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
1652 ± 4% -13.4% 1431 ± 4% -13.4% 1431 ± 2% -14.3% 1416 ± 6% perf-sched.wait_and_delay.count.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
1628 ± 8% -15.0% 1383 ± 9% -16.6% 1357 ± 2% -16.6% 1358 ± 7% perf-sched.wait_and_delay.count.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
83.67 ± 7% -87.6% 10.33 ±223% -59.2% 34.17 ±100% -85.5% 12.17 ±223% perf-sched.wait_and_delay.count.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
2835 ± 3% +10.6% 3135 ± 10% +123.8% 6345 ± 80% +48.4% 4207 ± 19% perf-sched.wait_and_delay.count.pipe_read.vfs_read.ksys_read.do_syscall_64
3827 ± 4% -13.0% 3328 ± 3% -12.9% 3335 ± 2% -14.7% 3264 ± 2% perf-sched.wait_and_delay.count.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
1.71 ±165% -83.4% 0.28 ± 21% -82.3% 0.30 ± 16% -74.6% 0.43 ± 60% perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
0.43 ± 17% -43.8% 0.24 ± 26% -44.4% 0.24 ± 27% -32.9% 0.29 ± 23% perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.46 ± 17% -36.7% 0.29 ± 12% -35.7% 0.30 ± 19% -35.3% 0.30 ± 21% perf-sched.wait_and_delay.max.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
45.41 ± 4% +13.4% 51.51 ± 12% +148.6% 112.88 ± 86% +56.7% 71.18 ± 21% perf-sched.wait_and_delay.max.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
0.30 ± 34% -90.7% 0.03 ±223% -66.0% 0.10 ±110% -88.2% 0.04 ±223% perf-sched.wait_and_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
2.39 +10.7% 2.65 ± 2% -24.3% 1.81 ± 70% +12.1% 2.68 ± 2% perf-sched.wait_and_delay.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
0.04 ± 9% -19.3% 0.03 ± 6% -19.7% 0.03 ± 6% -14.3% 0.03 ± 8% perf-sched.wait_time.avg.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
0.04 ± 11% -18.0% 0.03 ± 13% -22.8% 0.03 ± 10% -14.0% 0.04 ± 15% perf-sched.wait_time.avg.ms.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.04 ± 8% -22.3% 0.03 ± 5% -19.4% 0.03 ± 3% -12.6% 0.04 ± 9% perf-sched.wait_time.avg.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
0.04 ± 11% -33.1% 0.03 ± 17% -32.3% 0.03 ± 22% -16.3% 0.04 ± 12% perf-sched.wait_time.avg.ms.__cond_resched.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.90 ± 2% +11.5% 1.00 ± 5% +66.1% 1.50 ± 53% +29.2% 1.16 ± 11% perf-sched.wait_time.avg.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
0.04 ± 13% -26.6% 0.03 ± 12% -33.6% 0.03 ± 11% -18.1% 0.04 ± 16% perf-sched.wait_time.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
24.05 ± 3% -9.0% 21.90 ± 10% -25.0% 18.04 ± 50% -29.4% 16.97 ± 21% perf-sched.wait_time.avg.ms.pipe_read.vfs_read.ksys_read.do_syscall_64
1.13 +15.2% 1.30 +15.0% 1.30 +13.7% 1.29 perf-sched.wait_time.avg.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
189.93 ± 3% +18.3% 224.72 ± 4% +20.3% 228.50 ± 3% +22.1% 231.81 ± 3% perf-sched.wait_time.avg.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
1.71 ±165% -83.4% 0.28 ± 21% -82.3% 0.30 ± 16% -74.6% 0.43 ± 60% perf-sched.wait_time.max.ms.__cond_resched.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
0.43 ± 17% -43.8% 0.24 ± 26% -44.4% 0.24 ± 27% -32.9% 0.29 ± 23% perf-sched.wait_time.max.ms.__cond_resched.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.46 ± 17% -36.7% 0.29 ± 12% -35.7% 0.30 ± 19% -35.3% 0.30 ± 21% perf-sched.wait_time.max.ms.__cond_resched.shmem_undo_range.shmem_setattr.notify_change.do_truncate
0.31 ± 26% -42.1% 0.18 ± 58% -64.1% 0.11 ± 40% -28.5% 0.22 ± 30% perf-sched.wait_time.max.ms.__cond_resched.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe
45.41 ± 4% +13.4% 51.50 ± 12% +148.6% 112.87 ± 86% +56.8% 71.18 ± 21% perf-sched.wait_time.max.ms.do_wait.kernel_wait4.__do_sys_wait4.do_syscall_64
2.39 +10.7% 2.64 ± 2% +12.9% 2.69 ± 2% +12.1% 2.68 ± 2% perf-sched.wait_time.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
0.75 +142.0% 1.83 ± 2% +146.9% 1.86 +124.8% 1.70 perf-stat.i.MPKI
8.47e+09 -24.4% 6.407e+09 -23.2% 6.503e+09 -21.2% 6.674e+09 perf-stat.i.branch-instructions
0.66 -0.0 0.63 -0.0 0.64 -0.0 0.63 perf-stat.i.branch-miss-rate%
56364992 -28.3% 40421603 ± 3% -26.0% 41734061 ± 2% -25.8% 41829975 perf-stat.i.branch-misses
14.64 +6.7 21.30 +6.9 21.54 +6.5 21.10 perf-stat.i.cache-miss-rate%
30868184 +81.3% 55977240 ± 3% +87.7% 57950237 +76.2% 54404466 perf-stat.i.cache-misses
2.107e+08 +24.7% 2.627e+08 ± 2% +27.6% 2.69e+08 +22.3% 2.578e+08 perf-stat.i.cache-references
3106 -5.5% 2934 ± 2% +16.4% 3615 ± 29% +2.4% 3181 ± 5% perf-stat.i.context-switches
3.55 +33.4% 4.74 +31.5% 4.67 +27.4% 4.52 perf-stat.i.cpi
4722 -44.8% 2605 ± 3% -46.7% 2515 -43.3% 2675 perf-stat.i.cycles-between-cache-misses
0.04 -0.0 0.04 -0.0 0.04 -0.0 0.04 perf-stat.i.dTLB-load-miss-rate%
4117232 -29.1% 2917107 -28.1% 2961876 -25.8% 3056956 perf-stat.i.dTLB-load-misses
1.051e+10 -24.1% 7.979e+09 -23.0% 8.1e+09 -19.7% 8.44e+09 perf-stat.i.dTLB-loads
0.00 ± 3% +0.0 0.00 ± 6% +0.0 0.00 ± 5% +0.0 0.00 ± 4% perf-stat.i.dTLB-store-miss-rate%
5.886e+09 -27.5% 4.269e+09 -26.3% 4.34e+09 -24.1% 4.467e+09 perf-stat.i.dTLB-stores
78.16 -6.6 71.51 -6.4 71.75 -5.9 72.23 perf-stat.i.iTLB-load-miss-rate%
4131074 ± 3% -30.0% 2891515 -29.2% 2922789 -26.2% 3048227 perf-stat.i.iTLB-load-misses
4.098e+10 -25.0% 3.072e+10 -23.9% 3.119e+10 -21.6% 3.214e+10 perf-stat.i.instructions
9929 ± 2% +7.0% 10627 +7.5% 10673 +6.2% 10547 perf-stat.i.instructions-per-iTLB-miss
0.28 -25.0% 0.21 -23.9% 0.21 -21.5% 0.22 perf-stat.i.ipc
63.49 +43.8% 91.27 ± 3% +48.2% 94.07 +38.6% 87.97 perf-stat.i.metric.K/sec
241.12 -24.6% 181.87 -23.4% 184.70 -20.9% 190.75 perf-stat.i.metric.M/sec
90.84 -0.4 90.49 -0.9 89.98 -2.9 87.93 perf-stat.i.node-load-miss-rate%
3735316 +78.6% 6669641 ± 3% +83.1% 6839047 +62.4% 6067727 perf-stat.i.node-load-misses
377465 ± 4% +86.1% 702512 ± 11% +101.7% 761510 ± 4% +120.8% 833359 perf-stat.i.node-loads
1322217 -27.6% 957081 ± 5% -22.9% 1019779 ± 2% -19.4% 1066178 perf-stat.i.node-store-misses
37459 ± 3% -23.0% 28826 ± 5% -19.2% 30253 ± 6% -23.4% 28682 ± 3% perf-stat.i.node-stores
0.75 +141.8% 1.82 ± 2% +146.6% 1.86 +124.7% 1.69 perf-stat.overall.MPKI
0.67 -0.0 0.63 -0.0 0.64 -0.0 0.63 perf-stat.overall.branch-miss-rate%
14.65 +6.7 21.30 +6.9 21.54 +6.5 21.11 perf-stat.overall.cache-miss-rate%
3.55 +33.4% 4.73 +31.4% 4.66 +27.4% 4.52 perf-stat.overall.cpi
4713 -44.8% 2601 ± 3% -46.7% 2511 -43.3% 2671 perf-stat.overall.cycles-between-cache-misses
0.04 -0.0 0.04 -0.0 0.04 -0.0 0.04 perf-stat.overall.dTLB-load-miss-rate%
0.00 ± 3% +0.0 0.00 ± 5% +0.0 0.00 ± 5% +0.0 0.00 perf-stat.overall.dTLB-store-miss-rate%
78.14 -6.7 71.47 -6.4 71.70 -5.9 72.20 perf-stat.overall.iTLB-load-miss-rate%
9927 ± 2% +7.0% 10624 +7.5% 10672 +6.2% 10547 perf-stat.overall.instructions-per-iTLB-miss
0.28 -25.0% 0.21 -23.9% 0.21 -21.5% 0.22 perf-stat.overall.ipc
90.82 -0.3 90.49 -0.8 89.98 -2.9 87.92 perf-stat.overall.node-load-miss-rate%
3098901 +7.1% 3318983 +6.9% 3313112 +7.0% 3316044 perf-stat.overall.path-length
8.441e+09 -24.4% 6.385e+09 -23.2% 6.48e+09 -21.2% 6.652e+09 perf-stat.ps.branch-instructions
56179581 -28.3% 40286337 ± 3% -26.0% 41593521 ± 2% -25.8% 41687151 perf-stat.ps.branch-misses
30759982 +81.3% 55777812 ± 3% +87.7% 57746279 +76.3% 54217757 perf-stat.ps.cache-misses
2.1e+08 +24.6% 2.618e+08 ± 2% +27.6% 2.68e+08 +22.3% 2.569e+08 perf-stat.ps.cache-references
3095 -5.5% 2923 ± 2% +16.2% 3597 ± 29% +2.3% 3167 ± 5% perf-stat.ps.context-switches
135.89 -0.8% 134.84 -0.7% 134.99 -1.0% 134.55 perf-stat.ps.cpu-migrations
4103292 -29.1% 2907270 -28.1% 2951746 -25.7% 3046739 perf-stat.ps.dTLB-load-misses
1.048e+10 -24.1% 7.952e+09 -23.0% 8.072e+09 -19.7% 8.412e+09 perf-stat.ps.dTLB-loads
5.866e+09 -27.5% 4.255e+09 -26.3% 4.325e+09 -24.1% 4.452e+09 perf-stat.ps.dTLB-stores
4117020 ± 3% -30.0% 2881750 -29.3% 2912744 -26.2% 3037970 perf-stat.ps.iTLB-load-misses
4.084e+10 -25.0% 3.062e+10 -23.9% 3.109e+10 -21.6% 3.203e+10 perf-stat.ps.instructions
3722149 +78.5% 6645867 ± 3% +83.1% 6814976 +62.5% 6046854 perf-stat.ps.node-load-misses
376240 ± 4% +86.1% 700053 ± 11% +101.7% 758898 ± 4% +120.8% 830575 perf-stat.ps.node-loads
1317772 -27.6% 953773 ± 5% -22.9% 1016183 ± 2% -19.4% 1062457 perf-stat.ps.node-store-misses
37408 ± 3% -23.2% 28748 ± 5% -19.3% 30192 ± 6% -23.5% 28607 ± 3% perf-stat.ps.node-stores
1.234e+13 -25.1% 9.246e+12 -24.0% 9.375e+12 -21.5% 9.683e+12 perf-stat.total.instructions
1.28 -0.4 0.90 ± 2% -0.4 0.91 -0.3 0.94 ± 2% perf-profile.calltrace.cycles-pp.syscall_return_via_sysret.fallocate64
1.26 ± 2% -0.4 0.90 ± 3% -0.3 0.92 ± 2% -0.3 0.94 ± 2% perf-profile.calltrace.cycles-pp.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
1.08 ± 2% -0.3 0.77 ± 3% -0.3 0.79 ± 2% -0.3 0.81 ± 2% perf-profile.calltrace.cycles-pp.alloc_pages_mpol.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.92 ± 2% -0.3 0.62 ± 3% -0.3 0.63 -0.3 0.66 ± 2% perf-profile.calltrace.cycles-pp.shmem_inode_acct_blocks.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
0.84 ± 3% -0.2 0.61 ± 3% -0.2 0.63 ± 2% -0.2 0.65 ± 2% perf-profile.calltrace.cycles-pp.__alloc_pages.alloc_pages_mpol.shmem_alloc_folio.shmem_alloc_and_add_folio.shmem_get_folio_gfp
29.27 -0.2 29.09 -1.0 28.32 -0.2 29.04 perf-profile.calltrace.cycles-pp.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
1.26 -0.2 1.08 -0.2 1.07 -0.2 1.10 perf-profile.calltrace.cycles-pp.folio_batch_move_lru.lru_add_drain_cpu.__folio_batch_release.shmem_undo_range.shmem_setattr
1.26 -0.2 1.08 -0.2 1.07 -0.2 1.10 perf-profile.calltrace.cycles-pp.lru_add_drain_cpu.__folio_batch_release.shmem_undo_range.shmem_setattr.notify_change
1.24 -0.2 1.06 -0.2 1.05 -0.2 1.08 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu.__folio_batch_release.shmem_undo_range
1.23 -0.2 1.06 -0.2 1.05 -0.2 1.08 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu
1.24 -0.2 1.06 -0.2 1.05 -0.2 1.08 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.lru_add_drain_cpu.__folio_batch_release
29.15 -0.2 28.99 -0.9 28.23 -0.2 28.94 perf-profile.calltrace.cycles-pp.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
1.20 -0.2 1.04 ± 2% -0.2 1.05 -0.2 1.02 ± 2% perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
27.34 -0.1 27.22 ± 2% -0.9 26.49 -0.1 27.20 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio
27.36 -0.1 27.24 ± 2% -0.9 26.51 -0.1 27.22 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp
27.28 -0.1 27.17 ± 2% -0.8 26.44 -0.1 27.16 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.folio_batch_move_lru.folio_add_lru
25.74 -0.1 25.67 ± 2% +0.2 25.98 +0.9 26.62 perf-profile.calltrace.cycles-pp.release_pages.__folio_batch_release.shmem_undo_range.shmem_setattr.notify_change
23.43 +0.0 23.43 ± 2% +0.3 23.70 +0.9 24.34 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.__folio_batch_release.shmem_undo_range
23.45 +0.0 23.45 ± 2% +0.3 23.73 +0.9 24.35 perf-profile.calltrace.cycles-pp.folio_lruvec_lock_irqsave.release_pages.__folio_batch_release.shmem_undo_range.shmem_setattr
23.37 +0.0 23.39 ± 2% +0.3 23.67 +0.9 24.30 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.folio_lruvec_lock_irqsave.release_pages.__folio_batch_release
0.68 ± 3% +0.0 0.72 ± 4% +0.1 0.73 ± 3% +0.1 0.74 perf-profile.calltrace.cycles-pp.__mem_cgroup_uncharge_list.release_pages.__folio_batch_release.shmem_undo_range.shmem_setattr
1.08 +0.1 1.20 +0.1 1.17 +0.1 1.15 ± 2% perf-profile.calltrace.cycles-pp.lru_add_fn.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio.shmem_get_folio_gfp
2.91 +0.3 3.18 ± 2% +0.3 3.23 +0.1 3.02 perf-profile.calltrace.cycles-pp.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change.do_truncate
2.56 +0.4 2.92 ± 2% +0.4 2.98 +0.2 2.75 perf-profile.calltrace.cycles-pp.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr.notify_change
1.36 ± 3% +0.4 1.76 ± 9% +0.4 1.75 ± 5% +0.3 1.68 ± 3% perf-profile.calltrace.cycles-pp.get_mem_cgroup_from_mm.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
2.22 +0.5 2.68 ± 2% +0.5 2.73 +0.3 2.50 perf-profile.calltrace.cycles-pp.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range.shmem_setattr
0.00 +0.6 0.60 ± 2% +0.6 0.61 ± 2% +0.6 0.61 perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.release_pages.__folio_batch_release.shmem_undo_range.shmem_setattr
2.33 +0.6 2.94 +0.6 2.96 ± 3% +0.3 2.59 perf-profile.calltrace.cycles-pp.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
0.00 +0.7 0.72 ± 2% +0.7 0.72 ± 2% +0.7 0.68 ± 2% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.lru_add_fn.folio_batch_move_lru.folio_add_lru.shmem_alloc_and_add_folio
0.69 ± 4% +0.8 1.47 ± 3% +0.8 1.48 ± 2% +0.7 1.42 perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio
1.24 ± 2% +0.8 2.04 ± 2% +0.8 2.07 ± 2% +0.6 1.82 perf-profile.calltrace.cycles-pp.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio.shmem_undo_range
0.00 +0.8 0.82 ± 4% +0.8 0.85 ± 3% +0.8 0.78 ± 2% perf-profile.calltrace.cycles-pp.__count_memcg_events.mem_cgroup_commit_charge.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp
1.17 ± 2% +0.8 2.00 ± 2% +0.9 2.04 ± 2% +0.6 1.77 perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.filemap_unaccount_folio.__filemap_remove_folio.filemap_remove_folio.truncate_inode_folio
0.59 ± 4% +0.9 1.53 +0.9 1.53 ± 4% +0.8 1.37 ± 2% perf-profile.calltrace.cycles-pp.__mod_memcg_lruvec_state.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp
1.38 +1.0 2.33 ± 2% +1.0 2.34 ± 3% +0.6 1.94 ± 2% perf-profile.calltrace.cycles-pp.__mod_lruvec_page_state.shmem_add_to_page_cache.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
0.62 ± 3% +1.0 1.66 ± 5% +1.1 1.68 ± 4% +1.0 1.57 ± 2% perf-profile.calltrace.cycles-pp.mem_cgroup_commit_charge.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate
38.70 +1.2 39.90 +0.5 39.23 +0.7 39.45 perf-profile.calltrace.cycles-pp.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe.fallocate64
38.34 +1.3 39.65 +0.6 38.97 +0.9 39.20 perf-profile.calltrace.cycles-pp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64.entry_SYSCALL_64_after_hwframe
37.24 +1.6 38.86 +0.9 38.17 +1.1 38.35 perf-profile.calltrace.cycles-pp.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate.do_syscall_64
36.64 +1.8 38.40 +1.1 37.72 +1.2 37.88 perf-profile.calltrace.cycles-pp.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate.__x64_sys_fallocate
2.47 ± 2% +2.1 4.59 ± 8% +2.1 4.61 ± 5% +1.9 4.37 ± 2% perf-profile.calltrace.cycles-pp.__mem_cgroup_charge.shmem_alloc_and_add_folio.shmem_get_folio_gfp.shmem_fallocate.vfs_fallocate
1.30 -0.4 0.92 ± 2% -0.4 0.93 -0.4 0.96 perf-profile.children.cycles-pp.syscall_return_via_sysret
1.28 ± 2% -0.4 0.90 ± 3% -0.3 0.93 ± 2% -0.3 0.95 ± 2% perf-profile.children.cycles-pp.shmem_alloc_folio
30.44 -0.3 30.11 -1.1 29.33 -0.4 30.07 perf-profile.children.cycles-pp.folio_batch_move_lru
1.10 ± 2% -0.3 0.78 ± 3% -0.3 0.81 ± 2% -0.3 0.82 ± 2% perf-profile.children.cycles-pp.alloc_pages_mpol
0.96 ± 2% -0.3 0.64 ± 3% -0.3 0.65 -0.3 0.68 ± 2% perf-profile.children.cycles-pp.shmem_inode_acct_blocks
0.88 -0.3 0.58 ± 2% -0.3 0.60 ± 2% -0.3 0.62 ± 2% perf-profile.children.cycles-pp.xas_store
0.88 ± 3% -0.2 0.64 ± 3% -0.2 0.66 ± 2% -0.2 0.67 ± 2% perf-profile.children.cycles-pp.__alloc_pages
29.29 -0.2 29.10 -1.0 28.33 -0.2 29.06 perf-profile.children.cycles-pp.folio_add_lru
0.61 ± 2% -0.2 0.43 ± 3% -0.2 0.44 ± 2% -0.2 0.45 ± 3% perf-profile.children.cycles-pp.__entry_text_start
1.26 -0.2 1.09 -0.2 1.08 -0.2 1.10 perf-profile.children.cycles-pp.lru_add_drain_cpu
0.56 -0.2 0.39 ± 4% -0.2 0.40 ± 3% -0.2 0.40 ± 3% perf-profile.children.cycles-pp.free_unref_page_list
1.22 -0.2 1.06 ± 2% -0.2 1.06 -0.2 1.04 ± 2% perf-profile.children.cycles-pp.syscall_exit_to_user_mode
0.46 -0.1 0.32 ± 3% -0.1 0.32 -0.1 0.32 ± 3% perf-profile.children.cycles-pp.__mod_lruvec_state
0.41 ± 3% -0.1 0.28 ± 4% -0.1 0.28 ± 3% -0.1 0.29 ± 2% perf-profile.children.cycles-pp.xas_load
0.44 ± 4% -0.1 0.31 ± 4% -0.1 0.32 ± 2% -0.1 0.34 ± 3% perf-profile.children.cycles-pp.find_lock_entries
0.50 ± 3% -0.1 0.37 ± 2% -0.1 0.39 ± 4% -0.1 0.39 ± 2% perf-profile.children.cycles-pp.get_page_from_freelist
0.24 ± 7% -0.1 0.12 ± 5% -0.1 0.13 ± 2% -0.1 0.13 ± 3% perf-profile.children.cycles-pp.__list_add_valid_or_report
25.89 -0.1 25.78 ± 2% +0.2 26.08 +0.8 26.73 perf-profile.children.cycles-pp.release_pages
0.34 ± 2% -0.1 0.24 ± 4% -0.1 0.23 ± 2% -0.1 0.23 ± 4% perf-profile.children.cycles-pp.__mod_node_page_state
0.38 ± 3% -0.1 0.28 ± 4% -0.1 0.29 ± 3% -0.1 0.28 perf-profile.children.cycles-pp._raw_spin_lock
0.32 ± 2% -0.1 0.22 ± 5% -0.1 0.23 ± 2% -0.1 0.23 ± 2% perf-profile.children.cycles-pp.__dquot_alloc_space
0.26 ± 2% -0.1 0.17 ± 2% -0.1 0.18 ± 3% -0.1 0.18 ± 2% perf-profile.children.cycles-pp.xas_descend
0.22 ± 3% -0.1 0.14 ± 4% -0.1 0.14 ± 3% -0.1 0.14 ± 2% perf-profile.children.cycles-pp.free_unref_page_commit
0.25 -0.1 0.17 ± 3% -0.1 0.18 ± 4% -0.1 0.18 ± 4% perf-profile.children.cycles-pp.xas_clear_mark
0.32 ± 4% -0.1 0.25 ± 3% -0.1 0.26 ± 4% -0.1 0.26 ± 2% perf-profile.children.cycles-pp.rmqueue
0.23 ± 2% -0.1 0.16 ± 2% -0.1 0.16 ± 4% -0.1 0.16 ± 6% perf-profile.children.cycles-pp.xas_init_marks
0.24 ± 2% -0.1 0.17 ± 5% -0.1 0.17 ± 4% -0.1 0.18 ± 2% perf-profile.children.cycles-pp.__cond_resched
0.25 ± 4% -0.1 0.18 ± 2% -0.1 0.18 ± 2% -0.1 0.18 ± 4% perf-profile.children.cycles-pp.truncate_cleanup_folio
0.30 ± 3% -0.1 0.23 ± 4% -0.1 0.22 ± 3% -0.1 0.22 ± 2% perf-profile.children.cycles-pp.filemap_get_entry
0.20 ± 2% -0.1 0.13 ± 5% -0.1 0.13 ± 3% -0.1 0.14 ± 4% perf-profile.children.cycles-pp.folio_unlock
0.16 ± 4% -0.1 0.10 ± 5% -0.1 0.10 ± 7% -0.1 0.11 ± 6% perf-profile.children.cycles-pp.xas_find_conflict
0.19 ± 3% -0.1 0.13 ± 5% -0.0 0.14 ± 12% -0.1 0.14 ± 5% perf-profile.children.cycles-pp._raw_spin_lock_irq
0.17 ± 5% -0.1 0.12 ± 3% -0.1 0.12 ± 4% -0.0 0.13 ± 3% perf-profile.children.cycles-pp.noop_dirty_folio
0.13 ± 4% -0.1 0.08 ± 9% -0.1 0.08 ± 8% -0.0 0.09 perf-profile.children.cycles-pp.security_vm_enough_memory_mm
0.18 ± 8% -0.1 0.13 ± 4% -0.0 0.13 ± 5% -0.0 0.13 ± 5% perf-profile.children.cycles-pp.shmem_recalc_inode
0.16 ± 2% -0.1 0.11 ± 3% -0.0 0.12 ± 4% -0.0 0.12 ± 6% perf-profile.children.cycles-pp.free_unref_page_prepare
0.09 ± 5% -0.1 0.04 ± 45% -0.0 0.05 -0.0 0.05 ± 7% perf-profile.children.cycles-pp.mem_cgroup_update_lru_size
0.10 ± 7% -0.0 0.05 ± 45% -0.0 0.06 ± 13% -0.0 0.06 ± 7% perf-profile.children.cycles-pp.cap_vm_enough_memory
0.14 ± 5% -0.0 0.10 -0.0 0.10 ± 4% -0.0 0.11 ± 5% perf-profile.children.cycles-pp.__folio_cancel_dirty
0.14 ± 5% -0.0 0.10 ± 4% -0.0 0.10 ± 3% -0.0 0.10 ± 6% perf-profile.children.cycles-pp.security_file_permission
0.10 ± 5% -0.0 0.06 ± 6% -0.0 0.06 ± 7% -0.0 0.07 ± 10% perf-profile.children.cycles-pp.xas_find
0.15 ± 4% -0.0 0.11 ± 3% -0.0 0.11 ± 6% -0.0 0.11 ± 3% perf-profile.children.cycles-pp.__fget_light
0.12 ± 3% -0.0 0.09 ± 7% -0.0 0.09 ± 7% -0.0 0.09 ± 6% perf-profile.children.cycles-pp.__vm_enough_memory
0.12 ± 3% -0.0 0.09 ± 4% -0.0 0.09 ± 4% -0.0 0.09 ± 6% perf-profile.children.cycles-pp.apparmor_file_permission
0.12 ± 3% -0.0 0.08 ± 5% -0.0 0.08 ± 5% -0.0 0.09 ± 5% perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
0.14 ± 5% -0.0 0.11 ± 3% -0.0 0.11 ± 4% -0.0 0.12 ± 3% perf-profile.children.cycles-pp.file_modified
0.12 ± 4% -0.0 0.08 ± 4% -0.0 0.08 ± 7% -0.0 0.09 ± 5% perf-profile.children.cycles-pp.xas_start
0.09 -0.0 0.06 ± 8% -0.0 0.04 ± 45% -0.0 0.06 ± 6% perf-profile.children.cycles-pp.__folio_throttle_swaprate
0.12 ± 4% -0.0 0.08 ± 4% -0.0 0.08 ± 4% -0.0 0.09 ± 5% perf-profile.children.cycles-pp.__percpu_counter_limited_add
0.12 ± 6% -0.0 0.08 ± 8% -0.0 0.08 ± 8% -0.0 0.08 ± 4% perf-profile.children.cycles-pp._raw_spin_trylock
0.12 ± 4% -0.0 0.09 ± 4% -0.0 0.09 ± 4% -0.0 0.09 perf-profile.children.cycles-pp.inode_add_bytes
0.20 ± 2% -0.0 0.17 ± 7% -0.0 0.17 ± 4% -0.0 0.18 ± 3% perf-profile.children.cycles-pp.try_charge_memcg
0.10 ± 5% -0.0 0.07 ± 7% -0.0 0.07 ± 7% -0.0 0.06 ± 7% perf-profile.children.cycles-pp.policy_nodemask
0.09 ± 6% -0.0 0.06 ± 6% -0.0 0.06 ± 7% -0.0 0.06 ± 6% perf-profile.children.cycles-pp.get_pfnblock_flags_mask
0.09 ± 6% -0.0 0.06 ± 7% -0.0 0.06 ± 7% -0.0 0.07 ± 5% perf-profile.children.cycles-pp.filemap_free_folio
0.07 ± 6% -0.0 0.05 ± 7% -0.0 0.06 ± 9% -0.0 0.06 ± 8% perf-profile.children.cycles-pp.down_write
0.08 ± 4% -0.0 0.06 ± 8% -0.0 0.06 ± 9% -0.0 0.06 ± 8% perf-profile.children.cycles-pp.get_task_policy
0.09 ± 7% -0.0 0.07 -0.0 0.07 ± 7% -0.0 0.07 perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
0.09 ± 7% -0.0 0.07 -0.0 0.07 ± 5% -0.0 0.08 ± 6% perf-profile.children.cycles-pp.inode_needs_update_time
0.09 ± 5% -0.0 0.07 ± 5% -0.0 0.08 ± 4% -0.0 0.08 ± 4% perf-profile.children.cycles-pp.xas_create
0.16 ± 2% -0.0 0.14 ± 5% -0.0 0.14 ± 2% -0.0 0.15 ± 4% perf-profile.children.cycles-pp.cgroup_rstat_updated
0.08 ± 7% -0.0 0.06 ± 9% -0.0 0.06 ± 6% -0.0 0.06 perf-profile.children.cycles-pp.percpu_counter_add_batch
0.07 ± 5% -0.0 0.05 ± 7% -0.0 0.03 ± 70% -0.0 0.06 ± 14% perf-profile.children.cycles-pp.folio_mark_dirty
0.08 ± 10% -0.0 0.06 ± 6% -0.0 0.06 ± 13% -0.0 0.05 perf-profile.children.cycles-pp.shmem_is_huge
0.07 ± 6% +0.0 0.09 ± 10% +0.0 0.09 ± 5% +0.0 0.09 ± 6% perf-profile.children.cycles-pp.propagate_protected_usage
0.43 ± 3% +0.0 0.46 ± 5% +0.0 0.47 ± 3% +0.0 0.48 ± 2% perf-profile.children.cycles-pp.uncharge_batch
0.68 ± 3% +0.0 0.73 ± 4% +0.0 0.74 ± 3% +0.1 0.74 perf-profile.children.cycles-pp.__mem_cgroup_uncharge_list
1.11 +0.1 1.22 +0.1 1.19 +0.1 1.17 ± 2% perf-profile.children.cycles-pp.lru_add_fn
2.91 +0.3 3.18 ± 2% +0.3 3.23 +0.1 3.02 perf-profile.children.cycles-pp.truncate_inode_folio
2.56 +0.4 2.92 ± 2% +0.4 2.98 +0.2 2.75 perf-profile.children.cycles-pp.filemap_remove_folio
1.37 ± 3% +0.4 1.76 ± 9% +0.4 1.76 ± 5% +0.3 1.69 ± 2% perf-profile.children.cycles-pp.get_mem_cgroup_from_mm
2.24 +0.5 2.70 ± 2% +0.5 2.75 +0.3 2.51 perf-profile.children.cycles-pp.__filemap_remove_folio
2.38 +0.6 2.97 +0.6 2.99 ± 3% +0.2 2.63 perf-profile.children.cycles-pp.shmem_add_to_page_cache
0.18 ± 4% +0.7 0.91 ± 4% +0.8 0.94 ± 4% +0.7 0.87 ± 2% perf-profile.children.cycles-pp.__count_memcg_events
1.26 +0.8 2.04 ± 2% +0.8 2.08 ± 2% +0.6 1.82 perf-profile.children.cycles-pp.filemap_unaccount_folio
0.63 ± 2% +1.0 1.67 ± 5% +1.1 1.68 ± 5% +1.0 1.58 ± 2% perf-profile.children.cycles-pp.mem_cgroup_commit_charge
38.71 +1.2 39.91 +0.5 39.23 +0.7 39.46 perf-profile.children.cycles-pp.vfs_fallocate
38.37 +1.3 39.66 +0.6 38.99 +0.8 39.21 perf-profile.children.cycles-pp.shmem_fallocate
37.28 +1.6 38.89 +0.9 38.20 +1.1 38.39 perf-profile.children.cycles-pp.shmem_get_folio_gfp
36.71 +1.7 38.45 +1.1 37.77 +1.2 37.94 perf-profile.children.cycles-pp.shmem_alloc_and_add_folio
2.58 +1.8 4.36 ± 2% +1.8 4.40 ± 3% +1.2 3.74 perf-profile.children.cycles-pp.__mod_lruvec_page_state
2.48 ± 2% +2.1 4.60 ± 8% +2.1 4.62 ± 5% +1.9 4.38 ± 2% perf-profile.children.cycles-pp.__mem_cgroup_charge
1.93 ± 3% +2.4 4.36 ± 2% +2.5 4.38 ± 3% +2.2 4.09 perf-profile.children.cycles-pp.__mod_memcg_lruvec_state
1.30 -0.4 0.92 ± 2% -0.4 0.93 -0.3 0.95 perf-profile.self.cycles-pp.syscall_return_via_sysret
0.73 -0.2 0.52 ± 2% -0.2 0.53 -0.2 0.54 ± 2% perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
0.54 ± 2% -0.2 0.36 ± 3% -0.2 0.36 ± 3% -0.2 0.37 ± 2% perf-profile.self.cycles-pp.release_pages
0.48 -0.2 0.30 ± 3% -0.2 0.32 ± 3% -0.2 0.33 ± 2% perf-profile.self.cycles-pp.xas_store
0.54 ± 2% -0.2 0.38 ± 3% -0.1 0.39 ± 2% -0.1 0.39 ± 3% perf-profile.self.cycles-pp.__entry_text_start
1.17 -0.1 1.03 ± 2% -0.1 1.03 -0.2 1.00 ± 2% perf-profile.self.cycles-pp.syscall_exit_to_user_mode
0.36 ± 2% -0.1 0.22 ± 3% -0.1 0.22 ± 3% -0.1 0.24 ± 2% perf-profile.self.cycles-pp.shmem_add_to_page_cache
0.43 ± 5% -0.1 0.30 ± 7% -0.2 0.27 ± 7% -0.1 0.29 ± 2% perf-profile.self.cycles-pp.lru_add_fn
0.24 ± 7% -0.1 0.12 ± 6% -0.1 0.13 ± 2% -0.1 0.12 ± 6% perf-profile.self.cycles-pp.__list_add_valid_or_report
0.38 ± 4% -0.1 0.27 ± 4% -0.1 0.28 ± 3% -0.1 0.28 ± 2% perf-profile.self.cycles-pp._raw_spin_lock
0.52 ± 3% -0.1 0.41 -0.1 0.41 -0.1 0.43 ± 3% perf-profile.self.cycles-pp.folio_batch_move_lru
0.32 ± 2% -0.1 0.22 ± 4% -0.1 0.22 ± 3% -0.1 0.22 ± 5% perf-profile.self.cycles-pp.__mod_node_page_state
0.36 ± 2% -0.1 0.26 ± 2% -0.1 0.26 ± 2% -0.1 0.27 perf-profile.self.cycles-pp.shmem_fallocate
0.36 ± 4% -0.1 0.26 ± 4% -0.1 0.26 ± 3% -0.1 0.27 ± 3% perf-profile.self.cycles-pp.find_lock_entries
0.28 ± 3% -0.1 0.20 ± 5% -0.1 0.20 ± 2% -0.1 0.21 ± 3% perf-profile.self.cycles-pp.__alloc_pages
0.24 ± 2% -0.1 0.16 ± 4% -0.1 0.16 ± 4% -0.1 0.16 ± 3% perf-profile.self.cycles-pp.xas_descend
0.09 ± 5% -0.1 0.01 ±223% -0.1 0.03 ± 70% -0.1 0.03 ± 70% perf-profile.self.cycles-pp.mem_cgroup_update_lru_size
0.23 ± 2% -0.1 0.16 ± 3% -0.1 0.16 ± 2% -0.1 0.16 ± 4% perf-profile.self.cycles-pp.xas_clear_mark
0.18 ± 3% -0.1 0.11 ± 6% -0.1 0.12 ± 4% -0.1 0.11 ± 4% perf-profile.self.cycles-pp.free_unref_page_commit
0.18 ± 3% -0.1 0.12 ± 4% -0.1 0.12 ± 3% -0.0 0.13 ± 5% perf-profile.self.cycles-pp.shmem_inode_acct_blocks
0.21 ± 3% -0.1 0.15 ± 2% -0.1 0.15 ± 2% -0.1 0.16 ± 3% perf-profile.self.cycles-pp.shmem_alloc_and_add_folio
0.18 ± 2% -0.1 0.12 ± 3% -0.1 0.12 ± 4% -0.1 0.12 ± 3% perf-profile.self.cycles-pp.__filemap_remove_folio
0.18 ± 7% -0.1 0.12 ± 7% -0.0 0.13 ± 5% -0.1 0.12 ± 3% perf-profile.self.cycles-pp.vfs_fallocate
0.18 ± 2% -0.1 0.13 ± 3% -0.1 0.13 -0.1 0.13 ± 5% perf-profile.self.cycles-pp.folio_unlock
0.20 ± 2% -0.1 0.14 ± 6% -0.1 0.15 ± 3% -0.1 0.15 ± 6% perf-profile.self.cycles-pp.__dquot_alloc_space
0.18 ± 2% -0.1 0.12 ± 3% -0.1 0.13 ± 3% -0.0 0.13 ± 4% perf-profile.self.cycles-pp.get_page_from_freelist
0.15 ± 3% -0.1 0.10 ± 7% -0.0 0.10 ± 3% -0.0 0.10 ± 3% perf-profile.self.cycles-pp.xas_load
0.17 ± 3% -0.1 0.12 ± 8% -0.1 0.12 ± 3% -0.0 0.12 ± 4% perf-profile.self.cycles-pp.__cond_resched
0.17 ± 2% -0.1 0.12 ± 3% -0.1 0.12 ± 7% -0.0 0.13 ± 2% perf-profile.self.cycles-pp._raw_spin_lock_irq
0.17 ± 5% -0.1 0.12 ± 3% -0.0 0.12 ± 4% -0.0 0.12 ± 6% perf-profile.self.cycles-pp.noop_dirty_folio
0.10 ± 7% -0.0 0.05 ± 45% -0.0 0.06 ± 13% -0.0 0.06 ± 7% perf-profile.self.cycles-pp.cap_vm_enough_memory
0.12 ± 3% -0.0 0.08 ± 4% -0.0 0.08 -0.0 0.08 ± 4% perf-profile.self.cycles-pp.rmqueue
0.06 -0.0 0.02 ±141% -0.0 0.03 ± 70% -0.0 0.04 ± 44% perf-profile.self.cycles-pp.inode_needs_update_time
0.07 ± 5% -0.0 0.02 ± 99% -0.0 0.05 -0.0 0.05 ± 7% perf-profile.self.cycles-pp.xas_find
0.13 ± 3% -0.0 0.09 ± 6% -0.0 0.10 ± 5% -0.0 0.09 ± 7% perf-profile.self.cycles-pp.alloc_pages_mpol
0.07 ± 6% -0.0 0.03 ± 70% -0.0 0.04 ± 44% -0.0 0.05 perf-profile.self.cycles-pp.xas_find_conflict
0.16 ± 2% -0.0 0.12 ± 6% -0.0 0.12 ± 3% -0.0 0.13 ± 5% perf-profile.self.cycles-pp.free_unref_page_list
0.12 ± 5% -0.0 0.08 ± 4% -0.0 0.08 ± 4% -0.0 0.09 ± 7% perf-profile.self.cycles-pp.fallocate64
0.20 ± 4% -0.0 0.16 ± 3% -0.0 0.16 ± 3% -0.0 0.18 ± 4% perf-profile.self.cycles-pp.shmem_get_folio_gfp
0.06 ± 7% -0.0 0.02 ± 99% -0.0 0.02 ± 99% -0.0 0.03 ± 70% perf-profile.self.cycles-pp.shmem_recalc_inode
0.13 ± 3% -0.0 0.09 -0.0 0.09 ± 6% -0.0 0.09 ± 6% perf-profile.self.cycles-pp._raw_spin_lock_irqsave
0.22 ± 3% -0.0 0.19 ± 6% -0.0 0.20 ± 3% -0.0 0.21 ± 4% perf-profile.self.cycles-pp.page_counter_uncharge
0.14 ± 3% -0.0 0.10 ± 6% -0.0 0.10 ± 8% -0.0 0.10 ± 4% perf-profile.self.cycles-pp.filemap_remove_folio
0.15 ± 5% -0.0 0.11 ± 3% -0.0 0.11 ± 6% -0.0 0.11 ± 3% perf-profile.self.cycles-pp.__fget_light
0.12 ± 4% -0.0 0.08 -0.0 0.08 ± 5% -0.0 0.08 ± 4% perf-profile.self.cycles-pp.__folio_cancel_dirty
0.11 ± 4% -0.0 0.08 ± 7% -0.0 0.08 ± 8% -0.0 0.08 ± 4% perf-profile.self.cycles-pp._raw_spin_trylock
0.11 ± 3% -0.0 0.08 ± 6% -0.0 0.07 ± 9% -0.0 0.08 ± 6% perf-profile.self.cycles-pp.xas_start
0.11 ± 3% -0.0 0.08 ± 6% -0.0 0.08 ± 6% -0.0 0.08 ± 6% perf-profile.self.cycles-pp.__percpu_counter_limited_add
0.12 ± 3% -0.0 0.09 ± 5% -0.0 0.08 ± 5% -0.0 0.08 ± 4% perf-profile.self.cycles-pp.__mod_lruvec_state
0.11 ± 5% -0.0 0.08 ± 4% -0.0 0.08 ± 6% -0.0 0.08 ± 4% perf-profile.self.cycles-pp.truncate_cleanup_folio
0.10 ± 6% -0.0 0.07 ± 5% -0.0 0.07 ± 7% -0.0 0.07 ± 11% perf-profile.self.cycles-pp.xas_init_marks
0.09 ± 6% -0.0 0.06 ± 6% -0.0 0.06 ± 7% -0.0 0.06 ± 6% perf-profile.self.cycles-pp.get_pfnblock_flags_mask
0.11 -0.0 0.08 ± 5% -0.0 0.08 -0.0 0.09 ± 5% perf-profile.self.cycles-pp.folio_add_lru
0.09 ± 6% -0.0 0.06 ± 7% -0.0 0.06 ± 7% -0.0 0.07 ± 5% perf-profile.self.cycles-pp.filemap_free_folio
0.09 ± 4% -0.0 0.06 ± 6% -0.0 0.06 ± 6% -0.0 0.06 ± 6% perf-profile.self.cycles-pp.shmem_alloc_folio
0.10 ± 4% -0.0 0.08 ± 4% -0.0 0.08 ± 6% -0.0 0.08 ± 7% perf-profile.self.cycles-pp.apparmor_file_permission
0.14 ± 5% -0.0 0.12 ± 5% -0.0 0.12 ± 3% -0.0 0.13 ± 4% perf-profile.self.cycles-pp.cgroup_rstat_updated
0.07 ± 7% -0.0 0.04 ± 44% -0.0 0.04 ± 44% -0.0 0.04 ± 71% perf-profile.self.cycles-pp.policy_nodemask
0.07 ± 11% -0.0 0.04 ± 45% -0.0 0.05 ± 7% -0.0 0.03 ± 70% perf-profile.self.cycles-pp.shmem_is_huge
0.08 ± 4% -0.0 0.06 ± 8% -0.0 0.06 ± 9% -0.0 0.06 ± 9% perf-profile.self.cycles-pp.get_task_policy
0.08 ± 6% -0.0 0.05 ± 8% -0.0 0.06 ± 8% -0.0 0.05 ± 8% perf-profile.self.cycles-pp.__x64_sys_fallocate
0.12 ± 3% -0.0 0.10 ± 6% -0.0 0.10 ± 6% -0.0 0.10 ± 3% perf-profile.self.cycles-pp.try_charge_memcg
0.07 -0.0 0.05 -0.0 0.05 -0.0 0.04 ± 45% perf-profile.self.cycles-pp.free_unref_page_prepare
0.07 ± 6% -0.0 0.06 ± 9% -0.0 0.06 ± 8% -0.0 0.06 ± 9% perf-profile.self.cycles-pp.percpu_counter_add_batch
0.08 ± 4% -0.0 0.06 -0.0 0.06 ± 6% -0.0 0.06 perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
0.09 ± 7% -0.0 0.07 ± 5% -0.0 0.07 ± 5% -0.0 0.07 ± 7% perf-profile.self.cycles-pp.filemap_get_entry
0.07 ± 9% +0.0 0.09 ± 10% +0.0 0.09 ± 5% +0.0 0.09 ± 6% perf-profile.self.cycles-pp.propagate_protected_usage
0.96 ± 2% +0.2 1.12 ± 7% +0.2 1.16 ± 4% -0.2 0.72 ± 2% perf-profile.self.cycles-pp.__mod_lruvec_page_state
0.45 ± 4% +0.4 0.82 ± 8% +0.4 0.81 ± 6% +0.3 0.77 ± 3% perf-profile.self.cycles-pp.mem_cgroup_commit_charge
1.36 ± 3% +0.4 1.75 ± 9% +0.4 1.75 ± 5% +0.3 1.68 ± 2% perf-profile.self.cycles-pp.get_mem_cgroup_from_mm
0.29 +0.7 1.00 ± 10% +0.7 1.01 ± 7% +0.6 0.93 ± 2% perf-profile.self.cycles-pp.__mem_cgroup_charge
0.16 ± 4% +0.7 0.90 ± 4% +0.8 0.92 ± 4% +0.7 0.85 ± 2% perf-profile.self.cycles-pp.__count_memcg_events
1.80 ± 2% +2.5 4.26 ± 2% +2.5 4.28 ± 3% +2.2 3.98 perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
On Tue, Oct 24, 2023 at 11:09 PM Oliver Sang <[email protected]> wrote:
>
> hi, Yosry Ahmed,
>
> On Tue, Oct 24, 2023 at 12:14:42AM -0700, Yosry Ahmed wrote:
> > On Mon, Oct 23, 2023 at 11:56 PM Oliver Sang <[email protected]> wrote:
> > >
> > > hi, Yosry Ahmed,
> > >
> > > On Mon, Oct 23, 2023 at 07:13:50PM -0700, Yosry Ahmed wrote:
> > >
> > > ...
> > >
> > > >
> > > > I still could not run the benchmark, but I used a version of
> > > > fallocate1.c that does 1 million iterations. I ran 100 in parallel.
> > > > This showed ~13% regression with the patch, so not the same as the
> > > > will-it-scale version, but it could be an indicator.
> > > >
> > > > With that, I did not see any improvement with the fixlet above or
> > > > ___cacheline_aligned_in_smp. So you can scratch that.
> > > >
> > > > I did, however, see some improvement with reducing the indirection
> > > > layers by moving stats_updates directly into struct mem_cgroup. The
> > > > regression in my manual testing went down to 9%. Still not great, but
> > > > I am wondering how this reflects on the benchmark. If you're able to
> > > > test it that would be great, the diff is below. Meanwhile I am still
> > > > looking for other improvements that can be made.
> > >
> > > we applied previous patch-set as below:
> > >
> > > c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing
> > > ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent()
> > > 51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg
> > > 130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code
> > > 26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time
> > > 25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything <---- the base our tool picked for the patch set
> > >
> > > I tried to apply below patch to either 51d74c18a9c61 or c5f50d8b23c79,
> > > but failed. could you guide how to apply this patch?
> > > Thanks
> > >
> >
> > Thanks for looking into this. I rebased the diff on top of
> > c5f50d8b23c79. Please find it attached.
>
> from our tests, this patch has little impact.
>
> it was applied as below ac6a9444dec85:
>
> ac6a9444dec85 (linux-devel/fixup-c5f50d8b23c79) memcg: move stats_updates to struct mem_cgroup
> c5f50d8b23c79 (linux-review/Yosry-Ahmed/mm-memcg-change-flush_next_time-to-flush_last_time/20231010-112257) mm: memcg: restore subtree stats flushing
> ac8a48ba9e1ca mm: workingset: move the stats flush into workingset_test_recent()
> 51d74c18a9c61 mm: memcg: make stats flushing threshold per-memcg
> 130617edc1cd1 mm: memcg: move vmstats structs definition above flushing code
> 26d0ee342efc6 mm: memcg: change flush_next_time to flush_last_time
> 25478183883e6 Merge branch 'mm-nonmm-unstable' into mm-everything
>
> for the first regression reported in original report, data are very close
> for 51d74c18a9c61, c5f50d8b23c79 (patch-set tip, parent of ac6a9444dec85),
> and ac6a9444dec85.
> full comparison is as [1]
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
> gcc-12/performance/x86_64-rhel-8.3/thread/100%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale
>
> 130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 c5f50d8b23c7982ac875791755b ac6a9444dec85dc50c6bfbc4ee7
> ---------------- --------------------------- --------------------------- ---------------------------
> %stddev %change %stddev %change %stddev %change %stddev
> \ | \ | \ | \
> 36509 -25.8% 27079 -25.2% 27305 -25.0% 27383 will-it-scale.per_thread_ops
>
> for the second regression reported in origianl report, seems a small impact
> from ac6a9444dec85.
> full comparison is as [2]
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
> gcc-12/performance/x86_64-rhel-8.3/thread/50%/debian-11.1-x86_64-20220510.cgz/lkp-skl-fpga01/fallocate1/will-it-scale
>
> 130617edc1cd1ba1 51d74c18a9c61e7ee33bc90b522 c5f50d8b23c7982ac875791755b ac6a9444dec85dc50c6bfbc4ee7
> ---------------- --------------------------- --------------------------- ---------------------------
> %stddev %change %stddev %change %stddev %change %stddev
> \ | \ | \ | \
> 76580 -30.0% 53575 -28.9% 54415 -26.7% 56152 will-it-scale.per_thread_ops
>
> [1]
>
Thanks Oliver for running the numbers. If I understand correctly the
will-it-scale.fallocate1 microbenchmark is the only one showing
significant regression here, is this correct?
In my runs, other more representative microbenchmarks benchmarks like
netperf and will-it-scale.page_fault* show minimal regression. I would
expect practical workloads to have high concurrency of page faults or
networking, but maybe not fallocate/ftruncate.
Oliver, in your experience, how often does such a regression in such a
microbenchmark translate to a real regression that people care about?
(or how often do people dismiss it?)
I tried optimizing this further for the fallocate/ftruncate case but
without luck. I even tried moving stats_updates into cgroup core
(struct cgroup_rstat_cpu) to reuse the existing loop in
cgroup_rstat_updated() -- but it somehow made it worse.
On the other hand, we do have some machines in production running this
series together with a previous optimization for non-hierarchical
stats [1] on an older kernel, and we do see significant reduction in
cpu time spent on reading the stats. Domenico did a similar experiment
with only this series and reported similar results [2].
Shakeel, Johannes, (and other memcg folks), I personally think the
benefits here outweigh a regression in this particular benchmark, but
I am obviously biased. What do you think?
[1]https://lore.kernel.org/lkml/[email protected]/
[2]https://lore.kernel.org/lkml/CAFYChMv_kv_KXOMRkrmTN-7MrfgBHMcK3YXv0dPYEL7nK77e2A@mail.gmail.com/
On Tue, Oct 24, 2023 at 11:23 PM Yosry Ahmed <[email protected]> wrote:
>
[...]
>
> Thanks Oliver for running the numbers. If I understand correctly the
> will-it-scale.fallocate1 microbenchmark is the only one showing
> significant regression here, is this correct?
>
> In my runs, other more representative microbenchmarks benchmarks like
> netperf and will-it-scale.page_fault* show minimal regression. I would
> expect practical workloads to have high concurrency of page faults or
> networking, but maybe not fallocate/ftruncate.
>
> Oliver, in your experience, how often does such a regression in such a
> microbenchmark translate to a real regression that people care about?
> (or how often do people dismiss it?)
>
> I tried optimizing this further for the fallocate/ftruncate case but
> without luck. I even tried moving stats_updates into cgroup core
> (struct cgroup_rstat_cpu) to reuse the existing loop in
> cgroup_rstat_updated() -- but it somehow made it worse.
>
> On the other hand, we do have some machines in production running this
> series together with a previous optimization for non-hierarchical
> stats [1] on an older kernel, and we do see significant reduction in
> cpu time spent on reading the stats. Domenico did a similar experiment
> with only this series and reported similar results [2].
>
> Shakeel, Johannes, (and other memcg folks), I personally think the
> benefits here outweigh a regression in this particular benchmark, but
> I am obviously biased. What do you think?
>
> [1]https://lore.kernel.org/lkml/[email protected]/
> [2]https://lore.kernel.org/lkml/CAFYChMv_kv_KXOMRkrmTN-7MrfgBHMcK3YXv0dPYEL7nK77e2A@mail.gmail.com/
I still am not convinced of the benefits outweighing the regression
but I would not block this. So, let's do this, skip this open window,
get the patch series reviewed and hopefully we can work together on
fixing that regression and we can make an informed decision of
accepting the regression for this series for the next cycle.
On Wed, Oct 25, 2023 at 10:06 AM Shakeel Butt <[email protected]> wrote:
>
> On Tue, Oct 24, 2023 at 11:23 PM Yosry Ahmed <[email protected]> wrote:
> >
> [...]
> >
> > Thanks Oliver for running the numbers. If I understand correctly the
> > will-it-scale.fallocate1 microbenchmark is the only one showing
> > significant regression here, is this correct?
> >
> > In my runs, other more representative microbenchmarks benchmarks like
> > netperf and will-it-scale.page_fault* show minimal regression. I would
> > expect practical workloads to have high concurrency of page faults or
> > networking, but maybe not fallocate/ftruncate.
> >
> > Oliver, in your experience, how often does such a regression in such a
> > microbenchmark translate to a real regression that people care about?
> > (or how often do people dismiss it?)
> >
> > I tried optimizing this further for the fallocate/ftruncate case but
> > without luck. I even tried moving stats_updates into cgroup core
> > (struct cgroup_rstat_cpu) to reuse the existing loop in
> > cgroup_rstat_updated() -- but it somehow made it worse.
> >
> > On the other hand, we do have some machines in production running this
> > series together with a previous optimization for non-hierarchical
> > stats [1] on an older kernel, and we do see significant reduction in
> > cpu time spent on reading the stats. Domenico did a similar experiment
> > with only this series and reported similar results [2].
> >
> > Shakeel, Johannes, (and other memcg folks), I personally think the
> > benefits here outweigh a regression in this particular benchmark, but
> > I am obviously biased. What do you think?
> >
> > [1]https://lore.kernel.org/lkml/[email protected]/
> > [2]https://lore.kernel.org/lkml/CAFYChMv_kv_KXOMRkrmTN-7MrfgBHMcK3YXv0dPYEL7nK77e2A@mail.gmail.com/
>
> I still am not convinced of the benefits outweighing the regression
> but I would not block this. So, let's do this, skip this open window,
> get the patch series reviewed and hopefully we can work together on
> fixing that regression and we can make an informed decision of
> accepting the regression for this series for the next cycle.
Skipping this open window sounds okay to me.
FWIW, I think with this patch series we can keep the old behavior
(roughly) and hide the changes behind a tunable (config option or
sysfs file). I think the only changes that need to be done to the code
to approximate the previous behavior are:
- Use root when updating the pending stats in memcg_rstat_updated()
instead of the passed memcg.
- Use root in mem_cgroup_flush_stats() instead of the passed memcg.
- Use mutex_trylock() instead of mutex_lock() in mem_cgroup_flush_stats().
So I think it should be doable to hide most changes behind a tunable,
but let's not do this unless necessary.