2020-05-07 02:57:10

by Paul E. McKenney

[permalink] [raw]
Subject: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode

This commit adds a shrinker so as to inform RCU when memory is scarce.
RCU responds by shifting into the same fast and inefficient mode that is
used in the presence of excessive numbers of RCU callbacks. RCU remains
in this state for one-tenth of a second, though this time window can be
extended by another call to the shrinker.

If it proves feasible, a later commit might add a function call directly
indicating the end of the period of scarce memory.

Suggested-by: Al Viro <[email protected]>
Signed-off-by: Paul E. McKenney <[email protected]>
Cc: Johannes Weiner <[email protected]>

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index b0fe32f..76d148d 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -2368,8 +2368,15 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
struct rcu_data *rdp;
struct rcu_node *rnp;

- rcu_state.cbovld = rcu_state.cbovldnext;
+ // Load .oomovld before .oomovldend, pairing with .oomovld set.
+ rcu_state.cbovld = smp_load_acquire(&rcu_state.oomovld) || // ^^^
+ rcu_state.cbovldnext;
rcu_state.cbovldnext = false;
+ if (READ_ONCE(rcu_state.oomovld) &&
+ time_after(jiffies, READ_ONCE(rcu_state.oomovldend))) {
+ WRITE_ONCE(rcu_state.oomovld, false);
+ pr_info("%s: Ending OOM-mode grace periods.\n", __func__);
+ }
rcu_for_each_leaf_node(rnp) {
cond_resched_tasks_rcu_qs();
mask = 0;
@@ -2697,6 +2704,35 @@ static void check_cb_ovld(struct rcu_data *rdp)
raw_spin_unlock_rcu_node(rnp);
}

+/* Return a rough count of the RCU callbacks outstanding. */
+static unsigned long rcu_oom_count(struct shrinker *unused1,
+ struct shrink_control *unused2)
+{
+ int cpu;
+ unsigned long ncbs = 0;
+
+ for_each_possible_cpu(cpu)
+ ncbs += rcu_get_n_cbs_cpu(cpu);
+ return ncbs;
+}
+
+/* Start up an interval of fast high-overhead grace periods. */
+static unsigned long rcu_oom_scan(struct shrinker *unused1,
+ struct shrink_control *unused2)
+{
+ pr_info("%s: Starting OOM-mode grace periods.\n", __func__);
+ WRITE_ONCE(rcu_state.oomovldend, jiffies + HZ / 10);
+ smp_store_release(&rcu_state.oomovld, true); // After .oomovldend
+ rcu_force_quiescent_state(); // Kick grace period
+ return 0; // We haven't actually reclaimed anything yet.
+}
+
+static struct shrinker rcu_shrinker = {
+ .count_objects = rcu_oom_count,
+ .scan_objects = rcu_oom_scan,
+ .seeks = DEFAULT_SEEKS,
+};
+
/* Helper function for call_rcu() and friends. */
static void
__call_rcu(struct rcu_head *head, rcu_callback_t func)
@@ -4146,6 +4182,7 @@ void __init rcu_init(void)
qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark;
else
qovld_calc = qovld;
+ WARN_ON(register_shrinker(&rcu_shrinker));
}

#include "tree_stall.h"
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 2d7fcb9..c4d8e96 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -326,6 +326,8 @@ struct rcu_state {
int ncpus_snap; /* # CPUs seen last time. */
u8 cbovld; /* Callback overload now? */
u8 cbovldnext; /* ^ ^ next time? */
+ u8 oomovld; /* OOM overload? */
+ unsigned long oomovldend; /* OOM ovld end, jiffies. */

unsigned long jiffies_force_qs; /* Time at which to invoke */
/* force_quiescent_state(). */


2020-05-07 03:17:48

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode

On Wed, 6 May 2020 17:42:40 -0700 "Paul E. McKenney" <[email protected]> wrote:

> This commit adds a shrinker so as to inform RCU when memory is scarce.
> RCU responds by shifting into the same fast and inefficient mode that is
> used in the presence of excessive numbers of RCU callbacks. RCU remains
> in this state for one-tenth of a second, though this time window can be
> extended by another call to the shrinker.
>
> If it proves feasible, a later commit might add a function call directly
> indicating the end of the period of scarce memory.

(Cc David Chinner, who often has opinions on shrinkers ;))

It's a bit abusive of the intent of the slab shrinkers, but I don't
immediately see a problem with it. Always returning 0 from
->scan_objects might cause a problem in some situations(?).

Perhaps we should have a formal "system getting low on memory, please
do something" notification API.

How significant is this? How much memory can RCU consume?

> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -2368,8 +2368,15 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
> struct rcu_data *rdp;
> struct rcu_node *rnp;
>
> - rcu_state.cbovld = rcu_state.cbovldnext;
> + // Load .oomovld before .oomovldend, pairing with .oomovld set.
> + rcu_state.cbovld = smp_load_acquire(&rcu_state.oomovld) || // ^^^
> + rcu_state.cbovldnext;
> rcu_state.cbovldnext = false;
> + if (READ_ONCE(rcu_state.oomovld) &&
> + time_after(jiffies, READ_ONCE(rcu_state.oomovldend))) {
> + WRITE_ONCE(rcu_state.oomovld, false);
> + pr_info("%s: Ending OOM-mode grace periods.\n", __func__);
> + }
> rcu_for_each_leaf_node(rnp) {
> cond_resched_tasks_rcu_qs();
> mask = 0;
> @@ -2697,6 +2704,35 @@ static void check_cb_ovld(struct rcu_data *rdp)
> raw_spin_unlock_rcu_node(rnp);
> }
>
> +/* Return a rough count of the RCU callbacks outstanding. */
> +static unsigned long rcu_oom_count(struct shrinker *unused1,
> + struct shrink_control *unused2)
> +{
> + int cpu;
> + unsigned long ncbs = 0;
> +
> + for_each_possible_cpu(cpu)
> + ncbs += rcu_get_n_cbs_cpu(cpu);
> + return ncbs;
> +}
> +
> +/* Start up an interval of fast high-overhead grace periods. */
> +static unsigned long rcu_oom_scan(struct shrinker *unused1,
> + struct shrink_control *unused2)
> +{
> + pr_info("%s: Starting OOM-mode grace periods.\n", __func__);
> + WRITE_ONCE(rcu_state.oomovldend, jiffies + HZ / 10);
> + smp_store_release(&rcu_state.oomovld, true); // After .oomovldend
> + rcu_force_quiescent_state(); // Kick grace period
> + return 0; // We haven't actually reclaimed anything yet.
> +}
> +
> +static struct shrinker rcu_shrinker = {
> + .count_objects = rcu_oom_count,
> + .scan_objects = rcu_oom_scan,
> + .seeks = DEFAULT_SEEKS,
> +};
> +
> /* Helper function for call_rcu() and friends. */
> static void
> __call_rcu(struct rcu_head *head, rcu_callback_t func)
> @@ -4146,6 +4182,7 @@ void __init rcu_init(void)
> qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark;
> else
> qovld_calc = qovld;
> + WARN_ON(register_shrinker(&rcu_shrinker));
> }
>
> #include "tree_stall.h"
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index 2d7fcb9..c4d8e96 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -326,6 +326,8 @@ struct rcu_state {
> int ncpus_snap; /* # CPUs seen last time. */
> u8 cbovld; /* Callback overload now? */
> u8 cbovldnext; /* ^ ^ next time? */
> + u8 oomovld; /* OOM overload? */
> + unsigned long oomovldend; /* OOM ovld end, jiffies. */
>
> unsigned long jiffies_force_qs; /* Time at which to invoke */
> /* force_quiescent_state(). */

2020-05-07 03:26:55

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode

On Wed, May 06, 2020 at 05:55:35PM -0700, Andrew Morton wrote:
> On Wed, 6 May 2020 17:42:40 -0700 "Paul E. McKenney" <[email protected]> wrote:
>
> > This commit adds a shrinker so as to inform RCU when memory is scarce.
> > RCU responds by shifting into the same fast and inefficient mode that is
> > used in the presence of excessive numbers of RCU callbacks. RCU remains
> > in this state for one-tenth of a second, though this time window can be
> > extended by another call to the shrinker.
> >
> > If it proves feasible, a later commit might add a function call directly
> > indicating the end of the period of scarce memory.
>
> (Cc David Chinner, who often has opinions on shrinkers ;))
>
> It's a bit abusive of the intent of the slab shrinkers, but I don't
> immediately see a problem with it. Always returning 0 from
> ->scan_objects might cause a problem in some situations(?).

I could just divide the total number of callbacks by 16 or some such,
if that would work better.

> Perhaps we should have a formal "system getting low on memory, please
> do something" notification API.

That would be a very good thing to have! But from what I can see, the
shrinker interface is currently the closest approximation to such an
interface.

> How significant is this? How much memory can RCU consume?

This depends on the configuration and workload. By default, RCU starts
getting concerned if any CPU exceeds 10,000 callbacks. It is not all
-that- hard to cause RCU to have tens of millions of callbacks queued,
though some would argue that workloads doing this are rather abusive.
But at 1K per, this maps to 10GB of storage.

But in more normal workloads, I would expect the amount of storage
awaiting an RCU grace period to not even come close to a gigabyte.

Thoughts?

Thanx, Paul

> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -2368,8 +2368,15 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
> > struct rcu_data *rdp;
> > struct rcu_node *rnp;
> >
> > - rcu_state.cbovld = rcu_state.cbovldnext;
> > + // Load .oomovld before .oomovldend, pairing with .oomovld set.
> > + rcu_state.cbovld = smp_load_acquire(&rcu_state.oomovld) || // ^^^
> > + rcu_state.cbovldnext;
> > rcu_state.cbovldnext = false;
> > + if (READ_ONCE(rcu_state.oomovld) &&
> > + time_after(jiffies, READ_ONCE(rcu_state.oomovldend))) {
> > + WRITE_ONCE(rcu_state.oomovld, false);
> > + pr_info("%s: Ending OOM-mode grace periods.\n", __func__);
> > + }
> > rcu_for_each_leaf_node(rnp) {
> > cond_resched_tasks_rcu_qs();
> > mask = 0;
> > @@ -2697,6 +2704,35 @@ static void check_cb_ovld(struct rcu_data *rdp)
> > raw_spin_unlock_rcu_node(rnp);
> > }
> >
> > +/* Return a rough count of the RCU callbacks outstanding. */
> > +static unsigned long rcu_oom_count(struct shrinker *unused1,
> > + struct shrink_control *unused2)
> > +{
> > + int cpu;
> > + unsigned long ncbs = 0;
> > +
> > + for_each_possible_cpu(cpu)
> > + ncbs += rcu_get_n_cbs_cpu(cpu);
> > + return ncbs;
> > +}
> > +
> > +/* Start up an interval of fast high-overhead grace periods. */
> > +static unsigned long rcu_oom_scan(struct shrinker *unused1,
> > + struct shrink_control *unused2)
> > +{
> > + pr_info("%s: Starting OOM-mode grace periods.\n", __func__);
> > + WRITE_ONCE(rcu_state.oomovldend, jiffies + HZ / 10);
> > + smp_store_release(&rcu_state.oomovld, true); // After .oomovldend
> > + rcu_force_quiescent_state(); // Kick grace period
> > + return 0; // We haven't actually reclaimed anything yet.
> > +}
> > +
> > +static struct shrinker rcu_shrinker = {
> > + .count_objects = rcu_oom_count,
> > + .scan_objects = rcu_oom_scan,
> > + .seeks = DEFAULT_SEEKS,
> > +};
> > +
> > /* Helper function for call_rcu() and friends. */
> > static void
> > __call_rcu(struct rcu_head *head, rcu_callback_t func)
> > @@ -4146,6 +4182,7 @@ void __init rcu_init(void)
> > qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark;
> > else
> > qovld_calc = qovld;
> > + WARN_ON(register_shrinker(&rcu_shrinker));
> > }
> >
> > #include "tree_stall.h"
> > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > index 2d7fcb9..c4d8e96 100644
> > --- a/kernel/rcu/tree.h
> > +++ b/kernel/rcu/tree.h
> > @@ -326,6 +326,8 @@ struct rcu_state {
> > int ncpus_snap; /* # CPUs seen last time. */
> > u8 cbovld; /* Callback overload now? */
> > u8 cbovldnext; /* ^ ^ next time? */
> > + u8 oomovld; /* OOM overload? */
> > + unsigned long oomovldend; /* OOM ovld end, jiffies. */
> >
> > unsigned long jiffies_force_qs; /* Time at which to invoke */
> > /* force_quiescent_state(). */

2020-05-07 15:51:12

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode

On Thu, May 07, 2020 at 05:36:47PM +0800, Hillf Danton wrote:
>
> Hello Paul
>
> On Wed, 6 May 2020 17:42:40 Paul E. McKenney wrote:
> >
> > This commit adds a shrinker so as to inform RCU when memory is scarce.
>
> A simpler hook is added in the logic of kswapd for subscribing the info
> that memory pressure is high, and then on top of it make rcu a subscriber
> by copying your code for the shrinker, wishing it makes a sense to you.
>
> What's not yet included is to make the hook per node to help make every
> reviewer convinced that memory is becoming tight. Of course without the
> cost of making subscribers node aware.
>
> Hillf

I must defer to the MM folks on the MM portion of this patch, but early
warning of impending memory pressure would be extremely good. A few
RCU-related notes inline below, though.

Thanx, Paul

> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -49,6 +49,16 @@ static inline void set_max_mapnr(unsigne
> static inline void set_max_mapnr(unsigned long limit) { }
> #endif
>
> +/* subscriber of kswapd's memory_pressure_high signal */
> +struct mph_subscriber {
> + struct list_head node;
> + void (*info) (void *data);
> + void *data;
> +};
> +
> +int mph_subscribe(struct mph_subscriber *ms);
> +void mph_unsubscribe(struct mph_subscriber *ms);
> +
> extern atomic_long_t _totalram_pages;
> static inline unsigned long totalram_pages(void)
> {
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3536,6 +3536,40 @@ static bool kswapd_shrink_node(pg_data_t
> }
>
> /*
> + * subscribers of kswapd's signal that memory pressure is high
> + */
> +static LIST_HEAD(mph_subs);
> +static DEFINE_MUTEX(mph_lock);
> +
> +int mph_subscribe(struct mph_subscriber *ms)
> +{
> + if (!ms->info)
> + return -EAGAIN;
> +
> + mutex_lock(&mph_lock);
> + list_add_tail(&ms->node, &mph_subs);
> + mutex_unlock(&mph_lock);
> + return 0;
> +}
> +
> +void mph_unsubscribe(struct mph_subscriber *ms)
> +{
> + mutex_lock(&mph_lock);
> + list_del(&ms->node);
> + mutex_unlock(&mph_lock);
> +}
> +
> +static void kswapd_bbc_mph(void)
> +{
> + struct mph_subscriber *ms;
> +
> + mutex_lock(&mph_lock);
> + list_for_each_entry(ms, &mph_subs, node)
> + ms->info(ms->data);
> + mutex_unlock(&mph_lock);
> +}
> +
> +/*
> * For kswapd, balance_pgdat() will reclaim pages across a node from zones
> * that are eligible for use by the caller until at least one zone is
> * balanced.
> @@ -3663,8 +3697,11 @@ restart:
> * If we're getting trouble reclaiming, start doing writepage
> * even in laptop mode.
> */
> - if (sc.priority < DEF_PRIORITY - 2)
> + if (sc.priority < DEF_PRIORITY - 2) {
> sc.may_writepage = 1;
> + if (sc.priority == DEF_PRIORITY - 3)
> + kswapd_bbc_mph();
> + }
>
> /* Call soft limit reclaim before calling shrink_node. */
> sc.nr_scanned = 0;
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -325,6 +325,8 @@ struct rcu_state {
> int ncpus_snap; /* # CPUs seen last time. */
> u8 cbovld; /* Callback overload now? */
> u8 cbovldnext; /* ^ ^ next time? */
> + u8 mph; /* mm pressure high signal from kswapd */
> + unsigned long mph_end; /* time stamp in jiffies */
>
> unsigned long jiffies_force_qs; /* Time at which to invoke */
> /* force_quiescent_state(). */
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -52,6 +52,7 @@
> #include <linux/kprobes.h>
> #include <linux/gfp.h>
> #include <linux/oom.h>
> +#include <linux/mm.h>
> #include <linux/smpboot.h>
> #include <linux/jiffies.h>
> #include <linux/slab.h>
> @@ -2314,8 +2315,15 @@ static void force_qs_rnp(int (*f)(struct
> struct rcu_data *rdp;
> struct rcu_node *rnp;
>
> - rcu_state.cbovld = rcu_state.cbovldnext;
> + rcu_state.cbovld = smp_load_acquire(&rcu_state.mph) ||
> + rcu_state.cbovldnext;
> rcu_state.cbovldnext = false;
> +
> + if (READ_ONCE(rcu_state.mph) &&
> + time_after(jiffies, READ_ONCE(rcu_state.mph_end))) {
> + WRITE_ONCE(rcu_state.mph, false);
> + pr_info("%s: Ending OOM-mode grace periods.\n", __func__);
> + }
> rcu_for_each_leaf_node(rnp) {
> cond_resched_tasks_rcu_qs();
> mask = 0;
> @@ -2643,6 +2651,20 @@ static void check_cb_ovld(struct rcu_dat
> raw_spin_unlock_rcu_node(rnp);
> }
>
> +static void rcu_mph_info(void *data)

This pointer will always be &rcu_state, so why not ignore the pointer
and use "rcu_state" below?

RCU grace periods are inherently global, so I don't know of any way
for RCU to focus on a given NUMA node. All or nothing. But on the
other hand, speeding up RCU grace periods will also help specific
NUMA nodes, so I believe that it is all good.

> +{
> + struct rcu_state *state = data;
> +
> + WRITE_ONCE(state->mph_end, jiffies + HZ / 10);
> + smp_store_release(&state->mph, true);
> + rcu_force_quiescent_state();
> +}
> +
> +static struct mph_subscriber rcu_mph_subscriber = {
> + .info = rcu_mph_info,
> + .data = &rcu_state,

Then this ".data" entry can be omitted, correct?

> +};
> +
> /* Helper function for call_rcu() and friends. */
> static void
> __call_rcu(struct rcu_head *head, rcu_callback_t func)
> @@ -4036,6 +4058,8 @@ void __init rcu_init(void)
> qovld_calc = DEFAULT_RCU_QOVLD_MULT * qhimark;
> else
> qovld_calc = qovld;
> +
> + mph_subscribe(&rcu_mph_subscriber);
> }
>
> #include "tree_stall.h"
>

2020-05-07 17:03:28

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode

On Wed, May 06, 2020 at 05:55:35PM -0700, Andrew Morton wrote:
> On Wed, 6 May 2020 17:42:40 -0700 "Paul E. McKenney" <[email protected]> wrote:
>
> > This commit adds a shrinker so as to inform RCU when memory is scarce.
> > RCU responds by shifting into the same fast and inefficient mode that is
> > used in the presence of excessive numbers of RCU callbacks. RCU remains
> > in this state for one-tenth of a second, though this time window can be
> > extended by another call to the shrinker.

We may be able to use shrinkers here, but merely being invoked does
not carry a reliable distress signal.

Shrinkers get invoked whenever vmscan runs. It's a useful indicator
for when to age an auxiliary LRU list - test references, clear and
rotate or reclaim stale entries. The urgency, and what can and cannot
be considered "stale", is encoded in the callback frequency and scan
counts, and meant to be relative to the VM's own rate of aging: "I've
tested X percent of mine for recent use, now you go and test the same
share of your pool." It doesn't translate well to other
interpretations of the callbacks, although people have tried.

> > If it proves feasible, a later commit might add a function call directly
> > indicating the end of the period of scarce memory.
>
> (Cc David Chinner, who often has opinions on shrinkers ;))
>
> It's a bit abusive of the intent of the slab shrinkers, but I don't
> immediately see a problem with it. Always returning 0 from
> ->scan_objects might cause a problem in some situations(?).
>
> Perhaps we should have a formal "system getting low on memory, please
> do something" notification API.

It's tricky to find a useful definition of what low on memory
means. In the past we've used sc->priority cutoffs, the vmpressure
interface (reclaimed/scanned - reclaim efficiency cutoffs), oom
notifiers (another reclaim efficiency cutoff). But none of these
reliably capture "distress", and they vary highly between different
hardware setups. It can be hard to trigger OOM itself on fast IO
devices, even when the machine is way past useful (where useful is
somewhat subjective to the user). Userspace OOM implementations that
consider userspace health (also subjective) are getting more common.

> How significant is this? How much memory can RCU consume?

I think if rcu can end up consuming a significant share of memory, one
way that may work would be to do proper shrinker integration and track
the age of its objects relative to the age of other allocations in the
system. I.e. toss them all on a clock list with "new" bits and shrink
them at VM velocity. If the shrinker sees objects with new bit set,
clear and rotate. If it sees objects without them, we know rcu_heads
outlive cache pages etc. and should probably cycle faster too.

2020-05-07 17:13:16

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode

On Thu, May 07, 2020 at 01:00:06PM -0400, Johannes Weiner wrote:
> On Wed, May 06, 2020 at 05:55:35PM -0700, Andrew Morton wrote:
> > On Wed, 6 May 2020 17:42:40 -0700 "Paul E. McKenney" <[email protected]> wrote:
> >
> > > This commit adds a shrinker so as to inform RCU when memory is scarce.
> > > RCU responds by shifting into the same fast and inefficient mode that is
> > > used in the presence of excessive numbers of RCU callbacks. RCU remains
> > > in this state for one-tenth of a second, though this time window can be
> > > extended by another call to the shrinker.
>
> We may be able to use shrinkers here, but merely being invoked does
> not carry a reliable distress signal.
>
> Shrinkers get invoked whenever vmscan runs. It's a useful indicator
> for when to age an auxiliary LRU list - test references, clear and
> rotate or reclaim stale entries. The urgency, and what can and cannot
> be considered "stale", is encoded in the callback frequency and scan
> counts, and meant to be relative to the VM's own rate of aging: "I've
> tested X percent of mine for recent use, now you go and test the same
> share of your pool." It doesn't translate well to other
> interpretations of the callbacks, although people have tried.

Would it make sense for RCU to interpret two invocations within (say)
100ms of each other as indicating urgency? (Hey, I had to ask!)

> > > If it proves feasible, a later commit might add a function call directly
> > > indicating the end of the period of scarce memory.
> >
> > (Cc David Chinner, who often has opinions on shrinkers ;))
> >
> > It's a bit abusive of the intent of the slab shrinkers, but I don't
> > immediately see a problem with it. Always returning 0 from
> > ->scan_objects might cause a problem in some situations(?).
> >
> > Perhaps we should have a formal "system getting low on memory, please
> > do something" notification API.
>
> It's tricky to find a useful definition of what low on memory
> means. In the past we've used sc->priority cutoffs, the vmpressure
> interface (reclaimed/scanned - reclaim efficiency cutoffs), oom
> notifiers (another reclaim efficiency cutoff). But none of these
> reliably capture "distress", and they vary highly between different
> hardware setups. It can be hard to trigger OOM itself on fast IO
> devices, even when the machine is way past useful (where useful is
> somewhat subjective to the user). Userspace OOM implementations that
> consider userspace health (also subjective) are getting more common.
>
> > How significant is this? How much memory can RCU consume?
>
> I think if rcu can end up consuming a significant share of memory, one
> way that may work would be to do proper shrinker integration and track
> the age of its objects relative to the age of other allocations in the
> system. I.e. toss them all on a clock list with "new" bits and shrink
> them at VM velocity. If the shrinker sees objects with new bit set,
> clear and rotate. If it sees objects without them, we know rcu_heads
> outlive cache pages etc. and should probably cycle faster too.

It would be easy for RCU to pass back (or otherwise use) the age of the
current grace period, if that would help.

Tracking the age of individual callbacks is out of the question due to
memory overhead, but RCU could approximate this via statistical sampling.
Comparing this to grace-period durations could give information as to
whether making grace periods go faster would be helpful.

But, yes, it would be better to have an elusive unambiguous indication
of distress. ;-)

Thanx, Paul

2020-05-07 17:31:33

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode

On Thu, May 07, 2020 at 10:09:03AM -0700, Paul E. McKenney wrote:
> On Thu, May 07, 2020 at 01:00:06PM -0400, Johannes Weiner wrote:
> > On Wed, May 06, 2020 at 05:55:35PM -0700, Andrew Morton wrote:
> > > On Wed, 6 May 2020 17:42:40 -0700 "Paul E. McKenney" <[email protected]> wrote:
> > >
> > > > This commit adds a shrinker so as to inform RCU when memory is scarce.
> > > > RCU responds by shifting into the same fast and inefficient mode that is
> > > > used in the presence of excessive numbers of RCU callbacks. RCU remains
> > > > in this state for one-tenth of a second, though this time window can be
> > > > extended by another call to the shrinker.
> >
> > We may be able to use shrinkers here, but merely being invoked does
> > not carry a reliable distress signal.
> >
> > Shrinkers get invoked whenever vmscan runs. It's a useful indicator
> > for when to age an auxiliary LRU list - test references, clear and
> > rotate or reclaim stale entries. The urgency, and what can and cannot
> > be considered "stale", is encoded in the callback frequency and scan
> > counts, and meant to be relative to the VM's own rate of aging: "I've
> > tested X percent of mine for recent use, now you go and test the same
> > share of your pool." It doesn't translate well to other
> > interpretations of the callbacks, although people have tried.
>
> Would it make sense for RCU to interpret two invocations within (say)
> 100ms of each other as indicating urgency? (Hey, I had to ask!)
>
> > > > If it proves feasible, a later commit might add a function call directly
> > > > indicating the end of the period of scarce memory.
> > >
> > > (Cc David Chinner, who often has opinions on shrinkers ;))
> > >
> > > It's a bit abusive of the intent of the slab shrinkers, but I don't
> > > immediately see a problem with it. Always returning 0 from
> > > ->scan_objects might cause a problem in some situations(?).
> > >
> > > Perhaps we should have a formal "system getting low on memory, please
> > > do something" notification API.
> >
> > It's tricky to find a useful definition of what low on memory
> > means. In the past we've used sc->priority cutoffs, the vmpressure
> > interface (reclaimed/scanned - reclaim efficiency cutoffs), oom
> > notifiers (another reclaim efficiency cutoff). But none of these
> > reliably capture "distress", and they vary highly between different
> > hardware setups. It can be hard to trigger OOM itself on fast IO
> > devices, even when the machine is way past useful (where useful is
> > somewhat subjective to the user). Userspace OOM implementations that
> > consider userspace health (also subjective) are getting more common.
> >
> > > How significant is this? How much memory can RCU consume?
> >
> > I think if rcu can end up consuming a significant share of memory, one
> > way that may work would be to do proper shrinker integration and track
> > the age of its objects relative to the age of other allocations in the
> > system. I.e. toss them all on a clock list with "new" bits and shrink
> > them at VM velocity. If the shrinker sees objects with new bit set,
> > clear and rotate. If it sees objects without them, we know rcu_heads
> > outlive cache pages etc. and should probably cycle faster too.
>
> It would be easy for RCU to pass back (or otherwise use) the age of the
> current grace period, if that would help.
>
> Tracking the age of individual callbacks is out of the question due to
> memory overhead, but RCU could approximate this via statistical sampling.
> Comparing this to grace-period durations could give information as to
> whether making grace periods go faster would be helpful.
>
> But, yes, it would be better to have an elusive unambiguous indication
> of distress. ;-)

And I have dropped this patch for the time being, but I do hope that
it served a purpose in illustrating that it is not difficult to put RCU
into a fast-but-inefficient mode when needed.

Thanx, Paul

2020-05-07 18:34:09

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode

On Thu, May 07, 2020 at 10:09:03AM -0700, Paul E. McKenney wrote:
> On Thu, May 07, 2020 at 01:00:06PM -0400, Johannes Weiner wrote:
> > On Wed, May 06, 2020 at 05:55:35PM -0700, Andrew Morton wrote:
> > > On Wed, 6 May 2020 17:42:40 -0700 "Paul E. McKenney" <[email protected]> wrote:
> > >
> > > > This commit adds a shrinker so as to inform RCU when memory is scarce.
> > > > RCU responds by shifting into the same fast and inefficient mode that is
> > > > used in the presence of excessive numbers of RCU callbacks. RCU remains
> > > > in this state for one-tenth of a second, though this time window can be
> > > > extended by another call to the shrinker.
> >
> > We may be able to use shrinkers here, but merely being invoked does
> > not carry a reliable distress signal.
> >
> > Shrinkers get invoked whenever vmscan runs. It's a useful indicator
> > for when to age an auxiliary LRU list - test references, clear and
> > rotate or reclaim stale entries. The urgency, and what can and cannot
> > be considered "stale", is encoded in the callback frequency and scan
> > counts, and meant to be relative to the VM's own rate of aging: "I've
> > tested X percent of mine for recent use, now you go and test the same
> > share of your pool." It doesn't translate well to other
> > interpretations of the callbacks, although people have tried.
>
> Would it make sense for RCU to interpret two invocations within (say)
> 100ms of each other as indicating urgency? (Hey, I had to ask!)

It's the perfect number for one combination of CPU, storage device,
and shrinker implementation :-)

> > > > If it proves feasible, a later commit might add a function call directly
> > > > indicating the end of the period of scarce memory.
> > >
> > > (Cc David Chinner, who often has opinions on shrinkers ;))
> > >
> > > It's a bit abusive of the intent of the slab shrinkers, but I don't
> > > immediately see a problem with it. Always returning 0 from
> > > ->scan_objects might cause a problem in some situations(?).
> > >
> > > Perhaps we should have a formal "system getting low on memory, please
> > > do something" notification API.
> >
> > It's tricky to find a useful definition of what low on memory
> > means. In the past we've used sc->priority cutoffs, the vmpressure
> > interface (reclaimed/scanned - reclaim efficiency cutoffs), oom
> > notifiers (another reclaim efficiency cutoff). But none of these
> > reliably capture "distress", and they vary highly between different
> > hardware setups. It can be hard to trigger OOM itself on fast IO
> > devices, even when the machine is way past useful (where useful is
> > somewhat subjective to the user). Userspace OOM implementations that
> > consider userspace health (also subjective) are getting more common.
> >
> > > How significant is this? How much memory can RCU consume?
> >
> > I think if rcu can end up consuming a significant share of memory, one
> > way that may work would be to do proper shrinker integration and track
> > the age of its objects relative to the age of other allocations in the
> > system. I.e. toss them all on a clock list with "new" bits and shrink
> > them at VM velocity. If the shrinker sees objects with new bit set,
> > clear and rotate. If it sees objects without them, we know rcu_heads
> > outlive cache pages etc. and should probably cycle faster too.
>
> It would be easy for RCU to pass back (or otherwise use) the age of the
> current grace period, if that would help.
>
> Tracking the age of individual callbacks is out of the question due to
> memory overhead, but RCU could approximate this via statistical sampling.
> Comparing this to grace-period durations could give information as to
> whether making grace periods go faster would be helpful.

That makes sense.

So RCU knows the time and the VM knows the amount of memory. Either
RCU needs to figure out its memory component to be able to translate
shrinker input to age, or the VM needs to learn about time to be able
to say: I'm currently scanning memory older than timestamp X.

The latter would also require sampling in the VM. Nose goes. :-)

There actually is prior art for teaching reclaim about time:
https://lore.kernel.org/linux-mm/20130430110214.22179.26139.stgit@zurg/

CCing Konstantin. I'm curious how widely this ended up being used and
how reliably it worked.

> But, yes, it would be better to have an elusive unambiguous indication
> of distress. ;-)

I agree. Preferably something more practical than a dialogue box
asking the user on how well things are going for them :-)

2020-05-07 19:10:52

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode

On Thu, May 07, 2020 at 02:31:02PM -0400, Johannes Weiner wrote:
> On Thu, May 07, 2020 at 10:09:03AM -0700, Paul E. McKenney wrote:
> > On Thu, May 07, 2020 at 01:00:06PM -0400, Johannes Weiner wrote:
> > > On Wed, May 06, 2020 at 05:55:35PM -0700, Andrew Morton wrote:
> > > > On Wed, 6 May 2020 17:42:40 -0700 "Paul E. McKenney" <[email protected]> wrote:
> > > >
> > > > > This commit adds a shrinker so as to inform RCU when memory is scarce.
> > > > > RCU responds by shifting into the same fast and inefficient mode that is
> > > > > used in the presence of excessive numbers of RCU callbacks. RCU remains
> > > > > in this state for one-tenth of a second, though this time window can be
> > > > > extended by another call to the shrinker.
> > >
> > > We may be able to use shrinkers here, but merely being invoked does
> > > not carry a reliable distress signal.
> > >
> > > Shrinkers get invoked whenever vmscan runs. It's a useful indicator
> > > for when to age an auxiliary LRU list - test references, clear and
> > > rotate or reclaim stale entries. The urgency, and what can and cannot
> > > be considered "stale", is encoded in the callback frequency and scan
> > > counts, and meant to be relative to the VM's own rate of aging: "I've
> > > tested X percent of mine for recent use, now you go and test the same
> > > share of your pool." It doesn't translate well to other
> > > interpretations of the callbacks, although people have tried.
> >
> > Would it make sense for RCU to interpret two invocations within (say)
> > 100ms of each other as indicating urgency? (Hey, I had to ask!)
>
> It's the perfect number for one combination of CPU, storage device,
> and shrinker implementation :-)

Woo-hoo!!!

But is that one combination actually in use anywhere? ;-)

> > > > > If it proves feasible, a later commit might add a function call directly
> > > > > indicating the end of the period of scarce memory.
> > > >
> > > > (Cc David Chinner, who often has opinions on shrinkers ;))
> > > >
> > > > It's a bit abusive of the intent of the slab shrinkers, but I don't
> > > > immediately see a problem with it. Always returning 0 from
> > > > ->scan_objects might cause a problem in some situations(?).
> > > >
> > > > Perhaps we should have a formal "system getting low on memory, please
> > > > do something" notification API.
> > >
> > > It's tricky to find a useful definition of what low on memory
> > > means. In the past we've used sc->priority cutoffs, the vmpressure
> > > interface (reclaimed/scanned - reclaim efficiency cutoffs), oom
> > > notifiers (another reclaim efficiency cutoff). But none of these
> > > reliably capture "distress", and they vary highly between different
> > > hardware setups. It can be hard to trigger OOM itself on fast IO
> > > devices, even when the machine is way past useful (where useful is
> > > somewhat subjective to the user). Userspace OOM implementations that
> > > consider userspace health (also subjective) are getting more common.
> > >
> > > > How significant is this? How much memory can RCU consume?
> > >
> > > I think if rcu can end up consuming a significant share of memory, one
> > > way that may work would be to do proper shrinker integration and track
> > > the age of its objects relative to the age of other allocations in the
> > > system. I.e. toss them all on a clock list with "new" bits and shrink
> > > them at VM velocity. If the shrinker sees objects with new bit set,
> > > clear and rotate. If it sees objects without them, we know rcu_heads
> > > outlive cache pages etc. and should probably cycle faster too.
> >
> > It would be easy for RCU to pass back (or otherwise use) the age of the
> > current grace period, if that would help.
> >
> > Tracking the age of individual callbacks is out of the question due to
> > memory overhead, but RCU could approximate this via statistical sampling.
> > Comparing this to grace-period durations could give information as to
> > whether making grace periods go faster would be helpful.
>
> That makes sense.
>
> So RCU knows the time and the VM knows the amount of memory. Either
> RCU needs to figure out its memory component to be able to translate
> shrinker input to age, or the VM needs to learn about time to be able
> to say: I'm currently scanning memory older than timestamp X.
>
> The latter would also require sampling in the VM. Nose goes. :-)

Sounds about right. ;-)

Does reclaim have any notion of having continuously scanned for
longer than some amount of time? Or could RCU reasonably deduce this?
For example, if RCU noticed that reclaim had been scanning for longer than
(say) five grace periods, RCU might decide to speed things up.

But on the other hand, with slow disks, reclaim might go on for tens of
seconds even without much in the way of memory pressure, mightn't it?

I suppose that another indicator would be recent NULL returns from
allocators. But that indicator flashes a bit later than one would like,
doesn't it? And has false positives when allocators are invoked from
atomic contexts, no doubt. And no doubt similar for sleeping more than
a certain length of time in an allocator.

> There actually is prior art for teaching reclaim about time:
> https://lore.kernel.org/linux-mm/20130430110214.22179.26139.stgit@zurg/
>
> CCing Konstantin. I'm curious how widely this ended up being used and
> how reliably it worked.

Looking forward to hearing of any results!

> > But, yes, it would be better to have an elusive unambiguous indication
> > of distress. ;-)
>
> I agree. Preferably something more practical than a dialogue box
> asking the user on how well things are going for them :-)

Indeed, that dialog box should be especially useful for things like
light bulbs running Linux. ;-)

Thanx, Paul

2020-05-08 09:05:29

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode

On 07/05/2020 22.09, Paul E. McKenney wrote:
> On Thu, May 07, 2020 at 02:31:02PM -0400, Johannes Weiner wrote:
>> On Thu, May 07, 2020 at 10:09:03AM -0700, Paul E. McKenney wrote:
>>> On Thu, May 07, 2020 at 01:00:06PM -0400, Johannes Weiner wrote:
>>>> On Wed, May 06, 2020 at 05:55:35PM -0700, Andrew Morton wrote:
>>>>> On Wed, 6 May 2020 17:42:40 -0700 "Paul E. McKenney" <[email protected]> wrote:
>>>>>
>>>>>> This commit adds a shrinker so as to inform RCU when memory is scarce.
>>>>>> RCU responds by shifting into the same fast and inefficient mode that is
>>>>>> used in the presence of excessive numbers of RCU callbacks. RCU remains
>>>>>> in this state for one-tenth of a second, though this time window can be
>>>>>> extended by another call to the shrinker.
>>>>
>>>> We may be able to use shrinkers here, but merely being invoked does
>>>> not carry a reliable distress signal.
>>>>
>>>> Shrinkers get invoked whenever vmscan runs. It's a useful indicator
>>>> for when to age an auxiliary LRU list - test references, clear and
>>>> rotate or reclaim stale entries. The urgency, and what can and cannot
>>>> be considered "stale", is encoded in the callback frequency and scan
>>>> counts, and meant to be relative to the VM's own rate of aging: "I've
>>>> tested X percent of mine for recent use, now you go and test the same
>>>> share of your pool." It doesn't translate well to other
>>>> interpretations of the callbacks, although people have tried.
>>>
>>> Would it make sense for RCU to interpret two invocations within (say)
>>> 100ms of each other as indicating urgency? (Hey, I had to ask!)
>>
>> It's the perfect number for one combination of CPU, storage device,
>> and shrinker implementation :-)
>
> Woo-hoo!!!
>
> But is that one combination actually in use anywhere? ;-)
>
>>>>>> If it proves feasible, a later commit might add a function call directly
>>>>>> indicating the end of the period of scarce memory.
>>>>>
>>>>> (Cc David Chinner, who often has opinions on shrinkers ;))
>>>>>
>>>>> It's a bit abusive of the intent of the slab shrinkers, but I don't
>>>>> immediately see a problem with it. Always returning 0 from
>>>>> ->scan_objects might cause a problem in some situations(?).
>>>>>
>>>>> Perhaps we should have a formal "system getting low on memory, please
>>>>> do something" notification API.
>>>>
>>>> It's tricky to find a useful definition of what low on memory
>>>> means. In the past we've used sc->priority cutoffs, the vmpressure
>>>> interface (reclaimed/scanned - reclaim efficiency cutoffs), oom
>>>> notifiers (another reclaim efficiency cutoff). But none of these
>>>> reliably capture "distress", and they vary highly between different
>>>> hardware setups. It can be hard to trigger OOM itself on fast IO
>>>> devices, even when the machine is way past useful (where useful is
>>>> somewhat subjective to the user). Userspace OOM implementations that
>>>> consider userspace health (also subjective) are getting more common.
>>>>
>>>>> How significant is this? How much memory can RCU consume?
>>>>
>>>> I think if rcu can end up consuming a significant share of memory, one
>>>> way that may work would be to do proper shrinker integration and track
>>>> the age of its objects relative to the age of other allocations in the
>>>> system. I.e. toss them all on a clock list with "new" bits and shrink
>>>> them at VM velocity. If the shrinker sees objects with new bit set,
>>>> clear and rotate. If it sees objects without them, we know rcu_heads
>>>> outlive cache pages etc. and should probably cycle faster too.
>>>
>>> It would be easy for RCU to pass back (or otherwise use) the age of the
>>> current grace period, if that would help.
>>>
>>> Tracking the age of individual callbacks is out of the question due to
>>> memory overhead, but RCU could approximate this via statistical sampling.
>>> Comparing this to grace-period durations could give information as to
>>> whether making grace periods go faster would be helpful.
>>
>> That makes sense.
>>
>> So RCU knows the time and the VM knows the amount of memory. Either
>> RCU needs to figure out its memory component to be able to translate
>> shrinker input to age, or the VM needs to learn about time to be able
>> to say: I'm currently scanning memory older than timestamp X.
>>
>> The latter would also require sampling in the VM. Nose goes. :-)
>
> Sounds about right. ;-)
>
> Does reclaim have any notion of having continuously scanned for
> longer than some amount of time? Or could RCU reasonably deduce this?
> For example, if RCU noticed that reclaim had been scanning for longer than
> (say) five grace periods, RCU might decide to speed things up.
>
> But on the other hand, with slow disks, reclaim might go on for tens of
> seconds even without much in the way of memory pressure, mightn't it?
>
> I suppose that another indicator would be recent NULL returns from
> allocators. But that indicator flashes a bit later than one would like,
> doesn't it? And has false positives when allocators are invoked from
> atomic contexts, no doubt. And no doubt similar for sleeping more than
> a certain length of time in an allocator.
>
>> There actually is prior art for teaching reclaim about time:
>> https://lore.kernel.org/linux-mm/20130430110214.22179.26139.stgit@zurg/
>>
>> CCing Konstantin. I'm curious how widely this ended up being used and
>> how reliably it worked.
>
> Looking forward to hearing of any results!

Well, that was some experiment about automatic steering memory pressure
between containers. LRU timings from milestones itself worked pretty well.
Remaining engine were more robust than mainline cgroups these days.
Memory becomes much cheaper - I hope nobody want's overcommit it that badly anymore.

It seems modern MM has plenty signals about memory pressure.
Kswapsd should have enough knowledge to switch gears in RCU.

>
>>> But, yes, it would be better to have an elusive unambiguous indication
>>> of distress. ;-)
>>
>> I agree. Preferably something more practical than a dialogue box
>> asking the user on how well things are going for them :-)
>
> Indeed, that dialog box should be especially useful for things like
> light bulbs running Linux. ;-)
>
> Thanx, Paul
>

2020-05-08 14:49:18

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode

On Fri, May 08, 2020 at 12:00:28PM +0300, Konstantin Khlebnikov wrote:
> On 07/05/2020 22.09, Paul E. McKenney wrote:
> > On Thu, May 07, 2020 at 02:31:02PM -0400, Johannes Weiner wrote:
> > > On Thu, May 07, 2020 at 10:09:03AM -0700, Paul E. McKenney wrote:
> > > > On Thu, May 07, 2020 at 01:00:06PM -0400, Johannes Weiner wrote:
> > > > > On Wed, May 06, 2020 at 05:55:35PM -0700, Andrew Morton wrote:
> > > > > > On Wed, 6 May 2020 17:42:40 -0700 "Paul E. McKenney" <[email protected]> wrote:
> > > > > >
> > > > > > > This commit adds a shrinker so as to inform RCU when memory is scarce.
> > > > > > > RCU responds by shifting into the same fast and inefficient mode that is
> > > > > > > used in the presence of excessive numbers of RCU callbacks. RCU remains
> > > > > > > in this state for one-tenth of a second, though this time window can be
> > > > > > > extended by another call to the shrinker.
> > > > >
> > > > > We may be able to use shrinkers here, but merely being invoked does
> > > > > not carry a reliable distress signal.
> > > > >
> > > > > Shrinkers get invoked whenever vmscan runs. It's a useful indicator
> > > > > for when to age an auxiliary LRU list - test references, clear and
> > > > > rotate or reclaim stale entries. The urgency, and what can and cannot
> > > > > be considered "stale", is encoded in the callback frequency and scan
> > > > > counts, and meant to be relative to the VM's own rate of aging: "I've
> > > > > tested X percent of mine for recent use, now you go and test the same
> > > > > share of your pool." It doesn't translate well to other
> > > > > interpretations of the callbacks, although people have tried.
> > > >
> > > > Would it make sense for RCU to interpret two invocations within (say)
> > > > 100ms of each other as indicating urgency? (Hey, I had to ask!)
> > >
> > > It's the perfect number for one combination of CPU, storage device,
> > > and shrinker implementation :-)
> >
> > Woo-hoo!!!
> >
> > But is that one combination actually in use anywhere? ;-)
> >
> > > > > > > If it proves feasible, a later commit might add a function call directly
> > > > > > > indicating the end of the period of scarce memory.
> > > > > >
> > > > > > (Cc David Chinner, who often has opinions on shrinkers ;))
> > > > > >
> > > > > > It's a bit abusive of the intent of the slab shrinkers, but I don't
> > > > > > immediately see a problem with it. Always returning 0 from
> > > > > > ->scan_objects might cause a problem in some situations(?).
> > > > > >
> > > > > > Perhaps we should have a formal "system getting low on memory, please
> > > > > > do something" notification API.
> > > > >
> > > > > It's tricky to find a useful definition of what low on memory
> > > > > means. In the past we've used sc->priority cutoffs, the vmpressure
> > > > > interface (reclaimed/scanned - reclaim efficiency cutoffs), oom
> > > > > notifiers (another reclaim efficiency cutoff). But none of these
> > > > > reliably capture "distress", and they vary highly between different
> > > > > hardware setups. It can be hard to trigger OOM itself on fast IO
> > > > > devices, even when the machine is way past useful (where useful is
> > > > > somewhat subjective to the user). Userspace OOM implementations that
> > > > > consider userspace health (also subjective) are getting more common.
> > > > >
> > > > > > How significant is this? How much memory can RCU consume?
> > > > >
> > > > > I think if rcu can end up consuming a significant share of memory, one
> > > > > way that may work would be to do proper shrinker integration and track
> > > > > the age of its objects relative to the age of other allocations in the
> > > > > system. I.e. toss them all on a clock list with "new" bits and shrink
> > > > > them at VM velocity. If the shrinker sees objects with new bit set,
> > > > > clear and rotate. If it sees objects without them, we know rcu_heads
> > > > > outlive cache pages etc. and should probably cycle faster too.
> > > >
> > > > It would be easy for RCU to pass back (or otherwise use) the age of the
> > > > current grace period, if that would help.
> > > >
> > > > Tracking the age of individual callbacks is out of the question due to
> > > > memory overhead, but RCU could approximate this via statistical sampling.
> > > > Comparing this to grace-period durations could give information as to
> > > > whether making grace periods go faster would be helpful.
> > >
> > > That makes sense.
> > >
> > > So RCU knows the time and the VM knows the amount of memory. Either
> > > RCU needs to figure out its memory component to be able to translate
> > > shrinker input to age, or the VM needs to learn about time to be able
> > > to say: I'm currently scanning memory older than timestamp X.
> > >
> > > The latter would also require sampling in the VM. Nose goes. :-)
> >
> > Sounds about right. ;-)
> >
> > Does reclaim have any notion of having continuously scanned for
> > longer than some amount of time? Or could RCU reasonably deduce this?
> > For example, if RCU noticed that reclaim had been scanning for longer than
> > (say) five grace periods, RCU might decide to speed things up.
> >
> > But on the other hand, with slow disks, reclaim might go on for tens of
> > seconds even without much in the way of memory pressure, mightn't it?
> >
> > I suppose that another indicator would be recent NULL returns from
> > allocators. But that indicator flashes a bit later than one would like,
> > doesn't it? And has false positives when allocators are invoked from
> > atomic contexts, no doubt. And no doubt similar for sleeping more than
> > a certain length of time in an allocator.
> >
> > > There actually is prior art for teaching reclaim about time:
> > > https://lore.kernel.org/linux-mm/20130430110214.22179.26139.stgit@zurg/
> > >
> > > CCing Konstantin. I'm curious how widely this ended up being used and
> > > how reliably it worked.
> >
> > Looking forward to hearing of any results!
>
> Well, that was some experiment about automatic steering memory pressure
> between containers. LRU timings from milestones itself worked pretty well.
> Remaining engine were more robust than mainline cgroups these days.
> Memory becomes much cheaper - I hope nobody want's overcommit it that badly anymore.
>
> It seems modern MM has plenty signals about memory pressure.
> Kswapsd should have enough knowledge to switch gears in RCU.

Easy for me to provide "start fast and inefficient mode" and "stop fast
and inefficient mode" APIs for MM to call!

How about rcu_mempressure_start() and rcu_mempressure_end()? I would
expect them not to nest (as in if you need them to nest, please let
me know). I would not expect these to be invoked all that often (as in
if you do need them to be fast and scalable, please let me know).

RCU would then be in fast/inefficient mode if either MM told it to be
or if RCU had detected callback overload on at least one CPU.

Seem reasonable?

Thanx, Paul

> > > > But, yes, it would be better to have an elusive unambiguous indication
> > > > of distress. ;-)
> > >
> > > I agree. Preferably something more practical than a dialogue box
> > > asking the user on how well things are going for them :-)
> >
> > Indeed, that dialog box should be especially useful for things like
> > light bulbs running Linux. ;-)
> >
> > Thanx, Paul
> >

2020-05-08 14:51:25

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode

On Fri, May 08, 2020 at 09:37:43PM +0800, Hillf Danton wrote:
>
> On Thu, 7 May 2020 08:49:10 Paul E. McKenney wrote:
> >
> > > +static void rcu_mph_info(void *data)
> >
> > This pointer will always be &rcu_state, so why not ignore the pointer
> > and use "rcu_state" below?
> >
> Yes you're right.
>
> > RCU grace periods are inherently global, so I don't know of any way
> > for RCU to focus on a given NUMA node. All or nothing.
>
> Or is it feasible to expose certain RCU thing to VM, say, with which kswapd
> can kick grace period every time the kthreads think it's needed? That way
> the work to gauge memory pressure can be off RCU's shoulders.

A pair of functions RCU provides is easy for me. ;-)

Thanx, Paul

> > But on the
> > other hand, speeding up RCU grace periods will also help specific
> > NUMA nodes, so I believe that it is all good.
> >
> > > +{
> > > + struct rcu_state *state = data;
> > > +
> > > + WRITE_ONCE(state->mph_end, jiffies + HZ / 10);
> > > + smp_store_release(&state->mph, true);
> > > + rcu_force_quiescent_state();
> > > +}
> > > +
> > > +static struct mph_subscriber rcu_mph_subscriber = {
> > > + .info = rcu_mph_info,
> > > + .data = &rcu_state,
> >
> > Then this ".data" entry can be omitted, correct?
>
> Yes :)
>
> Hillf
>

2020-05-09 08:56:47

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode

On 08/05/2020 17.46, Paul E. McKenney wrote:
> On Fri, May 08, 2020 at 12:00:28PM +0300, Konstantin Khlebnikov wrote:
>> On 07/05/2020 22.09, Paul E. McKenney wrote:
>>> On Thu, May 07, 2020 at 02:31:02PM -0400, Johannes Weiner wrote:
>>>> On Thu, May 07, 2020 at 10:09:03AM -0700, Paul E. McKenney wrote:
>>>>> On Thu, May 07, 2020 at 01:00:06PM -0400, Johannes Weiner wrote:
>>>>>> On Wed, May 06, 2020 at 05:55:35PM -0700, Andrew Morton wrote:
>>>>>>> On Wed, 6 May 2020 17:42:40 -0700 "Paul E. McKenney" <[email protected]> wrote:
>>>>>>>
>>>>>>>> This commit adds a shrinker so as to inform RCU when memory is scarce.
>>>>>>>> RCU responds by shifting into the same fast and inefficient mode that is
>>>>>>>> used in the presence of excessive numbers of RCU callbacks. RCU remains
>>>>>>>> in this state for one-tenth of a second, though this time window can be
>>>>>>>> extended by another call to the shrinker.
>>>>>>
>>>>>> We may be able to use shrinkers here, but merely being invoked does
>>>>>> not carry a reliable distress signal.
>>>>>>
>>>>>> Shrinkers get invoked whenever vmscan runs. It's a useful indicator
>>>>>> for when to age an auxiliary LRU list - test references, clear and
>>>>>> rotate or reclaim stale entries. The urgency, and what can and cannot
>>>>>> be considered "stale", is encoded in the callback frequency and scan
>>>>>> counts, and meant to be relative to the VM's own rate of aging: "I've
>>>>>> tested X percent of mine for recent use, now you go and test the same
>>>>>> share of your pool." It doesn't translate well to other
>>>>>> interpretations of the callbacks, although people have tried.
>>>>>
>>>>> Would it make sense for RCU to interpret two invocations within (say)
>>>>> 100ms of each other as indicating urgency? (Hey, I had to ask!)
>>>>
>>>> It's the perfect number for one combination of CPU, storage device,
>>>> and shrinker implementation :-)
>>>
>>> Woo-hoo!!!
>>>
>>> But is that one combination actually in use anywhere? ;-)
>>>
>>>>>>>> If it proves feasible, a later commit might add a function call directly
>>>>>>>> indicating the end of the period of scarce memory.
>>>>>>>
>>>>>>> (Cc David Chinner, who often has opinions on shrinkers ;))
>>>>>>>
>>>>>>> It's a bit abusive of the intent of the slab shrinkers, but I don't
>>>>>>> immediately see a problem with it. Always returning 0 from
>>>>>>> ->scan_objects might cause a problem in some situations(?).
>>>>>>>
>>>>>>> Perhaps we should have a formal "system getting low on memory, please
>>>>>>> do something" notification API.
>>>>>>
>>>>>> It's tricky to find a useful definition of what low on memory
>>>>>> means. In the past we've used sc->priority cutoffs, the vmpressure
>>>>>> interface (reclaimed/scanned - reclaim efficiency cutoffs), oom
>>>>>> notifiers (another reclaim efficiency cutoff). But none of these
>>>>>> reliably capture "distress", and they vary highly between different
>>>>>> hardware setups. It can be hard to trigger OOM itself on fast IO
>>>>>> devices, even when the machine is way past useful (where useful is
>>>>>> somewhat subjective to the user). Userspace OOM implementations that
>>>>>> consider userspace health (also subjective) are getting more common.
>>>>>>
>>>>>>> How significant is this? How much memory can RCU consume?
>>>>>>
>>>>>> I think if rcu can end up consuming a significant share of memory, one
>>>>>> way that may work would be to do proper shrinker integration and track
>>>>>> the age of its objects relative to the age of other allocations in the
>>>>>> system. I.e. toss them all on a clock list with "new" bits and shrink
>>>>>> them at VM velocity. If the shrinker sees objects with new bit set,
>>>>>> clear and rotate. If it sees objects without them, we know rcu_heads
>>>>>> outlive cache pages etc. and should probably cycle faster too.
>>>>>
>>>>> It would be easy for RCU to pass back (or otherwise use) the age of the
>>>>> current grace period, if that would help.
>>>>>
>>>>> Tracking the age of individual callbacks is out of the question due to
>>>>> memory overhead, but RCU could approximate this via statistical sampling.
>>>>> Comparing this to grace-period durations could give information as to
>>>>> whether making grace periods go faster would be helpful.
>>>>
>>>> That makes sense.
>>>>
>>>> So RCU knows the time and the VM knows the amount of memory. Either
>>>> RCU needs to figure out its memory component to be able to translate
>>>> shrinker input to age, or the VM needs to learn about time to be able
>>>> to say: I'm currently scanning memory older than timestamp X.
>>>>
>>>> The latter would also require sampling in the VM. Nose goes. :-)
>>>
>>> Sounds about right. ;-)
>>>
>>> Does reclaim have any notion of having continuously scanned for
>>> longer than some amount of time? Or could RCU reasonably deduce this?
>>> For example, if RCU noticed that reclaim had been scanning for longer than
>>> (say) five grace periods, RCU might decide to speed things up.
>>>
>>> But on the other hand, with slow disks, reclaim might go on for tens of
>>> seconds even without much in the way of memory pressure, mightn't it?
>>>
>>> I suppose that another indicator would be recent NULL returns from
>>> allocators. But that indicator flashes a bit later than one would like,
>>> doesn't it? And has false positives when allocators are invoked from
>>> atomic contexts, no doubt. And no doubt similar for sleeping more than
>>> a certain length of time in an allocator.
>>>
>>>> There actually is prior art for teaching reclaim about time:
>>>> https://lore.kernel.org/linux-mm/20130430110214.22179.26139.stgit@zurg/
>>>>
>>>> CCing Konstantin. I'm curious how widely this ended up being used and
>>>> how reliably it worked.
>>>
>>> Looking forward to hearing of any results!
>>
>> Well, that was some experiment about automatic steering memory pressure
>> between containers. LRU timings from milestones itself worked pretty well.
>> Remaining engine were more robust than mainline cgroups these days.
>> Memory becomes much cheaper - I hope nobody want's overcommit it that badly anymore.
>>
>> It seems modern MM has plenty signals about memory pressure.
>> Kswapsd should have enough knowledge to switch gears in RCU.
>
> Easy for me to provide "start fast and inefficient mode" and "stop fast
> and inefficient mode" APIs for MM to call!
>
> How about rcu_mempressure_start() and rcu_mempressure_end()? I would
> expect them not to nest (as in if you need them to nest, please let
> me know). I would not expect these to be invoked all that often (as in
> if you do need them to be fast and scalable, please let me know). >
> RCU would then be in fast/inefficient mode if either MM told it to be
> or if RCU had detected callback overload on at least one CPU.
>
> Seem reasonable?

Not exactly nested calls, but kswapd threads are per numa node.
So, at some level nodes under pressure must be counted.

Also forcing rcu calls only for cpus in one numa node might be useful.


I wonder if direct-reclaim should at some stage simply wait for RCU QS.
I.e. call rcu_barrier() or similar somewhere before invoking OOM.

All GFP_NOFAIL users should allow direct-reclaim, thus this loop
in page_alloc shouldn't block RCU and doesn't need special care.

>
> Thanx, Paul
>
>>>>> But, yes, it would be better to have an elusive unambiguous indication
>>>>> of distress. ;-)
>>>>
>>>> I agree. Preferably something more practical than a dialogue box
>>>> asking the user on how well things are going for them :-)
>>>
>>> Indeed, that dialog box should be especially useful for things like
>>> light bulbs running Linux. ;-)
>>>
>>> Thanx, Paul
>>>

2020-05-09 16:11:04

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode

On Sat, May 09, 2020 at 11:54:40AM +0300, Konstantin Khlebnikov wrote:
> On 08/05/2020 17.46, Paul E. McKenney wrote:
> > On Fri, May 08, 2020 at 12:00:28PM +0300, Konstantin Khlebnikov wrote:
> > > On 07/05/2020 22.09, Paul E. McKenney wrote:
> > > > On Thu, May 07, 2020 at 02:31:02PM -0400, Johannes Weiner wrote:
> > > > > On Thu, May 07, 2020 at 10:09:03AM -0700, Paul E. McKenney wrote:
> > > > > > On Thu, May 07, 2020 at 01:00:06PM -0400, Johannes Weiner wrote:
> > > > > > > On Wed, May 06, 2020 at 05:55:35PM -0700, Andrew Morton wrote:
> > > > > > > > On Wed, 6 May 2020 17:42:40 -0700 "Paul E. McKenney" <[email protected]> wrote:
> > > > > > > >
> > > > > > > > > This commit adds a shrinker so as to inform RCU when memory is scarce.
> > > > > > > > > RCU responds by shifting into the same fast and inefficient mode that is
> > > > > > > > > used in the presence of excessive numbers of RCU callbacks. RCU remains
> > > > > > > > > in this state for one-tenth of a second, though this time window can be
> > > > > > > > > extended by another call to the shrinker.
> > > > > > >
> > > > > > > We may be able to use shrinkers here, but merely being invoked does
> > > > > > > not carry a reliable distress signal.
> > > > > > >
> > > > > > > Shrinkers get invoked whenever vmscan runs. It's a useful indicator
> > > > > > > for when to age an auxiliary LRU list - test references, clear and
> > > > > > > rotate or reclaim stale entries. The urgency, and what can and cannot
> > > > > > > be considered "stale", is encoded in the callback frequency and scan
> > > > > > > counts, and meant to be relative to the VM's own rate of aging: "I've
> > > > > > > tested X percent of mine for recent use, now you go and test the same
> > > > > > > share of your pool." It doesn't translate well to other
> > > > > > > interpretations of the callbacks, although people have tried.
> > > > > >
> > > > > > Would it make sense for RCU to interpret two invocations within (say)
> > > > > > 100ms of each other as indicating urgency? (Hey, I had to ask!)
> > > > >
> > > > > It's the perfect number for one combination of CPU, storage device,
> > > > > and shrinker implementation :-)
> > > >
> > > > Woo-hoo!!!
> > > >
> > > > But is that one combination actually in use anywhere? ;-)
> > > >
> > > > > > > > > If it proves feasible, a later commit might add a function call directly
> > > > > > > > > indicating the end of the period of scarce memory.
> > > > > > > >
> > > > > > > > (Cc David Chinner, who often has opinions on shrinkers ;))
> > > > > > > >
> > > > > > > > It's a bit abusive of the intent of the slab shrinkers, but I don't
> > > > > > > > immediately see a problem with it. Always returning 0 from
> > > > > > > > ->scan_objects might cause a problem in some situations(?).
> > > > > > > >
> > > > > > > > Perhaps we should have a formal "system getting low on memory, please
> > > > > > > > do something" notification API.
> > > > > > >
> > > > > > > It's tricky to find a useful definition of what low on memory
> > > > > > > means. In the past we've used sc->priority cutoffs, the vmpressure
> > > > > > > interface (reclaimed/scanned - reclaim efficiency cutoffs), oom
> > > > > > > notifiers (another reclaim efficiency cutoff). But none of these
> > > > > > > reliably capture "distress", and they vary highly between different
> > > > > > > hardware setups. It can be hard to trigger OOM itself on fast IO
> > > > > > > devices, even when the machine is way past useful (where useful is
> > > > > > > somewhat subjective to the user). Userspace OOM implementations that
> > > > > > > consider userspace health (also subjective) are getting more common.
> > > > > > >
> > > > > > > > How significant is this? How much memory can RCU consume?
> > > > > > >
> > > > > > > I think if rcu can end up consuming a significant share of memory, one
> > > > > > > way that may work would be to do proper shrinker integration and track
> > > > > > > the age of its objects relative to the age of other allocations in the
> > > > > > > system. I.e. toss them all on a clock list with "new" bits and shrink
> > > > > > > them at VM velocity. If the shrinker sees objects with new bit set,
> > > > > > > clear and rotate. If it sees objects without them, we know rcu_heads
> > > > > > > outlive cache pages etc. and should probably cycle faster too.
> > > > > >
> > > > > > It would be easy for RCU to pass back (or otherwise use) the age of the
> > > > > > current grace period, if that would help.
> > > > > >
> > > > > > Tracking the age of individual callbacks is out of the question due to
> > > > > > memory overhead, but RCU could approximate this via statistical sampling.
> > > > > > Comparing this to grace-period durations could give information as to
> > > > > > whether making grace periods go faster would be helpful.
> > > > >
> > > > > That makes sense.
> > > > >
> > > > > So RCU knows the time and the VM knows the amount of memory. Either
> > > > > RCU needs to figure out its memory component to be able to translate
> > > > > shrinker input to age, or the VM needs to learn about time to be able
> > > > > to say: I'm currently scanning memory older than timestamp X.
> > > > >
> > > > > The latter would also require sampling in the VM. Nose goes. :-)
> > > >
> > > > Sounds about right. ;-)
> > > >
> > > > Does reclaim have any notion of having continuously scanned for
> > > > longer than some amount of time? Or could RCU reasonably deduce this?
> > > > For example, if RCU noticed that reclaim had been scanning for longer than
> > > > (say) five grace periods, RCU might decide to speed things up.
> > > >
> > > > But on the other hand, with slow disks, reclaim might go on for tens of
> > > > seconds even without much in the way of memory pressure, mightn't it?
> > > >
> > > > I suppose that another indicator would be recent NULL returns from
> > > > allocators. But that indicator flashes a bit later than one would like,
> > > > doesn't it? And has false positives when allocators are invoked from
> > > > atomic contexts, no doubt. And no doubt similar for sleeping more than
> > > > a certain length of time in an allocator.
> > > >
> > > > > There actually is prior art for teaching reclaim about time:
> > > > > https://lore.kernel.org/linux-mm/20130430110214.22179.26139.stgit@zurg/
> > > > >
> > > > > CCing Konstantin. I'm curious how widely this ended up being used and
> > > > > how reliably it worked.
> > > >
> > > > Looking forward to hearing of any results!
> > >
> > > Well, that was some experiment about automatic steering memory pressure
> > > between containers. LRU timings from milestones itself worked pretty well.
> > > Remaining engine were more robust than mainline cgroups these days.
> > > Memory becomes much cheaper - I hope nobody want's overcommit it that badly anymore.
> > >
> > > It seems modern MM has plenty signals about memory pressure.
> > > Kswapsd should have enough knowledge to switch gears in RCU.
> >
> > Easy for me to provide "start fast and inefficient mode" and "stop fast
> > and inefficient mode" APIs for MM to call!
> >
> > How about rcu_mempressure_start() and rcu_mempressure_end()? I would
> > expect them not to nest (as in if you need them to nest, please let
> > me know). I would not expect these to be invoked all that often (as in
> > if you do need them to be fast and scalable, please let me know). >
> > RCU would then be in fast/inefficient mode if either MM told it to be
> > or if RCU had detected callback overload on at least one CPU.
> >
> > Seem reasonable?
>
> Not exactly nested calls, but kswapd threads are per numa node.
> So, at some level nodes under pressure must be counted.

Easy enough, especially given that RCU already "counts" CPUs having
excessive numbers of callbacks. But assuming that the transitions to/from
OOM are rare, I would start by just counting them with a global counter.
If the counter is non-zero, RCU is in fast and inefficient mode.

> Also forcing rcu calls only for cpus in one numa node might be useful.

Interesting. RCU currently evaluates a given CPU by comparing the
number of callbacks against a fixed cutoff that can be set at boot using
rcutree.qhimark, which defaults to 10,000. When this cutoff is exceeded,
RCU becomes more aggressive about invoking callbacks on that CPU, for
example, by sacrificing some degree of real-time response. I believe
that this heuristic would also serve the OOM use case well.

> I wonder if direct-reclaim should at some stage simply wait for RCU QS.
> I.e. call rcu_barrier() or similar somewhere before invoking OOM.

The rcu_oom_count() function in the patch starting this thread returns the
total number of outstanding callbacks queued on all CPUs. So one approach
would be to invoke this function, and if the return value was truly
huge (taking size of memory and who knows that all else into account),
do the rcu_barrier() to wait for RCU to clear its current backlog.

On the NUMA point, it would be dead easy for me to supply a function
that returned the number of callbacks on a given CPU, which would allow
you to similarly evaluate a NUMA node, a cgroup, or whatever.

> All GFP_NOFAIL users should allow direct-reclaim, thus this loop
> in page_alloc shouldn't block RCU and doesn't need special care.

I must defer to you guys on this. The main caution is the duration of
direct reclaim. After all, if it is too long, the kfree_rcu() instance
would have been better of just invoking synchronize_rcu().

Thanx, Paul

> > > > > > But, yes, it would be better to have an elusive unambiguous indication
> > > > > > of distress. ;-)
> > > > >
> > > > > I agree. Preferably something more practical than a dialogue box
> > > > > asking the user on how well things are going for them :-)
> > > >
> > > > Indeed, that dialog box should be especially useful for things like
> > > > light bulbs running Linux. ;-)
> > > >
> > > > Thanx, Paul
> > > >

2020-05-12 03:14:33

by Chen, Rong A

[permalink] [raw]
Subject: 0902bb3bb8: vm-scalability.median -86.2% regression

Greeting,

FYI, we noticed a -86.2% regression of vm-scalability.median due to commit:


commit: 0902bb3bb8fdb69f956f4c3ee8157fe5d1c1e44d ("[PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode")
url: https://github.com/0day-ci/linux/commits/Paul-E-McKenney/Add-shrinker-to-shift-to-fast-inefficient-GP-mode/20200507-084433


in testcase: vm-scalability
on test machine: 96 threads Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 256G memory
with following parameters:

runtime: 300
thp_enabled: never
thp_defrag: always
nr_task: 8
nr_pmem: 1
test: swap-w-seq-mt
bp_memmap: 96G!18G
cpufreq_governor: performance
ucode: 0x500002c

test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us.
test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/



If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml

=========================================================================================
bp_memmap/compiler/cpufreq_governor/kconfig/nr_pmem/nr_task/rootfs/runtime/tbox_group/test/testcase/thp_defrag/thp_enabled/ucode:
96G!18G/gcc-7/performance/x86_64-rhel-7.6/1/8/debian-x86_64-20191114.cgz/300/lkp-csl-2sp6/swap-w-seq-mt/vm-scalability/always/never/0x500002c

commit:
baf5fe7618 ("Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu")
0902bb3bb8 ("Add shrinker to shift to fast/inefficient GP mode")

baf5fe7618468151 0902bb3bb8fdb69f956f4c3ee81
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
1:4 -25% :4 dmesg.WARNING:at_ip__slab_free/0x
1:4 -33% 0:4 perf-profile.children.cycles-pp.error_entry
%stddev %change %stddev
\ | \
1.60 +11.1% 1.78 ± 4% vm-scalability.free_time
716935 ± 3% -86.2% 99160 ± 9% vm-scalability.median
46.71 ± 14% -35.3 11.43 ± 28% vm-scalability.stddev%
5619989 ± 2% -86.0% 788970 ± 9% vm-scalability.throughput
44.93 +635.3% 330.35 ± 10% vm-scalability.time.elapsed_time
44.93 +635.3% 330.35 ± 10% vm-scalability.time.elapsed_time.max
91727 ± 7% -49.8% 46077 ± 8% vm-scalability.time.involuntary_context_switches
198.06 ± 3% +927.4% 2034 ± 8% vm-scalability.time.system_time
102.53 +6.3% 108.96 ± 2% vm-scalability.time.user_time
9919 ± 10% -65.2% 3455 ± 9% vm-scalability.time.voluntary_context_switches
92.25 -3.7% 88.81 iostat.cpu.idle
5.24 ± 2% +106.5% 10.82 ± 2% iostat.cpu.system
2.51 -85.6% 0.36 ± 11% iostat.cpu.user
0.00 ±100% +0.0 0.00 ± 37% mpstat.cpu.all.iowait%
5.41 ± 2% +5.4 10.86 ± 2% mpstat.cpu.all.sys%
2.60 -2.2 0.36 ± 11% mpstat.cpu.all.usr%
918871 ± 24% +49.6% 1374492 ± 11% cpuidle.C1.time
21145 ± 18% +240.5% 72004 ± 9% cpuidle.C1.usage
1.541e+09 ± 95% +1137.7% 1.908e+10 ± 30% cpuidle.C1E.time
5241728 ± 34% +748.9% 44495355 ± 16% cpuidle.C1E.usage
34825 ± 7% +2744.5% 990603 ± 11% cpuidle.POLL.time
12508 ± 4% +3114.1% 402020 ± 9% cpuidle.POLL.usage
92.00 -3.6% 88.67 vmstat.cpu.id
1031 ± 3% -87.7% 126.33 ± 40% vmstat.memory.buff
61018140 ± 2% -48.5% 31447956 ± 11% vmstat.memory.free
6453338 ± 17% +53.4% 9897990 ± 4% vmstat.memory.swpd
7.25 ± 5% +83.9% 13.33 ± 3% vmstat.procs.r
909.25 ± 2% -86.3% 124.33 ± 3% vmstat.swap.si
611160 ± 9% -87.7% 74985 ± 2% vmstat.swap.so
7293 ± 7% -31.7% 4981 vmstat.system.cs
785179 ± 6% -70.1% 234556 vmstat.system.in
99837026 +18.5% 1.183e+08 ± 3% meminfo.Active
99835926 +18.5% 1.183e+08 ± 3% meminfo.Active(anon)
1100 ± 2% -67.1% 362.33 ± 12% meminfo.Active(file)
1.017e+08 ± 2% +20.0% 1.221e+08 ± 3% meminfo.AnonPages
1063 ± 2% -85.4% 155.33 ± 41% meminfo.Buffers
129506 ± 3% -56.6% 56267 ± 6% meminfo.CmaFree
1.704e+08 -7.7% 1.572e+08 ± 3% meminfo.Committed_AS
8189726 ± 10% -6.0% 7701277 ± 6% meminfo.DirectMap2M
1950380 ± 6% +95.5% 3813862 ± 3% meminfo.Inactive
1949183 ± 6% +95.6% 3813016 ± 3% meminfo.Inactive(anon)
1196 ± 3% -29.3% 845.33 ± 21% meminfo.Inactive(file)
83888 ± 2% +16.4% 97645 meminfo.KReclaimable
29604 +44.9% 42884 ± 4% meminfo.Mapped
58640993 ± 3% -34.5% 38418196 ± 11% meminfo.MemAvailable
59089646 ± 3% -34.3% 38806088 ± 11% meminfo.MemFree
1.056e+08 +19.2% 1.259e+08 ± 3% meminfo.Memused
218191 ± 2% +19.3% 260390 ± 3% meminfo.PageTables
83888 ± 2% +16.4% 97645 meminfo.SReclaimable
3557167 -82.4% 626370 ± 10% meminfo.max_used_kB
1041 ± 7% -82.7% 180.00 ± 69% numa-meminfo.node0.Active(file)
713269 ± 3% +73.5% 1237284 ± 2% numa-meminfo.node0.Inactive
712476 ± 3% +73.6% 1237076 ± 2% numa-meminfo.node0.Inactive(anon)
792.25 ± 27% -73.8% 207.33 ± 48% numa-meminfo.node0.Inactive(file)
8539 ± 2% -14.0% 7342 numa-meminfo.node0.KernelStack
15778 ± 13% +23.6% 19500 ± 8% numa-meminfo.node0.Mapped
61936 -11.1% 55075 ± 2% numa-meminfo.node0.PageTables
77142184 ± 3% +22.9% 94803620 ± 3% numa-meminfo.node1.Active
77142121 ± 3% +22.9% 94803431 ± 3% numa-meminfo.node1.Active(anon)
78351034 ± 3% +24.2% 97283455 ± 3% numa-meminfo.node1.AnonPages
1238824 ± 12% +104.0% 2526732 ± 2% numa-meminfo.node1.Inactive
1238414 ± 12% +104.0% 2526081 ± 2% numa-meminfo.node1.Inactive(anon)
39485 ± 5% +30.0% 51344 ± 7% numa-meminfo.node1.KReclaimable
6764 ± 4% +14.9% 7776 numa-meminfo.node1.KernelStack
13907 ± 13% +68.8% 23470 ± 15% numa-meminfo.node1.Mapped
52452027 ± 5% -36.1% 33505419 ± 9% numa-meminfo.node1.MemFree
79628231 ± 3% +23.8% 98574840 ± 3% numa-meminfo.node1.MemUsed
158507 ± 4% +28.1% 203008 ± 3% numa-meminfo.node1.PageTables
39485 ± 5% +30.0% 51344 ± 7% numa-meminfo.node1.SReclaimable
221.00 ± 15% +131.4% 511.33 ± 31% slabinfo.biovec-128.active_objs
221.00 ± 15% +131.4% 511.33 ± 31% slabinfo.biovec-128.num_objs
82252 ± 2% +12.3% 92376 slabinfo.dentry.active_objs
1993 ± 2% +13.5% 2262 slabinfo.dentry.active_slabs
1993 ± 2% +13.5% 2262 slabinfo.dentry.num_slabs
3972 ± 2% +11.4% 4425 ± 2% slabinfo.files_cache.active_objs
3972 ± 2% +11.4% 4425 ± 2% slabinfo.files_cache.num_objs
22678 ± 5% +14.3% 25913 slabinfo.filp.active_objs
717.25 ± 5% +17.6% 843.33 slabinfo.filp.active_slabs
22968 ± 5% +13.5% 26077 slabinfo.filp.num_objs
717.25 ± 5% +17.6% 843.33 slabinfo.filp.num_slabs
81.25 ± 9% +1136.1% 1004 ± 43% slabinfo.nfs_commit_data.active_objs
81.25 ± 9% +1136.1% 1004 ± 43% slabinfo.nfs_commit_data.num_objs
65.75 ± 19% +941.3% 684.67 ± 53% slabinfo.nfs_read_data.active_objs
65.75 ± 19% +941.3% 684.67 ± 53% slabinfo.nfs_read_data.num_objs
41710 ± 3% +62.0% 67563 slabinfo.radix_tree_node.active_objs
751.75 ± 3% +63.7% 1230 slabinfo.radix_tree_node.active_slabs
42119 ± 3% +63.6% 68894 slabinfo.radix_tree_node.num_objs
751.75 ± 3% +63.7% 1230 slabinfo.radix_tree_node.num_slabs
1587 ± 7% +14.6% 1819 ± 8% slabinfo.skbuff_ext_cache.active_objs
1587 ± 7% +14.7% 1821 ± 8% slabinfo.skbuff_ext_cache.num_objs
357.00 ± 9% +171.5% 969.33 ± 22% slabinfo.skbuff_fclone_cache.active_objs
357.00 ± 9% +171.5% 969.33 ± 22% slabinfo.skbuff_fclone_cache.num_objs
9323 +21.0% 11281 ± 3% slabinfo.vmap_area.active_objs
9339 +20.8% 11282 ± 3% slabinfo.vmap_area.num_objs
257.75 ± 6% -82.5% 45.00 ± 69% numa-vmstat.node0.nr_active_file
41.25 ± 39% +1416.8% 625.67 ± 48% numa-vmstat.node0.nr_dirtied
176027 ± 2% +73.7% 305747 ± 3% numa-vmstat.node0.nr_inactive_anon
194.75 ± 25% -73.6% 51.33 ± 49% numa-vmstat.node0.nr_inactive_file
8540 ± 2% -14.0% 7342 numa-vmstat.node0.nr_kernel_stack
3951 ± 13% +26.0% 4977 ± 8% numa-vmstat.node0.nr_mapped
15365 -11.3% 13635 ± 2% numa-vmstat.node0.nr_page_table_pages
257.75 ± 6% -82.7% 44.67 ± 69% numa-vmstat.node0.nr_zone_active_file
176136 ± 2% +73.7% 305881 ± 3% numa-vmstat.node0.nr_zone_inactive_anon
194.00 ± 25% -73.5% 51.33 ± 49% numa-vmstat.node0.nr_zone_inactive_file
7706557 ± 6% +59.4% 12280502 ± 3% numa-vmstat.node0.numa_foreign
7881976 +15.1% 9074037 ± 2% numa-vmstat.node0.numa_hit
7744537 +15.2% 8920222 ± 2% numa-vmstat.node0.numa_local
227000 ± 52% +85.0% 419979 ± 15% numa-vmstat.node0.numa_miss
364448 ± 32% +57.4% 573797 ± 12% numa-vmstat.node0.numa_other
19143467 ± 3% +22.7% 23496200 ± 3% numa-vmstat.node1.nr_active_anon
19443323 ± 3% +24.0% 24107902 ± 3% numa-vmstat.node1.nr_anon_pages
11.25 ±168% +6092.6% 696.67 ± 23% numa-vmstat.node1.nr_dirtied
32665 ± 3% -54.3% 14934 ± 7% numa-vmstat.node1.nr_free_cma
13257702 ± 4% -35.2% 8589733 ± 10% numa-vmstat.node1.nr_free_pages
307118 ± 12% +102.9% 623237 ± 3% numa-vmstat.node1.nr_inactive_anon
32.75 ± 23% +289.8% 127.67 ± 2% numa-vmstat.node1.nr_isolated_anon
6765 ± 4% +14.9% 7773 numa-vmstat.node1.nr_kernel_stack
3627 ± 14% +61.5% 5858 ± 15% numa-vmstat.node1.nr_mapped
39352 ± 4% +27.7% 50243 ± 3% numa-vmstat.node1.nr_page_table_pages
9945 ± 5% +28.7% 12796 ± 7% numa-vmstat.node1.nr_slab_reclaimable
1441846 ± 21% +56.3% 2254231 ± 9% numa-vmstat.node1.nr_vmscan_write
1441849 ± 21% +56.4% 2254915 ± 9% numa-vmstat.node1.nr_written
19143456 ± 3% +22.7% 23496195 ± 3% numa-vmstat.node1.nr_zone_active_anon
307119 ± 12% +102.9% 623237 ± 3% numa-vmstat.node1.nr_zone_inactive_anon
227115 ± 52% +85.0% 420194 ± 15% numa-vmstat.node1.numa_foreign
17315459 ± 5% +26.1% 21840232 ± 3% numa-vmstat.node1.numa_hit
17256929 ± 5% +26.3% 21794912 ± 3% numa-vmstat.node1.numa_local
7707341 ± 6% +59.3% 12281115 ± 3% numa-vmstat.node1.numa_miss
7765889 ± 6% +58.7% 12326444 ± 3% numa-vmstat.node1.numa_other
1.25 ± 87% +3126.7% 40.33 ±108% numa-vmstat.node1.workingset_nodes
10353 ±123% +219.7% 33104 ± 2% proc-vmstat.compact_daemon_migrate_scanned
1888 ± 49% +325.1% 8028 ± 8% proc-vmstat.compact_fail
10353 ±123% +219.7% 33104 ± 2% proc-vmstat.compact_migrate_scanned
1888 ± 49% +325.2% 8030 ± 8% proc-vmstat.compact_stall
25092321 ± 2% +16.8% 29316619 ± 3% proc-vmstat.nr_active_anon
276.00 ± 3% -66.9% 91.33 ± 11% proc-vmstat.nr_active_file
25568624 ± 2% +18.3% 30244462 ± 3% proc-vmstat.nr_anon_pages
65.00 ± 24% +2683.6% 1809 ± 12% proc-vmstat.nr_dirtied
1449552 ± 3% -32.0% 986067 ± 10% proc-vmstat.nr_dirty_background_threshold
2902650 ± 3% -32.0% 1974547 ± 10% proc-vmstat.nr_dirty_threshold
276121 +2.8% 283935 proc-vmstat.nr_file_pages
32040 ± 4% -54.9% 14439 ± 6% proc-vmstat.nr_free_cma
14635098 ± 3% -31.8% 9975596 ± 10% proc-vmstat.nr_free_pages
490377 ± 6% +92.1% 941776 ± 2% proc-vmstat.nr_inactive_anon
300.25 ± 3% -28.9% 213.33 ± 21% proc-vmstat.nr_inactive_file
55.25 ± 16% +265.6% 202.00 ± 3% proc-vmstat.nr_isolated_anon
15301 -1.1% 15126 proc-vmstat.nr_kernel_stack
7563 +43.4% 10848 ± 4% proc-vmstat.nr_mapped
55169 ± 3% +17.0% 64572 ± 3% proc-vmstat.nr_page_table_pages
20862 +17.7% 24545 proc-vmstat.nr_slab_reclaimable
40904 +3.1% 42163 proc-vmstat.nr_slab_unreclaimable
261601 +3.0% 269429 proc-vmstat.nr_unevictable
2405495 ± 10% +46.5% 3524586 ± 9% proc-vmstat.nr_vmscan_write
25092249 ± 2% +16.8% 29316482 ± 3% proc-vmstat.nr_zone_active_anon
275.75 ± 3% -67.0% 91.00 ± 12% proc-vmstat.nr_zone_active_file
490475 ± 6% +92.0% 941924 ± 2% proc-vmstat.nr_zone_inactive_anon
299.75 ± 2% -28.8% 213.33 ± 21% proc-vmstat.nr_zone_inactive_file
261601 +3.0% 269430 proc-vmstat.nr_zone_unevictable
13602156 ± 4% +7.3% 14596636 ± 3% proc-vmstat.numa_foreign
35948444 -3.5% 34680186 proc-vmstat.numa_hit
35917370 -3.5% 34648547 proc-vmstat.numa_local
13602156 ± 4% +7.3% 14596636 ± 3% proc-vmstat.numa_miss
13633230 ± 4% +7.3% 14628275 ± 3% proc-vmstat.numa_other
5588835 ± 27% +336.2% 24380925 ± 10% proc-vmstat.numa_pte_updates
5204 ± 4% +201.5% 15693 proc-vmstat.pgactivate
49648820 +3.5% 51363583 proc-vmstat.pgfree
2677 ± 4% +63.5% 4378 ± 44% proc-vmstat.pgmajfault
0.25 ±173% +4.8e+05% 1195 ±141% proc-vmstat.pgmigrate_fail
2939659 ± 10% -44.1% 1641880 ± 10% proc-vmstat.pgscan_kswapd
2937571 ± 10% -44.2% 1638222 ± 10% proc-vmstat.pgsteal_kswapd
36379 +26416.6% 9646476 ± 25% proc-vmstat.slabs_scanned
733.25 ± 7% +86.8% 1370 ± 17% proc-vmstat.swap_ra
576.25 ± 8% +64.7% 949.00 ± 15% proc-vmstat.swap_ra_hit
16.00 ± 4% +37393.8% 5999 ± 24% proc-vmstat.unevictable_pgs_culled
6.76 ± 3% -7.4% 6.26 ± 6% perf-stat.i.MPKI
6.754e+09 -73.1% 1.814e+09 ± 7% perf-stat.i.branch-instructions
23049236 -74.2% 5947053 ± 3% perf-stat.i.branch-misses
37.40 ± 3% -11.4 26.01 ± 3% perf-stat.i.cache-miss-rate%
55113848 ± 4% -82.2% 9792343 ± 9% perf-stat.i.cache-misses
1.419e+08 ± 3% -75.4% 34878738 ± 5% perf-stat.i.cache-references
7468 ± 8% -31.1% 5145 ± 2% perf-stat.i.context-switches
1.14 +621.9% 8.23 perf-stat.i.cpi
2.592e+10 +48.0% 3.835e+10 ± 3% perf-stat.i.cpu-cycles
131.22 +10.4% 144.87 perf-stat.i.cpu-migrations
568.39 ± 4% +880.0% 5570 ± 3% perf-stat.i.cycles-between-cache-misses
0.02 ± 12% -0.0 0.01 ± 45% perf-stat.i.dTLB-load-miss-rate%
619825 ± 7% -88.7% 69818 ± 17% perf-stat.i.dTLB-load-misses
6.232e+09 -71.8% 1.755e+09 ± 6% perf-stat.i.dTLB-loads
0.21 ± 2% -0.1 0.08 ± 6% perf-stat.i.dTLB-store-miss-rate%
5684367 -83.6% 929886 ± 11% perf-stat.i.dTLB-store-misses
2.382e+09 -79.1% 4.974e+08 ± 6% perf-stat.i.dTLB-stores
59.54 -9.7 49.81 perf-stat.i.iTLB-load-miss-rate%
4835144 ± 6% -65.4% 1674858 ± 2% perf-stat.i.iTLB-load-misses
2.449e+10 -71.3% 7.019e+09 ± 6% perf-stat.i.instructions
13139 -66.9% 4353 ± 7% perf-stat.i.instructions-per-iTLB-miss
0.98 -77.3% 0.22 ± 6% perf-stat.i.ipc
60.11 ± 5% -84.2% 9.47 ± 6% perf-stat.i.major-faults
0.27 +52.5% 0.41 ± 3% perf-stat.i.metric.GHz
0.19 ± 24% +301.6% 0.74 ± 3% perf-stat.i.metric.K/sec
161.46 -73.1% 43.49 ± 6% perf-stat.i.metric.M/sec
1102490 -83.6% 180444 ± 11% perf-stat.i.minor-faults
77.68 +4.9 82.55 perf-stat.i.node-load-miss-rate%
7350049 ± 5% -82.0% 1325602 ± 8% perf-stat.i.node-load-misses
1863257 ± 4% -84.3% 293219 ± 8% perf-stat.i.node-loads
59.91 ± 7% +10.2 70.12 ± 2% perf-stat.i.node-store-miss-rate%
4561753 ± 7% -82.0% 823146 ± 7% perf-stat.i.node-store-misses
2697430 ± 15% -79.2% 561007 ± 8% perf-stat.i.node-stores
1102550 -83.6% 180454 ± 11% perf-stat.i.page-faults
5.80 ± 3% -11.7% 5.12 ± 2% perf-stat.overall.MPKI
38.81 -11.3 27.49 ± 4% perf-stat.overall.cache-miss-rate%
1.06 +440.0% 5.72 ± 4% perf-stat.overall.cpi
471.06 ± 4% +763.4% 4067 ± 6% perf-stat.overall.cycles-between-cache-misses
0.01 ± 7% -0.0 0.01 ± 44% perf-stat.overall.dTLB-load-miss-rate%
0.24 -0.1 0.18 ± 5% perf-stat.overall.dTLB-store-miss-rate%
73.85 -23.8 50.03 perf-stat.overall.iTLB-load-miss-rate%
5082 ± 7% -22.4% 3945 ± 9% perf-stat.overall.instructions-per-iTLB-miss
0.94 -81.4% 0.18 ± 4% perf-stat.overall.ipc
79.75 +2.1 81.87 perf-stat.overall.node-load-miss-rate%
4953 +88.6% 9340 ± 6% perf-stat.overall.path-length
6.606e+09 -74.7% 1.668e+09 ± 10% perf-stat.ps.branch-instructions
22608775 -75.2% 5605292 ± 5% perf-stat.ps.branch-misses
53930981 ± 4% -83.0% 9162780 ± 11% perf-stat.ps.cache-misses
1.389e+08 ± 3% -76.1% 33217475 ± 7% perf-stat.ps.cache-references
7311 ± 8% -31.2% 5026 perf-stat.ps.context-switches
93863 +1.9% 95670 perf-stat.ps.cpu-clock
2.536e+10 +45.8% 3.697e+10 ± 4% perf-stat.ps.cpu-cycles
128.53 +7.9% 138.70 ± 4% perf-stat.ps.cpu-migrations
607037 ± 7% -86.5% 81773 ± 36% perf-stat.ps.dTLB-load-misses
6.096e+09 -73.3% 1.625e+09 ± 9% perf-stat.ps.dTLB-loads
5558677 -85.3% 817456 ± 15% perf-stat.ps.dTLB-store-misses
2.33e+09 -80.4% 4.572e+08 ± 9% perf-stat.ps.dTLB-stores
4734527 ± 6% -65.2% 1646334 perf-stat.ps.iTLB-load-misses
2.395e+10 -72.9% 6.498e+09 ± 9% perf-stat.ps.instructions
58.86 ± 5% -84.7% 9.02 ± 5% perf-stat.ps.major-faults
1078073 -85.3% 158082 ± 16% perf-stat.ps.minor-faults
7194716 ± 5% -82.5% 1257009 ± 10% perf-stat.ps.node-load-misses
1824407 ± 4% -84.7% 278752 ± 11% perf-stat.ps.node-loads
4460797 ± 7% -83.0% 756410 ± 11% perf-stat.ps.node-store-misses
2640492 ± 15% -80.9% 504536 ± 11% perf-stat.ps.node-stores
1078132 -85.3% 158091 ± 16% perf-stat.ps.page-faults
93863 +1.9% 95670 perf-stat.ps.task-clock
1.09e+12 +85.9% 2.026e+12 ± 4% perf-stat.total.instructions
0.02 ± 81% +1.2e+08% 21164 ± 16% sched_debug.cfs_rq:/.exec_clock.avg
0.92 ± 77% +1.4e+07% 127531 ± 18% sched_debug.cfs_rq:/.exec_clock.max
0.11 ± 76% +1.5e+07% 16164 ± 24% sched_debug.cfs_rq:/.exec_clock.stddev
17284 ± 23% +326.3% 73680 ± 14% sched_debug.cfs_rq:/.load.avg
135741 ± 33% +66.7% 226288 ± 9% sched_debug.cfs_rq:/.load.stddev
1028 +9.7% 1128 ± 4% sched_debug.cfs_rq:/.load_avg.max
14965 ± 17% +1150.1% 187086 ± 13% sched_debug.cfs_rq:/.min_vruntime.avg
31992 ± 13% +1045.6% 366502 ± 10% sched_debug.cfs_rq:/.min_vruntime.max
8981 ± 24% +1017.6% 100380 ± 22% sched_debug.cfs_rq:/.min_vruntime.min
3372 ± 4% +1501.5% 54013 ± 9% sched_debug.cfs_rq:/.min_vruntime.stddev
0.14 ± 7% +38.2% 0.19 ± 5% sched_debug.cfs_rq:/.nr_running.avg
24.12 ± 73% -92.3% 1.87 ± 11% sched_debug.cfs_rq:/.removed.load_avg.avg
1018 -83.2% 171.17 ± 12% sched_debug.cfs_rq:/.removed.load_avg.max
144.98 ± 35% -87.7% 17.78 ± 11% sched_debug.cfs_rq:/.removed.load_avg.stddev
1112 ± 73% -92.2% 86.36 ± 11% sched_debug.cfs_rq:/.removed.runnable_sum.avg
46957 -83.1% 7914 ± 12% sched_debug.cfs_rq:/.removed.runnable_sum.max
6679 ± 35% -87.7% 821.98 ± 11% sched_debug.cfs_rq:/.removed.runnable_sum.stddev
8.16 ± 83% -91.7% 0.68 ± 33% sched_debug.cfs_rq:/.removed.util_avg.avg
395.25 ± 40% -84.1% 62.96 ± 36% sched_debug.cfs_rq:/.removed.util_avg.max
50.76 ± 55% -87.2% 6.51 ± 34% sched_debug.cfs_rq:/.removed.util_avg.stddev
2.20 ± 8% +2562.8% 58.63 ± 18% sched_debug.cfs_rq:/.runnable_load_avg.avg
40.75 ± 68% +2053.4% 877.49 ± 6% sched_debug.cfs_rq:/.runnable_load_avg.max
6.71 ± 27% +2795.2% 194.19 ± 11% sched_debug.cfs_rq:/.runnable_load_avg.stddev
15922 ± 29% +361.8% 73525 ± 14% sched_debug.cfs_rq:/.runnable_weight.avg
134704 ± 34% +67.7% 225909 ± 9% sched_debug.cfs_rq:/.runnable_weight.stddev
2199 ± 64% -3699.3% -79153 sched_debug.cfs_rq:/.spread0.avg
19225 ± 27% +422.8% 100520 ± 49% sched_debug.cfs_rq:/.spread0.max
-3785 +4284.8% -165992 sched_debug.cfs_rq:/.spread0.min
3372 ± 4% +1503.5% 54080 ± 9% sched_debug.cfs_rq:/.spread0.stddev
266.83 ± 3% -22.3% 207.34 ± 7% sched_debug.cfs_rq:/.util_avg.avg
1452 ± 17% -23.1% 1116 ± 7% sched_debug.cfs_rq:/.util_avg.max
23.15 ± 24% +351.8% 104.60 ± 10% sched_debug.cfs_rq:/.util_est_enqueued.avg
101.20 ± 18% +158.9% 262.04 ± 7% sched_debug.cfs_rq:/.util_est_enqueued.stddev
723046 +28.8% 931633 sched_debug.cpu.avg_idle.avg
8921 ±136% +3367.2% 309320 ± 6% sched_debug.cpu.avg_idle.min
314848 ± 6% -58.0% 132096 ± 5% sched_debug.cpu.avg_idle.stddev
28760 +533.3% 182149 ± 13% sched_debug.cpu.clock.avg
28764 +533.6% 182260 ± 13% sched_debug.cpu.clock.max
28756 +533.1% 182049 ± 13% sched_debug.cpu.clock.min
2.36 ± 9% +2660.5% 65.27 ± 41% sched_debug.cpu.clock.stddev
28760 +533.3% 182149 ± 13% sched_debug.cpu.clock_task.avg
28764 +533.6% 182260 ± 13% sched_debug.cpu.clock_task.max
28756 +533.1% 182049 ± 13% sched_debug.cpu.clock_task.min
2.36 ± 9% +2660.4% 65.27 ± 41% sched_debug.cpu.clock_task.stddev
230.67 ± 12% +23.0% 283.61 ± 4% sched_debug.cpu.curr->pid.avg
2078 +142.3% 5035 ± 9% sched_debug.cpu.curr->pid.max
619.98 ± 6% +36.5% 846.49 ± 6% sched_debug.cpu.curr->pid.stddev
0.00 ± 5% +173.2% 0.00 ± 31% sched_debug.cpu.next_balance.stddev
1666 +492.9% 9880 ± 11% sched_debug.cpu.nr_switches.avg
8085 ± 22% +577.2% 54755 ± 9% sched_debug.cpu.nr_switches.max
470.25 ± 12% +448.2% 2577 ± 10% sched_debug.cpu.nr_switches.min
1060 ± 8% +813.3% 9687 ± 9% sched_debug.cpu.nr_switches.stddev
-58.50 -27.4% -42.47 sched_debug.cpu.nr_uninterruptible.min
0.81 ±100% +1e+06% 8263 ± 14% sched_debug.cpu.sched_count.avg
18.50 ±130% +2.8e+05% 52205 ± 9% sched_debug.cpu.sched_count.max
3.67 ±120% +2.6e+05% 9512 ± 9% sched_debug.cpu.sched_count.stddev
0.41 ± 97% +8.8e+05% 3601 ± 14% sched_debug.cpu.sched_goidle.avg
9.25 ±130% +2.8e+05% 25571 ± 9% sched_debug.cpu.sched_goidle.max
1.84 ±119% +2.6e+05% 4754 ± 9% sched_debug.cpu.sched_goidle.stddev
0.36 ±123% +1.1e+06% 3997 ± 14% sched_debug.cpu.ttwu_count.avg
7.25 ±134% +4.1e+05% 29797 ± 2% sched_debug.cpu.ttwu_count.max
1.44 ±125% +3.5e+05% 5022 ± 6% sched_debug.cpu.ttwu_count.stddev
0.01 ±173% +8.8e+06% 782.74 ± 17% sched_debug.cpu.ttwu_local.avg
0.25 ±173% +8.7e+05% 2176 ± 28% sched_debug.cpu.ttwu_local.max
0.05 ±173% +6.9e+05% 319.05 ± 22% sched_debug.cpu.ttwu_local.stddev
28757 +533.0% 182050 ± 13% sched_debug.cpu_clk
28251 +542.5% 181508 ± 13% sched_debug.ktime
29119 +526.5% 182429 ± 13% sched_debug.sched_clk
36688 ± 39% -60.6% 14451 ± 18% softirqs.CPU0.RCU
12193 ± 4% +305.1% 49394 ± 7% softirqs.CPU0.SCHED
22020 ± 6% +465.9% 124620 ± 12% softirqs.CPU0.TIMER
6532 ± 32% +458.7% 36495 ± 9% softirqs.CPU1.SCHED
22332 ± 7% +445.4% 121803 ± 11% softirqs.CPU1.TIMER
5556 ± 13% +601.2% 38958 ± 9% softirqs.CPU10.SCHED
21976 ± 5% +545.3% 141801 ± 18% softirqs.CPU10.TIMER
6146 ± 21% +513.9% 37735 ± 10% softirqs.CPU11.SCHED
21510 ± 2% +474.1% 123501 ± 11% softirqs.CPU11.TIMER
6910 ± 40% +442.5% 37486 ± 12% softirqs.CPU12.SCHED
21979 ± 3% +460.8% 123261 ± 10% softirqs.CPU12.TIMER
7494 ± 17% +400.1% 37479 ± 10% softirqs.CPU13.SCHED
21998 ± 6% +460.2% 123230 ± 11% softirqs.CPU13.TIMER
7486 ± 31% +69.0% 12649 ± 15% softirqs.CPU14.RCU
7354 ± 21% +415.3% 37902 ± 10% softirqs.CPU14.SCHED
22857 ± 4% +433.4% 121921 ± 10% softirqs.CPU14.TIMER
5572 ± 33% +584.7% 38156 ± 10% softirqs.CPU15.SCHED
22447 ± 4% +447.9% 122977 ± 10% softirqs.CPU15.TIMER
6945 ± 2% +430.3% 36828 ± 12% softirqs.CPU16.SCHED
22646 ± 7% +439.4% 122151 ± 11% softirqs.CPU16.TIMER
7728 ± 26% +74.9% 13518 ± 18% softirqs.CPU17.RCU
6935 +438.4% 37342 ± 13% softirqs.CPU17.SCHED
22983 ± 8% +434.9% 122933 ± 11% softirqs.CPU17.TIMER
6346 ± 11% +508.7% 38628 ± 10% softirqs.CPU18.SCHED
22324 ± 7% +451.8% 123194 ± 11% softirqs.CPU18.TIMER
20405 ± 27% -38.2% 12618 ± 16% softirqs.CPU19.RCU
6280 ± 8% +528.9% 39500 ± 6% softirqs.CPU19.SCHED
22052 ± 5% +453.7% 122092 ± 11% softirqs.CPU19.TIMER
5597 ± 19% +548.2% 36280 ± 7% softirqs.CPU2.SCHED
23466 ± 14% +419.7% 121952 ± 10% softirqs.CPU2.TIMER
38152 ± 29% -62.6% 14267 ± 11% softirqs.CPU20.RCU
8398 ± 35% +342.7% 37176 ± 12% softirqs.CPU20.SCHED
22273 ± 7% +443.2% 120993 ± 11% softirqs.CPU20.TIMER
6796 ± 3% +446.2% 37120 ± 11% softirqs.CPU21.SCHED
22092 ± 7% +453.1% 122183 ± 11% softirqs.CPU21.TIMER
6590 ± 11% +115.2% 14179 ± 22% softirqs.CPU22.RCU
8729 ± 18% +336.0% 38058 ± 11% softirqs.CPU22.SCHED
22540 ± 7% +448.8% 123706 ± 12% softirqs.CPU22.TIMER
6104 ± 17% +517.0% 37661 ± 9% softirqs.CPU23.SCHED
22902 ± 6% +446.1% 125080 ± 12% softirqs.CPU23.TIMER
5850 ± 14% +493.9% 34745 ± 13% softirqs.CPU24.SCHED
22197 +426.9% 116964 ± 10% softirqs.CPU24.TIMER
5649 ± 14% +564.8% 37554 ± 15% softirqs.CPU25.SCHED
22145 ± 5% +441.1% 119822 ± 10% softirqs.CPU25.TIMER
23123 ± 3% +511.5% 141395 ± 9% softirqs.CPU26.TIMER
4401 ± 16% +716.2% 35925 ± 14% softirqs.CPU27.SCHED
22894 ± 5% +414.3% 117743 ± 7% softirqs.CPU27.TIMER
6128 ± 15% +504.5% 37044 ± 11% softirqs.CPU28.SCHED
23322 ± 7% +409.5% 118819 ± 10% softirqs.CPU28.TIMER
6491 ± 10% +473.1% 37203 ± 10% softirqs.CPU29.SCHED
23701 ± 6% +406.3% 120008 ± 10% softirqs.CPU29.TIMER
5690 ± 20% +583.1% 38873 ± 8% softirqs.CPU3.SCHED
22621 ± 6% +441.8% 122560 ± 11% softirqs.CPU3.TIMER
6782 ± 4% +441.5% 36725 ± 15% softirqs.CPU30.SCHED
23426 ± 7% +399.1% 116914 ± 13% softirqs.CPU30.TIMER
6655 ± 4% +449.2% 36548 ± 9% softirqs.CPU31.SCHED
22760 ± 6% +427.1% 119968 ± 7% softirqs.CPU31.TIMER
6456 +498.8% 38661 ± 11% softirqs.CPU32.SCHED
23092 ± 5% +420.0% 120077 ± 10% softirqs.CPU32.TIMER
6304 ± 11% +485.0% 36879 ± 12% softirqs.CPU33.SCHED
22838 ± 6% +420.5% 118880 ± 10% softirqs.CPU33.TIMER
6685 ± 8% +440.3% 36119 ± 11% softirqs.CPU34.SCHED
23305 ± 7% +401.8% 116950 ± 10% softirqs.CPU34.TIMER
6730 ± 5% +471.1% 38439 ± 10% softirqs.CPU35.SCHED
23368 ± 7% +410.9% 119389 ± 9% softirqs.CPU35.TIMER
6157 +98.3% 12210 ± 18% softirqs.CPU36.RCU
7044 ± 4% +435.9% 37749 ± 11% softirqs.CPU36.SCHED
23157 ± 8% +411.8% 118523 ± 9% softirqs.CPU36.TIMER
6393 ± 13% +490.4% 37744 ± 11% softirqs.CPU37.SCHED
22945 ± 9% +415.9% 118379 ± 9% softirqs.CPU37.TIMER
6797 ± 4% +450.3% 37410 ± 11% softirqs.CPU38.SCHED
23095 ± 8% +421.7% 120485 ± 11% softirqs.CPU38.TIMER
6462 ± 4% +494.1% 38397 ± 12% softirqs.CPU39.SCHED
23003 ± 8% +416.3% 118775 ± 9% softirqs.CPU39.TIMER
6658 ± 13% +461.7% 37400 ± 8% softirqs.CPU4.SCHED
22898 ± 6% +436.0% 122724 ± 11% softirqs.CPU4.TIMER
6663 ± 5% +463.6% 37555 ± 11% softirqs.CPU40.SCHED
23428 ± 6% +405.9% 118523 ± 10% softirqs.CPU40.TIMER
23172 ± 7% +479.8% 134363 ± 22% softirqs.CPU41.TIMER
6681 ± 29% +95.0% 13027 ± 19% softirqs.CPU42.RCU
6761 ± 2% +425.8% 35552 ± 6% softirqs.CPU42.SCHED
22874 ± 8% +419.7% 118879 ± 8% softirqs.CPU42.TIMER
7112 ± 3% +422.5% 37157 ± 10% softirqs.CPU43.SCHED
25900 ± 13% +357.4% 118464 ± 9% softirqs.CPU43.TIMER
6067 ± 5% +111.6% 12839 ± 17% softirqs.CPU44.RCU
7078 ± 2% +432.4% 37685 ± 10% softirqs.CPU44.SCHED
23467 ± 7% +409.3% 119517 ± 10% softirqs.CPU44.TIMER
6417 ± 15% +481.3% 37303 ± 12% softirqs.CPU45.SCHED
23074 ± 8% +508.5% 140417 ± 26% softirqs.CPU45.TIMER
7519 ± 26% +65.2% 12423 ± 13% softirqs.CPU46.RCU
6956 ± 2% +437.2% 37373 ± 14% softirqs.CPU46.SCHED
23184 ± 7% +411.5% 118586 ± 9% softirqs.CPU46.TIMER
6439 ± 11% +101.4% 12965 ± 13% softirqs.CPU47.RCU
5486 ± 3% +572.2% 36874 ± 9% softirqs.CPU47.SCHED
23475 ± 6% +407.4% 119116 ± 10% softirqs.CPU47.TIMER
7008 ± 35% +88.7% 13222 ± 16% softirqs.CPU48.RCU
5881 ± 14% +545.4% 37959 ± 12% softirqs.CPU48.SCHED
21852 ± 6% +473.9% 125408 ± 12% softirqs.CPU48.TIMER
6708 ± 6% +471.3% 38329 ± 11% softirqs.CPU49.SCHED
21856 ± 6% +468.8% 124326 ± 12% softirqs.CPU49.TIMER
8618 ± 36% +343.5% 38219 ± 10% softirqs.CPU5.SCHED
24849 ± 12% +395.1% 123026 ± 11% softirqs.CPU5.TIMER
6985 ± 22% +91.7% 13393 ± 17% softirqs.CPU50.RCU
6749 ± 5% +468.1% 38343 ± 9% softirqs.CPU50.SCHED
21851 ± 6% +475.0% 125647 ± 10% softirqs.CPU50.TIMER
6478 ± 11% +502.4% 39026 ± 10% softirqs.CPU51.SCHED
22332 ± 5% +467.5% 126738 ± 11% softirqs.CPU51.TIMER
7628 ± 26% +65.0% 12585 ± 13% softirqs.CPU52.RCU
6744 ± 12% +489.4% 39748 ± 11% softirqs.CPU52.SCHED
22287 ± 6% +465.9% 126113 ± 12% softirqs.CPU52.TIMER
6361 ± 13% +93.1% 12283 ± 12% softirqs.CPU53.RCU
7273 +438.2% 39150 ± 11% softirqs.CPU53.SCHED
22745 ± 5% +457.9% 126889 ± 11% softirqs.CPU53.TIMER
6768 ± 10% +469.3% 38533 ± 11% softirqs.CPU54.SCHED
22198 ± 6% +470.5% 126633 ± 11% softirqs.CPU54.TIMER
6671 ± 8% +93.0% 12879 ± 17% softirqs.CPU55.RCU
6345 ± 15% +492.7% 37606 ± 14% softirqs.CPU55.SCHED
22178 ± 6% +465.2% 125361 ± 12% softirqs.CPU55.TIMER
6098 ± 13% +115.3% 13130 ± 11% softirqs.CPU56.RCU
6372 ± 14% +510.8% 38918 ± 8% softirqs.CPU56.SCHED
22829 ± 6% +447.5% 124995 ± 10% softirqs.CPU56.TIMER
6253 ± 20% +512.4% 38293 ± 9% softirqs.CPU57.SCHED
22136 ± 8% +472.0% 126610 ± 11% softirqs.CPU57.TIMER
6949 ± 31% +72.3% 11971 ± 15% softirqs.CPU58.RCU
6970 +451.0% 38404 ± 11% softirqs.CPU58.SCHED
22076 ± 6% +468.8% 125570 ± 11% softirqs.CPU58.TIMER
6546 ± 23% +103.9% 13346 ± 13% softirqs.CPU59.RCU
7124 +442.9% 38680 ± 12% softirqs.CPU59.SCHED
22292 ± 5% +469.6% 126980 ± 11% softirqs.CPU59.TIMER
6929 ± 6% +452.9% 38316 ± 14% softirqs.CPU6.SCHED
22690 ± 5% +442.9% 123184 ± 11% softirqs.CPU6.TIMER
6101 ± 5% +112.2% 12946 ± 13% softirqs.CPU60.RCU
6947 ± 4% +457.3% 38718 ± 11% softirqs.CPU60.SCHED
22785 ± 9% +451.7% 125714 ± 11% softirqs.CPU60.TIMER
6909 ± 2% +574.0% 46572 ± 25% softirqs.CPU61.SCHED
22245 ± 5% +464.5% 125576 ± 12% softirqs.CPU61.TIMER
6073 ± 15% +545.4% 39196 ± 9% softirqs.CPU62.SCHED
22429 ± 8% +458.0% 125143 ± 9% softirqs.CPU62.TIMER
6085 ± 24% +98.9% 12103 ± 16% softirqs.CPU63.RCU
6672 ± 7% +482.2% 38847 ± 9% softirqs.CPU63.SCHED
22255 ± 5% +469.1% 126660 ± 10% softirqs.CPU63.TIMER
6906 ± 3% +459.5% 38640 ± 11% softirqs.CPU64.SCHED
22023 ± 6% +472.1% 125995 ± 12% softirqs.CPU64.TIMER
6456 ± 5% +503.6% 38969 ± 10% softirqs.CPU65.SCHED
23126 ± 10% +448.6% 126869 ± 11% softirqs.CPU65.TIMER
6727 ± 4% +513.3% 41259 ± 17% softirqs.CPU66.SCHED
22465 ± 5% +467.5% 127482 ± 11% softirqs.CPU66.TIMER
6641 ± 10% +489.5% 39150 ± 12% softirqs.CPU67.SCHED
22547 ± 5% +446.2% 123143 ± 11% softirqs.CPU67.TIMER
7569 ± 25% +68.6% 12762 ± 13% softirqs.CPU68.RCU
6913 ± 3% +475.1% 39758 ± 11% softirqs.CPU68.SCHED
22058 ± 7% +480.0% 127942 ± 12% softirqs.CPU68.TIMER
7077 ± 9% +448.1% 38792 ± 11% softirqs.CPU69.SCHED
23097 ± 9% +444.0% 125637 ± 11% softirqs.CPU69.TIMER
6610 ± 14% +105.6% 13594 ± 18% softirqs.CPU7.RCU
7036 ± 4% +433.5% 37541 ± 11% softirqs.CPU7.SCHED
22654 ± 5% +441.4% 122652 ± 11% softirqs.CPU7.TIMER
6307 ± 8% +95.1% 12305 ± 17% softirqs.CPU70.RCU
7178 ± 4% +447.8% 39319 ± 10% softirqs.CPU70.SCHED
22725 ± 9% +459.8% 127222 ± 11% softirqs.CPU70.TIMER
6902 ± 4% +456.7% 38426 ± 9% softirqs.CPU71.SCHED
23173 ± 6% +439.5% 125013 ± 11% softirqs.CPU71.TIMER
6936 ± 2% +440.3% 37478 ± 10% softirqs.CPU72.SCHED
22718 ± 7% +427.3% 119795 ± 11% softirqs.CPU72.TIMER
6248 ± 14% +492.2% 37002 ± 9% softirqs.CPU73.SCHED
22736 ± 8% +415.1% 117122 ± 9% softirqs.CPU73.TIMER
22828 ± 7% +518.2% 141117 ± 7% softirqs.CPU74.TIMER
5932 ± 13% +552.1% 38681 ± 11% softirqs.CPU75.SCHED
22651 ± 6% +434.1% 120982 ± 10% softirqs.CPU75.TIMER
5989 ± 12% +551.0% 38989 ± 7% softirqs.CPU76.SCHED
22679 ± 4% +514.7% 139409 ± 5% softirqs.CPU76.TIMER
6342 ± 14% +504.5% 38341 ± 11% softirqs.CPU77.SCHED
22680 ± 7% +448.9% 124493 ± 11% softirqs.CPU77.TIMER
6180 ± 16% +541.6% 39656 ± 6% softirqs.CPU78.SCHED
22959 ± 9% +443.7% 124843 ± 9% softirqs.CPU78.TIMER
15836 ± 26% -24.8% 11911 ± 16% softirqs.CPU79.RCU
6431 ± 11% +493.9% 38195 ± 10% softirqs.CPU79.SCHED
22471 ± 8% +436.6% 120572 ± 9% softirqs.CPU79.TIMER
6116 ± 8% +123.0% 13639 ± 8% softirqs.CPU8.RCU
7066 ± 3% +422.6% 36930 ± 8% softirqs.CPU8.SCHED
23222 ± 7% +425.3% 121998 ± 11% softirqs.CPU8.TIMER
5679 ± 7% +102.9% 11523 ± 18% softirqs.CPU80.RCU
6918 ± 3% +470.6% 39480 ± 7% softirqs.CPU80.SCHED
22973 ± 6% +444.5% 125094 ± 10% softirqs.CPU80.TIMER
7395 ± 28% +64.4% 12155 ± 18% softirqs.CPU81.RCU
6876 ± 3% +456.0% 38229 ± 11% softirqs.CPU81.SCHED
23024 ± 6% +428.9% 121786 ± 12% softirqs.CPU81.TIMER
6278 ± 17% +518.7% 38843 ± 10% softirqs.CPU82.SCHED
23005 ± 6% +436.6% 123442 ± 11% softirqs.CPU82.TIMER
7106 ± 20% +66.5% 11833 ± 16% softirqs.CPU83.RCU
6996 ± 4% +456.4% 38930 ± 9% softirqs.CPU83.SCHED
22885 ± 7% +436.0% 122678 ± 9% softirqs.CPU83.TIMER
6469 ± 12% +499.7% 38794 ± 10% softirqs.CPU84.SCHED
22723 ± 8% +437.5% 122129 ± 11% softirqs.CPU84.TIMER
6769 ± 5% +467.8% 38437 ± 10% softirqs.CPU85.SCHED
22753 ± 10% +433.7% 121443 ± 10% softirqs.CPU85.TIMER
6908 +457.7% 38530 ± 10% softirqs.CPU86.SCHED
22687 ± 7% +440.0% 122516 ± 10% softirqs.CPU86.TIMER
6530 ± 6% +482.1% 38008 ± 11% softirqs.CPU87.SCHED
22574 ± 8% +435.7% 120927 ± 10% softirqs.CPU87.TIMER
6739 +471.2% 38491 ± 11% softirqs.CPU88.SCHED
22813 ± 6% +440.3% 123263 ± 10% softirqs.CPU88.TIMER
6861 ± 7% +303.4% 27680 ± 43% softirqs.CPU89.SCHED
22714 ± 7% +504.2% 137243 ± 21% softirqs.CPU89.TIMER
6086 ± 23% +515.1% 37437 ± 9% softirqs.CPU9.SCHED
27118 ± 20% +356.1% 123681 ± 10% softirqs.CPU9.TIMER
6827 ± 22% +68.7% 11520 ± 18% softirqs.CPU90.RCU
6846 ± 2% +448.7% 37560 ± 8% softirqs.CPU90.SCHED
22693 ± 8% +432.5% 120850 ± 9% softirqs.CPU90.TIMER
5636 ± 9% +98.8% 11207 ± 21% softirqs.CPU91.RCU
6865 ± 3% +454.3% 38054 ± 10% softirqs.CPU91.SCHED
22663 ± 7% +429.1% 119908 ± 10% softirqs.CPU91.TIMER
5713 ± 5% +111.0% 12055 ± 18% softirqs.CPU92.RCU
7080 ± 4% +436.3% 37976 ± 12% softirqs.CPU92.SCHED
23117 ± 6% +434.7% 123599 ± 10% softirqs.CPU92.TIMER
6754 ± 2% +446.7% 36928 ± 8% softirqs.CPU93.SCHED
22554 ± 8% +437.1% 121143 ± 10% softirqs.CPU93.TIMER
6007 ± 19% +86.4% 11197 ± 18% softirqs.CPU94.RCU
6982 +454.1% 38688 ± 10% softirqs.CPU94.SCHED
22658 ± 7% +441.2% 122629 ± 10% softirqs.CPU94.TIMER
6128 ± 18% +97.3% 12093 ± 24% softirqs.CPU95.RCU
6669 ± 6% +471.8% 38136 ± 8% softirqs.CPU95.SCHED
22538 ± 7% +442.8% 122343 ± 10% softirqs.CPU95.TIMER
641599 ± 2% +462.4% 3608346 ± 10% softirqs.SCHED
2187113 ± 5% +443.2% 11881220 ± 10% softirqs.TIMER
0.50 ±173% +40633.3% 203.67 ± 78% interrupts.113:PCI-MSI.31981646-edge.i40e-eth0-TxRx-77
0.00 +1.9e+104% 190.67 ± 99% interrupts.114:PCI-MSI.31981647-edge.i40e-eth0-TxRx-78
0.00 +1.8e+104% 183.00 ± 63% interrupts.115:PCI-MSI.31981648-edge.i40e-eth0-TxRx-79
0.00 +7.1e+103% 71.33 ± 62% interrupts.117:PCI-MSI.31981650-edge.i40e-eth0-TxRx-81
0.50 ±173% +10766.7% 54.33 ± 85% interrupts.120:PCI-MSI.31981653-edge.i40e-eth0-TxRx-84
36.75 ± 8% +672.8% 284.00 interrupts.35:PCI-MSI.31981568-edge.i40e-0000:3d:00.0:misc
28331569 ± 11% -48.9% 14483923 ± 7% interrupts.CAL:Function_call_interrupts
1597643 ± 37% -85.7% 228000 ± 5% interrupts.CPU0.CAL:Function_call_interrupts
91505 ± 2% +617.7% 656759 ± 10% interrupts.CPU0.LOC:Local_timer_interrupts
3990 ± 53% +271.5% 14822 ± 16% interrupts.CPU0.RES:Rescheduling_interrupts
3293773 ± 36% -76.5% 775632 ± 5% interrupts.CPU0.TLB:TLB_shootdowns
91444 ± 2% +618.8% 657273 ± 10% interrupts.CPU1.LOC:Local_timer_interrupts
2970 ± 53% -65.0% 1041 ± 25% interrupts.CPU1.RES:Rescheduling_interrupts
91427 ± 2% +619.1% 657486 ± 10% interrupts.CPU10.LOC:Local_timer_interrupts
91397 ± 2% +618.7% 656886 ± 10% interrupts.CPU11.LOC:Local_timer_interrupts
91424 ± 2% +618.1% 656560 ± 10% interrupts.CPU12.LOC:Local_timer_interrupts
91436 ± 2% +619.4% 657799 ± 10% interrupts.CPU13.LOC:Local_timer_interrupts
91438 ± 2% +619.9% 658306 ± 10% interrupts.CPU14.LOC:Local_timer_interrupts
295.00 ± 60% +360.9% 1359 ± 42% interrupts.CPU14.NMI:Non-maskable_interrupts
295.00 ± 60% +360.9% 1359 ± 42% interrupts.CPU14.PMI:Performance_monitoring_interrupts
91423 ± 2% +619.5% 657765 ± 10% interrupts.CPU15.LOC:Local_timer_interrupts
91473 ± 2% +616.9% 655776 ± 10% interrupts.CPU16.LOC:Local_timer_interrupts
91420 ± 2% +617.9% 656334 ± 9% interrupts.CPU17.LOC:Local_timer_interrupts
579.50 ±104% +176.3% 1601 ± 32% interrupts.CPU17.NMI:Non-maskable_interrupts
579.50 ±104% +176.3% 1601 ± 32% interrupts.CPU17.PMI:Performance_monitoring_interrupts
125.00 ±126% +515.7% 769.67 ± 22% interrupts.CPU17.RES:Rescheduling_interrupts
91432 ± 2% +618.8% 657187 ± 10% interrupts.CPU18.LOC:Local_timer_interrupts
91348 ± 2% +620.3% 658014 ± 10% interrupts.CPU19.LOC:Local_timer_interrupts
91442 ± 2% +617.7% 656305 ± 10% interrupts.CPU2.LOC:Local_timer_interrupts
1607994 ± 30% -88.3% 187593 ± 16% interrupts.CPU20.CAL:Function_call_interrupts
91411 ± 2% +606.9% 646203 ± 10% interrupts.CPU20.LOC:Local_timer_interrupts
2901 ± 19% -77.1% 665.33 ± 58% interrupts.CPU20.NMI:Non-maskable_interrupts
2901 ± 19% -77.1% 665.33 ± 58% interrupts.CPU20.PMI:Performance_monitoring_interrupts
7329 ± 93% -84.2% 1159 ± 67% interrupts.CPU20.RES:Rescheduling_interrupts
3241724 ± 32% -80.3% 640070 ± 13% interrupts.CPU20.TLB:TLB_shootdowns
91406 ± 2% +619.5% 657687 ± 10% interrupts.CPU21.LOC:Local_timer_interrupts
34527 ± 99% +368.6% 161789 ± 19% interrupts.CPU22.CAL:Function_call_interrupts
91424 ± 2% +619.3% 657647 ± 10% interrupts.CPU22.LOC:Local_timer_interrupts
172.75 ± 57% +799.2% 1553 ± 30% interrupts.CPU22.NMI:Non-maskable_interrupts
172.75 ± 57% +799.2% 1553 ± 30% interrupts.CPU22.PMI:Performance_monitoring_interrupts
67878 ±101% +672.9% 524610 ± 19% interrupts.CPU22.TLB:TLB_shootdowns
91444 ± 2% +619.1% 657560 ± 10% interrupts.CPU23.LOC:Local_timer_interrupts
91322 ± 2% +616.4% 654231 ± 10% interrupts.CPU24.LOC:Local_timer_interrupts
91360 ± 2% +609.2% 647901 ± 8% interrupts.CPU25.LOC:Local_timer_interrupts
2895 ± 91% -63.4% 1059 ±100% interrupts.CPU25.NMI:Non-maskable_interrupts
2895 ± 91% -63.4% 1059 ±100% interrupts.CPU25.PMI:Performance_monitoring_interrupts
91300 ± 2% +624.3% 661277 ± 9% interrupts.CPU26.LOC:Local_timer_interrupts
482.00 ± 80% +350.6% 2171 ± 61% interrupts.CPU26.NMI:Non-maskable_interrupts
482.00 ± 80% +350.6% 2171 ± 61% interrupts.CPU26.PMI:Performance_monitoring_interrupts
91312 ± 2% +604.8% 643563 ± 10% interrupts.CPU27.LOC:Local_timer_interrupts
5890 ±116% +2954.1% 179892 ± 20% interrupts.CPU28.CAL:Function_call_interrupts
91293 ± 2% +620.6% 657852 ± 10% interrupts.CPU28.LOC:Local_timer_interrupts
10660 ±122% +5430.5% 589551 ± 20% interrupts.CPU28.TLB:TLB_shootdowns
91201 ± 2% +619.7% 656371 ± 10% interrupts.CPU29.LOC:Local_timer_interrupts
91435 ± 2% +618.6% 657029 ± 10% interrupts.CPU3.LOC:Local_timer_interrupts
91315 ± 2% +599.6% 638802 ± 14% interrupts.CPU30.LOC:Local_timer_interrupts
91237 ± 2% +620.9% 657775 ± 10% interrupts.CPU31.LOC:Local_timer_interrupts
91320 ± 2% +620.7% 658106 ± 10% interrupts.CPU32.LOC:Local_timer_interrupts
91328 ± 2% +620.2% 657720 ± 10% interrupts.CPU33.LOC:Local_timer_interrupts
91329 ± 2% +604.7% 643577 ± 11% interrupts.CPU34.LOC:Local_timer_interrupts
489.50 ± 78% +304.9% 1982 ± 62% interrupts.CPU34.NMI:Non-maskable_interrupts
489.50 ± 78% +304.9% 1982 ± 62% interrupts.CPU34.PMI:Performance_monitoring_interrupts
91330 ± 2% +620.5% 658042 ± 10% interrupts.CPU35.LOC:Local_timer_interrupts
257.50 ± 34% +411.6% 1317 ± 53% interrupts.CPU35.NMI:Non-maskable_interrupts
257.50 ± 34% +411.6% 1317 ± 53% interrupts.CPU35.PMI:Performance_monitoring_interrupts
11062 ±166% +1103.0% 133081 ± 18% interrupts.CPU36.CAL:Function_call_interrupts
91325 ± 2% +620.2% 657759 ± 10% interrupts.CPU36.LOC:Local_timer_interrupts
73.25 ±101% +1184.2% 940.67 ± 65% interrupts.CPU36.RES:Rescheduling_interrupts
21201 ±173% +1910.5% 426249 ± 18% interrupts.CPU36.TLB:TLB_shootdowns
91322 ± 2% +621.6% 658973 ± 10% interrupts.CPU37.LOC:Local_timer_interrupts
91317 ± 2% +619.8% 657278 ± 10% interrupts.CPU38.LOC:Local_timer_interrupts
91311 ± 2% +620.3% 657721 ± 10% interrupts.CPU39.LOC:Local_timer_interrupts
91452 ± 2% +618.2% 656807 ± 10% interrupts.CPU4.LOC:Local_timer_interrupts
91287 ± 2% +619.9% 657174 ± 10% interrupts.CPU40.LOC:Local_timer_interrupts
323.50 ±100% +207.6% 995.00 ± 48% interrupts.CPU40.NMI:Non-maskable_interrupts
323.50 ±100% +207.6% 995.00 ± 48% interrupts.CPU40.PMI:Performance_monitoring_interrupts
115822 ±173% +373.0% 547848 ± 11% interrupts.CPU40.TLB:TLB_shootdowns
91314 ± 2% +621.4% 658707 ± 10% interrupts.CPU41.LOC:Local_timer_interrupts
91323 ± 2% +618.7% 656381 ± 10% interrupts.CPU42.LOC:Local_timer_interrupts
119.75 ± 21% +1207.2% 1565 ± 26% interrupts.CPU42.NMI:Non-maskable_interrupts
119.75 ± 21% +1207.2% 1565 ± 26% interrupts.CPU42.PMI:Performance_monitoring_interrupts
76.50 ±139% +2120.9% 1699 ± 81% interrupts.CPU42.RES:Rescheduling_interrupts
135052 ±159% +286.8% 522441 ± 23% interrupts.CPU42.TLB:TLB_shootdowns
91315 ± 2% +619.8% 657282 ± 10% interrupts.CPU43.LOC:Local_timer_interrupts
308.00 ± 53% +263.3% 1119 ± 28% interrupts.CPU43.NMI:Non-maskable_interrupts
308.00 ± 53% +263.3% 1119 ± 28% interrupts.CPU43.PMI:Performance_monitoring_interrupts
21303 ±160% +635.8% 156751 ± 7% interrupts.CPU44.CAL:Function_call_interrupts
91333 ± 2% +620.0% 657591 ± 10% interrupts.CPU44.LOC:Local_timer_interrupts
132.00 ± 19% +775.0% 1155 ± 17% interrupts.CPU44.NMI:Non-maskable_interrupts
132.00 ± 19% +775.0% 1155 ± 17% interrupts.CPU44.PMI:Performance_monitoring_interrupts
36.25 ± 97% +4102.3% 1523 ± 81% interrupts.CPU44.RES:Rescheduling_interrupts
35668 ±160% +1294.0% 497211 ± 9% interrupts.CPU44.TLB:TLB_shootdowns
91324 ± 2% +619.1% 656671 ± 10% interrupts.CPU45.LOC:Local_timer_interrupts
171.25 ± 47% +1160.5% 2158 ± 43% interrupts.CPU45.NMI:Non-maskable_interrupts
171.25 ± 47% +1160.5% 2158 ± 43% interrupts.CPU45.PMI:Performance_monitoring_interrupts
91319 ± 2% +620.5% 657981 ± 10% interrupts.CPU46.LOC:Local_timer_interrupts
259.25 ± 74% +386.8% 1262 ± 74% interrupts.CPU46.NMI:Non-maskable_interrupts
259.25 ± 74% +386.8% 1262 ± 74% interrupts.CPU46.PMI:Performance_monitoring_interrupts
152.25 ± 93% +605.2% 1073 ± 73% interrupts.CPU46.RES:Rescheduling_interrupts
179949 ±102% +138.9% 429973 ± 16% interrupts.CPU46.TLB:TLB_shootdowns
22158 ±103% +636.1% 163103 ± 15% interrupts.CPU47.CAL:Function_call_interrupts
91316 ± 2% +619.3% 656852 ± 10% interrupts.CPU47.LOC:Local_timer_interrupts
258.00 ± 41% +491.0% 1524 ± 14% interrupts.CPU47.NMI:Non-maskable_interrupts
258.00 ± 41% +491.0% 1524 ± 14% interrupts.CPU47.PMI:Performance_monitoring_interrupts
110.25 ±144% +501.7% 663.33 ± 14% interrupts.CPU47.RES:Rescheduling_interrupts
44623 ±106% +1072.9% 523397 ± 15% interrupts.CPU47.TLB:TLB_shootdowns
91426 ± 2% +618.7% 657099 ± 10% interrupts.CPU48.LOC:Local_timer_interrupts
91442 ± 2% +619.4% 657810 ± 10% interrupts.CPU49.LOC:Local_timer_interrupts
91475 ± 2% +619.8% 658437 ± 10% interrupts.CPU5.LOC:Local_timer_interrupts
91447 ± 2% +619.1% 657562 ± 10% interrupts.CPU50.LOC:Local_timer_interrupts
133055 ± 87% +274.6% 498477 ± 13% interrupts.CPU50.TLB:TLB_shootdowns
91446 ± 2% +619.6% 658011 ± 10% interrupts.CPU51.LOC:Local_timer_interrupts
91442 ± 2% +619.5% 657910 ± 10% interrupts.CPU52.LOC:Local_timer_interrupts
33202 ±137% +230.0% 109566 ± 5% interrupts.CPU53.CAL:Function_call_interrupts
91470 ± 2% +619.8% 658431 ± 10% interrupts.CPU53.LOC:Local_timer_interrupts
381.00 ± 73% +272.5% 1419 ± 27% interrupts.CPU53.NMI:Non-maskable_interrupts
381.00 ± 73% +272.5% 1419 ± 27% interrupts.CPU53.PMI:Performance_monitoring_interrupts
110.25 ±111% +383.7% 533.33 ± 2% interrupts.CPU53.RES:Rescheduling_interrupts
62130 ±139% +481.0% 361006 ± 6% interrupts.CPU53.TLB:TLB_shootdowns
7571 ±117% +1548.1% 124780 ± 24% interrupts.CPU54.CAL:Function_call_interrupts
91450 ± 2% +620.1% 658506 ± 10% interrupts.CPU54.LOC:Local_timer_interrupts
11528 ±117% +3376.7% 400807 ± 22% interrupts.CPU54.TLB:TLB_shootdowns
50740 ± 62% +228.9% 166883 ± 26% interrupts.CPU55.CAL:Function_call_interrupts
91421 ± 2% +620.2% 658453 ± 10% interrupts.CPU55.LOC:Local_timer_interrupts
270.75 ± 48% +372.4% 1279 ± 24% interrupts.CPU55.NMI:Non-maskable_interrupts
270.75 ± 48% +372.4% 1279 ± 24% interrupts.CPU55.PMI:Performance_monitoring_interrupts
180.75 ± 73% +267.5% 664.33 ± 6% interrupts.CPU55.RES:Rescheduling_interrupts
98857 ± 63% +440.1% 533878 ± 27% interrupts.CPU55.TLB:TLB_shootdowns
2726 ±145% +4959.8% 137968 ± 18% interrupts.CPU56.CAL:Function_call_interrupts
91596 ± 2% +618.4% 658045 ± 10% interrupts.CPU56.LOC:Local_timer_interrupts
162.50 ± 48% +668.2% 1248 ± 57% interrupts.CPU56.NMI:Non-maskable_interrupts
162.50 ± 48% +668.2% 1248 ± 57% interrupts.CPU56.PMI:Performance_monitoring_interrupts
30.75 ±146% +4917.9% 1543 ± 77% interrupts.CPU56.RES:Rescheduling_interrupts
4435 ±173% +10216.4% 457608 ± 20% interrupts.CPU56.TLB:TLB_shootdowns
37241 ±171% +310.1% 152741 ± 4% interrupts.CPU57.CAL:Function_call_interrupts
91394 ± 2% +618.1% 656291 ± 10% interrupts.CPU57.LOC:Local_timer_interrupts
70192 ±173% +624.1% 508288 ± 6% interrupts.CPU57.TLB:TLB_shootdowns
91415 ± 2% +619.3% 657592 ± 10% interrupts.CPU58.LOC:Local_timer_interrupts
91463 ± 2% +619.9% 658458 ± 10% interrupts.CPU59.LOC:Local_timer_interrupts
613.25 ± 85% +149.3% 1528 ± 24% interrupts.CPU59.NMI:Non-maskable_interrupts
613.25 ± 85% +149.3% 1528 ± 24% interrupts.CPU59.PMI:Performance_monitoring_interrupts
65.50 ±126% +865.9% 632.67 ± 30% interrupts.CPU59.RES:Rescheduling_interrupts
94936 ±173% +411.2% 485333 ± 31% interrupts.CPU59.TLB:TLB_shootdowns
41173 ±106% +324.2% 174664 ± 15% interrupts.CPU6.CAL:Function_call_interrupts
91443 ± 2% +619.7% 658131 ± 10% interrupts.CPU6.LOC:Local_timer_interrupts
84022 ±105% +596.4% 585147 ± 13% interrupts.CPU6.TLB:TLB_shootdowns
13559 ±163% +1052.6% 156280 ± 25% interrupts.CPU60.CAL:Function_call_interrupts
91424 ± 2% +619.8% 658027 ± 10% interrupts.CPU60.LOC:Local_timer_interrupts
82.50 ±103% +508.1% 501.67 ± 24% interrupts.CPU60.RES:Rescheduling_interrupts
24650 ±168% +1992.2% 515724 ± 26% interrupts.CPU60.TLB:TLB_shootdowns
91451 ± 2% +620.1% 658530 ± 10% interrupts.CPU61.LOC:Local_timer_interrupts
91276 ± 2% +622.2% 659231 ± 10% interrupts.CPU62.LOC:Local_timer_interrupts
91566 +619.4% 658761 ± 10% interrupts.CPU63.LOC:Local_timer_interrupts
79456 ±169% +433.2% 423698 ± 31% interrupts.CPU63.TLB:TLB_shootdowns
91248 +621.0% 657891 ± 10% interrupts.CPU64.LOC:Local_timer_interrupts
694.75 ± 52% +198.5% 2074 ± 16% interrupts.CPU64.NMI:Non-maskable_interrupts
694.75 ± 52% +198.5% 2074 ± 16% interrupts.CPU64.PMI:Performance_monitoring_interrupts
91417 ± 2% +619.6% 657844 ± 9% interrupts.CPU65.LOC:Local_timer_interrupts
329.00 ± 61% +273.0% 1227 ± 33% interrupts.CPU65.NMI:Non-maskable_interrupts
329.00 ± 61% +273.0% 1227 ± 33% interrupts.CPU65.PMI:Performance_monitoring_interrupts
23082 ± 67% +556.6% 151549 ± 41% interrupts.CPU66.CAL:Function_call_interrupts
91407 ± 2% +618.5% 656754 ± 10% interrupts.CPU66.LOC:Local_timer_interrupts
41149 ± 63% +1078.4% 484899 ± 39% interrupts.CPU66.TLB:TLB_shootdowns
91433 ± 2% +620.3% 658575 ± 10% interrupts.CPU67.LOC:Local_timer_interrupts
91411 ± 2% +620.5% 658627 ± 10% interrupts.CPU68.LOC:Local_timer_interrupts
91423 ± 2% +620.4% 658572 ± 10% interrupts.CPU69.LOC:Local_timer_interrupts
36.75 ± 8% +672.8% 284.00 interrupts.CPU7.35:PCI-MSI.31981568-edge.i40e-0000:3d:00.0:misc
659.50 ± 57% +22709.4% 150428 ± 33% interrupts.CPU7.CAL:Function_call_interrupts
91406 ± 2% +619.9% 658001 ± 10% interrupts.CPU7.LOC:Local_timer_interrupts
209.75 ± 30% +598.1% 1464 ± 37% interrupts.CPU7.NMI:Non-maskable_interrupts
209.75 ± 30% +598.1% 1464 ± 37% interrupts.CPU7.PMI:Performance_monitoring_interrupts
69.50 ± 57% +833.3% 648.67 ± 19% interrupts.CPU7.RES:Rescheduling_interrupts
480.00 ±170% +1e+05% 492449 ± 33% interrupts.CPU7.TLB:TLB_shootdowns
27310 ±112% +400.2% 136617 ± 15% interrupts.CPU70.CAL:Function_call_interrupts
91446 ± 2% +620.6% 658978 ± 10% interrupts.CPU70.LOC:Local_timer_interrupts
185.50 ± 15% +906.1% 1866 ± 19% interrupts.CPU70.NMI:Non-maskable_interrupts
185.50 ± 15% +906.1% 1866 ± 19% interrupts.CPU70.PMI:Performance_monitoring_interrupts
33.75 ± 82% +1382.5% 500.33 ± 18% interrupts.CPU70.RES:Rescheduling_interrupts
48996 ±120% +794.6% 438323 ± 15% interrupts.CPU70.TLB:TLB_shootdowns
91429 ± 2% +619.0% 657394 ± 10% interrupts.CPU71.LOC:Local_timer_interrupts
91329 ± 2% +618.8% 656467 ± 10% interrupts.CPU72.LOC:Local_timer_interrupts
1388 ± 28% +109.1% 2902 ± 33% interrupts.CPU72.NMI:Non-maskable_interrupts
1388 ± 28% +109.1% 2902 ± 33% interrupts.CPU72.PMI:Performance_monitoring_interrupts
91296 ± 2% +587.7% 627805 ± 6% interrupts.CPU73.LOC:Local_timer_interrupts
9516 ± 98% +629.3% 69405 ± 49% interrupts.CPU74.CAL:Function_call_interrupts
91352 ± 2% +623.3% 660713 ± 10% interrupts.CPU74.LOC:Local_timer_interrupts
353.25 ± 81% +350.4% 1591 ± 53% interrupts.CPU74.NMI:Non-maskable_interrupts
353.25 ± 81% +350.4% 1591 ± 53% interrupts.CPU74.PMI:Performance_monitoring_interrupts
79.25 ±148% +292.0% 310.67 ± 65% interrupts.CPU74.RES:Rescheduling_interrupts
14031 ± 98% +1540.9% 230243 ± 54% interrupts.CPU74.TLB:TLB_shootdowns
91317 ± 2% +618.0% 655625 ± 10% interrupts.CPU75.LOC:Local_timer_interrupts
91334 ± 2% +621.1% 658628 ± 10% interrupts.CPU76.LOC:Local_timer_interrupts
0.25 ±173% +81366.7% 203.67 ± 78% interrupts.CPU77.113:PCI-MSI.31981646-edge.i40e-eth0-TxRx-77
91337 ± 2% +619.9% 657559 ± 10% interrupts.CPU77.LOC:Local_timer_interrupts
0.00 +1.9e+104% 190.67 ± 99% interrupts.CPU78.114:PCI-MSI.31981647-edge.i40e-eth0-TxRx-78
91338 ± 2% +620.5% 658069 ± 10% interrupts.CPU78.LOC:Local_timer_interrupts
0.00 +1.8e+104% 182.67 ± 63% interrupts.CPU79.115:PCI-MSI.31981648-edge.i40e-eth0-TxRx-79
531436 ± 49% -73.6% 140073 ± 10% interrupts.CPU79.CAL:Function_call_interrupts
91338 ± 2% +620.6% 658180 ± 10% interrupts.CPU79.LOC:Local_timer_interrupts
1279 ± 57% -61.4% 494.00 ± 5% interrupts.CPU79.RES:Rescheduling_interrupts
1024465 ± 49% -56.6% 444652 ± 9% interrupts.CPU79.TLB:TLB_shootdowns
438.25 +37483.9% 164711 ± 17% interrupts.CPU8.CAL:Function_call_interrupts
91423 ± 2% +618.9% 657275 ± 10% interrupts.CPU8.LOC:Local_timer_interrupts
150.00 ± 12% +1021.3% 1682 ± 63% interrupts.CPU8.NMI:Non-maskable_interrupts
150.00 ± 12% +1021.3% 1682 ± 63% interrupts.CPU8.PMI:Performance_monitoring_interrupts
1.50 ±110% +3.8e+07% 565425 ± 17% interrupts.CPU8.TLB:TLB_shootdowns
8744 ± 98% +1181.2% 112026 ± 25% interrupts.CPU80.CAL:Function_call_interrupts
91335 ± 2% +621.2% 658668 ± 10% interrupts.CPU80.LOC:Local_timer_interrupts
70.75 ± 70% +546.9% 457.67 ± 11% interrupts.CPU80.RES:Rescheduling_interrupts
16253 ±104% +2070.0% 352696 ± 26% interrupts.CPU80.TLB:TLB_shootdowns
0.00 +7.1e+103% 71.00 ± 62% interrupts.CPU81.117:PCI-MSI.31981650-edge.i40e-eth0-TxRx-81
91337 ± 2% +620.6% 658169 ± 10% interrupts.CPU81.LOC:Local_timer_interrupts
193.75 ± 95% +234.5% 648.00 ± 16% interrupts.CPU81.RES:Rescheduling_interrupts
91342 ± 2% +616.5% 654464 ± 10% interrupts.CPU82.LOC:Local_timer_interrupts
180.00 ±107% +244.1% 619.33 ± 16% interrupts.CPU82.RES:Rescheduling_interrupts
91344 ± 2% +621.8% 659347 ± 10% interrupts.CPU83.LOC:Local_timer_interrupts
350.00 ± 75% +222.7% 1129 ± 19% interrupts.CPU83.NMI:Non-maskable_interrupts
350.00 ± 75% +222.7% 1129 ± 19% interrupts.CPU83.PMI:Performance_monitoring_interrupts
183.25 ±120% +486.6% 1075 ± 68% interrupts.CPU83.RES:Rescheduling_interrupts
0.25 ±173% +21233.3% 53.33 ± 87% interrupts.CPU84.120:PCI-MSI.31981653-edge.i40e-eth0-TxRx-84
91331 ± 2% +621.6% 659054 ± 10% interrupts.CPU84.LOC:Local_timer_interrupts
91309 ± 2% +621.7% 658964 ± 10% interrupts.CPU85.LOC:Local_timer_interrupts
91327 ± 2% +621.7% 659107 ± 10% interrupts.CPU86.LOC:Local_timer_interrupts
122.75 ± 68% +302.4% 494.00 ± 16% interrupts.CPU86.RES:Rescheduling_interrupts
464.75 +25290.8% 118003 ± 31% interrupts.CPU87.CAL:Function_call_interrupts
91350 ± 2% +620.7% 658353 ± 10% interrupts.CPU87.LOC:Local_timer_interrupts
3.75 ± 60% +1e+07% 376468 ± 32% interrupts.CPU87.TLB:TLB_shootdowns
510.75 ± 15% +29935.6% 153406 ± 11% interrupts.CPU88.CAL:Function_call_interrupts
91333 ± 2% +620.8% 658330 ± 10% interrupts.CPU88.LOC:Local_timer_interrupts
109.00 ±169% +4.4e+05% 481712 ± 8% interrupts.CPU88.TLB:TLB_shootdowns
91343 ± 2% +621.3% 658818 ± 10% interrupts.CPU89.LOC:Local_timer_interrupts
91400 ± 2% +618.3% 656547 ± 10% interrupts.CPU9.LOC:Local_timer_interrupts
20036 ± 84% +583.1% 136879 ± 3% interrupts.CPU90.CAL:Function_call_interrupts
91525 ± 2% +619.4% 658431 ± 10% interrupts.CPU90.LOC:Local_timer_interrupts
221.25 ± 60% +440.4% 1195 ± 25% interrupts.CPU90.NMI:Non-maskable_interrupts
221.25 ± 60% +440.4% 1195 ± 25% interrupts.CPU90.PMI:Performance_monitoring_interrupts
34753 ± 82% +1197.6% 450948 ± 3% interrupts.CPU90.TLB:TLB_shootdowns
20216 ±100% +513.0% 123921 ± 30% interrupts.CPU91.CAL:Function_call_interrupts
91058 ± 2% +622.4% 657853 ± 10% interrupts.CPU91.LOC:Local_timer_interrupts
266.00 ± 28% +289.2% 1035 ± 35% interrupts.CPU91.NMI:Non-maskable_interrupts
266.00 ± 28% +289.2% 1035 ± 35% interrupts.CPU91.PMI:Performance_monitoring_interrupts
78.25 ± 89% +666.8% 600.00 ± 14% interrupts.CPU91.RES:Rescheduling_interrupts
40088 ±103% +923.6% 410355 ± 30% interrupts.CPU91.TLB:TLB_shootdowns
14262 ±168% +809.1% 129653 ± 17% interrupts.CPU92.CAL:Function_call_interrupts
91323 ± 2% +621.5% 658884 ± 10% interrupts.CPU92.LOC:Local_timer_interrupts
168.25 ± 22% +541.9% 1080 ± 78% interrupts.CPU92.NMI:Non-maskable_interrupts
168.25 ± 22% +541.9% 1080 ± 78% interrupts.CPU92.PMI:Performance_monitoring_interrupts
22.00 ± 98% +2825.8% 643.67 ± 10% interrupts.CPU92.RES:Rescheduling_interrupts
26830 ±173% +1461.8% 419031 ± 14% interrupts.CPU92.TLB:TLB_shootdowns
91329 ± 2% +621.3% 658747 ± 10% interrupts.CPU93.LOC:Local_timer_interrupts
171.75 ± 18% +1043.3% 1963 ± 25% interrupts.CPU93.NMI:Non-maskable_interrupts
171.75 ± 18% +1043.3% 1963 ± 25% interrupts.CPU93.PMI:Performance_monitoring_interrupts
91336 ± 2% +621.9% 659365 ± 10% interrupts.CPU94.LOC:Local_timer_interrupts
190.25 ± 44% +329.1% 816.33 ± 86% interrupts.CPU94.NMI:Non-maskable_interrupts
190.25 ± 44% +329.1% 816.33 ± 86% interrupts.CPU94.PMI:Performance_monitoring_interrupts
46.75 ±165% +1341.0% 673.67 ± 35% interrupts.CPU94.RES:Rescheduling_interrupts
70182 ±173% +390.0% 343907 ± 20% interrupts.CPU94.TLB:TLB_shootdowns
91411 ± 2% +619.9% 658036 ± 10% interrupts.CPU95.LOC:Local_timer_interrupts
219.50 ± 25% +614.5% 1568 ± 9% interrupts.CPU95.NMI:Non-maskable_interrupts
219.50 ± 25% +614.5% 1568 ± 9% interrupts.CPU95.PMI:Performance_monitoring_interrupts
131.75 ±160% +745.0% 1113 ± 46% interrupts.CPU95.RES:Rescheduling_interrupts
64224 ±172% +623.4% 464571 ± 47% interrupts.CPU95.TLB:TLB_shootdowns
8772137 ± 2% +618.8% 63050381 ± 10% interrupts.LOC:Local_timer_interrupts
115013 ± 5% +25.5% 144352 ± 2% interrupts.NMI:Non-maskable_interrupts
115013 ± 5% +25.5% 144352 ± 2% interrupts.PMI:Performance_monitoring_interrupts
45.16 ± 7% -7.3 37.81 perf-profile.calltrace.cycles-pp.do_access
36.34 ± 16% -6.2 30.13 perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64
36.34 ± 16% -6.2 30.13 perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64
36.34 ± 16% -6.2 30.13 perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64
36.67 ± 16% -6.0 30.63 perf-profile.calltrace.cycles-pp.secondary_startup_64
5.66 ± 10% -5.7 0.00 perf-profile.calltrace.cycles-pp.pageout.shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node
33.76 ± 18% -5.3 28.51 perf-profile.calltrace.cycles-pp.cpuidle_enter.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64
33.39 ± 18% -5.2 28.21 perf-profile.calltrace.cycles-pp.cpuidle_enter_state.cpuidle_enter.do_idle.cpu_startup_entry.start_secondary
4.78 ± 10% -4.8 0.00 perf-profile.calltrace.cycles-pp.__swap_writepage.pageout.shrink_page_list.shrink_inactive_list.shrink_lruvec
5.43 ± 8% -3.7 1.71 ± 11% perf-profile.calltrace.cycles-pp.do_rw_once
1.83 ± 9% -1.1 0.75 ± 9% perf-profile.calltrace.cycles-pp._raw_spin_lock.handle_pte_fault.__handle_mm_fault.handle_mm_fault.do_page_fault
2.42 ± 8% -0.9 1.49 ± 3% perf-profile.calltrace.cycles-pp.menu_select.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64
2.74 ± 11% -0.8 1.89 ± 2% perf-profile.calltrace.cycles-pp.apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter.do_idle.cpu_startup_entry
2.36 ± 12% -0.8 1.59 ± 2% perf-profile.calltrace.cycles-pp.smp_apic_timer_interrupt.apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter.do_idle
1.05 ± 9% -0.4 0.63 ± 9% perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages_nodemask.alloc_pages_vma.handle_pte_fault.__handle_mm_fault
1.26 ± 8% -0.3 0.92 perf-profile.calltrace.cycles-pp.hrtimer_interrupt.smp_apic_timer_interrupt.apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter
0.00 +0.7 0.66 ± 4% perf-profile.calltrace.cycles-pp.__lookup_slow.walk_component.link_path_walk.path_parentat.filename_parentat
0.00 +0.7 0.66 ± 4% perf-profile.calltrace.cycles-pp.d_alloc_parallel.__lookup_slow.walk_component.link_path_walk.path_parentat
0.00 +0.7 0.66 ± 4% perf-profile.calltrace.cycles-pp.link_path_walk.path_parentat.filename_parentat.filename_create.do_mkdirat
0.00 +0.7 0.66 ± 4% perf-profile.calltrace.cycles-pp.walk_component.link_path_walk.path_parentat.filename_parentat.filename_create
0.00 +0.7 0.66 ± 4% perf-profile.calltrace.cycles-pp.path_parentat.filename_parentat.filename_create.do_mkdirat.do_syscall_64
0.00 +0.7 0.67 ± 4% perf-profile.calltrace.cycles-pp.filename_create.do_mkdirat.do_syscall_64.entry_SYSCALL_64_after_hwframe.mkdir
0.00 +0.7 0.67 ± 4% perf-profile.calltrace.cycles-pp.filename_parentat.filename_create.do_mkdirat.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.00 +0.7 0.67 ± 4% perf-profile.calltrace.cycles-pp.do_mkdirat.do_syscall_64.entry_SYSCALL_64_after_hwframe.mkdir
0.00 +0.7 0.67 ± 4% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.mkdir
0.00 +0.7 0.67 ± 4% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.mkdir
0.00 +0.7 0.68 ± 5% perf-profile.calltrace.cycles-pp.mkdir
0.00 +0.8 0.77 ± 41% perf-profile.calltrace.cycles-pp.page_fault.__libc_fork.forkshell
0.00 +0.8 0.77 ± 41% perf-profile.calltrace.cycles-pp.do_page_fault.page_fault.__libc_fork.forkshell
0.00 +0.8 0.77 ± 41% perf-profile.calltrace.cycles-pp.handle_mm_fault.do_page_fault.page_fault.__libc_fork.forkshell
0.00 +0.8 0.77 ± 41% perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.do_page_fault.page_fault.__libc_fork
0.00 +1.1 1.07 ± 18% perf-profile.calltrace.cycles-pp.__slab_alloc.kmem_cache_alloc.__alloc_file.alloc_empty_file.path_openat
0.00 +1.1 1.07 ± 18% perf-profile.calltrace.cycles-pp.___slab_alloc.__slab_alloc.kmem_cache_alloc.__alloc_file.alloc_empty_file
0.00 +1.1 1.07 ± 18% perf-profile.calltrace.cycles-pp.new_slab.___slab_alloc.__slab_alloc.kmem_cache_alloc.__alloc_file
0.00 +1.1 1.07 ± 18% perf-profile.calltrace.cycles-pp.kmem_cache_alloc.__alloc_file.alloc_empty_file.path_openat.do_filp_open
0.00 +1.1 1.08 ± 19% perf-profile.calltrace.cycles-pp.alloc_empty_file.path_openat.do_filp_open.do_sys_openat2.do_sys_open
0.00 +1.1 1.08 ± 19% perf-profile.calltrace.cycles-pp.__alloc_file.alloc_empty_file.path_openat.do_filp_open.do_sys_openat2
0.00 +1.2 1.22 ± 37% perf-profile.calltrace.cycles-pp.__alloc_pages_nodemask.pipe_write.new_sync_write.vfs_write.ksys_write
0.00 +1.2 1.22 ± 37% perf-profile.calltrace.cycles-pp.__alloc_pages_slowpath.__alloc_pages_nodemask.pipe_write.new_sync_write.vfs_write
0.00 +1.2 1.22 ± 37% perf-profile.calltrace.cycles-pp.try_to_free_pages.__alloc_pages_slowpath.__alloc_pages_nodemask.pipe_write.new_sync_write
0.00 +1.2 1.22 ± 37% perf-profile.calltrace.cycles-pp.do_try_to_free_pages.try_to_free_pages.__alloc_pages_slowpath.__alloc_pages_nodemask.pipe_write
0.00 +1.2 1.25 ± 36% perf-profile.calltrace.cycles-pp.pipe_write.new_sync_write.vfs_write.ksys_write.do_syscall_64
0.00 +1.4 1.37 ± 39% perf-profile.calltrace.cycles-pp.__slab_alloc.kmem_cache_alloc.__d_alloc.d_alloc.d_alloc_parallel
0.00 +1.4 1.37 ± 39% perf-profile.calltrace.cycles-pp.___slab_alloc.__slab_alloc.kmem_cache_alloc.__d_alloc.d_alloc
0.00 +1.4 1.37 ± 39% perf-profile.calltrace.cycles-pp.new_slab.___slab_alloc.__slab_alloc.kmem_cache_alloc.__d_alloc
0.00 +1.4 1.37 ± 39% perf-profile.calltrace.cycles-pp.kmem_cache_alloc.__d_alloc.d_alloc.d_alloc_parallel.__lookup_slow
0.00 +1.4 1.37 ± 39% perf-profile.calltrace.cycles-pp.d_alloc.d_alloc_parallel.__lookup_slow.walk_component.link_path_walk
0.00 +1.4 1.37 ± 39% perf-profile.calltrace.cycles-pp.__d_alloc.d_alloc.d_alloc_parallel.__lookup_slow.walk_component
0.00 +1.4 1.42 ± 38% perf-profile.calltrace.cycles-pp.__alloc_pages_nodemask.__vmalloc_node_range.copy_process._do_fork.__x64_sys_clone
0.00 +1.4 1.42 ± 38% perf-profile.calltrace.cycles-pp.__alloc_pages_slowpath.__alloc_pages_nodemask.__vmalloc_node_range.copy_process._do_fork
0.00 +1.4 1.42 ± 38% perf-profile.calltrace.cycles-pp.try_to_free_pages.__alloc_pages_slowpath.__alloc_pages_nodemask.__vmalloc_node_range.copy_process
0.00 +1.4 1.42 ± 38% perf-profile.calltrace.cycles-pp.do_try_to_free_pages.try_to_free_pages.__alloc_pages_slowpath.__alloc_pages_nodemask.__vmalloc_node_range
0.00 +1.4 1.42 ± 38% perf-profile.calltrace.cycles-pp.__vmalloc_node_range.copy_process._do_fork.__x64_sys_clone.do_syscall_64
0.00 +1.4 1.42 ± 21% perf-profile.calltrace.cycles-pp.__alloc_pages_nodemask.pagecache_get_page.grab_cache_page_write_begin.nfs_write_begin.generic_perform_write
0.00 +1.4 1.42 ± 21% perf-profile.calltrace.cycles-pp.__alloc_pages_slowpath.__alloc_pages_nodemask.pagecache_get_page.grab_cache_page_write_begin.nfs_write_begin
0.00 +1.4 1.42 ± 20% perf-profile.calltrace.cycles-pp.grab_cache_page_write_begin.nfs_write_begin.generic_perform_write.nfs_file_write.new_sync_write
0.00 +1.4 1.42 ± 20% perf-profile.calltrace.cycles-pp.pagecache_get_page.grab_cache_page_write_begin.nfs_write_begin.generic_perform_write.nfs_file_write
0.00 +1.4 1.42 ± 20% perf-profile.calltrace.cycles-pp.nfs_write_begin.generic_perform_write.nfs_file_write.new_sync_write.vfs_write
0.00 +1.4 1.43 ± 21% perf-profile.calltrace.cycles-pp.generic_perform_write.nfs_file_write.new_sync_write.vfs_write.ksys_write
0.00 +1.4 1.43 ± 21% perf-profile.calltrace.cycles-pp.nfs_file_write.new_sync_write.vfs_write.ksys_write.do_syscall_64
0.00 +1.8 1.81 ± 27% perf-profile.calltrace.cycles-pp.__alloc_pages_nodemask.pagecache_get_page.grab_cache_page_write_begin.simple_write_begin.generic_perform_write
0.00 +1.8 1.81 ± 27% perf-profile.calltrace.cycles-pp.__alloc_pages_slowpath.__alloc_pages_nodemask.pagecache_get_page.grab_cache_page_write_begin.simple_write_begin
0.00 +1.8 1.81 ± 27% perf-profile.calltrace.cycles-pp.grab_cache_page_write_begin.simple_write_begin.generic_perform_write.__generic_file_write_iter.generic_file_write_iter
0.00 +1.8 1.81 ± 27% perf-profile.calltrace.cycles-pp.pagecache_get_page.grab_cache_page_write_begin.simple_write_begin.generic_perform_write.__generic_file_write_iter
0.00 +1.8 1.81 ± 27% perf-profile.calltrace.cycles-pp.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.__GI___libc_write
0.00 +1.8 1.81 ± 27% perf-profile.calltrace.cycles-pp.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.__GI___libc_write
0.00 +1.8 1.81 ± 27% perf-profile.calltrace.cycles-pp.generic_perform_write.__generic_file_write_iter.generic_file_write_iter.new_sync_write.vfs_write
0.00 +1.8 1.81 ± 27% perf-profile.calltrace.cycles-pp.generic_file_write_iter.new_sync_write.vfs_write.ksys_write.do_syscall_64
0.00 +1.8 1.81 ± 27% perf-profile.calltrace.cycles-pp.__generic_file_write_iter.generic_file_write_iter.new_sync_write.vfs_write.ksys_write
0.00 +1.8 1.81 ± 27% perf-profile.calltrace.cycles-pp.simple_write_begin.generic_perform_write.__generic_file_write_iter.generic_file_write_iter.new_sync_write
0.00 +1.8 1.82 ± 27% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__GI___libc_write
0.00 +1.8 1.82 ± 27% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__GI___libc_write
0.00 +1.8 1.82 ± 27% perf-profile.calltrace.cycles-pp.__GI___libc_write
0.00 +1.8 1.84 ± 18% perf-profile.calltrace.cycles-pp.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe.__GI___libc_open
0.00 +1.8 1.84 ± 18% perf-profile.calltrace.cycles-pp.do_sys_openat2.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe.__GI___libc_open
0.51 ±173% +1.8 2.35 ± 8% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__libc_fork.forkshell
0.51 ±173% +1.8 2.35 ± 8% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_fork.forkshell
0.51 ±173% +1.8 2.35 ± 8% perf-profile.calltrace.cycles-pp.__x64_sys_clone.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_fork.forkshell
0.51 ±173% +1.8 2.35 ± 8% perf-profile.calltrace.cycles-pp._do_fork.__x64_sys_clone.do_syscall_64.entry_SYSCALL_64_after_hwframe.__libc_fork
0.51 ±173% +1.8 2.35 ± 8% perf-profile.calltrace.cycles-pp.copy_process._do_fork.__x64_sys_clone.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.00 +1.8 1.85 ± 18% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__GI___libc_open
0.00 +1.8 1.85 ± 18% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__GI___libc_open
0.00 +1.9 1.85 ± 18% perf-profile.calltrace.cycles-pp.__GI___libc_open
0.00 +1.9 1.88 ± 18% perf-profile.calltrace.cycles-pp.page_fault
0.00 +1.9 1.88 ± 18% perf-profile.calltrace.cycles-pp.do_page_fault.page_fault
0.00 +1.9 1.88 ± 18% perf-profile.calltrace.cycles-pp.handle_mm_fault.do_page_fault.page_fault
0.00 +1.9 1.88 ± 18% perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.do_page_fault.page_fault
0.00 +2.4 2.37 ± 24% perf-profile.calltrace.cycles-pp.do_filp_open.do_sys_openat2.do_sys_open.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.00 +2.4 2.37 ± 24% perf-profile.calltrace.cycles-pp.path_openat.do_filp_open.do_sys_openat2.do_sys_open.do_syscall_64
0.00 +2.4 2.44 ± 15% perf-profile.calltrace.cycles-pp.__alloc_pages_nodemask.new_slab.___slab_alloc.__slab_alloc.kmem_cache_alloc
0.00 +2.4 2.44 ± 15% perf-profile.calltrace.cycles-pp.__alloc_pages_slowpath.__alloc_pages_nodemask.new_slab.___slab_alloc.__slab_alloc
0.00 +2.4 2.44 ± 15% perf-profile.calltrace.cycles-pp.try_to_free_pages.__alloc_pages_slowpath.__alloc_pages_nodemask.new_slab.___slab_alloc
0.00 +2.4 2.44 ± 15% perf-profile.calltrace.cycles-pp.do_try_to_free_pages.try_to_free_pages.__alloc_pages_slowpath.__alloc_pages_nodemask.new_slab
0.00 +2.5 2.54 ± 8% perf-profile.calltrace.cycles-pp.io_serial_in.wait_for_xmitr.serial8250_console_putchar.uart_console_write.serial8250_console_write
0.00 +2.7 2.68 ± 27% perf-profile.calltrace.cycles-pp.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
0.00 +2.7 2.68 ± 27% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.write
0.00 +2.7 2.68 ± 27% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
0.00 +2.7 2.68 ± 27% perf-profile.calltrace.cycles-pp.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe.write
0.00 +2.7 2.68 ± 27% perf-profile.calltrace.cycles-pp.write
0.52 ±173% +2.8 3.32 ± 11% perf-profile.calltrace.cycles-pp.__libc_fork.forkshell
0.00 +2.8 2.82 ± 9% perf-profile.calltrace.cycles-pp.wait_for_xmitr.serial8250_console_putchar.uart_console_write.serial8250_console_write.console_unlock
0.00 +2.8 2.82 ± 9% perf-profile.calltrace.cycles-pp.serial8250_console_putchar.uart_console_write.serial8250_console_write.console_unlock.vprintk_emit
0.97 ± 17% +3.1 4.05 ± 25% perf-profile.calltrace.cycles-pp.worker_thread.kthread.ret_from_fork
0.90 ± 16% +3.1 4.03 ± 25% perf-profile.calltrace.cycles-pp.process_one_work.worker_thread.kthread.ret_from_fork
0.00 +3.2 3.23 ± 6% perf-profile.calltrace.cycles-pp.try_to_free_pages.__alloc_pages_slowpath.__alloc_pages_nodemask.pagecache_get_page.grab_cache_page_write_begin
0.00 +3.2 3.23 ± 6% perf-profile.calltrace.cycles-pp.do_try_to_free_pages.try_to_free_pages.__alloc_pages_slowpath.__alloc_pages_nodemask.pagecache_get_page
0.00 +3.2 3.25 ± 9% perf-profile.calltrace.cycles-pp.uart_console_write.serial8250_console_write.console_unlock.vprintk_emit.printk
0.52 ±173% +3.3 3.85 ± 9% perf-profile.calltrace.cycles-pp.forkshell
0.00 +3.4 3.41 ± 9% perf-profile.calltrace.cycles-pp.serial8250_console_write.console_unlock.vprintk_emit.printk.rcu_oom_scan
0.00 +3.5 3.49 ± 8% perf-profile.calltrace.cycles-pp.console_unlock.vprintk_emit.printk.rcu_oom_scan.do_shrink_slab
0.49 ± 59% +3.5 4.03 ± 5% perf-profile.calltrace.cycles-pp.shrink_slab.shrink_node.do_try_to_free_pages.try_to_free_pages.__alloc_pages_slowpath
0.47 ± 59% +3.6 4.03 ± 5% perf-profile.calltrace.cycles-pp.do_shrink_slab.shrink_slab.shrink_node.do_try_to_free_pages.try_to_free_pages
0.00 +3.6 3.61 ± 25% perf-profile.calltrace.cycles-pp.memcpy_erms.drm_fb_helper_dirty_work.process_one_work.worker_thread.kthread
0.00 +3.7 3.73 ± 25% perf-profile.calltrace.cycles-pp.drm_fb_helper_dirty_work.process_one_work.worker_thread.kthread.ret_from_fork
33.79 ± 6% +3.9 37.64 ± 2% perf-profile.calltrace.cycles-pp.handle_pte_fault.__handle_mm_fault.handle_mm_fault.do_page_fault.page_fault
0.00 +3.9 3.92 ± 6% perf-profile.calltrace.cycles-pp.rcu_oom_scan.do_shrink_slab.shrink_slab.shrink_node.do_try_to_free_pages
0.00 +4.4 4.35 ± 3% perf-profile.calltrace.cycles-pp.printk.rcu_oom_scan.do_shrink_slab.shrink_slab.shrink_node
0.00 +4.4 4.35 ± 3% perf-profile.calltrace.cycles-pp.vprintk_emit.printk.rcu_oom_scan.do_shrink_slab.shrink_slab
0.00 +4.5 4.49 ± 8% perf-profile.calltrace.cycles-pp.new_sync_write.vfs_write.ksys_write.do_syscall_64.entry_SYSCALL_64_after_hwframe
19.77 ± 16% +4.9 24.63 ± 11% perf-profile.calltrace.cycles-pp.try_to_free_pages.__alloc_pages_slowpath.__alloc_pages_nodemask.alloc_pages_vma.handle_pte_fault
19.60 ± 16% +5.4 25.02 ± 12% perf-profile.calltrace.cycles-pp.do_try_to_free_pages.try_to_free_pages.__alloc_pages_slowpath.__alloc_pages_nodemask.alloc_pages_vma
24.02 ± 7% +13.2 37.20 ± 3% perf-profile.calltrace.cycles-pp.shrink_lruvec.shrink_node.do_try_to_free_pages.try_to_free_pages.__alloc_pages_slowpath
22.00 ± 7% +14.7 36.68 ± 3% perf-profile.calltrace.cycles-pp.shrink_inactive_list.shrink_lruvec.shrink_node.do_try_to_free_pages.try_to_free_pages
21.63 ± 7% +15.0 36.64 ± 3% perf-profile.calltrace.cycles-pp.shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node.do_try_to_free_pages
24.80 ± 6% +17.1 41.89 ± 3% perf-profile.calltrace.cycles-pp.shrink_node.do_try_to_free_pages.try_to_free_pages.__alloc_pages_slowpath.__alloc_pages_nodemask
14.09 ± 10% +29.5 43.58 ± 4% perf-profile.calltrace.cycles-pp.try_to_unmap_flush_dirty.shrink_page_list.shrink_inactive_list.shrink_lruvec.shrink_node
14.05 ± 10% +29.5 43.57 ± 4% perf-profile.calltrace.cycles-pp.arch_tlbbatch_flush.try_to_unmap_flush_dirty.shrink_page_list.shrink_inactive_list.shrink_lruvec
13.55 ± 10% +30.0 43.55 ± 4% perf-profile.calltrace.cycles-pp.on_each_cpu_cond_mask.arch_tlbbatch_flush.try_to_unmap_flush_dirty.shrink_page_list.shrink_inactive_list
12.92 ± 10% +30.6 43.47 ± 4% perf-profile.calltrace.cycles-pp.smp_call_function_many_cond.on_each_cpu_cond_mask.arch_tlbbatch_flush.try_to_unmap_flush_dirty.shrink_page_list
9.99 ± 6% -9.5 0.47 ± 32% perf-profile.children.cycles-pp.call_function_interrupt
7.22 ± 6% -6.9 0.34 ± 33% perf-profile.children.cycles-pp.smp_call_function_interrupt
45.05 ± 7% -6.8 38.25 perf-profile.children.cycles-pp.do_access
7.03 ± 6% -6.7 0.33 ± 31% perf-profile.children.cycles-pp.flush_smp_call_function_queue
36.34 ± 16% -6.2 30.13 perf-profile.children.cycles-pp.start_secondary
36.69 ± 16% -6.1 30.63 perf-profile.children.cycles-pp.do_idle
36.67 ± 16% -6.0 30.63 perf-profile.children.cycles-pp.secondary_startup_64
36.67 ± 16% -6.0 30.63 perf-profile.children.cycles-pp.cpu_startup_entry
5.97 ± 10% -5.5 0.48 ± 26% perf-profile.children.cycles-pp.pageout
5.81 ± 6% -5.1 0.70 ± 23% perf-profile.children.cycles-pp.rmap_walk_anon
34.08 ± 18% -5.1 28.98 perf-profile.children.cycles-pp.cpuidle_enter_state
34.08 ± 18% -5.1 28.98 perf-profile.children.cycles-pp.cpuidle_enter
5.05 ± 10% -4.7 0.39 ± 25% perf-profile.children.cycles-pp.__swap_writepage
4.97 ± 10% -4.6 0.38 ± 26% perf-profile.children.cycles-pp.bdev_write_page
6.29 ± 9% -4.3 1.97 ± 12% perf-profile.children.cycles-pp.do_rw_once
4.42 ± 10% -4.1 0.33 ± 24% perf-profile.children.cycles-pp.pmem_rw_page
3.39 ± 8% -3.1 0.26 ± 36% perf-profile.children.cycles-pp.try_to_unmap
2.98 ± 9% -2.8 0.21 ± 37% perf-profile.children.cycles-pp.try_to_unmap_one
2.93 ± 11% -2.7 0.22 ± 21% perf-profile.children.cycles-pp.__remove_mapping
2.82 ± 14% -2.7 0.15 ± 25% perf-profile.children.cycles-pp.llist_add_batch
2.78 ± 6% -2.6 0.13 ± 30% perf-profile.children.cycles-pp.flush_tlb_func_common
3.78 ± 7% -2.6 1.13 ± 13% perf-profile.children.cycles-pp._raw_spin_lock
2.77 ± 11% -2.6 0.19 ± 23% perf-profile.children.cycles-pp.pmem_do_bvec
2.75 ± 11% -2.6 0.19 ± 23% perf-profile.children.cycles-pp.write_pmem
2.73 ± 11% -2.5 0.19 ± 23% perf-profile.children.cycles-pp.__memcpy_flushcache
2.58 ± 9% -2.4 0.19 ± 28% perf-profile.children.cycles-pp.add_to_swap
2.46 ± 6% -2.3 0.13 ± 29% perf-profile.children.cycles-pp.llist_reverse_order
2.72 ± 5% -2.2 0.49 ± 15% perf-profile.children.cycles-pp.page_referenced
2.30 ± 8% -2.1 0.19 ± 22% perf-profile.children.cycles-pp.default_send_IPI_mask_sequence_phys
2.85 ± 9% -2.1 0.75 ± 12% perf-profile.children.cycles-pp.get_page_from_freelist
2.12 ± 11% -1.7 0.41 ± 15% perf-profile.children.cycles-pp.__softirqentry_text_start
1.81 ± 8% -1.7 0.14 ± 28% perf-profile.children.cycles-pp.add_to_swap_cache
2.19 ± 5% -1.6 0.57 ± 10% perf-profile.children.cycles-pp.shrink_active_list
1.75 ± 12% -1.6 0.17 ± 38% perf-profile.children.cycles-pp.rcu_core
1.95 ± 8% -1.6 0.37 ± 11% perf-profile.children.cycles-pp.native_irq_return_iret
1.72 ± 9% -1.6 0.15 ± 21% perf-profile.children.cycles-pp.__default_send_IPI_dest_field
1.74 ± 8% -1.5 0.26 ± 23% perf-profile.children.cycles-pp.page_vma_mapped_walk
1.62 ± 9% -1.5 0.14 ± 29% perf-profile.children.cycles-pp.end_page_writeback
1.60 ± 11% -1.5 0.15 ± 44% perf-profile.children.cycles-pp.rcu_do_batch
3.21 ± 9% -1.3 1.86 perf-profile.children.cycles-pp.smp_apic_timer_interrupt
1.61 ± 6% -1.3 0.26 ± 19% perf-profile.children.cycles-pp.page_referenced_one
3.63 ± 9% -1.3 2.32 ± 2% perf-profile.children.cycles-pp.apic_timer_interrupt
1.23 ± 6% -1.2 0.05 ± 72% perf-profile.children.cycles-pp.native_flush_tlb
1.23 ± 11% -1.1 0.11 ± 42% perf-profile.children.cycles-pp.kmem_cache_free
1.16 ± 17% -1.1 0.06 ± 71% perf-profile.children.cycles-pp.smpboot_thread_fn
1.13 ± 17% -1.1 0.05 ± 71% perf-profile.children.cycles-pp.run_ksoftirqd
1.54 ± 10% -1.0 0.51 ± 12% perf-profile.children.cycles-pp.mem_cgroup_try_charge_delay
1.06 ± 11% -1.0 0.09 ± 41% perf-profile.children.cycles-pp.__slab_free
1.36 ± 14% -1.0 0.39 ± 13% perf-profile.children.cycles-pp.down_read_trylock
2.43 ± 8% -0.9 1.50 ± 3% perf-profile.children.cycles-pp.menu_select
0.99 ± 11% -0.9 0.09 ± 28% perf-profile.children.cycles-pp.__delete_from_swap_cache
1.22 ± 7% -0.9 0.32 ± 22% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
1.36 ± 7% -0.9 0.46 ± 13% perf-profile.children.cycles-pp.prep_new_page
0.99 ± 9% -0.9 0.09 ± 26% perf-profile.children.cycles-pp.xas_create
0.98 ± 9% -0.9 0.10 ± 29% perf-profile.children.cycles-pp.xas_create_range
1.25 ± 9% -0.8 0.43 ± 14% perf-profile.children.cycles-pp.mem_cgroup_try_charge
1.30 ± 3% -0.8 0.52 ± 6% perf-profile.children.cycles-pp.irq_exit
1.20 ± 8% -0.8 0.42 ± 12% perf-profile.children.cycles-pp.clear_page_erms
0.84 ± 7% -0.8 0.08 ± 12% perf-profile.children.cycles-pp._raw_spin_unlock_irqrestore
0.82 ± 14% -0.8 0.07 ± 23% perf-profile.children.cycles-pp.mem_cgroup_swapout
0.83 ± 16% -0.7 0.08 ± 72% perf-profile.children.cycles-pp.drain_local_pages_wq
0.83 ± 16% -0.7 0.08 ± 72% perf-profile.children.cycles-pp.drain_pages
0.82 ± 15% -0.7 0.08 ± 72% perf-profile.children.cycles-pp.drain_pages_zone
0.81 ± 8% -0.7 0.08 ± 29% perf-profile.children.cycles-pp.xas_alloc
0.83 ± 11% -0.7 0.11 ± 4% perf-profile.children.cycles-pp._find_next_bit
0.80 ± 16% -0.7 0.09 ± 71% perf-profile.children.cycles-pp.free_pcppages_bulk
0.81 ± 5% -0.7 0.16 ± 21% perf-profile.children.cycles-pp.page_lock_anon_vma_read
0.69 ± 12% -0.6 0.05 ± 72% perf-profile.children.cycles-pp.xas_store
0.64 ± 9% -0.6 0.04 ± 71% perf-profile.children.cycles-pp.swap_writepage
0.66 ± 7% -0.6 0.08 ± 17% perf-profile.children.cycles-pp.cpumask_next
0.75 ± 9% -0.5 0.25 ± 11% perf-profile.children.cycles-pp.__lru_cache_add
0.72 ± 8% -0.5 0.22 ± 11% perf-profile.children.cycles-pp.pagevec_lru_move_fn
0.74 ± 11% -0.5 0.25 ± 15% perf-profile.children.cycles-pp.lru_cache_add_active_or_unevictable
0.68 ± 13% -0.5 0.21 ± 7% perf-profile.children.cycles-pp.up_read
0.53 ± 15% -0.4 0.10 ± 16% perf-profile.children.cycles-pp.count_shadow_nodes
1.57 ± 8% -0.4 1.15 perf-profile.children.cycles-pp.hrtimer_interrupt
0.44 ± 11% -0.4 0.04 ± 73% perf-profile.children.cycles-pp.call_rcu
0.40 ± 8% -0.4 0.04 ± 71% perf-profile.children.cycles-pp.radix_tree_node_ctor
0.43 ± 10% -0.3 0.08 ± 17% perf-profile.children.cycles-pp.isolate_lru_pages
0.41 ± 15% -0.3 0.06 ± 13% perf-profile.children.cycles-pp.__list_del_entry_valid
0.46 ± 9% -0.3 0.13 ± 12% perf-profile.children.cycles-pp._raw_spin_lock_irqsave
0.99 ± 19% -0.3 0.66 ± 6% perf-profile.children.cycles-pp.ktime_get
0.37 ± 10% -0.3 0.04 ± 76% perf-profile.children.cycles-pp.get_mem_cgroup_from_mm
0.40 ± 9% -0.2 0.15 ± 12% perf-profile.children.cycles-pp.sync_regs
0.39 ± 6% -0.2 0.15 ± 11% perf-profile.children.cycles-pp.__pagevec_lru_add_fn
0.79 ± 5% -0.2 0.57 perf-profile.children.cycles-pp.__hrtimer_run_queues
0.28 ± 6% -0.2 0.07 ± 20% perf-profile.children.cycles-pp.__mod_lruvec_state
0.59 ± 4% -0.2 0.40 ± 2% perf-profile.children.cycles-pp.tick_sched_timer
0.31 ± 13% -0.2 0.13 ± 14% perf-profile.children.cycles-pp.try_charge
0.29 ± 16% -0.2 0.11 ± 19% perf-profile.children.cycles-pp._raw_spin_lock_irq
0.24 ± 10% -0.2 0.07 ± 7% perf-profile.children.cycles-pp.move_pages_to_lru
0.59 ± 12% -0.2 0.44 ± 3% perf-profile.children.cycles-pp.clockevents_program_event
0.20 ± 7% -0.1 0.06 ± 8% perf-profile.children.cycles-pp.__perf_sw_event
0.20 ± 17% -0.1 0.06 ± 14% perf-profile.children.cycles-pp.mem_cgroup_throttle_swaprate
0.43 ± 6% -0.1 0.29 ± 5% perf-profile.children.cycles-pp.update_process_times
0.43 ± 6% -0.1 0.30 ± 5% perf-profile.children.cycles-pp.tick_sched_handle
0.51 ± 14% -0.1 0.39 ± 14% perf-profile.children.cycles-pp.tick_nohz_get_sleep_length
0.15 ± 15% -0.1 0.04 ± 73% perf-profile.children.cycles-pp.mem_cgroup_commit_charge
0.17 ± 19% -0.1 0.07 ± 25% perf-profile.children.cycles-pp.__sched_text_start
0.20 ± 9% -0.1 0.12 ± 23% perf-profile.children.cycles-pp.irq_work_run_list
0.11 ± 28% -0.1 0.04 ± 73% perf-profile.children.cycles-pp.schedule
0.18 ± 6% -0.1 0.13 ± 3% perf-profile.children.cycles-pp.scheduler_tick
0.20 ± 4% -0.0 0.15 ± 3% perf-profile.children.cycles-pp.get_next_timer_interrupt
0.18 ± 21% -0.0 0.13 perf-profile.children.cycles-pp.rebalance_domains
0.15 ± 6% -0.0 0.10 ± 4% perf-profile.children.cycles-pp.native_write_msr
0.12 ± 10% -0.0 0.08 ± 10% perf-profile.children.cycles-pp.lapic_next_deadline
0.14 ± 5% -0.0 0.11 ± 8% perf-profile.children.cycles-pp.__next_timer_interrupt
0.09 ± 24% -0.0 0.06 ± 16% perf-profile.children.cycles-pp._raw_spin_trylock
0.09 ± 9% -0.0 0.06 ± 14% perf-profile.children.cycles-pp.read_tsc
0.08 ± 10% -0.0 0.05 ± 8% perf-profile.children.cycles-pp.sched_clock_cpu
0.07 ± 22% -0.0 0.06 ± 16% perf-profile.children.cycles-pp.run_local_timers
0.00 +0.1 0.09 ± 36% perf-profile.children.cycles-pp.fbcon_putcs
0.00 +0.1 0.09 ± 36% perf-profile.children.cycles-pp.bit_putcs
0.00 +0.1 0.09 ± 39% perf-profile.children.cycles-pp.fbcon_redraw
0.00 +0.1 0.10 ± 37% perf-profile.children.cycles-pp.lf
0.00 +0.1 0.10 ± 37% perf-profile.children.cycles-pp.con_scroll
0.00 +0.1 0.10 ± 37% perf-profile.children.cycles-pp.fbcon_scroll
0.00 +0.1 0.11 ± 34% perf-profile.children.cycles-pp.vt_console_print
0.00 +0.1 0.14 ± 40% perf-profile.children.cycles-pp.ksys_read
0.00 +0.1 0.14 ± 40% perf-profile.children.cycles-pp.vfs_read
0.00 +0.1 0.14 ± 42% perf-profile.children.cycles-pp.read
0.00 +0.2 0.16 ± 38% perf-profile.children.cycles-pp.sk_page_frag_refill
0.00 +0.2 0.16 ± 38% perf-profile.children.cycles-pp.skb_page_frag_refill
0.00 +0.2 0.16 ± 21% perf-profile.children.cycles-pp.force_qs_rnp
0.00 +0.2 0.17 ± 17% perf-profile.children.cycles-pp.rcu_gp_kthread
0.00 +0.2 0.18 ± 69% perf-profile.children.cycles-pp.setup_arg_pages
0.00 +0.2 0.18 ± 69% perf-profile.children.cycles-pp.shift_arg_pages
0.00 +0.2 0.18 ± 69% perf-profile.children.cycles-pp.move_page_tables
0.00 +0.2 0.19 ± 39% perf-profile.children.cycles-pp.schedule_tail
0.00 +0.2 0.19 ± 39% perf-profile.children.cycles-pp.__put_user_4
0.00 +0.2 0.19 ± 26% perf-profile.children.cycles-pp.call_transmit
0.00 +0.2 0.19 ± 26% perf-profile.children.cycles-pp.xprt_transmit
0.00 +0.2 0.19 ± 26% perf-profile.children.cycles-pp.xs_tcp_send_request
0.00 +0.2 0.19 ± 26% perf-profile.children.cycles-pp.xs_sendpages
0.00 +0.2 0.19 ± 26% perf-profile.children.cycles-pp.sock_sendmsg
0.00 +0.2 0.19 ± 26% perf-profile.children.cycles-pp.tcp_sendmsg
0.00 +0.2 0.19 ± 26% perf-profile.children.cycles-pp.tcp_sendmsg_locked
0.00 +0.2 0.19 ± 71% perf-profile.children.cycles-pp.__vmalloc_node
0.00 +0.2 0.19 ± 28% perf-profile.children.cycles-pp.rpc_async_schedule
0.00 +0.2 0.19 ± 28% perf-profile.children.cycles-pp.__rpc_execute
0.00 +0.2 0.20 ± 14% perf-profile.children.cycles-pp.copy_strings
0.00 +0.2 0.20 ± 14% perf-profile.children.cycles-pp.get_user_pages_remote
0.00 +0.2 0.20 ± 14% perf-profile.children.cycles-pp.__get_user_pages
0.01 ±173% +0.2 0.24 ± 48% perf-profile.children.cycles-pp.search_binary_handler
0.01 ±173% +0.2 0.24 ± 48% perf-profile.children.cycles-pp.load_elf_binary
0.00 +0.3 0.26 ± 53% perf-profile.children.cycles-pp.shmem_alloc_page
0.00 +0.3 0.26 ± 51% perf-profile.children.cycles-pp.posix_fallocate
0.00 +0.3 0.26 ± 51% perf-profile.children.cycles-pp.__x64_sys_fallocate
0.00 +0.3 0.26 ± 51% perf-profile.children.cycles-pp.ksys_fallocate
0.00 +0.3 0.26 ± 51% perf-profile.children.cycles-pp.vfs_fallocate
0.00 +0.3 0.26 ± 51% perf-profile.children.cycles-pp.shmem_fallocate
0.00 +0.3 0.26 ± 51% perf-profile.children.cycles-pp.shmem_alloc_and_acct_page
0.00 +0.3 0.27 ± 28% perf-profile.children.cycles-pp.kmem_cache_alloc_node
0.00 +0.3 0.28 ± 33% perf-profile.children.cycles-pp.__pmd_alloc
0.12 ± 13% +0.3 0.45 ± 9% perf-profile.children.cycles-pp.swapgs_restore_regs_and_return_to_usermode
0.00 +0.3 0.33 ± 37% perf-profile.children.cycles-pp.perf_poll
0.00 +0.3 0.33 ± 37% perf-profile.children.cycles-pp.start_thread
0.00 +0.3 0.33 ± 37% perf-profile.children.cycles-pp.__pollwait
0.00 +0.3 0.33 ± 36% perf-profile.children.cycles-pp.poll
0.00 +0.3 0.33 ± 36% perf-profile.children.cycles-pp.__x64_sys_poll
0.00 +0.3 0.33 ± 36% perf-profile.children.cycles-pp.do_sys_poll
0.08 ± 23% +0.4 0.44 ± 9% perf-profile.children.cycles-pp.prepare_exit_to_usermode
0.00 +0.4 0.39 ± 9% perf-profile.children.cycles-pp.delay_tsc
0.00 +0.4 0.41 ± 9% perf-profile.children.cycles-pp.flush_tlb_batched_pending
0.01 ±173% +0.4 0.43 ± 10% perf-profile.children.cycles-pp.task_numa_work
0.01 ±173% +0.4 0.43 ± 10% perf-profile.children.cycles-pp.change_prot_numa
0.01 ±173% +0.4 0.43 ± 10% perf-profile.children.cycles-pp.change_p4d_range
0.01 ±173% +0.4 0.43 ± 9% perf-profile.children.cycles-pp.change_protection
0.00 +0.4 0.42 ± 51% perf-profile.children.cycles-pp.kvmalloc_node
0.00 +0.4 0.42 ± 51% perf-profile.children.cycles-pp.__kmalloc_node
0.00 +0.4 0.42 ± 51% perf-profile.children.cycles-pp.kmalloc_large_node
0.01 ±173% +0.4 0.44 ± 10% perf-profile.children.cycles-pp.task_work_run
0.00 +0.4 0.43 ± 12% perf-profile.children.cycles-pp.flush_tlb_mm_range
0.00 +0.5 0.50 ± 39% perf-profile.children.cycles-pp.shmem_fault
0.00 +0.5 0.50 ± 39% perf-profile.children.cycles-pp.shmem_swapin_page
0.00 +0.5 0.50 ± 39% perf-profile.children.cycles-pp.shmem_swapin
0.00 +0.5 0.50 ± 39% perf-profile.children.cycles-pp.swap_cluster_readahead
0.00 +0.5 0.52 ± 41% perf-profile.children.cycles-pp.__read_swap_cache_async
0.00 +0.5 0.54 ± 6% perf-profile.children.cycles-pp.io_serial_out
0.00 +0.6 0.55 ± 47% perf-profile.children.cycles-pp.__do_fault
0.04 ± 59% +0.6 0.63 ± 29% perf-profile.children.cycles-pp.exit_to_usermode_loop
0.00 +0.6 0.59 ± 7% perf-profile.children.cycles-pp.do_swap_page
0.00 +0.6 0.61 ± 53% perf-profile.children.cycles-pp.do_dentry_open
0.00 +0.6 0.61 ± 53% perf-profile.children.cycles-pp.open64
0.00 +0.6 0.61 ± 53% perf-profile.children.cycles-pp.proc_reg_open
0.00 +0.6 0.61 ± 53% perf-profile.children.cycles-pp.single_open_size
0.00 +0.7 0.66 ± 4% perf-profile.children.cycles-pp.path_parentat
0.00 +0.7 0.67 ± 4% perf-profile.children.cycles-pp.filename_create
0.00 +0.7 0.67 ± 4% perf-profile.children.cycles-pp.filename_parentat
0.00 +0.7 0.67 ± 4% perf-profile.children.cycles-pp.do_mkdirat
0.00 +0.7 0.68 ± 5% perf-profile.children.cycles-pp.mkdir
0.00 +0.8 0.77 ± 28% perf-profile.children.cycles-pp.shmem_getpage_gfp
0.00 +1.1 1.08 ± 19% perf-profile.children.cycles-pp.alloc_empty_file
0.00 +1.1 1.08 ± 19% perf-profile.children.cycles-pp.__alloc_file
0.00 +1.2 1.16 ± 52% perf-profile.children.cycles-pp.do_wp_page
0.00 +1.2 1.16 ± 52% perf-profile.children.cycles-pp.wp_page_copy
0.00 +1.4 1.40 ± 35% perf-profile.children.cycles-pp.__lookup_slow
0.00 +1.4 1.40 ± 35% perf-profile.children.cycles-pp.d_alloc_parallel
0.00 +1.4 1.40 ± 35% perf-profile.children.cycles-pp.d_alloc
0.00 +1.4 1.40 ± 35% perf-profile.children.cycles-pp.__d_alloc
0.00 +1.4 1.41 ± 45% perf-profile.children.cycles-pp.pipe_write
0.00 +1.4 1.42 ± 35% perf-profile.children.cycles-pp.walk_component
0.00 +1.4 1.42 ± 20% perf-profile.children.cycles-pp.nfs_write_begin
0.00 +1.4 1.43 ± 34% perf-profile.children.cycles-pp.link_path_walk
0.00 +1.4 1.43 ± 21% perf-profile.children.cycles-pp.nfs_file_write
0.03 ±102% +1.6 1.61 ± 27% perf-profile.children.cycles-pp.__vmalloc_node_range
0.64 ±127% +1.7 2.35 ± 8% perf-profile.children.cycles-pp.__x64_sys_clone
0.64 ±127% +1.7 2.35 ± 8% perf-profile.children.cycles-pp._do_fork
0.64 ±127% +1.7 2.35 ± 8% perf-profile.children.cycles-pp.copy_process
0.00 +1.8 1.81 ± 27% perf-profile.children.cycles-pp.simple_write_begin
0.00 +1.8 1.82 ± 27% perf-profile.children.cycles-pp.generic_file_write_iter
0.00 +1.8 1.82 ± 27% perf-profile.children.cycles-pp.__generic_file_write_iter
0.00 +1.8 1.83 ± 27% perf-profile.children.cycles-pp.__GI___libc_write
0.00 +1.9 1.86 ± 18% perf-profile.children.cycles-pp.__GI___libc_open
0.77 ± 8% +1.9 2.66 ± 16% perf-profile.children.cycles-pp.kmem_cache_alloc
0.62 ± 7% +2.3 2.92 ± 17% perf-profile.children.cycles-pp.__slab_alloc
0.62 ± 7% +2.3 2.92 ± 17% perf-profile.children.cycles-pp.___slab_alloc
0.56 ± 7% +2.4 2.92 ± 17% perf-profile.children.cycles-pp.new_slab
0.00 +2.5 2.45 ± 22% perf-profile.children.cycles-pp.do_filp_open
0.00 +2.5 2.45 ± 22% perf-profile.children.cycles-pp.path_openat
0.00 +2.5 2.45 ± 22% perf-profile.children.cycles-pp.do_sys_open
0.00 +2.5 2.45 ± 22% perf-profile.children.cycles-pp.do_sys_openat2
0.68 ±119% +2.6 3.32 ± 11% perf-profile.children.cycles-pp.__libc_fork
0.00 +2.8 2.85 ± 30% perf-profile.children.cycles-pp.write
0.97 ± 17% +3.1 4.05 ± 25% perf-profile.children.cycles-pp.worker_thread
0.90 ± 16% +3.1 4.03 ± 25% perf-profile.children.cycles-pp.process_one_work
0.69 ±116% +3.2 3.85 ± 9% perf-profile.children.cycles-pp.forkshell
0.09 ± 7% +3.2 3.30 ± 4% perf-profile.children.cycles-pp.io_serial_in
0.00 +3.2 3.24 ± 6% perf-profile.children.cycles-pp.grab_cache_page_write_begin
0.00 +3.2 3.25 ± 6% perf-profile.children.cycles-pp.generic_perform_write
0.00 +3.3 3.27 ± 7% perf-profile.children.cycles-pp.pagecache_get_page
0.11 ± 12% +3.4 3.52 ± 5% perf-profile.children.cycles-pp.serial8250_console_putchar
0.12 ± 12% +3.6 3.68 ± 5% perf-profile.children.cycles-pp.wait_for_xmitr
0.00 +3.7 3.73 ± 25% perf-profile.children.cycles-pp.drm_fb_helper_dirty_work
0.00 +3.7 3.73 ± 25% perf-profile.children.cycles-pp.memcpy_erms
0.12 ± 12% +3.9 4.04 ± 5% perf-profile.children.cycles-pp.uart_console_write
0.12 ± 12% +4.1 4.23 ± 5% perf-profile.children.cycles-pp.serial8250_console_write
0.12 ± 14% +4.2 4.35 ± 4% perf-profile.children.cycles-pp.console_unlock
35.04 ± 6% +4.4 39.41 ± 2% perf-profile.children.cycles-pp.handle_mm_fault
0.78 ± 16% +4.6 5.38 ± 3% perf-profile.children.cycles-pp.shrink_slab
34.72 ± 6% +4.6 39.33 ± 2% perf-profile.children.cycles-pp.__handle_mm_fault
0.74 ± 16% +4.6 5.38 ± 3% perf-profile.children.cycles-pp.do_shrink_slab
0.00 +4.7 4.67 ± 12% perf-profile.children.cycles-pp.ksys_write
0.00 +4.7 4.67 ± 12% perf-profile.children.cycles-pp.vfs_write
0.00 +4.7 4.67 ± 12% perf-profile.children.cycles-pp.new_sync_write
34.00 ± 6% +5.0 39.02 ± 2% perf-profile.children.cycles-pp.handle_pte_fault
0.00 +5.2 5.21 ± 3% perf-profile.children.cycles-pp.rcu_oom_scan
0.08 ± 27% +5.3 5.43 ± 2% perf-profile.children.cycles-pp.printk
0.08 ± 27% +5.3 5.43 ± 2% perf-profile.children.cycles-pp.vprintk_emit
1.05 ±104% +10.7 11.71 ± 4% perf-profile.children.cycles-pp.do_syscall_64
1.05 ±104% +10.7 11.71 ± 4% perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
34.02 ± 9% +16.9 50.93 ± 4% perf-profile.children.cycles-pp.shrink_lruvec
31.76 ± 9% +18.6 50.34 ± 4% perf-profile.children.cycles-pp.shrink_inactive_list
31.30 ± 9% +19.0 50.29 ± 4% perf-profile.children.cycles-pp.shrink_page_list
29.28 ± 7% +19.0 48.27 ± 3% perf-profile.children.cycles-pp.__alloc_pages_nodemask
28.07 ± 7% +19.5 47.59 ± 3% perf-profile.children.cycles-pp.__alloc_pages_slowpath
34.84 ± 9% +21.5 56.33 ± 3% perf-profile.children.cycles-pp.shrink_node
25.45 ± 7% +22.0 47.43 ± 4% perf-profile.children.cycles-pp.try_to_free_pages
25.21 ± 7% +22.2 47.41 ± 4% perf-profile.children.cycles-pp.do_try_to_free_pages
14.33 ± 10% +34.6 48.95 ± 5% perf-profile.children.cycles-pp.try_to_unmap_flush_dirty
14.31 ± 10% +34.6 48.95 ± 5% perf-profile.children.cycles-pp.arch_tlbbatch_flush
13.81 ± 10% +35.1 48.94 ± 5% perf-profile.children.cycles-pp.on_each_cpu_cond_mask
13.46 ± 11% +35.9 49.31 ± 5% perf-profile.children.cycles-pp.smp_call_function_many_cond
5.33 ± 9% -3.4 1.90 ± 10% perf-profile.self.cycles-pp.do_rw_once
4.88 ± 9% -3.1 1.75 ± 11% perf-profile.self.cycles-pp.do_access
2.46 ± 6% -2.3 0.13 ± 29% perf-profile.self.cycles-pp.llist_reverse_order
2.29 ± 15% -2.2 0.13 ± 26% perf-profile.self.cycles-pp.llist_add_batch
2.26 ± 12% -2.1 0.17 ± 23% perf-profile.self.cycles-pp.__memcpy_flushcache
2.15 ± 8% -2.1 0.09 ± 39% perf-profile.self.cycles-pp.flush_smp_call_function_queue
2.49 ± 9% -1.6 0.90 ± 10% perf-profile.self.cycles-pp._raw_spin_lock
1.95 ± 8% -1.6 0.37 ± 11% perf-profile.self.cycles-pp.native_irq_return_iret
1.72 ± 9% -1.6 0.15 ± 21% perf-profile.self.cycles-pp.__default_send_IPI_dest_field
1.55 ± 6% -1.5 0.08 ± 26% perf-profile.self.cycles-pp.flush_tlb_func_common
1.25 ± 10% -1.2 0.06 ± 75% perf-profile.self.cycles-pp.try_to_unmap_one
1.23 ± 6% -1.2 0.05 ± 72% perf-profile.self.cycles-pp.native_flush_tlb
1.10 ± 11% -1.0 0.12 ± 28% perf-profile.self.cycles-pp.end_page_writeback
1.02 ± 12% -0.9 0.09 ± 41% perf-profile.self.cycles-pp.__slab_free
1.15 ± 9% -0.8 0.31 ± 22% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
1.19 ± 15% -0.8 0.37 ± 14% perf-profile.self.cycles-pp.down_read_trylock
1.90 ± 13% -0.8 1.09 ± 4% perf-profile.self.cycles-pp.menu_select
0.84 ± 17% -0.8 0.07 ± 30% perf-profile.self.cycles-pp.page_vma_mapped_walk
0.85 ± 10% -0.6 0.21 ± 10% perf-profile.self.cycles-pp.get_page_from_freelist
1.05 ± 7% -0.6 0.42 ± 13% perf-profile.self.cycles-pp.clear_page_erms
0.67 ± 10% -0.6 0.10 ± 4% perf-profile.self.cycles-pp._find_next_bit
0.62 ± 15% -0.4 0.19 ± 18% perf-profile.self.cycles-pp.__handle_mm_fault
0.67 ± 11% -0.4 0.25 ± 15% perf-profile.self.cycles-pp.lru_cache_add_active_or_unevictable
0.60 ± 12% -0.4 0.20 ± 8% perf-profile.self.cycles-pp.up_read
1.03 ± 11% -0.4 0.64 ± 4% perf-profile.self.cycles-pp.cpuidle_enter_state
0.39 ± 8% -0.3 0.04 ± 71% perf-profile.self.cycles-pp.radix_tree_node_ctor
0.39 ± 16% -0.3 0.06 ± 13% perf-profile.self.cycles-pp.__list_del_entry_valid
0.92 ± 20% -0.3 0.61 ± 7% perf-profile.self.cycles-pp.ktime_get
0.31 ± 10% -0.3 0.04 ± 73% perf-profile.self.cycles-pp.get_mem_cgroup_from_mm
0.30 ± 6% -0.3 0.04 ± 73% perf-profile.self.cycles-pp.page_lock_anon_vma_read
0.49 ± 11% -0.3 0.24 ± 12% perf-profile.self.cycles-pp.mem_cgroup_try_charge
0.40 ± 9% -0.2 0.15 ± 12% perf-profile.self.cycles-pp.sync_regs
0.34 ± 17% -0.2 0.11 ± 22% perf-profile.self.cycles-pp.handle_pte_fault
0.27 ± 16% -0.2 0.04 ± 71% perf-profile.self.cycles-pp.count_shadow_nodes
0.32 ± 11% -0.2 0.09 ± 15% perf-profile.self.cycles-pp._raw_spin_lock_irqsave
0.24 ± 13% -0.2 0.04 ± 71% perf-profile.self.cycles-pp._raw_spin_unlock_irqrestore
0.22 ± 19% -0.2 0.04 ± 73% perf-profile.self.cycles-pp.handle_mm_fault
0.28 ± 12% -0.2 0.13 ± 13% perf-profile.self.cycles-pp.try_charge
0.23 ± 6% -0.2 0.08 ± 10% perf-profile.self.cycles-pp.__pagevec_lru_add_fn
0.15 ± 15% -0.1 0.06 ± 8% perf-profile.self.cycles-pp.mem_cgroup_throttle_swaprate
0.15 ± 6% -0.0 0.10 ± 4% perf-profile.self.cycles-pp.native_write_msr
0.07 ± 22% -0.0 0.04 ± 71% perf-profile.self.cycles-pp.run_local_timers
0.09 ± 24% -0.0 0.05 ± 8% perf-profile.self.cycles-pp._raw_spin_trylock
0.00 +0.4 0.39 ± 9% perf-profile.self.cycles-pp.delay_tsc
0.00 +0.5 0.54 ± 6% perf-profile.self.cycles-pp.io_serial_out
0.00 +1.1 1.09 ± 17% perf-profile.self.cycles-pp.vprintk_emit
0.09 ± 7% +3.2 3.30 ± 4% perf-profile.self.cycles-pp.io_serial_in
0.00 +3.7 3.71 ± 25% perf-profile.self.cycles-pp.memcpy_erms
6.37 ± 12% +42.3 48.62 ± 5% perf-profile.self.cycles-pp.smp_call_function_many_cond



vm-scalability.time.system_time

2500 +--------------------------------------------------------------------+
| O O O |
| O O O O O O |
2000 |-+ O O |
| O O |
| |
1500 |-+ |
| |
1000 |-+ |
| |
| |
500 |-+ |
| |
| .+.+.+..+.+. .+..+.+. .+.+.+.. .+.+.+..+.+.+..+.+.+..+.+.+.+..+.|
0 +--------------------------------------------------------------------+


vm-scalability.time.elapsed_time

400 +---------------------------------------------------------------------+
| O O O O |
350 |-+ O O O |
300 |-O O O |
| O O O |
250 |-+ |
| |
200 |-+ |
| |
150 |-+ |
100 |-+ |
| |
50 |-+ +.+.+..+.+ .+.. +.+.. +..+.+.+..+.+.+..+.+.+..+.+.+..+.|
| .. + .+.+ + +. + |
0 +---------------------------------------------------------------------+


vm-scalability.time.elapsed_time.max

400 +---------------------------------------------------------------------+
| O O O O |
350 |-+ O O O |
300 |-O O O |
| O O O |
250 |-+ |
| |
200 |-+ |
| |
150 |-+ |
100 |-+ |
| |
50 |-+ +.+.+..+.+ .+.. +.+.. +..+.+.+..+.+.+..+.+.+..+.+.+..+.|
| .. + .+.+ + +. + |
0 +---------------------------------------------------------------------+


vm-scalability.throughput

7e+06 +-------------------------------------------------------------------+
| |
6e+06 |-+ .+.+ +.+.. +..+ .+. |
| + + : + : : +..+.+.+.+..+.+.+.+..+.+ +..+.|
5e+06 |-+ : +..+ : : : : : |
| : : : : + : : |
4e+06 |-+ : : : : : : : |
| : : : : : : : |
3e+06 |-+ : : : : : : : |
| : : : : : : : |
2e+06 |-+: : : : : : : |
| : : : : : :: |
1e+06 |-O: O : O : O : O O O |
| : O : O O : O O : O |
0 +-------------------------------------------------------------------+


vm-scalability.median

800000 +------------------------------------------------------------------+
| .+ : +.+. +.+ +.+. .+. .+.. |
700000 |-+ +. : : + : : : +.+..+.+ +..+.+.+ +.+.|
| : +.+ : : : : : |
600000 |-+ : : : : + : : |
500000 |-+ : : : : : : : |
| : : : : : : : |
400000 |-+: : : : : : : |
| : : : : : : : |
300000 |-+: : : : : : : |
200000 |-+: : : : : : : |
| : :: : : :: |
100000 |-: O O :: O O O : O O O : O O O O |
| : : : : |
0 +------------------------------------------------------------------+


[*] bisect-good sample
[O] bisect-bad sample



Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


Thanks,
Rong Chen


Attachments:
(No filename) (122.85 kB)
config-5.6.0-rc7-00077-g0902bb3bb8fdb6 (208.52 kB)
job-script (8.23 kB)
job.yaml (5.93 kB)
reproduce (983.00 B)
Download all attachments

2020-05-13 01:56:35

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode

On Sat, May 09, 2020 at 09:09:00AM -0700, Paul E. McKenney wrote:
> On Sat, May 09, 2020 at 11:54:40AM +0300, Konstantin Khlebnikov wrote:
> > On 08/05/2020 17.46, Paul E. McKenney wrote:
> > > Easy for me to provide "start fast and inefficient mode" and "stop fast
> > > and inefficient mode" APIs for MM to call!
> > >
> > > How about rcu_mempressure_start() and rcu_mempressure_end()? I would
> > > expect them not to nest (as in if you need them to nest, please let
> > > me know). I would not expect these to be invoked all that often (as in
> > > if you do need them to be fast and scalable, please let me know). >
> > > RCU would then be in fast/inefficient mode if either MM told it to be
> > > or if RCU had detected callback overload on at least one CPU.
> > >
> > > Seem reasonable?
> >
> > Not exactly nested calls, but kswapd threads are per numa node.
> > So, at some level nodes under pressure must be counted.
>
> Easy enough, especially given that RCU already "counts" CPUs having
> excessive numbers of callbacks. But assuming that the transitions to/from
> OOM are rare, I would start by just counting them with a global counter.
> If the counter is non-zero, RCU is in fast and inefficient mode.
>
> > Also forcing rcu calls only for cpus in one numa node might be useful.
>
> Interesting. RCU currently evaluates a given CPU by comparing the
> number of callbacks against a fixed cutoff that can be set at boot using
> rcutree.qhimark, which defaults to 10,000. When this cutoff is exceeded,
> RCU becomes more aggressive about invoking callbacks on that CPU, for
> example, by sacrificing some degree of real-time response. I believe
> that this heuristic would also serve the OOM use case well.

So one of the things that I'm not sure people have connected here is
that memory reclaim done by shrinkers is one of the things that
drives huge numbers of call_rcu() callbacks to free memory via rcu.
If we are reclaiming dentries and inodes, then we can be pushing
thousands to hundreds of thousands of objects into kfree_rcu()
and/or direct call_rcu() calls to free these objects in a single
reclaim pass.

Hence the trigger for RCU going into "excessive callback" mode
might, in fact, be kswapd running a pass over the shrinkers. i.e.
memory reclaim itself can be responsible for pushing RCU into this "OOM
pressure" situation.

So perhaps we've missed a trick here by not having the memory
reclaim routines trigger RCU callbacks at the end of a priority
scan. The shrinkers have queued the objects for freeing, but they
haven't actually been freed yet and so things like slab pages
haven't actually been returned to the free pool even though the
shrinkers have said "freed this many objects"...

i.e. perhaps the right solution here is a "rcu_run_callbacks()"
function that memory reclaim calls before backing off and/or winding
up reclaim priority.

> > I wonder if direct-reclaim should at some stage simply wait for RCU QS.
> > I.e. call rcu_barrier() or similar somewhere before invoking OOM.
>
> The rcu_oom_count() function in the patch starting this thread returns the
> total number of outstanding callbacks queued on all CPUs. So one approach
> would be to invoke this function, and if the return value was truly
> huge (taking size of memory and who knows that all else into account),
> do the rcu_barrier() to wait for RCU to clear its current backlog.

The shrinker scan control structure has a node mask in it to
indicate what node (and hence CPUs) it should be reclaiming from.
This information comes from the main reclaim scan routine, so it
would be trivial to feed straight into the RCU code to have it
act on just the CPUs/node that we are reclaiming memory from...

> On the NUMA point, it would be dead easy for me to supply a function
> that returned the number of callbacks on a given CPU, which would allow
> you to similarly evaluate a NUMA node, a cgroup, or whatever.

I'd think it runs the other way around - we optimisitically call the
RCU layer to do cleanup, and the RCU layer decides if there's enough
queued callbacks on the cpus/node to run callbacks immediately. It
would even be provided with the scan priority to indicate the level
of desperation memory reclaim is under....

> > All GFP_NOFAIL users should allow direct-reclaim, thus this loop
> > in page_alloc shouldn't block RCU and doesn't need special care.
>
> I must defer to you guys on this. The main caution is the duration of
> direct reclaim. After all, if it is too long, the kfree_rcu() instance
> would have been better of just invoking synchronize_rcu().

Individual callers of kfree_rcu() have no idea of the load on RCU,
nor how long direct reclaim is taking. Calling synchronize_rcu()
incorrectly has pretty major downsides to it, so nobody should be
trying to expedite kfree_rcu() unless there is a good reason to do
so (e.g. at unmount to ensure everything allocated by a filesystem
has actually been freed). Hence I'd much prefer the decision to
expedite callbacks is made by the RCU subsystem based on it's known
callback load and some indication of how close memory reclaim is to
declaring OOM...

Cheers,

Dave.

--
Dave Chinner
[email protected]

2020-05-13 03:20:32

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode

On Wed, May 13, 2020 at 11:32:38AM +1000, Dave Chinner wrote:
> On Sat, May 09, 2020 at 09:09:00AM -0700, Paul E. McKenney wrote:
> > On Sat, May 09, 2020 at 11:54:40AM +0300, Konstantin Khlebnikov wrote:
> > > On 08/05/2020 17.46, Paul E. McKenney wrote:
> > > > Easy for me to provide "start fast and inefficient mode" and "stop fast
> > > > and inefficient mode" APIs for MM to call!
> > > >
> > > > How about rcu_mempressure_start() and rcu_mempressure_end()? I would
> > > > expect them not to nest (as in if you need them to nest, please let
> > > > me know). I would not expect these to be invoked all that often (as in
> > > > if you do need them to be fast and scalable, please let me know). >
> > > > RCU would then be in fast/inefficient mode if either MM told it to be
> > > > or if RCU had detected callback overload on at least one CPU.
> > > >
> > > > Seem reasonable?
> > >
> > > Not exactly nested calls, but kswapd threads are per numa node.
> > > So, at some level nodes under pressure must be counted.
> >
> > Easy enough, especially given that RCU already "counts" CPUs having
> > excessive numbers of callbacks. But assuming that the transitions to/from
> > OOM are rare, I would start by just counting them with a global counter.
> > If the counter is non-zero, RCU is in fast and inefficient mode.
> >
> > > Also forcing rcu calls only for cpus in one numa node might be useful.
> >
> > Interesting. RCU currently evaluates a given CPU by comparing the
> > number of callbacks against a fixed cutoff that can be set at boot using
> > rcutree.qhimark, which defaults to 10,000. When this cutoff is exceeded,
> > RCU becomes more aggressive about invoking callbacks on that CPU, for
> > example, by sacrificing some degree of real-time response. I believe
> > that this heuristic would also serve the OOM use case well.
>
> So one of the things that I'm not sure people have connected here is
> that memory reclaim done by shrinkers is one of the things that
> drives huge numbers of call_rcu() callbacks to free memory via rcu.
> If we are reclaiming dentries and inodes, then we can be pushing
> thousands to hundreds of thousands of objects into kfree_rcu()
> and/or direct call_rcu() calls to free these objects in a single
> reclaim pass.

Good point!

> Hence the trigger for RCU going into "excessive callback" mode
> might, in fact, be kswapd running a pass over the shrinkers. i.e.
> memory reclaim itself can be responsible for pushing RCU into this "OOM
> pressure" situation.
>
> So perhaps we've missed a trick here by not having the memory
> reclaim routines trigger RCU callbacks at the end of a priority
> scan. The shrinkers have queued the objects for freeing, but they
> haven't actually been freed yet and so things like slab pages
> haven't actually been returned to the free pool even though the
> shrinkers have said "freed this many objects"...
>
> i.e. perhaps the right solution here is a "rcu_run_callbacks()"
> function that memory reclaim calls before backing off and/or winding
> up reclaim priority.

It would not be hard to make something that put RCU into fast/inefficient
mode for a couple of grace periods. I will also look into the possibility
of speeding up callback invocation.

It might also make sense to put RCU grace periods into fast mode while
running the shrinkers that are freeing dentries and inodes. However,
kbuild test robot reports ugly regressions when putting RCU into
fast/inefficient mode to quickly and too often. As in 78.5% degradation
on one of the benchmarks.

> > > I wonder if direct-reclaim should at some stage simply wait for RCU QS.
> > > I.e. call rcu_barrier() or similar somewhere before invoking OOM.
> >
> > The rcu_oom_count() function in the patch starting this thread returns the
> > total number of outstanding callbacks queued on all CPUs. So one approach
> > would be to invoke this function, and if the return value was truly
> > huge (taking size of memory and who knows that all else into account),
> > do the rcu_barrier() to wait for RCU to clear its current backlog.
>
> The shrinker scan control structure has a node mask in it to
> indicate what node (and hence CPUs) it should be reclaiming from.
> This information comes from the main reclaim scan routine, so it
> would be trivial to feed straight into the RCU code to have it
> act on just the CPUs/node that we are reclaiming memory from...

For the callbacks, RCU can operate on CPUs, in theory anyway. The
grace period itself, however, is inherently global.

> > On the NUMA point, it would be dead easy for me to supply a function
> > that returned the number of callbacks on a given CPU, which would allow
> > you to similarly evaluate a NUMA node, a cgroup, or whatever.
>
> I'd think it runs the other way around - we optimisitically call the
> RCU layer to do cleanup, and the RCU layer decides if there's enough
> queued callbacks on the cpus/node to run callbacks immediately. It
> would even be provided with the scan priority to indicate the level
> of desperation memory reclaim is under....

Easy for RCU to count the number of callbacks. That said, it has no
idea which callbacks are which. Perhaps kfree_rcu() could gather that
information from the slab allocator, though.

> > > All GFP_NOFAIL users should allow direct-reclaim, thus this loop
> > > in page_alloc shouldn't block RCU and doesn't need special care.
> >
> > I must defer to you guys on this. The main caution is the duration of
> > direct reclaim. After all, if it is too long, the kfree_rcu() instance
> > would have been better of just invoking synchronize_rcu().
>
> Individual callers of kfree_rcu() have no idea of the load on RCU,
> nor how long direct reclaim is taking. Calling synchronize_rcu()
> incorrectly has pretty major downsides to it, so nobody should be
> trying to expedite kfree_rcu() unless there is a good reason to do
> so (e.g. at unmount to ensure everything allocated by a filesystem
> has actually been freed). Hence I'd much prefer the decision to
> expedite callbacks is made by the RCU subsystem based on it's known
> callback load and some indication of how close memory reclaim is to
> declaring OOM...

Sorry, I was unclear. There is a new single-argument kfree_rcu() under
way that does not require an rcu_head in the structure being freed.
However, in this case, kfree_rcu() might either allocate the memory
that is needed to track the memory to be freed on the one hand or just
invoke synchronize_rcu() on the other. So this decision would be taken
inside kfree_rcu(), and not be visible to either core RCU or the caller
of kfree_rcu().

This decision is made based on whether or not the allocator provides
kfree_rcu() the memory needed. The tradeoff is what GFP flags are
supplied. So the question kfree_rcu() has to answer is "Would it be
better to give myself to reclaim as an additional task, or would it
instead be better to just invoke synchronize_rcu() and then immediately
free()?"

I am probably still unclear, but hopefully at least one step in the
right direction.

Thanx, Paul

2020-05-13 04:39:53

by Konstantin Khlebnikov

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode

On 13/05/2020 06.18, Paul E. McKenney wrote:
> On Wed, May 13, 2020 at 11:32:38AM +1000, Dave Chinner wrote:
>> On Sat, May 09, 2020 at 09:09:00AM -0700, Paul E. McKenney wrote:
>>> On Sat, May 09, 2020 at 11:54:40AM +0300, Konstantin Khlebnikov wrote:
>>>> On 08/05/2020 17.46, Paul E. McKenney wrote:
>>>>> Easy for me to provide "start fast and inefficient mode" and "stop fast
>>>>> and inefficient mode" APIs for MM to call!
>>>>>
>>>>> How about rcu_mempressure_start() and rcu_mempressure_end()? I would
>>>>> expect them not to nest (as in if you need them to nest, please let
>>>>> me know). I would not expect these to be invoked all that often (as in
>>>>> if you do need them to be fast and scalable, please let me know). >
>>>>> RCU would then be in fast/inefficient mode if either MM told it to be
>>>>> or if RCU had detected callback overload on at least one CPU.
>>>>>
>>>>> Seem reasonable?
>>>>
>>>> Not exactly nested calls, but kswapd threads are per numa node.
>>>> So, at some level nodes under pressure must be counted.
>>>
>>> Easy enough, especially given that RCU already "counts" CPUs having
>>> excessive numbers of callbacks. But assuming that the transitions to/from
>>> OOM are rare, I would start by just counting them with a global counter.
>>> If the counter is non-zero, RCU is in fast and inefficient mode.
>>>
>>>> Also forcing rcu calls only for cpus in one numa node might be useful.
>>>
>>> Interesting. RCU currently evaluates a given CPU by comparing the
>>> number of callbacks against a fixed cutoff that can be set at boot using
>>> rcutree.qhimark, which defaults to 10,000. When this cutoff is exceeded,
>>> RCU becomes more aggressive about invoking callbacks on that CPU, for
>>> example, by sacrificing some degree of real-time response. I believe
>>> that this heuristic would also serve the OOM use case well.
>>
>> So one of the things that I'm not sure people have connected here is
>> that memory reclaim done by shrinkers is one of the things that
>> drives huge numbers of call_rcu() callbacks to free memory via rcu.
>> If we are reclaiming dentries and inodes, then we can be pushing
>> thousands to hundreds of thousands of objects into kfree_rcu()
>> and/or direct call_rcu() calls to free these objects in a single
>> reclaim pass.
>
> Good point!

Indeed

>
>> Hence the trigger for RCU going into "excessive callback" mode
>> might, in fact, be kswapd running a pass over the shrinkers. i.e.
>> memory reclaim itself can be responsible for pushing RCU into this "OOM
>> pressure" situation.
>>
>> So perhaps we've missed a trick here by not having the memory
>> reclaim routines trigger RCU callbacks at the end of a priority
>> scan. The shrinkers have queued the objects for freeing, but they
>> haven't actually been freed yet and so things like slab pages
>> haven't actually been returned to the free pool even though the
>> shrinkers have said "freed this many objects"...
>>
>> i.e. perhaps the right solution here is a "rcu_run_callbacks()"
>> function that memory reclaim calls before backing off and/or winding
>> up reclaim priority.
>
> It would not be hard to make something that put RCU into fast/inefficient
> mode for a couple of grace periods. I will also look into the possibility
> of speeding up callback invocation.
>
> It might also make sense to put RCU grace periods into fast mode while
> running the shrinkers that are freeing dentries and inodes. However,
> kbuild test robot reports ugly regressions when putting RCU into
> fast/inefficient mode to quickly and too often. As in 78.5% degradation
> on one of the benchmarks.

I think fast/inefficient mode here just an optimization for freeing
memory faster. It doesn't solve the problem itself.

At first we have to close the loop in reclaimer and actually wait or run
rcu callbacks which might free memory before increasing priority and
invoking OOM killer.

>
>>>> I wonder if direct-reclaim should at some stage simply wait for RCU QS.
>>>> I.e. call rcu_barrier() or similar somewhere before invoking OOM.
>>>
>>> The rcu_oom_count() function in the patch starting this thread returns the
>>> total number of outstanding callbacks queued on all CPUs. So one approach
>>> would be to invoke this function, and if the return value was truly
>>> huge (taking size of memory and who knows that all else into account),
>>> do the rcu_barrier() to wait for RCU to clear its current backlog.
>>
>> The shrinker scan control structure has a node mask in it to
>> indicate what node (and hence CPUs) it should be reclaiming from.
>> This information comes from the main reclaim scan routine, so it
>> would be trivial to feed straight into the RCU code to have it
>> act on just the CPUs/node that we are reclaiming memory from...
>
> For the callbacks, RCU can operate on CPUs, in theory anyway. The
> grace period itself, however, is inherently global.
>
>>> On the NUMA point, it would be dead easy for me to supply a function
>>> that returned the number of callbacks on a given CPU, which would allow
>>> you to similarly evaluate a NUMA node, a cgroup, or whatever.
>>
>> I'd think it runs the other way around - we optimisitically call the
>> RCU layer to do cleanup, and the RCU layer decides if there's enough
>> queued callbacks on the cpus/node to run callbacks immediately. It
>> would even be provided with the scan priority to indicate the level
>> of desperation memory reclaim is under....
>
> Easy for RCU to count the number of callbacks. That said, it has no
> idea which callbacks are which. Perhaps kfree_rcu() could gather that
> information from the slab allocator, though.

It's simple to mark slab shrinkers that frees object through RCU and
count freed objects in reclaimer:

--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -536,6 +536,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
else
new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);

+ if (shrinker->flags & SHRINKER_KFREE_RCU)
+ shrinkctl->nr_kfree_rcu += freed;
+
trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, total_scan);
return freed;
}

And when accumulated enough do some synchronization.

Probably it's better to sum freed objects at per-cpu variable,
and accumulate size rather than count.

>
>>>> All GFP_NOFAIL users should allow direct-reclaim, thus this loop
>>>> in page_alloc shouldn't block RCU and doesn't need special care.
>>>
>>> I must defer to you guys on this. The main caution is the duration of
>>> direct reclaim. After all, if it is too long, the kfree_rcu() instance
>>> would have been better of just invoking synchronize_rcu().
>>
>> Individual callers of kfree_rcu() have no idea of the load on RCU,
>> nor how long direct reclaim is taking. Calling synchronize_rcu()
>> incorrectly has pretty major downsides to it, so nobody should be
>> trying to expedite kfree_rcu() unless there is a good reason to do
>> so (e.g. at unmount to ensure everything allocated by a filesystem
>> has actually been freed). Hence I'd much prefer the decision to
>> expedite callbacks is made by the RCU subsystem based on it's known
>> callback load and some indication of how close memory reclaim is to
>> declaring OOM...
>
> Sorry, I was unclear. There is a new single-argument kfree_rcu() under
> way that does not require an rcu_head in the structure being freed.
> However, in this case, kfree_rcu() might either allocate the memory
> that is needed to track the memory to be freed on the one hand or just
> invoke synchronize_rcu() on the other. So this decision would be taken
> inside kfree_rcu(), and not be visible to either core RCU or the caller
> of kfree_rcu().
>
> This decision is made based on whether or not the allocator provides
> kfree_rcu() the memory needed. The tradeoff is what GFP flags are
> supplied. So the question kfree_rcu() has to answer is "Would it be
> better to give myself to reclaim as an additional task, or would it
> instead be better to just invoke synchronize_rcu() and then immediately
> free()?"
>
> I am probably still unclear, but hopefully at least one step in the
> right direction.
>
> Thanx, Paul
>

2020-05-13 05:10:18

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode

On Tue, May 12, 2020 at 08:18:26PM -0700, Paul E. McKenney wrote:
> On Wed, May 13, 2020 at 11:32:38AM +1000, Dave Chinner wrote:
> > On Sat, May 09, 2020 at 09:09:00AM -0700, Paul E. McKenney wrote:
> > > On Sat, May 09, 2020 at 11:54:40AM +0300, Konstantin Khlebnikov wrote:
> > > > On 08/05/2020 17.46, Paul E. McKenney wrote:
> > > > > Easy for me to provide "start fast and inefficient mode" and "stop fast
> > > > > and inefficient mode" APIs for MM to call!
> > > > >
> > > > > How about rcu_mempressure_start() and rcu_mempressure_end()? I would
> > > > > expect them not to nest (as in if you need them to nest, please let
> > > > > me know). I would not expect these to be invoked all that often (as in
> > > > > if you do need them to be fast and scalable, please let me know). >
> > > > > RCU would then be in fast/inefficient mode if either MM told it to be
> > > > > or if RCU had detected callback overload on at least one CPU.
> > > > >
> > > > > Seem reasonable?
> > > >
> > > > Not exactly nested calls, but kswapd threads are per numa node.
> > > > So, at some level nodes under pressure must be counted.
> > >
> > > Easy enough, especially given that RCU already "counts" CPUs having
> > > excessive numbers of callbacks. But assuming that the transitions to/from
> > > OOM are rare, I would start by just counting them with a global counter.
> > > If the counter is non-zero, RCU is in fast and inefficient mode.
> > >
> > > > Also forcing rcu calls only for cpus in one numa node might be useful.
> > >
> > > Interesting. RCU currently evaluates a given CPU by comparing the
> > > number of callbacks against a fixed cutoff that can be set at boot using
> > > rcutree.qhimark, which defaults to 10,000. When this cutoff is exceeded,
> > > RCU becomes more aggressive about invoking callbacks on that CPU, for
> > > example, by sacrificing some degree of real-time response. I believe
> > > that this heuristic would also serve the OOM use case well.
> >
> > So one of the things that I'm not sure people have connected here is
> > that memory reclaim done by shrinkers is one of the things that
> > drives huge numbers of call_rcu() callbacks to free memory via rcu.
> > If we are reclaiming dentries and inodes, then we can be pushing
> > thousands to hundreds of thousands of objects into kfree_rcu()
> > and/or direct call_rcu() calls to free these objects in a single
> > reclaim pass.
>
> Good point!
>
> > Hence the trigger for RCU going into "excessive callback" mode
> > might, in fact, be kswapd running a pass over the shrinkers. i.e.
> > memory reclaim itself can be responsible for pushing RCU into this "OOM
> > pressure" situation.
> >
> > So perhaps we've missed a trick here by not having the memory
> > reclaim routines trigger RCU callbacks at the end of a priority
> > scan. The shrinkers have queued the objects for freeing, but they
> > haven't actually been freed yet and so things like slab pages
> > haven't actually been returned to the free pool even though the
> > shrinkers have said "freed this many objects"...
> >
> > i.e. perhaps the right solution here is a "rcu_run_callbacks()"
> > function that memory reclaim calls before backing off and/or winding
> > up reclaim priority.
>
> It would not be hard to make something that put RCU into fast/inefficient
> mode for a couple of grace periods. I will also look into the possibility
> of speeding up callback invocation.
>
> It might also make sense to put RCU grace periods into fast mode while
> running the shrinkers that are freeing dentries and inodes. However,
> kbuild test robot reports ugly regressions when putting RCU into
> fast/inefficient mode to quickly and too often. As in 78.5% degradation
> on one of the benchmarks.

I don't think it should be dependent on what specific shrinkers
free. There are other objects that may be RCU freed by shrinkers,
so it really shouldn't be applied just to specific shrinker
instances.

> > > > I wonder if direct-reclaim should at some stage simply wait for RCU QS.
> > > > I.e. call rcu_barrier() or similar somewhere before invoking OOM.
> > >
> > > The rcu_oom_count() function in the patch starting this thread returns the
> > > total number of outstanding callbacks queued on all CPUs. So one approach
> > > would be to invoke this function, and if the return value was truly
> > > huge (taking size of memory and who knows that all else into account),
> > > do the rcu_barrier() to wait for RCU to clear its current backlog.
> >
> > The shrinker scan control structure has a node mask in it to
> > indicate what node (and hence CPUs) it should be reclaiming from.
> > This information comes from the main reclaim scan routine, so it
> > would be trivial to feed straight into the RCU code to have it
> > act on just the CPUs/node that we are reclaiming memory from...
>
> For the callbacks, RCU can operate on CPUs, in theory anyway. The
> grace period itself, however, is inherently global.

*nod*

The memory reclaim backoffs tend to be in the order of 50-100
milliseconds, though, so we are talking multiple grace periods here,
right? In which case, triggering a grace period expiry before a
backoff takes place might make a lot sense...

> > > On the NUMA point, it would be dead easy for me to supply a function
> > > that returned the number of callbacks on a given CPU, which would allow
> > > you to similarly evaluate a NUMA node, a cgroup, or whatever.
> >
> > I'd think it runs the other way around - we optimisitically call the
> > RCU layer to do cleanup, and the RCU layer decides if there's enough
> > queued callbacks on the cpus/node to run callbacks immediately. It
> > would even be provided with the scan priority to indicate the level
> > of desperation memory reclaim is under....
>
> Easy for RCU to count the number of callbacks. That said, it has no
> idea which callbacks are which. Perhaps kfree_rcu() could gather that
> information from the slab allocator, though.
>
> > > > All GFP_NOFAIL users should allow direct-reclaim, thus this loop
> > > > in page_alloc shouldn't block RCU and doesn't need special care.
> > >
> > > I must defer to you guys on this. The main caution is the duration of
> > > direct reclaim. After all, if it is too long, the kfree_rcu() instance
> > > would have been better of just invoking synchronize_rcu().
> >
> > Individual callers of kfree_rcu() have no idea of the load on RCU,
> > nor how long direct reclaim is taking. Calling synchronize_rcu()
> > incorrectly has pretty major downsides to it, so nobody should be
> > trying to expedite kfree_rcu() unless there is a good reason to do
> > so (e.g. at unmount to ensure everything allocated by a filesystem
> > has actually been freed). Hence I'd much prefer the decision to
> > expedite callbacks is made by the RCU subsystem based on it's known
> > callback load and some indication of how close memory reclaim is to
> > declaring OOM...
>
> Sorry, I was unclear. There is a new single-argument kfree_rcu() under
> way that does not require an rcu_head in the structure being freed.
> However, in this case, kfree_rcu() might either allocate the memory
> that is needed to track the memory to be freed on the one hand or just
> invoke synchronize_rcu() on the other. So this decision would be taken
> inside kfree_rcu(), and not be visible to either core RCU or the caller
> of kfree_rcu().

Ah. The need to allocate memory to free memory, and with that comes
the requirement of a forwards progress guarantee. It's mempools all
over again :P

Personally, though, designing functionality that specifically
requires memory allocation to free memory seems like an incredibly
fragile thing to be doing. I don't know the use case here, though,
but jsut the general description of what you are trying to do rings
alarm bells in my head...

> This decision is made based on whether or not the allocator provides
> kfree_rcu() the memory needed. The tradeoff is what GFP flags are
> supplied.

So there's a reclaim recursion problem here, too?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2020-05-13 13:05:59

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode

On Wed, May 13, 2020 at 03:07:26PM +1000, Dave Chinner wrote:
> On Tue, May 12, 2020 at 08:18:26PM -0700, Paul E. McKenney wrote:
> > On Wed, May 13, 2020 at 11:32:38AM +1000, Dave Chinner wrote:
> > > On Sat, May 09, 2020 at 09:09:00AM -0700, Paul E. McKenney wrote:
> > > > On Sat, May 09, 2020 at 11:54:40AM +0300, Konstantin Khlebnikov wrote:
> > > > > On 08/05/2020 17.46, Paul E. McKenney wrote:
> > > > > > Easy for me to provide "start fast and inefficient mode" and "stop fast
> > > > > > and inefficient mode" APIs for MM to call!
> > > > > >
> > > > > > How about rcu_mempressure_start() and rcu_mempressure_end()? I would
> > > > > > expect them not to nest (as in if you need them to nest, please let
> > > > > > me know). I would not expect these to be invoked all that often (as in
> > > > > > if you do need them to be fast and scalable, please let me know). >
> > > > > > RCU would then be in fast/inefficient mode if either MM told it to be
> > > > > > or if RCU had detected callback overload on at least one CPU.
> > > > > >
> > > > > > Seem reasonable?
> > > > >
> > > > > Not exactly nested calls, but kswapd threads are per numa node.
> > > > > So, at some level nodes under pressure must be counted.
> > > >
> > > > Easy enough, especially given that RCU already "counts" CPUs having
> > > > excessive numbers of callbacks. But assuming that the transitions to/from
> > > > OOM are rare, I would start by just counting them with a global counter.
> > > > If the counter is non-zero, RCU is in fast and inefficient mode.
> > > >
> > > > > Also forcing rcu calls only for cpus in one numa node might be useful.
> > > >
> > > > Interesting. RCU currently evaluates a given CPU by comparing the
> > > > number of callbacks against a fixed cutoff that can be set at boot using
> > > > rcutree.qhimark, which defaults to 10,000. When this cutoff is exceeded,
> > > > RCU becomes more aggressive about invoking callbacks on that CPU, for
> > > > example, by sacrificing some degree of real-time response. I believe
> > > > that this heuristic would also serve the OOM use case well.
> > >
> > > So one of the things that I'm not sure people have connected here is
> > > that memory reclaim done by shrinkers is one of the things that
> > > drives huge numbers of call_rcu() callbacks to free memory via rcu.
> > > If we are reclaiming dentries and inodes, then we can be pushing
> > > thousands to hundreds of thousands of objects into kfree_rcu()
> > > and/or direct call_rcu() calls to free these objects in a single
> > > reclaim pass.
> >
> > Good point!
> >
> > > Hence the trigger for RCU going into "excessive callback" mode
> > > might, in fact, be kswapd running a pass over the shrinkers. i.e.
> > > memory reclaim itself can be responsible for pushing RCU into this "OOM
> > > pressure" situation.
> > >
> > > So perhaps we've missed a trick here by not having the memory
> > > reclaim routines trigger RCU callbacks at the end of a priority
> > > scan. The shrinkers have queued the objects for freeing, but they
> > > haven't actually been freed yet and so things like slab pages
> > > haven't actually been returned to the free pool even though the
> > > shrinkers have said "freed this many objects"...
> > >
> > > i.e. perhaps the right solution here is a "rcu_run_callbacks()"
> > > function that memory reclaim calls before backing off and/or winding
> > > up reclaim priority.
> >
> > It would not be hard to make something that put RCU into fast/inefficient
> > mode for a couple of grace periods. I will also look into the possibility
> > of speeding up callback invocation.
> >
> > It might also make sense to put RCU grace periods into fast mode while
> > running the shrinkers that are freeing dentries and inodes. However,
> > kbuild test robot reports ugly regressions when putting RCU into
> > fast/inefficient mode to quickly and too often. As in 78.5% degradation
> > on one of the benchmarks.
>
> I don't think it should be dependent on what specific shrinkers
> free. There are other objects that may be RCU freed by shrinkers,
> so it really shouldn't be applied just to specific shrinker
> instances.

Plus a call_rcu() might be freeing a linked structure, so counting the
size of the argument to call_rcu() would be understating the total amount
of memory being freed.

> > > > > I wonder if direct-reclaim should at some stage simply wait for RCU QS.
> > > > > I.e. call rcu_barrier() or similar somewhere before invoking OOM.
> > > >
> > > > The rcu_oom_count() function in the patch starting this thread returns the
> > > > total number of outstanding callbacks queued on all CPUs. So one approach
> > > > would be to invoke this function, and if the return value was truly
> > > > huge (taking size of memory and who knows that all else into account),
> > > > do the rcu_barrier() to wait for RCU to clear its current backlog.
> > >
> > > The shrinker scan control structure has a node mask in it to
> > > indicate what node (and hence CPUs) it should be reclaiming from.
> > > This information comes from the main reclaim scan routine, so it
> > > would be trivial to feed straight into the RCU code to have it
> > > act on just the CPUs/node that we are reclaiming memory from...
> >
> > For the callbacks, RCU can operate on CPUs, in theory anyway. The
> > grace period itself, however, is inherently global.
>
> *nod*
>
> The memory reclaim backoffs tend to be in the order of 50-100
> milliseconds, though, so we are talking multiple grace periods here,
> right? In which case, triggering a grace period expiry before a
> backoff takes place might make a lot sense...

Usually, yes, I would expect several grace periods to elapse during
a backoff.

> > > > On the NUMA point, it would be dead easy for me to supply a function
> > > > that returned the number of callbacks on a given CPU, which would allow
> > > > you to similarly evaluate a NUMA node, a cgroup, or whatever.
> > >
> > > I'd think it runs the other way around - we optimisitically call the
> > > RCU layer to do cleanup, and the RCU layer decides if there's enough
> > > queued callbacks on the cpus/node to run callbacks immediately. It
> > > would even be provided with the scan priority to indicate the level
> > > of desperation memory reclaim is under....
> >
> > Easy for RCU to count the number of callbacks. That said, it has no
> > idea which callbacks are which. Perhaps kfree_rcu() could gather that
> > information from the slab allocator, though.
> >
> > > > > All GFP_NOFAIL users should allow direct-reclaim, thus this loop
> > > > > in page_alloc shouldn't block RCU and doesn't need special care.
> > > >
> > > > I must defer to you guys on this. The main caution is the duration of
> > > > direct reclaim. After all, if it is too long, the kfree_rcu() instance
> > > > would have been better of just invoking synchronize_rcu().
> > >
> > > Individual callers of kfree_rcu() have no idea of the load on RCU,
> > > nor how long direct reclaim is taking. Calling synchronize_rcu()
> > > incorrectly has pretty major downsides to it, so nobody should be
> > > trying to expedite kfree_rcu() unless there is a good reason to do
> > > so (e.g. at unmount to ensure everything allocated by a filesystem
> > > has actually been freed). Hence I'd much prefer the decision to
> > > expedite callbacks is made by the RCU subsystem based on it's known
> > > callback load and some indication of how close memory reclaim is to
> > > declaring OOM...
> >
> > Sorry, I was unclear. There is a new single-argument kfree_rcu() under
> > way that does not require an rcu_head in the structure being freed.
> > However, in this case, kfree_rcu() might either allocate the memory
> > that is needed to track the memory to be freed on the one hand or just
> > invoke synchronize_rcu() on the other. So this decision would be taken
> > inside kfree_rcu(), and not be visible to either core RCU or the caller
> > of kfree_rcu().
>
> Ah. The need to allocate memory to free memory, and with that comes
> the requirement of a forwards progress guarantee. It's mempools all
> over again :P
>
> Personally, though, designing functionality that specifically
> requires memory allocation to free memory seems like an incredibly
> fragile thing to be doing. I don't know the use case here, though,
> but jsut the general description of what you are trying to do rings
> alarm bells in my head...

And mine as well. Hence my earlier insistence that kfree_rcu() never
block waiting for memory, but instead just invoke synchronize_rcu() and
then immediately free the memory. Others have since convinced me that
there are combinations of GFP flags that allow only limited sleeping so
as to avoid the OOM deadlocks that I fear.

> > This decision is made based on whether or not the allocator provides
> > kfree_rcu() the memory needed. The tradeoff is what GFP flags are
> > supplied.
>
> So there's a reclaim recursion problem here, too?

There was an earlier discussion as to what was safe, with one
recommendation being __GFP_NORETRY.

Thanx, Paul

2020-05-13 20:42:49

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH RFC tip/core/rcu] Add shrinker to shift to fast/inefficient GP mode

On Wed, May 13, 2020 at 07:35:25AM +0300, Konstantin Khlebnikov wrote:
> On 13/05/2020 06.18, Paul E. McKenney wrote:
> > On Wed, May 13, 2020 at 11:32:38AM +1000, Dave Chinner wrote:
> > > On Sat, May 09, 2020 at 09:09:00AM -0700, Paul E. McKenney wrote:
> > > > On Sat, May 09, 2020 at 11:54:40AM +0300, Konstantin Khlebnikov wrote:
> > > > > On 08/05/2020 17.46, Paul E. McKenney wrote:
> > > > > > Easy for me to provide "start fast and inefficient mode" and "stop fast
> > > > > > and inefficient mode" APIs for MM to call!
> > > > > >
> > > > > > How about rcu_mempressure_start() and rcu_mempressure_end()? I would
> > > > > > expect them not to nest (as in if you need them to nest, please let
> > > > > > me know). I would not expect these to be invoked all that often (as in
> > > > > > if you do need them to be fast and scalable, please let me know). >
> > > > > > RCU would then be in fast/inefficient mode if either MM told it to be
> > > > > > or if RCU had detected callback overload on at least one CPU.
> > > > > >
> > > > > > Seem reasonable?
> > > > >
> > > > > Not exactly nested calls, but kswapd threads are per numa node.
> > > > > So, at some level nodes under pressure must be counted.
> > > >
> > > > Easy enough, especially given that RCU already "counts" CPUs having
> > > > excessive numbers of callbacks. But assuming that the transitions to/from
> > > > OOM are rare, I would start by just counting them with a global counter.
> > > > If the counter is non-zero, RCU is in fast and inefficient mode.
> > > >
> > > > > Also forcing rcu calls only for cpus in one numa node might be useful.
> > > >
> > > > Interesting. RCU currently evaluates a given CPU by comparing the
> > > > number of callbacks against a fixed cutoff that can be set at boot using
> > > > rcutree.qhimark, which defaults to 10,000. When this cutoff is exceeded,
> > > > RCU becomes more aggressive about invoking callbacks on that CPU, for
> > > > example, by sacrificing some degree of real-time response. I believe
> > > > that this heuristic would also serve the OOM use case well.
> > >
> > > So one of the things that I'm not sure people have connected here is
> > > that memory reclaim done by shrinkers is one of the things that
> > > drives huge numbers of call_rcu() callbacks to free memory via rcu.
> > > If we are reclaiming dentries and inodes, then we can be pushing
> > > thousands to hundreds of thousands of objects into kfree_rcu()
> > > and/or direct call_rcu() calls to free these objects in a single
> > > reclaim pass.
> >
> > Good point!
>
> Indeed
>
> >
> > > Hence the trigger for RCU going into "excessive callback" mode
> > > might, in fact, be kswapd running a pass over the shrinkers. i.e.
> > > memory reclaim itself can be responsible for pushing RCU into this "OOM
> > > pressure" situation.
> > >
> > > So perhaps we've missed a trick here by not having the memory
> > > reclaim routines trigger RCU callbacks at the end of a priority
> > > scan. The shrinkers have queued the objects for freeing, but they
> > > haven't actually been freed yet and so things like slab pages
> > > haven't actually been returned to the free pool even though the
> > > shrinkers have said "freed this many objects"...
> > >
> > > i.e. perhaps the right solution here is a "rcu_run_callbacks()"
> > > function that memory reclaim calls before backing off and/or winding
> > > up reclaim priority.
> >
> > It would not be hard to make something that put RCU into fast/inefficient
> > mode for a couple of grace periods. I will also look into the possibility
> > of speeding up callback invocation.
> >
> > It might also make sense to put RCU grace periods into fast mode while
> > running the shrinkers that are freeing dentries and inodes. However,
> > kbuild test robot reports ugly regressions when putting RCU into
> > fast/inefficient mode to quickly and too often. As in 78.5% degradation
> > on one of the benchmarks.
>
> I think fast/inefficient mode here just an optimization for freeing
> memory faster. It doesn't solve the problem itself.
>
> At first we have to close the loop in reclaimer and actually wait or run
> rcu callbacks which might free memory before increasing priority and
> invoking OOM killer.

That is easy, just invoke rcu_barrier(), which will block until all
prior call_rcu()/kfree_rcu() callbacks have been invoked.

> > > > > I wonder if direct-reclaim should at some stage simply wait for RCU QS.
> > > > > I.e. call rcu_barrier() or similar somewhere before invoking OOM.
> > > >
> > > > The rcu_oom_count() function in the patch starting this thread returns the
> > > > total number of outstanding callbacks queued on all CPUs. So one approach
> > > > would be to invoke this function, and if the return value was truly
> > > > huge (taking size of memory and who knows that all else into account),
> > > > do the rcu_barrier() to wait for RCU to clear its current backlog.
> > >
> > > The shrinker scan control structure has a node mask in it to
> > > indicate what node (and hence CPUs) it should be reclaiming from.
> > > This information comes from the main reclaim scan routine, so it
> > > would be trivial to feed straight into the RCU code to have it
> > > act on just the CPUs/node that we are reclaiming memory from...
> >
> > For the callbacks, RCU can operate on CPUs, in theory anyway. The
> > grace period itself, however, is inherently global.
> >
> > > > On the NUMA point, it would be dead easy for me to supply a function
> > > > that returned the number of callbacks on a given CPU, which would allow
> > > > you to similarly evaluate a NUMA node, a cgroup, or whatever.
> > >
> > > I'd think it runs the other way around - we optimisitically call the
> > > RCU layer to do cleanup, and the RCU layer decides if there's enough
> > > queued callbacks on the cpus/node to run callbacks immediately. It
> > > would even be provided with the scan priority to indicate the level
> > > of desperation memory reclaim is under....
> >
> > Easy for RCU to count the number of callbacks. That said, it has no
> > idea which callbacks are which. Perhaps kfree_rcu() could gather that
> > information from the slab allocator, though.
>
> It's simple to mark slab shrinkers that frees object through RCU and
> count freed objects in reclaimer:
>
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -536,6 +536,9 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
> else
> new_nr = atomic_long_read(&shrinker->nr_deferred[nid]);
>
> + if (shrinker->flags & SHRINKER_KFREE_RCU)
> + shrinkctl->nr_kfree_rcu += freed;
> +
> trace_mm_shrink_slab_end(shrinker, nid, freed, nr, new_nr, total_scan);
> return freed;
> }
>
> And when accumulated enough do some synchronization.
>
> Probably it's better to sum freed objects at per-cpu variable,
> and accumulate size rather than count.

RCU currently has no notion of size outside of possibly kfree_rcu(),
so that would be new information to RCU.

Thanx, Paul

> > > > > All GFP_NOFAIL users should allow direct-reclaim, thus this loop
> > > > > in page_alloc shouldn't block RCU and doesn't need special care.
> > > >
> > > > I must defer to you guys on this. The main caution is the duration of
> > > > direct reclaim. After all, if it is too long, the kfree_rcu() instance
> > > > would have been better of just invoking synchronize_rcu().
> > >
> > > Individual callers of kfree_rcu() have no idea of the load on RCU,
> > > nor how long direct reclaim is taking. Calling synchronize_rcu()
> > > incorrectly has pretty major downsides to it, so nobody should be
> > > trying to expedite kfree_rcu() unless there is a good reason to do
> > > so (e.g. at unmount to ensure everything allocated by a filesystem
> > > has actually been freed). Hence I'd much prefer the decision to
> > > expedite callbacks is made by the RCU subsystem based on it's known
> > > callback load and some indication of how close memory reclaim is to
> > > declaring OOM...
> >
> > Sorry, I was unclear. There is a new single-argument kfree_rcu() under
> > way that does not require an rcu_head in the structure being freed.
> > However, in this case, kfree_rcu() might either allocate the memory
> > that is needed to track the memory to be freed on the one hand or just
> > invoke synchronize_rcu() on the other. So this decision would be taken
> > inside kfree_rcu(), and not be visible to either core RCU or the caller
> > of kfree_rcu().
> >
> > This decision is made based on whether or not the allocator provides
> > kfree_rcu() the memory needed. The tradeoff is what GFP flags are
> > supplied. So the question kfree_rcu() has to answer is "Would it be
> > better to give myself to reclaim as an additional task, or would it
> > instead be better to just invoke synchronize_rcu() and then immediately
> > free()?"
> >
> > I am probably still unclear, but hopefully at least one step in the
> > right direction.
> >
> > Thanx, Paul
> >