2018-11-24 08:25:42

by Rafael J. Wysocki

[permalink] [raw]
Subject: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

From: Rafael J. Wysocki <[email protected]>

The venerable menu governor does some thigns that are quite
questionable in my view.

First, it includes timer wakeups in the pattern detection data and
mixes them up with wakeups from other sources which in some cases
causes it to expect what essentially would be a timer wakeup in a
time frame in which no timer wakeups are possible (becuase it knows
the time until the next timer event and that is later than the
expected wakeup time).

Second, it uses the extra exit latency limit based on the predicted
idle duration and depending on the number of tasks waiting on I/O,
even though those tasks may run on a different CPU when they are
woken up. Moreover, the time ranges used by it for the sleep length
correction factors depend on whether or not there are tasks waiting
on I/O, which again doesn't imply anything in particular, and they
are not correlated to the list of available idle states in any way
whatever.

Also, the pattern detection code in menu may end up considering
values that are too large to matter at all, in which cases running
it is a waste of time.

A major rework of the menu governor would be required to address
these issues and the performance of at least some workloads (tuned
specifically to the current behavior of the menu governor) is likely
to suffer from that. It is thus better to introduce an entirely new
governor without them and let everybody use the governor that works
better with their actual workloads.

The new governor introduced here, the timer events oriented (TEO)
governor, uses the same basic strategy as menu: it always tries to
find the deepest idle state that can be used in the given conditions.
However, it applies a different approach to that problem.

First, it doesn't use "correction factors" for the time till the
closest timer, but instead it tries to correlate the measured idle
duration values with the available idle states and use that
information to pick up the idle state that is most likely to "match"
the upcoming CPU idle interval.

Second, it doesn't take the number of "I/O waiters" into account at
all and the pattern detection code in it avoids taking timer wakeups
into account. It also only uses idle duration values less than the
current time till the closest timer (with the tick excluded) for that
purpose.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---

v5 -> v6:
* Avoid applying poll_time_limit to non-polling idle states by mistake.
* Use idle duration measured by the governor for everything (as it likely is
more accurate than the one measured by the core).
* Rename SPIKE to PULSE.
* Do not run pattern detection upfront. Instead, use recent idle duration
values to refine the state selection after finding a candidate idle state.
* Do not use the expected idle duration as an extra latency constraint
(exit latency is less than the target residency for all of the idle states
known to me anyway, so this doesn't change anything in practice).

v4 -> v5:
* Avoid using shallow idle states when the tick has been stopped already.

v3 -> v4:
* Make the pattern detection avoid returning too early if the minimum
sample is too far from the average.
* Reformat the changelog (as requested by Peter).

v2 -> v3:
* Simplify the pattern detection code and make it return a value
lower than the time to the closest timer if the majority of recent
idle intervals are below it regardless of their variance (that should
cause it to be slightly more aggressive).
* Do not count wakeups from state 0 due to the time limit in poll_idle()
as non-timer.

---
drivers/cpuidle/Kconfig | 11
drivers/cpuidle/governors/Makefile | 1
drivers/cpuidle/governors/teo.c | 450 +++++++++++++++++++++++++++++++++++++
3 files changed, 462 insertions(+)

Index: linux-pm/drivers/cpuidle/governors/teo.c
===================================================================
--- /dev/null
+++ linux-pm/drivers/cpuidle/governors/teo.c
@@ -0,0 +1,450 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Timer events oriented CPU idle governor
+ *
+ * Copyright (C) 2018 Intel Corporation
+ * Author: Rafael J. Wysocki <[email protected]>
+ *
+ * The idea of this governor is based on the observation that on many systems
+ * timer events are two or more orders of magnitude more frequent than any
+ * other interrupts, so they are likely to be the most significant source of CPU
+ * wakeups from idle states. Moreover, information about what happened in the
+ * (relatively recent) past can be used to estimate whether or not the deepest
+ * idle state with target residency within the time to the closest timer is
+ * likely to be suitable for the upcoming idle time of the CPU and, if not, then
+ * which of the shallower idle states to choose.
+ *
+ * Of course, non-timer wakeup sources are more important in some use cases and
+ * they can be covered by taking a few most recent idle time intervals of the
+ * CPU into account. However, even in that case it is not necessary to consider
+ * idle duration values greater than the time till the closest timer, as the
+ * patterns that they may belong to produce average values close enough to
+ * the time till the closest timer (sleep length) anyway.
+ *
+ * Thus this governor estimates whether or not the upcoming idle time of the CPU
+ * is likely to be significantly shorter than the sleep length and selects an
+ * idle state for it in accordance with that, as follows:
+ *
+ * - Find an idle state on the basis of the sleep length and state statistics
+ * collected over time:
+ *
+ * o Find the deepest idle state whose target residency is less than or euqal
+ * to the sleep length.
+ *
+ * o Select it if it matched both the sleep length and the observed idle
+ * duration in the past more often than it matched the sleep length alone
+ * (i.e. the observed idle duration was significantly shorter than the sleep
+ * length matched by it).
+ *
+ * o Otherwise, select the shallower state with the greatest matched "early"
+ * wakeups metric.
+ *
+ * - If at least half of the most recent idle duration values is below the
+ * target residency of the idle state selected so far, use those values to
+ * compute the new expected idle duration and find an idle state matching it
+ * (which has to be shallower than the one selected so far).
+ */
+
+#include <linux/cpuidle.h>
+#include <linux/jiffies.h>
+#include <linux/kernel.h>
+#include <linux/sched/clock.h>
+#include <linux/tick.h>
+
+/*
+ * The PULSE value is added to metrics when they grow and the DECAY_SHIFT value
+ * is used for decreasing metrics on a regular basis.
+ */
+#define PULSE 1024
+#define DECAY_SHIFT 3
+
+/*
+ * Number of the most recent idle duration values to take into consideration for
+ * the detection of wakeup patterns.
+ */
+#define INTERVALS 8
+
+/**
+ * struct teo_idle_state - Idle state data used by the TEO cpuidle governor.
+ * @early_hits: "Early" CPU wakeups matched by this state.
+ * @hits: "On time" CPU wakeups matched by this state.
+ * @misses: CPU wakeups "missed" by this state.
+ *
+ * A CPU wakeup is "matched" by a given idle state if the idle duration measured
+ * after the wakeup is between the target residency of that state and the target
+ * residnecy of the next one (or if this is the deepest available idle state, it
+ * "matches" a CPU wakeup when the measured idle duration is at least equal to
+ * its target residency).
+ *
+ * Also, from the TEO governor prespective, a CPU wakeup from idle is "early" if
+ * it occurs significantly earlier than the closest expected timer event (that
+ * is, early enough to match an idle state shallower than the one matching the
+ * time till the closest timer event). Otherwise, the wakeup is "on time", or
+ * it is a "hit".
+ *
+ * A "miss" occurs when the given state doesn't match the wakeup, but it matches
+ * the time till the closest timer event used for idle state selection.
+ */
+struct teo_idle_state {
+ unsigned int early_hits;
+ unsigned int hits;
+ unsigned int misses;
+};
+
+/**
+ * struct teo_cpu - CPU data used by the TEO cpuidle governor.
+ * @time_span_ns: Time between idle state selection and post-wakeup update.
+ * @sleep_length_ns: Time till the closest timer event (at the selection time).
+ * @states: Idle states data corresponding to this CPU.
+ * @last_state: Idle state entered by the CPU last time.
+ * @interval_idx: Index of the most recent saved idle interval.
+ * @intervals: Saved idle duration values.
+ */
+struct teo_cpu {
+ u64 time_span_ns;
+ u64 sleep_length_ns;
+ struct teo_idle_state states[CPUIDLE_STATE_MAX];
+ int last_state;
+ int interval_idx;
+ unsigned int intervals[INTERVALS];
+};
+
+static DEFINE_PER_CPU(struct teo_cpu, teo_cpus);
+
+/**
+ * teo_update - Update CPU data after wakeup.
+ * @drv: cpuidle driver containing state data.
+ * @dev: Target CPU.
+ */
+static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
+{
+ struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
+ unsigned int sleep_length_us = ktime_to_us(cpu_data->sleep_length_ns);
+ int i, idx_hit = -1, idx_timer = -1;
+ unsigned int measured_us;
+
+ if (cpu_data->time_span_ns == cpu_data->sleep_length_ns) {
+ /* One of the safety nets has triggered (most likely). */
+ measured_us = sleep_length_us;
+ } else {
+ measured_us = dev->last_residency;
+ i = cpu_data->last_state;
+ if (measured_us >= 2 * drv->states[i].exit_latency)
+ measured_us -= drv->states[i].exit_latency;
+ else
+ measured_us /= 2;
+ }
+
+ /*
+ * Decay the "early hits" metric for all of the states and find the
+ * states matching the sleep length and the measured idle duration.
+ */
+ for (i = 0; i < drv->state_count; i++) {
+ unsigned int early_hits = cpu_data->states[i].early_hits;
+
+ cpu_data->states[i].early_hits -= early_hits >> DECAY_SHIFT;
+
+ if (drv->states[i].target_residency <= measured_us)
+ idx_hit = i;
+
+ if (drv->states[i].target_residency <= sleep_length_us)
+ idx_timer = i;
+ }
+
+ /*
+ * Update the "hits" and "misses" data for the state matching the sleep
+ * length. If it matches the measured idle duration too, this is a hit,
+ * so increase the "hits" metric for it then. Otherwise, this is a
+ * miss, so increase the "misses" metric for it. In the latter case
+ * also increase the "early hits" metric for the state that actually
+ * matches the measured idle duration.
+ */
+ if (idx_timer >= 0) {
+ unsigned int hits = cpu_data->states[idx_timer].hits;
+ unsigned int misses = cpu_data->states[idx_timer].misses;
+
+ hits -= hits >> DECAY_SHIFT;
+ misses -= misses >> DECAY_SHIFT;
+
+ if (idx_timer > idx_hit) {
+ misses += PULSE;
+ if (idx_hit >= 0)
+ cpu_data->states[idx_hit].early_hits += PULSE;
+ } else {
+ hits += PULSE;
+ }
+
+ cpu_data->states[idx_timer].misses = misses;
+ cpu_data->states[idx_timer].hits = hits;
+ }
+
+ /*
+ * Save idle duration values corresponding to non-timer wakeups for
+ * pattern detection.
+ *
+ * If the total time span between idle state selection and the "reflect"
+ * callback is greater than or equal to the sleep length determined at
+ * the idle state selection time, the wakeup is likely to be due to a
+ * timer event.
+ */
+ if (cpu_data->time_span_ns >= cpu_data->sleep_length_ns)
+ measured_us = UINT_MAX;
+
+ cpu_data->intervals[cpu_data->interval_idx++] = measured_us;
+ if (cpu_data->interval_idx > INTERVALS)
+ cpu_data->interval_idx = 0;
+}
+
+/**
+ * teo_find_shallower_state - Find shallower idle state matching given duration.
+ * @drv: cpuidle driver containing state data.
+ * @dev: Target CPU.
+ * @state_idx: Index of the capping idle state.
+ * @duration_us: Idle duration value to match.
+ */
+static int teo_find_shallower_state(struct cpuidle_driver *drv,
+ struct cpuidle_device *dev, int state_idx,
+ unsigned int duration_us)
+{
+ int i;
+
+ for (i = state_idx - 1; i > 0; i--) {
+ if (drv->states[i].disabled || dev->states_usage[i].disable)
+ continue;
+
+ if (drv->states[i].target_residency <= duration_us)
+ break;
+ }
+ return i;
+}
+
+/**
+ * teo_select - Selects the next idle state to enter.
+ * @drv: cpuidle driver containing state data.
+ * @dev: Target CPU.
+ * @stop_tick: Indication on whether or not to stop the scheduler tick.
+ */
+static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
+ bool *stop_tick)
+{
+ struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
+ int latency_req = cpuidle_governor_latency_req(dev->cpu);
+ unsigned int duration_us, count;
+ int max_early_idx, idx, i;
+ ktime_t delta_tick;
+
+ if (cpu_data->last_state >= 0) {
+ teo_update(drv, dev);
+ cpu_data->last_state = -1;
+ }
+
+ cpu_data->time_span_ns = local_clock();
+
+ cpu_data->sleep_length_ns = tick_nohz_get_sleep_length(&delta_tick);
+ duration_us = ktime_to_us(cpu_data->sleep_length_ns);
+
+ count = 0;
+ max_early_idx = -1;
+ idx = -1;
+
+ for (i = 0; i < drv->state_count; i++) {
+ struct cpuidle_state *s = &drv->states[i];
+ struct cpuidle_state_usage *su = &dev->states_usage[i];
+
+ if (s->disabled || su->disable) {
+ /*
+ * If the "early hits" metric of a disabled state is
+ * greater than the current maximum, it should be taken
+ * into account, because it would be a mistake to select
+ * a deeper state with lower "early hits" metric. The
+ * index cannot be changed to point to it, however, so
+ * just increase the max count alone and let the index
+ * still point to a shallower idle state.
+ */
+ if (max_early_idx >= 0 &&
+ count < cpu_data->states[i].early_hits)
+ count = cpu_data->states[i].early_hits;
+
+ continue;
+ }
+
+ if (idx < 0)
+ idx = i; /* first enabled state */
+
+ if (s->target_residency > duration_us) {
+ /*
+ * If the "hits" metric of the state matching the sleep
+ * length is greater than its "misses" metric, that is
+ * the one to use.
+ */
+ if (cpu_data->states[idx].hits > cpu_data->states[idx].misses)
+ break;
+
+ /*
+ * It is more likely that one of the shallower states
+ * will match the idle duration measured after wakeup,
+ * so take the one with the maximum "early hits" metric,
+ * but if that cannot be determined, just use the state
+ * selected so far.
+ */
+ if (max_early_idx >= 0) {
+ idx = max_early_idx;
+ duration_us = drv->states[idx].target_residency;
+ }
+ break;
+ }
+ if (s->exit_latency > latency_req) {
+ /*
+ * If we break out of the loop for latency reasons, use
+ * the target residency of the selected state as the
+ * expected idle duration to avoid stopping the tick
+ * as long as that target residency is low enough.
+ */
+ duration_us = drv->states[idx].target_residency;
+ break;
+ }
+
+ idx = i;
+
+ if (count < cpu_data->states[i].early_hits &&
+ !(tick_nohz_tick_stopped() &&
+ drv->states[i].target_residency < TICK_USEC)) {
+ count = cpu_data->states[i].early_hits;
+ max_early_idx = i;
+ }
+ }
+
+ if (idx < 0) {
+ idx = 0; /* No states enabled. Must use 0. */
+ } else if (idx > 0) {
+ unsigned int max = 0;
+ u64 sum = 0;
+
+ count = 0;
+
+ /*
+ * Count and sum the most recent idle duration values less than
+ * the target residency of the state selected so far, find the
+ * max.
+ */
+ for (i = 0; i < INTERVALS; i++) {
+ unsigned int val = cpu_data->intervals[i];
+
+ if (val >= drv->states[idx].target_residency)
+ continue;
+
+ count++;
+ sum += val;
+
+ if (val > max)
+ max = val;
+ }
+
+ /*
+ * Give up if the majority of the most recent idle duration
+ * values are not in the interesting range.
+ */
+ if (count >= INTERVALS / 2) {
+ unsigned int avg_us;
+
+ /*
+ * Discard the max to push the average towards shallower
+ * idle state somewhat (it is better to underestimate
+ * the idle duration than to overestimate it).
+ */
+ count--;
+ sum -= max;
+
+ avg_us = div64_u64(sum, count);
+
+ /*
+ * Avoid spending too much time in an idle state that
+ * would be too shallow.
+ */
+ if (!(tick_nohz_tick_stopped() && avg_us < TICK_USEC)) {
+ idx = teo_find_shallower_state(drv, dev, idx, avg_us);
+ duration_us = avg_us;
+ }
+ }
+ }
+
+ /*
+ * Don't stop the tick if the selected state is a polling one or if the
+ * expected idle duration is shorter than the tick period length.
+ */
+ if (((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) ||
+ duration_us < TICK_USEC) && !tick_nohz_tick_stopped()) {
+ unsigned int delta_tick_us = ktime_to_us(delta_tick);
+
+ *stop_tick = false;
+
+ /*
+ * The tick is not going to be stopped, so if the target
+ * residency of the state to be returned is not within the time
+ * till the closest timer including the tick, try to correct
+ * that.
+ */
+ if (idx > 0 && drv->states[idx].target_residency > delta_tick_us)
+ idx = teo_find_shallower_state(drv, dev, idx, delta_tick_us);
+ }
+
+ return idx;
+}
+
+/**
+ * teo_reflect - Note that governor data for the CPU need to be updated.
+ * @dev: Target CPU.
+ * @state: Entered state.
+ */
+static void teo_reflect(struct cpuidle_device *dev, int state)
+{
+ struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
+
+ cpu_data->last_state = state;
+ /*
+ * If the wakeup was not "natural", but triggered by one of the safety
+ * nets, assume that the CPU might have been idle for the entire sleep
+ * length time.
+ */
+ if (dev->poll_time_limit ||
+ (tick_nohz_idle_got_tick() && cpu_data->sleep_length_ns > TICK_NSEC)) {
+ dev->poll_time_limit = false;
+ cpu_data->time_span_ns = cpu_data->sleep_length_ns;
+ } else {
+ cpu_data->time_span_ns = local_clock() - cpu_data->time_span_ns;
+ }
+}
+
+/**
+ * teo_enable_device - Initialize the governor's data for the target CPU.
+ * @drv: cpuidle driver (not used).
+ * @dev: Target CPU.
+ */
+static int teo_enable_device(struct cpuidle_driver *drv,
+ struct cpuidle_device *dev)
+{
+ struct teo_cpu *cpu_data = per_cpu_ptr(&teo_cpus, dev->cpu);
+ int i;
+
+ memset(cpu_data, 0, sizeof(*cpu_data));
+
+ for (i = 0; i < INTERVALS; i++)
+ cpu_data->intervals[i] = UINT_MAX;
+
+ return 0;
+}
+
+static struct cpuidle_governor teo_governor = {
+ .name = "teo",
+ .rating = 22,
+ .enable = teo_enable_device,
+ .select = teo_select,
+ .reflect = teo_reflect,
+};
+
+static int __init teo_governor_init(void)
+{
+ return cpuidle_register_governor(&teo_governor);
+}
+
+postcore_initcall(teo_governor_init);
Index: linux-pm/drivers/cpuidle/Kconfig
===================================================================
--- linux-pm.orig/drivers/cpuidle/Kconfig
+++ linux-pm/drivers/cpuidle/Kconfig
@@ -23,6 +23,17 @@ config CPU_IDLE_GOV_LADDER
config CPU_IDLE_GOV_MENU
bool "Menu governor (for tickless system)"

+config CPU_IDLE_GOV_TEO
+ bool "Timer events oriented governor (for tickless systems)"
+ help
+ Menu governor derivative that uses a simplified idle state
+ selection method focused on timer events and does not do any
+ interactivity boosting.
+
+ Some workloads benefit from using this governor and it generally
+ should be safe to use. Say Y here if you are not happy with the
+ alternatives.
+
config DT_IDLE_STATES
bool

Index: linux-pm/drivers/cpuidle/governors/Makefile
===================================================================
--- linux-pm.orig/drivers/cpuidle/governors/Makefile
+++ linux-pm/drivers/cpuidle/governors/Makefile
@@ -4,3 +4,4 @@

obj-$(CONFIG_CPU_IDLE_GOV_LADDER) += ladder.o
obj-$(CONFIG_CPU_IDLE_GOV_MENU) += menu.o
+obj-$(CONFIG_CPU_IDLE_GOV_TEO) += teo.o



2018-11-24 08:26:08

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

On Friday, November 23, 2018 11:35:38 AM CET Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>
>
> The venerable menu governor does some thigns that are quite
> questionable in my view.
>
> First, it includes timer wakeups in the pattern detection data and
> mixes them up with wakeups from other sources which in some cases
> causes it to expect what essentially would be a timer wakeup in a
> time frame in which no timer wakeups are possible (becuase it knows
> the time until the next timer event and that is later than the
> expected wakeup time).
>
> Second, it uses the extra exit latency limit based on the predicted
> idle duration and depending on the number of tasks waiting on I/O,
> even though those tasks may run on a different CPU when they are
> woken up. Moreover, the time ranges used by it for the sleep length
> correction factors depend on whether or not there are tasks waiting
> on I/O, which again doesn't imply anything in particular, and they
> are not correlated to the list of available idle states in any way
> whatever.
>
> Also, the pattern detection code in menu may end up considering
> values that are too large to matter at all, in which cases running
> it is a waste of time.
>
> A major rework of the menu governor would be required to address
> these issues and the performance of at least some workloads (tuned
> specifically to the current behavior of the menu governor) is likely
> to suffer from that. It is thus better to introduce an entirely new
> governor without them and let everybody use the governor that works
> better with their actual workloads.
>
> The new governor introduced here, the timer events oriented (TEO)
> governor, uses the same basic strategy as menu: it always tries to
> find the deepest idle state that can be used in the given conditions.
> However, it applies a different approach to that problem.
>
> First, it doesn't use "correction factors" for the time till the
> closest timer, but instead it tries to correlate the measured idle
> duration values with the available idle states and use that
> information to pick up the idle state that is most likely to "match"
> the upcoming CPU idle interval.
>
> Second, it doesn't take the number of "I/O waiters" into account at
> all and the pattern detection code in it avoids taking timer wakeups
> into account. It also only uses idle duration values less than the
> current time till the closest timer (with the tick excluded) for that
> purpose.
>
> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---
>
> v5 -> v6:
> * Avoid applying poll_time_limit to non-polling idle states by mistake.
> * Use idle duration measured by the governor for everything (as it likely is
> more accurate than the one measured by the core).

This particular change is actually missing, sorry about that. It is not
essential, however, so the v6 should be good enough as is for evaluation
and review purposes.

> * Rename SPIKE to PULSE.
> * Do not run pattern detection upfront. Instead, use recent idle duration
> values to refine the state selection after finding a candidate idle state.
> * Do not use the expected idle duration as an extra latency constraint
> (exit latency is less than the target residency for all of the idle states
> known to me anyway, so this doesn't change anything in practice).

Thanks,
Rafael


2018-11-28 23:24:05

by Doug Smythies

[permalink] [raw]
Subject: RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

On 2018.11.23 02:36 Rafael J. Wysocki wrote:

v5 -> v6:
* Avoid applying poll_time_limit to non-polling idle states by mistake.
* Use idle duration measured by the governor for everything (as it likely is
more accurate than the one measured by the core).

-- above missing-- (see follow up e-mail from Rafael)

* Rename SPIKE to PULSE.
* Do not run pattern detection upfront. Instead, use recent idle duration
values to refine the state selection after finding a candidate idle state.
* Do not use the expected idle duration as an extra latency constraint
(exit latency is less than the target residency for all of the idle states
known to me anyway, so this doesn't change anything in practice).

Hi Rafael,

I did some minimal testing on teov6, using kernel 4.20-rc3 as my baseline
reference kernel.

Test 1: Phoronix bdench test, all options: 1, 6, 12, 48, 128, 256 clients.

Note: because it uses the disk, the dbench test is somewhat non-repeatable.
However, if particular attention is paid to not doing anything else with
the disk between tests, then it seems to be repeatable to within about 6%.

Anyway no significant difference observed between kernel 4.20-rc3 and the
same with the teov6 patch.

Test 2: Pipe test, non cross core. (And idle state 0 test, really)
I ran 4 pipe tests, 1 for each of my 4 cores, @2 CPUs per core.
Thus, pretty much only idle state 0 was ever used.
Processor package power was similar for both kernels.
teov6 entered/exited idle state 0 about 60,984 times/second/cpu.
-rc3 entered/exited idle state 0 about 62,806 times/second/cpu.
There was a difference in percentage time spent in idle state 0,
with kernel 4.20-rc3 spending 0.2441% in idle state 0 verses
teov6 at 0.0641%.

For throughput, teov6 was 1.4% faster.

Test 3: was an attempt to sweep through a preference for
all idle states.

40 threads were launched with nothing to do except sleep
for a variable duration of 1 to 500 uSec, each step was
run for 1 minute. With 1 minute idle before the test and a few
minutes idle after, the total test duration was about 505 minutes.
Recall that when one asks for a short sleep of 1 uSec, they actually
get about 50 uSec, due to overheads. So I use 40 threads in an attempt
to get the average time between wakeup events per CPU down somewhat.

The results are here:
http://fast.smythies.com/linux-pm/k420/k420-pn-sweep-teo6-2.htm

I might try to get some histogram information at a later date.

... Doug



2018-11-29 09:44:17

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

Hi Doug,

On Thu, Nov 29, 2018 at 12:20 AM Doug Smythies <[email protected]> wrote:
>
> On 2018.11.23 02:36 Rafael J. Wysocki wrote:
>
> v5 -> v6:
> * Avoid applying poll_time_limit to non-polling idle states by mistake.
> * Use idle duration measured by the governor for everything (as it likely is
> more accurate than the one measured by the core).
>
> -- above missing-- (see follow up e-mail from Rafael)
>
> * Rename SPIKE to PULSE.
> * Do not run pattern detection upfront. Instead, use recent idle duration
> values to refine the state selection after finding a candidate idle state.
> * Do not use the expected idle duration as an extra latency constraint
> (exit latency is less than the target residency for all of the idle states
> known to me anyway, so this doesn't change anything in practice).
>
> Hi Rafael,
>
> I did some minimal testing on teov6, using kernel 4.20-rc3 as my baseline
> reference kernel.
>
> Test 1: Phoronix bdench test, all options: 1, 6, 12, 48, 128, 256 clients.
>
> Note: because it uses the disk, the dbench test is somewhat non-repeatable.
> However, if particular attention is paid to not doing anything else with
> the disk between tests, then it seems to be repeatable to within about 6%.
>
> Anyway no significant difference observed between kernel 4.20-rc3 and the
> same with the teov6 patch.
>
> Test 2: Pipe test, non cross core. (And idle state 0 test, really)
> I ran 4 pipe tests, 1 for each of my 4 cores, @2 CPUs per core.
> Thus, pretty much only idle state 0 was ever used.
> Processor package power was similar for both kernels.
> teov6 entered/exited idle state 0 about 60,984 times/second/cpu.
> -rc3 entered/exited idle state 0 about 62,806 times/second/cpu.
> There was a difference in percentage time spent in idle state 0,
> with kernel 4.20-rc3 spending 0.2441% in idle state 0 verses
> teov6 at 0.0641%.
>
> For throughput, teov6 was 1.4% faster.
>
> Test 3: was an attempt to sweep through a preference for
> all idle states.
>
> 40 threads were launched with nothing to do except sleep
> for a variable duration of 1 to 500 uSec, each step was
> run for 1 minute. With 1 minute idle before the test and a few
> minutes idle after, the total test duration was about 505 minutes.
> Recall that when one asks for a short sleep of 1 uSec, they actually
> get about 50 uSec, due to overheads. So I use 40 threads in an attempt
> to get the average time between wakeup events per CPU down somewhat.
>
> The results are here:
> http://fast.smythies.com/linux-pm/k420/k420-pn-sweep-teo6-2.htm
>
> I might try to get some histogram information at a later date.

Thank you for the results, much appreciated!

2018-11-30 07:49:57

by Doug Smythies

[permalink] [raw]
Subject: RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

Hi Rafael,

On 2018.11.23 02:36 Rafael J. Wysocki wrote:

... [snip]...

> +/**
> + * teo_find_shallower_state - Find shallower idle state matching given duration.
> + * @drv: cpuidle driver containing state data.
> + * @dev: Target CPU.
> + * @state_idx: Index of the capping idle state.
> + * @duration_us: Idle duration value to match.
> + */
> +static int teo_find_shallower_state(struct cpuidle_driver *drv,
> + struct cpuidle_device *dev, int state_idx,
> + unsigned int duration_us)
> +{
> + int i;
> +
> + for (i = state_idx - 1; i > 0; i--) {
> + if (drv->states[i].disabled || dev->states_usage[i].disable)
> + continue;
> +
> + if (drv->states[i].target_residency <= duration_us)
> + break;
> + }
> + return i;
> +}

I think this subroutine has a problem when idle state 0
is disabled.

Perhaps something like this might help:

diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
index bc1c9a2..5b97639 100644
--- a/drivers/cpuidle/governors/teo.c
+++ b/drivers/cpuidle/governors/teo.c
@@ -196,7 +196,8 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
}

/**
- * teo_find_shallower_state - Find shallower idle state matching given duration.
+ * teo_find_shallower_state - Find shallower idle state matching given
+ * duration, if possible.
* @drv: cpuidle driver containing state data.
* @dev: Target CPU.
* @state_idx: Index of the capping idle state.
@@ -208,13 +209,15 @@ static int teo_find_shallower_state(struct cpuidle_driver *drv,
{
int i;

- for (i = state_idx - 1; i > 0; i--) {
+ for (i = state_idx - 1; i >= 0; i--) {
if (drv->states[i].disabled || dev->states_usage[i].disable)
continue;

if (drv->states[i].target_residency <= duration_us)
break;
}
+ if (i < 0)
+ i = state_idx;
return i;
}

@@ -264,7 +267,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
if (max_early_idx >= 0 &&
count < cpu_data->states[i].early_hits)
count = cpu_data->states[i].early_hits;
-
continue;
}

There is an additional issue where if idle state 0 is disabled (with the above suggested code patch),
idle state usage seems to fall to deeper states than idle state 1.
This is not the expected behaviour.
Kernel 4.20-rc3 works as expected.
I have not figured this issue out yet, in the code.

Example (1 minute per sample. Number of entries/exits per state):
State 0 State 1 State 2 State 3 State 4 Watts
28235143, 83, 26, 17, 837, 64.900
5583238, 657079, 5884941, 8498552, 30986831, 62.433 << Transition sample, after idle state 0 disabled
0, 793517, 7186099, 10559878, 38485721, 61.900 << ?? should have all gone into Idle state 1
0, 795414, 7340703, 10553117, 38513456, 62.050
0, 807028, 7288195, 10574113, 38523524, 62.167
0, 814983, 7403534, 10575108, 38571228, 62.167
0, 838302, 7747127, 10552289, 38556054, 62.183
9664999, 544473, 4914512, 6942037, 25295361, 63.633 << Transition sample, after idle state 0 enabled
27893504, 96, 40, 9, 912, 66.500
26556343, 83, 29, 7, 814, 66.683
27929227, 64, 20, 10, 931, 66.683

... Doug



2018-11-30 08:52:25

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

Hi Doug,

On Fri, Nov 30, 2018 at 8:49 AM Doug Smythies <[email protected]> wrote:
>
> Hi Rafael,
>
> On 2018.11.23 02:36 Rafael J. Wysocki wrote:
>
> ... [snip]...
>
> > +/**
> > + * teo_find_shallower_state - Find shallower idle state matching given duration.
> > + * @drv: cpuidle driver containing state data.
> > + * @dev: Target CPU.
> > + * @state_idx: Index of the capping idle state.
> > + * @duration_us: Idle duration value to match.
> > + */
> > +static int teo_find_shallower_state(struct cpuidle_driver *drv,
> > + struct cpuidle_device *dev, int state_idx,
> > + unsigned int duration_us)
> > +{
> > + int i;
> > +
> > + for (i = state_idx - 1; i > 0; i--) {
> > + if (drv->states[i].disabled || dev->states_usage[i].disable)
> > + continue;
> > +
> > + if (drv->states[i].target_residency <= duration_us)
> > + break;
> > + }
> > + return i;
> > +}
>
> I think this subroutine has a problem when idle state 0
> is disabled.

You are right, thanks!

> Perhaps something like this might help:
>
> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
> index bc1c9a2..5b97639 100644
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -196,7 +196,8 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
> }
>
> /**
> - * teo_find_shallower_state - Find shallower idle state matching given duration.
> + * teo_find_shallower_state - Find shallower idle state matching given
> + * duration, if possible.
> * @drv: cpuidle driver containing state data.
> * @dev: Target CPU.
> * @state_idx: Index of the capping idle state.
> @@ -208,13 +209,15 @@ static int teo_find_shallower_state(struct cpuidle_driver *drv,
> {
> int i;
>
> - for (i = state_idx - 1; i > 0; i--) {
> + for (i = state_idx - 1; i >= 0; i--) {
> if (drv->states[i].disabled || dev->states_usage[i].disable)
> continue;
>
> if (drv->states[i].target_residency <= duration_us)
> break;
> }
> + if (i < 0)
> + i = state_idx;
> return i;
> }

I'll do something slightly similar, but equivalent.

>
> @@ -264,7 +267,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> if (max_early_idx >= 0 &&
> count < cpu_data->states[i].early_hits)
> count = cpu_data->states[i].early_hits;
> -
> continue;
> }
>
> There is an additional issue where if idle state 0 is disabled (with the above suggested code patch),
> idle state usage seems to fall to deeper states than idle state 1.
> This is not the expected behaviour.

No, it isn't.

> Kernel 4.20-rc3 works as expected.
> I have not figured this issue out yet, in the code.
>
> Example (1 minute per sample. Number of entries/exits per state):
> State 0 State 1 State 2 State 3 State 4 Watts
> 28235143, 83, 26, 17, 837, 64.900
> 5583238, 657079, 5884941, 8498552, 30986831, 62.433 << Transition sample, after idle state 0 disabled
> 0, 793517, 7186099, 10559878, 38485721, 61.900 << ?? should have all gone into Idle state 1
> 0, 795414, 7340703, 10553117, 38513456, 62.050
> 0, 807028, 7288195, 10574113, 38523524, 62.167
> 0, 814983, 7403534, 10575108, 38571228, 62.167
> 0, 838302, 7747127, 10552289, 38556054, 62.183
> 9664999, 544473, 4914512, 6942037, 25295361, 63.633 << Transition sample, after idle state 0 enabled
> 27893504, 96, 40, 9, 912, 66.500
> 26556343, 83, 29, 7, 814, 66.683
> 27929227, 64, 20, 10, 931, 66.683

I see.

OK, I'll look into this too, thanks!

2018-12-01 14:15:34

by Giovanni Gherdovich

[permalink] [raw]
Subject: Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

On Fri, 2018-11-23 at 11:35 +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <[email protected]>

> The venerable menu governor does some thigns that are quite
> questionable in my view.

> First, it includes timer wakeups in the pattern detection data and
> mixes them up with wakeups from other sources which in some cases
> causes it to expect what essentially would be a timer wakeup in a
> time frame in which no timer wakeups are possible (becuase it knows
> the time until the next timer event and that is later than the
> expected wakeup time).

> Second, it uses the extra exit latency limit based on the predicted
> idle duration and depending on the number of tasks waiting on I/O,
> even though those tasks may run on a different CPU when they are
> woken up.  Moreover, the time ranges used by it for the sleep length
> correction factors depend on whether or not there are tasks waiting
> on I/O, which again doesn't imply anything in particular, and they
> are not correlated to the list of available idle states in any way
> whatever.

> Also, the pattern detection code in menu may end up considering
> values that are too large to matter at all, in which cases running
> it is a waste of time.

> A major rework of the menu governor would be required to address
> these issues and the performance of at least some workloads (tuned
> specifically to the current behavior of the menu governor) is likely
> to suffer from that.  It is thus better to introduce an entirely new
> governor without them and let everybody use the governor that works
> better with their actual workloads.

> The new governor introduced here, the timer events oriented (TEO)
> governor, uses the same basic strategy as menu: it always tries to
> find the deepest idle state that can be used in the given conditions.
> However, it applies a different approach to that problem.

> First, it doesn't use "correction factors" for the time till the
> closest timer, but instead it tries to correlate the measured idle
> duration values with the available idle states and use that
> information to pick up the idle state that is most likely to "match"
> the upcoming CPU idle interval.

> Second, it doesn't take the number of "I/O waiters" into account at
> all and the pattern detection code in it avoids taking timer wakeups
> into account.  It also only uses idle duration values less than the
> current time till the closest timer (with the tick excluded) for that
> purpose.

> Signed-off-by: Rafael J. Wysocki <[email protected]>
> ---

> v5 -> v6:
>  * Avoid applying poll_time_limit to non-polling idle states by mistake.
>  * Use idle duration measured by the governor for everything (as it likely is
>    more accurate than the one measured by the core).
>  * Rename SPIKE to PULSE.
>  * Do not run pattern detection upfront.  Instead, use recent idle duration
>    values to refine the state selection after finding a candidate idle state.
>  * Do not use the expected idle duration as an extra latency constraint
>    (exit latency is less than the target residency for all of the idle states
>    known to me anyway, so this doesn't change anything in practice).

> v4 -> v5:
>  * Avoid using shallow idle states when the tick has been stopped already.

> v3 -> v4:
>  * Make the pattern detection avoid returning too early if the minimum
>    sample is too far from the average.
>  * Reformat the changelog (as requested by Peter).

> v2 -> v3:
>  * Simplify the pattern detection code and make it return a value
>  lower than the time to the closest timer if the majority of recent
>  idle intervals are below it regardless of their variance (that should
>  cause it to be slightly more aggressive).
>  * Do not count wakeups from state 0 due to the time limit in poll_idle()
>    as non-timer.

> [snip]

[NOTE: the tables in this message are quite wide. If this doesn't get to you
properly formatted you can read a copy of this message at the URL
https://beta.suse.com/private/ggherdovich/teo-eval/teo-v6-eval.html ]

All performance concerns manifested in v5 are wiped out by v6. Not only v6
improves over v5, but is even better than the baseline (menu) in most
cases. The optimizations in v6 paid off!

The overview of the analysis for v5, from the message
https://lore.kernel.org/lkml/[email protected] , was:

> The quick summary is:

> ---> sockperf on loopback over UDP, mode "throughput":
>      this had a 12% regression in v2 on 48x-HASWELL-NUMA, which is completely
>      recovered in v3 and v5. Good stuff.

> ---> dbench on xfs:
>      this was down 16% in v2 on 48x-HASWELL-NUMA. On v5 we're at a 10%
>      regression. Slight improvement. What's really hurting here is the single
>      client scenario.

> ---> netperf-udp on loopback:
>      had 6% regression on v2 on 8x-SKYLAKE-UMA, which is the same as what
>      happens in v5.

> ---> tbench on loopback:
>      was down 10% in v2 on 8x-SKYLAKE-UMA, now slightly worse in v5 with a 12%
>      regression. As in dbench, it's at low number of clients that the results
>      are worst. Note that this machine is different from the one that has the
>      dbench regression.

now the situation is overturned:

---> sockperf on loopback over UDP, mode "throughput":
     No new problems from 48x-HASWELL-NUMA, which stays put at the level of
     the baseline. OTOH 80x-BROADWELL-NUMA and 8x-SKYLAKE-UMA improve over the
     baseline of 8% and 10% respectively.

---> dbench on xfs:
     48x-HASWELL-NUMA rebounds from the previous 10% degradation and it's now
     at 0, i.e. the baseline level. The 1-client case, responsible for the
     previous overall degradation (I average results from different number of
     clients), went from -40% to -20% and is compensated in my table by
     improvements with 4, 8, 16 and 32 clients (table below).

---> netperf-udp on loopback:
     8x-SKYLAKE-UMA now shows a 9% improvement over  baseline.
     80x-BROADWELL-NUMA, previously similar to baseline, now improves 7%.

---> tbench on loopback:
     Impressive change of color for 8x-SKYLAKE-UMA, from 12% regression in v5
     to 7% improvement in v6. The problematic 1- and 2-clients cases went from
     -25% and -33% to +13% and +10% respectively.

Details below.

Runs are compared against v4.18 with the Menu governor. I know v4.18 is a
little old now but that's where I measured my baseline. My machine pool didn't
change:

* single socket E3-1240 v5 (Skylake 8 cores, which I'll call 8x-SKYLAKE-UMA)
* two sockets E5-2698 v4 (Broadwell 80 cores, 80x-BROADWELL-NUMA from here onwards)
* two sockets E5-2670 v3 (Haswell 48 cores, 48x-HASWELL-NUMA from here onwards)


BENCHMARKS WITH NEUTRAL RESULTS
===============================

This is the list of neutral benchmarks, identical to the one for v5. What's
interesting is that the benchmarks showing a degradations in v5 and before
seems now repaired in v6 (and improving baseline!), but the list of neutral
benchmarks didn't move. My take on this is that the list below is not affected
by cpuidle at all, be the gorvernor good or bad. OTOH the benchmarks I discuss
in the next sections are really the ones to use when evaluating cpuidle, as
they are very sensitive to it (frequent idling and waking up, hard-to-predict
interrupt patterns etc).

* pgbench read-only on xfs, pgbench read/write on xfs
    * global-dhp__db-pgbench-timed-ro-small-xfs
    * global-dhp__db-pgbench-timed-rw-small-xfs
* siege
    * global-dhp__http-siege
* hackbench, pipetest
    * global-dhp__scheduler-unbound
* Linux kernel compilation
    * global-dhp__workload_kerndevel-xfs
* NASA Parallel Benchmarks, C-Class (linear algebra; run both with OpenMP
  and OpenMPI, over xfs)
    * global-dhp__nas-c-class-mpi-full-xfs
    * global-dhp__nas-c-class-omp-full
* FIO (Flexible IO) in several configurations
    * global-dhp__io-fio-randread-async-randwrite-xfs
    * global-dhp__io-fio-randread-async-seqwrite-xfs
    * global-dhp__io-fio-seqread-doublemem-32k-4t-xfs
    * global-dhp__io-fio-seqread-doublemem-4k-4t-xfs
* netperf on loopback over TCP
    * global-dhp__network-netperf-unbound
* xfsrepair
    * global-dhp__io-xfsrepair-xfs
* sqlite (insert operations on xfs)
    * global-dhp__db-sqlite-insert-medium-xfs
* schbench
    * global-dhp__workload_schbench
* gitsource on xfs (git unit tests, shell intensive)
    * global-dhp__workload_shellscripts-xfs

Note: global-dhp* are configuration file names for MMTests[1]


PREVIOUSLY REGRESSING BENCHMARKS: OVERVIEW
==========================================

* sockperf on loopback over UDP, mode "throughput"
    * global-dhp__network-sockperf-unbound
    48x-HASWELL-NUMA fixed since v2, the others greatly improved in v6.

                        teo-v1      teo-v2      teo-v3      teo-v5      teo-v6
  -------------------------------------------------------------------------------
  8x-SKYLAKE-UMA        1% worse    1% worse    1% worse    1% worse    10% better
  80x-BROADWELL-NUMA    3% better   2% better   5% better   3% worse    8% better
  48x-HASWELL-NUMA      4% better   12% worse   no change   no change   no change

* dbench on xfs
    * global-dhp__io-dbench4-async-xfs
    48x-HASWELL-NUMA is fixed wrt v5 and earlier versions.

                        teo-v1      teo-v2      teo-v3     teo-v5       teo-v6   
  -------------------------------------------------------------------------------
  8x-SKYLAKE-UMA        3% better   4% better   6% better  4% better    5% better
  80x-BROADWELL-NUMA    no change   no change   1% worse   3% worse     2% better
  48x-HASWELL-NUMA      6% worse    16% worse   8% worse   10% worse    no change 

* netperf on loopback over UDP
    * global-dhp__network-netperf-unbound
    8x-SKYLAKE-UMA fixed.

                        teo-v1      teo-v2      teo-v3     teo-v5       teo-v6   
  -------------------------------------------------------------------------------
  8x-SKYLAKE-UMA        no change   6% worse    4% worse   6% worse     9% better
  80x-BROADWELL-NUMA    1% worse    4% worse    no change  no change    7% better
  48x-HASWELL-NUMA      3% better   5% worse    7% worse   5% worse     no change

* tbench on loopback
    * global-dhp__network-tbench
    Measurable improvements across all machines, especially 8x-SKYLAKE-UMA.

                        teo-v1      teo-v2      teo-v3     teo-v5       teo-v6
  -------------------------------------------------------------------------------
  8x-SKYLAKE-UMA        1% worse    10% worse   11% worse  12% worse    7% better
  80x-BROADWELL-NUMA    1% worse    1% worse    no cahnge  1% worse     4% better
  48x-HASWELL-NUMA      1% worse    2% worse    1% worse   1% worse     5% better


PREVIOUSLY REGRESSING BENCHMARKS: DETAIL
========================================

SOCKPERF-UDP-THROUGHPUT
=======================
NOTES: Test run in mode "throughput" over UDP. The varying parameter is the
    message size.
MEASURES: Throughput, in MBits/second
HIGHER is better

machine: 8x-SKYLAKE-UMA

                              4.18.0                 4.18.0                 4.18.0                 4.18.0                 4.18.0                 4.18.0
                             vanilla                    teo        teo-v2+backport        teo-v3+backport        teo-v5+backport        teo-v6+backport
-------------------------------------------------------------------------------------------------------------------------------------------------------
Hmean     14        70.34 (   0.00%)       69.80 *  -0.76%*       69.11 *  -1.75%*       69.49 *  -1.20%*       69.71 *  -0.90%*       77.51 *  10.20%*
Hmean     100      499.24 (   0.00%)      494.26 *  -1.00%*      492.74 *  -1.30%*      494.90 *  -0.87%*      497.43 *  -0.36%*      549.93 *  10.15%*
Hmean     300     1489.13 (   0.00%)     1472.39 *  -1.12%*     1468.45 *  -1.39%*     1477.74 *  -0.76%*     1478.61 *  -0.71%*     1632.63 *   9.64%*
Hmean     500     2469.62 (   0.00%)     2444.41 *  -1.02%*     2434.61 *  -1.42%*     2454.15 *  -0.63%*     2454.76 *  -0.60%*     2698.70 *   9.28%*
Hmean     850     4165.12 (   0.00%)     4123.82 *  -0.99%*     4100.37 *  -1.55%*     4111.82 *  -1.28%*     4120.04 *  -1.08%*     4521.11 *   8.55%*

In the report I sent for v5 on this benchmark, I posted the table for
48x-HASWELL-NUMA; that one is now uninteresting (v5 fixed it and v6 didn't
change that), so the table above shows the detail for the improvement on
8x-SKYLAKE-UMA.

DBENCH4
=======
NOTES: asyncronous IO; varies the number of clients up to NUMCPUS*8.
MEASURES: latency (millisecs)
LOWER is better

machine: 48x-HASWELL-NUMA
                              4.18.0                 4.18.0                 4.18.0                 4.18.0                 4.18.0                 4.18.0
                             vanilla                    teo        teo-v2+backport        teo-v3+backport        teo-v5+backport        teo-v6+backport
-------------------------------------------------------------------------------------------------------------------------------------------------------
Amean      1        37.15 (   0.00%)       50.10 ( -34.86%)       39.02 (  -5.03%)       52.24 ( -40.63%)       51.62 ( -38.96%)       45.24 ( -21.78%)
Amean      2        43.75 (   0.00%)       45.50 (  -4.01%)       44.36 (  -1.39%)       47.25 (  -8.00%)       44.20 (  -1.03%)       44.30 (  -1.26%)
Amean      4        54.42 (   0.00%)       58.85 (  -8.15%)       58.17 (  -6.89%)       55.12 (  -1.29%)       58.07 (  -6.70%)       52.91 (   2.77%)
Amean      8        75.72 (   0.00%)       74.25 (   1.94%)       82.76 (  -9.30%)       78.63 (  -3.84%)       85.33 ( -12.68%)       70.26 (   7.22%)
Amean      16      116.56 (   0.00%)      119.88 (  -2.85%)      164.14 ( -40.82%)      124.87 (  -7.13%)      124.54 (  -6.85%)      110.95 (   4.81%)
Amean      32      570.02 (   0.00%)      561.92 (   1.42%)      681.94 ( -19.63%)      568.93 (   0.19%)      571.23 (  -0.21%)      543.10 (   4.72%)
Amean      64     3185.20 (   0.00%)     3291.80 (  -3.35%)     4337.43 ( -36.17%)     3181.13 (   0.13%)     3382.48 (  -6.19%)     3186.58 (  -0.04%)

The -21% on 1-client may not look exciting but it's leaps and bounds better
than what was on v5, plus most other num-clients improve measurably.

NETPERF-UDP
===========
NOTES: Test run in mode "stream" over UDP. The varying parameter is the
    message size in bytes. Each measurement is taken 5 times and the
    harmonic mean is reported.
MEASURES: Throughput in MBits/second, both on the sender and on the receiver end.
HIGHER is better

machine: 8x-SKYLAKE-UMA
                                     4.18.0                 4.18.0                 4.18.0                 4.18.0                 4.18.0                 4.18.0
                                    vanilla                    teo        teo-v2+backport        teo-v3+backport        teo-v5+backport        teo-v6+backport
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Hmean     send-64         362.27 (   0.00%)      362.87 (   0.16%)      318.85 * -11.99%*      347.08 *  -4.19%*      333.48 *  -7.95%*      402.61 *  11.13%*
Hmean     send-128        723.17 (   0.00%)      723.66 (   0.07%)      660.96 *  -8.60%*      676.46 *  -6.46%*      650.71 * -10.02%*      796.78 *  10.18%*
Hmean     send-256       1435.24 (   0.00%)     1427.08 (  -0.57%)     1346.22 *  -6.20%*     1359.59 *  -5.27%*     1323.83 *  -7.76%*     1590.55 *  10.82%*
Hmean     send-1024      5563.78 (   0.00%)     5529.90 *  -0.61%*     5228.28 *  -6.03%*     5382.04 *  -3.27%*     5271.99 *  -5.24%*     6117.42 *   9.95%*
Hmean     send-2048     10935.42 (   0.00%)    10809.66 *  -1.15%*    10521.14 *  -3.79%*    10610.29 *  -2.97%*    10544.58 *  -3.57%*    11512.14 *   5.27%*
Hmean     send-3312     16898.66 (   0.00%)    16539.89 *  -2.12%*    16240.87 *  -3.89%*    16271.23 *  -3.71%*    15968.89 *  -5.50%*    17600.72 *   4.15%*
Hmean     send-4096     19354.33 (   0.00%)    19185.43 (  -0.87%)    18600.52 *  -3.89%*    18692.16 *  -3.42%*    18408.69 *  -4.89%*    20494.07 *   5.89%*
Hmean     send-8192     32238.80 (   0.00%)    32275.57 (   0.11%)    29850.62 *  -7.41%*    30066.83 *  -6.74%*    29824.62 *  -7.49%*    35225.60 *   9.26%*
Hmean     send-16384    48146.75 (   0.00%)    49297.23 *   2.39%*    48295.51 (   0.31%)    48800.37 *   1.36%*    48247.73 (   0.21%)    53000.20 *  10.08%*
Hmean     recv-64         362.16 (   0.00%)      362.87 (   0.19%)      318.82 * -11.97%*      347.07 *  -4.17%*      333.48 *  -7.92%*      402.60 *  11.17%*
Hmean     recv-128        723.01 (   0.00%)      723.66 (   0.09%)      660.89 *  -8.59%*      676.39 *  -6.45%*      650.63 * -10.01%*      796.70 *  10.19%*
Hmean     recv-256       1435.06 (   0.00%)     1426.94 (  -0.57%)     1346.07 *  -6.20%*     1359.45 *  -5.27%*     1323.81 *  -7.75%*     1590.55 *  10.84%*
Hmean     recv-1024      5562.68 (   0.00%)     5529.90 *  -0.59%*     5228.28 *  -6.01%*     5381.37 *  -3.26%*     5271.45 *  -5.24%*     6117.42 *   9.97%*
Hmean     recv-2048     10934.36 (   0.00%)    10809.66 *  -1.14%*    10519.89 *  -3.79%*    10610.28 *  -2.96%*    10544.58 *  -3.56%*    11512.14 *   5.28%*
Hmean     recv-3312     16898.65 (   0.00%)    16538.21 *  -2.13%*    16240.86 *  -3.89%*    16269.34 *  -3.72%*    15967.13 *  -5.51%*    17598.31 *   4.14%*
Hmean     recv-4096     19351.99 (   0.00%)    19183.17 (  -0.87%)    18598.33 *  -3.89%*    18690.13 *  -3.42%*    18407.45 *  -4.88%*    20489.99 *   5.88%*
Hmean     recv-8192     32238.74 (   0.00%)    32275.13 (   0.11%)    29850.39 *  -7.41%*    30062.78 *  -6.75%*    29824.30 *  -7.49%*    35221.61 *   9.25%*
Hmean     recv-16384    48146.59 (   0.00%)    49296.23 *   2.39%*    48295.03 (   0.31%)    48786.88 *   1.33%*    48246.71 (   0.21%)    52993.72 *  10.07%*

Recovered!

TBENCH4
=======
NOTES: networking counterpart of dbench. Varies the number of clients up to NUMCPUS*4
MEASURES: Throughput, MB/sec
HIGHER is better

machine: 8x-SKYLAKE-UMA
                                    4.18.0                 4.18.0                 4.18.0                 4.18.0                 4.18.0                 4.18.0
                                   vanilla                    teo        teo-v2+backport        teo-v3+backport        teo-v5+backport        teo-v6+backport
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Hmean     mb/sec-1       620.52 (   0.00%)      613.98 *  -1.05%*      502.47 * -19.03%*      492.77 * -20.59%*      464.52 * -25.14%*      705.89 *  13.76%*
Hmean     mb/sec-2      1179.05 (   0.00%)     1112.84 *  -5.62%*      820.57 * -30.40%*      831.23 * -29.50%*      780.97 * -33.76%*     1303.87 *  10.59%*
Hmean     mb/sec-4      2072.29 (   0.00%)     2040.55 *  -1.53%*     2036.11 *  -1.75%*     2016.97 *  -2.67%*     2019.79 *  -2.53%*     2164.66 *   4.46%*
Hmean     mb/sec-8      4238.96 (   0.00%)     4205.01 *  -0.80%*     4124.59 *  -2.70%*     4098.06 *  -3.32%*     4171.64 *  -1.59%*     4354.18 *   2.72%*
Hmean     mb/sec-16     3515.96 (   0.00%)     3536.23 *   0.58%*     3500.02 *  -0.45%*     3438.60 *  -2.20%*     3456.89 *  -1.68%*     3688.76 *   4.91%*
Hmean     mb/sec-32     3452.92 (   0.00%)     3448.94 *  -0.12%*     3428.08 *  -0.72%*     3369.30 *  -2.42%*     3430.09 *  -0.66%*     3574.24 *   3.51%*

This one, too, not only is fixed but adds a solid improvement over the
baseline.


[1] https://github.com/gormanm/mmtests

Giovanni

2018-12-03 16:25:12

by Doug Smythies

[permalink] [raw]
Subject: RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

Hi Giovanni,

Perhaps I should go off-list for this, not sure.

I had the thought that I should be able to get similar
results as your "8x-SKYLAKE-UMA" on my test computer,
i7-2600K. Or that at least it was worth trying, just
to see. I couldn't find the same or similar test
on Phoronix, and my attempts to do similar, for example,
with iperf, didn't show differences between the baseline
kernel and one with the teov6 patch.

So I tried the test set you referenced [1]:

On 2018.12.01 06:18 Giovanni Gherdovich wrote:
...
> * netperf on loopback over TCP
> * global-dhp__network-netperf-unbound

I assume this means that I am supposed to do:

cp config-global-dhp__network-netperf-unbound config

from the configs directory. Anyway that config file
looks correct. Then:

./run-mmtests.sh --no-monitor 3.0-nomonitor

...

> * sockperf on loopback over UDP, mode "throughput"
> * global-dhp__network-sockperf-unbound

Similarly (from the appropriate directories):

cp config-global-dhp__network-sockperf-unbound config
./run-mmtests.sh --no-monitor 3.0-nomonitor

My issue is that I do not understand the output or how it
might correlate with your tables.

I get, for example:

3 1 1 0.13s 0.68s 0.80s 1003894.302 1003779.613
3 1 1 0.16s 0.64s 0.80s 1008900.053 1008215.336
3 1 1 0.14s 0.66s 0.80s 1009630.439 1008990.265
...

But I don't know what that means, nor have I been able to find
a description anywhere.

In the README file, I did see that for reporting I am
somehow supposed to use compare-kernels.sh, but
I couldn't figure that out.

By the way, I am running these tests as a regular user, but
they seem to want to modify:

/sys/kernel/mm/transparent_hugepage/enabled

which requires root privilege. I don't really want to mess
with that stuff for these tests.

> [1] https://github.com/gormanm/mmtests

Can you help me to produce meaningful results to compare with
your results?

... Doug



2018-12-03 23:39:32

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

On Saturday, December 1, 2018 3:18:24 PM CET Giovanni Gherdovich wrote:
> On Fri, 2018-11-23 at 11:35 +0100, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki <[email protected]>
> >

[cut]

> >
> > [snip]
>
> [NOTE: the tables in this message are quite wide. If this doesn't get to you
> properly formatted you can read a copy of this message at the URL
> https://beta.suse.com/private/ggherdovich/teo-eval/teo-v6-eval.html ]
>
> All performance concerns manifested in v5 are wiped out by v6. Not only v6
> improves over v5, but is even better than the baseline (menu) in most
> cases. The optimizations in v6 paid off!

This is very encouraging, thank you!

> The overview of the analysis for v5, from the message
> https://lore.kernel.org/lkml/[email protected] , was:
>
> > The quick summary is:
> >
> > ---> sockperf on loopback over UDP, mode "throughput":
> > this had a 12% regression in v2 on 48x-HASWELL-NUMA, which is completely
> > recovered in v3 and v5. Good stuff.
> >
> > ---> dbench on xfs:
> > this was down 16% in v2 on 48x-HASWELL-NUMA. On v5 we're at a 10%
> > regression. Slight improvement. What's really hurting here is the single
> > client scenario.
> >
> > ---> netperf-udp on loopback:
> > had 6% regression on v2 on 8x-SKYLAKE-UMA, which is the same as what
> > happens in v5.
> >
> > ---> tbench on loopback:
> > was down 10% in v2 on 8x-SKYLAKE-UMA, now slightly worse in v5 with a 12%
> > regression. As in dbench, it's at low number of clients that the results
> > are worst. Note that this machine is different from the one that has the
> > dbench regression.
>
> now the situation is overturned:
>
> ---> sockperf on loopback over UDP, mode "throughput":
> No new problems from 48x-HASWELL-NUMA, which stays put at the level of
> the baseline. OTOH 80x-BROADWELL-NUMA and 8x-SKYLAKE-UMA improve over the
> baseline of 8% and 10% respectively.

Good.

> ---> dbench on xfs:
> 48x-HASWELL-NUMA rebounds from the previous 10% degradation and it's now
> at 0, i.e. the baseline level. The 1-client case, responsible for the
> previous overall degradation (I average results from different number of
> clients), went from -40% to -20% and is compensated in my table by
> improvements with 4, 8, 16 and 32 clients (table below).
>
> ---> netperf-udp on loopback:
> 8x-SKYLAKE-UMA now shows a 9% improvement over baseline.
> 80x-BROADWELL-NUMA, previously similar to baseline, now improves 7%.

Good.

> ---> tbench on loopback:
> Impressive change of color for 8x-SKYLAKE-UMA, from 12% regression in v5
> to 7% improvement in v6. The problematic 1- and 2-clients cases went from
> -25% and -33% to +13% and +10% respectively.

Awesome. :-)

> Details below.
>
> Runs are compared against v4.18 with the Menu governor. I know v4.18 is a
> little old now but that's where I measured my baseline. My machine pool didn't
> change:
>
> * single socket E3-1240 v5 (Skylake 8 cores, which I'll call 8x-SKYLAKE-UMA)
> * two sockets E5-2698 v4 (Broadwell 80 cores, 80x-BROADWELL-NUMA from here onwards)
> * two sockets E5-2670 v3 (Haswell 48 cores, 48x-HASWELL-NUMA from here onwards)
>

[cut]

>
>
> PREVIOUSLY REGRESSING BENCHMARKS: OVERVIEW
> ==========================================
>
> * sockperf on loopback over UDP, mode "throughput"
> * global-dhp__network-sockperf-unbound
> 48x-HASWELL-NUMA fixed since v2, the others greatly improved in v6.
>
> teo-v1 teo-v2 teo-v3 teo-v5 teo-v6
> -------------------------------------------------------------------------------
> 8x-SKYLAKE-UMA 1% worse 1% worse 1% worse 1% worse 10% better
> 80x-BROADWELL-NUMA 3% better 2% better 5% better 3% worse 8% better
> 48x-HASWELL-NUMA 4% better 12% worse no change no change no change
>
> * dbench on xfs
> * global-dhp__io-dbench4-async-xfs
> 48x-HASWELL-NUMA is fixed wrt v5 and earlier versions.
>
> teo-v1 teo-v2 teo-v3 teo-v5 teo-v6
> -------------------------------------------------------------------------------
> 8x-SKYLAKE-UMA 3% better 4% better 6% better 4% better 5% better
> 80x-BROADWELL-NUMA no change no change 1% worse 3% worse 2% better
> 48x-HASWELL-NUMA 6% worse 16% worse 8% worse 10% worse no change
>
> * netperf on loopback over UDP
> * global-dhp__network-netperf-unbound
> 8x-SKYLAKE-UMA fixed.
>
> teo-v1 teo-v2 teo-v3 teo-v5 teo-v6
> -------------------------------------------------------------------------------
> 8x-SKYLAKE-UMA no change 6% worse 4% worse 6% worse 9% better
> 80x-BROADWELL-NUMA 1% worse 4% worse no change no change 7% better
> 48x-HASWELL-NUMA 3% better 5% worse 7% worse 5% worse no change
>
> * tbench on loopback
> * global-dhp__network-tbench
> Measurable improvements across all machines, especially 8x-SKYLAKE-UMA.
>
> teo-v1 teo-v2 teo-v3 teo-v5 teo-v6
> -------------------------------------------------------------------------------
> 8x-SKYLAKE-UMA 1% worse 10% worse 11% worse 12% worse 7% better
> 80x-BROADWELL-NUMA 1% worse 1% worse no cahnge 1% worse 4% better
> 48x-HASWELL-NUMA 1% worse 2% worse 1% worse 1% worse 5% better

So I'm really happy with this, but I'm afraid that the v6 may be a little too
agressive. Also my testing (with the "low" and "high" counters introduced by
https://patchwork.kernel.org/patch/10709463/) shows that it generally is
a bit worse than menu with respect to matching the observed idle duration
as it tends to prefer shallower states. This appears to be in agreement with
the Doug's results too.

For this reason, I'm going to send a v7 with a few changes relative to v6 to
make it somewhat more energy-efficient. If it turns out to be much worse than
the v6 performance-wise, though, the v6 may be a winner. :-)

Thanks,
Rafael




2018-12-03 23:49:48

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

On Friday, November 30, 2018 9:51:19 AM CET Rafael J. Wysocki wrote:
> Hi Doug,
>
> On Fri, Nov 30, 2018 at 8:49 AM Doug Smythies <[email protected]> wrote:
> >
> > Hi Rafael,
> >
> > On 2018.11.23 02:36 Rafael J. Wysocki wrote:
> >
> > ... [snip]...
> >
> > > +/**
> > > + * teo_find_shallower_state - Find shallower idle state matching given duration.
> > > + * @drv: cpuidle driver containing state data.
> > > + * @dev: Target CPU.
> > > + * @state_idx: Index of the capping idle state.
> > > + * @duration_us: Idle duration value to match.
> > > + */
> > > +static int teo_find_shallower_state(struct cpuidle_driver *drv,
> > > + struct cpuidle_device *dev, int state_idx,
> > > + unsigned int duration_us)
> > > +{
> > > + int i;
> > > +
> > > + for (i = state_idx - 1; i > 0; i--) {
> > > + if (drv->states[i].disabled || dev->states_usage[i].disable)
> > > + continue;
> > > +
> > > + if (drv->states[i].target_residency <= duration_us)
> > > + break;
> > > + }
> > > + return i;
> > > +}
> >
> > I think this subroutine has a problem when idle state 0
> > is disabled.
>
> You are right, thanks!
>
> > Perhaps something like this might help:
> >
> > diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
> > index bc1c9a2..5b97639 100644
> > --- a/drivers/cpuidle/governors/teo.c
> > +++ b/drivers/cpuidle/governors/teo.c
> > @@ -196,7 +196,8 @@ static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
> > }
> >
> > /**
> > - * teo_find_shallower_state - Find shallower idle state matching given duration.
> > + * teo_find_shallower_state - Find shallower idle state matching given
> > + * duration, if possible.
> > * @drv: cpuidle driver containing state data.
> > * @dev: Target CPU.
> > * @state_idx: Index of the capping idle state.
> > @@ -208,13 +209,15 @@ static int teo_find_shallower_state(struct cpuidle_driver *drv,
> > {
> > int i;
> >
> > - for (i = state_idx - 1; i > 0; i--) {
> > + for (i = state_idx - 1; i >= 0; i--) {
> > if (drv->states[i].disabled || dev->states_usage[i].disable)
> > continue;
> >
> > if (drv->states[i].target_residency <= duration_us)
> > break;
> > }
> > + if (i < 0)
> > + i = state_idx;
> > return i;
> > }
>
> I'll do something slightly similar, but equivalent.

I actually ended up fixing it differently, as the above will cause state_idx
to be returned even if some states shallower than state_idx are enabled, but
their target residencies are higher than duration_us. In that case, though,
it still is more correct to return the shallowest enabled state rather than
state_idx.

> >
> > @@ -264,7 +267,6 @@ static int teo_select(struct cpuidle_driver *drv, struct cpuidle_device *dev,
> > if (max_early_idx >= 0 &&
> > count < cpu_data->states[i].early_hits)
> > count = cpu_data->states[i].early_hits;
> > -
> > continue;
> > }
> >
> > There is an additional issue where if idle state 0 is disabled (with the above suggested code patch),
> > idle state usage seems to fall to deeper states than idle state 1.
> > This is not the expected behaviour.
>
> No, it isn't.
>
> > Kernel 4.20-rc3 works as expected.
> > I have not figured this issue out yet, in the code.
> >
> > Example (1 minute per sample. Number of entries/exits per state):
> > State 0 State 1 State 2 State 3 State 4 Watts
> > 28235143, 83, 26, 17, 837, 64.900
> > 5583238, 657079, 5884941, 8498552, 30986831, 62.433 << Transition sample, after idle state 0 disabled
> > 0, 793517, 7186099, 10559878, 38485721, 61.900 << ?? should have all gone into Idle state 1
> > 0, 795414, 7340703, 10553117, 38513456, 62.050
> > 0, 807028, 7288195, 10574113, 38523524, 62.167
> > 0, 814983, 7403534, 10575108, 38571228, 62.167
> > 0, 838302, 7747127, 10552289, 38556054, 62.183
> > 9664999, 544473, 4914512, 6942037, 25295361, 63.633 << Transition sample, after idle state 0 enabled
> > 27893504, 96, 40, 9, 912, 66.500
> > 26556343, 83, 29, 7, 814, 66.683
> > 27929227, 64, 20, 10, 931, 66.683
>
> I see.
>
> OK, I'll look into this too, thanks!

This probably is the artifact of the fix for the teo_find_shallower_state()
issue.

Anyway, I'm not able to reproduce this with the teo_find_shallower_state() issue
fixed differently.

Thanks,
Rafael


2018-12-03 23:53:39

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

On Thursday, November 29, 2018 12:20:07 AM CET Doug Smythies wrote:
> On 2018.11.23 02:36 Rafael J. Wysocki wrote:
>
> v5 -> v6:
> * Avoid applying poll_time_limit to non-polling idle states by mistake.
> * Use idle duration measured by the governor for everything (as it likely is
> more accurate than the one measured by the core).
>
> -- above missing-- (see follow up e-mail from Rafael)
>
> * Rename SPIKE to PULSE.
> * Do not run pattern detection upfront. Instead, use recent idle duration
> values to refine the state selection after finding a candidate idle state.
> * Do not use the expected idle duration as an extra latency constraint
> (exit latency is less than the target residency for all of the idle states
> known to me anyway, so this doesn't change anything in practice).
>
> Hi Rafael,
>
> I did some minimal testing on teov6, using kernel 4.20-rc3 as my baseline
> reference kernel.
>
> Test 1: Phoronix bdench test, all options: 1, 6, 12, 48, 128, 256 clients.
>
> Note: because it uses the disk, the dbench test is somewhat non-repeatable.
> However, if particular attention is paid to not doing anything else with
> the disk between tests, then it seems to be repeatable to within about 6%.
>
> Anyway no significant difference observed between kernel 4.20-rc3 and the
> same with the teov6 patch.
>
> Test 2: Pipe test, non cross core. (And idle state 0 test, really)
> I ran 4 pipe tests, 1 for each of my 4 cores, @2 CPUs per core.
> Thus, pretty much only idle state 0 was ever used.
> Processor package power was similar for both kernels.
> teov6 entered/exited idle state 0 about 60,984 times/second/cpu.
> -rc3 entered/exited idle state 0 about 62,806 times/second/cpu.
> There was a difference in percentage time spent in idle state 0,
> with kernel 4.20-rc3 spending 0.2441% in idle state 0 verses
> teov6 at 0.0641%.
>
> For throughput, teov6 was 1.4% faster.

This may indicate that teov6 is somewhat too aggressive.

> Test 3: was an attempt to sweep through a preference for
> all idle states.
>
> 40 threads were launched with nothing to do except sleep
> for a variable duration of 1 to 500 uSec, each step was
> run for 1 minute. With 1 minute idle before the test and a few
> minutes idle after, the total test duration was about 505 minutes.
> Recall that when one asks for a short sleep of 1 uSec, they actually
> get about 50 uSec, due to overheads. So I use 40 threads in an attempt
> to get the average time between wakeup events per CPU down somewhat.
>
> The results are here:
> http://fast.smythies.com/linux-pm/k420/k420-pn-sweep-teo6-2.htm

And, so long as my understanding of the graphs is correct, the results
here indicate that teov6 tends to prefer relatively shallow idle states
which is good for performance (at least with some workloads), but not
necessarily for energy-efficiency.

I will send a v7 of TEO with some changes to make it a bit more
energy-efficient with respect to the v6.

Thanks,
Rafael


2018-12-05 23:07:19

by Doug Smythies

[permalink] [raw]
Subject: RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

On 2018.12.03 03:48 Rafael J. Wysocki wrote:

>>> There is an additional issue where if idle state 0 is disabled (with the above suggested code patch),
>>> idle state usage seems to fall to deeper states than idle state 1.
>>> This is not the expected behaviour.
>>
>> No, it isn't.
>>
>>> Kernel 4.20-rc3 works as expected.
>>> I have not figured this issue out yet, in the code.
>>>
>>> Example (1 minute per sample. Number of entries/exits per state):
>>> State 0 State 1 State 2 State 3 State 4 Watts
>>> 28235143, 83, 26, 17, 837, 64.900
>>> 5583238, 657079, 5884941, 8498552, 30986831, 62.433 << Transition sample, after idle state 0 disabled
>>> 0, 793517, 7186099, 10559878, 38485721, 61.900 << ?? should have all gone into Idle state 1
>>> 0, 795414, 7340703, 10553117, 38513456, 62.050
>>> 0, 807028, 7288195, 10574113, 38523524, 62.167
>>> 0, 814983, 7403534, 10575108, 38571228, 62.167
>>> 0, 838302, 7747127, 10552289, 38556054, 62.183
>>> 9664999, 544473, 4914512, 6942037, 25295361, 63.633 << Transition sample, after idle state 0 enabled
>>> 27893504, 96, 40, 9, 912, 66.500
>>> 26556343, 83, 29, 7, 814, 66.683
>>> 27929227, 64, 20, 10, 931, 66.683
>>
>> I see.
>>
>> OK, I'll look into this too, thanks!
>
> This probably is the artifact of the fix for the teo_find_shallower_state()
> issue.
>
> Anyway, I'm not able to reproduce this with the teo_find_shallower_state() issue
> fixed differently.

I am not able to reproduce with your teo_find_shallower_state(), or teo V 7,
either. Everything is graceful now, as states are disabled:
(10 seconds per sample. Number of entries/exits per state):

State 0 State 1 State 2 State 3 State 4 Watts
0, 6, 4, 1, 414, 3.700
2, 4, 30, 3, 578, 3.700 << No load
168619, 37, 39, 4, 480, 5.600 << Transition sample
4643618, 45, 8, 1, 137, 61.200 << All idle states enabled
4736227, 40, 3, 5, 111, 61.800
1888417, 4369314, 25, 2, 89, 62.000 << Transition sample
0, 7266864, 9, 0, 0, 62.200 << state 0 disabled
0, 7193372, 9, 0, 0, 62.700
0, 5539898, 1744007, 0, 0, 63.500 << Transition sample
0, 0, 8152956, 0, 0, 63.700 << states 0,1 disabled
0, 0, 8015151, 0, 0, 63.900
0, 0, 4146806, 6349619, 0, 63.000 << Transition sample
0, 0, 0, 13252144, 0, 61.600 << states 0,1,2 disabled
0, 0, 0, 13258313, 0, 61.800
0, 0, 0, 10417428, 1984451, 61.200 << Transition sample
0, 0, 0, 0, 9247172, 58.500 << states 0,1,2,3 disabled
0, 0, 0, 0, 9242657, 58.500
0, 0, 0, 0, 9233749, 58.600
0, 0, 0, 0, 9238444, 58.700
0, 0, 0, 0, 9236345, 58.600

For reference, this is kernel 4.20-rc5 (with your other proposed patches):

State 0 State 1 State 2 State 3 State 4 Watts
0, 4, 8, 6, 426, 3.700
1592870, 279, 149, 96, 831, 21.800
5071279, 154, 25, 6, 105, 61.200
5095090, 78, 21, 1, 86, 61.800
5001493, 94, 30, 4, 101, 62.200
616019, 5446924, 5, 3, 38, 62.500
0, 6249752, 0, 0, 0, 63.300
0, 6293671, 0, 0, 0, 63.800
0, 3751035, 2529964, 0, 0, 64.100
0, 0, 6101167, 0, 0, 64.500
0, 0, 6172526, 0, 0, 64.700
0, 0, 6163797, 0, 0, 64.900
0, 0, 1724841, 9567528, 0, 63.300
0, 0, 0, 13349668, 0, 62.700
0, 0, 0, 13360471, 0, 62.700
0, 0, 0, 13355424, 0, 62.700
0, 0, 0, 8854491, 3132640, 61.600
0, 0, 0, 0, 9302824, 59.000
0, 0, 0, 0, 9303561, 58.900
0, 0, 0, 0, 9313397, 59.000
0, 0, 0, 0, 9333944, 59.000

Test kernel:
94a976a cpuidle: New timer events oriented governor for tickless systems <<< V7
935be4e cpuidle: poll_state: Disregard disable idle states
e3670df cpuidle: Add 'high' and 'low' idle state metrics
dfa672c Documentation: admin-guide: PM: Add cpuidle document
2595646 Linux 4.20-rc5

Reference kernel:
f418681 cpuidle: poll_state: Disregard disable idle states
1be0e87 cpuidle: Add 'high' and 'low' idle state metrics
279ec1d Documentation: admin-guide: PM: Add cpuidle document
2595646 Linux 4.20-rc5

... Doug



2018-12-06 09:13:24

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

On Thu, Dec 6, 2018 at 12:06 AM Doug Smythies <[email protected]> wrote:
>
> On 2018.12.03 03:48 Rafael J. Wysocki wrote:
>
> >>> There is an additional issue where if idle state 0 is disabled (with the above suggested code patch),
> >>> idle state usage seems to fall to deeper states than idle state 1.
> >>> This is not the expected behaviour.
> >>
> >> No, it isn't.
> >>
> >>> Kernel 4.20-rc3 works as expected.
> >>> I have not figured this issue out yet, in the code.
> >>>
> >>> Example (1 minute per sample. Number of entries/exits per state):
> >>> State 0 State 1 State 2 State 3 State 4 Watts
> >>> 28235143, 83, 26, 17, 837, 64.900
> >>> 5583238, 657079, 5884941, 8498552, 30986831, 62.433 << Transition sample, after idle state 0 disabled
> >>> 0, 793517, 7186099, 10559878, 38485721, 61.900 << ?? should have all gone into Idle state 1
> >>> 0, 795414, 7340703, 10553117, 38513456, 62.050
> >>> 0, 807028, 7288195, 10574113, 38523524, 62.167
> >>> 0, 814983, 7403534, 10575108, 38571228, 62.167
> >>> 0, 838302, 7747127, 10552289, 38556054, 62.183
> >>> 9664999, 544473, 4914512, 6942037, 25295361, 63.633 << Transition sample, after idle state 0 enabled
> >>> 27893504, 96, 40, 9, 912, 66.500
> >>> 26556343, 83, 29, 7, 814, 66.683
> >>> 27929227, 64, 20, 10, 931, 66.683
> >>
> >> I see.
> >>
> >> OK, I'll look into this too, thanks!
> >
> > This probably is the artifact of the fix for the teo_find_shallower_state()
> > issue.
> >
> > Anyway, I'm not able to reproduce this with the teo_find_shallower_state() issue
> > fixed differently.
>
> I am not able to reproduce with your teo_find_shallower_state(), or teo V 7,
> either. Everything is graceful now, as states are disabled:
> (10 seconds per sample. Number of entries/exits per state):
>
> State 0 State 1 State 2 State 3 State 4 Watts
> 0, 6, 4, 1, 414, 3.700
> 2, 4, 30, 3, 578, 3.700 << No load
> 168619, 37, 39, 4, 480, 5.600 << Transition sample
> 4643618, 45, 8, 1, 137, 61.200 << All idle states enabled
> 4736227, 40, 3, 5, 111, 61.800
> 1888417, 4369314, 25, 2, 89, 62.000 << Transition sample
> 0, 7266864, 9, 0, 0, 62.200 << state 0 disabled
> 0, 7193372, 9, 0, 0, 62.700
> 0, 5539898, 1744007, 0, 0, 63.500 << Transition sample
> 0, 0, 8152956, 0, 0, 63.700 << states 0,1 disabled
> 0, 0, 8015151, 0, 0, 63.900
> 0, 0, 4146806, 6349619, 0, 63.000 << Transition sample
> 0, 0, 0, 13252144, 0, 61.600 << states 0,1,2 disabled
> 0, 0, 0, 13258313, 0, 61.800
> 0, 0, 0, 10417428, 1984451, 61.200 << Transition sample
> 0, 0, 0, 0, 9247172, 58.500 << states 0,1,2,3 disabled
> 0, 0, 0, 0, 9242657, 58.500
> 0, 0, 0, 0, 9233749, 58.600
> 0, 0, 0, 0, 9238444, 58.700
> 0, 0, 0, 0, 9236345, 58.600
>
> For reference, this is kernel 4.20-rc5 (with your other proposed patches):
>
> State 0 State 1 State 2 State 3 State 4 Watts
> 0, 4, 8, 6, 426, 3.700
> 1592870, 279, 149, 96, 831, 21.800
> 5071279, 154, 25, 6, 105, 61.200
> 5095090, 78, 21, 1, 86, 61.800
> 5001493, 94, 30, 4, 101, 62.200
> 616019, 5446924, 5, 3, 38, 62.500
> 0, 6249752, 0, 0, 0, 63.300
> 0, 6293671, 0, 0, 0, 63.800
> 0, 3751035, 2529964, 0, 0, 64.100
> 0, 0, 6101167, 0, 0, 64.500
> 0, 0, 6172526, 0, 0, 64.700
> 0, 0, 6163797, 0, 0, 64.900
> 0, 0, 1724841, 9567528, 0, 63.300
> 0, 0, 0, 13349668, 0, 62.700
> 0, 0, 0, 13360471, 0, 62.700
> 0, 0, 0, 13355424, 0, 62.700
> 0, 0, 0, 8854491, 3132640, 61.600
> 0, 0, 0, 0, 9302824, 59.000
> 0, 0, 0, 0, 9303561, 58.900
> 0, 0, 0, 0, 9313397, 59.000
> 0, 0, 0, 0, 9333944, 59.000
>
> Test kernel:
> 94a976a cpuidle: New timer events oriented governor for tickless systems <<< V7
> 935be4e cpuidle: poll_state: Disregard disable idle states
> e3670df cpuidle: Add 'high' and 'low' idle state metrics
> dfa672c Documentation: admin-guide: PM: Add cpuidle document
> 2595646 Linux 4.20-rc5
>
> Reference kernel:
> f418681 cpuidle: poll_state: Disregard disable idle states
> 1be0e87 cpuidle: Add 'high' and 'low' idle state metrics
> 279ec1d Documentation: admin-guide: PM: Add cpuidle document
> 2595646 Linux 4.20-rc5

Thanks for the data!

I seem to see an improvement with teo as it draws less power in the
corresponding bins, roughly speaking.

2018-12-07 13:35:56

by Mel Gorman

[permalink] [raw]
Subject: Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

On Mon, Dec 03, 2018 at 08:23:56AM -0800, Doug Smythies wrote:
> In the README file, I did see that for reporting I am
> somehow supposed to use compare-kernels.sh, but
> I couldn't figure that out.
>

cd work/log
../../compare-kernels.sh

> By the way, I am running these tests as a regular user, but
> they seem to want to modify:
>
> /sys/kernel/mm/transparent_hugepage/enabled
>

Red herring in this case. Even if transparent hugepages are left as the
default, it still tries to write it stupidly. An irritating, but
harmless bug.

--
Mel Gorman
SUSE Labs

2018-12-08 10:20:25

by Giovanni Gherdovich

[permalink] [raw]
Subject: Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

Hello Doug,

sorry for the late reply, this week I was traveling.

First off, thank you for trying out MMTests; I admit the documentation is
somewhat incomplete. I'm going to give you an overview of how I run benchmarks
with MMTests and how do I print comparisons, hoping this can address your
questions.

In the last report I posted the following two tables, for instance; I'll now
show the commands I used to produce them.

>  * sockperf on loopback over UDP, mode "throughput"
>     * global-dhp__network-sockperf-unbound
>     48x-HASWELL-NUMA fixed since v2, the others greatly improved in v6.

>                         teo-v1      teo-v2      teo-v3      teo-v5      teo-v6
>   -------------------------------------------------------------------------------
>   8x-SKYLAKE-UMA        1% worse    1% worse    1% worse    1% worse    10% better
>   80x-BROADWELL-NUMA    3% better   2% better   5% better   3% worse    8% better
>   48x-HASWELL-NUMA      4% better   12% worse   no change   no change   no change

>   SOCKPERF-UDP-THROUGHPUT
>   =======================
>   NOTES: Test run in mode "throughput" over UDP. The varying parameter is the
>       message size.
>   MEASURES: Throughput, in MBits/second
>   HIGHER is better

>   machine: 8x-SKYLAKE-UMA

>  4.18.0                 4.18.0                 4.18.0                 4.18.0                 4.18.0                 4.18.0
>         vanilla                    teo        teo-v2+backport        teo-v3+backport        teo-v5+backport        teo-v6+backport
>   -------------------------------------------------------------------------------------------------------------------------------------------------------
>   Hmean     14        70.34 (   0.00%)       69.80 *  -0.76%*       69.11 *  -1.75%*       69.49 *  -1.20%*       69.71 *  -0.90%*       77.51 *  10.20%*
>   Hmean     100      499.24 (   0.00%)      494.26 *  -1.00%*      492.74 *  -1.30%*      494.90 *  -0.87%*      497.43 *  -0.36%*      549.93 *  10.15%*
>   Hmean     300     1489.13 (   0.00%)     1472.39 *  -1.12%*     1468.45 *  -1.39%*     1477.74 *  -0.76%*     1478.61 *  -0.71%*     1632.63 *   9.64%*
>   Hmean     500     2469.62 (   0.00%)     2444.41 *  -1.02%*     2434.61 *  -1.42%*     2454.15 *  -0.63%*     2454.76 *  -0.60%*     2698.70 *   9.28%*
>   Hmean     850     4165.12 (   0.00%)     4123.82 *  -0.99%*     4100.37 *  -1.55%*     4111.82 *  -1.28%*     4120.04 *  -1.08%*     4521.11 *   8.55%*

The first table is a panoramic view of all machines, the second is a zoom into
the 8x-SKYLAKE-UMA machine where the overall benchmark score is broken down
into the various message sizes.

The first thing to do is, obviously, to gather data for each kernel. Once the
kernel is installed on the box, as you already figured out, you have to run:

  ./run-mmtests.sh --config configs/config-global-dhp__network-sockperf-unbound SOME-MNEMONIC-NAME

In my case, what I did is to run:

  # build, install and boot 4.18.0-vanilla kernel
  ./run-mmtests.sh --config configs/config-global-dhp__network-sockperf-unbound  4.18.0-vanilla

  # build, install and boot 4.18.0-teo kernel
  ./run-mmtests.sh --config configs/config-global-dhp__network-sockperf-unbound  4.18.0-teo

  # build, install and boot 4.18.0-teo-v2+backport kernel
  ./run-mmtests.sh --config configs/config-global-dhp__network-sockperf-unbound  4.18.0-teo-v2+backport

  ...

  # build, install and boot 4.18.0-teo-v6+backport kernel
  ./run-mmtests.sh --config configs/config-global-dhp__network-sockperf-unbound  4.18.0-teo-v6+backport

At this point in the work/log directory I've accumulated all the data I need
for a report. What's important to note here is that a single configuration
file (such as config-global-dhp__network-sockperf-unbound) often runs more than
a single  benchmark, according to the value of the MMTESTS variable in that
config. The config we're using has:

  export MMTESTS="sockperf-tcp-throughput sockperf-tcp-under-load sockperf-udp-throughput sockperf-udp-under-load"

which means it's running 4 different flavors of sockperf. The two tables above
are from the "sockperf-udp-throughput" variant.

Now that we've run the benchmarks for each kernel (every run takes around 75
minutes on my machines) we're ready to extract some comparison tables.
Exploring the work/log directory shows what we've got:

  $ find . -type d -name sockperf\* | sort 
  ./sockperf-tcp-throughput-4.18.0-teo
  ./sockperf-tcp-throughput-4.18.0-teo-v2+backport
  ./sockperf-tcp-throughput-4.18.0-teo-v3+backport
  ./sockperf-tcp-throughput-4.18.0-teo-v5+backport
  ./sockperf-tcp-throughput-4.18.0-teo-v6+backport
  ./sockperf-tcp-throughput-4.18.0-vanilla
  ./sockperf-tcp-under-load-4.18.0-teo
  ./sockperf-tcp-under-load-4.18.0-teo-v2+backport
  ./sockperf-tcp-under-load-4.18.0-teo-v3+backport
  ./sockperf-tcp-under-load-4.18.0-teo-v5+backport
  ./sockperf-tcp-under-load-4.18.0-teo-v6+backport
  ./sockperf-tcp-under-load-4.18.0-vanilla
  ./sockperf-udp-throughput-4.18.0-teo
  ./sockperf-udp-throughput-4.18.0-teo-v2+backport
  ./sockperf-udp-throughput-4.18.0-teo-v3+backport
  ./sockperf-udp-throughput-4.18.0-teo-v5+backport
  ./sockperf-udp-throughput-4.18.0-teo-v6+backport
  ./sockperf-udp-throughput-4.18.0-vanilla
  ./sockperf-udp-under-load-4.18.0-teo
  ./sockperf-udp-under-load-4.18.0-teo-v2+backport
  ./sockperf-udp-under-load-4.18.0-teo-v3+backport
  ./sockperf-udp-under-load-4.18.0-teo-v5+backport
  ./sockperf-udp-under-load-4.18.0-teo-v6+backport
  ./sockperf-udp-under-load-4.18.0-vanilla

Above you see a directory for each of the benchmarks in the MMTESTS variable
from before and for each kernel patch. The command to get the detailed table
at the top of this message is:

  $ ../../bin/compare-mmtests.pl --directory . \
                                 --benchmark sockperf-udp-throughput \
                                 --names 4.18.0-vanilla,4.18.0-teo,4.18.0-teo-v2+backport,4.18.0-teo-v3+backport,4.18.0-teo-v5+backport,4.18.0-teo-v6+backport \
   | grep '^.mean'

  Hmean     14        70.34 (   0.00%)       69.80 *  -0.76%*       69.11 *  -1.75%*       69.49 *  -1.20%*       69.71 *  -0.90%*       77.51 *  10.20%*
  Hmean     100      499.24 (   0.00%)      494.26 *  -1.00%*      492.74 *  -1.30%*      494.90 *  -0.87%*      497.43 *  -0.36%*      549.93 *  10.15%*
  Hmean     300     1489.13 (   0.00%)     1472.39 *  -1.12%*     1468.45 *  -1.39%*     1477.74 *  -0.76%*     1478.61 *  -0.71%*     1632.63 *   9.64%*
  Hmean     500     2469.62 (   0.00%)     2444.41 *  -1.02%*     2434.61 *  -1.42%*     2454.15 *  -0.63%*     2454.76 *  -0.60%*     2698.70 *   9.28%*
  Hmean     850     4165.12 (   0.00%)     4123.82 *  -0.99%*     4100.37 *  -1.55%*     4111.82 *  -1.28%*     4120.04 *  -1.08%*     4521.11 *   8.55%*

As you can see I'm grepping for a specific field in the table, i.e. the mean
values; some benchmarks use the harmonic mean (Hmean) and some use the
arithmetic mean (Amean). I also use to get a peek at the beginning of the table
without grepping to get nice headers to stick on top. See how the table above
only uses 1/4 of the data we collected, i.e. it only reads from the data
directories

  ./sockperf-udp-throughput-4.18.0-teo
  ./sockperf-udp-throughput-4.18.0-teo-v2+backport
  ./sockperf-udp-throughput-4.18.0-teo-v3+backport
  ./sockperf-udp-throughput-4.18.0-teo-v5+backport
  ./sockperf-udp-throughput-4.18.0-teo-v6+backport
  ./sockperf-udp-throughput-4.18.0-vanilla

note that the "--benchmark sockperf-udp-throughput" option flag gives the
first part of the directories name (the specific benchmark) while the option
flag "--names 4.18.0-vanilla,4.18.0-teo,4.18.0-teo-v2+backport,..." completes
the directory names (the kernel patches). One can use the script
compare-kernels.sh too, but I like bin/compare-mmtests.pl the most.

Now, the overview table. To get the overall score for, say, teo-v3, what the
script does is to compute ratios between v3 and the baseline for all message
sizes, and then taking the geometric mean of the results. The geometric mean is
chosen because it has the nice property that the geometric mean of ratios is
equal to the ratios of geometric means, which is:

  gmean(ratio(v3, baseline))

is the same as

  gmean(v3) / gmean(baseline)

where v3 and baseline are list of values (results for each message length) and
ratio() is a function that takes component-wise ratios of lists. It would be
awkward is our mean function didn't have that property, you would get
different results depending on the order of your operations (first ratio then
mean, or vice-versa).

I get overview tables running the same compare-mmtests.pl command above, but
adding the --print-ratio option flag.

  $ ../../bin/compare-mmtests.pl --print-ratio \
                                 --directory . \
                                 --benchmark sockperf-udp-throughput \
                                 --names 4.18.0-vanilla,4.18.0-teo,4.18.0-teo-v2+backport,4.18.0-teo-v3+backport,4.18.0-teo-v5+backport,4.18.0-teo-v6+backport

I then look at the very last line of the output, which in this case is

  Gmean Higher    1.00    0.99    0.99    0.99    0.99    1.10

"Gmean" in there reminds me that I'm looking at geometric means; "Higher" is
to say that "higher ratio is better" (remember that sockperf-udp-throughput
measures throughput), then the first ratio is always 1.00 because that's the
reference baseline. Then I have v1, v2, v3, v5 and v6 (I omitted the
headers). All versions have a 1% regression except v6 which is a 10% gain. If
it was a lower-is-better type of test, the interpretation of those number
would be reversed.


Some specific remarks you raise:

On Mon, 2018-12-03 at 08:23 -0800, Doug Smythies wrote:
> ...
> My issue is that I do not understand the output or how it
> might correlate with your tables.

> I get, for example:

>    3    1   1     0.13s     0.68s     0.80s  1003894.302 1003779.613
>    3    1   1     0.16s     0.64s     0.80s  1008900.053 1008215.336
>    3    1   1     0.14s     0.66s     0.80s  1009630.439 1008990.265
> ...

> But I don't know what that means, nor have I been able to find
> a description anywhere.

> In the README file, I did see that for reporting I am 
> somehow supposed to use compare-kernels.sh, but
> I couldn't figure that out.

I don't recognize this output. I hope the illustration above can clarify how
MMTests is used.

> By the way, I am running these tests as a regular user, but
> they seem to want to modify:

> /sys/kernel/mm/transparent_hugepage/enabled

> which requires root privilege. I don't really want to mess
> with that stuff for these tests.

As Mel said, in this case the warning is harmless but I understand that
requesting root permissions is annoying. At times the framework has to modify
some settings and that may require root; on one side we are careful to undo
all modifications applied to a machine for the purpose of testing after the
benchmark is completed, but is also true that MMTests has evolved over time to
be used on lab machines that can be redeployed to a clean known state at the
push of a button. The convenience of assuming root far outweighs the
limitations of this approach.

Another less then ideal characteristic of MMTests is that it downloads the
sources of benchmarks from external URLs which you, as a user, may or may not
trust. We have vetted all those URLs and determined that they represent the
canonical source for a given benchmark, but are not ultimately responsible for
their content. Inside our organization we have a mirror for the content of all
of them (the external URL is accessed only as a fallback), but when people
outside of SUSE use MMTests that's what they get.

> I had the thought that I should be able to get similar
> results as your "8x-SKYLAKE-UMA" on my test computer,
> i7-2600K. Or that at least it was worth trying, just
> to see.

Uhm. That may or may not happen. It's true that your i7-2600K and
8x-SKYLAKE-UMA have the same number of cores and threads, and that the size of
the L3 cache is the same, but the https://ark.intel.com website tells me that
they are from different microarchitecture generations: i7-2600K is a Sandy
Bridge and 8x-SKYLAKE-UMA is a Skylake. Given that the timeline of Intel
microarchitectures is Sandy Bridge (2011), Haswell (2013), Broadwell (2015)
and Skylake (2016), i7-2600K might have more in common with my 48x-HASWELL-NUMA
than it has 8x-SKYLAKE-UMA.

It might seems odd to compare a 48 threads machine with an 8 threads one, but
if you look at MMTests configurations you'll notice that we try to "normalize"
over available resources such as amount of memory or number of cores. We
generally try to explore the space of parameters from "light workload" to
machine saturation; then we average the results, as you see in the
illustration above. This might offer you an alternate viewpoint if what you
get if i7-2600K doesn't match your expectation.


Thanks,
Giovanni

2018-12-08 16:25:35

by Doug Smythies

[permalink] [raw]
Subject: RE: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

On 2018.12.08 02:23 Giovanni Gherdovich wrote:

> sorry for the late reply, this week I was traveling.

No Problem. Thanks very much for your very detailed reply,
which obviously took considerable time to write. While
I was making progress, your instructions really fill in
some gaps and mistakes I was making.

Eventually (probably several days) I'll report back my test
results.


> Some specific remarks you raise:
>
> On Mon, 2018-12-03 at 08:23 -0800, Doug Smythies wrote:
>> ...
>> My issue is that I do not understand the output or how it
>> might correlate with your tables.
>>
>> I get, for example:
>>
>> 3 1 1 0.13s 0.68s 0.80s 1003894.302 1003779.613
>> 3 1 1 0.16s 0.64s 0.80s 1008900.053 1008215.336
>> 3 1 1 0.14s 0.66s 0.80s 1009630.439 1008990.265
>> ...
>>
>> But I don't know what that means, nor have I been able to find
>> a description anywhere.
>
> I don't recognize this output. I hope the illustration above can clarify how
> MMTests is used.

Due to incompetence on my part, the config file being run for my tests was
always just the default config file from my original
git clone https://github.com/gormanm/mmtests.git
command. So regardless of what I thought I was doing, I was running "pft"
(Page Fault Test).

... Doug