2012-08-23 21:14:20

by Rik van Riel

[permalink] [raw]
Subject: [RFC][PATCH 0/3] c-state governor changes

It turns out that the c-state governor can be a performance issue
with certain workloads. The problem workloads seem to involve
mixed length pauses between activities, eg. an occasional long
idle period, with a burst of activity involving short idle periods.

One example of this would be a system with a web server and a
database, where the web server gets a request "every once in a
while" (compared to c-state timescales), and then quickly bounces
stuff back and forth between itself and the database, before
sending anything back to the http client.

On the face of it, the use of average sleep times does not make
a whole lot of sense, when we can have sleep patterns like this:

150 200 50 1000 30 180 10000 220

The bulk of the sleep time is spent in the one long sleep, but
planning for a 1500us sleep time (around the average) is pretty
much guaranteed to be wrong.

Instead, it might make more sense to plan for a sleep time just
under 200us. We may still need some kind of demotion scheme to
kick us into a deeper c-state when a truly long sleep period
comes along...

This patch set is mostly there to kick off a discussion in time
for Kernel Summit. When running it on my laptop, with acpi_idle,
I see a promising change in powertop.

Time spent in C3 has gone down from 99% of CPU time, to 97-98% CPU
time, but the average residency time in C3 has gone up. The other
1-2% of CPU time is spent in C2 instead.

I have not run any meaningful benchmarks against this code yet.
It has had a benchmark run where the workload presents regular
sleep intervals, and performs identically to the old code.

Please let me know what you think :)

--
All Rights Reversed


2012-08-23 21:13:45

by Rik van Riel

[permalink] [raw]
Subject: [RFC][PATCH 3/3] cpuidle: count double the exit latency

In some workloads, for example web servers talking to database servers,
a task on one CPU will wake up a task on another CPU, wait for that
other CPU to handle the job, and get the result back. Each of those
wakeups is likely to incur the c-state exit latency.

By only counting the exit latency once, even though the round trip
involves two c-state exits, we can end up going into too deep a
c-state for the workload.

Signed-off-by: Rik van Riel <[email protected]>
---
drivers/cpuidle/governors/menu.c | 9 ++++++---
1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 47b1150..ccdb348 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -378,10 +378,13 @@ static void menu_update(struct cpuidle_driver *drv, struct cpuidle_device *dev)
/*
* We correct for the exit latency; we are assuming here that the
* exit latency happens after the event that we're interested in.
+ * Pessimistically assume that we got woken up by another CPU,
+ * so we count its exit latency, too.
*/
- if (measured_us > data->exit_us)
- measured_us -= data->exit_us;
-
+ if (measured_us > 2 * data->exit_us)
+ measured_us -= 2 * data->exit_us;
+ else
+ measured_us = 0;

/* update our correction ratio */

2012-08-23 21:13:49

by Rik van Riel

[permalink] [raw]
Subject: [RFC][PATCH 1/3] cpuidle: fix underflow in stddev calculation

The calculation to determine the standard deviation used unsigned
integers. Since some of the values are guaranteed to be below the
average, this would always lead to large unsigned 32 bit numbers,
which would then be multiplied and added to a 64 bit integer,
potentially leading to a totally unpredictable result.

I am not sure if/why this code has ever worked.

Signed-off-by: Rik van Riel <[email protected]>
---
drivers/cpuidle/governors/menu.c | 7 ++++---
1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index 5b1f2c3..f4fe5c3 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -212,9 +212,10 @@ static void detect_repeating_patterns(struct menu_device *data)
if (avg > data->expected_us)
return;

- for (i = 0; i < INTERVALS; i++)
- stddev += (data->intervals[i] - avg) *
- (data->intervals[i] - avg);
+ for (i = 0; i < INTERVALS; i++) {
+ int diff = (int)data->intervals[i] - avg;
+ stddev += diff * diff;
+ }

stddev = stddev / INTERVALS;

2012-08-23 21:13:51

by Rik van Riel

[permalink] [raw]
Subject: [RFC][PATCH 2/3] cpuidle: find a typical recent sleep interval

The function detect_repeating_patterns was not very useful for
workloads with alternating long and short pauses, for example
virtual machines handling network requests for each other (say
a web and database server).

Instead, try to find a recent sleep interval that is somewhere
between the median and the mode sleep time, by discarding outliers
to the up side and recalculating the average and standard deviation
until that is no longer required.

This should do something sane with a sleep interval series like:

200 180 210 10000 30 1000 170 200

The current code would simply discard such a series, while the
new code will guess a typical sleep interval just shy of 200.

Signed-off-by: Rik van Riel <[email protected]>

... should we apply this idea to the bucket code instead,
keeping 8 intervals per bucket?
---
drivers/cpuidle/governors/menu.c | 68 ++++++++++++++++++++++++-------------
1 files changed, 44 insertions(+), 24 deletions(-)

diff --git a/drivers/cpuidle/governors/menu.c b/drivers/cpuidle/governors/menu.c
index f4fe5c3..87772bb 100644
--- a/drivers/cpuidle/governors/menu.c
+++ b/drivers/cpuidle/governors/menu.c
@@ -192,39 +192,59 @@ static u64 div_round64(u64 dividend, u32 divisor)
}

/*
- * Try detecting repeating patterns by keeping track of the last 8
- * intervals, and checking if the standard deviation of that set
- * of points is below a threshold. If it is... then use the
- * average of these 8 points as the estimated value.
+ * Try to find a typical sleep interval. By discarding the outliers
+ * on the up side only, we should arrive at a figure somewhere
+ * between the median and the mean sleep time.
*/
-static void detect_repeating_patterns(struct menu_device *data)
+static void get_typical_interval(struct menu_device *data)
{
- int i;
- uint64_t avg = 0;
- uint64_t stddev = 0; /* contains the square of the std deviation */
+ int i, divisor;
+ int64_t max, avg, stddev;
+ int64_t thresh = LLONG_MAX; /* Discard outliers above this value. */

+ again:
/* first calculate average and standard deviation of the past */
- for (i = 0; i < INTERVALS; i++)
- avg += data->intervals[i];
- avg = avg / INTERVALS;
-
- /* if the avg is beyond the known next tick, it's worthless */
- if (avg > data->expected_us)
- return;
-
+ max = avg = divisor = stddev = 0;
for (i = 0; i < INTERVALS; i++) {
- int diff = (int)data->intervals[i] - avg;
- stddev += diff * diff;
+ int64_t value = data->intervals[i];
+ if (value <= thresh) {
+ avg += value;
+ divisor++;
+ if (value > max)
+ max = value;
+ }
}
+ avg = avg / divisor;

- stddev = stddev / INTERVALS;
+ for (i = 0; i < INTERVALS; i++) {
+ int64_t value = data->intervals[i];
+ if (value <= thresh) {
+ int64_t diff = value - avg;
+ stddev += diff * diff;
+ }
+ }
+ stddev = stddev / divisor;
+ stddev = int_sqrt(stddev);

/*
- * now.. if stddev is small.. then assume we have a
- * repeating pattern and predict we keep doing this.
+ * If we have outliers to the upside in our distribution, discard
+ * those by setting the threshold to exclude these outliers, then
+ * calculate the average and standard deviation again. Once we get
+ * down to the bottom 1/3 of our samples, stop excluding samples.
+ *
+ * This can deal with workloads that have long pauses interspersed
+ * with sporadic activity with a bunch of short pauses.
*/
-
- if (avg && stddev < STDDEV_THRESH)
+ if (avg && max > avg + (stddev * 2) && (divisor * 3) >= INTERVALS) {
+ thresh = avg + (stddev * 2);
+ goto again;
+ }
+
+ /*
+ * If the calculated interval is shorter than what the other code
+ * expected, use it.
+ */
+ if (avg && avg < data->expected_us)
data->predicted_us = avg;
}

@@ -275,7 +295,7 @@ static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev)
data->predicted_us = div_round64(data->expected_us * data->correction_factor[data->bucket],
RESOLUTION * DECAY);

- detect_repeating_patterns(data);
+ get_typical_interval(data);

/*
* We want to default to C1 (hlt), not to busy polling

2012-08-23 21:54:26

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] c-state governor changes

On 8/23/2012 2:11 PM, Rik van Riel wrote:
> This patch set is mostly there to kick off a discussion in time
> for Kernel Summit. When running it on my laptop, with acpi_idle,
> I see a promising change in powertop.

be careful with acpi_idle... that will remove most of the intermediate C states the platform has,
so the policy engine no longer has good things to chose from.

2012-08-23 21:55:03

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] c-state governor changes


> Please let me know what you think :)
>

btw on what kind of hardware are you seeing the issues ?

2012-08-24 02:57:44

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] c-state governor changes

On 08/23/2012 05:54 PM, Arjan van de Ven wrote:
> On 8/23/2012 2:11 PM, Rik van Riel wrote:
>> This patch set is mostly there to kick off a discussion in time
>> for Kernel Summit. When running it on my laptop, with acpi_idle,
>> I see a promising change in powertop.
>
> be careful with acpi_idle... that will remove most of the intermediate C states the platform has,
> so the policy engine no longer has good things to chose from.

Interesting, I was not aware of that.

Another thing I did see is that the intel_idle driver selects
a target residency time around 4x larger than the exit latency
for most c-states, while acpi_idle sticks to 2x by default.

Aiming for a residency time only 2x the exit latency seems
excessively aggressive, especially for the deeper c states.

The latency issues with various KVM guests communicating with
each other have been observed on various Intel and AMD chipsets,
not unique to one vendor.

--
All rights reversed

2012-08-24 04:11:24

by Matthew Garrett

[permalink] [raw]
Subject: Re: [RFC][PATCH 0/3] c-state governor changes

On Thu, Aug 23, 2012 at 02:54:05PM -0700, Arjan van de Ven wrote:
> On 8/23/2012 2:11 PM, Rik van Riel wrote:
> > This patch set is mostly there to kick off a discussion in time
> > for Kernel Summit. When running it on my laptop, with acpi_idle,
> > I see a promising change in powertop.
>
> be careful with acpi_idle... that will remove most of the intermediate C states the platform has,
> so the policy engine no longer has good things to chose from.

Not inherently - intel_idle typically only exports 4 C states, and many
ACPI platforms will provide the same number, especially when on battery.

--
Matthew Garrett | [email protected]