2009-07-27 05:30:10

by Yanmin Zhang

[permalink] [raw]
Subject: Dynamic configure max_cstate

When running a fio workload, I found sometimes cpu C state has
big impact on the result. Mostly, fio is a disk I/O workload
which doesn't spend much time with cpu, so cpu switch to C2/C3
freqently and the latency is big.

If I start kernel with idle=poll or processor.max_cstate=1,
the result is quite good. Consider a scenario that machine is
busy at daytime and free at night. Could we add a dynamic
configuration interface for processor.max_cstate or something
similiar with sysfs? So user applications could change the
max_cstate dynamically? For example, we could add a new
parameter to function cpuidle_governor->select to mark the
highest c state.

Any idea?

Yanmin


2009-07-27 07:33:39

by Andreas Mohr

[permalink] [raw]
Subject: Re: Dynamic configure max_cstate

Hi,

> When running a fio workload, I found sometimes cpu C state has
> big impact on the result. Mostly, fio is a disk I/O workload
> which doesn't spend much time with cpu, so cpu switch to C2/C3
> freqently and the latency is big.

Rather than inventing ways to limit ACPI Cx state usefulness, we should
perhaps be thinking of what's wrong here.

And your complaint might just fit into a thought I had recently:
are we actually taking ACPI Cx exit latency into account, for timers???

If we program a timer to fire at some point, then it is quite imaginable
that any ACPI Cx exit latency due to the CPU being idle at that moment
could add to actual timer trigger time significantly.

To combat this, one would need to tweak the timer expiration time
to include the exit latency. But of course once the CPU is running
again, one would need to re-add the latency amount (read: reprogram the
timer hardware, ugh...) to prevent the timer from firing too early.

Given that one would need to reprogram timer hardware quite often,
I don't know whether taking Cx exit latency into account is feasible.
OTOH analysis of the single next timer value and actual hardware reprogramming
would have to be done only once (in ACPI sleep and wake paths each),
thus it might just turn out to be very beneficial after all
(minus prolonging ACPI Cx path activity and thus aggravating CPU power
savings, of course).

Arjan mentioned examples of maybe 10us for C2 and 185us for C3/C4 in an
article.

OTOH even 185us is only 0.185ms, which, when compared to disk seek
latency (around 7ms still, except for SSD), doesn't seem to be all that much.
Or what kind of ballpark figure do you have for percentage of I/O
deterioration?
I'm wondering whether we might have an even bigger problem with disk I/O
related to this than just the raw ACPI exit latency value itself.

Andreas Mohr

2009-07-28 02:42:09

by Yanmin Zhang

[permalink] [raw]
Subject: Re: Dynamic configure max_cstate

On Mon, 2009-07-27 at 09:33 +0200, Andreas Mohr wrote:
> Hi,
>
> > When running a fio workload, I found sometimes cpu C state has
> > big impact on the result. Mostly, fio is a disk I/O workload
> > which doesn't spend much time with cpu, so cpu switch to C2/C3
> > freqently and the latency is big.
>
> Rather than inventing ways to limit ACPI Cx state usefulness, we should
> perhaps be thinking of what's wrong here.
Andreas,

Thanks for your kind comments.

>
> And your complaint might just fit into a thought I had recently:
> are we actually taking ACPI Cx exit latency into account, for timers???
I tried both tickless kernel and non-tickless kernels. The result is similiar.

Originally, I also thought it's related to timer. As you know, I/O block layer
has many timers. Such timers don't expire normally. For example, an I/O request
is submitted to driver and driver delievers it to disk and hardware triggers
an interrupt after finishing I/O. Mostly, the I/O submit and interrupt, not
the timer, drive the I/O.

>
> If we program a timer to fire at some point, then it is quite imaginable
> that any ACPI Cx exit latency due to the CPU being idle at that moment
> could add to actual timer trigger time significantly.
>
> To combat this, one would need to tweak the timer expiration time
> to include the exit latency. But of course once the CPU is running
> again, one would need to re-add the latency amount (read: reprogram the
> timer hardware, ugh...) to prevent the timer from firing too early.
>
> Given that one would need to reprogram timer hardware quite often,
> I don't know whether taking Cx exit latency into account is feasible.
> OTOH analysis of the single next timer value and actual hardware reprogramming
> would have to be done only once (in ACPI sleep and wake paths each),
> thus it might just turn out to be very beneficial after all
> (minus prolonging ACPI Cx path activity and thus aggravating CPU power
> savings, of course).
>
> Arjan mentioned examples of maybe 10us for C2 and 185us for C3/C4 in an
> article.
>
> OTOH even 185us is only 0.185ms, which, when compared to disk seek
> latency (around 7ms still, except for SSD), doesn't seem to be all that much.
> Or what kind of ballpark figure do you have for percentage of I/O
> deterioration?
I have lots of FIO sub test cases which test I/O on single disk and JBOD (a disk
bos which mostly has 12~13 disks) on nahelam machines. Your analysis on disk seek
is reasonable. I found sequential buffered read has the worst regression while rand
read is far better. For example, I start 12 processes per disk and every disk has 24
1-G files. There are 12 disks. The sequential read fio result is about 593MB/second
with idle=poll, and about 375MB/s without idle=poll. Read block size is 4KB.

Another exmaple is single fio direct seqential read (block size is 4K) on a single
SATA disk. The result is about 28MB/s without idle=poll and about 32.5MB with
idle=poll.

How did I find C state has impact on disk I/O result? Frankly, I found a regression
between kernel 2.6.27 and 2.6.28. Bisect located a nonstop tsc patch, but the patch
is quite good. I found the patch changes the default clocksource from hpet to
tsc. Then, I tried all clocksources and got the best result with acpi_pm clocksource.
But oprofile data shows acpi_pm has more cpu utilization. clocksource jiffies has
worst result but least cpu utilization. As you know, fio calls gettimeofday frequently.
Then, I tried boot parameter processor.max_cstate and idle=poll.
I get the similar result with processor.max_cstate=1 like the one with idle=poll.

I also run the testing on 2 stoakley machines and don't find such issues.
/proc/acpi/processor/CPUXXX/power shows stoakley cpu only has C1.

> I'm wondering whether we might have an even bigger problem with disk I/O
> related to this than just the raw ACPI exit latency value itself.
We might have. I'm still doing more testing. With Venki's tool (write/read MSR registers),
I collected some C state switch stat.

Current cpuidle has a good consideration on cpu utilization, but doesn't have
consideration on devices. So with I/O delivery and interrupt drive model
with little cpu utilization, performance might be hurt if C state exit has a long
latency.

Yanmin

2009-07-28 07:20:35

by Corrado Zoccolo

[permalink] [raw]
Subject: Re: Dynamic configure max_cstate

Hi,
On Tue, Jul 28, 2009 at 4:42 AM, Zhang,
Yanmin<[email protected]> wrote:
> On Mon, 2009-07-27 at 09:33 +0200, Andreas Mohr wrote:
>> Hi,
>>
>> > When running a fio workload, I found sometimes cpu C state has
>> > big impact on the result. Mostly, fio is a disk I/O workload
>> > which doesn't spend much time with cpu, so cpu switch to C2/C3
>> > freqently and the latency is big.
>>
>> Rather than inventing ways to limit ACPI Cx state usefulness, we should
>> perhaps be thinking of what's wrong here.
> Andreas,
>
> Thanks for your kind comments.
>
>>
>> And your complaint might just fit into a thought I had recently:
>> are we actually taking ACPI Cx exit latency into account, for timers???
> I tried both tickless kernel and non-tickless kernels. The result is similiar.
>
> Originally, I also thought it's related to timer. As you know, I/O block layer
> has many timers. Such timers don't expire normally. For example, an I/O request
> is submitted to driver and driver delievers it to disk and hardware triggers
> an interrupt after finishing I/O. Mostly, the I/O submit and interrupt, not
> the timer, drive the I/O.
>
>>
>> If we program a timer to fire at some point, then it is quite imaginable
>> that any ACPI Cx exit latency due to the CPU being idle at that moment
>> could add to actual timer trigger time significantly.
>>
>> To combat this, one would need to tweak the timer expiration time
>> to include the exit latency. But of course once the CPU is running
>> again, one would need to re-add the latency amount (read: reprogram the
>> timer hardware, ugh...) to prevent the timer from firing too early.
>>
>> Given that one would need to reprogram timer hardware quite often,
>> I don't know whether taking Cx exit latency into account is feasible.
>> OTOH analysis of the single next timer value and actual hardware reprogramming
>> would have to be done only once (in ACPI sleep and wake paths each),
>> thus it might just turn out to be very beneficial after all
>> (minus prolonging ACPI Cx path activity and thus aggravating CPU power
>> savings, of course).
>>
>> Arjan mentioned examples of maybe 10us for C2 and 185us for C3/C4 in an
>> article.
>>
>> OTOH even 185us is only 0.185ms, which, when compared to disk seek
>> latency (around 7ms still, except for SSD), doesn't seem to be all that much.
>> Or what kind of ballpark figure do you have for percentage of I/O
>> deterioration?
> I have lots of FIO sub test cases which test I/O on single disk and JBOD (a disk
> bos which mostly has 12~13 disks) on nahelam machines. Your analysis on disk seek
> is reasonable. I found sequential buffered read has the worst regression while rand
> read is far better. For example, I start 12 processes per disk and every disk has 24
> 1-G files. There are 12 disks. The sequential read fio result is about 593MB/second
> with idle=poll, and about 375MB/s without idle=poll. Read block size is 4KB.
>
> Another exmaple is single fio direct seqential read (block size is 4K) on a single
> SATA disk. The result is about 28MB/s without idle=poll and about 32.5MB with
> idle=poll.
>
> How did I find C state has impact on disk I/O result? Frankly, I found a regression
> between kernel 2.6.27 and 2.6.28. Bisect located a nonstop tsc patch, but the patch
> is quite good. I found the patch changes the default clocksource from hpet to
> tsc. Then, I tried all clocksources and got the best result with acpi_pm clocksource.
> But oprofile data shows acpi_pm has more cpu utilization. clocksource jiffies has
> worst result but least cpu utilization. As you know, fio calls gettimeofday frequently.
> Then, I tried boot parameter processor.max_cstate and idle=poll.
> I get the similar result with processor.max_cstate=1 like the one with idle=poll.
>

Is it possible that the different bandwidths figures are due to
incorrect timing, instead of C-state latencies?
Entering a deep C state can cause strange things to timers: some of
them, especially tsc, become unreliable.
Maybe the patch you found that re-enables tsc is actually wrong for
your machine, for which tsc is unreliable in deep C states.

> I also run the testing on 2 stoakley machines and don't find such issues.
> /proc/acpi/processor/CPUXXX/power shows stoakley cpu only has C1.
>
>> I'm wondering whether we might have an even bigger problem with disk I/O
>> related to this than just the raw ACPI exit latency value itself.
> We might have. I'm still doing more testing. With Venki's tool (write/read MSR registers),
> I collected some C state switch stat.
>
You can see the latencies (expressed in us) on your machine with:
[root@localhost corrado]# cat
/sys/devices/system/cpu/cpu0/cpuidle/state*/latency
0
0
1
133

Can you post your numbers, to see if they are unusually high?

> Current cpuidle has a good consideration on cpu utilization, but doesn't have
> consideration on devices. So with I/O delivery and interrupt drive model
> with little cpu utilization, performance might be hurt if C state exit has a long
> latency.
>
> Yanmin
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



--
__________________________________________________________________________

dott. Corrado Zoccolo mailto:[email protected]
PhD - Department of Computer Science - University of Pisa, Italy
--------------------------------------------------------------------------

2009-07-28 09:00:29

by Yanmin Zhang

[permalink] [raw]
Subject: Re: Dynamic configure max_cstate

On Tue, 2009-07-28 at 09:20 +0200, Corrado Zoccolo wrote:
> Hi,
> On Tue, Jul 28, 2009 at 4:42 AM, Zhang,
> Yanmin<[email protected]> wrote:
> > On Mon, 2009-07-27 at 09:33 +0200, Andreas Mohr wrote:
> >> Hi,
> >>
> >> > When running a fio workload, I found sometimes cpu C state has
> >> > big impact on the result. Mostly, fio is a disk I/O workload
> >> > which doesn't spend much time with cpu, so cpu switch to C2/C3
> >> > freqently and the latency is big.
> >>
> >> Rather than inventing ways to limit ACPI Cx state usefulness, we should
> >> perhaps be thinking of what's wrong here.
> > Andreas,
> >
> > Thanks for your kind comments.
> >
> >>
> >> And your complaint might just fit into a thought I had recently:
> >> are we actually taking ACPI Cx exit latency into account, for timers???
> > I tried both tickless kernel and non-tickless kernels. The result is similiar.
> >
> > Originally, I also thought it's related to timer. As you know, I/O block layer
> > has many timers. Such timers don't expire normally. For example, an I/O request
> > is submitted to driver and driver delievers it to disk and hardware triggers
> > an interrupt after finishing I/O. Mostly, the I/O submit and interrupt, not
> > the timer, drive the I/O.
> >
> >>
> >> If we program a timer to fire at some point, then it is quite imaginable
> >> that any ACPI Cx exit latency due to the CPU being idle at that moment
> >> could add to actual timer trigger time significantly.
> >>
> >> To combat this, one would need to tweak the timer expiration time
> >> to include the exit latency. But of course once the CPU is running
> >> again, one would need to re-add the latency amount (read: reprogram the
> >> timer hardware, ugh...) to prevent the timer from firing too early.
> >>
> >> Given that one would need to reprogram timer hardware quite often,
> >> I don't know whether taking Cx exit latency into account is feasible.
> >> OTOH analysis of the single next timer value and actual hardware reprogramming
> >> would have to be done only once (in ACPI sleep and wake paths each),
> >> thus it might just turn out to be very beneficial after all
> >> (minus prolonging ACPI Cx path activity and thus aggravating CPU power
> >> savings, of course).
> >>
> >> Arjan mentioned examples of maybe 10us for C2 and 185us for C3/C4 in an
> >> article.
> >>
> >> OTOH even 185us is only 0.185ms, which, when compared to disk seek
> >> latency (around 7ms still, except for SSD), doesn't seem to be all that much.
> >> Or what kind of ballpark figure do you have for percentage of I/O
> >> deterioration?
> > I have lots of FIO sub test cases which test I/O on single disk and JBOD (a disk
> > bos which mostly has 12~13 disks) on nahelam machines. Your analysis on disk seek
> > is reasonable. I found sequential buffered read has the worst regression while rand
> > read is far better. For example, I start 12 processes per disk and every disk has 24
> > 1-G files. There are 12 disks. The sequential read fio result is about 593MB/second
> > with idle=poll, and about 375MB/s without idle=poll. Read block size is 4KB.
> >
> > Another exmaple is single fio direct seqential read (block size is 4K) on a single
> > SATA disk. The result is about 28MB/s without idle=poll and about 32.5MB with
> > idle=poll.
> >
> > How did I find C state has impact on disk I/O result? Frankly, I found a regression
> > between kernel 2.6.27 and 2.6.28. Bisect located a nonstop tsc patch, but the patch
> > is quite good. I found the patch changes the default clocksource from hpet to
> > tsc. Then, I tried all clocksources and got the best result with acpi_pm clocksource.
> > But oprofile data shows acpi_pm has more cpu utilization. clocksource jiffies has
> > worst result but least cpu utilization. As you know, fio calls gettimeofday frequently.
> > Then, I tried boot parameter processor.max_cstate and idle=poll.
> > I get the similar result with processor.max_cstate=1 like the one with idle=poll.
> >
>
> Is it possible that the different bandwidths figures are due to
> incorrect timing, instead of C-state latencies?
I'm not sure.

> Entering a deep C state can cause strange things to timers: some of
> them, especially tsc, become unreliable.
> Maybe the patch you found that re-enables tsc is actually wrong for
> your machine, for which tsc is unreliable in deep C states.
I'm using a SDV machine, not an official product. But it's rare that cpuid
reports non-stop tsc feature while it doesn't support it.

I tried different clocksources. For exmaple, I could get a better (30%) result with
hpet. With hpet, cpu utilization is about 5~8%. Function hpet_read uses too much cpu
time. With tsc, cpu utilization is about 2~3%. I think more cpu utilization causes fewer
C state transitions.

With idle=poll, the result is about 10% better than the one of hpet. If using idle=poll,
I didn't find result difference among different clocksources.

>
> > I also run the testing on 2 stoakley machines and don't find such issues.
> > /proc/acpi/processor/CPUXXX/power shows stoakley cpu only has C1.
> >
> >> I'm wondering whether we might have an even bigger problem with disk I/O
> >> related to this than just the raw ACPI exit latency value itself.
> > We might have. I'm still doing more testing. With Venki's tool (write/read MSR registers),
> > I collected some C state switch stat.
> >
> You can see the latencies (expressed in us) on your machine with:
> [root@localhost corrado]# cat
> /sys/devices/system/cpu/cpu0/cpuidle/state*/latency
> 0
> 0
> 1
> 133
>
> Can you post your numbers, to see if they are unusually high?
[ymzhang@lkp-ne02 ~]$ cat /proc/acpi/processor/CPU0/power
active state: C0
max_cstate: C8
maximum allowed latency: 2000000000 usec
states:
C1: type[C1] promotion[--] demotion[--] latency[003] usage[00001661] duration[00000000000000000000]
C2: type[C3] promotion[--] demotion[--] latency[205] usage[00000687] duration[00000000000000732028]
C3: type[C3] promotion[--] demotion[--] latency[245] usage[00011509] duration[00000000000115186065]

[ymzhang@lkp-ne02 ~]$ cat /sys/devices/system/cpu/cpu0/cpuidle/state*/latency
0
3
205
245

>
> > Current cpuidle has a good consideration on cpu utilization, but doesn't have
> > consideration on devices. So with I/O delivery and interrupt drive model
> > with little cpu utilization, performance might be hurt if C state exit has a long
> > latency.

2009-07-28 10:11:38

by Andreas Mohr

[permalink] [raw]
Subject: Re: Dynamic configure max_cstate

Hi,

On Tue, Jul 28, 2009 at 05:00:35PM +0800, Zhang, Yanmin wrote:
> I tried different clocksources. For exmaple, I could get a better (30%) result with
> hpet. With hpet, cpu utilization is about 5~8%. Function hpet_read uses too much cpu
> time. With tsc, cpu utilization is about 2~3%. I think more cpu utilization causes fewer
> C state transitions.
>
> With idle=poll, the result is about 10% better than the one of hpet. If using idle=poll,
> I didn't find result difference among different clocksources.

IOW, this seems to clearly point to ACPI Cx causing it.

Both Corrado and me have been thinking that one should try skipping all
bigger-latency ACPI Cx states whenever there's an ongoing I/O request where an
immediate reply interrupt is expected.

I've been investigating this a bit, and interesting parts would perhaps include
. kernel/pm_qos_params.c
. drivers/cpuidle/governors/menu.c (which acts on the ACPI _cx state
structs as configured by drivers/acpi/processor_idle.c)
. and e.g. the wait_for_completion_timeout() part in drivers/ata/libata-core.c
(or other sources in case of other disk I/O mechanisms)

One way to do some quick (and dirty!!) testing would be to set a flag
before calling wait_for_completion_timeout() and testing for this flag in
drivers/cpuidle/governors/menu.c and then skip deeper Cx states
conditionally.

As a very quick test, I tried a
while :; do :; done
loop in shell and renicing shell to 19 (to keep my CPU out of ACPI idle),
but bonnie -s 100 results initially looked promising yet turned out to
be inconsistent. The real way to test this would be idle=poll.
My test system was Athlon XP with /proc/acpi/processor/CPU0/power
latencies of 000 and 100 (the maximum allowed value, BTW) for C1/C2.

If the wait_for_completion_timeout() flag testing turns out to help,
then one might intend to use the pm_qos infrastructure to indicate
these conditions, however it might be too bloated for such a
purpose, a relatively simple (read: fast) boolean flag mechanism
could be better.

Plus one could then create a helper function which figures out a
"pretty fast" Cx state (independent of specific latency times!).
But when introducing this mechanism, take care to not ignore the
requirements defined by pm_qos settings!

Oh, and about the places which submit I/O requests where one would have to
flag this: are they in any way correlated with the scheduler I/O wait
value? Would the I/O wait mechanism be a place to more easily and centrally
indicate that we're waiting for a request to come back in "very soon"?
OTOH I/O requests may have vastly differing delay expectations,
thus specifically only short-term expected I/O replies should be flagged,
otherwise we're wasting lots of ACPI deep idle opportunities.

Andreas Mohr

2009-07-28 14:03:11

by Andreas Mohr

[permalink] [raw]
Subject: Re: Dynamic configure max_cstate

On Tue, Jul 28, 2009 at 12:11:35PM +0200, Andreas Mohr wrote:
> As a very quick test, I tried a
> while :; do :; done
> loop in shell and renicing shell to 19 (to keep my CPU out of ACPI idle),
> but bonnie -s 100 results initially looked promising yet turned out to
> be inconsistent. The real way to test this would be idle=poll.
> My test system was Athlon XP with /proc/acpi/processor/CPU0/power
> latencies of 000 and 100 (the maximum allowed value, BTW) for C1/C2.

OK, I just tested it properly.
Rebooted, did 5 bonnie -s 100 with ACPI idle, rebooted and did another 5
bonnie -s 100 with idle=poll, results:


$ cat bonnie_ACPI_* /tmp/line bonnie_poll_*
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
1* 100 20084 95.3 19037 9.5 12286 4.7 18074 99.6 581752 96.6 28792.3 93.6
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
1* 100 19235 93.5 24591 11.8 13916 4.3 17934 99.8 604429 100.3 27993.8 98.0
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
1* 100 17221 86.3 30591 16.1 15404 5.4 18689 99.3 593296 92.7 28146.0 98.5
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
1* 100 20254 99.3 110095 55.9 15722 6.1 17901 99.5 601185 99.8 28675.5 100.4
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
1* 100 18274 88.5 106909 53.2 10614 4.1 18759 99.7 598833 99.4 28461.6 92.5
========
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
1* 100 15274 98.2 20206 9.7 17286 7.3 18055 99.4 608112 101.0 28424.0 99.5
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
1* 100 20545 99.1 25332 12.6 16392 6.1 17957 99.4 606706 100.7 27906.8 90.7
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
1* 100 20482 99.2 30907 13.6 17585 6.2 17867 99.1 608090 101.0 27919.1 97.7
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
1* 100 20863 99.4 138383 66.2 18945 7.6 17938 99.5 581421 96.5 27094.6 94.8
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
1* 100 20821 98.8 156821 70.4 11536 4.4 18747 99.0 603556 100.2 27677.8 96.9


And these values (cumulative) result in:

ACPI poll
Per Char 95068 97985 +3.06%
Block 291223 371649 +27.62%
Rewrite 67942 81744 +20.31%
Per Char 91357 90564 -0.87%
Block 2979495 3007885 +0.95%
RndSeek 142069.2 139022.3 -2.1%
average: +8.16%

Now the question is how much is due to idle state entry/exit latency
and how much is due to ACPI idle/wakeup code path execution.

Still, an average of +8.16% during 5 test runs each should be quite some incentive,
and once there's a proper "idle latency skipping during expected I/O replies"
even with idle/wakeup code path reinstated we should hopefully be able to keep
some 5% improvement in disk access.

Andreas Mohr

2009-07-28 17:35:35

by Andreas Mohr

[permalink] [raw]
Subject: ok, now would this be useful? (Re: Dynamic configure max_cstate)

On Tue, Jul 28, 2009 at 04:03:08PM +0200, Andreas Mohr wrote:
> Still, an average of +8.16% during 5 test runs each should be quite some incentive,
> and once there's a proper "idle latency skipping during expected I/O replies"
> even with idle/wakeup code path reinstated we should hopefully be able to keep
> some 5% improvement in disk access.

I went ahead and created a small and VERY dirty test for this.

In kernel/pm_qos_params.c I added

static bool io_reply_is_expected;

bool io_reply_expected(void)
{
return io_reply_is_expected;
}
EXPORT_SYMBOL_GPL(io_reply_expected);

void set_io_reply_expected(bool expected)
{
io_reply_is_expected = expected;
}
EXPORT_SYMBOL_GPL(set_io_reply_expected);



Then in drivers/ata/libata-core.c I added

extern bool set_io_reply_expected();

and updated it to

set_io_reply_expected(1);
rc = wait_for_completion_timeout(&wait, msecs_to_jiffies(timeout));
set_io_reply_expected(0);

ata_port_flush_task(ap);


Then I changed ./drivers/cpuidle/governors/menu.c
(make sure you're using the menu governor!) to use

extern bool io_reply_expected(void);

and updated

if (io_reply_expected())
data->expected_us = 10;
else {
/* determine the expected residency time */
data->expected_us =
(u32) ktime_to_ns(tick_nohz_get_sleep_length()) / 1000;
}

Rebuilt, rebootloadered ;), rebooted, and then booting and disk operation
_seemed_ to be snappier (I'm damn sure the hdd seek noise
is a bit higher-pitched ;).
And it's exactly seeks which should be shorter-intervalled now,
since the system triggers a hdd operation and then is forced to wait (idle)
until the seeking is done.

bonnie test results (of patched kernel vs. kernel with set_io_reply_expected() muted)
seem to support this, but then a "time make bzImage" (of newly rebooted box each)
showed inconsistent results again and a much higher sample rate (with reboots each)
would be needed to really confirm this.

I'd expect improvements to be in the 3% to 4% range, at most, but still,
compared to the yield of other kernel patches this ain't nothing.

Now the question becomes whether one should implement such an improvement and especially, how.
Perhaps the io reply decision making should be folded into the tick_nohz_get_sleep_length()
function (or rather create a higher-level "expected sleep length" function which consults both
tick_nohz_get_sleep_length() and io reply mechanism).
And another important detail is that my current hack completely ignores per-cpu operation
and thus causes suboptimal power savings of _all_ cpus,
not just the one waiting for the I/O reply (i.e., we should properly take into account
cpu affinity settings of the reply interrupt).
And of course it would probably be best to create a mechanism which stores a record of average
responsiveness delays of various block devices and then derive a maximum
idle wakeup latency value from this to request.

Does anyone else have thoughts on this or benchmark numbers which would support this?

Andreas Mohr

2009-07-28 19:25:29

by Len Brown

[permalink] [raw]
Subject: Re: Dynamic configure max_cstate


> Entering a deep C state can cause strange things to timers: some of
> them, especially tsc, become unreliable.

The Nehalem family CPU has a non-stop constant-frequency TSC.

The measurements that Yanmin quotes show that the TSC
is the lowest overhead timesource in the system.

thanks,
Len Brown, Intel Open Source Technology Center

2009-07-28 19:47:20

by Len Brown

[permalink] [raw]
Subject: Re: Dynamic configure max_cstate

> When running a fio workload, I found sometimes cpu C state has
> big impact on the result. Mostly, fio is a disk I/O workload
> which doesn't spend much time with cpu, so cpu switch to C2/C3
> freqently and the latency is big.
>
> If I start kernel with idle=poll or processor.max_cstate=1,
> the result is quite good. Consider a scenario that machine is
> busy at daytime and free at night. Could we add a dynamic
> configuration interface for processor.max_cstate or something
> similiar with sysfs? So user applications could change the
> max_cstate dynamically? For example, we could add a new
> parameter to function cpuidle_governor->select to mark the
> highest c state.

max_cstate is a debug param. It isn't a run-time API and never will be.
User-space shouldn't need to know or care about C-states,
and if it appears it needs to, then we have a bug we need to fix.

The interface in Documentation/power/pm_qos_interface.txt
is supposed to handle this. Though if the underlying code
is not noticing IO interrupts, then it can't help.

Another thing to look at is processor.latency_factor
which you can change at run-time in
/sys/module/processor/parameters/latency_factor

We multiply the advertised exit latency by this
before deciding to enter a C-state. The concept
is that ACPI reports a performance number, but what
we really want is a power break-even. Anyway,
we know the default mulitple is too low, and will be
raising it shortly.

Of course if the current code is not predicting any
IO interrupts on your IO-only workload, this, like
pm_qos, will not help.

cheers,
-Len Brown, Intel Open Source Technology Center

2009-07-29 00:17:34

by Len Brown

[permalink] [raw]
Subject: Re: Dynamic configure max_cstate



thanks,
Len Brown, Intel Open Source Technology Center

On Mon, 27 Jul 2009, Andreas Mohr wrote:

> Hi,
>
> > When running a fio workload, I found sometimes cpu C state has
> > big impact on the result. Mostly, fio is a disk I/O workload
> > which doesn't spend much time with cpu, so cpu switch to C2/C3
> > freqently and the latency is big.
>
> Rather than inventing ways to limit ACPI Cx state usefulness, we should
> perhaps be thinking of what's wrong here.
>
> And your complaint might just fit into a thought I had recently:
> are we actually taking ACPI Cx exit latency into account, for timers???

Yes.
menu_select() calls tick_nohz_get_sleep_length() specifically
to compare the expiration of the next timer vs. the expected sleep length.

The problem here is likely that the expected sleep length
is shorter than expected, for IO interrupts are not timers...
Thus we add long deep C-state wakeup time to the IO interrupt latency...

-Len Brown, Intel Open Source Technology Center

2009-07-29 08:00:39

by Andreas Mohr

[permalink] [raw]
Subject: Re: Dynamic configure max_cstate

Hi,

On Tue, Jul 28, 2009 at 08:17:09PM -0400, Len Brown wrote:
> > And your complaint might just fit into a thought I had recently:
> > are we actually taking ACPI Cx exit latency into account, for timers???
>
> Yes.
> menu_select() calls tick_nohz_get_sleep_length() specifically
> to compare the expiration of the next timer vs. the expected sleep length.
>
> The problem here is likely that the expected sleep length
> is shorter than expected, for IO interrupts are not timers...
> Thus we add long deep C-state wakeup time to the IO interrupt latency...

Well, but... the code does not work according to my idea about this.
The code currently checks against the expected sleep length and throws away
any large exit latencies that don't fit.
What I was thinking how to handle this is entirely different (and,
frankly, I'm not sure whether it would have any advantage, but still):
actively _subtract_ the idle exit latency from the timer expiration
time (i.e., reprogram the timer on idle entry and again on idle exit if
not expired yet) to make sure that the timer fires correctly
despite having to handle the idle exit, too.

OTOH while this might allow deeper Cx states, it's most likely a weaker
solution than the current implementation, since it requires up to two
times additional timer reprogramming.
And additionally taking into account I/O-inflicted idle exit can be
implemented pretty easily alongside the existing tick_nohz_get_sleep_length()
mechanism.

The code still causes some additional uneasiness such as:
tick_nohz_get_sleep_length() returns dev->next_event - now,
but pushed through all the ACPI latency hardware-wise)
the actual timer appearance after cpu wakeup might be
entirely random, there should be a feedback mechanism which
measures when a timer was expected and when it then _actually_ turned up,
to cancel out the delay effects of ACPI idle entry/exit.

== i.e. we seem to be calculating these things on what we _think_ the
machine is doing, not on what we _know_ about its previous behaviour ==
- since we don't have a feedback loop...
IMHO this is an important missing element here, if such a feedback loop
was implemented, then timer wakeups would be much more precise,
which incidentally would result in improved machine performance.
(CC Thomas)


And spinning this a bit further - let me guess (I didn't check it)
that hard realtime users are always quick to disable ACPI Cx completely?
With such a mechanism they shouldn't need to, since the timer is
programmed according to _actual_ CPU wakeup time, not when we _think_
it might wakeup.
(CC Ingo)


I just realized that such a feedback loop (resulting in possibly
early-programmed timers) would then need my timer reprogramming
mechanism again (after ACPI idle exit), to avoid early timer trigger.
However ultimately I think it might turn out to be a much better solution
to precisely _determine_ timer fireing than to simply statically, mechanically
(blindly!) pre-set the time around which a timer "might be expected to be fired".


An annoyingly simple sentence to phrase the current situation:
"With ACPI idle configured, high-res timers aren't."


Or am I wrong and the current implementation is already doing all this
already? Didn't see that though...

Andreas Mohr

2009-07-29 08:20:56

by Yanmin Zhang

[permalink] [raw]
Subject: Re: Dynamic configure max_cstate

On Tue, 2009-07-28 at 12:11 +0200, Andreas Mohr wrote:
> Hi,
>
> On Tue, Jul 28, 2009 at 05:00:35PM +0800, Zhang, Yanmin wrote:
> > I tried different clocksources. For exmaple, I could get a better (30%) result with
> > hpet. With hpet, cpu utilization is about 5~8%. Function hpet_read uses too much cpu
> > time. With tsc, cpu utilization is about 2~3%. I think more cpu utilization causes fewer
> > C state transitions.
> >
> > With idle=poll, the result is about 10% better than the one of hpet. If using idle=poll,
> > I didn't find result difference among different clocksources.
>
> IOW, this seems to clearly point to ACPI Cx causing it.
>
> Both Corrado and me have been thinking that one should try skipping all
> bigger-latency ACPI Cx states whenever there's an ongoing I/O request where an
> immediate reply interrupt is expected.
That's a good idea.

>
> I've been investigating this a bit, and interesting parts would perhaps include
> . kernel/pm_qos_params.c
> . drivers/cpuidle/governors/menu.c (which acts on the ACPI _cx state
> structs as configured by drivers/acpi/processor_idle.c)
> . and e.g. the wait_for_completion_timeout() part in drivers/ata/libata-core.c
> (or other sources in case of other disk I/O mechanisms)
>
> One way to do some quick (and dirty!!) testing would be to set a flag
> before calling wait_for_completion_timeout() and testing for this flag in
> drivers/cpuidle/governors/menu.c and then skip deeper Cx states
> conditionally.
>
> As a very quick test, I tried a
> while :; do :; done
> loop in shell and renicing shell to 19 (to keep my CPU out of ACPI idle),
> but bonnie -s 100 results initially looked promising yet turned out to
> be inconsistent. The real way to test this would be idle=poll.
> My test system was Athlon XP with /proc/acpi/processor/CPU0/power
> latencies of 000 and 100 (the maximum allowed value, BTW) for C1/C2.
>
> If the wait_for_completion_timeout() flag testing turns out to help,
> then one might intend to use the pm_qos infrastructure to indicate
> these conditions, however it might be too bloated for such a
> purpose, a relatively simple (read: fast) boolean flag mechanism
> could be better.
>
> Plus one could then create a helper function which figures out a
> "pretty fast" Cx state (independent of specific latency times!).
> But when introducing this mechanism, take care to not ignore the
> requirements defined by pm_qos settings!
>
> Oh, and about the places which submit I/O requests where one would have to
> flag this: are they in any way correlated with the scheduler I/O wait
> value? Would the I/O wait mechanism be a place to more easily and centrally
> indicate that we're waiting for a request to come back in "very soon"?
> OTOH I/O requests may have vastly differing delay expectations,
> thus specifically only short-term expected I/O replies should be flagged,
> otherwise we're wasting lots of ACPI deep idle opportunities.
Another issue is we might submit I/O request on cpu A, but the corresponding
interrupt is sent to cpu B. It's common. So the SOFTIRQ on cpu B would send
an IPI to cpu A to schedule process to run on cpu A to finish the I/O.

2009-07-30 06:27:56

by Yanmin Zhang

[permalink] [raw]
Subject: Re: Dynamic configure max_cstate

On Tue, 2009-07-28 at 17:00 +0800, Zhang, Yanmin wrote:
> On Tue, 2009-07-28 at 09:20 +0200, Corrado Zoccolo wrote:
> > Hi,
> > On Tue, Jul 28, 2009 at 4:42 AM, Zhang,
> > Yanmin<[email protected]> wrote:
> > > On Mon, 2009-07-27 at 09:33 +0200, Andreas Mohr wrote:
> > >> Hi,
> > >>
> > >> > When running a fio workload, I found sometimes cpu C state has
> > >> > big impact on the result. Mostly, fio is a disk I/O workload
> > >> > which doesn't spend much time with cpu, so cpu switch to C2/C3
> > >> > freqently and the latency is big.
> > >>
> > >> Rather than inventing ways to limit ACPI Cx state usefulness, we should
> > >> perhaps be thinking of what's wrong here.
> > > Andreas,
> > >
> > > Thanks for your kind comments.
> > >
> > >>
> > >> And your complaint might just fit into a thought I had recently:
> > >> are we actually taking ACPI Cx exit latency into account, for timers???
> > > I tried both tickless kernel and non-tickless kernels. The result is similiar.
> > >
> > > Originally, I also thought it's related to timer. As you know, I/O block layer
> > > has many timers. Such timers don't expire normally. For example, an I/O request
> > > is submitted to driver and driver delievers it to disk and hardware triggers
> > > an interrupt after finishing I/O. Mostly, the I/O submit and interrupt, not
> > > the timer, drive the I/O.
> > >
> > >>
> > >> If we program a timer to fire at some point, then it is quite imaginable
> > >> that any ACPI Cx exit latency due to the CPU being idle at that moment
> > >> could add to actual timer trigger time significantly.
> > >>
> > >> To combat this, one would need to tweak the timer expiration time
> > >> to include the exit latency. But of course once the CPU is running
> > >> again, one would need to re-add the latency amount (read: reprogram the
> > >> timer hardware, ugh...) to prevent the timer from firing too early.
> > >>
> > >> Given that one would need to reprogram timer hardware quite often,
> > >> I don't know whether taking Cx exit latency into account is feasible.
> > >> OTOH analysis of the single next timer value and actual hardware reprogramming
> > >> would have to be done only once (in ACPI sleep and wake paths each),
> > >> thus it might just turn out to be very beneficial after all
> > >> (minus prolonging ACPI Cx path activity and thus aggravating CPU power
> > >> savings, of course).
> > >>
> > >> Arjan mentioned examples of maybe 10us for C2 and 185us for C3/C4 in an
> > >> article.
> > >>
> > >> OTOH even 185us is only 0.185ms, which, when compared to disk seek
> > >> latency (around 7ms still, except for SSD), doesn't seem to be all that much.
> > >> Or what kind of ballpark figure do you have for percentage of I/O
> > >> deterioration?
> > > I have lots of FIO sub test cases which test I/O on single disk and JBOD (a disk
> > > bos which mostly has 12~13 disks) on nahelam machines. Your analysis on disk seek
> > > is reasonable. I found sequential buffered read has the worst regression while rand
> > > read is far better. For example, I start 12 processes per disk and every disk has 24
> > > 1-G files. There are 12 disks. The sequential read fio result is about 593MB/second
> > > with idle=poll, and about 375MB/s without idle=poll. Read block size is 4KB.
> > >
> > > Another exmaple is single fio direct seqential read (block size is 4K) on a single
> > > SATA disk. The result is about 28MB/s without idle=poll and about 32.5MB with
> > > idle=poll.
> > >
> > > How did I find C state has impact on disk I/O result? Frankly, I found a regression
> > > between kernel 2.6.27 and 2.6.28. Bisect located a nonstop tsc patch, but the patch
> > > is quite good. I found the patch changes the default clocksource from hpet to
> > > tsc. Then, I tried all clocksources and got the best result with acpi_pm clocksource.
> > > But oprofile data shows acpi_pm has more cpu utilization. clocksource jiffies has
> > > worst result but least cpu utilization. As you know, fio calls gettimeofday frequently.
> > > Then, I tried boot parameter processor.max_cstate and idle=poll.
> > > I get the similar result with processor.max_cstate=1 like the one with idle=poll.
> > >
> >
> > Is it possible that the different bandwidths figures are due to
> > incorrect timing, instead of C-state latencies?
> I'm not sure.
>
> > Entering a deep C state can cause strange things to timers: some of
> > them, especially tsc, become unreliable.
> > Maybe the patch you found that re-enables tsc is actually wrong for
> > your machine, for which tsc is unreliable in deep C states.
> I'm using a SDV machine, not an official product. But it's rare that cpuid
> reports non-stop tsc feature while it doesn't support it.
>
> I tried different clocksources. For exmaple, I could get a better (30%) result with
> hpet. With hpet, cpu utilization is about 5~8%. Function hpet_read uses too much cpu
> time. With tsc, cpu utilization is about 2~3%. I think more cpu utilization causes fewer
> C state transitions.
>
> With idle=poll, the result is about 10% better than the one of hpet. If using idle=poll,
> I didn't find result difference among different clocksources.
>
> >
> > > I also run the testing on 2 stoakley machines and don't find such issues.
> > > /proc/acpi/processor/CPUXXX/power shows stoakley cpu only has C1.
> > >
> > >> I'm wondering whether we might have an even bigger problem with disk I/O
> > >> related to this than just the raw ACPI exit latency value itself.
> > > We might have. I'm still doing more testing. With Venki's tool (write/read MSR registers),
> > > I collected some C state switch stat.
> > >
> > You can see the latencies (expressed in us) on your machine with:
> > [root@localhost corrado]# cat
> > /sys/devices/system/cpu/cpu0/cpuidle/state*/latency
> > 0
> > 0
> > 1
> > 133
> >
> > Can you post your numbers, to see if they are unusually high?
> [ymzhang@lkp-ne02 ~]$ cat /proc/acpi/processor/CPU0/power
> active state: C0
> max_cstate: C8
> maximum allowed latency: 2000000000 usec
> states:
> C1: type[C1] promotion[--] demotion[--] latency[003] usage[00001661] duration[00000000000000000000]
> C2: type[C3] promotion[--] demotion[--] latency[205] usage[00000687] duration[00000000000000732028]
> C3: type[C3] promotion[--] demotion[--] latency[245] usage[00011509] duration[00000000000115186065]
>
> [ymzhang@lkp-ne02 ~]$ cat /sys/devices/system/cpu/cpu0/cpuidle/state*/latency
> 0
> 3
> 205
> 245
>
> >
> > > Current cpuidle has a good consideration on cpu utilization, but doesn't have
> > > consideration on devices. So with I/O delivery and interrupt drive model
> > > with little cpu utilization, performance might be hurt if C state exit has a long
> > > latency.
Another interesting testing with netperf has the similiar behavior. I start 1 netperf client
and bind client and server to different physical cpus to run a UDP-RR-1 loopback testing.
The result is about 54000 without idle=poll while the one is 88000 with idle=poll.

If I start CPU_NUM netperf clients, there is no such issue, because all cpu are busy.

2009-07-31 03:43:07

by Robert Hancock

[permalink] [raw]
Subject: Re: Dynamic configure max_cstate

On 07/28/2009 04:11 AM, Andreas Mohr wrote:
> Hi,
>
> On Tue, Jul 28, 2009 at 05:00:35PM +0800, Zhang, Yanmin wrote:
>> I tried different clocksources. For exmaple, I could get a better (30%) result with
>> hpet. With hpet, cpu utilization is about 5~8%. Function hpet_read uses too much cpu
>> time. With tsc, cpu utilization is about 2~3%. I think more cpu utilization causes fewer
>> C state transitions.
>>
>> With idle=poll, the result is about 10% better than the one of hpet. If using idle=poll,
>> I didn't find result difference among different clocksources.
>
> IOW, this seems to clearly point to ACPI Cx causing it.
>
> Both Corrado and me have been thinking that one should try skipping all
> bigger-latency ACPI Cx states whenever there's an ongoing I/O request where an
> immediate reply interrupt is expected.
>
> I've been investigating this a bit, and interesting parts would perhaps include
> . kernel/pm_qos_params.c
> . drivers/cpuidle/governors/menu.c (which acts on the ACPI _cx state
> structs as configured by drivers/acpi/processor_idle.c)
> . and e.g. the wait_for_completion_timeout() part in drivers/ata/libata-core.c
> (or other sources in case of other disk I/O mechanisms)
>
> One way to do some quick (and dirty!!) testing would be to set a flag
> before calling wait_for_completion_timeout() and testing for this flag in
> drivers/cpuidle/governors/menu.c and then skip deeper Cx states
> conditionally.
>
> As a very quick test, I tried a
> while :; do :; done
> loop in shell and renicing shell to 19 (to keep my CPU out of ACPI idle),
> but bonnie -s 100 results initially looked promising yet turned out to
> be inconsistent. The real way to test this would be idle=poll.
> My test system was Athlon XP with /proc/acpi/processor/CPU0/power
> latencies of 000 and 100 (the maximum allowed value, BTW) for C1/C2.
>
> If the wait_for_completion_timeout() flag testing turns out to help,
> then one might intend to use the pm_qos infrastructure to indicate
> these conditions, however it might be too bloated for such a
> purpose, a relatively simple (read: fast) boolean flag mechanism
> could be better.
>
> Plus one could then create a helper function which figures out a
> "pretty fast" Cx state (independent of specific latency times!).
> But when introducing this mechanism, take care to not ignore the
> requirements defined by pm_qos settings!
>
> Oh, and about the places which submit I/O requests where one would have to
> flag this: are they in any way correlated with the scheduler I/O wait
> value? Would the I/O wait mechanism be a place to more easily and centrally
> indicate that we're waiting for a request to come back in "very soon"?
> OTOH I/O requests may have vastly differing delay expectations,
> thus specifically only short-term expected I/O replies should be flagged,
> otherwise we're wasting lots of ACPI deep idle opportunities.

Did the results show a big difference in performance between maximum C2
and maximum C3? Thing with C3 is that it likely will have some
interference with bus-master DMA activity as the CPU has to wake up at
least partially before the SATA controller can complete DMA operations,
which will likely stall the controller for some period of time. There
would be an argument for avoiding going into deep C-states which can't
handle snooping while IO is in progress and DMA will shortly be occurring..

2009-07-31 07:06:41

by Yanmin Zhang

[permalink] [raw]
Subject: Re: Dynamic configure max_cstate

On Thu, 2009-07-30 at 21:43 -0600, Robert Hancock wrote:
> On 07/28/2009 04:11 AM, Andreas Mohr wrote:
> > Hi,
> >
> > On Tue, Jul 28, 2009 at 05:00:35PM +0800, Zhang, Yanmin wrote:
> >> I tried different clocksources. For exmaple, I could get a better (30%) result with
> >> hpet. With hpet, cpu utilization is about 5~8%. Function hpet_read uses too much cpu
> >> time. With tsc, cpu utilization is about 2~3%. I think more cpu utilization causes fewer
> >> C state transitions.
> >>
> >> With idle=poll, the result is about 10% better than the one of hpet. If using idle=poll,
> >> I didn't find result difference among different clocksources.
> >
> > IOW, this seems to clearly point to ACPI Cx causing it.
> >
> > Both Corrado and me have been thinking that one should try skipping all
> > bigger-latency ACPI Cx states whenever there's an ongoing I/O request where an
> > immediate reply interrupt is expected.
> >
> > I've been investigating this a bit, and interesting parts would perhaps include
> > . kernel/pm_qos_params.c
> > . drivers/cpuidle/governors/menu.c (which acts on the ACPI _cx state
> > structs as configured by drivers/acpi/processor_idle.c)
> > . and e.g. the wait_for_completion_timeout() part in drivers/ata/libata-core.c
> > (or other sources in case of other disk I/O mechanisms)
> >
> > One way to do some quick (and dirty!!) testing would be to set a flag
> > before calling wait_for_completion_timeout() and testing for this flag in
> > drivers/cpuidle/governors/menu.c and then skip deeper Cx states
> > conditionally.
> >
> > As a very quick test, I tried a
> > while :; do :; done
> > loop in shell and renicing shell to 19 (to keep my CPU out of ACPI idle),
> > but bonnie -s 100 results initially looked promising yet turned out to
> > be inconsistent. The real way to test this would be idle=poll.
> > My test system was Athlon XP with /proc/acpi/processor/CPU0/power
> > latencies of 000 and 100 (the maximum allowed value, BTW) for C1/C2.
> >
> > If the wait_for_completion_timeout() flag testing turns out to help,
> > then one might intend to use the pm_qos infrastructure to indicate
> > these conditions, however it might be too bloated for such a
> > purpose, a relatively simple (read: fast) boolean flag mechanism
> > could be better.
> >
> > Plus one could then create a helper function which figures out a
> > "pretty fast" Cx state (independent of specific latency times!).
> > But when introducing this mechanism, take care to not ignore the
> > requirements defined by pm_qos settings!
> >
> > Oh, and about the places which submit I/O requests where one would have to
> > flag this: are they in any way correlated with the scheduler I/O wait
> > value? Would the I/O wait mechanism be a place to more easily and centrally
> > indicate that we're waiting for a request to come back in "very soon"?
> > OTOH I/O requests may have vastly differing delay expectations,
> > thus specifically only short-term expected I/O replies should be flagged,
> > otherwise we're wasting lots of ACPI deep idle opportunities.
>
> Did the results show a big difference in performance between maximum C2
> and maximum C3?
No big difference. I tried different max cstate by processor.max_cstate.
Mostly, processor.max_cstate=1 could get the similiar result like idle=poll.

> Thing with C3 is that it likely will have some
> interference with bus-master DMA activity as the CPU has to wake up at
> least partially before the SATA controller can complete DMA operations,
> which will likely stall the controller for some period of time. There
> would be an argument for avoiding going into deep C-states which can't
> handle snooping while IO is in progress and DMA will shortly be occurring..

2009-07-31 08:07:31

by Andreas Mohr

[permalink] [raw]
Subject: Re: Dynamic configure max_cstate

Hi,

On Fri, Jul 31, 2009 at 03:06:46PM +0800, Zhang, Yanmin wrote:
> On Thu, 2009-07-30 at 21:43 -0600, Robert Hancock wrote:
> > On 07/28/2009 04:11 AM, Andreas Mohr wrote:
> > > Oh, and about the places which submit I/O requests where one would have to
> > > flag this: are they in any way correlated with the scheduler I/O wait
> > > value? Would the I/O wait mechanism be a place to more easily and centrally
> > > indicate that we're waiting for a request to come back in "very soon"?
> > > OTOH I/O requests may have vastly differing delay expectations,
> > > thus specifically only short-term expected I/O replies should be flagged,
> > > otherwise we're wasting lots of ACPI deep idle opportunities.
> >
> > Did the results show a big difference in performance between maximum C2
> > and maximum C3?
> No big difference. I tried different max cstate by processor.max_cstate.
> Mostly, processor.max_cstate=1 could get the similiar result like idle=poll.

OK, but I'd say that this doesn't mean that we should implement a
hard-coded mechanism which simply says "in such cases, don't do anything > C1".
Instead we should strive for a far-reaching _generic_ mechanism
which gathers average latencies of various I/O activities/devices
and then uses some formula to determine the maximum (not necessarily ACPI)
idle latency that we're willing to endure (e.g. average device I/O reply latency
divided by 10 or so).

And in addition to this, we should also take into account (read: skip)
any idle states which kill busmaster DMA completely
(in case of busmaster DMA I/O activities, that is).

_Lots_ of very nice opportunities for improvement here, I'd say...
(in the 5, 10 or even 40% range in the case of certain network I/O)

Andreas

2009-07-31 14:40:07

by Andi Kleen

[permalink] [raw]
Subject: Re: Dynamic configure max_cstate

Andreas Mohr <[email protected]> writes:

> Instead we should strive for a far-reaching _generic_ mechanism
> which gathers average latencies of various I/O activities/devices
> and then uses some formula to determine the maximum (not necessarily ACPI)
> idle latency that we're willing to endure (e.g. average device I/O reply latency
> divided by 10 or so).

The interrupt heuristics in the menu cpuidle governour are already
attempting this, based on interrupt rates (or rather
wakeup rates) which are supposed to roughly correspond with IO rates
and scheduling events together.

Apparently that doesn't work in this case. The challenge would
be to find out why and improve the menu algorithm to deal with it.
I doubt a completely new mechanism is needed or makes sense.

> And in addition to this, we should also take into account (read: skip)
> any idle states which kill busmaster DMA completely
> (in case of busmaster DMA I/O activities, that is)

This is already done.

-Andi
--
[email protected] -- Speaking for myself only.

2009-07-31 15:00:00

by Michael S. Zick

[permalink] [raw]
Subject: Re: Dynamic configure max_cstate

On Fri July 31 2009, Andi Kleen wrote:
> Andreas Mohr <[email protected]> writes:
>
> > Instead we should strive for a far-reaching _generic_ mechanism
> > which gathers average latencies of various I/O activities/devices
> > and then uses some formula to determine the maximum (not necessarily ACPI)
> > idle latency that we're willing to endure (e.g. average device I/O reply latency
> > divided by 10 or so).
>
> The interrupt heuristics in the menu cpuidle governour are already
> attempting this, based on interrupt rates (or rather
> wakeup rates) which are supposed to roughly correspond with IO rates
> and scheduling events together.
>
> Apparently that doesn't work in this case. The challenge would
> be to find out why and improve the menu algorithm to deal with it.
> I doubt a completely new mechanism is needed or makes sense.
>
> > And in addition to this, we should also take into account (read: skip)
> > any idle states which kill busmaster DMA completely
> > (in case of busmaster DMA I/O activities, that is)
>
> This is already done.
>

Almost - the VIA C7-M needs a bit of kernel command line help - -
But should be easily fixable when I or one of the VIA support people GATI.
(Bus snoops are only fully supported in C0 and C1 but idle=halt takes care of that.)

Mike
> -Andi

2009-07-31 15:14:37

by Len Brown

[permalink] [raw]
Subject: Re: Dynamic configure max_cstate

> And in addition to this, we should also take into account (read: skip)
> any idle states which kill busmaster DMA completely
> (in case of busmaster DMA I/O activities, that is).

It isn't so simple.
This is system specific.

In the old days, a c3-type C-state would lock down the bus
in order to assure no DMA could pass by to memory
before the processor could wake up to snoop.

Then a few years ago the hardware would allow us to
enter C3-type C-states, but transparently "pop-up"
into C2 to retire the snoop activity without ever
waking the processor to C0. This was good b/c
it was more efficient than waking to C0, but bad
b/c the OS could not easily tell if it actually
got the C3 time it requested, or if it was actually
spending a bunch of time demoted to C2...

In the most recent hardware, the core's cache is flushed
in deep C-states so that the core need not be woken
at all to snoop DMA activity.

Indeed, Yanmin's Nehalem box advertises two C3-type C-states,
but in reality, Nehalem doesn't have _any_ C3-type C-states,
only C2-type. The BIOS advertises C3-type C-states
to not break the installed base, which uses the presence
of C3-type C-states to work around the broken LAPIC timer.

I think the issue at hand on the system at hand is waking
up the processor in response to an IO interrupt break events.
ie. Linux doees a good job with timer interrupts, but
isn't so smart about using IO interrupts for demoting
C-states. Arjan is looking at fixing this.

cheers,
Len Brown, Intel Opoen Source TEchnology Center

2009-07-31 17:34:08

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: RE: Dynamic configure max_cstate



>-----Original Message-----
>From: Andi Kleen [mailto:[email protected]]
>Sent: Friday, July 31, 2009 7:40 AM
>To: Zhang, Yanmin
>Cc: Robert Hancock; Corrado Zoccolo; LKML;
>[email protected]; Pallipadi, Venkatesh
>Subject: Re: Dynamic configure max_cstate
>
>Andreas Mohr <[email protected]> writes:
>
>> Instead we should strive for a far-reaching _generic_ mechanism
>> which gathers average latencies of various I/O activities/devices
>> and then uses some formula to determine the maximum (not
>necessarily ACPI)
>> idle latency that we're willing to endure (e.g. average
>device I/O reply latency
>> divided by 10 or so).
>
>The interrupt heuristics in the menu cpuidle governour are already
>attempting this, based on interrupt rates (or rather
>wakeup rates) which are supposed to roughly correspond with IO rates
>and scheduling events together.
>
>Apparently that doesn't work in this case. The challenge would
>be to find out why and improve the menu algorithm to deal with it.
>I doubt a completely new mechanism is needed or makes sense.
>

Yes. cpuidle's attempt at guessing the interrupt rate is not working here.
I got this test running on a test system here and looks like its not just
the cpuidle that causes problems here.
I am still collecting more data, but from what I have right now, this is what I see:
- cpuidle and deep C-state usage reduces the performance here, as has been
discussed in this thread.
- cpufreq ondemand governor also has a problem with the workload, as it runs the
CPU mostly at lower freq (as CPU utilization is hardly over 20%) and switching the
cpus to high frequency increases the performance.
- It also depends on where fio and the ahci interrupt handler runs. Looks like,
for maximum performance, they have to run on different CPUs sharing the caches.

So, getting this workload to give best performance by default will be a
major challenge :-). Another thing that will be interesting to look at is
performance/power for this workload and I haven't ventured into that
territory yet.

Thanks,
Venki-