LinuxLists.cc - schedutil issue with serial workloads

2020-06-04 21:56:10

Subject: schedutil issue with serial workloads

Hello,

this is a question/bugreport about behavior of schedutil on serial workloads
such as rsync, or './configure', or 'make install'. These workloads are
such that there's no single task that takes a substantial portion of CPU
time, but at any moment there's at least one runnable task, and overall
the workload is compute-bound. To run the workload efficiently, cpufreq
governor should select a high frequency.

Assume the system is idle except for the workload in question.

Sadly, schedutil will select the lowest frequency, unless the workload is
confined to one core with taskset (in which case it will select the
highest frequency, correctly though somewhat paradoxically).

This sounds like it should be a known problem, but I couldn't find any
mention of it in the documentation.

I was able to replicate the effect with a pair of 'ping-pong' programs
that get a token, burn some cycles to simulate work, and pass the token.
Thus, each program has 50% CPU utilization. To repeat my test:

gcc -O2 pingpong.c -o pingpong
mkfifo ping
mkfifo pong
taskset -c 0 ./pingpong 1000000 < ping > pong &
taskset -c 1 ./pingpong 1000000 < pong > ping &
echo > ping

#include <stdio.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
unsigned i, n;
sscanf(argv[1], "%u", &n);
for (;;) {
char c;
read(0, &c, 1);
for (i = n; i; i--)
asm("" :: "r"(i));
write(1, &c, 1);
}
}

Alexander

2020-06-05 16:54:09

by Wysocki, Rafael J

[permalink] [raw]

Subject: Re: schedutil issue with serial workloads

On 6/4/2020 11:29 PM, Alexander Monakov wrote:
> Hello,

Hi,

Let's make more people see your report.

+Peter, Giovanni, Quentin, Juri, Valentin, Vincent, Doug, and linux-pm.

> this is a question/bugreport about behavior of schedutil on serial workloads
> such as rsync, or './configure', or 'make install'. These workloads are
> such that there's no single task that takes a substantial portion of CPU
> time, but at any moment there's at least one runnable task, and overall
> the workload is compute-bound. To run the workload efficiently, cpufreq
> governor should select a high frequency.
>
> Assume the system is idle except for the workload in question.
>
> Sadly, schedutil will select the lowest frequency, unless the workload is
> confined to one core with taskset (in which case it will select the
> highest frequency, correctly though somewhat paradoxically).

That's because the CPU utilization generated by the workload on all CPUs
is small.

Confining it to one CPU causes the utilization of this one to grow and
so schedutil selects a higher frequency for it.

> This sounds like it should be a known problem, but I couldn't find any
> mention of it in the documentation.

Well, what would you expect to happen instead of what you see?

> I was able to replicate the effect with a pair of 'ping-pong' programs
> that get a token, burn some cycles to simulate work, and pass the token.
> Thus, each program has 50% CPU utilization. To repeat my test:
>
> gcc -O2 pingpong.c -o pingpong
> mkfifo ping
> mkfifo pong
> taskset -c 0 ./pingpong 1000000 < ping > pong &
> taskset -c 1 ./pingpong 1000000 < pong > ping &
> echo > ping
>
> #include <stdio.h>
> #include <unistd.h>
> int main(int argc, char *argv[])
> {
> unsigned i, n;
> sscanf(argv[1], "%u", &n);
> for (;;) {
> char c;
> read(0, &c, 1);
> for (i = n; i; i--)
> asm("" :: "r"(i));
> write(1, &c, 1);
> }
> }
>
> Alexander

2020-06-05 18:38:30

by Alexander Monakov

[permalink] [raw]

Subject: Re: schedutil issue with serial workloads

On Fri, 5 Jun 2020, Rafael J. Wysocki wrote:

> > This sounds like it should be a known problem, but I couldn't find any
> > mention of it in the documentation.
>
> Well, what would you expect to happen instead of what you see?

Not sure why you ask. Named workloads are pretty common for example on
"build-bot" machines. If you don't work exclusively on the kernel you
probably recognize that on, let's say, more "traditionally engineered"
packages ./configure can take 10x more wall-clock time than subsequent
'make -j $(nproc)', and if schedutil makes the problem worse by
consistently choosing the lowest possible frequency for the configure
phase, that's a huge pitfall that's worth fixing or documenting.

To answer your question, assuming schedutil is intended to become a good
choice for a wide range of use-cases, I'd expect it to choose a high
frequency, ideally the highest, and definitely not the lowest. I think I
was pretty transparent about that in my initial mail. I understand there
is no obvious fix and inventing one may prove difficult.

Short term, better Kconfig help text to help people make a suitable choice
for their workloads would be nice.

I'd also like to point out that current Kconfig is confusingly worded
where it says "If the utilization is frequency-invariant, ...". It can
be interpreted as "the workload is producing the same utilization
irrespective of frequency", but what it actually refers to is the
implementation detail of how the "utilization" metric is interpreted.
Does that sentence need to be in Kconfig help at all?

Thanks.
Alexander

2020-06-05 20:36:45

by Peter Zijlstra

[permalink] [raw]

Subject: Re: schedutil issue with serial workloads

On Fri, Jun 05, 2020 at 06:51:12PM +0200, Rafael J. Wysocki wrote:
> On 6/4/2020 11:29 PM, Alexander Monakov wrote:

> > this is a question/bugreport about behavior of schedutil on serial workloads
> > such as rsync, or './configure', or 'make install'. These workloads are
> > such that there's no single task that takes a substantial portion of CPU
> > time, but at any moment there's at least one runnable task, and overall
> > the workload is compute-bound. To run the workload efficiently, cpufreq
> > governor should select a high frequency.
> >
> > Assume the system is idle except for the workload in question.
> >
> > Sadly, schedutil will select the lowest frequency, unless the workload is
> > confined to one core with taskset (in which case it will select the
> > highest frequency, correctly though somewhat paradoxically).
>
> That's because the CPU utilization generated by the workload on all CPUs is
> small.
>
> Confining it to one CPU causes the utilization of this one to grow and so
> schedutil selects a higher frequency for it.

My initial question was why doesn't io-boosting fix this up, but a quick
look at our pipe code shows me that it doesn't seem to use
io_schedule().

That is currently our only means to express 'someone is waiting on us'
to which we then say 'lets hurry up a bit'.

Because, as you've found, if the tasks do not queue up, there is nothing
to push the frequency up.

2020-06-07 17:27:52

by Doug Smythies

[permalink] [raw]

Subject: RE: schedutil issue with serial workloads

On 2020.06.05 Rafael J. Wysocki wrote:
> On 6/4/2020 11:29 PM, Alexander Monakov wrote:
> > Hello,
>
> Hi,
>
> Let's make more people see your report.
>
> +Peter, Giovanni, Quentin, Juri, Valentin, Vincent, Doug, and linux-pm.
>
>> this is a question/bugreport about behavior of schedutil on serial workloads
>> such as rsync, or './configure', or 'make install'. These workloads are
>> such that there's no single task that takes a substantial portion of CPU
>> time, but at any moment there's at least one runnable task, and overall
>> the workload is compute-bound. To run the workload efficiently, cpufreq
>> governor should select a high frequency.
>>
>> Assume the system is idle except for the workload in question.
>>
>> Sadly, schedutil will select the lowest frequency, unless the workload is
>> confined to one core with taskset (in which case it will select the
>> highest frequency, correctly though somewhat paradoxically).
>
> That's because the CPU utilization generated by the workload on all CPUs
> is small.
>
> Confining it to one CPU causes the utilization of this one to grow and
> so schedutil selects a higher frequency for it.
>
>> This sounds like it should be a known problem, but I couldn't find any
>> mention of it in the documentation.

Yes, this issue is very well known, and has been discussed on this list
several times, going back many years (and I likely missed some of the
discussions). In recent years Giovanni's git "make test" has
been the "goto" example for this. From that test, which has run to run
variability due to disk I/O, I made some test that varys PIDs per second
verses time. Giovanni's recent work on frequency invariance made a huge
difference for the schedutil response to this type of serialized workflow.

For my part of it:
I only ever focused on a new PID per work packet serialized workflow;
Since my last testing on this subject in January, I fell behind with
system issues and infrastructure updates.

Your workflow example is fascinating and rather revealing.
I will make use of it moving forward. Thank you.

Yes, schedutil basically responds poorly as it did for PIDs/second
based workflow before frequency invariance, but...(digression follows)...

Typically, I merely set the performance governor whenever I know
I will be doing serialized workflow, or whenever I just want the
job done the fastest (i.e. kernel compile).

If I use performance mode (hwp disabled, either active or passive,
doesn't matter), then I can not get the CPU frequency to max,
even if I set:

$ grep . /sys/devices/system/cpu/intel_pstate/m??_perf_pct
/sys/devices/system/cpu/intel_pstate/max_perf_pct:100
/sys/devices/system/cpu/intel_pstate/min_perf_pct:100

I have to increase EPB all way to 1 to get to max CPU frequency.
There also is extreme hysteresis, as I have to back to 9 for
the frequency to drop again.

The above was an i5-9600K. My much older i7-9600K, works fine
with default EPB of 6. I had not previously realized there was
so much difference between processors and EPB.

I don't have time to dig deeper right now, but will in future.

>> I was able to replicate the effect with a pair of 'ping-pong' programs
>> that get a token, burn some cycles to simulate work, and pass the token.
>> Thus, each program has 50% CPU utilization. To repeat my test:
>>
>> gcc -O2 pingpong.c -o pingpong
>> mkfifo ping
>> mkfifo pong
>> taskset -c 0 ./pingpong 1000000 < ping > pong &
>> taskset -c 1 ./pingpong 1000000 < pong > ping &
>> echo > ping
>>
>> #include <stdio.h>
>> #include <unistd.h>
>> int main(int argc, char *argv[])
>> {
>> unsigned i, n;
>> sscanf(argv[1], "%u", &n);
>> for (;;) {
>> char c;
>> read(0, &c, 1);
>> for (i = n; i; i--)
>> asm("" :: "r"(i));
>> write(1, &c, 1);
>> }
>> }
>>
>> Alexander

It was not obvious to me what the approximate work/sleep frequency would be for
your work flow. For my version of it I made the loop time slower on purpose, and
because I could merely adjust "N" to compensate. I measured 100 hertz work/sleep
frequency per CPU, but my pipeline is 6 instead of 2.

Just for the record, this is what I did:

doug@s18:~/c$ cat pingpong.c
#include <stdio.h>
#include <unistd.h>
int main(int argc, char *argv[])
{
unsigned i, n, k;
sscanf(argv[1], "%u", &n);
while(1) {
char c;
read(0, &c, 1);
for (i = n; i; i--){
k = i;
k = k++;
}
write(1, &c, 1);
}
}

Compiled with:

cc pingpong.c -o pingpong

and run with (on purpose, I did not force CPU affinity,
as I wanted schedutil to decide (when it was the
governor, at least)):

#! /bin/dash
#
# ping-pong-test Smythies 2019.06.06
# serialized workflow, but same PID.
# from Alexander, but modified.
#

# because I always forget from last time
killall pingpong

rm --force pong1
rm --force pong2
rm --force pong3
rm --force pong4
rm --force pong5
rm --force pong6

mkfifo pong1
mkfifo pong2
mkfifo pong3
mkfifo pong4
mkfifo pong5
mkfifo pong6
~/c/pingpong 1000000 < pong1 > pong2 &
~/c/pingpong 1000000 < pong2 > pong3 &
~/c/pingpong 1000000 < pong3 > pong4 &
~/c/pingpong 1000000 < pong4 > pong5 &
~/c/pingpong 1000000 < pong5 > pong6 &
~/c/pingpong 1000000 < pong6 > pong1 &
echo > pong1

To measure work/sleep frequency, I made a
version that would only run, say, 10,000 times
and timed it.

... Doug