2010-01-30 23:46:22

by Shawn Bohrer

[permalink] [raw]
Subject: High scheduler wake up times

Hello,

Currently we have a workload that depends on around 50 processes that
wake up 1000 times a second do a small amount of work and go back to
sleep. This works great on RHEL 5 (2.6.18-164.6.1.el5), but on recent
kernels we are unable to achieve 1000 iterations per second. Using
the simple test application below on RHEL 5 2.6.18-164.6.1.el5 I can run
500 of these processes on and still achieve 999.99 iterations per
second. Running just 10 of these processes on the same machine with
2.6.32.6 produces results like:

...
Iterations Per Sec: 905.659667
Iterations Per Sec: 805.099068
Iterations Per Sec: 925.195578
Iterations Per Sec: 759.310773
Iterations Per Sec: 702.849261
Iterations Per Sec: 782.157292
Iterations Per Sec: 917.138031
Iterations Per Sec: 834.770391
Iterations Per Sec: 850.543755
...

I've tried playing with some of the cfs tunables in /proc/sys/kernel/
without success. Are there any suggestions on how to achieve the
results we are looking for using a recent kernel?

Thanks,
Shawn


#include <sys/epoll.h>
#include <sys/time.h>
#include <stdio.h>
#include <unistd.h>

int main ()
{
int epfd = epoll_create(1);
int i, j;
struct timeval tv;
unsigned long start, end;
const unsigned int count = 60000;

while (1) {
gettimeofday(&tv, NULL);
start = tv.tv_sec * 1000000 + tv.tv_usec;

for (i = 0; i < count; ++i) {
if (epoll_wait(epfd, 0, 1, 1) == -1)
perror("epoll failed");

for (j = 0; j < 10000; ++j)
/* simulate work */;
}
gettimeofday(&tv, NULL);
end = tv.tv_sec * 1000000 + tv.tv_usec;

printf("Iterations Per Sec: %f\n", count/((double)(end - start)/1000000));
}

close(epfd);
}


2010-01-30 23:55:55

by Arjan van de Ven

[permalink] [raw]
Subject: Re: High scheduler wake up times

On Sat, 30 Jan 2010 17:45:51 -0600
Shawn Bohrer <[email protected]> wrote:

> Hello,
>
> Currently we have a workload that depends on around 50 processes that
> wake up 1000 times a second do a small amount of work and go back to
> sleep. This works great on RHEL 5 (2.6.18-164.6.1.el5), but on recent
> kernels we are unable to achieve 1000 iterations per second. Using
> the simple test application below on RHEL 5 2.6.18-164.6.1.el5 I can
> run 500 of these processes on and still achieve 999.99 iterations per
> second. Running just 10 of these processes on the same machine with
> 2.6.32.6 produces results like:
>
> ...
> Iterations Per Sec: 905.659667
> Iterations Per Sec: 805.099068
> Iterations Per Sec: 925.195578
> Iterations Per Sec: 759.310773
> Iterations Per Sec: 702.849261
> Iterations Per Sec: 782.157292
> Iterations Per Sec: 917.138031
> Iterations Per Sec: 834.770391
> Iterations Per Sec: 850.543755
> ...
>
> I've tried playing with some of the cfs tunables in /proc/sys/kernel/
> without success. Are there any suggestions on how to achieve the
> results we are looking for using a recent kernel?

I'll play a bit, but I wonder idly what kind of machine this is on ?
(number and types of cpus)

2010-01-31 00:01:54

by Arjan van de Ven

[permalink] [raw]
Subject: Re: High scheduler wake up times

On Sat, 30 Jan 2010 17:45:51 -0600
Shawn Bohrer <[email protected]> wrote:
>
> int main ()
> {
> int epfd = epoll_create(1);
> int i, j;
> struct timeval tv;
> unsigned long start, end;
> const unsigned int count = 60000;
>
> while (1) {
> gettimeofday(&tv, NULL);
> start = tv.tv_sec * 1000000 + tv.tv_usec;
>
> for (i = 0; i < count; ++i) {
> if (epoll_wait(epfd, 0, 1, 1) == -1)
> perror("epoll failed");
>
> for (j = 0; j < 10000; ++j)
> /* simulate work */;
> }
> gettimeofday(&tv, NULL);
> end = tv.tv_sec * 1000000 + tv.tv_usec;
>
> printf("Iterations Per Sec: %f\n",
> count/((double)(end - start)/1000000)); }
>
> close(epfd);
> }

btw do you have an equivalent program that uses poll instead of epoll
by chance?

2010-01-31 00:10:26

by Arjan van de Ven

[permalink] [raw]
Subject: Re: High scheduler wake up times

On Sat, 30 Jan 2010 17:45:51 -0600
Shawn Bohrer <[email protected]> wrote:

> Hello,
>
> Currently we have a workload that depends on around 50 processes that
> wake up 1000 times a second do a small amount of work and go back to
> sleep. This works great on RHEL 5 (2.6.18-164.6.1.el5), but on recent
> kernels we are unable to achieve 1000 iterations per second. Using
> the simple test application below on RHEL 5 2.6.18-164.6.1.el5 I can
> run 500 of these processes on and still achieve 999.99 iterations per
> second. Running just 10 of these processes on the same machine with
> 2.6.32.6 produces results like:
> ]

there's an issue with your expectation btw.
what your application does, in practice is

<wait 1 millisecond>
<do a bunch of work>
<wait 1 millisecond>
<do a bunch of work>
etc

you would only be able to get close to 1000 per second if "bunch of
work" is nothing.....but it isn't.
so lets assume "bunch of work" is 100 microseconds.. the basic period
of your program (ignoring any costs/overhead in the implementation)
is 1.1 milliseconds, which is approximately 909 per second, not 1000!

I suspect that the 1000 you get on RHEL5 is a bug in the RHEL5 kernel
where it gives you a shorter delay than what you asked for; since it's
clearly not a correct number to get.

(and yes, older kernels had such rounding bugs, current kernels go
through great length to give applications *exactly* the delay they are
asking for....)



--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2010-01-31 00:30:23

by Shawn Bohrer

[permalink] [raw]
Subject: Re: High scheduler wake up times

On Sat, Jan 30, 2010 at 03:56:40PM -0800, Arjan van de Ven wrote:
> On Sat, 30 Jan 2010 17:45:51 -0600
> Shawn Bohrer <[email protected]> wrote:
>
> > Hello,
> >
> > Currently we have a workload that depends on around 50 processes that
> > wake up 1000 times a second do a small amount of work and go back to
> > sleep. This works great on RHEL 5 (2.6.18-164.6.1.el5), but on recent
> > kernels we are unable to achieve 1000 iterations per second. Using
> > the simple test application below on RHEL 5 2.6.18-164.6.1.el5 I can
> > run 500 of these processes on and still achieve 999.99 iterations per
> > second. Running just 10 of these processes on the same machine with
> > 2.6.32.6 produces results like:
> >
> > ...
> > Iterations Per Sec: 905.659667
> > Iterations Per Sec: 805.099068
> > Iterations Per Sec: 925.195578
> > Iterations Per Sec: 759.310773
> > Iterations Per Sec: 702.849261
> > Iterations Per Sec: 782.157292
> > Iterations Per Sec: 917.138031
> > Iterations Per Sec: 834.770391
> > Iterations Per Sec: 850.543755
> > ...
> >
> > I've tried playing with some of the cfs tunables in /proc/sys/kernel/
> > without success. Are there any suggestions on how to achieve the
> > results we are looking for using a recent kernel?
>
> I'll play a bit, but I wonder idly what kind of machine this is on ?
> (number and types of cpus)

$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
stepping : 5
cpu MHz : 2926.203
cache size : 8192 KB
physical id : 1
siblings : 8
core id : 0
cpu cores : 4
apicid : 16
initial apicid : 16
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
bogomips : 5852.40
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
stepping : 5
cpu MHz : 2926.203
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
bogomips : 5851.05
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
stepping : 5
cpu MHz : 2926.203
cache size : 8192 KB
physical id : 1
siblings : 8
core id : 1
cpu cores : 4
apicid : 18
initial apicid : 18
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
bogomips : 5851.04
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
stepping : 5
cpu MHz : 2926.203
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 1
cpu cores : 4
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
bogomips : 5851.05
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 4
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
stepping : 5
cpu MHz : 2926.203
cache size : 8192 KB
physical id : 1
siblings : 8
core id : 2
cpu cores : 4
apicid : 20
initial apicid : 20
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
bogomips : 5851.04
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 5
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
stepping : 5
cpu MHz : 2926.203
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 2
cpu cores : 4
apicid : 4
initial apicid : 4
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
bogomips : 5851.05
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 6
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
stepping : 5
cpu MHz : 2926.203
cache size : 8192 KB
physical id : 1
siblings : 8
core id : 3
cpu cores : 4
apicid : 22
initial apicid : 22
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
bogomips : 5851.04
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
stepping : 5
cpu MHz : 2926.203
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 6
initial apicid : 6
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
bogomips : 5851.05
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 8
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
stepping : 5
cpu MHz : 2926.203
cache size : 8192 KB
physical id : 1
siblings : 8
core id : 0
cpu cores : 4
apicid : 17
initial apicid : 17
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
bogomips : 5851.04
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 9
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
stepping : 5
cpu MHz : 2926.203
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
bogomips : 5851.05
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 10
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
stepping : 5
cpu MHz : 2926.203
cache size : 8192 KB
physical id : 1
siblings : 8
core id : 1
cpu cores : 4
apicid : 19
initial apicid : 19
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
bogomips : 5851.04
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 11
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
stepping : 5
cpu MHz : 2926.203
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 1
cpu cores : 4
apicid : 3
initial apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
bogomips : 5851.05
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 12
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
stepping : 5
cpu MHz : 2926.203
cache size : 8192 KB
physical id : 1
siblings : 8
core id : 2
cpu cores : 4
apicid : 21
initial apicid : 21
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
bogomips : 5851.04
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 13
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
stepping : 5
cpu MHz : 2926.203
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 2
cpu cores : 4
apicid : 5
initial apicid : 5
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
bogomips : 5851.05
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 14
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
stepping : 5
cpu MHz : 2926.203
cache size : 8192 KB
physical id : 1
siblings : 8
core id : 3
cpu cores : 4
apicid : 23
initial apicid : 23
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
bogomips : 5851.05
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

processor : 15
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
stepping : 5
cpu MHz : 2926.203
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 7
initial apicid : 7
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology tsc_reliable nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
bogomips : 5851.05
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:


2010-01-31 00:36:23

by Shawn Bohrer

[permalink] [raw]
Subject: Re: High scheduler wake up times

On Sat, Jan 30, 2010 at 04:11:14PM -0800, Arjan van de Ven wrote:
> On Sat, 30 Jan 2010 17:45:51 -0600
> Shawn Bohrer <[email protected]> wrote:
>
> > Hello,
> >
> > Currently we have a workload that depends on around 50 processes that
> > wake up 1000 times a second do a small amount of work and go back to
> > sleep. This works great on RHEL 5 (2.6.18-164.6.1.el5), but on recent
> > kernels we are unable to achieve 1000 iterations per second. Using
> > the simple test application below on RHEL 5 2.6.18-164.6.1.el5 I can
> > run 500 of these processes on and still achieve 999.99 iterations per
> > second. Running just 10 of these processes on the same machine with
> > 2.6.32.6 produces results like:
> > ]
>
> there's an issue with your expectation btw.
> what your application does, in practice is
>
> <wait 1 millisecond>
> <do a bunch of work>
> <wait 1 millisecond>
> <do a bunch of work>
> etc
>
> you would only be able to get close to 1000 per second if "bunch of
> work" is nothing.....but it isn't.
> so lets assume "bunch of work" is 100 microseconds.. the basic period
> of your program (ignoring any costs/overhead in the implementation)
> is 1.1 milliseconds, which is approximately 909 per second, not 1000!
>
> I suspect that the 1000 you get on RHEL5 is a bug in the RHEL5 kernel
> where it gives you a shorter delay than what you asked for; since it's
> clearly not a correct number to get.
>
> (and yes, older kernels had such rounding bugs, current kernels go
> through great length to give applications *exactly* the delay they are
> asking for....)

I agree that we are currently depending on a bug in epoll. The epoll
implementation currently rounds up to the next jiffie, so specifying a
timeout of 1 ms really just wakes the process up at the next timer tick.
I have a patch to fix epoll by converting it to use
schedule_hrtimeout_range() that I'll gladly send, but I still need a way
to achieve the same thing.

--
Shawn

2010-01-31 00:46:26

by Shawn Bohrer

[permalink] [raw]
Subject: Re: High scheduler wake up times

On Sat, Jan 30, 2010 at 06:35:49PM -0600, Shawn Bohrer wrote:
> On Sat, Jan 30, 2010 at 04:11:14PM -0800, Arjan van de Ven wrote:
> > On Sat, 30 Jan 2010 17:45:51 -0600
> > Shawn Bohrer <[email protected]> wrote:
> >
> > > Hello,
> > >
> > > Currently we have a workload that depends on around 50 processes that
> > > wake up 1000 times a second do a small amount of work and go back to
> > > sleep. This works great on RHEL 5 (2.6.18-164.6.1.el5), but on recent
> > > kernels we are unable to achieve 1000 iterations per second. Using
> > > the simple test application below on RHEL 5 2.6.18-164.6.1.el5 I can
> > > run 500 of these processes on and still achieve 999.99 iterations per
> > > second. Running just 10 of these processes on the same machine with
> > > 2.6.32.6 produces results like:
> > > ]
> >
> > there's an issue with your expectation btw.
> > what your application does, in practice is
> >
> > <wait 1 millisecond>
> > <do a bunch of work>
> > <wait 1 millisecond>
> > <do a bunch of work>
> > etc
> >
> > you would only be able to get close to 1000 per second if "bunch of
> > work" is nothing.....but it isn't.
> > so lets assume "bunch of work" is 100 microseconds.. the basic period
> > of your program (ignoring any costs/overhead in the implementation)
> > is 1.1 milliseconds, which is approximately 909 per second, not 1000!
> >
> > I suspect that the 1000 you get on RHEL5 is a bug in the RHEL5 kernel
> > where it gives you a shorter delay than what you asked for; since it's
> > clearly not a correct number to get.
> >
> > (and yes, older kernels had such rounding bugs, current kernels go
> > through great length to give applications *exactly* the delay they are
> > asking for....)
>
> I agree that we are currently depending on a bug in epoll. The epoll
> implementation currently rounds up to the next jiffie, so specifying a
> timeout of 1 ms really just wakes the process up at the next timer tick.
> I have a patch to fix epoll by converting it to use
> schedule_hrtimeout_range() that I'll gladly send, but I still need a way
> to achieve the same thing.

I guess I should add that I think we could achieve the same effect by
adding a 1 ms (or less) periodic timerfd to our epoll set. However, it
still appears that newer kernels have a much larger scheduler delay and
I still need a way to fix that in order for us to move to a newer
kernel.

--
Shawn

2010-01-31 00:46:31

by Arjan van de Ven

[permalink] [raw]
Subject: Re: High scheduler wake up times

On Sat, 30 Jan 2010 18:35:49 -0600
Shawn Bohrer <[email protected]> wrote:
\
>
> I agree that we are currently depending on a bug in epoll. The epoll
> implementation currently rounds up to the next jiffie, so specifying a
> timeout of 1 ms really just wakes the process up at the next timer
> tick. I have a patch to fix epoll by converting it to use
> schedule_hrtimeout_range() that I'll gladly send, but I still need a
> way to achieve the same thing.

it's not going to help you; your expectation is incorrect.
you CANNOT get 1000 iterations per second if you do

<wait 1 msec>
<do a bunch of work>
<wait 1 msec>
etc in a loop

the more accurate (read: not rounding down) the implementation, the
more not-1000 you will get, because to hit 1000 the two actions

<wait 1 msec>
<do a bunch of work>

combined are not allowed to take more than 1000 microseconds wallcock
time. Assuming "do a bunch of work" takes 100 microseconds, for you to
hit 1000 there would need to be 900 microseconds in a milliseconds...
and sadly physics don't work that way.

(and that's even ignoring various OS, CPU wakeup and scheduler
contention overheads)



--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2010-01-31 03:47:44

by Shawn Bohrer

[permalink] [raw]
Subject: Re: High scheduler wake up times

On Sat, Jan 30, 2010 at 04:47:16PM -0800, Arjan van de Ven wrote:
> On Sat, 30 Jan 2010 18:35:49 -0600
> Shawn Bohrer <[email protected]> wrote:
> \
> >
> > I agree that we are currently depending on a bug in epoll. The epoll
> > implementation currently rounds up to the next jiffie, so specifying a
> > timeout of 1 ms really just wakes the process up at the next timer
> > tick. I have a patch to fix epoll by converting it to use
> > schedule_hrtimeout_range() that I'll gladly send, but I still need a
> > way to achieve the same thing.
>
> it's not going to help you; your expectation is incorrect.
> you CANNOT get 1000 iterations per second if you do
>
> <wait 1 msec>
> <do a bunch of work>
> <wait 1 msec>
> etc in a loop
>
> the more accurate (read: not rounding down) the implementation, the
> more not-1000 you will get, because to hit 1000 the two actions

Of course that patch makes my situation worse, which was my point. We
are depending on the _current_ epoll_wait() implementation which calls
schedule_timeout(1). You do agree that the current epoll_wait()
implementation sleeps less than 1 msec with HZ == 1000 correct? So as
long as:

work + scheduling_overhead < 1 msec

We _should_ be able to achieve 1000 iterations per second. I also
realize that with multiple worker processes I need:

(total_work + scheduling_overhead)/number_cpus < 1 msec

With the old kernel I can run 500 of these processes, and I'm hoping
that I'm simply missing the knob I need to tweak to achieve similar
performance on a recent kernel.

Thanks,
Shawn

2010-01-31 18:27:58

by Arjan van de Ven

[permalink] [raw]
Subject: Re: High scheduler wake up times

On Sat, 30 Jan 2010 21:47:18 -0600
Shawn Bohrer <[email protected]> wrote:
>
> Of course that patch makes my situation worse, which was my point. We
> are depending on the _current_ epoll_wait() implementation which calls
> schedule_timeout(1).

> You do agree that the current epoll_wait()
> implementation sleeps less than 1 msec with HZ == 1000 correct?

I agree with your hypothesis, but I wouldn't call the behavior
correct ;-)

First of all, jiffies based timeouts are supposed to round *up*, not
down, and second.. it should really be just 1 msec.


> With the old kernel I can run 500 of these processes, and I'm hoping
> that I'm simply missing the knob I need to tweak to achieve similar
> performance on a recent kernel.

can you run powertop during your workload? maybe you're getting hit by
some C state exit latencies tilting the rounding over the top just too
many times...


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2010-01-31 20:51:12

by Shawn Bohrer

[permalink] [raw]
Subject: Re: High scheduler wake up times

On Sun, Jan 31, 2010 at 10:28:46AM -0800, Arjan van de Ven wrote:
> can you run powertop during your workload? maybe you're getting hit by
> some C state exit latencies tilting the rounding over the top just too
> many times...

Running 50 of the example processes and powertop on 2.6.32.7, with the
performance cpu governor:

Cn Avg residency P-states (frequencies)
C0 (cpu running) (24.7%)
polling 0.0ms ( 0.0%)
C1 mwait 0.3ms ( 0.0%)
C3 mwait 0.8ms (75.3%)


Wakeups-from-idle per second : 980.1 interval: 10.0s
no ACPI power usage estimate available

Top causes for wakeups:
76.2% (45066.9) worker_process : sys_epoll_wait (process_timeout)
22.0% (13039.2) <kernel core> : hrtimer_start_range_ns (tick_sched_timer)
1.5% (892.7) kipmi0 : schedule_timeout_interruptible (process_timeout)
0.2% (105.0) <kernel core> : add_timer (smi_timeout)
0.0% ( 10.5) <interrupt> : ata_piix
0.0% ( 10.1) <kernel core> : ipmi_timeout (ipmi_timeout)


I also tried booting with processor.max_cstate=0 which causes powertop
to no longer show any cstate information but I assumed that would keep
me fixed at C0. Booting with processor.max_cstate=0 didn't seem to make
any difference.

Thanks,
Shawn

2010-02-01 08:51:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: High scheduler wake up times

On Sat, 2010-01-30 at 16:47 -0800, Arjan van de Ven wrote:
> On Sat, 30 Jan 2010 18:35:49 -0600
> Shawn Bohrer <[email protected]> wrote:
> \
> >
> > I agree that we are currently depending on a bug in epoll. The epoll
> > implementation currently rounds up to the next jiffie, so specifying a
> > timeout of 1 ms really just wakes the process up at the next timer
> > tick. I have a patch to fix epoll by converting it to use
> > schedule_hrtimeout_range() that I'll gladly send, but I still need a
> > way to achieve the same thing.
>
> it's not going to help you; your expectation is incorrect.
> you CANNOT get 1000 iterations per second if you do
>
> <wait 1 msec>
> <do a bunch of work>
> <wait 1 msec>
> etc in a loop
>
> the more accurate (read: not rounding down) the implementation, the
> more not-1000 you will get, because to hit 1000 the two actions
>
> <wait 1 msec>
> <do a bunch of work>
>
> combined are not allowed to take more than 1000 microseconds wallcock
> time. Assuming "do a bunch of work" takes 100 microseconds, for you to
> hit 1000 there would need to be 900 microseconds in a milliseconds...
> and sadly physics don't work that way.
>
> (and that's even ignoring various OS, CPU wakeup and scheduler
> contention overheads)

Right, aside from that, CFS will only (potentially) delay your wakeup if
there's someone else on the cpu at the moment of wakeup, and that's
fully by design, you don't want to fix that, its bad for throughput.

If you want deterministic wakeup latencies use a RT scheduling class
(and kernel).

Fwiw, your test proglet gives me:

peter@laptop:~/tmp$ ./epoll
Iterations Per Sec: 996.767947
Iterations Per Sec: 995.424135
Iterations Per Sec: 993.624936

and that's with full contemporary desktop bloat around.

As it stand it appears you have at least two bugs in your application,
you rely on broken epoll behaviour and you have incorrect assumptions on
what the regular scheduler class will guarantee you (which is in fact
nothing other than that your application will at one point in the future
receive some service, per posix).

Now CFS stives to gives you more guarantees than that, but they're soft.
We try to schedule such that your application will receive a
proportional amount of service to every other runnable task of the same
nice level (and there's a weighted proportion between nice levels as
well), furthermore we try to service each task at least once per
nr_running*sysctl.kernel.sched_min_granularity_ns. If you see wakeup
latencies an order of magnitude over that, we clearly messed up, but
until that point we're doing ok-ish.

2010-02-01 19:44:50

by Shawn Bohrer

[permalink] [raw]
Subject: Re: High scheduler wake up times

On Mon, Feb 01, 2010 at 09:51:30AM +0100, Peter Zijlstra wrote:
> Right, aside from that, CFS will only (potentially) delay your wakeup if
> there's someone else on the cpu at the moment of wakeup, and that's
> fully by design, you don't want to fix that, its bad for throughput.
>
> If you want deterministic wakeup latencies use a RT scheduling class
> (and kernel).

I've confirmed that running my processes as SCHED_FIFO fixes the issue
and allows me to achieve ~999.99 iterations per second.

> As it stand it appears you have at least two bugs in your application,
> you rely on broken epoll behaviour and you have incorrect assumptions on
> what the regular scheduler class will guarantee you (which is in fact
> nothing other than that your application will at one point in the future
> receive some service, per posix).

Interestingly I can also achieve ~999.99 iterations per second by using an
infinite epoll timeout, and adding a 1 msec periodic timerfd handle to
the epoll set while still using SCHED_OTHER.

So it seems I have two solutions when using a new kernel so I'm
satisfied. I'll see if I can clean up my patch to fix the broken epoll
behavior and send it in.

Thanks,
Shawn