2004-11-12 05:52:30

by dean gaudet

[permalink] [raw]
Subject: CONFIG_X86_PM_TIMER is slow

when using CONFIG_X86_PM_TIMER i'm finding that gettimeofday() calls take
2.8us on a p-m 1.4GHz box... which is an order of magnitude slower than
TSC-based solutions.

on one workload i'm seeing a 7% perf improvement by booting "acpi=off" to
force it to use tsc instead of the PM timer... (the workload calls
gettimeofday too frequently, but i can't change that).

i'm curious why other folks haven't run into this -- is it because most
systems have HPET timer as well and that's not nearly as bad as PM timer?

-dean


2004-11-12 06:04:23

by Willy Tarreau

[permalink] [raw]
Subject: Re: CONFIG_X86_PM_TIMER is slow

Hi,

On Thu, Nov 11, 2004 at 09:52:27PM -0800, dean gaudet wrote:
> when using CONFIG_X86_PM_TIMER i'm finding that gettimeofday() calls take
> 2.8us on a p-m 1.4GHz box... which is an order of magnitude slower than
> TSC-based solutions.
>
> on one workload i'm seeing a 7% perf improvement by booting "acpi=off" to
> force it to use tsc instead of the PM timer... (the workload calls
> gettimeofday too frequently, but i can't change that).

I did not test, this might be interesting.
In fact, what would be very good would be sort of a new select/poll/epoll
syscalls with an additional argument, which would point to a structure
that the syscall would fill in return with the time of day. This would
greatly reduce the number of calls to gettimeofday() in some programs,
and make use of the time that was used by the syscall itself.

For example, if I could call it like this, it would be really cool :

ret = select_absdate(&in, &out, &excp, &date_timeout, &return_date);

with <date_timeout> the date at which to timeout (instead of the number of
microseconds) and <return_date> the current date upon return.

Cheers,
Willy

2004-11-12 06:17:46

by dean gaudet

[permalink] [raw]
Subject: Re: CONFIG_X86_PM_TIMER is slow

On Fri, 12 Nov 2004, Willy Tarreau wrote:

> On Thu, Nov 11, 2004 at 09:52:27PM -0800, dean gaudet wrote:
> > when using CONFIG_X86_PM_TIMER i'm finding that gettimeofday() calls take
> > 2.8us on a p-m 1.4GHz box... which is an order of magnitude slower than
> > TSC-based solutions.
> >
> > on one workload i'm seeing a 7% perf improvement by booting "acpi=off" to
> > force it to use tsc instead of the PM timer... (the workload calls
> > gettimeofday too frequently, but i can't change that).
>
> I did not test, this might be interesting.
> In fact, what would be very good would be sort of a new select/poll/epoll
> syscalls with an additional argument, which would point to a structure
> that the syscall would fill in return with the time of day. This would
> greatly reduce the number of calls to gettimeofday() in some programs,
> and make use of the time that was used by the syscall itself.
>
> For example, if I could call it like this, it would be really cool :
>
> ret = select_absdate(&in, &out, &excp, &date_timeout, &return_date);

but the overhead isn't the syscall :) it's the pm timer itself... the
program below reads the pm timer the same way the kernel does and you can
see it takes 2.5us to read it...

# cc -O2 -g -Wall readport.c -o readport
# grep PM-Timer /var/log/dmesg
ACPI: PM-Timer IO Port: 0xd808
# time ./readport 0xd808 1000000
./readport 0xd808 1000000 2.54s user 0.00s system 100% cpu 2.526 total

the gettimeofday overhead is dominated by this i/o...

-dean

/* readport.c */

#include <stdint.h>
#include <sys/io.h>
#include <stdio.h>
#include <stdlib.h>

uint32_t pmtmr_ioport;

#define ACPI_PM_MASK 0xFFFFFF /* limit it to 24 bits */

static inline uint32_t read_pmtmr(void)
{
uint32_t v1=0,v2=0,v3=0;
/* It has been reported that because of various broken
* chipsets (ICH4, PIIX4 and PIIX4E) where the ACPI PM time
* source is not latched, so you must read it multiple
* times to insure a safe value is read.
*/
do {
v1 = inl(pmtmr_ioport);
v2 = inl(pmtmr_ioport);
v3 = inl(pmtmr_ioport);
} while ((v1 > v2 && v1 < v3) || (v2 > v3 && v2 < v1)
|| (v3 > v1 && v3 < v2));

/* mask the output to 24 bits */
return v2 & ACPI_PM_MASK;
}


int main(int argc, char **argv)
{
unsigned port;
size_t count;
size_t i;

if (argc != 3) {
fprintf(stderr, "usage: %s port_num count\n", argv[0]);
exit(1);
}

pmtmr_ioport = port = strtoul(argv[1], 0, 0);
count = strtoul(argv[2], 0, 0);

if (iopl(3) != 0) {
perror("iopl");
exit(1);
}

for (i = 0; i < count; ++i) {
read_pmtmr();
}
return 0;
}

2004-11-12 07:06:23

by Willy Tarreau

[permalink] [raw]
Subject: Re: CONFIG_X86_PM_TIMER is slow

On Thu, Nov 11, 2004 at 10:16:27PM -0800, dean gaudet wrote:
> On Fri, 12 Nov 2004, Willy Tarreau wrote:
>
> > On Thu, Nov 11, 2004 at 09:52:27PM -0800, dean gaudet wrote:
> > > when using CONFIG_X86_PM_TIMER i'm finding that gettimeofday() calls take
> > > 2.8us on a p-m 1.4GHz box... which is an order of magnitude slower than
> > > TSC-based solutions.
> > >
> > > on one workload i'm seeing a 7% perf improvement by booting "acpi=off" to
> > > force it to use tsc instead of the PM timer... (the workload calls
> > > gettimeofday too frequently, but i can't change that).
> >
> > I did not test, this might be interesting.
> > In fact, what would be very good would be sort of a new select/poll/epoll
> > syscalls with an additional argument, which would point to a structure
> > that the syscall would fill in return with the time of day. This would
> > greatly reduce the number of calls to gettimeofday() in some programs,
> > and make use of the time that was used by the syscall itself.
> >
> > For example, if I could call it like this, it would be really cool :
> >
> > ret = select_absdate(&in, &out, &excp, &date_timeout, &return_date);
>
> but the overhead isn't the syscall :) it's the pm timer itself... the
> program below reads the pm timer the same way the kernel does and you can
> see it takes 2.5us to read it...

Sorry, what I meant is that if the select() did filled the structure itself,
it would avoid a supplementary call (syscall+hw access). And I'm certain
that select() accesses the same information at some time.

> # cc -O2 -g -Wall readport.c -o readport
> # grep PM-Timer /var/log/dmesg
> ACPI: PM-Timer IO Port: 0xd808
> # time ./readport 0xd808 1000000
> ./readport 0xd808 1000000 2.54s user 0.00s system 100% cpu 2.526 total
>
> the gettimeofday overhead is dominated by this i/o...

Indeed, this is much !

> /* It has been reported that because of various broken
> * chipsets (ICH4, PIIX4 and PIIX4E) where the ACPI PM time
> * source is not latched, so you must read it multiple
> * times to insure a safe value is read.
> */
> do {
> v1 = inl(pmtmr_ioport);
> v2 = inl(pmtmr_ioport);
> v3 = inl(pmtmr_ioport);
> } while ((v1 > v2 && v1 < v3) || (v2 > v3 && v2 < v1)
> || (v3 > v1 && v3 < v2));

Just a thought : have you tried to check whether it's the recovery time
after a read or read itself which takes time ? I mean, perhaps one read
would take, say 50 ns, but two back-to-back reads will take 2 us. If
this is the case, having a separate function with only one read for
non-broken chipsets will be better because there might be no particular
reasons to check the counter that often.

Other thought : is it possible to memory-map this timer to avoid the slow
inl() on x86 ?

Regards,
Willy

2004-11-12 07:47:26

by dean gaudet

[permalink] [raw]
Subject: Re: CONFIG_X86_PM_TIMER is slow

On Fri, 12 Nov 2004, Willy Tarreau wrote:

> > /* It has been reported that because of various broken
> > * chipsets (ICH4, PIIX4 and PIIX4E) where the ACPI PM time
> > * source is not latched, so you must read it multiple
> > * times to insure a safe value is read.
> > */
> > do {
> > v1 = inl(pmtmr_ioport);
> > v2 = inl(pmtmr_ioport);
> > v3 = inl(pmtmr_ioport);
> > } while ((v1 > v2 && v1 < v3) || (v2 > v3 && v2 < v1)
> > || (v3 > v1 && v3 < v2));
>
> Just a thought : have you tried to check whether it's the recovery time
> after a read or read itself which takes time ?

each read is ~0.8us ... the loop only runs once.

> I mean, perhaps one read
> would take, say 50 ns, but two back-to-back reads will take 2 us. If
> this is the case, having a separate function with only one read for
> non-broken chipsets will be better because there might be no particular
> reasons to check the counter that often.

yeah for the few chipsets i've looked at i haven't seen the problem the
loop is defending against yet. or the problem is pretty rare.

> Other thought : is it possible to memory-map this timer to avoid the slow
> inl() on x86 ?

that's how the even newer HPET works ... but not all systems have HPET.

-dean

2004-11-12 10:07:46

by Willy Tarreau

[permalink] [raw]
Subject: Re: CONFIG_X86_PM_TIMER is slow

On Thu, Nov 11, 2004 at 11:43:14PM -0800, dean gaudet wrote:
> > do {
> > > v1 = inl(pmtmr_ioport);
> > > v2 = inl(pmtmr_ioport);
> > > v3 = inl(pmtmr_ioport);
> > > } while ((v1 > v2 && v1 < v3) || (v2 > v3 && v2 < v1)
> > > || (v3 > v1 && v3 < v2));
> >
> > Just a thought : have you tried to check whether it's the recovery time
> > after a read or read itself which takes time ?
>
> each read is ~0.8us ... the loop only runs once.

OK, but here, 2 of these 3 reads are done after another one.
For example, does it change anything to add something like this :

do {
volatile int slow;
v1 = inl(pmtr_ioport);
for (slow = 500; slow--;);
v2 = inl(pmtr_ioport);
for (slow = 500; slow--;);
v3 = inl(pmtr_ioport);
} while ...

If it does not change anything, then it means that there is a recovery
time after a read, which would imply that the 3 back-to-back reads are
more harmful than it seems.

> > Other thought : is it possible to memory-map this timer to avoid the slow
> > inl() on x86 ?
>
> that's how the even newer HPET works ... but not all systems have HPET.

OK, thanks for the info, I didn't know about that.

Regards,
Willy

2004-11-16 08:11:06

by George Anzinger

[permalink] [raw]
Subject: Re: CONFIG_X86_PM_TIMER is slow

dean gaudet wrote:
> On Fri, 12 Nov 2004, Willy Tarreau wrote:
>
>
>>> /* It has been reported that because of various broken
>>> * chipsets (ICH4, PIIX4 and PIIX4E) where the ACPI PM time
>>> * source is not latched, so you must read it multiple
>>> * times to insure a safe value is read.
>>> */
>>> do {
>>> v1 = inl(pmtmr_ioport);
>>> v2 = inl(pmtmr_ioport);
>>> v3 = inl(pmtmr_ioport);
>>> } while ((v1 > v2 && v1 < v3) || (v2 > v3 && v2 < v1)
>>> || (v3 > v1 && v3 < v2));
>>
>>Just a thought : have you tried to check whether it's the recovery time
>>after a read or read itself which takes time ?
>
>
> each read is ~0.8us ... the loop only runs once.
>
>
>>I mean, perhaps one read
>>would take, say 50 ns, but two back-to-back reads will take 2 us. If
>>this is the case, having a separate function with only one read for
>>non-broken chipsets will be better because there might be no particular
>>reasons to check the counter that often.
>
>
> yeah for the few chipsets i've looked at i haven't seen the problem the
> loop is defending against yet. or the problem is pretty rare.

I contend that the problem can be defended against with only one read the
majority of the time. A bit of logic is needed to see if the read is within
reason and, only when not, an additional read is needed. For example, the
following is the pm_timer read in the HRT patch.

quick_get_cpuctr(void)
{
static unsigned long last_read = 0;
static int qgc_max = 0;
int i;

unsigned long rd_delta, rd_ans, rd = inl(acpi_pm_tmr_address);

/*
* This will be REALLY big if ever we move backward in time...
*/
rd_delta = (rd - last_read) & SIZE_MASK;
last_read = rd;

rd_ans = (rd - last_update) & SIZE_MASK;

if (likely((rd_ans < (arch_cycles_per_jiffy << 1)) &&
(rd_delta < (arch_cycles_per_jiffy << 1))))
return rd_ans;

for (i = 0; i < 10; i++) {
rd = inl(acpi_pm_tmr_address);
rd_delta = (rd - last_read) & SIZE_MASK;
last_read = rd;
if (unlikely(i > qgc_max))
qgc_max = i;
/*
* On my test machine (800MHZ dual PIII) this is always
* seven. Seems long, but we will give it some slack...
* We note that rd_delta (and all the vars) unsigned so
* a backward movement will show as a really big number.
*/
if (likely(rd_delta < 20))
return (rd - last_update) & SIZE_MASK;
}
return (rd - last_update) & SIZE_MASK;
}
>
>
>>Other thought : is it possible to memory-map this timer to avoid the slow
>>inl() on x86 ?

I suspect not. The problem is that the timer is in I/O land and the cpu and I/O
clocks need to sync to to the transfer and that is the cause of the delay.
~
--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/