2005-02-20 10:49:10

by Puneet Kaushik

[permalink] [raw]
Subject: Needed faster implementation of do_gettimeofday()

Hello all,

I am running oprofile on some program. Following is the oprofile output.

-----------------------------------------------------------------------

Counted GLOBAL_POWER_EVENTS events (time during which processor is not
stopped) with a unit mask of 0x01 (mandatory) count 100000
samples % app name symbol name
985913 8.6083 vmlinux mark_offset_tsc
584473 5.1032 libc-2.3.2.so getc
295901 2.5836 vmlinux ide_outb
270823 2.3646 vmlinux _spin_lock
249791 2.1810 vmlinux _spin_unlock
236140 2.0618 vmlinux timer_interrupt
175249 1.5302 ld-2.3.2.so do_lookup_versioned
140429 1.2261 sendmail putc
138739 1.2114 sendmail stabhash
134145 1.1713 sendmail getc

-----------------------------------------------------------------------


>From this output what I can analyse is that mark_offset_tsc(which is
called from do_gettimeofday), and some other timer functions, are taking
most of the CPU.

Is there any faster implementation of do_gettimeofday. I am using kernel
2.6.10. with dual P4.

What I found from google search is: http://lwn.net/Articles/9266/ , which
is only for kernel 2.4

Thanks for help.


-Puneet




2005-02-20 15:48:29

by Parag Warudkar

[permalink] [raw]
Subject: Re: Needed faster implementation of do_gettimeofday()

On Sunday 20 February 2005 05:58 am, [email protected] wrote:
> 985913 ? ?8.6083 ?vmlinux ? ? ? ? ? ? ? ? ?mark_offset_tsc
> 584473 ? ?5.1032 ?libc-2.3.2.so ? ? ? ? ? ?getc

What makes you think mark_offset_tsc is slow? Do you have any comparative
numbers? It might just be that the workload you are throwing at it justifies
it. (For e.g. if your workload does a zillion system calls, system_call will
show up as a hot spot in oprofile - doesn't necessarily mean it is slow -
it's just overused.) Can you post the relevant code?

Parag

2005-02-22 03:07:27

by George Anzinger

[permalink] [raw]
Subject: Re: Needed faster implementation of do_gettimeofday()

Parag Warudkar wrote:
> On Sunday 20 February 2005 05:58 am, [email protected] wrote:
>
>>985913 8.6083 vmlinux mark_offset_tsc
>>584473 5.1032 libc-2.3.2.so getc
>
>
> What makes you think mark_offset_tsc is slow? Do you have any comparative
> numbers? It might just be that the workload you are throwing at it justifies
> it. (For e.g. if your workload does a zillion system calls, system_call will
> show up as a hot spot in oprofile - doesn't necessarily mean it is slow -
> it's just overused.) Can you post the relevant code?

He really is right. Mark offset is reading the PIT counter and that is not only
rather dumb but dog slow.

A suggestion, try the high res timers patch. Even if you don't use the timers
the mark offset there is MUCH faster. It does not read the PIT.

The difference is where we assume the jiffie bump is in time. If we assume it
is at the point that the PIT interrupts, well then the only way to get to that
is to read the PIT. If, on the other hand, we assume it is at the time after
the interrrupt where we mark offset, we can observe the "best" time for this
event based on the TSC and avoid reading the PIT.

Try the HRT patch (see signature below) and see if if doesn't do better.


--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/

2005-02-22 13:55:52

by Puneet Kaushik

[permalink] [raw]
Subject: Re: Needed faster implementation of do_gettimeofday()

Hello Parag and George,

Thanks for immediate reply.
The main problem is I am working on a SMP system. I have written a small
program that just calls the gettimeofday(), one billion times. I have
run it with time utility and it takes almost double time on SMP then a
UP.



with kernel 2.6.10 on UP

real 4m5.495s
user 1m17.088s
sys 2m48.046s


With Kernel 2.6.10 on SMP

real 6m24.485s
user 1m43.723s
sys 4m30.749s


And the fact is this SMP machine is faster and with more memory than the
UP one. In SMP systems it make a spinlock every time it got called,
synchronizes both the processors, and unlock them. Thats all I know
about it.

George I am just working on your suggestion, let me know if it will work
for SMPs.

If there is some good implementation for SMP, please let me know.

Thanks,

- Puneet




On Tue, 2005-02-22 at 08:36, George Anzinger wrote:
> Parag Warudkar wrote:
> > On Sunday 20 February 2005 05:58 am, [email protected] wrote:
> >
> >>985913 8.6083 vmlinux mark_offset_tsc
> >>584473 5.1032 libc-2.3.2.so getc
> >
> >
> > What makes you think mark_offset_tsc is slow? Do you have any comparative
> > numbers? It might just be that the workload you are throwing at it justifies
> > it. (For e.g. if your workload does a zillion system calls, system_call will
> > show up as a hot spot in oprofile - doesn't necessarily mean it is slow -
> > it's just overused.) Can you post the relevant code?
>
> He really is right. Mark offset is reading the PIT counter and that is not only
> rather dumb but dog slow.
>
> A suggestion, try the high res timers patch. Even if you don't use the timers
> the mark offset there is MUCH faster. It does not read the PIT.
>
> The difference is where we assume the jiffie bump is in time. If we assume it
> is at the point that the PIT interrupts, well then the only way to get to that
> is to read the PIT. If, on the other hand, we assume it is at the time after
> the interrrupt where we mark offset, we can observe the "best" time for this
> event based on the TSC and avoid reading the PIT.
>
> Try the HRT patch (see signature below) and see if if doesn't do better.
>

2005-02-22 15:47:13

by Chris Friesen

[permalink] [raw]
Subject: Re: Needed faster implementation of do_gettimeofday()

Puneet Kaushik wrote:
> Hello Parag and George,
>
> Thanks for immediate reply.
> The main problem is I am working on a SMP system. I have written a small
> program that just calls the gettimeofday(), one billion times. I have
> run it with time utility and it takes almost double time on SMP then a
> UP.

If the hardware is known in advance, can you use some arch-specific
thing (like rdtsc on intel) to get a timestamp that can then be
calibrated by calling gettimeofday() at a lower frequency?

There will be issues (may have to use cpu affinity if the two don't run
at the same rate, may need to disable any frequency stepping), but it
might be possible to work around them.

Chris

2005-02-22 16:45:16

by George Anzinger

[permalink] [raw]
Subject: Re: Needed faster implementation of do_gettimeofday()

Puneet Kaushik wrote:
> Hello Parag and George,
>
> Thanks for immediate reply.
> The main problem is I am working on a SMP system. I have written a small
> program that just calls the gettimeofday(), one billion times. I have
> run it with time utility and it takes almost double time on SMP then a
> UP.
>
>
>
> with kernel 2.6.10 on UP
>
> real 4m5.495s
> user 1m17.088s
> sys 2m48.046s
>
>
> With Kernel 2.6.10 on SMP
>
> real 6m24.485s
> user 1m43.723s
> sys 4m30.749s
>
>
> And the fact is this SMP machine is faster and with more memory than the
> UP one. In SMP systems it make a spinlock every time it got called,
> synchronizes both the processors, and unlock them. Thats all I know
> about it.

On 2.6 the lock is a r/w sequence lock. The machines are not synchronized or
locked, but some of the sequence lock instructions around the locking are
"locked". I find it hard to believe that this would double the time, however.

Ah..., now I remember. On SMP x86 boxen, the accounting/ run_timer interrupt
comes from the lapic timer. This is triggered at a 1/HZ rate and means that
there is an additional time keeping interrupt. Actually, over the box, you get
(N+1)/HZ interrupts where N is the number of cpus. Assuming that the PIT and
the lapic interrupt take about the same amount of time and that the PIT
interrupt is evenly distributed on the CPUs, the interrupt contention should go
from 1 to 1.5. This alone would take your 4.084 sec UP time to 6.125 sec on an
SMP boxen (that is amazingly close to what you are seeing if you ask me).

Again, I recommend my HRT patch. There the accounting interrupt is generated by
an "all-but-self" IPI. This is generated by the PIT interrupt code which also
does the accounting on the cpu handling the PIT interrupt. Result: total time
keeping interrupts N/HZ where N is the number of CPUs.


>
> George I am just working on your suggestion, let me know if it will work
> for SMPs.

See above. Should solve your problem.
>
> If there is some good implementation for SMP, please let me know.
>
> Thanks,
>
> - Puneet
>
>
>
>
> On Tue, 2005-02-22 at 08:36, George Anzinger wrote:
>
>>Parag Warudkar wrote:
>>
>>>On Sunday 20 February 2005 05:58 am, [email protected] wrote:
>>>
>>>
>>>>985913 8.6083 vmlinux mark_offset_tsc
>>>>584473 5.1032 libc-2.3.2.so getc
>>>
>>>
>>>What makes you think mark_offset_tsc is slow? Do you have any comparative
>>>numbers? It might just be that the workload you are throwing at it justifies
>>>it. (For e.g. if your workload does a zillion system calls, system_call will
>>>show up as a hot spot in oprofile - doesn't necessarily mean it is slow -
>>>it's just overused.) Can you post the relevant code?
>>
>>He really is right. Mark offset is reading the PIT counter and that is not only
>>rather dumb but dog slow.
>>
>>A suggestion, try the high res timers patch. Even if you don't use the timers
>>the mark offset there is MUCH faster. It does not read the PIT.
>>
>>The difference is where we assume the jiffie bump is in time. If we assume it
>>is at the point that the PIT interrupts, well then the only way to get to that
>>is to read the PIT. If, on the other hand, we assume it is at the time after
>>the interrrupt where we mark offset, we can observe the "best" time for this
>>event based on the TSC and avoid reading the PIT.
>>
>>Try the HRT patch (see signature below) and see if if doesn't do better.
>>

--
George Anzinger [email protected]
High-res-timers: http://sourceforge.net/projects/high-res-timers/