LinuxLists.cc - [PATCH] ia64: Scalability improvement of gettimeofday with jitter compensation

2007-06-12 03:13:43

Subject: [PATCH] ia64: Scalability improvement of gettimeofday with jitter compensation

Hi all,

This is a proposal with a patch to improve scalability
of time handling.

As described in comment at arch/ia64/kernel/time.c:

[arch/ia64/kernel/time.c]
> #ifdef CONFIG_SMP
> /* On IA64 in an SMP configuration ITCs are never accurately synchronized.
> * Jitter compensation requires a cmpxchg which may limit
> * the scalability of the syscalls for retrieving time.
> * The ITC synchronization is usually successful to within a few
> * ITC ticks but this is not a sure thing. If you need to improve
> * timer performance in SMP situations then boot the kernel with the
> * "nojitter" option. However, doing so may result in time fluctuating (maybe
> * even going backward) if the ITC offsets between the individual CPUs
> * are too large.
> */
> if (!nojitter) itc_interpolator.jitter = 1;
> #endif

ia64 uses jitter compensation to prevent time from going backward.

This jitter compensation logic which keep track of cycle value
recently returned is provided as generic code (and copied to
arch/ia64/kernel/fsys.S).
It seems that there is no user (setting jitter = 1) other than ia64.

[kernel/timer.c]
> static inline u64 time_interpolator_get_counter(int writelock)
> {
> unsigned int src = time_interpolator->source;
>
> if (time_interpolator->jitter)
> {
> cycles_t lcycle;
> cycles_t now;
>
> do {
> lcycle = time_interpolator->last_cycle;
> now = time_interpolator_get_cycles(src);
> if (lcycle && time_after(lcycle, now))
> return lcycle;
>
> /* When holding the xtime write lock, there's no need
> * to add the overhead of the cmpxchg. Readers are
> * force to retry until the write lock is released.
> */
> if (writelock) {
> time_interpolator->last_cycle = now;
> return now;
> }
> /* Keep track of the last timer value returned. The use of cmpxchg here
> * will cause contention in an SMP environment.
> */
> } while (unlikely(cmpxchg(&time_interpolator->last_cycle, lcycle, now) != lcycle));
> return now;
> }
> else
> return time_interpolator_get_cycles(src);
> }

The current logic is consist of do-while loop with cmpxchg.

The cmpxchg is known to take long time in an SMP environment but
it is easy way to guarantee atomic operation.
I think this is acceptable while there are no better alternatives.

OTOH, the do-while forces retry if cmpxchg fails (no exchanges).
This means that if there are N threads trying to do cmpxchg at
same time, only 1 can out from this loop and N-1 others will be
trapped in the loop. This also means that a thread could loop
N times in worst case.

Obviously this is a scalability issue.
To avoid this retry loop, I'd like to propose new logic that
removes do-while here.

The basic idea is "use winner's cycle instead of retrying."
Assuming that there are N threads trying to do cmpxchg, it also
be assumed that they are trying to update last_cycle by its own
new value while all values are almost same.
Therefore, it will work that treating threads as a group and
deciding a group's return value by picking up one from the group.

Fortunately, cmpxchg mechanism can help this logic. Only first
one in group can exchange the last_cycle successfully, so this
"winner" gets previous last_cycle as the return value of cmpxchg.
The rests in group will fail to exchange since last_cycle is
already updated by winner, so these "loser" gets current
last_cycle on cmpxchg's return. This means that all thread in
the group can know the winner's cycle.

ret = cmpxchg(&last_cycle, last, new);
if (ret == last)
return new; /* you win! */
else
return ret; /* you lose. ret is winner's new */

I had a test running gettimeofday() processes at 1.5GHz*4way box.
It shows followings:

- x1 process:
0.15us / 1 gettimeofday() call
0.15us / 1 gettimeofday() call with patch
- x2 process:
0.31us / 1 gettimeofday() call
0.24us / 1 gettimeofday() call with patch
- x3 process:
1.59us / 1 gettimeofday() call
1.11us / 1 gettimeofday() call with patch
- x4 process:
2.34us / 1 gettimeofday() call
1.29us / 1 gettimeofday() call with patch

I know that this patch could not help quite huge system since
such system like having 1024CPUs should have better clocksource
instead of doing cmpxchg. Even though this patch will work good
on middle-sized box (4~8way, possibly 16~64way?).

Thanks,
H.Seto

Signed-off-by: Hidetoshi Seto <[email protected]>

-----
arch/ia64/kernel/fsys.S | 10 +++++++---
kernel/timer.c | 45 +++++++++++++++++++++++++--------------------
2 files changed, 32 insertions(+), 23 deletions(-)

Index: linux-2.6.21/arch/ia64/kernel/fsys.S
===================================================================
--- linux-2.6.21.orig/arch/ia64/kernel/fsys.S
+++ linux-2.6.21/arch/ia64/kernel/fsys.S
@@ -271,18 +271,22 @@
(p6) sub r10 = r25,r26 // time we got was less than last_cycle
(p7) mov ar.ccv = r25 // more than last_cycle. Prep for cmpxchg
;;
+(p7) cmpxchg8.rel r3 = [r23],r2,ar.ccv
+ ;;
+(p7) cmp.ne p7,p0 = r25,r3 // if cmpxchg not successful, neighbor updated last_cycle.
+ ;;
+(p7) sub r10 = r3,r26 // use new last_cycle instead
+ ;;
and r10 = r10,r14 // Apply mask
;;
setf.sig f8 = r10
nop.i 123
;;
-(p7) cmpxchg8.rel r3 = [r23],r2,ar.ccv
EX(.fail_efault, probe.w.fault r31, 3) // This takes 5 cycles and we have spare time
xmpy.l f8 = f8,f7 // nsec_per_cyc*(counter-last_counter)
(p15) add r9 = r9,r17 // Add wall to monotonic.secs to result secs
;;
(p15) ld8 r17 = [r19],-IA64_TIMESPEC_TV_NSEC_OFFSET
-(p7) cmp.ne p7,p0 = r25,r3 // if cmpxchg not successful redo
// simulate tbit.nz.or p7,p0 = r28,0
and r28 = ~1,r28 // Make sequence even to force retry if odd
getf.sig r2 = f8
@@ -295,7 +299,7 @@
;; // overloaded 3 bundles!
// End critical section.
add r8 = r8,r2 // Add xtime.nsecs
- cmp4.ne.or p7,p0 = r28,r10
+ cmp4.ne p7,p0 = r28,r10
(p7) br.cond.dpnt.few .time_redo // sequence number changed ?
// Now r8=tv->tv_nsec and r9=tv->tv_sec
mov r10 = r0
Index: linux-2.6.21/kernel/timer.c
===================================================================
--- linux-2.6.21.orig/kernel/timer.c
+++ linux-2.6.21/kernel/timer.c
@@ -1762,27 +1762,32 @@

if (time_interpolator->jitter)
{
- cycles_t lcycle;
- cycles_t now;
+ cycles_t lcycle, now, ret;
+
+ lcycle = time_interpolator->last_cycle;
+ now = time_interpolator_get_cycles(src);
+ if (lcycle && time_after(lcycle, now))
+ return lcycle;
+
+ /* When holding the xtime write lock, there's no need
+ * to add the overhead of the cmpxchg. Readers are
+ * force to retry until the write lock is released.
+ */
+ if (writelock) {
+ time_interpolator->last_cycle = now;
+ return now;
+ }
+
+ /*
+ * Keep track of the last timer value returned.
+ * In an SMP environment, you could lose out in contention of
+ * cmpxchg. If so, your cmpxchg returns new value which the
+ * winner of contention updated to. Use the new value instead.
+ */
+ ret = cmpxchg(&time_interpolator->last_cycle, lcycle, now);
+ if (unlikely(ret != lcycle))
+ return ret;

- do {
- lcycle = time_interpolator->last_cycle;
- now = time_interpolator_get_cycles(src);
- if (lcycle && time_after(lcycle, now))
- return lcycle;
-
- /* When holding the xtime write lock, there's no need
- * to add the overhead of the cmpxchg. Readers are
- * force to retry until the write lock is released.
- */
- if (writelock) {
- time_interpolator->last_cycle = now;
- return now;
- }
- /* Keep track of the last timer value returned. The use of cmpxchg here
- * will cause contention in an SMP environment.
- */
- } while (unlikely(cmpxchg(&time_interpolator->last_cycle, lcycle, now) != lcycle));
return now;
}
else

2007-06-13 12:20:30

by Bob Picco

[permalink] [raw]

Subject: Re: [PATCH] ia64: Scalability improvement of gettimeofday with jitter compensation

Hidetoshi Seto wrote: [Mon Jun 11 2007, 11:14:44PM EDT]
[snip]
I didn't examine your patch closely but did notice modification to the
time interpolator. I'll attempt to examine your patch with more scrutiny
in the next few days.

I will be pushing Peter Keilty's clocksource ia64 patches within the next
week or so. At that time I'll ask for inclusion into -mm. Please see:
http://marc.info/?t=117881585500001&r=1&w=2
You'll notice that the time interpolator is removed.

BTW. Peter is no longer with HP. He participated in the Employee Early
Retirement program and left May 31.

thanks,

bob

2007-06-14 02:06:20

by Hidetoshi Seto

[permalink] [raw]

Subject: Re: [PATCH] ia64: Scalability improvement of gettimeofday with jitter compensation

Bob Picco wrote:
> I will be pushing Peter Keilty's clocksource ia64 patches within the next
> week or so. At that time I'll ask for inclusion into -mm. Please see:
> http://marc.info/?t=117881585500001&r=1&w=2
> You'll notice that the time interpolator is removed.

I see. Your patch will replace the time interpolator to clocksource.
This is a conversion of structure which contains data related to time.
There will be no drastic change on logic, right?

What I want to do is removing the cmpxchg loop.
Your patch pointed it out that there are 2 loops in fsys.S, outer loop
with lock.seq and inner loop with cmpxchg.

If your patch successfully removes cmpxchg from generic code, then my
patch will be ia64 specific and be compact like following.

Thanks,
H.Seto

---
arch/ia64/kernel/fsys.S | 18 ++++++++++--------
1 files changed, 10 insertions(+), 8 deletions(-)

Index: linux-2.6.21/arch/ia64/kernel/fsys.S
===================================================================
--- linux-2.6.21.orig/arch/ia64/kernel/fsys.S
+++ linux-2.6.21/arch/ia64/kernel/fsys.S
@@ -224,7 +224,7 @@
add r26 = IA64_CLKSRC_CYCLE_LAST_OFFSET,r20 // clksrc_cycle_last
cmp.ne p6, p0 = 0, r2 // Fallback if work is scheduled
(p6) br.cond.spnt.many fsys_fallback_syscall
- ;; // get lock.seq here new code, outer loop2!
+ ;;
.time_redo:
ld4.acq r28 = [r20] // gtod_lock.sequence, Must be first in struct
ld8 r30 = [r21] // clocksource->mmio_ptr
@@ -242,8 +242,7 @@
ld4 r23 = [r23] // clocksource shift value
ld8 r24 = [r26] // get clksrc_cycle_last value
(p9) cmp.eq p13,p0 = 0,r30 // if mmio_ptr, clear p13 jitter control
- ;; // old position for lock seq, new inner loop1!
-.cmpxchg_redo:
+ ;;
.pred.rel.mutex p8,p9
(p8) mov r2 = ar.itc // CPU_TIMER. 36 clocks latency!!!
(p9) ld8 r2 = [r30] // readq(ti->address). Could also have latency issues..
@@ -261,20 +260,23 @@
(p6) sub r10 = r25,r24 // time we got was less than last_cycle
(p7) mov ar.ccv = r25 // more than last_cycle. Prep for cmpxchg
;;
+(p7) cmpxchg8.rel r3 = [r19],r2,ar.ccv
+ ;;
+(p7) cmp.ne p7,p0 = r25,r3 // if cmpxchg not successful, neighbor updated last_cycle
+ ;;
+(p7) sub r10 = r3,r24 // use new last_cycle instead
+ ;;
and r10 = r10,r14 // Apply mask
;;
setf.sig f8 = r10
nop.i 123
;;
-(p7) cmpxchg8.rel r3 = [r19],r2,ar.ccv
EX(.fail_efault, probe.w.fault r31, 3) // This takes 5 cycles and we have spare time
xmpy.l f8 = f8,f7 // nsec_per_cyc*(counter-last_counter)
(p15) add r9 = r9,r17 // Add wall to monotonic.secs to result secs
;;
// End cmpxchg critical section loop1
(p15) ld8 r17 = [r22],-IA64_TIMESPEC_TV_NSEC_OFFSET
-(p7) cmp.ne p7,p0 = r25,r3 // if cmpxchg not successful redo
-(p7) br.cond.dpnt.few .cmpxchg_redo // inner loop1
// simulate tbit.nz.or p7,p0 = r28,0
and r28 = ~1,r28 // Make sequence even to force retry if odd
getf.sig r2 = f8
@@ -286,8 +288,8 @@
;; // overloaded 3 bundles!
// End critical section.
add r8 = r8,r2 // Add xtime.nsecs
- cmp4.ne.or p7,p0 = r28,r10
-(p7) br.cond.dpnt.few .time_redo // sequence number changed, outer loop2
+ cmp4.ne p7,p0 = r28,r10
+(p7) br.cond.dpnt.few .time_redo // sequence number changed ?
// Now r8=tv->tv_nsec and r9=tv->tv_sec
mov r10 = r0
movl r2 = 1000000000

2007-06-15 17:46:26

by Christoph Lameter

[permalink] [raw]

Subject: Re: [PATCH] ia64: Scalability improvement of gettimeofday with jitter compensation

On Tue, 12 Jun 2007, Hidetoshi Seto wrote:

>
> [arch/ia64/kernel/time.c]
> > #ifdef CONFIG_SMP
> > /* On IA64 in an SMP configuration ITCs are never accurately synchronized.
> > * Jitter compensation requires a cmpxchg which may limit
> > * the scalability of the syscalls for retrieving time.
> > * The ITC synchronization is usually successful to within a few
> > * ITC ticks but this is not a sure thing. If you need to improve
> > * timer performance in SMP situations then boot the kernel with the
> > * "nojitter" option. However, doing so may result in time fluctuating (maybe
> > * even going backward) if the ITC offsets between the individual CPUs
> > * are too large.
> > */
> > if (!nojitter) itc_interpolator.jitter = 1;
> > #endif
>
> ia64 uses jitter compensation to prevent time from going backward.
>
> This jitter compensation logic which keep track of cycle value
> recently returned is provided as generic code (and copied to
> arch/ia64/kernel/fsys.S).
> It seems that there is no user (setting jitter = 1) other than ia64.

Yes there are only two users of time interpolators. The generic
gettimeofday work by John Stultz will make time interpolators obsolete
soon. I already saw a patch to remove them.

> The cmpxchg is known to take long time in an SMP environment but
> it is easy way to guarantee atomic operation.
> I think this is acceptable while there are no better alternatives.
>
> OTOH, the do-while forces retry if cmpxchg fails (no exchanges).
> This means that if there are N threads trying to do cmpxchg at
> same time, only 1 can out from this loop and N-1 others will be
> trapped in the loop. This also means that a thread could loop
> N times in worst case.

Right.

> Obviously this is a scalability issue.
> To avoid this retry loop, I'd like to propose new logic that
> removes do-while here.

Booting with nojitter is the easiest way to solve this if one is willing
to accept slightly inaccurate clocks.

> The basic idea is "use winner's cycle instead of retrying."
> Assuming that there are N threads trying to do cmpxchg, it also
> be assumed that they are trying to update last_cycle by its own
> new value while all values are almost same.
> Therefore, it will work that treating threads as a group and
> deciding a group's return value by picking up one from the group.
>
> Fortunately, cmpxchg mechanism can help this logic. Only first
> one in group can exchange the last_cycle successfully, so this
> "winner" gets previous last_cycle as the return value of cmpxchg.
> The rests in group will fail to exchange since last_cycle is
> already updated by winner, so these "loser" gets current
> last_cycle on cmpxchg's return. This means that all thread in
> the group can know the winner's cycle.
>
> ret = cmpxchg(&last_cycle, last, new);
> if (ret == last)
> return new; /* you win! */
> else
> return ret; /* you lose. ret is winner's new */

Interesting solution. But there may be multiple updates of last happening.
Which of the winners is the real winner?

>
> I had a test running gettimeofday() processes at 1.5GHz*4way box.
> It shows followings:
>
> - x1 process:
> 0.15us / 1 gettimeofday() call
> 0.15us / 1 gettimeofday() call with patch
> - x2 process:
> 0.31us / 1 gettimeofday() call
> 0.24us / 1 gettimeofday() call with patch
> - x3 process:
> 1.59us / 1 gettimeofday() call
> 1.11us / 1 gettimeofday() call with patch
> - x4 process:
> 2.34us / 1 gettimeofday() call
> 1.29us / 1 gettimeofday() call with patch
>
> I know that this patch could not help quite huge system since
> such system like having 1024CPUs should have better clocksource
> instead of doing cmpxchg. Even though this patch will work good
> on middle-sized box (4~8way, possibly 16~64way?).

Our SGI machines have their own clock that is hardware replicated to all
nodes.

This would certainly a nice improvement for SMP machines. It certainly
works nicely for 2 way systems.

2007-06-15 20:03:21

by Luck, Tony

[permalink] [raw]

Subject: RE: [PATCH] ia64: Scalability improvement of gettimeofday with jitter compensation

> > ret = cmpxchg(&last_cycle, last, new);
> > if (ret == last)
> > return new; /* you win! */
> > else
> > return ret; /* you lose. ret is winner's new */
>
> Interesting solution. But there may be multiple updates of last happening.
> Which of the winners is the real winner?

It might not matter. Let's look at how good a guarantee an
application can expect from gettimeofday(). First of all
there may be arbitrary delays either before, or after the
kernel comes up with the "current" time since the process
could be context switched in the glibc prologue or epilogue
for the system call. So I think that the best that an
application can expect is:

Given two calls to gettimeofday A and B, where through some
external syncronization it is known that call B is initiated
after A has completed, then B shall not return a time value
lower than that returned by A.

Now if that is the strongest ordering that an application can
expect (and I'm not completely convinced that I'm right, other
definitions gratefully accepted), then it would appear that
is is just over-engineering to worry too much about calls that
collide inside the kernel (since these don't satisfy the
"B initiated after A completes" clause of my definition).

-Tony

[For some 50 year old British humour on the subject of time,
read: http://www.hexmaster.com/goonscripts/what_time_is_it.html ]