Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754442AbYAPXjk (ORCPT ); Wed, 16 Jan 2008 18:39:40 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1751232AbYAPXjc (ORCPT ); Wed, 16 Jan 2008 18:39:32 -0500 Received: from tomts43.bellnexxia.net ([209.226.175.110]:54786 "EHLO tomts43-srv.bellnexxia.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751217AbYAPXja convert rfc822-to-8bit (ORCPT ); Wed, 16 Jan 2008 18:39:30 -0500 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: Aj0KAAAkjkdMROHU/2dsb2JhbACBWJAPnB8 Date: Wed, 16 Jan 2008 18:39:28 -0500 From: Mathieu Desnoyers To: john stultz Cc: Steven Rostedt , LKML , Ingo Molnar , Linus Torvalds , Andrew Morton , Peter Zijlstra , Christoph Hellwig , Gregory Haskins , Arnaldo Carvalho de Melo , Thomas Gleixner , Tim Bird , Sam Ravnborg , "Frank Ch. Eigler" , Steven Rostedt , Paul Mackerras , Daniel Walker Subject: Re: [RFC PATCH 16/22 -v2] add get_monotonic_cycles Message-ID: <20080116233927.GB23895@Krystal> References: <20080109233044.777564395@goodmis.org> <20080115214636.GD17439@Krystal> <20080115220824.GB22242@Krystal> <20080116031730.GA2164@Krystal> <20080116145604.GB31329@Krystal> <1f1b08da0801161436k4a7ac1e3kd83590951e7bebb9@mail.gmail.com> <1200523867.6127.5.camel@localhost.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: 8BIT In-Reply-To: <1200523867.6127.5.camel@localhost.localdomain> X-Editor: vi X-Info: http://krystal.dyndns.org:8080 X-Operating-System: Linux/2.6.21.3-grsec (i686) X-Uptime: 18:00:29 up 74 days, 4:05, 6 users, load average: 0.58, 2.46, 2.82 User-Agent: Mutt/1.5.16 (2007-06-11) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8896 Lines: 243 * john stultz (johnstul@us.ibm.com) wrote: > > On Wed, 2008-01-16 at 14:36 -0800, john stultz wrote: > > On Jan 16, 2008 6:56 AM, Mathieu Desnoyers wrote: > > > If you really want an seqlock free algorithm (I _do_ want this for > > > tracing!) :) maybe going in the RCU direction could help (I refer to my > > > RCU-based 32-to-64 bits lockless timestamp counter extension, which > > > could be turned into the clocksource updater). > > > > Yea. After our earlier discussion and talking w/ Steven, I'm taking a > > swing at this now. The lock-free method still doesn't apply to the > > update_wall_time function, but does work fine for the monotonic cycle > > uses. I'll send a patch for review as soon as I get things building. > > So here's my first attempt at adding Mathieu's lock-free method to > Steven's get_monotonic_cycles() interface. > > Completely un-tested, but it builds, so I figured I'd send it out for > review. > > I'm not super sure the update or the read doesn't need something > additional to force a memory access, but as I didn't see anything > special in Mathieu's implementation, I'm going to guess this is ok. > > Mathieu, Let me know if this isn't what you're suggesting. > > Signed-off-by: John Stultz > > Index: monotonic-cleanup/include/linux/clocksource.h > =================================================================== > --- monotonic-cleanup.orig/include/linux/clocksource.h 2008-01-16 12:22:04.000000000 -0800 > +++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 14:41:31.000000000 -0800 > @@ -87,9 +87,17 @@ > * more than one cache line. > */ > struct { > - cycle_t cycle_last, cycle_accumulated, cycle_raw; > - } ____cacheline_aligned_in_smp; Shouldn't the cycle_last and cycle_accumulated by in the array too ? > + cycle_t cycle_last, cycle_accumulated; > > + /* base structure provides lock-free read > + * access to a virtualized 64bit counter > + * Uses RCU-like update. > + */ > + struct { We had cycle_raw before, why do we need the following two ? > + cycle_t cycle_base_last, cycle_base; I'm not quite sure why you need both cycle_base_last and cycle_base... I think I'll need a bit of an explanation of what you are trying to achieve here to see what to expect from the clock source. Are you trying to deal with non-synchronized TSCs across CPUs in a way that will generate a monotonic (sometimes stalling) clock ? What I am trying to say is : I know you are trying to make a virtual clock source where time cannot go backward, but what are your assumptions about the "real" clock source ? Is the intent to deal with an HPET suddenly reset to 0 or something like this ? Basically, I wonder why you have to calculate the current cycle count from the previous update_wall_time event. Is is because you need to be consistent when a clocksource change occurs ? > + } base[2]; > + int base_num; > + } ____cacheline_aligned_in_smp; > u64 xtime_nsec; > s64 error; > > @@ -175,19 +183,21 @@ > } > > /** > - * clocksource_get_cycles: - Access the clocksource's accumulated cycle value > + * clocksource_get_basecycles: - get the clocksource's accumulated cycle value > * @cs: pointer to clocksource being read > * @now: current cycle value > * > * Uses the clocksource to return the current cycle_t value. > * NOTE!!!: This is different from clocksource_read, because it > - * returns the accumulated cycle value! Must hold xtime lock! > + * returns a 64bit wide accumulated value. > */ > static inline cycle_t > -clocksource_get_cycles(struct clocksource *cs, cycle_t now) > +clocksource_get_basecycles(struct clocksource *cs, cycle_t now) > { > - cycle_t offset = (now - cs->cycle_last) & cs->mask; > - offset += cs->cycle_accumulated; I would disable preemption in clocksource_get_basecycles. We would not want to be scheduled out while we hold a pointer to the old array element. > + int num = cs->base_num; Since you deal with base_num in a shared manner (not per cpu), you will need a smp_read_barrier_depend() here after the cs->base_num read. You should think about reading the cs->base_num first, and _after_ that read the real clocksource. Here, the clocksource value is passed as parameter. It means that the read clocksource may have been read in the previous RCU window. > + cycle_t offset = (now - cs->base[num].cycle_base_last); > + offset &= cs->mask; > + offset += cs->base[num].cycle_base; > return offset; > } > > @@ -197,14 +207,25 @@ > * @now: current cycle value > * > * Used to avoids clocksource hardware overflow by periodically > - * accumulating the current cycle delta. Must hold xtime write lock! > + * accumulating the current cycle delta. Uses RCU-like update, but > + * ***still requires the xtime_lock is held for writing!*** > */ > static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now) > { Why do we still require xtime_lock here ? Can you tell exactly which contexts this function will be called from (periodical timer interrupt?) I guess it is called from one and only one CPU periodically. > - cycle_t offset = (now - cs->cycle_last) & cs->mask; > + /* First update the monotonic base portion. > + * The dual array update method allows for lock-free reading. > + */ > + int num = !cs->base_num; > + cycle_t offset = (now - cs->base[!num].cycle_base_last); !0 is not necessarily 1. This is why I use cpu_synth->index ? 0 : 1 in my code. The two previous lines seems buggy. (I did the same mistake in my first implementation) ;) > + offset &= cs->mask; > + cs->base[num].cycle_base = cs->base[!num].cycle_base + offset; Here too. > + cs->base[num].cycle_base_last = now; Since you deal with shared data (in my algo, I use per-cpu data), you have to add a wmb() before the base_num value update. Only then will you ensure that other CPUs will see consistent values. > + cs->base_num = num; > + > + /* Now update the cycle_accumulated portion */ > + offset = (now - cs->cycle_last) & cs->mask; The following two updates are racy. I think they should be in the array too. We want consistant cycle_raw, cycle_last and cycle_accumulated values; they should therefore be presented to the reader atomically with a pointer change. > cs->cycle_last = now; > cs->cycle_accumulated += offset; > - cs->cycle_raw += offset; > } > > /** > Index: monotonic-cleanup/kernel/time/timekeeping.c > =================================================================== > --- monotonic-cleanup.orig/kernel/time/timekeeping.c 2008-01-16 12:21:46.000000000 -0800 > +++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 14:15:31.000000000 -0800 > @@ -71,10 +71,12 @@ > */ > static inline s64 __get_nsec_offset(void) > { > - cycle_t cycle_delta; > + cycle_t now, cycle_delta; > s64 ns_offset; > > - cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock)); > + now = clocksource_read(clock); Why is there a second level of cycle_last and cycle_accumulated here ? > + cycle_delta = (now - clock->cycle_last) & clock->mask; > + cycle_delta += clock->cycle_accumulated; > ns_offset = cyc2ns(clock, cycle_delta); > > return ns_offset; > @@ -105,35 +107,7 @@ > > cycle_t notrace get_monotonic_cycles(void) > { > - cycle_t cycle_now, cycle_delta, cycle_raw, cycle_last; > - > - do { > - /* > - * cycle_raw and cycle_last can change on > - * another CPU and we need the delta calculation > - * of cycle_now and cycle_last happen atomic, as well > - * as the adding to cycle_raw. We don't need to grab > - * any locks, we just keep trying until get all the > - * calculations together in one state. > - * > - * In fact, we __cant__ grab any locks. This > - * function is called from the latency_tracer which can > - * be called anywhere. To grab any locks (including > - * seq_locks) we risk putting ourselves into a deadlock. > - */ > - cycle_raw = clock->cycle_raw; > - cycle_last = clock->cycle_last; > - > - /* read clocksource: */ > - cycle_now = clocksource_read(clock); > - > - /* calculate the delta since the last update_wall_time: */ > - cycle_delta = (cycle_now - cycle_last) & clock->mask; > - > - } while (cycle_raw != clock->cycle_raw || > - cycle_last != clock->cycle_last); > - > - return cycle_raw + cycle_delta; > + return clocksource_get_basecycles(clock, clocksource_read(clock)); > } > > unsigned long notrace cycles_to_usecs(cycle_t cycles) > > -- Mathieu Desnoyers Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/