Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id ; Thu, 19 Sep 2002 13:57:20 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id ; Thu, 19 Sep 2002 13:57:20 -0400 Received: from [195.223.140.120] ([195.223.140.120]:1864 "EHLO penguin.e-mind.com") by vger.kernel.org with ESMTP id ; Thu, 19 Sep 2002 13:57:18 -0400 Date: Thu, 19 Sep 2002 20:02:29 +0200 From: Andrea Arcangeli To: James Cleverdon Cc: Andi Kleen , "David S. Miller" , johnstul@us.ibm.com, alan@lxorguk.ukuu.org.uk, linux-kernel@vger.kernel.org, anton.wilson@camotion.com Subject: Re: do_gettimeofday vs. rdtsc in the scheduler Message-ID: <20020919180229.GF1345@dualathlon.random> References: <20020918015209.B31263@wotan.suse.de> <20020917.165131.81918297.davem@redhat.com> <20020918020535.A9784@wotan.suse.de> <200209171804.33391.jamesclv@us.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200209171804.33391.jamesclv@us.ibm.com> User-Agent: Mutt/1.3.27i Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4665 Lines: 84 On Tue, Sep 17, 2002 at 06:04:33PM -0700, James Cleverdon wrote: > have a separate clock input for it that runs at 1 MHz so skew and The clock input should be the same, or they can always run out of synchrony if you left it running forever. The timer generation is an analogic thing, the reception is digital, so having a single timer guarantees no counter skew. If the precision we'd need from the timer driving gettimeofday would be 1HZ, so 1 tick per second, you could make it scale perfectly without oscillations on a 256G box. you simply can't do that with a < 1nanosecond tick period on more than a few cpus, because of physics, or it happens what's been mentioned a number of times on this thread (oscillations generated by the latency of the signal delivery or further slowdown in accessing the information with overhead in the interconnects). The best hardware solution to this problem is to have two cpu registers increased by two timers, one is the regular cpu tick (TSC) that we have today, that could even go away with asynchronous cpus, and the other timer would be the new "real time timer", a 10/100khz clock delivered to all the cpus that goes to increase such in-cpu-core counter (so that it can be read from userspace too inside vgettimeofday and with extremely low latency, exactly like the current tsc, but driven by such a secondary low frequency timer that will tell us about the time changes). 10/100usec should be much more than enough margin to deliver this timer to all the hundred cpus with a very small oscillation. And no software that I'm aware about needs a time-of-day precision over 10/100usec. An interrupt itself is going to take some usec. A context switch as well is going to take more than 10usec, that's the important bit to guarantee gettimeofday to be monothone, different threads can have a minor difference in the perception of the time, dominated by the speed of light delivery of the timer signal, that's not a problem as far as it's monothone. The TSC and also the system clock mentioned by Dave are way too fast to be kept synchronized in a numa without introducing significant drifts and oscillations. If somebody really needs 1usec resolution, he will first need vsyscalls to avoid enter/exit kernel latencies, likely he will need to run iopl with irq disabled, and so it should be ok to use the TSC in such case with a specialized hacked kernel config option (with all the disclaimer that it would break if the cpu clock changes under you etc...) All mere mortals will be perfectly fine with a 100khz clock for gettimeofday. If sun did a 1mhz clock to achieve the above suggested design solution, then they did the optimal thing IMHO. Another approch would be to use separate timer sources per-cpu and to re-resychronize every once in a while, at regular intervals that guarantees the drift not to spread above the half of the time of the shortest context switch, but it would need tedious software support with knowledge of very lowevel hardware informations, so I'd definitely prefer the previous mentioned solution that will require all hardware vendors to get it right or it won't work. Like it's happening now with the TSC, with the difference that the 100k timer would be doable, while the TSC at 2ghz isn't doable. Of course the cyclone timer and the HPET are the very next best thing the hardware vendors could provide us on x86, and of course you cannot do better than the cyclone and HPET without upgrading the cpu too, because the cpu is simply missing a register to avoid hitting the southbridge at every vgettimeofday. At least the good thing is that HPET is mapped in a mmio region so we don't need to enter kernel but only to access the southbridge from userspace and that saves a number of usec at every gettimeofday. All of this assumes gettimeofday is an important operation and that an additional cpu sequence counter and an additional numa-shared timer would payoff to make gettimeofday most efficient and most accurate on all class of machines. It would be also an option to replace the TSC with such new "real time counter" if adding a new counter is too expensive, the TSC is almost unusable in its current too high frequency form, it is useful only for microbenchmarking, so it's more a debugging facility than a production feature, while the other would be a really useful feature not only for debugging/benchmarking purposes. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/