Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752592AbbFHHpC (ORCPT ); Mon, 8 Jun 2015 03:45:02 -0400 Received: from www.linutronix.de ([62.245.132.108]:40503 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750992AbbFHHoz (ORCPT ); Mon, 8 Jun 2015 03:44:55 -0400 Date: Mon, 8 Jun 2015 09:44:50 +0200 (CEST) From: Thomas Gleixner To: John Stultz cc: Ingo Molnar , Jeremiah Mahler , Preeti U Murthy , Peter Zijlstra , Viresh Kumar , Marcelo Tosatti , Frederic Weisbecker , lkml Subject: Re: [BUG, bisect] hrtimer: severe lag after suspend & resume In-Reply-To: Message-ID: References: <20150604005624.GA1789@hudson.localdomain> <20150605100707.GB8995@gmail.com> User-Agent: Alpine 2.11 (DEB 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Linutronix-Spam-Score: -1.0 X-Linutronix-Spam-Level: - X-Linutronix-Spam-Status: No , -1.0 points, 5.0 required, ALL_TRUSTED=-1,SHORTCIRCUIT=-0.0001,URIBL_BLOCKED=0.001 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4695 Lines: 107 On Fri, 5 Jun 2015, John Stultz wrote: > On Fri, Jun 5, 2015 at 3:07 AM, Ingo Molnar wrote: > > * Thomas Gleixner wrote: > >> It's not about copying 24 bytes. It's about touching 3 cache lines for nothing. > >> In situations where we run high frequency periodic timers on clock monotonic and > >> nothing is going on in the other clock domains, which is a pretty common > >> situation, this is measurable in terms of cache utilization. [...] > > > > It's not just about 'touching': it's about _dirtying_ cachelines from a globally > > executed function (timekeeping), which is then accessed by per-CPU functionality > > (hrtimers). > > Right, but part of that issue is that we're caching in the hrtimer cpu > bases data that *should not be cached*. That was the core issue that > caused the 2012 leapsecond issue, and I'd prefer to not reintroduce > it. > > The offset data is only valid for the monotonic time its read for. So > dirtying three cache lines really is just due to the fact that the > data is stored in the cpu_base structure where I'd argue it doesn't > provide real value (other then convenience of indexing it cleanly). > > Reading the offset data into three values from the stack would be fine > too, and (I think) would avoid dirtying much extra (we have to store > the now value anyway). Well, the problem is that we need to fetch that data on several occasions: - hrtimer_start (if it is the first expiring timer of a clock) - hrtimer_reprogram (after canceling the first timer) - hrtimer_interrupt So I really prefer to have cached values instead of a function call. And even if we do not cache stuff, there is no guarantee that we wont expire a timer too early: CPU0 CPU1 hrtimer_interrupt() ------------------------------------- leap second edge get_time_and_offsets_uncached() do_leap_second_adjustment() expire_timers() So, if the do_leap_second_adjustment() happens a bit too late, then clock monotonic + offset_realtime will have advanced over the leap second and expire a timer and the sleeper will then observe that it is expired too early because the leap second adjustment finished before it returned to user space. You do not even need two cpus for this. You can observe the same issue on a UP machine. Assume we have two hrtimers programmed to go off at the leap seconds edge: 1) a user space timer 2) the tick/leap second one If the user space timer has been enqueued before the leap one, then it will be expired first and if the timer interrupt got delayed a bit it again will see that its over the programmed time and happily expire to early. So what ever we do vs. the hrtimer offsets, cached or not will not prevent that we expire timers early. > BTW: Thomas, what are you using to do measurements here? I hesitate > to argue in these sorts of performance discussions, since I really > only have a embarrassing theoretical understanding of the issues and > suspect myself a bit naive here. Additionally these sorts of > constraints aren't always clearly documented, so being able to > measure and compare would be helpful to ensure future changes don't > impact things here. performance counters and tests which stress the particular subsystems. > > That makes it far more expensive, it has similar scalability limiting effects as a > > global lock - while if we do it smart it can perform as essentially lockless code > > in most cases. > > Another reason why I don't like this approach of caching the data is > that it also prevents fixing the leap-second adjustment to happen on > the second edge, because we have to have an async update to the > seqcounter in order to refresh the cached real_offset. Or we have to > also export more ntp state data so we can duplicate the adjustment to > the cached data in the hrtimer code, which is more of the complexity > you've objected to. There is no guarantee that it happens at the seconds edge. Timer might be delayed, vcpu scheduled out .... All you will be able to do is to narrow the window, but as I explained above it wont prevent early expiry and it wont prevent VDSO seing the time go over the leap second and then jump back. We just have to accept that timekeeping, time readout and hrtimers have asynchronous behaviour. And there is no way around that unless you want to kill performance completely for the sake of this leap second nonsense. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/