Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752879AbbFHRhr (ORCPT ); Mon, 8 Jun 2015 13:37:47 -0400 Received: from mail-ie0-f170.google.com ([209.85.223.170]:34218 "EHLO mail-ie0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751082AbbFHRhj (ORCPT ); Mon, 8 Jun 2015 13:37:39 -0400 MIME-Version: 1.0 In-Reply-To: References: <20150604005624.GA1789@hudson.localdomain> <20150605100707.GB8995@gmail.com> Date: Mon, 8 Jun 2015 10:37:38 -0700 Message-ID: Subject: Re: [BUG, bisect] hrtimer: severe lag after suspend & resume From: John Stultz To: Thomas Gleixner Cc: Ingo Molnar , Jeremiah Mahler , Preeti U Murthy , Peter Zijlstra , Viresh Kumar , Marcelo Tosatti , Frederic Weisbecker , lkml Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7376 Lines: 160 On Mon, Jun 8, 2015 at 12:44 AM, Thomas Gleixner wrote: > On Fri, 5 Jun 2015, John Stultz wrote: >> On Fri, Jun 5, 2015 at 3:07 AM, Ingo Molnar wrote: >> > * Thomas Gleixner wrote: >> >> It's not about copying 24 bytes. It's about touching 3 cache lines for nothing. >> >> In situations where we run high frequency periodic timers on clock monotonic and >> >> nothing is going on in the other clock domains, which is a pretty common >> >> situation, this is measurable in terms of cache utilization. [...] >> > >> > It's not just about 'touching': it's about _dirtying_ cachelines from a globally >> > executed function (timekeeping), which is then accessed by per-CPU functionality >> > (hrtimers). >> >> Right, but part of that issue is that we're caching in the hrtimer cpu >> bases data that *should not be cached*. That was the core issue that >> caused the 2012 leapsecond issue, and I'd prefer to not reintroduce >> it. >> >> The offset data is only valid for the monotonic time its read for. So >> dirtying three cache lines really is just due to the fact that the >> data is stored in the cpu_base structure where I'd argue it doesn't >> provide real value (other then convenience of indexing it cleanly). >> >> Reading the offset data into three values from the stack would be fine >> too, and (I think) would avoid dirtying much extra (we have to store >> the now value anyway). > > Well, the problem is that we need to fetch that data on several > occasions: > > - hrtimer_start (if it is the first expiring timer of a clock) > - hrtimer_reprogram (after canceling the first timer) > - hrtimer_interrupt hrtimer_interrupt is really the one I'm most concerned with, because that's what does timer expiration. For that usage, having duplicated time state has been repeatedly problematic. For start and reprogram, I guess I'm not as concerned because bad cache values there can only result in errors in programming the clockevent hardware, which can only cause timers to firing late. > So I really prefer to have cached values instead of a function > call. And even if we do not cache stuff, there is no guarantee that we > wont expire a timer too early: > > CPU0 CPU1 > hrtimer_interrupt() > ------------------------------------- leap second edge > get_time_and_offsets_uncached() > do_leap_second_adjustment() > expire_timers() > > So, if the do_leap_second_adjustment() happens a bit too late, then > clock monotonic + offset_realtime will have advanced over the leap > second and expire a timer and the sleeper will then observe that it is > expired too early because the leap second adjustment finished before > it returned to user space. Well, the patch I proposed (which does leapsecond adjustments in the read paths) avoids this because update_base_and_offsets_now() checks to see if the monotonic base + offset would cross the leapsecond edge, and corrects the offset appropriately. > > You do not even need two cpus for this. You can observe the same issue > on a UP machine. Assume we have two hrtimers programmed to go off at > the leap seconds edge: > > 1) a user space timer > 2) the tick/leap second one > > If the user space timer has been enqueued before the leap one, then it > will be expired first and if the timer interrupt got delayed a bit it > again will see that its over the programmed time and happily expire to > early. > > So what ever we do vs. the hrtimer offsets, cached or not will not > prevent that we expire timers early. Right, so without my proposed patch, this is an issue, but my proposed patch requires that the hrtimer_interrupt not use cached offsets in order to ensure the read-state is adjusted properly for the leapsecond. (Or it requries the hrtimer_interrupt path to also cache the leapsecond state so it can do the same adjustment to the cached data, but this seems terribly duplicative). >> BTW: Thomas, what are you using to do measurements here? I hesitate >> to argue in these sorts of performance discussions, since I really >> only have a embarrassing theoretical understanding of the issues and >> suspect myself a bit naive here. Additionally these sorts of >> constraints aren't always clearly documented, so being able to >> measure and compare would be helpful to ensure future changes don't >> impact things here. > > performance counters and tests which stress the particular subsystems. Sweet. Vague hand-waving results are simple to reproduce. [Gestures, points assertively a few times] Results validated! Science! :) But seriously, I'm earnestly looking for specifics (like which stress tests are you caring about) here, so I can try to also watch that patches I write or take don't undo your performance optimizations, so in the future you'll have to yell at fewer people. >> > That makes it far more expensive, it has similar scalability limiting effects as a >> > global lock - while if we do it smart it can perform as essentially lockless code >> > in most cases. >> >> Another reason why I don't like this approach of caching the data is >> that it also prevents fixing the leap-second adjustment to happen on >> the second edge, because we have to have an async update to the >> seqcounter in order to refresh the cached real_offset. Or we have to >> also export more ntp state data so we can duplicate the adjustment to >> the cached data in the hrtimer code, which is more of the complexity >> you've objected to. > > There is no guarantee that it happens at the seconds edge. Timer might > be delayed, vcpu scheduled out .... Again, I worry you've not looked closely at my patch. The only way to assure the change happens at second edge is to do the adjustment in the read path. > All you will be able to do is to narrow the window, but as I explained > above it wont prevent early expiry and it wont prevent VDSO seing the > time go over the leap second and then jump back. The vdso is fixable as well. I just avoided it because fixing it has less benefit then fixing the timer expiration issue, and expected the performance overhead (which is an extra comparison and subtraction, not huge, but I respect that the vdsos are very optimized) would raise more objections. (If we want to get fancy, there's even the possibility of switching the vdso execution page so we use the fixed adjustment function only near the leap-second and go back to the normal function otherwise - but regardless, if its to be done, have to fix the core first) > We just have to accept that timekeeping, time readout and hrtimers > have asynchronous behaviour. And there is no way around that unless > you want to kill performance completely for the sake of this leap > second nonsense. Even outside of my leapsecond correctness concern, I think caching time state like you're doing here is a maintenance issue. It has bit us a number of times already, and things are complicated enough that even fixing the issues that stem from it are non-trivial. thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/