Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934414AbcLAUYJ (ORCPT ); Thu, 1 Dec 2016 15:24:09 -0500 Received: from mail-io0-f169.google.com ([209.85.223.169]:35438 "EHLO mail-io0-f169.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S934001AbcLAUYH (ORCPT ); Thu, 1 Dec 2016 15:24:07 -0500 MIME-Version: 1.0 In-Reply-To: References: <1479531216-25361-1-git-send-email-john.stultz@linaro.org> <20161129235727.GA19891@umbus> <20161201021233.GI19891@umbus> From: John Stultz Date: Thu, 1 Dec 2016 12:23:41 -0800 Message-ID: Subject: Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation. To: Thomas Gleixner Cc: David Gibson , lkml , Liav Rehana , Chris Metcalf , Richard Cochran , Ingo Molnar , Prarit Bhargava , Laurent Vivier , "Christopher S . Hall" , "4.6+" , Peter Zijlstra Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4277 Lines: 95 On Thu, Dec 1, 2016 at 3:59 AM, Thomas Gleixner wrote: > On Thu, 1 Dec 2016, David Gibson wrote: >> But.. delta is a cycle_t, which is typedef'd to u64, so how could it >> be negative? > > Indeed. To be honest I did not bother to look that up and for some reason I > was assuming that it's a s64. :( > > So yes, we can make all this unsigned and all worries about negative deltas > are moot. Sorry for the slow response, and thanks David, for stepping in here. So apologies for not rewriting the commit message, but this is the reason I came around on this patch. I didn't see negative deltas as valid and so moving to u64 seemed proper. > But we really should get rid of that cycle_t typedef and simply use u64 and > be done with it. All this typedeffery for no value is just causing > confusion. I'm very well able to confuse myself, so I don't need extra > stimulus. Yea, it can be obscuring. However I worry if we just have a bunch of u64s around folks (or maybe just me :) will mix up which are cycles and which are nanoseconds more easily. >> > The proper solution is to figure out WHY we are running into that situation >> > at all. So far all I have seen are symptom reports and fairy tales about >> > ftp connections, but no real root cause analysis. >> >> In the case I hit, it was due to running in a VM that had been stopped >> for a substantial amount of time, so nothing that's actually under the >> guest kernel's control. The bug-as-reported was that if the VM was >> suspended for too long it would blow up immediately upon resume. > > Suspended as in suspend/resume? The timekeeping_resume() code path does not > use the conversion function, it has already it's own algorithm to protect > against the potential overflow. > > So I assume that you are talking about a VM which was not scheduled by the > host due to overcommitment (who ever thought that this is a good idea) or > whatever other reason (yes, people were complaining about wreckage caused > by stopping kernels with debuggers) for a long enough time to trigger that > overflow situation. If that's the case then the unsigned conversion will > just make it more unlikely but it still will happen. > > I agree that clamping the result would prevent the time going backwards > issue for clocksources which have a wide enough counter (x86 TSC, powerpc > incrementer, ...), but it won't prevent problems with clocksources which > wrap around due to a small bit width of the counter. > > We have two options to deal with the issue for wide enough clocksources: > > 1) Checking that delta is less than clocksource->max_cycles. > > I really hate this because we invested a lot of thoughts to squeeze > everything we need for the time getters hotpath into a single cache > line. Accessing clocksource->max_cycles forces us to touch another > one. Bah! > > Aside of that what guarantees that we never run into a situation where > something doing timekeeping updates (NTP, PTP, PPS ...) uses such a > clamped value and comes to completely bogus conclusions? Are we going to > analyze and fixup all of that in order to prevent such wreckage? > > I seriously doubt that given the fact, that nobody sat down and analyzed > the signed/unsigned issue proper, which is way less complex. > > 2) Use mul_u64_u32_shr() > > That works without an extra cache line, but it's more expensive in terms > of text size especially on architectures which do not support the mul64 > expansion to 128bit natively. > > But that seems like the most robust solution. We can be clever and make > this conditional on both a configuration switch and a static key which > can be turned on by guests. I'll send out a RFC series later today. > > Yet another proof that virtualization is creating more problems than it > solves. I would also suggest: 3) If the systems are halted for longer then the timekeeping core expects, the system will "miss" or "lose" some portion of that halted time, but otherwise the system will function properly. Which is the result with this patch. I'm not sure if its really worth trying to recover that time or be perfect in those situations. Especially since on narrow clocksources you'll have the same result. thanks -john