Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758294AbcLBIjk (ORCPT ); Fri, 2 Dec 2016 03:39:40 -0500 Received: from Galois.linutronix.de ([146.0.238.70]:59376 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750842AbcLBIji (ORCPT ); Fri, 2 Dec 2016 03:39:38 -0500 Date: Fri, 2 Dec 2016 09:36:42 +0100 (CET) From: Thomas Gleixner To: David Gibson cc: John Stultz , lkml , Liav Rehana , Chris Metcalf , Richard Cochran , Ingo Molnar , Prarit Bhargava , Laurent Vivier , "Christopher S . Hall" , "4.6+" , Peter Zijlstra Subject: Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in its calculation. In-Reply-To: <20161201233210.GB31412@umbus.fritz.box> Message-ID: References: <1479531216-25361-1-git-send-email-john.stultz@linaro.org> <20161129235727.GA19891@umbus> <20161201021233.GI19891@umbus> <20161201233210.GB31412@umbus.fritz.box> User-Agent: Alpine 2.20 (DEB 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2523 Lines: 55 On Fri, 2 Dec 2016, David Gibson wrote: > On Thu, Dec 01, 2016 at 12:59:51PM +0100, Thomas Gleixner wrote: > > So I assume that you are talking about a VM which was not scheduled by the > > host due to overcommitment (who ever thought that this is a good idea) or > > whatever other reason (yes, people were complaining about wreckage caused > > by stopping kernels with debuggers) for a long enough time to trigger that > > overflow situation. If that's the case then the unsigned conversion will > > just make it more unlikely but it still will happen. > > It was essentially the stopped by debugger case. I forget exactly > why, but the guest was being explicitly stopped from outside, it > wasn't just scheduling lag. I think it was something in the vicinity > of 10 minutes stopped. Ok. Debuggers stopping stuff is one issue, but if I understood Liav correctly, then he is seing the issue on a heavy loaded machine. Liav, can you please describe the scenario in detail? Are you observing this on bare metal or in a VM which gets scheduled out long enough or was there debugging/hypervisor intervention involved? > It's long enough ago that I can't be sure, but I thought we'd tried > various different stoppage periods, which should have also triggered > the unsigned overflow you're describing, and didn't observe the crash > once the change was applied. Note that there have been other changes > to the timekeeping code since then, which might have made a > difference. > > I agree that it's not reasonable for the guest to be entirely > unaffected by such a large stoppage: I'd have no complaints if the > guest time was messed up, and/or it spewed warnings. But complete > guest death seems a rather more fragile response to the situation than > we'd like. Guests death? Is it really dead/crashed or just stuck in that endless loop trying to add that huge negative value piecewise? That's at least what Liav was describing as he mentioned __iter_div_u64_rem() explicitely. While I'm less worried about debuggers, I worry about the real thing. I agree that we should not starve after resume from a debug stop, but in that case the least of my worries is time going backwards. Though if the signed mult overrun is observable in a live system, then we need to worry about time going backwards even with the unsigned conversion. Simply because once we fixed the starvation issue people with insane enough setups will trigger the unsigned overrun and complain about time going backwards. Thanks, tglx