Date: Fri, 2 Dec 2016 10:32:10 +1100
From: David Gibson <david@gibson.dropbear.id.au>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: John Stultz <john.stultz@linaro.org>,
        lkml <linux-kernel@vger.kernel.org>, Liav Rehana <liavr@mellanox.com>,
        Chris Metcalf <cmetcalf@mellanox.com>,
        Richard Cochran <richardcochran@gmail.com>,
        Ingo Molnar <mingo@kernel.org>, Prarit Bhargava <prarit@redhat.com>,
        Laurent Vivier <lvivier@redhat.com>,
        "Christopher S . Hall" <christopher.s.hall@intel.com>,
        "4.6+" <stable@vger.kernel.org>, Peter Zijlstra <peterz@infradead.org>
Subject: Re: [PATCH] timekeeping: Change type of nsec variable to unsigned in
 its calculation.
Message-ID: <20161201233210.GB31412@umbus.fritz.box>
References: <1479531216-25361-1-git-send-email-john.stultz@linaro.org>
 <alpine.DEB.2.20.1611291520070.4358@nanos>
 <20161129235727.GA19891@umbus>
 <alpine.DEB.2.20.1611302355070.3619@nanos>
 <20161201021233.GI19891@umbus>
 <alpine.DEB.2.20.1612011110270.3453@nanos>
MIME-Version: 1.0
Content-Type: multipart/signed; micalg=pgp-sha256;
        protocol="application/pgp-signature"; boundary="mxv5cy4qt+RJ9ypb"
Content-Disposition: inline
In-Reply-To: <alpine.DEB.2.20.1612011110270.3453@nanos>
User-Agent: Mutt/1.7.1 (2016-10-04)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 9456
Lines: 236


--mxv5cy4qt+RJ9ypb
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Dec 01, 2016 at 12:59:51PM +0100, Thomas Gleixner wrote:
> On Thu, 1 Dec 2016, David Gibson wrote:
> > On Thu, Dec 01, 2016 at 12:21:02AM +0100, Thomas Gleixner wrote:
> > > On Wed, 30 Nov 2016, David Gibson wrote:
> > > > On Tue, Nov 29, 2016 at 03:22:17PM +0100, Thomas Gleixner wrote:
> > > > > If we have legitimate use cases with a negative delta, then this =
patch
> > > > > breaks them no matter what. See the basic C course section in the=
 second
> > > > > link.
> > > >=20
> > > > So, fwiw, when I first wrote a variant on this, I wasn't trying to =
fix
> > > > every case - just to make the consequences less bad if something go=
es
> > > > wrong.  An overflow here can still mess up timekeeping, it's true, =
but
> > > > time going backwards tends to cause things to go horribly, horribly
> > > > wrong - which was why I spotted this in the first place.
> > >=20
> > > I completely understand the intention.
> > >=20
> > > We _cannot_ make that whole thing unsigned when it is not 100% clear
> > > that there is no legitimate caller which hands in a negative delta and
> > > rightfully expects to get a negative nanoseconds value handed back.
> >=20
> > But.. delta is a cycle_t, which is typedef'd to u64, so how could it
> > be negative?
>=20
> Indeed. To be honest I did not bother to look that up and for some reason=
 I
> was assuming that it's a s64. :(
>=20
> So yes, we can make all this unsigned and all worries about negative delt=
as
> are moot.
>=20
> But we really should get rid of that cycle_t typedef and simply use u64 a=
nd
> be done with it. All this typedeffery for no value is just causing
> confusion. I'm very well able to confuse myself, so I don't need extra
> stimulus.
>=20
> > This is why I believed my original version (35a4933) to be safe - it
> > was merely removing a signed intermediate from what was essentially an
> > unsigned calculation (technically the output was signed, but the right
> > shift means that's not relevant).
> >=20
> > > If someone sits down and proves that this cannot happen there is no r=
eason
> > > to hold that off.
> > >=20
> > > But that still does not solve the underlying root cause. Assume the
> > > following:
> > >=20
> > >   T1 =3D base + to_nsec(delta1)
> > >=20
> > >        where delta1 is big, but the multiplication does not overflow =
64bit
> > >=20
> > >   Now wait a bit and do:
> > >      =20
> > >   T2 =3D base + to_nsec(delta2)
> > >=20
> > >        now delta2 is big enough, so the  multiplication does overflow=
 64bit
> > >        now delta2 is big enough to overflow 64bit with the multiplica=
tion.
> > >=20
> > >   The result is T2 < T1, i.e. time goes backwards.
> >=20
> > Hm, I see.  Do we ever actually update time that way (at least core
> > system time), rather than using the last result as a base?
>=20
> It's:
>=20
>     delta =3D read(clocksoure) - last_update;
>     T =3D base + to_nsec(delta)
>=20
> and in the update function we do:
>=20
>     now =3D read(clocksoure);
>     delta =3D now - last_update;
>     last_update =3D now;
>     base +=3D to_ns(delta);
>=20
> Usually the update happens once per tick and if all cpus are idle we have=
 a
> safe guard which makes sure that the update happens _before_ we run into
> the overflow situation or into a wraparound of the clocksource, which ever
> comes first.

Ah, so base won't go backwards, but T could.  I guess that's what John
means about going backwards only within one interval.  I don't know
the timekeeping code well enough to know how bad the consequences of
that will be.

> > It does seem like the safer approach might be to clamp the result in
> > case of overflow, though.
>=20
> Right, but clamping has its own issues. See below.

Doubtless :/.

> > > All what the unsigned conversion does is to procrastinate the problem=
 by a
> > > factor of 2. So instead of failing after 10 seconds we fail after 20
> > > seconds. And just because you never observed the 20 seconds problem i=
t does
> > > not go away magically.
> >=20
> > At least in the case I was observing I'm pretty sure we weren't
> > updating time that way - we always used a delta from the last value,
> > so to_nsec() returning always positive was enough to make time not go
> > backwards.
>=20
> See above. But in case of failure the delta to the last update value was
> big enough to overflow into negative space. Making it unsigned just
> happened to 'work' because the stop time of the VM was not large enough to
> trigger the unsigned mult overflow.
>=20
> > > The proper solution is to figure out WHY we are running into that sit=
uation
> > > at all. So far all I have seen are symptom reports and fairy tales ab=
out
> > > ftp connections, but no real root cause analysis.
> >=20
> > In the case I hit, it was due to running in a VM that had been stopped
> > for a substantial amount of time, so nothing that's actually under the
> > guest kernel's control.  The bug-as-reported was that if the VM was
> > suspended for too long it would blow up immediately upon resume.
>=20
> Suspended as in suspend/resume? The timekeeping_resume() code path does n=
ot
> use the conversion function, it has already it's own algorithm to protect
> against the potential overflow.

Alas, no.

> So I assume that you are talking about a VM which was not scheduled by the
> host due to overcommitment (who ever thought that this is a good idea) or
> whatever other reason (yes, people were complaining about wreckage caused
> by stopping kernels with debuggers) for a long enough time to trigger that
> overflow situation. If that's the case then the unsigned conversion will
> just make it more unlikely but it still will happen.

It was essentially the stopped by debugger case.  I forget exactly
why, but the guest was being explicitly stopped from outside, it
wasn't just scheduling lag.  I think it was something in the vicinity
of 10 minutes stopped.

It's long enough ago that I can't be sure, but I thought we'd tried
various different stoppage periods, which should have also triggered
the unsigned overflow you're describing, and didn't observe the crash
once the change was applied.  Note that there have been other changes
to the timekeeping code since then, which might have made a
difference.

I agree that it's not reasonable for the guest to be entirely
unaffected by such a large stoppage: I'd have no complaints if the
guest time was messed up, and/or it spewed warnings.  But complete
guest death seems a rather more fragile response to the situation than
we'd like.

> I agree that clamping the result would prevent the time going backwards
> issue for clocksources which have a wide enough counter (x86 TSC, powerpc
> incrementer, ...), but it won't prevent problems with clocksources which
> wrap around due to a small bit width of the counter.
>=20
> We have two options to deal with the issue for wide enough clocksources:
>=20
> 1) Checking that delta is less than clocksource->max_cycles.
>=20
>    I really hate this because we invested a lot of thoughts to squeeze
>    everything we need for the time getters hotpath into a single cache
>    line. Accessing clocksource->max_cycles forces us to touch another
>    one. Bah!
>=20
>    Aside of that what guarantees that we never run into a situation where
>    something doing timekeeping updates (NTP, PTP, PPS ...) uses such a
>    clamped value and comes to completely bogus conclusions? Are we going =
to
>    analyze and fixup all of that in order to prevent such wreckage?
>   =20
>    I seriously doubt that given the fact, that nobody sat down and analyz=
ed
>    the signed/unsigned issue proper, which is way less complex.
>=20
> 2) Use mul_u64_u32_shr()
>=20
>    That works without an extra cache line, but it's more expensive in ter=
ms
>    of text size especially on architectures which do not support the mul64
>    expansion to 128bit natively.
>=20
>    But that seems like the most robust solution. We can be clever and make
>    this conditional on both a configuration switch and a static key which
>    can be turned on by guests. I'll send out a RFC series later today.
>=20
> Yet another proof that virtualization is creating more problems than it
> solves.

--=20
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

--mxv5cy4qt+RJ9ypb
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQIcBAEBCAAGBQJYQLL4AAoJEGw4ysog2bOSdCsQAJmg9OIZKUssXdLlU/kx1kKh
DAPZWcHdVQOhsikOV4ULS7+HrbwkjPmPJEsH1mrdbRieF5BiuVEnFYh8m52kIplN
meIRwksSTXZsT2yQTh+sqQDoAT54R0x1qhyVpoVRjl59s4osZGNV4JbT3c3J70vs
IW/q/2sPjOShmXJnfu6MBtLuiGxIX9E+w8yUyF3Dp/NiJSKRe+8qQ/t7dXd3S+VJ
K0nnJouMUEBYHp1jdpKvUc+pYpGn2iSZYTC+n4Vnuyhu+NDqORCMFFnSSRWfMTW5
L9qnTWlMZWhDM5zNDNLW6u6sM3U6oRWH7p289I0YATdnIdLxN2+4h+D8tZfgUIFb
R2StdYeqqt85hUZWB4s7Hyz2FjaEIc4SN0JA44bjeJ9dQwq6wWQ6l4WETm3RFWgM
GLqnL8Xst04h52+Y9IW8+0zwjVmuN7eJjiGxpJxnIZXH8gR95VD2fWZdRfTiLEqn
VzBzFAjwrQP+xztMsPEcwuohirX+I6rB06tCgU7LOsIkx0vRkCiVL5vq69V9QWAv
ejxgeEDxjOKlVyQMGDVoHXbUlEThgbRTudcq0k75qi/RYQtalP5dBbd0oyGV23YT
kSg4WsOl+4SGyMuQhUWzWpV6L49N7sfsufSwKoKZK7kiVWXD9Z6SBlZXT8eUTXzT
yatj5KWqPdJUPd8oCulA
=AL37
-----END PGP SIGNATURE-----

--mxv5cy4qt+RJ9ypb--