2016-12-08 20:52:24

by Thomas Gleixner

[permalink] [raw]
Subject: [patch 5/6] [RFD] timekeeping: Provide optional 128bit math

If the timekeeping CPU is scheduled out long enough by a hypervisor the
clocksource delta multiplication can overflow and as a result time can go
backwards. That's insane to begin with, but people already triggered a
signed multiplication overflow, so a unsigned overflow is not necessarily
impossible.

Implement optional 128bit math which can be selected by a config option.

Signed-off-by: Thomas Gleixner <[email protected]>
---
kernel/time/Kconfig | 15 +++++++++++++++
kernel/time/timekeeping.c | 38 ++++++++++++++++++++++++++++++++++++--
2 files changed, 51 insertions(+), 2 deletions(-)

--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -51,6 +51,21 @@ config GENERIC_CLOCKEVENTS_MIN_ADJUST
config GENERIC_CMOS_UPDATE
bool

+config TIMEKEEPING_USE_128BIT_MATH
+ bool "Enable 128 bit math in the timekeeping hotpath"
+ default n
+ depends on !ARCH_USES_GETTIMEOFFSET && EXPERT
+ help
+
+ If VMs get scheduled out for a long time then the clocksource
+ delta to nanoseconds conversion in timekeeping can overflow the
+ 64bit multiplication. As a result time going backwards might be
+ observed.
+
+ Enable this only if you want to support insane setups with
+ massive overcommitment as this introduces overhead into the
+ timekeeping hotpath.
+
if GENERIC_CLOCKEVENTS
menu "Timers subsystem"

--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -298,8 +298,41 @@ u32 (*arch_gettimeoffset)(void) = defaul
static inline u32 arch_gettimeoffset(void) { return 0; }
#endif

-static inline u64 timekeeping_delta_to_ns(struct tk_read_base *tkr,
- cycle_t delta)
+/*
+ * Enabled when timekeeping is supposed to deal with virtualization keeping
+ * VMs long enough scheduled out that the 64 * 32 bit multiplication in
+ * timekeeping_delta_to_ns() overflows 64bit.
+ */
+#ifdef CONFIG_TIMEKEEPING_USE_128BIT_MATH
+
+#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
+static inline u64 timekeeping_delta_to_ns(struct tk_read_base *tkr, u64 delta)
+{
+ unsigned __int128 nsec;
+
+ nsec = ((unsigned __int128)delta * tkr->mult) + tkr->xtime_nsec;
+ return (u64) (nsec >> tkr->shift);
+}
+#else
+static inline u64 timekeeping_delta_to_ns(struct tk_read_base *tkr, u64 delta)
+{
+ u32 dh, dl;
+ u64 nsec;
+
+ dl = delta;
+ dh = delta >> 32;
+
+ nsec = ((u64)dl * tkr->mult) + tkr->xtime_nsec;
+ nsec >>= tkr->shift;
+ if (unlikely(dh))
+ nsec += ((u64)dh * tkr->mult) << (32 - tkr->shift);
+ return nsec;
+}
+#endif
+
+#else /* CONFIG_TIMEKEEPING_USE_128BIT_MATH */
+
+static inline u64 timekeeping_delta_to_ns(struct tk_read_base *tkr, u64 delta)
{
u64 nsec;

@@ -309,6 +342,7 @@ static inline u64 timekeeping_delta_to_n
/* If arch requires, add in get_arch_timeoffset() */
return nsec + arch_gettimeoffset();
}
+#endif

static inline u64 timekeeping_get_ns(struct tk_read_base *tkr)
{



2016-12-09 04:08:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 5/6] [RFD] timekeeping: Provide optional 128bit math


* Thomas Gleixner <[email protected]> wrote:

> If the timekeeping CPU is scheduled out long enough by a hypervisor the
> clocksource delta multiplication can overflow and as a result time can go
> backwards. That's insane to begin with, but people already triggered a
> signed multiplication overflow, so a unsigned overflow is not necessarily
> impossible.
>
> Implement optional 128bit math which can be selected by a config option.

What's the rough VM interruption time that would trigger an overflow? Given that
the clock shift tk_read_base::mult is often 1, isn't it 32-bit nsecs, i.e. 4
seconds?

That doesn't sound 'insanely long'.

Or some other value?

> +#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
> +static inline u64 timekeeping_delta_to_ns(struct tk_read_base *tkr, u64 delta)
> +{
> + unsigned __int128 nsec;
> +
> + nsec = ((unsigned __int128)delta * tkr->mult) + tkr->xtime_nsec;
> + return (u64) (nsec >> tkr->shift);
> +}
> +#else
> +static inline u64 timekeeping_delta_to_ns(struct tk_read_base *tkr, u64 delta)
> +{
> + u32 dh, dl;
> + u64 nsec;
> +
> + dl = delta;
> + dh = delta >> 32;
> +
> + nsec = ((u64)dl * tkr->mult) + tkr->xtime_nsec;
> + nsec >>= tkr->shift;
> + if (unlikely(dh))
> + nsec += ((u64)dh * tkr->mult) << (32 - tkr->shift);
> + return nsec;
> +}
> +#endif

Actually, 128-bit multiplication shouldn't be too horrible - at least on 64-bit
architectures. (128-bit division is another matter, but there's no division here.)

So we might as well use this by default on 64-bit architectures that have 64-bit
cycle counters - which the vast majority of hypervisors are. Assuming I'm correct
that just 4 seconds of VM delay would make the whole logic unrobust.

Thanks,

Ingo

2016-12-09 04:29:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 5/6] [RFD] timekeeping: Provide optional 128bit math


* Ingo Molnar <[email protected]> wrote:

>
> * Thomas Gleixner <[email protected]> wrote:
>
> > If the timekeeping CPU is scheduled out long enough by a hypervisor the
> > clocksource delta multiplication can overflow and as a result time can go
> > backwards. That's insane to begin with, but people already triggered a
> > signed multiplication overflow, so a unsigned overflow is not necessarily
> > impossible.
> >
> > Implement optional 128bit math which can be selected by a config option.
>
> What's the rough VM interruption time that would trigger an overflow? Given that
> the clock shift tk_read_base::mult is often 1, isn't it 32-bit nsecs, i.e. 4
> seconds?
>
> That doesn't sound 'insanely long'.
>
> Or some other value?

Ok, wasn't fully awake yet: more realistic values of the scaling factor on x86
would allow cycles input values of up to ~70 billion with 64-bit math, which would
allow deltas of up to about 1 minute with 64-bit math.

I think we should at least detect (and report?) the overflow and sanitize the
effects to the max offset instead of generating random overflown values.

That would also allow the 128-bit multiplication only be done in the rare case
when we overflow. Which in turn could then be made unconditional. Am I missing
something?

Thanks,

Ingo

2016-12-09 04:39:52

by John Stultz

[permalink] [raw]
Subject: Re: [patch 5/6] [RFD] timekeeping: Provide optional 128bit math

On Thu, Dec 8, 2016 at 8:29 PM, Ingo Molnar <[email protected]> wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
>>
>> * Thomas Gleixner <[email protected]> wrote:
>>
>> > If the timekeeping CPU is scheduled out long enough by a hypervisor the
>> > clocksource delta multiplication can overflow and as a result time can go
>> > backwards. That's insane to begin with, but people already triggered a
>> > signed multiplication overflow, so a unsigned overflow is not necessarily
>> > impossible.
>> >
>> > Implement optional 128bit math which can be selected by a config option.
>>
>> What's the rough VM interruption time that would trigger an overflow? Given that
>> the clock shift tk_read_base::mult is often 1, isn't it 32-bit nsecs, i.e. 4
>> seconds?
>>
>> That doesn't sound 'insanely long'.
>>
>> Or some other value?
>
> Ok, wasn't fully awake yet: more realistic values of the scaling factor on x86
> would allow cycles input values of up to ~70 billion with 64-bit math, which would
> allow deltas of up to about 1 minute with 64-bit math.

So if I'm remembering properly, we pick mult/shift pairs such that the
mult shouldn't overflow from ~10 minutes worth of cycles.

> I think we should at least detect (and report?) the overflow and sanitize the
> effects to the max offset instead of generating random overflown values.

So with CONFIG_DEBUG_TIMEKEEPING, we do check to see if the cycle
value is larger then the max_cycles and will report a warning. But
this is done at interrupt time and not in the hotpath.

thanks
-john

2016-12-09 04:49:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 5/6] [RFD] timekeeping: Provide optional 128bit math

On Fri, Dec 09, 2016 at 05:08:26AM +0100, Ingo Molnar wrote:
> > +#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
> > +static inline u64 timekeeping_delta_to_ns(struct tk_read_base *tkr, u64 delta)
> > +{
> > + unsigned __int128 nsec;
> > +
> > + nsec = ((unsigned __int128)delta * tkr->mult) + tkr->xtime_nsec;
> > + return (u64) (nsec >> tkr->shift);
> > +}
>
> Actually, 128-bit multiplication shouldn't be too horrible - at least on 64-bit
> architectures. (128-bit division is another matter, but there's no division here.)

IIRC there are 64bit architectures that do not have a 64x64->128 mult,
only a 64x64->64 mult instruction. Its not immediately apparent using
__int128 will generate optimal code for those, nor is it a given GCC
will not require libgcc functions for those.


2016-12-09 05:11:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 5/6] [RFD] timekeeping: Provide optional 128bit math

On Thu, Dec 08, 2016 at 08:49:39PM -0000, Thomas Gleixner wrote:

> +/*
> + * Enabled when timekeeping is supposed to deal with virtualization keeping
> + * VMs long enough scheduled out that the 64 * 32 bit multiplication in
> + * timekeeping_delta_to_ns() overflows 64bit.
> + */
> +#ifdef CONFIG_TIMEKEEPING_USE_128BIT_MATH
> +
> +#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
> +static inline u64 timekeeping_delta_to_ns(struct tk_read_base *tkr, u64 delta)
> +{
> + unsigned __int128 nsec;
> +
> + nsec = ((unsigned __int128)delta * tkr->mult) + tkr->xtime_nsec;
> + return (u64) (nsec >> tkr->shift);
> +}
> +#else
> +static inline u64 timekeeping_delta_to_ns(struct tk_read_base *tkr, u64 delta)
> +{
> + u32 dh, dl;
> + u64 nsec;
> +
> + dl = delta;
> + dh = delta >> 32;
> +
> + nsec = ((u64)dl * tkr->mult) + tkr->xtime_nsec;
> + nsec >>= tkr->shift;
> + if (unlikely(dh))
> + nsec += ((u64)dh * tkr->mult) << (32 - tkr->shift);
> + return nsec;
> +}
> +#endif
> +
> +#else /* CONFIG_TIMEKEEPING_USE_128BIT_MATH */

xtime_nsec confuses me, contrary to its name, its not actually in nsec,
its in shifted nsec units for some reason (and that might well be a good
reason, but I don't know).

In any case, it needing to be inside the shift is somewhat unfortunate
in that it doesn't allow you to use the existing mul_u64_u32_shr()


2016-12-09 05:22:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch 5/6] [RFD] timekeeping: Provide optional 128bit math


* Peter Zijlstra <[email protected]> wrote:

> On Fri, Dec 09, 2016 at 05:08:26AM +0100, Ingo Molnar wrote:
> > > +#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
> > > +static inline u64 timekeeping_delta_to_ns(struct tk_read_base *tkr, u64 delta)
> > > +{
> > > + unsigned __int128 nsec;
> > > +
> > > + nsec = ((unsigned __int128)delta * tkr->mult) + tkr->xtime_nsec;
> > > + return (u64) (nsec >> tkr->shift);
> > > +}
> >
> > Actually, 128-bit multiplication shouldn't be too horrible - at least on 64-bit
> > architectures. (128-bit division is another matter, but there's no division here.)
>
> IIRC there are 64bit architectures that do not have a 64x64->128 mult,
> only a 64x64->64 mult instruction. Its not immediately apparent using
> __int128 will generate optimal code for those, nor is it a given GCC
> will not require libgcc functions for those.

Well, if the overflow case is rare (which it is in this case) then it should still
be relatively straightforward, something like:

X and Y are 64-bit:

X = Xh*2^32 + Xl
Y = Yh*2^32 + Yl

X*Y = (Xh*2^32 + Xl)*(Yh*2^32 + Yl)

= Xh*2^32*(Yh*2^32 + Yl)
+ Xl*(Yh*2^32 + Yl)

= Xh*Yh*2^64
+ Xh*Yl*2^32
+ Xl*Yh*2^32
+ XL*Yl

Which is four 32x32->64 multiplications in the worst case.

Where a valid overflow threshold is relatively easy to determine in a hot path
compatible fashion:

if (Xh != 0 || Yh != 0)
slow_path();

And this simple and fast overflow check should still cover the overwhelming
majority of 'sane' systems. (A more involved 'could it overflow' check of counting
the high bits with 8 bit granularity by looking at the high bytes not at the words
could be done in the slow path - to still avoid the 4 multiplications in most
cases.)

Am I missing something?

Thanks,

Ingo

2016-12-09 05:26:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 5/6] [RFD] timekeeping: Provide optional 128bit math

On Thu, Dec 08, 2016 at 08:49:39PM -0000, Thomas Gleixner wrote:

> +static inline u64 timekeeping_delta_to_ns(struct tk_read_base *tkr, u64 delta)
> +{
> + u32 dh, dl;
> + u64 nsec;
> +
> + dl = delta;
> + dh = delta >> 32;
> +
> + nsec = ((u64)dl * tkr->mult) + tkr->xtime_nsec;
> + nsec >>= tkr->shift;
> + if (unlikely(dh))
> + nsec += ((u64)dh * tkr->mult) << (32 - tkr->shift);
> + return nsec;
> +}

Just for giggles, on tilegx the branch is actually slower than doing the
mult unconditionally.

The problem is that the two multiplies would otherwise completely
pipeline, whereas with the conditional you serialize them.

(came to light while talking about why the mul_u64_u32_shr() fallback
didn't work right for them, which was a combination of the above issue
and the fact that their compiler 'lost' the fact that these are
32x32->64 mults and did 64x64 ones instead).

2016-12-09 05:41:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 5/6] [RFD] timekeeping: Provide optional 128bit math

On Fri, Dec 09, 2016 at 06:22:03AM +0100, Ingo Molnar wrote:
>
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Fri, Dec 09, 2016 at 05:08:26AM +0100, Ingo Molnar wrote:
> > > > +#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
> > > > +static inline u64 timekeeping_delta_to_ns(struct tk_read_base *tkr, u64 delta)
> > > > +{
> > > > + unsigned __int128 nsec;
> > > > +
> > > > + nsec = ((unsigned __int128)delta * tkr->mult) + tkr->xtime_nsec;
> > > > + return (u64) (nsec >> tkr->shift);
> > > > +}
> > >
> > > Actually, 128-bit multiplication shouldn't be too horrible - at least on 64-bit
> > > architectures. (128-bit division is another matter, but there's no division here.)
> >
> > IIRC there are 64bit architectures that do not have a 64x64->128 mult,
> > only a 64x64->64 mult instruction. Its not immediately apparent using
> > __int128 will generate optimal code for those, nor is it a given GCC
> > will not require libgcc functions for those.
>
> Well, if the overflow case is rare (which it is in this case) then it should still
> be relatively straightforward, something like:
>
> X and Y are 64-bit:
>
> X = Xh*2^32 + Xl
> Y = Yh*2^32 + Yl
>
> X*Y = (Xh*2^32 + Xl)*(Yh*2^32 + Yl)
>
> = Xh*2^32*(Yh*2^32 + Yl)
> + Xl*(Yh*2^32 + Yl)
>
> = Xh*Yh*2^64
> + Xh*Yl*2^32
> + Xl*Yh*2^32
> + XL*Yl
>
> Which is four 32x32->64 multiplications in the worst case.

Yeah, that's the full 64x64->128 mult on 3bit. Luckily we only need
64x32->96, which reduces to 2 32x32->64 mults.

But my point was that unconditionally using __int128 might not be the
right thing.

> Where a valid overflow threshold is relatively easy to determine in a hot path
> compatible fashion:
>
> if (Xh != 0 || Yh != 0)
> slow_path();
>
> And this simple and fast overflow check should still cover the overwhelming
> majority of 'sane' systems. (A more involved 'could it overflow' check of counting
> the high bits with 8 bit granularity by looking at the high bytes not at the words
> could be done in the slow path - to still avoid the 4 multiplications in most
> cases.)
>
> Am I missing something?

Yeah, the fact that we only need the 2 mults and that the fallback
already does the second multiply conditionally :-) But then look at the
email where I said that that condition actually makes the thing vastly
more expensive on some archs (like tilegx).

2016-12-09 06:08:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 5/6] [RFD] timekeeping: Provide optional 128bit math

On Fri, Dec 09, 2016 at 06:11:17AM +0100, Peter Zijlstra wrote:
> On Thu, Dec 08, 2016 at 08:49:39PM -0000, Thomas Gleixner wrote:
>
> > +/*
> > + * Enabled when timekeeping is supposed to deal with virtualization keeping
> > + * VMs long enough scheduled out that the 64 * 32 bit multiplication in
> > + * timekeeping_delta_to_ns() overflows 64bit.
> > + */
> > +#ifdef CONFIG_TIMEKEEPING_USE_128BIT_MATH
> > +
> > +#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
> > +static inline u64 timekeeping_delta_to_ns(struct tk_read_base *tkr, u64 delta)
> > +{
> > + unsigned __int128 nsec;
> > +
> > + nsec = ((unsigned __int128)delta * tkr->mult) + tkr->xtime_nsec;
> > + return (u64) (nsec >> tkr->shift);
> > +}
> > +#else
> > +static inline u64 timekeeping_delta_to_ns(struct tk_read_base *tkr, u64 delta)
> > +{
> > + u32 dh, dl;
> > + u64 nsec;
> > +
> > + dl = delta;
> > + dh = delta >> 32;
> > +
> > + nsec = ((u64)dl * tkr->mult) + tkr->xtime_nsec;
> > + nsec >>= tkr->shift;
> > + if (unlikely(dh))
> > + nsec += ((u64)dh * tkr->mult) << (32 - tkr->shift);
> > + return nsec;
> > +}
> > +#endif
> > +
> > +#else /* CONFIG_TIMEKEEPING_USE_128BIT_MATH */
>
> xtime_nsec confuses me, contrary to its name, its not actually in nsec,
> its in shifted nsec units for some reason (and that might well be a good
> reason, but I don't know).
>
> In any case, it needing to be inside the shift is somewhat unfortunate
> in that it doesn't allow you to use the existing mul_u64_u32_shr()

Wouldn't something like:

nsec = mul_u64_u32_shr(delta, tkr->mult, tkr->shift);
nsec += tkr->xtime_nsec >> tkr->shift;

Be good enough? Sure you have a slight rounding error, which results in
a few jaggies in the actual timeline, but it would still be monotonic.

That is, we'll observe the ns rollover 'late', but given its ns, does
anybody really care?

2016-12-09 06:38:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 5/6] [RFD] timekeeping: Provide optional 128bit math

On Fri, Dec 09, 2016 at 06:26:38AM +0100, Peter Zijlstra wrote:
> On Thu, Dec 08, 2016 at 08:49:39PM -0000, Thomas Gleixner wrote:
>
> > +static inline u64 timekeeping_delta_to_ns(struct tk_read_base *tkr, u64 delta)
> > +{
> > + u32 dh, dl;
> > + u64 nsec;
> > +
> > + dl = delta;
> > + dh = delta >> 32;
> > +
> > + nsec = ((u64)dl * tkr->mult) + tkr->xtime_nsec;
> > + nsec >>= tkr->shift;
> > + if (unlikely(dh))
> > + nsec += ((u64)dh * tkr->mult) << (32 - tkr->shift);
> > + return nsec;
> > +}
>
> Just for giggles, on tilegx the branch is actually slower than doing the
> mult unconditionally.
>
> The problem is that the two multiplies would otherwise completely
> pipeline, whereas with the conditional you serialize them.

On my Haswell laptop the unconditional version is faster too.

> (came to light while talking about why the mul_u64_u32_shr() fallback
> didn't work right for them, which was a combination of the above issue
> and the fact that their compiler 'lost' the fact that these are
> 32x32->64 mults and did 64x64 ones instead).

Turns out using GCC-6.2.1 we have the same problem on i386, GCC doesn't
recognise the 32x32 mults and generates crap.

This used to work :/

2016-12-09 08:30:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 5/6] [RFD] timekeeping: Provide optional 128bit math

On Fri, Dec 09, 2016 at 07:38:47AM +0100, Peter Zijlstra wrote:
> On Fri, Dec 09, 2016 at 06:26:38AM +0100, Peter Zijlstra wrote:

> > Just for giggles, on tilegx the branch is actually slower than doing the
> > mult unconditionally.
> >
> > The problem is that the two multiplies would otherwise completely
> > pipeline, whereas with the conditional you serialize them.
>
> On my Haswell laptop the unconditional version is faster too.

Only when using x86_64 instructions, once I fixed the i386 variant it
was slower, probably due to register pressure and the like.

> > (came to light while talking about why the mul_u64_u32_shr() fallback
> > didn't work right for them, which was a combination of the above issue
> > and the fact that their compiler 'lost' the fact that these are
> > 32x32->64 mults and did 64x64 ones instead).
>
> Turns out using GCC-6.2.1 we have the same problem on i386, GCC doesn't
> recognise the 32x32 mults and generates crap.
>
> This used to work :/

Do we want something like so?

---
arch/tile/include/asm/Kbuild | 1 -
arch/tile/include/asm/div64.h | 14 ++++++++++++++
arch/x86/include/asm/div64.h | 10 ++++++++++
include/linux/math64.h | 26 ++++++++++++++++++--------
4 files changed, 42 insertions(+), 9 deletions(-)

diff --git a/arch/tile/include/asm/Kbuild b/arch/tile/include/asm/Kbuild
index 2d1f5638974c..20f2ba6d79be 100644
--- a/arch/tile/include/asm/Kbuild
+++ b/arch/tile/include/asm/Kbuild
@@ -5,7 +5,6 @@ generic-y += bug.h
generic-y += bugs.h
generic-y += clkdev.h
generic-y += cputime.h
-generic-y += div64.h
generic-y += emergency-restart.h
generic-y += errno.h
generic-y += exec.h
diff --git a/arch/tile/include/asm/div64.h b/arch/tile/include/asm/div64.h
index e69de29bb2d1..bf6161966dfa 100644
--- a/arch/tile/include/asm/div64.h
+++ b/arch/tile/include/asm/div64.h
@@ -0,0 +1,14 @@
+#ifndef _ASM_TILE_DIV64_H
+#define _ASM_TILE_DIV64_H
+
+#ifdef __tilegx__
+static inline u64 mul_u32_u32(u32 a, u32 b)
+{
+ return __insn_mul_lu_lu(a, b);
+}
+#define mul_u32_u32 mul_u32_u32
+#endif
+
+#include <asm-generic/div64.h>
+
+#endif /* _ASM_TILE_DIV64_H */
diff --git a/arch/x86/include/asm/div64.h b/arch/x86/include/asm/div64.h
index ced283ac79df..68f4ae5e8976 100644
--- a/arch/x86/include/asm/div64.h
+++ b/arch/x86/include/asm/div64.h
@@ -59,6 +59,16 @@ static inline u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder)
}
#define div_u64_rem div_u64_rem

+static inline u64 mul_u32_u32(u32 a, u32 b)
+{
+ u64 ret;
+
+ asm ("mull %[b]" : "=A" (ret) : [a] "a" (a), [b] "g" (b) );
+
+ return ret;
+}
+#define mul_u32_u32 mul_u32_u32
+
#else
# include <asm-generic/div64.h>
#endif /* CONFIG_X86_32 */
diff --git a/include/linux/math64.h b/include/linux/math64.h
index 6e8b5b270ffe..80690c96c734 100644
--- a/include/linux/math64.h
+++ b/include/linux/math64.h
@@ -133,6 +133,16 @@ static inline s64 div_s64(s64 dividend, s32 divisor)
return ret;
}

+#ifndef mul_u32_u32
+/*
+ * Many a GCC version messes this up and generates a 64x64 mult :-(
+ */
+static inline u64 mul_u32_u32(u32 a, u32 b)
+{
+ return (u64)a * b;
+}
+#endif
+
#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)

#ifndef mul_u64_u32_shr
@@ -160,9 +170,9 @@ static inline u64 mul_u64_u32_shr(u64 a, u32 mul, unsigned int shift)
al = a;
ah = a >> 32;

- ret = ((u64)al * mul) >> shift;
+ ret = mul_u32_u32(al, mul) >> shift;
if (ah)
- ret += ((u64)ah * mul) << (32 - shift);
+ ret += mul_u32_u32(ah, mul) << (32 - shift);

return ret;
}
@@ -186,10 +196,10 @@ static inline u64 mul_u64_u64_shr(u64 a, u64 b, unsigned int shift)
a0.ll = a;
b0.ll = b;

- rl.ll = (u64)a0.l.low * b0.l.low;
- rm.ll = (u64)a0.l.low * b0.l.high;
- rn.ll = (u64)a0.l.high * b0.l.low;
- rh.ll = (u64)a0.l.high * b0.l.high;
+ rl.ll = mul_u32_u32(a0.l.low, b0.l.low);
+ rm.ll = mul_u32_u32(a0.l.low, b0.l.high);
+ rn.ll = mul_u32_u32(a0.l.high, b0.l.low);
+ rh.ll = mul_u32_u32(a0.l.high, b0.l.high);

/*
* Each of these lines computes a 64-bit intermediate result into "c",
@@ -229,8 +239,8 @@ static inline u64 mul_u64_u32_div(u64 a, u32 mul, u32 divisor)
} u, rl, rh;

u.ll = a;
- rl.ll = (u64)u.l.low * mul;
- rh.ll = (u64)u.l.high * mul + rl.l.high;
+ rl.ll = mul_u32_u32(u.l.low, mul);
+ rh.ll = mul_u32_u32(u.l.high, mul) + rl.l.high;

/* Bits 32-63 of the result will be in rh.l.low. */
rl.l.high = do_div(rh.ll, divisor);

2016-12-09 09:12:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 5/6] [RFD] timekeeping: Provide optional 128bit math

On Fri, Dec 09, 2016 at 09:30:11AM +0100, Peter Zijlstra wrote:

> > > Just for giggles, on tilegx the branch is actually slower than doing the
> > > mult unconditionally.
> > >
> > > The problem is that the two multiplies would otherwise completely
> > > pipeline, whereas with the conditional you serialize them.

> Only when using x86_64 instructions, once I fixed the i386 variant it
> was slower, probably due to register pressure and the like.

OK, maybe I messed up on i386, although I've yet to try running that on
an actual 32bit machine. I also need to dig up a small core, who knows
what atoms do.

Results are in cycles, average over 1e6 loops. I think the 128 results
are around 1 cycle, measurements are maybe a tad wobbly because I
compare against an empty loop to correct measurement overhead.


root@ivb-ep:~/tmp# for i in -m64 -m32 -mx32; do echo $i ; gcc -O3 $i -o mult mult.c -lm; ./mult ; done

-m64
cond: avg: 5.487738 +- 0.004152
uncond: avg: 4.495690 +- 0.006009
128: avg: 0.634496 +- 0.004795

-m32
cond: avg: 14.807630 +- 0.006890
uncond: avg: 11.601985 +- 0.009722

-mx32
cond: avg: 5.027696 +- 0.005766
uncond: avg: 4.038013 +- 0.008069
128: avg: 0.009928 +- 0.005730


root@hsw:~/tmp# for i in -m64 -m32 -mx32; do echo $i ; gcc -O3 $i -o mult mult.c -lm; ./mult ; done

-m64
cond: avg: 1.998718 +- 0.008775
uncond: avg: 2.004795 +- 0.009865
128: avg: 0.991947 +- 0.007607

-m32
cond: avg: 12.981868 +- 0.011239
uncond: avg: 13.000566 +- 0.011668

-mx32
cond: avg: 2.005437 +- 0.006840
uncond: avg: 3.001631 +- 0.004786
128: avg: 1.990425 +- 0.003880


2016-12-09 10:01:45

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 5/6] [RFD] timekeeping: Provide optional 128bit math

On Fri, Dec 09, 2016 at 09:30:11AM +0100, Peter Zijlstra wrote:
> +static inline u64 mul_u32_u32(u32 a, u32 b)
> +{
> + u64 ret;
> +
> + asm ("mull %[b]" : "=A" (ret) : [a] "a" (a), [b] "g" (b) );
> +
> + return ret;
> +}

ARGH, that's broken on x86_64, it needs to be:

u32 high, low;

asm ("mull %[b]" : "=a" (low), "=d" (high)
: [a] "a" (a), [b] "g" (b) );

return low | ((u64)high) << 32;

The 'A' constraint doesn't work right.

And with that all the benchmark results are borken too.



root@ivb-ep:~/spinlocks# for i in -m64 -m32 -mx32 ; do echo $i; gcc -O3 $i -o mult mult.c -lm; ./mult; done

-m64
cond: avg: 7.474872 +- 0.008302
uncond: avg: 9.116401 +- 0.008468
128: avg: 0.826584 +- 0.005514

-m32
cond: avg: 16.604030 +- 0.009808
uncond: avg: 13.115470 +- 0.004452

-mx32
cond: avg: 6.168156 +- 0.006650
uncond: avg: 7.202092 +- 0.006813
128: avg: 0.081809 +- 0.008440


2016-12-09 10:18:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 5/6] [RFD] timekeeping: Provide optional 128bit math

On Fri, Dec 09, 2016 at 07:38:47AM +0100, Peter Zijlstra wrote:

> Turns out using GCC-6.2.1 we have the same problem on i386, GCC doesn't
> recognise the 32x32 mults and generates crap.
>
> This used to work :/

I tried:

gcc-4.4: good
gcc-4.6, gcc-4.8, gcc-5.4, gcc-6.2: bad


2016-12-09 17:32:26

by Chris Metcalf

[permalink] [raw]
Subject: Re: [patch 5/6] [RFD] timekeeping: Provide optional 128bit math

On 12/9/2016 3:30 AM, Peter Zijlstra wrote:
> On Fri, Dec 09, 2016 at 07:38:47AM +0100, Peter Zijlstra wrote:
>> On Fri, Dec 09, 2016 at 06:26:38AM +0100, Peter Zijlstra wrote:
>>> Just for giggles, on tilegx the branch is actually slower than doing the
>>> mult unconditionally.
>>>
>>> The problem is that the two multiplies would otherwise completely
>>> pipeline, whereas with the conditional you serialize them.
>> On my Haswell laptop the unconditional version is faster too.
> Only when using x86_64 instructions, once I fixed the i386 variant it
> was slower, probably due to register pressure and the like.
>
>>> (came to light while talking about why the mul_u64_u32_shr() fallback
>>> didn't work right for them, which was a combination of the above issue
>>> and the fact that their compiler 'lost' the fact that these are
>>> 32x32->64 mults and did 64x64 ones instead).
>> Turns out using GCC-6.2.1 we have the same problem on i386, GCC doesn't
>> recognise the 32x32 mults and generates crap.
>>
>> This used to work :/
> Do we want something like so?
>
> ---
> arch/tile/include/asm/Kbuild | 1 -
> arch/tile/include/asm/div64.h | 14 ++++++++++++++
> arch/x86/include/asm/div64.h | 10 ++++++++++
> include/linux/math64.h | 26 ++++++++++++++++++--------
> 4 files changed, 42 insertions(+), 9 deletions(-)

Untested, but I looked at it closely, and it seems like a decent idea.

Acked-by: Chris Metcalf <[email protected]> [for tile]

Of course if this is pushed up, it will then probably be too tempting for me not
to add the tilegx-specific mul_u64_u32_shr() to take advantage of pipelining
the two 32x32->64 multiplies :-)

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com

2016-12-09 17:37:04

by Chris Metcalf

[permalink] [raw]
Subject: Re: [patch 5/6] [RFD] timekeeping: Provide optional 128bit math

On 12/9/2016 5:18 AM, Peter Zijlstra wrote:
> On Fri, Dec 09, 2016 at 07:38:47AM +0100, Peter Zijlstra wrote:
>
>> Turns out using GCC-6.2.1 we have the same problem on i386, GCC doesn't
>> recognise the 32x32 mults and generates crap.
>>
>> This used to work :/
> I tried:
>
> gcc-4.4: good
> gcc-4.6, gcc-4.8, gcc-5.4, gcc-6.2: bad

I also found 4.4 was good on tilegx at recognizing the 32x32, and bad on
the later versions I tested; I don't recall which specific later versions I
tried, though.

--
Chris Metcalf, Mellanox Technologies
http://www.mellanox.com