2021-11-14 07:48:22

by Huacai Chen

[permalink] [raw]
Subject: [PATCH] time/sched_clock: Allow architecture to override cyc_to_ns()

The current cyc_to_ns() implementation is like this:

static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift)
{
return (cyc * mult) >> shift;
}

But u64*u32 maybe overflow, so introduce ARCH_HAS_CYC_TO_NS to allow
architecture to override it.

Signed-off-by: Huacai Chen <[email protected]>
---
kernel/time/Kconfig | 4 ++++
kernel/time/sched_clock.c | 2 ++
2 files changed, 6 insertions(+)

diff --git a/kernel/time/Kconfig b/kernel/time/Kconfig
index 04bfd62f5e5c..5093e67115e8 100644
--- a/kernel/time/Kconfig
+++ b/kernel/time/Kconfig
@@ -30,6 +30,10 @@ config GENERIC_TIME_VSYSCALL
config GENERIC_CLOCKEVENTS
def_bool !LEGACY_TIMER_TICK

+# Architecture has its own cyc_to_ns() implementation
+config ARCH_HAS_CYC_TO_NS
+ bool
+
# Architecture can handle broadcast in a driver-agnostic way
config ARCH_HAS_TICK_BROADCAST
bool
diff --git a/kernel/time/sched_clock.c b/kernel/time/sched_clock.c
index b1b9b12899f5..4496962e8e9f 100644
--- a/kernel/time/sched_clock.c
+++ b/kernel/time/sched_clock.c
@@ -63,10 +63,12 @@ static struct clock_data cd ____cacheline_aligned = {
.actual_read_sched_clock = jiffy_sched_clock_read,
};

+#ifndef CONFIG_ARCH_HAS_CYC_TO_NS
static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift)
{
return (cyc * mult) >> shift;
}
+#endif

notrace struct clock_read_data *sched_clock_read_begin(unsigned int *seq)
{
--
2.27.0



2021-11-16 02:12:16

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH] time/sched_clock: Allow architecture to override cyc_to_ns()

On Sat, Nov 13, 2021 at 11:47 PM Huacai Chen <[email protected]> wrote:
>
> The current cyc_to_ns() implementation is like this:
>
> static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift)
> {
> return (cyc * mult) >> shift;
> }
>
> But u64*u32 maybe overflow, so introduce ARCH_HAS_CYC_TO_NS to allow
> architecture to override it.
>

If that's the case, it would seem too large a mult/shift pair had been selected.

What sort of cycle range are you considering to be valid here? Can you
provide more rationale as to why this needs the ability to be
overridden?

And what sort of arch-specific logic do you envision, rather than
having a common implementation to avoid the overflow?

thanks
-john

2021-11-16 04:39:55

by Huacai Chen

[permalink] [raw]
Subject: Re: [PATCH] time/sched_clock: Allow architecture to override cyc_to_ns()

Hi, John,

On Tue, Nov 16, 2021 at 1:27 AM John Stultz <[email protected]> wrote:
>
> On Sat, Nov 13, 2021 at 11:47 PM Huacai Chen <[email protected]> wrote:
> >
> > The current cyc_to_ns() implementation is like this:
> >
> > static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift)
> > {
> > return (cyc * mult) >> shift;
> > }
> >
> > But u64*u32 maybe overflow, so introduce ARCH_HAS_CYC_TO_NS to allow
> > architecture to override it.
> >
>
> If that's the case, it would seem too large a mult/shift pair had been selected.
We use a 100MHz clock and the counter is 64bit, the mult is ~160M. But
even if we use a smaller mult, cyc*mult, it can also overflow.

>
> What sort of cycle range are you considering to be valid here? Can you
> provide more rationale as to why this needs the ability to be
> overridden?
>
> And what sort of arch-specific logic do you envision, rather than
> having a common implementation to avoid the overflow?
u64*u64 can be handled by hardware (store the high bits and low bits
of result in two registers). So, if we use assembly, we can handle the
overflow correctly. E.g., LoongArch (and MIPS) can override
cyc_to_ns() like this:

static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift)
{
u64 t1, t2, t3;
unsigned long long rv;

/* 64-bit arithmetic can overflow, so use 128-bit. */
__asm__ (
"nor %[t1], $r0, %[shift] \n\t"
"mulh.du %[t2], %[cyc], %[mult] \n\t"
"mul.d %[t3], %[cyc], %[mult] \n\t"
"slli.d %[t2], %[t2], 1 \n\t"
"srl.d %[rv], %[t3], %[shift] \n\t"
"sll.d %[t1], %[t2], %[t1] \n\t"
"or %[rv], %[t1], %[rv] \n\t"
: [rv] "=&r" (rv), [t1] "=&r" (t1), [t2] "=&r" (t2),
[t3] "=&r" (t3)
: [cyc] "r" (cyc), [mult] "r" (mult), [shift] "r" (shift)
: );
return rv;
}

Huacai
>
> thanks
> -john

2021-11-16 05:02:24

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH] time/sched_clock: Allow architecture to override cyc_to_ns()

On Mon, Nov 15, 2021 at 5:41 PM Huacai Chen <[email protected]> wrote:
>
> Hi, John,
>
> On Tue, Nov 16, 2021 at 1:27 AM John Stultz <[email protected]> wrote:
> >
> > On Sat, Nov 13, 2021 at 11:47 PM Huacai Chen <[email protected]> wrote:
> > >
> > > The current cyc_to_ns() implementation is like this:
> > >
> > > static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift)
> > > {
> > > return (cyc * mult) >> shift;
> > > }
> > >
> > > But u64*u32 maybe overflow, so introduce ARCH_HAS_CYC_TO_NS to allow
> > > architecture to override it.
> > >
> >
> > If that's the case, it would seem too large a mult/shift pair had been selected.
> We use a 100MHz clock and the counter is 64bit, the mult is ~160M. But
> even if we use a smaller mult, cyc*mult, it can also overflow.

Well, yes, any simple multiplication could overflow. My point is that
the mult/shift pair is usually calculated for an expected interval
range via clocks_calc_mult_shift(), where the max interval for
sched_clock is set to an hour. So any interval length under an hour
should not overflow in a simple multiplication.

What I'm trying to understand is what is the case you have where your
interval length is longer than an hour?
As that might break other assumptions going on in the code.

> > What sort of cycle range are you considering to be valid here? Can you
> > provide more rationale as to why this needs the ability to be
> > overridden?
> >
> > And what sort of arch-specific logic do you envision, rather than
> > having a common implementation to avoid the overflow?
> u64*u64 can be handled by hardware (store the high bits and low bits
> of result in two registers). So, if we use assembly, we can handle the
> overflow correctly. E.g., LoongArch (and MIPS) can override
> cyc_to_ns() like this:
>
> static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift)
> {
> u64 t1, t2, t3;
> unsigned long long rv;
>
> /* 64-bit arithmetic can overflow, so use 128-bit. */
> __asm__ (
> "nor %[t1], $r0, %[shift] \n\t"
> "mulh.du %[t2], %[cyc], %[mult] \n\t"
> "mul.d %[t3], %[cyc], %[mult] \n\t"
> "slli.d %[t2], %[t2], 1 \n\t"
> "srl.d %[rv], %[t3], %[shift] \n\t"
> "sll.d %[t1], %[t2], %[t1] \n\t"
> "or %[rv], %[t1], %[rv] \n\t"
> : [rv] "=&r" (rv), [t1] "=&r" (t1), [t2] "=&r" (t2),
> [t3] "=&r" (t3)
> : [cyc] "r" (cyc), [mult] "r" (mult), [shift] "r" (shift)
> : );
> return rv;
> }

But then isn't the mul_u64_u32_shr() the right abstraction for such a
custom implementation?

Then potentially implement a generic cyc_to_ns() implementation that
uses that instead?

thanks
-john

2021-11-16 11:39:42

by Huacai Chen

[permalink] [raw]
Subject: Re: [PATCH] time/sched_clock: Allow architecture to override cyc_to_ns()

Hi, John,

On Tue, Nov 16, 2021 at 10:05 AM John Stultz <[email protected]> wrote:
>
> On Mon, Nov 15, 2021 at 5:41 PM Huacai Chen <[email protected]> wrote:
> >
> > Hi, John,
> >
> > On Tue, Nov 16, 2021 at 1:27 AM John Stultz <[email protected]> wrote:
> > >
> > > On Sat, Nov 13, 2021 at 11:47 PM Huacai Chen <[email protected]> wrote:
> > > >
> > > > The current cyc_to_ns() implementation is like this:
> > > >
> > > > static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift)
> > > > {
> > > > return (cyc * mult) >> shift;
> > > > }
> > > >
> > > > But u64*u32 maybe overflow, so introduce ARCH_HAS_CYC_TO_NS to allow
> > > > architecture to override it.
> > > >
> > >
> > > If that's the case, it would seem too large a mult/shift pair had been selected.
> > We use a 100MHz clock and the counter is 64bit, the mult is ~160M. But
> > even if we use a smaller mult, cyc*mult, it can also overflow.
>
> Well, yes, any simple multiplication could overflow. My point is that
> the mult/shift pair is usually calculated for an expected interval
> range via clocks_calc_mult_shift(), where the max interval for
> sched_clock is set to an hour. So any interval length under an hour
> should not overflow in a simple multiplication.
>
> What I'm trying to understand is what is the case you have where your
> interval length is longer than an hour?
> As that might break other assumptions going on in the code.
We found that the mult is "reasonable" if we use 100MHz, 50MHz or
25MHz clocks, but if we use some others, such as 33MHz, the calculated
mult is dramatically large.

>
> > > What sort of cycle range are you considering to be valid here? Can you
> > > provide more rationale as to why this needs the ability to be
> > > overridden?
> > >
> > > And what sort of arch-specific logic do you envision, rather than
> > > having a common implementation to avoid the overflow?
> > u64*u64 can be handled by hardware (store the high bits and low bits
> > of result in two registers). So, if we use assembly, we can handle the
> > overflow correctly. E.g., LoongArch (and MIPS) can override
> > cyc_to_ns() like this:
> >
> > static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift)
> > {
> > u64 t1, t2, t3;
> > unsigned long long rv;
> >
> > /* 64-bit arithmetic can overflow, so use 128-bit. */
> > __asm__ (
> > "nor %[t1], $r0, %[shift] \n\t"
> > "mulh.du %[t2], %[cyc], %[mult] \n\t"
> > "mul.d %[t3], %[cyc], %[mult] \n\t"
> > "slli.d %[t2], %[t2], 1 \n\t"
> > "srl.d %[rv], %[t3], %[shift] \n\t"
> > "sll.d %[t1], %[t2], %[t1] \n\t"
> > "or %[rv], %[t1], %[rv] \n\t"
> > : [rv] "=&r" (rv), [t1] "=&r" (t1), [t2] "=&r" (t2),
> > [t3] "=&r" (t3)
> > : [cyc] "r" (cyc), [mult] "r" (mult), [shift] "r" (shift)
> > : );
> > return rv;
> > }
>
> But then isn't the mul_u64_u32_shr() the right abstraction for such a
> custom implementation?
>
> Then potentially implement a generic cyc_to_ns() implementation that
> uses that instead?
OK, we will try to override mul_u64_u32_shr() rather than cyc_to_ns().

Huacai
>
> thanks
> -john

2021-11-16 20:31:56

by John Stultz

[permalink] [raw]
Subject: Re: [PATCH] time/sched_clock: Allow architecture to override cyc_to_ns()

On Tue, Nov 16, 2021 at 3:37 AM Huacai Chen <[email protected]> wrote:
> On Tue, Nov 16, 2021 at 10:05 AM John Stultz <[email protected]> wrote:
> > On Mon, Nov 15, 2021 at 5:41 PM Huacai Chen <[email protected]> wrote:
> > > On Tue, Nov 16, 2021 at 1:27 AM John Stultz <[email protected]> wrote:
> > > > On Sat, Nov 13, 2021 at 11:47 PM Huacai Chen <[email protected]> wrote:
> > > > >
> > > > > The current cyc_to_ns() implementation is like this:
> > > > >
> > > > > static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift)
> > > > > {
> > > > > return (cyc * mult) >> shift;
> > > > > }
> > > > >
> > > > > But u64*u32 maybe overflow, so introduce ARCH_HAS_CYC_TO_NS to allow
> > > > > architecture to override it.
> > > > >
> > > >
> > > > If that's the case, it would seem too large a mult/shift pair had been selected.
> > > We use a 100MHz clock and the counter is 64bit, the mult is ~160M. But
> > > even if we use a smaller mult, cyc*mult, it can also overflow.
> >
> > Well, yes, any simple multiplication could overflow. My point is that
> > the mult/shift pair is usually calculated for an expected interval
> > range via clocks_calc_mult_shift(), where the max interval for
> > sched_clock is set to an hour. So any interval length under an hour
> > should not overflow in a simple multiplication.
> >
> > What I'm trying to understand is what is the case you have where your
> > interval length is longer than an hour?
> > As that might break other assumptions going on in the code.
> We found that the mult is "reasonable" if we use 100MHz, 50MHz or
> 25MHz clocks, but if we use some others, such as 33MHz, the calculated
> mult is dramatically large.

So that sounds a little concerning, but in clocks_calc_mult_shift() we
try to find an accurate mult/shift pair that fits the expected range.
As we want to be as accurate as possible, we need higher shift values
for freqs that don't evenly divide NSEC_PER_SEC, in order to minimize
the error. So larger mult/shift values isn't necessarily a problem.

For the set you gave, the code should give
100000000 -> m:20971520 s:21
50000000 -> m:41943040 s:21
25000000 -> m:83886080 s:21
33333333 -> m:125829121 s:22

And yes, the 33Mhz does have a larger mult/shift pair, but that's
because we want to be as accurate as possible in the conversion.

And calculating that max interval multiplication (one hour of cycles):
100Mhz: (180000000000cyc * 41943040 = 7549747200000000000) >> 21 =
3600000000000 ns
50Mhz: (90000000000cyc * 83886080 = 7549747200000000000) >> 21 =
3600000000000 ns
25Mhz: (360000000000cyc * 20971520 = 7549747200000000000) >> 21
= 3600000000000 ns
33Mhz: (119999998800cyc * 125829121 = 15099494369005054800) >>
22 = 3599999992610 ns

So assuming you're seeing the same mult/shift from above, we should be
able to handle an hour of cycles without overflowing the
multiplication.
(Though it definitely gets close with the 33Mhz case, as we would
overflow with an hour + 7 seconds or so of cycles.)

If you're seeing something else, let us know, as then it's probably a bug.

But again, it would be good to understand the use case where the
sched_clock epoch isn't being updated for over an hour. I'm guessing
hrtimers are being deferred for a super long time?

> > > > What sort of cycle range are you considering to be valid here? Can you
> > > > provide more rationale as to why this needs the ability to be
> > > > overridden?
> > > >
> > > > And what sort of arch-specific logic do you envision, rather than
> > > > having a common implementation to avoid the overflow?
> > > u64*u64 can be handled by hardware (store the high bits and low bits
> > > of result in two registers). So, if we use assembly, we can handle the
> > > overflow correctly. E.g., LoongArch (and MIPS) can override
> > > cyc_to_ns() like this:
> > >
> > > static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift)
> > > {
> > > u64 t1, t2, t3;
> > > unsigned long long rv;
> > >
> > > /* 64-bit arithmetic can overflow, so use 128-bit. */
> > > __asm__ (
> > > "nor %[t1], $r0, %[shift] \n\t"
> > > "mulh.du %[t2], %[cyc], %[mult] \n\t"
> > > "mul.d %[t3], %[cyc], %[mult] \n\t"
> > > "slli.d %[t2], %[t2], 1 \n\t"
> > > "srl.d %[rv], %[t3], %[shift] \n\t"
> > > "sll.d %[t1], %[t2], %[t1] \n\t"
> > > "or %[rv], %[t1], %[rv] \n\t"
> > > : [rv] "=&r" (rv), [t1] "=&r" (t1), [t2] "=&r" (t2),
> > > [t3] "=&r" (t3)
> > > : [cyc] "r" (cyc), [mult] "r" (mult), [shift] "r" (shift)
> > > : );
> > > return rv;
> > > }
> >
> > But then isn't the mul_u64_u32_shr() the right abstraction for such a
> > custom implementation?
> >
> > Then potentially implement a generic cyc_to_ns() implementation that
> > uses that instead?
> OK, we will try to override mul_u64_u32_shr() rather than cyc_to_ns().

If the use case is considered reasonable, there likely still needs to
be a cyc_to_ns() implementation that uses mult_u64_u32_shr(), but
hopefully it can be generic (maybe under a config option for the
deferred hrtimer use case?).

thanks
-john

2021-11-18 08:34:33

by Huacai Chen

[permalink] [raw]
Subject: Re: [PATCH] time/sched_clock: Allow architecture to override cyc_to_ns()

Hi, John,

On Wed, Nov 17, 2021 at 4:31 AM John Stultz <[email protected]> wrote:
>
> On Tue, Nov 16, 2021 at 3:37 AM Huacai Chen <[email protected]> wrote:
> > On Tue, Nov 16, 2021 at 10:05 AM John Stultz <[email protected]> wrote:
> > > On Mon, Nov 15, 2021 at 5:41 PM Huacai Chen <[email protected]> wrote:
> > > > On Tue, Nov 16, 2021 at 1:27 AM John Stultz <[email protected]> wrote:
> > > > > On Sat, Nov 13, 2021 at 11:47 PM Huacai Chen <[email protected]> wrote:
> > > > > >
> > > > > > The current cyc_to_ns() implementation is like this:
> > > > > >
> > > > > > static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift)
> > > > > > {
> > > > > > return (cyc * mult) >> shift;
> > > > > > }
> > > > > >
> > > > > > But u64*u32 maybe overflow, so introduce ARCH_HAS_CYC_TO_NS to allow
> > > > > > architecture to override it.
> > > > > >
> > > > >
> > > > > If that's the case, it would seem too large a mult/shift pair had been selected.
> > > > We use a 100MHz clock and the counter is 64bit, the mult is ~160M. But
> > > > even if we use a smaller mult, cyc*mult, it can also overflow.
> > >
> > > Well, yes, any simple multiplication could overflow. My point is that
> > > the mult/shift pair is usually calculated for an expected interval
> > > range via clocks_calc_mult_shift(), where the max interval for
> > > sched_clock is set to an hour. So any interval length under an hour
> > > should not overflow in a simple multiplication.
> > >
> > > What I'm trying to understand is what is the case you have where your
> > > interval length is longer than an hour?
> > > As that might break other assumptions going on in the code.
> > We found that the mult is "reasonable" if we use 100MHz, 50MHz or
> > 25MHz clocks, but if we use some others, such as 33MHz, the calculated
> > mult is dramatically large.
>
> So that sounds a little concerning, but in clocks_calc_mult_shift() we
> try to find an accurate mult/shift pair that fits the expected range.
> As we want to be as accurate as possible, we need higher shift values
> for freqs that don't evenly divide NSEC_PER_SEC, in order to minimize
> the error. So larger mult/shift values isn't necessarily a problem.
>
> For the set you gave, the code should give
> 100000000 -> m:20971520 s:21
> 50000000 -> m:41943040 s:21
> 25000000 -> m:83886080 s:21
> 33333333 -> m:125829121 s:22
>
> And yes, the 33Mhz does have a larger mult/shift pair, but that's
> because we want to be as accurate as possible in the conversion.
Maybe we have made some mistakes, let's investigate it deeper. Thanks.

Huacai

>
> And calculating that max interval multiplication (one hour of cycles):
> 100Mhz: (180000000000cyc * 41943040 = 7549747200000000000) >> 21 =
> 3600000000000 ns
> 50Mhz: (90000000000cyc * 83886080 = 7549747200000000000) >> 21 =
> 3600000000000 ns
> 25Mhz: (360000000000cyc * 20971520 = 7549747200000000000) >> 21
> = 3600000000000 ns
> 33Mhz: (119999998800cyc * 125829121 = 15099494369005054800) >>
> 22 = 3599999992610 ns
>
> So assuming you're seeing the same mult/shift from above, we should be
> able to handle an hour of cycles without overflowing the
> multiplication.
> (Though it definitely gets close with the 33Mhz case, as we would
> overflow with an hour + 7 seconds or so of cycles.)
>
> If you're seeing something else, let us know, as then it's probably a bug.
>
> But again, it would be good to understand the use case where the
> sched_clock epoch isn't being updated for over an hour. I'm guessing
> hrtimers are being deferred for a super long time?
>
> > > > > What sort of cycle range are you considering to be valid here? Can you
> > > > > provide more rationale as to why this needs the ability to be
> > > > > overridden?
> > > > >
> > > > > And what sort of arch-specific logic do you envision, rather than
> > > > > having a common implementation to avoid the overflow?
> > > > u64*u64 can be handled by hardware (store the high bits and low bits
> > > > of result in two registers). So, if we use assembly, we can handle the
> > > > overflow correctly. E.g., LoongArch (and MIPS) can override
> > > > cyc_to_ns() like this:
> > > >
> > > > static inline u64 notrace cyc_to_ns(u64 cyc, u32 mult, u32 shift)
> > > > {
> > > > u64 t1, t2, t3;
> > > > unsigned long long rv;
> > > >
> > > > /* 64-bit arithmetic can overflow, so use 128-bit. */
> > > > __asm__ (
> > > > "nor %[t1], $r0, %[shift] \n\t"
> > > > "mulh.du %[t2], %[cyc], %[mult] \n\t"
> > > > "mul.d %[t3], %[cyc], %[mult] \n\t"
> > > > "slli.d %[t2], %[t2], 1 \n\t"
> > > > "srl.d %[rv], %[t3], %[shift] \n\t"
> > > > "sll.d %[t1], %[t2], %[t1] \n\t"
> > > > "or %[rv], %[t1], %[rv] \n\t"
> > > > : [rv] "=&r" (rv), [t1] "=&r" (t1), [t2] "=&r" (t2),
> > > > [t3] "=&r" (t3)
> > > > : [cyc] "r" (cyc), [mult] "r" (mult), [shift] "r" (shift)
> > > > : );
> > > > return rv;
> > > > }
> > >
> > > But then isn't the mul_u64_u32_shr() the right abstraction for such a
> > > custom implementation?
> > >
> > > Then potentially implement a generic cyc_to_ns() implementation that
> > > uses that instead?
> > OK, we will try to override mul_u64_u32_shr() rather than cyc_to_ns().
>
> If the use case is considered reasonable, there likely still needs to
> be a cyc_to_ns() implementation that uses mult_u64_u32_shr(), but
> hopefully it can be generic (maybe under a config option for the
> deferred hrtimer use case?).
>
> thanks
> -john