2013-03-06 07:19:03

by Feng Tang

[permalink] [raw]
Subject: [PATCH v3 0/5] Add support for S3 non-stop TSC support.

Hi All,

On some new Intel Atom processors (Penwell and Cloverview), there is
a feature that the TSC won't stop in S3, say the TSC value won't be
reset to 0 after resume. This feature makes TSC a more reliable
clocksource and could benefit the timekeeping code during system
suspend/resume cycles.

The enabling efforts include adding new flags for this feature,
modifying clocksource.h and timekeeping.c to support and utilizing
it.

The major change to timekeeping is the way to count suspended time,
current way is trying to use the persistent clock first, and then
try the rtc if persistent clock can't do it. This patch will change
the trying order to:
suspend-nonstop clocksource -> persistent clock -> rtc

Please help to review them, thanks a lot!

Changelog:

v3:
* Adopt Jason Gunthorpe's way to convert large cycles to
nsec. And put it into clocksource_cyc2ns()
* Small change in flag name
v2:
* Dump the code changing a clocksource's mult and shit,
as suggested by John to not hurt accuracy
* Modify the CPU feature flag name to be consistent with
other flags
* Solve the problem of judging S3/S4, as the clocksource
counter will be reset after coming out S4.

- Feng

-------------
Feng Tang (5):
x86: Add cpu capability flag X86_FEATURE_NONSTOP_TSC_S3
clocksource: Add new feature flag CLOCK_SOURCE_SUSPEND_NONSTOP
x86: tsc: Add support for new S3_NONSTOP feature
clocksource: Enable clocksource_cyc2ns() to cover big cycles
timekeeping: utilize the suspend-nonstop clocksource to count
suspended time

arch/x86/include/asm/cpufeature.h | 1 +
arch/x86/kernel/cpu/intel.c | 12 +++++++++++
arch/x86/kernel/tsc.c | 6 +++++-
include/linux/clocksource.h | 12 ++++++++++-
kernel/time/timekeeping.c | 42 ++++++++++++++++++++++++++++++-------
5 files changed, 63 insertions(+), 10 deletions(-)

--
1.7.9.5


2013-03-06 07:19:11

by Feng Tang

[permalink] [raw]
Subject: [PATCH v3 5/5] timekeeping: utilize the suspend-nonstop clocksource to count suspended time

There are some new processors whose TSC clocksource won't stop during
suspend. Currently, after system resumes, kernel will use persistent
clock or RTC to compensate the sleep time, but for those new types of
clocksources, we could skip the special compensation from external
sources, and just use current clocksource for time recounting.

This can solve some time drift bugs caused by some not-so-accurate or
error-prone RTC devices.

The current way to count suspened time is first try to use the persistent
clock, and then try the rtc if persistent clock can't be used. This
patch will change the trying order to:
suspend-nonstop clocksource -> persistent clock -> rtc

Signed-off-by: Feng Tang <[email protected]>
---
kernel/time/timekeeping.c | 42 ++++++++++++++++++++++++++++++++++--------
1 file changed, 34 insertions(+), 8 deletions(-)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 9a0bc98..bd7c27b 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -788,22 +788,48 @@ void timekeeping_inject_sleeptime(struct timespec *delta)
static void timekeeping_resume(void)
{
struct timekeeper *tk = &timekeeper;
+ struct clocksource *clock = tk->clock;
unsigned long flags;
- struct timespec ts;
+ struct timespec ts_new, ts_delta;
+ cycle_t cycle_now, cycle_delta;
+ s64 nsec;

- read_persistent_clock(&ts);
+ ts_delta.tv_sec = 0;
+ read_persistent_clock(&ts_new);

clockevents_resume();
clocksource_resume();

write_seqlock_irqsave(&tk->lock, flags);

- if (timespec_compare(&ts, &timekeeping_suspend_time) > 0) {
- ts = timespec_sub(ts, timekeeping_suspend_time);
- __timekeeping_inject_sleeptime(tk, &ts);
- }
- /* re-base the last cycle value */
- tk->clock->cycle_last = tk->clock->read(tk->clock);
+ /*
+ * After system resumes, we need to calculate the suspended time and
+ * compensate it for the OS time. There are 3 sources that could be
+ * used: Nonstop clocksource during suspend, persistent clock and rtc
+ * device.
+ *
+ * One specific platform may have 1 or 2 or all of them, and the
+ * preference will be:
+ * suspend-nonstop clocksource -> persistent clock -> rtc
+ * The less preferred source will only be tried if there is no better
+ * usable source. The rtc part is handled separately in rtc core code.
+ */
+ cycle_now = clock->read(clock);
+ if ((clock->flags & CLOCK_SOURCE_SUSPEND_NONSTOP) &&
+ cycle_now > clock->cycle_last) {
+
+ cycle_delta = (cycle_now - clock->cycle_last) & clock->mask;
+ nsec = clocksource_cyc2ns(cycle_delta, clock->mult,
+ clock->shift);
+ ts_delta = ns_to_timespec(nsec);
+ } else if (timespec_compare(&ts_new, &timekeeping_suspend_time) > 0)
+ ts_delta = timespec_sub(ts_new, timekeeping_suspend_time);
+
+ if (ts_delta.tv_sec >= 1)
+ __timekeeping_inject_sleeptime(tk, &ts_delta);
+
+ /* Re-base the last cycle value */
+ clock->cycle_last = clock->read(clock);
tk->ntp_error = 0;
timekeeping_suspended = 0;
timekeeping_update(tk, false);
--
1.7.9.5

2013-03-06 07:19:09

by Feng Tang

[permalink] [raw]
Subject: [PATCH v3 4/5] clocksource: Enable clocksource_cyc2ns() to cover big cycles

Current clocksource_cyc2ns() has a implicit limit that the (cycles * mult)
can not exceed 64 bits limit. Jason Gunthorpe proposed a way to
handle this big cycles case, and this patch put the handling into
clocksource_cyc2ns() so that it could be used unconditionally.

Suggested-by: Jason Gunthorpe <[email protected]>
Signed-off-by: Feng Tang <[email protected]>
---
include/linux/clocksource.h | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index aa7032c..1ecc872 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -274,7 +274,16 @@ static inline u32 clocksource_hz2mult(u32 hz, u32 shift_constant)
*/
static inline s64 clocksource_cyc2ns(cycle_t cycles, u32 mult, u32 shift)
{
- return ((u64) cycles * mult) >> shift;
+ u64 max = ULLONG_MAX / mult;
+ s64 nsec = 0;
+
+ /* The (mult * cycles) may overflow 64 bits, so add a max check */
+ if (cycles > max) {
+ nsec = ((max * mult) >> shift) * (cycles / max);
+ cycles %= max;
+ }
+ nsec += ((u64) cycles * mult) >> shift;
+ return nsec;
}


--
1.7.9.5

2013-03-06 07:19:07

by Feng Tang

[permalink] [raw]
Subject: [PATCH v3 3/5] x86: tsc: Add support for new S3_NONSTOP feature

Signed-off-by: Feng Tang <[email protected]>
---
arch/x86/kernel/tsc.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 4b9ea10..098b3cf 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -768,7 +768,8 @@ static cycle_t read_tsc(struct clocksource *cs)

static void resume_tsc(struct clocksource *cs)
{
- clocksource_tsc.cycle_last = 0;
+ if (!boot_cpu_has(X86_FEATURE_NONSTOP_TSC_S3))
+ clocksource_tsc.cycle_last = 0;
}

static struct clocksource clocksource_tsc = {
@@ -939,6 +940,9 @@ static int __init init_tsc_clocksource(void)
clocksource_tsc.flags &= ~CLOCK_SOURCE_IS_CONTINUOUS;
}

+ if (boot_cpu_has(X86_FEATURE_NONSTOP_TSC_S3))
+ clocksource_tsc.flags |= CLOCK_SOURCE_SUSPEND_NONSTOP;
+
/*
* Trust the results of the earlier calibration on systems
* exporting a reliable TSC.
--
1.7.9.5

2013-03-06 07:19:05

by Feng Tang

[permalink] [raw]
Subject: [PATCH v3 1/5] x86: Add cpu capability flag X86_FEATURE_NONSTOP_TSC_S3

On some new Intel Atom processors (Penwell and Cloverview), there is
a feature that the TSC won't stop in S3 state, say the TSC value
won't be reset to 0 after resume. This feature makes TSC a more reliable
clocksource and could benefit the timekeeping code during system
suspend/resume cycle, so add a flag for it.

Signed-off-by: Feng Tang <[email protected]>
---
arch/x86/include/asm/cpufeature.h | 1 +
arch/x86/kernel/cpu/intel.c | 12 ++++++++++++
2 files changed, 13 insertions(+)

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 93fe929..a8466f2 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -100,6 +100,7 @@
#define X86_FEATURE_AMD_DCM (3*32+27) /* multi-node processor */
#define X86_FEATURE_APERFMPERF (3*32+28) /* APERFMPERF */
#define X86_FEATURE_EAGER_FPU (3*32+29) /* "eagerfpu" Non lazy FPU restore */
+#define X86_FEATURE_NONSTOP_TSC_S3 (3*32+30) /* TSC doesn't stop in S3 state */

/* Intel-defined CPU features, CPUID level 0x00000001 (ecx), word 4 */
#define X86_FEATURE_XMM3 (4*32+ 0) /* "pni" SSE-3 */
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 1905ce9..fe57544 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -96,6 +96,18 @@ static void __cpuinit early_init_intel(struct cpuinfo_x86 *c)
sched_clock_stable = 1;
}

+ /* Penwell and Cloverview have the TSC which doesn't sleep on S3 */
+ if (c->x86 == 6) {
+ switch (c->x86_model) {
+ case 0x27: /* Penwell */
+ case 0x35: /* Cloverview */
+ set_cpu_cap(c, X86_FEATURE_NONSTOP_TSC_S3);
+ break;
+ default:
+ ;
+ }
+ }
+
/*
* There is a known erratum on Pentium III and Core Solo
* and Core Duo CPUs.
--
1.7.9.5

2013-03-06 07:20:07

by Feng Tang

[permalink] [raw]
Subject: [PATCH v3 2/5] clocksource: Add new feature flag CLOCK_SOURCE_SUSPEND_NONSTOP

Some x86 processors have a TSC clocksource, which continues to run
even when system is suspended. Also most OMAP platforms have a
32 KHz timer which has similar capability. Add a feature flag so that
it could be utilized.

Signed-off-by: Feng Tang <[email protected]>
---
include/linux/clocksource.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 27cfda4..aa7032c 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -206,6 +206,7 @@ struct clocksource {
#define CLOCK_SOURCE_WATCHDOG 0x10
#define CLOCK_SOURCE_VALID_FOR_HRES 0x20
#define CLOCK_SOURCE_UNSTABLE 0x40
+#define CLOCK_SOURCE_SUSPEND_NONSTOP 0x80

/* simplify initialization of mask field */
#define CLOCKSOURCE_MASK(bits) (cycle_t)((bits) < 64 ? ((1ULL<<(bits))-1) : -1)
--
1.7.9.5

2013-03-06 14:09:44

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 4/5] clocksource: Enable clocksource_cyc2ns() to cover big cycles

On Wed, 6 Mar 2013, Feng Tang wrote:

> Current clocksource_cyc2ns() has a implicit limit that the (cycles * mult)
> can not exceed 64 bits limit. Jason Gunthorpe proposed a way to
> handle this big cycles case, and this patch put the handling into
> clocksource_cyc2ns() so that it could be used unconditionally.

Could be used if it wouldn't break the world and some more.

> Suggested-by: Jason Gunthorpe <[email protected]>
> Signed-off-by: Feng Tang <[email protected]>
> ---
> include/linux/clocksource.h | 11 ++++++++++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
> index aa7032c..1ecc872 100644
> --- a/include/linux/clocksource.h
> +++ b/include/linux/clocksource.h
> @@ -274,7 +274,16 @@ static inline u32 clocksource_hz2mult(u32 hz, u32 shift_constant)
> */
> static inline s64 clocksource_cyc2ns(cycle_t cycles, u32 mult, u32 shift)
> {
> - return ((u64) cycles * mult) >> shift;
> + u64 max = ULLONG_MAX / mult;

This breaks everything which does not have a 64/32bit divide
instruction. And you can't replace it with do_div() as that would
impose massive overhead on those architectures in the fast path.

The max value can be precalculated and stored in the timekeeper
struct. We really do not want expensive calculations in the fast path.

> + s64 nsec = 0;
> +
> + /* The (mult * cycles) may overflow 64 bits, so add a max check */
> + if (cycles > max) {
> + nsec = ((max * mult) >> shift) * (cycles / max);

This breaks everything which does not have a 64/64bit divide instruction.

> + cycles %= max;

Ditto.

As this is the slow path on resume you can use the 64bit functions
declared in math64.h. And you want to put the slow path out of line.

There is a world outside of x86!

Thanks,

tglx

2013-03-06 14:16:06

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] timekeeping: utilize the suspend-nonstop clocksource to count suspended time

On Wed, 6 Mar 2013, Feng Tang wrote:

> There are some new processors whose TSC clocksource won't stop during
> suspend. Currently, after system resumes, kernel will use persistent
> clock or RTC to compensate the sleep time, but for those new types of
> clocksources, we could skip the special compensation from external
> sources, and just use current clocksource for time recounting.
>
> This can solve some time drift bugs caused by some not-so-accurate or
> error-prone RTC devices.
>
> The current way to count suspened time is first try to use the persistent
> clock, and then try the rtc if persistent clock can't be used. This
> patch will change the trying order to:
> suspend-nonstop clocksource -> persistent clock -> rtc
>
> Signed-off-by: Feng Tang <[email protected]>
> ---
> kernel/time/timekeeping.c | 42 ++++++++++++++++++++++++++++++++++--------
> 1 file changed, 34 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index 9a0bc98..bd7c27b 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -788,22 +788,48 @@ void timekeeping_inject_sleeptime(struct timespec *delta)
> static void timekeeping_resume(void)
> {
> struct timekeeper *tk = &timekeeper;
> + struct clocksource *clock = tk->clock;
> unsigned long flags;
> - struct timespec ts;
> + struct timespec ts_new, ts_delta;
> + cycle_t cycle_now, cycle_delta;
> + s64 nsec;
>
> - read_persistent_clock(&ts);
> + ts_delta.tv_sec = 0;
> + read_persistent_clock(&ts_new);
>
> clockevents_resume();
> clocksource_resume();
>
> write_seqlock_irqsave(&tk->lock, flags);
>
> - if (timespec_compare(&ts, &timekeeping_suspend_time) > 0) {
> - ts = timespec_sub(ts, timekeeping_suspend_time);
> - __timekeeping_inject_sleeptime(tk, &ts);
> - }
> - /* re-base the last cycle value */
> - tk->clock->cycle_last = tk->clock->read(tk->clock);
> + /*
> + * After system resumes, we need to calculate the suspended time and
> + * compensate it for the OS time. There are 3 sources that could be
> + * used: Nonstop clocksource during suspend, persistent clock and rtc
> + * device.
> + *
> + * One specific platform may have 1 or 2 or all of them, and the
> + * preference will be:
> + * suspend-nonstop clocksource -> persistent clock -> rtc
> + * The less preferred source will only be tried if there is no better
> + * usable source. The rtc part is handled separately in rtc core code.
> + */
> + cycle_now = clock->read(clock);
> + if ((clock->flags & CLOCK_SOURCE_SUSPEND_NONSTOP) &&
> + cycle_now > clock->cycle_last) {
> +
> + cycle_delta = (cycle_now - clock->cycle_last) & clock->mask;
> + nsec = clocksource_cyc2ns(cycle_delta, clock->mult,
> + clock->shift);
> + ts_delta = ns_to_timespec(nsec);
> + } else if (timespec_compare(&ts_new, &timekeeping_suspend_time) > 0)
> + ts_delta = timespec_sub(ts_new, timekeeping_suspend_time);
> +
> + if (ts_delta.tv_sec >= 1)
> + __timekeeping_inject_sleeptime(tk, &ts_delta);

If the suspend time measured by the nonstop clocksource is 0.999 sec
then we throw it away and then let the RTC code inject inaccurate
sleep time? Brilliant design, really.

Thanks,

tglx

2013-03-06 14:30:22

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] timekeeping: utilize the suspend-nonstop clocksource to count suspended time

On Wed, Mar 06, 2013 at 03:15:49PM +0100, Thomas Gleixner wrote:
> On Wed, 6 Mar 2013, Feng Tang wrote:
>
> > There are some new processors whose TSC clocksource won't stop during
> > suspend. Currently, after system resumes, kernel will use persistent
> > clock or RTC to compensate the sleep time, but for those new types of
> > clocksources, we could skip the special compensation from external
> > sources, and just use current clocksource for time recounting.
> >
> > This can solve some time drift bugs caused by some not-so-accurate or
> > error-prone RTC devices.
> >
> > The current way to count suspened time is first try to use the persistent
> > clock, and then try the rtc if persistent clock can't be used. This
> > patch will change the trying order to:
> > suspend-nonstop clocksource -> persistent clock -> rtc
> >
> > Signed-off-by: Feng Tang <[email protected]>
> > ---
> > kernel/time/timekeeping.c | 42 ++++++++++++++++++++++++++++++++++--------
> > 1 file changed, 34 insertions(+), 8 deletions(-)
> > - read_persistent_clock(&ts);
> > + ts_delta.tv_sec = 0;
> > + read_persistent_clock(&ts_new);
> >
> > clockevents_resume();
> > clocksource_resume();
> >
> > write_seqlock_irqsave(&tk->lock, flags);
> >
> > - if (timespec_compare(&ts, &timekeeping_suspend_time) > 0) {
> > - ts = timespec_sub(ts, timekeeping_suspend_time);
> > - __timekeeping_inject_sleeptime(tk, &ts);
> > - }
> > - /* re-base the last cycle value */
> > - tk->clock->cycle_last = tk->clock->read(tk->clock);
> > + /*
> > + * After system resumes, we need to calculate the suspended time and
> > + * compensate it for the OS time. There are 3 sources that could be
> > + * used: Nonstop clocksource during suspend, persistent clock and rtc
> > + * device.
> > + *
> > + * One specific platform may have 1 or 2 or all of them, and the
> > + * preference will be:
> > + * suspend-nonstop clocksource -> persistent clock -> rtc
> > + * The less preferred source will only be tried if there is no better
> > + * usable source. The rtc part is handled separately in rtc core code.
> > + */
> > + cycle_now = clock->read(clock);
> > + if ((clock->flags & CLOCK_SOURCE_SUSPEND_NONSTOP) &&
> > + cycle_now > clock->cycle_last) {
> > +
> > + cycle_delta = (cycle_now - clock->cycle_last) & clock->mask;
> > + nsec = clocksource_cyc2ns(cycle_delta, clock->mult,
> > + clock->shift);
> > + ts_delta = ns_to_timespec(nsec);
> > + } else if (timespec_compare(&ts_new, &timekeeping_suspend_time) > 0)
> > + ts_delta = timespec_sub(ts_new, timekeeping_suspend_time);
> > +
> > + if (ts_delta.tv_sec >= 1)
> > + __timekeeping_inject_sleeptime(tk, &ts_delta);
>
> If the suspend time measured by the nonstop clocksource is 0.999 sec
> then we throw it away and then let the RTC code inject inaccurate
> sleep time? Brilliant design, really.

Emm, I wrote the code with an assumption that the sleep itself and the
enter/exit processes will be longer than 1 second.

I can initialize the ts_delta to (0, 0} and change the check condition
to
if (ts_delta.tv_sec || ts_delta.tv_nsec)

2013-03-06 14:38:50

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v3 4/5] clocksource: Enable clocksource_cyc2ns() to cover big cycles

Hi Thomas,

Thanks for the reviews.

On Wed, Mar 06, 2013 at 03:09:26PM +0100, Thomas Gleixner wrote:
> On Wed, 6 Mar 2013, Feng Tang wrote:
>
> > Current clocksource_cyc2ns() has a implicit limit that the (cycles * mult)
> > can not exceed 64 bits limit. Jason Gunthorpe proposed a way to
> > handle this big cycles case, and this patch put the handling into
> > clocksource_cyc2ns() so that it could be used unconditionally.
>
> Could be used if it wouldn't break the world and some more.

Exactly.

One excuse I can think of is usually the clocksource_cyc2ns() will be called
for cycles less than 600 seconds, based on which the "mult" and "shift" are
calculated for a clocksource.

>
> > Suggested-by: Jason Gunthorpe <[email protected]>
> > Signed-off-by: Feng Tang <[email protected]>
> > ---
> > include/linux/clocksource.h | 11 ++++++++++-
> > 1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
> > index aa7032c..1ecc872 100644
> > --- a/include/linux/clocksource.h
> > +++ b/include/linux/clocksource.h
> > @@ -274,7 +274,16 @@ static inline u32 clocksource_hz2mult(u32 hz, u32 shift_constant)
> > */
> > static inline s64 clocksource_cyc2ns(cycle_t cycles, u32 mult, u32 shift)
> > {
> > - return ((u64) cycles * mult) >> shift;
> > + u64 max = ULLONG_MAX / mult;
>
> This breaks everything which does not have a 64/32bit divide
> instruction. And you can't replace it with do_div() as that would
> impose massive overhead on those architectures in the fast path.

I thought about this once. And in my v2 patch, I used some code like

+ /*
+ * The system suspended time and the delta cycles may be very
+ * long, so we can't call clocksource_cyc2ns() directly with
+ * clocksource's default mult and shift to avoid overflow.
+ */
+ max_cycles = 1ULL << (63 - (ilog2(mult) + 1));
+ while (cycle_delta > max_cycles) {
+ max_cycles <<= 1;
+ mult >>= 1;
+ shift--;
+ }
+

trying to avoid expensieve maths. But as Jason pointed, there is some accuracy
lost.

>
> The max value can be precalculated and stored in the timekeeper
> struct. We really do not want expensive calculations in the fast path.

Yeah, just like the max_idle_ns.

> > + s64 nsec = 0;
> > +
> > + /* The (mult * cycles) may overflow 64 bits, so add a max check */
> > + if (cycles > max) {
> > + nsec = ((max * mult) >> shift) * (cycles / max);
>
> This breaks everything which does not have a 64/64bit divide instruction.
>
> > + cycles %= max;
>
> Ditto.
>
> As this is the slow path on resume you can use the 64bit functions
> declared in math64.h. And you want to put the slow path out of line.

So should I leave the clocksource_cyl2ns() untouched and only add these
do_div() 64 bit operation inside the timekeeping_resume() slow path?

Thanks,
Feng



2013-03-06 16:11:12

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 4/5] clocksource: Enable clocksource_cyc2ns() to cover big cycles

On Wed, 6 Mar 2013, Feng Tang wrote:
> Hi Thomas,
>
> Thanks for the reviews.
>
> On Wed, Mar 06, 2013 at 03:09:26PM +0100, Thomas Gleixner wrote:
> > On Wed, 6 Mar 2013, Feng Tang wrote:
> >
> > > Current clocksource_cyc2ns() has a implicit limit that the (cycles * mult)
> > > can not exceed 64 bits limit. Jason Gunthorpe proposed a way to
> > > handle this big cycles case, and this patch put the handling into
> > > clocksource_cyc2ns() so that it could be used unconditionally.
> >
> > Could be used if it wouldn't break the world and some more.
>
> Exactly.
>
> One excuse I can think of is usually the clocksource_cyc2ns() will be called
> for cycles less than 600 seconds, based on which the "mult" and "shift" are
> calculated for a clocksource.

That's not an excuse for making even the build fail on ARM and other
32bit archs.

> > > static inline s64 clocksource_cyc2ns(cycle_t cycles, u32 mult, u32 shift)
> > > {
> > > - return ((u64) cycles * mult) >> shift;
> > > + u64 max = ULLONG_MAX / mult;
> >
> > This breaks everything which does not have a 64/32bit divide
> > instruction. And you can't replace it with do_div() as that would
> > impose massive overhead on those architectures in the fast path.
>
> I thought about this once. And in my v2 patch, I used some code like
>
> + /*
> + * The system suspended time and the delta cycles may be very
> + * long, so we can't call clocksource_cyc2ns() directly with
> + * clocksource's default mult and shift to avoid overflow.
> + */
> + max_cycles = 1ULL << (63 - (ilog2(mult) + 1));
> + while (cycle_delta > max_cycles) {
> + max_cycles <<= 1;
> + mult >>= 1;
> + shift--;
> + }
> +
>
> trying to avoid expensieve maths. But as Jason pointed, there is some accuracy
> lost.

Right, but if you precalculate the max_fast_cycles value you can avoid
at least the division in the fast path and then do

if (cycles > max_fast_cycles)
return clocksource_cyc2ns_slow();
return ((u64) cycles * mult) >> shift;

clocksource_cyc2ns_slow() should be out of line and there you can do
all the slow 64 bit operations. That keeps the fast path sane and we
don't need extra magic for the large cycle values.

Thanks,

tglx

2013-03-06 16:27:42

by Feng Tang

[permalink] [raw]
Subject: Re: [PATCH v3 4/5] clocksource: Enable clocksource_cyc2ns() to cover big cycles

On Wed, Mar 06, 2013 at 05:10:53PM +0100, Thomas Gleixner wrote:
> On Wed, 6 Mar 2013, Feng Tang wrote:
> > Hi Thomas,
> >
> > Thanks for the reviews.
> >
> > On Wed, Mar 06, 2013 at 03:09:26PM +0100, Thomas Gleixner wrote:
> > > On Wed, 6 Mar 2013, Feng Tang wrote:
> > >
> > > > Current clocksource_cyc2ns() has a implicit limit that the (cycles * mult)
> > > > can not exceed 64 bits limit. Jason Gunthorpe proposed a way to
> > > > handle this big cycles case, and this patch put the handling into
> > > > clocksource_cyc2ns() so that it could be used unconditionally.
> > >
> > > Could be used if it wouldn't break the world and some more.
> >
> > Exactly.
> >
> > One excuse I can think of is usually the clocksource_cyc2ns() will be called
> > for cycles less than 600 seconds, based on which the "mult" and "shift" are
> > calculated for a clocksource.
>
> That's not an excuse for making even the build fail on ARM and other
> 32bit archs.

That's a huge mistake I made in my patch, and I didn't meant to excuse for it :)

> >
> > trying to avoid expensieve maths. But as Jason pointed, there is some accuracy
> > lost.
>
> Right, but if you precalculate the max_fast_cycles value you can avoid
> at least the division in the fast path and then do
>
> if (cycles > max_fast_cycles)
> return clocksource_cyc2ns_slow();
> return ((u64) cycles * mult) >> shift;
>
> clocksource_cyc2ns_slow() should be out of line and there you can do
> all the slow 64 bit operations. That keeps the fast path sane and we
> don't need extra magic for the large cycle values.

Yeah! This should well cover all possilbe cycles and solve the fast/slow
problem. Thanks. Will try to make a new patch.

- Feng

2013-03-06 21:05:32

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v3 4/5] clocksource: Enable clocksource_cyc2ns() to cover big cycles

On 03/06/2013 06:09 AM, Thomas Gleixner wrote:
>
> This breaks everything which does not have a 64/32bit divide
> instruction. And you can't replace it with do_div() as that would
> impose massive overhead on those architectures in the fast path.
>

Could we do the same kind of scaling-by-multiplication that we do in
kernel/time.c for this?

-hpa

2013-03-06 21:16:06

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 4/5] clocksource: Enable clocksource_cyc2ns() to cover big cycles

On Wed, 6 Mar 2013, H. Peter Anvin wrote:

> On 03/06/2013 06:09 AM, Thomas Gleixner wrote:
> >
> > This breaks everything which does not have a 64/32bit divide
> > instruction. And you can't replace it with do_div() as that would
> > impose massive overhead on those architectures in the fast path.
> >
>
> Could we do the same kind of scaling-by-multiplication that we do in
> kernel/time.c for this?

Not sure what you are referring to. kernel/time.c contains a lot of stuff :)

Thanks,

tglx

2013-03-06 21:20:17

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v3 4/5] clocksource: Enable clocksource_cyc2ns() to cover big cycles

On 03/06/2013 01:15 PM, Thomas Gleixner wrote:
> On Wed, 6 Mar 2013, H. Peter Anvin wrote:
>
>> On 03/06/2013 06:09 AM, Thomas Gleixner wrote:
>>>
>>> This breaks everything which does not have a 64/32bit divide
>>> instruction. And you can't replace it with do_div() as that would
>>> impose massive overhead on those architectures in the fast path.
>>>
>>
>> Could we do the same kind of scaling-by-multiplication that we do in
>> kernel/time.c for this?
>
> Not sure what you are referring to. kernel/time.c contains a lot of stuff :)
>

This stuff, specifically the third clause (which incidentally could be
extended to the fourth clause without much trouble... I have
experimented with it already.)

It uses a N*N->2N multiply and a shift to do overflowless scaling; it is
?1 LSB in the upper half of the value range with can be remedied with an
additional 2N add.

-hpa

/*
* Convert jiffies to milliseconds and back.
*
* Avoid unnecessary multiplications/divisions in the
* two most common HZ cases:
*/
unsigned int jiffies_to_msecs(const unsigned long j)
{
#if HZ <= MSEC_PER_SEC && !(MSEC_PER_SEC % HZ)
return (MSEC_PER_SEC / HZ) * j;
#elif HZ > MSEC_PER_SEC && !(HZ % MSEC_PER_SEC)
return (j + (HZ / MSEC_PER_SEC) - 1)/(HZ / MSEC_PER_SEC);
#else
# if BITS_PER_LONG == 32
return (HZ_TO_MSEC_MUL32 * j) >> HZ_TO_MSEC_SHR32; <---
# else
return (j * HZ_TO_MSEC_NUM) / HZ_TO_MSEC_DEN;
# endif
#endif
}

2013-03-06 21:29:36

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 4/5] clocksource: Enable clocksource_cyc2ns() to cover big cycles

On Wed, 6 Mar 2013, H. Peter Anvin wrote:
> On 03/06/2013 01:15 PM, Thomas Gleixner wrote:
> > On Wed, 6 Mar 2013, H. Peter Anvin wrote:
> >
> >> On 03/06/2013 06:09 AM, Thomas Gleixner wrote:
> >>>
> >>> This breaks everything which does not have a 64/32bit divide
> >>> instruction. And you can't replace it with do_div() as that would
> >>> impose massive overhead on those architectures in the fast path.
> >>>
> >>
> >> Could we do the same kind of scaling-by-multiplication that we do in
> >> kernel/time.c for this?
> >
> > Not sure what you are referring to. kernel/time.c contains a lot of stuff :)
> >
>
> This stuff, specifically the third clause (which incidentally could be
> extended to the fourth clause without much trouble... I have
> experimented with it already.)
>
> It uses a N*N->2N multiply and a shift to do overflowless scaling; it is
> ?1 LSB in the upper half of the value range with can be remedied with an
> additional 2N add.

That's compile time. We need this at runtime and we do not want to
reduce the accuracy. That's precision timekeeping not jiffies :)

Thanks,

tglx

2013-03-06 21:33:16

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH v3 5/5] timekeeping: utilize the suspend-nonstop clocksource to count suspended time

On Wed, 6 Mar 2013, Feng Tang wrote:
> On Wed, Mar 06, 2013 at 03:15:49PM +0100, Thomas Gleixner wrote:
> > If the suspend time measured by the nonstop clocksource is 0.999 sec
> > then we throw it away and then let the RTC code inject inaccurate
> > sleep time? Brilliant design, really.
>
> Emm, I wrote the code with an assumption that the sleep itself and the
> enter/exit processes will be longer than 1 second.
>
> I can initialize the ts_delta to (0, 0} and change the check condition
> to
> if (ts_delta.tv_sec || ts_delta.tv_nsec)
>

To be honest, I hate this abuse of ts_delta as a condition. Use a
separate condition variable to make clear what the hell that code is
doing. This is not a fast path and I really prefer understandable code
over magic microoptimizations for no value.

Thanks,

tglx