2023-06-06 13:22:54

by Chris Bainbridge

[permalink] [raw]
Subject: Re: PROBLEM: skew message does not handle negative ns skew

On Tue, 6 Jun 2023 at 13:50, Feng Tang <[email protected]> wrote:
>
> And I'm have no idea if there is a real hardware/firmware issue
> or just a false alarm.

Is a negative reported skew valid? I don't know, I had assumed so, so
the problem was the conversion from -878159 ns to 18446744073708 ms.


2023-06-06 14:24:19

by Feng Tang

[permalink] [raw]
Subject: Re: PROBLEM: skew message does not handle negative ns skew

On Tue, Jun 06, 2023 at 02:09:08PM +0100, Chris Bainbridge wrote:
> On Tue, 6 Jun 2023 at 13:50, Feng Tang <[email protected]> wrote:
> >
> > And I'm have no idea if there is a real hardware/firmware issue
> > or just a false alarm.
>
> Is a negative reported skew valid? I don't know, I had assumed so, so
> the problem was the conversion from -878159 ns to 18446744073708 ms.

I think it's valid. The related code is from kernel/time/clocksource.c:

"
cs_wd_msec = div_u64_rem(cs_nsec - wd_nsec, 1000U * 1000U, &wd_rem);
wd_msec = div_u64_rem(wd_nsec, 1000U * 1000U, &wd_rem);
pr_warn(" Clocksource '%s' skewed %lld ns (%lld ms) over watchdog '%s' interval of %lld ns (%lld ms)\n",
cs->name, cs_nsec - wd_nsec, cs_wd_msec, watchdog->name, wd_nsec, wd_msec);
"

The negative value just means the watchdog is running faster than
TSC in the 512 ms checking interval. The 18446744073708 ms is just
a conversion from s64 value in ns (-878159) to a u64 ns, then a
u64 ms.

Thanks,
Feng

2023-06-07 19:16:03

by Paul E. McKenney

[permalink] [raw]
Subject: Re: PROBLEM: skew message does not handle negative ns skew

On Tue, Jun 06, 2023 at 09:52:11PM +0800, Feng Tang wrote:
> On Tue, Jun 06, 2023 at 02:09:08PM +0100, Chris Bainbridge wrote:
> > On Tue, 6 Jun 2023 at 13:50, Feng Tang <[email protected]> wrote:
> > >
> > > And I'm have no idea if there is a real hardware/firmware issue
> > > or just a false alarm.
> >
> > Is a negative reported skew valid? I don't know, I had assumed so, so
> > the problem was the conversion from -878159 ns to 18446744073708 ms.
>
> I think it's valid. The related code is from kernel/time/clocksource.c:
>
> "
> cs_wd_msec = div_u64_rem(cs_nsec - wd_nsec, 1000U * 1000U, &wd_rem);
> wd_msec = div_u64_rem(wd_nsec, 1000U * 1000U, &wd_rem);
> pr_warn(" Clocksource '%s' skewed %lld ns (%lld ms) over watchdog '%s' interval of %lld ns (%lld ms)\n",
> cs->name, cs_nsec - wd_nsec, cs_wd_msec, watchdog->name, wd_nsec, wd_msec);
> "
>
> The negative value just means the watchdog is running faster than
> TSC in the 512 ms checking interval. The 18446744073708 ms is just
> a conversion from s64 value in ns (-878159) to a u64 ns, then a
> u64 ms.

That is a bit user-unfriendly. Does the following fix address this
issue at your end?

Thanx, Paul

------------------------------------------------------------------------

commit 8eb836f2dd44cb1e80dfc603cf47c03603dadcdb
Author: Paul E. McKenney <[email protected]>
Date: Wed Jun 7 11:59:49 2023 -0700

clocksource: Handle negative skews in "skew is too large" messages

The nanosecond-to-millisecond skew computation uses unsigned arithmetic,
which produces user-unfriendly large positive numbers for negative skews.
Therefore, use signed arithmetic for this computation in order to preserve
the negativity.

Reported-by: Chris Bainbridge <[email protected]>
Reported-by: Feng Tang <[email protected]>
Fixes: dd029269947a ("clocksource: Improve "skew is too large" messages")
Signed-off-by: Paul E. McKenney <[email protected]>

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 91836b727cef..0600e16dbafe 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -473,8 +473,8 @@ static void clocksource_watchdog(struct timer_list *unused)
/* Check the deviation from the watchdog clocksource. */
md = cs->uncertainty_margin + watchdog->uncertainty_margin;
if (abs(cs_nsec - wd_nsec) > md) {
- u64 cs_wd_msec;
- u64 wd_msec;
+ s64 cs_wd_msec;
+ s64 wd_msec;
u32 wd_rem;

pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
@@ -483,8 +483,8 @@ static void clocksource_watchdog(struct timer_list *unused)
watchdog->name, wd_nsec, wdnow, wdlast, watchdog->mask);
pr_warn(" '%s' cs_nsec: %lld cs_now: %llx cs_last: %llx mask: %llx\n",
cs->name, cs_nsec, csnow, cslast, cs->mask);
- cs_wd_msec = div_u64_rem(cs_nsec - wd_nsec, 1000U * 1000U, &wd_rem);
- wd_msec = div_u64_rem(wd_nsec, 1000U * 1000U, &wd_rem);
+ cs_wd_msec = div_s64_rem(cs_nsec - wd_nsec, 1000 * 1000, &wd_rem);
+ wd_msec = div_s64_rem(wd_nsec, 1000 * 1000, &wd_rem);
pr_warn(" Clocksource '%s' skewed %lld ns (%lld ms) over watchdog '%s' interval of %lld ns (%lld ms)\n",
cs->name, cs_nsec - wd_nsec, cs_wd_msec, watchdog->name, wd_nsec, wd_msec);
if (curr_clocksource == cs)

2023-06-08 06:44:21

by Feng Tang

[permalink] [raw]
Subject: Re: PROBLEM: skew message does not handle negative ns skew

On Wed, Jun 07, 2023 at 12:04:49PM -0700, Paul E. McKenney wrote:
> On Tue, Jun 06, 2023 at 09:52:11PM +0800, Feng Tang wrote:
> > On Tue, Jun 06, 2023 at 02:09:08PM +0100, Chris Bainbridge wrote:
> > > On Tue, 6 Jun 2023 at 13:50, Feng Tang <[email protected]> wrote:
> > > >
> > > > And I'm have no idea if there is a real hardware/firmware issue
> > > > or just a false alarm.
> > >
> > > Is a negative reported skew valid? I don't know, I had assumed so, so
> > > the problem was the conversion from -878159 ns to 18446744073708 ms.
> >
> > I think it's valid. The related code is from kernel/time/clocksource.c:
> >
> > "
> > cs_wd_msec = div_u64_rem(cs_nsec - wd_nsec, 1000U * 1000U, &wd_rem);
> > wd_msec = div_u64_rem(wd_nsec, 1000U * 1000U, &wd_rem);
> > pr_warn(" Clocksource '%s' skewed %lld ns (%lld ms) over watchdog '%s' interval of %lld ns (%lld ms)\n",
> > cs->name, cs_nsec - wd_nsec, cs_wd_msec, watchdog->name, wd_nsec, wd_msec);
> > "
> >
> > The negative value just means the watchdog is running faster than
> > TSC in the 512 ms checking interval. The 18446744073708 ms is just
> > a conversion from s64 value in ns (-878159) to a u64 ns, then a
> > u64 ms.
>
> That is a bit user-unfriendly. Does the following fix address this
> issue at your end?
>
> Thanx, Paul
>
> ------------------------------------------------------------------------
>
> commit 8eb836f2dd44cb1e80dfc603cf47c03603dadcdb
> Author: Paul E. McKenney <[email protected]>
> Date: Wed Jun 7 11:59:49 2023 -0700
>
> clocksource: Handle negative skews in "skew is too large" messages
>
> The nanosecond-to-millisecond skew computation uses unsigned arithmetic,
> which produces user-unfriendly large positive numbers for negative skews.
> Therefore, use signed arithmetic for this computation in order to preserve
> the negativity.

It does make the error message more consistent and less confusing. Thanks.

Reviewed-by: Feng Tang <[email protected]>

>
> Reported-by: Chris Bainbridge <[email protected]>
> Reported-by: Feng Tang <[email protected]>
> Fixes: dd029269947a ("clocksource: Improve "skew is too large" messages")
> Signed-off-by: Paul E. McKenney <[email protected]>
>
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index 91836b727cef..0600e16dbafe 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -473,8 +473,8 @@ static void clocksource_watchdog(struct timer_list *unused)
> /* Check the deviation from the watchdog clocksource. */
> md = cs->uncertainty_margin + watchdog->uncertainty_margin;
> if (abs(cs_nsec - wd_nsec) > md) {
> - u64 cs_wd_msec;
> - u64 wd_msec;
> + s64 cs_wd_msec;
> + s64 wd_msec;
> u32 wd_rem;
>
> pr_warn("timekeeping watchdog on CPU%d: Marking clocksource '%s' as unstable because the skew is too large:\n",
> @@ -483,8 +483,8 @@ static void clocksource_watchdog(struct timer_list *unused)
> watchdog->name, wd_nsec, wdnow, wdlast, watchdog->mask);
> pr_warn(" '%s' cs_nsec: %lld cs_now: %llx cs_last: %llx mask: %llx\n",
> cs->name, cs_nsec, csnow, cslast, cs->mask);
> - cs_wd_msec = div_u64_rem(cs_nsec - wd_nsec, 1000U * 1000U, &wd_rem);
> - wd_msec = div_u64_rem(wd_nsec, 1000U * 1000U, &wd_rem);
> + cs_wd_msec = div_s64_rem(cs_nsec - wd_nsec, 1000 * 1000, &wd_rem);
> + wd_msec = div_s64_rem(wd_nsec, 1000 * 1000, &wd_rem);
> pr_warn(" Clocksource '%s' skewed %lld ns (%lld ms) over watchdog '%s' interval of %lld ns (%lld ms)\n",
> cs->name, cs_nsec - wd_nsec, cs_wd_msec, watchdog->name, wd_nsec, wd_msec);
> if (curr_clocksource == cs)

2023-06-08 10:01:25

by Chris Bainbridge

[permalink] [raw]
Subject: Re: PROBLEM: skew message does not handle negative ns skew

On Wed, 7 Jun 2023 at 20:04, Paul E. McKenney <[email protected]> wrote:
>
> That is a bit user-unfriendly. Does the following fix address this
> issue at your end?

[ 2.095149] clocksource: timekeeping watchdog on CPU3: Marking
clocksource 'tsc' as unstable because the skew is too large:
[ 2.095152] clocksource: 'hpet' wd_nsec:
515998611 wd_now: 1c29fb9 wd_last: 151e3b8 mask: ffffffff
[ 2.095154] clocksource: 'tsc' cs_nsec:
515124524 cs_now: 8af4c89f9 cs_last: 874f8e80b mask: ffffffffffffffff
[ 2.095155] clocksource: Clocksource 'tsc'
skewed -874087 ns (0 ms) over watchdog 'hpet' interval of 515998611 ns
(515 ms)

2023-06-08 16:35:57

by Paul E. McKenney

[permalink] [raw]
Subject: Re: PROBLEM: skew message does not handle negative ns skew

On Thu, Jun 08, 2023 at 10:41:51AM +0100, Chris Bainbridge wrote:
> On Wed, 7 Jun 2023 at 20:04, Paul E. McKenney <[email protected]> wrote:
> >
> > That is a bit user-unfriendly. Does the following fix address this
> > issue at your end?
>
> [ 2.095149] clocksource: timekeeping watchdog on CPU3: Marking
> clocksource 'tsc' as unstable because the skew is too large:
> [ 2.095152] clocksource: 'hpet' wd_nsec:
> 515998611 wd_now: 1c29fb9 wd_last: 151e3b8 mask: ffffffff
> [ 2.095154] clocksource: 'tsc' cs_nsec:
> 515124524 cs_now: 8af4c89f9 cs_last: 874f8e80b mask: ffffffffffffffff
> [ 2.095155] clocksource: Clocksource 'tsc'
> skewed -874087 ns (0 ms) over watchdog 'hpet' interval of 515998611 ns
> (515 ms)

Very good, thank you!

May I please apply your Tested-by?

Thanx, Paul