2016-10-03 22:10:51

by Rich Felker

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> On 26/09/2016 23:07, Rich Felker wrote:
> > Ping. Is there anything that still needs to be changed for this driver
> > to be acceptable?
>
> It is on my radar. I'm reviewing it.
>
> Can you elaborate the workaround mentioned in the changelog. I have been
> digging into the lkml@ thread but it is not clear if the issue is
> related to the time framework, the driver itself or whatever else. Can
> you clarify that ?

Do you have comments on any remaining issues other than this
workaround? If not I don't mind removing the workaround and trying to
solve the issue separately later. Let me know and either way I'll
submit a v8.

Rich


2016-10-04 10:27:30

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

On Mon, Oct 03, 2016 at 06:10:39PM -0400, Rich Felker wrote:
> On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> > On 26/09/2016 23:07, Rich Felker wrote:
> > > Ping. Is there anything that still needs to be changed for this driver
> > > to be acceptable?
> >
> > It is on my radar. I'm reviewing it.
> >
> > Can you elaborate the workaround mentioned in the changelog. I have been
> > digging into the lkml@ thread but it is not clear if the issue is
> > related to the time framework, the driver itself or whatever else. Can
> > you clarify that ?
>
> Do you have comments on any remaining issues other than this
> workaround? If not I don't mind removing the workaround and trying to
> solve the issue separately later. Let me know and either way I'll
> submit a v8.

One question of interest to me is whether this patchset prevents the
RCU CPU stall warnings that you are seeing.

Thanx, Paul

2016-10-04 20:58:43

by Rich Felker

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

On Tue, Oct 04, 2016 at 12:06:23AM -0700, Paul E. McKenney wrote:
> On Mon, Oct 03, 2016 at 06:10:39PM -0400, Rich Felker wrote:
> > On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> > > On 26/09/2016 23:07, Rich Felker wrote:
> > > > Ping. Is there anything that still needs to be changed for this driver
> > > > to be acceptable?
> > >
> > > It is on my radar. I'm reviewing it.
> > >
> > > Can you elaborate the workaround mentioned in the changelog. I have been
> > > digging into the lkml@ thread but it is not clear if the issue is
> > > related to the time framework, the driver itself or whatever else. Can
> > > you clarify that ?
> >
> > Do you have comments on any remaining issues other than this
> > workaround? If not I don't mind removing the workaround and trying to
> > solve the issue separately later. Let me know and either way I'll
> > submit a v8.
>
> One question of interest to me is whether this patchset prevents the
> RCU CPU stall warnings that you are seeing.

With the 5ms minimum delta, I didn't observe any rcu_sched stall
warnings. At 2.5ms I thought it was gone but eventually saw one. With
the previous mindelta = 1, i.e. 1 hardware bus period, I get the
stalls regularly.

Rich

2016-10-04 21:14:54

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

On Tue, Oct 04, 2016 at 04:58:37PM -0400, Rich Felker wrote:
> On Tue, Oct 04, 2016 at 12:06:23AM -0700, Paul E. McKenney wrote:
> > On Mon, Oct 03, 2016 at 06:10:39PM -0400, Rich Felker wrote:
> > > On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> > > > On 26/09/2016 23:07, Rich Felker wrote:
> > > > > Ping. Is there anything that still needs to be changed for this driver
> > > > > to be acceptable?
> > > >
> > > > It is on my radar. I'm reviewing it.
> > > >
> > > > Can you elaborate the workaround mentioned in the changelog. I have been
> > > > digging into the lkml@ thread but it is not clear if the issue is
> > > > related to the time framework, the driver itself or whatever else. Can
> > > > you clarify that ?
> > >
> > > Do you have comments on any remaining issues other than this
> > > workaround? If not I don't mind removing the workaround and trying to
> > > solve the issue separately later. Let me know and either way I'll
> > > submit a v8.
> >
> > One question of interest to me is whether this patchset prevents the
> > RCU CPU stall warnings that you are seeing.
>
> With the 5ms minimum delta, I didn't observe any rcu_sched stall
> warnings. At 2.5ms I thought it was gone but eventually saw one. With
> the previous mindelta = 1, i.e. 1 hardware bus period, I get the
> stalls regularly.

Sounds to me like your low-level clock drivers or your clock hardware is
having trouble dealing with short timeouts. I suggest writing a focused
test. It is of course quite possible that the failure could occur for
any time period, but simply becomes less probable with longer time
periods.

Or perhaps better, do the tracing that Thomas Gleixner suggested. Or both,
in order to get the most information in the shortest period of time.

Thanx, Paul

2016-10-04 21:48:23

by Rich Felker

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

On Tue, Oct 04, 2016 at 02:14:51PM -0700, Paul E. McKenney wrote:
> On Tue, Oct 04, 2016 at 04:58:37PM -0400, Rich Felker wrote:
> > On Tue, Oct 04, 2016 at 12:06:23AM -0700, Paul E. McKenney wrote:
> > > On Mon, Oct 03, 2016 at 06:10:39PM -0400, Rich Felker wrote:
> > > > On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> > > > > On 26/09/2016 23:07, Rich Felker wrote:
> > > > > > Ping. Is there anything that still needs to be changed for this driver
> > > > > > to be acceptable?
> > > > >
> > > > > It is on my radar. I'm reviewing it.
> > > > >
> > > > > Can you elaborate the workaround mentioned in the changelog. I have been
> > > > > digging into the lkml@ thread but it is not clear if the issue is
> > > > > related to the time framework, the driver itself or whatever else. Can
> > > > > you clarify that ?
> > > >
> > > > Do you have comments on any remaining issues other than this
> > > > workaround? If not I don't mind removing the workaround and trying to
> > > > solve the issue separately later. Let me know and either way I'll
> > > > submit a v8.
> > >
> > > One question of interest to me is whether this patchset prevents the
> > > RCU CPU stall warnings that you are seeing.
> >
> > With the 5ms minimum delta, I didn't observe any rcu_sched stall
> > warnings. At 2.5ms I thought it was gone but eventually saw one. With
> > the previous mindelta = 1, i.e. 1 hardware bus period, I get the
> > stalls regularly.
>
> Sounds to me like your low-level clock drivers or your clock hardware is
> having trouble dealing with short timeouts. I suggest writing a focused
> test. It is of course quite possible that the failure could occur for
> any time period, but simply becomes less probable with longer time
> periods.

I don't see any indication of such a problem on the hardware side, and
it didn't happen in earlier kernels, though of course it's possible
that they were programming different intervals. And all of the
intervals logged when I added trace logging were way above the scale
where it would be plausible for the hardware to be an issue.

> Or perhaps better, do the tracing that Thomas Gleixner suggested. Or both,
> in order to get the most information in the shortest period of time.

I'm starting with new traces with "nohlt" on the kernel command line,
since many of the stalls I've observed up to now may have a different
mechanism -- since the arch_cpu_idle code doesn't work (returns
immediately) the rapid rcu idle enter/exit in the cpu idle loop might
be part of what was interfering with rcu_sched.

With my new traces and LED debugging (which I still need to tune with
more logging, then I'll post some more results) what I'm seeing when
the stall happens is the timer interrupt firing as expected and making
it into __hrtimer_run_queues, but tick_sched_timer not getting called
and thus rcu_sched never waking up from its 1s schedule_timeout (via
swait).

Rich

2016-10-05 13:36:32

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH v7 2/2] clocksource: add J-Core timer/clocksource driver

On Tue, Oct 04, 2016 at 05:48:18PM -0400, Rich Felker wrote:
> On Tue, Oct 04, 2016 at 02:14:51PM -0700, Paul E. McKenney wrote:
> > On Tue, Oct 04, 2016 at 04:58:37PM -0400, Rich Felker wrote:
> > > On Tue, Oct 04, 2016 at 12:06:23AM -0700, Paul E. McKenney wrote:
> > > > On Mon, Oct 03, 2016 at 06:10:39PM -0400, Rich Felker wrote:
> > > > > On Mon, Sep 26, 2016 at 11:27:14PM +0200, Daniel Lezcano wrote:
> > > > > > On 26/09/2016 23:07, Rich Felker wrote:
> > > > > > > Ping. Is there anything that still needs to be changed for this driver
> > > > > > > to be acceptable?
> > > > > >
> > > > > > It is on my radar. I'm reviewing it.
> > > > > >
> > > > > > Can you elaborate the workaround mentioned in the changelog. I have been
> > > > > > digging into the lkml@ thread but it is not clear if the issue is
> > > > > > related to the time framework, the driver itself or whatever else. Can
> > > > > > you clarify that ?
> > > > >
> > > > > Do you have comments on any remaining issues other than this
> > > > > workaround? If not I don't mind removing the workaround and trying to
> > > > > solve the issue separately later. Let me know and either way I'll
> > > > > submit a v8.
> > > >
> > > > One question of interest to me is whether this patchset prevents the
> > > > RCU CPU stall warnings that you are seeing.
> > >
> > > With the 5ms minimum delta, I didn't observe any rcu_sched stall
> > > warnings. At 2.5ms I thought it was gone but eventually saw one. With
> > > the previous mindelta = 1, i.e. 1 hardware bus period, I get the
> > > stalls regularly.
> >
> > Sounds to me like your low-level clock drivers or your clock hardware is
> > having trouble dealing with short timeouts. I suggest writing a focused
> > test. It is of course quite possible that the failure could occur for
> > any time period, but simply becomes less probable with longer time
> > periods.
>
> I don't see any indication of such a problem on the hardware side, and
> it didn't happen in earlier kernels, though of course it's possible
> that they were programming different intervals. And all of the
> intervals logged when I added trace logging were way above the scale
> where it would be plausible for the hardware to be an issue.

Is it possible that the problem occurs when the process setting up the
timeout is delayed for longer than the timeout duration, so that the
time has already expired at the time that the hardware is programmed?

> > Or perhaps better, do the tracing that Thomas Gleixner suggested. Or both,
> > in order to get the most information in the shortest period of time.
>
> I'm starting with new traces with "nohlt" on the kernel command line,
> since many of the stalls I've observed up to now may have a different
> mechanism -- since the arch_cpu_idle code doesn't work (returns
> immediately) the rapid rcu idle enter/exit in the cpu idle loop might
> be part of what was interfering with rcu_sched.

That doesn't sound good, though I cannot immediately think of a reason
why that would cause RCU much trouble.

> With my new traces and LED debugging (which I still need to tune with
> more logging, then I'll post some more results) what I'm seeing when
> the stall happens is the timer interrupt firing as expected and making
> it into __hrtimer_run_queues, but tick_sched_timer not getting called
> and thus rcu_sched never waking up from its 1s schedule_timeout (via
> swait).

It would be very interesting to track down exactly the point at which
the system decides not to fire the timeout.

Thanx, Paul