2020-01-23 20:43:47

by Steven Rostedt

[permalink] [raw]
Subject: [PATCH RT 10/30] hrtimer: Prevent using hrtimer_grab_expiry_lock() on migration_base

4.19.94-rt39-rc2 stable review patch.
If anyone has any objections, please let me know.

------------------

From: Julien Grall <[email protected]>

[ Upstream commit cef1b87f98823af923a386f3f69149acb212d4a1 ]

As tglx puts it:
|If base == migration_base then there is no point to lock soft_expiry_lock
|simply because the timer is not executing the callback in soft irq context
|and the whole lock/unlock dance can be avoided.

Furthermore, all the path leading to hrtimer_grab_expiry_lock() assumes
timer->base and timer->base->cpu_base are always non-NULL. So it is safe
to remove the NULL checks here.

Signed-off-by: Julien Grall <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]
Signed-off-by: Steven Rostedt (VMware) <[email protected]>
[bigeasy: rewrite changelog]
Signed-off-by: Sebastian Andrzej Siewior <[email protected]>
---
kernel/time/hrtimer.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index 49d20fe8570f..1a5167c68310 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -943,7 +943,7 @@ void hrtimer_grab_expiry_lock(const struct hrtimer *timer)
{
struct hrtimer_clock_base *base = READ_ONCE(timer->base);

- if (timer->is_soft && base && base->cpu_base) {
+ if (timer->is_soft && base != &migration_base) {
spin_lock(&base->cpu_base->softirq_expiry_lock);
spin_unlock(&base->cpu_base->softirq_expiry_lock);
}
--
2.24.1



2020-04-27 13:12:58

by Rasmus Villemoes

[permalink] [raw]
Subject: Re: [PATCH RT 10/30] hrtimer: Prevent using hrtimer_grab_expiry_lock() on migration_base

On 23/01/2020 21.39, Steven Rostedt wrote:
> 4.19.94-rt39-rc2 stable review patch.
> If anyone has any objections, please let me know.
>
> ------------------
>
> From: Julien Grall <[email protected]>
>
> [ Upstream commit cef1b87f98823af923a386f3f69149acb212d4a1 ]
>
> As tglx puts it:
> |If base == migration_base then there is no point to lock soft_expiry_lock
> |simply because the timer is not executing the callback in soft irq context
> |and the whole lock/unlock dance can be avoided.
>
> Furthermore, all the path leading to hrtimer_grab_expiry_lock() assumes
> timer->base and timer->base->cpu_base are always non-NULL. So it is safe
> to remove the NULL checks here.
>
> Signed-off-by: Julien Grall <[email protected]>
> Link: https://lkml.kernel.org/r/[email protected]
> Signed-off-by: Steven Rostedt (VMware) <[email protected]>
> [bigeasy: rewrite changelog]
> Signed-off-by: Sebastian Andrzej Siewior <[email protected]>

Pretty late to the party, but I think I've bisected a problem to this
patch (and its required fixup for !SMP, "hrtimer: Add a missing bracket
and hide `migration_base on !SMP").

Originally, a customer reported that upgrading from 4.19.82-rt30 to
v4.19.106-rt45 failed to boot, stalling around the time the network gets
initialized (this is a board with an embedded Marvell switch). Perhaps 1
in 10 times, the board would come up successfully. I haven't been able
to reproduce that particular problem (or, perhaps I've seen it once or
twice, but not nearly often enough to use that as a basis for bisection).

However, building with their rescue initrd and booting that, the board
would consistently hang during reboot. Sometimes I would get lines like

[ 72.956630] sched: RT throttling activated
[ 72.973769] lanx: port 1(lan1) entered disabled state
[ 73.000401] lanx: port 2(lan2) entered disabled state
[ 73.974951] lanx: port 3(lan3) entered disabled state
[ 73.997473] lanx: port 4(lan4) entered disabled state
[ 74.968006] lanx: port 5(lan5) entered disabled state

other times there would be no output, but the board was still hanging.
Reverting

b1a471ec4df1 - hrtimer: Prevent using hrtimer_grab_expiry_lock() on
migration_base
40aae5708e7a - hrtimer: Add a missing bracket and hide `migration_base'
on !SMP

on top of v4.19.94-rt39 makes that problem go away, i.e. the board
reboots as expected.

The board is a 32 bit powerpc (mpc8309) !SMP. Any ideas what I can do to
debug this further?

Thanks,
Rasmus

2020-04-27 19:08:57

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH RT 10/30] hrtimer: Prevent using hrtimer_grab_expiry_lock() on migration_base

On Mon, 27 Apr 2020 15:10:00 +0200
Rasmus Villemoes <[email protected]> wrote:

> However, building with their rescue initrd and booting that, the board
> would consistently hang during reboot. Sometimes I would get lines like
>
> [ 72.956630] sched: RT throttling activated
> [ 72.973769] lanx: port 1(lan1) entered disabled state
> [ 73.000401] lanx: port 2(lan2) entered disabled state
> [ 73.974951] lanx: port 3(lan3) entered disabled state
> [ 73.997473] lanx: port 4(lan4) entered disabled state
> [ 74.968006] lanx: port 5(lan5) entered disabled state
>
> other times there would be no output, but the board was still hanging.
> Reverting
>
> b1a471ec4df1 - hrtimer: Prevent using hrtimer_grab_expiry_lock() on
> migration_base
> 40aae5708e7a - hrtimer: Add a missing bracket and hide `migration_base'
> on !SMP
>
> on top of v4.19.94-rt39 makes that problem go away, i.e. the board
> reboots as expected.
>
> The board is a 32 bit powerpc (mpc8309) !SMP. Any ideas what I can do to
> debug this further?

Thanks Rasmus for looking into this. Tom now maintains 4.19-rt.

Tom, care to pull in these patches on top of 4.19-rt?

-- Steve

2020-04-27 19:29:35

by Tom Zanussi

[permalink] [raw]
Subject: Re: [PATCH RT 10/30] hrtimer: Prevent using hrtimer_grab_expiry_lock() on migration_base

On Mon, 2020-04-27 at 15:06 -0400, Steven Rostedt wrote:
> On Mon, 27 Apr 2020 15:10:00 +0200
> Rasmus Villemoes <[email protected]> wrote:
>
> > However, building with their rescue initrd and booting that, the
> > board
> > would consistently hang during reboot. Sometimes I would get lines
> > like
> >
> > [ 72.956630] sched: RT throttling activated
> > [ 72.973769] lanx: port 1(lan1) entered disabled state
> > [ 73.000401] lanx: port 2(lan2) entered disabled state
> > [ 73.974951] lanx: port 3(lan3) entered disabled state
> > [ 73.997473] lanx: port 4(lan4) entered disabled state
> > [ 74.968006] lanx: port 5(lan5) entered disabled state
> >
> > other times there would be no output, but the board was still
> > hanging.
> > Reverting
> >
> > b1a471ec4df1 - hrtimer: Prevent using hrtimer_grab_expiry_lock() on
> > migration_base
> > 40aae5708e7a - hrtimer: Add a missing bracket and hide
> > `migration_base'
> > on !SMP
> >
> > on top of v4.19.94-rt39 makes that problem go away, i.e. the board
> > reboots as expected.
> >
> > The board is a 32 bit powerpc (mpc8309) !SMP. Any ideas what I can
> > do to
> > debug this further?
>
> Thanks Rasmus for looking into this. Tom now maintains 4.19-rt.
>
> Tom, care to pull in these patches on top of 4.19-rt?
>

Those patches are already in 4.19-rt - he's saying that reverting them
fixes the problem.

I'm guessing that the assumption of base or base->cpu_base always being
non-NULL in those patches might be wrong. If so, the below patch
should fix the problem:

Subject: [PATCH] hrtimer: Add back base and base->cpu_base checks in
hrtimer_grab_expiry_lock()

4.19 commit b1a471ec4df1 [hrtimer: Prevent using
hrtimer_grab_expiry_lock() on migration_base] removed the NULL checks
for timer->base and timer->base->cpu_base on the assumption that
they're always non-NULL. That assumption is apparently not to be
true, so add the checks back.

Signed-off-by: Tom Zanussi <[email protected]>
---
kernel/time/hrtimer.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
index e54a95de8b79..6f20cf23008b 100644
--- a/kernel/time/hrtimer.c
+++ b/kernel/time/hrtimer.c
@@ -953,7 +953,7 @@ void hrtimer_grab_expiry_lock(const struct hrtimer *timer)
{
struct hrtimer_clock_base *base = READ_ONCE(timer->base);

- if (timer->is_soft && is_migration_base(base)) {
+ if (timer->is_soft && base && base->cpu_base && is_migration_base(base)) {
spin_lock(&base->cpu_base->softirq_expiry_lock);
spin_unlock(&base->cpu_base->softirq_expiry_lock);
}
--
2.17.1


> -- Steve

2020-04-28 06:55:32

by Rasmus Villemoes

[permalink] [raw]
Subject: Re: [PATCH RT 10/30] hrtimer: Prevent using hrtimer_grab_expiry_lock() on migration_base

On 27/04/2020 21.26, Tom Zanussi wrote:
> On Mon, 2020-04-27 at 15:06 -0400, Steven Rostedt wrote:
>> On Mon, 27 Apr 2020 15:10:00 +0200
>> Rasmus Villemoes <[email protected]> wrote:
>>
>>> Reverting
>>>
>>> b1a471ec4df1 - hrtimer: Prevent using hrtimer_grab_expiry_lock() on
>>> migration_base
>>> 40aae5708e7a - hrtimer: Add a missing bracket and hide
>>> `migration_base'
>>> on !SMP
>>>
>>> on top of v4.19.94-rt39 makes that problem go away, i.e. the board
>>> reboots as expected.
>>>
>> Thanks Rasmus for looking into this. Tom now maintains 4.19-rt.
>>
>> Tom, care to pull in these patches on top of 4.19-rt?
>>
>
> Those patches are already in 4.19-rt - he's saying that reverting them
> fixes the problem.
>
> I'm guessing that the assumption of base or base->cpu_base always being
> non-NULL in those patches might be wrong. If so, the below patch
> should fix the problem:
>
> Subject: [PATCH] hrtimer: Add back base and base->cpu_base checks in
> hrtimer_grab_expiry_lock()
>
> 4.19 commit b1a471ec4df1 [hrtimer: Prevent using
> hrtimer_grab_expiry_lock() on migration_base] removed the NULL checks
> for timer->base and timer->base->cpu_base on the assumption that
> they're always non-NULL. That assumption is apparently not to be
> true, so add the checks back.
>
> Signed-off-by: Tom Zanussi <[email protected]>
> ---
> kernel/time/hrtimer.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/time/hrtimer.c b/kernel/time/hrtimer.c
> index e54a95de8b79..6f20cf23008b 100644
> --- a/kernel/time/hrtimer.c
> +++ b/kernel/time/hrtimer.c
> @@ -953,7 +953,7 @@ void hrtimer_grab_expiry_lock(const struct hrtimer *timer)
> {
> struct hrtimer_clock_base *base = READ_ONCE(timer->base);
>
> - if (timer->is_soft && is_migration_base(base)) {
> + if (timer->is_soft && base && base->cpu_base && is_migration_base(base)) {

I'm sorry, but no, I don't think that can be it. For !SMP (my case),
is_migration_base() is always false, so with or without the above, the
whole if() is false. Also, the symptoms do not look like a NULL pointer
deref, but more like a dead (or live) lock - so I'm guessing (and that's
just a wild guess) that the lock/unlock sequence is needed to give some
other thread a priority boost to make the whole machine make forward
progress.

Rasmus

2020-04-28 07:07:07

by Rasmus Villemoes

[permalink] [raw]
Subject: Re: [PATCH RT 10/30] hrtimer: Prevent using hrtimer_grab_expiry_lock() on migration_base

On 23/01/2020 21.39, Steven Rostedt wrote:
> 4.19.94-rt39-rc2 stable review patch.
> If anyone has any objections, please let me know.
>
> ------------------
>
> From: Julien Grall <[email protected]>
>
> [ Upstream commit cef1b87f98823af923a386f3f69149acb212d4a1 ]
>
> As tglx puts it:
> |If base == migration_base then there is no point to lock soft_expiry_lock
> |simply because the timer is not executing the callback in soft irq context
> |and the whole lock/unlock dance can be avoided.

Hold on a second. This patch (hrtimer: Prevent using
hrtimer_grab_expiry_lock() on migration_base) indeed seems to implement
the optimization implied by the above, namely avoid the lock/unlock in
case base == migration_base:

> - if (timer->is_soft && base && base->cpu_base) {
> + if (timer->is_soft && base != &migration_base) {

But the followup patch (hrtimer: Add a missing bracket and hide
`migration_base on !SMP) to fix the build on !SMP [the missing bracket
part seems to have been fixed when backporting the above to 4.19-rt]
replaces that logic by

+static inline bool is_migration_base(struct hrtimer_clock_base *base)
+{
+ return base == &migration_base;
+}
+
...
- if (timer->is_soft && base != &migration_base) {
+ if (timer->is_soft && is_migration_base(base)) {

in the SMP case, i.e. the exact opposite condition. One of these can't
be correct.

Assuming the followup patch was wrong and the condition should have read

timer->is_soft && !is_migration_base(base)

while keeping is_migration_base() false on !SMP might explain the
problem I see. But I'd like someone who knows this code to chime in.

Thanks,
Rasmus

2020-04-28 13:04:00

by Tom Zanussi

[permalink] [raw]
Subject: Re: [PATCH RT 10/30] hrtimer: Prevent using hrtimer_grab_expiry_lock() on migration_base

On Tue, 2020-04-28 at 09:03 +0200, Rasmus Villemoes wrote:
> On 23/01/2020 21.39, Steven Rostedt wrote:
> > 4.19.94-rt39-rc2 stable review patch.
> > If anyone has any objections, please let me know.
> >
> > ------------------
> >
> > From: Julien Grall <[email protected]>
> >
> > [ Upstream commit cef1b87f98823af923a386f3f69149acb212d4a1 ]
> >
> > As tglx puts it:
> > > If base == migration_base then there is no point to lock
> > > soft_expiry_lock
> > > simply because the timer is not executing the callback in soft
> > > irq context
> > > and the whole lock/unlock dance can be avoided.
>
> Hold on a second. This patch (hrtimer: Prevent using
> hrtimer_grab_expiry_lock() on migration_base) indeed seems to
> implement
> the optimization implied by the above, namely avoid the lock/unlock
> in
> case base == migration_base:
>
> > - if (timer->is_soft && base && base->cpu_base) {
> > + if (timer->is_soft && base != &migration_base) {
>
> But the followup patch (hrtimer: Add a missing bracket and hide
> `migration_base on !SMP) to fix the build on !SMP [the missing
> bracket
> part seems to have been fixed when backporting the above to 4.19-rt]
> replaces that logic by
>
> +static inline bool is_migration_base(struct hrtimer_clock_base
> *base)
> +{
> + return base == &migration_base;
> +}
> +
> ...
> - if (timer->is_soft && base != &migration_base) {
> + if (timer->is_soft && is_migration_base(base)) {
>
> in the SMP case, i.e. the exact opposite condition. One of these
> can't
> be correct.
>
> Assuming the followup patch was wrong and the condition should have
> read
>
> timer->is_soft && !is_migration_base(base)
>
> while keeping is_migration_base() false on !SMP might explain the
> problem I see. But I'd like someone who knows this code to chime in.
>

I don't know this code, but I think you're correct - the followup patch
reversed the condition by forgetting the !.

So, does your problem go away when you make that change?

Tom

> Thanks,
> Rasmus

2020-04-28 13:11:58

by Rasmus Villemoes

[permalink] [raw]
Subject: Re: [PATCH RT 10/30] hrtimer: Prevent using hrtimer_grab_expiry_lock() on migration_base

On 28/04/2020 14.59, Tom Zanussi wrote:
> On Tue, 2020-04-28 at 09:03 +0200, Rasmus Villemoes wrote:

>> Hold on a second. This patch (hrtimer: Prevent using
>> hrtimer_grab_expiry_lock() on migration_base) indeed seems to
>> implement
>> the optimization implied by the above, namely avoid the lock/unlock
>> in
>> case base == migration_base:
>>
>>> - if (timer->is_soft && base && base->cpu_base) {
>>> + if (timer->is_soft && base != &migration_base) {
>>
>> But the followup patch (hrtimer: Add a missing bracket and hide
>> `migration_base on !SMP) to fix the build on !SMP [the missing
>> bracket
>> part seems to have been fixed when backporting the above to 4.19-rt]
>> replaces that logic by
>>
>> +static inline bool is_migration_base(struct hrtimer_clock_base
>> *base)
>> +{
>> + return base == &migration_base;
>> +}
>> +
>> ...
>> - if (timer->is_soft && base != &migration_base) {
>> + if (timer->is_soft && is_migration_base(base)) {
>>
>> in the SMP case, i.e. the exact opposite condition. One of these
>> can't
>> be correct.
>>
>> Assuming the followup patch was wrong and the condition should have
>> read
>>
>> timer->is_soft && !is_migration_base(base)
>>
>> while keeping is_migration_base() false on !SMP might explain the
>> problem I see. But I'd like someone who knows this code to chime in.
>>
>
> I don't know this code, but I think you're correct - the followup patch
> reversed the condition by forgetting the !.
>
> So, does your problem go away when you make that change?

Yes, it does. (I'll have to ask the customer to check in their setup
whether the boot hang also vanishes).

Essentially, adding that ! is equivalent to reverting the two patches on
!SMP (which I also tested): Before, the condition was

timer->is_soft && base && base->cpu_base

and, assuming the NULL pointer checks are indeed redundant, that's the
same as "timer->is_soft". Appending " && !is_migration_base()" to that,
with is_migration_base() always false as on !SMP, doesn't change anything.

Thanks,
Rasmus

2020-04-28 13:44:57

by Tom Zanussi

[permalink] [raw]
Subject: Re: [PATCH RT 10/30] hrtimer: Prevent using hrtimer_grab_expiry_lock() on migration_base

On Tue, 2020-04-28 at 15:07 +0200, Rasmus Villemoes wrote:
> On 28/04/2020 14.59, Tom Zanussi wrote:
> > On Tue, 2020-04-28 at 09:03 +0200, Rasmus Villemoes wrote:
> > > Hold on a second. This patch (hrtimer: Prevent using
> > > hrtimer_grab_expiry_lock() on migration_base) indeed seems to
> > > implement
> > > the optimization implied by the above, namely avoid the
> > > lock/unlock
> > > in
> > > case base == migration_base:
> > >
> > > > - if (timer->is_soft && base && base->cpu_base) {
> > > > + if (timer->is_soft && base != &migration_base) {
> > >
> > > But the followup patch (hrtimer: Add a missing bracket and hide
> > > `migration_base on !SMP) to fix the build on !SMP [the missing
> > > bracket
> > > part seems to have been fixed when backporting the above to 4.19-
> > > rt]
> > > replaces that logic by
> > >
> > > +static inline bool is_migration_base(struct hrtimer_clock_base
> > > *base)
> > > +{
> > > + return base == &migration_base;
> > > +}
> > > +
> > > ...
> > > - if (timer->is_soft && base != &migration_base) {
> > > + if (timer->is_soft && is_migration_base(base)) {
> > >
> > > in the SMP case, i.e. the exact opposite condition. One of these
> > > can't
> > > be correct.
> > >
> > > Assuming the followup patch was wrong and the condition should
> > > have
> > > read
> > >
> > > timer->is_soft && !is_migration_base(base)
> > >
> > > while keeping is_migration_base() false on !SMP might explain the
> > > problem I see. But I'd like someone who knows this code to chime
> > > in.
> > >
> >
> > I don't know this code, but I think you're correct - the followup
> > patch
> > reversed the condition by forgetting the !.
> >
> > So, does your problem go away when you make that change?
>
> Yes, it does. (I'll have to ask the customer to check in their setup
> whether the boot hang also vanishes).
>
> Essentially, adding that ! is equivalent to reverting the two patches
> on
> !SMP (which I also tested): Before, the condition was
>
> timer->is_soft && base && base->cpu_base
>
> and, assuming the NULL pointer checks are indeed redundant, that's
> the
> same as "timer->is_soft". Appending " && !is_migration_base()" to
> that,
> with is_migration_base() always false as on !SMP, doesn't change
> anything.
>

OK, great, thanks for tracking this down.

If you post a patch that makes that change and mention that it's a fix
for commit "40aae5708e7a hrtimer: Add a missing bracket and hide
`migration_base' on !SMP", I can pull it into a new update release.

Thanks,

Tom

> Thanks,
> Rasmus