2021-09-03 13:44:16

by Jens Axboe

[permalink] [raw]
Subject: Bug: d0e936adbd22 crashes at boot

Hi,

Booting Linus's tree causes a crash on my laptop, an x1 gen9. This was a bit
difficult to pin down as it crashes before the display is up, but I managed
to narrow it down to:

commit d0e936adbd2250cb03f2e840c6651d18edc22ace
Author: Srinivas Pandruvada <[email protected]>
Date: Thu Aug 19 19:40:06 2021 -0700

cpufreq: intel_pstate: Process HWP Guaranteed change notification

which crashes with a NULL pointer deref in notify_hwp_interrupt() ->
queue_delayed_work_on().

Reverting this change makes the laptop boot fine again.

--
Jens Axboe


2021-09-03 14:16:42

by srinivas pandruvada

[permalink] [raw]
Subject: Re: Bug: d0e936adbd22 crashes at boot

Hi Axboe,

Thanks for reporting.
On Fri, 2021-09-03 at 07:36 -0600, Jens Axboe wrote:
> Hi,
>
> Booting Linus's tree causes a crash on my laptop, an x1 gen9. This was
> a bit
> difficult to pin down as it crashes before the display is up, but I
> managed
> to narrow it down to:
>
> commit d0e936adbd2250cb03f2e840c6651d18edc22ace
> Author: Srinivas Pandruvada <[email protected]>
> Date:   Thu Aug 19 19:40:06 2021 -0700
>
>     cpufreq: intel_pstate: Process HWP Guaranteed change notification
>
> which crashes with a NULL pointer deref in notify_hwp_interrupt() ->
> queue_delayed_work_on().
>
> Reverting this change makes the laptop boot fine again.
>
Does this change fixes your issue?

diff --git a/drivers/cpufreq/intel_pstate.c
b/drivers/cpufreq/intel_pstate.c
index b4ffe6c8a0d0..6a3c6f60ad12 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1650,7 +1650,10 @@ void notify_hwp_interrupt(void)
return;

cpudata = all_cpu_data[this_cpu];
- schedule_delayed_work_on(this_cpu, &cpudata->hwp_notify_work,
msecs_to_jiffies(10));
+ if (cpudata)
+ schedule_delayed_work_on(this_cpu, &cpudata-
>hwp_notify_work, msecs_to_jiffies(10));
+ else
+ wrmsrl(MSR_HWP_STATUS, 0);
}

static void intel_pstate_enable_hwp_interrupt(struct cpudata *cpudata)


Thanks,
Srinivas

2021-09-03 14:20:05

by Jens Axboe

[permalink] [raw]
Subject: Re: Bug: d0e936adbd22 crashes at boot

On 9/3/21 8:13 AM, Srinivas Pandruvada wrote:
> Hi Axboe,
>
> Thanks for reporting.
> On Fri, 2021-09-03 at 07:36 -0600, Jens Axboe wrote:
>> Hi,
>>
>> Booting Linus's tree causes a crash on my laptop, an x1 gen9. This was
>> a bit
>> difficult to pin down as it crashes before the display is up, but I
>> managed
>> to narrow it down to:
>>
>> commit d0e936adbd2250cb03f2e840c6651d18edc22ace
>> Author: Srinivas Pandruvada <[email protected]>
>> Date:   Thu Aug 19 19:40:06 2021 -0700
>>
>>     cpufreq: intel_pstate: Process HWP Guaranteed change notification
>>
>> which crashes with a NULL pointer deref in notify_hwp_interrupt() ->
>> queue_delayed_work_on().
>>
>> Reverting this change makes the laptop boot fine again.
>>
> Does this change fixes your issue?

I would assume so, as it's crashing on cpudata == NULL :-)

But why is it NULL? Happy to test patches, but the below doesn't look like
a real fix and more of a work-around.

--
Jens Axboe

2021-09-03 15:13:25

by srinivas pandruvada

[permalink] [raw]
Subject: Re: Bug: d0e936adbd22 crashes at boot

On Fri, 2021-09-03 at 09:00 -0600, Jens Axboe wrote:
> On 9/3/21 8:38 AM, Srinivas Pandruvada wrote:
> > On Fri, 2021-09-03 at 08:15 -0600, Jens Axboe wrote:
> > > On 9/3/21 8:13 AM, Srinivas Pandruvada wrote:
> > > > Hi Axboe,
> > > >
> > > > Thanks for reporting.
> > > > On Fri, 2021-09-03 at 07:36 -0600, Jens Axboe wrote:
> > > > > Hi,
> > > > >
> > > > > Booting Linus's tree causes a crash on my laptop, an x1 gen9.
> > > > > This
> > > > > was
> > > > > a bit
> > > > > difficult to pin down as it crashes before the display is up,
> > > > > but I
> > > > > managed
> > > > > to narrow it down to:
> > > > >
> > > > > commit d0e936adbd2250cb03f2e840c6651d18edc22ace
> > > > > Author: Srinivas Pandruvada <
> > > > > [email protected]>
> > > > > Date:   Thu Aug 19 19:40:06 2021 -0700
> > > > >
> > > > >     cpufreq: intel_pstate: Process HWP Guaranteed change
> > > > > notification
> > > > >
> > > > > which crashes with a NULL pointer deref in
> > > > > notify_hwp_interrupt() -
> > > > > >
> > > > > queue_delayed_work_on().
> > > > >
> > > > > Reverting this change makes the laptop boot fine again.
> > > > >
> > > > Does this change fixes your issue?
> > >
> > > I would assume so, as it's crashing on cpudata == NULL :-)
> > >
> > > But why is it NULL? Happy to test patches, but the below doesn't
> > > look
> > > like
> > > a real fix and more of a work-around.
> >
> > This platform is sending an HWP interrupt on a CPU which we didn't
> > yet
> > bring it up for pstate control. So somehow firmware decided to send
> > very early during boot, which previously we would have ignored it
> >
> > Actually try this, with more prevention
>
> I can give this a whirl.
>
> > diff --git a/drivers/cpufreq/intel_pstate.c
> > b/drivers/cpufreq/intel_pstate.c
> > index b4ffe6c8a0d0..6ee88d7640ea 100644
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -1645,12 +1645,24 @@ void notify_hwp_interrupt(void)
> >         if (!hwp_active || !boot_cpu_has(X86_FEATURE_HWP_NOTIFY))
> >                 return;
> >  
> > -       rdmsrl(MSR_HWP_STATUS, value);
> > +       rdmsrl_safe(MSR_HWP_STATUS, &value);
> >         if (!(value & 0x01))
> >                 return;
> >  
> > +       /*
> > +        * After hwp_active is set and all_cpu_data is allocated,
> > there
> > +        * is small window.
> > +        */
> > +       if (!all_cpu_data) {
> > +               wrmsrl_safe(MSR_HWP_STATUS, 0);
> > +               return;
> > +       }
>
> What synchronizes the all_cpu_data setup and the interrupt? Can the
> interrupt come in while it's still being setup?
Yes.
I am working on a change to simulate that case and fix.

Thanks,
Srinivas
>


2021-09-03 16:14:50

by srinivas pandruvada

[permalink] [raw]
Subject: Re: Bug: d0e936adbd22 crashes at boot

On Fri, 2021-09-03 at 08:15 -0600, Jens Axboe wrote:
> On 9/3/21 8:13 AM, Srinivas Pandruvada wrote:
> > Hi Axboe,
> >
> > Thanks for reporting.
> > On Fri, 2021-09-03 at 07:36 -0600, Jens Axboe wrote:
> > > Hi,
> > >
> > > Booting Linus's tree causes a crash on my laptop, an x1 gen9. This
> > > was
> > > a bit
> > > difficult to pin down as it crashes before the display is up, but I
> > > managed
> > > to narrow it down to:
> > >
> > > commit d0e936adbd2250cb03f2e840c6651d18edc22ace
> > > Author: Srinivas Pandruvada <[email protected]>
> > > Date:   Thu Aug 19 19:40:06 2021 -0700
> > >
> > >     cpufreq: intel_pstate: Process HWP Guaranteed change
> > > notification
> > >
> > > which crashes with a NULL pointer deref in notify_hwp_interrupt() -
> > > >
> > > queue_delayed_work_on().
> > >
> > > Reverting this change makes the laptop boot fine again.
> > >
> > Does this change fixes your issue?
>
> I would assume so, as it's crashing on cpudata == NULL :-)
>
> But why is it NULL? Happy to test patches, but the below doesn't look
> like
> a real fix and more of a work-around.

This platform is sending an HWP interrupt on a CPU which we didn't yet
bring it up for pstate control. So somehow firmware decided to send
very early during boot, which previously we would have ignored it

Actually try this, with more prevention

diff --git a/drivers/cpufreq/intel_pstate.c
b/drivers/cpufreq/intel_pstate.c
index b4ffe6c8a0d0..6ee88d7640ea 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1645,12 +1645,24 @@ void notify_hwp_interrupt(void)
if (!hwp_active || !boot_cpu_has(X86_FEATURE_HWP_NOTIFY))
return;

- rdmsrl(MSR_HWP_STATUS, value);
+ rdmsrl_safe(MSR_HWP_STATUS, &value);
if (!(value & 0x01))
return;

+ /*
+ * After hwp_active is set and all_cpu_data is allocated, there
+ * is small window.
+ */
+ if (!all_cpu_data) {
+ wrmsrl_safe(MSR_HWP_STATUS, 0);
+ return;
+ }
+
cpudata = all_cpu_data[this_cpu];
- schedule_delayed_work_on(this_cpu, &cpudata->hwp_notify_work,
msecs_to_jiffies(10));
+ if (cpudata)
+ schedule_delayed_work_on(this_cpu, &cpudata-
>hwp_notify_work, msecs_to_jiffies(10));
+ else
+ wrmsrl_safe(MSR_HWP_STATUS, 0);
}



>


2021-09-03 16:17:41

by Jens Axboe

[permalink] [raw]
Subject: Re: Bug: d0e936adbd22 crashes at boot

On 9/3/21 8:38 AM, Srinivas Pandruvada wrote:
> On Fri, 2021-09-03 at 08:15 -0600, Jens Axboe wrote:
>> On 9/3/21 8:13 AM, Srinivas Pandruvada wrote:
>>> Hi Axboe,
>>>
>>> Thanks for reporting.
>>> On Fri, 2021-09-03 at 07:36 -0600, Jens Axboe wrote:
>>>> Hi,
>>>>
>>>> Booting Linus's tree causes a crash on my laptop, an x1 gen9. This
>>>> was
>>>> a bit
>>>> difficult to pin down as it crashes before the display is up, but I
>>>> managed
>>>> to narrow it down to:
>>>>
>>>> commit d0e936adbd2250cb03f2e840c6651d18edc22ace
>>>> Author: Srinivas Pandruvada <[email protected]>
>>>> Date: Thu Aug 19 19:40:06 2021 -0700
>>>>
>>>> cpufreq: intel_pstate: Process HWP Guaranteed change
>>>> notification
>>>>
>>>> which crashes with a NULL pointer deref in notify_hwp_interrupt() -
>>>>>
>>>> queue_delayed_work_on().
>>>>
>>>> Reverting this change makes the laptop boot fine again.
>>>>
>>> Does this change fixes your issue?
>>
>> I would assume so, as it's crashing on cpudata == NULL :-)
>>
>> But why is it NULL? Happy to test patches, but the below doesn't look
>> like
>> a real fix and more of a work-around.
>
> This platform is sending an HWP interrupt on a CPU which we didn't yet
> bring it up for pstate control. So somehow firmware decided to send
> very early during boot, which previously we would have ignored it
>
> Actually try this, with more prevention

I can give this a whirl.

> diff --git a/drivers/cpufreq/intel_pstate.c
> b/drivers/cpufreq/intel_pstate.c
> index b4ffe6c8a0d0..6ee88d7640ea 100644
> --- a/drivers/cpufreq/intel_pstate.c
> +++ b/drivers/cpufreq/intel_pstate.c
> @@ -1645,12 +1645,24 @@ void notify_hwp_interrupt(void)
> if (!hwp_active || !boot_cpu_has(X86_FEATURE_HWP_NOTIFY))
> return;
>
> - rdmsrl(MSR_HWP_STATUS, value);
> + rdmsrl_safe(MSR_HWP_STATUS, &value);
> if (!(value & 0x01))
> return;
>
> + /*
> + * After hwp_active is set and all_cpu_data is allocated, there
> + * is small window.
> + */
> + if (!all_cpu_data) {
> + wrmsrl_safe(MSR_HWP_STATUS, 0);
> + return;
> + }

What synchronizes the all_cpu_data setup and the interrupt? Can the
interrupt come in while it's still being setup?

--
Jens Axboe

2021-09-03 19:17:56

by srinivas pandruvada

[permalink] [raw]
Subject: Re: Bug: d0e936adbd22 crashes at boot

Hi Axobe,

On Fri, 2021-09-03 at 09:00 -0600, Jens Axboe wrote:
> On 9/3/21 8:38 AM, Srinivas Pandruvada wrote:
> > On Fri, 2021-09-03 at 08:15 -0600, Jens Axboe wrote:
> > > On 9/3/21 8:13 AM, Srinivas Pandruvada wrote:
> > > > Hi Axboe,
> > > >
> > > > Thanks for reporting.
> > > > On Fri, 2021-09-03 at 07:36 -0600, Jens Axboe wrote:
> > > > > Hi,
> > > > >
> > > > > Booting Linus's tree causes a crash on my laptop, an x1 gen9.
> > > > > This
> > > > > was
> > > > > a bit
> > > > > difficult to pin down as it crashes before the display is up,
> > > > > but I
> > > > > managed
> > > > > to narrow it down to:
> > > > >
> > > > > commit d0e936adbd2250cb03f2e840c6651d18edc22ace
> > > > > Author: Srinivas Pandruvada <
> > > > > [email protected]>
> > > > > Date:   Thu Aug 19 19:40:06 2021 -0700
> > > > >
> > > > >     cpufreq: intel_pstate: Process HWP Guaranteed change
> > > > > notification
> > > > >
> > > > > which crashes with a NULL pointer deref in
> > > > > notify_hwp_interrupt() -
> > > > > >
> > > > > queue_delayed_work_on().
> > > > >
> > > > > Reverting this change makes the laptop boot fine again.
> > > > >
> > > > Does this change fixes your issue?
> > >
> > > I would assume so, as it's crashing on cpudata == NULL :-)
> > >
> > > But why is it NULL? Happy to test patches, but the below doesn't
> > > look
> > > like
> > > a real fix and more of a work-around.
> >

Please try the attached.

Thanks,
Srinivas

> > This platform is sending an HWP interrupt on a CPU which we didn't
> > yet
> > bring it up for pstate control. So somehow firmware decided to send
> > very early during boot, which previously we would have ignored it
> >
> > Actually try this, with more prevention
>
> I can give this a whirl.
>
> > diff --git a/drivers/cpufreq/intel_pstate.c
> > b/drivers/cpufreq/intel_pstate.c
> > index b4ffe6c8a0d0..6ee88d7640ea 100644
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -1645,12 +1645,24 @@ void notify_hwp_interrupt(void)
> >         if (!hwp_active || !boot_cpu_has(X86_FEATURE_HWP_NOTIFY))
> >                 return;
> >  
> > -       rdmsrl(MSR_HWP_STATUS, value);
> > +       rdmsrl_safe(MSR_HWP_STATUS, &value);
> >         if (!(value & 0x01))
> >                 return;
> >  
> > +       /*
> > +        * After hwp_active is set and all_cpu_data is allocated,
> > there
> > +        * is small window.
> > +        */
> > +       if (!all_cpu_data) {
> > +               wrmsrl_safe(MSR_HWP_STATUS, 0);
> > +               return;
> > +       }
>
> What synchronizes the all_cpu_data setup and the interrupt? Can the
> interrupt come in while it's still being setup?
>


Attachments:
0001-cpufreq-intel_pstate-Fix-for-HWP-interrupt-before-dr.patch (2.98 kB)

2021-09-03 21:00:13

by Jens Axboe

[permalink] [raw]
Subject: Re: Bug: d0e936adbd22 crashes at boot

On 9/3/21 2:41 PM, Jens Axboe wrote:
> On 9/3/21 12:00 PM, Srinivas Pandruvada wrote:
>> Hi Axobe,
>>
>> On Fri, 2021-09-03 at 09:00 -0600, Jens Axboe wrote:
>>> On 9/3/21 8:38 AM, Srinivas Pandruvada wrote:
>>>> On Fri, 2021-09-03 at 08:15 -0600, Jens Axboe wrote:
>>>>> On 9/3/21 8:13 AM, Srinivas Pandruvada wrote:
>>>>>> Hi Axboe,
>>>>>>
>>>>>> Thanks for reporting.
>>>>>> On Fri, 2021-09-03 at 07:36 -0600, Jens Axboe wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Booting Linus's tree causes a crash on my laptop, an x1 gen9.
>>>>>>> This
>>>>>>> was
>>>>>>> a bit
>>>>>>> difficult to pin down as it crashes before the display is up,
>>>>>>> but I
>>>>>>> managed
>>>>>>> to narrow it down to:
>>>>>>>
>>>>>>> commit d0e936adbd2250cb03f2e840c6651d18edc22ace
>>>>>>> Author: Srinivas Pandruvada <
>>>>>>> [email protected]>
>>>>>>> Date: Thu Aug 19 19:40:06 2021 -0700
>>>>>>>
>>>>>>> cpufreq: intel_pstate: Process HWP Guaranteed change
>>>>>>> notification
>>>>>>>
>>>>>>> which crashes with a NULL pointer deref in
>>>>>>> notify_hwp_interrupt() -
>>>>>>>>
>>>>>>> queue_delayed_work_on().
>>>>>>>
>>>>>>> Reverting this change makes the laptop boot fine again.
>>>>>>>
>>>>>> Does this change fixes your issue?
>>>>>
>>>>> I would assume so, as it's crashing on cpudata == NULL :-)
>>>>>
>>>>> But why is it NULL? Happy to test patches, but the below doesn't
>>>>> look
>>>>> like
>>>>> a real fix and more of a work-around.
>>>>
>>
>> Please try the attached.
>
> I'll give it a test spin right now. Please do add a Reported-by tag,
> though. That's always prudent.

And you can add Tested-by as well, it works for me.

--
Jens Axboe

2021-09-03 21:09:46

by Jens Axboe

[permalink] [raw]
Subject: Re: Bug: d0e936adbd22 crashes at boot

On 9/3/21 12:00 PM, Srinivas Pandruvada wrote:
> Hi Axobe,
>
> On Fri, 2021-09-03 at 09:00 -0600, Jens Axboe wrote:
>> On 9/3/21 8:38 AM, Srinivas Pandruvada wrote:
>>> On Fri, 2021-09-03 at 08:15 -0600, Jens Axboe wrote:
>>>> On 9/3/21 8:13 AM, Srinivas Pandruvada wrote:
>>>>> Hi Axboe,
>>>>>
>>>>> Thanks for reporting.
>>>>> On Fri, 2021-09-03 at 07:36 -0600, Jens Axboe wrote:
>>>>>> Hi,
>>>>>>
>>>>>> Booting Linus's tree causes a crash on my laptop, an x1 gen9.
>>>>>> This
>>>>>> was
>>>>>> a bit
>>>>>> difficult to pin down as it crashes before the display is up,
>>>>>> but I
>>>>>> managed
>>>>>> to narrow it down to:
>>>>>>
>>>>>> commit d0e936adbd2250cb03f2e840c6651d18edc22ace
>>>>>> Author: Srinivas Pandruvada <
>>>>>> [email protected]>
>>>>>> Date:   Thu Aug 19 19:40:06 2021 -0700
>>>>>>
>>>>>>     cpufreq: intel_pstate: Process HWP Guaranteed change
>>>>>> notification
>>>>>>
>>>>>> which crashes with a NULL pointer deref in
>>>>>> notify_hwp_interrupt() -
>>>>>>>
>>>>>> queue_delayed_work_on().
>>>>>>
>>>>>> Reverting this change makes the laptop boot fine again.
>>>>>>
>>>>> Does this change fixes your issue?
>>>>
>>>> I would assume so, as it's crashing on cpudata == NULL :-)
>>>>
>>>> But why is it NULL? Happy to test patches, but the below doesn't
>>>> look
>>>> like
>>>> a real fix and more of a work-around.
>>>
>
> Please try the attached.

I'll give it a test spin right now. Please do add a Reported-by tag,
though. That's always prudent.

--
Jens Axboe

2021-09-03 22:41:25

by srinivas pandruvada

[permalink] [raw]
Subject: Re: Bug: d0e936adbd22 crashes at boot

On Fri, 2021-09-03 at 14:57 -0600, Jens Axboe wrote:
> On 9/3/21 2:41 PM, Jens Axboe wrote:
> > On 9/3/21 12:00 PM, Srinivas Pandruvada wrote:
> > > Hi Axobe,
> > >
> > > On Fri, 2021-09-03 at 09:00 -0600, Jens Axboe wrote:
> > > > On 9/3/21 8:38 AM, Srinivas Pandruvada wrote:
> > > > > On Fri, 2021-09-03 at 08:15 -0600, Jens Axboe wrote:
> > > > > > On 9/3/21 8:13 AM, Srinivas Pandruvada wrote:
> > > > > > > Hi Axboe,
> > > > > > >
> > > > > > > Thanks for reporting.
> > > > > > > On Fri, 2021-09-03 at 07:36 -0600, Jens Axboe wrote:
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > Booting Linus's tree causes a crash on my laptop, an x1
> > > > > > > > gen9.
> > > > > > > > This
> > > > > > > > was
> > > > > > > > a bit
> > > > > > > > difficult to pin down as it crashes before the display
> > > > > > > > is up,
> > > > > > > > but I
> > > > > > > > managed
> > > > > > > > to narrow it down to:
> > > > > > > >
> > > > > > > > commit d0e936adbd2250cb03f2e840c6651d18edc22ace
> > > > > > > > Author: Srinivas Pandruvada <
> > > > > > > > [email protected]>
> > > > > > > > Date:   Thu Aug 19 19:40:06 2021 -0700
> > > > > > > >
> > > > > > > >     cpufreq: intel_pstate: Process HWP Guaranteed
> > > > > > > > change
> > > > > > > > notification
> > > > > > > >
> > > > > > > > which crashes with a NULL pointer deref in
> > > > > > > > notify_hwp_interrupt() -
> > > > > > > > >
> > > > > > > > queue_delayed_work_on().
> > > > > > > >
> > > > > > > > Reverting this change makes the laptop boot fine again.
> > > > > > > >
> > > > > > > Does this change fixes your issue?
> > > > > >
> > > > > > I would assume so, as it's crashing on cpudata == NULL :-)
> > > > > >
> > > > > > But why is it NULL? Happy to test patches, but the below
> > > > > > doesn't
> > > > > > look
> > > > > > like
> > > > > > a real fix and more of a work-around.
> > > > >
> > >
> > > Please try the attached.
> >
> > I'll give it a test spin right now. Please do add a Reported-by
> > tag,
> > though. That's always prudent.
>
> And you can add Tested-by as well, it works for me.

Thanks. I will add both.

-Srinivas
>