2021-05-03 18:08:16

by Andy Shevchenko

[permalink] [raw]
Subject: Re: Null pointer dereference in mcp251x driver when resuming from sleep

On Mon, May 03, 2021 at 03:11:40PM +0200, Frieder Schrempf wrote:
> Hi,
>
> with kernel 5.10.x and 5.12.x I'm getting a null pointer dereference
> exception from the mcp251x driver when I resume from sleep (see trace
> below).
>
> As far as I can tell this was working fine with 5.4. As I currently don't
> have the time to do further debugging/bisecting, for now I want to at least
> report this here.
>
> Maybe there is someone around who could already give a wild guess for what
> might cause this just by looking at the trace/code!?

Does revert of c7299fea6769 ("spi: Fix spi device unregister flow") help?

--
With Best Regards,
Andy Shevchenko



2021-05-03 18:08:31

by Frieder Schrempf

[permalink] [raw]
Subject: Re: Null pointer dereference in mcp251x driver when resuming from sleep

On 03.05.21 15:44, Andy Shevchenko wrote:
> On Mon, May 03, 2021 at 03:11:40PM +0200, Frieder Schrempf wrote:
>> Hi,
>>
>> with kernel 5.10.x and 5.12.x I'm getting a null pointer dereference
>> exception from the mcp251x driver when I resume from sleep (see trace
>> below).
>>
>> As far as I can tell this was working fine with 5.4. As I currently don't
>> have the time to do further debugging/bisecting, for now I want to at least
>> report this here.
>>
>> Maybe there is someone around who could already give a wild guess for what
>> might cause this just by looking at the trace/code!?
>
> Does revert of c7299fea6769 ("spi: Fix spi device unregister flow") help?

This commit is so new, that it is neither in 5.10.x nor in 5.12.1. So it
can't be the reason.

2021-05-03 18:08:38

by Andy Shevchenko

[permalink] [raw]
Subject: Re: Null pointer dereference in mcp251x driver when resuming from sleep

On Mon, May 03, 2021 at 04:44:24PM +0300, Andy Shevchenko wrote:
> On Mon, May 03, 2021 at 03:11:40PM +0200, Frieder Schrempf wrote:
> > Hi,
> >
> > with kernel 5.10.x and 5.12.x I'm getting a null pointer dereference
> > exception from the mcp251x driver when I resume from sleep (see trace
> > below).
> >
> > As far as I can tell this was working fine with 5.4. As I currently don't
> > have the time to do further debugging/bisecting, for now I want to at least
> > report this here.
> >
> > Maybe there is someone around who could already give a wild guess for what
> > might cause this just by looking at the trace/code!?
>
> Does revert of c7299fea6769 ("spi: Fix spi device unregister flow") help?

Other than that, bisecting will take not more than 3-4 iterations only:
% git log --oneline v5.4..v5.10.34 -- drivers/net/can/spi/mcp251x.c
3292c4fc9ce2 can: mcp251x: fix support for half duplex SPI host controllers
e0e25001d088 can: mcp251x: add support for half duplex controllers
74fa565b63dc can: mcp251x: Use readx_poll_timeout() helper
2d52dabbef60 can: mcp251x: add GPIO support
cfc24a0aa7a1 can: mcp251x: sort include files alphabetically
df561f6688fe treewide: Use fallthrough pseudo-keyword
8ce8c0abcba3 can: mcp251x: only reset hardware as required
877a902103fd can: mcp251x: add mcp251x_write_2regs() and make use of it
50ec88120ea1 can: mcp251x: get rid of legacy platform data
14684b93019a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

--
With Best Regards,
Andy Shevchenko


2021-05-03 18:08:39

by Andy Shevchenko

[permalink] [raw]
Subject: Re: Null pointer dereference in mcp251x driver when resuming from sleep

On Mon, May 03, 2021 at 04:48:10PM +0300, Andy Shevchenko wrote:
> On Mon, May 03, 2021 at 04:44:24PM +0300, Andy Shevchenko wrote:
> > On Mon, May 03, 2021 at 03:11:40PM +0200, Frieder Schrempf wrote:
> > > Hi,
> > >
> > > with kernel 5.10.x and 5.12.x I'm getting a null pointer dereference
> > > exception from the mcp251x driver when I resume from sleep (see trace
> > > below).
> > >
> > > As far as I can tell this was working fine with 5.4. As I currently don't
> > > have the time to do further debugging/bisecting, for now I want to at least
> > > report this here.
> > >
> > > Maybe there is someone around who could already give a wild guess for what
> > > might cause this just by looking at the trace/code!?
> >
> > Does revert of c7299fea6769 ("spi: Fix spi device unregister flow") help?
>
> Other than that, bisecting will take not more than 3-4 iterations only:
> % git log --oneline v5.4..v5.10.34 -- drivers/net/can/spi/mcp251x.c
> 3292c4fc9ce2 can: mcp251x: fix support for half duplex SPI host controllers
> e0e25001d088 can: mcp251x: add support for half duplex controllers
> 74fa565b63dc can: mcp251x: Use readx_poll_timeout() helper
> 2d52dabbef60 can: mcp251x: add GPIO support
> cfc24a0aa7a1 can: mcp251x: sort include files alphabetically
> df561f6688fe treewide: Use fallthrough pseudo-keyword

> 8ce8c0abcba3 can: mcp251x: only reset hardware as required

And only smoking gun by analyzing the code is the above. So, for the first I
would simply check before that commit and immediately after (15-30 minutes of
work). (I would do it myself if I had a hardware at hand...)

> 877a902103fd can: mcp251x: add mcp251x_write_2regs() and make use of it
> 50ec88120ea1 can: mcp251x: get rid of legacy platform data
> 14684b93019a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

--
With Best Regards,
Andy Shevchenko


2021-05-04 13:58:21

by Frieder Schrempf

[permalink] [raw]
Subject: Re: Null pointer dereference in mcp251x driver when resuming from sleep

On 03.05.21 15:54, Andy Shevchenko wrote:
> On Mon, May 03, 2021 at 04:48:10PM +0300, Andy Shevchenko wrote:
>> On Mon, May 03, 2021 at 04:44:24PM +0300, Andy Shevchenko wrote:
>>> On Mon, May 03, 2021 at 03:11:40PM +0200, Frieder Schrempf wrote:
>>>> Hi,
>>>>
>>>> with kernel 5.10.x and 5.12.x I'm getting a null pointer dereference
>>>> exception from the mcp251x driver when I resume from sleep (see trace
>>>> below).
>>>>
>>>> As far as I can tell this was working fine with 5.4. As I currently don't
>>>> have the time to do further debugging/bisecting, for now I want to at least
>>>> report this here.
>>>>
>>>> Maybe there is someone around who could already give a wild guess for what
>>>> might cause this just by looking at the trace/code!?
>>>
>>> Does revert of c7299fea6769 ("spi: Fix spi device unregister flow") help?
>>
>> Other than that, bisecting will take not more than 3-4 iterations only:
>> % git log --oneline v5.4..v5.10.34 -- drivers/net/can/spi/mcp251x.c
>> 3292c4fc9ce2 can: mcp251x: fix support for half duplex SPI host controllers
>> e0e25001d088 can: mcp251x: add support for half duplex controllers
>> 74fa565b63dc can: mcp251x: Use readx_poll_timeout() helper
>> 2d52dabbef60 can: mcp251x: add GPIO support
>> cfc24a0aa7a1 can: mcp251x: sort include files alphabetically
>> df561f6688fe treewide: Use fallthrough pseudo-keyword
>
>> 8ce8c0abcba3 can: mcp251x: only reset hardware as required
>
> And only smoking gun by analyzing the code is the above. So, for the first I
> would simply check before that commit and immediately after (15-30 minutes of
> work). (I would do it myself if I had a hardware at hand...)

Thanks for pointing that out. Indeed when I revert this commit it works
fine again.

When I look at the change I see that queue_work(priv->wq,
&priv->restart_work) is called in two cases, when the interface is
brought up after resume and now also when the device is only powered up
after resume but the interface stays down.

The latter is a problem if the device was never brought up before, as
the workqueue is only allocated and initialized in mcp251x_open().

To me it looks like a proper fix would be to just move the workqueue
init to the probe function to make sure it is available when resuming
even if the interface was never up before.

I will try this and send a patch if it looks good.

2021-05-04 14:20:58

by Andy Shevchenko

[permalink] [raw]
Subject: Re: Null pointer dereference in mcp251x driver when resuming from sleep

On Tue, May 04, 2021 at 03:54:00PM +0200, Frieder Schrempf wrote:
> On 03.05.21 15:54, Andy Shevchenko wrote:
> > On Mon, May 03, 2021 at 04:48:10PM +0300, Andy Shevchenko wrote:
> > > On Mon, May 03, 2021 at 04:44:24PM +0300, Andy Shevchenko wrote:
> > > > On Mon, May 03, 2021 at 03:11:40PM +0200, Frieder Schrempf wrote:
> > > > > Hi,
> > > > >
> > > > > with kernel 5.10.x and 5.12.x I'm getting a null pointer dereference
> > > > > exception from the mcp251x driver when I resume from sleep (see trace
> > > > > below).
> > > > >
> > > > > As far as I can tell this was working fine with 5.4. As I currently don't
> > > > > have the time to do further debugging/bisecting, for now I want to at least
> > > > > report this here.
> > > > >
> > > > > Maybe there is someone around who could already give a wild guess for what
> > > > > might cause this just by looking at the trace/code!?
> > > >
> > > > Does revert of c7299fea6769 ("spi: Fix spi device unregister flow") help?
> > >
> > > Other than that, bisecting will take not more than 3-4 iterations only:
> > > % git log --oneline v5.4..v5.10.34 -- drivers/net/can/spi/mcp251x.c
> > > 3292c4fc9ce2 can: mcp251x: fix support for half duplex SPI host controllers
> > > e0e25001d088 can: mcp251x: add support for half duplex controllers
> > > 74fa565b63dc can: mcp251x: Use readx_poll_timeout() helper
> > > 2d52dabbef60 can: mcp251x: add GPIO support
> > > cfc24a0aa7a1 can: mcp251x: sort include files alphabetically
> > > df561f6688fe treewide: Use fallthrough pseudo-keyword
> >
> > > 8ce8c0abcba3 can: mcp251x: only reset hardware as required
> >
> > And only smoking gun by analyzing the code is the above. So, for the first I
> > would simply check before that commit and immediately after (15-30 minutes of
> > work). (I would do it myself if I had a hardware at hand...)
>
> Thanks for pointing that out. Indeed when I revert this commit it works fine
> again.
>
> When I look at the change I see that queue_work(priv->wq,
> &priv->restart_work) is called in two cases, when the interface is brought
> up after resume and now also when the device is only powered up after resume
> but the interface stays down.
>
> The latter is a problem if the device was never brought up before, as the
> workqueue is only allocated and initialized in mcp251x_open().
>
> To me it looks like a proper fix would be to just move the workqueue init to
> the probe function to make sure it is available when resuming even if the
> interface was never up before.
>
> I will try this and send a patch if it looks good.

Sounds like a plan!

--
With Best Regards,
Andy Shevchenko