2012-08-15 06:40:24

by Hans de Goede

[permalink] [raw]
Subject: Re: [REGRESION] Suspend hangs with 3.6-rc1 on Lenovo T60 notebook

Hi,

On 08/15/2012 07:13 AM, Miklos Szeredi wrote:
> Suspend oopses in generic_ide_suspend() because dev_get_drvdata()
> returns NULL (dev->p->driver_data == NULL) and this function is not
> prepared for this.
>
> I bisected it to 0998d063 (device-core: Ensure drvdata = NULL when no
> driver is bound). Reverting it fixes suspend.
>

First of all, thanks for reporting and bisecting this. With that said,
I must say that this is very weird. The patch in question:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=0998d063

Only makes dev-drvdata NULL in 2 cases:
1) The probe method of the driver fails
2) The driver has been detached from the device by calling one of:
device_release_driver() or driver_detach()

Note that in both code paths dev->driver also gets set to NULL, and
other generic ide driver callbacks very much depend on that not being
NULL, ie:

static int generic_ide_remove(struct device *dev)
{
ide_drive_t *drive = to_ide_device(dev);
struct ide_driver *drv = to_ide_driver(dev->driver);

if (drv->remove)
drv->remove(drive);

return 0;
}

Also how can a drivers suspend callback get called if dev->driver is NULL,
since that callback would normally be "reached" through dev->driver, so
something weird is going on here ...

I hope one of the ide guys can shed some light on this.

Regards,

Hans


2012-08-15 19:53:59

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [REGRESION] Suspend hangs with 3.6-rc1 on Lenovo T60 notebook

On Wednesday, August 15, 2012, Hans de Goede wrote:
> Hi,
>
> On 08/15/2012 07:13 AM, Miklos Szeredi wrote:
> > Suspend oopses in generic_ide_suspend() because dev_get_drvdata()
> > returns NULL (dev->p->driver_data == NULL) and this function is not
> > prepared for this.
> >
> > I bisected it to 0998d063 (device-core: Ensure drvdata = NULL when no
> > driver is bound). Reverting it fixes suspend.
> >
>
> First of all, thanks for reporting and bisecting this. With that said,
> I must say that this is very weird. The patch in question:
> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=0998d063
>
> Only makes dev-drvdata NULL in 2 cases:
> 1) The probe method of the driver fails
> 2) The driver has been detached from the device by calling one of:
> device_release_driver() or driver_detach()
>
> Note that in both code paths dev->driver also gets set to NULL, and
> other generic ide driver callbacks very much depend on that not being
> NULL, ie:
>
> static int generic_ide_remove(struct device *dev)
> {
> ide_drive_t *drive = to_ide_device(dev);
> struct ide_driver *drv = to_ide_driver(dev->driver);
>
> if (drv->remove)
> drv->remove(drive);
>
> return 0;
> }
>
> Also how can a drivers suspend callback get called if dev->driver is NULL,
> since that callback would normally be "reached" through dev->driver, so
> something weird is going on here ...

No, it wouldn't, because it is a bus type callback and it is invoked for
all devices whose bus type is ide_bus_type, regardless of whether or not
their driver field is NULL.

It clearly should check if drive is not NULL before using that pointer.

Thanks,
Rafael

2012-08-16 11:33:56

by Hans de Goede

[permalink] [raw]
Subject: Re: [REGRESION] Suspend hangs with 3.6-rc1 on Lenovo T60 notebook

Hi,

On 08/15/2012 09:59 PM, Rafael J. Wysocki wrote:
> On Wednesday, August 15, 2012, Hans de Goede wrote:
>> Hi,
>>
>> On 08/15/2012 07:13 AM, Miklos Szeredi wrote:
>>> Suspend oopses in generic_ide_suspend() because dev_get_drvdata()
>>> returns NULL (dev->p->driver_data == NULL) and this function is not
>>> prepared for this.
>>>
>>> I bisected it to 0998d063 (device-core: Ensure drvdata = NULL when no
>>> driver is bound). Reverting it fixes suspend.
>>>
>>
>> First of all, thanks for reporting and bisecting this. With that said,
>> I must say that this is very weird. The patch in question:
>> http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=0998d063
>>
>> Only makes dev-drvdata NULL in 2 cases:
>> 1) The probe method of the driver fails
>> 2) The driver has been detached from the device by calling one of:
>> device_release_driver() or driver_detach()
>>
>> Note that in both code paths dev->driver also gets set to NULL, and
>> other generic ide driver callbacks very much depend on that not being
>> NULL, ie:
>>
>> static int generic_ide_remove(struct device *dev)
>> {
>> ide_drive_t *drive = to_ide_device(dev);
>> struct ide_driver *drv = to_ide_driver(dev->driver);
>>
>> if (drv->remove)
>> drv->remove(drive);
>>
>> return 0;
>> }
>>
>> Also how can a drivers suspend callback get called if dev->driver is NULL,
>> since that callback would normally be "reached" through dev->driver, so
>> something weird is going on here ...
>
> No, it wouldn't, because it is a bus type callback and it is invoked for
> all devices whose bus type is ide_bus_type, regardless of whether or not
> their driver field is NULL.

Ah right, these are bus_driver operations. That explains some things, so I've
done some more research asking myself: "Why does generic_ide_suspend(), which
is a *bus* op, call dev_get_drvdata?", the answer to that seems to be that
the ide subsystem is abusing (IMHO) drvdata to store per device bus_driver
data. Which I believe is not how drvdata is intended to be used.

With that said, the above knowledge has allowed me to write an (ugly) fix for
the regression Miklos is seeing. Miklos can you give the attached patch a
try please?

> It clearly should check if drive is not NULL before using that pointer.

I assume you mean drive*r*, yes I agree that generic_ide_remove should
check for that. So who is going to write a patch for that?

Regards,

Hans


Attachments:
0001-ide-Restore-drvdata-after-device_attach-failure.patch (1.49 kB)

2012-08-16 15:13:36

by Alan Stern

[permalink] [raw]
Subject: Re: [REGRESION] Suspend hangs with 3.6-rc1 on Lenovo T60 notebook

On Thu, 16 Aug 2012, Hans de Goede wrote:

> Ah right, these are bus_driver operations. That explains some things, so I've
> done some more research asking myself: "Why does generic_ide_suspend(), which
> is a *bus* op, call dev_get_drvdata?", the answer to that seems to be that
> the ide subsystem is abusing (IMHO) drvdata to store per device bus_driver
> data. Which I believe is not how drvdata is intended to be used.
>
> With that said, the above knowledge has allowed me to write an (ugly) fix for
> the regression Miklos is seeing. Miklos can you give the attached patch a
> try please?
>
> > It clearly should check if drive is not NULL before using that pointer.
>
> I assume you mean drive*r*, yes I agree that generic_ide_remove should
> check for that. So who is going to write a patch for that?

The existing code could certainly be improved. Your patch does:

> + /*
> + * device_register() will have cleared drvdata on
> + * device_attach failure, but we use drvdata to store per
> + * device bus info, rather then for driver info, so restore it.
> + */
> + dev_set_drvdata(dev, drive);

But at this point, dev is defined by:

struct device *dev = &drive->gendev;

So why bother setting anything? It seems to me that
generic_ide_suspend() and generic_ide_resume() could easily replace

ide_drive_t *drive = dev_get_drvdata(dev);

with

ide_drive_t *drive = dev_to_ide_drive(dev);

where dev_to_ide_drive is defined as "container_of(dev, ide_drive_t,
gendev)" (if this isn't defined already).

Alan Stern

2012-08-16 16:28:47

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [REGRESION] Suspend hangs with 3.6-rc1 on Lenovo T60 notebook

Alan Stern <[email protected]> writes:

> On Thu, 16 Aug 2012, Hans de Goede wrote:
>
>> Ah right, these are bus_driver operations. That explains some things, so I've
>> done some more research asking myself: "Why does generic_ide_suspend(), which
>> is a *bus* op, call dev_get_drvdata?", the answer to that seems to be that
>> the ide subsystem is abusing (IMHO) drvdata to store per device bus_driver
>> data. Which I believe is not how drvdata is intended to be used.
>>
>> With that said, the above knowledge has allowed me to write an (ugly) fix for
>> the regression Miklos is seeing. Miklos can you give the attached patch a
>> try please?
>>
>> > It clearly should check if drive is not NULL before using that pointer.
>>
>> I assume you mean drive*r*, yes I agree that generic_ide_remove should
>> check for that. So who is going to write a patch for that?
>
> The existing code could certainly be improved. Your patch does:
>
>> + /*
>> + * device_register() will have cleared drvdata on
>> + * device_attach failure, but we use drvdata to store per
>> + * device bus info, rather then for driver info, so restore it.
>> + */
>> + dev_set_drvdata(dev, drive);
>
> But at this point, dev is defined by:
>
> struct device *dev = &drive->gendev;
>
> So why bother setting anything? It seems to me that
> generic_ide_suspend() and generic_ide_resume() could easily replace
>
> ide_drive_t *drive = dev_get_drvdata(dev);
>
> with
>
> ide_drive_t *drive = dev_to_ide_drive(dev);
>
> where dev_to_ide_drive is defined as "container_of(dev, ide_drive_t,
> gendev)" (if this isn't defined already).

Yes, this appears to work. Following patch fixes the suspend oops.

Thanks,
Miklos

---
drivers/ide/ide-pm.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/ide/ide-pm.c b/drivers/ide/ide-pm.c
index 9240609..8d1e32d 100644
--- a/drivers/ide/ide-pm.c
+++ b/drivers/ide/ide-pm.c
@@ -4,7 +4,7 @@

int generic_ide_suspend(struct device *dev, pm_message_t mesg)
{
- ide_drive_t *drive = dev_get_drvdata(dev);
+ ide_drive_t *drive = to_ide_device(dev);
ide_drive_t *pair = ide_get_pair_dev(drive);
ide_hwif_t *hwif = drive->hwif;
struct request *rq;
@@ -40,7 +40,7 @@ int generic_ide_suspend(struct device *dev, pm_message_t mesg)

int generic_ide_resume(struct device *dev)
{
- ide_drive_t *drive = dev_get_drvdata(dev);
+ ide_drive_t *drive = to_ide_device(dev);
ide_drive_t *pair = ide_get_pair_dev(drive);
ide_hwif_t *hwif = drive->hwif;
struct request *rq;
--
1.7.7

2012-08-16 16:32:54

by Alan Stern

[permalink] [raw]
Subject: Re: [REGRESION] Suspend hangs with 3.6-rc1 on Lenovo T60 notebook

On Thu, 16 Aug 2012, Miklos Szeredi wrote:

> Yes, this appears to work. Following patch fixes the suspend oops.
>
> Thanks,
> Miklos
>
> ---
> drivers/ide/ide-pm.c | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/ide/ide-pm.c b/drivers/ide/ide-pm.c
> index 9240609..8d1e32d 100644
> --- a/drivers/ide/ide-pm.c
> +++ b/drivers/ide/ide-pm.c
> @@ -4,7 +4,7 @@
>
> int generic_ide_suspend(struct device *dev, pm_message_t mesg)
> {
> - ide_drive_t *drive = dev_get_drvdata(dev);
> + ide_drive_t *drive = to_ide_device(dev);
> ide_drive_t *pair = ide_get_pair_dev(drive);
> ide_hwif_t *hwif = drive->hwif;
> struct request *rq;
> @@ -40,7 +40,7 @@ int generic_ide_suspend(struct device *dev, pm_message_t mesg)
>
> int generic_ide_resume(struct device *dev)
> {
> - ide_drive_t *drive = dev_get_drvdata(dev);
> + ide_drive_t *drive = to_ide_device(dev);
> ide_drive_t *pair = ide_get_pair_dev(drive);
> ide_hwif_t *hwif = drive->hwif;
> struct request *rq;

And now you can get rid of the useless dev_set_drvdata() call.

Alan Stern

2012-08-16 19:56:46

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [REGRESION] Suspend hangs with 3.6-rc1 on Lenovo T60 notebook

On Thursday, August 16, 2012, Alan Stern wrote:
> On Thu, 16 Aug 2012, Miklos Szeredi wrote:
>
> > Yes, this appears to work. Following patch fixes the suspend oops.
> >
> > Thanks,
> > Miklos

OK

Miklos, can you please send that to Dave with a proper changelog and
sign-off (please add my ACK too)? Please make it clear that this is a
regression fix and which commit has introduced the regression.

> > ---
> > drivers/ide/ide-pm.c | 4 ++--
> > 1 files changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/ide/ide-pm.c b/drivers/ide/ide-pm.c
> > index 9240609..8d1e32d 100644
> > --- a/drivers/ide/ide-pm.c
> > +++ b/drivers/ide/ide-pm.c
> > @@ -4,7 +4,7 @@
> >
> > int generic_ide_suspend(struct device *dev, pm_message_t mesg)
> > {
> > - ide_drive_t *drive = dev_get_drvdata(dev);
> > + ide_drive_t *drive = to_ide_device(dev);
> > ide_drive_t *pair = ide_get_pair_dev(drive);
> > ide_hwif_t *hwif = drive->hwif;
> > struct request *rq;
> > @@ -40,7 +40,7 @@ int generic_ide_suspend(struct device *dev, pm_message_t mesg)
> >
> > int generic_ide_resume(struct device *dev)
> > {
> > - ide_drive_t *drive = dev_get_drvdata(dev);
> > + ide_drive_t *drive = to_ide_device(dev);
> > ide_drive_t *pair = ide_get_pair_dev(drive);
> > ide_hwif_t *hwif = drive->hwif;
> > struct request *rq;
>
> And now you can get rid of the useless dev_set_drvdata() call.

That was in the other patch I think.

Thanks,
Rafael

2012-08-17 09:22:56

by Hans de Goede

[permalink] [raw]
Subject: Re: [REGRESION] Suspend hangs with 3.6-rc1 on Lenovo T60 notebook

Hi all,

On 08/16/2012 10:02 PM, Rafael J. Wysocki wrote:
> On Thursday, August 16, 2012, Alan Stern wrote:
>> On Thu, 16 Aug 2012, Miklos Szeredi wrote:
>>
>>> Yes, this appears to work. Following patch fixes the suspend oops.
>>>
>>> Thanks,
>>> Miklos
>
> OK
>
> Miklos, can you please send that to Dave with a proper changelog and
> sign-off (please add my ACK too)? Please make it clear that this is a
> regression fix and which commit has introduced the regression.

I think that Alan's suggest fix, as implemented by Milos is great!, but
before moving forward with this someone should audit all the other
(generic) ide code for other places using the drvdata, a simple grep
for dev_get_drvdata should show these.

<snip>

>> And now you can get rid of the useless dev_set_drvdata() call.
>
> That was in the other patch I think.

No my patch was a hack to undo the results of the commit causing
the regression in the IDE case. But Alan's approach clearly is
much better! Once we are sure drvdata is not used anywhere the
dev_set_drvdata call could be removed in the place where my
hack added a second call to it. Note that there are likely
actual ide drivers using it, without setting it themselves since
the ide core was setting it. So removing it will require even more
auditing / checking.

A 3th thing which should be audited is the generic ide code assuming
that dev->driver != NULL, this is at least true for
generic_ide_remove, where:

if (drv->remove)

Should be changed to:

if (dev->driver && drv->remove)

Following the way things are done in generic_ide_shutdown, a similar
change could be needed for generic_ide_probe(), although I guess
that may never get called with dev->driver == NULL ?

Unfortunately I'm very busy with other stuff atm, so I cannot help
here other then pointing out that such audits should be done :|

Regards,

Hans

2012-08-17 14:22:21

by Alan Stern

[permalink] [raw]
Subject: Re: [REGRESION] Suspend hangs with 3.6-rc1 on Lenovo T60 notebook

On Fri, 17 Aug 2012, Hans de Goede wrote:

> No my patch was a hack to undo the results of the commit causing
> the regression in the IDE case. But Alan's approach clearly is
> much better! Once we are sure drvdata is not used anywhere the
> dev_set_drvdata call could be removed in the place where my
> hack added a second call to it. Note that there are likely
> actual ide drivers using it, without setting it themselves since
> the ide core was setting it. So removing it will require even more
> auditing / checking.

I did search for dev_get_drvdata() calls in drivers/ide; there were no
other calls that retrieved an ide_drive_t value. But I didn't check
anywhere else in the kernel. In fact, I'm not sure where else to look.

Alan Stern

2012-08-17 14:27:29

by Alan Stern

[permalink] [raw]
Subject: Re: [REGRESION] Suspend hangs with 3.6-rc1 on Lenovo T60 notebook

On Fri, 17 Aug 2012, Alan Stern wrote:

> On Fri, 17 Aug 2012, Hans de Goede wrote:
>
> > No my patch was a hack to undo the results of the commit causing
> > the regression in the IDE case. But Alan's approach clearly is
> > much better! Once we are sure drvdata is not used anywhere the
> > dev_set_drvdata call could be removed in the place where my
> > hack added a second call to it. Note that there are likely
> > actual ide drivers using it, without setting it themselves since
> > the ide core was setting it. So removing it will require even more
> > auditing / checking.
>
> I did search for dev_get_drvdata() calls in drivers/ide; there were no
> other calls that retrieved an ide_drive_t value. But I didn't check
> anywhere else in the kernel. In fact, I'm not sure where else to look.

After a little more checking: Among all the .c files in the kernel
which match the pattern '[^a-z]ide_', the only occurrences of
dev_get_drvdata() which retrieve an ide_drive_t value are the two in
ide-pm.c.

Alan Stern

2012-08-17 15:31:20

by Hans de Goede

[permalink] [raw]
Subject: Re: [REGRESION] Suspend hangs with 3.6-rc1 on Lenovo T60 notebook

Hi,

On 08/17/2012 04:27 PM, Alan Stern wrote:
> On Fri, 17 Aug 2012, Alan Stern wrote:
>
>> On Fri, 17 Aug 2012, Hans de Goede wrote:
>>
>>> No my patch was a hack to undo the results of the commit causing
>>> the regression in the IDE case. But Alan's approach clearly is
>>> much better! Once we are sure drvdata is not used anywhere the
>>> dev_set_drvdata call could be removed in the place where my
>>> hack added a second call to it. Note that there are likely
>>> actual ide drivers using it, without setting it themselves since
>>> the ide core was setting it. So removing it will require even more
>>> auditing / checking.
>>
>> I did search for dev_get_drvdata() calls in drivers/ide; there were no
>> other calls that retrieved an ide_drive_t value. But I didn't check
>> anywhere else in the kernel. In fact, I'm not sure where else to look.
>
> After a little more checking: Among all the .c files in the kernel
> which match the pattern '[^a-z]ide_', the only occurrences of
> dev_get_drvdata() which retrieve an ide_drive_t value are the two in
> ide-pm.c.

Thanks, good work! That means we should move forward with submitting
Miklos' patch to the ide subsystem maintainer asap.

Also if someone feels like it we still could do the 2 follow up / clean
up patches:

1) Removin the set_drvdata call from hwif_register_devices()
in ide-prove.c
2) Fixup ide-generic.c ide_remove to not dereference a NULL dev->driver

Regards,

Hans