2021-01-11 04:47:35

by Hugh Dickins

[permalink] [raw]
Subject: 5.11-rc device reordering breaks ThinkPad rmi4 suspend

Hi Rafael,

Synaptics RMI4 SMBus touchpad on ThinkPad X1 Carbon (5th generation)
fails to suspend when running 5.11-rc kernels: bisected to
5b6164d3465f ("driver core: Reorder devices on successful probe"),
and reverting that fixes it. dmesg.xz attached, but go ahead and ask
me to switch on a debug option to extract further info if that may help.

Thanks,
Hugh


Attachments:
dmesg.xz (14.12 kB)

2021-01-11 13:46:27

by Thierry Reding

[permalink] [raw]
Subject: Re: 5.11-rc device reordering breaks ThinkPad rmi4 suspend

On Sun, Jan 10, 2021 at 08:44:13PM -0800, Hugh Dickins wrote:
> Hi Rafael,
>
> Synaptics RMI4 SMBus touchpad on ThinkPad X1 Carbon (5th generation)
> fails to suspend when running 5.11-rc kernels: bisected to
> 5b6164d3465f ("driver core: Reorder devices on successful probe"),
> and reverting that fixes it. dmesg.xz attached, but go ahead and ask
> me to switch on a debug option to extract further info if that may help.

Hi Hugh,

Quoting what I think are the relevant parts of that log:

[ 34.373742] printk: Suspending console(s) (use no_console_suspend to debug)
[ 34.429015] rmi4_physical rmi4-00: Failed to read irqs, code=-6
[ 34.474973] rmi4_f01 rmi4-00.fn01: Failed to write sleep mode: -6.
[ 34.474994] rmi4_f01 rmi4-00.fn01: Suspend failed with code -6.
[ 34.475001] rmi4_physical rmi4-00: Failed to suspend functions: -6
[ 34.475105] rmi4_smbus 6-002c: Failed to suspend device: -6
[ 34.475113] PM: dpm_run_callback(): rmi_smb_suspend+0x0/0x3c returns -6
[ 34.475130] PM: Device 6-002c failed to suspend: error -6
[ 34.475187] PM: Some devices failed to suspend, or early wake event detected
[ 34.480324] rmi4_f03 rmi4-00.fn03: rmi_f03_pt_write: Failed to write to F03 TX register (-6).
[ 34.480748] rmi4_f03 rmi4-00.fn03: rmi_f03_pt_write: Failed to write to F03 TX register (-6).
[ 34.481558] rmi4_physical rmi4-00: rmi_driver_clear_irq_bits: Failed to change enabled interrupts!
[ 34.487935] acpi LNXPOWER:02: Turning OFF
[ 34.488707] acpi LNXPOWER:01: Turning OFF
[ 34.489554] rmi4_physical rmi4-00: rmi_driver_set_irq_bits: Failed to change enabled interrupts!
[ 34.489669] psmouse: probe of serio2 failed with error -1
[ 34.489882] OOM killer enabled.
[ 34.489891] Restarting tasks ... done.
[ 34.589183] PM: suspend exit
[ 34.589839] PM: suspend entry (s2idle)
[ 34.605884] Filesystems sync: 0.017 seconds
[ 34.607594] Freezing user space processes ... (elapsed 0.006 seconds) done.
[ 34.613645] OOM killer disabled.
[ 34.613650] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[ 34.615482] printk: Suspending console(s) (use no_console_suspend to debug)
[ 34.653097] rmi4_f01 rmi4-00.fn01: Failed to write sleep mode: -6.
[ 34.653108] rmi4_f01 rmi4-00.fn01: Suspend failed with code -6.
[ 34.653115] rmi4_physical rmi4-00: Failed to suspend functions: -6
[ 34.653123] rmi4_smbus 6-002c: Failed to suspend device: -6
[ 34.653129] PM: dpm_run_callback(): rmi_smb_suspend+0x0/0x3c returns -6
[ 34.653160] PM: Device 6-002c failed to suspend: error -6
[ 34.653174] PM: Some devices failed to suspend, or early wake event detected
[ 34.660515] OOM killer enabled.
[ 34.660524] Restarting tasks ...
[ 34.661456] rmi4_physical rmi4-00: rmi_driver_set_irq_bits: Failed to change enabled interrupts!
[ 34.661591] psmouse: probe of serio2 failed with error -1
[ 34.669469] done.
[ 34.748386] PM: suspend exit

I think what might be happening here is that the offending patch causes
some devices to be reordered in a way different to how they were ordered
originally and the rmi4 driver currently depends on that implicit order.

Interestingly one of the bugs that the offending patch fixes is similar
in the failure mode but for the reverse reason: the implicit order
causes suspend/resume to fail.

I suspect that the underlying reason here is that rmi4 needs something
in order to successfully suspend (i.e. read the IRQ status registers)
that has already been suspended where it hadn't prior to the offending
patch. It can't be the I2C controller itself that has been suspended,
because the parent/child relationship should prevent that from
happening.

I'm not familiar with how exactly rmi4 works, so I'll have to do
some digging to hopefully pinpoint exactly what's going wrong here.

In the meantime, it would be useful to know what exactly the I2C
hierarchy looks like. For example, what's the I2C controller that the
RMI4 device is hooked up to. According to the above, that's I2C bus 6,
so you should be able to find out some details about it by inspecting
the corresponding sysfs nodes:

$ ls -l /sys/class/i2c-adapter/i2c-6/
$ cat /sys/class/i2c-adapter/i2c-6/name
$ ls -l /sys/class/i2c-adapter/i2c-6/device/

Thierry


Attachments:
(No filename) (4.23 kB)
signature.asc (849.00 B)
Download all attachments

2021-01-11 15:00:18

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: 5.11-rc device reordering breaks ThinkPad rmi4 suspend

On Mon, Jan 11, 2021 at 2:43 PM Thierry Reding <[email protected]> wrote:
>
> On Sun, Jan 10, 2021 at 08:44:13PM -0800, Hugh Dickins wrote:
> > Hi Rafael,
> >
> > Synaptics RMI4 SMBus touchpad on ThinkPad X1 Carbon (5th generation)
> > fails to suspend when running 5.11-rc kernels: bisected to
> > 5b6164d3465f ("driver core: Reorder devices on successful probe"),
> > and reverting that fixes it. dmesg.xz attached, but go ahead and ask
> > me to switch on a debug option to extract further info if that may help.
>
> Hi Hugh,
>
> Quoting what I think are the relevant parts of that log:
>
> [ 34.373742] printk: Suspending console(s) (use no_console_suspend to debug)
> [ 34.429015] rmi4_physical rmi4-00: Failed to read irqs, code=-6
> [ 34.474973] rmi4_f01 rmi4-00.fn01: Failed to write sleep mode: -6.
> [ 34.474994] rmi4_f01 rmi4-00.fn01: Suspend failed with code -6.
> [ 34.475001] rmi4_physical rmi4-00: Failed to suspend functions: -6
> [ 34.475105] rmi4_smbus 6-002c: Failed to suspend device: -6
> [ 34.475113] PM: dpm_run_callback(): rmi_smb_suspend+0x0/0x3c returns -6
> [ 34.475130] PM: Device 6-002c failed to suspend: error -6
> [ 34.475187] PM: Some devices failed to suspend, or early wake event detected
> [ 34.480324] rmi4_f03 rmi4-00.fn03: rmi_f03_pt_write: Failed to write to F03 TX register (-6).
> [ 34.480748] rmi4_f03 rmi4-00.fn03: rmi_f03_pt_write: Failed to write to F03 TX register (-6).
> [ 34.481558] rmi4_physical rmi4-00: rmi_driver_clear_irq_bits: Failed to change enabled interrupts!
> [ 34.487935] acpi LNXPOWER:02: Turning OFF
> [ 34.488707] acpi LNXPOWER:01: Turning OFF
> [ 34.489554] rmi4_physical rmi4-00: rmi_driver_set_irq_bits: Failed to change enabled interrupts!
> [ 34.489669] psmouse: probe of serio2 failed with error -1
> [ 34.489882] OOM killer enabled.
> [ 34.489891] Restarting tasks ... done.
> [ 34.589183] PM: suspend exit
> [ 34.589839] PM: suspend entry (s2idle)
> [ 34.605884] Filesystems sync: 0.017 seconds
> [ 34.607594] Freezing user space processes ... (elapsed 0.006 seconds) done.
> [ 34.613645] OOM killer disabled.
> [ 34.613650] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
> [ 34.615482] printk: Suspending console(s) (use no_console_suspend to debug)
> [ 34.653097] rmi4_f01 rmi4-00.fn01: Failed to write sleep mode: -6.
> [ 34.653108] rmi4_f01 rmi4-00.fn01: Suspend failed with code -6.
> [ 34.653115] rmi4_physical rmi4-00: Failed to suspend functions: -6
> [ 34.653123] rmi4_smbus 6-002c: Failed to suspend device: -6
> [ 34.653129] PM: dpm_run_callback(): rmi_smb_suspend+0x0/0x3c returns -6
> [ 34.653160] PM: Device 6-002c failed to suspend: error -6
> [ 34.653174] PM: Some devices failed to suspend, or early wake event detected
> [ 34.660515] OOM killer enabled.
> [ 34.660524] Restarting tasks ...
> [ 34.661456] rmi4_physical rmi4-00: rmi_driver_set_irq_bits: Failed to change enabled interrupts!
> [ 34.661591] psmouse: probe of serio2 failed with error -1
> [ 34.669469] done.
> [ 34.748386] PM: suspend exit
>
> I think what might be happening here is that the offending patch causes
> some devices to be reordered in a way different to how they were ordered
> originally and the rmi4 driver currently depends on that implicit order.

Actually, the only possible case in which the commit in question can
introduce suspend failures like this is when some dependency
information is missing and so the reordering causes the ordering to
change from the (working) implicit one.

> Interestingly one of the bugs that the offending patch fixes is similar
> in the failure mode but for the reverse reason: the implicit order
> causes suspend/resume to fail.

And that happens because some dependency information is missing.

So we have failing cases when dependency information is missing, so
instead of fixing those we have tried to make the core change the
ordering after every successful probe in the hope that this will take
care of the problem without introducing new breakage.

However, it evidently has introduced new breakage and in order to fix
it we need to figure out what dependency information is missing in the
failing cases and put that information in, but we may as well do the
same for the cases that are failing without the offending change.

So why don't we revert the commit in question and do just that?

2021-01-11 16:17:52

by Thierry Reding

[permalink] [raw]
Subject: Re: 5.11-rc device reordering breaks ThinkPad rmi4 suspend

On Mon, Jan 11, 2021 at 03:57:37PM +0100, Rafael J. Wysocki wrote:
> On Mon, Jan 11, 2021 at 2:43 PM Thierry Reding <[email protected]> wrote:
> >
> > On Sun, Jan 10, 2021 at 08:44:13PM -0800, Hugh Dickins wrote:
> > > Hi Rafael,
> > >
> > > Synaptics RMI4 SMBus touchpad on ThinkPad X1 Carbon (5th generation)
> > > fails to suspend when running 5.11-rc kernels: bisected to
> > > 5b6164d3465f ("driver core: Reorder devices on successful probe"),
> > > and reverting that fixes it. dmesg.xz attached, but go ahead and ask
> > > me to switch on a debug option to extract further info if that may help.
> >
> > Hi Hugh,
> >
> > Quoting what I think are the relevant parts of that log:
> >
> > [ 34.373742] printk: Suspending console(s) (use no_console_suspend to debug)
> > [ 34.429015] rmi4_physical rmi4-00: Failed to read irqs, code=-6
> > [ 34.474973] rmi4_f01 rmi4-00.fn01: Failed to write sleep mode: -6.
> > [ 34.474994] rmi4_f01 rmi4-00.fn01: Suspend failed with code -6.
> > [ 34.475001] rmi4_physical rmi4-00: Failed to suspend functions: -6
> > [ 34.475105] rmi4_smbus 6-002c: Failed to suspend device: -6
> > [ 34.475113] PM: dpm_run_callback(): rmi_smb_suspend+0x0/0x3c returns -6
> > [ 34.475130] PM: Device 6-002c failed to suspend: error -6
> > [ 34.475187] PM: Some devices failed to suspend, or early wake event detected
> > [ 34.480324] rmi4_f03 rmi4-00.fn03: rmi_f03_pt_write: Failed to write to F03 TX register (-6).
> > [ 34.480748] rmi4_f03 rmi4-00.fn03: rmi_f03_pt_write: Failed to write to F03 TX register (-6).
> > [ 34.481558] rmi4_physical rmi4-00: rmi_driver_clear_irq_bits: Failed to change enabled interrupts!
> > [ 34.487935] acpi LNXPOWER:02: Turning OFF
> > [ 34.488707] acpi LNXPOWER:01: Turning OFF
> > [ 34.489554] rmi4_physical rmi4-00: rmi_driver_set_irq_bits: Failed to change enabled interrupts!
> > [ 34.489669] psmouse: probe of serio2 failed with error -1
> > [ 34.489882] OOM killer enabled.
> > [ 34.489891] Restarting tasks ... done.
> > [ 34.589183] PM: suspend exit
> > [ 34.589839] PM: suspend entry (s2idle)
> > [ 34.605884] Filesystems sync: 0.017 seconds
> > [ 34.607594] Freezing user space processes ... (elapsed 0.006 seconds) done.
> > [ 34.613645] OOM killer disabled.
> > [ 34.613650] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
> > [ 34.615482] printk: Suspending console(s) (use no_console_suspend to debug)
> > [ 34.653097] rmi4_f01 rmi4-00.fn01: Failed to write sleep mode: -6.
> > [ 34.653108] rmi4_f01 rmi4-00.fn01: Suspend failed with code -6.
> > [ 34.653115] rmi4_physical rmi4-00: Failed to suspend functions: -6
> > [ 34.653123] rmi4_smbus 6-002c: Failed to suspend device: -6
> > [ 34.653129] PM: dpm_run_callback(): rmi_smb_suspend+0x0/0x3c returns -6
> > [ 34.653160] PM: Device 6-002c failed to suspend: error -6
> > [ 34.653174] PM: Some devices failed to suspend, or early wake event detected
> > [ 34.660515] OOM killer enabled.
> > [ 34.660524] Restarting tasks ...
> > [ 34.661456] rmi4_physical rmi4-00: rmi_driver_set_irq_bits: Failed to change enabled interrupts!
> > [ 34.661591] psmouse: probe of serio2 failed with error -1
> > [ 34.669469] done.
> > [ 34.748386] PM: suspend exit
> >
> > I think what might be happening here is that the offending patch causes
> > some devices to be reordered in a way different to how they were ordered
> > originally and the rmi4 driver currently depends on that implicit order.
>
> Actually, the only possible case in which the commit in question can
> introduce suspend failures like this is when some dependency
> information is missing and so the reordering causes the ordering to
> change from the (working) implicit one.
>
> > Interestingly one of the bugs that the offending patch fixes is similar
> > in the failure mode but for the reverse reason: the implicit order
> > causes suspend/resume to fail.
>
> And that happens because some dependency information is missing.
>
> So we have failing cases when dependency information is missing, so
> instead of fixing those we have tried to make the core change the
> ordering after every successful probe in the hope that this will take
> care of the problem without introducing new breakage.
>
> However, it evidently has introduced new breakage and in order to fix
> it we need to figure out what dependency information is missing in the
> failing cases and put that information in, but we may as well do the
> same for the cases that are failing without the offending change.
>
> So why don't we revert the commit in question and do just that?

Unfortunately it isn't that easy. In fact, all the dependency
information already exists in the case that I cited in 5b6164d3465f
("driver core: Reorder devices on successful probe"), but it's the
driver core that suspends/resumes the devices in the wrong order.

The reason is because the ACONNECT device depends on the BPMP device
(via a power-domains property), but it's also instantiated before the
BPMP device (because it is listed earlier in device tree, which is
sorted by unit-address first, then alphabetically). BPMP being a CPU
non-addressable device it doesn't have a unit-address and hence is
listed very late in device tree (by convention). Normally this is would
not be a problem because deferred probe would take care of it. But there
is one corner-case which happens when the BPMP is built into the kernel
(which it usually is, as it provides access to resources necessary for
booting, such as clocks and resets) and ACONNECT is built as a loadable
module. In that case, BPMP gets probed before ACONNECT and hence when
ACONNECT does eventually get loaded, the BPMP is already there, meaning
ACONNECT won't defer probe and hence the DPM suspend/resume order is not
fixed up by the deferred probe code.

And that's precisely what the offending commit addresses. However, the
downside is, and we did discuss this during review, that it operates
under the (somewhat optimistic) assumption that all the dependency
information exists. This is because reordering on successful probe can
potentially introduce regressions for dependencies that were previously
implicit. So if a system has component B that depends on component A but
doesn't model that dependency via some child/parent relationship or an
explicit relationship that would be flagged by deferred probe, then this
implicit dependency can break by the new reordering on successful probe.

I very much suspect that that's exactly what's going on here. This RMI4
device very likely implicitly depends on some other resource getting
enabled but doesn't properly model that dependency. If we find out what
that dependency is and return -EPROBE_DEFER when that dependency has not
probed yet, then deferred probe will automatically take care of ordering
everything correctly again (or, in fact, ordering by successful probe
will take care of it already because RMI4 would initially fail with
-EPROBE_DEFER).

Adding Vincent, Jason, Andrew and Lucas (who have recently worked on
this driver), perhaps they have some better understanding of what
missing dependencies might be causing the above errors.

Thierry


Attachments:
(No filename) (7.16 kB)
signature.asc (849.00 B)
Download all attachments

2021-01-11 17:01:18

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: 5.11-rc device reordering breaks ThinkPad rmi4 suspend

On Mon, Jan 11, 2021 at 5:12 PM Thierry Reding <[email protected]> wrote:
>
> On Mon, Jan 11, 2021 at 03:57:37PM +0100, Rafael J. Wysocki wrote:
> > On Mon, Jan 11, 2021 at 2:43 PM Thierry Reding <[email protected]> wrote:
> > >
> > > On Sun, Jan 10, 2021 at 08:44:13PM -0800, Hugh Dickins wrote:
> > > > Hi Rafael,
> > > >
> > > > Synaptics RMI4 SMBus touchpad on ThinkPad X1 Carbon (5th generation)
> > > > fails to suspend when running 5.11-rc kernels: bisected to
> > > > 5b6164d3465f ("driver core: Reorder devices on successful probe"),
> > > > and reverting that fixes it. dmesg.xz attached, but go ahead and ask
> > > > me to switch on a debug option to extract further info if that may help.
> > >
> > > Hi Hugh,
> > >
> > > Quoting what I think are the relevant parts of that log:
> > >
> > > [ 34.373742] printk: Suspending console(s) (use no_console_suspend to debug)
> > > [ 34.429015] rmi4_physical rmi4-00: Failed to read irqs, code=-6
> > > [ 34.474973] rmi4_f01 rmi4-00.fn01: Failed to write sleep mode: -6.
> > > [ 34.474994] rmi4_f01 rmi4-00.fn01: Suspend failed with code -6.
> > > [ 34.475001] rmi4_physical rmi4-00: Failed to suspend functions: -6
> > > [ 34.475105] rmi4_smbus 6-002c: Failed to suspend device: -6
> > > [ 34.475113] PM: dpm_run_callback(): rmi_smb_suspend+0x0/0x3c returns -6
> > > [ 34.475130] PM: Device 6-002c failed to suspend: error -6
> > > [ 34.475187] PM: Some devices failed to suspend, or early wake event detected
> > > [ 34.480324] rmi4_f03 rmi4-00.fn03: rmi_f03_pt_write: Failed to write to F03 TX register (-6).
> > > [ 34.480748] rmi4_f03 rmi4-00.fn03: rmi_f03_pt_write: Failed to write to F03 TX register (-6).
> > > [ 34.481558] rmi4_physical rmi4-00: rmi_driver_clear_irq_bits: Failed to change enabled interrupts!
> > > [ 34.487935] acpi LNXPOWER:02: Turning OFF
> > > [ 34.488707] acpi LNXPOWER:01: Turning OFF
> > > [ 34.489554] rmi4_physical rmi4-00: rmi_driver_set_irq_bits: Failed to change enabled interrupts!
> > > [ 34.489669] psmouse: probe of serio2 failed with error -1
> > > [ 34.489882] OOM killer enabled.
> > > [ 34.489891] Restarting tasks ... done.
> > > [ 34.589183] PM: suspend exit
> > > [ 34.589839] PM: suspend entry (s2idle)
> > > [ 34.605884] Filesystems sync: 0.017 seconds
> > > [ 34.607594] Freezing user space processes ... (elapsed 0.006 seconds) done.
> > > [ 34.613645] OOM killer disabled.
> > > [ 34.613650] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
> > > [ 34.615482] printk: Suspending console(s) (use no_console_suspend to debug)
> > > [ 34.653097] rmi4_f01 rmi4-00.fn01: Failed to write sleep mode: -6.
> > > [ 34.653108] rmi4_f01 rmi4-00.fn01: Suspend failed with code -6.
> > > [ 34.653115] rmi4_physical rmi4-00: Failed to suspend functions: -6
> > > [ 34.653123] rmi4_smbus 6-002c: Failed to suspend device: -6
> > > [ 34.653129] PM: dpm_run_callback(): rmi_smb_suspend+0x0/0x3c returns -6
> > > [ 34.653160] PM: Device 6-002c failed to suspend: error -6
> > > [ 34.653174] PM: Some devices failed to suspend, or early wake event detected
> > > [ 34.660515] OOM killer enabled.
> > > [ 34.660524] Restarting tasks ...
> > > [ 34.661456] rmi4_physical rmi4-00: rmi_driver_set_irq_bits: Failed to change enabled interrupts!
> > > [ 34.661591] psmouse: probe of serio2 failed with error -1
> > > [ 34.669469] done.
> > > [ 34.748386] PM: suspend exit
> > >
> > > I think what might be happening here is that the offending patch causes
> > > some devices to be reordered in a way different to how they were ordered
> > > originally and the rmi4 driver currently depends on that implicit order.
> >
> > Actually, the only possible case in which the commit in question can
> > introduce suspend failures like this is when some dependency
> > information is missing and so the reordering causes the ordering to
> > change from the (working) implicit one.
> >
> > > Interestingly one of the bugs that the offending patch fixes is similar
> > > in the failure mode but for the reverse reason: the implicit order
> > > causes suspend/resume to fail.
> >
> > And that happens because some dependency information is missing.
> >
> > So we have failing cases when dependency information is missing, so
> > instead of fixing those we have tried to make the core change the
> > ordering after every successful probe in the hope that this will take
> > care of the problem without introducing new breakage.
> >
> > However, it evidently has introduced new breakage and in order to fix
> > it we need to figure out what dependency information is missing in the
> > failing cases and put that information in, but we may as well do the
> > same for the cases that are failing without the offending change.
> >
> > So why don't we revert the commit in question and do just that?
>
> Unfortunately it isn't that easy. In fact, all the dependency
> information already exists in the case that I cited in 5b6164d3465f
> ("driver core: Reorder devices on successful probe"), but it's the
> driver core that suspends/resumes the devices in the wrong order.
>
> The reason is because the ACONNECT device depends on the BPMP device
> (via a power-domains property), but it's also instantiated before the
> BPMP device (because it is listed earlier in device tree, which is
> sorted by unit-address first, then alphabetically). BPMP being a CPU
> non-addressable device it doesn't have a unit-address and hence is
> listed very late in device tree (by convention). Normally this is would
> not be a problem because deferred probe would take care of it. But there
> is one corner-case which happens when the BPMP is built into the kernel
> (which it usually is, as it provides access to resources necessary for
> booting, such as clocks and resets) and ACONNECT is built as a loadable
> module. In that case, BPMP gets probed before ACONNECT and hence when
> ACONNECT does eventually get loaded, the BPMP is already there, meaning
> ACONNECT won't defer probe and hence the DPM suspend/resume order is not
> fixed up by the deferred probe code.

What about using a device link to enforce the right ordering, then?

Deferred probing is not a way to ensure the suitable suspend/resume ordering.

> And that's precisely what the offending commit addresses. However, the
> downside is, and we did discuss this during review, that it operates
> under the (somewhat optimistic) assumption that all the dependency
> information exists. This is because reordering on successful probe can
> potentially introduce regressions for dependencies that were previously
> implicit. So if a system has component B that depends on component A but
> doesn't model that dependency via some child/parent relationship or an
> explicit relationship that would be flagged by deferred probe,

Again, deferred probing may not help here.

> then this implicit dependency can break by the new reordering on successful probe.
>
> I very much suspect that that's exactly what's going on here. This RMI4
> device very likely implicitly depends on some other resource getting
> enabled but doesn't properly model that dependency. If we find out what
> that dependency is and return -EPROBE_DEFER when that dependency has not
> probed yet, then deferred probe will automatically take care of ordering
> everything correctly again (or, in fact, ordering by successful probe
> will take care of it already because RMI4 would initially fail with
> -EPROBE_DEFER).
>
> Adding Vincent, Jason, Andrew and Lucas (who have recently worked on
> this driver), perhaps they have some better understanding of what
> missing dependencies might be causing the above errors.

IMV it is a mistake to believe that deferred probing can get
everything right for you in every case, with or without the offending
commit. Sometimes you need to tell the core what the right ordering
is and that's what device links are for.

As it stands today, that commit doesn't improve the situation and it
adds overhead and complexity.

2021-01-11 17:24:56

by Hugh Dickins

[permalink] [raw]
Subject: Re: 5.11-rc device reordering breaks ThinkPad rmi4 suspend

On Mon, 11 Jan 2021, Rafael J. Wysocki wrote:
> On Mon, Jan 11, 2021 at 5:44 AM Hugh Dickins <[email protected]> wrote:
> >
> > Hi Rafael,
> >
> > Synaptics RMI4 SMBus touchpad on ThinkPad X1 Carbon (5th generation)
> > fails to suspend when running 5.11-rc kernels: bisected to
> > 5b6164d3465f ("driver core: Reorder devices on successful probe"),
> > and reverting that fixes it. dmesg.xz attached, but go ahead and ask
> > me to switch on a debug option to extract further info if that may help.
>
> Does the driver abort the suspend transition by returning an error or
> does something else happen?

Both. Thierry has pointed to the lines showing failed suspend transition;
and I forgot to mention that the touchpad is unresponsive from then on
(I might not have noticed the failed suspend without that). But I don't
suppose that unresponsiveness is worth worrying about: things went wrong
in suspend, so it's not surprising if the driver does not recover well.

Thank you both for getting on to this so quickly - but don't worry about
getting my touchpad working: I'm glad to see you discussing the wider
issues of ordering that this has brought up.

Hugh

2021-01-12 00:44:36

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: 5.11-rc device reordering breaks ThinkPad rmi4 suspend

Hi Hugh,

Thanks for the report!

On Mon, Jan 11, 2021 at 5:44 AM Hugh Dickins <[email protected]> wrote:
>
> Hi Rafael,
>
> Synaptics RMI4 SMBus touchpad on ThinkPad X1 Carbon (5th generation)
> fails to suspend when running 5.11-rc kernels: bisected to
> 5b6164d3465f ("driver core: Reorder devices on successful probe"),
> and reverting that fixes it. dmesg.xz attached, but go ahead and ask
> me to switch on a debug option to extract further info if that may help.

Does the driver abort the suspend transition by returning an error or
does something else happen?

2021-01-12 07:17:14

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: 5.11-rc device reordering breaks ThinkPad rmi4 suspend

On Mon, Jan 11, 2021 at 2:43 PM Thierry Reding <[email protected]> wrote:
>
> On Sun, Jan 10, 2021 at 08:44:13PM -0800, Hugh Dickins wrote:
> > Hi Rafael,
> >
> > Synaptics RMI4 SMBus touchpad on ThinkPad X1 Carbon (5th generation)
> > fails to suspend when running 5.11-rc kernels: bisected to
> > 5b6164d3465f ("driver core: Reorder devices on successful probe"),
> > and reverting that fixes it. dmesg.xz attached, but go ahead and ask
> > me to switch on a debug option to extract further info if that may help.
>
> Hi Hugh,
>
> Quoting what I think are the relevant parts of that log:

I'm not sure how I overlooked that part of the log. Oh well.

> [ 34.373742] printk: Suspending console(s) (use no_console_suspend to debug)
> [ 34.429015] rmi4_physical rmi4-00: Failed to read irqs, code=-6

This is a transport device read operation failing, but I'm not sure
how it is related to suspend.

> [ 34.474973] rmi4_f01 rmi4-00.fn01: Failed to write sleep mode: -6.

And this is the rmi_write() in rmi_f01_suspend() failing AFAICS.

> [ 34.474994] rmi4_f01 rmi4-00.fn01: Suspend failed with code -6.
> [ 34.475001] rmi4_physical rmi4-00: Failed to suspend functions: -6
> [ 34.475105] rmi4_smbus 6-002c: Failed to suspend device: -6
> [ 34.475113] PM: dpm_run_callback(): rmi_smb_suspend+0x0/0x3c returns -6

So the call chain is
rmi_smb_suspend()->rmi_driver_suspend()->rmi_suspend_functions()->suspend_one_function()->rmi_f01_suspend().

> [ 34.475130] PM: Device 6-002c failed to suspend: error -6
> [ 34.475187] PM: Some devices failed to suspend, or early wake event detected
> [ 34.480324] rmi4_f03 rmi4-00.fn03: rmi_f03_pt_write: Failed to write to F03 TX register (-6).
> [ 34.480748] rmi4_f03 rmi4-00.fn03: rmi_f03_pt_write: Failed to write to F03 TX register (-6).
> [ 34.481558] rmi4_physical rmi4-00: rmi_driver_clear_irq_bits: Failed to change enabled interrupts!
> [ 34.487935] acpi LNXPOWER:02: Turning OFF
> [ 34.488707] acpi LNXPOWER:01: Turning OFF
> [ 34.489554] rmi4_physical rmi4-00: rmi_driver_set_irq_bits: Failed to change enabled interrupts!
> [ 34.489669] psmouse: probe of serio2 failed with error -1
> [ 34.489882] OOM killer enabled.
> [ 34.489891] Restarting tasks ... done.
> [ 34.589183] PM: suspend exit
> [ 34.589839] PM: suspend entry (s2idle)
> [ 34.605884] Filesystems sync: 0.017 seconds
> [ 34.607594] Freezing user space processes ... (elapsed 0.006 seconds) done.
> [ 34.613645] OOM killer disabled.
> [ 34.613650] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
> [ 34.615482] printk: Suspending console(s) (use no_console_suspend to debug)
> [ 34.653097] rmi4_f01 rmi4-00.fn01: Failed to write sleep mode: -6.
> [ 34.653108] rmi4_f01 rmi4-00.fn01: Suspend failed with code -6.
> [ 34.653115] rmi4_physical rmi4-00: Failed to suspend functions: -6
> [ 34.653123] rmi4_smbus 6-002c: Failed to suspend device: -6
> [ 34.653129] PM: dpm_run_callback(): rmi_smb_suspend+0x0/0x3c returns -6
> [ 34.653160] PM: Device 6-002c failed to suspend: error -6
> [ 34.653174] PM: Some devices failed to suspend, or early wake event detected
> [ 34.660515] OOM killer enabled.
> [ 34.660524] Restarting tasks ...
> [ 34.661456] rmi4_physical rmi4-00: rmi_driver_set_irq_bits: Failed to change enabled interrupts!
> [ 34.661591] psmouse: probe of serio2 failed with error -1
> [ 34.669469] done.
> [ 34.748386] PM: suspend exit
>
> I think what might be happening here is that the offending patch causes
> some devices to be reordered in a way different to how they were ordered
> originally and the rmi4 driver currently depends on that implicit order.

Yes, that's what appears to be happening.

> Interestingly one of the bugs that the offending patch fixes is similar
> in the failure mode but for the reverse reason: the implicit order
> causes suspend/resume to fail.
>
> I suspect that the underlying reason here is that rmi4 needs something
> in order to successfully suspend (i.e. read the IRQ status registers)
> that has already been suspended where it hadn't prior to the offending
> patch.

Definitely, something has been suspended prematurely.

> It can't be the I2C controller itself that has been suspended,
> because the parent/child relationship should prevent that from
> happening.

Well, assuming that there is such a parent-child dependency.

It looks like there is at least one level of indirection between i2c
and the affected device.

> I'm not familiar with how exactly rmi4 works, so I'll have to do
> some digging to hopefully pinpoint exactly what's going wrong here.
>
> In the meantime, it would be useful to know what exactly the I2C
> hierarchy looks like. For example, what's the I2C controller that the
> RMI4 device is hooked up to. According to the above, that's I2C bus 6,
> so you should be able to find out some details about it by inspecting
> the corresponding sysfs nodes:
>
> $ ls -l /sys/class/i2c-adapter/i2c-6/
> $ cat /sys/class/i2c-adapter/i2c-6/name
> $ ls -l /sys/class/i2c-adapter/i2c-6/device/
>
> Thierry

2021-01-12 09:06:06

by Hugh Dickins

[permalink] [raw]
Subject: Re: 5.11-rc device reordering breaks ThinkPad rmi4 suspend

On Mon, 11 Jan 2021, Thierry Reding wrote:
> On Sun, Jan 10, 2021 at 08:44:13PM -0800, Hugh Dickins wrote:
> >
> > Synaptics RMI4 SMBus touchpad on ThinkPad X1 Carbon (5th generation)
> > fails to suspend when running 5.11-rc kernels: bisected to
> > 5b6164d3465f ("driver core: Reorder devices on successful probe"),
> > and reverting that fixes it. dmesg.xz attached, but go ahead and ask
> > me to switch on a debug option to extract further info if that may help.
...
>
> I think what might be happening here is that the offending patch causes
> some devices to be reordered in a way different to how they were ordered
> originally and the rmi4 driver currently depends on that implicit order.

Yes, all that you explained makes good sense, thanks.

> I'm not familiar with how exactly rmi4 works, so I'll have to do
> some digging to hopefully pinpoint exactly what's going wrong here.
>
> In the meantime, it would be useful to know what exactly the I2C
> hierarchy looks like. For example, what's the I2C controller that the
> RMI4 device is hooked up to. According to the above, that's I2C bus 6,
> so you should be able to find out some details about it by inspecting
> the corresponding sysfs nodes:
>
> $ ls -l /sys/class/i2c-adapter/i2c-6/
> $ cat /sys/class/i2c-adapter/i2c-6/name
> $ ls -l /sys/class/i2c-adapter/i2c-6/device/

That's curious: I don't even have a /sys/class/i2c-adapter directory.

(And I did wonder if you meant to say "smbus" rather than "i2c",
though I don't have any /sys/class/smbus* either: I have no notion
of the relationship between i2c and smbus, but I thought the failing
write_block calls were the ones in rmi_smbus.c rather than rmi_i2c.c.)

I've attached compressed output of "find /sys/bus /sys/class | sort":
/sys/bus looked more relevant than /sys/class, maybe it will help
point in the right direction?

And in case it's relevant, maybe I should mention that this is a
non-modular, all-built-in kernel.

But as I said to Rafael, my touchpad can wait: the wider ordering
discussion is much more important.

Hugh


Attachments:
sysbusclass.xz (4.14 kB)

2021-01-12 09:58:08

by Saravana Kannan

[permalink] [raw]
Subject: Re: 5.11-rc device reordering breaks ThinkPad rmi4 suspend

On Mon, Jan 11, 2021 at 8:57 AM Rafael J. Wysocki <[email protected]> wrote:
>
> On Mon, Jan 11, 2021 at 5:12 PM Thierry Reding <[email protected]> wrote:
> >
> > On Mon, Jan 11, 2021 at 03:57:37PM +0100, Rafael J. Wysocki wrote:
> > > On Mon, Jan 11, 2021 at 2:43 PM Thierry Reding <[email protected]> wrote:
> > > >
> > > > On Sun, Jan 10, 2021 at 08:44:13PM -0800, Hugh Dickins wrote:
> > > > > Hi Rafael,
> > > > >
> > > > > Synaptics RMI4 SMBus touchpad on ThinkPad X1 Carbon (5th generation)
> > > > > fails to suspend when running 5.11-rc kernels: bisected to
> > > > > 5b6164d3465f ("driver core: Reorder devices on successful probe"),
> > > > > and reverting that fixes it. dmesg.xz attached, but go ahead and ask
> > > > > me to switch on a debug option to extract further info if that may help.
> > > >
> > > > Hi Hugh,
> > > >
> > > > Quoting what I think are the relevant parts of that log:
> > > >
> > > > [ 34.373742] printk: Suspending console(s) (use no_console_suspend to debug)
> > > > [ 34.429015] rmi4_physical rmi4-00: Failed to read irqs, code=-6
> > > > [ 34.474973] rmi4_f01 rmi4-00.fn01: Failed to write sleep mode: -6.
> > > > [ 34.474994] rmi4_f01 rmi4-00.fn01: Suspend failed with code -6.
> > > > [ 34.475001] rmi4_physical rmi4-00: Failed to suspend functions: -6
> > > > [ 34.475105] rmi4_smbus 6-002c: Failed to suspend device: -6
> > > > [ 34.475113] PM: dpm_run_callback(): rmi_smb_suspend+0x0/0x3c returns -6
> > > > [ 34.475130] PM: Device 6-002c failed to suspend: error -6
> > > > [ 34.475187] PM: Some devices failed to suspend, or early wake event detected
> > > > [ 34.480324] rmi4_f03 rmi4-00.fn03: rmi_f03_pt_write: Failed to write to F03 TX register (-6).
> > > > [ 34.480748] rmi4_f03 rmi4-00.fn03: rmi_f03_pt_write: Failed to write to F03 TX register (-6).
> > > > [ 34.481558] rmi4_physical rmi4-00: rmi_driver_clear_irq_bits: Failed to change enabled interrupts!
> > > > [ 34.487935] acpi LNXPOWER:02: Turning OFF
> > > > [ 34.488707] acpi LNXPOWER:01: Turning OFF
> > > > [ 34.489554] rmi4_physical rmi4-00: rmi_driver_set_irq_bits: Failed to change enabled interrupts!
> > > > [ 34.489669] psmouse: probe of serio2 failed with error -1
> > > > [ 34.489882] OOM killer enabled.
> > > > [ 34.489891] Restarting tasks ... done.
> > > > [ 34.589183] PM: suspend exit
> > > > [ 34.589839] PM: suspend entry (s2idle)
> > > > [ 34.605884] Filesystems sync: 0.017 seconds
> > > > [ 34.607594] Freezing user space processes ... (elapsed 0.006 seconds) done.
> > > > [ 34.613645] OOM killer disabled.
> > > > [ 34.613650] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
> > > > [ 34.615482] printk: Suspending console(s) (use no_console_suspend to debug)
> > > > [ 34.653097] rmi4_f01 rmi4-00.fn01: Failed to write sleep mode: -6.
> > > > [ 34.653108] rmi4_f01 rmi4-00.fn01: Suspend failed with code -6.
> > > > [ 34.653115] rmi4_physical rmi4-00: Failed to suspend functions: -6
> > > > [ 34.653123] rmi4_smbus 6-002c: Failed to suspend device: -6
> > > > [ 34.653129] PM: dpm_run_callback(): rmi_smb_suspend+0x0/0x3c returns -6
> > > > [ 34.653160] PM: Device 6-002c failed to suspend: error -6
> > > > [ 34.653174] PM: Some devices failed to suspend, or early wake event detected
> > > > [ 34.660515] OOM killer enabled.
> > > > [ 34.660524] Restarting tasks ...
> > > > [ 34.661456] rmi4_physical rmi4-00: rmi_driver_set_irq_bits: Failed to change enabled interrupts!
> > > > [ 34.661591] psmouse: probe of serio2 failed with error -1
> > > > [ 34.669469] done.
> > > > [ 34.748386] PM: suspend exit
> > > >
> > > > I think what might be happening here is that the offending patch causes
> > > > some devices to be reordered in a way different to how they were ordered
> > > > originally and the rmi4 driver currently depends on that implicit order.
> > >
> > > Actually, the only possible case in which the commit in question can
> > > introduce suspend failures like this is when some dependency
> > > information is missing and so the reordering causes the ordering to
> > > change from the (working) implicit one.
> > >
> > > > Interestingly one of the bugs that the offending patch fixes is similar
> > > > in the failure mode but for the reverse reason: the implicit order
> > > > causes suspend/resume to fail.
> > >
> > > And that happens because some dependency information is missing.
> > >
> > > So we have failing cases when dependency information is missing, so
> > > instead of fixing those we have tried to make the core change the
> > > ordering after every successful probe in the hope that this will take
> > > care of the problem without introducing new breakage.
> > >
> > > However, it evidently has introduced new breakage and in order to fix
> > > it we need to figure out what dependency information is missing in the
> > > failing cases and put that information in, but we may as well do the
> > > same for the cases that are failing without the offending change.
> > >
> > > So why don't we revert the commit in question and do just that?
> >
> > Unfortunately it isn't that easy. In fact, all the dependency
> > information already exists in the case that I cited in 5b6164d3465f
> > ("driver core: Reorder devices on successful probe"), but it's the
> > driver core that suspends/resumes the devices in the wrong order.
> >
> > The reason is because the ACONNECT device depends on the BPMP device
> > (via a power-domains property), but it's also instantiated before the
> > BPMP device (because it is listed earlier in device tree, which is
> > sorted by unit-address first, then alphabetically). BPMP being a CPU
> > non-addressable device it doesn't have a unit-address and hence is
> > listed very late in device tree (by convention). Normally this is would
> > not be a problem because deferred probe would take care of it. But there
> > is one corner-case which happens when the BPMP is built into the kernel
> > (which it usually is, as it provides access to resources necessary for
> > booting, such as clocks and resets) and ACONNECT is built as a loadable
> > module. In that case, BPMP gets probed before ACONNECT and hence when
> > ACONNECT does eventually get loaded, the BPMP is already there, meaning
> > ACONNECT won't defer probe and hence the DPM suspend/resume order is not
> > fixed up by the deferred probe code.
>
> What about using a device link to enforce the right ordering, then?
>
> Deferred probing is not a way to ensure the suitable suspend/resume ordering.

Thierry,

Can you try booting with fw_devlink=on with this series? It's queued
up for 5.12-rc1
https://lore.kernel.org/lkml/[email protected]/

It might solve your issue, but I think your patch still addresses a real issue.

> > And that's precisely what the offending commit addresses. However, the
> > downside is, and we did discuss this during review, that it operates
> > under the (somewhat optimistic) assumption that all the dependency
> > information exists. This is because reordering on successful probe can
> > potentially introduce regressions for dependencies that were previously
> > implicit. So if a system has component B that depends on component A but
> > doesn't model that dependency via some child/parent relationship or an
> > explicit relationship that would be flagged by deferred probe,
>
> Again, deferred probing may not help here.
>
> > then this implicit dependency can break by the new reordering on successful probe.
> >
> > I very much suspect that that's exactly what's going on here. This RMI4
> > device very likely implicitly depends on some other resource getting
> > enabled but doesn't properly model that dependency. If we find out what
> > that dependency is and return -EPROBE_DEFER when that dependency has not
> > probed yet, then deferred probe will automatically take care of ordering
> > everything correctly again (or, in fact, ordering by successful probe
> > will take care of it already because RMI4 would initially fail with
> > -EPROBE_DEFER).
> >
> > Adding Vincent, Jason, Andrew and Lucas (who have recently worked on
> > this driver), perhaps they have some better understanding of what
> > missing dependencies might be causing the above errors.
>
> IMV it is a mistake to believe that deferred probing can get
> everything right for you in every case, with or without the offending
> commit. Sometimes you need to tell the core what the right ordering
> is and that's what device links are for.

IMHO, Thierry's patch is the right way to imply dependencies when
device links aren't explicitly calling out dependencies. It's not
really depending on deferred probe to imply dependency order. Rather,
it's saying that the order in which devices probe is a better way to
imply dependency than relying on the order in which devices are added.

For Thierry's case, fw_devlink=on might solve his problem, but that's
solving the problem by explicitly calling out the dependency (by
getting it from DT where the dependency is explicitly called out). For
implicit cases, we still need his patch. I wonder how

> As it stands today, that commit doesn't improve the situation and it
> adds overhead and complexity.

I'm okay if we revert it for now, but that doesn't solve the
overarching ordering issues though.

I happen to have an X1 Carbon (different gen though) lying around and
I poked at its /sys folders. None of the devices in the rmi4_smbus are
considered the grandchildren of the i2c device. I think the real
problem is rmi_register_transport_device() [1] not setting up the
parent for any of the new devices it's adding.

Hugh, can you try this patch?

diff --git a/drivers/input/rmi4/rmi_bus.c b/drivers/input/rmi4/rmi_bus.c
index 24f31a5c0e04..50a0134b6901 100644
--- a/drivers/input/rmi4/rmi_bus.c
+++ b/drivers/input/rmi4/rmi_bus.c
@@ -90,6 +90,7 @@ int rmi_register_transport_device(struct
rmi_transport_dev *xport)

rmi_dev->dev.bus = &rmi_bus_type;
rmi_dev->dev.type = &rmi_device_type;
+ rmi_dev->dev.parent = xport->dev;

xport->rmi_dev = rmi_dev;

-Saravana

[1] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/input/rmi4/rmi_bus.c#n74

2021-01-12 10:24:06

by Hugh Dickins

[permalink] [raw]
Subject: Re: 5.11-rc device reordering breaks ThinkPad rmi4 suspend

On Mon, 11 Jan 2021, Saravana Kannan wrote:
>
> I happen to have an X1 Carbon (different gen though) lying around and
> I poked at its /sys folders. None of the devices in the rmi4_smbus are
> considered the grandchildren of the i2c device. I think the real
> problem is rmi_register_transport_device() [1] not setting up the
> parent for any of the new devices it's adding.
>
> Hugh, can you try this patch?

Just tried, but no, this patch does not help; but I bet
you're along the right lines, and something as simple will do it.

>
> diff --git a/drivers/input/rmi4/rmi_bus.c b/drivers/input/rmi4/rmi_bus.c
> index 24f31a5c0e04..50a0134b6901 100644
> --- a/drivers/input/rmi4/rmi_bus.c
> +++ b/drivers/input/rmi4/rmi_bus.c
> @@ -90,6 +90,7 @@ int rmi_register_transport_device(struct
> rmi_transport_dev *xport)
>
> rmi_dev->dev.bus = &rmi_bus_type;
> rmi_dev->dev.type = &rmi_device_type;
> + rmi_dev->dev.parent = xport->dev;
>
> xport->rmi_dev = rmi_dev;
>
> -Saravana
>
> [1] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/input/rmi4/rmi_bus.c#n74

2021-01-12 10:29:23

by Saravana Kannan

[permalink] [raw]
Subject: Re: 5.11-rc device reordering breaks ThinkPad rmi4 suspend

On Mon, Jan 11, 2021 at 3:42 PM Hugh Dickins <[email protected]> wrote:
>
> On Mon, 11 Jan 2021, Saravana Kannan wrote:
> >
> > I happen to have an X1 Carbon (different gen though) lying around and
> > I poked at its /sys folders. None of the devices in the rmi4_smbus are
> > considered the grandchildren of the i2c device. I think the real
> > problem is rmi_register_transport_device() [1] not setting up the
> > parent for any of the new devices it's adding.
> >
> > Hugh, can you try this patch?
>
> Just tried, but no, this patch does not help; but I bet
> you're along the right lines, and something as simple will do it.

Did you see this patch change the organization of devices under /sys/devices/?
The rmi* devices need to be under one of the i2c devices after this
patch. Is that not the case? Or is that the case, but you are still
seeing suspend/resume issues?

-Saravana

>
> >
> > diff --git a/drivers/input/rmi4/rmi_bus.c b/drivers/input/rmi4/rmi_bus.c
> > index 24f31a5c0e04..50a0134b6901 100644
> > --- a/drivers/input/rmi4/rmi_bus.c
> > +++ b/drivers/input/rmi4/rmi_bus.c
> > @@ -90,6 +90,7 @@ int rmi_register_transport_device(struct
> > rmi_transport_dev *xport)
> >
> > rmi_dev->dev.bus = &rmi_bus_type;
> > rmi_dev->dev.type = &rmi_device_type;
> > + rmi_dev->dev.parent = xport->dev;
> >
> > xport->rmi_dev = rmi_dev;
> >
> > -Saravana
> >
> > [1] - https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/input/rmi4/rmi_bus.c#n74

2021-01-12 10:39:35

by Hugh Dickins

[permalink] [raw]
Subject: Re: 5.11-rc device reordering breaks ThinkPad rmi4 suspend

On Mon, 11 Jan 2021, Saravana Kannan wrote:
> On Mon, Jan 11, 2021 at 3:42 PM Hugh Dickins <[email protected]> wrote:
> > On Mon, 11 Jan 2021, Saravana Kannan wrote:
> > >
> > > I happen to have an X1 Carbon (different gen though) lying around and
> > > I poked at its /sys folders. None of the devices in the rmi4_smbus are
> > > considered the grandchildren of the i2c device. I think the real
> > > problem is rmi_register_transport_device() [1] not setting up the
> > > parent for any of the new devices it's adding.
> > >
> > > Hugh, can you try this patch?
> >
> > Just tried, but no, this patch does not help; but I bet
> > you're along the right lines, and something as simple will do it.
>
> Did you see this patch change the organization of devices under /sys/devices/?
> The rmi* devices need to be under one of the i2c devices after this
> patch. Is that not the case? Or is that the case, but you are still
> seeing suspend/resume issues?

Now that I look, yes, that patch has moved the directory
/sys/devices/rmi4-00
to
/sys/devices/pci0000:00/0000:00:1f.4/i2c-6/6-002c/rmi4-00

But I still see the same suspend issues despite that.

Hugh

2021-01-12 11:11:10

by Saravana Kannan

[permalink] [raw]
Subject: Re: 5.11-rc device reordering breaks ThinkPad rmi4 suspend

On Mon, Jan 11, 2021 at 4:44 PM Hugh Dickins <[email protected]> wrote:
>
> On Mon, 11 Jan 2021, Saravana Kannan wrote:
> > On Mon, Jan 11, 2021 at 3:42 PM Hugh Dickins <[email protected]> wrote:
> > > On Mon, 11 Jan 2021, Saravana Kannan wrote:
> > > >
> > > > I happen to have an X1 Carbon (different gen though) lying around and
> > > > I poked at its /sys folders. None of the devices in the rmi4_smbus are
> > > > considered the grandchildren of the i2c device. I think the real
> > > > problem is rmi_register_transport_device() [1] not setting up the
> > > > parent for any of the new devices it's adding.
> > > >
> > > > Hugh, can you try this patch?
> > >
> > > Just tried, but no, this patch does not help; but I bet
> > > you're along the right lines, and something as simple will do it.
> >
> > Did you see this patch change the organization of devices under /sys/devices/?
> > The rmi* devices need to be under one of the i2c devices after this
> > patch. Is that not the case? Or is that the case, but you are still
> > seeing suspend/resume issues?
>
> Now that I look, yes, that patch has moved the directory
> /sys/devices/rmi4-00
> to
> /sys/devices/pci0000:00/0000:00:1f.4/i2c-6/6-002c/rmi4-00

What about child devices of rmi4-00? Does it still have the
rmi4-00.fn* devices as children? I'd think so, but just double
checking.

>
> But I still see the same suspend issues despite that.

Can you please get new logs to see if the failure reasons are still
the same? I'd think this parent/child relationship would at least
avoid the "Failed to read irqs" errors that seem to be due to I2C
dependency.

-Saravana

2021-01-12 11:15:21

by Hugh Dickins

[permalink] [raw]
Subject: Re: 5.11-rc device reordering breaks ThinkPad rmi4 suspend

On Mon, 11 Jan 2021, Saravana Kannan wrote:
> On Mon, Jan 11, 2021 at 4:44 PM Hugh Dickins <[email protected]> wrote:
> > On Mon, 11 Jan 2021, Saravana Kannan wrote:
> > >
> > > Did you see this patch change the organization of devices under /sys/devices/?
> > > The rmi* devices need to be under one of the i2c devices after this
> > > patch. Is that not the case? Or is that the case, but you are still
> > > seeing suspend/resume issues?
> >
> > Now that I look, yes, that patch has moved the directory
> > /sys/devices/rmi4-00
> > to
> > /sys/devices/pci0000:00/0000:00:1f.4/i2c-6/6-002c/rmi4-00
>
> What about child devices of rmi4-00? Does it still have the
> rmi4-00.fn* devices as children? I'd think so, but just double
> checking.

Yes, the patch moved the rmi4-00 directory and its contents.

>
> >
> > But I still see the same suspend issues despite that.
>
> Can you please get new logs to see if the failure reasons are still
> the same? I'd think this parent/child relationship would at least
> avoid the "Failed to read irqs" errors that seem to be due to I2C
> dependency.

No, it did not avoid the "Failed to read irqs" error (though my
recollection from earlier failures before I mailed out, is that
that particular error is intermittent: sometimes it showed up,
other times not; but always the "Failed to write sleep mode").

I configured CONFIG_PM_DEBUG=y and booted with pm_debug_messages
this time, dmesgsys.tar attached, contents:

dmesg.rc3 # dmesg of boot and attempt to suspend on 5.11-rc3
sys.rc3 # find /sys | sort | grep -v /sys/fs/cgroup afterwards
dmesg.saravana # dmesg of boot and attempt to suspend with your patch
sys.saravana # find /sys | sort | grep -v /sys/fs/cgroup afterwards
dmesg.revert # dmesg of boot+suspend+resume, rc3 without 5b6164d3465f
sys.revert # find /sys | sort | grep -v /sys/fs/cgroup afterwards

Not as many debug messages as I was expecting: perhaps you can point
me to something else to tune to get more info out? And perhaps it was
a mistake to snapshot the /sys hierarchy after rather than before:
I see now that it does make some difference. I filtered out
/sys/fs/cgroup because that enlarged the diffs with no relevance.

Hugh


Attachments:
dmesgsys.tar (97.57 kB)

2021-01-12 13:08:16

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: 5.11-rc device reordering breaks ThinkPad rmi4 suspend

On Mon, Jan 11, 2021 at 11:44 PM Saravana Kannan <[email protected]> wrote:
>
> On Mon, Jan 11, 2021 at 8:57 AM Rafael J. Wysocki <[email protected]> wrote:
> >
> > On Mon, Jan 11, 2021 at 5:12 PM Thierry Reding <[email protected]> wrote:
> > >
> > > On Mon, Jan 11, 2021 at 03:57:37PM +0100, Rafael J. Wysocki wrote:
> > > > On Mon, Jan 11, 2021 at 2:43 PM Thierry Reding <[email protected]> wrote:
> > > > >
> > > > > On Sun, Jan 10, 2021 at 08:44:13PM -0800, Hugh Dickins wrote:
> > > > > > Hi Rafael,
> > > > > >
> > > > > > Synaptics RMI4 SMBus touchpad on ThinkPad X1 Carbon (5th generation)
> > > > > > fails to suspend when running 5.11-rc kernels: bisected to
> > > > > > 5b6164d3465f ("driver core: Reorder devices on successful probe"),
> > > > > > and reverting that fixes it. dmesg.xz attached, but go ahead and ask
> > > > > > me to switch on a debug option to extract further info if that may help.
> > > > >
> > > > > Hi Hugh,
> > > > >
> > > > > Quoting what I think are the relevant parts of that log:
> > > > >
> > > > > [ 34.373742] printk: Suspending console(s) (use no_console_suspend to debug)
> > > > > [ 34.429015] rmi4_physical rmi4-00: Failed to read irqs, code=-6
> > > > > [ 34.474973] rmi4_f01 rmi4-00.fn01: Failed to write sleep mode: -6.
> > > > > [ 34.474994] rmi4_f01 rmi4-00.fn01: Suspend failed with code -6.
> > > > > [ 34.475001] rmi4_physical rmi4-00: Failed to suspend functions: -6
> > > > > [ 34.475105] rmi4_smbus 6-002c: Failed to suspend device: -6
> > > > > [ 34.475113] PM: dpm_run_callback(): rmi_smb_suspend+0x0/0x3c returns -6
> > > > > [ 34.475130] PM: Device 6-002c failed to suspend: error -6
> > > > > [ 34.475187] PM: Some devices failed to suspend, or early wake event detected
> > > > > [ 34.480324] rmi4_f03 rmi4-00.fn03: rmi_f03_pt_write: Failed to write to F03 TX register (-6).
> > > > > [ 34.480748] rmi4_f03 rmi4-00.fn03: rmi_f03_pt_write: Failed to write to F03 TX register (-6).
> > > > > [ 34.481558] rmi4_physical rmi4-00: rmi_driver_clear_irq_bits: Failed to change enabled interrupts!
> > > > > [ 34.487935] acpi LNXPOWER:02: Turning OFF
> > > > > [ 34.488707] acpi LNXPOWER:01: Turning OFF
> > > > > [ 34.489554] rmi4_physical rmi4-00: rmi_driver_set_irq_bits: Failed to change enabled interrupts!
> > > > > [ 34.489669] psmouse: probe of serio2 failed with error -1
> > > > > [ 34.489882] OOM killer enabled.
> > > > > [ 34.489891] Restarting tasks ... done.
> > > > > [ 34.589183] PM: suspend exit
> > > > > [ 34.589839] PM: suspend entry (s2idle)
> > > > > [ 34.605884] Filesystems sync: 0.017 seconds
> > > > > [ 34.607594] Freezing user space processes ... (elapsed 0.006 seconds) done.
> > > > > [ 34.613645] OOM killer disabled.
> > > > > [ 34.613650] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
> > > > > [ 34.615482] printk: Suspending console(s) (use no_console_suspend to debug)
> > > > > [ 34.653097] rmi4_f01 rmi4-00.fn01: Failed to write sleep mode: -6.
> > > > > [ 34.653108] rmi4_f01 rmi4-00.fn01: Suspend failed with code -6.
> > > > > [ 34.653115] rmi4_physical rmi4-00: Failed to suspend functions: -6
> > > > > [ 34.653123] rmi4_smbus 6-002c: Failed to suspend device: -6
> > > > > [ 34.653129] PM: dpm_run_callback(): rmi_smb_suspend+0x0/0x3c returns -6
> > > > > [ 34.653160] PM: Device 6-002c failed to suspend: error -6
> > > > > [ 34.653174] PM: Some devices failed to suspend, or early wake event detected
> > > > > [ 34.660515] OOM killer enabled.
> > > > > [ 34.660524] Restarting tasks ...
> > > > > [ 34.661456] rmi4_physical rmi4-00: rmi_driver_set_irq_bits: Failed to change enabled interrupts!
> > > > > [ 34.661591] psmouse: probe of serio2 failed with error -1
> > > > > [ 34.669469] done.
> > > > > [ 34.748386] PM: suspend exit
> > > > >
> > > > > I think what might be happening here is that the offending patch causes
> > > > > some devices to be reordered in a way different to how they were ordered
> > > > > originally and the rmi4 driver currently depends on that implicit order.
> > > >
> > > > Actually, the only possible case in which the commit in question can
> > > > introduce suspend failures like this is when some dependency
> > > > information is missing and so the reordering causes the ordering to
> > > > change from the (working) implicit one.
> > > >
> > > > > Interestingly one of the bugs that the offending patch fixes is similar
> > > > > in the failure mode but for the reverse reason: the implicit order
> > > > > causes suspend/resume to fail.
> > > >
> > > > And that happens because some dependency information is missing.
> > > >
> > > > So we have failing cases when dependency information is missing, so
> > > > instead of fixing those we have tried to make the core change the
> > > > ordering after every successful probe in the hope that this will take
> > > > care of the problem without introducing new breakage.
> > > >
> > > > However, it evidently has introduced new breakage and in order to fix
> > > > it we need to figure out what dependency information is missing in the
> > > > failing cases and put that information in, but we may as well do the
> > > > same for the cases that are failing without the offending change.
> > > >
> > > > So why don't we revert the commit in question and do just that?
> > >
> > > Unfortunately it isn't that easy. In fact, all the dependency
> > > information already exists in the case that I cited in 5b6164d3465f
> > > ("driver core: Reorder devices on successful probe"), but it's the
> > > driver core that suspends/resumes the devices in the wrong order.
> > >
> > > The reason is because the ACONNECT device depends on the BPMP device
> > > (via a power-domains property), but it's also instantiated before the
> > > BPMP device (because it is listed earlier in device tree, which is
> > > sorted by unit-address first, then alphabetically). BPMP being a CPU
> > > non-addressable device it doesn't have a unit-address and hence is
> > > listed very late in device tree (by convention). Normally this is would
> > > not be a problem because deferred probe would take care of it. But there
> > > is one corner-case which happens when the BPMP is built into the kernel
> > > (which it usually is, as it provides access to resources necessary for
> > > booting, such as clocks and resets) and ACONNECT is built as a loadable
> > > module. In that case, BPMP gets probed before ACONNECT and hence when
> > > ACONNECT does eventually get loaded, the BPMP is already there, meaning
> > > ACONNECT won't defer probe and hence the DPM suspend/resume order is not
> > > fixed up by the deferred probe code.
> >
> > What about using a device link to enforce the right ordering, then?
> >
> > Deferred probing is not a way to ensure the suitable suspend/resume ordering.
>
> Thierry,
>
> Can you try booting with fw_devlink=on with this series? It's queued
> up for 5.12-rc1
> https://lore.kernel.org/lkml/[email protected]/
>
> It might solve your issue, but I think your patch still addresses a real issue.
>
> > > And that's precisely what the offending commit addresses. However, the
> > > downside is, and we did discuss this during review, that it operates
> > > under the (somewhat optimistic) assumption that all the dependency
> > > information exists. This is because reordering on successful probe can
> > > potentially introduce regressions for dependencies that were previously
> > > implicit. So if a system has component B that depends on component A but
> > > doesn't model that dependency via some child/parent relationship or an
> > > explicit relationship that would be flagged by deferred probe,
> >
> > Again, deferred probing may not help here.
> >
> > > then this implicit dependency can break by the new reordering on successful probe.
> > >
> > > I very much suspect that that's exactly what's going on here. This RMI4
> > > device very likely implicitly depends on some other resource getting
> > > enabled but doesn't properly model that dependency. If we find out what
> > > that dependency is and return -EPROBE_DEFER when that dependency has not
> > > probed yet, then deferred probe will automatically take care of ordering
> > > everything correctly again (or, in fact, ordering by successful probe
> > > will take care of it already because RMI4 would initially fail with
> > > -EPROBE_DEFER).
> > >
> > > Adding Vincent, Jason, Andrew and Lucas (who have recently worked on
> > > this driver), perhaps they have some better understanding of what
> > > missing dependencies might be causing the above errors.
> >
> > IMV it is a mistake to believe that deferred probing can get
> > everything right for you in every case, with or without the offending
> > commit. Sometimes you need to tell the core what the right ordering
> > is and that's what device links are for.
>
> IMHO, Thierry's patch is the right way to imply dependencies when
> device links aren't explicitly calling out dependencies. It's not
> really depending on deferred probe to imply dependency order. Rather,
> it's saying that the order in which devices probe is a better way to
> imply dependency than relying on the order in which devices are added.

Well, it breaks existing setups.

Moreover, I'm not really convinced that it is a better way to "imply
dependencies", it is just different.

> For Thierry's case, fw_devlink=on might solve his problem, but that's
> solving the problem by explicitly calling out the dependency (by
> getting it from DT where the dependency is explicitly called out). For
> implicit cases, we still need his patch. I wonder how

The "implicit" cases are all broken potentially, because the dpm_list
ordering only matters for the order in which PM sleep callbacks are
invoked, and that doesn't help if they are async and it is meaningless
for runtime PM.

> > As it stands today, that commit doesn't improve the situation and it
> > adds overhead and complexity.
>
> I'm okay if we revert it for now, but that doesn't solve the
> overarching ordering issues though.

No, it doesn't and the commit doesn't solve it either which is my point.

The situation before and after the commit is generally the same, even
though the set of affected systems is different in each case, so
because the general strategy of the kernel development is to avoid new
breakage, it should be reverted.

2021-01-12 17:40:50

by Thierry Reding

[permalink] [raw]
Subject: Re: 5.11-rc device reordering breaks ThinkPad rmi4 suspend

On Mon, Jan 11, 2021 at 02:44:03PM -0800, Saravana Kannan wrote:
> On Mon, Jan 11, 2021 at 8:57 AM Rafael J. Wysocki <[email protected]> wrote:
> >
> > On Mon, Jan 11, 2021 at 5:12 PM Thierry Reding <[email protected]> wrote:
> > >
> > > On Mon, Jan 11, 2021 at 03:57:37PM +0100, Rafael J. Wysocki wrote:
> > > > On Mon, Jan 11, 2021 at 2:43 PM Thierry Reding <[email protected]> wrote:
> > > > >
> > > > > On Sun, Jan 10, 2021 at 08:44:13PM -0800, Hugh Dickins wrote:
> > > > > > Hi Rafael,
> > > > > >
> > > > > > Synaptics RMI4 SMBus touchpad on ThinkPad X1 Carbon (5th generation)
> > > > > > fails to suspend when running 5.11-rc kernels: bisected to
> > > > > > 5b6164d3465f ("driver core: Reorder devices on successful probe"),
> > > > > > and reverting that fixes it. dmesg.xz attached, but go ahead and ask
> > > > > > me to switch on a debug option to extract further info if that may help.
> > > > >
> > > > > Hi Hugh,
> > > > >
> > > > > Quoting what I think are the relevant parts of that log:
> > > > >
> > > > > [ 34.373742] printk: Suspending console(s) (use no_console_suspend to debug)
> > > > > [ 34.429015] rmi4_physical rmi4-00: Failed to read irqs, code=-6
> > > > > [ 34.474973] rmi4_f01 rmi4-00.fn01: Failed to write sleep mode: -6.
> > > > > [ 34.474994] rmi4_f01 rmi4-00.fn01: Suspend failed with code -6.
> > > > > [ 34.475001] rmi4_physical rmi4-00: Failed to suspend functions: -6
> > > > > [ 34.475105] rmi4_smbus 6-002c: Failed to suspend device: -6
> > > > > [ 34.475113] PM: dpm_run_callback(): rmi_smb_suspend+0x0/0x3c returns -6
> > > > > [ 34.475130] PM: Device 6-002c failed to suspend: error -6
> > > > > [ 34.475187] PM: Some devices failed to suspend, or early wake event detected
> > > > > [ 34.480324] rmi4_f03 rmi4-00.fn03: rmi_f03_pt_write: Failed to write to F03 TX register (-6).
> > > > > [ 34.480748] rmi4_f03 rmi4-00.fn03: rmi_f03_pt_write: Failed to write to F03 TX register (-6).
> > > > > [ 34.481558] rmi4_physical rmi4-00: rmi_driver_clear_irq_bits: Failed to change enabled interrupts!
> > > > > [ 34.487935] acpi LNXPOWER:02: Turning OFF
> > > > > [ 34.488707] acpi LNXPOWER:01: Turning OFF
> > > > > [ 34.489554] rmi4_physical rmi4-00: rmi_driver_set_irq_bits: Failed to change enabled interrupts!
> > > > > [ 34.489669] psmouse: probe of serio2 failed with error -1
> > > > > [ 34.489882] OOM killer enabled.
> > > > > [ 34.489891] Restarting tasks ... done.
> > > > > [ 34.589183] PM: suspend exit
> > > > > [ 34.589839] PM: suspend entry (s2idle)
> > > > > [ 34.605884] Filesystems sync: 0.017 seconds
> > > > > [ 34.607594] Freezing user space processes ... (elapsed 0.006 seconds) done.
> > > > > [ 34.613645] OOM killer disabled.
> > > > > [ 34.613650] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
> > > > > [ 34.615482] printk: Suspending console(s) (use no_console_suspend to debug)
> > > > > [ 34.653097] rmi4_f01 rmi4-00.fn01: Failed to write sleep mode: -6.
> > > > > [ 34.653108] rmi4_f01 rmi4-00.fn01: Suspend failed with code -6.
> > > > > [ 34.653115] rmi4_physical rmi4-00: Failed to suspend functions: -6
> > > > > [ 34.653123] rmi4_smbus 6-002c: Failed to suspend device: -6
> > > > > [ 34.653129] PM: dpm_run_callback(): rmi_smb_suspend+0x0/0x3c returns -6
> > > > > [ 34.653160] PM: Device 6-002c failed to suspend: error -6
> > > > > [ 34.653174] PM: Some devices failed to suspend, or early wake event detected
> > > > > [ 34.660515] OOM killer enabled.
> > > > > [ 34.660524] Restarting tasks ...
> > > > > [ 34.661456] rmi4_physical rmi4-00: rmi_driver_set_irq_bits: Failed to change enabled interrupts!
> > > > > [ 34.661591] psmouse: probe of serio2 failed with error -1
> > > > > [ 34.669469] done.
> > > > > [ 34.748386] PM: suspend exit
> > > > >
> > > > > I think what might be happening here is that the offending patch causes
> > > > > some devices to be reordered in a way different to how they were ordered
> > > > > originally and the rmi4 driver currently depends on that implicit order.
> > > >
> > > > Actually, the only possible case in which the commit in question can
> > > > introduce suspend failures like this is when some dependency
> > > > information is missing and so the reordering causes the ordering to
> > > > change from the (working) implicit one.
> > > >
> > > > > Interestingly one of the bugs that the offending patch fixes is similar
> > > > > in the failure mode but for the reverse reason: the implicit order
> > > > > causes suspend/resume to fail.
> > > >
> > > > And that happens because some dependency information is missing.
> > > >
> > > > So we have failing cases when dependency information is missing, so
> > > > instead of fixing those we have tried to make the core change the
> > > > ordering after every successful probe in the hope that this will take
> > > > care of the problem without introducing new breakage.
> > > >
> > > > However, it evidently has introduced new breakage and in order to fix
> > > > it we need to figure out what dependency information is missing in the
> > > > failing cases and put that information in, but we may as well do the
> > > > same for the cases that are failing without the offending change.
> > > >
> > > > So why don't we revert the commit in question and do just that?
> > >
> > > Unfortunately it isn't that easy. In fact, all the dependency
> > > information already exists in the case that I cited in 5b6164d3465f
> > > ("driver core: Reorder devices on successful probe"), but it's the
> > > driver core that suspends/resumes the devices in the wrong order.
> > >
> > > The reason is because the ACONNECT device depends on the BPMP device
> > > (via a power-domains property), but it's also instantiated before the
> > > BPMP device (because it is listed earlier in device tree, which is
> > > sorted by unit-address first, then alphabetically). BPMP being a CPU
> > > non-addressable device it doesn't have a unit-address and hence is
> > > listed very late in device tree (by convention). Normally this is would
> > > not be a problem because deferred probe would take care of it. But there
> > > is one corner-case which happens when the BPMP is built into the kernel
> > > (which it usually is, as it provides access to resources necessary for
> > > booting, such as clocks and resets) and ACONNECT is built as a loadable
> > > module. In that case, BPMP gets probed before ACONNECT and hence when
> > > ACONNECT does eventually get loaded, the BPMP is already there, meaning
> > > ACONNECT won't defer probe and hence the DPM suspend/resume order is not
> > > fixed up by the deferred probe code.
> >
> > What about using a device link to enforce the right ordering, then?
> >
> > Deferred probing is not a way to ensure the suitable suspend/resume ordering.
>
> Thierry,
>
> Can you try booting with fw_devlink=on with this series? It's queued
> up for 5.12-rc1
> https://lore.kernel.org/lkml/[email protected]/
>
> It might solve your issue, but I think your patch still addresses a real issue.

I've tried booting next-20210112 on the Jetson AGX Xavier device where
the suspend/resume problem with ACONNECT was happening and even if I
revert the reordering patch, ACONNECT resumes correctly.

Interestingly, and I don't know if that's expected or not, even setting
fw_devlink=off on today's linux-next no longer exhibits the problem. My
impression from going through the fw_devlink code was that switching it
off would disable any reordering that was going on, so I'm not sure I
understand why the initial problem is fixed even in that case.

> > > And that's precisely what the offending commit addresses. However, the
> > > downside is, and we did discuss this during review, that it operates
> > > under the (somewhat optimistic) assumption that all the dependency
> > > information exists. This is because reordering on successful probe can
> > > potentially introduce regressions for dependencies that were previously
> > > implicit. So if a system has component B that depends on component A but
> > > doesn't model that dependency via some child/parent relationship or an
> > > explicit relationship that would be flagged by deferred probe,
> >
> > Again, deferred probing may not help here.
> >
> > > then this implicit dependency can break by the new reordering on successful probe.
> > >
> > > I very much suspect that that's exactly what's going on here. This RMI4
> > > device very likely implicitly depends on some other resource getting
> > > enabled but doesn't properly model that dependency. If we find out what
> > > that dependency is and return -EPROBE_DEFER when that dependency has not
> > > probed yet, then deferred probe will automatically take care of ordering
> > > everything correctly again (or, in fact, ordering by successful probe
> > > will take care of it already because RMI4 would initially fail with
> > > -EPROBE_DEFER).
> > >
> > > Adding Vincent, Jason, Andrew and Lucas (who have recently worked on
> > > this driver), perhaps they have some better understanding of what
> > > missing dependencies might be causing the above errors.
> >
> > IMV it is a mistake to believe that deferred probing can get
> > everything right for you in every case, with or without the offending
> > commit. Sometimes you need to tell the core what the right ordering
> > is and that's what device links are for.
>
> IMHO, Thierry's patch is the right way to imply dependencies when
> device links aren't explicitly calling out dependencies. It's not
> really depending on deferred probe to imply dependency order. Rather,
> it's saying that the order in which devices probe is a better way to
> imply dependency than relying on the order in which devices are added.
>
> For Thierry's case, fw_devlink=on might solve his problem, but that's
> solving the problem by explicitly calling out the dependency (by
> getting it from DT where the dependency is explicitly called out). For
> implicit cases, we still need his patch. I wonder how

Yes, I think reordering on probe is logically the right thing to do.
However, I also understand Rafael's argument that this can now cause
failures that were not there before.

I do think that any failures exposed by this patch are in fact bugs
caused by missing explicit dependencies that are hidden right now by
mere chance because devices are instantiated in an order that makes
them work. If for any reason the instantiation order changes these
can easily break again.

> > As it stands today, that commit doesn't improve the situation and it
> > adds overhead and complexity.
>
> I'm okay if we revert it for now, but that doesn't solve the
> overarching ordering issues though.

Given that the ACONNECT problem seems to have been solved I don't have
any objections to reverting this either. On the other hand, perhaps we
can stage this in by tying it to a command-line option? I'm not sure how
helpful that would be, because if it isn't a default, people are
unlikely to go and test with it enabled and report issues.

I suppose it's also not a perfect solution because it doesn't weed out
all of the missing dependencies. Besides the implied dependencies that
can now get jumbled, I'm sure there are lots of implied dependencies
that won't get jumbled and therefore will still go unnoticed.

And perhaps that's fine. We only need to fix the bugs that are actually
causing problems, right?

Thierry


Attachments:
(No filename) (11.41 kB)
signature.asc (849.00 B)
Download all attachments

2021-01-12 18:01:39

by Thierry Reding

[permalink] [raw]
Subject: Re: 5.11-rc device reordering breaks ThinkPad rmi4 suspend

On Mon, Jan 11, 2021 at 05:57:17PM +0100, Rafael J. Wysocki wrote:
> On Mon, Jan 11, 2021 at 5:12 PM Thierry Reding <[email protected]> wrote:
> >
> > On Mon, Jan 11, 2021 at 03:57:37PM +0100, Rafael J. Wysocki wrote:
> > > On Mon, Jan 11, 2021 at 2:43 PM Thierry Reding <[email protected]> wrote:
> > > >
> > > > On Sun, Jan 10, 2021 at 08:44:13PM -0800, Hugh Dickins wrote:
> > > > > Hi Rafael,
> > > > >
> > > > > Synaptics RMI4 SMBus touchpad on ThinkPad X1 Carbon (5th generation)
> > > > > fails to suspend when running 5.11-rc kernels: bisected to
> > > > > 5b6164d3465f ("driver core: Reorder devices on successful probe"),
> > > > > and reverting that fixes it. dmesg.xz attached, but go ahead and ask
> > > > > me to switch on a debug option to extract further info if that may help.
> > > >
> > > > Hi Hugh,
> > > >
> > > > Quoting what I think are the relevant parts of that log:
> > > >
> > > > [ 34.373742] printk: Suspending console(s) (use no_console_suspend to debug)
> > > > [ 34.429015] rmi4_physical rmi4-00: Failed to read irqs, code=-6
> > > > [ 34.474973] rmi4_f01 rmi4-00.fn01: Failed to write sleep mode: -6.
> > > > [ 34.474994] rmi4_f01 rmi4-00.fn01: Suspend failed with code -6.
> > > > [ 34.475001] rmi4_physical rmi4-00: Failed to suspend functions: -6
> > > > [ 34.475105] rmi4_smbus 6-002c: Failed to suspend device: -6
> > > > [ 34.475113] PM: dpm_run_callback(): rmi_smb_suspend+0x0/0x3c returns -6
> > > > [ 34.475130] PM: Device 6-002c failed to suspend: error -6
> > > > [ 34.475187] PM: Some devices failed to suspend, or early wake event detected
> > > > [ 34.480324] rmi4_f03 rmi4-00.fn03: rmi_f03_pt_write: Failed to write to F03 TX register (-6).
> > > > [ 34.480748] rmi4_f03 rmi4-00.fn03: rmi_f03_pt_write: Failed to write to F03 TX register (-6).
> > > > [ 34.481558] rmi4_physical rmi4-00: rmi_driver_clear_irq_bits: Failed to change enabled interrupts!
> > > > [ 34.487935] acpi LNXPOWER:02: Turning OFF
> > > > [ 34.488707] acpi LNXPOWER:01: Turning OFF
> > > > [ 34.489554] rmi4_physical rmi4-00: rmi_driver_set_irq_bits: Failed to change enabled interrupts!
> > > > [ 34.489669] psmouse: probe of serio2 failed with error -1
> > > > [ 34.489882] OOM killer enabled.
> > > > [ 34.489891] Restarting tasks ... done.
> > > > [ 34.589183] PM: suspend exit
> > > > [ 34.589839] PM: suspend entry (s2idle)
> > > > [ 34.605884] Filesystems sync: 0.017 seconds
> > > > [ 34.607594] Freezing user space processes ... (elapsed 0.006 seconds) done.
> > > > [ 34.613645] OOM killer disabled.
> > > > [ 34.613650] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
> > > > [ 34.615482] printk: Suspending console(s) (use no_console_suspend to debug)
> > > > [ 34.653097] rmi4_f01 rmi4-00.fn01: Failed to write sleep mode: -6.
> > > > [ 34.653108] rmi4_f01 rmi4-00.fn01: Suspend failed with code -6.
> > > > [ 34.653115] rmi4_physical rmi4-00: Failed to suspend functions: -6
> > > > [ 34.653123] rmi4_smbus 6-002c: Failed to suspend device: -6
> > > > [ 34.653129] PM: dpm_run_callback(): rmi_smb_suspend+0x0/0x3c returns -6
> > > > [ 34.653160] PM: Device 6-002c failed to suspend: error -6
> > > > [ 34.653174] PM: Some devices failed to suspend, or early wake event detected
> > > > [ 34.660515] OOM killer enabled.
> > > > [ 34.660524] Restarting tasks ...
> > > > [ 34.661456] rmi4_physical rmi4-00: rmi_driver_set_irq_bits: Failed to change enabled interrupts!
> > > > [ 34.661591] psmouse: probe of serio2 failed with error -1
> > > > [ 34.669469] done.
> > > > [ 34.748386] PM: suspend exit
> > > >
> > > > I think what might be happening here is that the offending patch causes
> > > > some devices to be reordered in a way different to how they were ordered
> > > > originally and the rmi4 driver currently depends on that implicit order.
> > >
> > > Actually, the only possible case in which the commit in question can
> > > introduce suspend failures like this is when some dependency
> > > information is missing and so the reordering causes the ordering to
> > > change from the (working) implicit one.
> > >
> > > > Interestingly one of the bugs that the offending patch fixes is similar
> > > > in the failure mode but for the reverse reason: the implicit order
> > > > causes suspend/resume to fail.
> > >
> > > And that happens because some dependency information is missing.
> > >
> > > So we have failing cases when dependency information is missing, so
> > > instead of fixing those we have tried to make the core change the
> > > ordering after every successful probe in the hope that this will take
> > > care of the problem without introducing new breakage.
> > >
> > > However, it evidently has introduced new breakage and in order to fix
> > > it we need to figure out what dependency information is missing in the
> > > failing cases and put that information in, but we may as well do the
> > > same for the cases that are failing without the offending change.
> > >
> > > So why don't we revert the commit in question and do just that?
> >
> > Unfortunately it isn't that easy. In fact, all the dependency
> > information already exists in the case that I cited in 5b6164d3465f
> > ("driver core: Reorder devices on successful probe"), but it's the
> > driver core that suspends/resumes the devices in the wrong order.
> >
> > The reason is because the ACONNECT device depends on the BPMP device
> > (via a power-domains property), but it's also instantiated before the
> > BPMP device (because it is listed earlier in device tree, which is
> > sorted by unit-address first, then alphabetically). BPMP being a CPU
> > non-addressable device it doesn't have a unit-address and hence is
> > listed very late in device tree (by convention). Normally this is would
> > not be a problem because deferred probe would take care of it. But there
> > is one corner-case which happens when the BPMP is built into the kernel
> > (which it usually is, as it provides access to resources necessary for
> > booting, such as clocks and resets) and ACONNECT is built as a loadable
> > module. In that case, BPMP gets probed before ACONNECT and hence when
> > ACONNECT does eventually get loaded, the BPMP is already there, meaning
> > ACONNECT won't defer probe and hence the DPM suspend/resume order is not
> > fixed up by the deferred probe code.
>
> What about using a device link to enforce the right ordering, then?

I was going to implement that, but then I realized that the specific
problem I was facing with ACONNECT had been solved differently in the
meantime. I wasn't able to exactly pinpoint what fixed it, but I suspect
it might have been some of Saravana's fw_devlink code. It's a bit
difficult to find out what exactly changed, because it happened after
the offending commit was already merged, so I would have to go through
all linux-next releases since early December and revert my patch to find
out when the change happened and then bisect which change exactly did
it.

But yes, using a device link would've done the trick as well. However
the idea had been to potentially fix many more subtle cases like the one
we faced in ACONNECT at the same time. It's unfortunate that it breaks a
bunch of other cases that apparently are also missing dependency
information and just happen to work fine with the status quo.

> Deferred probing is not a way to ensure the suitable suspend/resume ordering.

Well, it is in the majority of cases because deferred probe causes the
reordering of the suspend/resume queue. And that all makes sense because
suppliers should always be suspended after all their consumers.

But that's not what the offending patch was doing. The purpose was to
ensure that the default suspend/resume ordering matches the probe order,
which is both more logical (though I suppose that can be subjective) and
ensures that deferred probing can work properly for all cases.

Anyway, this is ultimately just switching out one set of broken cases
for another, so might as well revert the offending patch and concentrate
on fixing the broken cases one by one as we find them.

Like I said, I'm slightly concerned about drivers like rmi4 breaking
unexpectedly down the road because some other patch caused the ordering
to change, so reverting now may just be putting off the inevitable. But
you're obviously right that we shouldn't randomly break working setups,
so I'm fine with the revert.

Thierry


Attachments:
(No filename) (8.42 kB)
signature.asc (849.00 B)
Download all attachments

2021-01-13 02:33:30

by Saravana Kannan

[permalink] [raw]
Subject: Re: 5.11-rc device reordering breaks ThinkPad rmi4 suspend

I'm just going to combine my response to the 2-3 emails in this one response.

On Tue, Jan 12, 2021 at 9:57 AM Thierry Reding <[email protected]> wrote:
>
> On Mon, Jan 11, 2021 at 05:57:17PM +0100, Rafael J. Wysocki wrote:
> > On Mon, Jan 11, 2021 at 5:12 PM Thierry Reding <[email protected]> wrote:
> > >
> > > On Mon, Jan 11, 2021 at 03:57:37PM +0100, Rafael J. Wysocki wrote:
> > > > On Mon, Jan 11, 2021 at 2:43 PM Thierry Reding <[email protected]> wrote:
> > > > >
> > > > > On Sun, Jan 10, 2021 at 08:44:13PM -0800, Hugh Dickins wrote:
> > > > > > Hi Rafael,
> > > > > >
> > > > > > Synaptics RMI4 SMBus touchpad on ThinkPad X1 Carbon (5th generation)
> > > > > > fails to suspend when running 5.11-rc kernels: bisected to
> > > > > > 5b6164d3465f ("driver core: Reorder devices on successful probe"),
> > > > > > and reverting that fixes it. dmesg.xz attached, but go ahead and ask
> > > > > > me to switch on a debug option to extract further info if that may help.
> > > > >
> > > > > Hi Hugh,
> > > > >
> > > > > Quoting what I think are the relevant parts of that log:
> > > > >
> > > > > [ 34.373742] printk: Suspending console(s) (use no_console_suspend to debug)
> > > > > [ 34.429015] rmi4_physical rmi4-00: Failed to read irqs, code=-6
> > > > > [ 34.474973] rmi4_f01 rmi4-00.fn01: Failed to write sleep mode: -6.
> > > > > [ 34.474994] rmi4_f01 rmi4-00.fn01: Suspend failed with code -6.
> > > > > [ 34.475001] rmi4_physical rmi4-00: Failed to suspend functions: -6
> > > > > [ 34.475105] rmi4_smbus 6-002c: Failed to suspend device: -6
> > > > > [ 34.475113] PM: dpm_run_callback(): rmi_smb_suspend+0x0/0x3c returns -6
> > > > > [ 34.475130] PM: Device 6-002c failed to suspend: error -6
> > > > > [ 34.475187] PM: Some devices failed to suspend, or early wake event detected
> > > > > [ 34.480324] rmi4_f03 rmi4-00.fn03: rmi_f03_pt_write: Failed to write to F03 TX register (-6).
> > > > > [ 34.480748] rmi4_f03 rmi4-00.fn03: rmi_f03_pt_write: Failed to write to F03 TX register (-6).
> > > > > [ 34.481558] rmi4_physical rmi4-00: rmi_driver_clear_irq_bits: Failed to change enabled interrupts!
> > > > > [ 34.487935] acpi LNXPOWER:02: Turning OFF
> > > > > [ 34.488707] acpi LNXPOWER:01: Turning OFF
> > > > > [ 34.489554] rmi4_physical rmi4-00: rmi_driver_set_irq_bits: Failed to change enabled interrupts!
> > > > > [ 34.489669] psmouse: probe of serio2 failed with error -1
> > > > > [ 34.489882] OOM killer enabled.
> > > > > [ 34.489891] Restarting tasks ... done.
> > > > > [ 34.589183] PM: suspend exit
> > > > > [ 34.589839] PM: suspend entry (s2idle)
> > > > > [ 34.605884] Filesystems sync: 0.017 seconds
> > > > > [ 34.607594] Freezing user space processes ... (elapsed 0.006 seconds) done.
> > > > > [ 34.613645] OOM killer disabled.
> > > > > [ 34.613650] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
> > > > > [ 34.615482] printk: Suspending console(s) (use no_console_suspend to debug)
> > > > > [ 34.653097] rmi4_f01 rmi4-00.fn01: Failed to write sleep mode: -6.
> > > > > [ 34.653108] rmi4_f01 rmi4-00.fn01: Suspend failed with code -6.
> > > > > [ 34.653115] rmi4_physical rmi4-00: Failed to suspend functions: -6
> > > > > [ 34.653123] rmi4_smbus 6-002c: Failed to suspend device: -6
> > > > > [ 34.653129] PM: dpm_run_callback(): rmi_smb_suspend+0x0/0x3c returns -6
> > > > > [ 34.653160] PM: Device 6-002c failed to suspend: error -6
> > > > > [ 34.653174] PM: Some devices failed to suspend, or early wake event detected
> > > > > [ 34.660515] OOM killer enabled.
> > > > > [ 34.660524] Restarting tasks ...
> > > > > [ 34.661456] rmi4_physical rmi4-00: rmi_driver_set_irq_bits: Failed to change enabled interrupts!
> > > > > [ 34.661591] psmouse: probe of serio2 failed with error -1
> > > > > [ 34.669469] done.
> > > > > [ 34.748386] PM: suspend exit
> > > > >
> > > > > I think what might be happening here is that the offending patch causes
> > > > > some devices to be reordered in a way different to how they were ordered
> > > > > originally and the rmi4 driver currently depends on that implicit order.
> > > >
> > > > Actually, the only possible case in which the commit in question can
> > > > introduce suspend failures like this is when some dependency
> > > > information is missing and so the reordering causes the ordering to
> > > > change from the (working) implicit one.
> > > >
> > > > > Interestingly one of the bugs that the offending patch fixes is similar
> > > > > in the failure mode but for the reverse reason: the implicit order
> > > > > causes suspend/resume to fail.
> > > >
> > > > And that happens because some dependency information is missing.
> > > >
> > > > So we have failing cases when dependency information is missing, so
> > > > instead of fixing those we have tried to make the core change the
> > > > ordering after every successful probe in the hope that this will take
> > > > care of the problem without introducing new breakage.
> > > >
> > > > However, it evidently has introduced new breakage and in order to fix
> > > > it we need to figure out what dependency information is missing in the
> > > > failing cases and put that information in, but we may as well do the
> > > > same for the cases that are failing without the offending change.
> > > >
> > > > So why don't we revert the commit in question and do just that?
> > >
> > > Unfortunately it isn't that easy. In fact, all the dependency
> > > information already exists in the case that I cited in 5b6164d3465f
> > > ("driver core: Reorder devices on successful probe"), but it's the
> > > driver core that suspends/resumes the devices in the wrong order.
> > >
> > > The reason is because the ACONNECT device depends on the BPMP device
> > > (via a power-domains property), but it's also instantiated before the
> > > BPMP device (because it is listed earlier in device tree, which is
> > > sorted by unit-address first, then alphabetically). BPMP being a CPU
> > > non-addressable device it doesn't have a unit-address and hence is
> > > listed very late in device tree (by convention). Normally this is would
> > > not be a problem because deferred probe would take care of it. But there
> > > is one corner-case which happens when the BPMP is built into the kernel
> > > (which it usually is, as it provides access to resources necessary for
> > > booting, such as clocks and resets) and ACONNECT is built as a loadable
> > > module. In that case, BPMP gets probed before ACONNECT and hence when
> > > ACONNECT does eventually get loaded, the BPMP is already there, meaning
> > > ACONNECT won't defer probe and hence the DPM suspend/resume order is not
> > > fixed up by the deferred probe code.
> >
> > What about using a device link to enforce the right ordering, then?
>
> I was going to implement that, but then I realized that the specific
> problem I was facing with ACONNECT had been solved differently in the
> meantime. I wasn't able to exactly pinpoint what fixed it, but I suspect
> it might have been some of Saravana's fw_devlink code.

I'd be very surprised if fw_devlink code fixed anything for you when
you don't see the issue with fw_devlink=off/permissive. In those two
modes, fw_devlink is pretty much a NOP when it comes to probe/suspend
ordering.

> It's a bit
> difficult to find out what exactly changed, because it happened after
> the offending commit was already merged, so I would have to go through
> all linux-next releases since early December and revert my patch to find
> out when the change happened and then bisect which change exactly did
> it.
>
> But yes, using a device link would've done the trick as well. However
> the idea had been to potentially fix many more subtle cases like the one
> we faced in ACONNECT at the same time. It's unfortunate that it breaks a
> bunch of other cases that apparently are also missing dependency
> information and just happen to work fine with the status quo.
>
> > Deferred probing is not a way to ensure the suitable suspend/resume ordering.
>
> Well, it is in the majority of cases because deferred probe causes the
> reordering of the suspend/resume queue. And that all makes sense because
> suppliers should always be suspended after all their consumers.
>
> But that's not what the offending patch was doing. The purpose was to
> ensure that the default suspend/resume ordering matches the probe order,
> which is both more logical (though I suppose that can be subjective)

I agree with you too that the suspend order matching the probe order
is more logical than device add order. However, I'm now completely
changing my position on the patch (the one that reorder on probe). I
think we should revert it because reordering on probe has a bunch of
problems. Take this example:

1. Device-A probe() starts running.
a. Device-A probe adds two devices - Device-B and Device-C. But
incorrectly doesn't mark Device-A as their parents.
b. Say the driver for B has already been loaded, so then
Device-B's probe() runs.
c. Device-B is reordered to the end of list dpm_list (and so are
it's children).
4. Device-A probe() completes.
5. Device-A is reordered to the end of dpm_list but device-B and
device-C are incorrectly NOT reorder to come after device-A.

And this gets worse when async probing is enabled or if a driver-B is
registered by another thread during Device-A probe. The same problem
also exists for supplier-consumer relationships. A consumer's probe
can finish before the supplier's probe. This is because the supplier
would have registered with the frameworks before the supplier's probe
completes. And that's all that's needed for the consumer's probe to
finish.

There's no way to fix this without the complete dependency info
(parent-child _and_ supplier-consumer info) being shared with the
driver core. So, even the reordering of dpm_list after a deferred
probe is broken/wrong and it's pretty racy. It just happens to work
for what's already working with the current boot timing in those
devices. If the complete dependency info was provided, we could delete
the reorder after the deferred probe. But that's obviously not going
to happen for existing devices and we just have to live with the code
being there.

So for future issues related to ordering, we should just focus on
fixing/adding parent-child relationships (Eg: Hugh's rmi4 driver
issue) or adding more device links (Eg: Thierry's issue).

As for adding device links, hopefully fw_devlink=on as default will
happen soon and at least take care of this for DT based systems.
Hopefully someone can add fw_devlink support for ACPI (it's closer to
the bottom of my long TODO list).


> and
> ensures that deferred probing can work properly for all cases.
>
> Anyway, this is ultimately just switching out one set of broken cases
> for another, so might as well revert the offending patch and concentrate
> on fixing the broken cases one by one as we find them.

Can you see if fw_devlink=on boots for you? If so, that's the fix. If
it doesn't, let's fix it to make sure it works for you. In the
linux-next tree, it should be limited to driver fixes (no changes
needed to the fw_devlink code). Can you make it the default for your
test platform (add it to your command line) so that it doesn't break
in the future if it's already working?

> Like I said, I'm slightly concerned about drivers like rmi4 breaking
> unexpectedly down the road

I think we should fix these too as we find them. At a minimum the rmi4
driver isn't capturing parent/child relationship correctly.

-Saravana

> because some other patch caused the ordering
> to change, so reverting now may just be putting off the inevitable. But
> you're obviously right that we shouldn't randomly break working setups,
> so I'm fine with the revert.
>
> Thierry