Hi,
Since 6.8-rc1 the internal eDP display on the Lenovo ThinkPad X13s does
not always show up on boot.
The logs indicate problems with the runtime PM and eDP rework that went
into 6.8-rc1:
[ 6.006236] Console: switching to colour dummy device 80x25
[ 6.007542] [drm:dpu_kms_hw_init:1048] dpu hardware revision:0x80000000
[ 6.007872] [drm:drm_bridge_attach [drm]] *ERROR* failed to attach bridge /soc@0/phy@88eb000 to encoder TMDS-31: -16
[ 6.007934] [drm:dp_bridge_init [msm]] *ERROR* failed to attach panel bridge: -16
[ 6.007983] msm_dpu ae01000.display-controller: [drm:msm_dp_modeset_init [msm]] *ERROR* failed to create dp bridge: -16
[ 6.008030] [drm:_dpu_kms_initialize_displayport:588] [dpu error]modeset_init failed for DP, rc = -16
[ 6.008050] [drm:_dpu_kms_setup_displays:681] [dpu error]initialize_DP failed, rc = -16
[ 6.008068] [drm:dpu_kms_hw_init:1153] [dpu error]modeset init failed: -16
[ 6.008388] msm_dpu ae01000.display-controller: [drm:msm_drm_kms_init [msm]] *ERROR* kms hw init failed: -16
and this can also manifest itself as a NULL-pointer dereference:
[ 7.339447] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[ 7.643705] pc : drm_bridge_attach+0x70/0x1a8 [drm]
[ 7.686415] lr : drm_aux_bridge_attach+0x24/0x38 [aux_bridge]
[ 7.769039] Call trace:
[ 7.771564] drm_bridge_attach+0x70/0x1a8 [drm]
[ 7.776234] drm_aux_bridge_attach+0x24/0x38 [aux_bridge]
[ 7.781782] drm_bridge_attach+0x80/0x1a8 [drm]
[ 7.786454] dp_bridge_init+0xa8/0x15c [msm]
[ 7.790856] msm_dp_modeset_init+0x28/0xc4 [msm]
[ 7.795617] _dpu_kms_drm_obj_init+0x19c/0x680 [msm]
[ 7.800731] dpu_kms_hw_init+0x348/0x4c4 [msm]
[ 7.805306] msm_drm_kms_init+0x84/0x324 [msm]
[ 7.809891] msm_drm_bind+0x1d8/0x3a8 [msm]
[ 7.814196] try_to_bring_up_aggregate_device+0x1f0/0x2f8
[ 7.819747] __component_add+0xa4/0x18c
[ 7.823703] component_add+0x14/0x20
[ 7.827389] dp_display_probe+0x47c/0x568 [msm]
[ 7.832052] platform_probe+0x68/0xd8
Users have also reported random crashes at boot since 6.8-rc1, and I've
been able to trigger hard crashes twice when testing an external display
(USB-C/DP), which may also be related to the DP regressions.
I've opened an issue here:
https://gitlab.freedesktop.org/drm/msm/-/issues/51
but I also want Thorsten's help to track this so that it gets fixed
before 6.8 is released.
#regzbot introduced: v6.7..v6.8-rc1
The following series is likely the culprit:
https://lore.kernel.org/all/[email protected]/
Johan
Hi Johan
Thanks for the report.
I do agree that pm runtime eDP driver got merged that time but I think
the issue is either a combination of that along with DRM aux bridge
https://patchwork.freedesktop.org/series/122584/ OR just the latter as
even that went in around the same time.
Thats why perhaps this issue was not seen with the chromebooks we tested
on as they do not use pmic_glink (aux bridge).
So we will need to debug this on sc8280xp specifically or an equivalent
device which uses aux bridge.
Thanks
Abhinav
On 2/13/2024 3:42 AM, Johan Hovold wrote:
> Hi,
>
> Since 6.8-rc1 the internal eDP display on the Lenovo ThinkPad X13s does
> not always show up on boot.
>
> The logs indicate problems with the runtime PM and eDP rework that went
> into 6.8-rc1:
>
> [ 6.006236] Console: switching to colour dummy device 80x25
> [ 6.007542] [drm:dpu_kms_hw_init:1048] dpu hardware revision:0x80000000
> [ 6.007872] [drm:drm_bridge_attach [drm]] *ERROR* failed to attach bridge /soc@0/phy@88eb000 to encoder TMDS-31: -16
> [ 6.007934] [drm:dp_bridge_init [msm]] *ERROR* failed to attach panel bridge: -16
> [ 6.007983] msm_dpu ae01000.display-controller: [drm:msm_dp_modeset_init [msm]] *ERROR* failed to create dp bridge: -16
> [ 6.008030] [drm:_dpu_kms_initialize_displayport:588] [dpu error]modeset_init failed for DP, rc = -16
> [ 6.008050] [drm:_dpu_kms_setup_displays:681] [dpu error]initialize_DP failed, rc = -16
> [ 6.008068] [drm:dpu_kms_hw_init:1153] [dpu error]modeset init failed: -16
> [ 6.008388] msm_dpu ae01000.display-controller: [drm:msm_drm_kms_init [msm]] *ERROR* kms hw init failed: -16
>
> and this can also manifest itself as a NULL-pointer dereference:
>
> [ 7.339447] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
>
> [ 7.643705] pc : drm_bridge_attach+0x70/0x1a8 [drm]
> [ 7.686415] lr : drm_aux_bridge_attach+0x24/0x38 [aux_bridge]
>
> [ 7.769039] Call trace:
> [ 7.771564] drm_bridge_attach+0x70/0x1a8 [drm]
> [ 7.776234] drm_aux_bridge_attach+0x24/0x38 [aux_bridge]
> [ 7.781782] drm_bridge_attach+0x80/0x1a8 [drm]
> [ 7.786454] dp_bridge_init+0xa8/0x15c [msm]
> [ 7.790856] msm_dp_modeset_init+0x28/0xc4 [msm]
> [ 7.795617] _dpu_kms_drm_obj_init+0x19c/0x680 [msm]
> [ 7.800731] dpu_kms_hw_init+0x348/0x4c4 [msm]
> [ 7.805306] msm_drm_kms_init+0x84/0x324 [msm]
> [ 7.809891] msm_drm_bind+0x1d8/0x3a8 [msm]
> [ 7.814196] try_to_bring_up_aggregate_device+0x1f0/0x2f8
> [ 7.819747] __component_add+0xa4/0x18c
> [ 7.823703] component_add+0x14/0x20
> [ 7.827389] dp_display_probe+0x47c/0x568 [msm]
> [ 7.832052] platform_probe+0x68/0xd8
>
> Users have also reported random crashes at boot since 6.8-rc1, and I've
> been able to trigger hard crashes twice when testing an external display
> (USB-C/DP), which may also be related to the DP regressions.
>
> I've opened an issue here:
>
> https://gitlab.freedesktop.org/drm/msm/-/issues/51
>
> but I also want Thorsten's help to track this so that it gets fixed
> before 6.8 is released.
>
> #regzbot introduced: v6.7..v6.8-rc1
>
> The following series is likely the culprit:
>
> https://lore.kernel.org/all/[email protected]/
>
> Johan
On 13.02.24 19:00, Abhinav Kumar wrote:
>
> Thanks for the report.
>
> I do agree that pm runtime eDP driver got merged that time but I think
> the issue is either a combination of that along with DRM aux bridge
> https://patchwork.freedesktop.org/series/122584/ OR just the latter as
> even that went in around the same time.
In that case allow me a stupid question from the cheap seats:
Is there anything affected users can do to help getting us closer to the
real problem? Like testing a specific commit or two before or after the
merge of one of those features for example? That might help to rule out
a few things.
Ciao, Thorsten
> Thats why perhaps this issue was not seen with the chromebooks we tested
> on as they do not use pmic_glink (aux bridge).
>
> So we will need to debug this on sc8280xp specifically or an equivalent
> device which uses aux bridge.
>
> On 2/13/2024 3:42 AM, Johan Hovold wrote:
>> Hi,
>>
>> Since 6.8-rc1 the internal eDP display on the Lenovo ThinkPad X13s does
>> not always show up on boot.
>>
>> The logs indicate problems with the runtime PM and eDP rework that went
>> into 6.8-rc1:
>>
>> [ 6.006236] Console: switching to colour dummy device 80x25
>> [ 6.007542] [drm:dpu_kms_hw_init:1048] dpu hardware
>> revision:0x80000000
>> [ 6.007872] [drm:drm_bridge_attach [drm]] *ERROR* failed to
>> attach bridge /soc@0/phy@88eb000 to encoder TMDS-31: -16
>> [ 6.007934] [drm:dp_bridge_init [msm]] *ERROR* failed to attach
>> panel bridge: -16
>> [ 6.007983] msm_dpu ae01000.display-controller:
>> [drm:msm_dp_modeset_init [msm]] *ERROR* failed to create dp bridge: -16
>> [ 6.008030] [drm:_dpu_kms_initialize_displayport:588] [dpu
>> error]modeset_init failed for DP, rc = -16
>> [ 6.008050] [drm:_dpu_kms_setup_displays:681] [dpu
>> error]initialize_DP failed, rc = -16
>> [ 6.008068] [drm:dpu_kms_hw_init:1153] [dpu error]modeset init
>> failed: -16
>> [ 6.008388] msm_dpu ae01000.display-controller:
>> [drm:msm_drm_kms_init [msm]] *ERROR* kms hw init failed: -16
>>
>> and this can also manifest itself as a NULL-pointer dereference:
>>
>> [ 7.339447] Unable to handle kernel NULL pointer dereference at
>> virtual address 0000000000000000
>>
>> [ 7.643705] pc : drm_bridge_attach+0x70/0x1a8 [drm]
>> [ 7.686415] lr : drm_aux_bridge_attach+0x24/0x38 [aux_bridge]
>>
>> [ 7.769039] Call trace:
>> [ 7.771564] drm_bridge_attach+0x70/0x1a8 [drm]
>> [ 7.776234] drm_aux_bridge_attach+0x24/0x38 [aux_bridge]
>> [ 7.781782] drm_bridge_attach+0x80/0x1a8 [drm]
>> [ 7.786454] dp_bridge_init+0xa8/0x15c [msm]
>> [ 7.790856] msm_dp_modeset_init+0x28/0xc4 [msm]
>> [ 7.795617] _dpu_kms_drm_obj_init+0x19c/0x680 [msm]
>> [ 7.800731] dpu_kms_hw_init+0x348/0x4c4 [msm]
>> [ 7.805306] msm_drm_kms_init+0x84/0x324 [msm]
>> [ 7.809891] msm_drm_bind+0x1d8/0x3a8 [msm]
>> [ 7.814196] try_to_bring_up_aggregate_device+0x1f0/0x2f8
>> [ 7.819747] __component_add+0xa4/0x18c
>> [ 7.823703] component_add+0x14/0x20
>> [ 7.827389] dp_display_probe+0x47c/0x568 [msm]
>> [ 7.832052] platform_probe+0x68/0xd8
>>
>> Users have also reported random crashes at boot since 6.8-rc1, and I've
>> been able to trigger hard crashes twice when testing an external display
>> (USB-C/DP), which may also be related to the DP regressions.
>>
>> I've opened an issue here:
>>
>> https://gitlab.freedesktop.org/drm/msm/-/issues/51
>>
>> but I also want Thorsten's help to track this so that it gets fixed
>> before 6.8 is released.
>>
>> #regzbot introduced: v6.7..v6.8-rc1
>>
>> The following series is likely the culprit:
>>
>> https://lore.kernel.org/all/[email protected]/
>>
>> Johan
>
>
On Tue, Feb 13, 2024 at 10:00:13AM -0800, Abhinav Kumar wrote:
> I do agree that pm runtime eDP driver got merged that time but I think
> the issue is either a combination of that along with DRM aux bridge
> https://patchwork.freedesktop.org/series/122584/ OR just the latter as
> even that went in around the same time.
Yes, indeed there was a lot of changes that went into the MSM drm driver
in 6.8-rc1 and since I have not tried to debug this myself I can't say
for sure which change or changes that triggered this regression (or
possibly regressions).
The fact that the USB-C/DP PHY appears to be involved
(/soc@0/phy@88eb000) could indeed point to the series you mentioned.
> Thats why perhaps this issue was not seen with the chromebooks we tested
> on as they do not use pmic_glink (aux bridge).
>
> So we will need to debug this on sc8280xp specifically or an equivalent
> device which uses aux bridge.
I've hit the NULL-pointer deference three times now in the last few days
on the sc8280xp CRD. But since it doesn't trigger on every boot it seems
you need to go back to the series that could potentially have caused
this regression and review them again. There's clearly something quite
broken here.
> On 2/13/2024 3:42 AM, Johan Hovold wrote:
> > Since 6.8-rc1 the internal eDP display on the Lenovo ThinkPad X13s does
> > not always show up on boot.
> > [ 6.007872] [drm:drm_bridge_attach [drm]] *ERROR* failed to attach bridge /soc@0/phy@88eb000 to encoder TMDS-31: -16
> > and this can also manifest itself as a NULL-pointer dereference:
> >
> > [ 7.339447] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
> >
> > [ 7.643705] pc : drm_bridge_attach+0x70/0x1a8 [drm]
> > [ 7.686415] lr : drm_aux_bridge_attach+0x24/0x38 [aux_bridge]
> >
> > [ 7.769039] Call trace:
> > [ 7.771564] drm_bridge_attach+0x70/0x1a8 [drm]
> > [ 7.776234] drm_aux_bridge_attach+0x24/0x38 [aux_bridge]
> > [ 7.781782] drm_bridge_attach+0x80/0x1a8 [drm]
> > [ 7.786454] dp_bridge_init+0xa8/0x15c [msm]
> > [ 7.790856] msm_dp_modeset_init+0x28/0xc4 [msm]
> > [ 7.795617] _dpu_kms_drm_obj_init+0x19c/0x680 [msm]
> > [ 7.800731] dpu_kms_hw_init+0x348/0x4c4 [msm]
> > [ 7.805306] msm_drm_kms_init+0x84/0x324 [msm]
> > [ 7.809891] msm_drm_bind+0x1d8/0x3a8 [msm]
> > [ 7.814196] try_to_bring_up_aggregate_device+0x1f0/0x2f8
> > [ 7.819747] __component_add+0xa4/0x18c
> > [ 7.823703] component_add+0x14/0x20
> > [ 7.827389] dp_display_probe+0x47c/0x568 [msm]
> > [ 7.832052] platform_probe+0x68/0xd8
> >
> > Users have also reported random crashes at boot since 6.8-rc1, and I've
> > been able to trigger hard crashes twice when testing an external display
> > (USB-C/DP), which may also be related to the DP regressions.
Johan
On Wed, Feb 14, 2024 at 02:52:06PM +0100, Johan Hovold wrote:
> On Tue, Feb 13, 2024 at 10:00:13AM -0800, Abhinav Kumar wrote:
>
> > I do agree that pm runtime eDP driver got merged that time but I think
> > the issue is either a combination of that along with DRM aux bridge
> > https://patchwork.freedesktop.org/series/122584/ OR just the latter as
> > even that went in around the same time.
>
> Yes, indeed there was a lot of changes that went into the MSM drm driver
> in 6.8-rc1 and since I have not tried to debug this myself I can't say
> for sure which change or changes that triggered this regression (or
> possibly regressions).
>
> The fact that the USB-C/DP PHY appears to be involved
> (/soc@0/phy@88eb000) could indeed point to the series you mentioned.
>
> > Thats why perhaps this issue was not seen with the chromebooks we tested
> > on as they do not use pmic_glink (aux bridge).
> >
> > So we will need to debug this on sc8280xp specifically or an equivalent
> > device which uses aux bridge.
>
> I've hit the NULL-pointer deference three times now in the last few days
> on the sc8280xp CRD. But since it doesn't trigger on every boot it seems
> you need to go back to the series that could potentially have caused
> this regression and review them again. There's clearly something quite
> broken here.
Since Dmitry had trouble reproducing this issue I took a closer look at
the DRM aux bridge series that Abhinav pointed and was able to track
down the bridge regressions and come up with a reproducer. I just posted
a series fixing this here:
https://lore.kernel.org/lkml/[email protected]/
As I mentioned in the cover letter, I am still seeing intermittent hard
resets around the time that the DRM subsystem is initialising, which
suggests that we may be dealing with two separate DRM regressions here
however.
If the hard resets are triggered by something like unclocked hardware,
perhaps that bit could this be related to the runtime PM rework?
Johan
On Tue, Feb 13, 2024 at 12:42:17PM +0100, Johan Hovold wrote:
> Since 6.8-rc1 the internal eDP display on the Lenovo ThinkPad X13s does
> not always show up on boot.
>
> The logs indicate problems with the runtime PM and eDP rework that went
> into 6.8-rc1:
>
> [ 6.007872] [drm:drm_bridge_attach [drm]] *ERROR* failed to attach bridge /soc@0/phy@88eb000 to encoder TMDS-31: -16
> and this can also manifest itself as a NULL-pointer dereference:
>
> [ 7.339447] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
>
> [ 7.643705] pc : drm_bridge_attach+0x70/0x1a8 [drm]
#regzbot ^introduced: 2bcca96abfbf
It looks like it may have been possible to hit this also before commit
2bcca96abfbf ("soc: qcom: pmic-glink: switch to DRM_AUX_HPD_BRIDGE") and
the transparent bridge rework in 6.8-rc1 even if that has not yet been
confirmed.
The above is what made this trigger since 6.8-rc1 however.
Johan
On Sat, Feb 17, 2024 at 04:14:58PM +0100, Johan Hovold wrote:
> On Wed, Feb 14, 2024 at 02:52:06PM +0100, Johan Hovold wrote:
> > On Tue, Feb 13, 2024 at 10:00:13AM -0800, Abhinav Kumar wrote:
> Since Dmitry had trouble reproducing this issue I took a closer look at
> the DRM aux bridge series that Abhinav pointed and was able to track
> down the bridge regressions and come up with a reproducer. I just posted
> a series fixing this here:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> As I mentioned in the cover letter, I am still seeing intermittent hard
> resets around the time that the DRM subsystem is initialising, which
> suggests that we may be dealing with two separate DRM regressions here
> however.
>
> If the hard resets are triggered by something like unclocked hardware,
> perhaps that bit could this be related to the runtime PM rework?
It seems my initial suspicion that at least some of these regressions
were related to the runtime PM work was correct. The hard resets happens
when the DP controller is runtime suspended after being probed:
[ 16.748475] bus: 'platform': __driver_probe_device: matched device ae00000.display-subsystem with driver msm-mdss
[ 16.759444] msm-mdss ae00000.display-subsystem: Adding to iommu group 21
[ 16.795226] bus: 'platform': __driver_probe_device: matched device ae01000.display-controller with driver msm_dpu
[ 16.807542] probe of ae01000.display-controller returned -517 after 3 usecs
[ 16.821552] bus: 'platform': __driver_probe_device: matched device ae90000.displayport-controller with driver msm-dp-display
[ 16.837749] probe of ae90000.displayport-controller returned -517 after 1 usecs
[ OK ] Listening on Load/Save RF Kill Swit[ 16.854659] bus: 'platform': __dch Status /dev/rfkill Watch.
[ 16.868458] probe of ae98000.displayport-controller returned -517 after 2 usecs
[ 16.880012] bus: 'platform': __driver_probe_device: matched device aea0000.displayport-controller with driver msm-dp-display
[ 16.891856] probe of aea0000.displayport-controller returned -517 after 2 usecs
[ 16.903825] probe of ae00000.display-subsystem returned 0 after 144497 usecs
[ 16.911636] bus: 'platform': __driver_probe_device: matched device ae01000.display-controller with driver msm_dpu
[ 16.942092] probe of ae01000.display-controller returned 0 after 19593 usecs
Starting Load/Save Screen Backligh…rightness[ 16.959146] bus: 'platform': _ of backlight:backlight...
[ 16.995355] msm-dp-display ae90000.displayport-controller: dp_display_probe - probe tail
[ 17.004032] probe of ae90000.displayport-controller returned 0 after 30225 usecs
[ 17.012308] bus: 'platform': __driver_probe_device: matched device ae98000.displayport-controller with driver msm-dp-display
[ 17.050193] msm-dp-display ae98000.displayport-controller: dp_display_probe - probe tail
Starting Network Name Resolution...
[ 17.058925] probe of ae98000.displayport-controller returned 0 after 34774 usecs
[ 17.074925] bus: 'platform': __driver_probe_device: matched device aea0000.displayport-controller with driver msm-dp-display
[ Starting Network Time Synchronization...
[ 17.112000] msm-dp-display aea0000.displayport-controller: dp_display_probe - populate aux bus
[ 17.125208] msm-dp-display aea0000.displayport-controller: dp_pm_runtime_resume
Starting Record System Boot/Shutdown in UTMP...
Starting Virtual Console Setup...
[ OK ] Finished Load/Save Screen Backlight Brightness of backlight:backlight.
[ 17.197909] msm-dp-display aea0000.displayport-controller: dp_pm_runtime_suspend
[ 17.198079] probe of aea0Format: Log Type - Time(microsec) - Message - Optional Info
Log Type: B - Since Boot(Power On Reset), D - Delta, S - Statistic
S - QC_IMAGE_VERSION_STRING=BOOT.MXF.1.1-00470-MAKENA-1
S - IMAGE_VARIANT_STRING=SocMakenaWP
S - OEM_IMAGE_VERSION_STRING=crm-ubuntu92
< machine is reset by hypervisor >
Presumably the reset happens when controller is being shut down while
still being used by the EFI framebuffer.
In the cases where the machines survives boot, the controller is never
suspended.
When investigating this I've also seen intermittent:
[drm:dp_display_probe [msm]] *ERROR* device tree parsing failed
which also appears to be related to the runtime PM rework:
https://lore.kernel.org/lkml/[email protected]/
I believe this is enough evidence to conclude that this second
regression is introduced by commit 5814b8bf086a ("drm/msm/dp:
incorporate pm_runtime framework into DP driver"):
#regzbot introduced: 5814b8bf086a
Has anyone given some thought to how the framebuffer handover is
supposed to work? It seems we're currently just relying on luck with
timing.
Johan
On Mon, Feb 19, 2024 at 11:41:41AM +0100, Johan Hovold wrote:
> It seems my initial suspicion that at least some of these regressions
> were related to the runtime PM work was correct. The hard resets happens
> when the DP controller is runtime suspended after being probed:
> [ 17.074925] bus: 'platform': __driver_probe_device: matched device aea0000.displayport-controller with driver msm-dp-display
> [ 17.112000] msm-dp-display aea0000.displayport-controller: dp_display_probe - populate aux bus
> [ 17.125208] msm-dp-display aea0000.displayport-controller: dp_pm_runtime_resume
> [ 17.197909] msm-dp-display aea0000.displayport-controller: dp_pm_runtime_suspend
> [ 17.198079] probe of aea0Format: Log Type - Time(microsec) - Message - Optional Info
> Log Type: B - Since Boot(Power On Reset), D - Delta, S - Statistic
> S - QC_IMAGE_VERSION_STRING=BOOT.MXF.1.1-00470-MAKENA-1
> S - IMAGE_VARIANT_STRING=SocMakenaWP
> S - OEM_IMAGE_VERSION_STRING=crm-ubuntu92
>
> < machine is reset by hypervisor >
>
> Presumably the reset happens when controller is being shut down while
> still being used by the EFI framebuffer.
>
> In the cases where the machines survives boot, the controller is never
> suspended.
>
> When investigating this I've also seen intermittent:
>
> [drm:dp_display_probe [msm]] *ERROR* device tree parsing failed
Note that there are further indications there may be more than one bug
here too.
I definitely see hard resets when dp_pm_runtime_suspend() is shutting
down the eDP PHY, but there are occasional resets also if I instrument
DP controller probe() to resume and then prevent the controller from
suspending until after a timeout (e.g. to be used as a temporary
workaround):
[ 15.676495] bus: 'platform': __driver_probe_device: matched device aea0000.displayport-controller with driver msm-dp-display
[ 15.769392] msm-dp-display aea0000.displayport-controller: dp_display_probe - populate aux bus
[ 15.778808] msm-dp-display aea0000.displayport-controller: dp_display_probe - scheduling handover
[ 15.789931] probe of aea0000.displayport-controller returned 0 after 91121 usecs
[ 15.790460] bus: 'dp-aux': __driver_probe_device: matched device aux-aea0000.displayport-controller with driver panel-simple-dp-aux
Format: Log Type - Time(microsec) - Message - Optional Info
Log Type: B - Since Boot(Power On Reset), D - Delta, S - Statistic
S - QC_IMAGE_VERSION_STRING=BOOT.MXF.1.1-00470-MAKENA-1
I'll wait for the maintainers and authors of this code to comment, but
it seems the runtime PM work is broken in multiple ways.
Johan
Hi Johan
On 2/19/2024 2:41 AM, Johan Hovold wrote:
> On Sat, Feb 17, 2024 at 04:14:58PM +0100, Johan Hovold wrote:
>> On Wed, Feb 14, 2024 at 02:52:06PM +0100, Johan Hovold wrote:
>>> On Tue, Feb 13, 2024 at 10:00:13AM -0800, Abhinav Kumar wrote:
>
>> Since Dmitry had trouble reproducing this issue I took a closer look at
>> the DRM aux bridge series that Abhinav pointed and was able to track
>> down the bridge regressions and come up with a reproducer. I just posted
>> a series fixing this here:
>>
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> As I mentioned in the cover letter, I am still seeing intermittent hard
>> resets around the time that the DRM subsystem is initialising, which
>> suggests that we may be dealing with two separate DRM regressions here
>> however.
>>
>> If the hard resets are triggered by something like unclocked hardware,
>> perhaps that bit could this be related to the runtime PM rework?
>
> It seems my initial suspicion that at least some of these regressions
> were related to the runtime PM work was correct. The hard resets happens
> when the DP controller is runtime suspended after being probed:
>
> [ 16.748475] bus: 'platform': __driver_probe_device: matched device ae00000.display-subsystem with driver msm-mdss
> [ 16.759444] msm-mdss ae00000.display-subsystem: Adding to iommu group 21
> [ 16.795226] bus: 'platform': __driver_probe_device: matched device ae01000.display-controller with driver msm_dpu
> [ 16.807542] probe of ae01000.display-controller returned -517 after 3 usecs
> [ 16.821552] bus: 'platform': __driver_probe_device: matched device ae90000.displayport-controller with driver msm-dp-display
> [ 16.837749] probe of ae90000.displayport-controller returned -517 after 1 usecs
> [ OK ] Listening on Load/Save RF Kill Swit[ 16.854659] bus: 'platform': __dch Status /dev/rfkill Watch.
> [ 16.868458] probe of ae98000.displayport-controller returned -517 after 2 usecs
> [ 16.880012] bus: 'platform': __driver_probe_device: matched device aea0000.displayport-controller with driver msm-dp-display
> [ 16.891856] probe of aea0000.displayport-controller returned -517 after 2 usecs
> [ 16.903825] probe of ae00000.display-subsystem returned 0 after 144497 usecs
> [ 16.911636] bus: 'platform': __driver_probe_device: matched device ae01000.display-controller with driver msm_dpu
> [ 16.942092] probe of ae01000.display-controller returned 0 after 19593 usecs
> Starting Load/Save Screen Backligh…rightness[ 16.959146] bus: 'platform': _ of backlight:backlight...
> [ 16.995355] msm-dp-display ae90000.displayport-controller: dp_display_probe - probe tail
> [ 17.004032] probe of ae90000.displayport-controller returned 0 after 30225 usecs
> [ 17.012308] bus: 'platform': __driver_probe_device: matched device ae98000.displayport-controller with driver msm-dp-display
> [ 17.050193] msm-dp-display ae98000.displayport-controller: dp_display_probe - probe tail
> Starting Network Name Resolution...
> [ 17.058925] probe of ae98000.displayport-controller returned 0 after 34774 usecs
> [ 17.074925] bus: 'platform': __driver_probe_device: matched device aea0000.displayport-controller with driver msm-dp-display
> [ Starting Network Time Synchronization...
> [ 17.112000] msm-dp-display aea0000.displayport-controller: dp_display_probe - populate aux bus
> [ 17.125208] msm-dp-display aea0000.displayport-controller: dp_pm_runtime_resume
> Starting Record System Boot/Shutdown in UTMP...
> Starting Virtual Console Setup...
> [ OK ] Finished Load/Save Screen Backlight Brightness of backlight:backlight.
> [ 17.197909] msm-dp-display aea0000.displayport-controller: dp_pm_runtime_suspend
> [ 17.198079] probe of aea0Format: Log Type - Time(microsec) - Message - Optional Info
> Log Type: B - Since Boot(Power On Reset), D - Delta, S - Statistic
> S - QC_IMAGE_VERSION_STRING=BOOT.MXF.1.1-00470-MAKENA-1
> S - IMAGE_VARIANT_STRING=SocMakenaWP
> S - OEM_IMAGE_VERSION_STRING=crm-ubuntu92
>
> < machine is reset by hypervisor >
>
> Presumably the reset happens when controller is being shut down while
> still being used by the EFI framebuffer.
>
I am not sure if we can conclude like that. Even if we shut off the
controller when the framebuffer was still being fetched that should only
cause a blank screen and not a reset because we really don't trigger a
new register write / read while its fetching so as such there is no new
hardware access.
One thing I must accept is that there are two differences between
sc8280xp where we are hitting these resets and sc7180/sc7280 chromebooks
where we tested it more thoroughly without any such issues:
1) with the chromebooks we have depthcharge and not the QC UEFI.
If we are suspecting a hand-off issue here, will it help if we try to
disable the display in EFI by using "fastboot oem select-display-panel
none" (assuming this is a fastboot enabled device) and see if you still
hit the reset issue?
2) chromebooks used "internal_hpd" whereas the pmic_glink method used in
the sc8280xp.
I am still checking if there are any code paths in the eDP/DP driver
left exposed due to this difference with pm_runtime which can cause
this. I am wondering if some sort of drm tracing will help to narrow
down the reset point.
> In the cases where the machines survives boot, the controller is never
> suspended.
>
> When investigating this I've also seen intermittent:
>
> [drm:dp_display_probe [msm]] *ERROR* device tree parsing failed
>
So this error I think is because in dp_parser_parse() --->
dp_parser_ctrl_res(), we also have a devm_phy_get().
This can return -EDEFER if the phy driver has not yet probed.
I checked the other things inside dp_parser_parse(), others calls seem
to be purely DT parsing except this one. I think to avoid the confusion,
we should move devm_phy_get() outside of DT parsing into a separate call
or atleast add an error log inside devm_phy_get() failure below to
indicate that it deferred
io->phy = devm_phy_get(&pdev->dev, "dp");
if (IS_ERR(io->phy))
return PTR_ERR(io->phy);
If my hypothesis is correct on this, then this error log (even though
misleading) should be harmless for this issue because if we hit
DRM_ERROR("device tree parsing failed\n"); we will skip the
devm_pm_runtime_enable().
> which also appears to be related to the runtime PM rework:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> I believe this is enough evidence to conclude that this second
> regression is introduced by commit 5814b8bf086a ("drm/msm/dp:
> incorporate pm_runtime framework into DP driver"):
>
> #regzbot introduced: 5814b8bf086a
>
> Has anyone given some thought to how the framebuffer handover is
> supposed to work? It seems we're currently just relying on luck with
> timing.
>
> Johan
On Tue, Feb 20, 2024 at 01:19:54PM -0800, Abhinav Kumar wrote:
> On 2/19/2024 2:41 AM, Johan Hovold wrote:
> > It seems my initial suspicion that at least some of these regressions
> > were related to the runtime PM work was correct. The hard resets happens
> > when the DP controller is runtime suspended after being probed:
> > [ 17.074925] bus: 'platform': __driver_probe_device: matched device aea0000.displayport-controller with driver msm-dp-display
> > [ Starting Network Time Synchronization...
> > [ 17.112000] msm-dp-display aea0000.displayport-controller: dp_display_probe - populate aux bus
> > [ 17.125208] msm-dp-display aea0000.displayport-controller: dp_pm_runtime_resume
> > Starting Record System Boot/Shutdown in UTMP...
> > Starting Virtual Console Setup...
> > [ OK ] Finished Load/Save Screen Backlight Brightness of backlight:backlight.
> > [ 17.197909] msm-dp-display aea0000.displayport-controller: dp_pm_runtime_suspend
> > [ 17.198079] probe of aea0Format: Log Type - Time(microsec) - Message - Optional Info
> > Log Type: B - Since Boot(Power On Reset), D - Delta, S - Statistic
> > S - QC_IMAGE_VERSION_STRING=BOOT.MXF.1.1-00470-MAKENA-1
> > S - IMAGE_VARIANT_STRING=SocMakenaWP
> > S - OEM_IMAGE_VERSION_STRING=crm-ubuntu92
> >
> > < machine is reset by hypervisor >
> >
> > Presumably the reset happens when controller is being shut down while
> > still being used by the EFI framebuffer.
>
> I am not sure if we can conclude like that. Even if we shut off the
> controller when the framebuffer was still being fetched that should only
> cause a blank screen and not a reset because we really don't trigger a
> new register write / read while its fetching so as such there is no new
> hardware access.
It specifically looks like the reset happens when shutting down the PHY,
that is, the call to dp_display_host_phy_exit(dp) in
dp_pm_runtime_suspend() never returns.
That seems like more than a coincidence to me.
> One thing I must accept is that there are two differences between
> sc8280xp where we are hitting these resets and sc7180/sc7280 chromebooks
> where we tested it more thoroughly without any such issues:
>
> 1) with the chromebooks we have depthcharge and not the QC UEFI.
>
> If we are suspecting a hand-off issue here, will it help if we try to
> disable the display in EFI by using "fastboot oem select-display-panel
> none" (assuming this is a fastboot enabled device) and see if you still
> hit the reset issue?
No, we don't have fastboot.
But as I mentioned I still do see resets when I instrument the code to
not shut down the display, which could indicate more than one issue
here.
> 2) chromebooks used "internal_hpd" whereas the pmic_glink method used in
> the sc8280xp.
>
> I am still checking if there are any code paths in the eDP/DP driver
> left exposed due to this difference with pm_runtime which can cause
> this. I am wondering if some sort of drm tracing will help to narrow
> down the reset point.
>
> > In the cases where the machines survives boot, the controller is never
> > suspended.
> >
> > When investigating this I've also seen intermittent:
> >
> > [drm:dp_display_probe [msm]] *ERROR* device tree parsing failed
>
> So this error I think is because in dp_parser_parse() --->
> dp_parser_ctrl_res(), we also have a devm_phy_get().
>
> This can return -EDEFER if the phy driver has not yet probed.
>
> I checked the other things inside dp_parser_parse(), others calls seem
> to be purely DT parsing except this one. I think to avoid the confusion,
> we should move devm_phy_get() outside of DT parsing into a separate call
> or atleast add an error log inside devm_phy_get() failure below to
> indicate that it deferred
>
> io->phy = devm_phy_get(&pdev->dev, "dp");
> if (IS_ERR(io->phy))
> return PTR_ERR(io->phy);
>
> If my hypothesis is correct on this, then this error log (even though
> misleading) should be harmless for this issue because if we hit
> DRM_ERROR("device tree parsing failed\n"); we will skip the
> devm_pm_runtime_enable().
Yeah, this seems to be the case as boot appears to recover from this, so
this may indeed be a probe deferral.
Probe deferrals should not be logged as errors however, so the fix is
not to add another error message but rather to suppress the current one
(e.g. using dev_err_probe()).
> > Has anyone given some thought to how the framebuffer handover is
> > supposed to work? It seems we're currently just relying on luck with
> > timing.
Any comments to this? It seems we should not be shutting down (runtime
suspend) the display during boot as can currently happen.
Johan