2024-02-27 13:33:51

by Johan Hovold

[permalink] [raw]
Subject: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1

Hi,

Since 6.8-rc1 I have seen (and received reports) of hard resets of the
Lenovo ThinkPad X13s after connecting and disconnecting an external
display.

I have triggered this on a simple disconnect while in a VT console, but
also when stopping Xorg after having repeatedly connected and
disconnected an external display and tried to enable it using xrandr.

In the former case, the last (custom debug) messages printed over an SSH
session were once:

[ 948.416358] usb 5-1: USB disconnect, device number 3
[ 948.443496] msm_dpu ae01000.display-controller: msm_fbdev_client_hotplug
[ 948.443723] msm-dp-display ae98000.displayport-controller: dp_power_clk_enable - type = 1, enable = 0
[ 948.443872] msm-dp-display ae98000.displayport-controller: dp_ctrl_phy_exit
[ 948.445117] msm-dp-display ae98000.displayport-controller: dp_ctrl_phy_exit - done

and then the hypervisor resets the machine.

Hotplug in Xorg seems to work worse than it did with 6.7, which also had
some issues. Connecting a display once seems to work fine, but trying to
re-enable a reconnected display using xrandr sometimes does not work at
all, while with 6.7 it usually worked on the second xrandr execution.

xrandr reports the reconnected display as disconnected:

Screen 0: minimum 320 x 200, current 1920 x 1200, maximum 5120 x 4096
eDP-1 connected primary 1920x1200+0+0 (normal left inverted right x axis y axis) 286mm x 178mm
1920x1200 60.03*+
1600x1200 60.00
DP-1 disconnected (normal left inverted right x axis y axis)
DP-2 disconnected 1920x1200+0+0 (normal left inverted right x axis y axis) 0mm x 0mm
1920x1200 (0x40c) 154.000MHz +HSync -VSync
h: width 1920 start 1968 end 2000 total 2080 skew 0 clock 74.04KHz
v: height 1200 start 1203 end 1209 total 1235 clock 59.95Hz

Running 'xrandr --output DP-2 --auto' 2-3 times makes xrandr report the
display as connected, but the display is still blank (unlike with 6.7).

A few times after having exercised hotplug this way, the machine hard
resets when Xorg is later stopped. Once I saw the following log messages
on an SSH session but they may not have been printed directly before
the hard reset:

[ 214.555781] [drm:dpu_encoder_phys_vid_wait_for_commit_done:492] [dpu error]vblank timeout
[ 214.555843] [drm:dpu_kms_wait_for_commit_done:483] [dpu error]wait for commit done returned -110

Note that this appears to be unrelated to the recently fixed Qualcomm
power domain driver bug which can trigger similar resets when
initialising the display subsystem on boot. Specifically, I have
triggered the hotplug resets described above also with the fix applied.
[1]

Reverting commit e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify()
to report HPD status changes") which fixes the related VT console
regression does not seem to make any difference. [2]

Daniel Thompson reports that reverting the whole runtime PM series
appears to make the hard resets he has seen with DisplayPort hotplug go
away however:

https://lore.kernel.org/lkml/[email protected]/

So for now, let's assume that these regressions were also introduced (or
triggered) by commit 5814b8bf086a ("drm/msm/dp: incorporate pm_runtime
framework into DP driver").

Johan


[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/lkml/[email protected]/


#regzbot introduced: 5814b8bf086a


2024-03-08 12:43:10

by Johan Hovold

[permalink] [raw]
Subject: Re: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1

Hi Abhinav, Rob, Dmitry and Kuogee,

On Tue, Feb 27, 2024 at 02:33:48PM +0100, Johan Hovold wrote:

> Since 6.8-rc1 I have seen (and received reports) of hard resets of the
> Lenovo ThinkPad X13s after connecting and disconnecting an external
> display.
>
> I have triggered this on a simple disconnect while in a VT console, but
> also when stopping Xorg after having repeatedly connected and
> disconnected an external display and tried to enable it using xrandr.
>
> In the former case, the last (custom debug) messages printed over an SSH
> session were once:
>
> [ 948.416358] usb 5-1: USB disconnect, device number 3
> [ 948.443496] msm_dpu ae01000.display-controller: msm_fbdev_client_hotplug
> [ 948.443723] msm-dp-display ae98000.displayport-controller: dp_power_clk_enable - type = 1, enable = 0
> [ 948.443872] msm-dp-display ae98000.displayport-controller: dp_ctrl_phy_exit
> [ 948.445117] msm-dp-display ae98000.displayport-controller: dp_ctrl_phy_exit - done
>
> and then the hypervisor resets the machine.

Has there been any progress on tracking down the reset on disconnect
issue? I was expecting you to share your findings here so that we can
determine if the rest of the runtime PM series needs to be reverted or
not before 6.8 is released (possibly on Sunday).

I really did not want to spend more on this driver than I already have
this cycle, but the lack of (visible) progress has again forced me to do
so.

It's quite likely that the resets are indeed a regression caused by the
runtime PM series as the bus clocks were not disabled on disconnect
before that one was merged in 6.8-rc1.

In a VT console, the device is now runtime suspended immediately on
disconnect, while in X, it currently remains active until X is killed,
which is consistent with what I reported above.

We now also have Bjorn's call trace from a different Qualcomm platform
triggering an unclocked access on disconnect, something which would
trigger a reset by the hypervisor on sc8280xp platforms like the X13s:

[ 2030.379417] SError Interrupt on CPU0, code 0x00000000be000000 -- SError
[ 2030.379425] CPU: 0 PID: 239 Comm: kworker/0:2 Not tainted 6.8.0-rc4-next-20240216-00015-gc937d3c43ffe-dirty #219
[ 2030.379430] Hardware name: Qualcomm Technologies, Inc. Robotics RB3gen2 (DT)
[ 2030.379435] Workqueue: events output_poll_execute [drm_kms_helper]
[ 2030.379495] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 2030.379501] pc : el1_interrupt+0x28/0xc0
[ 2030.379514] lr : el1h_64_irq_handler+0x18/0x24
[ 2030.379520] sp : ffff80008129b700
[ 2030.379523] x29: ffff80008129b700 x28: ffff7a0a031aa200 x27: ffff7a0af768c3c0
[ 2030.379530] x26: 0000000000000000 x25: ffff7a0a024bb568 x24: ffff7a0a031aa200
[ 2030.379537] x23: 0000000060400005 x22: ffffd2a4a10c871c x21: ffff80008129b8a0
[ 2030.379544] x20: ffffd2a4a00002f0 x19: ffff80008129b750 x18: 0000000000000000
[ 2030.379549] x17: 0000000000000000 x16: ffffd2a4a10c21d4 x15: 0000000000000000
[ 2030.379555] x14: 0000000000000000 x13: 000000000001acf6 x12: 0000000000000000
[ 2030.379560] x11: 0000000000000001 x10: ffff7a0af623f680 x9 : 0000000100000001
[ 2030.379565] x8 : 00000000000000c0 x7 : 0000000000000000 x6 : 000000000000003f
[ 2030.379570] x5 : ffff7a0a003c6a70 x4 : 000000000000001f x3 : ffffd2a48d6722dc
[ 2030.379576] x2 : 0000000000000002 x1 : ffffd2a4a00002f0 x0 : ffff80008129b750
[ 2030.379583] Kernel panic - not syncing: Asynchronous SError Interrupt
[ 2030.379586] CPU: 0 PID: 239 Comm: kworker/0:2 Not tainted 6.8.0-rc4-next-20240216-00015-gc937d3c43ffe-dirty #219
[ 2030.379590] Hardware name: Qualcomm Technologies, Inc. Robotics RB3gen2 (DT)
[ 2030.379592] Workqueue: events output_poll_execute [drm_kms_helper]
[ 2030.379642] Call trace:
[ 2030.379644] dump_backtrace+0xec/0x108
[ 2030.379654] show_stack+0x18/0x24
[ 2030.379659] dump_stack_lvl+0x40/0x84
[ 2030.379664] dump_stack+0x18/0x24
[ 2030.379668] panic+0x130/0x34c
[ 2030.379673] nmi_panic+0x44/0x90
[ 2030.379679] arm64_serror_panic+0x68/0x74
[ 2030.379683] do_serror+0xc4/0xcc
[ 2030.379686] el1h_64_error_handler+0x34/0x4c
[ 2030.379692] el1h_64_error+0x64/0x68
[ 2030.379696] el1_interrupt+0x28/0xc0
[ 2030.379700] el1h_64_irq_handler+0x18/0x24
[ 2030.379706] el1h_64_irq+0x64/0x68
[ 2030.379710] _raw_spin_unlock_irq+0x20/0x48
[ 2030.379718] wait_for_common+0xb4/0x16c
[ 2030.379722] wait_for_completion_timeout+0x14/0x20
[ 2030.379727] dp_ctrl_push_idle+0x34/0x8c [msm]
[ 2030.379844] dp_bridge_atomic_disable+0x18/0x24 [msm]
[ 2030.379959] drm_atomic_bridge_chain_disable+0x6c/0xb4 [drm]
[ 2030.380150] drm_atomic_helper_commit_modeset_disables+0x174/0x57c [drm_kms_helper]
[ 2030.380200] msm_atomic_commit_tail+0x1b4/0x474 [msm]
[ 2030.380316] commit_tail+0xa4/0x158 [drm_kms_helper]
[ 2030.380369] drm_atomic_helper_commit+0x24c/0x26c [drm_kms_helper]
[ 2030.380418] drm_atomic_commit+0xa8/0xd4 [drm]
[ 2030.380529] drm_client_modeset_commit_atomic+0x16c/0x244 [drm]
[ 2030.380641] drm_client_modeset_commit_locked+0x50/0x168 [drm]
[ 2030.380753] drm_client_modeset_commit+0x2c/0x54 [drm]
[ 2030.380865] __drm_fb_helper_restore_fbdev_mode_unlocked+0x60/0xa4 [drm_kms_helper]
[ 2030.380915] drm_fb_helper_hotplug_event+0xe0/0xf4 [drm_kms_helper]
[ 2030.380965] msm_fbdev_client_hotplug+0x28/0xc8 [msm]
[ 2030.381081] drm_client_dev_hotplug+0x94/0x118 [drm]
[ 2030.381192] output_poll_execute+0x214/0x26c [drm_kms_helper]
[ 2030.381241] process_scheduled_works+0x19c/0x2cc
[ 2030.381249] worker_thread+0x290/0x3cc
[ 2030.381255] kthread+0xfc/0x184
[ 2030.381260] ret_from_fork+0x10/0x20

The above could happen if the convoluted hotplug state machine breaks
down so that the device is runtime suspended before
dp_bridge_atomic_disable() is called.

For some reason, possibly due to unrelated changes in timing, possibly
after the hotplug revert, I am no longer able to reproduce the reset
with 6.8-rc7 on the X13s.

I am however able to get the hotplug state machine to leak
runtime PM reference counts which prevents it from ever being suspended
(e.g. by disconnecting slowly so that we get multiple connect and
disconnect events). This can manifest itself as a hotplug event which is
processed after disconnecting the display:

[drm:dp_panel_read_sink_caps [msm]] *ERROR* read dpcd failed -110

I would not be surprised at all if there's a sequence of events that
leads to an unbalanced put instead (and the stack trace and observations
above make this appear likely).

Were you able to determine what events lead to the premature disabling
of the bus clocks on the RB3?

Have you got any reason to believe that the revert of the hotplug
notification patch may in anyway prevent that? Or is it just papering
over the issue?

> Hotplug in Xorg seems to work worse than it did with 6.7, which also had
> some issues. Connecting a display once seems to work fine, but trying to
> re-enable a reconnected display using xrandr sometimes does not work at
> all, while with 6.7 it usually worked on the second xrandr execution.
>
> xrandr reports the reconnected display as disconnected:

> Running 'xrandr --output DP-2 --auto' 2-3 times makes xrandr report the
> display as connected, but the display is still blank (unlike with 6.7).

As I mentioned elsewhere, the revert of e467e0bde881 ("drm/msm/dp: use
drm_bridge_hpd_notify() to report HPD status changes") in rc7 does seem
to help with the hotplug detect issues that I could reproduce (in VT
console, X and wayland).

The question is whether we should revert the whole runtime PM series so
that the bus clock is left on, which should prevent any resets on
disconnect.

Without any analysis from you or reason to be believe the issue to have
been resolved, I'm inclined to just go ahead and revert it. It clearly
had not been tested enough before being merged and I'm quite frustrated
with how this has been handled.

Johan

2024-03-08 17:50:47

by Abhinav Kumar

[permalink] [raw]
Subject: Re: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1

Hi Johan

On 3/8/2024 4:43 AM, Johan Hovold wrote:
> Hi Abhinav, Rob, Dmitry and Kuogee,
>
> On Tue, Feb 27, 2024 at 02:33:48PM +0100, Johan Hovold wrote:
>
>> Since 6.8-rc1 I have seen (and received reports) of hard resets of the
>> Lenovo ThinkPad X13s after connecting and disconnecting an external
>> display.
>>
>> I have triggered this on a simple disconnect while in a VT console, but
>> also when stopping Xorg after having repeatedly connected and
>> disconnected an external display and tried to enable it using xrandr.
>>
>> In the former case, the last (custom debug) messages printed over an SSH
>> session were once:
>>
>> [ 948.416358] usb 5-1: USB disconnect, device number 3
>> [ 948.443496] msm_dpu ae01000.display-controller: msm_fbdev_client_hotplug
>> [ 948.443723] msm-dp-display ae98000.displayport-controller: dp_power_clk_enable - type = 1, enable = 0
>> [ 948.443872] msm-dp-display ae98000.displayport-controller: dp_ctrl_phy_exit
>> [ 948.445117] msm-dp-display ae98000.displayport-controller: dp_ctrl_phy_exit - done
>>
>> and then the hypervisor resets the machine.
>
> Has there been any progress on tracking down the reset on disconnect
> issue? I was expecting you to share your findings here so that we can
> determine if the rest of the runtime PM series needs to be reverted or
> not before 6.8 is released (possibly on Sunday).
>
> I really did not want to spend more on this driver than I already have
> this cycle, but the lack of (visible) progress has again forced me to do
> so.
>
> It's quite likely that the resets are indeed a regression caused by the
> runtime PM series as the bus clocks were not disabled on disconnect
> before that one was merged in 6.8-rc1.
>

For this last remaining reset with the stacktrace you have mentioned
below, I do not think this was introduced due to PM runtime series. We
have had this report earlier with the same stacktrace as well in earlier
kernels which didnt have PM runtime support:

https://gitlab.freedesktop.org/drm/msm/-/issues/17

We had fixed the issue reported by the user that time with below patch
which was confirmed by the user as fixed:

https://lore.kernel.org/all/[email protected]/

But, it seems like there is another race condition in this code
resulting in this issue again.

I have posted my analysis with the patch here as a RFC y'day:

https://patchwork.freedesktop.org/patch/581758/

I missed CCing you and Bjorn on the RFC but when I post it as a patch
either today/tomm, will CC you both.

Sorry, much of the discussion of this issue was happening on #freedreno
last few days.

> In a VT console, the device is now runtime suspended immediately on
> disconnect, while in X, it currently remains active until X is killed,
> which is consistent with what I reported above.
>
> We now also have Bjorn's call trace from a different Qualcomm platform
> triggering an unclocked access on disconnect, something which would
> trigger a reset by the hypervisor on sc8280xp platforms like the X13s:
>
> [ 2030.379417] SError Interrupt on CPU0, code 0x00000000be000000 -- SError
> [ 2030.379425] CPU: 0 PID: 239 Comm: kworker/0:2 Not tainted 6.8.0-rc4-next-20240216-00015-gc937d3c43ffe-dirty #219
> [ 2030.379430] Hardware name: Qualcomm Technologies, Inc. Robotics RB3gen2 (DT)
> [ 2030.379435] Workqueue: events output_poll_execute [drm_kms_helper]
> [ 2030.379495] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [ 2030.379501] pc : el1_interrupt+0x28/0xc0
> [ 2030.379514] lr : el1h_64_irq_handler+0x18/0x24
> [ 2030.379520] sp : ffff80008129b700
> [ 2030.379523] x29: ffff80008129b700 x28: ffff7a0a031aa200 x27: ffff7a0af768c3c0
> [ 2030.379530] x26: 0000000000000000 x25: ffff7a0a024bb568 x24: ffff7a0a031aa200
> [ 2030.379537] x23: 0000000060400005 x22: ffffd2a4a10c871c x21: ffff80008129b8a0
> [ 2030.379544] x20: ffffd2a4a00002f0 x19: ffff80008129b750 x18: 0000000000000000
> [ 2030.379549] x17: 0000000000000000 x16: ffffd2a4a10c21d4 x15: 0000000000000000
> [ 2030.379555] x14: 0000000000000000 x13: 000000000001acf6 x12: 0000000000000000
> [ 2030.379560] x11: 0000000000000001 x10: ffff7a0af623f680 x9 : 0000000100000001
> [ 2030.379565] x8 : 00000000000000c0 x7 : 0000000000000000 x6 : 000000000000003f
> [ 2030.379570] x5 : ffff7a0a003c6a70 x4 : 000000000000001f x3 : ffffd2a48d6722dc
> [ 2030.379576] x2 : 0000000000000002 x1 : ffffd2a4a00002f0 x0 : ffff80008129b750
> [ 2030.379583] Kernel panic - not syncing: Asynchronous SError Interrupt
> [ 2030.379586] CPU: 0 PID: 239 Comm: kworker/0:2 Not tainted 6.8.0-rc4-next-20240216-00015-gc937d3c43ffe-dirty #219
> [ 2030.379590] Hardware name: Qualcomm Technologies, Inc. Robotics RB3gen2 (DT)
> [ 2030.379592] Workqueue: events output_poll_execute [drm_kms_helper]
> [ 2030.379642] Call trace:
> [ 2030.379644] dump_backtrace+0xec/0x108
> [ 2030.379654] show_stack+0x18/0x24
> [ 2030.379659] dump_stack_lvl+0x40/0x84
> [ 2030.379664] dump_stack+0x18/0x24
> [ 2030.379668] panic+0x130/0x34c
> [ 2030.379673] nmi_panic+0x44/0x90
> [ 2030.379679] arm64_serror_panic+0x68/0x74
> [ 2030.379683] do_serror+0xc4/0xcc
> [ 2030.379686] el1h_64_error_handler+0x34/0x4c
> [ 2030.379692] el1h_64_error+0x64/0x68
> [ 2030.379696] el1_interrupt+0x28/0xc0
> [ 2030.379700] el1h_64_irq_handler+0x18/0x24
> [ 2030.379706] el1h_64_irq+0x64/0x68
> [ 2030.379710] _raw_spin_unlock_irq+0x20/0x48
> [ 2030.379718] wait_for_common+0xb4/0x16c
> [ 2030.379722] wait_for_completion_timeout+0x14/0x20
> [ 2030.379727] dp_ctrl_push_idle+0x34/0x8c [msm]
> [ 2030.379844] dp_bridge_atomic_disable+0x18/0x24 [msm]
> [ 2030.379959] drm_atomic_bridge_chain_disable+0x6c/0xb4 [drm]
> [ 2030.380150] drm_atomic_helper_commit_modeset_disables+0x174/0x57c [drm_kms_helper]
> [ 2030.380200] msm_atomic_commit_tail+0x1b4/0x474 [msm]
> [ 2030.380316] commit_tail+0xa4/0x158 [drm_kms_helper]
> [ 2030.380369] drm_atomic_helper_commit+0x24c/0x26c [drm_kms_helper]
> [ 2030.380418] drm_atomic_commit+0xa8/0xd4 [drm]
> [ 2030.380529] drm_client_modeset_commit_atomic+0x16c/0x244 [drm]
> [ 2030.380641] drm_client_modeset_commit_locked+0x50/0x168 [drm]
> [ 2030.380753] drm_client_modeset_commit+0x2c/0x54 [drm]
> [ 2030.380865] __drm_fb_helper_restore_fbdev_mode_unlocked+0x60/0xa4 [drm_kms_helper]
> [ 2030.380915] drm_fb_helper_hotplug_event+0xe0/0xf4 [drm_kms_helper]
> [ 2030.380965] msm_fbdev_client_hotplug+0x28/0xc8 [msm]
> [ 2030.381081] drm_client_dev_hotplug+0x94/0x118 [drm]
> [ 2030.381192] output_poll_execute+0x214/0x26c [drm_kms_helper]
> [ 2030.381241] process_scheduled_works+0x19c/0x2cc
> [ 2030.381249] worker_thread+0x290/0x3cc
> [ 2030.381255] kthread+0xfc/0x184
> [ 2030.381260] ret_from_fork+0x10/0x20
>
> The above could happen if the convoluted hotplug state machine breaks
> down so that the device is runtime suspended before
> dp_bridge_atomic_disable() is called.
>

Yes, state machine got broken and I have explained how in the commit
text of the RFC. But its not necessarily due to PM runtime but a
sequence of events happening from userspace exposing this breakage.

> For some reason, possibly due to unrelated changes in timing, possibly
> after the hotplug revert, I am no longer able to reproduce the reset
> with 6.8-rc7 on the X13s.
>

I do not know how the hotplug revert fixed this stack because I think
this can still happen.

> I am however able to get the hotplug state machine to leak
> runtime PM reference counts which prevents it from ever being suspended
> (e.g. by disconnecting slowly so that we get multiple connect and
> disconnect events). This can manifest itself as a hotplug event which is
> processed after disconnecting the display:
>
> [drm:dp_panel_read_sink_caps [msm]] *ERROR* read dpcd failed -110
>

hmm ... this is another new report. I was not aware of this till this
instance. We would like to debug this issue too as we have also not seen
this one before.

But, I am suspicious about how or why a refcount leak leads to a DPCD
read failure.

> I would not be surprised at all if there's a sequence of events that
> leads to an unbalanced put instead (and the stack trace and observations
> above make this appear likely).
>

Stacktrace has a different analysis.

> Were you able to determine what events lead to the premature disabling
> of the bus clocks on the RB3?
>

Yes, please check my RFC . I am waiting for Kuogee to review it too and
I shall post it formally as a patch.

> Have you got any reason to believe that the revert of the hotplug
> notification patch may in anyway prevent that? Or is it just papering
> over the issue?
>

No, I do not think reverting the hotplug notification patch can fix that
reset.

>> Hotplug in Xorg seems to work worse than it did with 6.7, which also had
>> some issues. Connecting a display once seems to work fine, but trying to
>> re-enable a reconnected display using xrandr sometimes does not work at
>> all, while with 6.7 it usually worked on the second xrandr execution.
>>
>> xrandr reports the reconnected display as disconnected:
>
>> Running 'xrandr --output DP-2 --auto' 2-3 times makes xrandr report the
>> display as connected, but the display is still blank (unlike with 6.7).
>
> As I mentioned elsewhere, the revert of e467e0bde881 ("drm/msm/dp: use
> drm_bridge_hpd_notify() to report HPD status changes") in rc7 does seem
> to help with the hotplug detect issues that I could reproduce (in VT
> console, X and wayland).
>

Yes revert of e467e0bde881 ("drm/msm/dp: use
> drm_bridge_hpd_notify() to report HPD status changes") affecting
hotplug status and associated connection issues makes sense. But I think
the reset with the above mentioned stacktrace can still happen.

> The question is whether we should revert the whole runtime PM series so
> that the bus clock is left on, which should prevent any resets on
> disconnect.
>

At this point, I would say if the RFC resolves the reset for you and
Bjorn with the above stack, that should be the way to go and not full
revert.

Unless you are facing other issues other than this last reset crash.

> Without any analysis from you or reason to be believe the issue to have
> been resolved, I'm inclined to just go ahead and revert it. It clearly
> had not been tested enough before being merged and I'm quite frustrated
> with how this has been handled.
>

I must admit that the number of issues reported on sc8280xp was
surprising for us as well because the series was tested on both
sc7180/sc7280 by us and other users but not on sc8280xp as we did not
have access to this device that time.

sc7280 is the only other device we supported upstream which supported
eDP natively before Qualcomm X1E80100 landed this cycle. Surprisingly
enough, perhaps because it uses a chrome defconfig or chrome userspace,
no issues were seen that time or even now for that matter.

To avoid the breakages on sc8280xp moving forward as it has more
visibility, our team here has decided to setup sc8280xp devices to test
out all msm display related changes especially eDP/DP as it shares the
eDP/DP driver. We can also CC you on these, in case you want to test it
out as well and give your Tested-by.

> Johan

2024-03-09 16:30:40

by Johan Hovold

[permalink] [raw]
Subject: Re: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1

On Fri, Mar 08, 2024 at 09:50:17AM -0800, Abhinav Kumar wrote:
> On 3/8/2024 4:43 AM, Johan Hovold wrote:

> For this last remaining reset with the stacktrace you have mentioned
> below, I do not think this was introduced due to PM runtime series. We
> have had this report earlier with the same stacktrace as well in earlier
> kernels which didnt have PM runtime support:

> But, it seems like there is another race condition in this code
> resulting in this issue again.
>
> I have posted my analysis with the patch here as a RFC y'day:
>
> https://patchwork.freedesktop.org/patch/581758/
>
> I missed CCing you and Bjorn on the RFC but when I post it as a patch
> either today/tomm, will CC you both.

Ok, thanks. I'll take a closer look at that next week. It's not clear to
me what that race looks like after reading the commit message. It would
be good to have some more details in there (e.g. the sequence of events
that breaks the state machine) and some way to reproduce the issue
reliably.

> > We now also have Bjorn's call trace from a different Qualcomm platform
> > triggering an unclocked access on disconnect, something which would
> > trigger a reset by the hypervisor on sc8280xp platforms like the X13s:

> > [ 2030.379583] Kernel panic - not syncing: Asynchronous SError Interrupt
> > [ 2030.379586] CPU: 0 PID: 239 Comm: kworker/0:2 Not tainted 6.8.0-rc4-next-20240216-00015-gc937d3c43ffe-dirty #219
> > [ 2030.379590] Hardware name: Qualcomm Technologies, Inc. Robotics RB3gen2 (DT)
> > [ 2030.379592] Workqueue: events output_poll_execute [drm_kms_helper]
> > [ 2030.379642] Call trace:

> > [ 2030.379722] wait_for_completion_timeout+0x14/0x20
> > [ 2030.379727] dp_ctrl_push_idle+0x34/0x8c [msm]
> > [ 2030.379844] dp_bridge_atomic_disable+0x18/0x24 [msm]
> > [ 2030.379959] drm_atomic_bridge_chain_disable+0x6c/0xb4 [drm]
> > [ 2030.380150] drm_atomic_helper_commit_modeset_disables+0x174/0x57c [drm_kms_helper]
> > [ 2030.380200] msm_atomic_commit_tail+0x1b4/0x474 [msm]
> > [ 2030.380316] commit_tail+0xa4/0x158 [drm_kms_helper]
> > [ 2030.380369] drm_atomic_helper_commit+0x24c/0x26c [drm_kms_helper]
> > [ 2030.380418] drm_atomic_commit+0xa8/0xd4 [drm]
> > [ 2030.380529] drm_client_modeset_commit_atomic+0x16c/0x244 [drm]
> > [ 2030.380641] drm_client_modeset_commit_locked+0x50/0x168 [drm]
> > [ 2030.380753] drm_client_modeset_commit+0x2c/0x54 [drm]
> > [ 2030.380865] __drm_fb_helper_restore_fbdev_mode_unlocked+0x60/0xa4 [drm_kms_helper]
> > [ 2030.380915] drm_fb_helper_hotplug_event+0xe0/0xf4 [drm_kms_helper]
> > [ 2030.380965] msm_fbdev_client_hotplug+0x28/0xc8 [msm]
> > [ 2030.381081] drm_client_dev_hotplug+0x94/0x118 [drm]
> > [ 2030.381192] output_poll_execute+0x214/0x26c [drm_kms_helper]
> > [ 2030.381241] process_scheduled_works+0x19c/0x2cc
> > [ 2030.381249] worker_thread+0x290/0x3cc
> > [ 2030.381255] kthread+0xfc/0x184
> > [ 2030.381260] ret_from_fork+0x10/0x20
> >
> > The above could happen if the convoluted hotplug state machine breaks
> > down so that the device is runtime suspended before
> > dp_bridge_atomic_disable() is called.

> Yes, state machine got broken and I have explained how in the commit
> text of the RFC. But its not necessarily due to PM runtime but a
> sequence of events happening from userspace exposing this breakage.

After looking at this some more today, I agree with you. The
observations I've reported are consistent with what would happen if the
link clock is disabled when dp_bridge_atomic_disable() is called.

That clock is disabled in dp_bridge_atomic_post_disable() before runtime
suspending, but perhaps there are some further paths that can end up
disabling it.

> > For some reason, possibly due to unrelated changes in timing, possibly
> > after the hotplug revert, I am no longer able to reproduce the reset
> > with 6.8-rc7 on the X13s.

> I do not know how the hotplug revert fixed this stack because I think
> this can still happen.

Thanks for confirming.

> > I am however able to get the hotplug state machine to leak
> > runtime PM reference counts which prevents it from ever being suspended
> > (e.g. by disconnecting slowly so that we get multiple connect and
> > disconnect events). This can manifest itself as a hotplug event which is
> > processed after disconnecting the display:
> >
> > [drm:dp_panel_read_sink_caps [msm]] *ERROR* read dpcd failed -110

> hmm ... this is another new report. I was not aware of this till this
> instance. We would like to debug this issue too as we have also not seen
> this one before.
>
> But, I am suspicious about how or why a refcount leak leads to a DPCD
> read failure.

It's the other way round: the hotplug event is processed after
disconnect and that leaks a runtime PM usage count:

< usb-c disconnect >

[ 158.167845] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 2

Here's the disconnect event.

[ 158.172234] msm-dp-display ae98000.displayport-controller: dp_hpd_unplug_handle
[ 158.175164] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 1

Here's an unexpected connect event which is added to the event queue if
the link state is updated by a racing thread (after this printk).

[ 158.175696] msm-dp-display ae98000.displayport-controller: __pm_runtime_idle - usage count = 1
[ 158.175704] msm-dp-display ae98000.displayport-controller: dp_display_send_hpd_notification - hpd = 0

Such as here.

[ 158.180972] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 0, status = 2
[ 158.184146] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 0, status = 2
[ 158.186565] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_disable
[ 158.235903] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_post_disable
[ 158.237759] msm-dp-display ae98000.displayport-controller: dp_power_clk_enable - type = 2, enable = 0
[ 158.238484] msm-dp-display ae98000.displayport-controller: dp_power_clk_enable - type = 1, enable = 0
[ 158.240350] msm-dp-display ae98000.displayport-controller: __pm_runtime_idle - usage count = 0
[ 158.240966] msm-dp-display ae98000.displayport-controller: dp_pm_runtime_suspend
[ 158.242674] msm-dp-display ae98000.displayport-controller: dp_power_clk_enable - type = 0, enable = 0
[ 158.259286] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 0, status = 2
[ 158.275869] msm-dp-display ae98000.displayport-controller: dp_hpd_plug_handle

Here the spurious connect event is processed but the external display is
already disconnected.

[ 158.277070] msm-dp-display ae98000.displayport-controller: dp_hpd_plug_handle - rpm get
[ 158.278234] msm-dp-display ae98000.displayport-controller: __pm_runtime_resume - usage count = 1
[ 158.279322] msm-dp-display ae98000.displayport-controller: dp_pm_runtime_resume
[ 158.280386] msm-dp-display ae98000.displayport-controller: dp_power_clk_enable - type = 0, enable = 1
[ 158.283766] msm-dp-display ae98000.displayport-controller: __pm_runtime_resume - usage count = 2
[ 158.340829] msm-dp-display ae98000.displayport-controller: __pm_runtime_idle - usage count = 1
..
[ 160.101867] msm-dp-display ae98000.displayport-controller: __pm_runtime_resume - usage count = 2
[ 160.159553] msm-dp-display ae98000.displayport-controller: __pm_runtime_idle - usage count = 1
[ 160.160643] [drm:dp_panel_read_sink_caps [msm]] *ERROR* read dpcd failed -110

Which leads to the dpcd read failure and leaves the runtime PM usage
count incremented (the next hotplug event bumps the count to 2).

I've also hit one other scenario where a runtime PM reference count is
leaked (without any dpcd errors), but the above one is easily reproduced
by pulling out the plug slowly (the printks may possibly help make this
more reproducable by altering the timing slightly).

> At this point, I would say if the RFC resolves the reset for you and
> Bjorn with the above stack, that should be the way to go and not full
> revert.
>
> Unless you are facing other issues other than this last reset crash.

We only have Bjorn's stack trace as the hypervisor just resets the
machine when I hit this on sc8280xp, but I think it's reasonable to
assume that we are hitting the same or at least a related issue.

Also note that I've hit the resets in two different ways; disconnect in
VT console and when stopping X some time after a disconnect (Bjorn's
stack trace is from the former case I think).

Both resets occur around the time where dp_bridge_atomic_disable() is
normally called, but I'm still missing an explanation for how the link
clock got disabled prematurely (if that's what happened).

> To avoid the breakages on sc8280xp moving forward as it has more
> visibility, our team here has decided to setup sc8280xp devices to test
> out all msm display related changes especially eDP/DP as it shares the
> eDP/DP driver. We can also CC you on these, in case you want to test it
> out as well and give your Tested-by.

That's good to hear. I'm sure your testing will be sufficient to catch
any obvious regressions. Once it hits mainline we have a larger user
base of people running the X13s which will provide further testing, but
it's not very efficient to rely on third parties with these kind of
issues that just trigger resets on this platform. It gives us an
indication that something is broken, but someone still needs to spend
time to try to narrow it down.

Johan

2024-03-11 13:43:42

by Johan Hovold

[permalink] [raw]
Subject: Re: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1

On Sat, Mar 09, 2024 at 05:30:17PM +0100, Johan Hovold wrote:
> On Fri, Mar 08, 2024 at 09:50:17AM -0800, Abhinav Kumar wrote:
> > On 3/8/2024 4:43 AM, Johan Hovold wrote:
>
> > For this last remaining reset with the stacktrace you have mentioned
> > below, I do not think this was introduced due to PM runtime series. We
> > have had this report earlier with the same stacktrace as well in earlier
> > kernels which didnt have PM runtime support:
>
> > But, it seems like there is another race condition in this code
> > resulting in this issue again.
> >
> > I have posted my analysis with the patch here as a RFC y'day:
> >
> > https://patchwork.freedesktop.org/patch/581758/
> >
> > I missed CCing you and Bjorn on the RFC but when I post it as a patch
> > either today/tomm, will CC you both.
>
> Ok, thanks. I'll take a closer look at that next week. It's not clear to
> me what that race looks like after reading the commit message. It would
> be good to have some more details in there (e.g. the sequence of events
> that breaks the state machine) and some way to reproduce the issue
> reliably.

I was able to reproduce the reset with some of my debug printks in place
after reapplying the reverted hpd notify change so I have an explanation
for (one of) the ways we can up in this state now.

This does not match description of the problem in the fix linked to
above and I don't think that patch solves the underlying issue even if
it may make the race window somewhat smaller. More details below.

> > > We now also have Bjorn's call trace from a different Qualcomm platform
> > > triggering an unclocked access on disconnect, something which would
> > > trigger a reset by the hypervisor on sc8280xp platforms like the X13s:
>
> > > [ 2030.379583] Kernel panic - not syncing: Asynchronous SError Interrupt
> > > [ 2030.379586] CPU: 0 PID: 239 Comm: kworker/0:2 Not tainted 6.8.0-rc4-next-20240216-00015-gc937d3c43ffe-dirty #219
> > > [ 2030.379590] Hardware name: Qualcomm Technologies, Inc. Robotics RB3gen2 (DT)
> > > [ 2030.379592] Workqueue: events output_poll_execute [drm_kms_helper]
> > > [ 2030.379642] Call trace:
>
> > > [ 2030.379722] wait_for_completion_timeout+0x14/0x20
> > > [ 2030.379727] dp_ctrl_push_idle+0x34/0x8c [msm]
> > > [ 2030.379844] dp_bridge_atomic_disable+0x18/0x24 [msm]
> > > [ 2030.379959] drm_atomic_bridge_chain_disable+0x6c/0xb4 [drm]
> > > [ 2030.380150] drm_atomic_helper_commit_modeset_disables+0x174/0x57c [drm_kms_helper]
> > > [ 2030.380200] msm_atomic_commit_tail+0x1b4/0x474 [msm]
> > > [ 2030.380316] commit_tail+0xa4/0x158 [drm_kms_helper]
> > > [ 2030.380369] drm_atomic_helper_commit+0x24c/0x26c [drm_kms_helper]
> > > [ 2030.380418] drm_atomic_commit+0xa8/0xd4 [drm]
> > > [ 2030.380529] drm_client_modeset_commit_atomic+0x16c/0x244 [drm]
> > > [ 2030.380641] drm_client_modeset_commit_locked+0x50/0x168 [drm]
> > > [ 2030.380753] drm_client_modeset_commit+0x2c/0x54 [drm]
> > > [ 2030.380865] __drm_fb_helper_restore_fbdev_mode_unlocked+0x60/0xa4 [drm_kms_helper]
> > > [ 2030.380915] drm_fb_helper_hotplug_event+0xe0/0xf4 [drm_kms_helper]
> > > [ 2030.380965] msm_fbdev_client_hotplug+0x28/0xc8 [msm]
> > > [ 2030.381081] drm_client_dev_hotplug+0x94/0x118 [drm]
> > > [ 2030.381192] output_poll_execute+0x214/0x26c [drm_kms_helper]
> > > [ 2030.381241] process_scheduled_works+0x19c/0x2cc
> > > [ 2030.381249] worker_thread+0x290/0x3cc
> > > [ 2030.381255] kthread+0xfc/0x184
> > > [ 2030.381260] ret_from_fork+0x10/0x20
> > >
> > > The above could happen if the convoluted hotplug state machine breaks
> > > down so that the device is runtime suspended before
> > > dp_bridge_atomic_disable() is called.
>
> > Yes, state machine got broken and I have explained how in the commit
> > text of the RFC. But its not necessarily due to PM runtime but a
> > sequence of events happening from userspace exposing this breakage.
>
> After looking at this some more today, I agree with you. The
> observations I've reported are consistent with what would happen if the
> link clock is disabled when dp_bridge_atomic_disable() is called.
>
> That clock is disabled in dp_bridge_atomic_post_disable() before runtime
> suspending, but perhaps there are some further paths that can end up
> disabling it.

Turns out there are paths like that and we hit those more often before
reverting e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify().

Specifically, when a previous connect attempt did not enable the bridge
fully so that it is still in the ST_MAINLINK_READY when we receive a
disconnect event, dp_hpd_unplug_handle() will turn of the link clock.

[ 204.527625] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 2
[ 204.531553] msm-dp-display ae98000.displayport-controller: dp_hpd_unplug_handle
[ 204.533261] msm-dp-display ae98000.displayport-controller: dp_ctrl_off_link

A racing connect event, such as the one I described earlier, can then
try to enable the bridge again but dp_bridge_atomic_enable() just bails
out early (and leaks a rpm reference) because we're now in
ST_DISCONNECTED:

[ 204.535773] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 1
[ 204.536187] [CONNECTOR:35:DP-2] status updated from disconnected to connected
[ 204.536905] msm-dp-display ae98000.displayport-controller: dp_display_notify_disconnect - would clear link ready (1), state = 0
[ 204.537821] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_check - link_ready = 1
[ 204.538063] msm-dp-display ae98000.displayport-controller: dp_display_send_hpd_notification - hpd = 0, link_ready = 1
[ 204.542778] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_enable
[ 204.586547] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_enable - state = 0 (rpm leak?)

Clearing link_ready already in dp_display_notify_disconnect() would make
the race window slightly smaller, but it would essentially just paper
over the bug as the events are still not serialised. Notably, there is
no user space interaction involved here and it's the spurious connect
event that triggers the bridge enable.

When the fbdev hotplug code later disables the never-enabled bridge,
things go boom:

[ 204.649072] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 0, status = 2
[ 204.650378] [CONNECTOR:35:DP-2] status updated from connected to disconnected
[ 204.651111] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_disable

as the link clock has already been disabled when accessing the link
registers.

The stack trace for the bridge enable above is:

[ 204.553922] dp_bridge_atomic_enable+0x40/0x2f0 [msm]
[ 204.555241] drm_atomic_bridge_chain_enable+0x54/0xc8 [drm]
[ 204.556557] drm_atomic_helper_commit_modeset_enables+0x194/0x26c [drm_kms_helper]
[ 204.557853] msm_atomic_commit_tail+0x204/0x804 [msm]
[ 204.559173] commit_tail+0xa4/0x18c [drm_kms_helper]
[ 204.560450] drm_atomic_helper_commit+0x19c/0x1b0 [drm_kms_helper]
[ 204.561743] drm_atomic_commit+0xa4/0xdc [drm]
[ 204.563065] drm_client_modeset_commit_atomic+0x22c/0x298 [drm]
[ 204.564402] drm_client_modeset_commit_locked+0x60/0x1c0 [drm]
[ 204.565733] drm_client_modeset_commit+0x30/0x58 [drm]
[ 204.567055] __drm_fb_helper_restore_fbdev_mode_unlocked+0xbc/0xfc [drm_kms_helper]
[ 204.568381] drm_fb_helper_hotplug_event.part.0+0xd4/0x110 [drm_kms_helper]
[ 204.569708] drm_fb_helper_hotplug_event+0x38/0x44 [drm_kms_helper]
[ 204.571032] msm_fbdev_client_hotplug+0x28/0xd4 [msm]
[ 204.572395] drm_client_dev_hotplug+0xcc/0x130 [drm]
[ 204.573755] drm_kms_helper_connector_hotplug_event+0x34/0x44 [drm_kms_helper]
[ 204.575114] drm_bridge_connector_hpd_cb+0x90/0xa4 [drm_kms_helper]
[ 204.576465] drm_bridge_hpd_notify+0x40/0x5c [drm]
[ 204.577842] drm_aux_hpd_bridge_notify+0x18/0x28 [aux_hpd_bridge]
[ 204.579184] pmic_glink_altmode_worker+0xc0/0x23c [pmic_glink_altmode]

> > > For some reason, possibly due to unrelated changes in timing, possibly
> > > after the hotplug revert, I am no longer able to reproduce the reset
> > > with 6.8-rc7 on the X13s.
>
> > I do not know how the hotplug revert fixed this stack because I think
> > this can still happen.

So, while it may still be theoretically possible to hit the resets after
the revert, the HPD notify revert effectively "fixed" the regression in
6.8-rc1 by removing the preconditions that now made us hit it (i.e. the
half-initialised bridge).

It seems the hotplug state machine needs to be reworked completely, but
at least we're roughly back where we were with 6.7 (including that the
bus clocks will never be turned of because of the rpm leaks on
disconnect).

Johan

2024-03-11 15:31:29

by Johan Hovold

[permalink] [raw]
Subject: Re: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1

On Mon, Mar 11, 2024 at 02:43:24PM +0100, Johan Hovold wrote:

> So, while it may still be theoretically possible to hit the resets after
> the revert, the HPD notify revert effectively "fixed" the regression in
> 6.8-rc1 by removing the preconditions that now made us hit it (i.e. the
> half-initialised bridge).
>
> It seems the hotplug state machine needs to be reworked completely, but
> at least we're roughly back where we were with 6.7 (including that the
> bus clocks will never be turned of because of the rpm leaks on
> disconnect).

#regzbot introduced: e467e0bde881
#regzbot fix: 664bad6af3cb

2024-03-11 16:52:37

by Abhinav Kumar

[permalink] [raw]
Subject: Re: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1



On 3/11/2024 6:43 AM, Johan Hovold wrote:
> On Sat, Mar 09, 2024 at 05:30:17PM +0100, Johan Hovold wrote:
>> On Fri, Mar 08, 2024 at 09:50:17AM -0800, Abhinav Kumar wrote:
>>> On 3/8/2024 4:43 AM, Johan Hovold wrote:
>>
>>> For this last remaining reset with the stacktrace you have mentioned
>>> below, I do not think this was introduced due to PM runtime series. We
>>> have had this report earlier with the same stacktrace as well in earlier
>>> kernels which didnt have PM runtime support:
>>
>>> But, it seems like there is another race condition in this code
>>> resulting in this issue again.
>>>
>>> I have posted my analysis with the patch here as a RFC y'day:
>>>
>>> https://patchwork.freedesktop.org/patch/581758/
>>>
>>> I missed CCing you and Bjorn on the RFC but when I post it as a patch
>>> either today/tomm, will CC you both.
>>
>> Ok, thanks. I'll take a closer look at that next week. It's not clear to
>> me what that race looks like after reading the commit message. It would
>> be good to have some more details in there (e.g. the sequence of events
>> that breaks the state machine) and some way to reproduce the issue
>> reliably.
>
> I was able to reproduce the reset with some of my debug printks in place
> after reapplying the reverted hpd notify change so I have an explanation
> for (one of) the ways we can up in this state now.
>
> This does not match description of the problem in the fix linked to
> above and I don't think that patch solves the underlying issue even if
> it may make the race window somewhat smaller. More details below.
>

Its the same condition you described below that enable does not go
through and we bail out as we are in ST_DISCONNECTED.

>>>> We now also have Bjorn's call trace from a different Qualcomm platform
>>>> triggering an unclocked access on disconnect, something which would
>>>> trigger a reset by the hypervisor on sc8280xp platforms like the X13s:
>>
>>>> [ 2030.379583] Kernel panic - not syncing: Asynchronous SError Interrupt
>>>> [ 2030.379586] CPU: 0 PID: 239 Comm: kworker/0:2 Not tainted 6.8.0-rc4-next-20240216-00015-gc937d3c43ffe-dirty #219
>>>> [ 2030.379590] Hardware name: Qualcomm Technologies, Inc. Robotics RB3gen2 (DT)
>>>> [ 2030.379592] Workqueue: events output_poll_execute [drm_kms_helper]
>>>> [ 2030.379642] Call trace:
>>
>>>> [ 2030.379722] wait_for_completion_timeout+0x14/0x20
>>>> [ 2030.379727] dp_ctrl_push_idle+0x34/0x8c [msm]
>>>> [ 2030.379844] dp_bridge_atomic_disable+0x18/0x24 [msm]
>>>> [ 2030.379959] drm_atomic_bridge_chain_disable+0x6c/0xb4 [drm]
>>>> [ 2030.380150] drm_atomic_helper_commit_modeset_disables+0x174/0x57c [drm_kms_helper]
>>>> [ 2030.380200] msm_atomic_commit_tail+0x1b4/0x474 [msm]
>>>> [ 2030.380316] commit_tail+0xa4/0x158 [drm_kms_helper]
>>>> [ 2030.380369] drm_atomic_helper_commit+0x24c/0x26c [drm_kms_helper]
>>>> [ 2030.380418] drm_atomic_commit+0xa8/0xd4 [drm]
>>>> [ 2030.380529] drm_client_modeset_commit_atomic+0x16c/0x244 [drm]
>>>> [ 2030.380641] drm_client_modeset_commit_locked+0x50/0x168 [drm]
>>>> [ 2030.380753] drm_client_modeset_commit+0x2c/0x54 [drm]
>>>> [ 2030.380865] __drm_fb_helper_restore_fbdev_mode_unlocked+0x60/0xa4 [drm_kms_helper]
>>>> [ 2030.380915] drm_fb_helper_hotplug_event+0xe0/0xf4 [drm_kms_helper]
>>>> [ 2030.380965] msm_fbdev_client_hotplug+0x28/0xc8 [msm]
>>>> [ 2030.381081] drm_client_dev_hotplug+0x94/0x118 [drm]
>>>> [ 2030.381192] output_poll_execute+0x214/0x26c [drm_kms_helper]
>>>> [ 2030.381241] process_scheduled_works+0x19c/0x2cc
>>>> [ 2030.381249] worker_thread+0x290/0x3cc
>>>> [ 2030.381255] kthread+0xfc/0x184
>>>> [ 2030.381260] ret_from_fork+0x10/0x20
>>>>
>>>> The above could happen if the convoluted hotplug state machine breaks
>>>> down so that the device is runtime suspended before
>>>> dp_bridge_atomic_disable() is called.
>>
>>> Yes, state machine got broken and I have explained how in the commit
>>> text of the RFC. But its not necessarily due to PM runtime but a
>>> sequence of events happening from userspace exposing this breakage.
>>
>> After looking at this some more today, I agree with you. The
>> observations I've reported are consistent with what would happen if the
>> link clock is disabled when dp_bridge_atomic_disable() is called.
>>
>> That clock is disabled in dp_bridge_atomic_post_disable() before runtime
>> suspending, but perhaps there are some further paths that can end up
>> disabling it.
>
> Turns out there are paths like that and we hit those more often before
> reverting e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify().
>
> Specifically, when a previous connect attempt did not enable the bridge
> fully so that it is still in the ST_MAINLINK_READY when we receive a
> disconnect event, dp_hpd_unplug_handle() will turn of the link clock.
>
> [ 204.527625] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 2
> [ 204.531553] msm-dp-display ae98000.displayport-controller: dp_hpd_unplug_handle
> [ 204.533261] msm-dp-display ae98000.displayport-controller: dp_ctrl_off_link
>
> A racing connect event, such as the one I described earlier, can then
> try to enable the bridge again but dp_bridge_atomic_enable() just bails
> out early (and leaks a rpm reference) because we're now in
> ST_DISCONNECTED:
>
> [ 204.535773] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 1
> [ 204.536187] [CONNECTOR:35:DP-2] status updated from disconnected to connected
> [ 204.536905] msm-dp-display ae98000.displayport-controller: dp_display_notify_disconnect - would clear link ready (1), state = 0
> [ 204.537821] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_check - link_ready = 1
> [ 204.538063] msm-dp-display ae98000.displayport-controller: dp_display_send_hpd_notification - hpd = 0, link_ready = 1
> [ 204.542778] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_enable
> [ 204.586547] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_enable - state = 0 (rpm leak?)
>
> Clearing link_ready already in dp_display_notify_disconnect() would make
> the race window slightly smaller, but it would essentially just paper
> over the bug as the events are still not serialised. Notably, there is
> no user space interaction involved here and it's the spurious connect
> event that triggers the bridge enable.
>

Yes, it only narrows down the race condition window. The issue can still
happen if the commit / modeset was issued before we marked link_ready as
false.

And yes, I was only targetting a short term fix till we rework the HPD.
That will happen only incrementally as its a delicate piece of code.

> When the fbdev hotplug code later disables the never-enabled bridge,
> things go boom:
>
> [ 204.649072] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 0, status = 2
> [ 204.650378] [CONNECTOR:35:DP-2] status updated from connected to disconnected
> [ 204.651111] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_disable
>
> as the link clock has already been disabled when accessing the link
> registers.
>
> The stack trace for the bridge enable above is:
>
> [ 204.553922] dp_bridge_atomic_enable+0x40/0x2f0 [msm]
> [ 204.555241] drm_atomic_bridge_chain_enable+0x54/0xc8 [drm]
> [ 204.556557] drm_atomic_helper_commit_modeset_enables+0x194/0x26c [drm_kms_helper]
> [ 204.557853] msm_atomic_commit_tail+0x204/0x804 [msm]
> [ 204.559173] commit_tail+0xa4/0x18c [drm_kms_helper]
> [ 204.560450] drm_atomic_helper_commit+0x19c/0x1b0 [drm_kms_helper]
> [ 204.561743] drm_atomic_commit+0xa4/0xdc [drm]
> [ 204.563065] drm_client_modeset_commit_atomic+0x22c/0x298 [drm]
> [ 204.564402] drm_client_modeset_commit_locked+0x60/0x1c0 [drm]
> [ 204.565733] drm_client_modeset_commit+0x30/0x58 [drm]
> [ 204.567055] __drm_fb_helper_restore_fbdev_mode_unlocked+0xbc/0xfc [drm_kms_helper]
> [ 204.568381] drm_fb_helper_hotplug_event.part.0+0xd4/0x110 [drm_kms_helper]
> [ 204.569708] drm_fb_helper_hotplug_event+0x38/0x44 [drm_kms_helper]
> [ 204.571032] msm_fbdev_client_hotplug+0x28/0xd4 [msm]
> [ 204.572395] drm_client_dev_hotplug+0xcc/0x130 [drm]
> [ 204.573755] drm_kms_helper_connector_hotplug_event+0x34/0x44 [drm_kms_helper]
> [ 204.575114] drm_bridge_connector_hpd_cb+0x90/0xa4 [drm_kms_helper]
> [ 204.576465] drm_bridge_hpd_notify+0x40/0x5c [drm]
> [ 204.577842] drm_aux_hpd_bridge_notify+0x18/0x28 [aux_hpd_bridge]
> [ 204.579184] pmic_glink_altmode_worker+0xc0/0x23c [pmic_glink_altmode]
>
>>>> For some reason, possibly due to unrelated changes in timing, possibly
>>>> after the hotplug revert, I am no longer able to reproduce the reset
>>>> with 6.8-rc7 on the X13s.
>>
>>> I do not know how the hotplug revert fixed this stack because I think
>>> this can still happen.
>
> So, while it may still be theoretically possible to hit the resets after
> the revert, the HPD notify revert effectively "fixed" the regression in
> 6.8-rc1 by removing the preconditions that now made us hit it (i.e. the
> half-initialised bridge).
>

Not entirely. In the bug which was reported before the patch
e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify() got landed, its a
classic example of how this issue can happen with userspace involvement
and not just fbdev client which was your case.

> It seems the hotplug state machine needs to be reworked completely, but
> at least we're roughly back where we were with 6.7 (including that the
> bus clocks will never be turned of because of the rpm leaks on
> disconnect).
>

Yes, we already landed that revert but I am also planning to land the
patch I had posted till we rework HPD.

> Johan

2024-03-12 08:31:34

by Johan Hovold

[permalink] [raw]
Subject: Re: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1

On Mon, Mar 11, 2024 at 09:51:29AM -0700, Abhinav Kumar wrote:
> On 3/11/2024 6:43 AM, Johan Hovold wrote:
> > On Sat, Mar 09, 2024 at 05:30:17PM +0100, Johan Hovold wrote:
> >> On Fri, Mar 08, 2024 at 09:50:17AM -0800, Abhinav Kumar wrote:

> >>> I have posted my analysis with the patch here as a RFC y'day:
> >>>
> >>> https://patchwork.freedesktop.org/patch/581758/

> > I was able to reproduce the reset with some of my debug printks in place
> > after reapplying the reverted hpd notify change so I have an explanation
> > for (one of) the ways we can up in this state now.
> >
> > This does not match description of the problem in the fix linked to
> > above and I don't think that patch solves the underlying issue even if
> > it may make the race window somewhat smaller. More details below.

> Its the same condition you described below that enable does not go
> through and we bail out as we are in ST_DISCONNECTED.

It's closely related but clearly not the same as user space is not
involved in the reset I see.

> > Turns out there are paths like that and we hit those more often before
> > reverting e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify().
> >
> > Specifically, when a previous connect attempt did not enable the bridge
> > fully so that it is still in the ST_MAINLINK_READY when we receive a
> > disconnect event, dp_hpd_unplug_handle() will turn of the link clock.
> >
> > [ 204.527625] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 2
> > [ 204.531553] msm-dp-display ae98000.displayport-controller: dp_hpd_unplug_handle
> > [ 204.533261] msm-dp-display ae98000.displayport-controller: dp_ctrl_off_link
> >
> > A racing connect event, such as the one I described earlier, can then
> > try to enable the bridge again but dp_bridge_atomic_enable() just bails
> > out early (and leaks a rpm reference) because we're now in
> > ST_DISCONNECTED:
> >
> > [ 204.535773] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 1
> > [ 204.536187] [CONNECTOR:35:DP-2] status updated from disconnected to connected
> > [ 204.536905] msm-dp-display ae98000.displayport-controller: dp_display_notify_disconnect - would clear link ready (1), state = 0
> > [ 204.537821] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_check - link_ready = 1
> > [ 204.538063] msm-dp-display ae98000.displayport-controller: dp_display_send_hpd_notification - hpd = 0, link_ready = 1
> > [ 204.542778] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_enable
> > [ 204.586547] msm-dp-display ae98000.displayport-controller: dp_bridge_atomic_enable - state = 0 (rpm leak?)
> >
> > Clearing link_ready already in dp_display_notify_disconnect() would make
> > the race window slightly smaller, but it would essentially just paper
> > over the bug as the events are still not serialised. Notably, there is
> > no user space interaction involved here and it's the spurious connect
> > event that triggers the bridge enable.

> Yes, it only narrows down the race condition window. The issue can still
> happen if the commit / modeset was issued before we marked link_ready as
> false.
>
> And yes, I was only targetting a short term fix till we rework the HPD.
> That will happen only incrementally as its a delicate piece of code.

Ok, thanks for confirming. Please also make that clear in the commit
message of any patch.

I am however not sure that your patch (RFC) is needed at this point as
the HPD revert fixes the 6.8-rc1 regression, and moving the clearing of
link_ready can actually make things worse as it makes any spurious
hotplug event always be processed (not just if they happen after
dp_display_send_hpd_notification()).

I'll reply to you patch as well.

> > So, while it may still be theoretically possible to hit the resets after
> > the revert, the HPD notify revert effectively "fixed" the regression in
> > 6.8-rc1 by removing the preconditions that now made us hit it (i.e. the
> > half-initialised bridge).

> Not entirely. In the bug which was reported before the patch
> e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify() got landed, its a
> classic example of how this issue can happen with userspace involvement
> and not just fbdev client which was your case.

Sure, but you already added some kind of band-aid for that issue, right?

https://lore.kernel.org/all/[email protected]/

How likely is it that you'd still hit that? Have you had an reports of
anyone actually hitting that issue since the above workaround was
merged?

Note that I, and other users of the X13s, only started hitting the
resets with 6.8-rc1, which broke hotplug notification and resulted in
the half-initialised bridges.

I'm not saying it's impossible to hit the unclocked access still, just
that that does not seem to be relevant for the regression.

> > It seems the hotplug state machine needs to be reworked completely, but
> > at least we're roughly back where we were with 6.7 (including that the
> > bus clocks will never be turned of because of the rpm leaks on
> > disconnect).

> Yes, we already landed that revert but I am also planning to land the
> patch I had posted till we rework HPD.

Ok, but please consider the implications that would have for any
spurious connect events first.

Johan