2019-06-10 17:24:40

by Jorge Ramirez-Ortiz

[permalink] [raw]
Subject: [PATCH v3] tty: serial: msm_serial: avoid system lockup condition

The function msm_wait_for_xmitr can be taken with interrupts
disabled. In order to avoid a potential system lockup - demonstrated
under stress testing conditions on SoC QCS404/5 - make sure we wait
for a bounded amount of time.

Tested on SoC QCS404.

Signed-off-by: Jorge Ramirez-Ortiz <[email protected]>
---
v2: fix exit condition (timeout --> !timeout)
v3: add these clarification messages

drivers/tty/serial/msm_serial.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/drivers/tty/serial/msm_serial.c b/drivers/tty/serial/msm_serial.c
index 23833ad952ba..3657a24913fc 100644
--- a/drivers/tty/serial/msm_serial.c
+++ b/drivers/tty/serial/msm_serial.c
@@ -383,10 +383,14 @@ static void msm_request_rx_dma(struct msm_port *msm_port, resource_size_t base)

static inline void msm_wait_for_xmitr(struct uart_port *port)
{
+ unsigned int timeout = 500000;
+
while (!(msm_read(port, UART_SR) & UART_SR_TX_EMPTY)) {
if (msm_read(port, UART_ISR) & UART_ISR_TX_READY)
break;
udelay(1);
+ if (!timeout--)
+ break;
}
msm_write(port, UART_CR_CMD_RESET_TX_READY, UART_CR);
}
--
2.21.0


2019-06-10 17:53:58

by Rob Clark

[permalink] [raw]
Subject: Re: [PATCH v3] tty: serial: msm_serial: avoid system lockup condition

On Mon, Jun 10, 2019 at 10:23 AM Jorge Ramirez-Ortiz
<[email protected]> wrote:
>
> The function msm_wait_for_xmitr can be taken with interrupts
> disabled. In order to avoid a potential system lockup - demonstrated
> under stress testing conditions on SoC QCS404/5 - make sure we wait
> for a bounded amount of time.
>
> Tested on SoC QCS404.
>
> Signed-off-by: Jorge Ramirez-Ortiz <[email protected]>


I had observed that heavy UART traffic would lockup the system (on
sdm845, but I guess same serial driver)?

But a comment from the peanut gallary: wouldn't this fix lead to TX
corruption, ie. writing more into TX fifo before hw is ready? I
haven't looked closely at the driver, but a way to wait without irqs
disabled would seem nicer..

BR,
-R

> ---
> v2: fix exit condition (timeout --> !timeout)
> v3: add these clarification messages
>
> drivers/tty/serial/msm_serial.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/drivers/tty/serial/msm_serial.c b/drivers/tty/serial/msm_serial.c
> index 23833ad952ba..3657a24913fc 100644
> --- a/drivers/tty/serial/msm_serial.c
> +++ b/drivers/tty/serial/msm_serial.c
> @@ -383,10 +383,14 @@ static void msm_request_rx_dma(struct msm_port *msm_port, resource_size_t base)
>
> static inline void msm_wait_for_xmitr(struct uart_port *port)
> {
> + unsigned int timeout = 500000;
> +
> while (!(msm_read(port, UART_SR) & UART_SR_TX_EMPTY)) {
> if (msm_read(port, UART_ISR) & UART_ISR_TX_READY)
> break;
> udelay(1);
> + if (!timeout--)
> + break;
> }
> msm_write(port, UART_CR_CMD_RESET_TX_READY, UART_CR);
> }
> --
> 2.21.0
>

2019-06-10 19:11:45

by Jorge Ramirez-Ortiz

[permalink] [raw]
Subject: Re: [PATCH v3] tty: serial: msm_serial: avoid system lockup condition

On 6/10/19 19:53, Rob Clark wrote:
> On Mon, Jun 10, 2019 at 10:23 AM Jorge Ramirez-Ortiz
> <[email protected]> wrote:
>> The function msm_wait_for_xmitr can be taken with interrupts
>> disabled. In order to avoid a potential system lockup - demonstrated
>> under stress testing conditions on SoC QCS404/5 - make sure we wait
>> for a bounded amount of time.
>>
>> Tested on SoC QCS404.
>>
>> Signed-off-by: Jorge Ramirez-Ortiz <[email protected]>
>
> I had observed that heavy UART traffic would lockup the system (on
> sdm845, but I guess same serial driver)?
>
> But a comment from the peanut gallary: wouldn't this fix lead to TX
> corruption, ie. writing more into TX fifo before hw is ready? I
> haven't looked closely at the driver, but a way to wait without irqs
> disabled would seem nicer..
>
> BR,
> -R
>

I think sdm845 uses a different driver (qcom_geni_serial.c) but yes in
any case we need to determine the sequence leading to the lockup. In our
internal releases we are adding additional debug information to try to
capture this info.

But also I dont think this means that the safety net should not be used

btw, do you think that perhaps we should add a WARN_ONCE() on timeout?.

2019-06-10 19:40:29

by Rob Clark

[permalink] [raw]
Subject: Re: [PATCH v3] tty: serial: msm_serial: avoid system lockup condition

On Mon, Jun 10, 2019 at 12:11 PM Jorge Ramirez
<[email protected]> wrote:
>
> On 6/10/19 19:53, Rob Clark wrote:
> > On Mon, Jun 10, 2019 at 10:23 AM Jorge Ramirez-Ortiz
> > <[email protected]> wrote:
> >> The function msm_wait_for_xmitr can be taken with interrupts
> >> disabled. In order to avoid a potential system lockup - demonstrated
> >> under stress testing conditions on SoC QCS404/5 - make sure we wait
> >> for a bounded amount of time.
> >>
> >> Tested on SoC QCS404.
> >>
> >> Signed-off-by: Jorge Ramirez-Ortiz <[email protected]>
> >
> > I had observed that heavy UART traffic would lockup the system (on
> > sdm845, but I guess same serial driver)?
> >
> > But a comment from the peanut gallary: wouldn't this fix lead to TX
> > corruption, ie. writing more into TX fifo before hw is ready? I
> > haven't looked closely at the driver, but a way to wait without irqs
> > disabled would seem nicer..
> >
> > BR,
> > -R
> >
>
> I think sdm845 uses a different driver (qcom_geni_serial.c) but yes in
> any case we need to determine the sequence leading to the lockup. In our
> internal releases we are adding additional debug information to try to
> capture this info.

ahh, ok.. perhaps qcom_geni_serial has a similar issue.. fwiw where I
tend to hit it is debugging mesa, bugs that can trigger GPU lockups
can tricker a lot of them, and a lot of dmesg spew. Which in turn
seems to freeze usb (? I think.. I'm using a usb-c ethernet adapter)
making it hard to ctrl-c the thing that is causing the GPU lockups in
the first place.

> But also I dont think this means that the safety net should not be used

yeah, probably not worse than the current state.. although a proper
solution would be nice

> btw, do you think that perhaps we should add a WARN_ONCE() on timeout?.

not sure if backtrace adds much value here.. but perhaps a (very)
ratelimited warning msg? You don't want to make the underlying
problem too much worse with too much debug msg but some hint about
what is happening could be useful.

BR,
-R