2024-05-23 23:22:50

by Doug Anderson

[permalink] [raw]
Subject: [PATCH 0/2] serial: Fix problems when serial transfer is happening at suspend time


This is a set of two patches that fix problems related to suspending
while a serial transfer is going on. The two patches are independent
from each other and can land in any order. The only thing tying them
together is that I used the same test to reproduce both of them.
Specifically, I could reproduce my problemes by logging in via an
agetty on the debug serial port (which was _not_ used for kernel
console) and running:
cat /var/log/messages
..and then (via an SSH session) forcing a few suspend/resume cycles.

The first patch solves a problem that is probably more major. It was
introduced recently and has even shown up in stable trees.
Suspend/resume testing in ChromeOS test labs are hitting the problem
fixed by this patch. The fix hasn't been tested in labs, but when I
reproduced the problem locally I could see that the fix worked. IMO it
should land ASAP.

The second patch fixes an ancient problem that I only found because I
was trying to reproduce the first problem. Given how long it's been
around it's probably not urgent but it would be nice to get fixed.


Douglas Anderson (2):
serial: port: Don't block system suspend even if bytes are left to
xmit
serial: qcom-geni: Fix qcom_geni_serial_stop_tx_fifo() while xfer

drivers/tty/serial/qcom_geni_serial.c | 45 +++++++++++++++++++++++++--
drivers/tty/serial/serial_port.c | 10 ++++++
2 files changed, 52 insertions(+), 3 deletions(-)

--
2.45.1.288.g0e0cd299f1-goog



2024-05-23 23:23:18

by Doug Anderson

[permalink] [raw]
Subject: [PATCH 2/2] serial: qcom-geni: Fix qcom_geni_serial_stop_tx_fifo() while xfer

If qcom_geni_serial_stop_tx_fifo() was called while a transfer was
happening then the UART would be effectively stuck and wouldn't
transmit again. Specifically, I could reproduce these problem by
logging in via an agetty on the debug serial port (which was _not_
used for kernel console) and running:
cat /var/log/messages
..and then (via an SSH session) forcing a few suspend/resume cycles.

Digging into this showed a number of problems. One problem was that we
cancelled the current "tx" command but we forgot to zero out
"tx_remaining". Another problem was that we didn't drain the buffer
before cancelling the "tx" command and thus those bytes were
lost. Another problem was that failing to drain the buffer meant that
the hardware still reported something in the FIFO and that caused
qcom_geni_serial_start_tx_fifo() to be a no-op, since it doesn't do
anything if the TX FIFO is not empty.

Though I haven't gone back and validated on ancient code, it appears
from code inspection that many of these problems have existed since
the start of the driver. In the very least, I could reproduce the
problems on vanilla v5.15. The problems don't seem to reproduce when
using the serial port for kernel console output and also don't seem to
reproduce if nothing is being printed to the console at suspend time,
so this is presumably why they were not noticed until now.

Fixes: c4f528795d1a ("tty: serial: msm_geni_serial: Add serial driver support for GENI based QUP")
Signed-off-by: Douglas Anderson <[email protected]>
---

drivers/tty/serial/qcom_geni_serial.c | 45 +++++++++++++++++++++++++--
1 file changed, 42 insertions(+), 3 deletions(-)

diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
index 2bd25afe0d92..9110ac4bdbbf 100644
--- a/drivers/tty/serial/qcom_geni_serial.c
+++ b/drivers/tty/serial/qcom_geni_serial.c
@@ -265,8 +265,8 @@ static bool qcom_geni_serial_secondary_active(struct uart_port *uport)
return readl(uport->membase + SE_GENI_STATUS) & S_GENI_CMD_ACTIVE;
}

-static bool qcom_geni_serial_poll_bit(struct uart_port *uport,
- int offset, int field, bool set)
+static bool qcom_geni_serial_poll_bitfield(struct uart_port *uport,
+ int offset, int field, u32 val)
{
u32 reg;
struct qcom_geni_serial_port *port;
@@ -295,7 +295,7 @@ static bool qcom_geni_serial_poll_bit(struct uart_port *uport,
timeout_us = DIV_ROUND_UP(timeout_us, 10) * 10;
while (timeout_us) {
reg = readl(uport->membase + offset);
- if ((bool)(reg & field) == set)
+ if ((reg & field) == val)
return true;
udelay(10);
timeout_us -= 10;
@@ -303,6 +303,12 @@ static bool qcom_geni_serial_poll_bit(struct uart_port *uport,
return false;
}

+static bool qcom_geni_serial_poll_bit(struct uart_port *uport,
+ int offset, int field, bool set)
+{
+ return qcom_geni_serial_poll_bitfield(uport, offset, field, set ? field : 0);
+}
+
static void qcom_geni_serial_setup_tx(struct uart_port *uport, u32 xmit_size)
{
u32 m_cmd;
@@ -675,6 +681,31 @@ static void qcom_geni_serial_stop_tx_fifo(struct uart_port *uport)
if (!qcom_geni_serial_main_active(uport))
return;

+ /*
+ * Wait until the FIFO has been drained. We've already taken bytes out
+ * of the higher level queue in qcom_geni_serial_send_chunk_fifo() so
+ * if we don't drain the FIFO but send the "cancel" below they seem to
+ * get lost.
+ */
+ qcom_geni_serial_poll_bitfield(uport, SE_GENI_TX_FIFO_STATUS, TX_FIFO_WC, 0);
+
+ /*
+ * If we send the cancel immediately after the FIFO reports that it's
+ * empty then bytes still seem to get lost. From trial and error, it
+ * appears that a small delay here keeps bytes from being lost and
+ * there is (apparently) no bit that we can poll instead of this.
+ * Specifically it can be noted that the sequencer is still "active"
+ * if it's waiting for us to send it more bytes from the current
+ * transfer.
+ */
+ mdelay(1);
+
+ /*
+ * Cancel the current command. After this the main sequencer will
+ * stop reporting that it's active and we'll have to start a new
+ * transfer command. If the cancel doesn't take, we'll also send an
+ * abort.
+ */
geni_se_cancel_m_cmd(&port->se);
if (!qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
M_CMD_CANCEL_EN, true)) {
@@ -684,6 +715,14 @@ static void qcom_geni_serial_stop_tx_fifo(struct uart_port *uport)
writel(M_CMD_ABORT_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
}
writel(M_CMD_CANCEL_EN, uport->membase + SE_GENI_M_IRQ_CLEAR);
+
+ /*
+ * We've cancelled the current command. "tx_remaining" stores how
+ * many bytes are left to finish in the current command so we know
+ * when to start a new command. Since the command was cancelled we
+ * need to zero "tx_remaining".
+ */
+ port->tx_remaining = 0;
}

static void qcom_geni_serial_handle_rx_fifo(struct uart_port *uport, bool drop)
--
2.45.1.288.g0e0cd299f1-goog


2024-05-23 23:24:23

by Doug Anderson

[permalink] [raw]
Subject: [PATCH 1/2] serial: port: Don't block system suspend even if bytes are left to xmit

Recently, suspend testing on sc7180-trogdor based devices has started
to sometimes fail with messages like this:

port a88000.serial:0.0: PM: calling pm_runtime_force_suspend+0x0/0xf8 @ 28934, parent: a88000.serial:0
port a88000.serial:0.0: PM: dpm_run_callback(): pm_runtime_force_suspend+0x0/0xf8 returns -16
port a88000.serial:0.0: PM: pm_runtime_force_suspend+0x0/0xf8 returned -16 after 33 usecs
port a88000.serial:0.0: PM: failed to suspend: error -16

I could reproduce these problem by logging in via an agetty on the
debug serial port (which was _not_ used for kernel console) and
running:
cat /var/log/messages
..and then (via an SSH session) forcing a few suspend/resume cycles.

Tracing through the code and doing some printf debugging shows that
the -16 (-EBUSY) comes from the recently added
serial_port_runtime_suspend().

The idea of the serial_port_runtime_suspend() function is to prevent
the port from being _runtime_ suspended if it still has bytes left to
transmit. Having bytes left to transmit isn't a reason to block
_system_ suspend, though. The DEFINE_RUNTIME_DEV_PM_OPS() used by the
serial_port code means that the system suspend function will be
pm_runtime_force_suspend(). In pm_runtime_force_suspend() we can see
that before calling the runtime suspend function we'll call
pm_runtime_disable(). This should be a reliable way to detect that
we're called from system suspend and that we shouldn't look for
busyness.

Fixes: 43066e32227e ("serial: port: Don't suspend if the port is still busy")
Signed-off-by: Douglas Anderson <[email protected]>
---

drivers/tty/serial/serial_port.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/drivers/tty/serial/serial_port.c b/drivers/tty/serial/serial_port.c
index 91a338d3cb34..b781227cc996 100644
--- a/drivers/tty/serial/serial_port.c
+++ b/drivers/tty/serial/serial_port.c
@@ -64,6 +64,16 @@ static int serial_port_runtime_suspend(struct device *dev)
if (port->flags & UPF_DEAD)
return 0;

+ /*
+ * We only want to check the busyness of the port if PM Runtime is
+ * enabled. Specifically PM Runtime will be disabled by
+ * pm_runtime_force_suspend() during system suspend and we don't want
+ * to block system suspend even if there is data still left to
+ * transmit. We only want to block regulator PM Runtime transitions.
+ */
+ if (!pm_runtime_enabled(dev))
+ return 0;
+
uart_port_lock_irqsave(port, &flags);
if (!port_dev->tx_enabled) {
uart_port_unlock_irqrestore(port, flags);
--
2.45.1.288.g0e0cd299f1-goog


2024-05-24 00:29:57

by Stephen Boyd

[permalink] [raw]
Subject: Re: [PATCH 1/2] serial: port: Don't block system suspend even if bytes are left to xmit

Quoting Douglas Anderson (2024-05-23 16:22:12)
> Recently, suspend testing on sc7180-trogdor based devices has started
> to sometimes fail with messages like this:
>
> port a88000.serial:0.0: PM: calling pm_runtime_force_suspend+0x0/0xf8 @ 28934, parent: a88000.serial:0
> port a88000.serial:0.0: PM: dpm_run_callback(): pm_runtime_force_suspend+0x0/0xf8 returns -16
> port a88000.serial:0.0: PM: pm_runtime_force_suspend+0x0/0xf8 returned -16 after 33 usecs
> port a88000.serial:0.0: PM: failed to suspend: error -16
>
> I could reproduce these problem by logging in via an agetty on the
> debug serial port (which was _not_ used for kernel console) and
> running:
> cat /var/log/messages
> ...and then (via an SSH session) forcing a few suspend/resume cycles.
>
> Tracing through the code and doing some printf debugging shows that
> the -16 (-EBUSY) comes from the recently added
> serial_port_runtime_suspend().
>
> The idea of the serial_port_runtime_suspend() function is to prevent
> the port from being _runtime_ suspended if it still has bytes left to
> transmit. Having bytes left to transmit isn't a reason to block
> _system_ suspend, though.

Can you elaborate? I paused to think that maybe we would want to make
sure that everything that was transmitted had been transmitted but that
doesn't seem right because it's a problem for higher layers to solve,
e.g. serdev would want to make sure some sleep command sent over the
wire actually got sent.

> The DEFINE_RUNTIME_DEV_PM_OPS() used by the
> serial_port code means that the system suspend function will be
> pm_runtime_force_suspend(). In pm_runtime_force_suspend() we can see
> that before calling the runtime suspend function we'll call
> pm_runtime_disable(). This should be a reliable way to detect that
> we're called from system suspend and that we shouldn't look for
> busyness.
>
> Fixes: 43066e32227e ("serial: port: Don't suspend if the port is still busy")
> Signed-off-by: Douglas Anderson <[email protected]>
> ---
>
> drivers/tty/serial/serial_port.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/drivers/tty/serial/serial_port.c b/drivers/tty/serial/serial_port.c
> index 91a338d3cb34..b781227cc996 100644
> --- a/drivers/tty/serial/serial_port.c
> +++ b/drivers/tty/serial/serial_port.c
> @@ -64,6 +64,16 @@ static int serial_port_runtime_suspend(struct device *dev)
> if (port->flags & UPF_DEAD)
> return 0;
>
> + /*
> + * We only want to check the busyness of the port if PM Runtime is
> + * enabled. Specifically PM Runtime will be disabled by
> + * pm_runtime_force_suspend() during system suspend and we don't want
> + * to block system suspend even if there is data still left to
> + * transmit. We only want to block regulator PM Runtime transitions.

s/regulator/regular/

Is this a typo? Also, why is "runtime" capitalized?

> + */
> + if (!pm_runtime_enabled(dev))
> + return 0;

2024-05-24 00:38:48

by Stephen Boyd

[permalink] [raw]
Subject: Re: [PATCH 2/2] serial: qcom-geni: Fix qcom_geni_serial_stop_tx_fifo() while xfer

Quoting Douglas Anderson (2024-05-23 16:22:13)
> diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
> index 2bd25afe0d92..9110ac4bdbbf 100644
> --- a/drivers/tty/serial/qcom_geni_serial.c
> +++ b/drivers/tty/serial/qcom_geni_serial.c
> @@ -265,8 +265,8 @@ static bool qcom_geni_serial_secondary_active(struct uart_port *uport)
> return readl(uport->membase + SE_GENI_STATUS) & S_GENI_CMD_ACTIVE;
> }
>
> -static bool qcom_geni_serial_poll_bit(struct uart_port *uport,
> - int offset, int field, bool set)
> +static bool qcom_geni_serial_poll_bitfield(struct uart_port *uport,
> + int offset, int field, u32 val)

Can these be unsigned offset and field?

> {
> u32 reg;
> struct qcom_geni_serial_port *port;
> @@ -295,7 +295,7 @@ static bool qcom_geni_serial_poll_bit(struct uart_port *uport,
> timeout_us = DIV_ROUND_UP(timeout_us, 10) * 10;
> while (timeout_us) {
> reg = readl(uport->membase + offset);
> - if ((bool)(reg & field) == set)
> + if ((reg & field) == val)
> return true;
> udelay(10);
> timeout_us -= 10;
> @@ -303,6 +303,12 @@ static bool qcom_geni_serial_poll_bit(struct uart_port *uport,
> return false;
> }
>
> +static bool qcom_geni_serial_poll_bit(struct uart_port *uport,
> + int offset, int field, bool set)

Can these be unsigned offset and field?

> +{
> + return qcom_geni_serial_poll_bitfield(uport, offset, field, set ? field : 0);
> +}
> +
> static void qcom_geni_serial_setup_tx(struct uart_port *uport, u32 xmit_size)
> {
> u32 m_cmd;
> @@ -675,6 +681,31 @@ static void qcom_geni_serial_stop_tx_fifo(struct uart_port *uport)
> if (!qcom_geni_serial_main_active(uport))
> return;
>
> + /*
> + * Wait until the FIFO has been drained. We've already taken bytes out
> + * of the higher level queue in qcom_geni_serial_send_chunk_fifo() so
> + * if we don't drain the FIFO but send the "cancel" below they seem to
> + * get lost.
> + */
> + qcom_geni_serial_poll_bitfield(uport, SE_GENI_TX_FIFO_STATUS, TX_FIFO_WC, 0);
> +
> + /*
> + * If we send the cancel immediately after the FIFO reports that it's
> + * empty then bytes still seem to get lost. From trial and error, it
> + * appears that a small delay here keeps bytes from being lost and
> + * there is (apparently) no bit that we can poll instead of this.
> + * Specifically it can be noted that the sequencer is still "active"
> + * if it's waiting for us to send it more bytes from the current
> + * transfer.
> + */
> + mdelay(1);

I wonder if the FIFO is in a different 1kb chunk of device memory and so
this needs to be an instruction barrier (isb()) to prevent the cancel
from being executed before or in parallel to the FIFO polling. Hopefully
someone at qcom can confirm this. It looks like SE_GENI_TX_FIFO_STATUS
is 0x800 offset and the cancel is at 0x600 so it looks like it may be
this problem. Device memory doesn't save us even if that has ordered
accesses :(

> +
> + /*
> + * Cancel the current command. After this the main sequencer will
> + * stop reporting that it's active and we'll have to start a new
> + * transfer command. If the cancel doesn't take, we'll also send an
> + * abort.
> + */
> geni_se_cancel_m_cmd(&port->se);
> if (!qcom_geni_serial_poll_bit(uport, SE_GENI_M_IRQ_STATUS,
> M_CMD_CANCEL_EN, true)) {

2024-05-24 15:24:25

by Doug Anderson

[permalink] [raw]
Subject: Re: [PATCH 1/2] serial: port: Don't block system suspend even if bytes are left to xmit

Hi,

On Thu, May 23, 2024 at 5:25 PM Stephen Boyd <[email protected]> wrote:
>
> Quoting Douglas Anderson (2024-05-23 16:22:12)
> > Recently, suspend testing on sc7180-trogdor based devices has started
> > to sometimes fail with messages like this:
> >
> > port a88000.serial:0.0: PM: calling pm_runtime_force_suspend+0x0/0xf8 @ 28934, parent: a88000.serial:0
> > port a88000.serial:0.0: PM: dpm_run_callback(): pm_runtime_force_suspend+0x0/0xf8 returns -16
> > port a88000.serial:0.0: PM: pm_runtime_force_suspend+0x0/0xf8 returned -16 after 33 usecs
> > port a88000.serial:0.0: PM: failed to suspend: error -16
> >
> > I could reproduce these problem by logging in via an agetty on the
> > debug serial port (which was _not_ used for kernel console) and
> > running:
> > cat /var/log/messages
> > ...and then (via an SSH session) forcing a few suspend/resume cycles.
> >
> > Tracing through the code and doing some printf debugging shows that
> > the -16 (-EBUSY) comes from the recently added
> > serial_port_runtime_suspend().
> >
> > The idea of the serial_port_runtime_suspend() function is to prevent
> > the port from being _runtime_ suspended if it still has bytes left to
> > transmit. Having bytes left to transmit isn't a reason to block
> > _system_ suspend, though.
>
> Can you elaborate? I paused to think that maybe we would want to make
> sure that everything that was transmitted had been transmitted but that
> doesn't seem right because it's a problem for higher layers to solve,
> e.g. serdev would want to make sure some sleep command sent over the
> wire actually got sent.

I don't have a ton of intuition about how the new "serial_port" code
is designed to work. The patch that added it mentioned that it was
based on suggestions by a whole bunch of people but there were no
links to the previous discussions so it wasn't easy to get tons of
context. I would certainly love it if people who have been involved
could chime in and say whether my patch is correct. If anyone has any
suggestions for something better to say in the commit message I'm
happy to use different wording as well.

In any case, looking at it I guess a serdev device would use
serdev_device_wait_until_sent() to ensure its command was sent. That
eventually seems to trickle down to the UART function tx_empty(). If a
serdev device needs to block suspend while waiting then that would be
up to the serdev device. This could be implicit if the
serdev_device_wait_until_sent() was being called as part of the serdev
device's suspend() function or perhaps could be through some sort of
locking.

..so really I think the case we're running into is if userspace has a
whole bunch of bytes queued up to write out the UART. That shouldn't
block suspend. If userspace needs to block suspend they should use
another method.


> > The DEFINE_RUNTIME_DEV_PM_OPS() used by the
> > serial_port code means that the system suspend function will be
> > pm_runtime_force_suspend(). In pm_runtime_force_suspend() we can see
> > that before calling the runtime suspend function we'll call
> > pm_runtime_disable(). This should be a reliable way to detect that
> > we're called from system suspend and that we shouldn't look for
> > busyness.
> >
> > Fixes: 43066e32227e ("serial: port: Don't suspend if the port is still busy")
> > Signed-off-by: Douglas Anderson <[email protected]>
> > ---
> >
> > drivers/tty/serial/serial_port.c | 10 ++++++++++
> > 1 file changed, 10 insertions(+)
> >
> > diff --git a/drivers/tty/serial/serial_port.c b/drivers/tty/serial/serial_port.c
> > index 91a338d3cb34..b781227cc996 100644
> > --- a/drivers/tty/serial/serial_port.c
> > +++ b/drivers/tty/serial/serial_port.c
> > @@ -64,6 +64,16 @@ static int serial_port_runtime_suspend(struct device *dev)
> > if (port->flags & UPF_DEAD)
> > return 0;
> >
> > + /*
> > + * We only want to check the busyness of the port if PM Runtime is
> > + * enabled. Specifically PM Runtime will be disabled by
> > + * pm_runtime_force_suspend() during system suspend and we don't want
> > + * to block system suspend even if there is data still left to
> > + * transmit. We only want to block regulator PM Runtime transitions.
>
> s/regulator/regular/
>
> Is this a typo?

Yeah, I'll fix that to "regular". I'll wait (probably till Tuesday)
before sending a v2.

> Also, why is "runtime" capitalized?

Somehow I thought it was supposed to be, but you're right that it
looks weird. I guess the docs call it "runtime PM" so I'll change all
instances to that in v2.

-Doug

2024-05-25 00:18:27

by Doug Anderson

[permalink] [raw]
Subject: Re: [PATCH 2/2] serial: qcom-geni: Fix qcom_geni_serial_stop_tx_fifo() while xfer

Hi,

On Thu, May 23, 2024 at 5:38 PM Stephen Boyd <[email protected]> wrote:
>
> Quoting Douglas Anderson (2024-05-23 16:22:13)
> > diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
> > index 2bd25afe0d92..9110ac4bdbbf 100644
> > --- a/drivers/tty/serial/qcom_geni_serial.c
> > +++ b/drivers/tty/serial/qcom_geni_serial.c
> > @@ -265,8 +265,8 @@ static bool qcom_geni_serial_secondary_active(struct uart_port *uport)
> > return readl(uport->membase + SE_GENI_STATUS) & S_GENI_CMD_ACTIVE;
> > }
> >
> > -static bool qcom_geni_serial_poll_bit(struct uart_port *uport,
> > - int offset, int field, bool set)
> > +static bool qcom_geni_serial_poll_bitfield(struct uart_port *uport,
> > + int offset, int field, u32 val)
>
> Can these be unsigned offset and field?

Not new for this patch, but sure. Field should almost certainly be
u32, right? I guess offset could be "unsigned int"? If you want to get
technical it could be "size_t", but that feels like a waste when we
know the offset is tiny.


> > {
> > u32 reg;
> > struct qcom_geni_serial_port *port;
> > @@ -295,7 +295,7 @@ static bool qcom_geni_serial_poll_bit(struct uart_port *uport,
> > timeout_us = DIV_ROUND_UP(timeout_us, 10) * 10;
> > while (timeout_us) {
> > reg = readl(uport->membase + offset);
> > - if ((bool)(reg & field) == set)
> > + if ((reg & field) == val)
> > return true;
> > udelay(10);
> > timeout_us -= 10;
> > @@ -303,6 +303,12 @@ static bool qcom_geni_serial_poll_bit(struct uart_port *uport,
> > return false;
> > }
> >
> > +static bool qcom_geni_serial_poll_bit(struct uart_port *uport,
> > + int offset, int field, bool set)
>
> Can these be unsigned offset and field?

Sure. Same as above.



> > +{
> > + return qcom_geni_serial_poll_bitfield(uport, offset, field, set ? field : 0);
> > +}
> > +
> > static void qcom_geni_serial_setup_tx(struct uart_port *uport, u32 xmit_size)
> > {
> > u32 m_cmd;
> > @@ -675,6 +681,31 @@ static void qcom_geni_serial_stop_tx_fifo(struct uart_port *uport)
> > if (!qcom_geni_serial_main_active(uport))
> > return;
> >
> > + /*
> > + * Wait until the FIFO has been drained. We've already taken bytes out
> > + * of the higher level queue in qcom_geni_serial_send_chunk_fifo() so
> > + * if we don't drain the FIFO but send the "cancel" below they seem to
> > + * get lost.
> > + */
> > + qcom_geni_serial_poll_bitfield(uport, SE_GENI_TX_FIFO_STATUS, TX_FIFO_WC, 0);
> > +
> > + /*
> > + * If we send the cancel immediately after the FIFO reports that it's
> > + * empty then bytes still seem to get lost. From trial and error, it
> > + * appears that a small delay here keeps bytes from being lost and
> > + * there is (apparently) no bit that we can poll instead of this.
> > + * Specifically it can be noted that the sequencer is still "active"
> > + * if it's waiting for us to send it more bytes from the current
> > + * transfer.
> > + */
> > + mdelay(1);
>
> I wonder if the FIFO is in a different 1kb chunk of device memory and so
> this needs to be an instruction barrier (isb()) to prevent the cancel
> from being executed before or in parallel to the FIFO polling. Hopefully
> someone at qcom can confirm this. It looks like SE_GENI_TX_FIFO_STATUS
> is 0x800 offset and the cancel is at 0x600 so it looks like it may be
> this problem. Device memory doesn't save us even if that has ordered
> accesses :(

I spent a bunch of time digging into this today. isb() didn't help,
nor did isb() plus mb().

I searched the docs and also did a brute force attempt to figure out
what to do. I finally found two answers:

1. It appears that M_GP_LENGTH can still advance after the FIFO
becomes 0, which is extra proof that the transfer is still happening
even though the FIFO says it's done. Presumably we could keep track of
how many bytes we have enqueued into the FIFO for this command and
then compare. As I was trying to do this, though, I noticed another
option...

2. It appears that instead of "cancelling" the current command we can
just issue a new 0-byte transfer and wait for the 0-byte transfer to
be "done". This causes geni to give us back a "M_CMD_OVERRUN"
interrupt, but that's fine and we can ignore it. That interrupt just
says "hey, you gave me a command before the previous one was done" but
it does seem to properly accept the new command and it doesn't drop
any bytes.

..it turns out that we (apparently) already have been using option #2
to interrupt a transfer without dropping bytes. When the UART is
shared between an agetty and the kernel console this happens all the
time. In qcom_geni_serial_console_write() we'll issue a new command
before finishing a current one and then re-issue the current command
with any remaining bytes. So not only should this be safe but it's
already tested to work.

I'll need to spend a little more time on this to really confirm it
works as I expect and I'll send up a v2 using approach #2.

Also note that while spending more time on this I found _yet another_
bug, this one more serious. My original testing was done on kernel 6.6
(with stable backports) and I just did confirmation on mainline.
That's why I didn't see this new bug originally. ...but this time I
spent more time testing on mainline. It turns out that the recent
patches for kfifo on mainline have badly broken geni serial.
Specifically, if you just do "cat /var/log/messages" and then "Ctrl-C"
the machine will hard lockup! Yikes! This is yet another side effect
of the geni "packet"-based protocol biting us (so related to the
problems in ${SUBJECT}, but not identical). Whenever we setup a TX
transfer we set it up for the number of bytes in the queue at the
time. If that number goes down then we're in trouble. Specifically, it
can be noted that:
* When we start transmitting we look at the current queue size, setup
a transfer, and store "tx_remaining".
* Whenever there's space in the FIFO we add bytes and remove them from
the queue and "tx_remaining".
* We don't ever expect bytes to disappear from the queue. I think in
the old code if this happened we're just transfer some bogus bytes.
Now we'll loop in qcom_geni_serial_send_chunk_fifo() because
uart_fifo_out() will keep returning 0.

I'll try to take a gander at that, too...

-Doug

2024-05-27 18:08:31

by Andy Shevchenko

[permalink] [raw]
Subject: Re: [PATCH 1/2] serial: port: Don't block system suspend even if bytes are left to xmit

On Thu, May 23, 2024 at 04:22:12PM -0700, Douglas Anderson wrote:
> Recently, suspend testing on sc7180-trogdor based devices has started
> to sometimes fail with messages like this:
>
> port a88000.serial:0.0: PM: calling pm_runtime_force_suspend+0x0/0xf8 @ 28934, parent: a88000.serial:0
> port a88000.serial:0.0: PM: dpm_run_callback(): pm_runtime_force_suspend+0x0/0xf8 returns -16
> port a88000.serial:0.0: PM: pm_runtime_force_suspend+0x0/0xf8 returned -16 after 33 usecs
> port a88000.serial:0.0: PM: failed to suspend: error -16
>
> I could reproduce these problem by logging in via an agetty on the
> debug serial port (which was _not_ used for kernel console) and
> running:
> cat /var/log/messages
> ...and then (via an SSH session) forcing a few suspend/resume cycles.
>
> Tracing through the code and doing some printf debugging shows that

printf()

..or...

printf()-based

> the -16 (-EBUSY) comes from the recently added
> serial_port_runtime_suspend().
>
> The idea of the serial_port_runtime_suspend() function is to prevent
> the port from being _runtime_ suspended if it still has bytes left to
> transmit. Having bytes left to transmit isn't a reason to block
> _system_ suspend, though. The DEFINE_RUNTIME_DEV_PM_OPS() used by the
> serial_port code means that the system suspend function will be
> pm_runtime_force_suspend(). In pm_runtime_force_suspend() we can see
> that before calling the runtime suspend function we'll call
> pm_runtime_disable(). This should be a reliable way to detect that
> we're called from system suspend and that we shouldn't look for
> busyness.

..

> + /*
> + * We only want to check the busyness of the port if PM Runtime is
> + * enabled. Specifically PM Runtime will be disabled by
> + * pm_runtime_force_suspend() during system suspend and we don't want
> + * to block system suspend even if there is data still left to
> + * transmit. We only want to block regulator PM Runtime transitions.

regular

> + */
> + if (!pm_runtime_enabled(dev))
> + return 0;

--
With Best Regards,
Andy Shevchenko



2024-05-30 22:51:53

by Doug Anderson

[permalink] [raw]
Subject: Re: [PATCH 2/2] serial: qcom-geni: Fix qcom_geni_serial_stop_tx_fifo() while xfer

Hi,

On Fri, May 24, 2024 at 5:17 PM Doug Anderson <[email protected]> wrote:
>
> 1. It appears that M_GP_LENGTH can still advance after the FIFO
> becomes 0, which is extra proof that the transfer is still happening
> even though the FIFO says it's done. Presumably we could keep track of
> how many bytes we have enqueued into the FIFO for this command and
> then compare. As I was trying to do this, though, I noticed another
> option...
>
> 2. It appears that instead of "cancelling" the current command we can
> just issue a new 0-byte transfer and wait for the 0-byte transfer to
> be "done". This causes geni to give us back a "M_CMD_OVERRUN"
> interrupt, but that's fine and we can ignore it. That interrupt just
> says "hey, you gave me a command before the previous one was done" but
> it does seem to properly accept the new command and it doesn't drop
> any bytes.
>
> ...it turns out that we (apparently) already have been using option #2
> to interrupt a transfer without dropping bytes. When the UART is
> shared between an agetty and the kernel console this happens all the
> time. In qcom_geni_serial_console_write() we'll issue a new command
> before finishing a current one and then re-issue the current command
> with any remaining bytes. So not only should this be safe but it's
> already tested to work.
>
> I'll need to spend a little more time on this to really confirm it
> works as I expect and I'll send up a v2 using approach #2.

Just to connect the dots more, I did more testing and option #2
totally didn't work. The console/kdb stuff was working mostly through
ugly fragile hacks. Polling GP_LENGTH seemed like the only option,
though when I dug more I found that this wasn't the right place to do
it anyway...


> Also note that while spending more time on this I found _yet another_
> bug, this one more serious. My original testing was done on kernel 6.6
> (with stable backports) and I just did confirmation on mainline.
> That's why I didn't see this new bug originally. ...but this time I
> spent more time testing on mainline. It turns out that the recent
> patches for kfifo on mainline have badly broken geni serial.
> Specifically, if you just do "cat /var/log/messages" and then "Ctrl-C"
> the machine will hard lockup! Yikes! This is yet another side effect
> of the geni "packet"-based protocol biting us (so related to the
> problems in ${SUBJECT}, but not identical). Whenever we setup a TX
> transfer we set it up for the number of bytes in the queue at the
> time. If that number goes down then we're in trouble. Specifically, it
> can be noted that:
> * When we start transmitting we look at the current queue size, setup
> a transfer, and store "tx_remaining".
> * Whenever there's space in the FIFO we add bytes and remove them from
> the queue and "tx_remaining".
> * We don't ever expect bytes to disappear from the queue. I think in
> the old code if this happened we're just transfer some bogus bytes.
> Now we'll loop in qcom_geni_serial_send_chunk_fifo() because
> uart_fifo_out() will keep returning 0.
>
> I'll try to take a gander at that, too...

I spent _far_ too long poking at this and I've finally come up with a
v2 that _I think_ fixes everything. For reference:

https://lore.kernel.org/r/[email protected]