LinuxLists.cc - [PATCH] spi: fsi: Implement a timeout for polling status

2022-03-17 21:47:41

Subject: [PATCH] spi: fsi: Implement a timeout for polling status

The data transfer routines must poll the status register to
determine when more data can be shifted in or out. If the hardware
gets into a bad state, these polling loops may never exit. Prevent
this by returning an error if a timeout is exceeded.

Signed-off-by: Eddie James <[email protected]>
---
drivers/spi/spi-fsi.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/drivers/spi/spi-fsi.c b/drivers/spi/spi-fsi.c
index b6c7467f0b59..d403a7a3021d 100644
--- a/drivers/spi/spi-fsi.c
+++ b/drivers/spi/spi-fsi.c
@@ -25,6 +25,7 @@

#define SPI_FSI_BASE 0x70000
#define SPI_FSI_INIT_TIMEOUT_MS 1000
+#define SPI_FSI_STATUS_TIMEOUT_MS 100
#define SPI_FSI_MAX_RX_SIZE 8
#define SPI_FSI_MAX_TX_SIZE 40

@@ -299,6 +300,7 @@ static int fsi_spi_transfer_data(struct fsi_spi *ctx,
struct spi_transfer *transfer)
{
int rc = 0;
+ unsigned long end;
u64 status = 0ULL;

if (transfer->tx_buf) {
@@ -315,10 +317,14 @@ static int fsi_spi_transfer_data(struct fsi_spi *ctx,
if (rc)
return rc;

+ end = jiffies + msecs_to_jiffies(SPI_FSI_STATUS_TIMEOUT_MS);
do {
rc = fsi_spi_status(ctx, &status, "TX");
if (rc)
return rc;
+
+ if (time_after(jiffies, end))
+ return -ETIMEDOUT;
} while (status & SPI_FSI_STATUS_TDR_FULL);

sent += nb;
@@ -329,10 +335,14 @@ static int fsi_spi_transfer_data(struct fsi_spi *ctx,
u8 *rx = transfer->rx_buf;

while (transfer->len > recv) {
+ end = jiffies + msecs_to_jiffies(SPI_FSI_STATUS_TIMEOUT_MS);
do {
rc = fsi_spi_status(ctx, &status, "RX");
if (rc)
return rc;
+
+ if (time_after(jiffies, end))
+ return -ETIMEDOUT;
} while (!(status & SPI_FSI_STATUS_RDR_FULL));

rc = fsi_spi_read_reg(ctx, SPI_FSI_DATA_RX, &in);
--
2.27.0

2022-03-18 17:47:57

by Eddie James

[permalink] [raw]

Subject: Re: [PATCH] spi: fsi: Implement a timeout for polling status

On 3/17/22 23:19, Joel Stanley wrote:
> On Thu, 17 Mar 2022 at 21:14, Eddie James <[email protected]> wrote:
>> The data transfer routines must poll the status register to
>> determine when more data can be shifted in or out. If the hardware
>> gets into a bad state, these polling loops may never exit. Prevent
>> this by returning an error if a timeout is exceeded.
> This makes sense. We may even want to put this code in regardless.
>
> However, I'm wondering why the code in fsi_spi_status didn't catch this.

Same, which is why I thought the problem couldn't be happening here for
a long time. See below with what I think is going on.

>
>> static int fsi_spi_status(struct fsi_spi *ctx, u64 *status, const char *dir)
>> {
>> int rc = fsi_spi_read_reg(ctx, SPI_FSI_STATUS, status);
> You mentioned the error condition is we get back 0xff. That means that
> status will be 0xffff_ffff_ffff_ffff ?
>
> Did you observe status being this value?

No, I think my observation of 0xff is not universal. I suspect that
while the CFAM is IN reset, 0xff is returned, but once it's been reset,
valid (though maybe uninitialized) data is returned. I observed a status
of 0x0001100000000000, which means the controller is idle, which makes
sense since it's been reset. So the issue occurs if we start an
operation before a CFAM reset and are waiting for it to complete during
the CFAM reset, but we also don't get any failed/invalid data FSI
operations during/after the reset (very timing dependent - the FSI
master does lock during the reset but doesn't wait afterwards for the
hardware to initialize).

>
>> if (rc)
>> return rc;
>>
>> if (*status & SPI_FSI_STATUS_ANY_ERROR) {
> I think that we're checking against 0xffe0f000.
>
>> dev_err(ctx->dev, "%s error: %016llx\n", dir, *status);
>>
>> rc = fsi_spi_reset(ctx);
>> if (rc)
>> return rc;
> Is the problem here? fsi_spi_reset writes to the clock config
> registers, but doesn't read the status.
>
> Obviously doing the writes causes a call to fsi_spi_check_status, but
> that checks the FSI2SPI bridge, not the SPI master.
>
> ...but it doesn't matter, because we're either going to return an
> error from the reset, or return EREMOTEIO, so there's no masking of
> the error.

Not sure I follow. I don't think we were hitting this path in this error
scenario. Do you think we need to check the status after a reset? It
should always be the same.

>
>> return -EREMOTEIO;
>> }
>>
>> return 0;
>> }
>
>> Signed-off-by: Eddie James <[email protected]>
>> ---
>> drivers/spi/spi-fsi.c | 10 ++++++++++
>> 1 file changed, 10 insertions(+)
>>
>> diff --git a/drivers/spi/spi-fsi.c b/drivers/spi/spi-fsi.c
>> index b6c7467f0b59..d403a7a3021d 100644
>> --- a/drivers/spi/spi-fsi.c
>> +++ b/drivers/spi/spi-fsi.c
>> @@ -25,6 +25,7 @@
>>
>> #define SPI_FSI_BASE 0x70000
>> #define SPI_FSI_INIT_TIMEOUT_MS 1000
>> +#define SPI_FSI_STATUS_TIMEOUT_MS 100
> Can you add a comment (or put something in the commit message) about
> why you chose 100ms.

Hm, sure, but I chose it pretty arbitrarily. I'm not sure how to choose
something like this.

>
>> #define SPI_FSI_MAX_RX_SIZE 8
>> #define SPI_FSI_MAX_TX_SIZE 40
>>
>> @@ -299,6 +300,7 @@ static int fsi_spi_transfer_data(struct fsi_spi *ctx,
>> struct spi_transfer *transfer)
>> {
>> int rc = 0;
>> + unsigned long end;
>> u64 status = 0ULL;
>>
>> if (transfer->tx_buf) {
>> @@ -315,10 +317,14 @@ static int fsi_spi_transfer_data(struct fsi_spi *ctx,
>> if (rc)
>> return rc;
>>
>> + end = jiffies + msecs_to_jiffies(SPI_FSI_STATUS_TIMEOUT_MS);
>> do {
>> rc = fsi_spi_status(ctx, &status, "TX");
>> if (rc)
>> return rc;
>> +
>> + if (time_after(jiffies, end))
>> + return -ETIMEDOUT;
>> } while (status & SPI_FSI_STATUS_TDR_FULL);
>>
>> sent += nb;
>> @@ -329,10 +335,14 @@ static int fsi_spi_transfer_data(struct fsi_spi *ctx,
>> u8 *rx = transfer->rx_buf;
>>
>> while (transfer->len > recv) {
>> + end = jiffies + msecs_to_jiffies(SPI_FSI_STATUS_TIMEOUT_MS);
>> do {
>> rc = fsi_spi_status(ctx, &status, "RX");
>> if (rc)
>> return rc;
>> +
>> + if (time_after(jiffies, end))
>> + return -ETIMEDOUT;
>> } while (!(status & SPI_FSI_STATUS_RDR_FULL));
>>
>> rc = fsi_spi_read_reg(ctx, SPI_FSI_DATA_RX, &in);
>> --
>> 2.27.0
>>

2022-03-20 22:11:30

by Joel Stanley

[permalink] [raw]

Subject: Re: [PATCH] spi: fsi: Implement a timeout for polling status

On Thu, 17 Mar 2022 at 21:14, Eddie James <[email protected]> wrote:
>
> The data transfer routines must poll the status register to
> determine when more data can be shifted in or out. If the hardware
> gets into a bad state, these polling loops may never exit. Prevent
> this by returning an error if a timeout is exceeded.

This makes sense. We may even want to put this code in regardless.

However, I'm wondering why the code in fsi_spi_status didn't catch this.

> static int fsi_spi_status(struct fsi_spi *ctx, u64 *status, const char *dir)
> {
> int rc = fsi_spi_read_reg(ctx, SPI_FSI_STATUS, status);

You mentioned the error condition is we get back 0xff. That means that
status will be 0xffff_ffff_ffff_ffff ?

Did you observe status being this value?

> if (rc)
> return rc;
>
> if (*status & SPI_FSI_STATUS_ANY_ERROR) {

I think that we're checking against 0xffe0f000.

> dev_err(ctx->dev, "%s error: %016llx\n", dir, *status);
>
> rc = fsi_spi_reset(ctx);
> if (rc)
> return rc;

Is the problem here? fsi_spi_reset writes to the clock config
registers, but doesn't read the status.

Obviously doing the writes causes a call to fsi_spi_check_status, but
that checks the FSI2SPI bridge, not the SPI master.

...but it doesn't matter, because we're either going to return an
error from the reset, or return EREMOTEIO, so there's no masking of
the error.

>
> return -EREMOTEIO;
> }
>
> return 0;
> }

>
> Signed-off-by: Eddie James <[email protected]>
> ---
> drivers/spi/spi-fsi.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/drivers/spi/spi-fsi.c b/drivers/spi/spi-fsi.c
> index b6c7467f0b59..d403a7a3021d 100644
> --- a/drivers/spi/spi-fsi.c
> +++ b/drivers/spi/spi-fsi.c
> @@ -25,6 +25,7 @@
>
> #define SPI_FSI_BASE 0x70000
> #define SPI_FSI_INIT_TIMEOUT_MS 1000
> +#define SPI_FSI_STATUS_TIMEOUT_MS 100

Can you add a comment (or put something in the commit message) about
why you chose 100ms.

> #define SPI_FSI_MAX_RX_SIZE 8
> #define SPI_FSI_MAX_TX_SIZE 40
>
> @@ -299,6 +300,7 @@ static int fsi_spi_transfer_data(struct fsi_spi *ctx,
> struct spi_transfer *transfer)
> {
> int rc = 0;
> + unsigned long end;
> u64 status = 0ULL;
>
> if (transfer->tx_buf) {
> @@ -315,10 +317,14 @@ static int fsi_spi_transfer_data(struct fsi_spi *ctx,
> if (rc)
> return rc;
>
> + end = jiffies + msecs_to_jiffies(SPI_FSI_STATUS_TIMEOUT_MS);
> do {
> rc = fsi_spi_status(ctx, &status, "TX");
> if (rc)
> return rc;
> +
> + if (time_after(jiffies, end))
> + return -ETIMEDOUT;
> } while (status & SPI_FSI_STATUS_TDR_FULL);
>
> sent += nb;
> @@ -329,10 +335,14 @@ static int fsi_spi_transfer_data(struct fsi_spi *ctx,
> u8 *rx = transfer->rx_buf;
>
> while (transfer->len > recv) {
> + end = jiffies + msecs_to_jiffies(SPI_FSI_STATUS_TIMEOUT_MS);
> do {
> rc = fsi_spi_status(ctx, &status, "RX");
> if (rc)
> return rc;
> +
> + if (time_after(jiffies, end))
> + return -ETIMEDOUT;
> } while (!(status & SPI_FSI_STATUS_RDR_FULL));
>
> rc = fsi_spi_read_reg(ctx, SPI_FSI_DATA_RX, &in);
> --
> 2.27.0
>

2022-03-21 12:37:54

by Mark Brown

[permalink] [raw]

Subject: Re: [PATCH] spi: fsi: Implement a timeout for polling status

On Thu, 17 Mar 2022 16:14:26 -0500, Eddie James wrote:
> The data transfer routines must poll the status register to
> determine when more data can be shifted in or out. If the hardware
> gets into a bad state, these polling loops may never exit. Prevent
> this by returning an error if a timeout is exceeded.
>
>

Applied to

https://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi.git for-next

Thanks!

[1/1] spi: fsi: Implement a timeout for polling status
commit: 89b35e3f28514087d3f1e28e8f5634fbfd07c554

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark