2019-09-25 14:33:14

by Philipp Puschmann

[permalink] [raw]
Subject: [PATCH v5 0/3] Fix UART DMA freezes for i.MX SOCs

For some years and since many kernel versions there are reports that
RX UART DMA channel stops working at one point. So far the usual
workaround was to disable RX DMA. This patches fix the underlying
problem.

When a running sdma script does not find any usable destination buffer
to put its data into it just leads to stopping the channel being
scheduled again. As solution we manually retrigger the sdma script for
this channel and by this dissolve the freeze.

While this seems to work fine so far, it may come to buffer overruns
when the channel - even temporary - is stopped. This case has to be
addressed by device drivers by increasing the number of DMA periods.

This patch series was tested with the current kernel and backported to
kernel 4.15 with a special use case using a WL1837MOD via UART and
provoking the hanging of UART RX DMA within seconds after starting a
test application. It resulted in well known
"Bluetooth: hci0: command 0x0408 tx timeout"
errors and complete stop of UART data reception. Our Bluetooth traffic
consists of many independent small packets, mostly only a few bytes,
causing high usage of periods.

Signed-off-by: Philipp Puschmann <[email protected]>
Reviewed-by: Fugang Duan <[email protected]>

---

Changelog v5:
- join with patch version from Jan Luebbe
- adapt comments and patch descriptions
- add Reviewed-by

Changelog v4:
- fixed the fixes tags

Changelog v3:
- fixes typo in dma_wmb
- add fixes tags

Changelog v2:
- adapt title (this patches are not only for i.MX6)
- improve some comments and patch descriptions
- add a dma_wb() around BD_DONE flag
- add Reviewed-by tags
- split off "serial: imx: adapt rx buffer and dma periods"

Philipp Puschmann (3):
dmaengine: imx-sdma: fix buffer ownership
dmaengine: imx-sdma: fix dma freezes
dmaengine: imx-sdma: drop redundant variable

drivers/dma/imx-sdma.c | 37 +++++++++++++++++++++++++++----------
1 file changed, 27 insertions(+), 10 deletions(-)

--
2.23.0


2019-09-25 14:34:45

by Philipp Puschmann

[permalink] [raw]
Subject: [PATCH v5 1/3] dmaengine: imx-sdma: fix buffer ownership

BD_DONE flag marks ownership of the buffer. When 1 SDMA owns the
buffer, when 0 ARM owns it. When processing the buffers in
sdma_update_channel_loop the ownership of the currently processed
buffer was set to SDMA again before running the callback function of
the buffer and while the sdma script may be running in parallel. So
there was the possibility to get the buffer overwritten by SDMA before
it has been processed by kernel leading to kind of random errors in the
upper layers, e.g. bluetooth.

Fixes: 1ec1e82f2510 ("dmaengine: Add Freescale i.MX SDMA support")
Signed-off-by: Philipp Puschmann <[email protected]>
---

Changelog v5:
- no changes

Changelog v4:
- fixed the fixes tag

Changelog v3:
- use correct dma_wmb() instead of dma_wb()
- add fixes tag

Changelog v2:
- add dma_wb()

drivers/dma/imx-sdma.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/dma/imx-sdma.c b/drivers/dma/imx-sdma.c
index 9ba74ab7e912..b42281604e54 100644
--- a/drivers/dma/imx-sdma.c
+++ b/drivers/dma/imx-sdma.c
@@ -802,7 +802,6 @@ static void sdma_update_channel_loop(struct sdma_channel *sdmac)
*/

desc->chn_real_count = bd->mode.count;
- bd->mode.status |= BD_DONE;
bd->mode.count = desc->period_len;
desc->buf_ptail = desc->buf_tail;
desc->buf_tail = (desc->buf_tail + 1) % desc->num_bd;
@@ -817,6 +816,9 @@ static void sdma_update_channel_loop(struct sdma_channel *sdmac)
dmaengine_desc_get_callback_invoke(&desc->vd.tx, NULL);
spin_lock(&sdmac->vc.lock);

+ dma_wmb();
+ bd->mode.status |= BD_DONE;
+
if (error)
sdmac->status = old_status;
}
--
2.23.0

2019-09-25 15:02:15

by Adam Ford

[permalink] [raw]
Subject: Re: [PATCH v5 0/3] Fix UART DMA freezes for i.MX SOCs

On Mon, Sep 23, 2019 at 8:58 AM Philipp Puschmann
<[email protected]> wrote:
>
> For some years and since many kernel versions there are reports that
> RX UART DMA channel stops working at one point. So far the usual
> workaround was to disable RX DMA. This patches fix the underlying
> problem.
>
> When a running sdma script does not find any usable destination buffer
> to put its data into it just leads to stopping the channel being
> scheduled again. As solution we manually retrigger the sdma script for
> this channel and by this dissolve the freeze.
>
> While this seems to work fine so far, it may come to buffer overruns
> when the channel - even temporary - is stopped. This case has to be
> addressed by device drivers by increasing the number of DMA periods.
>
> This patch series was tested with the current kernel and backported to
> kernel 4.15 with a special use case using a WL1837MOD via UART and
> provoking the hanging of UART RX DMA within seconds after starting a
> test application. It resulted in well known
> "Bluetooth: hci0: command 0x0408 tx timeout"
> errors and complete stop of UART data reception. Our Bluetooth traffic
> consists of many independent small packets, mostly only a few bytes,
> causing high usage of periods.
>

Using the 4.19.y branch, this seems to working just fine for me with an i.MX6Q
with WL1837MOD Bluetooth connected to UART2. I am still seeing some
timeouts with 5.3, but I'm going to continue to run some tests.

Tested-by: Adam Ford <[email protected]> #imx6q w/ 4.19 Kernel

> Signed-off-by: Philipp Puschmann <[email protected]>
> Reviewed-by: Fugang Duan <[email protected]>
>
> ---
>
> Changelog v5:
> - join with patch version from Jan Luebbe
> - adapt comments and patch descriptions
> - add Reviewed-by
>
> Changelog v4:
> - fixed the fixes tags
>
> Changelog v3:
> - fixes typo in dma_wmb
> - add fixes tags
>
> Changelog v2:
> - adapt title (this patches are not only for i.MX6)
> - improve some comments and patch descriptions
> - add a dma_wb() around BD_DONE flag
> - add Reviewed-by tags
> - split off "serial: imx: adapt rx buffer and dma periods"
>
> Philipp Puschmann (3):
> dmaengine: imx-sdma: fix buffer ownership
> dmaengine: imx-sdma: fix dma freezes
> dmaengine: imx-sdma: drop redundant variable
>
> drivers/dma/imx-sdma.c | 37 +++++++++++++++++++++++++++----------
> 1 file changed, 27 insertions(+), 10 deletions(-)
>
> --
> 2.23.0
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

2019-09-25 15:12:54

by Philipp Puschmann

[permalink] [raw]
Subject: Re: [PATCH v5 0/3] Fix UART DMA freezes for i.MX SOCs

Hi Adam,

Am 23.09.19 um 16:55 schrieb Adam Ford:
> On Mon, Sep 23, 2019 at 8:58 AM Philipp Puschmann
> <[email protected]> wrote:
>>
>> For some years and since many kernel versions there are reports that
>> RX UART DMA channel stops working at one point. So far the usual
>> workaround was to disable RX DMA. This patches fix the underlying
>> problem.
>>
>> When a running sdma script does not find any usable destination buffer
>> to put its data into it just leads to stopping the channel being
>> scheduled again. As solution we manually retrigger the sdma script for
>> this channel and by this dissolve the freeze.
>>
>> While this seems to work fine so far, it may come to buffer overruns
>> when the channel - even temporary - is stopped. This case has to be
>> addressed by device drivers by increasing the number of DMA periods.
>>
>> This patch series was tested with the current kernel and backported to
>> kernel 4.15 with a special use case using a WL1837MOD via UART and
>> provoking the hanging of UART RX DMA within seconds after starting a
>> test application. It resulted in well known
>> "Bluetooth: hci0: command 0x0408 tx timeout"
>> errors and complete stop of UART data reception. Our Bluetooth traffic
>> consists of many independent small packets, mostly only a few bytes,
>> causing high usage of periods.
>>
>
> Using the 4.19.y branch, this seems to working just fine for me with an i.MX6Q
> with WL1837MOD Bluetooth connected to UART2. I am still seeing some
> timeouts with 5.3, but I'm going to continue to run some tests.

Thanks for testing.
With my local setup i still have very few tx timeouts too. But i think they have a different
cause and especially different consequences. When the problem addressed by this series
appear you get a whole bunch of tx timeouts (and maybe errors from Bluetooth
layer) and monitoring received Bluetooth packets with hciconfig shows a
complete freeze of rx counter. Only resetting the hci_uart driver and the wl1837mon then helps.
With these patches applied the rx data shold still coming in even if a single or
multiple tx timeout error happen. I'm not sure where the error comes from and what the
consequences for the Bluetooth layer are.

Regards,
Philipp
>
> Tested-by: Adam Ford <[email protected]> #imx6q w/ 4.19 Kernel
>
>> Signed-off-by: Philipp Puschmann <[email protected]>
>> Reviewed-by: Fugang Duan <[email protected]>
>>
>> ---
>>
>> Changelog v5:
>> - join with patch version from Jan Luebbe
>> - adapt comments and patch descriptions
>> - add Reviewed-by
>>
>> Changelog v4:
>> - fixed the fixes tags
>>
>> Changelog v3:
>> - fixes typo in dma_wmb
>> - add fixes tags
>>
>> Changelog v2:
>> - adapt title (this patches are not only for i.MX6)
>> - improve some comments and patch descriptions
>> - add a dma_wb() around BD_DONE flag
>> - add Reviewed-by tags
>> - split off "serial: imx: adapt rx buffer and dma periods"
>>
>> Philipp Puschmann (3):
>> dmaengine: imx-sdma: fix buffer ownership
>> dmaengine: imx-sdma: fix dma freezes
>> dmaengine: imx-sdma: drop redundant variable
>>
>> drivers/dma/imx-sdma.c | 37 +++++++++++++++++++++++++++----------
>> 1 file changed, 27 insertions(+), 10 deletions(-)
>>
>> --
>> 2.23.0
>>
>>
>> _______________________________________________
>> linux-arm-kernel mailing list
>> [email protected]
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

2019-09-25 18:49:48

by Jan Lübbe

[permalink] [raw]
Subject: Re: [PATCH v5 0/3] Fix UART DMA freezes for i.MX SOCs

On Mon, 2019-09-23 at 17:06 +0200, Philipp Puschmann wrote:
> Thanks for testing.
> With my local setup i still have very few tx timeouts too. But i think they have a different
> cause and especially different consequences. When the problem addressed by this series
> appear you get a whole bunch of tx timeouts (and maybe errors from Bluetooth
> layer) and monitoring received Bluetooth packets with hciconfig shows a
> complete freeze of rx counter. Only resetting the hci_uart driver and the wl1837mon then helps.
> With these patches applied the rx data shold still coming in even if a single or
> multiple tx timeout error happen. I'm not sure where the error comes from and what the
> consequences for the Bluetooth layer are.

For testing, I've used a UART connected to my development host and
configured *mismatching* baud rates. Sending /dev/urandom from the host
to the i.MX6 then triggered the DMA hang (because each character
triggers and error indication, which "uses" a full buffer).

Regards,
Jan
--
Pengutronix e.K. | |
Industrial Linux Solutions | http://www.pengutronix.de/ |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |

2019-12-03 09:49:02

by Lucas Stach

[permalink] [raw]
Subject: Re: [PATCH v5 1/3] dmaengine: imx-sdma: fix buffer ownership

On Mo, 2019-09-23 at 15:58 +0200, Philipp Puschmann wrote:
> BD_DONE flag marks ownership of the buffer. When 1 SDMA owns the
> buffer, when 0 ARM owns it. When processing the buffers in
> sdma_update_channel_loop the ownership of the currently processed
> buffer was set to SDMA again before running the callback function of
> the buffer and while the sdma script may be running in parallel. So
> there was the possibility to get the buffer overwritten by SDMA
> before
> it has been processed by kernel leading to kind of random errors in
> the
> upper layers, e.g. bluetooth.
>
> Fixes: 1ec1e82f2510 ("dmaengine: Add Freescale i.MX SDMA support")
> Signed-off-by: Philipp Puschmann <[email protected]>

Reviewed-by: Lucas Stach <[email protected]>

> ---
>
> Changelog v5:
> - no changes
>
> Changelog v4:
> - fixed the fixes tag
>
> Changelog v3:
> - use correct dma_wmb() instead of dma_wb()
> - add fixes tag
>
> Changelog v2:
> - add dma_wb()
>
> drivers/dma/imx-sdma.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/dma/imx-sdma.c b/drivers/dma/imx-sdma.c
> index 9ba74ab7e912..b42281604e54 100644
> --- a/drivers/dma/imx-sdma.c
> +++ b/drivers/dma/imx-sdma.c
> @@ -802,7 +802,6 @@ static void sdma_update_channel_loop(struct
> sdma_channel *sdmac)
> */
>
> desc->chn_real_count = bd->mode.count;
> - bd->mode.status |= BD_DONE;
> bd->mode.count = desc->period_len;
> desc->buf_ptail = desc->buf_tail;
> desc->buf_tail = (desc->buf_tail + 1) % desc->num_bd;
> @@ -817,6 +816,9 @@ static void sdma_update_channel_loop(struct
> sdma_channel *sdmac)
> dmaengine_desc_get_callback_invoke(&desc->vd.tx, NULL);
> spin_lock(&sdmac->vc.lock);
>
> + dma_wmb();
> + bd->mode.status |= BD_DONE;
> +
> if (error)
> sdmac->status = old_status;
> }

2019-12-04 09:21:52

by Robin Gong

[permalink] [raw]
Subject: RE: [PATCH v5 1/3] dmaengine: imx-sdma: fix buffer ownership

On 2019-9-23 Philipp Puschmann <[email protected]> wrote:
> BD_DONE flag marks ownership of the buffer. When 1 SDMA owns the buffer,
> when 0 ARM owns it. When processing the buffers in
> sdma_update_channel_loop the ownership of the currently processed buffer
> was set to SDMA again before running the callback function of the buffer and
> while the sdma script may be running in parallel. So there was the possibility to
> get the buffer overwritten by SDMA before it has been processed by kernel
Does this patch need indeed? I don't think any difference here move done flag
before callback or after callback, because callback never care this flag and actually
done flag is setup for next time rather than this time. Basically, this flag should be
set to 1 quickly asap so that sdma could use this bd asap. If delay the flag may cause
sdma channel stop since all BDs consumed. Could you try again your case without
this patch?
> leading to kind of random errors in the upper layers, e.g. bluetooth.
>
> Fixes: 1ec1e82f2510 ("dmaengine: Add Freescale i.MX SDMA support")
> Signed-off-by: Philipp Puschmann <[email protected]>
> ---
>
> Changelog v5:
> - no changes
>
> Changelog v4:
> - fixed the fixes tag
>
> Changelog v3:
> - use correct dma_wmb() instead of dma_wb()
> - add fixes tag
>
> Changelog v2:
> - add dma_wb()
>
> drivers/dma/imx-sdma.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/dma/imx-sdma.c b/drivers/dma/imx-sdma.c index
> 9ba74ab7e912..b42281604e54 100644
> --- a/drivers/dma/imx-sdma.c
> +++ b/drivers/dma/imx-sdma.c
> @@ -802,7 +802,6 @@ static void sdma_update_channel_loop(struct
> sdma_channel *sdmac)
> */
>
> desc->chn_real_count = bd->mode.count;
> - bd->mode.status |= BD_DONE;
> bd->mode.count = desc->period_len;
> desc->buf_ptail = desc->buf_tail;
> desc->buf_tail = (desc->buf_tail + 1) % desc->num_bd; @@ -817,6
> +816,9 @@ static void sdma_update_channel_loop(struct sdma_channel
> *sdmac)
> dmaengine_desc_get_callback_invoke(&desc->vd.tx, NULL);
> spin_lock(&sdmac->vc.lock);
>
> + dma_wmb();
> + bd->mode.status |= BD_DONE;
> +
> if (error)
> sdmac->status = old_status;
> }
> --
> 2.23.0

2019-12-10 09:53:11

by Philipp Puschmann

[permalink] [raw]
Subject: Re: [PATCH v5 1/3] dmaengine: imx-sdma: fix buffer ownership



Am 04.12.19 um 10:19 schrieb Robin Gong:
> On 2019-9-23 Philipp Puschmann <[email protected]> wrote:
>> BD_DONE flag marks ownership of the buffer. When 1 SDMA owns the buffer,
>> when 0 ARM owns it. When processing the buffers in
>> sdma_update_channel_loop the ownership of the currently processed buffer
>> was set to SDMA again before running the callback function of the buffer and
>> while the sdma script may be running in parallel. So there was the possibility to
>> get the buffer overwritten by SDMA before it has been processed by kernel
> Does this patch need indeed? I don't think any difference here move done flag
> before callback or after callback, because callback never care this flag and actually
> done flag is setup for next time rather than this time.
The callback doesn't care, but the DMA controller cares about this flag. I see a possible race
condition here. If i set the DONE flag for a specific buffer descriptor before handling the
data belonging to this buffer descriptor (aka running the callback function) the DMA script running
at the same time could corrupt that data while being processed.
Or is there are mechanism that prevents this case, that i havn't considered here.

> Basically, this flag should be
> set to 1 quickly asap so that sdma could use this bd asap. If delay the flag may cause
> sdma channel stop since all BDs consumed.

> Could you try again your case without this patch?
I don't have the hw to reproduce this available at the moment but as i remember i did run it without
this patch successfully already. The problem i have described above was more a logical or theoretical
one than a problem that really occured with my setup.

>> leading to kind of random errors in the upper layers, e.g. bluetooth.
>>
>> Fixes: 1ec1e82f2510 ("dmaengine: Add Freescale i.MX SDMA support")
>> Signed-off-by: Philipp Puschmann <[email protected]>
>> ---
>>
>> Changelog v5:
>> - no changes
>>
>> Changelog v4:
>> - fixed the fixes tag
>>
>> Changelog v3:
>> - use correct dma_wmb() instead of dma_wb()
>> - add fixes tag
>>
>> Changelog v2:
>> - add dma_wb()
>>
>> drivers/dma/imx-sdma.c | 4 +++-
>> 1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/dma/imx-sdma.c b/drivers/dma/imx-sdma.c index
>> 9ba74ab7e912..b42281604e54 100644
>> --- a/drivers/dma/imx-sdma.c
>> +++ b/drivers/dma/imx-sdma.c
>> @@ -802,7 +802,6 @@ static void sdma_update_channel_loop(struct
>> sdma_channel *sdmac)
>> */
>>
>> desc->chn_real_count = bd->mode.count;
>> - bd->mode.status |= BD_DONE;
>> bd->mode.count = desc->period_len;
>> desc->buf_ptail = desc->buf_tail;
>> desc->buf_tail = (desc->buf_tail + 1) % desc->num_bd; @@ -817,6
>> +816,9 @@ static void sdma_update_channel_loop(struct sdma_channel
>> *sdmac)
>> dmaengine_desc_get_callback_invoke(&desc->vd.tx, NULL);
>> spin_lock(&sdmac->vc.lock);
>>
>> + dma_wmb();
>> + bd->mode.status |= BD_DONE;
>> +
>> if (error)
>> sdmac->status = old_status;
>> }
>> --
>> 2.23.0
>

2019-12-10 13:03:16

by Robin Gong

[permalink] [raw]
Subject: RE: [PATCH v5 1/3] dmaengine: imx-sdma: fix buffer ownership

On 2019/12/10 17:45 Philipp Puschmann <[email protected]> wrote:
> Am 04.12.19 um 10:19 schrieb Robin Gong:
> > On 2019-9-23 Philipp Puschmann <[email protected]> wrote:
> >> BD_DONE flag marks ownership of the buffer. When 1 SDMA owns the
> >> buffer, when 0 ARM owns it. When processing the buffers in
> >> sdma_update_channel_loop the ownership of the currently processed
> >> buffer was set to SDMA again before running the callback function of
> >> the buffer and while the sdma script may be running in parallel. So
> >> there was the possibility to get the buffer overwritten by SDMA
> >> before it has been processed by kernel
> > Does this patch need indeed? I don't think any difference here move
> > done flag before callback or after callback, because callback never
> > care this flag and actually done flag is setup for next time rather than this
> time.
> The callback doesn't care, but the DMA controller cares about this flag. I see a
> possible race condition here. If i set the DONE flag for a specific buffer
> descriptor before handling the data belonging to this buffer descriptor (aka
> running the callback function) the DMA script running at the same time could
> corrupt that data while being processed.
> Or is there are mechanism that prevents this case, that i havn't considered
> here.
In theory that may happen, but in real world that's not the case:
1. SDMA is running much slower than CPU, for example, on i.MX6Q SDMA is running
at 66MHz while CPU is running at 1GHz. Besides, SDMA transfer data depends on fifo
data output frequency, such as UART 4Mhz. So your case may not be caught unless
time-consuming flow involved in callback which is not right.
2. There are multi descriptors(BDs) setup for cyclic mode, so that SDMA controller and CPU could handle data in parallel without interactions by using BD_DONE. Client driver should choose proper BD number and transfer size of BD to make sure cyclic transfer running smoothly without stop. In your case, all BDs consumed by SDMA during the narrow timing window which is between BD_DONE set and callback done at CPU side(all in interrupt handler). That never happen unless very small BD size set wrongly, such as only 32 byte or 64 byte for one BD, but generally BD size is in KB unit.
>
> > Basically, this flag should be
> > set to 1 quickly asap so that sdma could use this bd asap. If delay
> > the flag may cause sdma channel stop since all BDs consumed.
>
> > Could you try again your case without this patch?
> I don't have the hw to reproduce this available at the moment but as i
> remember i did run it without this patch successfully already. The problem i
> have described above was more a logical or theoretical one than a problem
> that really occured with my setup.
>
> >> leading to kind of random errors in the upper layers, e.g. bluetooth.
> >>
> >> Fixes: 1ec1e82f2510 ("dmaengine: Add Freescale i.MX SDMA support")
> >> Signed-off-by: Philipp Puschmann <[email protected]>
> >> ---
> >>
> >> Changelog v5:
> >> - no changes
> >>
> >> Changelog v4:
> >> - fixed the fixes tag
> >>
> >> Changelog v3:
> >> - use correct dma_wmb() instead of dma_wb()
> >> - add fixes tag
> >>
> >> Changelog v2:
> >> - add dma_wb()
> >>
> >> drivers/dma/imx-sdma.c | 4 +++-
> >> 1 file changed, 3 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/dma/imx-sdma.c b/drivers/dma/imx-sdma.c index
> >> 9ba74ab7e912..b42281604e54 100644
> >> --- a/drivers/dma/imx-sdma.c
> >> +++ b/drivers/dma/imx-sdma.c
> >> @@ -802,7 +802,6 @@ static void sdma_update_channel_loop(struct
> >> sdma_channel *sdmac)
> >> */
> >>
> >> desc->chn_real_count = bd->mode.count;
> >> - bd->mode.status |= BD_DONE;
> >> bd->mode.count = desc->period_len;
> >> desc->buf_ptail = desc->buf_tail;
> >> desc->buf_tail = (desc->buf_tail + 1) % desc->num_bd; @@
> -817,6
> >> +816,9 @@ static void sdma_update_channel_loop(struct sdma_channel
> >> *sdmac)
> >> dmaengine_desc_get_callback_invoke(&desc->vd.tx, NULL);
> >> spin_lock(&sdmac->vc.lock);
> >>
> >> + dma_wmb();
> >> + bd->mode.status |= BD_DONE;
> >> +
> >> if (error)
> >> sdmac->status = old_status;
> >> }
> >> --
> >> 2.23.0
> >