2020-01-27 13:22:36

by Peter Ujfalusi

[permalink] [raw]
Subject: [PATCH for-next 0/4] dmaengine: ti: k3-udma: Updates for next

Hi Vinod,

Based on customer reports we have identified two issues with the UDMA driver:

TX completion (1st patch):
The scheduled work based workaround for checking for completion worked well for
UART, but it had significant impact on SPI performance.
The underlying issue is coming from the fact that we have split data movement
architecture.
In order to know that the transfer is really done we need to check the remote
end's (PDMA) byte counter.

RX channel teardown with stale data in PDMA (2nd patch):
If we try to stop the RX DMA channel (teardown) then PDMA is trying to flush the
data is might received from a peripheral, but if UDMA does not have a packet to
use for this draining than it is going to push back on the PDMA and the flush
will never completes.
The workaround is to use a dummy descriptor for flush purposes when the channel
is terminated and we did not have active transfer (no descriptor for UDMA).
This allows UDMA to drain the data and the teardown can complete.

The last two patch is to use common code to set up the TR parameters for
slave_sg, cyclic and memcpy. The setup code is the same as we used for memcpy
with the change we can handle 4.2GB sg elements and periods in case of cyclic.
It is also nice that we have single function to do the configuration.

Regards,
Peter
---
Peter Ujfalusi (3):
dmaengine: ti: k3-udma: Workaround for RX teardown with stale data in
peer
dmaengine: ti: k3-udma: Move the TR counter calculation to helper
function
dmaengine: ti: k3-udma: Use the TR counter helper for slave_sg and
cyclic

Vignesh Raghavendra (1):
dmaengine: ti: k3-udma: Use ktime/usleep_range based TX completion
check

drivers/dma/ti/k3-udma.c | 452 +++++++++++++++++++++++++++++----------
1 file changed, 343 insertions(+), 109 deletions(-)

--
Peter

Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki.
Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki


2020-01-27 13:23:16

by Peter Ujfalusi

[permalink] [raw]
Subject: [PATCH for-next 4/4] dmaengine: ti: k3-udma: Use the TR counter helper for slave_sg and cyclic

Use the generic TR setup function to get the TR counters for both cyclic
and slave_sg transfers.
This way the period_size for cyclic and sg_dma_len() for slave_sg can be
as large as (SZ_64K - 1) * (SZ_64K - 1) and we can handle cases when the
length is >SZ_64K and a prime number.

Signed-off-by: Peter Ujfalusi <[email protected]>
---
drivers/dma/ti/k3-udma.c | 130 ++++++++++++++++++++++++++-------------
1 file changed, 88 insertions(+), 42 deletions(-)

diff --git a/drivers/dma/ti/k3-udma.c b/drivers/dma/ti/k3-udma.c
index 9b00013d6f63..1dba47c662c4 100644
--- a/drivers/dma/ti/k3-udma.c
+++ b/drivers/dma/ti/k3-udma.c
@@ -2079,31 +2079,31 @@ udma_prep_slave_sg_tr(struct udma_chan *uc, struct scatterlist *sgl,
unsigned int sglen, enum dma_transfer_direction dir,
unsigned long tx_flags, void *context)
{
- enum dma_slave_buswidth dev_width;
struct scatterlist *sgent;
struct udma_desc *d;
- size_t tr_size;
struct cppi5_tr_type1_t *tr_req = NULL;
+ u16 tr0_cnt0, tr0_cnt1, tr1_cnt0;
unsigned int i;
- u32 burst;
+ size_t tr_size;
+ int num_tr = 0;
+ int tr_idx = 0;

- if (dir == DMA_DEV_TO_MEM) {
- dev_width = uc->cfg.src_addr_width;
- burst = uc->cfg.src_maxburst;
- } else if (dir == DMA_MEM_TO_DEV) {
- dev_width = uc->cfg.dst_addr_width;
- burst = uc->cfg.dst_maxburst;
- } else {
- dev_err(uc->ud->dev, "%s: bad direction?\n", __func__);
+ if (!is_slave_direction(dir)) {
+ dev_err(uc->ud->dev, "Only slave cyclic is supported\n");
return NULL;
}

- if (!burst)
- burst = 1;
+ /* estimate the number of TRs we will need */
+ for_each_sg(sgl, sgent, sglen, i) {
+ if (sg_dma_len(sgent) < SZ_64K)
+ num_tr++;
+ else
+ num_tr += 2;
+ }

/* Now allocate and setup the descriptor. */
tr_size = sizeof(struct cppi5_tr_type1_t);
- d = udma_alloc_tr_desc(uc, tr_size, sglen, dir);
+ d = udma_alloc_tr_desc(uc, tr_size, num_tr, dir);
if (!d)
return NULL;

@@ -2111,19 +2111,46 @@ udma_prep_slave_sg_tr(struct udma_chan *uc, struct scatterlist *sgl,

tr_req = d->hwdesc[0].tr_req_base;
for_each_sg(sgl, sgent, sglen, i) {
- d->residue += sg_dma_len(sgent);
+ dma_addr_t sg_addr = sg_dma_address(sgent);
+
+ num_tr = udma_get_tr_counters(sg_dma_len(sgent), __ffs(sg_addr),
+ &tr0_cnt0, &tr0_cnt1, &tr1_cnt0);
+ if (num_tr < 0) {
+ dev_err(uc->ud->dev, "size %u is not supported\n",
+ sg_dma_len(sgent));
+ udma_free_hwdesc(uc, d);
+ kfree(d);
+ return NULL;
+ }

cppi5_tr_init(&tr_req[i].flags, CPPI5_TR_TYPE1, false, false,
CPPI5_TR_EVENT_SIZE_COMPLETION, 0);
cppi5_tr_csf_set(&tr_req[i].flags, CPPI5_TR_CSF_SUPR_EVT);

- tr_req[i].addr = sg_dma_address(sgent);
- tr_req[i].icnt0 = burst * dev_width;
- tr_req[i].dim1 = burst * dev_width;
- tr_req[i].icnt1 = sg_dma_len(sgent) / tr_req[i].icnt0;
+ tr_req[tr_idx].addr = sg_addr;
+ tr_req[tr_idx].icnt0 = tr0_cnt0;
+ tr_req[tr_idx].icnt1 = tr0_cnt1;
+ tr_req[tr_idx].dim1 = tr0_cnt0;
+ tr_idx++;
+
+ if (num_tr == 2) {
+ cppi5_tr_init(&tr_req[tr_idx].flags, CPPI5_TR_TYPE1,
+ false, false,
+ CPPI5_TR_EVENT_SIZE_COMPLETION, 0);
+ cppi5_tr_csf_set(&tr_req[tr_idx].flags,
+ CPPI5_TR_CSF_SUPR_EVT);
+
+ tr_req[tr_idx].addr = sg_addr + tr0_cnt1 * tr0_cnt0;
+ tr_req[tr_idx].icnt0 = tr1_cnt0;
+ tr_req[tr_idx].icnt1 = 1;
+ tr_req[tr_idx].dim1 = tr1_cnt0;
+ tr_idx++;
+ }
+
+ d->residue += sg_dma_len(sgent);
}

- cppi5_tr_csf_set(&tr_req[i - 1].flags, CPPI5_TR_CSF_EOP);
+ cppi5_tr_csf_set(&tr_req[tr_idx - 1].flags, CPPI5_TR_CSF_EOP);

return d;
}
@@ -2428,47 +2455,66 @@ udma_prep_dma_cyclic_tr(struct udma_chan *uc, dma_addr_t buf_addr,
size_t buf_len, size_t period_len,
enum dma_transfer_direction dir, unsigned long flags)
{
- enum dma_slave_buswidth dev_width;
struct udma_desc *d;
- size_t tr_size;
+ size_t tr_size, period_addr;
struct cppi5_tr_type1_t *tr_req;
- unsigned int i;
unsigned int periods = buf_len / period_len;
- u32 burst;
+ u16 tr0_cnt0, tr0_cnt1, tr1_cnt0;
+ unsigned int i;
+ int num_tr;

- if (dir == DMA_DEV_TO_MEM) {
- dev_width = uc->cfg.src_addr_width;
- burst = uc->cfg.src_maxburst;
- } else if (dir == DMA_MEM_TO_DEV) {
- dev_width = uc->cfg.dst_addr_width;
- burst = uc->cfg.dst_maxburst;
- } else {
- dev_err(uc->ud->dev, "%s: bad direction?\n", __func__);
+ if (!is_slave_direction(dir)) {
+ dev_err(uc->ud->dev, "Only slave cyclic is supported\n");
return NULL;
}

- if (!burst)
- burst = 1;
+ num_tr = udma_get_tr_counters(period_len, __ffs(buf_addr), &tr0_cnt0,
+ &tr0_cnt1, &tr1_cnt0);
+ if (num_tr < 0) {
+ dev_err(uc->ud->dev, "size %zu is not supported\n",
+ period_len);
+ return NULL;
+ }

/* Now allocate and setup the descriptor. */
tr_size = sizeof(struct cppi5_tr_type1_t);
- d = udma_alloc_tr_desc(uc, tr_size, periods, dir);
+ d = udma_alloc_tr_desc(uc, tr_size, periods * num_tr, dir);
if (!d)
return NULL;

tr_req = d->hwdesc[0].tr_req_base;
+ period_addr = buf_addr;
for (i = 0; i < periods; i++) {
- cppi5_tr_init(&tr_req[i].flags, CPPI5_TR_TYPE1, false, false,
- CPPI5_TR_EVENT_SIZE_COMPLETION, 0);
+ int tr_idx = i * num_tr;

- tr_req[i].addr = buf_addr + period_len * i;
- tr_req[i].icnt0 = dev_width;
- tr_req[i].icnt1 = period_len / dev_width;
- tr_req[i].dim1 = dev_width;
+ cppi5_tr_init(&tr_req[tr_idx].flags, CPPI5_TR_TYPE1, false,
+ false, CPPI5_TR_EVENT_SIZE_COMPLETION, 0);
+
+ tr_req[tr_idx].addr = period_addr;
+ tr_req[tr_idx].icnt0 = tr0_cnt0;
+ tr_req[tr_idx].icnt1 = tr0_cnt1;
+ tr_req[tr_idx].dim1 = tr0_cnt0;
+
+ if (num_tr == 2) {
+ cppi5_tr_csf_set(&tr_req[tr_idx].flags,
+ CPPI5_TR_CSF_SUPR_EVT);
+ tr_idx++;
+
+ cppi5_tr_init(&tr_req[tr_idx].flags, CPPI5_TR_TYPE1,
+ false, false,
+ CPPI5_TR_EVENT_SIZE_COMPLETION, 0);
+
+ tr_req[tr_idx].addr = period_addr + tr0_cnt1 * tr0_cnt0;
+ tr_req[tr_idx].icnt0 = tr1_cnt0;
+ tr_req[tr_idx].icnt1 = 1;
+ tr_req[tr_idx].dim1 = tr1_cnt0;
+ }

if (!(flags & DMA_PREP_INTERRUPT))
- cppi5_tr_csf_set(&tr_req[i].flags,
+ cppi5_tr_csf_set(&tr_req[tr_idx].flags,
CPPI5_TR_CSF_SUPR_EVT);
+
+ period_addr += period_len;
}

return d;
--
Peter

Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki.
Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki

2020-01-28 10:16:23

by Peter Ujfalusi

[permalink] [raw]
Subject: Re: [PATCH for-next 0/4] dmaengine: ti: k3-udma: Updates for next

Vinod,

On 27/01/2020 15.21, Peter Ujfalusi wrote:
> Hi Vinod,
>
> Based on customer reports we have identified two issues with the UDMA driver:
>
> TX completion (1st patch):
> The scheduled work based workaround for checking for completion worked well for
> UART, but it had significant impact on SPI performance.
> The underlying issue is coming from the fact that we have split data movement
> architecture.
> In order to know that the transfer is really done we need to check the remote
> end's (PDMA) byte counter.
>
> RX channel teardown with stale data in PDMA (2nd patch):
> If we try to stop the RX DMA channel (teardown) then PDMA is trying to flush the
> data is might received from a peripheral, but if UDMA does not have a packet to
> use for this draining than it is going to push back on the PDMA and the flush
> will never completes.
> The workaround is to use a dummy descriptor for flush purposes when the channel
> is terminated and we did not have active transfer (no descriptor for UDMA).
> This allows UDMA to drain the data and the teardown can complete.
>
> The last two patch is to use common code to set up the TR parameters for
> slave_sg, cyclic and memcpy. The setup code is the same as we used for memcpy
> with the change we can handle 4.2GB sg elements and periods in case of cyclic.
> It is also nice that we have single function to do the configuration.

I have marked these patches as for-next as 5.5 was not released yet.
Would it be possible to have these as fixes for 5.6?

Thanks,
- Péter

>
> Regards,
> Peter
> ---
> Peter Ujfalusi (3):
> dmaengine: ti: k3-udma: Workaround for RX teardown with stale data in
> peer
> dmaengine: ti: k3-udma: Move the TR counter calculation to helper
> function
> dmaengine: ti: k3-udma: Use the TR counter helper for slave_sg and
> cyclic
>
> Vignesh Raghavendra (1):
> dmaengine: ti: k3-udma: Use ktime/usleep_range based TX completion
> check
>
> drivers/dma/ti/k3-udma.c | 452 +++++++++++++++++++++++++++++----------
> 1 file changed, 343 insertions(+), 109 deletions(-)
>

Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki.
Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki

2020-01-28 11:51:19

by Vinod Koul

[permalink] [raw]
Subject: Re: [PATCH for-next 0/4] dmaengine: ti: k3-udma: Updates for next

On 28-01-20, 12:15, Peter Ujfalusi wrote:
> Vinod,
>
> On 27/01/2020 15.21, Peter Ujfalusi wrote:
> > Hi Vinod,
> >
> > Based on customer reports we have identified two issues with the UDMA driver:
> >
> > TX completion (1st patch):
> > The scheduled work based workaround for checking for completion worked well for
> > UART, but it had significant impact on SPI performance.
> > The underlying issue is coming from the fact that we have split data movement
> > architecture.
> > In order to know that the transfer is really done we need to check the remote
> > end's (PDMA) byte counter.
> >
> > RX channel teardown with stale data in PDMA (2nd patch):
> > If we try to stop the RX DMA channel (teardown) then PDMA is trying to flush the
> > data is might received from a peripheral, but if UDMA does not have a packet to
> > use for this draining than it is going to push back on the PDMA and the flush
> > will never completes.
> > The workaround is to use a dummy descriptor for flush purposes when the channel
> > is terminated and we did not have active transfer (no descriptor for UDMA).
> > This allows UDMA to drain the data and the teardown can complete.
> >
> > The last two patch is to use common code to set up the TR parameters for
> > slave_sg, cyclic and memcpy. The setup code is the same as we used for memcpy
> > with the change we can handle 4.2GB sg elements and periods in case of cyclic.
> > It is also nice that we have single function to do the configuration.
>
> I have marked these patches as for-next as 5.5 was not released yet.
> Would it be possible to have these as fixes for 5.6?

Sure but are they really fixes, why cant they go for next release :)

They seem to improve things for sure, but do we want to call them as
fixes..?

--
~Vinod

2020-01-28 12:38:13

by Peter Ujfalusi

[permalink] [raw]
Subject: Re: [PATCH for-next 0/4] dmaengine: ti: k3-udma: Updates for next

Hi Vinod,

On 28/01/2020 13.50, Vinod Koul wrote:
> On 28-01-20, 12:15, Peter Ujfalusi wrote:
>> Vinod,
>>
>> On 27/01/2020 15.21, Peter Ujfalusi wrote:
>>> Hi Vinod,
>>>
>>> Based on customer reports we have identified two issues with the UDMA driver:
>>>
>>> TX completion (1st patch):
>>> The scheduled work based workaround for checking for completion worked well for
>>> UART, but it had significant impact on SPI performance.
>>> The underlying issue is coming from the fact that we have split data movement
>>> architecture.
>>> In order to know that the transfer is really done we need to check the remote
>>> end's (PDMA) byte counter.
>>>
>>> RX channel teardown with stale data in PDMA (2nd patch):
>>> If we try to stop the RX DMA channel (teardown) then PDMA is trying to flush the
>>> data is might received from a peripheral, but if UDMA does not have a packet to
>>> use for this draining than it is going to push back on the PDMA and the flush
>>> will never completes.
>>> The workaround is to use a dummy descriptor for flush purposes when the channel
>>> is terminated and we did not have active transfer (no descriptor for UDMA).
>>> This allows UDMA to drain the data and the teardown can complete.
>>>
>>> The last two patch is to use common code to set up the TR parameters for
>>> slave_sg, cyclic and memcpy. The setup code is the same as we used for memcpy
>>> with the change we can handle 4.2GB sg elements and periods in case of cyclic.
>>> It is also nice that we have single function to do the configuration.
>>
>> I have marked these patches as for-next as 5.5 was not released yet.
>> Would it be possible to have these as fixes for 5.6?
>
> Sure but are they really fixes, why cant they go for next release :)
>
> They seem to improve things for sure, but do we want to call them as
> fixes..?

I would say that the first two patch is a fix:
TX completion check is fixing the performance hit by the early TX
completion workaround which used jiffies+work.

The second patch is fixing a case when we have stale data during RX and
no active transfer. For example when UART reads 1000 bytes, but the
other end is 'streaming' the data and after the 1000 bytes the UART+PDMA
receives data.
Recovering from this state is not easy and it might not even succeed in
HW level.

The last two is I agree, it is not fixing much, it does corrects the
slave_sg TR setup (and improves the cyclic as well).
With that I could send the ASoC platform wrapper for UDMA with
period_bytes_max = 4.2GB ;)
I have SZ_512K in there atm, with the old calculation SZ_64K is the
maximum, not a big issue.

I think the first two patch is a fix candidate as they fix regression
(albeit regression between the series's) and a real world channel lockup
discovered too late for the initial driver.

- Péter

Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki.
Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki

2020-01-30 13:19:46

by Peter Ujfalusi

[permalink] [raw]
Subject: Re: [PATCH for-next 0/4] dmaengine: ti: k3-udma: Updates for next

Hi Vinod,

On 28/01/2020 14.37, Peter Ujfalusi wrote:
> Hi Vinod,
>
> On 28/01/2020 13.50, Vinod Koul wrote:
>> On 28-01-20, 12:15, Peter Ujfalusi wrote:
>>> Vinod,
>>>
>>> On 27/01/2020 15.21, Peter Ujfalusi wrote:
>>>> Hi Vinod,
>>>>
>>>> Based on customer reports we have identified two issues with the UDMA driver:
>>>>
>>>> TX completion (1st patch):
>>>> The scheduled work based workaround for checking for completion worked well for
>>>> UART, but it had significant impact on SPI performance.
>>>> The underlying issue is coming from the fact that we have split data movement
>>>> architecture.
>>>> In order to know that the transfer is really done we need to check the remote
>>>> end's (PDMA) byte counter.
>>>>
>>>> RX channel teardown with stale data in PDMA (2nd patch):
>>>> If we try to stop the RX DMA channel (teardown) then PDMA is trying to flush the
>>>> data is might received from a peripheral, but if UDMA does not have a packet to
>>>> use for this draining than it is going to push back on the PDMA and the flush
>>>> will never completes.
>>>> The workaround is to use a dummy descriptor for flush purposes when the channel
>>>> is terminated and we did not have active transfer (no descriptor for UDMA).
>>>> This allows UDMA to drain the data and the teardown can complete.
>>>>
>>>> The last two patch is to use common code to set up the TR parameters for
>>>> slave_sg, cyclic and memcpy. The setup code is the same as we used for memcpy
>>>> with the change we can handle 4.2GB sg elements and periods in case of cyclic.
>>>> It is also nice that we have single function to do the configuration.
>>>
>>> I have marked these patches as for-next as 5.5 was not released yet.
>>> Would it be possible to have these as fixes for 5.6?
>>
>> Sure but are they really fixes, why cant they go for next release :)
>>
>> They seem to improve things for sure, but do we want to call them as
>> fixes..?
>
> I would say that the first two patch is a fix:
> TX completion check is fixing the performance hit by the early TX
> completion workaround which used jiffies+work.
>
> The second patch is fixing a case when we have stale data during RX and
> no active transfer. For example when UART reads 1000 bytes, but the
> other end is 'streaming' the data and after the 1000 bytes the UART+PDMA
> receives data.
> Recovering from this state is not easy and it might not even succeed in
> HW level.
>
> The last two is I agree, it is not fixing much, it does corrects the
> slave_sg TR setup (and improves the cyclic as well).
> With that I could send the ASoC platform wrapper for UDMA with
> period_bytes_max = 4.2GB ;)
> I have SZ_512K in there atm, with the old calculation SZ_64K is the
> maximum, not a big issue.

Actually this also fixes a real bug in the driver for the slave_sg_tr case:
if the sg_dma_len(sgent) is not multiple of (burst * dev_width) then we
end up with missing bits as the counters are not set up correctly.
The client driver which we tested the slave_sg_tr was always giving
sg_len == 1 and the buffer was aligned, but when I tuned the client to
pass a list, things got broken.

>
> I think the first two patch is a fix candidate as they fix regression
> (albeit regression between the series's) and a real world channel lockup
> discovered too late for the initial driver.
>
> - Péter
>
> Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki.
> Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki
>

- Péter

Texas Instruments Finland Oy, Porkkalankatu 22, 00180 Helsinki.
Y-tunnus/Business ID: 0615521-4. Kotipaikka/Domicile: Helsinki