2022-10-24 04:52:21

by Yuji Ishikawa

[permalink] [raw]
Subject: Question for an accepted patch: use of DMA-BUF based videobuf2 capture buffer with no-HW-cache-coherent HW

Hi,
I'm porting a V4L2 capture driver from 4.19.y to 5.10.y [1].
When I test the ported driver, I sometimes find a corruption on a captured image.
Because the corruption is exactly aligned with cacheline, I started investigation from map/unmap of DMA-BUF.

The capture driver uses DMA-BUF for videobuf2.
The capture hardware does not have HW-mantained cache coherency with CPU, that is, explicit map/unmap is essential on QBUF/DQBUF.
After some hours of struggle, I found a patch removing cache synchronizations on QBUF/DQBUF.

https://patchwork.kernel.org/project/linux-media/patch/[email protected]/

When I removed this patch from my 5.10.y working-tree, the driver yielded images without any defects.

***************
Sorry for a mention to a patch released 4 years ago.
The patch removes map/unmap on QBUF/DQBUF to improve the performance of V4L2 decoder device, by reusing previously decoded frames.
However, there seems no cares nor compensations for modifying lifecycle of DMA-BUF, especially on video capture devices.

Would you tell me some idea on this patch:
* Do well-implemented capture drivers work well even if this patch is applied?
* How should a video capture driver call V4L2/videobuf2 APIs, especially when the hardware does not support cache coherency?

***************
[1] FYI: the capture driver is not on mainline yet; the candidate is,
https://lore.kernel.org/all/[email protected]/

***************
Sorry for sending the same email message again. I wrongly posted previous one with HTML format.

Regards,
Yuji Ishikawa


2022-10-24 08:13:09

by Hans Verkuil

[permalink] [raw]
Subject: Re: Question for an accepted patch: use of DMA-BUF based videobuf2 capture buffer with no-HW-cache-coherent HW

Hi Yuji,

On 10/24/22 06:02, [email protected] wrote:
> Hi,
>
> I'm porting a V4L2 capture driver from 4.19.y to 5.10.y [1].
>
> When I test the ported driver, I sometimes find a corruption on a captured image.
>
> Because the corruption is exactly aligned with cacheline, I started investigation from map/unmap of DMA-BUF.
>
>  
>
> The capture driver uses DMA-BUF for videobuf2.
>
> The capture hardware does not have HW-mantained cache coherency with CPU, that is, explicit map/unmap is essential on QBUF/DQBUF.
>
> After some hours of struggle, I found a patch removing cache synchronizations on QBUF/DQBUF.
>
>  
>
> https://patchwork.kernel.org/project/linux-media/patch/[email protected]/ <https://patchwork.kernel.org/project/linux-media/patch/[email protected]/>
>
>  
>
> When I removed this patch from my 5.10.y working-tree, the driver yielded images without any defects.v
>
>  
>
> ***************
>
> Sorry for a mention to a patch released 4 years ago.
>
> The patch removes map/unmap on QBUF/DQBUF to improve the performance of V4L2 decoder device, by reusing previously decoded frames.
>
> However, there seems no cares nor compensations for modifying lifecycle of DMA-BUF, especially on video capture devices.

I'm not entirely sure what you mean exactly.

>
>  
>
> Would you tell me some idea on this patch:
>
> * Do well-implemented capture drivers work well even if this patch is applied?

Yes, dmabuf is used extensively and I have not had any reports of issues.

>
> * How should a video capture driver call V4L2/videobuf2 APIs, especially when the hardware does not support cache coherency?

It should all be handled correctly by the core frameworks.

I think you need to debug more inside videobuf2-core.c. Some printk's that show the
dmabuf fd when the buffer is mapped and when it is unmapped + the length it is
mapping should hopefully help a bit.

Regards,

Hans

>
>  
>
> ***************
>
> [1] FYI: the capture driver is not on mainline yet; the candidate is,
>
> https://lore.kernel.org/all/[email protected]/ <https://lore.kernel.org/all/[email protected]/>
>
>  
>
>  
>
> Regards,
>
>               Yuji Ishikawa
>

2022-10-26 09:52:23

by Yuji Ishikawa

[permalink] [raw]
Subject: RE: Question for an accepted patch: use of DMA-BUF based videobuf2 capture buffer with no-HW-cache-coherent HW

Hi Hans,

> -----Original Message-----
> From: Hans Verkuil <[email protected]>
> Sent: Monday, October 24, 2022 4:49 PM
> To: ishikawa yuji(石川 悠司 ○RDC□AITC○EA開)
> <[email protected]>; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: Question for an accepted patch: use of DMA-BUF based videobuf2
> capture buffer with no-HW-cache-coherent HW
>
> Hi Yuji,
>
> On 10/24/22 06:02, [email protected] wrote:
> > Hi,
> >
> > I'm porting a V4L2 capture driver from 4.19.y to 5.10.y [1].
> >
> > When I test the ported driver, I sometimes find a corruption on a captured
> image.
> >
> > Because the corruption is exactly aligned with cacheline, I started
> investigation from map/unmap of DMA-BUF.
> >
> >
> >
> > The capture driver uses DMA-BUF for videobuf2.
> >
> > The capture hardware does not have HW-mantained cache coherency with
> CPU, that is, explicit map/unmap is essential on QBUF/DQBUF.
> >
> > After some hours of struggle, I found a patch removing cache synchronizations
> on QBUF/DQBUF.
> >
> >
> >
> > https://patchwork.kernel.org/project/linux-media/patch/20190124095156.
> > [email protected]/
> > <https://patchwork.kernel.org/project/linux-media/patch/20190124095156
> > [email protected]/>
> >
> >
> >
> > When I removed this patch from my 5.10.y working-tree, the driver
> > yielded images without any defects.v
> >
> >
> >
> > ***************
> >
> > Sorry for a mention to a patch released 4 years ago.
> >
> > The patch removes map/unmap on QBUF/DQBUF to improve the
> performance of V4L2 decoder device, by reusing previously decoded frames.
> >
> > However, there seems no cares nor compensations for modifying lifecycle of
> DMA-BUF, especially on video capture devices.
>
> I'm not entirely sure what you mean exactly.
>
My concern is consistency between ioctls and the state transition of capture buffers.
Generally, streaming I/O (DMA-BUF importing) buffers are handled following by userland.

Ioctl(VIDIOC_QBUF) -> /* DMA transfer from HW*/ -> ioctl(VIDIOC_DQBUF) -> /* access from CPU */ -> ioctl(VIDIOC_QBUF) -> ...

Therefore, expected semantics is that a buffer is owned by HW after QBUF, and owned by CPU after DQBUF.
In practice, ioctl(QBUF) kicks vb2_dc_map_dma_buf() and ioctl(DQBUF) kicks vb2_dc_unmap_dma_buf() before applying the patch.
This implementation keeps consistency in terms of cache coherency as cache-clean is done in vb2_dc_map_dma_buf().

By applying the patch, ioctl(DQBUF) does not kick unmap_dma() anymore. The similar for ioctl(QBUF).
Therefore, in practice, a buffer is not owned by CPU just after ioctl(DQBUF).
To keep compatibility of buffer operations, there should be delayed map_dma()/unmap_dma() call just before DMA-transfer/CPU-access.
However, no one referred to such function in the v4l2 framework in the examination of the patch.
Also, there is no advice for individual video device drivers; such that adding map_dma()/unmap_dma() explicitly.

> >
> >
> >
> > Would you tell me some idea on this patch:
> >
> > * Do well-implemented capture drivers work well even if this patch is applied?
>
> Yes, dmabuf is used extensively and I have not had any reports of issues.

Many architectures can avoid this problem.
A problem statistically occurs, only if a video capture HW does not have HW-maintained cache coherency with CPU.
Does this patch consider such case?

> >
> > * How should a video capture driver call V4L2/videobuf2 APIs, especially
> when the hardware does not support cache coherency?
>
> It should all be handled correctly by the core frameworks.
>
> I think you need to debug more inside videobuf2-core.c. Some printk's that show
> the dmabuf fd when the buffer is mapped and when it is unmapped + the length
> it is mapping should hopefully help a bit.

I added printk and dump_stack() to several functions.
The patched function __prepare_dmabuf() is called every ioctl(QBUF).
Function vb2_dc_map_dmabuf() is called only for the 1st call of ioctl(QBUF) for a buffer instance.
After that, vb2_dc_map_dmabuf() was never called, as the patch intended.

Regards,
Yuji

>
> Regards,
>
> Hans
>
> >
> >
> >
> > ***************
> >
> > [1] FYI: the capture driver is not on mainline yet; the candidate is,
> >
> > https://lore.kernel.org/all/20220810132822.32534-1-yuji2.ishikawa@tosh
> > iba.co.jp/
> > <https://lore.kernel.org/all/20220810132822.32534-1-yuji2.ishikawa@tos
> > hiba.co.jp/>
> >
> >
> >
> >
> >
> > Regards,
> >
> >               Yuji Ishikawa
> >

2022-12-01 12:16:13

by Hans Verkuil

[permalink] [raw]
Subject: Re: Question for an accepted patch: use of DMA-BUF based videobuf2 capture buffer with no-HW-cache-coherent HW

Hi Yuji,

On 26/10/2022 11:16, [email protected] wrote:
> Hi Hans,
>
>> -----Original Message-----
>> From: Hans Verkuil <[email protected]>
>> Sent: Monday, October 24, 2022 4:49 PM
>> To: ishikawa yuji(石川 悠司 ○RDC□AITC○EA開)
>> <[email protected]>; [email protected];
>> [email protected]; [email protected];
>> [email protected]; [email protected]
>> Subject: Re: Question for an accepted patch: use of DMA-BUF based videobuf2
>> capture buffer with no-HW-cache-coherent HW
>>
>> Hi Yuji,
>>
>> On 10/24/22 06:02, [email protected] wrote:
>>> Hi,
>>>
>>> I'm porting a V4L2 capture driver from 4.19.y to 5.10.y [1].
>>>
>>> When I test the ported driver, I sometimes find a corruption on a captured
>> image.
>>>
>>> Because the corruption is exactly aligned with cacheline, I started
>> investigation from map/unmap of DMA-BUF.
>>>
>>>
>>>
>>> The capture driver uses DMA-BUF for videobuf2.
>>>
>>> The capture hardware does not have HW-mantained cache coherency with
>> CPU, that is, explicit map/unmap is essential on QBUF/DQBUF.
>>>
>>> After some hours of struggle, I found a patch removing cache synchronizations
>> on QBUF/DQBUF.
>>>
>>>
>>>
>>> https://patchwork.kernel.org/project/linux-media/patch/20190124095156.
>>> [email protected]/
>>> <https://patchwork.kernel.org/project/linux-media/patch/20190124095156
>>> [email protected]/>
>>>
>>>
>>>
>>> When I removed this patch from my 5.10.y working-tree, the driver
>>> yielded images without any defects.v
>>>
>>>
>>>
>>> ***************
>>>
>>> Sorry for a mention to a patch released 4 years ago.
>>>
>>> The patch removes map/unmap on QBUF/DQBUF to improve the
>> performance of V4L2 decoder device, by reusing previously decoded frames.
>>>
>>> However, there seems no cares nor compensations for modifying lifecycle of
>> DMA-BUF, especially on video capture devices.
>>
>> I'm not entirely sure what you mean exactly.
>>
> My concern is consistency between ioctls and the state transition of capture buffers.
> Generally, streaming I/O (DMA-BUF importing) buffers are handled following by userland.
>
> Ioctl(VIDIOC_QBUF) -> /* DMA transfer from HW*/ -> ioctl(VIDIOC_DQBUF) -> /* access from CPU */ -> ioctl(VIDIOC_QBUF) -> ...
>
> Therefore, expected semantics is that a buffer is owned by HW after QBUF, and owned by CPU after DQBUF.
> In practice, ioctl(QBUF) kicks vb2_dc_map_dma_buf() and ioctl(DQBUF) kicks vb2_dc_unmap_dma_buf() before applying the patch.
> This implementation keeps consistency in terms of cache coherency as cache-clean is done in vb2_dc_map_dma_buf().
>
> By applying the patch, ioctl(DQBUF) does not kick unmap_dma() anymore. The similar for ioctl(QBUF).
> Therefore, in practice, a buffer is not owned by CPU just after ioctl(DQBUF).
> To keep compatibility of buffer operations, there should be delayed map_dma()/unmap_dma() call just before DMA-transfer/CPU-access.
> However, no one referred to such function in the v4l2 framework in the examination of the patch.
> Also, there is no advice for individual video device drivers; such that adding map_dma()/unmap_dma() explicitly.

The cache syncing is supposed to happen in __vb2_buf_mem_finish() where the
'finish' memop is called.

But for DMABUF it notes that:

/*
* DMA exporter should take care of cache syncs, so we can avoid
* explicit ->prepare()/->finish() syncs. For other ->memory types
* we always need ->prepare() or/and ->finish() cache sync.
*/

And here https://docs.kernel.org/driver-api/dma-buf.html I read that userspace
must call DMA_BUF_IOCTL_SYNC to ensure the caches are synced before using the
buffer.

Are you calling DMA_BUF_IOCTL_SYNC?

I suspect that vb2_dc_unmap_dma_buf() caused a cache sync, so you never noticed
issues.

Regards,

Hans

>
>>>
>>>
>>>
>>> Would you tell me some idea on this patch:
>>>
>>> * Do well-implemented capture drivers work well even if this patch is applied?
>>
>> Yes, dmabuf is used extensively and I have not had any reports of issues.
>
> Many architectures can avoid this problem.
> A problem statistically occurs, only if a video capture HW does not have HW-maintained cache coherency with CPU.
> Does this patch consider such case?
>
>>>
>>> * How should a video capture driver call V4L2/videobuf2 APIs, especially
>> when the hardware does not support cache coherency?
>>
>> It should all be handled correctly by the core frameworks.
>>
>> I think you need to debug more inside videobuf2-core.c. Some printk's that show
>> the dmabuf fd when the buffer is mapped and when it is unmapped + the length
>> it is mapping should hopefully help a bit.
>
> I added printk and dump_stack() to several functions.
> The patched function __prepare_dmabuf() is called every ioctl(QBUF).
> Function vb2_dc_map_dmabuf() is called only for the 1st call of ioctl(QBUF) for a buffer instance.
> After that, vb2_dc_map_dmabuf() was never called, as the patch intended.
>
> Regards,
> Yuji
>
>>
>> Regards,
>>
>> Hans
>>
>>>
>>>
>>>
>>> ***************
>>>
>>> [1] FYI: the capture driver is not on mainline yet; the candidate is,
>>>
>>> https://lore.kernel.org/all/20220810132822.32534-1-yuji2.ishikawa@tosh
>>> iba.co.jp/
>>> <https://lore.kernel.org/all/20220810132822.32534-1-yuji2.ishikawa@tos
>>> hiba.co.jp/>
>>>
>>>
>>>
>>>
>>>
>>> Regards,
>>>
>>>               Yuji Ishikawa
>>>

2022-12-02 09:25:12

by Yuji Ishikawa

[permalink] [raw]
Subject: RE: Question for an accepted patch: use of DMA-BUF based videobuf2 capture buffer with no-HW-cache-coherent HW

Hi Hans,

> -----Original Message-----
> From: Hans Verkuil <[email protected]>
> Sent: Thursday, December 1, 2022 8:33 PM
> To: ishikawa yuji(石川 悠司 ○RDC□AITC○EA開)
> <[email protected]>; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; Tomasz Figa
> <[email protected]>
> Subject: Re: Question for an accepted patch: use of DMA-BUF based videobuf2
> capture buffer with no-HW-cache-coherent HW
>
> Hi Yuji,
>
> On 26/10/2022 11:16, [email protected] wrote:
> > Hi Hans,
> >
> >> -----Original Message-----
> >> From: Hans Verkuil <[email protected]>
> >> Sent: Monday, October 24, 2022 4:49 PM
> >> To: ishikawa yuji(石川 悠司 ○RDC□AITC○EA開)
> >> <[email protected]>; [email protected];
> >> [email protected]; [email protected];
> >> [email protected]; [email protected]
> >> Subject: Re: Question for an accepted patch: use of DMA-BUF based
> >> videobuf2 capture buffer with no-HW-cache-coherent HW
> >>
> >> Hi Yuji,
> >>
> >> On 10/24/22 06:02, [email protected] wrote:
> >>> Hi,
> >>>
> >>> I'm porting a V4L2 capture driver from 4.19.y to 5.10.y [1].
> >>>
> >>> When I test the ported driver, I sometimes find a corruption on a
> >>> captured
> >> image.
> >>>
> >>> Because the corruption is exactly aligned with cacheline, I started
> >> investigation from map/unmap of DMA-BUF.
> >>>
> >>>
> >>>
> >>> The capture driver uses DMA-BUF for videobuf2.
> >>>
> >>> The capture hardware does not have HW-mantained cache coherency with
> >> CPU, that is, explicit map/unmap is essential on QBUF/DQBUF.
> >>>
> >>> After some hours of struggle, I found a patch removing cache
> >>> synchronizations
> >> on QBUF/DQBUF.
> >>>
> >>>
> >>>
> >>>
> https://patchwork.kernel.org/project/linux-media/patch/20190124095156.
> >>> [email protected]/
> >>> <https://patchwork.kernel.org/project/linux-media/patch/201901240951
> >>> 56 [email protected]/>
> >>>
> >>>
> >>>
> >>> When I removed this patch from my 5.10.y working-tree, the driver
> >>> yielded images without any defects.v
> >>>
> >>>
> >>>
> >>> ***************
> >>>
> >>> Sorry for a mention to a patch released 4 years ago.
> >>>
> >>> The patch removes map/unmap on QBUF/DQBUF to improve the
> >> performance of V4L2 decoder device, by reusing previously decoded frames.
> >>>
> >>> However, there seems no cares nor compensations for modifying
> >>> lifecycle of
> >> DMA-BUF, especially on video capture devices.
> >>
> >> I'm not entirely sure what you mean exactly.
> >>
> > My concern is consistency between ioctls and the state transition of capture
> buffers.
> > Generally, streaming I/O (DMA-BUF importing) buffers are handled following
> by userland.
> >
> > Ioctl(VIDIOC_QBUF) -> /* DMA transfer from HW*/ -> ioctl(VIDIOC_DQBUF)
> -> /* access from CPU */ -> ioctl(VIDIOC_QBUF) -> ...
> >
> > Therefore, expected semantics is that a buffer is owned by HW after QBUF,
> and owned by CPU after DQBUF.
> > In practice, ioctl(QBUF) kicks vb2_dc_map_dma_buf() and ioctl(DQBUF) kicks
> vb2_dc_unmap_dma_buf() before applying the patch.
> > This implementation keeps consistency in terms of cache coherency as
> cache-clean is done in vb2_dc_map_dma_buf().
> >
> > By applying the patch, ioctl(DQBUF) does not kick unmap_dma() anymore.
> The similar for ioctl(QBUF).
> > Therefore, in practice, a buffer is not owned by CPU just after ioctl(DQBUF).
> > To keep compatibility of buffer operations, there should be delayed
> map_dma()/unmap_dma() call just before DMA-transfer/CPU-access.
> > However, no one referred to such function in the v4l2 framework in the
> examination of the patch.
> > Also, there is no advice for individual video device drivers; such that adding
> map_dma()/unmap_dma() explicitly.
>
> The cache syncing is supposed to happen in __vb2_buf_mem_finish() where the
> 'finish' memop is called.
>
> But for DMABUF it notes that:
>
> /*
> * DMA exporter should take care of cache syncs, so we can avoid
> * explicit ->prepare()/->finish() syncs. For other ->memory types
> * we always need ->prepare() or/and ->finish() cache sync.
> */

It seems I have misunderstood how DMA-BUF's cache syncs are maintained along with
videobuf2 API calls.
I understand that cache syncs are expected to be handled before prepare() and after finish().
The "ownership" transition along QBUF/DQBUF came from my misunderstanding, please forget.

> And here https://docs.kernel.org/driver-api/dma-buf.html I read that userspace
> must call DMA_BUF_IOCTL_SYNC to ensure the caches are synced before
> using the buffer.
>
> Are you calling DMA_BUF_IOCTL_SYNC?

Missing calling ioctl(DMA_BUF_IOCTL_SYNC) in userland was exactly the cause.
I read the document, carried out experiments and found it worked completely.
Very sorry to bother you.

Regards,
Yuji

> I suspect that vb2_dc_unmap_dma_buf() caused a cache sync, so you never
> noticed issues.
>
> Regards,
>
> Hans
>
> >
> >>>
> >>>
> >>>
> >>> Would you tell me some idea on this patch:
> >>>
> >>> * Do well-implemented capture drivers work well even if this patch is
> applied?
> >>
> >> Yes, dmabuf is used extensively and I have not had any reports of issues.
> >
> > Many architectures can avoid this problem.
> > A problem statistically occurs, only if a video capture HW does not have
> HW-maintained cache coherency with CPU.
> > Does this patch consider such case?
> >
> >>>
> >>> * How should a video capture driver call V4L2/videobuf2 APIs,
> >>> especially
> >> when the hardware does not support cache coherency?
> >>
> >> It should all be handled correctly by the core frameworks.
> >>
> >> I think you need to debug more inside videobuf2-core.c. Some printk's
> >> that show the dmabuf fd when the buffer is mapped and when it is
> >> unmapped + the length it is mapping should hopefully help a bit.
> >
> > I added printk and dump_stack() to several functions.
> > The patched function __prepare_dmabuf() is called every ioctl(QBUF).
> > Function vb2_dc_map_dmabuf() is called only for the 1st call of ioctl(QBUF)
> for a buffer instance.
> > After that, vb2_dc_map_dmabuf() was never called, as the patch intended.
> >
> > Regards,
> > Yuji
> >
> >>
> >> Regards,
> >>
> >> Hans
> >>
> >>>
> >>>
> >>>
> >>> ***************
> >>>
> >>> [1] FYI: the capture driver is not on mainline yet; the candidate
> >>> is,
> >>>
> >>> https://lore.kernel.org/all/20220810132822.32534-1-yuji2.ishikawa@to
> >>> sh
> >>> iba.co.jp/
> >>> <https://lore.kernel.org/all/20220810132822.32534-1-yuji2.ishikawa@t
> >>> os
> >>> hiba.co.jp/>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Regards,
> >>>
> >>>               Yuji Ishikawa
> >>>