Hi guys,
we are currently working an Freesync and direct scan out from system
memory on AMD APUs in A+A laptops.
On problem we stumbled over is that our display hardware needs to scan
out from uncached system memory and we currently don't have a way to
communicate that through DMA-buf.
For our specific use case at hand we are going to implement something
driver specific, but the question is should we have something more
generic for this?
After all the system memory access pattern is a PCIe extension and as
such something generic.
Regards,
Christian.
On Monday, February 15th, 2021 at 9:58 AM, Christian König <[email protected]> wrote:
> we are currently working an Freesync and direct scan out from system
> memory on AMD APUs in A+A laptops.
>
> On problem we stumbled over is that our display hardware needs to scan
> out from uncached system memory and we currently don't have a way to
> communicate that through DMA-buf.
>
> For our specific use case at hand we are going to implement something
> driver specific, but the question is should we have something more
> generic for this?
>
> After all the system memory access pattern is a PCIe extension and as
> such something generic.
Intel also needs uncached system memory if I'm not mistaken?
Where are the buffers allocated? If GBM, then it needs to allocate memory that
can be scanned out if the USE_SCANOUT flag is set or if a scanout-capable
modifier is picked.
If this is about communicating buffer constraints between different components
of the stack, there were a few proposals about it. The most recent one is [1].
Simon
[1]: https://xdc2020.x.org/event/9/contributions/615/
Am 15.02.21 um 10:06 schrieb Simon Ser:
> On Monday, February 15th, 2021 at 9:58 AM, Christian König <[email protected]> wrote:
>
>> we are currently working an Freesync and direct scan out from system
>> memory on AMD APUs in A+A laptops.
>>
>> On problem we stumbled over is that our display hardware needs to scan
>> out from uncached system memory and we currently don't have a way to
>> communicate that through DMA-buf.
>>
>> For our specific use case at hand we are going to implement something
>> driver specific, but the question is should we have something more
>> generic for this?
>>
>> After all the system memory access pattern is a PCIe extension and as
>> such something generic.
> Intel also needs uncached system memory if I'm not mistaken?
No idea, that's why I'm asking. Could be that this is also interesting
for I+A systems.
> Where are the buffers allocated? If GBM, then it needs to allocate memory that
> can be scanned out if the USE_SCANOUT flag is set or if a scanout-capable
> modifier is picked.
>
> If this is about communicating buffer constraints between different components
> of the stack, there were a few proposals about it. The most recent one is [1].
Well the problem here is on a different level of the stack.
See resolution, pitch etc:.. can easily communicated in userspace
without involvement of the kernel. The worst thing which can happen is
that you draw garbage into your own application window.
But if you get the caching attributes in the page tables (both CPU as
well as IOMMU, device etc...) wrong then ARM for example has the
tendency to just spontaneously reboot
X86 is fortunately a bit more gracefully and you only end up with random
data corruption, but that is only marginally better.
So to sum it up that is not something which we can leave in the hands of
userspace.
I think that exporters in the DMA-buf framework should have the ability
to tell importers if the system memory snooping is necessary or not.
Userspace components can then of course tell the exporter what the
importer needs, but validation if that stuff is correct and doesn't
crash the system must happen in the kernel.
Regards,
Christian.
>
> Simon
>
> [1]: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fxdc2020.x.org%2Fevent%2F9%2Fcontributions%2F615%2F&data=04%7C01%7Cchristian.koenig%40amd.com%7Cb2824bd79bd44862b38e08d8d190f344%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637489767796900783%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=hIptfin5Xx6XlYBtGFYAAbfuNsnD6kmQEiggq9k10E8%3D&reserved=0
Hi
Am 15.02.21 um 09:58 schrieb Christian König:
> Hi guys,
>
> we are currently working an Freesync and direct scan out from system
> memory on AMD APUs in A+A laptops.
>
> On problem we stumbled over is that our display hardware needs to scan
> out from uncached system memory and we currently don't have a way to
> communicate that through DMA-buf.
>
> For our specific use case at hand we are going to implement something
> driver specific, but the question is should we have something more
> generic for this?
For vmap operations, we return the address as struct dma_buf_map, which
contains additional information about the memory buffer. In vram
helpers, we have the interface drm_gem_vram_offset() that returns the
offset of the GPU device memory.
Would it be feasible to combine both concepts into a dma-buf interface
that returns the device-memory offset plus the additional caching flag?
There'd be a structure and a getter function returning the structure.
struct dma_buf_offset {
bool cached;
u64 address;
};
// return offset in *off
int dma_buf_offset(struct dma_buf *buf, struct dma_buf_off *off);
Whatever settings are returned by dma_buf_offset() are valid while the
dma_buf is pinned.
Best regards
Thomas
>
> After all the system memory access pattern is a PCIe extension and as
> such something generic.
>
> Regards,
> Christian.
> _______________________________________________
> dri-devel mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer
Am Montag, dem 15.02.2021 um 10:34 +0100 schrieb Christian König:
>
> Am 15.02.21 um 10:06 schrieb Simon Ser:
> > On Monday, February 15th, 2021 at 9:58 AM, Christian König <[email protected]> wrote:
> >
> > > we are currently working an Freesync and direct scan out from system
> > > memory on AMD APUs in A+A laptops.
> > >
> > > On problem we stumbled over is that our display hardware needs to scan
> > > out from uncached system memory and we currently don't have a way to
> > > communicate that through DMA-buf.
> > >
> > > For our specific use case at hand we are going to implement something
> > > driver specific, but the question is should we have something more
> > > generic for this?
> > >
> > > After all the system memory access pattern is a PCIe extension and as
> > > such something generic.
> > Intel also needs uncached system memory if I'm not mistaken?
>
> No idea, that's why I'm asking. Could be that this is also interesting
> for I+A systems.
>
> > Where are the buffers allocated? If GBM, then it needs to allocate memory that
> > can be scanned out if the USE_SCANOUT flag is set or if a scanout-capable
> > modifier is picked.
> >
> > If this is about communicating buffer constraints between different components
> > of the stack, there were a few proposals about it. The most recent one is [1].
>
> Well the problem here is on a different level of the stack.
>
> See resolution, pitch etc:.. can easily communicated in userspace
> without involvement of the kernel. The worst thing which can happen is
> that you draw garbage into your own application window.
>
> But if you get the caching attributes in the page tables (both CPU as
> well as IOMMU, device etc...) wrong then ARM for example has the
> tendency to just spontaneously reboot
>
> X86 is fortunately a bit more gracefully and you only end up with random
> data corruption, but that is only marginally better.
>
> So to sum it up that is not something which we can leave in the hands of
> userspace.
>
> I think that exporters in the DMA-buf framework should have the ability
> to tell importers if the system memory snooping is necessary or not.
There is already a coarse-grained way to do so: the dma_coherent
property in struct device, which you can check at dmabuf attach time.
However it may not be enough for the requirements of a GPU where the
engines could differ in their dma coherency requirements. For that you
need to either have fake struct devices for the individual engines or
come up with a more fine-grained way to communicate those requirements.
> Userspace components can then of course tell the exporter what the
> importer needs, but validation if that stuff is correct and doesn't
> crash the system must happen in the kernel.
What exactly do you mean by "scanout requires non-coherent memory"?
Does the scanout requestor always set the no-snoop PCI flag, so you get
garbage if some writes to memory are still stuck in the caches, or is
it some other requirement?
Regards,
Lucas
Hi
Am 15.02.21 um 10:49 schrieb Thomas Zimmermann:
> Hi
>
> Am 15.02.21 um 09:58 schrieb Christian König:
>> Hi guys,
>>
>> we are currently working an Freesync and direct scan out from system
>> memory on AMD APUs in A+A laptops.
>>
>> On problem we stumbled over is that our display hardware needs to scan
>> out from uncached system memory and we currently don't have a way to
>> communicate that through DMA-buf.
Re-reading this paragrah, it sounds more as if you want to let the
exporter know where to move the buffer. Is this another case of the
missing-pin-flag problem?
Best regards
Thomas
>>
>> For our specific use case at hand we are going to implement something
>> driver specific, but the question is should we have something more
>> generic for this?
>
> For vmap operations, we return the address as struct dma_buf_map, which
> contains additional information about the memory buffer. In vram
> helpers, we have the interface drm_gem_vram_offset() that returns the
> offset of the GPU device memory.
>
> Would it be feasible to combine both concepts into a dma-buf interface
> that returns the device-memory offset plus the additional caching flag?
>
> There'd be a structure and a getter function returning the structure.
>
> struct dma_buf_offset {
> bool cached;
> u64 address;
> };
>
> // return offset in *off
> int dma_buf_offset(struct dma_buf *buf, struct dma_buf_off *off);
>
> Whatever settings are returned by dma_buf_offset() are valid while the
> dma_buf is pinned.
>
> Best regards
> Thomas
>
>>
>> After all the system memory access pattern is a PCIe extension and as
>> such something generic.
>>
>> Regards,
>> Christian.
>> _______________________________________________
>> dri-devel mailing list
>> [email protected]
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
>
> _______________________________________________
> dri-devel mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Felix Imendörffer
Am 15.02.21 um 12:53 schrieb Lucas Stach:
> Am Montag, dem 15.02.2021 um 10:34 +0100 schrieb Christian König:
>> Am 15.02.21 um 10:06 schrieb Simon Ser:
>>> On Monday, February 15th, 2021 at 9:58 AM, Christian König <[email protected]> wrote:
>>>
>>>> we are currently working an Freesync and direct scan out from system
>>>> memory on AMD APUs in A+A laptops.
>>>>
>>>> On problem we stumbled over is that our display hardware needs to scan
>>>> out from uncached system memory and we currently don't have a way to
>>>> communicate that through DMA-buf.
>>>>
>>>> For our specific use case at hand we are going to implement something
>>>> driver specific, but the question is should we have something more
>>>> generic for this?
>>>>
>>>> After all the system memory access pattern is a PCIe extension and as
>>>> such something generic.
>>> Intel also needs uncached system memory if I'm not mistaken?
>> No idea, that's why I'm asking. Could be that this is also interesting
>> for I+A systems.
>>
>>> Where are the buffers allocated? If GBM, then it needs to allocate memory that
>>> can be scanned out if the USE_SCANOUT flag is set or if a scanout-capable
>>> modifier is picked.
>>>
>>> If this is about communicating buffer constraints between different components
>>> of the stack, there were a few proposals about it. The most recent one is [1].
>> Well the problem here is on a different level of the stack.
>>
>> See resolution, pitch etc:.. can easily communicated in userspace
>> without involvement of the kernel. The worst thing which can happen is
>> that you draw garbage into your own application window.
>>
>> But if you get the caching attributes in the page tables (both CPU as
>> well as IOMMU, device etc...) wrong then ARM for example has the
>> tendency to just spontaneously reboot
>>
>> X86 is fortunately a bit more gracefully and you only end up with random
>> data corruption, but that is only marginally better.
>>
>> So to sum it up that is not something which we can leave in the hands of
>> userspace.
>>
>> I think that exporters in the DMA-buf framework should have the ability
>> to tell importers if the system memory snooping is necessary or not.
> There is already a coarse-grained way to do so: the dma_coherent
> property in struct device, which you can check at dmabuf attach time.
>
> However it may not be enough for the requirements of a GPU where the
> engines could differ in their dma coherency requirements. For that you
> need to either have fake struct devices for the individual engines or
> come up with a more fine-grained way to communicate those requirements.
Yeah, that won't work. We need this on a per buffer level.
>> Userspace components can then of course tell the exporter what the
>> importer needs, but validation if that stuff is correct and doesn't
>> crash the system must happen in the kernel.
> What exactly do you mean by "scanout requires non-coherent memory"?
> Does the scanout requestor always set the no-snoop PCI flag, so you get
> garbage if some writes to memory are still stuck in the caches, or is
> it some other requirement?
Snooping the CPU caches introduces some extra latency, so what can
happen is that the response to the PCIe read comes to late for the
scanout. The result is an underflow and flickering whenever something is
in the cache which needs to be flushed first.
On the other hand when the don't snoop the CPU caches we at least get
garbage/stale data on the screen. That wouldn't be that worse, but the
big problem is that we have also seen machine check exceptions when
don't snoop and the cache is dirty.
So this should better be coherent or you can crash the box. ARM seems to
be really susceptible for this, x86 is fortunately much more graceful
and I'm not sure about other architectures.
Regards,
Christian.
>
> Regards,
> Lucas
>
Am 15.02.21 um 13:00 schrieb Thomas Zimmermann:
> Hi
>
> Am 15.02.21 um 10:49 schrieb Thomas Zimmermann:
>> Hi
>>
>> Am 15.02.21 um 09:58 schrieb Christian König:
>>> Hi guys,
>>>
>>> we are currently working an Freesync and direct scan out from system
>>> memory on AMD APUs in A+A laptops.
>>>
>>> On problem we stumbled over is that our display hardware needs to
>>> scan out from uncached system memory and we currently don't have a
>>> way to communicate that through DMA-buf.
>
> Re-reading this paragrah, it sounds more as if you want to let the
> exporter know where to move the buffer. Is this another case of the
> missing-pin-flag problem?
No, your original interpretation was correct. Maybe my writing is a bit
unspecific.
The real underlying issue is that our display hardware has a problem
with latency when accessing system memory.
So the question is if that also applies to for example Intel hardware or
other devices as well or if it is just something AMD specific?
Regards,
Christian.
>
> Best regards
> Thomas
>
>>>
>>> For our specific use case at hand we are going to implement
>>> something driver specific, but the question is should we have
>>> something more generic for this?
>>
>> For vmap operations, we return the address as struct dma_buf_map,
>> which contains additional information about the memory buffer. In
>> vram helpers, we have the interface drm_gem_vram_offset() that
>> returns the offset of the GPU device memory.
>>
>> Would it be feasible to combine both concepts into a dma-buf
>> interface that returns the device-memory offset plus the additional
>> caching flag?
>>
>> There'd be a structure and a getter function returning the structure.
>>
>> struct dma_buf_offset {
>> bool cached;
>> u64 address;
>> };
>>
>> // return offset in *off
>> int dma_buf_offset(struct dma_buf *buf, struct dma_buf_off *off);
>>
>> Whatever settings are returned by dma_buf_offset() are valid while
>> the dma_buf is pinned.
>>
>> Best regards
>> Thomas
>>
>>>
>>> After all the system memory access pattern is a PCIe extension and
>>> as such something generic.
>>>
>>> Regards,
>>> Christian.
>>> _______________________________________________
>>> dri-devel mailing list
>>> [email protected]
>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>
>>
>> _______________________________________________
>> dri-devel mailing list
>> [email protected]
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>
>
Am Montag, dem 15.02.2021 um 13:04 +0100 schrieb Christian König:
> Am 15.02.21 um 12:53 schrieb Lucas Stach:
> > Am Montag, dem 15.02.2021 um 10:34 +0100 schrieb Christian König:
> > > Am 15.02.21 um 10:06 schrieb Simon Ser:
> > > > On Monday, February 15th, 2021 at 9:58 AM, Christian König <[email protected]> wrote:
> > > >
> > > > > we are currently working an Freesync and direct scan out from system
> > > > > memory on AMD APUs in A+A laptops.
> > > > >
> > > > > On problem we stumbled over is that our display hardware needs to scan
> > > > > out from uncached system memory and we currently don't have a way to
> > > > > communicate that through DMA-buf.
> > > > >
> > > > > For our specific use case at hand we are going to implement something
> > > > > driver specific, but the question is should we have something more
> > > > > generic for this?
> > > > >
> > > > > After all the system memory access pattern is a PCIe extension and as
> > > > > such something generic.
> > > > Intel also needs uncached system memory if I'm not mistaken?
> > > No idea, that's why I'm asking. Could be that this is also interesting
> > > for I+A systems.
> > >
> > > > Where are the buffers allocated? If GBM, then it needs to allocate memory that
> > > > can be scanned out if the USE_SCANOUT flag is set or if a scanout-capable
> > > > modifier is picked.
> > > >
> > > > If this is about communicating buffer constraints between different components
> > > > of the stack, there were a few proposals about it. The most recent one is [1].
> > > Well the problem here is on a different level of the stack.
> > >
> > > See resolution, pitch etc:.. can easily communicated in userspace
> > > without involvement of the kernel. The worst thing which can happen is
> > > that you draw garbage into your own application window.
> > >
> > > But if you get the caching attributes in the page tables (both CPU as
> > > well as IOMMU, device etc...) wrong then ARM for example has the
> > > tendency to just spontaneously reboot
> > >
> > > X86 is fortunately a bit more gracefully and you only end up with random
> > > data corruption, but that is only marginally better.
> > >
> > > So to sum it up that is not something which we can leave in the hands of
> > > userspace.
> > >
> > > I think that exporters in the DMA-buf framework should have the ability
> > > to tell importers if the system memory snooping is necessary or not.
> > There is already a coarse-grained way to do so: the dma_coherent
> > property in struct device, which you can check at dmabuf attach time.
> >
> > However it may not be enough for the requirements of a GPU where the
> > engines could differ in their dma coherency requirements. For that you
> > need to either have fake struct devices for the individual engines or
> > come up with a more fine-grained way to communicate those requirements.
>
> Yeah, that won't work. We need this on a per buffer level.
>
> > > Userspace components can then of course tell the exporter what the
> > > importer needs, but validation if that stuff is correct and doesn't
> > > crash the system must happen in the kernel.
> > What exactly do you mean by "scanout requires non-coherent memory"?
> > Does the scanout requestor always set the no-snoop PCI flag, so you get
> > garbage if some writes to memory are still stuck in the caches, or is
> > it some other requirement?
>
> Snooping the CPU caches introduces some extra latency, so what can
> happen is that the response to the PCIe read comes to late for the
> scanout. The result is an underflow and flickering whenever something is
> in the cache which needs to be flushed first.
Okay, that confirms my theory on why this is needed. So things don't
totally explode if you don't do it, but to in order to guarantee access
latency you need to take the no-snoop path, which means your device
effectively gets dma-noncoherent.
> On the other hand when the don't snoop the CPU caches we at least get
> garbage/stale data on the screen. That wouldn't be that worse, but the
> big problem is that we have also seen machine check exceptions when
> don't snoop and the cache is dirty.
If you attach to the dma-buf with a struct device which is non-coherent
it's the exporters job to flush any dirty caches. Unfortunately the DRM
caching of the dma-buf attachments in the DRM framework will get a bit
in the way here, so a DRM specific flush might be be needed. :/ Maybe
moving the whole buffer to uncached sysmem location on first attach of
a non-coherent importer would be enough?
> So this should better be coherent or you can crash the box. ARM seems to
> be really susceptible for this, x86 is fortunately much more graceful
> and I'm not sure about other architectures.
ARM really dislikes pagetable setups with different attributes pointing
to the same physical page, however you should be fine as long as all
cached aliases are properly flushed from the cache before access via a
different alias.
Regards,
Lucas
Am 15.02.21 um 13:16 schrieb Lucas Stach:
> [SNIP]
>>>> Userspace components can then of course tell the exporter what the
>>>> importer needs, but validation if that stuff is correct and doesn't
>>>> crash the system must happen in the kernel.
>>> What exactly do you mean by "scanout requires non-coherent memory"?
>>> Does the scanout requestor always set the no-snoop PCI flag, so you get
>>> garbage if some writes to memory are still stuck in the caches, or is
>>> it some other requirement?
>> Snooping the CPU caches introduces some extra latency, so what can
>> happen is that the response to the PCIe read comes to late for the
>> scanout. The result is an underflow and flickering whenever something is
>> in the cache which needs to be flushed first.
> Okay, that confirms my theory on why this is needed. So things don't
> totally explode if you don't do it, but to in order to guarantee access
> latency you need to take the no-snoop path, which means your device
> effectively gets dma-noncoherent.
Exactly. My big question at the moment is if this is something AMD
specific or do we have the same issue on other devices as well?
>> On the other hand when the don't snoop the CPU caches we at least get
>> garbage/stale data on the screen. That wouldn't be that worse, but the
>> big problem is that we have also seen machine check exceptions when
>> don't snoop and the cache is dirty.
> If you attach to the dma-buf with a struct device which is non-coherent
> it's the exporters job to flush any dirty caches. Unfortunately the DRM
> caching of the dma-buf attachments in the DRM framework will get a bit
> in the way here, so a DRM specific flush might be be needed. :/ Maybe
> moving the whole buffer to uncached sysmem location on first attach of
> a non-coherent importer would be enough?
Could work in theory, but problem is that for this to do I have to tear
down all CPU mappings and attachments of other devices.
Apart from the problem that we don't have the infrastructure for that we
don't know at import time that a buffer might be used for scan out. I
would need to re-import it during fb creation or something like this.
Our current concept for AMD GPUs is rather that we try to use uncached
memory as much as possible. So for the specific use case just checking
if the exporter is AMDGPU and has the flag set should be enough for not.
>> So this should better be coherent or you can crash the box. ARM seems to
>> be really susceptible for this, x86 is fortunately much more graceful
>> and I'm not sure about other architectures.
> ARM really dislikes pagetable setups with different attributes pointing
> to the same physical page, however you should be fine as long as all
> cached aliases are properly flushed from the cache before access via a
> different alias.
Yeah, can totally confirm that and had to learn it the hard way.
Regards,
Christian.
>
> Regards,
> Lucas
>
From: Christian König
> Sent: 15 February 2021 12:05
...
> Snooping the CPU caches introduces some extra latency, so what can
> happen is that the response to the PCIe read comes to late for the
> scanout. The result is an underflow and flickering whenever something is
> in the cache which needs to be flushed first.
Aren't you going to get the same problem if any other endpoints are
doing memory reads?
Possibly even ones that don't require a cache snoop and flush.
What about just the cpu doing a real memory transfer?
Or a combination of the two above happening just before your request.
If you don't have a big enough fifo you'll lose.
I did 'fix' a similar(ish) issue with video DMA latency on an embedded
system based the on SA1100/SA1101 by significantly reducing the clock
to the VGA panel whenever the cpu was doing 'slow io'.
(Interleaving an uncached cpu DRAM write between the slow io cycles
also fixed it.)
But the video was the only DMA device and that was an embedded system.
Given the application note about video latency didn't mention what was
actually happening, I'm not sure how many people actually got it working!
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Am 15.02.21 um 15:41 schrieb David Laight:
> From: Christian König
>> Sent: 15 February 2021 12:05
> ...
>> Snooping the CPU caches introduces some extra latency, so what can
>> happen is that the response to the PCIe read comes to late for the
>> scanout. The result is an underflow and flickering whenever something is
>> in the cache which needs to be flushed first.
> Aren't you going to get the same problem if any other endpoints are
> doing memory reads?
The PCIe device in this case is part of the SoC, so we have a high
priority channel to memory.
Because of this the hardware designer assumed they have a guaranteed
memory latency.
> Possibly even ones that don't require a cache snoop and flush.
>
> What about just the cpu doing a real memory transfer?
>
> Or a combination of the two above happening just before your request.
>
> If you don't have a big enough fifo you'll lose.
>
> I did 'fix' a similar(ish) issue with video DMA latency on an embedded
> system based the on SA1100/SA1101 by significantly reducing the clock
> to the VGA panel whenever the cpu was doing 'slow io'.
> (Interleaving an uncached cpu DRAM write between the slow io cycles
> also fixed it.)
> But the video was the only DMA device and that was an embedded system.
> Given the application note about video latency didn't mention what was
> actually happening, I'm not sure how many people actually got it working!
Yeah, I'm also not sure if AMD doesn't solve this with deeper fifos or
more prefetching in future designs.
But you gave me at least one example where somebody had similar problems.
Thanks for the feedback,
Christian.
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
> _______________________________________________
> Linaro-mm-sig mailing list
> [email protected]
> https://lists.linaro.org/mailman/listinfo/linaro-mm-sig
Le lundi 15 février 2021 à 09:58 +0100, Christian König a écrit :
> Hi guys,
>
> we are currently working an Freesync and direct scan out from system
> memory on AMD APUs in A+A laptops.
>
> On problem we stumbled over is that our display hardware needs to scan
> out from uncached system memory and we currently don't have a way to
> communicate that through DMA-buf.
>
> For our specific use case at hand we are going to implement something
> driver specific, but the question is should we have something more
> generic for this?
Hopefully I'm getting this right, but this makes me think of a long standing
issue I've met with Intel DRM and UVC driver. If I let the UVC driver allocate
the buffer, and import the resulting DMABuf (cacheable memory written with a cpu
copy in the kernel) into DRM, we can see cache artifact being displayed. While
if I use the DRM driver memory (dumb buffer in that case) it's clean because
there is a driver specific solution to that.
There is no obvious way for userspace application to know what's is right/wrong
way and in fact it feels like the kernel could solve this somehow without having
to inform userspace (perhaps).
>
> After all the system memory access pattern is a PCIe extension and as
> such something generic.
>
> Regards,
> Christian.
Le lundi 15 février 2021 à 13:10 +0100, Christian König a écrit :
>
>
> Am 15.02.21 um 13:00 schrieb Thomas Zimmermann:
> > Hi
> >
> > Am 15.02.21 um 10:49 schrieb Thomas Zimmermann:
> > > Hi
> > >
> > > Am 15.02.21 um 09:58 schrieb Christian König:
> > > > Hi guys,
> > > >
> > > > we are currently working an Freesync and direct scan out from system
> > > > memory on AMD APUs in A+A laptops.
> > > >
> > > > On problem we stumbled over is that our display hardware needs to
> > > > scan out from uncached system memory and we currently don't have a
> > > > way to communicate that through DMA-buf.
> >
> > Re-reading this paragrah, it sounds more as if you want to let the
> > exporter know where to move the buffer. Is this another case of the
> > missing-pin-flag problem?
>
> No, your original interpretation was correct. Maybe my writing is a bit
> unspecific.
>
> The real underlying issue is that our display hardware has a problem
> with latency when accessing system memory.
>
> So the question is if that also applies to for example Intel hardware or
> other devices as well or if it is just something AMD specific?
I do believe that the answer is yes, Intel display have similar issue with
latency, hence requires un-cached memory.
>
> Regards,
> Christian.
>
> >
> > Best regards
> > Thomas
> >
> > > >
> > > > For our specific use case at hand we are going to implement
> > > > something driver specific, but the question is should we have
> > > > something more generic for this?
> > >
> > > For vmap operations, we return the address as struct dma_buf_map,
> > > which contains additional information about the memory buffer. In
> > > vram helpers, we have the interface drm_gem_vram_offset() that
> > > returns the offset of the GPU device memory.
> > >
> > > Would it be feasible to combine both concepts into a dma-buf
> > > interface that returns the device-memory offset plus the additional
> > > caching flag?
> > >
> > > There'd be a structure and a getter function returning the structure.
> > >
> > > struct dma_buf_offset {
> > > bool cached;
> > > u64 address;
> > > };
> > >
> > > // return offset in *off
> > > int dma_buf_offset(struct dma_buf *buf, struct dma_buf_off *off);
> > >
> > > Whatever settings are returned by dma_buf_offset() are valid while
> > > the dma_buf is pinned.
> > >
> > > Best regards
> > > Thomas
> > >
> > > >
> > > > After all the system memory access pattern is a PCIe extension and
> > > > as such something generic.
> > > >
> > > > Regards,
> > > > Christian.
> > > > _______________________________________________
> > > > dri-devel mailing list
> > > > [email protected]
> > > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > >
> > >
> > > _______________________________________________
> > > dri-devel mailing list
> > > [email protected]
> > > https://lists.freedesktop.org/mailman/listinfo/dri-devel
> > >
> >
>
On Mon, Feb 15, 2021 at 09:58:08AM +0100, Christian K?nig wrote:
> Hi guys,
>
> we are currently working an Freesync and direct scan out from system memory
> on AMD APUs in A+A laptops.
>
> On problem we stumbled over is that our display hardware needs to scan out
> from uncached system memory and we currently don't have a way to communicate
> that through DMA-buf.
>
> For our specific use case at hand we are going to implement something driver
> specific, but the question is should we have something more generic for
> this?
>
> After all the system memory access pattern is a PCIe extension and as such
> something generic.
Yes it's a problem, and it's a complete mess. So the defacto rules are:
1. importer has to do coherent transactions to snoop cpu caches.
This way both modes work:
- if the buffer is cached, we're fine
- if the buffer is not cached, but the exporter has flushed all the
caches, you're mostly just wasting time on inefficient bus cycles. Also
this doesn't work if your CPU doesn't just drop clean cachelines. Like I
thing some ARM are prone to do, not idea about AMD, Intel is guaranteed
to drop them which is why the uncached scanout for integrated Intel +
amd discrete "works".
2. exporters picks the mode freely, and can even change it at runtime
(i915 does this, since we don't have an "allocate for scanout" flag wired
through consistently). This doesn't work on arm, there the rule is "all
devices in the same system must use the same mode".
3. This should be solved at the dma-buf layer, but the dma-api refuses to
tell you this information (at least for dma_alloc_coherent). And I'm not
going to deal with the bikeshed that would bring into my inbox. Or at
least there's always been screaming that drivers shouldn't peek behind the
abstraction.
So I think if AMD also guarantees to drop clean cachelines just do the
same thing we do right now for intel integrated + discrete amd, but in
reserve. It's fragile, but it does work.
What we imo shouldn't do is driver private interfaces here, that's just
going to make the problem worse long term. Or at least driver-private
interfaces that spawn across drivers behind dma-buf, because imo this is
really a problem that dma-buf should solve.
If you do want to solve this at the dma-buf level I can try and point you
at the respective i915 and amdgpu code that makes the magic work - I've
had to fix it a few times in the past. I'm not sure whether we'd need to
pass the dynamic nature through though, i.e. whether we want to be able to
scan out imported dma-buf and hence request they be used in uncached mode.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
Hi Christian and Andy,
Le mardi 21 juin 2022 à 12:34 +0200, Christian König a écrit :
> Hi Andy,
>
> Am 21.06.22 um 12:17 schrieb Andy.Hsieh:
>
> > On 2/16/21 4:39 AM, Nicolas Dufresne wrote:
> > > Le lundi 15 février 2021 à 09:58 +0100, Christian König a écrit :
> > > > Hi guys,
> > > >
> > > > we are currently working an Freesync and direct scan out from system
> > > > memory on AMD APUs in A+A laptops.
> > > >
> > > > On problem we stumbled over is that our display hardware needs to scan
> > > > out from uncached system memory and we currently don't have a way to
> > > > communicate that through DMA-buf.
> > > >
> > > > For our specific use case at hand we are going to implement something
> > > > driver specific, but the question is should we have something more
> > > > generic for this?
> > >
> > > Hopefully I'm getting this right, but this makes me think of a long
> > > standing
> > > issue I've met with Intel DRM and UVC driver. If I let the UVC driver
> > > allocate
> > > the buffer, and import the resulting DMABuf (cacheable memory written with
> > > a cpu
> > > copy in the kernel) into DRM, we can see cache artifact being displayed.
> > > While
> > > if I use the DRM driver memory (dumb buffer in that case) it's clean
> > > because
> > > there is a driver specific solution to that.
> > >
> > > There is no obvious way for userspace application to know what's is
> > > right/wrong
> > > way and in fact it feels like the kernel could solve this somehow without
> > > having
> > > to inform userspace (perhaps).
> > >
> > > >
> > > > After all the system memory access pattern is a PCIe extension and as
> > > > such something generic.
> > > >
> > > > Regards,
> > > > Christian.
> > >
> > >
> >
> > Hi All,
> >
> > We also encountered the UVC cache issue on ARMv8 CPU in Mediatek SoC when
> > using UVC dmabuf-export and feeding the dmabuf to the DRM display by the
> > following GStreamer command:
> >
> > # gst-launch-1.0 v4l2src device=/dev/video0 io-mode=dmabuf ! kmssink
> >
> > UVC driver uses videobuf2-vmalloc to allocate buffers and is able to export
> > them as dmabuf. But UVC uses memcpy() to fill the frame buffer by CPU
> > without
> > flushing the cache. So if the display hardware directly uses the buffer, the
> > image shown on the screen will be dirty.
> >
> > Here are some experiments:
> >
> > 1. By doing some memory operations (e.g. devmem) when streaming the UVC,
> > the issue is mitigated. I guess the cache is swapped rapidly.
> > 2. By replacing the memcpy() with memcpy_flushcache() in the UVC driver,
> > the issue disappears.
> > 3. By adding .finish callback in videobuf2-vmalloc.c to flush the cache
> > before returning the buffer, the issue disappears.
> >
> > It seems to lack a cache flush stage in either UVC or Display. We may also
> > need communication between the producer and consumer. Then, they can decide
> > who is responsible for the flushing to avoid flushing cache unconditionally
> > leading to the performance impact.
>
> Well, that's not what this mail thread was all about.
>
> The issue you are facing is that somebody is forgetting to flush caches, but
> the issue discussed in this thread here is that we have hardware which
> bypasses caches altogether.
>
> As far as I can see in your case UVC just allocates normal cached system
> memory through videobuf2-vmalloc() and it is perfectly valid to fill that
> using memcpy().
>
> If some hardware then accesses those buffers bypassing CPU caches then it is
> the responsibility of the importing driver and/or DMA subsystem to flush the
> caches accordingly.
I've tracked this down to videobuf2-vmalloc.c failing to look for coherency
during "attach()". It is also missing begin_/end access implementation for the
case it get attached to a non-coherent device. Seems fixable though, but "I'm
far from an expert", but more some random person reading code and comments.
regards,
Nicolas
>
> Regards,
> Christian.
>
>
> >
> > Regards,
> > Andy Hsieh
> >
> > ************* MEDIATEK Confidentiality Notice ********************
> > The information contained in this e-mail message (including any
> > attachments) may be confidential, proprietary, privileged, or otherwise
> > exempt from disclosure under applicable laws. It is intended to be
> > conveyed only to the designated recipient(s). Any use, dissemination,
> > distribution, printing, retaining or copying of this e-mail (including its
> > attachments) by unintended recipient(s) is strictly prohibited and may
> > be unlawful. If you are not an intended recipient of this e-mail, or believe
> > that you have received this e-mail in error, please notify the sender
> > immediately (by replying to this e-mail), delete any and all copies of
> > this e-mail (including any attachments) from your system, and do not
> > disclose the content of this e-mail to any other person. Thank you!
>
>
Am 21.06.22 um 17:42 schrieb Nicolas Dufresne:
> Hi Christian and Andy,
>
> Le mardi 21 juin 2022 à 12:34 +0200, Christian König a écrit :
>> Hi Andy,
>>
>> Am 21.06.22 um 12:17 schrieb Andy.Hsieh:
>>
>>> On 2/16/21 4:39 AM, Nicolas Dufresne wrote:
>>>> Le lundi 15 février 2021 à 09:58 +0100, Christian König a écrit :
>>>>> Hi guys,
>>>>>
>>>>> we are currently working an Freesync and direct scan out from system
>>>>> memory on AMD APUs in A+A laptops.
>>>>>
>>>>> On problem we stumbled over is that our display hardware needs to scan
>>>>> out from uncached system memory and we currently don't have a way to
>>>>> communicate that through DMA-buf.
>>>>>
>>>>> For our specific use case at hand we are going to implement something
>>>>> driver specific, but the question is should we have something more
>>>>> generic for this?
>>>> Hopefully I'm getting this right, but this makes me think of a long
>>>> standing
>>>> issue I've met with Intel DRM and UVC driver. If I let the UVC driver
>>>> allocate
>>>> the buffer, and import the resulting DMABuf (cacheable memory written with
>>>> a cpu
>>>> copy in the kernel) into DRM, we can see cache artifact being displayed.
>>>> While
>>>> if I use the DRM driver memory (dumb buffer in that case) it's clean
>>>> because
>>>> there is a driver specific solution to that.
>>>>
>>>> There is no obvious way for userspace application to know what's is
>>>> right/wrong
>>>> way and in fact it feels like the kernel could solve this somehow without
>>>> having
>>>> to inform userspace (perhaps).
>>>>
>>>>> After all the system memory access pattern is a PCIe extension and as
>>>>> such something generic.
>>>>>
>>>>> Regards,
>>>>> Christian.
>>>>
>>> Hi All,
>>>
>>> We also encountered the UVC cache issue on ARMv8 CPU in Mediatek SoC when
>>> using UVC dmabuf-export and feeding the dmabuf to the DRM display by the
>>> following GStreamer command:
>>>
>>> # gst-launch-1.0 v4l2src device=/dev/video0 io-mode=dmabuf ! kmssink
>>>
>>> UVC driver uses videobuf2-vmalloc to allocate buffers and is able to export
>>> them as dmabuf. But UVC uses memcpy() to fill the frame buffer by CPU
>>> without
>>> flushing the cache. So if the display hardware directly uses the buffer, the
>>> image shown on the screen will be dirty.
>>>
>>> Here are some experiments:
>>>
>>> 1. By doing some memory operations (e.g. devmem) when streaming the UVC,
>>> the issue is mitigated. I guess the cache is swapped rapidly.
>>> 2. By replacing the memcpy() with memcpy_flushcache() in the UVC driver,
>>> the issue disappears.
>>> 3. By adding .finish callback in videobuf2-vmalloc.c to flush the cache
>>> before returning the buffer, the issue disappears.
>>>
>>> It seems to lack a cache flush stage in either UVC or Display. We may also
>>> need communication between the producer and consumer. Then, they can decide
>>> who is responsible for the flushing to avoid flushing cache unconditionally
>>> leading to the performance impact.
>>
>> Well, that's not what this mail thread was all about.
>>
>> The issue you are facing is that somebody is forgetting to flush caches, but
>> the issue discussed in this thread here is that we have hardware which
>> bypasses caches altogether.
>>
>> As far as I can see in your case UVC just allocates normal cached system
>> memory through videobuf2-vmalloc() and it is perfectly valid to fill that
>> using memcpy().
>>
>> If some hardware then accesses those buffers bypassing CPU caches then it is
>> the responsibility of the importing driver and/or DMA subsystem to flush the
>> caches accordingly.
> I've tracked this down to videobuf2-vmalloc.c failing to look for coherency
> during "attach()". It is also missing begin_/end access implementation for the
> case it get attached to a non-coherent device. Seems fixable though, but "I'm
> far from an expert", but more some random person reading code and comments.
Well that is perfectly expected behavior, videobuf2-vmalloc return
normal cached system memory.
So it doesn't care for the coherency of the buffer.
What should happen instead is that the display device needs to make sure
that it can coherently access the data and that's not the case here.
Regards,
Christian.
>
> regards,
> Nicolas
>
>>
>> Regards,
>> Christian.
>>
>>
>>> Regards,
>>> Andy Hsieh
>>>
>>> ************* MEDIATEK Confidentiality Notice ********************
>>> The information contained in this e-mail message (including any
>>> attachments) may be confidential, proprietary, privileged, or otherwise
>>> exempt from disclosure under applicable laws. It is intended to be
>>> conveyed only to the designated recipient(s). Any use, dissemination,
>>> distribution, printing, retaining or copying of this e-mail (including its
>>> attachments) by unintended recipient(s) is strictly prohibited and may
>>> be unlawful. If you are not an intended recipient of this e-mail, or believe
>>> that you have received this e-mail in error, please notify the sender
>>> immediately (by replying to this e-mail), delete any and all copies of
>>> this e-mail (including any attachments) from your system, and do not
>>> disclose the content of this e-mail to any other person. Thank you!
>>
>>
> _______________________________________________
> Linaro-mm-sig mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
Le mardi 16 février 2021 à 10:25 +0100, Daniel Vetter a écrit :
> So I think if AMD also guarantees to drop clean cachelines just do the
> same thing we do right now for intel integrated + discrete amd, but in
> reserve. It's fragile, but it does work.
Sorry to disrupt, but if you pass V4L2 vmalloc data to Intel display driver, you
also get nice dirt on the screen. If you have a UVC webcam that produces a pixel
format compatible with your display, you can reproduce the issue quite easily
with:
gst-launch-1.0 v4l2src device=/dev/video0 ! kmssink
p.s. some frame-rate are less likely to exhibit the issue, make sure you create
movement to see it.
The only solution I could think of (not implemented) was to detect in the
attach() call what the importers can do (with dev->coherent_dma_mask if I
recall), and otherwise flush the cache immediately and start flushing the cache
from now on signalling it for DQBUF (in vb2 workqueue or dqbuf ioctl, I don't
have an idea yet). I bet this idea is inapplicable to were you have fences, we
don't have that in v4l2.
This idea was hinted by Robert Becket (now in CC), but perhaps I picked it up
wrong, explaining it wrong, etc. I'm no expert, just noticed there wasn't really
a good plan for that, so one needs to make one up. I'm not aware oh an importer
could know how the memory was allocated by the exporter, and worst, how an
importer could figure-out that the export is going to produce buffer with hot
CPU cache (UVC driver does memcpy from USB chunks of variable size to produce a
fixed size image).
Nicolas
Hi Nicolas,
On Wed, 22 Jun 2022 at 20:39, Nicolas Dufresne <[email protected]> wrote:
> Le mardi 16 février 2021 à 10:25 +0100, Daniel Vetter a écrit :
> > So I think if AMD also guarantees to drop clean cachelines just do the
> > same thing we do right now for intel integrated + discrete amd, but in
> > reserve. It's fragile, but it does work.
>
> Sorry to disrupt, but if you pass V4L2 vmalloc data to Intel display driver, you
> also get nice dirt on the screen. If you have a UVC webcam that produces a pixel
> format compatible with your display, you can reproduce the issue quite easily
> with:
>
> gst-launch-1.0 v4l2src device=/dev/video0 ! kmssink
>
> p.s. some frame-rate are less likely to exhibit the issue, make sure you create
> movement to see it.
Right, this is because the UVC data in a vmalloc() area is not
necessarily flushed from the CPU cache, and the importer expects it
will be.
> The only solution I could think of (not implemented) was to detect in the
> attach() call what the importers can do (with dev->coherent_dma_mask if I
> recall), and otherwise flush the cache immediately and start flushing the cache
> from now on signalling it for DQBUF (in vb2 workqueue or dqbuf ioctl, I don't
> have an idea yet). I bet this idea is inapplicable to were you have fences, we
> don't have that in v4l2.
>
> This idea was hinted by Robert Becket (now in CC), but perhaps I picked it up
> wrong, explaining it wrong, etc. I'm no expert, just noticed there wasn't really
> a good plan for that, so one needs to make one up. I'm not aware oh an importer
> could know how the memory was allocated by the exporter, and worst, how an
> importer could figure-out that the export is going to produce buffer with hot
> CPU cache (UVC driver does memcpy from USB chunks of variable size to produce a
> fixed size image).
This is exactly what Christian was saying above.
Cheers,
Daniel
Am 23.06.22 um 01:34 schrieb Daniel Stone:
> Hi Nicolas,
>
> On Wed, 22 Jun 2022 at 20:39, Nicolas Dufresne <[email protected]> wrote:
>> Le mardi 16 février 2021 à 10:25 +0100, Daniel Vetter a écrit :
>>> So I think if AMD also guarantees to drop clean cachelines just do the
>>> same thing we do right now for intel integrated + discrete amd, but in
>>> reserve. It's fragile, but it does work.
>> Sorry to disrupt, but if you pass V4L2 vmalloc data to Intel display driver, you
>> also get nice dirt on the screen. If you have a UVC webcam that produces a pixel
>> format compatible with your display, you can reproduce the issue quite easily
>> with:
>>
>> gst-launch-1.0 v4l2src device=/dev/video0 ! kmssink
>>
>> p.s. some frame-rate are less likely to exhibit the issue, make sure you create
>> movement to see it.
> Right, this is because the UVC data in a vmalloc() area is not
> necessarily flushed from the CPU cache, and the importer expects it
> will be.
Yeah, but that is something perfectly valid for an exporter to do. So
the bug is not in UVC.
>> The only solution I could think of (not implemented) was to detect in the
>> attach() call what the importers can do (with dev->coherent_dma_mask if I
>> recall), and otherwise flush the cache immediately and start flushing the cache
>> from now on signalling it for DQBUF (in vb2 workqueue or dqbuf ioctl, I don't
>> have an idea yet). I bet this idea is inapplicable to were you have fences, we
>> don't have that in v4l2.
>>
>> This idea was hinted by Robert Becket (now in CC), but perhaps I picked it up
>> wrong, explaining it wrong, etc. I'm no expert, just noticed there wasn't really
>> a good plan for that, so one needs to make one up. I'm not aware oh an importer
>> could know how the memory was allocated by the exporter, and worst, how an
>> importer could figure-out that the export is going to produce buffer with hot
>> CPU cache (UVC driver does memcpy from USB chunks of variable size to produce a
>> fixed size image).
> This is exactly what Christian was saying above.
Well more or less.
The exporter isn't doing anything wrong here. DMA-buf are supposed to be
CPU cached and can also be cache hot.
The importer needs to be able to deal with that. Either by flushing the
CPU cache manually (awkward), rejecting the DMA-buf for this use case
(display scanout) or working around that inside it's driver (extra copy,
different hw settings, etc...).
Regards,
Christian.
>
> Cheers,
> Daniel
On Thu, 23 Jun 2022 08:59:41 +0200
Christian König <[email protected]> wrote:
> The exporter isn't doing anything wrong here. DMA-buf are supposed to be
> CPU cached and can also be cache hot.
Hi,
what is that statement based on?
Were the (mandatory for CPU access) cpu_access_begin/end functions &
ioctls not supposed to ensure that CPU cache is up-to-date / CPU cache
is fully flushed out?
Thanks,
pq
Am 23.06.22 um 09:13 schrieb Pekka Paalanen:
> On Thu, 23 Jun 2022 08:59:41 +0200
> Christian König <[email protected]> wrote:
>
>> The exporter isn't doing anything wrong here. DMA-buf are supposed to be
>> CPU cached and can also be cache hot.
> Hi,
>
> what is that statement based on?
On the design documentation of DMA-buf and the actual driver
implementations.
Coherency and snooping of the CPU cache is mandatory for devices and
root complexes in the PCI specification. Incoherent access is just an
extension.
We inherited that by basing DMA-buf on the Linux kernel DMA-API which in
turn is largely based on the PCI specification.
> Were the (mandatory for CPU access) cpu_access_begin/end functions &
> ioctls not supposed to ensure that CPU cache is up-to-date / CPU cache
> is fully flushed out?
No, those functions are to inform the exporter that the importer has
started and finished accessing the buffer using the CPU.
There is no signaling in the other direction. In other words the
exporter doesn't inform the importer about CPU accesses because it is
the owner of the buffer.
It's the responsibility of the importer to make sure that it can
actually access the data in the buffer. If it can't guarantee that the
importer shouldn't import the buffer in the first place.
Regards,
Christian.
>
>
> Thanks,
> pq
Hi Christian
Am 15.02.21 um 09:58 schrieb Christian König:
> Hi guys,
>
> we are currently working an Freesync and direct scan out from system
> memory on AMD APUs in A+A laptops.
>
> On problem we stumbled over is that our display hardware needs to scan
> out from uncached system memory and we currently don't have a way to
> communicate that through DMA-buf.
>
> For our specific use case at hand we are going to implement something
> driver specific, but the question is should we have something more
> generic for this?
I had a patchset here that extends iosys-map (former dma-buf-map) with
caching information. I'll post a copy.
Sorry for being late to reply.
Best regards
Thomas
>
> After all the system memory access pattern is a PCIe extension and as
> such something generic.
>
> Regards,
> Christian.
> _______________________________________________
> dri-devel mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev
Am Donnerstag, dem 23.06.2022 um 09:26 +0200 schrieb Christian König:
> Am 23.06.22 um 09:13 schrieb Pekka Paalanen:
> > On Thu, 23 Jun 2022 08:59:41 +0200
> > Christian König <[email protected]> wrote:
> >
> > > The exporter isn't doing anything wrong here. DMA-buf are supposed to be
> > > CPU cached and can also be cache hot.
> > Hi,
> >
> > what is that statement based on?
>
> On the design documentation of DMA-buf and the actual driver
> implementations.
>
> Coherency and snooping of the CPU cache is mandatory for devices and
> root complexes in the PCI specification. Incoherent access is just an
> extension.
>
> We inherited that by basing DMA-buf on the Linux kernel DMA-API which in
> turn is largely based on the PCI specification.
>
> > Were the (mandatory for CPU access) cpu_access_begin/end functions &
> > ioctls not supposed to ensure that CPU cache is up-to-date / CPU cache
> > is fully flushed out?
>
> No, those functions are to inform the exporter that the importer has
> started and finished accessing the buffer using the CPU.
>
> There is no signaling in the other direction. In other words the
> exporter doesn't inform the importer about CPU accesses because it is
> the owner of the buffer.
>
> It's the responsibility of the importer to make sure that it can
> actually access the data in the buffer. If it can't guarantee that the
> importer shouldn't import the buffer in the first place.
This is not really correct. DMA-buf inherited the the map/unmap part
from the DMA API, which on cache coherent architecture is mostly a no-
op or ties into the IOMMU implementation to set up the pagetables for
the translation. On non cache coherent architectures this is the point
where any any necessary cache maintenance happens. DRM breaks this
model by caching the DMA-buf mapping for performance reasons.
In the DMA API keeping things mapped is also a valid use-case, but then
you need to do explicit domain transfers via the dma_sync_* family,
which DMA-buf has not inherited. Again those sync are no-ops on cache
coherent architectures, but do any necessary cache maintenance on non
coherent arches.
Regards,
Lucas
Am 23.06.22 um 10:04 schrieb Lucas Stach:
> Am Donnerstag, dem 23.06.2022 um 09:26 +0200 schrieb Christian König:
>> Am 23.06.22 um 09:13 schrieb Pekka Paalanen:
>>> On Thu, 23 Jun 2022 08:59:41 +0200
>>> Christian König <[email protected]> wrote:
>>>
>>>> The exporter isn't doing anything wrong here. DMA-buf are supposed to be
>>>> CPU cached and can also be cache hot.
>>> Hi,
>>>
>>> what is that statement based on?
>> On the design documentation of DMA-buf and the actual driver
>> implementations.
>>
>> Coherency and snooping of the CPU cache is mandatory for devices and
>> root complexes in the PCI specification. Incoherent access is just an
>> extension.
>>
>> We inherited that by basing DMA-buf on the Linux kernel DMA-API which in
>> turn is largely based on the PCI specification.
>>
>>> Were the (mandatory for CPU access) cpu_access_begin/end functions &
>>> ioctls not supposed to ensure that CPU cache is up-to-date / CPU cache
>>> is fully flushed out?
>> No, those functions are to inform the exporter that the importer has
>> started and finished accessing the buffer using the CPU.
>>
>> There is no signaling in the other direction. In other words the
>> exporter doesn't inform the importer about CPU accesses because it is
>> the owner of the buffer.
>>
>> It's the responsibility of the importer to make sure that it can
>> actually access the data in the buffer. If it can't guarantee that the
>> importer shouldn't import the buffer in the first place.
> This is not really correct. DMA-buf inherited the the map/unmap part
> from the DMA API, which on cache coherent architecture is mostly a no-
> op or ties into the IOMMU implementation to set up the pagetables for
> the translation. On non cache coherent architectures this is the point
> where any any necessary cache maintenance happens. DRM breaks this
> model by caching the DMA-buf mapping for performance reasons.
That's not only because of performance reasons, but also because of
correctness.
At least the Vulkan API and a bunch of OpenGL extensions make it
mandatory for the buffer to be cache coherent. The kernel is simply not
informed about domain transfers.
For example you can just do a CPU copy to a ring buffer and the
expectation is that an already running shader sees that.
> In the DMA API keeping things mapped is also a valid use-case, but then
> you need to do explicit domain transfers via the dma_sync_* family,
> which DMA-buf has not inherited. Again those sync are no-ops on cache
> coherent architectures, but do any necessary cache maintenance on non
> coherent arches.
Correct, yes. Coherency is mandatory for DMA-buf, you can't use
dma_sync_* on it when you are the importer.
The exporter could of course make use of that because he is the owner of
the buffer.
Regards,
Christian.
>
> Regards,
> Lucas
>
Am Donnerstag, dem 23.06.2022 um 10:14 +0200 schrieb Christian König:
> Am 23.06.22 um 10:04 schrieb Lucas Stach:
> > Am Donnerstag, dem 23.06.2022 um 09:26 +0200 schrieb Christian König:
> > > Am 23.06.22 um 09:13 schrieb Pekka Paalanen:
> > > > On Thu, 23 Jun 2022 08:59:41 +0200
> > > > Christian König <[email protected]> wrote:
> > > >
> > > > > The exporter isn't doing anything wrong here. DMA-buf are supposed to be
> > > > > CPU cached and can also be cache hot.
> > > > Hi,
> > > >
> > > > what is that statement based on?
> > > On the design documentation of DMA-buf and the actual driver
> > > implementations.
> > >
> > > Coherency and snooping of the CPU cache is mandatory for devices and
> > > root complexes in the PCI specification. Incoherent access is just an
> > > extension.
> > >
> > > We inherited that by basing DMA-buf on the Linux kernel DMA-API which in
> > > turn is largely based on the PCI specification.
> > >
> > > > Were the (mandatory for CPU access) cpu_access_begin/end functions &
> > > > ioctls not supposed to ensure that CPU cache is up-to-date / CPU cache
> > > > is fully flushed out?
> > > No, those functions are to inform the exporter that the importer has
> > > started and finished accessing the buffer using the CPU.
> > >
> > > There is no signaling in the other direction. In other words the
> > > exporter doesn't inform the importer about CPU accesses because it is
> > > the owner of the buffer.
> > >
> > > It's the responsibility of the importer to make sure that it can
> > > actually access the data in the buffer. If it can't guarantee that the
> > > importer shouldn't import the buffer in the first place.
> > This is not really correct. DMA-buf inherited the the map/unmap part
> > from the DMA API, which on cache coherent architecture is mostly a no-
> > op or ties into the IOMMU implementation to set up the pagetables for
> > the translation. On non cache coherent architectures this is the point
> > where any any necessary cache maintenance happens. DRM breaks this
> > model by caching the DMA-buf mapping for performance reasons.
>
> That's not only because of performance reasons, but also because of
> correctness.
>
> At least the Vulkan API and a bunch of OpenGL extensions make it
> mandatory for the buffer to be cache coherent. The kernel is simply not
> informed about domain transfers.
>
> For example you can just do a CPU copy to a ring buffer and the
> expectation is that an already running shader sees that.
Yes, that one is not really an issue as you know that at buffer
creation time and can make sure to map those buffers uncached on non
coherent arches. If there are no explicit domain transfer points non
coherent must bite the bullet and bypass the CPU caches, running
performance into the ground.
>
> > In the DMA API keeping things mapped is also a valid use-case, but then
> > you need to do explicit domain transfers via the dma_sync_* family,
> > which DMA-buf has not inherited. Again those sync are no-ops on cache
> > coherent architectures, but do any necessary cache maintenance on non
> > coherent arches.
>
> Correct, yes. Coherency is mandatory for DMA-buf, you can't use
> dma_sync_* on it when you are the importer.
>
> The exporter could of course make use of that because he is the owner of
> the buffer.
In the example given here with UVC video, you don't know that the
buffer will be exported and needs to be coherent without
synchronization points, due to the mapping cache at the DRM side. So
V4L2 naturally allocates the buffers from CPU cached memory. If the
expectation is that those buffers are device coherent without relying
on the map/unmap_attachment calls, then V4L2 needs to always
synchronize caches on DQBUF when the buffer is allocated from CPU
cached memory and a single DMA-buf attachment exists. And while writing
this I realize that this is probably exactly what V4L2 should do...
Regards,
Lucas
Am 23.06.22 um 10:13 schrieb Thomas Zimmermann:
> Hi Christian
>
> Am 15.02.21 um 09:58 schrieb Christian König:
>> Hi guys,
>>
>> we are currently working an Freesync and direct scan out from system
>> memory on AMD APUs in A+A laptops.
>>
>> On problem we stumbled over is that our display hardware needs to
>> scan out from uncached system memory and we currently don't have a
>> way to communicate that through DMA-buf.
>>
>> For our specific use case at hand we are going to implement something
>> driver specific, but the question is should we have something more
>> generic for this?
>
> I had a patchset here that extends iosys-map (former dma-buf-map) with
> caching information. I'll post a copy.
Oh, nice. But why on iosys-map? We need that per DMA-buf.
Thanks,
Christian.
>
> Sorry for being late to reply.
>
> Best regards
> Thomas
>
>>
>> After all the system memory access pattern is a PCIe extension and as
>> such something generic.
>>
>> Regards,
>> Christian.
>> _______________________________________________
>> dri-devel mailing list
>> [email protected]
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
Am 23.06.22 um 10:58 schrieb Lucas Stach:
> Am Donnerstag, dem 23.06.2022 um 10:14 +0200 schrieb Christian König:
>> Am 23.06.22 um 10:04 schrieb Lucas Stach:
>>> Am Donnerstag, dem 23.06.2022 um 09:26 +0200 schrieb Christian König:
>>>> Am 23.06.22 um 09:13 schrieb Pekka Paalanen:
>>>>> On Thu, 23 Jun 2022 08:59:41 +0200
>>>>> Christian König <[email protected]> wrote:
>>>>>
>>>>>> The exporter isn't doing anything wrong here. DMA-buf are supposed to be
>>>>>> CPU cached and can also be cache hot.
>>>>> Hi,
>>>>>
>>>>> what is that statement based on?
>>>> On the design documentation of DMA-buf and the actual driver
>>>> implementations.
>>>>
>>>> Coherency and snooping of the CPU cache is mandatory for devices and
>>>> root complexes in the PCI specification. Incoherent access is just an
>>>> extension.
>>>>
>>>> We inherited that by basing DMA-buf on the Linux kernel DMA-API which in
>>>> turn is largely based on the PCI specification.
>>>>
>>>>> Were the (mandatory for CPU access) cpu_access_begin/end functions &
>>>>> ioctls not supposed to ensure that CPU cache is up-to-date / CPU cache
>>>>> is fully flushed out?
>>>> No, those functions are to inform the exporter that the importer has
>>>> started and finished accessing the buffer using the CPU.
>>>>
>>>> There is no signaling in the other direction. In other words the
>>>> exporter doesn't inform the importer about CPU accesses because it is
>>>> the owner of the buffer.
>>>>
>>>> It's the responsibility of the importer to make sure that it can
>>>> actually access the data in the buffer. If it can't guarantee that the
>>>> importer shouldn't import the buffer in the first place.
>>> This is not really correct. DMA-buf inherited the the map/unmap part
>>> from the DMA API, which on cache coherent architecture is mostly a no-
>>> op or ties into the IOMMU implementation to set up the pagetables for
>>> the translation. On non cache coherent architectures this is the point
>>> where any any necessary cache maintenance happens. DRM breaks this
>>> model by caching the DMA-buf mapping for performance reasons.
>> That's not only because of performance reasons, but also because of
>> correctness.
>>
>> At least the Vulkan API and a bunch of OpenGL extensions make it
>> mandatory for the buffer to be cache coherent. The kernel is simply not
>> informed about domain transfers.
>>
>> For example you can just do a CPU copy to a ring buffer and the
>> expectation is that an already running shader sees that.
> Yes, that one is not really an issue as you know that at buffer
> creation time and can make sure to map those buffers uncached on non
> coherent arches. If there are no explicit domain transfer points non
> coherent must bite the bullet and bypass the CPU caches, running
> performance into the ground.
Yes, exactly that was what this mail thread was about. But this case is
currently not supported by DMA-buf.
In other words, cache coherency is currently mandatory for everybody
involved.
>>> In the DMA API keeping things mapped is also a valid use-case, but then
>>> you need to do explicit domain transfers via the dma_sync_* family,
>>> which DMA-buf has not inherited. Again those sync are no-ops on cache
>>> coherent architectures, but do any necessary cache maintenance on non
>>> coherent arches.
>> Correct, yes. Coherency is mandatory for DMA-buf, you can't use
>> dma_sync_* on it when you are the importer.
>>
>> The exporter could of course make use of that because he is the owner of
>> the buffer.
> In the example given here with UVC video, you don't know that the
> buffer will be exported and needs to be coherent without
> synchronization points, due to the mapping cache at the DRM side. So
> V4L2 naturally allocates the buffers from CPU cached memory. If the
> expectation is that those buffers are device coherent without relying
> on the map/unmap_attachment calls, then V4L2 needs to always
> synchronize caches on DQBUF when the buffer is allocated from CPU
> cached memory and a single DMA-buf attachment exists. And while writing
> this I realize that this is probably exactly what V4L2 should do...
No, the expectation is that the importer can deal with whatever the
exporter provides.
If the importer can't access the DMA-buf coherently it's his job to
handle that gracefully.
See for example on AMD/Intel hardware most of the engines can perfectly
deal with cache coherent memory accesses. Only the display engines can't.
So on import time we can't even say if the access can be coherent and
snoop the CPU cache or not because we don't know how the imported
DMA-buf will be used later on.
Regards,
Christian.
>
> Regards,
> Lucas
>
Hi
Am 23.06.22 um 10:26 schrieb Christian König:
> Am 23.06.22 um 10:13 schrieb Thomas Zimmermann:
>> Hi Christian
>>
>> Am 15.02.21 um 09:58 schrieb Christian König:
>>> Hi guys,
>>>
>>> we are currently working an Freesync and direct scan out from system
>>> memory on AMD APUs in A+A laptops.
>>>
>>> On problem we stumbled over is that our display hardware needs to
>>> scan out from uncached system memory and we currently don't have a
>>> way to communicate that through DMA-buf.
>>>
>>> For our specific use case at hand we are going to implement something
>>> driver specific, but the question is should we have something more
>>> generic for this?
>>
>> I had a patchset here that extends iosys-map (former dma-buf-map) with
>> caching information. I'll post a copy.
>
> Oh, nice. But why on iosys-map? We need that per DMA-buf.
It's returned via the dma-buf's vmap call within the iosys-map
structure. If the dma-buf moves, the following vmap calls always return
updated caching information. Maybe it's not quite what you need for
Freesync?
I'll use this for format-conversion helpers, which do some optimizations
for uncached memory.
Anyway, I have to look for that patch...
Best regards
Thomas
>
> Thanks,
> Christian.
>
>>
>> Sorry for being late to reply.
>>
>> Best regards
>> Thomas
>>
>>>
>>> After all the system memory access pattern is a PCIe extension and as
>>> such something generic.
>>>
>>> Regards,
>>> Christian.
>>> _______________________________________________
>>> dri-devel mailing list
>>> [email protected]
>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>
>
--
Thomas Zimmermann
Graphics Driver Developer
SUSE Software Solutions Germany GmbH
Maxfeldstr. 5, 90409 Nürnberg, Germany
(HRB 36809, AG Nürnberg)
Geschäftsführer: Ivo Totev
Am Donnerstag, dem 23.06.2022 um 11:09 +0200 schrieb Christian König:
> Am 23.06.22 um 10:58 schrieb Lucas Stach:
> > Am Donnerstag, dem 23.06.2022 um 10:14 +0200 schrieb Christian König:
> > > Am 23.06.22 um 10:04 schrieb Lucas Stach:
> > > > Am Donnerstag, dem 23.06.2022 um 09:26 +0200 schrieb Christian König:
> > > > > Am 23.06.22 um 09:13 schrieb Pekka Paalanen:
> > > > > > On Thu, 23 Jun 2022 08:59:41 +0200
> > > > > > Christian König <[email protected]> wrote:
> > > > > >
> > > > > > > The exporter isn't doing anything wrong here. DMA-buf are supposed to be
> > > > > > > CPU cached and can also be cache hot.
> > > > > > Hi,
> > > > > >
> > > > > > what is that statement based on?
> > > > > On the design documentation of DMA-buf and the actual driver
> > > > > implementations.
> > > > >
> > > > > Coherency and snooping of the CPU cache is mandatory for devices and
> > > > > root complexes in the PCI specification. Incoherent access is just an
> > > > > extension.
> > > > >
> > > > > We inherited that by basing DMA-buf on the Linux kernel DMA-API which in
> > > > > turn is largely based on the PCI specification.
> > > > >
> > > > > > Were the (mandatory for CPU access) cpu_access_begin/end functions &
> > > > > > ioctls not supposed to ensure that CPU cache is up-to-date / CPU cache
> > > > > > is fully flushed out?
> > > > > No, those functions are to inform the exporter that the importer has
> > > > > started and finished accessing the buffer using the CPU.
> > > > >
> > > > > There is no signaling in the other direction. In other words the
> > > > > exporter doesn't inform the importer about CPU accesses because it is
> > > > > the owner of the buffer.
> > > > >
> > > > > It's the responsibility of the importer to make sure that it can
> > > > > actually access the data in the buffer. If it can't guarantee that the
> > > > > importer shouldn't import the buffer in the first place.
> > > > This is not really correct. DMA-buf inherited the the map/unmap part
> > > > from the DMA API, which on cache coherent architecture is mostly a no-
> > > > op or ties into the IOMMU implementation to set up the pagetables for
> > > > the translation. On non cache coherent architectures this is the point
> > > > where any any necessary cache maintenance happens. DRM breaks this
> > > > model by caching the DMA-buf mapping for performance reasons.
> > > That's not only because of performance reasons, but also because of
> > > correctness.
> > >
> > > At least the Vulkan API and a bunch of OpenGL extensions make it
> > > mandatory for the buffer to be cache coherent. The kernel is simply not
> > > informed about domain transfers.
> > >
> > > For example you can just do a CPU copy to a ring buffer and the
> > > expectation is that an already running shader sees that.
> > Yes, that one is not really an issue as you know that at buffer
> > creation time and can make sure to map those buffers uncached on non
> > coherent arches. If there are no explicit domain transfer points non
> > coherent must bite the bullet and bypass the CPU caches, running
> > performance into the ground.
>
> Yes, exactly that was what this mail thread was about. But this case is
> currently not supported by DMA-buf.
>
> In other words, cache coherency is currently mandatory for everybody
> involved.
>
> > > > In the DMA API keeping things mapped is also a valid use-case, but then
> > > > you need to do explicit domain transfers via the dma_sync_* family,
> > > > which DMA-buf has not inherited. Again those sync are no-ops on cache
> > > > coherent architectures, but do any necessary cache maintenance on non
> > > > coherent arches.
> > > Correct, yes. Coherency is mandatory for DMA-buf, you can't use
> > > dma_sync_* on it when you are the importer.
> > >
> > > The exporter could of course make use of that because he is the owner of
> > > the buffer.
> > In the example given here with UVC video, you don't know that the
> > buffer will be exported and needs to be coherent without
> > synchronization points, due to the mapping cache at the DRM side. So
> > V4L2 naturally allocates the buffers from CPU cached memory. If the
> > expectation is that those buffers are device coherent without relying
> > on the map/unmap_attachment calls, then V4L2 needs to always
> > synchronize caches on DQBUF when the buffer is allocated from CPU
> > cached memory and a single DMA-buf attachment exists. And while writing
> > this I realize that this is probably exactly what V4L2 should do...
>
> No, the expectation is that the importer can deal with whatever the
> exporter provides.
>
> If the importer can't access the DMA-buf coherently it's his job to
> handle that gracefully.
How does the importer know that the memory behind the DMA-buf is in CPU
cached memory?
If you now tell me that an importer always needs to assume this and
reject the import if it can't do snooping, then any DMA-buf usage on
most ARM SoCs is currently invalid usage. On most of the multimedia
targeted ARM SoCs being unable to snoop the cache is the norm, not an
exception.
>
> See for example on AMD/Intel hardware most of the engines can perfectly
> deal with cache coherent memory accesses. Only the display engines can't.
>
> So on import time we can't even say if the access can be coherent and
> snoop the CPU cache or not because we don't know how the imported
> DMA-buf will be used later on.
>
So for those mixed use cases, wouldn't it help to have something
similar to the dma_sync in the DMA-buf API, so your scanout usage can
tell the exporter that it's going to do non-snoop access and any dirty
cache lines must be cleaned? Signaling this to the exporter would allow
to skip the cache maintenance if the buffer is in CPU uncached memory,
which again is a default case for the ARM SoC world.
Regards,
Lucas
Am 23.06.22 um 11:33 schrieb Lucas Stach:
> [SNIP]
>>>>> In the DMA API keeping things mapped is also a valid use-case, but then
>>>>> you need to do explicit domain transfers via the dma_sync_* family,
>>>>> which DMA-buf has not inherited. Again those sync are no-ops on cache
>>>>> coherent architectures, but do any necessary cache maintenance on non
>>>>> coherent arches.
>>>> Correct, yes. Coherency is mandatory for DMA-buf, you can't use
>>>> dma_sync_* on it when you are the importer.
>>>>
>>>> The exporter could of course make use of that because he is the owner of
>>>> the buffer.
>>> In the example given here with UVC video, you don't know that the
>>> buffer will be exported and needs to be coherent without
>>> synchronization points, due to the mapping cache at the DRM side. So
>>> V4L2 naturally allocates the buffers from CPU cached memory. If the
>>> expectation is that those buffers are device coherent without relying
>>> on the map/unmap_attachment calls, then V4L2 needs to always
>>> synchronize caches on DQBUF when the buffer is allocated from CPU
>>> cached memory and a single DMA-buf attachment exists. And while writing
>>> this I realize that this is probably exactly what V4L2 should do...
>> No, the expectation is that the importer can deal with whatever the
>> exporter provides.
>>
>> If the importer can't access the DMA-buf coherently it's his job to
>> handle that gracefully.
> How does the importer know that the memory behind the DMA-buf is in CPU
> cached memory?
>
> If you now tell me that an importer always needs to assume this and
> reject the import if it can't do snooping, then any DMA-buf usage on
> most ARM SoCs is currently invalid usage.
Yes, exactly that. I've pointed out a couple of times now that a lot of
ARM SoCs don't implement that the way we need it.
We already had tons of bug reports because somebody attached a random
PCI root complex to an ARM SoC and expected it to work with for example
an AMD GPU.
Non-cache coherent applications are currently not really supported by
the DMA-buf framework in any way.
> On most of the multimedia
> targeted ARM SoCs being unable to snoop the cache is the norm, not an
> exception.
>
>> See for example on AMD/Intel hardware most of the engines can perfectly
>> deal with cache coherent memory accesses. Only the display engines can't.
>>
>> So on import time we can't even say if the access can be coherent and
>> snoop the CPU cache or not because we don't know how the imported
>> DMA-buf will be used later on.
>>
> So for those mixed use cases, wouldn't it help to have something
> similar to the dma_sync in the DMA-buf API, so your scanout usage can
> tell the exporter that it's going to do non-snoop access and any dirty
> cache lines must be cleaned? Signaling this to the exporter would allow
> to skip the cache maintenance if the buffer is in CPU uncached memory,
> which again is a default case for the ARM SoC world.
Well for the AMD and Intel use cases we at least have the opportunity to
signal cache flushing, but I'm not sure if that counts for everybody.
What we would rather do for those use cases is an indicator on the
DMA-buf if the underlying backing store is CPU cached or not. The
importer can then cleanly reject the use cases where it can't support
CPU cache snooping.
This then results in the normal fallback paths which we have anyway for
those use cases because DMA-buf sharing is not always possible.
Regards,
Christian.
>
> Regards,
> Lucas
>
Am Donnerstag, dem 23.06.2022 um 11:46 +0200 schrieb Christian König:
> Am 23.06.22 um 11:33 schrieb Lucas Stach:
> > [SNIP]
> > > > > > In the DMA API keeping things mapped is also a valid use-case, but then
> > > > > > you need to do explicit domain transfers via the dma_sync_* family,
> > > > > > which DMA-buf has not inherited. Again those sync are no-ops on cache
> > > > > > coherent architectures, but do any necessary cache maintenance on non
> > > > > > coherent arches.
> > > > > Correct, yes. Coherency is mandatory for DMA-buf, you can't use
> > > > > dma_sync_* on it when you are the importer.
> > > > >
> > > > > The exporter could of course make use of that because he is the owner of
> > > > > the buffer.
> > > > In the example given here with UVC video, you don't know that the
> > > > buffer will be exported and needs to be coherent without
> > > > synchronization points, due to the mapping cache at the DRM side. So
> > > > V4L2 naturally allocates the buffers from CPU cached memory. If the
> > > > expectation is that those buffers are device coherent without relying
> > > > on the map/unmap_attachment calls, then V4L2 needs to always
> > > > synchronize caches on DQBUF when the buffer is allocated from CPU
> > > > cached memory and a single DMA-buf attachment exists. And while writing
> > > > this I realize that this is probably exactly what V4L2 should do...
> > > No, the expectation is that the importer can deal with whatever the
> > > exporter provides.
> > >
> > > If the importer can't access the DMA-buf coherently it's his job to
> > > handle that gracefully.
> > How does the importer know that the memory behind the DMA-buf is in CPU
> > cached memory?
> >
> > If you now tell me that an importer always needs to assume this and
> > reject the import if it can't do snooping, then any DMA-buf usage on
> > most ARM SoCs is currently invalid usage.
>
> Yes, exactly that. I've pointed out a couple of times now that a lot of
> ARM SoCs don't implement that the way we need it.
>
> We already had tons of bug reports because somebody attached a random
> PCI root complex to an ARM SoC and expected it to work with for example
> an AMD GPU.
>
> Non-cache coherent applications are currently not really supported by
> the DMA-buf framework in any way.
>
I'm not talking about bolting on a PCIe root complex, with its implicit
inherited "PCI is cache coherent" expectations to a ARM SoC, but just
the standard VPU/GPU/display engines are not snooping on most ARM SoCs.
> > On most of the multimedia
> > targeted ARM SoCs being unable to snoop the cache is the norm, not an
> > exception.
> >
> > > See for example on AMD/Intel hardware most of the engines can perfectly
> > > deal with cache coherent memory accesses. Only the display engines can't.
> > >
> > > So on import time we can't even say if the access can be coherent and
> > > snoop the CPU cache or not because we don't know how the imported
> > > DMA-buf will be used later on.
> > >
> > So for those mixed use cases, wouldn't it help to have something
> > similar to the dma_sync in the DMA-buf API, so your scanout usage can
> > tell the exporter that it's going to do non-snoop access and any dirty
> > cache lines must be cleaned? Signaling this to the exporter would allow
> > to skip the cache maintenance if the buffer is in CPU uncached memory,
> > which again is a default case for the ARM SoC world.
>
> Well for the AMD and Intel use cases we at least have the opportunity to
> signal cache flushing, but I'm not sure if that counts for everybody.
>
Sure, all the non-coherent arches have some way to do the cache
maintenance in some explicit way. Non coherent and no cache maintenance
instruction would be a recipe for desaster. ;)
> What we would rather do for those use cases is an indicator on the
> DMA-buf if the underlying backing store is CPU cached or not. The
> importer can then cleanly reject the use cases where it can't support
> CPU cache snooping.
>
> This then results in the normal fallback paths which we have anyway for
> those use cases because DMA-buf sharing is not always possible.
>
That's a very x86 centric world view you have there. 99% of DMA-buf
uses on those cheap ARM SoCs is non-snooping. We can not do any
fallbacks here, as the whole graphics world on those SoCs with their
different IP cores mixed together depends on DMA-buf sharing working
efficiently even when the SoC is mostly non coherent.
In fact DMA-buf sharing works fine on most of those SoCs because
everyone just assumes that all the accelerators don't snoop, so the
memory shared via DMA-buf is mostly CPU uncached. It only falls apart
for uses like the UVC cameras, where the shared buffer ends up being
CPU cached.
Non-coherent without explicit domain transfer points is just not going
to work. So why can't we solve the issue for DMA-buf in the same way as
the DMA API already solved it years ago: by adding the equivalent of
the dma_sync calls that do cache maintenance when necessary? On x86 (or
any system where things are mostly coherent) you could still no-op them
for the common case and only trigger cache cleaning if the importer
explicitly says that is going to do a non-snooping access.
Regards,
Lucas
Am 23.06.22 um 12:13 schrieb Lucas Stach:
> [SNIP]
>>> On most of the multimedia
>>> targeted ARM SoCs being unable to snoop the cache is the norm, not an
>>> exception.
>>>
>>>> See for example on AMD/Intel hardware most of the engines can perfectly
>>>> deal with cache coherent memory accesses. Only the display engines can't.
>>>>
>>>> So on import time we can't even say if the access can be coherent and
>>>> snoop the CPU cache or not because we don't know how the imported
>>>> DMA-buf will be used later on.
>>>>
>>> So for those mixed use cases, wouldn't it help to have something
>>> similar to the dma_sync in the DMA-buf API, so your scanout usage can
>>> tell the exporter that it's going to do non-snoop access and any dirty
>>> cache lines must be cleaned? Signaling this to the exporter would allow
>>> to skip the cache maintenance if the buffer is in CPU uncached memory,
>>> which again is a default case for the ARM SoC world.
>> Well for the AMD and Intel use cases we at least have the opportunity to
>> signal cache flushing, but I'm not sure if that counts for everybody.
>>
> Sure, all the non-coherent arches have some way to do the cache
> maintenance in some explicit way. Non coherent and no cache maintenance
> instruction would be a recipe for desaster. ;)
>
>> What we would rather do for those use cases is an indicator on the
>> DMA-buf if the underlying backing store is CPU cached or not. The
>> importer can then cleanly reject the use cases where it can't support
>> CPU cache snooping.
>>
>> This then results in the normal fallback paths which we have anyway for
>> those use cases because DMA-buf sharing is not always possible.
>>
> That's a very x86 centric world view you have there. 99% of DMA-buf
> uses on those cheap ARM SoCs is non-snooping. We can not do any
> fallbacks here, as the whole graphics world on those SoCs with their
> different IP cores mixed together depends on DMA-buf sharing working
> efficiently even when the SoC is mostly non coherent.
>
> In fact DMA-buf sharing works fine on most of those SoCs because
> everyone just assumes that all the accelerators don't snoop, so the
> memory shared via DMA-buf is mostly CPU uncached. It only falls apart
> for uses like the UVC cameras, where the shared buffer ends up being
> CPU cached.
Well then the existing DMA-buf framework is not what you want to use for
this.
> Non-coherent without explicit domain transfer points is just not going
> to work. So why can't we solve the issue for DMA-buf in the same way as
> the DMA API already solved it years ago: by adding the equivalent of
> the dma_sync calls that do cache maintenance when necessary? On x86 (or
> any system where things are mostly coherent) you could still no-op them
> for the common case and only trigger cache cleaning if the importer
> explicitly says that is going to do a non-snooping access.
Because DMA-buf is a framework for buffer sharing between cache coherent
devices which don't signal transitions.
We intentionally didn't implemented any of the dma_sync_* functions
because that would break the intended use case.
You can of course use DMA-buf in an incoherent environment, but then you
can't expect that this works all the time.
This is documented behavior and so far we have bluntly rejected any of
the complains that it doesn't work on most ARM SoCs and I don't really
see a way to do this differently.
Regards,
Christian.
>
> Regards,
> Lucas
>
Am 23.06.22 um 13:27 schrieb Daniel Stone:
> Hi Christian,
>
> On Thu, 23 Jun 2022 at 12:11, Christian König <[email protected]> wrote:
>>> In fact DMA-buf sharing works fine on most of those SoCs because
>>> everyone just assumes that all the accelerators don't snoop, so the
>>> memory shared via DMA-buf is mostly CPU uncached. It only falls apart
>>> for uses like the UVC cameras, where the shared buffer ends up being
>>> CPU cached.
>> Well then the existing DMA-buf framework is not what you want to use for
>> this.
>>
>>> Non-coherent without explicit domain transfer points is just not going
>>> to work. So why can't we solve the issue for DMA-buf in the same way as
>>> the DMA API already solved it years ago: by adding the equivalent of
>>> the dma_sync calls that do cache maintenance when necessary? On x86 (or
>>> any system where things are mostly coherent) you could still no-op them
>>> for the common case and only trigger cache cleaning if the importer
>>> explicitly says that is going to do a non-snooping access.
>> Because DMA-buf is a framework for buffer sharing between cache coherent
>> devices which don't signal transitions.
>>
>> We intentionally didn't implemented any of the dma_sync_* functions
>> because that would break the intended use case.
>>
>> You can of course use DMA-buf in an incoherent environment, but then you
>> can't expect that this works all the time.
>>
>> This is documented behavior and so far we have bluntly rejected any of
>> the complains that it doesn't work on most ARM SoCs and I don't really
>> see a way to do this differently.
> For some strange reason, 'let's do buffer sharing but make sure it
> doesn't work on Arm' wasn't exactly the intention of the groups who
> came together under the Linaro umbrella to create dmabuf.
>
> If it's really your belief that dmabuf requires universal snooping, I
> recommend you send the patch to update the documentation, as well as
> to remove DRIVER_PRIME from, realistically, most non-PCIE drivers.
Well, to be honest I think that would indeed be necessary.
What we have created are essentially two different worlds, one for PCI
devices and one for the rest.
This was indeed not the intention, but it's a fact that basically all
DMA-buf based PCI drivers assume coherent access.
Regards,
Christian.
>
> Cheers,
> Daniel
Hi Christian,
On Thu, 23 Jun 2022 at 12:11, Christian König <[email protected]> wrote:
> > In fact DMA-buf sharing works fine on most of those SoCs because
> > everyone just assumes that all the accelerators don't snoop, so the
> > memory shared via DMA-buf is mostly CPU uncached. It only falls apart
> > for uses like the UVC cameras, where the shared buffer ends up being
> > CPU cached.
>
> Well then the existing DMA-buf framework is not what you want to use for
> this.
>
> > Non-coherent without explicit domain transfer points is just not going
> > to work. So why can't we solve the issue for DMA-buf in the same way as
> > the DMA API already solved it years ago: by adding the equivalent of
> > the dma_sync calls that do cache maintenance when necessary? On x86 (or
> > any system where things are mostly coherent) you could still no-op them
> > for the common case and only trigger cache cleaning if the importer
> > explicitly says that is going to do a non-snooping access.
>
> Because DMA-buf is a framework for buffer sharing between cache coherent
> devices which don't signal transitions.
>
> We intentionally didn't implemented any of the dma_sync_* functions
> because that would break the intended use case.
>
> You can of course use DMA-buf in an incoherent environment, but then you
> can't expect that this works all the time.
>
> This is documented behavior and so far we have bluntly rejected any of
> the complains that it doesn't work on most ARM SoCs and I don't really
> see a way to do this differently.
For some strange reason, 'let's do buffer sharing but make sure it
doesn't work on Arm' wasn't exactly the intention of the groups who
came together under the Linaro umbrella to create dmabuf.
If it's really your belief that dmabuf requires universal snooping, I
recommend you send the patch to update the documentation, as well as
to remove DRIVER_PRIME from, realistically, most non-PCIE drivers.
Cheers,
Daniel
Am Donnerstag, dem 23.06.2022 um 13:10 +0200 schrieb Christian König:
> Am 23.06.22 um 12:13 schrieb Lucas Stach:
> > [SNIP]
> > > > On most of the multimedia
> > > > targeted ARM SoCs being unable to snoop the cache is the norm, not an
> > > > exception.
> > > >
> > > > > See for example on AMD/Intel hardware most of the engines can perfectly
> > > > > deal with cache coherent memory accesses. Only the display engines can't.
> > > > >
> > > > > So on import time we can't even say if the access can be coherent and
> > > > > snoop the CPU cache or not because we don't know how the imported
> > > > > DMA-buf will be used later on.
> > > > >
> > > > So for those mixed use cases, wouldn't it help to have something
> > > > similar to the dma_sync in the DMA-buf API, so your scanout usage can
> > > > tell the exporter that it's going to do non-snoop access and any dirty
> > > > cache lines must be cleaned? Signaling this to the exporter would allow
> > > > to skip the cache maintenance if the buffer is in CPU uncached memory,
> > > > which again is a default case for the ARM SoC world.
> > > Well for the AMD and Intel use cases we at least have the opportunity to
> > > signal cache flushing, but I'm not sure if that counts for everybody.
> > >
> > Sure, all the non-coherent arches have some way to do the cache
> > maintenance in some explicit way. Non coherent and no cache maintenance
> > instruction would be a recipe for desaster. ;)
> >
> > > What we would rather do for those use cases is an indicator on the
> > > DMA-buf if the underlying backing store is CPU cached or not. The
> > > importer can then cleanly reject the use cases where it can't support
> > > CPU cache snooping.
> > >
> > > This then results in the normal fallback paths which we have anyway for
> > > those use cases because DMA-buf sharing is not always possible.
> > >
> > That's a very x86 centric world view you have there. 99% of DMA-buf
> > uses on those cheap ARM SoCs is non-snooping. We can not do any
> > fallbacks here, as the whole graphics world on those SoCs with their
> > different IP cores mixed together depends on DMA-buf sharing working
> > efficiently even when the SoC is mostly non coherent.
> >
> > In fact DMA-buf sharing works fine on most of those SoCs because
> > everyone just assumes that all the accelerators don't snoop, so the
> > memory shared via DMA-buf is mostly CPU uncached. It only falls apart
> > for uses like the UVC cameras, where the shared buffer ends up being
> > CPU cached.
>
> Well then the existing DMA-buf framework is not what you want to use for
> this.
>
Sorry, but this is just ignoring reality. You try to flag 8+ years of
DMA-buf usage on non-coherent arches as "you shouldn't do this". At
this point there are probably a lot more users (drivers) of DMA-buf in
the kernel for devices, which are used on non-coherent arches, than
there are on coherent arches.
> > Non-coherent without explicit domain transfer points is just not going
> > to work. So why can't we solve the issue for DMA-buf in the same way as
> > the DMA API already solved it years ago: by adding the equivalent of
> > the dma_sync calls that do cache maintenance when necessary? On x86 (or
> > any system where things are mostly coherent) you could still no-op them
> > for the common case and only trigger cache cleaning if the importer
> > explicitly says that is going to do a non-snooping access.
>
> Because DMA-buf is a framework for buffer sharing between cache coherent
> devices which don't signal transitions.
>
> We intentionally didn't implemented any of the dma_sync_* functions
> because that would break the intended use case.
>
Non coherent access, including your non-snoop scanout, and no domain
transition signal just doesn't go together when you want to solve
things in a generic way.
Remember that in a fully (not only IO) coherent system the CPU isn't
the only agent that may cache the content you are trying to access
here. The dirty cacheline could reasonably still be sitting in a GPU or
VPU cache, so you need some way to clean those cachelines, which isn't
a magic "importer knows how to call CPU cache clean instructions".
> You can of course use DMA-buf in an incoherent environment, but then you
> can't expect that this works all the time.
>
> This is documented behavior and so far we have bluntly rejected any of
> the complains that it doesn't work on most ARM SoCs and I don't really
> see a way to do this differently.
Can you point me to that part of the documentation? A quick grep for
"coherent" didn't immediately turn something up within the DMA-buf
dirs.
Regards,
Lucas
Am 23.06.22 um 13:29 schrieb Lucas Stach:
> [SNIP]
>> Well then the existing DMA-buf framework is not what you want to use for
>> this.
>>
> Sorry, but this is just ignoring reality. You try to flag 8+ years of
> DMA-buf usage on non-coherent arches as "you shouldn't do this". At
> this point there are probably a lot more users (drivers) of DMA-buf in
> the kernel for devices, which are used on non-coherent arches, than
> there are on coherent arches.
Well, it's my reality that people come up with bug reports about that
and we have been pushing back on this with the explanation "Hey this is
not supported" as long as I can think about it.
I mean I even had somebody from ARM which told me that this is not going
to work with our GPUs on a specific SoC. That there are ARM internal use
cases which just seem to work because all the devices are non-coherent
is completely new to me.
I'm as much surprised as you are about this lack of agreement about such
fundamental stuff.
>>> Non-coherent without explicit domain transfer points is just not going
>>> to work. So why can't we solve the issue for DMA-buf in the same way as
>>> the DMA API already solved it years ago: by adding the equivalent of
>>> the dma_sync calls that do cache maintenance when necessary? On x86 (or
>>> any system where things are mostly coherent) you could still no-op them
>>> for the common case and only trigger cache cleaning if the importer
>>> explicitly says that is going to do a non-snooping access.
>> Because DMA-buf is a framework for buffer sharing between cache coherent
>> devices which don't signal transitions.
>>
>> We intentionally didn't implemented any of the dma_sync_* functions
>> because that would break the intended use case.
>>
> Non coherent access, including your non-snoop scanout, and no domain
> transition signal just doesn't go together when you want to solve
> things in a generic way.
Yeah, that's the stuff I totally agree on.
See we absolutely do have the requirement of implementing coherent
access without domain transitions for Vulkan and OpenGL+extensions.
When we now have to introduce domain transitions to get non coherent
access working we are essentially splitting all the drivers into
coherent and non-coherent ones.
That doesn't sounds like it would improve interop.
> Remember that in a fully (not only IO) coherent system the CPU isn't
> the only agent that may cache the content you are trying to access
> here. The dirty cacheline could reasonably still be sitting in a GPU or
> VPU cache, so you need some way to clean those cachelines, which isn't
> a magic "importer knows how to call CPU cache clean instructions".
IIRC we do already have/had a SYNC_IOCTL for cases like this, but (I
need to double check as well, that's way to long ago) this was kicked
out because of the requirements above.
>> You can of course use DMA-buf in an incoherent environment, but then you
>> can't expect that this works all the time.
>>
>> This is documented behavior and so far we have bluntly rejected any of
>> the complains that it doesn't work on most ARM SoCs and I don't really
>> see a way to do this differently.
> Can you point me to that part of the documentation? A quick grep for
> "coherent" didn't immediately turn something up within the DMA-buf
> dirs.
Search for "cache coherency management". It's quite a while ago, but I
do remember helping to review that stuff.
Regards,
Christian.
>
> Regards,
> Lucas
>
Am Donnerstag, dem 23.06.2022 um 13:54 +0200 schrieb Christian König:
> Am 23.06.22 um 13:29 schrieb Lucas Stach:
> > [SNIP]
> > > Well then the existing DMA-buf framework is not what you want to use for
> > > this.
> > >
> > Sorry, but this is just ignoring reality. You try to flag 8+ years of
> > DMA-buf usage on non-coherent arches as "you shouldn't do this". At
> > this point there are probably a lot more users (drivers) of DMA-buf in
> > the kernel for devices, which are used on non-coherent arches, than
> > there are on coherent arches.
>
> Well, it's my reality that people come up with bug reports about that
> and we have been pushing back on this with the explanation "Hey this is
> not supported" as long as I can think about it.
>
> I mean I even had somebody from ARM which told me that this is not going
> to work with our GPUs on a specific SoC. That there are ARM internal use
> cases which just seem to work because all the devices are non-coherent
> is completely new to me.
>
Yes, trying to hook up a peripheral that assumes cache snooping in some
design details to a non coherent SoC may end up exploding in various
ways. On the other hand you can work around most of those assumptions
by marking the memory as uncached to the CPU, which may tank
performance, but will work from a correctness PoV.
> I'm as much surprised as you are about this lack of agreement about such
> fundamental stuff.
>
> > > > Non-coherent without explicit domain transfer points is just not going
> > > > to work. So why can't we solve the issue for DMA-buf in the same way as
> > > > the DMA API already solved it years ago: by adding the equivalent of
> > > > the dma_sync calls that do cache maintenance when necessary? On x86 (or
> > > > any system where things are mostly coherent) you could still no-op them
> > > > for the common case and only trigger cache cleaning if the importer
> > > > explicitly says that is going to do a non-snooping access.
> > > Because DMA-buf is a framework for buffer sharing between cache coherent
> > > devices which don't signal transitions.
> > >
> > > We intentionally didn't implemented any of the dma_sync_* functions
> > > because that would break the intended use case.
> > >
> > Non coherent access, including your non-snoop scanout, and no domain
> > transition signal just doesn't go together when you want to solve
> > things in a generic way.
>
> Yeah, that's the stuff I totally agree on.
>
> See we absolutely do have the requirement of implementing coherent
> access without domain transitions for Vulkan and OpenGL+extensions.
>
Coherent can mean 2 different things:
1. CPU cached with snooping from the IO device
2. CPU uncached
The Vulkan and GL "coherent" uses are really coherent without explicit
domain transitions, so on non coherent arches that require the
transitions the only way to implement this is by making the memory CPU
uncached. Which from a performance PoV will probably not be what app
developers expect, but will still expose the correct behavior.
> When we now have to introduce domain transitions to get non coherent
> access working we are essentially splitting all the drivers into
> coherent and non-coherent ones.
>
> That doesn't sounds like it would improve interop.
>
> > Remember that in a fully (not only IO) coherent system the CPU isn't
> > the only agent that may cache the content you are trying to access
> > here. The dirty cacheline could reasonably still be sitting in a GPU or
> > VPU cache, so you need some way to clean those cachelines, which isn't
> > a magic "importer knows how to call CPU cache clean instructions".
>
> IIRC we do already have/had a SYNC_IOCTL for cases like this, but (I
> need to double check as well, that's way to long ago) this was kicked
> out because of the requirements above.
>
The DMA_BUF_IOCTL_SYNC is available in upstream, with the explicit
documentation that "userspace can not rely on coherent access".
> > > You can of course use DMA-buf in an incoherent environment, but then you
> > > can't expect that this works all the time.
> > >
> > > This is documented behavior and so far we have bluntly rejected any of
> > > the complains that it doesn't work on most ARM SoCs and I don't really
> > > see a way to do this differently.
> > Can you point me to that part of the documentation? A quick grep for
> > "coherent" didn't immediately turn something up within the DMA-buf
> > dirs.
>
> Search for "cache coherency management". It's quite a while ago, but I
> do remember helping to review that stuff.
>
That only turns up the lines in DMA_BUF_IOCTL_SYNC doc, which are
saying the exact opposite of the DMA-buf is always coherent.
I also don't see why you think that both world views are so totally
different. We could just require explicit domain transitions for non-
snoop access, which would probably solve your scanout issue and would
not be a problem for most ARM systems, where we could no-op this if the
buffer is already in uncached memory and at the same time keep the "x86
assumes cached + snooped access by default" semantics.
Regards,
Lucas
Am 23.06.22 um 14:14 schrieb Lucas Stach:
> Am Donnerstag, dem 23.06.2022 um 13:54 +0200 schrieb Christian König:
>> Am 23.06.22 um 13:29 schrieb Lucas Stach:
>> [SNIP]
>> I mean I even had somebody from ARM which told me that this is not going
>> to work with our GPUs on a specific SoC. That there are ARM internal use
>> cases which just seem to work because all the devices are non-coherent
>> is completely new to me.
>>
> Yes, trying to hook up a peripheral that assumes cache snooping in some
> design details to a non coherent SoC may end up exploding in various
> ways. On the other hand you can work around most of those assumptions
> by marking the memory as uncached to the CPU, which may tank
> performance, but will work from a correctness PoV.
Yeah, and exactly that's what I meant with "DMA-buf is not the framework
for this".
See we do support using uncached/not snooped memory in DMA-buf, but only
for the exporter side.
For example the AMD and Intel GPUs have a per buffer flag for this.
The importer on the other hand needs to be able to handle whatever the
exporter provides.
>> [SNIP]
>>> Non coherent access, including your non-snoop scanout, and no domain
>>> transition signal just doesn't go together when you want to solve
>>> things in a generic way.
>> Yeah, that's the stuff I totally agree on.
>>
>> See we absolutely do have the requirement of implementing coherent
>> access without domain transitions for Vulkan and OpenGL+extensions.
>>
> Coherent can mean 2 different things:
> 1. CPU cached with snooping from the IO device
> 2. CPU uncached
>
> The Vulkan and GL "coherent" uses are really coherent without explicit
> domain transitions, so on non coherent arches that require the
> transitions the only way to implement this is by making the memory CPU
> uncached. Which from a performance PoV will probably not be what app
> developers expect, but will still expose the correct behavior.
Quite a boomer for performance, but yes that should work.
>>> Remember that in a fully (not only IO) coherent system the CPU isn't
>>> the only agent that may cache the content you are trying to access
>>> here. The dirty cacheline could reasonably still be sitting in a GPU or
>>> VPU cache, so you need some way to clean those cachelines, which isn't
>>> a magic "importer knows how to call CPU cache clean instructions".
>> IIRC we do already have/had a SYNC_IOCTL for cases like this, but (I
>> need to double check as well, that's way to long ago) this was kicked
>> out because of the requirements above.
>>
> The DMA_BUF_IOCTL_SYNC is available in upstream, with the explicit
> documentation that "userspace can not rely on coherent access".
Yeah, double checked that as well. This is for the coherency case on the
exporter side.
>>>> You can of course use DMA-buf in an incoherent environment, but then you
>>>> can't expect that this works all the time.
>>>>
>>>> This is documented behavior and so far we have bluntly rejected any of
>>>> the complains that it doesn't work on most ARM SoCs and I don't really
>>>> see a way to do this differently.
>>> Can you point me to that part of the documentation? A quick grep for
>>> "coherent" didn't immediately turn something up within the DMA-buf
>>> dirs.
>> Search for "cache coherency management". It's quite a while ago, but I
>> do remember helping to review that stuff.
>>
> That only turns up the lines in DMA_BUF_IOCTL_SYNC doc, which are
> saying the exact opposite of the DMA-buf is always coherent.
Sounds like I'm not making clear what I want to say here: For the
exporter using cache coherent memory is optional, for the importer it isn't.
For the exporter it is perfectly valid to use kmalloc, get_free_page
etc... on his buffers as long as it uses the DMA API to give the
importer access to it.
The importer on the other hand needs to be able to deal with that. When
this is not the case then the importer somehow needs to work around that.
Either by flushing the CPU caches or by rejecting using the imported
buffer for this specific use case (like AMD and Intel drivers should be
doing).
If the Intel or ARM display drivers need non-cached memory and don't
reject buffer where they don't know this then that's certainly a bug in
those drivers.
Otherwise we would need to change all DMA-buf exporters to use a special
function for allocation non-coherent memory and that is certainly not
going to fly.
> I also don't see why you think that both world views are so totally
> different. We could just require explicit domain transitions for non-
> snoop access, which would probably solve your scanout issue and would
> not be a problem for most ARM systems, where we could no-op this if the
> buffer is already in uncached memory and at the same time keep the "x86
> assumes cached + snooped access by default" semantics.
Well the key point is we intentionally rejected that design previously
because it created all kind of trouble as well.
For this limited use case of doing a domain transition right before
scanout it might make sense, but that's just one use case.
Regards,
Christian.
>
> Regards,
> Lucas
>
>
Am Donnerstag, dem 23.06.2022 um 14:52 +0200 schrieb Christian König:
> Am 23.06.22 um 14:14 schrieb Lucas Stach:
> > Am Donnerstag, dem 23.06.2022 um 13:54 +0200 schrieb Christian König:
> > > Am 23.06.22 um 13:29 schrieb Lucas Stach:
> > > [SNIP]
> > > I mean I even had somebody from ARM which told me that this is not going
> > > to work with our GPUs on a specific SoC. That there are ARM internal use
> > > cases which just seem to work because all the devices are non-coherent
> > > is completely new to me.
> > >
> > Yes, trying to hook up a peripheral that assumes cache snooping in some
> > design details to a non coherent SoC may end up exploding in various
> > ways. On the other hand you can work around most of those assumptions
> > by marking the memory as uncached to the CPU, which may tank
> > performance, but will work from a correctness PoV.
>
> Yeah, and exactly that's what I meant with "DMA-buf is not the framework
> for this".
>
> See we do support using uncached/not snooped memory in DMA-buf, but only
> for the exporter side.
>
> For example the AMD and Intel GPUs have a per buffer flag for this.
>
> The importer on the other hand needs to be able to handle whatever the
> exporter provides.
>
I fail to construct a case where you want the Vulkan/GL "no domain
transition" coherent semantic without the allocator knowing about this.
If you need this and the system is non-snooping, surely the allocator
will choose uncached memory.
I agree that you absolutely need to fail the usage when someone imports
a CPU cached buffer and then tries to use it as GL coherent on a non-
snooping system. That simply will not work.
> > > [SNIP]
> > > > Non coherent access, including your non-snoop scanout, and no domain
> > > > transition signal just doesn't go together when you want to solve
> > > > things in a generic way.
> > > Yeah, that's the stuff I totally agree on.
> > >
> > > See we absolutely do have the requirement of implementing coherent
> > > access without domain transitions for Vulkan and OpenGL+extensions.
> > >
> > Coherent can mean 2 different things:
> > 1. CPU cached with snooping from the IO device
> > 2. CPU uncached
> >
> > The Vulkan and GL "coherent" uses are really coherent without explicit
> > domain transitions, so on non coherent arches that require the
> > transitions the only way to implement this is by making the memory CPU
> > uncached. Which from a performance PoV will probably not be what app
> > developers expect, but will still expose the correct behavior.
>
> Quite a boomer for performance, but yes that should work.
>
> > > > Remember that in a fully (not only IO) coherent system the CPU isn't
> > > > the only agent that may cache the content you are trying to access
> > > > here. The dirty cacheline could reasonably still be sitting in a GPU or
> > > > VPU cache, so you need some way to clean those cachelines, which isn't
> > > > a magic "importer knows how to call CPU cache clean instructions".
> > > IIRC we do already have/had a SYNC_IOCTL for cases like this, but (I
> > > need to double check as well, that's way to long ago) this was kicked
> > > out because of the requirements above.
> > >
> > The DMA_BUF_IOCTL_SYNC is available in upstream, with the explicit
> > documentation that "userspace can not rely on coherent access".
>
> Yeah, double checked that as well. This is for the coherency case on the
> exporter side.
>
> > > > > You can of course use DMA-buf in an incoherent environment, but then you
> > > > > can't expect that this works all the time.
> > > > >
> > > > > This is documented behavior and so far we have bluntly rejected any of
> > > > > the complains that it doesn't work on most ARM SoCs and I don't really
> > > > > see a way to do this differently.
> > > > Can you point me to that part of the documentation? A quick grep for
> > > > "coherent" didn't immediately turn something up within the DMA-buf
> > > > dirs.
> > > Search for "cache coherency management". It's quite a while ago, but I
> > > do remember helping to review that stuff.
> > >
> > That only turns up the lines in DMA_BUF_IOCTL_SYNC doc, which are
> > saying the exact opposite of the DMA-buf is always coherent.
>
> Sounds like I'm not making clear what I want to say here: For the
> exporter using cache coherent memory is optional, for the importer it isn't.
>
> For the exporter it is perfectly valid to use kmalloc, get_free_page
> etc... on his buffers as long as it uses the DMA API to give the
> importer access to it.
>
And here is where our line of thought diverges: the DMA API allows
snooping and non-snooping devices to work together just fine, as it has
explicit domain transitions, which are no-ops if both devices are
snooping, but will do the necessary cache maintenance when one of them
is non-snooping but the memory is CPU cached.
I don't see why DMA-buf should be any different here. Yes, you can not
support the "no domain transition" sharing when the memory is CPU
cached and one of the devices in non-snooping, but you can support 99%
of real use-cases like the non-snooped scanout or the UVC video import.
> The importer on the other hand needs to be able to deal with that. When
> this is not the case then the importer somehow needs to work around that.
>
Why? The importer maps the dma-buf via dma_buf_map_attachment, which in
most cases triggers a map via the DMA API on the exporter side. This
map via the DMA API will already do the right thing in terms of cache
management, it's just that we explicitly disable it via
DMA_ATTR_SKIP_CPU_SYNC in DRM because we know that the mapping will be
cached, which violates the DMA API explicit domain transition anyway.
> Either by flushing the CPU caches or by rejecting using the imported
> buffer for this specific use case (like AMD and Intel drivers should be
> doing).
>
> If the Intel or ARM display drivers need non-cached memory and don't
> reject buffer where they don't know this then that's certainly a bug in
> those drivers.
It's not just display drivers, video codec accelerators and most GPUs
in this space are also non-snooping. In the ARM SoC world everyone just
assumes you are non-snooping, which is why things work for most cases
and only a handful like the UVC video import is broken.
>
> Otherwise we would need to change all DMA-buf exporters to use a special
> function for allocation non-coherent memory and that is certainly not
> going to fly.
>
> > I also don't see why you think that both world views are so totally
> > different. We could just require explicit domain transitions for non-
> > snoop access, which would probably solve your scanout issue and would
> > not be a problem for most ARM systems, where we could no-op this if the
> > buffer is already in uncached memory and at the same time keep the "x86
> > assumes cached + snooped access by default" semantics.
>
> Well the key point is we intentionally rejected that design previously
> because it created all kind of trouble as well.
>
I would really like to know what issues popped up there. Moving the
dma-buf attachment to work more like a buffer used with the DMA API
seems like a good thing to me.
> For this limited use case of doing a domain transition right before
> scanout it might make sense, but that's just one use case.
>
The only case I see that we still couldn't support with a change in
that direction is the GL coherent access to a imported buffer that has
been allocated from CPU cached memory on a system with non-snooping
agents. Which to me sounds like a pretty niche use-case, but I would be
happy to be proven wrong.
Regards,
Lucas
Am 23.06.22 um 17:26 schrieb Lucas Stach:
> Am Donnerstag, dem 23.06.2022 um 14:52 +0200 schrieb Christian König:
>> Am 23.06.22 um 14:14 schrieb Lucas Stach:
>>> Am Donnerstag, dem 23.06.2022 um 13:54 +0200 schrieb Christian König:
>>>> Am 23.06.22 um 13:29 schrieb Lucas Stach:
>>>> [SNIP]
>>>> I mean I even had somebody from ARM which told me that this is not going
>>>> to work with our GPUs on a specific SoC. That there are ARM internal use
>>>> cases which just seem to work because all the devices are non-coherent
>>>> is completely new to me.
>>>>
>>> Yes, trying to hook up a peripheral that assumes cache snooping in some
>>> design details to a non coherent SoC may end up exploding in various
>>> ways. On the other hand you can work around most of those assumptions
>>> by marking the memory as uncached to the CPU, which may tank
>>> performance, but will work from a correctness PoV.
>> Yeah, and exactly that's what I meant with "DMA-buf is not the framework
>> for this".
>>
>> See we do support using uncached/not snooped memory in DMA-buf, but only
>> for the exporter side.
>>
>> For example the AMD and Intel GPUs have a per buffer flag for this.
>>
>> The importer on the other hand needs to be able to handle whatever the
>> exporter provides.
>>
> I fail to construct a case where you want the Vulkan/GL "no domain
> transition" coherent semantic without the allocator knowing about this.
> If you need this and the system is non-snooping, surely the allocator
> will choose uncached memory.
No it won't. The allocator in the exporter is independent of the importer.
That is an important and intentional design decision, cause otherwise
you wouldn't have exporters/importers in the first place and rather a
centralized allocation pool like what dma-heap implements.
See the purpose of DMA-buf is to expose the buffers in the way the
exporter wants to expose them. So when the exporting driver wants to
allocate normal cached system memory then that is perfectly fine and
completely fits into this design.
Otherwise we would need to adjust all exporters to the importers, which
is potentially not even possible.
> I agree that you absolutely need to fail the usage when someone imports
> a CPU cached buffer and then tries to use it as GL coherent on a non-
> snooping system. That simply will not work.
Exactly that, yes. That's what the attach callback is good for.
See we already have tons of cases where buffers can't be shared because
they wasn't initially allocated in a way the importer can deal with
them. But that's perfectly ok and intentional.
For example just take a configuration where a dedicated GPU clones the
display with an integrated GPU. The dedicated GPU needs the image in
local memory for scanout which is usually not accessible to the
integrated GPU.
So either attaching the DMA-buf or creating the KMS framebuffer config
will fail and we are running into the fallback path which involves an
extra copy. And that is perfectly fine and intentional since this
configuration is not supported by the hardware.
>>>> [SNIP]
>>>>>> You can of course use DMA-buf in an incoherent environment, but then you
>>>>>> can't expect that this works all the time.
>>>>>>
>>>>>> This is documented behavior and so far we have bluntly rejected any of
>>>>>> the complains that it doesn't work on most ARM SoCs and I don't really
>>>>>> see a way to do this differently.
>>>>> Can you point me to that part of the documentation? A quick grep for
>>>>> "coherent" didn't immediately turn something up within the DMA-buf
>>>>> dirs.
>>>> Search for "cache coherency management". It's quite a while ago, but I
>>>> do remember helping to review that stuff.
>>>>
>>> That only turns up the lines in DMA_BUF_IOCTL_SYNC doc, which are
>>> saying the exact opposite of the DMA-buf is always coherent.
>> Sounds like I'm not making clear what I want to say here: For the
>> exporter using cache coherent memory is optional, for the importer it isn't.
>>
>> For the exporter it is perfectly valid to use kmalloc, get_free_page
>> etc... on his buffers as long as it uses the DMA API to give the
>> importer access to it.
>>
> And here is where our line of thought diverges: the DMA API allows
> snooping and non-snooping devices to work together just fine, as it has
> explicit domain transitions, which are no-ops if both devices are
> snooping, but will do the necessary cache maintenance when one of them
> is non-snooping but the memory is CPU cached.
>
> I don't see why DMA-buf should be any different here. Yes, you can not
> support the "no domain transition" sharing when the memory is CPU
> cached and one of the devices in non-snooping, but you can support 99%
> of real use-cases like the non-snooped scanout or the UVC video import.
Well I didn't say we couldn't do it that way. What I'm saying that it
was intentionally decided against it.
We could re-iterate that decision, but this would mean that all existing
exporters would now need to provide additional functionality.
>> The importer on the other hand needs to be able to deal with that. When
>> this is not the case then the importer somehow needs to work around that.
>>
> Why? The importer maps the dma-buf via dma_buf_map_attachment, which in
> most cases triggers a map via the DMA API on the exporter side. This
> map via the DMA API will already do the right thing in terms of cache
> management, it's just that we explicitly disable it via
> DMA_ATTR_SKIP_CPU_SYNC in DRM because we know that the mapping will be
> cached, which violates the DMA API explicit domain transition anyway.
Why doesn't the importer simply calls dma_sync_sg_for_device() as
necessary? See the importer does already know when it needs to access
the buffer and as far as I can see has all the necessary variable to do
the sync.
The exporter on the other hand doesn't know that. So we would need to
transport this information.
Another fundamental problem is that the DMA API isn't designed for
device to device transitions. In other words you have CPU->device and
device->CPU transition, but not device->device. As far as I can see the
DMA API should already have the necessary information if things like
cache flushes are necessary or not.
>> Either by flushing the CPU caches or by rejecting using the imported
>> buffer for this specific use case (like AMD and Intel drivers should be
>> doing).
>>
>> If the Intel or ARM display drivers need non-cached memory and don't
>> reject buffer where they don't know this then that's certainly a bug in
>> those drivers.
> It's not just display drivers, video codec accelerators and most GPUs
> in this space are also non-snooping. In the ARM SoC world everyone just
> assumes you are non-snooping, which is why things work for most cases
> and only a handful like the UVC video import is broken.
That is really interesting to know, but I still think that DMA-buf was
absolutely not designed for this use case.
From the point of view the primary reason for this was laptops with
both dedicated and integrated GPUs, webcams etc...
That you have a huge number of ARM specific devices which can interop
with themselves, but not with devices outside of their domain is not
something foreseen here.
Regards,
Christian.
>> Otherwise we would need to change all DMA-buf exporters to use a special
>> function for allocation non-coherent memory and that is certainly not
>> going to fly.
>>
>>> I also don't see why you think that both world views are so totally
>>> different. We could just require explicit domain transitions for non-
>>> snoop access, which would probably solve your scanout issue and would
>>> not be a problem for most ARM systems, where we could no-op this if the
>>> buffer is already in uncached memory and at the same time keep the "x86
>>> assumes cached + snooped access by default" semantics.
>> Well the key point is we intentionally rejected that design previously
>> because it created all kind of trouble as well.
>>
> I would really like to know what issues popped up there. Moving the
> dma-buf attachment to work more like a buffer used with the DMA API
> seems like a good thing to me.
>
>> For this limited use case of doing a domain transition right before
>> scanout it might make sense, but that's just one use case.
>>
> The only case I see that we still couldn't support with a change in
> that direction is the GL coherent access to a imported buffer that has
> been allocated from CPU cached memory on a system with non-snooping
> agents. Which to me sounds like a pretty niche use-case, but I would be
> happy to be proven wrong.
>
> Regards,
> Lucas
>
Am Freitag, dem 24.06.2022 um 08:54 +0200 schrieb Christian König:
> Am 23.06.22 um 17:26 schrieb Lucas Stach:
> > Am Donnerstag, dem 23.06.2022 um 14:52 +0200 schrieb Christian König:
> > > Am 23.06.22 um 14:14 schrieb Lucas Stach:
> > > > Am Donnerstag, dem 23.06.2022 um 13:54 +0200 schrieb Christian König:
> > > > > Am 23.06.22 um 13:29 schrieb Lucas Stach:
> > > > > [SNIP]
> > > > > I mean I even had somebody from ARM which told me that this is not going
> > > > > to work with our GPUs on a specific SoC. That there are ARM internal use
> > > > > cases which just seem to work because all the devices are non-coherent
> > > > > is completely new to me.
> > > > >
> > > > Yes, trying to hook up a peripheral that assumes cache snooping in some
> > > > design details to a non coherent SoC may end up exploding in various
> > > > ways. On the other hand you can work around most of those assumptions
> > > > by marking the memory as uncached to the CPU, which may tank
> > > > performance, but will work from a correctness PoV.
> > > Yeah, and exactly that's what I meant with "DMA-buf is not the framework
> > > for this".
> > >
> > > See we do support using uncached/not snooped memory in DMA-buf, but only
> > > for the exporter side.
> > >
> > > For example the AMD and Intel GPUs have a per buffer flag for this.
> > >
> > > The importer on the other hand needs to be able to handle whatever the
> > > exporter provides.
> > >
> > I fail to construct a case where you want the Vulkan/GL "no domain
> > transition" coherent semantic without the allocator knowing about this.
> > If you need this and the system is non-snooping, surely the allocator
> > will choose uncached memory.
>
> No it won't. The allocator in the exporter is independent of the importer.
>
> That is an important and intentional design decision, cause otherwise
> you wouldn't have exporters/importers in the first place and rather a
> centralized allocation pool like what dma-heap implements.
>
> See the purpose of DMA-buf is to expose the buffers in the way the
> exporter wants to expose them. So when the exporting driver wants to
> allocate normal cached system memory then that is perfectly fine and
> completely fits into this design.
>
I'm specifically talking about the case where a snooping exporter would
allocate the GL coherent buffer and a non-snooping importer would need
to access that buffer with the same "no domain transition needed"
semantic. That is the thing which we can not make work at all and need
to fail the attach. If both the exporter and importer are non-snooping
you would probably get uncached memory, as long as the exporter knows
how the buffer will be used. Is there a real use-case where the
exporter doesn't know that the buffer will be used as GL/Vulkan
coherent and we can't do fallback on the importer side?
> Otherwise we would need to adjust all exporters to the importers, which
> is potentially not even possible.
>
> > I agree that you absolutely need to fail the usage when someone imports
> > a CPU cached buffer and then tries to use it as GL coherent on a non-
> > snooping system. That simply will not work.
>
> Exactly that, yes. That's what the attach callback is good for.
>
> See we already have tons of cases where buffers can't be shared because
> they wasn't initially allocated in a way the importer can deal with
> them. But that's perfectly ok and intentional.
>
> For example just take a configuration where a dedicated GPU clones the
> display with an integrated GPU. The dedicated GPU needs the image in
> local memory for scanout which is usually not accessible to the
> integrated GPU.
>
> So either attaching the DMA-buf or creating the KMS framebuffer config
> will fail and we are running into the fallback path which involves an
> extra copy. And that is perfectly fine and intentional since this
> configuration is not supported by the hardware.
>
> > > > > [SNIP]
> > And here is where our line of thought diverges: the DMA API allows
> > snooping and non-snooping devices to work together just fine, as it has
> > explicit domain transitions, which are no-ops if both devices are
> > snooping, but will do the necessary cache maintenance when one of them
> > is non-snooping but the memory is CPU cached.
> >
> > I don't see why DMA-buf should be any different here. Yes, you can not
> > support the "no domain transition" sharing when the memory is CPU
> > cached and one of the devices in non-snooping, but you can support 99%
> > of real use-cases like the non-snooped scanout or the UVC video import.
>
> Well I didn't say we couldn't do it that way. What I'm saying that it
> was intentionally decided against it.
>
> We could re-iterate that decision, but this would mean that all existing
> exporters would now need to provide additional functionality.
>
The way I see it we would only need this for exporters that potentially
export CPU cached memory, but need to interop with non-snooping
importers. I guess that can be done on a case-by-case basis and
wouldn't be a big flag day operation.
> > > The importer on the other hand needs to be able to deal with that. When
> > > this is not the case then the importer somehow needs to work around that.
> > >
> > Why? The importer maps the dma-buf via dma_buf_map_attachment, which in
> > most cases triggers a map via the DMA API on the exporter side. This
> > map via the DMA API will already do the right thing in terms of cache
> > management, it's just that we explicitly disable it via
> > DMA_ATTR_SKIP_CPU_SYNC in DRM because we know that the mapping will be
> > cached, which violates the DMA API explicit domain transition anyway.
>
> Why doesn't the importer simply calls dma_sync_sg_for_device() as
> necessary? See the importer does already know when it needs to access
> the buffer and as far as I can see has all the necessary variable to do
> the sync.
>
First, it wouldn't be symmetric with the dma_buf_map_attachment, where
the actual dma_map_sg also happens on the exporter side.
Second, that is again a very x86 with PCI centric view. The importer
flushing CPU caches by calling dma_sync_sg_for_device will only suffice
in a world where devices are IO coherent, i.e. they snoop the CPU cache
but don't participate fully in the system coherency due to never
keeping dirty cache lines for buffers in system memory.
On fully coherent systems like ARM with AMBA CHI or x86 with CXL.cache
all devices with access to the buffer can keep dirty cache lines in
their device private caches, so any access from a non-snooping agent
will require a cache clean on all those devices, which would basically
require the the dma_buf_sync to be a broadcast operation to the
exporter and all attached fully coherent importers.
> The exporter on the other hand doesn't know that. So we would need to
> transport this information.
>
> Another fundamental problem is that the DMA API isn't designed for
> device to device transitions. In other words you have CPU->device and
> device->CPU transition, but not device->device. As far as I can see the
> DMA API should already have the necessary information if things like
> cache flushes are necessary or not.
>
Don't you contradict the second part here with the first? The DMA API
doesn't have the necessary information about needed cache cleaning on
the exporters or other attached importers side, when you only call the
dma_sync on the importer, which is exactly why I'm arguing for putting
it in the dma_buf ops so we can do the necessary operations on other
attached clients to make a device->device transition working reliably.
> > > Either by flushing the CPU caches or by rejecting using the imported
> > > buffer for this specific use case (like AMD and Intel drivers should be
> > > doing).
> > >
> > > If the Intel or ARM display drivers need non-cached memory and don't
> > > reject buffer where they don't know this then that's certainly a bug in
> > > those drivers.
> > It's not just display drivers, video codec accelerators and most GPUs
> > in this space are also non-snooping. In the ARM SoC world everyone just
> > assumes you are non-snooping, which is why things work for most cases
> > and only a handful like the UVC video import is broken.
>
> That is really interesting to know, but I still think that DMA-buf was
> absolutely not designed for this use case.
>
> From the point of view the primary reason for this was laptops with
> both dedicated and integrated GPUs, webcams etc...
>
> That you have a huge number of ARM specific devices which can interop
> with themselves, but not with devices outside of their domain is not
> something foreseen here.
>
Our recollection of history might differ here, but as Daniel remarked
kind of snarkily, most of the initial contributors to dma-buf were from
Linaro and TI, both of which were focused on getting device interop
working on ARM devices, which at the time were overwhelmingly non-
snooping. So I kind of doubt that dma-buf wasn't designed for this use-
case.
Regards,
Lucas
On Thu, Jun 23, 2022 at 01:32:18PM +0200, Christian K?nig wrote:
> Am 23.06.22 um 13:27 schrieb Daniel Stone:
> > Hi Christian,
> >
> > On Thu, 23 Jun 2022 at 12:11, Christian K?nig <[email protected]> wrote:
> > > > In fact DMA-buf sharing works fine on most of those SoCs because
> > > > everyone just assumes that all the accelerators don't snoop, so the
> > > > memory shared via DMA-buf is mostly CPU uncached. It only falls apart
> > > > for uses like the UVC cameras, where the shared buffer ends up being
> > > > CPU cached.
> > > Well then the existing DMA-buf framework is not what you want to use for
> > > this.
> > >
> > > > Non-coherent without explicit domain transfer points is just not going
> > > > to work. So why can't we solve the issue for DMA-buf in the same way as
> > > > the DMA API already solved it years ago: by adding the equivalent of
> > > > the dma_sync calls that do cache maintenance when necessary? On x86 (or
> > > > any system where things are mostly coherent) you could still no-op them
> > > > for the common case and only trigger cache cleaning if the importer
> > > > explicitly says that is going to do a non-snooping access.
> > > Because DMA-buf is a framework for buffer sharing between cache coherent
> > > devices which don't signal transitions.
> > >
> > > We intentionally didn't implemented any of the dma_sync_* functions
> > > because that would break the intended use case.
> > >
> > > You can of course use DMA-buf in an incoherent environment, but then you
> > > can't expect that this works all the time.
> > >
> > > This is documented behavior and so far we have bluntly rejected any of
> > > the complains that it doesn't work on most ARM SoCs and I don't really
> > > see a way to do this differently.
> > For some strange reason, 'let's do buffer sharing but make sure it
> > doesn't work on Arm' wasn't exactly the intention of the groups who
> > came together under the Linaro umbrella to create dmabuf.
> >
> > If it's really your belief that dmabuf requires universal snooping, I
> > recommend you send the patch to update the documentation, as well as
> > to remove DRIVER_PRIME from, realistically, most non-PCIE drivers.
>
> Well, to be honest I think that would indeed be necessary.
>
> What we have created are essentially two different worlds, one for PCI
> devices and one for the rest.
>
> This was indeed not the intention, but it's a fact that basically all
> DMA-buf based PCI drivers assume coherent access.
dma-buf does not require universal snooping.
It does defacto require that all device access is coherent with all other
device access, and consistent with the exporters notion of how cpu
coherency is achieved. Not that coherent does not mean snooping, as long
as all devices do unsnooped access and the exporter either does wc/uc or
flushes caches that's perfectly fine, and how all the arm soc dma-buf
sharing works.
We did originally have the wording in there that you have to map/unamp
around every device access, but that got dropped because no one was doing
that anyway.
Now where this totally breaks down is how we make this work, because the
idea was that dma_buf_attach validates this all. Where this means all the
hilarious reasons buffer sharing might not work:
- wrong coherency mode (cpu cached or not)
- not contiguous (we do check that, but only once we get the sg from
dma_buf_attachment_map, which strictly speaking is a bit too late but
most drivers do attach&map as one step so not that bad in practice)
- whether the dma api will throw in bounce buffers or not
- random shit like "oh this is in the wrong memory bank", which I think
never landed in upstream
p2p connectivity is about the only one that gets this right, yay. And the
only reason we can even get it right is because all the information is
exposed to drivers fully.
The issue is that the device dma api refuses to share this information
because it would "leak". Which sucks, because we have defacto build every
single cross-device use-case of dma-buf on the assumption we can check
this (up to gl/vk specs), but oh well.
So in practice this gets sorted out by endless piles of hacks to make
individual use-cases work.
Oh and: This is definitely not limited to arm socs. x86 socs with intel
at least have exactly all the same issues, and they get solved by adding
various shitty hacks to the involved drivers (like i915+amdgpu). Luckily
the intel camera driver isn't in upstream yet, since that would break a
bunch of the hacks since suddently there will be now 2 cpu cache
incoherent devices in an x86 system.
Ideally someone fixes this, but I'm not hopeful.
I recommend pouring more drinks.
What is definitely not correct is claiming that dma-buf wasn't meant for
this. We discussed cache coherency issues endless in budapest 12 or so
years ago, I was there. It's just that the reality of the current
implementation is falling short, and every time someone tries to fix it we
get shouted down by dma api maintainers for looking behind their current.
tldr; You have to magically know to not use cpu cached allocators on these
machines.
Aside: This is also why vgem alloates wc memory on x86. It's the least
common denominator that works. arm unfortunately doesn't allow you to
allocate wc memory, so there stuff is simply somewhat broken.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
Hi,
Le jeudi 23 juin 2022 à 10:58 +0200, Lucas Stach a écrit :
> > > In the DMA API keeping things mapped is also a valid use-case, but then
> > > you need to do explicit domain transfers via the dma_sync_* family,
> > > which DMA-buf has not inherited. Again those sync are no-ops on cache
> > > coherent architectures, but do any necessary cache maintenance on non
> > > coherent arches.
> >
> > Correct, yes. Coherency is mandatory for DMA-buf, you can't use
> > dma_sync_* on it when you are the importer.
> >
> > The exporter could of course make use of that because he is the owner of
> > the buffer.
>
> In the example given here with UVC video, you don't know that the
> buffer will be exported and needs to be coherent without
> synchronization points, due to the mapping cache at the DRM side. So
> V4L2 naturally allocates the buffers from CPU cached memory. If the
> expectation is that those buffers are device coherent without relying
> on the map/unmap_attachment calls, then V4L2 needs to always
> synchronize caches on DQBUF when the buffer is allocated from CPU
> cached memory and a single DMA-buf attachment exists. And while writing
> this I realize that this is probably exactly what V4L2 should do...
I'm not sure we are making any progress here. Doing so will just regress
performance of coherent devices used to render UVC video feeds. In fact, they
are all coherent except the display controller (on Intel). What my colleague was
suggesting me to try (with the expectation that some adaptation will be needed,
perhaps new signalling flags), is to read the dma_coherency_mask values on the
devices that calls attach() and adapt v4l2 exporter accordingly.
Its likely wrong as-is, not intended to be used for that, but the value is that
it tries to fix the problem, unlike what I'm reading here.
Nicolas
Le jeudi 23 juin 2022 à 11:33 +0200, Lucas Stach a écrit :
> >
> > See for example on AMD/Intel hardware most of the engines can perfectly
> > deal with cache coherent memory accesses. Only the display engines can't.
> >
> > So on import time we can't even say if the access can be coherent and
> > snoop the CPU cache or not because we don't know how the imported
> > DMA-buf will be used later on.
> >
> So for those mixed use cases, wouldn't it help to have something
> similar to the dma_sync in the DMA-buf API, so your scanout usage can
> tell the exporter that it's going to do non-snoop access and any dirty
> cache lines must be cleaned? Signaling this to the exporter would allow
> to skip the cache maintenance if the buffer is in CPU uncached memory,
> which again is a default case for the ARM SoC world.
Telling the exporter for every scan is unneeded overhead. If that information is
made available "properly", then tracking it in attach/detach is sufficient and
lightweight.
Nicolas
Am Montag, dem 27.06.2022 um 09:54 -0400 schrieb Nicolas Dufresne:
> Le jeudi 23 juin 2022 à 11:33 +0200, Lucas Stach a écrit :
> > >
> > > See for example on AMD/Intel hardware most of the engines can perfectly
> > > deal with cache coherent memory accesses. Only the display engines can't.
> > >
> > > So on import time we can't even say if the access can be coherent and
> > > snoop the CPU cache or not because we don't know how the imported
> > > DMA-buf will be used later on.
> > >
> > So for those mixed use cases, wouldn't it help to have something
> > similar to the dma_sync in the DMA-buf API, so your scanout usage can
> > tell the exporter that it's going to do non-snoop access and any dirty
> > cache lines must be cleaned? Signaling this to the exporter would allow
> > to skip the cache maintenance if the buffer is in CPU uncached memory,
> > which again is a default case for the ARM SoC world.
>
> Telling the exporter for every scan is unneeded overhead. If that information is
> made available "properly", then tracking it in attach/detach is sufficient and
> lightweight.
That isn't sufficient. The AMD GPU is a single device, but internally
has different engines that have different capabilities with regard to
snooping the caches. So you will likely end up with needing the cache
clean if the V4L2 buffer is going directly to scanout, which doesn't
snoop, but if the usage changes to sampling you don't need any cache
flushes.
Also I don't see a big overhead when comparing a kernel internal call
that tells the exporter that the importer is going to access the buffer
without snooping and thus needs the cache clean once every frame and
the need to always clean the cache before DQBUF when a potentially non-
snooping importer is attached.
Regards,
Lucas
Le lundi 27 juin 2022 à 16:06 +0200, Lucas Stach a écrit :
> Am Montag, dem 27.06.2022 um 09:54 -0400 schrieb Nicolas Dufresne:
> > Le jeudi 23 juin 2022 à 11:33 +0200, Lucas Stach a écrit :
> > > >
> > > > See for example on AMD/Intel hardware most of the engines can perfectly
> > > > deal with cache coherent memory accesses. Only the display engines can't.
> > > >
> > > > So on import time we can't even say if the access can be coherent and
> > > > snoop the CPU cache or not because we don't know how the imported
> > > > DMA-buf will be used later on.
> > > >
> > > So for those mixed use cases, wouldn't it help to have something
> > > similar to the dma_sync in the DMA-buf API, so your scanout usage can
> > > tell the exporter that it's going to do non-snoop access and any dirty
> > > cache lines must be cleaned? Signaling this to the exporter would allow
> > > to skip the cache maintenance if the buffer is in CPU uncached memory,
> > > which again is a default case for the ARM SoC world.
> >
> > Telling the exporter for every scan is unneeded overhead. If that information is
> > made available "properly", then tracking it in attach/detach is sufficient and
> > lightweight.
>
> That isn't sufficient. The AMD GPU is a single device, but internally
> has different engines that have different capabilities with regard to
> snooping the caches. So you will likely end up with needing the cache
> clean if the V4L2 buffer is going directly to scanout, which doesn't
> snoop, but if the usage changes to sampling you don't need any cache
> flushes.
>
> Also I don't see a big overhead when comparing a kernel internal call
> that tells the exporter that the importer is going to access the buffer
> without snooping and thus needs the cache clean once every frame and
> the need to always clean the cache before DQBUF when a potentially non-
> snooping importer is attached.
Ack, thanks for the information.
>
> Regards,
> Lucas
>
Hi Daniel,
Am 25.06.22 um 00:02 schrieb Daniel Vetter:
> On Thu, Jun 23, 2022 at 01:32:18PM +0200, Christian König wrote:
>> Am 23.06.22 um 13:27 schrieb Daniel Stone:
>>> [SNIP]
>>> If it's really your belief that dmabuf requires universal snooping, I
>>> recommend you send the patch to update the documentation, as well as
>>> to remove DRIVER_PRIME from, realistically, most non-PCIE drivers.
>> Well, to be honest I think that would indeed be necessary.
>>
>> What we have created are essentially two different worlds, one for PCI
>> devices and one for the rest.
>>
>> This was indeed not the intention, but it's a fact that basically all
>> DMA-buf based PCI drivers assume coherent access.
> dma-buf does not require universal snooping.
>
> It does defacto require that all device access is coherent with all other
> device access, and consistent with the exporters notion of how cpu
> coherency is achieved. Not that coherent does not mean snooping, as long
> as all devices do unsnooped access and the exporter either does wc/uc or
> flushes caches that's perfectly fine, and how all the arm soc dma-buf
> sharing works.
We should probably start documenting that better.
> We did originally have the wording in there that you have to map/unamp
> around every device access, but that got dropped because no one was doing
> that anyway.
>
> Now where this totally breaks down is how we make this work, because the
> idea was that dma_buf_attach validates this all. Where this means all the
> hilarious reasons buffer sharing might not work:
> - wrong coherency mode (cpu cached or not)
> - not contiguous (we do check that, but only once we get the sg from
> dma_buf_attachment_map, which strictly speaking is a bit too late but
> most drivers do attach&map as one step so not that bad in practice)
> - whether the dma api will throw in bounce buffers or not
> - random shit like "oh this is in the wrong memory bank", which I think
> never landed in upstream
>
> p2p connectivity is about the only one that gets this right, yay. And the
> only reason we can even get it right is because all the information is
> exposed to drivers fully.
Yeah, that's why I designed P2P that way :)
I also don't think it's that bad, at least for radeon, nouveau and
amdgpu all the migration restrictions are actually handled correctly.
In other words when a DMA-buf is about to be used by another device we
use TTM to move the buffer around so that it can actually be accessed by
that device.
What I haven't foreseen in here is that we need to deal with different
caching behaviors between exporter and importer.
> The issue is that the device dma api refuses to share this information
> because it would "leak". Which sucks, because we have defacto build every
> single cross-device use-case of dma-buf on the assumption we can check
> this (up to gl/vk specs), but oh well.
>
> So in practice this gets sorted out by endless piles of hacks to make
> individual use-cases work.
>
> Oh and: This is definitely not limited to arm socs. x86 socs with intel
> at least have exactly all the same issues, and they get solved by adding
> various shitty hacks to the involved drivers (like i915+amdgpu). Luckily
> the intel camera driver isn't in upstream yet, since that would break a
> bunch of the hacks since suddently there will be now 2 cpu cache
> incoherent devices in an x86 system.
>
> Ideally someone fixes this, but I'm not hopeful.
>
> I recommend pouring more drinks.
>
> What is definitely not correct is claiming that dma-buf wasn't meant for
> this. We discussed cache coherency issues endless in budapest 12 or so
> years ago, I was there. It's just that the reality of the current
> implementation is falling short, and every time someone tries to fix it we
> get shouted down by dma api maintainers for looking behind their current.
Well that explains this, I've joined the party a year later and haven't
witnessed all of this.
> tldr; You have to magically know to not use cpu cached allocators on these
> machines.
Or reject the attachment. As far as I can see that is still the cleanest
option.
Regards,
Christian.
>
> Aside: This is also why vgem alloates wc memory on x86. It's the least
> common denominator that works. arm unfortunately doesn't allow you to
> allocate wc memory, so there stuff is simply somewhat broken.
> -Daniel
On Mon, Jul 04, 2022 at 03:48:03PM +0200, Christian K?nig wrote:
> Hi Daniel,
>
> Am 25.06.22 um 00:02 schrieb Daniel Vetter:
> > On Thu, Jun 23, 2022 at 01:32:18PM +0200, Christian K?nig wrote:
> > > Am 23.06.22 um 13:27 schrieb Daniel Stone:
> > > > [SNIP]
> > > > If it's really your belief that dmabuf requires universal snooping, I
> > > > recommend you send the patch to update the documentation, as well as
> > > > to remove DRIVER_PRIME from, realistically, most non-PCIE drivers.
> > > Well, to be honest I think that would indeed be necessary.
> > >
> > > What we have created are essentially two different worlds, one for PCI
> > > devices and one for the rest.
> > >
> > > This was indeed not the intention, but it's a fact that basically all
> > > DMA-buf based PCI drivers assume coherent access.
> > dma-buf does not require universal snooping.
> >
> > It does defacto require that all device access is coherent with all other
> > device access, and consistent with the exporters notion of how cpu
> > coherency is achieved. Not that coherent does not mean snooping, as long
> > as all devices do unsnooped access and the exporter either does wc/uc or
> > flushes caches that's perfectly fine, and how all the arm soc dma-buf
> > sharing works.
>
> We should probably start documenting that better.
Agreed :-)
Are you volunteering to type up something that reflects the current sorry
state of affairs? I'm not sure I'm the best since I guess I've been too
badly involved in this ...
> > We did originally have the wording in there that you have to map/unamp
> > around every device access, but that got dropped because no one was doing
> > that anyway.
> >
> > Now where this totally breaks down is how we make this work, because the
> > idea was that dma_buf_attach validates this all. Where this means all the
> > hilarious reasons buffer sharing might not work:
> > - wrong coherency mode (cpu cached or not)
> > - not contiguous (we do check that, but only once we get the sg from
> > dma_buf_attachment_map, which strictly speaking is a bit too late but
> > most drivers do attach&map as one step so not that bad in practice)
> > - whether the dma api will throw in bounce buffers or not
> > - random shit like "oh this is in the wrong memory bank", which I think
> > never landed in upstream
> >
> > p2p connectivity is about the only one that gets this right, yay. And the
> > only reason we can even get it right is because all the information is
> > exposed to drivers fully.
>
> Yeah, that's why I designed P2P that way :)
>
> I also don't think it's that bad, at least for radeon, nouveau and amdgpu
> all the migration restrictions are actually handled correctly.
>
> In other words when a DMA-buf is about to be used by another device we use
> TTM to move the buffer around so that it can actually be accessed by that
> device.
>
> What I haven't foreseen in here is that we need to deal with different
> caching behaviors between exporter and importer.
Yeah we should have done caching explicitly and full opt-in like with p2p.
The trouble is that this would have been a multi-year fight with dma api
folks, who insist it must be all transparent. So the politically clever
thing was to just ignore the problem and land dma-buf, but it comes back
to bite us now :-/
> > The issue is that the device dma api refuses to share this information
> > because it would "leak". Which sucks, because we have defacto build every
> > single cross-device use-case of dma-buf on the assumption we can check
> > this (up to gl/vk specs), but oh well.
> >
> > So in practice this gets sorted out by endless piles of hacks to make
> > individual use-cases work.
> >
> > Oh and: This is definitely not limited to arm socs. x86 socs with intel
> > at least have exactly all the same issues, and they get solved by adding
> > various shitty hacks to the involved drivers (like i915+amdgpu). Luckily
> > the intel camera driver isn't in upstream yet, since that would break a
> > bunch of the hacks since suddently there will be now 2 cpu cache
> > incoherent devices in an x86 system.
> >
> > Ideally someone fixes this, but I'm not hopeful.
> >
> > I recommend pouring more drinks.
> >
> > What is definitely not correct is claiming that dma-buf wasn't meant for
> > this. We discussed cache coherency issues endless in budapest 12 or so
> > years ago, I was there. It's just that the reality of the current
> > implementation is falling short, and every time someone tries to fix it we
> > get shouted down by dma api maintainers for looking behind their current.
>
> Well that explains this, I've joined the party a year later and haven't
> witnessed all of this.
Yay, cleared up another confusion!
> > tldr; You have to magically know to not use cpu cached allocators on these
> > machines.
>
> Or reject the attachment. As far as I can see that is still the cleanest
> option.
Yeah rejecting is always an ok thing if it just doesn't work.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
On Mon, Feb 15, 2021 at 12:58 AM Christian König
<[email protected]> wrote:
>
> Hi guys,
>
> we are currently working an Freesync and direct scan out from system
> memory on AMD APUs in A+A laptops.
>
> On problem we stumbled over is that our display hardware needs to scan
> out from uncached system memory and we currently don't have a way to
> communicate that through DMA-buf.
>
> For our specific use case at hand we are going to implement something
> driver specific, but the question is should we have something more
> generic for this?
I'm a bit late to this party (and sorry, I didn't read the entire
thread), but it occurs to me that dmabuf mmap_info[1] would also get
you what you need, ie. display importing dma-buf could check whether
the exporter is mapping cached or not, and reject the import if
needed?
[1] https://patchwork.freedesktop.org/patch/496069/?series=106847&rev=2
> After all the system memory access pattern is a PCIe extension and as
> such something generic.
>
> Regards,
> Christian.
> _______________________________________________
> dri-devel mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
Am 09.08.22 um 16:46 schrieb Daniel Vetter:
> On Mon, Jul 04, 2022 at 03:48:03PM +0200, Christian König wrote:
>> Hi Daniel,
>>
>> Am 25.06.22 um 00:02 schrieb Daniel Vetter:
>>> On Thu, Jun 23, 2022 at 01:32:18PM +0200, Christian König wrote:
>>>> Am 23.06.22 um 13:27 schrieb Daniel Stone:
>>>>> [SNIP]
>>>>> If it's really your belief that dmabuf requires universal snooping, I
>>>>> recommend you send the patch to update the documentation, as well as
>>>>> to remove DRIVER_PRIME from, realistically, most non-PCIE drivers.
>>>> Well, to be honest I think that would indeed be necessary.
>>>>
>>>> What we have created are essentially two different worlds, one for PCI
>>>> devices and one for the rest.
>>>>
>>>> This was indeed not the intention, but it's a fact that basically all
>>>> DMA-buf based PCI drivers assume coherent access.
>>> dma-buf does not require universal snooping.
>>>
>>> It does defacto require that all device access is coherent with all other
>>> device access, and consistent with the exporters notion of how cpu
>>> coherency is achieved. Not that coherent does not mean snooping, as long
>>> as all devices do unsnooped access and the exporter either does wc/uc or
>>> flushes caches that's perfectly fine, and how all the arm soc dma-buf
>>> sharing works.
>> We should probably start documenting that better.
> Agreed :-)
>
> Are you volunteering to type up something that reflects the current sorry
> state of affairs? I'm not sure I'm the best since I guess I've been too
> badly involved in this ...
Yeah, already working on this. But you know, normal human being with two
hands and one head.
With all the workload I'm pretty sure people would have cloned me by now
if tech would be just a bit more advanced.
Christian.
>
>>> We did originally have the wording in there that you have to map/unamp
>>> around every device access, but that got dropped because no one was doing
>>> that anyway.
>>>
>>> Now where this totally breaks down is how we make this work, because the
>>> idea was that dma_buf_attach validates this all. Where this means all the
>>> hilarious reasons buffer sharing might not work:
>>> - wrong coherency mode (cpu cached or not)
>>> - not contiguous (we do check that, but only once we get the sg from
>>> dma_buf_attachment_map, which strictly speaking is a bit too late but
>>> most drivers do attach&map as one step so not that bad in practice)
>>> - whether the dma api will throw in bounce buffers or not
>>> - random shit like "oh this is in the wrong memory bank", which I think
>>> never landed in upstream
>>>
>>> p2p connectivity is about the only one that gets this right, yay. And the
>>> only reason we can even get it right is because all the information is
>>> exposed to drivers fully.
>> Yeah, that's why I designed P2P that way :)
>>
>> I also don't think it's that bad, at least for radeon, nouveau and amdgpu
>> all the migration restrictions are actually handled correctly.
>>
>> In other words when a DMA-buf is about to be used by another device we use
>> TTM to move the buffer around so that it can actually be accessed by that
>> device.
>>
>> What I haven't foreseen in here is that we need to deal with different
>> caching behaviors between exporter and importer.
> Yeah we should have done caching explicitly and full opt-in like with p2p.
> The trouble is that this would have been a multi-year fight with dma api
> folks, who insist it must be all transparent. So the politically clever
> thing was to just ignore the problem and land dma-buf, but it comes back
> to bite us now :-/
>
>>> The issue is that the device dma api refuses to share this information
>>> because it would "leak". Which sucks, because we have defacto build every
>>> single cross-device use-case of dma-buf on the assumption we can check
>>> this (up to gl/vk specs), but oh well.
>>>
>>> So in practice this gets sorted out by endless piles of hacks to make
>>> individual use-cases work.
>>>
>>> Oh and: This is definitely not limited to arm socs. x86 socs with intel
>>> at least have exactly all the same issues, and they get solved by adding
>>> various shitty hacks to the involved drivers (like i915+amdgpu). Luckily
>>> the intel camera driver isn't in upstream yet, since that would break a
>>> bunch of the hacks since suddently there will be now 2 cpu cache
>>> incoherent devices in an x86 system.
>>>
>>> Ideally someone fixes this, but I'm not hopeful.
>>>
>>> I recommend pouring more drinks.
>>>
>>> What is definitely not correct is claiming that dma-buf wasn't meant for
>>> this. We discussed cache coherency issues endless in budapest 12 or so
>>> years ago, I was there. It's just that the reality of the current
>>> implementation is falling short, and every time someone tries to fix it we
>>> get shouted down by dma api maintainers for looking behind their current.
>> Well that explains this, I've joined the party a year later and haven't
>> witnessed all of this.
> Yay, cleared up another confusion!
>
>>> tldr; You have to magically know to not use cpu cached allocators on these
>>> machines.
>> Or reject the attachment. As far as I can see that is still the cleanest
>> option.
> Yeah rejecting is always an ok thing if it just doesn't work.
> -Daniel