2021-08-28 15:40:08

by Sven Peter

[permalink] [raw]
Subject: [PATCH v2 3/8] iommu/dma: Disable get_sgtable for granule > PAGE_SIZE

Pretend that iommu_dma_get_sgtable is not implemented when
granule > PAGE_SIZE since I can neither test this function right now
nor do I fully understand how it is used.

Signed-off-by: Sven Peter <[email protected]>
---
drivers/iommu/dma-iommu.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index d6e273ec3de6..64fbd9236820 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -1315,9 +1315,15 @@ static int iommu_dma_get_sgtable(struct device *dev, struct sg_table *sgt,
void *cpu_addr, dma_addr_t dma_addr, size_t size,
unsigned long attrs)
{
+ struct iommu_domain *domain = iommu_get_dma_domain(dev);
+ struct iommu_dma_cookie *cookie = domain->iova_cookie;
+ struct iova_domain *iovad = &cookie->iovad;
struct page *page;
int ret;

+ if (iovad->granule > PAGE_SIZE)
+ return -ENXIO;
+
if (IS_ENABLED(CONFIG_DMA_REMAP) && is_vmalloc_addr(cpu_addr)) {
struct page **pages = dma_common_find_pages(cpu_addr);

--
2.25.1


2021-08-31 23:05:58

by Alyssa Rosenzweig

[permalink] [raw]
Subject: Re: [PATCH v2 3/8] iommu/dma: Disable get_sgtable for granule > PAGE_SIZE

I use this function for cross-device sharing on the M1 display driver.
Arguably this is unsafe but it works on 16k kernels and if you want to
test the function on 4k, you know where my code is.

On Sat, Aug 28, 2021 at 05:36:37PM +0200, Sven Peter wrote:
> Pretend that iommu_dma_get_sgtable is not implemented when
> granule > PAGE_SIZE since I can neither test this function right now
> nor do I fully understand how it is used.
>
> Signed-off-by: Sven Peter <[email protected]>
> ---
> drivers/iommu/dma-iommu.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index d6e273ec3de6..64fbd9236820 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -1315,9 +1315,15 @@ static int iommu_dma_get_sgtable(struct device *dev, struct sg_table *sgt,
> void *cpu_addr, dma_addr_t dma_addr, size_t size,
> unsigned long attrs)
> {
> + struct iommu_domain *domain = iommu_get_dma_domain(dev);
> + struct iommu_dma_cookie *cookie = domain->iova_cookie;
> + struct iova_domain *iovad = &cookie->iovad;
> struct page *page;
> int ret;
>
> + if (iovad->granule > PAGE_SIZE)
> + return -ENXIO;
> +
> if (IS_ENABLED(CONFIG_DMA_REMAP) && is_vmalloc_addr(cpu_addr)) {
> struct page **pages = dma_common_find_pages(cpu_addr);
>
> --
> 2.25.1
>

2021-09-01 20:18:39

by Sven Peter

[permalink] [raw]
Subject: Re: [PATCH v2 3/8] iommu/dma: Disable get_sgtable for granule > PAGE_SIZE



On Tue, Aug 31, 2021, at 23:30, Alyssa Rosenzweig wrote:
> I use this function for cross-device sharing on the M1 display driver.
> Arguably this is unsafe but it works on 16k kernels and if you want to
> test the function on 4k, you know where my code is.
>

My biggest issue is that I do not understand how this function is supposed
to be used correctly. It would work fine as-is if it only ever gets passed buffers
allocated by the coherent API but there's not way to check or guarantee that.
There may also be callers making assumptions that no longer hold when
iovad->granule > PAGE_SIZE.


Regarding your case: I'm not convinced the function is meant to be used there.
If I understand it correctly, your code first allocates memory with dma_alloc_coherent
(which possibly creates a sgt internally and then maps it with iommu_map_sg),
then coerces that back into a sgt with dma_get_sgtable, and then maps that sgt to
another iommu domain with dma_map_sg while assuming that the result will be contiguous
in IOVA space. It'll work out because dma_alloc_coherent is the very thing
meant to allocate pages that can be mapped into kernel and device VA space
as a single contiguous block and because both of your IOMMUs are different
instances of the same HW block. Anything allocated by dma_alloc_coherent for the
first IOMMU will have the right shape that will allow it to be mapped as
a single contiguous block for the second IOMMU.

What could be done in your case is to instead use the IOMMU API,
allocate the pages yourself (while ensuring the sgt your create is made up
of blocks with size and physaddr aligned to max(domain_a->granule, domain_b->granule))
and then just use iommu_map_sg for both domains which actually comes with the
guarantee that the result will be a single contiguous block in IOVA space and
doesn't required the sgt roundtrip.



Sven


> On Sat, Aug 28, 2021 at 05:36:37PM +0200, Sven Peter wrote:
> > Pretend that iommu_dma_get_sgtable is not implemented when
> > granule > PAGE_SIZE since I can neither test this function right now
> > nor do I fully understand how it is used.
> >
> > Signed-off-by: Sven Peter <[email protected]>
> > ---
> > drivers/iommu/dma-iommu.c | 6 ++++++
> > 1 file changed, 6 insertions(+)
> >
> > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> > index d6e273ec3de6..64fbd9236820 100644
> > --- a/drivers/iommu/dma-iommu.c
> > +++ b/drivers/iommu/dma-iommu.c
> > @@ -1315,9 +1315,15 @@ static int iommu_dma_get_sgtable(struct device *dev, struct sg_table *sgt,
> > void *cpu_addr, dma_addr_t dma_addr, size_t size,
> > unsigned long attrs)
> > {
> > + struct iommu_domain *domain = iommu_get_dma_domain(dev);
> > + struct iommu_dma_cookie *cookie = domain->iova_cookie;
> > + struct iova_domain *iovad = &cookie->iovad;
> > struct page *page;
> > int ret;
> >
> > + if (iovad->granule > PAGE_SIZE)
> > + return -ENXIO;
> > +
> > if (IS_ENABLED(CONFIG_DMA_REMAP) && is_vmalloc_addr(cpu_addr)) {
> > struct page **pages = dma_common_find_pages(cpu_addr);
> >
> > --
> > 2.25.1
> >
>


--
Sven Peter

2021-09-02 01:37:09

by Alyssa Rosenzweig

[permalink] [raw]
Subject: Re: [PATCH v2 3/8] iommu/dma: Disable get_sgtable for granule > PAGE_SIZE

> My biggest issue is that I do not understand how this function is supposed
> to be used correctly. It would work fine as-is if it only ever gets passed buffers
> allocated by the coherent API but there's not way to check or guarantee that.
> There may also be callers making assumptions that no longer hold when
> iovad->granule > PAGE_SIZE.
>
> Regarding your case: I'm not convinced the function is meant to be used there.
> If I understand it correctly, your code first allocates memory with dma_alloc_coherent
> (which possibly creates a sgt internally and then maps it with iommu_map_sg),
> then coerces that back into a sgt with dma_get_sgtable, and then maps that sgt to
> another iommu domain with dma_map_sg while assuming that the result will be contiguous
> in IOVA space. It'll work out because dma_alloc_coherent is the very thing
> meant to allocate pages that can be mapped into kernel and device VA space
> as a single contiguous block and because both of your IOMMUs are different
> instances of the same HW block. Anything allocated by dma_alloc_coherent for the
> first IOMMU will have the right shape that will allow it to be mapped as
> a single contiguous block for the second IOMMU.
>
> What could be done in your case is to instead use the IOMMU API,
> allocate the pages yourself (while ensuring the sgt your create is made up
> of blocks with size and physaddr aligned to max(domain_a->granule, domain_b->granule))
> and then just use iommu_map_sg for both domains which actually comes with the
> guarantee that the result will be a single contiguous block in IOVA space and
> doesn't required the sgt roundtrip.

In principle I agree. I am getting the sense this function can't be used
correctly in general, and yet is the function that's meant to be used.
If my interpretation of prior LKML discussion holds, the problems are
far deeper than my code or indeed page size problems...

If the right way to handle this is with the IOMMU and IOVA APIs, I really wish
that dance were wrapped up in a safe helper function instead of open
coding it in every driver that does cross device sharing.

We might even call that helper... hmm... dma_map_sg.... *ducks*

For context for people-other-than-Sven, the code in question from my
tree appears below the break.

---------------------------------------------------------------------------------

/*
* Allocate an IOVA contiguous buffer mapped to the DCP. The buffer need not be
* physically contigiuous, however we should save the sgtable in case the
* buffer needs to be later mapped for PIODMA.
*/
static bool dcpep_cb_allocate_buffer(struct apple_dcp *dcp, void *out, void *in)
{
struct dcp_allocate_buffer_resp *resp = out;
struct dcp_allocate_buffer_req *req = in;
void *buf;

resp->dva_size = ALIGN(req->size, 4096);
resp->mem_desc_id = ++dcp->nr_mappings;

if (resp->mem_desc_id >= ARRAY_SIZE(dcp->mappings)) {
dev_warn(dcp->dev, "DCP overflowed mapping table, ignoring");
return true;
}

buf = dma_alloc_coherent(dcp->dev, resp->dva_size, &resp->dva,
GFP_KERNEL);

dma_get_sgtable(dcp->dev, &dcp->mappings[resp->mem_desc_id], buf,
resp->dva, resp->dva_size);

WARN_ON(resp->mem_desc_id == 0);
return true;
}

/*
* Callback to map a buffer allocated with allocate_buf for PIODMA usage.
* PIODMA is separate from the main DCP and uses own IOVA space on a dedicated
* stream of the display DART, rather than the expected DCP DART.
*
* XXX: This relies on dma_get_sgtable in concert with dma_map_sgtable, which
* is a "fundamentally unsafe" operation according to the docs. And yet
* everyone does it...
*/
static bool dcpep_cb_map_piodma(struct apple_dcp *dcp, void *out, void *in)
{
struct dcp_map_buf_resp *resp = out;
struct dcp_map_buf_req *req = in;
struct sg_table *map;

if (req->buffer >= ARRAY_SIZE(dcp->mappings))
goto reject;

map = &dcp->mappings[req->buffer];

if (!map->sgl)
goto reject;

/* XNU leaks a kernel VA here, breaking kASLR. Don't do that. */
resp->vaddr = 0;

/* Use PIODMA device instead of DCP to map against the right IOMMU. */
resp->ret = dma_map_sgtable(dcp->piodma, map, DMA_BIDIRECTIONAL, 0);

if (resp->ret)
dev_warn(dcp->dev, "failed to map for piodma %d\n", resp->ret);
else
resp->dva = sg_dma_address(map->sgl);

resp->ret = 0;
return true;

reject:
dev_warn(dcp->dev, "denying map of invalid buffer %llx for pidoma\n",
req->buffer);
resp->ret = EINVAL;
return true;
}

2021-09-02 18:21:26

by Sven Peter

[permalink] [raw]
Subject: Re: [PATCH v2 3/8] iommu/dma: Disable get_sgtable for granule > PAGE_SIZE



On Wed, Sep 1, 2021, at 23:10, Alyssa Rosenzweig wrote:
> > My biggest issue is that I do not understand how this function is supposed
> > to be used correctly. It would work fine as-is if it only ever gets passed buffers
> > allocated by the coherent API but there's not way to check or guarantee that.
> > There may also be callers making assumptions that no longer hold when
> > iovad->granule > PAGE_SIZE.
> >
> > Regarding your case: I'm not convinced the function is meant to be used there.
> > If I understand it correctly, your code first allocates memory with dma_alloc_coherent
> > (which possibly creates a sgt internally and then maps it with iommu_map_sg),
> > then coerces that back into a sgt with dma_get_sgtable, and then maps that sgt to
> > another iommu domain with dma_map_sg while assuming that the result will be contiguous
> > in IOVA space. It'll work out because dma_alloc_coherent is the very thing
> > meant to allocate pages that can be mapped into kernel and device VA space
> > as a single contiguous block and because both of your IOMMUs are different
> > instances of the same HW block. Anything allocated by dma_alloc_coherent for the
> > first IOMMU will have the right shape that will allow it to be mapped as
> > a single contiguous block for the second IOMMU.
> >
> > What could be done in your case is to instead use the IOMMU API,
> > allocate the pages yourself (while ensuring the sgt your create is made up
> > of blocks with size and physaddr aligned to max(domain_a->granule, domain_b->granule))
> > and then just use iommu_map_sg for both domains which actually comes with the
> > guarantee that the result will be a single contiguous block in IOVA space and
> > doesn't required the sgt roundtrip.
>
> In principle I agree. I am getting the sense this function can't be used
> correctly in general, and yet is the function that's meant to be used.
> If my interpretation of prior LKML discussion holds, the problems are
> far deeper than my code or indeed page size problems...

Right, which makes reasoning about this function and its behavior if the
IOMMU pages size is unexpected very hard for me. I'm not opposed to just
keeping this function as-is when there's a mismatch between PAGE_SIZE and
the IOMMU page size (and it will probably work that way) but I'd like to
be sure that won't introduce unexpected behavior.

>
> If the right way to handle this is with the IOMMU and IOVA APIs, I really wish
> that dance were wrapped up in a safe helper function instead of open
> coding it in every driver that does cross device sharing.
>
> We might even call that helper... hmm... dma_map_sg.... *ducks*
>

There might be another way to do this correctly. I'm likely just a little
bit biased because I've spent the past weeks wrapping my head around the
IOMMU and DMA APIs and when all you have is a hammer everything looks like
a nail.

But dma_map_sg operates at the DMA API level and at that point the dma-ops
for two different devices could be vastly different.
In the worst case one of them could be behind an IOMMU that can easily map
non-contiguous pages while the other one is directly connected to the bus and
can't even access >4G pages without swiotlb support.
It's really only possible to guarantee that it will map N buffers to <= N
DMA-addressable buffers (possibly by using an IOMMU or swiotlb internally) at
that point.

On the IOMMU API level you have much more information available about the actual
hardware and can prepare the buffers in a way that makes both devices happy.
That's why iommu_map_sgtable combined with iovad->granule aligned sgt entries
can actually guarantee to map the entire list to a single contiguous IOVA block.


Sven

2021-09-02 19:46:02

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v2 3/8] iommu/dma: Disable get_sgtable for granule > PAGE_SIZE

On 2021-09-02 19:19, Sven Peter wrote:
>
>
> On Wed, Sep 1, 2021, at 23:10, Alyssa Rosenzweig wrote:
>>> My biggest issue is that I do not understand how this function is supposed
>>> to be used correctly. It would work fine as-is if it only ever gets passed buffers
>>> allocated by the coherent API but there's not way to check or guarantee that.
>>> There may also be callers making assumptions that no longer hold when
>>> iovad->granule > PAGE_SIZE.
>>>
>>> Regarding your case: I'm not convinced the function is meant to be used there.
>>> If I understand it correctly, your code first allocates memory with dma_alloc_coherent
>>> (which possibly creates a sgt internally and then maps it with iommu_map_sg),
>>> then coerces that back into a sgt with dma_get_sgtable, and then maps that sgt to
>>> another iommu domain with dma_map_sg while assuming that the result will be contiguous
>>> in IOVA space. It'll work out because dma_alloc_coherent is the very thing
>>> meant to allocate pages that can be mapped into kernel and device VA space
>>> as a single contiguous block and because both of your IOMMUs are different
>>> instances of the same HW block. Anything allocated by dma_alloc_coherent for the
>>> first IOMMU will have the right shape that will allow it to be mapped as
>>> a single contiguous block for the second IOMMU.
>>>
>>> What could be done in your case is to instead use the IOMMU API,
>>> allocate the pages yourself (while ensuring the sgt your create is made up
>>> of blocks with size and physaddr aligned to max(domain_a->granule, domain_b->granule))
>>> and then just use iommu_map_sg for both domains which actually comes with the
>>> guarantee that the result will be a single contiguous block in IOVA space and
>>> doesn't required the sgt roundtrip.
>>
>> In principle I agree. I am getting the sense this function can't be used
>> correctly in general, and yet is the function that's meant to be used.
>> If my interpretation of prior LKML discussion holds, the problems are
>> far deeper than my code or indeed page size problems...
>
> Right, which makes reasoning about this function and its behavior if the
> IOMMU pages size is unexpected very hard for me. I'm not opposed to just
> keeping this function as-is when there's a mismatch between PAGE_SIZE and
> the IOMMU page size (and it will probably work that way) but I'd like to
> be sure that won't introduce unexpected behavior.
>
>>
>> If the right way to handle this is with the IOMMU and IOVA APIs, I really wish
>> that dance were wrapped up in a safe helper function instead of open
>> coding it in every driver that does cross device sharing.
>>
>> We might even call that helper... hmm... dma_map_sg.... *ducks*
>>
>
> There might be another way to do this correctly. I'm likely just a little
> bit biased because I've spent the past weeks wrapping my head around the
> IOMMU and DMA APIs and when all you have is a hammer everything looks like
> a nail.
>
> But dma_map_sg operates at the DMA API level and at that point the dma-ops
> for two different devices could be vastly different.
> In the worst case one of them could be behind an IOMMU that can easily map
> non-contiguous pages while the other one is directly connected to the bus and
> can't even access >4G pages without swiotlb support.
> It's really only possible to guarantee that it will map N buffers to <= N
> DMA-addressable buffers (possibly by using an IOMMU or swiotlb internally) at
> that point.
>
> On the IOMMU API level you have much more information available about the actual
> hardware and can prepare the buffers in a way that makes both devices happy.
> That's why iommu_map_sgtable combined with iovad->granule aligned sgt entries
> can actually guarantee to map the entire list to a single contiguous IOVA block.

Essentially there are two reasonable options, and doing pretend dma-buf
export/import between two devices effectively owned by the same driver
is neither of them. Handily, DRM happens to be exactly where all the
precedent is, too; unsurprisingly this is not a new concern.

One is to go full IOMMU API, like rockchip or tegra, attaching the
relevant devices to your own unmanaged domain(s) and mapping pages
exactly where you choose. You still make dma_map/dma_unmap calls for the
sake of cache maintenance and other housekeeping on the underlying
memory, but you ignore the provided DMA addresses in favour of your own
IOVAs when it comes to programming the devices.

The lazier option if you can rely on all relevant devices having equal
DMA and IOMMU capabilities is to follow exynos, and herd the devices
into a common default domain. Instead of allocating you own domain, you
grab the current domain for one device (which will be its default
domain) and manually attach the other devices to that. Then you forget
all about IOMMUs but make sure to do all your regular DMA API calls
using that first device, and the DMA addresses which come back should be
magically valid for the other devices too. It was a bit of a cheeky hack
TBH, but I'd still much prefer more of that over any usage of
get_sgtable outside of actual dma-buf...

Note that where multiple IOMMU instances are involved, the latter
approach does depend on the IOMMU driver being able to support sharing a
single domain across them; I think that might sort-of-work for DART
already, but may need a little more attention.

Robin.

2021-09-03 15:48:20

by Robin Murphy

[permalink] [raw]
Subject: Re: [PATCH v2 3/8] iommu/dma: Disable get_sgtable for granule > PAGE_SIZE

On 2021-09-03 16:16, Sven Peter wrote:
>
>
> On Thu, Sep 2, 2021, at 21:42, Robin Murphy wrote:
>> On 2021-09-02 19:19, Sven Peter wrote:
>>>
>>>
>>> On Wed, Sep 1, 2021, at 23:10, Alyssa Rosenzweig wrote:
>>>>> My biggest issue is that I do not understand how this function is supposed
>>>>> to be used correctly. It would work fine as-is if it only ever gets passed buffers
>>>>> allocated by the coherent API but there's not way to check or guarantee that.
>>>>> There may also be callers making assumptions that no longer hold when
>>>>> iovad->granule > PAGE_SIZE.
>>>>>
>>>>> Regarding your case: I'm not convinced the function is meant to be used there.
>>>>> If I understand it correctly, your code first allocates memory with dma_alloc_coherent
>>>>> (which possibly creates a sgt internally and then maps it with iommu_map_sg),
>>>>> then coerces that back into a sgt with dma_get_sgtable, and then maps that sgt to
>>>>> another iommu domain with dma_map_sg while assuming that the result will be contiguous
>>>>> in IOVA space. It'll work out because dma_alloc_coherent is the very thing
>>>>> meant to allocate pages that can be mapped into kernel and device VA space
>>>>> as a single contiguous block and because both of your IOMMUs are different
>>>>> instances of the same HW block. Anything allocated by dma_alloc_coherent for the
>>>>> first IOMMU will have the right shape that will allow it to be mapped as
>>>>> a single contiguous block for the second IOMMU.
>>>>>
>>>>> What could be done in your case is to instead use the IOMMU API,
>>>>> allocate the pages yourself (while ensuring the sgt your create is made up
>>>>> of blocks with size and physaddr aligned to max(domain_a->granule, domain_b->granule))
>>>>> and then just use iommu_map_sg for both domains which actually comes with the
>>>>> guarantee that the result will be a single contiguous block in IOVA space and
>>>>> doesn't required the sgt roundtrip.
>>>>
>>>> In principle I agree. I am getting the sense this function can't be used
>>>> correctly in general, and yet is the function that's meant to be used.
>>>> If my interpretation of prior LKML discussion holds, the problems are
>>>> far deeper than my code or indeed page size problems...
>>>
>>> Right, which makes reasoning about this function and its behavior if the
>>> IOMMU pages size is unexpected very hard for me. I'm not opposed to just
>>> keeping this function as-is when there's a mismatch between PAGE_SIZE and
>>> the IOMMU page size (and it will probably work that way) but I'd like to
>>> be sure that won't introduce unexpected behavior.
>>>
>>>>
>>>> If the right way to handle this is with the IOMMU and IOVA APIs, I really wish
>>>> that dance were wrapped up in a safe helper function instead of open
>>>> coding it in every driver that does cross device sharing.
>>>>
>>>> We might even call that helper... hmm... dma_map_sg.... *ducks*
>>>>
>>>
>>> There might be another way to do this correctly. I'm likely just a little
>>> bit biased because I've spent the past weeks wrapping my head around the
>>> IOMMU and DMA APIs and when all you have is a hammer everything looks like
>>> a nail.
>>>
>>> But dma_map_sg operates at the DMA API level and at that point the dma-ops
>>> for two different devices could be vastly different.
>>> In the worst case one of them could be behind an IOMMU that can easily map
>>> non-contiguous pages while the other one is directly connected to the bus and
>>> can't even access >4G pages without swiotlb support.
>>> It's really only possible to guarantee that it will map N buffers to <= N
>>> DMA-addressable buffers (possibly by using an IOMMU or swiotlb internally) at
>>> that point.
>>>
>>> On the IOMMU API level you have much more information available about the actual
>>> hardware and can prepare the buffers in a way that makes both devices happy.
>>> That's why iommu_map_sgtable combined with iovad->granule aligned sgt entries
>>> can actually guarantee to map the entire list to a single contiguous IOVA block.
>>
>> Essentially there are two reasonable options, and doing pretend dma-buf
>> export/import between two devices effectively owned by the same driver
>> is neither of them. Handily, DRM happens to be exactly where all the
>> precedent is, too; unsurprisingly this is not a new concern.
>>
>> One is to go full IOMMU API, like rockchip or tegra, attaching the
>> relevant devices to your own unmanaged domain(s) and mapping pages
>> exactly where you choose. You still make dma_map/dma_unmap calls for the
>> sake of cache maintenance and other housekeeping on the underlying
>> memory, but you ignore the provided DMA addresses in favour of your own
>> IOVAs when it comes to programming the devices.
>>
>> The lazier option if you can rely on all relevant devices having equal
>> DMA and IOMMU capabilities is to follow exynos, and herd the devices
>> into a common default domain. Instead of allocating you own domain, you
>> grab the current domain for one device (which will be its default
>> domain) and manually attach the other devices to that. Then you forget
>> all about IOMMUs but make sure to do all your regular DMA API calls
>> using that first device, and the DMA addresses which come back should be
>> magically valid for the other devices too. It was a bit of a cheeky hack
>> TBH, but I'd still much prefer more of that over any usage of
>> get_sgtable outside of actual dma-buf...
>>
>> Note that where multiple IOMMU instances are involved, the latter
>> approach does depend on the IOMMU driver being able to support sharing a
>> single domain across them; I think that might sort-of-work for DART
>> already, but may need a little more attention.
>
> It'll work for two streams inside the same DART but needs some
> attention for streams from two separate DARTs.
>
> Then there's also this amazing "feature" that the display controller DART
> pagetable pointer register is read-only so that we have to reuse the memory
> Apple configured for first level table. That needs some changes anyway
> but might make adding multiple devices from different groups more complex.

OK, I was thinking the dual-DART accommodation is already at least some
of the way there, but I guess it's still tied to a single device's cfg.
One upside to generalising further might be that the dual-DART case
stops being particularly special :)

Not being able to physically share pagetables shouldn't be too big a
deal, just a bit more work to sync iommu_map/iommu_unmap calls across
all the relevant instances for the given domain.

Robin.

2021-09-03 17:39:59

by Sven Peter

[permalink] [raw]
Subject: Re: [PATCH v2 3/8] iommu/dma: Disable get_sgtable for granule > PAGE_SIZE



On Thu, Sep 2, 2021, at 21:42, Robin Murphy wrote:
> On 2021-09-02 19:19, Sven Peter wrote:
> >
> >
> > On Wed, Sep 1, 2021, at 23:10, Alyssa Rosenzweig wrote:
> >>> My biggest issue is that I do not understand how this function is supposed
> >>> to be used correctly. It would work fine as-is if it only ever gets passed buffers
> >>> allocated by the coherent API but there's not way to check or guarantee that.
> >>> There may also be callers making assumptions that no longer hold when
> >>> iovad->granule > PAGE_SIZE.
> >>>
> >>> Regarding your case: I'm not convinced the function is meant to be used there.
> >>> If I understand it correctly, your code first allocates memory with dma_alloc_coherent
> >>> (which possibly creates a sgt internally and then maps it with iommu_map_sg),
> >>> then coerces that back into a sgt with dma_get_sgtable, and then maps that sgt to
> >>> another iommu domain with dma_map_sg while assuming that the result will be contiguous
> >>> in IOVA space. It'll work out because dma_alloc_coherent is the very thing
> >>> meant to allocate pages that can be mapped into kernel and device VA space
> >>> as a single contiguous block and because both of your IOMMUs are different
> >>> instances of the same HW block. Anything allocated by dma_alloc_coherent for the
> >>> first IOMMU will have the right shape that will allow it to be mapped as
> >>> a single contiguous block for the second IOMMU.
> >>>
> >>> What could be done in your case is to instead use the IOMMU API,
> >>> allocate the pages yourself (while ensuring the sgt your create is made up
> >>> of blocks with size and physaddr aligned to max(domain_a->granule, domain_b->granule))
> >>> and then just use iommu_map_sg for both domains which actually comes with the
> >>> guarantee that the result will be a single contiguous block in IOVA space and
> >>> doesn't required the sgt roundtrip.
> >>
> >> In principle I agree. I am getting the sense this function can't be used
> >> correctly in general, and yet is the function that's meant to be used.
> >> If my interpretation of prior LKML discussion holds, the problems are
> >> far deeper than my code or indeed page size problems...
> >
> > Right, which makes reasoning about this function and its behavior if the
> > IOMMU pages size is unexpected very hard for me. I'm not opposed to just
> > keeping this function as-is when there's a mismatch between PAGE_SIZE and
> > the IOMMU page size (and it will probably work that way) but I'd like to
> > be sure that won't introduce unexpected behavior.
> >
> >>
> >> If the right way to handle this is with the IOMMU and IOVA APIs, I really wish
> >> that dance were wrapped up in a safe helper function instead of open
> >> coding it in every driver that does cross device sharing.
> >>
> >> We might even call that helper... hmm... dma_map_sg.... *ducks*
> >>
> >
> > There might be another way to do this correctly. I'm likely just a little
> > bit biased because I've spent the past weeks wrapping my head around the
> > IOMMU and DMA APIs and when all you have is a hammer everything looks like
> > a nail.
> >
> > But dma_map_sg operates at the DMA API level and at that point the dma-ops
> > for two different devices could be vastly different.
> > In the worst case one of them could be behind an IOMMU that can easily map
> > non-contiguous pages while the other one is directly connected to the bus and
> > can't even access >4G pages without swiotlb support.
> > It's really only possible to guarantee that it will map N buffers to <= N
> > DMA-addressable buffers (possibly by using an IOMMU or swiotlb internally) at
> > that point.
> >
> > On the IOMMU API level you have much more information available about the actual
> > hardware and can prepare the buffers in a way that makes both devices happy.
> > That's why iommu_map_sgtable combined with iovad->granule aligned sgt entries
> > can actually guarantee to map the entire list to a single contiguous IOVA block.
>
> Essentially there are two reasonable options, and doing pretend dma-buf
> export/import between two devices effectively owned by the same driver
> is neither of them. Handily, DRM happens to be exactly where all the
> precedent is, too; unsurprisingly this is not a new concern.
>
> One is to go full IOMMU API, like rockchip or tegra, attaching the
> relevant devices to your own unmanaged domain(s) and mapping pages
> exactly where you choose. You still make dma_map/dma_unmap calls for the
> sake of cache maintenance and other housekeeping on the underlying
> memory, but you ignore the provided DMA addresses in favour of your own
> IOVAs when it comes to programming the devices.
>
> The lazier option if you can rely on all relevant devices having equal
> DMA and IOMMU capabilities is to follow exynos, and herd the devices
> into a common default domain. Instead of allocating you own domain, you
> grab the current domain for one device (which will be its default
> domain) and manually attach the other devices to that. Then you forget
> all about IOMMUs but make sure to do all your regular DMA API calls
> using that first device, and the DMA addresses which come back should be
> magically valid for the other devices too. It was a bit of a cheeky hack
> TBH, but I'd still much prefer more of that over any usage of
> get_sgtable outside of actual dma-buf...
>
> Note that where multiple IOMMU instances are involved, the latter
> approach does depend on the IOMMU driver being able to support sharing a
> single domain across them; I think that might sort-of-work for DART
> already, but may need a little more attention.

It'll work for two streams inside the same DART but needs some
attention for streams from two separate DARTs.

Then there's also this amazing "feature" that the display controller DART
pagetable pointer register is read-only so that we have to reuse the memory
Apple configured for first level table. That needs some changes anyway
but might make adding multiple devices from different groups more complex.



Sven

2021-09-03 17:41:40

by Alyssa Rosenzweig

[permalink] [raw]
Subject: Re: [PATCH v2 3/8] iommu/dma: Disable get_sgtable for granule > PAGE_SIZE

> > On the IOMMU API level you have much more information available about the actual
> > hardware and can prepare the buffers in a way that makes both devices happy.
> > That's why iommu_map_sgtable combined with iovad->granule aligned sgt entries
> > can actually guarantee to map the entire list to a single contiguous IOVA block.
>
> Essentially there are two reasonable options, and doing pretend dma-buf
> export/import between two devices effectively owned by the same driver is
> neither of them. Handily, DRM happens to be exactly where all the precedent
> is, too; unsurprisingly this is not a new concern.
>
> One is to go full IOMMU API, like rockchip or tegra, attaching the relevant
> devices to your own unmanaged domain(s) and mapping pages exactly where you
> choose. You still make dma_map/dma_unmap calls for the sake of cache
> maintenance and other housekeeping on the underlying memory, but you ignore
> the provided DMA addresses in favour of your own IOVAs when it comes to
> programming the devices.

I guess this is the way to go for DCP.

> The lazier option if you can rely on all relevant devices having equal DMA
> and IOMMU capabilities is to follow exynos, and herd the devices into a
> common default domain. Instead of allocating you own domain, you grab the
> current domain for one device (which will be its default domain) and
> manually attach the other devices to that. Then you forget all about IOMMUs
> but make sure to do all your regular DMA API calls using that first device,
> and the DMA addresses which come back should be magically valid for the
> other devices too. It was a bit of a cheeky hack TBH, but I'd still much
> prefer more of that over any usage of get_sgtable outside of actual
> dma-buf...

It'd probably be *possible* to get away with this for DCP but it'd
probably involve more hacks, since the DARTs are not 100% symmetric and
there are some contraints on the different DARTs involved.

It'd also be less desirable -- there is no reason for the display
coprocessor to know the actual *contents* of the framebuffer, only the
IOVA valid only for the actual display hardware. These are two devices
in hardware with two independent DARTs, by modeling as such we reduce
the amount we need to trust the coprocessor firmware blob.

> Note that where multiple IOMMU instances are involved, the latter approach
> does depend on the IOMMU driver being able to support sharing a single
> domain across them; I think that might sort-of-work for DART already, but
> may need a little more attention.

I think this already works (for USB-C).

2021-09-03 19:16:39

by Sven Peter

[permalink] [raw]
Subject: Re: [PATCH v2 3/8] iommu/dma: Disable get_sgtable for granule > PAGE_SIZE



On Fri, Sep 3, 2021, at 17:45, Robin Murphy wrote:
> On 2021-09-03 16:16, Sven Peter wrote:
> >
> >
> > On Thu, Sep 2, 2021, at 21:42, Robin Murphy wrote:
> >> On 2021-09-02 19:19, Sven Peter wrote:
> >>>
> >>>
> >>> On Wed, Sep 1, 2021, at 23:10, Alyssa Rosenzweig wrote:
> >>>>> My biggest issue is that I do not understand how this function is supposed
> >>>>> to be used correctly. It would work fine as-is if it only ever gets passed buffers
> >>>>> allocated by the coherent API but there's not way to check or guarantee that.
> >>>>> There may also be callers making assumptions that no longer hold when
> >>>>> iovad->granule > PAGE_SIZE.
> >>>>>
> >>>>> Regarding your case: I'm not convinced the function is meant to be used there.
> >>>>> If I understand it correctly, your code first allocates memory with dma_alloc_coherent
> >>>>> (which possibly creates a sgt internally and then maps it with iommu_map_sg),
> >>>>> then coerces that back into a sgt with dma_get_sgtable, and then maps that sgt to
> >>>>> another iommu domain with dma_map_sg while assuming that the result will be contiguous
> >>>>> in IOVA space. It'll work out because dma_alloc_coherent is the very thing
> >>>>> meant to allocate pages that can be mapped into kernel and device VA space
> >>>>> as a single contiguous block and because both of your IOMMUs are different
> >>>>> instances of the same HW block. Anything allocated by dma_alloc_coherent for the
> >>>>> first IOMMU will have the right shape that will allow it to be mapped as
> >>>>> a single contiguous block for the second IOMMU.
> >>>>>
> >>>>> What could be done in your case is to instead use the IOMMU API,
> >>>>> allocate the pages yourself (while ensuring the sgt your create is made up
> >>>>> of blocks with size and physaddr aligned to max(domain_a->granule, domain_b->granule))
> >>>>> and then just use iommu_map_sg for both domains which actually comes with the
> >>>>> guarantee that the result will be a single contiguous block in IOVA space and
> >>>>> doesn't required the sgt roundtrip.
> >>>>
> >>>> In principle I agree. I am getting the sense this function can't be used
> >>>> correctly in general, and yet is the function that's meant to be used.
> >>>> If my interpretation of prior LKML discussion holds, the problems are
> >>>> far deeper than my code or indeed page size problems...
> >>>
> >>> Right, which makes reasoning about this function and its behavior if the
> >>> IOMMU pages size is unexpected very hard for me. I'm not opposed to just
> >>> keeping this function as-is when there's a mismatch between PAGE_SIZE and
> >>> the IOMMU page size (and it will probably work that way) but I'd like to
> >>> be sure that won't introduce unexpected behavior.
> >>>
> >>>>
> >>>> If the right way to handle this is with the IOMMU and IOVA APIs, I really wish
> >>>> that dance were wrapped up in a safe helper function instead of open
> >>>> coding it in every driver that does cross device sharing.
> >>>>
> >>>> We might even call that helper... hmm... dma_map_sg.... *ducks*
> >>>>
> >>>
> >>> There might be another way to do this correctly. I'm likely just a little
> >>> bit biased because I've spent the past weeks wrapping my head around the
> >>> IOMMU and DMA APIs and when all you have is a hammer everything looks like
> >>> a nail.
> >>>
> >>> But dma_map_sg operates at the DMA API level and at that point the dma-ops
> >>> for two different devices could be vastly different.
> >>> In the worst case one of them could be behind an IOMMU that can easily map
> >>> non-contiguous pages while the other one is directly connected to the bus and
> >>> can't even access >4G pages without swiotlb support.
> >>> It's really only possible to guarantee that it will map N buffers to <= N
> >>> DMA-addressable buffers (possibly by using an IOMMU or swiotlb internally) at
> >>> that point.
> >>>
> >>> On the IOMMU API level you have much more information available about the actual
> >>> hardware and can prepare the buffers in a way that makes both devices happy.
> >>> That's why iommu_map_sgtable combined with iovad->granule aligned sgt entries
> >>> can actually guarantee to map the entire list to a single contiguous IOVA block.
> >>
> >> Essentially there are two reasonable options, and doing pretend dma-buf
> >> export/import between two devices effectively owned by the same driver
> >> is neither of them. Handily, DRM happens to be exactly where all the
> >> precedent is, too; unsurprisingly this is not a new concern.
> >>
> >> One is to go full IOMMU API, like rockchip or tegra, attaching the
> >> relevant devices to your own unmanaged domain(s) and mapping pages
> >> exactly where you choose. You still make dma_map/dma_unmap calls for the
> >> sake of cache maintenance and other housekeeping on the underlying
> >> memory, but you ignore the provided DMA addresses in favour of your own
> >> IOVAs when it comes to programming the devices.
> >>
> >> The lazier option if you can rely on all relevant devices having equal
> >> DMA and IOMMU capabilities is to follow exynos, and herd the devices
> >> into a common default domain. Instead of allocating you own domain, you
> >> grab the current domain for one device (which will be its default
> >> domain) and manually attach the other devices to that. Then you forget
> >> all about IOMMUs but make sure to do all your regular DMA API calls
> >> using that first device, and the DMA addresses which come back should be
> >> magically valid for the other devices too. It was a bit of a cheeky hack
> >> TBH, but I'd still much prefer more of that over any usage of
> >> get_sgtable outside of actual dma-buf...
> >>
> >> Note that where multiple IOMMU instances are involved, the latter
> >> approach does depend on the IOMMU driver being able to support sharing a
> >> single domain across them; I think that might sort-of-work for DART
> >> already, but may need a little more attention.
> >
> > It'll work for two streams inside the same DART but needs some
> > attention for streams from two separate DARTs.
> >
> > Then there's also this amazing "feature" that the display controller DART
> > pagetable pointer register is read-only so that we have to reuse the memory
> > Apple configured for first level table. That needs some changes anyway
> > but might make adding multiple devices from different groups more complex.
>
> OK, I was thinking the dual-DART accommodation is already at least some
> of the way there, but I guess it's still tied to a single device's cfg.

Pretty much. I think "needing a little more attention" describes it pretty
well :)


> One upside to generalising further might be that the dual-DART case
> stops being particularly special :)
>
> Not being able to physically share pagetables shouldn't be too big a
> deal, just a bit more work to sync iommu_map/iommu_unmap calls across
> all the relevant instances for the given domain.

True, it's just a bit more bookkeeping in the end.



Sven