Subject: Re: [PATCH v7 06/10] iommu/dma-reserved-iommu:
 iommu_get/put_reserved_iova
To: Robin Murphy <robin.murphy@arm.com>, eric.auger@st.com,
        alex.williamson@redhat.com, will.deacon@arm.com, joro@8bytes.org,
        tglx@linutronix.de, jason@lakedaemon.net, marc.zyngier@arm.com,
        christoffer.dall@linaro.org, linux-arm-kernel@lists.infradead.org
References: <1461084994-2355-1-git-send-email-eric.auger@linaro.org>
 <1461084994-2355-7-git-send-email-eric.auger@linaro.org>
 <5717B541.7090003@arm.com>
Cc: patches@linaro.org, linux-kernel@vger.kernel.org,
        Bharat.Bhushan@freescale.com, pranav.sawargaonkar@gmail.com,
        p.fedin@samsung.com, iommu@lists.linux-foundation.org,
        Jean-Philippe.Brucker@arm.com, julien.grall@arm.com
From: Eric Auger <eric.auger@linaro.org>
Message-ID: <5718929A.5050703@linaro.org>
Date: Thu, 21 Apr 2016 10:43:06 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101
 Thunderbird/38.6.0
MIME-Version: 1.0
In-Reply-To: <5717B541.7090003@arm.com>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 13487
Lines: 372

Hi Robin,
On 04/20/2016 06:58 PM, Robin Murphy wrote:
> On 19/04/16 17:56, Eric Auger wrote:
>> This patch introduces iommu_get/put_reserved_iova.
>>
>> iommu_get_reserved_iova allows to iommu map a contiguous physical region
>> onto a reserved contiguous IOVA region. The physical region base address
>> does not need to be iommu page size aligned. iova pages are allocated and
>> mapped so that they cover all the physical region. This mapping is
>> tracked as a whole (and cannot be split) in an RB tree indexed by PA.
>>
>> In case a mapping already exists for the physical pages, the IOVA mapped
>> to the PA base is directly returned.
>>
>> Each time the get succeeds a binding ref count is incremented.
>>
>> iommu_put_reserved_iova decrements the ref count and when this latter
>> is null, the mapping is destroyed and the iovas are released.
>>
>> Signed-off-by: Eric Auger <eric.auger@linaro.org>
>>
>> ---
>> v7:
>> - change title and rework commit message with new name of the functions
>>    and size parameter
>> - fix locking
>> - rework header doc comments
>> - put now takes a phys_addr_t
>> - check prot argument against reserved_iova_domain prot flags
>>
>> v5 -> v6:
>> - revisit locking with spin_lock instead of mutex
>> - do not kref_get on 1st get
>> - add size parameter to the get function following Marc's request
>> - use the iova domain shift instead of using the smallest supported
>> page size
>>
>> v3 -> v4:
>> - formerly in iommu: iommu_get/put_single_reserved &
>>    iommu/arm-smmu: implement iommu_get/put_single_reserved
>> - Attempted to address Marc's doubts about missing size/alignment
>>    at VFIO level (user-space knows the IOMMU page size and the number
>>    of IOVA pages to provision)
>>
>> v2 -> v3:
>> - remove static implementation of iommu_get_single_reserved &
>>    iommu_put_single_reserved when CONFIG_IOMMU_API is not set
>>
>> v1 -> v2:
>> - previously a VFIO API, named vfio_alloc_map/unmap_free_reserved_iova
>> ---
>>   drivers/iommu/dma-reserved-iommu.c | 150
>> +++++++++++++++++++++++++++++++++++++
>>   include/linux/dma-reserved-iommu.h |  38 ++++++++++
>>   2 files changed, 188 insertions(+)
>>
>> diff --git a/drivers/iommu/dma-reserved-iommu.c
>> b/drivers/iommu/dma-reserved-iommu.c
>> index f6fa18e..426d339 100644
>> --- a/drivers/iommu/dma-reserved-iommu.c
>> +++ b/drivers/iommu/dma-reserved-iommu.c
>> @@ -135,6 +135,22 @@ unlock:
>>   }
>>   EXPORT_SYMBOL_GPL(iommu_alloc_reserved_iova_domain);
>>
>> +/* called with domain's reserved_lock held */
>> +static void reserved_binding_release(struct kref *kref)
>> +{
>> +    struct iommu_reserved_binding *b =
>> +        container_of(kref, struct iommu_reserved_binding, kref);
>> +    struct iommu_domain *d = b->domain;
>> +    struct reserved_iova_domain *rid =
>> +        (struct reserved_iova_domain *)d->reserved_iova_cookie;
> 
> Either it's a void *, in which case you don't need to cast it, or it
> should be the appropriate type as I mentioned earlier, in which case you
> still wouldn't need to cast it.
ok
> 
>> +    unsigned long order;
>> +
>> +    order = iova_shift(rid->iovad);
>> +    free_iova(rid->iovad, b->iova >> order);
> 
> iova_pfn() ?
ok
> 
>> +    unlink_reserved_binding(d, b);
>> +    kfree(b);
>> +}
>> +
>>   void iommu_free_reserved_iova_domain(struct iommu_domain *domain)
>>   {
>>       struct reserved_iova_domain *rid;
>> @@ -160,3 +176,137 @@ unlock:
>>       }
>>   }
>>   EXPORT_SYMBOL_GPL(iommu_free_reserved_iova_domain);
>> +
>> +int iommu_get_reserved_iova(struct iommu_domain *domain,
>> +                  phys_addr_t addr, size_t size, int prot,
>> +                  dma_addr_t *iova)
>> +{
>> +    unsigned long base_pfn, end_pfn, nb_iommu_pages, order, flags;
>> +    struct iommu_reserved_binding *b, *newb;
>> +    size_t iommu_page_size, binding_size;
>> +    phys_addr_t aligned_base, offset;
>> +    struct reserved_iova_domain *rid;
>> +    struct iova_domain *iovad;
>> +    struct iova *p_iova;
>> +    int ret = -EINVAL;
>> +
>> +    newb = kzalloc(sizeof(*newb), GFP_KERNEL);
>> +    if (!newb)
>> +        return -ENOMEM;
>> +
>> +    spin_lock_irqsave(&domain->reserved_lock, flags);
>> +
>> +    rid = (struct reserved_iova_domain *)domain->reserved_iova_cookie;
>> +    if (!rid)
>> +        goto free_newb;
>> +
>> +    if ((prot & IOMMU_READ & !(rid->prot & IOMMU_READ)) ||
>> +        (prot & IOMMU_WRITE & !(rid->prot & IOMMU_WRITE)))
> 
> Are devices wanting to read from MSI doorbells really a thing?
the rationale is Alex asked for propagating the VFIO DMA MAP prot flag
downto this API (genericity context). This later is stored in rid->prot
and I was just checking the iova was mapped according to the direction
the userspace expected.
> 
>> +        goto free_newb;
>> +
>> +    iovad = rid->iovad;
>> +    order = iova_shift(iovad);
>> +    base_pfn = addr >> order;
>> +    end_pfn = (addr + size - 1) >> order;
>> +    aligned_base = base_pfn << order;
>> +    offset = addr - aligned_base;
>> +    nb_iommu_pages = end_pfn - base_pfn + 1;
>> +    iommu_page_size = 1 << order;
>> +    binding_size = nb_iommu_pages * iommu_page_size;
> 
> offset = iova_offset(iovad, addr);
> aligned_base = addr - offset;
> binding_size = iova_align(iovad, size + offset);
> 
> Am I right?
Looks so. Will further test it. Thanks
> 
>> +
>> +    b = find_reserved_binding(domain, aligned_base, binding_size);
>> +    if (b) {
>> +        *iova = b->iova + offset + aligned_base - b->addr;
>> +        kref_get(&b->kref);
>> +        ret = 0;
>> +        goto free_newb;
>> +    }
>> +
>> +    p_iova = alloc_iova(iovad, nb_iommu_pages,
>> +                iovad->dma_32bit_pfn, true);
>> +    if (!p_iova) {
>> +        ret = -ENOMEM;
>> +        goto free_newb;
>> +    }
>> +
>> +    *iova = iova_dma_addr(iovad, p_iova);
>> +
>> +    /* unlock to call iommu_map which is not guaranteed to be atomic */
> 
> Hmm, that's concerning, because the ARM DMA mapping code, and
> consequently the iommu-dma layer, has always relied on it being so. On
> brief inspection, it looks to be only the AMD IOMMU doing something
> obviously non-atomic (taking a mutex) in its map callback, but then that
> has a separate DMA ops implementation. It doesn't look like it would be
> too intrusive to change, either, but that's an idea for its own thread.
yes. Making no hypothesis on the atomicity of iommu_map/unmap ops
brought some extra complexity here. Also it obliged to separate the
alloc/map from the iommu "binding" lookup. But now it is done I think it
brings some added value. Typically the fact we introduced an irq-chip
ops to retrieve the doorbells characteristics is valuable to enumerate
their number.
> 
>> +    spin_unlock_irqrestore(&domain->reserved_lock, flags);
>> +
>> +    ret = iommu_map(domain, *iova, aligned_base, binding_size, prot);
>> +
>> +    spin_lock_irqsave(&domain->reserved_lock, flags);
>> +
>> +    rid = (struct reserved_iova_domain *) domain->reserved_iova_cookie;
>> +    if (!rid || (rid->iovad != iovad)) {
>> +        /* reserved iova domain was destroyed in our back */
> 
> That that could happen at all is terrifying! Surely the reserved domain
> should be set up immediately after iommu_domain_alloc() and torn down
> immediately before iommu_domain_free(). Things going missing while a
> domain is live smacks of horrible brokenness.
The VFIO user client creates the "reserved iova domain" using the vfio
VFIO_IOMMU_MAP_DMA ioctl. This can happen anytime after the iommu domain
creation (on VFIO_SET_IOMMU ioctl). The user-space is currently allowed
to unregister this iova domain at any time too. I think this is wrong: I
should have 2 reserved iova domain destroy functions, one used by
user-space and one used by kernel. In the user-space implementation I
should reject any attempt to destroy the reserved iova domain until
there are existing bindings.

> 
>> +        ret = -EBUSY;
>> +        goto free_newb; /* iova already released */
>> +    }
>> +
>> +    /* no change in iova reserved domain but iommu_map failed */
>> +    if (ret)
>> +        goto free_iova;
>> +
>> +    /* everything is fine, add in the new node in the rb tree */
>> +    kref_init(&newb->kref);
>> +    newb->domain = domain;
>> +    newb->addr = aligned_base;
>> +    newb->iova = *iova;
>> +    newb->size = binding_size;
>> +
>> +    link_reserved_binding(domain, newb);
>> +
>> +    *iova += offset;
>> +    goto unlock;
>> +
>> +free_iova:
>> +    free_iova(rid->iovad, p_iova->pfn_lo);
>> +free_newb:
>> +    kfree(newb);
>> +unlock:
>> +    spin_unlock_irqrestore(&domain->reserved_lock, flags);
>> +    return ret;
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_get_reserved_iova);
>> +
>> +void iommu_put_reserved_iova(struct iommu_domain *domain, phys_addr_t
>> addr)
>> +{
>> +    phys_addr_t aligned_addr, page_size, mask;
>> +    struct iommu_reserved_binding *b;
>> +    struct reserved_iova_domain *rid;
>> +    unsigned long order, flags;
>> +    struct iommu_domain *d;
>> +    dma_addr_t iova;
>> +    size_t size;
>> +    int ret = 0;
>> +
>> +    spin_lock_irqsave(&domain->reserved_lock, flags);
>> +
>> +    rid = (struct reserved_iova_domain *)domain->reserved_iova_cookie;
>> +    if (!rid)
>> +        goto unlock;
>> +
>> +    order = iova_shift(rid->iovad);
>> +    page_size = (uint64_t)1 << order;
>> +    mask = page_size - 1;
>> +    aligned_addr = addr & ~mask;
> 
> addr & ~iova_mask(rid->iovad)
OK
> 
>> +
>> +    b = find_reserved_binding(domain, aligned_addr, page_size);
>> +    if (!b)
>> +        goto unlock;
>> +
>> +    iova = b->iova;
>> +    size = b->size;
>> +    d = b->domain;
>> +
>> +    ret = kref_put(&b->kref, reserved_binding_release);
>> +
>> +unlock:
>> +    spin_unlock_irqrestore(&domain->reserved_lock, flags);
>> +    if (ret)
>> +        iommu_unmap(d, iova, size);
>> +}
>> +EXPORT_SYMBOL_GPL(iommu_put_reserved_iova);
>> +
>> diff --git a/include/linux/dma-reserved-iommu.h
>> b/include/linux/dma-reserved-iommu.h
>> index 01ec385..8722131 100644
>> --- a/include/linux/dma-reserved-iommu.h
>> +++ b/include/linux/dma-reserved-iommu.h
>> @@ -42,6 +42,34 @@ int iommu_alloc_reserved_iova_domain(struct
>> iommu_domain *domain,
>>    */
>>   void iommu_free_reserved_iova_domain(struct iommu_domain *domain);
>>
>> +/**
>> + * iommu_get_reserved_iova: allocate a contiguous set of iova pages and
>> + * map them to the physical range defined by @addr and @size.
>> + *
>> + * @domain: iommu domain handle
>> + * @addr: physical address to bind
>> + * @size: size of the binding
>> + * @prot: mapping protection attribute
>> + * @iova: returned iova
>> + *
>> + * Mapped physical pfns are within [@addr >> order, (@addr + size -1)
>> >> order]
>> + * where order corresponds to the reserved iova domain order.
>> + * This mapping is tracked and reference counted with the minimal
>> granularity
>> + * of @size.
>> + */
>> +int iommu_get_reserved_iova(struct iommu_domain *domain,
>> +                phys_addr_t addr, size_t size, int prot,
>> +                dma_addr_t *iova);
>> +
>> +/**
>> + * iommu_put_reserved_iova: decrement a ref count of the reserved
>> mapping
>> + *
>> + * @domain: iommu domain handle
>> + * @addr: physical address whose binding ref count is decremented
>> + *
>> + * if the binding ref count is null, destroy the reserved mapping
>> + */
>> +void iommu_put_reserved_iova(struct iommu_domain *domain, phys_addr_t
>> addr);
>>   #else
>>
>>   static inline int
>> @@ -55,5 +83,15 @@ iommu_alloc_reserved_iova_domain(struct
>> iommu_domain *domain,
>>   static inline void
>>   iommu_free_reserved_iova_domain(struct iommu_domain *domain) {}
>>
>> +static inline int iommu_get_reserved_iova(struct iommu_domain *domain,
>> +                      phys_addr_t addr, size_t size,
>> +                      int prot, dma_addr_t *iova)
>> +{
>> +    return -ENOENT;
>> +}
>> +
>> +static inline void iommu_put_reserved_iova(struct iommu_domain *domain,
>> +                       phys_addr_t addr) {}
>> +
>>   #endif    /* CONFIG_IOMMU_DMA_RESERVED */
>>   #endif    /* __DMA_RESERVED_IOMMU_H */
>>
> 
> I worry that this all falls into the trap of trying too hard to abstract
> something which doesn't need abstracting. AFAICS all we need is
> something for VFIO to keep track of its own IOVA usage vs. userspace's,
> plus a list of MSI descriptors (with IOVAs) wrapped in refcounts hanging
> off the iommu_domain, with a handful of functions to manage them. The
> former is as good as solved already - stick an iova_domain or even just
> a bitmap in the iova_cookie and use it directly - and the latter would
> actually be reusable elsewhere (e.g. for iommu-dma domains). What I'm
> seeing here is layers upon layers of complexity with no immediate
> justification, that's 'generic' enough to not directly solve the problem
> at hand, but in a way that still makes it more or less unusable for
> solving equivalent problems elsewhere.
> 
> Since I don't like that everything I have to say about this series so
> far seems negative, I'll plan to spend some time next week having a go
> at hardening my 50-line proof-of-concept for stage 1 MSIs, and see if I
> can offer code instead of criticism :)
No worries. I really appreciate the time you've already spent reading
this code ;-) I aknowledge it is a lot of trouble for mapping a single
page - in my case - ! Anyway I will take into account your comments and
simplify things accordingly. Let's see how we can converge...

Best Regards

Eric
> 
> Robin.