2010-07-03 05:38:35

by Zach Pfeffer

[permalink] [raw]
Subject: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

This patch contains the documentation for the API, termed the Virtual
Contiguous Memory Manager. Its use would allow all of the IOMMU to VM,
VM to device and device to IOMMU interoperation code to be refactored
into platform independent code.

Comments, suggestions and criticisms are welcome and wanted.

Signed-off-by: Zach Pfeffer <[email protected]>
---
Documentation/vcm.txt | 587 +++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 587 insertions(+), 0 deletions(-)
create mode 100644 Documentation/vcm.txt

diff --git a/Documentation/vcm.txt b/Documentation/vcm.txt
new file mode 100644
index 0000000..1c6a8be
--- /dev/null
+++ b/Documentation/vcm.txt
@@ -0,0 +1,587 @@
+What is this document about?
+============================
+
+This document covers how to use the Virtual Contiguous Memory Manager
+(VCMM), how the first implementation works with a specific low-level
+Input/Output Memory Management Unit (IOMMU) and the way the VCMM is used
+from user-space. It also contains a section that describes why something
+like the VCMM is needed in the kernel.
+
+If anything in this document is wrong, please send patches to the
+maintainer of this file, listed at the bottom of the document.
+
+
+The Virtual Contiguous Memory Manager
+=====================================
+
+The VCMM was built to solve the system-wide memory mapping issues that
+occur when many bus-masters have IOMMUs.
+
+An IOMMU maps device addresses to physical addresses. It also insulates
+the system from spurious or malicious device bus transactions and allows
+fine-grained mapping attribute control. The Linux kernel core does not
+contain a generic API to handle IOMMU mapped memory; device driver writers
+must implement device specific code to interoperate with the Linux kernel
+core. As the number of IOMMUs increases, coordinating the many address
+spaces mapped by all discrete IOMMUs becomes difficult without in-kernel
+support.
+
+The VCMM API enables device independent IOMMU control, virtual memory
+manager (VMM) interoperation and non-IOMMU enabled device interoperation
+by treating devices with or without IOMMUs and all CPUs with or without
+MMUs, their mapping contexts and their mappings using common
+abstractions. Physical hardware is given a generic device type and mapping
+contexts are abstracted into Virtual Contiguous Memory (VCM)
+regions. Users "reserve" memory from VCMs and "back" their reservations
+with physical memory.
+
+Why the VCMM is Needed
+----------------------
+
+Driver writers who control devices with IOMMUs must contend with device
+control and memory management. Driver writers have a large device driver
+API that they can leverage to control their devices, but they are lacking
+a unified API to help them program mappings into IOMMUs and share those
+mappings with other devices and CPUs in the system.
+
+Sharing is complicated by Linux's CPU-centric VMM. The CPU-centric model
+generally makes sense because average hardware only contains a MMU for the
+CPU and possibly a graphics MMU. If every device in the system has one or
+more MMUs the CPU-centric memory management (MM) programming model breaks
+down.
+
+Abstracting IOMMU device programming into a common API has already begun
+in the Linux kernel. It was built to abstract the difference between AMD
+and Intel IOMMUs to support x86 virtualization on both platforms. The
+interface is listed in include/linux/iommu.h. It contains
+interfaces for mapping and unmapping as well as domain management. This
+interface has not gained widespread use outside the x86; PA-RISC, Alpha
+and SPARC architectures and ARM and PowerPC platforms all use their own
+mapping modules to control their IOMMUs. The VCMM contains an IOMMU
+programming layer, but since its abstraction supports map management
+independent of device control, the layer is not used directly. This
+higher-level view enables a new kernel service, not just an IOMMU
+interoperation layer.
+
+The General Idea: Map Management using Graphs
+---------------------------------------------
+
+Looking at mapping from a system-wide perspective reveals a general graph
+problem. The VCMM's API is built to manage the general mapping graph. Each
+node that talks to memory, either through an MMU or directly (physically
+mapped) can be thought of as the device-end of a mapping edge. The other
+edge is the physical memory (or intermediate virtual space) that is
+mapped.
+
+In the direct-mapped case the device is assigned a one-to-one MMU. This
+scheme allows direct mapped devices to participate in general graph
+management.
+
+The CPU nodes can also be brought under the same mapping abstraction with
+the use of a light overlay on the existing VMM. This light overlay allows
+VMM-managed mappings to interoperate with the common API. The light
+overlay enables this without substantial modifications to the existing
+VMM.
+
+In addition to CPU nodes that are running Linux (and the VMM), remote CPU
+nodes that may be running other operating systems can be brought into the
+general abstraction. Routing all memory management requests from a remote
+node through the central memory management framework enables new features
+like system-wide memory migration. This feature may only be feasible for
+large buffers that are managed outside of the fast-path, but having remote
+allocation in a system enables features that are impossible to build
+without it.
+
+The fundamental objects that support graph-based map management are:
+
+1) Virtual Contiguous Memory Regions
+
+2) Reservations
+
+3) Associated Virtual Contiguous Memory Regions
+
+4) Memory Targets
+
+5) Physical Memory Allocations
+
+Usage Overview
+--------------
+
+In a nutshell, users allocate Virtual Contiguous Memory Regions and
+associate those regions with one or more devices by creating an Associated
+Virtual Contiguous Memory Region. Users then create Reservations from the
+Virtual Contiguous Memory Region. At this point no physical memory has
+been committed to the reservation. To associate physical memory with a
+reservation a Physical Memory Allocation is created and the Reservation is
+backed with this allocation.
+
+include/linux/vcm.h includes comments documenting each API.
+
+Virtual Contiguous Memory Regions
+---------------------------------
+
+A Virtual Contiguous Memory Region (VCM) abstracts the memory space a
+device sees. The addresses of the region are only used by the devices
+which are associated with the region. This address space would normally be
+implemented as a device page table.
+
+A VCM is created and destroyed with three functions:
+
+ struct vcm *vcm_create(unsigned long start_addr, unsigned long len);
+
+ struct vcm *vcm_create_from_prebuilt(size_t ext_vcm_id);
+
+ int vcm_free(struct vcm *vcm);
+
+start_addr is an offset into the address space where allocations will
+start from. len is the length from start_addr of the VCM. Both functions
+generate an instance of a VCM.
+
+ext_vcm_id is used to pass a request to the VMM to generate a VCM
+instance. In the current implementation the call simply makes a note that
+the VCM instance is a VMM VCM instance for other interfaces usage. This
+muxing is seen throughout the implementation.
+
+vcm_create() and vcm_create_from_prebuilt() produce VCM instances for
+virtually mapped devices (IOMMUs and CPUs). To create a one-to-one mapped
+VCM, users pass the start_addr and len of the physical region. The VCMM
+matches this and records that the VCM instance is a one-to-one VCM.
+
+The newly created VCM instance can be passed to any function that needs to
+operate on or with a virtual contiguous memory region. Its main attributes
+are a start_addr and a len as well as an internal setting that allows the
+implementation to mux between true virtual spaces, one-to-one mapped
+spaces and VMM managed spaces.
+
+The current implementation uses the genalloc library to manage the VCM for
+IOMMU devices. Return values and more in-depth per-function documentation
+for these and the ones listed below are in include/linux/vcm.h.
+
+Reservations
+------------
+
+A Reservation is a contiguous region allocated from a VCM. There is no
+physical memory associated with it.
+
+A Reservation is created and destroyed with:
+
+ struct res *vcm_reserve(struct vcm *vcm, size_t len, u32 attr);
+
+ int vcm_unreserve(struct res *res);
+
+A vcm is a VCM created above. len is the length of the request. It can be
+up to the length of the VCM region the reservation is being created
+from. attr are mapping attributes: read, write, execute, user, supervisor,
+secure, not-cached, write-back/write-allocate, write-back/no
+write-allocate, write-through. These attrs are appropriate for ARM but can
+be changed to match to any architecture.
+
+The implementation calls gen_pool_alloc() for IOMMU devices,
+alloc_vm_area() for VMM areas and is a pass-through for one-to-one mapped
+areas.
+
+Associated Virtual Contiguous Memory Regions and Activation
+-----------------------------------------------------------
+
+An Associated Virtual Contiguous Memory Region (AVCM) is a mapping of a
+VCM to a device. The mapping can be active or inactive.
+
+An AVCM is managed with:
+
+ struct avcm *vcm_assoc(struct vcm *vcm, struct device *dev, u32 attr);
+
+ int vcm_deassoc(struct avcm *avcm);
+
+ int vcm_activate(struct avcm *avcm);
+
+ int vcm_deactivate(struct avcm *avcm);
+
+A VCM instance is a VCM created above. A dev is an opaque device handle
+thats passed down to the device driver the VCMM muxes in to handle a
+request. attr are association attributes: split, use-high or
+use-low. split controls which transactions hit a high-address page-table
+and which transactions hit a low-address page-table. For instance, all
+transactions whose most significant address bit is one would use the
+high-address page-table, any other transaction would use the low address
+page-table. This scheme is ARM-specific and could be changed in other
+architectures. One VCM instance can be associated with many devices and
+many VCM instances can be associated with one device.
+
+An AVCM is only a link. To program and deprogram a device with a VCM the
+user calls vcm_activate() and vcm_deactivate(). For IOMMU devices,
+activating a mapping programs the base address of a page table into an
+IOMMU. For VMM and one-to-one based devices, mappings are active
+immediately but the API does require an activation call for them for
+internal reference counting.
+
+Memory Targets
+--------------
+
+A Memory Target is a platform independent way of specifying a physical
+pool; it abstracts a pool of physical memory. The physical memory pool may
+be physically discontiguous, need to be allocated from in a unique way or
+have other user-defined attributes.
+
+Physical Memory Allocation and Reservation Backing
+--------------------------------------------------
+
+Physical memory is allocated as a separate step from reserving
+memory. This allows multiple reservations to back the same physical
+memory.
+
+A Physical Memory Allocation is managed using the following functions:
+
+ struct physmem *vcm_phys_alloc(enum memtype_t memtype,
+ size_t len, u32 attr);
+
+ int vcm_phys_free(struct physmem *physmem);
+
+ int vcm_back(struct res *res, struct physmem *physmem);
+
+ int vcm_unback(struct res *res);
+
+attr can include an alignment request, a specification to map memory using
+various block sizes and/or to use physically contiguous memory. memtype is
+one of the memory types listed in Memory Targets.
+
+The current implementation manages two pools of memory. One pool is a
+contiguous block of memory and the other is a set of contiguous block
+pools. In the current implementation the block pools contain 4K, 64K and
+1M blocks. The physical allocator does not try to split blocks from the
+contiguous block pools to satisfy requests.
+
+The use of 4K, 64K and 1M blocks solves a problem with some IOMMU
+hardware. IOMMUs are placed in front of multimedia engines to provide a
+contiguous address space to the device. Multimedia devices need large
+buffers and large buffers may map to a large number of physical
+blocks. IOMMUs tend to have small translation lookaside buffers
+(TLBs). Since the TLB is small the number of physical blocks that map a
+given range needs to be small or else the IOMMU will continually fetch new
+translations during a typical streamed multimedia flow. By using a 1 MB
+mapping (or 64K mapping) instead of a 4K mapping the number of misses can
+be minimized, allowing the multimedia block to meet its performance goals.
+
+Low Level Control
+-----------------
+
+It is necessary in some instances to access attributes and provide
+higher-level control of the low-level hardware abstraction. The API
+contains many members and functions for this task but the two that are
+typically used are:
+
+ res->dev_addr;
+
+ int vcm_hook(struct device *dev, vcm_handler handler, void *data);
+
+res->dev_addr is the device address given a reservation. This device
+address is a virtual IOMMU address for reservations on IOMMU VCMs, a
+virtual VMM address for reservations on VMM VCMs and a virtual (really
+physical since its one-to-one mapped) address for one-to-one devices.
+
+The function, vcm_hook, allows a caller in the kernel to register a
+user_handler. The handler is passed the data member passed to vcm_hook
+during a fault. The user can return 1 to indicate that the underlying
+driver should handle the fault and retry the transaction or they can
+return 0 to halt the transaction. If the user doesn't register a
+handler the low-level driver will print a warning and terminate the
+transaction.
+
+A Detailed Walk Through
+-----------------------
+
+The following call sequence walks through a typical allocation
+sequence. In the first stage the memory for a device is reserved and
+backed. This occurs without mapping the memory into a VMM VCM region. The
+second stage maps the first VCM region into a VMM VCM region so the kernel
+can read or write it. The second stage is not necessary if the VMM does
+not need to read or modify the contents of the original mapping.
+
+ Stage 1: Map and Allocate Memory for a Device
+
+ The call sequence starts by creating a VCM region:
+
+ vcm = vcm_create(start_addr, len);
+
+ The next call associates a VCM region with a device:
+
+ avcm = vcm_assoc(vcm, dev, attr);
+
+ To activate the association, users call vcm_activate() on the avcm from
+ the associate call. This programs the underlining device with the
+ mappings.
+
+ ret = vcm_activate(avcm);
+
+ Once a VCM region is created and associated it can be reserved from
+ with:
+
+ res = vcm_reserve(vcm, res_len, res_attr);
+
+ A user then allocates physical memory with:
+
+ physmem = vcm_phys_alloc(memtype, len, phys_attr);
+
+ To back the reservation with the physical memory allocation the user
+ calls:
+
+ vcm_back(res, physmem);
+
+
+ Stage 2: Map the Device's Memory into the VMM's VCM region
+
+ If the VMM needs to read and/or write the region that was just created,
+ the following calls are made.
+
+ The first call creates a prebuilt VCM with:
+
+ vcm_vmm = vcm_from_prebuilt(ext_vcm_id);
+
+ The prebuilt VCM is associated with the CPU device and activated with:
+
+ avcm_vmm = vcm_assoc(vcm_vmm, dev_cpu, attr);
+ vcm_activate(avcm_vmm);
+
+ A reservation is made on the VMM VCM with:
+
+ res_vmm = vcm_reserve(vcm_vmm, res_len, attr);
+
+ Finally, once the topology has been set up a vcm_back() allows the VMM
+ to read the memory using the physmem generated in stage 1:
+
+ vcm_back(res_vmm, physmem);
+
+Mapping IOMMU, one-to-one and VMM Reservations
+----------------------------------------------
+
+The following example demonstrates mapping IOMMU, one-to-one and VMM
+reservations to the same physical memory. It shows the use of phys_addr
+and phys_size to create a contiguous VCM for one-to-one mapped devices.
+
+ The user allocates physical memory:
+
+ physmem = vcm_phys_alloc(memtype, SZ_2MB + SZ_4K, CONTIGUOUS);
+
+ Creates an IOMMU VCM:
+
+ vcm_iommu = vcm_create(SZ_1K, SZ_16M);
+
+ Creates a one-to-one VCM:
+
+ vcm_onetoone = vcm_create(phys_addr, phys_size);
+
+ Creates a Prebuit VCM:
+
+ vcm_vmm = vcm_from_prebuit(ext_vcm_id);
+
+ Associate and activate all three to their respective devices:
+
+ avcm_iommu = vcm_assoc(vcm_iommu, dev_iommu, attr0);
+ avcm_onetoone = vcm_assoc(vcm_onetoone, dev_onetoone, attr1);
+ avcm_vmm = vcm_assoc(vcm_vmm, dev_cpu, attr2);
+ vcm_activate(avcm_iommu);
+ vcm_activate(avcm_onetoone);
+ vcm_activate(avcm_vmm);
+
+ Associations that fail return 0.
+
+ And finally, creates and backs reservations on all 3 such that they
+ all point to the same memory:
+
+ res_iommu = vcm_reserve(vcm_iommu, SZ_2MB + SZ_4K, attr);
+ res_onetoone = vcm_reserve(vcm_onetoone, SZ_2MB + SZ_4K, attr);
+ res_vmm = vcm_reserve(vcm_vmm, SZ_2MB + SZ_4K, attr);
+ vcm_back(res_iommu, physmem);
+ vcm_back(res_onetoone, physmem);
+ vcm_back(res_vmm, physmem);
+
+ Like associations, reservations that fail return 0.
+
+VCM Summary
+-----------
+
+The VCMM is an attempt to abstract attributes of three distinct classes of
+mappings into one API. The VCMM allows users to reason about mappings as
+first class objects. It also allows memory mappings to flow from the
+traditional 4K mappings prevalent on systems today to more efficient block
+sizes. Finally, it allows users to manage mapping interoperation without
+becoming VMM experts. These features will allow future systems with many
+MMU mapped devices to interoperate simply and therefore correctly.
+
+
+IOMMU Hardware Control
+======================
+
+The VCM currently supports a single type of IOMMU, a Qualcomm System MMU
+(SMMU). The SMMU interface contains functions to map and unmap virtual
+addresses, perform address translations and initialize hardware. A
+Qualcomm SMMU can contain multiple MMU contexts. Each context can
+translate in parallel. All contexts in a SMMU share one global translation
+look-aside buffer (TLB).
+
+To support context muxing the SMMU module creates and manages device
+independent virtual contexts. These context abstractions are bound to
+actual contexts at run-time. Once bound, a context can be activated. This
+activation programs the underlying context with the virtual context
+affecting a context switch.
+
+The following functions are all documented in:
+
+ arch/arm/mach-msm/include/mach/smmu_driver.h.
+
+Mapping
+-------
+
+To map and unmap a virtual page into physical space the VCM calls:
+
+ int smmu_map(struct smmu_dev *dev, unsigned long pa,
+ unsigned long va, unsigned long len, unsigned int attr);
+
+ int smmu_unmap(struct smmu_dev *dev, unsigned long va,
+ unsigned long len);
+
+ int smmu_update_start(struct smmu_dev *dev);
+
+ int smmu_update_done(struct smmu_dev *dev);
+
+The size given to map must be 4K, 64K, 1M or 16M and the VA and PA must be
+aligned to the given size. smmu_update_start() and smmu_update_done()
+should be called before and after each map or unmap.
+
+Translation
+-----------
+
+To request a hardware VA to PA translation on a single address the VCM
+calls:
+
+ unsigned long smmu_translate(struct smmu_dev *dev,
+ unsigned long va);
+
+Fault Handling
+--------------
+
+To register an interrupt handler for a context the VCM calls:
+
+ int smmu_hook_interrupt(struct smmu_dev *dev, vcm_handler handler,
+ void *data);
+
+The registered interrupt handler should return 1 if it wants the SMMU
+driver to retry the transaction again and 0 if it wants the SMMU driver to
+terminate the transaction.
+
+Managing SMMU Initialization and Contexts
+-----------------------------------------
+
+SMMU hardware initialization and management happens in 2 steps. The first
+step initializes global SMMU devices and abstract device contexts. The
+second step binds contexts and devices.
+
+An SMMU hardware instance is built with:
+
+ int smmu_drvdata_init(struct smmu_driver *drv, unsigned long base,
+ int irq);
+
+An SMMU context is initialized and deinitialized with:
+
+ struct smmu_dev *smmu_ctx_init(int ctx);
+ int smmu_ctx_deinit(struct smmu_dev *dev);
+
+An abstract SMMU context is bound to a particular SMMU with:
+
+ int smmu_ctx_bind(struct smmu_dev *ctx, struct smmu_driver *drv);
+
+Activation
+----------
+
+Activation affects a context switch.
+
+Activation, deactivation and activation state testing are done with:
+
+ int smmu_activate(struct smmu_dev *dev);
+ int smmu_deactivate(struct smmu_dev *dev);
+ int smmu_is_active(struct smmu_dev *dev);
+
+
+Userspace Access to Devices with IOMMUs
+=======================================
+
+A device that issues transactions through an IOMMU must work with two
+APIs. The first API is the VCM. The VCM API is device independent. Users
+pass the VCM a dev_id and the VCM makes calls on the hardware device it
+has been configured with using this dev_id. The second API is whatever
+device topology has been created to organize the particular IOMMUs in a
+system. The only constraint on this second API is that it must give the
+user a single dev_id that it can pass through the VCM.
+
+For the Qualcomm SMMUs the second API consists of a tree of platform
+devices and two platform drivers as well as a context lookup function that
+traverses the device tree and returns a dev_id given a context name.
+
+Qualcomm SMMU Device Tree
+-------------------------
+
+The current tree organizes the devices into a tree that looks like the
+following:
+
+smmu/
+ smmu0/
+ ctx0
+ ctx1
+ ctx2
+ smmu1/
+ ctx3
+
+
+Each context, ctx[n] and each smmu, smmu[n] is given a name. Since users
+are interested in contexts not smmus, the context name is passed to a
+function to find the dev_id associated with that name. The functions to
+find, free and get the base address (since the device probe function calls
+ioremap to map the SMMUs configuration registers into the kernel) are
+listed here:
+
+ struct smmu_dev *smmu_get_ctx_instance(char *ctx_name);
+ int smmu_free_ctx_instance(struct smmu_dev *dev);
+ unsigned long smmu_get_base_addr(struct smmu_dev *dev);
+
+Documentation for these functions is in:
+
+ arch/arm/mach-msm/include/mach/smmu_device.h
+
+Each context is given a dev node named after the context. For example:
+
+ /dev/vcodec_a_mm1
+ /dev/vcodec_b_mm2
+ /dev/vcodec_stream
+ etc...
+
+Users open, close and mmap these nodes to access VCM buffers from
+userspace in the same way that they used to open, close and mmap /dev
+nodes that represented large physically contiguous buffers (called PMEM
+buffers on Android).
+
+Example
+-------
+
+An abbreviated example is shown here:
+
+Users get the dev_id associated with their target context, create a VCM
+topology appropriate for their device and finally associate the VCMs of
+the topology with the contexts that will take the VCMs:
+
+ dev_id = smmu_get_ctx_instance(vcodec_a_stream);
+
+create vcm and needed topology
+
+ avcm = vcm_assoc(vcm, dev_id, attr);
+
+Tying it all Together
+---------------------
+
+VCMs, IOMMUs and the device tree all work to support system-wide memory
+mappings. The use of each API in this system allows users to concentrate
+on the relevant details without needing to worry about low-level
+details. The API's clear separation of memory spaces and the devices that
+support those memory spaces continues the Linux tradition of abstracting the
+what from the how.
+
+
+Maintainer: Zach Pfeffer <[email protected]>
--
1.7.0.2

--
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.


2010-07-03 19:06:52

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

Zach Pfeffer <[email protected]> writes:

> This patch contains the documentation for the API, termed the Virtual
> Contiguous Memory Manager. Its use would allow all of the IOMMU to VM,
> VM to device and device to IOMMU interoperation code to be refactored
> into platform independent code.
>
> Comments, suggestions and criticisms are welcome and wanted.

How does this differ from the dma api?

You probably want to copy linux-arch on something that is aimed at
affecting multiple architectures like this proposal is.

Eric


>
> Signed-off-by: Zach Pfeffer <[email protected]>
> ---
> Documentation/vcm.txt | 587 +++++++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 587 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/vcm.txt
>
> diff --git a/Documentation/vcm.txt b/Documentation/vcm.txt
> new file mode 100644
> index 0000000..1c6a8be
> --- /dev/null
> +++ b/Documentation/vcm.txt
> @@ -0,0 +1,587 @@
> +What is this document about?
> +============================
> +
> +This document covers how to use the Virtual Contiguous Memory Manager
> +(VCMM), how the first implementation works with a specific low-level
> +Input/Output Memory Management Unit (IOMMU) and the way the VCMM is used
> +from user-space. It also contains a section that describes why something
> +like the VCMM is needed in the kernel.
> +
> +If anything in this document is wrong, please send patches to the
> +maintainer of this file, listed at the bottom of the document.
> +
> +
> +The Virtual Contiguous Memory Manager
> +=====================================
> +
> +The VCMM was built to solve the system-wide memory mapping issues that
> +occur when many bus-masters have IOMMUs.
> +
> +An IOMMU maps device addresses to physical addresses. It also insulates
> +the system from spurious or malicious device bus transactions and allows
> +fine-grained mapping attribute control. The Linux kernel core does not
> +contain a generic API to handle IOMMU mapped memory; device driver writers
> +must implement device specific code to interoperate with the Linux kernel
> +core. As the number of IOMMUs increases, coordinating the many address
> +spaces mapped by all discrete IOMMUs becomes difficult without in-kernel
> +support.
> +
> +The VCMM API enables device independent IOMMU control, virtual memory
> +manager (VMM) interoperation and non-IOMMU enabled device interoperation
> +by treating devices with or without IOMMUs and all CPUs with or without
> +MMUs, their mapping contexts and their mappings using common
> +abstractions. Physical hardware is given a generic device type and mapping
> +contexts are abstracted into Virtual Contiguous Memory (VCM)
> +regions. Users "reserve" memory from VCMs and "back" their reservations
> +with physical memory.
> +
> +Why the VCMM is Needed
> +----------------------
> +
> +Driver writers who control devices with IOMMUs must contend with device
> +control and memory management. Driver writers have a large device driver
> +API that they can leverage to control their devices, but they are lacking
> +a unified API to help them program mappings into IOMMUs and share those
> +mappings with other devices and CPUs in the system.
> +
> +Sharing is complicated by Linux's CPU-centric VMM. The CPU-centric model
> +generally makes sense because average hardware only contains a MMU for the
> +CPU and possibly a graphics MMU. If every device in the system has one or
> +more MMUs the CPU-centric memory management (MM) programming model breaks
> +down.
> +
> +Abstracting IOMMU device programming into a common API has already begun
> +in the Linux kernel. It was built to abstract the difference between AMD
> +and Intel IOMMUs to support x86 virtualization on both platforms. The
> +interface is listed in include/linux/iommu.h. It contains
> +interfaces for mapping and unmapping as well as domain management. This
> +interface has not gained widespread use outside the x86; PA-RISC, Alpha
> +and SPARC architectures and ARM and PowerPC platforms all use their own
> +mapping modules to control their IOMMUs. The VCMM contains an IOMMU
> +programming layer, but since its abstraction supports map management
> +independent of device control, the layer is not used directly. This
> +higher-level view enables a new kernel service, not just an IOMMU
> +interoperation layer.
> +
> +The General Idea: Map Management using Graphs
> +---------------------------------------------
> +
> +Looking at mapping from a system-wide perspective reveals a general graph
> +problem. The VCMM's API is built to manage the general mapping graph. Each
> +node that talks to memory, either through an MMU or directly (physically
> +mapped) can be thought of as the device-end of a mapping edge. The other
> +edge is the physical memory (or intermediate virtual space) that is
> +mapped.
> +
> +In the direct-mapped case the device is assigned a one-to-one MMU. This
> +scheme allows direct mapped devices to participate in general graph
> +management.
> +
> +The CPU nodes can also be brought under the same mapping abstraction with
> +the use of a light overlay on the existing VMM. This light overlay allows
> +VMM-managed mappings to interoperate with the common API. The light
> +overlay enables this without substantial modifications to the existing
> +VMM.
> +
> +In addition to CPU nodes that are running Linux (and the VMM), remote CPU
> +nodes that may be running other operating systems can be brought into the
> +general abstraction. Routing all memory management requests from a remote
> +node through the central memory management framework enables new features
> +like system-wide memory migration. This feature may only be feasible for
> +large buffers that are managed outside of the fast-path, but having remote
> +allocation in a system enables features that are impossible to build
> +without it.
> +
> +The fundamental objects that support graph-based map management are:
> +
> +1) Virtual Contiguous Memory Regions
> +
> +2) Reservations
> +
> +3) Associated Virtual Contiguous Memory Regions
> +
> +4) Memory Targets
> +
> +5) Physical Memory Allocations
> +
> +Usage Overview
> +--------------
> +
> +In a nutshell, users allocate Virtual Contiguous Memory Regions and
> +associate those regions with one or more devices by creating an Associated
> +Virtual Contiguous Memory Region. Users then create Reservations from the
> +Virtual Contiguous Memory Region. At this point no physical memory has
> +been committed to the reservation. To associate physical memory with a
> +reservation a Physical Memory Allocation is created and the Reservation is
> +backed with this allocation.
> +
> +include/linux/vcm.h includes comments documenting each API.
> +
> +Virtual Contiguous Memory Regions
> +---------------------------------
> +
> +A Virtual Contiguous Memory Region (VCM) abstracts the memory space a
> +device sees. The addresses of the region are only used by the devices
> +which are associated with the region. This address space would normally be
> +implemented as a device page table.
> +
> +A VCM is created and destroyed with three functions:
> +
> + struct vcm *vcm_create(unsigned long start_addr, unsigned long len);
> +
> + struct vcm *vcm_create_from_prebuilt(size_t ext_vcm_id);
> +
> + int vcm_free(struct vcm *vcm);
> +
> +start_addr is an offset into the address space where allocations will
> +start from. len is the length from start_addr of the VCM. Both functions
> +generate an instance of a VCM.
> +
> +ext_vcm_id is used to pass a request to the VMM to generate a VCM
> +instance. In the current implementation the call simply makes a note that
> +the VCM instance is a VMM VCM instance for other interfaces usage. This
> +muxing is seen throughout the implementation.
> +
> +vcm_create() and vcm_create_from_prebuilt() produce VCM instances for
> +virtually mapped devices (IOMMUs and CPUs). To create a one-to-one mapped
> +VCM, users pass the start_addr and len of the physical region. The VCMM
> +matches this and records that the VCM instance is a one-to-one VCM.
> +
> +The newly created VCM instance can be passed to any function that needs to
> +operate on or with a virtual contiguous memory region. Its main attributes
> +are a start_addr and a len as well as an internal setting that allows the
> +implementation to mux between true virtual spaces, one-to-one mapped
> +spaces and VMM managed spaces.
> +
> +The current implementation uses the genalloc library to manage the VCM for
> +IOMMU devices. Return values and more in-depth per-function documentation
> +for these and the ones listed below are in include/linux/vcm.h.
> +
> +Reservations
> +------------
> +
> +A Reservation is a contiguous region allocated from a VCM. There is no
> +physical memory associated with it.
> +
> +A Reservation is created and destroyed with:
> +
> + struct res *vcm_reserve(struct vcm *vcm, size_t len, u32 attr);
> +
> + int vcm_unreserve(struct res *res);
> +
> +A vcm is a VCM created above. len is the length of the request. It can be
> +up to the length of the VCM region the reservation is being created
> +from. attr are mapping attributes: read, write, execute, user, supervisor,
> +secure, not-cached, write-back/write-allocate, write-back/no
> +write-allocate, write-through. These attrs are appropriate for ARM but can
> +be changed to match to any architecture.
> +
> +The implementation calls gen_pool_alloc() for IOMMU devices,
> +alloc_vm_area() for VMM areas and is a pass-through for one-to-one mapped
> +areas.
> +
> +Associated Virtual Contiguous Memory Regions and Activation
> +-----------------------------------------------------------
> +
> +An Associated Virtual Contiguous Memory Region (AVCM) is a mapping of a
> +VCM to a device. The mapping can be active or inactive.
> +
> +An AVCM is managed with:
> +
> + struct avcm *vcm_assoc(struct vcm *vcm, struct device *dev, u32 attr);
> +
> + int vcm_deassoc(struct avcm *avcm);
> +
> + int vcm_activate(struct avcm *avcm);
> +
> + int vcm_deactivate(struct avcm *avcm);
> +
> +A VCM instance is a VCM created above. A dev is an opaque device handle
> +thats passed down to the device driver the VCMM muxes in to handle a
> +request. attr are association attributes: split, use-high or
> +use-low. split controls which transactions hit a high-address page-table
> +and which transactions hit a low-address page-table. For instance, all
> +transactions whose most significant address bit is one would use the
> +high-address page-table, any other transaction would use the low address
> +page-table. This scheme is ARM-specific and could be changed in other
> +architectures. One VCM instance can be associated with many devices and
> +many VCM instances can be associated with one device.
> +
> +An AVCM is only a link. To program and deprogram a device with a VCM the
> +user calls vcm_activate() and vcm_deactivate(). For IOMMU devices,
> +activating a mapping programs the base address of a page table into an
> +IOMMU. For VMM and one-to-one based devices, mappings are active
> +immediately but the API does require an activation call for them for
> +internal reference counting.
> +
> +Memory Targets
> +--------------
> +
> +A Memory Target is a platform independent way of specifying a physical
> +pool; it abstracts a pool of physical memory. The physical memory pool may
> +be physically discontiguous, need to be allocated from in a unique way or
> +have other user-defined attributes.
> +
> +Physical Memory Allocation and Reservation Backing
> +--------------------------------------------------
> +
> +Physical memory is allocated as a separate step from reserving
> +memory. This allows multiple reservations to back the same physical
> +memory.
> +
> +A Physical Memory Allocation is managed using the following functions:
> +
> + struct physmem *vcm_phys_alloc(enum memtype_t memtype,
> + size_t len, u32 attr);
> +
> + int vcm_phys_free(struct physmem *physmem);
> +
> + int vcm_back(struct res *res, struct physmem *physmem);
> +
> + int vcm_unback(struct res *res);
> +
> +attr can include an alignment request, a specification to map memory using
> +various block sizes and/or to use physically contiguous memory. memtype is
> +one of the memory types listed in Memory Targets.
> +
> +The current implementation manages two pools of memory. One pool is a
> +contiguous block of memory and the other is a set of contiguous block
> +pools. In the current implementation the block pools contain 4K, 64K and
> +1M blocks. The physical allocator does not try to split blocks from the
> +contiguous block pools to satisfy requests.
> +
> +The use of 4K, 64K and 1M blocks solves a problem with some IOMMU
> +hardware. IOMMUs are placed in front of multimedia engines to provide a
> +contiguous address space to the device. Multimedia devices need large
> +buffers and large buffers may map to a large number of physical
> +blocks. IOMMUs tend to have small translation lookaside buffers
> +(TLBs). Since the TLB is small the number of physical blocks that map a
> +given range needs to be small or else the IOMMU will continually fetch new
> +translations during a typical streamed multimedia flow. By using a 1 MB
> +mapping (or 64K mapping) instead of a 4K mapping the number of misses can
> +be minimized, allowing the multimedia block to meet its performance goals.
> +
> +Low Level Control
> +-----------------
> +
> +It is necessary in some instances to access attributes and provide
> +higher-level control of the low-level hardware abstraction. The API
> +contains many members and functions for this task but the two that are
> +typically used are:
> +
> + res->dev_addr;
> +
> + int vcm_hook(struct device *dev, vcm_handler handler, void *data);
> +
> +res->dev_addr is the device address given a reservation. This device
> +address is a virtual IOMMU address for reservations on IOMMU VCMs, a
> +virtual VMM address for reservations on VMM VCMs and a virtual (really
> +physical since its one-to-one mapped) address for one-to-one devices.
> +
> +The function, vcm_hook, allows a caller in the kernel to register a
> +user_handler. The handler is passed the data member passed to vcm_hook
> +during a fault. The user can return 1 to indicate that the underlying
> +driver should handle the fault and retry the transaction or they can
> +return 0 to halt the transaction. If the user doesn't register a
> +handler the low-level driver will print a warning and terminate the
> +transaction.
> +
> +A Detailed Walk Through
> +-----------------------
> +
> +The following call sequence walks through a typical allocation
> +sequence. In the first stage the memory for a device is reserved and
> +backed. This occurs without mapping the memory into a VMM VCM region. The
> +second stage maps the first VCM region into a VMM VCM region so the kernel
> +can read or write it. The second stage is not necessary if the VMM does
> +not need to read or modify the contents of the original mapping.
> +
> + Stage 1: Map and Allocate Memory for a Device
> +
> + The call sequence starts by creating a VCM region:
> +
> + vcm = vcm_create(start_addr, len);
> +
> + The next call associates a VCM region with a device:
> +
> + avcm = vcm_assoc(vcm, dev, attr);
> +
> + To activate the association, users call vcm_activate() on the avcm from
> + the associate call. This programs the underlining device with the
> + mappings.
> +
> + ret = vcm_activate(avcm);
> +
> + Once a VCM region is created and associated it can be reserved from
> + with:
> +
> + res = vcm_reserve(vcm, res_len, res_attr);
> +
> + A user then allocates physical memory with:
> +
> + physmem = vcm_phys_alloc(memtype, len, phys_attr);
> +
> + To back the reservation with the physical memory allocation the user
> + calls:
> +
> + vcm_back(res, physmem);
> +
> +
> + Stage 2: Map the Device's Memory into the VMM's VCM region
> +
> + If the VMM needs to read and/or write the region that was just created,
> + the following calls are made.
> +
> + The first call creates a prebuilt VCM with:
> +
> + vcm_vmm = vcm_from_prebuilt(ext_vcm_id);
> +
> + The prebuilt VCM is associated with the CPU device and activated with:
> +
> + avcm_vmm = vcm_assoc(vcm_vmm, dev_cpu, attr);
> + vcm_activate(avcm_vmm);
> +
> + A reservation is made on the VMM VCM with:
> +
> + res_vmm = vcm_reserve(vcm_vmm, res_len, attr);
> +
> + Finally, once the topology has been set up a vcm_back() allows the VMM
> + to read the memory using the physmem generated in stage 1:
> +
> + vcm_back(res_vmm, physmem);
> +
> +Mapping IOMMU, one-to-one and VMM Reservations
> +----------------------------------------------
> +
> +The following example demonstrates mapping IOMMU, one-to-one and VMM
> +reservations to the same physical memory. It shows the use of phys_addr
> +and phys_size to create a contiguous VCM for one-to-one mapped devices.
> +
> + The user allocates physical memory:
> +
> + physmem = vcm_phys_alloc(memtype, SZ_2MB + SZ_4K, CONTIGUOUS);
> +
> + Creates an IOMMU VCM:
> +
> + vcm_iommu = vcm_create(SZ_1K, SZ_16M);
> +
> + Creates a one-to-one VCM:
> +
> + vcm_onetoone = vcm_create(phys_addr, phys_size);
> +
> + Creates a Prebuit VCM:
> +
> + vcm_vmm = vcm_from_prebuit(ext_vcm_id);
> +
> + Associate and activate all three to their respective devices:
> +
> + avcm_iommu = vcm_assoc(vcm_iommu, dev_iommu, attr0);
> + avcm_onetoone = vcm_assoc(vcm_onetoone, dev_onetoone, attr1);
> + avcm_vmm = vcm_assoc(vcm_vmm, dev_cpu, attr2);
> + vcm_activate(avcm_iommu);
> + vcm_activate(avcm_onetoone);
> + vcm_activate(avcm_vmm);
> +
> + Associations that fail return 0.
> +
> + And finally, creates and backs reservations on all 3 such that they
> + all point to the same memory:
> +
> + res_iommu = vcm_reserve(vcm_iommu, SZ_2MB + SZ_4K, attr);
> + res_onetoone = vcm_reserve(vcm_onetoone, SZ_2MB + SZ_4K, attr);
> + res_vmm = vcm_reserve(vcm_vmm, SZ_2MB + SZ_4K, attr);
> + vcm_back(res_iommu, physmem);
> + vcm_back(res_onetoone, physmem);
> + vcm_back(res_vmm, physmem);
> +
> + Like associations, reservations that fail return 0.
> +
> +VCM Summary
> +-----------
> +
> +The VCMM is an attempt to abstract attributes of three distinct classes of
> +mappings into one API. The VCMM allows users to reason about mappings as
> +first class objects. It also allows memory mappings to flow from the
> +traditional 4K mappings prevalent on systems today to more efficient block
> +sizes. Finally, it allows users to manage mapping interoperation without
> +becoming VMM experts. These features will allow future systems with many
> +MMU mapped devices to interoperate simply and therefore correctly.
> +
> +
> +IOMMU Hardware Control
> +======================
> +
> +The VCM currently supports a single type of IOMMU, a Qualcomm System MMU
> +(SMMU). The SMMU interface contains functions to map and unmap virtual
> +addresses, perform address translations and initialize hardware. A
> +Qualcomm SMMU can contain multiple MMU contexts. Each context can
> +translate in parallel. All contexts in a SMMU share one global translation
> +look-aside buffer (TLB).
> +
> +To support context muxing the SMMU module creates and manages device
> +independent virtual contexts. These context abstractions are bound to
> +actual contexts at run-time. Once bound, a context can be activated. This
> +activation programs the underlying context with the virtual context
> +affecting a context switch.
> +
> +The following functions are all documented in:
> +
> + arch/arm/mach-msm/include/mach/smmu_driver.h.
> +
> +Mapping
> +-------
> +
> +To map and unmap a virtual page into physical space the VCM calls:
> +
> + int smmu_map(struct smmu_dev *dev, unsigned long pa,
> + unsigned long va, unsigned long len, unsigned int attr);
> +
> + int smmu_unmap(struct smmu_dev *dev, unsigned long va,
> + unsigned long len);
> +
> + int smmu_update_start(struct smmu_dev *dev);
> +
> + int smmu_update_done(struct smmu_dev *dev);
> +
> +The size given to map must be 4K, 64K, 1M or 16M and the VA and PA must be
> +aligned to the given size. smmu_update_start() and smmu_update_done()
> +should be called before and after each map or unmap.
> +
> +Translation
> +-----------
> +
> +To request a hardware VA to PA translation on a single address the VCM
> +calls:
> +
> + unsigned long smmu_translate(struct smmu_dev *dev,
> + unsigned long va);
> +
> +Fault Handling
> +--------------
> +
> +To register an interrupt handler for a context the VCM calls:
> +
> + int smmu_hook_interrupt(struct smmu_dev *dev, vcm_handler handler,
> + void *data);
> +
> +The registered interrupt handler should return 1 if it wants the SMMU
> +driver to retry the transaction again and 0 if it wants the SMMU driver to
> +terminate the transaction.
> +
> +Managing SMMU Initialization and Contexts
> +-----------------------------------------
> +
> +SMMU hardware initialization and management happens in 2 steps. The first
> +step initializes global SMMU devices and abstract device contexts. The
> +second step binds contexts and devices.
> +
> +An SMMU hardware instance is built with:
> +
> + int smmu_drvdata_init(struct smmu_driver *drv, unsigned long base,
> + int irq);
> +
> +An SMMU context is initialized and deinitialized with:
> +
> + struct smmu_dev *smmu_ctx_init(int ctx);
> + int smmu_ctx_deinit(struct smmu_dev *dev);
> +
> +An abstract SMMU context is bound to a particular SMMU with:
> +
> + int smmu_ctx_bind(struct smmu_dev *ctx, struct smmu_driver *drv);
> +
> +Activation
> +----------
> +
> +Activation affects a context switch.
> +
> +Activation, deactivation and activation state testing are done with:
> +
> + int smmu_activate(struct smmu_dev *dev);
> + int smmu_deactivate(struct smmu_dev *dev);
> + int smmu_is_active(struct smmu_dev *dev);
> +
> +
> +Userspace Access to Devices with IOMMUs
> +=======================================
> +
> +A device that issues transactions through an IOMMU must work with two
> +APIs. The first API is the VCM. The VCM API is device independent. Users
> +pass the VCM a dev_id and the VCM makes calls on the hardware device it
> +has been configured with using this dev_id. The second API is whatever
> +device topology has been created to organize the particular IOMMUs in a
> +system. The only constraint on this second API is that it must give the
> +user a single dev_id that it can pass through the VCM.
> +
> +For the Qualcomm SMMUs the second API consists of a tree of platform
> +devices and two platform drivers as well as a context lookup function that
> +traverses the device tree and returns a dev_id given a context name.
> +
> +Qualcomm SMMU Device Tree
> +-------------------------
> +
> +The current tree organizes the devices into a tree that looks like the
> +following:
> +
> +smmu/
> + smmu0/
> + ctx0
> + ctx1
> + ctx2
> + smmu1/
> + ctx3
> +
> +
> +Each context, ctx[n] and each smmu, smmu[n] is given a name. Since users
> +are interested in contexts not smmus, the context name is passed to a
> +function to find the dev_id associated with that name. The functions to
> +find, free and get the base address (since the device probe function calls
> +ioremap to map the SMMUs configuration registers into the kernel) are
> +listed here:
> +
> + struct smmu_dev *smmu_get_ctx_instance(char *ctx_name);
> + int smmu_free_ctx_instance(struct smmu_dev *dev);
> + unsigned long smmu_get_base_addr(struct smmu_dev *dev);
> +
> +Documentation for these functions is in:
> +
> + arch/arm/mach-msm/include/mach/smmu_device.h
> +
> +Each context is given a dev node named after the context. For example:
> +
> + /dev/vcodec_a_mm1
> + /dev/vcodec_b_mm2
> + /dev/vcodec_stream
> + etc...
> +
> +Users open, close and mmap these nodes to access VCM buffers from
> +userspace in the same way that they used to open, close and mmap /dev
> +nodes that represented large physically contiguous buffers (called PMEM
> +buffers on Android).
> +
> +Example
> +-------
> +
> +An abbreviated example is shown here:
> +
> +Users get the dev_id associated with their target context, create a VCM
> +topology appropriate for their device and finally associate the VCMs of
> +the topology with the contexts that will take the VCMs:
> +
> + dev_id = smmu_get_ctx_instance(vcodec_a_stream);
> +
> +create vcm and needed topology
> +
> + avcm = vcm_assoc(vcm, dev_id, attr);
> +
> +Tying it all Together
> +---------------------
> +
> +VCMs, IOMMUs and the device tree all work to support system-wide memory
> +mappings. The use of each API in this system allows users to concentrate
> +on the relevant details without needing to worry about low-level
> +details. The API's clear separation of memory spaces and the devices that
> +support those memory spaces continues the Linux tradition of abstracting the
> +what from the how.
> +
> +
> +Maintainer: Zach Pfeffer <[email protected]>
> --
> 1.7.0.2
>
> --
> Sent by an employee of the Qualcomm Innovation Center, Inc.
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2010-07-07 22:44:32

by Zach Pfeffer

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

Eric W. Biederman wrote:
> Zach Pfeffer <[email protected]> writes:
>
>> This patch contains the documentation for the API, termed the Virtual
>> Contiguous Memory Manager. Its use would allow all of the IOMMU to VM,
>> VM to device and device to IOMMU interoperation code to be refactored
>> into platform independent code.
>>
>> Comments, suggestions and criticisms are welcome and wanted.
>
> How does this differ from the dma api?

The DMA API handles the allocation and use of DMA channels. It can
configure physical transfer settings, manage scatter-gather lists,
etc.

The VCM is a different thing. The VCM allows a Virtual Contiguous
Memory region to be created and associated with a device that
addresses the bus virtually or physically. If the bus is addressed
physically the Virtual Contiguous Memory is one-to-one mapped. If the
bus is virtually mapped than a contiguous virtual reservation may be
backed by a discontiguous list of physical blocks. This discontiguous
list could be a SG list of just a list of physical blocks that would
back the entire virtual reservation.

The VCM allows all device buffers to be passed between all devices in
the system without passing those buffers through each domain's
API. This means that instead of writing code to interoperate between
DMA engines, IOMMU mapped spaces, CPUs and physically addressed
devices the user can simply target a device with a buffer using the
same API regardless of how that device maps or otherwise accesses the
buffer.

2010-07-07 23:08:20

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Wed, Jul 07, 2010 at 03:44:27PM -0700, Zach Pfeffer wrote:
> The DMA API handles the allocation and use of DMA channels. It can
> configure physical transfer settings, manage scatter-gather lists,
> etc.

You're confused about what the DMA API is. You're talking about
the DMA engine subsystem (drivers/dma) not the DMA API (see
Documentation/DMA-API.txt, include/linux/dma-mapping.h, and
arch/arm/include/asm/dma-mapping.h)

> The VCM allows all device buffers to be passed between all devices in
> the system without passing those buffers through each domain's
> API. This means that instead of writing code to interoperate between
> DMA engines, IOMMU mapped spaces, CPUs and physically addressed
> devices the user can simply target a device with a buffer using the
> same API regardless of how that device maps or otherwise accesses the
> buffer.

With the DMA API, if we have a SG list which refers to the physical
pages (as a struct page, offset, length tuple), the DMA API takes
care of dealing with CPU caches and IOMMUs to make the data in the
buffer visible to the target device. It provides you with a set of
cookies referring to the SG lists, which may be coalesced if the
IOMMU can do so.

If you have a kernel virtual address, the DMA API has single buffer
mapping/unmapping functions to do the same thing, and provide you
with a cookie to pass to the device to refer to that buffer.

These cookies are whatever the device needs to be able to access
the buffer - for instance, if system SDRAM is located at 0xc0000000
virtual, 0x80000000 physical and 0x40000000 as far as the DMA device
is concerned, then the cookie for a buffer at 0xc0000000 virtual will
be 0x40000000 and not 0x80000000.

2010-07-08 23:59:58

by Zach Pfeffer

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

Russell King - ARM Linux wrote:
> On Wed, Jul 07, 2010 at 03:44:27PM -0700, Zach Pfeffer wrote:
>> The DMA API handles the allocation and use of DMA channels. It can
>> configure physical transfer settings, manage scatter-gather lists,
>> etc.
>
> You're confused about what the DMA API is. You're talking about
> the DMA engine subsystem (drivers/dma) not the DMA API (see
> Documentation/DMA-API.txt, include/linux/dma-mapping.h, and
> arch/arm/include/asm/dma-mapping.h)

Thanks for the clarification.

>
>> The VCM allows all device buffers to be passed between all devices in
>> the system without passing those buffers through each domain's
>> API. This means that instead of writing code to interoperate between
>> DMA engines, IOMMU mapped spaces, CPUs and physically addressed
>> devices the user can simply target a device with a buffer using the
>> same API regardless of how that device maps or otherwise accesses the
>> buffer.
>
> With the DMA API, if we have a SG list which refers to the physical
> pages (as a struct page, offset, length tuple), the DMA API takes
> care of dealing with CPU caches and IOMMUs to make the data in the
> buffer visible to the target device. It provides you with a set of
> cookies referring to the SG lists, which may be coalesced if the
> IOMMU can do so.
>
> If you have a kernel virtual address, the DMA API has single buffer
> mapping/unmapping functions to do the same thing, and provide you
> with a cookie to pass to the device to refer to that buffer.
>
> These cookies are whatever the device needs to be able to access
> the buffer - for instance, if system SDRAM is located at 0xc0000000
> virtual, 0x80000000 physical and 0x40000000 as far as the DMA device
> is concerned, then the cookie for a buffer at 0xc0000000 virtual will
> be 0x40000000 and not 0x80000000.

It sounds like I've got some work to do. I appreciate the feedback.

The problem I'm trying to solve boils down to this: map a set of
contiguous physical buffers to an aligned IOMMU address. I need to
allocate the set of physical buffers in a particular way: use 1 MB
contiguous physical memory, then 64 KB, then 4 KB, etc. and I need to
align the IOMMU address in a particular way. I also need to swap out the
IOMMU address spaces and map the buffers into the kernel.

I have this all solved, but it sounds like I'll need to migrate to the DMA
API to upstream it.

--
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.

2010-07-12 01:27:01

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Thu, 08 Jul 2010 16:59:52 -0700
Zach Pfeffer <[email protected]> wrote:

> The problem I'm trying to solve boils down to this: map a set of
> contiguous physical buffers to an aligned IOMMU address. I need to
> allocate the set of physical buffers in a particular way: use 1 MB
> contiguous physical memory, then 64 KB, then 4 KB, etc. and I need to
> align the IOMMU address in a particular way.

Sounds like the DMA API already supports what you want.

You can set segment_boundary_mask in struct device_dma_parameters if
you want to align the IOMMU address. See IOMMU implementations that
support dma_get_seg_boundary() properly.

2010-07-13 05:57:11

by Zach Pfeffer

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

FUJITA Tomonori wrote:
> On Thu, 08 Jul 2010 16:59:52 -0700
> Zach Pfeffer <[email protected]> wrote:
>
>> The problem I'm trying to solve boils down to this: map a set of
>> contiguous physical buffers to an aligned IOMMU address. I need to
>> allocate the set of physical buffers in a particular way: use 1 MB
>> contiguous physical memory, then 64 KB, then 4 KB, etc. and I need to
>> align the IOMMU address in a particular way.
>
> Sounds like the DMA API already supports what you want.
>
> You can set segment_boundary_mask in struct device_dma_parameters if
> you want to align the IOMMU address. See IOMMU implementations that
> support dma_get_seg_boundary() properly.

That function takes the wrong argument in a VCM world:

unsigned long dma_get_seg_boundary(struct device *dev);

The boundary should be an attribute of the device side mapping,
independent of the device. This would allow better code reuse.

--
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.

2010-07-13 06:04:23

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Mon, 12 Jul 2010 22:57:06 -0700
Zach Pfeffer <[email protected]> wrote:

> FUJITA Tomonori wrote:
> > On Thu, 08 Jul 2010 16:59:52 -0700
> > Zach Pfeffer <[email protected]> wrote:
> >
> >> The problem I'm trying to solve boils down to this: map a set of
> >> contiguous physical buffers to an aligned IOMMU address. I need to
> >> allocate the set of physical buffers in a particular way: use 1 MB
> >> contiguous physical memory, then 64 KB, then 4 KB, etc. and I need to
> >> align the IOMMU address in a particular way.
> >
> > Sounds like the DMA API already supports what you want.
> >
> > You can set segment_boundary_mask in struct device_dma_parameters if
> > you want to align the IOMMU address. See IOMMU implementations that
> > support dma_get_seg_boundary() properly.
>
> That function takes the wrong argument in a VCM world:
>
> unsigned long dma_get_seg_boundary(struct device *dev);
>
> The boundary should be an attribute of the device side mapping,
> independent of the device. This would allow better code reuse.

You mean that you want to specify this alignment attribute every time
you create an IOMMU mapping? Then you can set segment_boundary_mask
every time you create an IOMMU mapping. It's odd but it should work.

Another possible solution is extending struct dma_attrs. We could add
the alignment attribute to it.

2010-07-13 12:14:26

by Zach Pfeffer

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Tue, Jul 13, 2010 at 03:03:25PM +0900, FUJITA Tomonori wrote:
> On Mon, 12 Jul 2010 22:57:06 -0700
> Zach Pfeffer <[email protected]> wrote:
>
> > FUJITA Tomonori wrote:
> > > On Thu, 08 Jul 2010 16:59:52 -0700
> > > Zach Pfeffer <[email protected]> wrote:
> > >
> > >> The problem I'm trying to solve boils down to this: map a set of
> > >> contiguous physical buffers to an aligned IOMMU address. I need to
> > >> allocate the set of physical buffers in a particular way: use 1 MB
> > >> contiguous physical memory, then 64 KB, then 4 KB, etc. and I need to
> > >> align the IOMMU address in a particular way.
> > >
> > > Sounds like the DMA API already supports what you want.
> > >
> > > You can set segment_boundary_mask in struct device_dma_parameters if
> > > you want to align the IOMMU address. See IOMMU implementations that
> > > support dma_get_seg_boundary() properly.
> >
> > That function takes the wrong argument in a VCM world:
> >
> > unsigned long dma_get_seg_boundary(struct device *dev);
> >
> > The boundary should be an attribute of the device side mapping,
> > independent of the device. This would allow better code reuse.
>
> You mean that you want to specify this alignment attribute every time
> you create an IOMMU mapping? Then you can set segment_boundary_mask
> every time you create an IOMMU mapping. It's odd but it should work.

Kinda. I want to forget about IOMMUs, devices and CPUs. I just want to
create a mapping that has the alignment I specify, regardless of the
mapper. The mapping is created on a VCM and the VCM is associated with
a mapper: a CPU, an IOMMU'd device or a direct mapped device.

>
> Another possible solution is extending struct dma_attrs. We could add
> the alignment attribute to it.

That may be useful, but in the current DMA-API may be seen as
redundant info.

--
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.

2010-07-14 02:00:58

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Tue, 13 Jul 2010 05:14:21 -0700
Zach Pfeffer <[email protected]> wrote:

> > You mean that you want to specify this alignment attribute every time
> > you create an IOMMU mapping? Then you can set segment_boundary_mask
> > every time you create an IOMMU mapping. It's odd but it should work.
>
> Kinda. I want to forget about IOMMUs, devices and CPUs. I just want to
> create a mapping that has the alignment I specify, regardless of the
> mapper. The mapping is created on a VCM and the VCM is associated with
> a mapper: a CPU, an IOMMU'd device or a direct mapped device.

Sounds like you can do the above with the combination of the current
APIs, create a virtual address and then an I/O address.

The above can't be a reason to add a new infrastructure includes more
than 3,000 lines.


> > Another possible solution is extending struct dma_attrs. We could add
> > the alignment attribute to it.
>
> That may be useful, but in the current DMA-API may be seen as
> redundant info.

If there is real requirement, we can extend the DMA-API.

2010-07-14 20:12:03

by Zach Pfeffer

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Wed, Jul 14, 2010 at 10:59:38AM +0900, FUJITA Tomonori wrote:
> On Tue, 13 Jul 2010 05:14:21 -0700
> Zach Pfeffer <[email protected]> wrote:
>
> > > You mean that you want to specify this alignment attribute every time
> > > you create an IOMMU mapping? Then you can set segment_boundary_mask
> > > every time you create an IOMMU mapping. It's odd but it should work.
> >
> > Kinda. I want to forget about IOMMUs, devices and CPUs. I just want to
> > create a mapping that has the alignment I specify, regardless of the
> > mapper. The mapping is created on a VCM and the VCM is associated with
> > a mapper: a CPU, an IOMMU'd device or a direct mapped device.
>
> Sounds like you can do the above with the combination of the current
> APIs, create a virtual address and then an I/O address.
>

Yes, and that's what the implementation does - and all the other
implementations that need to do this same thing. Why not solve the
problem once?

> The above can't be a reason to add a new infrastructure includes more
> than 3,000 lines.

Right now its 3000 lines because I haven't converted to a function
pointer based implementation. Once I do that the size of the
implementation will shrink and the code will act as a lib. Users pass
buffer mappers and the lib will ease the management of of those
buffers.

>
>
> > > Another possible solution is extending struct dma_attrs. We could add
> > > the alignment attribute to it.
> >
> > That may be useful, but in the current DMA-API may be seen as
> > redundant info.
>
> If there is real requirement, we can extend the DMA-API.

If the DMA-API contained functions to allocate virtual space separate
from physical space and reworked how chained buffers functioned it
would probably work - but then things start to look like the VCM API
which does graph based map management.

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2010-07-14 22:07:06

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Wed, Jul 14, 2010 at 01:11:49PM -0700, Zach Pfeffer wrote:
> If the DMA-API contained functions to allocate virtual space separate
> from physical space and reworked how chained buffers functioned it
> would probably work - but then things start to look like the VCM API
> which does graph based map management.

Every additional virtual mapping of a physical buffer results in
additional cache aliases on aliasing caches, and more workload for
developers to sort out the cache aliasing issues.

What does VCM to do mitigate that?

2010-07-14 23:08:57

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Wed, 14 Jul 2010 13:11:49 -0700
Zach Pfeffer <[email protected]> wrote:

> On Wed, Jul 14, 2010 at 10:59:38AM +0900, FUJITA Tomonori wrote:
> > On Tue, 13 Jul 2010 05:14:21 -0700
> > Zach Pfeffer <[email protected]> wrote:
> >
> > > > You mean that you want to specify this alignment attribute every time
> > > > you create an IOMMU mapping? Then you can set segment_boundary_mask
> > > > every time you create an IOMMU mapping. It's odd but it should work.
> > >
> > > Kinda. I want to forget about IOMMUs, devices and CPUs. I just want to
> > > create a mapping that has the alignment I specify, regardless of the
> > > mapper. The mapping is created on a VCM and the VCM is associated with
> > > a mapper: a CPU, an IOMMU'd device or a direct mapped device.
> >
> > Sounds like you can do the above with the combination of the current
> > APIs, create a virtual address and then an I/O address.
> >
>
> Yes, and that's what the implementation does - and all the other
> implementations that need to do this same thing. Why not solve the
> problem once?

Why we we need a new abstraction layer to solve the problem that the
current API can handle?

The above two operations don't sound too complicated. The combination
of the current API sounds much simpler than your new abstraction.

Please show how the combination of the current APIs doesn't
work. Otherwise, we can't see what's the benefit of your new
abstraction.

2010-07-15 01:30:05

by Zach Pfeffer

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Wed, Jul 14, 2010 at 11:05:36PM +0100, Russell King - ARM Linux wrote:
> On Wed, Jul 14, 2010 at 01:11:49PM -0700, Zach Pfeffer wrote:
> > If the DMA-API contained functions to allocate virtual space separate
> > from physical space and reworked how chained buffers functioned it
> > would probably work - but then things start to look like the VCM API
> > which does graph based map management.
>
> Every additional virtual mapping of a physical buffer results in
> additional cache aliases on aliasing caches, and more workload for
> developers to sort out the cache aliasing issues.
>
> What does VCM to do mitigate that?

The VCM ensures that all mappings that map a given physical buffer:
IOMMU mappings, CPU mappings and one-to-one device mappings all map
that buffer using the same (or compatible) attributes. At this point
the only attribute that users can pass is CACHED. In the absence of
CACHED all accesses go straight through to the physical memory.

The architecture of the VCM allows these sorts of consistency checks
to be made since all mappers of a given physical resource are
tracked. This is feasible because the physical resources we're
tracking are typically large.

2010-07-15 01:41:57

by Zach Pfeffer

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Thu, Jul 15, 2010 at 08:07:28AM +0900, FUJITA Tomonori wrote:
> On Wed, 14 Jul 2010 13:11:49 -0700
> Zach Pfeffer <[email protected]> wrote:
>
> > On Wed, Jul 14, 2010 at 10:59:38AM +0900, FUJITA Tomonori wrote:
> > > On Tue, 13 Jul 2010 05:14:21 -0700
> > > Zach Pfeffer <[email protected]> wrote:
> > >
> > > > > You mean that you want to specify this alignment attribute every time
> > > > > you create an IOMMU mapping? Then you can set segment_boundary_mask
> > > > > every time you create an IOMMU mapping. It's odd but it should work.
> > > >
> > > > Kinda. I want to forget about IOMMUs, devices and CPUs. I just want to
> > > > create a mapping that has the alignment I specify, regardless of the
> > > > mapper. The mapping is created on a VCM and the VCM is associated with
> > > > a mapper: a CPU, an IOMMU'd device or a direct mapped device.
> > >
> > > Sounds like you can do the above with the combination of the current
> > > APIs, create a virtual address and then an I/O address.
> > >
> >
> > Yes, and that's what the implementation does - and all the other
> > implementations that need to do this same thing. Why not solve the
> > problem once?
>
> Why we we need a new abstraction layer to solve the problem that the
> current API can handle?

The current API can't really handle it because the DMA API doesn't
separate buffer allocation from buffer mapping. To use the DMA API a
scatterlist would need to be synthesized from the physical buffers
that we've allocated.

For instance: I need 10, 1 MB physical buffers and a 64 KB physical
buffer. With the DMA API I need to allocate 10*1MB/PAGE_SIZE + 64
KB/PAGE_SIZE scatterlist elements, fix them all up to follow the
chaining specification and then go through all of them again to fix up
their virtual mappings for the mapper that's mapping the physical
buffer. If I want to share the buffer with another device I have to
make a copy of the entire thing then fix up the virtual mappings for
the other device I'm sharing with. The VCM splits the two things up so
that I do a physical allocation, then 2 virtual allocations and then
map both.

>
> The above two operations don't sound too complicated. The combination
> of the current API sounds much simpler than your new abstraction.
>
> Please show how the combination of the current APIs doesn't
> work. Otherwise, we can't see what's the benefit of your new
> abstraction.

See above.

2010-07-15 01:48:09

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

Zach Pfeffer <[email protected]> writes:

> On Wed, Jul 14, 2010 at 11:05:36PM +0100, Russell King - ARM Linux wrote:
>> On Wed, Jul 14, 2010 at 01:11:49PM -0700, Zach Pfeffer wrote:
>> > If the DMA-API contained functions to allocate virtual space separate
>> > from physical space and reworked how chained buffers functioned it
>> > would probably work - but then things start to look like the VCM API
>> > which does graph based map management.
>>
>> Every additional virtual mapping of a physical buffer results in
>> additional cache aliases on aliasing caches, and more workload for
>> developers to sort out the cache aliasing issues.
>>
>> What does VCM to do mitigate that?
>
> The VCM ensures that all mappings that map a given physical buffer:
> IOMMU mappings, CPU mappings and one-to-one device mappings all map
> that buffer using the same (or compatible) attributes. At this point
> the only attribute that users can pass is CACHED. In the absence of
> CACHED all accesses go straight through to the physical memory.
>
> The architecture of the VCM allows these sorts of consistency checks
> to be made since all mappers of a given physical resource are
> tracked. This is feasible because the physical resources we're
> tracking are typically large.

On x86 this is implemented in the pat code, and could reasonably be
generalized to be cross platform.

This is controlled by HAVE_PFNMAP_TRACKING and with entry points
like track_pfn_vma_new.

Given that we already have an implementation that tracks the cached
vs non-cached attribute using the dma api. I don't see that the
API has to change. An implementation of the cached vs non-cached
status for arm and other architectures is probably appropriate.

It is definitely true that getting your mapping caching attributes
out of sync can be a problem.

Eric

2010-07-15 05:35:58

by Zach Pfeffer

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Wed, Jul 14, 2010 at 06:29:58PM -0700, Zach Pfeffer wrote:
> On Wed, Jul 14, 2010 at 11:05:36PM +0100, Russell King - ARM Linux wrote:
> > On Wed, Jul 14, 2010 at 01:11:49PM -0700, Zach Pfeffer wrote:
> > > If the DMA-API contained functions to allocate virtual space separate
> > > from physical space and reworked how chained buffers functioned it
> > > would probably work - but then things start to look like the VCM API
> > > which does graph based map management.
> >
> > Every additional virtual mapping of a physical buffer results in
> > additional cache aliases on aliasing caches, and more workload for
> > developers to sort out the cache aliasing issues.
> >
> > What does VCM to do mitigate that?
>
> The VCM ensures that all mappings that map a given physical buffer:
> IOMMU mappings, CPU mappings and one-to-one device mappings all map
> that buffer using the same (or compatible) attributes. At this point
> the only attribute that users can pass is CACHED. In the absence of
> CACHED all accesses go straight through to the physical memory.
>
> The architecture of the VCM allows these sorts of consistency checks
> to be made since all mappers of a given physical resource are
> tracked. This is feasible because the physical resources we're
> tracking are typically large.

A few more things...

In addition to CACHED, the VCMM can support different cache policies
as long as the architecture can support it - they get passed down
through the device map call.

In addition, handling physical mappings in the VCMM enables it to
perform refcounting on the physical chunks (ie, to see how many
virtual spaces it's been mapped to, including the kernel's). This
allows it to turn on any coherency protocols that are available in
hardware (ie, setting the shareable bit on something that is mapped to
more than one virtual space). That same attribute can be left off on a
buffer that has only one virtual mapping (ie, scratch buffers used by
one device only). It is then up to the underlying system to deal with
that shared attribute - to enable redirection if it's supported, or to
force something to be non-cacheable, etc. Doing it all through the
VCMM allows all these mechanisms be worked out once per architecture
and then reused.

2010-07-15 05:40:59

by Zach Pfeffer

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Wed, Jul 14, 2010 at 06:47:34PM -0700, Eric W. Biederman wrote:
> Zach Pfeffer <[email protected]> writes:
>
> > On Wed, Jul 14, 2010 at 11:05:36PM +0100, Russell King - ARM Linux wrote:
> >> On Wed, Jul 14, 2010 at 01:11:49PM -0700, Zach Pfeffer wrote:
> >> > If the DMA-API contained functions to allocate virtual space separate
> >> > from physical space and reworked how chained buffers functioned it
> >> > would probably work - but then things start to look like the VCM API
> >> > which does graph based map management.
> >>
> >> Every additional virtual mapping of a physical buffer results in
> >> additional cache aliases on aliasing caches, and more workload for
> >> developers to sort out the cache aliasing issues.
> >>
> >> What does VCM to do mitigate that?
> >
> > The VCM ensures that all mappings that map a given physical buffer:
> > IOMMU mappings, CPU mappings and one-to-one device mappings all map
> > that buffer using the same (or compatible) attributes. At this point
> > the only attribute that users can pass is CACHED. In the absence of
> > CACHED all accesses go straight through to the physical memory.
> >
> > The architecture of the VCM allows these sorts of consistency checks
> > to be made since all mappers of a given physical resource are
> > tracked. This is feasible because the physical resources we're
> > tracking are typically large.
>
> On x86 this is implemented in the pat code, and could reasonably be
> generalized to be cross platform.
>
> This is controlled by HAVE_PFNMAP_TRACKING and with entry points
> like track_pfn_vma_new.
>
> Given that we already have an implementation that tracks the cached
> vs non-cached attribute using the dma api. I don't see that the
> API has to change. An implementation of the cached vs non-cached
> status for arm and other architectures is probably appropriate.
>
> It is definitely true that getting your mapping caching attributes
> out of sync can be a problem.

Sure, but we're still stuck with needing lots of scatterlist list
elements and needing to copy them to share physical buffers.

2010-07-15 08:56:51

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Wed, Jul 14, 2010 at 06:29:58PM -0700, Zach Pfeffer wrote:
> The VCM ensures that all mappings that map a given physical buffer:
> IOMMU mappings, CPU mappings and one-to-one device mappings all map
> that buffer using the same (or compatible) attributes. At this point
> the only attribute that users can pass is CACHED. In the absence of
> CACHED all accesses go straight through to the physical memory.

So what you're saying is that if I have a buffer in kernel space
which I already have its virtual address, I can pass this to VCM and
tell it !CACHED, and it'll setup another mapping which is not cached
for me?

You are aware that multiple V:P mappings for the same physical page
with different attributes are being outlawed with ARMv6 and ARMv7
due to speculative prefetching. The cache can be searched even for
a mapping specified as 'normal, uncached' and you can get cache hits
because the data has been speculatively loaded through a separate
cached mapping of the same physical page.

FYI, during the next merge window, I will be pushing a patch which makes
ioremap() of system RAM fail, which should be the last core code creator
of mappings with different memory types. This behaviour has been outlawed
(as unpredictable) in the architecture specification and does cause
problems on some CPUs.

We've also the issue of multiple mappings with differing cache attributes
which needs addressing too...

2010-07-16 00:48:40

by Timothy Meade

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Thu, Jul 15, 2010 at 4:55 AM, Russell King - ARM Linux
<[email protected]> wrote:
> On Wed, Jul 14, 2010 at 06:29:58PM -0700, Zach Pfeffer wrote:
>> The VCM ensures that all mappings that map a given physical buffer:
>> IOMMU mappings, CPU mappings and one-to-one device mappings all map
>> that buffer using the same (or compatible) attributes. At this point
>> the only attribute that users can pass is CACHED. In the absence of
>> CACHED all accesses go straight through to the physical memory.
>
> So what you're saying is that if I have a buffer in kernel space
> which I already have its virtual address, I can pass this to VCM and
> tell it !CACHED, and it'll setup another mapping which is not cached
> for me?
>
> You are aware that multiple V:P mappings for the same physical page
> with different attributes are being outlawed with ARMv6 and ARMv7
> due to speculative prefetching. ?The cache can be searched even for
> a mapping specified as 'normal, uncached' and you can get cache hits
> because the data has been speculatively loaded through a separate
> cached mapping of the same physical page.
>
> FYI, during the next merge window, I will be pushing a patch which makes
> ioremap() of system RAM fail, which should be the last core code creator
> of mappings with different memory types. ?This behaviour has been outlawed
> (as unpredictable) in the architecture specification and does cause
> problems on some CPUs.
>
> We've also the issue of multiple mappings with differing cache attributes
> which needs addressing too...
> --
> To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>

Interesting, since I seem to remember the MSM devices mostly conduct
IO through regions of normal RAM, largely accomplished through
ioremap() calls.

Without more public domain documentation of the MSM chips and AMSS
interfaces I wouldn't know how to avoid this, but I can imagine it
creates a bit of urgency for Qualcomm developers as they attempt to
upstream support for this most interesting SoC.

--
Timothy Meade
tmzt #htc-linux

2010-07-16 08:00:23

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Thu, Jul 15, 2010 at 08:48:36PM -0400, Tim HRM wrote:
> Interesting, since I seem to remember the MSM devices mostly conduct
> IO through regions of normal RAM, largely accomplished through
> ioremap() calls.
>
> Without more public domain documentation of the MSM chips and AMSS
> interfaces I wouldn't know how to avoid this, but I can imagine it
> creates a bit of urgency for Qualcomm developers as they attempt to
> upstream support for this most interesting SoC.

As the patch has been out for RFC since early April on the linux-arm-kernel
mailing list (Subject: [RFC] Prohibit ioremap() on kernel managed RAM),
and no comments have come back from Qualcomm folk.

The restriction on creation of multiple V:P mappings with differing
attributes is also fairly hard to miss in the ARM architecture
specification when reading the sections about caches.

2010-07-17 00:01:23

by Larry Bassel

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On 16 Jul 10 08:58, Russell King - ARM Linux wrote:
> On Thu, Jul 15, 2010 at 08:48:36PM -0400, Tim HRM wrote:
> > Interesting, since I seem to remember the MSM devices mostly conduct
> > IO through regions of normal RAM, largely accomplished through
> > ioremap() calls.
> >
> > Without more public domain documentation of the MSM chips and AMSS
> > interfaces I wouldn't know how to avoid this, but I can imagine it
> > creates a bit of urgency for Qualcomm developers as they attempt to
> > upstream support for this most interesting SoC.
>
> As the patch has been out for RFC since early April on the linux-arm-kernel
> mailing list (Subject: [RFC] Prohibit ioremap() on kernel managed RAM),
> and no comments have come back from Qualcomm folk.

We are investigating the impact of this change on us, and I
will send out more detailed comments next week.

>
> The restriction on creation of multiple V:P mappings with differing
> attributes is also fairly hard to miss in the ARM architecture
> specification when reading the sections about caches.
>

Larry Bassel

--
Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.

2010-07-19 06:52:38

by Zach Pfeffer

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Thu, Jul 15, 2010 at 09:55:35AM +0100, Russell King - ARM Linux wrote:
> On Wed, Jul 14, 2010 at 06:29:58PM -0700, Zach Pfeffer wrote:
> > The VCM ensures that all mappings that map a given physical buffer:
> > IOMMU mappings, CPU mappings and one-to-one device mappings all map
> > that buffer using the same (or compatible) attributes. At this point
> > the only attribute that users can pass is CACHED. In the absence of
> > CACHED all accesses go straight through to the physical memory.
>
> So what you're saying is that if I have a buffer in kernel space
> which I already have its virtual address, I can pass this to VCM and
> tell it !CACHED, and it'll setup another mapping which is not cached
> for me?

Not quite. The existing mapping will be represented by a reservation
from the prebuilt VCM of the VM. This reservation has been marked
non-cached. Another reservation on a IOMMU VCM, also marked non-cached
will be backed with the same physical memory. This is legal in ARM,
allowing the vcm_back call to succeed. If you instead passed cached on
the second mapping, the first mapping would be non-cached and the
second would be cached. If the underlying architecture supported this
than the vcm_back would go through.

>
> You are aware that multiple V:P mappings for the same physical page
> with different attributes are being outlawed with ARMv6 and ARMv7
> due to speculative prefetching. The cache can be searched even for
> a mapping specified as 'normal, uncached' and you can get cache hits
> because the data has been speculatively loaded through a separate
> cached mapping of the same physical page.

I didn't know that. Thanks for the heads up.

> FYI, during the next merge window, I will be pushing a patch which makes
> ioremap() of system RAM fail, which should be the last core code creator
> of mappings with different memory types. This behaviour has been outlawed
> (as unpredictable) in the architecture specification and does cause
> problems on some CPUs.

That's fair enough, but it seems like it should only be outlawed for
those processors on which it breaks.

>
> We've also the issue of multiple mappings with differing cache attributes
> which needs addressing too...

The VCM has been architected to handle these things.

2010-07-19 07:45:27

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

Zach Pfeffer <[email protected]> writes:

> On Thu, Jul 15, 2010 at 09:55:35AM +0100, Russell King - ARM Linux wrote:
>> On Wed, Jul 14, 2010 at 06:29:58PM -0700, Zach Pfeffer wrote:
>> > The VCM ensures that all mappings that map a given physical buffer:
>> > IOMMU mappings, CPU mappings and one-to-one device mappings all map
>> > that buffer using the same (or compatible) attributes. At this point
>> > the only attribute that users can pass is CACHED. In the absence of
>> > CACHED all accesses go straight through to the physical memory.
>>
>> So what you're saying is that if I have a buffer in kernel space
>> which I already have its virtual address, I can pass this to VCM and
>> tell it !CACHED, and it'll setup another mapping which is not cached
>> for me?
>
> Not quite. The existing mapping will be represented by a reservation
> from the prebuilt VCM of the VM. This reservation has been marked
> non-cached. Another reservation on a IOMMU VCM, also marked non-cached
> will be backed with the same physical memory. This is legal in ARM,
> allowing the vcm_back call to succeed. If you instead passed cached on
> the second mapping, the first mapping would be non-cached and the
> second would be cached. If the underlying architecture supported this
> than the vcm_back would go through.

How does this compare with the x86 pat code?

>> You are aware that multiple V:P mappings for the same physical page
>> with different attributes are being outlawed with ARMv6 and ARMv7
>> due to speculative prefetching. The cache can be searched even for
>> a mapping specified as 'normal, uncached' and you can get cache hits
>> because the data has been speculatively loaded through a separate
>> cached mapping of the same physical page.
>
> I didn't know that. Thanks for the heads up.
>
>> FYI, during the next merge window, I will be pushing a patch which makes
>> ioremap() of system RAM fail, which should be the last core code creator
>> of mappings with different memory types. This behaviour has been outlawed
>> (as unpredictable) in the architecture specification and does cause
>> problems on some CPUs.
>
> That's fair enough, but it seems like it should only be outlawed for
> those processors on which it breaks.

To my knowledge mismatch of mapping attributes is a problem on most
cpus on every architecture. I don't see it making sense to encourage
coding constructs that will fail in the strangest most difficult to
debug ways.

Eric

2010-07-19 08:23:42

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Wed, Jul 14, 2010 at 06:41:48PM -0700, Zach Pfeffer wrote:
> On Thu, Jul 15, 2010 at 08:07:28AM +0900, FUJITA Tomonori wrote:
> > Why we we need a new abstraction layer to solve the problem that the
> > current API can handle?
>
> The current API can't really handle it because the DMA API doesn't
> separate buffer allocation from buffer mapping.

That's not entirely correct. The DMA API provides two things:

1. An API for allocating DMA coherent buffers
2. An API for mapping streaming buffers

Some implementations of (2) end up using (1) to work around broken
hardware - but that's a separate problem (and causes its own set of
problems.)

> For instance: I need 10, 1 MB physical buffers and a 64 KB physical
> buffer. With the DMA API I need to allocate 10*1MB/PAGE_SIZE + 64
> KB/PAGE_SIZE scatterlist elements, fix them all up to follow the
> chaining specification and then go through all of them again to fix up
> their virtual mappings for the mapper that's mapping the physical
> buffer.

You're making it sound like extremely hard work.

struct scatterlist *sg;
int i, nents = 11;

sg = kmalloc(sizeof(*sg) * nents, GFP_KERNEL);
if (!sg)
return -ENOMEM;

sg_init_table(sg, nents);
for (i = 0; i < nents; i++) {
if (i != nents - 1)
len = 1048576;
else
len = 64*1024;
buf = alloc_buffer(len);
sg_set_buf(&sg[i], buf, len);
}

There's no need to split the scatterlist elements up into individual
pages - the block layer doesn't do that when it passes scatterlists
down to block device drivers.

I'm not saying that it's reasonable to pass (or even allocate) a 1MB
buffer via the DMA API.

> If I want to share the buffer with another device I have to
> make a copy of the entire thing then fix up the virtual mappings for
> the other device I'm sharing with.

This is something the DMA API doesn't do - probably because there hasn't
been a requirement for it.

One of the issues for drivers is that by separating the mapped scatterlist
from the input buffer scatterlist, it creates something else for them to
allocate, which causes an additional failure point - and as all users sit
well with the current API, there's little reason to change especially
given the number of drivers which would need to be updated.

What you can do is:

struct map {
dma_addr_t addr;
size_t len;
};

int map_sg(struct device *dev, struct scatterlist *list,
unsigned int nents, struct map *map, enum dma_data_direction dir)
{
struct scatterlist *sg;
unsigned int i, j = 0;

for_each_sg(list, sg, nents, i) {
map[j]->addr = dma_map_page(dev, sg_page(sg), sg->offset,
sg->length, dir);
map[j]->len = length;
if (dma_mapping_error(map[j]->addr))
break;
j++;
}

return j;
}

void unmap(struct device *dev, struct map *map, unsigned int nents,
enum dma_data_direction dir)
{
while (nents) {
dma_unmap_page(dev, map->addr, map->len, dir);
map++;
nents--;
}
}

Note: this may not be portable to all architectures. It may also break
if there's something like the dmabounce or swiotlb code remapping buffers
which don't fit the DMA mask for the device - that's a different problem.

You can then map the same scatterlist into multiple different 'map'
arrays for several devices simultaneously. What you can't do is access
the buffers from the CPU while they're mapped to any device.

I'm not saying that you should do the above - I'm just proving that it's
not as hard as you seem to be making out.

2010-07-19 09:21:39

by Timothy Meade

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Fri, Jul 16, 2010 at 8:01 PM, Larry Bassel <[email protected]> wrote:
> On 16 Jul 10 08:58, Russell King - ARM Linux wrote:
>> On Thu, Jul 15, 2010 at 08:48:36PM -0400, Tim HRM wrote:
>> > Interesting, since I seem to remember the MSM devices mostly conduct
>> > IO through regions of normal RAM, largely accomplished through
>> > ioremap() calls.
>> >
>> > Without more public domain documentation of the MSM chips and AMSS
>> > interfaces I wouldn't know how to avoid this, but I can imagine it
>> > creates a bit of urgency for Qualcomm developers as they attempt to
>> > upstream support for this most interesting SoC.
>>
>> As the patch has been out for RFC since early April on the linux-arm-kernel
>> mailing list (Subject: [RFC] Prohibit ioremap() on kernel managed RAM),
>> and no comments have come back from Qualcomm folk.
>
> We are investigating the impact of this change on us, and I
> will send out more detailed comments next week.
>
>>
>> The restriction on creation of multiple V:P mappings with differing
>> attributes is also fairly hard to miss in the ARM architecture
>> specification when reading the sections about caches.
>>
>
> Larry Bassel
>
> --
> Sent by an employee of the Qualcomm Innovation Center, Inc.
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
>

Hi Larry and Qualcomm people.
I'm curious what your reason for introducing this new api (or adding
to dma) is. Specifically how this would be used to make the memory
mapping of the MSM chip dynamic in contrast to the fixed _PHYS defines
in the Android and Codeaurora trees.

I'm also interested in how this ability to map memory regions as files
for devices like KGSL/DRI or PMEM might work and why this is better
suited to that purpose than existing methods, where this fits into
camera preview and other issues that have been dealt with in these
trees in novel ways (from my perspective).

Thanks,
Timothy Meade
tmzt #htc-linux

2010-07-19 17:55:20

by Michael Bohan

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management


On 7/16/2010 12:58 AM, Russell King - ARM Linux wrote:

> As the patch has been out for RFC since early April on the linux-arm-kernel
> mailing list (Subject: [RFC] Prohibit ioremap() on kernel managed RAM),
> and no comments have come back from Qualcomm folk.

Would it be unreasonable to allow a map request to succeed if the
requested attributes matched that of the preexisting mapping?

Michael

--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum

2010-07-19 18:41:48

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Mon, Jul 19, 2010 at 10:55:15AM -0700, Michael Bohan wrote:
>
> On 7/16/2010 12:58 AM, Russell King - ARM Linux wrote:
>
>> As the patch has been out for RFC since early April on the linux-arm-kernel
>> mailing list (Subject: [RFC] Prohibit ioremap() on kernel managed RAM),
>> and no comments have come back from Qualcomm folk.
>
> Would it be unreasonable to allow a map request to succeed if the
> requested attributes matched that of the preexisting mapping?

What would be the point of creating such a mapping?

2010-07-20 10:11:07

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Mon, 19 Jul 2010 09:22:13 +0100
Russell King - ARM Linux <[email protected]> wrote:

> > If I want to share the buffer with another device I have to
> > make a copy of the entire thing then fix up the virtual mappings for
> > the other device I'm sharing with.
>
> This is something the DMA API doesn't do - probably because there hasn't
> been a requirement for it.
>
> One of the issues for drivers is that by separating the mapped scatterlist
> from the input buffer scatterlist, it creates something else for them to
> allocate, which causes an additional failure point - and as all users sit
> well with the current API, there's little reason to change especially
> given the number of drivers which would need to be updated.

Agreed. There was the discussion about separating 'dma_addr and dma_len' from
scatterlist struct but I don't think that it's worth doing so.


> I'm just proving that it's not as hard as you seem to be making out.

Agreed again.

2010-07-20 20:46:38

by Zach Pfeffer

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Fri, Jul 16, 2010 at 08:58:56AM +0100, Russell King - ARM Linux wrote:
> On Thu, Jul 15, 2010 at 08:48:36PM -0400, Tim HRM wrote:
> > Interesting, since I seem to remember the MSM devices mostly conduct
> > IO through regions of normal RAM, largely accomplished through
> > ioremap() calls.
> >
> > Without more public domain documentation of the MSM chips and AMSS
> > interfaces I wouldn't know how to avoid this, but I can imagine it
> > creates a bit of urgency for Qualcomm developers as they attempt to
> > upstream support for this most interesting SoC.
>
> As the patch has been out for RFC since early April on the linux-arm-kernel
> mailing list (Subject: [RFC] Prohibit ioremap() on kernel managed RAM),
> and no comments have come back from Qualcomm folk.
>
> The restriction on creation of multiple V:P mappings with differing
> attributes is also fairly hard to miss in the ARM architecture
> specification when reading the sections about caches.

As you mention in your patch the things that can't conflict are memory
type (strongly- ordered/device/normal), cache policy
(cacheable/non-cacheable, copy- back/write-through), and coherency
realm (non-shareable/inner- shareable/outer-shareable). You can
conflict in allocation preferences (write-allocate/write-no-allocate),
as those are just "hints".

You can also conflict in access permissions which can and do conflict
(which are what multiple mappings are all about...some buffer can get
some access, while others get different access).

The VCM API allows the same memory to be mapped as long as it makes
sense and allows those attributes that can change to be specified. It
could be the alternative, globally applicable approach, your looking
for and request in your patch.

Without the VCM API (or something like it) there will just be a bunch
of duplicated code that's basically doing ioremap. This code will
probably fail to configure its mappings correctly, in which case your
patch is a bad idea because it'll spawn bugs all over the place
instead of at a know location. We could instead change ioremap to
match the attributes of System RAM if that's what its mapping.


2010-07-20 20:55:53

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Tue, Jul 20, 2010 at 01:45:17PM -0700, Zach Pfeffer wrote:
> As you mention in your patch the things that can't conflict are memory
> type (strongly- ordered/device/normal), cache policy
> (cacheable/non-cacheable, copy- back/write-through), and coherency
> realm (non-shareable/inner- shareable/outer-shareable). You can
> conflict in allocation preferences (write-allocate/write-no-allocate),
> as those are just "hints".

What you refer to as "hints" I refer to as cache policy - practically
on ARM they're all tied up into the same set of bits.

> You can also conflict in access permissions which can and do conflict
> (which are what multiple mappings are all about...some buffer can get
> some access, while others get different access).

Access permissions don't conflict between mappings - each mapping has
unique access permissions.

> The VCM API allows the same memory to be mapped as long as it makes
> sense and allows those attributes that can change to be specified. It
> could be the alternative, globally applicable approach, your looking
> for and request in your patch.

I very much doubt it - there's virtually no call for creating an
additional mapping of existing kernel memory with different permissions.
The only time kernel memory gets remapped is with vmalloc(), where we
want to create a virtually contiguous mapping from a collection of
(possibly) non-contiguous pages. Such allocations are always created
with R/W permissions.

There are some cases where the vmalloc APIs are used to create mappings
with different memory properties, but as already covered, this has become
illegal with ARMv6 and v7 architectures.

So no, VCM doesn't help because there's nothing that could be solved here.
Creating read-only mappings is pointless, and creating mappings with
different memory type, sharability or cache attributes is illegal.

> Without the VCM API (or something like it) there will just be a bunch
> of duplicated code that's basically doing ioremap. This code will
> probably fail to configure its mappings correctly, in which case your
> patch is a bad idea because it'll spawn bugs all over the place
> instead of at a know location. We could instead change ioremap to
> match the attributes of System RAM if that's what its mapping.

And as I say, what is the point of creating another identical mapping to
the one we already have?

2010-07-20 21:56:14

by Zach Pfeffer

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Tue, Jul 20, 2010 at 09:54:33PM +0100, Russell King - ARM Linux wrote:
> On Tue, Jul 20, 2010 at 01:45:17PM -0700, Zach Pfeffer wrote:
> > You can also conflict in access permissions which can and do conflict
> > (which are what multiple mappings are all about...some buffer can get
> > some access, while others get different access).
>
> Access permissions don't conflict between mappings - each mapping has
> unique access permissions.

Yes. Bad choice of words.

> > The VCM API allows the same memory to be mapped as long as it makes
> > sense and allows those attributes that can change to be specified. It
> > could be the alternative, globally applicable approach, your looking
> > for and request in your patch.
>
> I very much doubt it - there's virtually no call for creating an
> additional mapping of existing kernel memory with different permissions.
> The only time kernel memory gets remapped is with vmalloc(), where we
> want to create a virtually contiguous mapping from a collection of
> (possibly) non-contiguous pages. Such allocations are always created
> with R/W permissions.
>
> There are some cases where the vmalloc APIs are used to create mappings
> with different memory properties, but as already covered, this has become
> illegal with ARMv6 and v7 architectures.
>
> So no, VCM doesn't help because there's nothing that could be solved here.
> Creating read-only mappings is pointless, and creating mappings with
> different memory type, sharability or cache attributes is illegal.

I don't think its pointless; it may have limited utility but things
like read-only mappings can be useful.

> > Without the VCM API (or something like it) there will just be a bunch
> > of duplicated code that's basically doing ioremap. This code will
> > probably fail to configure its mappings correctly, in which case your
> > patch is a bad idea because it'll spawn bugs all over the place
> > instead of at a know location. We could instead change ioremap to
> > match the attributes of System RAM if that's what its mapping.
>
> And as I say, what is the point of creating another identical mapping to
> the one we already have?

As you say probably not much. We do still have a problem (and other
people have it as well) we need to map in large contiguous buffers
with various attributes and point the kernel and various engines at
them. This seems like something that would be globally useful. The
feedback I've gotten is that we should just keep our usage private to
our mach-msm branch.

I've got a couple of questions:

Do you think a global solution to this problem is appropriate?

What would that solution need to look like, transparent huge pages?

How should people change various mapping attributes for these large
sections of memory?

2010-07-20 22:02:41

by Stepan Moskovchenko

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

Russell-

If a driver wants to allow a device to access memory (and cache coherency
is off/not present for device addesses), the driver needs to remap that
memory as non-cacheable. Suppose there exists a chunk of
physically-contiguous memory (say, memory reserved for device use) that
happened to be already mapped into the kernel as normal memory (cacheable,
etc). One way to remap this memory is to use ioremap (and then never touch
the original virtual mapping, which would now have conflicting
attributes). I feel as if there should be a better way to remap memory for
device access, either by altering the attributes on the original mapping,
or removing the original mapping and creating a new one with attributes
set to non-cacheable. Is there a better way to do this than calling
ioremap() on that memory? Please advise.

Thanks
Steve


Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.


> On Mon, Jul 19, 2010 at 10:55:15AM -0700, Michael Bohan wrote:
>>
>> On 7/16/2010 12:58 AM, Russell King - ARM Linux wrote:
>>
>>> As the patch has been out for RFC since early April on the
>>> linux-arm-kernel
>>> mailing list (Subject: [RFC] Prohibit ioremap() on kernel managed RAM),
>>> and no comments have come back from Qualcomm folk.
>>
>> Would it be unreasonable to allow a map request to succeed if the
>> requested attributes matched that of the preexisting mapping?
>
> What would be the point of creating such a mapping?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-arm-msm"
> in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>





2010-07-20 22:20:07

by Zach Pfeffer

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Mon, Jul 19, 2010 at 09:22:13AM +0100, Russell King - ARM Linux wrote:
> On Wed, Jul 14, 2010 at 06:41:48PM -0700, Zach Pfeffer wrote:
> > On Thu, Jul 15, 2010 at 08:07:28AM +0900, FUJITA Tomonori wrote:
> > > Why we we need a new abstraction layer to solve the problem that the
> > > current API can handle?
> >
> > The current API can't really handle it because the DMA API doesn't
> > separate buffer allocation from buffer mapping.
>
> That's not entirely correct. The DMA API provides two things:
>
> 1. An API for allocating DMA coherent buffers
> 2. An API for mapping streaming buffers
>
> Some implementations of (2) end up using (1) to work around broken
> hardware - but that's a separate problem (and causes its own set of
> problems.)
>
> > For instance: I need 10, 1 MB physical buffers and a 64 KB physical
> > buffer. With the DMA API I need to allocate 10*1MB/PAGE_SIZE + 64
> > KB/PAGE_SIZE scatterlist elements, fix them all up to follow the
> > chaining specification and then go through all of them again to fix up
> > their virtual mappings for the mapper that's mapping the physical
> > buffer.
>
> You're making it sound like extremely hard work.
>
> struct scatterlist *sg;
> int i, nents = 11;
>
> sg = kmalloc(sizeof(*sg) * nents, GFP_KERNEL);
> if (!sg)
> return -ENOMEM;
>
> sg_init_table(sg, nents);
> for (i = 0; i < nents; i++) {
> if (i != nents - 1)
> len = 1048576;
> else
> len = 64*1024;
> buf = alloc_buffer(len);
> sg_set_buf(&sg[i], buf, len);
> }
>
> There's no need to split the scatterlist elements up into individual
> pages - the block layer doesn't do that when it passes scatterlists
> down to block device drivers.

Okay. Thank you for the example.

>
> I'm not saying that it's reasonable to pass (or even allocate) a 1MB
> buffer via the DMA API.

But given a bunch of large chunks of memory, is there any API that can
manage them (asked this on the other thread as well)?

> > If I want to share the buffer with another device I have to
> > make a copy of the entire thing then fix up the virtual mappings for
> > the other device I'm sharing with.
>
> This is something the DMA API doesn't do - probably because there hasn't
> been a requirement for it.
>
> One of the issues for drivers is that by separating the mapped scatterlist
> from the input buffer scatterlist, it creates something else for them to
> allocate, which causes an additional failure point - and as all users sit
> well with the current API, there's little reason to change especially
> given the number of drivers which would need to be updated.
>
> What you can do is:
>
> struct map {
> dma_addr_t addr;
> size_t len;
> };
>
> int map_sg(struct device *dev, struct scatterlist *list,
> unsigned int nents, struct map *map, enum dma_data_direction dir)
> {
> struct scatterlist *sg;
> unsigned int i, j = 0;
>
> for_each_sg(list, sg, nents, i) {
> map[j]->addr = dma_map_page(dev, sg_page(sg), sg->offset,
> sg->length, dir);
> map[j]->len = length;
> if (dma_mapping_error(map[j]->addr))
> break;
> j++;
> }
>
> return j;
> }
>
> void unmap(struct device *dev, struct map *map, unsigned int nents,
> enum dma_data_direction dir)
> {
> while (nents) {
> dma_unmap_page(dev, map->addr, map->len, dir);
> map++;
> nents--;
> }
> }
>
> Note: this may not be portable to all architectures. It may also break
> if there's something like the dmabounce or swiotlb code remapping buffers
> which don't fit the DMA mask for the device - that's a different problem.

True but given a higher-level "map(virtual_range, physical_chunks)"
wouldn't break on all architectures.

> You can then map the same scatterlist into multiple different 'map'
> arrays for several devices simultaneously. What you can't do is access
> the buffers from the CPU while they're mapped to any device.

Which is considered a feature ;)

> I'm not saying that you should do the above - I'm just proving that it's
> not as hard as you seem to be making out.

That's fair. I didn't mean to say things were hard, just that using
the DMA API for big buffer management and mapping was not ideal since
our goals are to allocate big buffers using a device specific
algorithm, give them various attributes and share them. What we
created looked generally useful.

2010-07-20 22:31:22

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Tue, Jul 20, 2010 at 03:02:34PM -0700, [email protected] wrote:
> Russell-
>
> If a driver wants to allow a device to access memory (and cache coherency
> is off/not present for device addesses), the driver needs to remap that
> memory as non-cacheable.

If that memory is not part of the kernel's managed memory, then that's
fine. But if it _is_ part of the kernel's managed memory, then it is
not permitted by the ARM architecture specification to allow maps of
the memory with differing [memory type, sharability, cache] attributes.

Basically, if a driver wants to create these kinds of mappings, then
they should expect the system to become unreliable and unpredictable.
That's not something any sane person should be aiming to do.

> Suppose there exists a chunk of
> physically-contiguous memory (say, memory reserved for device use) that
> happened to be already mapped into the kernel as normal memory (cacheable,
> etc). One way to remap this memory is to use ioremap (and then never touch
> the original virtual mapping, which would now have conflicting
> attributes).

This doesn't work, and is unpredictable on ARMv6 and ARMv7. Not touching
the original mapping is _not_ _sufficient_ to guarantee that the mapping
is not used. (We've seen problems on OMAP as a result of this.)

Any mapping which exists can be speculatively prefetched by such CPUs
at any time, which can lead it to be read into the cache. Then, your
different attributes for your "other" mapping can cause problems if you
hit one of these cache lines - and then you can have (possibly silent)
data corruption.

> I feel as if there should be a better way to remap memory for
> device access, either by altering the attributes on the original mapping,
> or removing the original mapping and creating a new one with attributes
> set to non-cacheable.

This is difficult to achieve without remapping kernel memory using L2
page tables, so we can unmap pages on 4K page granularity. That's
going to increase TLB overhead and result in lower system performance
as there'll be a greater number of MMU misses.

However, one obvious case would be to use highmem-only pages for
remapping - but you then have to ensure that those pages are never
kmapped in any way, because those mappings will fall into the same
unpredictable category that we're already trying to avoid. This
may be possible, but you'll have to ensure that most of the system
RAM is in highmem - which poses other problems (eg, if lowmem gets
low.)

2010-07-21 00:44:18

by Zach Pfeffer

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Mon, Jul 19, 2010 at 05:21:35AM -0400, Tim HRM wrote:
> On Fri, Jul 16, 2010 at 8:01 PM, Larry Bassel <[email protected]> wrote:
> > On 16 Jul 10 08:58, Russell King - ARM Linux wrote:
> >> On Thu, Jul 15, 2010 at 08:48:36PM -0400, Tim HRM wrote:
> >> > Interesting, since I seem to remember the MSM devices mostly conduct
> >> > IO through regions of normal RAM, largely accomplished through
> >> > ioremap() calls.
> >> >
> >> > Without more public domain documentation of the MSM chips and AMSS
> >> > interfaces I wouldn't know how to avoid this, but I can imagine it
> >> > creates a bit of urgency for Qualcomm developers as they attempt to
> >> > upstream support for this most interesting SoC.
> >>
> >> As the patch has been out for RFC since early April on the linux-arm-kernel
> >> mailing list (Subject: [RFC] Prohibit ioremap() on kernel managed RAM),
> >> and no comments have come back from Qualcomm folk.
> >
> > We are investigating the impact of this change on us, and I
> > will send out more detailed comments next week.
> >
> >>
> >> The restriction on creation of multiple V:P mappings with differing
> >> attributes is also fairly hard to miss in the ARM architecture
> >> specification when reading the sections about caches.
> >>
> >
> > Larry Bassel
> >
> > --
> > Sent by an employee of the Qualcomm Innovation Center, Inc.
> > The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
> >
>
> Hi Larry and Qualcomm people.
> I'm curious what your reason for introducing this new api (or adding
> to dma) is. Specifically how this would be used to make the memory
> mapping of the MSM chip dynamic in contrast to the fixed _PHYS defines
> in the Android and Codeaurora trees.

The MSM has many integrated engines that allow offloading a variety of
workloads. These engines have always addressed memory using physical
addresses, because of this we had to reserve large (10's MB) buffers
at boot. These buffers are never freed regardless of whether an engine
is actually using them. As you can imagine, needing to reserve memory
for all time on a device that doesn't have a lot of memory in the
first place is not ideal because that memory could be used for other
things, running apps, etc.

To solve this problem we put IOMMUs in front of a lot of the
engines. IOMMUs allow us to map physically discontiguous memory into a
virtually contiguous address range. This means that we could ask the
OS for 10 MB of pages and map all of these into our IOMMU space and
the engine would still see a contiguous range.

In reality, limitations in the hardware meant that we needed to map
memory using larger mappings to minimize the number of TLB
misses. This, plus the number of IOMMUs and the extreme use cases we
needed to design for led us to a generic design.

This generic design solved our problem and the general mapping
problem. We thought other people, who had this same big-buffer
interoperation problem would also appreciate a common API that was
built with their needs in mind so we pushed our idea up.

>
> I'm also interested in how this ability to map memory regions as files
> for devices like KGSL/DRI or PMEM might work and why this is better
> suited to that purpose than existing methods, where this fits into
> camera preview and other issues that have been dealt with in these
> trees in novel ways (from my perspective).

The file based approach was driven by Android's buffer passing scheme
and the need to write userspace drivers for multimedia, etc...

2010-07-21 01:44:19

by Timothy Meade

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Tue, Jul 20, 2010 at 8:44 PM, Zach Pfeffer <[email protected]> wrote:
> On Mon, Jul 19, 2010 at 05:21:35AM -0400, Tim HRM wrote:
>> On Fri, Jul 16, 2010 at 8:01 PM, Larry Bassel <[email protected]> wrote:
>> > On 16 Jul 10 08:58, Russell King - ARM Linux wrote:
>> >> On Thu, Jul 15, 2010 at 08:48:36PM -0400, Tim HRM wrote:
>> >> > Interesting, since I seem to remember the MSM devices mostly conduct
>> >> > IO through regions of normal RAM, largely accomplished through
>> >> > ioremap() calls.
>> >> >
>> >> > Without more public domain documentation of the MSM chips and AMSS
>> >> > interfaces I wouldn't know how to avoid this, but I can imagine it
>> >> > creates a bit of urgency for Qualcomm developers as they attempt to
>> >> > upstream support for this most interesting SoC.
>> >>
>> >> As the patch has been out for RFC since early April on the linux-arm-kernel
>> >> mailing list (Subject: [RFC] Prohibit ioremap() on kernel managed RAM),
>> >> and no comments have come back from Qualcomm folk.
>> >
>> > We are investigating the impact of this change on us, and I
>> > will send out more detailed comments next week.
>> >
>> >>
>> >> The restriction on creation of multiple V:P mappings with differing
>> >> attributes is also fairly hard to miss in the ARM architecture
>> >> specification when reading the sections about caches.
>> >>
>> >
>> > Larry Bassel
>> >
>> > --
>> > Sent by an employee of the Qualcomm Innovation Center, Inc.
>> > The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
>> >
>>
>> Hi Larry and Qualcomm people.
>> I'm curious what your reason for introducing this new api (or adding
>> to dma) is. ?Specifically how this would be used to make the memory
>> mapping of the MSM chip dynamic in contrast to the fixed _PHYS defines
>> in the Android and Codeaurora trees.
>
> The MSM has many integrated engines that allow offloading a variety of
> workloads. These engines have always addressed memory using physical
> addresses, because of this we had to reserve large (10's MB) buffers
> at boot. These buffers are never freed regardless of whether an engine
> is actually using them. As you can imagine, needing to reserve memory
> for all time on a device that doesn't have a lot of memory in the
> first place is not ideal because that memory could be used for other
> things, running apps, etc.
>
> To solve this problem we put IOMMUs in front of a lot of the
> engines. IOMMUs allow us to map physically discontiguous memory into a
> virtually contiguous address range. This means that we could ask the
> OS for 10 MB of pages and map all of these into our IOMMU space and
> the engine would still see a contiguous range.
>


I see. Much like I suspected, this is used to replace the static
regime of the earliest Android kernel. You mention placing IOMMUs in
front of the A11 engines, you are involved in this architecture as an
engineer or similar? Is there a reason a cooperative approach using
RPC or another mechanism is not used for memory reservation, this is
something that can be accomplished fully on APPS side?

> In reality, limitations in the hardware meant that we needed to map
> memory using larger mappings to minimize the number of TLB
> misses. This, plus the number of IOMMUs and the extreme use cases we
> needed to design for led us to a generic design.
>
> This generic design solved our problem and the general mapping
> problem. We thought other people, who had this same big-buffer
> interoperation problem would also appreciate a common API that was
> built with their needs in mind so we pushed our idea up.
>
>>
>> I'm also interested in how this ability to map memory regions as files
>> for devices like KGSL/DRI or PMEM might work and why this is better
>> suited to that purpose than existing methods, where this fits into
>> camera preview and other issues that have been dealt with in these
>> trees in novel ways (from my perspective).
>
> The file based approach was driven by Android's buffer passing scheme
> and the need to write userspace drivers for multimedia, etc...
>
>
So the Android file backed approach is obiviated by GEM and other mechanisms?

Thanks you for you help,
Timothy Meade
-tmzt #htc-linux (facebook.com/HTCLinux)

2010-07-21 01:45:48

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Tue, 20 Jul 2010 15:20:01 -0700
Zach Pfeffer <[email protected]> wrote:

> > I'm not saying that it's reasonable to pass (or even allocate) a 1MB
> > buffer via the DMA API.
>
> But given a bunch of large chunks of memory, is there any API that can
> manage them (asked this on the other thread as well)?

What is the problem about mapping a 1MB buffer with the DMA API?

Possibly, an IOMMU can't find space for 1MB but it's not the problem
of the DMA API.

2010-07-21 05:51:10

by Santosh Shilimkar

[permalink] [raw]
Subject: RE: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

> -----Original Message-----
> From: [email protected] [mailto:linux-arm-
> [email protected]] On Behalf Of Russell King - ARM Linux
> Sent: Wednesday, July 21, 2010 4:00 AM
> To: [email protected]
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; FUJITA
> Tomonori; [email protected]; [email protected]; Zach Pfeffer; Michael
> Bohan; Tim HRM; [email protected]; linux-arm-
> [email protected]; [email protected]
> Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device
> memory management
>
> On Tue, Jul 20, 2010 at 03:02:34PM -0700, [email protected] wrote:
> > Russell-
> >
> > If a driver wants to allow a device to access memory (and cache
> coherency
> > is off/not present for device addesses), the driver needs to remap that
> > memory as non-cacheable.
>
> If that memory is not part of the kernel's managed memory, then that's
> fine. But if it _is_ part of the kernel's managed memory, then it is
> not permitted by the ARM architecture specification to allow maps of
> the memory with differing [memory type, sharability, cache] attributes.
>
> Basically, if a driver wants to create these kinds of mappings, then
> they should expect the system to become unreliable and unpredictable.
> That's not something any sane person should be aiming to do.
>
> > Suppose there exists a chunk of
> > physically-contiguous memory (say, memory reserved for device use) that
> > happened to be already mapped into the kernel as normal memory
> (cacheable,
> > etc). One way to remap this memory is to use ioremap (and then never
> touch
> > the original virtual mapping, which would now have conflicting
> > attributes).
>
> This doesn't work, and is unpredictable on ARMv6 and ARMv7. Not touching
> the original mapping is _not_ _sufficient_ to guarantee that the mapping
> is not used. (We've seen problems on OMAP as a result of this.)
>
> Any mapping which exists can be speculatively prefetched by such CPUs
> at any time, which can lead it to be read into the cache. Then, your
> different attributes for your "other" mapping can cause problems if you
> hit one of these cache lines - and then you can have (possibly silent)
> data corruption.
>
> > I feel as if there should be a better way to remap memory for
> > device access, either by altering the attributes on the original
> mapping,
> > or removing the original mapping and creating a new one with attributes
> > set to non-cacheable.
>
> This is difficult to achieve without remapping kernel memory using L2
> page tables, so we can unmap pages on 4K page granularity. That's
> going to increase TLB overhead and result in lower system performance
> as there'll be a greater number of MMU misses.
>
> However, one obvious case would be to use highmem-only pages for
> remapping - but you then have to ensure that those pages are never
> kmapped in any way, because those mappings will fall into the same
> unpredictable category that we're already trying to avoid. This
> may be possible, but you'll have to ensure that most of the system
> RAM is in highmem - which poses other problems (eg, if lowmem gets
> low.)
>
Why can't we consider an option of removing the old mappings when
we need to create new ones with different attributes as suggested
by Catalin on similar thread previously. This will avoid the duplicate
mapping with different attributes issue on newer ARMs.

Is this something can't be worked out?

Regards,
Santosh

2010-07-21 07:29:38

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Wed, Jul 21, 2010 at 11:19:58AM +0530, Shilimkar, Santosh wrote:
> > -----Original Message-----
> > From: [email protected] [mailto:linux-arm-
> > [email protected]] On Behalf Of Russell King - ARM Linux
> > Sent: Wednesday, July 21, 2010 4:00 AM
> > To: [email protected]
> > Cc: [email protected]; [email protected]; [email protected];
> > [email protected]; [email protected]; FUJITA
> > Tomonori; [email protected]; [email protected]; Zach Pfeffer; Michael
> > Bohan; Tim HRM; [email protected]; linux-arm-
> > [email protected]; [email protected]
> > Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device
> > memory management

*************************************************************************
> > This is difficult to achieve without remapping kernel memory using L2
> > page tables, so we can unmap pages on 4K page granularity. That's
> > going to increase TLB overhead and result in lower system performance
> > as there'll be a greater number of MMU misses.
*************************************************************************

> > However, one obvious case would be to use highmem-only pages for
> > remapping - but you then have to ensure that those pages are never
> > kmapped in any way, because those mappings will fall into the same
> > unpredictable category that we're already trying to avoid. This
> > may be possible, but you'll have to ensure that most of the system
> > RAM is in highmem - which poses other problems (eg, if lowmem gets
> > low.)
>
> Why can't we consider an option of removing the old mappings when
> we need to create new ones with different attributes as suggested
> by Catalin on similar thread previously. This will avoid the duplicate
> mapping with different attributes issue on newer ARMs.

See the first paragraph which I've highlighted above.

2010-07-21 07:46:10

by Santosh Shilimkar

[permalink] [raw]
Subject: RE: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

> -----Original Message-----
> From: Russell King - ARM Linux [mailto:[email protected]]
> Sent: Wednesday, July 21, 2010 12:59 PM
> To: Shilimkar, Santosh
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; FUJITA Tomonori; [email protected];
> [email protected]; Zach Pfeffer; Michael Bohan; Tim HRM; linux-
> [email protected]; [email protected];
> [email protected]
> Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device
> memory management
>
> On Wed, Jul 21, 2010 at 11:19:58AM +0530, Shilimkar, Santosh wrote:
> > > -----Original Message-----
> > > From: [email protected] [mailto:linux-arm-
> > > [email protected]] On Behalf Of Russell King - ARM
> Linux
> > > Sent: Wednesday, July 21, 2010 4:00 AM
> > > To: [email protected]
> > > Cc: [email protected]; [email protected]; [email protected];
> > > [email protected]; [email protected]; FUJITA
> > > Tomonori; [email protected]; [email protected]; Zach Pfeffer;
> Michael
> > > Bohan; Tim HRM; [email protected]; linux-arm-
> > > [email protected]; [email protected]
> > > Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and
> device
> > > memory management
>
> *************************************************************************
> > > This is difficult to achieve without remapping kernel memory using L2
> > > page tables, so we can unmap pages on 4K page granularity. That's
> > > going to increase TLB overhead and result in lower system performance
> > > as there'll be a greater number of MMU misses.
> *************************************************************************
>
> > > However, one obvious case would be to use highmem-only pages for
> > > remapping - but you then have to ensure that those pages are never
> > > kmapped in any way, because those mappings will fall into the same
> > > unpredictable category that we're already trying to avoid. This
> > > may be possible, but you'll have to ensure that most of the system
> > > RAM is in highmem - which poses other problems (eg, if lowmem gets
> > > low.)
> >
> > Why can't we consider an option of removing the old mappings when
> > we need to create new ones with different attributes as suggested
> > by Catalin on similar thread previously. This will avoid the duplicate
> > mapping with different attributes issue on newer ARMs.
>
> See the first paragraph which I've highlighted above.
>
Sorry about missing that para Russell.

2010-07-21 18:05:03

by Stepan Moskovchenko

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

> *************************************************************************
>> > This is difficult to achieve without remapping kernel memory using L2
>> > page tables, so we can unmap pages on 4K page granularity. That's
>> > going to increase TLB overhead and result in lower system performance
>> > as there'll be a greater number of MMU misses.
> *************************************************************************

Given how the buffers in question can be on the orders of tens of MB (and
I don't think they will ever be less than 1MB), would we be able to get
the desired effect by unmapping and then remapping on a 1MB granularity
(ie, L1 sections)? It seems to me like this should be sufficient, and
would not require using L2 mappings. Thoughts?

Thanks
Steve

Sent by an employee of the Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.





2010-07-22 04:06:58

by Zach Pfeffer

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Tue, Jul 20, 2010 at 09:44:12PM -0400, Timothy Meade wrote:
> On Tue, Jul 20, 2010 at 8:44 PM, Zach Pfeffer <[email protected]> wrote:
> > On Mon, Jul 19, 2010 at 05:21:35AM -0400, Tim HRM wrote:
> >> On Fri, Jul 16, 2010 at 8:01 PM, Larry Bassel <[email protected]> wrote:
> >> > On 16 Jul 10 08:58, Russell King - ARM Linux wrote:
> >> >> On Thu, Jul 15, 2010 at 08:48:36PM -0400, Tim HRM wrote:
> >> >> > Interesting, since I seem to remember the MSM devices mostly conduct
> >> >> > IO through regions of normal RAM, largely accomplished through
> >> >> > ioremap() calls.
> >> >> >
> >> >> > Without more public domain documentation of the MSM chips and AMSS
> >> >> > interfaces I wouldn't know how to avoid this, but I can imagine it
> >> >> > creates a bit of urgency for Qualcomm developers as they attempt to
> >> >> > upstream support for this most interesting SoC.
> >> >>
> >> >> As the patch has been out for RFC since early April on the linux-arm-kernel
> >> >> mailing list (Subject: [RFC] Prohibit ioremap() on kernel managed RAM),
> >> >> and no comments have come back from Qualcomm folk.
> >> >
> >> > We are investigating the impact of this change on us, and I
> >> > will send out more detailed comments next week.
> >> >
> >> >>
> >> >> The restriction on creation of multiple V:P mappings with differing
> >> >> attributes is also fairly hard to miss in the ARM architecture
> >> >> specification when reading the sections about caches.
> >> >>
> >> >
> >> > Larry Bassel
> >> >
> >> > --
> >> > Sent by an employee of the Qualcomm Innovation Center, Inc.
> >> > The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum.
> >> >
> >>
> >> Hi Larry and Qualcomm people.
> >> I'm curious what your reason for introducing this new api (or adding
> >> to dma) is. ?Specifically how this would be used to make the memory
> >> mapping of the MSM chip dynamic in contrast to the fixed _PHYS defines
> >> in the Android and Codeaurora trees.
> >
> > The MSM has many integrated engines that allow offloading a variety of
> > workloads. These engines have always addressed memory using physical
> > addresses, because of this we had to reserve large (10's MB) buffers
> > at boot. These buffers are never freed regardless of whether an engine
> > is actually using them. As you can imagine, needing to reserve memory
> > for all time on a device that doesn't have a lot of memory in the
> > first place is not ideal because that memory could be used for other
> > things, running apps, etc.
> >
> > To solve this problem we put IOMMUs in front of a lot of the
> > engines. IOMMUs allow us to map physically discontiguous memory into a
> > virtually contiguous address range. This means that we could ask the
> > OS for 10 MB of pages and map all of these into our IOMMU space and
> > the engine would still see a contiguous range.
> >
>
>
> I see. Much like I suspected, this is used to replace the static
> regime of the earliest Android kernel. You mention placing IOMMUs in
> front of the A11 engines, you are involved in this architecture as an
> engineer or similar?

I'm involved to the extent of designing and implementing VCM and,
finding it useful for this class of problems, trying push it upstream.

> Is there a reason a cooperative approach using
> RPC or another mechanism is not used for memory reservation, this is
> something that can be accomplished fully on APPS side?

It can be accomplished a few ways. At this point we let the
application processor manage the buffers. Other cooperative approaches
have been talked about. As you can see in the short, but voluminous
cannon of MSM Linux support there is a degree of RPC used to
communicate with other nodes in the system. As time progresses the
cannon of code shows this usage going down.

>
> > In reality, limitations in the hardware meant that we needed to map
> > memory using larger mappings to minimize the number of TLB
> > misses. This, plus the number of IOMMUs and the extreme use cases we
> > needed to design for led us to a generic design.
> >
> > This generic design solved our problem and the general mapping
> > problem. We thought other people, who had this same big-buffer
> > interoperation problem would also appreciate a common API that was
> > built with their needs in mind so we pushed our idea up.
> >
> >>
> >> I'm also interested in how this ability to map memory regions as files
> >> for devices like KGSL/DRI or PMEM might work and why this is better
> >> suited to that purpose than existing methods, where this fits into
> >> camera preview and other issues that have been dealt with in these
> >> trees in novel ways (from my perspective).
> >
> > The file based approach was driven by Android's buffer passing scheme
> > and the need to write userspace drivers for multimedia, etc...
> >
> >
> So the Android file backed approach is obiviated by GEM and other mechanisms?

Aye.

>
> Thanks you for you help,
> Timothy Meade
> -tmzt #htc-linux (facebook.com/HTCLinux)

2010-07-22 04:25:32

by Zach Pfeffer

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Mon, Jul 19, 2010 at 12:44:49AM -0700, Eric W. Biederman wrote:
> Zach Pfeffer <[email protected]> writes:
>
> > On Thu, Jul 15, 2010 at 09:55:35AM +0100, Russell King - ARM Linux wrote:
> >> On Wed, Jul 14, 2010 at 06:29:58PM -0700, Zach Pfeffer wrote:
> >> > The VCM ensures that all mappings that map a given physical buffer:
> >> > IOMMU mappings, CPU mappings and one-to-one device mappings all map
> >> > that buffer using the same (or compatible) attributes. At this point
> >> > the only attribute that users can pass is CACHED. In the absence of
> >> > CACHED all accesses go straight through to the physical memory.
> >>
> >> So what you're saying is that if I have a buffer in kernel space
> >> which I already have its virtual address, I can pass this to VCM and
> >> tell it !CACHED, and it'll setup another mapping which is not cached
> >> for me?
> >
> > Not quite. The existing mapping will be represented by a reservation
> > from the prebuilt VCM of the VM. This reservation has been marked
> > non-cached. Another reservation on a IOMMU VCM, also marked non-cached
> > will be backed with the same physical memory. This is legal in ARM,
> > allowing the vcm_back call to succeed. If you instead passed cached on
> > the second mapping, the first mapping would be non-cached and the
> > second would be cached. If the underlying architecture supported this
> > than the vcm_back would go through.
>
> How does this compare with the x86 pat code?

First, thanks for asking this question. I wasn't aware of the x86 pat
code and I got to read about it. From my initial read the VCM differs in 2 ways:

1. The attributes are explicitly set on virtual address ranges. These
reservations can then map physical memory with these attributes.

2. We explicitly allow multiple mappings (as long as the attributes are
compatible). One such mapping may come from a IOMMU's virtual address
space while another comes from the CPUs virtual address space. These
mappings may exist at the same time.

>
> >> You are aware that multiple V:P mappings for the same physical page
> >> with different attributes are being outlawed with ARMv6 and ARMv7
> >> due to speculative prefetching. The cache can be searched even for
> >> a mapping specified as 'normal, uncached' and you can get cache hits
> >> because the data has been speculatively loaded through a separate
> >> cached mapping of the same physical page.
> >
> > I didn't know that. Thanks for the heads up.
> >
> >> FYI, during the next merge window, I will be pushing a patch which makes
> >> ioremap() of system RAM fail, which should be the last core code creator
> >> of mappings with different memory types. This behaviour has been outlawed
> >> (as unpredictable) in the architecture specification and does cause
> >> problems on some CPUs.
> >
> > That's fair enough, but it seems like it should only be outlawed for
> > those processors on which it breaks.
>
> To my knowledge mismatch of mapping attributes is a problem on most
> cpus on every architecture. I don't see it making sense to encourage
> coding constructs that will fail in the strangest most difficult to
> debug ways.

Yes it is a problem, as Russell has brought up, but there's something
I probably haven't communicated well. I'll use the following example:

There are 3 devices: A CPU, a decoder and a video output device. All 3
devices need to map the same 12 MB buffer at the same time. Once this
buffer has served its purpose it gets freed and goes back into the
pool of big buffers. When the same usage case exists again the buffer
needs to get reallocated and the same devices need to map to it.

This usage case does exist, not only for Qualcomm but for all of these
SoC media engines that have started running Linux. The VCM API
attempts to cover this case for the Linux kernel.

2010-07-22 04:30:41

by Zach Pfeffer

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Wed, Jul 21, 2010 at 10:44:37AM +0900, FUJITA Tomonori wrote:
> On Tue, 20 Jul 2010 15:20:01 -0700
> Zach Pfeffer <[email protected]> wrote:
>
> > > I'm not saying that it's reasonable to pass (or even allocate) a 1MB
> > > buffer via the DMA API.
> >
> > But given a bunch of large chunks of memory, is there any API that can
> > manage them (asked this on the other thread as well)?
>
> What is the problem about mapping a 1MB buffer with the DMA API?
>
> Possibly, an IOMMU can't find space for 1MB but it's not the problem
> of the DMA API.

This goes to the nub of the issue. We need a lot of 1 MB physically
contiguous chunks. The system is going to fragment and we'll never get
our 12 1 MB chunks that we'll need, since the DMA API allocator uses
the system pool it will never succeed. For this reason we reserve a
pool of 1 MB chunks (and 16 MB, 64 KB etc...) to satisfy our
requests. This same use case is seen on most embedded "media" engines
that are getting built today.

2010-07-22 04:45:11

by FUJITA Tomonori

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Wed, 21 Jul 2010 21:30:34 -0700
Zach Pfeffer <[email protected]> wrote:

> On Wed, Jul 21, 2010 at 10:44:37AM +0900, FUJITA Tomonori wrote:
> > On Tue, 20 Jul 2010 15:20:01 -0700
> > Zach Pfeffer <[email protected]> wrote:
> >
> > > > I'm not saying that it's reasonable to pass (or even allocate) a 1MB
> > > > buffer via the DMA API.
> > >
> > > But given a bunch of large chunks of memory, is there any API that can
> > > manage them (asked this on the other thread as well)?
> >
> > What is the problem about mapping a 1MB buffer with the DMA API?
> >
> > Possibly, an IOMMU can't find space for 1MB but it's not the problem
> > of the DMA API.
>
> This goes to the nub of the issue. We need a lot of 1 MB physically
> contiguous chunks. The system is going to fragment and we'll never get
> our 12 1 MB chunks that we'll need, since the DMA API allocator uses
> the system pool it will never succeed. For this reason we reserve a
> pool of 1 MB chunks (and 16 MB, 64 KB etc...) to satisfy our
> requests. This same use case is seen on most embedded "media" engines
> that are getting built today.

We don't need a new abstraction to reserve some memory.

If you want pre-allocated memory pool per device (and share them with
some), the DMA API can for coherent memory (see
dma_alloc_from_coherent). You can extend the DMA API if necessary.

2010-07-22 07:36:14

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Wed, Jul 21, 2010 at 09:25:28PM -0700, Zach Pfeffer wrote:
> Yes it is a problem, as Russell has brought up, but there's something
> I probably haven't communicated well. I'll use the following example:
>
> There are 3 devices: A CPU, a decoder and a video output device. All 3
> devices need to map the same 12 MB buffer at the same time.

Why do you need the same buffer mapped by the CPU?

Let's take your example of a video decoder and video output device.
Surely the CPU doesn't want to be writing to the same memory region
used for the output picture as the decoder is writing to. So what's
the point of mapping that memory into the CPU's address space?

Surely the video output device doesn't need to see the input data to
the decoder either?

Surely, all you need is:

1. a mapping for the CPU for a chunk of memory to pass data to the
decoder.
2. a mapping for the decoder to see the chunk of memory to receive data
from the CPU.
3. a mapping for the decoder to see a chunk of memory used for the output
video buffer.
4. a mapping for the output device to see the video buffer.

So I don't see why everything needs to be mapped by everything else.

2010-07-22 07:40:22

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Wed, Jul 21, 2010 at 09:30:34PM -0700, Zach Pfeffer wrote:
> This goes to the nub of the issue. We need a lot of 1 MB physically
> contiguous chunks. The system is going to fragment and we'll never get
> our 12 1 MB chunks that we'll need, since the DMA API allocator uses
> the system pool it will never succeed.

By the "DMA API allocator" I assume you mean the coherent DMA interface,
The DMA coherent API and DMA streaming APIs are two separate sub-interfaces
of the DMA API and are not dependent on each other.

2010-07-22 16:26:04

by Zach Pfeffer

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Thu, Jul 22, 2010 at 08:34:55AM +0100, Russell King - ARM Linux wrote:
> On Wed, Jul 21, 2010 at 09:25:28PM -0700, Zach Pfeffer wrote:
> > Yes it is a problem, as Russell has brought up, but there's something
> > I probably haven't communicated well. I'll use the following example:
> >
> > There are 3 devices: A CPU, a decoder and a video output device. All 3
> > devices need to map the same 12 MB buffer at the same time.
>
> Why do you need the same buffer mapped by the CPU?
>
> Let's take your example of a video decoder and video output device.
> Surely the CPU doesn't want to be writing to the same memory region
> used for the output picture as the decoder is writing to. So what's
> the point of mapping that memory into the CPU's address space?

It may, especially if you're doing some software post processing. Also
by mapping all the buffers its extremly fast to "pass the buffers"
around in this senario - the buffer passing becomes a simple signal.

>
> Surely the video output device doesn't need to see the input data to
> the decoder either?

No, but other devices may (like the CPU).

>
> Surely, all you need is:
>
> 1. a mapping for the CPU for a chunk of memory to pass data to the
> decoder.
> 2. a mapping for the decoder to see the chunk of memory to receive data
> from the CPU.
> 3. a mapping for the decoder to see a chunk of memory used for the output
> video buffer.
> 4. a mapping for the output device to see the video buffer.
>
> So I don't see why everything needs to be mapped by everything else.

That's fair, but we do share buffers and we do have many, very large
mappings, and we do need to pull these from a separate pools because
they need to exhibit a particular allocation profile. I agree with you
that things should work like you've listed, but with Qualcomm's ARM
multimedia engines we're seeing some different usage scenarios. Its
the giant buffers, needing to use our own buffer allocator, the need
to share and the need to swap out virtual IOMMU space (which we
haven't talked about much) which make the DMA API seem like a
mismatch. (we haven't even talked about graphics usage ;) ).

2010-07-22 16:28:27

by Zach Pfeffer

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Thu, Jul 22, 2010 at 08:39:17AM +0100, Russell King - ARM Linux wrote:
> On Wed, Jul 21, 2010 at 09:30:34PM -0700, Zach Pfeffer wrote:
> > This goes to the nub of the issue. We need a lot of 1 MB physically
> > contiguous chunks. The system is going to fragment and we'll never get
> > our 12 1 MB chunks that we'll need, since the DMA API allocator uses
> > the system pool it will never succeed.
>
> By the "DMA API allocator" I assume you mean the coherent DMA interface,
> The DMA coherent API and DMA streaming APIs are two separate sub-interfaces
> of the DMA API and are not dependent on each other.

I didn't know that, but yes. As far as I can tell they both allocate
memory from the VM. We'd need a way to hook in our our own minimized
mapping allocator.

2010-07-22 16:45:01

by Zach Pfeffer

[permalink] [raw]
Subject: Re: [RFC 1/3 v3] mm: iommu: An API to unify IOMMU, CPU and device memory management

On Thu, Jul 22, 2010 at 01:43:26PM +0900, FUJITA Tomonori wrote:
> On Wed, 21 Jul 2010 21:30:34 -0700
> Zach Pfeffer <[email protected]> wrote:
>
> > On Wed, Jul 21, 2010 at 10:44:37AM +0900, FUJITA Tomonori wrote:
> > > On Tue, 20 Jul 2010 15:20:01 -0700
> > > Zach Pfeffer <[email protected]> wrote:
> > >
> > > > > I'm not saying that it's reasonable to pass (or even allocate) a 1MB
> > > > > buffer via the DMA API.
> > > >
> > > > But given a bunch of large chunks of memory, is there any API that can
> > > > manage them (asked this on the other thread as well)?
> > >
> > > What is the problem about mapping a 1MB buffer with the DMA API?
> > >
> > > Possibly, an IOMMU can't find space for 1MB but it's not the problem
> > > of the DMA API.
> >
> > This goes to the nub of the issue. We need a lot of 1 MB physically
> > contiguous chunks. The system is going to fragment and we'll never get
> > our 12 1 MB chunks that we'll need, since the DMA API allocator uses
> > the system pool it will never succeed. For this reason we reserve a
> > pool of 1 MB chunks (and 16 MB, 64 KB etc...) to satisfy our
> > requests. This same use case is seen on most embedded "media" engines
> > that are getting built today.
>
> We don't need a new abstraction to reserve some memory.
>
> If you want pre-allocated memory pool per device (and share them with
> some), the DMA API can for coherent memory (see
> dma_alloc_from_coherent). You can extend the DMA API if necessary.

That function won't work for us. We can't use
bitmap_find_free_region(), we need to use our own allocator. If
anything we need a dma_alloc_from_custom(my_allocator). Take a look
at:

mm: iommu: A physical allocator for the VCMM
vcm_alloc_max_munch()