2013-09-19 07:38:43

by Bharat Bhushan

[permalink] [raw]
Subject: [PATCH 0/7] vfio-pci: add support for Freescale IOMMU (PAMU)

From: Bharat Bhushan <[email protected]>

This patchset adds support for vfio-pci with Freescale
IOMMU (PAMU- Peripheral Access Management Unit)

The Freescale PAMU is an aperture-based IOMMU with the following
characteristics. Each device has an entry in a table in memory
describing the iova->phys mapping. The mapping has:
-an overall aperture that is power of 2 sized, and has a start iova that
is naturally aligned
-has 1 or more windows within the aperture
-number of windows must be power of 2, max is 256
-size of each window is determined by aperture size / # of windows
-iova of each window is determined by aperture start iova / # of windows
-the mapped region in each window can be different than
the window size...mapping must power of 2
-physical address of the mapping must be naturally aligned
with the mapping size

Because of some of above said limitations we need to set limited aperture
window which will have space for MSI address mapping. So we create space
for MSI windows just after the IOVA (guest memory).
First 4 patches in this patchset are for setting up MSI window and MSI address
at device accordingly.

Fifth patch resolves compilation error.
Sixth patch moves some common functions in a separate file so that they can be
used by FSL_PAMU implementation (next patch uses this). These will be used later for
iommu-none implementation. I believe we can do more of this but will take step by step.

Finally the seventh patch actually adds the support for FSL-PAMU :)

Bharat Bhushan (7):
powerpc: Add interface to get msi region information
iommu: add api to get iommu_domain of a device
fsl iommu: add get_dev_iommu_domain
powerpc: translate msi addr to iova if iommu is in use
iommu: supress loff_t compilation error on powerpc
vfio: moving some functions in common file
vfio pci: Add vfio iommu implementation for FSL_PAMU

arch/powerpc/include/asm/machdep.h | 8 +
arch/powerpc/include/asm/pci.h | 2 +
arch/powerpc/kernel/msi.c | 18 +
arch/powerpc/sysdev/fsl_msi.c | 95 ++++-
arch/powerpc/sysdev/fsl_msi.h | 11 +-
drivers/iommu/fsl_pamu_domain.c | 30 ++
drivers/iommu/iommu.c | 10 +
drivers/pci/msi.c | 26 +
drivers/vfio/Kconfig | 6 +
drivers/vfio/Makefile | 5 +-
drivers/vfio/pci/vfio_pci_rdwr.c | 3 +-
drivers/vfio/vfio_iommu_common.c | 235 +++++++++
drivers/vfio/vfio_iommu_common.h | 30 ++
drivers/vfio/vfio_iommu_fsl_pamu.c | 952 ++++++++++++++++++++++++++++++++++++
drivers/vfio/vfio_iommu_type1.c | 206 +--------
include/linux/iommu.h | 7 +
include/linux/msi.h | 8 +
include/linux/pci.h | 13 +
include/uapi/linux/vfio.h | 100 ++++
19 files changed, 1550 insertions(+), 215 deletions(-)
create mode 100644 drivers/vfio/vfio_iommu_common.c
create mode 100644 drivers/vfio/vfio_iommu_common.h
create mode 100644 drivers/vfio/vfio_iommu_fsl_pamu.c


2013-09-19 07:36:59

by Bharat Bhushan

[permalink] [raw]
Subject: [PATCH 1/7] powerpc: Add interface to get msi region information

This patch adds interface to get following information
- Number of MSI regions (which is number of MSI banks for powerpc).
- Get the region address range: Physical page which have the
address/addresses used for generating MSI interrupt
and size of the page.

These are required to create IOMMU (Freescale PAMU) mapping for
devices which are directly assigned using VFIO.

Signed-off-by: Bharat Bhushan <[email protected]>
---
arch/powerpc/include/asm/machdep.h | 8 +++++++
arch/powerpc/include/asm/pci.h | 2 +
arch/powerpc/kernel/msi.c | 18 ++++++++++++++++
arch/powerpc/sysdev/fsl_msi.c | 39 +++++++++++++++++++++++++++++++++--
arch/powerpc/sysdev/fsl_msi.h | 11 ++++++++-
drivers/pci/msi.c | 26 ++++++++++++++++++++++++
include/linux/msi.h | 8 +++++++
include/linux/pci.h | 13 ++++++++++++
8 files changed, 120 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index 8b48090..8d1b787 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -30,6 +30,7 @@ struct file;
struct pci_controller;
struct kimage;
struct pci_host_bridge;
+struct msi_region;

struct machdep_calls {
char *name;
@@ -124,6 +125,13 @@ struct machdep_calls {
int (*setup_msi_irqs)(struct pci_dev *dev,
int nvec, int type);
void (*teardown_msi_irqs)(struct pci_dev *dev);
+
+ /* Returns the number of MSI regions (banks) */
+ int (*msi_get_region_count)(void);
+
+ /* Returns the requested region's address and size */
+ int (*msi_get_region)(int region_num,
+ struct msi_region *region);
#endif

void (*restart)(char *cmd);
diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h
index 6653f27..e575349 100644
--- a/arch/powerpc/include/asm/pci.h
+++ b/arch/powerpc/include/asm/pci.h
@@ -117,6 +117,8 @@ extern int pci_proc_domain(struct pci_bus *bus);
#define arch_setup_msi_irqs arch_setup_msi_irqs
#define arch_teardown_msi_irqs arch_teardown_msi_irqs
#define arch_msi_check_device arch_msi_check_device
+#define arch_msi_get_region_count arch_msi_get_region_count
+#define arch_msi_get_region arch_msi_get_region

struct vm_area_struct;
/* Map a range of PCI memory or I/O space for a device into user space */
diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c
index 8bbc12d..1a67787 100644
--- a/arch/powerpc/kernel/msi.c
+++ b/arch/powerpc/kernel/msi.c
@@ -13,6 +13,24 @@

#include <asm/machdep.h>

+int arch_msi_get_region_count(void)
+{
+ if (ppc_md.msi_get_region_count) {
+ pr_debug("msi: Using platform get_region_count routine.\n");
+ return ppc_md.msi_get_region_count();
+ }
+ return 0;
+}
+
+int arch_msi_get_region(int region_num, struct msi_region *region)
+{
+ if (ppc_md.msi_get_region) {
+ pr_debug("msi: Using platform get_region routine.\n");
+ return ppc_md.msi_get_region(region_num, region);
+ }
+ return 0;
+}
+
int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
{
if (!ppc_md.setup_msi_irqs || !ppc_md.teardown_msi_irqs) {
diff --git a/arch/powerpc/sysdev/fsl_msi.c b/arch/powerpc/sysdev/fsl_msi.c
index ab02db3..ed045cb 100644
--- a/arch/powerpc/sysdev/fsl_msi.c
+++ b/arch/powerpc/sysdev/fsl_msi.c
@@ -96,6 +96,34 @@ static int fsl_msi_init_allocator(struct fsl_msi *msi_data)
return 0;
}

+static int fsl_msi_get_region_count(void)
+{
+ int count = 0;
+ struct fsl_msi *msi_data;
+
+ list_for_each_entry(msi_data, &msi_head, list)
+ count++;
+
+ return count;
+}
+
+static int fsl_msi_get_region(int region_num, struct msi_region *region)
+{
+ struct fsl_msi *msi_data;
+
+ list_for_each_entry(msi_data, &msi_head, list) {
+ if (msi_data->bank_index == region_num) {
+ region->region_num = msi_data->bank_index;
+ /* Setting PAGE_SIZE as MSIIR is a 4 byte register */
+ region->size = PAGE_SIZE;
+ region->addr = msi_data->msiir & ~(region->size - 1);
+ return 0;
+ }
+ }
+
+ return -ENODEV;
+}
+
static int fsl_msi_check_device(struct pci_dev *pdev, int nvec, int type)
{
if (type == PCI_CAP_ID_MSIX)
@@ -137,7 +165,8 @@ static void fsl_compose_msi_msg(struct pci_dev *pdev, int hwirq,
if (reg && (len == sizeof(u64)))
address = be64_to_cpup(reg);
else
- address = fsl_pci_immrbar_base(hose) + msi_data->msiir_offset;
+ address = fsl_pci_immrbar_base(hose) +
+ (msi_data->msiir & 0xfffff);

msg->address_lo = lower_32_bits(address);
msg->address_hi = upper_32_bits(address);
@@ -376,6 +405,7 @@ static int fsl_of_msi_probe(struct platform_device *dev)
int len;
u32 offset;
static const u32 all_avail[] = { 0, NR_MSI_IRQS };
+ static int bank_index;

match = of_match_device(fsl_of_msi_ids, &dev->dev);
if (!match)
@@ -419,8 +449,8 @@ static int fsl_of_msi_probe(struct platform_device *dev)
dev->dev.of_node->full_name);
goto error_out;
}
- msi->msiir_offset =
- features->msiir_offset + (res.start & 0xfffff);
+ msi->msiir = res.start + features->msiir_offset;
+ printk("msi->msiir = %llx\n", msi->msiir);
}

msi->feature = features->fsl_pic_ip;
@@ -470,6 +500,7 @@ static int fsl_of_msi_probe(struct platform_device *dev)
}
}

+ msi->bank_index = bank_index++;
list_add_tail(&msi->list, &msi_head);

/* The multiple setting ppc_md.setup_msi_irqs will not harm things */
@@ -477,6 +508,8 @@ static int fsl_of_msi_probe(struct platform_device *dev)
ppc_md.setup_msi_irqs = fsl_setup_msi_irqs;
ppc_md.teardown_msi_irqs = fsl_teardown_msi_irqs;
ppc_md.msi_check_device = fsl_msi_check_device;
+ ppc_md.msi_get_region_count = fsl_msi_get_region_count;
+ ppc_md.msi_get_region = fsl_msi_get_region;
} else if (ppc_md.setup_msi_irqs != fsl_setup_msi_irqs) {
dev_err(&dev->dev, "Different MSI driver already installed!\n");
err = -ENODEV;
diff --git a/arch/powerpc/sysdev/fsl_msi.h b/arch/powerpc/sysdev/fsl_msi.h
index 8225f86..6bd5cfc 100644
--- a/arch/powerpc/sysdev/fsl_msi.h
+++ b/arch/powerpc/sysdev/fsl_msi.h
@@ -29,12 +29,19 @@ struct fsl_msi {
struct irq_domain *irqhost;

unsigned long cascade_irq;
-
- u32 msiir_offset; /* Offset of MSIIR, relative to start of CCSR */
+ dma_addr_t msiir; /* MSIIR Address in CCSR */
void __iomem *msi_regs;
u32 feature;
int msi_virqs[NR_MSI_REG];

+ /*
+ * During probe each bank is assigned a index number.
+ * index number ranges from 0 to 2^32.
+ * Example MSI bank 1 = 0
+ * MSI bank 2 = 1, and so on.
+ */
+ int bank_index;
+
struct msi_bitmap bitmap;

struct list_head list; /* support multiple MSI banks */
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index aca7578..6d85c15 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -30,6 +30,20 @@ static int pci_msi_enable = 1;

/* Arch hooks */

+#ifndef arch_msi_get_region_count
+int arch_msi_get_region_count(void)
+{
+ return 0;
+}
+#endif
+
+#ifndef arch_msi_get_region
+int arch_msi_get_region(int region_num, struct msi_region *region)
+{
+ return 0;
+}
+#endif
+
#ifndef arch_msi_check_device
int arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
{
@@ -903,6 +917,18 @@ void pci_disable_msi(struct pci_dev *dev)
}
EXPORT_SYMBOL(pci_disable_msi);

+int msi_get_region_count(void)
+{
+ return arch_msi_get_region_count();
+}
+EXPORT_SYMBOL(msi_get_region_count);
+
+int msi_get_region(int region_num, struct msi_region *region)
+{
+ return arch_msi_get_region(region_num, region);
+}
+EXPORT_SYMBOL(msi_get_region);
+
/**
* pci_msix_table_size - return the number of device's MSI-X table entries
* @dev: pointer to the pci_dev data structure of MSI-X device function
diff --git a/include/linux/msi.h b/include/linux/msi.h
index ee66f3a..ae32601 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -50,6 +50,12 @@ struct msi_desc {
struct kobject kobj;
};

+struct msi_region {
+ int region_num;
+ dma_addr_t addr;
+ size_t size;
+};
+
/*
* The arch hook for setup up msi irqs
*/
@@ -58,5 +64,7 @@ void arch_teardown_msi_irq(unsigned int irq);
int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
void arch_teardown_msi_irqs(struct pci_dev *dev);
int arch_msi_check_device(struct pci_dev* dev, int nvec, int type);
+int arch_msi_get_region_count(void);
+int arch_msi_get_region(int region_num, struct msi_region *region);

#endif /* LINUX_MSI_H */
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 186540d..2b26a59 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1126,6 +1126,7 @@ struct msix_entry {
u16 entry; /* driver uses to specify entry, OS writes */
};

+struct msi_region;

#ifndef CONFIG_PCI_MSI
static inline int pci_enable_msi_block(struct pci_dev *dev, unsigned int nvec)
@@ -1168,6 +1169,16 @@ static inline int pci_msi_enabled(void)
{
return 0;
}
+
+static inline int msi_get_region_count(void)
+{
+ return 0;
+}
+
+static inline int msi_get_region(int region_num, struct msi_region *region)
+{
+ return 0;
+}
#else
int pci_enable_msi_block(struct pci_dev *dev, unsigned int nvec);
int pci_enable_msi_block_auto(struct pci_dev *dev, unsigned int *maxvec);
@@ -1180,6 +1191,8 @@ void pci_disable_msix(struct pci_dev *dev);
void msi_remove_pci_irq_vectors(struct pci_dev *dev);
void pci_restore_msi_state(struct pci_dev *dev);
int pci_msi_enabled(void);
+int msi_get_region_count(void);
+int msi_get_region(int region_num, struct msi_region *region);
#endif

#ifdef CONFIG_PCIEPORTBUS
--
1.7.0.4

2013-09-19 07:37:12

by Bharat Bhushan

[permalink] [raw]
Subject: [PATCH 2/7] iommu: add api to get iommu_domain of a device

This api return the iommu domain to which the device is attached.
The iommu_domain is required for making API calls related to iommu.
Follow up patches which use this API to know iommu maping.

Signed-off-by: Bharat Bhushan <[email protected]>
---
drivers/iommu/iommu.c | 10 ++++++++++
include/linux/iommu.h | 7 +++++++
2 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index fbe9ca7..6ac5f50 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -696,6 +696,16 @@ void iommu_detach_device(struct iommu_domain *domain, struct device *dev)
}
EXPORT_SYMBOL_GPL(iommu_detach_device);

+struct iommu_domain *iommu_get_dev_domain(struct device *dev)
+{
+ struct iommu_ops *ops = dev->bus->iommu_ops;
+
+ if (unlikely(ops == NULL || ops->get_dev_iommu_domain == NULL))
+ return NULL;
+
+ return ops->get_dev_iommu_domain(dev);
+}
+EXPORT_SYMBOL_GPL(iommu_get_dev_domain);
/*
* IOMMU groups are really the natrual working unit of the IOMMU, but
* the IOMMU API works on domains and devices. Bridge that gap by
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 7ea319e..fa046bd 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -127,6 +127,7 @@ struct iommu_ops {
int (*domain_set_windows)(struct iommu_domain *domain, u32 w_count);
/* Get the numer of window per domain */
u32 (*domain_get_windows)(struct iommu_domain *domain);
+ struct iommu_domain *(*get_dev_iommu_domain)(struct device *dev);

unsigned long pgsize_bitmap;
};
@@ -190,6 +191,7 @@ extern int iommu_domain_window_enable(struct iommu_domain *domain, u32 wnd_nr,
phys_addr_t offset, u64 size,
int prot);
extern void iommu_domain_window_disable(struct iommu_domain *domain, u32 wnd_nr);
+extern struct iommu_domain *iommu_get_dev_domain(struct device *dev);
/**
* report_iommu_fault() - report about an IOMMU fault to the IOMMU framework
* @domain: the iommu domain where the fault has happened
@@ -284,6 +286,11 @@ static inline void iommu_domain_window_disable(struct iommu_domain *domain,
{
}

+static inline struct iommu_domain *iommu_get_dev_domain(struct device *dev)
+{
+ return NULL;
+}
+
static inline phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain, dma_addr_t iova)
{
return 0;
--
1.7.0.4

2013-09-19 07:37:23

by Bharat Bhushan

[permalink] [raw]
Subject: [PATCH 4/7] powerpc: translate msi addr to iova if iommu is in use

If the device is attached with iommu domain then set MSI address
to the iova configured in PAMU.

Signed-off-by: Bharat Bhushan <[email protected]>
---
arch/powerpc/sysdev/fsl_msi.c | 56 +++++++++++++++++++++++++++++++++++++++-
1 files changed, 54 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/sysdev/fsl_msi.c b/arch/powerpc/sysdev/fsl_msi.c
index ed045cb..c7cf018 100644
--- a/arch/powerpc/sysdev/fsl_msi.c
+++ b/arch/powerpc/sysdev/fsl_msi.c
@@ -18,6 +18,7 @@
#include <linux/pci.h>
#include <linux/slab.h>
#include <linux/of_platform.h>
+#include <linux/iommu.h>
#include <sysdev/fsl_soc.h>
#include <asm/prom.h>
#include <asm/hw_irq.h>
@@ -150,7 +151,40 @@ static void fsl_teardown_msi_irqs(struct pci_dev *pdev)
return;
}

-static void fsl_compose_msi_msg(struct pci_dev *pdev, int hwirq,
+static uint64_t fsl_iommu_get_iova(struct pci_dev *pdev, dma_addr_t msi_phys)
+{
+ struct iommu_domain *domain;
+ struct iommu_domain_geometry geometry;
+ u32 wins = 0;
+ uint64_t iova, size;
+ int ret, i;
+
+ domain = iommu_get_dev_domain(&pdev->dev);
+ if (!domain)
+ return 0;
+
+ ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_WINDOWS, &wins);
+ if (ret)
+ return 0;
+
+ ret = iommu_domain_get_attr(domain, DOMAIN_ATTR_GEOMETRY, &geometry);
+ if (ret)
+ return 0;
+
+ iova = geometry.aperture_start;
+ size = geometry.aperture_end - geometry.aperture_start + 1;
+ do_div(size, wins);
+ for (i = 0; i < wins; i++) {
+ phys_addr_t phys;
+ phys = iommu_iova_to_phys(domain, iova);
+ if (phys == (msi_phys & ~(PAGE_SIZE - 1)))
+ return (iova + (msi_phys & (PAGE_SIZE - 1)));
+ iova += size;
+ }
+ return 0;
+}
+
+static int fsl_compose_msi_msg(struct pci_dev *pdev, int hwirq,
struct msi_msg *msg,
struct fsl_msi *fsl_msi_data)
{
@@ -168,6 +202,16 @@ static void fsl_compose_msi_msg(struct pci_dev *pdev, int hwirq,
address = fsl_pci_immrbar_base(hose) +
(msi_data->msiir & 0xfffff);

+ /*
+ * If the device is attached with iommu domain then set MSI address
+ * to the iova configured in PAMU.
+ */
+ if (iommu_get_dev_domain(&pdev->dev)) {
+ address = fsl_iommu_get_iova(pdev, msi_data->msiir);
+ if (!address)
+ return -ENODEV;
+ }
+
msg->address_lo = lower_32_bits(address);
msg->address_hi = upper_32_bits(address);

@@ -175,6 +219,8 @@ static void fsl_compose_msi_msg(struct pci_dev *pdev, int hwirq,

pr_debug("%s: allocated srs: %d, ibs: %d\n",
__func__, hwirq / IRQS_PER_MSI_REG, hwirq % IRQS_PER_MSI_REG);
+
+ return 0;
}

static int fsl_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
@@ -244,7 +290,13 @@ static int fsl_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
/* chip_data is msi_data via host->hostdata in host->map() */
irq_set_msi_desc(virq, entry);

- fsl_compose_msi_msg(pdev, hwirq, &msg, msi_data);
+ if (fsl_compose_msi_msg(pdev, hwirq, &msg, msi_data)) {
+ dev_err(&pdev->dev, "Fail to set MSI for hwirq %i\n",
+ hwirq);
+ msi_bitmap_free_hwirqs(&msi_data->bitmap, hwirq, 1);
+ rc = -ENODEV;
+ goto out_free;
+ }
write_msi_msg(virq, &msg);
}
return 0;
--
1.7.0.4

2013-09-19 07:37:29

by Bharat Bhushan

[permalink] [raw]
Subject: [PATCH 5/7] iommu: supress loff_t compilation error on powerpc

Signed-off-by: Bharat Bhushan <[email protected]>
---
drivers/vfio/pci/vfio_pci_rdwr.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 210db24..8a8156a 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -181,7 +181,8 @@ ssize_t vfio_pci_vga_rw(struct vfio_pci_device *vdev, char __user *buf,
size_t count, loff_t *ppos, bool iswrite)
{
int ret;
- loff_t off, pos = *ppos & VFIO_PCI_OFFSET_MASK;
+ loff_t off;
+ u64 pos = (u64 )(*ppos & VFIO_PCI_OFFSET_MASK);
void __iomem *iomem = NULL;
unsigned int rsrc;
bool is_ioport;
--
1.7.0.4

2013-09-19 07:37:40

by Bharat Bhushan

[permalink] [raw]
Subject: [PATCH 6/7] vfio: moving some functions in common file

Some function defined in vfio_iommu_type1.c were common and
we want to use these for FSL IOMMU (PAMU) and iommu-none driver.
So some of them are moved to vfio_iommu_common.c

I think we can do more of that but we will take this step by step.

Signed-off-by: Bharat Bhushan <[email protected]>
---
drivers/vfio/Makefile | 4 +-
drivers/vfio/vfio_iommu_common.c | 235 ++++++++++++++++++++++++++++++++++++++
drivers/vfio/vfio_iommu_common.h | 30 +++++
drivers/vfio/vfio_iommu_type1.c | 206 +---------------------------------
4 files changed, 268 insertions(+), 207 deletions(-)
create mode 100644 drivers/vfio/vfio_iommu_common.c
create mode 100644 drivers/vfio/vfio_iommu_common.h

diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index 72bfabc..c5792ec 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -1,4 +1,4 @@
obj-$(CONFIG_VFIO) += vfio.o
-obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
-obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
+obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_common.o vfio_iommu_type1.o
+obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_common.o vfio_iommu_spapr_tce.o
obj-$(CONFIG_VFIO_PCI) += pci/
diff --git a/drivers/vfio/vfio_iommu_common.c b/drivers/vfio/vfio_iommu_common.c
new file mode 100644
index 0000000..8bdc0ea
--- /dev/null
+++ b/drivers/vfio/vfio_iommu_common.c
@@ -0,0 +1,235 @@
+/*
+ * VFIO: Common code for vfio IOMMU support
+ *
+ * Copyright (C) 2012 Red Hat, Inc. All rights reserved.
+ * Author: Alex Williamson <[email protected]>
+ * Author: Bharat Bhushan <[email protected]>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ *
+ * Derived from original vfio:
+ * Copyright 2010 Cisco Systems, Inc. All rights reserved.
+ * Author: Tom Lyon, [email protected]
+ */
+
+#include <linux/compat.h>
+#include <linux/device.h>
+#include <linux/fs.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/pci.h> /* pci_bus_type */
+#include <linux/rbtree.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/workqueue.h>
+
+static bool disable_hugepages;
+module_param_named(disable_hugepages,
+ disable_hugepages, bool, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(disable_hugepages,
+ "Disable VFIO IOMMU support for IOMMU hugepages.");
+
+struct vwork {
+ struct mm_struct *mm;
+ long npage;
+ struct work_struct work;
+};
+
+/* delayed decrement/increment for locked_vm */
+void vfio_lock_acct_bg(struct work_struct *work)
+{
+ struct vwork *vwork = container_of(work, struct vwork, work);
+ struct mm_struct *mm;
+
+ mm = vwork->mm;
+ down_write(&mm->mmap_sem);
+ mm->locked_vm += vwork->npage;
+ up_write(&mm->mmap_sem);
+ mmput(mm);
+ kfree(vwork);
+}
+
+void vfio_lock_acct(long npage)
+{
+ struct vwork *vwork;
+ struct mm_struct *mm;
+
+ if (!current->mm || !npage)
+ return; /* process exited or nothing to do */
+
+ if (down_write_trylock(&current->mm->mmap_sem)) {
+ current->mm->locked_vm += npage;
+ up_write(&current->mm->mmap_sem);
+ return;
+ }
+
+ /*
+ * Couldn't get mmap_sem lock, so must setup to update
+ * mm->locked_vm later. If locked_vm were atomic, we
+ * wouldn't need this silliness
+ */
+ vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
+ if (!vwork)
+ return;
+ mm = get_task_mm(current);
+ if (!mm) {
+ kfree(vwork);
+ return;
+ }
+ INIT_WORK(&vwork->work, vfio_lock_acct_bg);
+ vwork->mm = mm;
+ vwork->npage = npage;
+ schedule_work(&vwork->work);
+}
+
+/*
+ * Some mappings aren't backed by a struct page, for example an mmap'd
+ * MMIO range for our own or another device. These use a different
+ * pfn conversion and shouldn't be tracked as locked pages.
+ */
+bool is_invalid_reserved_pfn(unsigned long pfn)
+{
+ if (pfn_valid(pfn)) {
+ bool reserved;
+ struct page *tail = pfn_to_page(pfn);
+ struct page *head = compound_trans_head(tail);
+ reserved = !!(PageReserved(head));
+ if (head != tail) {
+ /*
+ * "head" is not a dangling pointer
+ * (compound_trans_head takes care of that)
+ * but the hugepage may have been split
+ * from under us (and we may not hold a
+ * reference count on the head page so it can
+ * be reused before we run PageReferenced), so
+ * we've to check PageTail before returning
+ * what we just read.
+ */
+ smp_rmb();
+ if (PageTail(tail))
+ return reserved;
+ }
+ return PageReserved(tail);
+ }
+
+ return true;
+}
+
+int put_pfn(unsigned long pfn, int prot)
+{
+ if (!is_invalid_reserved_pfn(pfn)) {
+ struct page *page = pfn_to_page(pfn);
+ if (prot & IOMMU_WRITE)
+ SetPageDirty(page);
+ put_page(page);
+ return 1;
+ }
+ return 0;
+}
+
+static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
+{
+ struct page *page[1];
+ struct vm_area_struct *vma;
+ int ret = -EFAULT;
+
+ if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
+ *pfn = page_to_pfn(page[0]);
+ return 0;
+ }
+
+ printk("via vma\n");
+ down_read(&current->mm->mmap_sem);
+
+ vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
+
+ if (vma && vma->vm_flags & VM_PFNMAP) {
+ *pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
+ if (is_invalid_reserved_pfn(*pfn))
+ ret = 0;
+ }
+
+ up_read(&current->mm->mmap_sem);
+
+ return ret;
+}
+
+/*
+ * Attempt to pin pages. We really don't want to track all the pfns and
+ * the iommu can only map chunks of consecutive pfns anyway, so get the
+ * first page and all consecutive pages with the same locking.
+ */
+long vfio_pin_pages(unsigned long vaddr, long npage,
+ int prot, unsigned long *pfn_base)
+{
+ unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
+ bool lock_cap = capable(CAP_IPC_LOCK);
+ long ret, i;
+
+ if (!current->mm)
+ return -ENODEV;
+
+ ret = vaddr_get_pfn(vaddr, prot, pfn_base);
+ if (ret)
+ return ret;
+
+ if (is_invalid_reserved_pfn(*pfn_base))
+ return 1;
+
+ if (!lock_cap && current->mm->locked_vm + 1 > limit) {
+ put_pfn(*pfn_base, prot);
+ pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
+ limit << PAGE_SHIFT);
+ return -ENOMEM;
+ }
+
+ if (unlikely(disable_hugepages)) {
+ vfio_lock_acct(1);
+ return 1;
+ }
+
+ /* Lock all the consecutive pages from pfn_base */
+ for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
+ unsigned long pfn = 0;
+
+ ret = vaddr_get_pfn(vaddr, prot, &pfn);
+ if (ret)
+ break;
+
+ if (pfn != *pfn_base + i || is_invalid_reserved_pfn(pfn)) {
+ put_pfn(pfn, prot);
+ break;
+ }
+
+ if (!lock_cap && current->mm->locked_vm + i + 1 > limit) {
+ put_pfn(pfn, prot);
+ pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
+ __func__, limit << PAGE_SHIFT);
+ break;
+ }
+ }
+
+ vfio_lock_acct(i);
+
+ return i;
+}
+
+long vfio_unpin_pages(unsigned long pfn, long npage,
+ int prot, bool do_accounting)
+{
+ unsigned long unlocked = 0;
+ long i;
+
+ for (i = 0; i < npage; i++)
+ unlocked += put_pfn(pfn++, prot);
+
+ if (do_accounting)
+ vfio_lock_acct(-unlocked);
+
+ return unlocked;
+}
diff --git a/drivers/vfio/vfio_iommu_common.h b/drivers/vfio/vfio_iommu_common.h
new file mode 100644
index 0000000..4738391
--- /dev/null
+++ b/drivers/vfio/vfio_iommu_common.h
@@ -0,0 +1,30 @@
+/*
+ * Copyright (C) 2012 Red Hat, Inc. All rights reserved.
+ * Copyright (C) 2013 Freescale Semiconductor, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef _VFIO_IOMMU_COMMON_H
+#define _VFIO_IOMMU_COMMON_H
+
+void vfio_lock_acct_bg(struct work_struct *work);
+void vfio_lock_acct(long npage);
+bool is_invalid_reserved_pfn(unsigned long pfn);
+int put_pfn(unsigned long pfn, int prot);
+long vfio_pin_pages(unsigned long vaddr, long npage, int prot,
+ unsigned long *pfn_base);
+long vfio_unpin_pages(unsigned long pfn, long npage,
+ int prot, bool do_accounting);
+#endif
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index a9807de..e9a58fa 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -37,6 +37,7 @@
#include <linux/uaccess.h>
#include <linux/vfio.h>
#include <linux/workqueue.h>
+#include "vfio_iommu_common.h"

#define DRIVER_VERSION "0.2"
#define DRIVER_AUTHOR "Alex Williamson <[email protected]>"
@@ -48,12 +49,6 @@ module_param_named(allow_unsafe_interrupts,
MODULE_PARM_DESC(allow_unsafe_interrupts,
"Enable VFIO IOMMU support for on platforms without interrupt remapping support.");

-static bool disable_hugepages;
-module_param_named(disable_hugepages,
- disable_hugepages, bool, S_IRUGO | S_IWUSR);
-MODULE_PARM_DESC(disable_hugepages,
- "Disable VFIO IOMMU support for IOMMU hugepages.");
-
struct vfio_iommu {
struct iommu_domain *domain;
struct mutex lock;
@@ -123,205 +118,6 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
rb_erase(&old->node, &iommu->dma_list);
}

-struct vwork {
- struct mm_struct *mm;
- long npage;
- struct work_struct work;
-};
-
-/* delayed decrement/increment for locked_vm */
-static void vfio_lock_acct_bg(struct work_struct *work)
-{
- struct vwork *vwork = container_of(work, struct vwork, work);
- struct mm_struct *mm;
-
- mm = vwork->mm;
- down_write(&mm->mmap_sem);
- mm->locked_vm += vwork->npage;
- up_write(&mm->mmap_sem);
- mmput(mm);
- kfree(vwork);
-}
-
-static void vfio_lock_acct(long npage)
-{
- struct vwork *vwork;
- struct mm_struct *mm;
-
- if (!current->mm || !npage)
- return; /* process exited or nothing to do */
-
- if (down_write_trylock(&current->mm->mmap_sem)) {
- current->mm->locked_vm += npage;
- up_write(&current->mm->mmap_sem);
- return;
- }
-
- /*
- * Couldn't get mmap_sem lock, so must setup to update
- * mm->locked_vm later. If locked_vm were atomic, we
- * wouldn't need this silliness
- */
- vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
- if (!vwork)
- return;
- mm = get_task_mm(current);
- if (!mm) {
- kfree(vwork);
- return;
- }
- INIT_WORK(&vwork->work, vfio_lock_acct_bg);
- vwork->mm = mm;
- vwork->npage = npage;
- schedule_work(&vwork->work);
-}
-
-/*
- * Some mappings aren't backed by a struct page, for example an mmap'd
- * MMIO range for our own or another device. These use a different
- * pfn conversion and shouldn't be tracked as locked pages.
- */
-static bool is_invalid_reserved_pfn(unsigned long pfn)
-{
- if (pfn_valid(pfn)) {
- bool reserved;
- struct page *tail = pfn_to_page(pfn);
- struct page *head = compound_trans_head(tail);
- reserved = !!(PageReserved(head));
- if (head != tail) {
- /*
- * "head" is not a dangling pointer
- * (compound_trans_head takes care of that)
- * but the hugepage may have been split
- * from under us (and we may not hold a
- * reference count on the head page so it can
- * be reused before we run PageReferenced), so
- * we've to check PageTail before returning
- * what we just read.
- */
- smp_rmb();
- if (PageTail(tail))
- return reserved;
- }
- return PageReserved(tail);
- }
-
- return true;
-}
-
-static int put_pfn(unsigned long pfn, int prot)
-{
- if (!is_invalid_reserved_pfn(pfn)) {
- struct page *page = pfn_to_page(pfn);
- if (prot & IOMMU_WRITE)
- SetPageDirty(page);
- put_page(page);
- return 1;
- }
- return 0;
-}
-
-static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
-{
- struct page *page[1];
- struct vm_area_struct *vma;
- int ret = -EFAULT;
-
- if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
- *pfn = page_to_pfn(page[0]);
- return 0;
- }
-
- down_read(&current->mm->mmap_sem);
-
- vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
-
- if (vma && vma->vm_flags & VM_PFNMAP) {
- *pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
- if (is_invalid_reserved_pfn(*pfn))
- ret = 0;
- }
-
- up_read(&current->mm->mmap_sem);
-
- return ret;
-}
-
-/*
- * Attempt to pin pages. We really don't want to track all the pfns and
- * the iommu can only map chunks of consecutive pfns anyway, so get the
- * first page and all consecutive pages with the same locking.
- */
-static long vfio_pin_pages(unsigned long vaddr, long npage,
- int prot, unsigned long *pfn_base)
-{
- unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
- bool lock_cap = capable(CAP_IPC_LOCK);
- long ret, i;
-
- if (!current->mm)
- return -ENODEV;
-
- ret = vaddr_get_pfn(vaddr, prot, pfn_base);
- if (ret)
- return ret;
-
- if (is_invalid_reserved_pfn(*pfn_base))
- return 1;
-
- if (!lock_cap && current->mm->locked_vm + 1 > limit) {
- put_pfn(*pfn_base, prot);
- pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
- limit << PAGE_SHIFT);
- return -ENOMEM;
- }
-
- if (unlikely(disable_hugepages)) {
- vfio_lock_acct(1);
- return 1;
- }
-
- /* Lock all the consecutive pages from pfn_base */
- for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
- unsigned long pfn = 0;
-
- ret = vaddr_get_pfn(vaddr, prot, &pfn);
- if (ret)
- break;
-
- if (pfn != *pfn_base + i || is_invalid_reserved_pfn(pfn)) {
- put_pfn(pfn, prot);
- break;
- }
-
- if (!lock_cap && current->mm->locked_vm + i + 1 > limit) {
- put_pfn(pfn, prot);
- pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
- __func__, limit << PAGE_SHIFT);
- break;
- }
- }
-
- vfio_lock_acct(i);
-
- return i;
-}
-
-static long vfio_unpin_pages(unsigned long pfn, long npage,
- int prot, bool do_accounting)
-{
- unsigned long unlocked = 0;
- long i;
-
- for (i = 0; i < npage; i++)
- unlocked += put_pfn(pfn++, prot);
-
- if (do_accounting)
- vfio_lock_acct(-unlocked);
-
- return unlocked;
-}
-
static int vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
dma_addr_t iova, size_t *size)
{
--
1.7.0.4

2013-09-19 07:37:47

by Bharat Bhushan

[permalink] [raw]
Subject: [PATCH 7/7] vfio pci: Add vfio iommu implementation for FSL_PAMU

This patch adds vfio iommu support for Freescale IOMMU
(PAMU - Peripheral Access Management Unit).

The Freescale PAMU is an aperture-based IOMMU with the following
characteristics. Each device has an entry in a table in memory
describing the iova->phys mapping. The mapping has:
-an overall aperture that is power of 2 sized, and has a start iova that
is naturally aligned
-has 1 or more windows within the aperture
-number of windows must be power of 2, max is 256
-size of each window is determined by aperture size / # of windows
-iova of each window is determined by aperture start iova / # of windows
-the mapped region in each window can be different than
the window size...mapping must power of 2
-physical address of the mapping must be naturally aligned
with the mapping size

Some of the code is derived from TYPE1 iommu (driver/vfio/vfio_iommu_type1.c).

Signed-off-by: Bharat Bhushan <[email protected]>
---
drivers/vfio/Kconfig | 6 +
drivers/vfio/Makefile | 1 +
drivers/vfio/vfio_iommu_fsl_pamu.c | 952 ++++++++++++++++++++++++++++++++++++
include/uapi/linux/vfio.h | 100 ++++
4 files changed, 1059 insertions(+), 0 deletions(-)
create mode 100644 drivers/vfio/vfio_iommu_fsl_pamu.c

diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
index 26b3d9d..7d1da26 100644
--- a/drivers/vfio/Kconfig
+++ b/drivers/vfio/Kconfig
@@ -8,11 +8,17 @@ config VFIO_IOMMU_SPAPR_TCE
depends on VFIO && SPAPR_TCE_IOMMU
default n

+config VFIO_IOMMU_FSL_PAMU
+ tristate
+ depends on VFIO
+ default n
+
menuconfig VFIO
tristate "VFIO Non-Privileged userspace driver framework"
depends on IOMMU_API
select VFIO_IOMMU_TYPE1 if X86
select VFIO_IOMMU_SPAPR_TCE if (PPC_POWERNV || PPC_PSERIES)
+ select VFIO_IOMMU_FSL_PAMU if FSL_PAMU
help
VFIO provides a framework for secure userspace device drivers.
See Documentation/vfio.txt for more details.
diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
index c5792ec..7461350 100644
--- a/drivers/vfio/Makefile
+++ b/drivers/vfio/Makefile
@@ -1,4 +1,5 @@
obj-$(CONFIG_VFIO) += vfio.o
obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_common.o vfio_iommu_type1.o
obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_common.o vfio_iommu_spapr_tce.o
+obj-$(CONFIG_VFIO_IOMMU_FSL_PAMU) += vfio_iommu_common.o vfio_iommu_fsl_pamu.o
obj-$(CONFIG_VFIO_PCI) += pci/
diff --git a/drivers/vfio/vfio_iommu_fsl_pamu.c b/drivers/vfio/vfio_iommu_fsl_pamu.c
new file mode 100644
index 0000000..b29365f
--- /dev/null
+++ b/drivers/vfio/vfio_iommu_fsl_pamu.c
@@ -0,0 +1,952 @@
+/*
+ * VFIO: IOMMU DMA mapping support for FSL PAMU IOMMU
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License, version 2, as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
+ *
+ * Copyright (C) 2013 Freescale Semiconductor, Inc.
+ *
+ * Author: Bharat Bhushan <[email protected]>
+ *
+ * This file is derived from driver/vfio/vfio_iommu_type1.c
+ *
+ * The Freescale PAMU is an aperture-based IOMMU with the following
+ * characteristics. Each device has an entry in a table in memory
+ * describing the iova->phys mapping. The mapping has:
+ * -an overall aperture that is power of 2 sized, and has a start iova that
+ * is naturally aligned
+ * -has 1 or more windows within the aperture
+ * -number of windows must be power of 2, max is 256
+ * -size of each window is determined by aperture size / # of windows
+ * -iova of each window is determined by aperture start iova / # of windows
+ * -the mapped region in each window can be different than
+ * the window size...mapping must power of 2
+ * -physical address of the mapping must be naturally aligned
+ * with the mapping size
+ */
+
+#include <linux/compat.h>
+#include <linux/device.h>
+#include <linux/fs.h>
+#include <linux/iommu.h>
+#include <linux/module.h>
+#include <linux/mm.h>
+#include <linux/pci.h> /* pci_bus_type */
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+#include <linux/vfio.h>
+#include <linux/workqueue.h>
+#include <linux/hugetlb.h>
+#include <linux/msi.h>
+#include <asm/fsl_pamu_stash.h>
+
+#include "vfio_iommu_common.h"
+
+#define DRIVER_VERSION "0.1"
+#define DRIVER_AUTHOR "Bharat Bhushan <[email protected]>"
+#define DRIVER_DESC "FSL PAMU IOMMU driver for VFIO"
+
+struct vfio_iommu {
+ struct iommu_domain *domain;
+ struct mutex lock;
+ dma_addr_t aperture_start;
+ dma_addr_t aperture_end;
+ dma_addr_t page_size; /* Maximum mapped Page size */
+ int nsubwindows; /* Number of subwindows */
+ struct rb_root dma_list;
+ struct list_head msi_dma_list;
+ struct list_head group_list;
+};
+
+struct vfio_dma {
+ struct rb_node node;
+ dma_addr_t iova; /* Device address */
+ unsigned long vaddr; /* Process virtual addr */
+ size_t size; /* Number of pages */
+ int prot; /* IOMMU_READ/WRITE */
+};
+
+struct vfio_msi_dma {
+ struct list_head next;
+ dma_addr_t iova; /* Device address */
+ int bank_id;
+ int prot; /* IOMMU_READ/WRITE */
+};
+
+struct vfio_group {
+ struct iommu_group *iommu_group;
+ struct list_head next;
+};
+
+static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
+ dma_addr_t start, size_t size)
+{
+ struct rb_node *node = iommu->dma_list.rb_node;
+
+ while (node) {
+ struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
+
+ if (start + size <= dma->iova)
+ node = node->rb_left;
+ else if (start >= dma->iova + dma->size)
+ node = node->rb_right;
+ else
+ return dma;
+ }
+
+ return NULL;
+}
+
+static void vfio_insert_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
+{
+ struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
+ struct vfio_dma *dma;
+
+ while (*link) {
+ parent = *link;
+ dma = rb_entry(parent, struct vfio_dma, node);
+
+ if (new->iova + new->size <= dma->iova)
+ link = &(*link)->rb_left;
+ else
+ link = &(*link)->rb_right;
+ }
+
+ rb_link_node(&new->node, parent, link);
+ rb_insert_color(&new->node, &iommu->dma_list);
+}
+
+static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
+{
+ rb_erase(&old->node, &iommu->dma_list);
+}
+
+static int iova_to_win(struct vfio_iommu *iommu, dma_addr_t iova)
+{
+ u64 offset = iova - iommu->aperture_start;
+ do_div(offset, iommu->page_size);
+ return (int) offset;
+}
+
+static int vfio_disable_iommu_domain(struct vfio_iommu *iommu)
+{
+ int enable = 0;
+ return iommu_domain_set_attr(iommu->domain,
+ DOMAIN_ATTR_FSL_PAMU_ENABLE, &enable);
+}
+
+static int vfio_enable_iommu_domain(struct vfio_iommu *iommu)
+{
+ int enable = 1;
+ return iommu_domain_set_attr(iommu->domain,
+ DOMAIN_ATTR_FSL_PAMU_ENABLE, &enable);
+}
+
+/* Unmap DMA region */
+static int vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
+ dma_addr_t iova, size_t *size)
+{
+ dma_addr_t start = iova;
+ int win, win_start, win_end;
+ long unlocked = 0;
+ unsigned int nr_pages;
+
+ nr_pages = iommu->page_size / PAGE_SIZE;
+ win_start = iova_to_win(iommu, iova);
+ win_end = iova_to_win(iommu, iova + *size - 1);
+
+ /* Release the pinned pages */
+ for (win = win_start; win <= win_end; iova += iommu->page_size, win++) {
+ unsigned long pfn;
+
+ pfn = iommu_iova_to_phys(iommu->domain, iova) >> PAGE_SHIFT;
+ if (!pfn)
+ continue;
+
+ iommu_domain_window_disable(iommu->domain, win);
+
+ unlocked += vfio_unpin_pages(pfn, nr_pages, dma->prot, 1);
+ }
+
+ vfio_lock_acct(-unlocked);
+ *size = iova - start;
+ return 0;
+}
+
+static int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
+ size_t *size, struct vfio_dma *dma)
+{
+ size_t offset, overlap, tmp;
+ struct vfio_dma *split;
+ int ret;
+
+ if (!*size)
+ return 0;
+
+ /*
+ * Existing dma region is completely covered, unmap all. This is
+ * the likely case since userspace tends to map and unmap buffers
+ * in one shot rather than multiple mappings within a buffer.
+ */
+ if (likely(start <= dma->iova &&
+ start + *size >= dma->iova + dma->size)) {
+ *size = dma->size;
+ ret = vfio_unmap_unpin(iommu, dma, dma->iova, size);
+ if (ret)
+ return ret;
+
+ /*
+ * Did we remove more than we have? Should never happen
+ * since a vfio_dma is contiguous in iova and vaddr.
+ */
+ WARN_ON(*size != dma->size);
+
+ vfio_remove_dma(iommu, dma);
+ kfree(dma);
+ return 0;
+ }
+
+ /* Overlap low address of existing range */
+ if (start <= dma->iova) {
+ overlap = start + *size - dma->iova;
+ ret = vfio_unmap_unpin(iommu, dma, dma->iova, &overlap);
+ if (ret)
+ return ret;
+
+ vfio_remove_dma(iommu, dma);
+
+ /*
+ * Check, we may have removed to whole vfio_dma. If not
+ * fixup and re-insert.
+ */
+ if (overlap < dma->size) {
+ dma->iova += overlap;
+ dma->vaddr += overlap;
+ dma->size -= overlap;
+ vfio_insert_dma(iommu, dma);
+ } else
+ kfree(dma);
+
+ *size = overlap;
+ return 0;
+ }
+
+ /* Overlap high address of existing range */
+ if (start + *size >= dma->iova + dma->size) {
+ offset = start - dma->iova;
+ overlap = dma->size - offset;
+
+ ret = vfio_unmap_unpin(iommu, dma, start, &overlap);
+ if (ret)
+ return ret;
+
+ dma->size -= overlap;
+ *size = overlap;
+ return 0;
+ }
+
+ /* Split existing */
+
+ /*
+ * Allocate our tracking structure early even though it may not
+ * be used. An Allocation failure later loses track of pages and
+ * is more difficult to unwind.
+ */
+ split = kzalloc(sizeof(*split), GFP_KERNEL);
+ if (!split)
+ return -ENOMEM;
+
+ offset = start - dma->iova;
+
+ ret = vfio_unmap_unpin(iommu, dma, start, size);
+ if (ret || !*size) {
+ kfree(split);
+ return ret;
+ }
+
+ tmp = dma->size;
+
+ /* Resize the lower vfio_dma in place, before the below insert */
+ dma->size = offset;
+
+ /* Insert new for remainder, assuming it didn't all get unmapped */
+ if (likely(offset + *size < tmp)) {
+ split->size = tmp - offset - *size;
+ split->iova = dma->iova + offset + *size;
+ split->vaddr = dma->vaddr + offset + *size;
+ split->prot = dma->prot;
+ vfio_insert_dma(iommu, split);
+ } else
+ kfree(split);
+
+ return 0;
+}
+
+/* Map DMA region */
+static int vfio_dma_map(struct vfio_iommu *iommu, dma_addr_t iova,
+ unsigned long vaddr, long npage, int prot)
+{
+ int ret = 0, i;
+ size_t size;
+ unsigned int win, nr_subwindows;
+ dma_addr_t iovamap;
+
+ /* total size to be mapped */
+ size = npage << PAGE_SHIFT;
+ do_div(size, iommu->page_size);
+ nr_subwindows = size;
+ size = npage << PAGE_SHIFT;
+ iovamap = iova;
+ for (i = 0; i < nr_subwindows; i++) {
+ unsigned long pfn;
+ unsigned long nr_pages;
+ dma_addr_t mapsize;
+ struct vfio_dma *dma = NULL;
+
+ win = iova_to_win(iommu, iovamap);
+ if (iovamap != iommu->aperture_start + iommu->page_size * win) {
+ pr_err("%s iova(%llx) unalligned to window size %llx\n",
+ __func__, iovamap, iommu->page_size);
+ ret = -EINVAL;
+ break;
+ }
+
+ mapsize = min(iova + size - iovamap, iommu->page_size);
+ /*
+ * FIXME: Currently we only support mapping page-size
+ * of subwindow-size.
+ */
+ if (mapsize < iommu->page_size) {
+ pr_err("%s iova (%llx) not alligned to window size %llx\n",
+ __func__, iovamap, iommu->page_size);
+ ret = -EINVAL;
+ break;
+ }
+
+ nr_pages = mapsize >> PAGE_SHIFT;
+
+ /* Pin a contiguous chunk of memory */
+ ret = vfio_pin_pages(vaddr, nr_pages, prot, &pfn);
+ if (ret != nr_pages) {
+ pr_err("%s unable to pin pages = %lx, pinned(%lx/%lx)\n",
+ __func__, vaddr, npage, nr_pages);
+ ret = -EINVAL;
+ break;
+ }
+
+ ret = iommu_domain_window_enable(iommu->domain, win,
+ (phys_addr_t)pfn << PAGE_SHIFT,
+ mapsize, prot);
+ if (ret) {
+ pr_err("%s unable to iommu_map()\n", __func__);
+ ret = -EINVAL;
+ break;
+ }
+
+ /*
+ * Check if we abut a region below - nothing below 0.
+ * This is the most likely case when mapping chunks of
+ * physically contiguous regions within a virtual address
+ * range. Update the abutting entry in place since iova
+ * doesn't change.
+ */
+ if (likely(iovamap)) {
+ struct vfio_dma *tmp;
+ tmp = vfio_find_dma(iommu, iovamap - 1, 1);
+ if (tmp && tmp->prot == prot &&
+ tmp->vaddr + tmp->size == vaddr) {
+ tmp->size += mapsize;
+ dma = tmp;
+ }
+ }
+
+ /*
+ * Check if we abut a region above - nothing above ~0 + 1.
+ * If we abut above and below, remove and free. If only
+ * abut above, remove, modify, reinsert.
+ */
+ if (likely(iovamap + mapsize)) {
+ struct vfio_dma *tmp;
+ tmp = vfio_find_dma(iommu, iovamap + mapsize, 1);
+ if (tmp && tmp->prot == prot &&
+ tmp->vaddr == vaddr + mapsize) {
+ vfio_remove_dma(iommu, tmp);
+ if (dma) {
+ dma->size += tmp->size;
+ kfree(tmp);
+ } else {
+ tmp->size += mapsize;
+ tmp->iova = iovamap;
+ tmp->vaddr = vaddr;
+ vfio_insert_dma(iommu, tmp);
+ dma = tmp;
+ }
+ }
+ }
+
+ if (!dma) {
+ dma = kzalloc(sizeof(*dma), GFP_KERNEL);
+ if (!dma) {
+ iommu_unmap(iommu->domain, iovamap, mapsize);
+ vfio_unpin_pages(pfn, npage, prot, true);
+ ret = -ENOMEM;
+ break;
+ }
+
+ dma->size = mapsize;
+ dma->iova = iovamap;
+ dma->vaddr = vaddr;
+ dma->prot = prot;
+ vfio_insert_dma(iommu, dma);
+ }
+
+ iovamap += mapsize;
+ vaddr += mapsize;
+ }
+
+ if (ret) {
+ struct vfio_dma *tmp;
+ while ((tmp = vfio_find_dma(iommu, iova, size))) {
+ int r = vfio_remove_dma_overlap(iommu, iova,
+ &size, tmp);
+ if (WARN_ON(r || !size))
+ break;
+ }
+ }
+
+ vfio_enable_iommu_domain(iommu);
+ return 0;
+}
+
+static int vfio_dma_do_map(struct vfio_iommu *iommu,
+ struct vfio_iommu_type1_dma_map *map)
+{
+ dma_addr_t iova = map->iova;
+ size_t size = map->size;
+ unsigned long vaddr = map->vaddr;
+ int ret = 0, prot = 0;
+ long npage;
+
+ /* READ/WRITE from device perspective */
+ if (map->flags & VFIO_DMA_MAP_FLAG_WRITE)
+ prot |= IOMMU_WRITE;
+ if (map->flags & VFIO_DMA_MAP_FLAG_READ)
+ prot |= IOMMU_READ;
+
+ if (!prot)
+ return -EINVAL; /* No READ/WRITE? */
+
+ /* Don't allow IOVA wrap */
+ if (iova + size && iova + size < iova)
+ return -EINVAL;
+
+ /* Don't allow virtual address wrap */
+ if (vaddr + size && vaddr + size < vaddr)
+ return -EINVAL;
+
+ /*
+ * FIXME: Currently we only support mapping page-size
+ * of subwindow-size.
+ */
+ if (size < iommu->page_size)
+ return -EINVAL;
+
+ npage = size >> PAGE_SHIFT;
+ if (!npage)
+ return -EINVAL;
+
+ mutex_lock(&iommu->lock);
+
+ if (vfio_find_dma(iommu, iova, size)) {
+ ret = -EEXIST;
+ goto out_lock;
+ }
+
+ vfio_dma_map(iommu, iova, vaddr, npage, prot);
+
+out_lock:
+ mutex_unlock(&iommu->lock);
+ return ret;
+}
+
+static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
+ struct vfio_iommu_type1_dma_unmap *unmap)
+{
+ struct vfio_dma *dma;
+ size_t unmapped = 0, size;
+ int ret = 0;
+
+ mutex_lock(&iommu->lock);
+
+ while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
+ size = unmap->size;
+ ret = vfio_remove_dma_overlap(iommu, unmap->iova, &size, dma);
+ if (ret || !size)
+ break;
+ unmapped += size;
+ }
+
+ mutex_unlock(&iommu->lock);
+
+ /*
+ * We may unmap more than requested, update the unmap struct so
+ * userspace can know.
+ */
+ unmap->size = unmapped;
+
+ return ret;
+}
+
+static int vfio_handle_get_attr(struct vfio_iommu *iommu,
+ struct vfio_pamu_attr *pamu_attr)
+{
+ switch (pamu_attr->attribute) {
+ case VFIO_ATTR_GEOMETRY: {
+ struct iommu_domain_geometry geom;
+ if (iommu_domain_get_attr(iommu->domain,
+ DOMAIN_ATTR_GEOMETRY, &geom)) {
+ pr_err("%s Error getting domain geometry\n",
+ __func__);
+ return -EFAULT;
+ }
+
+ pamu_attr->attr_info.attr.aperture_start = geom.aperture_start;
+ pamu_attr->attr_info.attr.aperture_end = geom.aperture_end;
+ break;
+ }
+ case VFIO_ATTR_WINDOWS: {
+ u32 count;
+ if (iommu_domain_get_attr(iommu->domain,
+ DOMAIN_ATTR_WINDOWS, &count)) {
+ pr_err("%s Error getting domain windows\n",
+ __func__);
+ return -EFAULT;
+ }
+
+ pamu_attr->attr_info.windows = count;
+ break;
+ }
+ case VFIO_ATTR_PAMU_STASH: {
+ struct pamu_stash_attribute stash;
+ if (iommu_domain_get_attr(iommu->domain,
+ DOMAIN_ATTR_FSL_PAMU_STASH, &stash)) {
+ pr_err("%s Error getting domain windows\n",
+ __func__);
+ return -EFAULT;
+ }
+
+ pamu_attr->attr_info.stash.cpu = stash.cpu;
+ pamu_attr->attr_info.stash.cache = stash.cache;
+ break;
+ }
+
+ default:
+ pr_err("%s Error: Invalid attribute (%d)\n",
+ __func__, pamu_attr->attribute);
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static int vfio_handle_set_attr(struct vfio_iommu *iommu,
+ struct vfio_pamu_attr *pamu_attr)
+{
+ switch (pamu_attr->attribute) {
+ case VFIO_ATTR_GEOMETRY: {
+ struct iommu_domain_geometry geom;
+
+ geom.aperture_start = pamu_attr->attr_info.attr.aperture_start;
+ geom.aperture_end = pamu_attr->attr_info.attr.aperture_end;
+ iommu->aperture_start = geom.aperture_start;
+ iommu->aperture_end = geom.aperture_end;
+ geom.force_aperture = 1;
+ if (iommu_domain_set_attr(iommu->domain,
+ DOMAIN_ATTR_GEOMETRY, &geom)) {
+ pr_err("%s Error setting domain geometry\n", __func__);
+ return -EFAULT;
+ }
+
+ break;
+ }
+ case VFIO_ATTR_WINDOWS: {
+ u32 count = pamu_attr->attr_info.windows;
+ u64 size;
+ if (count > 256) {
+ pr_err("Number of subwindows requested (%d) is 256\n",
+ count);
+ return -EINVAL;
+ }
+ iommu->nsubwindows = pamu_attr->attr_info.windows;
+ size = iommu->aperture_end - iommu->aperture_start + 1;
+ do_div(size, count);
+ iommu->page_size = size;
+ if (iommu_domain_set_attr(iommu->domain,
+ DOMAIN_ATTR_WINDOWS, &count)) {
+ pr_err("%s Error getting domain windows\n",
+ __func__);
+ return -EFAULT;
+ }
+
+ break;
+ }
+ case VFIO_ATTR_PAMU_STASH: {
+ struct pamu_stash_attribute stash;
+
+ stash.cpu = pamu_attr->attr_info.stash.cpu;
+ stash.cache = pamu_attr->attr_info.stash.cache;
+ if (iommu_domain_set_attr(iommu->domain,
+ DOMAIN_ATTR_FSL_PAMU_STASH, &stash)) {
+ pr_err("%s Error getting domain windows\n",
+ __func__);
+ return -EFAULT;
+ }
+ break;
+ }
+
+ default:
+ pr_err("%s Error: Invalid attribute (%d)\n",
+ __func__, pamu_attr->attribute);
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+static int vfio_msi_map(struct vfio_iommu *iommu,
+ struct vfio_pamu_msi_bank_map *msi_map, int prot)
+{
+ struct msi_region region;
+ int window;
+ int ret;
+
+ ret = msi_get_region(msi_map->msi_bank_index, &region);
+ if (ret) {
+ pr_err("%s MSI region (%d) not found\n", __func__,
+ msi_map->msi_bank_index);
+ return ret;
+ }
+
+ window = iova_to_win(iommu, msi_map->iova);
+ ret = iommu_domain_window_enable(iommu->domain, window, region.addr,
+ region.size, prot);
+ if (ret) {
+ pr_err("%s Error: unable to map msi region\n", __func__);
+ return ret;
+ }
+
+ return 0;
+}
+
+static int vfio_do_msi_map(struct vfio_iommu *iommu,
+ struct vfio_pamu_msi_bank_map *msi_map)
+{
+ struct vfio_msi_dma *msi_dma;
+ int ret, prot = 0;
+
+ /* READ/WRITE from device perspective */
+ if (msi_map->flags & VFIO_DMA_MAP_FLAG_WRITE)
+ prot |= IOMMU_WRITE;
+ if (msi_map->flags & VFIO_DMA_MAP_FLAG_READ)
+ prot |= IOMMU_READ;
+
+ if (!prot)
+ return -EINVAL; /* No READ/WRITE? */
+
+ ret = vfio_msi_map(iommu, msi_map, prot);
+ if (ret)
+ return ret;
+
+ msi_dma = kzalloc(sizeof(*msi_dma), GFP_KERNEL);
+ if (!msi_dma)
+ return -ENOMEM;
+
+ msi_dma->iova = msi_map->iova;
+ msi_dma->bank_id = msi_map->msi_bank_index;
+ list_add(&msi_dma->next, &iommu->msi_dma_list);
+ return 0;
+}
+
+static void vfio_msi_unmap(struct vfio_iommu *iommu, dma_addr_t iova)
+{
+ int window;
+ window = iova_to_win(iommu, iova);
+ iommu_domain_window_disable(iommu->domain, window);
+}
+
+static int vfio_do_msi_unmap(struct vfio_iommu *iommu,
+ struct vfio_pamu_msi_bank_unmap *msi_unmap)
+{
+ struct vfio_msi_dma *mdma, *mdma_tmp;
+
+ list_for_each_entry_safe(mdma, mdma_tmp, &iommu->msi_dma_list, next) {
+ if (mdma->iova == msi_unmap->iova) {
+ vfio_msi_unmap(iommu, mdma->iova);
+ list_del(&mdma->next);
+ kfree(mdma);
+ return 0;
+ }
+ }
+
+ return -EINVAL;
+}
+static void *vfio_iommu_fsl_pamu_open(unsigned long arg)
+{
+ struct vfio_iommu *iommu;
+
+ if (arg != VFIO_FSL_PAMU_IOMMU)
+ return ERR_PTR(-EINVAL);
+
+ iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
+ if (!iommu)
+ return ERR_PTR(-ENOMEM);
+
+ INIT_LIST_HEAD(&iommu->group_list);
+ iommu->dma_list = RB_ROOT;
+ INIT_LIST_HEAD(&iommu->msi_dma_list);
+ mutex_init(&iommu->lock);
+
+ /*
+ * Wish we didn't have to know about bus_type here.
+ */
+ iommu->domain = iommu_domain_alloc(&pci_bus_type);
+ if (!iommu->domain) {
+ kfree(iommu);
+ return ERR_PTR(-EIO);
+ }
+
+ return iommu;
+}
+
+static void vfio_iommu_fsl_pamu_release(void *iommu_data)
+{
+ struct vfio_iommu *iommu = iommu_data;
+ struct vfio_group *group, *group_tmp;
+ struct vfio_msi_dma *mdma, *mdma_tmp;
+ struct rb_node *node;
+
+ list_for_each_entry_safe(group, group_tmp, &iommu->group_list, next) {
+ iommu_detach_group(iommu->domain, group->iommu_group);
+ list_del(&group->next);
+ kfree(group);
+ }
+
+ while ((node = rb_first(&iommu->dma_list))) {
+ struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
+ size_t size = dma->size;
+ vfio_remove_dma_overlap(iommu, dma->iova, &size, dma);
+ if (WARN_ON(!size))
+ break;
+ }
+
+ list_for_each_entry_safe(mdma, mdma_tmp, &iommu->msi_dma_list, next) {
+ vfio_msi_unmap(iommu, mdma->iova);
+ list_del(&mdma->next);
+ kfree(mdma);
+ }
+
+ iommu_domain_free(iommu->domain);
+ iommu->domain = NULL;
+ kfree(iommu);
+}
+
+static long vfio_iommu_fsl_pamu_ioctl(void *iommu_data,
+ unsigned int cmd, unsigned long arg)
+{
+ struct vfio_iommu *iommu = iommu_data;
+ unsigned long minsz;
+
+ if (cmd == VFIO_CHECK_EXTENSION) {
+ switch (arg) {
+ case VFIO_FSL_PAMU_IOMMU:
+ return 1;
+ default:
+ return 0;
+ }
+ } else if (cmd == VFIO_IOMMU_MAP_DMA) {
+ struct vfio_iommu_type1_dma_map map;
+ uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
+ VFIO_DMA_MAP_FLAG_WRITE;
+
+ minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
+
+ if (copy_from_user(&map, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (map.argsz < minsz || map.flags & ~mask)
+ return -EINVAL;
+
+ return vfio_dma_do_map(iommu, &map);
+
+ } else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
+ struct vfio_iommu_type1_dma_unmap unmap;
+ long ret;
+
+ minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
+
+ if (copy_from_user(&unmap, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (unmap.argsz < minsz || unmap.flags)
+ return -EINVAL;
+
+ ret = vfio_dma_do_unmap(iommu, &unmap);
+ if (ret)
+ return ret;
+
+ return copy_to_user((void __user *)arg, &unmap, minsz);
+ } else if (cmd == VFIO_IOMMU_PAMU_GET_ATTR) {
+ struct vfio_pamu_attr pamu_attr;
+
+ minsz = offsetofend(struct vfio_pamu_attr, attr_info);
+ if (copy_from_user(&pamu_attr, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (pamu_attr.argsz < minsz)
+ return -EINVAL;
+
+ vfio_handle_get_attr(iommu, &pamu_attr);
+
+ copy_to_user((void __user *)arg, &pamu_attr, minsz);
+ return 0;
+ } else if (cmd == VFIO_IOMMU_PAMU_SET_ATTR) {
+ struct vfio_pamu_attr pamu_attr;
+
+ minsz = offsetofend(struct vfio_pamu_attr, attr_info);
+ if (copy_from_user(&pamu_attr, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (pamu_attr.argsz < minsz)
+ return -EINVAL;
+
+ vfio_handle_set_attr(iommu, &pamu_attr);
+ return 0;
+ } else if (cmd == VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT) {
+ return msi_get_region_count();
+ } else if (cmd == VFIO_IOMMU_PAMU_MAP_MSI_BANK) {
+ struct vfio_pamu_msi_bank_map msi_map;
+
+ minsz = offsetofend(struct vfio_pamu_msi_bank_map, iova);
+ if (copy_from_user(&msi_map, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (msi_map.argsz < minsz)
+ return -EINVAL;
+
+ vfio_do_msi_map(iommu, &msi_map);
+ return 0;
+ } else if (cmd == VFIO_IOMMU_PAMU_UNMAP_MSI_BANK) {
+ struct vfio_pamu_msi_bank_unmap msi_unmap;
+
+ minsz = offsetofend(struct vfio_pamu_msi_bank_unmap, iova);
+ if (copy_from_user(&msi_unmap, (void __user *)arg, minsz))
+ return -EFAULT;
+
+ if (msi_unmap.argsz < minsz)
+ return -EINVAL;
+
+ vfio_do_msi_unmap(iommu, &msi_unmap);
+ return 0;
+
+ }
+
+ return -ENOTTY;
+}
+
+static int vfio_iommu_fsl_pamu_attach_group(void *iommu_data,
+ struct iommu_group *iommu_group)
+{
+ struct vfio_iommu *iommu = iommu_data;
+ struct vfio_group *group, *tmp;
+ int ret;
+
+ group = kzalloc(sizeof(*group), GFP_KERNEL);
+ if (!group)
+ return -ENOMEM;
+
+ mutex_lock(&iommu->lock);
+
+ list_for_each_entry(tmp, &iommu->group_list, next) {
+ if (tmp->iommu_group == iommu_group) {
+ mutex_unlock(&iommu->lock);
+ kfree(group);
+ return -EINVAL;
+ }
+ }
+
+ ret = iommu_attach_group(iommu->domain, iommu_group);
+ if (ret) {
+ mutex_unlock(&iommu->lock);
+ kfree(group);
+ return ret;
+ }
+
+ group->iommu_group = iommu_group;
+ list_add(&group->next, &iommu->group_list);
+
+ mutex_unlock(&iommu->lock);
+
+ return 0;
+}
+
+static void vfio_iommu_fsl_pamu_detach_group(void *iommu_data,
+ struct iommu_group *iommu_group)
+{
+ struct vfio_iommu *iommu = iommu_data;
+ struct vfio_group *group;
+
+ mutex_lock(&iommu->lock);
+
+ list_for_each_entry(group, &iommu->group_list, next) {
+ if (group->iommu_group == iommu_group) {
+ iommu_detach_group(iommu->domain, iommu_group);
+ list_del(&group->next);
+ kfree(group);
+ break;
+ }
+ }
+
+ mutex_unlock(&iommu->lock);
+}
+
+static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_fsl_pamu = {
+ .name = "vfio-iommu-fsl_pamu",
+ .owner = THIS_MODULE,
+ .open = vfio_iommu_fsl_pamu_open,
+ .release = vfio_iommu_fsl_pamu_release,
+ .ioctl = vfio_iommu_fsl_pamu_ioctl,
+ .attach_group = vfio_iommu_fsl_pamu_attach_group,
+ .detach_group = vfio_iommu_fsl_pamu_detach_group,
+};
+
+static int __init vfio_iommu_fsl_pamu_init(void)
+{
+ if (!iommu_present(&pci_bus_type))
+ return -ENODEV;
+
+ return vfio_register_iommu_driver(&vfio_iommu_driver_ops_fsl_pamu);
+}
+
+static void __exit vfio_iommu_fsl_pamu_cleanup(void)
+{
+ vfio_unregister_iommu_driver(&vfio_iommu_driver_ops_fsl_pamu);
+}
+
+module_init(vfio_iommu_fsl_pamu_init);
+module_exit(vfio_iommu_fsl_pamu_cleanup);
+
+MODULE_VERSION(DRIVER_VERSION);
+MODULE_LICENSE("GPL v2");
+MODULE_AUTHOR(DRIVER_AUTHOR);
+MODULE_DESCRIPTION(DRIVER_DESC);
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 0fd47f5..d359055 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -23,6 +23,7 @@

#define VFIO_TYPE1_IOMMU 1
#define VFIO_SPAPR_TCE_IOMMU 2
+#define VFIO_FSL_PAMU_IOMMU 3

/*
* The IOCTL interface is designed for extensibility by embedding the
@@ -451,4 +452,103 @@ struct vfio_iommu_spapr_tce_info {

/* ***************************************************************** */

+/*********** APIs for VFIO_PAMU type only ****************/
+/*
+ * VFIO_IOMMU_PAMU_GET_ATTR - _IO(VFIO_TYPE, VFIO_BASE + 17,
+ * struct vfio_pamu_attr)
+ *
+ * Gets the iommu attributes for the current vfio container.
+ * Caller sets argsz and attribute. The ioctl fills in
+ * the provided struct vfio_pamu_attr based on the attribute
+ * value that was set.
+ * Return: 0 on success, -errno on failure
+ */
+struct vfio_pamu_attr {
+ __u32 argsz;
+ __u32 flags; /* no flags currently */
+#define VFIO_ATTR_GEOMETRY 0
+#define VFIO_ATTR_WINDOWS 1
+#define VFIO_ATTR_PAMU_STASH 2
+ __u32 attribute;
+
+ union {
+ /* VFIO_ATTR_GEOMETRY */
+ struct {
+ /* first addr that can be mapped */
+ __u64 aperture_start;
+ /* last addr that can be mapped */
+ __u64 aperture_end;
+ } attr;
+
+ /* VFIO_ATTR_WINDOWS */
+ __u32 windows; /* number of windows in the aperture
+ * initially this will be the max number
+ * of windows that can be set
+ */
+ /* VFIO_ATTR_PAMU_STASH */
+ struct {
+ __u32 cpu; /* CPU number for stashing */
+ __u32 cache; /* cache ID for stashing */
+ } stash;
+ } attr_info;
+};
+#define VFIO_IOMMU_PAMU_GET_ATTR _IO(VFIO_TYPE, VFIO_BASE + 17)
+
+/*
+ * VFIO_IOMMU_PAMU_SET_ATTR - _IO(VFIO_TYPE, VFIO_BASE + 18,
+ * struct vfio_pamu_attr)
+ *
+ * Sets the iommu attributes for the current vfio container.
+ * Caller sets struct vfio_pamu attr, including argsz and attribute and
+ * setting any fields that are valid for the attribute.
+ * Return: 0 on success, -errno on failure
+ */
+#define VFIO_IOMMU_PAMU_SET_ATTR _IO(VFIO_TYPE, VFIO_BASE + 18)
+
+/*
+ * VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT - _IO(VFIO_TYPE, VFIO_BASE + 19, __u32)
+ *
+ * Returns the number of MSI banks for this platform. This tells user space
+ * how many aperture windows should be reserved for MSI banks when setting
+ * the PAMU geometry and window count.
+ * Return: __u32 bank count on success, -errno on failure
+ */
+#define VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT _IO(VFIO_TYPE, VFIO_BASE + 19)
+
+/*
+ * VFIO_IOMMU_PAMU_MAP_MSI_BANK - _IO(VFIO_TYPE, VFIO_BASE + 20,
+ * struct vfio_pamu_msi_bank_map)
+ *
+ * Maps the MSI bank at the specified index and iova. User space must
+ * call this ioctl once for each MSI bank (count of banks is returned by
+ * VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT).
+ * Caller provides struct vfio_pamu_msi_bank_map with all fields set.
+ * Return: 0 on success, -errno on failure
+ */
+
+struct vfio_pamu_msi_bank_map {
+ __u32 argsz;
+ __u32 flags; /* no flags currently */
+ __u32 msi_bank_index; /* the index of the MSI bank */
+ __u64 iova; /* the iova the bank is to be mapped to */
+};
+#define VFIO_IOMMU_PAMU_MAP_MSI_BANK _IO(VFIO_TYPE, VFIO_BASE + 20)
+
+/*
+ * VFIO_IOMMU_PAMU_UNMAP_MSI_BANK - _IO(VFIO_TYPE, VFIO_BASE + 21,
+ * struct vfio_pamu_msi_bank_unmap)
+ *
+ * Unmaps the MSI bank at the specified iova.
+ * Caller provides struct vfio_pamu_msi_bank_unmap with all fields set.
+ * Operates on VFIO file descriptor (/dev/vfio/vfio).
+ * Return: 0 on success, -errno on failure
+ */
+
+struct vfio_pamu_msi_bank_unmap {
+ __u32 argsz;
+ __u32 flags; /* no flags currently */
+ __u64 iova; /* the iova to be unmapped to */
+};
+#define VFIO_IOMMU_PAMU_UNMAP_MSI_BANK _IO(VFIO_TYPE, VFIO_BASE + 21)
+
#endif /* _UAPIVFIO_H */
--
1.7.0.4

2013-09-19 07:38:59

by Bharat Bhushan

[permalink] [raw]
Subject: [PATCH 3/7] fsl iommu: add get_dev_iommu_domain

From: Bharat Bhushan <[email protected]>

returns the iommu_domain of the requested device for fsl pamu.

Use PCI controller dev struct for pci devices as current LIODN schema
assign LIODN to PCI controller not PCI device. This will be corrected
with proper LIODN schema.

Signed-off-by: Bharat Bhushan <[email protected]>
---
drivers/iommu/fsl_pamu_domain.c | 30 ++++++++++++++++++++++++++++++
1 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/drivers/iommu/fsl_pamu_domain.c b/drivers/iommu/fsl_pamu_domain.c
index 14d803a..1d0dfe3 100644
--- a/drivers/iommu/fsl_pamu_domain.c
+++ b/drivers/iommu/fsl_pamu_domain.c
@@ -1140,6 +1140,35 @@ static u32 fsl_pamu_get_windows(struct iommu_domain *domain)
return dma_domain->win_cnt;
}

+static struct iommu_domain *fsl_get_dev_domain(struct device *dev)
+{
+ struct pci_controller *pci_ctl;
+ struct device_domain_info *info;
+ struct pci_dev *pdev;
+
+ /*
+ * Use PCI controller dev struct for pci devices as current
+ * LIODN schema assign LIODN to PCI controller not PCI device
+ * This should get corrected with proper LIODN schema.
+ */
+ if (dev->bus == &pci_bus_type) {
+ pdev = to_pci_dev(dev);
+ pci_ctl = pci_bus_to_host(pdev->bus);
+ /*
+ * make dev point to pci controller device
+ * so we can get the LIODN programmed by
+ * u-boot.
+ */
+ dev = pci_ctl->parent;
+ }
+
+ info = dev->archdata.iommu_domain;
+ if (info && info->domain)
+ return info->domain->iommu_domain;
+
+ return NULL;
+}
+
static struct iommu_ops fsl_pamu_ops = {
.domain_init = fsl_pamu_domain_init,
.domain_destroy = fsl_pamu_domain_destroy,
@@ -1155,6 +1184,7 @@ static struct iommu_ops fsl_pamu_ops = {
.domain_get_attr = fsl_pamu_get_domain_attr,
.add_device = fsl_pamu_add_device,
.remove_device = fsl_pamu_remove_device,
+ .get_dev_iommu_domain = fsl_get_dev_domain,
};

int pamu_domain_init()
--
1.7.0.4

2013-09-24 23:58:19

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 1/7] powerpc: Add interface to get msi region information

On Thu, Sep 19, 2013 at 12:59:17PM +0530, Bharat Bhushan wrote:
> This patch adds interface to get following information
> - Number of MSI regions (which is number of MSI banks for powerpc).
> - Get the region address range: Physical page which have the
> address/addresses used for generating MSI interrupt
> and size of the page.
>
> These are required to create IOMMU (Freescale PAMU) mapping for
> devices which are directly assigned using VFIO.
>
> Signed-off-by: Bharat Bhushan <[email protected]>
> ---
> arch/powerpc/include/asm/machdep.h | 8 +++++++
> arch/powerpc/include/asm/pci.h | 2 +
> arch/powerpc/kernel/msi.c | 18 ++++++++++++++++
> arch/powerpc/sysdev/fsl_msi.c | 39 +++++++++++++++++++++++++++++++++--
> arch/powerpc/sysdev/fsl_msi.h | 11 ++++++++-
> drivers/pci/msi.c | 26 ++++++++++++++++++++++++
> include/linux/msi.h | 8 +++++++
> include/linux/pci.h | 13 ++++++++++++
> 8 files changed, 120 insertions(+), 5 deletions(-)
>
> ...

> diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
> index aca7578..6d85c15 100644
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -30,6 +30,20 @@ static int pci_msi_enable = 1;
>
> /* Arch hooks */
>
> +#ifndef arch_msi_get_region_count
> +int arch_msi_get_region_count(void)
> +{
> + return 0;
> +}
> +#endif
> +
> +#ifndef arch_msi_get_region
> +int arch_msi_get_region(int region_num, struct msi_region *region)
> +{
> + return 0;
> +}
> +#endif

This #define strategy is gone; see 4287d824 ("PCI: use weak functions for
MSI arch-specific functions"). Please use the weak function strategy
for your new MSI region functions.

> +
> #ifndef arch_msi_check_device
> int arch_msi_check_device(struct pci_dev *dev, int nvec, int type)
> {
> @@ -903,6 +917,18 @@ void pci_disable_msi(struct pci_dev *dev)
> }
> EXPORT_SYMBOL(pci_disable_msi);
>
> +int msi_get_region_count(void)
> +{
> + return arch_msi_get_region_count();
> +}
> +EXPORT_SYMBOL(msi_get_region_count);
> +
> +int msi_get_region(int region_num, struct msi_region *region)
> +{
> + return arch_msi_get_region(region_num, region);
> +}
> +EXPORT_SYMBOL(msi_get_region);

Please split these interface additions, i.e., the drivers/pci/msi.c,
include/linux/msi.h, and include/linux/pci.h changes, into a separate
patch.

I don't know enough about VFIO to understand why these new interfaces
are needed. Is this the first VFIO IOMMU driver? I see
vfio_iommu_spapr_tce.c and vfio_iommu_type1.c but I don't know if
they're comparable to the Freescale PAMU. Do other VFIO IOMMU
implementations support MSI? If so, do they handle the problem of
mapping the MSI regions in a different way?

> /**
> * pci_msix_table_size - return the number of device's MSI-X table entries
> * @dev: pointer to the pci_dev data structure of MSI-X device function
> diff --git a/include/linux/msi.h b/include/linux/msi.h
> index ee66f3a..ae32601 100644
> --- a/include/linux/msi.h
> +++ b/include/linux/msi.h
> @@ -50,6 +50,12 @@ struct msi_desc {
> struct kobject kobj;
> };
>
> +struct msi_region {
> + int region_num;
> + dma_addr_t addr;
> + size_t size;
> +};

This needs some sort of explanatory comment.

> /*
> * The arch hook for setup up msi irqs
> */
> @@ -58,5 +64,7 @@ void arch_teardown_msi_irq(unsigned int irq);
> int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type);
> void arch_teardown_msi_irqs(struct pci_dev *dev);
> int arch_msi_check_device(struct pci_dev* dev, int nvec, int type);
> +int arch_msi_get_region_count(void);
> +int arch_msi_get_region(int region_num, struct msi_region *region);
>
> #endif /* LINUX_MSI_H */
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 186540d..2b26a59 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1126,6 +1126,7 @@ struct msix_entry {
> u16 entry; /* driver uses to specify entry, OS writes */
> };
>
> +struct msi_region;
>
> #ifndef CONFIG_PCI_MSI
> static inline int pci_enable_msi_block(struct pci_dev *dev, unsigned int nvec)
> @@ -1168,6 +1169,16 @@ static inline int pci_msi_enabled(void)
> {
> return 0;
> }
> +
> +static inline int msi_get_region_count(void)
> +{
> + return 0;
> +}
> +
> +static inline int msi_get_region(int region_num, struct msi_region *region)
> +{
> + return 0;
> +}
> #else
> int pci_enable_msi_block(struct pci_dev *dev, unsigned int nvec);
> int pci_enable_msi_block_auto(struct pci_dev *dev, unsigned int *maxvec);
> @@ -1180,6 +1191,8 @@ void pci_disable_msix(struct pci_dev *dev);
> void msi_remove_pci_irq_vectors(struct pci_dev *dev);
> void pci_restore_msi_state(struct pci_dev *dev);
> int pci_msi_enabled(void);
> +int msi_get_region_count(void);
> +int msi_get_region(int region_num, struct msi_region *region);
> #endif
>
> #ifdef CONFIG_PCIEPORTBUS
> --
> 1.7.0.4
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2013-09-25 16:40:42

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH 5/7] iommu: supress loff_t compilation error on powerpc

On Thu, 2013-09-19 at 12:59 +0530, Bharat Bhushan wrote:
> Signed-off-by: Bharat Bhushan <[email protected]>
> ---
> drivers/vfio/pci/vfio_pci_rdwr.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
> index 210db24..8a8156a 100644
> --- a/drivers/vfio/pci/vfio_pci_rdwr.c
> +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
> @@ -181,7 +181,8 @@ ssize_t vfio_pci_vga_rw(struct vfio_pci_device *vdev, char __user *buf,
> size_t count, loff_t *ppos, bool iswrite)
> {
> int ret;
> - loff_t off, pos = *ppos & VFIO_PCI_OFFSET_MASK;
> + loff_t off;
> + u64 pos = (u64 )(*ppos & VFIO_PCI_OFFSET_MASK);
> void __iomem *iomem = NULL;
> unsigned int rsrc;
> bool is_ioport;

What's the compile error that this fixes?

2013-09-25 16:45:49

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a device

On Thu, 2013-09-19 at 12:59 +0530, Bharat Bhushan wrote:
> This api return the iommu domain to which the device is attached.
> The iommu_domain is required for making API calls related to iommu.
> Follow up patches which use this API to know iommu maping.
>
> Signed-off-by: Bharat Bhushan <[email protected]>
> ---
> drivers/iommu/iommu.c | 10 ++++++++++
> include/linux/iommu.h | 7 +++++++
> 2 files changed, 17 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index fbe9ca7..6ac5f50 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -696,6 +696,16 @@ void iommu_detach_device(struct iommu_domain *domain, struct device *dev)
> }
> EXPORT_SYMBOL_GPL(iommu_detach_device);
>
> +struct iommu_domain *iommu_get_dev_domain(struct device *dev)
> +{
> + struct iommu_ops *ops = dev->bus->iommu_ops;
> +
> + if (unlikely(ops == NULL || ops->get_dev_iommu_domain == NULL))
> + return NULL;
> +
> + return ops->get_dev_iommu_domain(dev);
> +}
> +EXPORT_SYMBOL_GPL(iommu_get_dev_domain);

What prevents this from racing iommu_domain_free()? There's no
references acquired, so there's no reason for the caller to assume the
pointer is valid.

> /*
> * IOMMU groups are really the natrual working unit of the IOMMU, but
> * the IOMMU API works on domains and devices. Bridge that gap by
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 7ea319e..fa046bd 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -127,6 +127,7 @@ struct iommu_ops {
> int (*domain_set_windows)(struct iommu_domain *domain, u32 w_count);
> /* Get the numer of window per domain */
> u32 (*domain_get_windows)(struct iommu_domain *domain);
> + struct iommu_domain *(*get_dev_iommu_domain)(struct device *dev);
>
> unsigned long pgsize_bitmap;
> };
> @@ -190,6 +191,7 @@ extern int iommu_domain_window_enable(struct iommu_domain *domain, u32 wnd_nr,
> phys_addr_t offset, u64 size,
> int prot);
> extern void iommu_domain_window_disable(struct iommu_domain *domain, u32 wnd_nr);
> +extern struct iommu_domain *iommu_get_dev_domain(struct device *dev);
> /**
> * report_iommu_fault() - report about an IOMMU fault to the IOMMU framework
> * @domain: the iommu domain where the fault has happened
> @@ -284,6 +286,11 @@ static inline void iommu_domain_window_disable(struct iommu_domain *domain,
> {
> }
>
> +static inline struct iommu_domain *iommu_get_dev_domain(struct device *dev)
> +{
> + return NULL;
> +}
> +
> static inline phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain, dma_addr_t iova)
> {
> return 0;


2013-09-25 17:03:07

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH 6/7] vfio: moving some functions in common file

On Thu, 2013-09-19 at 12:59 +0530, Bharat Bhushan wrote:
> Some function defined in vfio_iommu_type1.c were common and
> we want to use these for FSL IOMMU (PAMU) and iommu-none driver.
> So some of them are moved to vfio_iommu_common.c
>
> I think we can do more of that but we will take this step by step.
>
> Signed-off-by: Bharat Bhushan <[email protected]>
> ---
> drivers/vfio/Makefile | 4 +-
> drivers/vfio/vfio_iommu_common.c | 235 ++++++++++++++++++++++++++++++++++++++
> drivers/vfio/vfio_iommu_common.h | 30 +++++
> drivers/vfio/vfio_iommu_type1.c | 206 +---------------------------------
> 4 files changed, 268 insertions(+), 207 deletions(-)
> create mode 100644 drivers/vfio/vfio_iommu_common.c
> create mode 100644 drivers/vfio/vfio_iommu_common.h
>
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index 72bfabc..c5792ec 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -1,4 +1,4 @@
> obj-$(CONFIG_VFIO) += vfio.o
> -obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
> -obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
> +obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_common.o vfio_iommu_type1.o
> +obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_common.o vfio_iommu_spapr_tce.o
> obj-$(CONFIG_VFIO_PCI) += pci/
> diff --git a/drivers/vfio/vfio_iommu_common.c b/drivers/vfio/vfio_iommu_common.c
> new file mode 100644
> index 0000000..8bdc0ea
> --- /dev/null
> +++ b/drivers/vfio/vfio_iommu_common.c
> @@ -0,0 +1,235 @@
> +/*
> + * VFIO: Common code for vfio IOMMU support
> + *
> + * Copyright (C) 2012 Red Hat, Inc. All rights reserved.
> + * Author: Alex Williamson <[email protected]>
> + * Author: Bharat Bhushan <[email protected]>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Derived from original vfio:
> + * Copyright 2010 Cisco Systems, Inc. All rights reserved.
> + * Author: Tom Lyon, [email protected]
> + */
> +
> +#include <linux/compat.h>
> +#include <linux/device.h>
> +#include <linux/fs.h>
> +#include <linux/iommu.h>
> +#include <linux/module.h>
> +#include <linux/mm.h>
> +#include <linux/pci.h> /* pci_bus_type */
> +#include <linux/rbtree.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/workqueue.h>

Please cleanup includes on both the source and target files. You
obviously don't need linux/pci.h here for one.

> +
> +static bool disable_hugepages;
> +module_param_named(disable_hugepages,
> + disable_hugepages, bool, S_IRUGO | S_IWUSR);
> +MODULE_PARM_DESC(disable_hugepages,
> + "Disable VFIO IOMMU support for IOMMU hugepages.");
> +
> +struct vwork {
> + struct mm_struct *mm;
> + long npage;
> + struct work_struct work;
> +};
> +
> +/* delayed decrement/increment for locked_vm */
> +void vfio_lock_acct_bg(struct work_struct *work)
> +{
> + struct vwork *vwork = container_of(work, struct vwork, work);
> + struct mm_struct *mm;
> +
> + mm = vwork->mm;
> + down_write(&mm->mmap_sem);
> + mm->locked_vm += vwork->npage;
> + up_write(&mm->mmap_sem);
> + mmput(mm);
> + kfree(vwork);
> +}
> +
> +void vfio_lock_acct(long npage)
> +{
> + struct vwork *vwork;
> + struct mm_struct *mm;
> +
> + if (!current->mm || !npage)
> + return; /* process exited or nothing to do */
> +
> + if (down_write_trylock(&current->mm->mmap_sem)) {
> + current->mm->locked_vm += npage;
> + up_write(&current->mm->mmap_sem);
> + return;
> + }
> +
> + /*
> + * Couldn't get mmap_sem lock, so must setup to update
> + * mm->locked_vm later. If locked_vm were atomic, we
> + * wouldn't need this silliness
> + */
> + vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
> + if (!vwork)
> + return;
> + mm = get_task_mm(current);
> + if (!mm) {
> + kfree(vwork);
> + return;
> + }
> + INIT_WORK(&vwork->work, vfio_lock_acct_bg);
> + vwork->mm = mm;
> + vwork->npage = npage;
> + schedule_work(&vwork->work);
> +}
> +
> +/*
> + * Some mappings aren't backed by a struct page, for example an mmap'd
> + * MMIO range for our own or another device. These use a different
> + * pfn conversion and shouldn't be tracked as locked pages.
> + */
> +bool is_invalid_reserved_pfn(unsigned long pfn)
> +{
> + if (pfn_valid(pfn)) {
> + bool reserved;
> + struct page *tail = pfn_to_page(pfn);
> + struct page *head = compound_trans_head(tail);
> + reserved = !!(PageReserved(head));
> + if (head != tail) {
> + /*
> + * "head" is not a dangling pointer
> + * (compound_trans_head takes care of that)
> + * but the hugepage may have been split
> + * from under us (and we may not hold a
> + * reference count on the head page so it can
> + * be reused before we run PageReferenced), so
> + * we've to check PageTail before returning
> + * what we just read.
> + */
> + smp_rmb();
> + if (PageTail(tail))
> + return reserved;
> + }
> + return PageReserved(tail);
> + }
> +
> + return true;
> +}
> +
> +int put_pfn(unsigned long pfn, int prot)
> +{
> + if (!is_invalid_reserved_pfn(pfn)) {
> + struct page *page = pfn_to_page(pfn);
> + if (prot & IOMMU_WRITE)
> + SetPageDirty(page);
> + put_page(page);
> + return 1;
> + }
> + return 0;
> +}
> +
> +static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> +{
> + struct page *page[1];
> + struct vm_area_struct *vma;
> + int ret = -EFAULT;
> +
> + if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
> + *pfn = page_to_pfn(page[0]);
> + return 0;
> + }
> +
> + printk("via vma\n");
> + down_read(&current->mm->mmap_sem);
> +
> + vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> +
> + if (vma && vma->vm_flags & VM_PFNMAP) {
> + *pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> + if (is_invalid_reserved_pfn(*pfn))
> + ret = 0;
> + }
> +
> + up_read(&current->mm->mmap_sem);
> +
> + return ret;
> +}
> +
> +/*
> + * Attempt to pin pages. We really don't want to track all the pfns and
> + * the iommu can only map chunks of consecutive pfns anyway, so get the
> + * first page and all consecutive pages with the same locking.
> + */
> +long vfio_pin_pages(unsigned long vaddr, long npage,
> + int prot, unsigned long *pfn_base)
> +{
> + unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> + bool lock_cap = capable(CAP_IPC_LOCK);
> + long ret, i;
> +
> + if (!current->mm)
> + return -ENODEV;
> +
> + ret = vaddr_get_pfn(vaddr, prot, pfn_base);
> + if (ret)
> + return ret;
> +
> + if (is_invalid_reserved_pfn(*pfn_base))
> + return 1;
> +
> + if (!lock_cap && current->mm->locked_vm + 1 > limit) {
> + put_pfn(*pfn_base, prot);
> + pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
> + limit << PAGE_SHIFT);
> + return -ENOMEM;
> + }
> +
> + if (unlikely(disable_hugepages)) {
> + vfio_lock_acct(1);
> + return 1;
> + }
> +
> + /* Lock all the consecutive pages from pfn_base */
> + for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
> + unsigned long pfn = 0;
> +
> + ret = vaddr_get_pfn(vaddr, prot, &pfn);
> + if (ret)
> + break;
> +
> + if (pfn != *pfn_base + i || is_invalid_reserved_pfn(pfn)) {
> + put_pfn(pfn, prot);
> + break;
> + }
> +
> + if (!lock_cap && current->mm->locked_vm + i + 1 > limit) {
> + put_pfn(pfn, prot);
> + pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> + __func__, limit << PAGE_SHIFT);
> + break;
> + }
> + }
> +
> + vfio_lock_acct(i);
> +
> + return i;
> +}
> +
> +long vfio_unpin_pages(unsigned long pfn, long npage,
> + int prot, bool do_accounting)
> +{
> + unsigned long unlocked = 0;
> + long i;
> +
> + for (i = 0; i < npage; i++)
> + unlocked += put_pfn(pfn++, prot);
> +
> + if (do_accounting)
> + vfio_lock_acct(-unlocked);
> +
> + return unlocked;
> +}
> diff --git a/drivers/vfio/vfio_iommu_common.h b/drivers/vfio/vfio_iommu_common.h
> new file mode 100644
> index 0000000..4738391
> --- /dev/null
> +++ b/drivers/vfio/vfio_iommu_common.h
> @@ -0,0 +1,30 @@
> +/*
> + * Copyright (C) 2012 Red Hat, Inc. All rights reserved.
> + * Copyright (C) 2013 Freescale Semiconductor, Inc.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published
> + * by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
> + */
> +
> +#ifndef _VFIO_IOMMU_COMMON_H
> +#define _VFIO_IOMMU_COMMON_H
> +
> +void vfio_lock_acct_bg(struct work_struct *work);

Does this need to be exposed?

> +void vfio_lock_acct(long npage);
> +bool is_invalid_reserved_pfn(unsigned long pfn);
> +int put_pfn(unsigned long pfn, int prot);

Why are these exposed, they only seem to be used by functions in the new
common file.

> +long vfio_pin_pages(unsigned long vaddr, long npage, int prot,
> + unsigned long *pfn_base);
> +long vfio_unpin_pages(unsigned long pfn, long npage,
> + int prot, bool do_accounting);

Can we get by with just these two and vfio_lock_acct()?

> +#endif
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index a9807de..e9a58fa 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -37,6 +37,7 @@
> #include <linux/uaccess.h>
> #include <linux/vfio.h>
> #include <linux/workqueue.h>
> +#include "vfio_iommu_common.h"
>
> #define DRIVER_VERSION "0.2"
> #define DRIVER_AUTHOR "Alex Williamson <[email protected]>"
> @@ -48,12 +49,6 @@ module_param_named(allow_unsafe_interrupts,
> MODULE_PARM_DESC(allow_unsafe_interrupts,
> "Enable VFIO IOMMU support for on platforms without interrupt remapping support.");
>
> -static bool disable_hugepages;
> -module_param_named(disable_hugepages,
> - disable_hugepages, bool, S_IRUGO | S_IWUSR);
> -MODULE_PARM_DESC(disable_hugepages,
> - "Disable VFIO IOMMU support for IOMMU hugepages.");
> -
> struct vfio_iommu {
> struct iommu_domain *domain;
> struct mutex lock;
> @@ -123,205 +118,6 @@ static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
> rb_erase(&old->node, &iommu->dma_list);
> }
>
> -struct vwork {
> - struct mm_struct *mm;
> - long npage;
> - struct work_struct work;
> -};
> -
> -/* delayed decrement/increment for locked_vm */
> -static void vfio_lock_acct_bg(struct work_struct *work)
> -{
> - struct vwork *vwork = container_of(work, struct vwork, work);
> - struct mm_struct *mm;
> -
> - mm = vwork->mm;
> - down_write(&mm->mmap_sem);
> - mm->locked_vm += vwork->npage;
> - up_write(&mm->mmap_sem);
> - mmput(mm);
> - kfree(vwork);
> -}
> -
> -static void vfio_lock_acct(long npage)
> -{
> - struct vwork *vwork;
> - struct mm_struct *mm;
> -
> - if (!current->mm || !npage)
> - return; /* process exited or nothing to do */
> -
> - if (down_write_trylock(&current->mm->mmap_sem)) {
> - current->mm->locked_vm += npage;
> - up_write(&current->mm->mmap_sem);
> - return;
> - }
> -
> - /*
> - * Couldn't get mmap_sem lock, so must setup to update
> - * mm->locked_vm later. If locked_vm were atomic, we
> - * wouldn't need this silliness
> - */
> - vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
> - if (!vwork)
> - return;
> - mm = get_task_mm(current);
> - if (!mm) {
> - kfree(vwork);
> - return;
> - }
> - INIT_WORK(&vwork->work, vfio_lock_acct_bg);
> - vwork->mm = mm;
> - vwork->npage = npage;
> - schedule_work(&vwork->work);
> -}
> -
> -/*
> - * Some mappings aren't backed by a struct page, for example an mmap'd
> - * MMIO range for our own or another device. These use a different
> - * pfn conversion and shouldn't be tracked as locked pages.
> - */
> -static bool is_invalid_reserved_pfn(unsigned long pfn)
> -{
> - if (pfn_valid(pfn)) {
> - bool reserved;
> - struct page *tail = pfn_to_page(pfn);
> - struct page *head = compound_trans_head(tail);
> - reserved = !!(PageReserved(head));
> - if (head != tail) {
> - /*
> - * "head" is not a dangling pointer
> - * (compound_trans_head takes care of that)
> - * but the hugepage may have been split
> - * from under us (and we may not hold a
> - * reference count on the head page so it can
> - * be reused before we run PageReferenced), so
> - * we've to check PageTail before returning
> - * what we just read.
> - */
> - smp_rmb();
> - if (PageTail(tail))
> - return reserved;
> - }
> - return PageReserved(tail);
> - }
> -
> - return true;
> -}
> -
> -static int put_pfn(unsigned long pfn, int prot)
> -{
> - if (!is_invalid_reserved_pfn(pfn)) {
> - struct page *page = pfn_to_page(pfn);
> - if (prot & IOMMU_WRITE)
> - SetPageDirty(page);
> - put_page(page);
> - return 1;
> - }
> - return 0;
> -}
> -
> -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long *pfn)
> -{
> - struct page *page[1];
> - struct vm_area_struct *vma;
> - int ret = -EFAULT;
> -
> - if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
> - *pfn = page_to_pfn(page[0]);
> - return 0;
> - }
> -
> - down_read(&current->mm->mmap_sem);
> -
> - vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> -
> - if (vma && vma->vm_flags & VM_PFNMAP) {
> - *pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> - if (is_invalid_reserved_pfn(*pfn))
> - ret = 0;
> - }
> -
> - up_read(&current->mm->mmap_sem);
> -
> - return ret;
> -}
> -
> -/*
> - * Attempt to pin pages. We really don't want to track all the pfns and
> - * the iommu can only map chunks of consecutive pfns anyway, so get the
> - * first page and all consecutive pages with the same locking.
> - */
> -static long vfio_pin_pages(unsigned long vaddr, long npage,
> - int prot, unsigned long *pfn_base)
> -{
> - unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> - bool lock_cap = capable(CAP_IPC_LOCK);
> - long ret, i;
> -
> - if (!current->mm)
> - return -ENODEV;
> -
> - ret = vaddr_get_pfn(vaddr, prot, pfn_base);
> - if (ret)
> - return ret;
> -
> - if (is_invalid_reserved_pfn(*pfn_base))
> - return 1;
> -
> - if (!lock_cap && current->mm->locked_vm + 1 > limit) {
> - put_pfn(*pfn_base, prot);
> - pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
> - limit << PAGE_SHIFT);
> - return -ENOMEM;
> - }
> -
> - if (unlikely(disable_hugepages)) {
> - vfio_lock_acct(1);
> - return 1;
> - }
> -
> - /* Lock all the consecutive pages from pfn_base */
> - for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
> - unsigned long pfn = 0;
> -
> - ret = vaddr_get_pfn(vaddr, prot, &pfn);
> - if (ret)
> - break;
> -
> - if (pfn != *pfn_base + i || is_invalid_reserved_pfn(pfn)) {
> - put_pfn(pfn, prot);
> - break;
> - }
> -
> - if (!lock_cap && current->mm->locked_vm + i + 1 > limit) {
> - put_pfn(pfn, prot);
> - pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> - __func__, limit << PAGE_SHIFT);
> - break;
> - }
> - }
> -
> - vfio_lock_acct(i);
> -
> - return i;
> -}
> -
> -static long vfio_unpin_pages(unsigned long pfn, long npage,
> - int prot, bool do_accounting)
> -{
> - unsigned long unlocked = 0;
> - long i;
> -
> - for (i = 0; i < npage; i++)
> - unlocked += put_pfn(pfn++, prot);
> -
> - if (do_accounting)
> - vfio_lock_acct(-unlocked);
> -
> - return unlocked;
> -}
> -
> static int vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
> dma_addr_t iova, size_t *size)
> {


2013-09-25 19:07:07

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH 7/7] vfio pci: Add vfio iommu implementation for FSL_PAMU

On Thu, 2013-09-19 at 12:59 +0530, Bharat Bhushan wrote:
> This patch adds vfio iommu support for Freescale IOMMU
> (PAMU - Peripheral Access Management Unit).
>
> The Freescale PAMU is an aperture-based IOMMU with the following
> characteristics. Each device has an entry in a table in memory
> describing the iova->phys mapping. The mapping has:
> -an overall aperture that is power of 2 sized, and has a start iova that
> is naturally aligned
> -has 1 or more windows within the aperture
> -number of windows must be power of 2, max is 256
> -size of each window is determined by aperture size / # of windows
> -iova of each window is determined by aperture start iova / # of windows
> -the mapped region in each window can be different than
> the window size...mapping must power of 2
> -physical address of the mapping must be naturally aligned
> with the mapping size
>
> Some of the code is derived from TYPE1 iommu (driver/vfio/vfio_iommu_type1.c).
>
> Signed-off-by: Bharat Bhushan <[email protected]>
> ---
> drivers/vfio/Kconfig | 6 +
> drivers/vfio/Makefile | 1 +
> drivers/vfio/vfio_iommu_fsl_pamu.c | 952 ++++++++++++++++++++++++++++++++++++
> include/uapi/linux/vfio.h | 100 ++++
> 4 files changed, 1059 insertions(+), 0 deletions(-)
> create mode 100644 drivers/vfio/vfio_iommu_fsl_pamu.c
>
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index 26b3d9d..7d1da26 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -8,11 +8,17 @@ config VFIO_IOMMU_SPAPR_TCE
> depends on VFIO && SPAPR_TCE_IOMMU
> default n
>
> +config VFIO_IOMMU_FSL_PAMU
> + tristate
> + depends on VFIO
> + default n
> +
> menuconfig VFIO
> tristate "VFIO Non-Privileged userspace driver framework"
> depends on IOMMU_API
> select VFIO_IOMMU_TYPE1 if X86
> select VFIO_IOMMU_SPAPR_TCE if (PPC_POWERNV || PPC_PSERIES)
> + select VFIO_IOMMU_FSL_PAMU if FSL_PAMU
> help
> VFIO provides a framework for secure userspace device drivers.
> See Documentation/vfio.txt for more details.
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index c5792ec..7461350 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -1,4 +1,5 @@
> obj-$(CONFIG_VFIO) += vfio.o
> obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_common.o vfio_iommu_type1.o
> obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_common.o vfio_iommu_spapr_tce.o
> +obj-$(CONFIG_VFIO_IOMMU_FSL_PAMU) += vfio_iommu_common.o vfio_iommu_fsl_pamu.o
> obj-$(CONFIG_VFIO_PCI) += pci/
> diff --git a/drivers/vfio/vfio_iommu_fsl_pamu.c b/drivers/vfio/vfio_iommu_fsl_pamu.c
> new file mode 100644
> index 0000000..b29365f
> --- /dev/null
> +++ b/drivers/vfio/vfio_iommu_fsl_pamu.c
> @@ -0,0 +1,952 @@
> +/*
> + * VFIO: IOMMU DMA mapping support for FSL PAMU IOMMU
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License, version 2, as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
> + *
> + * Copyright (C) 2013 Freescale Semiconductor, Inc.
> + *
> + * Author: Bharat Bhushan <[email protected]>
> + *
> + * This file is derived from driver/vfio/vfio_iommu_type1.c
> + *
> + * The Freescale PAMU is an aperture-based IOMMU with the following
> + * characteristics. Each device has an entry in a table in memory
> + * describing the iova->phys mapping. The mapping has:
> + * -an overall aperture that is power of 2 sized, and has a start iova that
> + * is naturally aligned
> + * -has 1 or more windows within the aperture
> + * -number of windows must be power of 2, max is 256
> + * -size of each window is determined by aperture size / # of windows
> + * -iova of each window is determined by aperture start iova / # of windows
> + * -the mapped region in each window can be different than
> + * the window size...mapping must power of 2
> + * -physical address of the mapping must be naturally aligned
> + * with the mapping size
> + */
> +
> +#include <linux/compat.h>
> +#include <linux/device.h>
> +#include <linux/fs.h>
> +#include <linux/iommu.h>
> +#include <linux/module.h>
> +#include <linux/mm.h>
> +#include <linux/pci.h> /* pci_bus_type */
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/workqueue.h>
> +#include <linux/hugetlb.h>
> +#include <linux/msi.h>
> +#include <asm/fsl_pamu_stash.h>
> +
> +#include "vfio_iommu_common.h"
> +
> +#define DRIVER_VERSION "0.1"
> +#define DRIVER_AUTHOR "Bharat Bhushan <[email protected]>"
> +#define DRIVER_DESC "FSL PAMU IOMMU driver for VFIO"
> +
> +struct vfio_iommu {
> + struct iommu_domain *domain;
> + struct mutex lock;
> + dma_addr_t aperture_start;
> + dma_addr_t aperture_end;
> + dma_addr_t page_size; /* Maximum mapped Page size */
> + int nsubwindows; /* Number of subwindows */
> + struct rb_root dma_list;
> + struct list_head msi_dma_list;
> + struct list_head group_list;
> +};
> +
> +struct vfio_dma {
> + struct rb_node node;
> + dma_addr_t iova; /* Device address */
> + unsigned long vaddr; /* Process virtual addr */
> + size_t size; /* Number of pages */

Is this really pages?

> + int prot; /* IOMMU_READ/WRITE */
> +};
> +
> +struct vfio_msi_dma {
> + struct list_head next;
> + dma_addr_t iova; /* Device address */
> + int bank_id;
> + int prot; /* IOMMU_READ/WRITE */
> +};
> +
> +struct vfio_group {
> + struct iommu_group *iommu_group;
> + struct list_head next;
> +};
> +
> +static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
> + dma_addr_t start, size_t size)
> +{
> + struct rb_node *node = iommu->dma_list.rb_node;
> +
> + while (node) {
> + struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
> +
> + if (start + size <= dma->iova)
> + node = node->rb_left;
> + else if (start >= dma->iova + dma->size)

because this looks more like it's bytes...

> + node = node->rb_right;
> + else
> + return dma;
> + }
> +
> + return NULL;
> +}
> +
> +static void vfio_insert_dma(struct vfio_iommu *iommu, struct vfio_dma *new)
> +{
> + struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
> + struct vfio_dma *dma;
> +
> + while (*link) {
> + parent = *link;
> + dma = rb_entry(parent, struct vfio_dma, node);
> +
> + if (new->iova + new->size <= dma->iova)

so does this...

> + link = &(*link)->rb_left;
> + else
> + link = &(*link)->rb_right;
> + }
> +
> + rb_link_node(&new->node, parent, link);
> + rb_insert_color(&new->node, &iommu->dma_list);
> +}
> +
> +static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma *old)
> +{
> + rb_erase(&old->node, &iommu->dma_list);
> +}


So if your vfio_dma.size is actually in bytes, why isn't all this code
in common?

> +
> +static int iova_to_win(struct vfio_iommu *iommu, dma_addr_t iova)
> +{
> + u64 offset = iova - iommu->aperture_start;
> + do_div(offset, iommu->page_size);
> + return (int) offset;
> +}
> +
> +static int vfio_disable_iommu_domain(struct vfio_iommu *iommu)
> +{
> + int enable = 0;
> + return iommu_domain_set_attr(iommu->domain,
> + DOMAIN_ATTR_FSL_PAMU_ENABLE, &enable);
> +}

This is never called?!

> +
> +static int vfio_enable_iommu_domain(struct vfio_iommu *iommu)
> +{
> + int enable = 1;
> + return iommu_domain_set_attr(iommu->domain,
> + DOMAIN_ATTR_FSL_PAMU_ENABLE, &enable);
> +}
> +
> +/* Unmap DMA region */
> +static int vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
> + dma_addr_t iova, size_t *size)
> +{
> + dma_addr_t start = iova;
> + int win, win_start, win_end;
> + long unlocked = 0;
> + unsigned int nr_pages;
> +
> + nr_pages = iommu->page_size / PAGE_SIZE;
> + win_start = iova_to_win(iommu, iova);
> + win_end = iova_to_win(iommu, iova + *size - 1);
> +
> + /* Release the pinned pages */
> + for (win = win_start; win <= win_end; iova += iommu->page_size, win++) {
> + unsigned long pfn;
> +
> + pfn = iommu_iova_to_phys(iommu->domain, iova) >> PAGE_SHIFT;
> + if (!pfn)
> + continue;
> +
> + iommu_domain_window_disable(iommu->domain, win);
> +
> + unlocked += vfio_unpin_pages(pfn, nr_pages, dma->prot, 1);
> + }
> +
> + vfio_lock_acct(-unlocked);
> + *size = iova - start;
> + return 0;
> +}
> +
> +static int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
> + size_t *size, struct vfio_dma *dma)
> +{
> + size_t offset, overlap, tmp;
> + struct vfio_dma *split;
> + int ret;
> +
> + if (!*size)
> + return 0;
> +
> + /*
> + * Existing dma region is completely covered, unmap all. This is
> + * the likely case since userspace tends to map and unmap buffers
> + * in one shot rather than multiple mappings within a buffer.
> + */
> + if (likely(start <= dma->iova &&
> + start + *size >= dma->iova + dma->size)) {
> + *size = dma->size;
> + ret = vfio_unmap_unpin(iommu, dma, dma->iova, size);
> + if (ret)
> + return ret;
> +
> + /*
> + * Did we remove more than we have? Should never happen
> + * since a vfio_dma is contiguous in iova and vaddr.
> + */
> + WARN_ON(*size != dma->size);
> +
> + vfio_remove_dma(iommu, dma);
> + kfree(dma);
> + return 0;
> + }
> +
> + /* Overlap low address of existing range */
> + if (start <= dma->iova) {
> + overlap = start + *size - dma->iova;
> + ret = vfio_unmap_unpin(iommu, dma, dma->iova, &overlap);
> + if (ret)
> + return ret;
> +
> + vfio_remove_dma(iommu, dma);
> +
> + /*
> + * Check, we may have removed to whole vfio_dma. If not
> + * fixup and re-insert.
> + */
> + if (overlap < dma->size) {
> + dma->iova += overlap;
> + dma->vaddr += overlap;
> + dma->size -= overlap;
> + vfio_insert_dma(iommu, dma);
> + } else
> + kfree(dma);
> +
> + *size = overlap;
> + return 0;
> + }
> +
> + /* Overlap high address of existing range */
> + if (start + *size >= dma->iova + dma->size) {
> + offset = start - dma->iova;
> + overlap = dma->size - offset;
> +
> + ret = vfio_unmap_unpin(iommu, dma, start, &overlap);
> + if (ret)
> + return ret;
> +
> + dma->size -= overlap;
> + *size = overlap;
> + return 0;
> + }
> +
> + /* Split existing */
> +
> + /*
> + * Allocate our tracking structure early even though it may not
> + * be used. An Allocation failure later loses track of pages and
> + * is more difficult to unwind.
> + */
> + split = kzalloc(sizeof(*split), GFP_KERNEL);
> + if (!split)
> + return -ENOMEM;
> +
> + offset = start - dma->iova;
> +
> + ret = vfio_unmap_unpin(iommu, dma, start, size);
> + if (ret || !*size) {
> + kfree(split);
> + return ret;
> + }
> +
> + tmp = dma->size;
> +
> + /* Resize the lower vfio_dma in place, before the below insert */
> + dma->size = offset;
> +
> + /* Insert new for remainder, assuming it didn't all get unmapped */
> + if (likely(offset + *size < tmp)) {
> + split->size = tmp - offset - *size;
> + split->iova = dma->iova + offset + *size;
> + split->vaddr = dma->vaddr + offset + *size;
> + split->prot = dma->prot;
> + vfio_insert_dma(iommu, split);
> + } else
> + kfree(split);
> +
> + return 0;
> +}

Hmm, this looks identical to type1, can we share more?

> +
> +/* Map DMA region */
> +static int vfio_dma_map(struct vfio_iommu *iommu, dma_addr_t iova,
> + unsigned long vaddr, long npage, int prot)
> +{
> + int ret = 0, i;
> + size_t size;
> + unsigned int win, nr_subwindows;
> + dma_addr_t iovamap;
> +
> + /* total size to be mapped */
> + size = npage << PAGE_SHIFT;
> + do_div(size, iommu->page_size);
> + nr_subwindows = size;
> + size = npage << PAGE_SHIFT;

Is all this do_div() stuff necessary? If page_size is a power of two,
just shift it.

> + iovamap = iova;
> + for (i = 0; i < nr_subwindows; i++) {
> + unsigned long pfn;
> + unsigned long nr_pages;
> + dma_addr_t mapsize;
> + struct vfio_dma *dma = NULL;
> +
> + win = iova_to_win(iommu, iovamap);

Aren't these consecutive, why can't we just increment?

> + if (iovamap != iommu->aperture_start + iommu->page_size * win) {
> + pr_err("%s iova(%llx) unalligned to window size %llx\n",
> + __func__, iovamap, iommu->page_size);
> + ret = -EINVAL;
> + break;
> + }

Can't this only happen on the first one? Seems like it should be
outside of the loop. What about alignment with the end of the window,
do you care? Check spelling in your warning, but better yet, get rid of
it, this doesn't seem like something we need to error on.

> +
> + mapsize = min(iova + size - iovamap, iommu->page_size);
> + /*
> + * FIXME: Currently we only support mapping page-size
> + * of subwindow-size.
> + */
> + if (mapsize < iommu->page_size) {
> + pr_err("%s iova (%llx) not alligned to window size %llx\n",
> + __func__, iovamap, iommu->page_size);
> + ret = -EINVAL;
> + break;
> + }

So you do care about the end alignment, but why can't we error for both
of these in advance?

> +
> + nr_pages = mapsize >> PAGE_SHIFT;
> +
> + /* Pin a contiguous chunk of memory */
> + ret = vfio_pin_pages(vaddr, nr_pages, prot, &pfn);
> + if (ret != nr_pages) {
> + pr_err("%s unable to pin pages = %lx, pinned(%lx/%lx)\n",
> + __func__, vaddr, npage, nr_pages);
> + ret = -EINVAL;
> + break;
> + }

How likely is this to succeed? It seems like we're relying on userspace
to use hugepages to make this work.

> +
> + ret = iommu_domain_window_enable(iommu->domain, win,
> + (phys_addr_t)pfn << PAGE_SHIFT,
> + mapsize, prot);
> + if (ret) {
> + pr_err("%s unable to iommu_map()\n", __func__);
> + ret = -EINVAL;
> + break;
> + }

You might consider how many cases you're returning EINVAL and think
about how difficult this will be to debug. I don't think we can leave
all these pr_err()s since it gives userspace a trivial way to spam log
files.

> +
> + /*
> + * Check if we abut a region below - nothing below 0.
> + * This is the most likely case when mapping chunks of
> + * physically contiguous regions within a virtual address
> + * range. Update the abutting entry in place since iova
> + * doesn't change.
> + */
> + if (likely(iovamap)) {
> + struct vfio_dma *tmp;
> + tmp = vfio_find_dma(iommu, iovamap - 1, 1);
> + if (tmp && tmp->prot == prot &&
> + tmp->vaddr + tmp->size == vaddr) {
> + tmp->size += mapsize;
> + dma = tmp;
> + }
> + }
> +
> + /*
> + * Check if we abut a region above - nothing above ~0 + 1.
> + * If we abut above and below, remove and free. If only
> + * abut above, remove, modify, reinsert.
> + */
> + if (likely(iovamap + mapsize)) {
> + struct vfio_dma *tmp;
> + tmp = vfio_find_dma(iommu, iovamap + mapsize, 1);
> + if (tmp && tmp->prot == prot &&
> + tmp->vaddr == vaddr + mapsize) {
> + vfio_remove_dma(iommu, tmp);
> + if (dma) {
> + dma->size += tmp->size;
> + kfree(tmp);
> + } else {
> + tmp->size += mapsize;
> + tmp->iova = iovamap;
> + tmp->vaddr = vaddr;
> + vfio_insert_dma(iommu, tmp);
> + dma = tmp;
> + }
> + }
> + }
> +
> + if (!dma) {
> + dma = kzalloc(sizeof(*dma), GFP_KERNEL);
> + if (!dma) {
> + iommu_unmap(iommu->domain, iovamap, mapsize);
> + vfio_unpin_pages(pfn, npage, prot, true);
> + ret = -ENOMEM;
> + break;
> + }
> +
> + dma->size = mapsize;
> + dma->iova = iovamap;
> + dma->vaddr = vaddr;
> + dma->prot = prot;
> + vfio_insert_dma(iommu, dma);
> + }
> +
> + iovamap += mapsize;
> + vaddr += mapsize;

Another chunk that looks like it's probably identical to type1. Can we
rip this out to another function and add it to common?

> + }
> +
> + if (ret) {
> + struct vfio_dma *tmp;
> + while ((tmp = vfio_find_dma(iommu, iova, size))) {
> + int r = vfio_remove_dma_overlap(iommu, iova,
> + &size, tmp);
> + if (WARN_ON(r || !size))
> + break;
> + }
> + }


Broken whitespace, please run scripts/checkpatch.pl before posting.

> +
> + vfio_enable_iommu_domain(iommu);

I don't quite understand your semantics here since you never use the
disable version, is this just redundant after the first mapping? When
dma_list is empty should it be disabled? Is there a bug here that an
error will enable the iommu domain even if there are no entries?

> + return 0;
> +}
> +
> +static int vfio_dma_do_map(struct vfio_iommu *iommu,
> + struct vfio_iommu_type1_dma_map *map)
> +{
> + dma_addr_t iova = map->iova;
> + size_t size = map->size;
> + unsigned long vaddr = map->vaddr;
> + int ret = 0, prot = 0;
> + long npage;
> +
> + /* READ/WRITE from device perspective */
> + if (map->flags & VFIO_DMA_MAP_FLAG_WRITE)
> + prot |= IOMMU_WRITE;
> + if (map->flags & VFIO_DMA_MAP_FLAG_READ)
> + prot |= IOMMU_READ;
> +
> + if (!prot)
> + return -EINVAL; /* No READ/WRITE? */
> +
> + /* Don't allow IOVA wrap */
> + if (iova + size && iova + size < iova)
> + return -EINVAL;
> +
> + /* Don't allow virtual address wrap */
> + if (vaddr + size && vaddr + size < vaddr)
> + return -EINVAL;
> +
> + /*
> + * FIXME: Currently we only support mapping page-size
> + * of subwindow-size.
> + */
> + if (size < iommu->page_size)
> + return -EINVAL;
> +

I'd think the start and end alignment could be tested here.

> + npage = size >> PAGE_SHIFT;
> + if (!npage)
> + return -EINVAL;
> +
> + mutex_lock(&iommu->lock);
> +
> + if (vfio_find_dma(iommu, iova, size)) {
> + ret = -EEXIST;
> + goto out_lock;
> + }
> +
> + vfio_dma_map(iommu, iova, vaddr, npage, prot);
> +
> +out_lock:
> + mutex_unlock(&iommu->lock);
> + return ret;
> +}
> +
> +static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> + struct vfio_iommu_type1_dma_unmap *unmap)
> +{
> + struct vfio_dma *dma;
> + size_t unmapped = 0, size;
> + int ret = 0;
> +
> + mutex_lock(&iommu->lock);
> +
> + while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
> + size = unmap->size;
> + ret = vfio_remove_dma_overlap(iommu, unmap->iova, &size, dma);
> + if (ret || !size)
> + break;
> + unmapped += size;
> + }
> +
> + mutex_unlock(&iommu->lock);
> +
> + /*
> + * We may unmap more than requested, update the unmap struct so
> + * userspace can know.
> + */
> + unmap->size = unmapped;
> +
> + return ret;
> +}
> +
> +static int vfio_handle_get_attr(struct vfio_iommu *iommu,
> + struct vfio_pamu_attr *pamu_attr)
> +{
> + switch (pamu_attr->attribute) {
> + case VFIO_ATTR_GEOMETRY: {
> + struct iommu_domain_geometry geom;
> + if (iommu_domain_get_attr(iommu->domain,
> + DOMAIN_ATTR_GEOMETRY, &geom)) {
> + pr_err("%s Error getting domain geometry\n",
> + __func__);
> + return -EFAULT;
> + }
> +
> + pamu_attr->attr_info.attr.aperture_start = geom.aperture_start;
> + pamu_attr->attr_info.attr.aperture_end = geom.aperture_end;
> + break;
> + }
> + case VFIO_ATTR_WINDOWS: {
> + u32 count;
> + if (iommu_domain_get_attr(iommu->domain,
> + DOMAIN_ATTR_WINDOWS, &count)) {
> + pr_err("%s Error getting domain windows\n",
> + __func__);
> + return -EFAULT;
> + }
> +
> + pamu_attr->attr_info.windows = count;
> + break;
> + }
> + case VFIO_ATTR_PAMU_STASH: {
> + struct pamu_stash_attribute stash;
> + if (iommu_domain_get_attr(iommu->domain,
> + DOMAIN_ATTR_FSL_PAMU_STASH, &stash)) {
> + pr_err("%s Error getting domain windows\n",
> + __func__);
> + return -EFAULT;
> + }
> +
> + pamu_attr->attr_info.stash.cpu = stash.cpu;
> + pamu_attr->attr_info.stash.cache = stash.cache;
> + break;
> + }
> +
> + default:
> + pr_err("%s Error: Invalid attribute (%d)\n",
> + __func__, pamu_attr->attribute);
> + return -EINVAL;
> + }
> +
> + return 0;
> +}
> +
> +static int vfio_handle_set_attr(struct vfio_iommu *iommu,
> + struct vfio_pamu_attr *pamu_attr)
> +{
> + switch (pamu_attr->attribute) {
> + case VFIO_ATTR_GEOMETRY: {
> + struct iommu_domain_geometry geom;
> +
> + geom.aperture_start = pamu_attr->attr_info.attr.aperture_start;
> + geom.aperture_end = pamu_attr->attr_info.attr.aperture_end;
> + iommu->aperture_start = geom.aperture_start;
> + iommu->aperture_end = geom.aperture_end;
> + geom.force_aperture = 1;
> + if (iommu_domain_set_attr(iommu->domain,
> + DOMAIN_ATTR_GEOMETRY, &geom)) {
> + pr_err("%s Error setting domain geometry\n", __func__);
> + return -EFAULT;
> + }
> +
> + break;
> + }
> + case VFIO_ATTR_WINDOWS: {
> + u32 count = pamu_attr->attr_info.windows;
> + u64 size;
> + if (count > 256) {
> + pr_err("Number of subwindows requested (%d) is 256\n",
> + count);
> + return -EINVAL;
> + }
> + iommu->nsubwindows = pamu_attr->attr_info.windows;
> + size = iommu->aperture_end - iommu->aperture_start + 1;
> + do_div(size, count);
> + iommu->page_size = size;
> + if (iommu_domain_set_attr(iommu->domain,
> + DOMAIN_ATTR_WINDOWS, &count)) {
> + pr_err("%s Error getting domain windows\n",
> + __func__);
> + return -EFAULT;
> + }
> +
> + break;
> + }
> + case VFIO_ATTR_PAMU_STASH: {
> + struct pamu_stash_attribute stash;
> +
> + stash.cpu = pamu_attr->attr_info.stash.cpu;
> + stash.cache = pamu_attr->attr_info.stash.cache;
> + if (iommu_domain_set_attr(iommu->domain,
> + DOMAIN_ATTR_FSL_PAMU_STASH, &stash)) {
> + pr_err("%s Error getting domain windows\n",
> + __func__);
> + return -EFAULT;
> + }
> + break;

Why do we throw away the return value of iommu_domain_set_attr and
replace it with EFAULT in all these cases? I assume all these pr_err()s
are leftover debug. Can the user do anything they shouldn't through
these? How do we guarantee that?

> + }
> +
> + default:
> + pr_err("%s Error: Invalid attribute (%d)\n",
> + __func__, pamu_attr->attribute);
> + return -EINVAL;
> + }
> +
> + return 0;
> +}
> +
> +static int vfio_msi_map(struct vfio_iommu *iommu,
> + struct vfio_pamu_msi_bank_map *msi_map, int prot)
> +{
> + struct msi_region region;
> + int window;
> + int ret;
> +
> + ret = msi_get_region(msi_map->msi_bank_index, &region);
> + if (ret) {
> + pr_err("%s MSI region (%d) not found\n", __func__,
> + msi_map->msi_bank_index);
> + return ret;
> + }
> +
> + window = iova_to_win(iommu, msi_map->iova);
> + ret = iommu_domain_window_enable(iommu->domain, window, region.addr,
> + region.size, prot);
> + if (ret) {
> + pr_err("%s Error: unable to map msi region\n", __func__);
> + return ret;
> + }
> +
> + return 0;
> +}
> +
> +static int vfio_do_msi_map(struct vfio_iommu *iommu,
> + struct vfio_pamu_msi_bank_map *msi_map)
> +{
> + struct vfio_msi_dma *msi_dma;
> + int ret, prot = 0;
> +
> + /* READ/WRITE from device perspective */
> + if (msi_map->flags & VFIO_DMA_MAP_FLAG_WRITE)
> + prot |= IOMMU_WRITE;
> + if (msi_map->flags & VFIO_DMA_MAP_FLAG_READ)
> + prot |= IOMMU_READ;
> +
> + if (!prot)
> + return -EINVAL; /* No READ/WRITE? */
> +
> + ret = vfio_msi_map(iommu, msi_map, prot);
> + if (ret)
> + return ret;
> +
> + msi_dma = kzalloc(sizeof(*msi_dma), GFP_KERNEL);
> + if (!msi_dma)
> + return -ENOMEM;
> +
> + msi_dma->iova = msi_map->iova;
> + msi_dma->bank_id = msi_map->msi_bank_index;
> + list_add(&msi_dma->next, &iommu->msi_dma_list);
> + return 0;

What happens when the user creates multiple MSI mappings at the same
iova? What happens when DMA mappings overlap MSI mappings? Shouldn't
there be some locking around list manipulation?

> +}
> +
> +static void vfio_msi_unmap(struct vfio_iommu *iommu, dma_addr_t iova)
> +{
> + int window;
> + window = iova_to_win(iommu, iova);
> + iommu_domain_window_disable(iommu->domain, window);
> +}
> +
> +static int vfio_do_msi_unmap(struct vfio_iommu *iommu,
> + struct vfio_pamu_msi_bank_unmap *msi_unmap)
> +{
> + struct vfio_msi_dma *mdma, *mdma_tmp;
> +
> + list_for_each_entry_safe(mdma, mdma_tmp, &iommu->msi_dma_list, next) {
> + if (mdma->iova == msi_unmap->iova) {
> + vfio_msi_unmap(iommu, mdma->iova);
> + list_del(&mdma->next);
> + kfree(mdma);
> + return 0;
> + }
> + }
> +
> + return -EINVAL;
> +}
> +static void *vfio_iommu_fsl_pamu_open(unsigned long arg)
> +{
> + struct vfio_iommu *iommu;
> +
> + if (arg != VFIO_FSL_PAMU_IOMMU)
> + return ERR_PTR(-EINVAL);
> +
> + iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> + if (!iommu)
> + return ERR_PTR(-ENOMEM);
> +
> + INIT_LIST_HEAD(&iommu->group_list);
> + iommu->dma_list = RB_ROOT;
> + INIT_LIST_HEAD(&iommu->msi_dma_list);
> + mutex_init(&iommu->lock);
> +
> + /*
> + * Wish we didn't have to know about bus_type here.
> + */
> + iommu->domain = iommu_domain_alloc(&pci_bus_type);
> + if (!iommu->domain) {
> + kfree(iommu);
> + return ERR_PTR(-EIO);
> + }
> +
> + return iommu;
> +}
> +
> +static void vfio_iommu_fsl_pamu_release(void *iommu_data)
> +{
> + struct vfio_iommu *iommu = iommu_data;
> + struct vfio_group *group, *group_tmp;
> + struct vfio_msi_dma *mdma, *mdma_tmp;
> + struct rb_node *node;
> +
> + list_for_each_entry_safe(group, group_tmp, &iommu->group_list, next) {
> + iommu_detach_group(iommu->domain, group->iommu_group);
> + list_del(&group->next);
> + kfree(group);
> + }
> +
> + while ((node = rb_first(&iommu->dma_list))) {
> + struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
> + size_t size = dma->size;
> + vfio_remove_dma_overlap(iommu, dma->iova, &size, dma);
> + if (WARN_ON(!size))
> + break;
> + }
> +
> + list_for_each_entry_safe(mdma, mdma_tmp, &iommu->msi_dma_list, next) {
> + vfio_msi_unmap(iommu, mdma->iova);
> + list_del(&mdma->next);
> + kfree(mdma);
> + }
> +
> + iommu_domain_free(iommu->domain);
> + iommu->domain = NULL;
> + kfree(iommu);
> +}
> +
> +static long vfio_iommu_fsl_pamu_ioctl(void *iommu_data,
> + unsigned int cmd, unsigned long arg)
> +{
> + struct vfio_iommu *iommu = iommu_data;
> + unsigned long minsz;
> +
> + if (cmd == VFIO_CHECK_EXTENSION) {
> + switch (arg) {
> + case VFIO_FSL_PAMU_IOMMU:
> + return 1;
> + default:
> + return 0;
> + }
> + } else if (cmd == VFIO_IOMMU_MAP_DMA) {
> + struct vfio_iommu_type1_dma_map map;
> + uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
> + VFIO_DMA_MAP_FLAG_WRITE;
> +
> + minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
> +
> + if (copy_from_user(&map, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (map.argsz < minsz || map.flags & ~mask)
> + return -EINVAL;
> +
> + return vfio_dma_do_map(iommu, &map);
> +
> + } else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
> + struct vfio_iommu_type1_dma_unmap unmap;
> + long ret;
> +
> + minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
> +
> + if (copy_from_user(&unmap, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (unmap.argsz < minsz || unmap.flags)
> + return -EINVAL;
> +
> + ret = vfio_dma_do_unmap(iommu, &unmap);
> + if (ret)
> + return ret;
> +
> + return copy_to_user((void __user *)arg, &unmap, minsz);
> + } else if (cmd == VFIO_IOMMU_PAMU_GET_ATTR) {
> + struct vfio_pamu_attr pamu_attr;
> +
> + minsz = offsetofend(struct vfio_pamu_attr, attr_info);
> + if (copy_from_user(&pamu_attr, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (pamu_attr.argsz < minsz)
> + return -EINVAL;
> +
> + vfio_handle_get_attr(iommu, &pamu_attr);
> +
> + copy_to_user((void __user *)arg, &pamu_attr, minsz);
> + return 0;
> + } else if (cmd == VFIO_IOMMU_PAMU_SET_ATTR) {
> + struct vfio_pamu_attr pamu_attr;
> +
> + minsz = offsetofend(struct vfio_pamu_attr, attr_info);
> + if (copy_from_user(&pamu_attr, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (pamu_attr.argsz < minsz)
> + return -EINVAL;
> +
> + vfio_handle_set_attr(iommu, &pamu_attr);
> + return 0;
> + } else if (cmd == VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT) {
> + return msi_get_region_count();
> + } else if (cmd == VFIO_IOMMU_PAMU_MAP_MSI_BANK) {
> + struct vfio_pamu_msi_bank_map msi_map;
> +
> + minsz = offsetofend(struct vfio_pamu_msi_bank_map, iova);
> + if (copy_from_user(&msi_map, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (msi_map.argsz < minsz)
> + return -EINVAL;
> +
> + vfio_do_msi_map(iommu, &msi_map);
> + return 0;
> + } else if (cmd == VFIO_IOMMU_PAMU_UNMAP_MSI_BANK) {
> + struct vfio_pamu_msi_bank_unmap msi_unmap;
> +
> + minsz = offsetofend(struct vfio_pamu_msi_bank_unmap, iova);
> + if (copy_from_user(&msi_unmap, (void __user *)arg, minsz))
> + return -EFAULT;
> +
> + if (msi_unmap.argsz < minsz)
> + return -EINVAL;
> +
> + vfio_do_msi_unmap(iommu, &msi_unmap);
> + return 0;
> +
> + }
> +
> + return -ENOTTY;
> +}
> +
> +static int vfio_iommu_fsl_pamu_attach_group(void *iommu_data,
> + struct iommu_group *iommu_group)
> +{
> + struct vfio_iommu *iommu = iommu_data;
> + struct vfio_group *group, *tmp;
> + int ret;
> +
> + group = kzalloc(sizeof(*group), GFP_KERNEL);
> + if (!group)
> + return -ENOMEM;
> +
> + mutex_lock(&iommu->lock);
> +
> + list_for_each_entry(tmp, &iommu->group_list, next) {
> + if (tmp->iommu_group == iommu_group) {
> + mutex_unlock(&iommu->lock);
> + kfree(group);
> + return -EINVAL;
> + }
> + }
> +
> + ret = iommu_attach_group(iommu->domain, iommu_group);
> + if (ret) {
> + mutex_unlock(&iommu->lock);
> + kfree(group);
> + return ret;
> + }
> +
> + group->iommu_group = iommu_group;
> + list_add(&group->next, &iommu->group_list);
> +
> + mutex_unlock(&iommu->lock);
> +
> + return 0;
> +}
> +
> +static void vfio_iommu_fsl_pamu_detach_group(void *iommu_data,
> + struct iommu_group *iommu_group)
> +{
> + struct vfio_iommu *iommu = iommu_data;
> + struct vfio_group *group;
> +
> + mutex_lock(&iommu->lock);
> +
> + list_for_each_entry(group, &iommu->group_list, next) {
> + if (group->iommu_group == iommu_group) {
> + iommu_detach_group(iommu->domain, iommu_group);
> + list_del(&group->next);
> + kfree(group);
> + break;
> + }
> + }
> +
> + mutex_unlock(&iommu->lock);
> +}
> +
> +static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_fsl_pamu = {
> + .name = "vfio-iommu-fsl_pamu",
> + .owner = THIS_MODULE,
> + .open = vfio_iommu_fsl_pamu_open,
> + .release = vfio_iommu_fsl_pamu_release,
> + .ioctl = vfio_iommu_fsl_pamu_ioctl,
> + .attach_group = vfio_iommu_fsl_pamu_attach_group,
> + .detach_group = vfio_iommu_fsl_pamu_detach_group,
> +};
> +
> +static int __init vfio_iommu_fsl_pamu_init(void)
> +{
> + if (!iommu_present(&pci_bus_type))
> + return -ENODEV;
> +
> + return vfio_register_iommu_driver(&vfio_iommu_driver_ops_fsl_pamu);
> +}
> +
> +static void __exit vfio_iommu_fsl_pamu_cleanup(void)
> +{
> + vfio_unregister_iommu_driver(&vfio_iommu_driver_ops_fsl_pamu);
> +}
> +
> +module_init(vfio_iommu_fsl_pamu_init);
> +module_exit(vfio_iommu_fsl_pamu_cleanup);
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 0fd47f5..d359055 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -23,6 +23,7 @@
>
> #define VFIO_TYPE1_IOMMU 1
> #define VFIO_SPAPR_TCE_IOMMU 2
> +#define VFIO_FSL_PAMU_IOMMU 3
>
> /*
> * The IOCTL interface is designed for extensibility by embedding the
> @@ -451,4 +452,103 @@ struct vfio_iommu_spapr_tce_info {
>
> /* ***************************************************************** */
>
> +/*********** APIs for VFIO_PAMU type only ****************/
> +/*
> + * VFIO_IOMMU_PAMU_GET_ATTR - _IO(VFIO_TYPE, VFIO_BASE + 17,
> + * struct vfio_pamu_attr)
> + *
> + * Gets the iommu attributes for the current vfio container.
> + * Caller sets argsz and attribute. The ioctl fills in
> + * the provided struct vfio_pamu_attr based on the attribute
> + * value that was set.
> + * Return: 0 on success, -errno on failure
> + */
> +struct vfio_pamu_attr {
> + __u32 argsz;
> + __u32 flags; /* no flags currently */
> +#define VFIO_ATTR_GEOMETRY 0
> +#define VFIO_ATTR_WINDOWS 1
> +#define VFIO_ATTR_PAMU_STASH 2
> + __u32 attribute;
> +
> + union {
> + /* VFIO_ATTR_GEOMETRY */
> + struct {
> + /* first addr that can be mapped */
> + __u64 aperture_start;
> + /* last addr that can be mapped */
> + __u64 aperture_end;
> + } attr;
> +
> + /* VFIO_ATTR_WINDOWS */
> + __u32 windows; /* number of windows in the aperture
> + * initially this will be the max number
> + * of windows that can be set
> + */
> + /* VFIO_ATTR_PAMU_STASH */
> + struct {
> + __u32 cpu; /* CPU number for stashing */
> + __u32 cache; /* cache ID for stashing */
> + } stash;
> + } attr_info;
> +};
> +#define VFIO_IOMMU_PAMU_GET_ATTR _IO(VFIO_TYPE, VFIO_BASE + 17)
> +
> +/*
> + * VFIO_IOMMU_PAMU_SET_ATTR - _IO(VFIO_TYPE, VFIO_BASE + 18,
> + * struct vfio_pamu_attr)
> + *
> + * Sets the iommu attributes for the current vfio container.
> + * Caller sets struct vfio_pamu attr, including argsz and attribute and
> + * setting any fields that are valid for the attribute.
> + * Return: 0 on success, -errno on failure
> + */
> +#define VFIO_IOMMU_PAMU_SET_ATTR _IO(VFIO_TYPE, VFIO_BASE + 18)
> +
> +/*
> + * VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT - _IO(VFIO_TYPE, VFIO_BASE + 19, __u32)
> + *
> + * Returns the number of MSI banks for this platform. This tells user space
> + * how many aperture windows should be reserved for MSI banks when setting
> + * the PAMU geometry and window count.
> + * Return: __u32 bank count on success, -errno on failure
> + */
> +#define VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT _IO(VFIO_TYPE, VFIO_BASE + 19)
> +
> +/*
> + * VFIO_IOMMU_PAMU_MAP_MSI_BANK - _IO(VFIO_TYPE, VFIO_BASE + 20,
> + * struct vfio_pamu_msi_bank_map)
> + *
> + * Maps the MSI bank at the specified index and iova. User space must
> + * call this ioctl once for each MSI bank (count of banks is returned by
> + * VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT).
> + * Caller provides struct vfio_pamu_msi_bank_map with all fields set.
> + * Return: 0 on success, -errno on failure
> + */
> +
> +struct vfio_pamu_msi_bank_map {
> + __u32 argsz;
> + __u32 flags; /* no flags currently */
> + __u32 msi_bank_index; /* the index of the MSI bank */
> + __u64 iova; /* the iova the bank is to be mapped to */
> +};
> +#define VFIO_IOMMU_PAMU_MAP_MSI_BANK _IO(VFIO_TYPE, VFIO_BASE + 20)
> +
> +/*
> + * VFIO_IOMMU_PAMU_UNMAP_MSI_BANK - _IO(VFIO_TYPE, VFIO_BASE + 21,
> + * struct vfio_pamu_msi_bank_unmap)
> + *
> + * Unmaps the MSI bank at the specified iova.
> + * Caller provides struct vfio_pamu_msi_bank_unmap with all fields set.
> + * Operates on VFIO file descriptor (/dev/vfio/vfio).
> + * Return: 0 on success, -errno on failure
> + */
> +
> +struct vfio_pamu_msi_bank_unmap {
> + __u32 argsz;
> + __u32 flags; /* no flags currently */
> + __u64 iova; /* the iova to be unmapped to */
> +};
> +#define VFIO_IOMMU_PAMU_UNMAP_MSI_BANK _IO(VFIO_TYPE, VFIO_BASE + 21)
> +
> #endif /* _UAPIVFIO_H */


2013-09-26 03:53:25

by Bharat Bhushan

[permalink] [raw]
Subject: RE: [PATCH 5/7] iommu: supress loff_t compilation error on powerpc



> -----Original Message-----
> From: Alex Williamson [mailto:[email protected]]
> Sent: Wednesday, September 25, 2013 10:10 PM
> To: Bhushan Bharat-R65777
> Cc: [email protected]; [email protected]; [email protected]; linux-
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; Wood Scott-B07421; [email protected]
> foundation.org; Bhushan Bharat-R65777
> Subject: Re: [PATCH 5/7] iommu: supress loff_t compilation error on powerpc
>
> On Thu, 2013-09-19 at 12:59 +0530, Bharat Bhushan wrote:
> > Signed-off-by: Bharat Bhushan <[email protected]>
> > ---
> > drivers/vfio/pci/vfio_pci_rdwr.c | 3 ++-
> > 1 files changed, 2 insertions(+), 1 deletions(-)
> >
> > diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c
> > b/drivers/vfio/pci/vfio_pci_rdwr.c
> > index 210db24..8a8156a 100644
> > --- a/drivers/vfio/pci/vfio_pci_rdwr.c
> > +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
> > @@ -181,7 +181,8 @@ ssize_t vfio_pci_vga_rw(struct vfio_pci_device *vdev, char
> __user *buf,
> > size_t count, loff_t *ppos, bool iswrite) {
> > int ret;
> > - loff_t off, pos = *ppos & VFIO_PCI_OFFSET_MASK;
> > + loff_t off;
> > + u64 pos = (u64 )(*ppos & VFIO_PCI_OFFSET_MASK);
> > void __iomem *iomem = NULL;
> > unsigned int rsrc;
> > bool is_ioport;
>
> What's the compile error that this fixes?

I was getting below error; and after some googling I came to know that this is how it is fixed by other guys.

/home/r65777/linux-vfio/drivers/vfio/pci/vfio_pci_rdwr.c:193: undefined reference to `__cmpdi2'
/home/r65777/linux-vfio/drivers/vfio/pci/vfio_pci_rdwr.c:193: undefined reference to `__cmpdi2'

Thanks
-Bharat
>

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2013-09-26 03:57:45

by Bharat Bhushan

[permalink] [raw]
Subject: RE: [PATCH 6/7] vfio: moving some functions in common file



> -----Original Message-----
> From: [email protected] [mailto:[email protected]]
> On Behalf Of Alex Williamson
> Sent: Wednesday, September 25, 2013 10:33 PM
> To: Bhushan Bharat-R65777
> Cc: [email protected]; [email protected]; [email protected]; linux-
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; Wood Scott-B07421; [email protected]
> foundation.org; Bhushan Bharat-R65777
> Subject: Re: [PATCH 6/7] vfio: moving some functions in common file
>
> On Thu, 2013-09-19 at 12:59 +0530, Bharat Bhushan wrote:
> > Some function defined in vfio_iommu_type1.c were common and we want to
> > use these for FSL IOMMU (PAMU) and iommu-none driver.
> > So some of them are moved to vfio_iommu_common.c
> >
> > I think we can do more of that but we will take this step by step.
> >
> > Signed-off-by: Bharat Bhushan <[email protected]>
> > ---
> > drivers/vfio/Makefile | 4 +-
> > drivers/vfio/vfio_iommu_common.c | 235
> ++++++++++++++++++++++++++++++++++++++
> > drivers/vfio/vfio_iommu_common.h | 30 +++++
> > drivers/vfio/vfio_iommu_type1.c | 206
> > +---------------------------------
> > 4 files changed, 268 insertions(+), 207 deletions(-) create mode
> > 100644 drivers/vfio/vfio_iommu_common.c create mode 100644
> > drivers/vfio/vfio_iommu_common.h
> >
> > diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile index
> > 72bfabc..c5792ec 100644
> > --- a/drivers/vfio/Makefile
> > +++ b/drivers/vfio/Makefile
> > @@ -1,4 +1,4 @@
> > obj-$(CONFIG_VFIO) += vfio.o
> > -obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
> > -obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
> > +obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_common.o
> > +vfio_iommu_type1.o
> > +obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_common.o
> > +vfio_iommu_spapr_tce.o
> > obj-$(CONFIG_VFIO_PCI) += pci/
> > diff --git a/drivers/vfio/vfio_iommu_common.c
> > b/drivers/vfio/vfio_iommu_common.c
> > new file mode 100644
> > index 0000000..8bdc0ea
> > --- /dev/null
> > +++ b/drivers/vfio/vfio_iommu_common.c
> > @@ -0,0 +1,235 @@
> > +/*
> > + * VFIO: Common code for vfio IOMMU support
> > + *
> > + * Copyright (C) 2012 Red Hat, Inc. All rights reserved.
> > + * Author: Alex Williamson <[email protected]>
> > + * Author: Bharat Bhushan <[email protected]>
> > + *
> > + * This program is free software; you can redistribute it and/or
> > +modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + *
> > + * Derived from original vfio:
> > + * Copyright 2010 Cisco Systems, Inc. All rights reserved.
> > + * Author: Tom Lyon, [email protected]
> > + */
> > +
> > +#include <linux/compat.h>
> > +#include <linux/device.h>
> > +#include <linux/fs.h>
> > +#include <linux/iommu.h>
> > +#include <linux/module.h>
> > +#include <linux/mm.h>
> > +#include <linux/pci.h> /* pci_bus_type */
> > +#include <linux/rbtree.h>
> > +#include <linux/sched.h>
> > +#include <linux/slab.h>
> > +#include <linux/uaccess.h>
> > +#include <linux/vfio.h>
> > +#include <linux/workqueue.h>
>
> Please cleanup includes on both the source and target files. You obviously
> don't need linux/pci.h here for one.

Will do.

>
> > +
> > +static bool disable_hugepages;
> > +module_param_named(disable_hugepages,
> > + disable_hugepages, bool, S_IRUGO | S_IWUSR);
> > +MODULE_PARM_DESC(disable_hugepages,
> > + "Disable VFIO IOMMU support for IOMMU hugepages.");
> > +
> > +struct vwork {
> > + struct mm_struct *mm;
> > + long npage;
> > + struct work_struct work;
> > +};
> > +
> > +/* delayed decrement/increment for locked_vm */ void
> > +vfio_lock_acct_bg(struct work_struct *work) {
> > + struct vwork *vwork = container_of(work, struct vwork, work);
> > + struct mm_struct *mm;
> > +
> > + mm = vwork->mm;
> > + down_write(&mm->mmap_sem);
> > + mm->locked_vm += vwork->npage;
> > + up_write(&mm->mmap_sem);
> > + mmput(mm);
> > + kfree(vwork);
> > +}
> > +
> > +void vfio_lock_acct(long npage)
> > +{
> > + struct vwork *vwork;
> > + struct mm_struct *mm;
> > +
> > + if (!current->mm || !npage)
> > + return; /* process exited or nothing to do */
> > +
> > + if (down_write_trylock(&current->mm->mmap_sem)) {
> > + current->mm->locked_vm += npage;
> > + up_write(&current->mm->mmap_sem);
> > + return;
> > + }
> > +
> > + /*
> > + * Couldn't get mmap_sem lock, so must setup to update
> > + * mm->locked_vm later. If locked_vm were atomic, we
> > + * wouldn't need this silliness
> > + */
> > + vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
> > + if (!vwork)
> > + return;
> > + mm = get_task_mm(current);
> > + if (!mm) {
> > + kfree(vwork);
> > + return;
> > + }
> > + INIT_WORK(&vwork->work, vfio_lock_acct_bg);
> > + vwork->mm = mm;
> > + vwork->npage = npage;
> > + schedule_work(&vwork->work);
> > +}
> > +
> > +/*
> > + * Some mappings aren't backed by a struct page, for example an
> > +mmap'd
> > + * MMIO range for our own or another device. These use a different
> > + * pfn conversion and shouldn't be tracked as locked pages.
> > + */
> > +bool is_invalid_reserved_pfn(unsigned long pfn) {
> > + if (pfn_valid(pfn)) {
> > + bool reserved;
> > + struct page *tail = pfn_to_page(pfn);
> > + struct page *head = compound_trans_head(tail);
> > + reserved = !!(PageReserved(head));
> > + if (head != tail) {
> > + /*
> > + * "head" is not a dangling pointer
> > + * (compound_trans_head takes care of that)
> > + * but the hugepage may have been split
> > + * from under us (and we may not hold a
> > + * reference count on the head page so it can
> > + * be reused before we run PageReferenced), so
> > + * we've to check PageTail before returning
> > + * what we just read.
> > + */
> > + smp_rmb();
> > + if (PageTail(tail))
> > + return reserved;
> > + }
> > + return PageReserved(tail);
> > + }
> > +
> > + return true;
> > +}
> > +
> > +int put_pfn(unsigned long pfn, int prot) {
> > + if (!is_invalid_reserved_pfn(pfn)) {
> > + struct page *page = pfn_to_page(pfn);
> > + if (prot & IOMMU_WRITE)
> > + SetPageDirty(page);
> > + put_page(page);
> > + return 1;
> > + }
> > + return 0;
> > +}
> > +
> > +static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long
> > +*pfn) {
> > + struct page *page[1];
> > + struct vm_area_struct *vma;
> > + int ret = -EFAULT;
> > +
> > + if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
> > + *pfn = page_to_pfn(page[0]);
> > + return 0;
> > + }
> > +
> > + printk("via vma\n");
> > + down_read(&current->mm->mmap_sem);
> > +
> > + vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> > +
> > + if (vma && vma->vm_flags & VM_PFNMAP) {
> > + *pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> > + if (is_invalid_reserved_pfn(*pfn))
> > + ret = 0;
> > + }
> > +
> > + up_read(&current->mm->mmap_sem);
> > +
> > + return ret;
> > +}
> > +
> > +/*
> > + * Attempt to pin pages. We really don't want to track all the pfns
> > +and
> > + * the iommu can only map chunks of consecutive pfns anyway, so get
> > +the
> > + * first page and all consecutive pages with the same locking.
> > + */
> > +long vfio_pin_pages(unsigned long vaddr, long npage,
> > + int prot, unsigned long *pfn_base) {
> > + unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > + bool lock_cap = capable(CAP_IPC_LOCK);
> > + long ret, i;
> > +
> > + if (!current->mm)
> > + return -ENODEV;
> > +
> > + ret = vaddr_get_pfn(vaddr, prot, pfn_base);
> > + if (ret)
> > + return ret;
> > +
> > + if (is_invalid_reserved_pfn(*pfn_base))
> > + return 1;
> > +
> > + if (!lock_cap && current->mm->locked_vm + 1 > limit) {
> > + put_pfn(*pfn_base, prot);
> > + pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
> > + limit << PAGE_SHIFT);
> > + return -ENOMEM;
> > + }
> > +
> > + if (unlikely(disable_hugepages)) {
> > + vfio_lock_acct(1);
> > + return 1;
> > + }
> > +
> > + /* Lock all the consecutive pages from pfn_base */
> > + for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
> > + unsigned long pfn = 0;
> > +
> > + ret = vaddr_get_pfn(vaddr, prot, &pfn);
> > + if (ret)
> > + break;
> > +
> > + if (pfn != *pfn_base + i || is_invalid_reserved_pfn(pfn)) {
> > + put_pfn(pfn, prot);
> > + break;
> > + }
> > +
> > + if (!lock_cap && current->mm->locked_vm + i + 1 > limit) {
> > + put_pfn(pfn, prot);
> > + pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> > + __func__, limit << PAGE_SHIFT);
> > + break;
> > + }
> > + }
> > +
> > + vfio_lock_acct(i);
> > +
> > + return i;
> > +}
> > +
> > +long vfio_unpin_pages(unsigned long pfn, long npage,
> > + int prot, bool do_accounting) {
> > + unsigned long unlocked = 0;
> > + long i;
> > +
> > + for (i = 0; i < npage; i++)
> > + unlocked += put_pfn(pfn++, prot);
> > +
> > + if (do_accounting)
> > + vfio_lock_acct(-unlocked);
> > +
> > + return unlocked;
> > +}
> > diff --git a/drivers/vfio/vfio_iommu_common.h
> > b/drivers/vfio/vfio_iommu_common.h
> > new file mode 100644
> > index 0000000..4738391
> > --- /dev/null
> > +++ b/drivers/vfio/vfio_iommu_common.h
> > @@ -0,0 +1,30 @@
> > +/*
> > + * Copyright (C) 2012 Red Hat, Inc. All rights reserved.
> > + * Copyright (C) 2013 Freescale Semiconductor, Inc.
> > + *
> > + * This program is free software; you can redistribute it and/or
> > +modify it
> > + * under the terms of the GNU General Public License version 2 as
> > +published
> > + * by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write to the Free Software
> > + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA
> > +02111-1307 USA */
> > +
> > +#ifndef _VFIO_IOMMU_COMMON_H
> > +#define _VFIO_IOMMU_COMMON_H
> > +
> > +void vfio_lock_acct_bg(struct work_struct *work);
>
> Does this need to be exposed?

No, will make this and other which does not need to be exposed as static to this file

Thanks
-Bharat
>
> > +void vfio_lock_acct(long npage);
> > +bool is_invalid_reserved_pfn(unsigned long pfn); int put_pfn(unsigned
> > +long pfn, int prot);
>
> Why are these exposed, they only seem to be used by functions in the new common
> file.
>
> > +long vfio_pin_pages(unsigned long vaddr, long npage, int prot,
> > + unsigned long *pfn_base);
> > +long vfio_unpin_pages(unsigned long pfn, long npage,
> > + int prot, bool do_accounting);
>
> Can we get by with just these two and vfio_lock_acct()?
>
> > +#endif
> > diff --git a/drivers/vfio/vfio_iommu_type1.c
> > b/drivers/vfio/vfio_iommu_type1.c index a9807de..e9a58fa 100644
> > --- a/drivers/vfio/vfio_iommu_type1.c
> > +++ b/drivers/vfio/vfio_iommu_type1.c
> > @@ -37,6 +37,7 @@
> > #include <linux/uaccess.h>
> > #include <linux/vfio.h>
> > #include <linux/workqueue.h>
> > +#include "vfio_iommu_common.h"
> >
> > #define DRIVER_VERSION "0.2"
> > #define DRIVER_AUTHOR "Alex Williamson <[email protected]>"
> > @@ -48,12 +49,6 @@ module_param_named(allow_unsafe_interrupts,
> > MODULE_PARM_DESC(allow_unsafe_interrupts,
> > "Enable VFIO IOMMU support for on platforms without interrupt
> > remapping support.");
> >
> > -static bool disable_hugepages;
> > -module_param_named(disable_hugepages,
> > - disable_hugepages, bool, S_IRUGO | S_IWUSR);
> > -MODULE_PARM_DESC(disable_hugepages,
> > - "Disable VFIO IOMMU support for IOMMU hugepages.");
> > -
> > struct vfio_iommu {
> > struct iommu_domain *domain;
> > struct mutex lock;
> > @@ -123,205 +118,6 @@ static void vfio_remove_dma(struct vfio_iommu *iommu,
> struct vfio_dma *old)
> > rb_erase(&old->node, &iommu->dma_list); }
> >
> > -struct vwork {
> > - struct mm_struct *mm;
> > - long npage;
> > - struct work_struct work;
> > -};
> > -
> > -/* delayed decrement/increment for locked_vm */ -static void
> > vfio_lock_acct_bg(struct work_struct *work) -{
> > - struct vwork *vwork = container_of(work, struct vwork, work);
> > - struct mm_struct *mm;
> > -
> > - mm = vwork->mm;
> > - down_write(&mm->mmap_sem);
> > - mm->locked_vm += vwork->npage;
> > - up_write(&mm->mmap_sem);
> > - mmput(mm);
> > - kfree(vwork);
> > -}
> > -
> > -static void vfio_lock_acct(long npage) -{
> > - struct vwork *vwork;
> > - struct mm_struct *mm;
> > -
> > - if (!current->mm || !npage)
> > - return; /* process exited or nothing to do */
> > -
> > - if (down_write_trylock(&current->mm->mmap_sem)) {
> > - current->mm->locked_vm += npage;
> > - up_write(&current->mm->mmap_sem);
> > - return;
> > - }
> > -
> > - /*
> > - * Couldn't get mmap_sem lock, so must setup to update
> > - * mm->locked_vm later. If locked_vm were atomic, we
> > - * wouldn't need this silliness
> > - */
> > - vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
> > - if (!vwork)
> > - return;
> > - mm = get_task_mm(current);
> > - if (!mm) {
> > - kfree(vwork);
> > - return;
> > - }
> > - INIT_WORK(&vwork->work, vfio_lock_acct_bg);
> > - vwork->mm = mm;
> > - vwork->npage = npage;
> > - schedule_work(&vwork->work);
> > -}
> > -
> > -/*
> > - * Some mappings aren't backed by a struct page, for example an
> > mmap'd
> > - * MMIO range for our own or another device. These use a different
> > - * pfn conversion and shouldn't be tracked as locked pages.
> > - */
> > -static bool is_invalid_reserved_pfn(unsigned long pfn) -{
> > - if (pfn_valid(pfn)) {
> > - bool reserved;
> > - struct page *tail = pfn_to_page(pfn);
> > - struct page *head = compound_trans_head(tail);
> > - reserved = !!(PageReserved(head));
> > - if (head != tail) {
> > - /*
> > - * "head" is not a dangling pointer
> > - * (compound_trans_head takes care of that)
> > - * but the hugepage may have been split
> > - * from under us (and we may not hold a
> > - * reference count on the head page so it can
> > - * be reused before we run PageReferenced), so
> > - * we've to check PageTail before returning
> > - * what we just read.
> > - */
> > - smp_rmb();
> > - if (PageTail(tail))
> > - return reserved;
> > - }
> > - return PageReserved(tail);
> > - }
> > -
> > - return true;
> > -}
> > -
> > -static int put_pfn(unsigned long pfn, int prot) -{
> > - if (!is_invalid_reserved_pfn(pfn)) {
> > - struct page *page = pfn_to_page(pfn);
> > - if (prot & IOMMU_WRITE)
> > - SetPageDirty(page);
> > - put_page(page);
> > - return 1;
> > - }
> > - return 0;
> > -}
> > -
> > -static int vaddr_get_pfn(unsigned long vaddr, int prot, unsigned long
> > *pfn) -{
> > - struct page *page[1];
> > - struct vm_area_struct *vma;
> > - int ret = -EFAULT;
> > -
> > - if (get_user_pages_fast(vaddr, 1, !!(prot & IOMMU_WRITE), page) == 1) {
> > - *pfn = page_to_pfn(page[0]);
> > - return 0;
> > - }
> > -
> > - down_read(&current->mm->mmap_sem);
> > -
> > - vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> > -
> > - if (vma && vma->vm_flags & VM_PFNMAP) {
> > - *pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> > - if (is_invalid_reserved_pfn(*pfn))
> > - ret = 0;
> > - }
> > -
> > - up_read(&current->mm->mmap_sem);
> > -
> > - return ret;
> > -}
> > -
> > -/*
> > - * Attempt to pin pages. We really don't want to track all the pfns
> > and
> > - * the iommu can only map chunks of consecutive pfns anyway, so get
> > the
> > - * first page and all consecutive pages with the same locking.
> > - */
> > -static long vfio_pin_pages(unsigned long vaddr, long npage,
> > - int prot, unsigned long *pfn_base)
> > -{
> > - unsigned long limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > - bool lock_cap = capable(CAP_IPC_LOCK);
> > - long ret, i;
> > -
> > - if (!current->mm)
> > - return -ENODEV;
> > -
> > - ret = vaddr_get_pfn(vaddr, prot, pfn_base);
> > - if (ret)
> > - return ret;
> > -
> > - if (is_invalid_reserved_pfn(*pfn_base))
> > - return 1;
> > -
> > - if (!lock_cap && current->mm->locked_vm + 1 > limit) {
> > - put_pfn(*pfn_base, prot);
> > - pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n", __func__,
> > - limit << PAGE_SHIFT);
> > - return -ENOMEM;
> > - }
> > -
> > - if (unlikely(disable_hugepages)) {
> > - vfio_lock_acct(1);
> > - return 1;
> > - }
> > -
> > - /* Lock all the consecutive pages from pfn_base */
> > - for (i = 1, vaddr += PAGE_SIZE; i < npage; i++, vaddr += PAGE_SIZE) {
> > - unsigned long pfn = 0;
> > -
> > - ret = vaddr_get_pfn(vaddr, prot, &pfn);
> > - if (ret)
> > - break;
> > -
> > - if (pfn != *pfn_base + i || is_invalid_reserved_pfn(pfn)) {
> > - put_pfn(pfn, prot);
> > - break;
> > - }
> > -
> > - if (!lock_cap && current->mm->locked_vm + i + 1 > limit) {
> > - put_pfn(pfn, prot);
> > - pr_warn("%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> > - __func__, limit << PAGE_SHIFT);
> > - break;
> > - }
> > - }
> > -
> > - vfio_lock_acct(i);
> > -
> > - return i;
> > -}
> > -
> > -static long vfio_unpin_pages(unsigned long pfn, long npage,
> > - int prot, bool do_accounting)
> > -{
> > - unsigned long unlocked = 0;
> > - long i;
> > -
> > - for (i = 0; i < npage; i++)
> > - unlocked += put_pfn(pfn++, prot);
> > -
> > - if (do_accounting)
> > - vfio_lock_acct(-unlocked);
> > -
> > - return unlocked;
> > -}
> > -
> > static int vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
> > dma_addr_t iova, size_t *size) {
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body
> of a message to [email protected] More majordomo info at
> http://vger.kernel.org/majordomo-info.html

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2013-09-26 05:27:37

by Bharat Bhushan

[permalink] [raw]
Subject: RE: [PATCH 7/7] vfio pci: Add vfio iommu implementation for FSL_PAMU



> -----Original Message-----
> From: Alex Williamson [mailto:[email protected]]
> Sent: Thursday, September 26, 2013 12:37 AM
> To: Bhushan Bharat-R65777
> Cc: [email protected]; [email protected]; [email protected]; linux-
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; Wood Scott-B07421; [email protected]
> foundation.org; Bhushan Bharat-R65777
> Subject: Re: [PATCH 7/7] vfio pci: Add vfio iommu implementation for FSL_PAMU
>
> On Thu, 2013-09-19 at 12:59 +0530, Bharat Bhushan wrote:
> > This patch adds vfio iommu support for Freescale IOMMU (PAMU -
> > Peripheral Access Management Unit).
> >
> > The Freescale PAMU is an aperture-based IOMMU with the following
> > characteristics. Each device has an entry in a table in memory
> > describing the iova->phys mapping. The mapping has:
> > -an overall aperture that is power of 2 sized, and has a start iova that
> > is naturally aligned
> > -has 1 or more windows within the aperture
> > -number of windows must be power of 2, max is 256
> > -size of each window is determined by aperture size / # of windows
> > -iova of each window is determined by aperture start iova / # of windows
> > -the mapped region in each window can be different than
> > the window size...mapping must power of 2
> > -physical address of the mapping must be naturally aligned
> > with the mapping size
> >
> > Some of the code is derived from TYPE1 iommu (driver/vfio/vfio_iommu_type1.c).
> >
> > Signed-off-by: Bharat Bhushan <[email protected]>
> > ---
> > drivers/vfio/Kconfig | 6 +
> > drivers/vfio/Makefile | 1 +
> > drivers/vfio/vfio_iommu_fsl_pamu.c | 952
> ++++++++++++++++++++++++++++++++++++
> > include/uapi/linux/vfio.h | 100 ++++
> > 4 files changed, 1059 insertions(+), 0 deletions(-) create mode
> > 100644 drivers/vfio/vfio_iommu_fsl_pamu.c
> >
> > diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig index
> > 26b3d9d..7d1da26 100644
> > --- a/drivers/vfio/Kconfig
> > +++ b/drivers/vfio/Kconfig
> > @@ -8,11 +8,17 @@ config VFIO_IOMMU_SPAPR_TCE
> > depends on VFIO && SPAPR_TCE_IOMMU
> > default n
> >
> > +config VFIO_IOMMU_FSL_PAMU
> > + tristate
> > + depends on VFIO
> > + default n
> > +
> > menuconfig VFIO
> > tristate "VFIO Non-Privileged userspace driver framework"
> > depends on IOMMU_API
> > select VFIO_IOMMU_TYPE1 if X86
> > select VFIO_IOMMU_SPAPR_TCE if (PPC_POWERNV || PPC_PSERIES)
> > + select VFIO_IOMMU_FSL_PAMU if FSL_PAMU
> > help
> > VFIO provides a framework for secure userspace device drivers.
> > See Documentation/vfio.txt for more details.
> > diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile index
> > c5792ec..7461350 100644
> > --- a/drivers/vfio/Makefile
> > +++ b/drivers/vfio/Makefile
> > @@ -1,4 +1,5 @@
> > obj-$(CONFIG_VFIO) += vfio.o
> > obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_common.o
> > vfio_iommu_type1.o
> > obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_common.o
> > vfio_iommu_spapr_tce.o
> > +obj-$(CONFIG_VFIO_IOMMU_FSL_PAMU) += vfio_iommu_common.o
> > +vfio_iommu_fsl_pamu.o
> > obj-$(CONFIG_VFIO_PCI) += pci/
> > diff --git a/drivers/vfio/vfio_iommu_fsl_pamu.c
> > b/drivers/vfio/vfio_iommu_fsl_pamu.c
> > new file mode 100644
> > index 0000000..b29365f
> > --- /dev/null
> > +++ b/drivers/vfio/vfio_iommu_fsl_pamu.c
> > @@ -0,0 +1,952 @@
> > +/*
> > + * VFIO: IOMMU DMA mapping support for FSL PAMU IOMMU
> > + *
> > + * This program is free software; you can redistribute it and/or
> > +modify
> > + * it under the terms of the GNU General Public License, version 2,
> > +as
> > + * published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful,
> > + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> > + * GNU General Public License for more details.
> > + *
> > + * You should have received a copy of the GNU General Public License
> > + * along with this program; if not, write to the Free Software
> > + * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
> > + *
> > + * Copyright (C) 2013 Freescale Semiconductor, Inc.
> > + *
> > + * Author: Bharat Bhushan <[email protected]>
> > + *
> > + * This file is derived from driver/vfio/vfio_iommu_type1.c
> > + *
> > + * The Freescale PAMU is an aperture-based IOMMU with the following
> > + * characteristics. Each device has an entry in a table in memory
> > + * describing the iova->phys mapping. The mapping has:
> > + * -an overall aperture that is power of 2 sized, and has a start iova that
> > + * is naturally aligned
> > + * -has 1 or more windows within the aperture
> > + * -number of windows must be power of 2, max is 256
> > + * -size of each window is determined by aperture size / # of windows
> > + * -iova of each window is determined by aperture start iova / # of
> windows
> > + * -the mapped region in each window can be different than
> > + * the window size...mapping must power of 2
> > + * -physical address of the mapping must be naturally aligned
> > + * with the mapping size
> > + */
> > +
> > +#include <linux/compat.h>
> > +#include <linux/device.h>
> > +#include <linux/fs.h>
> > +#include <linux/iommu.h>
> > +#include <linux/module.h>
> > +#include <linux/mm.h>
> > +#include <linux/pci.h> /* pci_bus_type */
> > +#include <linux/sched.h>
> > +#include <linux/slab.h>
> > +#include <linux/uaccess.h>
> > +#include <linux/vfio.h>
> > +#include <linux/workqueue.h>
> > +#include <linux/hugetlb.h>
> > +#include <linux/msi.h>
> > +#include <asm/fsl_pamu_stash.h>
> > +
> > +#include "vfio_iommu_common.h"
> > +
> > +#define DRIVER_VERSION "0.1"
> > +#define DRIVER_AUTHOR "Bharat Bhushan <[email protected]>"
> > +#define DRIVER_DESC "FSL PAMU IOMMU driver for VFIO"
> > +
> > +struct vfio_iommu {
> > + struct iommu_domain *domain;
> > + struct mutex lock;
> > + dma_addr_t aperture_start;
> > + dma_addr_t aperture_end;
> > + dma_addr_t page_size; /* Maximum mapped Page size */
> > + int nsubwindows; /* Number of subwindows */
> > + struct rb_root dma_list;
> > + struct list_head msi_dma_list;
> > + struct list_head group_list;
> > +};
> > +
> > +struct vfio_dma {
> > + struct rb_node node;
> > + dma_addr_t iova; /* Device address */
> > + unsigned long vaddr; /* Process virtual addr */
> > + size_t size; /* Number of pages */
>
> Is this really pages?

Comment is leftover of previous implementation.

>
> > + int prot; /* IOMMU_READ/WRITE */
> > +};
> > +
> > +struct vfio_msi_dma {
> > + struct list_head next;
> > + dma_addr_t iova; /* Device address */
> > + int bank_id;
> > + int prot; /* IOMMU_READ/WRITE */
> > +};
> > +
> > +struct vfio_group {
> > + struct iommu_group *iommu_group;
> > + struct list_head next;
> > +};
> > +
> > +static struct vfio_dma *vfio_find_dma(struct vfio_iommu *iommu,
> > + dma_addr_t start, size_t size) {
> > + struct rb_node *node = iommu->dma_list.rb_node;
> > +
> > + while (node) {
> > + struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
> > +
> > + if (start + size <= dma->iova)
> > + node = node->rb_left;
> > + else if (start >= dma->iova + dma->size)
>
> because this looks more like it's bytes...
>
> > + node = node->rb_right;
> > + else
> > + return dma;
> > + }
> > +
> > + return NULL;
> > +}
> > +
> > +static void vfio_insert_dma(struct vfio_iommu *iommu, struct vfio_dma
> > +*new) {
> > + struct rb_node **link = &iommu->dma_list.rb_node, *parent = NULL;
> > + struct vfio_dma *dma;
> > +
> > + while (*link) {
> > + parent = *link;
> > + dma = rb_entry(parent, struct vfio_dma, node);
> > +
> > + if (new->iova + new->size <= dma->iova)
>
> so does this...
>
> > + link = &(*link)->rb_left;
> > + else
> > + link = &(*link)->rb_right;
> > + }
> > +
> > + rb_link_node(&new->node, parent, link);
> > + rb_insert_color(&new->node, &iommu->dma_list); }
> > +
> > +static void vfio_remove_dma(struct vfio_iommu *iommu, struct vfio_dma
> > +*old) {
> > + rb_erase(&old->node, &iommu->dma_list); }
>
>
> So if your vfio_dma.size is actually in bytes, why isn't all this code in
> common?

This takes "struct vfio_iommu " which is not same on type1 and fsl_pamu, so I have not done that in first phase.
But yes I completely agree that much more consolidation can be done. I would like to do that in next step.

>
> > +
> > +static int iova_to_win(struct vfio_iommu *iommu, dma_addr_t iova) {
> > + u64 offset = iova - iommu->aperture_start;
> > + do_div(offset, iommu->page_size);
> > + return (int) offset;
> > +}
> > +
> > +static int vfio_disable_iommu_domain(struct vfio_iommu *iommu) {
> > + int enable = 0;
> > + return iommu_domain_set_attr(iommu->domain,
> > + DOMAIN_ATTR_FSL_PAMU_ENABLE, &enable); }
>
> This is never called?!
>
> > +
> > +static int vfio_enable_iommu_domain(struct vfio_iommu *iommu) {
> > + int enable = 1;
> > + return iommu_domain_set_attr(iommu->domain,
> > + DOMAIN_ATTR_FSL_PAMU_ENABLE, &enable); }
> > +
> > +/* Unmap DMA region */
> > +static int vfio_unmap_unpin(struct vfio_iommu *iommu, struct vfio_dma *dma,
> > + dma_addr_t iova, size_t *size) {
> > + dma_addr_t start = iova;
> > + int win, win_start, win_end;
> > + long unlocked = 0;
> > + unsigned int nr_pages;
> > +
> > + nr_pages = iommu->page_size / PAGE_SIZE;
> > + win_start = iova_to_win(iommu, iova);
> > + win_end = iova_to_win(iommu, iova + *size - 1);
> > +
> > + /* Release the pinned pages */
> > + for (win = win_start; win <= win_end; iova += iommu->page_size, win++) {
> > + unsigned long pfn;
> > +
> > + pfn = iommu_iova_to_phys(iommu->domain, iova) >> PAGE_SHIFT;
> > + if (!pfn)
> > + continue;
> > +
> > + iommu_domain_window_disable(iommu->domain, win);
> > +
> > + unlocked += vfio_unpin_pages(pfn, nr_pages, dma->prot, 1);
> > + }
> > +
> > + vfio_lock_acct(-unlocked);
> > + *size = iova - start;
> > + return 0;
> > +}
> > +
> > +static int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t
> start,
> > + size_t *size, struct vfio_dma *dma) {
> > + size_t offset, overlap, tmp;
> > + struct vfio_dma *split;
> > + int ret;
> > +
> > + if (!*size)
> > + return 0;
> > +
> > + /*
> > + * Existing dma region is completely covered, unmap all. This is
> > + * the likely case since userspace tends to map and unmap buffers
> > + * in one shot rather than multiple mappings within a buffer.
> > + */
> > + if (likely(start <= dma->iova &&
> > + start + *size >= dma->iova + dma->size)) {
> > + *size = dma->size;
> > + ret = vfio_unmap_unpin(iommu, dma, dma->iova, size);
> > + if (ret)
> > + return ret;
> > +
> > + /*
> > + * Did we remove more than we have? Should never happen
> > + * since a vfio_dma is contiguous in iova and vaddr.
> > + */
> > + WARN_ON(*size != dma->size);
> > +
> > + vfio_remove_dma(iommu, dma);
> > + kfree(dma);
> > + return 0;
> > + }
> > +
> > + /* Overlap low address of existing range */
> > + if (start <= dma->iova) {
> > + overlap = start + *size - dma->iova;
> > + ret = vfio_unmap_unpin(iommu, dma, dma->iova, &overlap);
> > + if (ret)
> > + return ret;
> > +
> > + vfio_remove_dma(iommu, dma);
> > +
> > + /*
> > + * Check, we may have removed to whole vfio_dma. If not
> > + * fixup and re-insert.
> > + */
> > + if (overlap < dma->size) {
> > + dma->iova += overlap;
> > + dma->vaddr += overlap;
> > + dma->size -= overlap;
> > + vfio_insert_dma(iommu, dma);
> > + } else
> > + kfree(dma);
> > +
> > + *size = overlap;
> > + return 0;
> > + }
> > +
> > + /* Overlap high address of existing range */
> > + if (start + *size >= dma->iova + dma->size) {
> > + offset = start - dma->iova;
> > + overlap = dma->size - offset;
> > +
> > + ret = vfio_unmap_unpin(iommu, dma, start, &overlap);
> > + if (ret)
> > + return ret;
> > +
> > + dma->size -= overlap;
> > + *size = overlap;
> > + return 0;
> > + }
> > +
> > + /* Split existing */
> > +
> > + /*
> > + * Allocate our tracking structure early even though it may not
> > + * be used. An Allocation failure later loses track of pages and
> > + * is more difficult to unwind.
> > + */
> > + split = kzalloc(sizeof(*split), GFP_KERNEL);
> > + if (!split)
> > + return -ENOMEM;
> > +
> > + offset = start - dma->iova;
> > +
> > + ret = vfio_unmap_unpin(iommu, dma, start, size);
> > + if (ret || !*size) {
> > + kfree(split);
> > + return ret;
> > + }
> > +
> > + tmp = dma->size;
> > +
> > + /* Resize the lower vfio_dma in place, before the below insert */
> > + dma->size = offset;
> > +
> > + /* Insert new for remainder, assuming it didn't all get unmapped */
> > + if (likely(offset + *size < tmp)) {
> > + split->size = tmp - offset - *size;
> > + split->iova = dma->iova + offset + *size;
> > + split->vaddr = dma->vaddr + offset + *size;
> > + split->prot = dma->prot;
> > + vfio_insert_dma(iommu, split);
> > + } else
> > + kfree(split);
> > +
> > + return 0;
> > +}
>
> Hmm, this looks identical to type1, can we share more?

Yes, as I said this uses " struct vfio_iommu", vfio_unmap_unpin() etc which are different for type1 and fsl_pamu.
In this patchset I only moved the function which are straight forward. But yes I am working on doing more consolidation patches.

>
> > +
> > +/* Map DMA region */
> > +static int vfio_dma_map(struct vfio_iommu *iommu, dma_addr_t iova,
> > + unsigned long vaddr, long npage, int prot) {
> > + int ret = 0, i;
> > + size_t size;
> > + unsigned int win, nr_subwindows;
> > + dma_addr_t iovamap;
> > +
> > + /* total size to be mapped */
> > + size = npage << PAGE_SHIFT;
> > + do_div(size, iommu->page_size);
> > + nr_subwindows = size;
> > + size = npage << PAGE_SHIFT;
>
> Is all this do_div() stuff necessary? If page_size is a power of two, just
> shift it.

Will do

>
> > + iovamap = iova;
> > + for (i = 0; i < nr_subwindows; i++) {
> > + unsigned long pfn;
> > + unsigned long nr_pages;
> > + dma_addr_t mapsize;
> > + struct vfio_dma *dma = NULL;
> > +
> > + win = iova_to_win(iommu, iovamap);
>
> Aren't these consecutive, why can't we just increment?

Yes, will do

>
> > + if (iovamap != iommu->aperture_start + iommu->page_size * win) {
> > + pr_err("%s iova(%llx) unalligned to window size %llx\n",
> > + __func__, iovamap, iommu->page_size);
> > + ret = -EINVAL;
> > + break;
> > + }
>
> Can't this only happen on the first one?

iova to be mapped in a window must be (iommu->aperture_start + iommu->page_size * win)
But as you pointed out that "win" can incremented iovamap is always incremented by page_size then checking this outside the look can be done. But this is the requirement of our iommu and we should error out if this is not met.

> Seems like it should be outside of the
> loop. What about alignment with the end of the window, do you care? Check
> spelling in your warning, but better yet, get rid of it, this doesn't seem like
> something we need to error on.
>
> > +
> > + mapsize = min(iova + size - iovamap, iommu->page_size);
> > + /*
> > + * FIXME: Currently we only support mapping page-size
> > + * of subwindow-size.
> > + */
> > + if (mapsize < iommu->page_size) {
> > + pr_err("%s iova (%llx) not alligned to window size %llx\n",
> > + __func__, iovamap, iommu->page_size);
> > + ret = -EINVAL;
> > + break;
> > + }
>
> So you do care about the end alignment, but why can't we error for both of these
> in advance?

Eventually this should go away, I will remove this :)

>
> > +
> > + nr_pages = mapsize >> PAGE_SHIFT;
> > +
> > + /* Pin a contiguous chunk of memory */
> > + ret = vfio_pin_pages(vaddr, nr_pages, prot, &pfn);
> > + if (ret != nr_pages) {
> > + pr_err("%s unable to pin pages = %lx, pinned(%lx/%lx)\n",
> > + __func__, vaddr, npage, nr_pages);
> > + ret = -EINVAL;
> > + break;
> > + }
>
> How likely is this to succeed? It seems like we're relying on userspace to use
> hugepages to make this work.

Yes, userspace will first have hugepages and than calls DMA_MAP()

>
> > +
> > + ret = iommu_domain_window_enable(iommu->domain, win,
> > + (phys_addr_t)pfn << PAGE_SHIFT,
> > + mapsize, prot);
> > + if (ret) {
> > + pr_err("%s unable to iommu_map()\n", __func__);
> > + ret = -EINVAL;
> > + break;
> > + }
>
> You might consider how many cases you're returning EINVAL and think about how
> difficult this will be to debug. I don't think we can leave all these pr_err()s
> since it gives userspace a trivial way to spam log files.
>
> > +
> > + /*
> > + * Check if we abut a region below - nothing below 0.
> > + * This is the most likely case when mapping chunks of
> > + * physically contiguous regions within a virtual address
> > + * range. Update the abutting entry in place since iova
> > + * doesn't change.
> > + */
> > + if (likely(iovamap)) {
> > + struct vfio_dma *tmp;
> > + tmp = vfio_find_dma(iommu, iovamap - 1, 1);
> > + if (tmp && tmp->prot == prot &&
> > + tmp->vaddr + tmp->size == vaddr) {
> > + tmp->size += mapsize;
> > + dma = tmp;
> > + }
> > + }
> > +
> > + /*
> > + * Check if we abut a region above - nothing above ~0 + 1.
> > + * If we abut above and below, remove and free. If only
> > + * abut above, remove, modify, reinsert.
> > + */
> > + if (likely(iovamap + mapsize)) {
> > + struct vfio_dma *tmp;
> > + tmp = vfio_find_dma(iommu, iovamap + mapsize, 1);
> > + if (tmp && tmp->prot == prot &&
> > + tmp->vaddr == vaddr + mapsize) {
> > + vfio_remove_dma(iommu, tmp);
> > + if (dma) {
> > + dma->size += tmp->size;
> > + kfree(tmp);
> > + } else {
> > + tmp->size += mapsize;
> > + tmp->iova = iovamap;
> > + tmp->vaddr = vaddr;
> > + vfio_insert_dma(iommu, tmp);
> > + dma = tmp;
> > + }
> > + }
> > + }
> > +
> > + if (!dma) {
> > + dma = kzalloc(sizeof(*dma), GFP_KERNEL);
> > + if (!dma) {
> > + iommu_unmap(iommu->domain, iovamap, mapsize);
> > + vfio_unpin_pages(pfn, npage, prot, true);
> > + ret = -ENOMEM;
> > + break;
> > + }
> > +
> > + dma->size = mapsize;
> > + dma->iova = iovamap;
> > + dma->vaddr = vaddr;
> > + dma->prot = prot;
> > + vfio_insert_dma(iommu, dma);
> > + }
> > +
> > + iovamap += mapsize;
> > + vaddr += mapsize;
>
> Another chunk that looks like it's probably identical to type1. Can we rip this
> out to another function and add it to common?

Yes, and same answer :)

>
> > + }
> > +
> > + if (ret) {
> > + struct vfio_dma *tmp;
> > + while ((tmp = vfio_find_dma(iommu, iova, size))) {
> > + int r = vfio_remove_dma_overlap(iommu, iova,
> > + &size, tmp);
> > + if (WARN_ON(r || !size))
> > + break;
> > + }
> > + }
>
>
> Broken whitespace, please run scripts/checkpatch.pl before posting.
>
> > +
> > + vfio_enable_iommu_domain(iommu);
>
> I don't quite understand your semantics here since you never use the disable
> version, is this just redundant after the first mapping? When dma_list is empty
> should it be disabled?

Yes, I intended to do that but somehow in final version it is not there :(

> Is there a bug here that an error will enable the iommu
> domain even if there are no entries?

Will correct this.

>
> > + return 0;
> > +}
> > +
> > +static int vfio_dma_do_map(struct vfio_iommu *iommu,
> > + struct vfio_iommu_type1_dma_map *map) {
> > + dma_addr_t iova = map->iova;
> > + size_t size = map->size;
> > + unsigned long vaddr = map->vaddr;
> > + int ret = 0, prot = 0;
> > + long npage;
> > +
> > + /* READ/WRITE from device perspective */
> > + if (map->flags & VFIO_DMA_MAP_FLAG_WRITE)
> > + prot |= IOMMU_WRITE;
> > + if (map->flags & VFIO_DMA_MAP_FLAG_READ)
> > + prot |= IOMMU_READ;
> > +
> > + if (!prot)
> > + return -EINVAL; /* No READ/WRITE? */
> > +
> > + /* Don't allow IOVA wrap */
> > + if (iova + size && iova + size < iova)
> > + return -EINVAL;
> > +
> > + /* Don't allow virtual address wrap */
> > + if (vaddr + size && vaddr + size < vaddr)
> > + return -EINVAL;
> > +
> > + /*
> > + * FIXME: Currently we only support mapping page-size
> > + * of subwindow-size.
> > + */
> > + if (size < iommu->page_size)
> > + return -EINVAL;
> > +
>
> I'd think the start and end alignment could be tested here.
>
> > + npage = size >> PAGE_SHIFT;
> > + if (!npage)
> > + return -EINVAL;
> > +
> > + mutex_lock(&iommu->lock);
> > +
> > + if (vfio_find_dma(iommu, iova, size)) {
> > + ret = -EEXIST;
> > + goto out_lock;
> > + }
> > +
> > + vfio_dma_map(iommu, iova, vaddr, npage, prot);
> > +
> > +out_lock:
> > + mutex_unlock(&iommu->lock);
> > + return ret;
> > +}
> > +
> > +static int vfio_dma_do_unmap(struct vfio_iommu *iommu,
> > + struct vfio_iommu_type1_dma_unmap *unmap) {
> > + struct vfio_dma *dma;
> > + size_t unmapped = 0, size;
> > + int ret = 0;
> > +
> > + mutex_lock(&iommu->lock);
> > +
> > + while ((dma = vfio_find_dma(iommu, unmap->iova, unmap->size))) {
> > + size = unmap->size;
> > + ret = vfio_remove_dma_overlap(iommu, unmap->iova, &size, dma);
> > + if (ret || !size)
> > + break;
> > + unmapped += size;
> > + }
> > +
> > + mutex_unlock(&iommu->lock);
> > +
> > + /*
> > + * We may unmap more than requested, update the unmap struct so
> > + * userspace can know.
> > + */
> > + unmap->size = unmapped;
> > +
> > + return ret;
> > +}
> > +
> > +static int vfio_handle_get_attr(struct vfio_iommu *iommu,
> > + struct vfio_pamu_attr *pamu_attr) {
> > + switch (pamu_attr->attribute) {
> > + case VFIO_ATTR_GEOMETRY: {
> > + struct iommu_domain_geometry geom;
> > + if (iommu_domain_get_attr(iommu->domain,
> > + DOMAIN_ATTR_GEOMETRY, &geom)) {
> > + pr_err("%s Error getting domain geometry\n",
> > + __func__);
> > + return -EFAULT;
> > + }
> > +
> > + pamu_attr->attr_info.attr.aperture_start = geom.aperture_start;
> > + pamu_attr->attr_info.attr.aperture_end = geom.aperture_end;
> > + break;
> > + }
> > + case VFIO_ATTR_WINDOWS: {
> > + u32 count;
> > + if (iommu_domain_get_attr(iommu->domain,
> > + DOMAIN_ATTR_WINDOWS, &count)) {
> > + pr_err("%s Error getting domain windows\n",
> > + __func__);
> > + return -EFAULT;
> > + }
> > +
> > + pamu_attr->attr_info.windows = count;
> > + break;
> > + }
> > + case VFIO_ATTR_PAMU_STASH: {
> > + struct pamu_stash_attribute stash;
> > + if (iommu_domain_get_attr(iommu->domain,
> > + DOMAIN_ATTR_FSL_PAMU_STASH, &stash)) {
> > + pr_err("%s Error getting domain windows\n",
> > + __func__);
> > + return -EFAULT;
> > + }
> > +
> > + pamu_attr->attr_info.stash.cpu = stash.cpu;
> > + pamu_attr->attr_info.stash.cache = stash.cache;
> > + break;
> > + }
> > +
> > + default:
> > + pr_err("%s Error: Invalid attribute (%d)\n",
> > + __func__, pamu_attr->attribute);
> > + return -EINVAL;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static int vfio_handle_set_attr(struct vfio_iommu *iommu,
> > + struct vfio_pamu_attr *pamu_attr) {
> > + switch (pamu_attr->attribute) {
> > + case VFIO_ATTR_GEOMETRY: {
> > + struct iommu_domain_geometry geom;
> > +
> > + geom.aperture_start = pamu_attr->attr_info.attr.aperture_start;
> > + geom.aperture_end = pamu_attr->attr_info.attr.aperture_end;
> > + iommu->aperture_start = geom.aperture_start;
> > + iommu->aperture_end = geom.aperture_end;
> > + geom.force_aperture = 1;
> > + if (iommu_domain_set_attr(iommu->domain,
> > + DOMAIN_ATTR_GEOMETRY, &geom)) {
> > + pr_err("%s Error setting domain geometry\n", __func__);
> > + return -EFAULT;
> > + }
> > +
> > + break;
> > + }
> > + case VFIO_ATTR_WINDOWS: {
> > + u32 count = pamu_attr->attr_info.windows;
> > + u64 size;
> > + if (count > 256) {
> > + pr_err("Number of subwindows requested (%d) is 256\n",
> > + count);
> > + return -EINVAL;
> > + }
> > + iommu->nsubwindows = pamu_attr->attr_info.windows;
> > + size = iommu->aperture_end - iommu->aperture_start + 1;
> > + do_div(size, count);
> > + iommu->page_size = size;
> > + if (iommu_domain_set_attr(iommu->domain,
> > + DOMAIN_ATTR_WINDOWS, &count)) {
> > + pr_err("%s Error getting domain windows\n",
> > + __func__);
> > + return -EFAULT;
> > + }
> > +
> > + break;
> > + }
> > + case VFIO_ATTR_PAMU_STASH: {
> > + struct pamu_stash_attribute stash;
> > +
> > + stash.cpu = pamu_attr->attr_info.stash.cpu;
> > + stash.cache = pamu_attr->attr_info.stash.cache;
> > + if (iommu_domain_set_attr(iommu->domain,
> > + DOMAIN_ATTR_FSL_PAMU_STASH, &stash)) {
> > + pr_err("%s Error getting domain windows\n",
> > + __func__);
> > + return -EFAULT;
> > + }
> > + break;
>
> Why do we throw away the return value of iommu_domain_set_attr and replace it
> with EFAULT in all these cases?

Will use the return of iommu_domain_set_attr().

> I assume all these pr_err()s are leftover
> debug. Can the user do anything they shouldn't through these? How do we
> guarantee that?

I will move these to pr_debug()

>
> > + }
> > +
> > + default:
> > + pr_err("%s Error: Invalid attribute (%d)\n",
> > + __func__, pamu_attr->attribute);
> > + return -EINVAL;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static int vfio_msi_map(struct vfio_iommu *iommu,
> > + struct vfio_pamu_msi_bank_map *msi_map, int prot) {
> > + struct msi_region region;
> > + int window;
> > + int ret;
> > +
> > + ret = msi_get_region(msi_map->msi_bank_index, &region);
> > + if (ret) {
> > + pr_err("%s MSI region (%d) not found\n", __func__,
> > + msi_map->msi_bank_index);
> > + return ret;
> > + }
> > +
> > + window = iova_to_win(iommu, msi_map->iova);
> > + ret = iommu_domain_window_enable(iommu->domain, window, region.addr,
> > + region.size, prot);
> > + if (ret) {
> > + pr_err("%s Error: unable to map msi region\n", __func__);
> > + return ret;
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static int vfio_do_msi_map(struct vfio_iommu *iommu,
> > + struct vfio_pamu_msi_bank_map *msi_map) {
> > + struct vfio_msi_dma *msi_dma;
> > + int ret, prot = 0;
> > +
> > + /* READ/WRITE from device perspective */
> > + if (msi_map->flags & VFIO_DMA_MAP_FLAG_WRITE)
> > + prot |= IOMMU_WRITE;
> > + if (msi_map->flags & VFIO_DMA_MAP_FLAG_READ)
> > + prot |= IOMMU_READ;
> > +
> > + if (!prot)
> > + return -EINVAL; /* No READ/WRITE? */
> > +
> > + ret = vfio_msi_map(iommu, msi_map, prot);
> > + if (ret)
> > + return ret;
> > +
> > + msi_dma = kzalloc(sizeof(*msi_dma), GFP_KERNEL);
> > + if (!msi_dma)
> > + return -ENOMEM;
> > +
> > + msi_dma->iova = msi_map->iova;
> > + msi_dma->bank_id = msi_map->msi_bank_index;
> > + list_add(&msi_dma->next, &iommu->msi_dma_list);
> > + return 0;
>
> What happens when the user creates multiple MSI mappings at the same iova? What
> happens when DMA mappings overlap MSI mappings?

Good point, will correct this.

> Shouldn't there be some locking
> around list manipulation?

Yes, will correct this as well.

Thanks
-Bharat

>
> > +}
> > +
> > +static void vfio_msi_unmap(struct vfio_iommu *iommu, dma_addr_t iova)
> > +{
> > + int window;
> > + window = iova_to_win(iommu, iova);
> > + iommu_domain_window_disable(iommu->domain, window); }
> > +
> > +static int vfio_do_msi_unmap(struct vfio_iommu *iommu,
> > + struct vfio_pamu_msi_bank_unmap *msi_unmap) {
> > + struct vfio_msi_dma *mdma, *mdma_tmp;
> > +
> > + list_for_each_entry_safe(mdma, mdma_tmp, &iommu->msi_dma_list, next) {
> > + if (mdma->iova == msi_unmap->iova) {
> > + vfio_msi_unmap(iommu, mdma->iova);
> > + list_del(&mdma->next);
> > + kfree(mdma);
> > + return 0;
> > + }
> > + }
> > +
> > + return -EINVAL;
> > +}
> > +static void *vfio_iommu_fsl_pamu_open(unsigned long arg) {
> > + struct vfio_iommu *iommu;
> > +
> > + if (arg != VFIO_FSL_PAMU_IOMMU)
> > + return ERR_PTR(-EINVAL);
> > +
> > + iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> > + if (!iommu)
> > + return ERR_PTR(-ENOMEM);
> > +
> > + INIT_LIST_HEAD(&iommu->group_list);
> > + iommu->dma_list = RB_ROOT;
> > + INIT_LIST_HEAD(&iommu->msi_dma_list);
> > + mutex_init(&iommu->lock);
> > +
> > + /*
> > + * Wish we didn't have to know about bus_type here.
> > + */
> > + iommu->domain = iommu_domain_alloc(&pci_bus_type);
> > + if (!iommu->domain) {
> > + kfree(iommu);
> > + return ERR_PTR(-EIO);
> > + }
> > +
> > + return iommu;
> > +}
> > +
> > +static void vfio_iommu_fsl_pamu_release(void *iommu_data) {
> > + struct vfio_iommu *iommu = iommu_data;
> > + struct vfio_group *group, *group_tmp;
> > + struct vfio_msi_dma *mdma, *mdma_tmp;
> > + struct rb_node *node;
> > +
> > + list_for_each_entry_safe(group, group_tmp, &iommu->group_list, next) {
> > + iommu_detach_group(iommu->domain, group->iommu_group);
> > + list_del(&group->next);
> > + kfree(group);
> > + }
> > +
> > + while ((node = rb_first(&iommu->dma_list))) {
> > + struct vfio_dma *dma = rb_entry(node, struct vfio_dma, node);
> > + size_t size = dma->size;
> > + vfio_remove_dma_overlap(iommu, dma->iova, &size, dma);
> > + if (WARN_ON(!size))
> > + break;
> > + }
> > +
> > + list_for_each_entry_safe(mdma, mdma_tmp, &iommu->msi_dma_list, next) {
> > + vfio_msi_unmap(iommu, mdma->iova);
> > + list_del(&mdma->next);
> > + kfree(mdma);
> > + }
> > +
> > + iommu_domain_free(iommu->domain);
> > + iommu->domain = NULL;
> > + kfree(iommu);
> > +}
> > +
> > +static long vfio_iommu_fsl_pamu_ioctl(void *iommu_data,
> > + unsigned int cmd, unsigned long arg) {
> > + struct vfio_iommu *iommu = iommu_data;
> > + unsigned long minsz;
> > +
> > + if (cmd == VFIO_CHECK_EXTENSION) {
> > + switch (arg) {
> > + case VFIO_FSL_PAMU_IOMMU:
> > + return 1;
> > + default:
> > + return 0;
> > + }
> > + } else if (cmd == VFIO_IOMMU_MAP_DMA) {
> > + struct vfio_iommu_type1_dma_map map;
> > + uint32_t mask = VFIO_DMA_MAP_FLAG_READ |
> > + VFIO_DMA_MAP_FLAG_WRITE;
> > +
> > + minsz = offsetofend(struct vfio_iommu_type1_dma_map, size);
> > +
> > + if (copy_from_user(&map, (void __user *)arg, minsz))
> > + return -EFAULT;
> > +
> > + if (map.argsz < minsz || map.flags & ~mask)
> > + return -EINVAL;
> > +
> > + return vfio_dma_do_map(iommu, &map);
> > +
> > + } else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
> > + struct vfio_iommu_type1_dma_unmap unmap;
> > + long ret;
> > +
> > + minsz = offsetofend(struct vfio_iommu_type1_dma_unmap, size);
> > +
> > + if (copy_from_user(&unmap, (void __user *)arg, minsz))
> > + return -EFAULT;
> > +
> > + if (unmap.argsz < minsz || unmap.flags)
> > + return -EINVAL;
> > +
> > + ret = vfio_dma_do_unmap(iommu, &unmap);
> > + if (ret)
> > + return ret;
> > +
> > + return copy_to_user((void __user *)arg, &unmap, minsz);
> > + } else if (cmd == VFIO_IOMMU_PAMU_GET_ATTR) {
> > + struct vfio_pamu_attr pamu_attr;
> > +
> > + minsz = offsetofend(struct vfio_pamu_attr, attr_info);
> > + if (copy_from_user(&pamu_attr, (void __user *)arg, minsz))
> > + return -EFAULT;
> > +
> > + if (pamu_attr.argsz < minsz)
> > + return -EINVAL;
> > +
> > + vfio_handle_get_attr(iommu, &pamu_attr);
> > +
> > + copy_to_user((void __user *)arg, &pamu_attr, minsz);
> > + return 0;
> > + } else if (cmd == VFIO_IOMMU_PAMU_SET_ATTR) {
> > + struct vfio_pamu_attr pamu_attr;
> > +
> > + minsz = offsetofend(struct vfio_pamu_attr, attr_info);
> > + if (copy_from_user(&pamu_attr, (void __user *)arg, minsz))
> > + return -EFAULT;
> > +
> > + if (pamu_attr.argsz < minsz)
> > + return -EINVAL;
> > +
> > + vfio_handle_set_attr(iommu, &pamu_attr);
> > + return 0;
> > + } else if (cmd == VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT) {
> > + return msi_get_region_count();
> > + } else if (cmd == VFIO_IOMMU_PAMU_MAP_MSI_BANK) {
> > + struct vfio_pamu_msi_bank_map msi_map;
> > +
> > + minsz = offsetofend(struct vfio_pamu_msi_bank_map, iova);
> > + if (copy_from_user(&msi_map, (void __user *)arg, minsz))
> > + return -EFAULT;
> > +
> > + if (msi_map.argsz < minsz)
> > + return -EINVAL;
> > +
> > + vfio_do_msi_map(iommu, &msi_map);
> > + return 0;
> > + } else if (cmd == VFIO_IOMMU_PAMU_UNMAP_MSI_BANK) {
> > + struct vfio_pamu_msi_bank_unmap msi_unmap;
> > +
> > + minsz = offsetofend(struct vfio_pamu_msi_bank_unmap, iova);
> > + if (copy_from_user(&msi_unmap, (void __user *)arg, minsz))
> > + return -EFAULT;
> > +
> > + if (msi_unmap.argsz < minsz)
> > + return -EINVAL;
> > +
> > + vfio_do_msi_unmap(iommu, &msi_unmap);
> > + return 0;
> > +
> > + }
> > +
> > + return -ENOTTY;
> > +}
> > +
> > +static int vfio_iommu_fsl_pamu_attach_group(void *iommu_data,
> > + struct iommu_group *iommu_group) {
> > + struct vfio_iommu *iommu = iommu_data;
> > + struct vfio_group *group, *tmp;
> > + int ret;
> > +
> > + group = kzalloc(sizeof(*group), GFP_KERNEL);
> > + if (!group)
> > + return -ENOMEM;
> > +
> > + mutex_lock(&iommu->lock);
> > +
> > + list_for_each_entry(tmp, &iommu->group_list, next) {
> > + if (tmp->iommu_group == iommu_group) {
> > + mutex_unlock(&iommu->lock);
> > + kfree(group);
> > + return -EINVAL;
> > + }
> > + }
> > +
> > + ret = iommu_attach_group(iommu->domain, iommu_group);
> > + if (ret) {
> > + mutex_unlock(&iommu->lock);
> > + kfree(group);
> > + return ret;
> > + }
> > +
> > + group->iommu_group = iommu_group;
> > + list_add(&group->next, &iommu->group_list);
> > +
> > + mutex_unlock(&iommu->lock);
> > +
> > + return 0;
> > +}
> > +
> > +static void vfio_iommu_fsl_pamu_detach_group(void *iommu_data,
> > + struct iommu_group *iommu_group) {
> > + struct vfio_iommu *iommu = iommu_data;
> > + struct vfio_group *group;
> > +
> > + mutex_lock(&iommu->lock);
> > +
> > + list_for_each_entry(group, &iommu->group_list, next) {
> > + if (group->iommu_group == iommu_group) {
> > + iommu_detach_group(iommu->domain, iommu_group);
> > + list_del(&group->next);
> > + kfree(group);
> > + break;
> > + }
> > + }
> > +
> > + mutex_unlock(&iommu->lock);
> > +}
> > +
> > +static const struct vfio_iommu_driver_ops vfio_iommu_driver_ops_fsl_pamu = {
> > + .name = "vfio-iommu-fsl_pamu",
> > + .owner = THIS_MODULE,
> > + .open = vfio_iommu_fsl_pamu_open,
> > + .release = vfio_iommu_fsl_pamu_release,
> > + .ioctl = vfio_iommu_fsl_pamu_ioctl,
> > + .attach_group = vfio_iommu_fsl_pamu_attach_group,
> > + .detach_group = vfio_iommu_fsl_pamu_detach_group,
> > +};
> > +
> > +static int __init vfio_iommu_fsl_pamu_init(void) {
> > + if (!iommu_present(&pci_bus_type))
> > + return -ENODEV;
> > +
> > + return vfio_register_iommu_driver(&vfio_iommu_driver_ops_fsl_pamu);
> > +}
> > +
> > +static void __exit vfio_iommu_fsl_pamu_cleanup(void) {
> > + vfio_unregister_iommu_driver(&vfio_iommu_driver_ops_fsl_pamu);
> > +}
> > +
> > +module_init(vfio_iommu_fsl_pamu_init);
> > +module_exit(vfio_iommu_fsl_pamu_cleanup);
> > +
> > +MODULE_VERSION(DRIVER_VERSION);
> > +MODULE_LICENSE("GPL v2");
> > +MODULE_AUTHOR(DRIVER_AUTHOR);
> > +MODULE_DESCRIPTION(DRIVER_DESC);
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 0fd47f5..d359055 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -23,6 +23,7 @@
> >
> > #define VFIO_TYPE1_IOMMU 1
> > #define VFIO_SPAPR_TCE_IOMMU 2
> > +#define VFIO_FSL_PAMU_IOMMU 3
> >
> > /*
> > * The IOCTL interface is designed for extensibility by embedding the
> > @@ -451,4 +452,103 @@ struct vfio_iommu_spapr_tce_info {
> >
> > /* *****************************************************************
> > */
> >
> > +/*********** APIs for VFIO_PAMU type only ****************/
> > +/*
> > + * VFIO_IOMMU_PAMU_GET_ATTR - _IO(VFIO_TYPE, VFIO_BASE + 17,
> > + * struct vfio_pamu_attr)
> > + *
> > + * Gets the iommu attributes for the current vfio container.
> > + * Caller sets argsz and attribute. The ioctl fills in
> > + * the provided struct vfio_pamu_attr based on the attribute
> > + * value that was set.
> > + * Return: 0 on success, -errno on failure */ struct vfio_pamu_attr
> > +{
> > + __u32 argsz;
> > + __u32 flags; /* no flags currently */
> > +#define VFIO_ATTR_GEOMETRY 0
> > +#define VFIO_ATTR_WINDOWS 1
> > +#define VFIO_ATTR_PAMU_STASH 2
> > + __u32 attribute;
> > +
> > + union {
> > + /* VFIO_ATTR_GEOMETRY */
> > + struct {
> > + /* first addr that can be mapped */
> > + __u64 aperture_start;
> > + /* last addr that can be mapped */
> > + __u64 aperture_end;
> > + } attr;
> > +
> > + /* VFIO_ATTR_WINDOWS */
> > + __u32 windows; /* number of windows in the aperture
> > + * initially this will be the max number
> > + * of windows that can be set
> > + */
> > + /* VFIO_ATTR_PAMU_STASH */
> > + struct {
> > + __u32 cpu; /* CPU number for stashing */
> > + __u32 cache; /* cache ID for stashing */
> > + } stash;
> > + } attr_info;
> > +};
> > +#define VFIO_IOMMU_PAMU_GET_ATTR _IO(VFIO_TYPE, VFIO_BASE + 17)
> > +
> > +/*
> > + * VFIO_IOMMU_PAMU_SET_ATTR - _IO(VFIO_TYPE, VFIO_BASE + 18,
> > + * struct vfio_pamu_attr)
> > + *
> > + * Sets the iommu attributes for the current vfio container.
> > + * Caller sets struct vfio_pamu attr, including argsz and attribute
> > +and
> > + * setting any fields that are valid for the attribute.
> > + * Return: 0 on success, -errno on failure */ #define
> > +VFIO_IOMMU_PAMU_SET_ATTR _IO(VFIO_TYPE, VFIO_BASE + 18)
> > +
> > +/*
> > + * VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT - _IO(VFIO_TYPE, VFIO_BASE +
> > +19, __u32)
> > + *
> > + * Returns the number of MSI banks for this platform. This tells
> > +user space
> > + * how many aperture windows should be reserved for MSI banks when
> > +setting
> > + * the PAMU geometry and window count.
> > + * Return: __u32 bank count on success, -errno on failure */ #define
> > +VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT _IO(VFIO_TYPE, VFIO_BASE + 19)
> > +
> > +/*
> > + * VFIO_IOMMU_PAMU_MAP_MSI_BANK - _IO(VFIO_TYPE, VFIO_BASE + 20,
> > + * struct vfio_pamu_msi_bank_map)
> > + *
> > + * Maps the MSI bank at the specified index and iova. User space
> > +must
> > + * call this ioctl once for each MSI bank (count of banks is returned
> > +by
> > + * VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT).
> > + * Caller provides struct vfio_pamu_msi_bank_map with all fields set.
> > + * Return: 0 on success, -errno on failure */
> > +
> > +struct vfio_pamu_msi_bank_map {
> > + __u32 argsz;
> > + __u32 flags; /* no flags currently */
> > + __u32 msi_bank_index; /* the index of the MSI bank */
> > + __u64 iova; /* the iova the bank is to be mapped to */
> > +};
> > +#define VFIO_IOMMU_PAMU_MAP_MSI_BANK _IO(VFIO_TYPE, VFIO_BASE + 20)
> > +
> > +/*
> > + * VFIO_IOMMU_PAMU_UNMAP_MSI_BANK - _IO(VFIO_TYPE, VFIO_BASE + 21,
> > + * struct vfio_pamu_msi_bank_unmap)
> > + *
> > + * Unmaps the MSI bank at the specified iova.
> > + * Caller provides struct vfio_pamu_msi_bank_unmap with all fields set.
> > + * Operates on VFIO file descriptor (/dev/vfio/vfio).
> > + * Return: 0 on success, -errno on failure */
> > +
> > +struct vfio_pamu_msi_bank_unmap {
> > + __u32 argsz;
> > + __u32 flags; /* no flags currently */
> > + __u64 iova; /* the iova to be unmapped to */
> > +};
> > +#define VFIO_IOMMU_PAMU_UNMAP_MSI_BANK _IO(VFIO_TYPE, VFIO_BASE +
> > +21)
> > +
> > #endif /* _UAPIVFIO_H */
>
>
>

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2013-09-26 22:20:55

by Scott Wood

[permalink] [raw]
Subject: Re: [PATCH 5/7] iommu: supress loff_t compilation error on powerpc

On Wed, 2013-09-25 at 22:53 -0500, Bhushan Bharat-R65777 wrote:
>
> > -----Original Message-----
> > From: Alex Williamson [mailto:[email protected]]
> > Sent: Wednesday, September 25, 2013 10:10 PM
> > To: Bhushan Bharat-R65777
> > Cc: [email protected]; [email protected]; [email protected]; linux-
> > [email protected]; [email protected]; linux-
> > [email protected]; [email protected]; Wood Scott-B07421; [email protected]
> > foundation.org; Bhushan Bharat-R65777
> > Subject: Re: [PATCH 5/7] iommu: supress loff_t compilation error on powerpc
> >
> > On Thu, 2013-09-19 at 12:59 +0530, Bharat Bhushan wrote:
> > > Signed-off-by: Bharat Bhushan <[email protected]>
> > > ---
> > > drivers/vfio/pci/vfio_pci_rdwr.c | 3 ++-
> > > 1 files changed, 2 insertions(+), 1 deletions(-)
> > >
> > > diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c
> > > b/drivers/vfio/pci/vfio_pci_rdwr.c
> > > index 210db24..8a8156a 100644
> > > --- a/drivers/vfio/pci/vfio_pci_rdwr.c
> > > +++ b/drivers/vfio/pci/vfio_pci_rdwr.c
> > > @@ -181,7 +181,8 @@ ssize_t vfio_pci_vga_rw(struct vfio_pci_device *vdev, char
> > __user *buf,
> > > size_t count, loff_t *ppos, bool iswrite) {
> > > int ret;
> > > - loff_t off, pos = *ppos & VFIO_PCI_OFFSET_MASK;
> > > + loff_t off;
> > > + u64 pos = (u64 )(*ppos & VFIO_PCI_OFFSET_MASK);
> > > void __iomem *iomem = NULL;
> > > unsigned int rsrc;
> > > bool is_ioport;
> >
> > What's the compile error that this fixes?
>
> I was getting below error; and after some googling I came to know that this is how it is fixed by other guys.
>
> /home/r65777/linux-vfio/drivers/vfio/pci/vfio_pci_rdwr.c:193: undefined reference to `__cmpdi2'
> /home/r65777/linux-vfio/drivers/vfio/pci/vfio_pci_rdwr.c:193: undefined reference to `__cmpdi2'

It looks like PPC Linux implements __ucmpdi2, but not the signed
version. That should be fixed, rather than hacking up random code to
avoid it.

-Scott


2013-10-04 05:21:09

by Bharat Bhushan

[permalink] [raw]
Subject: RE: [PATCH 1/7] powerpc: Add interface to get msi region information



> -----Original Message-----
> From: [email protected] [mailto:[email protected]]
> On Behalf Of Bjorn Helgaas
> Sent: Wednesday, September 25, 2013 5:28 AM
> To: Bhushan Bharat-R65777
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; linuxppc-
> [email protected]; [email protected]; [email protected]; Wood Scott-
> B07421; [email protected]; Bhushan Bharat-R65777
> Subject: Re: [PATCH 1/7] powerpc: Add interface to get msi region information
>
> On Thu, Sep 19, 2013 at 12:59:17PM +0530, Bharat Bhushan wrote:
> > This patch adds interface to get following information
> > - Number of MSI regions (which is number of MSI banks for powerpc).
> > - Get the region address range: Physical page which have the
> > address/addresses used for generating MSI interrupt
> > and size of the page.
> >
> > These are required to create IOMMU (Freescale PAMU) mapping for
> > devices which are directly assigned using VFIO.
> >
> > Signed-off-by: Bharat Bhushan <[email protected]>
> > ---
> > arch/powerpc/include/asm/machdep.h | 8 +++++++
> > arch/powerpc/include/asm/pci.h | 2 +
> > arch/powerpc/kernel/msi.c | 18 ++++++++++++++++
> > arch/powerpc/sysdev/fsl_msi.c | 39 +++++++++++++++++++++++++++++++++--
> > arch/powerpc/sysdev/fsl_msi.h | 11 ++++++++-
> > drivers/pci/msi.c | 26 ++++++++++++++++++++++++
> > include/linux/msi.h | 8 +++++++
> > include/linux/pci.h | 13 ++++++++++++
> > 8 files changed, 120 insertions(+), 5 deletions(-)
> >
> > ...
>
> > diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c index
> > aca7578..6d85c15 100644
> > --- a/drivers/pci/msi.c
> > +++ b/drivers/pci/msi.c
> > @@ -30,6 +30,20 @@ static int pci_msi_enable = 1;
> >
> > /* Arch hooks */
> >
> > +#ifndef arch_msi_get_region_count
> > +int arch_msi_get_region_count(void)
> > +{
> > + return 0;
> > +}
> > +#endif
> > +
> > +#ifndef arch_msi_get_region
> > +int arch_msi_get_region(int region_num, struct msi_region *region) {
> > + return 0;
> > +}
> > +#endif
>
> This #define strategy is gone; see 4287d824 ("PCI: use weak functions for MSI
> arch-specific functions"). Please use the weak function strategy for your new
> MSI region functions.

ok

>
> > +
> > #ifndef arch_msi_check_device
> > int arch_msi_check_device(struct pci_dev *dev, int nvec, int type) {
> > @@ -903,6 +917,18 @@ void pci_disable_msi(struct pci_dev *dev) }
> > EXPORT_SYMBOL(pci_disable_msi);
> >
> > +int msi_get_region_count(void)
> > +{
> > + return arch_msi_get_region_count();
> > +}
> > +EXPORT_SYMBOL(msi_get_region_count);
> > +
> > +int msi_get_region(int region_num, struct msi_region *region) {
> > + return arch_msi_get_region(region_num, region); }
> > +EXPORT_SYMBOL(msi_get_region);
>
> Please split these interface additions, i.e., the drivers/pci/msi.c,
> include/linux/msi.h, and include/linux/pci.h changes, into a separate patch.

ok

>
> I don't know enough about VFIO to understand why these new interfaces are
> needed. Is this the first VFIO IOMMU driver? I see vfio_iommu_spapr_tce.c and
> vfio_iommu_type1.c but I don't know if they're comparable to the Freescale PAMU.
> Do other VFIO IOMMU implementations support MSI? If so, do they handle the
> problem of mapping the MSI regions in a different way?

PAMU is an aperture type of IOMMU while other are paging type, So they are completely different from what PAMU is and handle that differently.

>
> > /**
> > * pci_msix_table_size - return the number of device's MSI-X table entries
> > * @dev: pointer to the pci_dev data structure of MSI-X device
> > function diff --git a/include/linux/msi.h b/include/linux/msi.h index
> > ee66f3a..ae32601 100644
> > --- a/include/linux/msi.h
> > +++ b/include/linux/msi.h
> > @@ -50,6 +50,12 @@ struct msi_desc {
> > struct kobject kobj;
> > };
> >
> > +struct msi_region {
> > + int region_num;
> > + dma_addr_t addr;
> > + size_t size;
> > +};
>
> This needs some sort of explanatory comment.

Ok

-Bharat

>
> > /*
> > * The arch hook for setup up msi irqs
> > */
> > @@ -58,5 +64,7 @@ void arch_teardown_msi_irq(unsigned int irq); int
> > arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type); void
> > arch_teardown_msi_irqs(struct pci_dev *dev); int
> > arch_msi_check_device(struct pci_dev* dev, int nvec, int type);
> > +int arch_msi_get_region_count(void);
> > +int arch_msi_get_region(int region_num, struct msi_region *region);
> >
> > #endif /* LINUX_MSI_H */
> > diff --git a/include/linux/pci.h b/include/linux/pci.h index
> > 186540d..2b26a59 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -1126,6 +1126,7 @@ struct msix_entry {
> > u16 entry; /* driver uses to specify entry, OS writes */
> > };
> >
> > +struct msi_region;
> >
> > #ifndef CONFIG_PCI_MSI
> > static inline int pci_enable_msi_block(struct pci_dev *dev, unsigned
> > int nvec) @@ -1168,6 +1169,16 @@ static inline int
> > pci_msi_enabled(void) {
> > return 0;
> > }
> > +
> > +static inline int msi_get_region_count(void) {
> > + return 0;
> > +}
> > +
> > +static inline int msi_get_region(int region_num, struct msi_region
> > +*region) {
> > + return 0;
> > +}
> > #else
> > int pci_enable_msi_block(struct pci_dev *dev, unsigned int nvec);
> > int pci_enable_msi_block_auto(struct pci_dev *dev, unsigned int
> > *maxvec); @@ -1180,6 +1191,8 @@ void pci_disable_msix(struct pci_dev
> > *dev); void msi_remove_pci_irq_vectors(struct pci_dev *dev); void
> > pci_restore_msi_state(struct pci_dev *dev); int
> > pci_msi_enabled(void);
> > +int msi_get_region_count(void);
> > +int msi_get_region(int region_num, struct msi_region *region);
> > #endif
> >
> > #ifdef CONFIG_PCIEPORTBUS
> > --
> > 1.7.0.4
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-pci"
> > in the body of a message to [email protected] More majordomo
> > info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body
> of a message to [email protected] More majordomo info at
> http://vger.kernel.org/majordomo-info.html

2013-10-04 09:54:32

by Bharat Bhushan

[permalink] [raw]
Subject: RE: [PATCH 2/7] iommu: add api to get iommu_domain of a device



> -----Original Message-----
> From: [email protected] [mailto:[email protected]]
> On Behalf Of Alex Williamson
> Sent: Wednesday, September 25, 2013 10:16 PM
> To: Bhushan Bharat-R65777
> Cc: [email protected]; [email protected]; [email protected]; linux-
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; Wood Scott-B07421; [email protected]
> foundation.org; Bhushan Bharat-R65777
> Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a device
>
> On Thu, 2013-09-19 at 12:59 +0530, Bharat Bhushan wrote:
> > This api return the iommu domain to which the device is attached.
> > The iommu_domain is required for making API calls related to iommu.
> > Follow up patches which use this API to know iommu maping.
> >
> > Signed-off-by: Bharat Bhushan <[email protected]>
> > ---
> > drivers/iommu/iommu.c | 10 ++++++++++
> > include/linux/iommu.h | 7 +++++++
> > 2 files changed, 17 insertions(+), 0 deletions(-)
> >
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index
> > fbe9ca7..6ac5f50 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -696,6 +696,16 @@ void iommu_detach_device(struct iommu_domain
> > *domain, struct device *dev) }
> > EXPORT_SYMBOL_GPL(iommu_detach_device);
> >
> > +struct iommu_domain *iommu_get_dev_domain(struct device *dev) {
> > + struct iommu_ops *ops = dev->bus->iommu_ops;
> > +
> > + if (unlikely(ops == NULL || ops->get_dev_iommu_domain == NULL))
> > + return NULL;
> > +
> > + return ops->get_dev_iommu_domain(dev); }
> > +EXPORT_SYMBOL_GPL(iommu_get_dev_domain);
>
> What prevents this from racing iommu_domain_free()? There's no references
> acquired, so there's no reason for the caller to assume the pointer is valid.

Sorry for late query, somehow this email went into a folder and escaped;

Just to be sure, there is not lock at generic "struct iommu_domain", but IP specific structure (link FSL domain) linked in iommu_domain->priv have a lock, so we need to ensure this race in FSL iommu code (say drivers/iommu/fsl_pamu_domain.c), right?

Thanks
-Bharat

>
> > /*
> > * IOMMU groups are really the natrual working unit of the IOMMU, but
> > * the IOMMU API works on domains and devices. Bridge that gap by
> > diff --git a/include/linux/iommu.h b/include/linux/iommu.h index
> > 7ea319e..fa046bd 100644
> > --- a/include/linux/iommu.h
> > +++ b/include/linux/iommu.h
> > @@ -127,6 +127,7 @@ struct iommu_ops {
> > int (*domain_set_windows)(struct iommu_domain *domain, u32 w_count);
> > /* Get the numer of window per domain */
> > u32 (*domain_get_windows)(struct iommu_domain *domain);
> > + struct iommu_domain *(*get_dev_iommu_domain)(struct device *dev);
> >
> > unsigned long pgsize_bitmap;
> > };
> > @@ -190,6 +191,7 @@ extern int iommu_domain_window_enable(struct iommu_domain
> *domain, u32 wnd_nr,
> > phys_addr_t offset, u64 size,
> > int prot);
> > extern void iommu_domain_window_disable(struct iommu_domain *domain,
> > u32 wnd_nr);
> > +extern struct iommu_domain *iommu_get_dev_domain(struct device *dev);
> > /**
> > * report_iommu_fault() - report about an IOMMU fault to the IOMMU framework
> > * @domain: the iommu domain where the fault has happened @@ -284,6
> > +286,11 @@ static inline void iommu_domain_window_disable(struct
> > iommu_domain *domain, { }
> >
> > +static inline struct iommu_domain *iommu_get_dev_domain(struct device
> > +*dev) {
> > + return NULL;
> > +}
> > +
> > static inline phys_addr_t iommu_iova_to_phys(struct iommu_domain
> > *domain, dma_addr_t iova) {
> > return 0;
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body
> of a message to [email protected] More majordomo info at
> http://vger.kernel.org/majordomo-info.html

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2013-10-04 10:42:30

by Bharat Bhushan

[permalink] [raw]
Subject: RE: [PATCH 2/7] iommu: add api to get iommu_domain of a device



> -----Original Message-----
> From: Bhushan Bharat-R65777
> Sent: Friday, October 04, 2013 3:24 PM
> To: 'Alex Williamson'
> Cc: [email protected]; [email protected]; [email protected]; linux-
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; Wood Scott-B07421; [email protected]
> foundation.org
> Subject: RE: [PATCH 2/7] iommu: add api to get iommu_domain of a device
>
>
>
> > -----Original Message-----
> > From: [email protected]
> > [mailto:[email protected]]
> > On Behalf Of Alex Williamson
> > Sent: Wednesday, September 25, 2013 10:16 PM
> > To: Bhushan Bharat-R65777
> > Cc: [email protected]; [email protected];
> > [email protected]; linux- [email protected];
> > [email protected]; linux- [email protected];
> > [email protected]; Wood Scott-B07421; [email protected] foundation.org;
> > Bhushan Bharat-R65777
> > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a
> > device
> >
> > On Thu, 2013-09-19 at 12:59 +0530, Bharat Bhushan wrote:
> > > This api return the iommu domain to which the device is attached.
> > > The iommu_domain is required for making API calls related to iommu.
> > > Follow up patches which use this API to know iommu maping.
> > >
> > > Signed-off-by: Bharat Bhushan <[email protected]>
> > > ---
> > > drivers/iommu/iommu.c | 10 ++++++++++
> > > include/linux/iommu.h | 7 +++++++
> > > 2 files changed, 17 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index
> > > fbe9ca7..6ac5f50 100644
> > > --- a/drivers/iommu/iommu.c
> > > +++ b/drivers/iommu/iommu.c
> > > @@ -696,6 +696,16 @@ void iommu_detach_device(struct iommu_domain
> > > *domain, struct device *dev) }
> > > EXPORT_SYMBOL_GPL(iommu_detach_device);
> > >
> > > +struct iommu_domain *iommu_get_dev_domain(struct device *dev) {
> > > + struct iommu_ops *ops = dev->bus->iommu_ops;
> > > +
> > > + if (unlikely(ops == NULL || ops->get_dev_iommu_domain == NULL))
> > > + return NULL;
> > > +
> > > + return ops->get_dev_iommu_domain(dev); }
> > > +EXPORT_SYMBOL_GPL(iommu_get_dev_domain);
> >
> > What prevents this from racing iommu_domain_free()? There's no
> > references acquired, so there's no reason for the caller to assume the pointer
> is valid.
>
> Sorry for late query, somehow this email went into a folder and escaped;
>
> Just to be sure, there is not lock at generic "struct iommu_domain", but IP
> specific structure (link FSL domain) linked in iommu_domain->priv have a lock,
> so we need to ensure this race in FSL iommu code (say
> drivers/iommu/fsl_pamu_domain.c), right?

Further thinking of this, there are more problems here:
- Like MSI subsystem will call iommu_get_dev_domain(), which will take a lock, find the domain pointer, release the lock, and return the domain
- Now if domain in freed up
- While MSI subsystem tries to do work on domain (like get_attribute/set_attribute etc) ???

So can we do like iommu_get_dev_domain() will return domain with the lock held, and iommu_put_dev_domain() will release the lock? And iommu_get_dev_domain() must always be followed by iommu_get_dev_domain()

Thanks
-Bharat

>
> Thanks
> -Bharat
>
> >
> > > /*
> > > * IOMMU groups are really the natrual working unit of the IOMMU, but
> > > * the IOMMU API works on domains and devices. Bridge that gap by
> > > diff --git a/include/linux/iommu.h b/include/linux/iommu.h index
> > > 7ea319e..fa046bd 100644
> > > --- a/include/linux/iommu.h
> > > +++ b/include/linux/iommu.h
> > > @@ -127,6 +127,7 @@ struct iommu_ops {
> > > int (*domain_set_windows)(struct iommu_domain *domain, u32 w_count);
> > > /* Get the numer of window per domain */
> > > u32 (*domain_get_windows)(struct iommu_domain *domain);
> > > + struct iommu_domain *(*get_dev_iommu_domain)(struct device *dev);
> > >
> > > unsigned long pgsize_bitmap;
> > > };
> > > @@ -190,6 +191,7 @@ extern int iommu_domain_window_enable(struct
> > > iommu_domain
> > *domain, u32 wnd_nr,
> > > phys_addr_t offset, u64 size,
> > > int prot);
> > > extern void iommu_domain_window_disable(struct iommu_domain
> > > *domain,
> > > u32 wnd_nr);
> > > +extern struct iommu_domain *iommu_get_dev_domain(struct device
> > > +*dev);
> > > /**
> > > * report_iommu_fault() - report about an IOMMU fault to the IOMMU
> framework
> > > * @domain: the iommu domain where the fault has happened @@ -284,6
> > > +286,11 @@ static inline void iommu_domain_window_disable(struct
> > > iommu_domain *domain, { }
> > >
> > > +static inline struct iommu_domain *iommu_get_dev_domain(struct
> > > +device
> > > +*dev) {
> > > + return NULL;
> > > +}
> > > +
> > > static inline phys_addr_t iommu_iova_to_phys(struct iommu_domain
> > > *domain, dma_addr_t iova) {
> > > return 0;
> >
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-pci"
> > in the body of a message to [email protected] More majordomo
> > info at http://vger.kernel.org/majordomo-info.html

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2013-10-04 15:45:23

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a device

On Fri, 2013-10-04 at 09:54 +0000, Bhushan Bharat-R65777 wrote:
>
> > -----Original Message-----
> > From: [email protected] [mailto:[email protected]]
> > On Behalf Of Alex Williamson
> > Sent: Wednesday, September 25, 2013 10:16 PM
> > To: Bhushan Bharat-R65777
> > Cc: [email protected]; [email protected]; [email protected]; linux-
> > [email protected]; [email protected]; linux-
> > [email protected]; [email protected]; Wood Scott-B07421; [email protected]
> > foundation.org; Bhushan Bharat-R65777
> > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a device
> >
> > On Thu, 2013-09-19 at 12:59 +0530, Bharat Bhushan wrote:
> > > This api return the iommu domain to which the device is attached.
> > > The iommu_domain is required for making API calls related to iommu.
> > > Follow up patches which use this API to know iommu maping.
> > >
> > > Signed-off-by: Bharat Bhushan <[email protected]>
> > > ---
> > > drivers/iommu/iommu.c | 10 ++++++++++
> > > include/linux/iommu.h | 7 +++++++
> > > 2 files changed, 17 insertions(+), 0 deletions(-)
> > >
> > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index
> > > fbe9ca7..6ac5f50 100644
> > > --- a/drivers/iommu/iommu.c
> > > +++ b/drivers/iommu/iommu.c
> > > @@ -696,6 +696,16 @@ void iommu_detach_device(struct iommu_domain
> > > *domain, struct device *dev) }
> > > EXPORT_SYMBOL_GPL(iommu_detach_device);
> > >
> > > +struct iommu_domain *iommu_get_dev_domain(struct device *dev) {
> > > + struct iommu_ops *ops = dev->bus->iommu_ops;
> > > +
> > > + if (unlikely(ops == NULL || ops->get_dev_iommu_domain == NULL))
> > > + return NULL;
> > > +
> > > + return ops->get_dev_iommu_domain(dev); }
> > > +EXPORT_SYMBOL_GPL(iommu_get_dev_domain);
> >
> > What prevents this from racing iommu_domain_free()? There's no references
> > acquired, so there's no reason for the caller to assume the pointer is valid.
>
> Sorry for late query, somehow this email went into a folder and escaped;
>
> Just to be sure, there is not lock at generic "struct iommu_domain", but IP specific structure (link FSL domain) linked in iommu_domain->priv have a lock, so we need to ensure this race in FSL iommu code (say drivers/iommu/fsl_pamu_domain.c), right?

No, it's not sufficient to make sure that your use of the interface is
race free. The interface itself needs to be designed so that it's
difficult to use incorrectly. That's not the case here. This is a
backdoor to get the iommu domain from the iommu driver regardless of who
is using it or how. The iommu domain is created and managed by vfio, so
shouldn't we be looking at how to do this through vfio? It seems like
you'd want to use your device to get a vfio group reference, from which
you could do something with the vfio external user interface and get the
iommu domain reference. Thanks,

Alex

> > > /*
> > > * IOMMU groups are really the natrual working unit of the IOMMU, but
> > > * the IOMMU API works on domains and devices. Bridge that gap by
> > > diff --git a/include/linux/iommu.h b/include/linux/iommu.h index
> > > 7ea319e..fa046bd 100644
> > > --- a/include/linux/iommu.h
> > > +++ b/include/linux/iommu.h
> > > @@ -127,6 +127,7 @@ struct iommu_ops {
> > > int (*domain_set_windows)(struct iommu_domain *domain, u32 w_count);
> > > /* Get the numer of window per domain */
> > > u32 (*domain_get_windows)(struct iommu_domain *domain);
> > > + struct iommu_domain *(*get_dev_iommu_domain)(struct device *dev);
> > >
> > > unsigned long pgsize_bitmap;
> > > };
> > > @@ -190,6 +191,7 @@ extern int iommu_domain_window_enable(struct iommu_domain
> > *domain, u32 wnd_nr,
> > > phys_addr_t offset, u64 size,
> > > int prot);
> > > extern void iommu_domain_window_disable(struct iommu_domain *domain,
> > > u32 wnd_nr);
> > > +extern struct iommu_domain *iommu_get_dev_domain(struct device *dev);
> > > /**
> > > * report_iommu_fault() - report about an IOMMU fault to the IOMMU framework
> > > * @domain: the iommu domain where the fault has happened @@ -284,6
> > > +286,11 @@ static inline void iommu_domain_window_disable(struct
> > > iommu_domain *domain, { }
> > >
> > > +static inline struct iommu_domain *iommu_get_dev_domain(struct device
> > > +*dev) {
> > > + return NULL;
> > > +}
> > > +
> > > static inline phys_addr_t iommu_iova_to_phys(struct iommu_domain
> > > *domain, dma_addr_t iova) {
> > > return 0;
> >
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-pci" in the body
> > of a message to [email protected] More majordomo info at
> > http://vger.kernel.org/majordomo-info.html
>


2013-10-04 16:49:42

by Bharat Bhushan

[permalink] [raw]
Subject: RE: [PATCH 2/7] iommu: add api to get iommu_domain of a device



> -----Original Message-----
> From: Alex Williamson [mailto:[email protected]]
> Sent: Friday, October 04, 2013 9:15 PM
> To: Bhushan Bharat-R65777
> Cc: [email protected]; [email protected]; [email protected]; linux-
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; Wood Scott-B07421; [email protected]
> foundation.org
> Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a device
>
> On Fri, 2013-10-04 at 09:54 +0000, Bhushan Bharat-R65777 wrote:
> >
> > > -----Original Message-----
> > > From: [email protected]
> > > [mailto:[email protected]]
> > > On Behalf Of Alex Williamson
> > > Sent: Wednesday, September 25, 2013 10:16 PM
> > > To: Bhushan Bharat-R65777
> > > Cc: [email protected]; [email protected];
> > > [email protected]; linux- [email protected];
> > > [email protected]; linux- [email protected];
> > > [email protected]; Wood Scott-B07421; [email protected] foundation.org;
> > > Bhushan Bharat-R65777
> > > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a
> > > device
> > >
> > > On Thu, 2013-09-19 at 12:59 +0530, Bharat Bhushan wrote:
> > > > This api return the iommu domain to which the device is attached.
> > > > The iommu_domain is required for making API calls related to iommu.
> > > > Follow up patches which use this API to know iommu maping.
> > > >
> > > > Signed-off-by: Bharat Bhushan <[email protected]>
> > > > ---
> > > > drivers/iommu/iommu.c | 10 ++++++++++
> > > > include/linux/iommu.h | 7 +++++++
> > > > 2 files changed, 17 insertions(+), 0 deletions(-)
> > > >
> > > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index
> > > > fbe9ca7..6ac5f50 100644
> > > > --- a/drivers/iommu/iommu.c
> > > > +++ b/drivers/iommu/iommu.c
> > > > @@ -696,6 +696,16 @@ void iommu_detach_device(struct iommu_domain
> > > > *domain, struct device *dev) }
> > > > EXPORT_SYMBOL_GPL(iommu_detach_device);
> > > >
> > > > +struct iommu_domain *iommu_get_dev_domain(struct device *dev) {
> > > > + struct iommu_ops *ops = dev->bus->iommu_ops;
> > > > +
> > > > + if (unlikely(ops == NULL || ops->get_dev_iommu_domain == NULL))
> > > > + return NULL;
> > > > +
> > > > + return ops->get_dev_iommu_domain(dev); }
> > > > +EXPORT_SYMBOL_GPL(iommu_get_dev_domain);
> > >
> > > What prevents this from racing iommu_domain_free()? There's no
> > > references acquired, so there's no reason for the caller to assume the
> pointer is valid.
> >
> > Sorry for late query, somehow this email went into a folder and
> > escaped;
> >
> > Just to be sure, there is not lock at generic "struct iommu_domain", but IP
> specific structure (link FSL domain) linked in iommu_domain->priv have a lock,
> so we need to ensure this race in FSL iommu code (say
> drivers/iommu/fsl_pamu_domain.c), right?
>
> No, it's not sufficient to make sure that your use of the interface is race
> free. The interface itself needs to be designed so that it's difficult to use
> incorrectly.

So we can define iommu_get_dev_domain()/iommu_put_dev_domain(); iommu_get_dev_domain() will return domain with the lock held, and iommu_put_dev_domain() will release the lock? And iommu_get_dev_domain() must always be followed by iommu_get_dev_domain().


> That's not the case here. This is a backdoor to get the iommu
> domain from the iommu driver regardless of who is using it or how. The iommu
> domain is created and managed by vfio, so shouldn't we be looking at how to do
> this through vfio?

Let me first describe what we are doing here:
During initialization:-
- vfio talks to MSI system to know the MSI-page and size
- vfio then interacts with iommu to map the MSI-page in iommu (IOVA is decided by userspace and physical address is the MSI-page)
- So the IOVA subwindow mapping is created in iommu and yes VFIO know about this mapping.

Now do SET_IRQ(MSI/MSIX) ioctl:
- calls pci_enable_msix()/pci_enable_msi_block(): which is supposed to set MSI address/data in device.
- So in current implementation (this patchset) msi-subsystem gets the IOVA from iommu via this defined interface.
- Are you saying that rather than getting this from iommu, we should get this from vfio? What difference does this make?

Thanks
-Bharat

> It seems like you'd want to use your device to get a vfio
> group reference, from which you could do something with the vfio external user
> interface and get the iommu domain reference. Thanks,
>
> Alex
>
> > > > /*
> > > > * IOMMU groups are really the natrual working unit of the IOMMU, but
> > > > * the IOMMU API works on domains and devices. Bridge that gap
> > > > by diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > > > index 7ea319e..fa046bd 100644
> > > > --- a/include/linux/iommu.h
> > > > +++ b/include/linux/iommu.h
> > > > @@ -127,6 +127,7 @@ struct iommu_ops {
> > > > int (*domain_set_windows)(struct iommu_domain *domain, u32
> w_count);
> > > > /* Get the numer of window per domain */
> > > > u32 (*domain_get_windows)(struct iommu_domain *domain);
> > > > + struct iommu_domain *(*get_dev_iommu_domain)(struct device
> > > > +*dev);
> > > >
> > > > unsigned long pgsize_bitmap;
> > > > };
> > > > @@ -190,6 +191,7 @@ extern int iommu_domain_window_enable(struct
> > > > iommu_domain
> > > *domain, u32 wnd_nr,
> > > > phys_addr_t offset, u64 size,
> > > > int prot);
> > > > extern void iommu_domain_window_disable(struct iommu_domain
> > > > *domain,
> > > > u32 wnd_nr);
> > > > +extern struct iommu_domain *iommu_get_dev_domain(struct device
> > > > +*dev);
> > > > /**
> > > > * report_iommu_fault() - report about an IOMMU fault to the IOMMU
> framework
> > > > * @domain: the iommu domain where the fault has happened @@
> > > > -284,6
> > > > +286,11 @@ static inline void iommu_domain_window_disable(struct
> > > > iommu_domain *domain, { }
> > > >
> > > > +static inline struct iommu_domain *iommu_get_dev_domain(struct
> > > > +device
> > > > +*dev) {
> > > > + return NULL;
> > > > +}
> > > > +
> > > > static inline phys_addr_t iommu_iova_to_phys(struct iommu_domain
> > > > *domain, dma_addr_t iova) {
> > > > return 0;
> > >
> > >
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-pci"
> > > in the body of a message to [email protected] More majordomo
> > > info at http://vger.kernel.org/majordomo-info.html
> >
>
>
>

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2013-10-04 17:13:08

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a device

On Fri, 2013-10-04 at 16:47 +0000, Bhushan Bharat-R65777 wrote:
>
> > -----Original Message-----
> > From: Alex Williamson [mailto:[email protected]]
> > Sent: Friday, October 04, 2013 9:15 PM
> > To: Bhushan Bharat-R65777
> > Cc: [email protected]; [email protected]; [email protected]; linux-
> > [email protected]; [email protected]; linux-
> > [email protected]; [email protected]; Wood Scott-B07421; [email protected]
> > foundation.org
> > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a device
> >
> > On Fri, 2013-10-04 at 09:54 +0000, Bhushan Bharat-R65777 wrote:
> > >
> > > > -----Original Message-----
> > > > From: [email protected]
> > > > [mailto:[email protected]]
> > > > On Behalf Of Alex Williamson
> > > > Sent: Wednesday, September 25, 2013 10:16 PM
> > > > To: Bhushan Bharat-R65777
> > > > Cc: [email protected]; [email protected];
> > > > [email protected]; linux- [email protected];
> > > > [email protected]; linux- [email protected];
> > > > [email protected]; Wood Scott-B07421; [email protected] foundation.org;
> > > > Bhushan Bharat-R65777
> > > > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a
> > > > device
> > > >
> > > > On Thu, 2013-09-19 at 12:59 +0530, Bharat Bhushan wrote:
> > > > > This api return the iommu domain to which the device is attached.
> > > > > The iommu_domain is required for making API calls related to iommu.
> > > > > Follow up patches which use this API to know iommu maping.
> > > > >
> > > > > Signed-off-by: Bharat Bhushan <[email protected]>
> > > > > ---
> > > > > drivers/iommu/iommu.c | 10 ++++++++++
> > > > > include/linux/iommu.h | 7 +++++++
> > > > > 2 files changed, 17 insertions(+), 0 deletions(-)
> > > > >
> > > > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index
> > > > > fbe9ca7..6ac5f50 100644
> > > > > --- a/drivers/iommu/iommu.c
> > > > > +++ b/drivers/iommu/iommu.c
> > > > > @@ -696,6 +696,16 @@ void iommu_detach_device(struct iommu_domain
> > > > > *domain, struct device *dev) }
> > > > > EXPORT_SYMBOL_GPL(iommu_detach_device);
> > > > >
> > > > > +struct iommu_domain *iommu_get_dev_domain(struct device *dev) {
> > > > > + struct iommu_ops *ops = dev->bus->iommu_ops;
> > > > > +
> > > > > + if (unlikely(ops == NULL || ops->get_dev_iommu_domain == NULL))
> > > > > + return NULL;
> > > > > +
> > > > > + return ops->get_dev_iommu_domain(dev); }
> > > > > +EXPORT_SYMBOL_GPL(iommu_get_dev_domain);
> > > >
> > > > What prevents this from racing iommu_domain_free()? There's no
> > > > references acquired, so there's no reason for the caller to assume the
> > pointer is valid.
> > >
> > > Sorry for late query, somehow this email went into a folder and
> > > escaped;
> > >
> > > Just to be sure, there is not lock at generic "struct iommu_domain", but IP
> > specific structure (link FSL domain) linked in iommu_domain->priv have a lock,
> > so we need to ensure this race in FSL iommu code (say
> > drivers/iommu/fsl_pamu_domain.c), right?
> >
> > No, it's not sufficient to make sure that your use of the interface is race
> > free. The interface itself needs to be designed so that it's difficult to use
> > incorrectly.
>
> So we can define iommu_get_dev_domain()/iommu_put_dev_domain();
> iommu_get_dev_domain() will return domain with the lock held, and
> iommu_put_dev_domain() will release the lock? And
> iommu_get_dev_domain() must always be followed by
> iommu_get_dev_domain().

What lock? get/put are generally used for reference counting, not
locking in the kernel.

> > That's not the case here. This is a backdoor to get the iommu
> > domain from the iommu driver regardless of who is using it or how. The iommu
> > domain is created and managed by vfio, so shouldn't we be looking at how to do
> > this through vfio?
>
> Let me first describe what we are doing here:
> During initialization:-
> - vfio talks to MSI system to know the MSI-page and size
> - vfio then interacts with iommu to map the MSI-page in iommu (IOVA is decided by userspace and physical address is the MSI-page)
> - So the IOVA subwindow mapping is created in iommu and yes VFIO know about this mapping.
>
> Now do SET_IRQ(MSI/MSIX) ioctl:
> - calls pci_enable_msix()/pci_enable_msi_block(): which is supposed to set MSI address/data in device.
> - So in current implementation (this patchset) msi-subsystem gets the IOVA from iommu via this defined interface.
> - Are you saying that rather than getting this from iommu, we should get this from vfio? What difference does this make?

Yes, you just said above that vfio knows the msi to iova mapping, so why
go outside of vfio to find it later? The difference is one case you can
have a proper reference to data structures to make sure the pointer you
get back actually has meaning at the time you're using it vs the code
here where you're defining an API that returns a meaningless value
because you can't check or enforce that an arbitrary caller is using it
correctly. It's not maintainable. Thanks,

Alex

> > It seems like you'd want to use your device to get a vfio
> > group reference, from which you could do something with the vfio external user
> > interface and get the iommu domain reference. Thanks,
> >
> > Alex
> >
> > > > > /*
> > > > > * IOMMU groups are really the natrual working unit of the IOMMU, but
> > > > > * the IOMMU API works on domains and devices. Bridge that gap
> > > > > by diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> > > > > index 7ea319e..fa046bd 100644
> > > > > --- a/include/linux/iommu.h
> > > > > +++ b/include/linux/iommu.h
> > > > > @@ -127,6 +127,7 @@ struct iommu_ops {
> > > > > int (*domain_set_windows)(struct iommu_domain *domain, u32
> > w_count);
> > > > > /* Get the numer of window per domain */
> > > > > u32 (*domain_get_windows)(struct iommu_domain *domain);
> > > > > + struct iommu_domain *(*get_dev_iommu_domain)(struct device
> > > > > +*dev);
> > > > >
> > > > > unsigned long pgsize_bitmap;
> > > > > };
> > > > > @@ -190,6 +191,7 @@ extern int iommu_domain_window_enable(struct
> > > > > iommu_domain
> > > > *domain, u32 wnd_nr,
> > > > > phys_addr_t offset, u64 size,
> > > > > int prot);
> > > > > extern void iommu_domain_window_disable(struct iommu_domain
> > > > > *domain,
> > > > > u32 wnd_nr);
> > > > > +extern struct iommu_domain *iommu_get_dev_domain(struct device
> > > > > +*dev);
> > > > > /**
> > > > > * report_iommu_fault() - report about an IOMMU fault to the IOMMU
> > framework
> > > > > * @domain: the iommu domain where the fault has happened @@
> > > > > -284,6
> > > > > +286,11 @@ static inline void iommu_domain_window_disable(struct
> > > > > iommu_domain *domain, { }
> > > > >
> > > > > +static inline struct iommu_domain *iommu_get_dev_domain(struct
> > > > > +device
> > > > > +*dev) {
> > > > > + return NULL;
> > > > > +}
> > > > > +
> > > > > static inline phys_addr_t iommu_iova_to_phys(struct iommu_domain
> > > > > *domain, dma_addr_t iova) {
> > > > > return 0;
> > > >
> > > >
> > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-pci"
> > > > in the body of a message to [email protected] More majordomo
> > > > info at http://vger.kernel.org/majordomo-info.html
> > >
> >
> >
> >
>


2013-10-04 17:23:38

by Bharat Bhushan

[permalink] [raw]
Subject: RE: [PATCH 2/7] iommu: add api to get iommu_domain of a device



> -----Original Message-----
> From: Alex Williamson [mailto:[email protected]]
> Sent: Friday, October 04, 2013 10:43 PM
> To: Bhushan Bharat-R65777
> Cc: [email protected]; [email protected]; [email protected]; linux-
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; Wood Scott-B07421; [email protected]
> foundation.org
> Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a device
>
> On Fri, 2013-10-04 at 16:47 +0000, Bhushan Bharat-R65777 wrote:
> >
> > > -----Original Message-----
> > > From: Alex Williamson [mailto:[email protected]]
> > > Sent: Friday, October 04, 2013 9:15 PM
> > > To: Bhushan Bharat-R65777
> > > Cc: [email protected]; [email protected];
> > > [email protected]; linux- [email protected];
> > > [email protected]; linux- [email protected];
> > > [email protected]; Wood Scott-B07421; [email protected] foundation.org
> > > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a
> > > device
> > >
> > > On Fri, 2013-10-04 at 09:54 +0000, Bhushan Bharat-R65777 wrote:
> > > >
> > > > > -----Original Message-----
> > > > > From: [email protected]
> > > > > [mailto:[email protected]]
> > > > > On Behalf Of Alex Williamson
> > > > > Sent: Wednesday, September 25, 2013 10:16 PM
> > > > > To: Bhushan Bharat-R65777
> > > > > Cc: [email protected]; [email protected];
> > > > > [email protected]; linux- [email protected];
> > > > > [email protected]; linux- [email protected];
> > > > > [email protected]; Wood Scott-B07421; [email protected]
> > > > > foundation.org; Bhushan Bharat-R65777
> > > > > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a
> > > > > device
> > > > >
> > > > > On Thu, 2013-09-19 at 12:59 +0530, Bharat Bhushan wrote:
> > > > > > This api return the iommu domain to which the device is attached.
> > > > > > The iommu_domain is required for making API calls related to iommu.
> > > > > > Follow up patches which use this API to know iommu maping.
> > > > > >
> > > > > > Signed-off-by: Bharat Bhushan <[email protected]>
> > > > > > ---
> > > > > > drivers/iommu/iommu.c | 10 ++++++++++
> > > > > > include/linux/iommu.h | 7 +++++++
> > > > > > 2 files changed, 17 insertions(+), 0 deletions(-)
> > > > > >
> > > > > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > > > > > index
> > > > > > fbe9ca7..6ac5f50 100644
> > > > > > --- a/drivers/iommu/iommu.c
> > > > > > +++ b/drivers/iommu/iommu.c
> > > > > > @@ -696,6 +696,16 @@ void iommu_detach_device(struct
> > > > > > iommu_domain *domain, struct device *dev) }
> > > > > > EXPORT_SYMBOL_GPL(iommu_detach_device);
> > > > > >
> > > > > > +struct iommu_domain *iommu_get_dev_domain(struct device *dev) {
> > > > > > + struct iommu_ops *ops = dev->bus->iommu_ops;
> > > > > > +
> > > > > > + if (unlikely(ops == NULL || ops->get_dev_iommu_domain == NULL))
> > > > > > + return NULL;
> > > > > > +
> > > > > > + return ops->get_dev_iommu_domain(dev); }
> > > > > > +EXPORT_SYMBOL_GPL(iommu_get_dev_domain);
> > > > >
> > > > > What prevents this from racing iommu_domain_free()? There's no
> > > > > references acquired, so there's no reason for the caller to
> > > > > assume the
> > > pointer is valid.
> > > >
> > > > Sorry for late query, somehow this email went into a folder and
> > > > escaped;
> > > >
> > > > Just to be sure, there is not lock at generic "struct
> > > > iommu_domain", but IP
> > > specific structure (link FSL domain) linked in iommu_domain->priv
> > > have a lock, so we need to ensure this race in FSL iommu code (say
> > > drivers/iommu/fsl_pamu_domain.c), right?
> > >
> > > No, it's not sufficient to make sure that your use of the interface
> > > is race free. The interface itself needs to be designed so that
> > > it's difficult to use incorrectly.
> >
> > So we can define iommu_get_dev_domain()/iommu_put_dev_domain();
> > iommu_get_dev_domain() will return domain with the lock held, and
> > iommu_put_dev_domain() will release the lock? And
> > iommu_get_dev_domain() must always be followed by
> > iommu_get_dev_domain().
>
> What lock? get/put are generally used for reference counting, not locking in
> the kernel.
>
> > > That's not the case here. This is a backdoor to get the iommu
> > > domain from the iommu driver regardless of who is using it or how.
> > > The iommu domain is created and managed by vfio, so shouldn't we be
> > > looking at how to do this through vfio?
> >
> > Let me first describe what we are doing here:
> > During initialization:-
> > - vfio talks to MSI system to know the MSI-page and size
> > - vfio then interacts with iommu to map the MSI-page in iommu (IOVA
> > is decided by userspace and physical address is the MSI-page)
> > - So the IOVA subwindow mapping is created in iommu and yes VFIO know about
> this mapping.
> >
> > Now do SET_IRQ(MSI/MSIX) ioctl:
> > - calls pci_enable_msix()/pci_enable_msi_block(): which is supposed to set
> MSI address/data in device.
> > - So in current implementation (this patchset) msi-subsystem gets the IOVA
> from iommu via this defined interface.
> > - Are you saying that rather than getting this from iommu, we should get this
> from vfio? What difference does this make?
>
> Yes, you just said above that vfio knows the msi to iova mapping, so why go
> outside of vfio to find it later? The difference is one case you can have a
> proper reference to data structures to make sure the pointer you get back
> actually has meaning at the time you're using it vs the code here where you're
> defining an API that returns a meaningless value

With FSL-PAMU we will always get consistant data from iommu or vfio-data structure.

> because you can't check or
> enforce that an arbitrary caller is using it correctly.

I am not sure what is arbitrary caller? pdev is known to vfio, so vfio will only make pci_enable_msix()/pci_enable_msi_block() for this pdev. If anyother code makes then it is some other unexpectedly thing happening in system, no?


> It's not maintainable.
> Thanks,

I do not have any issue with this as well, can you also describe the type of API you are envisioning;
I can think of defining some function in vfio.c/vfio_iommu*.c, make them global and declare then in include/Linux/vfio.h
And include <Linux/vfio.h> in caller file (arch/powerpc/kernel/msi.c)

Thanks
-Bharat
>
> Alex
>
> > > It seems like you'd want to use your device to get a vfio group
> > > reference, from which you could do something with the vfio external
> > > user interface and get the iommu domain reference. Thanks,
> > >
> > > Alex
> > >
> > > > > > /*
> > > > > > * IOMMU groups are really the natrual working unit of the IOMMU, but
> > > > > > * the IOMMU API works on domains and devices. Bridge that
> > > > > > gap by diff --git a/include/linux/iommu.h
> > > > > > b/include/linux/iommu.h index 7ea319e..fa046bd 100644
> > > > > > --- a/include/linux/iommu.h
> > > > > > +++ b/include/linux/iommu.h
> > > > > > @@ -127,6 +127,7 @@ struct iommu_ops {
> > > > > > int (*domain_set_windows)(struct iommu_domain *domain, u32
> > > w_count);
> > > > > > /* Get the numer of window per domain */
> > > > > > u32 (*domain_get_windows)(struct iommu_domain *domain);
> > > > > > + struct iommu_domain *(*get_dev_iommu_domain)(struct device
> > > > > > +*dev);
> > > > > >
> > > > > > unsigned long pgsize_bitmap; }; @@ -190,6 +191,7 @@ extern
> > > > > > int iommu_domain_window_enable(struct iommu_domain
> > > > > *domain, u32 wnd_nr,
> > > > > > phys_addr_t offset, u64 size,
> > > > > > int prot);
> > > > > > extern void iommu_domain_window_disable(struct iommu_domain
> > > > > > *domain,
> > > > > > u32 wnd_nr);
> > > > > > +extern struct iommu_domain *iommu_get_dev_domain(struct
> > > > > > +device *dev);
> > > > > > /**
> > > > > > * report_iommu_fault() - report about an IOMMU fault to the
> > > > > > IOMMU
> > > framework
> > > > > > * @domain: the iommu domain where the fault has happened @@
> > > > > > -284,6
> > > > > > +286,11 @@ static inline void
> > > > > > +iommu_domain_window_disable(struct
> > > > > > iommu_domain *domain, { }
> > > > > >
> > > > > > +static inline struct iommu_domain
> > > > > > +*iommu_get_dev_domain(struct device
> > > > > > +*dev) {
> > > > > > + return NULL;
> > > > > > +}
> > > > > > +
> > > > > > static inline phys_addr_t iommu_iova_to_phys(struct
> > > > > > iommu_domain *domain, dma_addr_t iova) {
> > > > > > return 0;
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe linux-pci"
> > > > > in the body of a message to [email protected] More
> > > > > majordomo info at http://vger.kernel.org/majordomo-info.html
> > > >
> > >
> > >
> > >
> >
>
>
>

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2013-10-04 18:12:17

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a device

On Fri, 2013-10-04 at 17:23 +0000, Bhushan Bharat-R65777 wrote:
>
> > -----Original Message-----
> > From: Alex Williamson [mailto:[email protected]]
> > Sent: Friday, October 04, 2013 10:43 PM
> > To: Bhushan Bharat-R65777
> > Cc: [email protected]; [email protected]; [email protected]; linux-
> > [email protected]; [email protected]; linux-
> > [email protected]; [email protected]; Wood Scott-B07421; [email protected]
> > foundation.org
> > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a device
> >
> > On Fri, 2013-10-04 at 16:47 +0000, Bhushan Bharat-R65777 wrote:
> > >
> > > > -----Original Message-----
> > > > From: Alex Williamson [mailto:[email protected]]
> > > > Sent: Friday, October 04, 2013 9:15 PM
> > > > To: Bhushan Bharat-R65777
> > > > Cc: [email protected]; [email protected];
> > > > [email protected]; linux- [email protected];
> > > > [email protected]; linux- [email protected];
> > > > [email protected]; Wood Scott-B07421; [email protected] foundation.org
> > > > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a
> > > > device
> > > >
> > > > On Fri, 2013-10-04 at 09:54 +0000, Bhushan Bharat-R65777 wrote:
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: [email protected]
> > > > > > [mailto:[email protected]]
> > > > > > On Behalf Of Alex Williamson
> > > > > > Sent: Wednesday, September 25, 2013 10:16 PM
> > > > > > To: Bhushan Bharat-R65777
> > > > > > Cc: [email protected]; [email protected];
> > > > > > [email protected]; linux- [email protected];
> > > > > > [email protected]; linux- [email protected];
> > > > > > [email protected]; Wood Scott-B07421; [email protected]
> > > > > > foundation.org; Bhushan Bharat-R65777
> > > > > > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a
> > > > > > device
> > > > > >
> > > > > > On Thu, 2013-09-19 at 12:59 +0530, Bharat Bhushan wrote:
> > > > > > > This api return the iommu domain to which the device is attached.
> > > > > > > The iommu_domain is required for making API calls related to iommu.
> > > > > > > Follow up patches which use this API to know iommu maping.
> > > > > > >
> > > > > > > Signed-off-by: Bharat Bhushan <[email protected]>
> > > > > > > ---
> > > > > > > drivers/iommu/iommu.c | 10 ++++++++++
> > > > > > > include/linux/iommu.h | 7 +++++++
> > > > > > > 2 files changed, 17 insertions(+), 0 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > > > > > > index
> > > > > > > fbe9ca7..6ac5f50 100644
> > > > > > > --- a/drivers/iommu/iommu.c
> > > > > > > +++ b/drivers/iommu/iommu.c
> > > > > > > @@ -696,6 +696,16 @@ void iommu_detach_device(struct
> > > > > > > iommu_domain *domain, struct device *dev) }
> > > > > > > EXPORT_SYMBOL_GPL(iommu_detach_device);
> > > > > > >
> > > > > > > +struct iommu_domain *iommu_get_dev_domain(struct device *dev) {
> > > > > > > + struct iommu_ops *ops = dev->bus->iommu_ops;
> > > > > > > +
> > > > > > > + if (unlikely(ops == NULL || ops->get_dev_iommu_domain == NULL))
> > > > > > > + return NULL;
> > > > > > > +
> > > > > > > + return ops->get_dev_iommu_domain(dev); }
> > > > > > > +EXPORT_SYMBOL_GPL(iommu_get_dev_domain);
> > > > > >
> > > > > > What prevents this from racing iommu_domain_free()? There's no
> > > > > > references acquired, so there's no reason for the caller to
> > > > > > assume the
> > > > pointer is valid.
> > > > >
> > > > > Sorry for late query, somehow this email went into a folder and
> > > > > escaped;
> > > > >
> > > > > Just to be sure, there is not lock at generic "struct
> > > > > iommu_domain", but IP
> > > > specific structure (link FSL domain) linked in iommu_domain->priv
> > > > have a lock, so we need to ensure this race in FSL iommu code (say
> > > > drivers/iommu/fsl_pamu_domain.c), right?
> > > >
> > > > No, it's not sufficient to make sure that your use of the interface
> > > > is race free. The interface itself needs to be designed so that
> > > > it's difficult to use incorrectly.
> > >
> > > So we can define iommu_get_dev_domain()/iommu_put_dev_domain();
> > > iommu_get_dev_domain() will return domain with the lock held, and
> > > iommu_put_dev_domain() will release the lock? And
> > > iommu_get_dev_domain() must always be followed by
> > > iommu_get_dev_domain().
> >
> > What lock? get/put are generally used for reference counting, not locking in
> > the kernel.
> >
> > > > That's not the case here. This is a backdoor to get the iommu
> > > > domain from the iommu driver regardless of who is using it or how.
> > > > The iommu domain is created and managed by vfio, so shouldn't we be
> > > > looking at how to do this through vfio?
> > >
> > > Let me first describe what we are doing here:
> > > During initialization:-
> > > - vfio talks to MSI system to know the MSI-page and size
> > > - vfio then interacts with iommu to map the MSI-page in iommu (IOVA
> > > is decided by userspace and physical address is the MSI-page)
> > > - So the IOVA subwindow mapping is created in iommu and yes VFIO know about
> > this mapping.
> > >
> > > Now do SET_IRQ(MSI/MSIX) ioctl:
> > > - calls pci_enable_msix()/pci_enable_msi_block(): which is supposed to set
> > MSI address/data in device.
> > > - So in current implementation (this patchset) msi-subsystem gets the IOVA
> > from iommu via this defined interface.
> > > - Are you saying that rather than getting this from iommu, we should get this
> > from vfio? What difference does this make?
> >
> > Yes, you just said above that vfio knows the msi to iova mapping, so why go
> > outside of vfio to find it later? The difference is one case you can have a
> > proper reference to data structures to make sure the pointer you get back
> > actually has meaning at the time you're using it vs the code here where you're
> > defining an API that returns a meaningless value
>
> With FSL-PAMU we will always get consistant data from iommu or vfio-data structure.

Great, but you're trying to add a generic API to the IOMMU subsystem
that's difficult to use correctly. The fact that you use it correctly
does not justify the API.

> > because you can't check or
> > enforce that an arbitrary caller is using it correctly.
>
> I am not sure what is arbitrary caller? pdev is known to vfio, so vfio
> will only make pci_enable_msix()/pci_enable_msi_block() for this pdev.
> If anyother code makes then it is some other unexpectedly thing
> happening in system, no?

What's proposed here is a generic IOMMU API. Anybody can call this.
What if the host SCSI driver decides to go get the iommu domain for it's
device (or any other device)? Does that fit your usage model?

> > It's not maintainable.
> > Thanks,
>
> I do not have any issue with this as well, can you also describe the type of API you are envisioning;
> I can think of defining some function in vfio.c/vfio_iommu*.c, make them global and declare then in include/Linux/vfio.h
> And include <Linux/vfio.h> in caller file (arch/powerpc/kernel/msi.c)

Do you really want module dependencies between vfio and your core kernel
MSI setup? Look at the vfio external user interface that we've already
defined. That allows other components of the kernel to get a proper
reference to a vfio group. From there you can work out how to get what
you want. Another alternative is that vfio could register an MSI to
IOVA mapping with architecture code when the mapping is created. The
MSI setup path could then do a lookup in architecture code for the
mapping. You could even store the MSI to IOVA mapping in VFIO and
create an interface where SET_IRQ passes that mapping into setup code.
Thanks,

Alex

2013-10-07 05:46:29

by Bharat Bhushan

[permalink] [raw]
Subject: RE: [PATCH 2/7] iommu: add api to get iommu_domain of a device



> -----Original Message-----
> From: Alex Williamson [mailto:[email protected]]
> Sent: Friday, October 04, 2013 11:42 PM
> To: Bhushan Bharat-R65777
> Cc: [email protected]; [email protected]; [email protected]; linux-
> [email protected]; [email protected]; linux-
> [email protected]; [email protected]; Wood Scott-B07421; [email protected]
> foundation.org
> Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a device
>
> On Fri, 2013-10-04 at 17:23 +0000, Bhushan Bharat-R65777 wrote:
> >
> > > -----Original Message-----
> > > From: Alex Williamson [mailto:[email protected]]
> > > Sent: Friday, October 04, 2013 10:43 PM
> > > To: Bhushan Bharat-R65777
> > > Cc: [email protected]; [email protected];
> > > [email protected]; linux- [email protected];
> > > [email protected]; linux- [email protected];
> > > [email protected]; Wood Scott-B07421; [email protected] foundation.org
> > > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a
> > > device
> > >
> > > On Fri, 2013-10-04 at 16:47 +0000, Bhushan Bharat-R65777 wrote:
> > > >
> > > > > -----Original Message-----
> > > > > From: Alex Williamson [mailto:[email protected]]
> > > > > Sent: Friday, October 04, 2013 9:15 PM
> > > > > To: Bhushan Bharat-R65777
> > > > > Cc: [email protected]; [email protected];
> > > > > [email protected]; linux- [email protected];
> > > > > [email protected]; linux- [email protected];
> > > > > [email protected]; Wood Scott-B07421; [email protected]
> > > > > foundation.org
> > > > > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a
> > > > > device
> > > > >
> > > > > On Fri, 2013-10-04 at 09:54 +0000, Bhushan Bharat-R65777 wrote:
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: [email protected]
> > > > > > > [mailto:[email protected]]
> > > > > > > On Behalf Of Alex Williamson
> > > > > > > Sent: Wednesday, September 25, 2013 10:16 PM
> > > > > > > To: Bhushan Bharat-R65777
> > > > > > > Cc: [email protected]; [email protected];
> > > > > > > [email protected]; linux- [email protected];
> > > > > > > [email protected]; linux- [email protected];
> > > > > > > [email protected]; Wood Scott-B07421; [email protected]
> > > > > > > foundation.org; Bhushan Bharat-R65777
> > > > > > > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain
> > > > > > > of a device
> > > > > > >
> > > > > > > On Thu, 2013-09-19 at 12:59 +0530, Bharat Bhushan wrote:
> > > > > > > > This api return the iommu domain to which the device is attached.
> > > > > > > > The iommu_domain is required for making API calls related to
> iommu.
> > > > > > > > Follow up patches which use this API to know iommu maping.
> > > > > > > >
> > > > > > > > Signed-off-by: Bharat Bhushan
> > > > > > > > <[email protected]>
> > > > > > > > ---
> > > > > > > > drivers/iommu/iommu.c | 10 ++++++++++
> > > > > > > > include/linux/iommu.h | 7 +++++++
> > > > > > > > 2 files changed, 17 insertions(+), 0 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > > > > > > > index
> > > > > > > > fbe9ca7..6ac5f50 100644
> > > > > > > > --- a/drivers/iommu/iommu.c
> > > > > > > > +++ b/drivers/iommu/iommu.c
> > > > > > > > @@ -696,6 +696,16 @@ void iommu_detach_device(struct
> > > > > > > > iommu_domain *domain, struct device *dev) }
> > > > > > > > EXPORT_SYMBOL_GPL(iommu_detach_device);
> > > > > > > >
> > > > > > > > +struct iommu_domain *iommu_get_dev_domain(struct device *dev) {
> > > > > > > > + struct iommu_ops *ops = dev->bus->iommu_ops;
> > > > > > > > +
> > > > > > > > + if (unlikely(ops == NULL || ops->get_dev_iommu_domain ==
> NULL))
> > > > > > > > + return NULL;
> > > > > > > > +
> > > > > > > > + return ops->get_dev_iommu_domain(dev); }
> > > > > > > > +EXPORT_SYMBOL_GPL(iommu_get_dev_domain);
> > > > > > >
> > > > > > > What prevents this from racing iommu_domain_free()? There's
> > > > > > > no references acquired, so there's no reason for the caller
> > > > > > > to assume the
> > > > > pointer is valid.
> > > > > >
> > > > > > Sorry for late query, somehow this email went into a folder
> > > > > > and escaped;
> > > > > >
> > > > > > Just to be sure, there is not lock at generic "struct
> > > > > > iommu_domain", but IP
> > > > > specific structure (link FSL domain) linked in
> > > > > iommu_domain->priv have a lock, so we need to ensure this race
> > > > > in FSL iommu code (say drivers/iommu/fsl_pamu_domain.c), right?
> > > > >
> > > > > No, it's not sufficient to make sure that your use of the
> > > > > interface is race free. The interface itself needs to be
> > > > > designed so that it's difficult to use incorrectly.
> > > >
> > > > So we can define iommu_get_dev_domain()/iommu_put_dev_domain();
> > > > iommu_get_dev_domain() will return domain with the lock held, and
> > > > iommu_put_dev_domain() will release the lock? And
> > > > iommu_get_dev_domain() must always be followed by
> > > > iommu_get_dev_domain().
> > >
> > > What lock? get/put are generally used for reference counting, not
> > > locking in the kernel.
> > >
> > > > > That's not the case here. This is a backdoor to get the iommu
> > > > > domain from the iommu driver regardless of who is using it or how.
> > > > > The iommu domain is created and managed by vfio, so shouldn't we
> > > > > be looking at how to do this through vfio?
> > > >
> > > > Let me first describe what we are doing here:
> > > > During initialization:-
> > > > - vfio talks to MSI system to know the MSI-page and size
> > > > - vfio then interacts with iommu to map the MSI-page in iommu
> > > > (IOVA is decided by userspace and physical address is the
> > > > MSI-page)
> > > > - So the IOVA subwindow mapping is created in iommu and yes VFIO
> > > > know about
> > > this mapping.
> > > >
> > > > Now do SET_IRQ(MSI/MSIX) ioctl:
> > > > - calls pci_enable_msix()/pci_enable_msi_block(): which is
> > > > supposed to set
> > > MSI address/data in device.
> > > > - So in current implementation (this patchset) msi-subsystem gets
> > > > the IOVA
> > > from iommu via this defined interface.
> > > > - Are you saying that rather than getting this from iommu, we
> > > > should get this
> > > from vfio? What difference does this make?
> > >
> > > Yes, you just said above that vfio knows the msi to iova mapping, so
> > > why go outside of vfio to find it later? The difference is one case
> > > you can have a proper reference to data structures to make sure the
> > > pointer you get back actually has meaning at the time you're using
> > > it vs the code here where you're defining an API that returns a
> > > meaningless value
> >
> > With FSL-PAMU we will always get consistant data from iommu or vfio-data
> structure.
>
> Great, but you're trying to add a generic API to the IOMMU subsystem that's
> difficult to use correctly. The fact that you use it correctly does not justify
> the API.
>
> > > because you can't check or
> > > enforce that an arbitrary caller is using it correctly.
> >
> > I am not sure what is arbitrary caller? pdev is known to vfio, so vfio
> > will only make pci_enable_msix()/pci_enable_msi_block() for this pdev.
> > If anyother code makes then it is some other unexpectedly thing
> > happening in system, no?
>
> What's proposed here is a generic IOMMU API. Anybody can call this.
> What if the host SCSI driver decides to go get the iommu domain for it's device
> (or any other device)? Does that fit your usage model?
>
> > > It's not maintainable.
> > > Thanks,
> >
> > I do not have any issue with this as well, can you also describe the
> > type of API you are envisioning; I can think of defining some function
> > in vfio.c/vfio_iommu*.c, make them global and declare then in
> > include/Linux/vfio.h And include <Linux/vfio.h> in caller file
> > (arch/powerpc/kernel/msi.c)
>
> Do you really want module dependencies between vfio and your core kernel MSI
> setup? Look at the vfio external user interface that we've already defined.
> That allows other components of the kernel to get a proper reference to a vfio
> group. From there you can work out how to get what you want. Another
> alternative is that vfio could register an MSI to IOVA mapping with architecture
> code when the mapping is created. The MSI setup path could then do a lookup in
> architecture code for the mapping. You could even store the MSI to IOVA mapping
> in VFIO and create an interface where SET_IRQ passes that mapping into setup
> code.

Ok, What I want is to get IOVA associated with a physical address (physical address of MSI-bank).
And currently I do not see a way to know IOVA of a physical address and doing all this domain get and then search through all of iommu-windows of that domain.

What if we add an iommu-API which can return the IOVA mapping of a physical address. Current use case is setting up MSI's for aperture type of IOMMU also getting a phys_to_iova() mapping is independent of VFIO, your thought?

Thanks
-Bharat

> Thanks,
>
> Alex
>

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2013-10-08 03:13:21

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a device

On Mon, 2013-10-07 at 05:46 +0000, Bhushan Bharat-R65777 wrote:
>
> > -----Original Message-----
> > From: Alex Williamson [mailto:[email protected]]
> > Sent: Friday, October 04, 2013 11:42 PM
> > To: Bhushan Bharat-R65777
> > Cc: [email protected]; [email protected]; [email protected]; linux-
> > [email protected]; [email protected]; linux-
> > [email protected]; [email protected]; Wood Scott-B07421; [email protected]
> > foundation.org
> > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a device
> >
> > On Fri, 2013-10-04 at 17:23 +0000, Bhushan Bharat-R65777 wrote:
> > >
> > > > -----Original Message-----
> > > > From: Alex Williamson [mailto:[email protected]]
> > > > Sent: Friday, October 04, 2013 10:43 PM
> > > > To: Bhushan Bharat-R65777
> > > > Cc: [email protected]; [email protected];
> > > > [email protected]; linux- [email protected];
> > > > [email protected]; linux- [email protected];
> > > > [email protected]; Wood Scott-B07421; [email protected] foundation.org
> > > > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a
> > > > device
> > > >
> > > > On Fri, 2013-10-04 at 16:47 +0000, Bhushan Bharat-R65777 wrote:
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Alex Williamson [mailto:[email protected]]
> > > > > > Sent: Friday, October 04, 2013 9:15 PM
> > > > > > To: Bhushan Bharat-R65777
> > > > > > Cc: [email protected]; [email protected];
> > > > > > [email protected]; linux- [email protected];
> > > > > > [email protected]; linux- [email protected];
> > > > > > [email protected]; Wood Scott-B07421; [email protected]
> > > > > > foundation.org
> > > > > > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a
> > > > > > device
> > > > > >
> > > > > > On Fri, 2013-10-04 at 09:54 +0000, Bhushan Bharat-R65777 wrote:
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: [email protected]
> > > > > > > > [mailto:[email protected]]
> > > > > > > > On Behalf Of Alex Williamson
> > > > > > > > Sent: Wednesday, September 25, 2013 10:16 PM
> > > > > > > > To: Bhushan Bharat-R65777
> > > > > > > > Cc: [email protected]; [email protected];
> > > > > > > > [email protected]; linux- [email protected];
> > > > > > > > [email protected]; linux- [email protected];
> > > > > > > > [email protected]; Wood Scott-B07421; [email protected]
> > > > > > > > foundation.org; Bhushan Bharat-R65777
> > > > > > > > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain
> > > > > > > > of a device
> > > > > > > >
> > > > > > > > On Thu, 2013-09-19 at 12:59 +0530, Bharat Bhushan wrote:
> > > > > > > > > This api return the iommu domain to which the device is attached.
> > > > > > > > > The iommu_domain is required for making API calls related to
> > iommu.
> > > > > > > > > Follow up patches which use this API to know iommu maping.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Bharat Bhushan
> > > > > > > > > <[email protected]>
> > > > > > > > > ---
> > > > > > > > > drivers/iommu/iommu.c | 10 ++++++++++
> > > > > > > > > include/linux/iommu.h | 7 +++++++
> > > > > > > > > 2 files changed, 17 insertions(+), 0 deletions(-)
> > > > > > > > >
> > > > > > > > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > > > > > > > > index
> > > > > > > > > fbe9ca7..6ac5f50 100644
> > > > > > > > > --- a/drivers/iommu/iommu.c
> > > > > > > > > +++ b/drivers/iommu/iommu.c
> > > > > > > > > @@ -696,6 +696,16 @@ void iommu_detach_device(struct
> > > > > > > > > iommu_domain *domain, struct device *dev) }
> > > > > > > > > EXPORT_SYMBOL_GPL(iommu_detach_device);
> > > > > > > > >
> > > > > > > > > +struct iommu_domain *iommu_get_dev_domain(struct device *dev) {
> > > > > > > > > + struct iommu_ops *ops = dev->bus->iommu_ops;
> > > > > > > > > +
> > > > > > > > > + if (unlikely(ops == NULL || ops->get_dev_iommu_domain ==
> > NULL))
> > > > > > > > > + return NULL;
> > > > > > > > > +
> > > > > > > > > + return ops->get_dev_iommu_domain(dev); }
> > > > > > > > > +EXPORT_SYMBOL_GPL(iommu_get_dev_domain);
> > > > > > > >
> > > > > > > > What prevents this from racing iommu_domain_free()? There's
> > > > > > > > no references acquired, so there's no reason for the caller
> > > > > > > > to assume the
> > > > > > pointer is valid.
> > > > > > >
> > > > > > > Sorry for late query, somehow this email went into a folder
> > > > > > > and escaped;
> > > > > > >
> > > > > > > Just to be sure, there is not lock at generic "struct
> > > > > > > iommu_domain", but IP
> > > > > > specific structure (link FSL domain) linked in
> > > > > > iommu_domain->priv have a lock, so we need to ensure this race
> > > > > > in FSL iommu code (say drivers/iommu/fsl_pamu_domain.c), right?
> > > > > >
> > > > > > No, it's not sufficient to make sure that your use of the
> > > > > > interface is race free. The interface itself needs to be
> > > > > > designed so that it's difficult to use incorrectly.
> > > > >
> > > > > So we can define iommu_get_dev_domain()/iommu_put_dev_domain();
> > > > > iommu_get_dev_domain() will return domain with the lock held, and
> > > > > iommu_put_dev_domain() will release the lock? And
> > > > > iommu_get_dev_domain() must always be followed by
> > > > > iommu_get_dev_domain().
> > > >
> > > > What lock? get/put are generally used for reference counting, not
> > > > locking in the kernel.
> > > >
> > > > > > That's not the case here. This is a backdoor to get the iommu
> > > > > > domain from the iommu driver regardless of who is using it or how.
> > > > > > The iommu domain is created and managed by vfio, so shouldn't we
> > > > > > be looking at how to do this through vfio?
> > > > >
> > > > > Let me first describe what we are doing here:
> > > > > During initialization:-
> > > > > - vfio talks to MSI system to know the MSI-page and size
> > > > > - vfio then interacts with iommu to map the MSI-page in iommu
> > > > > (IOVA is decided by userspace and physical address is the
> > > > > MSI-page)
> > > > > - So the IOVA subwindow mapping is created in iommu and yes VFIO
> > > > > know about
> > > > this mapping.
> > > > >
> > > > > Now do SET_IRQ(MSI/MSIX) ioctl:
> > > > > - calls pci_enable_msix()/pci_enable_msi_block(): which is
> > > > > supposed to set
> > > > MSI address/data in device.
> > > > > - So in current implementation (this patchset) msi-subsystem gets
> > > > > the IOVA
> > > > from iommu via this defined interface.
> > > > > - Are you saying that rather than getting this from iommu, we
> > > > > should get this
> > > > from vfio? What difference does this make?
> > > >
> > > > Yes, you just said above that vfio knows the msi to iova mapping, so
> > > > why go outside of vfio to find it later? The difference is one case
> > > > you can have a proper reference to data structures to make sure the
> > > > pointer you get back actually has meaning at the time you're using
> > > > it vs the code here where you're defining an API that returns a
> > > > meaningless value
> > >
> > > With FSL-PAMU we will always get consistant data from iommu or vfio-data
> > structure.
> >
> > Great, but you're trying to add a generic API to the IOMMU subsystem that's
> > difficult to use correctly. The fact that you use it correctly does not justify
> > the API.
> >
> > > > because you can't check or
> > > > enforce that an arbitrary caller is using it correctly.
> > >
> > > I am not sure what is arbitrary caller? pdev is known to vfio, so vfio
> > > will only make pci_enable_msix()/pci_enable_msi_block() for this pdev.
> > > If anyother code makes then it is some other unexpectedly thing
> > > happening in system, no?
> >
> > What's proposed here is a generic IOMMU API. Anybody can call this.
> > What if the host SCSI driver decides to go get the iommu domain for it's device
> > (or any other device)? Does that fit your usage model?
> >
> > > > It's not maintainable.
> > > > Thanks,
> > >
> > > I do not have any issue with this as well, can you also describe the
> > > type of API you are envisioning; I can think of defining some function
> > > in vfio.c/vfio_iommu*.c, make them global and declare then in
> > > include/Linux/vfio.h And include <Linux/vfio.h> in caller file
> > > (arch/powerpc/kernel/msi.c)
> >
> > Do you really want module dependencies between vfio and your core kernel MSI
> > setup? Look at the vfio external user interface that we've already defined.
> > That allows other components of the kernel to get a proper reference to a vfio
> > group. From there you can work out how to get what you want. Another
> > alternative is that vfio could register an MSI to IOVA mapping with architecture
> > code when the mapping is created. The MSI setup path could then do a lookup in
> > architecture code for the mapping. You could even store the MSI to IOVA mapping
> > in VFIO and create an interface where SET_IRQ passes that mapping into setup
> > code.
>
> Ok, What I want is to get IOVA associated with a physical address
> (physical address of MSI-bank).
> And currently I do not see a way to know IOVA of a physical address
> and doing all this domain get and then search through all of
> iommu-windows of that domain.
>
> What if we add an iommu-API which can return the IOVA mapping of a
> physical address. Current use case is setting up MSI's for aperture
> type of IOMMU also getting a phys_to_iova() mapping is independent of
> VFIO, your thought?

A physical address can be mapped to multiple IOVAs, so the interface
seems flawed by design. It also has the same problem as above, it's a
backdoor that can be called asynchronous to the owner of the domain, so
what reason is there to believe the result? It just replaces an
iommu_domain pointer with an IOVA. VFIO knows this mapping, so why are
we trying to go behind its back and ask the IOMMU? Thanks,

Alex

2013-10-08 03:42:53

by Bharat Bhushan

[permalink] [raw]
Subject: RE: [PATCH 2/7] iommu: add api to get iommu_domain of a device

> > > Do you really want module dependencies between vfio and your core
> > > kernel MSI setup? Look at the vfio external user interface that we've
> already defined.
> > > That allows other components of the kernel to get a proper reference
> > > to a vfio group. From there you can work out how to get what you
> > > want. Another alternative is that vfio could register an MSI to
> > > IOVA mapping with architecture code when the mapping is created.
> > > The MSI setup path could then do a lookup in architecture code for
> > > the mapping. You could even store the MSI to IOVA mapping in VFIO
> > > and create an interface where SET_IRQ passes that mapping into setup code.
> >
> > Ok, What I want is to get IOVA associated with a physical address
> > (physical address of MSI-bank).
> > And currently I do not see a way to know IOVA of a physical address
> > and doing all this domain get and then search through all of
> > iommu-windows of that domain.
> >
> > What if we add an iommu-API which can return the IOVA mapping of a
> > physical address. Current use case is setting up MSI's for aperture
> > type of IOMMU also getting a phys_to_iova() mapping is independent of
> > VFIO, your thought?
>
> A physical address can be mapped to multiple IOVAs, so the interface seems
> flawed by design. It also has the same problem as above, it's a backdoor that
> can be called asynchronous to the owner of the domain, so what reason is there
> to believe the result? It just replaces an iommu_domain pointer with an IOVA.
> VFIO knows this mapping, so why are we trying to go behind its back and ask the
> IOMMU?
IOMMU is the final place where mapping is created, so may be today it is calling on behalf of VFIO, tomorrow it can be for normal Linux or some other interface. But I am fine to directly talk to vfio and will not try to solve a problem which does not exists today.

MSI subsystem knows pdev (pci device) and physical address, then what interface it will use to get the IOVA from VFIO?

Thanks
-Bharat

> Thanks,
>
> Alex
>

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2013-10-08 16:48:13

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 1/7] powerpc: Add interface to get msi region information

On Thu, Oct 3, 2013 at 11:19 PM, Bhushan Bharat-R65777
<[email protected]> wrote:

>> I don't know enough about VFIO to understand why these new interfaces are
>> needed. Is this the first VFIO IOMMU driver? I see vfio_iommu_spapr_tce.c and
>> vfio_iommu_type1.c but I don't know if they're comparable to the Freescale PAMU.
>> Do other VFIO IOMMU implementations support MSI? If so, do they handle the
>> problem of mapping the MSI regions in a different way?
>
> PAMU is an aperture type of IOMMU while other are paging type, So they are completely different from what PAMU is and handle that differently.

This is not an explanation or a justification for adding new
interfaces. I still have no idea what an "aperture type IOMMU" is,
other than that it is "different." But I see that Alex is working on
this issue with you in a different thread, so I'm sure you guys will
sort it out.

Bjorn

2013-10-08 17:02:36

by Joerg Roedel

[permalink] [raw]
Subject: Re: [PATCH 1/7] powerpc: Add interface to get msi region information

On Tue, Oct 08, 2013 at 10:47:49AM -0600, Bjorn Helgaas wrote:
> I still have no idea what an "aperture type IOMMU" is,
> other than that it is "different."

An aperture based IOMMU is basically any GART-like IOMMU which can only
remap a small window (the aperture) of the DMA address space. DMA
outside of that window is either blocked completly or passed through
untranslated.


Joerg

2013-10-08 17:09:33

by Scott Wood

[permalink] [raw]
Subject: Re: [PATCH 1/7] powerpc: Add interface to get msi region information

On Tue, 2013-10-08 at 10:47 -0600, Bjorn Helgaas wrote:
> On Thu, Oct 3, 2013 at 11:19 PM, Bhushan Bharat-R65777
> <[email protected]> wrote:
>
> >> I don't know enough about VFIO to understand why these new interfaces are
> >> needed. Is this the first VFIO IOMMU driver? I see vfio_iommu_spapr_tce.c and
> >> vfio_iommu_type1.c but I don't know if they're comparable to the Freescale PAMU.
> >> Do other VFIO IOMMU implementations support MSI? If so, do they handle the
> >> problem of mapping the MSI regions in a different way?
> >
> > PAMU is an aperture type of IOMMU while other are paging type, So they are completely different from what PAMU is and handle that differently.
>
> This is not an explanation or a justification for adding new
> interfaces. I still have no idea what an "aperture type IOMMU" is,
> other than that it is "different." But I see that Alex is working on
> this issue with you in a different thread, so I'm sure you guys will
> sort it out.

PAMU is a very constrained IOMMU that cannot do arbitrary page mappings.
Due to these constraints, we cannot map the MSI I/O page at its normal
address while also mapping RAM at the address we want. The address we
can map it at depends on the addresses of other mappings, so it can't be
hidden in the IOMMU driver -- the user needs to be in control.

Another difference is that (if I understand correctly) PCs handle MSIs
specially, via interrupt remapping, rather than being translated as a
normal memory access through the IOMMU.

-Scott


2013-10-08 17:10:03

by Bharat Bhushan

[permalink] [raw]
Subject: RE: [PATCH 1/7] powerpc: Add interface to get msi region information



> -----Original Message-----
> From: [email protected] [mailto:[email protected]]
> Sent: Tuesday, October 08, 2013 10:32 PM
> To: Bjorn Helgaas
> Cc: Bhushan Bharat-R65777; [email protected]; [email protected];
> [email protected]; [email protected]; linuxppc-
> [email protected]; [email protected]; [email protected]; Wood Scott-
> B07421; [email protected]
> Subject: Re: [PATCH 1/7] powerpc: Add interface to get msi region information
>
> On Tue, Oct 08, 2013 at 10:47:49AM -0600, Bjorn Helgaas wrote:
> > I still have no idea what an "aperture type IOMMU" is, other than that
> > it is "different."
>
> An aperture based IOMMU is basically any GART-like IOMMU which can only remap a
> small window (the aperture) of the DMA address space. DMA outside of that window
> is either blocked completly or passed through untranslated.

It is completely blocked for Freescale PAMU.
So for this type of iommu what we have to do is to create a MSI mapping just after guest physical address, Example: guest have a 512M of memory then we create window of 1G (because of power of 2 requirement), then we have to FIT MSI just after 512M of guest.
And for that we need
1) to know the physical address of MSI's in interrupt controller (for that this patch was all about of).

2) When guest enable MSI interrupt then we write MSI-address and MSI-DATA in device. The discussion with Alex Williamson is about that interface.

Thanks
-Bharat

>
>
> Joerg
>
>

2013-10-08 22:57:29

by Scott Wood

[permalink] [raw]
Subject: Re: [PATCH 1/7] powerpc: Add interface to get msi region information

On Thu, 2013-09-19 at 12:59 +0530, Bharat Bhushan wrote:
> @@ -376,6 +405,7 @@ static int fsl_of_msi_probe(struct platform_device *dev)
> int len;
> u32 offset;
> static const u32 all_avail[] = { 0, NR_MSI_IRQS };
> + static int bank_index;
>
> match = of_match_device(fsl_of_msi_ids, &dev->dev);
> if (!match)
> @@ -419,8 +449,8 @@ static int fsl_of_msi_probe(struct platform_device *dev)
> dev->dev.of_node->full_name);
> goto error_out;
> }
> - msi->msiir_offset =
> - features->msiir_offset + (res.start & 0xfffff);
> + msi->msiir = res.start + features->msiir_offset;
> + printk("msi->msiir = %llx\n", msi->msiir);

dev_dbg or remove

> }
>
> msi->feature = features->fsl_pic_ip;
> @@ -470,6 +500,7 @@ static int fsl_of_msi_probe(struct platform_device *dev)
> }
> }
>
> + msi->bank_index = bank_index++;

What if multiple MSIs are boing probed in parallel? bank_index is not
atomic.

> diff --git a/arch/powerpc/sysdev/fsl_msi.h b/arch/powerpc/sysdev/fsl_msi.h
> index 8225f86..6bd5cfc 100644
> --- a/arch/powerpc/sysdev/fsl_msi.h
> +++ b/arch/powerpc/sysdev/fsl_msi.h
> @@ -29,12 +29,19 @@ struct fsl_msi {
> struct irq_domain *irqhost;
>
> unsigned long cascade_irq;
> -
> - u32 msiir_offset; /* Offset of MSIIR, relative to start of CCSR */
> + dma_addr_t msiir; /* MSIIR Address in CCSR */

Are you sure dma_addr_t is right here, versus phys_addr_t? It implies
that it's the output of the DMA API, but I don't think the DMA API is
used in the MSI driver. Perhaps it should be, but we still want the raw
physical address to pass on to VFIO.

> void __iomem *msi_regs;
> u32 feature;
> int msi_virqs[NR_MSI_REG];
>
> + /*
> + * During probe each bank is assigned a index number.
> + * index number ranges from 0 to 2^32.
> + * Example MSI bank 1 = 0
> + * MSI bank 2 = 1, and so on.
> + */
> + int bank_index;

2^32 doesn't fit in "int" (nor does 2^32 - 1).

Just say that indices start at 0.

-Scott


2013-10-08 23:26:04

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH 1/7] powerpc: Add interface to get msi region information

>> - u32 msiir_offset; /* Offset of MSIIR, relative to start of CCSR */
>> + dma_addr_t msiir; /* MSIIR Address in CCSR */
>
> Are you sure dma_addr_t is right here, versus phys_addr_t? It implies
> that it's the output of the DMA API, but I don't think the DMA API is
> used in the MSI driver. Perhaps it should be, but we still want the raw
> physical address to pass on to VFIO.

I don't know what "msiir" is used for, but if it's an address you
program into a PCI device, then it's a dma_addr_t even if you didn't
get it from the DMA API. Maybe "bus_addr_t" would have been a more
suggestive name than "dma_addr_t". That said, I have no idea how this
relates to VFIO.

Bjorn

2013-10-08 23:35:35

by Scott Wood

[permalink] [raw]
Subject: Re: [PATCH 1/7] powerpc: Add interface to get msi region information

On Tue, 2013-10-08 at 17:25 -0600, Bjorn Helgaas wrote:
> >> - u32 msiir_offset; /* Offset of MSIIR, relative to start of CCSR */
> >> + dma_addr_t msiir; /* MSIIR Address in CCSR */
> >
> > Are you sure dma_addr_t is right here, versus phys_addr_t? It implies
> > that it's the output of the DMA API, but I don't think the DMA API is
> > used in the MSI driver. Perhaps it should be, but we still want the raw
> > physical address to pass on to VFIO.
>
> I don't know what "msiir" is used for, but if it's an address you
> program into a PCI device, then it's a dma_addr_t even if you didn't
> get it from the DMA API. Maybe "bus_addr_t" would have been a more
> suggestive name than "dma_addr_t". That said, I have no idea how this
> relates to VFIO.

It's a bit awkward because it gets used both as something to program
into a PCI device (and it's probably a bug that the DMA API doesn't get
used), and also (if I understand the current plans correctly) as a
physical address to give to VFIO to be a destination address in an IOMMU
mapping. So I think the value we keep here should be a phys_addr_t (it
comes straight from the MMIO address in the device tree), which gets
trivially turned into a dma_addr_t by the non-VFIO code path because
there's currently no translation there.

-Scott


2013-10-09 04:48:05

by Bharat Bhushan

[permalink] [raw]
Subject: RE: [PATCH 1/7] powerpc: Add interface to get msi region information



> -----Original Message-----
> From: Wood Scott-B07421
> Sent: Wednesday, October 09, 2013 4:27 AM
> To: Bhushan Bharat-R65777
> Cc: [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; linuxppc-
> [email protected]; [email protected]; [email protected];
> [email protected]; Bhushan Bharat-R65777
> Subject: Re: [PATCH 1/7] powerpc: Add interface to get msi region information
>
> On Thu, 2013-09-19 at 12:59 +0530, Bharat Bhushan wrote:
> > @@ -376,6 +405,7 @@ static int fsl_of_msi_probe(struct platform_device *dev)
> > int len;
> > u32 offset;
> > static const u32 all_avail[] = { 0, NR_MSI_IRQS };
> > + static int bank_index;
> >
> > match = of_match_device(fsl_of_msi_ids, &dev->dev);
> > if (!match)
> > @@ -419,8 +449,8 @@ static int fsl_of_msi_probe(struct platform_device *dev)
> > dev->dev.of_node->full_name);
> > goto error_out;
> > }
> > - msi->msiir_offset =
> > - features->msiir_offset + (res.start & 0xfffff);
> > + msi->msiir = res.start + features->msiir_offset;
> > + printk("msi->msiir = %llx\n", msi->msiir);
>
> dev_dbg or remove

Oops, sorry it was leftover of debugging :(

>
> > }
> >
> > msi->feature = features->fsl_pic_ip; @@ -470,6 +500,7 @@ static int
> > fsl_of_msi_probe(struct platform_device *dev)
> > }
> > }
> >
> > + msi->bank_index = bank_index++;
>
> What if multiple MSIs are boing probed in parallel?

Ohh, I have not thought that it can be called in parallel

> bank_index is not atomic.

Will declare bank_intex as atomic_t and use atomic_inc_return(&bank_index)

>
> > diff --git a/arch/powerpc/sysdev/fsl_msi.h
> > b/arch/powerpc/sysdev/fsl_msi.h index 8225f86..6bd5cfc 100644
> > --- a/arch/powerpc/sysdev/fsl_msi.h
> > +++ b/arch/powerpc/sysdev/fsl_msi.h
> > @@ -29,12 +29,19 @@ struct fsl_msi {
> > struct irq_domain *irqhost;
> >
> > unsigned long cascade_irq;
> > -
> > - u32 msiir_offset; /* Offset of MSIIR, relative to start of CCSR */
> > + dma_addr_t msiir; /* MSIIR Address in CCSR */
>
> Are you sure dma_addr_t is right here, versus phys_addr_t? It implies that it's
> the output of the DMA API, but I don't think the DMA API is used in the MSI
> driver. Perhaps it should be, but we still want the raw physical address to
> pass on to VFIO.

Looking through the conversation I will make this phys_addr_t

>
> > void __iomem *msi_regs;
> > u32 feature;
> > int msi_virqs[NR_MSI_REG];
> >
> > + /*
> > + * During probe each bank is assigned a index number.
> > + * index number ranges from 0 to 2^32.
> > + * Example MSI bank 1 = 0
> > + * MSI bank 2 = 1, and so on.
> > + */
> > + int bank_index;
>
> 2^32 doesn't fit in "int" (nor does 2^32 - 1).

Right :(

>
> Just say that indices start at 0.

Will correct this

Thanks
-Bharat

>
> -Scott
>

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2013-10-10 20:09:11

by Sethi Varun-B16395

[permalink] [raw]
Subject: RE: [PATCH 2/7] iommu: add api to get iommu_domain of a device



> -----Original Message-----
> From: [email protected] [mailto:iommu-
> [email protected]] On Behalf Of Alex Williamson
> Sent: Tuesday, October 08, 2013 8:43 AM
> To: Bhushan Bharat-R65777
> Cc: [email protected]; Wood Scott-B07421; [email protected];
> [email protected]; [email protected];
> [email protected]; [email protected]; linuxppc-
> [email protected]
> Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a device
>
> On Mon, 2013-10-07 at 05:46 +0000, Bhushan Bharat-R65777 wrote:
> >
> > > -----Original Message-----
> > > From: Alex Williamson [mailto:[email protected]]
> > > Sent: Friday, October 04, 2013 11:42 PM
> > > To: Bhushan Bharat-R65777
> > > Cc: [email protected]; [email protected];
> > > [email protected]; linux- [email protected];
> > > [email protected]; linux- [email protected];
> > > [email protected]; Wood Scott-B07421; [email protected] foundation.org
> > > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a
> > > device
> > >
> > > On Fri, 2013-10-04 at 17:23 +0000, Bhushan Bharat-R65777 wrote:
> > > >
> > > > > -----Original Message-----
> > > > > From: Alex Williamson [mailto:[email protected]]
> > > > > Sent: Friday, October 04, 2013 10:43 PM
> > > > > To: Bhushan Bharat-R65777
> > > > > Cc: [email protected]; [email protected];
> > > > > [email protected]; linux- [email protected];
> > > > > [email protected]; linux- [email protected];
> > > > > [email protected]; Wood Scott-B07421; [email protected]
> > > > > foundation.org
> > > > > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a
> > > > > device
> > > > >
> > > > > On Fri, 2013-10-04 at 16:47 +0000, Bhushan Bharat-R65777 wrote:
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Alex Williamson [mailto:[email protected]]
> > > > > > > Sent: Friday, October 04, 2013 9:15 PM
> > > > > > > To: Bhushan Bharat-R65777
> > > > > > > Cc: [email protected]; [email protected];
> > > > > > > [email protected]; linux- [email protected];
> > > > > > > [email protected]; linux- [email protected];
> > > > > > > [email protected]; Wood Scott-B07421; [email protected]
> > > > > > > foundation.org
> > > > > > > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain
> > > > > > > of a device
> > > > > > >
> > > > > > > On Fri, 2013-10-04 at 09:54 +0000, Bhushan Bharat-R65777
> wrote:
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: [email protected]
> > > > > > > > > [mailto:[email protected]]
> > > > > > > > > On Behalf Of Alex Williamson
> > > > > > > > > Sent: Wednesday, September 25, 2013 10:16 PM
> > > > > > > > > To: Bhushan Bharat-R65777
> > > > > > > > > Cc: [email protected]; [email protected];
> > > > > > > > > [email protected]; linux-
> > > > > > > > > [email protected]; [email protected];
> > > > > > > > > linux- [email protected]; [email protected]; Wood
> > > > > > > > > Scott-B07421; [email protected] foundation.org; Bhushan
> > > > > > > > > Bharat-R65777
> > > > > > > > > Subject: Re: [PATCH 2/7] iommu: add api to get
> > > > > > > > > iommu_domain of a device
> > > > > > > > >
> > > > > > > > > On Thu, 2013-09-19 at 12:59 +0530, Bharat Bhushan wrote:
> > > > > > > > > > This api return the iommu domain to which the device is
> attached.
> > > > > > > > > > The iommu_domain is required for making API calls
> > > > > > > > > > related to
> > > iommu.
> > > > > > > > > > Follow up patches which use this API to know iommu
> maping.
> > > > > > > > > >
> > > > > > > > > > Signed-off-by: Bharat Bhushan
> > > > > > > > > > <[email protected]>
> > > > > > > > > > ---
> > > > > > > > > > drivers/iommu/iommu.c | 10 ++++++++++
> > > > > > > > > > include/linux/iommu.h | 7 +++++++
> > > > > > > > > > 2 files changed, 17 insertions(+), 0 deletions(-)
> > > > > > > > > >
> > > > > > > > > > diff --git a/drivers/iommu/iommu.c
> > > > > > > > > > b/drivers/iommu/iommu.c index
> > > > > > > > > > fbe9ca7..6ac5f50 100644
> > > > > > > > > > --- a/drivers/iommu/iommu.c
> > > > > > > > > > +++ b/drivers/iommu/iommu.c
> > > > > > > > > > @@ -696,6 +696,16 @@ void iommu_detach_device(struct
> > > > > > > > > > iommu_domain *domain, struct device *dev) }
> > > > > > > > > > EXPORT_SYMBOL_GPL(iommu_detach_device);
> > > > > > > > > >
> > > > > > > > > > +struct iommu_domain *iommu_get_dev_domain(struct
> device *dev) {
> > > > > > > > > > + struct iommu_ops *ops = dev->bus->iommu_ops;
> > > > > > > > > > +
> > > > > > > > > > + if (unlikely(ops == NULL ||
> > > > > > > > > > +ops->get_dev_iommu_domain ==
> > > NULL))
> > > > > > > > > > + return NULL;
> > > > > > > > > > +
> > > > > > > > > > + return ops->get_dev_iommu_domain(dev); }
> > > > > > > > > > +EXPORT_SYMBOL_GPL(iommu_get_dev_domain);
> > > > > > > > >
> > > > > > > > > What prevents this from racing iommu_domain_free()?
> > > > > > > > > There's no references acquired, so there's no reason for
> > > > > > > > > the caller to assume the
> > > > > > > pointer is valid.
> > > > > > > >
> > > > > > > > Sorry for late query, somehow this email went into a
> > > > > > > > folder and escaped;
> > > > > > > >
> > > > > > > > Just to be sure, there is not lock at generic "struct
> > > > > > > > iommu_domain", but IP
> > > > > > > specific structure (link FSL domain) linked in
> > > > > > > iommu_domain->priv have a lock, so we need to ensure this
> > > > > > > race in FSL iommu code (say drivers/iommu/fsl_pamu_domain.c),
> right?
> > > > > > >
> > > > > > > No, it's not sufficient to make sure that your use of the
> > > > > > > interface is race free. The interface itself needs to be
> > > > > > > designed so that it's difficult to use incorrectly.
> > > > > >
> > > > > > So we can define
> > > > > > iommu_get_dev_domain()/iommu_put_dev_domain();
> > > > > > iommu_get_dev_domain() will return domain with the lock held,
> > > > > > and
> > > > > > iommu_put_dev_domain() will release the lock? And
> > > > > > iommu_get_dev_domain() must always be followed by
> > > > > > iommu_get_dev_domain().
> > > > >
> > > > > What lock? get/put are generally used for reference counting,
> > > > > not locking in the kernel.
> > > > >
> > > > > > > That's not the case here. This is a backdoor to get the
> > > > > > > iommu domain from the iommu driver regardless of who is using
> it or how.
> > > > > > > The iommu domain is created and managed by vfio, so
> > > > > > > shouldn't we be looking at how to do this through vfio?
> > > > > >
> > > > > > Let me first describe what we are doing here:
> > > > > > During initialization:-
> > > > > > - vfio talks to MSI system to know the MSI-page and size
> > > > > > - vfio then interacts with iommu to map the MSI-page in iommu
> > > > > > (IOVA is decided by userspace and physical address is the
> > > > > > MSI-page)
> > > > > > - So the IOVA subwindow mapping is created in iommu and yes
> > > > > > VFIO know about
> > > > > this mapping.
> > > > > >
> > > > > > Now do SET_IRQ(MSI/MSIX) ioctl:
> > > > > > - calls pci_enable_msix()/pci_enable_msi_block(): which is
> > > > > > supposed to set
> > > > > MSI address/data in device.
> > > > > > - So in current implementation (this patchset) msi-subsystem
> > > > > > gets the IOVA
> > > > > from iommu via this defined interface.
> > > > > > - Are you saying that rather than getting this from iommu, we
> > > > > > should get this
> > > > > from vfio? What difference does this make?
> > > > >
> > > > > Yes, you just said above that vfio knows the msi to iova
> > > > > mapping, so why go outside of vfio to find it later? The
> > > > > difference is one case you can have a proper reference to data
> > > > > structures to make sure the pointer you get back actually has
> > > > > meaning at the time you're using it vs the code here where
> > > > > you're defining an API that returns a meaningless value
> > > >
> > > > With FSL-PAMU we will always get consistant data from iommu or
> > > > vfio-data
> > > structure.
> > >
> > > Great, but you're trying to add a generic API to the IOMMU subsystem
> > > that's difficult to use correctly. The fact that you use it
> > > correctly does not justify the API.
> > >
> > > > > because you can't check or
> > > > > enforce that an arbitrary caller is using it correctly.
> > > >
> > > > I am not sure what is arbitrary caller? pdev is known to vfio, so
> > > > vfio will only make pci_enable_msix()/pci_enable_msi_block() for
> this pdev.
> > > > If anyother code makes then it is some other unexpectedly thing
> > > > happening in system, no?
> > >
> > > What's proposed here is a generic IOMMU API. Anybody can call this.
> > > What if the host SCSI driver decides to go get the iommu domain for
> > > it's device (or any other device)? Does that fit your usage model?
> > >
> > > > > It's not maintainable.
> > > > > Thanks,
> > > >
> > > > I do not have any issue with this as well, can you also describe
> > > > the type of API you are envisioning; I can think of defining some
> > > > function in vfio.c/vfio_iommu*.c, make them global and declare
> > > > then in include/Linux/vfio.h And include <Linux/vfio.h> in caller
> > > > file
> > > > (arch/powerpc/kernel/msi.c)
> > >
> > > Do you really want module dependencies between vfio and your core
> > > kernel MSI setup? Look at the vfio external user interface that
> we've already defined.
> > > That allows other components of the kernel to get a proper reference
> > > to a vfio group. From there you can work out how to get what you
> > > want. Another alternative is that vfio could register an MSI to
> > > IOVA mapping with architecture code when the mapping is created.
> > > The MSI setup path could then do a lookup in architecture code for
> > > the mapping. You could even store the MSI to IOVA mapping in VFIO
> > > and create an interface where SET_IRQ passes that mapping into setup
> code.
[Sethi Varun-B16395] What you are suggesting is that the MSI setup path queries the vfio subsystem for the mapping, rather than directly querying the iommu subsystem. So, say if we add an interface for getting MSI to IOVA mapping in the msi setup path, wouldn't this again be specific to vfio? I reckon that this interface would again ppc machine specific interface.

-Varun

2013-10-10 20:41:55

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a device

On Thu, 2013-10-10 at 20:09 +0000, Sethi Varun-B16395 wrote:
>
> > -----Original Message-----
> > From: [email protected] [mailto:iommu-
> > [email protected]] On Behalf Of Alex Williamson
> > Sent: Tuesday, October 08, 2013 8:43 AM
> > To: Bhushan Bharat-R65777
> > Cc: [email protected]; Wood Scott-B07421; [email protected];
> > [email protected]; [email protected];
> > [email protected]; [email protected]; linuxppc-
> > [email protected]
> > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a device
> >
> > On Mon, 2013-10-07 at 05:46 +0000, Bhushan Bharat-R65777 wrote:
> > >
> > > > -----Original Message-----
> > > > From: Alex Williamson [mailto:[email protected]]
> > > > Sent: Friday, October 04, 2013 11:42 PM
> > > > To: Bhushan Bharat-R65777
> > > > Cc: [email protected]; [email protected];
> > > > [email protected]; linux- [email protected];
> > > > [email protected]; linux- [email protected];
> > > > [email protected]; Wood Scott-B07421; [email protected] foundation.org
> > > > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a
> > > > device
> > > >
> > > > On Fri, 2013-10-04 at 17:23 +0000, Bhushan Bharat-R65777 wrote:
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Alex Williamson [mailto:[email protected]]
> > > > > > Sent: Friday, October 04, 2013 10:43 PM
> > > > > > To: Bhushan Bharat-R65777
> > > > > > Cc: [email protected]; [email protected];
> > > > > > [email protected]; linux- [email protected];
> > > > > > [email protected]; linux- [email protected];
> > > > > > [email protected]; Wood Scott-B07421; [email protected]
> > > > > > foundation.org
> > > > > > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a
> > > > > > device
> > > > > >
> > > > > > On Fri, 2013-10-04 at 16:47 +0000, Bhushan Bharat-R65777 wrote:
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Alex Williamson [mailto:[email protected]]
> > > > > > > > Sent: Friday, October 04, 2013 9:15 PM
> > > > > > > > To: Bhushan Bharat-R65777
> > > > > > > > Cc: [email protected]; [email protected];
> > > > > > > > [email protected]; linux- [email protected];
> > > > > > > > [email protected]; linux- [email protected];
> > > > > > > > [email protected]; Wood Scott-B07421; [email protected]
> > > > > > > > foundation.org
> > > > > > > > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain
> > > > > > > > of a device
> > > > > > > >
> > > > > > > > On Fri, 2013-10-04 at 09:54 +0000, Bhushan Bharat-R65777
> > wrote:
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: [email protected]
> > > > > > > > > > [mailto:[email protected]]
> > > > > > > > > > On Behalf Of Alex Williamson
> > > > > > > > > > Sent: Wednesday, September 25, 2013 10:16 PM
> > > > > > > > > > To: Bhushan Bharat-R65777
> > > > > > > > > > Cc: [email protected]; [email protected];
> > > > > > > > > > [email protected]; linux-
> > > > > > > > > > [email protected]; [email protected];
> > > > > > > > > > linux- [email protected]; [email protected]; Wood
> > > > > > > > > > Scott-B07421; [email protected] foundation.org; Bhushan
> > > > > > > > > > Bharat-R65777
> > > > > > > > > > Subject: Re: [PATCH 2/7] iommu: add api to get
> > > > > > > > > > iommu_domain of a device
> > > > > > > > > >
> > > > > > > > > > On Thu, 2013-09-19 at 12:59 +0530, Bharat Bhushan wrote:
> > > > > > > > > > > This api return the iommu domain to which the device is
> > attached.
> > > > > > > > > > > The iommu_domain is required for making API calls
> > > > > > > > > > > related to
> > > > iommu.
> > > > > > > > > > > Follow up patches which use this API to know iommu
> > maping.
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Bharat Bhushan
> > > > > > > > > > > <[email protected]>
> > > > > > > > > > > ---
> > > > > > > > > > > drivers/iommu/iommu.c | 10 ++++++++++
> > > > > > > > > > > include/linux/iommu.h | 7 +++++++
> > > > > > > > > > > 2 files changed, 17 insertions(+), 0 deletions(-)
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/drivers/iommu/iommu.c
> > > > > > > > > > > b/drivers/iommu/iommu.c index
> > > > > > > > > > > fbe9ca7..6ac5f50 100644
> > > > > > > > > > > --- a/drivers/iommu/iommu.c
> > > > > > > > > > > +++ b/drivers/iommu/iommu.c
> > > > > > > > > > > @@ -696,6 +696,16 @@ void iommu_detach_device(struct
> > > > > > > > > > > iommu_domain *domain, struct device *dev) }
> > > > > > > > > > > EXPORT_SYMBOL_GPL(iommu_detach_device);
> > > > > > > > > > >
> > > > > > > > > > > +struct iommu_domain *iommu_get_dev_domain(struct
> > device *dev) {
> > > > > > > > > > > + struct iommu_ops *ops = dev->bus->iommu_ops;
> > > > > > > > > > > +
> > > > > > > > > > > + if (unlikely(ops == NULL ||
> > > > > > > > > > > +ops->get_dev_iommu_domain ==
> > > > NULL))
> > > > > > > > > > > + return NULL;
> > > > > > > > > > > +
> > > > > > > > > > > + return ops->get_dev_iommu_domain(dev); }
> > > > > > > > > > > +EXPORT_SYMBOL_GPL(iommu_get_dev_domain);
> > > > > > > > > >
> > > > > > > > > > What prevents this from racing iommu_domain_free()?
> > > > > > > > > > There's no references acquired, so there's no reason for
> > > > > > > > > > the caller to assume the
> > > > > > > > pointer is valid.
> > > > > > > > >
> > > > > > > > > Sorry for late query, somehow this email went into a
> > > > > > > > > folder and escaped;
> > > > > > > > >
> > > > > > > > > Just to be sure, there is not lock at generic "struct
> > > > > > > > > iommu_domain", but IP
> > > > > > > > specific structure (link FSL domain) linked in
> > > > > > > > iommu_domain->priv have a lock, so we need to ensure this
> > > > > > > > race in FSL iommu code (say drivers/iommu/fsl_pamu_domain.c),
> > right?
> > > > > > > >
> > > > > > > > No, it's not sufficient to make sure that your use of the
> > > > > > > > interface is race free. The interface itself needs to be
> > > > > > > > designed so that it's difficult to use incorrectly.
> > > > > > >
> > > > > > > So we can define
> > > > > > > iommu_get_dev_domain()/iommu_put_dev_domain();
> > > > > > > iommu_get_dev_domain() will return domain with the lock held,
> > > > > > > and
> > > > > > > iommu_put_dev_domain() will release the lock? And
> > > > > > > iommu_get_dev_domain() must always be followed by
> > > > > > > iommu_get_dev_domain().
> > > > > >
> > > > > > What lock? get/put are generally used for reference counting,
> > > > > > not locking in the kernel.
> > > > > >
> > > > > > > > That's not the case here. This is a backdoor to get the
> > > > > > > > iommu domain from the iommu driver regardless of who is using
> > it or how.
> > > > > > > > The iommu domain is created and managed by vfio, so
> > > > > > > > shouldn't we be looking at how to do this through vfio?
> > > > > > >
> > > > > > > Let me first describe what we are doing here:
> > > > > > > During initialization:-
> > > > > > > - vfio talks to MSI system to know the MSI-page and size
> > > > > > > - vfio then interacts with iommu to map the MSI-page in iommu
> > > > > > > (IOVA is decided by userspace and physical address is the
> > > > > > > MSI-page)
> > > > > > > - So the IOVA subwindow mapping is created in iommu and yes
> > > > > > > VFIO know about
> > > > > > this mapping.
> > > > > > >
> > > > > > > Now do SET_IRQ(MSI/MSIX) ioctl:
> > > > > > > - calls pci_enable_msix()/pci_enable_msi_block(): which is
> > > > > > > supposed to set
> > > > > > MSI address/data in device.
> > > > > > > - So in current implementation (this patchset) msi-subsystem
> > > > > > > gets the IOVA
> > > > > > from iommu via this defined interface.
> > > > > > > - Are you saying that rather than getting this from iommu, we
> > > > > > > should get this
> > > > > > from vfio? What difference does this make?
> > > > > >
> > > > > > Yes, you just said above that vfio knows the msi to iova
> > > > > > mapping, so why go outside of vfio to find it later? The
> > > > > > difference is one case you can have a proper reference to data
> > > > > > structures to make sure the pointer you get back actually has
> > > > > > meaning at the time you're using it vs the code here where
> > > > > > you're defining an API that returns a meaningless value
> > > > >
> > > > > With FSL-PAMU we will always get consistant data from iommu or
> > > > > vfio-data
> > > > structure.
> > > >
> > > > Great, but you're trying to add a generic API to the IOMMU subsystem
> > > > that's difficult to use correctly. The fact that you use it
> > > > correctly does not justify the API.
> > > >
> > > > > > because you can't check or
> > > > > > enforce that an arbitrary caller is using it correctly.
> > > > >
> > > > > I am not sure what is arbitrary caller? pdev is known to vfio, so
> > > > > vfio will only make pci_enable_msix()/pci_enable_msi_block() for
> > this pdev.
> > > > > If anyother code makes then it is some other unexpectedly thing
> > > > > happening in system, no?
> > > >
> > > > What's proposed here is a generic IOMMU API. Anybody can call this.
> > > > What if the host SCSI driver decides to go get the iommu domain for
> > > > it's device (or any other device)? Does that fit your usage model?
> > > >
> > > > > > It's not maintainable.
> > > > > > Thanks,
> > > > >
> > > > > I do not have any issue with this as well, can you also describe
> > > > > the type of API you are envisioning; I can think of defining some
> > > > > function in vfio.c/vfio_iommu*.c, make them global and declare
> > > > > then in include/Linux/vfio.h And include <Linux/vfio.h> in caller
> > > > > file
> > > > > (arch/powerpc/kernel/msi.c)
> > > >
> > > > Do you really want module dependencies between vfio and your core
> > > > kernel MSI setup? Look at the vfio external user interface that
> > we've already defined.
> > > > That allows other components of the kernel to get a proper reference
> > > > to a vfio group. From there you can work out how to get what you
> > > > want. Another alternative is that vfio could register an MSI to
> > > > IOVA mapping with architecture code when the mapping is created.
> > > > The MSI setup path could then do a lookup in architecture code for
> > > > the mapping. You could even store the MSI to IOVA mapping in VFIO
> > > > and create an interface where SET_IRQ passes that mapping into setup
> > code.
> [Sethi Varun-B16395] What you are suggesting is that the MSI setup
> path queries the vfio subsystem for the mapping, rather than directly
> querying the iommu subsystem. So, say if we add an interface for
> getting MSI to IOVA mapping in the msi setup path, wouldn't this again
> be specific to vfio? I reckon that this interface would again ppc
> machine specific interface.

Sure, if this is a generic MSI setup path then clearly vfio should not
be involved. But by that same notion, if this is a generic MSI setup
path, how can the proposed solutions guarantee that the iommu_domain or
iova returned is still valid in all cases? It's racy. If the caller
trying to setup MSI has the information needed, why doesn't it pass it
in as part of the setup? Thanks,

Alex


2013-10-10 21:13:27

by Sethi Varun-B16395

[permalink] [raw]
Subject: RE: [PATCH 1/7] powerpc: Add interface to get msi region information



> -----Original Message-----
> From: [email protected] [mailto:linux-kernel-
> [email protected]] On Behalf Of Bhushan Bharat-R65777
> Sent: Tuesday, October 08, 2013 10:40 PM
> To: [email protected]; Bjorn Helgaas
> Cc: [email protected]; [email protected];
> [email protected]; [email protected]; linuxppc-
> [email protected]; [email protected]; [email protected]; Wood
> Scott-B07421; [email protected]
> Subject: RE: [PATCH 1/7] powerpc: Add interface to get msi region
> information
>
>
>
> > -----Original Message-----
> > From: [email protected] [mailto:[email protected]]
> > Sent: Tuesday, October 08, 2013 10:32 PM
> > To: Bjorn Helgaas
> > Cc: Bhushan Bharat-R65777; [email protected];
> > [email protected]; [email protected];
> > [email protected]; linuxppc- [email protected];
> > [email protected]; [email protected]; Wood Scott- B07421;
> > [email protected]
> > Subject: Re: [PATCH 1/7] powerpc: Add interface to get msi region
> > information
> >
> > On Tue, Oct 08, 2013 at 10:47:49AM -0600, Bjorn Helgaas wrote:
> > > I still have no idea what an "aperture type IOMMU" is, other than
> > > that it is "different."
> >
> > An aperture based IOMMU is basically any GART-like IOMMU which can
> > only remap a small window (the aperture) of the DMA address space. DMA
> > outside of that window is either blocked completly or passed through
> untranslated.
>
> It is completely blocked for Freescale PAMU.
> So for this type of iommu what we have to do is to create a MSI mapping
> just after guest physical address, Example: guest have a 512M of memory
> then we create window of 1G (because of power of 2 requirement), then we
> have to FIT MSI just after 512M of guest.

[Sethi Varun-B16395] PAMU (FSL IOMMU) has a concept of primary window and subwindows. Primary window corresponds to the complete guest iova address space (including MSI space), with respect to IOMMU_API this is termed as geometry . IOVA Base of subwindow is determined from the number of subwindows (configurable using iommu API). Subwindows allow for handling physically discontiguous memory. PAMU translates device iova accesses to actual physical address. MSI mapping would be addressed by a subwindow, with iova base starting at the end of the guest iova space.

VFIO code creates a PAMU window (also defines number of subwindow) to map the guest iova space + msi space. The interface defined by this patch queries the PAMU driver to get the iova mapping for the msi region assigned to the PCIe device (assigned to the guest).

-Varun



2013-10-14 12:58:50

by Sethi Varun-B16395

[permalink] [raw]
Subject: RE: [PATCH 2/7] iommu: add api to get iommu_domain of a device



> -----Original Message-----
> From: Alex Williamson [mailto:[email protected]]
> Sent: Friday, October 11, 2013 2:12 AM
> To: Sethi Varun-B16395
> Cc: Bhushan Bharat-R65777; [email protected]; Wood Scott-B07421; linux-
> [email protected]; [email protected]; linux-
> [email protected]; [email protected];
> [email protected]; [email protected]
> Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a device
>
> On Thu, 2013-10-10 at 20:09 +0000, Sethi Varun-B16395 wrote:
> >
> > > -----Original Message-----
> > > From: [email protected] [mailto:iommu-
> > > [email protected]] On Behalf Of Alex Williamson
> > > Sent: Tuesday, October 08, 2013 8:43 AM
> > > To: Bhushan Bharat-R65777
> > > Cc: [email protected]; Wood Scott-B07421; [email protected];
> > > [email protected]; [email protected];
> > > [email protected]; [email protected];
> > > linuxppc- [email protected]
> > > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a
> > > device
> > >
> > > On Mon, 2013-10-07 at 05:46 +0000, Bhushan Bharat-R65777 wrote:
> > > >
> > > > > -----Original Message-----
> > > > > From: Alex Williamson [mailto:[email protected]]
> > > > > Sent: Friday, October 04, 2013 11:42 PM
> > > > > To: Bhushan Bharat-R65777
> > > > > Cc: [email protected]; [email protected];
> > > > > [email protected]; linux- [email protected];
> > > > > [email protected]; linux- [email protected];
> > > > > [email protected]; Wood Scott-B07421; [email protected]
> > > > > foundation.org
> > > > > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a
> > > > > device
> > > > >
> > > > > On Fri, 2013-10-04 at 17:23 +0000, Bhushan Bharat-R65777 wrote:
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Alex Williamson [mailto:[email protected]]
> > > > > > > Sent: Friday, October 04, 2013 10:43 PM
> > > > > > > To: Bhushan Bharat-R65777
> > > > > > > Cc: [email protected]; [email protected];
> > > > > > > [email protected]; linux- [email protected];
> > > > > > > [email protected]; linux- [email protected];
> > > > > > > [email protected]; Wood Scott-B07421; [email protected]
> > > > > > > foundation.org
> > > > > > > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain
> > > > > > > of a device
> > > > > > >
> > > > > > > On Fri, 2013-10-04 at 16:47 +0000, Bhushan Bharat-R65777
> wrote:
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Alex Williamson
> > > > > > > > > [mailto:[email protected]]
> > > > > > > > > Sent: Friday, October 04, 2013 9:15 PM
> > > > > > > > > To: Bhushan Bharat-R65777
> > > > > > > > > Cc: [email protected]; [email protected];
> > > > > > > > > [email protected]; linux-
> > > > > > > > > [email protected]; [email protected];
> > > > > > > > > linux- [email protected]; [email protected]; Wood
> > > > > > > > > Scott-B07421; [email protected] foundation.org
> > > > > > > > > Subject: Re: [PATCH 2/7] iommu: add api to get
> > > > > > > > > iommu_domain of a device
> > > > > > > > >
> > > > > > > > > On Fri, 2013-10-04 at 09:54 +0000, Bhushan Bharat-R65777
> > > wrote:
> > > > > > > > > >
> > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > From: [email protected]
> > > > > > > > > > > [mailto:[email protected]]
> > > > > > > > > > > On Behalf Of Alex Williamson
> > > > > > > > > > > Sent: Wednesday, September 25, 2013 10:16 PM
> > > > > > > > > > > To: Bhushan Bharat-R65777
> > > > > > > > > > > Cc: [email protected]; [email protected];
> > > > > > > > > > > [email protected]; linux-
> > > > > > > > > > > [email protected];
> > > > > > > > > > > [email protected];
> > > > > > > > > > > linux- [email protected]; [email protected]; Wood
> > > > > > > > > > > Scott-B07421; [email protected] foundation.org;
> > > > > > > > > > > Bhushan
> > > > > > > > > > > Bharat-R65777
> > > > > > > > > > > Subject: Re: [PATCH 2/7] iommu: add api to get
> > > > > > > > > > > iommu_domain of a device
> > > > > > > > > > >
> > > > > > > > > > > On Thu, 2013-09-19 at 12:59 +0530, Bharat Bhushan
> wrote:
> > > > > > > > > > > > This api return the iommu domain to which the
> > > > > > > > > > > > device is
> > > attached.
> > > > > > > > > > > > The iommu_domain is required for making API calls
> > > > > > > > > > > > related to
> > > > > iommu.
> > > > > > > > > > > > Follow up patches which use this API to know iommu
> > > maping.
> > > > > > > > > > > >
> > > > > > > > > > > > Signed-off-by: Bharat Bhushan
> > > > > > > > > > > > <[email protected]>
> > > > > > > > > > > > ---
> > > > > > > > > > > > drivers/iommu/iommu.c | 10 ++++++++++
> > > > > > > > > > > > include/linux/iommu.h | 7 +++++++
> > > > > > > > > > > > 2 files changed, 17 insertions(+), 0 deletions(-)
> > > > > > > > > > > >
> > > > > > > > > > > > diff --git a/drivers/iommu/iommu.c
> > > > > > > > > > > > b/drivers/iommu/iommu.c index
> > > > > > > > > > > > fbe9ca7..6ac5f50 100644
> > > > > > > > > > > > --- a/drivers/iommu/iommu.c
> > > > > > > > > > > > +++ b/drivers/iommu/iommu.c
> > > > > > > > > > > > @@ -696,6 +696,16 @@ void
> > > > > > > > > > > > iommu_detach_device(struct iommu_domain *domain,
> > > > > > > > > > > > struct device *dev) }
> > > > > > > > > > > > EXPORT_SYMBOL_GPL(iommu_detach_device);
> > > > > > > > > > > >
> > > > > > > > > > > > +struct iommu_domain *iommu_get_dev_domain(struct
> > > device *dev) {
> > > > > > > > > > > > + struct iommu_ops *ops = dev->bus->iommu_ops;
> > > > > > > > > > > > +
> > > > > > > > > > > > + if (unlikely(ops == NULL ||
> > > > > > > > > > > > +ops->get_dev_iommu_domain ==
> > > > > NULL))
> > > > > > > > > > > > + return NULL;
> > > > > > > > > > > > +
> > > > > > > > > > > > + return ops->get_dev_iommu_domain(dev); }
> > > > > > > > > > > > +EXPORT_SYMBOL_GPL(iommu_get_dev_domain);
> > > > > > > > > > >
> > > > > > > > > > > What prevents this from racing iommu_domain_free()?
> > > > > > > > > > > There's no references acquired, so there's no reason
> > > > > > > > > > > for the caller to assume the
> > > > > > > > > pointer is valid.
> > > > > > > > > >
> > > > > > > > > > Sorry for late query, somehow this email went into a
> > > > > > > > > > folder and escaped;
> > > > > > > > > >
> > > > > > > > > > Just to be sure, there is not lock at generic "struct
> > > > > > > > > > iommu_domain", but IP
> > > > > > > > > specific structure (link FSL domain) linked in
> > > > > > > > > iommu_domain->priv have a lock, so we need to ensure
> > > > > > > > > this race in FSL iommu code (say
> > > > > > > > > drivers/iommu/fsl_pamu_domain.c),
> > > right?
> > > > > > > > >
> > > > > > > > > No, it's not sufficient to make sure that your use of
> > > > > > > > > the interface is race free. The interface itself needs
> > > > > > > > > to be designed so that it's difficult to use incorrectly.
> > > > > > > >
> > > > > > > > So we can define
> > > > > > > > iommu_get_dev_domain()/iommu_put_dev_domain();
> > > > > > > > iommu_get_dev_domain() will return domain with the lock
> > > > > > > > held, and
> > > > > > > > iommu_put_dev_domain() will release the lock? And
> > > > > > > > iommu_get_dev_domain() must always be followed by
> > > > > > > > iommu_get_dev_domain().
> > > > > > >
> > > > > > > What lock? get/put are generally used for reference
> > > > > > > counting, not locking in the kernel.
> > > > > > >
> > > > > > > > > That's not the case here. This is a backdoor to get the
> > > > > > > > > iommu domain from the iommu driver regardless of who is
> > > > > > > > > using
> > > it or how.
> > > > > > > > > The iommu domain is created and managed by vfio, so
> > > > > > > > > shouldn't we be looking at how to do this through vfio?
> > > > > > > >
> > > > > > > > Let me first describe what we are doing here:
> > > > > > > > During initialization:-
> > > > > > > > - vfio talks to MSI system to know the MSI-page and size
> > > > > > > > - vfio then interacts with iommu to map the MSI-page in
> > > > > > > > iommu (IOVA is decided by userspace and physical address
> > > > > > > > is the
> > > > > > > > MSI-page)
> > > > > > > > - So the IOVA subwindow mapping is created in iommu and
> > > > > > > > yes VFIO know about
> > > > > > > this mapping.
> > > > > > > >
> > > > > > > > Now do SET_IRQ(MSI/MSIX) ioctl:
> > > > > > > > - calls pci_enable_msix()/pci_enable_msi_block(): which
> > > > > > > > is supposed to set
> > > > > > > MSI address/data in device.
> > > > > > > > - So in current implementation (this patchset)
> > > > > > > > msi-subsystem gets the IOVA
> > > > > > > from iommu via this defined interface.
> > > > > > > > - Are you saying that rather than getting this from
> > > > > > > > iommu, we should get this
> > > > > > > from vfio? What difference does this make?
> > > > > > >
> > > > > > > Yes, you just said above that vfio knows the msi to iova
> > > > > > > mapping, so why go outside of vfio to find it later? The
> > > > > > > difference is one case you can have a proper reference to
> > > > > > > data structures to make sure the pointer you get back
> > > > > > > actually has meaning at the time you're using it vs the code
> > > > > > > here where you're defining an API that returns a meaningless
> > > > > > > value
> > > > > >
> > > > > > With FSL-PAMU we will always get consistant data from iommu or
> > > > > > vfio-data
> > > > > structure.
> > > > >
> > > > > Great, but you're trying to add a generic API to the IOMMU
> > > > > subsystem that's difficult to use correctly. The fact that you
> > > > > use it correctly does not justify the API.
> > > > >
> > > > > > > because you can't check or
> > > > > > > enforce that an arbitrary caller is using it correctly.
> > > > > >
> > > > > > I am not sure what is arbitrary caller? pdev is known to vfio,
> > > > > > so vfio will only make
> > > > > > pci_enable_msix()/pci_enable_msi_block() for
> > > this pdev.
> > > > > > If anyother code makes then it is some other unexpectedly
> > > > > > thing happening in system, no?
> > > > >
> > > > > What's proposed here is a generic IOMMU API. Anybody can call
> this.
> > > > > What if the host SCSI driver decides to go get the iommu domain
> > > > > for it's device (or any other device)? Does that fit your usage
> model?
> > > > >
> > > > > > > It's not maintainable.
> > > > > > > Thanks,
> > > > > >
> > > > > > I do not have any issue with this as well, can you also
> > > > > > describe the type of API you are envisioning; I can think of
> > > > > > defining some function in vfio.c/vfio_iommu*.c, make them
> > > > > > global and declare then in include/Linux/vfio.h And include
> > > > > > <Linux/vfio.h> in caller file
> > > > > > (arch/powerpc/kernel/msi.c)
> > > > >
> > > > > Do you really want module dependencies between vfio and your
> > > > > core kernel MSI setup? Look at the vfio external user interface
> > > > > that
> > > we've already defined.
> > > > > That allows other components of the kernel to get a proper
> > > > > reference to a vfio group. From there you can work out how to
> > > > > get what you want. Another alternative is that vfio could
> > > > > register an MSI to IOVA mapping with architecture code when the
> mapping is created.
> > > > > The MSI setup path could then do a lookup in architecture code
> > > > > for the mapping. You could even store the MSI to IOVA mapping
> > > > > in VFIO and create an interface where SET_IRQ passes that
> > > > > mapping into setup
> > > code.
> > [Sethi Varun-B16395] What you are suggesting is that the MSI setup
> > path queries the vfio subsystem for the mapping, rather than directly
> > querying the iommu subsystem. So, say if we add an interface for
> > getting MSI to IOVA mapping in the msi setup path, wouldn't this again
> > be specific to vfio? I reckon that this interface would again ppc
> > machine specific interface.
>
> Sure, if this is a generic MSI setup path then clearly vfio should not be
> involved. But by that same notion, if this is a generic MSI setup path,
> how can the proposed solutions guarantee that the iommu_domain or iova
> returned is still valid in all cases? It's racy. If the caller trying
> to setup MSI has the information needed, why doesn't it pass it in as
> part of the setup? Thanks,
[Sethi Varun-B16395] Agreed, the proposed interface is not clean. But, we still need an interface through which MSI driver can call in to the vfio layer.

-Varun


????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2013-10-14 14:20:34

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a device

On Mon, 2013-10-14 at 12:58 +0000, Sethi Varun-B16395 wrote:
>
> > -----Original Message-----
> > From: Alex Williamson [mailto:[email protected]]
> > Sent: Friday, October 11, 2013 2:12 AM
> > To: Sethi Varun-B16395
> > Cc: Bhushan Bharat-R65777; [email protected]; Wood Scott-B07421; linux-
> > [email protected]; [email protected]; linux-
> > [email protected]; [email protected];
> > [email protected]; [email protected]
> > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a device
> >
> > On Thu, 2013-10-10 at 20:09 +0000, Sethi Varun-B16395 wrote:
> > >
> > > > -----Original Message-----
> > > > From: [email protected] [mailto:iommu-
> > > > [email protected]] On Behalf Of Alex Williamson
> > > > Sent: Tuesday, October 08, 2013 8:43 AM
> > > > To: Bhushan Bharat-R65777
> > > > Cc: [email protected]; Wood Scott-B07421; [email protected];
> > > > [email protected]; [email protected];
> > > > [email protected]; [email protected];
> > > > linuxppc- [email protected]
> > > > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a
> > > > device
> > > >
> > > > On Mon, 2013-10-07 at 05:46 +0000, Bhushan Bharat-R65777 wrote:
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Alex Williamson [mailto:[email protected]]
> > > > > > Sent: Friday, October 04, 2013 11:42 PM
> > > > > > To: Bhushan Bharat-R65777
> > > > > > Cc: [email protected]; [email protected];
> > > > > > [email protected]; linux- [email protected];
> > > > > > [email protected]; linux- [email protected];
> > > > > > [email protected]; Wood Scott-B07421; [email protected]
> > > > > > foundation.org
> > > > > > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain of a
> > > > > > device
> > > > > >
> > > > > > On Fri, 2013-10-04 at 17:23 +0000, Bhushan Bharat-R65777 wrote:
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Alex Williamson [mailto:[email protected]]
> > > > > > > > Sent: Friday, October 04, 2013 10:43 PM
> > > > > > > > To: Bhushan Bharat-R65777
> > > > > > > > Cc: [email protected]; [email protected];
> > > > > > > > [email protected]; linux- [email protected];
> > > > > > > > [email protected]; linux- [email protected];
> > > > > > > > [email protected]; Wood Scott-B07421; [email protected]
> > > > > > > > foundation.org
> > > > > > > > Subject: Re: [PATCH 2/7] iommu: add api to get iommu_domain
> > > > > > > > of a device
> > > > > > > >
> > > > > > > > On Fri, 2013-10-04 at 16:47 +0000, Bhushan Bharat-R65777
> > wrote:
> > > > > > > > >
> > > > > > > > > > -----Original Message-----
> > > > > > > > > > From: Alex Williamson
> > > > > > > > > > [mailto:[email protected]]
> > > > > > > > > > Sent: Friday, October 04, 2013 9:15 PM
> > > > > > > > > > To: Bhushan Bharat-R65777
> > > > > > > > > > Cc: [email protected]; [email protected];
> > > > > > > > > > [email protected]; linux-
> > > > > > > > > > [email protected]; [email protected];
> > > > > > > > > > linux- [email protected]; [email protected]; Wood
> > > > > > > > > > Scott-B07421; [email protected] foundation.org
> > > > > > > > > > Subject: Re: [PATCH 2/7] iommu: add api to get
> > > > > > > > > > iommu_domain of a device
> > > > > > > > > >
> > > > > > > > > > On Fri, 2013-10-04 at 09:54 +0000, Bhushan Bharat-R65777
> > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > -----Original Message-----
> > > > > > > > > > > > From: [email protected]
> > > > > > > > > > > > [mailto:[email protected]]
> > > > > > > > > > > > On Behalf Of Alex Williamson
> > > > > > > > > > > > Sent: Wednesday, September 25, 2013 10:16 PM
> > > > > > > > > > > > To: Bhushan Bharat-R65777
> > > > > > > > > > > > Cc: [email protected]; [email protected];
> > > > > > > > > > > > [email protected]; linux-
> > > > > > > > > > > > [email protected];
> > > > > > > > > > > > [email protected];
> > > > > > > > > > > > linux- [email protected]; [email protected]; Wood
> > > > > > > > > > > > Scott-B07421; [email protected] foundation.org;
> > > > > > > > > > > > Bhushan
> > > > > > > > > > > > Bharat-R65777
> > > > > > > > > > > > Subject: Re: [PATCH 2/7] iommu: add api to get
> > > > > > > > > > > > iommu_domain of a device
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, 2013-09-19 at 12:59 +0530, Bharat Bhushan
> > wrote:
> > > > > > > > > > > > > This api return the iommu domain to which the
> > > > > > > > > > > > > device is
> > > > attached.
> > > > > > > > > > > > > The iommu_domain is required for making API calls
> > > > > > > > > > > > > related to
> > > > > > iommu.
> > > > > > > > > > > > > Follow up patches which use this API to know iommu
> > > > maping.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Signed-off-by: Bharat Bhushan
> > > > > > > > > > > > > <[email protected]>
> > > > > > > > > > > > > ---
> > > > > > > > > > > > > drivers/iommu/iommu.c | 10 ++++++++++
> > > > > > > > > > > > > include/linux/iommu.h | 7 +++++++
> > > > > > > > > > > > > 2 files changed, 17 insertions(+), 0 deletions(-)
> > > > > > > > > > > > >
> > > > > > > > > > > > > diff --git a/drivers/iommu/iommu.c
> > > > > > > > > > > > > b/drivers/iommu/iommu.c index
> > > > > > > > > > > > > fbe9ca7..6ac5f50 100644
> > > > > > > > > > > > > --- a/drivers/iommu/iommu.c
> > > > > > > > > > > > > +++ b/drivers/iommu/iommu.c
> > > > > > > > > > > > > @@ -696,6 +696,16 @@ void
> > > > > > > > > > > > > iommu_detach_device(struct iommu_domain *domain,
> > > > > > > > > > > > > struct device *dev) }
> > > > > > > > > > > > > EXPORT_SYMBOL_GPL(iommu_detach_device);
> > > > > > > > > > > > >
> > > > > > > > > > > > > +struct iommu_domain *iommu_get_dev_domain(struct
> > > > device *dev) {
> > > > > > > > > > > > > + struct iommu_ops *ops = dev->bus->iommu_ops;
> > > > > > > > > > > > > +
> > > > > > > > > > > > > + if (unlikely(ops == NULL ||
> > > > > > > > > > > > > +ops->get_dev_iommu_domain ==
> > > > > > NULL))
> > > > > > > > > > > > > + return NULL;
> > > > > > > > > > > > > +
> > > > > > > > > > > > > + return ops->get_dev_iommu_domain(dev); }
> > > > > > > > > > > > > +EXPORT_SYMBOL_GPL(iommu_get_dev_domain);
> > > > > > > > > > > >
> > > > > > > > > > > > What prevents this from racing iommu_domain_free()?
> > > > > > > > > > > > There's no references acquired, so there's no reason
> > > > > > > > > > > > for the caller to assume the
> > > > > > > > > > pointer is valid.
> > > > > > > > > > >
> > > > > > > > > > > Sorry for late query, somehow this email went into a
> > > > > > > > > > > folder and escaped;
> > > > > > > > > > >
> > > > > > > > > > > Just to be sure, there is not lock at generic "struct
> > > > > > > > > > > iommu_domain", but IP
> > > > > > > > > > specific structure (link FSL domain) linked in
> > > > > > > > > > iommu_domain->priv have a lock, so we need to ensure
> > > > > > > > > > this race in FSL iommu code (say
> > > > > > > > > > drivers/iommu/fsl_pamu_domain.c),
> > > > right?
> > > > > > > > > >
> > > > > > > > > > No, it's not sufficient to make sure that your use of
> > > > > > > > > > the interface is race free. The interface itself needs
> > > > > > > > > > to be designed so that it's difficult to use incorrectly.
> > > > > > > > >
> > > > > > > > > So we can define
> > > > > > > > > iommu_get_dev_domain()/iommu_put_dev_domain();
> > > > > > > > > iommu_get_dev_domain() will return domain with the lock
> > > > > > > > > held, and
> > > > > > > > > iommu_put_dev_domain() will release the lock? And
> > > > > > > > > iommu_get_dev_domain() must always be followed by
> > > > > > > > > iommu_get_dev_domain().
> > > > > > > >
> > > > > > > > What lock? get/put are generally used for reference
> > > > > > > > counting, not locking in the kernel.
> > > > > > > >
> > > > > > > > > > That's not the case here. This is a backdoor to get the
> > > > > > > > > > iommu domain from the iommu driver regardless of who is
> > > > > > > > > > using
> > > > it or how.
> > > > > > > > > > The iommu domain is created and managed by vfio, so
> > > > > > > > > > shouldn't we be looking at how to do this through vfio?
> > > > > > > > >
> > > > > > > > > Let me first describe what we are doing here:
> > > > > > > > > During initialization:-
> > > > > > > > > - vfio talks to MSI system to know the MSI-page and size
> > > > > > > > > - vfio then interacts with iommu to map the MSI-page in
> > > > > > > > > iommu (IOVA is decided by userspace and physical address
> > > > > > > > > is the
> > > > > > > > > MSI-page)
> > > > > > > > > - So the IOVA subwindow mapping is created in iommu and
> > > > > > > > > yes VFIO know about
> > > > > > > > this mapping.
> > > > > > > > >
> > > > > > > > > Now do SET_IRQ(MSI/MSIX) ioctl:
> > > > > > > > > - calls pci_enable_msix()/pci_enable_msi_block(): which
> > > > > > > > > is supposed to set
> > > > > > > > MSI address/data in device.
> > > > > > > > > - So in current implementation (this patchset)
> > > > > > > > > msi-subsystem gets the IOVA
> > > > > > > > from iommu via this defined interface.
> > > > > > > > > - Are you saying that rather than getting this from
> > > > > > > > > iommu, we should get this
> > > > > > > > from vfio? What difference does this make?
> > > > > > > >
> > > > > > > > Yes, you just said above that vfio knows the msi to iova
> > > > > > > > mapping, so why go outside of vfio to find it later? The
> > > > > > > > difference is one case you can have a proper reference to
> > > > > > > > data structures to make sure the pointer you get back
> > > > > > > > actually has meaning at the time you're using it vs the code
> > > > > > > > here where you're defining an API that returns a meaningless
> > > > > > > > value
> > > > > > >
> > > > > > > With FSL-PAMU we will always get consistant data from iommu or
> > > > > > > vfio-data
> > > > > > structure.
> > > > > >
> > > > > > Great, but you're trying to add a generic API to the IOMMU
> > > > > > subsystem that's difficult to use correctly. The fact that you
> > > > > > use it correctly does not justify the API.
> > > > > >
> > > > > > > > because you can't check or
> > > > > > > > enforce that an arbitrary caller is using it correctly.
> > > > > > >
> > > > > > > I am not sure what is arbitrary caller? pdev is known to vfio,
> > > > > > > so vfio will only make
> > > > > > > pci_enable_msix()/pci_enable_msi_block() for
> > > > this pdev.
> > > > > > > If anyother code makes then it is some other unexpectedly
> > > > > > > thing happening in system, no?
> > > > > >
> > > > > > What's proposed here is a generic IOMMU API. Anybody can call
> > this.
> > > > > > What if the host SCSI driver decides to go get the iommu domain
> > > > > > for it's device (or any other device)? Does that fit your usage
> > model?
> > > > > >
> > > > > > > > It's not maintainable.
> > > > > > > > Thanks,
> > > > > > >
> > > > > > > I do not have any issue with this as well, can you also
> > > > > > > describe the type of API you are envisioning; I can think of
> > > > > > > defining some function in vfio.c/vfio_iommu*.c, make them
> > > > > > > global and declare then in include/Linux/vfio.h And include
> > > > > > > <Linux/vfio.h> in caller file
> > > > > > > (arch/powerpc/kernel/msi.c)
> > > > > >
> > > > > > Do you really want module dependencies between vfio and your
> > > > > > core kernel MSI setup? Look at the vfio external user interface
> > > > > > that
> > > > we've already defined.
> > > > > > That allows other components of the kernel to get a proper
> > > > > > reference to a vfio group. From there you can work out how to
> > > > > > get what you want. Another alternative is that vfio could
> > > > > > register an MSI to IOVA mapping with architecture code when the
> > mapping is created.
> > > > > > The MSI setup path could then do a lookup in architecture code
> > > > > > for the mapping. You could even store the MSI to IOVA mapping
> > > > > > in VFIO and create an interface where SET_IRQ passes that
> > > > > > mapping into setup
> > > > code.
> > > [Sethi Varun-B16395] What you are suggesting is that the MSI setup
> > > path queries the vfio subsystem for the mapping, rather than directly
> > > querying the iommu subsystem. So, say if we add an interface for
> > > getting MSI to IOVA mapping in the msi setup path, wouldn't this again
> > > be specific to vfio? I reckon that this interface would again ppc
> > > machine specific interface.
> >
> > Sure, if this is a generic MSI setup path then clearly vfio should not be
> > involved. But by that same notion, if this is a generic MSI setup path,
> > how can the proposed solutions guarantee that the iommu_domain or iova
> > returned is still valid in all cases? It's racy. If the caller trying
> > to setup MSI has the information needed, why doesn't it pass it in as
> > part of the setup? Thanks,
> [Sethi Varun-B16395] Agreed, the proposed interface is not clean. But, we still need an interface through which MSI driver can call in to the vfio layer.

Or one that allows vfio to pass in the iova when the interrupt is being
setup. I'm afraid any kind of reverse interface where MSI calls back
into vfio is going to be racy. Thanks,

Alex