LinuxLists.cc - [PATCH] Xen-SWIOTLB v0.8.6 used for Xen PCI pass through for PV guests.

2010-07-27 17:20:12

by Konrad Rzeszutek Wilk

[permalink] [raw]

Subject: [PATCH] Xen-SWIOTLB v0.8.6 used for Xen PCI pass through for PV guests.

Changes since last posting [v0.8.3 (http://lkml.org/lkml/2010/6/22/310)]:
- Coalesced the Xen-SWIOTLB set of patches in one patch.
- Moved the swiotlb-xen.c to drivers/xen.
- Added Ack's, Cc's, as appropriate.

This patchset:

These nine patches lay the groundwork for Xen Paravirtualized (PV)
domains to access PCI pass-through devices (with and without an hardware
IOMMU) These patches utilize the SWIOTLB library modifications
(http://lkml.org/lkml/2010/6/4/272).

The end user of this are:
- The Xen PCI frontend and Xen PCI [1] which require a DMA API "backend"
that understands Xen's MMU. This allows the PV domains to use PCI devices.
The use case is for machines with and without hardware IOMMU. Without an
hardware IOMMU you have a potential security hole wherin a guest domain
can use the hardware to map pages outside its memory range and slurp
pages up. As such, this is more restricted to a Priviliged PV domain,
aka - device driver domain (similar to Qubes but a poor-man mechanism [2]).
- Xen PV domain 0 support. Without this domain 0 is incapable of using any
PCI devices.

This patch-set is split in two groups. The first alter the Xen components,
while the second introduces the SWIOTLB-Xen.

The Xen components patches consist of:

xen: Allow unprivileged Xen domains to create iomap pages
xen: Rename the balloon lock
xen: Add xen_create_contiguous_region
xen: use _PAGE_IOMAP in ioremap to do machine mappings
vmap: add flag to allow lazy unmap to be disabled at runtime
xen/mmu: inhibit vmap aliases rather than trying to clear them out

which alter the Xen MMU, which by default utilizes a layer of indirection
wherein the PFN is translated to the Machine Frame Number (MFN) and vice-versa.
This is required to "fool" the guest in thinking its memory starts at PFN 0 and
goes up to the available amount. While in the background, PFN 0 might as well be
MFN 1048576 (4GB).

For PCI/DMA API calls (ioremap, pci_map_page, etc) having a PFN != MFN is
not too good, so two new mechanisms are introduced:
a). For PTEs which manipulate the PCI/DMA pages the PFN == MFN. This is done
via utilizing the _PAGE_IOMAP flag to distinguish the PTE as an IO type.
b). Allow a mechanism to "swizzle" the MFN's under a PFN to be under a certain
bus width (say 32). This allows us to provide a mechanism for the
SWIOTLB Xen to allocate memory for its pool that will be guaranteed to be
under the 4GB mark.

We also introduce a mechanism to inhibit vmap aliases as we need to clear the
vmap's but might be doing this in atomic state.

The SWIOTLB-Xen adds a library that is only used if the machine is detected
to be running under Xen. It utilizes the SWIOTLB library bookkeeping functions
(swiotlb_tbl_*) and only deals with the virtual to [physical, bus] (and vice-versa)
address translations.

The diffstat:

arch/x86/include/asm/xen/page.h | 8 +-
arch/x86/include/asm/xen/swiotlb-xen.h | 14 +
arch/x86/kernel/pci-dma.c | 7 +-
arch/x86/xen/Makefile | 1 +
arch/x86/xen/enlighten.c | 4 +
arch/x86/xen/mmu.c | 293 ++++++++++++++++++-
arch/x86/xen/pci-swiotlb-xen.c | 58 ++++
drivers/xen/Kconfig | 4 +
drivers/xen/Makefile | 3 +-
drivers/xen/balloon.c | 15 +-
drivers/xen/swiotlb-xen.c | 515 ++++++++++++++++++++++++++++++++
include/linux/vmalloc.h | 2 +
include/xen/interface/memory.h | 50 +++
include/xen/swiotlb-xen.h | 65 ++++
include/xen/xen-ops.h | 6 +
mm/vmalloc.c | 4 +
16 files changed, 1024 insertions(+), 25 deletions(-)

P.S.
The patchset that this depends on (which is in linux-next), is:
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb-2.6.git stable/swiotlb-0.8.3
This patcheset (Xen-SWIOTLB) is available at:
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb-2.6.git devel/xen-swiotlb-0.8.6
And the customer of this (Xen PCI front):
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git devel/xen-pcifront-0.3
The tree with all of this on top of v2.6.35-rc6:
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git devel/merge.2.6.35-rc6.t2

2010-07-27 17:19:49

by Konrad Rzeszutek Wilk

[permalink] [raw]

Subject: [PATCH 2/9] xen: Allow unprivileged Xen domains to create iomap pages

From: Alex Nixon <[email protected]>

PV DomU domains are allowed to map hardware MFNs for PCI passthrough,
but are not generally allowed to map raw machine pages. In particular,
various pieces of code try to map DMI and ACPI tables in the ISA ROM
range. We disallow _PAGE_IOMAP for those mappings, so that they are
redirected to a set of local zeroed pages we reserve for that purpose.

[ Impact: prevent passthrough of ISA space, as we only allow PCI ]

Signed-off-by: Alex Nixon <[email protected]>
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
---
arch/x86/xen/enlighten.c | 4 ++++
arch/x86/xen/mmu.c | 18 +++++++++++++++---
2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 65d8d79..3254e8b 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1145,6 +1145,10 @@ asmlinkage void __init xen_start_kernel(void)

pgd = (pgd_t *)xen_start_info->pt_base;

+ if (!xen_initial_domain())
+ __supported_pte_mask &= ~(_PAGE_PWT | _PAGE_PCD);
+
+ __supported_pte_mask |= _PAGE_IOMAP;
/* Don't do the full vcpu_info placement stuff until we have a
possible map and a non-dummy shared_info. */
per_cpu(xen_vcpu, 0) = &HYPERVISOR_shared_info->vcpu_info[0];
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index a4dea9d..a5577f5 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -51,6 +51,7 @@
#include <asm/mmu_context.h>
#include <asm/setup.h>
#include <asm/paravirt.h>
+#include <asm/e820.h>
#include <asm/linkage.h>

#include <asm/xen/hypercall.h>
@@ -381,7 +382,7 @@ static bool xen_page_pinned(void *ptr)

static bool xen_iomap_pte(pte_t pte)
{
- return xen_initial_domain() && (pte_flags(pte) & _PAGE_IOMAP);
+ return pte_flags(pte) & _PAGE_IOMAP;
}

static void xen_set_iomap_pte(pte_t *ptep, pte_t pteval)
@@ -583,10 +584,21 @@ PV_CALLEE_SAVE_REGS_THUNK(xen_pgd_val);

pte_t xen_make_pte(pteval_t pte)
{
- if (unlikely(xen_initial_domain() && (pte & _PAGE_IOMAP)))
+ phys_addr_t addr = (pte & PTE_PFN_MASK);
+
+ /*
+ * Unprivileged domains are allowed to do IOMAPpings for
+ * PCI passthrough, but not map ISA space. The ISA
+ * mappings are just dummy local mappings to keep other
+ * parts of the kernel happy.
+ */
+ if (unlikely(pte & _PAGE_IOMAP) &&
+ (xen_initial_domain() || addr >= ISA_END_ADDRESS)) {
pte = iomap_pte(pte);
- else
+ } else {
+ pte &= ~_PAGE_IOMAP;
pte = pte_pfn_to_mfn(pte);
+ }

return native_make_pte(pte);
}
--
1.7.0.1

2010-07-27 17:19:53

by Konrad Rzeszutek Wilk

[permalink] [raw]

Subject: [PATCH 4/9] xen: Add xen_create_contiguous_region

From: Alex Nixon <[email protected]>

A memory region must be physically contiguous in order to be accessed
through DMA. This patch adds xen_create_contiguous_region, which
ensures a region of contiguous virtual memory is also physically
contiguous.

Based on Stephen Tweedie's port of the 2.6.18-xen version.

Remove contiguous_bitmap[] as it's no longer needed.

Ported from linux-2.6.18-xen.hg 707:e410857fd83c

[ Impact: add Xen-internal API to make pages phys-contig ]

Signed-off-by: Alex Nixon <[email protected]>
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Signed-off-by: Ian Campbell <[email protected]>
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
---
arch/x86/xen/mmu.c | 201 ++++++++++++++++++++++++++++++++++++++++
include/xen/interface/memory.h | 42 ++++++++
include/xen/xen-ops.h | 6 +
3 files changed, 249 insertions(+), 0 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 9e0d82f..eb51402 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -53,6 +53,7 @@
#include <asm/paravirt.h>
#include <asm/e820.h>
#include <asm/linkage.h>
+#include <asm/page.h>

#include <asm/xen/hypercall.h>
#include <asm/xen/hypervisor.h>
@@ -2027,6 +2028,206 @@ void __init xen_init_mmu_ops(void)
pv_mmu_ops = xen_mmu_ops;
}

+/* Protected by xen_reservation_lock. */
+#define MAX_CONTIG_ORDER 9 /* 2MB */
+static unsigned long discontig_frames[1<<MAX_CONTIG_ORDER];
+
+#define VOID_PTE (mfn_pte(0, __pgprot(0)))
+static void xen_zap_pfn_range(unsigned long vaddr, unsigned int order,
+ unsigned long *in_frames,
+ unsigned long *out_frames)
+{
+ int i;
+ struct multicall_space mcs;
+
+ xen_mc_batch();
+ for (i = 0; i < (1UL<<order); i++, vaddr += PAGE_SIZE) {
+ mcs = __xen_mc_entry(0);
+
+ if (in_frames)
+ in_frames[i] = virt_to_mfn(vaddr);
+
+ MULTI_update_va_mapping(mcs.mc, vaddr, VOID_PTE, 0);
+ set_phys_to_machine(virt_to_pfn(vaddr), INVALID_P2M_ENTRY);
+
+ if (out_frames)
+ out_frames[i] = virt_to_pfn(vaddr);
+ }
+ xen_mc_issue(0);
+}
+
+/*
+ * Update the pfn-to-mfn mappings for a virtual address range, either to
+ * point to an array of mfns, or contiguously from a single starting
+ * mfn.
+ */
+static void xen_remap_exchanged_ptes(unsigned long vaddr, int order,
+ unsigned long *mfns,
+ unsigned long first_mfn)
+{
+ unsigned i, limit;
+ unsigned long mfn;
+
+ xen_mc_batch();
+
+ limit = 1u << order;
+ for (i = 0; i < limit; i++, vaddr += PAGE_SIZE) {
+ struct multicall_space mcs;
+ unsigned flags;
+
+ mcs = __xen_mc_entry(0);
+ if (mfns)
+ mfn = mfns[i];
+ else
+ mfn = first_mfn + i;
+
+ if (i < (limit - 1))
+ flags = 0;
+ else {
+ if (order == 0)
+ flags = UVMF_INVLPG | UVMF_ALL;
+ else
+ flags = UVMF_TLB_FLUSH | UVMF_ALL;
+ }
+
+ MULTI_update_va_mapping(mcs.mc, vaddr,
+ mfn_pte(mfn, PAGE_KERNEL), flags);
+
+ set_phys_to_machine(virt_to_pfn(vaddr), mfn);
+ }
+
+ xen_mc_issue(0);
+}
+
+/*
+ * Perform the hypercall to exchange a region of our pfns to point to
+ * memory with the required contiguous alignment. Takes the pfns as
+ * input, and populates mfns as output.
+ *
+ * Returns a success code indicating whether the hypervisor was able to
+ * satisfy the request or not.
+ */
+static int xen_exchange_memory(unsigned long extents_in, unsigned int order_in,
+ unsigned long *pfns_in,
+ unsigned long extents_out,
+ unsigned int order_out,
+ unsigned long *mfns_out,
+ unsigned int address_bits)
+{
+ long rc;
+ int success;
+
+ struct xen_memory_exchange exchange = {
+ .in = {
+ .nr_extents = extents_in,
+ .extent_order = order_in,
+ .extent_start = pfns_in,
+ .domid = DOMID_SELF
+ },
+ .out = {
+ .nr_extents = extents_out,
+ .extent_order = order_out,
+ .extent_start = mfns_out,
+ .address_bits = address_bits,
+ .domid = DOMID_SELF
+ }
+ };
+
+ BUG_ON(extents_in << order_in != extents_out << order_out);
+
+ rc = HYPERVISOR_memory_op(XENMEM_exchange, &exchange);
+ success = (exchange.nr_exchanged == extents_in);
+
+ BUG_ON(!success && ((exchange.nr_exchanged != 0) || (rc == 0)));
+ BUG_ON(success && (rc != 0));
+
+ return success;
+}
+
+int xen_create_contiguous_region(unsigned long vstart, unsigned int order,
+ unsigned int address_bits)
+{
+ unsigned long *in_frames = discontig_frames, out_frame;
+ unsigned long flags;
+ int success;
+
+ /*
+ * Currently an auto-translated guest will not perform I/O, nor will
+ * it require PAE page directories below 4GB. Therefore any calls to
+ * this function are redundant and can be ignored.
+ */
+
+ if (xen_feature(XENFEAT_auto_translated_physmap))
+ return 0;
+
+ if (unlikely(order > MAX_CONTIG_ORDER))
+ return -ENOMEM;
+
+ memset((void *) vstart, 0, PAGE_SIZE << order);
+
+ vm_unmap_aliases();
+
+ spin_lock_irqsave(&xen_reservation_lock, flags);
+
+ /* 1. Zap current PTEs, remembering MFNs. */
+ xen_zap_pfn_range(vstart, order, in_frames, NULL);
+
+ /* 2. Get a new contiguous memory extent. */
+ out_frame = virt_to_pfn(vstart);
+ success = xen_exchange_memory(1UL << order, 0, in_frames,
+ 1, order, &out_frame,
+ address_bits);
+
+ /* 3. Map the new extent in place of old pages. */
+ if (success)
+ xen_remap_exchanged_ptes(vstart, order, NULL, out_frame);
+ else
+ xen_remap_exchanged_ptes(vstart, order, in_frames, 0);
+
+ spin_unlock_irqrestore(&xen_reservation_lock, flags);
+
+ return success ? 0 : -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(xen_create_contiguous_region);
+
+void xen_destroy_contiguous_region(unsigned long vstart, unsigned int order)
+{
+ unsigned long *out_frames = discontig_frames, in_frame;
+ unsigned long flags;
+ int success;
+
+ if (xen_feature(XENFEAT_auto_translated_physmap))
+ return;
+
+ if (unlikely(order > MAX_CONTIG_ORDER))
+ return;
+
+ memset((void *) vstart, 0, PAGE_SIZE << order);
+
+ vm_unmap_aliases();
+
+ spin_lock_irqsave(&xen_reservation_lock, flags);
+
+ /* 1. Find start MFN of contiguous extent. */
+ in_frame = virt_to_mfn(vstart);
+
+ /* 2. Zap current PTEs. */
+ xen_zap_pfn_range(vstart, order, NULL, out_frames);
+
+ /* 3. Do the exchange for non-contiguous MFNs. */
+ success = xen_exchange_memory(1, order, &in_frame, 1UL << order,
+ 0, out_frames, 0);
+
+ /* 4. Map new pages in place of old pages. */
+ if (success)
+ xen_remap_exchanged_ptes(vstart, order, out_frames, 0);
+ else
+ xen_remap_exchanged_ptes(vstart, order, NULL, in_frame);
+
+ spin_unlock_irqrestore(&xen_reservation_lock, flags);
+}
+EXPORT_SYMBOL_GPL(xen_destroy_contiguous_region);
+
#ifdef CONFIG_XEN_DEBUG_FS

static struct dentry *d_mmu_debug;
diff --git a/include/xen/interface/memory.h b/include/xen/interface/memory.h
index e6adce6..d3938d3 100644
--- a/include/xen/interface/memory.h
+++ b/include/xen/interface/memory.h
@@ -55,6 +55,48 @@ struct xen_memory_reservation {
DEFINE_GUEST_HANDLE_STRUCT(xen_memory_reservation);

/*
+ * An atomic exchange of memory pages. If return code is zero then
+ * @out.extent_list provides GMFNs of the newly-allocated memory.
+ * Returns zero on complete success, otherwise a negative error code.
+ * On complete success then always @nr_exchanged == @in.nr_extents.
+ * On partial success @nr_exchanged indicates how much work was done.
+ */
+#define XENMEM_exchange 11
+struct xen_memory_exchange {
+ /*
+ * [IN] Details of memory extents to be exchanged (GMFN bases).
+ * Note that @in.address_bits is ignored and unused.
+ */
+ struct xen_memory_reservation in;
+
+ /*
+ * [IN/OUT] Details of new memory extents.
+ * We require that:
+ * 1. @in.domid == @out.domid
+ * 2. @in.nr_extents << @in.extent_order ==
+ * @out.nr_extents << @out.extent_order
+ * 3. @in.extent_start and @out.extent_start lists must not overlap
+ * 4. @out.extent_start lists GPFN bases to be populated
+ * 5. @out.extent_start is overwritten with allocated GMFN bases
+ */
+ struct xen_memory_reservation out;
+
+ /*
+ * [OUT] Number of input extents that were successfully exchanged:
+ * 1. The first @nr_exchanged input extents were successfully
+ * deallocated.
+ * 2. The corresponding first entries in the output extent list correctly
+ * indicate the GMFNs that were successfully exchanged.
+ * 3. All other input and output extents are untouched.
+ * 4. If not all input exents are exchanged then the return code of this
+ * command will be non-zero.
+ * 5. THIS FIELD MUST BE INITIALISED TO ZERO BY THE CALLER!
+ */
+ unsigned long nr_exchanged;
+};
+
+DEFINE_GUEST_HANDLE_STRUCT(xen_memory_exchange);
+/*
* Returns the maximum machine frame number of mapped RAM in this system.
* This command always succeeds (it never returns an error code).
* arg == NULL.
diff --git a/include/xen/xen-ops.h b/include/xen/xen-ops.h
index 883a21b..d789c93 100644
--- a/include/xen/xen-ops.h
+++ b/include/xen/xen-ops.h
@@ -14,4 +14,10 @@ void xen_mm_unpin_all(void);
void xen_timer_resume(void);
void xen_arch_resume(void);

+extern unsigned long *xen_contiguous_bitmap;
+int xen_create_contiguous_region(unsigned long vstart, unsigned int order,
+ unsigned int address_bits);
+
+void xen_destroy_contiguous_region(unsigned long vstart, unsigned int order);
+
#endif /* INCLUDE_XEN_OPS_H */
--
1.7.0.1

2010-07-27 17:19:55

by Konrad Rzeszutek Wilk

[permalink] [raw]

Subject: [PATCH 8/9] pci-swiotlb-xen: Add glue code to setup dma_ops utilizing xen_swiotlb_* functions.

We add the glue code that sets up a dma_ops structure with the
xen_swiotlb_* functions. The code turns on xen_swiotlb flag
when it detects it is running under Xen and it is either
in privileged mode or the iommu=soft flag was passed in.

It also disables the bare-metal SWIOTLB if the Xen-SWIOTLB has
been enabled.

Note: The Xen-SWIOTLB is only built when CONFIG_XEN is enabled.

Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
Acked-by: Jeremy Fitzhardinge <[email protected]>
Cc: FUJITA Tomonori <[email protected]>
Cc: Albert Herranz <[email protected]>
Cc: Ian Campbell <[email protected]>
---
arch/x86/include/asm/xen/swiotlb-xen.h | 14 ++++++++
arch/x86/xen/Makefile | 1 +
arch/x86/xen/pci-swiotlb-xen.c | 58 ++++++++++++++++++++++++++++++++
3 files changed, 73 insertions(+), 0 deletions(-)
create mode 100644 arch/x86/include/asm/xen/swiotlb-xen.h
create mode 100644 arch/x86/xen/pci-swiotlb-xen.c

diff --git a/arch/x86/include/asm/xen/swiotlb-xen.h b/arch/x86/include/asm/xen/swiotlb-xen.h
new file mode 100644
index 0000000..1be1ab7
--- /dev/null
+++ b/arch/x86/include/asm/xen/swiotlb-xen.h
@@ -0,0 +1,14 @@
+#ifndef _ASM_X86_SWIOTLB_XEN_H
+#define _ASM_X86_SWIOTLB_XEN_H
+
+#ifdef CONFIG_SWIOTLB_XEN
+extern int xen_swiotlb;
+extern int __init pci_xen_swiotlb_detect(void);
+extern void __init pci_xen_swiotlb_init(void);
+#else
+#define xen_swiotlb (0)
+static inline int __init pci_xen_swiotlb_detect(void) { return 0; }
+static inline void __init pci_xen_swiotlb_init(void) { }
+#endif
+
+#endif /* _ASM_X86_SWIOTLB_XEN_H */
diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile
index 3bb4fc2..32af238 100644
--- a/arch/x86/xen/Makefile
+++ b/arch/x86/xen/Makefile
@@ -18,3 +18,4 @@ obj-$(CONFIG_SMP) += smp.o
obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= spinlock.o
obj-$(CONFIG_XEN_DEBUG_FS) += debugfs.o

+obj-$(CONFIG_SWIOTLB_XEN) += pci-swiotlb-xen.o
diff --git a/arch/x86/xen/pci-swiotlb-xen.c b/arch/x86/xen/pci-swiotlb-xen.c
new file mode 100644
index 0000000..a013ec9
--- /dev/null
+++ b/arch/x86/xen/pci-swiotlb-xen.c
@@ -0,0 +1,58 @@
+/* Glue code to lib/swiotlb-xen.c */
+
+#include <linux/dma-mapping.h>
+#include <xen/swiotlb-xen.h>
+
+#include <asm/xen/hypervisor.h>
+#include <xen/xen.h>
+
+int xen_swiotlb __read_mostly;
+
+static struct dma_map_ops xen_swiotlb_dma_ops = {
+ .mapping_error = xen_swiotlb_dma_mapping_error,
+ .alloc_coherent = xen_swiotlb_alloc_coherent,
+ .free_coherent = xen_swiotlb_free_coherent,
+ .sync_single_for_cpu = xen_swiotlb_sync_single_for_cpu,
+ .sync_single_for_device = xen_swiotlb_sync_single_for_device,
+ .sync_sg_for_cpu = xen_swiotlb_sync_sg_for_cpu,
+ .sync_sg_for_device = xen_swiotlb_sync_sg_for_device,
+ .map_sg = xen_swiotlb_map_sg_attrs,
+ .unmap_sg = xen_swiotlb_unmap_sg_attrs,
+ .map_page = xen_swiotlb_map_page,
+ .unmap_page = xen_swiotlb_unmap_page,
+ .dma_supported = xen_swiotlb_dma_supported,
+};
+
+/*
+ * pci_xen_swiotlb_detect - set xen_swiotlb to 1 if necessary
+ *
+ * This returns non-zero if we are forced to use xen_swiotlb (by the boot
+ * option).
+ */
+int __init pci_xen_swiotlb_detect(void)
+{
+
+ /* If running as PV guest, either iommu=soft, or swiotlb=force will
+ * activate this IOMMU. If running as PV privileged, activate it
+ * irregardlesss.
+ */
+ if ((xen_initial_domain() || swiotlb || swiotlb_force) &&
+ (xen_pv_domain()))
+ xen_swiotlb = 1;
+
+ /* If we are running under Xen, we MUST disable the native SWIOTLB.
+ * Don't worry about swiotlb_force flag activating the native, as
+ * the 'swiotlb' flag is the only one turning it on. */
+ if (xen_pv_domain())
+ swiotlb = 0;
+
+ return xen_swiotlb;
+}
+
+void __init pci_xen_swiotlb_init(void)
+{
+ if (xen_swiotlb) {
+ xen_swiotlb_init(1);
+ dma_ops = &xen_swiotlb_dma_ops;
+ }
+}
--
1.7.0.1

2010-07-27 17:20:27

by Konrad Rzeszutek Wilk

[permalink] [raw]

Subject: [PATCH 9/9] x86: Detect whether we should use Xen SWIOTLB.

It is paramount that we call pci_xen_swiotlb_detect before
pci_swiotlb_detect as both implementations use the 'swiotlb'
and 'swiotlb_force' flags. The pci-xen_swiotlb_detect inhibits
the swiotlb_force and swiotlb flag so that the native SWIOTLB
implementation is not enabled when running under Xen.

Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
Acked-by: Jeremy Fitzhardinge <[email protected]>
Cc: FUJITA Tomonori <[email protected]>
Cc: Albert Herranz <[email protected]>
Cc: Ian Campbell <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
Cc: [email protected]
Cc: Jesse Barnes <[email protected]>
---
arch/x86/kernel/pci-dma.c | 7 +++++--
1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 4b7e3d8..9f07cfc 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -13,6 +13,7 @@
#include <asm/calgary.h>
#include <asm/amd_iommu.h>
#include <asm/x86_init.h>
+#include <asm/xen/swiotlb-xen.h>

static int forbid_dac __read_mostly;

@@ -132,7 +133,7 @@ void __init pci_iommu_alloc(void)
/* free the range so iommu could get some range less than 4G */
dma32_free_bootmem();

- if (pci_swiotlb_detect())
+ if (pci_xen_swiotlb_detect() || pci_swiotlb_detect())
goto out;

gart_iommu_hole_init();
@@ -144,6 +145,8 @@ void __init pci_iommu_alloc(void)
/* needs to be called after gart_iommu_hole_init */
amd_iommu_detect();
out:
+ pci_xen_swiotlb_init();
+
pci_swiotlb_init();
}

@@ -296,7 +299,7 @@ static int __init pci_iommu_init(void)
#endif
x86_init.iommu.iommu_init();

- if (swiotlb) {
+ if (swiotlb || xen_swiotlb) {
printk(KERN_INFO "PCI-DMA: "
"Using software bounce buffering for IO (SWIOTLB)\n");
swiotlb_print_info();
--
1.7.0.1

2010-07-27 17:20:42

by Konrad Rzeszutek Wilk

[permalink] [raw]

Subject: [PATCH 6/9] xen/mmu: inhibit vmap aliases rather than trying to clear them out

From: Jeremy Fitzhardinge <[email protected]>

Rather than trying to deal with aliases once they appear, just completely
inhibit them. Mostly the removal of aliases was managable, but it comes
unstuck in xen_create_contiguous_region() because it gets executed at
interrupt time (as a result of dma_alloc_coherent()), which causes all
sorts of confusion in the vmap code, as it was never intended to be run
in interrupt context.

This has the unfortunate side effect of removing all the unmap batching
the vmap code so carefully added, but that can't be helped.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
---
arch/x86/xen/mmu.c | 10 +++-------
1 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index eb51402..ef5728d 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -42,6 +42,7 @@
#include <linux/highmem.h>
#include <linux/debugfs.h>
#include <linux/bug.h>
+#include <linux/vmalloc.h>
#include <linux/module.h>
#include <linux/gfp.h>

@@ -1015,8 +1016,6 @@ static int xen_pin_page(struct mm_struct *mm, struct page *page,
read-only, and can be pinned. */
static void __xen_pgd_pin(struct mm_struct *mm, pgd_t *pgd)
{
- vm_unmap_aliases();
-
xen_mc_batch();

if (__xen_pgd_walk(mm, pgd, xen_pin_page, USER_LIMIT)) {
@@ -1580,7 +1579,6 @@ static void xen_alloc_ptpage(struct mm_struct *mm, unsigned long pfn, unsigned l
if (PagePinned(virt_to_page(mm->pgd))) {
SetPagePinned(page);

- vm_unmap_aliases();
if (!PageHighMem(page)) {
make_lowmem_page_readonly(__va(PFN_PHYS((unsigned long)pfn)));
if (level == PT_PTE && USE_SPLIT_PTLOCKS)
@@ -2026,6 +2024,8 @@ void __init xen_init_mmu_ops(void)
x86_init.paging.pagetable_setup_start = xen_pagetable_setup_start;
x86_init.paging.pagetable_setup_done = xen_pagetable_setup_done;
pv_mmu_ops = xen_mmu_ops;
+
+ vmap_lazy_unmap = false;
}

/* Protected by xen_reservation_lock. */
@@ -2165,8 +2165,6 @@ int xen_create_contiguous_region(unsigned long vstart, unsigned int order,

memset((void *) vstart, 0, PAGE_SIZE << order);

- vm_unmap_aliases();
-
spin_lock_irqsave(&xen_reservation_lock, flags);

/* 1. Zap current PTEs, remembering MFNs. */
@@ -2204,8 +2202,6 @@ void xen_destroy_contiguous_region(unsigned long vstart, unsigned int order)

memset((void *) vstart, 0, PAGE_SIZE << order);

- vm_unmap_aliases();
-
spin_lock_irqsave(&xen_reservation_lock, flags);

/* 1. Find start MFN of contiguous extent. */
--
1.7.0.1

2010-07-27 17:20:45

by Konrad Rzeszutek Wilk

[permalink] [raw]

Subject: [PATCH 7/9] swiotlb-xen: SWIOTLB library for Xen PV guest with PCI passthrough.

This patchset:

PV guests under Xen are running in an non-contiguous memory architecture.

When PCI pass-through is utilized, this necessitates an IOMMU for
translating bus (DMA) to virtual and vice-versa and also providing a
mechanism to have contiguous pages for device drivers operations (say DMA
operations).

Specifically, under Xen the Linux idea of pages is an illusion. It
assumes that pages start at zero and go up to the available memory. To
help with that, the Linux Xen MMU provides a lookup mechanism to
translate the page frame numbers (PFN) to machine frame numbers (MFN)
and vice-versa. The MFN are the "real" frame numbers. Furthermore
memory is not contiguous. Xen hypervisor stitches memory for guests
from different pools, which means there is no guarantee that PFN==MFN
and PFN+1==MFN+1. Lastly with Xen 4.0, pages (in debug mode) are
allocated in descending order (high to low), meaning the guest might
never get any MFN's under the 4GB mark.

Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
Acked-by: Jeremy Fitzhardinge <[email protected]>
Cc: FUJITA Tomonori <[email protected]>
Cc: Albert Herranz <[email protected]>
Cc: Ian Campbell <[email protected]>
---
drivers/xen/Kconfig | 4 +
drivers/xen/Makefile | 3 +-
drivers/xen/swiotlb-xen.c | 515 +++++++++++++++++++++++++++++++++++++++++++++
include/xen/swiotlb-xen.h | 65 ++++++
4 files changed, 586 insertions(+), 1 deletions(-)
create mode 100644 drivers/xen/swiotlb-xen.c
create mode 100644 include/xen/swiotlb-xen.h

diff --git a/drivers/xen/Kconfig b/drivers/xen/Kconfig
index fad3df2..97199c2 100644
--- a/drivers/xen/Kconfig
+++ b/drivers/xen/Kconfig
@@ -62,4 +62,8 @@ config XEN_SYS_HYPERVISOR
virtual environment, /sys/hypervisor will still be present,
but will have no xen contents.

+config SWIOTLB_XEN
+ def_bool y
+ depends on SWIOTLB
+
endmenu
diff --git a/drivers/xen/Makefile b/drivers/xen/Makefile
index 7c28434..85f84cf 100644
--- a/drivers/xen/Makefile
+++ b/drivers/xen/Makefile
@@ -9,4 +9,5 @@ obj-$(CONFIG_XEN_XENCOMM) += xencomm.o
obj-$(CONFIG_XEN_BALLOON) += balloon.o
obj-$(CONFIG_XEN_DEV_EVTCHN) += evtchn.o
obj-$(CONFIG_XENFS) += xenfs/
-obj-$(CONFIG_XEN_SYS_HYPERVISOR) += sys-hypervisor.o
\ No newline at end of file
+obj-$(CONFIG_XEN_SYS_HYPERVISOR) += sys-hypervisor.o
+obj-$(CONFIG_SWIOTLB_XEN) += swiotlb-xen.o
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
new file mode 100644
index 0000000..54469c3
--- /dev/null
+++ b/drivers/xen/swiotlb-xen.c
@@ -0,0 +1,515 @@
+/*
+ * Copyright 2010
+ * by Konrad Rzeszutek Wilk <[email protected]>
+ *
+ * This code provides a IOMMU for Xen PV guests with PCI passthrough.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License v2.0 as published by
+ * the Free Software Foundation
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * PV guests under Xen are running in an non-contiguous memory architecture.
+ *
+ * When PCI pass-through is utilized, this necessitates an IOMMU for
+ * translating bus (DMA) to virtual and vice-versa and also providing a
+ * mechanism to have contiguous pages for device drivers operations (say DMA
+ * operations).
+ *
+ * Specifically, under Xen the Linux idea of pages is an illusion. It
+ * assumes that pages start at zero and go up to the available memory. To
+ * help with that, the Linux Xen MMU provides a lookup mechanism to
+ * translate the page frame numbers (PFN) to machine frame numbers (MFN)
+ * and vice-versa. The MFN are the "real" frame numbers. Furthermore
+ * memory is not contiguous. Xen hypervisor stitches memory for guests
+ * from different pools, which means there is no guarantee that PFN==MFN
+ * and PFN+1==MFN+1. Lastly with Xen 4.0, pages (in debug mode) are
+ * allocated in descending order (high to low), meaning the guest might
+ * never get any MFN's under the 4GB mark.
+ *
+ */
+
+#include <linux/bootmem.h>
+#include <linux/dma-mapping.h>
+#include <xen/swiotlb-xen.h>
+#include <xen/page.h>
+#include <xen/xen-ops.h>
+/*
+ * Used to do a quick range check in swiotlb_tbl_unmap_single and
+ * swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated by this
+ * API.
+ */
+
+static char *xen_io_tlb_start, *xen_io_tlb_end;
+static unsigned long xen_io_tlb_nslabs;
+/*
+ * Quick lookup value of the bus address of the IOTLB.
+ */
+
+u64 start_dma_addr;
+
+static dma_addr_t xen_phys_to_bus(phys_addr_t paddr)
+{
+ return phys_to_machine(XPADDR(paddr)).maddr;;
+}
+
+static phys_addr_t xen_bus_to_phys(dma_addr_t baddr)
+{
+ return machine_to_phys(XMADDR(baddr)).paddr;
+}
+
+static dma_addr_t xen_virt_to_bus(void *address)
+{
+ return xen_phys_to_bus(virt_to_phys(address));
+}
+
+static int check_pages_physically_contiguous(unsigned long pfn,
+ unsigned int offset,
+ size_t length)
+{
+ unsigned long next_mfn;
+ int i;
+ int nr_pages;
+
+ next_mfn = pfn_to_mfn(pfn);
+ nr_pages = (offset + length + PAGE_SIZE-1) >> PAGE_SHIFT;
+
+ for (i = 1; i < nr_pages; i++) {
+ if (pfn_to_mfn(++pfn) != ++next_mfn)
+ return 0;
+ }
+ return 1;
+}
+
+static int range_straddles_page_boundary(phys_addr_t p, size_t size)
+{
+ unsigned long pfn = PFN_DOWN(p);
+ unsigned int offset = p & ~PAGE_MASK;
+
+ if (offset + size <= PAGE_SIZE)
+ return 0;
+ if (check_pages_physically_contiguous(pfn, offset, size))
+ return 0;
+ return 1;
+}
+
+static int is_xen_swiotlb_buffer(dma_addr_t dma_addr)
+{
+ unsigned long mfn = PFN_DOWN(dma_addr);
+ unsigned long pfn = mfn_to_local_pfn(mfn);
+ phys_addr_t paddr;
+
+ /* If the address is outside our domain, it CAN
+ * have the same virtual address as another address
+ * in our domain. Therefore _only_ check address within our domain.
+ */
+ if (pfn_valid(pfn)) {
+ paddr = PFN_PHYS(pfn);
+ return paddr >= virt_to_phys(xen_io_tlb_start) &&
+ paddr < virt_to_phys(xen_io_tlb_end);
+ }
+ return 0;
+}
+
+static int max_dma_bits = 32;
+
+static int
+xen_swiotlb_fixup(void *buf, size_t size, unsigned long nslabs)
+{
+ int i, rc;
+ int dma_bits;
+
+ dma_bits = get_order(IO_TLB_SEGSIZE << IO_TLB_SHIFT) + PAGE_SHIFT;
+
+ i = 0;
+ do {
+ int slabs = min(nslabs - i, (unsigned long)IO_TLB_SEGSIZE);
+
+ do {
+ rc = xen_create_contiguous_region(
+ (unsigned long)buf + (i << IO_TLB_SHIFT),
+ get_order(slabs << IO_TLB_SHIFT),
+ dma_bits);
+ } while (rc && dma_bits++ < max_dma_bits);
+ if (rc)
+ return rc;
+
+ i += slabs;
+ } while (i < nslabs);
+ return 0;
+}
+
+void __init xen_swiotlb_init(int verbose)
+{
+ unsigned long bytes;
+ int rc;
+
+ xen_io_tlb_nslabs = (64 * 1024 * 1024 >> IO_TLB_SHIFT);
+ xen_io_tlb_nslabs = ALIGN(xen_io_tlb_nslabs, IO_TLB_SEGSIZE);
+
+ bytes = xen_io_tlb_nslabs << IO_TLB_SHIFT;
+
+ /*
+ * Get IO TLB memory from any location.
+ */
+ xen_io_tlb_start = alloc_bootmem(bytes);
+ if (!xen_io_tlb_start)
+ panic("Cannot allocate SWIOTLB buffer");
+
+ xen_io_tlb_end = xen_io_tlb_start + bytes;
+ /*
+ * And replace that memory with pages under 4GB.
+ */
+ rc = xen_swiotlb_fixup(xen_io_tlb_start,
+ bytes,
+ xen_io_tlb_nslabs);
+ if (rc)
+ goto error;
+
+ start_dma_addr = xen_virt_to_bus(xen_io_tlb_start);
+ swiotlb_init_with_tbl(xen_io_tlb_start, xen_io_tlb_nslabs, verbose);
+
+ return;
+error:
+ panic("DMA(%d): Failed to exchange pages allocated for DMA with Xen! "\
+ "We either don't have the permission or you do not have enough"\
+ "free memory under 4GB!\n", rc);
+}
+
+void *
+xen_swiotlb_alloc_coherent(struct device *hwdev, size_t size,
+ dma_addr_t *dma_handle, gfp_t flags)
+{
+ void *ret;
+ int order = get_order(size);
+ u64 dma_mask = DMA_BIT_MASK(32);
+ unsigned long vstart;
+
+ /*
+ * Ignore region specifiers - the kernel's ideas of
+ * pseudo-phys memory layout has nothing to do with the
+ * machine physical layout. We can't allocate highmem
+ * because we can't return a pointer to it.
+ */
+ flags &= ~(__GFP_DMA | __GFP_HIGHMEM);
+
+ if (dma_alloc_from_coherent(hwdev, size, dma_handle, &ret))
+ return ret;
+
+ vstart = __get_free_pages(flags, order);
+ ret = (void *)vstart;
+
+ if (hwdev && hwdev->coherent_dma_mask)
+ dma_mask = dma_alloc_coherent_mask(hwdev, flags);
+
+ if (ret) {
+ if (xen_create_contiguous_region(vstart, order,
+ fls64(dma_mask)) != 0) {
+ free_pages(vstart, order);
+ return NULL;
+ }
+ memset(ret, 0, size);
+ *dma_handle = virt_to_machine(ret).maddr;
+ }
+ return ret;
+}
+EXPORT_SYMBOL_GPL(xen_swiotlb_alloc_coherent);
+
+void
+xen_swiotlb_free_coherent(struct device *hwdev, size_t size, void *vaddr,
+ dma_addr_t dev_addr)
+{
+ int order = get_order(size);
+
+ if (dma_release_from_coherent(hwdev, order, vaddr))
+ return;
+
+ xen_destroy_contiguous_region((unsigned long)vaddr, order);
+ free_pages((unsigned long)vaddr, order);
+}
+EXPORT_SYMBOL_GPL(xen_swiotlb_free_coherent);
+
+
+/*
+ * Map a single buffer of the indicated size for DMA in streaming mode. The
+ * physical address to use is returned.
+ *
+ * Once the device is given the dma address, the device owns this memory until
+ * either xen_swiotlb_unmap_page or xen_swiotlb_dma_sync_single is performed.
+ */
+dma_addr_t xen_swiotlb_map_page(struct device *dev, struct page *page,
+ unsigned long offset, size_t size,
+ enum dma_data_direction dir,
+ struct dma_attrs *attrs)
+{
+ phys_addr_t phys = page_to_phys(page) + offset;
+ dma_addr_t dev_addr = xen_phys_to_bus(phys);
+ void *map;
+
+ BUG_ON(dir == DMA_NONE);
+ /*
+ * If the address happens to be in the device's DMA window,
+ * we can safely return the device addr and not worry about bounce
+ * buffering it.
+ */
+ if (dma_capable(dev, dev_addr, size) &&
+ !range_straddles_page_boundary(phys, size) && !swiotlb_force)
+ return dev_addr;
+
+ /*
+ * Oh well, have to allocate and map a bounce buffer.
+ */
+ map = swiotlb_tbl_map_single(dev, start_dma_addr, phys, size, dir);
+ if (!map)
+ return DMA_ERROR_CODE;
+
+ dev_addr = xen_virt_to_bus(map);
+
+ /*
+ * Ensure that the address returned is DMA'ble
+ */
+ if (!dma_capable(dev, dev_addr, size))
+ panic("map_single: bounce buffer is not DMA'ble");
+
+ return dev_addr;
+}
+EXPORT_SYMBOL_GPL(xen_swiotlb_map_page);
+
+/*
+ * Unmap a single streaming mode DMA translation. The dma_addr and size must
+ * match what was provided for in a previous xen_swiotlb_map_page call. All
+ * other usages are undefined.
+ *
+ * After this call, reads by the cpu to the buffer are guaranteed to see
+ * whatever the device wrote there.
+ */
+static void xen_unmap_single(struct device *hwdev, dma_addr_t dev_addr,
+ size_t size, enum dma_data_direction dir)
+{
+ phys_addr_t paddr = xen_bus_to_phys(dev_addr);
+
+ BUG_ON(dir == DMA_NONE);
+
+ /* NOTE: We use dev_addr here, not paddr! */
+ if (is_xen_swiotlb_buffer(dev_addr)) {
+ swiotlb_tbl_unmap_single(hwdev, phys_to_virt(paddr), size, dir);
+ return;
+ }
+
+ if (dir != DMA_FROM_DEVICE)
+ return;
+
+ /*
+ * phys_to_virt doesn't work with hihgmem page but we could
+ * call dma_mark_clean() with hihgmem page here. However, we
+ * are fine since dma_mark_clean() is null on POWERPC. We can
+ * make dma_mark_clean() take a physical address if necessary.
+ */
+ dma_mark_clean(phys_to_virt(paddr), size);
+}
+
+void xen_swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr,
+ size_t size, enum dma_data_direction dir,
+ struct dma_attrs *attrs)
+{
+ xen_unmap_single(hwdev, dev_addr, size, dir);
+}
+EXPORT_SYMBOL_GPL(xen_swiotlb_unmap_page);
+
+/*
+ * Make physical memory consistent for a single streaming mode DMA translation
+ * after a transfer.
+ *
+ * If you perform a xen_swiotlb_map_page() but wish to interrogate the buffer
+ * using the cpu, yet do not wish to teardown the dma mapping, you must
+ * call this function before doing so. At the next point you give the dma
+ * address back to the card, you must first perform a
+ * xen_swiotlb_dma_sync_for_device, and then the device again owns the buffer
+ */
+static void
+xen_swiotlb_sync_single(struct device *hwdev, dma_addr_t dev_addr,
+ size_t size, enum dma_data_direction dir,
+ enum dma_sync_target target)
+{
+ phys_addr_t paddr = xen_bus_to_phys(dev_addr);
+
+ BUG_ON(dir == DMA_NONE);
+
+ /* NOTE: We use dev_addr here, not paddr! */
+ if (is_xen_swiotlb_buffer(dev_addr)) {
+ swiotlb_tbl_sync_single(hwdev, phys_to_virt(paddr), size, dir,
+ target);
+ return;
+ }
+
+ if (dir != DMA_FROM_DEVICE)
+ return;
+
+ dma_mark_clean(phys_to_virt(paddr), size);
+}
+
+void
+xen_swiotlb_sync_single_for_cpu(struct device *hwdev, dma_addr_t dev_addr,
+ size_t size, enum dma_data_direction dir)
+{
+ xen_swiotlb_sync_single(hwdev, dev_addr, size, dir, SYNC_FOR_CPU);
+}
+EXPORT_SYMBOL_GPL(xen_swiotlb_sync_single_for_cpu);
+
+void
+xen_swiotlb_sync_single_for_device(struct device *hwdev, dma_addr_t dev_addr,
+ size_t size, enum dma_data_direction dir)
+{
+ xen_swiotlb_sync_single(hwdev, dev_addr, size, dir, SYNC_FOR_DEVICE);
+}
+EXPORT_SYMBOL_GPL(xen_swiotlb_sync_single_for_device);
+
+/*
+ * Map a set of buffers described by scatterlist in streaming mode for DMA.
+ * This is the scatter-gather version of the above xen_swiotlb_map_page
+ * interface. Here the scatter gather list elements are each tagged with the
+ * appropriate dma address and length. They are obtained via
+ * sg_dma_{address,length}(SG).
+ *
+ * NOTE: An implementation may be able to use a smaller number of
+ * DMA address/length pairs than there are SG table elements.
+ * (for example via virtual mapping capabilities)
+ * The routine returns the number of addr/length pairs actually
+ * used, at most nents.
+ *
+ * Device ownership issues as mentioned above for xen_swiotlb_map_page are the
+ * same here.
+ */
+int
+xen_swiotlb_map_sg_attrs(struct device *hwdev, struct scatterlist *sgl,
+ int nelems, enum dma_data_direction dir,
+ struct dma_attrs *attrs)
+{
+ struct scatterlist *sg;
+ int i;
+
+ BUG_ON(dir == DMA_NONE);
+
+ for_each_sg(sgl, sg, nelems, i) {
+ phys_addr_t paddr = sg_phys(sg);
+ dma_addr_t dev_addr = xen_phys_to_bus(paddr);
+
+ if (swiotlb_force ||
+ !dma_capable(hwdev, dev_addr, sg->length) ||
+ range_straddles_page_boundary(paddr, sg->length)) {
+ void *map = swiotlb_tbl_map_single(hwdev,
+ start_dma_addr,
+ sg_phys(sg),
+ sg->length, dir);
+ if (!map) {
+ /* Don't panic here, we expect map_sg users
+ to do proper error handling. */
+ xen_swiotlb_unmap_sg_attrs(hwdev, sgl, i, dir,
+ attrs);
+ sgl[0].dma_length = 0;
+ return DMA_ERROR_CODE;
+ }
+ sg->dma_address = xen_virt_to_bus(map);
+ } else
+ sg->dma_address = dev_addr;
+ sg->dma_length = sg->length;
+ }
+ return nelems;
+}
+EXPORT_SYMBOL_GPL(xen_swiotlb_map_sg_attrs);
+
+int
+xen_swiotlb_map_sg(struct device *hwdev, struct scatterlist *sgl, int nelems,
+ enum dma_data_direction dir)
+{
+ return xen_swiotlb_map_sg_attrs(hwdev, sgl, nelems, dir, NULL);
+}
+EXPORT_SYMBOL_GPL(xen_swiotlb_map_sg);
+
+/*
+ * Unmap a set of streaming mode DMA translations. Again, cpu read rules
+ * concerning calls here are the same as for swiotlb_unmap_page() above.
+ */
+void
+xen_swiotlb_unmap_sg_attrs(struct device *hwdev, struct scatterlist *sgl,
+ int nelems, enum dma_data_direction dir,
+ struct dma_attrs *attrs)
+{
+ struct scatterlist *sg;
+ int i;
+
+ BUG_ON(dir == DMA_NONE);
+
+ for_each_sg(sgl, sg, nelems, i)
+ xen_unmap_single(hwdev, sg->dma_address, sg->dma_length, dir);
+
+}
+EXPORT_SYMBOL_GPL(xen_swiotlb_unmap_sg_attrs);
+
+void
+xen_swiotlb_unmap_sg(struct device *hwdev, struct scatterlist *sgl, int nelems,
+ enum dma_data_direction dir)
+{
+ return xen_swiotlb_unmap_sg_attrs(hwdev, sgl, nelems, dir, NULL);
+}
+EXPORT_SYMBOL_GPL(xen_swiotlb_unmap_sg);
+
+/*
+ * Make physical memory consistent for a set of streaming mode DMA translations
+ * after a transfer.
+ *
+ * The same as swiotlb_sync_single_* but for a scatter-gather list, same rules
+ * and usage.
+ */
+static void
+xen_swiotlb_sync_sg(struct device *hwdev, struct scatterlist *sgl,
+ int nelems, enum dma_data_direction dir,
+ enum dma_sync_target target)
+{
+ struct scatterlist *sg;
+ int i;
+
+ for_each_sg(sgl, sg, nelems, i)
+ xen_swiotlb_sync_single(hwdev, sg->dma_address,
+ sg->dma_length, dir, target);
+}
+
+void
+xen_swiotlb_sync_sg_for_cpu(struct device *hwdev, struct scatterlist *sg,
+ int nelems, enum dma_data_direction dir)
+{
+ xen_swiotlb_sync_sg(hwdev, sg, nelems, dir, SYNC_FOR_CPU);
+}
+EXPORT_SYMBOL_GPL(xen_swiotlb_sync_sg_for_cpu);
+
+void
+xen_swiotlb_sync_sg_for_device(struct device *hwdev, struct scatterlist *sg,
+ int nelems, enum dma_data_direction dir)
+{
+ xen_swiotlb_sync_sg(hwdev, sg, nelems, dir, SYNC_FOR_DEVICE);
+}
+EXPORT_SYMBOL_GPL(xen_swiotlb_sync_sg_for_device);
+
+int
+xen_swiotlb_dma_mapping_error(struct device *hwdev, dma_addr_t dma_addr)
+{
+ return !dma_addr;
+}
+EXPORT_SYMBOL_GPL(xen_swiotlb_dma_mapping_error);
+
+/*
+ * Return whether the given device DMA address mask can be supported
+ * properly. For example, if your device can only drive the low 24-bits
+ * during bus mastering, then you would pass 0x00ffffff as the mask to
+ * this function.
+ */
+int
+xen_swiotlb_dma_supported(struct device *hwdev, u64 mask)
+{
+ return xen_virt_to_bus(xen_io_tlb_end - 1) <= mask;
+}
+EXPORT_SYMBOL_GPL(xen_swiotlb_dma_supported);
diff --git a/include/xen/swiotlb-xen.h b/include/xen/swiotlb-xen.h
new file mode 100644
index 0000000..2ea2fdc
--- /dev/null
+++ b/include/xen/swiotlb-xen.h
@@ -0,0 +1,65 @@
+#ifndef __LINUX_SWIOTLB_XEN_H
+#define __LINUX_SWIOTLB_XEN_H
+
+#include <linux/swiotlb.h>
+
+extern void xen_swiotlb_init(int verbose);
+
+extern void
+*xen_swiotlb_alloc_coherent(struct device *hwdev, size_t size,
+ dma_addr_t *dma_handle, gfp_t flags);
+
+extern void
+xen_swiotlb_free_coherent(struct device *hwdev, size_t size,
+ void *vaddr, dma_addr_t dma_handle);
+
+extern dma_addr_t xen_swiotlb_map_page(struct device *dev, struct page *page,
+ unsigned long offset, size_t size,
+ enum dma_data_direction dir,
+ struct dma_attrs *attrs);
+
+extern void xen_swiotlb_unmap_page(struct device *hwdev, dma_addr_t dev_addr,
+ size_t size, enum dma_data_direction dir,
+ struct dma_attrs *attrs);
+/*
+extern int
+xen_swiotlb_map_sg(struct device *hwdev, struct scatterlist *sg, int nents,
+ enum dma_data_direction dir);
+
+extern void
+xen_swiotlb_unmap_sg(struct device *hwdev, struct scatterlist *sg, int nents,
+ enum dma_data_direction dir);
+*/
+extern int
+xen_swiotlb_map_sg_attrs(struct device *hwdev, struct scatterlist *sgl,
+ int nelems, enum dma_data_direction dir,
+ struct dma_attrs *attrs);
+
+extern void
+xen_swiotlb_unmap_sg_attrs(struct device *hwdev, struct scatterlist *sgl,
+ int nelems, enum dma_data_direction dir,
+ struct dma_attrs *attrs);
+
+extern void
+xen_swiotlb_sync_single_for_cpu(struct device *hwdev, dma_addr_t dev_addr,
+ size_t size, enum dma_data_direction dir);
+
+extern void
+xen_swiotlb_sync_sg_for_cpu(struct device *hwdev, struct scatterlist *sg,
+ int nelems, enum dma_data_direction dir);
+
+extern void
+xen_swiotlb_sync_single_for_device(struct device *hwdev, dma_addr_t dev_addr,
+ size_t size, enum dma_data_direction dir);
+
+extern void
+xen_swiotlb_sync_sg_for_device(struct device *hwdev, struct scatterlist *sg,
+ int nelems, enum dma_data_direction dir);
+
+extern int
+xen_swiotlb_dma_mapping_error(struct device *hwdev, dma_addr_t dma_addr);
+
+extern int
+xen_swiotlb_dma_supported(struct device *hwdev, u64 mask);
+
+#endif /* __LINUX_SWIOTLB_XEN_H */
--
1.7.0.1

2010-07-27 17:20:50

by Konrad Rzeszutek Wilk

[permalink] [raw]

Subject: [PATCH 1/9] xen: use _PAGE_IOMAP in ioremap to do machine mappings

From: Jeremy Fitzhardinge <[email protected]>

In a Xen domain, ioremap operates on machine addresses, not
pseudo-physical addresses. We use _PAGE_IOMAP to determine whether a
mapping is intended for machine addresses.

[ Impact: allow Xen domain to map real hardware ]

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
---
arch/x86/include/asm/xen/page.h | 8 +---
arch/x86/xen/mmu.c | 71 +++++++++++++++++++++++++++++++++++++-
2 files changed, 71 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/xen/page.h b/arch/x86/include/asm/xen/page.h
index 018a0a4..bf5f7d3 100644
--- a/arch/x86/include/asm/xen/page.h
+++ b/arch/x86/include/asm/xen/page.h
@@ -112,13 +112,9 @@ static inline xpaddr_t machine_to_phys(xmaddr_t machine)
*/
static inline unsigned long mfn_to_local_pfn(unsigned long mfn)
{
- extern unsigned long max_mapnr;
unsigned long pfn = mfn_to_pfn(mfn);
- if ((pfn < max_mapnr)
- && !xen_feature(XENFEAT_auto_translated_physmap)
- && (get_phys_to_machine(pfn) != mfn))
- return max_mapnr; /* force !pfn_valid() */
- /* XXX fixme; not true with sparsemem */
+ if (get_phys_to_machine(pfn) != mfn)
+ return -1; /* force !pfn_valid() */
return pfn;
}

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index 914f046..a4dea9d 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -56,9 +56,11 @@
#include <asm/xen/hypercall.h>
#include <asm/xen/hypervisor.h>

+#include <xen/xen.h>
#include <xen/page.h>
#include <xen/interface/xen.h>
#include <xen/interface/version.h>
+#include <xen/interface/memory.h>
#include <xen/hvc-console.h>

#include "multicalls.h"
@@ -377,6 +379,28 @@ static bool xen_page_pinned(void *ptr)
return PagePinned(page);
}

+static bool xen_iomap_pte(pte_t pte)
+{
+ return xen_initial_domain() && (pte_flags(pte) & _PAGE_IOMAP);
+}
+
+static void xen_set_iomap_pte(pte_t *ptep, pte_t pteval)
+{
+ struct multicall_space mcs;
+ struct mmu_update *u;
+
+ mcs = xen_mc_entry(sizeof(*u));
+ u = mcs.args;
+
+ /* ptep might be kmapped when using 32-bit HIGHPTE */
+ u->ptr = arbitrary_virt_to_machine(ptep).maddr;
+ u->val = pte_val_ma(pteval);
+
+ MULTI_mmu_update(mcs.mc, mcs.args, 1, NULL, DOMID_IO);
+
+ xen_mc_issue(PARAVIRT_LAZY_MMU);
+}
+
static void xen_extend_mmu_update(const struct mmu_update *update)
{
struct multicall_space mcs;
@@ -453,6 +477,11 @@ void set_pte_mfn(unsigned long vaddr, unsigned long mfn, pgprot_t flags)
void xen_set_pte_at(struct mm_struct *mm, unsigned long addr,
pte_t *ptep, pte_t pteval)
{
+ if (xen_iomap_pte(pteval)) {
+ xen_set_iomap_pte(ptep, pteval);
+ goto out;
+ }
+
ADD_STATS(set_pte_at, 1);
// ADD_STATS(set_pte_at_pinned, xen_page_pinned(ptep));
ADD_STATS(set_pte_at_current, mm == current->mm);
@@ -523,8 +552,25 @@ static pteval_t pte_pfn_to_mfn(pteval_t val)
return val;
}

+static pteval_t iomap_pte(pteval_t val)
+{
+ if (val & _PAGE_PRESENT) {
+ unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
+ pteval_t flags = val & PTE_FLAGS_MASK;
+
+ /* We assume the pte frame number is a MFN, so
+ just use it as-is. */
+ val = ((pteval_t)pfn << PAGE_SHIFT) | flags;
+ }
+
+ return val;
+}
+
pteval_t xen_pte_val(pte_t pte)
{
+ if (xen_initial_domain() && (pte.pte & _PAGE_IOMAP))
+ return pte.pte;
+
return pte_mfn_to_pfn(pte.pte);
}
PV_CALLEE_SAVE_REGS_THUNK(xen_pte_val);
@@ -537,7 +583,11 @@ PV_CALLEE_SAVE_REGS_THUNK(xen_pgd_val);

pte_t xen_make_pte(pteval_t pte)
{
- pte = pte_pfn_to_mfn(pte);
+ if (unlikely(xen_initial_domain() && (pte & _PAGE_IOMAP)))
+ pte = iomap_pte(pte);
+ else
+ pte = pte_pfn_to_mfn(pte);
+
return native_make_pte(pte);
}
PV_CALLEE_SAVE_REGS_THUNK(xen_make_pte);
@@ -593,6 +643,11 @@ void xen_set_pud(pud_t *ptr, pud_t val)

void xen_set_pte(pte_t *ptep, pte_t pte)
{
+ if (xen_iomap_pte(pte)) {
+ xen_set_iomap_pte(ptep, pte);
+ return;
+ }
+
ADD_STATS(pte_update, 1);
// ADD_STATS(pte_update_pinned, xen_page_pinned(ptep));
ADD_STATS(pte_update_batched, paravirt_get_lazy_mode() == PARAVIRT_LAZY_MMU);
@@ -609,6 +664,11 @@ void xen_set_pte(pte_t *ptep, pte_t pte)
#ifdef CONFIG_X86_PAE
void xen_set_pte_atomic(pte_t *ptep, pte_t pte)
{
+ if (xen_iomap_pte(pte)) {
+ xen_set_iomap_pte(ptep, pte);
+ return;
+ }
+
set_64bit((u64 *)ptep, native_pte_val(pte));
}

@@ -1811,9 +1871,16 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot)
pte = pfn_pte(phys, prot);
break;

- default:
+ case FIX_PARAVIRT_BOOTMAP:
+ /* This is an MFN, but it isn't an IO mapping from the
+ IO domain */
pte = mfn_pte(phys, prot);
break;
+
+ default:
+ /* By default, set_fixmap is used for hardware mappings */
+ pte = mfn_pte(phys, __pgprot(pgprot_val(prot) | _PAGE_IOMAP));
+ break;
}

__native_set_fixmap(idx, pte);
--
1.7.0.1

2010-07-27 17:21:03

by Konrad Rzeszutek Wilk

[permalink] [raw]

Subject: [PATCH 5/9] vmap: add flag to allow lazy unmap to be disabled at runtime

From: Jeremy Fitzhardinge <[email protected]>

Add a flag to force lazy_max_pages() to zero to prevent any outstanding
mapped pages. We'll need this for Xen.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
Acked-by: Nick Piggin <[email protected]>
---
include/linux/vmalloc.h | 2 ++
mm/vmalloc.c | 4 ++++
2 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 227c2a5..b840fda 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -7,6 +7,8 @@

struct vm_area_struct; /* vma defining user mapping in mm_types.h */

+extern bool vmap_lazy_unmap;
+
/* bits in flags of vmalloc's vm_struct below */
#define VM_IOREMAP 0x00000001 /* ioremap() and friends */
#define VM_ALLOC 0x00000002 /* vmalloc() */
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ae00746..7f35fe2 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -31,6 +31,7 @@
#include <asm/tlbflush.h>
#include <asm/shmparam.h>

+bool vmap_lazy_unmap __read_mostly = true;

/*** Page table manipulation functions ***/

@@ -502,6 +503,9 @@ static unsigned long lazy_max_pages(void)
{
unsigned int log;

+ if (!vmap_lazy_unmap)
+ return 0;
+
log = fls(num_online_cpus());

return log * (32UL * 1024 * 1024 / PAGE_SIZE);
--
1.7.0.1

2010-07-27 17:21:49

by Konrad Rzeszutek Wilk

[permalink] [raw]

Subject: [PATCH 3/9] xen: Rename the balloon lock

From: Alex Nixon <[email protected]>

* xen_create_contiguous_region needs access to the balloon lock to
ensure memory doesn't change under its feet, so expose the balloon
lock
* Change the name of the lock to xen_reservation_lock, to imply it's
now less-specific usage.

[ Impact: cleanup ]

Signed-off-by: Alex Nixon <[email protected]>
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
---
arch/x86/xen/mmu.c | 7 +++++++
drivers/xen/balloon.c | 15 ++++-----------
include/xen/interface/memory.h | 8 ++++++++
3 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index a5577f5..9e0d82f 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -70,6 +70,13 @@

#define MMU_UPDATE_HISTO 30

+/*
+ * Protects atomic reservation decrease/increase against concurrent increases.
+ * Also protects non-atomic updates of current_pages and driver_pages, and
+ * balloon lists.
+ */
+DEFINE_SPINLOCK(xen_reservation_lock);
+
#ifdef CONFIG_XEN_DEBUG_FS

static struct {
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
index 1a0d8c2..500290b 100644
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -85,13 +85,6 @@ static struct sys_device balloon_sysdev;

static int register_balloon(struct sys_device *sysdev);

-/*
- * Protects atomic reservation decrease/increase against concurrent increases.
- * Also protects non-atomic updates of current_pages and driver_pages, and
- * balloon lists.
- */
-static DEFINE_SPINLOCK(balloon_lock);
-
static struct balloon_stats balloon_stats;

/* We increase/decrease in batches which fit in a page */
@@ -210,7 +203,7 @@ static int increase_reservation(unsigned long nr_pages)
if (nr_pages > ARRAY_SIZE(frame_list))
nr_pages = ARRAY_SIZE(frame_list);

- spin_lock_irqsave(&balloon_lock, flags);
+ spin_lock_irqsave(&xen_reservation_lock, flags);

page = balloon_first_page();
for (i = 0; i < nr_pages; i++) {
@@ -254,7 +247,7 @@ static int increase_reservation(unsigned long nr_pages)
balloon_stats.current_pages += rc;

out:
- spin_unlock_irqrestore(&balloon_lock, flags);
+ spin_unlock_irqrestore(&xen_reservation_lock, flags);

return rc < 0 ? rc : rc != nr_pages;
}
@@ -299,7 +292,7 @@ static int decrease_reservation(unsigned long nr_pages)
kmap_flush_unused();
flush_tlb_all();

- spin_lock_irqsave(&balloon_lock, flags);
+ spin_lock_irqsave(&xen_reservation_lock, flags);

/* No more mappings: invalidate P2M and add to balloon. */
for (i = 0; i < nr_pages; i++) {
@@ -315,7 +308,7 @@ static int decrease_reservation(unsigned long nr_pages)

balloon_stats.current_pages -= nr_pages;

- spin_unlock_irqrestore(&balloon_lock, flags);
+ spin_unlock_irqrestore(&xen_reservation_lock, flags);

return need_sleep;
}
diff --git a/include/xen/interface/memory.h b/include/xen/interface/memory.h
index af36ead..e6adce6 100644
--- a/include/xen/interface/memory.h
+++ b/include/xen/interface/memory.h
@@ -9,6 +9,8 @@
#ifndef __XEN_PUBLIC_MEMORY_H__
#define __XEN_PUBLIC_MEMORY_H__

+#include <linux/spinlock.h>
+
/*
* Increase or decrease the specified domain's memory reservation. Returns a
* -ve errcode on failure, or the # extents successfully allocated or freed.
@@ -142,4 +144,10 @@ struct xen_translate_gpfn_list {
};
DEFINE_GUEST_HANDLE_STRUCT(xen_translate_gpfn_list);

+
+/*
+ * Prevent the balloon driver from changing the memory reservation
+ * during a driver critical region.
+ */
+extern spinlock_t xen_reservation_lock;
#endif /* __XEN_PUBLIC_MEMORY_H__ */
--
1.7.0.1

2010-07-27 19:05:04

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [PATCH 9/9] x86: Detect whether we should use Xen SWIOTLB.

On 07/27/2010 10:00 AM, Konrad Rzeszutek Wilk wrote:
> It is paramount that we call pci_xen_swiotlb_detect before
> pci_swiotlb_detect as both implementations use the 'swiotlb'
> and 'swiotlb_force' flags. The pci-xen_swiotlb_detect inhibits
> the swiotlb_force and swiotlb flag so that the native SWIOTLB
> implementation is not enabled when running under Xen.
>
> Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
> Acked-by: Jeremy Fitzhardinge <[email protected]>
> Cc: FUJITA Tomonori <[email protected]>
> Cc: Albert Herranz <[email protected]>
> Cc: Ian Campbell <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: "H. Peter Anvin" <[email protected]>
> Cc: [email protected]
> Cc: Jesse Barnes <[email protected]>
> ---
> arch/x86/kernel/pci-dma.c | 7 +++++--
> 1 files changed, 5 insertions(+), 2 deletions(-)
>

Is there any way we can abstract this out a bit more instead of crapping
on generic code?

-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-07-27 19:41:31

by Konrad Rzeszutek Wilk

[permalink] [raw]

Subject: Re: [PATCH 9/9] x86: Detect whether we should use Xen SWIOTLB.

On Tue, Jul 27, 2010 at 12:03:56PM -0700, H. Peter Anvin wrote:
> On 07/27/2010 10:00 AM, Konrad Rzeszutek Wilk wrote:
> > It is paramount that we call pci_xen_swiotlb_detect before
> > pci_swiotlb_detect as both implementations use the 'swiotlb'
> > and 'swiotlb_force' flags. The pci-xen_swiotlb_detect inhibits
> > the swiotlb_force and swiotlb flag so that the native SWIOTLB
> > implementation is not enabled when running under Xen.
> >
> > Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
> > Acked-by: Jeremy Fitzhardinge <[email protected]>
> > Cc: FUJITA Tomonori <[email protected]>
> > Cc: Albert Herranz <[email protected]>
> > Cc: Ian Campbell <[email protected]>
> > Cc: Thomas Gleixner <[email protected]>
> > Cc: "H. Peter Anvin" <[email protected]>
> > Cc: [email protected]
> > Cc: Jesse Barnes <[email protected]>
> > ---
> > arch/x86/kernel/pci-dma.c | 7 +++++--
> > 1 files changed, 5 insertions(+), 2 deletions(-)
> >
>
> Is there any way we can abstract this out a bit more instead of crapping
> on generic code?

I was toying with something like this:

diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 9f07cfc..e0cd388 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -45,6 +45,25 @@ int iommu_detected __read_mostly = 0;
*/
int iommu_pass_through __read_mostly;

+initcall_t __swiotlb_initcall_detect[] =
+ {pci_xen_swiotlb_detect,
+ pci_swiotlb_detect,
+ NULL};
+
+initcall_t __swiotlb_initcall_init[] = {
+ pci_xen_swiotlb_init,
+ pci_swiotlb_init,
+ NULL};
+
+
+initcall_t __iommu_initcall_detect[] = {
+ gart_iommu_hole_init,
+ detect_calgary,
+ detect_intel_iommu,
+ /* needs to be called after gart_iommu_hole_init */
+ amd_iommu_detect,
+ NULL};
+
/* Dummy device used for NULL arguments (normally ISA). */
struct device x86_dma_fallback_dev = {
.init_name = "fallback device",
@@ -130,24 +149,22 @@ static void __init dma32_free_bootmem(void)

void __init pci_iommu_alloc(void)
{
+ initcall_t *fn;
+
/* free the range so iommu could get some range less than 4G */
dma32_free_bootmem();

- if (pci_xen_swiotlb_detect() || pci_swiotlb_detect())
- goto out;
-
- gart_iommu_hole_init();
-
- detect_calgary();
+ /* First do the SWIOTLB - if they work, skip the IOMMUs. */
+ for (fn = __swiotlb_initcall_detect; fn != NULL; fn++)
+ if ((*fn)())
+ goto swiotlb_init;

- detect_intel_iommu();
-
- /* needs to be called after gart_iommu_hole_init */
- amd_iommu_detect();
-out:
- pci_xen_swiotlb_init();
+ for (fn = __iommu_initcall_detect; fn != NULL; fn++)
+ (*fn)();

- pci_swiotlb_init();
+swiotlb_init:
+ for (fn = __swiotlb_initcall_init; fn != NULL; fn++)
+ (*fn)();
}

void *dma_generic_alloc_coherent(struct device *dev, size_t size,

(compiles with warnings and has not yet been completly flushed), but
Fujita mentioned that it might the right choice to use this
when we have more than just these two swiotlb-implentors.

>
> -hpa
> --
> H. Peter Anvin, Intel Open Source Technology Center
> I work for Intel. I don't speak on their behalf.
>
> _______________________________________________
> iommu mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/iommu

2010-07-27 23:36:55

by FUJITA Tomonori

[permalink] [raw]

Subject: Re: [PATCH 9/9] x86: Detect whether we should use Xen SWIOTLB.

On Tue, 27 Jul 2010 15:41:05 -0400
Konrad Rzeszutek Wilk <[email protected]> wrote:

> On Tue, Jul 27, 2010 at 12:03:56PM -0700, H. Peter Anvin wrote:
> > On 07/27/2010 10:00 AM, Konrad Rzeszutek Wilk wrote:
> > > It is paramount that we call pci_xen_swiotlb_detect before
> > > pci_swiotlb_detect as both implementations use the 'swiotlb'
> > > and 'swiotlb_force' flags. The pci-xen_swiotlb_detect inhibits
> > > the swiotlb_force and swiotlb flag so that the native SWIOTLB
> > > implementation is not enabled when running under Xen.
> > >
> > > Signed-off-by: Konrad Rzeszutek Wilk <[email protected]>
> > > Acked-by: Jeremy Fitzhardinge <[email protected]>
> > > Cc: FUJITA Tomonori <[email protected]>
> > > Cc: Albert Herranz <[email protected]>
> > > Cc: Ian Campbell <[email protected]>
> > > Cc: Thomas Gleixner <[email protected]>
> > > Cc: "H. Peter Anvin" <[email protected]>
> > > Cc: [email protected]
> > > Cc: Jesse Barnes <[email protected]>
> > > ---
> > > arch/x86/kernel/pci-dma.c | 7 +++++--
> > > 1 files changed, 5 insertions(+), 2 deletions(-)
> > >
> >
> > Is there any way we can abstract this out a bit more instead of crapping
> > on generic code?

I don't like this change much too, however I think that this is the
most simple and straightforward.

Basically, Xen's swiotlb works like a new IOMMU implementation so we
need to initialize it like other IOMMU implementations (call the
detect and init functions in order).

> I was toying with something like this:
>
> diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> index 9f07cfc..e0cd388 100644
> --- a/arch/x86/kernel/pci-dma.c
> +++ b/arch/x86/kernel/pci-dma.c
> @@ -45,6 +45,25 @@ int iommu_detected __read_mostly = 0;
> */
> int iommu_pass_through __read_mostly;
>
> +initcall_t __swiotlb_initcall_detect[] =
> + {pci_xen_swiotlb_detect,
> + pci_swiotlb_detect,
> + NULL};
> +
> +initcall_t __swiotlb_initcall_init[] = {
> + pci_xen_swiotlb_init,
> + pci_swiotlb_init,
> + NULL};
> +
> +
> +initcall_t __iommu_initcall_detect[] = {
> + gart_iommu_hole_init,
> + detect_calgary,
> + detect_intel_iommu,
> + /* needs to be called after gart_iommu_hole_init */
> + amd_iommu_detect,
> + NULL};

I really don't think that this makes the code better. I prefer the
current simple (dumb) code.

> /* Dummy device used for NULL arguments (normally ISA). */
> struct device x86_dma_fallback_dev = {
> .init_name = "fallback device",
> @@ -130,24 +149,22 @@ static void __init dma32_free_bootmem(void)
>
> void __init pci_iommu_alloc(void)
> {
> + initcall_t *fn;
> +
> /* free the range so iommu could get some range less than 4G */
> dma32_free_bootmem();
>
> - if (pci_xen_swiotlb_detect() || pci_swiotlb_detect())
> - goto out;
> -
> - gart_iommu_hole_init();
> -
> - detect_calgary();
> + /* First do the SWIOTLB - if they work, skip the IOMMUs. */

btw, this comment is wrong. We check if we are forced to use SWIOTLB
by kernel command line here.

Even if SWIOTLB works, we see if hardware IOMMU is available. SWIOTLB
is a last resort. We prefer hardware IOMMU.

2010-07-28 00:20:36

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [PATCH 9/9] x86: Detect whether we should use Xen SWIOTLB.

On 07/27/2010 04:36 PM, FUJITA Tomonori wrote:
>>>
>>> Is there any way we can abstract this out a bit more instead of crapping
>>> on generic code?
>
> I don't like this change much too, however I think that this is the
> most simple and straightforward.
>
> Basically, Xen's swiotlb works like a new IOMMU implementation so we
> need to initialize it like other IOMMU implementations (call the
> detect and init functions in order).
>

Even mentioning "xen" in generic code should be considered a bug. I
think we *do* need to driverize the iommu stuff, and yes, Xen's swiotlb
should just be handled like one in the list.

>
>> I was toying with something like this:
>>
>> diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
>> index 9f07cfc..e0cd388 100644
>> --- a/arch/x86/kernel/pci-dma.c
>> +++ b/arch/x86/kernel/pci-dma.c
>> @@ -45,6 +45,25 @@ int iommu_detected __read_mostly = 0;
>> */
>> int iommu_pass_through __read_mostly;
>>
>> +initcall_t __swiotlb_initcall_detect[] =
>> + {pci_xen_swiotlb_detect,
>> + pci_swiotlb_detect,
>> + NULL};
>> +
>> +initcall_t __swiotlb_initcall_init[] = {
>> + pci_xen_swiotlb_init,
>> + pci_swiotlb_init,
>> + NULL};
>> +
>> +
>> +initcall_t __iommu_initcall_detect[] = {
>> + gart_iommu_hole_init,
>> + detect_calgary,
>> + detect_intel_iommu,
>> + /* needs to be called after gart_iommu_hole_init */
>> + amd_iommu_detect,
>> + NULL};
>
> I really don't think that this makes the code better. I prefer the
> current simple (dumb) code.
>

The special handling of swiotlb here really looks wrong, but otherwise I
think it's the right idea.

> btw, this comment is wrong. We check if we are forced to use SWIOTLB
> by kernel command line here.
>
> Even if SWIOTLB works, we see if hardware IOMMU is available. SWIOTLB
> is a last resort. We prefer hardware IOMMU.

Any reason to not just handle swiotlb like any of the other iommus, at
the bottom of the list?

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-07-28 00:53:07

by FUJITA Tomonori

[permalink] [raw]

Subject: Re: [PATCH 9/9] x86: Detect whether we should use Xen SWIOTLB.

On Tue, 27 Jul 2010 17:19:41 -0700
"H. Peter Anvin" <[email protected]> wrote:

> On 07/27/2010 04:36 PM, FUJITA Tomonori wrote:
> >>>
> >>> Is there any way we can abstract this out a bit more instead of crapping
> >>> on generic code?
> >
> > I don't like this change much too, however I think that this is the
> > most simple and straightforward.
> >
> > Basically, Xen's swiotlb works like a new IOMMU implementation so we
> > need to initialize it like other IOMMU implementations (call the
> > detect and init functions in order).
> >
>
> Even mentioning "xen" in generic code should be considered a bug. I
> think we *do* need to driverize the iommu stuff, and yes, Xen's swiotlb
> should just be handled like one in the list.
>
> >
> >> I was toying with something like this:
> >>
> >> diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
> >> index 9f07cfc..e0cd388 100644
> >> --- a/arch/x86/kernel/pci-dma.c
> >> +++ b/arch/x86/kernel/pci-dma.c
> >> @@ -45,6 +45,25 @@ int iommu_detected __read_mostly = 0;
> >> */
> >> int iommu_pass_through __read_mostly;
> >>
> >> +initcall_t __swiotlb_initcall_detect[] =
> >> + {pci_xen_swiotlb_detect,
> >> + pci_swiotlb_detect,
> >> + NULL};
> >> +
> >> +initcall_t __swiotlb_initcall_init[] = {
> >> + pci_xen_swiotlb_init,
> >> + pci_swiotlb_init,
> >> + NULL};
> >> +
> >> +
> >> +initcall_t __iommu_initcall_detect[] = {
> >> + gart_iommu_hole_init,
> >> + detect_calgary,
> >> + detect_intel_iommu,
> >> + /* needs to be called after gart_iommu_hole_init */
> >> + amd_iommu_detect,
> >> + NULL};
> >
> > I really don't think that this makes the code better. I prefer the
> > current simple (dumb) code.
> >
>
> The special handling of swiotlb here really looks wrong, but otherwise I
> think it's the right idea.
>
> > btw, this comment is wrong. We check if we are forced to use SWIOTLB
> > by kernel command line here.
> >
> > Even if SWIOTLB works, we see if hardware IOMMU is available. SWIOTLB
> > is a last resort. We prefer hardware IOMMU.
>
> Any reason to not just handle swiotlb like any of the other iommus, at
> the bottom of the list?

we need to check if swiotlb usage is forced by the command line since:

- we skip hardware IOMMU initialization if so.

We also need swiotlb initialization after all hardware IOMMU
initialization since:

- if all hardware IOMMU initialization fails, we might need to
initialize swiotlb.

- even if hardware IOMMU initialization is successful, we might need
to initialize swiotlb (even if a system has hardware IOMMU, some
devices are not connected to hardware IOMMU).

- swiotlb initialization must be after GART initialization. We reserve
some DMA32 memory for broken bios with GART. The order must be freeing
the memory, initializing GART, then initializing swiotlb.
Initializing swiotlb before GART steals the reserved memory. It breaks
GART.

2010-07-28 22:39:37

by Konrad Rzeszutek Wilk

[permalink] [raw]

Subject: Re: [PATCH 9/9] x86: Detect whether we should use Xen SWIOTLB.

> > >>> Is there any way we can abstract this out a bit more instead of crapping
> > >>> on generic code?
> > >
> > > I don't like this change much too, however I think that this is the
> > > most simple and straightforward.
> > >
> > > Basically, Xen's swiotlb works like a new IOMMU implementation so we
> > > need to initialize it like other IOMMU implementations (call the
> > > detect and init functions in order).
> > >
> >
> > Even mentioning "xen" in generic code should be considered a bug. I
> > think we *do* need to driverize the iommu stuff, and yes, Xen's swiotlb
> > should just be handled like one in the list.

I think we all don't like the way 'pci_iommu_alloc' does it. But it does
the job right now pretty well, and the code looks well, ok. Adding in
the extra '_detect' and '_init' does not detract from it all that much.

Long term I think the driverization is the way to go, and..

> > > I really don't think that this makes the code better. I prefer the
> > > current simple (dumb) code.
> > >
> >
> > The special handling of swiotlb here really looks wrong, but otherwise I
> > think it's the right idea.
> >
> > > Even if SWIOTLB works, we see if hardware IOMMU is available. SWIOTLB
> > > is a last resort. We prefer hardware IOMMU.
> >
> > Any reason to not just handle swiotlb like any of the other iommus, at
> > the bottom of the list?
>
> we need to check if swiotlb usage is forced by the command line since:
>
> - we skip hardware IOMMU initialization if so.

I think the flow a). check if we need SWIOTLB b), check all IOMMUs, c).
recheck SWIOTLB in case no IOMMUs volunteered MUST be preserved
irregardless if we driverize the IOMMUs/SWIOTLB or not.

Perhaps we should get together at one of these Linux conferences and
think this one through? Beers on me.

2010-07-28 22:53:41

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [PATCH 9/9] x86: Detect whether we should use Xen SWIOTLB.

On 07/28/2010 03:38 PM, Konrad Rzeszutek Wilk wrote:
>
> Long term I think the driverization is the way to go, and..
>
> I think the flow a). check if we need SWIOTLB b), check all IOMMUs, c).
> recheck SWIOTLB in case no IOMMUs volunteered MUST be preserved
> irregardless if we driverize the IOMMUs/SWIOTLB or not.
>
> Perhaps we should get together at one of these Linux conferences and
> think this one through? Beers on me.
>

I don't understand point (a) here. (c) simply seems like the fallback
case, and in the case we are actively forcing swiotlb we simply skip
step (b).

-hpa

2010-07-29 07:18:44

by FUJITA Tomonori

[permalink] [raw]

Subject: Re: [PATCH 9/9] x86: Detect whether we should use Xen SWIOTLB.

On Wed, 28 Jul 2010 15:52:50 -0700
"H. Peter Anvin" <[email protected]> wrote:

> On 07/28/2010 03:38 PM, Konrad Rzeszutek Wilk wrote:
> >
> > Long term I think the driverization is the way to go, and..
> >
> > I think the flow a). check if we need SWIOTLB b), check all IOMMUs, c).
> > recheck SWIOTLB in case no IOMMUs volunteered MUST be preserved
> > irregardless if we driverize the IOMMUs/SWIOTLB or not.
> >
> > Perhaps we should get together at one of these Linux conferences and
> > think this one through? Beers on me.
> >
>
> I don't understand point (a) here. (c) simply seems like the fallback
> case, and in the case we are actively forcing swiotlb we simply skip
> step (b).

Looks like (a) is too simplified. The actual XEN code (a) is:

+int __init pci_xen_swiotlb_detect(void)
+{
+
+ /* If running as PV guest, either iommu=soft, or swiotlb=force will
+ * activate this IOMMU. If running as PV privileged, activate it
+ * irregardlesss.
+ */
+ if ((xen_initial_domain() || swiotlb || swiotlb_force) &&
+ (xen_pv_domain()))
+ xen_swiotlb = 1;
+
+ /* If we are running under Xen, we MUST disable the native SWIOTLB.
+ * Don't worry about swiotlb_force flag activating the native, as
+ * the 'swiotlb' flag is the only one turning it on. */
+ if (xen_pv_domain())
+ swiotlb = 0;
+
+ return xen_swiotlb;

It does things more complicated than checking if swiotlb usage is
forced.

Looks like we need to call Xen specific code twice, (a) and (c), I
dislike it though.

btw, (c) is not the fallback case (i.e. if we can't find hardware
IOMMU, we enable swiotlb). We use both hardware IOMMU and swiotlb on
some systems.

2010-07-29 13:45:26

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [PATCH 9/9] x86: Detect whether we should use Xen SWIOTLB.

On 07/29/2010 12:17 AM, FUJITA Tomonori wrote:
> btw, (c) is not the fallback case (i.e. if we can't find hardware
> IOMMU, we enable swiotlb). We use both hardware IOMMU and swiotlb on
> some systems.

Right, which is why it can't be folded into (b).

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-07-29 16:07:47

by Konrad Rzeszutek Wilk

[permalink] [raw]

Subject: Re: [PATCH 9/9] x86: Detect whether we should use Xen SWIOTLB.

> > > Long term I think the driverization is the way to go, and..
> > >
> > > I think the flow a). check if we need SWIOTLB b), check all IOMMUs, c).
> > > recheck SWIOTLB in case no IOMMUs volunteered MUST be preserved
> > > irregardless if we driverize the IOMMUs/SWIOTLB or not.
.. snip..
> > I don't understand point (a) here. (c) simply seems like the fallback

The way it works right now is that if user specifies swiotlb=force we would
use _only_ SWIOTLB and nothing else.

> > case, and in the case we are actively forcing swiotlb we simply skip
> > step (b).
>
> Looks like (a) is too simplified. The actual XEN code (a) is:
>
> +int __init pci_xen_swiotlb_detect(void)
> +{
> +
> + /* If running as PV guest, either iommu=soft, or swiotlb=force will
> + * activate this IOMMU. If running as PV privileged, activate it
> + * irregardlesss.
> + */
> + if ((xen_initial_domain() || swiotlb || swiotlb_force) &&
> + (xen_pv_domain()))
> + xen_swiotlb = 1;
> +
> + /* If we are running under Xen, we MUST disable the native SWIOTLB.
> + * Don't worry about swiotlb_force flag activating the native, as
> + * the 'swiotlb' flag is the only one turning it on. */
> + if (xen_pv_domain())
> + swiotlb = 0;
> +
> + return xen_swiotlb;
>
> It does things more complicated than checking if swiotlb usage is
> forced.
>
> Looks like we need to call Xen specific code twice, (a) and (c), I
> dislike it though.

I can eliminate step c) by making a) 'pci_xen_swiotlb_detect' do
what it does now and also utilize the x86_init.iommu.iommu_init.
In essence making it an IOMMU-type-ish.

The patch is on top of the other patches and the only reason I am calling
in 'pci_iommu_alloc' the 'pci_xen_swiotlb_detect' before 'pci_swiotlb_detect'
is because a user could specify 'swiotlb=force' and that would bypass the
Xen SWIOTLB detection code and end up using the wrong dma_ops (under Xen
of course). Oh, and I added a check in gart_iommu_hole_init() to stop it
from setting the iommu_init to its own.

What do you guys think?

Another option would be in 'pci_xen_swiotlb_detect' to coalesce
it with 'pci_xen_swiotlb_init' and right there do the deed.

diff --git a/arch/x86/include/asm/xen/swiotlb-xen.h b/arch/x86/include/asm/xen/swiotlb-xen.h
index 1be1ab7..07ed055 100644
--- a/arch/x86/include/asm/xen/swiotlb-xen.h
+++ b/arch/x86/include/asm/xen/swiotlb-xen.h
@@ -4,11 +4,9 @@
#ifdef CONFIG_SWIOTLB_XEN
extern int xen_swiotlb;
extern int __init pci_xen_swiotlb_detect(void);
-extern void __init pci_xen_swiotlb_init(void);
#else
#define xen_swiotlb (0)
static inline int __init pci_xen_swiotlb_detect(void) { return 0; }
-static inline void __init pci_xen_swiotlb_init(void) { }
#endif

#endif /* _ASM_X86_SWIOTLB_XEN_H */
diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c
index b5d8b0b..7a3ea9a 100644
--- a/arch/x86/kernel/aperture_64.c
+++ b/arch/x86/kernel/aperture_64.c
@@ -379,6 +379,9 @@ void __init gart_iommu_hole_init(void)
int fix, slot, valid_agp = 0;
int i, node;

+ if (iommu_detected)
+ return;
+
if (gart_iommu_aperture_disabled || !fix_aperture ||
!early_pci_allowed())
return;
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 9f07cfc..ef1de8e 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -133,7 +133,9 @@ void __init pci_iommu_alloc(void)
/* free the range so iommu could get some range less than 4G */
dma32_free_bootmem();

- if (pci_xen_swiotlb_detect() || pci_swiotlb_detect())
+ pci_xen_swiotlb_detect();
+
+ if (pci_swiotlb_detect())
goto out;

gart_iommu_hole_init();
@@ -145,8 +147,6 @@ void __init pci_iommu_alloc(void)
/* needs to be called after gart_iommu_hole_init */
amd_iommu_detect();
out:
- pci_xen_swiotlb_init();
-
pci_swiotlb_init();
}

@@ -299,7 +299,7 @@ static int __init pci_iommu_init(void)
#endif
x86_init.iommu.iommu_init();

- if (swiotlb || xen_swiotlb) {
+ if (swiotlb) {
printk(KERN_INFO "PCI-DMA: "
"Using software bounce buffering for IO (SWIOTLB)\n");
swiotlb_print_info();
diff --git a/arch/x86/xen/pci-swiotlb-xen.c b/arch/x86/xen/pci-swiotlb-xen.c
index a013ec9..8722a37 100644
--- a/arch/x86/xen/pci-swiotlb-xen.c
+++ b/arch/x86/xen/pci-swiotlb-xen.c
@@ -1,30 +1,21 @@
/* Glue code to lib/swiotlb-xen.c */

#include <linux/dma-mapping.h>
-#include <xen/swiotlb-xen.h>

+#include <asm/iommu.h>
+#include <asm/x86_init.h>
#include <asm/xen/hypervisor.h>
+#include <asm/dma.h>
+
+#include <xen/swiotlb-xen.h>
#include <xen/xen.h>

int xen_swiotlb __read_mostly;

-static struct dma_map_ops xen_swiotlb_dma_ops = {
- .mapping_error = xen_swiotlb_dma_mapping_error,
- .alloc_coherent = xen_swiotlb_alloc_coherent,
- .free_coherent = xen_swiotlb_free_coherent,
- .sync_single_for_cpu = xen_swiotlb_sync_single_for_cpu,
- .sync_single_for_device = xen_swiotlb_sync_single_for_device,
- .sync_sg_for_cpu = xen_swiotlb_sync_sg_for_cpu,
- .sync_sg_for_device = xen_swiotlb_sync_sg_for_device,
- .map_sg = xen_swiotlb_map_sg_attrs,
- .unmap_sg = xen_swiotlb_unmap_sg_attrs,
- .map_page = xen_swiotlb_map_page,
- .unmap_page = xen_swiotlb_unmap_page,
- .dma_supported = xen_swiotlb_dma_supported,
-};
-
/*
- * pci_xen_swiotlb_detect - set xen_swiotlb to 1 if necessary
+ * xen_swiotlb_init - detect if we are running under Xen and set
+ * the dma_ops appropiately. Also terminate the baremetal swiotlb if
+ * it has been enabled.
*
* This returns non-zero if we are forced to use xen_swiotlb (by the boot
* option).
@@ -37,22 +28,26 @@ int __init pci_xen_swiotlb_detect(void)
* irregardlesss.
*/
if ((xen_initial_domain() || swiotlb || swiotlb_force) &&
- (xen_pv_domain()))
+ (xen_pv_domain())) {
xen_swiotlb = 1;
+ /* We MUST turn off other IOMMU detection engines. */
+ iommu_detected = 1;
+ x86_init.iommu.iommu_init = xen_swiotlb_init;
+ }

- /* If we are running under Xen, we MUST disable the native SWIOTLB.
- * Don't worry about swiotlb_force flag activating the native, as
- * the 'swiotlb' flag is the only one turning it on. */
- if (xen_pv_domain())
+ /* If we are running under PV Xen, we MUST disable the native SWIOTLB
+ * and the command line overrides - otherwise baremetal SWIOTLB might
+ * get turned on and BOOM! */
+ if (xen_pv_domain()) {
swiotlb = 0;
+ swiotlb_force = 0;
+#ifdef CONFIG_X86_64
+ /* SWIOTLB has a check like this. MUST disable it. */
+ if (!no_iommu && max_pfn > MAX_DMA32_PFN)
+ no_iommu = 1;
+#endif
+ }

- return xen_swiotlb;
-}

-void __init pci_xen_swiotlb_init(void)
-{
- if (xen_swiotlb) {
- xen_swiotlb_init(1);
- dma_ops = &xen_swiotlb_dma_ops;
- }
+ return xen_swiotlb;
}
diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 54469c3..93a0d58 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -38,6 +38,23 @@
#include <xen/swiotlb-xen.h>
#include <xen/page.h>
#include <xen/xen-ops.h>
+
+
+static struct dma_map_ops xen_swiotlb_dma_ops = {
+ .mapping_error = xen_swiotlb_dma_mapping_error,
+ .alloc_coherent = xen_swiotlb_alloc_coherent,
+ .free_coherent = xen_swiotlb_free_coherent,
+ .sync_single_for_cpu = xen_swiotlb_sync_single_for_cpu,
+ .sync_single_for_device = xen_swiotlb_sync_single_for_device,
+ .sync_sg_for_cpu = xen_swiotlb_sync_sg_for_cpu,
+ .sync_sg_for_device = xen_swiotlb_sync_sg_for_device,
+ .map_sg = xen_swiotlb_map_sg_attrs,
+ .unmap_sg = xen_swiotlb_unmap_sg_attrs,
+ .map_page = xen_swiotlb_map_page,
+ .unmap_page = xen_swiotlb_unmap_page,
+ .dma_supported = xen_swiotlb_dma_supported,
+};
+
/*
* Used to do a quick range check in swiotlb_tbl_unmap_single and
* swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated by this
@@ -143,7 +160,7 @@ xen_swiotlb_fixup(void *buf, size_t size, unsigned long nslabs)
return 0;
}

-void __init xen_swiotlb_init(int verbose)
+int __init xen_swiotlb_init(void)
{
unsigned long bytes;
int rc;
@@ -153,13 +170,15 @@ void __init xen_swiotlb_init(int verbose)

bytes = xen_io_tlb_nslabs << IO_TLB_SHIFT;

+
/*
* Get IO TLB memory from any location.
*/
xen_io_tlb_start = alloc_bootmem(bytes);
- if (!xen_io_tlb_start)
- panic("Cannot allocate SWIOTLB buffer");
-
+ if (!xen_io_tlb_start) {
+ printk(KERN_ERR "PCI-DMA: Cannot allocate Xen-SWIOTLB buffer!");
+ return -ENOMEM;
+ }
xen_io_tlb_end = xen_io_tlb_start + bytes;
/*
* And replace that memory with pages under 4GB.
@@ -171,13 +190,22 @@ void __init xen_swiotlb_init(int verbose)
goto error;

start_dma_addr = xen_virt_to_bus(xen_io_tlb_start);
- swiotlb_init_with_tbl(xen_io_tlb_start, xen_io_tlb_nslabs, verbose);
+ swiotlb_init_with_tbl(xen_io_tlb_start, xen_io_tlb_nslabs, 1);
+
+ printk(KERN_INFO "PCI-DMA: Xen-SWIOTLB has %luMbytes.\n", bytes);

- return;
+ /* Yup, we are re-enabling it.. WHY? Because in pci-dma.c where the
+ * code gets called we have if (swiotlb) { .. } else { swiotlb_free();}
+ * and the swiotlb_free() would effectivly demolish us. */
+ swiotlb = 1;
+ dma_ops = &xen_swiotlb_dma_ops;
+
+ return 0;
error:
- panic("DMA(%d): Failed to exchange pages allocated for DMA with Xen! "\
- "We either don't have the permission or you do not have enough"\
- "free memory under 4GB!\n", rc);
+ printk(KERN_ERR "PCI-DMA: %d: Failed to exchange pages allocated for "\
+ "DMA with Xen! We either don't have the permission or you do "\
+ "not have enough free memory under 4GB!\n", rc);
+ return rc;
}

void *
diff --git a/include/xen/swiotlb-xen.h b/include/xen/swiotlb-xen.h
index 2ea2fdc..57ec7ef 100644
--- a/include/xen/swiotlb-xen.h
+++ b/include/xen/swiotlb-xen.h
@@ -3,7 +3,7 @@

#include <linux/swiotlb.h>

-extern void xen_swiotlb_init(int verbose);
+extern int __init xen_swiotlb_init(void);

extern void
*xen_swiotlb_alloc_coherent(struct device *hwdev, size_t size,

Making Xen-SWIOTLB be an IOMMU means no more alloc_bootmem. So
this also precipates:
- Splitting swiotlb_late_init_with_default_size in two functions, one
that allocates the io_tlb and the other ('swiotlb_late_init_with_tbl')
for allocation of the other structs.
- Exporting swiotlb_late_init_with_tbl, IO_TLB_MIN_SLABS and
SLABS_PER_PAGE
- Using swiotlb_late_init_with_tbl in Xen-SWIOTLB instead of
'swiotlb_init_with_tbl'

Here is the patch for that:

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 93a0d58..367fda2 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -163,6 +163,7 @@ xen_swiotlb_fixup(void *buf, size_t size, unsigned long nslabs)
int __init xen_swiotlb_init(void)
{
unsigned long bytes;
+ unsigned int order;
int rc;

xen_io_tlb_nslabs = (64 * 1024 * 1024 >> IO_TLB_SHIFT);
@@ -170,15 +171,32 @@ int __init xen_swiotlb_init(void)

bytes = xen_io_tlb_nslabs << IO_TLB_SHIFT;

+ order = get_order(xen_io_tlb_nslabs << IO_TLB_SHIFT);

/*
* Get IO TLB memory from any location.
*/
- xen_io_tlb_start = alloc_bootmem(bytes);
+ while ((SLABS_PER_PAGE << order) > IO_TLB_MIN_SLABS) {
+ xen_io_tlb_start = (void *)__get_free_pages(GFP_DMA |
+ __GFP_NOWARN,
+ order);
+ if (xen_io_tlb_start)
+ break;
+ order--;
+ }
+
if (!xen_io_tlb_start) {
printk(KERN_ERR "PCI-DMA: Cannot allocate Xen-SWIOTLB buffer!");
return -ENOMEM;
}
+
+ if (order != get_order(bytes)) {
+ printk(KERN_WARNING "Warning: only able to allocate %ld MB "
+ "for software IO TLB\n", (PAGE_SIZE << order) >> 20);
+ xen_io_tlb_nslabs = SLABS_PER_PAGE << order;
+ bytes = xen_io_tlb_nslabs << IO_TLB_SHIFT;
+ }
+
xen_io_tlb_end = xen_io_tlb_start + bytes;
/*
* And replace that memory with pages under 4GB.
@@ -190,9 +208,7 @@ int __init xen_swiotlb_init(void)
goto error;

start_dma_addr = xen_virt_to_bus(xen_io_tlb_start);
- swiotlb_init_with_tbl(xen_io_tlb_start, xen_io_tlb_nslabs, 1);
-
- printk(KERN_INFO "PCI-DMA: Xen-SWIOTLB has %luMbytes.\n", bytes);
+ swiotlb_late_init_with_tbl(xen_io_tlb_start, xen_io_tlb_nslabs);

/* Yup, we are re-enabling it.. WHY? Because in pci-dma.c where the
* code gets called we have if (swiotlb) { .. } else { swiotlb_free();}
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
index 8c0e349..3d263c6 100644
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -22,9 +22,18 @@ extern int swiotlb_force;
*/
#define IO_TLB_SHIFT 11

+#define SLABS_PER_PAGE (1 << (PAGE_SHIFT - IO_TLB_SHIFT))
+
+/*
+ * Minimum IO TLB size to bother booting with. Systems with mainly
+ * 64bit capable cards will only lightly use the swiotlb. If we can't
+ * allocate a contiguous 1MB, we're probably in trouble anyway.
+ */
+#define IO_TLB_MIN_SLABS ((1<<20) >> IO_TLB_SHIFT)
+
extern void swiotlb_init(int verbose);
extern void swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose);
-
+extern int swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs);
/*
* Enumeration for sync targets
*/
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
index 34e3082..3b6f23b 100644
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -41,15 +41,6 @@
#define OFFSET(val,align) ((unsigned long) \
( (val) & ( (align) - 1)))

-#define SLABS_PER_PAGE (1 << (PAGE_SHIFT - IO_TLB_SHIFT))
-
-/*
- * Minimum IO TLB size to bother booting with. Systems with mainly
- * 64bit capable cards will only lightly use the swiotlb. If we can't
- * allocate a contiguous 1MB, we're probably in trouble anyway.
- */
-#define IO_TLB_MIN_SLABS ((1<<20) >> IO_TLB_SHIFT)
-
int swiotlb_force;

/*
@@ -195,46 +186,15 @@ swiotlb_init(int verbose)
swiotlb_init_with_default_size(64 * (1<<20), verbose); /* default to 64MB */
}

-/*
- * Systems with larger DMA zones (those that don't support ISA) can
- * initialize the swiotlb later using the slab allocator if needed.
- * This should be just like above, but with some error catching.
- */
-int
-swiotlb_late_init_with_default_size(size_t default_size)
+int __init swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs)
{
- unsigned long i, bytes, req_nslabs = io_tlb_nslabs;
+ unsigned long i, bytes;
unsigned int order;

- if (!io_tlb_nslabs) {
- io_tlb_nslabs = (default_size >> IO_TLB_SHIFT);
- io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE);
- }
-
- /*
- * Get IO TLB memory from the low pages
- */
+ io_tlb_nslabs = nslabs;
order = get_order(io_tlb_nslabs << IO_TLB_SHIFT);
- io_tlb_nslabs = SLABS_PER_PAGE << order;
+ io_tlb_start = tlb;
bytes = io_tlb_nslabs << IO_TLB_SHIFT;
-
- while ((SLABS_PER_PAGE << order) > IO_TLB_MIN_SLABS) {
- io_tlb_start = (void *)__get_free_pages(GFP_DMA | __GFP_NOWARN,
- order);
- if (io_tlb_start)
- break;
- order--;
- }
-
- if (!io_tlb_start)
- goto cleanup1;
-
- if (order != get_order(bytes)) {
- printk(KERN_WARNING "Warning: only able to allocate %ld MB "
- "for software IO TLB\n", (PAGE_SIZE << order) >> 20);
- io_tlb_nslabs = SLABS_PER_PAGE << order;
- bytes = io_tlb_nslabs << IO_TLB_SHIFT;
- }
io_tlb_end = io_tlb_start + bytes;
memset(io_tlb_start, 0, bytes);

@@ -287,6 +247,49 @@ cleanup2:
io_tlb_end = NULL;
free_pages((unsigned long)io_tlb_start, order);
io_tlb_start = NULL;
+ return -ENOMEM;
+}
+/*
+ * Systems with larger DMA zones (those that don't support ISA) can
+ * initialize the swiotlb later using the slab allocator if needed.
+ * This should be just like above, but with some error catching.
+ */
+int
+swiotlb_late_init_with_default_size(size_t default_size)
+{
+ unsigned long bytes, req_nslabs = io_tlb_nslabs;
+ unsigned int order;
+
+ if (!io_tlb_nslabs) {
+ io_tlb_nslabs = (default_size >> IO_TLB_SHIFT);
+ io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE);
+ }
+
+ /*
+ * Get IO TLB memory from the low pages
+ */
+ order = get_order(io_tlb_nslabs << IO_TLB_SHIFT);
+ io_tlb_nslabs = SLABS_PER_PAGE << order;
+ bytes = io_tlb_nslabs << IO_TLB_SHIFT;
+
+ while ((SLABS_PER_PAGE << order) > IO_TLB_MIN_SLABS) {
+ io_tlb_start = (void *)__get_free_pages(GFP_DMA | __GFP_NOWARN,
+ order);
+ if (io_tlb_start)
+ break;
+ order--;
+ }
+
+ if (!io_tlb_start)
+ goto cleanup1;
+
+ if (order != get_order(bytes)) {
+ printk(KERN_WARNING "Warning: only able to allocate %ld MB "
+ "for software IO TLB\n", (PAGE_SIZE << order) >> 20);
+ io_tlb_nslabs = SLABS_PER_PAGE << order;
+ }
+
+ return swiotlb_late_init_with_tbl(io_tlb_start, io_tlb_nslabs);
cleanup1:
io_tlb_nslabs = req_nslabs;
return -ENOMEM;

I don't have an IA64 to actually test that patch for regression. Under
Xen I keep on getting:

[ 0.431357] Warning: only able to allocate 4 MB for software IO TLB
[ 0.435386] Placing 4MB software IO TLB between ffff880000c00000 -
ffff880001000000
[ 0.435404] software IO TLB at phys 0xc00000 - 0x1000000

Adding in a bit of printks does show we try 14,13,12,11, and 10 order
pages and succed at the last order. Is that just b/c I am asking for
such huge pages? (the guest BTW, has 2GB allocated to it)

>
>
> btw, (c) is not the fallback case (i.e. if we can't find hardware
> IOMMU, we enable swiotlb). We use both hardware IOMMU and swiotlb on
> some systems.

Ooops. Yes, that was an oversimplication.

2010-08-02 15:26:35

by Konrad Rzeszutek Wilk

[permalink] [raw]

Subject: Re: [PATCH 9/9] x86: Detect whether we should use Xen SWIOTLB.

> I can eliminate step c) by making a) 'pci_xen_swiotlb_detect' do
> what it does now and also utilize the x86_init.iommu.iommu_init.
> In essence making it an IOMMU-type-ish.
>
> The patch is on top of the other patches and the only reason I am calling
> in 'pci_iommu_alloc' the 'pci_xen_swiotlb_detect' before 'pci_swiotlb_detect'
> is because a user could specify 'swiotlb=force' and that would bypass the
> Xen SWIOTLB detection code and end up using the wrong dma_ops (under Xen
> of course). Oh, and I added a check in gart_iommu_hole_init() to stop it
> from setting the iommu_init to its own.
>
> What do you guys think?

And silence ensues. Let me back up a bit as I think I am heading the
wrong way.

hpa, are your concerns that a) inserting a sub-system call in the
generic code is not good. Or b) that we have five IOMMUs (counting SWIOTLB in that
category) and that we don't jettison from memory the ones we don't need
(that would be the primary goal of driverization of those IOMMUs,
right?). Or c) we should remove all sub-system detect calls (Calgary, AMD,
Intel, AGP) altogether from pci-dma.c and depend more on
x86_init.iommu structure (perhaps expend it?)

2010-08-02 15:31:06

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [PATCH 9/9] x86: Detect whether we should use Xen SWIOTLB.

On 08/02/2010 08:25 AM, Konrad Rzeszutek Wilk wrote:
> hpa, are your concerns that a) inserting a sub-system call in the
> generic code is not good. Or b) that we have five IOMMUs (counting SWIOTLB in that
> category) and that we don't jettison from memory the ones we don't need
> (that would be the primary goal of driverization of those IOMMUs,
> right?). Or c) we should remove all sub-system detect calls (Calgary, AMD,
> Intel, AGP) altogether from pci-dma.c and depend more on
> x86_init.iommu structure (perhaps expend it?)

Sorry, had to deal with other stuff.

Basically, a) and c) are the issues, with a) being the more immediate;
the amount of code left in memory is relatively small and as such I'm
not too concerned with that aspect specifically.

With five IOMMUs we're well past the point where we need to have a clean
and generic interface instead of having everything be ad hoc and
interdependent.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-08-02 15:44:25

by FUJITA Tomonori

[permalink] [raw]

Subject: Re: [PATCH 9/9] x86: Detect whether we should use Xen SWIOTLB.

On Mon, 02 Aug 2010 08:30:38 -0700
"H. Peter Anvin" <[email protected]> wrote:

> On 08/02/2010 08:25 AM, Konrad Rzeszutek Wilk wrote:
> > hpa, are your concerns that a) inserting a sub-system call in the
> > generic code is not good. Or b) that we have five IOMMUs (counting SWIOTLB in that
> > category) and that we don't jettison from memory the ones we don't need
> > (that would be the primary goal of driverization of those IOMMUs,
> > right?). Or c) we should remove all sub-system detect calls (Calgary, AMD,
> > Intel, AGP) altogether from pci-dma.c and depend more on
> > x86_init.iommu structure (perhaps expend it?)
>
> Sorry, had to deal with other stuff.
>
> Basically, a) and c) are the issues, with a) being the more immediate;
> the amount of code left in memory is relatively small and as such I'm
> not too concerned with that aspect specifically.
>
> With five IOMMUs we're well past the point where we need to have a clean
> and generic interface instead of having everything be ad hoc and
> interdependent.

That's the difficult part because IOMMUs are not
interdependent. Hardware IOMMUs are related with swiotlb. GART and
AMD-IOMMU are too.

We could invent sorta IOMMU register interface and driver-ize IOMMUs
but they can't be interdependent completely.

2010-08-02 15:48:40

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [PATCH 9/9] x86: Detect whether we should use Xen SWIOTLB.

On 08/02/2010 08:43 AM, FUJITA Tomonori wrote:
>
> That's the difficult part because IOMMUs are not
> interdependent. Hardware IOMMUs are related with swiotlb. GART and
> AMD-IOMMU are too.
>
> We could invent sorta IOMMU register interface and driver-ize IOMMUs
> but they can't be interdependent completely.

Of course. However, we need there to be as much structure to it as
there can be.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-08-02 16:02:08

by FUJITA Tomonori

[permalink] [raw]

Subject: Re: [PATCH 9/9] x86: Detect whether we should use Xen SWIOTLB.

On Mon, 02 Aug 2010 08:47:53 -0700
"H. Peter Anvin" <[email protected]> wrote:

> On 08/02/2010 08:43 AM, FUJITA Tomonori wrote:
> >
> > That's the difficult part because IOMMUs are not
> > interdependent. Hardware IOMMUs are related with swiotlb. GART and
> > AMD-IOMMU are too.
> >
> > We could invent sorta IOMMU register interface and driver-ize IOMMUs
> > but they can't be interdependent completely.
>
> Of course. However, we need there to be as much structure to it as
> there can be.

Ok, let's see if Konrad can invent something clean.

But his attempt to create "swiotlb iommu function array" and "hardware
iommu function array" looks like to makes the code more unreadable.

2010-08-02 16:44:26

by Konrad Rzeszutek Wilk

[permalink] [raw]

Subject: Re: [PATCH 9/9] x86: Detect whether we should use Xen SWIOTLB.

On Tue, Aug 03, 2010 at 01:01:08AM +0900, FUJITA Tomonori wrote:
> On Mon, 02 Aug 2010 08:47:53 -0700
> "H. Peter Anvin" <[email protected]> wrote:
>
> > On 08/02/2010 08:43 AM, FUJITA Tomonori wrote:
> > >
> > > That's the difficult part because IOMMUs are not
> > > interdependent. Hardware IOMMUs are related with swiotlb. GART and
> > > AMD-IOMMU are too.
> > >
> > > We could invent sorta IOMMU register interface and driver-ize IOMMUs
> > > but they can't be interdependent completely.
> >
> > Of course. However, we need there to be as much structure to it as
> > there can be.
>
> Ok, let's see if Konrad can invent something clean.

<chuckles> Thank you for your vote of confidence.
>
> But his attempt to create "swiotlb iommu function array" and "hardware
> iommu function array" looks like to makes the code more unreadable.

Let me go to my favorite coffee shop and think this one through.
Can I get concession for putting the original patch in (the simple, dumb one),
and then:
- start working on the IOMMU register interface without having to try
to get it done for 2.6.36, and
- do the driverization as a seperate cleanup.

2010-08-02 16:53:41

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [PATCH 9/9] x86: Detect whether we should use Xen SWIOTLB.

On 08/02/2010 09:42 AM, Konrad Rzeszutek Wilk wrote:
>
> Let me go to my favorite coffee shop and think this one through.
> Can I get concession for putting the original patch in (the simple, dumb one),
> and then:
> - start working on the IOMMU register interface without having to try
> to get it done for 2.6.36, and
> - do the driverization as a seperate cleanup.
>

That makes sense.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2010-08-03 05:35:44

by FUJITA Tomonori

[permalink] [raw]

Subject: Re: [PATCH 9/9] x86: Detect whether we should use Xen SWIOTLB.

On Mon, 2 Aug 2010 12:42:34 -0400
Konrad Rzeszutek Wilk <[email protected]> wrote:

> > But his attempt to create "swiotlb iommu function array" and "hardware
> > iommu function array" looks like to makes the code more unreadable.
>
> Let me go to my favorite coffee shop and think this one through.
> Can I get concession for putting the original patch in (the simple, dumb one),
> and then:
> - start working on the IOMMU register interface without having to try
> to get it done for 2.6.36, and
> - do the driverization as a seperate cleanup.

We could simplify swiotlb initialization after 2.6.36. If we merge my
patches to expand swiotlb memory dynamically, we could initialize
swiotlb before any hw IOMMUs (see the commit
186a25026c44d1bfa97671110ff14dcd0c99678e).

If you can make Xen swiotlb initialized like HW iommu, the IOMMU
initialization could be cleaner.

As I wrote, I also want to simplify swiotlb's init memory
allocation. I'll see what I can do on the whole issue.