Hi Ingo,
Here's the chunk of patches to add Xen Dom0 support (it's probably
worth creating a new xen/dom0 topic branch for it).
A dom0 Xen domain is basically the same as a normal domU domain, but
it has extra privileges to directly access hardware. There are two
issues to deal with:
- translating to and from the domain's pseudo-physical addresses and
real machine addresses (for ioremap and setting up DMA)
- routing hardware interrupts into the domain
ioremap is relatively easy to deal with. An earlier patch introduced
the _PAGE_IOMAP pte flag, which we use to distinguish between a
regular pseudo-physical mapping and a real machine mapping.
Everything falls out pretty cleanly. A consequence is that the
various pieces of table-parsing code - DMI, ACPI, MP, etc - work out
of the box.
Similarly, the series adds hooks into swiotlb so that architectures
can allocate the swiotlb memory appropriately; on the x86/xen side,
Xen hooks these allocation functions to make special hypercalls to
guarantee that the allocated memory is contiguous in machine memory.
Interrupts are a very different affair. The descriptions in each
patch describe how it all fits together in detail, but the overview
is:
1. Xen owns the local APICs; the dom0 kernel controls the IO APICs
2. Hardware interrupts are delivered on event channels like everything else
3. To set this up, we intercept at pcibios_enable_irq:
- given a dev+pin, we use ACPI to get a gsi
- hook acpi_register_gsi to call xen_register_gsi, which
- allocates an irq (generally not 1:1 with the gsi)
- asks Xen for a vector and event channel for the irq
- program the IO APIC to deliver the hardware interrupt to the
allocated vector
The upshot is that the device driver gets an irq, and when the
hardware raises an interrupt, it gets delivered on that irq.
We maintain our own irq allocation space, since the hardware-bound
event channel irqs are intermixed with all the other normal Xen event
channel irqs (inter-domain, timers, IPIs, etc). For compatibility the
irqs 0-15 are reserved for legacy device interrupts, but the rest of
the range is dynamically allocated.
Initialization also requires care. The dom0 kernel parses the ACPI
tables as usual, in order to discover the local and IO APICs, and all
the rest of the ACPI-provided data the kernel requires. However,
because the kernel doesn't own the local APICs and can't directly map
the IO APICs, we must be sure to avoid actually touching the hardware
when running under Xen.
TODO: work out how to fit MSI into all this.
So, in summary, this series contains:
- dom0 console support
- dom0 xenbus support
- CPU features and IO access for a privleged domain
- mtrrs
- making ioremap work on machine addresses
- swiotlb allocation hooks
- interrupts
- introduce PV io_apic operations
- add Xen-specific IRQ allocator
- switch to using all-Xen event delivery
- add pirq Xen interrupt type
- table parsing and setup
- intercept driver interrupt registration
All this code will compile away to nothing when CONFIG_XEN_DOM0 is not
enabled. If it is enabled, it will only have an effect if booted as a
dom0 kernel; normal native execution and domU execution should be
unaffected.
Thanks,
J
From: Tej <[email protected]>
Signed-off-by: Tej <[email protected]>
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/xen/enlighten.c | 18 ++++++++++--------
arch/x86/xen/mmu.c | 17 ++++++++++-------
arch/x86/xen/multicalls.c | 2 +-
arch/x86/xen/setup.c | 9 +++++----
4 files changed, 26 insertions(+), 20 deletions(-)
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -793,7 +793,7 @@
ret = 0;
- switch(msr) {
+ switch (msr) {
#ifdef CONFIG_X86_64
unsigned which;
u64 base;
@@ -1453,7 +1453,7 @@
ident_pte = 0;
pfn = 0;
- for(pmdidx = 0; pmdidx < PTRS_PER_PMD && pfn < max_pfn; pmdidx++) {
+ for (pmdidx = 0; pmdidx < PTRS_PER_PMD && pfn < max_pfn; pmdidx++) {
pte_t *pte_page;
/* Reuse or allocate a page of ptes */
@@ -1471,7 +1471,7 @@
}
/* Install mappings */
- for(pteidx = 0; pteidx < PTRS_PER_PTE; pteidx++, pfn++) {
+ for (pteidx = 0; pteidx < PTRS_PER_PTE; pteidx++, pfn++) {
pte_t pte;
if (pfn > max_pfn_mapped)
@@ -1485,7 +1485,7 @@
}
}
- for(pteidx = 0; pteidx < ident_pte; pteidx += PTRS_PER_PTE)
+ for (pteidx = 0; pteidx < ident_pte; pteidx += PTRS_PER_PTE)
set_page_prot(&level1_ident_pgt[pteidx], PAGE_KERNEL_RO);
set_page_prot(pmd, PAGE_KERNEL_RO);
@@ -1499,7 +1499,7 @@
/* All levels are converted the same way, so just treat them
as ptes. */
- for(i = 0; i < PTRS_PER_PTE; i++)
+ for (i = 0; i < PTRS_PER_PTE; i++)
pte[i] = xen_make_pte(pte[i].pte);
}
@@ -1514,7 +1514,8 @@
* of the physical mapping once some sort of allocator has been set
* up.
*/
-static __init pgd_t *xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
+static __init pgd_t *xen_setup_kernel_pagetable(pgd_t *pgd,
+ unsigned long max_pfn)
{
pud_t *l3;
pmd_t *l2;
@@ -1577,7 +1578,8 @@
#else /* !CONFIG_X86_64 */
static pmd_t level2_kernel_pgt[PTRS_PER_PMD] __page_aligned_bss;
-static __init pgd_t *xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
+static __init pgd_t *xen_setup_kernel_pagetable(pgd_t *pgd,
+ unsigned long max_pfn)
{
pmd_t *kernel_pmd;
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -154,13 +154,13 @@
{
unsigned pfn, idx;
- for(pfn = 0; pfn < MAX_DOMAIN_PAGES; pfn += P2M_ENTRIES_PER_PAGE) {
+ for (pfn = 0; pfn < MAX_DOMAIN_PAGES; pfn += P2M_ENTRIES_PER_PAGE) {
unsigned topidx = p2m_top_index(pfn);
p2m_top_mfn[topidx] = virt_to_mfn(p2m_top[topidx]);
}
- for(idx = 0; idx < ARRAY_SIZE(p2m_top_mfn_list); idx++) {
+ for (idx = 0; idx < ARRAY_SIZE(p2m_top_mfn_list); idx++) {
unsigned topidx = idx * P2M_ENTRIES_PER_PAGE;
p2m_top_mfn_list[idx] = virt_to_mfn(&p2m_top_mfn[topidx]);
}
@@ -179,7 +179,7 @@
unsigned long max_pfn = min(MAX_DOMAIN_PAGES, xen_start_info->nr_pages);
unsigned pfn;
- for(pfn = 0; pfn < max_pfn; pfn += P2M_ENTRIES_PER_PAGE) {
+ for (pfn = 0; pfn < max_pfn; pfn += P2M_ENTRIES_PER_PAGE) {
unsigned topidx = p2m_top_index(pfn);
p2m_top[topidx] = &mfn_list[pfn];
@@ -207,7 +207,7 @@
p = (void *)__get_free_page(GFP_KERNEL | __GFP_NOFAIL);
BUG_ON(p == NULL);
- for(i = 0; i < P2M_ENTRIES_PER_PAGE; i++)
+ for (i = 0; i < P2M_ENTRIES_PER_PAGE; i++)
p[i] = INVALID_P2M_ENTRY;
if (cmpxchg(pp, p2m_missing, p) != p2m_missing)
@@ -407,7 +407,8 @@
preempt_enable();
}
-pte_t xen_ptep_modify_prot_start(struct mm_struct *mm, unsigned long addr, pte_t *ptep)
+pte_t xen_ptep_modify_prot_start(struct mm_struct *mm,
+ unsigned long addr, pte_t *ptep)
{
/* Just return the pte as-is. We preserve the bits on commit */
return *ptep;
@@ -871,7 +872,8 @@
if (user_pgd) {
xen_pin_page(mm, virt_to_page(user_pgd), PT_PGD);
- xen_do_pin(MMUEXT_PIN_L4_TABLE, PFN_DOWN(__pa(user_pgd)));
+ xen_do_pin(MMUEXT_PIN_L4_TABLE,
+ PFN_DOWN(__pa(user_pgd)));
}
}
#else /* CONFIG_X86_32 */
@@ -986,7 +988,8 @@
pgd_t *user_pgd = xen_get_user_pgd(pgd);
if (user_pgd) {
- xen_do_pin(MMUEXT_UNPIN_TABLE, PFN_DOWN(__pa(user_pgd)));
+ xen_do_pin(MMUEXT_UNPIN_TABLE,
+ PFN_DOWN(__pa(user_pgd)));
xen_unpin_page(mm, virt_to_page(user_pgd), PT_PGD);
}
}
diff --git a/arch/x86/xen/multicalls.c b/arch/x86/xen/multicalls.c
--- a/arch/x86/xen/multicalls.c
+++ b/arch/x86/xen/multicalls.c
@@ -154,7 +154,7 @@
ret, smp_processor_id());
dump_stack();
for (i = 0; i < b->mcidx; i++) {
- printk(" call %2d/%d: op=%lu arg=[%lx] result=%ld\n",
+ printk(KERN_DEBUG " call %2d/%d: op=%lu arg=[%lx] result=%ld\n",
i+1, b->mcidx,
b->debug[i].op,
b->debug[i].args[0],
diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -28,6 +28,9 @@
/* These are code, but not functions. Defined in entry.S */
extern const char xen_hypervisor_callback[];
extern const char xen_failsafe_callback[];
+extern void xen_sysenter_target(void);
+extern void xen_syscall_target(void);
+extern void xen_syscall32_target(void);
/**
@@ -110,7 +113,6 @@
void __cpuinit xen_enable_sysenter(void)
{
- extern void xen_sysenter_target(void);
int ret;
unsigned sysenter_feature;
@@ -132,8 +134,6 @@
{
#ifdef CONFIG_X86_64
int ret;
- extern void xen_syscall_target(void);
- extern void xen_syscall32_target(void);
ret = register_callback(CALLBACKTYPE_syscall, xen_syscall_target);
if (ret != 0) {
@@ -160,7 +160,8 @@
HYPERVISOR_vm_assist(VMASST_CMD_enable, VMASST_TYPE_writable_pagetables);
if (!xen_feature(XENFEAT_auto_translated_physmap))
- HYPERVISOR_vm_assist(VMASST_CMD_enable, VMASST_TYPE_pae_extended_cr3);
+ HYPERVISOR_vm_assist(VMASST_CMD_enable,
+ VMASST_TYPE_pae_extended_cr3);
if (register_callback(CALLBACKTYPE_event, xen_hypervisor_callback) ||
register_callback(CALLBACKTYPE_failsafe, xen_failsafe_callback))
The last usage was removed by the patch set culminating in
commit e3c449f526cebb8d287241c7e82faafd9709668b
Author: Joerg Roedel <[email protected]>
Date: Wed Oct 15 22:02:11 2008 -0700
x86, AMD IOMMU: convert driver to generic iommu_num_pages function
Signed-off-by: Ian Campbell <[email protected]>
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected] -r 9b89e3b4ca90 arch/x86/include/asm/iommu.h
---
arch/x86/include/asm/iommu.h | 2 --
arch/x86/kernel/pci-dma.c | 7 -------
2 files changed, 9 deletions(-)
diff --git a/arch/x86/include/asm/iommu.h b/arch/x86/include/asm/iommu.h
--- a/arch/x86/include/asm/iommu.h
+++ b/arch/x86/include/asm/iommu.h
@@ -8,8 +8,6 @@
extern int iommu_detected;
extern int dmar_disabled;
-extern unsigned long iommu_nr_pages(unsigned long addr, unsigned long len);
-
/* 10 seconds */
#define DMAR_OPERATION_TIMEOUT ((cycles_t) tsc_khz*10*1000)
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -125,13 +125,6 @@
pci_swiotlb_init();
}
-unsigned long iommu_nr_pages(unsigned long addr, unsigned long len)
-{
- unsigned long size = roundup((addr & ~PAGE_MASK) + len, PAGE_SIZE);
-
- return size >> PAGE_SHIFT;
-}
-EXPORT_SYMBOL(iommu_nr_pages);
#endif
void *dma_generic_alloc_coherent(struct device *dev, size_t size,
Architectures may need to allocate memory specially for use with
the swiotlb. Create the weak function swiotlb_alloc_boot() and
swiotlb_alloc() defaulting to the current behaviour.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Signed-off-by: Ian Campbell <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Jan Beulich <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: FUJITA Tomonori <[email protected]>
---
include/linux/swiotlb.h | 3 +++
lib/swiotlb.c | 16 +++++++++++++---
2 files changed, 16 insertions(+), 3 deletions(-)
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -10,6 +10,9 @@
extern void
swiotlb_init(void);
+extern void *swiotlb_alloc_boot(size_t bytes, unsigned long nslabs);
+extern void *swiotlb_alloc(unsigned order, unsigned long nslabs);
+
extern void
*swiotlb_alloc_coherent(struct device *hwdev, size_t size,
dma_addr_t *dma_handle, gfp_t flags);
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -21,6 +21,7 @@
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/spinlock.h>
+#include <linux/swiotlb.h>
#include <linux/string.h>
#include <linux/types.h>
#include <linux/ctype.h>
@@ -126,6 +127,16 @@
__setup("swiotlb=", setup_io_tlb_npages);
/* make io_tlb_overflow tunable too? */
+void * __weak swiotlb_alloc_boot(size_t size, unsigned long nslabs)
+{
+ return alloc_bootmem_low_pages(size);
+}
+
+void * __weak swiotlb_alloc(unsigned order, unsigned long nslabs)
+{
+ return (void *)__get_free_pages(GFP_DMA | __GFP_NOWARN, order);
+}
+
/*
* Statically reserve bounce buffer space and initialize bounce buffer data
* structures for the software IO TLB used to implement the DMA API.
@@ -145,7 +156,7 @@
/*
* Get IO TLB memory from the low pages
*/
- io_tlb_start = alloc_bootmem_low_pages(bytes);
+ io_tlb_start = swiotlb_alloc_boot(bytes, io_tlb_nslabs);
if (!io_tlb_start)
panic("Cannot allocate SWIOTLB buffer");
io_tlb_end = io_tlb_start + bytes;
@@ -202,8 +213,7 @@
bytes = io_tlb_nslabs << IO_TLB_SHIFT;
while ((SLABS_PER_PAGE << order) > IO_TLB_MIN_SLABS) {
- io_tlb_start = (char *)__get_free_pages(GFP_DMA | __GFP_NOWARN,
- order);
+ io_tlb_start = swiotlb_alloc(order, io_tlb_nslabs);
if (io_tlb_start)
break;
order--;
From: Ian Campbell <[email protected]>
Signed-off-by: Ian Campbell <[email protected]>
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
include/linux/swiotlb.h | 14 ++++++++++++++
lib/swiotlb.c | 14 +-------------
2 files changed, 15 insertions(+), 13 deletions(-)
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -7,6 +7,20 @@
struct dma_attrs;
struct scatterlist;
+/*
+ * Maximum allowable number of contiguous slabs to map,
+ * must be a power of 2. What is the appropriate value ?
+ * The complexity of {map,unmap}_single is linearly dependent on this value.
+ */
+#define IO_TLB_SEGSIZE 128
+
+
+/*
+ * log of the size of each IO TLB slab. The number of slabs is command line
+ * controllable.
+ */
+#define IO_TLB_SHIFT 11
+
extern void
swiotlb_init(void);
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -23,6 +23,7 @@
#include <linux/spinlock.h>
#include <linux/swiotlb.h>
#include <linux/string.h>
+#include <linux/swiotlb.h>
#include <linux/types.h>
#include <linux/ctype.h>
@@ -40,19 +41,6 @@
#define SG_ENT_VIRT_ADDRESS(sg) (sg_virt((sg)))
#define SG_ENT_PHYS_ADDRESS(sg) virt_to_bus(SG_ENT_VIRT_ADDRESS(sg))
-/*
- * Maximum allowable number of contiguous slabs to map,
- * must be a power of 2. What is the appropriate value ?
- * The complexity of {map,unmap}_single is linearly dependent on this value.
- */
-#define IO_TLB_SEGSIZE 128
-
-/*
- * log of the size of each IO TLB slab. The number of slabs is command line
- * controllable.
- */
-#define IO_TLB_SHIFT 11
-
#define SLABS_PER_PAGE (1 << (PAGE_SHIFT - IO_TLB_SHIFT))
/*
From: Ian Campbell <[email protected]>
Architectures may need to override these conversions. Implement a
__weak hook point containing the default implementation.
Signed-off-by: Ian Campbell <[email protected]>
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
include/linux/swiotlb.h | 3 +++
lib/swiotlb.c | 42 ++++++++++++++++++++++++++----------------
2 files changed, 29 insertions(+), 16 deletions(-)
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -27,6 +27,9 @@
extern void *swiotlb_alloc_boot(size_t bytes, unsigned long nslabs);
extern void *swiotlb_alloc(unsigned order, unsigned long nslabs);
+extern unsigned long __weak swiotlb_virt_to_bus(volatile void *address);
+extern void * __weak swiotlb_bus_to_virt(unsigned long address);
+
extern void
*swiotlb_alloc_coherent(struct device *hwdev, size_t size,
dma_addr_t *dma_handle, gfp_t flags);
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -125,6 +125,16 @@
return (void *)__get_free_pages(GFP_DMA | __GFP_NOWARN, order);
}
+unsigned long __weak swiotlb_virt_to_bus(volatile void *address)
+{
+ return virt_to_bus(address);
+}
+
+void * __weak swiotlb_bus_to_virt(unsigned long address)
+{
+ return bus_to_virt(address);
+}
+
/*
* Statically reserve bounce buffer space and initialize bounce buffer data
* structures for the software IO TLB used to implement the DMA API.
@@ -168,7 +178,7 @@
panic("Cannot allocate SWIOTLB overflow buffer!\n");
printk(KERN_INFO "Placing software IO TLB between 0x%lx - 0x%lx\n",
- virt_to_bus(io_tlb_start), virt_to_bus(io_tlb_end));
+ swiotlb_virt_to_bus(io_tlb_start), swiotlb_virt_to_bus(io_tlb_end));
}
void __init
@@ -250,7 +260,7 @@
printk(KERN_INFO "Placing %luMB software IO TLB between 0x%lx - "
"0x%lx\n", bytes >> 20,
- virt_to_bus(io_tlb_start), virt_to_bus(io_tlb_end));
+ swiotlb_virt_to_bus(io_tlb_start), swiotlb_virt_to_bus(io_tlb_end));
return 0;
@@ -298,7 +308,7 @@
unsigned long max_slots;
mask = dma_get_seg_boundary(hwdev);
- start_dma_addr = virt_to_bus(io_tlb_start) & mask;
+ start_dma_addr = swiotlb_virt_to_bus(io_tlb_start) & mask;
offset_slots = ALIGN(start_dma_addr, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT;
max_slots = mask + 1
@@ -467,7 +477,7 @@
int order = get_order(size);
ret = (void *)__get_free_pages(flags, order);
- if (ret && address_needs_mapping(hwdev, virt_to_bus(ret), size)) {
+ if (ret && address_needs_mapping(hwdev, swiotlb_virt_to_bus(ret), size)) {
/*
* The allocated memory isn't reachable by the device.
* Fall back on swiotlb_map_single().
@@ -488,7 +498,7 @@
}
memset(ret, 0, size);
- dev_addr = virt_to_bus(ret);
+ dev_addr = swiotlb_virt_to_bus(ret);
/* Confirm address can be DMA'd by device */
if (address_needs_mapping(hwdev, dev_addr, size)) {
@@ -548,7 +558,7 @@
swiotlb_map_single_attrs(struct device *hwdev, void *ptr, size_t size,
int dir, struct dma_attrs *attrs)
{
- dma_addr_t dev_addr = virt_to_bus(ptr);
+ dma_addr_t dev_addr = swiotlb_virt_to_bus(ptr);
void *map;
BUG_ON(dir == DMA_NONE);
@@ -569,7 +579,7 @@
map = io_tlb_overflow_buffer;
}
- dev_addr = virt_to_bus(map);
+ dev_addr = swiotlb_virt_to_bus(map);
/*
* Ensure that the address returned is DMA'ble
@@ -599,7 +609,7 @@
swiotlb_unmap_single_attrs(struct device *hwdev, dma_addr_t dev_addr,
size_t size, int dir, struct dma_attrs *attrs)
{
- char *dma_addr = bus_to_virt(dev_addr);
+ char *dma_addr = swiotlb_bus_to_virt(dev_addr);
BUG_ON(dir == DMA_NONE);
if (is_swiotlb_buffer(dma_addr))
@@ -629,7 +639,7 @@
swiotlb_sync_single(struct device *hwdev, dma_addr_t dev_addr,
size_t size, int dir, int target)
{
- char *dma_addr = bus_to_virt(dev_addr);
+ char *dma_addr = swiotlb_bus_to_virt(dev_addr);
BUG_ON(dir == DMA_NONE);
if (is_swiotlb_buffer(dma_addr))
@@ -660,7 +670,7 @@
unsigned long offset, size_t size,
int dir, int target)
{
- char *dma_addr = bus_to_virt(dev_addr) + offset;
+ char *dma_addr = swiotlb_bus_to_virt(dev_addr) + offset;
BUG_ON(dir == DMA_NONE);
if (is_swiotlb_buffer(dma_addr))
@@ -716,7 +726,7 @@
for_each_sg(sgl, sg, nelems, i) {
addr = SG_ENT_VIRT_ADDRESS(sg);
- dev_addr = virt_to_bus(addr);
+ dev_addr = swiotlb_virt_to_bus(addr);
if (swiotlb_force ||
address_needs_mapping(hwdev, dev_addr, sg->length)) {
void *map = map_single(hwdev, addr, sg->length, dir);
@@ -729,7 +739,7 @@
sgl[0].dma_length = 0;
return 0;
}
- sg->dma_address = virt_to_bus(map);
+ sg->dma_address = swiotlb_virt_to_bus(map);
} else
sg->dma_address = dev_addr;
sg->dma_length = sg->length;
@@ -760,7 +770,7 @@
for_each_sg(sgl, sg, nelems, i) {
if (sg->dma_address != SG_ENT_PHYS_ADDRESS(sg))
- unmap_single(hwdev, bus_to_virt(sg->dma_address),
+ unmap_single(hwdev, swiotlb_bus_to_virt(sg->dma_address),
sg->dma_length, dir);
else if (dir == DMA_FROM_DEVICE)
dma_mark_clean(SG_ENT_VIRT_ADDRESS(sg), sg->dma_length);
@@ -793,7 +803,7 @@
for_each_sg(sgl, sg, nelems, i) {
if (sg->dma_address != SG_ENT_PHYS_ADDRESS(sg))
- sync_single(hwdev, bus_to_virt(sg->dma_address),
+ sync_single(hwdev, swiotlb_bus_to_virt(sg->dma_address),
sg->dma_length, dir, target);
else if (dir == DMA_FROM_DEVICE)
dma_mark_clean(SG_ENT_VIRT_ADDRESS(sg), sg->dma_length);
@@ -817,7 +827,7 @@
int
swiotlb_dma_mapping_error(struct device *hwdev, dma_addr_t dma_addr)
{
- return (dma_addr == virt_to_bus(io_tlb_overflow_buffer));
+ return (dma_addr == swiotlb_virt_to_bus(io_tlb_overflow_buffer));
}
/*
@@ -829,7 +839,7 @@
int
swiotlb_dma_supported(struct device *hwdev, u64 mask)
{
- return virt_to_bus(io_tlb_end - 1) <= mask;
+ return swiotlb_virt_to_bus(io_tlb_end - 1) <= mask;
}
EXPORT_SYMBOL(swiotlb_map_single);
hypervisor.h had accumulated a lot of crud, including lots of spurious
#includes. Clean it all up, and go around fixing up everything else
accordingly.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/xen/hypercall.h | 6 +++++
arch/x86/include/asm/xen/hypervisor.h | 39 ++++++---------------------------
arch/x86/include/asm/xen/page.h | 5 ++++
arch/x86/xen/enlighten.c | 1
drivers/xen/balloon.c | 4 ++-
drivers/xen/features.c | 6 ++++-
drivers/xen/grant-table.c | 1
include/xen/interface/event_channel.h | 2 +
8 files changed, 31 insertions(+), 33 deletions(-)
diff --git a/arch/x86/include/asm/xen/hypercall.h b/arch/x86/include/asm/xen/hypercall.h
--- a/arch/x86/include/asm/xen/hypercall.h
+++ b/arch/x86/include/asm/xen/hypercall.h
@@ -33,8 +33,14 @@
#ifndef _ASM_X86_XEN_HYPERCALL_H
#define _ASM_X86_XEN_HYPERCALL_H
+#include <linux/kernel.h>
+#include <linux/spinlock.h>
#include <linux/errno.h>
#include <linux/string.h>
+#include <linux/types.h>
+
+#include <asm/page.h>
+#include <asm/pgtable.h>
#include <xen/interface/xen.h>
#include <xen/interface/sched.h>
diff --git a/arch/x86/include/asm/xen/hypervisor.h b/arch/x86/include/asm/xen/hypervisor.h
--- a/arch/x86/include/asm/xen/hypervisor.h
+++ b/arch/x86/include/asm/xen/hypervisor.h
@@ -33,39 +33,10 @@
#ifndef _ASM_X86_XEN_HYPERVISOR_H
#define _ASM_X86_XEN_HYPERVISOR_H
-#include <linux/types.h>
-#include <linux/kernel.h>
-
-#include <xen/interface/xen.h>
-#include <xen/interface/version.h>
-
-#include <asm/ptrace.h>
-#include <asm/page.h>
-#include <asm/desc.h>
-#if defined(__i386__)
-# ifdef CONFIG_X86_PAE
-# include <asm-generic/pgtable-nopud.h>
-# else
-# include <asm-generic/pgtable-nopmd.h>
-# endif
-#endif
-#include <asm/xen/hypercall.h>
-
/* arch/i386/kernel/setup.c */
extern struct shared_info *HYPERVISOR_shared_info;
extern struct start_info *xen_start_info;
-/* arch/i386/mach-xen/evtchn.c */
-/* Force a proper event-channel callback from Xen. */
-extern void force_evtchn_callback(void);
-
-/* Turn jiffies into Xen system time. */
-u64 jiffies_to_st(unsigned long jiffies);
-
-
-#define MULTI_UVMFLAGS_INDEX 3
-#define MULTI_UVMDOMID_INDEX 4
-
enum xen_domain_type {
XEN_NATIVE,
XEN_PV_DOMAIN,
@@ -74,9 +45,15 @@
extern enum xen_domain_type xen_domain_type;
+#ifdef CONFIG_XEN
#define xen_domain() (xen_domain_type != XEN_NATIVE)
-#define xen_pv_domain() (xen_domain_type == XEN_PV_DOMAIN)
+#else
+#define xen_domain() (0)
+#endif
+
+#define xen_pv_domain() (xen_domain() && xen_domain_type == XEN_PV_DOMAIN)
+#define xen_hvm_domain() (xen_domain() && xen_domain_type == XEN_HVM_DOMAIN)
+
#define xen_initial_domain() (xen_pv_domain() && xen_start_info->flags & SIF_INITDOMAIN)
-#define xen_hvm_domain() (xen_domain_type == XEN_HVM_DOMAIN)
#endif /* _ASM_X86_XEN_HYPERVISOR_H */
diff --git a/arch/x86/include/asm/xen/page.h b/arch/x86/include/asm/xen/page.h
--- a/arch/x86/include/asm/xen/page.h
+++ b/arch/x86/include/asm/xen/page.h
@@ -1,11 +1,16 @@
#ifndef _ASM_X86_XEN_PAGE_H
#define _ASM_X86_XEN_PAGE_H
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/spinlock.h>
#include <linux/pfn.h>
#include <asm/uaccess.h>
+#include <asm/page.h>
#include <asm/pgtable.h>
+#include <xen/interface/xen.h>
#include <xen/features.h>
/* Xen machine address */
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -28,6 +28,7 @@
#include <linux/console.h>
#include <xen/interface/xen.h>
+#include <xen/interface/version.h>
#include <xen/interface/physdev.h>
#include <xen/interface/vcpu.h>
#include <xen/features.h>
diff --git a/drivers/xen/balloon.c b/drivers/xen/balloon.c
--- a/drivers/xen/balloon.c
+++ b/drivers/xen/balloon.c
@@ -44,13 +44,15 @@
#include <linux/list.h>
#include <linux/sysdev.h>
-#include <asm/xen/hypervisor.h>
#include <asm/page.h>
#include <asm/pgalloc.h>
#include <asm/pgtable.h>
#include <asm/uaccess.h>
#include <asm/tlb.h>
+#include <asm/xen/hypervisor.h>
+#include <asm/xen/hypercall.h>
+#include <xen/interface/xen.h>
#include <xen/interface/memory.h>
#include <xen/xenbus.h>
#include <xen/features.h>
diff --git a/drivers/xen/features.c b/drivers/xen/features.c
--- a/drivers/xen/features.c
+++ b/drivers/xen/features.c
@@ -8,7 +8,11 @@
#include <linux/types.h>
#include <linux/cache.h>
#include <linux/module.h>
-#include <asm/xen/hypervisor.h>
+
+#include <asm/xen/hypercall.h>
+
+#include <xen/interface/xen.h>
+#include <xen/interface/version.h>
#include <xen/features.h>
u8 xen_features[XENFEAT_NR_SUBMAPS * 32] __read_mostly;
diff --git a/drivers/xen/grant-table.c b/drivers/xen/grant-table.c
--- a/drivers/xen/grant-table.c
+++ b/drivers/xen/grant-table.c
@@ -40,6 +40,7 @@
#include <xen/interface/xen.h>
#include <xen/page.h>
#include <xen/grant_table.h>
+#include <asm/xen/hypercall.h>
#include <asm/pgtable.h>
#include <asm/sync_bitops.h>
diff --git a/include/xen/interface/event_channel.h b/include/xen/interface/event_channel.h
--- a/include/xen/interface/event_channel.h
+++ b/include/xen/interface/event_channel.h
@@ -9,6 +9,8 @@
#ifndef __XEN_PUBLIC_EVENT_CHANNEL_H__
#define __XEN_PUBLIC_EVENT_CHANNEL_H__
+#include <xen/interface/xen.h>
+
typedef uint32_t evtchn_port_t;
DEFINE_GUEST_HANDLE(evtchn_port_t);
When booting in Xen dom0, the hpet isn't really accessible, so make
sure the mapping is non-NULL before use.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/hpet.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c
--- a/arch/x86/kernel/hpet.c
+++ b/arch/x86/kernel/hpet.c
@@ -834,10 +834,11 @@
hpet_address = force_hpet_address;
hpet_enable();
- if (!hpet_virt_address)
- return -ENODEV;
}
+ if (!hpet_virt_address)
+ return -ENODEV;
+
hpet_reserve_platform_timers(hpet_readl(HPET_ID));
for_each_online_cpu(cpu) {
Mainly to get proper type-checking and consistency.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/apic.h | 35 +++++++++++++++++++++++++++++------
1 file changed, 29 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -125,12 +125,35 @@
extern struct apic_ops *apic_ops;
-#define apic_read (apic_ops->read)
-#define apic_write (apic_ops->write)
-#define apic_icr_read (apic_ops->icr_read)
-#define apic_icr_write (apic_ops->icr_write)
-#define apic_wait_icr_idle (apic_ops->wait_icr_idle)
-#define safe_apic_wait_icr_idle (apic_ops->safe_wait_icr_idle)
+static inline u32 apic_read(u32 reg)
+{
+ return apic_ops->read(reg);
+}
+
+static inline void apic_write(u32 reg, u32 val)
+{
+ apic_ops->write(reg, val);
+}
+
+static inline u64 apic_icr_read(void)
+{
+ return apic_ops->icr_read();
+}
+
+static inline void apic_icr_write(u32 low, u32 high)
+{
+ apic_ops->icr_write(low, high);
+}
+
+static inline void apic_wait_icr_idle(void)
+{
+ apic_ops->wait_icr_idle();
+}
+
+static inline u32 safe_apic_wait_icr_idle(void)
+{
+ return apic_ops->safe_wait_icr_idle();
+}
extern int get_physical_broadcast(void);
Xen dom0 needs to paravirtualize IO operations to the IO APIC, so add
a io_apic_ops for it to intercept. Do this as ops structure because
there's at least some chance that another paravirtualized environment
may want to intercept these.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/io_apic.h | 9 +++++++
arch/x86/kernel/io_apic.c | 50 +++++++++++++++++++++++++++++++++++++---
2 files changed, 56 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/io_apic.h b/arch/x86/include/asm/io_apic.h
--- a/arch/x86/include/asm/io_apic.h
+++ b/arch/x86/include/asm/io_apic.h
@@ -21,6 +21,15 @@
#define IO_APIC_REDIR_LEVEL_TRIGGER (1 << 15)
#define IO_APIC_REDIR_MASKED (1 << 16)
+struct io_apic_ops {
+ void (*init)(void);
+ unsigned int (*read)(unsigned int apic, unsigned int reg);
+ void (*write)(unsigned int apic, unsigned int reg, unsigned int value);
+ void (*modify)(unsigned int apic, unsigned int reg, unsigned int value);
+};
+
+void __init set_io_apic_ops(const struct io_apic_ops *);
+
/*
* The structure of the IO-APIC:
*/
diff --git a/arch/x86/kernel/io_apic.c b/arch/x86/kernel/io_apic.c
--- a/arch/x86/kernel/io_apic.c
+++ b/arch/x86/kernel/io_apic.c
@@ -67,6 +67,25 @@
#define __apicdebuginit(type) static type __init
+static void __init __ioapic_init_mappings(void);
+static unsigned int __io_apic_read(unsigned int apic, unsigned int reg);
+static void __io_apic_write(unsigned int apic, unsigned int reg,
+ unsigned int val);
+static void __io_apic_modify(unsigned int apic, unsigned int reg,
+ unsigned int val);
+
+static struct io_apic_ops io_apic_ops = {
+ .init = __ioapic_init_mappings,
+ .read = __io_apic_read,
+ .write = __io_apic_write,
+ .modify = __io_apic_modify,
+};
+
+void __init set_io_apic_ops(const struct io_apic_ops *ops)
+{
+ io_apic_ops = *ops;
+}
+
/*
* Is the SiS APIC rmw bug present ?
* -1 = don't know, 0 = no, 1 = yes
@@ -196,6 +215,24 @@
return pin;
}
+static inline unsigned int io_apic_read(unsigned int apic, unsigned int reg)
+{
+ return io_apic_ops.read(apic, reg);
+}
+
+static inline void io_apic_write(unsigned int apic, unsigned int reg,
+ unsigned int value)
+{
+ io_apic_ops.write(apic, reg, value);
+}
+
+static inline void io_apic_modify(unsigned int apic, unsigned int reg,
+ unsigned int value)
+{
+ io_apic_ops.modify(apic, reg, value);
+}
+
+
struct io_apic {
unsigned int index;
unsigned int unused[3];
@@ -208,14 +245,15 @@
+ (mp_ioapics[idx].mp_apicaddr & ~PAGE_MASK);
}
-static inline unsigned int io_apic_read(unsigned int apic, unsigned int reg)
+static unsigned int __io_apic_read(unsigned int apic, unsigned int reg)
{
struct io_apic __iomem *io_apic = io_apic_base(apic);
writel(reg, &io_apic->index);
return readl(&io_apic->data);
}
-static inline void io_apic_write(unsigned int apic, unsigned int reg, unsigned int value)
+static void __io_apic_write(unsigned int apic, unsigned int reg,
+ unsigned int value)
{
struct io_apic __iomem *io_apic = io_apic_base(apic);
writel(reg, &io_apic->index);
@@ -228,7 +266,8 @@
*
* Older SiS APIC requires we rewrite the index register
*/
-static inline void io_apic_modify(unsigned int apic, unsigned int reg, unsigned int value)
+static void __io_apic_modify(unsigned int apic, unsigned int reg,
+ unsigned int value)
{
struct io_apic __iomem *io_apic = io_apic_base(apic);
@@ -3848,6 +3887,11 @@
void __init ioapic_init_mappings(void)
{
+ io_apic_ops.init();
+}
+
+static void __init __ioapic_init_mappings(void)
+{
unsigned long ioapic_phys, idx = FIX_IO_APIC_BASE_0;
struct resource *ioapic_res;
int i;
exit_idle() is only used on 64-bit, but define a noop version on
32-bit to allow subsequent unification.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/idle.h | 10 ++++++++++
arch/x86/kernel/apic.c | 7 +------
arch/x86/kernel/io_apic.c | 2 --
3 files changed, 11 insertions(+), 8 deletions(-)
diff --git a/arch/x86/include/asm/idle.h b/arch/x86/include/asm/idle.h
--- a/arch/x86/include/asm/idle.h
+++ b/arch/x86/include/asm/idle.h
@@ -8,8 +8,18 @@
void idle_notifier_register(struct notifier_block *n);
void idle_notifier_unregister(struct notifier_block *n);
+#ifdef CONFIG_X86_64
void enter_idle(void);
void exit_idle(void);
+#else
+static inline void enter_idle(void)
+{
+}
+
+static inline void exit_idle(void)
+{
+}
+#endif /* CONFIG_X86_64 */
void c1e_remove_cpu(int cpu);
diff --git a/arch/x86/kernel/apic.c b/arch/x86/kernel/apic.c
--- a/arch/x86/kernel/apic.c
+++ b/arch/x86/kernel/apic.c
@@ -808,9 +808,8 @@
* Besides, if we don't timer interrupts ignore the global
* interrupt lock, which is the WrongThing (tm) to do.
*/
-#ifdef CONFIG_X86_64
exit_idle();
-#endif
+
irq_enter();
local_apic_timer_interrupt();
irq_exit();
@@ -1668,9 +1667,7 @@
{
u32 v;
-#ifdef CONFIG_X86_64
exit_idle();
-#endif
irq_enter();
/*
* Check if this really is a spurious interrupt and ACK it
@@ -1699,9 +1696,7 @@
{
u32 v, v1;
-#ifdef CONFIG_X86_64
exit_idle();
-#endif
irq_enter();
/* First tickle the hardware, only then report what went on. -- REW */
v = apic_read(APIC_ESR);
diff --git a/arch/x86/kernel/io_apic.c b/arch/x86/kernel/io_apic.c
--- a/arch/x86/kernel/io_apic.c
+++ b/arch/x86/kernel/io_apic.c
@@ -2242,9 +2242,7 @@
{
unsigned vector, me;
ack_APIC_irq();
-#ifdef CONFIG_X86_64
exit_idle();
-#endif
irq_enter();
me = smp_processor_id();
Xen uses a different interrupt path, so introduce handle_irq() to
allow interrupts to be inserted into the normal interrupt path. This
is handled slightly differently on 32 and 64-bit.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/irq.h | 1 +
arch/x86/kernel/irq_32.c | 34 +++++++++++++++++++++-------------
arch/x86/kernel/irq_64.c | 27 ++++++++++++++++++---------
3 files changed, 40 insertions(+), 22 deletions(-)
diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
--- a/arch/x86/include/asm/irq.h
+++ b/arch/x86/include/asm/irq.h
@@ -39,6 +39,7 @@
extern unsigned int do_IRQ(struct pt_regs *regs);
extern void init_IRQ(void);
extern void native_init_IRQ(void);
+extern bool handle_irq(unsigned irq, struct pt_regs *regs);
/* Interrupt vector management */
extern DECLARE_BITMAP(used_vectors, NR_VECTORS);
diff --git a/arch/x86/kernel/irq_32.c b/arch/x86/kernel/irq_32.c
--- a/arch/x86/kernel/irq_32.c
+++ b/arch/x86/kernel/irq_32.c
@@ -191,6 +191,26 @@
execute_on_irq_stack(int overflow, struct irq_desc *desc, int irq) { return 0; }
#endif
+bool handle_irq(unsigned irq, struct pt_regs *regs)
+{
+ struct irq_desc *desc;
+ int overflow;
+
+ overflow = check_stack_overflow();
+
+ desc = irq_to_desc(irq);
+ if (unlikely(!desc))
+ return false;
+
+ if (!execute_on_irq_stack(overflow, desc, irq)) {
+ if (unlikely(overflow))
+ print_stack_overflow();
+ desc->handle_irq(irq, desc);
+ }
+
+ return true;
+}
+
/*
* do_IRQ handles all normal device IRQ's (the special
* SMP cross-CPU interrupts have their own specific
@@ -200,31 +220,19 @@
{
struct pt_regs *old_regs;
/* high bit used in ret_from_ code */
- int overflow;
unsigned vector = ~regs->orig_ax;
- struct irq_desc *desc;
unsigned irq;
-
old_regs = set_irq_regs(regs);
irq_enter();
irq = __get_cpu_var(vector_irq)[vector];
- overflow = check_stack_overflow();
-
- desc = irq_to_desc(irq);
- if (unlikely(!desc)) {
+ if (!handle_irq(irq, regs)) {
printk(KERN_EMERG "%s: cannot handle IRQ %d vector %#x cpu %d\n",
__func__, irq, vector, smp_processor_id());
BUG();
}
- if (!execute_on_irq_stack(overflow, desc, irq)) {
- if (unlikely(overflow))
- print_stack_overflow();
- desc->handle_irq(irq, desc);
- }
-
irq_exit();
set_irq_regs(old_regs);
return 1;
diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c
--- a/arch/x86/kernel/irq_64.c
+++ b/arch/x86/kernel/irq_64.c
@@ -42,6 +42,22 @@
}
#endif
+bool handle_irq(unsigned irq, struct pt_regs *regs)
+{
+ struct irq_desc *desc;
+
+#ifdef CONFIG_DEBUG_STACKOVERFLOW
+ stack_overflow_check(regs);
+#endif
+
+ desc = irq_to_desc(irq);
+ if (unlikely(!desc))
+ return false;
+
+ generic_handle_irq_desc(irq, desc);
+ return true;
+}
+
/*
* do_IRQ handles all normal device IRQ's (the special
* SMP cross-CPU interrupts have their own specific
@@ -50,7 +66,6 @@
asmlinkage unsigned int do_IRQ(struct pt_regs *regs)
{
struct pt_regs *old_regs = set_irq_regs(regs);
- struct irq_desc *desc;
/* high bit used in ret_from_ code */
unsigned vector = ~regs->orig_ax;
@@ -58,16 +73,10 @@
exit_idle();
irq_enter();
+
irq = __get_cpu_var(vector_irq)[vector];
-#ifdef CONFIG_DEBUG_STACKOVERFLOW
- stack_overflow_check(regs);
-#endif
-
- desc = irq_to_desc(irq);
- if (likely(desc))
- generic_handle_irq_desc(irq, desc);
- else {
+ if (!handle_irq(irq, regs)) {
if (!disable_apic)
ack_APIC_irq();
When running in Xen dom0, we still want to parse the ACPI tables to
find out about local and IO apics, but we don't want to actually use
the lapics.
Put a couple of tests for Xen to prevent lapics from being mapped or
accessed. This is very Xen-specific behaviour, so there didn't seem to
be any point in adding more indirection.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/xen/hypervisor.h | 2 ++
arch/x86/kernel/acpi/boot.c | 10 ++++++++++
2 files changed, 12 insertions(+)
diff --git a/arch/x86/include/asm/xen/hypervisor.h b/arch/x86/include/asm/xen/hypervisor.h
--- a/arch/x86/include/asm/xen/hypervisor.h
+++ b/arch/x86/include/asm/xen/hypervisor.h
@@ -46,6 +46,8 @@
extern enum xen_domain_type xen_domain_type;
#ifdef CONFIG_XEN
+#include <xen/interface/xen.h>
+
#define xen_domain() (xen_domain_type != XEN_NATIVE)
#else
#define xen_domain() (0)
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -42,6 +42,8 @@
#include <asm/mpspec.h>
#include <asm/smp.h>
+#include <asm/xen/hypervisor.h>
+
#ifdef CONFIG_X86_LOCAL_APIC
# include <mach_apic.h>
#endif
@@ -234,6 +236,10 @@
{
unsigned int ver = 0;
+ /* We don't want to register lapics when in Xen dom0 */
+ if (xen_initial_domain())
+ return;
+
if (!enabled) {
++disabled_cpus;
return;
@@ -755,6 +761,10 @@
static void __init acpi_register_lapic_address(unsigned long address)
{
+ /* Xen dom0 doesn't have usable lapics */
+ if (xen_initial_domain())
+ return;
+
mp_lapic_addr = address;
set_fixmap_nocache(FIX_APIC_BASE, address);
Unstatic ioapic_write_entry and setup_ioapic_entry functions so that
the Xen code can do its own ioapic routing setup.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/io_apic.h | 6 ++++++
arch/x86/kernel/io_apic.c | 10 +++++-----
2 files changed, 11 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/io_apic.h b/arch/x86/include/asm/io_apic.h
--- a/arch/x86/include/asm/io_apic.h
+++ b/arch/x86/include/asm/io_apic.h
@@ -209,6 +209,12 @@
extern int probe_nr_irqs(void);
+extern int setup_ioapic_entry(int apic, int irq,
+ struct IO_APIC_route_entry *entry,
+ unsigned int destination, int trigger,
+ int polarity, int vector);
+extern void ioapic_write_entry(int apic, int pin,
+ struct IO_APIC_route_entry e);
#else /* !CONFIG_X86_IO_APIC */
#define io_apic_assign_pci_irqs 0
static const int timer_through_8259 = 0;
diff --git a/arch/x86/kernel/io_apic.c b/arch/x86/kernel/io_apic.c
--- a/arch/x86/kernel/io_apic.c
+++ b/arch/x86/kernel/io_apic.c
@@ -337,7 +337,7 @@
io_apic_write(apic, 0x10 + 2*pin, eu.w1);
}
-static void ioapic_write_entry(int apic, int pin, struct IO_APIC_route_entry e)
+void ioapic_write_entry(int apic, int pin, struct IO_APIC_route_entry e)
{
unsigned long flags;
spin_lock_irqsave(&ioapic_lock, flags);
@@ -1275,10 +1275,10 @@
handle_edge_irq, "edge");
}
-static int setup_ioapic_entry(int apic, int irq,
- struct IO_APIC_route_entry *entry,
- unsigned int destination, int trigger,
- int polarity, int vector)
+int setup_ioapic_entry(int apic, int irq,
+ struct IO_APIC_route_entry *entry,
+ unsigned int destination, int trigger,
+ int polarity, int vector)
{
/*
* add it to the IO-APIC irq-routing table:
Add x86-specific swiotlb allocation functions. These are purely
default for the moment.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/pci-swiotlb_64.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/arch/x86/kernel/pci-swiotlb_64.c b/arch/x86/kernel/pci-swiotlb_64.c
--- a/arch/x86/kernel/pci-swiotlb_64.c
+++ b/arch/x86/kernel/pci-swiotlb_64.c
@@ -3,6 +3,8 @@
#include <linux/pci.h>
#include <linux/cache.h>
#include <linux/module.h>
+#include <linux/swiotlb.h>
+#include <linux/bootmem.h>
#include <linux/dma-mapping.h>
#include <asm/iommu.h>
@@ -11,6 +13,16 @@
int swiotlb __read_mostly;
+void *swiotlb_alloc_boot(size_t size, unsigned long nslabs)
+{
+ return alloc_bootmem_low_pages(size);
+}
+
+void *swiotlb_alloc(unsigned order, unsigned long nslabs)
+{
+ return (void *)__get_free_pages(GFP_DMA | __GFP_NOWARN, order);
+}
+
static dma_addr_t
swiotlb_map_single_phys(struct device *hwdev, phys_addr_t paddr, size_t size,
int direction)
From: Juan Quintela <[email protected]>
Do initial xenbus/xenstore setup in dom0. In dom0 we need to actually
allocate the xenstore resources, rather than being given them from
outside.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Signed-off-by: Juan Quintela <[email protected]>
---
drivers/xen/xenbus/xenbus_probe.c | 30 +++++++++++++++++++++++++++++-
1 file changed, 29 insertions(+), 1 deletion(-)
diff --git a/drivers/xen/xenbus/xenbus_probe.c b/drivers/xen/xenbus/xenbus_probe.c
--- a/drivers/xen/xenbus/xenbus_probe.c
+++ b/drivers/xen/xenbus/xenbus_probe.c
@@ -810,6 +810,7 @@
static int __init xenbus_probe_init(void)
{
int err = 0;
+ unsigned long page = 0;
DPRINTK("");
@@ -830,7 +831,31 @@
* Domain0 doesn't have a store_evtchn or store_mfn yet.
*/
if (xen_initial_domain()) {
- /* dom0 not yet supported */
+ struct evtchn_alloc_unbound alloc_unbound;
+
+ /* Allocate Xenstore page */
+ page = get_zeroed_page(GFP_KERNEL);
+ if (!page)
+ return -ENOMEM;
+
+ xen_store_mfn = xen_start_info->store_mfn =
+ pfn_to_mfn(virt_to_phys((void *)page) >>
+ PAGE_SHIFT);
+
+ /* Next allocate a local port which xenstored can bind to */
+ alloc_unbound.dom = DOMID_SELF;
+ alloc_unbound.remote_dom = 0;
+
+ err = HYPERVISOR_event_channel_op(EVTCHNOP_alloc_unbound,
+ &alloc_unbound);
+ if (err == -ENOSYS)
+ goto out_unreg_front;
+
+ BUG_ON(err);
+ xen_store_evtchn = xen_start_info->store_evtchn =
+ alloc_unbound.port;
+
+ xen_store_interface = mfn_to_virt(xen_store_mfn);
} else {
xenstored_ready = 1;
xen_store_evtchn = xen_start_info->store_evtchn;
@@ -858,6 +883,9 @@
bus_unregister(&xenbus_frontend.bus);
out_error:
+ if (page != 0)
+ free_page(page);
+
return err;
}
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/mpspec.h | 1 +
arch/x86/kernel/acpi/boot.c | 2 +-
2 files changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -63,6 +63,7 @@
#ifdef CONFIG_X86_IO_APIC
extern int mp_config_acpi_gsi(unsigned char number, unsigned int devfn, u8 pin,
u32 gsi, int triggering, int polarity);
+extern int mp_find_ioapic(int gsi);
#else
static inline int
mp_config_acpi_gsi(unsigned char number, unsigned int devfn, u8 pin,
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -865,7 +865,7 @@
DECLARE_BITMAP(pin_programmed, MP_MAX_IOAPIC_PIN + 1);
} mp_ioapic_routing[MAX_IO_APICS];
-static int mp_find_ioapic(int gsi)
+int mp_find_ioapic(int gsi)
{
int i = 0;
Dom0 kernels actually want most of the CPU features to be enabled.
Some, like MCA/MCE, are still handled by Xen itself.
We leave APIC enabled even though we don't really have a functional
local apic so that the ACPI code will parse the corresponding tables
properly.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/xen/enlighten.c | 21 +++++++++++++--------
1 file changed, 13 insertions(+), 8 deletions(-)
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -205,18 +205,23 @@
static void xen_cpuid(unsigned int *ax, unsigned int *bx,
unsigned int *cx, unsigned int *dx)
{
- unsigned maskedx = ~0;
+ unsigned maskedx = 0;
/*
* Mask out inconvenient features, to try and disable as many
* unsupported kernel subsystems as possible.
*/
- if (*ax == 1)
- maskedx = ~((1 << X86_FEATURE_APIC) | /* disable APIC */
- (1 << X86_FEATURE_ACPI) | /* disable ACPI */
- (1 << X86_FEATURE_MCE) | /* disable MCE */
- (1 << X86_FEATURE_MCA) | /* disable MCA */
- (1 << X86_FEATURE_ACC)); /* thermal monitoring */
+ if (*ax == 1) {
+ maskedx =
+ (1 << X86_FEATURE_MCE) | /* disable MCE */
+ (1 << X86_FEATURE_MCA) | /* disable MCA */
+ (1 << X86_FEATURE_ACC); /* thermal monitoring */
+
+ if (!xen_initial_domain())
+ maskedx |=
+ (1 << X86_FEATURE_APIC) | /* disable local APIC */
+ (1 << X86_FEATURE_ACPI); /* disable ACPI */
+ }
asm(XEN_EMULATE_PREFIX "cpuid"
: "=a" (*ax),
@@ -224,7 +229,7 @@
"=c" (*cx),
"=d" (*dx)
: "0" (*ax), "2" (*cx));
- *dx &= maskedx;
+ *dx &= ~maskedx;
}
static void xen_set_debugreg(int reg, unsigned long val)
Add mp_find_ioapic_pin() to find an IO APIC's specific pin from a GSI,
and use this function within acpi/boot. Make it non-static so other
code can use it too.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/mpspec.h | 1 +
arch/x86/kernel/acpi/boot.c | 16 +++++++++++++---
2 files changed, 14 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/mpspec.h b/arch/x86/include/asm/mpspec.h
--- a/arch/x86/include/asm/mpspec.h
+++ b/arch/x86/include/asm/mpspec.h
@@ -64,6 +64,7 @@
extern int mp_config_acpi_gsi(unsigned char number, unsigned int devfn, u8 pin,
u32 gsi, int triggering, int polarity);
extern int mp_find_ioapic(int gsi);
+extern int mp_find_ioapic_pin(int ioapic, int gsi);
#else
static inline int
mp_config_acpi_gsi(unsigned char number, unsigned int devfn, u8 pin,
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -880,6 +880,16 @@
return -1;
}
+int mp_find_ioapic_pin(int ioapic, int gsi)
+{
+ if (WARN_ON(ioapic == -1))
+ return -1;
+ if (WARN_ON(gsi > mp_ioapic_routing[ioapic].gsi_end))
+ return -1;
+
+ return gsi - mp_ioapic_routing[ioapic].gsi_base;
+}
+
static u8 __init uniq_ioapic_id(u8 id)
{
#ifdef CONFIG_X86_32
@@ -992,7 +1002,7 @@
ioapic = mp_find_ioapic(gsi);
if (ioapic < 0)
return;
- pin = gsi - mp_ioapic_routing[ioapic].gsi_base;
+ pin = mp_find_ioapic_pin(ioapic, gsi);
/*
* TBD: This check is for faulty timer entries, where the override
@@ -1114,7 +1124,7 @@
return gsi;
}
- ioapic_pin = gsi - mp_ioapic_routing[ioapic].gsi_base;
+ ioapic_pin = mp_find_ioapic_pin(ioapic, gsi);
#ifdef CONFIG_X86_32
if (ioapic_renumber_irq)
@@ -1203,7 +1213,7 @@
mp_irq.mp_srcbusirq = (((devfn >> 3) & 0x1f) << 2) | ((pin - 1) & 3);
ioapic = mp_find_ioapic(gsi);
mp_irq.mp_dstapic = mp_ioapic_routing[ioapic].apic_id;
- mp_irq.mp_dstirq = gsi - mp_ioapic_routing[ioapic].gsi_base;
+ mp_irq.mp_dstirq = mp_find_ioapic_pin(ioapic, gsi);
save_mp_irq(&mp_irq);
#endif
swiotlb on 32 bit will be used by Xen domain 0 support.
Signed-off-by: Ian Campbell <[email protected]>
---
arch/x86/include/asm/dma-mapping.h | 6 +-----
arch/x86/include/asm/pci.h | 2 ++
arch/x86/include/asm/pci_64.h | 1 -
arch/x86/kernel/Makefile | 3 ++-
arch/x86/kernel/pci-dma.c | 6 ++++--
arch/x86/kernel/pci-swiotlb_64.c | 2 ++
arch/x86/mm/init_32.c | 3 +++
7 files changed, 14 insertions(+), 9 deletions(-)
diff --git a/arch/x86/include/asm/dma-mapping.h b/arch/x86/include/asm/dma-mapping.h
--- a/arch/x86/include/asm/dma-mapping.h
+++ b/arch/x86/include/asm/dma-mapping.h
@@ -66,21 +66,17 @@
return dma_ops;
else
return dev->archdata.dma_ops;
-#endif /* _ASM_X86_DMA_MAPPING_H */
+#endif
}
/* Make sure we keep the same behaviour */
static inline int dma_mapping_error(struct device *dev, dma_addr_t dma_addr)
{
-#ifdef CONFIG_X86_32
- return 0;
-#else
struct dma_mapping_ops *ops = get_dma_ops(dev);
if (ops->mapping_error)
return ops->mapping_error(dev, dma_addr);
return (dma_addr == bad_dma_address);
-#endif
}
#define dma_alloc_noncoherent(d, s, h, f) dma_alloc_coherent(d, s, h, f)
diff --git a/arch/x86/include/asm/pci.h b/arch/x86/include/asm/pci.h
--- a/arch/x86/include/asm/pci.h
+++ b/arch/x86/include/asm/pci.h
@@ -84,6 +84,8 @@
static inline void early_quirks(void) { }
#endif
+extern void pci_iommu_alloc(void);
+
#endif /* __KERNEL__ */
#ifdef CONFIG_X86_32
diff --git a/arch/x86/include/asm/pci_64.h b/arch/x86/include/asm/pci_64.h
--- a/arch/x86/include/asm/pci_64.h
+++ b/arch/x86/include/asm/pci_64.h
@@ -23,7 +23,6 @@
int reg, int len, u32 value);
extern void dma32_reserve_bootmem(void);
-extern void pci_iommu_alloc(void);
/* The PCI address space does equal the physical memory
* address space. The networking and block device layers use
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -115,6 +115,8 @@
obj-$(CONFIG_X86_CHECK_BIOS_CORRUPTION) += check.o
+obj-$(CONFIG_SWIOTLB) += pci-swiotlb_64.o # NB rename without _64
+
###
# 64 bit specific files
ifeq ($(CONFIG_X86_64),y)
@@ -128,7 +130,6 @@
obj-$(CONFIG_GART_IOMMU) += pci-gart_64.o aperture_64.o
obj-$(CONFIG_CALGARY_IOMMU) += pci-calgary_64.o tce_64.o
obj-$(CONFIG_AMD_IOMMU) += amd_iommu_init.o amd_iommu.o
- obj-$(CONFIG_SWIOTLB) += pci-swiotlb_64.o
obj-$(CONFIG_PCI_MMCONFIG) += mmconf-fam10h_64.o
endif
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -105,11 +105,15 @@
dma32_bootmem_ptr = NULL;
dma32_bootmem_size = 0;
}
+#endif
void __init pci_iommu_alloc(void)
{
+#ifdef CONFIG_X86_64
/* free the range so iommu could get some range less than 4G */
dma32_free_bootmem();
+#endif
+
/*
* The order of these functions is important for
* fall-back/fail-over reasons
@@ -125,8 +129,6 @@
pci_swiotlb_init();
}
-#endif
-
void *dma_generic_alloc_coherent(struct device *dev, size_t size,
dma_addr_t *dma_addr, gfp_t flag)
{
diff --git a/arch/x86/kernel/pci-swiotlb_64.c b/arch/x86/kernel/pci-swiotlb_64.c
--- a/arch/x86/kernel/pci-swiotlb_64.c
+++ b/arch/x86/kernel/pci-swiotlb_64.c
@@ -62,8 +62,10 @@
void __init pci_swiotlb_init(void)
{
/* don't initialize swiotlb if iommu=off (no_iommu=1) */
+#ifdef CONFIG_X86_64
if (!iommu_detected && !no_iommu && max_pfn > MAX_DMA32_PFN)
swiotlb = 1;
+#endif
if (swiotlb_force)
swiotlb = 1;
if (swiotlb) {
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -21,6 +21,7 @@
#include <linux/init.h>
#include <linux/highmem.h>
#include <linux/pagemap.h>
+#include <linux/pci.h>
#include <linux/pfn.h>
#include <linux/poison.h>
#include <linux/bootmem.h>
@@ -971,6 +972,8 @@
int codesize, reservedpages, datasize, initsize;
int tmp;
+ pci_iommu_alloc();
+
#ifdef CONFIG_FLATMEM
BUG_ON(!mem_map);
#endif
From: Ian Campbell <[email protected]>
These are to be used later and contain the default implementation.
Signed-off-by: Ian Campbell <[email protected]>
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/pci-swiotlb_64.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/x86/kernel/pci-swiotlb_64.c b/arch/x86/kernel/pci-swiotlb_64.c
--- a/arch/x86/kernel/pci-swiotlb_64.c
+++ b/arch/x86/kernel/pci-swiotlb_64.c
@@ -23,6 +23,16 @@
return (void *)__get_free_pages(GFP_DMA | __GFP_NOWARN, order);
}
+unsigned long swiotlb_virt_to_bus(volatile void *address)
+{
+ return virt_to_bus(address);
+}
+
+void * swiotlb_bus_to_virt(unsigned long address)
+{
+ return bus_to_virt(address);
+}
+
static dma_addr_t
swiotlb_map_single_phys(struct device *hwdev, phys_addr_t paddr, size_t size,
int direction)
From: Juan Quintela <[email protected]>
Add the direct mapping area for ISA bus access, and enable IO space
access for the guest when running as dom0.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Signed-off-by: Juan Quintela <[email protected]>
---
arch/x86/xen/enlighten.c | 32 ++++++++++++++++++++++++++++++++
arch/x86/xen/setup.c | 6 +++++-
2 files changed, 37 insertions(+), 1 deletion(-)
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1437,6 +1437,7 @@
return __ka(m2p(maddr));
}
+/* Set the page permissions on an identity-mapped pages */
static void set_page_prot(void *addr, pgprot_t prot)
{
unsigned long pfn = __pa(addr) >> PAGE_SHIFT;
@@ -1492,6 +1493,29 @@
set_page_prot(pmd, PAGE_KERNEL_RO);
}
+static __init void xen_ident_map_ISA(void)
+{
+ unsigned long pa;
+
+ /*
+ * If we're dom0, then linear map the ISA machine addresses into
+ * the kernel's address space.
+ */
+ if (!xen_initial_domain())
+ return;
+
+ xen_raw_printk("Xen: setup ISA identity maps\n");
+
+ for (pa = ISA_START_ADDRESS; pa < ISA_END_ADDRESS; pa += PAGE_SIZE) {
+ pte_t pte = mfn_pte(PFN_DOWN(pa), PAGE_KERNEL_IO);
+
+ if (HYPERVISOR_update_va_mapping(PAGE_OFFSET + pa, pte, 0))
+ BUG();
+ }
+
+ xen_flush_tlb();
+}
+
#ifdef CONFIG_X86_64
static void convert_pfn_mfn(void *v)
{
@@ -1674,6 +1698,7 @@
xen_raw_console_write("mapping kernel into physical memory\n");
pgd = xen_setup_kernel_pagetable(pgd, xen_start_info->nr_pages);
+ xen_ident_map_ISA();
init_mm.pgd = pgd;
@@ -1683,6 +1708,13 @@
if (xen_feature(XENFEAT_supervisor_mode_kernel))
pv_info.kernel_rpl = 0;
+ if (xen_initial_domain()) {
+ struct physdev_set_iopl set_iopl;
+ set_iopl.iopl = 1;
+ if (HYPERVISOR_physdev_op(PHYSDEVOP_set_iopl, &set_iopl) == -1)
+ BUG();
+ }
+
/* set the limit of our address space */
xen_reserve_top();
diff --git a/arch/x86/xen/setup.c b/arch/x86/xen/setup.c
--- a/arch/x86/xen/setup.c
+++ b/arch/x86/xen/setup.c
@@ -51,6 +51,9 @@
* Even though this is normal, usable memory under Xen, reserve
* ISA memory anyway because too many things think they can poke
* about in there.
+ *
+ * In a dom0 kernel, this region is identity mapped with the
+ * hardware ISA area, so it really is out of bounds.
*/
e820_add_region(ISA_START_ADDRESS, ISA_END_ADDRESS - ISA_START_ADDRESS,
E820_RESERVED);
@@ -188,7 +191,8 @@
pm_idle = xen_idle;
- paravirt_disable_iospace();
+ if (!xen_initial_domain())
+ paravirt_disable_iospace();
fiddle_vdso();
}
We don't allow direct access to the IO apic, so make sure that any
request to map it just "maps" non-present pages. We should see any
attempts at direct access explode nicely.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/xen/enlighten.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1166,6 +1166,16 @@
pte = pfn_pte(phys, prot);
break;
+#ifdef CONFIG_X86_IO_APIC
+ case FIX_IO_APIC_BASE_0 ... FIX_IO_APIC_BASE_END:
+ /*
+ * We just don't map the IO APIC - all access is via
+ * hypercalls. Keep the address in the pte for reference.
+ */
+ pte = pfn_pte(phys, PAGE_NONE);
+ break;
+#endif
+
case FIX_PARAVIRT_BOOTMAP:
/* This is an MFN, but it isn't an IO mapping from the
IO domain */
It uses __init/__cpuinit, which are not defined otherwise.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/numa_64.h | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/x86/include/asm/numa_64.h b/arch/x86/include/asm/numa_64.h
--- a/arch/x86/include/asm/numa_64.h
+++ b/arch/x86/include/asm/numa_64.h
@@ -1,6 +1,7 @@
#ifndef _ASM_X86_NUMA_64_H
#define _ASM_X86_NUMA_64_H
+#include <linux/init.h>
#include <linux/nodemask.h>
#include <asm/apicdef.h>
Use the console hypercalls for dom0 console.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Signed-off-by: Juan Quintela <[email protected]>
---
drivers/char/hvc_xen.c | 100 +++++++++++++++++++++++++++++++-----------------
drivers/xen/events.c | 2
include/xen/events.h | 2
3 files changed, 69 insertions(+), 35 deletions(-)
diff --git a/drivers/char/hvc_xen.c b/drivers/char/hvc_xen.c
--- a/drivers/char/hvc_xen.c
+++ b/drivers/char/hvc_xen.c
@@ -55,7 +55,7 @@
notify_remote_via_evtchn(xen_start_info->console.domU.evtchn);
}
-static int write_console(uint32_t vtermno, const char *data, int len)
+static int domU_write_console(uint32_t vtermno, const char *data, int len)
{
struct xencons_interface *intf = xencons_interface();
XENCONS_RING_IDX cons, prod;
@@ -76,7 +76,7 @@
return sent;
}
-static int read_console(uint32_t vtermno, char *buf, int len)
+static int domU_read_console(uint32_t vtermno, char *buf, int len)
{
struct xencons_interface *intf = xencons_interface();
XENCONS_RING_IDX cons, prod;
@@ -97,28 +97,62 @@
return recv;
}
-static struct hv_ops hvc_ops = {
- .get_chars = read_console,
- .put_chars = write_console,
+static struct hv_ops domU_hvc_ops = {
+ .get_chars = domU_read_console,
+ .put_chars = domU_write_console,
.notifier_add = notifier_add_irq,
.notifier_del = notifier_del_irq,
.notifier_hangup = notifier_hangup_irq,
};
-static int __init xen_init(void)
+static int dom0_read_console(uint32_t vtermno, char *buf, int len)
+{
+ return HYPERVISOR_console_io(CONSOLEIO_read, len, buf);
+}
+
+/*
+ * Either for a dom0 to write to the system console, or a domU with a
+ * debug version of Xen
+ */
+static int dom0_write_console(uint32_t vtermno, const char *str, int len)
+{
+ int rc = HYPERVISOR_console_io(CONSOLEIO_write, len, (char *)str);
+ if (rc < 0)
+ return 0;
+
+ return len;
+}
+
+static struct hv_ops dom0_hvc_ops = {
+ .get_chars = dom0_read_console,
+ .put_chars = dom0_write_console,
+ .notifier_add = notifier_add_irq,
+ .notifier_del = notifier_del_irq,
+};
+
+static int __init xen_hvc_init(void)
{
struct hvc_struct *hp;
+ struct hv_ops *ops;
- if (!xen_pv_domain() ||
- xen_initial_domain() ||
- !xen_start_info->console.domU.evtchn)
- return -ENODEV;
+ if (!xen_pv_domain())
+ return -ENODEV;
- xencons_irq = bind_evtchn_to_irq(xen_start_info->console.domU.evtchn);
+ if (xen_initial_domain()) {
+ ops = &dom0_hvc_ops;
+ xencons_irq = bind_virq_to_irq(VIRQ_CONSOLE, 0);
+ } else {
+ if (!xen_start_info->console.domU.evtchn)
+ return -ENODEV;
+
+ ops = &domU_hvc_ops;
+ xencons_irq = bind_evtchn_to_irq(xen_start_info->console.domU.evtchn);
+ }
+
if (xencons_irq < 0)
xencons_irq = 0; /* NO_IRQ */
- hp = hvc_alloc(HVC_COOKIE, xencons_irq, &hvc_ops, 256);
+ hp = hvc_alloc(HVC_COOKIE, xencons_irq, ops, 256);
if (IS_ERR(hp))
return PTR_ERR(hp);
@@ -135,7 +169,7 @@
rebind_evtchn_irq(xen_start_info->console.domU.evtchn, xencons_irq);
}
-static void __exit xen_fini(void)
+static void __exit xen_hvc_fini(void)
{
if (hvc)
hvc_remove(hvc);
@@ -143,29 +177,24 @@
static int xen_cons_init(void)
{
+ struct hv_ops *ops;
+
if (!xen_pv_domain())
return 0;
- hvc_instantiate(HVC_COOKIE, 0, &hvc_ops);
+ ops = &domU_hvc_ops;
+ if (xen_initial_domain())
+ ops = &dom0_hvc_ops;
+
+ hvc_instantiate(HVC_COOKIE, 0, ops);
+
return 0;
}
-module_init(xen_init);
-module_exit(xen_fini);
+module_init(xen_hvc_init);
+module_exit(xen_hvc_fini);
console_initcall(xen_cons_init);
-static void raw_console_write(const char *str, int len)
-{
- while(len > 0) {
- int rc = HYPERVISOR_console_io(CONSOLEIO_write, len, (char *)str);
- if (rc <= 0)
- break;
-
- str += rc;
- len -= rc;
- }
-}
-
#ifdef CONFIG_EARLY_PRINTK
static void xenboot_write_console(struct console *console, const char *string,
unsigned len)
@@ -173,19 +202,22 @@
unsigned int linelen, off = 0;
const char *pos;
- raw_console_write(string, len);
+ dom0_write_console(0, string, len);
- write_console(0, "(early) ", 8);
+ if (xen_initial_domain())
+ return;
+
+ domU_write_console(0, "(early) ", 8);
while (off < len && NULL != (pos = strchr(string+off, '\n'))) {
linelen = pos-string+off;
if (off + linelen > len)
break;
- write_console(0, string+off, linelen);
- write_console(0, "\r\n", 2);
+ domU_write_console(0, string+off, linelen);
+ domU_write_console(0, "\r\n", 2);
off += linelen + 1;
}
if (off < len)
- write_console(0, string+off, len-off);
+ domU_write_console(0, string+off, len-off);
}
struct console xenboot_console = {
@@ -197,7 +229,7 @@
void xen_raw_console_write(const char *str)
{
- raw_console_write(str, strlen(str));
+ dom0_write_console(0, str, strlen(str));
}
void xen_raw_printk(const char *fmt, ...)
diff --git a/drivers/xen/events.c b/drivers/xen/events.c
--- a/drivers/xen/events.c
+++ b/drivers/xen/events.c
@@ -307,7 +307,7 @@
}
-static int bind_virq_to_irq(unsigned int virq, unsigned int cpu)
+int bind_virq_to_irq(unsigned int virq, unsigned int cpu)
{
struct evtchn_bind_virq bind_virq;
int evtchn, irq;
diff --git a/include/xen/events.h b/include/xen/events.h
--- a/include/xen/events.h
+++ b/include/xen/events.h
@@ -12,6 +12,8 @@
irq_handler_t handler,
unsigned long irqflags, const char *devname,
void *dev_id);
+int bind_virq_to_irq(unsigned int virq, unsigned int cpu);
+
int bind_virq_to_irqhandler(unsigned int virq, unsigned int cpu,
irq_handler_t handler,
unsigned long irqflags, const char *devname,
By default, the irq_chip.disable operation is a no-op. Explicitly set
it to disable the Xen event channel.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
drivers/xen/events.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/drivers/xen/events.c b/drivers/xen/events.c
--- a/drivers/xen/events.c
+++ b/drivers/xen/events.c
@@ -806,8 +806,11 @@
static struct irq_chip xen_dynamic_chip __read_mostly = {
.name = "xen-dyn",
+
+ .disable = disable_dynirq,
.mask = disable_dynirq,
.unmask = enable_dynirq,
+
.ack = ack_dynirq,
.set_affinity = set_affinity_irq,
.retrigger = retrigger_dynirq,
A guest domain may have external pages mapped into its address space,
in order to share memory with other domains. These shared pages are
more akin to io mappings than real RAM, and should not pass the
page_is_ram test. Add a paravirt op for this so that a hypervisor
backend can validate whether a page should be considered ram or not.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/page.h | 9 ++++++++-
arch/x86/include/asm/paravirt.h | 7 +++++++
arch/x86/kernel/paravirt.c | 1 +
arch/x86/mm/ioremap.c | 2 +-
arch/x86/xen/enlighten.c | 11 +++++++++++
5 files changed, 28 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/page.h b/arch/x86/include/asm/page.h
--- a/arch/x86/include/asm/page.h
+++ b/arch/x86/include/asm/page.h
@@ -56,7 +56,14 @@
typedef struct { pgdval_t pgd; } pgd_t;
typedef struct { pgprotval_t pgprot; } pgprot_t;
-extern int page_is_ram(unsigned long pagenr);
+extern int native_page_is_ram(unsigned long pagenr);
+#ifndef CONFIG_PARAVIRT
+static inline int page_is_ram(unsigned long pagenr)
+{
+ return native_page_is_ram(pagenr);
+}
+#endif
+
extern int pagerange_is_ram(unsigned long start, unsigned long end);
extern int devmem_is_allowed(unsigned long pagenr);
extern void map_devmem(unsigned long pfn, unsigned long size,
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -321,6 +321,8 @@
an mfn. We can tell which is which from the index. */
void (*set_fixmap)(unsigned /* enum fixed_addresses */ idx,
unsigned long phys, pgprot_t flags);
+
+ int (*page_is_ram)(unsigned long pfn);
};
struct raw_spinlock;
@@ -1386,6 +1388,11 @@
pv_mmu_ops.set_fixmap(idx, phys, flags);
}
+static inline int page_is_ram(unsigned long pfn)
+{
+ return PVOP_CALL1(int, pv_mmu_ops.page_is_ram, pfn);
+}
+
void _paravirt_nop(void);
#define paravirt_nop ((void *)_paravirt_nop)
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -451,6 +451,7 @@
},
.set_fixmap = native_set_fixmap,
+ .page_is_ram = native_page_is_ram,
};
EXPORT_SYMBOL_GPL(pv_time_ops);
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -97,7 +97,7 @@
#endif
-int page_is_ram(unsigned long pagenr)
+int native_page_is_ram(unsigned long pagenr)
{
resource_size_t addr, end;
int i;
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1190,6 +1190,16 @@
#endif
}
+static int xen_page_is_ram(unsigned long pfn)
+{
+ /* Granted pages are not RAM. They will not have a proper
+ identity pfn<->mfn translation. */
+ if (mfn_to_local_pfn(pfn_to_mfn(pfn)) != pfn)
+ return 0;
+
+ return native_page_is_ram(pfn);
+}
+
static const struct pv_info xen_info __initdata = {
.paravirt_enabled = 1,
.shared_kernel_pmd = 0,
@@ -1364,6 +1374,7 @@
},
.set_fixmap = xen_set_fixmap,
+ .page_is_ram = xen_page_is_ram,
};
static void xen_reboot(int reason)
In a Xen domain, ioremap operates on machine addresses, not
pseudo-physical addresses. We use _PAGE_IOMAP to determine whether a
mapping is intended for machine addresses.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/xen/page.h | 8 +----
arch/x86/xen/enlighten.c | 14 +++++++--
arch/x86/xen/mmu.c | 60 ++++++++++++++++++++++++++++++++++++++-
3 files changed, 73 insertions(+), 9 deletions(-)
diff --git a/arch/x86/include/asm/xen/page.h b/arch/x86/include/asm/xen/page.h
--- a/arch/x86/include/asm/xen/page.h
+++ b/arch/x86/include/asm/xen/page.h
@@ -112,13 +112,9 @@
*/
static inline unsigned long mfn_to_local_pfn(unsigned long mfn)
{
- extern unsigned long max_mapnr;
unsigned long pfn = mfn_to_pfn(mfn);
- if ((pfn < max_mapnr)
- && !xen_feature(XENFEAT_auto_translated_physmap)
- && (get_phys_to_machine(pfn) != mfn))
- return max_mapnr; /* force !pfn_valid() */
- /* XXX fixme; not true with sparsemem */
+ if (get_phys_to_machine(pfn) != mfn)
+ return -1; /* force !pfn_valid() */
return pfn;
}
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1162,11 +1162,19 @@
#ifdef CONFIG_X86_LOCAL_APIC
case FIX_APIC_BASE: /* maps dummy local APIC */
#endif
+ /* All local page mappings */
pte = pfn_pte(phys, prot);
break;
+ case FIX_PARAVIRT_BOOTMAP:
+ /* This is an MFN, but it isn't an IO mapping from the
+ IO domain */
+ pte = mfn_pte(phys, prot);
+ break;
+
default:
- pte = mfn_pte(phys, prot);
+ /* By default, set_fixmap is used for hardware mappings */
+ pte = mfn_pte(phys, __pgprot(pgprot_val(prot) | _PAGE_IOMAP));
break;
}
@@ -1695,7 +1703,9 @@
/* Prevent unwanted bits from being set in PTEs. */
__supported_pte_mask &= ~_PAGE_GLOBAL;
- if (!xen_initial_domain())
+ if (xen_initial_domain())
+ __supported_pte_mask |= _PAGE_IOMAP;
+ else
__supported_pte_mask &= ~(_PAGE_PWT | _PAGE_PCD);
/* Don't do the full vcpu_info placement stuff until we have a
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -302,6 +302,28 @@
return PagePinned(page);
}
+static bool xen_iomap_pte(pte_t pte)
+{
+ return xen_initial_domain() && (pte_flags(pte) & _PAGE_IOMAP);
+}
+
+static void xen_set_iomap_pte(pte_t *ptep, pte_t pteval)
+{
+ struct multicall_space mcs;
+ struct mmu_update *u;
+
+ mcs = xen_mc_entry(sizeof(*u));
+ u = mcs.args;
+
+ /* ptep might be kmapped when using 32-bit HIGHPTE */
+ u->ptr = arbitrary_virt_to_machine(ptep).maddr;
+ u->val = pte_val_ma(pteval);
+
+ MULTI_mmu_update(mcs.mc, mcs.args, 1, NULL, DOMID_IO);
+
+ xen_mc_issue(PARAVIRT_LAZY_MMU);
+}
+
static void xen_extend_mmu_update(const struct mmu_update *update)
{
struct multicall_space mcs;
@@ -382,6 +404,11 @@
if (mm == &init_mm)
preempt_disable();
+ if (xen_iomap_pte(pteval)) {
+ xen_set_iomap_pte(ptep, pteval);
+ goto out;
+ }
+
ADD_STATS(set_pte_at, 1);
// ADD_STATS(set_pte_at_pinned, xen_page_pinned(ptep));
ADD_STATS(set_pte_at_current, mm == current->mm);
@@ -454,8 +481,25 @@
return val;
}
+static pteval_t iomap_pte(pteval_t val)
+{
+ if (val & _PAGE_PRESENT) {
+ unsigned long pfn = (val & PTE_PFN_MASK) >> PAGE_SHIFT;
+ pteval_t flags = val & PTE_FLAGS_MASK;
+
+ /* We assume the pte frame number is a MFN, so
+ just use it as-is. */
+ val = ((pteval_t)pfn << PAGE_SHIFT) | flags;
+ }
+
+ return val;
+}
+
pteval_t xen_pte_val(pte_t pte)
{
+ if (xen_initial_domain() && (pte.pte & _PAGE_IOMAP))
+ return pte.pte;
+
return pte_mfn_to_pfn(pte.pte);
}
@@ -466,7 +510,11 @@
pte_t xen_make_pte(pteval_t pte)
{
- pte = pte_pfn_to_mfn(pte);
+ if (unlikely(xen_initial_domain() && (pte & _PAGE_IOMAP)))
+ pte = iomap_pte(pte);
+ else
+ pte = pte_pfn_to_mfn(pte);
+
return native_make_pte(pte);
}
@@ -519,6 +567,11 @@
void xen_set_pte(pte_t *ptep, pte_t pte)
{
+ if (xen_iomap_pte(pte)) {
+ xen_set_iomap_pte(ptep, pte);
+ return;
+ }
+
ADD_STATS(pte_update, 1);
// ADD_STATS(pte_update_pinned, xen_page_pinned(ptep));
ADD_STATS(pte_update_batched, paravirt_get_lazy_mode() == PARAVIRT_LAZY_MMU);
@@ -535,6 +588,11 @@
#ifdef CONFIG_X86_PAE
void xen_set_pte_atomic(pte_t *ptep, pte_t pte)
{
+ if (xen_iomap_pte(pte)) {
+ xen_set_iomap_pte(ptep, pte);
+ return;
+ }
+
set_64bit((u64 *)ptep, native_pte_val(pte));
}
Writes to the IO APIC are paravirtualized via hypercalls, so implement
the appropriate operations.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/xen/Makefile | 3 +-
arch/x86/xen/apic.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++
arch/x86/xen/enlighten.c | 2 +
arch/x86/xen/xen-ops.h | 2 +
4 files changed, 72 insertions(+), 1 deletion(-)
diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile
--- a/arch/x86/xen/Makefile
+++ b/arch/x86/xen/Makefile
@@ -9,4 +9,5 @@
time.o xen-asm_$(BITS).o grant-table.o suspend.o
obj-$(CONFIG_SMP) += smp.o spinlock.o
-obj-$(CONFIG_XEN_DEBUG_FS) += debugfs.o
\ No newline at end of file
+obj-$(CONFIG_XEN_DEBUG_FS) += debugfs.o
+obj-$(CONFIG_XEN_DOM0) += apic.o
diff --git a/arch/x86/xen/apic.c b/arch/x86/xen/apic.c
new file mode 100644
--- /dev/null
+++ b/arch/x86/xen/apic.c
@@ -0,0 +1,66 @@
+#include <linux/kernel.h>
+#include <linux/threads.h>
+#include <linux/bitmap.h>
+
+#include <asm/io_apic.h>
+#include <asm/acpi.h>
+
+#include <asm/xen/hypervisor.h>
+#include <asm/xen/hypercall.h>
+
+#include <xen/interface/xen.h>
+#include <xen/interface/physdev.h>
+
+static void __init xen_io_apic_init(void)
+{
+ printk("xen apic init\n");
+ dump_stack();
+}
+
+static unsigned int xen_io_apic_read(unsigned apic, unsigned reg)
+{
+ struct physdev_apic apic_op;
+ int ret;
+
+ apic_op.apic_physbase = mp_ioapics[apic].mp_apicaddr;
+ apic_op.reg = reg;
+ ret = HYPERVISOR_physdev_op(PHYSDEVOP_apic_read, &apic_op);
+ if (ret)
+ BUG();
+ return apic_op.value;
+}
+
+
+static void xen_io_apic_write(unsigned int apic, unsigned int reg, unsigned int value)
+{
+ struct physdev_apic apic_op;
+
+ apic_op.apic_physbase = mp_ioapics[apic].mp_apicaddr;
+ apic_op.reg = reg;
+ apic_op.value = value;
+ if (HYPERVISOR_physdev_op(PHYSDEVOP_apic_write, &apic_op))
+ BUG();
+}
+
+static struct io_apic_ops __initdata xen_ioapic_ops = {
+ .init = xen_io_apic_init,
+ .read = xen_io_apic_read,
+ .write = xen_io_apic_write,
+ .modify = xen_io_apic_write,
+};
+
+void xen_init_apic(void)
+{
+ if (!xen_initial_domain())
+ return;
+
+ set_io_apic_ops(&xen_ioapic_ops);
+
+#ifdef CONFIG_ACPI
+ /*
+ * Pretend ACPI found our lapic even though we've disabled it,
+ * to prevent MP tables from setting up lapics.
+ */
+ acpi_lapic = 1;
+#endif
+}
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1750,6 +1750,8 @@
set_iopl.iopl = 1;
if (HYPERVISOR_physdev_op(PHYSDEVOP_set_iopl, &set_iopl) == -1)
BUG();
+
+ xen_init_apic();
}
/* set the limit of our address space */
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -64,6 +64,8 @@
#endif
+void xen_init_apic(void);
+
/* Declare an asm function, along with symbols needed to make it
inlineable */
#define DECL_ASM(ret, name, ...) \
Put all irq info into one struct. Also, use a union to keep
event channel type-specific information, rather than overloading the
index field.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
drivers/xen/events.c | 184 ++++++++++++++++++++++++++++++++++++--------------
1 file changed, 135 insertions(+), 49 deletions(-)
diff --git a/drivers/xen/events.c b/drivers/xen/events.c
--- a/drivers/xen/events.c
+++ b/drivers/xen/events.c
@@ -51,18 +51,8 @@
/* IRQ <-> IPI mapping */
static DEFINE_PER_CPU(int, ipi_to_irq[XEN_NR_IPIS]) = {[0 ... XEN_NR_IPIS-1] = -1};
-/* Packed IRQ information: binding type, sub-type index, and event channel. */
-struct packed_irq
-{
- unsigned short evtchn;
- unsigned char index;
- unsigned char type;
-};
-
-static struct packed_irq irq_info[NR_IRQS];
-
-/* Binding types. */
-enum {
+/* Interrupt types. */
+enum xen_irq_type {
IRQT_UNBOUND,
IRQT_PIRQ,
IRQT_VIRQ,
@@ -70,14 +60,39 @@
IRQT_EVTCHN
};
-/* Convenient shorthand for packed representation of an unbound IRQ. */
-#define IRQ_UNBOUND mk_irq_info(IRQT_UNBOUND, 0, 0)
+/*
+ * Packed IRQ information:
+ * type - enum xen_irq_type
+ * event channel - irq->event channel mapping
+ * cpu - cpu this event channel is bound to
+ * index - type-specific information:
+ * PIRQ - vector, with MSB being "needs EIO"
+ * VIRQ - virq number
+ * IPI - IPI vector
+ * EVTCHN -
+ */
+struct irq_info
+{
+ enum xen_irq_type type; /* type */
+ unsigned short evtchn; /* event channel */
+ unsigned short cpu; /* cpu bound */
+
+ union {
+ unsigned short virq;
+ enum ipi_vector ipi;
+ struct {
+ unsigned short gsi;
+ unsigned short vector;
+ } pirq;
+ } u;
+};
+
+static struct irq_info irq_info[NR_IRQS];
static int evtchn_to_irq[NR_EVENT_CHANNELS] = {
[0 ... NR_EVENT_CHANNELS-1] = -1
};
static unsigned long cpu_evtchn_mask[NR_CPUS][NR_EVENT_CHANNELS/BITS_PER_LONG];
-static u8 cpu_evtchn[NR_EVENT_CHANNELS];
/* Reference counts for bindings to IRQs. */
static int irq_bindcount[NR_IRQS];
@@ -88,27 +103,107 @@
static struct irq_chip xen_dynamic_chip;
/* Constructor for packed IRQ information. */
-static inline struct packed_irq mk_irq_info(u32 type, u32 index, u32 evtchn)
+static struct irq_info mk_unbound_info(void)
{
- return (struct packed_irq) { evtchn, index, type };
+ return (struct irq_info) { .type = IRQT_UNBOUND };
+}
+
+static struct irq_info mk_evtchn_info(unsigned short evtchn)
+{
+ return (struct irq_info) { .type = IRQT_EVTCHN, .evtchn = evtchn };
+}
+
+static struct irq_info mk_ipi_info(unsigned short evtchn, enum ipi_vector ipi)
+{
+ return (struct irq_info) { .type = IRQT_IPI, .evtchn = evtchn,
+ .u.ipi = ipi };
+}
+
+static struct irq_info mk_virq_info(unsigned short evtchn, unsigned short virq)
+{
+ return (struct irq_info) { .type = IRQT_VIRQ, .evtchn = evtchn,
+ .u.virq = virq };
+}
+
+static struct irq_info mk_pirq_info(unsigned short evtchn,
+ unsigned short gsi, unsigned short vector)
+{
+ return (struct irq_info) { .type = IRQT_PIRQ, .evtchn = evtchn,
+ .u.pirq = { .gsi = gsi, .vector = vector } };
}
/*
* Accessors for packed IRQ information.
*/
-static inline unsigned int evtchn_from_irq(int irq)
+static struct irq_info *info_for_irq(unsigned irq)
{
- return irq_info[irq].evtchn;
+ return &irq_info[irq];
}
-static inline unsigned int index_from_irq(int irq)
+static unsigned int evtchn_from_irq(unsigned irq)
{
- return irq_info[irq].index;
+ return info_for_irq(irq)->evtchn;
}
-static inline unsigned int type_from_irq(int irq)
+static enum ipi_vector ipi_from_irq(unsigned irq)
{
- return irq_info[irq].type;
+ struct irq_info *info = info_for_irq(irq);
+
+ BUG_ON(info == NULL);
+ BUG_ON(info->type != IRQT_IPI);
+
+ return info->u.ipi;
+}
+
+static unsigned virq_from_irq(unsigned irq)
+{
+ struct irq_info *info = info_for_irq(irq);
+
+ BUG_ON(info == NULL);
+ BUG_ON(info->type != IRQT_VIRQ);
+
+ return info->u.virq;
+}
+
+static unsigned gsi_from_irq(unsigned irq)
+{
+ struct irq_info *info = info_for_irq(irq);
+
+ BUG_ON(info == NULL);
+ BUG_ON(info->type != IRQT_PIRQ);
+
+ return info->u.pirq.gsi;
+}
+
+static unsigned vector_from_irq(unsigned irq)
+{
+ struct irq_info *info = info_for_irq(irq);
+
+ BUG_ON(info == NULL);
+ BUG_ON(info->type != IRQT_PIRQ);
+
+ return info->u.pirq.vector;
+}
+
+static enum xen_irq_type type_from_irq(unsigned irq)
+{
+ return info_for_irq(irq)->type;
+}
+
+static unsigned cpu_from_irq(unsigned irq)
+{
+ return info_for_irq(irq)->cpu;
+}
+
+static unsigned int cpu_from_evtchn(unsigned int evtchn)
+{
+ int irq = evtchn_to_irq[evtchn];
+ unsigned ret = 0;
+
+ if (irq != -1)
+ ret = cpu_from_irq(irq);
+
+ return ret;
}
static inline unsigned long active_evtchns(unsigned int cpu,
@@ -129,10 +224,10 @@
irq_to_desc(irq)->affinity = cpumask_of_cpu(cpu);
#endif
- __clear_bit(chn, cpu_evtchn_mask[cpu_evtchn[chn]]);
+ __clear_bit(chn, cpu_evtchn_mask[cpu_from_irq(irq)]);
__set_bit(chn, cpu_evtchn_mask[cpu]);
- cpu_evtchn[chn] = cpu;
+ irq_info[irq].cpu = cpu;
}
static void init_evtchn_cpu_bindings(void)
@@ -146,15 +241,9 @@
desc->affinity = cpumask_of_cpu(0);
#endif
- memset(cpu_evtchn, 0, sizeof(cpu_evtchn));
memset(cpu_evtchn_mask[0], ~0, sizeof(cpu_evtchn_mask[0]));
}
-static inline unsigned int cpu_from_evtchn(unsigned int evtchn)
-{
- return cpu_evtchn[evtchn];
-}
-
static inline void clear_evtchn(int port)
{
struct shared_info *s = HYPERVISOR_shared_info;
@@ -239,6 +328,8 @@
if (irq == nr_irqs)
panic("No available IRQ to bind to: increase nr_irqs!\n");
+ dynamic_irq_init(irq);
+
return irq;
}
@@ -253,12 +344,11 @@
if (irq == -1) {
irq = find_unbound_irq();
- dynamic_irq_init(irq);
set_irq_chip_and_handler_name(irq, &xen_dynamic_chip,
handle_level_irq, "event");
evtchn_to_irq[evtchn] = irq;
- irq_info[irq] = mk_irq_info(IRQT_EVTCHN, 0, evtchn);
+ irq_info[irq] = mk_evtchn_info(evtchn);
}
irq_bindcount[irq]++;
@@ -282,7 +372,6 @@
if (irq < 0)
goto out;
- dynamic_irq_init(irq);
set_irq_chip_and_handler_name(irq, &xen_dynamic_chip,
handle_level_irq, "ipi");
@@ -293,7 +382,7 @@
evtchn = bind_ipi.port;
evtchn_to_irq[evtchn] = irq;
- irq_info[irq] = mk_irq_info(IRQT_IPI, ipi, evtchn);
+ irq_info[irq] = mk_ipi_info(evtchn, ipi);
per_cpu(ipi_to_irq, cpu)[ipi] = irq;
@@ -327,12 +416,11 @@
irq = find_unbound_irq();
- dynamic_irq_init(irq);
set_irq_chip_and_handler_name(irq, &xen_dynamic_chip,
handle_level_irq, "virq");
evtchn_to_irq[evtchn] = irq;
- irq_info[irq] = mk_irq_info(IRQT_VIRQ, virq, evtchn);
+ irq_info[irq] = mk_virq_info(evtchn, virq);
per_cpu(virq_to_irq, cpu)[virq] = irq;
@@ -361,11 +449,11 @@
switch (type_from_irq(irq)) {
case IRQT_VIRQ:
per_cpu(virq_to_irq, cpu_from_evtchn(evtchn))
- [index_from_irq(irq)] = -1;
+ [virq_from_irq(irq)] = -1;
break;
case IRQT_IPI:
per_cpu(ipi_to_irq, cpu_from_evtchn(evtchn))
- [index_from_irq(irq)] = -1;
+ [ipi_from_irq(irq)] = -1;
break;
default:
break;
@@ -375,7 +463,7 @@
bind_evtchn_to_cpu(evtchn, 0);
evtchn_to_irq[evtchn] = -1;
- irq_info[irq] = IRQ_UNBOUND;
+ irq_info[irq] = mk_unbound_info();
dynamic_irq_cleanup(irq);
}
@@ -493,8 +581,8 @@
for(i = 0; i < NR_EVENT_CHANNELS; i++) {
if (sync_test_bit(i, sh->evtchn_pending)) {
printk(" %d: event %d -> irq %d\n",
- cpu_evtchn[i], i,
- evtchn_to_irq[i]);
+ cpu_from_evtchn(i), i,
+ evtchn_to_irq[i]);
}
}
@@ -592,7 +680,7 @@
BUG_ON(irq_bindcount[irq] == 0);
evtchn_to_irq[evtchn] = irq;
- irq_info[irq] = mk_irq_info(IRQT_EVTCHN, 0, evtchn);
+ irq_info[irq] = mk_evtchn_info(evtchn);
spin_unlock(&irq_mapping_update_lock);
@@ -702,8 +790,7 @@
if ((irq = per_cpu(virq_to_irq, cpu)[virq]) == -1)
continue;
- BUG_ON(irq_info[irq].type != IRQT_VIRQ);
- BUG_ON(irq_info[irq].index != virq);
+ BUG_ON(virq_from_irq(irq) != virq);
/* Get a new binding from Xen. */
bind_virq.virq = virq;
@@ -715,7 +802,7 @@
/* Record the new mapping. */
evtchn_to_irq[evtchn] = irq;
- irq_info[irq] = mk_irq_info(IRQT_VIRQ, virq, evtchn);
+ irq_info[irq] = mk_virq_info(evtchn, virq);
bind_evtchn_to_cpu(evtchn, cpu);
/* Ready for use. */
@@ -732,8 +819,7 @@
if ((irq = per_cpu(ipi_to_irq, cpu)[ipi]) == -1)
continue;
- BUG_ON(irq_info[irq].type != IRQT_IPI);
- BUG_ON(irq_info[irq].index != ipi);
+ BUG_ON(ipi_from_irq(irq) != ipi);
/* Get a new binding from Xen. */
bind_ipi.vcpu = cpu;
@@ -744,7 +830,7 @@
/* Record the new mapping. */
evtchn_to_irq[evtchn] = irq;
- irq_info[irq] = mk_irq_info(IRQT_IPI, ipi, evtchn);
+ irq_info[irq] = mk_ipi_info(evtchn, ipi);
bind_evtchn_to_cpu(evtchn, cpu);
/* Ready for use. */
Make sure that irq_enter()/irq_exit() wrap the entire event processing
loop, rather than each individual event invokation. This make sure
that softirq processing is deferred until the of event processing,
rather than in the middle with interrupts disabled.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
drivers/xen/events.c | 29 +++++++++--------------------
1 file changed, 9 insertions(+), 20 deletions(-)
diff --git a/drivers/xen/events.c b/drivers/xen/events.c
--- a/drivers/xen/events.c
+++ b/drivers/xen/events.c
@@ -806,25 +806,6 @@
return IRQ_HANDLED;
}
-
-static void xen_do_irq(unsigned irq, struct pt_regs *regs)
-{
- struct pt_regs *old_regs = set_irq_regs(regs);
-
- if (WARN_ON(irq == -1))
- return;
-
- exit_idle();
- irq_enter();
-
- //printk("cpu %d handling irq %d\n", smp_processor_id(), info->irq);
- handle_irq(irq, regs);
-
- irq_exit();
-
- set_irq_regs(old_regs);
-}
-
/*
* Search the CPUs pending events bitmasks. For each one found, map
* the event number to an irq, and feed it into do_IRQ() for
@@ -837,11 +818,15 @@
void xen_evtchn_do_upcall(struct pt_regs *regs)
{
int cpu = get_cpu();
+ struct pt_regs *old_regs = set_irq_regs(regs);
struct shared_info *s = HYPERVISOR_shared_info;
struct vcpu_info *vcpu_info = __get_cpu_var(xen_vcpu);
static DEFINE_PER_CPU(unsigned, nesting_count);
unsigned count;
+ exit_idle();
+ irq_enter();
+
do {
unsigned long pending_words;
@@ -865,7 +850,8 @@
int port = (word_idx * BITS_PER_LONG) + bit_idx;
int irq = evtchn_to_irq[port];
- xen_do_irq(irq, regs);
+ if (irq != -1)
+ handle_irq(irq, regs);
}
}
@@ -876,6 +862,9 @@
} while(count != 1);
out:
+ irq_exit();
+ set_irq_regs(old_regs);
+
put_cpu();
}
This patch puts the hooks into place so that when the interrupt
subsystem registers an irq, it gets routed via Xen (if we're running
under Xen).
The first step is to get a gsi for a particular device+pin. We use
the normal acpi interrupt routing to do the mapping.
Normally the gsi number is used directly as the irq number. We can't
do that since we also have irqs for non-hardware event channels, and
so we must share the irq space between them. A given gsi is only
allocated a single irq, so re-registering a gsi will simply return the
same irq.
We therefore allocate an irq for a given gsi, and return that. As a
special case, we reserve the first 16 irqs for identity-mapping legacy
irqs, since there's a fair amount of code which assumes that.
Having allocated an irq, we ask Xen to allocate a vector, and then
bind that pirq/vector to an event channel. When the hardware raises
an interrupt on a vector, Xen signals us on the corresponding event
channel, which gets routed to the irq and delivered to the appropriate
device driver.
This patch does everything except set up the IO APIC pin routing to
the vector.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/acpi/boot.c | 8 +++
arch/x86/pci/legacy.c | 4 +
arch/x86/xen/Makefile | 1
arch/x86/xen/pci.c | 98 +++++++++++++++++++++++++++++++++++++++++++
arch/x86/xen/xen-ops.h | 1
drivers/xen/events.c | 9 ++-
include/asm-x86/xen/pci.h | 7 +++
include/xen/events.h | 8 +++
8 files changed, 132 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -42,6 +42,8 @@
#include <asm/mpspec.h>
#include <asm/smp.h>
+#include <asm/xen/pci.h>
+
#include <asm/xen/hypervisor.h>
#ifdef CONFIG_X86_LOCAL_APIC
@@ -501,6 +503,12 @@
unsigned int irq;
unsigned int plat_gsi = gsi;
+#ifdef CONFIG_XEN_PCI
+ irq = xen_register_gsi(gsi, triggering, polarity);
+ if ((int)irq >= 0)
+ return irq;
+#endif
+
#ifdef CONFIG_PCI
/*
* Make sure all (legacy) PCI IRQs are set as level-triggered.
diff --git a/arch/x86/pci/legacy.c b/arch/x86/pci/legacy.c
--- a/arch/x86/pci/legacy.c
+++ b/arch/x86/pci/legacy.c
@@ -3,6 +3,7 @@
*/
#include <linux/init.h>
#include <linux/pci.h>
+#include <asm/xen/pci.h>
#include "pci.h"
/*
@@ -66,6 +67,9 @@
#ifdef CONFIG_X86_VISWS
pci_visws_init();
#endif
+#ifdef CONFIG_XEN_PCI
+ xen_pci_init();
+#endif
pci_legacy_init();
pcibios_irq_init();
pcibios_init();
diff --git a/arch/x86/xen/Makefile b/arch/x86/xen/Makefile
--- a/arch/x86/xen/Makefile
+++ b/arch/x86/xen/Makefile
@@ -11,3 +11,4 @@
obj-$(CONFIG_SMP) += smp.o spinlock.o
obj-$(CONFIG_XEN_DEBUG_FS) += debugfs.o
obj-$(CONFIG_XEN_DOM0) += apic.o
+obj-$(CONFIG_XEN_PCI) += pci.o
\ No newline at end of file
diff --git a/arch/x86/xen/pci.c b/arch/x86/xen/pci.c
new file mode 100644
--- /dev/null
+++ b/arch/x86/xen/pci.c
@@ -0,0 +1,98 @@
+#include <linux/kernel.h>
+#include <linux/acpi.h>
+#include <linux/pci.h>
+
+#include <asm/xen/hypervisor.h>
+
+#include <xen/interface/xen.h>
+#include <xen/events.h>
+
+#include "../pci/pci.h"
+
+#include "xen-ops.h"
+
+int xen_register_gsi(u32 gsi, int triggering, int polarity)
+{
+ int irq;
+
+ if (!xen_domain())
+ return -1;
+
+ printk(KERN_DEBUG "xen: registering gsi %u triggering %d polarity %d\n",
+ gsi, triggering, polarity);
+
+ irq = xen_allocate_pirq(gsi);
+
+ printk(KERN_DEBUG "xen: --> irq=%d\n", irq);
+
+ return irq;
+}
+
+static int xen_pci_pirq_enable(struct pci_dev *dev)
+{
+ int rc;
+
+ printk(KERN_DEBUG "xen: enabling pci device %s pin %d\n",
+ pci_name(dev), dev->pin);
+
+ if (dev->pin == 0)
+ return 0; /* no pin, nothing to do */
+
+ rc = acpi_pci_irq_enable(dev);
+
+ if (rc >= 0 && dev->irq != 0) {
+ int irq = dev->irq;
+
+ printk(KERN_INFO "xen: PCI device %s pin %d -> irq %d\n",
+ pci_name(dev), dev->pin, irq);
+
+ /* install vector in ioapic? */
+ } else
+ printk(KERN_INFO "xen: irq enable for %s failed: rc=%d pin=%d irq=%d\n",
+ pci_name(dev), rc, dev->pin, dev->irq);
+
+ return rc;
+}
+
+static void xen_pci_pirq_disable(struct pci_dev *dev)
+{
+ printk(KERN_INFO "xen: disable pci device %s\n",
+ pci_name(dev));
+
+ dump_stack();
+}
+
+void __init xen_pci_init(void)
+{
+ if (!xen_domain())
+ return;
+
+ /*
+ * In either dom0 or when using pcifront we need to take
+ * control of physical interrupts from pci devices.
+ * Overriding these two put us in charge of interrupt routing
+ * akin to ACPI.
+ *
+ * This overrides any previous settings.
+ */
+ pcibios_enable_irq = xen_pci_pirq_enable;
+ pcibios_disable_irq = xen_pci_pirq_disable;
+}
+
+void __init xen_setup_pirqs(void)
+{
+#ifdef CONFIG_ACPI
+ int irq;
+
+ /*
+ * Set up acpi interrupt in acpi_gbl_FADT.sci_interrupt.
+ */
+ irq = xen_allocate_pirq(acpi_gbl_FADT.sci_interrupt);
+
+ printk(KERN_INFO "xen: allocated irq %d for acpi %d\n",
+ irq, acpi_gbl_FADT.sci_interrupt);
+
+ /* Blerk. */
+ acpi_gbl_FADT.sci_interrupt = irq;
+#endif
+}
diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
--- a/arch/x86/xen/xen-ops.h
+++ b/arch/x86/xen/xen-ops.h
@@ -63,7 +63,6 @@
static inline void xen_smp_init(void) {}
#endif
-
void xen_init_apic(void);
/* Declare an asm function, along with symbols needed to make it
diff --git a/drivers/xen/events.c b/drivers/xen/events.c
--- a/drivers/xen/events.c
+++ b/drivers/xen/events.c
@@ -391,6 +391,7 @@
struct evtchn_bind_pirq bind_pirq;
struct irq_info *info = info_for_irq(irq);
int evtchn = evtchn_from_irq(irq);
+ int rc;
BUG_ON(info->type != IRQT_PIRQ);
@@ -400,10 +401,12 @@
bind_pirq.pirq = irq;
/* NB. We are happy to share unless we are probing. */
bind_pirq.flags = probing_irq(irq) ? 0 : BIND_PIRQ__WILL_SHARE;
- if (HYPERVISOR_event_channel_op(EVTCHNOP_bind_pirq, &bind_pirq) != 0) {
+ rc = HYPERVISOR_event_channel_op(EVTCHNOP_bind_pirq, &bind_pirq);
+ if (rc != 0) {
if (!probing_irq(irq))
printk(KERN_INFO "Failed to obtain physical IRQ %d\n",
irq);
+ dump_stack();
return 0;
}
evtchn = bind_pirq.port;
@@ -523,8 +526,6 @@
} else
irq = find_unbound_irq();
- spin_unlock(&irq_mapping_update_lock);
-
set_irq_chip_and_handler_name(irq, &xen_pirq_chip,
handle_level_irq, "pirq");
@@ -1167,4 +1168,6 @@
mask_evtchn(i);
irq_ctx_init(smp_processor_id());
+
+ xen_setup_pirqs();
}
diff --git a/include/asm-x86/xen/pci.h b/include/asm-x86/xen/pci.h
new file mode 100644
--- /dev/null
+++ b/include/asm-x86/xen/pci.h
@@ -0,0 +1,7 @@
+#ifndef _ASM_X86_XEN_PCI_H
+#define _ASM_X86_XEN_PCI_H
+
+void xen_pci_init(void);
+int xen_register_gsi(u32 gsi, int triggering, int polarity);
+
+#endif /* _ASM_X86_XEN_PCI_H */
diff --git a/include/xen/events.h b/include/xen/events.h
--- a/include/xen/events.h
+++ b/include/xen/events.h
@@ -66,4 +66,12 @@
/* Return gsi allocated to pirq */
int xen_gsi_from_irq(unsigned pirq);
+#ifdef CONFIG_XEN_PCI
+void xen_setup_pirqs(void);
+#else
+static inline void xen_setup_pirqs(void)
+{
+}
+#endif
+
#endif /* _XEN_EVENTS_H */
From: Stephen Tweedie <[email protected]>
Add a Xen mtrr type, and reorganise mtrr initialisation slightly to allow the
mtrr driver to set up num_var_ranges (Xen needs to do this by querying the
hypervisor itself.)
Only the boot path is handled for now: we set up a xen-specific mtrr_if
and set up the mtrr tables based on hypervisor information, but we don't
yet handle mtrr entry add/delete.
Signed-off-by: Stephen Tweedie <[email protected]>
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/cpu/mtrr/Makefile | 1
arch/x86/kernel/cpu/mtrr/amd.c | 1
arch/x86/kernel/cpu/mtrr/centaur.c | 1
arch/x86/kernel/cpu/mtrr/cyrix.c | 1
arch/x86/kernel/cpu/mtrr/generic.c | 1
arch/x86/kernel/cpu/mtrr/main.c | 11 ++++--
arch/x86/kernel/cpu/mtrr/mtrr.h | 5 +++
arch/x86/kernel/cpu/mtrr/xen.c | 59 ++++++++++++++++++++++++++++++++++++
8 files changed, 77 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/cpu/mtrr/Makefile b/arch/x86/kernel/cpu/mtrr/Makefile
--- a/arch/x86/kernel/cpu/mtrr/Makefile
+++ b/arch/x86/kernel/cpu/mtrr/Makefile
@@ -1,3 +1,4 @@
obj-y := main.o if.o generic.o state.o
obj-$(CONFIG_X86_32) += amd.o cyrix.o centaur.o
+obj-$(CONFIG_XEN_DOM0) += xen.o
diff --git a/arch/x86/kernel/cpu/mtrr/amd.c b/arch/x86/kernel/cpu/mtrr/amd.c
--- a/arch/x86/kernel/cpu/mtrr/amd.c
+++ b/arch/x86/kernel/cpu/mtrr/amd.c
@@ -108,6 +108,7 @@
.get_free_region = generic_get_free_region,
.validate_add_page = amd_validate_add_page,
.have_wrcomb = positive_have_wrcomb,
+ .num_var_ranges = common_num_var_ranges,
};
int __init amd_init_mtrr(void)
diff --git a/arch/x86/kernel/cpu/mtrr/centaur.c b/arch/x86/kernel/cpu/mtrr/centaur.c
--- a/arch/x86/kernel/cpu/mtrr/centaur.c
+++ b/arch/x86/kernel/cpu/mtrr/centaur.c
@@ -213,6 +213,7 @@
.get_free_region = centaur_get_free_region,
.validate_add_page = centaur_validate_add_page,
.have_wrcomb = positive_have_wrcomb,
+ .num_var_ranges = common_num_var_ranges,
};
int __init centaur_init_mtrr(void)
diff --git a/arch/x86/kernel/cpu/mtrr/cyrix.c b/arch/x86/kernel/cpu/mtrr/cyrix.c
--- a/arch/x86/kernel/cpu/mtrr/cyrix.c
+++ b/arch/x86/kernel/cpu/mtrr/cyrix.c
@@ -263,6 +263,7 @@
.get_free_region = cyrix_get_free_region,
.validate_add_page = generic_validate_add_page,
.have_wrcomb = positive_have_wrcomb,
+ .num_var_ranges = common_num_var_ranges,
};
int __init cyrix_init_mtrr(void)
diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c
--- a/arch/x86/kernel/cpu/mtrr/generic.c
+++ b/arch/x86/kernel/cpu/mtrr/generic.c
@@ -667,4 +667,5 @@
.set = generic_set_mtrr,
.validate_add_page = generic_validate_add_page,
.have_wrcomb = generic_have_wrcomb,
+ .num_var_ranges = common_num_var_ranges,
};
diff --git a/arch/x86/kernel/cpu/mtrr/main.c b/arch/x86/kernel/cpu/mtrr/main.c
--- a/arch/x86/kernel/cpu/mtrr/main.c
+++ b/arch/x86/kernel/cpu/mtrr/main.c
@@ -99,7 +99,7 @@
}
/* This function returns the number of variable MTRRs */
-static void __init set_num_var_ranges(void)
+int __init common_num_var_ranges(void)
{
unsigned long config = 0, dummy;
@@ -109,7 +109,7 @@
config = 2;
else if (is_cpu(CYRIX) || is_cpu(CENTAUR))
config = 8;
- num_var_ranges = config & 0xff;
+ return config & 0xff;
}
static void __init init_table(void)
@@ -1676,12 +1676,17 @@
void __init mtrr_bp_init(void)
{
u32 phys_addr;
+
init_ifs();
phys_addr = 32;
if (cpu_has_mtrr) {
mtrr_if = &generic_mtrr_ops;
+#ifdef CONFIG_XEN_DOM0
+ xen_init_mtrr();
+#endif
+
size_or_mask = 0xff000000; /* 36 bits */
size_and_mask = 0x00f00000;
phys_addr = 36;
@@ -1739,7 +1744,7 @@
}
if (mtrr_if) {
- set_num_var_ranges();
+ num_var_ranges = mtrr_if->num_var_ranges();
init_table();
if (use_intel()) {
get_mtrr_state();
diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.h b/arch/x86/kernel/cpu/mtrr/mtrr.h
--- a/arch/x86/kernel/cpu/mtrr/mtrr.h
+++ b/arch/x86/kernel/cpu/mtrr/mtrr.h
@@ -50,6 +50,8 @@
int (*validate_add_page)(unsigned long base, unsigned long size,
unsigned int type);
int (*have_wrcomb)(void);
+
+ int (*num_var_ranges)(void);
};
extern int generic_get_free_region(unsigned long base, unsigned long size,
@@ -61,6 +63,8 @@
extern int positive_have_wrcomb(void);
+extern int __init common_num_var_ranges(void);
+
/* library functions for processor-specific routines */
struct set_mtrr_context {
unsigned long flags;
@@ -104,3 +108,4 @@
int amd_init_mtrr(void);
int cyrix_init_mtrr(void);
int centaur_init_mtrr(void);
+void xen_init_mtrr(void);
diff --git a/arch/x86/kernel/cpu/mtrr/xen.c b/arch/x86/kernel/cpu/mtrr/xen.c
new file mode 100644
--- /dev/null
+++ b/arch/x86/kernel/cpu/mtrr/xen.c
@@ -0,0 +1,59 @@
+#include <linux/init.h>
+#include <linux/proc_fs.h>
+#include <linux/ctype.h>
+#include <linux/module.h>
+#include <linux/seq_file.h>
+#include <asm/uaccess.h>
+#include <linux/mutex.h>
+
+#include <asm/mtrr.h>
+#include "mtrr.h"
+
+#include <xen/interface/platform.h>
+#include <asm/xen/hypervisor.h>
+#include <asm/xen/hypercall.h>
+
+static int __init xen_num_var_ranges(void);
+
+/* DOM0 TODO: Need to fill in the remaining mtrr methods to have full
+ * working userland mtrr support. */
+static struct mtrr_ops xen_mtrr_ops = {
+ .vendor = X86_VENDOR_UNKNOWN,
+// .set = xen_set_mtrr,
+// .get = xen_get_mtrr,
+ .get_free_region = generic_get_free_region,
+// .validate_add_page = xen_validate_add_page,
+ .have_wrcomb = positive_have_wrcomb,
+ .use_intel_if = 0,
+ .num_var_ranges = xen_num_var_ranges,
+};
+
+static int __init xen_num_var_ranges(void)
+{
+ int ranges;
+ struct xen_platform_op op;
+
+ for (ranges = 0; ; ranges++) {
+ op.cmd = XENPF_read_memtype;
+ op.u.read_memtype.reg = ranges;
+ if (HYPERVISOR_dom0_op(&op) != 0)
+ break;
+ }
+ return ranges;
+}
+
+void __init xen_init_mtrr(void)
+{
+ struct cpuinfo_x86 *c = &boot_cpu_data;
+
+ if (!xen_initial_domain())
+ return;
+
+ if ((!cpu_has(c, X86_FEATURE_MTRR)) &&
+ (!cpu_has(c, X86_FEATURE_K6_MTRR)) &&
+ (!cpu_has(c, X86_FEATURE_CYRIX_ARR)) &&
+ (!cpu_has(c, X86_FEATURE_CENTAUR_MCR)))
+ return;
+
+ mtrr_if = &xen_mtrr_ops;
+}
From: Stephen Tweedie <[email protected]>
Minimal changes to get platform ops (renamed dom0_ops on pv_ops) working
on pv_ops builds. Pulls in upstream linux-2.6.18-xen.hg's platform.h
Signed-off-by: Stephen Tweedie <[email protected]>
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/xen/hypercall.h | 8 +
include/xen/interface/platform.h | 232 ++++++++++++++++++++++++++++++++++
include/xen/interface/xen.h | 2
3 files changed, 242 insertions(+)
diff --git a/arch/x86/include/asm/xen/hypercall.h b/arch/x86/include/asm/xen/hypercall.h
--- a/arch/x86/include/asm/xen/hypercall.h
+++ b/arch/x86/include/asm/xen/hypercall.h
@@ -45,6 +45,7 @@
#include <xen/interface/xen.h>
#include <xen/interface/sched.h>
#include <xen/interface/physdev.h>
+#include <xen/interface/platform.h>
/*
* The hypercall asms have to meet several constraints:
@@ -282,6 +283,13 @@
}
static inline int
+HYPERVISOR_dom0_op(struct xen_platform_op *platform_op)
+{
+ platform_op->interface_version = XENPF_INTERFACE_VERSION;
+ return _hypercall1(int, dom0_op, platform_op);
+}
+
+static inline int
HYPERVISOR_set_debugreg(int reg, unsigned long value)
{
return _hypercall2(int, set_debugreg, reg, value);
diff --git a/include/xen/interface/platform.h b/include/xen/interface/platform.h
new file mode 100644
--- /dev/null
+++ b/include/xen/interface/platform.h
@@ -0,0 +1,232 @@
+/******************************************************************************
+ * platform.h
+ *
+ * Hardware platform operations. Intended for use by domain-0 kernel.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Copyright (c) 2002-2006, K Fraser
+ */
+
+#ifndef __XEN_PUBLIC_PLATFORM_H__
+#define __XEN_PUBLIC_PLATFORM_H__
+
+#include "xen.h"
+
+#define XENPF_INTERFACE_VERSION 0x03000001
+
+/*
+ * Set clock such that it would read <secs,nsecs> after 00:00:00 UTC,
+ * 1 January, 1970 if the current system time was <system_time>.
+ */
+#define XENPF_settime 17
+struct xenpf_settime {
+ /* IN variables. */
+ uint32_t secs;
+ uint32_t nsecs;
+ uint64_t system_time;
+};
+typedef struct xenpf_settime xenpf_settime_t;
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_settime_t);
+
+/*
+ * Request memory range (@mfn, @mfn+@nr_mfns-1) to have type @type.
+ * On x86, @type is an architecture-defined MTRR memory type.
+ * On success, returns the MTRR that was used (@reg) and a handle that can
+ * be passed to XENPF_DEL_MEMTYPE to accurately tear down the new setting.
+ * (x86-specific).
+ */
+#define XENPF_add_memtype 31
+struct xenpf_add_memtype {
+ /* IN variables. */
+ unsigned long mfn;
+ uint64_t nr_mfns;
+ uint32_t type;
+ /* OUT variables. */
+ uint32_t handle;
+ uint32_t reg;
+};
+typedef struct xenpf_add_memtype xenpf_add_memtype_t;
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_add_memtype_t);
+
+/*
+ * Tear down an existing memory-range type. If @handle is remembered then it
+ * should be passed in to accurately tear down the correct setting (in case
+ * of overlapping memory regions with differing types). If it is not known
+ * then @handle should be set to zero. In all cases @reg must be set.
+ * (x86-specific).
+ */
+#define XENPF_del_memtype 32
+struct xenpf_del_memtype {
+ /* IN variables. */
+ uint32_t handle;
+ uint32_t reg;
+};
+typedef struct xenpf_del_memtype xenpf_del_memtype_t;
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_del_memtype_t);
+
+/* Read current type of an MTRR (x86-specific). */
+#define XENPF_read_memtype 33
+struct xenpf_read_memtype {
+ /* IN variables. */
+ uint32_t reg;
+ /* OUT variables. */
+ unsigned long mfn;
+ uint64_t nr_mfns;
+ uint32_t type;
+};
+typedef struct xenpf_read_memtype xenpf_read_memtype_t;
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_read_memtype_t);
+
+#define XENPF_microcode_update 35
+struct xenpf_microcode_update {
+ /* IN variables. */
+ GUEST_HANDLE(void) data; /* Pointer to microcode data */
+ uint32_t length; /* Length of microcode data. */
+};
+typedef struct xenpf_microcode_update xenpf_microcode_update_t;
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_microcode_update_t);
+
+#define XENPF_platform_quirk 39
+#define QUIRK_NOIRQBALANCING 1 /* Do not restrict IO-APIC RTE targets */
+#define QUIRK_IOAPIC_BAD_REGSEL 2 /* IO-APIC REGSEL forgets its value */
+#define QUIRK_IOAPIC_GOOD_REGSEL 3 /* IO-APIC REGSEL behaves properly */
+struct xenpf_platform_quirk {
+ /* IN variables. */
+ uint32_t quirk_id;
+};
+typedef struct xenpf_platform_quirk xenpf_platform_quirk_t;
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_platform_quirk_t);
+
+#define XENPF_firmware_info 50
+#define XEN_FW_DISK_INFO 1 /* from int 13 AH=08/41/48 */
+#define XEN_FW_DISK_MBR_SIGNATURE 2 /* from MBR offset 0x1b8 */
+#define XEN_FW_VBEDDC_INFO 3 /* from int 10 AX=4f15 */
+struct xenpf_firmware_info {
+ /* IN variables. */
+ uint32_t type;
+ uint32_t index;
+ /* OUT variables. */
+ union {
+ struct {
+ /* Int13, Fn48: Check Extensions Present. */
+ uint8_t device; /* %dl: bios device number */
+ uint8_t version; /* %ah: major version */
+ uint16_t interface_support; /* %cx: support bitmap */
+ /* Int13, Fn08: Legacy Get Device Parameters. */
+ uint16_t legacy_max_cylinder; /* %cl[7:6]:%ch: max cyl # */
+ uint8_t legacy_max_head; /* %dh: max head # */
+ uint8_t legacy_sectors_per_track; /* %cl[5:0]: max sector # */
+ /* Int13, Fn41: Get Device Parameters (as filled into %ds:%esi). */
+ /* NB. First uint16_t of buffer must be set to buffer size. */
+ GUEST_HANDLE(void) edd_params;
+ } disk_info; /* XEN_FW_DISK_INFO */
+ struct {
+ uint8_t device; /* bios device number */
+ uint32_t mbr_signature; /* offset 0x1b8 in mbr */
+ } disk_mbr_signature; /* XEN_FW_DISK_MBR_SIGNATURE */
+ struct {
+ /* Int10, AX=4F15: Get EDID info. */
+ uint8_t capabilities;
+ uint8_t edid_transfer_time;
+ /* must refer to 128-byte buffer */
+ GUEST_HANDLE(uchar) edid;
+ } vbeddc_info; /* XEN_FW_VBEDDC_INFO */
+ } u;
+};
+typedef struct xenpf_firmware_info xenpf_firmware_info_t;
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_firmware_info_t);
+
+#define XENPF_enter_acpi_sleep 51
+struct xenpf_enter_acpi_sleep {
+ /* IN variables */
+ uint16_t pm1a_cnt_val; /* PM1a control value. */
+ uint16_t pm1b_cnt_val; /* PM1b control value. */
+ uint32_t sleep_state; /* Which state to enter (Sn). */
+ uint32_t flags; /* Must be zero. */
+};
+typedef struct xenpf_enter_acpi_sleep xenpf_enter_acpi_sleep_t;
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_enter_acpi_sleep_t);
+
+#define XENPF_change_freq 52
+struct xenpf_change_freq {
+ /* IN variables */
+ uint32_t flags; /* Must be zero. */
+ uint32_t cpu; /* Physical cpu. */
+ uint64_t freq; /* New frequency (Hz). */
+};
+typedef struct xenpf_change_freq xenpf_change_freq_t;
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_change_freq_t);
+
+/*
+ * Get idle times (nanoseconds since boot) for physical CPUs specified in the
+ * @cpumap_bitmap with range [0..@cpumap_nr_cpus-1]. The @idletime array is
+ * indexed by CPU number; only entries with the corresponding @cpumap_bitmap
+ * bit set are written to. On return, @cpumap_bitmap is modified so that any
+ * non-existent CPUs are cleared. Such CPUs have their @idletime array entry
+ * cleared.
+ */
+#define XENPF_getidletime 53
+struct xenpf_getidletime {
+ /* IN/OUT variables */
+ /* IN: CPUs to interrogate; OUT: subset of IN which are present */
+ GUEST_HANDLE(uchar) cpumap_bitmap;
+ /* IN variables */
+ /* Size of cpumap bitmap. */
+ uint32_t cpumap_nr_cpus;
+ /* Must be indexable for every cpu in cpumap_bitmap. */
+ GUEST_HANDLE(uint64_t) idletime;
+ /* OUT variables */
+ /* System time when the idletime snapshots were taken. */
+ uint64_t now;
+};
+typedef struct xenpf_getidletime xenpf_getidletime_t;
+DEFINE_GUEST_HANDLE_STRUCT(xenpf_getidletime_t);
+
+struct xen_platform_op {
+ uint32_t cmd;
+ uint32_t interface_version; /* XENPF_INTERFACE_VERSION */
+ union {
+ struct xenpf_settime settime;
+ struct xenpf_add_memtype add_memtype;
+ struct xenpf_del_memtype del_memtype;
+ struct xenpf_read_memtype read_memtype;
+ struct xenpf_microcode_update microcode;
+ struct xenpf_platform_quirk platform_quirk;
+ struct xenpf_firmware_info firmware_info;
+ struct xenpf_enter_acpi_sleep enter_acpi_sleep;
+ struct xenpf_change_freq change_freq;
+ struct xenpf_getidletime getidletime;
+ uint8_t pad[128];
+ } u;
+};
+typedef struct xen_platform_op xen_platform_op_t;
+DEFINE_GUEST_HANDLE_STRUCT(xen_platform_op_t);
+
+#endif /* __XEN_PUBLIC_PLATFORM_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-set-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/include/xen/interface/xen.h b/include/xen/interface/xen.h
--- a/include/xen/interface/xen.h
+++ b/include/xen/interface/xen.h
@@ -461,6 +461,8 @@
#define __mk_unsigned_long(x) x ## UL
#define mk_unsigned_long(x) __mk_unsigned_long(x)
+DEFINE_GUEST_HANDLE(uint64_t);
+
#else /* __ASSEMBLY__ */
/* In assembly code we cannot use C numeric constant suffixes. */
Having converting a dev+pin to a gsi, and that gsi to an irq, and
allocated a vector for the irq, we must program the IO APIC to deliver
an interrupt on a pin to the vector, so Xen can deliver it as an event
channel.
Given the pirq, we can get the gsi and vector. We map the gsi to a
specific IO APIC's pin, and set the routing entry.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/xen/apic.c | 4 ++--
arch/x86/xen/pci.c | 34 +++++++++++++++++++++++++++++++---
2 files changed, 33 insertions(+), 5 deletions(-)
diff --git a/arch/x86/xen/apic.c b/arch/x86/xen/apic.c
--- a/arch/x86/xen/apic.c
+++ b/arch/x86/xen/apic.c
@@ -4,6 +4,7 @@
#include <asm/io_apic.h>
#include <asm/acpi.h>
+#include <asm/hw_irq.h>
#include <asm/xen/hypervisor.h>
#include <asm/xen/hypercall.h>
@@ -13,8 +14,7 @@
static void __init xen_io_apic_init(void)
{
- printk("xen apic init\n");
- dump_stack();
+ enable_IO_APIC();
}
static unsigned int xen_io_apic_read(unsigned apic, unsigned reg)
diff --git a/arch/x86/xen/pci.c b/arch/x86/xen/pci.c
--- a/arch/x86/xen/pci.c
+++ b/arch/x86/xen/pci.c
@@ -2,8 +2,10 @@
#include <linux/acpi.h>
#include <linux/pci.h>
+#include <asm/mpspec.h>
+#include <asm/io_apic.h>
+
#include <asm/xen/hypervisor.h>
-
#include <xen/interface/xen.h>
#include <xen/events.h>
@@ -11,6 +13,31 @@
#include "xen-ops.h"
+static void xen_set_io_apic_routing(int irq, int trigger, int polarity)
+{
+ int ioapic, ioapic_pin;
+ int vector, gsi;
+ struct IO_APIC_route_entry entry;
+
+ gsi = xen_gsi_from_irq(irq);
+ vector = xen_vector_from_irq(irq);
+
+ ioapic = mp_find_ioapic(gsi);
+ if (ioapic == -1) {
+ printk(KERN_WARNING "xen_set_ioapic_routing: irq %d gsi %d ioapic %d\n",
+ irq, gsi, ioapic);
+ return;
+ }
+
+ ioapic_pin = mp_find_ioapic_pin(ioapic, gsi);
+
+ printk(KERN_INFO "xen_set_ioapic_routing: irq %d gsi %d vector %d ioapic %d pin %d triggering %d polarity %d\n",
+ irq, gsi, vector, ioapic, ioapic_pin, trigger, polarity);
+
+ setup_ioapic_entry(ioapic, -1, &entry, ~0, trigger, polarity, vector);
+ ioapic_write_entry(ioapic, ioapic_pin, entry);
+}
+
int xen_register_gsi(u32 gsi, int triggering, int polarity)
{
int irq;
@@ -25,6 +52,9 @@
printk(KERN_DEBUG "xen: --> irq=%d\n", irq);
+ if (irq > 0)
+ xen_set_io_apic_routing(irq, triggering, polarity);
+
return irq;
}
@@ -45,8 +75,6 @@
printk(KERN_INFO "xen: PCI device %s pin %d -> irq %d\n",
pci_name(dev), dev->pin, irq);
-
- /* install vector in ioapic? */
} else
printk(KERN_INFO "xen: irq enable for %s failed: rc=%d pin=%d irq=%d\n",
pci_name(dev), rc, dev->pin, dev->irq);
A privileged PV Xen domain can get direct access to hardware. In
order for this to be useful, it must be able to get hardware
interrupts.
Being a PV Xen domain, all interrupts are delivered as event channels.
PIRQ event channels are bound to a pirq number and an interrupt
vector. When a IO APIC raises a hardware interrupt on that vector, it
is delivered as an event channel, which we can deliver to the
appropriate device driver(s).
This patch simply implements the infrastructure for dealing with pirq
event channels.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
drivers/xen/events.c | 250 +++++++++++++++++++++++++++++++++++++++++++++++++-
include/xen/events.h | 11 ++
2 files changed, 258 insertions(+), 3 deletions(-)
diff --git a/drivers/xen/events.c b/drivers/xen/events.c
--- a/drivers/xen/events.c
+++ b/drivers/xen/events.c
@@ -16,7 +16,7 @@
* (typically dom0).
* 2. VIRQs, typically used for timers. These are per-cpu events.
* 3. IPIs.
- * 4. Hardware interrupts. Not supported at present.
+ * 4. PIRQs - Hardware interrupts.
*
* Jeremy Fitzhardinge <[email protected]>, XenSource Inc, 2007
*/
@@ -39,6 +39,9 @@
#include <xen/interface/xen.h>
#include <xen/interface/event_channel.h>
+/* Leave low irqs free for identity mapping */
+#define LEGACY_IRQS 16
+
/*
* This lock protects updates to the following mapping and reference-count
* arrays. The lock does not need to be acquired to read the mapping tables.
@@ -82,10 +85,12 @@
enum ipi_vector ipi;
struct {
unsigned short gsi;
- unsigned short vector;
+ unsigned char vector;
+ unsigned char flags;
} pirq;
} u;
};
+#define PIRQ_NEEDS_EOI (1 << 0)
static struct irq_info irq_info[NR_IRQS];
@@ -98,6 +103,7 @@
#define VALID_EVTCHN(chn) ((chn) != 0)
static struct irq_chip xen_dynamic_chip;
+static struct irq_chip xen_pirq_chip;
/* Constructor for packed IRQ information. */
static struct irq_info mk_unbound_info(void)
@@ -203,6 +209,15 @@
return ret;
}
+static bool pirq_needs_eoi(unsigned irq)
+{
+ struct irq_info *info = info_for_irq(irq);
+
+ BUG_ON(info->type != IRQT_PIRQ);
+
+ return info->u.pirq.flags & PIRQ_NEEDS_EOI;
+}
+
static inline unsigned long active_evtchns(unsigned int cpu,
struct shared_info *sh,
unsigned int idx)
@@ -317,9 +332,12 @@
{
int irq;
- for_each_irq_nr(irq)
+ for_each_irq_nr(irq) {
+ if (irq < LEGACY_IRQS)
+ continue;
if (irq_info[irq].type == IRQT_UNBOUND)
break;
+ }
if (irq == nr_irqs)
panic("No available IRQ to bind to: increase nr_irqs!\n");
@@ -329,6 +347,212 @@
return irq;
}
+static bool identity_mapped_irq(unsigned irq)
+{
+ /* only identity map legacy irqs */
+ return irq < LEGACY_IRQS;
+}
+
+static void pirq_unmask_notify(int irq)
+{
+ struct physdev_eoi eoi = { .irq = irq };
+
+ if (unlikely(pirq_needs_eoi(irq))) {
+ int rc = HYPERVISOR_physdev_op(PHYSDEVOP_eoi, &eoi);
+ WARN_ON(rc);
+ }
+}
+
+static void pirq_query_unmask(int irq)
+{
+ struct physdev_irq_status_query irq_status;
+ struct irq_info *info = info_for_irq(irq);
+
+ BUG_ON(info->type != IRQT_PIRQ);
+
+ irq_status.irq = irq;
+ if (HYPERVISOR_physdev_op(PHYSDEVOP_irq_status_query, &irq_status))
+ irq_status.flags = 0;
+
+ info->u.pirq.flags &= ~PIRQ_NEEDS_EOI;
+ if (irq_status.flags & XENIRQSTAT_needs_eoi)
+ info->u.pirq.flags |= PIRQ_NEEDS_EOI;
+}
+
+static bool probing_irq(int irq)
+{
+ struct irq_desc *desc = irq_to_desc(irq);
+
+ return desc && desc->action == NULL;
+}
+
+static unsigned int startup_pirq(unsigned int irq)
+{
+ struct evtchn_bind_pirq bind_pirq;
+ struct irq_info *info = info_for_irq(irq);
+ int evtchn = evtchn_from_irq(irq);
+
+ BUG_ON(info->type != IRQT_PIRQ);
+
+ if (VALID_EVTCHN(evtchn))
+ goto out;
+
+ bind_pirq.pirq = irq;
+ /* NB. We are happy to share unless we are probing. */
+ bind_pirq.flags = probing_irq(irq) ? 0 : BIND_PIRQ__WILL_SHARE;
+ if (HYPERVISOR_event_channel_op(EVTCHNOP_bind_pirq, &bind_pirq) != 0) {
+ if (!probing_irq(irq))
+ printk(KERN_INFO "Failed to obtain physical IRQ %d\n",
+ irq);
+ return 0;
+ }
+ evtchn = bind_pirq.port;
+
+ pirq_query_unmask(irq);
+
+ evtchn_to_irq[evtchn] = irq;
+ bind_evtchn_to_cpu(evtchn, 0);
+ info->evtchn = evtchn;
+
+ out:
+ unmask_evtchn(evtchn);
+ pirq_unmask_notify(irq);
+
+ return 0;
+}
+
+static void shutdown_pirq(unsigned int irq)
+{
+ struct evtchn_close close;
+ struct irq_info *info = info_for_irq(irq);
+ int evtchn = evtchn_from_irq(irq);
+
+ BUG_ON(info->type != IRQT_PIRQ);
+
+ if (!VALID_EVTCHN(evtchn))
+ return;
+
+ mask_evtchn(evtchn);
+
+ close.port = evtchn;
+ if (HYPERVISOR_event_channel_op(EVTCHNOP_close, &close) != 0)
+ BUG();
+
+ bind_evtchn_to_cpu(evtchn, 0);
+ evtchn_to_irq[evtchn] = -1;
+ info->evtchn = 0;
+}
+
+static void enable_pirq(unsigned int irq)
+{
+ startup_pirq(irq);
+}
+
+static void disable_pirq(unsigned int irq)
+{
+}
+
+static void ack_pirq(unsigned int irq)
+{
+ int evtchn = evtchn_from_irq(irq);
+
+ move_native_irq(irq);
+
+ if (VALID_EVTCHN(evtchn)) {
+ mask_evtchn(evtchn);
+ clear_evtchn(evtchn);
+ }
+}
+
+static void end_pirq(unsigned int irq)
+{
+ int evtchn = evtchn_from_irq(irq);
+ struct irq_desc *desc = irq_to_desc(irq);
+
+ if (WARN_ON(!desc))
+ return;
+
+ if ((desc->status & (IRQ_DISABLED|IRQ_PENDING)) ==
+ (IRQ_DISABLED|IRQ_PENDING)) {
+ shutdown_pirq(irq);
+ } else if (VALID_EVTCHN(evtchn)) {
+ unmask_evtchn(evtchn);
+ pirq_unmask_notify(irq);
+ }
+}
+
+static int find_irq_by_gsi(unsigned gsi)
+{
+ int irq;
+
+ for(irq = 0; irq < NR_IRQS; irq++) {
+ struct irq_info *info = info_for_irq(irq);
+
+ if (info == NULL || info->type != IRQT_PIRQ)
+ continue;
+
+ if (gsi_from_irq(irq) == gsi)
+ return irq;
+ }
+
+ return -1;
+}
+
+/*
+ * Allocate a physical irq, along with a vector. We don't assign an
+ * event channel until the irq actually started up. Return an
+ * existing irq if we've already got one for the gsi.
+ */
+int xen_allocate_pirq(unsigned gsi)
+{
+ int irq;
+ struct physdev_irq irq_op;
+
+ spin_lock(&irq_mapping_update_lock);
+
+ irq = find_irq_by_gsi(gsi);
+ if (irq != -1) {
+ printk(KERN_INFO "xen_allocate_pirq: returning irq %d for gsi %u\n",
+ irq, gsi);
+ goto out; /* XXX need refcount? */
+ }
+
+ if (identity_mapped_irq(gsi)) {
+ irq = gsi;
+ dynamic_irq_init(irq);
+ } else
+ irq = find_unbound_irq();
+
+ spin_unlock(&irq_mapping_update_lock);
+
+ set_irq_chip_and_handler_name(irq, &xen_pirq_chip,
+ handle_level_irq, "pirq");
+
+ irq_op.irq = irq;
+ if (HYPERVISOR_physdev_op(PHYSDEVOP_alloc_irq_vector, &irq_op)) {
+ dynamic_irq_cleanup(irq);
+ irq = -ENOSPC;
+ goto out;
+ }
+
+ irq_info[irq] = mk_pirq_info(0, gsi, irq_op.vector);
+
+out:
+ spin_unlock(&irq_mapping_update_lock);
+
+ return irq;
+}
+
+int xen_vector_from_irq(unsigned irq)
+{
+ return vector_from_irq(irq);
+}
+
+int xen_gsi_from_irq(unsigned irq)
+{
+ return gsi_from_irq(irq);
+}
+
int bind_evtchn_to_irq(unsigned int evtchn)
{
int irq;
@@ -912,6 +1136,26 @@
.retrigger = retrigger_dynirq,
};
+static struct irq_chip xen_pirq_chip __read_mostly = {
+ .name = "xen-pirq",
+
+ .startup = startup_pirq,
+ .shutdown = shutdown_pirq,
+
+ .enable = enable_pirq,
+ .unmask = enable_pirq,
+
+ .disable = disable_pirq,
+ .mask = disable_pirq,
+
+ .ack = ack_pirq,
+ .end = end_pirq,
+
+ .set_affinity = set_affinity_irq,
+
+ .retrigger = retrigger_dynirq,
+};
+
void __init xen_init_IRQ(void)
{
int i;
diff --git a/include/xen/events.h b/include/xen/events.h
--- a/include/xen/events.h
+++ b/include/xen/events.h
@@ -55,4 +55,15 @@
irq will be disabled so it won't deliver an interrupt. */
void xen_poll_irq(int irq);
+/* Allocate an irq for a physical interrupt, given a gsi. "Legacy"
+ GSIs are identity mapped; others are dynamically allocated as
+ usual. */
+int xen_allocate_pirq(unsigned gsi);
+
+/* Return vector allocated to pirq */
+int xen_vector_from_irq(unsigned pirq);
+
+/* Return gsi allocated to pirq */
+int xen_gsi_from_irq(unsigned pirq);
+
#endif /* _XEN_EVENTS_H */
Rather than overloading vectors for event channels, take full
responsibility for mapping an event channel to irq directly. With
this patch Xen has its own irq allocator.
When the kernel gets an event channel upcall, it maps the event
channel number to an irq and injects it into the normal interrupt
path.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/include/asm/xen/events.h | 6 ------
arch/x86/xen/irq.c | 17 +----------------
drivers/xen/events.c | 22 ++++++++++++++++++++--
3 files changed, 21 insertions(+), 24 deletions(-)
diff --git a/arch/x86/include/asm/xen/events.h b/arch/x86/include/asm/xen/events.h
--- a/arch/x86/include/asm/xen/events.h
+++ b/arch/x86/include/asm/xen/events.h
@@ -15,10 +15,4 @@
return raw_irqs_disabled_flags(regs->flags);
}
-static inline void xen_do_IRQ(int irq, struct pt_regs *regs)
-{
- regs->orig_ax = ~irq;
- do_IRQ(regs);
-}
-
#endif /* _ASM_X86_XEN_EVENTS_H */
diff --git a/arch/x86/xen/irq.c b/arch/x86/xen/irq.c
--- a/arch/x86/xen/irq.c
+++ b/arch/x86/xen/irq.c
@@ -19,21 +19,6 @@
(void)HYPERVISOR_xen_version(0, NULL);
}
-static void __init __xen_init_IRQ(void)
-{
- int i;
-
- /* Create identity vector->irq map */
- for(i = 0; i < NR_VECTORS; i++) {
- int cpu;
-
- for_each_possible_cpu(cpu)
- per_cpu(vector_irq, cpu)[i] = i;
- }
-
- xen_init_IRQ();
-}
-
static unsigned long xen_save_fl(void)
{
struct vcpu_info *vcpu;
@@ -123,7 +108,7 @@
}
static const struct pv_irq_ops xen_irq_ops __initdata = {
- .init_IRQ = __xen_init_IRQ,
+ .init_IRQ = xen_init_IRQ,
.save_fl = xen_save_fl,
.restore_fl = xen_restore_fl,
.irq_disable = xen_irq_disable,
diff --git a/drivers/xen/events.c b/drivers/xen/events.c
--- a/drivers/xen/events.c
+++ b/drivers/xen/events.c
@@ -29,6 +29,7 @@
#include <asm/ptrace.h>
#include <asm/irq.h>
+#include <asm/idle.h>
#include <asm/sync_bitops.h>
#include <asm/xen/hypercall.h>
#include <asm/xen/hypervisor.h>
@@ -503,6 +504,24 @@
}
+static void xen_do_irq(unsigned irq, struct pt_regs *regs)
+{
+ struct pt_regs *old_regs = set_irq_regs(regs);
+
+ if (WARN_ON(irq == -1))
+ return;
+
+ exit_idle();
+ irq_enter();
+
+ //printk("cpu %d handling irq %d\n", smp_processor_id(), info->irq);
+ handle_irq(irq, regs);
+
+ irq_exit();
+
+ set_irq_regs(old_regs);
+}
+
/*
* Search the CPUs pending events bitmasks. For each one found, map
* the event number to an irq, and feed it into do_IRQ() for
@@ -543,8 +562,7 @@
int port = (word_idx * BITS_PER_LONG) + bit_idx;
int irq = evtchn_to_irq[port];
- if (irq != -1)
- xen_do_IRQ(irq, regs);
+ xen_do_irq(irq, regs);
}
}
From: Ian Campbell <[email protected]>
Xen imposes a particular PAT layout on all paravirtual guests which
does not match the layout Linux would like to use.
Force PAT to be disabled until this is resolved.
Signed-off-by: Ian Campbell <[email protected]>
---
arch/x86/include/asm/pat.h | 4 ++--
arch/x86/xen/enlighten.c | 3 +++
2 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/pat.h b/arch/x86/include/asm/pat.h
--- a/arch/x86/include/asm/pat.h
+++ b/arch/x86/include/asm/pat.h
@@ -6,9 +6,11 @@
#ifdef CONFIG_X86_PAT
extern int pat_enabled;
extern void validate_pat_support(struct cpuinfo_x86 *c);
+extern void pat_disable(char *reason);
#else
static const int pat_enabled;
static inline void validate_pat_support(struct cpuinfo_x86 *c) { }
+static inline void pat_disable(char *reason) { }
#endif
extern void pat_init(void);
@@ -17,6 +19,4 @@
unsigned long req_type, unsigned long *ret_type);
extern int free_memtype(u64 start, u64 end);
-extern void pat_disable(char *reason);
-
#endif /* _ASM_X86_PAT_H */
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -48,6 +48,7 @@
#include <asm/pgtable.h>
#include <asm/tlbflush.h>
#include <asm/reboot.h>
+#include <asm/pat.h>
#include "xen-ops.h"
#include "mmu.h"
@@ -1743,6 +1744,8 @@
add_preferred_console("hvc", 0, NULL);
}
+ pat_disable("PAT disabled on Xen");
+
xen_raw_console_write("about to get started...\n");
/* Start the world */
There should be no need for us to maintain our own bind count for
irqs, since the surrounding irq system should keep track of shared
irqs for us.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
drivers/xen/events.c | 26 +++++++-------------------
1 file changed, 7 insertions(+), 19 deletions(-)
diff --git a/drivers/xen/events.c b/drivers/xen/events.c
--- a/drivers/xen/events.c
+++ b/drivers/xen/events.c
@@ -53,7 +53,7 @@
/* Interrupt types. */
enum xen_irq_type {
- IRQT_UNBOUND,
+ IRQT_UNBOUND = 0,
IRQT_PIRQ,
IRQT_VIRQ,
IRQT_IPI,
@@ -94,9 +94,6 @@
};
static unsigned long cpu_evtchn_mask[NR_CPUS][NR_EVENT_CHANNELS/BITS_PER_LONG];
-/* Reference counts for bindings to IRQs. */
-static int irq_bindcount[NR_IRQS];
-
/* Xen will never allocate port zero for any purpose. */
#define VALID_EVTCHN(chn) ((chn) != 0)
@@ -320,9 +317,8 @@
{
int irq;
- /* Only allocate from dynirq range */
for_each_irq_nr(irq)
- if (irq_bindcount[irq] == 0)
+ if (irq_info[irq].type == IRQT_UNBOUND)
break;
if (irq == nr_irqs)
@@ -351,8 +347,6 @@
irq_info[irq] = mk_evtchn_info(evtchn);
}
- irq_bindcount[irq]++;
-
spin_unlock(&irq_mapping_update_lock);
return irq;
@@ -389,8 +383,6 @@
bind_evtchn_to_cpu(evtchn, cpu);
}
- irq_bindcount[irq]++;
-
out:
spin_unlock(&irq_mapping_update_lock);
return irq;
@@ -427,8 +419,6 @@
bind_evtchn_to_cpu(evtchn, cpu);
}
- irq_bindcount[irq]++;
-
spin_unlock(&irq_mapping_update_lock);
return irq;
@@ -441,7 +431,7 @@
spin_lock(&irq_mapping_update_lock);
- if ((--irq_bindcount[irq] == 0) && VALID_EVTCHN(evtchn)) {
+ if (VALID_EVTCHN(evtchn)) {
close.port = evtchn;
if (HYPERVISOR_event_channel_op(EVTCHNOP_close, &close) != 0)
BUG();
@@ -667,6 +657,8 @@
/* Rebind a new event channel to an existing irq. */
void rebind_evtchn_irq(int evtchn, int irq)
{
+ struct irq_info *info = info_for_irq(irq);
+
/* Make sure the irq is masked, since the new event channel
will also be masked. */
disable_irq(irq);
@@ -676,8 +668,8 @@
/* After resume the irq<->evtchn mappings are all cleared out */
BUG_ON(evtchn_to_irq[evtchn] != -1);
/* Expect irq to have been bound before,
- so the bindcount should be non-0 */
- BUG_ON(irq_bindcount[irq] == 0);
+ so there should be a proper type */
+ BUG_ON(info->type != IRQT_UNBOUND);
evtchn_to_irq[evtchn] = irq;
irq_info[irq] = mk_evtchn_info(evtchn);
@@ -930,9 +922,5 @@
for (i = 0; i < NR_EVENT_CHANNELS; i++)
mask_evtchn(i);
- /* Dynamic IRQ space is currently unbound. Zero the refcnts. */
- for_each_irq_nr(i)
- irq_bindcount[i] = 0;
-
irq_ctx_init(smp_processor_id());
}
On Thu, 2008-11-13 at 11:10 -0800, Jeremy Fitzhardinge wrote:
> diff --git a/arch/x86/kernel/cpu/mtrr/xen.c b/arch/x86/kernel/cpu/mtrr/xen.c
> new file mode 100644
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/mtrr/xen.c
> @@ -0,0 +1,59 @@
...
> +
> +/* DOM0 TODO: Need to fill in the remaining mtrr methods to have full
> + * working userland mtrr support. */
> +static struct mtrr_ops xen_mtrr_ops = {
> + .vendor = X86_VENDOR_UNKNOWN,
> +// .set = xen_set_mtrr,
> +// .get = xen_get_mtrr,
> + .get_free_region = generic_get_free_region,
> +// .validate_add_page = xen_validate_add_page,
> + .have_wrcomb = positive_have_wrcomb,
> + .use_intel_if = 0,
> + .num_var_ranges = xen_num_var_ranges,
> +};
...
I'm vague on the details now, but looking back at the dom0 patch set
here:
http://git.et.redhat.com/?p=linux-2.6-dom0-pvops.git;a=shortlog;h=55abc194080b5cf31cd66f5e35e8e5c5af2aa927
I see we did have a bunch more mtrr work e.g. fixing the TODO above:
http://git.et.redhat.com/?p=linux-2.6-dom0-pvops.git;a=commitdiff;h=93f779bf3d79f28d0933bfbc53f7b8c5b6496081
Cheers,
Mark.
Mark McLoughlin wrote:
> On Thu, 2008-11-13 at 11:10 -0800, Jeremy Fitzhardinge wrote:
>
>
>> diff --git a/arch/x86/kernel/cpu/mtrr/xen.c b/arch/x86/kernel/cpu/mtrr/xen.c
>> new file mode 100644
>> --- /dev/null
>> +++ b/arch/x86/kernel/cpu/mtrr/xen.c
>> @@ -0,0 +1,59 @@
>>
> ...
>
>> +
>> +/* DOM0 TODO: Need to fill in the remaining mtrr methods to have full
>> + * working userland mtrr support. */
>> +static struct mtrr_ops xen_mtrr_ops = {
>> + .vendor = X86_VENDOR_UNKNOWN,
>> +// .set = xen_set_mtrr,
>> +// .get = xen_get_mtrr,
>> + .get_free_region = generic_get_free_region,
>> +// .validate_add_page = xen_validate_add_page,
>> + .have_wrcomb = positive_have_wrcomb,
>> + .use_intel_if = 0,
>> + .num_var_ranges = xen_num_var_ranges,
>> +};
>>
>
> ...
>
> I'm vague on the details now, but looking back at the dom0 patch set
> here:
>
> http://git.et.redhat.com/?p=linux-2.6-dom0-pvops.git;a=shortlog;h=55abc194080b5cf31cd66f5e35e8e5c5af2aa927
>
> I see we did have a bunch more mtrr work e.g. fixing the TODO above:
>
> http://git.et.redhat.com/?p=linux-2.6-dom0-pvops.git;a=commitdiff;h=93f779bf3d79f28d0933bfbc53f7b8c5b6496081
>
Yes, the mtrr changes are incomplete. I started on them as much as
necessary to get things booting, and then left the rest to revisit.
It's not a particularly pretty part of the kernel, and so I was hoping
some magic beautification fairy would visit it before I needed to touch
it more...
J
Not directly related to this patch alone, but to the combined set of changes
to swiotlb: I don't see any handling of CONFIG_HIGHMEM here (or at least
a note that this a known limitation needing work). I mention this because
this was the largest part of the changes I had posted long ago to make
lib/swiotlb.c Xen-ready, and which got rejected due to their ugliness.
While perhaps less intrusive to take care of, I also didn't see an equivalent
of the range_straddles_page_boundary() logic, without which I can't see
how this would work in the common case.
Jan
>>> Jeremy Fitzhardinge <[email protected]> 13.11.08 20:10 >>>
Architectures may need to allocate memory specially for use with
the swiotlb. Create the weak function swiotlb_alloc_boot() and
swiotlb_alloc() defaulting to the current behaviour.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Signed-off-by: Ian Campbell <[email protected]>
Cc: Joerg Roedel <[email protected]>
Cc: Jan Beulich <[email protected]>
Cc: Tony Luck <[email protected]>
Cc: FUJITA Tomonori <[email protected]>
---
include/linux/swiotlb.h | 3 +++
lib/swiotlb.c | 16 +++++++++++++---
2 files changed, 16 insertions(+), 3 deletions(-)
diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
--- a/include/linux/swiotlb.h
+++ b/include/linux/swiotlb.h
@@ -10,6 +10,9 @@
extern void
swiotlb_init(void);
+extern void *swiotlb_alloc_boot(size_t bytes, unsigned long nslabs);
+extern void *swiotlb_alloc(unsigned order, unsigned long nslabs);
+
extern void
*swiotlb_alloc_coherent(struct device *hwdev, size_t size,
dma_addr_t *dma_handle, gfp_t flags);
diff --git a/lib/swiotlb.c b/lib/swiotlb.c
--- a/lib/swiotlb.c
+++ b/lib/swiotlb.c
@@ -21,6 +21,7 @@
#include <linux/mm.h>
#include <linux/module.h>
#include <linux/spinlock.h>
+#include <linux/swiotlb.h>
#include <linux/string.h>
#include <linux/types.h>
#include <linux/ctype.h>
@@ -126,6 +127,16 @@
__setup("swiotlb=", setup_io_tlb_npages);
/* make io_tlb_overflow tunable too? */
+void * __weak swiotlb_alloc_boot(size_t size, unsigned long nslabs)
+{
+ return alloc_bootmem_low_pages(size);
+}
+
+void * __weak swiotlb_alloc(unsigned order, unsigned long nslabs)
+{
+ return (void *)__get_free_pages(GFP_DMA | __GFP_NOWARN, order);
+}
+
/*
* Statically reserve bounce buffer space and initialize bounce buffer data
* structures for the software IO TLB used to implement the DMA API.
@@ -145,7 +156,7 @@
/*
* Get IO TLB memory from the low pages
*/
- io_tlb_start = alloc_bootmem_low_pages(bytes);
+ io_tlb_start = swiotlb_alloc_boot(bytes, io_tlb_nslabs);
if (!io_tlb_start)
panic("Cannot allocate SWIOTLB buffer");
io_tlb_end = io_tlb_start + bytes;
@@ -202,8 +213,7 @@
bytes = io_tlb_nslabs << IO_TLB_SHIFT;
while ((SLABS_PER_PAGE << order) > IO_TLB_MIN_SLABS) {
- io_tlb_start = (char *)__get_free_pages(GFP_DMA | __GFP_NOWARN,
- order);
+ io_tlb_start = swiotlb_alloc(order, io_tlb_nslabs);
if (io_tlb_start)
break;
order--;
_______________________________________________
Xen-devel mailing list
[email protected]
http://lists.xensource.com/xen-devel
Jan Beulich wrote:
> Not directly related to this patch alone, but to the combined set of changes
> to swiotlb: I don't see any handling of CONFIG_HIGHMEM here (or at least
> a note that this a known limitation needing work). I mention this because
> this was the largest part of the changes I had posted long ago to make
> lib/swiotlb.c Xen-ready, and which got rejected due to their ugliness.
>
Was that Andi's objection on the grounds that he didn't think that Xen
should need swiotlb at all?
I have to admit I didn't follow that thread very closely (or threads, as
I seem to remember). Do you have a pointer to the pertinent bits?
> While perhaps less intrusive to take care of, I also didn't see an equivalent
> of the range_straddles_page_boundary() logic, without which I can't see
> how this would work in the common case.
>
Could you be more specific? The swiotlb allocation should be machine
contiguous and so there's no stradding required, but I think I'm missing
your point.
J
On Thu, 13 Nov 2008 11:10:02 -0800
Jeremy Fitzhardinge <[email protected]> wrote:
> From: Ian Campbell <[email protected]>
>
> Signed-off-by: Ian Campbell <[email protected]>
> Signed-off-by: Jeremy Fitzhardinge <[email protected]>
> ---
> include/linux/swiotlb.h | 14 ++++++++++++++
> lib/swiotlb.c | 14 +-------------
> 2 files changed, 15 insertions(+), 13 deletions(-)
>
> diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> --- a/include/linux/swiotlb.h
> +++ b/include/linux/swiotlb.h
> @@ -7,6 +7,20 @@
> struct dma_attrs;
> struct scatterlist;
>
> +/*
> + * Maximum allowable number of contiguous slabs to map,
> + * must be a power of 2. What is the appropriate value ?
> + * The complexity of {map,unmap}_single is linearly dependent on this value.
> + */
> +#define IO_TLB_SEGSIZE 128
> +
> +
> +/*
> + * log of the size of each IO TLB slab. The number of slabs is command line
> + * controllable.
> + */
> +#define IO_TLB_SHIFT 11
> +
Why do we need to export IO_TLB_SEGSIZE and IO_TLB_SHIFT to everyone
in include/linux?
On Thu, 13 Nov 2008 11:10:16 -0800
Jeremy Fitzhardinge <[email protected]> wrote:
> swiotlb on 32 bit will be used by Xen domain 0 support.
If you want swiotlb on 32 bit, you need more modifications, I think.
For example, the following code assumes that the mask needs to be
64 bits.
static void *
map_single(struct device *hwdev, char *buffer, size_t size, int dir)
{
unsigned long flags;
char *dma_addr;
unsigned int nslots, stride, index, wrap;
int i;
unsigned long start_dma_addr;
unsigned long mask;
unsigned long offset_slots;
unsigned long max_slots;
mask = dma_get_seg_boundary(hwdev);
start_dma_addr = virt_to_bus(io_tlb_start) & mask;
offset_slots = ALIGN(start_dma_addr, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT;
max_slots = mask + 1
? ALIGN(mask + 1, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT
: 1UL << (BITS_PER_LONG - IO_TLB_SHIFT);
Jeremy Fitzhardinge wrote:
>> While perhaps less intrusive to take care of, I also didn't see an equivalent
>> of the range_straddles_page_boundary() logic, without which I can't see
>> how this would work in the common case.
>>
> Could you be more specific? The swiotlb allocation should be machine
> contiguous and so there's no stradding required, but I think I'm missing
> your point.
In general, I think you are right; swiotlb should be machine contiguous, so it
works in the normal case. The range_straddles_page_boundary function takes care
of a corner case, where you can run into swiotlb exhaustion when you really
shouldn't. As I understand it, it comes about because it is possible to get a
swiotlb request with two pages that just happen to be machine contiguous, but
were *not* allocated through xen_create_contiguous_region (and hence weren't
marked in the contiguous_bitmap as such). In this case, you split the request
into two separate requests, and this can more easily lead to exhaustion.
range_straddles_page_boundary works around this by checking whether any two
pages coming through the swiotlb layer are machine contiguous, and if they are,
not splitting the request.
--
Chris Lalancette
On 17/11/08 08:04, "Chris Lalancette" <[email protected]> wrote:
>> Could you be more specific? The swiotlb allocation should be machine
>> contiguous and so there's no stradding required, but I think I'm missing
>> your point.
>
> In general, I think you are right; swiotlb should be machine contiguous, so it
> works in the normal case. The range_straddles_page_boundary function takes
> care
> of a corner case, where you can run into swiotlb exhaustion when you really
> shouldn't. As I understand it, it comes about because it is possible to get a
> swiotlb request with two pages that just happen to be machine contiguous, but
> were *not* allocated through xen_create_contiguous_region (and hence weren't
> marked in the contiguous_bitmap as such). In this case, you split the request
> into two separate requests, and this can more easily lead to exhaustion.
> range_straddles_page_boundary works around this by checking whether any two
> pages coming through the swiotlb layer are machine contiguous, and if they
> are,
> not splitting the request.
A more specific problem solved by range_straddle_page_boundary() in our
2.6.18 kernel was that the block layer would do bio merging because it
checked that pages really were contiguous, and then swiotlb (without
r_s_p_b) would decide that the pages weren't contiguous (because the
contiguity was random luck rather than explicitly requested) and hence do
bounce buffering. Result was that sufficiently aggressive I/O would exhaust
swiotlb resources and crash the kernel.
In the 2.6.18 port we actually got rid of contiguous_bitmap[] entirely.
-- Keir
>>> Jeremy Fitzhardinge <[email protected]> 14.11.08 20:33 >>>
>Jan Beulich wrote:
>> Not directly related to this patch alone, but to the combined set of changes
>> to swiotlb: I don't see any handling of CONFIG_HIGHMEM here (or at least
>> a note that this a known limitation needing work). I mention this because
>> this was the largest part of the changes I had posted long ago to make
>> lib/swiotlb.c Xen-ready, and which got rejected due to their ugliness.
>>
>
>Was that Andi's objection on the grounds that he didn't think that Xen
>should need swiotlb at all?
No, Tony Luck actually merged it, but someone else (I don't recall who it
was) requested it to be reverted again.
>I have to admit I didn't follow that thread very closely (or threads, as
>I seem to remember). Do you have a pointer to the pertinent bits?
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=51099005ab8e09d68a13fea8d55bc739c1040ca6
>> While perhaps less intrusive to take care of, I also didn't see an equivalent
>> of the range_straddles_page_boundary() logic, without which I can't see
>> how this would work in the common case.
>>
>Could you be more specific? The swiotlb allocation should be machine
>contiguous and so there's no stradding required, but I think I'm missing
>your point.
The question is whether a multi-page piece of memory must be funneled
through the swiotlb in the first place. In native code, checking whether
the first/last byte satisfies the address_needs_mapping() check is
sufficient, but in Xen you also need to check whether the known to be
physically contiguous pages are also machine-contiguous.
Jan
On Mon, 2008-11-17 at 12:48 +0900, FUJITA Tomonori wrote:
> Why do we need to export IO_TLB_SEGSIZE and IO_TLB_SHIFT to everyone
> in include/linux?
A subsequent Xen patch needs to make use of them, although I can't see
it in the patchset Jeremy posted so here it is (not fully baked yet)
Subject: xen swiotlb: fixup swiotlb is chunks smaller than MAX_CONTIG_ORDER
From: Ian Campbell <[email protected]>
Signed-off-by: Ian Campbell <[email protected]>
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/pci-swiotlb_64.c | 7 +------
drivers/pci/xen-iommu.c | 30 ++++++++++++++++++++----------
2 files changed, 21 insertions(+), 16 deletions(-)
===================================================================
--- a/arch/x86/kernel/pci-swiotlb_64.c
+++ b/arch/x86/kernel/pci-swiotlb_64.c
@@ -28,12 +28,7 @@
void *swiotlb_alloc(unsigned order, unsigned long nslabs)
{
- void *ret = (void *)__get_free_pages(GFP_DMA | __GFP_NOWARN, order);
-
- if (ret && xen_pv_domain())
- xen_swiotlb_fixup(ret, 1u << order, nslabs);
-
- return ret;
+ BUG();
}
static dma_addr_t
===================================================================
--- a/drivers/pci/xen-iommu.c
+++ b/drivers/pci/xen-iommu.c
@@ -6,6 +6,7 @@
#include <linux/version.h>
#include <linux/scatterlist.h>
#include <linux/bio.h>
+#include <linux/swiotlb.h>
#include <linux/io.h>
#include <linux/bug.h>
@@ -43,19 +44,27 @@
unsigned long *bitmap;
};
+static int max_dma_bits = 32;
+
void xen_swiotlb_fixup(void *buf, size_t size, unsigned long nslabs)
{
- unsigned order = get_order(size);
+ int i, rc;
+ int dma_bits;
- printk(KERN_DEBUG "xen_swiotlb_fixup: buf=%p size=%zu order=%u\n",
- buf, size, order);
+ printk(KERN_DEBUG "xen_swiotlb_fixup: buf=%p size=%zu\n",
+ buf, size);
- if (WARN_ON(size != (PAGE_SIZE << order)))
- return;
-
- if (xen_create_contiguous_region((unsigned long)buf,
- order, 0xffffffff))
- printk(KERN_ERR "xen_create_contiguous_region failed\n");
+ dma_bits = get_order(IO_TLB_SEGSIZE << IO_TLB_SHIFT) + PAGE_SHIFT;
+ for (i = 0; i < nslabs; i += IO_TLB_SEGSIZE) {
+ do {
+ rc = xen_create_contiguous_region(
+ (unsigned long)buf + (i << IO_TLB_SHIFT),
+ get_order(IO_TLB_SEGSIZE << IO_TLB_SHIFT),
+ dma_bits);
+ } while (rc && dma_bits++ < max_dma_bits);
+ if (rc)
+ panic(KERN_ERR "xen_create_contiguous_region failed\n");
+ }
}
static inline int address_needs_mapping(struct device *hwdev,
@@ -117,7 +126,8 @@
if (check_pages_physically_contiguous(pfn, offset, size))
return 0;
- printk("range_straddles_page_boundary: p=%Lx size=%d pfn=%lx\n",
+ printk(KERN_WARNING "range_straddles_page_boundary: "
+ "p=%Lx size=%zd pfn=%lx\n",
p, size, pfn);
return 1;
}
On Mon, 2008-11-17 at 12:48 +0900, FUJITA Tomonori wrote:
> On Thu, 13 Nov 2008 11:10:16 -0800
> Jeremy Fitzhardinge <[email protected]> wrote:
>
> > swiotlb on 32 bit will be used by Xen domain 0 support.
>
> If you want swiotlb on 32 bit, you need more modifications, I think.
Possibly. It currently "Works For Me(tm)", but I should check it over.
> For example, the following code assumes that the mask needs to be
> 64 bits.
The use of unsigned long for the mask is throughout the API and not
simply limited to swiotlb.c. All the callers of dma_set_seg_boundary
(PCI and SCSI subsys it seems) do not use a value >4G anywhere I can
see. Presumably if something was we would see "warning: overflow in
implicit constant conversion" somewhere along the line. If no value is
set then the default is 0xffffffff which is safe on 32 bit.
I suspect that even with PAE addresses above 4G aren't seen very often
due to pre-existing subsystem specific bounce buffers or other existing
limitations (like network buffers being in lowmem).
Perhaps dma_addr_t should be used though?
Ian.
>
> static void *
> map_single(struct device *hwdev, char *buffer, size_t size, int dir)
> {
> unsigned long flags;
> char *dma_addr;
> unsigned int nslots, stride, index, wrap;
> int i;
> unsigned long start_dma_addr;
> unsigned long mask;
> unsigned long offset_slots;
> unsigned long max_slots;
>
> mask = dma_get_seg_boundary(hwdev);
> start_dma_addr = virt_to_bus(io_tlb_start) & mask;
>
> offset_slots = ALIGN(start_dma_addr, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT;
> max_slots = mask + 1
> ? ALIGN(mask + 1, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT
> : 1UL << (BITS_PER_LONG - IO_TLB_SHIFT);
>
On Mon, 17 Nov 2008 16:16:06 +0000
Ian Campbell <[email protected]> wrote:
> > For example, the following code assumes that the mask needs to be
> > 64 bits.
>
> The use of unsigned long for the mask is throughout the API and not
> simply limited to swiotlb.c. All the callers of dma_set_seg_boundary
> (PCI and SCSI subsys it seems) do not use a value >4G anywhere I can
> see.
32bit is large enough for dma segment boundary mask, I think.
The problem that I talked about in the previous mail:
> max_slots = mask + 1
> ? ALIGN(mask + 1, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT
> : 1UL << (BITS_PER_LONG - IO_TLB_SHIFT);
Since the popular value of the mask is 0xffffffff. So the above code
(mask + 1 ?) works wrongly if the size of mask is 32bit (well,
accidentally the result of max_slots is identical though).
> Presumably if something was we would see "warning: overflow in
> implicit constant conversion" somewhere along the line. If no value is
> set then the default is 0xffffffff which is safe on 32 bit.
>
> I suspect that even with PAE addresses above 4G aren't seen very often
> due to pre-existing subsystem specific bounce buffers or other existing
> limitations (like network buffers being in lowmem).
I guess that you talk about the dma_mask (and coherent_dma_mask) in
struct device. The dma segment boundary mask represents the different
dma limitation of a device.
> Perhaps dma_addr_t should be used though?
I think that 'unsigned long' is better for the dma segment boundary
mask since it represents the hardware limitation. The size of the
value are not related with kernel configurations at all.
On Wed, 2008-11-19 at 11:19 +0900, FUJITA Tomonori wrote:
> On Mon, 17 Nov 2008 16:16:06 +0000
> Ian Campbell <[email protected]> wrote:
>
> > > For example, the following code assumes that the mask needs to be
> > > 64 bits.
> >
> > The use of unsigned long for the mask is throughout the API and not
> > simply limited to swiotlb.c. All the callers of dma_set_seg_boundary
> > (PCI and SCSI subsys it seems) do not use a value >4G anywhere I can
> > see.
>
> 32bit is large enough for dma segment boundary mask, I think.
>
> The problem that I talked about in the previous mail:
>
> > max_slots = mask + 1
> > ? ALIGN(mask + 1, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT
> > : 1UL << (BITS_PER_LONG - IO_TLB_SHIFT);
>
> Since the popular value of the mask is 0xffffffff. So the above code
> (mask + 1 ?) works wrongly if the size of mask is 32bit (well,
> accidentally the result of max_slots is identical though).
Ah, I hadn't spotted this, you are right it probably works but just by
chance. Thanks for pointing it out.
> > Presumably if something was we would see "warning: overflow in
> > implicit constant conversion" somewhere along the line. If no value is
> > set then the default is 0xffffffff which is safe on 32 bit.
> >
> > I suspect that even with PAE addresses above 4G aren't seen very often
> > due to pre-existing subsystem specific bounce buffers or other existing
> > limitations (like network buffers being in lowmem).
>
> I guess that you talk about the dma_mask (and coherent_dma_mask) in
> struct device. The dma segment boundary mask represents the different
> dma limitation of a device.
I was talking about the segment_boundary_mask in struct
device_dma_parameters which is the source of the "mask" value in the
code you quoted.
> > Perhaps dma_addr_t should be used though?
>
> I think that 'unsigned long' is better for the dma segment boundary
> mask since it represents the hardware limitation. The size of the
> value are not related with kernel configurations at all.
Right, it's just that on occasion we have to cope with slightly larger
values while manipulating things.
Ian.
> +#ifdef CONFIG_XEN_PCI
> + irq = xen_register_gsi(gsi, triggering, polarity);
> + if ((int)irq >= 0)
> + return irq;
> +#endif
why not change irq to 'int' and avoid the cast?
also, please eliminate the #ifdef by turning xen_register_gsi() into a
'return -1' inline on !CONFIG_XEN_PCI.
> +#ifdef CONFIG_XEN_PCI
> + xen_pci_init();
> +#endif
hide the #ifdef in a header please. (like you already properly do for
xen_setup_pirqs())
> + if (rc != 0) {
> if (!probing_irq(irq))
> printk(KERN_INFO "Failed to obtain physical IRQ %d\n",
> irq);
> + dump_stack();
generally it's better to use WARN() or WARN_ONCE() to get good debug
feedback and stackdumps. (they also document the reason for the dump)
> @@ -523,8 +526,6 @@
> } else
> irq = find_unbound_irq();
>
> - spin_unlock(&irq_mapping_update_lock);
> -
> set_irq_chip_and_handler_name(irq, &xen_pirq_chip,
> handle_level_irq, "pirq");
hm, looks like a stray bugfix?
Ingo
* Jeremy Fitzhardinge <[email protected]> wrote:
> Writes to the IO APIC are paravirtualized via hypercalls, so implement
> the appropriate operations.
>
> Signed-off-by: Jeremy Fitzhardinge <[email protected]>
> ---
> arch/x86/xen/Makefile | 3 +-
> arch/x86/xen/apic.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++
> arch/x86/xen/enlighten.c | 2 +
> arch/x86/xen/xen-ops.h | 2 +
> 4 files changed, 72 insertions(+), 1 deletion(-)
hm, why is the ioapic used as the API here, and not an irqchip?
Ingo
Hi,
it seems that if CONFIG_XEN is set by CONFIG_XEN_DOM0 is not set,
then the call to xen_init_apic() in xen_start_kernel() causes the
build to fail.
One possible soluion to this is to provide a dummy
version of xen_init_apic() in the !CONFIG_XEN_DOM0 case.
Another possible solution would be to add #ifdef CONFIG_XEN_DOM0
inside xen_start_kernel()
#make gcc --version
gcc (Debian 4.3.2-1) 4.3.2
Copyright (C) 2008 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
# make
[snip]
UPD include/linux/compile.h
CC init/version.o
LD init/built-in.o
LD .tmp_vmlinux1
arch/x86/xen/built-in.o: In function `xen_start_kernel':
(.init.text+0x8ef): undefined reference to `xen_init_apic'
Index: linux-2.6/arch/x86/xen/xen-ops.h
===================================================================
--- linux-2.6.orig/arch/x86/xen/xen-ops.h 2008-11-20 17:23:14.000000000 +0900
+++ linux-2.6/arch/x86/xen/xen-ops.h 2008-11-20 17:24:28.000000000 +0900
@@ -64,7 +64,11 @@ static inline void xen_smp_init(void) {}
#endif
+#ifdef CONFIG_XEN_DOM0
void xen_init_apic(void);
+#else
+static inline void xen_init_apic(void) { ; }
+#endif
/* Declare an asm function, along with symbols needed to make it
inlineable */
Ingo Molnar wrote:
> * Jeremy Fitzhardinge <[email protected]> wrote:
>
>
>> Writes to the IO APIC are paravirtualized via hypercalls, so implement
>> the appropriate operations.
>>
>> Signed-off-by: Jeremy Fitzhardinge <[email protected]>
>> ---
>> arch/x86/xen/Makefile | 3 +-
>> arch/x86/xen/apic.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++
>> arch/x86/xen/enlighten.c | 2 +
>> arch/x86/xen/xen-ops.h | 2 +
>> 4 files changed, 72 insertions(+), 1 deletion(-)
>>
>
> hm, why is the ioapic used as the API here, and not an irqchip?
>
In essence, the purpose of the series is to break the 1:1 relationship
between Linux irqs and hardware GSIs. This allows me to have my own irq
allocator, which in turn allows me to intermix "physical" irqs (ie, a
Linux irq number bound to a real hardware interrupt source) with the
various software/virtual irqs the Xen system needs.
Once a physical irq has been mapped onto a gsi interrupt source, the
mechanisms for handing the ioapic side of things are more or less the
same. There's the same procedure of finding the ioapic/pin for a gsi
and programming the appropriate vector.
(Presumably once I implement MSI support, all references to "gsi" will
become "gsi/msi/etc".)
So, there's an awkward tradeoff. I could just completely duplicate the
whole irq/vector/ioapic management code and hide it under my own
irqchip, but it would end up duplicating a lot of the existing code. My
alternative was to try to open out the existing code into something like
a thin ioapic library, which I can call into as needed. The only
low-level difference is that the Xen ioapics need to be programmed via a
hypercall rather than register writes.
If the x86 interrupt layer in general decouples irqs from GSIs, then I
can probably make use of that to clean things up. A general irq
allocator along with some way of attaching interrupt-source-specific
information to each irq would get me a long way, I think. I'd still
need hooks to paravirtualize the actual ioapic writes, but at least I
wouldn't need to have quite so much delicate hooking.
Thanks,
J
Ingo Molnar wrote:
>> +#ifdef CONFIG_XEN_PCI
>> + irq = xen_register_gsi(gsi, triggering, polarity);
>> + if ((int)irq >= 0)
>> + return irq;
>> +#endif
>>
>
> why not change irq to 'int' and avoid the cast?
>
Yeah, OK.
> also, please eliminate the #ifdef by turning xen_register_gsi() into a
> 'return -1' inline on !CONFIG_XEN_PCI.
>
OK.
>> +#ifdef CONFIG_XEN_PCI
>> + xen_pci_init();
>> +#endif
>>
>
> hide the #ifdef in a header please. (like you already properly do for
> xen_setup_pirqs())
>
OK.
>> + if (rc != 0) {
>> if (!probing_irq(irq))
>> printk(KERN_INFO "Failed to obtain physical IRQ %d\n",
>> irq);
>> + dump_stack();
>>
>
> generally it's better to use WARN() or WARN_ONCE() to get good debug
> feedback and stackdumps. (they also document the reason for the dump)
>
That was really just a temp debug hack; it shouldn't be dumping stack
for probes anyway.
>> @@ -523,8 +526,6 @@
>> } else
>> irq = find_unbound_irq();
>>
>> - spin_unlock(&irq_mapping_update_lock);
>> -
>> set_irq_chip_and_handler_name(irq, &xen_pirq_chip,
>> handle_level_irq, "pirq");
>>
>
> hm, looks like a stray bugfix?
>
Erm, yep.
J
Simon Horman wrote:
> Hi,
>
> it seems that if CONFIG_XEN is set by CONFIG_XEN_DOM0 is not set,
> then the call to xen_init_apic() in xen_start_kernel() causes the
> build to fail.
>
> One possible soluion to this is to provide a dummy
> version of xen_init_apic() in the !CONFIG_XEN_DOM0 case.
>
> Another possible solution would be to add #ifdef CONFIG_XEN_DOM0
> inside xen_start_kernel()
>
Ah, OK, will fix that up.
J
* Jeremy Fitzhardinge <[email protected]> wrote:
> Ingo Molnar wrote:
>> * Jeremy Fitzhardinge <[email protected]> wrote:
>>
>>
>>> Writes to the IO APIC are paravirtualized via hypercalls, so implement
>>> the appropriate operations.
>>>
>>> Signed-off-by: Jeremy Fitzhardinge <[email protected]>
>>> ---
>>> arch/x86/xen/Makefile | 3 +-
>>> arch/x86/xen/apic.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++
>>> arch/x86/xen/enlighten.c | 2 +
>>> arch/x86/xen/xen-ops.h | 2 +
>>> 4 files changed, 72 insertions(+), 1 deletion(-)
>>>
>>
>> hm, why is the ioapic used as the API here, and not an irqchip?
>>
>
> In essence, the purpose of the series is to break the 1:1
> relationship between Linux irqs and hardware GSIs. This allows me
> to have my own irq allocator, which in turn allows me to intermix
> "physical" irqs (ie, a Linux irq number bound to a real hardware
> interrupt source) with the various software/virtual irqs the Xen
> system needs.
>
> Once a physical irq has been mapped onto a gsi interrupt source, the
> mechanisms for handing the ioapic side of things are more or less
> the same. There's the same procedure of finding the ioapic/pin for
> a gsi and programming the appropriate vector.
>
> (Presumably once I implement MSI support, all references to "gsi"
> will become "gsi/msi/etc".)
>
> So, there's an awkward tradeoff. I could just completely duplicate
> the whole irq/vector/ioapic management code and hide it under my own
> irqchip, but it would end up duplicating a lot of the existing code.
> My alternative was to try to open out the existing code into
> something like a thin ioapic library, which I can call into as
> needed. The only low-level difference is that the Xen ioapics need
> to be programmed via a hypercall rather than register writes.
>
> If the x86 interrupt layer in general decouples irqs from GSIs, then
> I can probably make use of that to clean things up. A general irq
> allocator along with some way of attaching interrupt-source-specific
> information to each irq would get me a long way, I think. I'd still
> need hooks to paravirtualize the actual ioapic writes, but at least
> I wouldn't need to have quite so much delicate hooking.
it certainly looks thin enough to me although i'm really not sure we
want to virtualize at the IO-APIC level. Peter, what's your
opinion/preference?
Ingo
Ingo Molnar wrote:
>
> it certainly looks thin enough to me although i'm really not sure we
> want to virtualize at the IO-APIC level. Peter, what's your
> opinion/preference?
>
relative fixed mapping, always make it simple and tracked easily.
like irq nr == gsi nr...
YH
Ingo Molnar wrote:
> it certainly looks thin enough to me although i'm really not sure we
> want to virtualize at the IO-APIC level. Peter, what's your
> opinion/preference?
>
Given that Xen's requirements here are pretty Xen-specific (I don't
imagine that any other virtualization system would work in the same
way), I didn't bother trying to come up with a general "virtualization"
layer at this level - that's why I was pretty blunt about putting xen_*
calls in without any indirection. But the code could certainly be
restructured in a way which would make it simpler to hook Xen as a
side-effect, so long as it was achieving some other goal as well
(general cleanup, big iron architecture support, better msi handling,
whatever...).
J
Yinghai Lu wrote:
> Ingo Molnar wrote:
>
>
>> it certainly looks thin enough to me although i'm really not sure we
>> want to virtualize at the IO-APIC level. Peter, what's your
>> opinion/preference?
>>
>>
>
> relative fixed mapping, always make it simple and tracked easily.
> like irq nr == gsi nr...
That was a bit terse for me to understand your point. The code already
has irq==gsi. Are you proposing that it stay that way, or something else?
J
Jeremy Fitzhardinge wrote:
> Yinghai Lu wrote:
>> Ingo Molnar wrote:
>>
>>
>>> it certainly looks thin enough to me although i'm really not sure we
>>> want to virtualize at the IO-APIC level. Peter, what's your
>>> opinion/preference?
>>>
>>>
>>
>> relative fixed mapping, always make it simple and tracked easily.
>> like irq nr == gsi nr...
>
> That was a bit terse for me to understand your point. The code already
> has irq==gsi. Are you proposing that it stay that way
yes
YH
Jeremy Fitzhardinge <[email protected]> writes:
> Ingo Molnar wrote:
>> * Jeremy Fitzhardinge <[email protected]> wrote:
>>
>>
>>> Writes to the IO APIC are paravirtualized via hypercalls, so implement
>>> the appropriate operations.
>>>
>>> Signed-off-by: Jeremy Fitzhardinge <[email protected]>
>>> ---
>>> arch/x86/xen/Makefile | 3 +-
>>> arch/x86/xen/apic.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++
>>> arch/x86/xen/enlighten.c | 2 +
>>> arch/x86/xen/xen-ops.h | 2 +
>>> 4 files changed, 72 insertions(+), 1 deletion(-)
>>>
>>
>> hm, why is the ioapic used as the API here, and not an irqchip?
>>
>
> In essence, the purpose of the series is to break the 1:1 relationship between
> Linux irqs and hardware GSIs.
Bad idea (I think). We have a 1:1 relationship between the linux irq number and
the GSI because it makes the code dramatically simpler, and it took significant
work to get there. The concept of an intermediate mapping layer sounds nasty.
But I haven't yet read the patch.
> If the x86 interrupt layer in general decouples irqs from GSIs, then I can
> probably make use of that to clean things up. A general irq allocator along
> with some way of attaching interrupt-source-specific information to each irq
> would get me a long way, I think. I'd still need hooks to paravirtualize the
> actual ioapic writes, but at least I wouldn't need to have quite so much
> delicate hooking.
I know that is where I thought the sparse irq code was going.
Eric
Eric W. Biederman wrote:
> Jeremy Fitzhardinge <[email protected]> writes:
>
>
>> Ingo Molnar wrote:
>>
>>> * Jeremy Fitzhardinge <[email protected]> wrote:
>>>
>>>
>>>
>>>> Writes to the IO APIC are paravirtualized via hypercalls, so implement
>>>> the appropriate operations.
>>>>
>>>> Signed-off-by: Jeremy Fitzhardinge <[email protected]>
>>>> ---
>>>> arch/x86/xen/Makefile | 3 +-
>>>> arch/x86/xen/apic.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++
>>>> arch/x86/xen/enlighten.c | 2 +
>>>> arch/x86/xen/xen-ops.h | 2 +
>>>> 4 files changed, 72 insertions(+), 1 deletion(-)
>>>>
>>>>
>>> hm, why is the ioapic used as the API here, and not an irqchip?
>>>
>>>
>> In essence, the purpose of the series is to break the 1:1 relationship between
>> Linux irqs and hardware GSIs.
>>
>
> Bad idea (I think). We have a 1:1 relationship between the linux irq number and
> the GSI because it makes the code dramatically simpler, and it took significant
> work to get there. The concept of an intermediate mapping layer sounds nasty.
> But I haven't yet read the patch.
>
The changes are spread over a number of patches, but the meat of it is
in "xen: route hardware irqs via Xen". It turns out fairly simply, but
perhaps its because I've made a number of simplifying assumptions:
interrupts are always IOAPIC based, only using ACPI for routing, no MSI
support yet.
But it seems to me that the only time you really care that the irq isn't
a gsi is when programming a vector into the ioapics - you need to do a
irq -> ioapic/pin mapping anyway, so adding a irq -> gsi -> ioapic/pin
map isn't all that complex. And conversely, when probing devices you
need to map gsi->irq to see whether the interrupt is shared, though you
could do that on a pure gsi level anyway.
And of course the current code isn't purely irq == gsi anyway, since
msis are allocated irqs as well, and there's no underlying gsi. In a
sense you can think of the other Xen interrupt sources as being a bit
like MSI, at least in as much as they're not sourced from a GSI (but
they go further and are not sourced from an IOAPIC at all).
J
Jeremy Fitzhardinge <[email protected]> writes:
> The changes are spread over a number of patches, but the meat of it is in "xen:
> route hardware irqs via Xen". It turns out fairly simply, but perhaps its
> because I've made a number of simplifying assumptions: interrupts are always
> IOAPIC based, only using ACPI for routing, no MSI support yet.
>
> But it seems to me that the only time you really care that the irq isn't a gsi
> is when programming a vector into the ioapics - you need to do a irq ->
> ioapic/pin mapping anyway, so adding a irq -> gsi -> ioapic/pin map isn't all
> that complex.
It is hideous. Been there and ripped out hundreds of lines of useless and problem
causing code to get here. It is especially bad when you do not identity map the first
16 gsi to linux irqs (the legacy isa irqs).
It is very useful to keep the linux irq number and the gsi irq number the same.
It facilitates debugging and keeps the code simple.
> And conversely, when probing devices you need to map gsi->irq to
> see whether the interrupt is shared, though you could do that on a pure gsi
> level anyway.
What I care about is important is that the GSI number == the linux irq number.
> And of course the current code isn't purely irq == gsi anyway, since msis are
> allocated irqs as well, and there's no underlying gsi.
Yep. And but the numbers we you should be beyond the range of the gsi's so there
is no conflict. Think of it an extension of how we identitly make the low 16 linux
irqs.
> In a sense you can think
> of the other Xen interrupt sources as being a bit like MSI, at least in as much
> as they're not sourced from a GSI (but they go further and are not sourced from
> an IOAPIC at all).
MSI isn't sourced from an IOAPIC either.
The difference is that the xen sources are not delivered using vectors. The cpu
vector numbers we do hide and treat as an implementation detail. And I am totally
happy not going through the vector allocation path.
My gut feel says that you just want to use a different set of irq operations when
doing Xen native and working with hardware interrupts. I haven't seen the code so
I don't know how you interact there. Except in dom0 this is not a consideration so
I don't how it is handled.
Eric
Eric W. Biederman wrote:
> Jeremy Fitzhardinge <[email protected]> writes:
>
>
>
>> The changes are spread over a number of patches, but the meat of it is in "xen:
>> route hardware irqs via Xen". It turns out fairly simply, but perhaps its
>> because I've made a number of simplifying assumptions: interrupts are always
>> IOAPIC based, only using ACPI for routing, no MSI support yet.
>>
>> But it seems to me that the only time you really care that the irq isn't a gsi
>> is when programming a vector into the ioapics - you need to do a irq ->
>> ioapic/pin mapping anyway, so adding a irq -> gsi -> ioapic/pin map isn't all
>> that complex.
>>
>
> It is hideous. Been there and ripped out hundreds of lines of useless and problem
> causing code to get here. It is especially bad when you do not identity map the first
> 16 gsi to linux irqs (the legacy isa irqs).
>
Yes. I made that concession too, and just reserved them as identity
mapped legacy irqs.
> Yep. And but the numbers we you should be beyond the range of the gsi's so there
> is no conflict. Think of it an extension of how we identitly make the low 16 linux
> irqs.
>
Yes, I suppose we can statically partition the irq space. In fact the
original 2.6.18-xen dom0 kernel does precisely that, but runs into
limitations because of the compile-time limit on NR_IRQS in that
kernel. If we move to a purely dynamically allocated irq space, then
having a sparse allocation if irqs becomes reasonable again, for msis
and vectorless Xen interrupts.
>> In a sense you can think
>> of the other Xen interrupt sources as being a bit like MSI, at least in as much
>> as they're not sourced from a GSI (but they go further and are not sourced from
>> an IOAPIC at all).
>>
>
> MSI isn't sourced from an IOAPIC either.
>
Right.
> The difference is that the xen sources are not delivered using vectors. The cpu
> vector numbers we do hide and treat as an implementation detail. And I am totally
> happy not going through the vector allocation path.
>
Right. And in the physical irq event channel case, the vector space is
managed by Xen, so we need to use Xen to allocate the vector, then
program that into the appropriate place in the ioapic.
> My gut feel says that you just want to use a different set of irq operations when
> doing Xen native and working with hardware interrupts. I haven't seen the code so
> I don't know how you interact there. Except in dom0 this is not a consideration so
> I don't how it is handled.
>
Yeah. In the domU case, where there's no physical interrupts, the Xen
code completely avoids the ioapic/vector stuff, and directly converts an
event channel into an irq. Indeed, physical irq delivery is handled the
same way; its just that the setup requires touching the ioapics to
program the appropriate vector and bind it to an event channel.
J
Ingo Molnar wrote:
>
> it certainly looks thin enough to me although i'm really not sure we
> want to virtualize at the IO-APIC level. Peter, what's your
> opinion/preference?
>
Not having studied the Xen code in detail, but my assumption would be
that we should allocate this at the IRQ chip level rather than violating
the IO-APIC code. I also (as previously discussed) really want to see a
dynamic allocator for IRQ numbers like PowerPC already has.
-hpa
Jeremy Fitzhardinge <[email protected]> writes:
>
> Yes, I suppose we can statically partition the irq space. In fact the original
> 2.6.18-xen dom0 kernel does precisely that, but runs into limitations because of
> the compile-time limit on NR_IRQS in that kernel. If we move to a purely
> dynamically allocated irq space, then having a sparse allocation if irqs becomes
> reasonable again, for msis and vectorless Xen interrupts.
>
>> The difference is that the xen sources are not delivered using vectors. The cpu
>> vector numbers we do hide and treat as an implementation detail. And I am totally
>> happy not going through the vector allocation path.
>>
>
> Right. And in the physical irq event channel case, the vector space is managed
> by Xen, so we need to use Xen to allocate the vector, then program that into the
> appropriate place in the ioapic.
We should be able to share code with iommu for irqs handling, at first glance you
are describing a pretty similar problem. Now I don't know think the interrupt
remapping code is any kind of beauty but that seems to be roughly what you
are doing with Xen domU. I certainly think with some careful factoring
we can share the ioapic munging code. And the code to pick how we program
the ioapics.
>> My gut feel says that you just want to use a different set of irq operations when
>> doing Xen native and working with hardware interrupts. I haven't seen the code so
>> I don't know how you interact there. Except in dom0 this is not a consideration so
>> I don't how it is handled.
>>
>
> Yeah. In the domU case, where there's no physical interrupts, the Xen code
> completely avoids the ioapic/vector stuff, and directly converts an event
> channel into an irq. Indeed, physical irq delivery is handled the same way; its
> just that the setup requires touching the ioapics to program the appropriate
> vector and bind it to an event channel.
Reasonable. A lot like the intel interrupt remapping code in that respect. The
message we program in has little to do with the vector the interrupt arrives on.
So I don't quite know where to hook it but if we are careful we should be able
to get a good interrupt mapping.
Eric
On Thu, Nov 13, 2008 at 11:10:34AM -0800, Jeremy Fitzhardinge wrote:
> This patch puts the hooks into place so that when the interrupt
> subsystem registers an irq, it gets routed via Xen (if we're running
> under Xen).
>
> The first step is to get a gsi for a particular device+pin. We use
> the normal acpi interrupt routing to do the mapping.
>
> Normally the gsi number is used directly as the irq number. We can't
> do that since we also have irqs for non-hardware event channels, and
> so we must share the irq space between them. A given gsi is only
> allocated a single irq, so re-registering a gsi will simply return the
> same irq.
>
> We therefore allocate an irq for a given gsi, and return that. As a
> special case, we reserve the first 16 irqs for identity-mapping legacy
> irqs, since there's a fair amount of code which assumes that.
>
> Having allocated an irq, we ask Xen to allocate a vector, and then
> bind that pirq/vector to an event channel. When the hardware raises
> an interrupt on a vector, Xen signals us on the corresponding event
> channel, which gets routed to the irq and delivered to the appropriate
> device driver.
>
> This patch does everything except set up the IO APIC pin routing to
> the vector.
>
> Signed-off-by: Jeremy Fitzhardinge <[email protected]>
> ---
> arch/x86/kernel/acpi/boot.c | 8 +++
> arch/x86/pci/legacy.c | 4 +
> arch/x86/xen/Makefile | 1
> arch/x86/xen/pci.c | 98 +++++++++++++++++++++++++++++++++++++++++++
> arch/x86/xen/xen-ops.h | 1
> drivers/xen/events.c | 9 ++-
> include/asm-x86/xen/pci.h | 7 +++
> include/xen/events.h | 8 +++
> 8 files changed, 132 insertions(+), 4 deletions(-)
>
[snip]
> diff --git a/arch/x86/xen/xen-ops.h b/arch/x86/xen/xen-ops.h
> --- a/arch/x86/xen/xen-ops.h
> +++ b/arch/x86/xen/xen-ops.h
> @@ -63,7 +63,6 @@
> static inline void xen_smp_init(void) {}
> #endif
>
> -
> void xen_init_apic(void);
>
> /* Declare an asm function, along with symbols needed to make it
Hi Jeremy,
This seems like a spurious whitespace change that could be
merged into "[PATCH 30 of 38] xen: implement io_apic_ops"
[snip]
--
Simon Horman
VA Linux Systems Japan K.K., Sydney, Australia Satellite Office
H: http://www.vergenet.net/~horms/ W: http://www.valinux.co.jp/en
On Wed, 2008-11-19 at 11:19 +0900, FUJITA Tomonori wrote:
>
> The problem that I talked about in the previous mail:
>
> > max_slots = mask + 1
> > ? ALIGN(mask + 1, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT
> > : 1UL << (BITS_PER_LONG - IO_TLB_SHIFT);
>
> Since the popular value of the mask is 0xffffffff. So the above code
> (mask + 1 ?) works wrongly if the size of mask is 32bit (well,
> accidentally the result of max_slots is identical though).
I've just been looking at this again and I don't think it is an accident
that this evaluates to the correct value when mask + 1 == 0.
The patch which adds the "mask + 1 ? ... : 1UL << ..." stuff is:
commit b15a3891c916f32a29832886a053a48be2741d4d
Author: Jan Beulich <[email protected]>
Date: Thu Mar 13 09:13:30 2008 +0000
avoid endless loops in lib/swiotlb.c
Commit 681cc5cd3efbeafca6386114070e0bfb5012e249 ("iommu sg merging:
swiotlb: respect the segment boundary limits") introduced two
possibilities for entering an endless loop in lib/swiotlb.c:
- if max_slots is zero (possible if mask is ~0UL)
[...]
I think the existing code is the nicest way to handle this corner case
and it is necessary anyway to handle the ~0UL case on 64 bit.
Ian.
On Fri, 21 Nov 2008 14:21:32 +0000
Ian Campbell <[email protected]> wrote:
> On Wed, 2008-11-19 at 11:19 +0900, FUJITA Tomonori wrote:
> >
> > The problem that I talked about in the previous mail:
> >
> > > max_slots = mask + 1
> > > ? ALIGN(mask + 1, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT
> > > : 1UL << (BITS_PER_LONG - IO_TLB_SHIFT);
> >
> > Since the popular value of the mask is 0xffffffff. So the above code
> > (mask + 1 ?) works wrongly if the size of mask is 32bit (well,
> > accidentally the result of max_slots is identical though).
>
> I've just been looking at this again and I don't think it is an accident
> that this evaluates to the correct value when mask + 1 == 0.
>
> The patch which adds the "mask + 1 ? ... : 1UL << ..." stuff is:
>
> commit b15a3891c916f32a29832886a053a48be2741d4d
> Author: Jan Beulich <[email protected]>
> Date: Thu Mar 13 09:13:30 2008 +0000
>
> avoid endless loops in lib/swiotlb.c
>
> Commit 681cc5cd3efbeafca6386114070e0bfb5012e249 ("iommu sg merging:
> swiotlb: respect the segment boundary limits") introduced two
> possibilities for entering an endless loop in lib/swiotlb.c:
>
> - if max_slots is zero (possible if mask is ~0UL)
> [...]
>
> I think the existing code is the nicest way to handle this corner case
> and it is necessary anyway to handle the ~0UL case on 64 bit.
Ah, I vaguely remember this patch. The ~0ULL mask didn't happen here
(nobody uses it) so the possibility was false. IMHO, if we use this
code on 32bit architectures, the mask should be u64 and the overflow
should be handled explicitly. But as you pointed out, looks like that
this patch takes account of the overflow.
Thanks,
On Sat, 2008-11-22 at 10:49 +0900, FUJITA Tomonori wrote:
> On Fri, 21 Nov 2008 14:21:32 +0000
> Ian Campbell <[email protected]> wrote:
>
> > On Wed, 2008-11-19 at 11:19 +0900, FUJITA Tomonori wrote:
> > >
> > > The problem that I talked about in the previous mail:
> > >
> > > > max_slots = mask + 1
> > > > ? ALIGN(mask + 1, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT
> > > > : 1UL << (BITS_PER_LONG - IO_TLB_SHIFT);
> > >
> > > Since the popular value of the mask is 0xffffffff. So the above code
> > > (mask + 1 ?) works wrongly if the size of mask is 32bit (well,
> > > accidentally the result of max_slots is identical though).
> >
> > I've just been looking at this again and I don't think it is an accident
> > that this evaluates to the correct value when mask + 1 == 0.
> >
> > The patch which adds the "mask + 1 ? ... : 1UL << ..." stuff is:
> >
> > commit b15a3891c916f32a29832886a053a48be2741d4d
> > Author: Jan Beulich <[email protected]>
> > Date: Thu Mar 13 09:13:30 2008 +0000
> >
> > avoid endless loops in lib/swiotlb.c
> >
> > Commit 681cc5cd3efbeafca6386114070e0bfb5012e249 ("iommu sg merging:
> > swiotlb: respect the segment boundary limits") introduced two
> > possibilities for entering an endless loop in lib/swiotlb.c:
> >
> > - if max_slots is zero (possible if mask is ~0UL)
> > [...]
> >
> > I think the existing code is the nicest way to handle this corner case
> > and it is necessary anyway to handle the ~0UL case on 64 bit.
>
> Ah, I vaguely remember this patch. The ~0ULL mask didn't happen here
> (nobody uses it) so the possibility was false. IMHO, if we use this
> code on 32bit architectures, the mask should be u64 and the overflow
> should be handled explicitly. But as you pointed out, looks like that
> this patch takes account of the overflow.
Something like this?
Ian.
---
swiotlb: explicitly handle segment boundary mask overflow.
When swiotlb is used on 32 bit we can overflow mask + 1 in the common
case where mask is 0xffffffffUL. This overflow was previously caught
by the case which attempts to handle a mask of ~0UL on 64 bit.
Signed-off-by: Ian Campbell <[email protected]>
diff -r 5fa30e5284dd lib/swiotlb.c
--- a/lib/swiotlb.c Mon Nov 24 09:39:50 2008 +0000
+++ b/lib/swiotlb.c Mon Nov 24 11:37:39 2008 +0000
@@ -303,7 +303,7 @@
unsigned int nslots, stride, index, wrap;
int i;
unsigned long start_dma_addr;
- unsigned long mask;
+ u64 mask;
unsigned long offset_slots;
unsigned long max_slots;
@@ -314,6 +314,7 @@
max_slots = mask + 1
? ALIGN(mask + 1, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT
: 1UL << (BITS_PER_LONG - IO_TLB_SHIFT);
+ BUG_ON(max_slots > 1UL << (BITS_PER_LONG - IO_TLB_SHIFT));
/*
* For mappings greater than a page, we limit the stride (and
Eric W. Biederman wrote:
> Jeremy Fitzhardinge <[email protected]> writes:
>
>
>> Yes, I suppose we can statically partition the irq space. In fact the original
>> 2.6.18-xen dom0 kernel does precisely that, but runs into limitations because of
>> the compile-time limit on NR_IRQS in that kernel. If we move to a purely
>> dynamically allocated irq space, then having a sparse allocation if irqs becomes
>> reasonable again, for msis and vectorless Xen interrupts.
>>
>>
>>> The difference is that the xen sources are not delivered using vectors. The cpu
>>> vector numbers we do hide and treat as an implementation detail. And I am totally
>>> happy not going through the vector allocation path.
>>>
>>>
>> Right. And in the physical irq event channel case, the vector space is managed
>> by Xen, so we need to use Xen to allocate the vector, then program that into the
>> appropriate place in the ioapic.
>>
>
> We should be able to share code with iommu for irqs handling, at first glance you
> are describing a pretty similar problem. Now I don't know think the interrupt
> remapping code is any kind of beauty but that seems to be roughly what you
> are doing with Xen domU. I certainly think with some careful factoring
> we can share the ioapic munging code. And the code to pick how we program
> the ioapics.
>
Notwithstanding the possibility that there'll be general changes to x86
interrupt handing in the future, do you have any objection to my patches
as they stand? Ingo would like to see your and/or hpa's ack before
accepting them.
Should I repost them?
Thanks,
J
Jeremy Fitzhardinge <[email protected]> writes:
> Notwithstanding the possibility that there'll be general changes to x86
> interrupt handing in the future, do you have any objection to my patches as they
> stand? Ingo would like to see your and/or hpa's ack before accepting them.
>
> Should I repost them?
I couldn't track down the first set. So if you could repost them with me on
the cc I can review them, and tell you if I have specific objections.
My apologies for the trouble.
Eric
On Mon, 24 Nov 2008 11:41:37 +0000
Ian Campbell <[email protected]> wrote:
> On Sat, 2008-11-22 at 10:49 +0900, FUJITA Tomonori wrote:
> > On Fri, 21 Nov 2008 14:21:32 +0000
> > Ian Campbell <[email protected]> wrote:
> >
> > > On Wed, 2008-11-19 at 11:19 +0900, FUJITA Tomonori wrote:
> > > >
> > > > The problem that I talked about in the previous mail:
> > > >
> > > > > max_slots = mask + 1
> > > > > ? ALIGN(mask + 1, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT
> > > > > : 1UL << (BITS_PER_LONG - IO_TLB_SHIFT);
> > > >
> > > > Since the popular value of the mask is 0xffffffff. So the above code
> > > > (mask + 1 ?) works wrongly if the size of mask is 32bit (well,
> > > > accidentally the result of max_slots is identical though).
> > >
> > > I've just been looking at this again and I don't think it is an accident
> > > that this evaluates to the correct value when mask + 1 == 0.
> > >
> > > The patch which adds the "mask + 1 ? ... : 1UL << ..." stuff is:
> > >
> > > commit b15a3891c916f32a29832886a053a48be2741d4d
> > > Author: Jan Beulich <[email protected]>
> > > Date: Thu Mar 13 09:13:30 2008 +0000
> > >
> > > avoid endless loops in lib/swiotlb.c
> > >
> > > Commit 681cc5cd3efbeafca6386114070e0bfb5012e249 ("iommu sg merging:
> > > swiotlb: respect the segment boundary limits") introduced two
> > > possibilities for entering an endless loop in lib/swiotlb.c:
> > >
> > > - if max_slots is zero (possible if mask is ~0UL)
> > > [...]
> > >
> > > I think the existing code is the nicest way to handle this corner case
> > > and it is necessary anyway to handle the ~0UL case on 64 bit.
> >
> > Ah, I vaguely remember this patch. The ~0ULL mask didn't happen here
> > (nobody uses it) so the possibility was false. IMHO, if we use this
> > code on 32bit architectures, the mask should be u64 and the overflow
> > should be handled explicitly. But as you pointed out, looks like that
> > this patch takes account of the overflow.
>
> Something like this?
>
> Ian.
> ---
>
> swiotlb: explicitly handle segment boundary mask overflow.
>
> When swiotlb is used on 32 bit we can overflow mask + 1 in the common
> case where mask is 0xffffffffUL. This overflow was previously caught
> by the case which attempts to handle a mask of ~0UL on 64 bit.
>
> Signed-off-by: Ian Campbell <[email protected]>
>
> diff -r 5fa30e5284dd lib/swiotlb.c
> --- a/lib/swiotlb.c Mon Nov 24 09:39:50 2008 +0000
> +++ b/lib/swiotlb.c Mon Nov 24 11:37:39 2008 +0000
> @@ -303,7 +303,7 @@
> unsigned int nslots, stride, index, wrap;
> int i;
> unsigned long start_dma_addr;
> - unsigned long mask;
> + u64 mask;
> unsigned long offset_slots;
> unsigned long max_slots;
>
> @@ -314,6 +314,7 @@
> max_slots = mask + 1
> ? ALIGN(mask + 1, 1 << IO_TLB_SHIFT) >> IO_TLB_SHIFT
> : 1UL << (BITS_PER_LONG - IO_TLB_SHIFT);
> + BUG_ON(max_slots > 1UL << (BITS_PER_LONG - IO_TLB_SHIFT));
How can this BUG_ON happen? Using u64 for the mask is fine though.
Thanks,
On Wed, 2008-11-26 at 11:53 +0900, FUJITA Tomonori wrote:
>
> > + BUG_ON(max_slots > 1UL << (BITS_PER_LONG - IO_TLB_SHIFT));
>
> How can this BUG_ON happen? Using u64 for the mask is fine though.
It covers the cases where the previous code would have overflowed. It
can't happen right now because although mask is 64 bits the value
assigned to it is currently sizeof(unsigned long). If someone changes
the type of that field then we would start seeing unexpected values.
Ian.
On Wed, 26 Nov 2008 09:36:49 +0000
Ian Campbell <[email protected]> wrote:
> On Wed, 2008-11-26 at 11:53 +0900, FUJITA Tomonori wrote:
> >
> > > + BUG_ON(max_slots > 1UL << (BITS_PER_LONG - IO_TLB_SHIFT));
> >
> > How can this BUG_ON happen? Using u64 for the mask is fine though.
>
> It covers the cases where the previous code would have overflowed. It
> can't happen right now because although mask is 64 bits the value
> assigned to it is currently sizeof(unsigned long). If someone changes
> the type of that field then we would start seeing unexpected values.
If someone changes dma_get_seg_boundary to return a u64 value instead
of unsigned long, this BUG_ON could happen on 32bit architectures. But
you don't need to trigger BUG_ON for it. max_slots > 1UL <<
(BITS_PER_LONG - IO_TLB_SHIFT) should be fine for
iommu_is_span_boundary().
Anyway, this is minor but would it be nice to make sure that anyone
can easily understand the code without digging into the git log?
a) dropping this patch and adding some comments how the code works
(especially about the overflow on 32bit architectures).
b) removing the BUG_ON in this patch and adding some comments.
?
On Thu, 2008-11-27 at 12:43 +0900, FUJITA Tomonori wrote:
> On Wed, 26 Nov 2008 09:36:49 +0000
> Ian Campbell <[email protected]> wrote:
>
> > On Wed, 2008-11-26 at 11:53 +0900, FUJITA Tomonori wrote:
> > >
> > > > + BUG_ON(max_slots > 1UL << (BITS_PER_LONG - IO_TLB_SHIFT));
> > >
> > > How can this BUG_ON happen? Using u64 for the mask is fine though.
> >
> > It covers the cases where the previous code would have overflowed. It
> > can't happen right now because although mask is 64 bits the value
> > assigned to it is currently sizeof(unsigned long). If someone changes
> > the type of that field then we would start seeing unexpected values.
>
> If someone changes dma_get_seg_boundary to return a u64 value instead
> of unsigned long, this BUG_ON could happen on 32bit architectures. But
> you don't need to trigger BUG_ON for it. max_slots > 1UL <<
> (BITS_PER_LONG - IO_TLB_SHIFT) should be fine for
> iommu_is_span_boundary().
>
> Anyway, this is minor but would it be nice to make sure that anyone
> can easily understand the code without digging into the git log?
>
> a) dropping this patch and adding some comments how the code works
> (especially about the overflow on 32bit architectures).
>
> b) removing the BUG_ON in this patch and adding some comments.
Yes, I think adding a comment to the existing code (option a) would be
best. I actually have a small queue of other fixes which make swiotlb
work properly for x86 PAE and HighMem but they are not particularly well
baked at the moment. I'll include a patch to add a comment in that
series.
Ian.
On Thu, 27 Nov 2008 17:14:47 +0000
Ian Campbell <[email protected]> wrote:
> On Thu, 2008-11-27 at 12:43 +0900, FUJITA Tomonori wrote:
> > On Wed, 26 Nov 2008 09:36:49 +0000
> > Ian Campbell <[email protected]> wrote:
> >
> > > On Wed, 2008-11-26 at 11:53 +0900, FUJITA Tomonori wrote:
> > > >
> > > > > + BUG_ON(max_slots > 1UL << (BITS_PER_LONG - IO_TLB_SHIFT));
> > > >
> > > > How can this BUG_ON happen? Using u64 for the mask is fine though.
> > >
> > > It covers the cases where the previous code would have overflowed. It
> > > can't happen right now because although mask is 64 bits the value
> > > assigned to it is currently sizeof(unsigned long). If someone changes
> > > the type of that field then we would start seeing unexpected values.
> >
> > If someone changes dma_get_seg_boundary to return a u64 value instead
> > of unsigned long, this BUG_ON could happen on 32bit architectures. But
> > you don't need to trigger BUG_ON for it. max_slots > 1UL <<
> > (BITS_PER_LONG - IO_TLB_SHIFT) should be fine for
> > iommu_is_span_boundary().
> >
> > Anyway, this is minor but would it be nice to make sure that anyone
> > can easily understand the code without digging into the git log?
> >
> > a) dropping this patch and adding some comments how the code works
> > (especially about the overflow on 32bit architectures).
> >
> > b) removing the BUG_ON in this patch and adding some comments.
>
> Yes, I think adding a comment to the existing code (option a) would be
> best.
Yeah, that should be fine.
> I actually have a small queue of other fixes which make swiotlb
> work properly for x86 PAE and HighMem but they are not particularly well
> baked at the moment. I'll include a patch to add a comment in that
> series.
I see.
Thanks,