LinuxLists.cc - Xen VMM patch set

2004-11-19 23:19:54

Subject: Xen VMM patch set - take 2

OK folks, this is my second attempt to post the arch xen patches,
this time against 2.6.10-rc2.

We need 6 core patches and a bug fix:

1. add ptep_establish_new to make va available
2. return code for arch_free_page
3. runtime disable of VT console
4. /dev/mem io_remap_page_range for CONFIG_XEN
5. split free_irq into teardown_irq
6. alloc_skb_from_cache

Bug fix:
7. handle fragemented skbs correctly in icmp_filter

The actual new architecture, arch xen, is too big to post to the list,
so here's a link:
http://www.cl.cam.ac.uk/netos/xen/downloads/arch-xen.patch

Likewise for the virtual block, network, and console drivers:
http://www.cl.cam.ac.uk/netos/xen/downloads/drivers-xen.patch

Applying the above 9 patches should give you everything you need to
build full-featured arch xen kernels.

Arch xen will be maintained by myself, Keir Fraser, Christian Limpach
and Steve hand.

Cheers,
Ian

2004-11-19 23:26:41

by Ian Pratt

[permalink] [raw]

Subject: [5/7] Xen VMM patch set : split free_irq into teardown_irq

This patch moves the `unregister the irqaction' part of free_irq into
a new function teardown_irq, leaving only the mapping from dev_id to
irqaction and freeing the irqaction in free_irq. free_irq
calls teardown_irq to unregister the irqaction. This is similar
to how setup_irq and request_irq work for registering irq's.
We need teardown_irq to allow us to unregister irq's which were
registered early during boot when memory management wasn't ready
yet, i.e. irq's which were registered using setup_irq and use a static
irqaction which cannot be kfree'd.

Signed-off-by: [email protected]

---
diff -Nurp pristine-linux-2.6.10-rc2/include/linux/irq.h tmp-linux-2.6.10-rc2-xen.patch/include/linux/irq.h
--- pristine-linux-2.6.10-rc2/include/linux/irq.h 2004-11-15 01:28:57.000000000 +0000
+++ tmp-linux-2.6.10-rc2-xen.patch/include/linux/irq.h 2004-11-19 14:00:39.000000000 +0000
@@ -73,6 +73,7 @@ extern irq_desc_t irq_desc [NR_IRQS];
#include <asm/hw_irq.h> /* the arch dependent stuff */

extern int setup_irq(unsigned int irq, struct irqaction * new);
+extern int teardown_irq(unsigned int irq, struct irqaction * old);

#ifdef CONFIG_GENERIC_HARDIRQS
extern cpumask_t irq_affinity[NR_IRQS];
diff -Nurp pristine-linux-2.6.10-rc2/kernel/irq/manage.c tmp-linux-2.6.10-rc2-xen.patch/kernel/irq/manage.c
--- pristine-linux-2.6.10-rc2/kernel/irq/manage.c 2004-11-15 01:27:20.000000000 +0000
+++ tmp-linux-2.6.10-rc2-xen.patch/kernel/irq/manage.c 2004-11-19 14:00:30.000000000 +0000
@@ -215,28 +215,18 @@ int setup_irq(unsigned int irq, struct i
return 0;
}

-/**
- * free_irq - free an interrupt
- * @irq: Interrupt line to free
- * @dev_id: Device identity to free
- *
- * Remove an interrupt handler. The handler is removed and if the
- * interrupt line is no longer in use by any driver it is disabled.
- * On a shared IRQ the caller must ensure the interrupt is disabled
- * on the card it drives before calling this function. The function
- * does not return until any executing interrupts for this IRQ
- * have completed.
- *
- * This function must not be called from interrupt context.
+/*
+ * Internal function to unregister an irqaction - typically used to
+ * deallocate special interrupts that are part of the architecture.
*/
-void free_irq(unsigned int irq, void *dev_id)
+int teardown_irq(unsigned int irq, struct irqaction * old)
{
struct irq_desc *desc;
struct irqaction **p;
unsigned long flags;

if (irq >= NR_IRQS)
- return;
+ return -ENOENT;

desc = irq_desc + irq;
spin_lock_irqsave(&desc->lock,flags);
@@ -248,7 +238,7 @@ void free_irq(unsigned int irq, void *de
struct irqaction **pp = p;

p = &action->next;
- if (action->dev_id != dev_id)
+ if (action != old)
continue;

/* Found it - now remove it from the list of entries */
@@ -265,13 +255,52 @@ void free_irq(unsigned int irq, void *de

/* Make sure it's not being used on another CPU */
synchronize_irq(irq);
- kfree(action);
- return;
+ return 0;
}
- printk(KERN_ERR "Trying to free free IRQ%d\n",irq);
+ printk(KERN_ERR "Trying to teardown free IRQ%d\n",irq);
spin_unlock_irqrestore(&desc->lock,flags);
+ return -ENOENT;
+ }
+}
+
+/**
+ * free_irq - free an interrupt
+ * @irq: Interrupt line to free
+ * @dev_id: Device identity to free
+ *
+ * Remove an interrupt handler. The handler is removed and if the
+ * interrupt line is no longer in use by any driver it is disabled.
+ * On a shared IRQ the caller must ensure the interrupt is disabled
+ * on the card it drives before calling this function. The function
+ * does not return until any executing interrupts for this IRQ
+ * have completed.
+ *
+ * This function must not be called from interrupt context.
+ */
+void free_irq(unsigned int irq, void *dev_id)
+{
+ struct irq_desc *desc;
+ struct irqaction *action;
+ unsigned long flags;
+
+ if (irq >= NR_IRQS)
+ return;
+
+ desc = irq_desc + irq;
+ spin_lock_irqsave(&desc->lock,flags);
+ for (action = desc->action; action != NULL; action = action->next) {
+ if (action->dev_id != dev_id)
+ continue;
+
+ spin_unlock_irqrestore(&desc->lock,flags);
+
+ if (teardown_irq(irq, action) == 0)
+ kfree(action);
return;
}
+ printk(KERN_ERR "Trying to free free IRQ%d\n",irq);
+ spin_unlock_irqrestore(&desc->lock,flags);
+ return;
}

EXPORT_SYMBOL(free_irq);

2004-11-19 23:30:18

by Ian Pratt

[permalink] [raw]

Subject: [6/7] Xen VMM patch set : add alloc_skb_from_cache

This patch adds a new alloc_skb_from_cache function. This serves two
purposes: firstly, we like to allocate skb's with page-sized data
fragements as this means we can do zero-copy transfer of network
buffers between guest operating systems. Secondly, it enables us to
have a cache of pages that have been used for network buffers that we
can be more lax about scrubbing when they change VM ownership (since
they could be sniffed on the wire).

Signed-off-by: [email protected]

---
diff -Nurp pristine-linux-2.6.10-rc2/net/core/skbuff.c tmp-linux-2.6.10-rc2-xen.patch/net/core/skbuff.c
--- pristine-linux-2.6.10-rc2/net/core/skbuff.c 2004-11-15 01:28:14.000000000 +0000
+++ tmp-linux-2.6.10-rc2-xen.patch/net/core/skbuff.c 2004-11-19 16:42:55.000000000 +0000
@@ -163,6 +163,59 @@ nodata:
goto out;
}

+/**
+ * alloc_skb_from_cache - allocate a network buffer
+ * @cp: kmem_cache from which to allocate the data area
+ * (object size must be big enough for @size bytes + skb overheads)
+ * @size: size to allocate
+ * @gfp_mask: allocation mask
+ *
+ * Allocate a new &sk_buff. The returned buffer has no headroom and a
+ * tail room of size bytes. The object has a reference count of one.
+ * The return is the buffer. On a failure the return is %NULL.
+ *
+ * Buffers may only be allocated from interrupts using a @gfp_mask of
+ * %GFP_ATOMIC.
+ */
+struct sk_buff *alloc_skb_from_cache(kmem_cache_t *cp,
+ unsigned int size, int gfp_mask)
+{
+ struct sk_buff *skb;
+ u8 *data;
+
+ /* Get the HEAD */
+ skb = kmem_cache_alloc(skbuff_head_cache,
+ gfp_mask & ~__GFP_DMA);
+ if (!skb)
+ goto out;
+
+ /* Get the DATA. */
+ size = SKB_DATA_ALIGN(size);
+ data = kmem_cache_alloc(cp, gfp_mask);
+ if (!data)
+ goto nodata;
+
+ memset(skb, 0, offsetof(struct sk_buff, truesize));
+ skb->truesize = size + sizeof(struct sk_buff);
+ atomic_set(&skb->users, 1);
+ skb->head = data;
+ skb->data = data;
+ skb->tail = data;
+ skb->end = data + size;
+
+ atomic_set(&(skb_shinfo(skb)->dataref), 1);
+ skb_shinfo(skb)->nr_frags = 0;
+ skb_shinfo(skb)->tso_size = 0;
+ skb_shinfo(skb)->tso_segs = 0;
+ skb_shinfo(skb)->frag_list = NULL;
+out:
+ return skb;
+nodata:
+ kmem_cache_free(skbuff_head_cache, skb);
+ skb = NULL;
+ goto out;
+}
+

static void skb_drop_fraglist(struct sk_buff *skb)
{
diff -Nurp pristine-linux-2.6.10-rc2/include/linux/skbuff.h tmp-linux-2.6.10-rc2-xen.patch/include/linux/skbuff.h
--- pristine-linux-2.6.10-rc2/include/linux/skbuff.h 2004-11-15 01:29:01.000000000 +0000
+++ tmp-linux-2.6.10-rc2-xen.patch/include/linux/skbuff.h 2004-11-18 19:58:13.000000000 +0000
@@ -292,6 +292,8 @@ struct sk_buff {

extern void __kfree_skb(struct sk_buff *skb);
extern struct sk_buff *alloc_skb(unsigned int size, int priority);
+extern struct sk_buff *alloc_skb_from_cache(kmem_cache_t *cp,
+ unsigned int size, int priority);
extern void kfree_skbmem(struct sk_buff *skb);
extern struct sk_buff *skb_clone(struct sk_buff *skb, int priority);
extern struct sk_buff *skb_copy(const struct sk_buff *skb, int priority);
@@ -935,6 +937,7 @@ static inline void __skb_queue_purge(str
*
* %NULL is returned in there is no free memory.
*/
+#ifndef CONFIG_HAVE_ARCH_DEV_ALLOC_SKB
static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
int gfp_mask)
{
@@ -943,6 +946,9 @@ static inline struct sk_buff *__dev_allo
skb_reserve(skb, 16);
return skb;
}
+#else
+extern struct sk_buff *__dev_alloc_skb(unsigned int length, int gfp_mask);
+#endif

/**
* dev_alloc_skb - allocate an skbuff for sending

2004-11-19 23:31:32

by Ian Pratt

[permalink] [raw]

Subject: [7/7] Xen VMM patch set : handle fragemented skbs correctly in icmp_filter

Simple bug fix to icmp_filter -- handle fragemented skbs correctly.

Signed-off-by: [email protected]

---
diff -Nurp pristine-linux-2.6.10-rc2/net/ipv4/raw.c tmp-linux-2.6.10-rc2-xen.patch/net/ipv4/raw.c
--- pristine-linux-2.6.10-rc2/net/ipv4/raw.c 2004-11-15 01:27:42.000000000 +0000
+++ tmp-linux-2.6.10-rc2-xen.patch/net/ipv4/raw.c 2004-11-18 20:08:06.000000000 +0000
@@ -130,6 +130,9 @@ static __inline__ int icmp_filter(struct
{
int type;

+ if (!pskb_may_pull(skb, sizeof(struct icmphdr)))
+ return 1;
+
type = skb->h.icmph->type;
if (type < 32) {
__u32 data = raw4_sk(sk)->filter.data;

2004-11-19 23:30:16

by Ian Pratt

[permalink] [raw]

Subject: [4/7] Xen VMM patch set : /dev/mem io_remap_page_range for CONFIG_XEN

This patch modifies /dev/mem to call io_remap_page_range rather than
remap_pfn_range under CONFIG_XEN. This is required because in arch
xen we need to use a different function for mapping MMIO space vs
mapping psueudo-physical memory. This allows the X server and other
programs that use /dev/mem for MMIO to work under Xen.

Signed-off-by: [email protected]

---

diff -Nurp pristine-linux-2.6.10-rc2/drivers/char/mem.c tmp-linux-2.6.10-rc2-xen.patch/drivers/char/mem.c
--- pristine-linux-2.6.10-rc2/drivers/char/mem.c 2004-11-15 01:27:41.000000000 +0000
+++ tmp-linux-2.6.10-rc2-xen.patch/drivers/char/mem.c 2004-11-19 16:33:04.000000000 +0000
@@ -42,7 +42,12 @@ extern void tapechar_init(void);
*/
static inline int uncached_access(struct file *file, unsigned long addr)
{
-#if defined(__i386__)
+#ifdef CONFIG_XEN
+ if (file->f_flags & O_SYNC)
+ return 1;
+ /* Xen sets correct MTRR type on non-RAM for us. */
+ return 0;
+#elif defined(__i386__)
/*
* On the PPro and successors, the MTRRs are used to set
* memory types for physical addresses outside main memory,
@@ -201,6 +206,14 @@ static int mmap_mem(struct file * file,
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
#endif

+#if defined(CONFIG_XEN)
+ if (io_remap_page_range(vma,
+ vma->vm_start,
+ vma->vm_pgoff << PAGE_SHIFT,
+ vma->vm_end-vma->vm_start,
+ vma->vm_page_prot))
+ return -EAGAIN;
+#else
/* Remap-pfn-range will mark the range VM_IO and VM_RESERVED */
if (remap_pfn_range(vma,
vma->vm_start,
@@ -208,6 +221,7 @@ static int mmap_mem(struct file * file,
vma->vm_end-vma->vm_start,
vma->vm_page_prot))
return -EAGAIN;
+#endif
return 0;
}

2004-11-19 23:24:53

by Ian Pratt

[permalink] [raw]

Subject: [3/7] Xen VMM patch set : runtime disable of VT console

This patch enables the VT console to be disabled at runtime even if it
is built into the kernel. Arch xen needs this to avoid trying to
initialise a VT in virtual machine that doesn't have access to the
console hardware.

Signed-off-by: [email protected]

---

diff -Nurp pristine-linux-2.6.10-rc2/drivers/char/tty_io.c tmp-linux-2.6.10-rc2-xen.patch/drivers/char/tty_io.c
--- pristine-linux-2.6.10-rc2/drivers/char/tty_io.c 2004-11-15 01:27:52.000000000 +0000
+++ tmp-linux-2.6.10-rc2-xen.patch/drivers/char/tty_io.c 2004-11-18 20:10:26.000000000 +0000
@@ -131,6 +131,8 @@ LIST_HEAD(tty_drivers); /* linked list
vt.c for deeply disgusting hack reasons */
DECLARE_MUTEX(tty_sem);

+int console_use_vt = 1;
+
#ifdef CONFIG_UNIX98_PTYS
extern struct tty_driver *ptm_driver; /* Unix98 pty masters; for /dev/ptmx */
extern int pty_limit; /* Config limit on Unix98 ptys */
@@ -2964,14 +2966,19 @@ static int __init tty_init(void)
#endif

#ifdef CONFIG_VT
- cdev_init(&vc0_cdev, &console_fops);
- if (cdev_add(&vc0_cdev, MKDEV(TTY_MAJOR, 0), 1) ||
- register_chrdev_region(MKDEV(TTY_MAJOR, 0), 1, "/dev/vc/0") < 0)
- panic("Couldn't register /dev/tty0 driver\n");
- devfs_mk_cdev(MKDEV(TTY_MAJOR, 0), S_IFCHR|S_IRUSR|S_IWUSR, "vc/0");
- class_simple_device_add(tty_class, MKDEV(TTY_MAJOR, 0), NULL, "tty0");
+ if (console_use_vt) {
+ cdev_init(&vc0_cdev, &console_fops);
+ if (cdev_add(&vc0_cdev, MKDEV(TTY_MAJOR, 0), 1) ||
+ register_chrdev_region(MKDEV(TTY_MAJOR, 0), 1,
+ "/dev/vc/0") < 0)
+ panic("Couldn't register /dev/tty0 driver\n");
+ devfs_mk_cdev(MKDEV(TTY_MAJOR, 0), S_IFCHR|S_IRUSR|S_IWUSR,
+ "vc/0");
+ class_simple_device_add(tty_class, MKDEV(TTY_MAJOR, 0), NULL,
+ "tty0");

- vty_init();
+ vty_init();
+ }
#endif
return 0;
}

2004-11-19 23:24:53

by Ian Pratt

[permalink] [raw]

Subject: [1/7] Xen VMM patch set : add ptep_establish_new to make va available

This patch adds 'ptep_establish_new', in keeping with the
existing 'ptep_establish', but for use where a mapping is being
established where there was previously none present. This
function is useful (rather than just using set_pte) because
having the virtual address available enables a very important
optimisation for arch-xen. We introduce
HAVE_ARCH_PTEP_ESTABLISH_NEW and define a generic implementation
in asm-generic/pgtable.h, following the pattern of the existing
ptep_establish.

Signed-off-by: [email protected]

---

diff -Nurp pristine-linux-2.6.10-rc2/include/asm-generic/pgtable.h tmp-linux-2.6.10-rc2-xen.patch/include/asm-generic/pgtable.h
--- pristine-linux-2.6.10-rc2/include/asm-generic/pgtable.h 2004-11-15 01:27:18.000000000 +0000
+++ tmp-linux-2.6.10-rc2-xen.patch/include/asm-generic/pgtable.h 2004-11-18 19:58:13.000000000 +0000
@@ -42,6 +42,16 @@ do { \
} while (0)
#endif

+#ifndef __HAVE_ARCH_PTEP_ESTABLISH_NEW
+/*
+ * Establish a mapping where none previously existed
+ */
+#define ptep_establish_new(__vma, __address, __ptep, __entry) \
+do { \
+ set_pte(__ptep, __entry); \
+} while (0)
+#endif
+
#ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
static inline int ptep_test_and_clear_young(pte_t *ptep)
{
diff -Nurp pristine-linux-2.6.10-rc2/mm/memory.c tmp-linux-2.6.10-rc2-xen.patch/mm/memory.c
--- pristine-linux-2.6.10-rc2/mm/memory.c 2004-11-15 01:27:26.000000000 +0000
+++ tmp-linux-2.6.10-rc2-xen.patch/mm/memory.c 2004-11-18 20:07:39.000000000 +0000
@@ -1472,7 +1472,7 @@ do_anonymous_page(struct mm_struct *mm,
page_add_anon_rmap(page, vma, addr);
}

- set_pte(page_table, entry);
+ ptep_establish_new(vma, addr, page_table, entry);
pte_unmap(page_table);

/* No need to invalidate - it was non-present before */
@@ -1577,7 +1577,7 @@ retry:
entry = mk_pte(new_page, vma->vm_page_prot);
if (write_access)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
- set_pte(page_table, entry);
+ ptep_establish_new(vma, address, page_table, entry);
if (anon) {
lru_cache_add_active(new_page);
page_add_anon_rmap(new_page, vma, address);

2004-11-20 02:05:46

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [4/7] Xen VMM patch set : /dev/mem io_remap_page_range for CONFIG_XEN

On Fri, Nov 19, 2004 at 11:22:51PM +0000, Ian Pratt wrote:
> This patch modifies /dev/mem to call io_remap_page_range rather than
> remap_pfn_range under CONFIG_XEN. This is required because in arch
> xen we need to use a different function for mapping MMIO space vs
> mapping psueudo-physical memory. This allows the X server and other
> programs that use /dev/mem for MMIO to work under Xen.
> Signed-off-by: [email protected]

This is safe from the io_remap_page_range() calling convention skew.
All good.

-- wli

2004-11-20 02:46:04

by James Morris

[permalink] [raw]

Subject: Re: [6/7] Xen VMM patch set : add alloc_skb_from_cache

On Fri, 19 Nov 2004, Ian Pratt wrote:

> + /* Get the DATA. */
> + size = SKB_DATA_ALIGN(size);
> + data = kmem_cache_alloc(cp, gfp_mask);
> + if (!data)
> + goto nodata;
> +
> + memset(skb, 0, offsetof(struct sk_buff, truesize));
> + skb->truesize = size + sizeof(struct sk_buff);
> + atomic_set(&skb->users, 1);
> + skb->head = data;
> + skb->data = data;
> + skb->tail = data;
> + skb->end = data + size;
> +
> + atomic_set(&(skb_shinfo(skb)->dataref), 1);
> + skb_shinfo(skb)->nr_frags = 0;
> + skb_shinfo(skb)->tso_size = 0;
> + skb_shinfo(skb)->tso_segs = 0;
> + skb_shinfo(skb)->frag_list = NULL;
> +out:
> + return skb;
> +nodata:
> + kmem_cache_free(skbuff_head_cache, skb);
> + skb = NULL;
> + goto out;
> +}
> +

Most of this is duplicated code with alloc_skb(), perhaps make a function:

__alloc_skb(size, gfp_mask, alloc_func)

Then alloc_skb() and alloc_skb_from_cache() can just be wrappers which
pass in different alloc_funcs. I'm not sure what peformance impact this
might have though.

Thoughts?

- James
--
James Morris
<[email protected]>

2004-11-19 23:24:52

by Ian Pratt

[permalink] [raw]

Subject: [2/7] Xen VMM patch set : return code for arch_free_page

This patch adds a return value to the existing arch_free_page function
that indicates whether the normal free routine still has work to
do. The only architecture that currently uses arch_free_page is arch
'um'. arch xen needs this for 'foreign pages' - pages that don't
belong to the page allocator but are instead managed by custom
allocators. Such pages are marked using PG_arch_1.

Signed-off-by: [email protected]

---

diff -Nurp pristine-linux-2.6.10-rc2/include/linux/gfp.h tmp-linux-2.6.10-rc2-xen.patch/include/linux/gfp.h
--- pristine-linux-2.6.10-rc2/include/linux/gfp.h 2004-11-15 01:27:15.000000000 +0000
+++ tmp-linux-2.6.10-rc2-xen.patch/include/linux/gfp.h 2004-11-18 19:58:13.000000000 +0000
@@ -74,8 +74,12 @@ struct vm_area_struct;
* optimized to &contig_page_data at compile-time.
*/

+/*
+ * If arch_free_page returns non-zero then the generic free_page code can
+ * immediately bail: the arch-specific function has done all the work.
+ */
#ifndef HAVE_ARCH_FREE_PAGE
-static inline void arch_free_page(struct page *page, int order) { }
+#define arch_free_page(page, order) 0
#endif

extern struct page *
diff -Nurp pristine-linux-2.6.10-rc2/mm/page_alloc.c tmp-linux-2.6.10-rc2-xen.patch/mm/page_alloc.c
--- pristine-linux-2.6.10-rc2/mm/page_alloc.c 2004-11-15 01:26:42.000000000 +0000
+++ tmp-linux-2.6.10-rc2-xen.patch/mm/page_alloc.c 2004-11-18 20:07:40.000000000 +0000
@@ -278,7 +278,8 @@ void __free_pages_ok(struct page *page,
LIST_HEAD(list);
int i;

- arch_free_page(page, order);
+ if (arch_free_page(page, order))
+ return;

mod_page_state(pgfree, 1 << order);
for (i = 0 ; i < (1 << order) ; ++i)
@@ -508,7 +509,8 @@ static void fastcall free_hot_cold_page(
struct per_cpu_pages *pcp;
unsigned long flags;

- arch_free_page(page, 0);
+ if (arch_free_page(page, 0))
+ return;

kernel_map_pages(page, 1, 0);
inc_page_state(pgfree);
diff -Nurp pristine-linux-2.6.10-rc2/arch/um/kernel/physmem.c tmp-linux-2.6.10-rc2-xen.patch/arch/um/kernel/physmem.c
--- pristine-linux-2.6.10-rc2/arch/um/kernel/physmem.c 2004-11-19 20:04:30.000000000 +0000
+++ tmp-linux-2.6.10-rc2-xen.patch/arch/um/kernel/physmem.c 2004-11-19 20:05:33.000000000 +0000
@@ -225,7 +225,7 @@ EXPORT_SYMBOL(physmem_forget_descriptor)
EXPORT_SYMBOL(physmem_remove_mapping);
EXPORT_SYMBOL(physmem_subst_mapping);

-void arch_free_page(struct page *page, int order)
+void __arch_free_page(struct page *page, int order)
{
void *virt;
int i;
diff -Nurp pristine-linux-2.6.10-rc2/include/asm-um/page.h tmp-linux-2.6.10-rc2-xen.patch/include/asm-um/page.h
--- pristine-linux-2.6.10-rc2/include/asm-um/page.h 2004-11-19 20:04:52.000000000 +0000
+++ tmp-linux-2.6.10-rc2-xen.patch/include/asm-um/page.h 2004-11-19 20:05:33.000000000 +0000
@@ -46,7 +46,8 @@ extern void *to_virt(unsigned long phys)
extern struct page *arch_validate(struct page *page, int mask, int order);
#define HAVE_ARCH_VALIDATE

-extern void arch_free_page(struct page *page, int order);
+extern void __arch_free_page(struct page *page, int order);
+#define arch_free_page(page, order) (__arch_free_page((page), (order)), 0)
#define HAVE_ARCH_FREE_PAGE

#endif

2004-11-20 06:03:41

by Chris Wedgwood

[permalink] [raw]

Subject: Re: [6/7] Xen VMM patch set : add alloc_skb_from_cache

On Fri, Nov 19, 2004 at 09:11:04PM -0500, James Morris wrote:

> Most of this is duplicated code with alloc_skb(), perhaps make a
> function:
>
> __alloc_skb(size, gfp_mask, alloc_func)
>
> Then alloc_skb() and alloc_skb_from_cache() can just be wrappers
> which pass in different alloc_funcs. I'm not sure what peformance
> impact this might have though.

I wonder if this would have a measurable performance hit on some
platforms where the additional call/indirection could hurt?

2004-11-20 07:04:03

by David Miller

[permalink] [raw]

Subject: Re: [6/7] Xen VMM patch set : add alloc_skb_from_cache

BTW I already have Ian's patches in my 2.6.11 tree. I have no
problem with the new interfaces, and the only reason it's not
going into 2.6.10 is that Linus wants only bug fixes.

2004-11-20 10:28:28

by Keir Fraser

[permalink] [raw]

Subject: Re: [6/7] Xen VMM patch set : add alloc_skb_from_cache

> On Fri, Nov 19, 2004 at 09:11:04PM -0500, James Morris wrote:
>
> > Most of this is duplicated code with alloc_skb(), perhaps make a
> > function:
> >
> > __alloc_skb(size, gfp_mask, alloc_func)
> >
> > Then alloc_skb() and alloc_skb_from_cache() can just be wrappers
> > which pass in different alloc_funcs. I'm not sure what peformance
> > impact this might have though.
>
> I wonder if this would have a measurable performance hit on some
> platforms where the additional call/indirection could hurt?

Could make __alloc_skb 'static inline'?

-- Keir

2004-11-20 10:36:09

by Chris Wedgwood

[permalink] [raw]

Subject: Re: [6/7] Xen VMM patch set : add alloc_skb_from_cache

On Sat, Nov 20, 2004 at 10:28:10AM +0000, Keir Fraser wrote:

> Could make __alloc_skb 'static inline'?

i don't think that will make much difference, i was more wondering
about the cost of an (additional) pointer dereference in a hot-path.
i think most modern x86 CPUs have pretty decent speculation logic so
the load won't usually cause horrible stalls whereas for some other
some other CPUs this isn't the case

of course w/o measurements this is just wild speculation

2004-11-20 16:32:40

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [4/7] Xen VMM patch set : /dev/mem io_remap_page_range for CONFIG_XEN

> @@ -42,7 +42,12 @@ extern void tapechar_init(void);
> */
> static inline int uncached_access(struct file *file, unsigned long addr)

Any chance you could just move uncached_access() to some asm/ header for
all arches instead of making the ifdef mess even worse?

2004-11-20 16:35:13

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [5/7] Xen VMM patch set : split free_irq into teardown_irq

> +/*
> + * Internal function to unregister an irqaction - typically used to
> + * deallocate special interrupts that are part of the architecture.
> */

It's not static so the internal is kinda wrong. Please provide a full
kerneldoc comment, like the other functions in this file.

2004-11-20 18:03:42

by Ian Pratt

[permalink] [raw]

Subject: Re: [4/7] Xen VMM patch set : /dev/mem io_remap_page_range for CONFIG_XEN

> > @@ -42,7 +42,12 @@ extern void tapechar_init(void);
> > */
> > static inline int uncached_access(struct file *file, unsigned long addr)
>
> Any chance you could just move uncached_access() to some asm/ header for
> all arches instead of making the ifdef mess even worse?

I suppose a generic definition could go in asm-generic/iomap.h
with per-architecture definitions in asm/io.h. However, I think
it would make sense to wait until PAT support gets added and then
think through exactly what needs doing rather than reorganising
things now.

Ian

2004-11-20 19:21:24

by Ian Pratt

[permalink] [raw]

Subject: Re: [5/7] Xen VMM patch set : split free_irq into teardown_irq

> > +/*
> > + * Internal function to unregister an irqaction - typically used to
> > + * deallocate special interrupts that are part of the architecture.
> > */
>
> It's not static so the internal is kinda wrong. Please provide a full
> kerneldoc comment, like the other functions in this file.

teardown_irq is to free_irq what setup_irq is to request_irq.

The comment we included in the original patch was taken from
setup_irq. However, I've written the attached patch that adds
kerneldoc style comments to both setup_irq and teardown_irq if
it's preferred.

Best,
Ian

---

This patch moves the `unregister the irqaction' part of free_irq into
a new function teardown_irq, leaving only the mapping from dev_id to
irqaction and freeing the irqaction in free_irq. free_irq
calls teardown_irq to unregister the irqaction. This is similar
to how setup_irq and request_irq work for registering irq's.
We need teardown_irq to allow us to unregister irq's which were
registered early during boot when memory management wasn't ready
yet, i.e. irq's which were registered using setup_irq and use a static
irqaction which cannot be kfree'd.

Signed-off-by: [email protected]

---

diff -Nurp pristine-linux-2.6.10-rc2/include/linux/irq.h tmp-linux-2.6.10-rc2-xen.patch/include/linux/irq.h
--- pristine-linux-2.6.10-rc2/include/linux/irq.h 2004-11-15 01:28:57.000000000 +0000
+++ tmp-linux-2.6.10-rc2-xen.patch/include/linux/irq.h 2004-11-19 14:00:39.000000000 +0000
@@ -73,6 +73,7 @@ extern irq_desc_t irq_desc [NR_IRQS];
#include <asm/hw_irq.h> /* the arch dependent stuff */

extern int setup_irq(unsigned int irq, struct irqaction * new);
+extern int teardown_irq(unsigned int irq, struct irqaction * old);

#ifdef CONFIG_GENERIC_HARDIRQS
extern cpumask_t irq_affinity[NR_IRQS];
diff -Nurp pristine-linux-2.6.10-rc2/kernel/irq/manage.c tmp-linux-2.6.10-rc2-xen.patch/kernel/irq/manage.c
--- pristine-linux-2.6.10-rc2/kernel/irq/manage.c 2004-11-18 16:00:21.000000000 +0000
+++ linux-2.6.10-rc2-xen0/kernel/irq/manage.c 2004-11-20 19:18:03.000000000 +0000
@@ -144,9 +144,14 @@ int can_request_irq(unsigned int irq, un
return !action;
}

-/*
- * Internal function to register an irqaction - typically used to
- * allocate special interrupts that are part of the architecture.
+/**
+ * setup_irq - register an irqaction structure
+ * @irq: Interrupt to register
+ * @irqaction: The irqaction structure to be registered
+ *
+ * Normally called by request_irq, this function can be used
+ * directly to allocate special interrupts that are part of the
+ * architecture.
*/
int setup_irq(unsigned int irq, struct irqaction * new)
{
@@ -215,28 +220,27 @@ int setup_irq(unsigned int irq, struct i
return 0;
}

-/**
- * free_irq - free an interrupt
- * @irq: Interrupt line to free
- * @dev_id: Device identity to free
- *
- * Remove an interrupt handler. The handler is removed and if the
- * interrupt line is no longer in use by any driver it is disabled.
- * On a shared IRQ the caller must ensure the interrupt is disabled
- * on the card it drives before calling this function. The function
- * does not return until any executing interrupts for this IRQ
- * have completed.
+/*
+ * teardown_irq - unregister an irqaction
+ * @irq: Interrupt line being freed
+ * @old: Pointer to the irqaction that is to be unregistered
+ *
+ * This function is called by free_irq and does the actual
+ * business of unregistering the handler. It exists as a
+ * seperate function to enable handlers to be unregistered
+ * for irqactions that have been allocated statically at
+ * boot time.
*
* This function must not be called from interrupt context.
*/
-void free_irq(unsigned int irq, void *dev_id)
+int teardown_irq(unsigned int irq, struct irqaction * old)
{
struct irq_desc *desc;
struct irqaction **p;
unsigned long flags;

if (irq >= NR_IRQS)
- return;
+ return -ENOENT;

desc = irq_desc + irq;
spin_lock_irqsave(&desc->lock,flags);
@@ -248,7 +252,7 @@ void free_irq(unsigned int irq, void *de
struct irqaction **pp = p;

p = &action->next;
- if (action->dev_id != dev_id)
+ if (action != old)
continue;

/* Found it - now remove it from the list of entries */
@@ -265,13 +269,52 @@ void free_irq(unsigned int irq, void *de

/* Make sure it's not being used on another CPU */
synchronize_irq(irq);
- kfree(action);
- return;
+ return 0;
}
- printk(KERN_ERR "Trying to free free IRQ%d\n",irq);
+ printk(KERN_ERR "Trying to teardown free IRQ%d\n",irq);
spin_unlock_irqrestore(&desc->lock,flags);
+ return -ENOENT;
+ }
+}
+
+/**
+ * free_irq - free an interrupt
+ * @irq: Interrupt line to free
+ * @dev_id: Device identity to free
+ *
+ * Remove an interrupt handler. The handler is removed and if the
+ * interrupt line is no longer in use by any driver it is disabled.
+ * On a shared IRQ the caller must ensure the interrupt is disabled
+ * on the card it drives before calling this function. The function
+ * does not return until any executing interrupts for this IRQ
+ * have completed.
+ *
+ * This function must not be called from interrupt context.
+ */
+void free_irq(unsigned int irq, void *dev_id)
+{
+ struct irq_desc *desc;
+ struct irqaction *action;
+ unsigned long flags;
+
+ if (irq >= NR_IRQS)
+ return;
+
+ desc = irq_desc + irq;
+ spin_lock_irqsave(&desc->lock,flags);
+ for (action = desc->action; action != NULL; action = action->next) {
+ if (action->dev_id != dev_id)
+ continue;
+
+ spin_unlock_irqrestore(&desc->lock,flags);
+
+ if (teardown_irq(irq, action) == 0)
+ kfree(action);
return;
}
+ printk(KERN_ERR "Trying to free free IRQ%d\n",irq);
+ spin_unlock_irqrestore(&desc->lock,flags);
+ return;
}

EXPORT_SYMBOL(free_irq);

2004-11-27 01:51:56

by Ian Pratt

[permalink] [raw]

Subject: Re: Xen VMM patch set - take 2

After an initial flourish of comments which I believe we've
managed to address, it's all gone very quiet.

Is that stunned silence or universal agreement? ;-)

Thanks,
Ian

http://xen.sf.net

> OK folks, this is my second attempt to post the arch xen patches,
> this time against 2.6.10-rc2.
>
> We need 6 core patches and a bug fix:
>
> 1. add ptep_establish_new to make va available
> 2. return code for arch_free_page
> 3. runtime disable of VT console
> 4. /dev/mem io_remap_page_range for CONFIG_XEN
> 5. split free_irq into teardown_irq
> 6. alloc_skb_from_cache
>
> Bug fix:
> 7. handle fragemented skbs correctly in icmp_filter
>
> The actual new architecture, arch xen, is too big to post to the list,
> so here's a link:
> http://www.cl.cam.ac.uk/netos/xen/downloads/arch-xen.patch
>
> Likewise for the virtual block, network, and console drivers:
> http://www.cl.cam.ac.uk/netos/xen/downloads/drivers-xen.patch
>
> Applying the above 9 patches should give you everything you need to
> build full-featured arch xen kernels.
>
> Arch xen will be maintained by myself, Keir Fraser, Christian Limpach
> and Steve hand.
>
> Cheers,
> Ian
>

2004-11-27 12:39:33

by Ian Pratt

[permalink] [raw]

Subject: Re: [4/7] Xen VMM patch set : /dev/mem io_remap_page_range for CONFIG_XEN

> > > @@ -42,7 +42,12 @@ extern void tapechar_init(void);
> > > */
> > > static inline int uncached_access(struct file *file, unsigned long addr)
> >
> > Any chance you could just move uncached_access() to some asm/ header for
> > all arches instead of making the ifdef mess even worse?
>
> I suppose a generic definition could go in asm-generic/iomap.h
> with per-architecture definitions in asm/io.h. However, I think
> it would make sense to wait until PAT support gets added and then
> think through exactly what needs doing rather than reorganising
> things now.

What do people think about this? Should I stick with the current
patch that adds another #ifdef to uncached_access, or should I
try pulling it out into asm-generic/iomap.h with per-arch
definitions in asm/io.h ?

Is there anyone working on PAT support? It would be good to have
their input.

Thanks,
Ian

2004-11-27 12:46:01

by Ian Pratt

[permalink] [raw]

Subject: Re: [5/7] Xen VMM patch set : split free_irq into teardown_irq

> > > +/*
> > > + * Internal function to unregister an irqaction - typically used to
> > > + * deallocate special interrupts that are part of the architecture.
> > > */
> >
> > It's not static so the internal is kinda wrong. Please provide a full
> > kerneldoc comment, like the other functions in this file.
>
> teardown_irq is to free_irq what setup_irq is to request_irq.
>
> The comment we included in the original patch was taken from
> setup_irq. However, I've written the attached patch that adds
> kerneldoc style comments to both setup_irq and teardown_irq if
> it's preferred.

Is everyone happy with the second version of the patch that
adds kerneldoc comments to setup_irq and our new teardown_irq?

Thanks,
Ian

---

This patch moves the `unregister the irqaction' part of free_irq into
a new function teardown_irq, leaving only the mapping from dev_id to
irqaction and freeing the irqaction in free_irq. free_irq
calls teardown_irq to unregister the irqaction. This is similar
to how setup_irq and request_irq work for registering irq's.
We need teardown_irq to allow us to unregister irq's which were
registered early during boot when memory management wasn't ready
yet, i.e. irq's which were registered using setup_irq and use a static
irqaction which cannot be kfree'd.

Signed-off-by: [email protected]

---

diff -Nurp pristine-linux-2.6.10-rc2/include/linux/irq.h tmp-linux-2.6.10-rc2-xen.patch/include/linux/irq.h
--- pristine-linux-2.6.10-rc2/include/linux/irq.h 2004-11-15 01:28:57.000000000 +0000
+++ tmp-linux-2.6.10-rc2-xen.patch/include/linux/irq.h 2004-11-19 14:00:39.000000000 +0000
@@ -73,6 +73,7 @@ extern irq_desc_t irq_desc [NR_IRQS];
#include <asm/hw_irq.h> /* the arch dependent stuff */

extern int setup_irq(unsigned int irq, struct irqaction * new);
+extern int teardown_irq(unsigned int irq, struct irqaction * old);

#ifdef CONFIG_GENERIC_HARDIRQS
extern cpumask_t irq_affinity[NR_IRQS];
diff -Nurp pristine-linux-2.6.10-rc2/kernel/irq/manage.c tmp-linux-2.6.10-rc2-xen.patch/kernel/irq/manage.c
--- pristine-linux-2.6.10-rc2/kernel/irq/manage.c 2004-11-18 16:00:21.000000000 +0000
+++ linux-2.6.10-rc2-xen0/kernel/irq/manage.c 2004-11-20 19:18:03.000000000 +0000
@@ -144,9 +144,14 @@ int can_request_irq(unsigned int irq, un
return !action;
}

-/*
- * Internal function to register an irqaction - typically used to
- * allocate special interrupts that are part of the architecture.
+/**
+ * setup_irq - register an irqaction structure
+ * @irq: Interrupt to register
+ * @irqaction: The irqaction structure to be registered
+ *
+ * Normally called by request_irq, this function can be used
+ * directly to allocate special interrupts that are part of the
+ * architecture.
*/
int setup_irq(unsigned int irq, struct irqaction * new)
{
@@ -215,28 +220,27 @@ int setup_irq(unsigned int irq, struct i
return 0;
}

-/**
- * free_irq - free an interrupt
- * @irq: Interrupt line to free
- * @dev_id: Device identity to free
- *
- * Remove an interrupt handler. The handler is removed and if the
- * interrupt line is no longer in use by any driver it is disabled.
- * On a shared IRQ the caller must ensure the interrupt is disabled
- * on the card it drives before calling this function. The function
- * does not return until any executing interrupts for this IRQ
- * have completed.
+/*
+ * teardown_irq - unregister an irqaction
+ * @irq: Interrupt line being freed
+ * @old: Pointer to the irqaction that is to be unregistered
+ *
+ * This function is called by free_irq and does the actual
+ * business of unregistering the handler. It exists as a
+ * seperate function to enable handlers to be unregistered
+ * for irqactions that have been allocated statically at
+ * boot time.
*
* This function must not be called from interrupt context.
*/
-void free_irq(unsigned int irq, void *dev_id)
+int teardown_irq(unsigned int irq, struct irqaction * old)
{
struct irq_desc *desc;
struct irqaction **p;
unsigned long flags;

if (irq >= NR_IRQS)
- return;
+ return -ENOENT;

desc = irq_desc + irq;
spin_lock_irqsave(&desc->lock,flags);
@@ -248,7 +252,7 @@ void free_irq(unsigned int irq, void *de
struct irqaction **pp = p;

p = &action->next;
- if (action->dev_id != dev_id)
+ if (action != old)
continue;

/* Found it - now remove it from the list of entries */
@@ -265,13 +269,52 @@ void free_irq(unsigned int irq, void *de

/* Make sure it's not being used on another CPU */
synchronize_irq(irq);
- kfree(action);
- return;
+ return 0;
}
- printk(KERN_ERR "Trying to free free IRQ%d\n",irq);
+ printk(KERN_ERR "Trying to teardown free IRQ%d\n",irq);
spin_unlock_irqrestore(&desc->lock,flags);
+ return -ENOENT;
+ }
+}
+
+/**
+ * free_irq - free an interrupt
+ * @irq: Interrupt line to free
+ * @dev_id: Device identity to free
+ *
+ * Remove an interrupt handler. The handler is removed and if the
+ * interrupt line is no longer in use by any driver it is disabled.
+ * On a shared IRQ the caller must ensure the interrupt is disabled
+ * on the card it drives before calling this function. The function
+ * does not return until any executing interrupts for this IRQ
+ * have completed.
+ *
+ * This function must not be called from interrupt context.
+ */
+void free_irq(unsigned int irq, void *dev_id)
+{
+ struct irq_desc *desc;
+ struct irqaction *action;
+ unsigned long flags;
+
+ if (irq >= NR_IRQS)
+ return;
+
+ desc = irq_desc + irq;
+ spin_lock_irqsave(&desc->lock,flags);
+ for (action = desc->action; action != NULL; action = action->next) {
+ if (action->dev_id != dev_id)
+ continue;
+
+ spin_unlock_irqrestore(&desc->lock,flags);
+
+ if (teardown_irq(irq, action) == 0)
+ kfree(action);
return;
}
+ printk(KERN_ERR "Trying to free free IRQ%d\n",irq);
+ spin_unlock_irqrestore(&desc->lock,flags);
+ return;
}

EXPORT_SYMBOL(free_irq);

2004-11-27 16:55:11

by Randy.Dunlap

[permalink] [raw]

Subject: Re: [4/7] Xen VMM patch set : /dev/mem io_remap_page_range for CONFIG_XEN

Ian Pratt wrote:
>>>>@@ -42,7 +42,12 @@ extern void tapechar_init(void);
>>>> */
>>>> static inline int uncached_access(struct file *file, unsigned long addr)
>>>
>>>Any chance you could just move uncached_access() to some asm/ header for
>>>all arches instead of making the ifdef mess even worse?
>>
>>I suppose a generic definition could go in asm-generic/iomap.h
>>with per-architecture definitions in asm/io.h. However, I think
>>it would make sense to wait until PAT support gets added and then
>>think through exactly what needs doing rather than reorganising
>>things now.
>
>
> What do people think about this? Should I stick with the current
> patch that adds another #ifdef to uncached_access, or should I
> try pulling it out into asm-generic/iomap.h with per-arch
> definitions in asm/io.h ?
>
> Is there anyone working on PAT support? It would be good to have
> their input.

Someone from NVidia has posted some PAT patches a few times.
(Terence Ripperda) There's a thread beginning here:
http://marc.theaimsgroup.com/?l=linux-kernel&m=108180930118848&w=2

--
~Randy

2004-11-28 17:11:44

by Rik van Riel

[permalink] [raw]

Subject: Re: Xen VMM patch set - take 2

On Thu, 25 Nov 2004, Ian Pratt wrote:

> After an initial flourish of comments which I believe we've
> managed to address, it's all gone very quiet.
>
> Is that stunned silence or universal agreement? ;-)

Stunned silence I guess - merging an architecture is
usually much more controversial ;)

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2004-11-30 02:29:44

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [2/7] Xen VMM patch set : return code for arch_free_page

On Fri, Nov 19, 2004 at 11:20:54PM +0000, Ian Pratt wrote:
>
> This patch adds a return value to the existing arch_free_page function
> that indicates whether the normal free routine still has work to
> do. The only architecture that currently uses arch_free_page is arch
> 'um'. arch xen needs this for 'foreign pages' - pages that don't
> belong to the page allocator but are instead managed by custom
> allocators. Such pages are marked using PG_arch_1.

This sure looks good too.

> @@ -508,7 +509,8 @@ static void fastcall free_hot_cold_page(
> struct per_cpu_pages *pcp;
> unsigned long flags;
>
> - arch_free_page(page, 0);
> + if (arch_free_page(page, 0))
> + return;

If you want you can microoptimize the guest placing zone =
page_zone(page) after arch_free_page. Just a side note (it cannot be
measurable anyways).

2004-11-30 03:08:46

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [4/7] Xen VMM patch set : /dev/mem io_remap_page_range for CONFIG_XEN

On Fri, Nov 19, 2004 at 11:22:51PM +0000, Ian Pratt wrote:
>
> This patch modifies /dev/mem to call io_remap_page_range rather than
> remap_pfn_range under CONFIG_XEN. This is required because in arch

Why don't we change /dev/mem to use io_remap_page_range unconditionally
for ranges above high_memory? Clearly io_remap_page_range can map device
space, and I guess that's what io_remap_page_range is there for. sparc
and sparc64 are the only two ones implementing io_remap_page_range, so
maybe Dave or Wli can tell us if there's any penalty in using
io_remap_page_range in mmap(/dev/mmap) for phys ranges above
high_memory. I don't know the sparc architectural details of mk_pte_io
invoked by io_remap_page_range of the sparc arch.

There's also an issue with io_remap_page_range where sparc has 6 args
while everyone else has 5 args. It's perfectly fine that sparc will be
the only one parsing the last value, but we should pass that last value
to all archs, so that people can avoid writing code like the below
(drivers/char/drm):

#ifdef __sparc__
if (io_remap_page_range(DRM_RPR_ARG(vma) vma->vm_start,
VM_OFFSET(vma) + offset,
vma->vm_end - vma->vm_start,
vma->vm_page_prot, 0))
#else
if (remap_pfn_range(DRM_RPR_ARG(vma) vma->vm_start,
(VM_OFFSET(vma) + offset) >> PAGE_SHIFT,
vma->vm_end - vma->vm_start,
vma->vm_page_prot))
#endif

Thanks.

2004-11-30 03:17:13

by David Miller

[permalink] [raw]

Subject: Re: [4/7] Xen VMM patch set : /dev/mem io_remap_page_range for CONFIG_XEN

On Tue, 30 Nov 2004 04:08:12 +0100
Andrea Arcangeli <[email protected]> wrote:

> There's also an issue with io_remap_page_range where sparc has 6 args
> while everyone else has 5 args.

It's there to handle larger than 32-bit I/O physical addresses
on sparc32.

WLI's work on remap_pfn_range() was meant to eventually turn
the various io_remap_page_range() routines over the be pfn
based as well. Then sparc32 wouldn't need that extra arg
any longer.

2004-11-30 03:38:47

by William Lee Irwin III

[permalink] [raw]

Subject: Re: [4/7] Xen VMM patch set : /dev/mem io_remap_page_range for CONFIG_XEN

On Fri, Nov 19, 2004 at 11:22:51PM +0000, Ian Pratt wrote:
>> This patch modifies /dev/mem to call io_remap_page_range rather than
>> remap_pfn_range under CONFIG_XEN. This is required because in arch

On Tue, Nov 30, 2004 at 04:08:12AM +0100, Andrea Arcangeli wrote:
> Why don't we change /dev/mem to use io_remap_page_range unconditionally
> for ranges above high_memory? Clearly io_remap_page_range can map device
> space, and I guess that's what io_remap_page_range is there for. sparc
> and sparc64 are the only two ones implementing io_remap_page_range, so
> maybe Dave or Wli can tell us if there's any penalty in using
> io_remap_page_range in mmap(/dev/mmap) for phys ranges above
> high_memory. I don't know the sparc architectural details of mk_pte_io
> invoked by io_remap_page_range of the sparc arch.
> There's also an issue with io_remap_page_range where sparc has 6 args
> while everyone else has 5 args. It's perfectly fine that sparc will be
> the only one parsing the last value, but we should pass that last value
> to all archs, so that people can avoid writing code like the below
> (drivers/char/drm):

On sparc32, all IO memory is above the 32-bit boundary. So it's
generally okay for that. The general ongoing work in the
io_remap_page_range() area to unify the sparc32/sparc64 case with other
architectures is based in part on the remap_pfn_range() work (as noted
by davem in another followup).

Unfortunately the effort to debug the effects of pending changes in
2.6.10-rc2-mm3 is blocking the io_remap_page_range() work.

-- wli

2004-11-30 08:56:50

by Ian Pratt

[permalink] [raw]

Subject: Re: [4/7] Xen VMM patch set : /dev/mem io_remap_page_range for CONFIG_XEN

> On Fri, Nov 19, 2004 at 11:22:51PM +0000, Ian Pratt wrote:
> >
> > This patch modifies /dev/mem to call io_remap_page_range rather than
> > remap_pfn_range under CONFIG_XEN. This is required because in arch
>
> Why don't we change /dev/mem to use io_remap_page_range unconditionally
> for ranges above high_memory? Clearly io_remap_page_range can map device
> space, and I guess that's what io_remap_page_range is there for.

In the Xen case, we actually need to use io_remap_page_range for
all /dev/mem accesses, so as to be able to map the BIOS area, DMI
tables etc.

Are people generally happy with the introduction of
ARCH_HAS_DEV_MEM as a way of migrating away from the nest of
#ifdef's in mem.c ?

I wasn't sure how best to handle the fact that /dev/kmem shared
its mmap implementation with /dev/mem. BTW: Does anyone know of
any programs that make use of mmap'ing /dev/kmem?

Thanks,
Ian

2004-11-30 09:06:19

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [4/7] Xen VMM patch set : /dev/mem io_remap_page_range for CONFIG_XEN

On Tue, 2004-11-30 at 08:56 +0000, Ian Pratt wrote:

> In the Xen case, we actually need to use io_remap_page_range for
> all /dev/mem accesses, so as to be able to map the BIOS area, DMI
> tables etc.
>

look at the /dev/mem patches in the -mm tree... there might be
infrastructure there that is useful to you

> I wasn't sure how best to handle the fact that /dev/kmem shared
> its mmap implementation with /dev/mem. BTW: Does anyone know of
> any programs that make use of mmap'ing /dev/kmem?

effectively nothing uses /dev/kmem; you might as well just remove it
entirely and/or not provide it in Xen.

2004-11-30 12:09:34

by Ian Pratt

[permalink] [raw]

Subject: Re: [4/7] Xen VMM patch set : /dev/mem io_remap_page_range for CONFIG_XEN

> On Tue, 2004-11-30 at 08:56 +0000, Ian Pratt wrote:
>
> > In the Xen case, we actually need to use io_remap_page_range for
> > all /dev/mem accesses, so as to be able to map the BIOS area, DMI
> > tables etc.
>
> look at the /dev/mem patches in the -mm tree... there might be
> infrastructure there that is useful to you

Thanks for the pointer. Having looked through, it's orthogonal
and can't help us, though doesn't conflict with our patch either
(fuzz 2).

Thanks,
Ian

2004-11-30 12:14:35

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [4/7] Xen VMM patch set : /dev/mem io_remap_page_range for CONFIG_XEN

On Tue, 2004-11-30 at 12:09 +0000, Ian Pratt wrote:
> > On Tue, 2004-11-30 at 08:56 +0000, Ian Pratt wrote:
> >
> > > In the Xen case, we actually need to use io_remap_page_range for
> > > all /dev/mem accesses, so as to be able to map the BIOS area, DMI
> > > tables etc.
> >
> > look at the /dev/mem patches in the -mm tree... there might be
> > infrastructure there that is useful to you
>
> Thanks for the pointer. Having looked through, it's orthogonal
> and can't help us, though doesn't conflict with our patch either
> (fuzz 2).

well.. it makes all non-ram unaccessible... so what's left is only the
stuff Xen wants to make accessible anyway :)

2004-11-30 12:19:59

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [4/7] Xen VMM patch set : /dev/mem io_remap_page_range for CONFIG_XEN

On Tue, 2004-11-30 at 13:14 +0100, Arjan van de Ven wrote:
> On Tue, 2004-11-30 at 12:09 +0000, Ian Pratt wrote:
> > > On Tue, 2004-11-30 at 08:56 +0000, Ian Pratt wrote:
> > >
> > > > In the Xen case, we actually need to use io_remap_page_range for
> > > > all /dev/mem accesses, so as to be able to map the BIOS area, DMI
> > > > tables etc.
> > >
> > > look at the /dev/mem patches in the -mm tree... there might be
> > > infrastructure there that is useful to you
> >
> > Thanks for the pointer. Having looked through, it's orthogonal
> > and can't help us, though doesn't conflict with our patch either
> > (fuzz 2).
>
> well.. it makes all non-ram unaccessible... so what's left is only the

eh too many negatives. it makes all kernel ram unaccessible.

2004-11-30 13:17:20

by Ian Pratt

[permalink] [raw]

Subject: Re: [4/7] Xen VMM patch set : /dev/mem io_remap_page_range for CONFIG_XEN

> On Tue, 2004-11-30 at 12:09 +0000, Ian Pratt wrote:
> > > On Tue, 2004-11-30 at 08:56 +0000, Ian Pratt wrote:
> > >
> > > > In the Xen case, we actually need to use io_remap_page_range for
> > > > all /dev/mem accesses, so as to be able to map the BIOS area, DMI
> > > > tables etc.
> > >
> > > look at the /dev/mem patches in the -mm tree... there might be
> > > infrastructure there that is useful to you
> >
> > Thanks for the pointer. Having looked through, it's orthogonal
> > and can't help us, though doesn't conflict with our patch either
> > (fuzz 2).
>
> well.. it makes all non-ram unaccessible... so what's left is only the
> stuff Xen wants to make accessible anyway :)

Ha -- I'd missed that because I hadn't spotted the addition of
devmem_is_allowed to /arch/i386/mm/init.c

I'm really rather pleased to see this attempt at disambiguating
what /dev/mem can be used for. What's nice from our POV is it now
explicitly prevents the cases that we couldn't easily support in
Xen. We'd observed that nothing seemed to break, so presumed
they weren't very common.

The next logical step with that patch would be to have mmap_mem
call io_remap_page_range rather than remap_pfn_range, at which
point it would solve our problem. I'm not sure what this would
mean for architectures other than i386.

Ian

2004-11-30 18:10:44

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [4/7] Xen VMM patch set : /dev/mem io_remap_page_range for CONFIG_XEN

On Tue, Nov 30, 2004 at 01:16:56PM +0000, Ian Pratt wrote:
> point it would solve our problem. I'm not sure what this would
> mean for architectures other than i386.

Only sparc implements io_remap_page_range differently from
remap_pte_range and from Wli answer I understand it's probably ok for
sparc to use io_remap_page_range outside ram.