2006-03-16 16:55:53

by Jes Sorensen

[permalink] [raw]
Subject: [patch] mspec - special memory driver and do_no_pfn handler

Hi,

This is an updated version of the mspec driver (special memory
support), formerly known as fetchop.

With this version I have implemented a do_no_pfn() handler, similar to
the do_no_page() handler but for pages which are not backed by a
struct page. This avoids the trick used in earlier versions where the
driver was allocating a dumy page returning it back to the
do_no_page() handler which would the free it immediately. Hopefully
this addresses the main concern there were with this driver in the
past.

The reason for taking the do_no_pfn() approch rather than
remap_pfn_range() is that it needs to benefit from node locality of
the pages on NUMA systems.

While the driver is currently only used on SN2 hardware, it is placed
in drivers/char as it should be possible and beneficial for other
architectures to implement and use the uncached mode.

Please let me know if there are any objections or comments etc. to
this approach. If preferred I can split out the do_no_pfn part into a
seperate patch.

Cheers,
Jes

Signed-off-by: Jes Sorensen <[email protected]>

drivers/char/Kconfig | 8
drivers/char/Makefile | 1
drivers/char/mspec.c | 442 ++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/mm.h | 1
mm/memory.c | 51 +++++
5 files changed, 502 insertions(+), 1 deletion(-)

Index: linux-2.6/drivers/char/Kconfig
===================================================================
--- linux-2.6.orig/drivers/char/Kconfig
+++ linux-2.6/drivers/char/Kconfig
@@ -421,6 +421,14 @@
If you have an SGI Altix with an attached SABrick
say Y or M here, otherwise say N.

+config MSPEC
+ tristate " Memory special operations driver"
+ depends on IA64
+ help
+ If you have an ia64 and you want to enable memory special
+ operations support (formerly known as fetchop), say Y here,
+ otherwise say N.
+
source "drivers/serial/Kconfig"

config UNIX98_PTYS
Index: linux-2.6/drivers/char/Makefile
===================================================================
--- linux-2.6.orig/drivers/char/Makefile
+++ linux-2.6/drivers/char/Makefile
@@ -49,6 +49,7 @@
obj-$(CONFIG_VIOTAPE) += viotape.o
obj-$(CONFIG_HVCS) += hvcs.o
obj-$(CONFIG_SGI_MBCS) += mbcs.o
+obj-$(CONFIG_MSPEC) += mspec.o

obj-$(CONFIG_PRINTER) += lp.o
obj-$(CONFIG_TIPAR) += tipar.o
Index: linux-2.6/drivers/char/mspec.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/char/mspec.c
@@ -0,0 +1,442 @@
+/*
+ * Copyright (C) 2001-2006 Silicon Graphics, Inc. All rights
+ * reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ */
+
+/*
+ * SN Platform Special Memory (mspec) Support
+ *
+ * This driver exports the SN special memory (mspec) facility to user
+ * processes.
+ * There are three types of memory made available thru this driver:
+ * fetchops, uncached and cached.
+ *
+ * Fetchops are atomic memory operations that are implemented in the
+ * memory controller on SGI SN hardware.
+ *
+ * Uncached are used for memory write combining feature of the ia64
+ * cpu.
+ *
+ * Cached are used for areas of memory that are used as cached addresses
+ * on our partition and used as uncached addresses from other partitions.
+ * Due to a design constraint of the SN2 Shub, you can not have processors
+ * on the same FSB perform both a cached and uncached reference to the
+ * same cache line. These special memory cached regions prevent the
+ * kernel from ever dropping in a TLB entry and therefore prevent the
+ * processor from ever speculating a cache line from this page.
+ */
+
+#include <linux/config.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/miscdevice.h>
+#include <linux/spinlock.h>
+#include <linux/mm.h>
+#include <linux/proc_fs.h>
+#include <linux/vmalloc.h>
+#include <linux/bitops.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+#include <linux/efi.h>
+#include <linux/numa.h>
+#include <asm/page.h>
+#include <asm/pal.h>
+#include <asm/system.h>
+#include <asm/pgtable.h>
+#include <asm/atomic.h>
+#include <asm/tlbflush.h>
+#include <asm/uncached.h>
+#include <asm/sn/addrs.h>
+#include <asm/sn/arch.h>
+#include <asm/sn/mspec.h>
+#include <asm/sn/sn_cpuid.h>
+#include <asm/sn/io.h>
+#include <asm/sn/bte.h>
+#include <asm/sn/shubio.h>
+
+
+#define FETCHOP_ID "SGI Fetchop,"
+#define CACHED_ID "Cached,"
+#define UNCACHED_ID "Uncached"
+#define REVISION "4.0"
+#define MSPEC_BASENAME "mspec"
+
+/*
+ * Page types allocated by the device.
+ */
+enum {
+ MSPEC_FETCHOP = 1,
+ MSPEC_CACHED,
+ MSPEC_UNCACHED
+};
+
+/*
+ * One of these structures is allocated when an mspec region is mmaped. The
+ * structure is pointed to by the vma->vm_private_data field in the vma struct.
+ * This structure is used to record the addresses of the mspec pages.
+ */
+struct vma_data {
+ atomic_t refcnt; /* Number of vmas sharing the data. */
+ spinlock_t lock; /* Serialize access to the vma. */
+ int count; /* Number of pages allocated. */
+ int type; /* Type of pages allocated. */
+ unsigned long maddr[0]; /* Array of MSPEC addresses. */
+};
+
+/* used on shub2 to clear FOP cache in the HUB */
+static unsigned long scratch_page[MAX_NUMNODES];
+#define SH2_AMO_CACHE_ENTRIES 4
+
+static inline int
+mspec_zero_block(unsigned long addr, int len)
+{
+ int status;
+
+ if (ia64_platform_is("sn2")) {
+ if (is_shub2()) {
+ int nid;
+ void *p;
+ int i;
+
+ nid = nasid_to_cnodeid(get_node_number(__pa(addr)));
+ p = (void *)TO_AMO(scratch_page[nid]);
+
+ for (i=0; i < SH2_AMO_CACHE_ENTRIES; i++) {
+ FETCHOP_LOAD_OP(p, FETCHOP_LOAD);
+ p += FETCHOP_VAR_SIZE;
+ }
+ }
+
+ status = bte_copy(0, addr & ~__IA64_UNCACHED_OFFSET, len,
+ BTE_WACQUIRE | BTE_ZERO_FILL, NULL);
+ } else {
+ memset((char *) addr, 0, len);
+ status = 0;
+ }
+ return status;
+}
+
+/*
+ * mspec_open
+ *
+ * Called when a device mapping is created by a means other than mmap
+ * (via fork, etc.). Increments the reference count on the underlying
+ * mspec data so it is not freed prematurely.
+ */
+static void
+mspec_open(struct vm_area_struct *vma)
+{
+ struct vma_data *vdata;
+
+ vdata = vma->vm_private_data;
+ atomic_inc(&vdata->refcnt);
+}
+
+/*
+ * mspec_close
+ *
+ * Called when unmapping a device mapping. Frees all mspec pages
+ * belonging to the vma.
+ */
+static void
+mspec_close(struct vm_area_struct *vma)
+{
+ struct vma_data *vdata;
+ int i, pages, result, vdata_size;
+
+ vdata = vma->vm_private_data;
+ if (!atomic_dec_and_test(&vdata->refcnt))
+ return;
+
+ pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+ vdata_size = sizeof(struct vma_data) + pages * sizeof(long);
+ for (i = 0; i < pages; i++) {
+ if (vdata->maddr[i] == 0)
+ continue;
+ /*
+ * Clear the page before sticking it back
+ * into the pool.
+ */
+ result = mspec_zero_block(vdata->maddr[i], PAGE_SIZE);
+ if (!result)
+ uncached_free_page(vdata->maddr[i]);
+ else
+ printk(KERN_WARNING "mspec_close(): "
+ "failed to zero page %i\n",
+ result);
+ }
+
+ if (vdata_size <= PAGE_SIZE)
+ kfree(vdata);
+ else
+ vfree(vdata);
+}
+
+
+/*
+ * mspec_get_one_pte
+ *
+ * Return the pmd for a given mm and address.
+ */
+static __inline__ pmd_t *
+mspec_get_pmd(struct mm_struct *mm, u64 address)
+{
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd = NULL;
+
+ pgd = pgd_offset(mm, address);
+ if (pgd_present(*pgd)) {
+ pud = pud_offset(pgd, address);
+
+ if (pud_present(*pud))
+ pmd = pmd_offset(pud, address);
+ }
+
+ return pmd;
+}
+
+/*
+ * mspec_nopfn
+ *
+ * Creates a mspec page and maps it to user space.
+ */
+static long
+mspec_nopfn(struct vm_area_struct *vma, unsigned long address, int *unused)
+{
+ unsigned long paddr, maddr;
+ unsigned long pfn;
+ int index;
+ struct vma_data *vdata = vma->vm_private_data;
+
+ index = (address - vma->vm_start) >> PAGE_SHIFT;
+ maddr = (volatile unsigned long) vdata->maddr[index];
+ if (maddr == 0) {
+ maddr = uncached_alloc_page(numa_node_id());
+ if (maddr == 0)
+ return -ENOMEM;
+
+ spin_lock(&vdata->lock);
+ if (vdata->maddr[index] == 0) {
+ vdata->count++;
+ vdata->maddr[index] = maddr;
+ } else {
+ uncached_free_page(maddr);
+ maddr = vdata->maddr[index];
+ }
+ spin_unlock(&vdata->lock);
+ }
+
+ if (vdata->type == MSPEC_FETCHOP)
+ paddr = TO_AMO(maddr);
+ else
+ paddr = __pa(TO_CAC(maddr));
+
+ pfn = paddr >> PAGE_SHIFT;
+
+ return pfn;
+}
+
+static struct vm_operations_struct mspec_vm_ops = {
+ .open = mspec_open,
+ .close = mspec_close,
+ .nopfn = mspec_nopfn
+};
+
+/*
+ * mspec_mmap
+ *
+ * Called when mmaping the device. Initializes the vma with a fault handler
+ * and private data structure necessary to allocate, track, and free the
+ * underlying pages.
+ */
+static int
+mspec_mmap(struct file *file, struct vm_area_struct *vma, int type)
+{
+ struct vma_data *vdata;
+ int pages, vdata_size;
+
+ if (vma->vm_pgoff != 0)
+ return -EINVAL;
+
+ if ((vma->vm_flags & VM_SHARED) == 0)
+ return -EINVAL;
+
+ if ((vma->vm_flags & VM_WRITE) == 0)
+ return -EPERM;
+
+ pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+ vdata_size = sizeof(struct vma_data) + pages * sizeof(long);
+ if (vdata_size <= PAGE_SIZE)
+ vdata = kmalloc(vdata_size, GFP_KERNEL);
+ else
+ vdata = vmalloc(vdata_size);
+ if (!vdata)
+ return -ENOMEM;
+ memset(vdata, 0, vdata_size);
+
+ vdata->type = type;
+ spin_lock_init(&vdata->lock);
+ vdata->refcnt = ATOMIC_INIT(1);
+ vma->vm_private_data = vdata;
+
+ vma->vm_flags |= (VM_IO | VM_LOCKED | VM_RESERVED | VM_PFNMAP);
+ if (vdata->type == MSPEC_FETCHOP || vdata->type == MSPEC_UNCACHED)
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+ vma->vm_ops = &mspec_vm_ops;
+
+ return 0;
+}
+
+static int
+fetchop_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ return mspec_mmap(file, vma, MSPEC_FETCHOP);
+}
+
+static int
+cached_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ return mspec_mmap(file, vma, MSPEC_CACHED);
+}
+
+static int
+uncached_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ return mspec_mmap(file, vma, MSPEC_UNCACHED);
+}
+
+static struct file_operations fetchop_fops = {
+ .owner = THIS_MODULE,
+ .mmap = fetchop_mmap
+};
+
+static struct miscdevice fetchop_miscdev = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = "sgi_fetchop",
+ .fops = &fetchop_fops
+};
+
+static struct file_operations cached_fops = {
+ .owner = THIS_MODULE,
+ .mmap = cached_mmap
+};
+
+static struct miscdevice cached_miscdev = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = "mspec_cached",
+ .fops = &cached_fops
+};
+
+static struct file_operations uncached_fops = {
+ .owner = THIS_MODULE,
+ .mmap = uncached_mmap
+};
+
+static struct miscdevice uncached_miscdev = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = "mspec_uncached",
+ .fops = &uncached_fops
+};
+
+/*
+ * mspec_init
+ *
+ * Called at boot time to initialize the mspec facility.
+ */
+static int __init
+mspec_init(void)
+{
+ int ret;
+ int nid;
+
+ /*
+ * The fetchop device only works on SN2 hardware, uncached and cached
+ * memory drivers should both be valid on all ia64 hardware
+ */
+ if (ia64_platform_is("sn2")) {
+ if (is_shub2()) {
+ ret = -ENOMEM;
+ for_each_online_node(nid) {
+ int actual_nid;
+
+ scratch_page[nid] = uncached_alloc_page(nid);
+ if (scratch_page[nid] == 0)
+ goto free_scratch_pages;
+ actual_nid = nasid_to_cnodeid(get_node_number(__pa(scratch_page[nid])));
+ if (actual_nid != nid)
+ goto free_scratch_pages;
+ }
+ }
+
+ ret = misc_register(&fetchop_miscdev);
+ if (ret) {
+ printk(KERN_ERR
+ "%s: failed to register device %i\n",
+ FETCHOP_ID, ret);
+ goto free_scratch_pages;
+ }
+ }
+ ret = misc_register(&cached_miscdev);
+ if (ret) {
+ printk(KERN_ERR "%s: failed to register device %i\n",
+ CACHED_ID, ret);
+ misc_deregister(&fetchop_miscdev);
+ goto free_scratch_pages;
+ }
+ ret = misc_register(&uncached_miscdev);
+ if (ret) {
+ printk(KERN_ERR "%s: failed to register device %i\n",
+ UNCACHED_ID, ret);
+ misc_deregister(&cached_miscdev);
+ misc_deregister(&fetchop_miscdev);
+ goto free_scratch_pages;
+ }
+
+ printk(KERN_INFO "%s %s initialized devices: %s %s %s\n",
+ MSPEC_BASENAME, REVISION,
+ ia64_platform_is("sn2") ? FETCHOP_ID : "",
+ CACHED_ID, UNCACHED_ID);
+
+ return 0;
+
+free_scratch_pages:
+ for_each_node(nid) {
+ if (scratch_page[nid] != 0)
+ uncached_free_page(scratch_page[nid]);
+ }
+ return ret;
+}
+
+static void __exit
+mspec_exit(void)
+{
+ int nid;
+
+ misc_deregister(&uncached_miscdev);
+ misc_deregister(&cached_miscdev);
+ if (ia64_platform_is("sn2")) {
+ misc_deregister(&fetchop_miscdev);
+
+ for_each_node(nid) {
+ if (scratch_page[nid] != 0)
+ uncached_free_page(scratch_page[nid]);
+ }
+ }
+}
+
+module_init(mspec_init);
+module_exit(mspec_exit);
+
+MODULE_AUTHOR("Silicon Graphics, Inc.");
+MODULE_DESCRIPTION("Driver for SGI SN special memory operations");
+MODULE_LICENSE("GPL");
+MODULE_INFO(supported, "external");
Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -199,6 +199,7 @@
void (*open)(struct vm_area_struct * area);
void (*close)(struct vm_area_struct * area);
struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
+ long (*nopfn)(struct vm_area_struct * area, unsigned long address, int *type);
int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
#ifdef CONFIG_NUMA
int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -2148,6 +2148,51 @@
}

/*
+ * do_no_pfn() tries to create a new page mapping for a page without
+ * a struct_page backing it
+ *
+ * As this is called only for pages that do not currently exist, we
+ * do not need to flush old virtual caches or the TLB.
+ *
+ * We enter with non-exclusive mmap_sem (to exclude vma changes,
+ * but allow concurrent faults), and pte mapped but not yet locked.
+ * We return with mmap_sem still held, but pte unmapped and unlocked.
+ *
+ * It is expected that the ->nopfn handler always returns the same pfn
+ * for a given virtual mapping.
+ */
+static int do_no_pfn(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pte_t *page_table, pmd_t *pmd,
+ int write_access)
+{
+ spinlock_t *ptl;
+ pte_t entry;
+ long pfn;
+ int ret = VM_FAULT_MINOR;
+
+ pte_unmap(page_table);
+ BUG_ON(!(vma->vm_flags & VM_PFNMAP));
+
+ pfn = vma->vm_ops->nopfn(vma, address & PAGE_MASK, &ret);
+ if (pfn == -ENOMEM)
+ return VM_FAULT_OOM;
+ if (pfn == -EFAULT)
+ return VM_FAULT_SIGBUS;
+ if (pfn < 0)
+ return VM_FAULT_SIGBUS;
+
+ page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+
+ entry = pfn_pte(pfn, vma->vm_page_prot);
+ if (write_access)
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ set_pte_at(mm, address, page_table, entry);
+
+ pte_unmap_unlock(page_table, ptl);
+ return ret;
+}
+
+/*
* Fault of a previously existing named mapping. Repopulate the pte
* from the encoded file_pte if possible. This enables swappable
* nonlinear vmas.
@@ -2209,9 +2254,13 @@
old_entry = entry = *pte;
if (!pte_present(entry)) {
if (pte_none(entry)) {
- if (!vma->vm_ops || !vma->vm_ops->nopage)
+ if (!vma->vm_ops ||
+ (!vma->vm_ops->nopage && !vma->vm_ops->nopfn))
return do_anonymous_page(mm, vma, address,
pte, pmd, write_access);
+ if (vma->vm_ops->nopfn)
+ return do_no_pfn(mm, vma, address,
+ pte, pmd, write_access);
return do_no_page(mm, vma, address,
pte, pmd, write_access);
}


2006-03-17 00:35:24

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] mspec - special memory driver and do_no_pfn handler

Jes Sorensen <[email protected]> wrote:
>
> Hi,
>
> This is an updated version of the mspec driver (special memory
> support), formerly known as fetchop.
>
> With this version I have implemented a do_no_pfn() handler, similar to
> the do_no_page() handler but for pages which are not backed by a
> struct page.

hm. Is that a superset of ->nopage? Should we be looking at
migrating over to ->nopfn, retire ->nopage?

<looks at the ghastly stuff in do_no_page>

Maybe not...

> This avoids the trick used in earlier versions where the
> driver was allocating a dumy page returning it back to the
> do_no_page() handler which would the free it immediately. Hopefully
> this addresses the main concern there were with this driver in the
> past.
>
> The reason for taking the do_no_pfn() approch rather than
> remap_pfn_range() is that it needs to benefit from node locality of
> the pages on NUMA systems.
>
> While the driver is currently only used on SN2 hardware, it is placed
> in drivers/char as it should be possible and beneficial for other
> architectures to implement and use the uncached mode.
>
> Please let me know if there are any objections or comments etc. to
> this approach. If preferred I can split out the do_no_pfn part into a
> seperate patch.

That would probably be best.

> +#include <linux/config.h>
> +#include <linux/types.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/init.h>
> +#include <linux/errno.h>
> +#include <linux/miscdevice.h>
> +#include <linux/spinlock.h>
> +#include <linux/mm.h>
> +#include <linux/proc_fs.h>
> +#include <linux/vmalloc.h>
> +#include <linux/bitops.h>
> +#include <linux/string.h>
> +#include <linux/slab.h>
> +#include <linux/seq_file.h>
> +#include <linux/efi.h>
> +#include <linux/numa.h>
> +#include <asm/page.h>
> +#include <asm/pal.h>
> +#include <asm/system.h>
> +#include <asm/pgtable.h>
> +#include <asm/atomic.h>
> +#include <asm/tlbflush.h>
> +#include <asm/uncached.h>
> +#include <asm/sn/addrs.h>
> +#include <asm/sn/arch.h>
> +#include <asm/sn/mspec.h>
> +#include <asm/sn/sn_cpuid.h>
> +#include <asm/sn/io.h>
> +#include <asm/sn/bte.h>
> +#include <asm/sn/shubio.h>

Wow.

> +static inline int
> +mspec_zero_block(unsigned long addr, int len)
> +{
> + int status;
> +
> + if (ia64_platform_is("sn2")) {

ia64 uses strcmp() in hotpaths to work out what sort of platform it's
running on? Surely someone has cached this info in a __read_mostly integer
somewhere?

> +static __inline__ pmd_t *

`inline', please.

> +mspec_get_pmd(struct mm_struct *mm, u64 address)

This function has no callers.

> +static int __init
> +mspec_init(void)
> +{
> + int ret;
> + int nid;
> +
> + /*
> + * The fetchop device only works on SN2 hardware, uncached and cached
> + * memory drivers should both be valid on all ia64 hardware
> + */
> + if (ia64_platform_is("sn2")) {
> + if (is_shub2()) {
> + ret = -ENOMEM;
> + for_each_online_node(nid) {
> + int actual_nid;
> +
> + scratch_page[nid] = uncached_alloc_page(nid);
> + if (scratch_page[nid] == 0)
> + goto free_scratch_pages;
> + actual_nid = nasid_to_cnodeid(get_node_number(__pa(scratch_page[nid])));
> + if (actual_nid != nid)
> + goto free_scratch_pages;
> + }
> + }
> +
> + ret = misc_register(&fetchop_miscdev);
> + if (ret) {
> + printk(KERN_ERR
> + "%s: failed to register device %i\n",
> + FETCHOP_ID, ret);
> + goto free_scratch_pages;
> + }
> + }
> + ret = misc_register(&cached_miscdev);
> + if (ret) {
> + printk(KERN_ERR "%s: failed to register device %i\n",
> + CACHED_ID, ret);
> + misc_deregister(&fetchop_miscdev);

You don't know that fetchop_miscdev was registered.


2006-03-17 01:04:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] mspec - special memory driver and do_no_pfn handler



On Thu, 16 Mar 2006, Andrew Morton wrote:
>
> hm. Is that a superset of ->nopage? Should we be looking at
> migrating over to ->nopfn, retire ->nopage?
>
> <looks at the ghastly stuff in do_no_page>
>
> Maybe not...

Yeah, absolutely _not_.

If we wouldn't pass the "struct page" around, we wouldn't have anything to
synchronize with, and each nopage() function would have to do rmap stuff.

That's actually how nopage() worked a long time ago (not rmap, but it was
up the the low-level function to do all the page table logic etc).
Switching to returning a structured return value and letting the generic
VM code handle all the locking and the races was a _huge_ improvement.

So yes, the modern "->nopage()" interface is less flexible, but it's less
flexible for a very good reason.

Quite frankly, I don't think nopfn() is a good interface. It's only usable
for one single thing, so trying to claim that it's a generic VM op is
really not valid. If (and that's a big if) we need this interface, we
should just do it inside mm/memory.c instead of playing games as if it was
generic.

Linus

2006-03-17 02:13:25

by Robin Holt

[permalink] [raw]
Subject: Re: [patch] mspec - special memory driver and do_no_pfn handler

On Thu, Mar 16, 2006 at 05:04:14PM -0800, Linus Torvalds wrote:
>
>
> On Thu, 16 Mar 2006, Andrew Morton wrote:
> >
> > hm. Is that a superset of ->nopage? Should we be looking at
> > migrating over to ->nopfn, retire ->nopage?
> >
> > <looks at the ghastly stuff in do_no_page>
> >
> > Maybe not...
>
> Yeah, absolutely _not_.
>
> If we wouldn't pass the "struct page" around, we wouldn't have anything to
> synchronize with, and each nopage() function would have to do rmap stuff.
>
> That's actually how nopage() worked a long time ago (not rmap, but it was
> up the the low-level function to do all the page table logic etc).
> Switching to returning a structured return value and letting the generic
> VM code handle all the locking and the races was a _huge_ improvement.
>
> So yes, the modern "->nopage()" interface is less flexible, but it's less
> flexible for a very good reason.
>
> Quite frankly, I don't think nopfn() is a good interface. It's only usable
> for one single thing, so trying to claim that it's a generic VM op is
> really not valid. If (and that's a big if) we need this interface, we
> should just do it inside mm/memory.c instead of playing games as if it was
> generic.

My understanding was Carsten Otte was also interested in a do_no_pfn() for
execute-in-place.

Casten, is that still your intention?

Thanks,
Robin

2006-03-17 04:58:47

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [patch] mspec - special memory driver and do_no_pfn handler


> Quite frankly, I don't think nopfn() is a good interface. It's only usable
> for one single thing, so trying to claim that it's a generic VM op is
> really not valid. If (and that's a big if) we need this interface, we
> should just do it inside mm/memory.c instead of playing games as if it was
> generic.

Or just use sparsemem and create struct pages for your hw :) we do that
for SPUs on Cell, works like a charm.

Ben.


2006-03-17 09:15:59

by Jes Sorensen

[permalink] [raw]
Subject: Re: [patch] mspec - special memory driver and do_no_pfn handler

Benjamin Herrenschmidt wrote:
>> Quite frankly, I don't think nopfn() is a good interface. It's only usable
>> for one single thing, so trying to claim that it's a generic VM op is
>> really not valid. If (and that's a big if) we need this interface, we
>> should just do it inside mm/memory.c instead of playing games as if it was
>> generic.
>
> Or just use sparsemem and create struct pages for your hw :) we do that
> for SPUs on Cell, works like a charm.

Well then the question is, would it simplify the code using no_pfn in
this case? Hacking up fake struct page entries seems even more of a
hack to me.

Cheers,
Jes

2006-03-17 09:42:18

by Jes Sorensen

[permalink] [raw]
Subject: Re: [patch] mspec - special memory driver and do_no_pfn handler

Linus Torvalds wrote:
> Quite frankly, I don't think nopfn() is a good interface. It's only usable
> for one single thing, so trying to claim that it's a generic VM op is
> really not valid. If (and that's a big if) we need this interface, we
> should just do it inside mm/memory.c instead of playing games as if it was
> generic.

Hi Linus,

As Robin mentioned I believe Carsten was also looking for this interface
and I received an email from Bjorn Helgas after posting this stating
that he was also looking for it, so there may be several users for it.

I believe it was originally Christoph who suggested we took this
approach to avoid playing tricks on do_no_page. However, if you have a
suggestion for how to do it in a better way, I shall be happy to try
and implement it that way instead, if you'll share the details.

Cheers,
Jes

2006-03-17 12:28:22

by Jes Sorensen

[permalink] [raw]
Subject: [patch 1/2] do_no_pfn handler (was: Re: [patch] mspec - special memory driver and do_no_pfn handler)

>>>>> "Andrew" == Andrew Morton <[email protected]> writes:

Andrew> Jes Sorensen <[email protected]> wrote:
>> Hi,
>>
>> This is an updated version of the mspec driver (special memory
>> support), formerly known as fetchop.
>>
>> With this version I have implemented a do_no_pfn() handler, similar
>> to the do_no_page() handler but for pages which are not backed by a
>> struct page.

Andrew> hm. Is that a superset of ->nopage? Should we be looking at
Andrew> migrating over to ->nopfn, retire ->nopage?

Andrew> <looks at the ghastly stuff in do_no_page>

It wasn't designed to handle all possible cases as a do_no_page
replacement :) My initial thought was that adding an extra op to the
vm_operations_struct was that it wouldn't be very expensive since we
don't allocate all that many of them.

>> Please let me know if there are any objections or comments etc. to
>> this approach. If preferred I can split out the do_no_pfn part into
>> a seperate patch.

Andrew> That would probably be best.

Here goes - if Linus comes back with a suggestion for how to do it in
a different fashion it may obsolete this patch, but at least until
then.

The cleaned up version of the actual mspec driver will be in a
seperate mail.

Cheers,
Jes

Implement do_no_pfn() for handling mapping of memory without a struct
page backing it.

Signed-off-by: Jes Sorensen <[email protected]>

---
include/linux/mm.h | 1 +
mm/memory.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 51 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -199,6 +199,7 @@
void (*open)(struct vm_area_struct * area);
void (*close)(struct vm_area_struct * area);
struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
+ long (*nopfn)(struct vm_area_struct * area, unsigned long address, int *type);
int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
#ifdef CONFIG_NUMA
int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -2148,6 +2148,51 @@
}

/*
+ * do_no_pfn() tries to create a new page mapping for a page without
+ * a struct_page backing it
+ *
+ * As this is called only for pages that do not currently exist, we
+ * do not need to flush old virtual caches or the TLB.
+ *
+ * We enter with non-exclusive mmap_sem (to exclude vma changes,
+ * but allow concurrent faults), and pte mapped but not yet locked.
+ * We return with mmap_sem still held, but pte unmapped and unlocked.
+ *
+ * It is expected that the ->nopfn handler always returns the same pfn
+ * for a given virtual mapping.
+ */
+static int do_no_pfn(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, pte_t *page_table, pmd_t *pmd,
+ int write_access)
+{
+ spinlock_t *ptl;
+ pte_t entry;
+ long pfn;
+ int ret = VM_FAULT_MINOR;
+
+ pte_unmap(page_table);
+ BUG_ON(!(vma->vm_flags & VM_PFNMAP));
+
+ pfn = vma->vm_ops->nopfn(vma, address & PAGE_MASK, &ret);
+ if (pfn == -ENOMEM)
+ return VM_FAULT_OOM;
+ if (pfn == -EFAULT)
+ return VM_FAULT_SIGBUS;
+ if (pfn < 0)
+ return VM_FAULT_SIGBUS;
+
+ page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
+
+ entry = pfn_pte(pfn, vma->vm_page_prot);
+ if (write_access)
+ entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+ set_pte_at(mm, address, page_table, entry);
+
+ pte_unmap_unlock(page_table, ptl);
+ return ret;
+}
+
+/*
* Fault of a previously existing named mapping. Repopulate the pte
* from the encoded file_pte if possible. This enables swappable
* nonlinear vmas.
@@ -2209,9 +2254,13 @@
old_entry = entry = *pte;
if (!pte_present(entry)) {
if (pte_none(entry)) {
- if (!vma->vm_ops || !vma->vm_ops->nopage)
+ if (!vma->vm_ops ||
+ (!vma->vm_ops->nopage && !vma->vm_ops->nopfn))
return do_anonymous_page(mm, vma, address,
pte, pmd, write_access);
+ if (vma->vm_ops->nopfn)
+ return do_no_pfn(mm, vma, address,
+ pte, pmd, write_access);
return do_no_page(mm, vma, address,
pte, pmd, write_access);
}

2006-03-17 12:29:30

by Carsten Otte

[permalink] [raw]
Subject: Re: [patch] mspec - special memory driver and do_no_pfn handler

Linus Torvalds wrote:
> Quite frankly, I don't think nopfn() is a good interface. It's only usable
> for one single thing, so trying to claim that it's a generic VM op is
> really not valid. If (and that's a big if) we need this interface, we
> should just do it inside mm/memory.c instead of playing games as if it was
> generic.
Execute in place would be the second single thing we'll need it for. Also,
I remember a statement from Anrd that it has value for SPUfs on Cell which
does count as singe thing #3. With three single things at hand, I believe
there is some sense in makeing it "a generic thing".

Carsten

2006-03-17 12:38:55

by Jes Sorensen

[permalink] [raw]
Subject: [patch 2/2] mspec driver (was: Re: [patch] mspec - special memory driver and do_no_pfn handler)

>>>>> "Andrew" == Andrew Morton <[email protected]> writes:

Andrew> Jes Sorensen <[email protected]> wrote:
> +#include <asm/sn/io.h>
> +#include <asm/sn/bte.h>
> +#include <asm/sn/shubio.h>

Andrew> Wow.

Guess thats what happens when you have a thing sitting for such a long
time and move stuff out of it bit by bit. My bad. I trimmed it down a
bit removing all the /proc and EFI related stuff.

>> +static inline int +mspec_zero_block(unsigned long addr, int len)
>> +{ + int status; + + if (ia64_platform_is("sn2")) {

Andrew> ia64 uses strcmp() in hotpaths to work out what sort of
Andrew> platform it's running on? Surely someone has cached this info
Andrew> in a __read_mostly integer somewhere?

For some reason I had a memory that this had been done at the higher
level, but it doesn't seem to be as you noted. I've changed it to
cache it locally in the driver since I don't think we want to grow the
code in the otherplaces it's used.

>> +static __inline__ pmd_t *

Andrew> `inline', please.

Done

>> +mspec_get_pmd(struct mm_struct *mm, u64 address)

Andrew> This function has no callers.

Gone

Andrew> You don't know that fetchop_miscdev was registered.

You're right. Fixed using the cached system type.

Updated patch below, thanks for the comments.

Cheers,
Jes

----

This patch implements the special memory driver (mspec) based on the
do_no_pfn approach. The driver is currently used only on SN2 hardware
with special fetchop support but could be beneficial on other
architectures using the uncached mode.

Signed-off-by: Jes Sorensen <[email protected]>

drivers/char/Kconfig | 8
drivers/char/Makefile | 1
drivers/char/mspec.c | 422 ++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 431 insertions(+)

Index: linux-2.6/drivers/char/Kconfig
===================================================================
--- linux-2.6.orig/drivers/char/Kconfig
+++ linux-2.6/drivers/char/Kconfig
@@ -421,6 +421,14 @@
If you have an SGI Altix with an attached SABrick
say Y or M here, otherwise say N.

+config MSPEC
+ tristate " Memory special operations driver"
+ depends on IA64
+ help
+ If you have an ia64 and you want to enable memory special
+ operations support (formerly known as fetchop), say Y here,
+ otherwise say N.
+
source "drivers/serial/Kconfig"

config UNIX98_PTYS
Index: linux-2.6/drivers/char/Makefile
===================================================================
--- linux-2.6.orig/drivers/char/Makefile
+++ linux-2.6/drivers/char/Makefile
@@ -49,6 +49,7 @@
obj-$(CONFIG_VIOTAPE) += viotape.o
obj-$(CONFIG_HVCS) += hvcs.o
obj-$(CONFIG_SGI_MBCS) += mbcs.o
+obj-$(CONFIG_MSPEC) += mspec.o

obj-$(CONFIG_PRINTER) += lp.o
obj-$(CONFIG_TIPAR) += tipar.o
Index: linux-2.6/drivers/char/mspec.c
===================================================================
--- /dev/null
+++ linux-2.6/drivers/char/mspec.c
@@ -0,0 +1,422 @@
+/*
+ * Copyright (C) 2001-2006 Silicon Graphics, Inc. All rights
+ * reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ */
+
+/*
+ * SN Platform Special Memory (mspec) Support
+ *
+ * This driver exports the SN special memory (mspec) facility to user
+ * processes.
+ * There are three types of memory made available thru this driver:
+ * fetchops, uncached and cached.
+ *
+ * Fetchops are atomic memory operations that are implemented in the
+ * memory controller on SGI SN hardware.
+ *
+ * Uncached are used for memory write combining feature of the ia64
+ * cpu.
+ *
+ * Cached are used for areas of memory that are used as cached addresses
+ * on our partition and used as uncached addresses from other partitions.
+ * Due to a design constraint of the SN2 Shub, you can not have processors
+ * on the same FSB perform both a cached and uncached reference to the
+ * same cache line. These special memory cached regions prevent the
+ * kernel from ever dropping in a TLB entry and therefore prevent the
+ * processor from ever speculating a cache line from this page.
+ */
+
+#include <linux/config.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/miscdevice.h>
+#include <linux/spinlock.h>
+#include <linux/mm.h>
+#include <linux/vmalloc.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/numa.h>
+#include <asm/page.h>
+#include <asm/system.h>
+#include <asm/pgtable.h>
+#include <asm/atomic.h>
+#include <asm/tlbflush.h>
+#include <asm/uncached.h>
+#include <asm/sn/addrs.h>
+#include <asm/sn/arch.h>
+#include <asm/sn/mspec.h>
+#include <asm/sn/sn_cpuid.h>
+#include <asm/sn/io.h>
+#include <asm/sn/bte.h>
+#include <asm/sn/shubio.h>
+
+
+#define FETCHOP_ID "SGI Fetchop,"
+#define CACHED_ID "Cached,"
+#define UNCACHED_ID "Uncached"
+#define REVISION "4.0"
+#define MSPEC_BASENAME "mspec"
+
+/*
+ * Page types allocated by the device.
+ */
+enum {
+ MSPEC_FETCHOP = 1,
+ MSPEC_CACHED,
+ MSPEC_UNCACHED
+};
+
+static int is_sn2;
+
+/*
+ * One of these structures is allocated when an mspec region is mmaped. The
+ * structure is pointed to by the vma->vm_private_data field in the vma struct.
+ * This structure is used to record the addresses of the mspec pages.
+ */
+struct vma_data {
+ atomic_t refcnt; /* Number of vmas sharing the data. */
+ spinlock_t lock; /* Serialize access to the vma. */
+ int count; /* Number of pages allocated. */
+ int type; /* Type of pages allocated. */
+ unsigned long maddr[0]; /* Array of MSPEC addresses. */
+};
+
+/* used on shub2 to clear FOP cache in the HUB */
+static unsigned long scratch_page[MAX_NUMNODES];
+#define SH2_AMO_CACHE_ENTRIES 4
+
+static inline int
+mspec_zero_block(unsigned long addr, int len)
+{
+ int status;
+
+ if (is_sn2) {
+ if (is_shub2()) {
+ int nid;
+ void *p;
+ int i;
+
+ nid = nasid_to_cnodeid(get_node_number(__pa(addr)));
+ p = (void *)TO_AMO(scratch_page[nid]);
+
+ for (i=0; i < SH2_AMO_CACHE_ENTRIES; i++) {
+ FETCHOP_LOAD_OP(p, FETCHOP_LOAD);
+ p += FETCHOP_VAR_SIZE;
+ }
+ }
+
+ status = bte_copy(0, addr & ~__IA64_UNCACHED_OFFSET, len,
+ BTE_WACQUIRE | BTE_ZERO_FILL, NULL);
+ } else {
+ memset((char *) addr, 0, len);
+ status = 0;
+ }
+ return status;
+}
+
+/*
+ * mspec_open
+ *
+ * Called when a device mapping is created by a means other than mmap
+ * (via fork, etc.). Increments the reference count on the underlying
+ * mspec data so it is not freed prematurely.
+ */
+static void
+mspec_open(struct vm_area_struct *vma)
+{
+ struct vma_data *vdata;
+
+ vdata = vma->vm_private_data;
+ atomic_inc(&vdata->refcnt);
+}
+
+/*
+ * mspec_close
+ *
+ * Called when unmapping a device mapping. Frees all mspec pages
+ * belonging to the vma.
+ */
+static void
+mspec_close(struct vm_area_struct *vma)
+{
+ struct vma_data *vdata;
+ int i, pages, result, vdata_size;
+
+ vdata = vma->vm_private_data;
+ if (!atomic_dec_and_test(&vdata->refcnt))
+ return;
+
+ pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+ vdata_size = sizeof(struct vma_data) + pages * sizeof(long);
+ for (i = 0; i < pages; i++) {
+ if (vdata->maddr[i] == 0)
+ continue;
+ /*
+ * Clear the page before sticking it back
+ * into the pool.
+ */
+ result = mspec_zero_block(vdata->maddr[i], PAGE_SIZE);
+ if (!result)
+ uncached_free_page(vdata->maddr[i]);
+ else
+ printk(KERN_WARNING "mspec_close(): "
+ "failed to zero page %i\n",
+ result);
+ }
+
+ if (vdata_size <= PAGE_SIZE)
+ kfree(vdata);
+ else
+ vfree(vdata);
+}
+
+
+/*
+ * mspec_nopfn
+ *
+ * Creates a mspec page and maps it to user space.
+ */
+static long
+mspec_nopfn(struct vm_area_struct *vma, unsigned long address, int *unused)
+{
+ unsigned long paddr, maddr;
+ unsigned long pfn;
+ int index;
+ struct vma_data *vdata = vma->vm_private_data;
+
+ index = (address - vma->vm_start) >> PAGE_SHIFT;
+ maddr = (volatile unsigned long) vdata->maddr[index];
+ if (maddr == 0) {
+ maddr = uncached_alloc_page(numa_node_id());
+ if (maddr == 0)
+ return -ENOMEM;
+
+ spin_lock(&vdata->lock);
+ if (vdata->maddr[index] == 0) {
+ vdata->count++;
+ vdata->maddr[index] = maddr;
+ } else {
+ uncached_free_page(maddr);
+ maddr = vdata->maddr[index];
+ }
+ spin_unlock(&vdata->lock);
+ }
+
+ if (vdata->type == MSPEC_FETCHOP)
+ paddr = TO_AMO(maddr);
+ else
+ paddr = __pa(TO_CAC(maddr));
+
+ pfn = paddr >> PAGE_SHIFT;
+
+ return pfn;
+}
+
+static struct vm_operations_struct mspec_vm_ops = {
+ .open = mspec_open,
+ .close = mspec_close,
+ .nopfn = mspec_nopfn
+};
+
+/*
+ * mspec_mmap
+ *
+ * Called when mmaping the device. Initializes the vma with a fault handler
+ * and private data structure necessary to allocate, track, and free the
+ * underlying pages.
+ */
+static int
+mspec_mmap(struct file *file, struct vm_area_struct *vma, int type)
+{
+ struct vma_data *vdata;
+ int pages, vdata_size;
+
+ if (vma->vm_pgoff != 0)
+ return -EINVAL;
+
+ if ((vma->vm_flags & VM_SHARED) == 0)
+ return -EINVAL;
+
+ if ((vma->vm_flags & VM_WRITE) == 0)
+ return -EPERM;
+
+ pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+ vdata_size = sizeof(struct vma_data) + pages * sizeof(long);
+ if (vdata_size <= PAGE_SIZE)
+ vdata = kmalloc(vdata_size, GFP_KERNEL);
+ else
+ vdata = vmalloc(vdata_size);
+ if (!vdata)
+ return -ENOMEM;
+ memset(vdata, 0, vdata_size);
+
+ vdata->type = type;
+ spin_lock_init(&vdata->lock);
+ vdata->refcnt = ATOMIC_INIT(1);
+ vma->vm_private_data = vdata;
+
+ vma->vm_flags |= (VM_IO | VM_LOCKED | VM_RESERVED | VM_PFNMAP);
+ if (vdata->type == MSPEC_FETCHOP || vdata->type == MSPEC_UNCACHED)
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+ vma->vm_ops = &mspec_vm_ops;
+
+ return 0;
+}
+
+static int
+fetchop_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ return mspec_mmap(file, vma, MSPEC_FETCHOP);
+}
+
+static int
+cached_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ return mspec_mmap(file, vma, MSPEC_CACHED);
+}
+
+static int
+uncached_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ return mspec_mmap(file, vma, MSPEC_UNCACHED);
+}
+
+static struct file_operations fetchop_fops = {
+ .owner = THIS_MODULE,
+ .mmap = fetchop_mmap
+};
+
+static struct miscdevice fetchop_miscdev = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = "sgi_fetchop",
+ .fops = &fetchop_fops
+};
+
+static struct file_operations cached_fops = {
+ .owner = THIS_MODULE,
+ .mmap = cached_mmap
+};
+
+static struct miscdevice cached_miscdev = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = "mspec_cached",
+ .fops = &cached_fops
+};
+
+static struct file_operations uncached_fops = {
+ .owner = THIS_MODULE,
+ .mmap = uncached_mmap
+};
+
+static struct miscdevice uncached_miscdev = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = "mspec_uncached",
+ .fops = &uncached_fops
+};
+
+/*
+ * mspec_init
+ *
+ * Called at boot time to initialize the mspec facility.
+ */
+static int __init
+mspec_init(void)
+{
+ int ret;
+ int nid;
+
+ /*
+ * The fetchop device only works on SN2 hardware, uncached and cached
+ * memory drivers should both be valid on all ia64 hardware
+ */
+ if (ia64_platform_is("sn2")) {
+ is_sn2 = 1;
+ if (is_shub2()) {
+ ret = -ENOMEM;
+ for_each_online_node(nid) {
+ int actual_nid;
+ int nasid;
+ unsigned long phys;
+
+ scratch_page[nid] = uncached_alloc_page(nid);
+ if (scratch_page[nid] == 0)
+ goto free_scratch_pages;
+ phys = __pa(scratch_page[nid]);
+ nasid = get_node_number(phys);
+ actual_nid = nasid_to_cnodeid(nasid);
+ if (actual_nid != nid)
+ goto free_scratch_pages;
+ }
+ }
+
+ ret = misc_register(&fetchop_miscdev);
+ if (ret) {
+ printk(KERN_ERR
+ "%s: failed to register device %i\n",
+ FETCHOP_ID, ret);
+ goto free_scratch_pages;
+ }
+ }
+ ret = misc_register(&cached_miscdev);
+ if (ret) {
+ printk(KERN_ERR "%s: failed to register device %i\n",
+ CACHED_ID, ret);
+ if (is_sn2)
+ misc_deregister(&fetchop_miscdev);
+ goto free_scratch_pages;
+ }
+ ret = misc_register(&uncached_miscdev);
+ if (ret) {
+ printk(KERN_ERR "%s: failed to register device %i\n",
+ UNCACHED_ID, ret);
+ misc_deregister(&cached_miscdev);
+ if (is_sn2)
+ misc_deregister(&fetchop_miscdev);
+ goto free_scratch_pages;
+ }
+
+ printk(KERN_INFO "%s %s initialized devices: %s %s %s\n",
+ MSPEC_BASENAME, REVISION, is_sn2 ? FETCHOP_ID : "",
+ CACHED_ID, UNCACHED_ID);
+
+ return 0;
+
+free_scratch_pages:
+ for_each_node(nid) {
+ if (scratch_page[nid] != 0)
+ uncached_free_page(scratch_page[nid]);
+ }
+ return ret;
+}
+
+static void __exit
+mspec_exit(void)
+{
+ int nid;
+
+ misc_deregister(&uncached_miscdev);
+ misc_deregister(&cached_miscdev);
+ if (is_sn2) {
+ misc_deregister(&fetchop_miscdev);
+
+ for_each_node(nid) {
+ if (scratch_page[nid] != 0)
+ uncached_free_page(scratch_page[nid]);
+ }
+ }
+}
+
+module_init(mspec_init);
+module_exit(mspec_exit);
+
+MODULE_AUTHOR("Silicon Graphics, Inc.");
+MODULE_DESCRIPTION("Driver for SGI SN special memory operations");
+MODULE_LICENSE("GPL");
+MODULE_INFO(supported, "external");

2006-03-17 13:28:56

by Carsten Otte

[permalink] [raw]
Subject: Re: [patch] mspec - special memory driver and do_no_pfn handler

Jes Sorensen wrote:
> Well then the question is, would it simplify the code using no_pfn in
> this case? Hacking up fake struct page entries seems even more of a
> hack to me.
I second that. That's were we are with our dcss xip thing today.
It _is_ a hack to have a struct page that you don't need.

Carsten

2006-03-17 13:36:46

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 2/2] mspec driver

Jes Sorensen wrote:

> +static int
> +mspec_mmap(struct file *file, struct vm_area_struct *vma, int type)

...

> + vma->vm_flags |= (VM_IO | VM_LOCKED | VM_RESERVED | VM_PFNMAP);

VM_PFNMAP actually has a fairly specific meaning [unlike the rest of
them :)] so you should be careful with it. Actually if you set vm_pgoff
in the right way, then that should enable you to do COWs on these areas
if that is what you want.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-17 13:51:07

by Carsten Otte

[permalink] [raw]
Subject: Re: [patch] mspec - special memory driver and do_no_pfn handler

Jes Sorensen wrote:
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -199,6 +199,7 @@
> void (*open)(struct vm_area_struct * area);
> void (*close)(struct vm_area_struct * area);
> struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
> + long (*nopfn)(struct vm_area_struct * area, unsigned long address, int *type);
> int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
> #ifdef CONFIG_NUMA
> int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
If you use address as parameter to nopfn, it won't work with highmem
on 32bit systems. Alternative would be to use (unsigned long) phys. page
frame number.

Your work in memory.c looks like the right thing to do.
Afaics it will work for xip as well once I figure how to
do COW. Cool stuff :-).

Carsten

2006-03-17 13:52:46

by Carsten Otte

[permalink] [raw]
Subject: Re: [patch] mspec - special memory driver and do_no_pfn handler

Carsten Otte wrote:
> If you use address as parameter to nopfn, it won't work with highmem
> on 32bit systems. Alternative would be to use (unsigned long) phys. page
> frame number.
Me stupid. I mean return value not address parameter.

Carsten

2006-03-17 13:56:08

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch] mspec - special memory driver and do_no_pfn handler

Carsten Otte wrote:
> Jes Sorensen wrote:
>
>>Index: linux-2.6/include/linux/mm.h
>>===================================================================
>>--- linux-2.6.orig/include/linux/mm.h
>>+++ linux-2.6/include/linux/mm.h
>>@@ -199,6 +199,7 @@
>> void (*open)(struct vm_area_struct * area);
>> void (*close)(struct vm_area_struct * area);
>> struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
>>+ long (*nopfn)(struct vm_area_struct * area, unsigned long address, int *type);
>> int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
>> #ifdef CONFIG_NUMA
>> int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
>
> If you use address as parameter to nopfn, it won't work with highmem
> on 32bit systems. Alternative would be to use (unsigned long) phys. page
> frame number.
>

It is vaddr, so that should be OK. Return is pfn, which is the important one.

> Your work in memory.c looks like the right thing to do.
> Afaics it will work for xip as well once I figure how to
> do COW. Cool stuff :-).
>

I think you may be able to use VM_PFNMAP in much the same way as remap_pfn_range
does. You won't be able to support get_user_pages, of course.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-17 13:59:30

by Jes Sorensen

[permalink] [raw]
Subject: Re: [patch] mspec - special memory driver and do_no_pfn handler

Carsten Otte wrote:
> Jes Sorensen wrote:
>> Index: linux-2.6/include/linux/mm.h
>> ===================================================================
>> --- linux-2.6.orig/include/linux/mm.h
>> +++ linux-2.6/include/linux/mm.h
>> @@ -199,6 +199,7 @@
>> void (*open)(struct vm_area_struct * area);
>> void (*close)(struct vm_area_struct * area);
>> struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
>> + long (*nopfn)(struct vm_area_struct * area, unsigned long address, int *type);
>> int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
>> #ifdef CONFIG_NUMA
>> int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
> If you use address as parameter to nopfn, it won't work with highmem
> on 32bit systems. Alternative would be to use (unsigned long) phys. page
> frame number.

Hi Carsten,

The address comes from handle_pte_fault() passing it to do_no_pfn()
passing it to ->nopfn(), ie. it's the faulted address, not the physical
one.

> Your work in memory.c looks like the right thing to do.
> Afaics it will work for xip as well once I figure how to
> do COW. Cool stuff :-).

:-)

Cheers,
Jes


2006-03-17 14:04:14

by Jes Sorensen

[permalink] [raw]
Subject: Re: [patch 2/2] mspec driver

>>>>> "Nick" == Nick Piggin <[email protected]> writes:

Nick> Jes Sorensen wrote:
>> + vma->vm_flags |= (VM_IO | VM_LOCKED | VM_RESERVED | VM_PFNMAP);

Nick> VM_PFNMAP actually has a fairly specific meaning [unlike the
Nick> rest of them :)] so you should be careful with it. Actually if
Nick> you set vm_pgoff in the right way, then that should enable you
Nick> to do COWs on these areas if that is what you want.

Yup, I went through that when I started using it. I think you guided
me through it :-)

We don't want COW here as the access is backed by special behavior in
the memory controller. We only allow shared mappings for that reason.

Cheers,
Jes

2006-03-17 14:09:20

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 2/2] mspec driver

Jes Sorensen wrote:
>>>>>>"Nick" == Nick Piggin <[email protected]> writes:
>
>
> Nick> Jes Sorensen wrote:
>
>>>+ vma->vm_flags |= (VM_IO | VM_LOCKED | VM_RESERVED | VM_PFNMAP);
>
>
> Nick> VM_PFNMAP actually has a fairly specific meaning [unlike the
> Nick> rest of them :)] so you should be careful with it. Actually if
> Nick> you set vm_pgoff in the right way, then that should enable you
> Nick> to do COWs on these areas if that is what you want.
>
> Yup, I went through that when I started using it. I think you guided
> me through it :-)
>
> We don't want COW here as the access is backed by special behavior in
> the memory controller. We only allow shared mappings for that reason.
>

No problem, I think you should just stop using the VM_PFNMAP flag then.
[Linus should jump in here if I'm wrong ;)]

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-17 14:12:20

by Jes Sorensen

[permalink] [raw]
Subject: Re: [patch 2/2] mspec driver

Nick Piggin wrote:
> Jes Sorensen wrote:
>> We don't want COW here as the access is backed by special behavior in
>> the memory controller. We only allow shared mappings for that reason.
>>
> No problem, I think you should just stop using the VM_PFNMAP flag then.
> [Linus should jump in here if I'm wrong ;)]

I'd have to go back and find the discussion to verify, but if I
remember correctly the conclusion was that I needed to use it in
order to make sure that vm_normal_page() didn't start thinking it was
in fact a real page, ie. VM_PFNMAP + never a COW mapping..

Chers,
Jes


2006-03-17 14:19:31

by Nick Piggin

[permalink] [raw]
Subject: Re: [patch 2/2] mspec driver

Jes Sorensen wrote:
> Nick Piggin wrote:
>
>>No problem, I think you should just stop using the VM_PFNMAP flag then.
>>[Linus should jump in here if I'm wrong ;)]
>
>
> I'd have to go back and find the discussion to verify, but if I
> remember correctly the conclusion was that I needed to use it in
> order to make sure that vm_normal_page() didn't start thinking it was
> in fact a real page, ie. VM_PFNMAP + never a COW mapping..
>

Oh of course: the primary purpose for VM_PFNMAP is to signal a pfn
mapping, strangely enough. The COW facility is additional to that.
Sorry.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-17 18:04:20

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch] mspec - special memory driver and do_no_pfn handler

On Fri, Mar 17, 2006 at 02:29:13PM +0100, Carsten Otte wrote:
> Jes Sorensen wrote:
> > Well then the question is, would it simplify the code using no_pfn in
> > this case? Hacking up fake struct page entries seems even more of a
> > hack to me.
> I second that. That's were we are with our dcss xip thing today.
> It _is_ a hack to have a struct page that you don't need.

The same is true for the SPU support. The way it's done currently works
which is great, but the way it's done is everything but nice.