2005-04-12 10:01:47

by Jes Sorensen

[permalink] [raw]
Subject: [patch] mspec driver for 2.6.12-rc2-mm3

Hi Andrew,

This patch includes the mspec driver for the ia64 port. Any chance
you'll take it for the mm tree?

It requires the genalloc patch posted earlier, but otherwise shouldn't
cause any harm.

Cheers,
Jes


Memory special driver for cached, uncached and 'fetchop' (SGI SN2 specific)
memory mappings, formerly known as fetchop. Mostly Used by parallel
appplictions.

This patch relies on the PG_uncached support patch and the generic
allocator patch (genalloc).

Signed-off-by: Jes Sorensen <[email protected]>


diff -urN -X /usr/people/jes/exclude-linux linux-2.6.12-rc2-mm3-vanilla/arch/ia64/Kconfig linux-2.6.12-rc2-mm3/arch/ia64/Kconfig
--- linux-2.6.12-rc2-mm3-vanilla/arch/ia64/Kconfig 2005-04-12 02:09:02 -07:00
+++ linux-2.6.12-rc2-mm3/arch/ia64/Kconfig 2005-04-12 02:14:06 -07:00
@@ -217,6 +217,16 @@
If you are compiling a kernel that will run under SGI's IA-64
simulator (Medusa) then say Y, otherwise say N.

+config MSPEC
+ tristate "Special Memory support"
+ select GENERIC_ALLOCATOR
+ help
+ This driver allows for cached and uncached mappings of memory
+ to user processes. On SGI SN hardware it will also export the
+ special fetchop memory facility.
+ Fetchops are atomic memory operations that are implemented in the
+ memory controller on SGI SN hardware.
+
config FORCE_MAX_ZONEORDER
int
default "18"
diff -urN -X /usr/people/jes/exclude-linux linux-2.6.12-rc2-mm3-vanilla/arch/ia64/configs/sn2_defconfig linux-2.6.12-rc2-mm3/arch/ia64/configs/sn2_defconfig
--- linux-2.6.12-rc2-mm3-vanilla/arch/ia64/configs/sn2_defconfig 2005-04-12 02:09:02 -07:00
+++ linux-2.6.12-rc2-mm3/arch/ia64/configs/sn2_defconfig 2005-04-12 02:14:06 -07:00
@@ -82,6 +82,7 @@
# CONFIG_IA64_CYCLONE is not set
CONFIG_IOSAPIC=y
CONFIG_IA64_SGI_SN_SIM=y
+CONFIG_MSPEC=m
CONFIG_FORCE_MAX_ZONEORDER=18
CONFIG_SMP=y
CONFIG_NR_CPUS=512
diff -urN -X /usr/people/jes/exclude-linux linux-2.6.12-rc2-mm3-vanilla/arch/ia64/defconfig linux-2.6.12-rc2-mm3/arch/ia64/defconfig
--- linux-2.6.12-rc2-mm3-vanilla/arch/ia64/defconfig 2005-04-12 02:09:02 -07:00
+++ linux-2.6.12-rc2-mm3/arch/ia64/defconfig 2005-04-12 02:14:06 -07:00
@@ -80,6 +80,7 @@
CONFIG_ARCH_DISCONTIGMEM_ENABLE=y
CONFIG_IA64_CYCLONE=y
CONFIG_IOSAPIC=y
+CONFIG_MSPEC=m
CONFIG_FORCE_MAX_ZONEORDER=18
CONFIG_SMP=y
CONFIG_NR_CPUS=512
diff -urN -X /usr/people/jes/exclude-linux linux-2.6.12-rc2-mm3-vanilla/arch/ia64/kernel/Makefile linux-2.6.12-rc2-mm3/arch/ia64/kernel/Makefile
--- linux-2.6.12-rc2-mm3-vanilla/arch/ia64/kernel/Makefile 2005-03-01 23:38:33 -08:00
+++ linux-2.6.12-rc2-mm3/arch/ia64/kernel/Makefile 2005-04-12 02:14:06 -07:00
@@ -20,6 +20,7 @@
obj-$(CONFIG_PERFMON) += perfmon_default_smpl.o
obj-$(CONFIG_IA64_CYCLONE) += cyclone.o
obj-$(CONFIG_IA64_MCA_RECOVERY) += mca_recovery.o
+obj-$(CONFIG_MSPEC) += mspec.o
mca_recovery-y += mca_drv.o mca_drv_asm.o

# The gate DSO image is built using a special linker script.
diff -urN -X /usr/people/jes/exclude-linux linux-2.6.12-rc2-mm3-vanilla/arch/ia64/kernel/mspec.c linux-2.6.12-rc2-mm3/arch/ia64/kernel/mspec.c
--- linux-2.6.12-rc2-mm3-vanilla/arch/ia64/kernel/mspec.c 1969-12-31 16:00:00 -08:00
+++ linux-2.6.12-rc2-mm3/arch/ia64/kernel/mspec.c 2005-04-12 02:14:06 -07:00
@@ -0,0 +1,799 @@
+/*
+ * Copyright (C) 2001-2005 Silicon Graphics, Inc. All rights
+ * reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ */
+
+/*
+ * SN Platform Special Memory (mspec) Support
+ *
+ * This driver exports the SN special memory (mspec) facility to user processes.
+ * There are three types of memory made available thru this driver:
+ * fetchops, uncached and cached.
+ *
+ * Fetchops are atomic memory operations that are implemented in the
+ * memory controller on SGI SN hardware.
+ *
+ * Uncached are used for memory write combining feature of the ia64
+ * cpu.
+ *
+ * Cached are used for areas of memory that are used as cached addresses
+ * on our partition and used as uncached addresses from other partitions.
+ * Due to a design constraint of the SN2 Shub, you can not have processors
+ * on the same FSB perform both a cached and uncached reference to the
+ * same cache line. These special memory cached regions prevent the
+ * kernel from ever dropping in a TLB entry and therefore prevent the
+ * processor from ever speculating a cache line from this page.
+ */
+
+
+#include <linux/config.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/miscdevice.h>
+#include <linux/spinlock.h>
+#include <linux/mm.h>
+#include <linux/proc_fs.h>
+#include <linux/vmalloc.h>
+#include <linux/bitops.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+#include <linux/efi.h>
+#include <linux/genalloc.h>
+#include <asm/page.h>
+#include <asm/pal.h>
+#include <asm/system.h>
+#include <asm/pgtable.h>
+#include <asm/atomic.h>
+#include <asm/tlbflush.h>
+#include <asm/sn/addrs.h>
+#include <asm/sn/arch.h>
+#include <asm/sn/mspec.h>
+#include <asm/sn/sn_cpuid.h>
+#include <asm/sn/io.h>
+#include <asm/sn/bte.h>
+#include <asm/sn/shubio.h>
+
+
+#define DEBUG 0
+
+#define FETCHOP_DRIVER_ID_STR "MSPEC Fetchop Device Driver"
+#define CACHED_DRIVER_ID_STR "MSPEC Cached Device Driver"
+#define UNCACHED_DRIVER_ID_STR "MSPEC Uncached Device Driver"
+#define REVISION "3.0"
+#define MSPEC_BASENAME "mspec"
+
+
+#define BTE_ZERO_BLOCK(_maddr, _len) \
+ bte_copy(0, _maddr - __IA64_UNCACHED_OFFSET, _len, BTE_WACQUIRE | BTE_ZERO_FILL, NULL)
+
+static int fetchop_mmap(struct file *file, struct vm_area_struct *vma);
+static int cached_mmap(struct file *file, struct vm_area_struct *vma);
+static int uncached_mmap(struct file *file, struct vm_area_struct *vma);
+static void mspec_open(struct vm_area_struct *vma);
+static void mspec_close(struct vm_area_struct *vma);
+static struct page * mspec_nopage(struct vm_area_struct *vma,
+ unsigned long address, int *unused);
+
+/*
+ * Page types allocated by the device.
+ */
+enum {
+ MSPEC_FETCHOP = 1,
+ MSPEC_CACHED,
+ MSPEC_UNCACHED
+};
+
+static struct file_operations fetchop_fops = {
+ .owner THIS_MODULE,
+ .mmap fetchop_mmap
+};
+static struct miscdevice fetchop_miscdev = {
+ .minor MISC_DYNAMIC_MINOR,
+ .name "sgi_fetchop",
+ .fops &fetchop_fops
+};
+
+
+static struct file_operations cached_fops = {
+ .owner THIS_MODULE,
+ .mmap cached_mmap
+};
+static struct miscdevice cached_miscdev = {
+ .minor MISC_DYNAMIC_MINOR,
+ .name "sgi_cached",
+ .fops &cached_fops
+};
+
+
+static struct file_operations uncached_fops = {
+ .owner THIS_MODULE,
+ .mmap uncached_mmap
+};
+static struct miscdevice uncached_miscdev = {
+ .minor MISC_DYNAMIC_MINOR,
+ .name "sgi_uncached",
+ .fops &uncached_fops
+};
+
+
+static struct vm_operations_struct mspec_vm_ops = {
+ .open mspec_open,
+ .close mspec_close,
+ .nopage mspec_nopage
+};
+
+/*
+ * There is one of these structs per node. It is used to manage the mspec
+ * space that is available on the node. Current assumption is that there is
+ * only 1 mspec block of memory per node.
+ */
+struct node_mspecs {
+ long maddr; /* phys addr of start of mspecs. */
+ int count; /* Total number of mspec pages. */
+ atomic_t free; /* Number of pages currently free. */
+ unsigned long bits[1]; /* Bitmap for managing pages. */
+};
+
+
+/*
+ * One of these structures is allocated when an mspec region is mmaped. The
+ * structure is pointed to by the vma->vm_private_data field in the vma struct.
+ * This structure is used to record the addresses of the mspec pages.
+ */
+struct vma_data {
+ atomic_t refcnt; /* Number of vmas sharing the data. */
+ spinlock_t lock; /* Serialize access to the vma. */
+ int count; /* Number of pages allocated. */
+ int type; /* Type of pages allocated. */
+ unsigned long maddr[1]; /* Array of MSPEC addresses. */
+};
+
+
+/*
+ * Memory Special statistics.
+ */
+struct mspec_stats {
+ atomic_t map_count; /* Number of active mmap's */
+ atomic_t pages_in_use; /* Number of mspec pages in use */
+ unsigned long pages_total; /* Total number of mspec pages */
+};
+
+static struct mspec_stats mspec_stats;
+static struct node_mspecs *node_mspecs[MAX_NUMNODES];
+
+#define MAX_UNCACHED_GRANULES 5
+static int allocated_granules;
+
+struct gen_pool *mspec_pool[MAX_NUMNODES];
+
+static void mspec_ipi_visibility(void *data)
+{
+ int status;
+
+ status = ia64_pal_prefetch_visibility(PAL_VISIBILITY_PHYSICAL);
+ if ((status != PAL_VISIBILITY_OK) &&
+ (status != PAL_VISIBILITY_OK_REMOTE_NEEDED))
+ printk(KERN_DEBUG "pal_prefetch_visibility() returns %i on "
+ "CPU %i\n", status, get_cpu());
+}
+
+
+static void mspec_ipi_mc_drain(void *data)
+{
+ int status;
+ status = ia64_pal_mc_drain();
+ if (status)
+ printk(KERN_WARNING "ia64_pal_mc_drain() failed with %i on "
+ "CPU %i\n", status, get_cpu());
+}
+
+
+static unsigned long
+mspec_get_new_chunk(struct gen_pool *poolp)
+{
+ struct page *page;
+ void *tmp;
+ int status, node, i;
+ unsigned long addr;
+
+ if (allocated_granules >= MAX_UNCACHED_GRANULES)
+ return 0;
+
+ node = (int)poolp->private;
+ page = alloc_pages_node(node, GFP_KERNEL,
+ IA64_GRANULE_SHIFT-PAGE_SHIFT);
+
+#if DEBUG
+ printk(KERN_INFO "get_new_chunk page %p, addr %lx\n",
+ page, (unsigned long)(page-vmem_map) << PAGE_SHIFT);
+#endif
+
+ /*
+ * Do magic if no mem on local node! XXX
+ */
+ if (!page)
+ return 0;
+ tmp = page_address(page);
+ memset(tmp, 0, IA64_GRANULE_SIZE);
+
+ /*
+ * There's a small race here where it's possible for someone to
+ * access the page through /dev/mem halfway through the conversion
+ * to uncached - not sure it's really worth bothering about
+ */
+ for (i = 0; i < (IA64_GRANULE_SIZE / PAGE_SIZE); i++)
+ SetPageUncached(&page[i]);
+
+ flush_tlb_kernel_range(tmp, tmp + IA64_GRANULE_SIZE);
+
+ status = ia64_pal_prefetch_visibility(PAL_VISIBILITY_PHYSICAL);
+#if DEBUG
+ printk(KERN_INFO "pal_prefetch_visibility() returns %i on cpu %i\n",
+ status, get_cpu());
+#endif
+ if (!status) {
+ status = smp_call_function(mspec_ipi_visibility, NULL, 0, 1);
+ if (status)
+ printk(KERN_WARNING "smp_call_function failed for "
+ "mspec_ipi_visibility! (%i)\n", status);
+ }
+
+ sn_flush_all_caches((unsigned long)tmp, IA64_GRANULE_SIZE);
+ ia64_pal_mc_drain();
+ status = smp_call_function(mspec_ipi_mc_drain, NULL, 0, 1);
+ if (status)
+ printk(KERN_WARNING "smp_call_function failed for "
+ "mspec_ipi_mc_drain! (%i)\n", status);
+
+ addr = (unsigned long)tmp - PAGE_OFFSET + __IA64_UNCACHED_OFFSET;
+
+ allocated_granules++;
+ return addr;
+}
+
+
+/*
+ * mspec_alloc_page
+ *
+ * Allocate 1 mspec page. Allocates on the requested node. If no
+ * mspec pages are available on the requested node, roundrobin starting
+ * with higher nodes.
+ */
+static unsigned long
+mspec_alloc_page(int nid, int type)
+{
+ unsigned long maddr;
+
+ maddr = gen_pool_alloc(mspec_pool[nid], PAGE_SIZE);
+#if DEBUG
+ printk(KERN_DEBUG "mspec_alloc_page returns %lx on node %i\n",
+ maddr, nid);
+#endif
+
+ /*
+ * If no memory is availble on our local node, try the
+ * remaining nodes in the system.
+ */
+ if (!maddr) {
+ int i;
+
+ for (i = MAX_NUMNODES - 1; i >= 0; i--) {
+ if (i == nid || !node_online(i))
+ continue;
+ maddr = gen_pool_alloc(mspec_pool[i], PAGE_SIZE);
+#if DEBUG
+ printk(KERN_DEBUG "mspec_alloc_page alternate search "
+ "returns %lx on node %i\n", maddr, i);
+#endif
+ if (maddr) {
+ break;
+ }
+ }
+ }
+
+ if (maddr)
+ atomic_inc(&mspec_stats.pages_in_use);
+
+ return maddr;
+}
+
+
+/*
+ * mspec_free_page
+ *
+ * Free a single mspec page.
+ */
+static void
+mspec_free_page(unsigned long maddr)
+{
+ int node;
+
+ node = nasid_to_cnodeid(NASID_GET(maddr));
+#if DEBUG
+ printk(KERN_DEBUG "mspec_free_page(%lx) on node %i\n", maddr, node);
+#endif
+ if ((maddr & (0XFUL << 60)) != __IA64_UNCACHED_OFFSET)
+ panic("mspec_free_page invalid address %lx\n", maddr);
+
+ atomic_dec(&mspec_stats.pages_in_use);
+ gen_pool_free(mspec_pool[node], maddr, PAGE_SIZE);
+}
+
+
+/*
+ * mspec_mmap
+ *
+ * Called when mmaping the device. Initializes the vma with a fault handler
+ * and private data structure necessary to allocate, track, and free the
+ * underlying pages.
+ */
+static int
+mspec_mmap(struct file *file, struct vm_area_struct *vma, int type)
+{
+ struct vma_data *vdata;
+ int pages;
+
+ if (vma->vm_pgoff != 0)
+ return -EINVAL;
+
+ if ((vma->vm_flags & VM_WRITE) == 0)
+ return -EPERM;
+
+ pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+ if (!(vdata = vmalloc(sizeof(struct vma_data)+(pages-1)*sizeof(long))))
+ return -ENOMEM;
+ memset(vdata, 0, sizeof(struct vma_data)+(pages-1)*sizeof(long));
+
+ vdata->type = type;
+ vdata->lock = SPIN_LOCK_UNLOCKED;
+ vdata->refcnt = ATOMIC_INIT(1);
+ vma->vm_private_data = vdata;
+
+ vma->vm_flags |= (VM_IO | VM_SHM | VM_LOCKED);
+ if (vdata->type == MSPEC_FETCHOP || vdata->type == MSPEC_UNCACHED)
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+ vma->vm_ops = &mspec_vm_ops;
+
+ atomic_inc(&mspec_stats.map_count);
+ return 0;
+}
+
+static int
+fetchop_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ return mspec_mmap(file, vma, MSPEC_FETCHOP);
+}
+
+static int
+cached_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ return mspec_mmap(file, vma, MSPEC_CACHED);
+}
+
+static int
+uncached_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ return mspec_mmap(file, vma, MSPEC_UNCACHED);
+}
+
+/*
+ * mspec_open
+ *
+ * Called when a device mapping is created by a means other than mmap
+ * (via fork, etc.). Increments the reference count on the underlying
+ * mspec data so it is not freed prematurely.
+ */
+static void
+mspec_open(struct vm_area_struct *vma)
+{
+ struct vma_data *vdata;
+
+ vdata = vma->vm_private_data;
+ atomic_inc(&vdata->refcnt);
+}
+
+/*
+ * mspec_close
+ *
+ * Called when unmapping a device mapping. Frees all mspec pages
+ * belonging to the vma.
+ */
+static void
+mspec_close(struct vm_area_struct *vma)
+{
+ struct vma_data *vdata;
+ int i, pages;
+ bte_result_t br;
+
+ vdata = vma->vm_private_data;
+ if (atomic_dec(&vdata->refcnt) == 0) {
+ pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+ for (i = 0; i < pages; i++) {
+ if (vdata->maddr[i] != 0) {
+ /*
+ * Clear the page before sticking it back
+ * into the pool.
+ */
+ br = BTE_ZERO_BLOCK(vdata->maddr[i], PAGE_SIZE);
+ if (br == BTE_SUCCESS)
+ mspec_free_page(vdata->maddr[i]);
+ else
+ printk(KERN_WARNING "mspec_close(): BTE failed to zero page\n");
+ }
+ }
+ if (vdata->count)
+ atomic_dec(&mspec_stats.map_count);
+ vfree(vdata);
+ }
+}
+
+/*
+ * mspec_get_one_pte
+ *
+ * Return the pte for a given mm and address.
+ */
+static __inline__ int
+mspec_get_one_pte(struct mm_struct *mm, u64 address, pte_t **pte)
+{
+ pgd_t *pgd;
+ pmd_t *pmd;
+ pud_t *pud;
+
+ pgd = pgd_offset(mm, address);
+ if (pgd_present(*pgd)) {
+ pud = pud_offset(pgd, address);
+ if (pud_present(*pud)) {
+ pmd = pmd_offset(pud, address);
+ if (pmd_present(*pmd)) {
+ *pte = pte_offset_map(pmd, address);
+ if (pte_present(**pte)) {
+ return 0;
+ }
+ }
+ }
+ }
+
+ return -1;
+}
+
+/*
+ * mspec_nopage
+ *
+ * Creates a mspec page and maps it to user space.
+ */
+static struct page *
+mspec_nopage(struct vm_area_struct *vma, unsigned long address, int *unused)
+{
+ unsigned long paddr, maddr = 0;
+ unsigned long pfn;
+ int index;
+ pte_t *pte;
+ struct page *page;
+ struct vma_data *vdata = vma->vm_private_data;
+
+ spin_lock(&vdata->lock);
+
+ index = (address - vma->vm_start) >> PAGE_SHIFT;
+ if (vdata->maddr[index] == 0) {
+ vdata->count++;
+ maddr = mspec_alloc_page(numa_node_id(), vdata->type);
+ if (maddr == 0)
+ BUG();
+ vdata->maddr[index] = maddr;
+ } else if (mspec_get_one_pte(vma->vm_mm, address, &pte) == 0) {
+ printk(KERN_ERR "page already mapped\n");
+ /*
+ * The page may have already been faulted by another
+ * pthread. If so, we need to avoid remapping the
+ * page or we will trip a BUG check in the
+ * remap_page_range() path.
+ */
+ goto getpage;
+ }
+
+ if (vdata->type == MSPEC_FETCHOP)
+ paddr = TO_AMO(vdata->maddr[index]);
+ else
+ paddr = __pa(TO_CAC(vdata->maddr[index]));
+
+ /*
+ * XXX - is this correct?
+ */
+ pfn = paddr >> PAGE_SHIFT;
+ if (remap_pfn_range(vma, address, pfn, PAGE_SIZE, vma->vm_page_prot)) {
+ printk(KERN_ERR "remap_pfn_range failed!\n");
+ goto error;
+ }
+
+ /*
+ * The kernel requires a page structure to be returned upon
+ * success, but there are no page structures for low granule pages.
+ * remap_page_range() creates the pte for us and we return a
+ * bogus page back to the kernel fault handler to keep it happy
+ * (the page is freed immediately there).
+ */
+ if (mspec_get_one_pte(vma->vm_mm, address, &pte) == 0) {
+ spin_lock(&vma->vm_mm->page_table_lock);
+ inc_mm_counter(vma->vm_mm, rss);
+ spin_unlock(&vma->vm_mm->page_table_lock);
+
+ set_pte(pte, pte_mkwrite(pte_mkdirty(*pte)));
+ }
+ getpage:
+ /*
+ * Is this really correct?
+ */
+ page = alloc_pages(GFP_USER, 0);
+ spin_unlock(&vdata->lock);
+ return page;
+
+ error:
+ if (maddr) {
+ mspec_free_page(vdata->maddr[index]);
+ vdata->maddr[index] = 0;
+ vdata->count--;
+ }
+ spin_unlock(&vdata->lock);
+ return NOPAGE_SIGBUS;
+}
+
+
+#ifdef CONFIG_PROC_FS
+static void *
+mspec_seq_start(struct seq_file *file, loff_t *offset)
+{
+ if (*offset < MAX_NUMNODES)
+ return offset;
+ return NULL;
+}
+
+static void *
+mspec_seq_next(struct seq_file *file, void *data, loff_t *offset)
+{
+ (*offset)++;
+ if (*offset < MAX_NUMNODES)
+ return offset;
+ return NULL;
+}
+
+static void
+mspec_seq_stop(struct seq_file *file, void *data)
+{
+}
+
+static int
+mspec_seq_show(struct seq_file *file, void *data)
+{
+ struct node_mspecs *mspecs;
+ int i;
+
+ i = *(loff_t *)data;
+
+ if (!i) {
+ seq_printf(file, "mappings : %i\n",
+ atomic_read(&mspec_stats.map_count));
+ seq_printf(file, "current mspec pages : %i\n",
+ atomic_read(&mspec_stats.pages_in_use));
+ seq_printf(file, "%4s %7s %7s\n", "node", "total", "free");
+ }
+
+ if (i < MAX_NUMNODES) {
+ int free, count;
+ mspecs = node_mspecs[i];
+ if (mspecs) {
+ free = atomic_read(&mspecs->free);
+ count = mspecs->count;
+ seq_printf(file, "%4d %7d %7d\n", i, count, free);
+ }
+ }
+
+ return 0;
+}
+
+
+static struct seq_operations mspec_seq_ops = {
+ .start = mspec_seq_start,
+ .next = mspec_seq_next,
+ .stop = mspec_seq_stop,
+ .show = mspec_seq_show
+};
+
+int
+mspec_proc_open(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &mspec_seq_ops);
+}
+
+static struct file_operations proc_mspec_operations = {
+ .open = mspec_proc_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+
+static struct proc_dir_entry *proc_mspec;
+
+#endif /* CONFIG_PROC_FS */
+
+/*
+ * mspec_build_memmap,
+ *
+ * Called at boot time to build a map of pages that can be used for
+ * memory special operations.
+ */
+static int __init
+mspec_build_memmap(unsigned long start, unsigned long end)
+{
+ long length;
+ bte_result_t br;
+ unsigned long vstart, vend;
+ int node;
+
+ length = end - start;
+ vstart = start + __IA64_UNCACHED_OFFSET;
+ vend = end + __IA64_UNCACHED_OFFSET;
+
+#if DEBUG
+ printk(KERN_ERR "mspec_build_memmap(%lx %lx)\n", start, end);
+#endif
+
+ br = BTE_ZERO_BLOCK(vstart, length);
+ if (br != BTE_SUCCESS)
+ panic("BTE Failed while trying to zero mspec page. bte_result_t = %d\n", (int) br);
+
+ node = nasid_to_cnodeid(NASID_GET(start));
+
+ for (; vstart < vend ; vstart += PAGE_SIZE) {
+#if DEBUG
+ printk(KERN_INFO "sticking %lx into the pool!\n", vstart);
+#endif
+ gen_pool_free(mspec_pool[node], vstart, PAGE_SIZE);
+ }
+
+ return 0;
+}
+
+/*
+ * Walk the EFI memory map to pull out leftover pages in the lower
+ * memory regions which do not end up in the regular memory map and
+ * stick them into the uncached allocator
+ */
+static void __init
+mspec_walk_efi_memmap_uc (void)
+{
+ void *efi_map_start, *efi_map_end, *p;
+ efi_memory_desc_t *md;
+ u64 efi_desc_size, start, end;
+
+ efi_map_start = __va(ia64_boot_param->efi_memmap);
+ efi_map_end = efi_map_start + ia64_boot_param->efi_memmap_size;
+ efi_desc_size = ia64_boot_param->efi_memdesc_size;
+
+ for (p = efi_map_start; p < efi_map_end; p += efi_desc_size) {
+ md = p;
+ if (md->attribute == EFI_MEMORY_UC) {
+ start = PAGE_ALIGN(md->phys_addr);
+ end = PAGE_ALIGN((md->phys_addr+(md->num_pages << EFI_PAGE_SHIFT)) & PAGE_MASK);
+ if (mspec_build_memmap(start, end) < 0)
+ return;
+ }
+ }
+}
+
+
+
+/*
+ * mspec_init
+ *
+ * Called at boot time to initialize the mspec facility.
+ */
+static int __init
+mspec_init(void)
+{
+ int i, ret;
+
+ /*
+ * The fetchop device only works on SN2 hardware, uncached and cached
+ * memory drivers should both be valid on all ia64 hardware
+ */
+ if (ia64_platform_is("sn2")) {
+ if ((ret = misc_register(&fetchop_miscdev))) {
+ printk(KERN_ERR "%s: failed to register device %i\n",
+ FETCHOP_DRIVER_ID_STR, ret);
+ return ret;
+ }
+ }
+ if ((ret = misc_register(&cached_miscdev))) {
+ printk(KERN_ERR "%s: failed to register device %i\n",
+ CACHED_DRIVER_ID_STR, ret);
+ misc_deregister(&fetchop_miscdev);
+ return ret;
+ }
+ if ((ret = misc_register(&uncached_miscdev))) {
+ printk(KERN_ERR "%s: failed to register device %i\n",
+ UNCACHED_DRIVER_ID_STR, ret);
+ misc_deregister(&cached_miscdev);
+ misc_deregister(&fetchop_miscdev);
+ return ret;
+ }
+
+ /*
+ * /proc code needs to be updated to work with the new
+ * allocation scheme
+ */
+#ifdef CONFIG_PROC_FS
+ if (!(proc_mspec = create_proc_entry(MSPEC_BASENAME, 0444, NULL))){
+ printk(KERN_ERR "%s: unable to create proc entry",
+ FETCHOP_DRIVER_ID_STR);
+ misc_deregister(&uncached_miscdev);
+ misc_deregister(&cached_miscdev);
+ misc_deregister(&fetchop_miscdev);
+ return -EINVAL;
+ }
+ proc_mspec->proc_fops = &proc_mspec_operations;
+#endif /* CONFIG_PROC_FS */
+
+ for (i = 0; i < MAX_NUMNODES; i++) {
+ if (!node_online(i))
+ continue;
+ printk(KERN_DEBUG "Setting up pool for node %i\n", i);
+ mspec_pool[i] = alloc_gen_pool(0, IA64_GRANULE_SHIFT,
+ &mspec_get_new_chunk, i);
+ }
+
+ mspec_walk_efi_memmap_uc();
+
+ printk(KERN_INFO "%s: v%s\n", FETCHOP_DRIVER_ID_STR, REVISION);
+ printk(KERN_INFO "%s: v%s\n", CACHED_DRIVER_ID_STR, REVISION);
+ printk(KERN_INFO "%s: v%s\n", UNCACHED_DRIVER_ID_STR, REVISION);
+
+ return 0;
+}
+
+
+static void __exit
+mspec_exit(void)
+{
+ BUG_ON(atomic_read(&mspec_stats.pages_in_use) > 0);
+
+#ifdef CONFIG_PROC_FS
+ remove_proc_entry(MSPEC_BASENAME, NULL);
+#endif
+ misc_deregister(&uncached_miscdev);
+ misc_deregister(&cached_miscdev);
+ misc_deregister(&fetchop_miscdev);
+}
+
+
+unsigned long
+mspec_kalloc_page(int nid)
+{
+ return TO_AMO(mspec_alloc_page(nid, MSPEC_FETCHOP));
+}
+
+
+void
+mspec_kfree_page(unsigned long maddr)
+{
+ mspec_free_page(TO_PHYS(maddr) + __IA64_UNCACHED_OFFSET);
+}
+EXPORT_SYMBOL(mspec_kalloc_page);
+EXPORT_SYMBOL(mspec_kfree_page);
+
+
+module_init(mspec_init);
+module_exit(mspec_exit);
+
+
+MODULE_AUTHOR("Silicon Graphics, Inc.");
+MODULE_DESCRIPTION("Driver for SGI SN special memory operations");
+MODULE_LICENSE("GPL");
diff -urN -X /usr/people/jes/exclude-linux linux-2.6.12-rc2-mm3-vanilla/include/asm-ia64/sn/fetchop.h linux-2.6.12-rc2-mm3/include/asm-ia64/sn/fetchop.h
--- linux-2.6.12-rc2-mm3-vanilla/include/asm-ia64/sn/fetchop.h 2005-03-01 23:38:12 -08:00
+++ linux-2.6.12-rc2-mm3/include/asm-ia64/sn/fetchop.h 1969-12-31 16:00:00 -08:00
@@ -1,85 +0,0 @@
-/*
- *
- * This file is subject to the terms and conditions of the GNU General Public
- * License. See the file "COPYING" in the main directory of this archive
- * for more details.
- *
- * Copyright (c) 2001-2004 Silicon Graphics, Inc. All rights reserved.
- */
-
-#ifndef _ASM_IA64_SN_FETCHOP_H
-#define _ASM_IA64_SN_FETCHOP_H
-
-#include <linux/config.h>
-
-#define FETCHOP_BASENAME "sgi_fetchop"
-#define FETCHOP_FULLNAME "/dev/sgi_fetchop"
-
-
-
-#define FETCHOP_VAR_SIZE 64 /* 64 byte per fetchop variable */
-
-#define FETCHOP_LOAD 0
-#define FETCHOP_INCREMENT 8
-#define FETCHOP_DECREMENT 16
-#define FETCHOP_CLEAR 24
-
-#define FETCHOP_STORE 0
-#define FETCHOP_AND 24
-#define FETCHOP_OR 32
-
-#define FETCHOP_CLEAR_CACHE 56
-
-#define FETCHOP_LOAD_OP(addr, op) ( \
- *(volatile long *)((char*) (addr) + (op)))
-
-#define FETCHOP_STORE_OP(addr, op, x) ( \
- *(volatile long *)((char*) (addr) + (op)) = (long) (x))
-
-#ifdef __KERNEL__
-
-/*
- * Convert a region 6 (kaddr) address to the address of the fetchop variable
- */
-#define FETCHOP_KADDR_TO_MSPEC_ADDR(kaddr) TO_MSPEC(kaddr)
-
-
-/*
- * Each Atomic Memory Operation (AMO formerly known as fetchop)
- * variable is 64 bytes long. The first 8 bytes are used. The
- * remaining 56 bytes are unaddressable due to the operation taking
- * that portion of the address.
- *
- * NOTE: The AMO_t _MUST_ be placed in either the first or second half
- * of the cache line. The cache line _MUST NOT_ be used for anything
- * other than additional AMO_t entries. This is because there are two
- * addresses which reference the same physical cache line. One will
- * be a cached entry with the memory type bits all set. This address
- * may be loaded into processor cache. The AMO_t will be referenced
- * uncached via the memory special memory type. If any portion of the
- * cached cache-line is modified, when that line is flushed, it will
- * overwrite the uncached value in physical memory and lead to
- * inconsistency.
- */
-typedef struct {
- u64 variable;
- u64 unused[7];
-} AMO_t;
-
-
-/*
- * The following APIs are externalized to the kernel to allocate/free pages of
- * fetchop variables.
- * fetchop_kalloc_page - Allocate/initialize 1 fetchop page on the
- * specified cnode.
- * fetchop_kfree_page - Free a previously allocated fetchop page
- */
-
-unsigned long fetchop_kalloc_page(int nid);
-void fetchop_kfree_page(unsigned long maddr);
-
-
-#endif /* __KERNEL__ */
-
-#endif /* _ASM_IA64_SN_FETCHOP_H */
-
diff -urN -X /usr/people/jes/exclude-linux linux-2.6.12-rc2-mm3-vanilla/include/asm-ia64/sn/mspec.h linux-2.6.12-rc2-mm3/include/asm-ia64/sn/mspec.h
--- linux-2.6.12-rc2-mm3-vanilla/include/asm-ia64/sn/mspec.h 1969-12-31 16:00:00 -08:00
+++ linux-2.6.12-rc2-mm3/include/asm-ia64/sn/mspec.h 2005-04-12 02:14:06 -07:00
@@ -0,0 +1,72 @@
+/*
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file "COPYING" in the main directory of this archive
+ * for more details.
+ *
+ * Copyright (c) 2001-2004 Silicon Graphics, Inc. All rights reserved.
+ */
+
+#ifndef _ASM_IA64_SN_MSPEC_H
+#define _ASM_IA64_SN_MSPEC_H
+
+#define FETCHOP_VAR_SIZE 64 /* 64 byte per fetchop variable */
+
+#define FETCHOP_LOAD 0
+#define FETCHOP_INCREMENT 8
+#define FETCHOP_DECREMENT 16
+#define FETCHOP_CLEAR 24
+
+#define FETCHOP_STORE 0
+#define FETCHOP_AND 24
+#define FETCHOP_OR 32
+
+#define FETCHOP_CLEAR_CACHE 56
+
+#define FETCHOP_LOAD_OP(addr, op) ( \
+ *(volatile long *)((char*) (addr) + (op)))
+
+#define FETCHOP_STORE_OP(addr, op, x) ( \
+ *(volatile long *)((char*) (addr) + (op)) = (long) (x))
+
+#ifdef __KERNEL__
+
+/*
+ * Each Atomic Memory Operation (AMO formerly known as fetchop)
+ * variable is 64 bytes long. The first 8 bytes are used. The
+ * remaining 56 bytes are unaddressable due to the operation taking
+ * that portion of the address.
+ *
+ * NOTE: The AMO_t _MUST_ be placed in either the first or second half
+ * of the cache line. The cache line _MUST NOT_ be used for anything
+ * other than additional AMO_t entries. This is because there are two
+ * addresses which reference the same physical cache line. One will
+ * be a cached entry with the memory type bits all set. This address
+ * may be loaded into processor cache. The AMO_t will be referenced
+ * uncached via the memory special memory type. If any portion of the
+ * cached cache-line is modified, when that line is flushed, it will
+ * overwrite the uncached value in physical memory and lead to
+ * inconsistency.
+ */
+typedef struct {
+ u64 variable;
+ u64 unused[7];
+} AMO_t;
+
+
+/*
+ * The following APIs are externalized to the kernel to allocate/free pages of
+ * fetchop variables.
+ * mspec_kalloc_page - Allocate/initialize 1 fetchop page on the
+ * specified cnode.
+ * mspec_kfree_page - Free a previously allocated fetchop page
+ */
+
+extern unsigned long mspec_kalloc_page(int);
+extern void mspec_kfree_page(unsigned long);
+
+
+#endif /* __KERNEL__ */
+
+#endif /* _ASM_IA64_SN_MSPEC_H */
+


2005-04-12 10:21:20

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] mspec driver for 2.6.12-rc2-mm3

[email protected] (Jes Sorensen) wrote:
>
> + if (atomic_dec(&vdata->refcnt) == 0) {

atomic_dec() normally returns void. ia64's returns int, which is a bit
risky for cross-arch developemnt.

atomic_dec_and_test() would be more conventional.

2005-04-12 10:22:32

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] mspec driver for 2.6.12-rc2-mm3

[email protected] (Jes Sorensen) wrote:
>
> + getpage:
> + /*
> + * Is this really correct?
> + */
> + page = alloc_pages(GFP_USER, 0);
> + spin_unlock(&vdata->lock);
> + return page;
> +

sleeping allocation inside a spinlock.

2005-04-12 10:28:07

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] mspec driver for 2.6.12-rc2-mm3

> +/*
> + * Walk the EFI memory map to pull out leftover pages in the lower
> + * memory regions which do not end up in the regular memory map and
> + * stick them into the uncached allocator
> + */
> +static void __init
> +mspec_walk_efi_memmap_uc (void)
> +{
> + void *efi_map_start, *efi_map_end, *p;
> + efi_memory_desc_t *md;
> + u64 efi_desc_size, start, end;
> +
> + efi_map_start = __va(ia64_boot_param->efi_memmap);
> + efi_map_end = efi_map_start + ia64_boot_param->efi_memmap_size;
> + efi_desc_size = ia64_boot_param->efi_memdesc_size;
> +
> + for (p = efi_map_start; p < efi_map_end; p += efi_desc_size) {
> + md = p;
> + if (md->attribute == EFI_MEMORY_UC) {
> + start = PAGE_ALIGN(md->phys_addr);
> + end = PAGE_ALIGN((md->phys_addr+(md->num_pages << EFI_PAGE_SHIFT)) & PAGE_MASK);
> + if (mspec_build_memmap(start, end) < 0)
> + return;
> + }
> + }
> +}
> +

Does this compile without CONFIG_EFI?

(It seems that ia64 Kconfig tries to turn on EFI always, but I thing
allnoconfig will turn it off)

> --- linux-2.6.12-rc2-mm3-vanilla/include/asm-ia64/sn/fetchop.h 2005-03-01 23:38:12 -08:00
> +++ linux-2.6.12-rc2-mm3/include/asm-ia64/sn/fetchop.h 1969-12-31 16:00:00 -08:00

Did you mean to remove fetchop.h?

> --- linux-2.6.12-rc2-mm3-vanilla/include/asm-ia64/sn/mspec.h 1969-12-31 16:00:00 -08:00
> +++ linux-2.6.12-rc2-mm3/include/asm-ia64/sn/mspec.h 2005-04-12 02:14:06 -07:00

I guess so.


2005-04-12 15:01:47

by Jes Sorensen

[permalink] [raw]
Subject: Re: [patch] mspec driver for 2.6.12-rc2-mm3

>>>>> "Andrew" == Andrew Morton <[email protected]> writes:

Andrew> Does this compile without CONFIG_EFI?

Andrew> (It seems that ia64 Kconfig tries to turn on EFI always, but I
Andrew> thing allnoconfig will turn it off)

Nope it doesn't. I don't think the ia64 port can operate without EFI
at all.

>> --- linux-2.6.12-rc2-mm3-vanilla/include/asm-ia64/sn/fetchop.h
>> 2005-03-01 23:38:12 -08:00 +++
>> linux-2.6.12-rc2-mm3/include/asm-ia64/sn/fetchop.h 1969-12-31
>> 16:00:00 -08:00

Andrew> Did you mean to remove fetchop.h?

Yep, it's obsoleted by mspec.h

Fixed the malloc locking bug and atomic_dev issue as well.

Thanks for the comments!

Cheers,
Jes

Memory special driver for cached, uncached and 'fetchop' (SGI SN2 specific)
memory mappings, formerly known as fetchop. Mostly Used by parallel
appplictions.

This patch relies on the PG_uncached support patch and the generic
allocator patch (genalloc).

Signed-off-by: Jes Sorensen <[email protected]>


diff -urN -X /usr/people/jes/exclude-linux linux-2.6.12-rc2-mm3-vanilla/arch/ia64/Kconfig linux-2.6.12-rc2-mm3/arch/ia64/Kconfig
--- linux-2.6.12-rc2-mm3-vanilla/arch/ia64/Kconfig 2005-04-12 02:09:02 -07:00
+++ linux-2.6.12-rc2-mm3/arch/ia64/Kconfig 2005-04-12 02:14:06 -07:00
@@ -217,6 +217,16 @@
If you are compiling a kernel that will run under SGI's IA-64
simulator (Medusa) then say Y, otherwise say N.

+config MSPEC
+ tristate "Special Memory support"
+ select GENERIC_ALLOCATOR
+ help
+ This driver allows for cached and uncached mappings of memory
+ to user processes. On SGI SN hardware it will also export the
+ special fetchop memory facility.
+ Fetchops are atomic memory operations that are implemented in the
+ memory controller on SGI SN hardware.
+
config FORCE_MAX_ZONEORDER
int
default "18"
diff -urN -X /usr/people/jes/exclude-linux linux-2.6.12-rc2-mm3-vanilla/arch/ia64/configs/sn2_defconfig linux-2.6.12-rc2-mm3/arch/ia64/configs/sn2_defconfig
--- linux-2.6.12-rc2-mm3-vanilla/arch/ia64/configs/sn2_defconfig 2005-04-12 02:09:02 -07:00
+++ linux-2.6.12-rc2-mm3/arch/ia64/configs/sn2_defconfig 2005-04-12 02:14:06 -07:00
@@ -82,6 +82,7 @@
# CONFIG_IA64_CYCLONE is not set
CONFIG_IOSAPIC=y
CONFIG_IA64_SGI_SN_SIM=y
+CONFIG_MSPEC=m
CONFIG_FORCE_MAX_ZONEORDER=18
CONFIG_SMP=y
CONFIG_NR_CPUS=512
diff -urN -X /usr/people/jes/exclude-linux linux-2.6.12-rc2-mm3-vanilla/arch/ia64/defconfig linux-2.6.12-rc2-mm3/arch/ia64/defconfig
--- linux-2.6.12-rc2-mm3-vanilla/arch/ia64/defconfig 2005-04-12 02:09:02 -07:00
+++ linux-2.6.12-rc2-mm3/arch/ia64/defconfig 2005-04-12 02:14:06 -07:00
@@ -80,6 +80,7 @@
CONFIG_ARCH_DISCONTIGMEM_ENABLE=y
CONFIG_IA64_CYCLONE=y
CONFIG_IOSAPIC=y
+CONFIG_MSPEC=m
CONFIG_FORCE_MAX_ZONEORDER=18
CONFIG_SMP=y
CONFIG_NR_CPUS=512
diff -urN -X /usr/people/jes/exclude-linux linux-2.6.12-rc2-mm3-vanilla/arch/ia64/kernel/Makefile linux-2.6.12-rc2-mm3/arch/ia64/kernel/Makefile
--- linux-2.6.12-rc2-mm3-vanilla/arch/ia64/kernel/Makefile 2005-03-01 23:38:33 -08:00
+++ linux-2.6.12-rc2-mm3/arch/ia64/kernel/Makefile 2005-04-12 02:14:06 -07:00
@@ -20,6 +20,7 @@
obj-$(CONFIG_PERFMON) += perfmon_default_smpl.o
obj-$(CONFIG_IA64_CYCLONE) += cyclone.o
obj-$(CONFIG_IA64_MCA_RECOVERY) += mca_recovery.o
+obj-$(CONFIG_MSPEC) += mspec.o
mca_recovery-y += mca_drv.o mca_drv_asm.o

# The gate DSO image is built using a special linker script.
diff -urN -X /usr/people/jes/exclude-linux linux-2.6.12-rc2-mm3-vanilla/arch/ia64/kernel/mspec.c linux-2.6.12-rc2-mm3/arch/ia64/kernel/mspec.c
--- linux-2.6.12-rc2-mm3-vanilla/arch/ia64/kernel/mspec.c 1969-12-31 16:00:00 -08:00
+++ linux-2.6.12-rc2-mm3/arch/ia64/kernel/mspec.c 2005-04-12 07:39:19 -07:00
@@ -0,0 +1,798 @@
+/*
+ * Copyright (C) 2001-2005 Silicon Graphics, Inc. All rights
+ * reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ */
+
+/*
+ * SN Platform Special Memory (mspec) Support
+ *
+ * This driver exports the SN special memory (mspec) facility to user processes.
+ * There are three types of memory made available thru this driver:
+ * fetchops, uncached and cached.
+ *
+ * Fetchops are atomic memory operations that are implemented in the
+ * memory controller on SGI SN hardware.
+ *
+ * Uncached are used for memory write combining feature of the ia64
+ * cpu.
+ *
+ * Cached are used for areas of memory that are used as cached addresses
+ * on our partition and used as uncached addresses from other partitions.
+ * Due to a design constraint of the SN2 Shub, you can not have processors
+ * on the same FSB perform both a cached and uncached reference to the
+ * same cache line. These special memory cached regions prevent the
+ * kernel from ever dropping in a TLB entry and therefore prevent the
+ * processor from ever speculating a cache line from this page.
+ */
+
+
+#include <linux/config.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/miscdevice.h>
+#include <linux/spinlock.h>
+#include <linux/mm.h>
+#include <linux/proc_fs.h>
+#include <linux/vmalloc.h>
+#include <linux/bitops.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+#include <linux/efi.h>
+#include <linux/genalloc.h>
+#include <asm/page.h>
+#include <asm/pal.h>
+#include <asm/system.h>
+#include <asm/pgtable.h>
+#include <asm/atomic.h>
+#include <asm/tlbflush.h>
+#include <asm/sn/addrs.h>
+#include <asm/sn/arch.h>
+#include <asm/sn/mspec.h>
+#include <asm/sn/sn_cpuid.h>
+#include <asm/sn/io.h>
+#include <asm/sn/bte.h>
+#include <asm/sn/shubio.h>
+
+
+#define DEBUG 0
+
+#define FETCHOP_DRIVER_ID_STR "MSPEC Fetchop Device Driver"
+#define CACHED_DRIVER_ID_STR "MSPEC Cached Device Driver"
+#define UNCACHED_DRIVER_ID_STR "MSPEC Uncached Device Driver"
+#define REVISION "3.0"
+#define MSPEC_BASENAME "mspec"
+
+
+#define BTE_ZERO_BLOCK(_maddr, _len) \
+ bte_copy(0, _maddr - __IA64_UNCACHED_OFFSET, _len, BTE_WACQUIRE | BTE_ZERO_FILL, NULL)
+
+static int fetchop_mmap(struct file *file, struct vm_area_struct *vma);
+static int cached_mmap(struct file *file, struct vm_area_struct *vma);
+static int uncached_mmap(struct file *file, struct vm_area_struct *vma);
+static void mspec_open(struct vm_area_struct *vma);
+static void mspec_close(struct vm_area_struct *vma);
+static struct page * mspec_nopage(struct vm_area_struct *vma,
+ unsigned long address, int *unused);
+
+/*
+ * Page types allocated by the device.
+ */
+enum {
+ MSPEC_FETCHOP = 1,
+ MSPEC_CACHED,
+ MSPEC_UNCACHED
+};
+
+static struct file_operations fetchop_fops = {
+ .owner THIS_MODULE,
+ .mmap fetchop_mmap
+};
+static struct miscdevice fetchop_miscdev = {
+ .minor MISC_DYNAMIC_MINOR,
+ .name "sgi_fetchop",
+ .fops &fetchop_fops
+};
+
+
+static struct file_operations cached_fops = {
+ .owner THIS_MODULE,
+ .mmap cached_mmap
+};
+static struct miscdevice cached_miscdev = {
+ .minor MISC_DYNAMIC_MINOR,
+ .name "sgi_cached",
+ .fops &cached_fops
+};
+
+
+static struct file_operations uncached_fops = {
+ .owner THIS_MODULE,
+ .mmap uncached_mmap
+};
+static struct miscdevice uncached_miscdev = {
+ .minor MISC_DYNAMIC_MINOR,
+ .name "sgi_uncached",
+ .fops &uncached_fops
+};
+
+
+static struct vm_operations_struct mspec_vm_ops = {
+ .open mspec_open,
+ .close mspec_close,
+ .nopage mspec_nopage
+};
+
+/*
+ * There is one of these structs per node. It is used to manage the mspec
+ * space that is available on the node. Current assumption is that there is
+ * only 1 mspec block of memory per node.
+ */
+struct node_mspecs {
+ long maddr; /* phys addr of start of mspecs. */
+ int count; /* Total number of mspec pages. */
+ atomic_t free; /* Number of pages currently free. */
+ unsigned long bits[1]; /* Bitmap for managing pages. */
+};
+
+
+/*
+ * One of these structures is allocated when an mspec region is mmaped. The
+ * structure is pointed to by the vma->vm_private_data field in the vma struct.
+ * This structure is used to record the addresses of the mspec pages.
+ */
+struct vma_data {
+ atomic_t refcnt; /* Number of vmas sharing the data. */
+ spinlock_t lock; /* Serialize access to the vma. */
+ int count; /* Number of pages allocated. */
+ int type; /* Type of pages allocated. */
+ unsigned long maddr[1]; /* Array of MSPEC addresses. */
+};
+
+
+/*
+ * Memory Special statistics.
+ */
+struct mspec_stats {
+ atomic_t map_count; /* Number of active mmap's */
+ atomic_t pages_in_use; /* Number of mspec pages in use */
+ unsigned long pages_total; /* Total number of mspec pages */
+};
+
+static struct mspec_stats mspec_stats;
+static struct node_mspecs *node_mspecs[MAX_NUMNODES];
+
+#define MAX_UNCACHED_GRANULES 5
+static int allocated_granules;
+
+struct gen_pool *mspec_pool[MAX_NUMNODES];
+
+static void mspec_ipi_visibility(void *data)
+{
+ int status;
+
+ status = ia64_pal_prefetch_visibility(PAL_VISIBILITY_PHYSICAL);
+ if ((status != PAL_VISIBILITY_OK) &&
+ (status != PAL_VISIBILITY_OK_REMOTE_NEEDED))
+ printk(KERN_DEBUG "pal_prefetch_visibility() returns %i on "
+ "CPU %i\n", status, get_cpu());
+}
+
+
+static void mspec_ipi_mc_drain(void *data)
+{
+ int status;
+ status = ia64_pal_mc_drain();
+ if (status)
+ printk(KERN_WARNING "ia64_pal_mc_drain() failed with %i on "
+ "CPU %i\n", status, get_cpu());
+}
+
+
+static unsigned long
+mspec_get_new_chunk(struct gen_pool *poolp)
+{
+ struct page *page;
+ void *tmp;
+ int status, node, i;
+ unsigned long addr;
+
+ if (allocated_granules >= MAX_UNCACHED_GRANULES)
+ return 0;
+
+ node = (int)poolp->private;
+ page = alloc_pages_node(node, GFP_KERNEL,
+ IA64_GRANULE_SHIFT-PAGE_SHIFT);
+
+#if DEBUG
+ printk(KERN_INFO "get_new_chunk page %p, addr %lx\n",
+ page, (unsigned long)(page-vmem_map) << PAGE_SHIFT);
+#endif
+
+ /*
+ * Do magic if no mem on local node! XXX
+ */
+ if (!page)
+ return 0;
+ tmp = page_address(page);
+ memset(tmp, 0, IA64_GRANULE_SIZE);
+
+ /*
+ * There's a small race here where it's possible for someone to
+ * access the page through /dev/mem halfway through the conversion
+ * to uncached - not sure it's really worth bothering about
+ */
+ for (i = 0; i < (IA64_GRANULE_SIZE / PAGE_SIZE); i++)
+ SetPageUncached(&page[i]);
+
+ flush_tlb_kernel_range(tmp, tmp + IA64_GRANULE_SIZE);
+
+ status = ia64_pal_prefetch_visibility(PAL_VISIBILITY_PHYSICAL);
+#if DEBUG
+ printk(KERN_INFO "pal_prefetch_visibility() returns %i on cpu %i\n",
+ status, get_cpu());
+#endif
+ if (!status) {
+ status = smp_call_function(mspec_ipi_visibility, NULL, 0, 1);
+ if (status)
+ printk(KERN_WARNING "smp_call_function failed for "
+ "mspec_ipi_visibility! (%i)\n", status);
+ }
+
+ sn_flush_all_caches((unsigned long)tmp, IA64_GRANULE_SIZE);
+ ia64_pal_mc_drain();
+ status = smp_call_function(mspec_ipi_mc_drain, NULL, 0, 1);
+ if (status)
+ printk(KERN_WARNING "smp_call_function failed for "
+ "mspec_ipi_mc_drain! (%i)\n", status);
+
+ addr = (unsigned long)tmp - PAGE_OFFSET + __IA64_UNCACHED_OFFSET;
+
+ allocated_granules++;
+ return addr;
+}
+
+
+/*
+ * mspec_alloc_page
+ *
+ * Allocate 1 mspec page. Allocates on the requested node. If no
+ * mspec pages are available on the requested node, roundrobin starting
+ * with higher nodes.
+ */
+static unsigned long
+mspec_alloc_page(int nid, int type)
+{
+ unsigned long maddr;
+
+ maddr = gen_pool_alloc(mspec_pool[nid], PAGE_SIZE);
+#if DEBUG
+ printk(KERN_DEBUG "mspec_alloc_page returns %lx on node %i\n",
+ maddr, nid);
+#endif
+
+ /*
+ * If no memory is availble on our local node, try the
+ * remaining nodes in the system.
+ */
+ if (!maddr) {
+ int i;
+
+ for (i = MAX_NUMNODES - 1; i >= 0; i--) {
+ if (i == nid || !node_online(i))
+ continue;
+ maddr = gen_pool_alloc(mspec_pool[i], PAGE_SIZE);
+#if DEBUG
+ printk(KERN_DEBUG "mspec_alloc_page alternate search "
+ "returns %lx on node %i\n", maddr, i);
+#endif
+ if (maddr) {
+ break;
+ }
+ }
+ }
+
+ if (maddr)
+ atomic_inc(&mspec_stats.pages_in_use);
+
+ return maddr;
+}
+
+
+/*
+ * mspec_free_page
+ *
+ * Free a single mspec page.
+ */
+static void
+mspec_free_page(unsigned long maddr)
+{
+ int node;
+
+ node = nasid_to_cnodeid(NASID_GET(maddr));
+#if DEBUG
+ printk(KERN_DEBUG "mspec_free_page(%lx) on node %i\n", maddr, node);
+#endif
+ if ((maddr & (0XFUL << 60)) != __IA64_UNCACHED_OFFSET)
+ panic("mspec_free_page invalid address %lx\n", maddr);
+
+ atomic_dec(&mspec_stats.pages_in_use);
+ gen_pool_free(mspec_pool[node], maddr, PAGE_SIZE);
+}
+
+
+/*
+ * mspec_mmap
+ *
+ * Called when mmaping the device. Initializes the vma with a fault handler
+ * and private data structure necessary to allocate, track, and free the
+ * underlying pages.
+ */
+static int
+mspec_mmap(struct file *file, struct vm_area_struct *vma, int type)
+{
+ struct vma_data *vdata;
+ int pages;
+
+ if (vma->vm_pgoff != 0)
+ return -EINVAL;
+
+ if ((vma->vm_flags & VM_WRITE) == 0)
+ return -EPERM;
+
+ pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+ if (!(vdata = vmalloc(sizeof(struct vma_data)+(pages-1)*sizeof(long))))
+ return -ENOMEM;
+ memset(vdata, 0, sizeof(struct vma_data)+(pages-1)*sizeof(long));
+
+ vdata->type = type;
+ vdata->lock = SPIN_LOCK_UNLOCKED;
+ vdata->refcnt = ATOMIC_INIT(1);
+ vma->vm_private_data = vdata;
+
+ vma->vm_flags |= (VM_IO | VM_SHM | VM_LOCKED);
+ if (vdata->type == MSPEC_FETCHOP || vdata->type == MSPEC_UNCACHED)
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+ vma->vm_ops = &mspec_vm_ops;
+
+ atomic_inc(&mspec_stats.map_count);
+ return 0;
+}
+
+static int
+fetchop_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ return mspec_mmap(file, vma, MSPEC_FETCHOP);
+}
+
+static int
+cached_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ return mspec_mmap(file, vma, MSPEC_CACHED);
+}
+
+static int
+uncached_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ return mspec_mmap(file, vma, MSPEC_UNCACHED);
+}
+
+/*
+ * mspec_open
+ *
+ * Called when a device mapping is created by a means other than mmap
+ * (via fork, etc.). Increments the reference count on the underlying
+ * mspec data so it is not freed prematurely.
+ */
+static void
+mspec_open(struct vm_area_struct *vma)
+{
+ struct vma_data *vdata;
+
+ vdata = vma->vm_private_data;
+ atomic_inc(&vdata->refcnt);
+}
+
+/*
+ * mspec_close
+ *
+ * Called when unmapping a device mapping. Frees all mspec pages
+ * belonging to the vma.
+ */
+static void
+mspec_close(struct vm_area_struct *vma)
+{
+ struct vma_data *vdata;
+ int i, pages;
+ bte_result_t br;
+
+ vdata = vma->vm_private_data;
+ if (atomic_dec_and_test(&vdata->refcnt) == 0) {
+ pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+ for (i = 0; i < pages; i++) {
+ if (vdata->maddr[i] != 0) {
+ /*
+ * Clear the page before sticking it back
+ * into the pool.
+ */
+ br = BTE_ZERO_BLOCK(vdata->maddr[i], PAGE_SIZE);
+ if (br == BTE_SUCCESS)
+ mspec_free_page(vdata->maddr[i]);
+ else
+ printk(KERN_WARNING "mspec_close(): BTE failed to zero page\n");
+ }
+ }
+ if (vdata->count)
+ atomic_dec(&mspec_stats.map_count);
+ vfree(vdata);
+ }
+}
+
+/*
+ * mspec_get_one_pte
+ *
+ * Return the pte for a given mm and address.
+ */
+static __inline__ int
+mspec_get_one_pte(struct mm_struct *mm, u64 address, pte_t **pte)
+{
+ pgd_t *pgd;
+ pmd_t *pmd;
+ pud_t *pud;
+
+ pgd = pgd_offset(mm, address);
+ if (pgd_present(*pgd)) {
+ pud = pud_offset(pgd, address);
+ if (pud_present(*pud)) {
+ pmd = pmd_offset(pud, address);
+ if (pmd_present(*pmd)) {
+ *pte = pte_offset_map(pmd, address);
+ if (pte_present(**pte)) {
+ return 0;
+ }
+ }
+ }
+ }
+
+ return -1;
+}
+
+/*
+ * mspec_nopage
+ *
+ * Creates a mspec page and maps it to user space.
+ */
+static struct page *
+mspec_nopage(struct vm_area_struct *vma, unsigned long address, int *unused)
+{
+ unsigned long paddr, maddr = 0;
+ unsigned long pfn;
+ int index;
+ pte_t *pte;
+ struct page *page;
+ struct vma_data *vdata = vma->vm_private_data;
+
+ spin_lock(&vdata->lock);
+
+ index = (address - vma->vm_start) >> PAGE_SHIFT;
+ if (vdata->maddr[index] == 0) {
+ vdata->count++;
+ maddr = mspec_alloc_page(numa_node_id(), vdata->type);
+ if (maddr == 0)
+ BUG();
+ vdata->maddr[index] = maddr;
+ } else if (mspec_get_one_pte(vma->vm_mm, address, &pte) == 0) {
+ printk(KERN_ERR "page already mapped\n");
+ /*
+ * The page may have already been faulted by another
+ * pthread. If so, we need to avoid remapping the
+ * page or we will trip a BUG check in the
+ * remap_page_range() path.
+ */
+ goto getpage;
+ }
+
+ if (vdata->type == MSPEC_FETCHOP)
+ paddr = TO_AMO(vdata->maddr[index]);
+ else
+ paddr = __pa(TO_CAC(vdata->maddr[index]));
+
+ /*
+ * XXX - is this correct?
+ */
+ pfn = paddr >> PAGE_SHIFT;
+ if (remap_pfn_range(vma, address, pfn, PAGE_SIZE, vma->vm_page_prot)) {
+ printk(KERN_ERR "remap_pfn_range failed!\n");
+ goto error;
+ }
+
+ /*
+ * The kernel requires a page structure to be returned upon
+ * success, but there are no page structures for low granule pages.
+ * remap_page_range() creates the pte for us and we return a
+ * bogus page back to the kernel fault handler to keep it happy
+ * (the page is freed immediately there).
+ */
+ if (mspec_get_one_pte(vma->vm_mm, address, &pte) == 0) {
+ spin_lock(&vma->vm_mm->page_table_lock);
+ inc_mm_counter(vma->vm_mm, rss);
+ spin_unlock(&vma->vm_mm->page_table_lock);
+
+ set_pte(pte, pte_mkwrite(pte_mkdirty(*pte)));
+ }
+ getpage:
+ /*
+ * Is this really correct?
+ */
+ spin_unlock(&vdata->lock);
+ page = alloc_pages(GFP_USER, 0);
+ return page;
+
+ error:
+ if (maddr) {
+ mspec_free_page(vdata->maddr[index]);
+ vdata->maddr[index] = 0;
+ vdata->count--;
+ }
+ spin_unlock(&vdata->lock);
+ return NOPAGE_SIGBUS;
+}
+
+
+#ifdef CONFIG_PROC_FS
+static void *
+mspec_seq_start(struct seq_file *file, loff_t *offset)
+{
+ if (*offset < MAX_NUMNODES)
+ return offset;
+ return NULL;
+}
+
+static void *
+mspec_seq_next(struct seq_file *file, void *data, loff_t *offset)
+{
+ (*offset)++;
+ if (*offset < MAX_NUMNODES)
+ return offset;
+ return NULL;
+}
+
+static void
+mspec_seq_stop(struct seq_file *file, void *data)
+{
+}
+
+static int
+mspec_seq_show(struct seq_file *file, void *data)
+{
+ struct node_mspecs *mspecs;
+ int i;
+
+ i = *(loff_t *)data;
+
+ if (!i) {
+ seq_printf(file, "mappings : %i\n",
+ atomic_read(&mspec_stats.map_count));
+ seq_printf(file, "current mspec pages : %i\n",
+ atomic_read(&mspec_stats.pages_in_use));
+ seq_printf(file, "%4s %7s %7s\n", "node", "total", "free");
+ }
+
+ if (i < MAX_NUMNODES) {
+ int free, count;
+ mspecs = node_mspecs[i];
+ if (mspecs) {
+ free = atomic_read(&mspecs->free);
+ count = mspecs->count;
+ seq_printf(file, "%4d %7d %7d\n", i, count, free);
+ }
+ }
+
+ return 0;
+}
+
+
+static struct seq_operations mspec_seq_ops = {
+ .start = mspec_seq_start,
+ .next = mspec_seq_next,
+ .stop = mspec_seq_stop,
+ .show = mspec_seq_show
+};
+
+int
+mspec_proc_open(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &mspec_seq_ops);
+}
+
+static struct file_operations proc_mspec_operations = {
+ .open = mspec_proc_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+
+static struct proc_dir_entry *proc_mspec;
+
+#endif /* CONFIG_PROC_FS */
+
+/*
+ * mspec_build_memmap,
+ *
+ * Called at boot time to build a map of pages that can be used for
+ * memory special operations.
+ */
+static int __init
+mspec_build_memmap(unsigned long start, unsigned long end)
+{
+ long length;
+ bte_result_t br;
+ unsigned long vstart, vend;
+ int node;
+
+ length = end - start;
+ vstart = start + __IA64_UNCACHED_OFFSET;
+ vend = end + __IA64_UNCACHED_OFFSET;
+
+#if DEBUG
+ printk(KERN_ERR "mspec_build_memmap(%lx %lx)\n", start, end);
+#endif
+
+ br = BTE_ZERO_BLOCK(vstart, length);
+ if (br != BTE_SUCCESS)
+ panic("BTE Failed while trying to zero mspec page. bte_result_t = %d\n", (int) br);
+
+ node = nasid_to_cnodeid(NASID_GET(start));
+
+ for (; vstart < vend ; vstart += PAGE_SIZE) {
+#if DEBUG
+ printk(KERN_INFO "sticking %lx into the pool!\n", vstart);
+#endif
+ gen_pool_free(mspec_pool[node], vstart, PAGE_SIZE);
+ }
+
+ return 0;
+}
+
+/*
+ * Walk the EFI memory map to pull out leftover pages in the lower
+ * memory regions which do not end up in the regular memory map and
+ * stick them into the uncached allocator
+ */
+static void __init
+mspec_walk_efi_memmap_uc (void)
+{
+ void *efi_map_start, *efi_map_end, *p;
+ efi_memory_desc_t *md;
+ u64 efi_desc_size, start, end;
+
+ efi_map_start = __va(ia64_boot_param->efi_memmap);
+ efi_map_end = efi_map_start + ia64_boot_param->efi_memmap_size;
+ efi_desc_size = ia64_boot_param->efi_memdesc_size;
+
+ for (p = efi_map_start; p < efi_map_end; p += efi_desc_size) {
+ md = p;
+ if (md->attribute == EFI_MEMORY_UC) {
+ start = PAGE_ALIGN(md->phys_addr);
+ end = PAGE_ALIGN((md->phys_addr+(md->num_pages << EFI_PAGE_SHIFT)) & PAGE_MASK);
+ if (mspec_build_memmap(start, end) < 0)
+ return;
+ }
+ }
+}
+
+
+
+/*
+ * mspec_init
+ *
+ * Called at boot time to initialize the mspec facility.
+ */
+static int __init
+mspec_init(void)
+{
+ int i, ret;
+
+ /*
+ * The fetchop device only works on SN2 hardware, uncached and cached
+ * memory drivers should both be valid on all ia64 hardware
+ */
+ if (ia64_platform_is("sn2")) {
+ if ((ret = misc_register(&fetchop_miscdev))) {
+ printk(KERN_ERR "%s: failed to register device %i\n",
+ FETCHOP_DRIVER_ID_STR, ret);
+ return ret;
+ }
+ }
+ if ((ret = misc_register(&cached_miscdev))) {
+ printk(KERN_ERR "%s: failed to register device %i\n",
+ CACHED_DRIVER_ID_STR, ret);
+ misc_deregister(&fetchop_miscdev);
+ return ret;
+ }
+ if ((ret = misc_register(&uncached_miscdev))) {
+ printk(KERN_ERR "%s: failed to register device %i\n",
+ UNCACHED_DRIVER_ID_STR, ret);
+ misc_deregister(&cached_miscdev);
+ misc_deregister(&fetchop_miscdev);
+ return ret;
+ }
+
+ /*
+ * /proc code needs to be updated to work with the new
+ * allocation scheme
+ */
+#ifdef CONFIG_PROC_FS
+ if (!(proc_mspec = create_proc_entry(MSPEC_BASENAME, 0444, NULL))){
+ printk(KERN_ERR "%s: unable to create proc entry",
+ FETCHOP_DRIVER_ID_STR);
+ misc_deregister(&uncached_miscdev);
+ misc_deregister(&cached_miscdev);
+ misc_deregister(&fetchop_miscdev);
+ return -EINVAL;
+ }
+ proc_mspec->proc_fops = &proc_mspec_operations;
+#endif /* CONFIG_PROC_FS */
+
+ for (i = 0; i < MAX_NUMNODES; i++) {
+ if (!node_online(i))
+ continue;
+ mspec_pool[i] = gen_pool_create(0, IA64_GRANULE_SHIFT,
+ &mspec_get_new_chunk, i);
+ }
+
+ mspec_walk_efi_memmap_uc();
+
+ printk(KERN_INFO "%s: v%s\n", FETCHOP_DRIVER_ID_STR, REVISION);
+ printk(KERN_INFO "%s: v%s\n", CACHED_DRIVER_ID_STR, REVISION);
+ printk(KERN_INFO "%s: v%s\n", UNCACHED_DRIVER_ID_STR, REVISION);
+
+ return 0;
+}
+
+
+static void __exit
+mspec_exit(void)
+{
+ BUG_ON(atomic_read(&mspec_stats.pages_in_use) > 0);
+
+#ifdef CONFIG_PROC_FS
+ remove_proc_entry(MSPEC_BASENAME, NULL);
+#endif
+ misc_deregister(&uncached_miscdev);
+ misc_deregister(&cached_miscdev);
+ misc_deregister(&fetchop_miscdev);
+}
+
+
+unsigned long
+mspec_kalloc_page(int nid)
+{
+ return TO_AMO(mspec_alloc_page(nid, MSPEC_FETCHOP));
+}
+
+
+void
+mspec_kfree_page(unsigned long maddr)
+{
+ mspec_free_page(TO_PHYS(maddr) + __IA64_UNCACHED_OFFSET);
+}
+EXPORT_SYMBOL(mspec_kalloc_page);
+EXPORT_SYMBOL(mspec_kfree_page);
+
+
+module_init(mspec_init);
+module_exit(mspec_exit);
+
+
+MODULE_AUTHOR("Silicon Graphics, Inc.");
+MODULE_DESCRIPTION("Driver for SGI SN special memory operations");
+MODULE_LICENSE("GPL");
diff -urN -X /usr/people/jes/exclude-linux linux-2.6.12-rc2-mm3-vanilla/include/asm-ia64/sn/fetchop.h linux-2.6.12-rc2-mm3/include/asm-ia64/sn/fetchop.h
--- linux-2.6.12-rc2-mm3-vanilla/include/asm-ia64/sn/fetchop.h 2005-03-01 23:38:12 -08:00
+++ linux-2.6.12-rc2-mm3/include/asm-ia64/sn/fetchop.h 1969-12-31 16:00:00 -08:00
@@ -1,85 +0,0 @@
-/*
- *
- * This file is subject to the terms and conditions of the GNU General Public
- * License. See the file "COPYING" in the main directory of this archive
- * for more details.
- *
- * Copyright (c) 2001-2004 Silicon Graphics, Inc. All rights reserved.
- */
-
-#ifndef _ASM_IA64_SN_FETCHOP_H
-#define _ASM_IA64_SN_FETCHOP_H
-
-#include <linux/config.h>
-
-#define FETCHOP_BASENAME "sgi_fetchop"
-#define FETCHOP_FULLNAME "/dev/sgi_fetchop"
-
-
-
-#define FETCHOP_VAR_SIZE 64 /* 64 byte per fetchop variable */
-
-#define FETCHOP_LOAD 0
-#define FETCHOP_INCREMENT 8
-#define FETCHOP_DECREMENT 16
-#define FETCHOP_CLEAR 24
-
-#define FETCHOP_STORE 0
-#define FETCHOP_AND 24
-#define FETCHOP_OR 32
-
-#define FETCHOP_CLEAR_CACHE 56
-
-#define FETCHOP_LOAD_OP(addr, op) ( \
- *(volatile long *)((char*) (addr) + (op)))
-
-#define FETCHOP_STORE_OP(addr, op, x) ( \
- *(volatile long *)((char*) (addr) + (op)) = (long) (x))
-
-#ifdef __KERNEL__
-
-/*
- * Convert a region 6 (kaddr) address to the address of the fetchop variable
- */
-#define FETCHOP_KADDR_TO_MSPEC_ADDR(kaddr) TO_MSPEC(kaddr)
-
-
-/*
- * Each Atomic Memory Operation (AMO formerly known as fetchop)
- * variable is 64 bytes long. The first 8 bytes are used. The
- * remaining 56 bytes are unaddressable due to the operation taking
- * that portion of the address.
- *
- * NOTE: The AMO_t _MUST_ be placed in either the first or second half
- * of the cache line. The cache line _MUST NOT_ be used for anything
- * other than additional AMO_t entries. This is because there are two
- * addresses which reference the same physical cache line. One will
- * be a cached entry with the memory type bits all set. This address
- * may be loaded into processor cache. The AMO_t will be referenced
- * uncached via the memory special memory type. If any portion of the
- * cached cache-line is modified, when that line is flushed, it will
- * overwrite the uncached value in physical memory and lead to
- * inconsistency.
- */
-typedef struct {
- u64 variable;
- u64 unused[7];
-} AMO_t;
-
-
-/*
- * The following APIs are externalized to the kernel to allocate/free pages of
- * fetchop variables.
- * fetchop_kalloc_page - Allocate/initialize 1 fetchop page on the
- * specified cnode.
- * fetchop_kfree_page - Free a previously allocated fetchop page
- */
-
-unsigned long fetchop_kalloc_page(int nid);
-void fetchop_kfree_page(unsigned long maddr);
-
-
-#endif /* __KERNEL__ */
-
-#endif /* _ASM_IA64_SN_FETCHOP_H */
-
diff -urN -X /usr/people/jes/exclude-linux linux-2.6.12-rc2-mm3-vanilla/include/asm-ia64/sn/mspec.h linux-2.6.12-rc2-mm3/include/asm-ia64/sn/mspec.h
--- linux-2.6.12-rc2-mm3-vanilla/include/asm-ia64/sn/mspec.h 1969-12-31 16:00:00 -08:00
+++ linux-2.6.12-rc2-mm3/include/asm-ia64/sn/mspec.h 2005-04-12 02:14:06 -07:00
@@ -0,0 +1,72 @@
+/*
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file "COPYING" in the main directory of this archive
+ * for more details.
+ *
+ * Copyright (c) 2001-2004 Silicon Graphics, Inc. All rights reserved.
+ */
+
+#ifndef _ASM_IA64_SN_MSPEC_H
+#define _ASM_IA64_SN_MSPEC_H
+
+#define FETCHOP_VAR_SIZE 64 /* 64 byte per fetchop variable */
+
+#define FETCHOP_LOAD 0
+#define FETCHOP_INCREMENT 8
+#define FETCHOP_DECREMENT 16
+#define FETCHOP_CLEAR 24
+
+#define FETCHOP_STORE 0
+#define FETCHOP_AND 24
+#define FETCHOP_OR 32
+
+#define FETCHOP_CLEAR_CACHE 56
+
+#define FETCHOP_LOAD_OP(addr, op) ( \
+ *(volatile long *)((char*) (addr) + (op)))
+
+#define FETCHOP_STORE_OP(addr, op, x) ( \
+ *(volatile long *)((char*) (addr) + (op)) = (long) (x))
+
+#ifdef __KERNEL__
+
+/*
+ * Each Atomic Memory Operation (AMO formerly known as fetchop)
+ * variable is 64 bytes long. The first 8 bytes are used. The
+ * remaining 56 bytes are unaddressable due to the operation taking
+ * that portion of the address.
+ *
+ * NOTE: The AMO_t _MUST_ be placed in either the first or second half
+ * of the cache line. The cache line _MUST NOT_ be used for anything
+ * other than additional AMO_t entries. This is because there are two
+ * addresses which reference the same physical cache line. One will
+ * be a cached entry with the memory type bits all set. This address
+ * may be loaded into processor cache. The AMO_t will be referenced
+ * uncached via the memory special memory type. If any portion of the
+ * cached cache-line is modified, when that line is flushed, it will
+ * overwrite the uncached value in physical memory and lead to
+ * inconsistency.
+ */
+typedef struct {
+ u64 variable;
+ u64 unused[7];
+} AMO_t;
+
+
+/*
+ * The following APIs are externalized to the kernel to allocate/free pages of
+ * fetchop variables.
+ * mspec_kalloc_page - Allocate/initialize 1 fetchop page on the
+ * specified cnode.
+ * mspec_kfree_page - Free a previously allocated fetchop page
+ */
+
+extern unsigned long mspec_kalloc_page(int);
+extern void mspec_kfree_page(unsigned long);
+
+
+#endif /* __KERNEL__ */
+
+#endif /* _ASM_IA64_SN_MSPEC_H */
+

2005-04-13 20:43:49

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch] mspec driver for 2.6.12-rc2-mm3

On Tue, Apr 12, 2005 at 10:50:08AM -0400, Jes Sorensen wrote:
> +config MSPEC
> + tristate "Special Memory support"
> + select GENERIC_ALLOCATOR

should depend on IA64_GENERIC || IA64_SGI_SN2 because it's using sn2
functions like bte_copy

> +#define BTE_ZERO_BLOCK(_maddr, _len) \
> + bte_copy(0, _maddr - __IA64_UNCACHED_OFFSET, _len, BTE_WACQUIRE | BTE_ZERO_FILL, NULL)

should become an line function.

> +static int fetchop_mmap(struct file *file, struct vm_area_struct *vma);
> +static int cached_mmap(struct file *file, struct vm_area_struct *vma);
> +static int uncached_mmap(struct file *file, struct vm_area_struct *vma);
> +static void mspec_open(struct vm_area_struct *vma);
> +static void mspec_close(struct vm_area_struct *vma);
> +static struct page * mspec_nopage(struct vm_area_struct *vma,
> + unsigned long address, int *unused);

please try to reorder the code to avoid forward-prototypes where easily
possible.

> +/*
> + * There is one of these structs per node. It is used to manage the mspec
> + * space that is available on the node. Current assumption is that there is
> + * only 1 mspec block of memory per node.
> + */
> +struct node_mspecs {
> + long maddr; /* phys addr of start of mspecs. */
> + int count; /* Total number of mspec pages. */
> + atomic_t free; /* Number of pages currently free. */
> + unsigned long bits[1]; /* Bitmap for managing pages. */

Please use the bits[0] gcc extensions so all the calculations for the
variable sized array easily make sense.

> +/*
> + * One of these structures is allocated when an mspec region is mmaped. The
> + * structure is pointed to by the vma->vm_private_data field in the vma struct.
> + * This structure is used to record the addresses of the mspec pages.
> + */
> +struct vma_data {
> + atomic_t refcnt; /* Number of vmas sharing the data. */
> + spinlock_t lock; /* Serialize access to the vma. */
> + int count; /* Number of pages allocated. */
> + int type; /* Type of pages allocated. */
> + unsigned long maddr[1]; /* Array of MSPEC addresses. */

dito

> +};
> +
> +
> +/*
> + * Memory Special statistics.
> + */
> +struct mspec_stats {
> + atomic_t map_count; /* Number of active mmap's */
> + atomic_t pages_in_use; /* Number of mspec pages in use */
> + unsigned long pages_total; /* Total number of mspec pages */
> +};
> +
> +static struct mspec_stats mspec_stats;
> +static struct node_mspecs *node_mspecs[MAX_NUMNODES];
> +
> +#define MAX_UNCACHED_GRANULES 5
> +static int allocated_granules;
> +
> +struct gen_pool *mspec_pool[MAX_NUMNODES];

> +static unsigned long
> +mspec_get_new_chunk(struct gen_pool *poolp)
> +{
> + struct page *page;
> + void *tmp;
> + int status, node, i;
> + unsigned long addr;
> +
> + if (allocated_granules >= MAX_UNCACHED_GRANULES)
> + return 0;
> +
> + node = (int)poolp->private;

maybe the private data in the genpool should be an union of
void * and unsigned long so we can avoid all those casrs?

> + page = alloc_pages_node(node, GFP_KERNEL,
> + IA64_GRANULE_SHIFT-PAGE_SHIFT);

> + tmp = page_address(page);
> + memset(tmp, 0, IA64_GRANULE_SIZE);

shouldn't you just use __GFP_ZERO?

> +#if DEBUG
> + printk(KERN_INFO "pal_prefetch_visibility() returns %i on cpu %i\n",
> + status, get_cpu());
> +#endif

same dprintk comment as for genalloc.

> + vdata->lock = SPIN_LOCK_UNLOCKED;

you're supposed to use spin_lock_init() these days.

> +/*
> + * mspec_get_one_pte
> + *
> + * Return the pte for a given mm and address.
> + */
> +static __inline__ int
> +mspec_get_one_pte(struct mm_struct *mm, u64 address, pte_t **pte)
> +{
> + pgd_t *pgd;
> + pmd_t *pmd;
> + pud_t *pud;
> +
> + pgd = pgd_offset(mm, address);
> + if (pgd_present(*pgd)) {
> + pud = pud_offset(pgd, address);
> + if (pud_present(*pud)) {
> + pmd = pmd_offset(pud, address);
> + if (pmd_present(*pmd)) {
> + *pte = pte_offset_map(pmd, address);
> + if (pte_present(**pte)) {
> + return 0;
> + }
> + }
> + }
> + }
> +
> + return -1;
> +}

> + spin_lock(&vdata->lock);
> +
> + index = (address - vma->vm_start) >> PAGE_SHIFT;
> + if (vdata->maddr[index] == 0) {
> + vdata->count++;
> + maddr = mspec_alloc_page(numa_node_id(), vdata->type);

this looks like a page allocation under spinlock.

> + /*
> + * The kernel requires a page structure to be returned upon
> + * success, but there are no page structures for low granule pages.
> + * remap_page_range() creates the pte for us and we return a
> + * bogus page back to the kernel fault handler to keep it happy
> + * (the page is freed immediately there).
> + */

Please don't use the ->nopage approach thenm but do remap_pfn_range
beforehand. ->nopage is very nice if the region is actually backed by
normal RAM, but what you're doing doesn't make much sense.

> +/*
> + * Walk the EFI memory map to pull out leftover pages in the lower
> + * memory regions which do not end up in the regular memory map and
> + * stick them into the uncached allocator
> + */
> +static void __init
> +mspec_walk_efi_memmap_uc (void)

I'm pretty sure this was NACKed on the ia64 list, and SGI was told to do
a more generic efi memmap walk.

> + /*
> + * The fetchop device only works on SN2 hardware, uncached and cached
> + * memory drivers should both be valid on all ia64 hardware
> + */

In which case my above comment about the depency doesn't make sense, but
you'll have to split the driver into separate files or add ifdefs. Please
test it on some non-sgi hardware with a non-generic kernel build.

> + printk(KERN_INFO "%s: v%s\n", FETCHOP_DRIVER_ID_STR, REVISION);
> + printk(KERN_INFO "%s: v%s\n", CACHED_DRIVER_ID_STR, REVISION);
> + printk(KERN_INFO "%s: v%s\n", UNCACHED_DRIVER_ID_STR, REVISION);

Please keep the noise level down and remove these.

> +unsigned long
> +mspec_kalloc_page(int nid)
> +{
> + return TO_AMO(mspec_alloc_page(nid, MSPEC_FETCHOP));
> +}
> +
> +
> +void
> +mspec_kfree_page(unsigned long maddr)
> +{
> + mspec_free_page(TO_PHYS(maddr) + __IA64_UNCACHED_OFFSET);
> +}
> +EXPORT_SYMBOL(mspec_kalloc_page);
> +EXPORT_SYMBOL(mspec_kfree_page);

What is this for? these look like really odd APIs.

2005-04-22 11:24:58

by Jes Sorensen

[permalink] [raw]
Subject: Re: [patch] mspec driver for 2.6.12-rc2-mm3

Hi,

Here's an updated version of the mspec driver. I have tried to address
pretty much all the issues reported by Christoph. I left out most of
the comments, but specific details on a few items below.

Patch at the end - I hope it's ready for the next -mm release now.

Cheers,
Jes


>>>>> "Christoph" == Christoph Hellwig <[email protected]> writes:

Christoph> On Tue, Apr 12, 2005 at 10:50:08AM -0400, Jes Sorensen
Christoph> wrote:
>> data. */ + spinlock_t lock; /* Serialize access to the vma. */ +
>> int count; /* Number of pages allocated. */ + int type; /* Type of
>> pages allocated. */ + unsigned long maddr[1]; /* Array of MSPEC
>> addresses. */

Christoph> dito

The code use the size to calculate, it could be changed either way,
don't think it's worth making the change.

>> + /* + * The kernel requires a page structure to be returned upon +
>> * success, but there are no page structures for low granule pages.
>> + * remap_page_range() creates the pte for us and we return a + *
>> bogus page back to the kernel fault handler to keep it happy + *
>> (the page is freed immediately there). + */

Christoph> Please don't use the ->nopage approach thenm but do
Christoph> remap_pfn_range beforehand. ->nopage is very nice if the
Christoph> region is actually backed by normal RAM, but what you're
Christoph> doing doesn't make much sense.

Thats what I used to think, however you want the node-local setup for
performance reasons. Otherwise I would have switched to remap_pfn_range.

>> +/* + * Walk the EFI memory map to pull out leftover pages in the
>> lower + * memory regions which do not end up in the regular memory
>> map and + * stick them into the uncached allocator + */ +static
>> void __init +mspec_walk_efi_memmap_uc (void)

Christoph> I'm pretty sure this was NACKed on the ia64 list, and SGI
Christoph> was told to do a more generic efi memmap walk.

No the issue back then was that the driver just took the memory and
kept it to itself. The new approach exports it for other users.

...............................


Memory special driver for cached, uncached and 'fetchop' (SGI SN2 specific)
memory mappings, formerly known as fetchop. Mostly Used by parallel
appplictions.

This patch relies on the PG_uncached support patch and the generic
allocator patch (genalloc).

Signed-off-by: Jes Sorensen <[email protected]>


diff -urN -X /usr/people/jes/exclude-linux linux-2.6.12-rc2-mm3-vanilla/arch/ia64/Kconfig linux-2.6.12-rc2-mm3/arch/ia64/Kconfig
--- linux-2.6.12-rc2-mm3-vanilla/arch/ia64/Kconfig 2005-04-12 02:09:02 -07:00
+++ linux-2.6.12-rc2-mm3/arch/ia64/Kconfig 2005-04-12 02:14:06 -07:00
@@ -217,6 +217,16 @@
If you are compiling a kernel that will run under SGI's IA-64
simulator (Medusa) then say Y, otherwise say N.

+config MSPEC
+ tristate "Special Memory support"
+ select GENERIC_ALLOCATOR
+ help
+ This driver allows for cached and uncached mappings of memory
+ to user processes. On SGI SN hardware it will also export the
+ special fetchop memory facility.
+ Fetchops are atomic memory operations that are implemented in the
+ memory controller on SGI SN hardware.
+
config FORCE_MAX_ZONEORDER
int
default "18"
diff -urN -X /usr/people/jes/exclude-linux linux-2.6.12-rc2-mm3-vanilla/arch/ia64/configs/sn2_defconfig linux-2.6.12-rc2-mm3/arch/ia64/configs/sn2_defconfig
--- linux-2.6.12-rc2-mm3-vanilla/arch/ia64/configs/sn2_defconfig 2005-04-12 02:09:02 -07:00
+++ linux-2.6.12-rc2-mm3/arch/ia64/configs/sn2_defconfig 2005-04-12 02:14:06 -07:00
@@ -82,6 +82,7 @@
# CONFIG_IA64_CYCLONE is not set
CONFIG_IOSAPIC=y
CONFIG_IA64_SGI_SN_SIM=y
+CONFIG_MSPEC=m
CONFIG_FORCE_MAX_ZONEORDER=18
CONFIG_SMP=y
CONFIG_NR_CPUS=512
diff -urN -X /usr/people/jes/exclude-linux linux-2.6.12-rc2-mm3-vanilla/arch/ia64/defconfig linux-2.6.12-rc2-mm3/arch/ia64/defconfig
--- linux-2.6.12-rc2-mm3-vanilla/arch/ia64/defconfig 2005-04-12 02:09:02 -07:00
+++ linux-2.6.12-rc2-mm3/arch/ia64/defconfig 2005-04-12 02:14:06 -07:00
@@ -80,6 +80,7 @@
CONFIG_ARCH_DISCONTIGMEM_ENABLE=y
CONFIG_IA64_CYCLONE=y
CONFIG_IOSAPIC=y
+CONFIG_MSPEC=m
CONFIG_FORCE_MAX_ZONEORDER=18
CONFIG_SMP=y
CONFIG_NR_CPUS=512
diff -urN -X /usr/people/jes/exclude-linux linux-2.6.12-rc2-mm3-vanilla/arch/ia64/kernel/Makefile linux-2.6.12-rc2-mm3/arch/ia64/kernel/Makefile
--- linux-2.6.12-rc2-mm3-vanilla/arch/ia64/kernel/Makefile 2005-03-01 23:38:33 -08:00
+++ linux-2.6.12-rc2-mm3/arch/ia64/kernel/Makefile 2005-04-12 02:14:06 -07:00
@@ -20,6 +20,7 @@
obj-$(CONFIG_PERFMON) += perfmon_default_smpl.o
obj-$(CONFIG_IA64_CYCLONE) += cyclone.o
obj-$(CONFIG_IA64_MCA_RECOVERY) += mca_recovery.o
+obj-$(CONFIG_MSPEC) += mspec.o
mca_recovery-y += mca_drv.o mca_drv_asm.o

# The gate DSO image is built using a special linker script.
diff -urN -X /usr/people/jes/exclude-linux linux-2.6.12-rc2-mm3-vanilla/arch/ia64/kernel/mspec.c linux-2.6.12-rc2-mm3/arch/ia64/kernel/mspec.c
--- linux-2.6.12-rc2-mm3-vanilla/arch/ia64/kernel/mspec.c 1969-12-31 16:00:00 -08:00
+++ linux-2.6.12-rc2-mm3/arch/ia64/kernel/mspec.c 2005-04-22 04:10:18 -07:00
@@ -0,0 +1,771 @@
+/*
+ * Copyright (C) 2001-2005 Silicon Graphics, Inc. All rights
+ * reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of version 2 of the GNU General Public License
+ * as published by the Free Software Foundation.
+ */
+
+/*
+ * SN Platform Special Memory (mspec) Support
+ *
+ * This driver exports the SN special memory (mspec) facility to user processes.
+ * There are three types of memory made available thru this driver:
+ * fetchops, uncached and cached.
+ *
+ * Fetchops are atomic memory operations that are implemented in the
+ * memory controller on SGI SN hardware.
+ *
+ * Uncached are used for memory write combining feature of the ia64
+ * cpu.
+ *
+ * Cached are used for areas of memory that are used as cached addresses
+ * on our partition and used as uncached addresses from other partitions.
+ * Due to a design constraint of the SN2 Shub, you can not have processors
+ * on the same FSB perform both a cached and uncached reference to the
+ * same cache line. These special memory cached regions prevent the
+ * kernel from ever dropping in a TLB entry and therefore prevent the
+ * processor from ever speculating a cache line from this page.
+ */
+
+
+#include <linux/config.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/errno.h>
+#include <linux/miscdevice.h>
+#include <linux/spinlock.h>
+#include <linux/mm.h>
+#include <linux/proc_fs.h>
+#include <linux/vmalloc.h>
+#include <linux/bitops.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <linux/seq_file.h>
+#include <linux/efi.h>
+#include <linux/genalloc.h>
+#include <asm/page.h>
+#include <asm/pal.h>
+#include <asm/system.h>
+#include <asm/pgtable.h>
+#include <asm/atomic.h>
+#include <asm/tlbflush.h>
+#include <asm/sn/addrs.h>
+#include <asm/sn/arch.h>
+#include <asm/sn/mspec.h>
+#include <asm/sn/sn_cpuid.h>
+#include <asm/sn/io.h>
+#include <asm/sn/bte.h>
+#include <asm/sn/shubio.h>
+
+
+#define DEBUG 0
+
+#if DEBUG
+#define dprintk printk
+#else
+#define dprintk(x...) do { } while (0)
+#endif
+
+
+#define FETCHOP_ID "Fetchop,"
+#define CACHED_ID "Cached,"
+#define UNCACHED_ID "Uncached"
+#define REVISION "3.0"
+#define MSPEC_BASENAME "mspec"
+
+
+/*
+ * Page types allocated by the device.
+ */
+enum {
+ MSPEC_FETCHOP = 1,
+ MSPEC_CACHED,
+ MSPEC_UNCACHED
+};
+
+
+/*
+ * One of these structures is allocated when an mspec region is mmaped. The
+ * structure is pointed to by the vma->vm_private_data field in the vma struct.
+ * This structure is used to record the addresses of the mspec pages.
+ */
+struct vma_data {
+ atomic_t refcnt; /* Number of vmas sharing the data. */
+ spinlock_t lock; /* Serialize access to the vma. */
+ int count; /* Number of pages allocated. */
+ int type; /* Type of pages allocated. */
+ unsigned long maddr[1]; /* Array of MSPEC addresses. */
+};
+
+
+/*
+ * Memory Special statistics.
+ */
+struct mspec_stats {
+ atomic_t map_count; /* Number of active mmap's */
+ atomic_t pages_in_use; /* Number of mspec pages in use */
+ unsigned long pages_total; /* Total number of mspec pages */
+};
+
+static struct mspec_stats mspec_stats;
+
+#define MAX_UNCACHED_GRANULES 5
+static int allocated_granules;
+
+struct gen_pool *mspec_pool[MAX_NUMNODES];
+
+
+static inline int zero_block(unsigned long addr, int len)
+{
+ int status;
+
+ if (ia64_platform_is("sn2"))
+ status = bte_copy(0, addr - __IA64_UNCACHED_OFFSET, len,
+ BTE_WACQUIRE | BTE_ZERO_FILL, NULL);
+ else {
+ memset((char *)addr, 0, len);
+ status = 0;
+ }
+ return status;
+}
+
+
+static void mspec_ipi_visibility(void *data)
+{
+ int status;
+
+ status = ia64_pal_prefetch_visibility(PAL_VISIBILITY_PHYSICAL);
+ if ((status != PAL_VISIBILITY_OK) &&
+ (status != PAL_VISIBILITY_OK_REMOTE_NEEDED))
+ printk(KERN_DEBUG "pal_prefetch_visibility() returns %i on "
+ "CPU %i\n", status, get_cpu());
+}
+
+
+static void mspec_ipi_mc_drain(void *data)
+{
+ int status;
+ status = ia64_pal_mc_drain();
+ if (status)
+ printk(KERN_WARNING "ia64_pal_mc_drain() failed with %i on "
+ "CPU %i\n", status, get_cpu());
+}
+
+
+static unsigned long
+mspec_get_new_chunk(struct gen_pool *poolp)
+{
+ struct page *page;
+ void *tmp;
+ int status, i;
+ unsigned long addr, node;
+
+ if (allocated_granules >= MAX_UNCACHED_GRANULES)
+ return 0;
+
+ node = poolp->private;
+ page = alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO,
+ IA64_GRANULE_SHIFT-PAGE_SHIFT);
+
+ dprintk(KERN_INFO "get_new_chunk page %p, addr %lx\n",
+ page, (unsigned long)(page-vmem_map) << PAGE_SHIFT);
+
+ /*
+ * Do magic if no mem on local node! XXX
+ */
+ if (!page)
+ return 0;
+ tmp = page_address(page);
+
+ /*
+ * There's a small race here where it's possible for someone to
+ * access the page through /dev/mem halfway through the conversion
+ * to uncached - not sure it's really worth bothering about
+ */
+ for (i = 0; i < (IA64_GRANULE_SIZE / PAGE_SIZE); i++)
+ SetPageUncached(&page[i]);
+
+ flush_tlb_kernel_range(tmp, tmp + IA64_GRANULE_SIZE);
+
+ status = ia64_pal_prefetch_visibility(PAL_VISIBILITY_PHYSICAL);
+
+ dprintk(KERN_INFO "pal_prefetch_visibility() returns %i on cpu %i\n",
+ status, get_cpu());
+
+ if (!status) {
+ status = smp_call_function(mspec_ipi_visibility, NULL, 0, 1);
+ if (status)
+ printk(KERN_WARNING "smp_call_function failed for "
+ "mspec_ipi_visibility! (%i)\n", status);
+ }
+
+ sn_flush_all_caches((unsigned long)tmp, IA64_GRANULE_SIZE);
+ ia64_pal_mc_drain();
+ status = smp_call_function(mspec_ipi_mc_drain, NULL, 0, 1);
+ if (status)
+ printk(KERN_WARNING "smp_call_function failed for "
+ "mspec_ipi_mc_drain! (%i)\n", status);
+
+ addr = (unsigned long)tmp - PAGE_OFFSET + __IA64_UNCACHED_OFFSET;
+
+ allocated_granules++;
+ return addr;
+}
+
+
+/*
+ * mspec_alloc_page
+ *
+ * Allocate 1 mspec page. Allocates on the requested node. If no
+ * mspec pages are available on the requested node, roundrobin starting
+ * with higher nodes.
+ */
+static unsigned long
+mspec_alloc_page(int nid)
+{
+ unsigned long maddr;
+
+ maddr = gen_pool_alloc(mspec_pool[nid], PAGE_SIZE);
+
+ dprintk(KERN_DEBUG "mspec_alloc_page returns %lx on node %i\n",
+ maddr, nid);
+
+ /*
+ * If no memory is availble on our local node, try the
+ * remaining nodes in the system.
+ */
+ if (!maddr) {
+ int i;
+
+ for (i = MAX_NUMNODES - 1; i >= 0; i--) {
+ if (i == nid || !node_online(i))
+ continue;
+ maddr = gen_pool_alloc(mspec_pool[i], PAGE_SIZE);
+ dprintk(KERN_DEBUG "mspec_alloc_page alternate search "
+ "returns %lx on node %i\n", maddr, i);
+ if (maddr) {
+ break;
+ }
+ }
+ }
+
+ if (maddr)
+ atomic_inc(&mspec_stats.pages_in_use);
+
+ return maddr;
+}
+EXPORT_SYMBOL(mspec_alloc_page);
+
+
+/*
+ * mspec_free_page
+ *
+ * Free a single mspec page.
+ */
+static void
+mspec_free_page(unsigned long maddr)
+{
+ int node;
+
+ node = nasid_to_cnodeid(NASID_GET(maddr));
+
+ dprintk(KERN_DEBUG "mspec_free_page(%lx) on node %i\n", maddr, node);
+
+ if ((maddr & (0XFUL << 60)) != __IA64_UNCACHED_OFFSET)
+ panic("mspec_free_page invalid address %lx\n", maddr);
+
+ atomic_dec(&mspec_stats.pages_in_use);
+ gen_pool_free(mspec_pool[node], maddr, PAGE_SIZE);
+}
+EXPORT_SYMBOL(mspec_free_page);
+
+
+/*
+ * mspec_open
+ *
+ * Called when a device mapping is created by a means other than mmap
+ * (via fork, etc.). Increments the reference count on the underlying
+ * mspec data so it is not freed prematurely.
+ */
+static void
+mspec_open(struct vm_area_struct *vma)
+{
+ struct vma_data *vdata;
+
+ vdata = vma->vm_private_data;
+ atomic_inc(&vdata->refcnt);
+}
+
+
+/*
+ * mspec_close
+ *
+ * Called when unmapping a device mapping. Frees all mspec pages
+ * belonging to the vma.
+ */
+static void
+mspec_close(struct vm_area_struct *vma)
+{
+ struct vma_data *vdata;
+ int i, pages, result;
+
+ vdata = vma->vm_private_data;
+ if (atomic_dec_and_test(&vdata->refcnt)) {
+ pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+ for (i = 0; i < pages; i++) {
+ if (vdata->maddr[i] != 0) {
+ /*
+ * Clear the page before sticking it back
+ * into the pool.
+ */
+ result = zero_block(vdata->maddr[i], PAGE_SIZE);
+ if (!result)
+ mspec_free_page(vdata->maddr[i]);
+ else
+ printk(KERN_WARNING "mspec_close(): "
+ "failed to zero page %i\n",
+ result);
+ }
+ }
+ if (vdata->count)
+ atomic_dec(&mspec_stats.map_count);
+ vfree(vdata);
+ }
+}
+
+
+/*
+ * mspec_get_one_pte
+ *
+ * Return the pte for a given mm and address.
+ */
+static __inline__ int
+mspec_get_one_pte(struct mm_struct *mm, u64 address, pte_t **pte)
+{
+ pgd_t *pgd;
+ pmd_t *pmd;
+ pud_t *pud;
+
+ pgd = pgd_offset(mm, address);
+ if (pgd_present(*pgd)) {
+ pud = pud_offset(pgd, address);
+ if (pud_present(*pud)) {
+ pmd = pmd_offset(pud, address);
+ if (pmd_present(*pmd)) {
+ *pte = pte_offset_map(pmd, address);
+ if (pte_present(**pte)) {
+ return 0;
+ }
+ }
+ }
+ }
+
+ return -1;
+}
+
+
+/*
+ * mspec_nopage
+ *
+ * Creates a mspec page and maps it to user space.
+ */
+static struct page *
+mspec_nopage(struct vm_area_struct *vma, unsigned long address, int *unused)
+{
+ unsigned long paddr, maddr = 0;
+ unsigned long pfn;
+ int index;
+ pte_t *pte;
+ struct page *page;
+ struct vma_data *vdata = vma->vm_private_data;
+
+ spin_lock(&vdata->lock);
+
+ index = (address - vma->vm_start) >> PAGE_SHIFT;
+ if (vdata->maddr[index] == 0) {
+ vdata->count++;
+ spin_unlock(&vdata->lock);
+ maddr = mspec_alloc_page(numa_node_id());
+ if (maddr == 0)
+ BUG();
+ spin_lock(&vdata->lock);
+ vdata->maddr[index] = maddr;
+ } else if (mspec_get_one_pte(vma->vm_mm, address, &pte) == 0) {
+ printk(KERN_ERR "page already mapped\n");
+ /*
+ * The page may have already been faulted by another
+ * pthread. If so, we need to avoid remapping the
+ * page or we will trip a BUG check in the
+ * remap_page_range() path.
+ */
+ goto getpage;
+ }
+
+ if (vdata->type == MSPEC_FETCHOP)
+ paddr = TO_AMO(vdata->maddr[index]);
+ else
+ paddr = __pa(TO_CAC(vdata->maddr[index]));
+
+ /*
+ * XXX - is this correct?
+ */
+ pfn = paddr >> PAGE_SHIFT;
+ if (remap_pfn_range(vma, address, pfn, PAGE_SIZE, vma->vm_page_prot)) {
+ printk(KERN_ERR "remap_pfn_range failed!\n");
+ goto error;
+ }
+
+ /*
+ * The kernel requires a page structure to be returned upon
+ * success, but there are no page structures for low granule pages.
+ * remap_page_range() creates the pte for us and we return a
+ * bogus page back to the kernel fault handler to keep it happy
+ * (the page is freed immediately there).
+ */
+ if (mspec_get_one_pte(vma->vm_mm, address, &pte) == 0) {
+ spin_lock(&vma->vm_mm->page_table_lock);
+ inc_mm_counter(vma->vm_mm, rss);
+ spin_unlock(&vma->vm_mm->page_table_lock);
+
+ set_pte(pte, pte_mkwrite(pte_mkdirty(*pte)));
+ }
+ getpage:
+ /*
+ * Is this really correct?
+ */
+ spin_unlock(&vdata->lock);
+ page = alloc_pages(GFP_USER, 0);
+ return page;
+
+ error:
+ if (maddr) {
+ mspec_free_page(vdata->maddr[index]);
+ vdata->maddr[index] = 0;
+ vdata->count--;
+ }
+ spin_unlock(&vdata->lock);
+ return NOPAGE_SIGBUS;
+}
+
+
+static struct vm_operations_struct mspec_vm_ops = {
+ .open mspec_open,
+ .close mspec_close,
+ .nopage mspec_nopage
+};
+
+
+/*
+ * mspec_mmap
+ *
+ * Called when mmaping the device. Initializes the vma with a fault handler
+ * and private data structure necessary to allocate, track, and free the
+ * underlying pages.
+ */
+static int
+mspec_mmap(struct file *file, struct vm_area_struct *vma, int type)
+{
+ struct vma_data *vdata;
+ int pages;
+
+ if (vma->vm_pgoff != 0)
+ return -EINVAL;
+
+ if ((vma->vm_flags & VM_WRITE) == 0)
+ return -EPERM;
+
+ pages = (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+ if (!(vdata = vmalloc(sizeof(struct vma_data)+(pages-1)*sizeof(long))))
+ return -ENOMEM;
+ memset(vdata, 0, sizeof(struct vma_data)+(pages-1)*sizeof(long));
+
+ vdata->type = type;
+ spin_lock_init(&vdata->lock);
+ vdata->refcnt = ATOMIC_INIT(1);
+ vma->vm_private_data = vdata;
+
+ vma->vm_flags |= (VM_IO | VM_SHM | VM_LOCKED);
+ if (vdata->type == MSPEC_FETCHOP || vdata->type == MSPEC_UNCACHED)
+ vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+ vma->vm_ops = &mspec_vm_ops;
+
+ atomic_inc(&mspec_stats.map_count);
+ return 0;
+}
+
+
+static int
+fetchop_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ return mspec_mmap(file, vma, MSPEC_FETCHOP);
+}
+
+
+static int
+cached_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ return mspec_mmap(file, vma, MSPEC_CACHED);
+}
+
+
+static int
+uncached_mmap(struct file *file, struct vm_area_struct *vma)
+{
+ return mspec_mmap(file, vma, MSPEC_UNCACHED);
+}
+
+
+#ifdef CONFIG_PROC_FS
+static void *
+mspec_seq_start(struct seq_file *file, loff_t *offset)
+{
+ if (*offset < MAX_NUMNODES)
+ return offset;
+ return NULL;
+}
+
+static void *
+mspec_seq_next(struct seq_file *file, void *data, loff_t *offset)
+{
+ (*offset)++;
+ if (*offset < MAX_NUMNODES)
+ return offset;
+ return NULL;
+}
+
+static void
+mspec_seq_stop(struct seq_file *file, void *data)
+{
+}
+
+static int
+mspec_seq_show(struct seq_file *file, void *data)
+{
+ int i;
+
+ i = *(loff_t *)data;
+
+ if (!i) {
+ seq_printf(file, "mappings : %i\n",
+ atomic_read(&mspec_stats.map_count));
+ seq_printf(file, "current mspec pages : %i\n",
+ atomic_read(&mspec_stats.pages_in_use));
+ seq_printf(file, "%4s %7s %7s\n", "node", "total", "free");
+ }
+ return 0;
+}
+
+
+static struct seq_operations mspec_seq_ops = {
+ .start = mspec_seq_start,
+ .next = mspec_seq_next,
+ .stop = mspec_seq_stop,
+ .show = mspec_seq_show
+};
+
+int
+mspec_proc_open(struct inode *inode, struct file *file)
+{
+ return seq_open(file, &mspec_seq_ops);
+}
+
+static struct file_operations proc_mspec_operations = {
+ .open = mspec_proc_open,
+ .read = seq_read,
+ .llseek = seq_lseek,
+ .release = seq_release,
+};
+
+
+static struct proc_dir_entry *proc_mspec;
+
+#endif /* CONFIG_PROC_FS */
+
+/*
+ * mspec_build_memmap,
+ *
+ * Called at boot time to build a map of pages that can be used for
+ * memory special operations.
+ */
+static int __init
+mspec_build_memmap(unsigned long start, unsigned long end)
+{
+ long length;
+ unsigned long vstart, vend;
+ int node, result;
+
+ length = end - start;
+ vstart = start + __IA64_UNCACHED_OFFSET;
+ vend = end + __IA64_UNCACHED_OFFSET;
+
+ dprintk(KERN_ERR "mspec_build_memmap(%lx %lx)\n", start, end);
+
+ result = zero_block(vstart, length);
+ if (result)
+ panic("Failed while trying to zero mspec page: %i\n", result);
+
+ node = nasid_to_cnodeid(NASID_GET(start));
+
+ for (; vstart < vend ; vstart += PAGE_SIZE) {
+ dprintk(KERN_INFO "sticking %lx into the pool!\n", vstart);
+ gen_pool_free(mspec_pool[node], vstart, PAGE_SIZE);
+ }
+
+ return 0;
+}
+
+/*
+ * Walk the EFI memory map to pull out leftover pages in the lower
+ * memory regions which do not end up in the regular memory map and
+ * stick them into the uncached allocator
+ */
+static void __init
+mspec_walk_efi_memmap_uc (void)
+{
+ void *efi_map_start, *efi_map_end, *p;
+ efi_memory_desc_t *md;
+ u64 efi_desc_size, start, end;
+
+ efi_map_start = __va(ia64_boot_param->efi_memmap);
+ efi_map_end = efi_map_start + ia64_boot_param->efi_memmap_size;
+ efi_desc_size = ia64_boot_param->efi_memdesc_size;
+
+ for (p = efi_map_start; p < efi_map_end; p += efi_desc_size) {
+ md = p;
+ if (md->attribute == EFI_MEMORY_UC) {
+ start = PAGE_ALIGN(md->phys_addr);
+ end = PAGE_ALIGN((md->phys_addr+(md->num_pages << EFI_PAGE_SHIFT)) & PAGE_MASK);
+ if (mspec_build_memmap(start, end) < 0)
+ return;
+ }
+ }
+}
+
+
+static struct file_operations fetchop_fops = {
+ .owner THIS_MODULE,
+ .mmap fetchop_mmap
+};
+static struct miscdevice fetchop_miscdev = {
+ .minor MISC_DYNAMIC_MINOR,
+ .name "sgi_fetchop",
+ .fops &fetchop_fops
+};
+
+
+static struct file_operations cached_fops = {
+ .owner THIS_MODULE,
+ .mmap cached_mmap
+};
+static struct miscdevice cached_miscdev = {
+ .minor MISC_DYNAMIC_MINOR,
+ .name "sgi_cached",
+ .fops &cached_fops
+};
+
+
+static struct file_operations uncached_fops = {
+ .owner THIS_MODULE,
+ .mmap uncached_mmap
+};
+static struct miscdevice uncached_miscdev = {
+ .minor MISC_DYNAMIC_MINOR,
+ .name "sgi_uncached",
+ .fops &uncached_fops
+};
+
+
+/*
+ * mspec_init
+ *
+ * Called at boot time to initialize the mspec facility.
+ */
+static int __init
+mspec_init(void)
+{
+ int i, ret;
+
+ /*
+ * The fetchop device only works on SN2 hardware, uncached and cached
+ * memory drivers should both be valid on all ia64 hardware
+ */
+ if (ia64_platform_is("sn2")) {
+ if ((ret = misc_register(&fetchop_miscdev))) {
+ printk(KERN_ERR "%s: failed to register device %i\n",
+ FETCHOP_ID, ret);
+ return ret;
+ }
+ }
+ if ((ret = misc_register(&cached_miscdev))) {
+ printk(KERN_ERR "%s: failed to register device %i\n",
+ CACHED_ID, ret);
+ misc_deregister(&fetchop_miscdev);
+ return ret;
+ }
+ if ((ret = misc_register(&uncached_miscdev))) {
+ printk(KERN_ERR "%s: failed to register device %i\n",
+ UNCACHED_ID, ret);
+ misc_deregister(&cached_miscdev);
+ misc_deregister(&fetchop_miscdev);
+ return ret;
+ }
+
+ /*
+ * /proc code needs to be updated to work with the new
+ * allocation scheme
+ */
+#ifdef CONFIG_PROC_FS
+ if (!(proc_mspec = create_proc_entry(MSPEC_BASENAME, 0444, NULL))){
+ printk(KERN_ERR "%s: unable to create proc entry",
+ MSPEC_BASENAME);
+ misc_deregister(&uncached_miscdev);
+ misc_deregister(&cached_miscdev);
+ misc_deregister(&fetchop_miscdev);
+ return -EINVAL;
+ }
+ proc_mspec->proc_fops = &proc_mspec_operations;
+#endif
+
+ for (i = 0; i < MAX_NUMNODES; i++) {
+ if (!node_online(i))
+ continue;
+ mspec_pool[i] = gen_pool_create(0, IA64_GRANULE_SHIFT,
+ &mspec_get_new_chunk, i);
+ }
+
+ mspec_walk_efi_memmap_uc();
+
+ printk(KERN_INFO "%s %s initialized devices: %s %s %s\n",
+ MSPEC_BASENAME, REVISION,
+ ia64_platform_is("sn2") ? "" : FETCHOP_ID,
+ CACHED_ID, UNCACHED_ID);
+
+ return 0;
+}
+
+
+static void __exit
+mspec_exit(void)
+{
+ BUG_ON(atomic_read(&mspec_stats.pages_in_use) > 0);
+
+#ifdef CONFIG_PROC_FS
+ remove_proc_entry(MSPEC_BASENAME, NULL);
+#endif
+ misc_deregister(&uncached_miscdev);
+ misc_deregister(&cached_miscdev);
+ misc_deregister(&fetchop_miscdev);
+}
+
+
+module_init(mspec_init);
+module_exit(mspec_exit);
+
+
+MODULE_AUTHOR("Silicon Graphics, Inc.");
+MODULE_DESCRIPTION("Driver for SGI SN special memory operations");
+MODULE_LICENSE("GPL");
diff -urN -X /usr/people/jes/exclude-linux linux-2.6.12-rc2-mm3-vanilla/include/asm-ia64/sn/fetchop.h linux-2.6.12-rc2-mm3/include/asm-ia64/sn/fetchop.h
--- linux-2.6.12-rc2-mm3-vanilla/include/asm-ia64/sn/fetchop.h 2005-03-01 23:38:12 -08:00
+++ linux-2.6.12-rc2-mm3/include/asm-ia64/sn/fetchop.h 1969-12-31 16:00:00 -08:00
@@ -1,85 +0,0 @@
-/*
- *
- * This file is subject to the terms and conditions of the GNU General Public
- * License. See the file "COPYING" in the main directory of this archive
- * for more details.
- *
- * Copyright (c) 2001-2004 Silicon Graphics, Inc. All rights reserved.
- */
-
-#ifndef _ASM_IA64_SN_FETCHOP_H
-#define _ASM_IA64_SN_FETCHOP_H
-
-#include <linux/config.h>
-
-#define FETCHOP_BASENAME "sgi_fetchop"
-#define FETCHOP_FULLNAME "/dev/sgi_fetchop"
-
-
-
-#define FETCHOP_VAR_SIZE 64 /* 64 byte per fetchop variable */
-
-#define FETCHOP_LOAD 0
-#define FETCHOP_INCREMENT 8
-#define FETCHOP_DECREMENT 16
-#define FETCHOP_CLEAR 24
-
-#define FETCHOP_STORE 0
-#define FETCHOP_AND 24
-#define FETCHOP_OR 32
-
-#define FETCHOP_CLEAR_CACHE 56
-
-#define FETCHOP_LOAD_OP(addr, op) ( \
- *(volatile long *)((char*) (addr) + (op)))
-
-#define FETCHOP_STORE_OP(addr, op, x) ( \
- *(volatile long *)((char*) (addr) + (op)) = (long) (x))
-
-#ifdef __KERNEL__
-
-/*
- * Convert a region 6 (kaddr) address to the address of the fetchop variable
- */
-#define FETCHOP_KADDR_TO_MSPEC_ADDR(kaddr) TO_MSPEC(kaddr)
-
-
-/*
- * Each Atomic Memory Operation (AMO formerly known as fetchop)
- * variable is 64 bytes long. The first 8 bytes are used. The
- * remaining 56 bytes are unaddressable due to the operation taking
- * that portion of the address.
- *
- * NOTE: The AMO_t _MUST_ be placed in either the first or second half
- * of the cache line. The cache line _MUST NOT_ be used for anything
- * other than additional AMO_t entries. This is because there are two
- * addresses which reference the same physical cache line. One will
- * be a cached entry with the memory type bits all set. This address
- * may be loaded into processor cache. The AMO_t will be referenced
- * uncached via the memory special memory type. If any portion of the
- * cached cache-line is modified, when that line is flushed, it will
- * overwrite the uncached value in physical memory and lead to
- * inconsistency.
- */
-typedef struct {
- u64 variable;
- u64 unused[7];
-} AMO_t;
-
-
-/*
- * The following APIs are externalized to the kernel to allocate/free pages of
- * fetchop variables.
- * fetchop_kalloc_page - Allocate/initialize 1 fetchop page on the
- * specified cnode.
- * fetchop_kfree_page - Free a previously allocated fetchop page
- */
-
-unsigned long fetchop_kalloc_page(int nid);
-void fetchop_kfree_page(unsigned long maddr);
-
-
-#endif /* __KERNEL__ */
-
-#endif /* _ASM_IA64_SN_FETCHOP_H */
-
diff -urN -X /usr/people/jes/exclude-linux linux-2.6.12-rc2-mm3-vanilla/include/asm-ia64/sn/mspec.h linux-2.6.12-rc2-mm3/include/asm-ia64/sn/mspec.h
--- linux-2.6.12-rc2-mm3-vanilla/include/asm-ia64/sn/mspec.h 1969-12-31 16:00:00 -08:00
+++ linux-2.6.12-rc2-mm3/include/asm-ia64/sn/mspec.h 2005-04-12 02:14:06 -07:00
@@ -0,0 +1,72 @@
+/*
+ *
+ * This file is subject to the terms and conditions of the GNU General Public
+ * License. See the file "COPYING" in the main directory of this archive
+ * for more details.
+ *
+ * Copyright (c) 2001-2004 Silicon Graphics, Inc. All rights reserved.
+ */
+
+#ifndef _ASM_IA64_SN_MSPEC_H
+#define _ASM_IA64_SN_MSPEC_H
+
+#define FETCHOP_VAR_SIZE 64 /* 64 byte per fetchop variable */
+
+#define FETCHOP_LOAD 0
+#define FETCHOP_INCREMENT 8
+#define FETCHOP_DECREMENT 16
+#define FETCHOP_CLEAR 24
+
+#define FETCHOP_STORE 0
+#define FETCHOP_AND 24
+#define FETCHOP_OR 32
+
+#define FETCHOP_CLEAR_CACHE 56
+
+#define FETCHOP_LOAD_OP(addr, op) ( \
+ *(volatile long *)((char*) (addr) + (op)))
+
+#define FETCHOP_STORE_OP(addr, op, x) ( \
+ *(volatile long *)((char*) (addr) + (op)) = (long) (x))
+
+#ifdef __KERNEL__
+
+/*
+ * Each Atomic Memory Operation (AMO formerly known as fetchop)
+ * variable is 64 bytes long. The first 8 bytes are used. The
+ * remaining 56 bytes are unaddressable due to the operation taking
+ * that portion of the address.
+ *
+ * NOTE: The AMO_t _MUST_ be placed in either the first or second half
+ * of the cache line. The cache line _MUST NOT_ be used for anything
+ * other than additional AMO_t entries. This is because there are two
+ * addresses which reference the same physical cache line. One will
+ * be a cached entry with the memory type bits all set. This address
+ * may be loaded into processor cache. The AMO_t will be referenced
+ * uncached via the memory special memory type. If any portion of the
+ * cached cache-line is modified, when that line is flushed, it will
+ * overwrite the uncached value in physical memory and lead to
+ * inconsistency.
+ */
+typedef struct {
+ u64 variable;
+ u64 unused[7];
+} AMO_t;
+
+
+/*
+ * The following APIs are externalized to the kernel to allocate/free pages of
+ * fetchop variables.
+ * mspec_kalloc_page - Allocate/initialize 1 fetchop page on the
+ * specified cnode.
+ * mspec_kfree_page - Free a previously allocated fetchop page
+ */
+
+extern unsigned long mspec_kalloc_page(int);
+extern void mspec_kfree_page(unsigned long);
+
+
+#endif /* __KERNEL__ */
+
+#endif /* _ASM_IA64_SN_MSPEC_H */
+

2005-04-24 10:16:25

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch] mspec driver for 2.6.12-rc2-mm3

> >> int count; /* Number of pages allocated. */ + int type; /* Type of
> >> pages allocated. */ + unsigned long maddr[1]; /* Array of MSPEC
> >> addresses. */
>
> Christoph> dito
>
> The code use the size to calculate, it could be changed either way,
> don't think it's worth making the change.

The current code is obsufcated, see the pages-1 stuff and co.
Please change it.

> >> + /* + * The kernel requires a page structure to be returned upon +
> >> * success, but there are no page structures for low granule pages.
> >> + * remap_page_range() creates the pte for us and we return a + *
> >> bogus page back to the kernel fault handler to keep it happy + *
> >> (the page is freed immediately there). + */
>
> Christoph> Please don't use the ->nopage approach thenm but do
> Christoph> remap_pfn_range beforehand. ->nopage is very nice if the
> Christoph> region is actually backed by normal RAM, but what you're
> Christoph> doing doesn't make much sense.
>
> Thats what I used to think, however you want the node-local setup for
> performance reasons. Otherwise I would have switched to remap_pfn_range.

Then fixup remap_pfn_range (or rather add a new _node variant). The
current code relies on deep magic to work and could be broken by a new
kernel release easily.

> >> +/* + * Walk the EFI memory map to pull out leftover pages in the
> >> lower + * memory regions which do not end up in the regular memory
> >> map and + * stick them into the uncached allocator + */ +static
> >> void __init +mspec_walk_efi_memmap_uc (void)
>
> Christoph> I'm pretty sure this was NACKed on the ia64 list, and SGI
> Christoph> was told to do a more generic efi memmap walk.
>
> No the issue back then was that the driver just took the memory and
> kept it to itself. The new approach exports it for other users.

That comment doesn't make sense at all to me. exports what to what other
users. And through what way. Please bring this issue up on the ia64
list again. (also please post this patch to linux-ia64, too)


Jes, is it just me or are you trying to chicken out on all the real
problems? :-)

2005-04-25 10:13:13

by Jes Sorensen

[permalink] [raw]
Subject: Re: [patch] mspec driver for 2.6.12-rc2-mm3

>>>>> "Christoph" == Christoph Hellwig <[email protected]> writes:

>> The code use the size to calculate, it could be changed either
>> way, don't think it's worth making the change.

Christoph> The current code is obsufcated, see the pages-1 stuff and
Christoph> co. Please change it.

Both ways work, this is down to nitpicking for the sake of
nitpicking. Whatever, I'll change it.

Christoph> Please don't use the ->nopage approach thenm but do
Christoph> remap_pfn_range beforehand. ->nopage is very nice if the
Christoph> region is actually backed by normal RAM, but what you're
Christoph> doing doesn't make much sense.
>> Thats what I used to think, however you want the node-local setup
>> for performance reasons. Otherwise I would have switched to
>> remap_pfn_range.

Christoph> Then fixup remap_pfn_range (or rather add a new _node
Christoph> variant). The current code relies on deep magic to work
Christoph> and could be broken by a new kernel release easily.

Your approach doesn't work. This relies on first-touch to get
performance, remap_pfn_range_node wouldn't work.

Christoph> I'm pretty sure this was NACKed on the ia64 list, and SGI
Christoph> was told to do a more generic efi memmap walk.
>> No the issue back then was that the driver just took the memory
>> and kept it to itself. The new approach exports it for other users.

Christoph> That comment doesn't make sense at all to me. exports what
Christoph> to what other users. And through what way. Please bring
Christoph> this issue up on the ia64 list again. (also please post
Christoph> this patch to linux-ia64, too)

mspec_alloc_page can be called from anywhere by anyone who wants to
allocate an uncached page. The old fetchop driver just took the
uncached memory and kept to itself. Thats what I am talking about!
Earlier versions of the patch has already been by the ia64 list, we're
down to details here.

Christoph> Jes, is it just me or are you trying to chicken out on all
Christoph> the real problems? :-)

It's you! I seems you're forgetting to do the real research before
trying to shoot something down ;-)

Cheers,
Jes

2005-04-25 14:32:18

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [patch] mspec driver for 2.6.12-rc2-mm3

On Mon, Apr 25, 2005 at 06:13:01AM -0400, Jes Sorensen wrote:
> >>>>> "Christoph" == Christoph Hellwig <[email protected]> writes:
>
> >> The code use the size to calculate, it could be changed either
> >> way, don't think it's worth making the change.
>
> Christoph> The current code is obsufcated, see the pages-1 stuff and
> Christoph> co. Please change it.
>
> Both ways work, this is down to nitpicking for the sake of
> nitpicking. Whatever, I'll change it.

It's nitpicking to keep the code readable. Same thing as always using
named initializers for method vectors or similar things.

2005-04-25 14:41:57

by Christoph Hellwig

[permalink] [raw]
Subject: efi_memmap_walk_uc, was Re: [patch] mspec driver for 2.6.12-rc2-mm3

On Mon, Apr 25, 2005 at 06:13:01AM -0400, Jes Sorensen wrote:
> Your approach doesn't work. This relies on first-touch to get
> performance, remap_pfn_range_node wouldn't work.
>
> Christoph> I'm pretty sure this was NACKed on the ia64 list, and SGI
> Christoph> was told to do a more generic efi memmap walk.
> >> No the issue back then was that the driver just took the memory
> >> and kept it to itself. The new approach exports it for other users.
>
> Christoph> That comment doesn't make sense at all to me. exports what
> Christoph> to what other users. And through what way. Please bring
> Christoph> this issue up on the ia64 list again. (also please post
> Christoph> this patch to linux-ia64, too)
>
> mspec_alloc_page can be called from anywhere by anyone who wants to
> allocate an uncached page. The old fetchop driver just took the
> uncached memory and kept to itself. Thats what I am talking about!
> Earlier versions of the patch has already been by the ia64 list, we're
> down to details here.

See the thread starting at
http://marc.theaimsgroup.com/?l=linux-ia64&m=105883467032028&w=2

My reading is that it requests two things:

- not duplicating the EFI memmap walk in a new function but rather
have generic EFI memmap walk replaces the current efi_memmap_walk
- an uncached memory allocator below the driver (not in the driver!).
Your allocator design also doesn't seem to take many of the suggestions
and recommendations in that thread in account.

2005-04-25 14:47:53

by Christoph Hellwig

[permalink] [raw]
Subject: returning non-ram via ->nopage, was Re: [patch] mspec driver for 2.6.12-rc2-mm3

Jes has this shiny new IA64 uncached foo bar whizbang driver (see the patch
at http://marc.theaimsgroup.com/?l=linux-kernel&m=111416930927092&w=2),
which has a nopage routine that calls remap_pfn_range from ->nopage for
uncached memory that's not part of the mem map. Because ->nopage wants
to return a struct page * he's allocating a normal kernel page and actually
returns that one - to get the page he wants into the pagetables his does
all the pagetable manipulation himself before (See the glory details of
pagetable walks and modification inside a driver in the patch above).

I don't think these hacks are acceptable for a driver, especially as the
problem can easily be solved by calling remap_pfn_range in ->mmap - except
SGI also wants node locality..

2005-04-26 22:14:12

by Jes Sorensen

[permalink] [raw]
Subject: Re: returning non-ram via ->nopage, was Re: [patch] mspec driver for 2.6.12-rc2-mm3

>>>>> "Christoph" == Christoph Hellwig <[email protected]> writes:

Christoph> http://marc.theaimsgroup.com/?l=linux-kernel&m=111416930927092&w=2),
Christoph> which has a nopage routine that calls remap_pfn_range from
Christoph> ->nopage for uncached memory that's not part of the mem
Christoph> map. Because ->nopage wants to return a struct page * he's
Christoph> allocating a normal kernel page and actually returns that
Christoph> one - to get the page he wants into the pagetables his does
Christoph> all the pagetable manipulation himself before (See the
Christoph> glory details of pagetable walks and modification inside a
Christoph> driver in the patch above).

Christoph> I don't think these hacks are acceptable for a driver,
Christoph> especially as the problem can easily be solved by calling
Christoph> remap_pfn_range in ->mmap - except SGI also wants node
Christoph> locality..

Christoph,

Let me try and provide some more background then.

Simply doing remap_pfn_range in the mmap call doesn't work for large
systems.

Take the example of a 2048 CPU system (512 CPUs per partition/machine
- each machine running it's own OS) running an MPI application
across all 2048 CPUs using cross coherency domain traffic.

A standard application will allocate 56 DDQs per thread (the DDQs are
used for synchronization and allocated through the mspec driver) which
translates to having 126976 uncached cache lines reserved or 992 pages
per worker thread. The controlling thread on each partition will mmap
the entire DDQ space up front and then fork off the workers who will
then go and touch their pages. With the current approach by the driver
this means that if you have two threads per node you will end up with
~32MB of uncached memory allocated per node.

Alternatively doing this at mmap time having 512 worker threads per
partition, the result is ~8GB (992 * 16K * 512) of uncached memory all
allocated by the master thread on each machine.

A typical system configuration is 4GB or 8GB of RAM per node. This
means that by using the remap_pfn_range at mmap time approach and the
kernel's standard overhead you end up completely starving the first
couple of nodes of memory on each partition.

Combine this with the effect of all synchronization traffic hitting
the same node, you effectively end up with 512 CPUs all constantly
hammering the same memory controller to death.

FWIW, an initial implementation of the driver was done by someone
within SGI, prior to me having anything to do with it. It was using
the remap_pfn_range at mmap time approach and it was noticed then that
16 worker threads was pretty much enough to overwhelm a node.

Having the page allocations and drop ins on a first touch basis is
consistent with what is done for cached memory and seems a pretty
reasonable approach to me. Sure it isn't particularly pretty to use
the ->nopage approach, nobody disagrees with you there, but what is
the alternative?

Is the problem more an issue of the ugliness of allocating a page
just to return it to the nopage handler or the fact that we're trying
to make the allocations node local?

If you have any suggestions for how to do this differently, then I'm
all ears.

Cheers,
Jes

PS: Thanks to Robin Holt for providing more info on MPI application
behavior than I ever wanted to know ;-)

2005-04-27 15:53:50

by Jeff Garzik

[permalink] [raw]
Subject: Re: returning non-ram via ->nopage, was Re: [patch] mspec driver for 2.6.12-rc2-mm3

Jes Sorensen wrote:
>>>>>>"Christoph" == Christoph Hellwig <[email protected]> writes:
>
>
> Christoph> http://marc.theaimsgroup.com/?l=linux-kernel&m=111416930927092&w=2),
> Christoph> which has a nopage routine that calls remap_pfn_range from
> Christoph> ->nopage for uncached memory that's not part of the mem
> Christoph> map. Because ->nopage wants to return a struct page * he's
> Christoph> allocating a normal kernel page and actually returns that
> Christoph> one - to get the page he wants into the pagetables his does
> Christoph> all the pagetable manipulation himself before (See the
> Christoph> glory details of pagetable walks and modification inside a
> Christoph> driver in the patch above).
>
> Christoph> I don't think these hacks are acceptable for a driver,
> Christoph> especially as the problem can easily be solved by calling
> Christoph> remap_pfn_range in ->mmap - except SGI also wants node
> Christoph> locality..
>
> Christoph,
>
> Let me try and provide some more background then.
>
> Simply doing remap_pfn_range in the mmap call doesn't work for large
> systems.
>
> Take the example of a 2048 CPU system (512 CPUs per partition/machine
> - each machine running it's own OS) running an MPI application
> across all 2048 CPUs using cross coherency domain traffic.
>
> A standard application will allocate 56 DDQs per thread (the DDQs are
> used for synchronization and allocated through the mspec driver) which
> translates to having 126976 uncached cache lines reserved or 992 pages
> per worker thread. The controlling thread on each partition will mmap
> the entire DDQ space up front and then fork off the workers who will
> then go and touch their pages. With the current approach by the driver
> this means that if you have two threads per node you will end up with
> ~32MB of uncached memory allocated per node.
>
> Alternatively doing this at mmap time having 512 worker threads per
> partition, the result is ~8GB (992 * 16K * 512) of uncached memory all
> allocated by the master thread on each machine.
>
> A typical system configuration is 4GB or 8GB of RAM per node. This
> means that by using the remap_pfn_range at mmap time approach and the
> kernel's standard overhead you end up completely starving the first
> couple of nodes of memory on each partition.
>
> Combine this with the effect of all synchronization traffic hitting
> the same node, you effectively end up with 512 CPUs all constantly
> hammering the same memory controller to death.
>
> FWIW, an initial implementation of the driver was done by someone
> within SGI, prior to me having anything to do with it. It was using
> the remap_pfn_range at mmap time approach and it was noticed then that
> 16 worker threads was pretty much enough to overwhelm a node.
>
> Having the page allocations and drop ins on a first touch basis is
> consistent with what is done for cached memory and seems a pretty
> reasonable approach to me. Sure it isn't particularly pretty to use
> the ->nopage approach, nobody disagrees with you there, but what is
> the alternative?

I don't see anything wrong with a ->nopage approach.

At Linus's suggestion, I used ->nopage in the implementation of
sound/oss/via82cxxx_audio.c.

Jeff



2005-04-27 15:55:40

by Christoph Hellwig

[permalink] [raw]
Subject: Re: returning non-ram via ->nopage, was Re: [patch] mspec driver for 2.6.12-rc2-mm3

On Wed, Apr 27, 2005 at 11:53:15AM -0400, Jeff Garzik wrote:
> I don't see anything wrong with a ->nopage approach.
>
> At Linus's suggestion, I used ->nopage in the implementation of
> sound/oss/via82cxxx_audio.c.

The difference is that you return kernel memory (actually pci_alloc_consistant
memory that has it's own set of problems), while this is memory not in mem_map,
so he allocates some regularly kernel memory too to have a struct page and
just leaks it

2005-04-27 18:04:36

by Jes Sorensen

[permalink] [raw]
Subject: Re: returning non-ram via ->nopage, was Re: [patch] mspec driver for 2.6.12-rc2-mm3

>>>>> "Christoph" == Christoph Hellwig <[email protected]> writes:

Christoph> On Wed, Apr 27, 2005 at 11:53:15AM -0400, Jeff Garzik
Christoph> wrote:
>> I don't see anything wrong with a ->nopage approach.
>>
>> At Linus's suggestion, I used ->nopage in the implementation of
>> sound/oss/via82cxxx_audio.c.

Christoph> The difference is that you return kernel memory (actually
Christoph> pci_alloc_consistant memory that has it's own set of
Christoph> problems), while this is memory not in mem_map, so he
Christoph> allocates some regularly kernel memory too to have a struct
Christoph> page and just leaks it

Are you suggesting then that we change do_no_page to handle this as a
special return value then?

Jes

2005-04-27 18:58:37

by Russell King

[permalink] [raw]
Subject: Re: returning non-ram via ->nopage, was Re: [patch] mspec driver for 2.6.12-rc2-mm3

On Wed, Apr 27, 2005 at 02:03:50PM -0400, Jes Sorensen wrote:
> >>>>> "Christoph" == Christoph Hellwig <[email protected]> writes:
>
> Christoph> On Wed, Apr 27, 2005 at 11:53:15AM -0400, Jeff Garzik
> Christoph> wrote:
> >> I don't see anything wrong with a ->nopage approach.
> >>
> >> At Linus's suggestion, I used ->nopage in the implementation of
> >> sound/oss/via82cxxx_audio.c.
>
> Christoph> The difference is that you return kernel memory (actually
> Christoph> pci_alloc_consistant memory that has it's own set of
> Christoph> problems), while this is memory not in mem_map, so he
> Christoph> allocates some regularly kernel memory too to have a struct
> Christoph> page and just leaks it
>
> Are you suggesting then that we change do_no_page to handle this as a
> special return value then?

If you're looking to mmap dma memory, ARM already supports the API
which was discussed (although not properly imho) on linux-arch.
I previously posted a potential patch for x86, but it has the
problem that remap_pfn_range() will not work on such memory because
it isn't marked reserved.

In addition, if you're mmaping dma memory on x86 as is, you're
providing a potential security hole - the x86 DMA memory allocator
does not extend its zeroing to cover the entire last page of the
allocation.

--
Russell King
Linux kernel 2.6 ARM Linux - http://www.arm.linux.org.uk/
maintainer of: 2.6 Serial core

2005-05-03 20:41:05

by William Lee Irwin III

[permalink] [raw]
Subject: Re: returning non-ram via ->nopage, was Re: [patch] mspec driver for 2.6.12-rc2-mm3

On Tue, Apr 26, 2005 at 06:14:02PM -0400, Jes Sorensen wrote:
> Having the page allocations and drop ins on a first touch basis is
> consistent with what is done for cached memory and seems a pretty
> reasonable approach to me. Sure it isn't particularly pretty to use
> the ->nopage approach, nobody disagrees with you there, but what is
> the alternative?
> Is the problem more an issue of the ugliness of allocating a page
> just to return it to the nopage handler or the fact that we're trying
> to make the allocations node local?
> If you have any suggestions for how to do this differently, then I'm
> all ears.
> Cheers,
> Jes
> PS: Thanks to Robin Holt for providing more info on MPI application
> behavior than I ever wanted to know ;-)

This and several other issues all fall down when instead of ->nopage(),
the vma's fault handling method takes a vma, a virtual address, and
an access type, and returns a VM_FAULT_* code. Yes, I remember how I
got heavily criticized the last time I wrote/suggested/whatever this.


-- wli