Andi,
I must be missing something here, but did you not include mempolicy.h
and policy.c in these patches? I can't seem to find them anywhere?!?
It's really hard to evaluate your patches if the core of them is
missing!
Andrew already mentioned your mistake on the i386 syscalls which needs
to be fixed.
Also, this snippet of code is in 2 of your patches (#1 and #6) causing
rejects:
@@ -435,6 +445,8 @@
struct page *shmem_nopage(struct vm_area_struct * vma,
unsigned long address, int *type);
+int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy
*new);
+struct mempolicy *shmem_get_policy(struct vm_area_struct *vma, unsigned
long addr);
struct file *shmem_file_setup(char * name, loff_t size, unsigned long
flags);
void shmem_lock(struct file * file, int lock);
int shmem_zero_setup(struct vm_area_struct *);
Just from the patches you posted, I would really disagree that these are
ready for merging into -mm.
-Matt
On Wed, 07 Apr 2004 14:24:19 -0700
Matthew Dobson <[email protected]> wrote:
> I must be missing something here, but did you not include mempolicy.h
> and policy.c in these patches? I can't seem to find them anywhere?!?
> It's really hard to evaluate your patches if the core of them is
> missing!
It was in the core patch and also in the last patch I sent Andrew.
See ftp://ftp.suse.com/pub/people/ak/numa/* for the full patches
>
> Andrew already mentioned your mistake on the i386 syscalls which needs
> to be fixed.
That's already fixed
> Also, this snippet of code is in 2 of your patches (#1 and #6) causing
> rejects:
>
> @@ -435,6 +445,8 @@
>
> struct page *shmem_nopage(struct vm_area_struct * vma,
> unsigned long address, int *type);
> +int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy
> *new);
> +struct mempolicy *shmem_get_policy(struct vm_area_struct *vma, unsigned
> long addr);
> struct file *shmem_file_setup(char * name, loff_t size, unsigned long
> flags);
> void shmem_lock(struct file * file, int lock);
> int shmem_zero_setup(struct vm_area_struct *);
It didn't reject for me.
> Just from the patches you posted, I would really disagree that these are
> ready for merging into -mm.
Why so?
-Andi
Replying to myself, but...
On Wed, 2004-04-07 at 14:24, Matthew Dobson wrote:
> Andi,
> I must be missing something here, but did you not include mempolicy.h
> and policy.c in these patches? I can't seem to find them anywhere?!?
> It's really hard to evaluate your patches if the core of them is
> missing!
Running make -j24 bzImage
In file included from arch/i386/kernel/asm-offsets.c:7:
include/linux/sched.h:32: linux/mempolicy.h: No such file or directory
make[1]: *** [arch/i386/kernel/asm-offsets.s] Error 1
make: *** [arch/i386/kernel/asm-offsets.s] Error 2
I'm guessing you just forgot the -N option to diff. You might want to
add the -p option when you rediff and repost because it makes your
patches an order of magnitude easier to read when you can tell what
function the patch is modifying.
-Matt
On Wed, 2004-04-07 at 14:27, Andi Kleen wrote:
> On Wed, 07 Apr 2004 14:24:19 -0700
> Matthew Dobson <[email protected]> wrote:
>
> > I must be missing something here, but did you not include mempolicy.h
> > and policy.c in these patches? I can't seem to find them anywhere?!?
> > It's really hard to evaluate your patches if the core of them is
> > missing!
>
> It was in the core patch and also in the last patch I sent Andrew.
> See ftp://ftp.suse.com/pub/people/ak/numa/* for the full patches
Ok.. I'll check that link, but what you posted didn't have the files
(mempolicy.h & policy.c) in the patch:
[mcd@arrakis numa_api]$ diffstat numa_api-01-core.patch
include/linux/gfp.h | 25 +++++++++++++++++++++++--
include/linux/mm.h | 17 +++++++++++++++++
include/linux/sched.h | 4 ++++
kernel/sys.c | 3 +++
mm/Makefile | 1 +
5 files changed, 48 insertions(+), 2 deletions(-)
Maybe it got lost somewhere between your mailer and mine? The patch you
posted to LKML yesterday was clearly done without the -N option to diff:
diff -u linux-2.6.5-numa/kernel/sys.c-o linux-2.6.5-numa/kernel/sys.c
> >
> > Andrew already mentioned your mistake on the i386 syscalls which needs
> > to be fixed.
>
> That's already fixed
Good.
> > Also, this snippet of code is in 2 of your patches (#1 and #6) causing
> > rejects:
> >
> > @@ -435,6 +445,8 @@
> >
> > struct page *shmem_nopage(struct vm_area_struct * vma,
> > unsigned long address, int *type);
> > +int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy
> > *new);
> > +struct mempolicy *shmem_get_policy(struct vm_area_struct *vma, unsigned
> > long addr);
> > struct file *shmem_file_setup(char * name, loff_t size, unsigned long
> > flags);
> > void shmem_lock(struct file * file, int lock);
> > int shmem_zero_setup(struct vm_area_struct *);
>
>
> It didn't reject for me.
I don't know why. The same code addition is in both the 'core' patch
and the 'shm' patch. Adding it twice causes patch throw a reject.
> > Just from the patches you posted, I would really disagree that these are
> > ready for merging into -mm.
>
> Why so?
>
> -Andi
Well, if for no other reason than all the code isn't posted!
-Matt
On Wed, 07 Apr 2004 14:41:02 -0700
Matthew Dobson <[email protected]> wrote:
> On Wed, 2004-04-07 at 14:27, Andi Kleen wrote:
> > On Wed, 07 Apr 2004 14:24:19 -0700
> > Matthew Dobson <[email protected]> wrote:
> >
> > > I must be missing something here, but did you not include mempolicy.h
> > > and policy.c in these patches? I can't seem to find them anywhere?!?
> > > It's really hard to evaluate your patches if the core of them is
> > > missing!
> >
> > It was in the core patch and also in the last patch I sent Andrew.
> > See ftp://ftp.suse.com/pub/people/ak/numa/* for the full patches
>
> Ok.. I'll check that link, but what you posted didn't have the files
> (mempolicy.h & policy.c) in the patch:
Indeed. Must have gone missing. Here are the files for reference.
The full current broken out patchkit is in
ftp.suse.com:/pub/people/ak/numa/2.6.5mc2/
Core NUMA API code
This is the core NUMA API code. This includes NUMA policy aware
wrappers for get_free_pages and alloc_page_vma(). On non NUMA kernels
these are defined away.
The system calls mbind (see http://www.firstfloor.org/~andi/mbind.html),
get_mempolicy (http://www.firstfloor.org/~andi/get_mempolicy.html) and
set_mempolicy (http://www.firstfloor.org/~andi/set_mempolicy.html) are
implemented here.
Adds a vm_policy field to the VMA and to the process. The process
also has field for interleaving. VMA interleaving uses the offset
into the VMA, but that's not possible for process allocations.
diff -u linux-2.6.5-mc2-numa/include/linux/gfp.h-o linux-2.6.5-mc2-numa/include/linux/gfp.h
--- linux-2.6.5-mc2-numa/include/linux/gfp.h-o 2004-04-07 11:42:19.000000000 +0200
+++ linux-2.6.5-mc2-numa/include/linux/gfp.h 2004-04-07 11:45:42.000000000 +0200
@@ -4,6 +4,8 @@
#include <linux/mmzone.h>
#include <linux/stddef.h>
#include <linux/linkage.h>
+#include <linux/config.h>
+
/*
* GFP bitmasks..
*/
@@ -73,10 +75,29 @@
return __alloc_pages(gfp_mask, order, NODE_DATA(nid)->node_zonelists + (gfp_mask & GFP_ZONEMASK));
}
+extern struct page *alloc_pages_current(unsigned gfp_mask, unsigned order);
+struct vm_area_struct;
+
+#ifdef CONFIG_NUMA
+static inline struct page * alloc_pages(unsigned int gfp_mask, unsigned int order)
+{
+ if (unlikely(order >= MAX_ORDER))
+ return NULL;
+
+ return alloc_pages_current(gfp_mask, order);
+}
+extern struct page *__alloc_page_vma(unsigned gfp_mask, struct vm_area_struct *vma,
+ unsigned long off);
+
+extern struct page *alloc_page_vma(unsigned gfp_mask, struct vm_area_struct *vma,
+ unsigned long addr);
+#else
#define alloc_pages(gfp_mask, order) \
alloc_pages_node(numa_node_id(), gfp_mask, order)
-#define alloc_page(gfp_mask) \
- alloc_pages_node(numa_node_id(), gfp_mask, 0)
+#define alloc_page_vma(gfp_mask, vma, addr) alloc_pages(gfp_mask, 0)
+#define __alloc_page_vma(gfp_mask, vma, addr) alloc_pages(gfp_mask, 0)
+#endif
+#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)
extern unsigned long FASTCALL(__get_free_pages(unsigned int gfp_mask, unsigned int order));
extern unsigned long FASTCALL(get_zeroed_page(unsigned int gfp_mask));
diff -u linux-2.6.5-mc2-numa/include/linux/mempolicy.h-o linux-2.6.5-mc2-numa/include/linux/mempolicy.h
--- linux-2.6.5-mc2-numa/include/linux/mempolicy.h-o 2004-04-07 12:07:18.000000000 +0200
+++ linux-2.6.5-mc2-numa/include/linux/mempolicy.h 2004-04-07 12:07:13.000000000 +0200
@@ -0,0 +1,219 @@
+#ifndef _LINUX_MEMPOLICY_H
+#define _LINUX_MEMPOLICY_H 1
+
+#include <linux/errno.h>
+
+/*
+ * NUMA memory policies for Linux.
+ * Copyright 2003,2004 Andi Kleen SuSE Labs
+ */
+
+/* Policies */
+#define MPOL_DEFAULT 0
+#define MPOL_PREFERRED 1
+#define MPOL_BIND 2
+#define MPOL_INTERLEAVE 3
+
+#define MPOL_MAX MPOL_INTERLEAVE
+
+/* Flags for get_mem_policy */
+#define MPOL_F_NODE (1<<0) /* return next IL mode instead of node mask */
+#define MPOL_F_ADDR (1<<1) /* look up vma using address */
+
+/* Flags for mbind */
+#define MPOL_MF_STRICT (1<<0) /* Verify existing pages in the mapping */
+
+#ifdef __KERNEL__
+
+#include <linux/config.h>
+#include <linux/mmzone.h>
+#include <linux/bitmap.h>
+#include <linux/slab.h>
+#include <linux/rbtree.h>
+#include <asm/semaphore.h>
+
+struct vm_area_struct;
+
+#ifdef CONFIG_NUMA
+
+/*
+ * Describe a memory policy.
+ *
+ * A mempolicy can be either associated with a process or with a VMA.
+ * For VMA related allocations the VMA policy is preferred, otherwise
+ * the process policy is used. Interrupts ignore the memory policy
+ * of the current process.
+ *
+ * Locking policy for interlave:
+ * In process context there is no locking because only the process accesses
+ * its own state. All vma manipulation is somewhat protected by a down_read on
+ * mmap_sem. For allocating in the interleave policy the page_table_lock
+ * must be also aquired to protect il_next.
+ *
+ * Freeing policy:
+ * When policy is MPOL_BIND v.zonelist is kmalloc'ed and must be kfree'd.
+ * All other policies don't have any external state. mpol_free() handles this.
+ *
+ * Copying policy objects:
+ * For MPOL_BIND the zonelist must be always duplicated. mpol_clone() does this.
+ */
+struct mempolicy {
+ atomic_t refcnt;
+ short policy; /* See MPOL_* above */
+ union {
+ struct zonelist *zonelist; /* bind */
+ short preferred_node; /* preferred */
+ DECLARE_BITMAP(nodes, MAX_NUMNODES); /* interleave */
+ /* undefined for default */
+ } v;
+};
+
+/* An NULL mempolicy pointer is a synonym of &default_policy. */
+extern struct mempolicy default_policy;
+
+/*
+ * Support for managing mempolicy data objects (clone, copy, destroy)
+ * The default fast path of a NULL MPOL_DEFAULT policy is always inlined.
+ */
+
+extern void __mpol_free(struct mempolicy *pol);
+static inline void mpol_free(struct mempolicy *pol)
+{
+ if (pol)
+ __mpol_free(pol);
+}
+
+extern struct mempolicy *__mpol_copy(struct mempolicy *pol);
+static inline struct mempolicy *mpol_copy(struct mempolicy *pol)
+{
+ if (pol)
+ pol = __mpol_copy(pol);
+ return pol;
+}
+
+#define vma_policy(vma) ((vma)->vm_policy)
+#define vma_set_policy(vma, pol) ((vma)->vm_policy = (pol))
+
+static inline void mpol_get(struct mempolicy *pol)
+{
+ if (pol)
+ atomic_inc(&pol->refcnt);
+}
+
+extern int __mpol_equal(struct mempolicy *a, struct mempolicy *b);
+static inline int mpol_equal(struct mempolicy *a, struct mempolicy *b)
+{
+ if (a == b)
+ return 1;
+ return __mpol_equal(a, b);
+}
+#define vma_mpol_equal(a,b) mpol_equal(vma_policy(a), vma_policy(b))
+
+/* Could later add inheritance of the process policy here. */
+
+#define mpol_set_vma_default(vma) ((vma)->vm_policy = NULL)
+
+/*
+ * Hugetlb policy. i386 hugetlb so far works with node numbers
+ * instead of zone lists, so give it special interfaces for now.
+ */
+extern int mpol_first_node(struct vm_area_struct *vma, unsigned long addr);
+extern int mpol_node_valid(int nid, struct vm_area_struct *vma, unsigned long addr);
+
+/*
+ * Tree of shared policies for a shared memory region.
+ * Maintain the policies in a pseudo mm that contains vmas. The vmas
+ * carry the policy. As a special twist the pseudo mm is indexed in pages, not
+ * bytes, so that we can work with shared memory segments bigger than
+ * unsigned long.
+ */
+
+struct sp_node {
+ struct rb_node nd;
+ unsigned long start, end;
+ struct mempolicy *policy;
+};
+
+struct shared_policy {
+ struct rb_root root;
+ struct semaphore sem;
+};
+
+static inline void mpol_shared_policy_init(struct shared_policy *info)
+{
+ info->root = RB_ROOT;
+ init_MUTEX(&info->sem);
+}
+
+int mpol_set_shared_policy(struct shared_policy *info,
+ struct vm_area_struct *vma,
+ struct mempolicy *new);
+void mpol_free_shared_policy(struct shared_policy *p);
+struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
+ unsigned long idx);
+
+#else
+
+struct mempolicy {};
+
+static inline int mpol_equal(struct mempolicy *a, struct mempolicy *b)
+{
+ return 1;
+}
+#define vma_mpol_equal(a,b) 1
+
+#define mpol_set_vma_default(vma) do {} while(0)
+
+static inline void mpol_free(struct mempolicy *p)
+{
+}
+
+static inline void mpol_get(struct mempolicy *pol)
+{
+}
+
+static inline struct mempolicy *mpol_copy(struct mempolicy *old)
+{
+ return NULL;
+}
+
+static inline int mpol_first_node(struct vm_area_struct *vma, unsigned long a)
+{
+ return numa_node_id();
+}
+
+static inline int mpol_node_valid(int nid, struct vm_area_struct *vma, unsigned long a)
+{
+ return 1;
+}
+
+struct shared_policy {};
+
+static inline int mpol_set_shared_policy(struct shared_policy *info,
+ struct vm_area_struct *vma,
+ struct mempolicy *new)
+{
+ return -EINVAL;
+}
+
+static inline void mpol_shared_policy_init(struct shared_policy *info)
+{
+}
+
+static inline void mpol_free_shared_policy(struct shared_policy *p)
+{
+}
+
+static inline struct mempolicy *
+mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx)
+{
+ return NULL;
+}
+
+#define vma_policy(vma) NULL
+#define vma_set_policy(vma, pol) do {} while(0)
+
+#endif /* CONFIG_NUMA */
+#endif /* __KERNEL__ */
+
+#endif
diff -u linux-2.6.5-mc2-numa/include/linux/mm.h-o linux-2.6.5-mc2-numa/include/linux/mm.h
--- linux-2.6.5-mc2-numa/include/linux/mm.h-o 2004-04-07 11:42:19.000000000 +0200
+++ linux-2.6.5-mc2-numa/include/linux/mm.h 2004-04-07 11:45:42.000000000 +0200
@@ -12,6 +12,7 @@
#include <linux/mmzone.h>
#include <linux/rbtree.h>
#include <linux/fs.h>
+#include <linux/mempolicy.h>
#ifndef CONFIG_DISCONTIGMEM /* Don't use mapnrs, do it properly */
extern unsigned long max_mapnr;
@@ -47,6 +48,9 @@
*
* This structure is exactly 64 bytes on ia32. Please think very, very hard
* before adding anything to it.
+ * [Now 4 bytes more on 32bit NUMA machines. Sorry. -AK.
+ * But if you want to recover the 4 bytes justr remove vm_next. It is redundant
+ * with vm_rb. Will be a lot of editing work though. vm_rb.color is redundant too.]
*/
struct vm_area_struct {
struct mm_struct * vm_mm; /* The address space we belong to. */
@@ -77,6 +81,10 @@
units, *not* PAGE_CACHE_SIZE */
struct file * vm_file; /* File we map to (can be NULL). */
void * vm_private_data; /* was vm_pte (shared mem) */
+
+#ifdef CONFIG_NUMA
+ struct mempolicy *vm_policy; /* NUMA policy for the VMA */
+#endif
};
/*
@@ -148,6 +156,8 @@
void (*close)(struct vm_area_struct * area);
struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
+ int (*set_policy)(struct vm_area_struct *vma, struct mempolicy *new);
+ struct mempolicy *(*get_policy)(struct vm_area_struct *vma, unsigned long addr);
};
/* forward declaration; pte_chain is meant to be internal to rmap.c */
@@ -434,6 +444,8 @@
struct page *shmem_nopage(struct vm_area_struct * vma,
unsigned long address, int *type);
+int shmem_set_policy(struct vm_area_struct *vma, struct mempolicy *new);
+struct mempolicy *shmem_get_policy(struct vm_area_struct *vma, unsigned long addr);
struct file *shmem_file_setup(char * name, loff_t size, unsigned long flags);
void shmem_lock(struct file * file, int lock);
int shmem_zero_setup(struct vm_area_struct *);
@@ -636,6 +648,11 @@
return vma;
}
+static inline unsigned long vma_pages(struct vm_area_struct *vma)
+{
+ return (vma->vm_end - vma->vm_start) >> PAGE_SHIFT;
+}
+
extern struct vm_area_struct *find_extend_vma(struct mm_struct *mm, unsigned long addr);
extern unsigned int nr_used_zone_pages(void);
diff -u linux-2.6.5-mc2-numa/include/linux/sched.h-o linux-2.6.5-mc2-numa/include/linux/sched.h
--- linux-2.6.5-mc2-numa/include/linux/sched.h-o 2004-04-07 11:42:19.000000000 +0200
+++ linux-2.6.5-mc2-numa/include/linux/sched.h 2004-04-07 11:45:42.000000000 +0200
@@ -29,6 +29,7 @@
#include <linux/completion.h>
#include <linux/pid.h>
#include <linux/percpu.h>
+#include <linux/mempolicy.h>
struct exec_domain;
@@ -501,6 +502,9 @@
unsigned long ptrace_message;
siginfo_t *last_siginfo; /* For ptrace use. */
+
+ struct mempolicy *mempolicy;
+ short il_next; /* could be shared with used_math */
};
static inline pid_t process_group(struct task_struct *tsk)
diff -u linux-2.6.5-mc2-numa/kernel/sys.c-o linux-2.6.5-mc2-numa/kernel/sys.c
--- linux-2.6.5-mc2-numa/kernel/sys.c-o 2004-04-07 11:42:20.000000000 +0200
+++ linux-2.6.5-mc2-numa/kernel/sys.c 2004-04-07 11:48:39.000000000 +0200
@@ -266,6 +266,9 @@
cond_syscall(sys_mq_timedreceive)
cond_syscall(sys_mq_notify)
cond_syscall(sys_mq_getsetattr)
+cond_syscall(sys_mbind)
+cond_syscall(sys_get_mempolicy)
+cond_syscall(sys_set_mempolicy)
/* arch-specific weak syscall entries */
cond_syscall(sys_pciconfig_read)
diff -u linux-2.6.5-mc2-numa/mm/Makefile-o linux-2.6.5-mc2-numa/mm/Makefile
--- linux-2.6.5-mc2-numa/mm/Makefile-o 2004-03-21 21:12:13.000000000 +0100
+++ linux-2.6.5-mc2-numa/mm/Makefile 2004-04-07 12:07:49.000000000 +0200
@@ -12,3 +12,4 @@
slab.o swap.o truncate.o vmscan.o $(mmu-y)
obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o
+obj-$(CONFIG_NUMA) += mempolicy.o
diff -u linux-2.6.5-mc2-numa/include/linux/mempolicy.h-o linux-2.6.5-mc2-numa/include/linux/mempolicy.h
--- linux-2.6.5-mc2-numa/include/linux/mempolicy.h-o 2004-04-07 12:07:18.000000000 +0200
+++ linux-2.6.5-mc2-numa/include/linux/mempolicy.h 2004-04-07 12:07:13.000000000 +0200
@@ -0,0 +1,219 @@
+#ifndef _LINUX_MEMPOLICY_H
+#define _LINUX_MEMPOLICY_H 1
+
+#include <linux/errno.h>
+
+/*
+ * NUMA memory policies for Linux.
+ * Copyright 2003,2004 Andi Kleen SuSE Labs
+ */
+
+/* Policies */
+#define MPOL_DEFAULT 0
+#define MPOL_PREFERRED 1
+#define MPOL_BIND 2
+#define MPOL_INTERLEAVE 3
+
+#define MPOL_MAX MPOL_INTERLEAVE
+
+/* Flags for get_mem_policy */
+#define MPOL_F_NODE (1<<0) /* return next IL mode instead of node mask */
+#define MPOL_F_ADDR (1<<1) /* look up vma using address */
+
+/* Flags for mbind */
+#define MPOL_MF_STRICT (1<<0) /* Verify existing pages in the mapping */
+
+#ifdef __KERNEL__
+
+#include <linux/config.h>
+#include <linux/mmzone.h>
+#include <linux/bitmap.h>
+#include <linux/slab.h>
+#include <linux/rbtree.h>
+#include <asm/semaphore.h>
+
+struct vm_area_struct;
+
+#ifdef CONFIG_NUMA
+
+/*
+ * Describe a memory policy.
+ *
+ * A mempolicy can be either associated with a process or with a VMA.
+ * For VMA related allocations the VMA policy is preferred, otherwise
+ * the process policy is used. Interrupts ignore the memory policy
+ * of the current process.
+ *
+ * Locking policy for interlave:
+ * In process context there is no locking because only the process accesses
+ * its own state. All vma manipulation is somewhat protected by a down_read on
+ * mmap_sem. For allocating in the interleave policy the page_table_lock
+ * must be also aquired to protect il_next.
+ *
+ * Freeing policy:
+ * When policy is MPOL_BIND v.zonelist is kmalloc'ed and must be kfree'd.
+ * All other policies don't have any external state. mpol_free() handles this.
+ *
+ * Copying policy objects:
+ * For MPOL_BIND the zonelist must be always duplicated. mpol_clone() does this.
+ */
+struct mempolicy {
+ atomic_t refcnt;
+ short policy; /* See MPOL_* above */
+ union {
+ struct zonelist *zonelist; /* bind */
+ short preferred_node; /* preferred */
+ DECLARE_BITMAP(nodes, MAX_NUMNODES); /* interleave */
+ /* undefined for default */
+ } v;
+};
+
+/* An NULL mempolicy pointer is a synonym of &default_policy. */
+extern struct mempolicy default_policy;
+
+/*
+ * Support for managing mempolicy data objects (clone, copy, destroy)
+ * The default fast path of a NULL MPOL_DEFAULT policy is always inlined.
+ */
+
+extern void __mpol_free(struct mempolicy *pol);
+static inline void mpol_free(struct mempolicy *pol)
+{
+ if (pol)
+ __mpol_free(pol);
+}
+
+extern struct mempolicy *__mpol_copy(struct mempolicy *pol);
+static inline struct mempolicy *mpol_copy(struct mempolicy *pol)
+{
+ if (pol)
+ pol = __mpol_copy(pol);
+ return pol;
+}
+
+#define vma_policy(vma) ((vma)->vm_policy)
+#define vma_set_policy(vma, pol) ((vma)->vm_policy = (pol))
+
+static inline void mpol_get(struct mempolicy *pol)
+{
+ if (pol)
+ atomic_inc(&pol->refcnt);
+}
+
+extern int __mpol_equal(struct mempolicy *a, struct mempolicy *b);
+static inline int mpol_equal(struct mempolicy *a, struct mempolicy *b)
+{
+ if (a == b)
+ return 1;
+ return __mpol_equal(a, b);
+}
+#define vma_mpol_equal(a,b) mpol_equal(vma_policy(a), vma_policy(b))
+
+/* Could later add inheritance of the process policy here. */
+
+#define mpol_set_vma_default(vma) ((vma)->vm_policy = NULL)
+
+/*
+ * Hugetlb policy. i386 hugetlb so far works with node numbers
+ * instead of zone lists, so give it special interfaces for now.
+ */
+extern int mpol_first_node(struct vm_area_struct *vma, unsigned long addr);
+extern int mpol_node_valid(int nid, struct vm_area_struct *vma, unsigned long addr);
+
+/*
+ * Tree of shared policies for a shared memory region.
+ * Maintain the policies in a pseudo mm that contains vmas. The vmas
+ * carry the policy. As a special twist the pseudo mm is indexed in pages, not
+ * bytes, so that we can work with shared memory segments bigger than
+ * unsigned long.
+ */
+
+struct sp_node {
+ struct rb_node nd;
+ unsigned long start, end;
+ struct mempolicy *policy;
+};
+
+struct shared_policy {
+ struct rb_root root;
+ struct semaphore sem;
+};
+
+static inline void mpol_shared_policy_init(struct shared_policy *info)
+{
+ info->root = RB_ROOT;
+ init_MUTEX(&info->sem);
+}
+
+int mpol_set_shared_policy(struct shared_policy *info,
+ struct vm_area_struct *vma,
+ struct mempolicy *new);
+void mpol_free_shared_policy(struct shared_policy *p);
+struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
+ unsigned long idx);
+
+#else
+
+struct mempolicy {};
+
+static inline int mpol_equal(struct mempolicy *a, struct mempolicy *b)
+{
+ return 1;
+}
+#define vma_mpol_equal(a,b) 1
+
+#define mpol_set_vma_default(vma) do {} while(0)
+
+static inline void mpol_free(struct mempolicy *p)
+{
+}
+
+static inline void mpol_get(struct mempolicy *pol)
+{
+}
+
+static inline struct mempolicy *mpol_copy(struct mempolicy *old)
+{
+ return NULL;
+}
+
+static inline int mpol_first_node(struct vm_area_struct *vma, unsigned long a)
+{
+ return numa_node_id();
+}
+
+static inline int mpol_node_valid(int nid, struct vm_area_struct *vma, unsigned long a)
+{
+ return 1;
+}
+
+struct shared_policy {};
+
+static inline int mpol_set_shared_policy(struct shared_policy *info,
+ struct vm_area_struct *vma,
+ struct mempolicy *new)
+{
+ return -EINVAL;
+}
+
+static inline void mpol_shared_policy_init(struct shared_policy *info)
+{
+}
+
+static inline void mpol_free_shared_policy(struct shared_policy *p)
+{
+}
+
+static inline struct mempolicy *
+mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx)
+{
+ return NULL;
+}
+
+#define vma_policy(vma) NULL
+#define vma_set_policy(vma, pol) do {} while(0)
+
+#endif /* CONFIG_NUMA */
+#endif /* __KERNEL__ */
+
+#endif
diff -u linux-2.6.5-mc2-numa/mm/mempolicy.c-o linux-2.6.5-mc2-numa/mm/mempolicy.c
--- linux-2.6.5-mc2-numa/mm/mempolicy.c-o 2004-04-07 12:07:41.000000000 +0200
+++ linux-2.6.5-mc2-numa/mm/mempolicy.c 2004-04-07 13:07:02.000000000 +0200
@@ -0,0 +1,981 @@
+/*
+ * Simple NUMA memory policy for the Linux kernel.
+ *
+ * Copyright 2003,2004 Andi Kleen, SuSE Labs.
+ * Subject to the GNU Public License, version 2.
+ *
+ * NUMA policy allows the user to give hints in which node(s) memory should
+ * be allocated.
+ *
+ * Support four policies per VMA and per process:
+ *
+ * The VMA policy has priority over the process policy for a page fault.
+ *
+ * interleave Allocate memory interleaved over a set of nodes,
+ * with normal fallback if it fails.
+ * For VMA based allocations this interleaves based on the
+ * offset into the backing object or offset into the mapping
+ * for anonymous memory. For process policy an process counter
+ * is used.
+ * bind Only allocate memory on a specific set of nodes,
+ * no fallback.
+ * preferred Try a specific node first before normal fallback.
+ * As a special case node -1 here means do the allocation
+ * on the local CPU. This is normally identical to default,
+ * but useful to set in a VMA when you have a non default
+ * process policy.
+ * default Allocate on the local node first, or when on a VMA
+ * use the process policy. This is what Linux always did
+ * in a NUMA aware kernel and still does by, ahem, default.
+ *
+ * The process policy is applied for most non interrupt memory allocations
+ * in that process' context. Interrupts ignore the policies and always
+ * try to allocate on the local CPU. The VMA policy is only applied for memory
+ * allocations for a VMA in the VM.
+ *
+ * Currently there are a few corner cases in swapping where the policy
+ * is not applied, but the majority should be handled. When process policy
+ * is used it is not remembered over swap outs/swap ins.
+ *
+ * Only the highest zone in the zone hierarchy gets policied. Allocations
+ * requesting a lower zone just use default policy. This implies that
+ * on systems with highmem kernel lowmem allocation don't get policied.
+ * Same with GFP_DMA allocations.
+ *
+ * For shmfs/tmpfs/hugetlbfs shared memory the policy is shared between
+ * all users and remembered even when nobody has memory mapped.
+ */
+
+/* Notebook:
+ fix mmap readahead to honour policy and enable policy for any page cache
+ object
+ statistics for bigpages
+ global policy for page cache? currently it uses process policy. Requires
+ first item above.
+ handle mremap for shared memory (currently ignored for the policy)
+ grows down?
+ make bind policy root only? It can trigger oom much faster and the
+ kernel is not always grateful with that.
+ could replace all the switch()es with a mempolicy_ops structure.
+*/
+
+#include <linux/mempolicy.h>
+#include <linux/mm.h>
+#include <linux/hugetlb.h>
+#include <linux/kernel.h>
+#include <linux/sched.h>
+#include <linux/mm.h>
+#include <linux/gfp.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/module.h>
+#include <linux/interrupt.h>
+#include <linux/init.h>
+#include <asm/uaccess.h>
+
+static kmem_cache_t *policy_cache;
+static kmem_cache_t *sn_cache;
+
+#define round_up(x,y) (((x) + (y) - 1) & ~((y)-1))
+#define PDprintk(fmt...)
+
+/* Highest zone. An specific allocation for a zone below that is not
+ policied. */
+static int policy_zone;
+
+static struct mempolicy default_policy = {
+ .refcnt = ATOMIC_INIT(1), /* never free it */
+ .policy = MPOL_DEFAULT,
+};
+
+/* Check if all specified nodes are online */
+static int check_online(unsigned long *nodes)
+{
+ DECLARE_BITMAP(offline, MAX_NUMNODES);
+ bitmap_copy(offline, node_online_map, MAX_NUMNODES);
+ if (bitmap_empty(offline, MAX_NUMNODES))
+ set_bit(0, offline);
+ bitmap_complement(offline, MAX_NUMNODES);
+ bitmap_and(offline, offline, nodes, MAX_NUMNODES);
+ if (!bitmap_empty(offline, MAX_NUMNODES))
+ return -EINVAL;
+ return 0;
+}
+
+/* Do sanity checking on a policy */
+static int check_policy(int mode, unsigned long *nodes)
+{
+ int empty = bitmap_empty(nodes, MAX_NUMNODES);
+ switch (mode) {
+ case MPOL_DEFAULT:
+ if (!empty)
+ return -EINVAL;
+ break;
+ case MPOL_BIND:
+ case MPOL_INTERLEAVE:
+ /* Preferred will only use the first bit, but allow
+ more for now. */
+ if (empty)
+ return -EINVAL;
+ break;
+ }
+ return check_online(nodes);
+}
+
+/* Copy a node mask from user space. */
+static int get_nodes(unsigned long *nodes, unsigned long *nmask,
+ unsigned long maxnode, int mode)
+{
+ unsigned long k;
+ unsigned long nlongs;
+ unsigned long endmask;
+
+ --maxnode;
+ nlongs = BITS_TO_LONGS(maxnode);
+ if ((maxnode % BITS_PER_LONG) == 0)
+ endmask = ~0UL;
+ else
+ endmask = (1UL << (maxnode % BITS_PER_LONG)) - 1;
+
+ /* When the user specified more nodes than supported just check
+ if the non supported part is all zero. */
+ if (nmask && nlongs > BITS_TO_LONGS(MAX_NUMNODES)) {
+ for (k = BITS_TO_LONGS(MAX_NUMNODES); k < nlongs; k++) {
+ unsigned long t;
+ if (get_user(t, nmask + k))
+ return -EFAULT;
+ if (k == nlongs - 1) {
+ if (t & endmask)
+ return -EINVAL;
+ } else if (t)
+ return -EINVAL;
+ }
+ nlongs = BITS_TO_LONGS(MAX_NUMNODES);
+ endmask = ~0UL;
+ }
+
+ bitmap_clear(nodes, MAX_NUMNODES);
+ if (nmask && copy_from_user(nodes, nmask, nlongs*sizeof(unsigned long)))
+ return -EFAULT;
+ nodes[nlongs-1] &= endmask;
+ return check_policy(mode, nodes);
+}
+
+/* Generate a custom zonelist for the BIND policy. */
+static struct zonelist *bind_zonelist(unsigned long *nodes)
+{
+ struct zonelist *zl;
+ int num, max, nd;
+
+ max = 1 + MAX_NR_ZONES * bitmap_weight(nodes, MAX_NUMNODES);
+ zl = kmalloc(sizeof(void *) * max, GFP_KERNEL);
+ if (!zl)
+ return NULL;
+ num = 0;
+ for (nd = find_first_bit(nodes, MAX_NUMNODES);
+ nd < MAX_NUMNODES;
+ nd = find_next_bit(nodes, MAX_NUMNODES, 1+nd)) {
+ int k;
+ for (k = MAX_NR_ZONES-1; k >= 0; k--) {
+ struct zone *z = &NODE_DATA(nd)->node_zones[k];
+ if (!z->present_pages)
+ continue;
+ zl->zones[num++] = z;
+ if (k > policy_zone)
+ policy_zone = k;
+ }
+ }
+ BUG_ON(num >= max);
+ zl->zones[num] = NULL;
+ return zl;
+}
+
+/* Create a new policy */
+static struct mempolicy *new_policy(int mode, unsigned long *nodes)
+{
+ struct mempolicy *policy;
+ PDprintk("setting mode %d nodes[0] %lx\n", mode, nodes[0]);
+ if (mode == MPOL_DEFAULT)
+ return NULL;
+ policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
+ if (!policy)
+ return ERR_PTR(-ENOMEM);
+ atomic_set(&policy->refcnt, 1);
+ switch (mode) {
+ case MPOL_INTERLEAVE:
+ bitmap_copy(policy->v.nodes, nodes, MAX_NUMNODES);
+ break;
+ case MPOL_PREFERRED:
+ policy->v.preferred_node = find_first_bit(nodes, MAX_NUMNODES);
+ if (policy->v.preferred_node >= MAX_NUMNODES)
+ policy->v.preferred_node = -1;
+ break;
+ case MPOL_BIND:
+ policy->v.zonelist = bind_zonelist(nodes);
+ if (policy->v.zonelist == NULL) {
+ kmem_cache_free(policy_cache, policy);
+ return ERR_PTR(-ENOMEM);
+ }
+ break;
+ }
+ policy->policy = mode;
+ return policy;
+}
+
+/* Ensure all existing pages follow the policy. */
+static int
+verify_pages(unsigned long addr, unsigned long end, unsigned long *nodes)
+{
+ while (addr < end) {
+ struct page *p;
+ pte_t *pte;
+ pmd_t *pmd;
+ pgd_t *pgd = pgd_offset_k(addr);
+ if (pgd_none(*pgd)) {
+ addr = (addr + PGDIR_SIZE) & PGDIR_MASK;
+ continue;
+ }
+ pmd = pmd_offset(pgd, addr);
+ if (pmd_none(*pmd)) {
+ addr = (addr + PMD_SIZE) & PMD_MASK;
+ continue;
+ }
+ p = NULL;
+ pte = pte_offset_map(pmd, addr);
+ if (pte_present(*pte))
+ p = pte_page(*pte);
+ pte_unmap(pte);
+ if (p) {
+ unsigned nid = page_zone(p)->zone_pgdat->node_id;
+ if (!test_bit(nid, nodes))
+ return -EIO;
+ }
+ addr += PAGE_SIZE;
+ }
+ return 0;
+}
+
+/* Step 1: check the range */
+static struct vm_area_struct *
+check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
+ unsigned long *nodes, unsigned long flags)
+{
+ int err;
+ struct vm_area_struct *first, *vma, *prev;
+
+ first = find_vma(mm, start);
+ if (!first)
+ return ERR_PTR(-EFAULT);
+ prev = NULL;
+ for (vma = first; vma->vm_start < end; vma = vma->vm_next) {
+ if (!vma->vm_next && vma->vm_end < end)
+ return ERR_PTR(-EFAULT);
+ if (prev && prev->vm_end < vma->vm_start)
+ return ERR_PTR(-EFAULT);
+ if ((flags & MPOL_MF_STRICT) && !is_vm_hugetlb_page(vma)) {
+ err = verify_pages(vma->vm_start, vma->vm_end, nodes);
+ if (err) {
+ first = ERR_PTR(err);
+ break;
+ }
+ }
+ prev = vma;
+ }
+ return first;
+}
+
+/* Apply policy to a single VMA */
+static int policy_vma(struct vm_area_struct *vma, struct mempolicy *new)
+{
+ int err = 0;
+ struct mempolicy *old = vma->vm_policy;
+
+ PDprintk("vma %lx-%lx/%lx vm_ops %p vm_file %p set_policy %p\n",
+ vma->vm_start, vma->vm_end, vma->vm_pgoff,
+ vma->vm_ops, vma->vm_file,
+ vma->vm_ops ? vma->vm_ops->set_policy : NULL);
+
+ if (vma->vm_file)
+ down(&vma->vm_file->f_mapping->i_shared_sem);
+ if (vma->vm_ops && vma->vm_ops->set_policy)
+ err = vma->vm_ops->set_policy(vma, new);
+ if (!err) {
+ mpol_get(new);
+ vma->vm_policy = new;
+ mpol_free(old);
+ }
+ if (vma->vm_file)
+ up(&vma->vm_file->f_mapping->i_shared_sem);
+ return err;
+}
+
+/* Step 2: apply policy to a range and do splits. */
+static int mbind_range(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end, struct mempolicy *new)
+{
+ struct vm_area_struct *next;
+ int err;
+
+ err = 0;
+ for (; vma->vm_start < end; vma = next) {
+ next = vma->vm_next;
+ if (vma->vm_start < start)
+ err = split_vma(vma->vm_mm, vma, start, 1);
+ if (!err && vma->vm_end > end)
+ err = split_vma(vma->vm_mm, vma, end, 0);
+ if (!err)
+ err = policy_vma(vma, new);
+ if (err)
+ break;
+ }
+ return err;
+}
+
+/* Change policy for a memory range */
+asmlinkage long sys_mbind(unsigned long start, unsigned long len,
+ unsigned long mode,
+ unsigned long *nmask, unsigned long maxnode,
+ unsigned flags)
+{
+ struct vm_area_struct *vma;
+ struct mm_struct *mm = current->mm;
+ struct mempolicy *new;
+ unsigned long end;
+ DECLARE_BITMAP(nodes, MAX_NUMNODES);
+ int err;
+
+ if ((flags & ~(unsigned long)(MPOL_MF_STRICT)) || mode > MPOL_MAX)
+ return -EINVAL;
+ if (start & ~PAGE_MASK)
+ return -EINVAL;
+ if (mode == MPOL_DEFAULT)
+ flags &= ~MPOL_MF_STRICT;
+ len = (len + PAGE_SIZE - 1) & PAGE_MASK;
+ end = start + len;
+ if (end < start)
+ return -EINVAL;
+ if (end == start)
+ return 0;
+
+ err = get_nodes(nodes, nmask, maxnode, mode);
+ if (err)
+ return err;
+
+ new = new_policy(mode, nodes);
+ if (IS_ERR(new))
+ return PTR_ERR(new);
+
+ PDprintk("mbind %lx-%lx mode:%ld nodes:%lx\n",start,start+len,
+ mode,nodes[0]);
+
+ down_write(&mm->mmap_sem);
+ vma = check_range(mm, start, end, nodes, flags);
+ err = PTR_ERR(vma);
+ if (!IS_ERR(vma))
+ err = mbind_range(vma, start, end, new);
+ up_write(&mm->mmap_sem);
+ mpol_free(new);
+ return err;
+}
+
+/* Set the process memory policy */
+asmlinkage long sys_set_mempolicy(int mode, unsigned long *nmask,
+ unsigned long maxnode)
+{
+ int err;
+ struct mempolicy *new;
+ DECLARE_BITMAP(nodes, MAX_NUMNODES);
+
+ if (mode > MPOL_MAX)
+ return -EINVAL;
+ err = get_nodes(nodes, nmask, maxnode, mode);
+ if (err)
+ return err;
+ new = new_policy(mode, nodes);
+ if (IS_ERR(new))
+ return PTR_ERR(new);
+ mpol_free(current->mempolicy);
+ current->mempolicy = new;
+ if (new && new->policy == MPOL_INTERLEAVE)
+ current->il_next = find_first_bit(new->v.nodes, MAX_NUMNODES);
+ return 0;
+}
+
+/* Fill a zone bitmap for a policy */
+static void get_zonemask(struct mempolicy *p, unsigned long *nodes)
+{
+ int i;
+ bitmap_clear(nodes, MAX_NUMNODES);
+ switch (p->policy) {
+ case MPOL_BIND:
+ for (i = 0; p->v.zonelist->zones[i]; i++)
+ __set_bit(p->v.zonelist->zones[i]->zone_pgdat->node_id, nodes);
+ break;
+ case MPOL_DEFAULT:
+ break;
+ case MPOL_INTERLEAVE:
+ bitmap_copy(nodes, p->v.nodes, MAX_NUMNODES);
+ break;
+ case MPOL_PREFERRED:
+ /* or use current node instead of online map? */
+ if (p->v.preferred_node < 0)
+ bitmap_copy(nodes, node_online_map, MAX_NUMNODES);
+ else
+ __set_bit(p->v.preferred_node, nodes);
+ break;
+ default:
+ BUG();
+ }
+}
+
+static int lookup_node(struct mm_struct *mm, unsigned long addr)
+{
+ struct page *p;
+ int err;
+ err = get_user_pages(current, mm, addr & PAGE_MASK, 1, 0, 0, &p, NULL);
+ if (err >= 0) {
+ err = page_zone(p)->zone_pgdat->node_id;
+ put_page(p);
+ }
+ return err;
+}
+
+/* Copy a kernel node mask to user space */
+static int copy_nodes_to_user(unsigned long *user_mask, unsigned long maxnode,
+ unsigned long *nodes)
+{
+ unsigned long copy = round_up(maxnode-1, BITS_PER_LONG) / 8;
+ if (copy > sizeof(nodes)) {
+ if (copy > PAGE_SIZE)
+ return -EINVAL;
+ if (clear_user((char*)user_mask + sizeof(nodes), copy - sizeof(nodes)))
+ return -EFAULT;
+ copy = sizeof(nodes);
+ }
+ return copy_to_user(user_mask, nodes, copy) ? -EFAULT : 0;
+}
+
+/* Retrieve NUMA policy */
+asmlinkage long sys_get_mempolicy(int *policy,
+ unsigned long *nmask, unsigned long maxnode,
+ unsigned long addr, unsigned long flags)
+{
+ int err, pval;
+ struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma = NULL;
+ struct mempolicy *pol = current->mempolicy;
+
+ if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
+ return -EINVAL;
+ if (nmask != NULL && maxnode < numnodes)
+ return -EINVAL;
+ if (flags & MPOL_F_ADDR) {
+ down_read(&mm->mmap_sem);
+ vma = find_vma_intersection(mm, addr, addr+1);
+ if (!vma) {
+ up_read(&mm->mmap_sem);
+ return -EFAULT;
+ }
+ if (vma->vm_ops && vma->vm_ops->get_policy)
+ pol = vma->vm_ops->get_policy(vma, addr);
+ else
+ pol = vma->vm_policy;
+ } else if (addr)
+ return -EINVAL;
+
+ if (!pol)
+ pol = &default_policy;
+
+ if (flags & MPOL_F_NODE) {
+ if (flags & MPOL_F_ADDR) {
+ err = lookup_node(mm, addr);
+ if (err < 0)
+ goto out;
+ pval = err;
+ } else if (pol == current->mempolicy && pol->policy == MPOL_INTERLEAVE)
+ pval = current->il_next;
+ else {
+ err = -EINVAL;
+ goto out;
+ }
+ } else
+ pval = pol->policy;
+
+ err = -EFAULT;
+ if (policy && put_user(pval, policy))
+ goto out;
+
+ err = 0;
+ if (nmask) {
+ DECLARE_BITMAP(nodes, MAX_NUMNODES);
+ get_zonemask(pol, nodes);
+ err = copy_nodes_to_user(nmask, maxnode, nodes);
+ }
+
+ out:
+ if (vma)
+ up_read(¤t->mm->mmap_sem);
+ return err;
+}
+
+/* Return effective policy for a VMA */
+static struct mempolicy *
+get_vma_policy(struct vm_area_struct *vma, unsigned long addr)
+{
+ struct mempolicy *pol = current->mempolicy;
+ if (vma) {
+ if (vma->vm_ops && vma->vm_ops->get_policy)
+ pol = vma->vm_ops->get_policy(vma, addr);
+ else if (vma->vm_policy && vma->vm_policy->policy != MPOL_DEFAULT)
+ pol = vma->vm_policy;
+ }
+ if (!pol)
+ pol = &default_policy;
+ return pol;
+}
+
+/* Return a zonelist representing a mempolicy */
+static struct zonelist *zonelist_policy(unsigned gfp, struct mempolicy *policy)
+{
+ int nd;
+ switch (policy->policy) {
+ case MPOL_PREFERRED:
+ nd = policy->v.preferred_node;
+ if (nd < 0)
+ nd = numa_node_id();
+ break;
+ case MPOL_BIND:
+ /* Lower zones don't get a policy applied */
+ if (gfp >= policy_zone)
+ return policy->v.zonelist;
+ /*FALL THROUGH*/
+ case MPOL_INTERLEAVE: /* should not happen */
+ case MPOL_DEFAULT:
+ nd = numa_node_id();
+ break;
+ default:
+ nd = 0;
+ BUG();
+ }
+ return NODE_DATA(nd)->node_zonelists + (gfp & GFP_ZONEMASK);
+}
+
+/* Do dynamic interleaving for a process */
+static unsigned interleave_nodes(struct mempolicy *policy)
+{
+ unsigned nid, next;
+ struct task_struct *me = current;
+ nid = me->il_next;
+ BUG_ON(nid >= MAX_NUMNODES);
+ next = find_next_bit(policy->v.nodes, MAX_NUMNODES, 1+nid);
+ if (next >= MAX_NUMNODES)
+ next = find_first_bit(policy->v.nodes, MAX_NUMNODES);
+ me->il_next = next;
+ return nid;
+}
+
+/* Do static interleaving for a VMA with known offset. */
+static unsigned
+offset_il_node(struct mempolicy *pol, struct vm_area_struct *vma, unsigned long off)
+{
+ unsigned target = (unsigned)off % (unsigned)numnodes;
+ int c;
+ int nid = -1;
+ c = 0;
+ do {
+ nid = find_next_bit(pol->v.nodes, MAX_NUMNODES, nid+1);
+ if (nid >= MAX_NUMNODES) {
+ nid = -1;
+ continue;
+ }
+ c++;
+ } while (c <= target);
+ BUG_ON(nid >= MAX_NUMNODES);
+ return nid;
+}
+
+/* Allocate a page in interleaved policy for a VMA. Use the offset
+ into the VMA as key. Own path because it needs to do special accounting. */
+static struct page *alloc_page_interleave(unsigned gfp, unsigned nid)
+{
+ struct zonelist *zl;
+ struct page *page;
+ BUG_ON(!test_bit(nid, node_online_map));
+ zl = NODE_DATA(nid)->node_zonelists + (gfp & GFP_ZONEMASK);
+ page = __alloc_pages(gfp, 0, zl);
+ if (page && page_zone(page) == zl->zones[0]) {
+ zl->zones[0]->pageset[get_cpu()].interleave_hit++;
+ put_cpu();
+ }
+ return page;
+}
+
+/**
+ * alloc_page_vma - Allocate a page for a VMA.
+ *
+ * @gfp:
+ * %GFP_USER user allocation.
+ * %GFP_KERNEL kernel allocations,
+ * %GFP_HIGHMEM highmem/user allocations,
+ * %GFP_FS allocation should not call back into a file system.
+ * %GFP_ATOMIC don't sleep.
+ *
+ * @vma: Pointer to VMA or NULL if not available.
+ * @addr: Virtual Address of the allocation. Must be inside the VMA.
+ *
+ * This function allocates a page from the kernel page pool and applies
+ * a NUMA policy associated with the VMA or the current process.
+ * When VMA is not NULL caller must hold down_read on the mmap_sem of the
+ * mm_struct of the VMA to prevent it from going away. Should be used for i
+ * all allocations for pages that will be mapped into
+ * user space. Returns NULL when no page can be allocated.
+ *
+ * Should be called with the mm_sem of the vma hold.
+ */
+struct page *
+alloc_page_vma(unsigned gfp, struct vm_area_struct *vma, unsigned long addr)
+{
+ struct mempolicy *pol = get_vma_policy(vma, addr);
+ if (unlikely(pol->policy == MPOL_INTERLEAVE)) {
+ unsigned nid;
+ if (vma) {
+ unsigned long off;
+ BUG_ON(addr >= vma->vm_end);
+ BUG_ON(addr < vma->vm_start);
+ off = vma->vm_pgoff;
+ off += (addr - vma->vm_start) >> PAGE_SHIFT;
+ nid = offset_il_node(pol, vma, off);
+ } else {
+ /* fall back to process interleaving */
+ nid = interleave_nodes(pol);
+ }
+ return alloc_page_interleave(gfp, nid);
+ }
+ return __alloc_pages(gfp, 0, zonelist_policy(gfp, pol));
+}
+
+/**
+ * alloc_pages_current - Allocate pages.
+ *
+ * @gfp: %GFP_USER user allocation,
+ * %GFP_KERNEL kernel allocation,
+ * %GFP_HIGHMEM highmem allocation,
+ * %GFP_FS don't call back into a file system.
+ * %GFP_ATOMIC don't sleep.
+ * @order: Power of two of allocation size in pages. 0 is a single page.
+ *
+ * Allocate a page from the kernel page pool. When not in
+ * interrupt context and apply the current process NUMA policy.
+ * Returns NULL when no page can be allocated.
+ */
+struct page *alloc_pages_current(unsigned gfp, unsigned order)
+{
+ struct mempolicy *pol = current->mempolicy;
+ if (!pol || in_interrupt())
+ pol = &default_policy;
+ if (pol->policy == MPOL_INTERLEAVE && order == 0)
+ return alloc_page_interleave(gfp, interleave_nodes(pol));
+ return __alloc_pages(gfp, order, zonelist_policy(gfp, pol));
+}
+
+EXPORT_SYMBOL(alloc_pages_current);
+
+/* Slow path of a mempolicy copy */
+struct mempolicy *__mpol_copy(struct mempolicy *old)
+{
+ struct mempolicy *new = kmem_cache_alloc(policy_cache, GFP_KERNEL);
+ if (!new)
+ return ERR_PTR(-ENOMEM);
+ *new = *old;
+ atomic_set(&new->refcnt, 1);
+ if (new->policy == MPOL_BIND) {
+ int sz = ksize(old->v.zonelist);
+ new->v.zonelist = kmalloc(sz, SLAB_KERNEL);
+ if (!new->v.zonelist) {
+ kmem_cache_free(policy_cache, new);
+ return ERR_PTR(-ENOMEM);
+ }
+ memcpy(new->v.zonelist, old->v.zonelist, sz);
+ }
+ return new;
+}
+
+/* Slow path of a mempolicy comparison */
+int __mpol_equal(struct mempolicy *a, struct mempolicy *b)
+{
+ if (!a || !b)
+ return 0;
+ if (a->policy != b->policy)
+ return 0;
+ switch (a->policy) {
+ case MPOL_DEFAULT:
+ return 1;
+ case MPOL_INTERLEAVE:
+ return bitmap_equal(a->v.nodes, b->v.nodes, MAX_NUMNODES);
+ case MPOL_PREFERRED:
+ return a->v.preferred_node == b->v.preferred_node;
+ case MPOL_BIND: {
+ int i;
+ for (i = 0; a->v.zonelist->zones[i]; i++)
+ if (a->v.zonelist->zones[i] != b->v.zonelist->zones[i])
+ return 0;
+ return b->v.zonelist->zones[i] == NULL;
+ }
+ default:
+ BUG();
+ return 0;
+ }
+}
+
+/* Slow path of a mpol destructor. */
+extern void __mpol_free(struct mempolicy *p)
+{
+ if (!atomic_dec_and_test(&p->refcnt))
+ return;
+ if (p->policy == MPOL_BIND)
+ kfree(p->v.zonelist);
+ p->policy = MPOL_DEFAULT;
+ kmem_cache_free(policy_cache, p);
+}
+
+/*
+ * Hugetlb policy. Same as above, just works with node numbers instead of
+ * zonelists.
+ */
+
+/* Find first node suitable for an allocation */
+int mpol_first_node(struct vm_area_struct *vma, unsigned long addr)
+{
+ struct mempolicy *pol = get_vma_policy(vma, addr);
+ switch (pol->policy) {
+ case MPOL_DEFAULT:
+ return numa_node_id();
+ case MPOL_BIND:
+ return pol->v.zonelist->zones[0]->zone_pgdat->node_id;
+ case MPOL_INTERLEAVE:
+ return interleave_nodes(pol);
+ case MPOL_PREFERRED:
+ return pol->v.preferred_node >= 0 ? pol->v.preferred_node:numa_node_id();
+ }
+ BUG();
+ return 0;
+}
+
+/* Find secondary valid nodes for an allocation */
+int mpol_node_valid(int nid, struct vm_area_struct *vma, unsigned long addr)
+{
+ struct mempolicy *pol = get_vma_policy(vma, addr);
+ switch (pol->policy) {
+ case MPOL_PREFERRED:
+ case MPOL_DEFAULT:
+ case MPOL_INTERLEAVE:
+ return 1;
+ case MPOL_BIND: {
+ struct zone **z;
+ for (z = pol->v.zonelist->zones; *z; z++)
+ if ((*z)->zone_pgdat->node_id == nid)
+ return 1;
+ return 0;
+ }
+ default:
+ BUG();
+ return 0;
+ }
+}
+
+/*
+ * Shared memory backing store policy support.
+ *
+ * Remember policies even when nobody has shared memory mapped.
+ * The policies are kept in Red-Black tree linked from the inode.
+ * They are protected by the sp->sem semaphore, which should be held
+ * for any accesses to the tree.
+ */
+
+/* lookup first element intersecting start-end */
+/* Caller holds sp->sem */
+static struct sp_node *
+sp_lookup(struct shared_policy *sp, unsigned long start, unsigned long end)
+{
+ struct rb_node *n = sp->root.rb_node;
+ while (n) {
+ struct sp_node *p = rb_entry(n, struct sp_node, nd);
+ if (start >= p->end) {
+ n = n->rb_right;
+ } else if (end < p->start) {
+ n = n->rb_left;
+ } else {
+ break;
+ }
+ }
+ if (!n)
+ return NULL;
+ for (;;) {
+ struct sp_node *w = NULL;
+ struct rb_node *prev = rb_prev(n);
+ if (!prev)
+ break;
+ w = rb_entry(prev, struct sp_node, nd);
+ if (w->end <= start)
+ break;
+ n = prev;
+ }
+ return rb_entry(n, struct sp_node, nd);
+}
+
+/* Insert a new shared policy into the list. */
+/* Caller holds sp->sem */
+static void sp_insert(struct shared_policy *sp, struct sp_node *new)
+{
+ struct rb_node **p = &sp->root.rb_node;
+ struct rb_node *parent = NULL;
+ struct sp_node *nd;
+ while (*p) {
+ parent = *p;
+ nd = rb_entry(parent, struct sp_node, nd);
+ if (new->start < nd->start)
+ p = &(*p)->rb_left;
+ else if (new->end > nd->end)
+ p = &(*p)->rb_right;
+ else
+ BUG();
+ }
+ rb_link_node(&new->nd, parent, p);
+ rb_insert_color(&new->nd, &sp->root);
+ PDprintk("inserting %lx-%lx: %d\n", new->start, new->end,
+ new->policy ? new->policy->policy : 0);
+}
+
+/* Find shared policy intersecting idx */
+struct mempolicy *
+mpol_shared_policy_lookup(struct shared_policy *sp, unsigned long idx)
+{
+ struct mempolicy *pol = NULL;
+ struct sp_node *sn;
+ down(&sp->sem);
+ sn = sp_lookup(sp, idx, idx+1);
+ if (sn) {
+ mpol_get(sn->policy);
+ pol = sn->policy;
+ }
+ up(&sp->sem);
+ return pol;
+}
+
+static void sp_delete(struct shared_policy *sp, struct sp_node *n)
+{
+ PDprintk("deleting %lx-l%x\n", n->start, n->end);
+ rb_erase(&n->nd, &sp->root);
+ mpol_free(n->policy);
+ kmem_cache_free(sn_cache, n);
+}
+
+struct sp_node *
+sp_alloc(unsigned long start, unsigned long end, struct mempolicy *pol)
+{
+ struct sp_node *n = kmem_cache_alloc(sn_cache, GFP_KERNEL);
+ if (!n)
+ return NULL;
+ n->start = start;
+ n->end = end;
+ mpol_get(pol);
+ n->policy = pol;
+ return n;
+}
+
+/* Replace a policy range. */
+static int shared_policy_replace(struct shared_policy *sp, unsigned long start,
+ unsigned long end, struct sp_node *new)
+{
+ struct sp_node *n, *new2;
+
+ down(&sp->sem);
+ n = sp_lookup(sp, start, end);
+ /* Take care of old policies in the same range. */
+ while (n && n->start < end) {
+ struct rb_node *next = rb_next(&n->nd);
+ if (n->start >= start) {
+ if (n->end <= end)
+ sp_delete(sp, n);
+ else
+ n->start = end;
+ } else {
+ /* Old policy spanning whole new range. */
+ if (n->end > end) {
+ new2 = sp_alloc(end, n->end, n->policy);
+ if (!new2) {
+ up(&sp->sem);
+ return -ENOMEM;
+ }
+ n->end = end;
+ sp_insert(sp, new2);
+ }
+ /* Old crossing beginning, but not end (easy) */
+ if (n->start < start && n->end > start)
+ n->end = start;
+ }
+ if (!next)
+ break;
+ n = rb_entry(next, struct sp_node, nd);
+ }
+ if (new)
+ sp_insert(sp, new);
+ up(&sp->sem);
+ return 0;
+}
+
+int mpol_set_shared_policy(struct shared_policy *info, struct vm_area_struct *vma,
+ struct mempolicy *npol)
+{
+ int err;
+ struct sp_node *new = NULL;
+ unsigned long sz = vma_pages(vma);
+
+ PDprintk("set_shared_policy %lx sz %lu %d %lx\n",
+ vma->vm_pgoff,
+ sz, npol? npol->policy : -1,
+ npol ? npol->v.nodes[0] : -1);
+
+ if (npol) {
+ new = sp_alloc(vma->vm_pgoff, vma->vm_pgoff + sz, npol);
+ if (!new)
+ return -ENOMEM;
+ }
+ err = shared_policy_replace(info, vma->vm_pgoff, vma->vm_pgoff+sz, new);
+ if (err && new)
+ kmem_cache_free(sn_cache, new);
+ return err;
+}
+
+/* Free a backing policy store on inode delete. */
+void mpol_free_shared_policy(struct shared_policy *p)
+{
+ struct sp_node *n;
+ struct rb_node *next;
+ down(&p->sem);
+ next = rb_first(&p->root);
+ while (next) {
+ n = rb_entry(next, struct sp_node, nd);
+ next = rb_next(&n->nd);
+ rb_erase(&n->nd, &p->root);
+ mpol_free(n->policy);
+ kmem_cache_free(sn_cache, n);
+ }
+ up(&p->sem);
+}
+
+static __init int numa_policy_init(void)
+{
+ policy_cache = kmem_cache_create("numa_policy",
+ sizeof(struct mempolicy),
+ 0, 0, NULL, NULL);
+
+ sn_cache = kmem_cache_create("shared_policy_node",
+ sizeof(struct sp_node),
+ 0, 0, NULL, NULL);
+
+ if (!policy_cache || !sn_cache)
+ panic("Cannot create NUMA policy cache");
+ return 0;
+}
+__initcall(numa_policy_init);
Matthew Dobson <[email protected]> wrote:
>
> Just from the patches you posted, I would really disagree that these are
> ready for merging into -mm.
I have them all merged up here. I made a number of small changes -
additional CONFIG_NUMA ifdefs, whitespace improvements, remove unneeded
arch_hugetlb_fault() implementation. The core patch created two copies of
the same file in mempolicy.h, compile fix in mmap.c and a few other things.
It builds OK for NUMAQ, although NUMAQ does have a problem:
drivers/built-in.o: In function `acpi_pci_root_add':
drivers/built-in.o(.text+0x22015): undefined reference to `pci_acpi_scan_root'
ppc64+CONFIG_NUMA compiles OK.
On Wed, 7 Apr 2004 14:51:30 -0700
Andrew Morton <[email protected]> wrote:
> Matthew Dobson <[email protected]> wrote:
> >
> > Just from the patches you posted, I would really disagree that these are
> > ready for merging into -mm.
>
> I have them all merged up here. I made a number of small changes -
> additional CONFIG_NUMA ifdefs, whitespace improvements, remove unneeded
> arch_hugetlb_fault() implementation. The core patch created two copies of
> the same file in mempolicy.h, compile fix in mmap.c and a few other things.
Sorry about the bad patches. I will try to be more careful in the future.
What was the problem in mmap.c ? I compiled in various combinations (with
and without NUMA on i386 and x86-64) and it worked.
And why was arch_hugetlb_fault() unneeded?
> It builds OK for NUMAQ, although NUMAQ does have a problem:
>
> drivers/built-in.o: In function `acpi_pci_root_add':
> drivers/built-in.o(.text+0x22015): undefined reference to `pci_acpi_scan_root'
>
> ppc64+CONFIG_NUMA compiles OK.
ppc64 doesn't have the system calls hooked up, but I'm not sure how useful
it would be for these boxes anyways (afaik they are pretty uniform)
-Andi
On Wed, 2004-04-07 at 14:45, Andi Kleen wrote:
> On Wed, 07 Apr 2004 14:41:02 -0700
> Matthew Dobson <[email protected]> wrote:
>
> > On Wed, 2004-04-07 at 14:27, Andi Kleen wrote:
> > > On Wed, 07 Apr 2004 14:24:19 -0700
> > > Matthew Dobson <[email protected]> wrote:
> > >
> > > > I must be missing something here, but did you not include mempolicy.h
> > > > and policy.c in these patches? I can't seem to find them anywhere?!?
> > > > It's really hard to evaluate your patches if the core of them is
> > > > missing!
> > >
> > > It was in the core patch and also in the last patch I sent Andrew.
> > > See ftp://ftp.suse.com/pub/people/ak/numa/* for the full patches
> >
> > Ok.. I'll check that link, but what you posted didn't have the files
> > (mempolicy.h & policy.c) in the patch:
>
> Indeed. Must have gone missing. Here are the files for reference.
>
> The full current broken out patchkit is in
> ftp.suse.com:/pub/people/ak/numa/2.6.5mc2/
Server isn't taking connections right now. At least for me... :(
Your patch still looks broken. It includes some files twice:
> diff -u linux-2.6.5-mc2-numa/include/linux/mempolicy.h-o linux-2.6.5-mc2-numa/include/linux/mempolicy.h
> --- linux-2.6.5-mc2-numa/include/linux/mempolicy.h-o 2004-04-07 12:07:18.000000000 +0200
> +++ linux-2.6.5-mc2-numa/include/linux/mempolicy.h 2004-04-07 12:07:13.000000000 +0200
> @@ -0,0 +1,219 @@
<snip>
> diff -u linux-2.6.5-mc2-numa/include/linux/mempolicy.h-o linux-2.6.5-mc2-numa/include/linux/mempolicy.h
> --- linux-2.6.5-mc2-numa/include/linux/mempolicy.h-o 2004-04-07 12:07:18.000000000 +0200
> +++ linux-2.6.5-mc2-numa/include/linux/mempolicy.h 2004-04-07 12:07:13.000000000 +0200
> @@ -0,0 +1,219 @@
-Matt
--On Wednesday, April 07, 2004 14:51:30 -0700 Andrew Morton <[email protected]> wrote:
> Matthew Dobson <[email protected]> wrote:
>>
>> Just from the patches you posted, I would really disagree that these are
>> ready for merging into -mm.
>
> I have them all merged up here. I made a number of small changes -
> additional CONFIG_NUMA ifdefs, whitespace improvements, remove unneeded
> arch_hugetlb_fault() implementation. The core patch created two copies of
> the same file in mempolicy.h, compile fix in mmap.c and a few other things.
I think there are some design issues that still aren't resolved - we've
been over this a bit before, but I still don't think they're fixed.
It seems you're still making a copy of the binding structure for every
VMA, which seems ... extravagent. Can we share them? IIRC, the only
justification was the striping ... and I thought we agreed that was
better fixed by using the mod of the offset as a decider?
Maybe I'm just misreading your code, in which case, feel free to spit at me ;-)
M.
>> ppc64+CONFIG_NUMA compiles OK.
>
> ppc64 doesn't have the system calls hooked up, but I'm not sure how useful
> it would be for these boxes anyways (afaik they are pretty uniform)
They actually are fairly keen on doing NUMA for HPC stuff - it makes
a significant performance improvement ...
M.
On Wed, 07 Apr 2004 15:39:17 -0700
"Martin J. Bligh" <[email protected]> wrote:
> >> ppc64+CONFIG_NUMA compiles OK.
> >
> > ppc64 doesn't have the system calls hooked up, but I'm not sure how useful
> > it would be for these boxes anyways (afaik they are pretty uniform)
>
> They actually are fairly keen on doing NUMA for HPC stuff - it makes
> a significant performance improvement ...
Ok. Should be straight forward to add system calls them if they're
interested.
The hugetlbpages may need some work though, but that's non essential for
the beginning.
The new calls are all emulation clean too, you can just hand them through.
[i didn't add that yet, but it would be pretty simple]
-Andi
Andi Kleen <[email protected]> wrote:
>
> What was the problem in mmap.c ? I compiled in various combinations (with
> and without NUMA on i386 and x86-64) and it worked.
mm/mmap.c: In function `copy_vma':
mm/mmap.c:1531: structure has no member named `vm_policy'
--- 25/mm/mmap.c~numa-api-vma-policy-hooks-fix Wed Apr 7 12:28:53 2004
+++ 25-akpm/mm/mmap.c Wed Apr 7 12:29:09 2004
@@ -1528,7 +1528,7 @@ struct vm_area_struct *copy_vma(struct v
find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent);
new_vma = vma_merge(mm, prev, rb_parent, addr, addr + len,
- vma->vm_flags, vma->vm_file, pgoff, vma->vm_policy);
+ vma->vm_flags, vma->vm_file, pgoff, vma_policy(vma));
if (!new_vma) {
new_vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
if (new_vma) {
_
> And why was arch_hugetlb_fault() unneeded?
Well we do:
if (condition-which-evaluates-to-constant-zero-on-CONFIG_HUGETLB=n)
arch_hugetlb_fault();
so the compiler will never emit a call to the stub function anyway.
But it turns out the stub is needed for now, because !X86 doesn't implement
arch_hugetlb_fault(). So I put it back.
> > It builds OK for NUMAQ, although NUMAQ does have a problem:
> >
> > drivers/built-in.o: In function `acpi_pci_root_add':
> > drivers/built-in.o(.text+0x22015): undefined reference to `pci_acpi_scan_root'
> >
> > ppc64+CONFIG_NUMA compiles OK.
>
> ppc64 doesn't have the system calls hooked up, but I'm not sure how useful
> it would be for these boxes anyways (afaik they are pretty uniform)
Well, it's best that the kernel actually compiles still..
On Wed, 07 Apr 2004 15:38:24 -0700
"Martin J. Bligh" <[email protected]> wrote:
> I think there are some design issues that still aren't resolved - we've
> been over this a bit before, but I still don't think they're fixed.
> It seems you're still making a copy of the binding structure for every
> VMA, which seems ... extravagent. Can we share them? IIRC, the only
Sharing is only an optimization that adds more code and potential
for more bugs (hash tables, locking etc.).
We can discuss changes when someone shows numbers that additional
optimizations are needed. I haven't seen such numbers and I'm not convinced
sharing is even a good idea from a design standpoint. For the first version
I just aimed to get something working with straight forward code.
To put it all in perspective: a policy is 12 bytes on a 32bit machine
(assuming MAX_NUMNODES <= 32) and 16 bytes on a 64bit machine
(with MAX_NUMNODES <= 64)
-Andi
Andi Kleen <[email protected]> wrote:
>
> We can discuss changes when someone shows numbers that additional
> optimizations are needed. I haven't seen such numbers and I'm not convinced
> sharing is even a good idea from a design standpoint. For the first version
> I just aimed to get something working with straight forward code.
>
> To put it all in perspective: a policy is 12 bytes on a 32bit machine
> (assuming MAX_NUMNODES <= 32) and 16 bytes on a 64bit machine
> (with MAX_NUMNODES <= 64)
sizeof(vm_area_struct) is a very sensitive thing on ia32. If you expect
that anyone is likely to actually use the numa API on 32-bit, sharing
will be important.
It should be useful for SMT, yes?
> sizeof(vm_area_struct) is a very sensitive thing on ia32. If you expect
> that anyone is likely to actually use the numa API on 32-bit, sharing
Me please ;-)
M.
On Wed, 7 Apr 2004 15:52:25 -0700
Andrew Morton <[email protected]> wrote:
> Andi Kleen <[email protected]> wrote:
> >
> > We can discuss changes when someone shows numbers that additional
> > optimizations are needed. I haven't seen such numbers and I'm not convinced
> > sharing is even a good idea from a design standpoint. For the first version
> > I just aimed to get something working with straight forward code.
> >
> > To put it all in perspective: a policy is 12 bytes on a 32bit machine
> > (assuming MAX_NUMNODES <= 32) and 16 bytes on a 64bit machine
> > (with MAX_NUMNODES <= 64)
>
> sizeof(vm_area_struct) is a very sensitive thing on ia32. If you expect
> that anyone is likely to actually use the numa API on 32-bit, sharing
> will be important.
I don't really believe that. If it was that way someone would have already
done all the obvious space optimizations left on the table...
(like using rb_next or merging the rb color into flags)
NUMA API adds a new pointer, but all sharing in the world couldn't fix that.
When you set a policy != default you will also pay the 12 or 16 bytes overhead
for the object for each "policy region"
> It should be useful for SMT, yes?
Nope. Only for real NUMA.
-Andi
Andi Kleen <[email protected]> wrote:
>
> On Wed, 7 Apr 2004 15:52:25 -0700
> Andrew Morton <[email protected]> wrote:
>
> > Andi Kleen <[email protected]> wrote:
> > >
> > > We can discuss changes when someone shows numbers that additional
> > > optimizations are needed. I haven't seen such numbers and I'm not convinced
> > > sharing is even a good idea from a design standpoint. For the first version
> > > I just aimed to get something working with straight forward code.
> > >
> > > To put it all in perspective: a policy is 12 bytes on a 32bit machine
> > > (assuming MAX_NUMNODES <= 32) and 16 bytes on a 64bit machine
> > > (with MAX_NUMNODES <= 64)
> >
> > sizeof(vm_area_struct) is a very sensitive thing on ia32. If you expect
> > that anyone is likely to actually use the numa API on 32-bit, sharing
> > will be important.
>
> I don't really believe that.
You better. VMA space exhaustion is one of the reasons for introducing
remap_file_pages(). It's an oracle-killer. Like everything else ;)
> If it was that way someone would have already
> done all the obvious space optimizations left on the table...
> (like using rb_next or merging the rb color into flags)
Nope, we're slack.
> NUMA API adds a new pointer, but all sharing in the world couldn't fix that.
> When you set a policy != default you will also pay the 12 or 16 bytes overhead
> for the object for each "policy region"
OK, that's not so bad. So if you don't use the feature the overhead is 4
bytes/vma.
If you _do_ use the feature, what is the overhead? 12 bytes for each and
every vma? Or just for the vma's which have a non-default policy?
Your patch takes the CONFIG_NUMA vma from 64 bytes to 68. It would be nice
to pull those 4 bytes back somehow.
On Wed, 7 Apr 2004 16:56:39 -0700
Andrew Morton <[email protected]> wrote:
> If you _do_ use the feature, what is the overhead? 12 bytes for each and
> every vma? Or just for the vma's which have a non-default policy?
Just for VMAs with a non default policy. Default policy is always NULL
> Your patch takes the CONFIG_NUMA vma from 64 bytes to 68. It would be nice
> to pull those 4 bytes back somehow.
Eliminate the RB color field or use rb_next() instead of vm_next. First
alternative is cheaper.
-Andi
On Thu, Apr 08, 2004 at 01:35:22AM +0200, Andi Kleen wrote:
> NUMA API adds a new pointer, but all sharing in the world couldn't fix that.
that's fine. As we already discussed from my part I only care that the
4bytes for the pointer go away if you compile with CONFIG_NUMA=n, it
doesn't really matter much but it'd like if it would be optimized for
the long term.
On Thu, Apr 08, 2004 at 02:14:48AM +0200, Andi Kleen wrote:
> Eliminate the RB color field or use rb_next() instead of vm_next. First
> alternative is cheaper.
with eliminate I assume you mean to reuse a bit in the vma for that
(like vma->vm_flags), somewhere the color bit info is needed to make the
rebalanacing in mmap quick but to still guarantee the max height <= 2 *
min height.
On Thu, 8 Apr 2004 02:26:26 +0200
Andrea Arcangeli <[email protected]> wrote:
> On Thu, Apr 08, 2004 at 02:14:48AM +0200, Andi Kleen wrote:
> > Eliminate the RB color field or use rb_next() instead of vm_next. First
> > alternative is cheaper.
>
> with eliminate I assume you mean to reuse a bit in the vma for that
> (like vma->vm_flags), somewhere the color bit info is needed to make the
> rebalanacing in mmap quick but to still guarantee the max height <= 2 *
> min height.
Yes. Put it either into flags or into the low order bits of a pointer.
-Andi
On Wed, 2004-04-07 at 14:45, Andi Kleen wrote:
> diff -u linux-2.6.5-mc2-numa/mm/mempolicy.c-o linux-2.6.5-mc2-numa/mm/mempolicy.c
> --- linux-2.6.5-mc2-numa/mm/mempolicy.c-o 2004-04-07 12:07:41.000000000 +0200
> +++ linux-2.6.5-mc2-numa/mm/mempolicy.c 2004-04-07 13:07:02.000000000 +0200
<snip>
> +/* Do sanity checking on a policy */
> +static int check_policy(int mode, unsigned long *nodes)
> +{
> + int empty = bitmap_empty(nodes, MAX_NUMNODES);
> + switch (mode) {
> + case MPOL_DEFAULT:
> + if (!empty)
> + return -EINVAL;
> + break;
> + case MPOL_BIND:
> + case MPOL_INTERLEAVE:
> + /* Preferred will only use the first bit, but allow
> + more for now. */
> + if (empty)
> + return -EINVAL;
> + break;
> + }
> + return check_online(nodes);
> +}
Is there a reason you don't have a case for MPOL_PREFERRED? You have a
comment about it in the function, but you don't check the nodemask isn't
empty...
> +/* Copy a node mask from user space. */
> +static int get_nodes(unsigned long *nodes, unsigned long *nmask,
> + unsigned long maxnode, int mode)
> +{
> + unsigned long k;
> + unsigned long nlongs;
> + unsigned long endmask;
> +
> + --maxnode;
> + nlongs = BITS_TO_LONGS(maxnode);
> + if ((maxnode % BITS_PER_LONG) == 0)
> + endmask = ~0UL;
> + else
> + endmask = (1UL << (maxnode % BITS_PER_LONG)) - 1;
> +
> + /* When the user specified more nodes than supported just check
> + if the non supported part is all zero. */
> + if (nmask && nlongs > BITS_TO_LONGS(MAX_NUMNODES)) {
> + for (k = BITS_TO_LONGS(MAX_NUMNODES); k < nlongs; k++) {
> + unsigned long t;
> + if (get_user(t, nmask + k))
> + return -EFAULT;
> + if (k == nlongs - 1) {
> + if (t & endmask)
> + return -EINVAL;
> + } else if (t)
> + return -EINVAL;
> + }
> + nlongs = BITS_TO_LONGS(MAX_NUMNODES);
> + endmask = ~0UL;
> + }
> +
> + bitmap_clear(nodes, MAX_NUMNODES);
> + if (nmask && copy_from_user(nodes, nmask, nlongs*sizeof(unsigned long)))
> + return -EFAULT;
> + nodes[nlongs-1] &= endmask;
> + return check_policy(mode, nodes);
> +}
In this function, why do we care what bits the user set past
MAX_NUMNODES? Why shouldn't we just silently ignore the bits like we do
in sys_sched_setaffinity? If a user tries to hand us an 8k bitmask, my
opinion is we should just grab as much as we care about (MAX_NUMNODES
bits rounded up to the nearest UL).
> +/* Generate a custom zonelist for the BIND policy. */
> +static struct zonelist *bind_zonelist(unsigned long *nodes)
> +{
> + struct zonelist *zl;
> + int num, max, nd;
> +
> + max = 1 + MAX_NR_ZONES * bitmap_weight(nodes, MAX_NUMNODES);
> + zl = kmalloc(sizeof(void *) * max, GFP_KERNEL);
> + if (!zl)
> + return NULL;
> + num = 0;
> + for (nd = find_first_bit(nodes, MAX_NUMNODES);
> + nd < MAX_NUMNODES;
> + nd = find_next_bit(nodes, MAX_NUMNODES, 1+nd)) {
> + int k;
> + for (k = MAX_NR_ZONES-1; k >= 0; k--) {
> + struct zone *z = &NODE_DATA(nd)->node_zones[k];
> + if (!z->present_pages)
> + continue;
> + zl->zones[num++] = z;
> + if (k > policy_zone)
> + policy_zone = k;
> + }
> + }
> + BUG_ON(num >= max);
> + zl->zones[num] = NULL;
> + return zl;
> +}
This seems a bit strange to me. Instead of just allocating a whole
struct zonelist, you're allocating part of one? I guess it's safe,
since the array is meant to be NULL terminated, but we should put a note
in any code using these zonelists that they *aren't* regular zonelists,
they will be smaller, and dereferencing arbitrary array elements in the
struct could be dangerous. I think we'd be better off creating a
kmem_cache_t for these and using *whole* zonelist structures.
Allocating part of a well-defined structure makes me a bit nervous...
> +/* Create a new policy */
> +static struct mempolicy *new_policy(int mode, unsigned long *nodes)
> +{
> + struct mempolicy *policy;
> + PDprintk("setting mode %d nodes[0] %lx\n", mode, nodes[0]);
> + if (mode == MPOL_DEFAULT)
> + return NULL;
> + policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
> + if (!policy)
> + return ERR_PTR(-ENOMEM);
> + atomic_set(&policy->refcnt, 1);
> + switch (mode) {
> + case MPOL_INTERLEAVE:
> + bitmap_copy(policy->v.nodes, nodes, MAX_NUMNODES);
> + break;
> + case MPOL_PREFERRED:
> + policy->v.preferred_node = find_first_bit(nodes, MAX_NUMNODES);
> + if (policy->v.preferred_node >= MAX_NUMNODES)
> + policy->v.preferred_node = -1;
> + break;
> + case MPOL_BIND:
> + policy->v.zonelist = bind_zonelist(nodes);
> + if (policy->v.zonelist == NULL) {
> + kmem_cache_free(policy_cache, policy);
> + return ERR_PTR(-ENOMEM);
> + }
> + break;
> + }
> + policy->policy = mode;
> + return policy;
> +}
I'm guessing this is why you aren't checking MPOL_PREFERRED in
check_policy()? So the user can call mbind() with MPOL_PREFERRED and an
empty nodes bitmap and get the default behavior you mentioned in the
comments?
I've got to get some dinner now... I'll keep reading and send more
comments as I come up with them.
Cheers!
-Matt
On Wed, 07 Apr 2004 17:58:23 -0700
Matthew Dobson <[email protected]> wrote:
> Is there a reason you don't have a case for MPOL_PREFERRED? You have a
> comment about it in the function, but you don't check the nodemask isn't
> empty...
Empty prefered is a special case. It means DEFAULT. This is useful
when you have a process policy != DEFAULT, but want to set a specific
VMA to default. Normally default in a VMA would mean use process policy.
> In this function, why do we care what bits the user set past
> MAX_NUMNODES? Why shouldn't we just silently ignore the bits like we do
> in sys_sched_setaffinity? If a user tries to hand us an 8k bitmask, my
> opinion is we should just grab as much as we care about (MAX_NUMNODES
> bits rounded up to the nearest UL).
This is to catch uninitialized bits. Otherwise it could work on a kernel
with small MAX_NUMNODES, and then suddenly fail on a kernel with bigger
MAX_NUMNODES when a node isn't online.
> This seems a bit strange to me. Instead of just allocating a whole
> struct zonelist, you're allocating part of one? I guess it's safe,
> since the array is meant to be NULL terminated, but we should put a note
> in any code using these zonelists that they *aren't* regular zonelists,
> they will be smaller, and dereferencing arbitrary array elements in the
> struct could be dangerous. I think we'd be better off creating a
> kmem_cache_t for these and using *whole* zonelist structures.
> Allocating part of a well-defined structure makes me a bit nervous...
And that after all the whining about sharing policies? ;-) (a BIND policy will
always carry a zonelist). As far as I can see all existing zonelist code
just walks it until NULL.
I would not be opposed to always using a full one, but it would use considerably
more memory in many cases.
> I'm guessing this is why you aren't checking MPOL_PREFERRED in
> check_policy()? So the user can call mbind() with MPOL_PREFERRED and an
> empty nodes bitmap and get the default behavior you mentioned in the
> comments?
Yep.
-Andi
On Wed, 7 Apr 2004, Andrew Morton wrote:
>
> Your patch takes the CONFIG_NUMA vma from 64 bytes to 68. It would be nice
> to pull those 4 bytes back somehow.
How significant is this vma size issue?
anon_vma objrmap will add 20 bytes to each vma (on 32-bit arches):
8 for prio_tree, 12 for anon_vma linkage in vma,
sometimes another 12 for the anon_vma head itself.
anonmm objrmap adds just the 8 bytes for prio_tree,
remaining overhead 28 bytes per mm.
Seems hard on Andi to begrudge him 4.
Hugh
> On Wed, 7 Apr 2004, Andrew Morton wrote:
>>
>> Your patch takes the CONFIG_NUMA vma from 64 bytes to 68. It would be nice
>> to pull those 4 bytes back somehow.
>
> How significant is this vma size issue?
>
> anon_vma objrmap will add 20 bytes to each vma (on 32-bit arches):
> 8 for prio_tree, 12 for anon_vma linkage in vma,
> sometimes another 12 for the anon_vma head itself.
Ewwww. Isn't some of that shared most of the time though?
> anonmm objrmap adds just the 8 bytes for prio_tree,
> remaining overhead 28 bytes per mm.
28 bytes per *mm* is nothing, and I still think the prio_tree is
completely unneccesary. Nobody has ever demonstrated a real benchmark
that needs it, as far as I recall.
> Seems hard on Andi to begrudge him 4.
I don't care about the 4 bytes much (other than that the current 64 happens
to be a nice size). I just don't see the point in making copies of the
binding structure all the time ;-) Refcounts aren't that hard ... didn't
Greg do a kref just recently? ...
M.
On Thu, 8 Apr 2004, Martin J. Bligh wrote:
> > On Wed, 7 Apr 2004, Andrew Morton wrote:
> >>
> >> Your patch takes the CONFIG_NUMA vma from 64 bytes to 68. It would be nice
> >> to pull those 4 bytes back somehow.
> >
> > How significant is this vma size issue?
> >
> > anon_vma objrmap will add 20 bytes to each vma (on 32-bit arches):
> > 8 for prio_tree, 12 for anon_vma linkage in vma,
> > sometimes another 12 for the anon_vma head itself.
>
> Ewwww. Isn't some of that shared most of the time though?
The anon_vma head may well be shared with other vmas of the fork group.
But the anon_vma linkage is a list_head and a pointer within the vma.
prio_tree is already using a union as much as it can (and a pointer
where a list_head would simplify the code); Rajesh was thinking of
reusing vm_private_data for one pointer, but I've gone and used it
for nonlinear swapout.
> > anonmm objrmap adds just the 8 bytes for prio_tree,
> > remaining overhead 28 bytes per mm.
>
> 28 bytes per *mm* is nothing, and I still think the prio_tree is
> completely unneccesary. Nobody has ever demonstrated a real benchmark
> that needs it, as far as I recall.
I'm sure an Ingobench will shortly follow that observation.
Hugh
On Wed, 2004-04-07 at 18:31, Andi Kleen wrote:
> On Wed, 07 Apr 2004 17:58:23 -0700
> Matthew Dobson <[email protected]> wrote:
>
>
> > Is there a reason you don't have a case for MPOL_PREFERRED? You have a
> > comment about it in the function, but you don't check the nodemask isn't
> > empty...
>
> Empty prefered is a special case. It means DEFAULT. This is useful
> when you have a process policy != DEFAULT, but want to set a specific
> VMA to default. Normally default in a VMA would mean use process policy.
Ok.. That makes sense.
> > In this function, why do we care what bits the user set past
> > MAX_NUMNODES? Why shouldn't we just silently ignore the bits like we do
> > in sys_sched_setaffinity? If a user tries to hand us an 8k bitmask, my
> > opinion is we should just grab as much as we care about (MAX_NUMNODES
> > bits rounded up to the nearest UL).
>
> This is to catch uninitialized bits. Otherwise it could work on a kernel
> with small MAX_NUMNODES, and then suddenly fail on a kernel with bigger
> MAX_NUMNODES when a node isn't online.
I am of the opinion that we should allow currently offline nodes in the
user's mask. Those nodes may come online later on, and we should
respect the user's request to allocate from those nodes if possible.
Just like in sched_setaffinity() we take in the user's mask, and when we
actually use the mask to make a decision, we check it against
cpu_online_map. Just because a node isn't online at the time of the
mbind() call doesn't mean it won't be soon. Besides, we should be
checking against node_online_map anyway, because nodes could go away.
Well, maybe not right now, but in the near future. Hotplugable memory
is a reality, even if we don't support it just yet.
> > This seems a bit strange to me. Instead of just allocating a whole
> > struct zonelist, you're allocating part of one? I guess it's safe,
> > since the array is meant to be NULL terminated, but we should put a note
> > in any code using these zonelists that they *aren't* regular zonelists,
> > they will be smaller, and dereferencing arbitrary array elements in the
> > struct could be dangerous. I think we'd be better off creating a
> > kmem_cache_t for these and using *whole* zonelist structures.
> > Allocating part of a well-defined structure makes me a bit nervous...
>
> And that after all the whining about sharing policies? ;-) (a BIND policy will
> always carry a zonelist). As far as I can see all existing zonelist code
> just walks it until NULL.
>
> I would not be opposed to always using a full one, but it would use considerably
> more memory in many cases.
I'm not whining about sharing policies because of the space usage,
although that is a small side issue. I'm whining about sharing policies
because it just makes sense. You've got a data structure that is always
dynamically allocated and referenced by pointers, that has no instance
specific data in it, and that *already has* an atomic reference counter
in it. And you decided not to share this data structure?! In my
opinion, it's harder and more code to *not* share it... Instead of
copying the structure in mpol_copy(), just atomic_inc(policy->refcnt)
and we're pretty much done. You already do an atomic_dec_and_test() in
mpol_free()...
-Matt
On Thu, 8 Apr 2004, Hugh Dickins wrote:
> On Thu, 8 Apr 2004, Martin J. Bligh wrote:
> > > On Wed, 7 Apr 2004, Andrew Morton wrote:
> > >>
> > >> Your patch takes the CONFIG_NUMA vma from 64 bytes to 68. It would be nice
> > >> to pull those 4 bytes back somehow.
> > >
> > > How significant is this vma size issue?
> > >
> > > anon_vma objrmap will add 20 bytes to each vma (on 32-bit arches):
> > > 8 for prio_tree, 12 for anon_vma linkage in vma,
> > > sometimes another 12 for the anon_vma head itself.
> >
> > Ewwww. Isn't some of that shared most of the time though?
>
> The anon_vma head may well be shared with other vmas of the fork group.
> But the anon_vma linkage is a list_head and a pointer within the vma.
>
> prio_tree is already using a union as much as it can (and a pointer
> where a list_head would simplify the code); Rajesh was thinking of
> reusing vm_private_data for one pointer, but I've gone and used it
> for nonlinear swapout.
I guess using vm_private_data for nonlinear is not a problem because
we use list i_mmap_nonlinear for nonlinear vmas.
As you have found out vm_private_data is only used if vm_file != NULL
or VM_RESERVED or VM_DONTEXPAND is set. I think we can revert back to
i_mmap{_shared} list for such special cases and use prio_tree for
others. I maybe missing something. Please teach me.
If anonmm is merged then I plan to seriously consider removing that
8 extra bytes for prio_tree. If anon_vma is merged, then I can easily
point my finger at 12 more bytes added by anon_vma and be happy :)
I still think removing the 8 extra bytes used by prio_tree from
vm_area_struct is possible.
> > > anonmm objrmap adds just the 8 bytes for prio_tree,
> > > remaining overhead 28 bytes per mm.
> >
> > 28 bytes per *mm* is nothing, and I still think the prio_tree is
> > completely unneccesary. Nobody has ever demonstrated a real benchmark
> > that needs it, as far as I recall.
>
> I'm sure an Ingobench will shortly follow that observation.
Yeap. If Andrew didn't write his rmap-test.c and Ingo didn't write his
test-mmap3.c, I wouldn't even have considered developing prio_tree.
Thanks,
Rajesh
Hugh Dickins <[email protected]> wrote:
>
> On Wed, 7 Apr 2004, Andrew Morton wrote:
> >
> > Your patch takes the CONFIG_NUMA vma from 64 bytes to 68. It would be nice
> > to pull those 4 bytes back somehow.
>
> How significant is this vma size issue?
For some workloads/machines it will simply cause an
approximately-proportional reduction in the size of the workload which we
can handle.
ie: there are some (oracle) workloads where the kernel craps out due to
lowmem vma exhaustion. If they're now using remap_file_pages() for this then
it may not be a problem any more. Ingo would know better than I.
> anon_vma objrmap will add 20 bytes to each vma (on 32-bit arches):
> 8 for prio_tree, 12 for anon_vma linkage in vma,
> sometimes another 12 for the anon_vma head itself.
>
> anonmm objrmap adds just the 8 bytes for prio_tree,
> remaining overhead 28 bytes per mm.
>
> Seems hard on Andi to begrudge him 4.
On Thu, 8 Apr 2004, Rajesh Venkatasubramanian wrote:
>
> I guess using vm_private_data for nonlinear is not a problem because
> we use list i_mmap_nonlinear for nonlinear vmas.
>
> As you have found out vm_private_data is only used if vm_file != NULL
> or VM_RESERVED or VM_DONTEXPAND is set. I think we can revert back to
> i_mmap{_shared} list for such special cases and use prio_tree for
> others. I maybe missing something. Please teach me.
Sorry, I don't understand what you're proposing here, and why?
Oh, to save 4 bytes of vma by making the special cases use a list,
no need for vm_set_head, and vm_private_data for the driver itself;
but regular vmas use prio_tree, with vm_set_head in vm_private_data.
Hmm, right now it's getting too much for me: can we keep it simplish
for now, and come back to this later?
Hugh
On Thu, 8 Apr 2004, Hugh Dickins wrote:
> On Thu, 8 Apr 2004, Rajesh Venkatasubramanian wrote:
> >
> > I guess using vm_private_data for nonlinear is not a problem because
> > we use list i_mmap_nonlinear for nonlinear vmas.
> >
> > As you have found out vm_private_data is only used if vm_file != NULL
> > or VM_RESERVED or VM_DONTEXPAND is set. I think we can revert back to
> > i_mmap{_shared} list for such special cases and use prio_tree for
> > others. I maybe missing something. Please teach me.
>
> Sorry, I don't understand what you're proposing here, and why?
> Oh, to save 4 bytes of vma by making the special cases use a list,
> no need for vm_set_head, and vm_private_data for the driver itself;
> but regular vmas use prio_tree, with vm_set_head in vm_private_data.
Yeah. You are right.
> Hmm, right now it's getting too much for me: can we keep it simplish
> for now, and come back to this later?
Yeah. This complicates the code further. That's why I didn't touch
it now. If things settle down and if we really worry about sizeof
vm_area_struct in the future, then we can remove the 8 bytes used
by prio_tree.
Rajesh
On Thu, Apr 08, 2004 at 03:20:22PM -0400, Rajesh Venkatasubramanian wrote:
> 8 extra bytes for prio_tree. If anon_vma is merged, then I can easily
> point my finger at 12 more bytes added by anon_vma and be happy :)
those 12 more bytes payoff providing better swapping performance, handle
mremap transparently and make the code scale better, allowing even to
_trivially_ move the page_table_lock into the vma without any downside
(I mean performance downside, per-vma lock will cost one more bit of
info in every vma).
There's no way to remove those 12 bytes without falling in the downsides
of anonmm.
I don't see a real need to shrink the size of the prio_tree right now
though if something useless and can be removed that's fine with me, what
I'm trying to tell is that pointing the finger at 12 bytes of anon_vma
doesn't sound a good argument since there's no way to remove them
without falling back in the anonmm downsides.
On Wed, 2004-04-07 at 14:45, Andi Kleen wrote:
<snip>
> diff -u linux-2.6.5-mc2-numa/mm/mempolicy.c-o linux-2.6.5-mc2-numa/mm/mempolicy.c
> --- linux-2.6.5-mc2-numa/mm/mempolicy.c-o 2004-04-07 12:07:41.000000000 +0200
> +++ linux-2.6.5-mc2-numa/mm/mempolicy.c 2004-04-07 13:07:02.000000000 +0200
<snip more>
> +/* Ensure all existing pages follow the policy. */
> +static int
> +verify_pages(unsigned long addr, unsigned long end, unsigned long *nodes)
> +{
> + while (addr < end) {
> + struct page *p;
> + pte_t *pte;
> + pmd_t *pmd;
> + pgd_t *pgd = pgd_offset_k(addr);
> + if (pgd_none(*pgd)) {
> + addr = (addr + PGDIR_SIZE) & PGDIR_MASK;
> + continue;
> + }
> + pmd = pmd_offset(pgd, addr);
> + if (pmd_none(*pmd)) {
> + addr = (addr + PMD_SIZE) & PMD_MASK;
> + continue;
> + }
> + p = NULL;
> + pte = pte_offset_map(pmd, addr);
> + if (pte_present(*pte))
> + p = pte_page(*pte);
> + pte_unmap(pte);
> + if (p) {
> + unsigned nid = page_zone(p)->zone_pgdat->node_id;
> + if (!test_bit(nid, nodes))
> + return -EIO;
> + }
> + addr += PAGE_SIZE;
> + }
> + return 0;
> +}
Instead of looking up a page's node number by
page_zone(p)->zone_pgdat->node_id, you can get the same information much
more efficiently by doing some bit-twidling on page->flags. Use
page_nodenum(struct page *) from include/linux/mm.h.
> +/* Apply policy to a single VMA */
> +static int policy_vma(struct vm_area_struct *vma, struct mempolicy *new)
> +{
> + int err = 0;
> + struct mempolicy *old = vma->vm_policy;
> +
> + PDprintk("vma %lx-%lx/%lx vm_ops %p vm_file %p set_policy %p\n",
> + vma->vm_start, vma->vm_end, vma->vm_pgoff,
> + vma->vm_ops, vma->vm_file,
> + vma->vm_ops ? vma->vm_ops->set_policy : NULL);
> +
> + if (vma->vm_file)
> + down(&vma->vm_file->f_mapping->i_shared_sem);
> + if (vma->vm_ops && vma->vm_ops->set_policy)
> + err = vma->vm_ops->set_policy(vma, new);
> + if (!err) {
> + mpol_get(new);
> + vma->vm_policy = new;
> + mpol_free(old);
> + }
> + if (vma->vm_file)
> + up(&vma->vm_file->f_mapping->i_shared_sem);
> + return err;
> +}
So it looks like you *are* sharing policies, at least for VMA's in the
range of a single mbind() call? This is a good start! ;) Looking
further ahead, I'm a bit confused. It seems despite *sharing* VMA's
belonging to a single mbind() call, you *copy* VMA's during dup_mmap(),
copy_process(), split_vma(), and move_vma(). So in the majority of
cases you duplicate policies instead of sharing them, but you *do* share
them in some instances? Why the inconsistency?
> +/* Change policy for a memory range */
> +asmlinkage long sys_mbind(unsigned long start, unsigned long len,
> + unsigned long mode,
> + unsigned long *nmask, unsigned long maxnode,
> + unsigned flags)
What would you think about giving sys_mbind() a pid argument, like
sys_sched_setaffinity()? It would make it much easier for sysadmins to
use the API if they didn't have to rewrite applications to make these
calls on their own. There's already a plethora of arguments, so one
more might be overkill.... Just a thought.
> +{
> + struct vm_area_struct *vma;
> + struct mm_struct *mm = current->mm;
> + struct mempolicy *new;
> + unsigned long end;
> + DECLARE_BITMAP(nodes, MAX_NUMNODES);
> + int err;
> +
> + if ((flags & ~(unsigned long)(MPOL_MF_STRICT)) || mode > MPOL_MAX)
> + return -EINVAL;
> + if (start & ~PAGE_MASK)
> + return -EINVAL;
> + if (mode == MPOL_DEFAULT)
> + flags &= ~MPOL_MF_STRICT;
> + len = (len + PAGE_SIZE - 1) & PAGE_MASK;
> + end = start + len;
> + if (end < start)
> + return -EINVAL;
> + if (end == start)
> + return 0;
> +
> + err = get_nodes(nodes, nmask, maxnode, mode);
> + if (err)
> + return err;
> +
> + new = new_policy(mode, nodes);
> + if (IS_ERR(new))
> + return PTR_ERR(new);
> +
> + PDprintk("mbind %lx-%lx mode:%ld nodes:%lx\n",start,start+len,
> + mode,nodes[0]);
> +
> + down_write(&mm->mmap_sem);
> + vma = check_range(mm, start, end, nodes, flags);
> + err = PTR_ERR(vma);
> + if (!IS_ERR(vma))
> + err = mbind_range(vma, start, end, new);
> + up_write(&mm->mmap_sem);
> + mpol_free(new);
> + return err;
> +}
> +
> +/* Set the process memory policy */
> +asmlinkage long sys_set_mempolicy(int mode, unsigned long *nmask,
> + unsigned long maxnode)
> +{
> + int err;
> + struct mempolicy *new;
> + DECLARE_BITMAP(nodes, MAX_NUMNODES);
> +
> + if (mode > MPOL_MAX)
> + return -EINVAL;
> + err = get_nodes(nodes, nmask, maxnode, mode);
> + if (err)
> + return err;
> + new = new_policy(mode, nodes);
> + if (IS_ERR(new))
> + return PTR_ERR(new);
> + mpol_free(current->mempolicy);
> + current->mempolicy = new;
> + if (new && new->policy == MPOL_INTERLEAVE)
> + current->il_next = find_first_bit(new->v.nodes, MAX_NUMNODES);
> + return 0;
> +}
Why not condense both the sys_mbind() & sys_set_mempolicy() into a
single call? The functionality of these calls (and even their code) is
very similar. The main difference is there is no need for looking up
VMA's and doing locking in the sys_set_mempolicy() case. You overload
the output side (sys_get_mempolicy()) to handle both whole process and
memory range options, but you don't do the same on the input
(sys_mbind() and sys_set_mempolicy()). Saving one syscall and having
them behave more symmetrically would be a nice addition...
> +/* Fill a zone bitmap for a policy */
> +static void get_zonemask(struct mempolicy *p, unsigned long *nodes)
> +{
> + int i;
> + bitmap_clear(nodes, MAX_NUMNODES);
> + switch (p->policy) {
> + case MPOL_BIND:
> + for (i = 0; p->v.zonelist->zones[i]; i++)
> + __set_bit(p->v.zonelist->zones[i]->zone_pgdat->node_id, nodes);
> + break;
> + case MPOL_DEFAULT:
> + break;
> + case MPOL_INTERLEAVE:
> + bitmap_copy(nodes, p->v.nodes, MAX_NUMNODES);
> + break;
> + case MPOL_PREFERRED:
> + /* or use current node instead of online map? */
> + if (p->v.preferred_node < 0)
> + bitmap_copy(nodes, node_online_map, MAX_NUMNODES);
> + else
> + __set_bit(p->v.preferred_node, nodes);
> + break;
> + default:
> + BUG();
> + }
> +}
Shouldn't this be called get_nodemask()? You're setting a bit in a
bitmask called 'nodes' for each node the policy is using, so you're
getting a nodemask, not a zonemask...
> +static int lookup_node(struct mm_struct *mm, unsigned long addr)
> +{
> + struct page *p;
> + int err;
> + err = get_user_pages(current, mm, addr & PAGE_MASK, 1, 0, 0, &p, NULL);
> + if (err >= 0) {
> + err = page_zone(p)->zone_pgdat->node_id;
> + put_page(p);
> + }
> + return err;
> +}
Again, you can save some pointer dereferences if you call
page_nodenum(p) instead of looking it up through the pgdat.
> +/* Retrieve NUMA policy */
> +asmlinkage long sys_get_mempolicy(int *policy,
> + unsigned long *nmask, unsigned long maxnode,
> + unsigned long addr, unsigned long flags)
I had a thought... Shouldn't all your user pointers be marked as such
with __user? Ie:
asmlinkage long sys_get_mempolicy(int __user *policy,
unsigned long __user *nmask,
unsigned long maxnode, unsigned long addr,
unsigned long flags)
This would apply to the other 2 syscalls as well.
> +{
> + int err, pval;
> + struct mm_struct *mm = current->mm;
> + struct vm_area_struct *vma = NULL;
> + struct mempolicy *pol = current->mempolicy;
> +
> + if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
> + return -EINVAL;
> + if (nmask != NULL && maxnode < numnodes)
> + return -EINVAL;
Did you mean: if (nmask == NULL || maxnode < numnodes) ?
<snip>
> +/* Do static interleaving for a VMA with known offset. */
> +static unsigned
> +offset_il_node(struct mempolicy *pol, struct vm_area_struct *vma, unsigned long off)
> +{
> + unsigned target = (unsigned)off % (unsigned)numnodes;
> + int c;
> + int nid = -1;
> + c = 0;
> + do {
> + nid = find_next_bit(pol->v.nodes, MAX_NUMNODES, nid+1);
> + if (nid >= MAX_NUMNODES) {
> + nid = -1;
> + continue;
> + }
> + c++;
> + } while (c <= target);
> + BUG_ON(nid >= MAX_NUMNODES);
> + return nid;
> +}
I think that BUG_ON should be BUG_ON(nid < 0), since it *shouldn't* be
possible to get through that loop with nid >= MAX_NUMNODES. The only
line that could set nid >= MAX_NUMNODES is nid = find_next_bit(...); and
the very next line ensures that if nid is in fact >= MAX_NUMNODES it
will be set to -1. It actually looks pretty redundant altogether,
but...
More comments to follow...
-Matt
> ie: there are some (oracle) workloads where the kernel craps out due to
> lowmem vma exhaustion. If they're now using remap_file_pages() for this then
> it may not be a problem any more. Ingo would know better than I.
remap_file_pages() from now on yes.
> Instead of looking up a page's node number by
> page_zone(p)->zone_pgdat->node_id, you can get the same information much
> more efficiently by doing some bit-twidling on page->flags. Use
> page_nodenum(struct page *) from include/linux/mm.h.
Never noticed that before - I'd prefer we renamed this to page_to_nid
before anyone starts using it ... fits with the naming convention of
everything else (pfn_to_nid, etc). Nobody uses it right now - I grepped
the whole tree.
M.
diff -aurpN -X /home/fletch/.diff.exclude virgin/include/linux/mm.h name_nids/include/linux/mm.h
--- virgin/include/linux/mm.h Wed Mar 17 07:33:09 2004
+++ name_nids/include/linux/mm.h Thu Apr 8 22:27:24 2004
@@ -340,7 +340,7 @@ static inline unsigned long page_zonenum
{
return (page->flags >> NODEZONE_SHIFT) & (~(~0UL << ZONES_SHIFT));
}
-static inline unsigned long page_nodenum(struct page *page)
+static inline unsigned long page_to_nid(struct page *page)
{
return (page->flags >> (NODEZONE_SHIFT + ZONES_SHIFT));
}
Matthew Dobson <[email protected]> writes:
>
> Instead of looking up a page's node number by
> page_zone(p)->zone_pgdat->node_id, you can get the same information much
> more efficiently by doing some bit-twidling on page->flags. Use
> page_nodenum(struct page *) from include/linux/mm.h.
That wasn't there when I wrote the code. I will do that change, thanks.
> So it looks like you *are* sharing policies, at least for VMA's in the
> range of a single mbind() call? This is a good start! ;) Looking
> further ahead, I'm a bit confused. It seems despite *sharing* VMA's
> belonging to a single mbind() call, you *copy* VMA's during dup_mmap(),
> copy_process(), split_vma(), and move_vma(). So in the majority of
> cases you duplicate policies instead of sharing them, but you *do* share
> them in some instances? Why the inconsistency?
Locking. policies get locked by their VM.
(the sharing you're advocating would require adding a new lock to
mempolicy)
>> +/* Change policy for a memory range */
>> +asmlinkage long sys_mbind(unsigned long start, unsigned long len,
>> + unsigned long mode,
>> + unsigned long *nmask, unsigned long maxnode,
>> + unsigned flags)
>
> What would you think about giving sys_mbind() a pid argument, like
> sys_sched_setaffinity()? It would make it much easier for sysadmins to
> use the API if they didn't have to rewrite applications to make these
> calls on their own. There's already a plethora of arguments, so one
> more might be overkill.... Just a thought.
It already has 6 arguments - 7 are not allowed. Also playing with a different
process' VM remotely is just a recipe for disaster IMHO (remember
all the security holes that used to be in /proc/pid/mem)
> Why not condense both the sys_mbind() & sys_set_mempolicy() into a
> single call? The functionality of these calls (and even their code) is
> very similar. The main difference is there is no need for looking up
> VMA's and doing locking in the sys_set_mempolicy() case. You overload
> the output side (sys_get_mempolicy()) to handle both whole process and
> memory range options, but you don't do the same on the input
> (sys_mbind() and sys_set_mempolicy()). Saving one syscall and having
> them behave more symmetrically would be a nice addition...
I think it's cleaner to have them separated.
>> +/* Retrieve NUMA policy */
>> +asmlinkage long sys_get_mempolicy(int *policy,
>> + unsigned long *nmask, unsigned long maxnode,
>> + unsigned long addr, unsigned long flags)
>
> I had a thought... Shouldn't all your user pointers be marked as such
> with __user? Ie:
I figure that the people who use the tools who need such ugly annotations
will add them themselves.
>
>> +{
>> + int err, pval;
>> + struct mm_struct *mm = current->mm;
>> + struct vm_area_struct *vma = NULL;
>> + struct mempolicy *pol = current->mempolicy;
>> +
>> + if (flags & ~(unsigned long)(MPOL_F_NODE|MPOL_F_ADDR))
>> + return -EINVAL;
>> + if (nmask != NULL && maxnode < numnodes)
>> + return -EINVAL;
>
> Did you mean: if (nmask == NULL || maxnode < numnodes) ?
No, the original test is correct.
> I think that BUG_ON should be BUG_ON(nid < 0), since it *shouldn't* be
> possible to get through that loop with nid >= MAX_NUMNODES. The only
> line that could set nid >= MAX_NUMNODES is nid = find_next_bit(...); and
> the very next line ensures that if nid is in fact >= MAX_NUMNODES it
> will be set to -1. It actually looks pretty redundant altogether,
> but...
Yes, it could be removed.
-Andi
Sounds good to me. I named it page_to_nodenum() to match
page_to_zonenum() which I named to differentiate it from page_zone(). I
have no attachment to the names whatsoever, though.
-Matt
On Thu, 2004-04-08 at 22:29, Martin J. Bligh wrote:
> > Instead of looking up a page's node number by
> > page_zone(p)->zone_pgdat->node_id, you can get the same information much
> > more efficiently by doing some bit-twidling on page->flags. Use
> > page_nodenum(struct page *) from include/linux/mm.h.
>
> Never noticed that before - I'd prefer we renamed this to page_to_nid
> before anyone starts using it ... fits with the naming convention of
> everything else (pfn_to_nid, etc). Nobody uses it right now - I grepped
> the whole tree.
>
> M.
>
> diff -aurpN -X /home/fletch/.diff.exclude virgin/include/linux/mm.h name_nids/include/linux/mm.h
> --- virgin/include/linux/mm.h Wed Mar 17 07:33:09 2004
> +++ name_nids/include/linux/mm.h Thu Apr 8 22:27:24 2004
> @@ -340,7 +340,7 @@ static inline unsigned long page_zonenum
> {
> return (page->flags >> NODEZONE_SHIFT) & (~(~0UL << ZONES_SHIFT));
> }
> -static inline unsigned long page_nodenum(struct page *page)
> +static inline unsigned long page_to_nid(struct page *page)
> {
> return (page->flags >> (NODEZONE_SHIFT + ZONES_SHIFT));
> }
>
>
>
>
>
>
Andi,
I'm sure you're sick of me commenting on your patches without "showing
you the money". I've attached a patch with some of the changes I think
would be beneficial. Feel free to let me know which changes you think
are crap and which you think are not.
Changes include:
1) Redefine the value of some of the MPOL_* flags
2) Rename check_* to mpol_check_*
3) Remove get_nodes(). This should be done in the same manner as
sys_sched_setaffinity(). We shouldn't care about unused high bits.
4) Create mpol_check_flags() to, well, check the flags. As the number
of flags and modes grows, it will be easier to do this check in its own
function.
5) In the syscalls (sys_mbind() & sys_set_mempolicy()), change 'len' to
a size_t, add __user to the declaration of 'nmask', change 'maxnode' to
'nmask_len', and condense 'flags' and 'mode' into 'flags'. The
motivation here is to make this syscall similar to
sys_sched_setaffinity(). These calls are basically the memory
equivalent of set/getaffinity, and should look & behave that way. Also,
dropping an argument leaves an opening for a pid argument, which I
believe would be good. We should allow processes (with appropriate
permissions, of course) to mbind other processes.
6) Change how end is calculated as follows:
end = PAGE_ALIGN(start+len);
start &= PAGE_MASK;
Basically, this allows users to pass in a non-page aligned 'start', and
makes sure we mbind all pages from the page containing 'start' to the
page containing 'start'+'len'.
This patch also shows that sys_mbind() and sys_set_mempolicy() have more
commonalities than differences. I believe these two syscalls should be
combined into one with the call signature of sys_mbind(). If the user
passes a start address and length of 0 (or maybe even a flag?), we bind
the whole process, otherwise we bind just a region. This would shrink
the patch even more than the measly 3 lines the current patch saves, and
save a syscall.
[mcd@arrakis source]$ diffstat ~/linux/patches/265-mm4/mcd_mods.patch
include/linux/mempolicy.h | 12 ++--
mm/mempolicy.c | 119
++++++++++++++++++++++------------------------
2 files changed, 64 insertions(+), 67 deletions(-)
-Matt
On Wed, 14 Apr 2004 17:38:37 -0700
Matthew Dobson <[email protected]> wrote:
> 1) Redefine the value of some of the MPOL_* flags
I don't want to merge the flags the and the mode argument. It's ugly.
> 2) Rename check_* to mpol_check_*
I really don't understand why you insist on renaming all my functions?
I like the current naming, thank you.
> 3) Remove get_nodes(). This should be done in the same manner as
> sys_sched_setaffinity(). We shouldn't care about unused high bits.
I disagree on that. This would break programs that are first tested
on a small machine and then later run on a big machine (common case)
> 4) Create mpol_check_flags() to, well, check the flags. As the number
> of flags and modes grows, it will be easier to do this check in its own
> function.
> 5) In the syscalls (sys_mbind() & sys_set_mempolicy()), change 'len' to
> a size_t, add __user to the declaration of 'nmask', change 'maxnode' to
unsigned long is the standard for system calls. Check some others.
> 'nmask_len', and condense 'flags' and 'mode' into 'flags'. The
> motivation here is to make this syscall similar to
> sys_sched_setaffinity(). These calls are basically the memory
> equivalent of set/getaffinity, and should look & behave that way. Also,
> dropping an argument leaves an opening for a pid argument, which I
> believe would be good. We should allow processes (with appropriate
> permissions, of course) to mbind other processes.
Messing with other process' VM is a recipe for disaster. There
used to be tons of exploitable races in /proc/pid/mem, I don't want to repeat that.
Adding pid to set_mem_policy would be a bit easier, but it would require
to add a lock to the task struct for this. Currently it is nice and lockless
because it relies on the fact that only the current process can change
its own policy. I prefer to keep it lockless, because that keeps the memory
allocation fast paths faster.
> 6) Change how end is calculated as follows:
> end = PAGE_ALIGN(start+len);
> start &= PAGE_MASK;
> Basically, this allows users to pass in a non-page aligned 'start', and
> makes sure we mbind all pages from the page containing 'start' to the
> page containing 'start'+'len'.
mprotect() does the EINVAL check on unalignment. I think it's better
to follow mprotect here.
-Andi
On Thu, Apr 15, 2004 at 12:39:15PM +0200, Andi Kleen wrote:
> On Wed, 14 Apr 2004 17:38:37 -0700
> Matthew Dobson <[email protected]> wrote:
> > 2) Rename check_* to mpol_check_*
>
> I really don't understand why you insist on renaming all my functions?
> I like the current naming, thank you.
>
I like the mpol_ flavors because there is no namespace collision
on these. When I look at cscope or any other code analysis tool,
I know that the mpol_... functions belong together. Makes
investigating other people's code easier.
Just my 2 cents,
Robin Holt
On Thu, 2004-04-15 at 04:48, Robin Holt wrote:
> On Thu, Apr 15, 2004 at 12:39:15PM +0200, Andi Kleen wrote:
> > On Wed, 14 Apr 2004 17:38:37 -0700
> > Matthew Dobson <[email protected]> wrote:
> > > 2) Rename check_* to mpol_check_*
> >
> > I really don't understand why you insist on renaming all my functions?
> > I like the current naming, thank you.
> >
>
> I like the mpol_ flavors because there is no namespace collision
> on these. When I look at cscope or any other code analysis tool,
> I know that the mpol_... functions belong together. Makes
> investigating other people's code easier.
>
> Just my 2 cents,
> Robin Holt
That was my motivation. It's not particularly important, especially
since the functions are all declared static. I just thought they were a
little more readable when prefixed by mpol_.
-Matt
On Thu, 2004-04-15 at 03:39, Andi Kleen wrote:
> On Wed, 14 Apr 2004 17:38:37 -0700
> Matthew Dobson <[email protected]> wrote:
>
>
>
> > 1) Redefine the value of some of the MPOL_* flags
>
> I don't want to merge the flags the and the mode argument. It's ugly.
Well, if you're against adding a pid argument, this change is useless
anyway.
> > 2) Rename check_* to mpol_check_*
>
> I really don't understand why you insist on renaming all my functions?
> I like the current naming, thank you.
I don't want to rename them *all*! ;) There's a million functions named
check_foo(), so why not a few more? :)
> > 3) Remove get_nodes(). This should be done in the same manner as
> > sys_sched_setaffinity(). We shouldn't care about unused high bits.
>
> I disagree on that. This would break programs that are first tested
> on a small machine and then later run on a big machine (common case)
I don't follow this argument... How does this code work on a small
machine, but break on a big one:
DECLARE_BITMAP(nodes, MAX_NUMNODES);
BITMAP_CLEAR(nodes, MAX_NUMNODES);
set_bit(4, nodes);
mbind(startaddr, 8 * PAGE_SIZE, MPOL_PREFERRED, nodes, MAX_NUMNODES,
flags);
??
Userspace should be declaring an array of unsigned longs based on the
size of the machine they're running on. If userspace programs are just
declaring arbitrary sized array and hoping that they'll work, they're
being stupid. If they're declaring oversized arrays and not zeroing the
array before passing it in, then they're being lazy. Either way, we're
going to throw the bits away, so do we care if they're being lazy or
stupid? If the bits we care about are sane, then we do what userspace
tells us. If the bits we care about aren't, we throw an error.
Besides, we don't return an error when they pass us too little data, why
do we return an error when they pass too much?
I remember one of your concerns was checking that all the nodes with set
bits need to be online. I think this is wrong. We need to be checking
node_online_map at fault time, not at binding time. Sooner or later
memory hotplug will go into the kernel, and then you're (or someone) is
going to have to rewrite how this is handled. I'd rather do it right
the first time.
> > 4) Create mpol_check_flags() to, well, check the flags. As the number
> > of flags and modes grows, it will be easier to do this check in its own
> > function.
> > 5) In the syscalls (sys_mbind() & sys_set_mempolicy()), change 'len' to
> > a size_t, add __user to the declaration of 'nmask', change 'maxnode' to
>
> unsigned long is the standard for system calls. Check some others.
Ok. I see this set of system calls as sort of the illegitimate
love-child of mlock()/mprotect() and sched_set/getaffinity().
sys_mlock() was using a size_t, so I copied that...
> > 'nmask_len', and condense 'flags' and 'mode' into 'flags'. The
> > motivation here is to make this syscall similar to
> > sys_sched_setaffinity(). These calls are basically the memory
> > equivalent of set/getaffinity, and should look & behave that way. Also,
> > dropping an argument leaves an opening for a pid argument, which I
> > believe would be good. We should allow processes (with appropriate
> > permissions, of course) to mbind other processes.
>
> Messing with other process' VM is a recipe for disaster. There
> used to be tons of exploitable races in /proc/pid/mem, I don't want to repeat that.
> Adding pid to set_mem_policy would be a bit easier, but it would require
> to add a lock to the task struct for this. Currently it is nice and lockless
> because it relies on the fact that only the current process can change
> its own policy. I prefer to keep it lockless, because that keeps the memory
> allocation fast paths faster.
We're already grabbing the mmap_sem for writing when we modify the vma's
in sys_mbind. I haven't looked at what kind of locking would be
necessary if we were modifying a *different* processes vma's, but I
figured that would at least be a good start. If it's significantly more
complicated than that, you're probably right that it's not worth the
effort.
> > 6) Change how end is calculated as follows:
> > end = PAGE_ALIGN(start+len);
> > start &= PAGE_MASK;
> > Basically, this allows users to pass in a non-page aligned 'start', and
> > makes sure we mbind all pages from the page containing 'start' to the
> > page containing 'start'+'len'.
>
> mprotect() does the EINVAL check on unalignment. I think it's better
> to follow mprotect here.
Ok. I'm usually a fan of not failing if we don't *have* to, and an
unaligned start address is easily fixable. On the other hand, if other
syscalls throw errors for it, then it's fine with me.
BTW, any response to the idea of combining sys_mbind() &
sys_set_mempolicy()?
-Matt