Subject: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix


This patch is an attempt at fixing the much discussed search complexity issues
in objrmap design. The key idea is to replace the i_mmap{_shared} list with a
new tree data structure.

The radix priority search tree (prio_tree) first proposed by Edward M. McCreight
is used as the new data structure. A prio_tree is indexed by two indices which
we call radix_index and heap_index. The tree is a simple binary radix tree on
the radix_index with an additional heap tree like property that a parent node's
heap_index is always greater than or equal to the heap_indices of its children.

An interesting property of the prio_tree is that they are useful to store and
query intervals, for example, a file-mapped vm_area_struct can be considered
as an interval of file pages. If we store all vmas that map a file in a
prio_tree, then we can execute a stabbing query, i.e., choosing a set of vmas
that a map a single file page or a set of contiguous file pages, in O(log n + m)
time where "log n" indicates the height of the tree (maximum 64 in a 32 bit
machine) and "m" represents the number of vmas that map the page(s).

The test results below show that the prio_tree effectively solves the objrmap
i_mmap{_shared} search complexity problems. The tests were done on a PII
200 MHz machine with 256MB RAM using UP kernels.

This patch is for 2.6.5-rc2. The patch boots and works both on SMP and UP.
Further testing will help. If you like broken-out patches please check:

http://www-personal.engin.umich.edu/~vrajesh/~vrajesh/linux/prio_tree/

Some Results:

Kernel compile - 2.6.2 defconfig - make
2.6.5-rc2 3313.97 user 261.08 system 1:00:11 elapsed 98% CPU
rc2+prio+objrmap 3315.30 user 258.59 system 1:00:14 elapsed 98% CPU
rc2+objrmap 3316.41 user 257.77 system 1:00:10 elapsed 98% CPU

rmap-test 1 - ./rmap-test -l -i 10 -n 100 -s 600 -t 100 foo
2.6.5-rc2 67.57 user 277.14 system 0:13:12 elapsed 43% CPU
rc2+prio+objrmap 71.99 user 203.90 system 0:13:30 elapsed 34% CPU
rc2+objrmap 70.45 user 19834.38 system 7:28:04 elapsed 74% CPU
-I killed the process after 7 hours. System was responsive afer
killing the process. Compared to previous results, the program
should not lock or take so long. Maybe it is due to this problem
identified by Andrea:
http://marc.theaimsgroup.com/?l=linux-kernel&m=107966438414248

Andrea says the system may hang, however, in this case system
does not hang.

rmap-test 2 - ./rmap-test -vv -V 2 -r -i 1 -n 100 -s 600 -t 100 foo
2.6.5-rc2 0.58 user 212.50 system 0: 7:32 elapsed 47% CPU
rc2+prio+objrmap 0.63 user 101.77 system 0: 4:44 elapsed 36% CPU
rc2+objrmap 0.60 user 605.97 system 0:14:35 elapsed 69% CPU


rmap-test 3 - ./rmap-test -v -l -i 10 -n 10000 -s 7 -t 1 foo
2.6.5-rc2 1.07 user 31.08 system 0:16:06 elapsed 3% CPU
rc2+prio+objrmap 1.03 user 31.41 system 0:16:38 elapsed 3% CPU
rc2+objrmap 0.53 user 1588.40 system 2:25:27 elapsed 18% CPU
-I killed the process after around 2 1/2 hours.
System was responsive afer killing the process.

test-mmap2 H M Sec.
2.6.5-rc2 0.00 user 0.34 system 0:0:01.55 elapsed 22% CPU
rc2+prio+objramp 0.00 user 0.35 system 0:0:01.49 elapsed 23% CPU
rc2+objrmap - didn't try - already known to lock the system


test-mmap3 H M Sec.
2.6.5-rc2 0.06 user 3.38 system 0:0:14.62 elapsed 23% CPU
rc2+prio+objrmap 0.09 user 3.65 system 0:0:13.99 elapsed 26% CPU
rc2+objrmap - didn't try - known to lock the system


Lowlights of the patch:

* Adds a new tree data structure (around 500 lines of code + bugs?)
- code seems reasonably stable. More testing will help a lot.

* Breaks compilation of hugetlbfs, xfs, and few archs.
- easily fixable.

* Adds 2 extra pointers to vm_area_struct
- both of these pointers can be removed later.

- Plan:

* Shove vm_list_head into vm_private_data.

* I need a single bit protected by i_shared_sem to
mark prio_tree nodes. When I get convinced that I can use
the least significant bit of the vm_list_head for marking
nodes (vm_area_struct alignment?), I plan to remove the parent
field and use percpu array in prio_tree_insert and
prio_tree_remove. In invalidate_mmap_range_list, we can try to
allocate an array from slab (helps to avoid high latency in
truncate), if we fail, we can fall back to percpu array.

Useful Links:

[1] Andrew Morton's rmap-test.c
http://marc.theaimsgroup.com/?l=linux-kernel&m=104954444204356

[2] Ingo's test-mmap2.c
http://marc.theaimsgroup.com/?l=linux-kernel&m=107883030601436

[3] Ingo's test-mmap3.c
http://marc.theaimsgroup.com/?l=linux-kernel&m=107886160312419



fs/exec.c | 2
fs/inode.c | 4
fs/locks.c | 6
fs/proc/task_mmu.c | 2
include/linux/fs.h | 5
include/linux/mm.h | 168 +++++++++++++
include/linux/prio_tree.h | 78 ++++++
init/main.c | 2
kernel/fork.c | 4
mm/Makefile | 3
mm/filemap.c | 2
mm/memory.c | 15 -
mm/mmap.c | 66 +++--
mm/mremap.c | 21 +
mm/prio_tree.c | 574 ++++++++++++++++++++++++++++++++++++++++++++++
mm/shmem.c | 2
mm/swap_state.c | 4
mm/vmscan.c | 4
18 files changed, 910 insertions(+), 52 deletions(-)

diff -puN fs/exec.c~prio_tree_core fs/exec.c
--- mmlinux-2.6/fs/exec.c~prio_tree_core 2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/fs/exec.c 2004-03-21 16:25:34.000000000 -0500
@@ -430,7 +430,7 @@ int setup_arg_pages(struct linux_binprm
mpnt->vm_ops = NULL;
mpnt->vm_pgoff = 0;
mpnt->vm_file = NULL;
- INIT_LIST_HEAD(&mpnt->shared);
+ INIT_VMA_SHARED(mpnt);
mpnt->vm_private_data = (void *) 0;
insert_vm_struct(mm, mpnt);
mm->total_vm = (mpnt->vm_end - mpnt->vm_start) >> PAGE_SHIFT;
diff -puN fs/inode.c~prio_tree_core fs/inode.c
--- mmlinux-2.6/fs/inode.c~prio_tree_core 2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/fs/inode.c 2004-03-21 16:25:01.000000000 -0500
@@ -189,8 +189,8 @@ void inode_init_once(struct inode *inode
atomic_set(&inode->i_data.truncate_count, 0);
INIT_LIST_HEAD(&inode->i_data.private_list);
spin_lock_init(&inode->i_data.private_lock);
- INIT_LIST_HEAD(&inode->i_data.i_mmap);
- INIT_LIST_HEAD(&inode->i_data.i_mmap_shared);
+ INIT_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
+ INIT_PRIO_TREE_ROOT(&inode->i_data.i_mmap_shared);
spin_lock_init(&inode->i_lock);
i_size_ordered_init(inode);
}
diff -puN fs/locks.c~prio_tree_core fs/locks.c
--- mmlinux-2.6/fs/locks.c~prio_tree_core 2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/fs/locks.c 2004-03-21 16:25:01.000000000 -0500
@@ -1455,8 +1455,7 @@ int fcntl_setlk(struct file *filp, unsig
if (IS_MANDLOCK(inode) &&
(inode->i_mode & (S_ISGID | S_IXGRP)) == S_ISGID) {
struct address_space *mapping = filp->f_mapping;
-
- if (!list_empty(&mapping->i_mmap_shared)) {
+ if (!prio_tree_empty(&mapping->i_mmap_shared)) {
error = -EAGAIN;
goto out;
}
@@ -1593,8 +1592,7 @@ int fcntl_setlk64(struct file *filp, uns
if (IS_MANDLOCK(inode) &&
(inode->i_mode & (S_ISGID | S_IXGRP)) == S_ISGID) {
struct address_space *mapping = filp->f_mapping;
-
- if (!list_empty(&mapping->i_mmap_shared)) {
+ if (!prio_tree_empty(&mapping->i_mmap_shared)) {
error = -EAGAIN;
goto out;
}
diff -puN fs/proc/task_mmu.c~prio_tree_core fs/proc/task_mmu.c
--- mmlinux-2.6/fs/proc/task_mmu.c~prio_tree_core 2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/fs/proc/task_mmu.c 2004-03-21 16:25:01.000000000 -0500
@@ -65,7 +65,7 @@ int task_statm(struct mm_struct *mm, int
*shared += pages;
continue;
}
- if (vma->vm_flags & VM_SHARED || !list_empty(&vma->shared))
+ if (vma->vm_flags & VM_SHARED || !vma_shared_empty(vma))
*shared += pages;
if (vma->vm_flags & VM_EXECUTABLE)
*text += pages;
diff -puN include/linux/fs.h~prio_tree_core include/linux/fs.h
--- mmlinux-2.6/include/linux/fs.h~prio_tree_core 2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/include/linux/fs.h 2004-03-21 16:25:01.000000000 -0500
@@ -18,6 +18,7 @@
#include <linux/stat.h>
#include <linux/cache.h>
#include <linux/radix-tree.h>
+#include <linux/prio_tree.h>
#include <linux/kobject.h>
#include <asm/atomic.h>

@@ -329,8 +330,8 @@ struct address_space {
struct list_head io_pages; /* being prepared for I/O */
unsigned long nrpages; /* number of total pages */
struct address_space_operations *a_ops; /* methods */
- struct list_head i_mmap; /* list of private mappings */
- struct list_head i_mmap_shared; /* list of shared mappings */
+ struct prio_tree_root i_mmap; /* tree of private mappings */
+ struct prio_tree_root i_mmap_shared; /* tree of shared mappings */
struct semaphore i_shared_sem; /* protect both above lists */
atomic_t truncate_count; /* Cover race condition with truncate */
unsigned long dirtied_when; /* jiffies of first page dirtying */
diff -puN include/linux/mm.h~prio_tree_core include/linux/mm.h
--- mmlinux-2.6/include/linux/mm.h~prio_tree_core 2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/include/linux/mm.h 2004-03-21 16:25:34.000000000 -0500
@@ -11,6 +11,7 @@
#include <linux/list.h>
#include <linux/mmzone.h>
#include <linux/rbtree.h>
+#include <linux/prio_tree.h>
#include <linux/fs.h>

#ifndef CONFIG_DISCONTIGMEM /* Don't use mapnrs, do it properly */
@@ -67,8 +68,29 @@ struct vm_area_struct {
* one of the address_space->i_mmap{,shared} lists,
* for shm areas, the list of attaches, otherwise unused.
*/
- struct list_head shared;
+ union {
+ struct {
+ struct list_head list;
+ void *parent;
+ } vm_set;
+
+ struct prio_tree_node prio_tree_node;
+
+ struct {
+ void *first;
+ void *second;
+ void *parent;
+ } both;
+ } shared;

+ /*
+ * shared.vm_set : list of vmas that map exactly the same set of pages
+ * vm_set_head : head of the vm_set list
+ *
+ * TODO: try to shove the following field into vm_private_data ??
+ */
+ struct vm_area_struct *vm_set_head;
+
/* Function pointers to deal with this struct. */
struct vm_operations_struct * vm_ops;

@@ -129,6 +151,150 @@ struct vm_area_struct {
#define VM_RandomReadHint(v) ((v)->vm_flags & VM_RAND_READ)

/*
+ * The following macros are used for implementing prio_tree for i_mmap{_shared}
+ */
+
+#define RADIX_INDEX(vma) ((vma)->vm_pgoff)
+#define VMA_SIZE(vma) (((vma)->vm_end - (vma)->vm_start) >> PAGE_SHIFT)
+/* avoid overflow */
+#define HEAP_INDEX(vma) ((vma)->vm_pgoff + (VMA_SIZE(vma) - 1))
+
+#define GET_INDEX_VMA(vma, radix, heap) \
+do { \
+ radix = RADIX_INDEX(vma); \
+ heap = HEAP_INDEX(vma); \
+} while (0)
+
+#define GET_INDEX(node, radix, heap) \
+do { \
+ struct vm_area_struct *__tmp = \
+ prio_tree_entry(node, struct vm_area_struct, shared.prio_tree_node);\
+ GET_INDEX_VMA(__tmp, radix, heap); \
+} while (0)
+
+#define INIT_VMA_SHARED_LIST(vma) \
+do { \
+ INIT_LIST_HEAD(&(vma)->shared.vm_set.list); \
+ (vma)->shared.vm_set.parent = NULL; \
+ (vma)->vm_set_head = NULL; \
+} while (0)
+
+#define INIT_VMA_SHARED(vma) \
+do { \
+ (vma)->shared.both.first = NULL; \
+ (vma)->shared.both.second = NULL; \
+ (vma)->shared.both.parent = NULL; \
+ (vma)->vm_set_head = NULL; \
+} while (0)
+
+extern void __vma_prio_tree_insert(struct prio_tree_root *,
+ struct vm_area_struct *);
+
+extern void __vma_prio_tree_remove(struct prio_tree_root *,
+ struct vm_area_struct *);
+
+static inline int vma_shared_empty(struct vm_area_struct *vma)
+{
+ return vma->shared.both.first == NULL;
+}
+
+/*
+ * Helps to add a new vma that maps the same (identical) set of pages as the
+ * old vma to an i_mmap tree.
+ */
+static inline void __vma_prio_tree_add(struct vm_area_struct *vma,
+ struct vm_area_struct *old)
+{
+ INIT_VMA_SHARED_LIST(vma);
+
+ /* Leave these BUG_ONs till prio_tree patch stabilizes */
+ BUG_ON(RADIX_INDEX(vma) != RADIX_INDEX(old));
+ BUG_ON(HEAP_INDEX(vma) != HEAP_INDEX(old));
+
+ if (old->shared.both.parent) {
+ if (old->vm_set_head) {
+ list_add_tail(&vma->shared.vm_set.list,
+ &old->vm_set_head->shared.vm_set.list);
+ return;
+ }
+ else {
+ old->vm_set_head = vma;
+ vma->vm_set_head = old;
+ }
+ }
+ else
+ list_add(&vma->shared.vm_set.list, &old->shared.vm_set.list);
+}
+
+/*
+ * We cannot modify vm_start, vm_end, vm_pgoff fields of a vma that has been
+ * already present in an i_mmap{_shared} tree without modifying the tree. The
+ * following helper function should be used when such modifications are
+ * necessary. We should hold the mapping's i_shared_sem.
+ *
+ * This function can be (micro)optimized for some special cases (maybe later).
+ */
+static inline void __vma_modify(struct prio_tree_root *root,
+ struct vm_area_struct *vma, unsigned long start, unsigned long end,
+ unsigned long pgoff)
+{
+ if (root)
+ __vma_prio_tree_remove(root, vma);
+ vma->vm_start = start;
+ vma->vm_end = end;
+ vma->vm_pgoff = pgoff;
+ if (root)
+ __vma_prio_tree_insert(root, vma);
+}
+
+/*
+ * Helper functions to enumerate vmas that map a given file page or a set of
+ * contiguous file pages. The functions return vmas that at least map a single
+ * page in the given range of contiguous file pages.
+ */
+static inline struct vm_area_struct *__vma_prio_tree_first(
+ struct prio_tree_root *root, struct prio_tree_iter *iter,
+ unsigned long begin, unsigned long end)
+{
+ struct prio_tree_node *ptr;
+
+ ptr = prio_tree_first(root, iter, begin, end);
+
+ if (ptr)
+ return prio_tree_entry(ptr, struct vm_area_struct,
+ shared.prio_tree_node);
+ else
+ return NULL;
+}
+
+static inline struct vm_area_struct *__vma_prio_tree_next(
+ struct vm_area_struct *vma, struct prio_tree_root *root,
+ struct prio_tree_iter *iter, unsigned long begin, unsigned long end)
+{
+ struct prio_tree_node *ptr;
+ struct vm_area_struct *next;
+
+ if (vma->shared.both.parent) {
+ if (vma->vm_set_head)
+ return vma->vm_set_head;
+ }
+ else {
+ next = list_entry(vma->shared.vm_set.list.next,
+ struct vm_area_struct, shared.vm_set.list);
+ if (!(next->vm_set_head))
+ return next;
+ }
+
+ ptr = prio_tree_next(root, iter, begin, end);
+
+ if (ptr)
+ return prio_tree_entry(ptr, struct vm_area_struct,
+ shared.prio_tree_node);
+ else
+ return NULL;
+}
+
+/*
* mapping from the currently active vm_flags protection bits (the
* low four bits) to a page protection mask..
*/
diff -puN /dev/null include/linux/prio_tree.h
--- /dev/null 2003-01-30 05:24:37.000000000 -0500
+++ mmlinux-2.6-jaya/include/linux/prio_tree.h 2004-03-21 16:25:01.000000000 -0500
@@ -0,0 +1,78 @@
+#ifndef _LINUX_PRIO_TREE_H
+#define _LINUX_PRIO_TREE_H
+
+struct prio_tree_node {
+ struct prio_tree_node *left;
+ struct prio_tree_node *right;
+ struct prio_tree_node *parent;
+};
+
+struct prio_tree_root {
+ struct prio_tree_node *prio_tree_node;
+ unsigned int index_bits;
+};
+
+struct prio_tree_iter {
+ struct prio_tree_node *cur;
+ unsigned long mask;
+ unsigned long value;
+ int size_level;
+};
+
+#define PRIO_TREE_ROOT (struct prio_tree_root) {NULL, 1}
+
+#define PRIO_TREE_ROOT_INIT {NULL, 1}
+
+#define INIT_PRIO_TREE_ROOT(ptr) \
+do { \
+ (ptr)->prio_tree_node = NULL; \
+ (ptr)->index_bits = 1; \
+} while (0)
+
+#define PRIO_TREE_NODE_INIT(name) {&(name), &(name), &(name)}
+
+#define PRIO_TREE_NODE(name) \
+ struct prio_tree_node name = PRIO_TREE_NODE_INIT(name)
+
+#define INIT_PRIO_TREE_NODE(ptr) \
+do { \
+ (ptr)->left = (ptr)->right = (ptr)->parent = (ptr); \
+} while (0)
+
+#define prio_tree_entry(ptr, type, member) \
+ ((type *)((char *)(ptr)-(unsigned long)(&((type *)0)->member)))
+
+#define PRIO_TREE_ITER (struct prio_tree_iter) {NULL, 0UL, 0UL, 0}
+
+static inline int prio_tree_empty(const struct prio_tree_root *root)
+{
+ return root->prio_tree_node == NULL;
+}
+
+static inline int prio_tree_root(const struct prio_tree_node *node)
+{
+ return node->parent == node;
+}
+
+static inline int prio_tree_left_empty(const struct prio_tree_node *node)
+{
+ return node->left == node;
+}
+
+static inline int prio_tree_right_empty(const struct prio_tree_node *node)
+{
+ return node->right == node;
+}
+
+extern struct prio_tree_node *prio_tree_insert(struct prio_tree_root *,
+ struct prio_tree_node *);
+
+extern void prio_tree_remove(struct prio_tree_root *, struct prio_tree_node *);
+
+extern struct prio_tree_node *prio_tree_first(struct prio_tree_root *,
+ struct prio_tree_iter *, unsigned long, unsigned long);
+
+extern struct prio_tree_node *prio_tree_next(struct prio_tree_root *,
+ struct prio_tree_iter *, unsigned long, unsigned long);
+
+#endif
diff -puN init/main.c~prio_tree_core init/main.c
--- mmlinux-2.6/init/main.c~prio_tree_core 2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/init/main.c 2004-03-21 16:25:01.000000000 -0500
@@ -86,6 +86,7 @@ extern void pidhash_init(void);
extern void pidmap_init(void);
extern void pte_chain_init(void);
extern void radix_tree_init(void);
+extern void prio_tree_init(void);
extern void free_initmem(void);
extern void populate_rootfs(void);
extern void driver_init(void);
@@ -460,6 +461,7 @@ asmlinkage void __init start_kernel(void
calibrate_delay();
pidmap_init();
pgtable_cache_init();
+ prio_tree_init();
pte_chain_init();
#ifdef CONFIG_X86
if (efi_enabled)
diff -puN kernel/fork.c~prio_tree_core kernel/fork.c
--- mmlinux-2.6/kernel/fork.c~prio_tree_core 2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/kernel/fork.c 2004-03-21 16:25:01.000000000 -0500
@@ -313,7 +313,7 @@ static inline int dup_mmap(struct mm_str
tmp->vm_mm = mm;
tmp->vm_next = NULL;
file = tmp->vm_file;
- INIT_LIST_HEAD(&tmp->shared);
+ INIT_VMA_SHARED(tmp);
if (file) {
struct inode *inode = file->f_dentry->d_inode;
get_file(file);
@@ -322,7 +322,7 @@ static inline int dup_mmap(struct mm_str

/* insert tmp into the share list, just after mpnt */
down(&file->f_mapping->i_shared_sem);
- list_add_tail(&tmp->shared, &mpnt->shared);
+ __vma_prio_tree_add(tmp, mpnt);
up(&file->f_mapping->i_shared_sem);
}

diff -puN mm/filemap.c~prio_tree_core mm/filemap.c
--- mmlinux-2.6/mm/filemap.c~prio_tree_core 2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/mm/filemap.c 2004-03-21 16:25:01.000000000 -0500
@@ -630,7 +630,7 @@ page_ok:
* virtual addresses, take care about potential aliasing
* before reading the page on the kernel side.
*/
- if (!list_empty(&mapping->i_mmap_shared))
+ if (!prio_tree_empty(&mapping->i_mmap_shared))
flush_dcache_page(page);

/*
diff -puN mm/Makefile~prio_tree_core mm/Makefile
--- mmlinux-2.6/mm/Makefile~prio_tree_core 2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/mm/Makefile 2004-03-21 16:25:01.000000000 -0500
@@ -9,6 +9,7 @@ mmu-$(CONFIG_MMU) := fremap.o highmem.o

obj-y := bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
page_alloc.o page-writeback.o pdflush.o readahead.o \
- slab.o swap.o truncate.o vmscan.o $(mmu-y)
+ slab.o swap.o truncate.o vmscan.o prio_tree.o \
+ $(mmu-y)

obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o
diff -puN mm/memory.c~prio_tree_core mm/memory.c
--- mmlinux-2.6/mm/memory.c~prio_tree_core 2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/mm/memory.c 2004-03-21 16:25:34.000000000 -0500
@@ -1097,11 +1097,11 @@ no_pte_chain:
* An hlen of zero blows away the entire portion file after hba.
*/
static void
-invalidate_mmap_range_list(struct list_head *head,
+invalidate_mmap_range_list(struct prio_tree_root *root,
unsigned long const hba,
unsigned long const hlen)
{
- struct list_head *curr;
+ struct prio_tree_iter iter;
unsigned long hea; /* last page of hole. */
unsigned long vba;
unsigned long vea; /* last page of corresponding uva hole. */
@@ -1112,17 +1112,16 @@ invalidate_mmap_range_list(struct list_h
hea = hba + hlen - 1; /* avoid overflow. */
if (hea < hba)
hea = ULONG_MAX;
- list_for_each(curr, head) {
- vp = list_entry(curr, struct vm_area_struct, shared);
+ vp = __vma_prio_tree_first(root, &iter, hba, hea);
+ while(vp) {
vba = vp->vm_pgoff;
vea = vba + ((vp->vm_end - vp->vm_start) >> PAGE_SHIFT) - 1;
- if (hea < vba || vea < hba)
- continue; /* Mapping disjoint from hole. */
zba = (hba <= vba) ? vba : hba;
zea = (vea <= hea) ? vea : hea;
zap_page_range(vp,
((zba - vba) << PAGE_SHIFT) + vp->vm_start,
(zea - zba + 1) << PAGE_SHIFT);
+ vp = __vma_prio_tree_next(vp, root, &iter, hba, hea);
}
}

@@ -1157,9 +1156,9 @@ void invalidate_mmap_range(struct addres
down(&mapping->i_shared_sem);
/* Protect against page fault */
atomic_inc(&mapping->truncate_count);
- if (unlikely(!list_empty(&mapping->i_mmap)))
+ if (unlikely(!prio_tree_empty(&mapping->i_mmap)))
invalidate_mmap_range_list(&mapping->i_mmap, hba, hlen);
- if (unlikely(!list_empty(&mapping->i_mmap_shared)))
+ if (unlikely(!prio_tree_empty(&mapping->i_mmap_shared)))
invalidate_mmap_range_list(&mapping->i_mmap_shared, hba, hlen);
up(&mapping->i_shared_sem);
}
diff -puN mm/mmap.c~prio_tree_core mm/mmap.c
--- mmlinux-2.6/mm/mmap.c~prio_tree_core 2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/mm/mmap.c 2004-03-21 16:25:01.000000000 -0500
@@ -64,12 +64,16 @@ EXPORT_SYMBOL(vm_committed_space);
* Requires inode->i_mapping->i_shared_sem
*/
static inline void
-__remove_shared_vm_struct(struct vm_area_struct *vma, struct inode *inode)
+__remove_shared_vm_struct(struct vm_area_struct *vma, struct inode *inode,
+ struct address_space *mapping)
{
if (inode) {
if (vma->vm_flags & VM_DENYWRITE)
atomic_inc(&inode->i_writecount);
- list_del_init(&vma->shared);
+ if (vma->vm_flags & VM_SHARED)
+ __vma_prio_tree_remove(&mapping->i_mmap_shared, vma);
+ else
+ __vma_prio_tree_remove(&mapping->i_mmap, vma);
}
}

@@ -83,7 +87,8 @@ static void remove_shared_vm_struct(stru
if (file) {
struct address_space *mapping = file->f_mapping;
down(&mapping->i_shared_sem);
- __remove_shared_vm_struct(vma, file->f_dentry->d_inode);
+ __remove_shared_vm_struct(vma, file->f_dentry->d_inode,
+ mapping);
up(&mapping->i_shared_sem);
}
}
@@ -257,10 +262,10 @@ static inline void __vma_link_file(struc
if (vma->vm_flags & VM_DENYWRITE)
atomic_dec(&file->f_dentry->d_inode->i_writecount);

- if (vma->vm_flags & VM_SHARED)
- list_add_tail(&vma->shared, &mapping->i_mmap_shared);
- else
- list_add_tail(&vma->shared, &mapping->i_mmap);
+ if (vma->vm_flags & VM_SHARED)
+ __vma_prio_tree_insert(&mapping->i_mmap_shared, vma);
+ else
+ __vma_prio_tree_insert(&mapping->i_mmap, vma);
}
}

@@ -390,7 +395,9 @@ static int vma_merge(struct mm_struct *m
{
spinlock_t *lock = &mm->page_table_lock;
struct inode *inode = file ? file->f_dentry->d_inode : NULL;
+ struct address_space *mapping = file ? file->f_mapping : NULL;
struct semaphore *i_shared_sem;
+ struct prio_tree_root *root = NULL;

/*
* We later require that vma->vm_flags == vm_flags, so this tests
@@ -401,6 +408,13 @@ static int vma_merge(struct mm_struct *m

i_shared_sem = file ? &file->f_mapping->i_shared_sem : NULL;

+ if (mapping) {
+ if (vm_flags & VM_SHARED)
+ root = &mapping->i_mmap_shared;
+ else
+ root = &mapping->i_mmap;
+ }
+
if (!prev) {
prev = rb_entry(rb_parent, struct vm_area_struct, vm_rb);
goto merge_next;
@@ -421,18 +435,18 @@ static int vma_merge(struct mm_struct *m
need_up = 1;
}
spin_lock(lock);
- prev->vm_end = end;

/*
* OK, it did. Can we now merge in the successor as well?
*/
next = prev->vm_next;
- if (next && prev->vm_end == next->vm_start &&
+ if (next && end == next->vm_start &&
can_vma_merge_before(next, vm_flags, file,
pgoff, (end - addr) >> PAGE_SHIFT)) {
- prev->vm_end = next->vm_end;
+ __vma_modify(root, prev, prev->vm_start,
+ next->vm_end, prev->vm_pgoff);
__vma_unlink(mm, next, prev);
- __remove_shared_vm_struct(next, inode);
+ __remove_shared_vm_struct(next, inode, mapping);
spin_unlock(lock);
if (need_up)
up(i_shared_sem);
@@ -443,6 +457,7 @@ static int vma_merge(struct mm_struct *m
kmem_cache_free(vm_area_cachep, next);
return 1;
}
+ __vma_modify(root, prev, prev->vm_start, end, prev->vm_pgoff);
spin_unlock(lock);
if (need_up)
up(i_shared_sem);
@@ -462,8 +477,8 @@ static int vma_merge(struct mm_struct *m
if (file)
down(i_shared_sem);
spin_lock(lock);
- prev->vm_start = addr;
- prev->vm_pgoff -= (end - addr) >> PAGE_SHIFT;
+ __vma_modify(root, prev, addr, prev->vm_end,
+ prev->vm_pgoff - ((end - addr) >> PAGE_SHIFT));
spin_unlock(lock);
if (file)
up(i_shared_sem);
@@ -649,7 +664,7 @@ munmap_back:
vma->vm_file = NULL;
vma->vm_private_data = NULL;
vma->vm_next = NULL;
- INIT_LIST_HEAD(&vma->shared);
+ INIT_VMA_SHARED(vma);

if (file) {
error = -EINVAL;
@@ -1196,6 +1211,7 @@ int split_vma(struct mm_struct * mm, str
{
struct vm_area_struct *new;
struct address_space *mapping = NULL;
+ struct prio_tree_root *root = NULL;

if (mm->map_count >= MAX_MAP_COUNT)
return -ENOMEM;
@@ -1207,7 +1223,7 @@ int split_vma(struct mm_struct * mm, str
/* most fields are the same, copy all, and then fixup */
*new = *vma;

- INIT_LIST_HEAD(&new->shared);
+ INIT_VMA_SHARED(new);

if (new_below)
new->vm_end = addr;
@@ -1222,18 +1238,24 @@ int split_vma(struct mm_struct * mm, str
if (new->vm_ops && new->vm_ops->open)
new->vm_ops->open(new);

- if (vma->vm_file)
+ if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;

+ if (vma->vm_flags & VM_SHARED)
+ root = &mapping->i_mmap_shared;
+ else
+ root = &mapping->i_mmap;
+ }
+
if (mapping)
down(&mapping->i_shared_sem);
spin_lock(&mm->page_table_lock);

- if (new_below) {
- vma->vm_start = addr;
- vma->vm_pgoff += ((addr - new->vm_start) >> PAGE_SHIFT);
- } else
- vma->vm_end = addr;
+ if (new_below)
+ __vma_modify(root, vma, addr, vma->vm_end,
+ vma->vm_pgoff + ((addr - new->vm_start) >> PAGE_SHIFT));
+ else
+ __vma_modify(root, vma, vma->vm_start, addr, vma->vm_pgoff);

__insert_vm_struct(mm, new);

@@ -1406,7 +1428,7 @@ unsigned long do_brk(unsigned long addr,
vma->vm_pgoff = 0;
vma->vm_file = NULL;
vma->vm_private_data = NULL;
- INIT_LIST_HEAD(&vma->shared);
+ INIT_VMA_SHARED(vma);

vma_link(mm, vma, prev, rb_link, rb_parent);

diff -puN mm/mremap.c~prio_tree_core mm/mremap.c
--- mmlinux-2.6/mm/mremap.c~prio_tree_core 2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/mm/mremap.c 2004-03-21 16:25:01.000000000 -0500
@@ -251,7 +251,7 @@ static unsigned long move_vma(struct vm_

if (allocated_vma) {
*new_vma = *vma;
- INIT_LIST_HEAD(&new_vma->shared);
+ INIT_VMA_SHARED(new_vma);
new_vma->vm_start = new_addr;
new_vma->vm_end = new_addr+new_len;
new_vma->vm_pgoff += (addr-vma->vm_start) >> PAGE_SHIFT;
@@ -309,6 +309,8 @@ unsigned long do_mremap(unsigned long ad
unsigned long flags, unsigned long new_addr)
{
struct vm_area_struct *vma;
+ struct address_space *mapping = NULL;
+ struct prio_tree_root *root = NULL;
unsigned long ret = -EINVAL;
unsigned long charged = 0;

@@ -416,9 +418,24 @@ unsigned long do_mremap(unsigned long ad
/* can we just expand the current mapping? */
if (max_addr - addr >= new_len) {
int pages = (new_len - old_len) >> PAGE_SHIFT;
+
+ if (vma->vm_file) {
+ mapping = vma->vm_file->f_mapping;
+ if (vma->vm_flags & VM_SHARED)
+ root = &mapping->i_mmap_shared;
+ else
+ root = &mapping->i_mmap;
+ down(&mapping->i_shared_sem);
+ }
+
spin_lock(&vma->vm_mm->page_table_lock);
- vma->vm_end = addr + new_len;
+ __vma_modify(root, vma, vma->vm_start,
+ addr + new_len, vma->vm_pgoff);
spin_unlock(&vma->vm_mm->page_table_lock);
+
+ if(mapping)
+ up(&mapping->i_shared_sem);
+
current->mm->total_vm += pages;
if (vma->vm_flags & VM_LOCKED) {
current->mm->locked_vm += pages;
diff -puN /dev/null mm/prio_tree.c
--- /dev/null 2003-01-30 05:24:37.000000000 -0500
+++ mmlinux-2.6-jaya/mm/prio_tree.c 2004-03-21 16:25:01.000000000 -0500
@@ -0,0 +1,574 @@
+/*
+ * mm/prio_tree.c - priority search tree for mapping->i_mmap{,_shared}
+ *
+ * Copyright (C) 2004, Rajesh Venkatasubramanian <[email protected]>
+ *
+ * Based on the radix priority search tree proposed by Edward M. McCreight
+ * SIAM Journal of Computing, vol. 14, no.2, pages 257-276, May 1985
+ *
+ * 02Feb2004 Initial version
+ */
+
+#include <linux/init.h>
+#include <linux/mm.h>
+#include <linux/prio_tree.h>
+
+/*
+ * A clever mix of heap and radix trees forms a radix priority search tree (PST)
+ * which is useful for storing intervals, e.g, we can consider a vma as a closed
+ * interval of file pages [offset_begin, offset_end], and store all vmas that
+ * map a file in a PST. Then, using the PST, we can answer a stabbing query,
+ * i.e., selecting a set of stored intervals (vmas) that overlap with (map) a
+ * given input interval X (a set of consecutive file pages), in "O(log n + m)"
+ * time where 'log n' is the height of the PST, and 'm' is the number of stored
+ * intervals (vmas) that overlap (map) with the input interval X (the set of
+ * consecutive file pages).
+ *
+ * In our implementation, we store closed intervals of the form [radix_index,
+ * heap_index]. We assume that always radix_index <= heap_index. McCreight's PST
+ * is designed for storing intervals with unique radix indices, i.e., each
+ * interval have different radix_index. However, this limitation can be easily
+ * overcome by using the size, i.e., heap_index - radix_index, as part of the
+ * index, so we index the tree using [(radix_index,size), heap_index].
+ *
+ * When the above-mentioned indexing scheme is used, theoretically, in a 32 bit
+ * machine, the maximum height of a PST can be 64. We can use a balanced version
+ * of the priority search tree to optimize the tree height, but the balanced
+ * tree proposed by McCreight is too complex and memory-hungry for our purpose.
+ */
+
+static unsigned long index_bits_to_maxindex[BITS_PER_LONG];
+
+/*
+ * Maximum heap_index that can be stored in a PST with index_bits bits
+ */
+static inline unsigned long prio_tree_maxindex(unsigned int bits)
+{
+ return index_bits_to_maxindex[bits - 1];
+}
+
+/*
+ * Extend a priority search tree so that it can store a node with heap_index
+ * max_heap_index. In the worst case, this algorithm takes O((log n)^2).
+ * However, this function is used rarely and the common case performance is
+ * not bad.
+ */
+static struct prio_tree_node *prio_tree_expand(struct prio_tree_root *root,
+ struct prio_tree_node *node, unsigned long max_heap_index)
+{
+ struct prio_tree_node *first = NULL, *prev, *last = NULL;
+
+ if (max_heap_index > prio_tree_maxindex(root->index_bits))
+ root->index_bits++;
+
+ while (max_heap_index > prio_tree_maxindex(root->index_bits)) {
+ root->index_bits++;
+
+ if (prio_tree_empty(root))
+ continue;
+
+ if (first == NULL) {
+ first = root->prio_tree_node;
+ prio_tree_remove(root, root->prio_tree_node);
+ INIT_PRIO_TREE_NODE(first);
+ last = first;
+ }
+ else {
+ prev = last;
+ last = root->prio_tree_node;
+ prio_tree_remove(root, root->prio_tree_node);
+ INIT_PRIO_TREE_NODE(last);
+ prev->left = last;
+ last->parent = prev;
+ }
+ }
+
+ INIT_PRIO_TREE_NODE(node);
+
+ if (first) {
+ node->left = first;
+ first->parent = node;
+ }
+ else
+ last = node;
+
+ if (!prio_tree_empty(root)) {
+ last->left = root->prio_tree_node;
+ last->left->parent = last;
+ }
+
+ root->prio_tree_node = node;
+ return node;
+}
+
+/*
+ * Replace a prio_tree_node with a new node and return the old node
+ */
+static inline struct prio_tree_node *prio_tree_replace(
+ struct prio_tree_root *root, struct prio_tree_node *old,
+ struct prio_tree_node *node)
+{
+ INIT_PRIO_TREE_NODE(node);
+
+ if (prio_tree_root(old)) {
+ BUG_ON(root->prio_tree_node != old);
+ /*
+ * We can reduce root->index_bits here. However, it is complex
+ * and does not help much to improve performance (IMO).
+ */
+ node->parent = node;
+ root->prio_tree_node = node;
+ }
+ else {
+ node->parent = old->parent;
+ if (old->parent->left == old)
+ old->parent->left = node;
+ else {
+ BUG_ON(old->parent->right != old);
+ old->parent->right = node;
+ }
+ }
+
+ if (!prio_tree_left_empty(old)) {
+ node->left = old->left;
+ old->left->parent = node;
+ }
+
+ if (!prio_tree_right_empty(old)) {
+ node->right = old->right;
+ old->right->parent = node;
+ }
+
+ return old;
+}
+
+#undef swap
+#define swap(x,y,z) do {z = x; x = y; y = z; } while (0)
+
+/*
+ * Insert a prio_tree_node @node into a radix priority search tree @root. The
+ * algorithm typically takes O(log n) time where 'log n' is the number of bits
+ * required to represent the maximum heap_index. In the worst case, the algo
+ * can take O((log n)^2) - check prio_tree_expand.
+ *
+ * If a prior node with same radix_index and heap_index is already found in
+ * the tree, then returns the address of the prior node. Otherwise, inserts
+ * @node into the tree and returns @node.
+ */
+
+struct prio_tree_node *prio_tree_insert(struct prio_tree_root *root,
+ struct prio_tree_node *node)
+{
+ struct prio_tree_node *cur, *res = node;
+ unsigned long radix_index, heap_index;
+ unsigned long r_index, h_index, index, mask;
+ int size_flag = 0;
+
+ GET_INDEX(node, radix_index, heap_index);
+
+ if (prio_tree_empty(root) ||
+ heap_index > prio_tree_maxindex(root->index_bits))
+ return prio_tree_expand(root, node, heap_index);
+
+ cur = root->prio_tree_node;
+ mask = 1UL << (root->index_bits - 1);
+
+ while (mask) {
+ GET_INDEX(cur, r_index, h_index);
+
+ if (r_index == radix_index && h_index == heap_index)
+ return cur;
+
+ if (h_index < heap_index || (h_index == heap_index &&
+ r_index > radix_index))
+ {
+ struct prio_tree_node *tmp = node;
+ node = prio_tree_replace(root, cur, node);
+ cur = tmp;
+ swap(r_index, radix_index, index);
+ swap(h_index, heap_index, index);
+ }
+
+ if (size_flag)
+ index = heap_index - radix_index;
+ else
+ index = radix_index;
+
+ if (index & mask) {
+ if (prio_tree_right_empty(cur)) {
+ INIT_PRIO_TREE_NODE(node);
+ cur->right = node;
+ node->parent = cur;
+ return res;
+ }
+ else
+ cur = cur->right;
+ }
+ else {
+ if (prio_tree_left_empty(cur)) {
+ INIT_PRIO_TREE_NODE(node);
+ cur->left = node;
+ node->parent = cur;
+ return res;
+ }
+ else
+ cur = cur->left;
+ }
+
+ mask >>= 1;
+
+ if (!mask) {
+ mask = 1UL << (root->index_bits - 1);
+ size_flag = 1;
+ }
+ }
+ /* Should not reach here */
+ BUG();
+ return NULL;
+}
+
+/*
+ * Remove a prio_tree_node @node from a radix priority search tree @root. The
+ * algorithm takes O(log n) time where 'log n' is the number of bits required
+ * to represent the maximum heap_index.
+ */
+
+void prio_tree_remove(struct prio_tree_root *root, struct prio_tree_node *node)
+{
+ struct prio_tree_node *cur;
+ unsigned long r_index, h_index_right, h_index_left;
+
+ cur = node;
+
+ while (!prio_tree_left_empty(cur) || !prio_tree_right_empty(cur)) {
+ if (!prio_tree_left_empty(cur))
+ GET_INDEX(cur->left, r_index, h_index_left);
+ else {
+ cur = cur->right;
+ continue;
+ }
+
+ if (!prio_tree_right_empty(cur))
+ GET_INDEX(cur->right, r_index, h_index_right);
+ else {
+ cur = cur->left;
+ continue;
+ }
+
+ /* both h_index_left and h_index_right cannot be 0 */
+ if (h_index_left >= h_index_right)
+ cur = cur->left;
+ else
+ cur = cur->right;
+ }
+
+ if (prio_tree_root(cur)) {
+ BUG_ON(root->prio_tree_node != cur);
+ *root = PRIO_TREE_ROOT;
+ return;
+ }
+
+ if (cur->parent->right == cur)
+ cur->parent->right = cur->parent;
+ else {
+ BUG_ON(cur->parent->left != cur);
+ cur->parent->left = cur->parent;
+ }
+
+ while (cur != node)
+ cur = prio_tree_replace(root, cur->parent, cur);
+
+ return;
+}
+
+/*
+ * Following functions help to enumerate all prio_tree_nodes in the tree that
+ * overlap with the input interval X [radix_index, heap_index]. The enumeration
+ * takes O(log n + m) time where 'log n' is the height of the tree (which is
+ * proportional to # of bits required to represent the maximum heap_index) and
+ * 'm' is the number of prio_tree_nodes that overlap the interval X.
+ */
+
+static inline struct prio_tree_node *__prio_tree_left(
+ struct prio_tree_root *root, struct prio_tree_iter *iter,
+ unsigned long radix_index, unsigned long heap_index,
+ unsigned long *r_index, unsigned long *h_index)
+{
+ if (prio_tree_left_empty(iter->cur))
+ return NULL;
+
+ GET_INDEX(iter->cur->left, *r_index, *h_index);
+
+ if (radix_index <= *h_index) {
+ iter->cur = iter->cur->left;
+ iter->mask >>= 1;
+ if (iter->mask) {
+ if (iter->size_level)
+ iter->size_level++;
+ }
+ else {
+ iter->size_level = 1;
+ iter->mask = 1UL << (root->index_bits - 1);
+ }
+ return iter->cur;
+ }
+
+ return NULL;
+}
+
+
+static inline struct prio_tree_node *__prio_tree_right(
+ struct prio_tree_root *root, struct prio_tree_iter *iter,
+ unsigned long radix_index, unsigned long heap_index,
+ unsigned long *r_index, unsigned long *h_index)
+{
+ unsigned long value;
+
+ if (prio_tree_right_empty(iter->cur))
+ return NULL;
+
+ if (iter->size_level)
+ value = iter->value;
+ else
+ value = iter->value | iter->mask;
+
+ if (heap_index < value)
+ return NULL;
+
+ GET_INDEX(iter->cur->right, *r_index, *h_index);
+
+ if (radix_index <= *h_index) {
+ iter->cur = iter->cur->right;
+ iter->mask >>= 1;
+ iter->value = value;
+ if (iter->mask) {
+ if (iter->size_level)
+ iter->size_level++;
+ }
+ else {
+ iter->size_level = 1;
+ iter->mask = 1UL << (root->index_bits - 1);
+ }
+ return iter->cur;
+ }
+
+ return NULL;
+}
+
+static inline struct prio_tree_node *__prio_tree_parent(
+ struct prio_tree_iter *iter)
+{
+ iter->cur = iter->cur->parent;
+ iter->mask <<= 1;
+ if (iter->size_level) {
+ if (iter->size_level == 1)
+ iter->mask = 1UL;
+ iter->size_level--;
+ }
+ else if (iter->value & iter->mask)
+ iter->value ^= iter->mask;
+ return iter->cur;
+}
+
+static inline int overlap(unsigned long radix_index, unsigned long heap_index,
+ unsigned long r_index, unsigned long h_index)
+{
+ if (heap_index < r_index || radix_index > h_index)
+ return 0;
+
+ return 1;
+}
+
+/*
+ * prio_tree_first:
+ *
+ * Get the first prio_tree_node that overlaps with the interval [radix_index,
+ * heap_index]. Note that always radix_index <= heap_index. We do a pre-order
+ * traversal of the tree.
+ */
+struct prio_tree_node *prio_tree_first(struct prio_tree_root *root,
+ struct prio_tree_iter *iter, unsigned long radix_index,
+ unsigned long heap_index)
+{
+ unsigned long r_index, h_index;
+
+ *iter = PRIO_TREE_ITER;
+
+ if (prio_tree_empty(root))
+ return NULL;
+
+ GET_INDEX(root->prio_tree_node, r_index, h_index);
+
+ if (radix_index > h_index)
+ return NULL;
+
+ iter->mask = 1UL << (root->index_bits - 1);
+ iter->cur = root->prio_tree_node;
+
+ while (1) {
+ if (overlap(radix_index, heap_index, r_index, h_index))
+ return iter->cur;
+
+ if (__prio_tree_left(root, iter, radix_index, heap_index,
+ &r_index, &h_index))
+ continue;
+
+ if (__prio_tree_right(root, iter, radix_index, heap_index,
+ &r_index, &h_index))
+ continue;
+
+ break;
+ }
+ return NULL;
+}
+
+/* Get the next prio_tree_node that overlaps with the input interval in iter */
+struct prio_tree_node *prio_tree_next(struct prio_tree_root *root,
+ struct prio_tree_iter *iter, unsigned long radix_index,
+ unsigned long heap_index)
+{
+ unsigned long r_index, h_index;
+
+repeat:
+ while (__prio_tree_left(root, iter, radix_index, heap_index,
+ &r_index, &h_index))
+ if (overlap(radix_index, heap_index, r_index, h_index))
+ return iter->cur;
+
+ while (!__prio_tree_right(root, iter, radix_index, heap_index,
+ &r_index, &h_index)) {
+ while (!prio_tree_root(iter->cur) &&
+ iter->cur->parent->right == iter->cur)
+ __prio_tree_parent(iter);
+
+ if (prio_tree_root(iter->cur))
+ return NULL;
+
+ __prio_tree_parent(iter);
+ }
+
+ if (overlap(radix_index, heap_index, r_index, h_index))
+ return iter->cur;
+
+ goto repeat;
+}
+
+/*
+ * Radix priority search tree for address_space->i_mmap_{_shared}
+ *
+ * For each vma that map a unique set of file pages i.e., unique [radix_index,
+ * heap_index] value, we have a corresponing priority search tree node. If
+ * multiple vmas have identical [radix_index, heap_index] value, then one of
+ * them is used as a tree node and others are stored in a vm_set list. The tree
+ * node points to the first vma (head) of the list using vm_set_head.
+ *
+ * prio_tree_root
+ * |
+ * A vm_set_head
+ * / \ /
+ * L R -> H-I-J-K-M-N-O-P-Q-S
+ * ^ ^ <-- vm_set.list -->
+ * tree nodes
+ *
+ * We need some way to identify whether a vma is a tree node, head of a vm_set
+ * list, or just a member of a vm_set list. We cannot use vm_flags to store
+ * such information. The reason is, in the above figure, it is possible that
+ * vm_flags' of R and H are covered by the different mmap_sems. When R is
+ * removed under R->mmap_sem, H replaces R as a tree node. Since we do not hold
+ * H->mmap_sem, we cannot use H->vm_flags for marking that H is a tree node now.
+ * That's why some trick involving shared.both.parent is used for identifying
+ * tree nodes and list head nodes. We can possibly use the least significant
+ * bit of the vm_set_head field to mark tree and list head nodes. I was worried
+ * about the alignment of vm_area_struct in various architectures.
+ *
+ * vma radix priority search tree node rules:
+ *
+ * vma->shared.both.parent != NULL ==> a tree node
+ *
+ * vma->shared.both.parent == NULL
+ * vma->vm_set_head != NULL ==> list head of vmas that map same pages
+ * vma->vm_set_head == NULL ==> a list node
+ */
+
+void __vma_prio_tree_insert(struct prio_tree_root *root,
+ struct vm_area_struct *vma)
+{
+ struct prio_tree_node *ptr;
+ struct vm_area_struct *old;
+
+ ptr = prio_tree_insert(root, &vma->shared.prio_tree_node);
+
+ if (ptr == &vma->shared.prio_tree_node) {
+ vma->vm_set_head = NULL;
+ return;
+ }
+
+ old = prio_tree_entry(ptr, struct vm_area_struct,
+ shared.prio_tree_node);
+
+ __vma_prio_tree_add(vma, old);
+}
+
+void __vma_prio_tree_remove(struct prio_tree_root *root,
+ struct vm_area_struct *vma)
+{
+ struct vm_area_struct *node, *head, *new_head;
+
+ if (vma->shared.both.parent == NULL && vma->vm_set_head == NULL) {
+ list_del_init(&vma->shared.vm_set.list);
+ INIT_VMA_SHARED(vma);
+ return;
+ }
+
+ if (vma->vm_set_head) {
+ /* Leave this BUG_ON till prio_tree patch stabilizes */
+ BUG_ON(vma->vm_set_head->vm_set_head != vma);
+ if (vma->shared.both.parent) {
+ head = vma->vm_set_head;
+ if (!list_empty(&head->shared.vm_set.list)) {
+ new_head = list_entry(
+ head->shared.vm_set.list.next,
+ struct vm_area_struct,
+ shared.vm_set.list);
+ list_del_init(&head->shared.vm_set.list);
+ }
+ else
+ new_head = NULL;
+
+ prio_tree_replace(root, &vma->shared.prio_tree_node,
+ &head->shared.prio_tree_node);
+ head->vm_set_head = new_head;
+ if (new_head)
+ new_head->vm_set_head = head;
+
+ }
+ else {
+ node = vma->vm_set_head;
+ if (!list_empty(&vma->shared.vm_set.list)) {
+ new_head = list_entry(
+ vma->shared.vm_set.list.next,
+ struct vm_area_struct,
+ shared.vm_set.list);
+ list_del_init(&vma->shared.vm_set.list);
+ node->vm_set_head = new_head;
+ new_head->vm_set_head = node;
+ }
+ else
+ node->vm_set_head = NULL;
+ }
+ INIT_VMA_SHARED(vma);
+ return;
+ }
+
+ prio_tree_remove(root, &vma->shared.prio_tree_node);
+ INIT_VMA_SHARED(vma);
+}
+
+void __init prio_tree_init(void)
+{
+ unsigned int i;
+
+ for (i = 0; i < ARRAY_SIZE(index_bits_to_maxindex) - 1; i++)
+ index_bits_to_maxindex[i] = (1UL << (i + 1)) - 1;
+ index_bits_to_maxindex[ARRAY_SIZE(index_bits_to_maxindex) - 1] = ~0UL;
+}
diff -puN mm/shmem.c~prio_tree_core mm/shmem.c
--- mmlinux-2.6/mm/shmem.c~prio_tree_core 2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/mm/shmem.c 2004-03-21 16:25:01.000000000 -0500
@@ -1328,7 +1328,7 @@ static void do_shmem_file_read(struct fi
* virtual addresses, take care about potential aliasing
* before reading the page on the kernel side.
*/
- if (!list_empty(&mapping->i_mmap_shared))
+ if (!prio_tree_empty(&mapping->i_mmap_shared))
flush_dcache_page(page);
/*
* Mark the page accessed if we read the beginning.
diff -puN mm/swap_state.c~prio_tree_core mm/swap_state.c
--- mmlinux-2.6/mm/swap_state.c~prio_tree_core 2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/mm/swap_state.c 2004-03-21 16:25:01.000000000 -0500
@@ -32,8 +32,8 @@ struct address_space swapper_space = {
.locked_pages = LIST_HEAD_INIT(swapper_space.locked_pages),
.a_ops = &swap_aops,
.backing_dev_info = &swap_backing_dev_info,
- .i_mmap = LIST_HEAD_INIT(swapper_space.i_mmap),
- .i_mmap_shared = LIST_HEAD_INIT(swapper_space.i_mmap_shared),
+ .i_mmap = PRIO_TREE_ROOT_INIT,
+ .i_mmap_shared = PRIO_TREE_ROOT_INIT,
.i_shared_sem = __MUTEX_INITIALIZER(swapper_space.i_shared_sem),
.truncate_count = ATOMIC_INIT(0),
.private_lock = SPIN_LOCK_UNLOCKED,
diff -puN mm/vmscan.c~prio_tree_core mm/vmscan.c
--- mmlinux-2.6/mm/vmscan.c~prio_tree_core 2004-03-21 16:25:01.000000000 -0500
+++ mmlinux-2.6-jaya/mm/vmscan.c 2004-03-21 16:25:01.000000000 -0500
@@ -191,9 +191,9 @@ static inline int page_mapping_inuse(str
return 1;

/* File is mmap'd by somebody. */
- if (!list_empty(&mapping->i_mmap))
+ if (!prio_tree_empty(&mapping->i_mmap))
return 1;
- if (!list_empty(&mapping->i_mmap_shared))
+ if (!prio_tree_empty(&mapping->i_mmap_shared))
return 1;

return 0;

_


Subject: Re: [RFC][PATCH 2/3] Dave & Hugh's objrmap patch


This patch is same as Hugh's anobjrmap patch 1, just rediffed on
top of 2.6.5-rc2+prio_tree_core. Please check Hugh's comment below.

anobjrmap 1/6 Dave McCracken's objrmap

Start with Dave McCracken's objrmap from Martin J. Bligh's tree, as did
Andrea. We've each diverged slightly: I've not bothered to include the
filemap.c locking comment, just to remove it again later; and I've not
included the page_table_lock avoidance from mmap.c - I don't see how it
can be safe to unlink a vma while try_to_unmap might be in find_vma
(but that may be fine in Andrea's, which ends up not using find_vma).
In rmap.c: I've not seen the problem which led Andrea to change try
failures from 1 to 0; fixed three comment typos, positioning of
page_test_and_clear_dirty calls, and use ptep_clear_flush.


fs/exec.c | 1
include/linux/mm.h | 1
include/linux/page-flags.h | 5
include/linux/swap.h | 2
mm/fremap.c | 21 ++
mm/memory.c | 8
mm/page_alloc.c | 2
mm/rmap.c | 390 +++++++++++++++++++++++++++++++++++++++++++--
mm/swapfile.c | 1
9 files changed, 417 insertions(+), 14 deletions(-)

diff -puN fs/exec.c~objrmap_dave_hugh_1 fs/exec.c
--- mmlinux-2.6/fs/exec.c~objrmap_dave_hugh_1 2004-03-21 16:25:07.000000000 -0500
+++ mmlinux-2.6-jaya/fs/exec.c 2004-03-21 16:25:07.000000000 -0500
@@ -324,6 +324,7 @@ void put_dirty_page(struct task_struct *
}
lru_cache_add_active(page);
flush_dcache_page(page);
+ SetPageAnon(page);
set_pte(pte, pte_mkdirty(pte_mkwrite(mk_pte(page, prot))));
pte_chain = page_add_rmap(page, pte, pte_chain);
pte_unmap(pte);
diff -puN include/linux/mm.h~objrmap_dave_hugh_1 include/linux/mm.h
--- mmlinux-2.6/include/linux/mm.h~objrmap_dave_hugh_1 2004-03-21 16:25:07.000000000 -0500
+++ mmlinux-2.6-jaya/include/linux/mm.h 2004-03-21 16:25:07.000000000 -0500
@@ -352,6 +352,7 @@ struct page {
struct pte_chain *chain;/* Reverse pte mapping pointer.
* protected by PG_chainlock */
pte_addr_t direct;
+ int mapcount;
} pte;
unsigned long private; /* mapping-private opaque data */

diff -puN include/linux/page-flags.h~objrmap_dave_hugh_1 include/linux/page-flags.h
--- mmlinux-2.6/include/linux/page-flags.h~objrmap_dave_hugh_1 2004-03-21 16:25:07.000000000 -0500
+++ mmlinux-2.6-jaya/include/linux/page-flags.h 2004-03-21 16:25:07.000000000 -0500
@@ -75,6 +75,7 @@
#define PG_mappedtodisk 17 /* Has blocks allocated on-disk */
#define PG_reclaim 18 /* To be reclaimed asap */
#define PG_compound 19 /* Part of a compound page */
+#define PG_anon 20 /* Anonymous page */


/*
@@ -298,6 +299,10 @@ extern void get_full_page_state(struct p
#define SetPageCompound(page) set_bit(PG_compound, &(page)->flags)
#define ClearPageCompound(page) clear_bit(PG_compound, &(page)->flags)

+#define PageAnon(page) test_bit(PG_anon, &(page)->flags)
+#define SetPageAnon(page) set_bit(PG_anon, &(page)->flags)
+#define ClearPageAnon(page) clear_bit(PG_anon, &(page)->flags)
+
/*
* The PageSwapCache predicate doesn't use a PG_flag at this time,
* but it may again do so one day.
diff -puN include/linux/swap.h~objrmap_dave_hugh_1 include/linux/swap.h
--- mmlinux-2.6/include/linux/swap.h~objrmap_dave_hugh_1 2004-03-21 16:25:07.000000000 -0500
+++ mmlinux-2.6-jaya/include/linux/swap.h 2004-03-21 16:25:07.000000000 -0500
@@ -185,6 +185,8 @@ struct pte_chain *FASTCALL(page_add_rmap
void FASTCALL(page_remove_rmap(struct page *, pte_t *));
int FASTCALL(try_to_unmap(struct page *));

+int page_convert_anon(struct page *);
+
/* linux/mm/shmem.c */
extern int shmem_unuse(swp_entry_t entry, struct page *page);
#else
diff -puN mm/fremap.c~objrmap_dave_hugh_1 mm/fremap.c
--- mmlinux-2.6/mm/fremap.c~objrmap_dave_hugh_1 2004-03-21 16:25:07.000000000 -0500
+++ mmlinux-2.6-jaya/mm/fremap.c 2004-03-21 16:25:07.000000000 -0500
@@ -61,10 +61,26 @@ int install_page(struct mm_struct *mm, s
pmd_t *pmd;
pte_t pte_val;
struct pte_chain *pte_chain;
+ unsigned long pgidx;

pte_chain = pte_chain_alloc(GFP_KERNEL);
if (!pte_chain)
goto err;
+
+ /*
+ * Convert this page to anon for objrmap if it's nonlinear
+ */
+ pgidx = (addr - vma->vm_start) >> PAGE_SHIFT;
+ pgidx += vma->vm_pgoff;
+ pgidx >>= PAGE_CACHE_SHIFT - PAGE_SHIFT;
+ if (!PageAnon(page) && (page->index != pgidx)) {
+ lock_page(page);
+ err = page_convert_anon(page);
+ unlock_page(page);
+ if (err < 0)
+ goto err_free;
+ }
+
pgd = pgd_offset(mm, addr);
spin_lock(&mm->page_table_lock);

@@ -85,12 +101,11 @@ int install_page(struct mm_struct *mm, s
pte_val = *pte;
pte_unmap(pte);
update_mmu_cache(vma, addr, pte_val);
- spin_unlock(&mm->page_table_lock);
- pte_chain_free(pte_chain);
- return 0;

+ err = 0;
err_unlock:
spin_unlock(&mm->page_table_lock);
+err_free:
pte_chain_free(pte_chain);
err:
return err;
diff -puN mm/memory.c~objrmap_dave_hugh_1 mm/memory.c
--- mmlinux-2.6/mm/memory.c~objrmap_dave_hugh_1 2004-03-21 16:25:07.000000000 -0500
+++ mmlinux-2.6-jaya/mm/memory.c 2004-03-21 16:25:07.000000000 -0500
@@ -1071,6 +1071,7 @@ static int do_wp_page(struct mm_struct *
++mm->rss;
page_remove_rmap(old_page, page_table);
break_cow(vma, new_page, address, page_table);
+ SetPageAnon(new_page);
pte_chain = page_add_rmap(new_page, page_table, pte_chain);
lru_cache_add_active(new_page);

@@ -1309,6 +1310,7 @@ static int do_swap_page(struct mm_struct

flush_icache_page(vma, page);
set_pte(page_table, pte);
+ SetPageAnon(page);
pte_chain = page_add_rmap(page, page_table, pte_chain);

/* No need to invalidate - it was non-present before */
@@ -1376,6 +1378,7 @@ do_anonymous_page(struct mm_struct *mm,
vma);
lru_cache_add_active(page);
mark_page_accessed(page);
+ SetPageAnon(page);
}

set_pte(page_table, entry);
@@ -1443,6 +1446,10 @@ retry:
if (!pte_chain)
goto oom;

+ /* See if nopage returned an anon page */
+ if (!new_page->mapping || PageSwapCache(new_page))
+ SetPageAnon(new_page);
+
/*
* Should we do an early C-O-W break?
*/
@@ -1453,6 +1460,7 @@ retry:
copy_user_highpage(page, new_page, address);
page_cache_release(new_page);
lru_cache_add_active(page);
+ SetPageAnon(page);
new_page = page;
}

diff -puN mm/page_alloc.c~objrmap_dave_hugh_1 mm/page_alloc.c
--- mmlinux-2.6/mm/page_alloc.c~objrmap_dave_hugh_1 2004-03-21 16:25:07.000000000 -0500
+++ mmlinux-2.6-jaya/mm/page_alloc.c 2004-03-21 16:25:07.000000000 -0500
@@ -224,6 +224,8 @@ static inline void free_pages_check(cons
bad_page(function, page);
if (PageDirty(page))
ClearPageDirty(page);
+ if (PageAnon(page))
+ ClearPageAnon(page);
}

/*
diff -puN mm/rmap.c~objrmap_dave_hugh_1 mm/rmap.c
--- mmlinux-2.6/mm/rmap.c~objrmap_dave_hugh_1 2004-03-21 16:25:07.000000000 -0500
+++ mmlinux-2.6-jaya/mm/rmap.c 2004-03-21 16:25:31.000000000 -0500
@@ -103,6 +103,136 @@ pte_chain_encode(struct pte_chain *pte_c
**/

/**
+ * find_pte - Find a pte pointer given a vma and a struct page.
+ * @vma: the vma to search
+ * @page: the page to find
+ *
+ * Determine if this page is mapped in this vma. If it is, map and return
+ * the pte pointer associated with it. Return null if the page is not
+ * mapped in this vma for any reason.
+ *
+ * This is strictly an internal helper function for the object-based rmap
+ * functions.
+ *
+ * It is the caller's responsibility to unmap the pte if it is returned.
+ */
+static inline pte_t *
+find_pte(struct vm_area_struct *vma, struct page *page, unsigned long *addr)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pgd_t *pgd;
+ pmd_t *pmd;
+ pte_t *pte;
+ unsigned long loffset;
+ unsigned long address;
+
+ loffset = (page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT));
+ address = vma->vm_start + ((loffset - vma->vm_pgoff) << PAGE_SHIFT);
+ if (address < vma->vm_start || address >= vma->vm_end)
+ goto out;
+
+ pgd = pgd_offset(mm, address);
+ if (!pgd_present(*pgd))
+ goto out;
+
+ pmd = pmd_offset(pgd, address);
+ if (!pmd_present(*pmd))
+ goto out;
+
+ pte = pte_offset_map(pmd, address);
+ if (!pte_present(*pte))
+ goto out_unmap;
+
+ if (page_to_pfn(page) != pte_pfn(*pte))
+ goto out_unmap;
+
+ if (addr)
+ *addr = address;
+
+ return pte;
+
+out_unmap:
+ pte_unmap(pte);
+out:
+ return NULL;
+}
+
+/**
+ * page_referenced_obj_one - referenced check for object-based rmap
+ * @vma: the vma to look in.
+ * @page: the page we're working on.
+ *
+ * Find a pte entry for a page/vma pair, then check and clear the referenced
+ * bit.
+ *
+ * This is strictly a helper function for page_referenced_obj.
+ */
+static int
+page_referenced_obj_one(struct vm_area_struct *vma, struct page *page)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ pte_t *pte;
+ int referenced = 0;
+
+ if (!spin_trylock(&mm->page_table_lock))
+ return 1;
+
+ pte = find_pte(vma, page, NULL);
+ if (pte) {
+ if (ptep_test_and_clear_young(pte))
+ referenced++;
+ pte_unmap(pte);
+ }
+
+ spin_unlock(&mm->page_table_lock);
+ return referenced;
+}
+
+/**
+ * page_referenced_obj - referenced check for object-based rmap
+ * @page: the page we're checking references on.
+ *
+ * For an object-based mapped page, find all the places it is mapped and
+ * check/clear the referenced flag. This is done by following the page->mapping
+ * pointer, then walking the chain of vmas it holds. It returns the number
+ * of references it found.
+ *
+ * This function is only called from page_referenced for object-based pages.
+ *
+ * The semaphore address_space->i_shared_sem is tried. If it can't be gotten,
+ * assume a reference count of 1.
+ */
+static int
+page_referenced_obj(struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+ struct vm_area_struct *vma;
+ int referenced = 0;
+
+ if (!page->pte.mapcount)
+ return 0;
+
+ if (!mapping)
+ BUG();
+
+ if (PageSwapCache(page))
+ BUG();
+
+ if (down_trylock(&mapping->i_shared_sem))
+ return 1;
+
+ list_for_each_entry(vma, &mapping->i_mmap, shared)
+ referenced += page_referenced_obj_one(vma, page);
+
+ list_for_each_entry(vma, &mapping->i_mmap_shared, shared)
+ referenced += page_referenced_obj_one(vma, page);
+
+ up(&mapping->i_shared_sem);
+
+ return referenced;
+}
+
+/**
* page_referenced - test if the page was referenced
* @page: the page to test
*
@@ -124,6 +254,10 @@ int fastcall page_referenced(struct page
if (TestClearPageReferenced(page))
referenced++;

+ if (!PageAnon(page)) {
+ referenced += page_referenced_obj(page);
+ goto out;
+ }
if (PageDirect(page)) {
pte_t *pte = rmap_ptep_map(page->pte.direct);
if (ptep_test_and_clear_young(pte))
@@ -155,6 +289,7 @@ int fastcall page_referenced(struct page
__pte_chain_free(pc);
}
}
+out:
return referenced;
}

@@ -177,6 +312,21 @@ page_add_rmap(struct page *page, pte_t *

pte_chain_lock(page);

+ /*
+ * If this is an object-based page, just count it. We can
+ * find the mappings by walking the object vma chain for that object.
+ */
+ if (!PageAnon(page)) {
+ if (!page->mapping)
+ BUG();
+ if (PageSwapCache(page))
+ BUG();
+ if (!page->pte.mapcount)
+ inc_page_state(nr_mapped);
+ page->pte.mapcount++;
+ goto out;
+ }
+
if (page->pte.direct == 0) {
page->pte.direct = pte_paddr;
SetPageDirect(page);
@@ -233,8 +383,21 @@ void fastcall page_remove_rmap(struct pa
pte_chain_lock(page);

if (!page_mapped(page))
- goto out_unlock; /* remap_page_range() from a driver? */
+ goto out_unlock;

+ /*
+ * If this is an object-based page, just uncount it. We can
+ * find the mappings by walking the object vma chain for that object.
+ */
+ if (!PageAnon(page)) {
+ if (!page->mapping)
+ BUG();
+ if (PageSwapCache(page))
+ BUG();
+ page->pte.mapcount--;
+ goto out;
+ }
+
if (PageDirect(page)) {
if (page->pte.direct == pte_paddr) {
page->pte.direct = 0;
@@ -271,16 +434,112 @@ void fastcall page_remove_rmap(struct pa
}
}
out:
- if (page->pte.direct == 0 && page_test_and_clear_dirty(page))
- set_page_dirty(page);
- if (!page_mapped(page))
+ if (!page_mapped(page)) {
+ if (page_test_and_clear_dirty(page))
+ set_page_dirty(page);
dec_page_state(nr_mapped);
+ }
out_unlock:
pte_chain_unlock(page);
return;
}

/**
+ * try_to_unmap_obj_one - unmap a page using the object-based rmap method
+ * @page: the page to unmap
+ *
+ * Determine whether a page is mapped in a given vma and unmap it if it's found.
+ *
+ * This function is strictly a helper function for try_to_unmap_obj.
+ */
+static inline int
+try_to_unmap_obj_one(struct vm_area_struct *vma, struct page *page)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ unsigned long address;
+ pte_t *pte;
+ pte_t pteval;
+ int ret = SWAP_AGAIN;
+
+ if (!spin_trylock(&mm->page_table_lock))
+ return ret;
+
+ pte = find_pte(vma, page, &address);
+ if (!pte)
+ goto out;
+
+ if (vma->vm_flags & (VM_LOCKED|VM_RESERVED)) {
+ ret = SWAP_FAIL;
+ goto out_unmap;
+ }
+
+ flush_cache_page(vma, address);
+ pteval = ptep_clear_flush(vma, address, pte);
+
+ if (pte_dirty(pteval))
+ set_page_dirty(page);
+
+ if (!page->pte.mapcount)
+ BUG();
+
+ mm->rss--;
+ page->pte.mapcount--;
+ page_cache_release(page);
+
+out_unmap:
+ pte_unmap(pte);
+
+out:
+ spin_unlock(&mm->page_table_lock);
+ return ret;
+}
+
+/**
+ * try_to_unmap_obj - unmap a page using the object-based rmap method
+ * @page: the page to unmap
+ *
+ * Find all the mappings of a page using the mapping pointer and the vma chains
+ * contained in the address_space struct it points to.
+ *
+ * This function is only called from try_to_unmap for object-based pages.
+ *
+ * The semaphore address_space->i_shared_sem is tried. If it can't be gotten,
+ * return a temporary error.
+ */
+static int
+try_to_unmap_obj(struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+ struct vm_area_struct *vma;
+ int ret = SWAP_AGAIN;
+
+ if (!mapping)
+ BUG();
+
+ if (PageSwapCache(page))
+ BUG();
+
+ if (down_trylock(&mapping->i_shared_sem))
+ return ret;
+
+ list_for_each_entry(vma, &mapping->i_mmap, shared) {
+ ret = try_to_unmap_obj_one(vma, page);
+ if (ret == SWAP_FAIL || !page->pte.mapcount)
+ goto out;
+ }
+
+ list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
+ ret = try_to_unmap_obj_one(vma, page);
+ if (ret == SWAP_FAIL || !page->pte.mapcount)
+ goto out;
+ }
+
+out:
+ up(&mapping->i_shared_sem);
+ return ret;
+}
+
+/**
* try_to_unmap_one - worker function for try_to_unmap
* @page: page to unmap
* @ptep: page table entry to unmap from page
@@ -324,7 +583,7 @@ static int fastcall try_to_unmap_one(str
}

/* The page is mlock()d, we cannot swap it out. */
- if (vma->vm_flags & VM_LOCKED) {
+ if (vma->vm_flags & (VM_LOCKED|VM_RESERVED)) {
ret = SWAP_FAIL;
goto out_unlock;
}
@@ -398,11 +657,18 @@ int fastcall try_to_unmap(struct page *
if (!page->mapping)
BUG();

+ /*
+ * If it's an object-based page, use the object vma chain to find all
+ * the mappings.
+ */
+ if (!PageAnon(page)) {
+ ret = try_to_unmap_obj(page);
+ goto out;
+ }
+
if (PageDirect(page)) {
ret = try_to_unmap_one(page, page->pte.direct);
if (ret == SWAP_SUCCESS) {
- if (page_test_and_clear_dirty(page))
- set_page_dirty(page);
page->pte.direct = 0;
ClearPageDirect(page);
}
@@ -439,9 +705,6 @@ int fastcall try_to_unmap(struct page *
} else {
start->next_and_idx++;
}
- if (page->pte.direct == 0 &&
- page_test_and_clear_dirty(page))
- set_page_dirty(page);
break;
case SWAP_AGAIN:
/* Skip this pte, remembering status. */
@@ -454,12 +717,117 @@ int fastcall try_to_unmap(struct page *
}
}
out:
- if (!page_mapped(page))
+ if (!page_mapped(page)) {
+ if (page_test_and_clear_dirty(page))
+ set_page_dirty(page);
dec_page_state(nr_mapped);
+ ret = SWAP_SUCCESS;
+ }
return ret;
}

/**
+ * page_convert_anon - Convert an object-based mapped page to pte_chain-based.
+ * @page: the page to convert
+ *
+ * Find all the mappings for an object-based page and convert them
+ * to 'anonymous', ie create a pte_chain and store all the pte pointers there.
+ *
+ * This function takes the address_space->i_shared_sem, sets the PageAnon flag,
+ * then sets the mm->page_table_lock for each vma and calls page_add_rmap. This
+ * means there is a period when PageAnon is set, but still has some mappings
+ * with no pte_chain entry. This is in fact safe, since page_remove_rmap will
+ * simply not find it. try_to_unmap might erroneously return success, but it
+ * will never be called because the page_convert_anon() caller has locked the
+ * page.
+ *
+ * page_referenced() may fail to scan all the appropriate pte's and may return
+ * an inaccurate result. This is so rare that it does not matter.
+ */
+int page_convert_anon(struct page *page)
+{
+ struct address_space *mapping;
+ struct vm_area_struct *vma;
+ struct pte_chain *pte_chain = NULL;
+ pte_t *pte;
+ int err = 0;
+
+ mapping = page->mapping;
+ if (mapping == NULL)
+ goto out; /* truncate won the lock_page() race */
+
+ down(&mapping->i_shared_sem);
+ pte_chain_lock(page);
+
+ /*
+ * Has someone else done it for us before we got the lock?
+ * If so, pte.direct or pte.chain has replaced pte.mapcount.
+ */
+ if (PageAnon(page)) {
+ pte_chain_unlock(page);
+ goto out_unlock;
+ }
+
+ SetPageAnon(page);
+ if (page->pte.mapcount == 0) {
+ pte_chain_unlock(page);
+ goto out_unlock;
+ }
+ /* This is gonna get incremented by page_add_rmap */
+ dec_page_state(nr_mapped);
+ page->pte.mapcount = 0;
+
+ /*
+ * Now that the page is marked as anon, unlock it. page_add_rmap will
+ * lock it as necessary.
+ */
+ pte_chain_unlock(page);
+
+ list_for_each_entry(vma, &mapping->i_mmap, shared) {
+ if (!pte_chain) {
+ pte_chain = pte_chain_alloc(GFP_KERNEL);
+ if (!pte_chain) {
+ err = -ENOMEM;
+ goto out_unlock;
+ }
+ }
+ spin_lock(&vma->vm_mm->page_table_lock);
+ pte = find_pte(vma, page, NULL);
+ if (pte) {
+ /* Make sure this isn't a duplicate */
+ page_remove_rmap(page, pte);
+ pte_chain = page_add_rmap(page, pte, pte_chain);
+ pte_unmap(pte);
+ }
+ spin_unlock(&vma->vm_mm->page_table_lock);
+ }
+ list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
+ if (!pte_chain) {
+ pte_chain = pte_chain_alloc(GFP_KERNEL);
+ if (!pte_chain) {
+ err = -ENOMEM;
+ goto out_unlock;
+ }
+ }
+ spin_lock(&vma->vm_mm->page_table_lock);
+ pte = find_pte(vma, page, NULL);
+ if (pte) {
+ /* Make sure this isn't a duplicate */
+ page_remove_rmap(page, pte);
+ pte_chain = page_add_rmap(page, pte, pte_chain);
+ pte_unmap(pte);
+ }
+ spin_unlock(&vma->vm_mm->page_table_lock);
+ }
+
+out_unlock:
+ pte_chain_free(pte_chain);
+ up(&mapping->i_shared_sem);
+out:
+ return err;
+}
+
+/**
** No more VM stuff below this comment, only pte_chain helper
** functions.
**/
diff -puN mm/swapfile.c~objrmap_dave_hugh_1 mm/swapfile.c
--- mmlinux-2.6/mm/swapfile.c~objrmap_dave_hugh_1 2004-03-21 16:25:07.000000000 -0500
+++ mmlinux-2.6-jaya/mm/swapfile.c 2004-03-21 16:25:07.000000000 -0500
@@ -390,6 +390,7 @@ unuse_pte(struct vm_area_struct *vma, un
vma->vm_mm->rss++;
get_page(page);
set_pte(dir, pte_mkold(mk_pte(page, vma->vm_page_prot)));
+ SetPageAnon(page);
*pte_chainp = page_add_rmap(page, dir, *pte_chainp);
swap_free(entry);
}

_

Subject: Re: [RFC][PATCH 3/3] Covert objrmap to use prio_tree...



Convert mm/rmap.c to prio_tree...



mm/rmap.c | 50 +++++++++++++++++++++++++++++++++++++++++++-------
1 files changed, 43 insertions(+), 7 deletions(-)

diff -puN mm/rmap.c~objrmap_prio_tree mm/rmap.c
--- mmlinux-2.6/mm/rmap.c~objrmap_prio_tree 2004-03-21 16:25:12.000000000 -0500
+++ mmlinux-2.6-jaya/mm/rmap.c 2004-03-21 16:25:12.000000000 -0500
@@ -129,7 +129,7 @@ find_pte(struct vm_area_struct *vma, str
loffset = (page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT));
address = vma->vm_start + ((loffset - vma->vm_pgoff) << PAGE_SHIFT);
if (address < vma->vm_start || address >= vma->vm_end)
- goto out;
+ BUG();

pgd = pgd_offset(mm, address);
if (!pgd_present(*pgd))
@@ -207,6 +207,8 @@ page_referenced_obj(struct page *page)
{
struct address_space *mapping = page->mapping;
struct vm_area_struct *vma;
+ struct prio_tree_iter iter;
+ unsigned long loffset;
int referenced = 0;

if (!page->pte.mapcount)
@@ -221,11 +223,22 @@ page_referenced_obj(struct page *page)
if (down_trylock(&mapping->i_shared_sem))
return 1;

- list_for_each_entry(vma, &mapping->i_mmap, shared)
+ loffset = (page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT));
+
+ vma = __vma_prio_tree_first(&mapping->i_mmap, &iter, loffset, loffset);
+ while (vma) {
referenced += page_referenced_obj_one(vma, page);
+ vma = __vma_prio_tree_next(vma, &mapping->i_mmap, &iter,
+ loffset, loffset);
+ }

- list_for_each_entry(vma, &mapping->i_mmap_shared, shared)
+ vma = __vma_prio_tree_first(&mapping->i_mmap_shared, &iter, loffset,
+ loffset);
+ while (vma) {
referenced += page_referenced_obj_one(vma, page);
+ vma = __vma_prio_tree_next(vma, &mapping->i_mmap_shared, &iter,
+ loffset, loffset);
+ }

up(&mapping->i_shared_sem);

@@ -511,6 +524,8 @@ try_to_unmap_obj(struct page *page)
{
struct address_space *mapping = page->mapping;
struct vm_area_struct *vma;
+ struct prio_tree_iter iter;
+ unsigned long loffset;
int ret = SWAP_AGAIN;

if (!mapping)
@@ -522,16 +537,25 @@ try_to_unmap_obj(struct page *page)
if (down_trylock(&mapping->i_shared_sem))
return ret;

- list_for_each_entry(vma, &mapping->i_mmap, shared) {
+ loffset = (page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT));
+
+ vma = __vma_prio_tree_first(&mapping->i_mmap, &iter, loffset, loffset);
+ while (vma) {
ret = try_to_unmap_obj_one(vma, page);
if (ret == SWAP_FAIL || !page->pte.mapcount)
goto out;
+ vma = __vma_prio_tree_next(vma, &mapping->i_mmap, &iter,
+ loffset, loffset);
}

- list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
+ vma = __vma_prio_tree_first(&mapping->i_mmap_shared, &iter, loffset,
+ loffset);
+ while (vma) {
ret = try_to_unmap_obj_one(vma, page);
if (ret == SWAP_FAIL || !page->pte.mapcount)
goto out;
+ vma = __vma_prio_tree_next(vma, &mapping->i_mmap_shared, &iter,
+ loffset, loffset);
}

out:
@@ -749,6 +773,8 @@ int page_convert_anon(struct page *page)
struct address_space *mapping;
struct vm_area_struct *vma;
struct pte_chain *pte_chain = NULL;
+ struct prio_tree_iter iter;
+ unsigned long loffset;
pte_t *pte;
int err = 0;

@@ -783,7 +809,10 @@ int page_convert_anon(struct page *page)
*/
pte_chain_unlock(page);

- list_for_each_entry(vma, &mapping->i_mmap, shared) {
+ loffset = (page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT));
+
+ vma = __vma_prio_tree_first(&mapping->i_mmap, &iter, loffset, loffset);
+ while (vma) {
if (!pte_chain) {
pte_chain = pte_chain_alloc(GFP_KERNEL);
if (!pte_chain) {
@@ -800,8 +829,13 @@ int page_convert_anon(struct page *page)
pte_unmap(pte);
}
spin_unlock(&vma->vm_mm->page_table_lock);
+ vma = __vma_prio_tree_next(vma, &mapping->i_mmap, &iter,
+ loffset, loffset);
}
- list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
+
+ vma = __vma_prio_tree_first(&mapping->i_mmap_shared, &iter, loffset,
+ loffset);
+ while (vma) {
if (!pte_chain) {
pte_chain = pte_chain_alloc(GFP_KERNEL);
if (!pte_chain) {
@@ -818,6 +852,8 @@ int page_convert_anon(struct page *page)
pte_unmap(pte);
}
spin_unlock(&vma->vm_mm->page_table_lock);
+ vma = __vma_prio_tree_next(vma, &mapping->i_mmap_shared, &iter,
+ loffset, loffset);
}

out_unlock:

_

Subject: URL typo...


> Further testing will help. If you like broken-out patches please check:
>
> http://www-personal.engin.umich.edu/~vrajesh/~vrajesh/linux/prio_tree/

Sorry! The URL is:

http://www-personal.engin.umich.edu/~vrajesh/linux/prio_tree/

Thanks,
Rajesh

2004-03-22 00:46:03

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Sun, Mar 21, 2004 at 05:10:45PM -0500, Rajesh Venkatasubramanian wrote:
> http://marc.theaimsgroup.com/?l=linux-kernel&m=107966438414248
>
> Andrea says the system may hang, however, in this case system
> does not hang.

It's a live lock, not a deadlock. I didn't wait more than a few minutes
every time before declaring the kernel broken and rebooting the machine.
still if the prio_tree fixed my problem it means at the very least it
reduced the contention on the locks a lot ;)

It would be curious to test it after changing the return 1 to return 0
in the page_referenced trylock failures?

the results looks great, thanks.

what about the cost of a tree rebalance, is that O(log(N)) like with the
rbtrees?

2004-03-22 02:33:12

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Mon, 22 Mar 2004, Andrea Arcangeli wrote:

> It would be curious to test it after changing the return 1 to return 0
> in the page_referenced trylock failures?

In the case of a trylock failure, it should probably return a
random value. For heavily page faulting multithreaded apps,
that would mean we'd tend towards random replacement, instead
of FIFO.

Then again, the locking problems shouldn't be too bad in most
cases. If you're swapping the program will be waiting on IO
and if it's not waiting on IO there's no problem.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix


> what about the cost of a tree rebalance, is that O(log(N)) like with the
> rbtrees?

Currently the tree is not balanced, so the tree can be totally skewed
in some corner cases. However, the maximum height of the tree can be
only 2 * BITS_PER_LONG.

Moreover, I have added an optimization to increase the maximum height
of the tree on demand. The tree height is controlled by keeping track
of the maximum file offset mapped. If the number of bits required to
represent the maximum file offset is B, then the height of the tree
can be only 2 * B. Note that currently B can only increase gradually,
it is not adjusted back to smaller value when vmas are removed from
the prio_tree. That's bit tricky to do.

There is a balanced version prio_tree proposed in the same McCreight's
paper. However, it is not interesting because it requires more memory
space in each vma and the balancing is too complex even though it is
O(log(N)). I tried to understand the gist of the balanced version,
but it was too hard to follow. So I left it in the middle. Even
McCreight claims that the balanced version is just an academic (not
too practical) excercise. If someone is really interested they can check
the paper. But, it is not too interesting. I doubt whether it will
improve the performance.

Thanks,
Rajesh



2004-03-22 04:02:18

by Rik van Riel

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Sun, 21 Mar 2004, Rajesh Venkatasubramanian wrote:

> > what about the cost of a tree rebalance, is that O(log(N)) like with the
> > rbtrees?
>
> Currently the tree is not balanced, so the tree can be totally skewed
> in some corner cases. However, the maximum height of the tree can be
> only 2 * BITS_PER_LONG.

Fair enough for a radix tree. Andrea, remember that page
tables don't need to be balanced either, for obvious reasons ;)

> Moreover, I have added an optimization to increase the maximum height
> of the tree on demand. The tree height is controlled by keeping track
> of the maximum file offset mapped. If the number of bits required to
> represent the maximum file offset is B, then the height of the tree
> can be only 2 * B.

Nice touch. That should really help keep the cost of the
prio_tree down in the common case.

Your stuff is so much nicer than the kb-trees I was thinking
about a year or two ago ... ;)


--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2004-03-22 04:21:08

by Abhishek Rai

[permalink] [raw]
Subject: put_super for proc

Hi,
I am trying to add a put_super for proc as part of some project. Although
I've done this right, when I unmount proc, it just doesn't call
proc's put_super. Any clues ?

Thanks in advance!
Abhishek

2004-03-22 11:59:56

by Maneesh Soni

[permalink] [raw]
Subject: Re: put_super for proc

On Mon, Mar 22, 2004 at 04:22:10AM +0000, Abhishek Rai wrote:
> Hi,
> I am trying to add a put_super for proc as part of some project. Although
> I've done this right, when I unmount proc, it just doesn't call
> proc's put_super. Any clues ?
>

check s_active for super_block. Probably it never drops to zero for /proc

Maneesh


--
Maneesh Soni
Linux Technology Center,
IBM Software Lab, Bangalore, India
email: [email protected]
Phone: 91-80-25044999 Fax: 91-80-25268553
T/L : 9243696

2004-03-25 23:03:58

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

Hi Rajesh,

this will allow compilation with hugetlbfs, please review. thanks.

I just finished adapting the priotree to work on my -aa tree (on top of
anonvma and objrmap-core).

It compiles cleanly, next is to try to boot it, in a few hours I will
know more. If it's sort of stable I'll load it in my main desktop and
I'll release a new 2.6-aa with it.

The quality of the prio-tree code is excellent (so it was easy to adapt
to my anon-vma changes that cleanups the vma merging removing useless
locks etc..), thanks.

As soon as the thing works the three patches
(objrmap-core+anon-vma+prio-tree) are ready for inclusion into mainline.

really one could nitpick that anon-vma may need a prio tree too, but
pratically the beauty of anon-vma is that a prio tree is not needed and
in real life it performs a lot better than a find_vma for every mm
mapping the page.

btw, the truncate of hugetlbfs didn't serialize correctly against the
do_no_page page faults, that's fixed too.

--- x/fs/hugetlbfs/inode.c.~1~ 2004-03-21 15:09:25.000000000 +0100
+++ x/fs/hugetlbfs/inode.c 2004-03-25 23:50:32.979427008 +0100
@@ -265,11 +265,13 @@ static void hugetlbfs_drop_inode(struct
* vma->vm_pgoff is in PAGE_SIZE units.
*/
static void
-hugetlb_vmtruncate_list(struct list_head *list, unsigned long h_pgoff)
+hugetlb_vmtruncate_list(struct prio_tree_root *root, unsigned long h_pgoff)
{
struct vm_area_struct *vma;
+ struct prio_tree_iter iter;

- list_for_each_entry(vma, list, shared) {
+ vma = __vma_prio_tree_first(root, &iter, h_pgoff, h_pgoff);
+ while (vma) {
unsigned long h_vm_pgoff;
unsigned long v_length;
unsigned long h_length;
@@ -301,6 +303,8 @@ hugetlb_vmtruncate_list(struct list_head
zap_hugepage_range(vma,
vma->vm_start + v_offset,
v_length - v_offset);
+
+ vma = __vma_prio_tree_next(vma, root, &iter, h_pgoff, h_pgoff);
}
}

@@ -320,9 +324,11 @@ static int hugetlb_vmtruncate(struct ino

inode->i_size = offset;
down(&mapping->i_shared_sem);
- if (!list_empty(&mapping->i_mmap))
+ /* Protect against page fault */
+ atomic_inc(&mapping->truncate_count);
+ if (unlikely(!prio_tree_empty(&mapping->i_mmap)))
hugetlb_vmtruncate_list(&mapping->i_mmap, pgoff);
- if (!list_empty(&mapping->i_mmap_shared))
+ if (unlikely(!prio_tree_empty(&mapping->i_mmap_shared)))
hugetlb_vmtruncate_list(&mapping->i_mmap_shared, pgoff);
up(&mapping->i_shared_sem);
truncate_hugepages(mapping, offset);

Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix


Hi Andrea,

I am yet to look at the new -aa you released. A small change is
required below. Currently, I cannot generate a patch. Sorry. Please
fix it by hand. Thanks.

>
> - list_for_each_entry(vma, list, shared) {
> + vma = __vma_prio_tree_first(root, &iter, h_pgoff, h_pgoff);

This should be:
vma = __vma_prio_tree_first(root, &iter, h_pgoff, ULONG_MAX);

> + while (vma) {
> unsigned long h_vm_pgoff;
[snip]
> + vma = __vma_prio_tree_next(vma, root, &iter, h_pgoff, h_pgoff);
> }

and here it should be:
vma = __vma_prio_tree_next(vma, root, &iter,
h_pgoff, ULONG_MAX);

Thanks,
Rajesh

2004-03-26 07:52:51

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Thu, Mar 25, 2004 at 11:06:50PM -0500, Rajesh Venkatasubramanian wrote:
>
> Hi Andrea,
>
> I am yet to look at the new -aa you released. A small change is
> required below. Currently, I cannot generate a patch. Sorry. Please
> fix it by hand. Thanks.
>
> >
> > - list_for_each_entry(vma, list, shared) {
> > + vma = __vma_prio_tree_first(root, &iter, h_pgoff, h_pgoff);
>
> This should be:
> vma = __vma_prio_tree_first(root, &iter, h_pgoff, ULONG_MAX);
>
> > + while (vma) {
> > unsigned long h_vm_pgoff;
> [snip]
> > + vma = __vma_prio_tree_next(vma, root, &iter, h_pgoff, h_pgoff);
> > }
>
> and here it should be:
> vma = __vma_prio_tree_next(vma, root, &iter,
> h_pgoff, ULONG_MAX);

I was missing all vmas with vm_start starting after h_pgoff. Thanks.

2004-03-26 12:26:51

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Thu, Mar 25, 2004 at 11:59:19PM +0100, Andrea Arcangeli wrote:
> btw, the truncate of hugetlbfs didn't serialize correctly against the
> do_no_page page faults, that's fixed too.

If a fault on hugetlb ever got as far as do_no_page() on ia32, the
kernel would oops on the bogus struct page it gets out of the bogus
pte. I believe the way faults are handled in out-of-tree patches if by
calling hugetlb-specific fault handling stacks instead of
handle_mm_fault() if hugetlb vmas are found by arch code.


-- wli

Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix


Hi Andrea,

There is a problem with the prio_tree merge. As usual it is
related to VM_NONLINEAR. When I was reading Hugh's nonlinear
patch, I recalled this problem.

Currently, with the prio_tree search in try_to_unmap, you will
not check all the nonlinear vmas. Earlier, with a list walk it
was not a problem. But, now in try_to_unmap we only select vmas
that map a given page. That's meaningless for nonlinear vmas.

I think the fix is straight-forward. My plan is to add a
"list_head i_mmap_nonlinear" to the address_space and use the
list to find all nonlinear vmas in try_to_unmap_inode.

In sys_remap_file_pages, we can do something like below:

if (!(vma->vm_flags & VM_NONLINEAR)) { /* vma is not already nonlinear */
__vma_prio_tree_remove(&mapping->i_mmap_shared, vma)
list_add_tail(&vma->shared.vm_set.list,
&mapping->i_mmap_nonlinear);
}

Urggh. That forces us to take i_shared_sem in sys_remap_file_pages.

Please let me know if you have any better idea. Otherwise, tonite
I will send you a patch for 2.6.5-rc2-aa4.

Thanks,
Rajesh


2004-03-26 17:57:50

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Fri, Mar 26, 2004 at 10:43:17AM -0500, Rajesh Venkatasubramanian wrote:
>
> Hi Andrea,
>
> There is a problem with the prio_tree merge. As usual it is
> related to VM_NONLINEAR. When I was reading Hugh's nonlinear
> patch, I recalled this problem.
>
> Currently, with the prio_tree search in try_to_unmap, you will
> not check all the nonlinear vmas. Earlier, with a list walk it
> was not a problem. But, now in try_to_unmap we only select vmas
> that map a given page. That's meaningless for nonlinear vmas.
>
> I think the fix is straight-forward. My plan is to add a
> "list_head i_mmap_nonlinear" to the address_space and use the
> list to find all nonlinear vmas in try_to_unmap_inode.
>
> In sys_remap_file_pages, we can do something like below:
>
> if (!(vma->vm_flags & VM_NONLINEAR)) { /* vma is not already nonlinear */
> __vma_prio_tree_remove(&mapping->i_mmap_shared, vma)
> list_add_tail(&vma->shared.vm_set.list,
> &mapping->i_mmap_nonlinear);
> }
>
> Urggh. That forces us to take i_shared_sem in sys_remap_file_pages.
>
> Please let me know if you have any better idea. Otherwise, tonite
> I will send you a patch for 2.6.5-rc2-aa4.

I agree this will fix it. If a better fix comes to mind I will let you
know, in the meantime this fix will be welcome (despite the i_mmap_sem
in remap_file_pages) and I will merge it in the next prio-tree patch.
Thanks!

2004-03-26 19:17:48

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Fri, Mar 26, 2004 at 04:26:36AM -0800, William Lee Irwin III wrote:
> On Thu, Mar 25, 2004 at 11:59:19PM +0100, Andrea Arcangeli wrote:
> > btw, the truncate of hugetlbfs didn't serialize correctly against the
> > do_no_page page faults, that's fixed too.
>
> If a fault on hugetlb ever got as far as do_no_page() on ia32, the
> kernel would oops on the bogus struct page it gets out of the bogus
> pte. I believe the way faults are handled in out-of-tree patches if by
> calling hugetlb-specific fault handling stacks instead of
> handle_mm_fault() if hugetlb vmas are found by arch code.
>

this is certainly true, but still the pmd fault handling should have the
same locking of do_no_page, the race sounds the same, no matter if it's
a pmd or pte fill, no?

Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix


This patch adds a list_head i_mmap_nonlinear to the address_space
structure. The list is used for storing all nonlinear vmas. This
is helpful in try_to_unmap_inode to find all nonlinear mappings of
a file.

This patch does not change invalidate_mmap_range_list. Already
the behavior of truncate on nonlinear mappings is undefined. We
understand that nonlinear mappings do not guarantee SIGBUS on
truncate. After this patch, we do not touch nonlinear maps on
the truncate path. So it is assured that the nonlinear maps will
not be destroyed by a truncate.

I am not happy with the truncate behavior on nonlinear maps. I
think we can guarantee SIGBUS on nonlinear maps by reusing Andrea's
try_to_unmap_nonlinear code. But, I have to study more to do that.
So I am leaving that for future.

This patch is against 2.6.5-rc2-aa4. The patch was tested in a
SMP m/c. It boots and compiles a kernel without any problem.

Please review and apply.

Thanks,
Rajesh


fs/inode.c | 1 +
fs/locks.c | 6 ++++--
include/linux/fs.h | 1 +
mm/filemap.c | 3 ++-
mm/fremap.c | 14 +++++++++++++-
mm/mmap.c | 25 +++++++++++++++++++------
mm/mremap.c | 6 ++++--
mm/objrmap.c | 10 +++++++++-
mm/shmem.c | 3 ++-
mm/swap_state.c | 1 +
mm/vmscan.c | 2 ++
11 files changed, 58 insertions(+), 14 deletions(-)

diff -puN include/linux/fs.h~010_nonlinear include/linux/fs.h
--- mmlinux-2.6/include/linux/fs.h~010_nonlinear 2004-03-27 14:25:07.000000000 -0500
+++ mmlinux-2.6-jaya/include/linux/fs.h 2004-03-27 14:25:07.000000000 -0500
@@ -328,6 +328,7 @@ struct address_space {
struct address_space_operations *a_ops; /* methods */
struct prio_tree_root i_mmap; /* tree of private mappings */
struct prio_tree_root i_mmap_shared; /* tree of shared mappings */
+ struct list_head i_mmap_nonlinear;/*list of nonlinear mappings */
struct semaphore i_shared_sem; /* protect both above lists */
atomic_t truncate_count; /* Cover race condition with truncate */
unsigned long dirtied_when; /* jiffies of first page dirtying */
diff -puN fs/inode.c~010_nonlinear fs/inode.c
--- mmlinux-2.6/fs/inode.c~010_nonlinear 2004-03-27 14:25:07.000000000 -0500
+++ mmlinux-2.6-jaya/fs/inode.c 2004-03-27 14:25:07.000000000 -0500
@@ -187,6 +187,7 @@ void inode_init_once(struct inode *inode
spin_lock_init(&inode->i_data.private_lock);
INIT_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
INIT_PRIO_TREE_ROOT(&inode->i_data.i_mmap_shared);
+ INIT_LIST_HEAD(&inode->i_data.i_mmap_nonlinear);
spin_lock_init(&inode->i_lock);
i_size_ordered_init(inode);
}
diff -puN fs/locks.c~010_nonlinear fs/locks.c
--- mmlinux-2.6/fs/locks.c~010_nonlinear 2004-03-27 14:25:07.000000000 -0500
+++ mmlinux-2.6-jaya/fs/locks.c 2004-03-27 14:25:07.000000000 -0500
@@ -1455,7 +1455,8 @@ int fcntl_setlk(struct file *filp, unsig
if (IS_MANDLOCK(inode) &&
(inode->i_mode & (S_ISGID | S_IXGRP)) == S_ISGID) {
struct address_space *mapping = filp->f_mapping;
- if (!prio_tree_empty(&mapping->i_mmap_shared)) {
+ if (!prio_tree_empty(&mapping->i_mmap_shared) ||
+ !list_empty(&mapping->i_mmap_nonlinear)) {
error = -EAGAIN;
goto out;
}
@@ -1592,7 +1593,8 @@ int fcntl_setlk64(struct file *filp, uns
if (IS_MANDLOCK(inode) &&
(inode->i_mode & (S_ISGID | S_IXGRP)) == S_ISGID) {
struct address_space *mapping = filp->f_mapping;
- if (!prio_tree_empty(&mapping->i_mmap_shared)) {
+ if (!prio_tree_empty(&mapping->i_mmap_shared) ||
+ !list_empty(&mapping->i_mmap_nonlinear)) {
error = -EAGAIN;
goto out;
}
diff -puN mm/filemap.c~010_nonlinear mm/filemap.c
--- mmlinux-2.6/mm/filemap.c~010_nonlinear 2004-03-27 14:25:07.000000000 -0500
+++ mmlinux-2.6-jaya/mm/filemap.c 2004-03-27 14:25:07.000000000 -0500
@@ -650,7 +650,8 @@ page_ok:
* virtual addresses, take care about potential aliasing
* before reading the page on the kernel side.
*/
- if (!prio_tree_empty(&mapping->i_mmap_shared))
+ if (!prio_tree_empty(&mapping->i_mmap_shared) ||
+ !list_empty(&mapping->i_mmap_nonlinear))
flush_dcache_page(page);

/*
diff -puN mm/shmem.c~010_nonlinear mm/shmem.c
--- mmlinux-2.6/mm/shmem.c~010_nonlinear 2004-03-27 14:25:07.000000000 -0500
+++ mmlinux-2.6-jaya/mm/shmem.c 2004-03-27 14:25:07.000000000 -0500
@@ -1328,7 +1328,8 @@ static void do_shmem_file_read(struct fi
* virtual addresses, take care about potential aliasing
* before reading the page on the kernel side.
*/
- if (!prio_tree_empty(&mapping->i_mmap_shared))
+ if (!prio_tree_empty(&mapping->i_mmap_shared) ||
+ !list_empty(&mapping->i_mmap_nonlinear))
flush_dcache_page(page);
/*
* Mark the page accessed if we read the beginning.
diff -puN mm/swap_state.c~010_nonlinear mm/swap_state.c
--- mmlinux-2.6/mm/swap_state.c~010_nonlinear 2004-03-27 14:25:07.000000000 -0500
+++ mmlinux-2.6-jaya/mm/swap_state.c 2004-03-27 14:25:07.000000000 -0500
@@ -30,6 +30,7 @@ struct address_space swapper_space = {
.backing_dev_info = &swap_backing_dev_info,
.i_mmap = PRIO_TREE_ROOT_INIT,
.i_mmap_shared = PRIO_TREE_ROOT_INIT,
+ .i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear),
.i_shared_sem = __MUTEX_INITIALIZER(swapper_space.i_shared_sem),
.truncate_count = ATOMIC_INIT(0),
.private_lock = SPIN_LOCK_UNLOCKED,
diff -puN mm/vmscan.c~010_nonlinear mm/vmscan.c
--- mmlinux-2.6/mm/vmscan.c~010_nonlinear 2004-03-27 14:25:07.000000000 -0500
+++ mmlinux-2.6-jaya/mm/vmscan.c 2004-03-27 14:25:07.000000000 -0500
@@ -195,6 +195,8 @@ static inline int page_mapping_inuse(str
return 1;
if (!prio_tree_empty(&mapping->i_mmap_shared))
return 1;
+ if (!list_empty(&mapping->i_mmap_nonlinear))
+ return 1;

return 0;
}
diff -puN mm/fremap.c~010_nonlinear mm/fremap.c
--- mmlinux-2.6/mm/fremap.c~010_nonlinear 2004-03-27 14:25:07.000000000 -0500
+++ mmlinux-2.6-jaya/mm/fremap.c 2004-03-27 14:25:07.000000000 -0500
@@ -151,6 +151,8 @@ asmlinkage long sys_remap_file_pages(uns
unsigned long __prot, unsigned long pgoff, unsigned long flags)
{
struct mm_struct *mm = current->mm;
+ struct address_space *mapping;
+ unsigned long linear_pgoff;
unsigned long end = start + size;
struct vm_area_struct *vma;
int err = -EINVAL;
@@ -187,9 +189,19 @@ asmlinkage long sys_remap_file_pages(uns
end > start && start >= vma->vm_start &&
end <= vma->vm_end) {

+ linear_pgoff = vma->vm_pgoff;
+ linear_pgoff += ((start - vma->vm_start) >> PAGE_SHIFT);
/* Must set VM_NONLINEAR before any pages are populated. */
- if (pgoff != ((start - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff)
+ if (pgoff != linear_pgoff && !(vma->vm_flags & VM_NONLINEAR)) {
+ mapping = vma->vm_file->f_mapping;
+ down(&mapping->i_shared_sem);
vma->vm_flags |= VM_NONLINEAR;
+ __vma_prio_tree_remove(&mapping->i_mmap_shared, vma);
+ INIT_VMA_SHARED_LIST(vma);
+ list_add_tail(&vma->shared.vm_set.list,
+ &mapping->i_mmap_nonlinear);
+ up(&mapping->i_shared_sem);
+ }

/* ->populate can take a long time, so downgrade the lock. */
downgrade_write(&mm->mmap_sem);
diff -puN mm/mmap.c~010_nonlinear mm/mmap.c
--- mmlinux-2.6/mm/mmap.c~010_nonlinear 2004-03-27 14:25:07.000000000 -0500
+++ mmlinux-2.6-jaya/mm/mmap.c 2004-03-27 14:25:07.000000000 -0500
@@ -81,7 +81,11 @@ __remove_shared_vm_struct(struct vm_area
if (inode) {
if (vma->vm_flags & VM_DENYWRITE)
atomic_inc(&inode->i_writecount);
- if (vma->vm_flags & VM_SHARED)
+ if (unlikely(vma->vm_flags & VM_NONLINEAR)) {
+ list_del_init(&vma->shared.vm_set.list);
+ INIT_VMA_SHARED(vma);
+ }
+ else if (vma->vm_flags & VM_SHARED)
__vma_prio_tree_remove(&mapping->i_mmap_shared, vma);
else
__vma_prio_tree_remove(&mapping->i_mmap, vma);
@@ -273,7 +277,12 @@ static inline void __vma_link_file(struc
if (vma->vm_flags & VM_DENYWRITE)
atomic_dec(&file->f_dentry->d_inode->i_writecount);

- if (vma->vm_flags & VM_SHARED)
+ if (unlikely(vma->vm_flags & VM_NONLINEAR)) {
+ INIT_VMA_SHARED_LIST(vma);
+ list_add_tail(&vma->shared.vm_set.list,
+ &mapping->i_mmap_nonlinear);
+ }
+ else if (vma->vm_flags & VM_SHARED)
__vma_prio_tree_insert(&mapping->i_mmap_shared, vma);
else
__vma_prio_tree_insert(&mapping->i_mmap, vma);
@@ -430,8 +439,10 @@ static int vma_merge(struct mm_struct *m
i_shared_sem = file ? &file->f_mapping->i_shared_sem : NULL;

if (mapping) {
- if (vm_flags & VM_SHARED)
- root = &mapping->i_mmap_shared;
+ if (vm_flags & VM_SHARED) {
+ if (likely(!(vm_flags & VM_NONLINEAR)))
+ root = &mapping->i_mmap_shared;
+ }
else
root = &mapping->i_mmap;
}
@@ -1271,8 +1282,10 @@ int split_vma(struct mm_struct * mm, str
if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;

- if (vma->vm_flags & VM_SHARED)
- root = &mapping->i_mmap_shared;
+ if (vma->vm_flags & VM_SHARED) {
+ if (likely(!(vma->vm_flags & VM_NONLINEAR)))
+ root = &mapping->i_mmap_shared;
+ }
else
root = &mapping->i_mmap;
}
diff -puN mm/mremap.c~010_nonlinear mm/mremap.c
--- mmlinux-2.6/mm/mremap.c~010_nonlinear 2004-03-27 14:25:07.000000000 -0500
+++ mmlinux-2.6-jaya/mm/mremap.c 2004-03-27 14:25:07.000000000 -0500
@@ -413,8 +413,10 @@ unsigned long do_mremap(unsigned long ad

if (vma->vm_file) {
mapping = vma->vm_file->f_mapping;
- if (vma->vm_flags & VM_SHARED)
- root = &mapping->i_mmap_shared;
+ if (vma->vm_flags & VM_SHARED) {
+ if (likely(!(vma->vm_flags & VM_NONLINEAR)))
+ root = &mapping->i_mmap_shared;
+ }
else
root = &mapping->i_mmap;
down(&mapping->i_shared_sem);
diff -puN mm/objrmap.c~010_nonlinear mm/objrmap.c
--- mmlinux-2.6/mm/objrmap.c~010_nonlinear 2004-03-27 14:25:07.000000000 -0500
+++ mmlinux-2.6-jaya/mm/objrmap.c 2004-03-27 14:25:07.000000000 -0500
@@ -133,8 +133,10 @@ page_referenced_one(struct vm_area_struc
* Tracking the referenced info is too expensive
* for nonlinear mappings.
*/
- if (vma->vm_flags & VM_NONLINEAR)
+ if (unlikely(vma->vm_flags & VM_NONLINEAR)) {
+ BUG();
goto out;
+ }

if (unlikely(!spin_trylock(&mm->page_table_lock)))
goto out;
@@ -630,6 +632,12 @@ try_to_unmap_inode(struct page *page)
loffset, loffset);
}

+ list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared.vm_set.list) {
+ ret = try_to_unmap_one(vma, page, &young);
+ if (ret == SWAP_FAIL || !page->mapcount)
+ goto out;
+ }
+
out:
up(&mapping->i_shared_sem);
return ret;

_

2004-03-29 17:23:13

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Sat, Mar 27, 2004 at 02:51:45PM -0500, Rajesh Venkatasubramanian wrote:
>
> This patch adds a list_head i_mmap_nonlinear to the address_space
> structure. The list is used for storing all nonlinear vmas. This
> is helpful in try_to_unmap_inode to find all nonlinear mappings of
> a file.
>
> This patch does not change invalidate_mmap_range_list. Already
> the behavior of truncate on nonlinear mappings is undefined. We
> understand that nonlinear mappings do not guarantee SIGBUS on
> truncate. After this patch, we do not touch nonlinear maps on
> the truncate path. So it is assured that the nonlinear maps will
> not be destroyed by a truncate.
>
> I am not happy with the truncate behavior on nonlinear maps. I
> think we can guarantee SIGBUS on nonlinear maps by reusing Andrea's
> try_to_unmap_nonlinear code. But, I have to study more to do that.
> So I am leaving that for future.
>
> This patch is against 2.6.5-rc2-aa4. The patch was tested in a
> SMP m/c. It boots and compiles a kernel without any problem.
>
> Please review and apply.

great work, looks fine, applied thanks.

Here a further update for xfs:

--- sles/fs/xfs/linux/xfs_vnode.h.~1~ 2004-03-29 18:33:20.047028592 +0200
+++ sles/fs/xfs/linux/xfs_vnode.h 2004-03-29 19:02:37.101915648 +0200
@@ -601,8 +601,8 @@ static __inline__ void vn_flagclr(struct
* Some useful predicates.
*/
#define VN_MAPPED(vp) \
- (!list_empty(&(LINVFS_GET_IP(vp)->i_mapping->i_mmap)) || \
- (!list_empty(&(LINVFS_GET_IP(vp)->i_mapping->i_mmap_shared))))
+ (!prio_tree_empty(&(LINVFS_GET_IP(vp)->i_mapping->i_mmap)) || \
+ (!prio_tree_empty(&(LINVFS_GET_IP(vp)->i_mapping->i_mmap_shared))))
#define VN_CACHED(vp) (LINVFS_GET_IP(vp)->i_mapping->nrpages)
#define VN_DIRTY(vp) mapping_tagged(LINVFS_GET_IP(vp)->i_mapping, \
PAGECACHE_TAG_DIRTY)


and really some other bigger tree needs this part too (not a mainline
issue).

--- sles/fs/xfs/dmapi/dmapi_xfs.c.~1~ 2004-03-29 18:33:03.781501328 +0200
+++ sles/fs/xfs/dmapi/dmapi_xfs.c 2004-03-29 18:58:57.754261560 +0200
@@ -228,17 +228,21 @@ prohibited_mr_events(
struct address_space *mapping = LINVFS_GET_IP(vp)->i_mapping;
int prohibited = (1 << DM_EVENT_READ);
struct vm_area_struct *vma;
+ struct prio_tree_iter iter;

if (!VN_MAPPED(vp))
return 0;

#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,0)
down(&mapping->i_shared_sem);
- list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
+ vma = __vma_prio_tree_first(&mapping->i_mmap_shared, &iter, 0, ULONG_MAX);
+ while (vma) {
if (!(vma->vm_flags & VM_DENYWRITE)) {
prohibited |= (1 << DM_EVENT_WRITE);
break;
}
+
+ vma = __vma_prio_tree_next(vma, &mapping->i_mmap_shared, &iter, 0, ULONG_MAX);
}
up(&mapping->i_shared_sem);
#else


let me know if you see any bug in the above, thanks!

Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix



> #define VN_MAPPED(vp) \
> - (!list_empty(&(LINVFS_GET_IP(vp)->i_mapping->i_mmap)) || \
> - (!list_empty(&(LINVFS_GET_IP(vp)->i_mapping->i_mmap_shared))))
> + (!prio_tree_empty(&(LINVFS_GET_IP(vp)->i_mapping->i_mmap)) || \
> + (!prio_tree_empty(&(LINVFS_GET_IP(vp)->i_mapping->i_mmap_shared))))

I think we will need the following too:
(!list_empty(&(LINVFS_GET_IP(vp)->i_mmaping->i_mmap_nonlinear)


> down(&mapping->i_shared_sem);
> - list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
> + vma = __vma_prio_tree_first(&mapping->i_mmap_shared, &iter, 0, ULONG_MAX);
> + while (vma) {
> if (!(vma->vm_flags & VM_DENYWRITE)) {
> prohibited |= (1 << DM_EVENT_WRITE);
> break;
> }
> +
> + vma = __vma_prio_tree_next(vma, &mapping->i_mmap_shared, &iter, 0, ULONG_MAX);
> }

This part looks fine. But, I am not sure whether you have to handle
nonlinear maps here.

list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared) {
...
}

> up(&mapping->i_shared_sem);
> #else

Hope that helps.

Rajesh

2004-03-29 18:01:21

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Mon, Mar 29, 2004 at 12:50:20PM -0500, Rajesh Venkatasubramanian wrote:
>
>
> > #define VN_MAPPED(vp) \
> > - (!list_empty(&(LINVFS_GET_IP(vp)->i_mapping->i_mmap)) || \
> > - (!list_empty(&(LINVFS_GET_IP(vp)->i_mapping->i_mmap_shared))))
> > + (!prio_tree_empty(&(LINVFS_GET_IP(vp)->i_mapping->i_mmap)) || \
> > + (!prio_tree_empty(&(LINVFS_GET_IP(vp)->i_mapping->i_mmap_shared))))
>
> I think we will need the following too:
> (!list_empty(&(LINVFS_GET_IP(vp)->i_mmaping->i_mmap_nonlinear)
>
>
> > down(&mapping->i_shared_sem);
> > - list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
> > + vma = __vma_prio_tree_first(&mapping->i_mmap_shared, &iter, 0, ULONG_MAX);
> > + while (vma) {
> > if (!(vma->vm_flags & VM_DENYWRITE)) {
> > prohibited |= (1 << DM_EVENT_WRITE);
> > break;
> > }
> > +
> > + vma = __vma_prio_tree_next(vma, &mapping->i_mmap_shared, &iter, 0, ULONG_MAX);
> > }
>
> This part looks fine. But, I am not sure whether you have to handle
> nonlinear maps here.
>
> list_for_each_entry(vma, &mapping->i_mmap_nonlinear, shared) {
> ...
> }
>

I agree we should handle the nonlinear maps. since nobody uses nonlinear
this isn't a big issue for the short term.

There's now also a screwup in the writeback -mm changes for swapsuspend,
it bugs out in radix tree tag, I believe it's because it doesn't
insert the page in the radix tree before doing writeback I/O on it. This
is my first attempt to cure it.

--- sles/mm/page_io.c.~1~ 2004-03-29 19:05:50.014588464 +0200
+++ sles/mm/page_io.c 2004-03-29 19:46:14.282043792 +0200
@@ -151,8 +151,15 @@ int rw_swap_page_sync(int rw, swp_entry_
lock_page(page);

BUG_ON(page_mapping(page));
+ BUG_ON(PageSwapCache(page));
SetPageSwapCache(page);
page->private = entry.val;
+ ret = radix_tree_insert(&page_mapping(page)->page_tree, page->private, page);
+ if (unlikely(ret)) {
+ ClearPageSwapCache(page);
+ unlock_page(page);
+ return ret;
+ }

if (rw == READ) {
ret = swap_readpage(NULL, page);
@@ -161,7 +168,10 @@ int rw_swap_page_sync(int rw, swp_entry_
ret = swap_writepage(page, &swap_wbc);
wait_on_page_writeback(page);
}
+
+ radix_tree_delete(&page_mapping(page)->page_tree, page->private);
ClearPageSwapCache(page);
+
if (ret == 0 && (!PageUptodate(page) || PageError(page)))
ret = -EIO;
return ret;

then there's x86-64 bombing too:

--- sles/arch/x86_64/ia32/ia32_binfmt.c.~1~ 2004-03-29 19:05:50.516512160 +0200
+++ sles/arch/x86_64/ia32/ia32_binfmt.c 2004-03-29 19:53:20.031320064 +0200
@@ -366,7 +366,7 @@ int setup_arg_pages(struct linux_binprm
mpnt->vm_pgoff = mpnt->vm_start >> PAGE_SHIFT;
mpnt->vm_file = NULL;
mpol_set_vma_default(mpnt);
- INIT_LIST_HEAD(&mpnt->shared);
+ INIT_VMA_SHARED(mpnt);
/* insert_vm_struct takes care of anon_vma_node */
mpnt->anon_vma = NULL;
mpnt->vm_private_data = (void *) 0;

the writeback part will require testing, so I'll postpone further
updates until I get confirmation that swapsuspend works again (this is
all low prio stuff anyways, the previous xfs list_empty miscompilation
was scary instead so I update it quickly, since people actually uses
MAP_SHARED/MAP_PRIVATE).

2004-03-29 18:12:38

by Hugh Dickins

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Mon, 29 Mar 2004, Andrea Arcangeli wrote:
>
> Here a further update for xfs:
>
> --- sles/fs/xfs/linux/xfs_vnode.h.~1~ 2004-03-29 18:33:20.047028592 +0200
> +++ sles/fs/xfs/linux/xfs_vnode.h 2004-03-29 19:02:37.101915648 +0200
> @@ -601,8 +601,8 @@ static __inline__ void vn_flagclr(struct
> * Some useful predicates.
> */
> #define VN_MAPPED(vp) \
> - (!list_empty(&(LINVFS_GET_IP(vp)->i_mapping->i_mmap)) || \
> - (!list_empty(&(LINVFS_GET_IP(vp)->i_mapping->i_mmap_shared))))
> + (!prio_tree_empty(&(LINVFS_GET_IP(vp)->i_mapping->i_mmap)) || \
> + (!prio_tree_empty(&(LINVFS_GET_IP(vp)->i_mapping->i_mmap_shared))))
> #define VN_CACHED(vp) (LINVFS_GET_IP(vp)->i_mapping->nrpages)
> #define VN_DIRTY(vp) mapping_tagged(LINVFS_GET_IP(vp)->i_mapping, \
> PAGECACHE_TAG_DIRTY)

Needs also to check
!list_empty(&(LINVFS_GET_IP(vp)->i_mapping->i_mmap_nonlinear))

Various arches need a similar conversion too (and use page_mapping(page)
rather than page->mapping: see arch and include/asm in my anobjrmap 3/6).

Those arches which do more than test list_empty (now prio_tree_empty),
arm and parisc (I think that's all): look as if they can take full
advantage of the prio tree; and I hope we can ignore the nonlinears
in those cases - if a page is mapped in a nonlinear vma it may suffer
from D-cache aliasing inconsistencies if also mapped elsewhere in
that user address space, never mind. Is that reasonable?

> and really some other bigger tree needs this part too (not a mainline
> issue).
>
> --- sles/fs/xfs/dmapi/dmapi_xfs.c.~1~ 2004-03-29 18:33:03.781501328 +0200
> +++ sles/fs/xfs/dmapi/dmapi_xfs.c 2004-03-29 18:58:57.754261560 +0200
> @@ -228,17 +228,21 @@ prohibited_mr_events(
> struct address_space *mapping = LINVFS_GET_IP(vp)->i_mapping;
> int prohibited = (1 << DM_EVENT_READ);
> struct vm_area_struct *vma;
> + struct prio_tree_iter iter;
>
> if (!VN_MAPPED(vp))
> return 0;
>
> #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,0)
> down(&mapping->i_shared_sem);
> - list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
> + vma = __vma_prio_tree_first(&mapping->i_mmap_shared, &iter, 0, ULONG_MAX);
> + while (vma) {
> if (!(vma->vm_flags & VM_DENYWRITE)) {
> prohibited |= (1 << DM_EVENT_WRITE);
> break;
> }
> +
> + vma = __vma_prio_tree_next(vma, &mapping->i_mmap_shared, &iter, 0, ULONG_MAX);
> }
> up(&mapping->i_shared_sem);
> #else

This looks horrid (not your change, the original), and would need to look
at nonlinears too; but I thought this was what i_writecount < 0 is for?

Hugh

2004-03-29 18:20:57

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Mon, Mar 29, 2004 at 07:12:34PM +0100, Hugh Dickins wrote:
> On Mon, 29 Mar 2004, Andrea Arcangeli wrote:
> >
> > Here a further update for xfs:
> >
> > --- sles/fs/xfs/linux/xfs_vnode.h.~1~ 2004-03-29 18:33:20.047028592 +0200
> > +++ sles/fs/xfs/linux/xfs_vnode.h 2004-03-29 19:02:37.101915648 +0200
> > @@ -601,8 +601,8 @@ static __inline__ void vn_flagclr(struct
> > * Some useful predicates.
> > */
> > #define VN_MAPPED(vp) \
> > - (!list_empty(&(LINVFS_GET_IP(vp)->i_mapping->i_mmap)) || \
> > - (!list_empty(&(LINVFS_GET_IP(vp)->i_mapping->i_mmap_shared))))
> > + (!prio_tree_empty(&(LINVFS_GET_IP(vp)->i_mapping->i_mmap)) || \
> > + (!prio_tree_empty(&(LINVFS_GET_IP(vp)->i_mapping->i_mmap_shared))))
> > #define VN_CACHED(vp) (LINVFS_GET_IP(vp)->i_mapping->nrpages)
> > #define VN_DIRTY(vp) mapping_tagged(LINVFS_GET_IP(vp)->i_mapping, \
> > PAGECACHE_TAG_DIRTY)
>
> Needs also to check
> !list_empty(&(LINVFS_GET_IP(vp)->i_mapping->i_mmap_nonlinear))
>
> Various arches need a similar conversion too (and use page_mapping(page)
> rather than page->mapping: see arch and include/asm in my anobjrmap 3/6).
>
> Those arches which do more than test list_empty (now prio_tree_empty),
> arm and parisc (I think that's all): look as if they can take full

I've noticed arm and parisc, luckily no arm/parisc user tried my tree
yet ;).

> advantage of the prio tree; and I hope we can ignore the nonlinears
> in those cases - if a page is mapped in a nonlinear vma it may suffer
> from D-cache aliasing inconsistencies if also mapped elsewhere in
> that user address space, never mind. Is that reasonable?

some arch was setting a max file offset multiple in the mmap API to
avoid aliasing issues too, nonlinear broke it off, not sure if that is
being taken into account, but certainly having a i_mmap_nonlinear will
facilitate the life of those archs.

> > and really some other bigger tree needs this part too (not a mainline
> > issue).
> >
> > --- sles/fs/xfs/dmapi/dmapi_xfs.c.~1~ 2004-03-29 18:33:03.781501328 +0200
> > +++ sles/fs/xfs/dmapi/dmapi_xfs.c 2004-03-29 18:58:57.754261560 +0200
> > @@ -228,17 +228,21 @@ prohibited_mr_events(
> > struct address_space *mapping = LINVFS_GET_IP(vp)->i_mapping;
> > int prohibited = (1 << DM_EVENT_READ);
> > struct vm_area_struct *vma;
> > + struct prio_tree_iter iter;
> >
> > if (!VN_MAPPED(vp))
> > return 0;
> >
> > #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,0)
> > down(&mapping->i_shared_sem);
> > - list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
> > + vma = __vma_prio_tree_first(&mapping->i_mmap_shared, &iter, 0, ULONG_MAX);
> > + while (vma) {
> > if (!(vma->vm_flags & VM_DENYWRITE)) {
> > prohibited |= (1 << DM_EVENT_WRITE);
> > break;
> > }
> > +
> > + vma = __vma_prio_tree_next(vma, &mapping->i_mmap_shared, &iter, 0, ULONG_MAX);
> > }
> > up(&mapping->i_shared_sem);
> > #else
>
> This looks horrid (not your change, the original), and would need to look
> at nonlinears too; but I thought this was what i_writecount < 0 is for?

no idea what's the point of this stuff, Christoph maybe wants to
elaborate.

2004-03-29 18:38:27

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Mon, Mar 29, 2004 at 08:20:51PM +0200, Andrea Arcangeli wrote:
> > > --- sles/fs/xfs/dmapi/dmapi_xfs.c.~1~ 2004-03-29 18:33:03.781501328 +0200
> > > +++ sles/fs/xfs/dmapi/dmapi_xfs.c 2004-03-29 18:58:57.754261560 +0200
> > > @@ -228,17 +228,21 @@ prohibited_mr_events(
> > > struct address_space *mapping = LINVFS_GET_IP(vp)->i_mapping;
> > > int prohibited = (1 << DM_EVENT_READ);
> > > struct vm_area_struct *vma;
> > > + struct prio_tree_iter iter;
> > >
> > > if (!VN_MAPPED(vp))
> > > return 0;
> > >
> > > #if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,0)
> > > down(&mapping->i_shared_sem);
> > > - list_for_each_entry(vma, &mapping->i_mmap_shared, shared) {
> > > + vma = __vma_prio_tree_first(&mapping->i_mmap_shared, &iter, 0, ULONG_MAX);
> > > + while (vma) {
> > > if (!(vma->vm_flags & VM_DENYWRITE)) {
> > > prohibited |= (1 << DM_EVENT_WRITE);
> > > break;
> > > }
> > > +
> > > + vma = __vma_prio_tree_next(vma, &mapping->i_mmap_shared, &iter, 0, ULONG_MAX);
> > > }
> > > up(&mapping->i_shared_sem);
> > > #else
> >
> > This looks horrid (not your change, the original), and would need to look
> > at nonlinears too; but I thought this was what i_writecount < 0 is for?
>
> no idea what's the point of this stuff, Christoph maybe wants to
> elaborate.

That's dmapi, a standard for Hierachial Storage Management. The code is not
in mainline for a reason, no idea where you got it from.

AFAIK the code tries to detect whether there could be anyone writing to the
vma, but ask Dean for the details

2004-03-29 20:38:20

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

Andrea Arcangeli <[email protected]> wrote:
>
> There's now also a screwup in the writeback -mm changes for swapsuspend,
> it bugs out in radix tree tag, I believe it's because it doesn't
> insert the page in the radix tree before doing writeback I/O on it.

hmm, yes, we have pages which satisfy PageSwapCache(), but which are not
actually in swapcache.

How about we use the normal pagecache APIs for this?

(untested):

--- 25/mm/page_io.c~rw_swap_page_sync-fix Mon Mar 29 12:34:24 2004
+++ 25-akpm/mm/page_io.c Mon Mar 29 12:37:13 2004
@@ -139,7 +139,7 @@ struct address_space_operations swap_aop

/*
* A scruffy utility function to read or write an arbitrary swap page
- * and wait on the I/O.
+ * and wait on the I/O. The caller must have a ref on the page.
*/
int rw_swap_page_sync(int rw, swp_entry_t entry, struct page *page)
{
@@ -151,8 +151,7 @@ int rw_swap_page_sync(int rw, swp_entry_
lock_page(page);

BUG_ON(page->mapping);
- page->mapping = &swapper_space;
- page->index = entry.val;
+ add_to_page_cache(page, &swapper_space, entry.val, GFP_NOIO);

if (rw == READ) {
ret = swap_readpage(NULL, page);
@@ -161,7 +160,11 @@ int rw_swap_page_sync(int rw, swp_entry_
ret = swap_writepage(page, &swap_wbc);
wait_on_page_writeback(page);
}
- page->mapping = NULL;
+
+ lock_page(page);
+ remove_from_page_cache(page);
+ unlock_page(page);
+
if (ret == 0 && (!PageUptodate(page) || PageError(page)))
ret = -EIO;
return ret;

_

Subject: Re: 2.6.5-rc2-aa5


Andrew Moroton <[email protected]> wrote:
>> Andrea Arcangeli <[email protected]> wrote:
>>
>> Notably there is a BUG_ON(page->mapping) triggering in
>> page_remove_rmap in the pagecache case. that could be ex-pagecache
>> being
>> removed from pagecache before all ptes have been zapped, infact the
>> page_remove_rmap triggers in the vmtruncate path.
>
> Confused. vmtruncate zaps the ptes before removing pages from
> pagecache,
> so I'd expect a non-null ->mapping in page_remove_rmap() is a very
> common
> thing. truncate a file which someone has mmapped and it'll happen every
> time, will it not?

Andrea missed a not (!) in the BUG_ON. It is BUG_ON(!page->mapping).

The race Andrea hit _may_ be the mremap vs. vmtruncate race I hit:

http://marc.theaimsgroup.com/?l=linux-mm&m=107720111303624

A first truncate that raced with mremap and left an orphaned pte.
The following truncate tried to clear the orphaned pte, and reached
page_remove_rmap with page->mapping == NULL.

Yes. It can happen in all 2.4 and 2.6 kernels.

Hugh has a better fix than mine for the mremap vs. truncate race
in his anobjrmap 7/6 patch.

http://marc.theaimsgroup.com/?l=linux-kernel&m=107998825716363

With prio_tree we have to modify Hugh's fix, though.

Thanks,
Rajesh

2004-03-29 22:25:02

by Hugh Dickins

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Mon, 29 Mar 2004, Andrew Morton wrote:
>
> hmm, yes, we have pages which satisfy PageSwapCache(), but which are not
> actually in swapcache.
>
> How about we use the normal pagecache APIs for this?
>
> + add_to_page_cache(page, &swapper_space, entry.val, GFP_NOIO);
>...
> + remove_from_page_cache(page);

Much nicer, and it'll probably appear to work: but (also untested)
I bet you'll need an additional page_cache_release(page) - damn,
looks like hugetlbfs has found a use for that tiresome asymmetry.

Hugh

2004-03-29 22:39:10

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Mon, Mar 29, 2004 at 12:40:27PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > There's now also a screwup in the writeback -mm changes for swapsuspend,
> > it bugs out in radix tree tag, I believe it's because it doesn't
> > insert the page in the radix tree before doing writeback I/O on it.
>
> hmm, yes, we have pages which satisfy PageSwapCache(), but which are not
> actually in swapcache.

exactly.

> How about we use the normal pagecache APIs for this?

should work fine too and it exposes less internal vm details. I will
propose your fix for testing too.

2004-03-29 22:40:56

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

Andrea Arcangeli <[email protected]> wrote:
>
> On Mon, Mar 29, 2004 at 12:40:27PM -0800, Andrew Morton wrote:
> > Andrea Arcangeli <[email protected]> wrote:
> > >
> > > There's now also a screwup in the writeback -mm changes for swapsuspend,
> > > it bugs out in radix tree tag, I believe it's because it doesn't
> > > insert the page in the radix tree before doing writeback I/O on it.
> >
> > hmm, yes, we have pages which satisfy PageSwapCache(), but which are not
> > actually in swapcache.
>
> exactly.
>
> > How about we use the normal pagecache APIs for this?
>
> should work fine too and it exposes less internal vm details. I will
> propose your fix for testing too.

As Hugh points out, it was missing a page_cache_release().


25-akpm/mm/page_io.c | 12 ++++++++----
1 files changed, 8 insertions(+), 4 deletions(-)

diff -puN mm/page_io.c~rw_swap_page_sync-fix mm/page_io.c
--- 25/mm/page_io.c~rw_swap_page_sync-fix Mon Mar 29 14:41:08 2004
+++ 25-akpm/mm/page_io.c Mon Mar 29 14:41:28 2004
@@ -139,7 +139,7 @@ struct address_space_operations swap_aop

/*
* A scruffy utility function to read or write an arbitrary swap page
- * and wait on the I/O.
+ * and wait on the I/O. The caller must have a ref on the page.
*/
int rw_swap_page_sync(int rw, swp_entry_t entry, struct page *page)
{
@@ -151,8 +151,7 @@ int rw_swap_page_sync(int rw, swp_entry_
lock_page(page);

BUG_ON(page->mapping);
- page->mapping = &swapper_space;
- page->index = entry.val;
+ add_to_page_cache(page, &swapper_space, entry.val, GFP_NOIO);

if (rw == READ) {
ret = swap_readpage(NULL, page);
@@ -161,7 +160,12 @@ int rw_swap_page_sync(int rw, swp_entry_
ret = swap_writepage(page, &swap_wbc);
wait_on_page_writeback(page);
}
- page->mapping = NULL;
+
+ lock_page(page);
+ remove_from_page_cache(page);
+ unlock_page(page);
+ page_cache_release(page); /* For add_to_page_cache() */
+
if (ret == 0 && (!PageUptodate(page) || PageError(page)))
ret = -EIO;
return ret;

_

2004-03-29 22:51:14

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: 2.6.5-rc2-aa5

On Mon, Mar 29, 2004 at 04:30:51PM -0500, Rajesh Venkatasubramanian wrote:
>
> Andrew Moroton <[email protected]> wrote:
> >> Andrea Arcangeli <[email protected]> wrote:
> >>
> >> Notably there is a BUG_ON(page->mapping) triggering in
> >> page_remove_rmap in the pagecache case. that could be ex-pagecache
> >> being
> >> removed from pagecache before all ptes have been zapped, infact the
> >> page_remove_rmap triggers in the vmtruncate path.
> >
> > Confused. vmtruncate zaps the ptes before removing pages from
> > pagecache,
> > so I'd expect a non-null ->mapping in page_remove_rmap() is a very
> > common
> > thing. truncate a file which someone has mmapped and it'll happen every
> > time, will it not?
>
> Andrea missed a not (!) in the BUG_ON. It is BUG_ON(!page->mapping).

Yep sorry ;)

>
> The race Andrea hit _may_ be the mremap vs. vmtruncate race I hit:
>
> http://marc.theaimsgroup.com/?l=linux-mm&m=107720111303624
>
> A first truncate that raced with mremap and left an orphaned pte.
> The following truncate tried to clear the orphaned pte, and reached
> page_remove_rmap with page->mapping == NULL.
>
> Yes. It can happen in all 2.4 and 2.6 kernels.

ok fine, so my WARN_ON should work.

> Hugh has a better fix than mine for the mremap vs. truncate race
> in his anobjrmap 7/6 patch.
>
> http://marc.theaimsgroup.com/?l=linux-kernel&m=107998825716363
>
> With prio_tree we have to modify Hugh's fix, though.

Hugh are you interested to extract the fix against mainline? The
anobjrmap 7/6 is doing most of stuff I don't really need with anon-vma.

2004-03-29 22:55:50

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Mon, Mar 29, 2004 at 11:24:58PM +0100, Hugh Dickins wrote:
> On Mon, 29 Mar 2004, Andrew Morton wrote:
> >
> > hmm, yes, we have pages which satisfy PageSwapCache(), but which are not
> > actually in swapcache.
> >
> > How about we use the normal pagecache APIs for this?
> >
> > + add_to_page_cache(page, &swapper_space, entry.val, GFP_NOIO);
> >...
> > + remove_from_page_cache(page);
>
> Much nicer, and it'll probably appear to work: but (also untested)
> I bet you'll need an additional page_cache_release(page) - damn,

I'll add the page_cache_release before testing ;)

2004-03-29 23:09:49

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Mon, 29 Mar 2004, Andrew Morton wrote:
>> hmm, yes, we have pages which satisfy PageSwapCache(), but which are not
>> actually in swapcache.
>>
>> How about we use the normal pagecache APIs for this?
>>
>> + add_to_page_cache(page, &swapper_space, entry.val, GFP_NOIO);
>>...
>> + remove_from_page_cache(page);

On Mon, Mar 29, 2004 at 11:24:58PM +0100, Hugh Dickins wrote:
> Much nicer, and it'll probably appear to work: but (also untested)
> I bet you'll need an additional page_cache_release(page) - damn,
> looks like hugetlbfs has found a use for that tiresome asymmetry.
> Hugh

The good news is that the use isn't particularly essential.


-- wli

2004-03-31 15:07:25

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Mon, Mar 29, 2004 at 02:42:43PM -0800, Andrew Morton wrote:
> diff -puN mm/page_io.c~rw_swap_page_sync-fix mm/page_io.c
> --- 25/mm/page_io.c~rw_swap_page_sync-fix Mon Mar 29 14:41:08 2004
> +++ 25-akpm/mm/page_io.c Mon Mar 29 14:41:28 2004
> @@ -139,7 +139,7 @@ struct address_space_operations swap_aop
>
> /*
> * A scruffy utility function to read or write an arbitrary swap page
> - * and wait on the I/O.
> + * and wait on the I/O. The caller must have a ref on the page.
> */
> int rw_swap_page_sync(int rw, swp_entry_t entry, struct page *page)
> {
> @@ -151,8 +151,7 @@ int rw_swap_page_sync(int rw, swp_entry_
> lock_page(page);
>
> BUG_ON(page->mapping);
> - page->mapping = &swapper_space;
> - page->index = entry.val;
> + add_to_page_cache(page, &swapper_space, entry.val, GFP_NOIO);
>
> if (rw == READ) {
> ret = swap_readpage(NULL, page);
> @@ -161,7 +160,12 @@ int rw_swap_page_sync(int rw, swp_entry_
> ret = swap_writepage(page, &swap_wbc);
> wait_on_page_writeback(page);
> }
> - page->mapping = NULL;
> +
> + lock_page(page);
> + remove_from_page_cache(page);
> + unlock_page(page);
> + page_cache_release(page); /* For add_to_page_cache() */
> +
> if (ret == 0 && (!PageUptodate(page) || PageError(page)))
> ret = -EIO;
> return ret;

I checked this into CVS last night and today I got this new oops in
bugzilla:

hda: completing PM request, resume
Writing data to swap (18536 pages): .<1>Unable to handle kernel NULL pointer dereference at virtual address 00000004
printing eip:
c01daf24
*pde = 00000000
Oops: 0000 [#1]
CPU: 0
EIP: 0060:[<c01daf24>] Tainted: P
EFLAGS: 00010082 (2.6.4-40.3-default)
EIP is at radix_tree_delete+0x14/0x160
eax: 00000004 ebx: c16b6880 ecx: 00000016 edx: 00001d69
esi: 00001d69 edi: 00000010 ebp: 000011ae esp: f7329e1c
ds: 007b es: 007b ss: 0068
Process powersaved (pid: 4216, threadinfo=f7328000 task=f751f250)
Stack: 00000000 f51b6e00 00000004 00000006 f6326200 f63262bc 0000002e c0108d48
c041f4c0 00000000 000003fd 000026cd c041f4c0 c03ffd45 00000320 0000007b
0000007b ffffff00 c021b78e c16b6880 c0342d60 000011ae 000011ae 000011ae
Call Trace:
[<c0108d48>] common_interrupt+0x18/0x20
[<c021b78e>] serial_in+0x1e/0x40
[<c0150f2c>] swap_free+0x1c/0x30
[<c0152897>] remove_exclusive_swap_page+0x97/0x155
[<c013be2f>] __remove_from_page_cache+0x3f/0xa0
[<c013beab>] remove_from_page_cache+0x1b/0x27
[<c014fe5c>] rw_swap_page_sync+0x9c/0x1b0
[<c0135a9d>] do_magic_suspend_2+0x27d/0x7d0
[<c0125fb0>] process_timeout+0x0/0x10
[<c011ad1e>] __wake_up+0xe/0x20
[<f952be8d>] snd_intel8x0_suspend+0x1d/0x40 [snd_intel8x0]
[<c01e3586>] pci_device_suspend+0x16/0x20
[<c027701d>] do_magic+0x4d/0x130
[<c0135520>] software_suspend+0xd0/0xe0
[<c01fc176>] acpi_system_write_sleep+0xb5/0xd2
[<c01fc0c1>] acpi_system_write_sleep+0x0/0xd2
[<c015514e>] vfs_write+0xae/0xf0
[<c015522c>] sys_write+0x2c/0x50
[<c0107dc9>] sysenter_past_esp+0x52/0x79

Code: 8b 28 8d 7c 24 10 3b 14 ad a0 b9 41 c0 0f 87 18 01 00 00 8d

the oops is in a different place. It seems to bomb in
__remove_from_page_cache while calling radix_tree_delete like if the
radix_tree_insert didn't work out. I believe it's because you're not
checking for the retval of add_to_page_cache, if it runs oom in the
radix tree insert it will crash. You used GFP_NOIO, that's wrong, it
should be GFP_KERNEL to guarantee allocation. There's no reason to use
GFP_NOIO as far as I can tell.

Furthermore I was thinking your patch is still too lowlevel, it's better
to use the swapcache entry/exit points that already do the hardness
checks and page_cache_release automatically plus it pins the swap page
so there's no risk of disk corruption etc...

So I rewritten the fix this way:


diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids x-ref/mm/page_io.c x/mm/page_io.c
--- x-ref/mm/page_io.c 2004-03-31 16:57:25.505978008 +0200
+++ x/mm/page_io.c 2004-03-31 17:06:07.028694504 +0200
@@ -139,7 +139,7 @@ struct address_space_operations swap_aop

/*
* A scruffy utility function to read or write an arbitrary swap page
- * and wait on the I/O.
+ * and wait on the I/O. The caller must have a ref on the page.
*/
int rw_swap_page_sync(int rw, swp_entry_t entry, struct page *page)
{
@@ -149,10 +149,9 @@ int rw_swap_page_sync(int rw, swp_entry_
};

lock_page(page);
-
- BUG_ON(page->mapping);
- page->mapping = &swapper_space;
- page->index = entry.val;
+ ret = add_to_swap_cache(page, entry);
+ if (unlikely(ret))
+ goto out_unlock;

if (rw == READ) {
ret = swap_readpage(NULL, page);
@@ -161,7 +160,12 @@ int rw_swap_page_sync(int rw, swp_entry_
ret = swap_writepage(page, &swap_wbc);
wait_on_page_writeback(page);
}
- page->mapping = NULL;
+
+ lock_page(page);
+ delete_from_swap_cache(page);
+ out_unlock:
+ unlock_page(page);
+
if (ret == 0 && (!PageUptodate(page) || PageError(page)))
ret = -EIO;
return ret;


I hope this will work (untested).

2004-03-31 15:26:47

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Wed, Mar 31, 2004 at 05:07:18PM +0200, Andrea Arcangeli wrote:
> diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids x-ref/mm/page_io.c x/mm/page_io.c
> --- x-ref/mm/page_io.c 2004-03-31 16:57:25.505978008 +0200
> +++ x/mm/page_io.c 2004-03-31 17:06:07.028694504 +0200
> @@ -139,7 +139,7 @@ struct address_space_operations swap_aop
>
> /*
> * A scruffy utility function to read or write an arbitrary swap page
> - * and wait on the I/O.
> + * and wait on the I/O. The caller must have a ref on the page.
> */
> int rw_swap_page_sync(int rw, swp_entry_t entry, struct page *page)
> {
> @@ -149,10 +149,9 @@ int rw_swap_page_sync(int rw, swp_entry_
> };
>
> lock_page(page);
> -
> - BUG_ON(page->mapping);
> - page->mapping = &swapper_space;
> - page->index = entry.val;
> + ret = add_to_swap_cache(page, entry);
> + if (unlikely(ret))
> + goto out_unlock;
>
> if (rw == READ) {
> ret = swap_readpage(NULL, page);
> @@ -161,7 +160,12 @@ int rw_swap_page_sync(int rw, swp_entry_
> ret = swap_writepage(page, &swap_wbc);
> wait_on_page_writeback(page);
> }
> - page->mapping = NULL;
> +
> + lock_page(page);
> + delete_from_swap_cache(page);
> + out_unlock:
> + unlock_page(page);
> +
> if (ret == 0 && (!PageUptodate(page) || PageError(page)))
> ret = -EIO;
> return ret;
>
>

this trivial bit is needed as well to allow compilation, you can append
it to the previous patch:

--- x/include/linux/swap.h.~1~ 2004-03-31 17:13:05.064143456 +0200
+++ x/include/linux/swap.h 2004-03-31 17:21:34.241736696 +0200
@@ -192,6 +192,7 @@ extern struct address_space swapper_spac
#define total_swapcache_pages swapper_space.nrpages
extern void show_swap_cache_info(void);
extern int add_to_swap(struct page *);
+extern int add_to_swap_cache(struct page *page, swp_entry_t entry);
extern void __delete_from_swap_cache(struct page *);
extern void delete_from_swap_cache(struct page *);
extern int move_to_swap_cache(struct page *, swp_entry_t);
--- x/mm/swap_state.c.~1~ 2004-03-31 17:13:05.249115336 +0200
+++ x/mm/swap_state.c 2004-03-31 17:21:15.201631232 +0200
@@ -56,7 +56,7 @@ void show_swap_cache_info(void)
swap_cache_info.noent_race, swap_cache_info.exist_race);
}

-static int add_to_swap_cache(struct page *page, swp_entry_t entry)
+int add_to_swap_cache(struct page *page, swp_entry_t entry)
{
int error;

2004-03-31 16:46:01

by Hugh Dickins

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Wed, 31 Mar 2004, Andrea Arcangeli wrote:
>
> So I rewritten the fix this way:
>
> + ret = add_to_swap_cache(page, entry);

I think you'll find that gets into trouble on the header page,
entry 0, which pmdisk/swsusp does access through this interface,
but swapping does not: I'd expect its swap_duplicate to fail.

I've put off dealing with this, wasn't a priority for me to
decide what to do with it. You might experiment with setting
p->swap_map[0] = 1 instead of SWAP_MAP_BAD in sys_swapon, but
offhand I'm unsure whether that's enough e.g. would the totals
come out right, would swapoff complete?

Just an idea, not something to finalize.

Hugh

2004-03-31 17:29:00

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Wed, Mar 31, 2004 at 05:45:31PM +0100, Hugh Dickins wrote:
> On Wed, 31 Mar 2004, Andrea Arcangeli wrote:
> >
> > So I rewritten the fix this way:
> >
> > + ret = add_to_swap_cache(page, entry);
>
> I think you'll find that gets into trouble on the header page,
> entry 0, which pmdisk/swsusp does access through this interface,
> but swapping does not: I'd expect its swap_duplicate to fail.

I didn't know they have to modify the header page.

> I've put off dealing with this, wasn't a priority for me to
> decide what to do with it. You might experiment with setting
> p->swap_map[0] = 1 instead of SWAP_MAP_BAD in sys_swapon, but
> offhand I'm unsure whether that's enough e.g. would the totals
> come out right, would swapoff complete?
>
> Just an idea, not something to finalize.

if they run into trouble I'll return to the pagecache API adding the
GFP_KERNEL and check for oom failure.

2004-04-01 00:46:00

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Wed, Mar 31, 2004 at 07:28:51PM +0200, Andrea Arcangeli wrote:
> if they run into trouble I'll return to the pagecache API adding the
> GFP_KERNEL and check for oom failure.

there were troubles with the header indeed. So I went back to the
pagecache version (now fixed with GFP_KERNEL and oom retval checking).

the oops I've got with the header trouble was weird (but at least the
previous radix_tree_delete crash is gone), so it's not completely clear
if this will be enough to make it work as well as it was working before
the -mm writeback changes. I tried to reproduce but apparently acpi is
doing nothing here for a echo 4 > sleep :/.

diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids x-ref/mm/page_io.c x/mm/page_io.c
--- x-ref/mm/page_io.c 2004-04-01 02:09:53.846664248 +0200
+++ x/mm/page_io.c 2004-04-01 02:11:41.526294456 +0200
@@ -139,7 +139,7 @@ struct address_space_operations swap_aop

/*
* A scruffy utility function to read or write an arbitrary swap page
- * and wait on the I/O.
+ * and wait on the I/O. The caller must have a ref on the page.
*/
int rw_swap_page_sync(int rw, swp_entry_t entry, struct page *page)
{
@@ -151,8 +151,11 @@ int rw_swap_page_sync(int rw, swp_entry_
lock_page(page);

BUG_ON(page->mapping);
- page->mapping = &swapper_space;
- page->index = entry.val;
+ ret = add_to_page_cache(page, &swapper_space, entry.val, GFP_KERNEL);
+ if (unlikely(ret)) {
+ unlock_page(page);
+ return ret;
+ }

if (rw == READ) {
ret = swap_readpage(NULL, page);
@@ -161,7 +164,12 @@ int rw_swap_page_sync(int rw, swp_entry_
ret = swap_writepage(page, &swap_wbc);
wait_on_page_writeback(page);
}
- page->mapping = NULL;
+
+ lock_page(page);
+ remove_from_page_cache(page);
+ unlock_page(page);
+ page_cache_release(page); /* For add_to_page_cache() */
+
if (ret == 0 && (!PageUptodate(page) || PageError(page)))
ret = -EIO;
return ret;

2004-04-01 01:20:26

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

Andrea Arcangeli <[email protected]> wrote:
>
> @@ -151,8 +151,11 @@ int rw_swap_page_sync(int rw, swp_entry_
> lock_page(page);
>
> BUG_ON(page->mapping);
> - page->mapping = &swapper_space;
> - page->index = entry.val;
> + ret = add_to_page_cache(page, &swapper_space, entry.val, GFP_KERNEL);

Doing a __GFP_FS allocation while holding lock_page() is worrisome. It's
OK if that page is private, but how do we know that the caller didn't pass
us some page which is on the LRU?

The only place where I think we can deadlock is if that GFP_KERNEL
allocation tries to write out the page we hold a lock on, and we hit an
error running swap_writepage() and then enter handle_write_error().

This actually cannot happen because swap_writepage() can only fail if
bio_alloc() fails, and that uses a mempool. But ick.

Your patch seems reasonable to run with for now, but to be totally anal
about it, I'll run with the below monstrosity.



diff -puN mm/page_io.c~rw_swap_page_sync-fix mm/page_io.c
--- 25/mm/page_io.c~rw_swap_page_sync-fix Wed Mar 31 16:55:44 2004
+++ 25-akpm/mm/page_io.c Wed Mar 31 17:15:31 2004
@@ -19,6 +19,7 @@
#include <linux/buffer_head.h> /* for block_sync_page() */
#include <linux/mpage.h>
#include <linux/writeback.h>
+#include <linux/radix-tree.h>
#include <asm/pgtable.h>

static struct bio *
@@ -137,9 +138,11 @@ struct address_space_operations swap_aop
.set_page_dirty = __set_page_dirty_nobuffers,
};

+#if defined(CONFIG_SOFTWARE_SUSPEND) || defined(CONFIG_PM_DISK)
+
/*
* A scruffy utility function to read or write an arbitrary swap page
- * and wait on the I/O.
+ * and wait on the I/O. The caller must have a ref on the page.
*/
int rw_swap_page_sync(int rw, swp_entry_t entry, struct page *page)
{
@@ -148,11 +151,30 @@ int rw_swap_page_sync(int rw, swp_entry_
.sync_mode = WB_SYNC_ALL,
};

- lock_page(page);
-
BUG_ON(page->mapping);
- page->mapping = &swapper_space;
- page->index = entry.val;
+
+ /*
+ * We shouldn't perform add_to_page_cache(..., GFP_KERNEL) inside
+ * lock_page(), so here we do bizarre things to arrange for the page
+ * to be locked while ensuring that this CPU has sufficient pooled
+ * radix-tree nodes for a successful add_to_page_cache().
+ */
+ for ( ; ; ) {
+ ret = radix_tree_preload(GFP_KERNEL);
+ if (ret)
+ goto out;
+ if (TestSetPageLocked(page) == 0)
+ break;
+ radix_tree_preload_end();
+ lock_page(page);
+ unlock_page(page);
+ }
+ ret = add_to_page_cache(page, &swapper_space, entry.val, GFP_ATOMIC);
+ radix_tree_preload_end();
+ if (ret) {
+ unlock_page(page);
+ goto out;
+ }

if (rw == READ) {
ret = swap_readpage(NULL, page);
@@ -161,8 +183,15 @@ int rw_swap_page_sync(int rw, swp_entry_
ret = swap_writepage(page, &swap_wbc);
wait_on_page_writeback(page);
}
- page->mapping = NULL;
+
+ lock_page(page);
+ remove_from_page_cache(page);
+ unlock_page(page);
+ page_cache_release(page); /* For add_to_page_cache() */
+
if (ret == 0 && (!PageUptodate(page) || PageError(page)))
ret = -EIO;
+out:
return ret;
}
+#endif

_

2004-04-01 01:26:28

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Wed, Mar 31, 2004 at 05:22:16PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > @@ -151,8 +151,11 @@ int rw_swap_page_sync(int rw, swp_entry_
> > lock_page(page);
> >
> > BUG_ON(page->mapping);
> > - page->mapping = &swapper_space;
> > - page->index = entry.val;
> > + ret = add_to_page_cache(page, &swapper_space, entry.val, GFP_KERNEL);
>
> Doing a __GFP_FS allocation while holding lock_page() is worrisome. It's
> OK if that page is private, but how do we know that the caller didn't pass
> us some page which is on the LRU?

it _has_ to be private if it's using rw_swap_page_sync. How can a page
be in a lru if we're going to execute add_to_page_cache on it? That
would be pretty broken in the first place.

> Your patch seems reasonable to run with for now, but to be totally anal
> about it, I'll run with the below monstrosity.

It's not needed IMO. We also already bugcheck on page->mapping, if
you're scared about the page being in a lru, you can add further
bugchecks on PageLru etc.. calling add_to_page_cache on anything that is
already visible to the VM in some lru is broken by design and should be
forbidden. All the users of swap suspend must work with freshly
allocated pages, the page_mapped bugcheck already covers most of the
cases.

2004-04-01 01:51:21

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

Andrea Arcangeli <[email protected]> wrote:
>
> > Doing a __GFP_FS allocation while holding lock_page() is worrisome. It's
> > OK if that page is private, but how do we know that the caller didn't pass
> > us some page which is on the LRU?
>
> it _has_ to be private if it's using rw_swap_page_sync. How can a page
> be in a lru if we're going to execute add_to_page_cache on it? That
> would be pretty broken in the first place.

An anonymous user page meets these requirements. A did say "anal", but
rw_swap_page_sync() is a general-purpose library function and we shouldn't
be making assumptions about the type of page which the caller happens to be
feeding us.


2004-04-01 02:02:01

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Wed, Mar 31, 2004 at 05:51:13PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > > Doing a __GFP_FS allocation while holding lock_page() is worrisome. It's
> > > OK if that page is private, but how do we know that the caller didn't pass
> > > us some page which is on the LRU?
> >
> > it _has_ to be private if it's using rw_swap_page_sync. How can a page
> > be in a lru if we're going to execute add_to_page_cache on it? That
> > would be pretty broken in the first place.
>
> An anonymous user page meets these requirements. A did say "anal", but
> rw_swap_page_sync() is a general-purpose library function and we shouldn't
> be making assumptions about the type of page which the caller happens to be
> feeding us.

that is a specialized backdoor to do I/O on _private_ pages, it's not a
general-purpose library function for doing anonymous pages
swapin/swapout, infact the only user is swap susped and we'd better
forbid swap suspend to pass anonymous pages through that interface and
be sure that nobody will ever attempt anything like that.

that interface is useful only to reach the swap device, for doing I/O on
private pages outside the VM, in the old days that was used to
read/write the swap header (again on a private page), swap suspend is
using it for similar reasons on _private_ pages.

the idea of allowing people to do I/O on anonymous pages using that
interface sounds broken to me. Your code sounds overkill complicated
for allowing something that we definitely must forbid.

2004-04-01 05:05:50

by Hugh Dickins

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Thu, 1 Apr 2004, Andrea Arcangeli wrote:
> On Wed, Mar 31, 2004 at 05:51:13PM -0800, Andrew Morton wrote:
> > rw_swap_page_sync() is a general-purpose library function and we shouldn't
> > be making assumptions about the type of page which the caller happens to be
> > feeding us.
>
> that is a specialized backdoor to do I/O on _private_ pages, it's not a
> general-purpose library function for doing anonymous pages

I'm not against anal checks (except personally :), but I'm very much
with Andrea on this: rw_swap_page_sync is horrid, but does manage to
do a particular job. The header page is great fun: sys_swapon and
mkswap read and write it by a totally different route, I shudder
(especially when it's a swapfile with blocksize less than pagesize).
It would be nice to make it more general and correct, but that's
not something you should get stuck on right now.

Hugh

2004-04-01 13:35:58

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

Let's forget the "should we allow people to use rw_swap_page_sync to
swapout/swapin anonymous pages" discussion, there's a major issue that
my latest patch still doesn't work:

Writing data to swap (5354 pages): .<1>Unable to handle kernel NULL pointer dereference at virtual address 00000004
printing eip:
c01d9b34
*pde = 00000000
Oops: 0000 [#1]
CPU: 0
EIP: 0060:[<c01d9b34>] Not tainted
EFLAGS: 00010082 (2.6.4-41.8-default)
EIP is at radix_tree_delete+0x14/0x160
eax: 00000004 ebx: c10361c0 ecx: 00000016 edx: 000023ee
esi: 000023ee edi: 00000000 ebp: 000000d0 esp: cdee5e1c
ds: 007b es: 007b ss: 0068
Process bash (pid: 1, threadinfo=cdee4000 task=cdf9d7b0)
Stack: 00000000 f7b0d200 00000004 00000016 c041d440 c03ffe2e c0108d48 c041d440
00000000 000003fd 000026b6 c041d440 c03ffe2e 00000320 0000007b ffff007b
ffffff00 c021a39e 00000060 c10361c0 c0341d20 00000056 00000056 00000056
Call Trace:
[<c0108d48>] common_interrupt+0x18/0x20
[<c021a39e>] serial_in+0x1e/0x40
[<c014fc3c>] swap_free+0x1c/0x30
[<c0151597>] remove_exclusive_swap_page+0x97/0x155
[<c013bc1f>] __remove_from_page_cache+0x3f/0xa0
[<c013bc9b>] remove_from_page_cache+0x1b/0x27
[<c014eb59>] rw_swap_page_sync+0xa9/0x1d0
[<c013588d>] do_magic_suspend_2+0x27d/0x7d0
[<c0275c2d>] do_magic+0x4d/0x130
[<c0135310>] software_suspend+0xd0/0xe0
[<c01fad86>] acpi_system_write_sleep+0xb5/0xd2
[<c01facd1>] acpi_system_write_sleep+0x0/0xd2
[<c0153e4e>] vfs_write+0xae/0xf0
[<c0153f2c>] sys_write+0x2c/0x50
[<c0107dc9>] sysenter_past_esp+0x52/0x79

Code: 8b 28 8d 7c 24 10 3b 14 ad 00 99 41 c0 0f 87 18 01 00 00 8d
<0>Kernel panic: Attempted to kill init!


Pavel told me a SMP kernel cannot suspend, that's probably why I
couldn't reproduce, I'll recompile UP and hopefully I will be able to
reproduce, so I can debug it, and I can try latest Andrew's patch too
(the one allowing anonymous memory swapin/swapouts too).

2004-04-01 15:09:23

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Thu, Apr 01, 2004 at 03:35:55PM +0200, Andrea Arcangeli wrote:
> Let's forget the "should we allow people to use rw_swap_page_sync to
> swapout/swapin anonymous pages" discussion, there's a major issue that
> my latest patch still doesn't work:
>
> Writing data to swap (5354 pages): .<1>Unable to handle kernel NULL pointer dereference at virtual address 00000004
> printing eip:
> c01d9b34
> *pde = 00000000
> Oops: 0000 [#1]
> CPU: 0
> EIP: 0060:[<c01d9b34>] Not tainted
> EFLAGS: 00010082 (2.6.4-41.8-default)
> EIP is at radix_tree_delete+0x14/0x160
> eax: 00000004 ebx: c10361c0 ecx: 00000016 edx: 000023ee
> esi: 000023ee edi: 00000000 ebp: 000000d0 esp: cdee5e1c
> ds: 007b es: 007b ss: 0068
> Process bash (pid: 1, threadinfo=cdee4000 task=cdf9d7b0)
> Stack: 00000000 f7b0d200 00000004 00000016 c041d440 c03ffe2e c0108d48 c041d440
> 00000000 000003fd 000026b6 c041d440 c03ffe2e 00000320 0000007b ffff007b
> ffffff00 c021a39e 00000060 c10361c0 c0341d20 00000056 00000056 00000056
> Call Trace:
> [<c0108d48>] common_interrupt+0x18/0x20
> [<c021a39e>] serial_in+0x1e/0x40
> [<c014fc3c>] swap_free+0x1c/0x30
> [<c0151597>] remove_exclusive_swap_page+0x97/0x155
> [<c013bc1f>] __remove_from_page_cache+0x3f/0xa0
> [<c013bc9b>] remove_from_page_cache+0x1b/0x27
> [<c014eb59>] rw_swap_page_sync+0xa9/0x1d0
> [<c013588d>] do_magic_suspend_2+0x27d/0x7d0
> [<c0275c2d>] do_magic+0x4d/0x130
> [<c0135310>] software_suspend+0xd0/0xe0
> [<c01fad86>] acpi_system_write_sleep+0xb5/0xd2
> [<c01facd1>] acpi_system_write_sleep+0x0/0xd2
> [<c0153e4e>] vfs_write+0xae/0xf0
> [<c0153f2c>] sys_write+0x2c/0x50
> [<c0107dc9>] sysenter_past_esp+0x52/0x79
>
> Code: 8b 28 8d 7c 24 10 3b 14 ad 00 99 41 c0 0f 87 18 01 00 00 8d
> <0>Kernel panic: Attempted to kill init!
>
>
> Pavel told me a SMP kernel cannot suspend, that's probably why I
> couldn't reproduce, I'll recompile UP and hopefully I will be able to
> reproduce, so I can debug it, and I can try latest Andrew's patch too
> (the one allowing anonymous memory swapin/swapouts too).

I think I got it, this should fix it, I'll checkin into CVS so they can
test it, I still can't test it myself unfortunately, acpi hangs.

diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids x-ref/mm/page_io.c x/mm/page_io.c
--- x-ref/mm/page_io.c 2004-04-01 17:07:10.231289760 +0200
+++ x/mm/page_io.c 2004-04-01 17:07:33.182800600 +0200
@@ -139,7 +139,7 @@ struct address_space_operations swap_aop

/*
* A scruffy utility function to read or write an arbitrary swap page
- * and wait on the I/O.
+ * and wait on the I/O. The caller must have a ref on the page.
*/
int rw_swap_page_sync(int rw, swp_entry_t entry, struct page *page)
{
@@ -151,8 +151,16 @@ int rw_swap_page_sync(int rw, swp_entry_
lock_page(page);

BUG_ON(page->mapping);
- page->mapping = &swapper_space;
- page->index = entry.val;
+ ret = add_to_page_cache(page, &swapper_space, entry.val, GFP_KERNEL);
+ if (unlikely(ret)) {
+ unlock_page(page);
+ return ret;
+ }
+ /*
+ * get one more reference to make page non-exclusive so
+ * remove_exclusive_swap_page won't mess with it.
+ */
+ page_cache_get(page);

if (rw == READ) {
ret = swap_readpage(NULL, page);
@@ -161,7 +169,13 @@ int rw_swap_page_sync(int rw, swp_entry_
ret = swap_writepage(page, &swap_wbc);
wait_on_page_writeback(page);
}
- page->mapping = NULL;
+
+ lock_page(page);
+ remove_from_page_cache(page);
+ unlock_page(page);
+ page_cache_release(page);
+ page_cache_release(page); /* For add_to_page_cache() */
+
if (ret == 0 && (!PageUptodate(page) || PageError(page)))
ret = -EIO;
return ret;

2004-04-01 15:15:38

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

> diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids x-ref/mm/page_io.c x/mm/page_io.c
> --- x-ref/mm/page_io.c 2004-04-01 17:07:10.231289760 +0200
> +++ x/mm/page_io.c 2004-04-01 17:07:33.182800600 +0200
> @@ -139,7 +139,7 @@ struct address_space_operations swap_aop
>
> /*
> * A scruffy utility function to read or write an arbitrary swap page
> - * and wait on the I/O.
> + * and wait on the I/O. The caller must have a ref on the page.
> */
> int rw_swap_page_sync(int rw, swp_entry_t entry, struct page *page)
> {
> @@ -151,8 +151,16 @@ int rw_swap_page_sync(int rw, swp_entry_
> lock_page(page);
>
> BUG_ON(page->mapping);
> - page->mapping = &swapper_space;
> - page->index = entry.val;
> + ret = add_to_page_cache(page, &swapper_space, entry.val, GFP_KERNEL);
> + if (unlikely(ret)) {
> + unlock_page(page);
> + return ret;
> + }
> + /*
> + * get one more reference to make page non-exclusive so
> + * remove_exclusive_swap_page won't mess with it.
> + */
> + page_cache_get(page);
>
> if (rw == READ) {
> ret = swap_readpage(NULL, page);
> @@ -161,7 +169,13 @@ int rw_swap_page_sync(int rw, swp_entry_
> ret = swap_writepage(page, &swap_wbc);
> wait_on_page_writeback(page);
> }
> - page->mapping = NULL;
> +
> + lock_page(page);
> + remove_from_page_cache(page);
> + unlock_page(page);
> + page_cache_release(page);
> + page_cache_release(page); /* For add_to_page_cache() */
> +
> if (ret == 0 && (!PageUptodate(page) || PageError(page)))
> ret = -EIO;
> return ret;

incrementally to the above I applied this hardness checks in the
anon-vma patch, so we're safe against the problem Andrew outlined
(somebody attempting to do swapin/swapouts of anonymous pages through
that interface, something that shouldn't happen since we want only the
VM to deal with userspace mapped pages).

@@ -149,8 +149,14 @@ int rw_swap_page_sync(int rw, swp_entry_
};

lock_page(page);
-
+ /*
+ * This library call can be only used to do I/O
+ * on _private_ pages just allocated with alloc_pages().
+ */
BUG_ON(page->mapping);
+ BUG_ON(PageSwapCache(page));
+ BUG_ON(PageAnon(page));
+ BUG_ON(PageLRU(page));
ret = add_to_page_cache(page, &swapper_space, entry.val, GFP_KERNEL);
if (unlikely(ret)) {
unlock_page(page);

2004-04-02 00:15:42

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Thu, Apr 01, 2004 at 05:15:34PM +0200, Andrea Arcangeli wrote:
> > diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch} --exclude .arch-ids x-ref/mm/page_io.c x/mm/page_io.c
> > --- x-ref/mm/page_io.c 2004-04-01 17:07:10.231289760 +0200
> > +++ x/mm/page_io.c 2004-04-01 17:07:33.182800600 +0200
> > @@ -139,7 +139,7 @@ struct address_space_operations swap_aop
> >
> > /*
> > * A scruffy utility function to read or write an arbitrary swap page
> > - * and wait on the I/O.
> > + * and wait on the I/O. The caller must have a ref on the page.
> > */
> > int rw_swap_page_sync(int rw, swp_entry_t entry, struct page *page)
> > {
> > @@ -151,8 +151,16 @@ int rw_swap_page_sync(int rw, swp_entry_
> > lock_page(page);
> >
> > BUG_ON(page->mapping);
> > - page->mapping = &swapper_space;
> > - page->index = entry.val;
> > + ret = add_to_page_cache(page, &swapper_space, entry.val, GFP_KERNEL);
> > + if (unlikely(ret)) {
> > + unlock_page(page);
> > + return ret;
> > + }
> > + /*
> > + * get one more reference to make page non-exclusive so
> > + * remove_exclusive_swap_page won't mess with it.
> > + */
> > + page_cache_get(page);
> >
> > if (rw == READ) {
> > ret = swap_readpage(NULL, page);
> > @@ -161,7 +169,13 @@ int rw_swap_page_sync(int rw, swp_entry_
> > ret = swap_writepage(page, &swap_wbc);
> > wait_on_page_writeback(page);
> > }
> > - page->mapping = NULL;
> > +
> > + lock_page(page);
> > + remove_from_page_cache(page);
> > + unlock_page(page);
> > + page_cache_release(page);
> > + page_cache_release(page); /* For add_to_page_cache() */
> > +
> > if (ret == 0 && (!PageUptodate(page) || PageError(page)))
> > ret = -EIO;
> > return ret;
>
> incrementally to the above I applied this hardness checks in the
> anon-vma patch, so we're safe against the problem Andrew outlined
> (somebody attempting to do swapin/swapouts of anonymous pages through
> that interface, something that shouldn't happen since we want only the
> VM to deal with userspace mapped pages).
>
> @@ -149,8 +149,14 @@ int rw_swap_page_sync(int rw, swp_entry_
> };
>
> lock_page(page);
> -
> + /*
> + * This library call can be only used to do I/O
> + * on _private_ pages just allocated with alloc_pages().
> + */
> BUG_ON(page->mapping);
> + BUG_ON(PageSwapCache(page));
> + BUG_ON(PageAnon(page));
> + BUG_ON(PageLRU(page));
> ret = add_to_page_cache(page, &swapper_space, entry.val, GFP_KERNEL);
> if (unlikely(ret)) {
> unlock_page(page);


the good thing is that I believe this fix will make it work with the -mm
writeback changes. However this fix now collides with anon-vma since
swapsuspend passes compound pages to rw_swap_page_sync and
add_to_page_cache overwrites page->private and the kernel crashes at the
next page_cache_get() since page->private is now the swap entry and not
a page_t pointer. So I guess I've a good reason now to giveup trying to
add the page to the swapcache, and to just fake the radix tree like I
did in my original fix. That way the page won't be swapcache either so I
don't even need to use get_page to avoid remove_exclusive_swap_page to
mess with it.

2004-04-02 00:50:13

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

Andrea Arcangeli <[email protected]> wrote:
>
> > @@ -149,8 +149,14 @@ int rw_swap_page_sync(int rw, swp_entry_
> > };
> >
> > lock_page(page);
> > -
> > + /*
> > + * This library call can be only used to do I/O
> > + * on _private_ pages just allocated with alloc_pages().
> > + */
> > BUG_ON(page->mapping);
> > + BUG_ON(PageSwapCache(page));
> > + BUG_ON(PageAnon(page));
> > + BUG_ON(PageLRU(page));
> > ret = add_to_page_cache(page, &swapper_space, entry.val, GFP_KERNEL);
> > if (unlikely(ret)) {
> > unlock_page(page);
>
>
> the good thing is that I believe this fix will make it work with the -mm
> writeback changes. However this fix now collides with anon-vma since
> swapsuspend passes compound pages to rw_swap_page_sync and
> add_to_page_cache overwrites page->private and the kernel crashes at the
> next page_cache_get() since page->private is now the swap entry and not
> a page_t pointer.

Why do swapcache pages have their ->index in ->private? That should have
been commented.

(hugetlb pages are also added to pagecache, and they are compound, but the
code looks OK).

> So I guess I've a good reason now to giveup trying to
> add the page to the swapcache, and to just fake the radix tree like I
> did in my original fix. That way the page won't be swapcache either so I
> don't even need to use get_page to avoid remove_exclusive_swap_page to
> mess with it.

The BUG_ON in radix_tree_tag_set() is a fairly arbitrary sanity check:
"hey, why are you tagging a non-existent item?".

We could simply replace it with a `return NULL;'?

2004-04-02 01:03:18

by Hugh Dickins

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Fri, 2 Apr 2004, Andrea Arcangeli wrote:
>
> the good thing is that I believe this fix will make it work with the -mm
> writeback changes. However this fix now collides with anon-vma since
> swapsuspend passes compound pages to rw_swap_page_sync and
> add_to_page_cache overwrites page->private and the kernel crashes at the
> next page_cache_get() since page->private is now the swap entry and not
> a page_t pointer. So I guess I've a good reason now to giveup trying to
> add the page to the swapcache, and to just fake the radix tree like I
> did in my original fix. That way the page won't be swapcache either so I
> don't even need to use get_page to avoid remove_exclusive_swap_page to
> mess with it.

Yes, I too was feeling that we'd gone far enough in this "make it like
a real swap page" direction, and we'd probably have better luck with
"take away all resemblance to a real swap page".

I've still done no work or testing on rw_swap_page_sync, but I wonder...
remember how your page_mapping(page) gives &swapper_space on a swap
cache page, whereas my page_mapping(page) gives NULL on them? My guess
(quite possibly wrong) is that I won't have any of the trouble you've
had with this, that the page_writeback functions, seeing NULL mapping,
won't get involved with the radix tree at all - and why should they,
it isn't doing anything useful for rw_swap_page_sync, just getting you
into memory allocation difficulties. No need for add_to_page_cache or
add_to_swap_cache there at all. As I say, I haven't tested this path,
but I do know that the rest of swap works fine with NULL page_mapping.

Hugh

2004-04-02 01:06:55

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Thu, Apr 01, 2004 at 04:52:16PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > > @@ -149,8 +149,14 @@ int rw_swap_page_sync(int rw, swp_entry_
> > > };
> > >
> > > lock_page(page);
> > > -
> > > + /*
> > > + * This library call can be only used to do I/O
> > > + * on _private_ pages just allocated with alloc_pages().
> > > + */
> > > BUG_ON(page->mapping);
> > > + BUG_ON(PageSwapCache(page));
> > > + BUG_ON(PageAnon(page));
> > > + BUG_ON(PageLRU(page));
> > > ret = add_to_page_cache(page, &swapper_space, entry.val, GFP_KERNEL);
> > > if (unlikely(ret)) {
> > > unlock_page(page);
> >
> >
> > the good thing is that I believe this fix will make it work with the -mm
> > writeback changes. However this fix now collides with anon-vma since
> > swapsuspend passes compound pages to rw_swap_page_sync and
> > add_to_page_cache overwrites page->private and the kernel crashes at the
> > next page_cache_get() since page->private is now the swap entry and not
> > a page_t pointer.
>
> Why do swapcache pages have their ->index in ->private? That should have
> been commented.

that's because I must leave page->index free for the anon-vma tracking.
Now an anonymous page while being swapped is just like a pagecache page,
however the index on swap is different than the index in-address-space,
because the swap is nonlinear. So I need to indexes, one for finding the
page in the anon-vma in the task address space (page->index), the other
(the swap-entry) for finding the page in the swap address-space
(swapcache, or disk).

> (hugetlb pages are also added to pagecache, and they are compound, but the
> code looks OK).

hugetlb is never swapped so yes, it cannot generate problems. The only
thing swapping a compound page is swap suspend and that's why we didn't
notice it yet.

> > So I guess I've a good reason now to giveup trying to
> > add the page to the swapcache, and to just fake the radix tree like I
> > did in my original fix. That way the page won't be swapcache either so I
> > don't even need to use get_page to avoid remove_exclusive_swap_page to
> > mess with it.
>
> The BUG_ON in radix_tree_tag_set() is a fairly arbitrary sanity check:
> "hey, why are you tagging a non-existent item?".
>
> We could simply replace it with a `return NULL;'?

I wouldn't like to reduce the hardness checks in the radix tree, I was
very happy to find this robusteness checks trapping those bugs so
reliably.

But I think the compound thing is overkill for 99% of usages, hugetlbfs
is the only one really needing that sort of transparency in the
refcounting I believe, so I'm now adding a __GFP_NO_COMP that
swapsuspend will start to use to allocate the multipages, I'd better add
it before somebody gets the idea of removing the order parameter to
free_pages (something you could do just fine with page compound since
the order is in page[1].index ;). This should fix it, I don't want to
teach rw_swap_page_sync how to swapout a compound page, I'd rather make
a compound page look like a regular page as far as rw_swap_page_sync is
concerned. This will not slowdown the kernel at all since the additional
check if to create a compound page or not will only trigger if the
previous check of order > 0 is positive.

2004-04-02 01:16:47

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Fri, Apr 02, 2004 at 02:03:14AM +0100, Hugh Dickins wrote:
> On Fri, 2 Apr 2004, Andrea Arcangeli wrote:
> >
> > the good thing is that I believe this fix will make it work with the -mm
> > writeback changes. However this fix now collides with anon-vma since
> > swapsuspend passes compound pages to rw_swap_page_sync and
> > add_to_page_cache overwrites page->private and the kernel crashes at the
> > next page_cache_get() since page->private is now the swap entry and not
> > a page_t pointer. So I guess I've a good reason now to giveup trying to
> > add the page to the swapcache, and to just fake the radix tree like I
> > did in my original fix. That way the page won't be swapcache either so I
> > don't even need to use get_page to avoid remove_exclusive_swap_page to
> > mess with it.
>
> Yes, I too was feeling that we'd gone far enough in this "make it like
> a real swap page" direction, and we'd probably have better luck with
> "take away all resemblance to a real swap page".
>
> I've still done no work or testing on rw_swap_page_sync, but I wonder...
> remember how your page_mapping(page) gives &swapper_space on a swap
> cache page, whereas my page_mapping(page) gives NULL on them? My guess

yes.

> (quite possibly wrong) is that I won't have any of the trouble you've
> had with this, that the page_writeback functions, seeing NULL mapping,
> won't get involved with the radix tree at all - and why should they,

Not sure but I find your way very risky since writepage operations are
address space methods, it's like calling an object method with a null
object as parameter, very risky and dirty, and the primary reason I
wanted my swap cache to have a true page_mapping(page) ==
&swapper_space, your swapcache having a null mapping looks very dirty to
me and that's why I avoided it.

Note that the same way you drop the swapper_space with your code
applied, you could drop it indipendently from mainline too w/o any other
change. I much prefer to have a real swapper_space with a real tree_lock
with a real ->writepage callback etc..

> it isn't doing anything useful for rw_swap_page_sync, just getting you
> into memory allocation difficulties. No need for add_to_page_cache or
> add_to_swap_cache there at all. As I say, I haven't tested this path,

I wouldn't need to call add_to_page_cache either, it's just Andrew
prefers it.

> but I do know that the rest of swap works fine with NULL page_mapping.

though your code still has no way to work since it will clash on the
compound page just like mine. Note that my code already works fine as
far as it's not a compound page, as far as Andrew's code works in
mainline, my code will work fine, if my code doesn't work yours cannot
either (we both clash in the compound page infact).

2004-04-02 01:34:41

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

Andrea Arcangeli <[email protected]> wrote:
>
> > it isn't doing anything useful for rw_swap_page_sync, just getting you
> > into memory allocation difficulties. No need for add_to_page_cache or
> > add_to_swap_cache there at all. As I say, I haven't tested this path,
>
> I wouldn't need to call add_to_page_cache either, it's just Andrew
> prefers it.

Well all of this is to avoid a fairly arbitrary BUG_ON in the radix-tree
code. If I hadn't added that, we'd all be happy.

The code is well-tested and has been thrashed to death in the userspace
radix-tree test harness.
(http://www.zip.com.au/~akpm/linux/patches/stuff/rtth.tar.gz). Let's
remove the BUG_ON.

2004-04-02 02:00:46

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Thu, Apr 01, 2004 at 05:36:49PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > > it isn't doing anything useful for rw_swap_page_sync, just getting you
> > > into memory allocation difficulties. No need for add_to_page_cache or
> > > add_to_swap_cache there at all. As I say, I haven't tested this path,
> >
> > I wouldn't need to call add_to_page_cache either, it's just Andrew
> > prefers it.
>
> Well all of this is to avoid a fairly arbitrary BUG_ON in the radix-tree
> code. If I hadn't added that, we'd all be happy.
>
> The code is well-tested and has been thrashed to death in the userspace
> radix-tree test harness.
> (http://www.zip.com.au/~akpm/linux/patches/stuff/rtth.tar.gz). Let's
> remove the BUG_ON.

the point of the BUG_ON is not to debug the radix tree, it's to debug
the callers.

I don't like to remove the BUG_ON... it's very cool to get a BUG when
somebody tag or delete entries that are never been present in the
radix-tree. That's been useful to me so far and I like to retain that
feature since I believe it's not hurting performance at all.

I now fixed up the whole compound thing, it made no sense to keep
compound off with HUGETLBSF=N, that's a generic setup for all order > 0
not just for hugetlbfs, so it has to be enabled always or never, or it's
just asking for troubles. I can very well imagine drivers touching
page[1]->mapping and crashing with HUGETLBFS=Y and working fine with
HUGETLBFS=N, for something that has nothing to do with hugetlbfs in the
first place. At the same time the CONFIG_MMU assuming HUGETLBFS=N and
setting the count for all the secondary pages as well looks wrong, the
MMU isn't related to the page[1-N].count setting at all, the layout we
return pages via alloc_pages must be the same w/ or w/o MMU.

So I cleaned up the code and also allowed drivers to alloc multipages
not compounded, so that they can do stuff like rw_swap_page_sync on
them (doing the refcount on MMU=y too). The thing works fine here and I
can finally suspend fine. The bug is definitely fixed now. I'm checking
this in CVS so the swapsuspend people will finally be able to suspend,
and and unless anybody has objections I'll merge it in next -aa.

--- x/include/linux/gfp.h.~1~ 2003-08-31 02:38:26.000000000 +0200
+++ x/include/linux/gfp.h 2004-04-02 02:49:21.241968968 +0200
@@ -32,6 +32,7 @@
#define __GFP_NOFAIL 0x800 /* Retry for ever. Cannot fail */
#define __GFP_NORETRY 0x1000 /* Do not retry. Might fail */
#define __GFP_NO_GROW 0x2000 /* Slab internal usage */
+#define __GFP_NO_COMP 0x4000 /* Return non compound pages if order > 0 */

#define __GFP_BITS_SHIFT 16 /* Room for 16 __GFP_FOO bits */
#define __GFP_BITS_MASK ((1 << __GFP_BITS_SHIFT) - 1)
--- x/include/linux/mm.h.~1~ 2004-04-01 18:32:55.000000000 +0200
+++ x/include/linux/mm.h 2004-04-02 03:39:40.884913464 +0200
@@ -445,8 +445,6 @@ struct page {

extern void FASTCALL(__page_cache_release(struct page *));

-#ifdef CONFIG_HUGETLB_PAGE
-
static inline int page_count(struct page *p)
{
if (PageCompound(p))
@@ -478,23 +476,6 @@ static inline void put_page(struct page
__page_cache_release(page);
}

-#else /* CONFIG_HUGETLB_PAGE */
-
-#define page_count(p) atomic_read(&(p)->count)
-
-static inline void get_page(struct page *page)
-{
- atomic_inc(&page->count);
-}
-
-static inline void put_page(struct page *page)
-{
- if (!PageReserved(page) && put_page_testzero(page))
- __page_cache_release(page);
-}
-
-#endif /* CONFIG_HUGETLB_PAGE */
-
/*
* Multiple processes may "see" the same page. E.g. for untouched
* mappings of /dev/null, all processes see the same page full of
--- x/kernel/power/pmdisk.c.~1~ 2004-03-11 08:27:47.000000000 +0100
+++ x/kernel/power/pmdisk.c 2004-04-02 02:51:09.000000000 +0200
@@ -531,7 +531,7 @@ static void calc_order(void)
static int alloc_pagedir(void)
{
calc_order();
- pagedir_save = (suspend_pagedir_t *)__get_free_pages(GFP_ATOMIC | __GFP_COLD,
+ pagedir_save = (suspend_pagedir_t *)__get_free_pages(GFP_ATOMIC | __GFP_COLD | __GFP_NO_COMP,
pagedir_order);
if(!pagedir_save)
return -ENOMEM;
--- x/kernel/power/swsusp.c.~1~ 2004-03-11 08:27:47.000000000 +0100
+++ x/kernel/power/swsusp.c 2004-04-02 03:03:03.327992896 +0200
@@ -442,7 +442,7 @@ static suspend_pagedir_t *create_suspend

pagedir_order = get_bitmask_order(SUSPEND_PD_PAGES(nr_copy_pages));

- p = pagedir = (suspend_pagedir_t *)__get_free_pages(GFP_ATOMIC | __GFP_COLD, pagedir_order);
+ p = pagedir = (suspend_pagedir_t *)__get_free_pages(GFP_ATOMIC | __GFP_COLD | __GFP_NO_COMP, pagedir_order);
if(!pagedir)
return NULL;

--- x/mm/page_alloc.c.~1~ 2004-04-01 18:32:54.000000000 +0200
+++ x/mm/page_alloc.c 2004-04-02 03:53:33.897276336 +0200
@@ -93,10 +93,6 @@ static void bad_page(const char *functio
page->mapcount = 0;
}

-#ifndef CONFIG_HUGETLB_PAGE
-#define prep_compound_page(page, order) do { } while (0)
-#define destroy_compound_page(page, order) do { } while (0)
-#else
/*
* Higher-order pages are called "compound pages". They are structured thusly:
*
@@ -147,7 +143,6 @@ static void destroy_compound_page(struct
ClearPageCompound(p);
}
}
-#endif /* CONFIG_HUGETLB_PAGE */

/*
* Freeing function for a buddy system allocator.
@@ -178,7 +173,7 @@ static inline void __free_pages_bulk (st
{
unsigned long page_idx, index;

- if (order)
+ if (PageCompound(page))
destroy_compound_page(page, order);
page_idx = page - base;
if (page_idx & ~mask)
@@ -306,47 +301,37 @@ expand(struct zone *zone, struct page *p
return page;
}

-static inline void set_page_refs(struct page *page, int order)
-{
-#ifdef CONFIG_MMU
- set_page_count(page, 1);
-#else
- int i;
-
- /*
- * We need to reference all the pages for this order, otherwise if
- * anyone accesses one of the pages with (get/put) it will be freed.
- */
- for (i = 0; i < (1 << order); i++)
- set_page_count(page+i, 1);
-#endif /* CONFIG_MMU */
-}
-
/*
* This page is about to be returned from the page allocator
*/
-static void prep_new_page(struct page *page, int order)
+static void prep_new_page(struct page * _page, int order)
{
- if (page->mapping ||
- page->mapcount ||
- (page->flags & (
- 1 << PG_private |
- 1 << PG_locked |
- 1 << PG_lru |
- 1 << PG_active |
- 1 << PG_dirty |
- 1 << PG_reclaim |
- 1 << PG_anon |
- 1 << PG_maplock |
- 1 << PG_swapcache |
- 1 << PG_writeback )))
- bad_page(__FUNCTION__, page);
+ int i;
+
+ for (i = 0; i < (1 << order); i++) {
+ struct page * page = _page + i;

- page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
- 1 << PG_referenced | 1 << PG_arch_1 |
- 1 << PG_checked | 1 << PG_mappedtodisk);
- page->private = 0;
- set_page_refs(page, order);
+ if (page->mapping ||
+ page->mapcount ||
+ (page->flags & (
+ 1 << PG_private |
+ 1 << PG_locked |
+ 1 << PG_lru |
+ 1 << PG_active |
+ 1 << PG_dirty |
+ 1 << PG_reclaim |
+ 1 << PG_anon |
+ 1 << PG_maplock |
+ 1 << PG_swapcache |
+ 1 << PG_writeback )))
+ bad_page(__FUNCTION__, page);
+
+ page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
+ 1 << PG_referenced | 1 << PG_arch_1 |
+ 1 << PG_checked | 1 << PG_mappedtodisk);
+ page->private = 0;
+ set_page_count(page, 1);
+ }
}

/*
@@ -498,10 +483,11 @@ void fastcall free_cold_page(struct page
* or two.
*/

-static struct page *buffered_rmqueue(struct zone *zone, int order, int cold)
+static struct page *buffered_rmqueue(struct zone *zone, int order, int cold_compound)
{
unsigned long flags;
struct page *page = NULL;
+ int cold = !!(cold_compound & __GFP_COLD);

if (order == 0) {
struct per_cpu_pages *pcp;
@@ -530,7 +516,7 @@ static struct page *buffered_rmqueue(str
BUG_ON(bad_range(zone, page));
mod_page_state_zone(zone, pgalloc, 1 << order);
prep_new_page(page, order);
- if (order)
+ if (unlikely(order) && !(cold_compound & __GFP_NO_COMP))
prep_compound_page(page, order);
}
return page;
@@ -570,7 +556,9 @@ __alloc_pages(unsigned int gfp_mask, uns

cold = 0;
if (gfp_mask & __GFP_COLD)
- cold = 1;
+ cold = __GFP_COLD;
+ if (gfp_mask & __GFP_NO_COMP)
+ cold |= __GFP_NO_COMP;

zones = zonelist->zones; /* the list of zones suitable for gfp_mask */
if (zones[0] == NULL) /* no zones in the zonelist */

2004-04-02 02:06:06

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

Andrea Arcangeli <[email protected]> wrote:
>
> I now fixed up the whole compound thing, it made no sense to keep
> compound off with HUGETLBSF=N, that's a generic setup for all order > 0
> not just for hugetlbfs, so it has to be enabled always or never, or it's
> just asking for troubles.

It was a modest optimisation for non-hugetlb architectures and configs.
Having it optional has caused no problem in a year.

Was there some reason why you _required_ that it be permanently enabled?

2004-04-02 02:22:41

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Thu, Apr 01, 2004 at 06:08:02PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > I now fixed up the whole compound thing, it made no sense to keep
> > compound off with HUGETLBSF=N, that's a generic setup for all order > 0
> > not just for hugetlbfs, so it has to be enabled always or never, or it's
> > just asking for troubles.
>
> It was a modest optimisation for non-hugetlb architectures and configs.
> Having it optional has caused no problem in a year.
>
> Was there some reason why you _required_ that it be permanently enabled?

Well, I doubt anybody could take advantage of this optimization, since
nobody can ship with hugetlbfs disabled anyways (peraphs with the
exception of the embedded people but then I doubt they want to risk
drivers to break because they could depend on the compound framekwork).
My point is that nobody will ever test with hugetlbfs disabled, so the
fact you don't get bugs it doesn't mean it'll not crash with hugetlbfs
disabled. there must be a defined API that returns multipages, today
most of the testing has been done with hugetlbfs enabled which means
with multipages being compound pages, so I'd rather not disable compound
by default.

I find unreliable that in mainline with hugetlbfs=N we don't set the
page->count of all page_t * to 1, and we still set to 1 only the first
page, that's just a bug waiting to trigger. The fact the MMU people
wanted all of them set to 1 just shows some driver would break with
hugetlbfs turned off. That's fixed.

If our object is to optimize then we could disable the compound by
default and have only hugetlbfs calling alloc_pages with __GFP_COMP,
instead of swapsuspend being the only one calling alloc_pages with
__GFP_NO_COMP, though I preferred not to optimize since the majority of
the testing so far has been done with hugetlbfs=y so I didn't want to
invalidate it. Though if you want to disable compound by default to
optimize everything (not just the hugetlbfs=n compiles) that's fine with
me. Though I understood DaveM just asked for compound being always
available for some network thing and that's what I implemented.

2004-04-02 06:05:39

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Fri, Apr 02, 2004 at 04:22:33AM +0200, Andrea Arcangeli wrote:
> Well, I doubt anybody could take advantage of this optimization, since
> nobody can ship with hugetlbfs disabled anyways (peraphs with the
> exception of the embedded people but then I doubt they want to risk

Common. stop smoking that bad stuff. Almost non-one except the known
oracle whores SuSE and RH need it. Remeber Linux is used much more widely
except the known "Enterprise" vendors. None of the NAS/networking/media
applicances or pdas I've seen has the slightest need for hugetlbfs.

2004-04-02 07:08:45

by Paul Mackerras

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

Christoph Hellwig writes:
> On Fri, Apr 02, 2004 at 04:22:33AM +0200, Andrea Arcangeli wrote:
> > Well, I doubt anybody could take advantage of this optimization, since
> > nobody can ship with hugetlbfs disabled anyways (peraphs with the
> > exception of the embedded people but then I doubt they want to risk
>
> Common. stop smoking that bad stuff. Almost non-one except the known
> oracle whores SuSE and RH need it. Remeber Linux is used much more widely
> except the known "Enterprise" vendors. None of the NAS/networking/media
> applicances or pdas I've seen has the slightest need for hugetlbfs.

The HPC types also love hugetlbfs since it reduces their tlb miss
rate.

Paul.

2004-04-02 07:11:16

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Fri, Apr 02, 2004 at 05:07:52PM +1000, Paul Mackerras wrote:
> The HPC types also love hugetlbfs since it reduces their tlb miss
> rate.

Thanks, forgot that one. Still it's a tiny subset of the linux userbase.

2004-04-02 09:43:43

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Fri, Apr 02, 2004 at 04:00:22AM +0200, Andrea Arcangeli wrote:
> I now fixed up the whole compound thing, it made no sense to keep
> compound off with HUGETLBSF=N, that's a generic setup for all order > 0

I got lots of the following OOPSEs with 2.6.5-rc3aa2 on a powerpc running
the xfs testsuite (with the truncate fix applied):

Apr 2 13:27:21 bird kernel: Bad page state at destroy_compound_page (in process 'swapper', page c08d9920)
Apr 2 13:27:21 bird kernel: flags:0x00000008 mapping:00000000 mapped:0 count:0
Apr 2 13:27:21 bird kernel: Backtrace:
Apr 2 13:27:21 bird kernel: Call trace:
Apr 2 13:27:21 bird kernel: [c000b5c8] dump_stack+0x18/0x28
Apr 2 13:27:21 bird kernel: [c0048b60] bad_page+0x70/0xb0
Apr 2 13:27:21 bird kernel: [c0048c70] destroy_compound_page+0x80/0xb8
Apr 2 13:27:21 bird kernel: [c0048ec4] free_pages_bulk+0x21c/0x220
Apr 2 13:27:21 bird kernel: [c0049020] __free_pages_ok+0x158/0x16c
Apr 2 13:27:21 bird kernel: [c004d4f8] slab_destroy+0x140/0x234
Apr 2 13:27:21 bird kernel: [c00505c8] reap_timer_fnc+0x1e4/0x2b8
Apr 2 13:27:21 bird kernel: [c002feac] run_timer_softirq+0x134/0x1fc
Apr 2 13:27:21 bird kernel: [c002abd0] do_softirq+0x140/0x144
Apr 2 13:27:21 bird kernel: [c0009e5c] timer_interrupt+0x2d0/0x300
Apr 2 13:27:21 bird kernel: [c0007cac] ret_from_except+0x0/0x14
Apr 2 13:27:21 bird kernel: [c000381c] ppc6xx_idle+0xe4/0xf0
Apr 2 13:27:21 bird kernel: [c0009b7c] cpu_idle+0x28/0x38
Apr 2 13:27:21 bird kernel: [c00038c4] rest_init+0x50/0x60
Apr 2 13:27:21 bird kernel: [c0364784] start_kernel+0x198/0x1d8
Apr 2 13:27:21 bird kernel: Trying to fix it up, but a reboot is needed

2004-04-02 10:24:51

by Marc-Christian Petersen

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Friday 02 April 2004 11:43, Christoph Hellwig wrote:

Hi Christoph,

> I got lots of the following OOPSEs with 2.6.5-rc3aa2 on a powerpc running
> the xfs testsuite (with the truncate fix applied):

What truncate fix? Sorry if I missed that.

dunno if the below is causing your trouble, but is that intentional that
page_cache_release(page) is called twice?

diff -urNp --exclude CVS --exclude BitKeeper --exclude {arch}
--exclude .arch-ids 2.6.5-rc3/mm/page_io.c xx/mm/page_io.c
--- 2.6.5-rc3/mm/page_io.c 2002-12-15 04:18:17.000000000 +0100
+++ xx/mm/page_io.c 2004-04-02 05:32:57.381688904 +0200
@@ -161,7 +176,13 @@ int rw_swap_page_sync(int rw, swp_entry_
ret = swap_writepage(page, &swap_wbc);
wait_on_page_writeback(page);
}
- page->mapping = NULL;
+
+ lock_page(page);
+ remove_from_page_cache(page);
+ unlock_page(page);
+ page_cache_release(page);
+ page_cache_release(page); /* For add_to_page_cache() */



ciao, Marc

2004-04-02 10:55:46

by Hugh Dickins

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Fri, 2 Apr 2004, Marc-Christian Petersen wrote:
>
> dunno if the below is causing your trouble, but is that intentional that
> page_cache_release(page) is called twice?

It's not pretty, but it is intentional and correct:
the first to balance the page_cache_get higher up (well commented),
the second because add_to_page_cache does a page_cache_get but
remove_from_page_cache does not do the corresponding page_cache_release.

Christoph's problems will be somewhere in Andrea's compound page changes.

Hugh

2004-04-02 15:23:02

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Fri, Apr 02, 2004 at 07:05:25AM +0100, Christoph Hellwig wrote:
> On Fri, Apr 02, 2004 at 04:22:33AM +0200, Andrea Arcangeli wrote:
> > Well, I doubt anybody could take advantage of this optimization, since
> > nobody can ship with hugetlbfs disabled anyways (peraphs with the
> > exception of the embedded people but then I doubt they want to risk
>
> Common. stop smoking that bad stuff. Almost non-one except the known
> oracle whores SuSE and RH need it. Remeber Linux is used much more widely
> except the known "Enterprise" vendors. None of the NAS/networking/media
> applicances or pdas I've seen has the slightest need for hugetlbfs.

I already explained the reason of the changes, and they've nothing to do
with hugetlbfs. The whole thing has nothing to do with hugetlbfs. I also
proposed a way to optimize _always_ regardless of hugetlbfs=y or =n, by
just turning my __GFP_NO_COPM into a __GFP_COMP, again regardless of
hugetlbfs. The current mainline code returning different things from
alloc_pages depending on a hugetlbfs compile option is totally broken
and I simply fixed it. this has absolutely nothing to do with the
hugetlbfs users.

About your comment about SUSE and RH being the only ones shipping with
hugetlbfs turned on, I very strongly doubt that any other distribution
can ship with hugetlbfs turned off, just go ask them, I bet you will
have a surprised that they turn it on too.

The only ones that may not turn it on are probably the embedded people
using a custom kernel, but as I said I strongly doubt they want to risk
to trigger driver bugs with a different alloc_pages API since nobody
tested that API since everybody is going to turn hugetlbfs on.

As far as I can tell the number of people that runs with hugetlbfs off
is a niche compared to the ones that will turn it on and you're totally
wrong about claiming the opposite. You're confusing the number of active
hugetlbfs users with the number of users that have a kernel compiled
with hugetlbfs=y. That's a completely different thing. And regardless I
proposed a way to optimize it. Plus I fixed a very bad bug that triggers
with hugetlbfs=n and that obviously nobody tested, expce the
CONFIG_MMU=n people, that infact had it fixed only for CONFIG_MM=n, that
as well was totally broken since the alloc_pages API must be indipendent
from CONFIG_MMU, that's a physical-memory thing. So stop making
aggressive claims on l-k, especially when you're wrong.

I'll now look into the bug that you triggered with xfs. Did you ever
test with hugetlbfs=y before btw (maybe you were one of the users
keeping it off always and now noticing the API changes under you, and
now benefiting from my standardization of the API)? Could be a bug in my
changes too (though it works fine for me), we'll see.

2004-04-02 15:27:21

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Fri, Apr 02, 2004 at 05:22:40PM +0200, Andrea Arcangeli wrote:
> I already explained the reason of the changes, and they've nothing to do
> with hugetlbfs. The whole thing has nothing to do with hugetlbfs. I also
> proposed a way to optimize _always_ regardless of hugetlbfs=y or =n, by
> just turning my __GFP_NO_COPM into a __GFP_COMP, again regardless of
> hugetlbfs. The current mainline code returning different things from
> alloc_pages depending on a hugetlbfs compile option is totally broken
> and I simply fixed it. this has absolutely nothing to do with the
> hugetlbfs users.

Umm, the usersn't aren't supposed to dig into the VM internals that deep.
Everyone who does has a bug.

> The only ones that may not turn it on are probably the embedded people
> using a custom kernel, but as I said I strongly doubt they want to risk
> to trigger driver bugs with a different alloc_pages API since nobody
> tested that API since everybody is going to turn hugetlbfs on.

We can make a little poll on lkml, but I bet most kernel developers will
have it disabled :)

> I'll now look into the bug that you triggered with xfs. Did you ever
> test with hugetlbfs=y before btw

I for myself haven't run with hugetlfs=y ever and don't really plan to.

> (maybe you were one of the users
> keeping it off always and now noticing the API changes under you, and
> now benefiting from my standardization of the API)?

Huh? The callchain comes from generic slab code..

2004-04-02 15:28:18

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Fri, Apr 02, 2004 at 05:07:52PM +1000, Paul Mackerras wrote:
> Christoph Hellwig writes:
> > On Fri, Apr 02, 2004 at 04:22:33AM +0200, Andrea Arcangeli wrote:
> > > Well, I doubt anybody could take advantage of this optimization, since
> > > nobody can ship with hugetlbfs disabled anyways (peraphs with the
> > > exception of the embedded people but then I doubt they want to risk
> >
> > Common. stop smoking that bad stuff. Almost non-one except the known
> > oracle whores SuSE and RH need it. Remeber Linux is used much more widely
> > except the known "Enterprise" vendors. None of the NAS/networking/media
> > applicances or pdas I've seen has the slightest need for hugetlbfs.
>
> The HPC types also love hugetlbfs since it reduces their tlb miss
> rate.

the point is not the number of people who needs this, the point is that
any distributor will be forced to turn it on, since the distributor
must allow any user to do HPC or database applications. Shipping with
hugetlbfs=n is like shipping with device-mapper=n or sysfs=n or whatever
like that. And having different alloc_pages API depending on a hugetlbfs
compile option was totally broken, plus hugetlbfs=n was buggy, this is
all fixed now.

If we want to make the API not generate compound pages unless you call
with __GFP_COMP that's fine with me, that'll optimize the whole kernel,
and that's a very simple variation of my code. Still the alloc_pages API
must be indipendent of a hugetlbfs compile option.

2004-04-02 15:38:05

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Fri, Apr 02, 2004 at 04:27:09PM +0100, Christoph Hellwig wrote:
> On Fri, Apr 02, 2004 at 05:22:40PM +0200, Andrea Arcangeli wrote:
> > I already explained the reason of the changes, and they've nothing to do
> > with hugetlbfs. The whole thing has nothing to do with hugetlbfs. I also
> > proposed a way to optimize _always_ regardless of hugetlbfs=y or =n, by
> > just turning my __GFP_NO_COPM into a __GFP_COMP, again regardless of
> > hugetlbfs. The current mainline code returning different things from
> > alloc_pages depending on a hugetlbfs compile option is totally broken
> > and I simply fixed it. this has absolutely nothing to do with the
> > hugetlbfs users.
>
> Umm, the usersn't aren't supposed to dig into the VM internals that deep.
> Everyone who does has a bug.

that's why alloc_pages should return the same thing for every user.

>
> > The only ones that may not turn it on are probably the embedded people
> > using a custom kernel, but as I said I strongly doubt they want to risk
> > to trigger driver bugs with a different alloc_pages API since nobody
> > tested that API since everybody is going to turn hugetlbfs on.
>
> We can make a little poll on lkml, but I bet most kernel developers will
> have it disabled :)

100 kernel developers, who cares about saving some cycles in 100
machines? Get real.

> > I'll now look into the bug that you triggered with xfs. Did you ever
> > test with hugetlbfs=y before btw
>
> I for myself haven't run with hugetlfs=y ever and don't really plan to.

Now I get a crash in swap resume (I cannot test swap resume yet, at
least now swap suspend works). Could be the same bug you triggered.
We'll see.

> Huh? The callchain comes from generic slab code..

slab code may be using multipages too. Anyways I had no time to look
into it yet, so give me a bit of time, I need to fix swap resume now,
after that works I'll check if your bug can be explained by the same
issue that swap resume has right now, and if not I'll fix it, then I'll
do mprotect merging (for file mappings too!).

2004-04-02 15:46:01

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Fri, Apr 02, 2004 at 05:38:01PM +0200, Andrea Arcangeli wrote:
> 100 kernel developers, who cares about saving some cycles in 100
> machines? Get real.

just to avoid any misunderstanding, I want to optimize it _everywhere_,
I mean that optimizing it in only 100 machines and an embedded niche is
worthless. I'm not saying it's worthless to optimize it everywhere
(though I doubt it's a measurable slowdown given the order > 0 is
unlikely in the first place). if you check my first emails about the
compound thing I wasn't very happy about it. The only single reason I
had to keep it on by default is that currently I feel unsafe about
optimizing it away turning it off by default, since the big testing (on
weird drivers too) has happened so far with compound on by default, and
disabling it everywhere would risk to trigger bugs, and this clearly
shows you how unreliable it is to return different things from
alloc_pages in function of an unrelated hugetlbfs option, and this is a
basic problem I'm fixing.

2004-04-02 16:46:39

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Fri, Apr 02, 2004 at 10:43:34AM +0100, Christoph Hellwig wrote:
> I got lots of the following OOPSEs with 2.6.5-rc3aa2 on a powerpc running
> the xfs testsuite (with the truncate fix applied):
>
> Apr 2 13:27:21 bird kernel: Bad page state at destroy_compound_page (in process 'swapper', page c08d9920)
> Apr 2 13:27:21 bird kernel: flags:0x00000008 mapping:00000000 mapped:0 count:0
> Apr 2 13:27:21 bird kernel: Backtrace:
> Apr 2 13:27:21 bird kernel: Call trace:
> Apr 2 13:27:21 bird kernel: [c000b5c8] dump_stack+0x18/0x28
> Apr 2 13:27:21 bird kernel: [c0048b60] bad_page+0x70/0xb0
> Apr 2 13:27:21 bird kernel: [c0048c70] destroy_compound_page+0x80/0xb8

it's not clear why this triggered, bad_page only shows the "master"
compound page and not the contents of the slave page that triggered the
bad_page. Can you try again with this incremental patch applied?
Thanks!

--- x/mm/page_alloc.c.~1~ 2004-04-02 05:24:50.000000000 +0200
+++ x/mm/page_alloc.c 2004-04-02 18:32:53.189244408 +0200
@@ -73,9 +73,9 @@ static void bad_page(const char *functio
{
printk(KERN_EMERG "Bad page state at %s (in process '%s', page %p)\n",
function, current->comm, page);
- printk(KERN_EMERG "flags:0x%08lx mapping:%p mapped:%d count:%d\n",
+ printk(KERN_EMERG "flags:0x%08lx mapping:%p mapped:%d count:%d private:0x%08lx\n",
(unsigned long)page->flags, page->mapping,
- page_mapped(page), page_count(page));
+ page_mapped(page), page_count(page), page->private);
printk(KERN_EMERG "Backtrace:\n");
dump_stack();
printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n");
@@ -137,9 +137,9 @@ static void destroy_compound_page(struct
struct page *p = page + i;

if (!PageCompound(p))
- bad_page(__FUNCTION__, page);
+ bad_page(__FUNCTION__, p);
if (p->private != (unsigned long)page)
- bad_page(__FUNCTION__, page);
+ bad_page(__FUNCTION__, p);
ClearPageCompound(p);
}
}
@@ -272,8 +272,12 @@ void __free_pages_ok(struct page *page,
int i;

mod_page_state(pgfree, 1 << order);
- for (i = 0 ; i < (1 << order) ; ++i)
- free_pages_check(__FUNCTION__, page + i);
+ for (i = 0 ; i < (1 << order) ; ++i) {
+ struct page * _page = page + i;
+ if (unlikely(i))
+ __put_page(_page);
+ free_pages_check(__FUNCTION__, _page);
+ }
list_add(&page->lru, &list);
kernel_map_pages(page, 1<<order, 0);
free_pages_bulk(page_zone(page), 1, &list, order);
@@ -316,19 +320,21 @@ static void prep_new_page(struct page *
(page->flags & (
1 << PG_private |
1 << PG_locked |
- 1 << PG_lru |
+ 1 << PG_lru |
1 << PG_active |
1 << PG_dirty |
1 << PG_reclaim |
1 << PG_anon |
1 << PG_maplock |
1 << PG_swapcache |
- 1 << PG_writeback )))
+ 1 << PG_writeback |
+ 1 << PG_compound )))
bad_page(__FUNCTION__, page);

page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
1 << PG_referenced | 1 << PG_arch_1 |
- 1 << PG_checked | 1 << PG_mappedtodisk);
+ 1 << PG_checked | 1 << PG_mappedtodisk |
+ 1 << PG_compound);
page->private = 0;
set_page_count(page, 1);
}


this incrmental bit made some harmless warning go away from swap resume,
but it didn't fix swap resume completely yet OTOH I'm not sure anymore
if there's any further VM issue or if it's a swap suspend issue. the
PageCompound bugcheck would already trap any compound page in
rw_swap_page_sync, so I'm sure nobody tried to swap compound pages in
swap resume, and I'm also sure that the page->count is now correct, or
free_pages_check would trigger. I cannot trigger any further bugcheck
here (and the above patch only shutdown some false positive that
couldn't hurt functionality, plus it adds further bugchecks).

2004-04-02 18:59:38

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Fri, Apr 02, 2004 at 06:46:34PM +0200, Andrea Arcangeli wrote:
> it's not clear why this triggered, bad_page only shows the "master"
> compound page and not the contents of the slave page that triggered the
> bad_page. Can you try again with this incremental patch applied?

Bad page state at destroy_compound_page (in process 'swapper', page c0772380)
flags:0x00080008 mapping:00000000 mapped:0 count:134217728 private:0xc07721ff
Backtrace:
Call trace:
[c000b5c8] dump_stack+0x18/0x28
[c0048b64] bad_page+0x74/0xbc
[c0048c7c] destroy_compound_page+0x80/0xb8
[c0048ed0] free_pages_bulk+0x21c/0x220
[c0049030] __free_pages_ok+0x15c/0x188
[c004d520] slab_destroy+0x140/0x234
[c00505f0] reap_timer_fnc+0x1e4/0x2b8
[c002feac] run_timer_softirq+0x134/0x1fc
[c002abd0] do_softirq+0x140/0x144
[c0009e5c] timer_interrupt+0x2d0/0x300
[c0007cac] ret_from_except+0x0/0x14
[c000381c] ppc6xx_idle+0xe4/0xf0
[c0009b7c] cpu_idle+0x28/0x38
[c00038c4] rest_init+0x50/0x60
[c0364784] start_kernel+0x198/0x1d8

2004-04-02 19:29:43

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Fri, Apr 02, 2004 at 07:59:28PM +0100, Christoph Hellwig wrote:
> On Fri, Apr 02, 2004 at 06:46:34PM +0200, Andrea Arcangeli wrote:
> > it's not clear why this triggered, bad_page only shows the "master"
> > compound page and not the contents of the slave page that triggered the
> > bad_page. Can you try again with this incremental patch applied?
>
> Bad page state at destroy_compound_page (in process 'swapper', page c0772380)
> flags:0x00080008 mapping:00000000 mapped:0 count:134217728 private:0xc07721ff

PageCompound and PageUpdodate are set.

mapping/mapped is null.

page->count is 0x8000000, that looks weird.

page->private indicates:

>>> (0xc0772380L-0xc07721ffL)/32
12L

that's the 12th page in the array.

can you check in the asm (you should look at address c0048c7c) if it's
the first bug that triggers?

if (page[1].index != order)
bad_page(__FUNCTION__, page);


the whole compound thing is very screwed in the above scenario.

Do you have CONFIG_DEBUG_PAGEALLOC enabled?

could be compound never worked right on ppc, dunno. You could try to
backout the patch gfp-no-compound and to recompile with hugetlbfs
enabled (can you enable it on PPC?).

In the meantime it seem swap resume got broken by some other change and
that the VM side is ok now [rc3-aa2 showed some harmless warning that
I've fixed in the patch you just tried] (I backed out the other non-VM
changes and resume works better now, though I cannot be 100% sure since
aic7xxx cannot resume totally, confirmed by Pavel, I need somebody with
suspend-capable-hardware to verify).

I also started the mprotect merging and it should be really quick to add
it.

Plus I'm doing a microscalability optimization in the fremap.c, the
previous code was right taking the page_table_lock after calculating the
pgd_offset.

> Backtrace:
> Call trace:
> [c000b5c8] dump_stack+0x18/0x28
> [c0048b64] bad_page+0x74/0xbc
> [c0048c7c] destroy_compound_page+0x80/0xb8
> [c0048ed0] free_pages_bulk+0x21c/0x220
> [c0049030] __free_pages_ok+0x15c/0x188
> [c004d520] slab_destroy+0x140/0x234
> [c00505f0] reap_timer_fnc+0x1e4/0x2b8
> [c002feac] run_timer_softirq+0x134/0x1fc
> [c002abd0] do_softirq+0x140/0x144
> [c0009e5c] timer_interrupt+0x2d0/0x300
> [c0007cac] ret_from_except+0x0/0x14
> [c000381c] ppc6xx_idle+0xe4/0xf0
> [c0009b7c] cpu_idle+0x28/0x38
> [c00038c4] rest_init+0x50/0x60
> [c0364784] start_kernel+0x198/0x1d8

2004-04-02 19:54:19

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Fri, Apr 02, 2004 at 09:29:41PM +0200, Andrea Arcangeli wrote:
> page->private indicates:
>
> >>> (0xc0772380L-0xc07721ffL)/32
> 12L
>
> that's the 12th page in the array.
>
> can you check in the asm (you should look at address c0048c7c) if it's
> the first bug that triggers?
>
> if (page[1].index != order)
> bad_page(__FUNCTION__, page);

No, it's the second one (and yes, I get lots of theses backtraces, unless
I counted wrongly 19 this time)

> the whole compound thing is very screwed in the above scenario.
>
> Do you have CONFIG_DEBUG_PAGEALLOC enabled?

no. it's not available on ppc32.

> could be compound never worked right on ppc, dunno. You could try to
> backout the patch gfp-no-compound and to recompile with hugetlbfs
> enabled (can you enable it on PPC?).

no, there's no hugetlb support on ppc32.

2004-04-02 20:35:20

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Fri, Apr 02, 2004 at 08:54:10PM +0100, Christoph Hellwig wrote:
> On Fri, Apr 02, 2004 at 09:29:41PM +0200, Andrea Arcangeli wrote:
> > page->private indicates:
> >
> > >>> (0xc0772380L-0xc07721ffL)/32
> > 12L
> >
> > that's the 12th page in the array.
> >
> > can you check in the asm (you should look at address c0048c7c) if it's
> > the first bug that triggers?
> >
> > if (page[1].index != order)
> > bad_page(__FUNCTION__, page);
>
> No, it's the second one (and yes, I get lots of theses backtraces, unless
> I counted wrongly 19 this time)

how can that be the second one? (I deduced it was the first one because
it cannot be the second one and the offset didn't look at the very end
of the function). This is the second one:

if (!PageCompound(p))
bad_page(__FUNCTION__, p);

but bad_page shows p->flags == 0x00080008 and 1<<PG_compound ==
0x80000.

So PG_compound is definitely set for "p" and it can't be the second one
triggering.

Can you double check? Maybe we should double check the asm. Something
sounds fundamentally wrong in the asm, sounds like a miscompilation,
which compiler are you using?

2004-04-02 21:31:24

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

Hi!

> > An anonymous user page meets these requirements. A did say "anal", but
> > rw_swap_page_sync() is a general-purpose library function and we shouldn't
> > be making assumptions about the type of page which the caller happens to be
> > feeding us.
>
> that is a specialized backdoor to do I/O on _private_ pages, it's not a
> general-purpose library function for doing anonymous pages
> swapin/swapout, infact the only user is swap susped and we'd better
> forbid swap suspend to pass anonymous pages through that interface and
> be sure that nobody will ever attempt anything like that.
>
> that interface is useful only to reach the swap device, for doing I/O on
> private pages outside the VM, in the old days that was used to
> read/write the swap header (again on a private page), swap suspend is
> using it for similar reasons on _private_ pages.

Ahha, so *here* is that discussion happening. I was only seeing it at
bugzilla, and could not make sense of it.

If swsusp/pmdisk are only user of rw_swap_page_sync, perhaps it should
be moved to power/ directory?
Pavel
--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

2004-04-02 21:44:13

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Fri, Apr 02, 2004 at 10:13:43PM +0200, Pavel Machek wrote:
> Hi!
>
> > > An anonymous user page meets these requirements. A did say "anal", but
> > > rw_swap_page_sync() is a general-purpose library function and we shouldn't
> > > be making assumptions about the type of page which the caller happens to be
> > > feeding us.
> >
> > that is a specialized backdoor to do I/O on _private_ pages, it's not a
> > general-purpose library function for doing anonymous pages
> > swapin/swapout, infact the only user is swap susped and we'd better
> > forbid swap suspend to pass anonymous pages through that interface and
> > be sure that nobody will ever attempt anything like that.
> >
> > that interface is useful only to reach the swap device, for doing I/O on
> > private pages outside the VM, in the old days that was used to
> > read/write the swap header (again on a private page), swap suspend is
> > using it for similar reasons on _private_ pages.
>
> Ahha, so *here* is that discussion happening. I was only seeing it at
> bugzilla, and could not make sense of it.

;)

btw, as far as I can tell I cannot see anymore VM issues with current CVS
kernel, what I get now is:

Resume Machine: resuming from /dev/sda1
Resuming from device sda1
Resume Machine: Signature found, resuming
Resume Machine: Reading pagedir, Relocating pagedir.:|
Reading image data (3420 pages): ...................................|
Reading resume file was successful
hdc: start_power_step(step: 0)
hdc: completing PM request, suspend
Waiting for DMAs to settle down...
Freeing prev allocated pagedir
hdc: Wakeup request inited, waiting for !BSY...
hdc: start_power_step(step: 1000)
hdc: completing PM request, resume
Fixing swap signatures... scsi0:A:0:0: ahc_intr - referenced scb not valid
during seqint 0x71 scb(1)
>>>>>>>>>>>>>>>>>> Dump Card State Begins <<<<<<<<<<<<<<<<<
scsi0: Dumping Card State in Message-in phase, at SEQADDR 0x1a6
Card was paused
ACCUM = 0x0, SINDEX = 0x71, DINDEX = 0xe4, ARG_2 = 0x0
HCNT = 0x0 SCBPTR = 0x3
SCSIPHASE[0x8] SCSISIGI[0xe6] ERROR[0x0] SCSIBUSL[0x0]
LASTPHASE[0xe0] SCSISEQ[0x12] SBLKCTL[0xa] SCSIRATE[0xc2]
SEQCTL[0x10] SEQ_FLAGS[0x0] SSTAT0[0x7] SSTAT1[0x11]
SSTAT2[0x0] SSTAT3[0x0] SIMODE0[0x8] SIMODE1[0xac]
SXFRCTL0[0x88] DFCNTRL[0x4] DFSTATUS[0x89]
STACK: 0xff 0x0 0x163 0x178
SCB count = 8
Kernel NEXTQSCB = 1
Card NEXTQSCB = 5
QINFIFO entries: 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5 1 5
Waiting Queue entries:
Disconnected Queue entries: 2:5 1:1 0:5
QOUTFIFO entries:
Sequencer Free SCB List: 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31
Sequencer SCB Info:
0 SCB_CONTROL[0x64] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0x5]
1 SCB_CONTROL[0x64] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0x1]
2 SCB_CONTROL[0x64] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0x5]
3 SCB_CONTROL[0xe0] SCB_SCSIID[0x7] SCB_LUN[0x0] SCB_TAG[0x1]
4 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
5 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
6 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
7 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
8 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
9 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
10 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
11 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
12 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
13 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
14 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
15 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
16 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
17 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
18 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
19 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
20 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
21 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
22 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
23 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
24 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
25 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
26 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
27 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
28 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
29 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
30 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
31 SCB_CONTROL[0x0] SCB_SCSIID[0xff] SCB_LUN[0xff] SCB_TAG[0xff]
Pending list:
5 SCB_CONTROL[0x60] SCB_SCSIID[0x7] SCB_LUN[0x0]
Kernel Free SCB list: 6 2 7 0 3 4
DevQ(0:0:0): 0 waiting
DevQ(0:1:0): 0 waiting

<<<<<<<<<<<<<<<<< Dump Card State Ends >>>>>>>>>>>>>>>>>>
Kernel panic: for safety
In interrupt handler - not syncing

I believe this is fine with you, I mean, you told me this is because the
aic7xxx cannot restore, right?

the only VM pending bug right now (besides the mprotect feature request that I
already implemented at 85%) is the compound bugreport from Christoph, but to me
that sounds a kernel miscompilation, it makes no sense that PageCompound(p) ==
0 and after a nanosecond p->flags & (1<<PG_compound) == 1, and no, it's not
likely a race condition, and nothing weird like that ever happened on x86 yet,
and that's all common code (no arch details in the compound thing, infact it
must not even depend on MMU etc..).

>
> If swsusp/pmdisk are only user of rw_swap_page_sync, perhaps it should
> be moved to power/ directory?

it's ok to leave it in page_io.c since it's generating a fake-swapcache
entry, and there are writeback details etc.. that'd better stay in the
mm layer.

2004-04-02 21:46:10

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

Hi!

> > > > An anonymous user page meets these requirements. A did say "anal", but
> > > > rw_swap_page_sync() is a general-purpose library function and we shouldn't
> > > > be making assumptions about the type of page which the caller happens to be
> > > > feeding us.
> > >
> > > that is a specialized backdoor to do I/O on _private_ pages, it's not a
> > > general-purpose library function for doing anonymous pages
> > > swapin/swapout, infact the only user is swap susped and we'd better
> > > forbid swap suspend to pass anonymous pages through that interface and
> > > be sure that nobody will ever attempt anything like that.
> > >
> > > that interface is useful only to reach the swap device, for doing I/O on
> > > private pages outside the VM, in the old days that was used to
> > > read/write the swap header (again on a private page), swap suspend is
> > > using it for similar reasons on _private_ pages.
> >
> > Ahha, so *here* is that discussion happening. I was only seeing it at
> > bugzilla, and could not make sense of it.
>
> ;)
>
> btw, as far as I can tell I cannot see anymore VM issues with current CVS
> kernel, what I get now is:

What does "current CVS kernel" mean? Current one at bkcvs?

> > If swsusp/pmdisk are only user of rw_swap_page_sync, perhaps it should
> > be moved to power/ directory?
>
> it's ok to leave it in page_io.c since it's generating a fake-swapcache
> entry, and there are writeback details etc.. that'd better stay in the
> mm layer.

Ok.
Pavel
--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]

2004-04-02 21:49:25

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Fri, Apr 02, 2004 at 11:45:48PM +0200, Pavel Machek wrote:
> Hi!
>
> > > > > An anonymous user page meets these requirements. A did say "anal", but
> > > > > rw_swap_page_sync() is a general-purpose library function and we shouldn't
> > > > > be making assumptions about the type of page which the caller happens to be
> > > > > feeding us.
> > > >
> > > > that is a specialized backdoor to do I/O on _private_ pages, it's not a
> > > > general-purpose library function for doing anonymous pages
> > > > swapin/swapout, infact the only user is swap susped and we'd better
> > > > forbid swap suspend to pass anonymous pages through that interface and
> > > > be sure that nobody will ever attempt anything like that.
> > > >
> > > > that interface is useful only to reach the swap device, for doing I/O on
> > > > private pages outside the VM, in the old days that was used to
> > > > read/write the swap header (again on a private page), swap suspend is
> > > > using it for similar reasons on _private_ pages.
> > >
> > > Ahha, so *here* is that discussion happening. I was only seeing it at
> > > bugzilla, and could not make sense of it.
> >
> > ;)
> >
> > btw, as far as I can tell I cannot see anymore VM issues with current CVS
> > kernel, what I get now is:
>
> What does "current CVS kernel" mean? Current one at bkcvs?

of course not, it means the kernel-source-26 that we used to reproduce
the suspend problem so far (mainline has no -mm writeback and no
anon-vma so it cannot have problems with rw_swap_page_sync).

2004-04-03 08:41:07

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Fri, Apr 02, 2004 at 10:35:14PM +0200, Andrea Arcangeli wrote:
> how can that be the second one? (I deduced it was the first one because
> it cannot be the second one and the offset didn't look at the very end
> of the function). This is the second one:
>
> if (!PageCompound(p))
> bad_page(__FUNCTION__, p);
>
> but bad_page shows p->flags == 0x00080008 and 1<<PG_compound ==
> 0x80000.
>
> So PG_compound is definitely set for "p" and it can't be the second one
> triggering.
>
> Can you double check? Maybe we should double check the asm. Something
> sounds fundamentally wrong in the asm, sounds like a miscompilation,
> which compiler are you using?

Because I didn't trust my ppc assembly reading that much I put in a printk
and it's actually the third bad_page(), sorry.

2004-04-03 15:20:28

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Sat, Apr 03, 2004 at 09:40:58AM +0100, Christoph Hellwig wrote:
> On Fri, Apr 02, 2004 at 10:35:14PM +0200, Andrea Arcangeli wrote:
> > how can that be the second one? (I deduced it was the first one because
> > it cannot be the second one and the offset didn't look at the very end
> > of the function). This is the second one:
> >
> > if (!PageCompound(p))
> > bad_page(__FUNCTION__, p);
> >
> > but bad_page shows p->flags == 0x00080008 and 1<<PG_compound ==
> > 0x80000.
> >
> > So PG_compound is definitely set for "p" and it can't be the second one
> > triggering.
> >
> > Can you double check? Maybe we should double check the asm. Something
> > sounds fundamentally wrong in the asm, sounds like a miscompilation,
> > which compiler are you using?
>
> Because I didn't trust my ppc assembly reading that much I put in a printk
> and it's actually the third bad_page(), sorry.

ok no problem, so page->private got screwed. I cannot see what could
change page->private though. I should also have noticed myself that
page->private was wrong: 0xc07721ff is not a 4byte aligned address, that
explains the weird page count too, since page_count follows
page->private to return the page->count of the master page.

I've no idea what could set page->private to such a weird address. the
"p" page is at address c0772380, that seems sane, the page->flags and
page->mapping as well are sane (p->count cannot be seen, what we see is
p->private->count), only page->private is screwed apparently.

if you want you can give a spin to this patch. As far as the old code
worked (i.e. with hugetlbfs=n) this should work too, since it disables
the compound feature completely, but if it works it probably only hides
the real bug. You can use rc3-aa3 for this (it already has the latest
robustness fixes I posted to you)

--- x/mm/page_alloc.c.~1~ 2004-04-02 20:37:14.000000000 +0200
+++ x/mm/page_alloc.c 2004-04-03 17:15:52.647449336 +0200
@@ -563,7 +563,9 @@ __alloc_pages(unsigned int gfp_mask, uns
cold = 0;
if (gfp_mask & __GFP_COLD)
cold = __GFP_COLD;
+#if 0
if (gfp_mask & __GFP_NO_COMP)
+#endif
cold |= __GFP_NO_COMP;

zones = zonelist->zones; /* the list of zones suitable for gfp_mask */

2004-04-03 16:00:04

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Sat, Apr 03, 2004 at 05:20:26PM +0200, Andrea Arcangeli wrote:
> if you want you can give a spin to this patch. As far as the old code
> worked (i.e. with hugetlbfs=n) this should work too, since it disables
> the compound feature completely, but if it works it probably only hides
> the real bug. You can use rc3-aa3 for this (it already has the latest
> robustness fixes I posted to you)
>
> --- x/mm/page_alloc.c.~1~ 2004-04-02 20:37:14.000000000 +0200
> +++ x/mm/page_alloc.c 2004-04-03 17:15:52.647449336 +0200
> @@ -563,7 +563,9 @@ __alloc_pages(unsigned int gfp_mask, uns
> cold = 0;
> if (gfp_mask & __GFP_COLD)
> cold = __GFP_COLD;
> +#if 0
> if (gfp_mask & __GFP_NO_COMP)
> +#endif
> cold |= __GFP_NO_COMP;
>
> zones = zonelist->zones; /* the list of zones suitable for gfp_mask */

I've written another piece of debugging code for you, this is also to
apply on top of rc3-aa3, but of course not at the same time as the above
one. The above one disables compound compeltely, while the below one is
trying to debug what's going wrong in compound.

Basically I store a backup copy of page->private into page->mapping
(arch is 32bit so they're the same size). we know for sure you're not
going to map into userspace those order >0 pages since hugetlbfs is off,
so reusing mapcount as a backup copy of page->private for compound pages
should be ok.

this way when we get the screwed page->private we see what's going on,
and if page->mapping is still pointing to 'page'. If page->mapping ==
page at least we know it's only page->private being corrupt. I don't
really see how can ppc32 corrupt page->private though.

--- x-debug/mm/page_alloc.c.~1~ 2004-04-02 20:37:14.000000000 +0200
+++ x-debug/mm/page_alloc.c 2004-04-03 17:55:16.629069504 +0200
@@ -122,6 +122,7 @@ static void prep_compound_page(struct pa

SetPageCompound(p);
p->private = (unsigned long)page;
+ p->mapcount = (unsigned int)page; /* works 32bit only */
}
}

@@ -130,16 +131,30 @@ static void destroy_compound_page(struct
int i;
int nr_pages = 1 << order;

- if (page[1].index != order)
+ if (page[1].index != order) {
+ printk("Badness in %s at %s:%d\n", __FUNCTION__, __FILE__, __LINE__);
bad_page(__FUNCTION__, page);
+ }
+ if ((unsigned long) page != page->private || page->private != page->mapcount) {
+ printk("Badness in %s at %s:%d\n", __FUNCTION__, __FILE__, __LINE__);
+ printk("private %lx real %x page %p\n", page->private, page->mapcount, page);
+ bad_page(__FUNCTION__, page);
+ }

for (i = 0; i < nr_pages; i++) {
struct page *p = page + i;

- if (!PageCompound(p))
+ if (!PageCompound(p)) {
+ printk("Badness in %s at %s:%d\n", __FUNCTION__, __FILE__, __LINE__);
+ printk("index %d\n", i);
bad_page(__FUNCTION__, p);
- if (p->private != (unsigned long)page)
+ }
+ if (p->private != (unsigned long)page || p->private != p->mapcount) {
+ printk("Badness in %s at %s:%d\n", __FUNCTION__, __FILE__, __LINE__);
+ printk("index %d private %lx real %x page %p\n", i, p->private, p->mapcount, page);
bad_page(__FUNCTION__, p);
+
+ }
ClearPageCompound(p);
}
}
@@ -211,7 +226,6 @@ static inline void __free_pages_bulk (st
static inline void free_pages_check(const char *function, struct page *page)
{
if ( page->mapping != NULL ||
- page->mapcount ||
page_count(page) != 0 ||
(page->flags & (
1 << PG_lru |
@@ -316,7 +330,6 @@ static void prep_new_page(struct page *
struct page * page = _page + i;

if (page->mapping ||
- page->mapcount ||
(page->flags & (
1 << PG_private |
1 << PG_locked |
@@ -336,6 +349,7 @@ static void prep_new_page(struct page *
1 << PG_checked | 1 << PG_mappedtodisk |
1 << PG_compound);
page->private = 0;
+ page->mapcount = 0;
set_page_count(page, 1);
}
}

2004-04-03 17:03:00

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

can you try this potential fix too? (maybe you want to try this first
thing)

this is from Hugh's anobjramp patches.

I merged it once, then I got a crash report, so I backed it out since it
was working anyways, but it was due a merging error that it didn't work
correctly, the below version should be fine and it seems really needed.

I'll upload a new kernel with this applied.

--- x/arch/ppc/mm/pgtable.c.~1~ 2004-02-20 17:26:33.000000000 +0100
+++ x/arch/ppc/mm/pgtable.c 2004-04-03 18:51:35.072468040 +0200
@@ -86,9 +86,14 @@ pte_t *pte_alloc_one_kernel(struct mm_st
extern int mem_init_done;
extern void *early_get_page(void);

- if (mem_init_done)
+ if (mem_init_done) {
pte = (pte_t *)__get_free_page(GFP_KERNEL|__GFP_REPEAT);
- else
+ if (pte) {
+ struct page *ptepage = virt_to_page(pte);
+ ptepage->mapping = (void *) mm;
+ ptepage->index = address & PMD_MASK;
+ }
+ } else
pte = (pte_t *)early_get_page();
if (pte)
clear_page(pte);
@@ -106,8 +111,11 @@ struct page *pte_alloc_one(struct mm_str
#endif

pte = alloc_pages(flags, 0);
- if (pte)
+ if (pte) {
+ pte->mapping = (void *) mm;
+ pte->index = address & PMD_MASK;
clear_highpage(pte);
+ }
return pte;
}

@@ -116,6 +124,7 @@ void pte_free_kernel(pte_t *pte)
#ifdef CONFIG_SMP
hash_page_sync();
#endif
+ virt_to_page(pte)->mapping = NULL;
free_page((unsigned long)pte);
}

@@ -124,6 +133,7 @@ void pte_free(struct page *pte)
#ifdef CONFIG_SMP
hash_page_sync();
#endif
+ pte->mapping = NULL;
__free_page(pte);
}

2004-04-03 17:40:47

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

JFYI, swap suspend now works fine with rc3-aa3 that includes the below
and shutdown the harmless warnings generated by the first
gfp-no-compound patch. This also means the -mm writeback fixes doing
the additional page_cache_get to keep the remove_exclusive_swap_cache
away, worked fine too and you should merge that fix in -mm.

I believe Christoph's oops on ppc, is either the missign tlb flushing
support (fixed by Hugh in the last patch I posted to the list) or some
other ppc issue and it's the last pending known bug (xfs also has a bug
in truncate exposed by a bugcheck in objrmap, but it's not a vm issue
and its fix should be almost ready too).

I'm very convinced that the alloc_pages API should be the same for all
archs w/o or w/ MMU, and I'm fine if we want to make the non-compound
retval the default (and change __GFP_NO_COMP to __GFP_COMP) in the long
run (to optimize all callers but hugetlbfs). For the short term
__GFP_NO_COMP and compound being the default is the safest (for all
archs).

On Fri, Apr 02, 2004 at 06:46:34PM +0200, Andrea Arcangeli wrote:
> On Fri, Apr 02, 2004 at 10:43:34AM +0100, Christoph Hellwig wrote:
> > I got lots of the following OOPSEs with 2.6.5-rc3aa2 on a powerpc running
> > the xfs testsuite (with the truncate fix applied):
> >
> > Apr 2 13:27:21 bird kernel: Bad page state at destroy_compound_page (in process 'swapper', page c08d9920)
> > Apr 2 13:27:21 bird kernel: flags:0x00000008 mapping:00000000 mapped:0 count:0
> > Apr 2 13:27:21 bird kernel: Backtrace:
> > Apr 2 13:27:21 bird kernel: Call trace:
> > Apr 2 13:27:21 bird kernel: [c000b5c8] dump_stack+0x18/0x28
> > Apr 2 13:27:21 bird kernel: [c0048b60] bad_page+0x70/0xb0
> > Apr 2 13:27:21 bird kernel: [c0048c70] destroy_compound_page+0x80/0xb8
>
> it's not clear why this triggered, bad_page only shows the "master"
> compound page and not the contents of the slave page that triggered the
> bad_page. Can you try again with this incremental patch applied?
> Thanks!
>
> --- x/mm/page_alloc.c.~1~ 2004-04-02 05:24:50.000000000 +0200
> +++ x/mm/page_alloc.c 2004-04-02 18:32:53.189244408 +0200
> @@ -73,9 +73,9 @@ static void bad_page(const char *functio
> {
> printk(KERN_EMERG "Bad page state at %s (in process '%s', page %p)\n",
> function, current->comm, page);
> - printk(KERN_EMERG "flags:0x%08lx mapping:%p mapped:%d count:%d\n",
> + printk(KERN_EMERG "flags:0x%08lx mapping:%p mapped:%d count:%d private:0x%08lx\n",
> (unsigned long)page->flags, page->mapping,
> - page_mapped(page), page_count(page));
> + page_mapped(page), page_count(page), page->private);
> printk(KERN_EMERG "Backtrace:\n");
> dump_stack();
> printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n");
> @@ -137,9 +137,9 @@ static void destroy_compound_page(struct
> struct page *p = page + i;
>
> if (!PageCompound(p))
> - bad_page(__FUNCTION__, page);
> + bad_page(__FUNCTION__, p);
> if (p->private != (unsigned long)page)
> - bad_page(__FUNCTION__, page);
> + bad_page(__FUNCTION__, p);
> ClearPageCompound(p);
> }
> }
> @@ -272,8 +272,12 @@ void __free_pages_ok(struct page *page,
> int i;
>
> mod_page_state(pgfree, 1 << order);
> - for (i = 0 ; i < (1 << order) ; ++i)
> - free_pages_check(__FUNCTION__, page + i);
> + for (i = 0 ; i < (1 << order) ; ++i) {
> + struct page * _page = page + i;
> + if (unlikely(i))
> + __put_page(_page);
> + free_pages_check(__FUNCTION__, _page);
> + }
> list_add(&page->lru, &list);
> kernel_map_pages(page, 1<<order, 0);
> free_pages_bulk(page_zone(page), 1, &list, order);
> @@ -316,19 +320,21 @@ static void prep_new_page(struct page *
> (page->flags & (
> 1 << PG_private |
> 1 << PG_locked |
> - 1 << PG_lru |
> + 1 << PG_lru |
> 1 << PG_active |
> 1 << PG_dirty |
> 1 << PG_reclaim |
> 1 << PG_anon |
> 1 << PG_maplock |
> 1 << PG_swapcache |
> - 1 << PG_writeback )))
> + 1 << PG_writeback |
> + 1 << PG_compound )))
> bad_page(__FUNCTION__, page);
>
> page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
> 1 << PG_referenced | 1 << PG_arch_1 |
> - 1 << PG_checked | 1 << PG_mappedtodisk);
> + 1 << PG_checked | 1 << PG_mappedtodisk |
> + 1 << PG_compound);
> page->private = 0;
> set_page_count(page, 1);
> }
>
>
> this incrmental bit made some harmless warning go away from swap resume,
> but it didn't fix swap resume completely yet OTOH I'm not sure anymore
> if there's any further VM issue or if it's a swap suspend issue. the
> PageCompound bugcheck would already trap any compound page in
> rw_swap_page_sync, so I'm sure nobody tried to swap compound pages in
> swap resume, and I'm also sure that the page->count is now correct, or
> free_pages_check would trigger. I cannot trigger any further bugcheck
> here (and the above patch only shutdown some false positive that
> couldn't hurt functionality, plus it adds further bugchecks).

2004-04-03 20:02:48

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

Andrea Arcangeli <[email protected]> wrote:
>
>
> I'm very convinced that the alloc_pages API should be the same for all
> archs w/o or w/ MMU, and I'm fine if we want to make the non-compound
> retval the default (and change __GFP_NO_COMP to __GFP_COMP) in the long
> run (to optimize all callers but hugetlbfs). For the short term
> __GFP_NO_COMP and compound being the default is the safest (for all
> archs).

This single patch which enables the compound page logic in
get_page()/put_page():


--- 25/include/linux/mm.h~a 2004-04-03 11:50:56.900246584 -0800
+++ 25-akpm/include/linux/mm.h 2004-04-03 11:50:59.189898504 -0800
@@ -236,7 +236,7 @@ struct page {

extern void FASTCALL(__page_cache_release(struct page *));

-#ifdef CONFIG_HUGETLB_PAGE
+#ifndef CONFIG_HUGETLB_PAGE

static inline int page_count(struct page *p)
{


Increases a 3.5MB vmlinux by 15kB, a lot of it fastpath. We should retain
this optimisation.

It might be better to switch over to address masking in get_user_pages()
and just dump all the compound page logic. I don't immediately see how the
get_user_pages() caller can subsequently do put_page() against the correct
pageframe, but I assume you worked that out?

2004-04-03 23:27:18

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Sat, Apr 03, 2004 at 12:02:27PM -0800, Andrew Morton wrote:
> It might be better to switch over to address masking in get_user_pages()
> and just dump all the compound page logic. I don't immediately see how the

I'm all for it, this is how the 2.4 get_user_pages deals with bigpages
too, I've never enjoyed the compound thing.

> get_user_pages() caller can subsequently do put_page() against the correct
> pageframe, but I assume you worked that out?

see this patch:

http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.23aa2/9910_shm-largepage-18.gz

it's a two liner fix in follow_page:

@@ -439,6 +457,8 @@ static struct page * follow_page(struct
pmd = pmd_offset(pgd, address);
if (pmd_none(*pmd))
goto out;
+ if (pmd_bigpage(*pmd))
+ return __pmd_page(*pmd) + (address & BIGPAGE_MASK) / PAGE_SIZE;


the BIGPAGE_MASK will never expose anything but the page->private to the
get_user_pages code, and handle_mm_fault takes care of doing the page
faults properly using larepages and pmds if the vma is marked
VM_BIGPAGE.

rawio on largepages has been a must-have feature (especially on >=32G)
for more than one year, definitely no need of compound slowdown for that.

Still I would like to understand what's wrong in Christoph's ppc machine
before dumping the whole compound thing.

2004-04-03 23:46:24

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

Andrea Arcangeli <[email protected]> wrote:
>
> On Sat, Apr 03, 2004 at 12:02:27PM -0800, Andrew Morton wrote:
> > It might be better to switch over to address masking in get_user_pages()
> > and just dump all the compound page logic. I don't immediately see how the
>
> I'm all for it, this is how the 2.4 get_user_pages deals with bigpages
> too, I've never enjoyed the compound thing.
>
> > get_user_pages() caller can subsequently do put_page() against the correct
> > pageframe, but I assume you worked that out?
>
> see this patch:
>
> http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.23aa2/9910_shm-largepage-18.gz
>
> it's a two liner fix in follow_page:
>
> @@ -439,6 +457,8 @@ static struct page * follow_page(struct
> pmd = pmd_offset(pgd, address);
> if (pmd_none(*pmd))
> goto out;
> + if (pmd_bigpage(*pmd))
> + return __pmd_page(*pmd) + (address & BIGPAGE_MASK) / PAGE_SIZE;

OK, that's an x86 solution. But this addresses the easy part - the messy
part happens where we want to unpin the pages at I/O completion in
bio_release_pages() when the page may not even be in a vma any more..


2004-04-04 00:40:38

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Sat, Apr 03, 2004 at 03:46:08PM -0800, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> > @@ -439,6 +457,8 @@ static struct page * follow_page(struct
> > pmd = pmd_offset(pgd, address);
> > if (pmd_none(*pmd))
> > goto out;
> > + if (pmd_bigpage(*pmd))
> > + return __pmd_page(*pmd) + (address & BIGPAGE_MASK) / PAGE_SIZE;
>
> OK, that's an x86 solution. But this addresses the easy part - the messy

you mean because it assumes the pmd is involved, right?

> part happens where we want to unpin the pages at I/O completion in
> bio_release_pages() when the page may not even be in a vma any more..

the vma in 2.4 doesn't matter, there's no refcounting on the bigpage
based on the pagetables that maps it, this is the zap_pmd code, go
figure:

[..]
do {
if (pmd_bigpage(*pmd))
pmd_clear(pmd);
else
freed += zap_pte_range(tlb, pmd, address, end - address);
[..]



So a vma going away isn't going to make any difference for
get_user_pages or things would go bad. However I just noticed if you
truncate or delete the shm segment during I/O that will corrupt memory
since the only refcounting happening happens in the shm in form of
physical pages idexed by an array, 1 entry in the array for every
bigpages, so no issues again with refcounting but you're right it's racy
against truncate/unlink. that's fine compromise for 2.4 where bigpages
are under a sysctl that disables local security anyways, but I agree in
2.6 doing it with proper refcounting is needed and I see better the
point for compound now.

Replacing the compound framework with a wrapper that reaches the master
page given any page_t* and the size of the bigpage is certainly doable
as I suggested some email ago (even if it's not exactly what 2.4 is
doing, or better 2.4 it's doing that just fine in the shm layer but not
in the I/O completion routine which means truncate can race with rawio),
though we'll end up filling the pagecache layer with these math
calculations. So it may not make an huge difference for the pagecache
itself, but it'll definitely free all the nonpagecache users from the
compound-or-equivalent-math overhead.

The thing I care most is that alloc_pages should return the same thing
for every arch. It's just asking for troubles to return compound pages
in x86 and non-compound pages for ppc. Drivers can very wall start
depending on compound pages too, then ppc users will be more sorry at
runtime than losing 16k ;), there's nothing that prevents drivers from
using compound pages too.

BTW, had you a look at Christoph's oops on ppc with the gfp-no-compound
applied? I'm currently scratching my head on it. Can you imagine
something corrupting page->private for a compound slab-page? I can't see
any problem in my gfp-no-compound patch in rc3-aa3 (infact now
swapsuspend works fine finally ;). I feel like my change is exposing
some other bug that was hidden previously with compound turned off.
It'll be very interesting to hear the effect of the three debugging
patches I posted.

Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix


This patch fixes a couple of mask overflow bugs in the prio_tree
search code. These bugs trigger in some very rare corner cases.
The patch also removes a couple of BUG_ONs from the fast paths.

Now the code is well-tested. I have tested all __vma_prio_tree_*
functions in the user-space with as many as 10 million vmas and
all prio_tree functions work fine.

This patch is against 2.6.5-aa2. It will apply on top of Hugh's
patches also.

If you like to test the prio_tree code further in the user-space,
the programs in the following link may help you.

http://www-personal.engin.umich.edu/~vrajesh/linux/prio_tree/user_space/



mm/prio_tree.c | 46 ++++++++++++++++++++++++++++++----------------
1 files changed, 30 insertions(+), 16 deletions(-)

diff -puN mm/prio_tree.c~080_prio_tree mm/prio_tree.c
--- mmlinux-2.6/mm/prio_tree.c~080_prio_tree 2004-04-04 22:25:29.000000000 -0400
+++ mmlinux-2.6-jaya/mm/prio_tree.c 2004-04-04 22:25:30.000000000 -0400
@@ -124,10 +124,8 @@ static inline struct prio_tree_node *pri
node->parent = old->parent;
if (old->parent->left == old)
old->parent->left = node;
- else {
- BUG_ON(old->parent->right != old);
+ else
old->parent->right = node;
- }
}

if (!prio_tree_left_empty(old)) {
@@ -271,10 +269,8 @@ void prio_tree_remove(struct prio_tree_r

if (cur->parent->right == cur)
cur->parent->right = cur->parent;
- else {
- BUG_ON(cur->parent->left != cur);
+ else
cur->parent->left = cur->parent;
- }

while (cur != node)
cur = prio_tree_replace(root, cur->parent, cur);
@@ -308,8 +304,16 @@ static inline struct prio_tree_node *__p
iter->size_level++;
}
else {
- iter->size_level = 1;
- iter->mask = 1UL << (root->index_bits - 1);
+ if (iter->size_level) {
+ BUG_ON(!prio_tree_left_empty(iter->cur));
+ BUG_ON(!prio_tree_right_empty(iter->cur));
+ iter->size_level++;
+ iter->mask = ULONG_MAX;
+ }
+ else {
+ iter->size_level = 1;
+ iter->mask = 1UL << (root->index_bits - 1);
+ }
}
return iter->cur;
}
@@ -347,8 +351,16 @@ static inline struct prio_tree_node *__p
iter->size_level++;
}
else {
- iter->size_level = 1;
- iter->mask = 1UL << (root->index_bits - 1);
+ if (iter->size_level) {
+ BUG_ON(!prio_tree_left_empty(iter->cur));
+ BUG_ON(!prio_tree_right_empty(iter->cur));
+ iter->size_level++;
+ iter->mask = ULONG_MAX;
+ }
+ else {
+ iter->size_level = 1;
+ iter->mask = 1UL << (root->index_bits - 1);
+ }
}
return iter->cur;
}
@@ -360,13 +372,15 @@ static inline struct prio_tree_node *__p
struct prio_tree_iter *iter)
{
iter->cur = iter->cur->parent;
- iter->mask <<= 1;
- if (iter->size_level) {
- if (iter->size_level == 1)
- iter->mask = 1UL;
+ if (iter->mask == ULONG_MAX)
+ iter->mask = 1UL;
+ else if (iter->size_level == 1)
+ iter->mask = 1UL;
+ else
+ iter->mask <<= 1;
+ if (iter->size_level)
iter->size_level--;
- }
- else if (iter->value & iter->mask)
+ if (!iter->size_level && (iter->value & iter->mask))
iter->value ^= iter->mask;
return iter->cur;
}

_

2004-04-05 04:42:49

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Sun, Apr 04, 2004 at 11:14:25PM -0400, Rajesh Venkatasubramanian wrote:
>
> This patch fixes a couple of mask overflow bugs in the prio_tree
> search code. These bugs trigger in some very rare corner cases.
> The patch also removes a couple of BUG_ONs from the fast paths.
>
> Now the code is well-tested. I have tested all __vma_prio_tree_*
> functions in the user-space with as many as 10 million vmas and
> all prio_tree functions work fine.

This is a great news.

>
> This patch is against 2.6.5-aa2. It will apply on top of Hugh's
> patches also.

I'm releasing an update for this.

> If you like to test the prio_tree code further in the user-space,
> the programs in the following link may help you.
>
> http://www-personal.engin.umich.edu/~vrajesh/linux/prio_tree/user_space/

thanks for this great work.

2004-04-05 09:59:59

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Sat, Apr 03, 2004 at 07:02:58PM +0200, Andrea Arcangeli wrote:
> can you try this potential fix too? (maybe you want to try this first
> thing)
>
> this is from Hugh's anobjramp patches.
>
> I merged it once, then I got a crash report, so I backed it out since it
> was working anyways, but it was due a merging error that it didn't work
> correctly, the below version should be fine and it seems really needed.
>
> I'll upload a new kernel with this applied.

Still fails with 2.6.5-aa3 which seems to have this one applied.

2004-04-05 12:11:19

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Mon, Apr 05, 2004 at 10:59:12AM +0100, Christoph Hellwig wrote:
> On Sat, Apr 03, 2004 at 07:02:58PM +0200, Andrea Arcangeli wrote:
> > can you try this potential fix too? (maybe you want to try this first
> > thing)
> >
> > this is from Hugh's anobjramp patches.
> >
> > I merged it once, then I got a crash report, so I backed it out since it
> > was working anyways, but it was due a merging error that it didn't work
> > correctly, the below version should be fine and it seems really needed.
> >
> > I'll upload a new kernel with this applied.
>
> Still fails with 2.6.5-aa3 which seems to have this one applied.

Disabling compound pages unconditionally gets it working again.

2004-04-05 16:09:07

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Mon, Apr 05, 2004 at 01:11:13PM +0100, Christoph Hellwig wrote:
> On Mon, Apr 05, 2004 at 10:59:12AM +0100, Christoph Hellwig wrote:
> > On Sat, Apr 03, 2004 at 07:02:58PM +0200, Andrea Arcangeli wrote:
> > > can you try this potential fix too? (maybe you want to try this first
> > > thing)
> > >
> > > this is from Hugh's anobjramp patches.
> > >
> > > I merged it once, then I got a crash report, so I backed it out since it
> > > was working anyways, but it was due a merging error that it didn't work
> > > correctly, the below version should be fine and it seems really needed.
> > >
> > > I'll upload a new kernel with this applied.
> >
> > Still fails with 2.6.5-aa3 which seems to have this one applied.
>
> Disabling compound pages unconditionally gets it working again.

This is weird, it sounds like something is reusing page->private for
slab pages in ppc, how that can be possible?

Can you also double check that this is not reproducible on x86 just in
case?

can you try again with compound on and the debugging patch I posted that
replicates page->private into page->mapping to verify it's only
page->private being corrupt?

thanks for the help.

2004-04-06 04:22:32

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Mon, Apr 05, 2004 at 01:11:13PM +0100, Christoph Hellwig wrote:
> Disabling compound pages unconditionally gets it working again.

you know what, I'm looking hard everywhere (this is the only pending
still unresolved VM related bug I have) and at the end I happened to
quickly look at xfs too, and I see nothing else in the kernel as suspect
as such piece of xfs code that I touched below, unfortunately the
moltitude of xfs paths that can reach that place prevents me to
quickly identify an exact stacktrace where that bugcheck will trigger at
runtime, but it's the only remaining thing I didn't ruled out yet from
source code audit, plus the kind of corruption on page->private that
your previous debugging data showed matches _exactly_ with the kind of
bitflag corruption that those set_bits would generate. Plus disabling
compound fixes it. Plus this only triggers on xfs. Plus you never used
hugetlbfs=y before. Then there's the pagebuf_associate_memory that rings
an extremely *loud* bell, pagebuf_get_no_daddr and XBUF_SET_PTR sounds
even more, then I go on with xlog_get_bp and tons of other things doing
pagebuf I/O with kmalloced memory with variable size of the kmalloc. Too
many concidences for this not being an xfs bug.

What really happens is that you get errors with my tree because xfs in
some unlikely case is messing with set_bit on the page->private of slab
pages of order > 0. This effectively means xfs has always been silenty
corrupting memory in 2.6 with hugetlbfs turned on, and I'm just exposing
this bug for the first time to you with my robustness gfp-no-compound
fix.

Maybe next time you'll think twice before insulting a patch that helped
you fix a nasty mm corruption bug by enhnacing the VM robustness, and
for the record there's not a single database specific change in my tree
(yet ;).

Can you give this thing a spin now on top of 2.6.5-aa3 and verify it
really triggers?

Index: fs/xfs/linux/xfs_buf.c
===================================================================
RCS file: /home/andrea/crypto/cvs/linux-2.5/fs/xfs/linux/xfs_buf.c,v
retrieving revision 1.4
diff -u -p -r1.4 xfs_buf.c
--- fs/xfs/linux/xfs_buf.c 3 Mar 2004 06:53:03 -0000 1.4
+++ fs/xfs/linux/xfs_buf.c 6 Apr 2004 02:59:27 -0000
@@ -1285,6 +1285,7 @@ bio_end_io_pagebuf(
for (i = 0; i < bio->bi_vcnt; i++, bvec++) {
struct page *page = bvec->bv_page;

+ BUG_ON(PageCompound(page));
if (pb->pb_error) {
SetPageError(page);
} else if (blocksize == PAGE_CACHE_SIZE) {


And now I give you a blazing fast 100% reliable fix so you can still
mess with page->private on the slab pages inside xfs as much as you want ;),
but without destabilizing 2.6 mainline with hugelbtfs=y (or 2.6-aa).
Note this fix wouldn't be possible w/o my new gfp-no-compound logic, so
you may have to apply the gfp-no-compound to the xfs tree too to fix the
hugetlbfs=y instability, this patch itself is incremental with
2.6.5-aa3.

I cannot reverse the logic to __GFP_COMP (that would microoptimize the
order > 0 allocations) or even more simply add a one liner in slab in
the __get_free_pages call because that would invalidate all the hard
driver testing done in the last few months: just like xfs breaks with
compound turned on, everything else may break with compound turned off
and the majority of the recent testing happened with compound on.

Please test it so I can checkin into CVS and release a 2.6-aa update.
Many thanks for all the help!

--- x/fs/xfs/linux/xfs_buf.c.~1~ 2004-03-11 08:27:42.000000000 +0100
+++ x/fs/xfs/linux/xfs_buf.c 2004-04-06 06:02:58.095233216 +0200
@@ -189,7 +189,7 @@ free_address(
{
a_list_t *aentry;

- aentry = kmalloc(sizeof(a_list_t), GFP_ATOMIC);
+ aentry = kmalloc(sizeof(a_list_t), GFP_ATOMIC | __GFP_NO_COMP);
if (aentry) {
spin_lock(&as_lock);
aentry->next = as_free_head;
@@ -870,7 +870,7 @@ pagebuf_get_no_daddr(
kfree(rmem); /* free the mem from the previous try */
tlen <<= 1; /* double the size and try again */
}
- if ((rmem = kmalloc(tlen, GFP_KERNEL)) == 0) {
+ if ((rmem = kmalloc(tlen, GFP_KERNEL | __GFP_NO_COMP)) == 0) {
pagebuf_free(pb);
return NULL;
}
@@ -1285,6 +1285,7 @@ bio_end_io_pagebuf(
for (i = 0; i < bio->bi_vcnt; i++, bvec++) {
struct page *page = bvec->bv_page;

+ BUG_ON(PageCompound(page));
if (pb->pb_error) {
SetPageError(page);
} else if (blocksize == PAGE_CACHE_SIZE) {
--- x/fs/xfs/linux/xfs_file.c.~1~ 2004-04-04 08:09:26.000000000 +0200
+++ x/fs/xfs/linux/xfs_file.c 2004-04-06 06:01:42.165776248 +0200
@@ -348,7 +348,7 @@ linvfs_readdir(

/* Try fairly hard to get memory */
do {
- if ((read_buf = (caddr_t)kmalloc(rlen, GFP_KERNEL)))
+ if ((read_buf = (caddr_t)kmalloc(rlen, GFP_KERNEL | __GFP_NO_COMP)))
break;
rlen >>= 1;
} while (rlen >= 1024);
--- x/fs/xfs/linux/xfs_iops.c.~1~ 2004-04-06 05:57:39.817618768 +0200
+++ x/fs/xfs/linux/xfs_iops.c 2004-04-06 06:02:24.030411856 +0200
@@ -418,11 +418,11 @@ linvfs_follow_link(
ASSERT(dentry);
ASSERT(nd);

- link = (char *)kmalloc(MAXNAMELEN+1, GFP_KERNEL);
+ link = (char *)kmalloc(MAXNAMELEN+1, GFP_KERNEL | __GFP_NO_COMP);
if (!link)
return -ENOMEM;

- uio = (uio_t *)kmalloc(sizeof(uio_t), GFP_KERNEL);
+ uio = (uio_t *)kmalloc(sizeof(uio_t), GFP_KERNEL | __GFP_NO_COMP);
if (!uio) {
kfree(link);
return -ENOMEM;
--- x/fs/xfs/linux/xfs_ioctl.c.~1~ 2004-04-04 08:09:26.000000000 +0200
+++ x/fs/xfs/linux/xfs_ioctl.c 2004-04-06 06:01:56.278630768 +0200
@@ -517,7 +517,7 @@ xfs_attrmulti_by_handle(
return -error;

size = am_hreq.opcount * sizeof(attr_multiop_t);
- ops = (xfs_attr_multiop_t *)kmalloc(size, GFP_KERNEL);
+ ops = (xfs_attr_multiop_t *)kmalloc(size, GFP_KERNEL | __GFP_NO_COMP);
if (!ops) {
VN_RELE(vp);
return -XFS_ERROR(ENOMEM);
--- x/fs/xfs/linux/kmem.h.~1~ 2004-04-04 08:09:26.000000000 +0200
+++ x/fs/xfs/linux/kmem.h 2004-04-06 06:08:14.173182064 +0200
@@ -101,7 +101,7 @@ kmem_flags_convert(int flags)
if (PFLAGS_TEST_FSTRANS() || (flags & KM_NOFS))
lflags &= ~__GFP_FS;

- return lflags;
+ return lflags | __GFP_NO_COMP;
}

static __inline void *

2004-04-06 04:43:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

Andrea Arcangeli <[email protected]> wrote:
>
> Then there's the pagebuf_associate_memory that rings
> an extremely *loud* bell, pagebuf_get_no_daddr and XBUF_SET_PTR sounds
> even more, then I go on with xlog_get_bp and tons of other things doing
> pagebuf I/O with kmalloced memory with variable size of the kmalloc. Too
> many concidences for this not being an xfs bug.

It does pagebuf I/O with kmalloced memory? Wow. Pretty much anything
which goes from kmalloc virtual addresses back to pageframes is a big fat
warning sign.

Do you see any reason why we shouldn't flip things around and make the
hugetlb code explicitly request the compound page metadata when allocating
the pages?

2004-04-06 05:15:07

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Mon, Apr 05, 2004 at 09:43:30PM -0700, Andrew Morton wrote:
> Andrea Arcangeli <[email protected]> wrote:
> >
> > Then there's the pagebuf_associate_memory that rings
> > an extremely *loud* bell, pagebuf_get_no_daddr and XBUF_SET_PTR sounds
> > even more, then I go on with xlog_get_bp and tons of other things doing
> > pagebuf I/O with kmalloced memory with variable size of the kmalloc. Too
> > many concidences for this not being an xfs bug.
>
> It does pagebuf I/O with kmalloced memory? Wow. Pretty much anything
> which goes from kmalloc virtual addresses back to pageframes is a big fat
> warning sign.

It's for the log I/O. I thought about doign __get_free_page for it but that
would waste a lot of memory.

2004-04-06 05:17:08

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Tue, Apr 06, 2004 at 06:22:22AM +0200, Andrea Arcangeli wrote:
> What really happens is that you get errors with my tree because xfs in
> some unlikely case is messing with set_bit on the page->private of slab
> pages of order > 0.

Yes, that case would be a filesystem with blocksize < PAGE_SIZE and a buffer
with a size of > PAGE_SIZE && < MAX_SLAB_SIZE.

Can you try the patch below (testing it now, but I'm pretty sure it'll fix it)
instead of all the kmalloc changes?:

--- linux-2.5/fs/xfs/linux/xfs_buf.c 2004-04-02 20:10:56.000000000 +0200
+++ linux-2.6.5-aa3/fs/xfs/linux/xfs_buf.c 2004-04-06 09:13:05.275317568 +0200
@@ -448,7 +448,8 @@ _pagebuf_lookup_pages(
if (flags & PBF_READ)
pb->pb_locked = 1;
good_pages--;
- } else if (!PagePrivate(page)) {
+ } else if ((pb->pb_flags & _PBF_PAGECACHE) &&
+ !PagePrivate(page)) {
unsigned long i, range;

/*
@@ -1289,7 +1290,8 @@ bio_end_io_pagebuf(
SetPageError(page);
} else if (blocksize == PAGE_CACHE_SIZE) {
SetPageUptodate(page);
- } else if (!PagePrivate(page)) {
+ } else if ((pb->pb_flags & _PBF_PAGECACHE) &&
+ !PagePrivate(page)) {
unsigned int j, range;

ASSERT(blocksize < PAGE_CACHE_SIZE);

2004-04-06 16:06:17

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Tue, Apr 06, 2004 at 06:16:46AM +0100, Christoph Hellwig wrote:
> Can you try the patch below (testing it now, but I'm pretty sure it'll fix it)
> instead of all the kmalloc changes?:

I'm having some email dealy so I don't know if you sent me more recent
emails, did it work fine as expected or should I keep my kmalloc change?

thanks

2004-04-06 21:54:45

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Mon, Apr 05, 2004 at 09:43:30PM -0700, Andrew Morton wrote:
> It does pagebuf I/O with kmalloced memory? Wow. Pretty much anything
> which goes from kmalloc virtual addresses back to pageframes is a big fat
> warning sign.

it's tricky indeed, though it worked fine as far as compound was left
disabled with hugetlbfs=n.

> Do you see any reason why we shouldn't flip things around and make the
> hugetlb code explicitly request the compound page metadata when allocating
> the pages?

I definitely agree we should reverse the logic to __GFP_COMP instead of
__GFP_NO_COMP in mainline. Problem is I coudln't do it in the short term
to avoid invalidating the testing done with hugetlbfs=y. Soon I can
reverse it and add the __GFP_COMP only for the hugetlbfs big-order
dyanmic allocations that should be quick to identify.

Christoph, I got no positive feedback yet for the alternate fix you
proposed and it's not obvious to my eyes (isn't good_pages going to be
screwed with your fix?), but I wanted to checkin a fix into CVS in the
meanwhile, so for now I've checked in my __GFP_NO_COMP fix that I'm sure
doesn't require any testing since it's obviously safe and it should
definitely fix the problem. This way you can also take your time for the
testing of your better fix.

What's not clear to me about your fix is if it's really working safe
with good_pages being overdecremented (good_pages doesn't look just an
hint, there seems to be a valid reason you're doing the set_bit/test_bit
on page->private, no?).

thanks.

2004-04-07 01:34:12

by Nathan Scott

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Tue, Apr 06, 2004 at 06:01:41PM +0200, Andrea Arcangeli wrote:
> On Tue, Apr 06, 2004 at 06:16:46AM +0100, Christoph Hellwig wrote:
> > Can you try the patch below (testing it now, but I'm pretty sure it'll fix it)
> > instead of all the kmalloc changes?:
>
> I'm having some email dealy so I don't know if you sent me more recent
> emails, did it work fine as expected or should I keep my kmalloc change?
>

Christophs away for a few days, so I'll jump in. The patch was not
quite right - the first part should be dropped, the second section
is right. In the first hunk (i.e. the pagebuf_lookup_pages changes)
we are always working with page cache pages there, so the change was
unnecessary (it also tested a flag which only gets set several lines
down in that call, so wasn't quite right anyway). The second part of
the patch is the critical piece, as it avoids updating the page->private
field in the IO completion handler for non-pagecache pages. I've been
testing with just that piece, and it looks good.

I'll be putting this into our local XFS trees shortly, and will send
it on to Linus and Andrew soon.

thanks Andrea.

--
Nathan

2004-04-07 01:40:10

by Nathan Scott

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

On Tue, Apr 06, 2004 at 11:54:41PM +0200, Andrea Arcangeli wrote:
>
> Christoph, I got no positive feedback yet for the alternate fix you
> proposed and it's not obvious to my eyes (isn't good_pages going to be
> screwed with your fix?), but I wanted to checkin a fix into CVS in the
> meanwhile, so for now I've checked in my __GFP_NO_COMP fix that I'm sure
> doesn't require any testing since it's obviously safe and it should
> definitely fix the problem. This way you can also take your time for the
> testing of your better fix.
>
> What's not clear to me about your fix is if it's really working safe
> with good_pages being overdecremented (good_pages doesn't look just an
> hint, there seems to be a valid reason you're doing the set_bit/test_bit
> on page->private, no?).

Ignore the first part of that patch, it was misdirected (Christoph
woulda gone through and put guards around all pagebuf page->private
users; turns out the first change was unnecessary, and confusing ;).

cheers.

--
Nathan

2004-04-08 19:10:13

by Bill Davidsen

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

Andrea Arcangeli wrote:
> JFYI, swap suspend now works fine with rc3-aa3 that includes the below
> and shutdown the harmless warnings generated by the first
> gfp-no-compound patch. This also means the -mm writeback fixes doing
> the additional page_cache_get to keep the remove_exclusive_swap_cache
> away, worked fine too and you should merge that fix in -mm.

Does this mean resume works now? Or just that suspend doesn't oops and
turns off the power?

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2004-04-20 22:56:41

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC][PATCH 1/3] radix priority search tree - objrmap complexity fix

Hi!

> >JFYI, swap suspend now works fine with rc3-aa3 that includes the below
> >and shutdown the harmless warnings generated by the first
> >gfp-no-compound patch. This also means the -mm writeback fixes doing
> >the additional page_cache_get to keep the remove_exclusive_swap_cache
> >away, worked fine too and you should merge that fix in -mm.
>
> Does this mean resume works now? Or just that suspend doesn't oops and
> turns off the power?

Andrea has scsi controller that has no support for suspend/resume. So
even "no oops and power goes off" is good result for him.
Pavel
--
When do you have a heart between your knees?
[Johanka's followup: and *two* hearts?]