2004-11-15 03:04:18

by Werner Almesberger

[permalink] [raw]
Subject: [RFC] Generalize prio_tree (1/3)

Hi Rajesh,

perhaps you remember me posting a long time ago about generalizing
prio_tree. Now I finally got to make that patch. In fact, there are
three parts:

- the prio_tree "core" in lib/
- switching mm/prio_tree.c to use the "core"
- some debugging extensions

The reason for wanting this generalization is that we'll also need
radix priority search trees for healthier barrier handling in the
IO scheduler (aka disk elevator).

Since this rearranges fairly crucial code, I think a first round
for review is approproate. If nothing major turns up, I'll make
another patch for merging into mainline when the next version is
out.

The patch below puts an includeable version of prio_tree to lib/.
This should be included similar to how inflate.c is used.

The only real change is that index_bits_to_maxindex is now called
prio_tree_index_bits_to_maxindex, and is globally shard.

- Werner

---------------------------------- cut here -----------------------------------

--- linux-2.6.9-orig/include/linux/prio_tree.h Mon Oct 18 18:54:08 2004
+++ linux-2.6.9/include/linux/prio_tree.h Sun Nov 14 21:29:29 2004
@@ -73,4 +73,6 @@
return node->right == node;
}

+extern unsigned long prio_tree_index_bits_to_maxindex[];
+
#endif /* _LINUX_PRIO_TREE_H */
--- linux-2.6.9-orig/lib/Makefile Mon Oct 18 18:53:08 2004
+++ linux-2.6.9/lib/Makefile Sun Nov 14 21:33:17 2004
@@ -6,7 +6,7 @@
lib-y := errno.o ctype.o string.o vsprintf.o cmdline.o \
bust_spinlocks.o rbtree.o radix-tree.o dump_stack.o \
kobject.o kref.o idr.o div64.o parser.o int_sqrt.o \
- bitmap.o extable.o
+ bitmap.o extable.o prio_tree_init.o

lib-$(CONFIG_RWSEM_GENERIC_SPINLOCK) += rwsem-spinlock.o
lib-$(CONFIG_RWSEM_XCHGADD_ALGORITHM) += rwsem.o
--- /dev/null Wed Jun 9 20:31:45 2004
+++ linux-2.6.9/lib/prio_tree_init.c Sun Nov 14 21:54:27 2004
@@ -0,0 +1,29 @@
+/*
+ * lib/prio_tree_init.c - priority search tree: initialization
+ *
+ * Copyright (C) 2004, Rajesh Venkatasubramanian <[email protected]>
+ *
+ * This file is released under the GPL v2.
+ *
+ * Based on the radix priority search tree proposed by Edward M. McCreight
+ * SIAM Journal of Computing, vol. 14, no.2, pages 257-276, May 1985
+ *
+ * 02Feb2004 Initial version
+ */
+
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/prio_tree.h>
+
+
+unsigned long prio_tree_index_bits_to_maxindex[BITS_PER_LONG];
+
+void __init prio_tree_init(void)
+{
+ unsigned int i;
+
+ for (i = 0; i < ARRAY_SIZE(prio_tree_index_bits_to_maxindex) - 1; i++)
+ prio_tree_index_bits_to_maxindex[i] = (1UL << (i + 1)) - 1;
+ prio_tree_index_bits_to_maxindex
+ [ARRAY_SIZE(prio_tree_index_bits_to_maxindex) - 1] = ~0UL;
+}
--- /dev/null Wed Jun 9 20:31:45 2004
+++ linux-2.6.9/lib/prio_tree.c Sun Nov 14 23:00:57 2004
@@ -0,0 +1,445 @@
+/*
+ * lib/prio_tree.c - priority search tree: common code
+ *
+ * Copyright (C) 2004, Rajesh Venkatasubramanian <[email protected]>
+ *
+ * This file is released under the GPL v2.
+ *
+ * Based on the radix priority search tree proposed by Edward M. McCreight
+ * SIAM Journal of Computing, vol. 14, no.2, pages 257-276, May 1985
+ *
+ * 02Feb2004 Initial version
+ */
+
+/* Includer will have included linux/prio_tree.h for us */
+
+/*
+ * A clever mix of heap and radix trees forms a radix priority search tree (PST)
+ * which is useful for storing intervals, e.g, we can consider a vma as a closed
+ * interval of file pages [offset_begin, offset_end], and store all vmas that
+ * map a file in a PST. Then, using the PST, we can answer a stabbing query,
+ * i.e., selecting a set of stored intervals (vmas) that overlap with (map) a
+ * given input interval X (a set of consecutive file pages), in "O(log n + m)"
+ * time where 'log n' is the height of the PST, and 'm' is the number of stored
+ * intervals (vmas) that overlap (map) with the input interval X (the set of
+ * consecutive file pages).
+ *
+ * In our implementation, we store closed intervals of the form [radix_index,
+ * heap_index]. We assume that always radix_index <= heap_index. McCreight's PST
+ * is designed for storing intervals with unique radix indices, i.e., each
+ * interval have different radix_index. However, this limitation can be easily
+ * overcome by using the size, i.e., heap_index - radix_index, as part of the
+ * index, so we index the tree using [(radix_index,size), heap_index].
+ *
+ * When the above-mentioned indexing scheme is used, theoretically, in a 32 bit
+ * machine, the maximum height of a PST can be 64. We can use a balanced version
+ * of the priority search tree to optimize the tree height, but the balanced
+ * tree proposed by McCreight is too complex and memory-hungry for our purpose.
+ */
+
+
+/*
+ * Maximum heap_index that can be stored in a PST with index_bits bits
+ */
+static inline unsigned long prio_tree_maxindex(unsigned int bits)
+{
+ return prio_tree_index_bits_to_maxindex[bits - 1];
+}
+
+static void prio_tree_remove(struct prio_tree_root *, struct prio_tree_node *);
+
+/*
+ * Extend a priority search tree so that it can store a node with heap_index
+ * max_heap_index. In the worst case, this algorithm takes O((log n)^2).
+ * However, this function is used rarely and the common case performance is
+ * not bad.
+ */
+static struct prio_tree_node *prio_tree_expand(struct prio_tree_root *root,
+ struct prio_tree_node *node, unsigned long max_heap_index)
+{
+ struct prio_tree_node *first = NULL, *prev, *last = NULL;
+
+ if (max_heap_index > prio_tree_maxindex(root->index_bits))
+ root->index_bits++;
+
+ while (max_heap_index > prio_tree_maxindex(root->index_bits)) {
+ root->index_bits++;
+
+ if (prio_tree_empty(root))
+ continue;
+
+ if (first == NULL) {
+ first = root->prio_tree_node;
+ prio_tree_remove(root, root->prio_tree_node);
+ INIT_PRIO_TREE_NODE(first);
+ last = first;
+ } else {
+ prev = last;
+ last = root->prio_tree_node;
+ prio_tree_remove(root, root->prio_tree_node);
+ INIT_PRIO_TREE_NODE(last);
+ prev->left = last;
+ last->parent = prev;
+ }
+ }
+
+ INIT_PRIO_TREE_NODE(node);
+
+ if (first) {
+ node->left = first;
+ first->parent = node;
+ } else
+ last = node;
+
+ if (!prio_tree_empty(root)) {
+ last->left = root->prio_tree_node;
+ last->left->parent = last;
+ }
+
+ root->prio_tree_node = node;
+ return node;
+}
+
+/*
+ * Replace a prio_tree_node with a new node and return the old node
+ */
+static struct prio_tree_node *prio_tree_replace(struct prio_tree_root *root,
+ struct prio_tree_node *old, struct prio_tree_node *node)
+{
+ INIT_PRIO_TREE_NODE(node);
+
+ if (prio_tree_root(old)) {
+ BUG_ON(root->prio_tree_node != old);
+ /*
+ * We can reduce root->index_bits here. However, it is complex
+ * and does not help much to improve performance (IMO).
+ */
+ node->parent = node;
+ root->prio_tree_node = node;
+ } else {
+ node->parent = old->parent;
+ if (old->parent->left == old)
+ old->parent->left = node;
+ else
+ old->parent->right = node;
+ }
+
+ if (!prio_tree_left_empty(old)) {
+ node->left = old->left;
+ old->left->parent = node;
+ }
+
+ if (!prio_tree_right_empty(old)) {
+ node->right = old->right;
+ old->right->parent = node;
+ }
+
+ return old;
+}
+
+/*
+ * Insert a prio_tree_node @node into a radix priority search tree @root. The
+ * algorithm typically takes O(log n) time where 'log n' is the number of bits
+ * required to represent the maximum heap_index. In the worst case, the algo
+ * can take O((log n)^2) - check prio_tree_expand.
+ *
+ * If a prior node with same radix_index and heap_index is already found in
+ * the tree, then returns the address of the prior node. Otherwise, inserts
+ * @node into the tree and returns @node.
+ */
+static struct prio_tree_node *prio_tree_insert(struct prio_tree_root *root,
+ struct prio_tree_node *node)
+{
+ struct prio_tree_node *cur, *res = node;
+ unsigned long radix_index, heap_index;
+ unsigned long r_index, h_index, index, mask;
+ int size_flag = 0;
+
+ GET_INDEX(node, radix_index, heap_index);
+
+ if (prio_tree_empty(root) ||
+ heap_index > prio_tree_maxindex(root->index_bits))
+ return prio_tree_expand(root, node, heap_index);
+
+ cur = root->prio_tree_node;
+ mask = 1UL << (root->index_bits - 1);
+
+ while (mask) {
+ GET_INDEX(cur, r_index, h_index);
+
+ if (r_index == radix_index && h_index == heap_index)
+ return cur;
+
+ if (h_index < heap_index ||
+ (h_index == heap_index && r_index > radix_index)) {
+ struct prio_tree_node *tmp = node;
+ node = prio_tree_replace(root, cur, node);
+ cur = tmp;
+ /* swap indices */
+ index = r_index;
+ r_index = radix_index;
+ radix_index = index;
+ index = h_index;
+ h_index = heap_index;
+ heap_index = index;
+ }
+
+ if (size_flag)
+ index = heap_index - radix_index;
+ else
+ index = radix_index;
+
+ if (index & mask) {
+ if (prio_tree_right_empty(cur)) {
+ INIT_PRIO_TREE_NODE(node);
+ cur->right = node;
+ node->parent = cur;
+ return res;
+ } else
+ cur = cur->right;
+ } else {
+ if (prio_tree_left_empty(cur)) {
+ INIT_PRIO_TREE_NODE(node);
+ cur->left = node;
+ node->parent = cur;
+ return res;
+ } else
+ cur = cur->left;
+ }
+
+ mask >>= 1;
+
+ if (!mask) {
+ mask = 1UL << (root->index_bits - 1);
+ size_flag = 1;
+ }
+ }
+ /* Should not reach here */
+ BUG();
+ return NULL;
+}
+
+/*
+ * Remove a prio_tree_node @node from a radix priority search tree @root. The
+ * algorithm takes O(log n) time where 'log n' is the number of bits required
+ * to represent the maximum heap_index.
+ */
+static void prio_tree_remove(struct prio_tree_root *root,
+ struct prio_tree_node *node)
+{
+ struct prio_tree_node *cur;
+ unsigned long r_index, h_index_right, h_index_left;
+
+ cur = node;
+
+ while (!prio_tree_left_empty(cur) || !prio_tree_right_empty(cur)) {
+ if (!prio_tree_left_empty(cur))
+ GET_INDEX(cur->left, r_index, h_index_left);
+ else {
+ cur = cur->right;
+ continue;
+ }
+
+ if (!prio_tree_right_empty(cur))
+ GET_INDEX(cur->right, r_index, h_index_right);
+ else {
+ cur = cur->left;
+ continue;
+ }
+
+ /* both h_index_left and h_index_right cannot be 0 */
+ if (h_index_left >= h_index_right)
+ cur = cur->left;
+ else
+ cur = cur->right;
+ }
+
+ if (prio_tree_root(cur)) {
+ BUG_ON(root->prio_tree_node != cur);
+ INIT_PRIO_TREE_ROOT(root);
+ return;
+ }
+
+ if (cur->parent->right == cur)
+ cur->parent->right = cur->parent;
+ else
+ cur->parent->left = cur->parent;
+
+ while (cur != node)
+ cur = prio_tree_replace(root, cur->parent, cur);
+}
+
+/*
+ * Following functions help to enumerate all prio_tree_nodes in the tree that
+ * overlap with the input interval X [radix_index, heap_index]. The enumeration
+ * takes O(log n + m) time where 'log n' is the height of the tree (which is
+ * proportional to # of bits required to represent the maximum heap_index) and
+ * 'm' is the number of prio_tree_nodes that overlap the interval X.
+ */
+
+static struct prio_tree_node *prio_tree_left(struct prio_tree_iter *iter,
+ unsigned long *r_index, unsigned long *h_index)
+{
+ if (prio_tree_left_empty(iter->cur))
+ return NULL;
+
+ GET_INDEX(iter->cur->left, *r_index, *h_index);
+
+ if (iter->r_index <= *h_index) {
+ iter->cur = iter->cur->left;
+ iter->mask >>= 1;
+ if (iter->mask) {
+ if (iter->size_level)
+ iter->size_level++;
+ } else {
+ if (iter->size_level) {
+ BUG_ON(!prio_tree_left_empty(iter->cur));
+ BUG_ON(!prio_tree_right_empty(iter->cur));
+ iter->size_level++;
+ iter->mask = ULONG_MAX;
+ } else {
+ iter->size_level = 1;
+ iter->mask = 1UL << (iter->root->index_bits - 1);
+ }
+ }
+ return iter->cur;
+ }
+
+ return NULL;
+}
+
+static struct prio_tree_node *prio_tree_right(struct prio_tree_iter *iter,
+ unsigned long *r_index, unsigned long *h_index)
+{
+ unsigned long value;
+
+ if (prio_tree_right_empty(iter->cur))
+ return NULL;
+
+ if (iter->size_level)
+ value = iter->value;
+ else
+ value = iter->value | iter->mask;
+
+ if (iter->h_index < value)
+ return NULL;
+
+ GET_INDEX(iter->cur->right, *r_index, *h_index);
+
+ if (iter->r_index <= *h_index) {
+ iter->cur = iter->cur->right;
+ iter->mask >>= 1;
+ iter->value = value;
+ if (iter->mask) {
+ if (iter->size_level)
+ iter->size_level++;
+ } else {
+ if (iter->size_level) {
+ BUG_ON(!prio_tree_left_empty(iter->cur));
+ BUG_ON(!prio_tree_right_empty(iter->cur));
+ iter->size_level++;
+ iter->mask = ULONG_MAX;
+ } else {
+ iter->size_level = 1;
+ iter->mask = 1UL << (iter->root->index_bits - 1);
+ }
+ }
+ return iter->cur;
+ }
+
+ return NULL;
+}
+
+static struct prio_tree_node *prio_tree_parent(struct prio_tree_iter *iter)
+{
+ iter->cur = iter->cur->parent;
+ if (iter->mask == ULONG_MAX)
+ iter->mask = 1UL;
+ else if (iter->size_level == 1)
+ iter->mask = 1UL;
+ else
+ iter->mask <<= 1;
+ if (iter->size_level)
+ iter->size_level--;
+ if (!iter->size_level && (iter->value & iter->mask))
+ iter->value ^= iter->mask;
+ return iter->cur;
+}
+
+static inline int overlap(struct prio_tree_iter *iter,
+ unsigned long r_index, unsigned long h_index)
+{
+ return iter->h_index >= r_index && iter->r_index <= h_index;
+}
+
+/*
+ * prio_tree_first:
+ *
+ * Get the first prio_tree_node that overlaps with the interval [radix_index,
+ * heap_index]. Note that always radix_index <= heap_index. We do a pre-order
+ * traversal of the tree.
+ */
+static struct prio_tree_node *prio_tree_first(struct prio_tree_iter *iter)
+{
+ struct prio_tree_root *root;
+ unsigned long r_index, h_index;
+
+ INIT_PRIO_TREE_ITER(iter);
+
+ root = iter->root;
+ if (prio_tree_empty(root))
+ return NULL;
+
+ GET_INDEX(root->prio_tree_node, r_index, h_index);
+
+ if (iter->r_index > h_index)
+ return NULL;
+
+ iter->mask = 1UL << (root->index_bits - 1);
+ iter->cur = root->prio_tree_node;
+
+ while (1) {
+ if (overlap(iter, r_index, h_index))
+ return iter->cur;
+
+ if (prio_tree_left(iter, &r_index, &h_index))
+ continue;
+
+ if (prio_tree_right(iter, &r_index, &h_index))
+ continue;
+
+ break;
+ }
+ return NULL;
+}
+
+/*
+ * prio_tree_next:
+ *
+ * Get the next prio_tree_node that overlaps with the input interval in iter
+ */
+static struct prio_tree_node *prio_tree_next(struct prio_tree_iter *iter)
+{
+ unsigned long r_index, h_index;
+
+repeat:
+ while (prio_tree_left(iter, &r_index, &h_index))
+ if (overlap(iter, r_index, h_index))
+ return iter->cur;
+
+ while (!prio_tree_right(iter, &r_index, &h_index)) {
+ while (!prio_tree_root(iter->cur) &&
+ iter->cur->parent->right == iter->cur)
+ prio_tree_parent(iter);
+
+ if (prio_tree_root(iter->cur))
+ return NULL;
+
+ prio_tree_parent(iter);
+ }
+
+ if (overlap(iter, r_index, h_index))
+ return iter->cur;
+
+ goto repeat;
+}

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/


2004-11-15 03:13:02

by Werner Almesberger

[permalink] [raw]
Subject: [RFC] prio_tree debugging functions (3/3)

The third and last part, which actually contains new code: debugging
functions for radix priority search trees. They're useful if some bug
happens to mess up the RPST, and you have no clue what is going on.
(Particularly since the RPST isn't exactly the most trivial data
structure either ;-)

I'd just add this to lib/prio_tree.c. Since I don't like overlapping
patches, this isn't in the form of a patch. I'll make one for
submission to mainline.

- Werner

---------------------------------- cut here -----------------------------------

/*
* The following functions validate the internal consistency of an RPST. They
* all have O(n) run time, so use them only on small trees, or in emergencies.
*
* prio_tree_validate_tree validates the internal consistency of an RPST.
*
* prio_tree_absent checks that no tree node points to the "absentee" node.
*
* prio_tree_string turns the tree into a nested list and prints it into a
* string buffer.
*/


#ifdef PRIO_TREE_DEBUG

#define CHECK_BUG(cond) \
if (cond) { \
printk(KERN_EMERG #cond "\n"); \
return -1; \
}


enum prio_tree_walk_state {
prio_tree_walk_down,
prio_tree_walk_left,
prio_tree_walk_right,
};


static int prio_tree_validate_child(const struct prio_tree_node *node,
int right, int bit, int in_size)
{
unsigned long r_parent, h_parent;
unsigned long r_child, h_child;
unsigned long i_parent,i_child;

GET_INDEX(node->parent, r_parent, h_parent);
GET_INDEX(node, r_child, h_child);

CHECK_BUG(r_child > h_child); /* left edge > right edge */
CHECK_BUG(h_parent < h_child); /* sort order wrong */

if (in_size) {
i_parent = r_parent;
i_child = r_child;
}
else {
i_parent = h_parent-r_parent;
i_child = h_child-r_child;
}

/*
* Parent and child can differ only in the lower "bit" bits. Also, the
* value of the next bit is determined by the branch we're taking.
*/
CHECK_BUG(i_parent >> bit != i_child >> bit);
CHECK_BUG(((i_child >> (bit-1)) & 1) != right);

return 0;
}


static int do_prio_tree_validate(const struct prio_tree_node *node, int bits)
{
enum prio_tree_walk_state state = prio_tree_walk_down;
int bit = bits;
int in_start = 1;
int bad;

while (1) {
switch (state) {
/*
* The tree is an RPST keyed by (start,end) down to
* individual start locations. Then it changes to an
* RPST keyed by (size,end).
*/
case prio_tree_walk_down:
if (!prio_tree_left_empty(node)) {
if (!bit) {
CHECK_BUG(!in_start);
bit = bits;
in_start = 0;
}
CHECK_BUG(node->left->parent != node);

node = node->left;
bad = prio_tree_validate_child(node, 0,
bit, in_start);
if (bad)
return bad;
bit--;
break;
}
/* fall through */
case prio_tree_walk_left:
if (!prio_tree_right_empty(node)) {
if (!bit) {
CHECK_BUG(!in_start);
bit = bits;
in_start = 0;
}
CHECK_BUG(node->right->parent != node);

node = node->right;
bad = prio_tree_validate_child(node, 1,
bit, in_start);
if (bad)
return bad;
state = prio_tree_walk_down;
bit--;
break;
}
/* fall through */
case prio_tree_walk_right:
if (prio_tree_root(node))
return 0;
state = node->parent->left == node ?
prio_tree_walk_left : prio_tree_walk_right;
node = node->parent;
bit++;
if (bit == bits && !in_start) {
bit = 0;
in_start = 1;
}
break;
default:
BUG();
}
}
return 0;
}


/*
* We use a slightly different algorithm than the one found in
* abiss_prio_tree_init on purpose.
*/

static int prio_tree_validate_index_map(void)
{
const unsigned long *p;

for (p = prio_tree_index_bits_to_maxindex;
p != (void *) prio_tree_index_bits_to_maxindex+
sizeof(prio_tree_index_bits_to_maxindex); p++) {
int n = p-prio_tree_index_bits_to_maxindex;

CHECK_BUG(*p != ((((1UL << n)-1) << 1) | 1));
}
return 0;
}


int prio_tree_validate(const struct prio_tree_root *root)
{
unsigned long r_root, h_root;
int bad;

bad = prio_tree_validate_index_map();
if (bad)
return bad;

if (prio_tree_empty(root))
return 0;

CHECK_BUG(!root->index_bits);
CHECK_BUG(root->index_bits > sizeof(unsigned long)*8);

/*
* Check the root node now, so that we don't have to check parent and
* child everywhere on the way down the tree.
*/
GET_INDEX(root->prio_tree_node, r_root, h_root);
CHECK_BUG(r_root > h_root);
CHECK_BUG(!prio_tree_root(root->prio_tree_node));

return do_prio_tree_validate(root->prio_tree_node, root->index_bits);
}

#undef CHECK_BUG


static void do_prio_tree_absent(const struct prio_tree_node *node,
const struct prio_tree_node *absentee)
{
enum prio_tree_walk_state state = prio_tree_walk_down;

while (1) {
BUG_ON(node == absentee);
switch (state) {
case prio_tree_walk_down:
if (!prio_tree_left_empty(node)) {
node = node->left;
break;
}
/* fall through */
case prio_tree_walk_left:
if (!prio_tree_right_empty(node)) {
state = prio_tree_walk_down;
node = node->right;
break;
}
/* fall through */
case prio_tree_walk_right:
if (prio_tree_root(node))
return;
state = node->parent->left == node ?
prio_tree_walk_left : prio_tree_walk_right;
node = node->parent;
break;
default:
BUG();
}
}
}


void prio_tree_absent(const struct prio_tree_root *root,
const struct prio_tree_node *absentee)
{
if (prio_tree_empty(root))
return;
do_prio_tree_absent(root->prio_tree_node, absentee);
}


static void do_prio_tree_string(const struct prio_tree_node *node, char **buf,
const char *end)
{
enum prio_tree_walk_state state = prio_tree_walk_down;
unsigned long r_index, h_index;
int len;

while (1) {
GET_INDEX(node, r_index, h_index);
switch (state) {
case prio_tree_walk_down:
len = scnprintf(*buf, end-*buf,
"(%lu[0x%lx] %lu,", r_index, r_index,
h_index);
BUG_ON(len+1 == end-*buf);
*buf += len;
if (!prio_tree_left_empty(node)) {
node = node->left;
break;
}
/* fall through */
case prio_tree_walk_left:
BUG_ON(*buf == end);
*(*buf)++ = ',';
if (!prio_tree_right_empty(node)) {
state = prio_tree_walk_down;
node = node->right;
break;
}
/* fall through */
case prio_tree_walk_right:
BUG_ON(*buf == end);
*(*buf)++ = ')';
if (prio_tree_root(node))
return;
state = node->parent->left == node ?
prio_tree_walk_left : prio_tree_walk_right;
node = node->parent;
break;
default:
BUG();
}
}
}


int prio_tree_string(const struct prio_tree_root *root, char *buf, size_t size)
{
const char *start = buf;
int len;

len = scnprintf(buf, size, "%d@%p:", root->index_bits, root);
BUG_ON(len+1 == size);
buf += len;
if (!prio_tree_empty(root))
do_prio_tree_string(root->prio_tree_node, &buf, start+size);
BUG_ON(buf == start+size);
*buf = 0;
return buf-start;
}


#else

#define prio_tree_validate(root)
#define prio_tree_absent(root, absentee)
#define prio_tree_string(root, buf, size) (*(buf) = 0)

#endif

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2004-11-15 03:44:32

by Werner Almesberger

[permalink] [raw]
Subject: Re: [RFC] Make MM use generalized prio_tree (2/3)

The second part: make mm/prio_tree.c use lib/prio_tree.c

- Werner

---------------------------------- cut here -----------------------------------

--- linux-2.6.9-orig/mm/prio_tree.c Mon Oct 18 18:55:28 2004
+++ linux-2.6.9/mm/prio_tree.c Sun Nov 14 21:32:50 2004
@@ -11,33 +11,12 @@
* 02Feb2004 Initial version
*/

-#include <linux/init.h>
-#include <linux/module.h>
#include <linux/mm.h>
#include <linux/prio_tree.h>

/*
- * A clever mix of heap and radix trees forms a radix priority search tree (PST)
- * which is useful for storing intervals, e.g, we can consider a vma as a closed
- * interval of file pages [offset_begin, offset_end], and store all vmas that
- * map a file in a PST. Then, using the PST, we can answer a stabbing query,
- * i.e., selecting a set of stored intervals (vmas) that overlap with (map) a
- * given input interval X (a set of consecutive file pages), in "O(log n + m)"
- * time where 'log n' is the height of the PST, and 'm' is the number of stored
- * intervals (vmas) that overlap (map) with the input interval X (the set of
- * consecutive file pages).
- *
- * In our implementation, we store closed intervals of the form [radix_index,
- * heap_index]. We assume that always radix_index <= heap_index. McCreight's PST
- * is designed for storing intervals with unique radix indices, i.e., each
- * interval have different radix_index. However, this limitation can be easily
- * overcome by using the size, i.e., heap_index - radix_index, as part of the
- * index, so we index the tree using [(radix_index,size), heap_index].
- *
- * When the above-mentioned indexing scheme is used, theoretically, in a 32 bit
- * machine, the maximum height of a PST can be 64. We can use a balanced version
- * of the priority search tree to optimize the tree height, but the balanced
- * tree proposed by McCreight is too complex and memory-hungry for our purpose.
+ * See ../lib/prio_tree.c for details on the general radix priority search tree
+ * code.
*/

/*
@@ -62,422 +41,9 @@
GET_INDEX_VMA(__tmp, radix, heap); \
} while (0)

-static unsigned long index_bits_to_maxindex[BITS_PER_LONG];

-void __init prio_tree_init(void)
-{
- unsigned int i;
+#include "../lib/prio_tree.c"

- for (i = 0; i < ARRAY_SIZE(index_bits_to_maxindex) - 1; i++)
- index_bits_to_maxindex[i] = (1UL << (i + 1)) - 1;
- index_bits_to_maxindex[ARRAY_SIZE(index_bits_to_maxindex) - 1] = ~0UL;
-}
-
-/*
- * Maximum heap_index that can be stored in a PST with index_bits bits
- */
-static inline unsigned long prio_tree_maxindex(unsigned int bits)
-{
- return index_bits_to_maxindex[bits - 1];
-}
-
-static void prio_tree_remove(struct prio_tree_root *, struct prio_tree_node *);
-
-/*
- * Extend a priority search tree so that it can store a node with heap_index
- * max_heap_index. In the worst case, this algorithm takes O((log n)^2).
- * However, this function is used rarely and the common case performance is
- * not bad.
- */
-static struct prio_tree_node *prio_tree_expand(struct prio_tree_root *root,
- struct prio_tree_node *node, unsigned long max_heap_index)
-{
- struct prio_tree_node *first = NULL, *prev, *last = NULL;
-
- if (max_heap_index > prio_tree_maxindex(root->index_bits))
- root->index_bits++;
-
- while (max_heap_index > prio_tree_maxindex(root->index_bits)) {
- root->index_bits++;
-
- if (prio_tree_empty(root))
- continue;
-
- if (first == NULL) {
- first = root->prio_tree_node;
- prio_tree_remove(root, root->prio_tree_node);
- INIT_PRIO_TREE_NODE(first);
- last = first;
- } else {
- prev = last;
- last = root->prio_tree_node;
- prio_tree_remove(root, root->prio_tree_node);
- INIT_PRIO_TREE_NODE(last);
- prev->left = last;
- last->parent = prev;
- }
- }
-
- INIT_PRIO_TREE_NODE(node);
-
- if (first) {
- node->left = first;
- first->parent = node;
- } else
- last = node;
-
- if (!prio_tree_empty(root)) {
- last->left = root->prio_tree_node;
- last->left->parent = last;
- }
-
- root->prio_tree_node = node;
- return node;
-}
-
-/*
- * Replace a prio_tree_node with a new node and return the old node
- */
-static struct prio_tree_node *prio_tree_replace(struct prio_tree_root *root,
- struct prio_tree_node *old, struct prio_tree_node *node)
-{
- INIT_PRIO_TREE_NODE(node);
-
- if (prio_tree_root(old)) {
- BUG_ON(root->prio_tree_node != old);
- /*
- * We can reduce root->index_bits here. However, it is complex
- * and does not help much to improve performance (IMO).
- */
- node->parent = node;
- root->prio_tree_node = node;
- } else {
- node->parent = old->parent;
- if (old->parent->left == old)
- old->parent->left = node;
- else
- old->parent->right = node;
- }
-
- if (!prio_tree_left_empty(old)) {
- node->left = old->left;
- old->left->parent = node;
- }
-
- if (!prio_tree_right_empty(old)) {
- node->right = old->right;
- old->right->parent = node;
- }
-
- return old;
-}
-
-/*
- * Insert a prio_tree_node @node into a radix priority search tree @root. The
- * algorithm typically takes O(log n) time where 'log n' is the number of bits
- * required to represent the maximum heap_index. In the worst case, the algo
- * can take O((log n)^2) - check prio_tree_expand.
- *
- * If a prior node with same radix_index and heap_index is already found in
- * the tree, then returns the address of the prior node. Otherwise, inserts
- * @node into the tree and returns @node.
- */
-static struct prio_tree_node *prio_tree_insert(struct prio_tree_root *root,
- struct prio_tree_node *node)
-{
- struct prio_tree_node *cur, *res = node;
- unsigned long radix_index, heap_index;
- unsigned long r_index, h_index, index, mask;
- int size_flag = 0;
-
- GET_INDEX(node, radix_index, heap_index);
-
- if (prio_tree_empty(root) ||
- heap_index > prio_tree_maxindex(root->index_bits))
- return prio_tree_expand(root, node, heap_index);
-
- cur = root->prio_tree_node;
- mask = 1UL << (root->index_bits - 1);
-
- while (mask) {
- GET_INDEX(cur, r_index, h_index);
-
- if (r_index == radix_index && h_index == heap_index)
- return cur;
-
- if (h_index < heap_index ||
- (h_index == heap_index && r_index > radix_index)) {
- struct prio_tree_node *tmp = node;
- node = prio_tree_replace(root, cur, node);
- cur = tmp;
- /* swap indices */
- index = r_index;
- r_index = radix_index;
- radix_index = index;
- index = h_index;
- h_index = heap_index;
- heap_index = index;
- }
-
- if (size_flag)
- index = heap_index - radix_index;
- else
- index = radix_index;
-
- if (index & mask) {
- if (prio_tree_right_empty(cur)) {
- INIT_PRIO_TREE_NODE(node);
- cur->right = node;
- node->parent = cur;
- return res;
- } else
- cur = cur->right;
- } else {
- if (prio_tree_left_empty(cur)) {
- INIT_PRIO_TREE_NODE(node);
- cur->left = node;
- node->parent = cur;
- return res;
- } else
- cur = cur->left;
- }
-
- mask >>= 1;
-
- if (!mask) {
- mask = 1UL << (root->index_bits - 1);
- size_flag = 1;
- }
- }
- /* Should not reach here */
- BUG();
- return NULL;
-}
-
-/*
- * Remove a prio_tree_node @node from a radix priority search tree @root. The
- * algorithm takes O(log n) time where 'log n' is the number of bits required
- * to represent the maximum heap_index.
- */
-static void prio_tree_remove(struct prio_tree_root *root,
- struct prio_tree_node *node)
-{
- struct prio_tree_node *cur;
- unsigned long r_index, h_index_right, h_index_left;
-
- cur = node;
-
- while (!prio_tree_left_empty(cur) || !prio_tree_right_empty(cur)) {
- if (!prio_tree_left_empty(cur))
- GET_INDEX(cur->left, r_index, h_index_left);
- else {
- cur = cur->right;
- continue;
- }
-
- if (!prio_tree_right_empty(cur))
- GET_INDEX(cur->right, r_index, h_index_right);
- else {
- cur = cur->left;
- continue;
- }
-
- /* both h_index_left and h_index_right cannot be 0 */
- if (h_index_left >= h_index_right)
- cur = cur->left;
- else
- cur = cur->right;
- }
-
- if (prio_tree_root(cur)) {
- BUG_ON(root->prio_tree_node != cur);
- INIT_PRIO_TREE_ROOT(root);
- return;
- }
-
- if (cur->parent->right == cur)
- cur->parent->right = cur->parent;
- else
- cur->parent->left = cur->parent;
-
- while (cur != node)
- cur = prio_tree_replace(root, cur->parent, cur);
-}
-
-/*
- * Following functions help to enumerate all prio_tree_nodes in the tree that
- * overlap with the input interval X [radix_index, heap_index]. The enumeration
- * takes O(log n + m) time where 'log n' is the height of the tree (which is
- * proportional to # of bits required to represent the maximum heap_index) and
- * 'm' is the number of prio_tree_nodes that overlap the interval X.
- */
-
-static struct prio_tree_node *prio_tree_left(struct prio_tree_iter *iter,
- unsigned long *r_index, unsigned long *h_index)
-{
- if (prio_tree_left_empty(iter->cur))
- return NULL;
-
- GET_INDEX(iter->cur->left, *r_index, *h_index);
-
- if (iter->r_index <= *h_index) {
- iter->cur = iter->cur->left;
- iter->mask >>= 1;
- if (iter->mask) {
- if (iter->size_level)
- iter->size_level++;
- } else {
- if (iter->size_level) {
- BUG_ON(!prio_tree_left_empty(iter->cur));
- BUG_ON(!prio_tree_right_empty(iter->cur));
- iter->size_level++;
- iter->mask = ULONG_MAX;
- } else {
- iter->size_level = 1;
- iter->mask = 1UL << (iter->root->index_bits - 1);
- }
- }
- return iter->cur;
- }
-
- return NULL;
-}
-
-static struct prio_tree_node *prio_tree_right(struct prio_tree_iter *iter,
- unsigned long *r_index, unsigned long *h_index)
-{
- unsigned long value;
-
- if (prio_tree_right_empty(iter->cur))
- return NULL;
-
- if (iter->size_level)
- value = iter->value;
- else
- value = iter->value | iter->mask;
-
- if (iter->h_index < value)
- return NULL;
-
- GET_INDEX(iter->cur->right, *r_index, *h_index);
-
- if (iter->r_index <= *h_index) {
- iter->cur = iter->cur->right;
- iter->mask >>= 1;
- iter->value = value;
- if (iter->mask) {
- if (iter->size_level)
- iter->size_level++;
- } else {
- if (iter->size_level) {
- BUG_ON(!prio_tree_left_empty(iter->cur));
- BUG_ON(!prio_tree_right_empty(iter->cur));
- iter->size_level++;
- iter->mask = ULONG_MAX;
- } else {
- iter->size_level = 1;
- iter->mask = 1UL << (iter->root->index_bits - 1);
- }
- }
- return iter->cur;
- }
-
- return NULL;
-}
-
-static struct prio_tree_node *prio_tree_parent(struct prio_tree_iter *iter)
-{
- iter->cur = iter->cur->parent;
- if (iter->mask == ULONG_MAX)
- iter->mask = 1UL;
- else if (iter->size_level == 1)
- iter->mask = 1UL;
- else
- iter->mask <<= 1;
- if (iter->size_level)
- iter->size_level--;
- if (!iter->size_level && (iter->value & iter->mask))
- iter->value ^= iter->mask;
- return iter->cur;
-}
-
-static inline int overlap(struct prio_tree_iter *iter,
- unsigned long r_index, unsigned long h_index)
-{
- return iter->h_index >= r_index && iter->r_index <= h_index;
-}
-
-/*
- * prio_tree_first:
- *
- * Get the first prio_tree_node that overlaps with the interval [radix_index,
- * heap_index]. Note that always radix_index <= heap_index. We do a pre-order
- * traversal of the tree.
- */
-static struct prio_tree_node *prio_tree_first(struct prio_tree_iter *iter)
-{
- struct prio_tree_root *root;
- unsigned long r_index, h_index;
-
- INIT_PRIO_TREE_ITER(iter);
-
- root = iter->root;
- if (prio_tree_empty(root))
- return NULL;
-
- GET_INDEX(root->prio_tree_node, r_index, h_index);
-
- if (iter->r_index > h_index)
- return NULL;
-
- iter->mask = 1UL << (root->index_bits - 1);
- iter->cur = root->prio_tree_node;
-
- while (1) {
- if (overlap(iter, r_index, h_index))
- return iter->cur;
-
- if (prio_tree_left(iter, &r_index, &h_index))
- continue;
-
- if (prio_tree_right(iter, &r_index, &h_index))
- continue;
-
- break;
- }
- return NULL;
-}
-
-/*
- * prio_tree_next:
- *
- * Get the next prio_tree_node that overlaps with the input interval in iter
- */
-static struct prio_tree_node *prio_tree_next(struct prio_tree_iter *iter)
-{
- unsigned long r_index, h_index;
-
-repeat:
- while (prio_tree_left(iter, &r_index, &h_index))
- if (overlap(iter, r_index, h_index))
- return iter->cur;
-
- while (!prio_tree_right(iter, &r_index, &h_index)) {
- while (!prio_tree_root(iter->cur) &&
- iter->cur->parent->right == iter->cur)
- prio_tree_parent(iter);
-
- if (prio_tree_root(iter->cur))
- return NULL;
-
- prio_tree_parent(iter);
- }
-
- if (overlap(iter, r_index, h_index))
- return iter->cur;
-
- goto repeat;
-}

/*
* Radix priority search tree for address_space->i_mmap

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2004-11-15 04:31:04

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] Generalize prio_tree (1/3)

Werner Almesberger wrote:
> Hi Rajesh,
>
> perhaps you remember me posting a long time ago about generalizing
> prio_tree. Now I finally got to make that patch. In fact, there are
> three parts:
>
> - the prio_tree "core" in lib/
> - switching mm/prio_tree.c to use the "core"
> - some debugging extensions
>
> The reason for wanting this generalization is that we'll also need
> radix priority search trees for healthier barrier handling in the
> IO scheduler (aka disk elevator).
>

I'm curious, how do you plan to use them for healthier barrier handling?

2004-11-15 06:08:10

by Werner Almesberger

[permalink] [raw]
Subject: Re: [RFC] Generalize prio_tree (1/3)

Nick Piggin wrote:
> I'm curious, how do you plan to use them for healthier barrier handling?

Ah, you missed the great discussion on fsdevel and the BOF
at OLS :-)

http://marc.theaimsgroup.com/?t=108787649700005&r=1&w=2&n=34
http://marc.theaimsgroup.com/?t=108809650200006&r=1&w=2&n=11
http://marc.theaimsgroup.com/?l=linux-fsdevel&m=109107082406140&w=2

This comes from prioritization of requests at the elevator.
In order to honor priorities as much as possible, we need to
keep barriers from affecting all requests in the queue.

The idea is to ignore barriers unless requests separated by a
barrier overlap, and at least one of them is a write. prio_tree
is used to find those overlaps.

That way, priorities are only affected by barriers if using
some form of direct IO, with overlaps and writes. While this
isn't perfect (i.e. someone else scribbling over our files can
still spoil all the fun), it allows a much larger class of
applications to enjoy the full benefits of priorities.

Besides that, it also helps the elevator to do a better job for
requests even without priorities, because it doesn't have to go
FIFO whenever it sees a barrier.

See also section 3.6 of
http://abiss.sourceforge.net/doc/abiss-lk.ps.gz

The CPU overhead is actually quite marginal: in tests, the ABISS
elevator would actually outperform AS in terms of CPU time
spent (measured by sending a lot of random requests with AIO
into a large queue). While such tests compare apples and oranges,
they at least indicate that minimalizing the effect of barriers
doesn't have a horrible performance impact.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2004-11-15 11:02:12

by Nick Piggin

[permalink] [raw]
Subject: Re: [RFC] Generalize prio_tree (1/3)

Werner Almesberger wrote:
> Nick Piggin wrote:
>
>>I'm curious, how do you plan to use them for healthier barrier handling?
>
>
> Ah, you missed the great discussion on fsdevel and the BOF
> at OLS :-)
>
> http://marc.theaimsgroup.com/?t=108787649700005&r=1&w=2&n=34
> http://marc.theaimsgroup.com/?t=108809650200006&r=1&w=2&n=11
> http://marc.theaimsgroup.com/?l=linux-fsdevel&m=109107082406140&w=2
>

Ah OK. Interesting reading.

> This comes from prioritization of requests at the elevator.
> In order to honor priorities as much as possible, we need to
> keep barriers from affecting all requests in the queue.
>
> The idea is to ignore barriers unless requests separated by a
> barrier overlap, and at least one of them is a write. prio_tree
> is used to find those overlaps.
>

Hmm OK. I won't quiz you on the impelementation as I'm sure you've
thought it through, and I haven't :)

... but humor me, you _are_ ensuring the following doesn't get
reordered, say:

(write, sect 100), (barrier), (write, sect 200)

Right? It isn't clear from your description what you are intending
(and I'm too lazy to read the code at the moment!).

> That way, priorities are only affected by barriers if using
> some form of direct IO, with overlaps and writes. While this
> isn't perfect (i.e. someone else scribbling over our files can
> still spoil all the fun), it allows a much larger class of
> applications to enjoy the full benefits of priorities.
>
> Besides that, it also helps the elevator to do a better job for
> requests even without priorities, because it doesn't have to go
> FIFO whenever it sees a barrier.
>

Well it goes FIFO to the extent that the barrier definition requires:
all requests before the barrier are issued before any requests
after the barrier.

> See also section 3.6 of
> http://abiss.sourceforge.net/doc/abiss-lk.ps.gz
>
> The CPU overhead is actually quite marginal: in tests, the ABISS
> elevator would actually outperform AS in terms of CPU time
> spent (measured by sending a lot of random requests with AIO
> into a large queue). While such tests compare apples and oranges,
> they at least indicate that minimalizing the effect of barriers
> doesn't have a horrible performance impact.
>

Well it sounds interesting. Good luck with it...

No comment on your prio tree generalization, sorry. Other than: it
seems to be unfortunately quite ugly. Obviously better than duplicating
the code, but... it may just be worth holding onto it until it has
another user about to be merged as well (eg. add the patch to the front
of your ABISS elevator if you submit it to 2.6 or -mm).

But I don't have strong feelings on the matter. If Rajesh says its OK
to go ahead as is, that would be fine by me :)

Good luck!
Nick

2004-11-15 14:33:04

by Werner Almesberger

[permalink] [raw]
Subject: Re: [RFC] Generalize prio_tree (1/3)

Nick Piggin wrote:
> ... but humor me, you _are_ ensuring the following doesn't get
> reordered, say:
>
> (write, sect 100), (barrier), (write, sect 200)

Ah you found the case I didn't mention :-) Yes, that's handled
somewhere else. When the ABISS elevator sees a barrier, it just
pushes the current sort tree for non-reads (writes and weird
stuff) to a FIFO list.

So writes don't benefit that much from priorities. The good
thing is that they also happen to need them less ;-)

Hmm, I just see that power down (suspend and shutdown) sneaked out
again. Well, easy enough to fix.

> No comment on your prio tree generalization, sorry. Other than: it
> seems to be unfortunately quite ugly.

I know. Unfortunately, there doesn't seem to be a nice alternative
that doesn't either require additional fields (e.g. in MM, only half
of the key is not stored in the tree, the other half is calculated)
or *very* busy callbacks.

Of course, if Rajesh or the MM folks in general think that storing
the whole key in the tree is fine, I wouldn't complain :-)

> (eg. add the patch to the front
> of your ABISS elevator if you submit it to 2.6 or -mm).

Ah no, the ABISS elevator won't go into the kernel - it's just for
experimenting. But Jens is planning to add the overlap handling to
the block device layer. So the user is well on its way :-) It's just
more convenient if we can do this one step at a time, particularly
since prio_tree is a very general concept (much like lists or
red-black trees) anyway.

> But I don't have strong feelings on the matter. If Rajesh says its OK
> to go ahead as is, that would be fine by me :)

Cool. Thanks !

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

Subject: Re: [RFC] Generalize prio_tree (1/3)



On Sun, 14 Nov 2004, Werner Almesberger wrote:

> perhaps you remember me posting a long time ago about generalizing
> prio_tree. Now I finally got to make that patch. In fact, there are
> three parts:
>
> - the prio_tree "core" in lib/
> - switching mm/prio_tree.c to use the "core"
> - some debugging extensions
>
[snip]
>
> The patch below puts an includeable version of prio_tree to lib/.
> This should be included similar to how inflate.c is used.

I don't like including prio_tree. I will only choose to go the
include way if and only if every other choice leads to significant
performance overhead.

Currently I prefer the following way, but I didn't implement it
because I don't have a strong reason to generalize prio_tree and
the generalization may lead to some preformance loss (which may be
insignificant loss in real world apps). We have to test to know
the performance loss.

Again I don't like the following approach fully. I couldn't come
out with a clean generalization something like rb_tree code. I don't
think such a clean generalization exists for prio_tree. Therefore,
if we have a strong reason to generalize the prio_tree, then we have to
choose one of the options we have. Untested prototype patch below.

prio_tree generalization...

---

linux-2.6.9-testuser/fs/inode.c | 2
linux-2.6.9-testuser/include/linux/prio_tree.h | 10 +++-
linux-2.6.9-testuser/mm/prio_tree.c | 56 ++++++++++++++-----------
3 files changed, 41 insertions(+), 27 deletions(-)

diff -puN include/linux/prio_tree.h~prio_tree_gen include/linux/prio_tree.h
--- linux-2.6.9/include/linux/prio_tree.h~prio_tree_gen 2004-11-15 12:30:34.513076512 -0500
+++ linux-2.6.9-testuser/include/linux/prio_tree.h 2004-11-15 13:04:55.772717416 -0500
@@ -9,7 +9,8 @@ struct prio_tree_node {

struct prio_tree_root {
struct prio_tree_node *prio_tree_node;
- unsigned int index_bits;
+ unsigned short index_bits;
+ unsigned short type;
};

struct prio_tree_iter {
@@ -31,10 +32,15 @@ static inline void prio_tree_iter_init(s
iter->h_index = h_index;
}

-#define INIT_PRIO_TREE_ROOT(ptr) \
+/* prio_tree types */
+#define VMA_PRIO_TREE 0
+#define ABISS_PRIO_TREE 1
+
+#define INIT_PRIO_TREE_ROOT(ptr, t) \
do { \
(ptr)->prio_tree_node = NULL; \
(ptr)->index_bits = 1; \
+ (ptr)->type = t; \
} while (0)

#define INIT_PRIO_TREE_NODE(ptr) \
diff -puN mm/prio_tree.c~prio_tree_gen mm/prio_tree.c
--- linux-2.6.9/mm/prio_tree.c~prio_tree_gen 2004-11-15 12:32:39.010150080 -0500
+++ linux-2.6.9-testuser/mm/prio_tree.c 2004-11-15 12:54:00.085397040 -0500
@@ -44,23 +44,10 @@
* The following macros are used for implementing prio_tree for i_mmap
*/

-#define RADIX_INDEX(vma) ((vma)->vm_pgoff)
-#define VMA_SIZE(vma) (((vma)->vm_end - (vma)->vm_start) >> PAGE_SHIFT)
+#define VMA_RADIX_INDEX(vma) ((vma)->vm_pgoff)
+#define VMA_SIZE(vma) (((vma)->vm_end - (vma)->vm_start) >> PAGE_SHIFT)
/* avoid overflow */
-#define HEAP_INDEX(vma) ((vma)->vm_pgoff + (VMA_SIZE(vma) - 1))
-
-#define GET_INDEX_VMA(vma, radix, heap) \
-do { \
- radix = RADIX_INDEX(vma); \
- heap = HEAP_INDEX(vma); \
-} while (0)
-
-#define GET_INDEX(node, radix, heap) \
-do { \
- struct vm_area_struct *__tmp = \
- prio_tree_entry(node, struct vm_area_struct, shared.prio_tree_node);\
- GET_INDEX_VMA(__tmp, radix, heap); \
-} while (0)
+#define VMA_HEAP_INDEX(vma) ((vma)->vm_pgoff + (VMA_SIZE(vma) - 1))

static unsigned long index_bits_to_maxindex[BITS_PER_LONG];

@@ -83,6 +70,27 @@ static inline unsigned long prio_tree_ma

static void prio_tree_remove(struct prio_tree_root *, struct prio_tree_node *);

+static void get_index(struct prio_tree_root *root, struct prio_tree_node *node,
+ unsigned long *radix, unsigned long *heap)
+{
+ switch (root->type) {
+ case VMA_PRIO_TREE:
+ struct vm_area_struct *vma;
+ vma = prio_tree_entry(node, struct vm_area_struct,
+ shared.prio_tree_node);
+ *radix = VMA_RADIX_INDEX(vma);
+ *heap = VMA_HEAP_INDEX(vma);
+ break;
+ case ABISS_PRIO_TREE:
+ ...
+ break;
+ default:
+ BUG();
+ break;
+ }
+}
+
+
/*
* Extend a priority search tree so that it can store a node with heap_index
* max_heap_index. In the worst case, this algorithm takes O((log n)^2).
@@ -190,7 +198,7 @@ static struct prio_tree_node *prio_tree_
unsigned long r_index, h_index, index, mask;
int size_flag = 0;

- GET_INDEX(node, radix_index, heap_index);
+ get_index(root, node, &radix_index, &heap_index);

if (prio_tree_empty(root) ||
heap_index > prio_tree_maxindex(root->index_bits))
@@ -200,7 +208,7 @@ static struct prio_tree_node *prio_tree_
mask = 1UL << (root->index_bits - 1);

while (mask) {
- GET_INDEX(cur, r_index, h_index);
+ get_index(root, cur, &r_index, &h_index);

if (r_index == radix_index && h_index == heap_index)
return cur;
@@ -269,14 +277,14 @@ static void prio_tree_remove(struct prio

while (!prio_tree_left_empty(cur) || !prio_tree_right_empty(cur)) {
if (!prio_tree_left_empty(cur))
- GET_INDEX(cur->left, r_index, h_index_left);
+ get_index(root, cur->left, &r_index, &h_index_left);
else {
cur = cur->right;
continue;
}

if (!prio_tree_right_empty(cur))
- GET_INDEX(cur->right, r_index, h_index_right);
+ get_index(root, cur->right, &r_index, &h_index_right);
else {
cur = cur->left;
continue;
@@ -291,7 +299,7 @@ static void prio_tree_remove(struct prio

if (prio_tree_root(cur)) {
BUG_ON(root->prio_tree_node != cur);
- INIT_PRIO_TREE_ROOT(root);
+ INIT_PRIO_TREE_ROOT(root, root->type);
return;
}

@@ -318,7 +326,7 @@ static struct prio_tree_node *prio_tree_
if (prio_tree_left_empty(iter->cur))
return NULL;

- GET_INDEX(iter->cur->left, *r_index, *h_index);
+ get_index(iter->root, iter->cur->left, r_index, h_index);

if (iter->r_index <= *h_index) {
iter->cur = iter->cur->left;
@@ -359,7 +367,7 @@ static struct prio_tree_node *prio_tree_
if (iter->h_index < value)
return NULL;

- GET_INDEX(iter->cur->right, *r_index, *h_index);
+ get_index(iter->root, iter->cur->right, r_index, h_index);

if (iter->r_index <= *h_index) {
iter->cur = iter->cur->right;
@@ -425,7 +433,7 @@ static struct prio_tree_node *prio_tree_
if (prio_tree_empty(root))
return NULL;

- GET_INDEX(root->prio_tree_node, r_index, h_index);
+ get_index(root, root->prio_tree_node, &r_index, &h_index);

if (iter->r_index > h_index)
return NULL;
diff -puN fs/inode.c~prio_tree_gen fs/inode.c
--- linux-2.6.9/fs/inode.c~prio_tree_gen 2004-11-15 12:50:22.393491240 -0500
+++ linux-2.6.9-testuser/fs/inode.c 2004-11-15 12:51:18.155014200 -0500
@@ -201,7 +201,7 @@ void inode_init_once(struct inode *inode
atomic_set(&inode->i_data.truncate_count, 0);
INIT_LIST_HEAD(&inode->i_data.private_list);
spin_lock_init(&inode->i_data.private_lock);
- INIT_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
+ INIT_PRIO_TREE_ROOT(&inode->i_data.i_mmap, VMA_PRIO_TREE);
INIT_LIST_HEAD(&inode->i_data.i_mmap_nonlinear);
spin_lock_init(&inode->i_lock);
i_size_ordered_init(inode);
_

2004-11-15 20:59:07

by Werner Almesberger

[permalink] [raw]
Subject: Re: [RFC] Generalize prio_tree (1/3)

Rajesh Venkatasubramanian wrote:
> Again I don't like the following approach fully. I couldn't come
> out with a clean generalization something like rb_tree code.

Hmm, GET_INDEX/get_index grows and grows, and also generates a
hotspot for patch collisions ...

And what if we took the hit and moved the key into struct
prio_tree_node ? struct vm_area_struct.shared.vm_set already is
one word longer than vm_area_struct.shared.prio_tree_node, so
half of the key is free (in terms of storage - the key updates
when vm_pgoff, vm_end, or vm_start changes aren't free). The
other half could also be made free (in terms of storage and
processing) with a little tweaking, e.g. by adding

...
union {
unsigned long vm_pgoff;
struct vm_set {
unsigned long vm_pgoff;
...
} vm_set;
struct prio_tree_node prio_tree_node;
}
...

#define vm_pgoff shared.vm_pgoff

(Untested. This kind of #define is of course risky, so it may be
better to just rewrite all the accesses.)

Then, we could have

struct prio_tree_node {
unsigned long r_index, h_index;
...
};

For the elevators, the keys (the "footprint" of a set of overlapping
requests) are already stored as separate variables, so that could be
migrated very easily, at no additional cost.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

Subject: Re: [RFC] Generalize prio_tree (1/3)



On Mon, 15 Nov 2004, Werner Almesberger wrote:

> Rajesh Venkatasubramanian wrote:
> > Again I don't like the following approach fully. I couldn't come
> > out with a clean generalization something like rb_tree code.
>
> Hmm, GET_INDEX/get_index grows and grows, and also generates a
> hotspot for patch collisions ...
>
> And what if we took the hit and moved the key into struct
> prio_tree_node ? struct vm_area_struct.shared.vm_set already is
> one word longer than vm_area_struct.shared.prio_tree_node, so
> half of the key is free (in terms of storage - the key updates
> when vm_pgoff, vm_end, or vm_start changes aren't free). The
> other half could also be made free (in terms of storage and
> processing) with a little tweaking, e.g. by adding
>
> ...
> union {
> unsigned long vm_pgoff;
> struct vm_set {
> unsigned long vm_pgoff;
> ...
> } vm_set;
> struct prio_tree_node prio_tree_node;
> }
> ...
>
> #define vm_pgoff shared.vm_pgoff
>
> (Untested. This kind of #define is of course risky, so it may be
> better to just rewrite all the accesses.)


I thought about this, but this will lead to a very intrusive patch.
We have to change the meaning of vm_start and vm_end or increase
the size of vm_area_struct.


> Then, we could have
>
> struct prio_tree_node {
> unsigned long r_index, h_index;
> ...
> };
>
> For the elevators, the keys (the "footprint" of a set of overlapping
> requests) are already stored as separate variables, so that could be
> migrated very easily, at no additional cost.

Why can't we have only 2 types of prio_tree. One VMA_PRIO_TREE and
another GENERIC_PRIO_TREE.

struct generic_prio_tree_node {
unsigned long r_index, h_index;
struct prio_tree_node prio_tree_node;
}

That way all new users of prio_tree code will use the
generic_prio_tree_node if possible. Moreover, we can avoid
an intrusive patch.

I am only worried about the micro-performance loss due to
get_index in the hot-paths such as vma_prio_tree_insert.

Rajesh

2004-11-15 21:43:21

by Werner Almesberger

[permalink] [raw]
Subject: Re: [RFC] Generalize prio_tree (1/3)

Rajesh Venkatasubramanian wrote:
> I thought about this, but this will lead to a very intrusive patch.

Possibly yes, unfortunately :-( All places where a node's keys change
would have to be updated, yes. Are there cases where vm_pgoff,
vm_start, or vm_end can change without doing a prio_tree_insert or
vma_prio_tree_insert afterwards ? If not, the key update could just
be moved into vma_prio_tree_insert and vma_prio_tree_add.

> We have to change the meaning of vm_start and vm_end or increase
> the size of vm_area_struct.

Nope :-) We already have space for adding one more long, i.e. h_index.
So all we need to do it to calculate and set it before going to
prio_tree.

For r_index, one can use what I've described in the last mail.

> I am only worried about the micro-performance loss due to
> get_index in the hot-paths such as vma_prio_tree_insert.

Yes, it starts to look fairly heavy for what it does ...

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

Subject: Re: [RFC] Generalize prio_tree (1/3)



On Mon, 15 Nov 2004, Werner Almesberger wrote:

> Rajesh Venkatasubramanian wrote:
> > I thought about this, but this will lead to a very intrusive patch.
>
> Possibly yes, unfortunately :-( All places where a node's keys change
> would have to be updated, yes. Are there cases where vm_pgoff,
> vm_start, or vm_end can change without doing a prio_tree_insert or
> vma_prio_tree_insert afterwards ?

No, that's not possible. Whenever we change vm_pgoff, vm_start, or
vm_end, we reshuffle the prio_tree. Check mm/mmap.c:vma_adjust().

> If not, the key update could just
> be moved into vma_prio_tree_insert and vma_prio_tree_add.

Yes. That's possible. But we have to expand the vm_area_struct
by sizeof(unsigned long).

> > We have to change the meaning of vm_start and vm_end or increase
> > the size of vm_area_struct.
>
> Nope :-) We already have space for adding one more long, i.e. h_index.

I don't understand what you are thinking. I reread your last mail
to try to understand, but I don't get it.

If you are thinking vm_set.head or vm_set.parent is free to use
for h_index, then it is incorrect. No field is free. If a vma
is a tree node (and in this discussion we care only about tree
nodes), then every field in vma->shared is used. We cannot reuse
them for storing h_index.

> For r_index, one can use what I've described in the last mail.

Yeah. We can use a "#define vm_pgoff shared.r_index". But, that's
very ugly, so we have to change all accesses of vm_pgoff leading to
a relatively intrusive patch.

> > I am only worried about the micro-performance loss due to
> > get_index in the hot-paths such as vma_prio_tree_insert.
>
> Yes, it starts to look fairly heavy for what it does ...

If we impose that there can be only 2 types of prio_tree, then
we can simply add an if-else condition in the GET_INDEX macro.
IMHO, that will not lead to any noticeable performance drop.

But, I agree with you that changing the layout of vm_area_struct
is a better (but intrusive) approach.

Thanks,
Rajesh

2004-11-15 23:04:41

by Werner Almesberger

[permalink] [raw]
Subject: Re: [RFC] Generalize prio_tree (1/3)

Rajesh Venkatasubramanian wrote:
> No, that's not possible. Whenever we change vm_pgoff, vm_start, or
> vm_end, we reshuffle the prio_tree. Check mm/mmap.c:vma_adjust().

Excellent, that makes it a lot easier.

> If you are thinking vm_set.head or vm_set.parent is free to use
> for h_index, then it is incorrect. No field is free. If a vma
> is a tree node (and in this discussion we care only about tree
> nodes), then every field in vma->shared is used. We cannot reuse
> them for storing h_index.

Oh, I see. I thought it was that either vm_set or prio_tree_node
was used. (Which I found somewhat confusing, but attributed my
confusion to simply not understanding all the strange ways of MM.)
Sorry about the confusion.

So yes, one more word then :-(

> If we impose that there can be only 2 types of prio_tree, then
> we can simply add an if-else condition in the GET_INDEX macro.
> IMHO, that will not lead to any noticeable performance drop.

Yes, that sounds better. It would also allow for a later transition
of VMA_PRIO_TREE to GENERIC_PRIO_TREE.

Now, if we want to prepare things already now for a later migration,
it would be nice to call the generic one "struct prio_tree_node",
since it would eventually become the only node type anyway.

Perhaps something along these lines:

struct prio_tree_node {
struct vma_prio_tree_node prio_tree_node;
unsigned long r_index, h_index;
};

Or would you consider this as premature optimization ?

> But, I agree with you that changing the layout of vm_area_struct
> is a better (but intrusive) approach.

I also wonder how expensive the calculations in HEAP_INDEX are.
Probably not very.

To make the intrusive change a bit more palatable, vm_pgoff could
become #define vm_pgoff(vma) ((vma)->shared.prio_tree_node.r_index)

Okay, so now we only need someone who has meaningful MM tests to
tell us how badly this would hurt performance :-)

Thanks,
- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

Subject: Re: [RFC] Generalize prio_tree (1/3)



On Mon, 15 Nov 2004, Werner Almesberger wrote:
> > If we impose that there can be only 2 types of prio_tree, then
> > we can simply add an if-else condition in the GET_INDEX macro.
> > IMHO, that will not lead to any noticeable performance drop.
>
> Yes, that sounds better. It would also allow for a later transition
> of VMA_PRIO_TREE to GENERIC_PRIO_TREE.

Yeap. That will be my preference. Make a small patch to generalize
prio_tree using the split VMA_PRIO_TREE and GENERIC_PRIO_TREE. A small
patch has a better chance of getting merged too. Giving a strong
reason for the generalization and showing that the VMA_PRIO_TREE
and GENERIC_PRIO_TREE split does not affect performance noticeably,
we can hope to convince Andrew to merge. We can worry about
changing vm_area_struct's layout later.

> Now, if we want to prepare things already now for a later migration,
> it would be nice to call the generic one "struct prio_tree_node",
> since it would eventually become the only node type anyway.
>
> Perhaps something along these lines:
>
> struct prio_tree_node {
> struct vma_prio_tree_node prio_tree_node;
> unsigned long r_index, h_index;
> };
>
> Or would you consider this as premature optimization ?

Yeap. That looks sane. However, if you are planning to produce
a patch, please consider the following names:

struct prio_tree_node {
unsigned long start, end;
struct raw_prio_tree_node prio_tree_node;
};

I think the r_index and h_index names are only meaningful in
prio_tree.c. My guess is start and end will be more palatable
to users of prio_tree.

Thank you,
Rajesh

2004-11-16 00:35:34

by Werner Almesberger

[permalink] [raw]
Subject: Re: [RFC] Generalize prio_tree (1/3)

Rajesh Venkatasubramanian wrote:
> Yeap. That looks sane. However, if you are planning to produce
> a patch, please consider the following names:
>
> struct prio_tree_node {
> unsigned long start, end;
> struct raw_prio_tree_node prio_tree_node;
> };

Okay. Any reason why you've put "start, end" before "prio_tree_node" ?
The other way around would seem to make things a lot easier.

> I think the r_index and h_index names are only meaningful in
> prio_tree.c. My guess is start and end will be more palatable
> to users of prio_tree.

Yes, they're a bit confusing :-) It would actually be nice if you
could write a little paper describing this particular type of radix
priority search tree, since it differs quite a bit from the original.
Also, the original paper is comparably difficult to obtain if you
don't have a university library at hand. Better documentation of how
prio_tree works might also encourage new uses of it.

Thanks,
- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

Subject: Re: [RFC] Generalize prio_tree (1/3)



On Mon, 15 Nov 2004, Werner Almesberger wrote:

> Rajesh Venkatasubramanian wrote:
> > Yeap. That looks sane. However, if you are planning to produce
> > a patch, please consider the following names:
> >
> > struct prio_tree_node {
> > unsigned long start, end;
> > struct raw_prio_tree_node prio_tree_node;
> > };
>
> Okay. Any reason why you've put "start, end" before "prio_tree_node" ?
> The other way around would seem to make things a lot easier.

I don't have any reason. I am okay with either.

> > I think the r_index and h_index names are only meaningful in
> > prio_tree.c. My guess is start and end will be more palatable
> > to users of prio_tree.
>
> Yes, they're a bit confusing :-) It would actually be nice if you
> could write a little paper describing this particular type of radix
> priority search tree, since it differs quite a bit from the original.
> Also, the original paper is comparably difficult to obtain if you
> don't have a university library at hand. Better documentation of how
> prio_tree works might also encourage new uses of it.

I have already started doing that. Please check
Documentation/prio_tree.txt in 2.6.10-rc2. The document can be
improved a lot, but for now it tries to explain the differences
from the original paper. Yes. I got a copy of the original paper
from my university library, there are no digital copies available.

Rajesh

2004-11-17 00:08:26

by Werner Almesberger

[permalink] [raw]
Subject: Generalize prio_tree, 2nd try

Hi Rajesh,

this patch (for 2.6.9) implements roughly what we've discussed yesterday.
The main difference is that I didn't nest the two node structures, but
put them in parallel. This avoids a zillion trivial changes when making
this change, and another zillion un-changes when vma_* switches to the
non-raw structure.

The vma_ functions use wrappers to cast between the two types of
structures. Not necessarily the most beautiful way of doing things, but
again, that's hopefully just transitional.

Unlike the patch sent yesterday, this patch leaves everything in
mm/prio_tree.c, to make it easier to see the differences. Moving the
non-vma code to lib/ will be trivial, and there's no hurry for that.

I've also left out the debugging functions. I'll update them later.

I've tested the patch only very lightly under UML. Is there any good VMA
exerciser around that will yield meaningful results on a desktop PC, or
should I just leave the benchmarking to the guys with the big irons ?

Thanks,
- Werner

---------------------------------- cut here -----------------------------------

--- linux-2.6.9-orig/include/linux/prio_tree.h Mon Oct 18 18:54:08 2004
+++ linux-2.6.9/include/linux/prio_tree.h Tue Nov 16 20:07:43 2004
@@ -1,15 +1,38 @@
#ifndef _LINUX_PRIO_TREE_H
#define _LINUX_PRIO_TREE_H

+/*
+ * K&R 2nd ed. A8.3 somewhat obliquely hints that initial sequences of struct
+ * fields with identical types should end up at the same location. We'll use
+ * this until we can scrap struct raw_prio_tree_node.
+ *
+ * Note: all this could be dome more elegantly by using unnamed union/struct
+ * fields. However, gcc 2.95.3 and apparently also gcc 3.0.4 don't support this
+ * language extension.
+ */
+
+struct raw_prio_tree_node {
+ struct prio_tree_node *left;
+ struct prio_tree_node *right;
+ struct prio_tree_node *parent;
+};
+
struct prio_tree_node {
struct prio_tree_node *left;
struct prio_tree_node *right;
struct prio_tree_node *parent;
+ unsigned long start;
+ unsigned long end;
};

struct prio_tree_root {
struct prio_tree_node *prio_tree_node;
- unsigned int index_bits;
+ unsigned short index_bits;
+ unsigned short raw;
+ /*
+ * 0: nodes are of type struct prio_tree_node
+ * 1: nodes are of type raw_prio_tree_node
+ */
};

struct prio_tree_iter {
@@ -31,10 +54,11 @@
iter->h_index = h_index;
}

-#define INIT_PRIO_TREE_ROOT(ptr) \
+#define INIT_PRIO_TREE_ROOT(ptr, _raw) \
do { \
(ptr)->prio_tree_node = NULL; \
(ptr)->index_bits = 1; \
+ (ptr)->raw = (_raw); \
} while (0)

#define INIT_PRIO_TREE_NODE(ptr) \
@@ -73,4 +97,21 @@
return node->right == node;
}

+
+struct prio_tree_node *prio_tree_replace(struct prio_tree_root *root,
+ struct prio_tree_node *old, struct prio_tree_node *node);
+struct prio_tree_node *prio_tree_insert(struct prio_tree_root *root,
+ struct prio_tree_node *node);
+void prio_tree_remove(struct prio_tree_root *root, struct prio_tree_node *node);
+struct prio_tree_node *prio_tree_first(struct prio_tree_iter *iter);
+struct prio_tree_node *prio_tree_next(struct prio_tree_iter *iter);
+
+#define raw_prio_tree_replace(root, old, node) \
+ prio_tree_replace(root, (struct prio_tree_node *) (old), \
+ (struct prio_tree_node *) (node))
+#define raw_prio_tree_insert(root, node) \
+ prio_tree_insert(root, (struct prio_tree_node *) (node))
+#define raw_prio_tree_remove(root, node) \
+ prio_tree_remove(root, (struct prio_tree_node *) (node))
+
#endif /* _LINUX_PRIO_TREE_H */
--- linux-2.6.9-orig/include/linux/mm.h Mon Oct 18 18:53:07 2004
+++ linux-2.6.9/include/linux/mm.h Tue Nov 16 19:26:55 2004
@@ -83,7 +83,7 @@
struct vm_area_struct *head;
} vm_set;

- struct prio_tree_node prio_tree_node;
+ struct raw_prio_tree_node prio_tree_node;
} shared;

/*
--- linux-2.6.9-orig/mm/prio_tree.c Mon Oct 18 18:55:28 2004
+++ linux-2.6.9/mm/prio_tree.c Tue Nov 16 20:22:02 2004
@@ -12,7 +12,6 @@
*/

#include <linux/init.h>
-#include <linux/module.h>
#include <linux/mm.h>
#include <linux/prio_tree.h>

@@ -49,18 +48,22 @@
/* avoid overflow */
#define HEAP_INDEX(vma) ((vma)->vm_pgoff + (VMA_SIZE(vma) - 1))

-#define GET_INDEX_VMA(vma, radix, heap) \
-do { \
- radix = RADIX_INDEX(vma); \
- heap = HEAP_INDEX(vma); \
-} while (0)
-
-#define GET_INDEX(node, radix, heap) \
-do { \
- struct vm_area_struct *__tmp = \
- prio_tree_entry(node, struct vm_area_struct, shared.prio_tree_node);\
- GET_INDEX_VMA(__tmp, radix, heap); \
-} while (0)
+
+static void get_index(struct prio_tree_root *root, struct prio_tree_node *node,
+ unsigned long *radix, unsigned long *heap)
+{
+ if (root->raw) {
+ struct vm_area_struct *vma = prio_tree_entry(
+ node, struct vm_area_struct, shared.prio_tree_node);
+
+ *radix = RADIX_INDEX(vma);
+ *heap = HEAP_INDEX(vma);
+ }
+ else {
+ *radix = node->start;
+ *heap = node->end;
+ }
+}

static unsigned long index_bits_to_maxindex[BITS_PER_LONG];

@@ -81,8 +84,6 @@
return index_bits_to_maxindex[bits - 1];
}

-static void prio_tree_remove(struct prio_tree_root *, struct prio_tree_node *);
-
/*
* Extend a priority search tree so that it can store a node with heap_index
* max_heap_index. In the worst case, this algorithm takes O((log n)^2).
@@ -138,7 +139,7 @@
/*
* Replace a prio_tree_node with a new node and return the old node
*/
-static struct prio_tree_node *prio_tree_replace(struct prio_tree_root *root,
+struct prio_tree_node *prio_tree_replace(struct prio_tree_root *root,
struct prio_tree_node *old, struct prio_tree_node *node)
{
INIT_PRIO_TREE_NODE(node);
@@ -182,7 +183,7 @@
* the tree, then returns the address of the prior node. Otherwise, inserts
* @node into the tree and returns @node.
*/
-static struct prio_tree_node *prio_tree_insert(struct prio_tree_root *root,
+struct prio_tree_node *prio_tree_insert(struct prio_tree_root *root,
struct prio_tree_node *node)
{
struct prio_tree_node *cur, *res = node;
@@ -190,7 +191,7 @@
unsigned long r_index, h_index, index, mask;
int size_flag = 0;

- GET_INDEX(node, radix_index, heap_index);
+ get_index(root, node, &radix_index, &heap_index);

if (prio_tree_empty(root) ||
heap_index > prio_tree_maxindex(root->index_bits))
@@ -200,7 +201,7 @@
mask = 1UL << (root->index_bits - 1);

while (mask) {
- GET_INDEX(cur, r_index, h_index);
+ get_index(root, cur, &r_index, &h_index);

if (r_index == radix_index && h_index == heap_index)
return cur;
@@ -259,8 +260,7 @@
* algorithm takes O(log n) time where 'log n' is the number of bits required
* to represent the maximum heap_index.
*/
-static void prio_tree_remove(struct prio_tree_root *root,
- struct prio_tree_node *node)
+void prio_tree_remove(struct prio_tree_root *root, struct prio_tree_node *node)
{
struct prio_tree_node *cur;
unsigned long r_index, h_index_right, h_index_left;
@@ -269,14 +269,14 @@

while (!prio_tree_left_empty(cur) || !prio_tree_right_empty(cur)) {
if (!prio_tree_left_empty(cur))
- GET_INDEX(cur->left, r_index, h_index_left);
+ get_index(root, cur->left, &r_index, &h_index_left);
else {
cur = cur->right;
continue;
}

if (!prio_tree_right_empty(cur))
- GET_INDEX(cur->right, r_index, h_index_right);
+ get_index(root, cur->right, &r_index, &h_index_right);
else {
cur = cur->left;
continue;
@@ -291,7 +291,7 @@

if (prio_tree_root(cur)) {
BUG_ON(root->prio_tree_node != cur);
- INIT_PRIO_TREE_ROOT(root);
+ INIT_PRIO_TREE_ROOT(root, root->raw);
return;
}

@@ -318,7 +318,7 @@
if (prio_tree_left_empty(iter->cur))
return NULL;

- GET_INDEX(iter->cur->left, *r_index, *h_index);
+ get_index(iter->root, iter->cur->left, r_index, h_index);

if (iter->r_index <= *h_index) {
iter->cur = iter->cur->left;
@@ -359,7 +359,7 @@
if (iter->h_index < value)
return NULL;

- GET_INDEX(iter->cur->right, *r_index, *h_index);
+ get_index(iter->root, iter->cur->right, r_index, h_index);

if (iter->r_index <= *h_index) {
iter->cur = iter->cur->right;
@@ -414,7 +414,7 @@
* heap_index]. Note that always radix_index <= heap_index. We do a pre-order
* traversal of the tree.
*/
-static struct prio_tree_node *prio_tree_first(struct prio_tree_iter *iter)
+struct prio_tree_node *prio_tree_first(struct prio_tree_iter *iter)
{
struct prio_tree_root *root;
unsigned long r_index, h_index;
@@ -425,7 +425,7 @@
if (prio_tree_empty(root))
return NULL;

- GET_INDEX(root->prio_tree_node, r_index, h_index);
+ get_index(root, root->prio_tree_node, &r_index, &h_index);

if (iter->r_index > h_index)
return NULL;
@@ -453,7 +453,7 @@
*
* Get the next prio_tree_node that overlaps with the input interval in iter
*/
-static struct prio_tree_node *prio_tree_next(struct prio_tree_iter *iter)
+struct prio_tree_node *prio_tree_next(struct prio_tree_iter *iter)
{
unsigned long r_index, h_index;

@@ -551,8 +551,8 @@

vma->shared.vm_set.head = NULL;

- ptr = prio_tree_insert(root, &vma->shared.prio_tree_node);
- if (ptr != &vma->shared.prio_tree_node) {
+ ptr = raw_prio_tree_insert(root, &vma->shared.prio_tree_node);
+ if (ptr != (struct prio_tree_node *) &vma->shared.prio_tree_node) {
old = prio_tree_entry(ptr, struct vm_area_struct,
shared.prio_tree_node);
vma_prio_tree_add(vma, old);
@@ -568,7 +568,7 @@
if (!vma->shared.vm_set.parent)
list_del_init(&vma->shared.vm_set.list);
else
- prio_tree_remove(root, &vma->shared.prio_tree_node);
+ raw_prio_tree_remove(root, &vma->shared.prio_tree_node);
} else {
/* Leave this BUG_ON till prio_tree patch stabilizes */
BUG_ON(vma->shared.vm_set.head->shared.vm_set.head != vma);
@@ -583,7 +583,7 @@
} else
new_head = NULL;

- prio_tree_replace(root, &vma->shared.prio_tree_node,
+ raw_prio_tree_replace(root, &vma->shared.prio_tree_node,
&head->shared.prio_tree_node);
head->shared.vm_set.head = new_head;
if (new_head)
--- linux-2.6.9-orig/fs/inode.c Mon Oct 18 18:55:29 2004
+++ linux-2.6.9/fs/inode.c Tue Nov 16 20:09:46 2004
@@ -201,7 +201,7 @@
atomic_set(&inode->i_data.truncate_count, 0);
INIT_LIST_HEAD(&inode->i_data.private_list);
spin_lock_init(&inode->i_data.private_lock);
- INIT_PRIO_TREE_ROOT(&inode->i_data.i_mmap);
+ INIT_PRIO_TREE_ROOT(&inode->i_data.i_mmap, 1);
INIT_LIST_HEAD(&inode->i_data.i_mmap_nonlinear);
spin_lock_init(&inode->i_lock);
i_size_ordered_init(inode);

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2004-11-17 01:28:47

by Werner Almesberger

[permalink] [raw]
Subject: Re: Generalize prio_tree, 2nd try

I wrote:
> +static void get_index(struct prio_tree_root *root, struct prio_tree_node *node,

They should of course be "const" ...

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/