2009-01-10 10:46:37

by Johannes Berg

[permalink] [raw]
Subject: [PATCH] add b+tree library

This adds a b+tree library. The API and memory layout is documented in
the header file lib/btree.h. There are tree versions for 32, 64 and
128 bit keys as well as unsigned long (32/64 depending on platform).

Signed-off-by: Joern Engel <[email protected]>
Signed-off-by: Johannes Berg <[email protected]>
---
We've tested this code in userspace, and would appreciate still getting
it into 2.6.29 because we're working on two separate users for 2.6.30
and it's easier to manage that way. We do not expect API changes before
using it, as the code using these trees is almost finished but not well
enough tested for .29.

MAINTAINERS | 8
include/linux/btree-128.h | 109 ++++++
include/linux/btree-type.h | 147 ++++++++
include/linux/btree.h | 243 +++++++++++++
lib/Kconfig | 3
lib/Makefile | 1
lib/btree.c | 797 +++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 1308 insertions(+)

--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ wireless-testing/include/linux/btree.h 2009-01-10 11:41:46.000000000 +0100
@@ -0,0 +1,243 @@
+#ifndef BTREE_H
+#define BTREE_H
+
+#include <linux/kernel.h>
+#include <linux/mempool.h>
+
+/**
+ * DOC: B+Tree basics
+ *
+ * A B+Tree is a data structure for looking up arbitrary (currently allowing
+ * unsigned long, u32, u64 and 2 * u64) keys into pointers. The data structure
+ * is described at http://en.wikipedia.org/wiki/B-tree, we currently do not
+ * use binary search to find the key on lookups.
+ *
+ * Each B+Tree consists of a head, that contains bookkeeping information and
+ * a variable number (starting with zero) nodes. Each node contains the keys
+ * and pointers to sub-nodes, or, for leaf nodes, the keys and values for the
+ * tree entries.
+ *
+ * Each node in this implementation has the following layout:
+ * [key1, key2, ..., keyN] [val1, val2, ..., valN]
+ *
+ * Each key here is an array of unsigned longs, geo->no_longs in total. The
+ * number of keys and values (N) is geo->no_pairs.
+ */
+
+/**
+ * struct btree_head - btree head
+ *
+ * @node: the first node in the tree
+ * @mempool: mempool used for node allocations
+ * @height: current of the tree
+ */
+struct btree_head {
+ unsigned long *node;
+ mempool_t *mempool;
+ int height;
+};
+
+/* btree geometry */
+struct btree_geo;
+
+/**
+ * btree_alloc - allocate function for the mempool
+ * @gfp_mask: gfp mask for the allocation
+ * @pool_data: unused
+ */
+void *btree_alloc(gfp_t gfp_mask, void *pool_data);
+
+/**
+ * btree_free - free function for the mempool
+ * @element: the element to free
+ * @pool_data: unused
+ */
+void btree_free(void *element, void *pool_data);
+
+/**
+ * btree_init_mempool - initialise a btree with given mempool
+ *
+ * @head: the btree head to initialise
+ * @mempool: the mempool to use
+ *
+ * When this function is used, there is no need to destroy
+ * the mempool.
+ */
+void btree_init_mempool(struct btree_head *head, mempool_t *mempool);
+
+/**
+ * btree_init - initialise a btree
+ *
+ * @head: the btree head to initialise
+ *
+ * This function allocates the memory pool that the
+ * btree needs. Returns zero or a negative error code
+ * (-%ENOMEM) when memory allocation fails.
+ *
+ */
+int __must_check btree_init(struct btree_head *head);
+
+/**
+ * btree_destroy - destroy mempool
+ *
+ * @head: the btree head to destroy
+ *
+ * This function destroys the internal memory pool, use only
+ * when using btree_init(), not with btree_init_mempool().
+ */
+void btree_destroy(struct btree_head *head);
+
+/**
+ * btree_lookup - look up a key in the btree
+ *
+ * @head: the btree to look in
+ * @geo: the btree geometry
+ * @key: the key to look up
+ *
+ * This function returns the value for the given key, or %NULL.
+ */
+void *btree_lookup(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key);
+
+/**
+ * btree_insert - insert an entry into the btree
+ *
+ * @head: the btree to add to
+ * @geo: the btree geometry
+ * @key: the key to add (must not already be present)
+ * @val: the value to add (must not be %NULL)
+ * @gfp: allocation flags for node allocations
+ *
+ * This function returns 0 if the item could be added, or an
+ * error code if it failed (may fail due to memory pressure).
+ */
+int __must_check btree_insert(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key, void *val, gfp_t gfp);
+/**
+ * btree_update - update an entry in the btree
+ *
+ * @head: the btree to update
+ * @geo: the btree geometry
+ * @key: the key to update
+ * @val: the value to change it to (must not be %NULL)
+ *
+ * This function returns 0 if the update was successful, or
+ * -%ENOENT if the key could not be found.
+ */
+int btree_update(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key, void *val);
+/**
+ * btree_remove - remove an entry from the btree
+ *
+ * @head: the btree to update
+ * @geo: the btree geometry
+ * @key: the key to remove
+ *
+ * This function returns the removed entry, or %NULL if the key
+ * could not be found.
+ */
+void *btree_remove(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key);
+
+/**
+ * btree_merge - merge two btrees
+ *
+ * @target: the tree that gets all the entries
+ * @victim: the tree that gets merged into @target
+ * @geo: the btree geometry
+ * @gfp: allocation flags
+ *
+ * The two trees @target and @victim may not contain the same keys,
+ * that is a bug and triggers a BUG(). This function returns zero
+ * if the trees were merged successfully, and may return a failure
+ * when memory allocation fails, in which case both trees might have
+ * been partially merged, i.e. some entries have been moved from
+ * @victim to @target.
+ */
+int btree_merge(struct btree_head *target, struct btree_head *victim,
+ struct btree_geo *geo, gfp_t gfp);
+
+/**
+ * btree_last - get last entry in btree
+ *
+ * @head: btree head
+ * @geo: btree geometry
+ * @key: last key
+ *
+ * Returns the last entry in the btree, and sets @key to the key
+ * of that entry; returns NULL if the tree is empty, in that case
+ * key is not changed.
+ */
+void *btree_last(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key);
+
+/**
+ * btree_get_prev - get previous entry
+ *
+ * @head: btree head
+ * @geo: btree geometry
+ * @key: pointer to key
+ *
+ * The function returns the next item right before the value pointed to by
+ * @key, and updates @key with its key, or returns %NULL when there is no
+ * entry with a key smaller than the given key.
+ */
+void *btree_get_prev(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key);
+
+
+/* internal use, use btree_visitor{l,32,64,128} */
+size_t btree_visitor(struct btree_head *head, struct btree_geo *geo,
+ unsigned long opaque,
+ void (*func)(void *elem, unsigned long opaque,
+ unsigned long *key, size_t index,
+ void *func2),
+ void *func2);
+
+/* internal use, use btree_grim_visitor{l,32,64,128} */
+size_t btree_grim_visitor(struct btree_head *head, struct btree_geo *geo,
+ unsigned long opaque,
+ void (*func)(void *elem, unsigned long opaque,
+ unsigned long *key,
+ size_t index, void *func2),
+ void *func2);
+
+
+#include <linux/btree-128.h>
+
+extern struct btree_geo btree_geo32;
+#define BTREE_TYPE_SUFFIX l
+#define BTREE_TYPE_BITS BITS_PER_LONG
+#define BTREE_TYPE_GEO &btree_geo32
+#define BTREE_KEYTYPE unsigned long
+#include <linux/btree-type.h>
+
+#define btree_for_each_safel(head, key, val) \
+ for (val = btree_lastl(head, &key); \
+ val; \
+ val = btree_get_prevl(head, &key))
+
+#define BTREE_TYPE_SUFFIX 32
+#define BTREE_TYPE_BITS 32
+#define BTREE_TYPE_GEO &btree_geo32
+#define BTREE_KEYTYPE u32
+#include <linux/btree-type.h>
+
+#define btree_for_each_safe32(head, key, val) \
+ for (val = btree_last32(head, &key); \
+ val; \
+ val = btree_get_prev32(head, &key))
+
+extern struct btree_geo btree_geo64;
+#define BTREE_TYPE_SUFFIX 64
+#define BTREE_TYPE_BITS 64
+#define BTREE_TYPE_GEO &btree_geo64
+#define BTREE_KEYTYPE u64
+#include <linux/btree-type.h>
+
+#define btree_for_each_safe64(head, key, val) \
+ for (val = btree_last64(head, &key); \
+ val; \
+ val = btree_get_prev64(head, &key))
+
+#endif
--- wireless-testing.orig/lib/Kconfig 2009-01-10 10:46:42.000000000 +0100
+++ wireless-testing/lib/Kconfig 2009-01-10 11:40:54.000000000 +0100
@@ -136,6 +136,9 @@ config TEXTSEARCH_FSM
config PLIST
boolean

+config BTREE
+ tristate
+
config HAS_IOMEM
boolean
depends on !NO_IOMEM
--- wireless-testing.orig/lib/Makefile 2009-01-10 10:46:42.000000000 +0100
+++ wireless-testing/lib/Makefile 2009-01-10 11:27:11.000000000 +0100
@@ -40,6 +40,7 @@ lib-$(CONFIG_GENERIC_FIND_NEXT_BIT) += f
obj-$(CONFIG_GENERIC_HWEIGHT) += hweight.o
obj-$(CONFIG_LOCK_KERNEL) += kernel_lock.o
obj-$(CONFIG_PLIST) += plist.o
+obj-$(CONFIG_BTREE) += btree.o
obj-$(CONFIG_DEBUG_PREEMPT) += smp_processor_id.o
obj-$(CONFIG_DEBUG_LIST) += list_debug.o
obj-$(CONFIG_DEBUG_OBJECTS) += debugobjects.o
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ wireless-testing/lib/btree.c 2009-01-10 11:41:47.000000000 +0100
@@ -0,0 +1,797 @@
+/*
+ * lib/btree.c - Simple In-memory B+Tree
+ *
+ * As should be obvious for Linux kernel code, license is GPLv2
+ *
+ * Copyright (c) 2007-2008 Joern Engel <[email protected]>
+ * Bits and pieces stolen from Peter Zijlstra's code, which is
+ * Copyright 2007, Red Hat Inc. Peter Zijlstra <[email protected]>
+ * GPLv2
+ *
+ * see http://programming.kicks-ass.net/kernel-patches/vma_lookup/btree.patch
+ *
+ * A relatively simple B+Tree implementation. I have written it as a learning
+ * excercise to understand how B+Trees work. Turned out to be useful as well.
+ *
+ * B+Trees can be used similar to Linux radix trees (which don't have anything
+ * in common with textbook radix trees, beware). Prerequisite for them working
+ * well is that access to a random tree node is much faster than a large number
+ * of operations within each node.
+ *
+ * Disks have fulfilled the prerequisite for a long time. More recently DRAM
+ * has gained similar properties, as memory access times, when measured in cpu
+ * cycles, have increased. Cacheline sizes have increased as well, which also
+ * helps B+Trees.
+ *
+ * Compared to radix trees, B+Trees are more efficient when dealing with a
+ * sparsely populated address space. Between 25% and 50% of the memory is
+ * occupied with valid pointers. When densely populated, radix trees contain
+ * ~98% pointers - hard to beat. Very sparse radix trees contain only ~2%
+ * pointers.
+ *
+ * This particular implementation stores pointers identified by a long value.
+ * Storing NULL pointers is illegal, lookup will return NULL when no entry
+ * was found.
+ *
+ * A tricks was used that is not commonly found in textbooks. The lowest
+ * values are to the right, not to the left. All used slots within a node
+ * are on the left, all unused slots contain NUL values. Most operations
+ * simply loop once over all slots and terminate on the first NUL.
+ */
+
+#include <linux/btree.h>
+#include <linux/cache.h>
+#include <linux/kernel.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+
+#define MAX(a, b) ((a) > (b) ? (a) : (b))
+#define NODESIZE MAX(L1_CACHE_BYTES, 128)
+
+struct btree_geo {
+ int keylen;
+ int no_pairs;
+ int no_longs;
+};
+
+struct btree_geo btree_geo32 = {
+ .keylen = 1,
+ .no_pairs = NODESIZE / sizeof(long) / 2,
+ .no_longs = NODESIZE / sizeof(long) / 2,
+};
+EXPORT_SYMBOL_GPL(btree_geo32);
+
+#define LONG_PER_U64 (64 / BITS_PER_LONG)
+struct btree_geo btree_geo64 = {
+ .keylen = LONG_PER_U64,
+ .no_pairs = NODESIZE / sizeof(long) / (1 + LONG_PER_U64),
+ .no_longs = LONG_PER_U64 * (NODESIZE / sizeof(long) / (1 + LONG_PER_U64)),
+};
+EXPORT_SYMBOL_GPL(btree_geo64);
+
+struct btree_geo btree_geo128 = {
+ .keylen = 2 * LONG_PER_U64,
+ .no_pairs = NODESIZE / sizeof(long) / (1 + 2 * LONG_PER_U64),
+ .no_longs = 2 * LONG_PER_U64 * (NODESIZE / sizeof(long) / (1 + 2 * LONG_PER_U64)),
+};
+EXPORT_SYMBOL_GPL(btree_geo128);
+
+static struct kmem_cache *btree_cachep;
+
+void *btree_alloc(gfp_t gfp_mask, void *pool_data)
+{
+ return kmem_cache_alloc(btree_cachep, gfp_mask);
+}
+EXPORT_SYMBOL_GPL(btree_alloc);
+
+void btree_free(void *element, void *pool_data)
+{
+ kmem_cache_free(btree_cachep, element);
+}
+EXPORT_SYMBOL_GPL(btree_free);
+
+static unsigned long *btree_node_alloc(struct btree_head *head, gfp_t gfp)
+{
+ unsigned long *node;
+
+ node = mempool_alloc(head->mempool, gfp);
+ memset(node, 0, NODESIZE);
+ return node;
+}
+
+static int longcmp(const unsigned long *l1, const unsigned long *l2, size_t n)
+{
+ size_t i;
+
+ for (i = 0; i < n; i++) {
+ if (l1[i] < l2[i])
+ return -1;
+ if (l1[i] > l2[i])
+ return 1;
+ }
+ return 0;
+}
+
+static unsigned long *longcpy(unsigned long *dest, const unsigned long *src,
+ size_t n)
+{
+ size_t i;
+
+ for (i = 0; i < n; i++)
+ dest[i] = src[i];
+ return dest;
+}
+
+static unsigned long *longset(unsigned long *s, unsigned long c, size_t n)
+{
+ size_t i;
+
+ for (i = 0; i < n; i++)
+ s[i] = c;
+ return s;
+}
+
+static void dec_key(struct btree_geo *geo, unsigned long *key)
+{
+ unsigned long val;
+ int i;
+
+ for (i = geo->keylen - 1; i >= 0; i--) {
+ val = key[i];
+ key[i] = val - 1;
+ if (val)
+ break;
+ }
+}
+
+static unsigned long *bkey(struct btree_geo *geo, unsigned long *node, int n)
+{
+ return &node[n * geo->keylen];
+}
+
+static void *bval(struct btree_geo *geo, unsigned long *node, int n)
+{
+ return (void *)node[geo->no_longs + n];
+}
+
+static void setkey(struct btree_geo *geo, unsigned long *node, int n,
+ unsigned long *key)
+{
+ longcpy(bkey(geo, node, n), key, geo->keylen);
+}
+
+static void setval(struct btree_geo *geo, unsigned long *node, int n,
+ void *val)
+{
+ node[geo->no_longs + n] = (unsigned long) val;
+}
+
+static void clearpair(struct btree_geo *geo, unsigned long *node, int n)
+{
+ longset(bkey(geo, node, n), 0, geo->keylen);
+ node[geo->no_longs + n] = 0;
+}
+
+static inline void __btree_init(struct btree_head *head)
+{
+ head->node = NULL;
+ head->height = 0;
+}
+
+void btree_init_mempool(struct btree_head *head, mempool_t *mempool)
+{
+ __btree_init(head);
+ head->mempool = mempool;
+}
+EXPORT_SYMBOL_GPL(btree_init_mempool);
+
+int btree_init(struct btree_head *head)
+{
+ __btree_init(head);
+ head->mempool = mempool_create(0, btree_alloc, btree_free, NULL);
+ if (!head->mempool)
+ return -ENOMEM;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(btree_init);
+
+void btree_destroy(struct btree_head *head)
+{
+ mempool_destroy(head->mempool);
+ head->mempool = NULL;
+}
+EXPORT_SYMBOL_GPL(btree_destroy);
+
+void *btree_last(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key)
+{
+ int height = head->height;
+ unsigned long *node = head->node;
+
+ if (height == 0)
+ return NULL;
+
+ for ( ; height > 1; height--)
+ node = bval(geo, node, 0);
+
+ longcpy(key, bkey(geo, node, 0), geo->keylen);
+ return bval(geo, node, 0);
+}
+EXPORT_SYMBOL_GPL(btree_last);
+
+static int keycmp(struct btree_geo *geo, unsigned long *node, int pos,
+ unsigned long *key)
+{
+ return longcmp(bkey(geo, node, pos), key, geo->keylen);
+}
+
+static int keyzero(struct btree_geo *geo, unsigned long *key)
+{
+ int i;
+
+ for (i = 0; i < geo->keylen; i++)
+ if (key[i])
+ return 0;
+
+ return 1;
+}
+
+void *btree_lookup(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key)
+{
+ int i, height = head->height;
+ unsigned long *node = head->node;
+
+ if (height == 0)
+ return NULL;
+
+ for ( ; height > 1; height--) {
+ for (i = 0; i < geo->no_pairs; i++)
+ if (keycmp(geo, node, i, key) <= 0)
+ break;
+ if (i == geo->no_pairs)
+ return NULL;
+ node = bval(geo, node, i);
+ if (!node)
+ return NULL;
+ }
+
+ if (!node)
+ return NULL;
+
+ for (i = 0; i < geo->no_pairs; i++)
+ if (keycmp(geo, node, i, key) == 0)
+ return bval(geo, node, i);
+ return NULL;
+}
+EXPORT_SYMBOL_GPL(btree_lookup);
+
+int btree_update(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key, void *val)
+{
+ int i, height = head->height;
+ unsigned long *node = head->node;
+
+ if (height == 0)
+ return -ENOENT;
+
+ for ( ; height > 1; height--) {
+ for (i = 0; i < geo->no_pairs; i++)
+ if (keycmp(geo, node, i, key) <= 0)
+ break;
+ if (i == geo->no_pairs)
+ return -ENOENT;
+ node = bval(geo, node, i);
+ if (!node)
+ return -ENOENT;
+ }
+
+ if (!node)
+ return -ENOENT;
+
+ for (i = 0; i < geo->no_pairs; i++)
+ if (keycmp(geo, node, i, key) == 0) {
+ setval(geo, node, i, val);
+ return 0;
+ }
+ return -ENOENT;
+}
+EXPORT_SYMBOL_GPL(btree_update);
+
+/*
+ * Usually this function is quite similar to normal lookup. But the key of
+ * a parent node may be smaller than the smallest key of all its siblings.
+ * In such a case we cannot just return NULL, as we have only proven that no
+ * key smaller than __key, but larger than this parent key exists.
+ * So we set __key to the parent key and retry. We have to use the smallest
+ * such parent key, which is the last parent key we encountered.
+ */
+void *btree_get_prev(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *__key)
+{
+ int i, height;
+ unsigned long *node, *oldnode;
+ unsigned long *retry_key = NULL, key[geo->keylen];
+
+ if (keyzero(geo, __key))
+ return NULL;
+
+ if (head->height == 0)
+ return NULL;
+retry:
+ longcpy(key, __key, geo->keylen);
+ dec_key(geo, key);
+
+ node = head->node;
+ for (height = head->height ; height > 1; height--) {
+ for (i = 0; i < geo->no_pairs; i++)
+ if (keycmp(geo, node, i, key) <= 0)
+ break;
+ if (i == geo->no_pairs)
+ goto miss;
+ oldnode = node;
+ node = bval(geo, node, i);
+ if (!node)
+ goto miss;
+ retry_key = bkey(geo, oldnode, i);
+ }
+
+ if (!node)
+ goto miss;
+
+ for (i = 0; i < geo->no_pairs; i++) {
+ if (keycmp(geo, node, i, key) <= 0) {
+ if (bval(geo, node, i)) {
+ longcpy(__key, bkey(geo, node, i), geo->keylen);
+ return bval(geo, node, i);
+ } else
+ goto miss;
+ }
+ }
+miss:
+ if (retry_key) {
+ __key = retry_key;
+ retry_key = NULL;
+ goto retry;
+ }
+ return NULL;
+}
+
+static int getpos(struct btree_geo *geo, unsigned long *node,
+ unsigned long *key)
+{
+ int i;
+
+ for (i = 0; i < geo->no_pairs; i++) {
+ if (keycmp(geo, node, i, key) <= 0)
+ break;
+ }
+ return i;
+}
+
+static int getfill(struct btree_geo *geo, unsigned long *node, int start)
+{
+ int i;
+
+ for (i = start; i < geo->no_pairs; i++)
+ if (!bval(geo, node, i))
+ break;
+ return i;
+}
+
+/*
+ * locate the correct leaf node in the btree
+ */
+static unsigned long *find_level(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key, int level)
+{
+ unsigned long *node = head->node;
+ int i, height;
+
+ for (height = head->height; height > level; height--) {
+ for (i = 0; i < geo->no_pairs; i++)
+ if (keycmp(geo, node, i, key) <= 0)
+ break;
+
+ if ((i == geo->no_pairs) || !bval(geo, node, i)) {
+ /* right-most key is too large, update it */
+ /* FIXME: If the right-most key on higher levels is
+ * always zero, this wouldn't be necessary. */
+ i--;
+ setkey(geo, node, i, key);
+ }
+ BUG_ON(i < 0);
+ node = bval(geo, node, i);
+ }
+ BUG_ON(!node);
+ return node;
+}
+
+static int btree_grow(struct btree_head *head, struct btree_geo *geo,
+ gfp_t gfp)
+{
+ unsigned long *node;
+ int fill;
+
+ node = btree_node_alloc(head, gfp);
+ if (!node)
+ return -ENOMEM;
+ if (head->node) {
+ fill = getfill(geo, head->node, 0);
+ setkey(geo, node, 0, bkey(geo, head->node, fill - 1));
+ setval(geo, node, 0, head->node);
+ }
+ head->node = node;
+ head->height++;
+ return 0;
+}
+
+static void btree_shrink(struct btree_head *head, struct btree_geo *geo)
+{
+ unsigned long *node;
+ int fill;
+
+ if (head->height <= 1)
+ return;
+
+ node = head->node;
+ fill = getfill(geo, node, 0);
+ BUG_ON(fill > 1);
+ head->node = bval(geo, node, 0);
+ head->height--;
+ mempool_free(node, head->mempool);
+}
+
+static int btree_insert_level(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key, void *val, int level,
+ gfp_t gfp)
+{
+ unsigned long *node;
+ int i, pos, fill, err;
+
+ BUG_ON(!val);
+ if (head->height < level) {
+ err = btree_grow(head, geo, gfp);
+ if (err)
+ return err;
+ }
+
+retry:
+ node = find_level(head, geo, key, level);
+ pos = getpos(geo, node, key);
+ fill = getfill(geo, node, pos);
+ /* two identical keys are not allowed */
+ BUG_ON(pos < fill && keycmp(geo, node, pos, key) == 0);
+
+ if (fill == geo->no_pairs) {
+ /* need to split node */
+ unsigned long *new;
+
+ new = btree_node_alloc(head, gfp);
+ if (!new)
+ return -ENOMEM;
+ err = btree_insert_level(head, geo,
+ bkey(geo, node, fill / 2 - 1),
+ new, level + 1, gfp);
+ if (err) {
+ mempool_free(new, head->mempool);
+ return err;
+ }
+ for (i = 0; i < fill / 2; i++) {
+ setkey(geo, new, i, bkey(geo, node, i));
+ setval(geo, new, i, bval(geo, node, i));
+ setkey(geo, node, i, bkey(geo, node, i + fill / 2));
+ setval(geo, node, i, bval(geo, node, i + fill / 2));
+ clearpair(geo, node, i + fill / 2);
+ }
+ if (fill & 1) {
+ setkey(geo, node, i, bkey(geo, node, fill - 1));
+ setval(geo, node, i, bval(geo, node, fill - 1));
+ clearpair(geo, node, fill - 1);
+ }
+ goto retry;
+ }
+ BUG_ON(fill >= geo->no_pairs);
+
+ /* shift and insert */
+ for (i = fill; i > pos; i--) {
+ setkey(geo, node, i, bkey(geo, node, i - 1));
+ setval(geo, node, i, bval(geo, node, i - 1));
+ }
+ setkey(geo, node, pos, key);
+ setval(geo, node, pos, val);
+
+ return 0;
+}
+
+int btree_insert(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key, void *val, gfp_t gfp)
+{
+ return btree_insert_level(head, geo, key, val, 1, gfp);
+}
+EXPORT_SYMBOL_GPL(btree_insert);
+
+static void *btree_remove_level(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key, int level);
+static void merge(struct btree_head *head, struct btree_geo *geo, int level,
+ unsigned long *left, int lfill,
+ unsigned long *right, int rfill,
+ unsigned long *parent, int lpos)
+{
+ int i;
+
+ for (i = 0; i < rfill; i++) {
+ /* Move all keys to the left */
+ setkey(geo, left, lfill + i, bkey(geo, right, i));
+ setval(geo, left, lfill + i, bval(geo, right, i));
+ }
+ /* Exchange left and right child in parent */
+ setval(geo, parent, lpos, right);
+ setval(geo, parent, lpos + 1, left);
+ /* Remove left (formerly right) child from parent */
+ btree_remove_level(head, geo, bkey(geo, parent, lpos), level + 1);
+ mempool_free(right, head->mempool);
+}
+
+static void rebalance(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key, int level, unsigned long *child, int fill)
+{
+ unsigned long *parent, *left = NULL, *right = NULL;
+ int i, no_left, no_right;
+
+ if (fill == 0) {
+ /* Because we don't steal entries from a neigbour, this case
+ * can happen. Parent node contains a single child, this
+ * node, so merging with a sibling never happens.
+ */
+ btree_remove_level(head, geo, key, level + 1);
+ mempool_free(child, head->mempool);
+ return;
+ }
+
+ parent = find_level(head, geo, key, level + 1);
+ i = getpos(geo, parent, key);
+ BUG_ON(bval(geo, parent, i) != child);
+
+ if (i > 0) {
+ left = bval(geo, parent, i - 1);
+ no_left = getfill(geo, left, 0);
+ if (fill + no_left <= geo->no_pairs) {
+ merge(head, geo, level,
+ left, no_left,
+ child, fill,
+ parent, i - 1);
+ return;
+ }
+ }
+ if (i + 1 < getfill(geo, parent, i)) {
+ right = bval(geo, parent, i + 1);
+ no_right = getfill(geo, right, 0);
+ if (fill + no_right <= geo->no_pairs) {
+ merge(head, geo, level,
+ child, fill,
+ right, no_right,
+ parent, i);
+ return;
+ }
+ }
+ /*
+ * We could also try to steal one entry from the left or right
+ * neighbor. By not doing so we changed the invariant from
+ * "all nodes are at least half full" to "no two neighboring
+ * nodes can be merged". Which means that the average fill of
+ * all nodes is still half or better.
+ */
+}
+
+static void *btree_remove_level(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key, int level)
+{
+ unsigned long *node;
+ int i, pos, fill;
+ void *ret;
+
+ if (level > head->height) {
+ /* we recursed all the way up */
+ head->height = 0;
+ head->node = NULL;
+ return NULL;
+ }
+
+ node = find_level(head, geo, key, level);
+ pos = getpos(geo, node, key);
+ fill = getfill(geo, node, pos);
+ if ((level == 1) && (keycmp(geo, node, pos, key) != 0))
+ return NULL;
+ ret = bval(geo, node, pos);
+
+ /* remove and shift */
+ for (i = pos; i < fill - 1; i++) {
+ setkey(geo, node, i, bkey(geo, node, i + 1));
+ setval(geo, node, i, bval(geo, node, i + 1));
+ }
+ clearpair(geo, node, fill - 1);
+
+ if (fill - 1 < geo->no_pairs / 2) {
+ if (level < head->height)
+ rebalance(head, geo, key, level, node, fill - 1);
+ else if (fill - 1 == 1)
+ btree_shrink(head, geo);
+ }
+
+ return ret;
+}
+
+void *btree_remove(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *key)
+{
+ if (head->height == 0)
+ return NULL;
+
+ return btree_remove_level(head, geo, key, 1);
+}
+EXPORT_SYMBOL_GPL(btree_remove);
+
+int btree_merge(struct btree_head *target, struct btree_head *victim,
+ struct btree_geo *geo, gfp_t gfp)
+{
+ unsigned long key[geo->keylen];
+ unsigned long dup[geo->keylen];
+ void *val;
+ int err;
+
+ BUG_ON(target == victim);
+
+ if (!(target->node)) {
+ /* target is empty, just copy fields over */
+ target->node = victim->node;
+ target->height = victim->height;
+ __btree_init(victim);
+ return 0;
+ }
+
+ /* TODO: This needs some optimizations. Currently we do three tree
+ * walks to remove a single object from the victim.
+ */
+ for (;;) {
+ if (!btree_last(victim, geo, key))
+ break;
+ val = btree_lookup(victim, geo, key);
+ err = btree_insert(target, geo, key, val, gfp);
+ if (err)
+ return err;
+ /* We must make a copy of the key, as the original will get
+ * mangled inside btree_remove. */
+ longcpy(dup, key, geo->keylen);
+ btree_remove(victim, geo, dup);
+ }
+ return 0;
+}
+EXPORT_SYMBOL_GPL(btree_merge);
+
+static size_t __btree_for_each(struct btree_head *head, struct btree_geo *geo,
+ unsigned long *node, unsigned long opaque,
+ void (*func)(void *elem, unsigned long opaque,
+ unsigned long *key, size_t index,
+ void *func2),
+ void *func2, int reap, int height, size_t count)
+{
+ int i;
+ unsigned long *child;
+
+ for (i = 0; i < geo->no_pairs; i++) {
+ child = bval(geo, node, i);
+ if (!child)
+ break;
+ if (height > 1)
+ count = __btree_for_each(head, geo, child, opaque,
+ func, func2, reap, height - 1, count);
+ else
+ func(child, opaque, bkey(geo, node, i), count++,
+ func2);
+ }
+ if (reap)
+ mempool_free(node, head->mempool);
+ return count;
+}
+
+static void empty(void *elem, unsigned long opaque, unsigned long *key,
+ size_t index, void *func2)
+{
+}
+
+void visitorl(void *elem, unsigned long opaque, unsigned long *key,
+ size_t index, void *__func)
+{
+ visitorl_t func = __func;
+
+ func(elem, opaque, *key, index);
+}
+EXPORT_SYMBOL_GPL(visitorl);
+
+void visitor32(void *elem, unsigned long opaque, unsigned long *__key,
+ size_t index, void *__func)
+{
+ visitor32_t func = __func;
+ u32 *key = (void *)__key;
+
+ func(elem, opaque, *key, index);
+}
+EXPORT_SYMBOL_GPL(visitor32);
+
+void visitor64(void *elem, unsigned long opaque, unsigned long *__key,
+ size_t index, void *__func)
+{
+ visitor64_t func = __func;
+ u64 *key = (void *)__key;
+
+ func(elem, opaque, *key, index);
+}
+EXPORT_SYMBOL_GPL(visitor64);
+
+void visitor128(void *elem, unsigned long opaque, unsigned long *__key,
+ size_t index, void *__func)
+{
+ visitor128_t func = __func;
+ u64 *key = (void *)__key;
+
+ func(elem, opaque, key[0], key[1], index);
+}
+EXPORT_SYMBOL_GPL(visitor128);
+
+size_t btree_visitor(struct btree_head *head, struct btree_geo *geo,
+ unsigned long opaque,
+ void (*func)(void *elem, unsigned long opaque,
+ unsigned long *key,
+ size_t index, void *func2),
+ void *func2)
+{
+ size_t count = 0;
+
+ if (!func2)
+ func = empty;
+ if (head->node)
+ count = __btree_for_each(head, geo, head->node, opaque, func,
+ func2, 0, head->height, 0);
+ return count;
+}
+EXPORT_SYMBOL_GPL(btree_visitor);
+
+size_t btree_grim_visitor(struct btree_head *head, struct btree_geo *geo,
+ unsigned long opaque,
+ void (*func)(void *elem, unsigned long opaque,
+ unsigned long *key,
+ size_t index, void *func2),
+ void *func2)
+{
+ size_t count = 0;
+
+ if (!func2)
+ func = empty;
+ if (head->node)
+ count = __btree_for_each(head, geo, head->node, opaque, func,
+ func2, 1, head->height, 0);
+ __btree_init(head);
+ return count;
+}
+EXPORT_SYMBOL_GPL(btree_grim_visitor);
+
+static int __init btree_module_init(void)
+{
+ btree_cachep = kmem_cache_create("btree_node", NODESIZE, 0,
+ SLAB_HWCACHE_ALIGN, NULL);
+ return 0;
+}
+
+static void __exit btree_module_exit(void)
+{
+ kmem_cache_destroy(btree_cachep);
+}
+
+/* If core code starts using btree, initialization should happen even earlier */
+module_init(btree_module_init);
+module_exit(btree_module_exit);
+
+MODULE_AUTHOR("Joern Engel <[email protected]>");
+MODULE_AUTHOR("Johannes Berg <[email protected]>");
+MODULE_LICENSE("GPL");
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ wireless-testing/include/linux/btree-128.h 2009-01-10 11:41:47.000000000 +0100
@@ -0,0 +1,109 @@
+extern struct btree_geo btree_geo128;
+
+struct btree_head128 { struct btree_head h; };
+
+static inline void btree_init_mempool128(struct btree_head128 *head,
+ mempool_t *mempool)
+{
+ btree_init_mempool(&head->h, mempool);
+}
+
+static inline int btree_init128(struct btree_head128 *head)
+{
+ return btree_init(&head->h);
+}
+
+static inline void btree_destroy128(struct btree_head128 *head)
+{
+ btree_destroy(&head->h);
+}
+
+static inline void *btree_lookup128(struct btree_head128 *head, u64 k1, u64 k2)
+{
+ u64 key[2] = {k1, k2};
+ return btree_lookup(&head->h, &btree_geo128, (unsigned long *)&key);
+}
+
+static inline void *btree_get_prev128(struct btree_head128 *head,
+ u64 *k1, u64 *k2)
+{
+ u64 key[2] = {*k1, *k2};
+ void *val;
+
+ val = btree_get_prev(&head->h, &btree_geo128,
+ (unsigned long *)&key);
+ *k1 = key[0];
+ *k2 = key[1];
+ return val;
+}
+
+static inline int btree_insert128(struct btree_head128 *head, u64 k1, u64 k2,
+ void *val, gfp_t gfp)
+{
+ u64 key[2] = {k1, k2};
+ return btree_insert(&head->h, &btree_geo128,
+ (unsigned long *)&key, val, gfp);
+}
+
+static inline int btree_update128(struct btree_head128 *head, u64 k1, u64 k2,
+ void *val)
+{
+ u64 key[2] = {k1, k2};
+ return btree_update(&head->h, &btree_geo128,
+ (unsigned long *)&key, val);
+}
+
+static inline void *btree_remove128(struct btree_head128 *head, u64 k1, u64 k2)
+{
+ u64 key[2] = {k1, k2};
+ return btree_remove(&head->h, &btree_geo128, (unsigned long *)&key);
+}
+
+static inline void *btree_last128(struct btree_head128 *head, u64 *k1, u64 *k2)
+{
+ u64 key[2];
+ void *val;
+
+ val = btree_last(&head->h, &btree_geo128, (unsigned long *)&key[0]);
+ if (val) {
+ *k1 = key[0];
+ *k2 = key[1];
+ }
+
+ return val;
+}
+
+static inline int btree_merge128(struct btree_head128 *target,
+ struct btree_head128 *victim,
+ gfp_t gfp)
+{
+ return btree_merge(&target->h, &victim->h, &btree_geo128, gfp);
+}
+
+void visitor128(void *elem, unsigned long opaque, unsigned long *__key,
+ size_t index, void *__func);
+
+typedef void (*visitor128_t)(void *elem, unsigned long opaque,
+ u64 key1, u64 key2, size_t index);
+
+static inline size_t btree_visitor128(struct btree_head128 *head,
+ unsigned long opaque,
+ visitor128_t func2)
+{
+ return btree_visitor(&head->h, &btree_geo128, opaque,
+ visitor128, func2);
+}
+
+static inline size_t btree_grim_visitor128(struct btree_head128 *head,
+ unsigned long opaque,
+ visitor128_t func2)
+{
+ return btree_grim_visitor(&head->h, &btree_geo128, opaque,
+ visitor128, func2);
+}
+
+#define btree_for_each_safe128(head, k1, k2, val) \
+ for (val = btree_last128(head, &k1, &k2); \
+ val; \
+ val = btree_get_prev128(head, &k1, &k2))
+
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ wireless-testing/include/linux/btree-type.h 2009-01-10 11:41:47.000000000 +0100
@@ -0,0 +1,147 @@
+#define __BTREE_TP(pfx, type, sfx) pfx ## type ## sfx
+#define _BTREE_TP(pfx, type, sfx) __BTREE_TP(pfx, type, sfx)
+#define BTREE_TP(pfx) _BTREE_TP(pfx, BTREE_TYPE_SUFFIX,)
+#define BTREE_FN(name) BTREE_TP(btree_ ## name)
+#define BTREE_TYPE_HEAD BTREE_TP(struct btree_head)
+#define VISITOR_FN BTREE_TP(visitor)
+#define VISITOR_FN_T _BTREE_TP(visitor, BTREE_TYPE_SUFFIX, _t)
+
+BTREE_TYPE_HEAD {
+ struct btree_head h;
+};
+
+static inline void BTREE_FN(init_mempool)(BTREE_TYPE_HEAD *head,
+ mempool_t *mempool)
+{
+ btree_init_mempool(&head->h, mempool);
+}
+
+static inline int BTREE_FN(init)(BTREE_TYPE_HEAD *head)
+{
+ return btree_init(&head->h);
+}
+
+static inline void BTREE_FN(destroy)(BTREE_TYPE_HEAD *head)
+{
+ btree_destroy(&head->h);
+}
+
+static inline int BTREE_FN(merge)(BTREE_TYPE_HEAD *target,
+ BTREE_TYPE_HEAD *victim,
+ gfp_t gfp)
+{
+ return btree_merge(&target->h, &victim->h, BTREE_TYPE_GEO, gfp);
+}
+
+#if (BITS_PER_LONG > BTREE_TYPE_BITS)
+static inline void *BTREE_FN(lookup)(BTREE_TYPE_HEAD *head, BTREE_KEYTYPE key)
+{
+ unsigned long _key = key;
+ return btree_lookup(&head->h, BTREE_TYPE_GEO, &_key);
+}
+
+static inline int BTREE_FN(insert)(BTREE_TYPE_HEAD *head, BTREE_KEYTYPE key,
+ void *val, gfp_t gfp)
+{
+ unsigned long _key = key;
+ return btree_insert(&head->h, BTREE_TYPE_GEO, &_key, val, gfp);
+}
+
+static inline int BTREE_FN(update)(BTREE_TYPE_HEAD *head, BTREE_KEYTYPE key,
+ void *val)
+{
+ unsigned long _key = key;
+ return btree_update(&head->h, BTREE_TYPE_GEO, &_key, val);
+}
+
+static inline void *BTREE_FN(remove)(BTREE_TYPE_HEAD *head, BTREE_KEYTYPE key)
+{
+ unsigned long _key = key;
+ return btree_remove(&head->h, BTREE_TYPE_GEO, &_key);
+}
+
+static inline void *BTREE_FN(last)(BTREE_TYPE_HEAD *head, BTREE_KEYTYPE *key)
+{
+ unsigned long _key;
+ void *val = btree_last(&head->h, BTREE_TYPE_GEO, &_key);
+ if (val)
+ *key = _key;
+ return val;
+}
+
+static inline void *BTREE_FN(get_prev)(BTREE_TYPE_HEAD *head, BTREE_KEYTYPE *key)
+{
+ unsigned long _key = *key;
+ void *val = btree_get_prev(&head->h, BTREE_TYPE_GEO, &_key);
+ if (val)
+ *key = _key;
+ return val;
+}
+#else
+static inline void *BTREE_FN(lookup)(BTREE_TYPE_HEAD *head, BTREE_KEYTYPE key)
+{
+ return btree_lookup(&head->h, BTREE_TYPE_GEO, (unsigned long *)&key);
+}
+
+static inline int BTREE_FN(insert)(BTREE_TYPE_HEAD *head, BTREE_KEYTYPE key,
+ void *val, gfp_t gfp)
+{
+ return btree_insert(&head->h, BTREE_TYPE_GEO, (unsigned long *)&key,
+ val, gfp);
+}
+
+static inline int BTREE_FN(update)(BTREE_TYPE_HEAD *head, BTREE_KEYTYPE key,
+ void *val)
+{
+ return btree_update(&head->h, BTREE_TYPE_GEO, (unsigned long *)&key, val);
+}
+
+static inline void *BTREE_FN(remove)(BTREE_TYPE_HEAD *head, BTREE_KEYTYPE key)
+{
+ return btree_remove(&head->h, BTREE_TYPE_GEO, (unsigned long *)&key);
+}
+
+static inline void *BTREE_FN(last)(BTREE_TYPE_HEAD *head, BTREE_KEYTYPE *key)
+{
+ return btree_last(&head->h, BTREE_TYPE_GEO, (unsigned long *)key);
+}
+
+static inline void *BTREE_FN(get_prev)(BTREE_TYPE_HEAD *head, BTREE_KEYTYPE *key)
+{
+ return btree_get_prev(&head->h, BTREE_TYPE_GEO, (unsigned long *)key);
+}
+#endif
+
+void VISITOR_FN(void *elem, unsigned long opaque, unsigned long *key,
+ size_t index, void *__func);
+
+typedef void (*VISITOR_FN_T)(void *elem, unsigned long opaque,
+ BTREE_KEYTYPE key, size_t index);
+
+static inline size_t BTREE_FN(visitor)(BTREE_TYPE_HEAD *head,
+ unsigned long opaque,
+ VISITOR_FN_T func2)
+{
+ return btree_visitor(&head->h, BTREE_TYPE_GEO, opaque,
+ visitorl, func2);
+}
+
+static inline size_t BTREE_FN(grim_visitor)(BTREE_TYPE_HEAD *head,
+ unsigned long opaque,
+ VISITOR_FN_T func2)
+{
+ return btree_grim_visitor(&head->h, BTREE_TYPE_GEO, opaque,
+ visitorl, func2);
+}
+
+#undef VISITOR_FN
+#undef VISITOR_FN_T
+#undef __BTREE_TP
+#undef _BTREE_TP
+#undef BTREE_TP
+#undef BTREE_FN
+#undef BTREE_TYPE_HEAD
+#undef BTREE_TYPE_SUFFIX
+#undef BTREE_TYPE_GEO
+#undef BTREE_KEYTYPE
+#undef BTREE_TYPE_BITS
--- wireless-testing.orig/MAINTAINERS 2009-01-10 11:27:45.000000000 +0100
+++ wireless-testing/MAINTAINERS 2009-01-10 11:43:26.000000000 +0100
@@ -866,6 +866,14 @@ L: [email protected]
W: http://www.linux-ax25.org/
S: Maintained

+B+TREE
+P: Johannes Berg
+M: [email protected]
+P: Joern Engel
+M: [email protected]
+L: [email protected]
+S: Maintained
+
B43 WIRELESS DRIVER
P: Michael Buesch
M: [email protected]


2009-01-10 11:03:09

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] add b+tree library

Hi

> This adds a b+tree library. The API and memory layout is documented in
> the header file lib/btree.h. There are tree versions for 32, 64 and
> 128 bit keys as well as unsigned long (32/64 depending on platform).

Can this library remove the btree code in some btree based filesystem?

2009-01-10 11:36:22

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH] add b+tree library

On Sat, 2009-01-10 at 20:02 +0900, KOSAKI Motohiro wrote:

> > This adds a b+tree library. The API and memory layout is documented in
> > the header file lib/btree.h. There are tree versions for 32, 64 and
> > 128 bit keys as well as unsigned long (32/64 depending on platform).
>
> Can this library remove the btree code in some btree based filesystem?

Joern is going to use it for logfs, that's the point. Not sure about
other filesystems.

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2009-01-10 11:56:55

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH] add b+tree library

On Sat, 10 January 2009 12:37:08 +0100, Johannes Berg wrote:
> On Sat, 2009-01-10 at 20:02 +0900, KOSAKI Motohiro wrote:
>
> > > This adds a b+tree library. The API and memory layout is documented in
> > > the header file lib/btree.h. There are tree versions for 32, 64 and
> > > 128 bit keys as well as unsigned long (32/64 depending on platform).
> >
> > Can this library remove the btree code in some btree based filesystem?
>
> Joern is going to use it for logfs, that's the point. Not sure about
> other filesystems.

Err...

While true, this is decidedly _not_ a disk-based btree. Logfs uses it
for in-memory data structures. So the short answer is no.

There are several differences between this implementation and one used
for disk data.

- The value is a pointer, 32/64bit depending on architecture. Won't be
suited for a 64bit filesystem on a 32bit host, f.e.

- Lookups are a simple loop and don't use binary search. With a 4KiB
btree node and 64B cacheline, binary search will only load 7 out of 64
cachelines - making a big difference. This implementation uses 128B
(or cacheline size, whichever is bigger) nodes, so binary search buy
almost nothing, but cause branch mispredictions, bigger code, etc.

There may be more. So if you plan to replace disk-based btree code with
this, it will certainly be some amount of work. Not sure if it is a
good idea or if one should rather have a seperate library.

Jörn

--
The cost of changing business rules is much more expensive for software
than for a secretaty.
-- unknown

2009-01-10 12:29:18

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH] add b+tree library

>> > This adds a b+tree library. The API and memory layout is documented in
>> > the header file lib/btree.h. There are tree versions for 32, 64 and
>> > 128 bit keys as well as unsigned long (32/64 depending on platform).
>>
>> Can this library remove the btree code in some btree based filesystem?
>
> Joern is going to use it for logfs, that's the point. Not sure about
> other filesystems.

Why can't you investigate other filesystem?
if this library is used by only one fs, why does this library routine
need to linux/lib/ directory?

2009-01-10 18:39:53

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH] add b+tree library

On Sat, 10 January 2009 21:29:04 +0900, KOSAKI Motohiro wrote:
> >
> > Joern is going to use it for logfs, that's the point. Not sure about
> > other filesystems.
>
> Why can't you investigate other filesystem?
> if this library is used by only one fs, why does this library routine
> need to linux/lib/ directory?

Because it is not used for on-disk data structures. It is an
alternative to radix trees, rbtrees, hash tables or linked lists.

Jörn

--
When in doubt, punt. When somebody actually complains, go back and fix it...
The 90% solution is a good thing.
-- Rob Landley

2009-01-10 18:43:24

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH] add b+tree library

On Sat, 2009-01-10 at 19:39 +0100, Jörn Engel wrote:
> On Sat, 10 January 2009 21:29:04 +0900, KOSAKI Motohiro wrote:
> > >
> > > Joern is going to use it for logfs, that's the point. Not sure about
> > > other filesystems.
> >
> > Why can't you investigate other filesystem?
> > if this library is used by only one fs, why does this library routine
> > need to linux/lib/ directory?
>
> Because it is not used for on-disk data structures. It is an
> alternative to radix trees, rbtrees, hash tables or linked lists.

Sorry, I misunderstood the question here.

Also, to elaborate on that answer, I'm going to use this as a sort of
hash table for wireless, where it ensures better scalability than a pure
hashtable from quiet environments (say wireless off on an airplane) to
your wireless test lab (100+ access points)

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2009-01-10 19:41:41

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] add b+tree library

Johannes Berg <[email protected]> writes:

> Also, to elaborate on that answer, I'm going to use this as a sort of
> hash table for wireless, where it ensures better scalability than a pure
> hashtable from quiet environments (say wireless off on an airplane) to
> your wireless test lab (100+ access points)

Is there any particular reason you can't use the standard rbtrees
for that?

-Andi
--
[email protected]

2009-01-10 20:21:30

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH] add b+tree library

On Sat, 2009-01-10 at 20:41 +0100, Andi Kleen wrote:
> Johannes Berg <[email protected]> writes:
>
> > Also, to elaborate on that answer, I'm going to use this as a sort of
> > hash table for wireless, where it ensures better scalability than a pure
> > hashtable from quiet environments (say wireless off on an airplane) to
> > your wireless test lab (100+ access points)
>
> Is there any particular reason you can't use the standard rbtrees
> for that?

It just seemed like a lot of overhead, no particular reason I guess.

johannes


Attachments:
signature.asc (836.00 B)
This is a digitally signed message part

2009-01-10 20:23:40

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH] add b+tree library

On Sat, 10 January 2009 20:41:03 +0100, Andi Kleen wrote:
> Johannes Berg <[email protected]> writes:
>
> > Also, to elaborate on that answer, I'm going to use this as a sort of
> > hash table for wireless, where it ensures better scalability than a pure
> > hashtable from quiet environments (say wireless off on an airplane) to
> > your wireless test lab (100+ access points)
>
> Is there any particular reason you can't use the standard rbtrees
> for that?

Can't think of any. You can use linked lists as well. Whether you want
to is a different matter.

Key difference is the number of cachelines you need to find a particular
entry. rbtrees have a fanout of sqrt(3), so for a million elements (to
pick a random example) you need about 25 cachelines with rbtrees and
about 5-16 with btrees. Closer to 5 if keys and pointers are small and
cachelines are large, closer to 16 if keys and pointers are large and
cachelines are small.

Jörn

--
The key to performance is elegance, not battalions of special cases.
-- Jon Bentley and Doug McIlroy

2009-01-10 21:28:20

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] add b+tree library

On Sat, Jan 10, 2009 at 09:23:16PM +0100, J?rn Engel wrote:
>
> Key difference is the number of cachelines you need to find a particular
> entry. rbtrees have a fanout of sqrt(3), so for a million elements (to
> pick a random example) you need about 25 cachelines with rbtrees and
> about 5-16 with btrees. Closer to 5 if keys and pointers are small and
> cachelines are large, closer to 16 if keys and pointers are large and
> cachelines are small.

Three questions.... is the number of cachelines in use going to make a
measurable difference for your use case in the filesystem? If the
operation is going to involve disk access, trying to optimize for to
improve cacheline utilization may not be the higher priority thing to
worry about.

If you have a million elements, and assuming each element is but 4
bytes (which seems unlikely; very likely you'd be indexing at least
8-12 bytes of data, right?) we're talking about 4 megabytes of
non-swappable kernel memory. Is that likely to be happen in your use
case?

Finally, are b+tree so much better than rbtrees that it would be
worthwhile to just *replace* rbtrees with b+trees? Or is the problem
the overhead issue if the number of entries in an rbtree is relatively
small?

Regards,

- Ted

2009-01-10 22:02:19

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH] add b+tree library

On Sat, 10 January 2009 16:27:40 -0500, Theodore Tso wrote:
> On Sat, Jan 10, 2009 at 09:23:16PM +0100, Jörn Engel wrote:
> >
> > Key difference is the number of cachelines you need to find a particular
> > entry. rbtrees have a fanout of sqrt(3), so for a million elements (to
> > pick a random example) you need about 25 cachelines with rbtrees and
> > about 5-16 with btrees. Closer to 5 if keys and pointers are small and
> > cachelines are large, closer to 16 if keys and pointers are large and
> > cachelines are small.
>
> Three questions.... is the number of cachelines in use going to make a
> measurable difference for your use case in the filesystem? If the
> operation is going to involve disk access, trying to optimize for to
> improve cacheline utilization may not be the higher priority thing to
> worry about.

I don't really expect a big difference, even if the filesystem is
intended for flash, not disks. Other overhead will dominate the
picture. The situation may be different for Johannes, though.

> If you have a million elements, and assuming each element is but 4
> bytes (which seems unlikely; very likely you'd be indexing at least
> 8-12 bytes of data, right?) we're talking about 4 megabytes of
> non-swappable kernel memory. Is that likely to be happen in your use
> case?

A million was picked because a) it is easy to calculate with and b) it
is sufficiently (insanely) large to illustrate the effect. It is not
likely at all in my case. With 1000 elements, which is much more
realistic, you can just halve the numbers above.

> Finally, are b+tree so much better than rbtrees that it would be
> worthwhile to just *replace* rbtrees with b+trees? Or is the problem
> the overhead issue if the number of entries in an rbtree is relatively
> small?

Maybe and no. The overhead for near-empty or completely empty trees is
fairly low. At one point in time I had one btree for every indirect
block and every inode in the filesystem. As a result, struct btree_head
contains just two pointers and an int.

One key difference is that rbtrees maintain the tree within objects and
btrees maintain the tree externally. So btrees have to allocate memory
on insertion, where rbtrees have the necessary memory as part of the
object. With mempools the memory allocation should be reasonably safe,
so maybe this is a bit of a red herring now.

Another difference is the locking. The current implementation
completely ignores locking and depends on the callers to serialize
access to the btree.

Keeping all that in mind, I believe many rbtree users could be
converted.

Jörn

--
The competent programmer is fully aware of the strictly limited size of
his own skull; therefore he approaches the programming task in full
humility, and among other things he avoids clever tricks like the plague.
-- Edsger W. Dijkstra

2009-01-10 22:12:47

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] add b+tree library

> Finally, are b+tree so much better than rbtrees that it would be
> worthwhile to just *replace* rbtrees with b+trees? Or is the problem

There've been a couple of proposals like this over the years, also
with other data structures like judy trees (which seem to bring
the cache line optimization Joern talks about to even more extreme).
splay trees seem to be also a popular suggestion, although they've
considered dubious by other people (including me). Another
alternative would be skip lists.

Also I don't remember there was ever a big discussion about
rbtrees vs other trees -- it was just that Andrea liked
them and added a convenient library and some point and other
people found it convenient too and started using it.

But it's unclear how much all that would really help.

I always thought it might be advanteous to look at a little
more abstract interface for the existing rbtree users (shouldn't
be that hard, the interface is already not too bad) and then just
let some students implement a couple of backend data structures
for that interface and then run some benchmarks.

I don't think it's a good idea to add a b*tree library
and use it only from a few users though. After all it's
a lot of code and that should have a clear benefit.

-Andi
--
[email protected]

2009-01-10 22:24:57

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] add b+tree library

On Sat, 10 Jan 2009 23:01:35 +0100 J__rn Engel <[email protected]> wrote:

> One key difference is that rbtrees maintain the tree within objects and
> btrees maintain the tree externally. So btrees have to allocate memory
> on insertion, where rbtrees have the necessary memory as part of the
> object.

This is a major disadvantage of the btrees.

See all the hoops we ended up jumping through with things like
radix_tree_preload() and idr_pre_get().

The IDR code there wasn't very well designed and still has holes. The
radix-tree code afaik is solid, but look at all the stuff it does!

> With mempools the memory allocation should be reasonably safe,
> so maybe this is a bit of a red herring now.

No, mempools won't help, particularly if items are being added from
atomic contexts.

This is a major major shortcoming which greatly limits the usefulness
of btrees and which greatly reduces the chances of anyone migrating any
rbtree users over to btrees.

2009-01-10 23:57:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] add b+tree library

On Sat, 2009-01-10 at 14:23 -0800, Andrew Morton wrote:
> On Sat, 10 Jan 2009 23:01:35 +0100 J__rn Engel <[email protected]> wrote:
>
> > One key difference is that rbtrees maintain the tree within objects and
> > btrees maintain the tree externally. So btrees have to allocate memory
> > on insertion, where rbtrees have the necessary memory as part of the
> > object.
>
> This is a major disadvantage of the btrees.
>
> See all the hoops we ended up jumping through with things like
> radix_tree_preload() and idr_pre_get().
>
> The IDR code there wasn't very well designed and still has holes. The
> radix-tree code afaik is solid, but look at all the stuff it does!

Yeah, its a bit of a mess, but solvable, as radix trees show.

B-tree's however have one thing over RB-trees, B-trees can be made
RCU-safe whereas RB-trees cannot be -- the only problem is that Joern's
doesn't do that.

I've been poking at my B-tree implementation but got distracted by the
mutex spin stuff, its still buggy (still the insert sibling overflow --
the rest should be ok-ish, although I need a hard look at the memory
barriers).

---
Subject: lib: RCU friendly B+tree
From: Peter Zijlstra <[email protected]>
Date: Mon Jan 05 14:23:17 CET 2009

A RCU friendly B+tree

Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/btree.h | 150 ++++++
init/main.c | 2
lib/Makefile | 3
lib/btree.c | 1228 ++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 1382 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/btree.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/btree.h
@@ -0,0 +1,150 @@
+#ifndef _LINUX_BTREE_H
+#define _LINUX_BTREE_H
+
+#define BTREE_DEBUG 1
+
+#include <linux/log2.h>
+#include <linux/rcupdate.h>
+#include <linux/bitops.h>
+#include <linux/cache.h>
+
+struct btree_item {
+ unsigned long key;
+ void *data;
+};
+
+#define BTREE_ITEM_CACHELINE (L1_CACHE_BYTES / sizeof(struct btree_item))
+#define BTREE_ITEM_MAX (BTREE_ITEM_CACHELINE < 4 ? 4 : BTREE_ITEM_CACHELINE)
+#define BTREE_ITEM_HALF (BTREE_ITEM_MAX/2)
+
+struct btree_node {
+ struct btree_item item[BTREE_ITEM_MAX];
+};
+
+/*
+ * Max depth of the tree.
+ * Assumes BTREE_ITEM_MAX is power of two.
+ */
+#define BTREE_MAX_DEPTH \
+ (1 + DIV_ROUND_UP(BITS_PER_LONG, ilog2(BTREE_ITEM_MAX)))
+
+/*
+ * Max number of nodes to be RCUed per modification.
+ */
+#define BTREE_NODE_REPLACE (1 + 2 * BTREE_MAX_DEPTH)
+
+struct btree_vec {
+ struct rcu_head rcu_head;
+ int pos;
+ void *slot[BTREE_NODE_REPLACE];
+};
+
+struct btree_root {
+ struct btree_node *root;
+ int height;
+ struct btree_vec *allocvec;
+ struct btree_vec *freevec;
+ gfp_t gfp_mask;
+ void (*flush)(struct btree_vec *);
+};
+
+#ifdef BTREE_DEBUG
+void btree_print(struct btree_root *);
+#else
+#define btree_print(root) do { } while (0)
+#endif
+
+static inline void *btree_item_deref(struct btree_item *itemp)
+{
+ void *ptr = rcu_dereference(itemp->data);
+ __clear_bit(0, (void *)&ptr);
+ return ptr;
+}
+
+static inline unsigned long btree_item_key(struct btree_item *itemp)
+{
+ unsigned long key = itemp->key;
+ /*
+ * match the smp_wmb() in btree_item_key_set()
+ */
+ smp_rmb();
+ return key;
+}
+
+extern struct btree_item *btree_search_next(struct btree_root *root,
+ unsigned long index, struct btree_item **nextp);
+
+static inline void *btree_lookup(struct btree_root *root, unsigned long index)
+{
+ struct btree_item *item;
+
+ item = btree_search_next(root, index, NULL);
+ if (!item || btree_item_key(item) != index)
+ return NULL;
+
+ return btree_item_deref(item);
+}
+
+static inline void *btree_lookup_next(struct btree_root *root,
+ unsigned long index, void **nextp)
+{
+ struct btree_item *next = NULL, *item;
+
+ item = btree_search_next(root, index, &next);
+ if (next)
+ *nextp = btree_item_deref(next);
+ if (!item || btree_item_key(item) != index)
+ return NULL;
+
+ return btree_item_deref(item);
+}
+
+static inline void *btree_stab(struct btree_root *root, unsigned long index)
+{
+ struct btree_item *item;
+
+ item = btree_search_next(root, index, NULL);
+ if (!item)
+ return NULL;
+
+ return btree_item_deref(item);
+}
+
+static inline void *btree_stab_next(struct btree_root *root,
+ unsigned long index, void **nextp)
+{
+ struct btree_item *next = NULL, *item;
+
+ item = btree_search_next(root, index, &next);
+ if (next)
+ *nextp = btree_item_deref(next);
+ if (!item)
+ return NULL;
+
+ return btree_item_deref(item);
+}
+
+extern int btree_preload(struct btree_root *, gfp_t);
+extern int btree_insert(struct btree_root *, unsigned long, void *);
+extern int btree_update(struct btree_root *, unsigned long, unsigned long);
+extern void *btree_remove(struct btree_root *, unsigned long);
+
+extern void btree_root_init(struct btree_root *, gfp_t);
+extern void btree_root_destroy(struct btree_root *);
+
+extern void btree_vec_flush(struct btree_vec *);
+
+#define BTREE_INIT_FLUSH(gfp, f) (struct btree_root){ \
+ .root = NULL, \
+ .height = 0, \
+ .allocvec = NULL, \
+ .freevec = NULL, \
+ .gfp_mask = (gfp), \
+ .flush = (f), \
+}
+
+#define BTREE_INIT(gfp) BTREE_INIT_FLUSH(gfp, btree_vec_flush)
+
+extern void __init btree_init(void);
+
+#endif /* _LINUX_BTREE_H */
Index: linux-2.6/lib/Makefile
===================================================================
--- linux-2.6.orig/lib/Makefile
+++ linux-2.6/lib/Makefile
@@ -11,7 +11,8 @@ lib-y := ctype.o string.o vsprintf.o cmd
rbtree.o radix-tree.o dump_stack.o \
idr.o int_sqrt.o extable.o prio_tree.o \
sha1.o irq_regs.o reciprocal_div.o argv_split.o \
- proportions.o prio_heap.o ratelimit.o show_mem.o is_single_threaded.o
+ proportions.o prio_heap.o ratelimit.o show_mem.o \
+ is_single_threaded.o btree.o

lib-$(CONFIG_MMU) += ioremap.o
lib-$(CONFIG_SMP) += cpumask.o
Index: linux-2.6/lib/btree.c
===================================================================
--- /dev/null
+++ linux-2.6/lib/btree.c
@@ -0,0 +1,1228 @@
+/*
+ * Copyright 2007-2009, Red Hat Inc. Peter Zijlstra <[email protected]>
+ * GPLv2
+ *
+ * Something that started out as an RCU friendly B+tree.
+ *
+ * Contrary to most B-trees the nodes have an even number of slots. This can
+ * be done because we carry all the children in the leafs and thus we don't
+ * need to push one up on a split. We can just duplicate the first.
+ *
+ * The inner nodes are basically an N-way space partition.
+ *
+ * Another difference to the typical B+tree is that this implementation does
+ * not thread the leafs, this is left up to the user if so desired.
+ *
+ *
+ * [------------------------------>[------------------------------>
+ * | |
+ * [------------------->[--------->[-------------->[-------------->
+ * | | | |
+ * [..) [..) [..) [..) [..) [..) [..) [..) [..) [..) [..) [..)
+ *
+ *
+ */
+#include <linux/btree.h>
+#include <linux/slab.h>
+#include <linux/log2.h>
+#include <linux/err.h>
+
+#ifdef BTREE_DEBUG
+static void btree_print(struct btree_root *root);
+static int btree_validate(struct btree_root *root);
+#else
+#define btree_print(root) do { } while (0)
+#define btree_validate(root) (0)
+#endif
+
+static struct kmem_cache *btree_cachep;
+
+static inline void btree_free(struct btree_root *root, void *ptr)
+{
+ root->freevec->slot[root->freevec->pos++] = ptr;
+}
+
+static inline void btree_ptr_flip(struct btree_root *root,
+ void **nodep, void *new)
+{
+ void *old = rcu_dereference(*nodep);
+ rcu_assign_pointer(*nodep, new);
+ __clear_bit(0, (void *)&old);
+ btree_free(root, old);
+}
+
+/*
+ * node pointers have bit 0 set.
+ */
+static inline int btree_item_node(struct btree_item *itemp)
+{
+ return test_bit(0, (void *)&itemp->data);
+}
+
+static inline struct btree_node *btree_node_ptr(struct btree_node *nodep)
+{
+ __set_bit(0, (void *)&nodep);
+ return nodep;
+}
+
+static inline void btree_item_key_set(struct btree_item *itemp,
+ unsigned long index)
+{
+ /*
+ * we can only tighten when the new node is linked in
+ */
+ smp_wmb();
+ itemp->key = index;
+}
+
+static void btree_item_flip(struct btree_root *root,
+ struct btree_item *itemp, struct btree_node *new)
+{
+ unsigned long index;
+
+ btree_ptr_flip(root, &itemp->data, btree_node_ptr(new));
+
+ /* possibly tighten */
+ index = btree_item_key(&new->item[0]);
+ if (btree_item_key(itemp) != index)
+ btree_item_key_set(itemp, index);
+}
+
+static inline void btree_item_copy(void *dst, void *src, int nr)
+{
+ memcpy(dst, src, nr*sizeof(struct btree_item));
+}
+
+static inline int btree_item_count(struct btree_node *node)
+{
+ int j;
+
+ for (j = 0; j < BTREE_ITEM_MAX &&
+ node->item[j].data ; j++)
+ ;
+
+ return j;
+}
+
+static struct btree_node *btree_node_alloc(struct btree_root *root)
+{
+ struct btree_node *node;
+ int pos = --root->allocvec->pos;
+
+ BUG_ON(pos < 0);
+
+ node = root->allocvec->slot[pos];
+
+ BUG_ON(!node);
+
+ return node;
+}
+
+static inline void btree_item_free(struct btree_root *root,
+ struct btree_item *itemp)
+{
+ if (btree_item_node(itemp))
+ btree_free(root, btree_item_deref(itemp));
+}
+
+static int btree_vec_alloc(struct btree_vec *vec, gfp_t gfp_mask, int nr)
+{
+ BUG_ON(nr < 0 || nr > BTREE_NODE_REPLACE);
+
+ while (vec->pos < nr) {
+ struct btree_node *node =
+ kmem_cache_alloc(btree_cachep, gfp_mask | __GFP_ZERO);
+
+ if (!node)
+ return -ENOMEM;
+
+ vec->slot[vec->pos++] = node;
+ }
+
+ return 0;
+}
+
+/*
+ * node storage; expose interface to claim all the needed memory in advance
+ * this avoids getting stuck in a situation where going back is as impossible
+ * as going forward.
+ */
+int btree_preload(struct btree_root *root, gfp_t gfp_mask)
+{
+ int ret = 0;
+
+ if (!root->allocvec) {
+ root->allocvec = kzalloc(sizeof(struct btree_vec), gfp_mask);
+ if (!root->allocvec)
+ return -ENOMEM;
+ }
+
+ if (!root->freevec) {
+ root->freevec = kzalloc(sizeof(struct btree_vec), gfp_mask);
+ if (!root->freevec)
+ return -ENOMEM;
+ }
+
+ ret = btree_vec_alloc(root->allocvec, 1+2*root->height, gfp_mask);
+
+ return ret;
+}
+
+static int btree_mod_init(struct btree_root *root)
+{
+ return btree_preload(root, root->gfp_mask);
+}
+
+static int btree_mod_finish(struct btree_root *root, unsigned long index,
+ char *op)
+{
+ int ret = 0;
+
+ if (btree_validate(root)) {
+ printk(KERN_DEBUG "modified(%s): %lu\n", op, index);
+ btree_print(root);
+ BUG();
+ }
+
+ root->flush(root->freevec);
+ root->freevec = NULL;
+
+ return ret;
+}
+
+/*
+ * stack, to store traversal path.
+ */
+struct btree_path {
+ struct btree_node *node;
+ int offset;
+};
+
+struct btree_stack {
+ int p;
+ struct btree_path path[BTREE_MAX_DEPTH];
+};
+
+static inline void btree_stack_init(struct btree_stack *stack)
+{
+ memset(stack, 0, sizeof(*stack));
+}
+
+static inline void btree_stack_push(struct btree_stack *stack,
+ struct btree_node *node, int offset)
+{
+ stack->path[stack->p].node = node;
+ stack->path[stack->p].offset = offset;
+ stack->p++;
+
+ BUG_ON(stack->p > ARRAY_SIZE(stack->path));
+}
+
+static inline void btree_stack_pop(struct btree_stack *stack,
+ struct btree_node **nodep, int *offsetp)
+{
+ stack->p--;
+ *nodep = stack->path[stack->p].node;
+ *offsetp = stack->path[stack->p].offset;
+
+ BUG_ON(stack->p < 0);
+}
+
+static inline int btree_stack_size(struct btree_stack *stack)
+{
+ return stack->p;
+}
+
+static struct btree_node *btree_stack_peek_left(struct btree_stack *stack)
+{
+ struct btree_node *node;
+ struct btree_item *item;
+ int offset;
+
+ if (!btree_stack_size(stack))
+ return NULL;
+
+ node = stack->path[stack->p - 1].node;
+ offset = stack->path[stack->p - 1].offset;
+
+ if (offset == 0)
+ return NULL;
+
+ offset--;
+ item = &node->item[offset];
+ return btree_item_deref(item);
+}
+
+static struct btree_node *btree_stack_peek_right(struct btree_stack *stack)
+{
+ struct btree_node *node;
+ struct btree_item *item;
+ int offset;
+
+ if (!btree_stack_size(stack))
+ return NULL;
+
+ node = stack->path[stack->p - 1].node;
+ offset = stack->path[stack->p - 1].offset;
+
+ offset++;
+ if (offset == BTREE_ITEM_MAX)
+ return NULL;
+
+ item = &node->item[offset];
+ return btree_item_deref(item);
+}
+
+/* ------ search ------- */
+
+/*
+ * Since we can re-adjust the space partitioning in delete and update a
+ * lockless lookup might hit a wrong branch once in a while, hence we might
+ * need to backtrack.
+ *
+ * Because we do a search for a less than or equal index we can only backtrack
+ * to the left. So the split can be off to the left but never to the right.
+ * Hence we need to move splits left on descend and right on ascend of the
+ * tree. See update and remove.
+ *
+ * NOTE the backtrack should be limited to one entry so worst time is:
+ * O(2*log(n)-1)
+ */
+
+/*
+ * find which branch to take
+ */
+static inline
+int __btree_item_search(struct btree_root *root,
+ struct btree_node *node, unsigned long index)
+{
+ int i;
+
+ for (i = 1; i < BTREE_ITEM_MAX; i++)
+ if (node->item[i].data == NULL ||
+ index < btree_item_key(&node->item[i]))
+ break;
+ i--;
+ return i;
+}
+
+/*
+ * find item and the path to it.
+ */
+static struct btree_item *__btree_search(struct btree_root *root,
+ struct btree_stack *stack, unsigned long index)
+{
+ struct btree_node *node = rcu_dereference(root->root);
+ struct btree_item *item;
+ int offset;
+
+ btree_stack_init(stack);
+
+ if (!node)
+ return NULL;
+
+ do {
+ /*
+ * if the split was too low, back-track and take the previous
+ * branch
+ */
+ if (unlikely(btree_item_key(&node->item[0]) > index)) {
+ if (!btree_stack_size(stack))
+ return NULL;
+
+ do {
+ btree_stack_pop(stack, &node, &offset);
+ } while (btree_stack_size(stack) && offset == 0);
+
+ if (!btree_stack_size(stack) && offset == 0)
+ return NULL;
+
+ offset--;
+ } else
+ offset = __btree_item_search(root, node, index);
+
+ btree_stack_push(stack, node, offset);
+
+ item = &node->item[offset];
+ if (!btree_item_node(item))
+ break;
+
+ node = btree_item_deref(item);
+ } while (node);
+
+ return item;
+}
+
+/*
+ * find the left-most item in a sub-tree, and the path to it.
+ */
+static struct btree_item *__btree_find_first(struct btree_root *root,
+ struct btree_stack *stack, struct btree_node *node, int offset)
+{
+ struct btree_item *item;
+
+ do {
+ btree_stack_push(stack, node, offset);
+
+ item = &node->item[offset];
+ if (!btree_item_node(item))
+ break;
+
+ node = btree_item_deref(item);
+ offset = 0;
+ } while (node);
+
+ return item;
+}
+
+/*
+ * find the item with a lesser or equal index
+ * and optionally the next item.
+ */
+struct btree_item *btree_search_next(struct btree_root *root,
+ unsigned long index, struct btree_item **nextp)
+{
+ struct btree_stack stack;
+ struct btree_node *node;
+ struct btree_item *item, *next;
+ int offset;
+
+ item = __btree_search(root, &stack, index);
+
+ if (item && nextp) {
+ do {
+ btree_stack_pop(&stack, &node, &offset);
+ } while (btree_stack_size(&stack) &&
+ offset == BTREE_ITEM_MAX-1);
+
+ if (offset != BTREE_ITEM_MAX-1) {
+ offset++;
+ next = __btree_find_first(root, &stack, node, offset);
+ } else
+ next = NULL;
+
+ *nextp = next;
+ } else if (nextp) {
+ node = rcu_dereference(root->root);
+ offset = 0;
+
+ if (node) {
+ next = __btree_find_first(root, &stack, node, offset);
+ *nextp = next;
+ }
+ }
+
+ return item;
+}
+
+/* ------ insert ------- */
+
+/*
+ * insert item; split nodes when needed
+ */
+int btree_insert(struct btree_root *root,
+ unsigned long index, void *data)
+{
+ struct btree_stack stack;
+ struct btree_node *node = NULL,
+ *old_node,
+ *update,
+ *new = NULL,
+ *split = NULL,
+ *uleft,
+ *uright,
+ *left = NULL,
+ *right = NULL;
+ struct btree_item *item;
+ struct btree_item nitem = {
+ .key = index,
+ .data = data,
+ };
+ unsigned long key;
+ int i, j, n, ret = 0;
+
+ ret = btree_mod_init(root);
+ if (ret)
+ return ret;
+
+ item = __btree_search(root, &stack, index);
+ if (!item) {
+ if (!root->root) {
+ root->root = btree_node_alloc(root);
+ root->height = 1;
+ btree_stack_push(&stack, root->root, 0);
+ } else
+ __btree_find_first(root, &stack, root->root, 0);
+ } else if (btree_item_key(item) == index) {
+ ret = -EEXIST;
+ goto out;
+ }
+
+ do {
+ old_node = node;
+ btree_stack_pop(&stack, &node, &i);
+ item = &node->item[i];
+
+ update = NULL;
+ if (btree_item_node(item)) {
+ /* no change */
+ if (!new && !split && !left && !right) {
+ BUG_ON(btree_item_deref(item) != old_node);
+
+ /* possibly tighten the split on our way back up */
+ key = btree_item_key(&old_node->item[0]);
+ if (btree_item_key(item) != key)
+ btree_item_key_set(item, key);
+ continue;
+ }
+
+ /* single change */
+ if (new && !split && !left && !right) {
+ btree_item_flip(root, item, new);
+ new = NULL;
+ continue;
+ }
+
+ /* merge left */
+ if (new && !split && left && !right) {
+ update = btree_node_alloc(root);
+ btree_item_copy(update->item, node->item,
+ BTREE_ITEM_MAX);
+ update->item[i-1] = (struct btree_item){
+ .key = btree_item_key(&left->item[0]),
+ .data = btree_node_ptr(left),
+ };
+ update->item[i] = (struct btree_item){
+ .key = btree_item_key(&new->item[0]),
+ .data = btree_node_ptr(new),
+ };
+ btree_item_free(root, &node->item[i-1]);
+ btree_item_free(root, &node->item[i]);
+
+ left = NULL;
+ new = update;
+ continue;
+ }
+
+ /* merge right */
+ if (new && !split && !left && right) {
+ update = btree_node_alloc(root);
+ btree_item_copy(update->item, node->item,
+ BTREE_ITEM_MAX);
+ update->item[i] = (struct btree_item){
+ .key = btree_item_key(&new->item[0]),
+ .data = btree_node_ptr(new),
+ };
+ update->item[i+1] = (struct btree_item){
+ .key = btree_item_key(&right->item[0]),
+ .data = btree_node_ptr(right),
+ };
+ btree_item_free(root, &node->item[i]);
+ btree_item_free(root, &node->item[i+1]);
+
+ new = update;
+ right = NULL;
+ continue;
+ }
+
+ /* split */
+ BUG_ON(left || right);
+ update = new;
+ }
+
+ if (btree_item_deref(item) &&
+ btree_item_key(item) < nitem.key)
+ i++;
+
+ split = NULL;
+ left = NULL;
+ right = NULL;
+ new = btree_node_alloc(root);
+ /* insert has room */
+ if (node->item[BTREE_ITEM_MAX-1].data == NULL) {
+ btree_item_copy(new->item, node->item, i);
+ new->item[i] = nitem;
+ btree_item_copy(new->item + i + 1, node->item + i,
+ BTREE_ITEM_MAX - i - 1);
+
+ if (update)
+ btree_item_flip(root, &new->item[i-1], update);
+
+ continue;
+ }
+
+ /* insert overflows */
+
+ /* see if left sibling will take some overflow */
+ uleft = btree_stack_peek_left(&stack);
+ if (uleft && uleft->item[BTREE_ITEM_MAX-1].data == NULL) {
+
+ j = btree_item_count(uleft);
+ n = (BTREE_ITEM_MAX - j + 1)/2;
+
+ left = btree_node_alloc(root);
+ /*
+ * LLLLl... CCCCcccc
+ * LLLLlC*. CCCcccc.
+ *
+ * LLLLl... CCCCcccc
+ * LLLLlCC. CCccc*c.
+ */
+ btree_item_copy(left->item, uleft->item, j);
+ if (i < n) {
+ btree_item_copy(left->item+j, node->item, i);
+ left->item[j+i] = nitem;
+ btree_item_copy(left->item+j+i+1, node->item+i,
+ n-i-1);
+ btree_item_copy(new->item, node->item+n-1,
+ BTREE_ITEM_MAX-n+1);
+ } else {
+ btree_item_copy(left->item+j, node->item, n);
+ btree_item_copy(new->item, node->item+n, i-n);
+ new->item[i-n] = nitem;
+ btree_item_copy(new->item+i-n+1, node->item+i,
+ BTREE_ITEM_MAX-i+n-1);
+ }
+
+ if (update) {
+ if (i-1 < n) {
+ btree_item_flip(root,
+ &left->item[j+i-1],
+ update);
+ } else {
+ btree_item_flip(root, &new->item[i-n-1],
+ update);
+ }
+ }
+
+ continue;
+ }
+
+ /* see if right sibling will take some overflow */
+ uright = btree_stack_peek_right(&stack);
+ if (uright && uright->item[BTREE_ITEM_MAX-1].data == NULL) {
+
+ j = btree_item_count(uright);
+ n = (BTREE_ITEM_MAX - j + 1)/2;
+
+ right = btree_node_alloc(root);
+ /*
+ * CCCCcccc RRRR....
+ * CCCC*cc. ccRRRR..
+ *
+ * CCCCcccc RRRR....
+ * CCCCccc. *cRRRR..
+ */
+ if (i <= BTREE_ITEM_MAX-n) {
+ btree_item_copy(new->item, node->item, i);
+ new->item[i] = nitem;
+ btree_item_copy(new->item+i+1, node->item+i,
+ BTREE_ITEM_MAX-n-i);
+ btree_item_copy(right->item,
+ node->item+BTREE_ITEM_MAX-n, n);
+ btree_item_copy(right->item+n, uright->item, j);
+ } else {
+ btree_item_copy(new->item, node->item,
+ BTREE_ITEM_MAX-n);
+
+ btree_item_copy(right->item,
+ node->item+(BTREE_ITEM_MAX-n),
+ i-(BTREE_ITEM_MAX-n));
+ right->item[i-(BTREE_ITEM_MAX-n)] = nitem;
+ btree_item_copy(right->item+i-(BTREE_ITEM_MAX-n)+1,
+ node->item+i, BTREE_ITEM_MAX-i);
+ btree_item_copy(right->item+n+1, uright->item,
+ j);
+ }
+
+ if (update) {
+ if (i-1 <= BTREE_ITEM_MAX-n) {
+ btree_item_flip(root, &new->item[i-1],
+ update);
+ } else {
+ btree_item_flip(root,
+ &right->item[i-1-BTREE_ITEM_MAX-n],
+ update);
+ }
+ }
+
+ continue;
+ }
+
+ /* split node */
+ split = btree_node_alloc(root);
+ if (i < BTREE_ITEM_HALF) {
+ btree_item_copy(new->item, node->item, i);
+ new->item[i] = nitem;
+ btree_item_copy(new->item + i + 1, node->item + i,
+ BTREE_ITEM_HALF - i);
+ btree_item_copy(split->item,
+ node->item + BTREE_ITEM_HALF,
+ BTREE_ITEM_HALF);
+ } else {
+ btree_item_copy(new->item, node->item,
+ BTREE_ITEM_HALF);
+ btree_item_copy(split->item,
+ node->item + BTREE_ITEM_HALF,
+ i - BTREE_ITEM_HALF);
+ split->item[i - BTREE_ITEM_HALF] = nitem;
+ btree_item_copy(split->item + i - BTREE_ITEM_HALF + 1,
+ node->item + i,
+ BTREE_ITEM_MAX - i);
+ }
+
+ if (update) {
+ if (i-1 < BTREE_ITEM_HALF)
+ btree_item_flip(root, &new->item[i-1], update);
+ else
+ btree_item_flip(root,
+ &split->item[i-1-BTREE_ITEM_HALF],
+ update);
+ }
+
+ nitem.key = split->item[0].key;
+ nitem.data = btree_node_ptr(split);
+ } while (btree_stack_size(&stack));
+
+ if (new) {
+ if (!split) {
+ btree_ptr_flip(root, (void **)&root->root, new);
+ } else {
+ update = btree_node_alloc(root);
+ update->item[0] = (struct btree_item){
+ .key = new->item[0].key,
+ .data = btree_node_ptr(new),
+ };
+ update->item[1] = (struct btree_item){
+ .key = split->item[0].key,
+ .data = btree_node_ptr(split),
+ };
+ btree_ptr_flip(root, (void **)&root->root, update);
+ root->height++;
+ }
+ }
+
+out:
+ btree_mod_finish(root, index, "insert");
+ return ret;
+}
+
+/* ------ update ------- */
+
+/*
+ * update the index of an item
+ *
+ * be careful the range:
+ * [min(index, new_index), max(index, new_index)]
+ * _MUST_ be free, otherwise remove and re-insert.
+ *
+ * see the search comment for lockless lookup constraints
+ */
+int btree_update(struct btree_root *root,
+ unsigned long index, unsigned long new_index)
+{
+ struct btree_stack stack;
+ struct btree_node *node = rcu_dereference(root->root);
+ struct btree_item *item;
+ unsigned long key;
+ int offset, ret = 0;
+
+ if (index == new_index)
+ return 0;
+
+ if (!node)
+ return -EINVAL;
+
+ btree_stack_init(&stack);
+
+ do {
+ offset = __btree_item_search(root, node, index);
+ btree_stack_push(&stack, node, offset);
+ item = &node->item[offset];
+ if (btree_item_node(item)) {
+ /* loosen downwards */
+ if (index > new_index && btree_item_key(item) == index)
+ btree_item_key_set(item, new_index);
+ node = btree_item_deref(item);
+ } else
+ node = NULL;
+ } while (node);
+
+ if (btree_item_key(item) == index)
+ btree_item_key_set(item, new_index);
+ else
+ ret = -EINVAL;
+
+ do {
+ btree_stack_pop(&stack, &node, &offset);
+ item = &node->item[offset];
+ if (!btree_item_node(item))
+ continue;
+
+ key = btree_item_key(item);
+ /* undo on error */
+ if (ret && index > new_index && key == new_index)
+ btree_item_key_set(item, index);
+ /* tighten upwards */
+ if (!ret && index < new_index && key == index)
+ btree_item_key_set(item, new_index);
+ } while (btree_stack_size(&stack));
+
+ if (btree_validate(root)) {
+ printk(KERN_DEBUG "btree_update: index: %lu new_index: %lu\n",
+ index, new_index);
+ btree_print(root);
+ BUG();
+ }
+
+ return 0;
+}
+
+/* ------ replace ------- */
+
+void *btree_replace(struct btree_root *root,
+ unsigned long index, void *new_data)
+{
+ struct btree_stack stack;
+ struct btree_item *item;
+
+ item = __btree_search(root, &stack, index);
+ if (item && btree_item_key(item) == index) {
+ void *data = btree_item_deref(item);
+ rcu_assign_pointer(item->data, new_data);
+ return data;
+ }
+
+ return ERR_PTR(-EINVAL);
+}
+
+/* ------ delete ------- */
+
+/*
+ * delete an item; borrow items or merge nodes when needed.
+ */
+void *btree_remove(struct btree_root *root, unsigned long index)
+{
+ struct btree_stack stack;
+ struct btree_node *node = NULL,
+ *old_node,
+ *left = NULL,
+ *new = NULL,
+ *right = NULL,
+ *update,
+ *uleft,
+ *uright;
+ struct btree_item *item;
+ void *ret = NULL;
+ unsigned long key;
+ int i, j, n;
+ int err;
+
+ if (!root->root)
+ return ERR_PTR(-EINVAL);
+
+ err = btree_mod_init(root);
+ if (err)
+ return ERR_PTR(err);
+
+ item = __btree_search(root, &stack, index);
+ if (!item || btree_item_key(item) != index) {
+ ret = ERR_PTR(-EINVAL);
+ goto out;
+ } else
+ ret = item->data;
+
+ do {
+ old_node = node;
+ btree_stack_pop(&stack, &node, &i);
+ item = &node->item[i];
+
+ update = NULL;
+ if (btree_item_node(item)) {
+ /* no change */
+ if (!new && !left && !right) {
+ BUG_ON(btree_item_deref(item) != old_node);
+
+ /* tighten the split on our way back up */
+ key = btree_item_key(&old_node->item[0]);
+ if (btree_item_key(item) != key)
+ btree_item_key_set(item, key);
+ continue;
+ }
+
+ /* single change */
+ if (!left && new && !right) {
+ btree_item_flip(root, &node->item[i], new);
+ new = NULL;
+ continue;
+ }
+
+ /* borrowed left */
+ if (left && new && !right) {
+ update = btree_node_alloc(root);
+ btree_item_copy(update->item, node->item,
+ BTREE_ITEM_MAX);
+ update->item[i-1] = (struct btree_item){
+ .key = btree_item_key(&left->item[0]),
+ .data = btree_node_ptr(left),
+ };
+ update->item[i] = (struct btree_item){
+ .key = btree_item_key(&new->item[0]),
+ .data = btree_node_ptr(new),
+ };
+ btree_item_free(root, &node->item[i-1]);
+ btree_item_free(root, &node->item[i]);
+
+ left = NULL;
+ new = update;
+ continue;
+ }
+
+ /* borrowed right */
+ if (!left && new && right) {
+ update = btree_node_alloc(root);
+ btree_item_copy(update->item, node->item,
+ BTREE_ITEM_MAX);
+ update->item[i] = (struct btree_item){
+ .key = btree_item_key(&new->item[0]),
+ .data = btree_node_ptr(new),
+ };
+ update->item[i+1] = (struct btree_item){
+ .key = btree_item_key(&right->item[0]),
+ .data = btree_node_ptr(right),
+ };
+ btree_item_free(root, &node->item[i]);
+ btree_item_free(root, &node->item[i+1]);
+
+ new = update;
+ right = NULL;
+ continue;
+ }
+
+ /* merged left */
+ if (left && !new && !right)
+ update = left;
+ /* merged right */
+ else if (!left && !new && right) {
+ update = right;
+ i++;
+ }
+ /* impossible */
+ else {
+ printk(KERN_ERR "%p %p %p\n", left, new, right);
+ BUG();
+ }
+ }
+
+ /* delete */
+ left = NULL;
+ right = NULL;
+ new = btree_node_alloc(root);
+ if (node->item[BTREE_ITEM_HALF].data) {
+delete_one:
+ /*
+ * CCCxcc..
+ * CCCcc...
+ */
+ btree_item_copy(new->item, node->item, i);
+ btree_item_copy(new->item+i, node->item+i+1,
+ BTREE_ITEM_MAX-i-1);
+
+ if (update)
+ btree_item_flip(root, &new->item[i-1], update);
+
+ btree_item_free(root, &node->item[i]);
+ continue;
+ }
+
+ /* delete underflows */
+
+ /* borrow left */
+ uleft = btree_stack_peek_left(&stack);
+ if (uleft && uleft->item[BTREE_ITEM_HALF].data) {
+ left = btree_node_alloc(root);
+
+ j = btree_item_count(uleft);
+ n = (j-BTREE_ITEM_HALF+1)/2;
+
+ /*
+ * LLLLlll. CxCC....
+ * LLLLl... llCCC...
+ */
+ btree_item_copy(left->item, uleft->item, (j-n));
+ btree_item_copy(new->item, uleft->item+(j-n), n);
+ btree_item_copy(new->item+n, node->item, i);
+ btree_item_copy(new->item+n+i, node->item+i+1,
+ BTREE_ITEM_HALF-i);
+
+ if (update)
+ btree_item_flip(root, &new->item[n+i-1],
+ update);
+
+ btree_item_free(root, &node->item[i]);
+
+ continue;
+ }
+
+ /* borrow right */
+ uright = btree_stack_peek_right(&stack);
+ if (uright && uright->item[BTREE_ITEM_HALF].data) {
+ right = btree_node_alloc(root);
+
+ j = btree_item_count(uright);
+ n = (j-BTREE_ITEM_HALF+1)/2;
+
+ /*
+ * CCxC.... RRRRrrr.
+ * CCCRR... RRrrr...
+ */
+ btree_item_copy(new->item, node->item, i);
+ btree_item_copy(new->item+i, node->item+i+1,
+ BTREE_ITEM_HALF-i-1);
+ btree_item_copy(new->item+BTREE_ITEM_HALF-1,
+ uright->item, n);
+ btree_item_copy(right->item, uright->item+n,
+ BTREE_ITEM_MAX-n);
+
+ if (update)
+ btree_item_flip(root, &new->item[i-1], update);
+
+ btree_item_free(root, &node->item[i]);
+
+ continue;
+ }
+
+ /* merge left */
+ if (uleft) {
+ /*
+ * LLLL.... CCCxc...
+ * LLLLCCCc
+ */
+ btree_item_copy(new->item, uleft->item,
+ BTREE_ITEM_HALF);
+ btree_item_copy(new->item+BTREE_ITEM_HALF,
+ node->item, i);
+ btree_item_copy(new->item+BTREE_ITEM_HALF+i,
+ node->item+i+1, BTREE_ITEM_HALF-i-1);
+
+ if (update)
+ btree_item_flip(root,
+ &new->item[BTREE_ITEM_HALF+i-1],
+ update);
+
+ btree_item_free(root, &node->item[i]);
+
+ left = new;
+ new = NULL;
+ continue;
+ }
+
+ /* merge right */
+ if (uright) {
+ /*
+ * CCxCc... RRRR....
+ * CCCcRRRR
+ */
+ btree_item_copy(new->item, node->item, i);
+ btree_item_copy(new->item+i, node->item+i+1,
+ BTREE_ITEM_HALF-i-1);
+ btree_item_copy(new->item+BTREE_ITEM_HALF-1,
+ uright->item, BTREE_ITEM_HALF);
+
+ if (update)
+ btree_item_flip(root, &new->item[i-1], update);
+
+ btree_item_free(root, &node->item[i]);
+
+ /*
+ * use right to carry the new node to avoid having
+ * the same pattern as single change
+ */
+ right = new;
+ new = NULL;
+ continue;
+ }
+
+ /* only the root may underflow */
+ BUG_ON(root->root != node);
+ goto delete_one;
+
+ } while (btree_stack_size(&stack));
+
+ BUG_ON(left || right);
+
+ if (new)
+ btree_ptr_flip(root, (void **)&root->root, new);
+
+ if (!root->root->item[1].data &&
+ btree_item_node(&root->root->item[0])) {
+ btree_ptr_flip(root, (void **)&root->root,
+ btree_item_deref(&root->root->item[0]));
+ root->height--;
+ }
+
+ if (!root->root->item[0].data) {
+ btree_ptr_flip(root, (void **)&root->root, NULL);
+ root->height = 0;
+ }
+
+out:
+ btree_mod_finish(root, index, "remove");
+ return ret;
+}
+
+/* ------ */
+
+void btree_root_init(struct btree_root *root, gfp_t gfp_mask)
+{
+ *root = BTREE_INIT(gfp_mask);
+}
+
+void btree_vec_flush(struct btree_vec *freevec)
+{
+ int i;
+ for (i = 0; i < freevec->pos; i++)
+ kmem_cache_free(btree_cachep, freevec->slot[i]);
+ kfree(freevec);
+}
+
+void btree_root_destroy(struct btree_root *root)
+{
+ /* assume the tree is empty */
+ BUG_ON(root->height);
+ BUG_ON(root->freevec);
+ if (root->allocvec)
+ btree_vec_flush(root->allocvec);
+}
+
+void __init btree_init(void)
+{
+ btree_cachep = KMEM_CACHE(btree_node, SLAB_HWCACHE_ALIGN);
+}
+
+#ifdef BTREE_DEBUG
+static void __btree_node_print(struct btree_root *root,
+ struct btree_node *node, int recurse)
+{
+ int i, j;
+ for (i = 0; i < BTREE_ITEM_MAX; i++) {
+ if (node->item[i].data) {
+ printk(KERN_DEBUG);
+ for (j = 0; j < recurse; j++)
+ printk(" ");
+ if (!btree_item_node(&node->item[i])) {
+
+ printk(KERN_DEBUG "-> leaf: %p, item: %d, key: %lu, data: %p\n",
+ node, i, node->item[i].key, node->item[i].data);
+
+ } else {
+
+ printk(KERN_DEBUG "node: %p, item: %d, key: %lu, child: %p\n",
+ node, i, node->item[i].key, btree_item_deref(&node->item[i]));
+ if (recurse)
+ __btree_node_print(root, btree_item_deref(&node->item[i]), recurse+1);
+
+ }
+ }
+ }
+}
+
+void btree_node_print(struct btree_root *root, struct btree_node *node)
+{
+ printk(KERN_DEBUG "node: %p\n", node);
+ __btree_node_print(root, node, 0);
+}
+
+void btree_print(struct btree_root *root)
+{
+ printk(KERN_DEBUG "[%p] root: %p, height: %d\n",
+ root, root->root, root->height);
+ if (root->root)
+ __btree_node_print(root, root->root, 1);
+}
+
+static unsigned long __btree_key(struct btree_root *root,
+ struct btree_node *node, int height)
+{
+ if (height == 0)
+ return node->item[0].key;
+
+ return __btree_key(root, btree_item_deref(&node->item[0]), height - 1);
+}
+
+static int __btree_validate(struct btree_root *root, struct btree_node *node,
+ unsigned long *pindex, int height)
+{
+ unsigned long parent_key = *pindex;
+ unsigned long node_key = 0;
+ unsigned long child_key = 0;
+
+ int i;
+ unsigned long key;
+ int nr = 0;
+ int bug = 0;
+
+ for (i = 0; i < BTREE_ITEM_MAX; i++) {
+ struct btree_node *child = btree_item_deref(&node->item[i]);
+ if (!child)
+ continue;
+
+ nr++;
+
+ key = node->item[i].key;
+ if (key < parent_key || (!i && key != parent_key) ||
+ (i && key == parent_key)) {
+ printk(KERN_DEBUG
+ "wrong parent split: key: %lu parent_key: %lu index: %d\n",
+ key, parent_key, i);
+ bug++;
+ }
+
+ if (key < node_key || (i && key == node_key)) {
+ printk(KERN_DEBUG
+ "wrong order: key: %lu node_key: %lu index: %d\n",
+ key, node_key, i);
+ bug++;
+ }
+ node_key = key;
+
+ if (key < child_key || (i && key == child_key)) {
+ printk(KERN_DEBUG
+ "wrong child split: key: %lu child_key: %lu index: %d\n",
+ key, child_key, i);
+ bug++;
+ }
+
+ child_key = max(node_key, child_key);
+
+ if (height)
+ bug += __btree_validate(root, child,
+ &child_key, height - 1);
+
+ *pindex = max(node_key, child_key);
+ }
+
+ if (node != root->root && nr < BTREE_ITEM_HALF) {
+ printk(KERN_DEBUG "node short\n");
+ bug++;
+ }
+
+ if (bug)
+ printk(KERN_DEBUG "bug in node: %p\n", node);
+
+ return bug;
+}
+
+int btree_validate(struct btree_root *root)
+{
+ unsigned long key;
+
+ if (root->root) {
+ key = __btree_key(root, root->root, root->height - 1);
+ return __btree_validate(root, root->root, &key,
+ root->height - 1);
+ }
+
+ return 0;
+}
+#endif
Index: linux-2.6/init/main.c
===================================================================
--- linux-2.6.orig/init/main.c
+++ linux-2.6/init/main.c
@@ -66,6 +66,7 @@
#include <linux/kmemcheck.h>
#include <linux/ftrace.h>
#include <trace/boot.h>
+#include <linux/btree.h>

#include <asm/io.h>
#include <asm/bugs.h>
@@ -654,6 +655,7 @@ asmlinkage void __init start_kernel(void
kmemtrace_init();
debug_objects_mem_init();
idr_init_cache();
+ btree_init();
setup_per_cpu_pageset();
numa_policy_init();
if (late_time_init)

2009-01-11 03:14:00

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] add b+tree library

On Sat, Jan 10, 2009 at 11:01:35PM +0100, J?rn Engel wrote:
>
> I don't really expect a big difference, even if the filesystem is
> intended for flash, not disks. Other overhead will dominate the
> picture. The situation may be different for Johannes, though.

If there isn't a big difference, is it really worth it to have both
rbtrees and b-trees in the kernel? Especially given the disadvantages
akpm noted....

- Ted

2009-01-11 08:20:55

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH] add b+tree library

On Sat, 10 January 2009 23:26:56 +0100, Andi Kleen wrote:
>
> > Finally, are b+tree so much better than rbtrees that it would be
> > worthwhile to just *replace* rbtrees with b+trees? Or is the problem
>
> There've been a couple of proposals like this over the years, also
> with other data structures like judy trees (which seem to bring
> the cache line optimization Joern talks about to even more extreme).
> splay trees seem to be also a popular suggestion, although they've
> considered dubious by other people (including me). Another
> alternative would be skip lists.

The number of people that truly understand what Judy trees do may be
single-digit. Main disadvantage I see is that Judy trees heavily rely
on repacking nodes over and over. Part of Judy is a memory manager with
essentially slab caches for nodes with 2, 4, 6, 8, 12, 16, 24, 32, 48,
64, 96, 128, 192, 256, 384 and 512 words.

Splay trees are still binary trees, so the fan-out argument is identical
to that against rbtrees. If we have to pull in a cacheline, we might as
well use all of it.

Skip lists are just a Bad Idea(tm). In O(x) notation they behave like
binary trees, waste cachelines left and right, use more memory, depend
on a sufficiently good random() function,... I guess you never closely
looked at them, because anyone who does tries to forget them as fast as
possible.

> Also I don't remember there was ever a big discussion about
> rbtrees vs other trees -- it was just that Andrea liked
> them and added a convenient library and some point and other
> people found it convenient too and started using it.
>
> But it's unclear how much all that would really help.
>
> I always thought it might be advanteous to look at a little
> more abstract interface for the existing rbtree users (shouldn't
> be that hard, the interface is already not too bad) and then just
> let some students implement a couple of backend data structures
> for that interface and then run some benchmarks.
>
> I don't think it's a good idea to add a b*tree library
> and use it only from a few users though. After all it's
> a lot of code and that should have a clear benefit.

Quoting Dave Chinner:
| I think a btree library would be useful - there are places where
| people are using rb-trees instead of btree simply because it's
| easier to use the rbtree than it is to implement a btree library.
| I can think of several places I could use such a library for
| in-memory extent representation....

Jörn

--
Joern's library part 15:
http://www.knosof.co.uk/cbook/accu06a.pdf

2009-01-11 08:31:35

by Jörn Engel

[permalink] [raw]
Subject: Re: [PATCH] add b+tree library

On Sun, 11 January 2009 00:57:20 +0100, Peter Zijlstra wrote:
>
> B-tree's however have one thing over RB-trees, B-trees can be made
> RCU-safe whereas RB-trees cannot be -- the only problem is that Joern's
> doesn't do that.

And yours doesn't support multiple key sizes afaics. I don't mind
using your version as a basis, so long as my^Wboth requirements get
fulfilled. ;)

Do you see a problem combining rcu with keys being an array of unsigned
long?

Jörn

--
Schrödinger's cat is <BLINK>not</BLINK> dead.
-- Illiad

2009-01-11 18:09:06

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] add b+tree library


I only listed the proposals I've heard about before, not necessarily
endorsing them.

> The number of people that truly understand what Judy trees do may be
> single-digit. Main disadvantage I see is that Judy trees heavily rely
> on repacking nodes over and over. Part of Judy is a memory manager with
> essentially slab caches for nodes with 2, 4, 6, 8, 12, 16, 24, 32, 48,
> 64, 96, 128, 192, 256, 384 and 512 words.

Well complicated code is en vogue recently :-)

> Splay trees are still binary trees, so the fan-out argument is identical
> to that against rbtrees. If we have to pull in a cacheline, we might as
> well use all of it.
>
> Skip lists are just a Bad Idea(tm). In O(x) notation they behave like
> binary trees, waste cachelines left and right, use more memory, depend
> on a sufficiently good random() function,... I guess you never closely
> looked at them, because anyone who does tries to forget them as fast as
> possible.

Using the radix trees more would be also an alternative.

I honestly don't know how they will all perform in the kernel that is why I
thought it would be a good idea to just try them out. But I'm not
volunteering to code it up, so it was more an idle thought.

Doing that would be a reasonable student project. In fact I've been asked
about this sort of thing by students in the past.

Cleaning up the rbtree interface to be a little more abstract
would be probably a good idea in general. I never really
liked the open coded searches.

-Andi

2009-01-12 16:20:46

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH] add b+tree library

On Sun, Jan 11, 2009 at 09:30:34AM +0100, J?rn Engel wrote:
> On Sun, 11 January 2009 00:57:20 +0100, Peter Zijlstra wrote:
> >
> > B-tree's however have one thing over RB-trees, B-trees can be made
> > RCU-safe whereas RB-trees cannot be -- the only problem is that Joern's
> > doesn't do that.
>
> And yours doesn't support multiple key sizes afaics. I don't mind
> using your version as a basis, so long as my^Wboth requirements get
> fulfilled. ;)
>
> Do you see a problem combining rcu with keys being an array of unsigned
> long?

There is a theory vs. practice issue here. In theory, you could make any
dynamically allocated search structure RCU-searchable by copy-updating
the entire structure on each update. In many cases, you only have to
replace a fairly small part. In practice, the question is whether your
read-to-update ratio is large enough to make it a win.

The main potential issue with keys being an array of unsigned long is
that it is no longer possible to atomically rewrite a given key in place.
The usual ways to work around this are:

1. Replace each key entry with a pointer to the array of unsigned
longs, so that you can atomically rewrite the pointer. The
downside of this approach is the extra cache line accessed per
key scanned. The upside of this approach is that you might
be able to share code with Peter's tree by using a comparison
function (if NULL, just compare the entries directly) or some
other similar trick.

2. Place the array of unsigned longs directly in the structure, but
copy-update the entire node rather than rewriting keys in place.
This has better cache locality, but makes it more difficult to
share code with the small-key variant.

There are probably other tricks as well.

Thanx, Paul

2009-01-17 23:13:47

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] add b+tree library

Hi!


> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ wireless-testing/lib/btree.c 2009-01-10 11:41:47.000000000 +0100
> @@ -0,0 +1,797 @@
> +/*
> + * lib/btree.c - Simple In-memory B+Tree
> + *
> + * As should be obvious for Linux kernel code, license is GPLv2
> + *
> + * Copyright (c) 2007-2008 Joern Engel <[email protected]>
> + * Bits and pieces stolen from Peter Zijlstra's code, which is
> + * Copyright 2007, Red Hat Inc. Peter Zijlstra <[email protected]>
> + * GPLv2
> + *
> + * see http://programming.kicks-ass.net/kernel-patches/vma_lookup/btree.patch
> + *
> + * A relatively simple B+Tree implementation. I have written it as a learning
> + * excercise to understand how B+Trees work. Turned out to be useful as well.
> + *
> + * B+Trees can be used similar to Linux radix trees (which don't have anything
> + * in common with textbook radix trees, beware). Prerequisite for them working
> + * well is that access to a random tree node is much faster than a large number
> + * of operations within each node.

should be 'much slower'?

Because otherwise I don't get it.

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html