Andrew - this should be pretty much identical to the patch series I
mailed out during last merge window, except rebased onto 3.11-rc4 and
retested.
I think the series should be more or less ready to go, and it'd be
really nice to get at least the percpu bits in - think you can have a
look and pick it up if it meets your standards?
Patch series is also available in git -
git://evilpiepirate.org/~kent/linux-bcache.git idr
This is a new, from scratch implementation of ida that should be
simpler, faster and more space efficient.
Two primary reasons for the rewrite:
* A future patch will reimplement idr on top of this ida implementation +
radix trees. Once that's done, the end result will be ~1k fewer lines
of code, much simpler and easier to understand and it should be quite
a bit faster.
* The performance improvements and addition of ganged allocation should
make ida more suitable for use by a percpu id/tag allocator, which
would then act as a frontend to this allocator.
The old ida implementation was done with the idr data structures - this
was IMO backwards. I'll soon be reimplementing idr on top of this new
ida implementation and radix trees - using a separate dedicated data
structure for the free ID bitmap should actually make idr faster, and
the end result is _significantly_ less code.
This implementation conceptually isn't that different from the old one -
it's a tree of bitmaps, where one bit in a given node indicates whether
or not there are free bits in a child node.
The main difference (and advantage) over the old version is that the
tree isn't implemented with pointers - it's implemented in an array,
like how heaps are implemented, which both better space efficiency and
it'll be faster since there's no pointer chasing. (It's not one giant
contiguous array, it's an array of arrays but the algorithm treats it as
one big array)
Time to allocate 1 << 24 ids: 0m0.663s
Time to allocate 1 << 24 ids, old code: 0m28.604s
Time to allocate INT_MAX ids: 1m41.371s
Time to allocate INT_MAX ids, old code: Got bored of waiting for it to finish.
Signed-off-by: Kent Overstreet <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Stephen Rothwell <[email protected]>
Cc: Fengguang Wu <[email protected]>
---
include/linux/idr.h | 122 ++++---
lib/idr.c | 897 +++++++++++++++++++++++++++++++++++-----------------
2 files changed, 691 insertions(+), 328 deletions(-)
diff --git a/include/linux/idr.h b/include/linux/idr.h
index c0e0c54..a310bb0 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -16,6 +16,92 @@
#include <linux/bitops.h>
#include <linux/init.h>
#include <linux/rcupdate.h>
+#include <linux/spinlock_types.h>
+#include <linux/wait.h>
+
+/* IDA */
+
+struct ida {
+ spinlock_t lock;
+
+ /*
+ * cur_id and allocated_ids are for ida_alloc_cyclic. For cyclic
+ * allocations we search for new ids to allocate starting from the last
+ * id allocated - cur_id is the next id to try allocating.
+ *
+ * But we also don't want the allocated ids to be arbitrarily sparse -
+ * the memory usage for the bitmap could be arbitrarily bad, and if
+ * they're used as keys in a radix tree the memory overhead of the radix
+ * tree could be quite bad as well. So we use allocated_ids to decide
+ * when to restart cur_id from 0, and bound how sparse the bitmap can
+ * be.
+ */
+ unsigned cur_id;
+ unsigned allocated_ids;
+
+ /* size of ida->tree */
+ unsigned nodes;
+
+ /*
+ * Index of first leaf node in ida->tree; equal to the number of non
+ * leaf nodes, ida->nodes - ida->first_leaf == number of leaf nodes
+ */
+ unsigned first_leaf;
+ unsigned sections;
+
+ unsigned long **tree;
+ unsigned long *inline_section;
+ unsigned long inline_node;
+};
+
+#define IDA_INIT(name) \
+{ \
+ .lock = __SPIN_LOCK_UNLOCKED(name.lock), \
+ .nodes = 1, \
+ .first_leaf = 0, \
+ .sections = 1, \
+ .tree = &name.inline_section, \
+ .inline_section = &name.inline_node, \
+}
+#define DEFINE_IDA(name) struct ida name = IDA_INIT(name)
+
+void ida_remove(struct ida *ida, unsigned id);
+int ida_alloc_range(struct ida *ida, unsigned int start,
+ unsigned int end, gfp_t gfp);
+int ida_alloc_cyclic(struct ida *ida, unsigned start, unsigned end, gfp_t gfp);
+void ida_destroy(struct ida *ida);
+int ida_init_prealloc(struct ida *ida, unsigned prealloc);
+
+/**
+ * ida_alloc_range - allocate a new id.
+ * @ida: the (initialized) ida.
+ * @gfp_mask: memory allocation flags
+ *
+ * Allocates an id in the range [0, INT_MAX]. Returns -ENOSPC if no ids are
+ * available, or -ENOMEM on memory allocation failure.
+ *
+ * Returns the smallest available id
+ *
+ * Use ida_remove() to get rid of an id.
+ */
+static inline int ida_alloc(struct ida *ida, gfp_t gfp_mask)
+{
+ return ida_alloc_range(ida, 0, 0, gfp_mask);
+}
+
+/**
+ * ida_init - initialize ida handle
+ * @ida: ida handle
+ *
+ * This function is use to set up the handle (@ida) that you will pass
+ * to the rest of the functions.
+ */
+static inline void ida_init(struct ida *ida)
+{
+ ida_init_prealloc(ida, 0);
+}
+
+/* IDR */
/*
* We want shallower trees and thus more bits covered at each layer. 8
@@ -195,42 +281,6 @@ static inline void __deprecated idr_remove_all(struct idr *idp)
__idr_remove_all(idp);
}
-/*
- * IDA - IDR based id allocator, use when translation from id to
- * pointer isn't necessary.
- *
- * IDA_BITMAP_LONGS is calculated to be one less to accommodate
- * ida_bitmap->nr_busy so that the whole struct fits in 128 bytes.
- */
-#define IDA_CHUNK_SIZE 128 /* 128 bytes per chunk */
-#define IDA_BITMAP_LONGS (IDA_CHUNK_SIZE / sizeof(long) - 1)
-#define IDA_BITMAP_BITS (IDA_BITMAP_LONGS * sizeof(long) * 8)
-
-struct ida_bitmap {
- long nr_busy;
- unsigned long bitmap[IDA_BITMAP_LONGS];
-};
-
-struct ida {
- struct idr idr;
- struct ida_bitmap *free_bitmap;
-};
-
-#define IDA_INIT(name) { .idr = IDR_INIT((name).idr), .free_bitmap = NULL, }
-#define DEFINE_IDA(name) struct ida name = IDA_INIT(name)
-
-void ida_destroy(struct ida *ida);
-void ida_init(struct ida *ida);
-
-int ida_alloc_range(struct ida *ida, unsigned int start, unsigned int end,
- gfp_t gfp_mask);
-void ida_remove(struct ida *ida, unsigned int id);
-
-static inline int ida_alloc(struct ida *ida, gfp_t gfp_mask)
-{
- return ida_alloc_range(ida, 0, 0, gfp_mask);
-}
-
void __init idr_init_cache(void);
#endif /* __IDR_H__ */
diff --git a/lib/idr.c b/lib/idr.c
index 9ac1174..e5afab5 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -8,6 +8,8 @@
*
* Modified by Nadia Derbey to make it RCU safe.
*
+ * IDA completely rewritten by Kent Overstreet <[email protected]>
+ *
* Small id to pointer translation service.
*
* It uses a radix tree like structure as a sparse array indexed
@@ -26,17 +28,612 @@
* with the slab allocator.
*/
-#ifndef TEST // to test in user space...
-#include <linux/slab.h>
-#include <linux/init.h>
-#include <linux/export.h>
-#endif
+#include <linux/bitmap.h>
+#include <linux/bitops.h>
+#include <linux/bug.h>
#include <linux/err.h>
-#include <linux/string.h>
+#include <linux/export.h>
+#include <linux/hardirq.h>
#include <linux/idr.h>
-#include <linux/spinlock.h>
+#include <linux/init.h>
+#include <linux/kernel.h>
#include <linux/percpu.h>
-#include <linux/hardirq.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/string.h>
+#include <linux/spinlock.h>
+
+static void kgfree(void *ptr, size_t size)
+{
+ if (size < PAGE_SIZE)
+ kfree(ptr);
+ else
+ free_pages((unsigned long) ptr, get_order(size));
+}
+
+static void *kgalloc(size_t size, gfp_t gfp)
+{
+ return size < PAGE_SIZE
+ ? kmalloc(size, gfp)
+ : (void *) __get_free_pages(gfp, get_order(size));
+}
+
+/**
+ * DOC: IDA description
+ * IDA - ID (small integer) allocator
+ *
+ * This works much like using a simple bitmap to allocate indices - ida_alloc()
+ * is equivalent to find_first_zero_bit() then __set_bit(), and ida_remove() is
+ * equivalent to __clear_bit(). But it's much more efficient than a large
+ * bitmap, and resizes itself as needed.
+ *
+ * It's implemented as a tree of bitmaps: a node in the tree is a single
+ * unsigned long. The leaf nodes of the tree are segments of the entire bitmap -
+ * a cleared bit indicates a free id, and a set bit indicates an allocated one.
+ * Bits in the parent nodes indicate whether or not there are free bits in the
+ * corresponding child node - when all the bits in a parent node are set, none
+ * of its children have bits free.
+ *
+ * The splay factor of the tree (IDA_TREE_ARY) == BITS_PER_LONG - parent nodes
+ * have 32 or 64 children.
+ *
+ * The tree itself is implemented with an array instead of pointers - exactly
+ * like the textbook implementation of D-ary heaps. The root of the bitmap tree
+ * is at ida->tree[0]. The children of node i are at i * IDA_TREE_ARY + 1 + j,
+ * where j is in the range [0, 63], and the parent of node i is at (i - 1) /
+ * IDA_TREE_ARY.
+ *
+ * This conveniently means that our leaf nodes are all contiguous in memory -
+ * the bit for id i is bit id % BITS_PER_LONG in ida->tree[ida->first_leaf + i /
+ * BITS_PER_LONG].
+ *
+ * Note that the number of ids we can allocate is limited by the amount of
+ * memory we can contiguously allocate. The amount of memory used for the bitmap
+ * tree is only slightly more than a flat bitmap would use - about 1 / TREE_ARY
+ * * (sizeof flat bitmap).
+ *
+ * So for 1 mb of memory (and allocating more than that should be fine with
+ * CONFIG_COMPACTION) you get slightly under 8 million IDs.
+ */
+
+#define IDA_TREE_ARY BITS_PER_LONG
+#define IDA_ALLOC_ORDER_MAX 4
+#define IDA_SECTION_SIZE (PAGE_SIZE << IDA_ALLOC_ORDER_MAX)
+#define IDA_NODES_PER_SECTION (IDA_SECTION_SIZE / sizeof(unsigned long))
+
+static inline unsigned long *ida_index_to_node(struct ida *ida, unsigned node)
+{
+ return ida->tree[node / IDA_NODES_PER_SECTION] +
+ node % IDA_NODES_PER_SECTION;
+}
+
+/*
+ * For a given number of nodes, calculate how many are going to be parent nodes
+ * (equal to ida->first_leaf) and by extension how my will be leaves.
+ */
+static unsigned first_leaf_from_nodes(unsigned nodes)
+{
+ unsigned ret = 0;
+
+ while (ret * IDA_TREE_ARY + 1 < nodes)
+ ret = ret * IDA_TREE_ARY + 1;
+
+ return ret;
+}
+
+static void __ida_remove(struct ida *ida, unsigned int id)
+{
+ unsigned i = ida->first_leaf + id / BITS_PER_LONG;
+ unsigned bit = id % BITS_PER_LONG;
+
+ if (WARN(i >= ida->nodes,
+ "Tried to free an id outside the range of allocated ids\n"))
+ return;
+
+ --ida->allocated_ids;
+
+ while (1) {
+ unsigned long *node = ida_index_to_node(ida, i), old = *node;
+
+ WARN(!test_bit(bit, node),
+ "Tried to free an id that was already free\n");
+ __clear_bit(bit, node);
+
+ if (~old || !i)
+ break;
+
+ /*
+ * If this node's bits were all 1s before we cleared this bit,
+ * we need to clear this node's bit in the parent node - and so
+ * on up to the root.
+ */
+
+ bit = (i - 1) % IDA_TREE_ARY;
+ i = (i - 1) / IDA_TREE_ARY;
+ }
+}
+
+/**
+ * ida_remove - remove an allocated id.
+ * @ida: the (initialized) ida.
+ * @id: the id returned by ida_alloc_range.
+ */
+void ida_remove(struct ida *ida, unsigned int id)
+{
+ unsigned long flags;
+ spin_lock_irqsave(&ida->lock, flags);
+ __ida_remove(ida, id);
+ spin_unlock_irqrestore(&ida->lock, flags);
+}
+EXPORT_SYMBOL(ida_remove);
+
+static void ida_increase_depth(struct ida *ida, unsigned new_nodes,
+ unsigned new_first_leaf)
+{
+ unsigned old_leaves = ida->nodes - ida->first_leaf;
+ unsigned src = ida->nodes;
+ unsigned dst = new_first_leaf + old_leaves;
+ unsigned n, i, bit;
+ unsigned long *node;
+
+ /* Shift leaves up to new position */
+ while (src != ida->first_leaf) {
+ i = min((src - 1) % IDA_NODES_PER_SECTION + 1,
+ (dst - 1) % IDA_NODES_PER_SECTION + 1);
+
+ i = min(i, src - ida->first_leaf);
+
+ src -= i;
+ dst -= i;
+
+ memmove(ida_index_to_node(ida, dst),
+ ida_index_to_node(ida, src),
+ i * sizeof(unsigned long));
+ }
+
+ /* Zero out parent nodes */
+ for (n = 0; n < new_first_leaf; n += i) {
+ i = min_t(unsigned, new_first_leaf - n,
+ IDA_NODES_PER_SECTION);
+
+ memset(ida_index_to_node(ida, n),
+ 0, i * sizeof(unsigned long));
+ }
+
+ /* Reconstruct parent nodes */
+ for (n = new_first_leaf; n < new_first_leaf + old_leaves; n++) {
+ i = n;
+ node = ida_index_to_node(ida, i);
+
+ while (!~*node && i) {
+ bit = (i - 1) % IDA_TREE_ARY;
+ i = (i - 1) / IDA_TREE_ARY;
+
+ node = ida_index_to_node(ida, i);
+ __set_bit(bit, node);
+ }
+ }
+}
+
+/*
+ * Attempt to double the size of the tree. We have to drop ida->lock to allocate
+ * memory, so we might race with another allocation that also tries to resize.
+ * So if the tree's not the size it originally was when we retake ida->lock,
+ * just return 0 - but the caller needs to recheck for the tree being full in
+ * case we _really_ raced badly.
+ */
+static int __ida_resize(struct ida *ida, gfp_t gfp, unsigned long *flags)
+ __releases(&ida->lock)
+ __acquires(&ida->lock)
+{
+ unsigned long *tree, **sections;
+ unsigned cur_nodes, new_nodes, new_first_leaf, cur_sections;
+again:
+ cur_nodes = ida->nodes;
+
+ new_nodes = roundup_pow_of_two(ida->nodes + 1) <= IDA_NODES_PER_SECTION
+ ? roundup_pow_of_two(ida->nodes + 1)
+ : ida->nodes + IDA_NODES_PER_SECTION;
+
+ new_first_leaf = first_leaf_from_nodes(new_nodes);
+
+ sections = NULL;
+ cur_sections = ida->sections;
+
+ BUG_ON(ida->nodes > IDA_NODES_PER_SECTION &&
+ ida->nodes % IDA_NODES_PER_SECTION);
+
+ spin_unlock_irqrestore(&ida->lock, *flags);
+
+ if (ida->nodes >= IDA_NODES_PER_SECTION &&
+ is_power_of_2(cur_sections)) {
+ sections = kgalloc(cur_sections * 2 * sizeof(unsigned long *),
+ __GFP_ZERO|gfp);
+ if (!sections)
+ goto err;
+ }
+
+ tree = kgalloc(min_t(size_t, new_nodes * sizeof(unsigned long),
+ IDA_SECTION_SIZE), __GFP_ZERO|gfp);
+ if (!tree)
+ goto err;
+
+ spin_lock_irqsave(&ida->lock, *flags);
+
+ if (cur_nodes != ida->nodes || cur_sections != ida->sections) {
+ kgfree(sections, cur_sections * 2 * sizeof(unsigned long *));
+ kgfree(tree, min_t(size_t, new_nodes * sizeof(unsigned long),
+ IDA_SECTION_SIZE));
+ return 0;
+ }
+
+ if (sections) {
+ memcpy(sections, ida->tree,
+ ida->sections * sizeof(unsigned long *));
+
+ if (ida->tree != &ida->inline_section)
+ kgfree(ida->tree,
+ ida->sections * sizeof(unsigned long *));
+
+ ida->tree = sections;
+ }
+
+ if (ida->nodes < IDA_NODES_PER_SECTION) {
+ memcpy(tree, ida_index_to_node(ida, 0),
+ ida->nodes * sizeof(unsigned long));
+
+ if (ida->tree[0] != &ida->inline_node)
+ kgfree(ida->tree[0],
+ ida->nodes * sizeof(unsigned long));
+
+ ida->tree[0] = tree;
+ } else {
+ ida->tree[ida->sections++] = tree;
+
+ new_nodes = ida->sections * IDA_NODES_PER_SECTION;
+ new_first_leaf = first_leaf_from_nodes(new_nodes);
+
+ if (new_nodes - new_first_leaf < ida->nodes - ida->first_leaf)
+ goto again;
+ }
+
+ if (new_first_leaf != ida->first_leaf)
+ ida_increase_depth(ida, new_nodes, new_first_leaf);
+
+ ida->nodes = new_nodes;
+ ida->first_leaf = new_first_leaf;
+
+ return 0;
+err:
+ kgfree(sections, cur_sections * 2 * sizeof(unsigned long));
+ spin_lock_irqsave(&ida->lock, *flags);
+ return -ENOMEM;
+}
+
+/*
+ * Ganged allocation - amortize locking and tree traversal for when we've got
+ * another allocator (i.e. a percpu version) acting as a frontend to this code
+ */
+static int __ida_alloc_range_multiple(struct ida *ida, unsigned *ids,
+ unsigned nr_ids, unsigned min_id,
+ unsigned max_id, gfp_t gfp,
+ unsigned long *flags)
+ __releases(&ida->lock)
+ __acquires(&ida->lock)
+{
+ unsigned i = 0, bit, bit_offset, id, ids_found = 0;
+ unsigned long *node = ida_index_to_node(ida, i);
+ int err = 0;
+
+ if (!max_id || max_id > (unsigned) INT_MAX + 1)
+ max_id = (unsigned) INT_MAX + 1;
+
+ if (min_id >= max_id)
+ return -ENOSPC;
+
+ while (ids_found < nr_ids) {
+ /*
+ * If all bits are set in the root, no bits free and we need to
+ * resize.
+ */
+ while (!~*node) {
+resize:
+ if (ida->nodes - ida->first_leaf >=
+ BITS_TO_LONGS(max_id)) {
+ err = -ENOSPC;
+ goto err;
+ }
+
+ err = __ida_resize(ida, gfp, flags);
+ if (err)
+ goto err;
+
+ i = 0;
+ node = ida_index_to_node(ida, i);
+ }
+
+ if (min_id) {
+ /*
+ * If we're starting from a specific index, skip to that
+ * leaf node and start looking there:
+ */
+ bit_offset = min_id % BITS_PER_LONG;
+ i = ida->first_leaf + min_id / BITS_PER_LONG;
+
+ if (i >= ida->nodes)
+ goto resize;
+
+ while (1) {
+ node = ida_index_to_node(ida, i);
+ bit = ffz(*node >> bit_offset) + bit_offset;
+
+ /*
+ * We might have had to go back up the tree
+ * before we found a free bit - so skip down to
+ * where we recurse down the tree.
+ */
+ if (~*node && bit < BITS_PER_LONG)
+ goto found;
+
+ if (!i)
+ goto resize;
+
+ /*
+ * Ok, no bits available in this node - go up a
+ * level. But we have to update bit_offset so we
+ * start searching in the parent _after_ the
+ * node we're currently at
+ */
+ bit_offset = (i - 1) % IDA_TREE_ARY + 1;
+ i = (i - 1) / IDA_TREE_ARY;
+ }
+ }
+
+ /*
+ * Recurse down the tree looking for a free bit. We already
+ * checked to make sure there _were_ free bits, but we might end
+ * up at a leaf node we haven't allocated yet.
+ */
+ while (1) {
+ bit = ffz(*node);
+found:
+ /*
+ * Found a bit - if we're at a leaf node, great! We're
+ * done:
+ */
+ if (i >= ida->first_leaf)
+ break;
+
+ i = i * IDA_TREE_ARY + 1 + bit;
+ node = ida_index_to_node(ida, i);
+
+ /*
+ * Recurse. But if we'd recurse to a node that hasn't
+ * been allocated yet, resize:
+ */
+
+ if (i >= ida->nodes)
+ goto resize;
+
+ BUG_ON(!~*node);
+ }
+
+ /*
+ * Our leaves are contiguous, so we can calculate the id we
+ * allocated from the node we're at and the bit we found within
+ * that node:
+ */
+ id = (i - ida->first_leaf) * BITS_PER_LONG + bit;
+ BUG_ON(id < min_id);
+
+ if (id >= max_id) {
+ err = -ENOSPC;
+ goto err;
+ }
+
+ ids[ids_found++] = id;
+ ida->allocated_ids++;
+
+ /*
+ * Mark the id as allocated. If all the bits are now set in this
+ * node, set this node's bit in the parent node - and so on up
+ * to the root:
+ */
+ while (1) {
+ __set_bit(bit, node);
+
+ if (~*node || !i)
+ break;
+
+ bit = (i - 1) % IDA_TREE_ARY;
+ i = (i - 1) / IDA_TREE_ARY;
+
+ node = ida_index_to_node(ida, i);
+ }
+ }
+err:
+ return ids_found ? ids_found : err;
+}
+
+/**
+ * ida_alloc_range - allocate a new id.
+ * @ida: the (initialized) ida.
+ * @start: the minimum id (inclusive, <= INT_MAX)
+ * @end: the maximum id (exclusive, <= INT_MAX + 1 or 0 for unlimited)
+ * @gfp: memory allocation flags
+ *
+ * Allocates an id in the range [start, end). Returns -ENOSPC if no ids are
+ * available, or -ENOMEM on memory allocation failure.
+ *
+ * Returns the smallest free id >= start.
+ *
+ * Use ida_remove() to get rid of an id.
+ */
+int ida_alloc_range(struct ida *ida, unsigned int start,
+ unsigned int end, gfp_t gfp)
+{
+ int ret;
+ unsigned id;
+ unsigned long flags;
+
+ spin_lock_irqsave(&ida->lock, flags);
+ ret = __ida_alloc_range_multiple(ida, &id, 1, start, end, gfp, &flags);
+ spin_unlock_irqrestore(&ida->lock, flags);
+
+ return ret == 1 ? id : ret;
+}
+EXPORT_SYMBOL(ida_alloc_range);
+
+static int __ida_alloc_cyclic(struct ida *ida, unsigned start, unsigned end,
+ gfp_t gfp, unsigned long *flags)
+ __releases(&ida->lock)
+ __acquires(&ida->lock)
+{
+ int ret;
+ unsigned id;
+
+ ret = __ida_alloc_range_multiple(ida, &id, 1,
+ max(start, ida->cur_id),
+ end, gfp, flags);
+
+ if (ret < 0)
+ ret = __ida_alloc_range_multiple(ida, &id, 1, start,
+ end, gfp, flags);
+ if (ret == 1) {
+ ida->cur_id = id + 1;
+ if ((ida->cur_id - start) / 2 > max(1024U, ida->allocated_ids))
+ ida->cur_id = 0;
+
+ return id;
+ }
+
+ return ret;
+}
+
+/**
+ * ida_alloc_cyclic - allocate new ids cyclically
+ * @ida: the (initialized) ida.
+ * @start: the minimum id (inclusive, <= INT_MAX)
+ * @end: the maximum id (exclusive, <= INT_MAX + 1 or 0 for unlimited)
+ * @gfp: memory allocation flags
+ *
+ * Allocates an id in the range start <= id < end, or returns -ENOSPC.
+ * On memory allocation failure, returns -ENOMEM.
+ *
+ * Instead of returning the smallest free id, start searching from the position
+ * where the last id was allocated - i.e. it won't reuse freed ids right away.
+ *
+ * To avoid the allocated id space (and internal bitmap) becoming arbitrarily
+ * sparse, it can wrap before reaching the maximum id - if less than half of our
+ * current id space is allocated, it resets cur_id to 0
+ *
+ * But we don't want to wrap when the id space is small, so we use the maximum
+ * of (1024, allocated_ids) - see __ida_alloc_cyclic().
+ *
+ * Use ida_remove() to get rid of an id.
+ */
+int ida_alloc_cyclic(struct ida *ida, unsigned start, unsigned end, gfp_t gfp)
+{
+ int ret;
+ unsigned long flags;
+
+ spin_lock_irqsave(&ida->lock, flags);
+ ret = __ida_alloc_cyclic(ida, start, end, gfp, &flags);
+ spin_unlock_irqrestore(&ida->lock, flags);
+
+ return ret;
+}
+EXPORT_SYMBOL(ida_alloc_cyclic);
+
+/**
+ * ida_destroy - release all cached layers within an ida tree
+ * @ida: ida handle
+ */
+void ida_destroy(struct ida *ida)
+{
+ unsigned i;
+
+ if (ida->tree[0] &&
+ ida->tree[0] != &ida->inline_node)
+ kgfree(ida->tree[0], min(ida->nodes * sizeof(unsigned long),
+ IDA_SECTION_SIZE));
+
+ for (i = 1; i < ida->sections; i++)
+ kgfree(ida->tree[i], IDA_SECTION_SIZE);
+
+ if (ida->tree &&
+ ida->tree != &ida->inline_section)
+ kgfree(ida->tree, roundup_pow_of_two(ida->sections) *
+ sizeof(unsigned long *));
+}
+EXPORT_SYMBOL(ida_destroy);
+
+/**
+ * ida_init_prealloc - initialize ida handle
+ * @ida: ida handle
+ * @prealloc: number of ids to preallocate memory for
+ *
+ * Initialize an ida, and preallocate enough memory that ida_alloc() will never
+ * return -ENOMEM if passed max_id <= prealloc.
+ */
+int ida_init_prealloc(struct ida *ida, unsigned prealloc)
+{
+ unsigned leaves = BITS_TO_LONGS(prealloc);
+
+ memset(ida, 0, sizeof(*ida));
+
+ spin_lock_init(&ida->lock);
+
+ ida->nodes = 1;
+ ida->first_leaf = 0;
+ ida->sections = 1;
+ ida->inline_section = &ida->inline_node;
+ ida->tree = &ida->inline_section;
+
+ if (leaves > ida->nodes - ida->first_leaf) {
+ unsigned i;
+
+ while (leaves > ida->nodes - ida->first_leaf) {
+ if (ida->nodes < IDA_NODES_PER_SECTION)
+ ida->nodes *= 2;
+ else
+ ida->nodes += IDA_NODES_PER_SECTION;
+
+ ida->first_leaf = first_leaf_from_nodes(ida->nodes);
+ }
+
+ if (ida->nodes > IDA_NODES_PER_SECTION) {
+ ida->sections = ida->nodes / IDA_NODES_PER_SECTION;
+ ida->tree = kgalloc(roundup_pow_of_two(ida->sections) *
+ sizeof(unsigned long *),
+ __GFP_ZERO|GFP_KERNEL);
+ if (!ida->tree)
+ return -ENOMEM;
+
+ for (i = 0; i < ida->sections; i++) {
+ ida->tree[i] = kgalloc(IDA_SECTION_SIZE,
+ __GFP_ZERO|GFP_KERNEL);
+ if (!ida->tree[i])
+ goto err;
+ }
+ } else {
+ ida->tree[0] =
+ kgalloc(ida->nodes * sizeof(unsigned long),
+ __GFP_ZERO|GFP_KERNEL);
+ if (!ida->tree)
+ return -ENOMEM;
+ }
+ }
+
+ return 0;
+err:
+ ida_destroy(ida);
+ return -ENOMEM;
+
+}
+EXPORT_SYMBOL(ida_init_prealloc);
+
+/* IDR */
#define MAX_IDR_SHIFT (sizeof(int) * 8 - 1)
#define MAX_IDR_BIT (1U << MAX_IDR_SHIFT)
@@ -50,7 +647,6 @@
static struct kmem_cache *idr_layer_cache;
static DEFINE_PER_CPU(struct idr_layer *, idr_preload_head);
static DEFINE_PER_CPU(int, idr_preload_cnt);
-static DEFINE_SPINLOCK(simple_ida_lock);
/* the maximum ID which can be allocated given idr->layers */
static int idr_max(int layers)
@@ -868,286 +1464,3 @@ void idr_init(struct idr *idp)
spin_lock_init(&idp->lock);
}
EXPORT_SYMBOL(idr_init);
-
-
-/**
- * DOC: IDA description
- * IDA - IDR based ID allocator
- *
- * This is id allocator without id -> pointer translation. Memory
- * usage is much lower than full blown idr because each id only
- * occupies a bit. ida uses a custom leaf node which contains
- * IDA_BITMAP_BITS slots.
- *
- * 2007-04-25 written by Tejun Heo <[email protected]>
- */
-
-static void free_bitmap(struct ida *ida, struct ida_bitmap *bitmap)
-{
- unsigned long flags;
-
- if (!ida->free_bitmap) {
- spin_lock_irqsave(&ida->idr.lock, flags);
- if (!ida->free_bitmap) {
- ida->free_bitmap = bitmap;
- bitmap = NULL;
- }
- spin_unlock_irqrestore(&ida->idr.lock, flags);
- }
-
- kfree(bitmap);
-}
-
-/**
- * ida_pre_get - reserve resources for ida allocation
- * @ida: ida handle
- * @gfp_mask: memory allocation flag
- *
- * This function should be called prior to locking and calling the
- * following function. It preallocates enough memory to satisfy the
- * worst possible allocation.
- *
- * If the system is REALLY out of memory this function returns %0,
- * otherwise %1.
- */
-static int ida_pre_get(struct ida *ida, gfp_t gfp_mask)
-{
- /* allocate idr_layers */
- if (!__idr_pre_get(&ida->idr, gfp_mask))
- return 0;
-
- /* allocate free_bitmap */
- if (!ida->free_bitmap) {
- struct ida_bitmap *bitmap;
-
- bitmap = kmalloc(sizeof(struct ida_bitmap), gfp_mask);
- if (!bitmap)
- return 0;
-
- free_bitmap(ida, bitmap);
- }
-
- return 1;
-}
-
-/**
- * ida_get_new_above - allocate new ID above or equal to a start id
- * @ida: ida handle
- * @starting_id: id to start search at
- * @p_id: pointer to the allocated handle
- *
- * Allocate new ID above or equal to @starting_id. It should be called
- * with any required locks.
- *
- * If memory is required, it will return %-EAGAIN, you should unlock
- * and go back to the ida_pre_get() call. If the ida is full, it will
- * return %-ENOSPC.
- *
- * @p_id returns a value in the range @starting_id ... %0x7fffffff.
- */
-static int ida_get_new_above(struct ida *ida, int starting_id, int *p_id)
-{
- struct idr_layer *pa[MAX_IDR_LEVEL + 1];
- struct ida_bitmap *bitmap;
- unsigned long flags;
- int idr_id = starting_id / IDA_BITMAP_BITS;
- int offset = starting_id % IDA_BITMAP_BITS;
- int t, id;
-
- restart:
- /* get vacant slot */
- t = idr_get_empty_slot(&ida->idr, idr_id, pa, 0, &ida->idr);
- if (t < 0)
- return t == -ENOMEM ? -EAGAIN : t;
-
- if (t * IDA_BITMAP_BITS >= MAX_IDR_BIT)
- return -ENOSPC;
-
- if (t != idr_id)
- offset = 0;
- idr_id = t;
-
- /* if bitmap isn't there, create a new one */
- bitmap = (void *)pa[0]->ary[idr_id & IDR_MASK];
- if (!bitmap) {
- spin_lock_irqsave(&ida->idr.lock, flags);
- bitmap = ida->free_bitmap;
- ida->free_bitmap = NULL;
- spin_unlock_irqrestore(&ida->idr.lock, flags);
-
- if (!bitmap)
- return -EAGAIN;
-
- memset(bitmap, 0, sizeof(struct ida_bitmap));
- rcu_assign_pointer(pa[0]->ary[idr_id & IDR_MASK],
- (void *)bitmap);
- pa[0]->count++;
- }
-
- /* lookup for empty slot */
- t = find_next_zero_bit(bitmap->bitmap, IDA_BITMAP_BITS, offset);
- if (t == IDA_BITMAP_BITS) {
- /* no empty slot after offset, continue to the next chunk */
- idr_id++;
- offset = 0;
- goto restart;
- }
-
- id = idr_id * IDA_BITMAP_BITS + t;
- if (id >= MAX_IDR_BIT)
- return -ENOSPC;
-
- __set_bit(t, bitmap->bitmap);
- if (++bitmap->nr_busy == IDA_BITMAP_BITS)
- idr_mark_full(pa, idr_id);
-
- *p_id = id;
-
- /* Each leaf node can handle nearly a thousand slots and the
- * whole idea of ida is to have small memory foot print.
- * Throw away extra resources one by one after each successful
- * allocation.
- */
- if (ida->idr.id_free_cnt || ida->free_bitmap) {
- struct idr_layer *p = get_from_free_list(&ida->idr);
- if (p)
- kmem_cache_free(idr_layer_cache, p);
- }
-
- return 0;
-}
-
-static void __ida_remove(struct ida *ida, int id)
-{
- struct idr_layer *p = ida->idr.top;
- int shift = (ida->idr.layers - 1) * IDR_BITS;
- int idr_id = id / IDA_BITMAP_BITS;
- int offset = id % IDA_BITMAP_BITS;
- int n;
- struct ida_bitmap *bitmap;
-
- /* clear full bits while looking up the leaf idr_layer */
- while ((shift > 0) && p) {
- n = (idr_id >> shift) & IDR_MASK;
- __clear_bit(n, p->bitmap);
- p = p->ary[n];
- shift -= IDR_BITS;
- }
-
- if (p == NULL)
- goto err;
-
- n = idr_id & IDR_MASK;
- __clear_bit(n, p->bitmap);
-
- bitmap = (void *)p->ary[n];
- if (!test_bit(offset, bitmap->bitmap))
- goto err;
-
- /* update bitmap and remove it if empty */
- __clear_bit(offset, bitmap->bitmap);
- if (--bitmap->nr_busy == 0) {
- __set_bit(n, p->bitmap); /* to please idr_remove() */
- idr_remove(&ida->idr, idr_id);
- free_bitmap(ida, bitmap);
- }
-
- return;
-
- err:
- WARN(1, "ida_remove called for id=%d which is not allocated.\n", id);
-}
-
-/**
- * ida_destroy - release all cached layers within an ida tree
- * @ida: ida handle
- */
-void ida_destroy(struct ida *ida)
-{
- idr_destroy(&ida->idr);
- kfree(ida->free_bitmap);
-}
-EXPORT_SYMBOL(ida_destroy);
-
-/**
- * ida_alloc_range - get a new id.
- * @ida: the (initialized) ida.
- * @start: the minimum id (inclusive, < 0x8000000)
- * @end: the maximum id (exclusive, < 0x8000000 or 0)
- * @gfp_mask: memory allocation flags
- *
- * Allocates an id in the range start <= id < end, or returns -ENOSPC.
- * On memory allocation failure, returns -ENOMEM.
- *
- * Use ida_remove() to get rid of an id.
- */
-int ida_alloc_range(struct ida *ida, unsigned int start, unsigned int end,
- gfp_t gfp_mask)
-{
- int ret, id;
- unsigned int max;
- unsigned long flags;
-
- BUG_ON((int)start < 0);
- BUG_ON((int)end < 0);
-
- if (end == 0)
- max = 0x80000000;
- else {
- BUG_ON(end < start);
- max = end - 1;
- }
-
-again:
- if (!ida_pre_get(ida, gfp_mask))
- return -ENOMEM;
-
- spin_lock_irqsave(&simple_ida_lock, flags);
- ret = ida_get_new_above(ida, start, &id);
- if (!ret) {
- if (id > max) {
- __ida_remove(ida, id);
- ret = -ENOSPC;
- } else {
- ret = id;
- }
- }
- spin_unlock_irqrestore(&simple_ida_lock, flags);
-
- if (unlikely(ret == -EAGAIN))
- goto again;
-
- return ret;
-}
-EXPORT_SYMBOL(ida_alloc_range);
-
-/**
- * ida_remove - remove an allocated id.
- * @ida: the (initialized) ida.
- * @id: the id returned by ida_alloc_range.
- */
-void ida_remove(struct ida *ida, unsigned int id)
-{
- unsigned long flags;
-
- BUG_ON((int)id < 0);
- spin_lock_irqsave(&simple_ida_lock, flags);
- __ida_remove(ida, id);
- spin_unlock_irqrestore(&simple_ida_lock, flags);
-}
-EXPORT_SYMBOL(ida_remove);
-
-/**
- * ida_init - initialize ida handle
- * @ida: ida handle
- *
- * This function is use to set up the handle (@ida) that you will pass
- * to the rest of the functions.
- */
-void ida_init(struct ida *ida)
-{
- memset(ida, 0, sizeof(struct ida));
- idr_init(&ida->idr);
-
-}
-EXPORT_SYMBOL(ida_init);
--
1.8.4.rc1
Percpu frontend for allocating ids. With percpu allocation (that works),
it's impossible to guarantee it will always be possible to allocate all
nr_tags - typically, some will be stuck on a remote percpu freelist
where the current job can't get to them.
We do guarantee that it will always be possible to allocate at least
(nr_tags / 2) tags - this is done by keeping track of which and how many
cpus have tags on their percpu freelists. On allocation failure if
enough cpus have tags that there could potentially be (nr_tags / 2) tags
stuck on remote percpu freelists, we then pick a remote cpu at random to
steal from.
Note that there's no cpu hotplug notifier - we don't care, because
steal_tags() will eventually get the down cpu's tags. We _could_ satisfy
more allocations if we had a notifier - but we'll still meet our
guarantees and it's absolutely not a correctness issue, so I don't think
it's worth the extra code.
Signed-off-by: Kent Overstreet <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: "Nicholas A. Bellinger" <[email protected]>
---
include/linux/idr.h | 46 +++++++++
lib/idr.c | 282 ++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 328 insertions(+)
diff --git a/include/linux/idr.h b/include/linux/idr.h
index a310bb0..f5b889b 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -101,6 +101,52 @@ static inline void ida_init(struct ida *ida)
ida_init_prealloc(ida, 0);
}
+/* Percpu IDA/tag allocator */
+
+struct percpu_ida_cpu;
+
+struct percpu_ida {
+ /*
+ * number of tags available to be allocated, as passed to
+ * percpu_ida_init()
+ */
+ unsigned nr_tags;
+
+ struct percpu_ida_cpu __percpu *tag_cpu;
+
+ /*
+ * Bitmap of cpus that (may) have tags on their percpu freelists:
+ * steal_tags() uses this to decide when to steal tags, and which cpus
+ * to try stealing from.
+ *
+ * It's ok for a freelist to be empty when its bit is set - steal_tags()
+ * will just keep looking - but the bitmap _must_ be set whenever a
+ * percpu freelist does have tags.
+ */
+ unsigned long *cpus_have_tags;
+
+ struct {
+ /*
+ * When we go to steal tags from another cpu (see steal_tags()),
+ * we want to pick a cpu at random. Cycling through them every
+ * time we steal is a bit easier and more or less equivalent:
+ */
+ unsigned cpu_last_stolen;
+
+ /* For sleeping on allocation failure */
+ wait_queue_head_t wait;
+
+ /* Global freelist */
+ struct ida ida;
+ } ____cacheline_aligned_in_smp;
+};
+
+int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp);
+void percpu_ida_free(struct percpu_ida *pool, unsigned tag);
+
+void percpu_ida_destroy(struct percpu_ida *pool);
+int percpu_ida_init(struct percpu_ida *pool, unsigned long nr_tags);
+
/* IDR */
/*
diff --git a/lib/idr.c b/lib/idr.c
index e5afab5..c94e29e 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -633,6 +633,288 @@ err:
}
EXPORT_SYMBOL(ida_init_prealloc);
+/* Percpu IDA */
+
+/*
+ * Number of tags we move between the percpu freelist and the global freelist at
+ * a time
+ */
+#define IDA_PCPU_BATCH_MOVE 32U
+
+/* Max size of percpu freelist, */
+#define IDA_PCPU_SIZE ((IDA_PCPU_BATCH_MOVE * 3) / 2)
+
+struct percpu_ida_cpu {
+ spinlock_t lock;
+ unsigned nr_free;
+ unsigned freelist[];
+};
+
+/*
+ * Try to steal tags from a remote cpu's percpu freelist.
+ *
+ * We first check how many percpu freelists have tags - we don't steal tags
+ * unless enough percpu freelists have tags on them that it's possible more than
+ * half the total tags could be stuck on remote percpu freelists.
+ *
+ * Then we iterate through the cpus until we find some tags - we don't attempt
+ * to find the "best" cpu to steal from, to keep cacheline bouncing to a
+ * minimum.
+ */
+static inline void steal_tags(struct percpu_ida *pool,
+ struct percpu_ida_cpu *tags)
+{
+ unsigned cpus_have_tags, cpu = pool->cpu_last_stolen;
+ struct percpu_ida_cpu *remote;
+
+ for (cpus_have_tags = bitmap_weight(pool->cpus_have_tags, nr_cpu_ids);
+ cpus_have_tags * IDA_PCPU_SIZE > pool->nr_tags / 2;
+ cpus_have_tags--) {
+ cpu = find_next_bit(pool->cpus_have_tags, nr_cpu_ids, cpu);
+
+ if (cpu == nr_cpu_ids)
+ cpu = find_first_bit(pool->cpus_have_tags, nr_cpu_ids);
+
+ if (cpu == nr_cpu_ids)
+ BUG();
+
+ pool->cpu_last_stolen = cpu;
+ remote = per_cpu_ptr(pool->tag_cpu, cpu);
+
+ clear_bit(cpu, pool->cpus_have_tags);
+
+ if (remote == tags)
+ continue;
+
+ spin_lock(&remote->lock);
+
+ if (remote->nr_free) {
+ memcpy(tags->freelist,
+ remote->freelist,
+ sizeof(unsigned) * remote->nr_free);
+
+ tags->nr_free = remote->nr_free;
+ remote->nr_free = 0;
+ }
+
+ spin_unlock(&remote->lock);
+
+ if (tags->nr_free)
+ break;
+ }
+}
+
+static inline void alloc_global_tags(struct percpu_ida *pool,
+ struct percpu_ida_cpu *tags)
+{
+ int nr_free = __ida_alloc_range_multiple(&pool->ida, tags->freelist,
+ IDA_PCPU_BATCH_MOVE, 0,
+ pool->nr_tags, GFP_NOWAIT,
+ NULL);
+ if (nr_free > 0)
+ tags->nr_free = nr_free;
+}
+
+static inline unsigned alloc_local_tag(struct percpu_ida *pool,
+ struct percpu_ida_cpu *tags)
+{
+ int tag = -ENOSPC;
+
+ spin_lock(&tags->lock);
+ if (tags->nr_free)
+ tag = tags->freelist[--tags->nr_free];
+ spin_unlock(&tags->lock);
+
+ return tag;
+}
+
+/**
+ * percpu_ida_alloc - allocate a tag
+ * @pool: pool to allocate from
+ * @gfp: gfp flags
+ *
+ * Returns a tag - an integer in the range [0..nr_tags) (passed to
+ * tag_pool_init()), or otherwise -ENOSPC on allocation failure.
+ *
+ * Safe to be called from interrupt context (assuming it isn't passed
+ * __GFP_WAIT, of course).
+ *
+ * Will not fail if passed __GFP_WAIT.
+ */
+int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp)
+{
+ DEFINE_WAIT(wait);
+ struct percpu_ida_cpu *tags;
+ unsigned long flags;
+ unsigned this_cpu;
+ int tag;
+
+ local_irq_save(flags);
+ this_cpu = smp_processor_id();
+ tags = per_cpu_ptr(pool->tag_cpu, this_cpu);
+
+ /* Fastpath */
+ tag = alloc_local_tag(pool, tags);
+ if (likely(tag >= 0)) {
+ local_irq_restore(flags);
+ return tag;
+ }
+
+ while (1) {
+ spin_lock(&pool->ida.lock);
+
+ /*
+ * prepare_to_wait() must come before steal_tags(), in case
+ * percpu_ida_free() on another cpu flips a bit in
+ * cpus_have_tags
+ *
+ * global lock held and irqs disabled, don't need percpu lock
+ */
+ prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
+
+ if (!tags->nr_free)
+ alloc_global_tags(pool, tags);
+ if (!tags->nr_free)
+ steal_tags(pool, tags);
+
+ if (tags->nr_free) {
+ tag = tags->freelist[--tags->nr_free];
+ if (tags->nr_free)
+ set_bit(this_cpu, pool->cpus_have_tags);
+ }
+
+ spin_unlock(&pool->ida.lock);
+ local_irq_restore(flags);
+
+ if (tag >= 0 || !(gfp & __GFP_WAIT))
+ break;
+
+ schedule();
+
+ local_irq_save(flags);
+ this_cpu = smp_processor_id();
+ tags = per_cpu_ptr(pool->tag_cpu, this_cpu);
+ }
+
+ finish_wait(&pool->wait, &wait);
+ return tag;
+}
+EXPORT_SYMBOL_GPL(percpu_ida_alloc);
+
+/**
+ * percpu_ida_free - free a tag
+ * @pool: pool @tag was allocated from
+ * @tag: a tag previously allocated with percpu_ida_alloc()
+ *
+ * Safe to be called from interrupt context.
+ */
+void percpu_ida_free(struct percpu_ida *pool, unsigned tag)
+{
+ struct percpu_ida_cpu *tags;
+ unsigned long flags;
+ unsigned nr_free, this_cpu;
+
+ BUG_ON(tag >= pool->nr_tags);
+
+ local_irq_save(flags);
+ this_cpu = smp_processor_id();
+ tags = per_cpu_ptr(pool->tag_cpu, this_cpu);
+
+ spin_lock(&tags->lock);
+ tags->freelist[tags->nr_free++] = tag;
+
+ nr_free = tags->nr_free;
+ spin_unlock(&tags->lock);
+
+ if (nr_free == 1) {
+ set_bit(this_cpu, pool->cpus_have_tags);
+ wake_up(&pool->wait);
+ }
+
+ if (nr_free == IDA_PCPU_SIZE) {
+ spin_lock(&pool->ida.lock);
+
+ /*
+ * Global lock held and irqs disabled, don't need percpu
+ * lock
+ */
+ while (tags->nr_free > IDA_PCPU_SIZE - IDA_PCPU_BATCH_MOVE)
+ __ida_remove(&pool->ida,
+ tags->freelist[--tags->nr_free]);
+
+ wake_up(&pool->wait);
+ spin_unlock(&pool->ida.lock);
+ }
+
+ local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(percpu_ida_free);
+
+/**
+ * percpu_ida_destroy - release a tag pool's resources
+ * @pool: pool to free
+ *
+ * Frees the resources allocated by percpu_ida_init().
+ */
+void percpu_ida_destroy(struct percpu_ida *pool)
+{
+ free_percpu(pool->tag_cpu);
+ kfree(pool->cpus_have_tags);
+ ida_destroy(&pool->ida);
+}
+EXPORT_SYMBOL_GPL(percpu_ida_destroy);
+
+/**
+ * percpu_ida_init - initialize a percpu tag pool
+ * @pool: pool to initialize
+ * @nr_tags: number of tags that will be available for allocation
+ *
+ * Initializes @pool so that it can be used to allocate tags - integers in the
+ * range [0, nr_tags). Typically, they'll be used by driver code to refer to a
+ * preallocated array of tag structures.
+ *
+ * Allocation is percpu, but sharding is limited by nr_tags - for best
+ * performance, the workload should not span more cpus than nr_tags / 128.
+ */
+int percpu_ida_init(struct percpu_ida *pool, unsigned long nr_tags)
+{
+ unsigned cpu;
+
+ memset(pool, 0, sizeof(*pool));
+
+ init_waitqueue_head(&pool->wait);
+ pool->nr_tags = nr_tags;
+
+ /* Guard against overflow */
+ if (nr_tags > (unsigned) INT_MAX + 1) {
+ pr_err("tags.c: nr_tags too large\n");
+ return -EINVAL;
+ }
+
+ if (ida_init_prealloc(&pool->ida, nr_tags))
+ return -ENOMEM;
+
+ pool->cpus_have_tags = kzalloc(BITS_TO_LONGS(nr_cpu_ids) *
+ sizeof(unsigned long), GFP_KERNEL);
+ if (!pool->cpus_have_tags)
+ goto err;
+
+ pool->tag_cpu = __alloc_percpu(sizeof(struct percpu_ida_cpu) +
+ IDA_PCPU_SIZE * sizeof(unsigned),
+ sizeof(unsigned));
+ if (!pool->tag_cpu)
+ goto err;
+
+ for_each_possible_cpu(cpu)
+ spin_lock_init(&per_cpu_ptr(pool->tag_cpu, cpu)->lock);
+
+ return 0;
+err:
+ percpu_ida_destroy(pool);
+ return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(percpu_ida_init);
+
/* IDR */
#define MAX_IDR_SHIFT (sizeof(int) * 8 - 1)
--
1.8.4.rc1
The deprecated idr interfaces don't have any in kernel users, so let's
delete them as prep work for the idr rewrite.
Signed-off-by: Kent Overstreet <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Tejun Heo <[email protected]>
---
include/linux/idr.h | 63 -----------------------------------------------------
lib/idr.c | 36 +++---------------------------
2 files changed, 3 insertions(+), 96 deletions(-)
diff --git a/include/linux/idr.h b/include/linux/idr.h
index f5b889b..b26f8b1 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -264,69 +264,6 @@ static inline void *idr_find(struct idr *idr, int id)
#define idr_for_each_entry(idp, entry, id) \
for (id = 0; ((entry) = idr_get_next(idp, &(id))) != NULL; ++id)
-/*
- * Don't use the following functions. These exist only to suppress
- * deprecated warnings on EXPORT_SYMBOL()s.
- */
-int __idr_pre_get(struct idr *idp, gfp_t gfp_mask);
-int __idr_get_new_above(struct idr *idp, void *ptr, int starting_id, int *id);
-void __idr_remove_all(struct idr *idp);
-
-/**
- * idr_pre_get - reserve resources for idr allocation
- * @idp: idr handle
- * @gfp_mask: memory allocation flags
- *
- * Part of old alloc interface. This is going away. Use
- * idr_preload[_end]() and idr_alloc() instead.
- */
-static inline int __deprecated idr_pre_get(struct idr *idp, gfp_t gfp_mask)
-{
- return __idr_pre_get(idp, gfp_mask);
-}
-
-/**
- * idr_get_new_above - allocate new idr entry above or equal to a start id
- * @idp: idr handle
- * @ptr: pointer you want associated with the id
- * @starting_id: id to start search at
- * @id: pointer to the allocated handle
- *
- * Part of old alloc interface. This is going away. Use
- * idr_preload[_end]() and idr_alloc() instead.
- */
-static inline int __deprecated idr_get_new_above(struct idr *idp, void *ptr,
- int starting_id, int *id)
-{
- return __idr_get_new_above(idp, ptr, starting_id, id);
-}
-
-/**
- * idr_get_new - allocate new idr entry
- * @idp: idr handle
- * @ptr: pointer you want associated with the id
- * @id: pointer to the allocated handle
- *
- * Part of old alloc interface. This is going away. Use
- * idr_preload[_end]() and idr_alloc() instead.
- */
-static inline int __deprecated idr_get_new(struct idr *idp, void *ptr, int *id)
-{
- return __idr_get_new_above(idp, ptr, 0, id);
-}
-
-/**
- * idr_remove_all - remove all ids from the given idr tree
- * @idp: idr handle
- *
- * If you're trying to destroy @idp, calling idr_destroy() is enough.
- * This is going away. Don't use.
- */
-static inline void __deprecated idr_remove_all(struct idr *idp)
-{
- __idr_remove_all(idp);
-}
-
void __init idr_init_cache(void);
#endif /* __IDR_H__ */
diff --git a/lib/idr.c b/lib/idr.c
index c94e29e..3f68665 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -1074,19 +1074,6 @@ static void idr_mark_full(struct idr_layer **pa, int id)
}
}
-int __idr_pre_get(struct idr *idp, gfp_t gfp_mask)
-{
- while (idp->id_free_cnt < MAX_IDR_FREE) {
- struct idr_layer *new;
- new = kmem_cache_zalloc(idr_layer_cache, gfp_mask);
- if (new == NULL)
- return (0);
- move_to_free_list(idp, new);
- }
- return 1;
-}
-EXPORT_SYMBOL(__idr_pre_get);
-
/**
* sub_alloc - try to allocate an id without growing the tree depth
* @idp: idr handle
@@ -1252,21 +1239,6 @@ static void idr_fill_slot(struct idr *idr, void *ptr, int id,
idr_mark_full(pa, id);
}
-int __idr_get_new_above(struct idr *idp, void *ptr, int starting_id, int *id)
-{
- struct idr_layer *pa[MAX_IDR_LEVEL + 1];
- int rv;
-
- rv = idr_get_empty_slot(idp, starting_id, pa, 0, idp);
- if (rv < 0)
- return rv == -ENOMEM ? -EAGAIN : rv;
-
- idr_fill_slot(idp, ptr, rv, pa);
- *id = rv;
- return 0;
-}
-EXPORT_SYMBOL(__idr_get_new_above);
-
/**
* idr_preload - preload for idr_alloc()
* @gfp_mask: allocation mask to use for preloading
@@ -1485,7 +1457,7 @@ void idr_remove(struct idr *idp, int id)
}
EXPORT_SYMBOL(idr_remove);
-void __idr_remove_all(struct idr *idp)
+static void __idr_remove_all(struct idr *idp)
{
int n, id, max;
int bt_mask;
@@ -1518,7 +1490,6 @@ void __idr_remove_all(struct idr *idp)
}
idp->layers = 0;
}
-EXPORT_SYMBOL(__idr_remove_all);
/**
* idr_destroy - release all cached layers within an idr tree
@@ -1580,13 +1551,12 @@ EXPORT_SYMBOL(idr_find_slowpath);
* callback function will be called for each pointer currently
* registered, passing the id, the pointer and the data pointer passed
* to this function. It is not safe to modify the idr tree while in
- * the callback, so functions such as idr_get_new and idr_remove are
- * not allowed.
+ * the callback, so functions such as idr_remove are not allowed.
*
* We check the return of @fn each time. If it returns anything other
* than %0, we break out and return that value.
*
- * The caller must serialize idr_for_each() vs idr_get_new() and idr_remove().
+ * The caller must serialize idr_for_each() vs idr_remove().
*/
int idr_for_each(struct idr *idp,
int (*fn)(int id, void *p, void *data), void *data)
--
1.8.4.rc1
get() implies taking a ref or sometimes an allocation, which this
function definitely does not do - rename it to something more sensible.
Signed-off-by: Kent Overstreet <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Tejun Heo <[email protected]>
---
drivers/block/drbd/drbd_main.c | 2 +-
drivers/block/drbd/drbd_nl.c | 2 +-
drivers/mtd/mtdcore.c | 2 +-
include/linux/idr.h | 4 ++--
kernel/cgroup.c | 2 +-
lib/idr.c | 6 +++---
6 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 55635ed..73f4765 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -500,7 +500,7 @@ int conn_lowest_minor(struct drbd_tconn *tconn)
int vnr = 0, m;
rcu_read_lock();
- mdev = idr_get_next(&tconn->volumes, &vnr);
+ mdev = idr_find_next(&tconn->volumes, &vnr);
m = mdev ? mdev_to_minor(mdev) : -1;
rcu_read_unlock();
diff --git a/drivers/block/drbd/drbd_nl.c b/drivers/block/drbd/drbd_nl.c
index 8cc1e64..936da36 100644
--- a/drivers/block/drbd/drbd_nl.c
+++ b/drivers/block/drbd/drbd_nl.c
@@ -2938,7 +2938,7 @@ int get_one_status(struct sk_buff *skb, struct netlink_callback *cb)
}
if (tconn) {
next_tconn:
- mdev = idr_get_next(&tconn->volumes, &volume);
+ mdev = idr_find_next(&tconn->volumes, &volume);
if (!mdev) {
/* No more volumes to dump on this tconn.
* Advance tconn iterator. */
diff --git a/drivers/mtd/mtdcore.c b/drivers/mtd/mtdcore.c
index 048c823..8d64363 100644
--- a/drivers/mtd/mtdcore.c
+++ b/drivers/mtd/mtdcore.c
@@ -91,7 +91,7 @@ EXPORT_SYMBOL_GPL(mtd_table_mutex);
struct mtd_info *__mtd_next_device(int i)
{
- return idr_get_next(&mtd_idr, &i);
+ return idr_find_next(&mtd_idr, &i);
}
EXPORT_SYMBOL_GPL(__mtd_next_device);
diff --git a/include/linux/idr.h b/include/linux/idr.h
index b26f8b1..6395da1 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -211,7 +211,7 @@ int idr_alloc(struct idr *idp, void *ptr, int start, int end, gfp_t gfp_mask);
int idr_alloc_cyclic(struct idr *idr, void *ptr, int start, int end, gfp_t gfp_mask);
int idr_for_each(struct idr *idp,
int (*fn)(int id, void *p, void *data), void *data);
-void *idr_get_next(struct idr *idp, int *nextid);
+void *idr_find_next(struct idr *idp, int *nextid);
void *idr_replace(struct idr *idp, void *ptr, int id);
void idr_remove(struct idr *idp, int id);
void idr_free(struct idr *idp, int id);
@@ -262,7 +262,7 @@ static inline void *idr_find(struct idr *idr, int id)
* is convenient for a "not found" value.
*/
#define idr_for_each_entry(idp, entry, id) \
- for (id = 0; ((entry) = idr_get_next(idp, &(id))) != NULL; ++id)
+ for (id = 0; ((entry) = idr_find_next(idp, &(id))) != NULL; ++id)
void __init idr_init_cache(void);
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index bac5312..7b397f8 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1868,7 +1868,7 @@ int task_cgroup_path(struct task_struct *task, char *buf, size_t buflen)
mutex_lock(&cgroup_mutex);
- root = idr_get_next(&cgroup_hierarchy_idr, &hierarchy_id);
+ root = idr_find_next(&cgroup_hierarchy_idr, &hierarchy_id);
if (root) {
cgrp = task_cgroup_from_root(task, root);
diff --git a/lib/idr.c b/lib/idr.c
index 3f68665..1d67cdd 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -1596,7 +1596,7 @@ int idr_for_each(struct idr *idp,
EXPORT_SYMBOL(idr_for_each);
/**
- * idr_get_next - lookup next object of id to given id.
+ * idr_find_next - lookup next object of id to given id.
* @idp: idr handle
* @nextidp: pointer to lookup key
*
@@ -1607,7 +1607,7 @@ EXPORT_SYMBOL(idr_for_each);
* This function can be called under rcu_read_lock(), given that the leaf
* pointers lifetimes are correctly managed.
*/
-void *idr_get_next(struct idr *idp, int *nextidp)
+void *idr_find_next(struct idr *idp, int *nextidp)
{
struct idr_layer *p, *pa[MAX_IDR_LEVEL + 1];
struct idr_layer **paa = &pa[0];
@@ -1648,7 +1648,7 @@ void *idr_get_next(struct idr *idp, int *nextidp)
}
return NULL;
}
-EXPORT_SYMBOL(idr_get_next);
+EXPORT_SYMBOL(idr_find_next);
/**
--
1.8.4.rc1
The old idr code was really a second radix tree implementation - we
already have one in lib/radix-tree.c.
This patch reimplements idr on top of our existing radix trees, using
our shiny new ida implementation for allocating/freeing the ids. The old
idr code was noticably slower than lib/radix-tree.c in at least some
benchmarks, so in addition to being ~500 lines less code this patch
should improve performance too.
There's one thing left unfinished in this patch - the existing
idr_preload() interface won't work for ida. Another patch on top of this
will fix idr_preload() and update existing users to the new interface.
Signed-off-by: Kent Overstreet <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Tejun Heo <[email protected]>
---
include/linux/idr.h | 159 +++++-----
init/main.c | 1 -
lib/idr.c | 890 ++++++++++------------------------------------------
3 files changed, 246 insertions(+), 804 deletions(-)
diff --git a/include/linux/idr.h b/include/linux/idr.h
index 4926f36..85355d7 100644
--- a/include/linux/idr.h
+++ b/include/linux/idr.h
@@ -1,6 +1,6 @@
/*
* include/linux/idr.h
- *
+ *
* 2002-10-18 written by Jim Houston [email protected]
* Copyright (C) 2002 by Concurrent Computer Corporation
* Distributed under the GNU GPL license version 2.
@@ -12,10 +12,8 @@
#ifndef __IDR_H__
#define __IDR_H__
-#include <linux/types.h>
-#include <linux/bitops.h>
-#include <linux/init.h>
-#include <linux/rcupdate.h>
+#include <linux/gfp.h>
+#include <linux/radix-tree.h>
#include <linux/spinlock_types.h>
#include <linux/wait.h>
@@ -149,76 +147,42 @@ int percpu_ida_init(struct percpu_ida *pool, unsigned long nr_tags);
/* IDR */
-/*
- * We want shallower trees and thus more bits covered at each layer. 8
- * bits gives us large enough first layer for most use cases and maximum
- * tree depth of 4. Each idr_layer is slightly larger than 2k on 64bit and
- * 1k on 32bit.
+/**
+ * DOC: idr sync
+ * idr synchronization (stolen from radix-tree.h)
+ *
+ * idr_alloc() and idr_remove() do their own locking internally - the user need
+ * not be concerned with synchronization unless there's other operations that
+ * need to be done atomically.
+ *
+ * idr_find() does no locking - it can be called locklessly using RCU, if the
+ * caller ensures calls to this function are made within rcu_read_lock()
+ * regions and does all the other appropriate RCU stuff.
*/
-#define IDR_BITS 8
-#define IDR_SIZE (1 << IDR_BITS)
-#define IDR_MASK ((1 << IDR_BITS)-1)
-
-struct idr_layer {
- int prefix; /* the ID prefix of this idr_layer */
- DECLARE_BITMAP(bitmap, IDR_SIZE); /* A zero bit means "space here" */
- struct idr_layer __rcu *ary[1<<IDR_BITS];
- int count; /* When zero, we can release it */
- int layer; /* distance from leaf */
- struct rcu_head rcu_head;
-};
struct idr {
- struct idr_layer __rcu *hint; /* the last layer allocated from */
- struct idr_layer __rcu *top;
- struct idr_layer *id_free;
- int layers; /* only valid w/o concurrent changes */
- int id_free_cnt;
- int cur; /* current pos for cyclic allocation */
- spinlock_t lock;
+ struct ida ida;
+ struct radix_tree_root ptrs;
};
#define IDR_INIT(name) \
{ \
- .lock = __SPIN_LOCK_UNLOCKED(name.lock), \
+ .ida = IDA_INIT(name.ida), \
+ .ptrs = RADIX_TREE_INIT(GFP_NOWAIT), \
}
#define DEFINE_IDR(name) struct idr name = IDR_INIT(name)
-/**
- * DOC: idr sync
- * idr synchronization (stolen from radix-tree.h)
- *
- * idr_find() is able to be called locklessly, using RCU. The caller must
- * ensure calls to this function are made within rcu_read_lock() regions.
- * Other readers (lock-free or otherwise) and modifications may be running
- * concurrently.
- *
- * It is still required that the caller manage the synchronization and
- * lifetimes of the items. So if RCU lock-free lookups are used, typically
- * this would mean that the items have their own locks, or are amenable to
- * lock-free access; and that the items are freed by RCU (or only freed after
- * having been deleted from the idr tree *and* a synchronize_rcu() grace
- * period).
- */
-
-/*
- * This is what we export.
- */
-
-void *idr_find_slowpath(struct idr *idp, int id);
-void idr_preload(gfp_t gfp_mask);
-int idr_alloc_range(struct idr *idp, void *ptr, int start,
- int end, gfp_t gfp_mask);
-int idr_alloc_cyclic(struct idr *idr, void *ptr, int start,
- int end, gfp_t gfp_mask);
-int idr_for_each(struct idr *idp,
+void *idr_find_next(struct idr *idr, int *nextid);
+int idr_for_each(struct idr *idr,
int (*fn)(int id, void *p, void *data), void *data);
-void *idr_find_next(struct idr *idp, int *nextid);
-void *idr_replace(struct idr *idp, void *ptr, int id);
-void idr_remove(struct idr *idp, int id);
-void idr_free(struct idr *idp, int id);
-void idr_destroy(struct idr *idp);
-void idr_init(struct idr *idp);
+void *idr_replace(struct idr *idr, void *ptr, unsigned id);
+void idr_remove(struct idr *idr, unsigned id);
+int idr_alloc_range(struct idr *idr, void *ptr, unsigned start,
+ unsigned end, gfp_t gfp);
+int idr_alloc_cyclic(struct idr *idr, void *ptr, unsigned start,
+ unsigned end, gfp_t gfp_mask);
+void idr_destroy(struct idr *idr);
+void idr_init(struct idr *idr);
static inline int idr_alloc(struct idr *idr, void *ptr, gfp_t gfp)
{
@@ -233,7 +197,53 @@ static inline int idr_alloc(struct idr *idr, void *ptr, gfp_t gfp)
*/
static inline void idr_preload_end(void)
{
- preempt_enable();
+ radix_tree_preload_end();
+}
+
+/**
+ * idr_preload - preload for idr_alloc_range()
+ * @gfp: allocation mask to use for preloading
+ *
+ * Preload per-cpu layer buffer for idr_alloc_range(). Can only be used from
+ * process context and each idr_preload() invocation should be matched with
+ * idr_preload_end(). Note that preemption is disabled while preloaded.
+ *
+ * The first idr_alloc_range() in the preloaded section can be treated as if it
+ * were invoked with @gfp_mask used for preloading. This allows using more
+ * permissive allocation masks for idrs protected by spinlocks.
+ *
+ * For example, if idr_alloc_range() below fails, the failure can be treated as
+ * if idr_alloc_range() were called with GFP_KERNEL rather than GFP_NOWAIT.
+ *
+ * idr_preload(GFP_KERNEL);
+ * spin_lock(lock);
+ *
+ * id = idr_alloc_range(idr, ptr, start, end, GFP_NOWAIT);
+ *
+ * spin_unlock(lock);
+ * idr_preload_end();
+ * if (id < 0)
+ * error;
+ */
+static inline void idr_preload(gfp_t gfp)
+{
+ might_sleep_if(gfp & __GFP_WAIT);
+
+ /* Well this is horrible, but idr_preload doesn't return errors */
+ if (radix_tree_preload(gfp))
+ preempt_disable();
+}
+
+/* radix tree can't store NULL pointers, so we have to translate... */
+static inline void *__radix_idr_ptr(void *ptr)
+{
+ return ptr != (void *) (~0UL & ~RADIX_TREE_INDIRECT_PTR)
+ ? ptr : NULL;
+}
+
+static inline void *__idr_radix_ptr(void *ptr)
+{
+ return ptr ?: (void *) (~0UL & ~RADIX_TREE_INDIRECT_PTR);
}
/**
@@ -243,24 +253,19 @@ static inline void idr_preload_end(void)
*
* Return the pointer given the id it has been registered with. A %NULL
* return indicates that @id is not valid or you passed %NULL in
- * idr_get_new().
+ * idr_alloc().
*
* This function can be called under rcu_read_lock(), given that the leaf
* pointers lifetimes are correctly managed.
*/
-static inline void *idr_find(struct idr *idr, int id)
+static inline void *idr_find(struct idr *idr, unsigned id)
{
- struct idr_layer *hint = rcu_dereference_raw(idr->hint);
-
- if (hint && (id & ~IDR_MASK) == hint->prefix)
- return rcu_dereference_raw(hint->ary[id & IDR_MASK]);
-
- return idr_find_slowpath(idr, id);
+ return __radix_idr_ptr(radix_tree_lookup(&idr->ptrs, id));
}
/**
* idr_for_each_entry - iterate over an idr's elements of a given type
- * @idp: idr handle
+ * @idr: idr handle
* @entry: the type * to use as cursor
* @id: id entry's key
*
@@ -268,9 +273,7 @@ static inline void *idr_find(struct idr *idr, int id)
* after normal terminatinon @entry is left with the value NULL. This
* is convenient for a "not found" value.
*/
-#define idr_for_each_entry(idp, entry, id) \
- for (id = 0; ((entry) = idr_find_next(idp, &(id))) != NULL; ++id)
-
-void __init idr_init_cache(void);
+#define idr_for_each_entry(idr, entry, id) \
+ for (id = 0; ((entry) = idr_find_next(idr, &(id))) != NULL; ++id)
#endif /* __IDR_H__ */
diff --git a/init/main.c b/init/main.c
index d03d2ec..6b44887 100644
--- a/init/main.c
+++ b/init/main.c
@@ -542,7 +542,6 @@ asmlinkage void __init start_kernel(void)
preempt_disable();
if (WARN(!irqs_disabled(), "Interrupts were enabled *very* early, fixing it\n"))
local_irq_disable();
- idr_init_cache();
rcu_init();
tick_nohz_init();
radix_tree_init();
diff --git a/lib/idr.c b/lib/idr.c
index 5393aa1..89ec59f 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -8,24 +8,10 @@
*
* Modified by Nadia Derbey to make it RCU safe.
*
- * IDA completely rewritten by Kent Overstreet <[email protected]>
+ * Completely rewritten by Kent Overstreet <[email protected]>.
*
- * Small id to pointer translation service.
- *
- * It uses a radix tree like structure as a sparse array indexed
- * by the id to obtain the pointer. The bitmap makes allocating
- * a new id quick.
- *
- * You call it to allocate an id (an int) an associate with that id a
- * pointer or what ever, we treat it as a (void *). You can pass this
- * id to a user for him to pass back at a later time. You then pass
- * that id to this code and it returns your pointer.
-
- * You can release ids at any time. When all ids are released, most of
- * the memory is returned (we keep MAX_IDR_FREE) in a local pool so we
- * don't need to go to the memory "store" during an id allocate, just
- * so you don't need to be too concerned about locking and conflicts
- * with the slab allocator.
+ * id allocator (scalable/resizable bitmap, essentially), and also idr which
+ * combines ida with a radix tree to map pointers to small integers for you.
*/
#include <linux/bitmap.h>
@@ -33,11 +19,10 @@
#include <linux/bug.h>
#include <linux/err.h>
#include <linux/export.h>
-#include <linux/hardirq.h>
#include <linux/idr.h>
-#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/percpu.h>
+#include <linux/rcupdate.h>
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/string.h>
@@ -915,389 +900,158 @@ err:
}
EXPORT_SYMBOL_GPL(percpu_ida_init);
-/* IDR */
-
-#define MAX_IDR_SHIFT (sizeof(int) * 8 - 1)
-#define MAX_IDR_BIT (1U << MAX_IDR_SHIFT)
-
-/* Leave the possibility of an incomplete final layer */
-#define MAX_IDR_LEVEL ((MAX_IDR_SHIFT + IDR_BITS - 1) / IDR_BITS)
-
-/* Number of id_layer structs to leave in free list */
-#define MAX_IDR_FREE (MAX_IDR_LEVEL * 2)
-
-static struct kmem_cache *idr_layer_cache;
-static DEFINE_PER_CPU(struct idr_layer *, idr_preload_head);
-static DEFINE_PER_CPU(int, idr_preload_cnt);
-
-/* the maximum ID which can be allocated given idr->layers */
-static int idr_max(int layers)
-{
- int bits = min_t(int, layers * IDR_BITS, MAX_IDR_SHIFT);
-
- return (1 << bits) - 1;
-}
-
-/*
- * Prefix mask for an idr_layer at @layer. For layer 0, the prefix mask is
- * all bits except for the lower IDR_BITS. For layer 1, 2 * IDR_BITS, and
- * so on.
+/**
+ * DOC: IDR description
+ * IDR: Maps ids (small integers) to pointers.
+ *
+ * This merely combines ida (id allocation) with a radix tree; idr_alloc()
+ * stores a pointer, and returns you a small integer by which you can refer to
+ * it.
+ *
+ * It'll give you the smallest available integer (within a specified range if
+ * you use idr_alloc_range()) - there's also idr_alloc_cyclic() if you don't
+ * want ids to be reused right away.
+ *
+ * id -> pointer mappings can be deleted with idr_remove().
*/
-static int idr_layer_prefix_mask(int layer)
-{
- return ~idr_max(layer + 1);
-}
-
-static struct idr_layer *get_from_free_list(struct idr *idp)
-{
- struct idr_layer *p;
- unsigned long flags;
-
- spin_lock_irqsave(&idp->lock, flags);
- if ((p = idp->id_free)) {
- idp->id_free = p->ary[0];
- idp->id_free_cnt--;
- p->ary[0] = NULL;
- }
- spin_unlock_irqrestore(&idp->lock, flags);
- return(p);
-}
/**
- * idr_layer_alloc - allocate a new idr_layer
- * @gfp_mask: allocation mask
- * @layer_idr: optional idr to allocate from
+ * idr_find_next - lookup next object of id to given id.
+ * @idr: idr handle
+ * @nextidp: pointer to lookup key
*
- * If @layer_idr is %NULL, directly allocate one using @gfp_mask or fetch
- * one from the per-cpu preload buffer. If @layer_idr is not %NULL, fetch
- * an idr_layer from @idr->id_free.
+ * Returns pointer to registered object with id, which is next number to
+ * given id. After being looked up, *@nextidp will be updated for the next
+ * iteration.
*
- * @layer_idr is to maintain backward compatibility with the old alloc
- * interface - idr_pre_get() and idr_get_new*() - and will be removed
- * together with per-pool preload buffer.
+ * This function can be called under rcu_read_lock(), given that the leaf
+ * pointers lifetimes are correctly managed.
*/
-static struct idr_layer *idr_layer_alloc(gfp_t gfp_mask, struct idr *layer_idr)
-{
- struct idr_layer *new;
-
- /* this is the old path, bypass to get_from_free_list() */
- if (layer_idr)
- return get_from_free_list(layer_idr);
-
- /*
- * Try to allocate directly from kmem_cache. We want to try this
- * before preload buffer; otherwise, non-preloading idr_alloc_range()
- * users will end up taking advantage of preloading ones. As the
- * following is allowed to fail for preloaded cases, suppress
- * warning this time.
- */
- new = kmem_cache_zalloc(idr_layer_cache, gfp_mask | __GFP_NOWARN);
- if (new)
- return new;
-
- /*
- * Try to fetch one from the per-cpu preload buffer if in process
- * context. See idr_preload() for details.
- */
- if (!in_interrupt()) {
- preempt_disable();
- new = __this_cpu_read(idr_preload_head);
- if (new) {
- __this_cpu_write(idr_preload_head, new->ary[0]);
- __this_cpu_dec(idr_preload_cnt);
- new->ary[0] = NULL;
- }
- preempt_enable();
- if (new)
- return new;
- }
-
- /*
- * Both failed. Try kmem_cache again w/o adding __GFP_NOWARN so
- * that memory allocation failure warning is printed as intended.
- */
- return kmem_cache_zalloc(idr_layer_cache, gfp_mask);
-}
-
-static void idr_layer_rcu_free(struct rcu_head *head)
+void *idr_find_next(struct idr *idr, int *nextidp)
{
- struct idr_layer *layer;
+ void **slot;
+ struct radix_tree_iter iter;
+ void *ret = NULL;
- layer = container_of(head, struct idr_layer, rcu_head);
- kmem_cache_free(idr_layer_cache, layer);
-}
+ rcu_read_lock();
-static inline void free_layer(struct idr *idr, struct idr_layer *p)
-{
- if (idr->hint && idr->hint == p)
- RCU_INIT_POINTER(idr->hint, NULL);
- call_rcu(&p->rcu_head, idr_layer_rcu_free);
-}
-
-/* only called when idp->lock is held */
-static void __move_to_free_list(struct idr *idp, struct idr_layer *p)
-{
- p->ary[0] = idp->id_free;
- idp->id_free = p;
- idp->id_free_cnt++;
-}
+ radix_tree_for_each_slot(slot, &idr->ptrs, &iter, *nextidp) {
+ *nextidp = iter.index;
+ ret = radix_tree_deref_slot(slot);
+ break;
+ }
-static void move_to_free_list(struct idr *idp, struct idr_layer *p)
-{
- unsigned long flags;
+ rcu_read_unlock();
- /*
- * Depends on the return element being zeroed.
- */
- spin_lock_irqsave(&idp->lock, flags);
- __move_to_free_list(idp, p);
- spin_unlock_irqrestore(&idp->lock, flags);
-}
-
-static void idr_mark_full(struct idr_layer **pa, int id)
-{
- struct idr_layer *p = pa[0];
- int l = 0;
-
- __set_bit(id & IDR_MASK, p->bitmap);
- /*
- * If this layer is full mark the bit in the layer above to
- * show that this part of the radix tree is full. This may
- * complete the layer above and require walking up the radix
- * tree.
- */
- while (bitmap_full(p->bitmap, IDR_SIZE)) {
- if (!(p = pa[++l]))
- break;
- id = id >> IDR_BITS;
- __set_bit((id & IDR_MASK), p->bitmap);
- }
+ return __radix_idr_ptr(ret);
}
+EXPORT_SYMBOL(idr_find_next);
/**
- * sub_alloc - try to allocate an id without growing the tree depth
- * @idp: idr handle
- * @starting_id: id to start search at
- * @pa: idr_layer[MAX_IDR_LEVEL] used as backtrack buffer
- * @gfp_mask: allocation mask for idr_layer_alloc()
- * @layer_idr: optional idr passed to idr_layer_alloc()
+ * idr_for_each - iterate through all stored pointers
+ * @idr: idr handle
+ * @fn: function to be called for each pointer
+ * @data: data passed back to callback function
+ *
+ * Iterate over the pointers registered with the given idr. The
+ * callback function will be called for each pointer currently
+ * registered, passing the id, the pointer and the data pointer passed
+ * to this function. It is not safe to modify the idr tree while in
+ * the callback, so functions such as idr_remove are not allowed.
*
- * Allocate an id in range [@starting_id, INT_MAX] from @idp without
- * growing its depth. Returns
+ * We check the return of @fn each time. If it returns anything other
+ * than %0, we break out and return that value.
*
- * the allocated id >= 0 if successful,
- * -EAGAIN if the tree needs to grow for allocation to succeed,
- * -ENOSPC if the id space is exhausted,
- * -ENOMEM if more idr_layers need to be allocated.
+ * The caller must serialize idr_for_each() vs idr_remove().
*/
-static int sub_alloc(struct idr *idp, int *starting_id, struct idr_layer **pa,
- gfp_t gfp_mask, struct idr *layer_idr)
+int idr_for_each(struct idr *idr,
+ int (*fn)(int id, void *p, void *data), void *data)
{
- int n, m, sh;
- struct idr_layer *p, *new;
- int l, id, oid;
-
- id = *starting_id;
- restart:
- p = idp->top;
- l = idp->layers;
- pa[l--] = NULL;
- while (1) {
- /*
- * We run around this while until we reach the leaf node...
- */
- n = (id >> (IDR_BITS*l)) & IDR_MASK;
- m = find_next_zero_bit(p->bitmap, IDR_SIZE, n);
- if (m == IDR_SIZE) {
- /* no space available go back to previous layer. */
- l++;
- oid = id;
- id = (id | ((1 << (IDR_BITS * l)) - 1)) + 1;
-
- /* if already at the top layer, we need to grow */
- if (id >= 1 << (idp->layers * IDR_BITS)) {
- *starting_id = id;
- return -EAGAIN;
- }
- p = pa[l];
- BUG_ON(!p);
+ void *p;
+ unsigned id;
+ int error = 0;
- /* If we need to go up one layer, continue the
- * loop; otherwise, restart from the top.
- */
- sh = IDR_BITS * (l + 1);
- if (oid >> sh == id >> sh)
- continue;
- else
- goto restart;
- }
- if (m != n) {
- sh = IDR_BITS*l;
- id = ((id >> sh) ^ n ^ m) << sh;
- }
- if ((id >= MAX_IDR_BIT) || (id < 0))
- return -ENOSPC;
- if (l == 0)
+ idr_for_each_entry(idr, p, id) {
+ error = fn(id, p, data);
+ if (error)
break;
- /*
- * Create the layer below if it is missing.
- */
- if (!p->ary[m]) {
- new = idr_layer_alloc(gfp_mask, layer_idr);
- if (!new)
- return -ENOMEM;
- new->layer = l-1;
- new->prefix = id & idr_layer_prefix_mask(new->layer);
- rcu_assign_pointer(p->ary[m], new);
- p->count++;
- }
- pa[l--] = p;
- p = p->ary[m];
}
- pa[l] = p;
- return id;
+ return error;
}
+EXPORT_SYMBOL(idr_for_each);
-static int idr_get_empty_slot(struct idr *idp, int starting_id,
- struct idr_layer **pa, gfp_t gfp_mask,
- struct idr *layer_idr)
+/**
+ * idr_replace - replace pointer for given id
+ * @idr: idr handle
+ * @ptr: pointer you want associated with the id
+ * @id: lookup key
+ *
+ * Replace the pointer registered with an id and return the old value.
+ * A %-ENOENT return indicates that @id was not found.
+ * A %-EINVAL return indicates that @id was not within valid constraints.
+ */
+void *idr_replace(struct idr *idr, void *ptr, unsigned id)
{
- struct idr_layer *p, *new;
- int layers, v, id;
+ void **slot, *old = ERR_PTR(-ENOENT);
unsigned long flags;
- id = starting_id;
-build_up:
- p = idp->top;
- layers = idp->layers;
- if (unlikely(!p)) {
- if (!(p = idr_layer_alloc(gfp_mask, layer_idr)))
- return -ENOMEM;
- p->layer = 0;
- layers = 1;
- }
- /*
- * Add a new layer to the top of the tree if the requested
- * id is larger than the currently allocated space.
- */
- while (id > idr_max(layers)) {
- layers++;
- if (!p->count) {
- /* special case: if the tree is currently empty,
- * then we grow the tree by moving the top node
- * upwards.
- */
- p->layer++;
- WARN_ON_ONCE(p->prefix);
- continue;
- }
- if (!(new = idr_layer_alloc(gfp_mask, layer_idr))) {
- /*
- * The allocation failed. If we built part of
- * the structure tear it down.
- */
- spin_lock_irqsave(&idp->lock, flags);
- for (new = p; p && p != idp->top; new = p) {
- p = p->ary[0];
- new->ary[0] = NULL;
- new->count = 0;
- bitmap_clear(new->bitmap, 0, IDR_SIZE);
- __move_to_free_list(idp, new);
- }
- spin_unlock_irqrestore(&idp->lock, flags);
- return -ENOMEM;
- }
- new->ary[0] = p;
- new->count = 1;
- new->layer = layers-1;
- new->prefix = id & idr_layer_prefix_mask(new->layer);
- if (bitmap_full(p->bitmap, IDR_SIZE))
- __set_bit(0, new->bitmap);
- p = new;
+ rcu_read_lock();
+ spin_lock_irqsave(&idr->ida.lock, flags);
+
+ slot = radix_tree_lookup_slot(&idr->ptrs, id);
+
+ if (slot) {
+ old = radix_tree_deref_slot(slot);
+ if (old)
+ radix_tree_replace_slot(slot, __idr_radix_ptr(ptr));
}
- rcu_assign_pointer(idp->top, p);
- idp->layers = layers;
- v = sub_alloc(idp, &id, pa, gfp_mask, layer_idr);
- if (v == -EAGAIN)
- goto build_up;
- return(v);
-}
-/*
- * @id and @pa are from a successful allocation from idr_get_empty_slot().
- * Install the user pointer @ptr and mark the slot full.
- */
-static void idr_fill_slot(struct idr *idr, void *ptr, int id,
- struct idr_layer **pa)
-{
- /* update hint used for lookup, cleared from free_layer() */
- rcu_assign_pointer(idr->hint, pa[0]);
+ spin_unlock_irqrestore(&idr->ida.lock, flags);
+ rcu_read_unlock();
- rcu_assign_pointer(pa[0]->ary[id & IDR_MASK], (struct idr_layer *)ptr);
- pa[0]->count++;
- idr_mark_full(pa, id);
+ return __radix_idr_ptr(old);
}
+EXPORT_SYMBOL(idr_replace);
/**
- * idr_preload - preload for idr_alloc_range()
- * @gfp_mask: allocation mask to use for preloading
- *
- * Preload per-cpu layer buffer for idr_alloc_range(). Can only be used from
- * process context and each idr_preload() invocation should be matched with
- * idr_preload_end(). Note that preemption is disabled while preloaded.
- *
- * The first idr_alloc_range() in the preloaded section can be treated as if it
- * were invoked with @gfp_mask used for preloading. This allows using more
- * permissive allocation masks for idrs protected by spinlocks.
- *
- * For example, if idr_alloc_range() below fails, the failure can be treated as
- * if idr_alloc_range() were called with GFP_KERNEL rather than GFP_NOWAIT.
- *
- * idr_preload(GFP_KERNEL);
- * spin_lock(lock);
- *
- * id = idr_alloc_range(idr, ptr, start, end, GFP_NOWAIT);
- *
- * spin_unlock(lock);
- * idr_preload_end();
- * if (id < 0)
- * error;
+ * idr_remove - remove the given id and free its slot
+ * @idr: idr handle
+ * @id: unique key
*/
-void idr_preload(gfp_t gfp_mask)
+void idr_remove(struct idr *idr, unsigned id)
{
- /*
- * Consuming preload buffer from non-process context breaks preload
- * allocation guarantee. Disallow usage from those contexts.
- */
- WARN_ON_ONCE(in_interrupt());
- might_sleep_if(gfp_mask & __GFP_WAIT);
-
- preempt_disable();
-
- /*
- * idr_alloc_range() is likely to succeed w/o full idr_layer buffer and
- * return value from idr_alloc_range() needs to be checked for failure
- * anyway. Silently give up if allocation fails. The caller can
- * treat failures from idr_alloc_range() as if idr_alloc() were called
- * with @gfp_mask which should be enough.
- */
- while (__this_cpu_read(idr_preload_cnt) < MAX_IDR_FREE) {
- struct idr_layer *new;
-
- preempt_enable();
- new = kmem_cache_zalloc(idr_layer_cache, gfp_mask);
- preempt_disable();
- if (!new)
- break;
+ unsigned long flags;
+
+ spin_lock_irqsave(&idr->ida.lock, flags);
- /* link the new one to per-cpu preload list */
- new->ary[0] = __this_cpu_read(idr_preload_head);
- __this_cpu_write(idr_preload_head, new);
- __this_cpu_inc(idr_preload_cnt);
+ radix_tree_delete(&idr->ptrs, id);
+ __ida_remove(&idr->ida, id);
+
+ spin_unlock_irqrestore(&idr->ida.lock, flags);
+}
+EXPORT_SYMBOL(idr_remove);
+
+static int idr_insert(struct idr *idr, void *ptr, unsigned id,
+ gfp_t gfp, unsigned long *flags)
+{
+ int ret = radix_tree_preload(GFP_NOWAIT);
+ if (ret) {
+ spin_unlock_irqrestore(&idr->ida.lock, *flags);
+ ret = radix_tree_preload(gfp);
+ spin_lock_irqsave(&idr->ida.lock, *flags);
+
+ if (ret) {
+ __ida_remove(&idr->ida, id);
+ return ret;
+ }
}
+
+ ret = radix_tree_insert(&idr->ptrs, id, __idr_radix_ptr(ptr));
+ BUG_ON(ret);
+ radix_tree_preload_end();
+ return id;
}
-EXPORT_SYMBOL(idr_preload);
/**
* idr_alloc_range - allocate new idr entry
@@ -1305,44 +1059,34 @@ EXPORT_SYMBOL(idr_preload);
* @ptr: pointer to be associated with the new id
* @start: the minimum id (inclusive)
* @end: the maximum id (exclusive, <= 0 for max)
- * @gfp_mask: memory allocation flags
+ * @gfp: memory allocation flags
*
* Allocate an id in [start, end) and associate it with @ptr. If no ID is
* available in the specified range, returns -ENOSPC. On memory allocation
* failure, returns -ENOMEM.
*
- * Note that @end is treated as max when <= 0. This is to always allow
- * using @start + N as @end as long as N is inside integer range.
- *
- * The user is responsible for exclusively synchronizing all operations
- * which may modify @idr. However, read-only accesses such as idr_find()
- * or iteration can be performed under RCU read lock provided the user
- * destroys @ptr in RCU-safe way after removal from idr.
+ * Note that @end is treated as max when <= 0. This is to always allow using
+ * @start + N as @end as long as N is inside integer range.
*/
-int idr_alloc_range(struct idr *idr, void *ptr, int start,
- int end, gfp_t gfp_mask)
+int idr_alloc_range(struct idr *idr, void *ptr, unsigned start,
+ unsigned end, gfp_t gfp)
{
- int max = end > 0 ? end - 1 : INT_MAX; /* inclusive upper limit */
- struct idr_layer *pa[MAX_IDR_LEVEL + 1];
- int id;
+ int ret;
+ unsigned id;
+ unsigned long flags;
- might_sleep_if(gfp_mask & __GFP_WAIT);
+ might_sleep_if(gfp & __GFP_WAIT);
- /* sanity checks */
- if (WARN_ON_ONCE(start < 0))
- return -EINVAL;
- if (unlikely(max < start))
- return -ENOSPC;
+ spin_lock_irqsave(&idr->ida.lock, flags);
- /* allocate id */
- id = idr_get_empty_slot(idr, start, pa, gfp_mask, NULL);
- if (unlikely(id < 0))
- return id;
- if (unlikely(id > max))
- return -ENOSPC;
+ ret = __ida_alloc_range_multiple(&idr->ida, &id, 1, start,
+ end, gfp, &flags);
+ if (ret == 1)
+ ret = idr_insert(idr, ptr, id, gfp, &flags);
- idr_fill_slot(idr, ptr, id, pa);
- return id;
+ spin_unlock_irqrestore(&idr->ida.lock, flags);
+
+ return ret;
}
EXPORT_SYMBOL_GPL(idr_alloc_range);
@@ -1352,369 +1096,65 @@ EXPORT_SYMBOL_GPL(idr_alloc_range);
* @ptr: pointer to be associated with the new id
* @start: the minimum id (inclusive)
* @end: the maximum id (exclusive, <= 0 for max)
- * @gfp_mask: memory allocation flags
+ * @gfp: memory allocation flags
*
* Essentially the same as idr_alloc_range, but prefers to allocate
* progressively higher ids if it can. If the "cur" counter wraps, then it will
* start again at the "start" end of the range and allocate one that has already
* been used.
*/
-int idr_alloc_cyclic(struct idr *idr, void *ptr, int start, int end,
- gfp_t gfp_mask)
+int idr_alloc_cyclic(struct idr *idr, void *ptr, unsigned start,
+ unsigned end, gfp_t gfp)
{
- int id;
-
- id = idr_alloc_range(idr, ptr, max(start, idr->cur), end, gfp_mask);
- if (id == -ENOSPC)
- id = idr_alloc_range(idr, ptr, start, end, gfp_mask);
-
- if (likely(id >= 0))
- idr->cur = id + 1;
- return id;
-}
-EXPORT_SYMBOL(idr_alloc_cyclic);
-
-static void idr_remove_warning(int id)
-{
- WARN(1, "idr_remove called for id=%d which is not allocated.\n", id);
-}
-
-static void sub_remove(struct idr *idp, int shift, int id)
-{
- struct idr_layer *p = idp->top;
- struct idr_layer **pa[MAX_IDR_LEVEL + 1];
- struct idr_layer ***paa = &pa[0];
- struct idr_layer *to_free;
- int n;
-
- *paa = NULL;
- *++paa = &idp->top;
-
- while ((shift > 0) && p) {
- n = (id >> shift) & IDR_MASK;
- __clear_bit(n, p->bitmap);
- *++paa = &p->ary[n];
- p = p->ary[n];
- shift -= IDR_BITS;
- }
- n = id & IDR_MASK;
- if (likely(p != NULL && test_bit(n, p->bitmap))) {
- __clear_bit(n, p->bitmap);
- rcu_assign_pointer(p->ary[n], NULL);
- to_free = NULL;
- while(*paa && ! --((**paa)->count)){
- if (to_free)
- free_layer(idp, to_free);
- to_free = **paa;
- **paa-- = NULL;
- }
- if (!*paa)
- idp->layers = 0;
- if (to_free)
- free_layer(idp, to_free);
- } else
- idr_remove_warning(id);
-}
+ int ret;
+ unsigned long flags;
-/**
- * idr_remove - remove the given id and free its slot
- * @idp: idr handle
- * @id: unique key
- */
-void idr_remove(struct idr *idp, int id)
-{
- struct idr_layer *p;
- struct idr_layer *to_free;
+ might_sleep_if(gfp & __GFP_WAIT);
- if (id < 0)
- return;
+ spin_lock_irqsave(&idr->ida.lock, flags);
- sub_remove(idp, (idp->layers - 1) * IDR_BITS, id);
- if (idp->top && idp->top->count == 1 && (idp->layers > 1) &&
- idp->top->ary[0]) {
- /*
- * Single child at leftmost slot: we can shrink the tree.
- * This level is not needed anymore since when layers are
- * inserted, they are inserted at the top of the existing
- * tree.
- */
- to_free = idp->top;
- p = idp->top->ary[0];
- rcu_assign_pointer(idp->top, p);
- --idp->layers;
- to_free->count = 0;
- bitmap_clear(to_free->bitmap, 0, IDR_SIZE);
- free_layer(idp, to_free);
- }
- while (idp->id_free_cnt >= MAX_IDR_FREE) {
- p = get_from_free_list(idp);
- /*
- * Note: we don't call the rcu callback here, since the only
- * layers that fall into the freelist are those that have been
- * preallocated.
- */
- kmem_cache_free(idr_layer_cache, p);
- }
- return;
-}
-EXPORT_SYMBOL(idr_remove);
+ ret = __ida_alloc_cyclic(&idr->ida, start, end, gfp, &flags);
+ if (ret >= 0)
+ ret = idr_insert(idr, ptr, ret, gfp, &flags);
-static void __idr_remove_all(struct idr *idp)
-{
- int n, id, max;
- int bt_mask;
- struct idr_layer *p;
- struct idr_layer *pa[MAX_IDR_LEVEL + 1];
- struct idr_layer **paa = &pa[0];
-
- n = idp->layers * IDR_BITS;
- p = idp->top;
- rcu_assign_pointer(idp->top, NULL);
- max = idr_max(idp->layers);
-
- id = 0;
- while (id >= 0 && id <= max) {
- while (n > IDR_BITS && p) {
- n -= IDR_BITS;
- *paa++ = p;
- p = p->ary[(id >> n) & IDR_MASK];
- }
+ spin_unlock_irqrestore(&idr->ida.lock, flags);
- bt_mask = id;
- id += 1 << n;
- /* Get the highest bit that the above add changed from 0->1. */
- while (n < fls(id ^ bt_mask)) {
- if (p)
- free_layer(idp, p);
- n += IDR_BITS;
- p = *--paa;
- }
- }
- idp->layers = 0;
+ return ret;
}
+EXPORT_SYMBOL(idr_alloc_cyclic);
/**
- * idr_destroy - release all cached layers within an idr tree
- * @idp: idr handle
+ * idr_destroy - free all memory owned by @idr
+ * @idr: idr handle
*
- * Free all id mappings and all idp_layers. After this function, @idp is
- * completely unused and can be freed / recycled. The caller is
- * responsible for ensuring that no one else accesses @idp during or after
- * idr_destroy().
+ * After this function, @idr is completely unused and can be freed / recycled.
*
* A typical clean-up sequence for objects stored in an idr tree will use
* idr_for_each() to free all objects, if necessay, then idr_destroy() to
- * free up the id mappings and cached idr_layers.
+ * free the embedded ida and radix tree.
*/
-void idr_destroy(struct idr *idp)
+void idr_destroy(struct idr *idr)
{
- __idr_remove_all(idp);
-
- while (idp->id_free_cnt) {
- struct idr_layer *p = get_from_free_list(idp);
- kmem_cache_free(idr_layer_cache, p);
- }
-}
-EXPORT_SYMBOL(idr_destroy);
-
-void *idr_find_slowpath(struct idr *idp, int id)
-{
- int n;
- struct idr_layer *p;
-
- if (id < 0)
- return NULL;
-
- p = rcu_dereference_raw(idp->top);
- if (!p)
- return NULL;
- n = (p->layer+1) * IDR_BITS;
-
- if (id > idr_max(p->layer + 1))
- return NULL;
- BUG_ON(n == 0);
-
- while (n > 0 && p) {
- n -= IDR_BITS;
- BUG_ON(n != p->layer*IDR_BITS);
- p = rcu_dereference_raw(p->ary[(id >> n) & IDR_MASK]);
- }
- return((void *)p);
-}
-EXPORT_SYMBOL(idr_find_slowpath);
-
-/**
- * idr_for_each - iterate through all stored pointers
- * @idp: idr handle
- * @fn: function to be called for each pointer
- * @data: data passed back to callback function
- *
- * Iterate over the pointers registered with the given idr. The
- * callback function will be called for each pointer currently
- * registered, passing the id, the pointer and the data pointer passed
- * to this function. It is not safe to modify the idr tree while in
- * the callback, so functions such as idr_remove are not allowed.
- *
- * We check the return of @fn each time. If it returns anything other
- * than %0, we break out and return that value.
- *
- * The caller must serialize idr_for_each() vs idr_remove().
- */
-int idr_for_each(struct idr *idp,
- int (*fn)(int id, void *p, void *data), void *data)
-{
- int n, id, max, error = 0;
- struct idr_layer *p;
- struct idr_layer *pa[MAX_IDR_LEVEL + 1];
- struct idr_layer **paa = &pa[0];
-
- n = idp->layers * IDR_BITS;
- p = rcu_dereference_raw(idp->top);
- max = idr_max(idp->layers);
-
- id = 0;
- while (id >= 0 && id <= max) {
- while (n > 0 && p) {
- n -= IDR_BITS;
- *paa++ = p;
- p = rcu_dereference_raw(p->ary[(id >> n) & IDR_MASK]);
- }
-
- if (p) {
- error = fn(id, (void *)p, data);
- if (error)
- break;
- }
-
- id += 1 << n;
- while (n < fls(id)) {
- n += IDR_BITS;
- p = *--paa;
- }
- }
-
- return error;
-}
-EXPORT_SYMBOL(idr_for_each);
-
-/**
- * idr_find_next - lookup next object of id to given id.
- * @idp: idr handle
- * @nextidp: pointer to lookup key
- *
- * Returns pointer to registered object with id, which is next number to
- * given id. After being looked up, *@nextidp will be updated for the next
- * iteration.
- *
- * This function can be called under rcu_read_lock(), given that the leaf
- * pointers lifetimes are correctly managed.
- */
-void *idr_find_next(struct idr *idp, int *nextidp)
-{
- struct idr_layer *p, *pa[MAX_IDR_LEVEL + 1];
- struct idr_layer **paa = &pa[0];
- int id = *nextidp;
- int n, max;
-
- /* find first ent */
- p = rcu_dereference_raw(idp->top);
- if (!p)
- return NULL;
- n = (p->layer + 1) * IDR_BITS;
- max = idr_max(p->layer + 1);
-
- while (id >= 0 && id <= max) {
- while (n > 0 && p) {
- n -= IDR_BITS;
- *paa++ = p;
- p = rcu_dereference_raw(p->ary[(id >> n) & IDR_MASK]);
- }
-
- if (p) {
- *nextidp = id;
- return p;
- }
-
- /*
- * Proceed to the next layer at the current level. Unlike
- * idr_for_each(), @id isn't guaranteed to be aligned to
- * layer boundary at this point and adding 1 << n may
- * incorrectly skip IDs. Make sure we jump to the
- * beginning of the next layer using round_up().
- */
- id = round_up(id + 1, 1 << n);
- while (n < fls(id)) {
- n += IDR_BITS;
- p = *--paa;
- }
- }
- return NULL;
-}
-EXPORT_SYMBOL(idr_find_next);
-
-
-/**
- * idr_replace - replace pointer for given id
- * @idp: idr handle
- * @ptr: pointer you want associated with the id
- * @id: lookup key
- *
- * Replace the pointer registered with an id and return the old value.
- * A %-ENOENT return indicates that @id was not found.
- * A %-EINVAL return indicates that @id was not within valid constraints.
- *
- * The caller must serialize with writers.
- */
-void *idr_replace(struct idr *idp, void *ptr, int id)
-{
- int n;
- struct idr_layer *p, *old_p;
-
- if (id < 0)
- return ERR_PTR(-EINVAL);
-
- p = idp->top;
- if (!p)
- return ERR_PTR(-EINVAL);
-
- n = (p->layer+1) * IDR_BITS;
-
- if (id >= (1 << n))
- return ERR_PTR(-EINVAL);
-
- n -= IDR_BITS;
- while ((n > 0) && p) {
- p = p->ary[(id >> n) & IDR_MASK];
- n -= IDR_BITS;
- }
-
- n = id & IDR_MASK;
- if (unlikely(p == NULL || !test_bit(n, p->bitmap)))
- return ERR_PTR(-ENOENT);
-
- old_p = p->ary[n];
- rcu_assign_pointer(p->ary[n], ptr);
+ void *p;
+ unsigned id;
- return old_p;
-}
-EXPORT_SYMBOL(idr_replace);
+ idr_for_each_entry(idr, p, id)
+ idr_remove(idr, id);
-void __init idr_init_cache(void)
-{
- idr_layer_cache = kmem_cache_create("idr_layer_cache",
- sizeof(struct idr_layer), 0, SLAB_PANIC, NULL);
+ ida_destroy(&idr->ida);
}
+EXPORT_SYMBOL(idr_destroy);
/**
- * idr_init - initialize idr handle
- * @idp: idr handle
+ * idr_init - initialize sparse idr handle
+ * @idr: idr handle
*
- * This function is use to set up the handle (@idp) that you will pass
+ * This function is use to set up the handle (@idr) that you will pass
* to the rest of the functions.
*/
-void idr_init(struct idr *idp)
+void idr_init(struct idr *idr)
{
- memset(idp, 0, sizeof(struct idr));
- spin_lock_init(&idp->lock);
+ ida_init(&idr->ida);
+ INIT_RADIX_TREE(&idr->ptrs, GFP_NOWAIT);
}
EXPORT_SYMBOL(idr_init);
--
1.8.4.rc1
On Wed, 7 Aug 2013, Kent Overstreet wrote:
> +{
> + DEFINE_WAIT(wait);
> + struct percpu_ida_cpu *tags;
> + unsigned long flags;
> + unsigned this_cpu;
> + int tag;
> +
> + local_irq_save(flags);
> + this_cpu = smp_processor_id();
> + tags = per_cpu_ptr(pool->tag_cpu, this_cpu);
tags = this_cpu_ptr(pool->tag_cpu);
> + schedule();
> +
> + local_irq_save(flags);
> + this_cpu = smp_processor_id();
> + tags = per_cpu_ptr(pool->tag_cpu, this_cpu);
And the same here.
> +void percpu_ida_free(struct percpu_ida *pool, unsigned tag)
> +{
> + struct percpu_ida_cpu *tags;
> + unsigned long flags;
> + unsigned nr_free, this_cpu;
> +
> + BUG_ON(tag >= pool->nr_tags);
> +
> + local_irq_save(flags);
> + this_cpu = smp_processor_id();
> + tags = per_cpu_ptr(pool->tag_cpu, this_cpu);
And again
On Wed, Aug 07, 2013 at 05:56:34PM +0000, Christoph Lameter wrote:
> On Wed, 7 Aug 2013, Kent Overstreet wrote:
>
> > +{
> > + DEFINE_WAIT(wait);
> > + struct percpu_ida_cpu *tags;
> > + unsigned long flags;
> > + unsigned this_cpu;
> > + int tag;
> > +
> > + local_irq_save(flags);
>
> > + this_cpu = smp_processor_id();
> > + tags = per_cpu_ptr(pool->tag_cpu, this_cpu);
>
> tags = this_cpu_ptr(pool->tag_cpu);
I was breaking it apart because I was using this_cpu elsewhere too - for
the bitmap of which cpus have non empty freelists.
Or is this_cpu_ptr() doing something smarter than per_cpu_ptr(ptr,
smp_processer_id())? There's so many variants I'm not 100% sure they're
the same.
On Wed, 7 Aug 2013, Kent Overstreet wrote:
> I was breaking it apart because I was using this_cpu elsewhere too - for
> the bitmap of which cpus have non empty freelists.
this_cpu can be retrieved with smp_processor_id().
> Or is this_cpu_ptr() doing something smarter than per_cpu_ptr(ptr,
> smp_processer_id())? There's so many variants I'm not 100% sure they're
> the same.
Yes it is. It uses a sepecial register that contains the offset of this
cpus per cpu area instead of going through the table of all processor
offsets. Its less code.
On Wed, Aug 07, 2013 at 07:40:15PM +0000, Christoph Lameter wrote:
> On Wed, 7 Aug 2013, Kent Overstreet wrote:
>
> > I was breaking it apart because I was using this_cpu elsewhere too - for
> > the bitmap of which cpus have non empty freelists.
>
> this_cpu can be retrieved with smp_processor_id().
>
> > Or is this_cpu_ptr() doing something smarter than per_cpu_ptr(ptr,
> > smp_processer_id())? There's so many variants I'm not 100% sure they're
> > the same.
>
> Yes it is. It uses a sepecial register that contains the offset of this
> cpus per cpu area instead of going through the table of all processor
> offsets. Its less code.
Alright, well here's a fixup patch - untested for the moment though.
One thing that was bugging me - I was never able to figure out for sure
if smp_processor_id() returns a number in the range [0, nr_cpu_ids), at
least I couldn't find where it was documented - could you tell me if
that's true?
>From e2b8016de49c28c0ccbe7849d7254f005c7e2e77 Mon Sep 17 00:00:00 2001
From: Kent Overstreet <[email protected]>
Date: Wed, 7 Aug 2013 12:52:58 -0700
Subject: [PATCH] idr: Use this_cpu_ptr() for percpu_ida
diff --git a/lib/idr.c b/lib/idr.c
index fb374c3..320ffea 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -748,12 +748,10 @@ int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp)
DEFINE_WAIT(wait);
struct percpu_ida_cpu *tags;
unsigned long flags;
- unsigned this_cpu;
int tag;
local_irq_save(flags);
- this_cpu = smp_processor_id();
- tags = per_cpu_ptr(pool->tag_cpu, this_cpu);
+ tags = this_cpu_ptr(pool->tag_cpu);
/* Fastpath */
tag = alloc_local_tag(pool, tags);
@@ -782,7 +780,8 @@ int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp)
if (tags->nr_free) {
tag = tags->freelist[--tags->nr_free];
if (tags->nr_free)
- set_bit(this_cpu, pool->cpus_have_tags);
+ set_bit(smp_processor_id(),
+ pool->cpus_have_tags);
}
spin_unlock(&pool->ida.lock);
@@ -794,8 +793,7 @@ int percpu_ida_alloc(struct percpu_ida *pool, gfp_t gfp)
schedule();
local_irq_save(flags);
- this_cpu = smp_processor_id();
- tags = per_cpu_ptr(pool->tag_cpu, this_cpu);
+ tags = this_cpu_ptr(pool->tag_cpu);
}
finish_wait(&pool->wait, &wait);
@@ -814,13 +812,12 @@ void percpu_ida_free(struct percpu_ida *pool, unsigned tag)
{
struct percpu_ida_cpu *tags;
unsigned long flags;
- unsigned nr_free, this_cpu;
+ unsigned nr_free;
BUG_ON(tag >= pool->nr_tags);
local_irq_save(flags);
- this_cpu = smp_processor_id();
- tags = per_cpu_ptr(pool->tag_cpu, this_cpu);
+ tags = this_cpu_ptr(pool->tag_cpu);
spin_lock(&tags->lock);
tags->freelist[tags->nr_free++] = tag;
@@ -829,7 +826,8 @@ void percpu_ida_free(struct percpu_ida *pool, unsigned tag)
spin_unlock(&tags->lock);
if (nr_free == 1) {
- set_bit(this_cpu, pool->cpus_have_tags);
+ set_bit(smp_processor_id(),
+ pool->cpus_have_tags);
wake_up(&pool->wait);
}
Hello, Kent.
On Wed, Aug 07, 2013 at 10:34:58AM -0700, Kent Overstreet wrote:
> + * So for 1 mb of memory (and allocating more than that should be fine with
> + * CONFIG_COMPACTION) you get slightly under 8 million IDs.
Nothing seems to explain the section thing. This is broken up now,
right? Where's the documentation?
Thanks.
--
tejun
On Wed, Aug 07, 2013 at 04:22:01PM -0400, Tejun Heo wrote:
> Hello, Kent.
>
> On Wed, Aug 07, 2013 at 10:34:58AM -0700, Kent Overstreet wrote:
> > + * So for 1 mb of memory (and allocating more than that should be fine with
> > + * CONFIG_COMPACTION) you get slightly under 8 million IDs.
>
> Nothing seems to explain the section thing. This is broken up now,
> right? Where's the documentation?
Whoops, yes. As usual with the documentation...
Here's a fixup patch for that:
>From c24de588c5f31fa77fb8fcbf4c457b32062fee0c Mon Sep 17 00:00:00 2001
From: Kent Overstreet <[email protected]>
Date: Wed, 7 Aug 2013 13:50:42 -0700
Subject: [PATCH] idr: Document ida tree sections
diff --git a/lib/idr.c b/lib/idr.c
index 320ffea..02a221c 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -72,18 +72,37 @@ static void *kgalloc(size_t size, gfp_t gfp)
* the bit for id i is bit id % BITS_PER_LONG in ida->tree[ida->first_leaf + i /
* BITS_PER_LONG].
*
- * Note that the number of ids we can allocate is limited by the amount of
- * memory we can contiguously allocate. The amount of memory used for the bitmap
- * tree is only slightly more than a flat bitmap would use - about 1 / TREE_ARY
- * * (sizeof flat bitmap).
+ * That last line of code is a lie - logically, the data structure is one flat
+ * array - but to avoid giant contiguous allocations we use an array of arrays -
+ * ida_index_to_node() replaces the array lookup in the above example.
*
- * So for 1 mb of memory (and allocating more than that should be fine with
- * CONFIG_COMPACTION) you get slightly under 8 million IDs.
+ * So ida->tree is an array of pointers to sections, where the sections are
+ * different segments of the array the bitmap tree lives in.
+ *
+ * If there's a single section, it's only as big as we need it to be, and we
+ * grow the bitmap tree by doubling the size of the allocation.
+ *
+ * Once the tree is big enough that we start using multiple sections, the
+ * sections are always the same size - the max section size - and we grow the
+ * tree by appending new sections.
+ *
+ * The maximum size of the bitmap tree is when we've allocated all the way up to
+ * INT_MAX ids; we need (INT_MAX / 8) bytes of memory for the leaves, plus a
+ * couple percent for the parent nodes (since TREE_ARY == BITS_PER_LONG the
+ * parent nodes only add around 2%).
+ *
+ * So that's ~256 mb of memory max; we pick the max section size such that the
+ * max size of the array of pointers to sections isn't any bigger than the max
+ * section size.
+ *
+ * So if the max section size is 64k, that's ~4096 sections, with 8 byte
+ * pointers that's a little over 32k for the pointers to sections.
+ *
+ * That means max size sections are order 4 page allocations.
*/
#define IDA_TREE_ARY BITS_PER_LONG
-#define IDA_ALLOC_ORDER_MAX 4
-#define IDA_SECTION_SIZE (PAGE_SIZE << IDA_ALLOC_ORDER_MAX)
+#define IDA_SECTION_SIZE (64UL << 10)
#define IDA_NODES_PER_SECTION (IDA_SECTION_SIZE / sizeof(unsigned long))
static inline unsigned long *ida_index_to_node(struct ida *ida, unsigned node)
On Wed, 7 Aug 2013, Kent Overstreet wrote:
> One thing that was bugging me - I was never able to figure out for sure
> if smp_processor_id() returns a number in the range [0, nr_cpu_ids), at
> least I couldn't find where it was documented - could you tell me if
> that's true?
I always assumed that it was in the range 0 ... nr_cpu_ids - 1 and that is
the assumption under which the kernel code was written. Things would break
horribly if smp_process_id would return nr_cpu_ids or higher.
Hello,
On Wed, Aug 07, 2013 at 01:51:17PM -0700, Kent Overstreet wrote:
> + * So if the max section size is 64k, that's ~4096 sections, with 8 byte
> + * pointers that's a little over 32k for the pointers to sections.
> + *
> + * That means max size sections are order 4 page allocations.
Order 4 allocations for common data structure doesn't really sound
like a good idea to me. It's gonna work fine on relatively large
machines but suck on mobile / small embedded devices, many of which
are still struggling with 32bit address space and compaction may not
be enabled. It just doens't make sense to me to impose 64k
allocations from low level library function like ida.
Thanks.
--
tejun
On Fri, Aug 09, 2013 at 10:57:56AM -0400, Tejun Heo wrote:
> Hello,
>
> On Wed, Aug 07, 2013 at 01:51:17PM -0700, Kent Overstreet wrote:
> > + * So if the max section size is 64k, that's ~4096 sections, with 8 byte
> > + * pointers that's a little over 32k for the pointers to sections.
> > + *
> > + * That means max size sections are order 4 page allocations.
>
> Order 4 allocations for common data structure doesn't really sound
> like a good idea to me. It's gonna work fine on relatively large
> machines but suck on mobile / small embedded devices, many of which
> are still struggling with 32bit address space and compaction may not
> be enabled. It just doens't make sense to me to impose 64k
> allocations from low level library function like ida.
I have a hard time seeing how it could really be an issue in practice -
keep in mind, for every bit in the ida tree we're going to have some
struct allocated somewhere else that the id correspends to.
So for a 4k ida, that's... bare minimum around 128k memory that has to
be allocated somewhere else, and in a single subsystem, assuming 16 byte
structs - and typically it'll be many times that. 32k memory in the ida
-> 2 mb in the subsystem, absolute minimum.
If you're convinced this is a real issue though - how about
IDA_SECTION_SIZE conditional on CONFIG_COMPACTION, so we use order 2 or
3 allocations if CONFIG_COMPACTION=n?
Then the max size toplevel array of pointers to segments would be
bigger, but that's only an issue when we're allocating up to near
INT_MAX ids, so it's difficult to see how _that_ would be an issue on a
small/embedded system... and we could even use vmalloc for that
allocation when the size of that array is > IDA_SECTION_SIZE.
Hello,
On Tue, Aug 13, 2013 at 03:13:08PM -0700, Kent Overstreet wrote:
> If you're convinced this is a real issue though - how about
It is a real issue. Large order allocation is fine for optimization
but shouldn't be depended upon. It does fail easily without
compaction and compaction is heavy-ass operation which will blow up
any minute performance advantage you might get from avoiding proper
radix tree implementation.
> IDA_SECTION_SIZE conditional on CONFIG_COMPACTION, so we use order 2 or
> 3 allocations if CONFIG_COMPACTION=n?
>
> Then the max size toplevel array of pointers to segments would be
> bigger, but that's only an issue when we're allocating up to near
> INT_MAX ids, so it's difficult to see how _that_ would be an issue on a
> small/embedded system... and we could even use vmalloc for that
> allocation when the size of that array is > IDA_SECTION_SIZE.
What about cyclic allocations then? This is natrually a radix tree
problem. I don't know why you're resisting radix tree so much here.
Thanks.
--
tejun
On Tue, Aug 13, 2013 at 06:19:28PM -0400, Tejun Heo wrote:
> Hello,
>
> On Tue, Aug 13, 2013 at 03:13:08PM -0700, Kent Overstreet wrote:
> > If you're convinced this is a real issue though - how about
>
> It is a real issue. Large order allocation is fine for optimization
> but shouldn't be depended upon. It does fail easily without
> compaction and compaction is heavy-ass operation which will blow up
> any minute performance advantage you might get from avoiding proper
> radix tree implementation.
>
> > IDA_SECTION_SIZE conditional on CONFIG_COMPACTION, so we use order 2 or
> > 3 allocations if CONFIG_COMPACTION=n?
> >
> > Then the max size toplevel array of pointers to segments would be
> > bigger, but that's only an issue when we're allocating up to near
> > INT_MAX ids, so it's difficult to see how _that_ would be an issue on a
> > small/embedded system... and we could even use vmalloc for that
> > allocation when the size of that array is > IDA_SECTION_SIZE.
>
> What about cyclic allocations then? This is natrually a radix tree
> problem. I don't know why you're resisting radix tree so much here.
It's only naturally a radix tree problem _if_ you require sparseness.
Otherwise, radix trees require pointer chasing, which we can avoid -
which saves us both the cost of chasing pointers (which is significant)
and the overhead of storing them.
The patch handles cyclic allocation by limiting sparseness - we talked
about this and I thought you were ok with this solution, though it was
awhile ago and I could be misremembering your comments.
To recap, here's the code that implements that sparseness limiting, it's
documented in ida_alloc_cyclic()'s docs:
static int __ida_alloc_cyclic(struct ida *ida, unsigned start, unsigned end,
gfp_t gfp, unsigned long *flags)
__releases(&ida->lock)
__acquires(&ida->lock)
{
int ret;
unsigned id;
ret = __ida_alloc_range_multiple(ida, &id, 1,
max(start, ida->cur_id),
end, gfp, flags);
if (ret < 0)
ret = __ida_alloc_range_multiple(ida, &id, 1, start,
end, gfp, flags);
if (ret == 1) {
ida->cur_id = id + 1;
if ((ida->cur_id - start) / 2 > max(1024U, ida->allocated_ids))
ida->cur_id = 0;
return id;
}
return ret;
}
On Fri, Aug 09, 2013 at 10:57:56AM -0400, Tejun Heo wrote:
> Hello,
>
> On Wed, Aug 07, 2013 at 01:51:17PM -0700, Kent Overstreet wrote:
> > + * So if the max section size is 64k, that's ~4096 sections, with 8 byte
> > + * pointers that's a little over 32k for the pointers to sections.
> > + *
> > + * That means max size sections are order 4 page allocations.
>
> Order 4 allocations for common data structure doesn't really sound
> like a good idea to me. It's gonna work fine on relatively large
> machines but suck on mobile / small embedded devices, many of which
> are still struggling with 32bit address space and compaction may not
> be enabled. It just doens't make sense to me to impose 64k
> allocations from low level library function like ida.
Would this be an acceptable solution?
>From 483cfa0c809b7dc3b0abad93407468f273416578 Mon Sep 17 00:00:00 2001
From: Kent Overstreet <[email protected]>
Date: Tue, 13 Aug 2013 15:31:20 -0700
Subject: [PATCH] ida: Use order 2 allocations when COMPACTION=n
And fall back to vmalloc for the array of pointers to sections so we can
still allocate up to INT_MAX ids.
diff --git a/lib/idr.c b/lib/idr.c
index 02a221c..3bffb52 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -21,16 +21,20 @@
#include <linux/export.h>
#include <linux/idr.h>
#include <linux/kernel.h>
+#include <linux/mm.h>
#include <linux/percpu.h>
#include <linux/rcupdate.h>
#include <linux/sched.h>
#include <linux/slab.h>
#include <linux/string.h>
#include <linux/spinlock.h>
+#include <linux/vmalloc.h>
static void kgfree(void *ptr, size_t size)
{
- if (size < PAGE_SIZE)
+ if (is_vmalloc_addr(ptr))
+ vfree(ptr);
+ else if (size < PAGE_SIZE)
kfree(ptr);
else
free_pages((unsigned long) ptr, get_order(size));
@@ -102,7 +106,14 @@ static void *kgalloc(size_t size, gfp_t gfp)
*/
#define IDA_TREE_ARY BITS_PER_LONG
-#define IDA_SECTION_SIZE (64UL << 10)
+
+/* Max section size, in bytes */
+#ifdef CONFIG_COMPACTION
+#define IDA_SECTION_SIZE (64UL << 10) /* order 4 page allocation */
+#else
+#define IDA_SECTION_SIZE (16UL << 10) /* order 2 */
+#endif
+
#define IDA_NODES_PER_SECTION (IDA_SECTION_SIZE / sizeof(unsigned long))
static inline unsigned long *ida_index_to_node(struct ida *ida, unsigned node)
@@ -251,8 +262,15 @@ again:
if (ida->nodes >= IDA_NODES_PER_SECTION &&
is_power_of_2(cur_sections)) {
- sections = kgalloc(cur_sections * 2 * sizeof(unsigned long *),
- __GFP_ZERO|gfp);
+ size_t bytes = cur_sections * 2 * sizeof(unsigned long *);
+
+ if (bytes <= IDA_SECTION_SIZE)
+ sections = kgalloc(bytes, __GFP_ZERO|gfp);
+ else if (gfp & GFP_KERNEL)
+ sections = vzalloc(bytes);
+ else
+ sections = NULL;
+
if (!sections)
goto err;
}
Hello, Kent.
On Tue, Aug 13, 2013 at 03:27:59PM -0700, Kent Overstreet wrote:
> It's only naturally a radix tree problem _if_ you require sparseness.
Well, it's not necessarily about requiring it but more about surviving
it with some grace when things don't go as expected, which is an
important characteristic for common library stuff.
> Otherwise, radix trees require pointer chasing, which we can avoid -
> which saves us both the cost of chasing pointers (which is significant)
> and the overhead of storing them.
Vast majority of which can be avoided with simple caching, right?
Thanks.
--
tejun
On Tue, Aug 13, 2013 at 06:44:28PM -0400, Tejun Heo wrote:
> Hello, Kent.
>
> On Tue, Aug 13, 2013 at 03:27:59PM -0700, Kent Overstreet wrote:
> > It's only naturally a radix tree problem _if_ you require sparseness.
>
> Well, it's not necessarily about requiring it but more about surviving
> it with some grace when things don't go as expected, which is an
> important characteristic for common library stuff.
The patch I posted should solve the high order allocations stuff, and
sparseness from cyclic allocations was already solved.
> > Otherwise, radix trees require pointer chasing, which we can avoid -
> > which saves us both the cost of chasing pointers (which is significant)
> > and the overhead of storing them.
>
> Vast majority of which can be avoided with simple caching, right?
Whatever caching optimizations you do with a radix tree version I could
apply to this bitmap tree version, and my bitmap tree code is simpler
and _considerably_ faster than the existing code.
Hello,
On Tue, Aug 13, 2013 at 03:59:27PM -0700, Kent Overstreet wrote:
> > Well, it's not necessarily about requiring it but more about surviving
> > it with some grace when things don't go as expected, which is an
> > important characteristic for common library stuff.
>
> The patch I posted should solve the high order allocations stuff, and
> sparseness from cyclic allocations was already solved.
I don't know. Yeah, using vmalloc would be able to work around the
issue for most cases, I suppose. It's iffy to consume vmalloc space
from ida, which functionally is such a basic algorithmic construct.
It probably won't worsen things noticeably but vmalloc area can be a
very precious resource on 32bit configs.
> Whatever caching optimizations you do with a radix tree version I could
> apply to this bitmap tree version, and my bitmap tree code is simpler
> and _considerably_ faster than the existing code.
But the difference won't really matter. Cached performance would be
the same and that's likely to cover most cases, right? It's not like
radix tree is orders of magnitude slower.
Thanks.
--
tejun
On Tue, Aug 13, 2013 at 07:22:11PM -0400, Tejun Heo wrote:
> Hello,
>
> On Tue, Aug 13, 2013 at 03:59:27PM -0700, Kent Overstreet wrote:
> > > Well, it's not necessarily about requiring it but more about surviving
> > > it with some grace when things don't go as expected, which is an
> > > important characteristic for common library stuff.
> >
> > The patch I posted should solve the high order allocations stuff, and
> > sparseness from cyclic allocations was already solved.
>
> I don't know. Yeah, using vmalloc would be able to work around the
> issue for most cases, I suppose. It's iffy to consume vmalloc space
> from ida, which functionally is such a basic algorithmic construct.
> It probably won't worsen things noticeably but vmalloc area can be a
> very precious resource on 32bit configs.
This is only using it for the array of pointers to sections though, not
the bitmap itself - and only when that allocations is > 16k. For INT_MAX
allocated ids (absolute worst case) we'd be using 256k of vmalloc memory
on 64 bit, half that on 32 bit.
>
> > Whatever caching optimizations you do with a radix tree version I could
> > apply to this bitmap tree version, and my bitmap tree code is simpler
> > and _considerably_ faster than the existing code.
>
> But the difference won't really matter. Cached performance would be
> the same and that's likely to cover most cases, right? It's not like
> radix tree is orders of magnitude slower.
Should probably be almost as good, yeah... in theory, but the space
efficiency still isn't going to be as good, and it'll probably be more
code... and at this point I really just don't want to futz with it more.
At this point unless there's something really wrong with this code I
just want to move onto something else :P
Hey, Kent.
On Tue, Aug 13, 2013 at 04:51:33PM -0700, Kent Overstreet wrote:
> Should probably be almost as good, yeah... in theory, but the space
> efficiency still isn't going to be as good, and it'll probably be more
> code... and at this point I really just don't want to futz with it more.
> At this point unless there's something really wrong with this code I
> just want to move onto something else :P
I think it probably would be okay in most cases but don't feel
confident about acking as it's making trade-offs which are unnecessary
and unusual. So, ummm, I really don't know. Maybe it's better enough
than what we have now but at the same time if you want to reimplement
the whole thing you should be persistent / reliable enough to see it
through this time around too, right?
Thanks.
--
tejun
On Tue, Aug 13, 2013 at 07:59:47PM -0400, Tejun Heo wrote:
> Hey, Kent.
>
> On Tue, Aug 13, 2013 at 04:51:33PM -0700, Kent Overstreet wrote:
> > Should probably be almost as good, yeah... in theory, but the space
> > efficiency still isn't going to be as good, and it'll probably be more
> > code... and at this point I really just don't want to futz with it more.
> > At this point unless there's something really wrong with this code I
> > just want to move onto something else :P
>
> I think it probably would be okay in most cases but don't feel
> confident about acking as it's making trade-offs which are unnecessary
> and unusual. So, ummm, I really don't know. Maybe it's better enough
> than what we have now but at the same time if you want to reimplement
> the whole thing you should be persistent / reliable enough to see it
> through this time around too, right?
I was just telling you how I felt :) Regardless of that, IMO what I've
got now is superior to any radix tree based approach for what ida/idr
are supposed to do. I could of course be wrong, but I'm not convinced...
Re: caching the last allocation with a radix tree based implementation.
I thought about that more last night, I don't think that would be viable
for using ida underneath the percpu ida allocator.
Reason being percpu ida has to heavily optimize for the case where
almost all of the id space is allocated, and after awhile the id space
is going to be fragmented - so caching the last allocated id is going to
be useless.
This is also why I implemented the ganged allocation bit, to amortize
the bitmap tree traversal. So we'd lose out on that going back to a
radix tree, or have to reimplement it (and it'll be slower due to
pointer chasing).
Which is all not the end of the world, but it means that if we punt on
the ida/idr rewrites for now or change our minds about them - we _do_
have quite a bit of stuff waiting on the percpu ida allocator, so for
that to go in separate I'll have to change it back to using a stack of
integers for the global freelist - which does use significantly more
memory.
Hello, Kent.
On Wed, Aug 14, 2013 at 05:04:27PM -0700, Kent Overstreet wrote:
> I was just telling you how I felt :) Regardless of that, IMO what I've
> got now is superior to any radix tree based approach for what ida/idr
> are supposed to do. I could of course be wrong, but I'm not convinced...
It's just very difficult to tell either way. You say it's better but
the benefit to weirdity ratio doesn't seem too apparent. The only
thing the proposed solution saves is a few pointer dereferences in
extreme corner cases at the cost of making low level library using
high order allocation or vmalloc allocation.
Weirdity aside, the unsualness even makes evaluating the overhead
muddier. e.g. vmalloc space is expensive not only in terms of address
space real estate but also in terms of runtime performance because
each vmalloc page is additional TLB pressure in most configurations
where the kernel linear address space is mapped with gigantic
mappings. The net effect could be small and won't easily show up in
microbenchmarks as they usually don't tend to push TLB pressure but
then again the performance benefit of the proposed implementation is
likely to extremely minor too.
For a code piece to be unusual, it should have its accompanying clear
benefits, which doesn't seem to be the case here. It's different and
maybe better in some extreme benchmarks specifically designed for it
but that seems to be about it.
> Re: caching the last allocation with a radix tree based implementation.
> I thought about that more last night, I don't think that would be viable
> for using ida underneath the percpu ida allocator.
>
> Reason being percpu ida has to heavily optimize for the case where
> almost all of the id space is allocated, and after awhile the id space
> is going to be fragmented - so caching the last allocated id is going to
> be useless.
A 4k page has 32k bits. It can serve up quite a few IDs even with
internal indexing. Most cases will be fine with single page and
single layer would cover most of what's left. How is that gonna be
very different from the proposed implementation. If you worry about
huge ID space being distributed by a lot of CPUs, you can use per-cpu
hints, which will be faster than the proposed solution anyway.
> This is also why I implemented the ganged allocation bit, to amortize
> the bitmap tree traversal. So we'd lose out on that going back to a
> radix tree, or have to reimplement it (and it'll be slower due to
> pointer chasing).
>
> Which is all not the end of the world, but it means that if we punt on
> the ida/idr rewrites for now or change our minds about them - we _do_
> have quite a bit of stuff waiting on the percpu ida allocator, so for
> that to go in separate I'll have to change it back to using a stack of
> integers for the global freelist - which does use significantly more
> memory.
Yes, I'd like to see better, percpu-aware ida too and there are things
which can benefit from it, but we still need to get ida right and I
don't think it's a very difficult thing to do at this point.
Thanks.
--
tejun
On Thu, 2013-08-08 at 14:32 +0000, Christoph Lameter wrote:
> On Wed, 7 Aug 2013, Kent Overstreet wrote:
>
> > One thing that was bugging me - I was never able to figure out for sure
> > if smp_processor_id() returns a number in the range [0, nr_cpu_ids), at
> > least I couldn't find where it was documented - could you tell me if
> > that's true?
>
> I always assumed that it was in the range 0 ... nr_cpu_ids - 1 and that is
> the assumption under which the kernel code was written. Things would break
> horribly if smp_process_id would return nr_cpu_ids or higher.
>
Hi guys,
Just a heads up that I've put Kent's standalone percpu-ida patch (with
Christoph's recommend changes) into target-pending/for-next here:
https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/commit/?h=for-next&id=47bd524a5b3eb6429b058b8b562b45329ab2c9e7
I've got a number of target patches that depend on this code for v3.12,
and a delay on this particular piece would be painful to endure..
Sooo, please yell loudly if there is an objection to percpu-ida merge as
a completely standalone item, that does not effect any existing ida
code.
--nab
On Tue, 20 Aug 2013 14:19:06 -0700 "Nicholas A. Bellinger" <[email protected]> wrote:
> On Thu, 2013-08-08 at 14:32 +0000, Christoph Lameter wrote:
> > On Wed, 7 Aug 2013, Kent Overstreet wrote:
> >
> > > One thing that was bugging me - I was never able to figure out for sure
> > > if smp_processor_id() returns a number in the range [0, nr_cpu_ids), at
> > > least I couldn't find where it was documented - could you tell me if
> > > that's true?
> >
> > I always assumed that it was in the range 0 ... nr_cpu_ids - 1 and that is
> > the assumption under which the kernel code was written. Things would break
> > horribly if smp_process_id would return nr_cpu_ids or higher.
> >
>
> Hi guys,
>
> Just a heads up that I've put Kent's standalone percpu-ida patch (with
> Christoph's recommend changes) into target-pending/for-next here:
>
> https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/commit/?h=for-next&id=47bd524a5b3eb6429b058b8b562b45329ab2c9e7
>
> I've got a number of target patches that depend on this code for v3.12,
> and a delay on this particular piece would be painful to endure..
>
> Sooo, please yell loudly if there is an objection to percpu-ida merge as
> a completely standalone item, that does not effect any existing ida
> code.
Was hoping that Tejun had time. I'll take a look...
On Tue, Aug 20, 2013 at 02:29:56PM -0700, Andrew Morton wrote:
> On Tue, 20 Aug 2013 14:19:06 -0700 "Nicholas A. Bellinger" <[email protected]> wrote:
>
> > On Thu, 2013-08-08 at 14:32 +0000, Christoph Lameter wrote:
> > > On Wed, 7 Aug 2013, Kent Overstreet wrote:
> > >
> > > > One thing that was bugging me - I was never able to figure out for sure
> > > > if smp_processor_id() returns a number in the range [0, nr_cpu_ids), at
> > > > least I couldn't find where it was documented - could you tell me if
> > > > that's true?
> > >
> > > I always assumed that it was in the range 0 ... nr_cpu_ids - 1 and that is
> > > the assumption under which the kernel code was written. Things would break
> > > horribly if smp_process_id would return nr_cpu_ids or higher.
> > >
> >
> > Hi guys,
> >
> > Just a heads up that I've put Kent's standalone percpu-ida patch (with
> > Christoph's recommend changes) into target-pending/for-next here:
> >
> > https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/commit/?h=for-next&id=47bd524a5b3eb6429b058b8b562b45329ab2c9e7
> >
> > I've got a number of target patches that depend on this code for v3.12,
> > and a delay on this particular piece would be painful to endure..
> >
> > Sooo, please yell loudly if there is an objection to percpu-ida merge as
> > a completely standalone item, that does not effect any existing ida
> > code.
>
> Was hoping that Tejun had time. I'll take a look...
I think Tejun and I might be at a bit of an impasse with the ida rewrite
itself, but I don't think there were any outstanding objections to the
percpu ida code itself - and this is a standalone version.
I was meaning to ask you Andrew, if you could take a look at the ida
discussion and lend your opinion - I don't think there's any _specific_
technical objections left to my ida code, and it's now on a more
philisophical "complexity vs. ..." level.
Hello, Kent.
On Tue, Aug 20, 2013 at 07:01:32PM -0700, Kent Overstreet wrote:
> I think Tejun and I might be at a bit of an impasse with the ida rewrite
> itself, but I don't think there were any outstanding objections to the
> percpu ida code itself - and this is a standalone version.
The percpu ida code can be applied separately from the ida rewrite?
> I was meaning to ask you Andrew, if you could take a look at the ida
> discussion and lend your opinion - I don't think there's any _specific_
> technical objections left to my ida code, and it's now on a more
> philisophical "complexity vs. ..." level.
Hmmm... the objection was pretty specific - don't depend on high-order
or vmalloc allocations when it can be easily avoided by using proper
radix tree.
Thanks.
--
tejun
On Tue, Aug 20, 2013 at 10:07:42PM -0400, Tejun Heo wrote:
> Hello, Kent.
>
> On Tue, Aug 20, 2013 at 07:01:32PM -0700, Kent Overstreet wrote:
> > I think Tejun and I might be at a bit of an impasse with the ida rewrite
> > itself, but I don't think there were any outstanding objections to the
> > percpu ida code itself - and this is a standalone version.
>
> The percpu ida code can be applied separately from the ida rewrite?
Yes - at the cost of using significantly more memory for the global
freelist
> > I was meaning to ask you Andrew, if you could take a look at the ida
> > discussion and lend your opinion - I don't think there's any _specific_
> > technical objections left to my ida code, and it's now on a more
> > philisophical "complexity vs. ..." level.
>
> Hmmm... the objection was pretty specific - don't depend on high-order
> or vmalloc allocations when it can be easily avoided by using proper
> radix tree.
We only do vmalloc allocations for CONFIG_COMPACTION=n, and then only
when we need to allocate more than almost 1 _billion_ ids from a single
ida (twice than on 32 bit, so never because that gets us just about to
INT_MAX) - and then it's only 32k of vmalloc memory, for the entire ida.
This is with max allocation order of 4 for COMPACTION=y, 2 for
COMPACTION=n.
All this for a performance improvement of 10x to 50x (or more), for the
ida sizes I measured.
So I could see your point if we were allocating gobs of vmalloc memory,
or high order allocations big enough to realistically be problematic (I
honestly don't think these will be) - but to me, this seems like a
pretty reasonable tradeoff for those performance gains.
(And the performance gains do substantially come from using more
contiguous memory and treating the whole data structure as an array, and
doing less pointer chasing/looping)
Hello, Kent.
On Tue, Aug 20, 2013 at 07:31:51PM -0700, Kent Overstreet wrote:
> All this for a performance improvement of 10x to 50x (or more), for the
> ida sizes I measured.
That's misleading, isn't it? We should see large performance
improvements even without the large pages. What matters more is the
leaf node performance for vast majority of cases and an extra radix
tree layer on top would cover most of whatever is left. Whether to
use high order pages or not only affects the extreme niche use cases
and I don't think going for high order pages to micro optimize those
extreme use cases is the right trade off.
> So I could see your point if we were allocating gobs of vmalloc memory,
> or high order allocations big enough to realistically be problematic (I
> honestly don't think these will be) - but to me, this seems like a
> pretty reasonable tradeoff for those performance gains.
The trade off is made up as the bulk of the performance benefit can be
gained without resorting to high order allocations.
> (And the performance gains do substantially come from using more
> contiguous memory and treating the whole data structure as an array, and
> doing less pointer chasing/looping)
I really have hard time buying that. Let's say you go with single
page leaf node and an extra single page layer on top. How many IDs
are we talking about? For the cases which are most performance
sensitive, this doesn't even matter a bit as percpu caching layer
would be on top anyway. I really don't think the micro optimization
is called for at the cost of high order allocations from low level
tool library.
Thanks.
--
tejun
On Tue, 20 Aug 2013, Nicholas A. Bellinger wrote:
> On Thu, 2013-08-08 at 14:32 +0000, Christoph Lameter wrote:
> > On Wed, 7 Aug 2013, Kent Overstreet wrote:
> >
> > > One thing that was bugging me - I was never able to figure out for sure
> > > if smp_processor_id() returns a number in the range [0, nr_cpu_ids), at
> > > least I couldn't find where it was documented - could you tell me if
> > > that's true?
> >
> > I always assumed that it was in the range 0 ... nr_cpu_ids - 1 and that is
> > the assumption under which the kernel code was written. Things would break
> > horribly if smp_process_id would return nr_cpu_ids or higher.
> >
>
> Hi guys,
>
> Just a heads up that I've put Kent's standalone percpu-ida patch (with
> Christoph's recommend changes) into target-pending/for-next here:
>
> https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/commit/?h=for-next&id=47bd524a5b3eb6429b058b8b562b45329ab2c9e7
>
> I've got a number of target patches that depend on this code for v3.12,
> and a delay on this particular piece would be painful to endure..
>
> Sooo, please yell loudly if there is an objection to percpu-ida merge as
> a completely standalone item, that does not effect any existing ida
> code.
Well the performance is still going to be limited due to the spinlock in
the percpu handling. You do not need the spinlock. Once preempt is off you
should have exclusive access to the per cpu data. This is already
exploited by idr_layer_alloc before the patch. Doing so is going to
reduce the code size of the patch significantly.
Please post the patch inline so that its easy to comment on it.
On Wed, 2013-08-21 at 14:32 +0000, Christoph Lameter wrote:
> On Tue, 20 Aug 2013, Nicholas A. Bellinger wrote:
>
> > On Thu, 2013-08-08 at 14:32 +0000, Christoph Lameter wrote:
> > > On Wed, 7 Aug 2013, Kent Overstreet wrote:
> > >
> > > > One thing that was bugging me - I was never able to figure out for sure
> > > > if smp_processor_id() returns a number in the range [0, nr_cpu_ids), at
> > > > least I couldn't find where it was documented - could you tell me if
> > > > that's true?
> > >
> > > I always assumed that it was in the range 0 ... nr_cpu_ids - 1 and that is
> > > the assumption under which the kernel code was written. Things would break
> > > horribly if smp_process_id would return nr_cpu_ids or higher.
> > >
> >
> > Hi guys,
> >
> > Just a heads up that I've put Kent's standalone percpu-ida patch (with
> > Christoph's recommend changes) into target-pending/for-next here:
> >
> > https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/commit/?h=for-next&id=47bd524a5b3eb6429b058b8b562b45329ab2c9e7
> >
> > I've got a number of target patches that depend on this code for v3.12,
> > and a delay on this particular piece would be painful to endure..
> >
> > Sooo, please yell loudly if there is an objection to percpu-ida merge as
> > a completely standalone item, that does not effect any existing ida
> > code.
>
> Well the performance is still going to be limited due to the spinlock in
> the percpu handling. You do not need the spinlock. Once preempt is off you
> should have exclusive access to the per cpu data. This is already
> exploited by idr_layer_alloc before the patch. Doing so is going to
> reduce the code size of the patch significantly.
>
> Please post the patch inline so that its easy to comment on it.
>
Hi Christoph,
The latest version from Kent was posted last week here:
http://marc.info/?l=linux-kernel&m=137669878117020&w=2
--nab
On Wed, 21 Aug 2013 14:32:55 +0000 Christoph Lameter <[email protected]> wrote:
> On Tue, 20 Aug 2013, Nicholas A. Bellinger wrote:
>
> > On Thu, 2013-08-08 at 14:32 +0000, Christoph Lameter wrote:
> > > On Wed, 7 Aug 2013, Kent Overstreet wrote:
> > >
> > > > One thing that was bugging me - I was never able to figure out for sure
> > > > if smp_processor_id() returns a number in the range [0, nr_cpu_ids), at
> > > > least I couldn't find where it was documented - could you tell me if
> > > > that's true?
> > >
> > > I always assumed that it was in the range 0 ... nr_cpu_ids - 1 and that is
> > > the assumption under which the kernel code was written. Things would break
> > > horribly if smp_process_id would return nr_cpu_ids or higher.
> > >
> >
> > Hi guys,
> >
> > Just a heads up that I've put Kent's standalone percpu-ida patch (with
> > Christoph's recommend changes) into target-pending/for-next here:
> >
> > https://git.kernel.org/cgit/linux/kernel/git/nab/target-pending.git/commit/?h=for-next&id=47bd524a5b3eb6429b058b8b562b45329ab2c9e7
> >
> > I've got a number of target patches that depend on this code for v3.12,
> > and a delay on this particular piece would be painful to endure..
> >
> > Sooo, please yell loudly if there is an objection to percpu-ida merge as
> > a completely standalone item, that does not effect any existing ida
> > code.
>
> Well the performance is still going to be limited due to the spinlock in
> the percpu handling. You do not need the spinlock. Once preempt is off you
> should have exclusive access to the per cpu data.
The lock is needed so that one cpu can steal tags from another cpu's cache.
See (the needlessly inlined!) steal_tags().
On Wed, Aug 21, 2013 at 07:59:41AM -0400, Tejun Heo wrote:
> Hello, Kent.
>
> On Tue, Aug 20, 2013 at 07:31:51PM -0700, Kent Overstreet wrote:
> > All this for a performance improvement of 10x to 50x (or more), for the
> > ida sizes I measured.
>
> That's misleading, isn't it?
It's comparing it to the existing version that actually exists, instead
of comparing it to a hypothetical approach that doesn't exist yet. I
don't see how that's misleading.
> We should see large performance
> improvements even without the large pages. What matters more is the
> leaf node performance for vast majority of cases and an extra radix
> tree layer on top would cover most of whatever is left. Whether to
> use high order pages or not only affects the extreme niche use cases
> and I don't think going for high order pages to micro optimize those
> extreme use cases is the right trade off.
>
> > So I could see your point if we were allocating gobs of vmalloc memory,
> > or high order allocations big enough to realistically be problematic (I
> > honestly don't think these will be) - but to me, this seems like a
> > pretty reasonable tradeoff for those performance gains.
>
> The trade off is made up as the bulk of the performance benefit can be
> gained without resorting to high order allocations.
I'm more and more skeptical that that's true, and it's a given that it
wouldn't be as fast.
> > (And the performance gains do substantially come from using more
> > contiguous memory and treating the whole data structure as an array, and
> > doing less pointer chasing/looping)
>
> I really have hard time buying that. Let's say you go with single
> page leaf node and an extra single page layer on top. How many IDs
> are we talking about? For the cases which are most performance
> sensitive, this doesn't even matter a bit as percpu caching layer
> would be on top anyway. I really don't think the micro optimization
> is called for at the cost of high order allocations from low level
> tool library.
These "micro optimizations" mean either less pointer chasing or less
branching in the _common_ case; you'd trade common case performance for
avoiding ever doing higher order allocations (and 2 with COMPACTION=n
and 4 with COMPACTION=y is not particularly high order!).
I don't buy that that's a good tradeoff. If you're convinced radix trees
are the way to go and it can be done without much performance cost, why
not code it up and show us?
Hello, Kent.
On Wed, Aug 21, 2013 at 02:09:01PM -0700, Kent Overstreet wrote:
> These "micro optimizations" mean either less pointer chasing or less
> branching in the _common_ case; you'd trade common case performance for
> avoiding ever doing higher order allocations (and 2 with COMPACTION=n
> and 4 with COMPACTION=y is not particularly high order!).
Order 4 allocation probably isn't as bad as before but it still is a
lot nastier than single page allocations. You say doing it the other
way would harm the common case performance but didn't answer my
question about the number of IDs being served per page. How many can
be served from a single page? And how many from two layer single page
configuration? How are you defining the "common" case?
> I don't buy that that's a good tradeoff. If you're convinced radix trees
> are the way to go and it can be done without much performance cost, why
> not code it up and show us?
Well, I'm not the one trying to rewrite ida, so the onus to justify
the proposed code is primarily on you. Another thing is that the
proposed code is *not* using the existing radix tree and instead
implementing its own simplified radix tree, which *can* be fine but
the bar to clear is fairly high. You have to be able to show
*clearly* that using the existing radix tree is not an option. Until
now, the only thing that I gathered is the simplified thing is gonna
be faster in some extreme cases while having clear disadvantage in
terms of memory allocation. Not very convincing.
Thanks.
--
tejun
On Wed, Aug 21, 2013 at 05:16:50PM -0400, Tejun Heo wrote:
> Hello, Kent.
>
> On Wed, Aug 21, 2013 at 02:09:01PM -0700, Kent Overstreet wrote:
> > These "micro optimizations" mean either less pointer chasing or less
> > branching in the _common_ case; you'd trade common case performance for
> > avoiding ever doing higher order allocations (and 2 with COMPACTION=n
> > and 4 with COMPACTION=y is not particularly high order!).
>
> Order 4 allocation probably isn't as bad as before but it still is a
> lot nastier than single page allocations. You say doing it the other
> way would harm the common case performance but didn't answer my
> question about the number of IDs being served per page. How many can
> be served from a single page? And how many from two layer single page
> configuration? How are you defining the "common" case?
With single page allocations:
1 << 15 bits per page
1 << 9 pointers per page
So two layers of pointers does get us to 1 << 33 bits, which is what we
need.
But now, since we need two layers of pointers instead of one, we need
either another pointer deref for a node lookup - _always_, even when
we've got 8 bytes of bits - or we need to branch on the depth of the
tree, which is something we don't have now.
This is extra overhead _no matter the size of the ida_, over my current
approach.
I'm assuming the common case is < one page of bits, based on the usage
I've seen throughout the kernel that's probably way conservative.
In that case, your approach is going to be slower than mine, and there's
no difference in the size of the allocations.
> > I don't buy that that's a good tradeoff. If you're convinced radix trees
> > are the way to go and it can be done without much performance cost, why
> > not code it up and show us?
>
> Well, I'm not the one trying to rewrite ida, so the onus to justify
> the proposed code is primarily on you. Another thing is that the
> proposed code is *not* using the existing radix tree and instead
> implementing its own simplified radix tree, which *can* be fine but
> the bar to clear is fairly high. You have to be able to show
> *clearly* that using the existing radix tree is not an option. Until
> now, the only thing that I gathered is the simplified thing is gonna
> be faster in some extreme cases while having clear disadvantage in
> terms of memory allocation. Not very convincing.
I've already shown massive performance gains over the existing radix
tree approach, you're the one claiming a different approach would be
better.
Hello, Kent.
On Wed, Aug 21, 2013 at 02:24:42PM -0700, Kent Overstreet wrote:
> With single page allocations:
>
> 1 << 15 bits per page
>
> 1 << 9 pointers per page
>
> So two layers of pointers does get us to 1 << 33 bits, which is what we
> need.
And single layer - 1 << 15 - would cover most of the use cases, right?
With 1 << (9 + 15) probably covering everyone else but the cyclic ones
doing the full circle.
> But now, since we need two layers of pointers instead of one, we need
> either another pointer deref for a node lookup - _always_, even when
> we've got 8 bytes of bits - or we need to branch on the depth of the
> tree, which is something we don't have now.
A likely() branch which is almost always hit is *extremely* cheap.
> This is extra overhead _no matter the size of the ida_, over my current
> approach.
> I'm assuming the common case is < one page of bits, based on the usage
> I've seen throughout the kernel that's probably way conservative.
>
> In that case, your approach is going to be slower than mine, and there's
> no difference in the size of the allocations.
By single likely() branch. I'm not even sure that'd be measureable in
most cases. I'd take that over custom radix tree implementation which
needs high order allocations.
> I've already shown massive performance gains over the existing radix
> tree approach, you're the one claiming a different approach would be
> better.
So? What difference does that make? You should be able to justify
your custom thing. If you do something unusual, of course someone is
gonna ask you to justify it and justifying that is *your*
responsibility.
Thanks.
--
tejun
On Wed, 21 Aug 2013, Andrew Morton wrote:
> The lock is needed so that one cpu can steal tags from another cpu's cache.
> See (the needlessly inlined!) steal_tags().
Stealing tags could also be done via IPIs or some other things.
On 08/22/2013 10:44 AM, Christoph Lameter wrote:
> On Wed, 21 Aug 2013, Andrew Morton wrote:
>
>> The lock is needed so that one cpu can steal tags from another cpu's cache.
>> See (the needlessly inlined!) steal_tags().
>
> Stealing tags could also be done via IPIs or some other things.
That is actually the approach I took for blk-mq tagging. But that isn't
free either. I think it pretty much boils down to how much stealing you
expect. If the tag space is sufficiently large, you would not expect a
lot of stealing. And then it doesn't really matter if you use a lock or
an IPI, since it's the rare case - the lock cacheline isn't going to be
shared across processors anyway. The ticket locks aren't exactly free on
their own, however, so that cost is still paid. But I'd be surprised if
it was much of an issue.
If the tag space isn't large enough (common case), then yes, it matters
a lot what kind of mechanism is used. It's on my TODO to compare the two
under realistic scenarios and see how they fare.
--
Jens Axboe