2013-06-12 04:03:33

by Kent Overstreet

[permalink] [raw]
Subject: [PATCH] Percpu tag allocator

Allocates integers out of a predefined range - for use by e.g. a driver
to allocate tags for communicating with the device.

v2: Tejun pointed out that the original strategy completely breaks if
nr_cpus is large enough. I think (hope) I realized that at one point,
but completely forgot in the months since I originally implemented it.

Came up with a new strategy where we keep track of the number of cpus
that have tags on their percpu freelists, then steal from a different
random cpu if too many cpus have tags on their freelists.

Note that the synchronization is _definitely_ tricky - we're using
xchg()/cmpxchg() on the percpu lists, to synchronize between
steal_tags().

The alternative would've been adding a spinlock to protect the percpu
freelists, but that would've required some different tricky code to
avoid deadlock because of the lock ordering.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: "Nicholas A. Bellinger" <[email protected]>
---
include/linux/percpu-tags.h | 63 +++++++++
lib/Makefile | 2 +-
lib/percpu-tags.c | 313 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 377 insertions(+), 1 deletion(-)
create mode 100644 include/linux/percpu-tags.h
create mode 100644 lib/percpu-tags.c

diff --git a/include/linux/percpu-tags.h b/include/linux/percpu-tags.h
new file mode 100644
index 0000000..c86b7af
--- /dev/null
+++ b/include/linux/percpu-tags.h
@@ -0,0 +1,63 @@
+/*
+ * Copyright 2012 Google Inc. All Rights Reserved.
+ * Author: [email protected] (Kent Overstreet)
+ *
+ * Per cpu tag allocator. Allocates small integers - up to nr_tags passed to
+ * percpu_tag_pool_init() - for use with say driver tag structures for talking
+ * to a device.
+ *
+ * It works by caching tags on percpu freelists, and then tags are
+ * allocated/freed from the global freelist in batches.
+ *
+ * Note that it will in general be impossible to allocate all nr_tags tags,
+ * since some tags will be stranded on other cpu's freelists: but we guarantee
+ * that nr_tags / 2 can always be allocated.
+ *
+ * This is done by keeping track of which cpus have tags on their percpu
+ * freelists in a bitmap, and then on allocation failure if too many cpus have
+ * tags on their freelists - i.e. if cpus_have_tags * TAG_CPU_SIZE (64) >
+ * nr_tags / 2 - then we steal one remote cpu's freelist (effectively picked at
+ * random).
+ *
+ * This means that if a workload spans a huge number of cpus - in relation to
+ * the number of tags that can be allocated - performance will suffer somewhat;
+ * but as long as the workload is bounded to a reasonable number of cpus the
+ * percpu-ness of the allocator will not be affected.
+ */
+
+#ifndef _LINUX_TAGS_H
+#define _LINUX_TAGS_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+
+struct percpu_tag_cpu_freelist;
+
+struct percpu_tag_pool {
+ unsigned nr_tags;
+
+ struct percpu_tag_cpu_freelist *tag_cpu;
+ unsigned long *cpus_have_tags;
+
+ struct {
+ spinlock_t lock;
+ unsigned cpu_last_stolen;
+ /* Global freelist */
+ unsigned nr_free;
+ unsigned *freelist;
+ wait_queue_head_t wait;
+ } ____cacheline_aligned_in_smp;
+};
+
+unsigned percpu_tag_alloc(struct percpu_tag_pool *pool, gfp_t gfp);
+void percpu_tag_free(struct percpu_tag_pool *pool, unsigned tag);
+
+void percpu_tag_pool_free(struct percpu_tag_pool *pool);
+int percpu_tag_pool_init(struct percpu_tag_pool *pool, unsigned long nr_tags);
+
+enum {
+ TAG_FAIL = -1U,
+ TAG_MAX = TAG_FAIL - 1,
+};
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 386db4b..e29a354 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -13,7 +13,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
is_single_threaded.o plist.o decompress.o kobject_uevent.o \
- earlycpio.o percpu-refcount.o
+ earlycpio.o percpu-refcount.o percpu-tags.o

obj-$(CONFIG_ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS) += usercopy.o
lib-$(CONFIG_MMU) += ioremap.o
diff --git a/lib/percpu-tags.c b/lib/percpu-tags.c
new file mode 100644
index 0000000..e43b428
--- /dev/null
+++ b/lib/percpu-tags.c
@@ -0,0 +1,313 @@
+/*
+ * Copyright 2012 Google Inc. All Rights Reserved.
+ * Author: [email protected] (Kent Overstreet)
+ *
+ * Per cpu tag allocator.
+ */
+
+#include <linux/gfp.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/percpu-tags.h>
+
+#define TAG_CPU_WATERMARK 32U
+#define TAG_CPU_SIZE TAG_CPU_WATERMARK * 2
+
+#define TAG_CPU_STEALING UINT_MAX
+
+struct percpu_tag_cpu_freelist {
+ unsigned nr_free;
+ unsigned freelist[];
+};
+
+static inline void move_tags(unsigned *dst, unsigned *dst_nr,
+ unsigned *src, unsigned *src_nr,
+ unsigned nr)
+{
+ *src_nr -= nr;
+ memcpy(dst + *dst_nr, src + *src_nr, sizeof(unsigned) * nr);
+ *dst_nr += nr;
+}
+
+static bool steal_tags(struct percpu_tag_pool *pool,
+ struct percpu_tag_cpu_freelist *tags)
+{
+ unsigned i, nr_free, cpus_have_tags = 0, cpu = pool->cpu_last_stolen;
+ struct percpu_tag_cpu_freelist *remote;
+
+ for (i = 0; i < DIV_ROUND_UP(nr_cpu_ids, BITS_PER_LONG); i++)
+ cpus_have_tags += hweight_long(pool->cpus_have_tags[i]);
+
+ while (cpus_have_tags-- * TAG_CPU_SIZE > pool->nr_tags / 2) {
+ cpu = find_next_bit(pool->cpus_have_tags, nr_cpu_ids, cpu);
+
+ if (cpu == nr_cpu_ids)
+ cpu = find_first_bit(pool->cpus_have_tags, nr_cpu_ids);
+
+ if (cpu == nr_cpu_ids)
+ BUG();
+
+ pool->cpu_last_stolen = cpu;
+ remote = per_cpu_ptr(pool->tag_cpu, cpu);
+
+ if (remote == tags)
+ continue;
+
+ clear_bit(cpu, pool->cpus_have_tags);
+
+ nr_free = xchg(&remote->nr_free, TAG_CPU_STEALING);
+
+ if (nr_free) {
+ memcpy(tags->freelist,
+ remote->freelist,
+ sizeof(unsigned) * nr_free);
+ smp_mb();
+ remote->nr_free = 0;
+ tags->nr_free = nr_free;
+ return true;
+ } else {
+ remote->nr_free = 0;
+ }
+ }
+
+ return false;
+}
+
+static bool alloc_global_tags(struct percpu_tag_pool *pool,
+ struct percpu_tag_cpu_freelist *tags)
+{
+ if (pool->nr_free) {
+ move_tags(tags->freelist, &tags->nr_free,
+ pool->freelist, &pool->nr_free,
+ min(pool->nr_free, TAG_CPU_WATERMARK));
+ return true;
+ }
+
+ return false;
+}
+
+static unsigned alloc_local_tag(struct percpu_tag_pool *pool,
+ struct percpu_tag_cpu_freelist *tags)
+{
+ unsigned nr_free, old, new, tag;
+
+ /*
+ * Try per cpu freelist
+ * Since we don't have global lock held, need to use cmpxchg()
+ * to guard against a different thread using steal_tags() on us:
+ */
+ nr_free = tags->nr_free;
+
+ do {
+ if (!nr_free || nr_free == TAG_CPU_STEALING)
+ return TAG_FAIL;
+
+ old = nr_free;
+ new = old - 1;
+ tag = tags->freelist[new];
+
+ nr_free = cmpxchg(&tags->nr_free, old, new);
+ } while (nr_free != old);
+
+ return tag;
+}
+
+/**
+ * tag_alloc - allocate a tag
+ * @pool: pool to allocate from
+ * @gfp: gfp flags
+ *
+ * Returns a tag - an integer in the range [0..nr_tags) (passed to
+ * tag_pool_init()), or otherwise TAG_FAIL on allocation failure.
+ *
+ * Will not fail if passed __GFP_WAIT.
+ */
+unsigned percpu_tag_alloc(struct percpu_tag_pool *pool, gfp_t gfp)
+{
+ DEFINE_WAIT(wait);
+ struct percpu_tag_cpu_freelist *tags;
+ unsigned long flags;
+ unsigned tag, this_cpu;
+
+ while (1) {
+ local_irq_save(flags);
+ this_cpu = smp_processor_id();
+ tags = per_cpu_ptr(pool->tag_cpu, this_cpu);
+
+ tag = alloc_local_tag(pool, tags);
+ if (tag != TAG_FAIL)
+ break;
+
+ spin_lock(&pool->lock);
+
+ /*
+ * prepare_to_wait() must come before steal_tags(), in case
+ * percpu_tag_free() on another cpu flips a bit in
+ * cpus_have_tags
+ */
+ prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
+
+ /*
+ * alloc_global_tags(), steal_tags() return true iff we now have
+ * tags on our percpu freelist
+ */
+ if (alloc_global_tags(pool, tags) ||
+ steal_tags(pool, tags)) {
+ /* Global lock held, don't need cmpxchg */
+ tag = tags->freelist[--tags->nr_free];
+ if (tags->nr_free)
+ set_bit(this_cpu, pool->cpus_have_tags);
+
+ spin_unlock(&pool->lock);
+ break;
+ }
+
+ if (!(gfp & __GFP_WAIT)) {
+ spin_unlock(&pool->lock);
+ break;
+ }
+
+ spin_unlock(&pool->lock);
+ local_irq_restore(flags);
+
+ schedule();
+ }
+
+ local_irq_restore(flags);
+ finish_wait(&pool->wait, &wait);
+ return tag;
+}
+EXPORT_SYMBOL_GPL(percpu_tag_alloc);
+
+/**
+ * tag_free - free a tag
+ * @pool: pool @tag was allocated from
+ * @tag: a tag previously allocated with tag_alloc()
+ */
+void percpu_tag_free(struct percpu_tag_pool *pool, unsigned tag)
+{
+ struct percpu_tag_cpu_freelist *tags;
+ unsigned long flags;
+ unsigned nr_free, old, new, this_cpu;
+
+ BUG_ON(tag >= pool->nr_tags);
+
+ local_irq_save(flags);
+ this_cpu = smp_processor_id();
+ tags = per_cpu_ptr(pool->tag_cpu, this_cpu);
+
+ /*
+ * Need to guard against racing with steal_tags() on another cpu - we
+ * can manage with just cmpxchg because we can only race with tags being
+ * pulled off our freelist, not other threads pushing tags back onto our
+ * freelist
+ */
+ nr_free = tags->nr_free;
+
+ do {
+ if (nr_free == TAG_CPU_STEALING) {
+ cpu_relax();
+ nr_free = tags->nr_free;
+ continue;
+ }
+
+ old = nr_free;
+ new = old + 1;
+ tags->freelist[old] = tag;
+
+ nr_free = cmpxchg(&tags->nr_free, old, new);
+ } while (nr_free != old);
+
+ if (!nr_free) {
+ set_bit(this_cpu, pool->cpus_have_tags);
+ wake_up(&pool->wait);
+ }
+
+ if (new == TAG_CPU_SIZE) {
+ spin_lock(&pool->lock);
+ /* Might have had our tags stolen before we took global lock */
+ if (tags->nr_free == TAG_CPU_SIZE) {
+ move_tags(pool->freelist, &pool->nr_free,
+ tags->freelist, &tags->nr_free,
+ TAG_CPU_WATERMARK);
+
+ wake_up(&pool->wait);
+ }
+ spin_unlock(&pool->lock);
+ }
+
+ local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(percpu_tag_free);
+
+/**
+ * tag_pool_free - release a tag pool's resources
+ * @pool: pool to free
+ *
+ * Frees the resources allocated by tag_pool_init().
+ */
+void percpu_tag_pool_free(struct percpu_tag_pool *pool)
+{
+ free_percpu(pool->tag_cpu);
+ kfree(pool->cpus_have_tags);
+ free_pages((unsigned long) pool->freelist,
+ get_order(pool->nr_tags * sizeof(unsigned)));
+}
+EXPORT_SYMBOL_GPL(percpu_tag_pool_free);
+
+/**
+ * tag_pool_init - initialize a percpu tag pool
+ * @pool: pool to initialize
+ * @nr_tags: number of tags that will be available for allocation
+ *
+ * Initializes @pool so that it can be used to allocate tags - integers in the
+ * range [0, nr_tags). Typically, they'll be used by driver code to refer to a
+ * preallocated array of tag structures.
+ *
+ * Allocation is percpu, but sharding is limited by nr_tags - for best
+ * performance, the workload should not span more cpus than nr_tags / 128.
+ */
+int percpu_tag_pool_init(struct percpu_tag_pool *pool, unsigned long nr_tags)
+{
+ unsigned i, order;
+
+ memset(pool, 0, sizeof(*pool));
+
+ spin_lock_init(&pool->lock);
+ init_waitqueue_head(&pool->wait);
+ pool->nr_tags = nr_tags;
+
+ /* Guard against overflow */
+ if (nr_tags > TAG_MAX) {
+ pr_err("tags.c: nr_tags too large\n");
+ return -EINVAL;
+ }
+
+ order = get_order(nr_tags * sizeof(unsigned));
+ pool->freelist = (void *) __get_free_pages(GFP_KERNEL, order);
+ if (!pool->freelist)
+ goto err;
+
+ for (i = 0; i < nr_tags; i++)
+ pool->freelist[i] = i;
+
+ pool->nr_free = nr_tags;
+
+ pool->cpus_have_tags = kzalloc(DIV_ROUND_UP(nr_cpu_ids, BITS_PER_LONG) *
+ sizeof(unsigned long), GFP_KERNEL);
+ if (!pool->cpus_have_tags)
+ goto err;
+
+ pool->tag_cpu = __alloc_percpu(sizeof(struct percpu_tag_cpu_freelist) +
+ TAG_CPU_SIZE * sizeof(unsigned),
+ sizeof(unsigned));
+ if (!pool->tag_cpu)
+ goto err;
+
+ return 0;
+err:
+ percpu_tag_pool_free(pool);
+ return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(percpu_tag_pool_init);
--
1.8.3.rc1


2013-06-12 17:13:03

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

On 06/11, Kent Overstreet wrote:
>
> + * This is done by keeping track of which cpus have tags on their percpu
> + * freelists in a bitmap, and then on allocation failure if too many cpus have
> + * tags on their freelists - i.e. if cpus_have_tags * TAG_CPU_SIZE (64) >
> + * nr_tags / 2 - then we steal one remote cpu's freelist (effectively picked at
> + * random).

Interesting... I'll try to read the patch later.

I have to admit, I do not really understand what this bitmap can actually
buy after a quick glance. It is only a hint afaics, and obviously it is
not accurate. alloc_local_tag() never clears the bit, tag_free()->set_bit()
is racy, etc. I guess this is fine, just this doesn't look clear.

But the code looks correct, just a couple of minor nits, feel freee to
ignore.

> +enum {
> + TAG_FAIL = -1U,
> + TAG_MAX = TAG_FAIL - 1,
> +};

This can probably go to .c, and it seems that TAG_MAX is not needed.
tag_init() can check "nr_tags >= TAG_FAIL" instead.

> +static bool steal_tags(struct percpu_tag_pool *pool,
> + struct percpu_tag_cpu_freelist *tags)
> +{
> + unsigned i, nr_free, cpus_have_tags = 0, cpu = pool->cpu_last_stolen;
> + struct percpu_tag_cpu_freelist *remote;
> +
> + for (i = 0; i < DIV_ROUND_UP(nr_cpu_ids, BITS_PER_LONG); i++)
> + cpus_have_tags += hweight_long(pool->cpus_have_tags[i]);

bitmap_weight(pool->cpus_have_tags, pool->nr_tags).

> +
> + while (cpus_have_tags-- * TAG_CPU_SIZE > pool->nr_tags / 2) {
> + cpu = find_next_bit(pool->cpus_have_tags, nr_cpu_ids, cpu);
> +
> + if (cpu == nr_cpu_ids)
> + cpu = find_first_bit(pool->cpus_have_tags, nr_cpu_ids);
> +
> + if (cpu == nr_cpu_ids)
> + BUG();
> +
> + pool->cpu_last_stolen = cpu;
> + remote = per_cpu_ptr(pool->tag_cpu, cpu);
> +
> + if (remote == tags)
> + continue;
> +
> + clear_bit(cpu, pool->cpus_have_tags);
> +
> + nr_free = xchg(&remote->nr_free, TAG_CPU_STEALING);
> +
> + if (nr_free) {
> + memcpy(tags->freelist,
> + remote->freelist,
> + sizeof(unsigned) * nr_free);
> + smp_mb();
> + remote->nr_free = 0;
> + tags->nr_free = nr_free;
> + return true;
> + } else {
> + remote->nr_free = 0;
> + }

Both branches clear remote->nr_free.

> +int percpu_tag_pool_init(struct percpu_tag_pool *pool, unsigned long nr_tags)
> +{
> + unsigned i, order;
> +
> + memset(pool, 0, sizeof(*pool));
> +
> + spin_lock_init(&pool->lock);
> + init_waitqueue_head(&pool->wait);
> + pool->nr_tags = nr_tags;
> +
> + /* Guard against overflow */
> + if (nr_tags > TAG_MAX) {
> + pr_err("tags.c: nr_tags too large\n");
> + return -EINVAL;
> + }
> +
> + order = get_order(nr_tags * sizeof(unsigned));
> + pool->freelist = (void *) __get_free_pages(GFP_KERNEL, order);
> + if (!pool->freelist)
> + goto err;
> +
> + for (i = 0; i < nr_tags; i++)
> + pool->freelist[i] = i;
> +
> + pool->nr_free = nr_tags;
> +
> + pool->cpus_have_tags = kzalloc(DIV_ROUND_UP(nr_cpu_ids, BITS_PER_LONG) *
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
BITS_TO_LONGS(nr_cpu_ids)

Oleg.

2013-06-12 17:59:22

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

On Wed, Jun 12, 2013 at 07:08:35PM +0200, Oleg Nesterov wrote:
> On 06/11, Kent Overstreet wrote:
> >
> > + * This is done by keeping track of which cpus have tags on their percpu
> > + * freelists in a bitmap, and then on allocation failure if too many cpus have
> > + * tags on their freelists - i.e. if cpus_have_tags * TAG_CPU_SIZE (64) >
> > + * nr_tags / 2 - then we steal one remote cpu's freelist (effectively picked at
> > + * random).
>
> Interesting... I'll try to read the patch later.
>
> I have to admit, I do not really understand what this bitmap can actually
> buy after a quick glance. It is only a hint afaics, and obviously it is
> not accurate. alloc_local_tag() never clears the bit, tag_free()->set_bit()
> is racy, etc. I guess this is fine, just this doesn't look clear.

Yeah, I'll add a comment explaining how it's used.

The bitmap is actually fairly important because that's how steal_tags()
knows whether it should run - without the bitmap, every time
alloc_global_tags() fails (and if you're keeping your device's queue
full and all tags allocated, that'll happen a lot) steal_tags() would
have to touch every other percpu freelist.

So we'd need at least an atomic counter, but a bitmap isn't really any
more trouble and it lets us skip most of the percpu lists that are empty
- which should make a real difference in scalability to huge nr_cpus.

The main thing is it's fine for a freelist to be empty when the bitmap
is set - steal_tags() will just keep iterating - but the bitmap _must_
be set whenever there's tags on a percpu freelist.

That's why it's ok for alloc_local_tag() to skip clearing it, and it's
probably better for it not to because it means not touching a shared
cacheline, and most likely we'll be doing more work on that cpu and
refilling the percpu freelist soon anyways.

>
> But the code looks correct, just a couple of minor nits, feel freee to
> ignore.
>
> > +enum {
> > + TAG_FAIL = -1U,
> > + TAG_MAX = TAG_FAIL - 1,
> > +};
>
> This can probably go to .c, and it seems that TAG_MAX is not needed.
> tag_init() can check "nr_tags >= TAG_FAIL" instead.

Users need TAG_FAIL to check for allocation failure.

>
> > +static bool steal_tags(struct percpu_tag_pool *pool,
> > + struct percpu_tag_cpu_freelist *tags)
> > +{
> > + unsigned i, nr_free, cpus_have_tags = 0, cpu = pool->cpu_last_stolen;
> > + struct percpu_tag_cpu_freelist *remote;
> > +
> > + for (i = 0; i < DIV_ROUND_UP(nr_cpu_ids, BITS_PER_LONG); i++)
> > + cpus_have_tags += hweight_long(pool->cpus_have_tags[i]);
>
> bitmap_weight(pool->cpus_have_tags, pool->nr_tags).

I couldn't remember what that was called, thanks :)

> > +
> > + while (cpus_have_tags-- * TAG_CPU_SIZE > pool->nr_tags / 2) {
> > + cpu = find_next_bit(pool->cpus_have_tags, nr_cpu_ids, cpu);
> > +
> > + if (cpu == nr_cpu_ids)
> > + cpu = find_first_bit(pool->cpus_have_tags, nr_cpu_ids);
> > +
> > + if (cpu == nr_cpu_ids)
> > + BUG();
> > +
> > + pool->cpu_last_stolen = cpu;
> > + remote = per_cpu_ptr(pool->tag_cpu, cpu);
> > +
> > + if (remote == tags)
> > + continue;
> > +
> > + clear_bit(cpu, pool->cpus_have_tags);
> > +
> > + nr_free = xchg(&remote->nr_free, TAG_CPU_STEALING);
> > +
> > + if (nr_free) {
> > + memcpy(tags->freelist,
> > + remote->freelist,
> > + sizeof(unsigned) * nr_free);
> > + smp_mb();
> > + remote->nr_free = 0;
> > + tags->nr_free = nr_free;
> > + return true;
> > + } else {
> > + remote->nr_free = 0;
> > + }
>
> Both branches clear remote->nr_free.

Yeah, but clearing it has to happen after the memcpy() and the smp_mb().
I couldn't find a way to combine them that I liked, but if you've got
any suggestions I'm all ears.

> BITS_TO_LONGS(nr_cpu_ids)

Aha, thanks.

I've made a few tweaks since doing some profiling this morning - here's
an updated version. Main thing was adding a fast path to
percpu_tag_alloc(), the finish_wait() was more expensive than I
expected:

>From 2ae72b8d0b7a995b875d0d702497c76c4f950e2e Mon Sep 17 00:00:00 2001
From: Kent Overstreet <[email protected]>
Date: Tue, 11 Jun 2013 20:59:04 -0700
Subject: [PATCH] Percpu tag allocator

Allocates integers out of a predefined range - for use by e.g. a driver
to allocate tags for communicating with the device.

v2: Tejun pointed out that the original strategy completely breaks if
nr_cpus is large enough. I think (hope) I realized that at one point,
but completely forgot in the months since I originally implemented it.

Came up with a new strategy where we keep track of the number of cpus
that have tags on their percpu freelists, then steal from a different
random cpu if too many cpus have tags on their freelists.

Note that the synchronization is _definitely_ tricky - we're using
xchg()/cmpxchg() on the percpu lists, to synchronize between
steal_tags().

The alternative would've been adding a spinlock to protect the percpu
freelists, but that would've required some different tricky code to
avoid deadlock because of the lock ordering.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: "Nicholas A. Bellinger" <[email protected]>

diff --git a/include/linux/percpu-tags.h b/include/linux/percpu-tags.h
new file mode 100644
index 0000000..9cf3c90
--- /dev/null
+++ b/include/linux/percpu-tags.h
@@ -0,0 +1,73 @@
+/*
+ * Copyright 2012 Google Inc. All Rights Reserved.
+ * Author: [email protected] (Kent Overstreet)
+ *
+ * Per cpu tag allocator. Allocates small integers - up to nr_tags passed to
+ * percpu_tag_pool_init() - for use with say driver tag structures for talking
+ * to a device.
+ *
+ * It works by caching tags on percpu freelists, and then tags are
+ * allocated/freed from the global freelist in batches.
+ *
+ * Note that it will in general be impossible to allocate all nr_tags tags,
+ * since some tags will be stranded on other cpu's freelists: but we guarantee
+ * that nr_tags / 2 can always be allocated.
+ *
+ * This is done by keeping track of which cpus have tags on their percpu
+ * freelists in a bitmap, and then on allocation failure if too many cpus have
+ * tags on their freelists - i.e. if cpus_have_tags * TAG_CPU_SIZE (64) >
+ * nr_tags / 2 - then we steal one remote cpu's freelist (effectively picked at
+ * random).
+ *
+ * This means that if a workload spans a huge number of cpus - in relation to
+ * the number of tags that can be allocated - performance will suffer somewhat;
+ * but as long as the workload is bounded to a reasonable number of cpus the
+ * percpu-ness of the allocator will not be affected.
+ */
+
+#ifndef _LINUX_TAGS_H
+#define _LINUX_TAGS_H
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+
+struct percpu_tag_cpu_freelist;
+
+struct percpu_tag_pool {
+ unsigned nr_tags;
+
+ struct percpu_tag_cpu_freelist *tag_cpu;
+
+ /*
+ * Bitmap of cpus that (may) have tags on their percpu freelists:
+ * steal_tags() uses this to decide when to steal tags, and which cpus
+ * to try stealing from.
+ *
+ * It's ok for a freelist to be empty when its bit is set - steal_tags()
+ * will just keep looking - but the bitmap _must_ be set whenever a
+ * percpu freelist does have tags.
+ */
+ unsigned long *cpus_have_tags;
+
+ struct {
+ spinlock_t lock;
+ unsigned cpu_last_stolen;
+ /* Global freelist */
+ unsigned nr_free;
+ unsigned *freelist;
+ wait_queue_head_t wait;
+ } ____cacheline_aligned_in_smp;
+};
+
+unsigned percpu_tag_alloc(struct percpu_tag_pool *pool, gfp_t gfp);
+void percpu_tag_free(struct percpu_tag_pool *pool, unsigned tag);
+
+void percpu_tag_pool_free(struct percpu_tag_pool *pool);
+int percpu_tag_pool_init(struct percpu_tag_pool *pool, unsigned long nr_tags);
+
+enum {
+ TAG_FAIL = -1U,
+ TAG_MAX = TAG_FAIL - 1,
+};
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 386db4b..e29a354 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -13,7 +13,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
is_single_threaded.o plist.o decompress.o kobject_uevent.o \
- earlycpio.o percpu-refcount.o
+ earlycpio.o percpu-refcount.o percpu-tags.o

obj-$(CONFIG_ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS) += usercopy.o
lib-$(CONFIG_MMU) += ioremap.o
diff --git a/lib/percpu-tags.c b/lib/percpu-tags.c
new file mode 100644
index 0000000..b26a865
--- /dev/null
+++ b/lib/percpu-tags.c
@@ -0,0 +1,314 @@
+/*
+ * Copyright 2012 Google Inc. All Rights Reserved.
+ * Author: [email protected] (Kent Overstreet)
+ *
+ * Per cpu tag allocator.
+ */
+
+#include <linux/gfp.h>
+#include <linux/module.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/percpu-tags.h>
+
+#define TAG_CPU_WATERMARK 32U
+#define TAG_CPU_SIZE (TAG_CPU_WATERMARK * 2)
+
+#define TAG_CPU_STEALING UINT_MAX
+
+struct percpu_tag_cpu_freelist {
+ unsigned nr_free;
+ unsigned freelist[];
+};
+
+static inline void move_tags(unsigned *dst, unsigned *dst_nr,
+ unsigned *src, unsigned *src_nr,
+ unsigned nr)
+{
+ *src_nr -= nr;
+ memcpy(dst + *dst_nr, src + *src_nr, sizeof(unsigned) * nr);
+ *dst_nr += nr;
+}
+
+static inline bool steal_tags(struct percpu_tag_pool *pool,
+ struct percpu_tag_cpu_freelist *tags)
+{
+ unsigned nr_free, cpus_have_tags, cpu = pool->cpu_last_stolen;
+ struct percpu_tag_cpu_freelist *remote;
+
+ for (cpus_have_tags = bitmap_weight(pool->cpus_have_tags, nr_cpu_ids);
+ cpus_have_tags * TAG_CPU_SIZE > pool->nr_tags / 2;
+ cpus_have_tags--) {
+ cpu = find_next_bit(pool->cpus_have_tags, nr_cpu_ids, cpu);
+
+ if (cpu == nr_cpu_ids)
+ cpu = find_first_bit(pool->cpus_have_tags, nr_cpu_ids);
+
+ if (cpu == nr_cpu_ids)
+ BUG();
+
+ pool->cpu_last_stolen = cpu;
+ remote = per_cpu_ptr(pool->tag_cpu, cpu);
+
+ if (remote == tags)
+ continue;
+
+ clear_bit(cpu, pool->cpus_have_tags);
+
+ nr_free = xchg(&remote->nr_free, TAG_CPU_STEALING);
+
+ if (nr_free) {
+ memcpy(tags->freelist,
+ remote->freelist,
+ sizeof(unsigned) * nr_free);
+ smp_mb();
+ remote->nr_free = 0;
+ tags->nr_free = nr_free;
+ return true;
+ } else {
+ remote->nr_free = 0;
+ }
+ }
+
+ return false;
+}
+
+static inline bool alloc_global_tags(struct percpu_tag_pool *pool,
+ struct percpu_tag_cpu_freelist *tags)
+{
+ if (pool->nr_free) {
+ move_tags(tags->freelist, &tags->nr_free,
+ pool->freelist, &pool->nr_free,
+ min(pool->nr_free, TAG_CPU_WATERMARK));
+ return true;
+ }
+
+ return false;
+}
+
+static inline unsigned alloc_local_tag(struct percpu_tag_pool *pool,
+ struct percpu_tag_cpu_freelist *tags)
+{
+ unsigned nr_free, old, new, tag;
+
+ /*
+ * Try per cpu freelist
+ * Since we don't have global lock held, need to use cmpxchg()
+ * to guard against a different thread using steal_tags() on us:
+ */
+ nr_free = tags->nr_free;
+
+ do {
+ if (unlikely(!nr_free || nr_free == TAG_CPU_STEALING))
+ return TAG_FAIL;
+
+ old = nr_free;
+ new = old - 1;
+ tag = tags->freelist[new];
+
+ nr_free = cmpxchg(&tags->nr_free, old, new);
+ } while (unlikely(nr_free != old));
+
+ return tag;
+}
+
+/**
+ * tag_alloc - allocate a tag
+ * @pool: pool to allocate from
+ * @gfp: gfp flags
+ *
+ * Returns a tag - an integer in the range [0..nr_tags) (passed to
+ * tag_pool_init()), or otherwise TAG_FAIL on allocation failure.
+ *
+ * Will not fail if passed __GFP_WAIT.
+ */
+unsigned percpu_tag_alloc(struct percpu_tag_pool *pool, gfp_t gfp)
+{
+ DEFINE_WAIT(wait);
+ struct percpu_tag_cpu_freelist *tags;
+ unsigned long flags;
+ unsigned tag, this_cpu;
+
+ local_irq_save(flags);
+ this_cpu = smp_processor_id();
+ tags = per_cpu_ptr(pool->tag_cpu, this_cpu);
+
+ /* Fastpath */
+ tag = alloc_local_tag(pool, tags);
+ if (likely(tag != TAG_FAIL)) {
+ local_irq_restore(flags);
+ return tag;
+ }
+
+ while (1) {
+ spin_lock(&pool->lock);
+
+ /*
+ * prepare_to_wait() must come before steal_tags(), in case
+ * percpu_tag_free() on another cpu flips a bit in
+ * cpus_have_tags
+ */
+ prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
+
+ /*
+ * alloc_global_tags(), steal_tags() return true iff we now have
+ * tags on our percpu freelist
+ */
+ if (tags->nr_free ||
+ alloc_global_tags(pool, tags) ||
+ steal_tags(pool, tags)) {
+ /* Global lock held, don't need cmpxchg */
+ tag = tags->freelist[--tags->nr_free];
+ if (tags->nr_free)
+ set_bit(this_cpu, pool->cpus_have_tags);
+ }
+
+ spin_unlock(&pool->lock);
+ local_irq_restore(flags);
+
+ if (tag != TAG_FAIL || !(gfp & __GFP_WAIT))
+ break;
+
+ schedule();
+
+ local_irq_save(flags);
+ this_cpu = smp_processor_id();
+ tags = per_cpu_ptr(pool->tag_cpu, this_cpu);
+ }
+
+ finish_wait(&pool->wait, &wait);
+ return tag;
+}
+EXPORT_SYMBOL_GPL(percpu_tag_alloc);
+
+/**
+ * tag_free - free a tag
+ * @pool: pool @tag was allocated from
+ * @tag: a tag previously allocated with tag_alloc()
+ */
+void percpu_tag_free(struct percpu_tag_pool *pool, unsigned tag)
+{
+ struct percpu_tag_cpu_freelist *tags;
+ unsigned long flags;
+ unsigned nr_free, old, new, this_cpu;
+
+ BUG_ON(tag >= pool->nr_tags);
+
+ local_irq_save(flags);
+ this_cpu = smp_processor_id();
+ tags = per_cpu_ptr(pool->tag_cpu, this_cpu);
+
+ /*
+ * Need to guard against racing with steal_tags() on another cpu - we
+ * can manage with just cmpxchg because we can only race with tags being
+ * pulled off our freelist, not other threads pushing tags back onto our
+ * freelist
+ */
+ nr_free = tags->nr_free;
+
+ do {
+ if (unlikely(nr_free == TAG_CPU_STEALING)) {
+ cpu_relax();
+ nr_free = tags->nr_free;
+ continue;
+ }
+
+ old = nr_free;
+ new = old + 1;
+ tags->freelist[old] = tag;
+
+ nr_free = cmpxchg(&tags->nr_free, old, new);
+ } while (unlikely(nr_free != old));
+
+ if (!nr_free) {
+ set_bit(this_cpu, pool->cpus_have_tags);
+ wake_up(&pool->wait);
+ }
+
+ if (new == TAG_CPU_SIZE) {
+ spin_lock(&pool->lock);
+ /* Might have had our tags stolen before we took global lock */
+ if (tags->nr_free == TAG_CPU_SIZE) {
+ move_tags(pool->freelist, &pool->nr_free,
+ tags->freelist, &tags->nr_free,
+ TAG_CPU_WATERMARK);
+
+ wake_up(&pool->wait);
+ }
+ spin_unlock(&pool->lock);
+ }
+
+ local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(percpu_tag_free);
+
+/**
+ * tag_pool_free - release a tag pool's resources
+ * @pool: pool to free
+ *
+ * Frees the resources allocated by tag_pool_init().
+ */
+void percpu_tag_pool_free(struct percpu_tag_pool *pool)
+{
+ free_percpu(pool->tag_cpu);
+ kfree(pool->cpus_have_tags);
+ free_pages((unsigned long) pool->freelist,
+ get_order(pool->nr_tags * sizeof(unsigned)));
+}
+EXPORT_SYMBOL_GPL(percpu_tag_pool_free);
+
+/**
+ * tag_pool_init - initialize a percpu tag pool
+ * @pool: pool to initialize
+ * @nr_tags: number of tags that will be available for allocation
+ *
+ * Initializes @pool so that it can be used to allocate tags - integers in the
+ * range [0, nr_tags). Typically, they'll be used by driver code to refer to a
+ * preallocated array of tag structures.
+ *
+ * Allocation is percpu, but sharding is limited by nr_tags - for best
+ * performance, the workload should not span more cpus than nr_tags / 128.
+ */
+int percpu_tag_pool_init(struct percpu_tag_pool *pool, unsigned long nr_tags)
+{
+ unsigned i, order;
+
+ memset(pool, 0, sizeof(*pool));
+
+ spin_lock_init(&pool->lock);
+ init_waitqueue_head(&pool->wait);
+ pool->nr_tags = nr_tags;
+
+ /* Guard against overflow */
+ if (nr_tags > TAG_MAX) {
+ pr_err("tags.c: nr_tags too large\n");
+ return -EINVAL;
+ }
+
+ order = get_order(nr_tags * sizeof(unsigned));
+ pool->freelist = (void *) __get_free_pages(GFP_KERNEL, order);
+ if (!pool->freelist)
+ goto err;
+
+ for (i = 0; i < nr_tags; i++)
+ pool->freelist[i] = i;
+
+ pool->nr_free = nr_tags;
+
+ pool->cpus_have_tags = kzalloc(BITS_TO_LONGS(nr_cpu_ids) *
+ sizeof(unsigned long), GFP_KERNEL);
+ if (!pool->cpus_have_tags)
+ goto err;
+
+ pool->tag_cpu = __alloc_percpu(sizeof(struct percpu_tag_cpu_freelist) +
+ TAG_CPU_SIZE * sizeof(unsigned),
+ sizeof(unsigned));
+ if (!pool->tag_cpu)
+ goto err;
+
+ return 0;
+err:
+ percpu_tag_pool_free(pool);
+ return -ENOMEM;
+}
+EXPORT_SYMBOL_GPL(percpu_tag_pool_init);

2013-06-12 19:18:48

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

On 06/12, Kent Overstreet wrote:
>
> So we'd need at least an atomic counter, but a bitmap isn't really any
> more trouble and it lets us skip most of the percpu lists that are empty

Yes, yes, I understand.

> - which should make a real difference in scalability to huge nr_cpus.

But this is not obvious to me. I mean, I am not sure I understand why
this all is "optimal". In particular, I do not really understand
"while (cpus_have_tags-- * TAG_CPU_SIZE > pool->nr_tags / 2)" in
steal_tags(), even if "the workload should not span more cpus than
nr_tags / 128" is true. I guess this connects to "we guarantee that
nr_tags / 2 can always be allocated" and we do not want to call
steal_tags() too often/otherwise, but cpus_have_tags * TAG_CPU_SIZE
can easily overestimate the number of free tags.

But I didn't read the patch carefully, and it is not that I think I
can suggest something better.

In short: please ignore ;)

> > > +enum {
> > > + TAG_FAIL = -1U,
> > > + TAG_MAX = TAG_FAIL - 1,
> > > +};
> >
> > This can probably go to .c, and it seems that TAG_MAX is not needed.
> > tag_init() can check "nr_tags >= TAG_FAIL" instead.
>
> Users need TAG_FAIL to check for allocation failure.

Ah, indeed, !__GFP_WAIT...

Hmm. but why gfp_t? why not "bool atomic" ?

Probably this is because you expect that most callers should have
gfp anyway. Looks a bit strange but I won't argue.

> > > + if (nr_free) {
> > > + memcpy(tags->freelist,
> > > + remote->freelist,
> > > + sizeof(unsigned) * nr_free);
> > > + smp_mb();
> > > + remote->nr_free = 0;
> > > + tags->nr_free = nr_free;
> > > + return true;
> > > + } else {
> > > + remote->nr_free = 0;
> > > + }
> >
> > Both branches clear remote->nr_free.
>
> Yeah, but clearing it has to happen after the memcpy() and the smp_mb().

Yes, yes, we need mb() between memcpy() and "remote = 0",

> I couldn't find a way to combine them that I liked, but if you've got
> any suggestions I'm all ears.

Please ignore. Somehow I missed the fact we need to return or continue,
so we need "else" or another check.

Oleg.

2013-06-12 23:38:58

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

On Tue, 11 Jun 2013 21:03:24 -0700 Kent Overstreet <[email protected]> wrote:

> Allocates integers out of a predefined range

0..n, I assume. Not n..m.

> - for use by e.g. a driver
> to allocate tags for communicating with the device.
>
> v2: Tejun pointed out that the original strategy completely breaks if
> nr_cpus is large enough. I think (hope) I realized that at one point,
> but completely forgot in the months since I originally implemented it.
>
> Came up with a new strategy where we keep track of the number of cpus
> that have tags on their percpu freelists, then steal from a different
> random cpu if too many cpus have tags on their freelists.
>
> Note that the synchronization is _definitely_ tricky - we're using
> xchg()/cmpxchg() on the percpu lists, to synchronize between
> steal_tags().
>
> The alternative would've been adding a spinlock to protect the percpu
> freelists, but that would've required some different tricky code to
> avoid deadlock because of the lock ordering.

The changelog makes this code very hard to review. Because it fails to
inform us of critical things such as:

What are the actual objectives here? The requirements as you see them?
What hardware and drivers are we talking about here?

What are the frequency requirements of that hardware?

Why can't we use ida_get_new_above()?

If it is "speed" then how bad is the problem and what efforts have
been made to address them within the idr code? (A per-cpu magazine
frontend to ida_get_new_above() might suit).

If it is "functionality" then what efforts have been made to
suitably modify the ida code?



And all of that is before we even start to look at the implementation!
How can one review an implementation without being told what it is to
achieve?


wrt the NR_CPUS=1024 thing: yes, this is a problem with the per-cpu
design. Especially if the hardware only has a 10-bit-tag! What's the
worst case here? What is the smallest tag size which you reasonably
expect we will encounter? What are the expected failure modes with
large cpu count and small tag size?

>
> ...
>
> +/*
> + * Copyright 2012 Google Inc. All Rights Reserved.
> + * Author: [email protected] (Kent Overstreet)
> + *
> + * Per cpu tag allocator. Allocates small integers - up to nr_tags passed to
> + * percpu_tag_pool_init() - for use with say driver tag structures for talking
> + * to a device.
> + *
> + * It works by caching tags on percpu freelists, and then tags are
> + * allocated/freed from the global freelist in batches.
> + *
> + * Note that it will in general be impossible to allocate all nr_tags tags,
> + * since some tags will be stranded on other cpu's freelists: but we guarantee
> + * that nr_tags / 2 can always be allocated.
> + *
> + * This is done by keeping track of which cpus have tags on their percpu
> + * freelists in a bitmap, and then on allocation failure if too many cpus have
> + * tags on their freelists - i.e. if cpus_have_tags * TAG_CPU_SIZE (64) >
> + * nr_tags / 2 - then we steal one remote cpu's freelist (effectively picked at
> + * random).

OK, this comment partially makes up for changelog deficiencies.

Why did you choose to directly fiddle with the other-cpus data rather
than issuing a cross-cpu IPI call? The latter would presumably result
in slower stealing performance but better common-case performance.

I didn't immediately see where this "if cpus_have_tags * TAG_CPU_SIZE
(64) + * nr_tags / 2" was implemented and I don't immediately see *why*
it was implemented. Why do we care? If local allocation fails, just
go steal.

> + * This means that if a workload spans a huge number of cpus - in relation to
> + * the number of tags that can be allocated - performance will suffer somewhat;
> + * but as long as the workload is bounded to a reasonable number of cpus the
> + * percpu-ness of the allocator will not be affected.
> + */
> +
> +#ifndef _LINUX_TAGS_H
> +#define _LINUX_TAGS_H
> +
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +
> +struct percpu_tag_cpu_freelist;
> +
> +struct percpu_tag_pool {
> + unsigned nr_tags;
> +
> + struct percpu_tag_cpu_freelist *tag_cpu;
> + unsigned long *cpus_have_tags;
> +
> + struct {
> + spinlock_t lock;
> + unsigned cpu_last_stolen;
> + /* Global freelist */
> + unsigned nr_free;
> + unsigned *freelist;
> + wait_queue_head_t wait;
> + } ____cacheline_aligned_in_smp;
> +};

Please lovingly document the struct fields - it really pays off.
Certainly the lack of a roadmap here will make this review harder and
less effective than it needs to be.

It's nice that the cacheline alignment only happens on SMP builds, but
this code already adds tremendous amounts of useless fluff to
uniprocessor builds. Those kernels would be much happier with
ida_get_new_above().

> +unsigned percpu_tag_alloc(struct percpu_tag_pool *pool, gfp_t gfp);
> +void percpu_tag_free(struct percpu_tag_pool *pool, unsigned tag);
> +
> +void percpu_tag_pool_free(struct percpu_tag_pool *pool);
> +int percpu_tag_pool_init(struct percpu_tag_pool *pool, unsigned long nr_tags);
> +
> +enum {
> + TAG_FAIL = -1U,
> + TAG_MAX = TAG_FAIL - 1,
> +};

Again, documentation of the meanings of these will help the reader
considerably.

> +#endif
> diff --git a/lib/Makefile b/lib/Makefile
> index 386db4b..e29a354 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -13,7 +13,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
> sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
> proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
> is_single_threaded.o plist.o decompress.o kobject_uevent.o \
> - earlycpio.o percpu-refcount.o
> + earlycpio.o percpu-refcount.o percpu-tags.o
>
> obj-$(CONFIG_ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS) += usercopy.o
> lib-$(CONFIG_MMU) += ioremap.o
> diff --git a/lib/percpu-tags.c b/lib/percpu-tags.c
> new file mode 100644
> index 0000000..e43b428
> --- /dev/null
> +++ b/lib/percpu-tags.c
> @@ -0,0 +1,313 @@
> +/*
> + * Copyright 2012 Google Inc. All Rights Reserved.
> + * Author: [email protected] (Kent Overstreet)
> + *
> + * Per cpu tag allocator.
> + */
> +
> +#include <linux/gfp.h>
> +#include <linux/module.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/percpu-tags.h>

I expect this isn't a long enough include list. Just below I see
memcpy, hweight_long, find_next_bit, per_cpu_ptr and xchg. Please
carefully go through the requirements and don't rely upon
config-dependent nested-include luck.

> +#define TAG_CPU_WATERMARK 32U
> +#define TAG_CPU_SIZE TAG_CPU_WATERMARK * 2
> +
> +#define TAG_CPU_STEALING UINT_MAX

What do these do.

> +struct percpu_tag_cpu_freelist {
> + unsigned nr_free;
> + unsigned freelist[];
> +};
> +
> +static inline void move_tags(unsigned *dst, unsigned *dst_nr,
> + unsigned *src, unsigned *src_nr,
> + unsigned nr)
> +{
> + *src_nr -= nr;
> + memcpy(dst + *dst_nr, src + *src_nr, sizeof(unsigned) * nr);
> + *dst_nr += nr;
> +}

Overlapping copies are not required?

> +static bool steal_tags(struct percpu_tag_pool *pool,
> + struct percpu_tag_cpu_freelist *tags)

An overview of this function would be helpful here.

> +{
> + unsigned i, nr_free, cpus_have_tags = 0, cpu = pool->cpu_last_stolen;
> + struct percpu_tag_cpu_freelist *remote;
> +
> + for (i = 0; i < DIV_ROUND_UP(nr_cpu_ids, BITS_PER_LONG); i++)
> + cpus_have_tags += hweight_long(pool->cpus_have_tags[i]);

That's an open-coded bitmap_weight(). Generally, the bitmap.h code
might be applicable in this code. (The bitmap code has some pretty
foul inefficiencies, but they're generally easily fixed).

> + while (cpus_have_tags-- * TAG_CPU_SIZE > pool->nr_tags / 2) {
> + cpu = find_next_bit(pool->cpus_have_tags, nr_cpu_ids, cpu);
> +
> + if (cpu == nr_cpu_ids)
> + cpu = find_first_bit(pool->cpus_have_tags, nr_cpu_ids);
> +
> + if (cpu == nr_cpu_ids)
> + BUG();
> +
> + pool->cpu_last_stolen = cpu;

I think I can see why this last_stolen thing exists, but I'd prefer to
be told why its creator thinks it needs to exist.

One could just start stealing at this_cpu + 1?

> + remote = per_cpu_ptr(pool->tag_cpu, cpu);
> +
> + if (remote == tags)
> + continue;
> +
> + clear_bit(cpu, pool->cpus_have_tags);
> +
> + nr_free = xchg(&remote->nr_free, TAG_CPU_STEALING);

(looks to see what TAG_CPU_STEALING does)

OK, this looks pretty lame. It adds a rare fallback to the central
allocator, which makes that path hard to test. And it does this at
basically the same cost of plain old spin_lock(). I do think it would
be better to avoid the underministic code and use plain old
spin_lock(). I appreciate the lock ranking concern, but you're a
cleaver chap ;)

Also, I wonder if this was all done in the incorrect order. Why make
alloc_local_tag() fail? steal_tags() could have just noodled off and
tried to steal from the next CPU if alloc_local_tag() was in progress
on this CPU?

> + if (nr_free) {
> + memcpy(tags->freelist,
> + remote->freelist,
> + sizeof(unsigned) * nr_free);
> + smp_mb();

Please always document barriers. Explain in full detail to the reader
which scenario(s) we're barriering against.

> + remote->nr_free = 0;
> + tags->nr_free = nr_free;
> + return true;
> + } else {
> + remote->nr_free = 0;
> + }
> + }
> +
> + return false;
> +}
> +
> +static bool alloc_global_tags(struct percpu_tag_pool *pool,
> + struct percpu_tag_cpu_freelist *tags)
> +{
> + if (pool->nr_free) {
> + move_tags(tags->freelist, &tags->nr_free,
> + pool->freelist, &pool->nr_free,
> + min(pool->nr_free, TAG_CPU_WATERMARK));
> + return true;
> + }
> +
> + return false;
> +}

OK, getting a bit lost now. What does this do and what is
TAG_CPU_WATERMARK.

> +static unsigned alloc_local_tag(struct percpu_tag_pool *pool,
> + struct percpu_tag_cpu_freelist *tags)
> +{
> + unsigned nr_free, old, new, tag;
> +
> + /*
> + * Try per cpu freelist
> + * Since we don't have global lock held, need to use cmpxchg()
> + * to guard against a different thread using steal_tags() on us:
> + */
> + nr_free = tags->nr_free;
> +
> + do {
> + if (!nr_free || nr_free == TAG_CPU_STEALING)
> + return TAG_FAIL;
> +
> + old = nr_free;
> + new = old - 1;
> + tag = tags->freelist[new];
> +
> + nr_free = cmpxchg(&tags->nr_free, old, new);
> + } while (nr_free != old);
> +
> + return tag;
> +}
> +
> +/**
> + * tag_alloc - allocate a tag

percpu_tag_alloc(), actually.

> + * @pool: pool to allocate from
> + * @gfp: gfp flags
> + *
> + * Returns a tag - an integer in the range [0..nr_tags) (passed to
> + * tag_pool_init()), or otherwise TAG_FAIL on allocation failure.
> + *
> + * Will not fail if passed __GFP_WAIT.
> + */
> +unsigned percpu_tag_alloc(struct percpu_tag_pool *pool, gfp_t gfp)
> +{
> + DEFINE_WAIT(wait);
> + struct percpu_tag_cpu_freelist *tags;
> + unsigned long flags;
> + unsigned tag, this_cpu;
> +
> + while (1) {
> + local_irq_save(flags);

why? Presumably so the main interfaces are irq-safe. That would be a
useful thing to include in the interface description.

What actions can a driver take if an irq-time tag allocation attempt
fails? How can the driver author test those actions?

> + this_cpu = smp_processor_id();
> + tags = per_cpu_ptr(pool->tag_cpu, this_cpu);
> +
> + tag = alloc_local_tag(pool, tags);
> + if (tag != TAG_FAIL)
> + break;
> +
> + spin_lock(&pool->lock);
> +
> + /*
> + * prepare_to_wait() must come before steal_tags(), in case
> + * percpu_tag_free() on another cpu flips a bit in
> + * cpus_have_tags
> + */
> + prepare_to_wait(&pool->wait, &wait, TASK_UNINTERRUPTIBLE);
> +
> + /*
> + * alloc_global_tags(), steal_tags() return true iff we now have
> + * tags on our percpu freelist
> + */
> + if (alloc_global_tags(pool, tags) ||
> + steal_tags(pool, tags)) {
> + /* Global lock held, don't need cmpxchg */
> + tag = tags->freelist[--tags->nr_free];
> + if (tags->nr_free)
> + set_bit(this_cpu, pool->cpus_have_tags);
> +
> + spin_unlock(&pool->lock);
> + break;
> + }
> +
> + if (!(gfp & __GFP_WAIT)) {
> + spin_unlock(&pool->lock);
> + break;

It's a bit inefficient to do the needless prepare_to_wait/finish_wait
in this case, but livable with.

> + }
> +
> + spin_unlock(&pool->lock);
> + local_irq_restore(flags);
> +
> + schedule();
> + }

Does this loop need a try_to_freeze()?

> + local_irq_restore(flags);
> + finish_wait(&pool->wait, &wait);
> + return tag;
> +}
> +EXPORT_SYMBOL_GPL(percpu_tag_alloc);
> +
> +/**
> + * tag_free - free a tag

percpu_tag_free

> + * @pool: pool @tag was allocated from
> + * @tag: a tag previously allocated with tag_alloc()
> + */
> +void percpu_tag_free(struct percpu_tag_pool *pool, unsigned tag)
> +{
> + struct percpu_tag_cpu_freelist *tags;
> + unsigned long flags;
> + unsigned nr_free, old, new, this_cpu;
> +
> + BUG_ON(tag >= pool->nr_tags);
> +
> + local_irq_save(flags);
> + this_cpu = smp_processor_id();
> + tags = per_cpu_ptr(pool->tag_cpu, this_cpu);
> +
> + /*
> + * Need to guard against racing with steal_tags() on another cpu - we
> + * can manage with just cmpxchg because we can only race with tags being
> + * pulled off our freelist, not other threads pushing tags back onto our
> + * freelist
> + */
> + nr_free = tags->nr_free;
> +
> + do {
> + if (nr_free == TAG_CPU_STEALING) {
> + cpu_relax();
> + nr_free = tags->nr_free;
> + continue;
> + }
> +
> + old = nr_free;
> + new = old + 1;
> + tags->freelist[old] = tag;
> +
> + nr_free = cmpxchg(&tags->nr_free, old, new);
> + } while (nr_free != old);
> +
> + if (!nr_free) {
> + set_bit(this_cpu, pool->cpus_have_tags);
> + wake_up(&pool->wait);
> + }
> +
> + if (new == TAG_CPU_SIZE) {
> + spin_lock(&pool->lock);
> + /* Might have had our tags stolen before we took global lock */
> + if (tags->nr_free == TAG_CPU_SIZE) {
> + move_tags(pool->freelist, &pool->nr_free,
> + tags->freelist, &tags->nr_free,
> + TAG_CPU_WATERMARK);
> +
> + wake_up(&pool->wait);
> + }
> + spin_unlock(&pool->lock);
> + }
> +
> + local_irq_restore(flags);
> +}
> +EXPORT_SYMBOL_GPL(percpu_tag_free);
>
> ...
>

2013-06-13 02:05:41

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

On Wed, Jun 12, 2013 at 04:38:54PM -0700, Andrew Morton wrote:
> On Tue, 11 Jun 2013 21:03:24 -0700 Kent Overstreet <[email protected]> wrote:
>
> > Allocates integers out of a predefined range
>
> 0..n, I assume. Not n..m.

Yes (though n..m would be trivial if anyone wanted it)... and
percpu_tag_pool_init() is explicit about [0, nr_free)

>
> > - for use by e.g. a driver
> > to allocate tags for communicating with the device.
> >
> > v2: Tejun pointed out that the original strategy completely breaks if
> > nr_cpus is large enough. I think (hope) I realized that at one point,
> > but completely forgot in the months since I originally implemented it.
> >
> > Came up with a new strategy where we keep track of the number of cpus
> > that have tags on their percpu freelists, then steal from a different
> > random cpu if too many cpus have tags on their freelists.
> >
> > Note that the synchronization is _definitely_ tricky - we're using
> > xchg()/cmpxchg() on the percpu lists, to synchronize between
> > steal_tags().
> >
> > The alternative would've been adding a spinlock to protect the percpu
> > freelists, but that would've required some different tricky code to
> > avoid deadlock because of the lock ordering.
>
> The changelog makes this code very hard to review. Because it fails to
> inform us of critical things such as:
>
> What are the actual objectives here? The requirements as you see them?
> What hardware and drivers are we talking about here?
>
> What are the frequency requirements of that hardware?

Code that would be using this (not just driver code, I'm using it for
the aio code to efficiently support cancellation) would typically today
be allocating out of a preallocated array, probably using something like
an atomic bit vector (find_next_unset_bit() etc.) to track free objects.
This is a faster and more scaleable replacement.

I originally wrote it for a driver that was doing exactly that atomic
bit vector thing. Jens forked an earlier version of this code for his
blk-mq stuff, and Nicholas Bellinger just posted a patch the other day
using this in the iscsi target code.

But all it's supposed to do is hand you integers that you then use to
index into your (probably preallocated) array.

Frequency requirements - aio would probably have higher performance
requirements than any individual piece of hardware, if the cancellation
stuff goes in. Individual flash devices do ~1M iops though and that's
the sort of driver this is targeted at.

> Why can't we use ida_get_new_above()?
>
> If it is "speed" then how bad is the problem and what efforts have
> been made to address them within the idr code? (A per-cpu magazine
> frontend to ida_get_new_above() might suit).
>
> If it is "functionality" then what efforts have been made to
> suitably modify the ida code?

Originally, it was because every time I opened idr.[ch] I was confronted
with an enormous pile of little functions, with very little indication
in the way of what they were all trying to do or which ones I might want
to start with.

Having finally read enough of the code to maybe get a handle on what
it's about - performance is a large part of it, but idr is also a more
complicated api that does more than what I wanted.

Re: the per-cpu magazine frontend - if I did that still need pretty much
exactly what I've got implemented here - the non smp version of this
code would be basically just a single stack, all the complexity is in
making it percpu.

> wrt the NR_CPUS=1024 thing: yes, this is a problem with the per-cpu
> design. Especially if the hardware only has a 10-bit-tag! What's the
> worst case here? What is the smallest tag size which you reasonably
> expect we will encounter? What are the expected failure modes with
> large cpu count and small tag size?

Well, ncq/tcq have a 5 bit tag space, you probably wouldn't want to use
this code for that :)

So the worst case is that on your NR_CPUS=1024 system, you don't have a
big enough tag space to keep tags on most of the percpu freelists _and_
your workload is running across all the cpus at once.

The key thing is with the way tag stealing works, it's not NR_CPUS that
matters, it's how many your workload is currently running on. And you
can't really expect good things if you're trying to use a device from
1024 cpus at once anyways, so I think it'll work out well enough in
practice.

Also, it should degrade _reasonably_gracefully as the number of cpus
you're running on starts to exceed the point at which you have to steal
tags - since to steal tags we only have to compete with one other cpu
for the needed cachelines. The global lock will eventually become a
bottleneck, but only so much we can do about that.

> > +/*
> > + * Copyright 2012 Google Inc. All Rights Reserved.
> > + * Author: [email protected] (Kent Overstreet)
> > + *
> > + * Per cpu tag allocator. Allocates small integers - up to nr_tags passed to
> > + * percpu_tag_pool_init() - for use with say driver tag structures for talking
> > + * to a device.
> > + *
> > + * It works by caching tags on percpu freelists, and then tags are
> > + * allocated/freed from the global freelist in batches.
> > + *
> > + * Note that it will in general be impossible to allocate all nr_tags tags,
> > + * since some tags will be stranded on other cpu's freelists: but we guarantee
> > + * that nr_tags / 2 can always be allocated.
> > + *
> > + * This is done by keeping track of which cpus have tags on their percpu
> > + * freelists in a bitmap, and then on allocation failure if too many cpus have
> > + * tags on their freelists - i.e. if cpus_have_tags * TAG_CPU_SIZE (64) >
> > + * nr_tags / 2 - then we steal one remote cpu's freelist (effectively picked at
> > + * random).
>
> OK, this comment partially makes up for changelog deficiencies.
>
> Why did you choose to directly fiddle with the other-cpus data rather
> than issuing a cross-cpu IPI call? The latter would presumably result
> in slower stealing performance but better common-case performance.

I could handwave an argument (the extra needed synchronization doesn't
cost much since in the common case it's in cache and there's no
contention) - but I haven't tested the IPI version.

Jens used IPIs in his fork (but he didn't figure out the trick of
limiting when we steal and stealing from a random other cpu, so there's
way more cacheline contention in his version).

> I didn't immediately see where this "if cpus_have_tags * TAG_CPU_SIZE
> (64) + * nr_tags / 2" was implemented and I don't immediately see *why*
> it was implemented. Why do we care? If local allocation fails, just
> go steal.

There's a tradeoff between percentage of tags you can guarantee will be
available for allocation, and amount of global communication required.

If we say up front "We only guarantee that half of the tag space can be
succesfully allocated, depending on what happens with the percpu
freelists" - then we can guarantee drastically less cacheline
contention/bouncing even when we're trying to keep as many tags as
possible allocated.

The cost of stealing tags isn't just to the thread that's doing the
stealing - it's already gone and touched the global stuff, and it's
probably going to sleep, so it might not matter much there. But now some
other cpu is going to have to touch the global cachelines a lot sooner
than it would have otherwise.

And then, if we didn't keep track of which cpus had tags on their
freelists, we'd always have to potentially try stealing from _every_
cpu, until we found tags - that would scale quadratically (i think, i'm
tired) as nr_cpus goes up w.r.t. cacheline bouncing.

>
> Please lovingly document the struct fields - it really pays off.
> Certainly the lack of a roadmap here will make this review harder and
> less effective than it needs to be.

Will do.

>
> It's nice that the cacheline alignment only happens on SMP builds, but
> this code already adds tremendous amounts of useless fluff to
> uniprocessor builds. Those kernels would be much happier with
> ida_get_new_above().

I hate #ifdefs, but I suppose it's warrented here. The non CONFIG_SMP
version will be trivial, I'll add that.

> > +unsigned percpu_tag_alloc(struct percpu_tag_pool *pool, gfp_t gfp);
> > +void percpu_tag_free(struct percpu_tag_pool *pool, unsigned tag);
> > +
> > +void percpu_tag_pool_free(struct percpu_tag_pool *pool);
> > +int percpu_tag_pool_init(struct percpu_tag_pool *pool, unsigned long nr_tags);
> > +
> > +enum {
> > + TAG_FAIL = -1U,
> > + TAG_MAX = TAG_FAIL - 1,
> > +};
>
> Again, documentation of the meanings of these will help the reader
> considerably.

Check.

> > +#endif
> > diff --git a/lib/Makefile b/lib/Makefile
> > index 386db4b..e29a354 100644
> > --- a/lib/Makefile
> > +++ b/lib/Makefile
> > @@ -13,7 +13,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
> > sha1.o md5.o irq_regs.o reciprocal_div.o argv_split.o \
> > proportions.o flex_proportions.o prio_heap.o ratelimit.o show_mem.o \
> > is_single_threaded.o plist.o decompress.o kobject_uevent.o \
> > - earlycpio.o percpu-refcount.o
> > + earlycpio.o percpu-refcount.o percpu-tags.o
> >
> > obj-$(CONFIG_ARCH_HAS_DEBUG_STRICT_USER_COPY_CHECKS) += usercopy.o
> > lib-$(CONFIG_MMU) += ioremap.o
> > diff --git a/lib/percpu-tags.c b/lib/percpu-tags.c
> > new file mode 100644
> > index 0000000..e43b428
> > --- /dev/null
> > +++ b/lib/percpu-tags.c
> > @@ -0,0 +1,313 @@
> > +/*
> > + * Copyright 2012 Google Inc. All Rights Reserved.
> > + * Author: [email protected] (Kent Overstreet)
> > + *
> > + * Per cpu tag allocator.
> > + */
> > +
> > +#include <linux/gfp.h>
> > +#include <linux/module.h>
> > +#include <linux/sched.h>
> > +#include <linux/slab.h>
> > +#include <linux/percpu-tags.h>
>
> I expect this isn't a long enough include list. Just below I see
> memcpy, hweight_long, find_next_bit, per_cpu_ptr and xchg. Please
> carefully go through the requirements and don't rely upon
> config-dependent nested-include luck.

Will do.

> > +#define TAG_CPU_WATERMARK 32U
> > +#define TAG_CPU_SIZE TAG_CPU_WATERMARK * 2
> > +
> > +#define TAG_CPU_STEALING UINT_MAX
>
> What do these do.
>
> > +struct percpu_tag_cpu_freelist {
> > + unsigned nr_free;
> > + unsigned freelist[];
> > +};
> > +
> > +static inline void move_tags(unsigned *dst, unsigned *dst_nr,
> > + unsigned *src, unsigned *src_nr,
> > + unsigned nr)
> > +{
> > + *src_nr -= nr;
> > + memcpy(dst + *dst_nr, src + *src_nr, sizeof(unsigned) * nr);
> > + *dst_nr += nr;
> > +}
>
> Overlapping copies are not required?

No, it's only for moving tags between different freelists (a percpu
freelist and the global freelist).

> > +static bool steal_tags(struct percpu_tag_pool *pool,
> > + struct percpu_tag_cpu_freelist *tags)
>
> An overview of this function would be helpful here.

Here's what I just wrote:

/*
* Try to steal tags from a remote cpu's percpu freelist.
*
* We first check how many percpu freelists have tags - we don't steal tags
* unless enough percpu freelists have tags on them that it's possible more than
* half the total tags could be stuck on remote percpu freelists.
*
* Then we iterate through the cpus until we find some tags - we don't attempt
* to find the "best" cpu to steal from, to keep cacheline bouncing to a
* minimum.
*
* Returns true on success (our percpu freelist is no longer empty), false on
* failure.
*/

> > +{
> > + unsigned i, nr_free, cpus_have_tags = 0, cpu = pool->cpu_last_stolen;
> > + struct percpu_tag_cpu_freelist *remote;
> > +
> > + for (i = 0; i < DIV_ROUND_UP(nr_cpu_ids, BITS_PER_LONG); i++)
> > + cpus_have_tags += hweight_long(pool->cpus_have_tags[i]);
>
> That's an open-coded bitmap_weight(). Generally, the bitmap.h code
> might be applicable in this code. (The bitmap code has some pretty
> foul inefficiencies, but they're generally easily fixed).

Yeah, Oleg pointed that out, I've already fixed it in my version.

> > + while (cpus_have_tags-- * TAG_CPU_SIZE > pool->nr_tags / 2) {
> > + cpu = find_next_bit(pool->cpus_have_tags, nr_cpu_ids, cpu);
> > +
> > + if (cpu == nr_cpu_ids)
> > + cpu = find_first_bit(pool->cpus_have_tags, nr_cpu_ids);
> > +
> > + if (cpu == nr_cpu_ids)
> > + BUG();
> > +
> > + pool->cpu_last_stolen = cpu;
>
> I think I can see why this last_stolen thing exists, but I'd prefer to
> be told why its creator thinks it needs to exist.
>
> One could just start stealing at this_cpu + 1?

Starting at this_cpu + 1 isn't a bad idea.

I think keeping the index around and cycling through is still better
though - suppose you have a workload running on some subset of the cpus
in the system, and those cpus happen to all have adjacent indices.

We want to have an equal chance of picking any given cpu to steal from,
so that the stealing is fair - if we were sarting at this_cpu + 1, we'd
steal more from the cpus our workload is currently running on, when what
we want is to make sure we also steal from the cpus we _aren't_
currently running on.

>
> > + remote = per_cpu_ptr(pool->tag_cpu, cpu);
> > +
> > + if (remote == tags)
> > + continue;
> > +
> > + clear_bit(cpu, pool->cpus_have_tags);
> > +
> > + nr_free = xchg(&remote->nr_free, TAG_CPU_STEALING);
>
> (looks to see what TAG_CPU_STEALING does)
>
> OK, this looks pretty lame. It adds a rare fallback to the central
> allocator, which makes that path hard to test. And it does this at
> basically the same cost of plain old spin_lock(). I do think it would
> be better to avoid the underministic code and use plain old
> spin_lock(). I appreciate the lock ranking concern, but you're a
> cleaver chap ;)

Yeah, the cmpxchg() stuff turned out trickier than I'd hoped - it's
actually the barriers (guarding against a race with percpu_tag_free())
that concern me more than that fallback.

I did torture test this code quite a bit and I'm not terribly eager to
futz with it more, but I may try switching to spinlocks for the percpu
freelists to see how it works out - I had the idea that I might be able
to avoid some writes to shared cachelines if I can simplify that stuff,
which would probably make it worth it.

> Also, I wonder if this was all done in the incorrect order. Why make
> alloc_local_tag() fail? steal_tags() could have just noodled off and
> tried to steal from the next CPU if alloc_local_tag() was in progress
> on this CPU?

steal_tags() can't notice that alloc_local_tag() is in progress,
alloc_local_tag() just uses cmpxchg to update nr_free - there's no
sentinal value or anything.

OTOH, if alloc_local_tag() sees TAG_CPU_STEALING - the next thing that's
going to happen is steal_tags() is going to set nr_free to 0, and the
only way tags are going to end up back on our cpu's freelist is if we
fail and do something else - we've got irqs disabled so
percpu_tag_free() can't do it.


>
> > + if (nr_free) {
> > + memcpy(tags->freelist,
> > + remote->freelist,
> > + sizeof(unsigned) * nr_free);
> > + smp_mb();
>
> Please always document barriers. Explain in full detail to the reader
> which scenario(s) we're barriering against.

Will do

>
> > + remote->nr_free = 0;
> > + tags->nr_free = nr_free;
> > + return true;
> > + } else {
> > + remote->nr_free = 0;
> > + }
> > + }
> > +
> > + return false;
> > +}
> > +
> > +static bool alloc_global_tags(struct percpu_tag_pool *pool,
> > + struct percpu_tag_cpu_freelist *tags)
> > +{
> > + if (pool->nr_free) {
> > + move_tags(tags->freelist, &tags->nr_free,
> > + pool->freelist, &pool->nr_free,
> > + min(pool->nr_free, TAG_CPU_WATERMARK));
> > + return true;
> > + }
> > +
> > + return false;
> > +}
>
> OK, getting a bit lost now. What does this do and what is
> TAG_CPU_WATERMARK.

TAG_CPU_WATERMARK is just the number of tags we shuffle back and forth
between the percpu freelists and the global freelist at a time - half
the maximum size of the percpu freelists.

Jens renamed that to BATCH_MOVE, I suppose that might've been a better
name.

> percpu_tag_alloc(), actually.

Whoops, missed some renaming.

> > + * @pool: pool to allocate from
> > + * @gfp: gfp flags
> > + *
> > + * Returns a tag - an integer in the range [0..nr_tags) (passed to
> > + * tag_pool_init()), or otherwise TAG_FAIL on allocation failure.
> > + *
> > + * Will not fail if passed __GFP_WAIT.
> > + */
> > +unsigned percpu_tag_alloc(struct percpu_tag_pool *pool, gfp_t gfp)
> > +{
> > + DEFINE_WAIT(wait);
> > + struct percpu_tag_cpu_freelist *tags;
> > + unsigned long flags;
> > + unsigned tag, this_cpu;
> > +
> > + while (1) {
> > + local_irq_save(flags);
>
> why? Presumably so the main interfaces are irq-safe. That would be a
> useful thing to include in the interface description.

An allocator that wasn't irq safe (and explicitly percpu and thus meant
for concurrency) would seem to be an odd thing. I suppose from the
user's perspective it could just be disabling preemption... I'll make a
note of it.

> What actions can a driver take if an irq-time tag allocation attempt
> fails? How can the driver author test those actions?

You mean if a GFP_NOWAIT allocation fails? It's the same as any other
allocation, I suppose.

Did you have something else in mind that could be implemented? I don't
want to add code for a reserve to this code - IMO, if a reserve is
needed it should be done elsewhere (like how mempools work on top of
existing allocators).

> It's a bit inefficient to do the needless prepare_to_wait/finish_wait
> in this case, but livable with.

After profiling this morning I added a fast path that doesn't do the
finish_wait(), that was more expensive that I expected.

>
> > + }
> > +
> > + spin_unlock(&pool->lock);
> > + local_irq_restore(flags);
> > +
> > + schedule();
> > + }
>
> Does this loop need a try_to_freeze()?

I suppose it does, I hadn't run across that one before...

2013-06-13 03:03:15

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

On Wed, 12 Jun 2013 19:05:36 -0700 Kent Overstreet <[email protected]> wrote:

> ...
>
> > Why can't we use ida_get_new_above()?
> >
> > If it is "speed" then how bad is the problem and what efforts have
> > been made to address them within the idr code? (A per-cpu magazine
> > frontend to ida_get_new_above() might suit).
> >
> > If it is "functionality" then what efforts have been made to
> > suitably modify the ida code?
>
> Originally, it was because every time I opened idr.[ch] I was confronted
> with an enormous pile of little functions, with very little indication
> in the way of what they were all trying to do or which ones I might want
> to start with.
>
> Having finally read enough of the code to maybe get a handle on what
> it's about - performance is a large part of it, but idr is also a more
> complicated api that does more than what I wanted.

They all sound like pretty crappy reasons ;) If the idr/ida interface
is nasty then it can be wrapped to provide the same interface as the
percpu tag allocator.

I could understand performance being an issue, but diligence demands
that we test that, or at least provide a convincing argument.

>
> ...
>
> > > + remote = per_cpu_ptr(pool->tag_cpu, cpu);
> > > +
> > > + if (remote == tags)
> > > + continue;
> > > +
> > > + clear_bit(cpu, pool->cpus_have_tags);
> > > +
> > > + nr_free = xchg(&remote->nr_free, TAG_CPU_STEALING);
> >
> > (looks to see what TAG_CPU_STEALING does)
> >
> > OK, this looks pretty lame. It adds a rare fallback to the central
> > allocator, which makes that path hard to test. And it does this at
> > basically the same cost of plain old spin_lock(). I do think it would
> > be better to avoid the underministic code and use plain old
> > spin_lock(). I appreciate the lock ranking concern, but you're a
> > cleaver chap ;)
>
> Yeah, the cmpxchg() stuff turned out trickier than I'd hoped - it's
> actually the barriers (guarding against a race with percpu_tag_free())
> that concern me more than that fallback.
>
> I did torture test this code quite a bit and I'm not terribly eager to
> futz with it more, but I may try switching to spinlocks for the percpu
> freelists to see how it works out - I had the idea that I might be able
> to avoid some writes to shared cachelines if I can simplify that stuff,
> which would probably make it worth it.

The nice thing about a lock per cpu is that the stealing CPU can grab
it and then steal a bunch of tags without releasing the lock: less
lock-taking. Maybe the current code does that amortisation as well;
my intent-reverse-engineering resources got depleted.

Also, with a lock-per-cpu the stolen-from CPU just spins, so the ugly
TAG_CPU_STEALING fallback-to-global-allocator thing goes away.

> > Also, I wonder if this was all done in the incorrect order. Why make
> > alloc_local_tag() fail? steal_tags() could have just noodled off and
> > tried to steal from the next CPU if alloc_local_tag() was in progress
> > on this CPU?
>
> steal_tags() can't notice that alloc_local_tag() is in progress,

Yes it can - the stolen-from CPU sets a flag in its cpu-local object
while in its critical section. The stealing CPU sees that and skips.

I think all this could be done with test_and_set_bit() and friends,
btw. xchg() hurts my brain.

> alloc_local_tag() just uses cmpxchg to update nr_free - there's no
> sentinal value or anything.
>
> OTOH, if alloc_local_tag() sees TAG_CPU_STEALING - the next thing that's
> going to happen is steal_tags() is going to set nr_free to 0, and the
> only way tags are going to end up back on our cpu's freelist is if we
> fail and do something else - we've got irqs disabled so
> percpu_tag_free() can't do it.
>
> ...
>
> > What actions can a driver take if an irq-time tag allocation attempt
> > fails? How can the driver author test those actions?
>
> You mean if a GFP_NOWAIT allocation fails? It's the same as any other
> allocation, I suppose.
>
> Did you have something else in mind that could be implemented? I don't
> want to add code for a reserve to this code - IMO, if a reserve is
> needed it should be done elsewhere (like how mempools work on top of
> existing allocators).

Dunno, really - I'm just wondering what the implications of an
allocation failure will be. I suspect it's EIO? Which in some
circumstances could be as serious as total system failure (loss of
data), so reliability/robustness is a pretty big deal.


Another thing: CPU hot-unplug. If it's "don't care" then the
forthcoming changelog should thoroughly explain why, please. Otherwise
it will need a notifier to spill back any reservation.

2013-06-13 03:54:39

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

On Wed, Jun 12, 2013 at 08:03:11PM -0700, Andrew Morton wrote:
> On Wed, 12 Jun 2013 19:05:36 -0700 Kent Overstreet <[email protected]> wrote:
>
> > ...
> >
> > > Why can't we use ida_get_new_above()?
> > >
> > > If it is "speed" then how bad is the problem and what efforts have
> > > been made to address them within the idr code? (A per-cpu magazine
> > > frontend to ida_get_new_above() might suit).
> > >
> > > If it is "functionality" then what efforts have been made to
> > > suitably modify the ida code?
> >
> > Originally, it was because every time I opened idr.[ch] I was confronted
> > with an enormous pile of little functions, with very little indication
> > in the way of what they were all trying to do or which ones I might want
> > to start with.
> >
> > Having finally read enough of the code to maybe get a handle on what
> > it's about - performance is a large part of it, but idr is also a more
> > complicated api that does more than what I wanted.
>
> They all sound like pretty crappy reasons ;) If the idr/ida interface
> is nasty then it can be wrapped to provide the same interface as the
> percpu tag allocator.

Well, the way I see it right now, idr and this tag allocator are doing
two somewhat different things and I'm really not sure how one piece of
code could do both jobs.

Supposing idr could be made percpu and just as fast as my tag allocator:
the problem is, using idr where right now I'd use the tag allocator
forces you to do two allocations instead of one:

First, you allocate your idr slot - but now that has to point to
something, so you need a kmem_cache_alloc() too. Which is probably the
way to go if you don't want to preallocate everything, but for the kinds
of things I'm using the tag allocator for preallocating tag structs is
fine but that second allocation would really hurt.

So, suppose we reimplement idr on top of the tag allocator: you have a
parallel array (or array type thing) of pointers, that lets you map
integers -> pointers - that gets you the current idr functionality
(minus whatever I've missed about it).

But with idr you don't have to specify the maximum integer/tag up front
- it enlarges the mapping as needed, which is great but it's a fair
amount of complexity. So now we've got two parallel data structures -
the tag allocator, and this extensible array thing - idr, if I'm not
mistaken, rolls the allocation stuff into the extensible array data
structure.

For users of idr who care about scalability this might be the way to go,
but I'm sure there's plenty who don't and this would be be a (perhaps
small) regression in memory and runtime overhead.

So right now at least I honestly think letting the tag allocator and idr
be distinct things is probably the way to go.

> I could understand performance being an issue, but diligence demands
> that we test that, or at least provide a convincing argument.

Well, idr is going to be a lot heavier than the atomic bit vector this
tag allocator originally replaced in driver code, and that was a very
significant performance improvement.

(now someone will point out to me all the fancy percpu optimizations in
idr I missed).


> The nice thing about a lock per cpu is that the stealing CPU can grab
> it and then steal a bunch of tags without releasing the lock: less
> lock-taking. Maybe the current code does that amortisation as well;
> my intent-reverse-engineering resources got depleted.

Yeah, the current code does do that.

> Also, with a lock-per-cpu the stolen-from CPU just spins, so the ugly
> TAG_CPU_STEALING fallback-to-global-allocator thing goes away.
>
> > > Also, I wonder if this was all done in the incorrect order. Why make
> > > alloc_local_tag() fail? steal_tags() could have just noodled off and
> > > tried to steal from the next CPU if alloc_local_tag() was in progress
> > > on this CPU?
> >
> > steal_tags() can't notice that alloc_local_tag() is in progress,
>
> Yes it can - the stolen-from CPU sets a flag in its cpu-local object
> while in its critical section. The stealing CPU sees that and skips.
>
> I think all this could be done with test_and_set_bit() and friends,
> btw. xchg() hurts my brain.

Ahh, you were talking about with a slightly bigger rework.

It's not xchg/cmpxchg that hurt my brain (I've used cmpxchg once or
twice just for keeping some statistics up to date that was just
something slightly more interesting than a counter) - it's when there's
pointers involved and insane ordering that my brain melts.

ABA. Feh.

> > You mean if a GFP_NOWAIT allocation fails? It's the same as any other
> > allocation, I suppose.
> >
> > Did you have something else in mind that could be implemented? I don't
> > want to add code for a reserve to this code - IMO, if a reserve is
> > needed it should be done elsewhere (like how mempools work on top of
> > existing allocators).
>
> Dunno, really - I'm just wondering what the implications of an
> allocation failure will be. I suspect it's EIO? Which in some
> circumstances could be as serious as total system failure (loss of
> data), so reliability/robustness is a pretty big deal.

Ahh. That's just outside the scope of this code - IME, in driver code
GFP_NOWAIT allocations are not the norm - most tags are created when
you're submitting a bio or processing a request, and then you're in
process context. But in your error handling code you could also need to
allocate tags to resubmit things - that'd be an issue if you're silly
enough to stick all your error handling code in your interrupt handling
path.

(i've seen such things done.)

> Another thing: CPU hot-unplug. If it's "don't care" then the
> forthcoming changelog should thoroughly explain why, please. Otherwise
> it will need a notifier to spill back any reservation.

Oh - doesn't care, becasue steal_tags() will eventually get it. We
_could_ satisfy more allocations if we had a notifier - but we'll still
meet our guarantees and it's absolutely not a correctness issue, so I
lean towards just leaving it out.

2013-06-13 05:46:28

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

On Wed, 12 Jun 2013 20:54:32 -0700 Kent Overstreet <[email protected]> wrote:

> On Wed, Jun 12, 2013 at 08:03:11PM -0700, Andrew Morton wrote:
> > On Wed, 12 Jun 2013 19:05:36 -0700 Kent Overstreet <[email protected]> wrote:
> >
> > > ...
> > >
> > > > Why can't we use ida_get_new_above()?
> > > >
> > > > If it is "speed" then how bad is the problem and what efforts have
> > > > been made to address them within the idr code? (A per-cpu magazine
> > > > frontend to ida_get_new_above() might suit).
> > > >
> > > > If it is "functionality" then what efforts have been made to
> > > > suitably modify the ida code?
> > >
> > > Originally, it was because every time I opened idr.[ch] I was confronted
> > > with an enormous pile of little functions, with very little indication
> > > in the way of what they were all trying to do or which ones I might want
> > > to start with.
> > >
> > > Having finally read enough of the code to maybe get a handle on what
> > > it's about - performance is a large part of it, but idr is also a more
> > > complicated api that does more than what I wanted.
> >
> > They all sound like pretty crappy reasons ;) If the idr/ida interface
> > is nasty then it can be wrapped to provide the same interface as the
> > percpu tag allocator.
>
> Well, the way I see it right now, idr and this tag allocator are doing
> two somewhat different things and I'm really not sure how one piece of
> code could do both jobs.
>
> Supposing idr could be made percpu and just as fast as my tag allocator:

We don't know how fast either is, nor what the performance requirements are.

> the problem is, using idr where right now I'd use the tag allocator
> forces you to do two allocations instead of one:
>
> First, you allocate your idr slot - but now that has to point to
> something, so you need a kmem_cache_alloc() too.

Only a single allocation is needed - for the bit used to reserve
ida_get_new_above()'s ID, plus epsilon for idr metadata.
ida_get_new_above() will call kmalloc approximately 0.001 times per
invocation.

> So right now at least I honestly think letting the tag allocator and idr
> be distinct things is probably the way to go.

That depends on performance testing results and upon performance
requirements. Neither are known at this time.

> (now someone will point out to me all the fancy percpu optimizations in
> idr I missed).

idr use percpu data for reliability, not for performance.

Cross-cpu costs might be significant. There are surely ways to reduce
them. I could suggest one but I've found that when I make a
suggestion, others busily work on shooting down my suggestion while
avoiding thinking of their own ideas.

> > I think all this could be done with test_and_set_bit() and friends,
> > btw. xchg() hurts my brain.
>
> Ahh, you were talking about with a slightly bigger rework.

Not really. Just that the xchg() in this code could be replaced in a
straightforward fashion with test_and_set_bit(). Or maybe not.

> > > You mean if a GFP_NOWAIT allocation fails? It's the same as any other
> > > allocation, I suppose.
> > >
> > > Did you have something else in mind that could be implemented? I don't
> > > want to add code for a reserve to this code - IMO, if a reserve is
> > > needed it should be done elsewhere (like how mempools work on top of
> > > existing allocators).
> >
> > Dunno, really - I'm just wondering what the implications of an
> > allocation failure will be. I suspect it's EIO? Which in some
> > circumstances could be as serious as total system failure (loss of
> > data), so reliability/robustness is a pretty big deal.
>
> Ahh. That's just outside the scope of this code - IME, in driver code
> GFP_NOWAIT allocations are not the norm - most tags are created when
> you're submitting a bio or processing a request, and then you're in
> process context. But in your error handling code you could also need to
> allocate tags to resubmit things - that'd be an issue if you're silly
> enough to stick all your error handling code in your interrupt handling
> path.

Our experience with alloc_pages(GFP_ATOMIC) has no relevance to this
code, because this code doesn't call alloc_pages()! Instead of failing
if page reserves are exhausted, it will fail if no tags are available
in this new data structure.

The reliability (or otherwise) of alloc_pages(GFP_ATOMIC) is well
understood whereas the reliability of percpu_tag_alloc(GFP_ATOMIC) is
not understood at all. So let us think about both this and about the
consequences of allocation failures.

> > Another thing: CPU hot-unplug. If it's "don't care" then the
> > forthcoming changelog should thoroughly explain why, please. Otherwise
> > it will need a notifier to spill back any reservation.
>
> Oh - doesn't care, becasue steal_tags() will eventually get it. We
> _could_ satisfy more allocations if we had a notifier - but we'll still
> meet our guarantees and it's absolutely not a correctness issue, so I
> lean towards just leaving it out.

I agree. It's a design decision which should be explained and justified
in the changelog, please.

2013-06-13 18:53:27

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

Hello, Andrew, Kent.

(cc'ing NFS folks for id[r|a] discussion)

On Wed, Jun 12, 2013 at 08:03:11PM -0700, Andrew Morton wrote:
> They all sound like pretty crappy reasons ;) If the idr/ida interface
> is nasty then it can be wrapped to provide the same interface as the
> percpu tag allocator.
>
> I could understand performance being an issue, but diligence demands
> that we test that, or at least provide a convincing argument.

The thing is that id[r|a] guarantee that the lowest available slot is
allocated and this is important because it's used to name things which
are visible to userland - things like block device minor number,
device indicies and so on. That alone pretty much ensures that
alloc/free paths can't be very scalable which usually is fine for most
id[r|a] use cases as long as lookup is fast. I'm doubtful that it's a
good idea to push per-cpu tag allocation into id[r|a]. The use cases
are quite different.

In fact, maybe what we can do is adding some features on top of the
tag allocator and moving id[r|a] users which don't require strict
in-order allocation to it. For example, NFS allocates an ID for each
transaction it performs and uses it to index the associate command
structure (Jeff, Bruce, please correct me if I'm getting it wrong).
The only requirement on IDs is that they shouldn't be recycled too
fast. Currently, idr implements cyclic mode for it but it can easily
be replaced with per-cpu tag allocator like this one and it'd be a lot
more scalable. There are a couple things to worry about tho - it
probably should use the highbits as generation number as a tag is
given out so that the actual ID doesn't get recycled quickly, and some
form dynamic tag sizing would be nice too.

Thanks.

--
tejun

2013-06-13 19:04:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

On Thu, 13 Jun 2013 11:53:18 -0700 Tejun Heo <[email protected]> wrote:

> Hello, Andrew, Kent.
>
> (cc'ing NFS folks for id[r|a] discussion)
>
> On Wed, Jun 12, 2013 at 08:03:11PM -0700, Andrew Morton wrote:
> > They all sound like pretty crappy reasons ;) If the idr/ida interface
> > is nasty then it can be wrapped to provide the same interface as the
> > percpu tag allocator.
> >
> > I could understand performance being an issue, but diligence demands
> > that we test that, or at least provide a convincing argument.
>
> The thing is that id[r|a] guarantee that the lowest available slot is
> allocated

That isn't the case for ida_get_new_above() - the caller gets to
control the starting index.

> and this is important because it's used to name things which
> are visible to userland - things like block device minor number,
> device indicies and so on. That alone pretty much ensures that
> alloc/free paths can't be very scalable which usually is fine for most
> id[r|a] use cases as long as lookup is fast. I'm doubtful that it's a
> good idea to push per-cpu tag allocation into id[r|a]. The use cases
> are quite different.

You aren't thinking right.

The worst outcome here is that idr.c remains unimproved and we merge a
new allocator which does basically the same thing.

The best outcome is that idr.c gets improved and we don't have to merge
duplicative code.

So please, let's put aside the shiny new thing for now and work out how
we can use the existing tag allocator for these applications. If we
make a genuine effort to do this and decide that it's fundamentally
hopeless then this is the time to start looking at new implementations.

(I can think of at least two ways of making ida_get_new_above() an
order of magnitude faster for this application and I'm sure you guys
can as well.)

2013-06-13 19:06:18

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

Hello, Andrew, Kent.

On Wed, Jun 12, 2013 at 04:38:54PM -0700, Andrew Morton wrote:
...
> > +unsigned percpu_tag_alloc(struct percpu_tag_pool *pool, gfp_t gfp)
> > +{
> > + DEFINE_WAIT(wait);
> > + struct percpu_tag_cpu_freelist *tags;
> > + unsigned long flags;
> > + unsigned tag, this_cpu;
> > +
> > + while (1) {
> > + local_irq_save(flags);
...
> > + schedule();
> > + }
>
> Does this loop need a try_to_freeze()?

I don't think so. Kernel tasks should never enter freezer without it
explicitly knowing it. It should be something evident in the
top-level control flow. Freezer acts as a giant lock and entering
freezer deep underneath where the task could be holding random number
of resources and locks can easily develop into a deadlock.

If this allocation wait is gonna be visible to userland, what's
necessary probably would be making the sleeping interruptible. The
freezer will then make the alloc fail and control should return to the
signal delivery path where it'll be frozen without holding any
resources.

Thanks.

--
tejun

2013-06-13 19:08:41

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

On Thu, Jun 13, 2013 at 11:53:18AM -0700, Tejun Heo wrote:
> The thing is that id[r|a] guarantee that the lowest available slot is
> allocated and this is important because it's used to name things which
> are visible to userland - things like block device minor number,
> device indicies and so on. That alone pretty much ensures that
> alloc/free paths can't be very scalable which usually is fine for most
> id[r|a] use cases as long as lookup is fast. I'm doubtful that it's a
> good idea to push per-cpu tag allocation into id[r|a]. The use cases
> are quite different.
>
> In fact, maybe what we can do is adding some features on top of the
> tag allocator and moving id[r|a] users which don't require strict
> in-order allocation to it. For example, NFS allocates an ID for each
> transaction it performs and uses it to index the associate command
> structure (Jeff, Bruce, please correct me if I'm getting it wrong).

It's not really per-transaction: it's more like an identifier for a
piece of lock state (an open, a lock, whatever), so a bit like a file
descriptor, but there's no requirement they be "small".

> The only requirement on IDs is that they shouldn't be recycled too
> fast.

Yep.--b.

> Currently, idr implements cyclic mode for it but it can easily
> be replaced with per-cpu tag allocator like this one and it'd be a lot
> more scalable. There are a couple things to worry about tho - it
> probably should use the highbits as generation number as a tag is
> given out so that the actual ID doesn't get recycled quickly, and some
> form dynamic tag sizing would be nice too.
>
> Thanks.
>
> --
> tejun

2013-06-13 19:09:26

by Jeff Layton

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

On Thu, 13 Jun 2013 11:53:18 -0700
Tejun Heo <[email protected]> wrote:

> Hello, Andrew, Kent.
>
> (cc'ing NFS folks for id[r|a] discussion)
>
> On Wed, Jun 12, 2013 at 08:03:11PM -0700, Andrew Morton wrote:
> > They all sound like pretty crappy reasons ;) If the idr/ida interface
> > is nasty then it can be wrapped to provide the same interface as the
> > percpu tag allocator.
> >
> > I could understand performance being an issue, but diligence demands
> > that we test that, or at least provide a convincing argument.
>
> The thing is that id[r|a] guarantee that the lowest available slot is
> allocated and this is important because it's used to name things which
> are visible to userland - things like block device minor number,
> device indicies and so on. That alone pretty much ensures that
> alloc/free paths can't be very scalable which usually is fine for most
> id[r|a] use cases as long as lookup is fast. I'm doubtful that it's a
> good idea to push per-cpu tag allocation into id[r|a]. The use cases
> are quite different.
>
> In fact, maybe what we can do is adding some features on top of the
> tag allocator and moving id[r|a] users which don't require strict
> in-order allocation to it. For example, NFS allocates an ID for each
> transaction it performs and uses it to index the associate command
> structure (Jeff, Bruce, please correct me if I'm getting it wrong).

> The only requirement on IDs is that they shouldn't be recycled too
> fast. Currently, idr implements cyclic mode for it but it can easily
> be replaced with per-cpu tag allocator like this one and it'd be a lot
> more scalable. There are a couple things to worry about tho - it
> probably should use the highbits as generation number as a tag is
> given out so that the actual ID doesn't get recycled quickly, and some
> form dynamic tag sizing would be nice too.
>
> Thanks.
>

Actually, nfsd just uses idr to allocate stateids, which sort of
correspond to an open or locked file state. This is not a terribly
high-performance codepath -- we allocate one of these roughly on the
frequency of one (or sometimes two) per open() or fcntl() call on the
client.

Anyway...you have the important piece right. The main constraint is
that we don't recycle these too quickly. We'd be fine with totally
random IDs as long as we can be ensured that we don't reuse any until
we've used all 2^31 possible values once.

--
Jeff Layton <[email protected]>

2013-06-13 19:13:26

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

On Thu, 13 Jun 2013 12:06:10 -0700 Tejun Heo <[email protected]> wrote:

> Hello, Andrew, Kent.
>
> On Wed, Jun 12, 2013 at 04:38:54PM -0700, Andrew Morton wrote:
> ...
> > > +unsigned percpu_tag_alloc(struct percpu_tag_pool *pool, gfp_t gfp)
> > > +{
> > > + DEFINE_WAIT(wait);
> > > + struct percpu_tag_cpu_freelist *tags;
> > > + unsigned long flags;
> > > + unsigned tag, this_cpu;
> > > +
> > > + while (1) {
> > > + local_irq_save(flags);
> ...
> > > + schedule();
> > > + }
> >
> > Does this loop need a try_to_freeze()?
>
> I don't think so. Kernel tasks should never enter freezer without it
> explicitly knowing it. It should be something evident in the
> top-level control flow. Freezer acts as a giant lock and entering
> freezer deep underneath where the task could be holding random number
> of resources and locks can easily develop into a deadlock.

As I understand it, if a task is stuck in this loop at freeze time, the
whole freeze attempt will fail. But it's been a long time since I
thought about or worked on this stuff.

Another issue is device takedown ordering - this thread is blocked
waiting for tags to be returned by IO completion, so there may be
issues where the hardware has been shut down.

I really don't know - I'm flagging it as something which should be
thought about, tested, etc.

> If this allocation wait is gonna be visible to userland, what's
> necessary probably would be making the sleeping interruptible. The
> freezer will then make the alloc fail and control should return to the
> signal delivery path where it'll be frozen without holding any
> resources.

Maybe. Interruptible sleeps here will be a bit of a nuisance with
signals. Poke Rafael ;)

2013-06-13 19:15:16

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

Hello, Andrew.

On Thu, Jun 13, 2013 at 12:04:39PM -0700, Andrew Morton wrote:
> > The thing is that id[r|a] guarantee that the lowest available slot is
> > allocated
>
> That isn't the case for ida_get_new_above() - the caller gets to
> control the starting index.

Hmmm? get_new_above() is the same, it must allocate the first
available ID above the given low bound - used to exclude unused or
reserved IDs.

> The worst outcome here is that idr.c remains unimproved and we merge a
> new allocator which does basically the same thing.

The lowest number guarantee makes them different. Maybe tag
allocation can be layered on top as a caching layer, I don't know, but
at any rate we need at least two different operation modes.

> The best outcome is that idr.c gets improved and we don't have to merge
> duplicative code.
>
> So please, let's put aside the shiny new thing for now and work out how
> we can use the existing tag allocator for these applications. If we
> make a genuine effort to do this and decide that it's fundamentally
> hopeless then this is the time to start looking at new implementations.
>
> (I can think of at least two ways of making ida_get_new_above() an
> order of magnitude faster for this application and I'm sure you guys
> can as well.)

Oh, I'm sure the current id[r|a] can be improved upon a lot but I'm
very skeptical one can reach the level of scalability necessary for,
say, pci-e attached extremely high-iops devices while still keeping
the lowest number allocation, which can't be achieved without strong
synchronization on each alloc/free.

Maybe we can layer things so that we have percpu layer on top of
id[r|a] and, say, mapping id to point is still done by idr, or the
percpu tag allocator uses ida for tag chunk allocations, but it's
still gonna be something extra on top.

Thanks.

--
tejun

2013-06-13 19:22:02

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

Hello,

On Thu, Jun 13, 2013 at 12:13:24PM -0700, Andrew Morton wrote:
> As I understand it, if a task is stuck in this loop at freeze time, the
> whole freeze attempt will fail. But it's been a long time since I
> thought about or worked on this stuff.

It depends on who's calling the allocator but if the device driver or
its midlayer is doing things properly, the ideal way would be the task
being marked unfreezable (I think freezable kthreads harms more than
help in general) and the driver suspend callback draining commands
while not issuing new ones. What the tag allocator does doesn't
really matter as all resources necessary for command execution should
eventually be freed before entering suspend anyway.

If that's not the case, the freezing should still be controlled by the
caller and again the allocator should be able to return -EBUSY/INTR to
indicate and then the caller, after releasing all other resources,
shouldf to try_to_freeze(). There simply is no reliable way to tell
from as deep as tag allocator whether it'd be safe to freeze or not.

Thanks.

--
tejun

2013-06-13 19:23:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

On Thu, 13 Jun 2013 12:15:07 -0700 Tejun Heo <[email protected]> wrote:

> Hello, Andrew.
>
> On Thu, Jun 13, 2013 at 12:04:39PM -0700, Andrew Morton wrote:
> > > The thing is that id[r|a] guarantee that the lowest available slot is
> > > allocated
> >
> > That isn't the case for ida_get_new_above() - the caller gets to
> > control the starting index.
>
> Hmmm? get_new_above() is the same, it must allocate the first
> available ID above the given low bound - used to exclude unused or
> reserved IDs.

Right. So using different starting IDs for different CPUs can be used
to improve scalability.

> > The worst outcome here is that idr.c remains unimproved and we merge a
> > new allocator which does basically the same thing.
>
> The lowest number guarantee makes them different. Maybe tag
> allocation can be layered on top as a caching layer, I don't know, but
> at any rate we need at least two different operation modes.

Why? Tag allocation doesn't care about the values - just that they be
unique.

> > The best outcome is that idr.c gets improved and we don't have to merge
> > duplicative code.
> >
> > So please, let's put aside the shiny new thing for now and work out how
> > we can use the existing tag allocator for these applications. If we
> > make a genuine effort to do this and decide that it's fundamentally
> > hopeless then this is the time to start looking at new implementations.
> >
> > (I can think of at least two ways of making ida_get_new_above() an
> > order of magnitude faster for this application and I'm sure you guys
> > can as well.)
>
> Oh, I'm sure the current id[r|a] can be improved upon a lot but I'm
> very skeptical one can reach the level of scalability necessary for,
> say, pci-e attached extremely high-iops devices while still keeping
> the lowest number allocation, which can't be achieved without strong
> synchronization on each alloc/free.
>
> Maybe we can layer things so that we have percpu layer on top of
> id[r|a] and, say, mapping id to point is still done by idr, or the
> percpu tag allocator uses ida for tag chunk allocations, but it's
> still gonna be something extra on top.

It's not obvious that explicit per-cpu is needed. Get an ID from
ida_get_new_above(), multiply it by 16 and store that in device-local
storage, along with a 16-bit bitmap. Blam, 30 lines of code and the
ida_get_new_above() cost is reduced 16x and it's off the map.

Or perhaps you can think of something smarter, but first you have to
start thinking of solutions rather than trying to find problems :(

2013-06-13 19:35:25

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

Hello,

On Thu, Jun 13, 2013 at 12:23:39PM -0700, Andrew Morton wrote:
> > The lowest number guarantee makes them different. Maybe tag
> > allocation can be layered on top as a caching layer, I don't know, but
> > at any rate we need at least two different operation modes.
>
> Why? Tag allocation doesn't care about the values - just that they be
> unique.

Hmmm? Confused. I was talking about other existing idr users which
need lowest-available allocation. Tag allocation doesn't care. ID
allocators do.

> > Maybe we can layer things so that we have percpu layer on top of
> > id[r|a] and, say, mapping id to point is still done by idr, or the
> > percpu tag allocator uses ida for tag chunk allocations, but it's
> > still gonna be something extra on top.
>
> It's not obvious that explicit per-cpu is needed. Get an ID from
> ida_get_new_above(), multiply it by 16 and store that in device-local
> storage, along with a 16-bit bitmap. Blam, 30 lines of code and the
> ida_get_new_above() cost is reduced 16x and it's off the map.

I'm fairly sure it'd have to be per-cpu. The idr allocation is
reduced 16x but now each of those 16 slots needs to be allocated. The
problem hasn't gone away and we do need some sort of utility to help
that as drivers tend to resort to things like linear bitmap scan
combined with test_and_set_bit() making one cacheline extremely hot.

> Or perhaps you can think of something smarter, but first you have to
> start thinking of solutions rather than trying to find problems :(

I don't know. It's pretty clear to me that we at least need two
different operation modes from the two conflicting requirements - one
with strict allocation ordering and the other with very high
scalability. Please note that I'm not saying they both can't be built
into id[r|a], but we *need* two different operation modes.

Thanks.

--
tejun

2013-06-13 21:14:15

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

On Thu, Jun 13, 2013 at 12:21:52PM -0700, Tejun Heo wrote:
> Hello,
>
> On Thu, Jun 13, 2013 at 12:13:24PM -0700, Andrew Morton wrote:
> > As I understand it, if a task is stuck in this loop at freeze time, the
> > whole freeze attempt will fail. But it's been a long time since I
> > thought about or worked on this stuff.
>
> It depends on who's calling the allocator but if the device driver or
> its midlayer is doing things properly, the ideal way would be the task
> being marked unfreezable (I think freezable kthreads harms more than
> help in general) and the driver suspend callback draining commands
> while not issuing new ones. What the tag allocator does doesn't
> really matter as all resources necessary for command execution should
> eventually be freed before entering suspend anyway.
>
> If that's not the case, the freezing should still be controlled by the
> caller and again the allocator should be able to return -EBUSY/INTR to
> indicate and then the caller, after releasing all other resources,
> shouldf to try_to_freeze(). There simply is no reliable way to tell
> from as deep as tag allocator whether it'd be safe to freeze or not.

Yeah, I think you're definitely right. (I only started reading up on the
freezer stuff yesterday, though).

Do you know offhand what existing (i.e. slab) allocators do? Whatever
they do should make sense for us.

2013-06-13 21:50:20

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

(cc'ing Rafael and Oleg)

On Thu, Jun 13, 2013 at 02:14:25PM -0700, Kent Overstreet wrote:
> Yeah, I think you're definitely right. (I only started reading up on the
> freezer stuff yesterday, though).
>
> Do you know offhand what existing (i.e. slab) allocators do? Whatever
> they do should make sense for us.

I don't think the memory allocator does anything. Memory allocations
are guaranteed to make forward progress and everything should still be
working while freezing, so it doesn't need to do anything special. If
the tag allocator is to be used only by kernel proper - say drivers,
block layer, it shouldn't need to do anything special. If it's
directly exposed to userland via something like aio and the userland
is involved in guaranteeing forward progress - ie. freeing of tags -
then the allocator would need to be able to fail, I think.

Thanks.

--
tejun

2013-06-13 21:53:47

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

On Thu, Jun 13, 2013 at 11:53:18AM -0700, Tejun Heo wrote:
> Hello, Andrew, Kent.
>
> (cc'ing NFS folks for id[r|a] discussion)
>
> On Wed, Jun 12, 2013 at 08:03:11PM -0700, Andrew Morton wrote:
> > They all sound like pretty crappy reasons ;) If the idr/ida interface
> > is nasty then it can be wrapped to provide the same interface as the
> > percpu tag allocator.
> >
> > I could understand performance being an issue, but diligence demands
> > that we test that, or at least provide a convincing argument.
>
> The thing is that id[r|a] guarantee that the lowest available slot is
> allocated and this is important because it's used to name things which
> are visible to userland - things like block device minor number,
> device indicies and so on. That alone pretty much ensures that
> alloc/free paths can't be very scalable which usually is fine for most
> id[r|a] use cases as long as lookup is fast. I'm doubtful that it's a
> good idea to push per-cpu tag allocation into id[r|a]. The use cases
> are quite different.
>
> In fact, maybe what we can do is adding some features on top of the
> tag allocator and moving id[r|a] users which don't require strict
> in-order allocation to it. For example, NFS allocates an ID for each
> transaction it performs and uses it to index the associate command
> structure (Jeff, Bruce, please correct me if I'm getting it wrong).
> The only requirement on IDs is that they shouldn't be recycled too
> fast. Currently, idr implements cyclic mode for it but it can easily
> be replaced with per-cpu tag allocator like this one and it'd be a lot
> more scalable. There are a couple things to worry about tho - it
> probably should use the highbits as generation number as a tag is
> given out so that the actual ID doesn't get recycled quickly, and some
> form dynamic tag sizing would be nice too.

Yeah, that sounds like a perfect use.

Using the high bits as a gen number - that's something I've done before
in driver code, and that can be done completely outside the tag
allocator - no need for a cyclic mode.

For dynamic sizing, the issue is not so much dynamically sizing the tag
allocator's data structures - the tag allocator itself will use a
fraction of the memory of your tag structs - it's that you want to do
something slightly more intelligent than preallocating one giant array
of tag structs.

I already ran into this in the aio code - kiocbs are just big enough
that we don't want to preallocate them all when we allocate the kioctx.
I did the simplest thing I could think of for the aio code, but if other
users are going to be running into this too maybe it should be made
generic too.

Anyways, for aio I just use an array of pages for the kiocbs instead of
a flat array, and then the pages are allocated lazily.

http://evilpiepirate.org/git/linux-bcache.git/commit/?h=aio&id=999e7718f6b7ec99512fd576b166e5d63cd45ef2

Since the tag allocator uses stacks, it'll tend to give out ids that
were previously allocated and this should work pretty well in practice.
The one caveat right now is that if the workload is shifting across
cpus, tags being stranded on percpu freelists would cause us to allocate
pages sooner than we probably want to.

I don't think this is a big issue because the tag stealing is done based
on the worst case number of stranded tags - but I think I can improve it
with a bit of lazyness...

2013-06-13 21:57:49

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

On Thu, Jun 13, 2013 at 02:50:09PM -0700, Tejun Heo wrote:
> (cc'ing Rafael and Oleg)
>
> On Thu, Jun 13, 2013 at 02:14:25PM -0700, Kent Overstreet wrote:
> > Yeah, I think you're definitely right. (I only started reading up on the
> > freezer stuff yesterday, though).
> >
> > Do you know offhand what existing (i.e. slab) allocators do? Whatever
> > they do should make sense for us.
>
> I don't think the memory allocator does anything. Memory allocations
> are guaranteed to make forward progress and everything should still be
> working while freezing, so it doesn't need to do anything special. If
> the tag allocator is to be used only by kernel proper - say drivers,
> block layer, it shouldn't need to do anything special. If it's
> directly exposed to userland via something like aio and the userland
> is involved in guaranteeing forward progress - ie. freeing of tags -
> then the allocator would need to be able to fail, I think.

That's a good point - as long as whatever is allocated is guaranteed to
be freed in bounded time, we should be fine.

2013-06-13 22:10:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

On Thu, 13 Jun 2013 12:35:14 -0700 Tejun Heo <[email protected]> wrote:

> > > Maybe we can layer things so that we have percpu layer on top of
> > > id[r|a] and, say, mapping id to point is still done by idr, or the
> > > percpu tag allocator uses ida for tag chunk allocations, but it's
> > > still gonna be something extra on top.
> >
> > It's not obvious that explicit per-cpu is needed. Get an ID from
> > ida_get_new_above(), multiply it by 16 and store that in device-local
> > storage, along with a 16-bit bitmap. Blam, 30 lines of code and the
> > ida_get_new_above() cost is reduced 16x and it's off the map.
>
> I'm fairly sure it'd have to be per-cpu. The idr allocation is
> reduced 16x but now each of those 16 slots needs to be allocated. The
> problem hasn't gone away and we do need some sort of utility to help
> that as drivers tend to resort to things like linear bitmap scan
> combined with test_and_set_bit() making one cacheline extremely hot.

Well OK, make it per-cpu then. Or think up something better.

Look, we're falling into the age-old trap of trying to justify existing
code just because it exists.

Stop. Take a step back and pretend that the percpu tag allocator patch
never existed. Now, define the problem and propose solutions.

The absolute dead last and worst solution is "implement something new
which largely duplicates existing code". Such a step requires an
extraordinary amount of justification and that hasn't happened.

2013-06-13 22:30:36

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

Hello, Andrew.

On Thu, Jun 13, 2013 at 03:10:40PM -0700, Andrew Morton wrote:
> Look, we're falling into the age-old trap of trying to justify existing
> code just because it exists.
>
> Stop. Take a step back and pretend that the percpu tag allocator patch
> never existed. Now, define the problem and propose solutions.
>
> The absolute dead last and worst solution is "implement something new
> which largely duplicates existing code". Such a step requires an
> extraordinary amount of justification and that hasn't happened.

I think you're looking at it from the complete opposite point from
mine. That problem has always existed and the solutions implemented
many times over in the whole IO stack across aio, block layer proper,
SCSI and individual drivers, and most, if not all, suck. If we can
consolidate those into a properly designed tag allocator, it's a win,
whether it's integrated into idr or not. It's a well-known problem
with incomplete bad solutions implemented in multiple places.

Again, I'm not saying integrating this into idr is a bad idea. If
that's reasonably doable, awesomer, but I think even just
consolidating the existing bad ones into a common one is beneficial
enough.

Thanks.

--
tejun

2013-06-13 22:36:02

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

On Thu, 13 Jun 2013 15:30:28 -0700 Tejun Heo <[email protected]> wrote:

> Hello, Andrew.
>
> On Thu, Jun 13, 2013 at 03:10:40PM -0700, Andrew Morton wrote:
> > Look, we're falling into the age-old trap of trying to justify existing
> > code just because it exists.
> >
> > Stop. Take a step back and pretend that the percpu tag allocator patch
> > never existed. Now, define the problem and propose solutions.
> >
> > The absolute dead last and worst solution is "implement something new
> > which largely duplicates existing code". Such a step requires an
> > extraordinary amount of justification and that hasn't happened.
>
> I think you're looking at it from the complete opposite point from
> mine. That problem has always existed and the solutions implemented
> many times over in the whole IO stack across aio, block layer proper,
> SCSI and individual drivers, and most, if not all, suck. If we can
> consolidate those into a properly designed tag allocator, it's a win,
> whether it's integrated into idr or not. It's a well-known problem
> with incomplete bad solutions implemented in multiple places.

Well that's news. My awe at the dreadfulness of Kent's changelog
continues to grow. Please identify these duplicated allocators
(function names will suffice). That way, others can actually review
this code :(

2013-06-13 23:13:51

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

Hello,

On Thu, Jun 13, 2013 at 03:35:51PM -0700, Andrew Morton wrote:
> Well that's news. My awe at the dreadfulness of Kent's changelog
> continues to grow. Please identify these duplicated allocators
> (function names will suffice). That way, others can actually review
> this code :(

Oh yeah, Kent could and should have sold it much better. :)

Thanks.

--
tejun

2013-06-13 23:23:25

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

On Thu, Jun 13, 2013 at 04:13:42PM -0700, Tejun Heo wrote:
> On Thu, Jun 13, 2013 at 03:35:51PM -0700, Andrew Morton wrote:
> > Well that's news. My awe at the dreadfulness of Kent's changelog
> > continues to grow. Please identify these duplicated allocators
> > (function names will suffice). That way, others can actually review
> > this code :(
>
> Oh yeah, Kent could and should have sold it much better. :)

BTW, here are three that I know of off the top of my head.

* block/blk-tag.c - this is pretty big and does resizing and stuff.

* drivers/ata/libata-core.c::ata_qc_new()

* drivers/block/nvme-core.c::alloc_cmdid()

Thanks.

--
tejun

2013-06-19 01:31:54

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH] Percpu tag allocator

On Thu, Jun 13, 2013 at 12:23:39PM -0700, Andrew Morton wrote:
> On Thu, 13 Jun 2013 12:15:07 -0700 Tejun Heo <[email protected]> wrote:
> > Oh, I'm sure the current id[r|a] can be improved upon a lot but I'm
> > very skeptical one can reach the level of scalability necessary for,
> > say, pci-e attached extremely high-iops devices while still keeping
> > the lowest number allocation, which can't be achieved without strong
> > synchronization on each alloc/free.
> >
> > Maybe we can layer things so that we have percpu layer on top of
> > id[r|a] and, say, mapping id to point is still done by idr, or the
> > percpu tag allocator uses ida for tag chunk allocations, but it's
> > still gonna be something extra on top.
>
> It's not obvious that explicit per-cpu is needed. Get an ID from
> ida_get_new_above(), multiply it by 16 and store that in device-local
> storage, along with a 16-bit bitmap. Blam, 30 lines of code and the
> ida_get_new_above() cost is reduced 16x and it's off the map.

I was just rereading emails and realized I should've replied to this.

So, I started out aiming for something simpler. I don't quite follow
what the approach you're suggesting is, but it ends up you really do
need the percpu freelists (buffering batching/freeing from the main
freelist) because ids/tags may not be freed on the cpu they were
allocated on.

In particular, if this is for a driver and the device doesn't implement
per cpu queues, tags are almost always going to be freed on a different
cpu. If you just give each cpu a fraction of the tag space they always
allocate out of (with ida_get_new_above() or similar) - that only helps
allocation, half the cacheline contention is freeing.

I originally wrote this tag allocator to use in a driver for a device
that didn't support multiple hardware queues at all, but it was fast
enough that any cacheline bouncing really hurt.

So that right there gets you to the basic design where you've got a
global freelist and percpu freelists, and you just use the percpu
freelists to batch up allocation/freeing to the global freelist.

The tag allocator I was going to submit for the aio code was pretty much
just that, nothing more. It was simple. It worked. I was happy with it.

The one concern with this approach is what happens when all the percpu
freelists except your are full of free tags. Originally, I had an easy
solution - we calculate the size of the percpu freelists based on
nr_tags and num_possible_cpus, so that there can't be more than half of
the tag space stranded like this.

(Keep in mind the main use case is drivers where the tag/id is used to
talk to the device, so you're limited by whatever the hardware people
thought was enough - 16 bits if you're lucky).

But then Tejun went and pointed out, just as I was about to mail it off
- "Hey, what happens if you've got 4096 cpus and not all that many tags?
Youv'e got a nice divice by zero in there".

After which I proceeded to swear at him a bit, but - well, it's a real
problem. And that is what led to the tag stealing stuff and all the
cmpxchg() shenanigans. And I'm pretty happy with the solution - there's
an elegance to it and I bet if I cared I could come up with a proof that
it's more or less optimal w.r.t. cacheline bounces for some decent
fraction of workloads we might care about. But yeah, it's not as simple
as I would've liked.

Anyways, now you've got an ida/idr api cleanup/rewrite to go along with
it, and it's all nicely integrated. Integrating the percpu tag allocator
with regular ida really doesn't save us any code - the original global
freelist was a stack and like 10 lines of code total.

But having the apis be consistent and having it all be organized and
pretty is nice. I think it is now, anyways :)