2007-09-02 18:27:51

by Nick Piggin

[permalink] [raw]
Subject: [rfc][patch] dynamic data structure switching

Hi,

This is my "data structure switch" algorithm that can convert one data
structure into another, with just a single unlikely branch in fastpaths
and no locking or atomic operations (the branch is only triggered when
the data structure is in the process of being converted). A pointer
indirection is generally also needed when converting a global data
structure.

I posted this algorithm a while back and converted the dcache hash to
be dynamically resized at runtime[*], and David Miller started on a
patch to make one of the networking hashes dynamic too.

[*] The algorithm need not only convert between two different sized
hashes, but that's just the most obvious and useful thing to do.
It _could_ convert between a hash and a tree or something crazy
like that too :)

Anyway, the problem with this is that I hadn't found a really nice
way to abstract this functionality and package it in a nice API. It
is almost small enough that open-coding it would be reasonable...
but it would be much nicer to be able to abstract it.

Anyway, here is the algorithm nicely packaged behind a really crappy
API :P I think the idea might be useful enough not to lose it, so
I'm posting this WIP just as a request for comments.

The algorithm implementation is basically optimal, but the API means
that it requires and extra function call and indirect function call
to work, which probably makes it unusable for important implementations.
(C preprocessor magic to avoid the indirect function calls and allow
some level of templating while retaining the types might be the way
to do it).

Dumb pidhash conversion also provided to help people tinker.


Index: linux-2.6/lib/dyndata.c
===================================================================
--- /dev/null
+++ linux-2.6/lib/dyndata.c
@@ -0,0 +1,80 @@
+#include <linux/dyndata.h>
+#include <linux/mutex.h>
+#include <linux/rcupdate.h>
+#include <linux/seqlock.h>
+
+void dyn_data_init(struct dyn_data *dd, void *cur_data)
+{
+ dd->cur = cur_data;
+ dd->old = NULL;
+ seqcount_init(&dd->resize_seq);
+ mutex_init(&dd->resize_mutex);
+}
+
+/*
+ * Slowpath lookup. There is a resize in progress and we weren't able to
+ * find the item we wanted in dyn_data_lookup: run the full race-free
+ * lookup protocol.
+ */
+static noinline void *dyn_data_lookup_slow(struct dyn_data *dd, dd_lookup_fn fn, void *arg)
+{
+ void *ret;
+ void *cur, *old;
+ unsigned seq;
+
+ cur = dd->cur;
+ old = dd->old;
+
+ do {
+ seq = read_seqcount_begin(&dd->resize_seq);
+ ret = fn(cur, arg);
+ if (ret)
+ return ret;
+ if (!old)
+ return NULL;
+
+ ret = fn(old, arg);
+ if (ret)
+ return ret;
+ } while (read_seqcount_retry(&dd->resize_seq, seq));
+
+ return NULL;
+}
+
+void *dyn_data_lookup(struct dyn_data *dd, dd_lookup_fn fn, void *arg)
+{
+ void *ret;
+
+ rcu_read_lock();
+ if (likely(!dd->old))
+ ret = fn(dd->cur, arg);
+ else
+ ret = dyn_data_lookup_slow(dd, fn, arg);
+ rcu_read_unlock();
+
+ return ret;
+}
+
+void *dyn_data_replace(struct dyn_data *dd, dd_transfer_fn fn, void *new)
+{
+ int xfer_done;
+ void *old;
+
+ BUG_ON(!mutex_is_locked(&dd->resize_mutex));
+ old = dd->cur;
+ BUG_ON(dd->old);
+ dd->old = old;
+ synchronize_rcu();
+ rcu_assign_pointer(dd->cur, new);
+ synchronize_rcu();
+ do {
+ write_seqcount_begin(&dd->resize_seq);
+ xfer_done = fn(old, new);
+ write_seqcount_end(&dd->resize_seq);
+ } while (!xfer_done);
+ dd->old = NULL;
+ synchronize_rcu();
+
+
+ return old;
+}
Index: linux-2.6/include/linux/dyndata.h
===================================================================
--- /dev/null
+++ linux-2.6/include/linux/dyndata.h
@@ -0,0 +1,94 @@
+#ifndef __DYNDATA_H__
+#define __DYNDATA_H__
+
+/*
+ * Dynamic data structure algorithm
+ *
+ * We have a given data structure that we are operating on, this is pointed
+ * to by 'cur', or current, data structure. There is also an 'old' data
+ * structure pointer, which is NULL unless cur is in the process of being
+ * replaced.
+ *
+ * When we wish to replace the cur data structure with another, we perform the
+ * following (numbered) steps (letters are not steps but are relevant info):
+ *
+ * 1. Set 'old' to the value of cur (was NULL). That is, the current data
+ * structure becomes the old one.
+ *
+ * a. If a data structure lookup for the current data structure is negative, old
+ * must be loaded, and if it is non-NULL, then that data structure should be
+ * searched too. This must happen under rcu_read_lock().
+ *
+ * 2. synchronize_rcu(), so all lookups after that point will find old to be
+ * non-NULL (by property a).
+ *
+ * 3. Set cur to the new data structure. This structure will be initially
+ * empty, but property a will ensure that lookups will find old and hence
+ * find the items that were in the old data structure.
+ *
+ * 4. synchronize_rcu(), so all lookups after that point will also find new
+ * to be non-NULL.
+ *
+ * b. All data structure insertions must insert new items into the current
+ * table. This property ensures that old will not get more items.
+ *
+ * c. If duplicate items cannot be tolerated, insertion must search for a
+ * pre-existing item (ie. do a lookup which may search the old and the
+ * new data structures), before inserting one, if a duplicate could be
+ * possible.
+ *
+ * 4. Transfer items from the old data structure to the new one. Property
+ * b ensures that old will eventually become empty.
+ *
+ * 5. When transferring items, it will often be required to delete the
+ * item from the old data structure before inserting it into the new
+ * one. This opens a window for a possible race where lookup would not
+ * find the item. If transient false negatives cannot be tolerated,
+ * this race must be covered somehow. Fortunately, it only needs to
+ * be considered when the lookup finds old to be non-NULL, and lookups
+ * are usually idempotent and side-effect free, so a simple seqcount
+ * in the lookup slowpath is used here to cover the race.
+ *
+ * 6. After the old data structure is emptied, set the old pointer to NULL.
+ * There is nothing in the data structure, so there is no reason for
+ * lookups to see the old data structure.
+ *
+ * 7. synchronize_rcu() again, to ensure that no lookups will have loaded
+ * the non-NULL value of old.
+ *
+ * 8. The caller is now free to deallocate the old data structure.
+ *
+ * Note: while synchronize_rcu() is used here, call_rcu could be used instead
+ * in some situations.
+ *
+ * Note2: memory barriers are not needed in the read-side (AFAIKS) with this
+ * implementation. Not even rcu_dereference when loading the potentially
+ * changing cur and old pointers: the write-side has gone through a
+ * synchronize_rcu() before making new data visible, which means the read-side
+ * must have gone through a quiescent state which is a full barrier.
+ */
+
+
+#include <linux/mutex.h>
+#include <linux/seqlock.h>
+
+struct dyn_data {
+ void *cur;
+ void *old;
+ seqcount_t resize_seq;
+ struct mutex resize_mutex;
+};
+
+typedef void *(dd_lookup_fn)(void *dstruct, void *arg);
+typedef int (dd_transfer_fn)(void *old_ds, void *new_ds);
+
+void dyn_data_init(struct dyn_data *dd, void *cur_data);
+void *dyn_data_lookup(struct dyn_data *dd, dd_lookup_fn fn, void *arg);
+void *dyn_data_replace(struct dyn_data *dd, dd_transfer_fn fn, void *new);
+
+static inline void *dyn_data_current_dstruct(struct dyn_data *dd)
+{
+ return dd->cur;
+}
+
+#endif
Index: linux-2.6/lib/Makefile
===================================================================
--- linux-2.6.orig/lib/Makefile
+++ linux-2.6/lib/Makefile
@@ -10,7 +10,7 @@ lib-y := ctype.o string.o vsprintf.o kas
lib-$(CONFIG_MMU) += ioremap.o
lib-$(CONFIG_SMP) += cpumask.o

-lib-y += kobject.o kref.o kobject_uevent.o klist.o
+lib-y += kobject.o kref.o kobject_uevent.o klist.o dyndata.o

obj-y += div64.o sort.o parser.o halfmd4.o debug_locks.o random32.o \
bust_spinlocks.o hexdump.o


2007-09-02 18:36:29

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch] dynamic data structure switching

Dumb, slightly incomplete, dynamic pidhash resizing. I'm just sending this
as a reference and testing tool. While the pid-hash might be the least
problematic one we have, RAM saving / collision minimisation aren't the
only reasons to size a hash optimally: in a really lookup intensive
workload, a small dense hash table will have between 4-32x better cache
efficiency than a very sparse one, depending on line size and pointer
size. So it is possible that resizing even reasonably small hashes can
be a good idea.

Index: linux-2.6/kernel/pid.c
===================================================================
--- linux-2.6.orig/kernel/pid.c
+++ linux-2.6/kernel/pid.c
@@ -28,13 +28,25 @@
#include <linux/hash.h>
#include <linux/pid_namespace.h>
#include <linux/init_task.h>
+#include <linux/dyndata.h>

-#define pid_hashfn(nr) hash_long((unsigned long)nr, pidhash_shift)
-static struct hlist_head *pid_hash;
-static int pidhash_shift;
+struct pid_hash {
+ struct hlist_head *table;
+ unsigned int shift;
+};
+
+static struct pid_hash pid_hashes[2];
+static unsigned int ph_cur_idx;
+static struct dyn_data dyn_pidhash;
+static unsigned int nr_pids;
+
+/* cur_pid_hash should be used under rcu_read_lock() */
+#define cur_pid_hash ((struct pid_hash *)dyn_data_current_dstruct(&dyn_pidhash))
+#define pid_hashfn(ph, nr) hash_long((unsigned long)nr, ph->shift)
static struct kmem_cache *pid_cachep;
struct pid init_struct_pid = INIT_STRUCT_PID;

+
int pid_max = PID_MAX_DEFAULT;

#define RESERVED_PIDS 300
@@ -190,21 +202,91 @@ static void delayed_put_pid(struct rcu_h
put_pid(pid);
}

+int dd_transfer_pids(void *old_ds, void *new_ds)
+{
+ struct pid_hash *old = old_ds;
+ struct pid_hash *new = new_ds;
+ unsigned int size, i;
+
+ size = 1UL << old->shift;
+
+ BUG_ON(old == new);
+
+ for (i = 0; i < size; i++) {
+ struct hlist_node *elem;
+ struct pid *pid;
+
+ spin_lock_irq(&pidmap_lock);
+again:
+ hlist_for_each_entry_rcu(pid, elem, &old->table[i], pid_chain) {
+ hlist_del_rcu(&pid->pid_chain);
+ hlist_add_head_rcu(&pid->pid_chain,
+ &new->table[pid_hashfn(new, pid->nr)]);
+ goto again; /* XXX: no _safe variant */
+ }
+ spin_unlock_irq(&pidmap_lock);
+ }
+
+ return 1;
+}
+
+static int init_pid_hash(struct pid_hash *ph, unsigned int pidhash_shift);
+
+static void resize_pid_hash(void)
+{
+ unsigned int old_shift, new_shift;
+
+ if (system_state != SYSTEM_RUNNING)
+ return;
+
+ old_shift = cur_pid_hash->shift;
+ new_shift = ilog2(nr_pids * 2 - 1);
+ if (new_shift == old_shift)
+ return;
+
+ if (!mutex_trylock(&dyn_pidhash.resize_mutex))
+ return;
+
+ old_shift = cur_pid_hash->shift;
+ new_shift = ilog2(nr_pids * 2 - 1);
+ if (new_shift != old_shift) {
+ struct pid_hash *ph, *ret;
+ unsigned int idx = ph_cur_idx ^ 1;
+ ph = &pid_hashes[idx];
+ if (!init_pid_hash(ph, new_shift)) {
+ ph_cur_idx = idx;
+
+ ret = dyn_data_replace(&dyn_pidhash, dd_transfer_pids, ph);
+ BUG_ON(ret == ph);
+ BUG_ON(ret != &pid_hashes[idx ^ 1]);
+ /* XXX: kfree(ret->table) */
+ ret->shift = -1;
+ ret->table = NULL;
+ }
+ }
+ mutex_unlock(&dyn_pidhash.resize_mutex);
+}
+
fastcall void free_pid(struct pid *pid)
{
/* We can be called with write_lock_irq(&tasklist_lock) held */
unsigned long flags;

spin_lock_irqsave(&pidmap_lock, flags);
+ nr_pids--;
hlist_del_rcu(&pid->pid_chain);
spin_unlock_irqrestore(&pidmap_lock, flags);

free_pidmap(&init_pid_ns, pid->nr);
call_rcu(&pid->rcu, delayed_put_pid);
+
+// if (nr_pids <= (1UL << cur_pid_hash->shift) / 2)
+// resize_pid_hash();
}

struct pid *alloc_pid(void)
{
+ struct pid_hash *ph;
struct pid *pid;
enum pid_type type;
int nr = -1;
@@ -223,9 +305,13 @@ struct pid *alloc_pid(void)
INIT_HLIST_HEAD(&pid->tasks[type]);

spin_lock_irq(&pidmap_lock);
- hlist_add_head_rcu(&pid->pid_chain, &pid_hash[pid_hashfn(pid->nr)]);
+ nr_pids++;
+ ph = cur_pid_hash;
+ hlist_add_head_rcu(&pid->pid_chain, &ph->table[pid_hashfn(ph, nr)]);
spin_unlock_irq(&pidmap_lock);

+// if (nr_pids >= 3 * (1UL << cur_pid_hash->shift) / 2)
+ resize_pid_hash();
out:
return pid;

@@ -235,18 +321,25 @@ out_free:
goto out;
}

-struct pid * fastcall find_pid(int nr)
+static void *dd_lookup_pid(void *dstruct, void *arg)
{
+ struct pid_hash *ph = dstruct;
+ int nr = (unsigned long)arg;
struct hlist_node *elem;
struct pid *pid;

hlist_for_each_entry_rcu(pid, elem,
- &pid_hash[pid_hashfn(nr)], pid_chain) {
+ &ph->table[pid_hashfn(ph, nr)], pid_chain) {
if (pid->nr == nr)
return pid;
}
return NULL;
}
+
+struct pid * fastcall find_pid(int nr)
+{
+ return dyn_data_lookup(&dyn_pidhash, dd_lookup_pid, (void *)(unsigned long)nr);
+}
EXPORT_SYMBOL_GPL(find_pid);

/*
@@ -380,6 +473,28 @@ void free_pid_ns(struct kref *kref)
kfree(ns);
}

+static int init_pid_hash(struct pid_hash *ph, unsigned int pidhash_shift)
+{
+ unsigned int i, pidhash_size;
+
+ ph->shift = pidhash_shift;
+ pidhash_size = 1 << pidhash_shift;
+
+ printk("PID hash table entries: %d (order: %d, %Zd bytes)\n",
+ pidhash_size, pidhash_shift,
+ pidhash_size * sizeof(struct hlist_head));
+
+ ph->table = kmalloc(pidhash_size * sizeof(struct hlist_head), GFP_KERNEL);
+ if (!ph->table) {
+ printk("Could not alloc pidhash!\n");
+ return -ENOMEM;
+ }
+ for (i = 0; i < pidhash_size; i++)
+ INIT_HLIST_HEAD(&ph->table[i]);
+
+ return 0;
+}
+
/*
* The pid hash table is scaled according to the amount of memory in the
* machine. From a minimum of 16 slots up to 4096 slots at one gigabyte or
@@ -387,22 +502,28 @@ void free_pid_ns(struct kref *kref)
*/
void __init pidhash_init(void)
{
- int i, pidhash_size;
+ struct pid_hash *ph;
+ int i, pidhash_shift, pidhash_size;
unsigned long megabytes = nr_kernel_pages >> (20 - PAGE_SHIFT);

+ ph_cur_idx = 0;
+ ph = &pid_hashes[ph_cur_idx];
pidhash_shift = max(4, fls(megabytes * 4));
pidhash_shift = min(12, pidhash_shift);
+ ph->shift = pidhash_shift;
pidhash_size = 1 << pidhash_shift;

printk("PID hash table entries: %d (order: %d, %Zd bytes)\n",
pidhash_size, pidhash_shift,
pidhash_size * sizeof(struct hlist_head));

- pid_hash = alloc_bootmem(pidhash_size * sizeof(*(pid_hash)));
- if (!pid_hash)
+ ph->table = alloc_bootmem(pidhash_size * sizeof(struct hlist_head));
+ if (!ph->table)
panic("Could not alloc pidhash!\n");
for (i = 0; i < pidhash_size; i++)
- INIT_HLIST_HEAD(&pid_hash[i]);
+ INIT_HLIST_HEAD(&ph->table[i]);
+
+ dyn_data_init(&dyn_pidhash, ph);
}

void __init pidmap_init(void)

2007-09-05 22:50:43

by Oleg Verych

[permalink] [raw]
Subject: Fast path efficiency (Re: [rfc][patch] dynamic data structure switching)

* Sun, 2 Sep 2007 20:36:19 +0200
>

I see, that in many places all pre-checks are done in negative form
with resulting return or jump out. In this case, if function was called,
what likely() path is?

> +static void resize_pid_hash(void)
> +{
> + unsigned int old_shift, new_shift;
> +
> + if (system_state != SYSTEM_RUNNING)
> + return;
> +
> + old_shift = cur_pid_hash->shift;
> + new_shift = ilog2(nr_pids * 2 - 1);
> + if (new_shift == old_shift)
> + return;
> +
> + if (!mutex_trylock(&dyn_pidhash.resize_mutex))
> + return;

that one or this?

==
if (system_state == SYSTEM_RUNNING) {
old_shift = cur_pid_hash->shift;
new_shift = ilog2(nr_pids * 2 - 1);
if (new_shift != old_shift && mutex_trylock(&dyn_pidhash.resize_mutex)) {
==
> + old_shift = cur_pid_hash->shift;
> + new_shift = ilog2(nr_pids * 2 - 1);

/* hope this repetition is needed by design */

...

> + mutex_unlock(&dyn_pidhash.resize_mutex);
}

What is more efficient in general sense,
as opposed to s,3,2,1,0 Optimized?

> + if (new_shift != old_shift) {
> + struct pid_hash *ph, *ret;
> + unsigned int idx = ph_cur_idx ^ 1;
> + ph = &pid_hashes[idx];
> + if (!init_pid_hash(ph, new_shift)) {
> + ph_cur_idx = idx;
> +
> + ret = dyn_data_replace(&dyn_pidhash, dd_transfer_pids, ph);
> + BUG_ON(ret == ph);
> + BUG_ON(ret != &pid_hashes[idx ^ 1]);
> + /* XXX: kfree(ret->table) */
> + ret->shift = -1;
> + ret->table = NULL;
> + }
> + }
> + mutex_unlock(&dyn_pidhash.resize_mutex);
> +}
> +

Thanks.
____

2007-09-06 07:14:39

by Nick Piggin

[permalink] [raw]
Subject: Re: Fast path efficiency (Re: [rfc][patch] dynamic data structure switching)

On Thu, Sep 06, 2007 at 01:05:40AM +0200, Oleg Verych wrote:
> * Sun, 2 Sep 2007 20:36:19 +0200
> >
>
> I see, that in many places all pre-checks are done in negative form
> with resulting return or jump out. In this case, if function was called,
> what likely() path is?
>
> > +static void resize_pid_hash(void)
> > +{
> > + unsigned int old_shift, new_shift;
> > +
> > + if (system_state != SYSTEM_RUNNING)
> > + return;
> > +
> > + old_shift = cur_pid_hash->shift;
> > + new_shift = ilog2(nr_pids * 2 - 1);
> > + if (new_shift == old_shift)
> > + return;
> > +
> > + if (!mutex_trylock(&dyn_pidhash.resize_mutex))
> > + return;
>
> that one or this?
>
> ==
> if (system_state == SYSTEM_RUNNING) {
> old_shift = cur_pid_hash->shift;
> new_shift = ilog2(nr_pids * 2 - 1);
> if (new_shift != old_shift && mutex_trylock(&dyn_pidhash.resize_mutex)) {
> ==
> > + old_shift = cur_pid_hash->shift;
> > + new_shift = ilog2(nr_pids * 2 - 1);
>
> /* hope this repetition is needed by design */
>
> ...
>
> > + mutex_unlock(&dyn_pidhash.resize_mutex);
> }
>
> What is more efficient in general sense,
> as opposed to s,3,2,1,0 Optimized?

I'm not too sure, but I'd guess that most of the time the compiler will
be able to figure out they are the same.

resize_pid_hash() fortunately isn't a fastpath anyway -- it calls
dyn_data_replace which ends up calling synchronize_rcu() 3 times,
each of which is likely to take a long time!

2007-09-06 11:35:18

by Andi Kleen

[permalink] [raw]
Subject: Re: Fast path efficiency (Re: [rfc][patch] dynamic data structure switching)

Oleg Verych <[email protected]> writes:
>
> What is more efficient in general sense,
> as opposed to s,3,2,1,0 Optimized?

In general it shouldn't matter at all on any reasonably
modern CPU (let's say less than 10 years old) unless you do
it tens of thousands of times in a loop. Also gcc has reasonable
default heuristics for this kind of stuff anyways
(e.g. return NULL or return negative value is predicted
unlikely by default)

Cache misses are far more important to care about because
they cost hundreds of cycles.

-Andi

2007-09-10 11:56:14

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [rfc][patch] dynamic data structure switching

On Sun, 2007-09-02 at 20:27 +0200, Nick Piggin wrote:

> Index: linux-2.6/lib/dyndata.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6/lib/dyndata.c
> @@ -0,0 +1,80 @@
> +#include <linux/dyndata.h>
> +#include <linux/mutex.h>
> +#include <linux/rcupdate.h>
> +#include <linux/seqlock.h>
> +
> +void dyn_data_init(struct dyn_data *dd, void *cur_data)
> +{
> + dd->cur = cur_data;
> + dd->old = NULL;
> + seqcount_init(&dd->resize_seq);
> + mutex_init(&dd->resize_mutex);
> +}
> +
> +/*
> + * Slowpath lookup. There is a resize in progress and we weren't able to
> + * find the item we wanted in dyn_data_lookup: run the full race-free
> + * lookup protocol.
> + */
> +static noinline void *dyn_data_lookup_slow(struct dyn_data *dd, dd_lookup_fn fn, void *arg)
> +{
> + void *ret;
> + void *cur, *old;
> + unsigned seq;
> +
> + cur = dd->cur;
> + old = dd->old;
> +
> + do {
> + seq = read_seqcount_begin(&dd->resize_seq);
> + ret = fn(cur, arg);
> + if (ret)
> + return ret;
> + if (!old)
> + return NULL;
> +
> + ret = fn(old, arg);
> + if (ret)
> + return ret;
> + } while (read_seqcount_retry(&dd->resize_seq, seq));

With the seqlock protecting the whole xfer a lookup for a non-existing
item might spin here for a very long time.

Maybe we should create this sleeping seqlock we talked and provide both
a sleeping and non sleeping version of this lookup function.

> + return NULL;
> +}
> +
> +void *dyn_data_lookup(struct dyn_data *dd, dd_lookup_fn fn, void *arg)
> +{
> + void *ret;
> +
> + rcu_read_lock();
> + if (likely(!dd->old))
> + ret = fn(dd->cur, arg);
> + else
> + ret = dyn_data_lookup_slow(dd, fn, arg);
> + rcu_read_unlock();
> +
> + return ret;
> +}
> +
> +void *dyn_data_replace(struct dyn_data *dd, dd_transfer_fn fn, void *new)
> +{
> + int xfer_done;
> + void *old;
> +
> + BUG_ON(!mutex_is_locked(&dd->resize_mutex));

So its up to the caller to take resize_mutex, but there is no mention of
this requirement. Why cannot it be taken inside this function?

> + old = dd->cur;
> + BUG_ON(dd->old);
> + dd->old = old;
> + synchronize_rcu();
> + rcu_assign_pointer(dd->cur, new);
> + synchronize_rcu();
> + do {
> + write_seqcount_begin(&dd->resize_seq);
> + xfer_done = fn(old, new);
> + write_seqcount_end(&dd->resize_seq);
> + } while (!xfer_done);
> + dd->old = NULL;
> + synchronize_rcu();
> +
> +
> + return old;
> +}
> Index: linux-2.6/include/linux/dyndata.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6/include/linux/dyndata.h
> @@ -0,0 +1,94 @@
> +#ifndef __DYNDATA_H__
> +#define __DYNDATA_H__
> +

<snip>

> +
> +#include <linux/mutex.h>
> +#include <linux/seqlock.h>
> +
> +struct dyn_data {
> + void *cur;
> + void *old;
> + seqcount_t resize_seq;
> + struct mutex resize_mutex;
> +};
> +
> +typedef void *(dd_lookup_fn)(void *dstruct, void *arg);
> +typedef int (dd_transfer_fn)(void *old_ds, void *new_ds);
> +
> +void dyn_data_init(struct dyn_data *dd, void *cur_data);
> +void *dyn_data_lookup(struct dyn_data *dd, dd_lookup_fn fn, void *arg);
> +void *dyn_data_replace(struct dyn_data *dd, dd_transfer_fn fn, void *new);
> +
> +static inline void *dyn_data_current_dstruct(struct dyn_data *dd)
> +{
> + return dd->cur;
> +}
> +
> +#endif


2007-09-10 13:39:46

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch] dynamic data structure switching

On Mon, Sep 10, 2007 at 01:55:52PM +0200, Peter Zijlstra wrote:
> On Sun, 2007-09-02 at 20:27 +0200, Nick Piggin wrote:
>
> > Index: linux-2.6/lib/dyndata.c
> > ===================================================================
> > --- /dev/null
> > +++ linux-2.6/lib/dyndata.c
> > @@ -0,0 +1,80 @@
> > +#include <linux/dyndata.h>
> > +#include <linux/mutex.h>
> > +#include <linux/rcupdate.h>
> > +#include <linux/seqlock.h>
> > +
> > +void dyn_data_init(struct dyn_data *dd, void *cur_data)
> > +{
> > + dd->cur = cur_data;
> > + dd->old = NULL;
> > + seqcount_init(&dd->resize_seq);
> > + mutex_init(&dd->resize_mutex);
> > +}
> > +
> > +/*
> > + * Slowpath lookup. There is a resize in progress and we weren't able to
> > + * find the item we wanted in dyn_data_lookup: run the full race-free
> > + * lookup protocol.
> > + */
> > +static noinline void *dyn_data_lookup_slow(struct dyn_data *dd, dd_lookup_fn fn, void *arg)
> > +{
> > + void *ret;
> > + void *cur, *old;
> > + unsigned seq;
> > +
> > + cur = dd->cur;
> > + old = dd->old;
> > +
> > + do {
> > + seq = read_seqcount_begin(&dd->resize_seq);
> > + ret = fn(cur, arg);
> > + if (ret)
> > + return ret;
> > + if (!old)
> > + return NULL;
> > +
> > + ret = fn(old, arg);
> > + if (ret)
> > + return ret;
> > + } while (read_seqcount_retry(&dd->resize_seq, seq));
>
> With the seqlock protecting the whole xfer a lookup for a non-existing
> item might spin here for a very long time.
>
> Maybe we should create this sleeping seqlock we talked and provide both
> a sleeping and non sleeping version of this lookup function.

Yeah, it isn't the best this way. One thing that could be done is to
fine grain the write-side seqlock (eg. take it once per object moved,
or have multiple seqlocks eg. per bucket). It would need some careful
measurement to see what we can get away with.


> > + return NULL;
> > +}
> > +
> > +void *dyn_data_lookup(struct dyn_data *dd, dd_lookup_fn fn, void *arg)
> > +{
> > + void *ret;
> > +
> > + rcu_read_lock();
> > + if (likely(!dd->old))
> > + ret = fn(dd->cur, arg);
> > + else
> > + ret = dyn_data_lookup_slow(dd, fn, arg);
> > + rcu_read_unlock();
> > +
> > + return ret;
> > +}
> > +
> > +void *dyn_data_replace(struct dyn_data *dd, dd_transfer_fn fn, void *new)
> > +{
> > + int xfer_done;
> > + void *old;
> > +
> > + BUG_ON(!mutex_is_locked(&dd->resize_mutex));
>
> So its up to the caller to take resize_mutex, but there is no mention of
> this requirement. Why cannot it be taken inside this function?

Yeah, I didn't want to document the API yet because it is to horrible :)
In the case of the dynamic pid hash, this mutex serialises resizers, and
so after you take the lock, you can check whether the hash still needs
resizing, for example (and can also do a trylock instead of a lock).


> > + old = dd->cur;
> > + BUG_ON(dd->old);
> > + dd->old = old;
> > + synchronize_rcu();
> > + rcu_assign_pointer(dd->cur, new);
> > + synchronize_rcu();
> > + do {
> > + write_seqcount_begin(&dd->resize_seq);
> > + xfer_done = fn(old, new);
> > + write_seqcount_end(&dd->resize_seq);
> > + } while (!xfer_done);
> > + dd->old = NULL;
> > + synchronize_rcu();
> > +
> > +
> > + return old;
> > +}

2007-09-10 16:55:52

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [rfc][patch] dynamic data structure switching

Nick Piggin wrote:
>
> +void *dyn_data_replace(struct dyn_data *dd, dd_transfer_fn fn, void *new)
> +{
> + int xfer_done;
> + void *old;
> +
> + BUG_ON(!mutex_is_locked(&dd->resize_mutex));
> + old = dd->cur;
> + BUG_ON(dd->old);
> + dd->old = old;
> + synchronize_rcu();
> + rcu_assign_pointer(dd->cur, new);

I think this all is correct, but I have a somewhat offtopic question, hopefully
you can help.

Suppose that we have a global "pid_t NR = 0", and another CPU does

pid = alloc_pid();
wmb();
NR = pid->nr;

Suppose that this CPU sees dd->cur == new, and adds the new item to it.

Now, yet another CPU does:

nr = NR;
rmb();
BUG_ON(nr && !find_pind(nr));

dyn_data_replace() didn't do synchronize_rcu() yet. The question is: how it is
possible to "prove" that the BUG_ON() above can't happen? IOW, why find_pind()
above must also see dd->cur == new if it sees NR != 0 ?

Once again, I believe this is true, but I can't find a "good" explanation for
myself. To simplify the example above, consider:

A = B = X = 0;
P = Q = &A;

CPU_1 CPU_2 CPU_3

P = &B; *P = 1; if (X) {
wmb(); rmb();
X = 1; BUG_ON(*P != 1 && *Q != 1);
}

So, it is not possible that CPU_2 sees P == &B, but CPU_3 sees P == &A in this
case, yes?

It looks "obvious" that rmb() guarantees that CPU_3 must see the new value if
any other CPU (CPU_2) already saw it "before", but I can't derive this from the
"all the LOAD operations specified before the barrier will appear to happen
before all the LOAD operations specified after the barrier" definition.

Thanks,

Oleg.

2007-09-10 20:48:59

by Nick Piggin

[permalink] [raw]
Subject: Re: [rfc][patch] dynamic data structure switching

On Mon, Sep 10, 2007 at 08:58:14PM +0400, Oleg Nesterov wrote:
> Nick Piggin wrote:
> >
> > +void *dyn_data_replace(struct dyn_data *dd, dd_transfer_fn fn, void *new)
> > +{
> > + int xfer_done;
> > + void *old;
> > +
> > + BUG_ON(!mutex_is_locked(&dd->resize_mutex));
> > + old = dd->cur;
> > + BUG_ON(dd->old);
> > + dd->old = old;
> > + synchronize_rcu();
> > + rcu_assign_pointer(dd->cur, new);
>
> I think this all is correct, but I have a somewhat offtopic question, hopefully
> you can help.
>
> Suppose that we have a global "pid_t NR = 0", and another CPU does
>
> pid = alloc_pid();
> wmb();
> NR = pid->nr;
>
> Suppose that this CPU sees dd->cur == new, and adds the new item to it.
>
> Now, yet another CPU does:
>
> nr = NR;
> rmb();
> BUG_ON(nr && !find_pind(nr));
>
> dyn_data_replace() didn't do synchronize_rcu() yet.

Hmm, it would have to have done synchronize_rcu() otherwise the first could
not see that dd->cur == new...? Or maybe you mean it hasn't done a second
synchronize_rcu()? I'll assume you mean that.


> The question is: how it is
> possible to "prove" that the BUG_ON() above can't happen? IOW, why find_pind()
> above must also see dd->cur == new if it sees NR != 0 ?

Hmm, that's a very good question. I was in the middle of starting to write
why I thought it would work, but after thinking about it more, I'm not sure
that it is correct.

I think we have only pairwise barrier semantics, and not causal semantics
(so the write to dd->cur from the 3rd CPU can be seen in any order by
the others, regardless of what barriers _they_ perform).

So you do have a problem. We'd need to do another synchronize_rcu here to
ensure that dd->cur gets propogated out to all CPUs before the first
insert happens. This shouldn't be too hard (simplest way is probably to use
a low-bit in the pointer).


> Once again, I believe this is true, but I can't find a "good" explanation for
> myself. To simplify the example above, consider:
>
> A = B = X = 0;
> P = Q = &A;
>
> CPU_1 CPU_2 CPU_3
>
> P = &B; *P = 1; if (X) {
> wmb(); rmb();
> X = 1; BUG_ON(*P != 1 && *Q != 1);
> }
>
> So, it is not possible that CPU_2 sees P == &B, but CPU_3 sees P == &A in this
> case, yes?
>
> It looks "obvious" that rmb() guarantees that CPU_3 must see the new value if
> any other CPU (CPU_2) already saw it "before", but I can't derive this from the
> "all the LOAD operations specified before the barrier will appear to happen
> before all the LOAD operations specified after the barrier" definition.

I believe this can go out of order (according to Linux memory model, I don't
know if any actual implementations will do this). The invalidations from CPU1
and 2 may reach CPU3 at different times I think.

Good point. Thanks.