This is a long-discussed space optimization for the VM system, with
what is expected to be a minor time tradeoff.
Typically very few waitqueues are in use at a given time. Maintaining
a field within per-page data structures is expensive, and uses 16 bytes
per page, or approximately 3MB of space on my 768MB home machine.
An alternative is to maintain a hash table of waitqueues so that there
is only one waitqueue for every N pages. The bucket discipline is to
put all waiters on the same queue and wake all, where the waiters go
back to sleep when woken for the wrong page. The hash table should be
sized so that the load is low enough to make collisions rare, and this
does not require much space due to the number of waitqueues typically
in use. The patch against rmap10b, tested on UP i386, follows.
mjc, riel, please apply.
Cheers,
Bill
diff --minimal -Nru a/include/linux/mm.h b/include/linux/mm.h
--- a/include/linux/mm.h Fri Jan 4 09:14:28 2002
+++ b/include/linux/mm.h Fri Jan 4 09:14:28 2002
@@ -161,7 +161,6 @@
unsigned char age; /* Page aging counter. */
unsigned char zone; /* Memory zone the page belongs to. */
struct pte_chain * pte_chain; /* Reverse pte mapping pointer. */
- wait_queue_head_t wait; /* Page locked? Stand in line... */
struct page **pprev_hash; /* Complement to *next_hash. */
struct buffer_head * buffers; /* Buffer maps us to a disk block. */
diff --minimal -Nru a/include/linux/mm_inline.h b/include/linux/mm_inline.h
--- a/include/linux/mm_inline.h Fri Jan 4 09:14:28 2002
+++ b/include/linux/mm_inline.h Fri Jan 4 09:14:28 2002
@@ -4,6 +4,39 @@
#include <linux/mm.h>
/*
+ * Knuth recommends primes in approximately golden ratio to the maximum
+ * integer representable by a machine word for multiplicative hashing.
+ * Chuck Lever verified the effectiveness of this technique for several
+ * hash tables in his paper documenting the benchmark results:
+ * http://www.citi.umich.edu/techreports/reports/citi-tr-00-1.pdf
+ */
+#define GOLDEN_RATIO_PRIME 2654435761UL
+
+/*
+ * In order to wait for pages to become available there must be
+ * waitqueues associated with pages. By using a hash table of
+ * waitqueues where the bucket discipline is to maintain all
+ * waiters on the same queue and wake all when any of the pages
+ * become available, and for the woken contexts to check to be
+ * sure the appropriate page became available, this saves space
+ * at a cost of "thundering herd" phenomena during rare hash
+ * collisions. This cost is great enough that effective hashing
+ * is necessary to maintain performance.
+ */
+static inline wait_queue_head_t *page_waitqueue(struct page *page)
+{
+ const zone_t *zone = page_zone(page);
+ wait_queue_head_t *wait = zone->wait_table;
+ unsigned long hash = (unsigned long)page;
+
+ hash *= GOLDEN_RATIO_PRIME;
+ hash >>= BITS_PER_LONG - zone->wait_table_bits;
+ hash &= zone->wait_table_size - 1;
+
+ return &wait[hash];
+}
+
+/*
* These inline functions tend to need bits and pieces of all the
* other VM include files, meaning they cannot be defined inside
* one of the other VM include files.
diff --minimal -Nru a/include/linux/mmzone.h b/include/linux/mmzone.h
--- a/include/linux/mmzone.h Fri Jan 4 09:14:28 2002
+++ b/include/linux/mmzone.h Fri Jan 4 09:14:28 2002
@@ -7,6 +7,7 @@
#include <linux/config.h>
#include <linux/spinlock.h>
#include <linux/list.h>
+#include <linux/wait.h>
/*
* Free memory management - zoned buddy allocator.
@@ -52,6 +53,35 @@
struct list_head inactive_dirty_list;
struct list_head inactive_clean_list;
free_area_t free_area[MAX_ORDER];
+
+ /*
+ * wait_table -- the array holding the hash table
+ * wait_table_size -- the size of the hash table array
+ * wait_table_bits -- wait_table_size == (1 << wait_table_bits)
+ * so it's the integer logarithm of the size
+ *
+ * The purpose of all these is to keep track of the people
+ * waiting for a page to become available and make them
+ * runnable again when possible. The trouble is that this
+ * consumes a lot of space, especially when so few things
+ * wait on pages at a given time. So instead of using
+ * per-page waitqueues, we use a waitqueue hash table.
+ *
+ * The bucket discipline is to sleep on the same queue when
+ * colliding and wake all in that wait queue when removing.
+ * When something wakes, it must check to be sure its page is
+ * truly available, a la thundering herd. The cost of a
+ * collision is great, but given the expected load of the
+ * table, they should be so rare as to be outweighed by the
+ * benefits from the saved space.
+ *
+ *__wait_on_page() and unlock_page() in mm/filemap.c, are the
+ * primary users of these fields, and in mm/page_alloc.c
+ * free_area_init_core() performs the initialization of them.
+ */
+ wait_queue_head_t *wait_table;
+ unsigned long wait_table_size;
+ unsigned long wait_table_bits;
/*
* Discontig memory support fields.
diff --minimal -Nru a/mm/filemap.c b/mm/filemap.c
--- a/mm/filemap.c Fri Jan 4 09:14:28 2002
+++ b/mm/filemap.c Fri Jan 4 09:14:28 2002
@@ -782,13 +782,28 @@
* This must be called with the caller "holding" the page,
* ie with increased "page->count" so that the page won't
* go away during the wait..
+ *
+ * The waiting strategy is to get on a waitqueue determined
+ * by hashing. Waiters will then collide, and the newly woken
+ * task must then determine whether it was woken for the page
+ * it really wanted, and go back to sleep on the waitqueue if
+ * that wasn't it. With the waitqueue semantics, it never leaves
+ * the waitqueue unless it calls, so the loop moves forward one
+ * iteration every time there is
+ * (1) a collision
+ * and
+ * (2) one of the colliding pages is woken
+ *
+ * This is the thundering herd problem, but it is expected to
+ * be very rare due to the few pages that are actually being
+ * waited on at any given time and the quality of the hash function.
*/
void ___wait_on_page(struct page *page)
{
struct task_struct *tsk = current;
DECLARE_WAITQUEUE(wait, tsk);
- add_wait_queue(&page->wait, &wait);
+ add_wait_queue(page_waitqueue(page), &wait);
do {
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
if (!PageLocked(page))
@@ -796,10 +811,17 @@
sync_page(page);
schedule();
} while (PageLocked(page));
- tsk->state = TASK_RUNNING;
- remove_wait_queue(&page->wait, &wait);
+ __set_task_state(tsk, TASK_RUNNING);
+ remove_wait_queue(page_waitqueue(page), &wait);
}
+/*
+ * unlock_page() is the other half of the story just above
+ * __wait_on_page(). Here a couple of quick checks are done
+ * and a couple of flags are set on the page, and then all
+ * of the waiters for all of the pages in the appropriate
+ * wait queue are woken.
+ */
void unlock_page(struct page *page)
{
clear_bit(PG_launder, &(page)->flags);
@@ -807,8 +829,18 @@
if (!test_and_clear_bit(PG_locked, &(page)->flags))
BUG();
smp_mb__after_clear_bit();
- if (waitqueue_active(&(page)->wait))
- wake_up(&(page)->wait);
+
+ if(!page_waitqueue(page))
+ BUG();
+
+ /*
+ * Although the default semantics of wake_up() are
+ * to wake all, here the specific function is used
+ * to make it even more explicit that a number of
+ * pages are being waited on here.
+ */
+ if(waitqueue_active(page_waitqueue(page)))
+ wake_up_all(page_waitqueue(page));
}
/*
@@ -820,7 +852,7 @@
struct task_struct *tsk = current;
DECLARE_WAITQUEUE(wait, tsk);
- add_wait_queue_exclusive(&page->wait, &wait);
+ add_wait_queue_exclusive(page_waitqueue(page), &wait);
for (;;) {
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
if (PageLocked(page)) {
@@ -830,8 +862,8 @@
if (!TryLockPage(page))
break;
}
- tsk->state = TASK_RUNNING;
- remove_wait_queue(&page->wait, &wait);
+ __set_task_state(tsk, TASK_RUNNING);
+ remove_wait_queue(page_waitqueue(page), &wait);
}
diff --minimal -Nru a/mm/page_alloc.c b/mm/page_alloc.c
--- a/mm/page_alloc.c Fri Jan 4 09:14:28 2002
+++ b/mm/page_alloc.c Fri Jan 4 09:14:28 2002
@@ -809,6 +809,44 @@
}
}
+/*
+ * Helper functions to size the waitqueue hash table.
+ * Essentially these want to choose hash table sizes sufficiently
+ * large so that collisions trying to wait on pages are rare.
+ * But in fact, the number of active page waitqueues on typical
+ * systems is ridiculously low, less than 200. So this is even
+ * conservative, even though it seems large.
+ *
+ * The constant PAGES_PER_WAITQUEUE specifies the ratio of pages to
+ * waitqueues, i.e. the size of the waitq table given the number of pages.
+ */
+
+#define PAGES_PER_WAITQUEUE 256
+
+static inline unsigned long wait_table_size(unsigned long pages)
+{
+ unsigned long size = 1;
+
+ pages /= PAGES_PER_WAITQUEUE;
+
+ while(size < pages)
+ size <<= 1;
+
+ return size;
+}
+
+
+/*
+ * This is an integer logarithm so that shifts can be used later
+ * to extract the more random high bits from the multiplicative
+ * hash function before the remainder is taken.
+ */
+static inline unsigned long wait_table_bits(unsigned long size)
+{
+ return ffz(~size);
+}
+
+
#define LONG_ALIGN(x) (((x)+(sizeof(long))-1)&~((sizeof(long))-1))
/*
@@ -883,9 +921,28 @@
INIT_LIST_HEAD(&zone->active_list);
INIT_LIST_HEAD(&zone->inactive_dirty_list);
INIT_LIST_HEAD(&zone->inactive_clean_list);
+
if (!size)
continue;
+ /*
+ * The per-page waitqueue mechanism requires hash tables
+ * whose buckets are waitqueues. These hash tables are
+ * per-zone, and dynamically sized according to the size
+ * of the zone so as to maintain a good ratio of waiters
+ * to hash table buckets. Right here we just allocate
+ * and initialize them for later use (in filemap.c)
+ */
+ zone->wait_table_size = wait_table_size(size);
+ zone->wait_table_bits = wait_table_bits(zone->wait_table_size);
+ zone->wait_table = (wait_queue_head_t *)
+ alloc_bootmem_node(pgdat,
+ zone->wait_table_size
+ * sizeof(wait_queue_head_t));
+
+ for(i = 0; i < zone->wait_table_size; ++i)
+ init_waitqueue_head(zone->wait_table + i);
+
pgdat->nr_zones = j+1;
mask = (realsize / zone_balance_ratio[j]);
@@ -926,7 +983,6 @@
set_page_zone(page, pgdat->node_id * MAX_NR_ZONES + j);
init_page_count(page);
__SetPageReserved(page);
- init_waitqueue_head(&page->wait);
memlist_init(&page->list);
if (j != ZONE_HIGHMEM)
set_page_address(page, __va(zone_start_paddr));
Hi
The following fixes a compile problem with agpgart:
-------------
--- linux/drivers/char/agp/agpgart_be.c.orig Fri Jan 4 15:01:50 2002
+++ linux/drivers/char/agp/agpgart_be.c Fri Jan 4 15:22:46 2002
@@ -30,6 +30,7 @@
#include <linux/kernel.h>
#include <linux/sched.h>
#include <linux/mm.h>
+#include <linux/mm_inline.h>
#include <linux/string.h>
#include <linux/errno.h>
#include <linux/slab.h>
@@ -830,7 +831,7 @@
page = virt_to_page(pt);
atomic_dec(&page->count);
clear_bit(PG_locked, &page->flags);
- wake_up(&page->wait);
+ wake_up(page_waitqueue(page));
free_page((unsigned long) pt);
atomic_dec(&agp_bridge.current_memory_agp);
}
---------------
As a quick test of a new kernel I copy dbench to a tmpfs fs and run "time dbench 32".
This typically takes over 512M swap (with rmap 10c and mainline). With hashed
waitqueues it does not reach 512M. The machine has 512M of memory. I observe
about the same runtimes and datarate with all three kernels.
Ed Tomlinson
>>>>> "WLI" == William Lee Irwin <[email protected]> writes:
WLI> This is a long-discussed space optimization for the VM system, with
WLI> Cheers,
Cheers ;)
I couple of places that need fixing:
./drivers/char/agp/agpgart_be.c:833: wake_up(&page->wait);
./drivers/char/agp/agpgart_be.c:2760: wake_up(&page->wait);
./drivers/char/drm/i810_dma.c:302: wake_up(&virt_to_page(page)->wait);
Regards,
-velco
William Lee Irwin III wrote:
>
> This is a long-discussed space optimization for the VM system, with
> what is expected to be a minor time tradeoff.
Nice code.
> ...
> + /*
> + * Although the default semantics of wake_up() are
> + * to wake all, here the specific function is used
> + * to make it even more explicit that a number of
> + * pages are being waited on here.
> + */
> + if(waitqueue_active(page_waitqueue(page)))
> + wake_up_all(page_waitqueue(page));
Does the compiler CSE these two calls to page_waitqueue()?
All versions? I'd be inclined to do CSE-by-hand here.
Also, why wake_up_all()? That will wake all tasks which are sleeping
in __lock_page(), even though they've asked for exclusive wakeup
semantics. Will a bare wake_up() here not suffice?
-
Andrew Morton wrote:
>
> > + /*
> > + * Although the default semantics of wake_up() are
> > + * to wake all, here the specific function is used
> > + * to make it even more explicit that a number of
> > + * pages are being waited on here.
> > + */
> > + if(waitqueue_active(page_waitqueue(page)))
> > + wake_up_all(page_waitqueue(page));
> ...
>
> Also, why wake_up_all()? That will wake all tasks which are sleeping
> in __lock_page(), even though they've asked for exclusive wakeup
> semantics. Will a bare wake_up() here not suffice?
>
Doh. It helps to read the comment. Suggest that __lock_page()
be changed to use add_wait_queue().
-
William Lee Irwin III wrote:
+ /*
+ * Although the default semantics of wake_up() are
+ * to wake all, here the specific function is used
+ * to make it even more explicit that a number of
+ * pages are being waited on here.
+ */
+ if(waitqueue_active(page_waitqueue(page)))
+ wake_up_all(page_waitqueue(page));
On Fri, Jan 04, 2002 at 03:07:20PM -0800, Andrew Morton wrote:
> Does the compiler CSE these two calls to page_waitqueue()?
> All versions? I'd be inclined to do CSE-by-hand here.
I'm not sure if CSE is run before or after inlining, but it
is probably not wise to leave it up to the compiler.
On Fri, Jan 04, 2002 at 03:07:20PM -0800, Andrew Morton wrote:
> Also, why wake_up_all()? That will wake all tasks which are sleeping
> in __lock_page(), even though they've asked for exclusive wakeup
> semantics. Will a bare wake_up() here not suffice?
A couple of other private responses pointed this out, and also had
suggestions for several ways to avoid the thundering herds. I am not
sure it's possible to honor exclusive wakeup requests here without
creating problems or otherwise having to propagate semantic changes
further up the call chain. I think that comment and one of the others
needs to be expanded to make the exclusive wakeup issue clearer.
Also there is a bug in the hash function (pointed out by an anonymous
private response):
On Fri, Jan 04, 2002 at 07:16:33PM -0000, an anonymous person told me:
> One *bug* in your code is that if toy have 64-bit longs, your
> GOLDEN_RATIO_PRIME isn't large enough and page_waitqueue will
> always compute hash = 0.
> The closest primes to phi * 2^64 = 11400714819323198485.95... are:
> ...
> 11400714819323198333
> 11400714819323198389
> 11400714819323198393
> *
> 11400714819323198549
> 11400714819323198581
> 11400714819323198647
> ...
which is easy enough to repair with a conditional definition of
GOLDEN_RATIO_PRIME.
The same person also mentioned that less work could be done in
the hash function by storing the shift amount directly in the zone,
and also pointed out that the masking operation is unnecessary as the
shift already annihilates the high-order bits.
Cheers,
Bill
On January 4, 2002 06:40 pm, William Lee Irwin III wrote:
> + unsigned long hash = (unsigned long)page;
Just to be anal here, 'long' doesn't add anything useful to this
declaration. In fact, u32 is what you want since you've chosen your
multiplier with a 32 bit register in mind - on 64bit arch I think you'll get
distinctly non-random results as it stands.
> + hash *= GOLDEN_RATIO_PRIME;
> + hash >>= BITS_PER_LONG - zone->wait_table_bits;
> + hash &= zone->wait_table_size - 1;
Nice hash! For arches with expensive multiplies you might want to look
for a near-golden ratio multiplier that has a simple contruction in terms of
1's & 0's so it can be computed with 2 or 3 shift-adds, dumping the
efficiency problem on the compiler. You don't have to restrict your search
to the neighbourhood of 32 bits, you can go a few bits less than that and
substract from a smaller value than BITS_PER_LONG (actually, you should just
write '32' there, since that's what you really have in mind).
--
Daniel
On January 4, 2002 06:40 pm, William Lee Irwin III wrote:
+ unsigned long hash = (unsigned long)page;
On Sat, Jan 05, 2002 at 01:37:40AM +0100, Daniel Phillips wrote:
> Just to be anal here, 'long' doesn't add anything useful to this
> declaration. In fact, u32 is what you want since you've chosen your
> multiplier with a 32 bit register in mind - on 64bit arch I think you'll get
> distinctly non-random results as it stands.
On January 4, 2002 06:40 pm, William Lee Irwin III wrote:
+ hash *= GOLDEN_RATIO_PRIME;
+ hash >>= BITS_PER_LONG - zone->wait_table_bits;
+ hash &= zone->wait_table_size - 1;
On Sat, Jan 05, 2002 at 01:37:40AM +0100, Daniel Phillips wrote:
> Nice hash! For arches with expensive multiplies you might want to look
> for a near-golden ratio multiplier that has a simple contruction in terms of
> 1's & 0's so it can be computed with 2 or 3 shift-adds, dumping the
> efficiency problem on the compiler. You don't have to restrict your search
> to the neighbourhood of 32 bits, you can go a few bits less than that and
> substract from a smaller value than BITS_PER_LONG (actually, you should just
> write '32' there, since that's what you really have in mind).
For 64-bit machines an entirely different multiplier should be used,
one with p/2^64 approximately phi. And I'd rather see something like:
hash = (unsigned long)pages;
hash *= GOLDEN_RATIO_PRIME;
hash >>= zone->wait_table_shift;
with no masking, since the above shift should be initialized so as
to annihilate all the high-order bits above where the table's page
size should be. On the other hand, shifting all the way may not be
truly optimal, and so it may be better to mask with a smaller shift.
2 or 3 shift/adds is really not possible, the population counts of the
primes in those ranges tends to be high, much to my chagrin. I actually
tried unrolling it by hand a few times, and it was slower than
multiplication on i386 (and uncomfortably lengthy). So Fibonacci
hashing may well not be the best strategy for architectures without
hardware integer multiplies.
I believe to address architectures where multiplication is prohibitively
expensive I should do some reading to determine a set of theoretically
sound candidates for non-multiplicative hash functions and benchmark them.
Knuth has some general rules about design but I would rather merely test
some already verified by someone else and use the one that benches best
than duplicate the various historical efforts to find good hash functions.
Cheers,
Bill
On January 5, 2002 02:39 am, William Lee Irwin III wrote:
> 2 or 3 shift/adds is really not possible, the population counts of the
> primes in those ranges tends to be high, much to my chagrin.
It doesn't really have to be a prime, being relatively prime is also
good, i.e., not too many or too small factors. Surely there's a multiplier
in the right range with just two prime factors that can be computed with 3
shift-adds.
> I actually
> tried unrolling it by hand a few times, and it was slower than
> multiplication on i386 (and uncomfortably lengthy).
Right, it's not worth it unless you can get it down to a handful of
shift-adds. How does 2**17 - 1 (Mersenne prime #6) with right-shift by
(16 - bits) work?
> I believe to address architectures where multiplication is prohibitively
> expensive I should do some reading to determine a set of theoretically
> sound candidates for non-multiplicative hash functions and benchmark them.
> Knuth has some general rules about design but I would rather merely test
> some already verified by someone else and use the one that benches best
> than duplicate the various historical efforts to find good hash functions.
It would be nice if you could just look up good ones in a cookbook, but you
can't, that cookbook doesn't exist.
--
Daniel
On January 5, 2002 02:39 am, William Lee Irwin III wrote:
>> 2 or 3 shift/adds is really not possible, the population counts of the
>> primes in those ranges tends to be high, much to my chagrin.
On Sat, Jan 05, 2002 at 03:44:06AM +0100, Daniel Phillips wrote:
> It doesn't really have to be a prime, being relatively prime is also
> good, i.e., not too many or too small factors. Surely there's a multiplier
> in the right range with just two prime factors that can be computed with 3
> shift-adds.
Trying to keep the factors even will probably be difficult, if you were
going to sieve for it you'd probably want to try sieving with various
polynomials something like the quadratic sieving factorization methods,
in order to start from a list of candidates with sparse bit patterns.
Or one can just iterate and filter by population counts...
On Sat, Jan 05, 2002 at 03:44:06AM +0100, Daniel Phillips wrote:
> Right, it's not worth it unless you can get it down to a handful of
> shift-adds. How does 2**17 - 1 (Mersenne prime #6) with right-shift by
> (16 - bits) work?
I haven't started benchmarking different hash functions yet. Also
interesting would be Fermat prime #4. The ratio phi falls in the
range 4/7 < theta < 2/3, so perhaps to be sparse a slight overestimate
like a prime in the range 0xA0000000 < p < 0xA7FFFFFF might be good.
I did a little sieving for numbers in those ranges coprime to the small
prime factors up to 29, and I found the natural numbers in this range
with a population count strictly less than 5 are:
$ factor `./prime`
2483027969: 2483027969
2684354593: 43 61 1023391
2684355073: 101 139 367 521
2684362753: 2684362753
2684485633: 2684485633
2717908993: 2717908993
of these, 2684362753/2^32 differs by a mere 0.69679% from phi, and
so appears to be the most promising. It has a population count of
precisely 4. Among the numbers >= 0xA0000000, there was a consistent
pattern of the hexadecimal digits, which was interesting. It is
unfortunate, though, that among those primes there are none with
population counts less than 4.
The continued fraction expansion of 2684362753/2^32 appears to be:
0, 1, 1, 1, 2, 8190, 1, 1, 1, 2, ... which has the disturbing
sixth term, where on the other hand, theta is better for the seemingly
further away ones:
2483027969/2^32: 0, 1, 1, 2, 1, 2, 3, 1048575, 1, 2
2684485633/2^32: 0, 1, 1, 1, 2, 511, 1, 1, 1, 1
2717908993/2^32: 0, 1, 1, 1, 2, 1, 1, 1, 1, 1
So in the continued fraction representation, 2717908993/2^32 is the
closest to phi (though I am concerned about the term 262143 further out),
and so 2717908993 is probably the best bet. I'll try these out and see
if there is a significant difference in computational expense or key
distribution.
Cheers,
Bill
On Fri, 4 Jan 2002, William Lee Irwin III wrote:
> This is a long-discussed space optimization for the VM system, with
> what is expected to be a minor time tradeoff.
Actually, I don't expect the mentioned time tradeoff to be
anywhere near measurable. When we wait on a page we go through
the scheduler, context switches, etc...
I'll definately apply this patch and am looking forward to the
results from the ODSL test you have queued.
regards,
Rik
--
Shortwave goes a long way: irc.starchat.net #swl
http://www.surriel.com/ http://distro.conectiva.com/
On January 5, 2002 02:39 am, William Lee Irwin III wrote:
>>> 2 or 3 shift/adds is really not possible, the population counts of the
>>> primes in those ranges tends to be high, much to my chagrin.
On Sat, Jan 05, 2002 at 03:44:06AM +0100, Daniel Phillips wrote:
>> It doesn't really have to be a prime, being relatively prime is also
>> good, i.e., not too many or too small factors. Surely there's a multiplier
>> in the right range with just two prime factors that can be computed with 3
>> shift-adds.
The (theoretically) best 64-bit prime with 5 high bits I found is:
11673330234145374209 == 0xa200000000100001
which has continued fraction of p/2^64
= 0,1,1,1,2,1,1,1,1,1,1,1073740799,2,1,1,1,1,6,1,1,5,1023,1,4,1,1,3,3
and the (theoretically) best 64-bit prime with 4 high bits I found is:
11529215046068994049 == 0xa000000000080001
which has continued fraction of p/2^64
= 0,1,1,1,2,549754765313,1,1,1,1,1,4095,2,1,1,1,1,2,1,2
(the continued fractions terminate after the points given here)
Which of the two would be better should depend on whether the penalty
against the distribution for the sixth term of the 4-bit prime is worse
than the computational expense of the extra shift/add for the 5-bit prime.
I need to start benching this stuff.
Cheers,
Bill
On Tue, Jan 08, 2002 at 10:20:37AM -0800, William Lee Irwin III wrote:
> The (theoretically) best 64-bit prime with 5 high bits I found is:
[numbers]
> and the (theoretically) best 64-bit prime with 4 high bits I found is:
[numbers]
Sorry, I forgot to credit my helpers in the last post:
The sieving technique to find these things was devised by
Christophe Rhodes <[email protected]> and [email protected]
Cheers,
Bill
William> On Tue, Jan 08, 2002 at 10:20:37AM -0800, William Lee Irwin III wrote:
>> The (theoretically) best 64-bit prime with 5 high bits I found is:
William> [numbers]
>> and the (theoretically) best 64-bit prime with 4 high bits I found is:
William> [numbers]
William> Sorry, I forgot to credit my helpers in the last post:
William> The sieving technique to find these things was devised by
William> Christophe Rhodes <[email protected]> and [email protected]
On Tue, Jan 08, 2002 at 10:40:33AM -0800, Roland Dreier wrote:
> Just out of curiousity, why do you need a "sieving technique" to find
> these primes? There are only 63 choose 4 (which is 595665) 64 bit
> numbers with only 5 bits set, of which probably no more than 15000 are
> prime, so it seems you could just test all of them. What am I missing?
The sieving technique they devised uses that. In fact, they search only
2*choose(59,2), as bits 63, 0, and one of either 60 or 61 must be set,
since it must be odd to be prime and bits 63 -> 60 are determined by
(up to the choice of either 60 or 61) by 4/7 <= p/2^64 <= 2/3.
Cheers,
Bill
On Tue, 8 Jan 2002, William Lee Irwin III wrote:
> On January 5, 2002 02:39 am, William Lee Irwin III wrote:
> >>> 2 or 3 shift/adds is really not possible, the population counts of the
> >>> primes in those ranges tends to be high, much to my chagrin.
>
> On Sat, Jan 05, 2002 at 03:44:06AM +0100, Daniel Phillips wrote:
> >> It doesn't really have to be a prime, being relatively prime is also
> >> good, i.e., not too many or too small factors. Surely there's a multiplier
> >> in the right range with just two prime factors that can be computed with 3
> >> shift-adds.
>
> The (theoretically) best 64-bit prime with 5 high bits I found is:
>
> 11673330234145374209 == 0xa200000000100001
> which has continued fraction of p/2^64
> = 0,1,1,1,2,1,1,1,1,1,1,1073740799,2,1,1,1,1,6,1,1,5,1023,1,4,1,1,3,3
>
> and the (theoretically) best 64-bit prime with 4 high bits I found is:
> 11529215046068994049 == 0xa000000000080001
> which has continued fraction of p/2^64
> = 0,1,1,1,2,549754765313,1,1,1,1,1,4095,2,1,1,1,1,2,1,2
>
> (the continued fractions terminate after the points given here)
>
> Which of the two would be better should depend on whether the penalty
> against the distribution for the sixth term of the 4-bit prime is worse
> than the computational expense of the extra shift/add for the 5-bit prime.
>
> I need to start benching this stuff.
Why is all this sophistication needed for hashing pages to wait queues?
I understand that you should avoid a stupid hash (such as one where all
pages end up on the same wait queue), and I understand why a cache needs
a well-chosen hash, and I understand why shift is preferred to multiply;
but I don't get why so much discussion of the precise hash for choosing
the wait queue of a page: aren't the waits rare, and the pages mostly
well-distributed anyway?
Hugh
On Tue, 8 Jan 2002, William Lee Irwin III wrote:
>> I need to start benching this stuff.
On Tue, Jan 08, 2002 at 07:20:27PM +0000, Hugh Dickins wrote:
> Why is all this sophistication needed for hashing pages to wait queues?
> I understand that you should avoid a stupid hash (such as one where all
> pages end up on the same wait queue), and I understand why a cache needs
> a well-chosen hash, and I understand why shift is preferred to multiply;
> but I don't get why so much discussion of the precise hash for choosing
> the wait queue of a page: aren't the waits rare, and the pages mostly
> well-distributed anyway?
All this "sophistication" boils down to a single number, perhaps a single
#define. I'd at least like to put some thought into it, at the very least
as due diligence. And also I want to be able to answer the question of
"Where did the number come from?"
It doesn't really require that much effort, either. The non-bitsparse
golden ratio prime was just looked up in Chuck Lever's paper, and the
criteria I'm using to determine potentially useful bitsparse factors
(aside from sparsity itself) are largely from Knuth, who (paraphrasing)
says the important characteristic is the first several terms in the
continued fraction expansion of p/w, where w is the wordsize.
And the sieving "algorithm" is just me asking a couple of people how
they'd do it, and the sieve took well under a minute to run, and maybe
5 or 10 minutes to write.
And if it doesn't matter to you, please remember anyway that when I
wrote my hash functions I did put some thought into it.
Thanks,
Bill
William Lee Irwin III wrote:
> I believe to address architectures where multiplication is prohibitively
> expensive I should do some reading to determine a set of theoretically
> sound candidates for non-multiplicative hash functions and benchmark them.
> Knuth has some general rules about design but I would rather merely test
> some already verified by someone else and use the one that benches best
> than duplicate the various historical efforts to find good hash functions.
Looking up "good hash function" on Google leads to these notable pages:
http://burtleburtle.net/bob/hash/doobs.html
A Hash Function For Hash Table Lookup - Robert Jenkins
http://burtleburtle.net/bob/hash/
Hash Functions and Block Ciphers - Robert Jenkins
http://www.concentric.net/~Ttwang/tech/inthash.htm
Integer Hash Function - Thomas Wang
The last one is interesting because it mentions the golden prime
multiplier function, and suggests good non-multipler functions instead.
(Justification: the multiplier function doesn't distribute bits evenly).
enjoy,
-- Jamie
On Wed, Jan 16, 2002 at 02:21:41PM +0000, Jamie Lokier wrote:
> Looking up "good hash function" on Google leads to these notable pages:
[various URL's]
> The last one is interesting because it mentions the golden prime
> multiplier function, and suggests good non-multipler functions instead.
> (Justification: the multiplier function doesn't distribute bits evenly).
Excellent! I can always use more of these to test.
It seems odd that they don't like Fibonacci hashing, it appears to pass
various chi^2 tests on bucket distribution. And operator-sparse Fibonacci
hashing primes appear to pass it as well, at least once 10 terms of the
continued fraction match (operator-sparse Fibonacci hashing primes means
that the multiplication can be done with shifts and adds or subtracts).
Regardless, various arches want non-multiplicative hash functions and
they'll be getting them. These hash functions will certainly prove
useful in getting a broader base to test against. I don't care to have
a "pet" hash function, only one that is good as possible.
Thanks,
Bill
William Lee Irwin III wrote:
> Regardless, various arches want non-multiplicative hash functions and
> they'll be getting them. These hash functions will certainly prove
> useful in getting a broader base to test against. I don't care to have
> a "pet" hash function, only one that is good as possible.
Take a look at these too. Not much information about how good they are,
but they do provide yet more examples to test, if you're testing.
http://www.physik.tu-muenchen.de/lehrstuehle/T32/matpack/html/LibDoc/Strings/strings.html
http://www.physik.tu-muenchen.de/lehrstuehle/T32/matpack/source/Strings/
-- Jamie