MIME-Version: 1.0
In-Reply-To: <20160428163525.495514517@linutronix.de>
References: <20160428161742.363543816@linutronix.de>
	<20160428163525.495514517@linutronix.de>
Date: Thu, 28 Apr 2016 11:32:47 -0700
Message-ID: <CA+55aFzMtEVc_5QS8EUvdKYfNMmJ9fpgcM4iEhdwaY1Oe=JYmw@mail.gmail.com>
Subject: Re: [patch 2/7] lib/hashmod: Add modulo based hash mechanism
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: LKML <linux-kernel@vger.kernel.org>, Peter Zijlstra <peterz@infradead.org>,
        Ingo Molnar <mingo@kernel.org>,
        Andrew Morton <akpm@linux-foundation.org>,
        Sebastian Andrzej Siewior <bigeasy@linutronix.de>,
        Darren Hart <darren@dvhart.com>,
        Michael Kerrisk <mtk.manpages@googlemail.com>,
        Davidlohr Bueso <dave@stgolabs.net>, Chris Mason <clm@fb.com>,
        "Carlos O'Donell" <carlos@redhat.com>,
        Torvald Riegel <triegel@redhat.com>,
        Eric Dumazet <edumazet@google.com>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 1379
Lines: 32

On Thu, Apr 28, 2016 at 9:42 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> hash_long/hash_ptr() provide really bad hashing for small hash sizes and for
> cases where the lower 12 bits (Page size aligned) of the hash value are 0.

Hmm.

hash_long/ptr really shouldn't care about the low bits being zero at
all, because it should mix in all the bits (using a prime multiplier
and taking the high bits).

That said, numbers rule, so clearly we need to do something. It does
strike me that we would be better off just trying to improve
hash_long().

In particular, there are people and projects that have worked on
nothing but hashing. I'm not sure we should add a new hash algorithm
even if the whole "modulo prime" sounds obviously fine in theory. For
example, your 64-bit code has obvious problems if there are common
patterns in the low and the high 32 bits.. Not a problem for something
like hash_ptr(), but it can certainly be a problem for other cases.

It would be a really good idea to have some real hard numbers on the
hashing in general, but _particularly_ so if/when we start adding new
ones. Have you tested the modulus version with SMhasher, for example?

For example, there's Thomas Wang's hash function which should cascade
all the bits.

I'd really hate to add *another* ad-hoc hash when the previous ad-hoc
hash has been shown to be bad.

               Linus