Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id ; Wed, 21 Feb 2001 19:00:01 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id ; Wed, 21 Feb 2001 18:59:51 -0500 Received: from neon-gw.transmeta.com ([209.10.217.66]:26893 "EHLO neon-gw.transmeta.com") by vger.kernel.org with ESMTP id ; Wed, 21 Feb 2001 18:59:30 -0500 To: linux-kernel@vger.kernel.org From: torvalds@transmeta.com (Linus Torvalds) Subject: Re: [rfc] Near-constant time directory index for Ext2 Date: 21 Feb 2001 15:59:00 -0800 Organization: Transmeta Corporation Message-ID: <971ko4$1e7$1@penguin.transmeta.com> In-Reply-To: <971i36$180$1@penguin.transmeta.com> Sender: linux-kernel-owner@vger.kernel.org X-Mailing-List: linux-kernel@vger.kernel.org In article , Davide Libenzi wrote: > >Yep, 4 is not good as a shifting factor. Prime number are the better choice for >this stuff. Oh, absolutely. It looks like the hash function was done rather early on in the dcache lifetime (one of the first things), back when nobody cared about whether it was really good or not because there were many much more complicated questions like "how the h*ll will this all ever work" ;) And at no point did anybody ever go back and verify whether the hash function made much sense or not. We had another boo-boo with the actual _folding_ of the "full" hash value into the actual hash chain pointer that is done when the name cache is actually looked up, which was even more embarrassing: even if the hash ended up being ok, we would remove most of the valid bits from it because it would under certain circumstances (512MB of RAM on x86) basically xor itself with itself. That took quite a while to find too - the code still worked fine, it just had a horrible distribution on machines with half a gig of memory. >The issue to have a good distribution is not only to have a good hashing >function, but also to give this function not correlated data. >Good hashing function for a Domain A may not be so good for a Domain B. This is not something we can do all that much about. The data we get is generated by the user, and can basically be a random string of characters. HOWEVER, there are certainly tons of _usual_ data, and while there's no way to select the data we can at least try to make sure that the distribution is good for the normal case (ie regular ASCII filenames, not forgetting the fact that many people use more interesting encodings) Linus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/