Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1760004AbZCWU3U (ORCPT ); Mon, 23 Mar 2009 16:29:20 -0400 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1758884AbZCWU3B (ORCPT ); Mon, 23 Mar 2009 16:29:01 -0400 Received: from byss.tchmachines.com ([208.76.80.75]:34912 "EHLO byss.tchmachines.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759592AbZCWU27 (ORCPT ); Mon, 23 Mar 2009 16:28:59 -0400 Date: Mon, 23 Mar 2009 13:28:37 -0700 From: Ravikiran G Thirumalai To: Eric Dumazet Cc: linux-kernel@vger.kernel.org, Ingo Molnar , shai@scalex86.org, Andrew Morton Subject: Re: [rfc] [patch 1/2 ] Process private hash tables for private futexes Message-ID: <20090323202837.GE7278@localdomain> References: <20090321044637.GA7278@localdomain> <49C4AE64.4060400@cosmosbay.com> <20090322045414.GD7278@localdomain> <49C5F3FD.9010606@cosmosbay.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <49C5F3FD.9010606@cosmosbay.com> User-Agent: Mutt/1.5.15+20070412 (2007-04-11) X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - byss.tchmachines.com X-AntiAbuse: Original Domain - vger.kernel.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - scalex86.org Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4703 Lines: 162 On Sun, Mar 22, 2009 at 09:17:01AM +0100, Eric Dumazet wrote: >Ravikiran G Thirumalai a ?crit : >>> >>> Did you tried to change FUTEX_HASHBITS instead, since current value is really really >>> ridiculous ? >> >> We tried it in the past and I remember on a 16 core machine, we had to >> use 32k hash slots to avoid false sharing. >> >> >> Yes, dynamically changing the hash table is better (looking at the patch you >> have posted), but still there are no locality guarantees here. A process >> pinned to node X may still end up accessing remote memory locations while >> accessing the hash table. A process private table on the other hand should >> not have this problem. I think using a global hash for entirely process local >> objects is bad design wise here. >> >> > > >Bad design, or bad luck... considering all kernel already use such global tables >(dentries, inodes, tcp, ip route cache, ...). Not necessarily. The dentry/inode/route caches need to be shared by processes so a global cache makes sense there -- the private futexes need to be only shared between threads of the process rather than the entire world. > >Problem is to size this hash table, being private or not. You said hou had >to have a 32768 slots to avoid false sharing on a 16 core machine. This seems >strange to me, given we use jhash. What is the size of the cache line on your >platforms ??? It is large and true these bad effects get magnified with larger cache lines. However, this does forewarn other architectures of problems such as these. Access to the below virtual addresses were seen to cause cache trashing between nodes on a vSMP system. The eip corresponds to the spin_lock on a 2.6.27 kernel at 'futex_wake' (see end of email). Obviously these addresses correspond to the spinlock on the hash buckets, and this was a threaded FEA solver workload on a 32 core machine. As can be seen, this is a problem even on a machine with 64b cacheline. > >Say we have 32768 slots for the global hash table, and 256 slots for a private one, >you probably can have a program running slowly with this private 256 slots table, >if this program uses all available cores. True, as I replied to akpm in this thread, if a workload happens to be one multi-threaded process with a zillion threads, the workload will have bigger overheads due to the sharing of process address space and mmap_sem. Atleast that has been our experience so far. Private futex hashes solve the problem on hand. > >If we use large private hash table, the setup cost is higher (need to initialize >all spinlocks and plist heads at each program startup), unless we use a dedicate >kmem_cache to keep a pool of preinitialized priv hash tables... > Hmm! How about a) Reduce hash table size for private futex hash and increase hash table size for the global hash? OR, better, b) Since it is multiple spinlocks on the same cacheline which is a PITA here, how about keeping the global table, but just add a dereference to each hash slot, and interleave the adjacent hash buckets between nodes/cpus? So even without needing to lose out memory from padding, we avoid false sharing on cachelines due to unrelated futexes hashing onto adjacent hash buckets? Thanks, Kiran Cache misses at futex_wake due to access to the following addresses: ------------------------------------------------------------------- fffff819cc180 fffff819cc1d0 fffff819cc248 fffff819cc310 fffff819cc3b0 fffff819cc400 fffff819cc568 fffff819cc5b8 fffff819cc658 fffff819cc770 fffff819cc798 fffff819cc838 fffff819cc8d8 fffff819cc9c8 fffff819cc9f0 fffff819cca90 fffff819ccae0 fffff819ccb08 fffff819ccd38 fffff819ccd88 fffff819ccdb0 fffff819cce78 fffff819ccf18 fffff819ccfb8 fffff819cd030 fffff819cd058 fffff819cd148 fffff819cd210 fffff819cd260 fffff819cd288 fffff819cd2b0 fffff819cd300 fffff819cd350 fffff819cd3f0 fffff819cd440 fffff819cd558 fffff819cd580 fffff819cd620 fffff819cd738 fffff819cd7b0 fffff819cd7d8 fffff819cd828 fffff819cd8c8 fffff819cd8f0 fffff819cd968 fffff819cd9b8 fffff819cd9e0 fffff819cda08 fffff819cda58 fffff819cdad0 fffff819cdbc0 fffff819cdc10 fffff819cdc60 fffff819cddf0 fffff819cde68 fffff819cdfa8 fffff819cdfd0 fffff819ce020 fffff819ce048 fffff819ce070 fffff819ce098 fffff819ce0c0 fffff819ce0e8 fffff819ce110 fffff819ce1d8 fffff819ce200 fffff819ce228 fffff819ce250 fffff819ce3b8 fffff819ce430 fffff819ce480 fffff819ce5e8 fffff819ce660 fffff819ce728 fffff819ce750 fffff819ce868 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/