Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758751AbZASFrn (ORCPT ); Mon, 19 Jan 2009 00:47:43 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1755540AbZASFrf (ORCPT ); Mon, 19 Jan 2009 00:47:35 -0500 Received: from mx1.suse.de ([195.135.220.2]:49806 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751977AbZASFre (ORCPT ); Mon, 19 Jan 2009 00:47:34 -0500 Date: Mon, 19 Jan 2009 06:47:30 +0100 From: Nick Piggin To: Christoph Lameter Cc: Pekka Enberg , "Zhang, Yanmin" , Lin Ming , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Linus Torvalds Subject: Re: [patch] SLQB slab allocator Message-ID: <20090119054730.GA22584@wotan.suse.de> References: <20090114114707.GA24673@wotan.suse.de> <84144f020901140544v56b856a4w80756b90f5b59f26@mail.gmail.com> <20090114142200.GB25401@wotan.suse.de> <84144f020901140645o68328e01ne0e10ace47555e19@mail.gmail.com> <20090114150900.GC25401@wotan.suse.de> <20090115060330.GB17810@wotan.suse.de> <20090116031940.GL17810@wotan.suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.9i Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4975 Lines: 105 On Fri, Jan 16, 2009 at 03:07:52PM -0600, Christoph Lameter wrote: > On Fri, 16 Jan 2009, Nick Piggin wrote: > > > > handled by the same 2M TLB covering a 32k page. If the 4k pages are > > > dispersed then you may need 8 2M tlbs (which covers already a quarter of > > > the available 2M TLBs on nehalem f.e.) for which the larger alloc just > > > needs a single one. > > > > Yes I know that. But it's pretty theoretical IMO (and I could equally > > describe a theoretical situation where increased fragmentation in higher > > order slabs will result in worse TLB coverage). > > Theoretical only for low sizes of memory. If you have terabytes of memory > then this becomes significant in a pretty fast way. I don't really buy that as a general statement with no other qualifiers. If the huge system has a correspondingly increased number of slab objects, then the potential win is much smaller as system sizes increases. Say if you have a 1GB RAM system, with 128 2MB TLBs, and suppose you have a slab that takes 25% of the RAM. Then if you optimally pack the slab for TLB access of those objects in a random pattern, then you can get 100% TLB hit for a given random access. And 25% in the case of random allocations. In a 1TB RAM system, you have ~0.1% chance of TLB hit in the optimally packed case and ~0.025% chance of hit in the random case. So in that case it is a much smaller (negligable) possible gain from packing. And note that we're talking about best possible packing scenario (ie. 2MB pages vs 4K pages). The standard SLUB tune would not get anywhere near that. The thing IMO you forget with all these doomsday scenarios about SGI's peta scale systems is that no matter what you do, you can't avoid the fact that computing is about locality. Even if you totally take the TLB out of the equation, you still have the small detail of other caches. Code that jumps all over that 1024 TB of memory with no locality is going to suck regardless of what the kernel ever does, due to physical limitations of hardware. Anyway, this is trivial for SLQB to tune this on those systems that really care, giving SLUB no advantage. So we needn't really get sidetracked with this in the context of SLQB. > > > It has lists of free objects that are bound to a particular page. That > > > simplifies numa handling since all the objects in a "queue" (or page) have > > > the same NUMA characteristics. > > > > The same can be said of SLQB and SLAB as well. > > Sorry not at all. SLAB and SLQB queue objects from different pages in the > same queue. The last sentence is what I was replying to. Ie. "simplification of numa handling" does not follow from the SLUB implementation of per-page freelists. > > > was assigned to a processor. Memory wastage may only occur because > > > each processor needs to have a separate page from which to allocate. SLAB > > > like designs needs to put a large number of objects in queues which may > > > keep a number of pages in the allocated pages pool although all objects > > > are unused. That does not occur with slub. > > > > That's wrong. SLUB keeps completely free pages on its partial lists, and > > also IIRC can keep free pages pinned in the per-cpu page. I have actually > > seen SLQB use less memory than SLUB in some situations for this reason. > > As I sad it pins a single page in the per cpu page and uses that in a way > that you call a queue and I call a freelist. And you found you have to increase the size of your pages because you need bigger queues. (must we argue semantics? it is a list of free objects) > SLUB keeps a few pages on the partial list right now because it tries to > avoid trips to the page allocator (which is quite slow). These could be > eliminated if the page allocator would work effectively. However that > number is a per node limit. This is the practical vs theoretical I'm talking about. > SLAB and SLUB can have large quantities of objects in their queues that > each can keep a single page out of circulation if its the last > object in that page. This is per queue thing and you have at least two And if that were a problem, SLQB can easily be runtime tuned to keep no objects in its object lists. But as I said, queueing is good, so why would anybody want to get rid of it? > queues per cpu. SLAB has queues per cpu, per pair of cpu, per node and per > alien node for each node. That can pin quite a number of pages on large > systems. Note that SLAB has one per cpu whereas you have already 2 per > cpu? In SMP configurations this may mean that SLQB has more queues than > SLAB. Again, this doesn't really go anywhere while we disagree on the fundamental goodliness of queueing. This is just describing the implementation. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/