Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1762072AbZAWJzr (ORCPT ); Fri, 23 Jan 2009 04:55:47 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1756862AbZAWJzj (ORCPT ); Fri, 23 Jan 2009 04:55:39 -0500 Received: from one.firstfloor.org ([213.235.205.2]:59014 "EHLO one.firstfloor.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757964AbZAWJzh (ORCPT ); Fri, 23 Jan 2009 04:55:37 -0500 To: Nick Piggin Cc: Pekka Enberg , Linux Memory Management List , Linux Kernel Mailing List , Andrew Morton , Lin Ming , "Zhang, Yanmin" , Christoph Lameter Subject: Re: [patch] SLQB slab allocator From: Andi Kleen References: <20090121143008.GV24891@wotan.suse.de> Date: Fri, 23 Jan 2009 10:55:26 +0100 In-Reply-To: <20090121143008.GV24891@wotan.suse.de> (Nick Piggin's message of "Wed, 21 Jan 2009 15:30:08 +0100") Message-ID: <87hc3qcpo1.fsf@basil.nowhere.org> User-Agent: Gnus/5.1008 (Gnus v5.10.8) Emacs/21.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 12654 Lines: 415 Nick Piggin writes: Not a full review, just some things i noticed. The code is very readable thanks (that's imho the main reason slab.c should go btw, it's really messy and hard to get through) > Using lists rather than arrays can reduce the cacheline footprint. When moving > objects around, SLQB can move a list of objects from one CPU to another by > simply manipulating a head pointer, wheras SLAB needs to memcpy arrays. Some > SLAB per-CPU arrays can be up to 1K in size, which is a lot of cachelines that > can be touched during alloc/free. Newly freed objects tend to be cache hot, > and newly allocated ones tend to soon be touched anyway, so often there is > little cost to using metadata in the objects. You're probably aware of that, but the obvious counter argument is that for manipulating a single object a double linked list will always require touching three cache lines (prev, current, next), while an array access only a single one. A possible alternative would be a list of shorter arrays. > + int objsize; /* The size of an object without meta data */ > + int offset; /* Free pointer offset. */ > + int objects; /* Number of objects in slab */ > + > + int size; /* The size of an object including meta data */ > + int order; /* Allocation order */ > + gfp_t allocflags; /* gfp flags to use on allocation */ > + unsigned int colour_range; /* range of colour counter */ > + unsigned int colour_off; /* offset per colour */ > + void (*ctor)(void *); > + > + const char *name; /* Name (only for display!) */ > + struct list_head list; /* List of slab caches */ > + > + int align; /* Alignment */ > + int inuse; /* Offset to metadata */ I suspect some of these fields could be short or char (E.g. alignment), possibly lowering cache line impact. > + > +#ifdef CONFIG_SLQB_SYSFS > + struct kobject kobj; /* For sysfs */ > +#endif > +#ifdef CONFIG_NUMA > + struct kmem_cache_node *node[MAX_NUMNODES]; > +#endif > +#ifdef CONFIG_SMP > + struct kmem_cache_cpu *cpu_slab[NR_CPUS]; Those both really need to be dynamically allocated, otherwise it wastes a lot of memory in the common case (e.g. NR_CPUS==128 kernel on dual core system). And of course on the proposed NR_CPUS==4096 kernels it becomes prohibitive. You could use alloc_percpu? There's no alloc_pernode unfortunately, perhaps there should be one. > +#if L1_CACHE_BYTES < 64 > + if (size > 64 && size <= 96) > + return 1; > +#endif > +#if L1_CACHE_BYTES < 128 > + if (size > 128 && size <= 192) > + return 2; > +#endif > + if (size <= 8) return 3; > + if (size <= 16) return 4; > + if (size <= 32) return 5; > + if (size <= 64) return 6; > + if (size <= 128) return 7; > + if (size <= 256) return 8; > + if (size <= 512) return 9; > + if (size <= 1024) return 10; > + if (size <= 2 * 1024) return 11; > + if (size <= 4 * 1024) return 12; > + if (size <= 8 * 1024) return 13; > + if (size <= 16 * 1024) return 14; > + if (size <= 32 * 1024) return 15; > + if (size <= 64 * 1024) return 16; > + if (size <= 128 * 1024) return 17; > + if (size <= 256 * 1024) return 18; > + if (size <= 512 * 1024) return 19; > + if (size <= 1024 * 1024) return 20; > + if (size <= 2 * 1024 * 1024) return 21; Have you looked into other binsizes? iirc the original slab paper mentioned that power of two is usually not the best. > + return -1; > +} > + > +#ifdef CONFIG_ZONE_DMA > +#define SLQB_DMA __GFP_DMA > +#else > +/* Disable "DMA slabs" */ > +#define SLQB_DMA (__force gfp_t)0 > +#endif > + > +/* > + * Find the kmalloc slab cache for a given combination of allocation flags and > + * size. You should mention that this would be a very bad idea to call for !__builtin_constant_p(size) > + */ > +static __always_inline struct kmem_cache *kmalloc_slab(size_t size, gfp_t flags) > +{ > + int index = kmalloc_index(size); > + > + if (unlikely(index == 0)) > + return NULL; > + > + if (likely(!(flags & SLQB_DMA))) > + return &kmalloc_caches[index]; > + else > + return &kmalloc_caches_dma[index]; BTW i had an old patchkit to kill all GFP_DMA slab users. Perhaps should warm that up again. That would lower the inline footprint. > +#ifdef CONFIG_NUMA > +void *__kmalloc_node(size_t size, gfp_t flags, int node); > +void *kmem_cache_alloc_node(struct kmem_cache *, gfp_t flags, int node); > + > +static __always_inline void *kmalloc_node(size_t size, gfp_t flags, int node) kmalloc_node should be infrequent, i suspect it can be safely out of lined. > + * - investiage performance with memoryless nodes. Perhaps CPUs can be given > + * a default closest home node via which it can use fastpath functions. FWIW that is what x86-64 always did. Perhaps you can just fix ia64 to do that too and be happy. > + * Perhaps it is not a big problem. > + */ > + > +/* > + * slqb_page overloads struct page, and is used to manage some slob allocation > + * aspects, however to avoid the horrible mess in include/linux/mm_types.h, > + * we'll just define our own struct slqb_page type variant here. Hopefully this works for the crash dumpers. Do they have a way to distingush slub/slqb/slab kernels with different struct page usage? > +#define PG_SLQB_BIT (1 << PG_slab) > + > +static int kmem_size __read_mostly; > +#ifdef CONFIG_NUMA > +static int numa_platform __read_mostly; > +#else > +#define numa_platform 0 > +#endif It would be cheaper if you put that as a flag into the kmem_caches flags, this way you avoid an additional cache line touched. > +static inline int slqb_page_to_nid(struct slqb_page *page) > +{ > + return page_to_nid(&page->page); > +} etc. you got a lot of wrappers... > +static inline struct slqb_page *alloc_slqb_pages_node(int nid, gfp_t flags, > + unsigned int order) > +{ > + struct page *p; > + > + if (nid == -1) > + p = alloc_pages(flags, order); > + else > + p = alloc_pages_node(nid, flags, order); alloc_pages_nodes does that check anyways. > +/* Not all arches define cache_line_size */ > +#ifndef cache_line_size > +#define cache_line_size() L1_CACHE_BYTES > +#endif > + They should. better fix them? > + > + /* > + * Determine which debug features should be switched on > + */ It would be nicer if you could use long options. At least for me that would increase the probability that I could remember them without having to look them up. > +/* > + * Allocate a new slab, set up its object list. > + */ > +static struct slqb_page *new_slab_page(struct kmem_cache *s, gfp_t flags, int node, unsigned int colour) > +{ > + struct slqb_page *page; > + void *start; > + void *last; > + void *p; > + > + BUG_ON(flags & GFP_SLAB_BUG_MASK); > + > + page = allocate_slab(s, > + flags & (GFP_RECLAIM_MASK | GFP_CONSTRAINT_MASK), node); > + if (!page) > + goto out; > + > + page->flags |= PG_SLQB_BIT; > + > + start = page_address(&page->page); > + > + if (unlikely(slab_poison(s))) > + memset(start, POISON_INUSE, PAGE_SIZE << s->order); > + > + start += colour; One thing i was wondering. Did you try to disable the colouring and see if it makes much difference on modern systems? They tend to have either larger caches or higher associativity caches. Or perhaps it could be made optional based on CPU type? > +static noinline void *__slab_alloc_page(struct kmem_cache *s, gfp_t gfpflags, int node) > +{ > + struct slqb_page *page; > + struct kmem_cache_list *l; > + struct kmem_cache_cpu *c; > + unsigned int colour; > + void *object; > + > + c = get_cpu_slab(s, smp_processor_id()); > + colour = c->colour_next; > + c->colour_next += s->colour_off; > + if (c->colour_next >= s->colour_range) > + c->colour_next = 0; > + > + /* XXX: load any partial? */ > + > + /* Caller handles __GFP_ZERO */ > + gfpflags &= ~__GFP_ZERO; > + > + if (gfpflags & __GFP_WAIT) > + local_irq_enable(); At least on P4 you could get some win by avoiding the local_irq_save() up in the fast path when __GFP_WAIT is set (because storing the eflags is very expensive there) > + > +again: > + local_irq_save(flags); > + object = __slab_alloc(s, gfpflags, node); > + local_irq_restore(flags); > + > + if (unlikely(slab_debug(s)) && likely(object)) { AFAIK gcc cannot process multiple likelys in a single condition. > +/* Initial slabs */ > +#ifdef CONFIG_SMP > +static struct kmem_cache_cpu kmem_cache_cpus[NR_CPUS]; > +#endif > +#ifdef CONFIG_NUMA > +static struct kmem_cache_node kmem_cache_nodes[MAX_NUMNODES]; > +#endif > + > +#ifdef CONFIG_SMP > +static struct kmem_cache kmem_cpu_cache; > +static struct kmem_cache_cpu kmem_cpu_cpus[NR_CPUS]; > +#ifdef CONFIG_NUMA > +static struct kmem_cache_node kmem_cpu_nodes[MAX_NUMNODES]; > +#endif > +#endif > + > +#ifdef CONFIG_NUMA > +static struct kmem_cache kmem_node_cache; > +static struct kmem_cache_cpu kmem_node_cpus[NR_CPUS]; > +static struct kmem_cache_node kmem_node_nodes[MAX_NUMNODES]; > +#endif That all needs fixing too of course. > + > +#ifdef CONFIG_SMP > +static struct kmem_cache_cpu *alloc_kmem_cache_cpu(struct kmem_cache *s, int cpu) > +{ > + struct kmem_cache_cpu *c; > + > + c = kmem_cache_alloc_node(&kmem_cpu_cache, GFP_KERNEL, cpu_to_node(cpu)); > + if (!c) > + return NULL; > + > + init_kmem_cache_cpu(s, c); > + return c; > +} > + > +static void free_kmem_cache_cpus(struct kmem_cache *s) > +{ > + int cpu; > + > + for_each_online_cpu(cpu) { Is this protected against racing cpu hotplugs? Doesn't look like it. Multiple occurrences. > +static void cache_trim_worker(struct work_struct *w) > +{ > + struct delayed_work *work = > + container_of(w, struct delayed_work, work); > + struct kmem_cache *s; > + int node; > + > + if (!down_read_trylock(&slqb_lock)) > + goto out; No counter for this? > + > + /* > + * We are bringing a node online. No memory is availabe yet. We must > + * allocate a kmem_cache_node structure in order to bring the node > + * online. > + */ > + down_read(&slqb_lock); > + list_for_each_entry(s, &slab_caches, list) { > + /* > + * XXX: kmem_cache_alloc_node will fallback to other nodes > + * since memory is not yet available from the node that > + * is brought up. > + */ > + if (s->node[nid]) /* could be lefover from last online */ > + continue; > + n = kmem_cache_alloc(&kmem_node_cache, GFP_KERNEL); > + if (!n) { > + ret = -ENOMEM; Surely that should panic? I don't think a slab less node will be very useful later. > +#ifdef CONFIG_SLQB_SYSFS > +/* > + * sysfs API > + */ > +#define to_slab_attr(n) container_of(n, struct slab_attribute, attr) > +#define to_slab(n) container_of(n, struct kmem_cache, kobj); > + > +struct slab_attribute { > + struct attribute attr; > + ssize_t (*show)(struct kmem_cache *s, char *buf); > + ssize_t (*store)(struct kmem_cache *s, const char *x, size_t count); > +}; > + > +#define SLAB_ATTR_RO(_name) \ > + static struct slab_attribute _name##_attr = __ATTR_RO(_name) > + > +#define SLAB_ATTR(_name) \ > + static struct slab_attribute _name##_attr = \ > + __ATTR(_name, 0644, _name##_show, _name##_store) > + > +static ssize_t slab_size_show(struct kmem_cache *s, char *buf) > +{ > + return sprintf(buf, "%d\n", s->size); > +} > +SLAB_ATTR_RO(slab_size); > + > +static ssize_t align_show(struct kmem_cache *s, char *buf) > +{ > + return sprintf(buf, "%d\n", s->align); > +} > +SLAB_ATTR_RO(align); > + When you map back to the attribute you can use a index into a table for the field, saving that many functions? > +#define STAT_ATTR(si, text) \ > +static ssize_t text##_show(struct kmem_cache *s, char *buf) \ > +{ \ > + return show_stat(s, buf, si); \ > +} \ > +SLAB_ATTR_RO(text); \ > + > +STAT_ATTR(ALLOC, alloc); > +STAT_ATTR(ALLOC_SLAB_FILL, alloc_slab_fill); > +STAT_ATTR(ALLOC_SLAB_NEW, alloc_slab_new); > +STAT_ATTR(FREE, free); > +STAT_ATTR(FREE_REMOTE, free_remote); > +STAT_ATTR(FLUSH_FREE_LIST, flush_free_list); > +STAT_ATTR(FLUSH_FREE_LIST_OBJECTS, flush_free_list_objects); > +STAT_ATTR(FLUSH_FREE_LIST_REMOTE, flush_free_list_remote); > +STAT_ATTR(FLUSH_SLAB_PARTIAL, flush_slab_partial); > +STAT_ATTR(FLUSH_SLAB_FREE, flush_slab_free); > +STAT_ATTR(FLUSH_RFREE_LIST, flush_rfree_list); > +STAT_ATTR(FLUSH_RFREE_LIST_OBJECTS, flush_rfree_list_objects); > +STAT_ATTR(CLAIM_REMOTE_LIST, claim_remote_list); > +STAT_ATTR(CLAIM_REMOTE_LIST_OBJECTS, claim_remote_list_objects); This really should be table driven, shouldn't it? That would give much smaller code. -Andi -- ak@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/