Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S934032AbdGSTQo (ORCPT ); Wed, 19 Jul 2017 15:16:44 -0400 Received: from mail-yw0-f181.google.com ([209.85.161.181]:32790 "EHLO mail-yw0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755476AbdGSTQj (ORCPT ); Wed, 19 Jul 2017 15:16:39 -0400 Date: Wed, 19 Jul 2017 19:16:35 +0000 From: Josef Bacik To: Dennis Zhou Cc: Tejun Heo , Christoph Lameter , kernel-team@fb.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Dennis Zhou Subject: Re: [PATCH 09/10] percpu: replace area map allocator with bitmap allocator Message-ID: <20170719191635.GD23135@li70-116.members.linode.com> References: <20170716022315.19892-1-dennisz@fb.com> <20170716022315.19892-10-dennisz@fb.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20170716022315.19892-10-dennisz@fb.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 49570 Lines: 1381 On Sat, Jul 15, 2017 at 10:23:14PM -0400, Dennis Zhou wrote: > From: "Dennis Zhou (Facebook)" > > The percpu memory allocator is experiencing scalability issues when > allocating and freeing large numbers of counters as in BPF. > Additionally, there is a corner case where iteration is triggered over > all chunks if the contig_hint is the right size, but wrong alignment. > > Implementation: > This patch removes the area map allocator in favor of a bitmap allocator > backed by metadata blocks. The primary goal is to provide consistency > in performance and memory footprint with a focus on small allocations > (< 64 bytes). The bitmap removes the heavy memmove from the freeing > critical path and provides a consistent memory footprint. The metadata > blocks provide a bound on the amount of scanning required by maintaining > a set of hints. > > The chunks previously were managed by free_size, a value maintained in > bytes. Now, the chunks are managed in terms of bits, which is just a > scaled value of free_size down by PCPU_MIN_ALLOC_SIZE. > > There is one caveat with this implementation. In an effort to make > freeing fast, the only time metadata is updated on the free path is if a > whole block becomes free or the freed area spans across metadata blocks. > This causes the chunk’s contig_hint to be potentially smaller than what > it could allocate by up to a block. If the chunk’s contig_hint is > smaller than a block, a check occurs and the hint is kept accurate. > Metadata is always kept accurate on allocation and therefore the > situation where a chunk has a larger contig_hint than available will > never occur. > > Evaluation: > I have primarily done testing against a simple workload of allocation of > 1 million objects of varying size. Deallocation was done by in order, > alternating, and in reverse. These numbers were collected after rebasing > ontop of a80099a152. I present the worst-case numbers here: > > Area Map Allocator: > > Object Size | Alloc Time (ms) | Free Time (ms) > ---------------------------------------------- > 4B | 335 | 4960 > 16B | 485 | 1150 > 64B | 445 | 280 > 128B | 505 | 177 > 1024B | 3385 | 140 > > Bitmap Allocator: > > Object Size | Alloc Time (ms) | Free Time (ms) > ---------------------------------------------- > 4B | 725 | 70 > 16B | 760 | 70 > 64B | 855 | 80 > 128B | 910 | 90 > 1024B | 3770 | 260 > > This data demonstrates the inability for the area map allocator to > handle less than ideal situations. In the best case of reverse > deallocation, the area map allocator was able to perform within range > of the bitmap allocator. In the worst case situation, freeing took > nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator > dramatically improves the consistency of the free path. The small > allocations performed nearly identical regardless of the freeing > pattern. > > While it does add to the allocation latency, the allocation scenario > here is optimal for the area map allocator. The second problem of > additional scanning can result in the area map allocator completing in > 52 minutes. The same workload takes only 14 seconds to complete for the > bitmap allocator. This was produced under a more contrived scenario of > allocating 1 milion 4-byte objects with 8-byte alignment. > > Signed-off-by: Dennis Zhou > --- > include/linux/percpu.h | 10 +- > init/main.c | 1 - > mm/percpu-internal.h | 70 ++- > mm/percpu-stats.c | 99 ++-- > mm/percpu.c | 1280 ++++++++++++++++++++++++++++++------------------ > 5 files changed, 923 insertions(+), 537 deletions(-) > > diff --git a/include/linux/percpu.h b/include/linux/percpu.h > index a5cedcd..8f62b10 100644 > --- a/include/linux/percpu.h > +++ b/include/linux/percpu.h > @@ -26,6 +26,15 @@ > #define PCPU_MIN_ALLOC_SHIFT 2 > > /* > + * This determines the size of each metadata block. There are several subtle > + * constraints around this variable. The reserved_region and dynamic_region > + * of the first chunk must be multiples of PCPU_BITMAP_BLOCK_SIZE. This is > + * not a problem if the BLOCK_SIZE encompasses a page, but if exploring blocks > + * that are backing multiple pages, this needs to be accounted for. > + */ > +#define PCPU_BITMAP_BLOCK_SIZE (PAGE_SIZE >> PCPU_MIN_ALLOC_SHIFT) > + > +/* > * Percpu allocator can serve percpu allocations before slab is > * initialized which allows slab to depend on the percpu allocator. > * The following two parameters decide how much resource to > @@ -120,7 +129,6 @@ extern bool is_kernel_percpu_address(unsigned long addr); > #if !defined(CONFIG_SMP) || !defined(CONFIG_HAVE_SETUP_PER_CPU_AREA) > extern void __init setup_per_cpu_areas(void); > #endif > -extern void __init percpu_init_late(void); > > extern void __percpu *__alloc_percpu_gfp(size_t size, size_t align, gfp_t gfp); > extern void __percpu *__alloc_percpu(size_t size, size_t align); > diff --git a/init/main.c b/init/main.c > index 052481f..c9a9fff 100644 > --- a/init/main.c > +++ b/init/main.c > @@ -500,7 +500,6 @@ static void __init mm_init(void) > page_ext_init_flatmem(); > mem_init(); > kmem_cache_init(); > - percpu_init_late(); > pgtable_init(); > vmalloc_init(); > ioremap_huge_init(); > diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h > index f0776f6..2dac644 100644 > --- a/mm/percpu-internal.h > +++ b/mm/percpu-internal.h > @@ -4,6 +4,21 @@ > #include > #include > > +/* > + * pcpu_bitmap_md is the metadata block struct. > + * All units are in terms of bits. > + */ > +struct pcpu_bitmap_md { > + int contig_hint; /* contig hint for block */ > + int contig_hint_start; /* block relative starting > + position of the contig hint */ > + int left_free; /* size of free space along > + the left side of the block */ > + int right_free; /* size of free space along > + the right side of the block */ > + int first_free; /* block position of first free */ > +}; > + > struct pcpu_chunk { > #ifdef CONFIG_PERCPU_STATS > int nr_alloc; /* # of allocations */ > @@ -11,17 +26,20 @@ struct pcpu_chunk { > #endif > > struct list_head list; /* linked to pcpu_slot lists */ > - int free_size; /* free bytes in the chunk */ > - int contig_hint; /* max contiguous size hint */ > + int free_bits; /* free bits in the chunk */ > + int contig_hint; /* max contiguous size hint > + in bits */ > + int contig_hint_start; /* contig_hint starting > + bit offset */ > void *base_addr; /* base address of this chunk */ > > - int map_used; /* # of map entries used before the sentry */ > - int map_alloc; /* # of map entries allocated */ > - int *map; /* allocation map */ > - struct list_head map_extend_list;/* on pcpu_map_extend_chunks */ > + unsigned long *alloc_map; /* allocation map */ > + unsigned long *bound_map; /* boundary map */ > + struct pcpu_bitmap_md *md_blocks; /* metadata blocks */ > > void *data; /* chunk data */ > - int first_free; /* no free below this */ > + int first_free_block; /* block that contains the first > + free bit */ > bool immutable; /* no [de]population allowed */ > bool has_reserved; /* indicates if the region this chunk > is responsible for overlaps with > @@ -44,6 +62,44 @@ extern struct pcpu_chunk *pcpu_first_chunk; > extern struct pcpu_chunk *pcpu_reserved_chunk; > extern unsigned long pcpu_reserved_offset; > > +/* > + * pcpu_nr_pages_to_blocks - converts nr_pages to # of md_blocks > + * @chunk: chunk of interest > + * > + * This conversion is from the number of physical pages that the chunk > + * serves to the number of bitmap blocks required. It converts to bytes > + * served to bits required and then blocks used. > + */ > +static inline int pcpu_nr_pages_to_blocks(struct pcpu_chunk *chunk) > +{ > + return chunk->nr_pages * PAGE_SIZE / PCPU_MIN_ALLOC_SIZE / > + PCPU_BITMAP_BLOCK_SIZE; > +} > + > +/* > + * pcpu_pages_to_bits - converts the pages to size of bitmap > + * @pages: number of physical pages > + * > + * This conversion is from physical pages to the number of bits > + * required in the bitmap. > + */ > +static inline int pcpu_pages_to_bits(int pages) > +{ > + return pages * PAGE_SIZE / PCPU_MIN_ALLOC_SIZE; > +} > + > +/* > + * pcpu_nr_pages_to_bits - helper to convert nr_pages to size of bitmap > + * @chunk: chunk of interest > + * > + * This conversion is from the number of physical pages that the chunk > + * serves to the number of bits in the bitmap. > + */ > +static inline int pcpu_nr_pages_to_bits(struct pcpu_chunk *chunk) > +{ > + return pcpu_pages_to_bits(chunk->nr_pages); > +} > + > #ifdef CONFIG_PERCPU_STATS > > #include > diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c > index 6fc04b1..8dbef0c 100644 > --- a/mm/percpu-stats.c > +++ b/mm/percpu-stats.c > @@ -29,64 +29,85 @@ static int cmpint(const void *a, const void *b) > } > > /* > - * Iterates over all chunks to find the max # of map entries used. > + * Iterates over all chunks to find the max nr_alloc entries. > */ > -static int find_max_map_used(void) > +static int find_max_nr_alloc(void) > { > struct pcpu_chunk *chunk; > - int slot, max_map_used; > + int slot, max_nr_alloc; > > - max_map_used = 0; > + max_nr_alloc = 0; > for (slot = 0; slot < pcpu_nr_slots; slot++) > list_for_each_entry(chunk, &pcpu_slot[slot], list) > - max_map_used = max(max_map_used, chunk->map_used); > + max_nr_alloc = max(max_nr_alloc, chunk->nr_alloc); > > - return max_map_used; > + return max_nr_alloc; > } > > /* > * Prints out chunk state. Fragmentation is considered between > * the beginning of the chunk to the last allocation. > + * > + * All statistics are in bytes unless stated otherwise. > */ > static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk, > int *buffer) > { > - int i, s_index, last_alloc, alloc_sign, as_len; > + int i, last_alloc, as_len, start, end; > int *alloc_sizes, *p; > /* statistics */ > int sum_frag = 0, max_frag = 0; > int cur_min_alloc = 0, cur_med_alloc = 0, cur_max_alloc = 0; > > alloc_sizes = buffer; > - s_index = chunk->has_reserved ? 1 : 0; > - > - /* find last allocation */ > - last_alloc = -1; > - for (i = chunk->map_used - 1; i >= s_index; i--) { > - if (chunk->map[i] & 1) { > - last_alloc = i; > - break; > - } > - } > > - /* if the chunk is not empty - ignoring reserve */ > - if (last_alloc >= s_index) { > - as_len = last_alloc + 1 - s_index; > - > - /* > - * Iterate through chunk map computing size info. > - * The first bit is overloaded to be a used flag. > - * negative = free space, positive = allocated > - */ > - for (i = 0, p = chunk->map + s_index; i < as_len; i++, p++) { > - alloc_sign = (*p & 1) ? 1 : -1; > - alloc_sizes[i] = alloc_sign * > - ((p[1] & ~1) - (p[0] & ~1)); > + /* > + * find_last_bit returns the start value if nothing found. > + * Therefore, we must determine if it is a failure of find_last_bit > + * and set the appropriate value. > + */ > + last_alloc = find_last_bit(chunk->alloc_map, > + pcpu_nr_pages_to_bits(chunk) - 1); > + last_alloc = test_bit(last_alloc, chunk->alloc_map) ? > + last_alloc + 1 : 0; > + > + start = as_len = 0; > + if (chunk->has_reserved) > + start = pcpu_reserved_offset; > + > + /* > + * If a bit is set in the allocation map, the bound_map identifies > + * where the allocation ends. If the allocation is not set, the > + * bound_map does not identify free areas as it is only kept accurate > + * on allocation, not free. > + * > + * Positive values are allocations and negative values are free > + * fragments. > + */ > + while (start < last_alloc) { > + if (test_bit(start, chunk->alloc_map)) { > + end = find_next_bit(chunk->bound_map, last_alloc, > + start + 1); > + alloc_sizes[as_len] = 1; > + } else { > + end = find_next_bit(chunk->alloc_map, last_alloc, > + start + 1); > + alloc_sizes[as_len] = -1; > } > > - sort(alloc_sizes, as_len, sizeof(chunk->map[0]), cmpint, NULL); > + alloc_sizes[as_len++] *= (end - start) * PCPU_MIN_ALLOC_SIZE; > + > + start = end; > + } > + > + /* > + * The negative values are free fragments and thus sorting gives the > + * free fragments at the beginning in largest first order. > + */ > + if (as_len > 0) { > + sort(alloc_sizes, as_len, sizeof(int), cmpint, NULL); > > - /* Iterate through the unallocated fragements. */ > + /* iterate through the unallocated fragments */ > for (i = 0, p = alloc_sizes; *p < 0 && i < as_len; i++, p++) { > sum_frag -= *p; > max_frag = max(max_frag, -1 * (*p)); > @@ -100,8 +121,9 @@ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk, > P("nr_alloc", chunk->nr_alloc); > P("max_alloc_size", chunk->max_alloc_size); > P("empty_pop_pages", chunk->nr_empty_pop_pages); > - P("free_size", chunk->free_size); > - P("contig_hint", chunk->contig_hint); > + P("first_free_block", chunk->first_free_block); > + P("free_size", chunk->free_bits * PCPU_MIN_ALLOC_SIZE); > + P("contig_hint", chunk->contig_hint * PCPU_MIN_ALLOC_SIZE); > P("sum_frag", sum_frag); > P("max_frag", max_frag); > P("cur_min_alloc", cur_min_alloc); > @@ -113,22 +135,23 @@ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk, > static int percpu_stats_show(struct seq_file *m, void *v) > { > struct pcpu_chunk *chunk; > - int slot, max_map_used; > + int slot, max_nr_alloc; > int *buffer; > > alloc_buffer: > spin_lock_irq(&pcpu_lock); > - max_map_used = find_max_map_used(); > + max_nr_alloc = find_max_nr_alloc(); > spin_unlock_irq(&pcpu_lock); > > - buffer = vmalloc(max_map_used * sizeof(pcpu_first_chunk->map[0])); > + /* there can be at most this many free and allocated fragments */ > + buffer = vmalloc((2 * max_nr_alloc + 1) * sizeof(int)); > if (!buffer) > return -ENOMEM; > > spin_lock_irq(&pcpu_lock); > > /* if the buffer allocated earlier is too small */ > - if (max_map_used < find_max_map_used()) { > + if (max_nr_alloc < find_max_nr_alloc()) { > spin_unlock_irq(&pcpu_lock); > vfree(buffer); > goto alloc_buffer; > diff --git a/mm/percpu.c b/mm/percpu.c > index fb01841..569df63 100644 > --- a/mm/percpu.c > +++ b/mm/percpu.c > @@ -4,6 +4,9 @@ > * Copyright (C) 2009 SUSE Linux Products GmbH > * Copyright (C) 2009 Tejun Heo > * > + * Copyright (C) 2017 Facebook Inc. > + * Copyright (C) 2017 Dennis Zhou > + * > * This file is released under the GPLv2 license. > * > * The percpu allocator handles both static and dynamic areas. Percpu > @@ -34,19 +37,20 @@ > * percpu variables from kernel modules. Finally, the dynamic section > * takes care of normal allocations. > * > - * Allocation state in each chunk is kept using an array of integers > - * on chunk->map. A positive value in the map represents a free > - * region and negative allocated. Allocation inside a chunk is done > - * by scanning this map sequentially and serving the first matching > - * entry. This is mostly copied from the percpu_modalloc() allocator. > - * Chunks can be determined from the address using the index field > - * in the page struct. The index field contains a pointer to the chunk. > - * > - * These chunks are organized into lists according to free_size and > - * tries to allocate from the fullest chunk first. Each chunk maintains > - * a maximum contiguous area size hint which is guaranteed to be equal > - * to or larger than the maximum contiguous area in the chunk. This > - * helps prevent the allocator from iterating over chunks unnecessarily. > + * The allocator organizes chunks into lists according to free size and > + * tries to allocate from the fullest chunk first. Each chunk is managed > + * by a bitmap with metadata blocks. The allocation map is updated on > + * every allocation to reflect the current state while the boundary map > + * is only updated on allocation. Each metadata block contains > + * information to help mitigate the need to iterate over large portions > + * of the bitmap. The reverse mapping from page to chunk is stored in > + * the page's index. Lastly, units are lazily backed and grow in unison. > + * > + * There is a unique conversion that goes on here between bytes and bits. > + * The chunk tracks the number of pages it is responsible for in nr_pages. > + * From there, helper functions are used to convert from physical pages > + * to bitmap bits and bitmap blocks. All hints are managed in bits > + * unless explicitly stated. > * > * To use this allocator, arch code should do the following: > * > @@ -86,10 +90,13 @@ > > #include "percpu-internal.h" > > -#define PCPU_SLOT_BASE_SHIFT 5 /* 1-31 shares the same slot */ > -#define PCPU_DFL_MAP_ALLOC 16 /* start a map with 16 ents */ > -#define PCPU_ATOMIC_MAP_MARGIN_LOW 32 > -#define PCPU_ATOMIC_MAP_MARGIN_HIGH 64 > +/* > + * The metadata is managed in terms of bits with each bit mapping to > + * a fragment of size PCPU_MIN_ALLOC_SIZE. Thus, the slots are calculated > + * with respect to the number of bits available. > + */ > +#define PCPU_SLOT_BASE_SHIFT 3 > + > #define PCPU_EMPTY_POP_PAGES_LOW 2 > #define PCPU_EMPTY_POP_PAGES_HIGH 4 > > @@ -156,10 +163,11 @@ unsigned long pcpu_reserved_offset __ro_after_init; > DEFINE_SPINLOCK(pcpu_lock); /* all internal data structures */ > static DEFINE_MUTEX(pcpu_alloc_mutex); /* chunk create/destroy, [de]pop, map ext */ > > -struct list_head *pcpu_slot __ro_after_init; /* chunk list slots */ > - > -/* chunks which need their map areas extended, protected by pcpu_lock */ > -static LIST_HEAD(pcpu_map_extend_chunks); > +/* > + * Chunk list slots. These slots order the chunks by the number of > + * free bits available in the bitmap. > + */ > +struct list_head *pcpu_slot __ro_after_init; > > /* > * The number of empty populated pages, protected by pcpu_lock. The > @@ -212,25 +220,25 @@ static bool pcpu_addr_in_reserved_chunk(void *addr) > pcpu_reserved_chunk->nr_pages * PAGE_SIZE; > } > > -static int __pcpu_size_to_slot(int size) > +static int __pcpu_size_to_slot(int bit_size) > { > - int highbit = fls(size); /* size is in bytes */ > + int highbit = fls(bit_size); /* size is in bits */ > return max(highbit - PCPU_SLOT_BASE_SHIFT + 2, 1); > } > > -static int pcpu_size_to_slot(int size) > +static int pcpu_size_to_slot(int bit_size) > { > - if (size == pcpu_unit_size) > + if (bit_size == pcpu_pages_to_bits(pcpu_unit_pages)) > return pcpu_nr_slots - 1; > - return __pcpu_size_to_slot(size); > + return __pcpu_size_to_slot(bit_size); > } > > static int pcpu_chunk_slot(const struct pcpu_chunk *chunk) > { > - if (chunk->free_size < sizeof(int) || chunk->contig_hint < sizeof(int)) > + if (chunk->free_bits == 0 || chunk->contig_hint == 0) > return 0; > > - return pcpu_size_to_slot(chunk->free_size); > + return pcpu_size_to_slot(chunk->free_bits); > } > > /* set the pointer to a chunk in a page struct */ > @@ -277,6 +285,37 @@ static void __maybe_unused pcpu_next_pop(struct pcpu_chunk *chunk, > } > > /* > + * The following are helper functions to help access bitmaps and convert > + * between bitmap offsets to actual address offsets. > + */ > +static unsigned long *pcpu_index_alloc_map(struct pcpu_chunk *chunk, int index) > +{ > + return chunk->alloc_map + > + (index * PCPU_BITMAP_BLOCK_SIZE / BITS_PER_LONG); > +} > + > +static unsigned long pcpu_off_to_block_index(int off) > +{ > + return off / PCPU_BITMAP_BLOCK_SIZE; > +} > + > +static unsigned long pcpu_off_to_block_off(int off) > +{ > + return off & (PCPU_BITMAP_BLOCK_SIZE - 1); > +} > + > +static unsigned long pcpu_block_off_to_off(int index, int off) > +{ > + return index * PCPU_BITMAP_BLOCK_SIZE + off; > +} > + > +static unsigned long pcpu_block_get_first_page(int index) > +{ > + return PFN_DOWN(index * PCPU_BITMAP_BLOCK_SIZE * > + PCPU_MIN_ALLOC_SIZE); > +} > + > +/* > * (Un)populated page region iterators. Iterate over (un)populated > * page regions between @start and @end in @chunk. @rs and @re should > * be integer variables and will be set to start and end page index of > @@ -329,38 +368,6 @@ static void pcpu_mem_free(void *ptr) > } > > /** > - * pcpu_count_occupied_pages - count the number of pages an area occupies > - * @chunk: chunk of interest > - * @i: index of the area in question > - * > - * Count the number of pages chunk's @i'th area occupies. When the area's > - * start and/or end address isn't aligned to page boundary, the straddled > - * page is included in the count iff the rest of the page is free. > - */ > -static int pcpu_count_occupied_pages(struct pcpu_chunk *chunk, int i) > -{ > - int off = chunk->map[i] & ~1; > - int end = chunk->map[i + 1] & ~1; > - > - if (!PAGE_ALIGNED(off) && i > 0) { > - int prev = chunk->map[i - 1]; > - > - if (!(prev & 1) && prev <= round_down(off, PAGE_SIZE)) > - off = round_down(off, PAGE_SIZE); > - } > - > - if (!PAGE_ALIGNED(end) && i + 1 < chunk->map_used) { > - int next = chunk->map[i + 1]; > - int nend = chunk->map[i + 2] & ~1; > - > - if (!(next & 1) && nend >= round_up(end, PAGE_SIZE)) > - end = round_up(end, PAGE_SIZE); > - } > - > - return max_t(int, PFN_DOWN(end) - PFN_UP(off), 0); > -} > - > -/** > * pcpu_chunk_relocate - put chunk in the appropriate chunk slot > * @chunk: chunk of interest > * @oslot: the previous slot it was on > @@ -386,385 +393,770 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot) > } > > /** > - * pcpu_need_to_extend - determine whether chunk area map needs to be extended > + * pcpu_cnt_pop_pages- counts populated backing pages in range > * @chunk: chunk of interest > - * @is_atomic: the allocation context > + * @start: start index > + * @end: end index > * > - * Determine whether area map of @chunk needs to be extended. If > - * @is_atomic, only the amount necessary for a new allocation is > - * considered; however, async extension is scheduled if the left amount is > - * low. If !@is_atomic, it aims for more empty space. Combined, this > - * ensures that the map is likely to have enough available space to > - * accomodate atomic allocations which can't extend maps directly. > - * > - * CONTEXT: > - * pcpu_lock. > + * Calculates the number of populated pages in the region [start, end). > + * This lets us keep track of how many empty populated pages are available > + * and decide if we should schedule async work. > * > * RETURNS: > - * New target map allocation length if extension is necessary, 0 > - * otherwise. > + * The nr of populated pages. > */ > -static int pcpu_need_to_extend(struct pcpu_chunk *chunk, bool is_atomic) > +static inline int pcpu_cnt_pop_pages(struct pcpu_chunk *chunk, > + int start, int end) > { > - int margin, new_alloc; > - > - lockdep_assert_held(&pcpu_lock); > + return bitmap_weight(chunk->populated, end) - > + bitmap_weight(chunk->populated, start); > +} > > - if (is_atomic) { > - margin = 3; > +/** > + * pcpu_chunk_update_hint - updates metadata about a chunk > + * @chunk: chunk of interest > + * > + * Responsible for iterating over metadata blocks to aggregate the > + * overall statistics of the chunk. > + * > + * Updates: > + * chunk->contig_hint > + * chunk->contig_hint_start > + * nr_empty_pop_pages > + */ > +static void pcpu_chunk_update_hint(struct pcpu_chunk *chunk) > +{ > + bool is_page_empty = true; > + int i, off, cur_contig, nr_empty_pop_pages, l_pop_off; > + struct pcpu_bitmap_md *block; > + > + chunk->contig_hint = cur_contig = 0; > + off = nr_empty_pop_pages = 0; > + l_pop_off = pcpu_block_get_first_page(chunk->first_free_block); > + > + for (i = chunk->first_free_block, block = chunk->md_blocks + i; > + i < pcpu_nr_pages_to_blocks(chunk); i++, block++) { > + /* Manage nr_empty_pop_pages. > + * > + * This is tricky. So the the background work function is > + * triggered when there are not enough free populated pages. > + * This is necessary to make sure atomic allocations can > + * succeed. > + * > + * The first page of each block is kept track of here allowing > + * this to scale in both situations where there are > 1 page > + * per block and where a block may be a portion of a page. > + */ > + int pop_off = pcpu_block_get_first_page(i); > + > + if (pop_off > l_pop_off) { > + if (is_page_empty) > + nr_empty_pop_pages += > + pcpu_cnt_pop_pages(chunk, l_pop_off, > + pop_off); > + l_pop_off = pop_off; > + is_page_empty = true; > + } > + if (block->contig_hint != PCPU_BITMAP_BLOCK_SIZE) > + is_page_empty = false; > > - if (chunk->map_alloc < > - chunk->map_used + PCPU_ATOMIC_MAP_MARGIN_LOW) { > - if (list_empty(&chunk->map_extend_list)) { > - list_add_tail(&chunk->map_extend_list, > - &pcpu_map_extend_chunks); > - pcpu_schedule_balance_work(); > + /* continue from prev block adding to the cur_contig hint */ > + if (cur_contig) { > + cur_contig += block->left_free; > + if (block->left_free == PCPU_BITMAP_BLOCK_SIZE) { > + continue; > + } else if (cur_contig > chunk->contig_hint) { > + chunk->contig_hint = cur_contig; > + chunk->contig_hint_start = off; > } > + cur_contig = 0; > } > - } else { > - margin = PCPU_ATOMIC_MAP_MARGIN_HIGH; > + /* check if the block->contig_hint is larger */ > + if (block->contig_hint > chunk->contig_hint) { > + chunk->contig_hint = block->contig_hint; > + chunk->contig_hint_start = > + pcpu_block_off_to_off(i, > + block->contig_hint_start); > + } > + /* let the next iteration catch the right_free */ > + cur_contig = block->right_free; > + off = (i + 1) * PCPU_BITMAP_BLOCK_SIZE - block->right_free; > } > > - if (chunk->map_alloc >= chunk->map_used + margin) > - return 0; > - > - new_alloc = PCPU_DFL_MAP_ALLOC; > - while (new_alloc < chunk->map_used + margin) > - new_alloc *= 2; > + /* catch last iteration if the last block ends with free space */ > + if (cur_contig > chunk->contig_hint) { > + chunk->contig_hint = cur_contig; > + chunk->contig_hint_start = off; > + } > > - return new_alloc; > + /* > + * Keep track of nr_empty_pop_pages. > + * > + * The chunk is maintains the previous number of free pages it held, > + * so the delta is used to update the global counter. The reserved > + * chunk is not part of the free page count as they are populated > + * at init and are special to serving reserved allocations. > + */ > + if (is_page_empty) { > + nr_empty_pop_pages += pcpu_cnt_pop_pages(chunk, l_pop_off, > + chunk->nr_pages); > + } > + if (chunk != pcpu_reserved_chunk) > + pcpu_nr_empty_pop_pages += > + (nr_empty_pop_pages - chunk->nr_empty_pop_pages); > + chunk->nr_empty_pop_pages = nr_empty_pop_pages; > } > > /** > - * pcpu_extend_area_map - extend area map of a chunk > + * pcpu_block_update_hint > * @chunk: chunk of interest > - * @new_alloc: new target allocation length of the area map > + * @index: block index of the metadata block > * > - * Extend area map of @chunk to have @new_alloc entries. > + * Full scan over the entire block to recalculate block-level metadata. > + */ > +static void pcpu_block_refresh_hint(struct pcpu_chunk *chunk, int index) > +{ > + unsigned long *alloc_map = pcpu_index_alloc_map(chunk, index); > + struct pcpu_bitmap_md *block = chunk->md_blocks + index; > + bool is_left_free = false, is_right_free = false; > + int contig; > + unsigned long start, end; > + > + block->contig_hint = 0; > + start = end = block->first_free; > + while (start < PCPU_BITMAP_BLOCK_SIZE) { > + /* > + * Scans the allocation map corresponding to this block > + * to find free fragments and update metadata accordingly. > + */ > + start = find_next_zero_bit(alloc_map, PCPU_BITMAP_BLOCK_SIZE, > + start); > + if (start >= PCPU_BITMAP_BLOCK_SIZE) > + break; > + /* returns PCPU_BITMAP_BLOCK_SIZE if no next bit is found */ > + end = find_next_bit(alloc_map, PCPU_BITMAP_BLOCK_SIZE, start); > + /* update left_free */ > + contig = end - start; > + if (start == 0) { > + block->left_free = contig; > + is_left_free = true; > + } > + /* update right_free */ > + if (end == PCPU_BITMAP_BLOCK_SIZE) { > + block->right_free = contig; > + is_right_free = true; > + } > + /* update block contig_hints */ > + if (block->contig_hint < contig) { > + block->contig_hint = contig; > + block->contig_hint_start = start; > + } > + start = end; > + } > + > + /* clear left/right free hints */ > + if (!is_left_free) > + block->left_free = 0; > + if (!is_right_free) > + block->right_free = 0; > +} > + > +/** > + * pcpu_block_update_hint_alloc - update hint on allocation path > + * @chunk: chunk of interest > + * @bit_off: bitmap offset > + * @bit_size: size of request in allocation units > * > - * CONTEXT: > - * Does GFP_KERNEL allocation. Grabs and releases pcpu_lock. > + * Updates metadata for the allocation path. The metadata only has to be > + * refreshed by a full scan iff we break the largest contig region. > * > * RETURNS: > - * 0 on success, -errno on failure. > + * Bool if we need to update the chunk's metadata. This occurs only if we > + * break the chunk's contig hint. > */ > -static int pcpu_extend_area_map(struct pcpu_chunk *chunk, int new_alloc) > +static bool pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off, > + int bit_size) > { > - int *old = NULL, *new = NULL; > - size_t old_size = 0, new_size = new_alloc * sizeof(new[0]); > - unsigned long flags; > - > - lockdep_assert_held(&pcpu_alloc_mutex); > + bool update_chunk = false; > + int i; > + int s_index, e_index, s_off, e_off; > + struct pcpu_bitmap_md *s_block, *e_block, *block; > > - new = pcpu_mem_zalloc(new_size); > - if (!new) > - return -ENOMEM; > + /* calculate per block offsets */ > + s_index = pcpu_off_to_block_index(bit_off); > + e_index = pcpu_off_to_block_index(bit_off + bit_size); > + s_off = pcpu_off_to_block_off(bit_off); > + e_off = pcpu_off_to_block_off(bit_off + bit_size); > > - /* acquire pcpu_lock and switch to new area map */ > - spin_lock_irqsave(&pcpu_lock, flags); > + /* > + * If the offset is the beginning of the next block, set it to the > + * end of the previous block as the last bit is the exclusive. > + */ > + if (e_off == 0) { > + e_off = PCPU_BITMAP_BLOCK_SIZE; > + e_index--; > + } > > - if (new_alloc <= chunk->map_alloc) > - goto out_unlock; > + s_block = chunk->md_blocks + s_index; > + e_block = chunk->md_blocks + e_index; > > - old_size = chunk->map_alloc * sizeof(chunk->map[0]); > - old = chunk->map; > + /* > + * Update s_block. > + * > + * block->first_free must be updated if the allocation takes its place. > + * If the allocation breaks the contig_hint, a scan is required to > + * restore this hint. > + */ > + if (s_off == s_block->first_free) > + s_block->first_free = find_next_zero_bit( > + pcpu_index_alloc_map(chunk, s_index), > + PCPU_BITMAP_BLOCK_SIZE, > + s_off + bit_size); > + > + if (s_off >= s_block->contig_hint_start && > + s_off < s_block->contig_hint_start + s_block->contig_hint) { > + pcpu_block_refresh_hint(chunk, s_index); > + } else { > + /* update left and right contig manually */ > + s_block->left_free = min(s_block->left_free, s_off); > + if (s_index == e_index) > + s_block->right_free = min_t(int, s_block->right_free, > + PCPU_BITMAP_BLOCK_SIZE - e_off); > + else > + s_block->right_free = 0; > + } > > - memcpy(new, old, old_size); > + /* > + * Update e_block. > + * If they are different, then e_block's first_free is guaranteed to > + * be the extend of e_off. first_free must be updated and a scan > + * over e_block is issued. > + */ > + if (s_index != e_index) { > + e_block->first_free = find_next_zero_bit( > + pcpu_index_alloc_map(chunk, e_index), > + PCPU_BITMAP_BLOCK_SIZE, e_off); > > - chunk->map_alloc = new_alloc; > - chunk->map = new; > - new = NULL; > + pcpu_block_refresh_hint(chunk, e_index); > + } > > -out_unlock: > - spin_unlock_irqrestore(&pcpu_lock, flags); > + /* update in-between md_blocks */ > + for (i = s_index + 1, block = chunk->md_blocks + i; i < e_index; > + i++, block++) { > + block->contig_hint = 0; > + block->left_free = 0; > + block->right_free = 0; > + } > > /* > - * pcpu_mem_free() might end up calling vfree() which uses > - * IRQ-unsafe lock and thus can't be called under pcpu_lock. > + * The only time a full chunk scan is required is if the global > + * contig_hint is broken. Otherwise, it means a smaller space > + * was used and therefore the global contig_hint is still correct. > */ > - pcpu_mem_free(old); > - pcpu_mem_free(new); > + if (bit_off >= chunk->contig_hint_start && > + bit_off < chunk->contig_hint_start + chunk->contig_hint) > + update_chunk = true; > > - return 0; > + return update_chunk; > } > > /** > - * pcpu_fit_in_area - try to fit the requested allocation in a candidate area > - * @chunk: chunk the candidate area belongs to > - * @off: the offset to the start of the candidate area > - * @this_size: the size of the candidate area > - * @size: the size of the target allocation > - * @align: the alignment of the target allocation > - * @pop_only: only allocate from already populated region > - * > - * We're trying to allocate @size bytes aligned at @align. @chunk's area > - * at @off sized @this_size is a candidate. This function determines > - * whether the target allocation fits in the candidate area and returns the > - * number of bytes to pad after @off. If the target area doesn't fit, -1 > - * is returned. > - * > - * If @pop_only is %true, this function only considers the already > - * populated part of the candidate area. > + * pcpu_block_update_hint_free - updates the block hints on the free path > + * @chunk: chunk of interest > + * @bit_off: bitmap offset > + * @bit_size: size of request in allocation units > + * > + * Updates the hint along the free path by taking advantage of current metadata > + * to minimize scanning of the bitmap. Triggers a global update if an entire > + * block becomes free or the free spans across blocks. This tradeoff is to > + * minimize global scanning to update the chunk->contig_hint. The > + * chunk->contig_hint may be off by up to a block, but a chunk->contig_hint > + * will never be more than the available space. If the chunk->contig_hint is > + * in this block, it will be accurate. > + * > + * RETURNS: > + * Bool if we need to update the chunk's metadata. This occurs if a larger > + * contig region is created along the edges or we free across blocks. > */ > -static int pcpu_fit_in_area(struct pcpu_chunk *chunk, int off, int this_size, > - int size, int align, bool pop_only) > +static bool pcpu_block_update_hint_free(struct pcpu_chunk *chunk, int bit_off, > + int bit_size) > { > - int cand_off = off; > + bool update_chunk = false; > + int i; > + int s_index, e_index, s_off, e_off; > + int start, end, contig; > + struct pcpu_bitmap_md *s_block, *e_block, *block; > > - while (true) { > - int head = ALIGN(cand_off, align) - off; > - int page_start, page_end, rs, re; > + /* calculate per block offsets */ > + s_index = pcpu_off_to_block_index(bit_off); > + e_index = pcpu_off_to_block_index(bit_off + bit_size); > + s_off = pcpu_off_to_block_off(bit_off); > + e_off = pcpu_off_to_block_off(bit_off + bit_size); > + > + /* > + * If the offset is the beginning of the next block, set it to the > + * end of the previous block as the last bit is the exclusive. > + */ > + if (e_off == 0) { > + e_off = PCPU_BITMAP_BLOCK_SIZE; > + e_index--; > + } > + > + s_block = chunk->md_blocks + s_index; > + e_block = chunk->md_blocks + e_index; > + > + /* > + * Check if the freed area aligns with the block->contig_hint. > + * If it does, then the scan to find the beginning/end of the > + * larger free area can be avoided. > + * > + * start and end refer to beginning and end of the free region > + * within each their respective blocks. This is not necessarily > + * the entire free region as it may span blocks past the beginning > + * or end of the block. > + */ > + start = s_off; > + if (s_off == s_block->contig_hint + s_block->contig_hint_start) { > + start = s_block->contig_hint_start; > + } else { > + int l_bit = find_last_bit(pcpu_index_alloc_map(chunk, s_index), > + start); > + start = (start == l_bit) ? 0 : l_bit + 1; > + } > + > + end = e_off; > + if (e_off == e_block->contig_hint_start) > + end = e_block->contig_hint_start + e_block->contig_hint; > + else > + end = find_next_bit(pcpu_index_alloc_map(chunk, e_index), > + PCPU_BITMAP_BLOCK_SIZE, end); > > - if (this_size < head + size) > - return -1; > + /* freeing in the same block */ > + if (s_index == e_index) { > + contig = end - start; > > - if (!pop_only) > - return head; > + if (start == 0) > + s_block->left_free = contig; > > + if (end == PCPU_BITMAP_BLOCK_SIZE) > + s_block->right_free = contig; > + > + s_block->first_free = min(s_block->first_free, start); > + if (contig > s_block->contig_hint) { > + s_block->contig_hint = contig; > + s_block->contig_hint_start = start; > + } > + > + } else { > /* > - * If the first unpopulated page is beyond the end of the > - * allocation, the whole allocation is populated; > - * otherwise, retry from the end of the unpopulated area. > + * Freeing across md_blocks. > + * > + * If the start is at the beginning of the block, just > + * reset the block instead. > */ > - page_start = PFN_DOWN(head + off); > - page_end = PFN_UP(head + off + size); > - > - rs = page_start; > - pcpu_next_unpop(chunk, &rs, &re, PFN_UP(off + this_size)); > - if (rs >= page_end) > - return head; > - cand_off = re * PAGE_SIZE; > + if (start == 0) { > + s_index--; > + } else { > + /* > + * Knowing that the free is across blocks, this means > + * the hint can be updated on the right side and the > + * left side does not need to be touched. > + */ > + s_block->first_free = min(s_block->first_free, start); > + contig = PCPU_BITMAP_BLOCK_SIZE - start; > + s_block->right_free = contig; > + if (contig > s_block->contig_hint) { > + s_block->contig_hint = contig; > + s_block->contig_hint_start = start; > + } > + } > + /* > + * If end is the entire e_block, just reset the block > + * as well. > + */ > + if (end == PCPU_BITMAP_BLOCK_SIZE) { > + e_index++; > + } else { > + /* > + * The hint must only be on the left side, so > + * update accordingly. > + */ > + e_block->first_free = 0; > + e_block->left_free = end; > + if (end > e_block->contig_hint) { > + e_block->contig_hint = end; > + e_block->contig_hint_start = 0; > + } > + } > + > + /* reset md_blocks in the middle */ > + for (i = s_index + 1, block = chunk->md_blocks + i; > + i < e_index; i++, block++) { > + block->first_free = 0; > + block->contig_hint_start = 0; > + block->contig_hint = PCPU_BITMAP_BLOCK_SIZE; > + block->left_free = PCPU_BITMAP_BLOCK_SIZE; > + block->right_free = PCPU_BITMAP_BLOCK_SIZE; > + } > } > + > + /* > + * The hint is only checked in the s_block and e_block when > + * freeing and particularly only when it is self contained within > + * its own block. A scan is required if the free space spans > + * blocks or makes a block whole as the scan will take into > + * account free space across blocks. > + */ > + if ((start == 0 && end == PCPU_BITMAP_BLOCK_SIZE) || > + s_index != e_index) { > + update_chunk = true; > + } else if (s_block->contig_hint > chunk->contig_hint) { > + /* check if block contig_hint is bigger */ > + chunk->contig_hint = s_block->contig_hint; > + chunk->contig_hint_start = > + pcpu_block_off_to_off(s_index, > + s_block->contig_hint_start); > + } > + > + return update_chunk; > } > > /** > - * pcpu_alloc_area - allocate area from a pcpu_chunk > + * pcpu_is_populated - determines if the region is populated > * @chunk: chunk of interest > - * @size: wanted size in bytes > - * @align: wanted align > - * @pop_only: allocate only from the populated area > - * @occ_pages_p: out param for the number of pages the area occupies > + * @index: block index > + * @block_off: offset within the bitmap > + * @bit_size: size of request in allocation units > + * @next_index: return value for next block index that is populated > * > - * Try to allocate @size bytes area aligned at @align from @chunk. > - * Note that this function only allocates the offset. It doesn't > - * populate or map the area. > + * For atomic allocations, we must check if the backing pages are populated. > * > - * @chunk->map must have at least two free slots. > + * RETURNS: > + * Bool if the backing pages are populated. next_index is to skip over > + * unpopulated blocks in pcpu_find_block_fit. > + */ > +static bool pcpu_is_populated(struct pcpu_chunk *chunk, int index, > + int block_off, int bit_size, int *next_index) > +{ > + int page_start, page_end, rs, re; > + int off = pcpu_block_off_to_off(index, block_off); > + int e_off = off + bit_size * PCPU_MIN_ALLOC_SIZE; > + > + page_start = PFN_DOWN(off); > + page_end = PFN_UP(e_off); > + > + rs = page_start; > + pcpu_next_unpop(chunk, &rs, &re, PFN_UP(e_off)); > + if (rs >= page_end) > + return true; > + *next_index = re * PAGE_SIZE / PCPU_BITMAP_BLOCK_SIZE; > + return false; > +} > + > +/** > + * pcpu_find_block_fit - finds the block index to start searching > + * @chunk: chunk of interest > + * @bit_size: size of request in allocation units > + * @align: alignment of area (max PAGE_SIZE) > + * @pop_only: use populated regions only > * > - * CONTEXT: > - * pcpu_lock. > + * Given a chunk and an allocation spec, find the offset to begin searching > + * for a free region. This is done by iterating over the bitmap metadata > + * blocks and then only returning regions that will be guaranteed to fit > + * alignment by comparing against the block->contig_hint_start or a correctly > + * aligned offset. Iteration is used within a block as an allocation may be > + * able to be served prior to the contig_hint. > + * > + * Note: This errs on the side of caution by only selecting blocks guaranteed > + * to have a fit in the chunk's contig_hint. Poor alignment can cause > + * us to skip over chunk's that have valid vacancies. > * > * RETURNS: > - * Allocated offset in @chunk on success, -1 if no matching area is > - * found. > + * The offset in the bitmap to begin searching. > + * -1 if no offset is found. > */ > -static int pcpu_alloc_area(struct pcpu_chunk *chunk, int size, int align, > - bool pop_only, int *occ_pages_p) > +static int pcpu_find_block_fit(struct pcpu_chunk *chunk, int bit_size, > + size_t align, bool pop_only) > { > - int oslot = pcpu_chunk_slot(chunk); > - int max_contig = 0; > - int i, off; > - bool seen_free = false; > - int *p; > - > - for (i = chunk->first_free, p = chunk->map + i; i < chunk->map_used; i++, p++) { > - int head, tail; > - int this_size; > - > - off = *p; > - if (off & 1) > - continue; > + int i, cur_free; > + int s_index, block_off, next_index, end_off; /* interior alloc index */ > + struct pcpu_bitmap_md *block; > + unsigned long *alloc_map; > > - this_size = (p[1] & ~1) - off; > + lockdep_assert_held(&pcpu_lock); > > - head = pcpu_fit_in_area(chunk, off, this_size, size, align, > - pop_only); > - if (head < 0) { > - if (!seen_free) { > - chunk->first_free = i; > - seen_free = true; > - } > - max_contig = max(this_size, max_contig); > + cur_free = block_off = 0; > + s_index = chunk->first_free_block; > + for (i = chunk->first_free_block; i < pcpu_nr_pages_to_blocks(chunk); > + i++) { > + alloc_map = pcpu_index_alloc_map(chunk, i); > + block = chunk->md_blocks + i; > + > + /* continue from prev block */ > + cur_free += block->left_free; > + if (cur_free >= bit_size) { > + end_off = bit_size; > + goto check_populated; > + } else if (block->left_free == PCPU_BITMAP_BLOCK_SIZE) { > continue; > } > > /* > - * If head is small or the previous block is free, > - * merge'em. Note that 'small' is defined as smaller > - * than sizeof(int), which is very small but isn't too > - * uncommon for percpu allocations. > + * Can this block hold this alloc? > + * > + * Here the block->contig_hint is used to guarantee a fit, > + * but the block->first_free is returned as we may be able > + * to serve the allocation earlier. The population check > + * must take into account the area beginning at first_free > + * through the end of the contig_hint. > */ > - if (head && (head < sizeof(int) || !(p[-1] & 1))) { > - *p = off += head; > - if (p[-1] & 1) > - chunk->free_size -= head; > - else > - max_contig = max(*p - p[-1], max_contig); > - this_size -= head; > - head = 0; > + cur_free = 0; > + s_index = i; > + block_off = ALIGN(block->contig_hint_start, align); > + block_off -= block->contig_hint_start; > + if (block->contig_hint >= block_off + bit_size) { > + block_off = block->first_free; > + end_off = block->contig_hint_start - block_off + > + bit_size; > + goto check_populated; > } > > - /* if tail is small, just keep it around */ > - tail = this_size - head - size; > - if (tail < sizeof(int)) { > - tail = 0; > - size = this_size - head; > + /* check right */ > + block_off = ALIGN(PCPU_BITMAP_BLOCK_SIZE - block->right_free, > + align); > + /* reset to start looking in the next block */ > + if (block_off >= PCPU_BITMAP_BLOCK_SIZE) { > + s_index++; > + cur_free = block_off = 0; > + continue; > } > + cur_free = PCPU_BITMAP_BLOCK_SIZE - block_off; > + if (cur_free >= bit_size) { > + end_off = bit_size; > +check_populated: > + if (!pop_only || > + pcpu_is_populated(chunk, s_index, block_off, > + end_off, &next_index)) > + break; > > - /* split if warranted */ > - if (head || tail) { > - int nr_extra = !!head + !!tail; > - > - /* insert new subblocks */ > - memmove(p + nr_extra + 1, p + 1, > - sizeof(chunk->map[0]) * (chunk->map_used - i)); > - chunk->map_used += nr_extra; > - > - if (head) { > - if (!seen_free) { > - chunk->first_free = i; > - seen_free = true; > - } > - *++p = off += head; > - ++i; > - max_contig = max(head, max_contig); > - } > - if (tail) { > - p[1] = off + size; > - max_contig = max(tail, max_contig); > - } > + i = next_index - 1; > + s_index = next_index; > + cur_free = block_off = 0; > } > + } > > - if (!seen_free) > - chunk->first_free = i + 1; > + /* nothing found */ > + if (i == pcpu_nr_pages_to_blocks(chunk)) > + return -1; > > - /* update hint and mark allocated */ > - if (i + 1 == chunk->map_used) > - chunk->contig_hint = max_contig; /* fully scanned */ > - else > - chunk->contig_hint = max(chunk->contig_hint, > - max_contig); > + return s_index * PCPU_BITMAP_BLOCK_SIZE + block_off; > +} > > - chunk->free_size -= size; > - *p |= 1; > > - *occ_pages_p = pcpu_count_occupied_pages(chunk, i); > - pcpu_chunk_relocate(chunk, oslot); > - return off; > - } > +/** > + * pcpu_alloc_area - allocates area from a pcpu_chunk > + * @chunk: chunk of interest > + * @bit_size: size of request in allocation units > + * @align: alignment of area (max PAGE_SIZE) > + * @start: bit_off to start searching > + * > + * This function takes in a start bit_offset to begin searching. It > + * searches the allocation bitmap to verify that the offset is available > + * as block->first_free is provided when allocation within a block is > + * available. > + * > + * RETURNS: > + * Allocated addr offset in @chunk on success, > + * -1 if no matching area is found > + */ > +static int pcpu_alloc_area(struct pcpu_chunk *chunk, int bit_size, > + size_t align, int start) > +{ > + size_t align_mask = (align) ? (align - 1) : 0; > + int i, bit_off, oslot; > + struct pcpu_bitmap_md *block; > + > + lockdep_assert_held(&pcpu_lock); > + > + oslot = pcpu_chunk_slot(chunk); > + > + /* search to find fit */ > + bit_off = bitmap_find_next_zero_area(chunk->alloc_map, > + pcpu_nr_pages_to_bits(chunk), > + start, bit_size, align_mask); > + > + if (bit_off >= pcpu_nr_pages_to_bits(chunk)) > + return -1; > + > + /* update alloc map */ > + bitmap_set(chunk->alloc_map, bit_off, bit_size); > + /* update boundary map */ > + set_bit(bit_off, chunk->bound_map); > + bitmap_clear(chunk->bound_map, bit_off + 1, bit_size - 1); > + set_bit(bit_off + bit_size, chunk->bound_map); > + Actually I decided I do want to complain about this. Have you considered making chunks statically sized, like slab does? We could avoid this whole bound_map thing completely and save quite a few cycles trying to figure out how big our allocation was. Thanks, Josef