Hi everyone,
It was reported a while [1] that an increase in allocation alignment
requirement [2] caused the percpu memory allocator to do significantly
more work.
After spending quite a bit of time diving into it, it seems the crux was
the following:
1) chunk management by free_bytes caused allocations to scan over
chunks that could not fit due to fragmentation
2) per block fragmentation required scanning from an early first_free
bit causing allocations to repeat work
This series introduces a scan_hint for pcpu_block_md and merges the
paths used to manage the hints. The scan_hint represents the largest
known free area prior to the contig_hint. There are some caveats to
this. First, it may not necessarily be the largest area as we do partial
updates based on freeing of regions and failed scanning in
pcpu_alloc_area(). Second, if contig_hint == scan_hint, then
scan_hint_start > contig_hint_start is possible. This is necessary
for scan_hint discovery when refreshing the hint of a block.
A necessary change is to enforce a block to be the size of a page. This
let's the management of nr_empty_pop_pages to be done by breaking and
making full contig_hints in the hint update paths. Prior, this was done
by piggy backing off of refreshing the chunk contig_hint as it performed
a full scan and counting empty full pages.
The following are the results found using the workload provided in [3].
branch | time
------------------------
5.0-rc7 | 69s
[2] reverted | 44s
scan_hint | 39s
The times above represent the approximate average across multiple runs.
I tested based on a basic 1M 16-byte allocation pattern with no
alignment requirement and times did not differ between 5.0-rc7 and
scan_hint.
[1] https://lore.kernel.org/netdev/CANn89iKb_vW+LA-91RV=zuAqbNycPFUYW54w_S=KZ3HdcWPw6Q@mail.gmail.com/
[2] https://lore.kernel.org/netdev/[email protected]/
[3] https://lore.kernel.org/netdev/[email protected]/
This patchset contains the following 12 patches:
0001-percpu-update-free-path-with-correct-new-free-region.patch
0002-percpu-do-not-search-past-bitmap-when-allocating-an-.patch
0003-percpu-introduce-helper-to-determine-if-two-regions-.patch
0004-percpu-manage-chunks-based-on-contig_bits-instead-of.patch
0005-percpu-relegate-chunks-unusable-when-failing-small-a.patch
0006-percpu-set-PCPU_BITMAP_BLOCK_SIZE-to-PAGE_SIZE.patch
0007-percpu-add-block-level-scan_hint.patch
0008-percpu-remember-largest-area-skipped-during-allocati.patch
0009-percpu-use-block-scan_hint-to-only-scan-forward.patch
0010-percpu-make-pcpu_block_md-generic.patch
0011-percpu-convert-chunk-hints-to-be-based-on-pcpu_block.patch
0012-percpu-use-chunk-scan_hint-to-skip-some-scanning.patch
0001 fixes an issue where the chunk contig_hint was being updated
improperly with the new region's starting offset and possibly differing
contig_hint. 0002 fixes possibly scanning pass the end of the bitmap.
0003 introduces a helper to do region overlap comparison. 0004 switches
to chunk management by contig_hint rather than free_bytes. 0005 moves
chunks that fail to allocate to the empty block list to prevent excess
scanning with of chunks with small contig_hints and poor alignment.
0006 introduces the constraint PCPU_BITMAP_BLOCK_SIZE == PAGE_SIZE and
modifies nr_empty_pop_pages management to be a part of the hint updates.
0007-0009 introduces percpu block scan_hint. 0010 makes pcpu_block_md
generic so chunk hints can be managed as a pcpu_block_md responsible
for more bits. 0011-0012 add chunk scan_hints.
This patchset is on top of percpu#master a3b22b9f11d9.
diffstats below:
Dennis Zhou (12):
percpu: update free path with correct new free region
percpu: do not search past bitmap when allocating an area
percpu: introduce helper to determine if two regions overlap
percpu: manage chunks based on contig_bits instead of free_bytes
percpu: relegate chunks unusable when failing small allocations
percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE
percpu: add block level scan_hint
percpu: remember largest area skipped during allocation
percpu: use block scan_hint to only scan forward
percpu: make pcpu_block_md generic
percpu: convert chunk hints to be based on pcpu_block_md
percpu: use chunk scan_hint to skip some scanning
include/linux/percpu.h | 12 +-
mm/percpu-internal.h | 15 +-
mm/percpu-km.c | 2 +-
mm/percpu-stats.c | 5 +-
mm/percpu.c | 547 +++++++++++++++++++++++++++++------------
5 files changed, 404 insertions(+), 177 deletions(-)
Thanks,
Dennis
When updating the chunk's contig_hint on the free path of a hint that
does not touch the page boundaries, it was incorrectly using the
starting offset of the free region and the block's contig_hint. This
could lead to incorrect assumptions about fit given a size and better
alignment of the start. Fix this by using (end - start) as this is only
called when updating a hint within a block.
Signed-off-by: Dennis Zhou <[email protected]>
---
mm/percpu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/percpu.c b/mm/percpu.c
index db86282fd024..53bd79a617b1 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -871,7 +871,7 @@ static void pcpu_block_update_hint_free(struct pcpu_chunk *chunk, int bit_off,
pcpu_chunk_refresh_hint(chunk);
else
pcpu_chunk_update(chunk, pcpu_block_off_to_off(s_index, start),
- s_block->contig_hint);
+ end - start);
}
/**
--
2.17.1
pcpu_find_block_fit() guarantees that a fit is found within
PCPU_BITMAP_BLOCK_BITS. Iteration is used to determine the first fit as
it compares against the block's contig_hint. This can lead to
incorrectly scanning past the end of the bitmap. The behavior was okay
given the check after for bit_off >= end and the correctness of the
hints from pcpu_find_block_fit().
This patch fixes this by bounding the end offset by the number of bits
in a chunk.
Signed-off-by: Dennis Zhou <[email protected]>
---
mm/percpu.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/percpu.c b/mm/percpu.c
index 53bd79a617b1..69ca51d238b5 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -988,7 +988,8 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int alloc_bits,
/*
* Search to find a fit.
*/
- end = start + alloc_bits + PCPU_BITMAP_BLOCK_BITS;
+ end = min_t(int, start + alloc_bits + PCPU_BITMAP_BLOCK_BITS,
+ pcpu_chunk_map_bits(chunk));
bit_off = bitmap_find_next_zero_area(chunk->alloc_map, end, start,
alloc_bits, align_mask);
if (bit_off >= end)
--
2.17.1
In reality, a chunk is just a block covering a larger number of bits.
The hints themselves are one in the same. Rather than maintaining the
hints separately, first introduce nr_bits to genericize
pcpu_block_update() to correctly maintain block->right_free. The next
patch will convert chunk hints to be managed as a pcpu_block_md.
Signed-off-by: Dennis Zhou <[email protected]>
---
mm/percpu-internal.h | 1 +
mm/percpu.c | 20 +++++++++++++-------
2 files changed, 14 insertions(+), 7 deletions(-)
diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index ec58b244545d..119bd1119aa7 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -28,6 +28,7 @@ struct pcpu_block_md {
int right_free; /* size of free space along
the right side of the block */
int first_free; /* block position of first free */
+ int nr_bits; /* total bits responsible for */
};
struct pcpu_chunk {
diff --git a/mm/percpu.c b/mm/percpu.c
index e51c151ed692..7cdf14c242de 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -658,7 +658,7 @@ static void pcpu_block_update(struct pcpu_block_md *block, int start, int end)
if (start == 0)
block->left_free = contig;
- if (end == PCPU_BITMAP_BLOCK_BITS)
+ if (end == block->nr_bits)
block->right_free = contig;
if (contig > block->contig_hint) {
@@ -1271,18 +1271,24 @@ static void pcpu_free_area(struct pcpu_chunk *chunk, int off)
pcpu_chunk_relocate(chunk, oslot);
}
+static void pcpu_init_md_block(struct pcpu_block_md *block, int nr_bits)
+{
+ block->scan_hint = 0;
+ block->contig_hint = nr_bits;
+ block->left_free = nr_bits;
+ block->right_free = nr_bits;
+ block->first_free = 0;
+ block->nr_bits = nr_bits;
+}
+
static void pcpu_init_md_blocks(struct pcpu_chunk *chunk)
{
struct pcpu_block_md *md_block;
for (md_block = chunk->md_blocks;
md_block != chunk->md_blocks + pcpu_chunk_nr_blocks(chunk);
- md_block++) {
- md_block->scan_hint = 0;
- md_block->contig_hint = PCPU_BITMAP_BLOCK_BITS;
- md_block->left_free = PCPU_BITMAP_BLOCK_BITS;
- md_block->right_free = PCPU_BITMAP_BLOCK_BITS;
- }
+ md_block++)
+ pcpu_init_md_block(md_block, PCPU_BITMAP_BLOCK_BITS);
}
/**
--
2.17.1
Just like blocks, chunks now maintain a scan_hint. This can be used to
skip some scanning by promoting the scan_hint to be the contig_hint.
The chunk's scan_hint is primarily updated on the backside and relies on
full scanning when a block becomes free or the free region spans across
blocks.
Signed-off-by: Dennis Zhou <[email protected]>
---
mm/percpu.c | 36 +++++++++++++++++++++++++++---------
1 file changed, 27 insertions(+), 9 deletions(-)
diff --git a/mm/percpu.c b/mm/percpu.c
index 197479f2c489..40d49d7fb286 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -711,20 +711,31 @@ static void pcpu_block_update_scan(struct pcpu_chunk *chunk, int bit_off,
/**
* pcpu_chunk_refresh_hint - updates metadata about a chunk
* @chunk: chunk of interest
+ * @full_scan: if we should scan from the beginning
*
* Iterates over the metadata blocks to find the largest contig area.
- * It also counts the populated pages and uses the delta to update the
- * global count.
+ * A full scan can be avoided on the allocation path as this is triggered
+ * if we broke the contig_hint. In doing so, the scan_hint will be before
+ * the contig_hint or after if the scan_hint == contig_hint. This cannot
+ * be prevented on freeing as we want to find the largest area possibly
+ * spanning blocks.
*/
-static void pcpu_chunk_refresh_hint(struct pcpu_chunk *chunk)
+static void pcpu_chunk_refresh_hint(struct pcpu_chunk *chunk, bool full_scan)
{
struct pcpu_block_md *chunk_md = &chunk->chunk_md;
int bit_off, bits;
- /* clear metadata */
- chunk_md->contig_hint = 0;
+ /* promote scan_hint to contig_hint */
+ if (!full_scan && chunk_md->scan_hint) {
+ bit_off = chunk_md->scan_hint_start + chunk_md->scan_hint;
+ chunk_md->contig_hint_start = chunk_md->scan_hint_start;
+ chunk_md->contig_hint = chunk_md->scan_hint;
+ chunk_md->scan_hint = 0;
+ } else {
+ bit_off = chunk_md->first_free;
+ chunk_md->contig_hint = 0;
+ }
- bit_off = chunk_md->first_free;
bits = 0;
pcpu_for_each_md_free_region(chunk, bit_off, bits) {
pcpu_block_update(chunk_md, bit_off, bit_off + bits);
@@ -884,6 +895,13 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off,
if (nr_empty_pages)
pcpu_update_empty_pages(chunk, -1 * nr_empty_pages);
+ if (pcpu_region_overlap(chunk_md->scan_hint_start,
+ chunk_md->scan_hint_start +
+ chunk_md->scan_hint,
+ bit_off,
+ bit_off + bits))
+ chunk_md->scan_hint = 0;
+
/*
* The only time a full chunk scan is required is if the chunk
* contig hint is broken. Otherwise, it means a smaller space
@@ -894,7 +912,7 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off,
chunk_md->contig_hint,
bit_off,
bit_off + bits))
- pcpu_chunk_refresh_hint(chunk);
+ pcpu_chunk_refresh_hint(chunk, false);
}
/**
@@ -1005,7 +1023,7 @@ static void pcpu_block_update_hint_free(struct pcpu_chunk *chunk, int bit_off,
* the else condition below.
*/
if (((end - start) >= PCPU_BITMAP_BLOCK_BITS) || s_index != e_index)
- pcpu_chunk_refresh_hint(chunk);
+ pcpu_chunk_refresh_hint(chunk, true);
else
pcpu_block_update(&chunk->chunk_md,
pcpu_block_off_to_off(s_index, start),
@@ -1078,7 +1096,7 @@ static int pcpu_find_block_fit(struct pcpu_chunk *chunk, int alloc_bits,
if (bit_off + alloc_bits > chunk_md->contig_hint)
return -1;
- bit_off = chunk_md->first_free;
+ bit_off = pcpu_next_hint(chunk_md, alloc_bits);
bits = 0;
pcpu_for_each_fit_region(chunk, alloc_bits, align, bit_off, bits) {
if (!pop_only || pcpu_is_populated(chunk, bit_off, bits,
--
2.17.1
Blocks now remember the latest scan_hint. This can be used on the
allocation path as when a contig_hint is broken, we can promote the
scan_hint to the contig_hint and scan forward from there. This works
because pcpu_block_refresh_hint() is only called on the allocation path
while block free regions are updated manually in
pcpu_block_update_hint_free().
Signed-off-by: Dennis Zhou <[email protected]>
---
mm/percpu.c | 23 +++++++++++++++++------
1 file changed, 17 insertions(+), 6 deletions(-)
diff --git a/mm/percpu.c b/mm/percpu.c
index dac18968d79f..e51c151ed692 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -765,14 +765,23 @@ static void pcpu_block_refresh_hint(struct pcpu_chunk *chunk, int index)
{
struct pcpu_block_md *block = chunk->md_blocks + index;
unsigned long *alloc_map = pcpu_index_alloc_map(chunk, index);
- int rs, re; /* region start, region end */
+ int rs, re, start; /* region start, region end */
+
+ /* promote scan_hint to contig_hint */
+ if (block->scan_hint) {
+ start = block->scan_hint_start + block->scan_hint;
+ block->contig_hint_start = block->scan_hint_start;
+ block->contig_hint = block->scan_hint;
+ block->scan_hint = 0;
+ } else {
+ start = block->first_free;
+ block->contig_hint = 0;
+ }
- /* clear hints */
- block->contig_hint = block->scan_hint = 0;
- block->left_free = block->right_free = 0;
+ block->right_free = 0;
/* iterate over free areas and update the contig hints */
- pcpu_for_each_unpop_region(alloc_map, rs, re, block->first_free,
+ pcpu_for_each_unpop_region(alloc_map, rs, re, start,
PCPU_BITMAP_BLOCK_BITS) {
pcpu_block_update(block, rs, re);
}
@@ -837,6 +846,8 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off,
s_off,
s_off + bits)) {
/* block contig hint is broken - scan to fix it */
+ if (!s_off)
+ s_block->left_free = 0;
pcpu_block_refresh_hint(chunk, s_index);
} else {
/* update left and right contig manually */
@@ -870,11 +881,11 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off,
if (e_off > e_block->scan_hint_start)
e_block->scan_hint = 0;
+ e_block->left_free = 0;
if (e_off > e_block->contig_hint_start) {
/* contig hint is broken - scan to fix it */
pcpu_block_refresh_hint(chunk, e_index);
} else {
- e_block->left_free = 0;
e_block->right_free =
min_t(int, e_block->right_free,
PCPU_BITMAP_BLOCK_BITS - e_off);
--
2.17.1
As mentioned in the last patch, a chunk's hints are no different than a
block just responsible for more bits. This converts chunk level hints to
use a pcpu_block_md to maintain them. This lets us reuse the same hint
helper functions as a block. The left_free and right_free are unused by
the chunk's pcpu_block_md.
Signed-off-by: Dennis Zhou <[email protected]>
---
mm/percpu-internal.h | 5 +-
mm/percpu-stats.c | 5 +-
mm/percpu.c | 120 +++++++++++++++++++------------------------
3 files changed, 57 insertions(+), 73 deletions(-)
diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index 119bd1119aa7..0468ba500bd4 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -39,9 +39,7 @@ struct pcpu_chunk {
struct list_head list; /* linked to pcpu_slot lists */
int free_bytes; /* free bytes in the chunk */
- int contig_bits; /* max contiguous size hint */
- int contig_bits_start; /* contig_bits starting
- offset */
+ struct pcpu_block_md chunk_md;
void *base_addr; /* base address of this chunk */
unsigned long *alloc_map; /* allocation map */
@@ -49,7 +47,6 @@ struct pcpu_chunk {
struct pcpu_block_md *md_blocks; /* metadata blocks */
void *data; /* chunk data */
- int first_bit; /* no free below this */
bool immutable; /* no [de]population allowed */
int start_offset; /* the overlap with the previous
region to have a page aligned
diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c
index b5fdd43b60c9..ef5034a0464e 100644
--- a/mm/percpu-stats.c
+++ b/mm/percpu-stats.c
@@ -53,6 +53,7 @@ static int find_max_nr_alloc(void)
static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk,
int *buffer)
{
+ struct pcpu_block_md *chunk_md = &chunk->chunk_md;
int i, last_alloc, as_len, start, end;
int *alloc_sizes, *p;
/* statistics */
@@ -121,9 +122,9 @@ static void chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk,
P("nr_alloc", chunk->nr_alloc);
P("max_alloc_size", chunk->max_alloc_size);
P("empty_pop_pages", chunk->nr_empty_pop_pages);
- P("first_bit", chunk->first_bit);
+ P("first_bit", chunk_md->first_free);
P("free_bytes", chunk->free_bytes);
- P("contig_bytes", chunk->contig_bits * PCPU_MIN_ALLOC_SIZE);
+ P("contig_bytes", chunk_md->contig_hint * PCPU_MIN_ALLOC_SIZE);
P("sum_frag", sum_frag);
P("max_frag", max_frag);
P("cur_min_alloc", cur_min_alloc);
diff --git a/mm/percpu.c b/mm/percpu.c
index 7cdf14c242de..197479f2c489 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -233,10 +233,13 @@ static int pcpu_size_to_slot(int size)
static int pcpu_chunk_slot(const struct pcpu_chunk *chunk)
{
- if (chunk->free_bytes < PCPU_MIN_ALLOC_SIZE || chunk->contig_bits == 0)
+ const struct pcpu_block_md *chunk_md = &chunk->chunk_md;
+
+ if (chunk->free_bytes < PCPU_MIN_ALLOC_SIZE ||
+ chunk_md->contig_hint == 0)
return 0;
- return pcpu_size_to_slot(chunk->contig_bits * PCPU_MIN_ALLOC_SIZE);
+ return pcpu_size_to_slot(chunk_md->contig_hint * PCPU_MIN_ALLOC_SIZE);
}
/* set the pointer to a chunk in a page struct */
@@ -592,54 +595,6 @@ static inline bool pcpu_region_overlap(int a, int b, int x, int y)
return false;
}
-/**
- * pcpu_chunk_update - updates the chunk metadata given a free area
- * @chunk: chunk of interest
- * @bit_off: chunk offset
- * @bits: size of free area
- *
- * This updates the chunk's contig hint and starting offset given a free area.
- * Choose the best starting offset if the contig hint is equal.
- */
-static void pcpu_chunk_update(struct pcpu_chunk *chunk, int bit_off, int bits)
-{
- if (bits > chunk->contig_bits) {
- chunk->contig_bits_start = bit_off;
- chunk->contig_bits = bits;
- } else if (bits == chunk->contig_bits && chunk->contig_bits_start &&
- (!bit_off ||
- __ffs(bit_off) > __ffs(chunk->contig_bits_start))) {
- /* use the start with the best alignment */
- chunk->contig_bits_start = bit_off;
- }
-}
-
-/**
- * pcpu_chunk_refresh_hint - updates metadata about a chunk
- * @chunk: chunk of interest
- *
- * Iterates over the metadata blocks to find the largest contig area.
- * It also counts the populated pages and uses the delta to update the
- * global count.
- *
- * Updates:
- * chunk->contig_bits
- * chunk->contig_bits_start
- */
-static void pcpu_chunk_refresh_hint(struct pcpu_chunk *chunk)
-{
- int bit_off, bits;
-
- /* clear metadata */
- chunk->contig_bits = 0;
-
- bit_off = chunk->first_bit;
- bits = 0;
- pcpu_for_each_md_free_region(chunk, bit_off, bits) {
- pcpu_chunk_update(chunk, bit_off, bits);
- }
-}
-
/**
* pcpu_block_update - updates a block given a free area
* @block: block of interest
@@ -753,6 +708,29 @@ static void pcpu_block_update_scan(struct pcpu_chunk *chunk, int bit_off,
pcpu_block_update(block, s_off, e_off);
}
+/**
+ * pcpu_chunk_refresh_hint - updates metadata about a chunk
+ * @chunk: chunk of interest
+ *
+ * Iterates over the metadata blocks to find the largest contig area.
+ * It also counts the populated pages and uses the delta to update the
+ * global count.
+ */
+static void pcpu_chunk_refresh_hint(struct pcpu_chunk *chunk)
+{
+ struct pcpu_block_md *chunk_md = &chunk->chunk_md;
+ int bit_off, bits;
+
+ /* clear metadata */
+ chunk_md->contig_hint = 0;
+
+ bit_off = chunk_md->first_free;
+ bits = 0;
+ pcpu_for_each_md_free_region(chunk, bit_off, bits) {
+ pcpu_block_update(chunk_md, bit_off, bit_off + bits);
+ }
+}
+
/**
* pcpu_block_refresh_hint
* @chunk: chunk of interest
@@ -800,6 +778,7 @@ static void pcpu_block_refresh_hint(struct pcpu_chunk *chunk, int index)
static void pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off,
int bits)
{
+ struct pcpu_block_md *chunk_md = &chunk->chunk_md;
int nr_empty_pages = 0;
struct pcpu_block_md *s_block, *e_block, *block;
int s_index, e_index; /* block indexes of the freed allocation */
@@ -910,8 +889,9 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off,
* contig hint is broken. Otherwise, it means a smaller space
* was used and therefore the chunk contig hint is still correct.
*/
- if (pcpu_region_overlap(chunk->contig_bits_start,
- chunk->contig_bits_start + chunk->contig_bits,
+ if (pcpu_region_overlap(chunk_md->contig_hint_start,
+ chunk_md->contig_hint_start +
+ chunk_md->contig_hint,
bit_off,
bit_off + bits))
pcpu_chunk_refresh_hint(chunk);
@@ -930,9 +910,10 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off,
*
* A chunk update is triggered if a page becomes free, a block becomes free,
* or the free spans across blocks. This tradeoff is to minimize iterating
- * over the block metadata to update chunk->contig_bits. chunk->contig_bits
- * may be off by up to a page, but it will never be more than the available
- * space. If the contig hint is contained in one block, it will be accurate.
+ * over the block metadata to update chunk_md->contig_hint.
+ * chunk_md->contig_hint may be off by up to a page, but it will never be more
+ * than the available space. If the contig hint is contained in one block, it
+ * will be accurate.
*/
static void pcpu_block_update_hint_free(struct pcpu_chunk *chunk, int bit_off,
int bits)
@@ -1026,8 +1007,9 @@ static void pcpu_block_update_hint_free(struct pcpu_chunk *chunk, int bit_off,
if (((end - start) >= PCPU_BITMAP_BLOCK_BITS) || s_index != e_index)
pcpu_chunk_refresh_hint(chunk);
else
- pcpu_chunk_update(chunk, pcpu_block_off_to_off(s_index, start),
- end - start);
+ pcpu_block_update(&chunk->chunk_md,
+ pcpu_block_off_to_off(s_index, start),
+ end);
}
/**
@@ -1082,6 +1064,7 @@ static bool pcpu_is_populated(struct pcpu_chunk *chunk, int bit_off, int bits,
static int pcpu_find_block_fit(struct pcpu_chunk *chunk, int alloc_bits,
size_t align, bool pop_only)
{
+ struct pcpu_block_md *chunk_md = &chunk->chunk_md;
int bit_off, bits, next_off;
/*
@@ -1090,12 +1073,12 @@ static int pcpu_find_block_fit(struct pcpu_chunk *chunk, int alloc_bits,
* cannot fit in the global hint, there is memory pressure and creating
* a new chunk would happen soon.
*/
- bit_off = ALIGN(chunk->contig_bits_start, align) -
- chunk->contig_bits_start;
- if (bit_off + alloc_bits > chunk->contig_bits)
+ bit_off = ALIGN(chunk_md->contig_hint_start, align) -
+ chunk_md->contig_hint_start;
+ if (bit_off + alloc_bits > chunk_md->contig_hint)
return -1;
- bit_off = chunk->first_bit;
+ bit_off = chunk_md->first_free;
bits = 0;
pcpu_for_each_fit_region(chunk, alloc_bits, align, bit_off, bits) {
if (!pop_only || pcpu_is_populated(chunk, bit_off, bits,
@@ -1190,6 +1173,7 @@ static unsigned long pcpu_find_zero_area(unsigned long *map,
static int pcpu_alloc_area(struct pcpu_chunk *chunk, int alloc_bits,
size_t align, int start)
{
+ struct pcpu_block_md *chunk_md = &chunk->chunk_md;
size_t align_mask = (align) ? (align - 1) : 0;
unsigned long area_off = 0, area_bits = 0;
int bit_off, end, oslot;
@@ -1222,8 +1206,8 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int alloc_bits,
chunk->free_bytes -= alloc_bits * PCPU_MIN_ALLOC_SIZE;
/* update first free bit */
- if (bit_off == chunk->first_bit)
- chunk->first_bit = find_next_zero_bit(
+ if (bit_off == chunk_md->first_free)
+ chunk_md->first_free = find_next_zero_bit(
chunk->alloc_map,
pcpu_chunk_map_bits(chunk),
bit_off + alloc_bits);
@@ -1245,6 +1229,7 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int alloc_bits,
*/
static void pcpu_free_area(struct pcpu_chunk *chunk, int off)
{
+ struct pcpu_block_md *chunk_md = &chunk->chunk_md;
int bit_off, bits, end, oslot;
lockdep_assert_held(&pcpu_lock);
@@ -1264,7 +1249,7 @@ static void pcpu_free_area(struct pcpu_chunk *chunk, int off)
chunk->free_bytes += bits * PCPU_MIN_ALLOC_SIZE;
/* update first free bit */
- chunk->first_bit = min(chunk->first_bit, bit_off);
+ chunk_md->first_free = min(chunk_md->first_free, bit_off);
pcpu_block_update_hint_free(chunk, bit_off, bits);
@@ -1285,6 +1270,9 @@ static void pcpu_init_md_blocks(struct pcpu_chunk *chunk)
{
struct pcpu_block_md *md_block;
+ /* init the chunk's block */
+ pcpu_init_md_block(&chunk->chunk_md, pcpu_chunk_map_bits(chunk));
+
for (md_block = chunk->md_blocks;
md_block != chunk->md_blocks + pcpu_chunk_nr_blocks(chunk);
md_block++)
@@ -1352,7 +1340,6 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
chunk->nr_populated = chunk->nr_pages;
chunk->nr_empty_pop_pages = chunk->nr_pages;
- chunk->contig_bits = map_size / PCPU_MIN_ALLOC_SIZE;
chunk->free_bytes = map_size;
if (chunk->start_offset) {
@@ -1362,7 +1349,7 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
set_bit(0, chunk->bound_map);
set_bit(offset_bits, chunk->bound_map);
- chunk->first_bit = offset_bits;
+ chunk->chunk_md.first_free = offset_bits;
pcpu_block_update_hint_alloc(chunk, 0, offset_bits);
}
@@ -1415,7 +1402,6 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t gfp)
pcpu_init_md_blocks(chunk);
/* init metadata */
- chunk->contig_bits = region_bits;
chunk->free_bytes = chunk->nr_pages * PAGE_SIZE;
return chunk;
--
2.17.1
In certain cases, requestors of percpu memory may want specific
alignments. However, it is possible to end up in situations where the
contig_hint matches, but the alignment does not. This causes excess
scanning of chunks that will fail. To prevent this, if a small
allocation fails (< 32B), the chunk is moved to the empty list. Once an
allocation is freed from that chunk, it is placed back into rotation.
Signed-off-by: Dennis Zhou <[email protected]>
---
mm/percpu.c | 35 ++++++++++++++++++++++++++---------
1 file changed, 26 insertions(+), 9 deletions(-)
diff --git a/mm/percpu.c b/mm/percpu.c
index c996bcffbb2a..3d7deece9556 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -94,6 +94,8 @@
/* the slots are sorted by free bytes left, 1-31 bytes share the same slot */
#define PCPU_SLOT_BASE_SHIFT 5
+/* chunks in slots below this are subject to being sidelined on failed alloc */
+#define PCPU_SLOT_FAIL_THRESHOLD 3
#define PCPU_EMPTY_POP_PAGES_LOW 2
#define PCPU_EMPTY_POP_PAGES_HIGH 4
@@ -488,6 +490,22 @@ static void pcpu_mem_free(void *ptr)
kvfree(ptr);
}
+static void __pcpu_chunk_move(struct pcpu_chunk *chunk, int slot,
+ bool move_front)
+{
+ if (chunk != pcpu_reserved_chunk) {
+ if (move_front)
+ list_move(&chunk->list, &pcpu_slot[slot]);
+ else
+ list_move_tail(&chunk->list, &pcpu_slot[slot]);
+ }
+}
+
+static void pcpu_chunk_move(struct pcpu_chunk *chunk, int slot)
+{
+ __pcpu_chunk_move(chunk, slot, true);
+}
+
/**
* pcpu_chunk_relocate - put chunk in the appropriate chunk slot
* @chunk: chunk of interest
@@ -505,12 +523,8 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot)
{
int nslot = pcpu_chunk_slot(chunk);
- if (chunk != pcpu_reserved_chunk && oslot != nslot) {
- if (oslot < nslot)
- list_move(&chunk->list, &pcpu_slot[nslot]);
- else
- list_move_tail(&chunk->list, &pcpu_slot[nslot]);
- }
+ if (oslot != nslot)
+ __pcpu_chunk_move(chunk, nslot, oslot < nslot);
}
/**
@@ -1381,7 +1395,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
bool is_atomic = (gfp & GFP_KERNEL) != GFP_KERNEL;
bool do_warn = !(gfp & __GFP_NOWARN);
static int warn_limit = 10;
- struct pcpu_chunk *chunk;
+ struct pcpu_chunk *chunk, *next;
const char *err;
int slot, off, cpu, ret;
unsigned long flags;
@@ -1443,11 +1457,14 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
restart:
/* search through normal chunks */
for (slot = pcpu_size_to_slot(size); slot < pcpu_nr_slots; slot++) {
- list_for_each_entry(chunk, &pcpu_slot[slot], list) {
+ list_for_each_entry_safe(chunk, next, &pcpu_slot[slot], list) {
off = pcpu_find_block_fit(chunk, bits, bit_align,
is_atomic);
- if (off < 0)
+ if (off < 0) {
+ if (slot < PCPU_SLOT_FAIL_THRESHOLD)
+ pcpu_chunk_move(chunk, 0);
continue;
+ }
off = pcpu_alloc_area(chunk, bits, bit_align, off);
if (off >= 0)
--
2.17.1
While block hints were always accurate, it's possible when spanning
across blocks that we miss updating the chunk's contig_hint. Rather than
rely on correctness of the boundaries of hints, do a full overlap
comparison.
Signed-off-by: Dennis Zhou <[email protected]>
---
mm/percpu.c | 31 +++++++++++++++++++++++++++----
1 file changed, 27 insertions(+), 4 deletions(-)
diff --git a/mm/percpu.c b/mm/percpu.c
index 69ca51d238b5..b40112b2fc59 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -546,6 +546,24 @@ static inline int pcpu_cnt_pop_pages(struct pcpu_chunk *chunk, int bit_off,
bitmap_weight(chunk->populated, page_start);
}
+/*
+ * pcpu_region_overlap - determines if two regions overlap
+ * @a: start of first region, inclusive
+ * @b: end of first region, exclusive
+ * @x: start of second region, inclusive
+ * @y: end of second region, exclusive
+ *
+ * This is used to determine if the hint region [a, b) overlaps with the
+ * allocated region [x, y).
+ */
+static inline bool pcpu_region_overlap(int a, int b, int x, int y)
+{
+ if ((x >= a && x < b) || (y > a && y <= b) ||
+ (x <= a && y >= b))
+ return true;
+ return false;
+}
+
/**
* pcpu_chunk_update - updates the chunk metadata given a free area
* @chunk: chunk of interest
@@ -710,8 +728,11 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off,
PCPU_BITMAP_BLOCK_BITS,
s_off + bits);
- if (s_off >= s_block->contig_hint_start &&
- s_off < s_block->contig_hint_start + s_block->contig_hint) {
+ if (pcpu_region_overlap(s_block->contig_hint_start,
+ s_block->contig_hint_start +
+ s_block->contig_hint,
+ s_off,
+ s_off + bits)) {
/* block contig hint is broken - scan to fix it */
pcpu_block_refresh_hint(chunk, s_index);
} else {
@@ -764,8 +785,10 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off,
* contig hint is broken. Otherwise, it means a smaller space
* was used and therefore the chunk contig hint is still correct.
*/
- if (bit_off >= chunk->contig_bits_start &&
- bit_off < chunk->contig_bits_start + chunk->contig_bits)
+ if (pcpu_region_overlap(chunk->contig_bits_start,
+ chunk->contig_bits_start + chunk->contig_bits,
+ bit_off,
+ bit_off + bits))
pcpu_chunk_refresh_hint(chunk);
}
--
2.17.1
Percpu allocations attempt to do first fit by scanning forward from the
first_free of a block. However, fragmentation from allocation requests
can cause holes not seen by block hint update functions. To address
this, create a local version of bitmap_find_next_zero_area_off() that
remembers the largest area skipped over. The caveat is that it only sees
regions skipped over due to not fitting, not regions skipped due to
alignment. Prior to updating the scan_hint, a scan backwards is done to
try and recover free bits skipped due to alignment. While this can cause
scanning to miss earlier possible free areas, smaller allocations will
eventually fill those holes.
Signed-off-by: Dennis Zhou <[email protected]>
---
mm/percpu.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 99 insertions(+), 2 deletions(-)
diff --git a/mm/percpu.c b/mm/percpu.c
index df1aacf58ac8..dac18968d79f 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -716,6 +716,43 @@ static void pcpu_block_update(struct pcpu_block_md *block, int start, int end)
}
}
+/*
+ * pcpu_block_update_scan - update a block given a free area from a scan
+ * @chunk: chunk of interest
+ * @bit_off: chunk offset
+ * @bits: size of free area
+ *
+ * Finding the final allocation spot first goes through pcpu_find_block_fit()
+ * to find a block that can hold the allocation and then pcpu_alloc_area()
+ * where a scan is used. When allocations require specific alignments,
+ * we can inadvertently create holes which will not be seen in the alloc
+ * or free paths.
+ *
+ * This takes a given free area hole and updates a block as it may change the
+ * scan_hint. We need to scan backwards to ensure we don't miss free bits
+ * from alignment.
+ */
+static void pcpu_block_update_scan(struct pcpu_chunk *chunk, int bit_off,
+ int bits)
+{
+ int s_off = pcpu_off_to_block_off(bit_off);
+ int e_off = s_off + bits;
+ int s_index, l_bit;
+ struct pcpu_block_md *block;
+
+ if (e_off > PCPU_BITMAP_BLOCK_BITS)
+ return;
+
+ s_index = pcpu_off_to_block_index(bit_off);
+ block = chunk->md_blocks + s_index;
+
+ /* scan backwards in case of alignment skipping free bits */
+ l_bit = find_last_bit(pcpu_index_alloc_map(chunk, s_index), s_off);
+ s_off = (s_off == l_bit) ? 0 : l_bit + 1;
+
+ pcpu_block_update(block, s_off, e_off);
+}
+
/**
* pcpu_block_refresh_hint
* @chunk: chunk of interest
@@ -1064,6 +1101,62 @@ static int pcpu_find_block_fit(struct pcpu_chunk *chunk, int alloc_bits,
return bit_off;
}
+/*
+ * pcpu_find_zero_area - modified from bitmap_find_next_zero_area_off
+ * @map: the address to base the search on
+ * @size: the bitmap size in bits
+ * @start: the bitnumber to start searching at
+ * @nr: the number of zeroed bits we're looking for
+ * @align_mask: alignment mask for zero area
+ * @largest_off: offset of the largest area skipped
+ * @largest_bits: size of the largest area skipped
+ *
+ * The @align_mask should be one less than a power of 2.
+ *
+ * This is a modified version of bitmap_find_next_zero_area_off() to remember
+ * the largest area that was skipped. This is imperfect, but in general is
+ * good enough. The largest remembered region is the largest failed region
+ * seen. This does not include anything we possibly skipped due to alignment.
+ * pcpu_block_update_scan() does scan backwards to try and recover what was
+ * lost to alignment. While this can cause scanning to miss earlier possible
+ * free areas, smaller allocations will eventually fill those holes.
+ */
+static unsigned long pcpu_find_zero_area(unsigned long *map,
+ unsigned long size,
+ unsigned long start,
+ unsigned long nr,
+ unsigned long align_mask,
+ unsigned long *largest_off,
+ unsigned long *largest_bits)
+{
+ unsigned long index, end, i, area_off, area_bits;
+again:
+ index = find_next_zero_bit(map, size, start);
+
+ /* Align allocation */
+ index = __ALIGN_MASK(index, align_mask);
+ area_off = index;
+
+ end = index + nr;
+ if (end > size)
+ return end;
+ i = find_next_bit(map, end, index);
+ if (i < end) {
+ area_bits = i - area_off;
+ /* remember largest unused area with best alignment */
+ if (area_bits > *largest_bits ||
+ (area_bits == *largest_bits && *largest_off &&
+ (!area_off || __ffs(area_off) > __ffs(*largest_off)))) {
+ *largest_off = area_off;
+ *largest_bits = area_bits;
+ }
+
+ start = i + 1;
+ goto again;
+ }
+ return index;
+}
+
/**
* pcpu_alloc_area - allocates an area from a pcpu_chunk
* @chunk: chunk of interest
@@ -1087,6 +1180,7 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int alloc_bits,
size_t align, int start)
{
size_t align_mask = (align) ? (align - 1) : 0;
+ unsigned long area_off = 0, area_bits = 0;
int bit_off, end, oslot;
lockdep_assert_held(&pcpu_lock);
@@ -1098,11 +1192,14 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk, int alloc_bits,
*/
end = min_t(int, start + alloc_bits + PCPU_BITMAP_BLOCK_BITS,
pcpu_chunk_map_bits(chunk));
- bit_off = bitmap_find_next_zero_area(chunk->alloc_map, end, start,
- alloc_bits, align_mask);
+ bit_off = pcpu_find_zero_area(chunk->alloc_map, end, start, alloc_bits,
+ align_mask, &area_off, &area_bits);
if (bit_off >= end)
return -1;
+ if (area_bits)
+ pcpu_block_update_scan(chunk, area_off, area_bits);
+
/* update alloc map */
bitmap_set(chunk->alloc_map, bit_off, alloc_bits);
--
2.17.1
When a chunk becomes fragmented, it can end up having a large number of
small allocation areas free. The free_bytes sorting of chunks leads to
unnecessary checking of chunks that cannot satisfy the allocation.
Switch to contig_bits sorting to prevent scanning chunks that may not be
able to service the allocation request.
Signed-off-by: Dennis Zhou <[email protected]>
---
mm/percpu.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/percpu.c b/mm/percpu.c
index b40112b2fc59..c996bcffbb2a 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -234,7 +234,7 @@ static int pcpu_chunk_slot(const struct pcpu_chunk *chunk)
if (chunk->free_bytes < PCPU_MIN_ALLOC_SIZE || chunk->contig_bits == 0)
return 0;
- return pcpu_size_to_slot(chunk->free_bytes);
+ return pcpu_size_to_slot(chunk->contig_bits * PCPU_MIN_ALLOC_SIZE);
}
/* set the pointer to a chunk in a page struct */
--
2.17.1
Previously, block size was flexible based on the constraint that the
GCD(PCPU_BITMAP_BLOCK_SIZE, PAGE_SIZE) > 1. However, this carried the
overhead that keeping a floating number of populated free pages required
scanning over the free regions of a chunk.
Setting the block size to be fixed at PAGE_SIZE lets us know when an
empty page becomes used as we will break a full contig_hint of a block.
This means we no longer have to scan the whole chunk upon breaking a
contig_hint which empty page management piggybacks off. A later patch
takes advantage of this to optimize the allocation path by only scanning
forward using the scan_hint introduced later too.
Signed-off-by: Dennis Zhou <[email protected]>
---
include/linux/percpu.h | 12 ++---
mm/percpu-km.c | 2 +-
mm/percpu.c | 111 +++++++++++++++++------------------------
3 files changed, 49 insertions(+), 76 deletions(-)
diff --git a/include/linux/percpu.h b/include/linux/percpu.h
index 70b7123f38c7..9909dc0e273a 100644
--- a/include/linux/percpu.h
+++ b/include/linux/percpu.h
@@ -26,16 +26,10 @@
#define PCPU_MIN_ALLOC_SHIFT 2
#define PCPU_MIN_ALLOC_SIZE (1 << PCPU_MIN_ALLOC_SHIFT)
-/* number of bits per page, used to trigger a scan if blocks are > PAGE_SIZE */
-#define PCPU_BITS_PER_PAGE (PAGE_SIZE >> PCPU_MIN_ALLOC_SHIFT)
-
/*
- * This determines the size of each metadata block. There are several subtle
- * constraints around this constant. The reserved region must be a multiple of
- * PCPU_BITMAP_BLOCK_SIZE. Additionally, PCPU_BITMAP_BLOCK_SIZE must be a
- * multiple of PAGE_SIZE or PAGE_SIZE must be a multiple of
- * PCPU_BITMAP_BLOCK_SIZE to align with the populated page map. The unit_size
- * also has to be a multiple of PCPU_BITMAP_BLOCK_SIZE to ensure full blocks.
+ * The PCPU_BITMAP_BLOCK_SIZE must be the same size as PAGE_SIZE as the
+ * updating of hints is used to manage the nr_empty_pop_pages in both
+ * the chunk and globally.
*/
#define PCPU_BITMAP_BLOCK_SIZE PAGE_SIZE
#define PCPU_BITMAP_BLOCK_BITS (PCPU_BITMAP_BLOCK_SIZE >> \
diff --git a/mm/percpu-km.c b/mm/percpu-km.c
index 0f643dc2dc65..c10bf7466596 100644
--- a/mm/percpu-km.c
+++ b/mm/percpu-km.c
@@ -70,7 +70,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp)
chunk->base_addr = page_address(pages) - pcpu_group_offsets[0];
spin_lock_irqsave(&pcpu_lock, flags);
- pcpu_chunk_populated(chunk, 0, nr_pages, false);
+ pcpu_chunk_populated(chunk, 0, nr_pages);
spin_unlock_irqrestore(&pcpu_lock, flags);
pcpu_stats_chunk_alloc();
diff --git a/mm/percpu.c b/mm/percpu.c
index 3d7deece9556..967c9cc3a928 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -527,37 +527,21 @@ static void pcpu_chunk_relocate(struct pcpu_chunk *chunk, int oslot)
__pcpu_chunk_move(chunk, nslot, oslot < nslot);
}
-/**
- * pcpu_cnt_pop_pages- counts populated backing pages in range
+/*
+ * pcpu_update_empty_pages - update empty page counters
* @chunk: chunk of interest
- * @bit_off: start offset
- * @bits: size of area to check
+ * @nr: nr of empty pages
*
- * Calculates the number of populated pages in the region
- * [page_start, page_end). This keeps track of how many empty populated
- * pages are available and decide if async work should be scheduled.
- *
- * RETURNS:
- * The nr of populated pages.
+ * This is used to keep track of the empty pages now based on the premise
+ * a pcpu_block_md covers a page. The hint update functions recognize if
+ * a block is made full or broken to calculate deltas for keeping track of
+ * free pages.
*/
-static inline int pcpu_cnt_pop_pages(struct pcpu_chunk *chunk, int bit_off,
- int bits)
+static inline void pcpu_update_empty_pages(struct pcpu_chunk *chunk, int nr)
{
- int page_start = PFN_UP(bit_off * PCPU_MIN_ALLOC_SIZE);
- int page_end = PFN_DOWN((bit_off + bits) * PCPU_MIN_ALLOC_SIZE);
-
- if (page_start >= page_end)
- return 0;
-
- /*
- * bitmap_weight counts the number of bits set in a bitmap up to
- * the specified number of bits. This is counting the populated
- * pages up to page_end and then subtracting the populated pages
- * up to page_start to count the populated pages in
- * [page_start, page_end).
- */
- return bitmap_weight(chunk->populated, page_end) -
- bitmap_weight(chunk->populated, page_start);
+ chunk->nr_empty_pop_pages += nr;
+ if (chunk != pcpu_reserved_chunk)
+ pcpu_nr_empty_pop_pages += nr;
}
/*
@@ -611,36 +595,19 @@ static void pcpu_chunk_update(struct pcpu_chunk *chunk, int bit_off, int bits)
* Updates:
* chunk->contig_bits
* chunk->contig_bits_start
- * nr_empty_pop_pages (chunk and global)
*/
static void pcpu_chunk_refresh_hint(struct pcpu_chunk *chunk)
{
- int bit_off, bits, nr_empty_pop_pages;
+ int bit_off, bits;
/* clear metadata */
chunk->contig_bits = 0;
bit_off = chunk->first_bit;
- bits = nr_empty_pop_pages = 0;
+ bits = 0;
pcpu_for_each_md_free_region(chunk, bit_off, bits) {
pcpu_chunk_update(chunk, bit_off, bits);
-
- nr_empty_pop_pages += pcpu_cnt_pop_pages(chunk, bit_off, bits);
}
-
- /*
- * Keep track of nr_empty_pop_pages.
- *
- * The chunk maintains the previous number of free pages it held,
- * so the delta is used to update the global counter. The reserved
- * chunk is not part of the free page count as they are populated
- * at init and are special to serving reserved allocations.
- */
- if (chunk != pcpu_reserved_chunk)
- pcpu_nr_empty_pop_pages +=
- (nr_empty_pop_pages - chunk->nr_empty_pop_pages);
-
- chunk->nr_empty_pop_pages = nr_empty_pop_pages;
}
/**
@@ -712,6 +679,7 @@ static void pcpu_block_refresh_hint(struct pcpu_chunk *chunk, int index)
static void pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off,
int bits)
{
+ int nr_empty_pages = 0;
struct pcpu_block_md *s_block, *e_block, *block;
int s_index, e_index; /* block indexes of the freed allocation */
int s_off, e_off; /* block offsets of the freed allocation */
@@ -736,6 +704,9 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off,
* If the allocation breaks the contig_hint, a scan is required to
* restore this hint.
*/
+ if (s_block->contig_hint == PCPU_BITMAP_BLOCK_BITS)
+ nr_empty_pages++;
+
if (s_off == s_block->first_free)
s_block->first_free = find_next_zero_bit(
pcpu_index_alloc_map(chunk, s_index),
@@ -763,6 +734,9 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off,
* Update e_block.
*/
if (s_index != e_index) {
+ if (e_block->contig_hint == PCPU_BITMAP_BLOCK_BITS)
+ nr_empty_pages++;
+
/*
* When the allocation is across blocks, the end is along
* the left part of the e_block.
@@ -787,6 +761,7 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off,
}
/* update in-between md_blocks */
+ nr_empty_pages += (e_index - s_index - 1);
for (block = s_block + 1; block < e_block; block++) {
block->contig_hint = 0;
block->left_free = 0;
@@ -794,6 +769,9 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off,
}
}
+ if (nr_empty_pages)
+ pcpu_update_empty_pages(chunk, -1 * nr_empty_pages);
+
/*
* The only time a full chunk scan is required is if the chunk
* contig hint is broken. Otherwise, it means a smaller space
@@ -826,6 +804,7 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off,
static void pcpu_block_update_hint_free(struct pcpu_chunk *chunk, int bit_off,
int bits)
{
+ int nr_empty_pages = 0;
struct pcpu_block_md *s_block, *e_block, *block;
int s_index, e_index; /* block indexes of the freed allocation */
int s_off, e_off; /* block offsets of the freed allocation */
@@ -879,14 +858,19 @@ static void pcpu_block_update_hint_free(struct pcpu_chunk *chunk, int bit_off,
/* update s_block */
e_off = (s_index == e_index) ? end : PCPU_BITMAP_BLOCK_BITS;
+ if (!start && e_off == PCPU_BITMAP_BLOCK_BITS)
+ nr_empty_pages++;
pcpu_block_update(s_block, start, e_off);
/* freeing in the same block */
if (s_index != e_index) {
/* update e_block */
+ if (end == PCPU_BITMAP_BLOCK_BITS)
+ nr_empty_pages++;
pcpu_block_update(e_block, 0, end);
/* reset md_blocks in the middle */
+ nr_empty_pages += (e_index - s_index - 1);
for (block = s_block + 1; block < e_block; block++) {
block->first_free = 0;
block->contig_hint_start = 0;
@@ -896,15 +880,16 @@ static void pcpu_block_update_hint_free(struct pcpu_chunk *chunk, int bit_off,
}
}
+ if (nr_empty_pages)
+ pcpu_update_empty_pages(chunk, nr_empty_pages);
+
/*
- * Refresh chunk metadata when the free makes a page free, a block
- * free, or spans across blocks. The contig hint may be off by up to
- * a page, but if the hint is contained in a block, it will be accurate
- * with the else condition below.
+ * Refresh chunk metadata when the free makes a block free or spans
+ * across blocks. The contig_hint may be off by up to a page, but if
+ * the contig_hint is contained in a block, it will be accurate with
+ * the else condition below.
*/
- if ((ALIGN_DOWN(end, min(PCPU_BITS_PER_PAGE, PCPU_BITMAP_BLOCK_BITS)) >
- ALIGN(start, min(PCPU_BITS_PER_PAGE, PCPU_BITMAP_BLOCK_BITS))) ||
- s_index != e_index)
+ if (((end - start) >= PCPU_BITMAP_BLOCK_BITS) || s_index != e_index)
pcpu_chunk_refresh_hint(chunk);
else
pcpu_chunk_update(chunk, pcpu_block_off_to_off(s_index, start),
@@ -1164,9 +1149,7 @@ static struct pcpu_chunk * __init pcpu_alloc_first_chunk(unsigned long tmp_addr,
chunk->immutable = true;
bitmap_fill(chunk->populated, chunk->nr_pages);
chunk->nr_populated = chunk->nr_pages;
- chunk->nr_empty_pop_pages =
- pcpu_cnt_pop_pages(chunk, start_offset / PCPU_MIN_ALLOC_SIZE,
- map_size / PCPU_MIN_ALLOC_SIZE);
+ chunk->nr_empty_pop_pages = chunk->nr_pages;
chunk->contig_bits = map_size / PCPU_MIN_ALLOC_SIZE;
chunk->free_bytes = map_size;
@@ -1261,7 +1244,6 @@ static void pcpu_free_chunk(struct pcpu_chunk *chunk)
* @chunk: pcpu_chunk which got populated
* @page_start: the start page
* @page_end: the end page
- * @for_alloc: if this is to populate for allocation
*
* Pages in [@page_start,@page_end) have been populated to @chunk. Update
* the bookkeeping information accordingly. Must be called after each
@@ -1271,7 +1253,7 @@ static void pcpu_free_chunk(struct pcpu_chunk *chunk)
* is to serve an allocation in that area.
*/
static void pcpu_chunk_populated(struct pcpu_chunk *chunk, int page_start,
- int page_end, bool for_alloc)
+ int page_end)
{
int nr = page_end - page_start;
@@ -1281,10 +1263,7 @@ static void pcpu_chunk_populated(struct pcpu_chunk *chunk, int page_start,
chunk->nr_populated += nr;
pcpu_nr_populated += nr;
- if (!for_alloc) {
- chunk->nr_empty_pop_pages += nr;
- pcpu_nr_empty_pop_pages += nr;
- }
+ pcpu_update_empty_pages(chunk, nr);
}
/**
@@ -1306,9 +1285,9 @@ static void pcpu_chunk_depopulated(struct pcpu_chunk *chunk,
bitmap_clear(chunk->populated, page_start, nr);
chunk->nr_populated -= nr;
- chunk->nr_empty_pop_pages -= nr;
- pcpu_nr_empty_pop_pages -= nr;
pcpu_nr_populated -= nr;
+
+ pcpu_update_empty_pages(chunk, -1 * nr);
}
/*
@@ -1523,7 +1502,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t align, bool reserved,
err = "failed to populate";
goto fail_unlock;
}
- pcpu_chunk_populated(chunk, rs, re, true);
+ pcpu_chunk_populated(chunk, rs, re);
spin_unlock_irqrestore(&pcpu_lock, flags);
}
@@ -1722,7 +1701,7 @@ static void pcpu_balance_workfn(struct work_struct *work)
if (!ret) {
nr_to_pop -= nr;
spin_lock_irq(&pcpu_lock);
- pcpu_chunk_populated(chunk, rs, rs + nr, false);
+ pcpu_chunk_populated(chunk, rs, rs + nr);
spin_unlock_irq(&pcpu_lock);
} else {
nr_to_pop = 0;
--
2.17.1
Fragmentation can cause both blocks and chunks to have an early
first_firee bit available, but only able to satisfy allocations much
later on. This patch introduces a scan_hint to help mitigate some
unnecessary scanning.
The scan_hint remembers the largest area prior to the contig_hint. If
the contig_hint == scan_hint, then scan_hint_start > contig_hint_start.
This is necessary for scan_hint discovery when refreshing a block.
Signed-off-by: Dennis Zhou <[email protected]>
---
mm/percpu-internal.h | 9 ++++
mm/percpu.c | 101 ++++++++++++++++++++++++++++++++++++++++---
2 files changed, 103 insertions(+), 7 deletions(-)
diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h
index b1739dc06b73..ec58b244545d 100644
--- a/mm/percpu-internal.h
+++ b/mm/percpu-internal.h
@@ -9,8 +9,17 @@
* pcpu_block_md is the metadata block struct.
* Each chunk's bitmap is split into a number of full blocks.
* All units are in terms of bits.
+ *
+ * The scan hint is the largest known contiguous area before the contig hint.
+ * It is not necessarily the actual largest contig hint though. There is an
+ * invariant that the scan_hint_start > contig_hint_start iff
+ * scan_hint == contig_hint. This is necessary because when scanning forward,
+ * we don't know if a new contig hint would be better than the current one.
*/
struct pcpu_block_md {
+ int scan_hint; /* scan hint for block */
+ int scan_hint_start; /* block relative starting
+ position of the scan hint */
int contig_hint; /* contig hint for block */
int contig_hint_start; /* block relative starting
position of the contig hint */
diff --git a/mm/percpu.c b/mm/percpu.c
index 967c9cc3a928..df1aacf58ac8 100644
--- a/mm/percpu.c
+++ b/mm/percpu.c
@@ -320,6 +320,34 @@ static unsigned long pcpu_block_off_to_off(int index, int off)
return index * PCPU_BITMAP_BLOCK_BITS + off;
}
+/*
+ * pcpu_next_hint - determine which hint to use
+ * @block: block of interest
+ * @alloc_bits: size of allocation
+ *
+ * This determines if we should scan based on the scan_hint or first_free.
+ * In general, we want to scan from first_free to fulfill allocations by
+ * first fit. However, if we know a scan_hint at position scan_hint_start
+ * cannot fulfill an allocation, we can begin scanning from there knowing
+ * the contig_hint will be our fallback.
+ */
+static int pcpu_next_hint(struct pcpu_block_md *block, int alloc_bits)
+{
+ /*
+ * The three conditions below determine if we can skip past the
+ * scan_hint. First, does the scan hint exist. Second, is the
+ * contig_hint after the scan_hint (possibly not true iff
+ * contig_hint == scan_hint). Third, is the allocation request
+ * larger than the scan_hint.
+ */
+ if (block->scan_hint &&
+ block->contig_hint_start > block->scan_hint_start &&
+ alloc_bits > block->scan_hint)
+ return block->scan_hint_start + block->scan_hint;
+
+ return block->first_free;
+}
+
/**
* pcpu_next_md_free_region - finds the next hint free area
* @chunk: chunk of interest
@@ -415,9 +443,11 @@ static void pcpu_next_fit_region(struct pcpu_chunk *chunk, int alloc_bits,
if (block->contig_hint &&
block->contig_hint_start >= block_off &&
block->contig_hint >= *bits + alloc_bits) {
+ int start = pcpu_next_hint(block, alloc_bits);
+
*bits += alloc_bits + block->contig_hint_start -
- block->first_free;
- *bit_off = pcpu_block_off_to_off(i, block->first_free);
+ start;
+ *bit_off = pcpu_block_off_to_off(i, start);
return;
}
/* reset to satisfy the second predicate above */
@@ -632,12 +662,57 @@ static void pcpu_block_update(struct pcpu_block_md *block, int start, int end)
block->right_free = contig;
if (contig > block->contig_hint) {
+ /* promote the old contig_hint to be the new scan_hint */
+ if (start > block->contig_hint_start) {
+ if (block->contig_hint > block->scan_hint) {
+ block->scan_hint_start =
+ block->contig_hint_start;
+ block->scan_hint = block->contig_hint;
+ } else if (start < block->scan_hint_start) {
+ /*
+ * The old contig_hint == scan_hint. But, the
+ * new contig is larger so hold the invariant
+ * scan_hint_start < contig_hint_start.
+ */
+ block->scan_hint = 0;
+ }
+ } else {
+ block->scan_hint = 0;
+ }
block->contig_hint_start = start;
block->contig_hint = contig;
- } else if (block->contig_hint_start && contig == block->contig_hint &&
- (!start || __ffs(start) > __ffs(block->contig_hint_start))) {
- /* use the start with the best alignment */
- block->contig_hint_start = start;
+ } else if (contig == block->contig_hint) {
+ if (block->contig_hint_start &&
+ (!start ||
+ __ffs(start) > __ffs(block->contig_hint_start))) {
+ /* start has a better alignment so use it */
+ block->contig_hint_start = start;
+ if (start < block->scan_hint_start &&
+ block->contig_hint > block->scan_hint)
+ block->scan_hint = 0;
+ } else if (start > block->scan_hint_start ||
+ block->contig_hint > block->scan_hint) {
+ /*
+ * Knowing contig == contig_hint, update the scan_hint
+ * if it is farther than or larger than the current
+ * scan_hint.
+ */
+ block->scan_hint_start = start;
+ block->scan_hint = contig;
+ }
+ } else {
+ /*
+ * The region is smaller than the contig_hint. So only update
+ * the scan_hint if it is larger than or equal and farther than
+ * the current scan_hint.
+ */
+ if ((start < block->contig_hint_start &&
+ (contig > block->scan_hint ||
+ (contig == block->scan_hint &&
+ start > block->scan_hint_start)))) {
+ block->scan_hint_start = start;
+ block->scan_hint = contig;
+ }
}
}
@@ -656,7 +731,7 @@ static void pcpu_block_refresh_hint(struct pcpu_chunk *chunk, int index)
int rs, re; /* region start, region end */
/* clear hints */
- block->contig_hint = 0;
+ block->contig_hint = block->scan_hint = 0;
block->left_free = block->right_free = 0;
/* iterate over free areas and update the contig hints */
@@ -713,6 +788,12 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off,
PCPU_BITMAP_BLOCK_BITS,
s_off + bits);
+ if (pcpu_region_overlap(s_block->scan_hint_start,
+ s_block->scan_hint_start + s_block->scan_hint,
+ s_off,
+ s_off + bits))
+ s_block->scan_hint = 0;
+
if (pcpu_region_overlap(s_block->contig_hint_start,
s_block->contig_hint_start +
s_block->contig_hint,
@@ -749,6 +830,9 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off,
/* reset the block */
e_block++;
} else {
+ if (e_off > e_block->scan_hint_start)
+ e_block->scan_hint = 0;
+
if (e_off > e_block->contig_hint_start) {
/* contig hint is broken - scan to fix it */
pcpu_block_refresh_hint(chunk, e_index);
@@ -763,6 +847,7 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off,
/* update in-between md_blocks */
nr_empty_pages += (e_index - s_index - 1);
for (block = s_block + 1; block < e_block; block++) {
+ block->scan_hint = 0;
block->contig_hint = 0;
block->left_free = 0;
block->right_free = 0;
@@ -873,6 +958,7 @@ static void pcpu_block_update_hint_free(struct pcpu_chunk *chunk, int bit_off,
nr_empty_pages += (e_index - s_index - 1);
for (block = s_block + 1; block < e_block; block++) {
block->first_free = 0;
+ block->scan_hint = 0;
block->contig_hint_start = 0;
block->contig_hint = PCPU_BITMAP_BLOCK_BITS;
block->left_free = PCPU_BITMAP_BLOCK_BITS;
@@ -1084,6 +1170,7 @@ static void pcpu_init_md_blocks(struct pcpu_chunk *chunk)
for (md_block = chunk->md_blocks;
md_block != chunk->md_blocks + pcpu_chunk_nr_blocks(chunk);
md_block++) {
+ md_block->scan_hint = 0;
md_block->contig_hint = PCPU_BITMAP_BLOCK_BITS;
md_block->left_free = PCPU_BITMAP_BLOCK_BITS;
md_block->right_free = PCPU_BITMAP_BLOCK_BITS;
--
2.17.1
On Thu 28 Feb 2019 at 02:18, Dennis Zhou <[email protected]> wrote:
> Hi everyone,
>
> It was reported a while [1] that an increase in allocation alignment
> requirement [2] caused the percpu memory allocator to do significantly
> more work.
>
> After spending quite a bit of time diving into it, it seems the crux was
> the following:
> 1) chunk management by free_bytes caused allocations to scan over
> chunks that could not fit due to fragmentation
> 2) per block fragmentation required scanning from an early first_free
> bit causing allocations to repeat work
>
> This series introduces a scan_hint for pcpu_block_md and merges the
> paths used to manage the hints. The scan_hint represents the largest
> known free area prior to the contig_hint. There are some caveats to
> this. First, it may not necessarily be the largest area as we do partial
> updates based on freeing of regions and failed scanning in
> pcpu_alloc_area(). Second, if contig_hint == scan_hint, then
> scan_hint_start > contig_hint_start is possible. This is necessary
> for scan_hint discovery when refreshing the hint of a block.
>
> A necessary change is to enforce a block to be the size of a page. This
> let's the management of nr_empty_pop_pages to be done by breaking and
> making full contig_hints in the hint update paths. Prior, this was done
> by piggy backing off of refreshing the chunk contig_hint as it performed
> a full scan and counting empty full pages.
>
> The following are the results found using the workload provided in [3].
>
> branch | time
> ------------------------
> 5.0-rc7 | 69s
> [2] reverted | 44s
> scan_hint | 39s
>
> The times above represent the approximate average across multiple runs.
> I tested based on a basic 1M 16-byte allocation pattern with no
> alignment requirement and times did not differ between 5.0-rc7 and
> scan_hint.
>
> [1] https://lore.kernel.org/netdev/CANn89iKb_vW+LA-91RV=zuAqbNycPFUYW54w_S=KZ3HdcWPw6Q@mail.gmail.com/
> [2] https://lore.kernel.org/netdev/[email protected]/
> [3] https://lore.kernel.org/netdev/[email protected]/
>
> This patchset contains the following 12 patches:
> 0001-percpu-update-free-path-with-correct-new-free-region.patch
> 0002-percpu-do-not-search-past-bitmap-when-allocating-an-.patch
> 0003-percpu-introduce-helper-to-determine-if-two-regions-.patch
> 0004-percpu-manage-chunks-based-on-contig_bits-instead-of.patch
> 0005-percpu-relegate-chunks-unusable-when-failing-small-a.patch
> 0006-percpu-set-PCPU_BITMAP_BLOCK_SIZE-to-PAGE_SIZE.patch
> 0007-percpu-add-block-level-scan_hint.patch
> 0008-percpu-remember-largest-area-skipped-during-allocati.patch
> 0009-percpu-use-block-scan_hint-to-only-scan-forward.patch
> 0010-percpu-make-pcpu_block_md-generic.patch
> 0011-percpu-convert-chunk-hints-to-be-based-on-pcpu_block.patch
> 0012-percpu-use-chunk-scan_hint-to-skip-some-scanning.patch
>
> 0001 fixes an issue where the chunk contig_hint was being updated
> improperly with the new region's starting offset and possibly differing
> contig_hint. 0002 fixes possibly scanning pass the end of the bitmap.
> 0003 introduces a helper to do region overlap comparison. 0004 switches
> to chunk management by contig_hint rather than free_bytes. 0005 moves
> chunks that fail to allocate to the empty block list to prevent excess
> scanning with of chunks with small contig_hints and poor alignment.
> 0006 introduces the constraint PCPU_BITMAP_BLOCK_SIZE == PAGE_SIZE and
> modifies nr_empty_pop_pages management to be a part of the hint updates.
> 0007-0009 introduces percpu block scan_hint. 0010 makes pcpu_block_md
> generic so chunk hints can be managed as a pcpu_block_md responsible
> for more bits. 0011-0012 add chunk scan_hints.
>
> This patchset is on top of percpu#master a3b22b9f11d9.
>
> diffstats below:
>
> Dennis Zhou (12):
> percpu: update free path with correct new free region
> percpu: do not search past bitmap when allocating an area
> percpu: introduce helper to determine if two regions overlap
> percpu: manage chunks based on contig_bits instead of free_bytes
> percpu: relegate chunks unusable when failing small allocations
> percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE
> percpu: add block level scan_hint
> percpu: remember largest area skipped during allocation
> percpu: use block scan_hint to only scan forward
> percpu: make pcpu_block_md generic
> percpu: convert chunk hints to be based on pcpu_block_md
> percpu: use chunk scan_hint to skip some scanning
>
> include/linux/percpu.h | 12 +-
> mm/percpu-internal.h | 15 +-
> mm/percpu-km.c | 2 +-
> mm/percpu-stats.c | 5 +-
> mm/percpu.c | 547 +++++++++++++++++++++++++++++------------
> 5 files changed, 404 insertions(+), 177 deletions(-)
>
> Thanks,
> Dennis
Hi Dennis,
Thank you very much for doing this!
I applied the patches on top of current net-next and can confirm that tc
filter insertion rate is significantly improved and is better compared
to version with offending commit simply reverted.
Regards,
Vlad
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On
> Behalf Of Dennis Zhou
> Sent: 2019??2??28?? 10:18
> To: Dennis Zhou <[email protected]>; Tejun Heo <[email protected]>; Christoph
> Lameter <[email protected]>
> Cc: Vlad Buslov <[email protected]>; [email protected];
> [email protected]; [email protected]
> Subject: [PATCH 01/12] percpu: update free path with correct new free region
>
> When updating the chunk's contig_hint on the free path of a hint that does not
> touch the page boundaries, it was incorrectly using the starting offset of the
> free region and the block's contig_hint. This could lead to incorrect
> assumptions about fit given a size and better alignment of the start. Fix this by
> using (end - start) as this is only called when updating a hint within a block.
>
> Signed-off-by: Dennis Zhou <[email protected]>
> ---
> mm/percpu.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/percpu.c b/mm/percpu.c
> index db86282fd024..53bd79a617b1 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -871,7 +871,7 @@ static void pcpu_block_update_hint_free(struct
> pcpu_chunk *chunk, int bit_off,
> pcpu_chunk_refresh_hint(chunk);
> else
> pcpu_chunk_update(chunk, pcpu_block_off_to_off(s_index, start),
> - s_block->contig_hint);
> + end - start);
> }
Reviewed-by: Peng Fan <[email protected]>
>
> /**
> --
> 2.17.1
Hi Dennis,
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On
> Behalf Of Dennis Zhou
> Sent: 2019??2??28?? 10:18
> To: Dennis Zhou <[email protected]>; Tejun Heo <[email protected]>; Christoph
> Lameter <[email protected]>
> Cc: Vlad Buslov <[email protected]>; [email protected];
> [email protected]; [email protected]
> Subject: [PATCH 02/12] percpu: do not search past bitmap when allocating an
> area
>
> pcpu_find_block_fit() guarantees that a fit is found within
> PCPU_BITMAP_BLOCK_BITS. Iteration is used to determine the first fit as it
> compares against the block's contig_hint. This can lead to incorrectly scanning
> past the end of the bitmap. The behavior was okay given the check after for
> bit_off >= end and the correctness of the hints from pcpu_find_block_fit().
>
> This patch fixes this by bounding the end offset by the number of bits in a
> chunk.
>
> Signed-off-by: Dennis Zhou <[email protected]>
> ---
> mm/percpu.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 53bd79a617b1..69ca51d238b5 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -988,7 +988,8 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk,
> int alloc_bits,
> /*
> * Search to find a fit.
> */
> - end = start + alloc_bits + PCPU_BITMAP_BLOCK_BITS;
> + end = min_t(int, start + alloc_bits + PCPU_BITMAP_BLOCK_BITS,
> + pcpu_chunk_map_bits(chunk));
> bit_off = bitmap_find_next_zero_area(chunk->alloc_map, end, start,
> alloc_bits, align_mask);
> if (bit_off >= end)
> --
From pcpu_alloc_area itself, I think this is correct to avoid bitmap_find_next_zero_area
scan past the boundaries of alloc_map, so
Reviewed-by: Peng Fan <[email protected]>
There are a few points I did not understand well,
Per understanding pcpu_find_block_fit is to find the first bit off in a chunk which could satisfy
the bits allocation, so bits might be larger than PCPU_BITMAP_BLOCK_BITS. And if
pcpu_find_block_fit returns a good off, it means there is a area in the chunk could satisfy
the bits allocation, then the following pcpu_alloc_area will not scan past the boundaries of
alloc_map, right?
Thanks,
Peng.
> 2.17.1
Hi Dennis,
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On
> Behalf Of Dennis Zhou
> Sent: 2019??2??28?? 10:19
> To: Dennis Zhou <[email protected]>; Tejun Heo <[email protected]>; Christoph
> Lameter <[email protected]>
> Cc: Vlad Buslov <[email protected]>; [email protected];
> [email protected]; [email protected]
> Subject: [PATCH 03/12] percpu: introduce helper to determine if two regions
> overlap
>
> While block hints were always accurate, it's possible when spanning across
> blocks that we miss updating the chunk's contig_hint. Rather than rely on
> correctness of the boundaries of hints, do a full overlap comparison.
>
> Signed-off-by: Dennis Zhou <[email protected]>
> ---
> mm/percpu.c | 31 +++++++++++++++++++++++++++----
> 1 file changed, 27 insertions(+), 4 deletions(-)
>
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 69ca51d238b5..b40112b2fc59 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -546,6 +546,24 @@ static inline int pcpu_cnt_pop_pages(struct
> pcpu_chunk *chunk, int bit_off,
> bitmap_weight(chunk->populated, page_start); }
>
> +/*
> + * pcpu_region_overlap - determines if two regions overlap
> + * @a: start of first region, inclusive
> + * @b: end of first region, exclusive
> + * @x: start of second region, inclusive
> + * @y: end of second region, exclusive
> + *
> + * This is used to determine if the hint region [a, b) overlaps with
> +the
> + * allocated region [x, y).
> + */
> +static inline bool pcpu_region_overlap(int a, int b, int x, int y) {
> + if ((x >= a && x < b) || (y > a && y <= b) ||
> + (x <= a && y >= b))
I think this could be simplified:
(a < y) && (x < b) could be used to do overlap check.
Regards,
Peng.
> + return true;
> + return false;
> +}
> +
> /**
> * pcpu_chunk_update - updates the chunk metadata given a free area
> * @chunk: chunk of interest
> @@ -710,8 +728,11 @@ static void pcpu_block_update_hint_alloc(struct
> pcpu_chunk *chunk, int bit_off,
> PCPU_BITMAP_BLOCK_BITS,
> s_off + bits);
>
> - if (s_off >= s_block->contig_hint_start &&
> - s_off < s_block->contig_hint_start + s_block->contig_hint) {
> + if (pcpu_region_overlap(s_block->contig_hint_start,
> + s_block->contig_hint_start +
> + s_block->contig_hint,
> + s_off,
> + s_off + bits)) {
> /* block contig hint is broken - scan to fix it */
> pcpu_block_refresh_hint(chunk, s_index);
> } else {
> @@ -764,8 +785,10 @@ static void pcpu_block_update_hint_alloc(struct
> pcpu_chunk *chunk, int bit_off,
> * contig hint is broken. Otherwise, it means a smaller space
> * was used and therefore the chunk contig hint is still correct.
> */
> - if (bit_off >= chunk->contig_bits_start &&
> - bit_off < chunk->contig_bits_start + chunk->contig_bits)
> + if (pcpu_region_overlap(chunk->contig_bits_start,
> + chunk->contig_bits_start + chunk->contig_bits,
> + bit_off,
> + bit_off + bits))
> pcpu_chunk_refresh_hint(chunk);
> }
>
> --
> 2.17.1
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On
> Behalf Of Dennis Zhou
> Sent: 2019??2??28?? 10:19
> To: Dennis Zhou <[email protected]>; Tejun Heo <[email protected]>; Christoph
> Lameter <[email protected]>
> Cc: Vlad Buslov <[email protected]>; [email protected];
> [email protected]; [email protected]
> Subject: [PATCH 04/12] percpu: manage chunks based on contig_bits instead
> of free_bytes
>
> When a chunk becomes fragmented, it can end up having a large number of
> small allocation areas free. The free_bytes sorting of chunks leads to
> unnecessary checking of chunks that cannot satisfy the allocation.
> Switch to contig_bits sorting to prevent scanning chunks that may not be able
> to service the allocation request.
>
> Signed-off-by: Dennis Zhou <[email protected]>
> ---
> mm/percpu.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/mm/percpu.c b/mm/percpu.c
> index b40112b2fc59..c996bcffbb2a 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -234,7 +234,7 @@ static int pcpu_chunk_slot(const struct pcpu_chunk
> *chunk)
> if (chunk->free_bytes < PCPU_MIN_ALLOC_SIZE || chunk->contig_bits
> == 0)
> return 0;
>
> - return pcpu_size_to_slot(chunk->free_bytes);
> + return pcpu_size_to_slot(chunk->contig_bits * PCPU_MIN_ALLOC_SIZE);
> }
>
> /* set the pointer to a chunk in a page struct */
Reviewed-by: Peng Fan <[email protected]>
Not relevant to this patch, another optimization to percpu might be good
to use per chunk spin_lock, not gobal pcpu_lock.
Thanks,
Peng.
> --
> 2.17.1
Hi Dennis,
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On
> Behalf Of Dennis Zhou
> Sent: 2019??2??28?? 10:19
> To: Dennis Zhou <[email protected]>; Tejun Heo <[email protected]>; Christoph
> Lameter <[email protected]>
> Cc: Vlad Buslov <[email protected]>; [email protected];
> [email protected]; [email protected]
> Subject: [PATCH 05/12] percpu: relegate chunks unusable when failing small
> allocations
>
> In certain cases, requestors of percpu memory may want specific alignments.
> However, it is possible to end up in situations where the contig_hint matches,
> but the alignment does not. This causes excess scanning of chunks that will fail.
> To prevent this, if a small allocation fails (< 32B), the chunk is moved to the
> empty list. Once an allocation is freed from that chunk, it is placed back into
> rotation.
>
> Signed-off-by: Dennis Zhou <[email protected]>
> ---
> mm/percpu.c | 35 ++++++++++++++++++++++++++---------
> 1 file changed, 26 insertions(+), 9 deletions(-)
>
> diff --git a/mm/percpu.c b/mm/percpu.c
> index c996bcffbb2a..3d7deece9556 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -94,6 +94,8 @@
>
> /* the slots are sorted by free bytes left, 1-31 bytes share the same slot */
> #define PCPU_SLOT_BASE_SHIFT 5
> +/* chunks in slots below this are subject to being sidelined on failed alloc */
> +#define PCPU_SLOT_FAIL_THRESHOLD 3
>
> #define PCPU_EMPTY_POP_PAGES_LOW 2
> #define PCPU_EMPTY_POP_PAGES_HIGH 4
> @@ -488,6 +490,22 @@ static void pcpu_mem_free(void *ptr)
> kvfree(ptr);
> }
>
> +static void __pcpu_chunk_move(struct pcpu_chunk *chunk, int slot,
> + bool move_front)
> +{
> + if (chunk != pcpu_reserved_chunk) {
> + if (move_front)
> + list_move(&chunk->list, &pcpu_slot[slot]);
> + else
> + list_move_tail(&chunk->list, &pcpu_slot[slot]);
> + }
> +}
> +
> +static void pcpu_chunk_move(struct pcpu_chunk *chunk, int slot) {
> + __pcpu_chunk_move(chunk, slot, true);
> +}
> +
> /**
> * pcpu_chunk_relocate - put chunk in the appropriate chunk slot
> * @chunk: chunk of interest
> @@ -505,12 +523,8 @@ static void pcpu_chunk_relocate(struct pcpu_chunk
> *chunk, int oslot) {
> int nslot = pcpu_chunk_slot(chunk);
>
> - if (chunk != pcpu_reserved_chunk && oslot != nslot) {
> - if (oslot < nslot)
> - list_move(&chunk->list, &pcpu_slot[nslot]);
> - else
> - list_move_tail(&chunk->list, &pcpu_slot[nslot]);
> - }
> + if (oslot != nslot)
> + __pcpu_chunk_move(chunk, nslot, oslot < nslot);
> }
>
> /**
> @@ -1381,7 +1395,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t
> align, bool reserved,
> bool is_atomic = (gfp & GFP_KERNEL) != GFP_KERNEL;
> bool do_warn = !(gfp & __GFP_NOWARN);
> static int warn_limit = 10;
> - struct pcpu_chunk *chunk;
> + struct pcpu_chunk *chunk, *next;
> const char *err;
> int slot, off, cpu, ret;
> unsigned long flags;
> @@ -1443,11 +1457,14 @@ static void __percpu *pcpu_alloc(size_t size,
> size_t align, bool reserved,
> restart:
> /* search through normal chunks */
> for (slot = pcpu_size_to_slot(size); slot < pcpu_nr_slots; slot++) {
> - list_for_each_entry(chunk, &pcpu_slot[slot], list) {
> + list_for_each_entry_safe(chunk, next, &pcpu_slot[slot], list) {
> off = pcpu_find_block_fit(chunk, bits, bit_align,
> is_atomic);
> - if (off < 0)
> + if (off < 0) {
> + if (slot < PCPU_SLOT_FAIL_THRESHOLD)
> + pcpu_chunk_move(chunk, 0);
> continue;
> + }
>
> off = pcpu_alloc_area(chunk, bits, bit_align, off);
> if (off >= 0)
For the code: Reviewed-by: Peng Fan <[email protected]>
But I did not understand well why choose 32B? If there are
more information, better put in commit log.
Thanks,
Peng.
> --
> 2.17.1
On Sat, Mar 02, 2019 at 01:32:04PM +0000, Peng Fan wrote:
> Hi Dennis,
>
> > -----Original Message-----
> > From: [email protected] [mailto:[email protected]] On
> > Behalf Of Dennis Zhou
> > Sent: 2019年2月28日 10:18
> > To: Dennis Zhou <[email protected]>; Tejun Heo <[email protected]>; Christoph
> > Lameter <[email protected]>
> > Cc: Vlad Buslov <[email protected]>; [email protected];
> > [email protected]; [email protected]
> > Subject: [PATCH 02/12] percpu: do not search past bitmap when allocating an
> > area
> >
> > pcpu_find_block_fit() guarantees that a fit is found within
> > PCPU_BITMAP_BLOCK_BITS. Iteration is used to determine the first fit as it
> > compares against the block's contig_hint. This can lead to incorrectly scanning
> > past the end of the bitmap. The behavior was okay given the check after for
> > bit_off >= end and the correctness of the hints from pcpu_find_block_fit().
> >
> > This patch fixes this by bounding the end offset by the number of bits in a
> > chunk.
> >
> > Signed-off-by: Dennis Zhou <[email protected]>
> > ---
> > mm/percpu.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/percpu.c b/mm/percpu.c
> > index 53bd79a617b1..69ca51d238b5 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -988,7 +988,8 @@ static int pcpu_alloc_area(struct pcpu_chunk *chunk,
> > int alloc_bits,
> > /*
> > * Search to find a fit.
> > */
> > - end = start + alloc_bits + PCPU_BITMAP_BLOCK_BITS;
> > + end = min_t(int, start + alloc_bits + PCPU_BITMAP_BLOCK_BITS,
> > + pcpu_chunk_map_bits(chunk));
> > bit_off = bitmap_find_next_zero_area(chunk->alloc_map, end, start,
> > alloc_bits, align_mask);
> > if (bit_off >= end)
> > --
>
> From pcpu_alloc_area itself, I think this is correct to avoid bitmap_find_next_zero_area
> scan past the boundaries of alloc_map, so
>
> Reviewed-by: Peng Fan <[email protected]>
>
> There are a few points I did not understand well,
> Per understanding pcpu_find_block_fit is to find the first bit off in a chunk which could satisfy
> the bits allocation, so bits might be larger than PCPU_BITMAP_BLOCK_BITS. And if
> pcpu_find_block_fit returns a good off, it means there is a area in the chunk could satisfy
> the bits allocation, then the following pcpu_alloc_area will not scan past the boundaries of
> alloc_map, right?
>
pcpu_find_block_fit() finds the chunk offset corresponding to the block
that will be able to fit the chunk. Allocations are done by first fit,
so scanning begins from the first_free of a block. Because the hints are
always accurate, you never fail to find a fit in pcpu_alloc_area() if
pcpu_find_block_fit() gives you an offset. This means you never scan
past the end anyway.
Thanks,
Dennis
On Sat, Mar 02, 2019 at 01:37:37PM +0000, Peng Fan wrote:
> Hi Dennis,
>
> > -----Original Message-----
> > From: [email protected] [mailto:[email protected]] On
> > Behalf Of Dennis Zhou
> > Sent: 2019年2月28日 10:19
> > To: Dennis Zhou <[email protected]>; Tejun Heo <[email protected]>; Christoph
> > Lameter <[email protected]>
> > Cc: Vlad Buslov <[email protected]>; [email protected];
> > [email protected]; [email protected]
> > Subject: [PATCH 03/12] percpu: introduce helper to determine if two regions
> > overlap
> >
> > While block hints were always accurate, it's possible when spanning across
> > blocks that we miss updating the chunk's contig_hint. Rather than rely on
> > correctness of the boundaries of hints, do a full overlap comparison.
> >
> > Signed-off-by: Dennis Zhou <[email protected]>
> > ---
> > mm/percpu.c | 31 +++++++++++++++++++++++++++----
> > 1 file changed, 27 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/percpu.c b/mm/percpu.c
> > index 69ca51d238b5..b40112b2fc59 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -546,6 +546,24 @@ static inline int pcpu_cnt_pop_pages(struct
> > pcpu_chunk *chunk, int bit_off,
> > bitmap_weight(chunk->populated, page_start); }
> >
> > +/*
> > + * pcpu_region_overlap - determines if two regions overlap
> > + * @a: start of first region, inclusive
> > + * @b: end of first region, exclusive
> > + * @x: start of second region, inclusive
> > + * @y: end of second region, exclusive
> > + *
> > + * This is used to determine if the hint region [a, b) overlaps with
> > +the
> > + * allocated region [x, y).
> > + */
> > +static inline bool pcpu_region_overlap(int a, int b, int x, int y) {
> > + if ((x >= a && x < b) || (y > a && y <= b) ||
> > + (x <= a && y >= b))
>
> I think this could be simplified:
> (a < y) && (x < b) could be used to do overlap check.
>
I'll change it to be the negative.
Thanks,
Dennis
>
> > + return true;
> > + return false;
> > +}
> > +
> > /**
> > * pcpu_chunk_update - updates the chunk metadata given a free area
> > * @chunk: chunk of interest
> > @@ -710,8 +728,11 @@ static void pcpu_block_update_hint_alloc(struct
> > pcpu_chunk *chunk, int bit_off,
> > PCPU_BITMAP_BLOCK_BITS,
> > s_off + bits);
> >
> > - if (s_off >= s_block->contig_hint_start &&
> > - s_off < s_block->contig_hint_start + s_block->contig_hint) {
> > + if (pcpu_region_overlap(s_block->contig_hint_start,
> > + s_block->contig_hint_start +
> > + s_block->contig_hint,
> > + s_off,
> > + s_off + bits)) {
> > /* block contig hint is broken - scan to fix it */
> > pcpu_block_refresh_hint(chunk, s_index);
> > } else {
> > @@ -764,8 +785,10 @@ static void pcpu_block_update_hint_alloc(struct
> > pcpu_chunk *chunk, int bit_off,
> > * contig hint is broken. Otherwise, it means a smaller space
> > * was used and therefore the chunk contig hint is still correct.
> > */
> > - if (bit_off >= chunk->contig_bits_start &&
> > - bit_off < chunk->contig_bits_start + chunk->contig_bits)
> > + if (pcpu_region_overlap(chunk->contig_bits_start,
> > + chunk->contig_bits_start + chunk->contig_bits,
> > + bit_off,
> > + bit_off + bits))
> > pcpu_chunk_refresh_hint(chunk);
> > }
> >
> > --
> > 2.17.1
>
On Sat, Mar 02, 2019 at 01:48:20PM +0000, Peng Fan wrote:
>
>
> > -----Original Message-----
> > From: [email protected] [mailto:[email protected]] On
> > Behalf Of Dennis Zhou
> > Sent: 2019年2月28日 10:19
> > To: Dennis Zhou <[email protected]>; Tejun Heo <[email protected]>; Christoph
> > Lameter <[email protected]>
> > Cc: Vlad Buslov <[email protected]>; [email protected];
> > [email protected]; [email protected]
> > Subject: [PATCH 04/12] percpu: manage chunks based on contig_bits instead
> > of free_bytes
> >
> > When a chunk becomes fragmented, it can end up having a large number of
> > small allocation areas free. The free_bytes sorting of chunks leads to
> > unnecessary checking of chunks that cannot satisfy the allocation.
> > Switch to contig_bits sorting to prevent scanning chunks that may not be able
> > to service the allocation request.
> >
> > Signed-off-by: Dennis Zhou <[email protected]>
> > ---
> > mm/percpu.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/percpu.c b/mm/percpu.c
> > index b40112b2fc59..c996bcffbb2a 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -234,7 +234,7 @@ static int pcpu_chunk_slot(const struct pcpu_chunk
> > *chunk)
> > if (chunk->free_bytes < PCPU_MIN_ALLOC_SIZE || chunk->contig_bits
> > == 0)
> > return 0;
> >
> > - return pcpu_size_to_slot(chunk->free_bytes);
> > + return pcpu_size_to_slot(chunk->contig_bits * PCPU_MIN_ALLOC_SIZE);
> > }
> >
> > /* set the pointer to a chunk in a page struct */
>
> Reviewed-by: Peng Fan <[email protected]>
>
> Not relevant to this patch, another optimization to percpu might be good
> to use per chunk spin_lock, not gobal pcpu_lock.
>
Percpu memory itself is expensive and for the most part shouldn't be
part of the critical path. Ideally, we don't have multiple chunks being
allocated simultaneously because once an allocation is given out, the
chunk is pinned until all allocations are freed.
Thanks,
Dennis
On Sat, Mar 02, 2019 at 01:55:54PM +0000, Peng Fan wrote:
> Hi Dennis,
>
> > -----Original Message-----
> > From: [email protected] [mailto:[email protected]] On
> > Behalf Of Dennis Zhou
> > Sent: 2019年2月28日 10:19
> > To: Dennis Zhou <[email protected]>; Tejun Heo <[email protected]>; Christoph
> > Lameter <[email protected]>
> > Cc: Vlad Buslov <[email protected]>; [email protected];
> > [email protected]; [email protected]
> > Subject: [PATCH 05/12] percpu: relegate chunks unusable when failing small
> > allocations
> >
> > In certain cases, requestors of percpu memory may want specific alignments.
> > However, it is possible to end up in situations where the contig_hint matches,
> > but the alignment does not. This causes excess scanning of chunks that will fail.
> > To prevent this, if a small allocation fails (< 32B), the chunk is moved to the
> > empty list. Once an allocation is freed from that chunk, it is placed back into
> > rotation.
> >
> > Signed-off-by: Dennis Zhou <[email protected]>
> > ---
> > mm/percpu.c | 35 ++++++++++++++++++++++++++---------
> > 1 file changed, 26 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/percpu.c b/mm/percpu.c
> > index c996bcffbb2a..3d7deece9556 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -94,6 +94,8 @@
> >
> > /* the slots are sorted by free bytes left, 1-31 bytes share the same slot */
> > #define PCPU_SLOT_BASE_SHIFT 5
> > +/* chunks in slots below this are subject to being sidelined on failed alloc */
> > +#define PCPU_SLOT_FAIL_THRESHOLD 3
> >
> > #define PCPU_EMPTY_POP_PAGES_LOW 2
> > #define PCPU_EMPTY_POP_PAGES_HIGH 4
> > @@ -488,6 +490,22 @@ static void pcpu_mem_free(void *ptr)
> > kvfree(ptr);
> > }
> >
> > +static void __pcpu_chunk_move(struct pcpu_chunk *chunk, int slot,
> > + bool move_front)
> > +{
> > + if (chunk != pcpu_reserved_chunk) {
> > + if (move_front)
> > + list_move(&chunk->list, &pcpu_slot[slot]);
> > + else
> > + list_move_tail(&chunk->list, &pcpu_slot[slot]);
> > + }
> > +}
> > +
> > +static void pcpu_chunk_move(struct pcpu_chunk *chunk, int slot) {
> > + __pcpu_chunk_move(chunk, slot, true);
> > +}
> > +
> > /**
> > * pcpu_chunk_relocate - put chunk in the appropriate chunk slot
> > * @chunk: chunk of interest
> > @@ -505,12 +523,8 @@ static void pcpu_chunk_relocate(struct pcpu_chunk
> > *chunk, int oslot) {
> > int nslot = pcpu_chunk_slot(chunk);
> >
> > - if (chunk != pcpu_reserved_chunk && oslot != nslot) {
> > - if (oslot < nslot)
> > - list_move(&chunk->list, &pcpu_slot[nslot]);
> > - else
> > - list_move_tail(&chunk->list, &pcpu_slot[nslot]);
> > - }
> > + if (oslot != nslot)
> > + __pcpu_chunk_move(chunk, nslot, oslot < nslot);
> > }
> >
> > /**
> > @@ -1381,7 +1395,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t
> > align, bool reserved,
> > bool is_atomic = (gfp & GFP_KERNEL) != GFP_KERNEL;
> > bool do_warn = !(gfp & __GFP_NOWARN);
> > static int warn_limit = 10;
> > - struct pcpu_chunk *chunk;
> > + struct pcpu_chunk *chunk, *next;
> > const char *err;
> > int slot, off, cpu, ret;
> > unsigned long flags;
> > @@ -1443,11 +1457,14 @@ static void __percpu *pcpu_alloc(size_t size,
> > size_t align, bool reserved,
> > restart:
> > /* search through normal chunks */
> > for (slot = pcpu_size_to_slot(size); slot < pcpu_nr_slots; slot++) {
> > - list_for_each_entry(chunk, &pcpu_slot[slot], list) {
> > + list_for_each_entry_safe(chunk, next, &pcpu_slot[slot], list) {
> > off = pcpu_find_block_fit(chunk, bits, bit_align,
> > is_atomic);
> > - if (off < 0)
> > + if (off < 0) {
> > + if (slot < PCPU_SLOT_FAIL_THRESHOLD)
> > + pcpu_chunk_move(chunk, 0);
> > continue;
> > + }
> >
> > off = pcpu_alloc_area(chunk, bits, bit_align, off);
> > if (off >= 0)
>
> For the code: Reviewed-by: Peng Fan <[email protected]>
>
> But I did not understand well why choose 32B? If there are
> more information, better put in commit log.
>
There isn't I just picked a small allocation size.
Thanks,
Dennis
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On
> Behalf Of Dennis Zhou
> Sent: 2019??2??28?? 10:19
> To: Dennis Zhou <[email protected]>; Tejun Heo <[email protected]>; Christoph
> Lameter <[email protected]>
> Cc: Vlad Buslov <[email protected]>; [email protected];
> [email protected]; [email protected]
> Subject: [PATCH 06/12] percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE
>
> Previously, block size was flexible based on the constraint that the
> GCD(PCPU_BITMAP_BLOCK_SIZE, PAGE_SIZE) > 1. However, this carried the
> overhead that keeping a floating number of populated free pages required
> scanning over the free regions of a chunk.
>
> Setting the block size to be fixed at PAGE_SIZE lets us know when an empty
> page becomes used as we will break a full contig_hint of a block.
> This means we no longer have to scan the whole chunk upon breaking a
> contig_hint which empty page management piggybacks off. A later patch
> takes advantage of this to optimize the allocation path by only scanning
> forward using the scan_hint introduced later too.
>
> Signed-off-by: Dennis Zhou <[email protected]>
> ---
> include/linux/percpu.h | 12 ++---
> mm/percpu-km.c | 2 +-
> mm/percpu.c | 111 +++++++++++++++++------------------------
> 3 files changed, 49 insertions(+), 76 deletions(-)
>
> diff --git a/include/linux/percpu.h b/include/linux/percpu.h index
> 70b7123f38c7..9909dc0e273a 100644
> --- a/include/linux/percpu.h
> +++ b/include/linux/percpu.h
> @@ -26,16 +26,10 @@
> #define PCPU_MIN_ALLOC_SHIFT 2
> #define PCPU_MIN_ALLOC_SIZE (1 << PCPU_MIN_ALLOC_SHIFT)
>
> -/* number of bits per page, used to trigger a scan if blocks are > PAGE_SIZE
> */
> -#define PCPU_BITS_PER_PAGE (PAGE_SIZE >>
> PCPU_MIN_ALLOC_SHIFT)
> -
> /*
> - * This determines the size of each metadata block. There are several
> subtle
> - * constraints around this constant. The reserved region must be a multiple
> of
> - * PCPU_BITMAP_BLOCK_SIZE. Additionally, PCPU_BITMAP_BLOCK_SIZE
> must be a
> - * multiple of PAGE_SIZE or PAGE_SIZE must be a multiple of
> - * PCPU_BITMAP_BLOCK_SIZE to align with the populated page map. The
> unit_size
> - * also has to be a multiple of PCPU_BITMAP_BLOCK_SIZE to ensure full
> blocks.
> + * The PCPU_BITMAP_BLOCK_SIZE must be the same size as PAGE_SIZE as
> the
> + * updating of hints is used to manage the nr_empty_pop_pages in both
> + * the chunk and globally.
> */
> #define PCPU_BITMAP_BLOCK_SIZE PAGE_SIZE
> #define PCPU_BITMAP_BLOCK_BITS (PCPU_BITMAP_BLOCK_SIZE >>
> \
> diff --git a/mm/percpu-km.c b/mm/percpu-km.c index
> 0f643dc2dc65..c10bf7466596 100644
> --- a/mm/percpu-km.c
> +++ b/mm/percpu-km.c
> @@ -70,7 +70,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t
> gfp)
> chunk->base_addr = page_address(pages) - pcpu_group_offsets[0];
>
> spin_lock_irqsave(&pcpu_lock, flags);
> - pcpu_chunk_populated(chunk, 0, nr_pages, false);
> + pcpu_chunk_populated(chunk, 0, nr_pages);
> spin_unlock_irqrestore(&pcpu_lock, flags);
>
> pcpu_stats_chunk_alloc();
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 3d7deece9556..967c9cc3a928 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -527,37 +527,21 @@ static void pcpu_chunk_relocate(struct
> pcpu_chunk *chunk, int oslot)
> __pcpu_chunk_move(chunk, nslot, oslot < nslot); }
>
> -/**
> - * pcpu_cnt_pop_pages- counts populated backing pages in range
> +/*
> + * pcpu_update_empty_pages - update empty page counters
> * @chunk: chunk of interest
> - * @bit_off: start offset
> - * @bits: size of area to check
> + * @nr: nr of empty pages
> *
> - * Calculates the number of populated pages in the region
> - * [page_start, page_end). This keeps track of how many empty populated
> - * pages are available and decide if async work should be scheduled.
> - *
> - * RETURNS:
> - * The nr of populated pages.
> + * This is used to keep track of the empty pages now based on the
> + premise
> + * a pcpu_block_md covers a page. The hint update functions recognize
> + if
> + * a block is made full or broken to calculate deltas for keeping track
> + of
> + * free pages.
> */
> -static inline int pcpu_cnt_pop_pages(struct pcpu_chunk *chunk, int bit_off,
> - int bits)
> +static inline void pcpu_update_empty_pages(struct pcpu_chunk *chunk,
> +int nr)
> {
> - int page_start = PFN_UP(bit_off * PCPU_MIN_ALLOC_SIZE);
> - int page_end = PFN_DOWN((bit_off + bits) * PCPU_MIN_ALLOC_SIZE);
> -
> - if (page_start >= page_end)
> - return 0;
> -
> - /*
> - * bitmap_weight counts the number of bits set in a bitmap up to
> - * the specified number of bits. This is counting the populated
> - * pages up to page_end and then subtracting the populated pages
> - * up to page_start to count the populated pages in
> - * [page_start, page_end).
> - */
> - return bitmap_weight(chunk->populated, page_end) -
> - bitmap_weight(chunk->populated, page_start);
> + chunk->nr_empty_pop_pages += nr;
> + if (chunk != pcpu_reserved_chunk)
> + pcpu_nr_empty_pop_pages += nr;
> }
>
> /*
> @@ -611,36 +595,19 @@ static void pcpu_chunk_update(struct pcpu_chunk
> *chunk, int bit_off, int bits)
> * Updates:
> * chunk->contig_bits
> * chunk->contig_bits_start
> - * nr_empty_pop_pages (chunk and global)
> */
> static void pcpu_chunk_refresh_hint(struct pcpu_chunk *chunk) {
> - int bit_off, bits, nr_empty_pop_pages;
> + int bit_off, bits;
>
> /* clear metadata */
> chunk->contig_bits = 0;
>
> bit_off = chunk->first_bit;
> - bits = nr_empty_pop_pages = 0;
> + bits = 0;
> pcpu_for_each_md_free_region(chunk, bit_off, bits) {
> pcpu_chunk_update(chunk, bit_off, bits);
> -
> - nr_empty_pop_pages += pcpu_cnt_pop_pages(chunk, bit_off, bits);
> }
> -
> - /*
> - * Keep track of nr_empty_pop_pages.
> - *
> - * The chunk maintains the previous number of free pages it held,
> - * so the delta is used to update the global counter. The reserved
> - * chunk is not part of the free page count as they are populated
> - * at init and are special to serving reserved allocations.
> - */
> - if (chunk != pcpu_reserved_chunk)
> - pcpu_nr_empty_pop_pages +=
> - (nr_empty_pop_pages - chunk->nr_empty_pop_pages);
> -
> - chunk->nr_empty_pop_pages = nr_empty_pop_pages;
> }
>
> /**
> @@ -712,6 +679,7 @@ static void pcpu_block_refresh_hint(struct
> pcpu_chunk *chunk, int index) static void
> pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off,
> int bits)
> {
> + int nr_empty_pages = 0;
> struct pcpu_block_md *s_block, *e_block, *block;
> int s_index, e_index; /* block indexes of the freed allocation */
> int s_off, e_off; /* block offsets of the freed allocation */
> @@ -736,6 +704,9 @@ static void pcpu_block_update_hint_alloc(struct
> pcpu_chunk *chunk, int bit_off,
> * If the allocation breaks the contig_hint, a scan is required to
> * restore this hint.
> */
> + if (s_block->contig_hint == PCPU_BITMAP_BLOCK_BITS)
> + nr_empty_pages++;
> +
> if (s_off == s_block->first_free)
> s_block->first_free = find_next_zero_bit(
> pcpu_index_alloc_map(chunk, s_index), @@ -763,6
> +734,9 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk
> *chunk, int bit_off,
> * Update e_block.
> */
> if (s_index != e_index) {
> + if (e_block->contig_hint == PCPU_BITMAP_BLOCK_BITS)
> + nr_empty_pages++;
> +
> /*
> * When the allocation is across blocks, the end is along
> * the left part of the e_block.
> @@ -787,6 +761,7 @@ static void pcpu_block_update_hint_alloc(struct
> pcpu_chunk *chunk, int bit_off,
> }
>
> /* update in-between md_blocks */
> + nr_empty_pages += (e_index - s_index - 1);
> for (block = s_block + 1; block < e_block; block++) {
> block->contig_hint = 0;
> block->left_free = 0;
> @@ -794,6 +769,9 @@ static void pcpu_block_update_hint_alloc(struct
> pcpu_chunk *chunk, int bit_off,
> }
> }
>
> + if (nr_empty_pages)
> + pcpu_update_empty_pages(chunk, -1 * nr_empty_pages);
> +
> /*
> * The only time a full chunk scan is required is if the chunk
> * contig hint is broken. Otherwise, it means a smaller space @@
> -826,6 +804,7 @@ static void pcpu_block_update_hint_alloc(struct
> pcpu_chunk *chunk, int bit_off, static void
> pcpu_block_update_hint_free(struct pcpu_chunk *chunk, int bit_off,
> int bits)
> {
> + int nr_empty_pages = 0;
> struct pcpu_block_md *s_block, *e_block, *block;
> int s_index, e_index; /* block indexes of the freed allocation */
> int s_off, e_off; /* block offsets of the freed allocation */
> @@ -879,14 +858,19 @@ static void pcpu_block_update_hint_free(struct
> pcpu_chunk *chunk, int bit_off,
>
> /* update s_block */
> e_off = (s_index == e_index) ? end : PCPU_BITMAP_BLOCK_BITS;
> + if (!start && e_off == PCPU_BITMAP_BLOCK_BITS)
> + nr_empty_pages++;
> pcpu_block_update(s_block, start, e_off);
>
> /* freeing in the same block */
> if (s_index != e_index) {
> /* update e_block */
> + if (end == PCPU_BITMAP_BLOCK_BITS)
> + nr_empty_pages++;
> pcpu_block_update(e_block, 0, end);
>
> /* reset md_blocks in the middle */
> + nr_empty_pages += (e_index - s_index - 1);
> for (block = s_block + 1; block < e_block; block++) {
> block->first_free = 0;
> block->contig_hint_start = 0;
> @@ -896,15 +880,16 @@ static void pcpu_block_update_hint_free(struct
> pcpu_chunk *chunk, int bit_off,
> }
> }
>
> + if (nr_empty_pages)
> + pcpu_update_empty_pages(chunk, nr_empty_pages);
> +
> /*
> - * Refresh chunk metadata when the free makes a page free, a block
> - * free, or spans across blocks. The contig hint may be off by up to
> - * a page, but if the hint is contained in a block, it will be accurate
> - * with the else condition below.
> + * Refresh chunk metadata when the free makes a block free or spans
> + * across blocks. The contig_hint may be off by up to a page, but if
> + * the contig_hint is contained in a block, it will be accurate with
> + * the else condition below.
> */
> - if ((ALIGN_DOWN(end, min(PCPU_BITS_PER_PAGE,
> PCPU_BITMAP_BLOCK_BITS)) >
> - ALIGN(start, min(PCPU_BITS_PER_PAGE,
> PCPU_BITMAP_BLOCK_BITS))) ||
> - s_index != e_index)
> + if (((end - start) >= PCPU_BITMAP_BLOCK_BITS) || s_index != e_index)
> pcpu_chunk_refresh_hint(chunk);
> else
> pcpu_chunk_update(chunk, pcpu_block_off_to_off(s_index, start),
> @@ -1164,9 +1149,7 @@ static struct pcpu_chunk * __init
> pcpu_alloc_first_chunk(unsigned long tmp_addr,
> chunk->immutable = true;
> bitmap_fill(chunk->populated, chunk->nr_pages);
> chunk->nr_populated = chunk->nr_pages;
> - chunk->nr_empty_pop_pages =
> - pcpu_cnt_pop_pages(chunk, start_offset / PCPU_MIN_ALLOC_SIZE,
> - map_size / PCPU_MIN_ALLOC_SIZE);
> + chunk->nr_empty_pop_pages = chunk->nr_pages;
>
> chunk->contig_bits = map_size / PCPU_MIN_ALLOC_SIZE;
> chunk->free_bytes = map_size;
> @@ -1261,7 +1244,6 @@ static void pcpu_free_chunk(struct pcpu_chunk
> *chunk)
> * @chunk: pcpu_chunk which got populated
> * @page_start: the start page
> * @page_end: the end page
> - * @for_alloc: if this is to populate for allocation
> *
> * Pages in [@page_start,@page_end) have been populated to @chunk.
> Update
> * the bookkeeping information accordingly. Must be called after each
> @@ -1271,7 +1253,7 @@ static void pcpu_free_chunk(struct pcpu_chunk
> *chunk)
> * is to serve an allocation in that area.
> */
> static void pcpu_chunk_populated(struct pcpu_chunk *chunk, int page_start,
> - int page_end, bool for_alloc)
> + int page_end)
> {
> int nr = page_end - page_start;
>
> @@ -1281,10 +1263,7 @@ static void pcpu_chunk_populated(struct
> pcpu_chunk *chunk, int page_start,
> chunk->nr_populated += nr;
> pcpu_nr_populated += nr;
>
> - if (!for_alloc) {
> - chunk->nr_empty_pop_pages += nr;
> - pcpu_nr_empty_pop_pages += nr;
> - }
> + pcpu_update_empty_pages(chunk, nr);
> }
>
> /**
> @@ -1306,9 +1285,9 @@ static void pcpu_chunk_depopulated(struct
> pcpu_chunk *chunk,
>
> bitmap_clear(chunk->populated, page_start, nr);
> chunk->nr_populated -= nr;
> - chunk->nr_empty_pop_pages -= nr;
> - pcpu_nr_empty_pop_pages -= nr;
> pcpu_nr_populated -= nr;
> +
> + pcpu_update_empty_pages(chunk, -1 * nr);
> }
>
> /*
> @@ -1523,7 +1502,7 @@ static void __percpu *pcpu_alloc(size_t size, size_t
> align, bool reserved,
> err = "failed to populate";
> goto fail_unlock;
> }
> - pcpu_chunk_populated(chunk, rs, re, true);
> + pcpu_chunk_populated(chunk, rs, re);
> spin_unlock_irqrestore(&pcpu_lock, flags);
> }
>
> @@ -1722,7 +1701,7 @@ static void pcpu_balance_workfn(struct
> work_struct *work)
> if (!ret) {
> nr_to_pop -= nr;
> spin_lock_irq(&pcpu_lock);
> - pcpu_chunk_populated(chunk, rs, rs + nr, false);
> + pcpu_chunk_populated(chunk, rs, rs + nr);
> spin_unlock_irq(&pcpu_lock);
> } else {
> nr_to_pop = 0;
> --
Reviewed-by: Peng Fan <[email protected]>
> 2.17.1
Hi Dennis
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On
> Behalf Of Dennis Zhou
> Sent: 2019??2??28?? 10:19
> To: Dennis Zhou <[email protected]>; Tejun Heo <[email protected]>; Christoph
> Lameter <[email protected]>
> Cc: Vlad Buslov <[email protected]>; [email protected];
> [email protected]; [email protected]
> Subject: [PATCH 07/12] percpu: add block level scan_hint
>
> Fragmentation can cause both blocks and chunks to have an early first_firee
> bit available, but only able to satisfy allocations much later on. This patch
> introduces a scan_hint to help mitigate some unnecessary scanning.
>
> The scan_hint remembers the largest area prior to the contig_hint. If the
> contig_hint == scan_hint, then scan_hint_start > contig_hint_start.
> This is necessary for scan_hint discovery when refreshing a block.
>
> Signed-off-by: Dennis Zhou <[email protected]>
> ---
> mm/percpu-internal.h | 9 ++++
> mm/percpu.c | 101
> ++++++++++++++++++++++++++++++++++++++++---
> 2 files changed, 103 insertions(+), 7 deletions(-)
>
> diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index
> b1739dc06b73..ec58b244545d 100644
> --- a/mm/percpu-internal.h
> +++ b/mm/percpu-internal.h
> @@ -9,8 +9,17 @@
> * pcpu_block_md is the metadata block struct.
> * Each chunk's bitmap is split into a number of full blocks.
> * All units are in terms of bits.
> + *
> + * The scan hint is the largest known contiguous area before the contig hint.
> + * It is not necessarily the actual largest contig hint though. There
> + is an
> + * invariant that the scan_hint_start > contig_hint_start iff
> + * scan_hint == contig_hint. This is necessary because when scanning
> + forward,
> + * we don't know if a new contig hint would be better than the current one.
> */
> struct pcpu_block_md {
> + int scan_hint; /* scan hint for block */
> + int scan_hint_start; /* block relative starting
> + position of the scan hint */
> int contig_hint; /* contig hint for block */
> int contig_hint_start; /* block relative starting
> position of the contig hint */ diff --git
> a/mm/percpu.c b/mm/percpu.c index 967c9cc3a928..df1aacf58ac8 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -320,6 +320,34 @@ static unsigned long pcpu_block_off_to_off(int index,
> int off)
> return index * PCPU_BITMAP_BLOCK_BITS + off; }
>
> +/*
> + * pcpu_next_hint - determine which hint to use
> + * @block: block of interest
> + * @alloc_bits: size of allocation
> + *
> + * This determines if we should scan based on the scan_hint or first_free.
> + * In general, we want to scan from first_free to fulfill allocations
> +by
> + * first fit. However, if we know a scan_hint at position
> +scan_hint_start
> + * cannot fulfill an allocation, we can begin scanning from there
> +knowing
> + * the contig_hint will be our fallback.
> + */
> +static int pcpu_next_hint(struct pcpu_block_md *block, int alloc_bits)
> +{
> + /*
> + * The three conditions below determine if we can skip past the
> + * scan_hint. First, does the scan hint exist. Second, is the
> + * contig_hint after the scan_hint (possibly not true iff
> + * contig_hint == scan_hint). Third, is the allocation request
> + * larger than the scan_hint.
> + */
> + if (block->scan_hint &&
> + block->contig_hint_start > block->scan_hint_start &&
> + alloc_bits > block->scan_hint)
> + return block->scan_hint_start + block->scan_hint;
> +
> + return block->first_free;
> +}
> +
> /**
> * pcpu_next_md_free_region - finds the next hint free area
> * @chunk: chunk of interest
> @@ -415,9 +443,11 @@ static void pcpu_next_fit_region(struct pcpu_chunk
> *chunk, int alloc_bits,
> if (block->contig_hint &&
> block->contig_hint_start >= block_off &&
> block->contig_hint >= *bits + alloc_bits) {
> + int start = pcpu_next_hint(block, alloc_bits);
> +
> *bits += alloc_bits + block->contig_hint_start -
> - block->first_free;
> - *bit_off = pcpu_block_off_to_off(i, block->first_free);
> + start;
This might not relevant to this patch.
Not sure it is intended or not.
For `alloc_bits + block->contig_hink_start - [block->first_free or start]`
If the reason is to let pcpu_is_populated return a proper next_off when pcpu_is_populated
fail, it makes sense. If not, why not just use *bits += alloc_bits.
> + *bit_off = pcpu_block_off_to_off(i, start);
> return;
> }
> /* reset to satisfy the second predicate above */ @@ -632,12
> +662,57 @@ static void pcpu_block_update(struct pcpu_block_md *block, int
> start, int end)
> block->right_free = contig;
>
> if (contig > block->contig_hint) {
> + /* promote the old contig_hint to be the new scan_hint */
> + if (start > block->contig_hint_start) {
> + if (block->contig_hint > block->scan_hint) {
> + block->scan_hint_start =
> + block->contig_hint_start;
> + block->scan_hint = block->contig_hint;
> + } else if (start < block->scan_hint_start) {
> + /*
> + * The old contig_hint == scan_hint. But, the
> + * new contig is larger so hold the invariant
> + * scan_hint_start < contig_hint_start.
> + */
> + block->scan_hint = 0;
> + }
> + } else {
> + block->scan_hint = 0;
> + }
> block->contig_hint_start = start;
> block->contig_hint = contig;
> - } else if (block->contig_hint_start && contig == block->contig_hint &&
> - (!start || __ffs(start) > __ffs(block->contig_hint_start))) {
> - /* use the start with the best alignment */
> - block->contig_hint_start = start;
> + } else if (contig == block->contig_hint) {
> + if (block->contig_hint_start &&
> + (!start ||
> + __ffs(start) > __ffs(block->contig_hint_start))) {
> + /* start has a better alignment so use it */
> + block->contig_hint_start = start;
> + if (start < block->scan_hint_start &&
> + block->contig_hint > block->scan_hint)
> + block->scan_hint = 0;
> + } else if (start > block->scan_hint_start ||
> + block->contig_hint > block->scan_hint) {
> + /*
> + * Knowing contig == contig_hint, update the scan_hint
> + * if it is farther than or larger than the current
> + * scan_hint.
> + */
> + block->scan_hint_start = start;
> + block->scan_hint = contig;
> + }
> + } else {
> + /*
> + * The region is smaller than the contig_hint. So only update
> + * the scan_hint if it is larger than or equal and farther than
> + * the current scan_hint.
> + */
> + if ((start < block->contig_hint_start &&
> + (contig > block->scan_hint ||
> + (contig == block->scan_hint &&
> + start > block->scan_hint_start)))) {
> + block->scan_hint_start = start;
> + block->scan_hint = contig;
> + }
> }
> }
>
> @@ -656,7 +731,7 @@ static void pcpu_block_refresh_hint(struct
> pcpu_chunk *chunk, int index)
> int rs, re; /* region start, region end */
>
> /* clear hints */
> - block->contig_hint = 0;
> + block->contig_hint = block->scan_hint = 0;
> block->left_free = block->right_free = 0;
>
> /* iterate over free areas and update the contig hints */ @@ -713,6
> +788,12 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk
> *chunk, int bit_off,
> PCPU_BITMAP_BLOCK_BITS,
> s_off + bits);
>
> + if (pcpu_region_overlap(s_block->scan_hint_start,
> + s_block->scan_hint_start + s_block->scan_hint,
> + s_off,
> + s_off + bits))
> + s_block->scan_hint = 0;
> +
> if (pcpu_region_overlap(s_block->contig_hint_start,
> s_block->contig_hint_start +
> s_block->contig_hint,
> @@ -749,6 +830,9 @@ static void pcpu_block_update_hint_alloc(struct
> pcpu_chunk *chunk, int bit_off,
> /* reset the block */
> e_block++;
> } else {
> + if (e_off > e_block->scan_hint_start)
> + e_block->scan_hint = 0;
> +
> if (e_off > e_block->contig_hint_start) {
> /* contig hint is broken - scan to fix it */
> pcpu_block_refresh_hint(chunk, e_index); @@ -763,6
> +847,7 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk
> *chunk, int bit_off,
> /* update in-between md_blocks */
> nr_empty_pages += (e_index - s_index - 1);
> for (block = s_block + 1; block < e_block; block++) {
> + block->scan_hint = 0;
> block->contig_hint = 0;
> block->left_free = 0;
> block->right_free = 0;
> @@ -873,6 +958,7 @@ static void pcpu_block_update_hint_free(struct
> pcpu_chunk *chunk, int bit_off,
> nr_empty_pages += (e_index - s_index - 1);
> for (block = s_block + 1; block < e_block; block++) {
> block->first_free = 0;
> + block->scan_hint = 0;
> block->contig_hint_start = 0;
> block->contig_hint = PCPU_BITMAP_BLOCK_BITS;
> block->left_free = PCPU_BITMAP_BLOCK_BITS; @@ -1084,6
> +1170,7 @@ static void pcpu_init_md_blocks(struct pcpu_chunk *chunk)
> for (md_block = chunk->md_blocks;
> md_block != chunk->md_blocks + pcpu_chunk_nr_blocks(chunk);
> md_block++) {
> + md_block->scan_hint = 0;
> md_block->contig_hint = PCPU_BITMAP_BLOCK_BITS;
> md_block->left_free = PCPU_BITMAP_BLOCK_BITS;
> md_block->right_free = PCPU_BITMAP_BLOCK_BITS;
Reviewed-by: Peng Fan <[email protected]>
> --
> 2.17.1
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On
> Behalf Of Dennis Zhou
> Sent: 2019??2??28?? 10:19
> To: Dennis Zhou <[email protected]>; Tejun Heo <[email protected]>; Christoph
> Lameter <[email protected]>
> Cc: Vlad Buslov <[email protected]>; [email protected];
> [email protected]; [email protected]
> Subject: [PATCH 10/12] percpu: make pcpu_block_md generic
>
> In reality, a chunk is just a block covering a larger number of bits.
> The hints themselves are one in the same. Rather than maintaining the hints
> separately, first introduce nr_bits to genericize
> pcpu_block_update() to correctly maintain block->right_free. The next patch
> will convert chunk hints to be managed as a pcpu_block_md.
>
> Signed-off-by: Dennis Zhou <[email protected]>
> ---
> mm/percpu-internal.h | 1 +
> mm/percpu.c | 20 +++++++++++++-------
> 2 files changed, 14 insertions(+), 7 deletions(-)
>
> diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index
> ec58b244545d..119bd1119aa7 100644
> --- a/mm/percpu-internal.h
> +++ b/mm/percpu-internal.h
> @@ -28,6 +28,7 @@ struct pcpu_block_md {
> int right_free; /* size of free space along
> the right side of the block */
> int first_free; /* block position of first free
> */
> + int nr_bits; /* total bits responsible for */
> };
>
> struct pcpu_chunk {
> diff --git a/mm/percpu.c b/mm/percpu.c
> index e51c151ed692..7cdf14c242de 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -658,7 +658,7 @@ static void pcpu_block_update(struct pcpu_block_md
> *block, int start, int end)
> if (start == 0)
> block->left_free = contig;
>
> - if (end == PCPU_BITMAP_BLOCK_BITS)
> + if (end == block->nr_bits)
> block->right_free = contig;
>
> if (contig > block->contig_hint) {
> @@ -1271,18 +1271,24 @@ static void pcpu_free_area(struct pcpu_chunk
> *chunk, int off)
> pcpu_chunk_relocate(chunk, oslot);
> }
>
> +static void pcpu_init_md_block(struct pcpu_block_md *block, int
> +nr_bits) {
> + block->scan_hint = 0;
> + block->contig_hint = nr_bits;
> + block->left_free = nr_bits;
> + block->right_free = nr_bits;
> + block->first_free = 0;
> + block->nr_bits = nr_bits;
> +}
> +
> static void pcpu_init_md_blocks(struct pcpu_chunk *chunk) {
> struct pcpu_block_md *md_block;
>
> for (md_block = chunk->md_blocks;
> md_block != chunk->md_blocks + pcpu_chunk_nr_blocks(chunk);
> - md_block++) {
> - md_block->scan_hint = 0;
> - md_block->contig_hint = PCPU_BITMAP_BLOCK_BITS;
> - md_block->left_free = PCPU_BITMAP_BLOCK_BITS;
> - md_block->right_free = PCPU_BITMAP_BLOCK_BITS;
> - }
> + md_block++)
> + pcpu_init_md_block(md_block, PCPU_BITMAP_BLOCK_BITS);
> }
Reviewed-by: Peng Fan <[email protected]>
>
> /**
> --
> 2.17.1
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On
> Behalf Of Dennis Zhou
> Sent: 2019??2??28?? 10:19
> To: Dennis Zhou <[email protected]>; Tejun Heo <[email protected]>; Christoph
> Lameter <[email protected]>
> Cc: Vlad Buslov <[email protected]>; [email protected];
> [email protected]; [email protected]
> Subject: [PATCH 11/12] percpu: convert chunk hints to be based on
> pcpu_block_md
>
> As mentioned in the last patch, a chunk's hints are no different than a block
> just responsible for more bits. This converts chunk level hints to use a
> pcpu_block_md to maintain them. This lets us reuse the same hint helper
> functions as a block. The left_free and right_free are unused by the chunk's
> pcpu_block_md.
>
> Signed-off-by: Dennis Zhou <[email protected]>
> ---
> mm/percpu-internal.h | 5 +-
> mm/percpu-stats.c | 5 +-
> mm/percpu.c | 120 +++++++++++++++++++------------------------
> 3 files changed, 57 insertions(+), 73 deletions(-)
>
> diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index
> 119bd1119aa7..0468ba500bd4 100644
> --- a/mm/percpu-internal.h
> +++ b/mm/percpu-internal.h
> @@ -39,9 +39,7 @@ struct pcpu_chunk {
>
> struct list_head list; /* linked to pcpu_slot lists */
> int free_bytes; /* free bytes in the chunk */
> - int contig_bits; /* max contiguous size hint */
> - int contig_bits_start; /* contig_bits starting
> - offset */
> + struct pcpu_block_md chunk_md;
> void *base_addr; /* base address of this chunk */
>
> unsigned long *alloc_map; /* allocation map */
> @@ -49,7 +47,6 @@ struct pcpu_chunk {
> struct pcpu_block_md *md_blocks; /* metadata blocks */
>
> void *data; /* chunk data */
> - int first_bit; /* no free below this */
> bool immutable; /* no [de]population allowed */
> int start_offset; /* the overlap with the previous
> region to have a page aligned
> diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c index
> b5fdd43b60c9..ef5034a0464e 100644
> --- a/mm/percpu-stats.c
> +++ b/mm/percpu-stats.c
> @@ -53,6 +53,7 @@ static int find_max_nr_alloc(void) static void
> chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk,
> int *buffer)
> {
> + struct pcpu_block_md *chunk_md = &chunk->chunk_md;
> int i, last_alloc, as_len, start, end;
> int *alloc_sizes, *p;
> /* statistics */
> @@ -121,9 +122,9 @@ static void chunk_map_stats(struct seq_file *m,
> struct pcpu_chunk *chunk,
> P("nr_alloc", chunk->nr_alloc);
> P("max_alloc_size", chunk->max_alloc_size);
> P("empty_pop_pages", chunk->nr_empty_pop_pages);
> - P("first_bit", chunk->first_bit);
> + P("first_bit", chunk_md->first_free);
> P("free_bytes", chunk->free_bytes);
> - P("contig_bytes", chunk->contig_bits * PCPU_MIN_ALLOC_SIZE);
> + P("contig_bytes", chunk_md->contig_hint * PCPU_MIN_ALLOC_SIZE);
> P("sum_frag", sum_frag);
> P("max_frag", max_frag);
> P("cur_min_alloc", cur_min_alloc);
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 7cdf14c242de..197479f2c489 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -233,10 +233,13 @@ static int pcpu_size_to_slot(int size)
>
> static int pcpu_chunk_slot(const struct pcpu_chunk *chunk) {
> - if (chunk->free_bytes < PCPU_MIN_ALLOC_SIZE || chunk->contig_bits
> == 0)
> + const struct pcpu_block_md *chunk_md = &chunk->chunk_md;
> +
> + if (chunk->free_bytes < PCPU_MIN_ALLOC_SIZE ||
> + chunk_md->contig_hint == 0)
> return 0;
>
> - return pcpu_size_to_slot(chunk->contig_bits * PCPU_MIN_ALLOC_SIZE);
> + return pcpu_size_to_slot(chunk_md->contig_hint *
> PCPU_MIN_ALLOC_SIZE);
> }
>
> /* set the pointer to a chunk in a page struct */ @@ -592,54 +595,6 @@
> static inline bool pcpu_region_overlap(int a, int b, int x, int y)
> return false;
> }
>
> -/**
> - * pcpu_chunk_update - updates the chunk metadata given a free area
> - * @chunk: chunk of interest
> - * @bit_off: chunk offset
> - * @bits: size of free area
> - *
> - * This updates the chunk's contig hint and starting offset given a free area.
> - * Choose the best starting offset if the contig hint is equal.
> - */
> -static void pcpu_chunk_update(struct pcpu_chunk *chunk, int bit_off, int bits)
> -{
> - if (bits > chunk->contig_bits) {
> - chunk->contig_bits_start = bit_off;
> - chunk->contig_bits = bits;
> - } else if (bits == chunk->contig_bits && chunk->contig_bits_start &&
> - (!bit_off ||
> - __ffs(bit_off) > __ffs(chunk->contig_bits_start))) {
> - /* use the start with the best alignment */
> - chunk->contig_bits_start = bit_off;
> - }
> -}
> -
> -/**
> - * pcpu_chunk_refresh_hint - updates metadata about a chunk
> - * @chunk: chunk of interest
> - *
> - * Iterates over the metadata blocks to find the largest contig area.
> - * It also counts the populated pages and uses the delta to update the
> - * global count.
> - *
> - * Updates:
> - * chunk->contig_bits
> - * chunk->contig_bits_start
> - */
> -static void pcpu_chunk_refresh_hint(struct pcpu_chunk *chunk) -{
> - int bit_off, bits;
> -
> - /* clear metadata */
> - chunk->contig_bits = 0;
> -
> - bit_off = chunk->first_bit;
> - bits = 0;
> - pcpu_for_each_md_free_region(chunk, bit_off, bits) {
> - pcpu_chunk_update(chunk, bit_off, bits);
> - }
> -}
> -
> /**
> * pcpu_block_update - updates a block given a free area
> * @block: block of interest
> @@ -753,6 +708,29 @@ static void pcpu_block_update_scan(struct
> pcpu_chunk *chunk, int bit_off,
> pcpu_block_update(block, s_off, e_off); }
>
> +/**
> + * pcpu_chunk_refresh_hint - updates metadata about a chunk
> + * @chunk: chunk of interest
> + *
> + * Iterates over the metadata blocks to find the largest contig area.
> + * It also counts the populated pages and uses the delta to update the
> + * global count.
> + */
> +static void pcpu_chunk_refresh_hint(struct pcpu_chunk *chunk) {
> + struct pcpu_block_md *chunk_md = &chunk->chunk_md;
> + int bit_off, bits;
> +
> + /* clear metadata */
> + chunk_md->contig_hint = 0;
> +
> + bit_off = chunk_md->first_free;
> + bits = 0;
> + pcpu_for_each_md_free_region(chunk, bit_off, bits) {
> + pcpu_block_update(chunk_md, bit_off, bit_off + bits);
> + }
> +}
> +
> /**
> * pcpu_block_refresh_hint
> * @chunk: chunk of interest
> @@ -800,6 +778,7 @@ static void pcpu_block_refresh_hint(struct
> pcpu_chunk *chunk, int index) static void
> pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off,
> int bits)
> {
> + struct pcpu_block_md *chunk_md = &chunk->chunk_md;
> int nr_empty_pages = 0;
> struct pcpu_block_md *s_block, *e_block, *block;
> int s_index, e_index; /* block indexes of the freed allocation */
> @@ -910,8 +889,9 @@ static void pcpu_block_update_hint_alloc(struct
> pcpu_chunk *chunk, int bit_off,
> * contig hint is broken. Otherwise, it means a smaller space
> * was used and therefore the chunk contig hint is still correct.
> */
> - if (pcpu_region_overlap(chunk->contig_bits_start,
> - chunk->contig_bits_start + chunk->contig_bits,
> + if (pcpu_region_overlap(chunk_md->contig_hint_start,
> + chunk_md->contig_hint_start +
> + chunk_md->contig_hint,
> bit_off,
> bit_off + bits))
> pcpu_chunk_refresh_hint(chunk);
> @@ -930,9 +910,10 @@ static void pcpu_block_update_hint_alloc(struct
> pcpu_chunk *chunk, int bit_off,
> *
> * A chunk update is triggered if a page becomes free, a block becomes free,
> * or the free spans across blocks. This tradeoff is to minimize iterating
> - * over the block metadata to update chunk->contig_bits.
> chunk->contig_bits
> - * may be off by up to a page, but it will never be more than the available
> - * space. If the contig hint is contained in one block, it will be accurate.
> + * over the block metadata to update chunk_md->contig_hint.
> + * chunk_md->contig_hint may be off by up to a page, but it will never
> + be more
> + * than the available space. If the contig hint is contained in one
> + block, it
> + * will be accurate.
> */
> static void pcpu_block_update_hint_free(struct pcpu_chunk *chunk, int
> bit_off,
> int bits)
> @@ -1026,8 +1007,9 @@ static void pcpu_block_update_hint_free(struct
> pcpu_chunk *chunk, int bit_off,
> if (((end - start) >= PCPU_BITMAP_BLOCK_BITS) || s_index != e_index)
> pcpu_chunk_refresh_hint(chunk);
> else
> - pcpu_chunk_update(chunk, pcpu_block_off_to_off(s_index, start),
> - end - start);
> + pcpu_block_update(&chunk->chunk_md,
> + pcpu_block_off_to_off(s_index, start),
> + end);
> }
>
> /**
> @@ -1082,6 +1064,7 @@ static bool pcpu_is_populated(struct pcpu_chunk
> *chunk, int bit_off, int bits, static int pcpu_find_block_fit(struct pcpu_chunk
> *chunk, int alloc_bits,
> size_t align, bool pop_only)
> {
> + struct pcpu_block_md *chunk_md = &chunk->chunk_md;
> int bit_off, bits, next_off;
>
> /*
> @@ -1090,12 +1073,12 @@ static int pcpu_find_block_fit(struct pcpu_chunk
> *chunk, int alloc_bits,
> * cannot fit in the global hint, there is memory pressure and creating
> * a new chunk would happen soon.
> */
> - bit_off = ALIGN(chunk->contig_bits_start, align) -
> - chunk->contig_bits_start;
> - if (bit_off + alloc_bits > chunk->contig_bits)
> + bit_off = ALIGN(chunk_md->contig_hint_start, align) -
> + chunk_md->contig_hint_start;
> + if (bit_off + alloc_bits > chunk_md->contig_hint)
> return -1;
>
> - bit_off = chunk->first_bit;
> + bit_off = chunk_md->first_free;
> bits = 0;
> pcpu_for_each_fit_region(chunk, alloc_bits, align, bit_off, bits) {
> if (!pop_only || pcpu_is_populated(chunk, bit_off, bits, @@ -1190,6
> +1173,7 @@ static unsigned long pcpu_find_zero_area(unsigned long *map,
> static int pcpu_alloc_area(struct pcpu_chunk *chunk, int alloc_bits,
> size_t align, int start)
> {
> + struct pcpu_block_md *chunk_md = &chunk->chunk_md;
> size_t align_mask = (align) ? (align - 1) : 0;
> unsigned long area_off = 0, area_bits = 0;
> int bit_off, end, oslot;
> @@ -1222,8 +1206,8 @@ static int pcpu_alloc_area(struct pcpu_chunk
> *chunk, int alloc_bits,
> chunk->free_bytes -= alloc_bits * PCPU_MIN_ALLOC_SIZE;
>
> /* update first free bit */
> - if (bit_off == chunk->first_bit)
> - chunk->first_bit = find_next_zero_bit(
> + if (bit_off == chunk_md->first_free)
> + chunk_md->first_free = find_next_zero_bit(
> chunk->alloc_map,
> pcpu_chunk_map_bits(chunk),
> bit_off + alloc_bits);
> @@ -1245,6 +1229,7 @@ static int pcpu_alloc_area(struct pcpu_chunk
> *chunk, int alloc_bits,
> */
> static void pcpu_free_area(struct pcpu_chunk *chunk, int off) {
> + struct pcpu_block_md *chunk_md = &chunk->chunk_md;
> int bit_off, bits, end, oslot;
>
> lockdep_assert_held(&pcpu_lock);
> @@ -1264,7 +1249,7 @@ static void pcpu_free_area(struct pcpu_chunk
> *chunk, int off)
> chunk->free_bytes += bits * PCPU_MIN_ALLOC_SIZE;
>
> /* update first free bit */
> - chunk->first_bit = min(chunk->first_bit, bit_off);
> + chunk_md->first_free = min(chunk_md->first_free, bit_off);
>
> pcpu_block_update_hint_free(chunk, bit_off, bits);
>
> @@ -1285,6 +1270,9 @@ static void pcpu_init_md_blocks(struct pcpu_chunk
> *chunk) {
> struct pcpu_block_md *md_block;
>
> + /* init the chunk's block */
> + pcpu_init_md_block(&chunk->chunk_md,
> pcpu_chunk_map_bits(chunk));
> +
> for (md_block = chunk->md_blocks;
> md_block != chunk->md_blocks + pcpu_chunk_nr_blocks(chunk);
> md_block++)
> @@ -1352,7 +1340,6 @@ static struct pcpu_chunk * __init
> pcpu_alloc_first_chunk(unsigned long tmp_addr,
> chunk->nr_populated = chunk->nr_pages;
> chunk->nr_empty_pop_pages = chunk->nr_pages;
>
> - chunk->contig_bits = map_size / PCPU_MIN_ALLOC_SIZE;
> chunk->free_bytes = map_size;
>
> if (chunk->start_offset) {
> @@ -1362,7 +1349,7 @@ static struct pcpu_chunk * __init
> pcpu_alloc_first_chunk(unsigned long tmp_addr,
> set_bit(0, chunk->bound_map);
> set_bit(offset_bits, chunk->bound_map);
>
> - chunk->first_bit = offset_bits;
> + chunk->chunk_md.first_free = offset_bits;
>
> pcpu_block_update_hint_alloc(chunk, 0, offset_bits);
> }
> @@ -1415,7 +1402,6 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t
> gfp)
> pcpu_init_md_blocks(chunk);
>
> /* init metadata */
> - chunk->contig_bits = region_bits;
> chunk->free_bytes = chunk->nr_pages * PAGE_SIZE;
>
> return chunk;
Reviewed-by: Peng Fan <[email protected]>
Nitpick, how about name a function __pcpu_md_update,
and let pcpu_chunk_update and pcpu_block_update to
call __pcpu_md_update. If you prefer, I could submit
a patch.
Thanks,
Peng.
> --
> 2.17.1
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] On
> Behalf Of Dennis Zhou
> Sent: 2019??2??28?? 10:19
> To: Dennis Zhou <[email protected]>; Tejun Heo <[email protected]>; Christoph
> Lameter <[email protected]>
> Cc: Vlad Buslov <[email protected]>; [email protected];
> [email protected]; [email protected]
> Subject: [PATCH 12/12] percpu: use chunk scan_hint to skip some scanning
>
> Just like blocks, chunks now maintain a scan_hint. This can be used to skip
> some scanning by promoting the scan_hint to be the contig_hint.
> The chunk's scan_hint is primarily updated on the backside and relies on full
> scanning when a block becomes free or the free region spans across blocks.
>
> Signed-off-by: Dennis Zhou <[email protected]>
> ---
> mm/percpu.c | 36 +++++++++++++++++++++++++++---------
> 1 file changed, 27 insertions(+), 9 deletions(-)
>
> diff --git a/mm/percpu.c b/mm/percpu.c
> index 197479f2c489..40d49d7fb286 100644
> --- a/mm/percpu.c
> +++ b/mm/percpu.c
> @@ -711,20 +711,31 @@ static void pcpu_block_update_scan(struct
> pcpu_chunk *chunk, int bit_off,
> /**
> * pcpu_chunk_refresh_hint - updates metadata about a chunk
> * @chunk: chunk of interest
> + * @full_scan: if we should scan from the beginning
> *
> * Iterates over the metadata blocks to find the largest contig area.
> - * It also counts the populated pages and uses the delta to update the
> - * global count.
> + * A full scan can be avoided on the allocation path as this is
> + triggered
> + * if we broke the contig_hint. In doing so, the scan_hint will be
> + before
> + * the contig_hint or after if the scan_hint == contig_hint. This
> + cannot
> + * be prevented on freeing as we want to find the largest area possibly
> + * spanning blocks.
> */
> -static void pcpu_chunk_refresh_hint(struct pcpu_chunk *chunk)
> +static void pcpu_chunk_refresh_hint(struct pcpu_chunk *chunk, bool
> +full_scan)
> {
> struct pcpu_block_md *chunk_md = &chunk->chunk_md;
> int bit_off, bits;
>
> - /* clear metadata */
> - chunk_md->contig_hint = 0;
> + /* promote scan_hint to contig_hint */
> + if (!full_scan && chunk_md->scan_hint) {
> + bit_off = chunk_md->scan_hint_start + chunk_md->scan_hint;
> + chunk_md->contig_hint_start = chunk_md->scan_hint_start;
> + chunk_md->contig_hint = chunk_md->scan_hint;
> + chunk_md->scan_hint = 0;
> + } else {
> + bit_off = chunk_md->first_free;
> + chunk_md->contig_hint = 0;
> + }
>
> - bit_off = chunk_md->first_free;
> bits = 0;
> pcpu_for_each_md_free_region(chunk, bit_off, bits) {
> pcpu_block_update(chunk_md, bit_off, bit_off + bits); @@ -884,6
> +895,13 @@ static void pcpu_block_update_hint_alloc(struct pcpu_chunk
> *chunk, int bit_off,
> if (nr_empty_pages)
> pcpu_update_empty_pages(chunk, -1 * nr_empty_pages);
>
> + if (pcpu_region_overlap(chunk_md->scan_hint_start,
> + chunk_md->scan_hint_start +
> + chunk_md->scan_hint,
> + bit_off,
> + bit_off + bits))
> + chunk_md->scan_hint = 0;
> +
> /*
> * The only time a full chunk scan is required is if the chunk
> * contig hint is broken. Otherwise, it means a smaller space @@
> -894,7 +912,7 @@ static void pcpu_block_update_hint_alloc(struct
> pcpu_chunk *chunk, int bit_off,
> chunk_md->contig_hint,
> bit_off,
> bit_off + bits))
> - pcpu_chunk_refresh_hint(chunk);
> + pcpu_chunk_refresh_hint(chunk, false);
> }
>
> /**
> @@ -1005,7 +1023,7 @@ static void pcpu_block_update_hint_free(struct
> pcpu_chunk *chunk, int bit_off,
> * the else condition below.
> */
> if (((end - start) >= PCPU_BITMAP_BLOCK_BITS) || s_index != e_index)
> - pcpu_chunk_refresh_hint(chunk);
> + pcpu_chunk_refresh_hint(chunk, true);
> else
> pcpu_block_update(&chunk->chunk_md,
> pcpu_block_off_to_off(s_index, start), @@ -1078,7
> +1096,7 @@ static int pcpu_find_block_fit(struct pcpu_chunk *chunk, int
> alloc_bits,
> if (bit_off + alloc_bits > chunk_md->contig_hint)
> return -1;
>
> - bit_off = chunk_md->first_free;
> + bit_off = pcpu_next_hint(chunk_md, alloc_bits);
> bits = 0;
> pcpu_for_each_fit_region(chunk, alloc_bits, align, bit_off, bits) {
> if (!pop_only || pcpu_is_populated(chunk, bit_off, bits,
Reviewed-by: Peng Fan <[email protected]>
> --
> 2.17.1
> -----Original Message-----
> From: Dennis Zhou [mailto:[email protected]]
> Sent: 2019年3月3日 6:24
> To: Peng Fan <[email protected]>
> Cc: Tejun Heo <[email protected]>; Christoph Lameter <[email protected]>; Vlad
> Buslov <[email protected]>; [email protected]; [email protected];
> [email protected]
> Subject: Re: [PATCH 02/12] percpu: do not search past bitmap when allocating
> an area
>
> On Sat, Mar 02, 2019 at 01:32:04PM +0000, Peng Fan wrote:
> > Hi Dennis,
> >
> > > -----Original Message-----
> > > From: [email protected] [mailto:[email protected]]
> On
> > > Behalf Of Dennis Zhou
> > > Sent: 2019年2月28日 10:18
> > > To: Dennis Zhou <[email protected]>; Tejun Heo <[email protected]>;
> > > Christoph Lameter <[email protected]>
> > > Cc: Vlad Buslov <[email protected]>; [email protected];
> > > [email protected]; [email protected]
> > > Subject: [PATCH 02/12] percpu: do not search past bitmap when
> > > allocating an area
> > >
> > > pcpu_find_block_fit() guarantees that a fit is found within
> > > PCPU_BITMAP_BLOCK_BITS. Iteration is used to determine the first fit
> > > as it compares against the block's contig_hint. This can lead to
> > > incorrectly scanning past the end of the bitmap. The behavior was
> > > okay given the check after for bit_off >= end and the correctness of the
> hints from pcpu_find_block_fit().
> > >
> > > This patch fixes this by bounding the end offset by the number of
> > > bits in a chunk.
> > >
> > > Signed-off-by: Dennis Zhou <[email protected]>
> > > ---
> > > mm/percpu.c | 3 ++-
> > > 1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/percpu.c b/mm/percpu.c index
> > > 53bd79a617b1..69ca51d238b5 100644
> > > --- a/mm/percpu.c
> > > +++ b/mm/percpu.c
> > > @@ -988,7 +988,8 @@ static int pcpu_alloc_area(struct pcpu_chunk
> > > *chunk, int alloc_bits,
> > > /*
> > > * Search to find a fit.
> > > */
> > > - end = start + alloc_bits + PCPU_BITMAP_BLOCK_BITS;
> > > + end = min_t(int, start + alloc_bits + PCPU_BITMAP_BLOCK_BITS,
> > > + pcpu_chunk_map_bits(chunk));
> > > bit_off = bitmap_find_next_zero_area(chunk->alloc_map, end,
> start,
> > > alloc_bits, align_mask);
> > > if (bit_off >= end)
> > > --
> >
> > From pcpu_alloc_area itself, I think this is correct to avoid
> > bitmap_find_next_zero_area scan past the boundaries of alloc_map, so
> >
> > Reviewed-by: Peng Fan <[email protected]>
> >
> > There are a few points I did not understand well, Per understanding
> > pcpu_find_block_fit is to find the first bit off in a chunk which
> > could satisfy the bits allocation, so bits might be larger than
> > PCPU_BITMAP_BLOCK_BITS. And if pcpu_find_block_fit returns a good off,
> > it means there is a area in the chunk could satisfy the bits
> > allocation, then the following pcpu_alloc_area will not scan past the
> boundaries of alloc_map, right?
> >
>
> pcpu_find_block_fit() finds the chunk offset corresponding to the block that
> will be able to fit the chunk. Allocations are done by first fit, so scanning begins
> from the first_free of a block. Because the hints are always accurate, you
> never fail to find a fit in pcpu_alloc_area() if
> pcpu_find_block_fit() gives you an offset. This means you never scan past the
> end anyway.
Thanks for explanation.
Thanks,
Peng.
>
> Thanks,
> Dennis
> -----Original Message-----
> From: Dennis Zhou [mailto:[email protected]]
> Sent: 2019年3月3日 6:32
> To: Peng Fan <[email protected]>
> Cc: Tejun Heo <[email protected]>; Christoph Lameter <[email protected]>; Vlad
> Buslov <[email protected]>; [email protected]; [email protected];
> [email protected]
> Subject: Re: [PATCH 04/12] percpu: manage chunks based on contig_bits
> instead of free_bytes
>
> On Sat, Mar 02, 2019 at 01:48:20PM +0000, Peng Fan wrote:
> >
> >
> > > -----Original Message-----
> > > From: [email protected] [mailto:[email protected]]
> On
> > > Behalf Of Dennis Zhou
> > > Sent: 2019年2月28日 10:19
> > > To: Dennis Zhou <[email protected]>; Tejun Heo <[email protected]>;
> > > Christoph Lameter <[email protected]>
> > > Cc: Vlad Buslov <[email protected]>; [email protected];
> > > [email protected]; [email protected]
> > > Subject: [PATCH 04/12] percpu: manage chunks based on contig_bits
> > > instead of free_bytes
> > >
> > > When a chunk becomes fragmented, it can end up having a large number
> > > of small allocation areas free. The free_bytes sorting of chunks
> > > leads to unnecessary checking of chunks that cannot satisfy the allocation.
> > > Switch to contig_bits sorting to prevent scanning chunks that may
> > > not be able to service the allocation request.
> > >
> > > Signed-off-by: Dennis Zhou <[email protected]>
> > > ---
> > > mm/percpu.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/mm/percpu.c b/mm/percpu.c index
> > > b40112b2fc59..c996bcffbb2a 100644
> > > --- a/mm/percpu.c
> > > +++ b/mm/percpu.c
> > > @@ -234,7 +234,7 @@ static int pcpu_chunk_slot(const struct
> > > pcpu_chunk
> > > *chunk)
> > > if (chunk->free_bytes < PCPU_MIN_ALLOC_SIZE ||
> chunk->contig_bits
> > > == 0)
> > > return 0;
> > >
> > > - return pcpu_size_to_slot(chunk->free_bytes);
> > > + return pcpu_size_to_slot(chunk->contig_bits *
> > > +PCPU_MIN_ALLOC_SIZE);
> > > }
> > >
> > > /* set the pointer to a chunk in a page struct */
> >
> > Reviewed-by: Peng Fan <[email protected]>
> >
> > Not relevant to this patch, another optimization to percpu might be
> > good to use per chunk spin_lock, not gobal pcpu_lock.
> >
>
> Percpu memory itself is expensive and for the most part shouldn't be part of
> the critical path. Ideally, we don't have multiple chunks being allocated
> simultaneously because once an allocation is given out, the chunk is pinned
> until all allocations are freed.
Thanks for clarification, since not critical patch, use global lock should be ok.
Thanks,
Peng.
>
> Thanks,
> Dennis
On Sun, Mar 03, 2019 at 08:18:49AM +0000, Peng Fan wrote:
>
>
> > -----Original Message-----
> > From: [email protected] [mailto:[email protected]] On
> > Behalf Of Dennis Zhou
> > Sent: 2019年2月28日 10:19
> > To: Dennis Zhou <[email protected]>; Tejun Heo <[email protected]>; Christoph
> > Lameter <[email protected]>
> > Cc: Vlad Buslov <[email protected]>; [email protected];
> > [email protected]; [email protected]
> > Subject: [PATCH 11/12] percpu: convert chunk hints to be based on
> > pcpu_block_md
> >
> > As mentioned in the last patch, a chunk's hints are no different than a block
> > just responsible for more bits. This converts chunk level hints to use a
> > pcpu_block_md to maintain them. This lets us reuse the same hint helper
> > functions as a block. The left_free and right_free are unused by the chunk's
> > pcpu_block_md.
> >
> > Signed-off-by: Dennis Zhou <[email protected]>
> > ---
> > mm/percpu-internal.h | 5 +-
> > mm/percpu-stats.c | 5 +-
> > mm/percpu.c | 120 +++++++++++++++++++------------------------
> > 3 files changed, 57 insertions(+), 73 deletions(-)
> >
> > diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index
> > 119bd1119aa7..0468ba500bd4 100644
> > --- a/mm/percpu-internal.h
> > +++ b/mm/percpu-internal.h
> > @@ -39,9 +39,7 @@ struct pcpu_chunk {
> >
> > struct list_head list; /* linked to pcpu_slot lists */
> > int free_bytes; /* free bytes in the chunk */
> > - int contig_bits; /* max contiguous size hint */
> > - int contig_bits_start; /* contig_bits starting
> > - offset */
> > + struct pcpu_block_md chunk_md;
> > void *base_addr; /* base address of this chunk */
> >
> > unsigned long *alloc_map; /* allocation map */
> > @@ -49,7 +47,6 @@ struct pcpu_chunk {
> > struct pcpu_block_md *md_blocks; /* metadata blocks */
> >
> > void *data; /* chunk data */
> > - int first_bit; /* no free below this */
> > bool immutable; /* no [de]population allowed */
> > int start_offset; /* the overlap with the previous
> > region to have a page aligned
> > diff --git a/mm/percpu-stats.c b/mm/percpu-stats.c index
> > b5fdd43b60c9..ef5034a0464e 100644
> > --- a/mm/percpu-stats.c
> > +++ b/mm/percpu-stats.c
> > @@ -53,6 +53,7 @@ static int find_max_nr_alloc(void) static void
> > chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk,
> > int *buffer)
> > {
> > + struct pcpu_block_md *chunk_md = &chunk->chunk_md;
> > int i, last_alloc, as_len, start, end;
> > int *alloc_sizes, *p;
> > /* statistics */
> > @@ -121,9 +122,9 @@ static void chunk_map_stats(struct seq_file *m,
> > struct pcpu_chunk *chunk,
> > P("nr_alloc", chunk->nr_alloc);
> > P("max_alloc_size", chunk->max_alloc_size);
> > P("empty_pop_pages", chunk->nr_empty_pop_pages);
> > - P("first_bit", chunk->first_bit);
> > + P("first_bit", chunk_md->first_free);
> > P("free_bytes", chunk->free_bytes);
> > - P("contig_bytes", chunk->contig_bits * PCPU_MIN_ALLOC_SIZE);
> > + P("contig_bytes", chunk_md->contig_hint * PCPU_MIN_ALLOC_SIZE);
> > P("sum_frag", sum_frag);
> > P("max_frag", max_frag);
> > P("cur_min_alloc", cur_min_alloc);
> > diff --git a/mm/percpu.c b/mm/percpu.c
> > index 7cdf14c242de..197479f2c489 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -233,10 +233,13 @@ static int pcpu_size_to_slot(int size)
> >
> > static int pcpu_chunk_slot(const struct pcpu_chunk *chunk) {
> > - if (chunk->free_bytes < PCPU_MIN_ALLOC_SIZE || chunk->contig_bits
> > == 0)
> > + const struct pcpu_block_md *chunk_md = &chunk->chunk_md;
> > +
> > + if (chunk->free_bytes < PCPU_MIN_ALLOC_SIZE ||
> > + chunk_md->contig_hint == 0)
> > return 0;
> >
> > - return pcpu_size_to_slot(chunk->contig_bits * PCPU_MIN_ALLOC_SIZE);
> > + return pcpu_size_to_slot(chunk_md->contig_hint *
> > PCPU_MIN_ALLOC_SIZE);
> > }
> >
> > /* set the pointer to a chunk in a page struct */ @@ -592,54 +595,6 @@
> > static inline bool pcpu_region_overlap(int a, int b, int x, int y)
> > return false;
> > }
> >
> > -/**
> > - * pcpu_chunk_update - updates the chunk metadata given a free area
> > - * @chunk: chunk of interest
> > - * @bit_off: chunk offset
> > - * @bits: size of free area
> > - *
> > - * This updates the chunk's contig hint and starting offset given a free area.
> > - * Choose the best starting offset if the contig hint is equal.
> > - */
> > -static void pcpu_chunk_update(struct pcpu_chunk *chunk, int bit_off, int bits)
> > -{
> > - if (bits > chunk->contig_bits) {
> > - chunk->contig_bits_start = bit_off;
> > - chunk->contig_bits = bits;
> > - } else if (bits == chunk->contig_bits && chunk->contig_bits_start &&
> > - (!bit_off ||
> > - __ffs(bit_off) > __ffs(chunk->contig_bits_start))) {
> > - /* use the start with the best alignment */
> > - chunk->contig_bits_start = bit_off;
> > - }
> > -}
> > -
> > -/**
> > - * pcpu_chunk_refresh_hint - updates metadata about a chunk
> > - * @chunk: chunk of interest
> > - *
> > - * Iterates over the metadata blocks to find the largest contig area.
> > - * It also counts the populated pages and uses the delta to update the
> > - * global count.
> > - *
> > - * Updates:
> > - * chunk->contig_bits
> > - * chunk->contig_bits_start
> > - */
> > -static void pcpu_chunk_refresh_hint(struct pcpu_chunk *chunk) -{
> > - int bit_off, bits;
> > -
> > - /* clear metadata */
> > - chunk->contig_bits = 0;
> > -
> > - bit_off = chunk->first_bit;
> > - bits = 0;
> > - pcpu_for_each_md_free_region(chunk, bit_off, bits) {
> > - pcpu_chunk_update(chunk, bit_off, bits);
> > - }
> > -}
> > -
> > /**
> > * pcpu_block_update - updates a block given a free area
> > * @block: block of interest
> > @@ -753,6 +708,29 @@ static void pcpu_block_update_scan(struct
> > pcpu_chunk *chunk, int bit_off,
> > pcpu_block_update(block, s_off, e_off); }
> >
> > +/**
> > + * pcpu_chunk_refresh_hint - updates metadata about a chunk
> > + * @chunk: chunk of interest
> > + *
> > + * Iterates over the metadata blocks to find the largest contig area.
> > + * It also counts the populated pages and uses the delta to update the
> > + * global count.
> > + */
> > +static void pcpu_chunk_refresh_hint(struct pcpu_chunk *chunk) {
> > + struct pcpu_block_md *chunk_md = &chunk->chunk_md;
> > + int bit_off, bits;
> > +
> > + /* clear metadata */
> > + chunk_md->contig_hint = 0;
> > +
> > + bit_off = chunk_md->first_free;
> > + bits = 0;
> > + pcpu_for_each_md_free_region(chunk, bit_off, bits) {
> > + pcpu_block_update(chunk_md, bit_off, bit_off + bits);
> > + }
> > +}
> > +
> > /**
> > * pcpu_block_refresh_hint
> > * @chunk: chunk of interest
> > @@ -800,6 +778,7 @@ static void pcpu_block_refresh_hint(struct
> > pcpu_chunk *chunk, int index) static void
> > pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off,
> > int bits)
> > {
> > + struct pcpu_block_md *chunk_md = &chunk->chunk_md;
> > int nr_empty_pages = 0;
> > struct pcpu_block_md *s_block, *e_block, *block;
> > int s_index, e_index; /* block indexes of the freed allocation */
> > @@ -910,8 +889,9 @@ static void pcpu_block_update_hint_alloc(struct
> > pcpu_chunk *chunk, int bit_off,
> > * contig hint is broken. Otherwise, it means a smaller space
> > * was used and therefore the chunk contig hint is still correct.
> > */
> > - if (pcpu_region_overlap(chunk->contig_bits_start,
> > - chunk->contig_bits_start + chunk->contig_bits,
> > + if (pcpu_region_overlap(chunk_md->contig_hint_start,
> > + chunk_md->contig_hint_start +
> > + chunk_md->contig_hint,
> > bit_off,
> > bit_off + bits))
> > pcpu_chunk_refresh_hint(chunk);
> > @@ -930,9 +910,10 @@ static void pcpu_block_update_hint_alloc(struct
> > pcpu_chunk *chunk, int bit_off,
> > *
> > * A chunk update is triggered if a page becomes free, a block becomes free,
> > * or the free spans across blocks. This tradeoff is to minimize iterating
> > - * over the block metadata to update chunk->contig_bits.
> > chunk->contig_bits
> > - * may be off by up to a page, but it will never be more than the available
> > - * space. If the contig hint is contained in one block, it will be accurate.
> > + * over the block metadata to update chunk_md->contig_hint.
> > + * chunk_md->contig_hint may be off by up to a page, but it will never
> > + be more
> > + * than the available space. If the contig hint is contained in one
> > + block, it
> > + * will be accurate.
> > */
> > static void pcpu_block_update_hint_free(struct pcpu_chunk *chunk, int
> > bit_off,
> > int bits)
> > @@ -1026,8 +1007,9 @@ static void pcpu_block_update_hint_free(struct
> > pcpu_chunk *chunk, int bit_off,
> > if (((end - start) >= PCPU_BITMAP_BLOCK_BITS) || s_index != e_index)
> > pcpu_chunk_refresh_hint(chunk);
> > else
> > - pcpu_chunk_update(chunk, pcpu_block_off_to_off(s_index, start),
> > - end - start);
> > + pcpu_block_update(&chunk->chunk_md,
> > + pcpu_block_off_to_off(s_index, start),
> > + end);
> > }
> >
> > /**
> > @@ -1082,6 +1064,7 @@ static bool pcpu_is_populated(struct pcpu_chunk
> > *chunk, int bit_off, int bits, static int pcpu_find_block_fit(struct pcpu_chunk
> > *chunk, int alloc_bits,
> > size_t align, bool pop_only)
> > {
> > + struct pcpu_block_md *chunk_md = &chunk->chunk_md;
> > int bit_off, bits, next_off;
> >
> > /*
> > @@ -1090,12 +1073,12 @@ static int pcpu_find_block_fit(struct pcpu_chunk
> > *chunk, int alloc_bits,
> > * cannot fit in the global hint, there is memory pressure and creating
> > * a new chunk would happen soon.
> > */
> > - bit_off = ALIGN(chunk->contig_bits_start, align) -
> > - chunk->contig_bits_start;
> > - if (bit_off + alloc_bits > chunk->contig_bits)
> > + bit_off = ALIGN(chunk_md->contig_hint_start, align) -
> > + chunk_md->contig_hint_start;
> > + if (bit_off + alloc_bits > chunk_md->contig_hint)
> > return -1;
> >
> > - bit_off = chunk->first_bit;
> > + bit_off = chunk_md->first_free;
> > bits = 0;
> > pcpu_for_each_fit_region(chunk, alloc_bits, align, bit_off, bits) {
> > if (!pop_only || pcpu_is_populated(chunk, bit_off, bits, @@ -1190,6
> > +1173,7 @@ static unsigned long pcpu_find_zero_area(unsigned long *map,
> > static int pcpu_alloc_area(struct pcpu_chunk *chunk, int alloc_bits,
> > size_t align, int start)
> > {
> > + struct pcpu_block_md *chunk_md = &chunk->chunk_md;
> > size_t align_mask = (align) ? (align - 1) : 0;
> > unsigned long area_off = 0, area_bits = 0;
> > int bit_off, end, oslot;
> > @@ -1222,8 +1206,8 @@ static int pcpu_alloc_area(struct pcpu_chunk
> > *chunk, int alloc_bits,
> > chunk->free_bytes -= alloc_bits * PCPU_MIN_ALLOC_SIZE;
> >
> > /* update first free bit */
> > - if (bit_off == chunk->first_bit)
> > - chunk->first_bit = find_next_zero_bit(
> > + if (bit_off == chunk_md->first_free)
> > + chunk_md->first_free = find_next_zero_bit(
> > chunk->alloc_map,
> > pcpu_chunk_map_bits(chunk),
> > bit_off + alloc_bits);
> > @@ -1245,6 +1229,7 @@ static int pcpu_alloc_area(struct pcpu_chunk
> > *chunk, int alloc_bits,
> > */
> > static void pcpu_free_area(struct pcpu_chunk *chunk, int off) {
> > + struct pcpu_block_md *chunk_md = &chunk->chunk_md;
> > int bit_off, bits, end, oslot;
> >
> > lockdep_assert_held(&pcpu_lock);
> > @@ -1264,7 +1249,7 @@ static void pcpu_free_area(struct pcpu_chunk
> > *chunk, int off)
> > chunk->free_bytes += bits * PCPU_MIN_ALLOC_SIZE;
> >
> > /* update first free bit */
> > - chunk->first_bit = min(chunk->first_bit, bit_off);
> > + chunk_md->first_free = min(chunk_md->first_free, bit_off);
> >
> > pcpu_block_update_hint_free(chunk, bit_off, bits);
> >
> > @@ -1285,6 +1270,9 @@ static void pcpu_init_md_blocks(struct pcpu_chunk
> > *chunk) {
> > struct pcpu_block_md *md_block;
> >
> > + /* init the chunk's block */
> > + pcpu_init_md_block(&chunk->chunk_md,
> > pcpu_chunk_map_bits(chunk));
> > +
> > for (md_block = chunk->md_blocks;
> > md_block != chunk->md_blocks + pcpu_chunk_nr_blocks(chunk);
> > md_block++)
> > @@ -1352,7 +1340,6 @@ static struct pcpu_chunk * __init
> > pcpu_alloc_first_chunk(unsigned long tmp_addr,
> > chunk->nr_populated = chunk->nr_pages;
> > chunk->nr_empty_pop_pages = chunk->nr_pages;
> >
> > - chunk->contig_bits = map_size / PCPU_MIN_ALLOC_SIZE;
> > chunk->free_bytes = map_size;
> >
> > if (chunk->start_offset) {
> > @@ -1362,7 +1349,7 @@ static struct pcpu_chunk * __init
> > pcpu_alloc_first_chunk(unsigned long tmp_addr,
> > set_bit(0, chunk->bound_map);
> > set_bit(offset_bits, chunk->bound_map);
> >
> > - chunk->first_bit = offset_bits;
> > + chunk->chunk_md.first_free = offset_bits;
> >
> > pcpu_block_update_hint_alloc(chunk, 0, offset_bits);
> > }
> > @@ -1415,7 +1402,6 @@ static struct pcpu_chunk *pcpu_alloc_chunk(gfp_t
> > gfp)
> > pcpu_init_md_blocks(chunk);
> >
> > /* init metadata */
> > - chunk->contig_bits = region_bits;
> > chunk->free_bytes = chunk->nr_pages * PAGE_SIZE;
> >
> > return chunk;
>
> Reviewed-by: Peng Fan <[email protected]>
>
> Nitpick, how about name a function __pcpu_md_update,
> and let pcpu_chunk_update and pcpu_block_update to
> call __pcpu_md_update. If you prefer, I could submit
> a patch.
>
I don't have a preference. But I do not really want to have 2 functions
that just wrap the same function. I think overall it'll make sense to
rename it, but I haven't thought about what to rename it to as it'll
probably be influence by the possible direction that percpu ends up
going.
Thanks,
Dennis
On Sun, Mar 03, 2019 at 06:01:42AM +0000, Peng Fan wrote:
> Hi Dennis
>
> > -----Original Message-----
> > From: [email protected] [mailto:[email protected]] On
> > Behalf Of Dennis Zhou
> > Sent: 2019年2月28日 10:19
> > To: Dennis Zhou <[email protected]>; Tejun Heo <[email protected]>; Christoph
> > Lameter <[email protected]>
> > Cc: Vlad Buslov <[email protected]>; [email protected];
> > [email protected]; [email protected]
> > Subject: [PATCH 07/12] percpu: add block level scan_hint
> >
> > Fragmentation can cause both blocks and chunks to have an early first_firee
> > bit available, but only able to satisfy allocations much later on. This patch
> > introduces a scan_hint to help mitigate some unnecessary scanning.
> >
> > The scan_hint remembers the largest area prior to the contig_hint. If the
> > contig_hint == scan_hint, then scan_hint_start > contig_hint_start.
> > This is necessary for scan_hint discovery when refreshing a block.
> >
> > Signed-off-by: Dennis Zhou <[email protected]>
> > ---
> > mm/percpu-internal.h | 9 ++++
> > mm/percpu.c | 101
> > ++++++++++++++++++++++++++++++++++++++++---
> > 2 files changed, 103 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index
> > b1739dc06b73..ec58b244545d 100644
> > --- a/mm/percpu-internal.h
> > +++ b/mm/percpu-internal.h
> > @@ -9,8 +9,17 @@
> > * pcpu_block_md is the metadata block struct.
> > * Each chunk's bitmap is split into a number of full blocks.
> > * All units are in terms of bits.
> > + *
> > + * The scan hint is the largest known contiguous area before the contig hint.
> > + * It is not necessarily the actual largest contig hint though. There
> > + is an
> > + * invariant that the scan_hint_start > contig_hint_start iff
> > + * scan_hint == contig_hint. This is necessary because when scanning
> > + forward,
> > + * we don't know if a new contig hint would be better than the current one.
> > */
> > struct pcpu_block_md {
> > + int scan_hint; /* scan hint for block */
> > + int scan_hint_start; /* block relative starting
> > + position of the scan hint */
> > int contig_hint; /* contig hint for block */
> > int contig_hint_start; /* block relative starting
> > position of the contig hint */ diff --git
> > a/mm/percpu.c b/mm/percpu.c index 967c9cc3a928..df1aacf58ac8 100644
> > --- a/mm/percpu.c
> > +++ b/mm/percpu.c
> > @@ -320,6 +320,34 @@ static unsigned long pcpu_block_off_to_off(int index,
> > int off)
> > return index * PCPU_BITMAP_BLOCK_BITS + off; }
> >
> > +/*
> > + * pcpu_next_hint - determine which hint to use
> > + * @block: block of interest
> > + * @alloc_bits: size of allocation
> > + *
> > + * This determines if we should scan based on the scan_hint or first_free.
> > + * In general, we want to scan from first_free to fulfill allocations
> > +by
> > + * first fit. However, if we know a scan_hint at position
> > +scan_hint_start
> > + * cannot fulfill an allocation, we can begin scanning from there
> > +knowing
> > + * the contig_hint will be our fallback.
> > + */
> > +static int pcpu_next_hint(struct pcpu_block_md *block, int alloc_bits)
> > +{
> > + /*
> > + * The three conditions below determine if we can skip past the
> > + * scan_hint. First, does the scan hint exist. Second, is the
> > + * contig_hint after the scan_hint (possibly not true iff
> > + * contig_hint == scan_hint). Third, is the allocation request
> > + * larger than the scan_hint.
> > + */
> > + if (block->scan_hint &&
> > + block->contig_hint_start > block->scan_hint_start &&
> > + alloc_bits > block->scan_hint)
> > + return block->scan_hint_start + block->scan_hint;
> > +
> > + return block->first_free;
> > +}
> > +
> > /**
> > * pcpu_next_md_free_region - finds the next hint free area
> > * @chunk: chunk of interest
> > @@ -415,9 +443,11 @@ static void pcpu_next_fit_region(struct pcpu_chunk
> > *chunk, int alloc_bits,
> > if (block->contig_hint &&
> > block->contig_hint_start >= block_off &&
> > block->contig_hint >= *bits + alloc_bits) {
> > + int start = pcpu_next_hint(block, alloc_bits);
> > +
> > *bits += alloc_bits + block->contig_hint_start -
> > - block->first_free;
> > - *bit_off = pcpu_block_off_to_off(i, block->first_free);
> > + start;
>
> This might not relevant to this patch.
> Not sure it is intended or not.
> For `alloc_bits + block->contig_hink_start - [block->first_free or start]`
> If the reason is to let pcpu_is_populated return a proper next_off when pcpu_is_populated
> fail, it makes sense. If not, why not just use *bits += alloc_bits.
>
This is how the iterator works. Without it, it doesn't.
Thanks,
Dennis
> -----Original Message-----
> From: Dennis Zhou [mailto:[email protected]]
> Sent: 2019年3月4日 4:22
> To: Peng Fan <[email protected]>
> Cc: Dennis Zhou <[email protected]>; Tejun Heo <[email protected]>; Christoph
> Lameter <[email protected]>; Vlad Buslov <[email protected]>;
> [email protected]; [email protected]; [email protected]
> Subject: Re: [PATCH 11/12] percpu: convert chunk hints to be based on
> pcpu_block_md
>
> On Sun, Mar 03, 2019 at 08:18:49AM +0000, Peng Fan wrote:
> >
> >
> > > -----Original Message-----
> > > From: [email protected] [mailto:[email protected]]
> On
> > > Behalf Of Dennis Zhou
> > > Sent: 2019年2月28日 10:19
> > > To: Dennis Zhou <[email protected]>; Tejun Heo <[email protected]>;
> > > Christoph Lameter <[email protected]>
> > > Cc: Vlad Buslov <[email protected]>; [email protected];
> > > [email protected]; [email protected]
> > > Subject: [PATCH 11/12] percpu: convert chunk hints to be based on
> > > pcpu_block_md
> > >
> > > As mentioned in the last patch, a chunk's hints are no different
> > > than a block just responsible for more bits. This converts chunk
> > > level hints to use a pcpu_block_md to maintain them. This lets us
> > > reuse the same hint helper functions as a block. The left_free and
> > > right_free are unused by the chunk's pcpu_block_md.
> > >
> > > Signed-off-by: Dennis Zhou <[email protected]>
> > > ---
> > > mm/percpu-internal.h | 5 +-
> > > mm/percpu-stats.c | 5 +-
> > > mm/percpu.c | 120
> +++++++++++++++++++------------------------
> > > 3 files changed, 57 insertions(+), 73 deletions(-)
> > >
> > > diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index
> > > 119bd1119aa7..0468ba500bd4 100644
> > > --- a/mm/percpu-internal.h
> > > +++ b/mm/percpu-internal.h
> > > @@ -39,9 +39,7 @@ struct pcpu_chunk {
> > >
> > > struct list_head list; /* linked to pcpu_slot lists */
> > > int free_bytes; /* free bytes in the chunk */
> > > - int contig_bits; /* max contiguous size hint */
> > > - int contig_bits_start; /* contig_bits starting
> > > - offset */
> > > + struct pcpu_block_md chunk_md;
> > > void *base_addr; /* base address of this chunk */
> > >
> > > unsigned long *alloc_map; /* allocation map */
> > > @@ -49,7 +47,6 @@ struct pcpu_chunk {
> > > struct pcpu_block_md *md_blocks; /* metadata blocks */
> > >
> > > void *data; /* chunk data */
> > > - int first_bit; /* no free below this */
> > > bool immutable; /* no [de]population allowed */
> > > int start_offset; /* the overlap with the previous
> > > region to have a page aligned diff --git
> > > a/mm/percpu-stats.c b/mm/percpu-stats.c index
> > > b5fdd43b60c9..ef5034a0464e 100644
> > > --- a/mm/percpu-stats.c
> > > +++ b/mm/percpu-stats.c
> > > @@ -53,6 +53,7 @@ static int find_max_nr_alloc(void) static void
> > > chunk_map_stats(struct seq_file *m, struct pcpu_chunk *chunk,
> > > int *buffer)
> > > {
> > > + struct pcpu_block_md *chunk_md = &chunk->chunk_md;
> > > int i, last_alloc, as_len, start, end;
> > > int *alloc_sizes, *p;
> > > /* statistics */
> > > @@ -121,9 +122,9 @@ static void chunk_map_stats(struct seq_file *m,
> > > struct pcpu_chunk *chunk,
> > > P("nr_alloc", chunk->nr_alloc);
> > > P("max_alloc_size", chunk->max_alloc_size);
> > > P("empty_pop_pages", chunk->nr_empty_pop_pages);
> > > - P("first_bit", chunk->first_bit);
> > > + P("first_bit", chunk_md->first_free);
> > > P("free_bytes", chunk->free_bytes);
> > > - P("contig_bytes", chunk->contig_bits * PCPU_MIN_ALLOC_SIZE);
> > > + P("contig_bytes", chunk_md->contig_hint *
> PCPU_MIN_ALLOC_SIZE);
> > > P("sum_frag", sum_frag);
> > > P("max_frag", max_frag);
> > > P("cur_min_alloc", cur_min_alloc); diff --git a/mm/percpu.c
> > > b/mm/percpu.c index 7cdf14c242de..197479f2c489 100644
> > > --- a/mm/percpu.c
> > > +++ b/mm/percpu.c
> > > @@ -233,10 +233,13 @@ static int pcpu_size_to_slot(int size)
> > >
> > > static int pcpu_chunk_slot(const struct pcpu_chunk *chunk) {
> > > - if (chunk->free_bytes < PCPU_MIN_ALLOC_SIZE || chunk->contig_bits
> > > == 0)
> > > + const struct pcpu_block_md *chunk_md = &chunk->chunk_md;
> > > +
> > > + if (chunk->free_bytes < PCPU_MIN_ALLOC_SIZE ||
> > > + chunk_md->contig_hint == 0)
> > > return 0;
> > >
> > > - return pcpu_size_to_slot(chunk->contig_bits * PCPU_MIN_ALLOC_SIZE);
> > > + return pcpu_size_to_slot(chunk_md->contig_hint *
> > > PCPU_MIN_ALLOC_SIZE);
> > > }
> > >
> > > /* set the pointer to a chunk in a page struct */ @@ -592,54 +595,6
> > > @@ static inline bool pcpu_region_overlap(int a, int b, int x, int y)
> > > return false;
> > > }
> > >
> > > -/**
> > > - * pcpu_chunk_update - updates the chunk metadata given a free area
> > > - * @chunk: chunk of interest
> > > - * @bit_off: chunk offset
> > > - * @bits: size of free area
> > > - *
> > > - * This updates the chunk's contig hint and starting offset given a free
> area.
> > > - * Choose the best starting offset if the contig hint is equal.
> > > - */
> > > -static void pcpu_chunk_update(struct pcpu_chunk *chunk, int
> > > bit_off, int bits) -{
> > > - if (bits > chunk->contig_bits) {
> > > - chunk->contig_bits_start = bit_off;
> > > - chunk->contig_bits = bits;
> > > - } else if (bits == chunk->contig_bits && chunk->contig_bits_start &&
> > > - (!bit_off ||
> > > - __ffs(bit_off) > __ffs(chunk->contig_bits_start))) {
> > > - /* use the start with the best alignment */
> > > - chunk->contig_bits_start = bit_off;
> > > - }
> > > -}
> > > -
> > > -/**
> > > - * pcpu_chunk_refresh_hint - updates metadata about a chunk
> > > - * @chunk: chunk of interest
> > > - *
> > > - * Iterates over the metadata blocks to find the largest contig area.
> > > - * It also counts the populated pages and uses the delta to update
> > > the
> > > - * global count.
> > > - *
> > > - * Updates:
> > > - * chunk->contig_bits
> > > - * chunk->contig_bits_start
> > > - */
> > > -static void pcpu_chunk_refresh_hint(struct pcpu_chunk *chunk) -{
> > > - int bit_off, bits;
> > > -
> > > - /* clear metadata */
> > > - chunk->contig_bits = 0;
> > > -
> > > - bit_off = chunk->first_bit;
> > > - bits = 0;
> > > - pcpu_for_each_md_free_region(chunk, bit_off, bits) {
> > > - pcpu_chunk_update(chunk, bit_off, bits);
> > > - }
> > > -}
> > > -
> > > /**
> > > * pcpu_block_update - updates a block given a free area
> > > * @block: block of interest
> > > @@ -753,6 +708,29 @@ static void pcpu_block_update_scan(struct
> > > pcpu_chunk *chunk, int bit_off,
> > > pcpu_block_update(block, s_off, e_off); }
> > >
> > > +/**
> > > + * pcpu_chunk_refresh_hint - updates metadata about a chunk
> > > + * @chunk: chunk of interest
> > > + *
> > > + * Iterates over the metadata blocks to find the largest contig area.
> > > + * It also counts the populated pages and uses the delta to update
> > > +the
> > > + * global count.
> > > + */
> > > +static void pcpu_chunk_refresh_hint(struct pcpu_chunk *chunk) {
> > > + struct pcpu_block_md *chunk_md = &chunk->chunk_md;
> > > + int bit_off, bits;
> > > +
> > > + /* clear metadata */
> > > + chunk_md->contig_hint = 0;
> > > +
> > > + bit_off = chunk_md->first_free;
> > > + bits = 0;
> > > + pcpu_for_each_md_free_region(chunk, bit_off, bits) {
> > > + pcpu_block_update(chunk_md, bit_off, bit_off + bits);
> > > + }
> > > +}
> > > +
> > > /**
> > > * pcpu_block_refresh_hint
> > > * @chunk: chunk of interest
> > > @@ -800,6 +778,7 @@ static void pcpu_block_refresh_hint(struct
> > > pcpu_chunk *chunk, int index) static void
> > > pcpu_block_update_hint_alloc(struct pcpu_chunk *chunk, int bit_off,
> > > int bits)
> > > {
> > > + struct pcpu_block_md *chunk_md = &chunk->chunk_md;
> > > int nr_empty_pages = 0;
> > > struct pcpu_block_md *s_block, *e_block, *block;
> > > int s_index, e_index; /* block indexes of the freed allocation */
> > > @@ -910,8 +889,9 @@ static void pcpu_block_update_hint_alloc(struct
> > > pcpu_chunk *chunk, int bit_off,
> > > * contig hint is broken. Otherwise, it means a smaller space
> > > * was used and therefore the chunk contig hint is still correct.
> > > */
> > > - if (pcpu_region_overlap(chunk->contig_bits_start,
> > > - chunk->contig_bits_start + chunk->contig_bits,
> > > + if (pcpu_region_overlap(chunk_md->contig_hint_start,
> > > + chunk_md->contig_hint_start +
> > > + chunk_md->contig_hint,
> > > bit_off,
> > > bit_off + bits))
> > > pcpu_chunk_refresh_hint(chunk);
> > > @@ -930,9 +910,10 @@ static void pcpu_block_update_hint_alloc(struct
> > > pcpu_chunk *chunk, int bit_off,
> > > *
> > > * A chunk update is triggered if a page becomes free, a block becomes
> free,
> > > * or the free spans across blocks. This tradeoff is to minimize
> > > iterating
> > > - * over the block metadata to update chunk->contig_bits.
> > > chunk->contig_bits
> > > - * may be off by up to a page, but it will never be more than the
> > > available
> > > - * space. If the contig hint is contained in one block, it will be accurate.
> > > + * over the block metadata to update chunk_md->contig_hint.
> > > + * chunk_md->contig_hint may be off by up to a page, but it will
> > > + never be more
> > > + * than the available space. If the contig hint is contained in
> > > + one block, it
> > > + * will be accurate.
> > > */
> > > static void pcpu_block_update_hint_free(struct pcpu_chunk *chunk,
> > > int bit_off,
> > > int bits)
> > > @@ -1026,8 +1007,9 @@ static void pcpu_block_update_hint_free(struct
> > > pcpu_chunk *chunk, int bit_off,
> > > if (((end - start) >= PCPU_BITMAP_BLOCK_BITS) || s_index !=
> e_index)
> > > pcpu_chunk_refresh_hint(chunk);
> > > else
> > > - pcpu_chunk_update(chunk, pcpu_block_off_to_off(s_index, start),
> > > - end - start);
> > > + pcpu_block_update(&chunk->chunk_md,
> > > + pcpu_block_off_to_off(s_index, start),
> > > + end);
> > > }
> > >
> > > /**
> > > @@ -1082,6 +1064,7 @@ static bool pcpu_is_populated(struct
> > > pcpu_chunk *chunk, int bit_off, int bits, static int
> > > pcpu_find_block_fit(struct pcpu_chunk *chunk, int alloc_bits,
> > > size_t align, bool pop_only) {
> > > + struct pcpu_block_md *chunk_md = &chunk->chunk_md;
> > > int bit_off, bits, next_off;
> > >
> > > /*
> > > @@ -1090,12 +1073,12 @@ static int pcpu_find_block_fit(struct
> > > pcpu_chunk *chunk, int alloc_bits,
> > > * cannot fit in the global hint, there is memory pressure and
> creating
> > > * a new chunk would happen soon.
> > > */
> > > - bit_off = ALIGN(chunk->contig_bits_start, align) -
> > > - chunk->contig_bits_start;
> > > - if (bit_off + alloc_bits > chunk->contig_bits)
> > > + bit_off = ALIGN(chunk_md->contig_hint_start, align) -
> > > + chunk_md->contig_hint_start;
> > > + if (bit_off + alloc_bits > chunk_md->contig_hint)
> > > return -1;
> > >
> > > - bit_off = chunk->first_bit;
> > > + bit_off = chunk_md->first_free;
> > > bits = 0;
> > > pcpu_for_each_fit_region(chunk, alloc_bits, align, bit_off, bits) {
> > > if (!pop_only || pcpu_is_populated(chunk, bit_off, bits, @@
> > > -1190,6
> > > +1173,7 @@ static unsigned long pcpu_find_zero_area(unsigned long
> > > +*map,
> > > static int pcpu_alloc_area(struct pcpu_chunk *chunk, int alloc_bits,
> > > size_t align, int start)
> > > {
> > > + struct pcpu_block_md *chunk_md = &chunk->chunk_md;
> > > size_t align_mask = (align) ? (align - 1) : 0;
> > > unsigned long area_off = 0, area_bits = 0;
> > > int bit_off, end, oslot;
> > > @@ -1222,8 +1206,8 @@ static int pcpu_alloc_area(struct pcpu_chunk
> > > *chunk, int alloc_bits,
> > > chunk->free_bytes -= alloc_bits * PCPU_MIN_ALLOC_SIZE;
> > >
> > > /* update first free bit */
> > > - if (bit_off == chunk->first_bit)
> > > - chunk->first_bit = find_next_zero_bit(
> > > + if (bit_off == chunk_md->first_free)
> > > + chunk_md->first_free = find_next_zero_bit(
> > > chunk->alloc_map,
> > > pcpu_chunk_map_bits(chunk),
> > > bit_off + alloc_bits);
> > > @@ -1245,6 +1229,7 @@ static int pcpu_alloc_area(struct pcpu_chunk
> > > *chunk, int alloc_bits,
> > > */
> > > static void pcpu_free_area(struct pcpu_chunk *chunk, int off) {
> > > + struct pcpu_block_md *chunk_md = &chunk->chunk_md;
> > > int bit_off, bits, end, oslot;
> > >
> > > lockdep_assert_held(&pcpu_lock);
> > > @@ -1264,7 +1249,7 @@ static void pcpu_free_area(struct pcpu_chunk
> > > *chunk, int off)
> > > chunk->free_bytes += bits * PCPU_MIN_ALLOC_SIZE;
> > >
> > > /* update first free bit */
> > > - chunk->first_bit = min(chunk->first_bit, bit_off);
> > > + chunk_md->first_free = min(chunk_md->first_free, bit_off);
> > >
> > > pcpu_block_update_hint_free(chunk, bit_off, bits);
> > >
> > > @@ -1285,6 +1270,9 @@ static void pcpu_init_md_blocks(struct
> > > pcpu_chunk
> > > *chunk) {
> > > struct pcpu_block_md *md_block;
> > >
> > > + /* init the chunk's block */
> > > + pcpu_init_md_block(&chunk->chunk_md,
> > > pcpu_chunk_map_bits(chunk));
> > > +
> > > for (md_block = chunk->md_blocks;
> > > md_block != chunk->md_blocks +
> pcpu_chunk_nr_blocks(chunk);
> > > md_block++)
> > > @@ -1352,7 +1340,6 @@ static struct pcpu_chunk * __init
> > > pcpu_alloc_first_chunk(unsigned long tmp_addr,
> > > chunk->nr_populated = chunk->nr_pages;
> > > chunk->nr_empty_pop_pages = chunk->nr_pages;
> > >
> > > - chunk->contig_bits = map_size / PCPU_MIN_ALLOC_SIZE;
> > > chunk->free_bytes = map_size;
> > >
> > > if (chunk->start_offset) {
> > > @@ -1362,7 +1349,7 @@ static struct pcpu_chunk * __init
> > > pcpu_alloc_first_chunk(unsigned long tmp_addr,
> > > set_bit(0, chunk->bound_map);
> > > set_bit(offset_bits, chunk->bound_map);
> > >
> > > - chunk->first_bit = offset_bits;
> > > + chunk->chunk_md.first_free = offset_bits;
> > >
> > > pcpu_block_update_hint_alloc(chunk, 0, offset_bits);
> > > }
> > > @@ -1415,7 +1402,6 @@ static struct pcpu_chunk
> > > *pcpu_alloc_chunk(gfp_t
> > > gfp)
> > > pcpu_init_md_blocks(chunk);
> > >
> > > /* init metadata */
> > > - chunk->contig_bits = region_bits;
> > > chunk->free_bytes = chunk->nr_pages * PAGE_SIZE;
> > >
> > > return chunk;
> >
> > Reviewed-by: Peng Fan <[email protected]>
> >
> > Nitpick, how about name a function __pcpu_md_update, and let
> > pcpu_chunk_update and pcpu_block_update to call __pcpu_md_update. If
> > you prefer, I could submit a patch.
> >
>
> I don't have a preference. But I do not really want to have 2 functions that just
> wrap the same function. I think overall it'll make sense to rename it, but I
> haven't thought about what to rename it to as it'll probably be influence by
> the possible direction that percpu ends up going.
ok, then I just leave it up to you.
Thanks,
Peng.
>
> Thanks,
> Dennis
> -----Original Message-----
> From: Dennis Zhou [mailto:[email protected]]
> Sent: 2019年3月4日 4:23
> To: Peng Fan <[email protected]>
> Cc: Tejun Heo <[email protected]>; Christoph Lameter <[email protected]>; Vlad
> Buslov <[email protected]>; [email protected]; [email protected];
> [email protected]
> Subject: Re: [PATCH 07/12] percpu: add block level scan_hint
>
> On Sun, Mar 03, 2019 at 06:01:42AM +0000, Peng Fan wrote:
> > Hi Dennis
> >
> > > -----Original Message-----
> > > From: [email protected] [mailto:[email protected]]
> On
> > > Behalf Of Dennis Zhou
> > > Sent: 2019年2月28日 10:19
> > > To: Dennis Zhou <[email protected]>; Tejun Heo <[email protected]>;
> > > Christoph Lameter <[email protected]>
> > > Cc: Vlad Buslov <[email protected]>; [email protected];
> > > [email protected]; [email protected]
> > > Subject: [PATCH 07/12] percpu: add block level scan_hint
> > >
> > > Fragmentation can cause both blocks and chunks to have an early
> > > first_firee bit available, but only able to satisfy allocations much
> > > later on. This patch introduces a scan_hint to help mitigate some
> unnecessary scanning.
> > >
> > > The scan_hint remembers the largest area prior to the contig_hint.
> > > If the contig_hint == scan_hint, then scan_hint_start > contig_hint_start.
> > > This is necessary for scan_hint discovery when refreshing a block.
> > >
> > > Signed-off-by: Dennis Zhou <[email protected]>
> > > ---
> > > mm/percpu-internal.h | 9 ++++
> > > mm/percpu.c | 101
> > > ++++++++++++++++++++++++++++++++++++++++---
> > > 2 files changed, 103 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/mm/percpu-internal.h b/mm/percpu-internal.h index
> > > b1739dc06b73..ec58b244545d 100644
> > > --- a/mm/percpu-internal.h
> > > +++ b/mm/percpu-internal.h
> > > @@ -9,8 +9,17 @@
> > > * pcpu_block_md is the metadata block struct.
> > > * Each chunk's bitmap is split into a number of full blocks.
> > > * All units are in terms of bits.
> > > + *
> > > + * The scan hint is the largest known contiguous area before the contig
> hint.
> > > + * It is not necessarily the actual largest contig hint though.
> > > + There is an
> > > + * invariant that the scan_hint_start > contig_hint_start iff
> > > + * scan_hint == contig_hint. This is necessary because when
> > > + scanning forward,
> > > + * we don't know if a new contig hint would be better than the current
> one.
> > > */
> > > struct pcpu_block_md {
> > > + int scan_hint; /* scan hint for block */
> > > + int scan_hint_start; /* block relative starting
> > > + position of the scan hint */
> > > int contig_hint; /* contig hint for block
> */
> > > int contig_hint_start; /* block relative
> starting
> > > position of the contig hint */ diff
> --git a/mm/percpu.c
> > > b/mm/percpu.c index 967c9cc3a928..df1aacf58ac8 100644
> > > --- a/mm/percpu.c
> > > +++ b/mm/percpu.c
> > > @@ -320,6 +320,34 @@ static unsigned long pcpu_block_off_to_off(int
> > > index, int off)
> > > return index * PCPU_BITMAP_BLOCK_BITS + off; }
> > >
> > > +/*
> > > + * pcpu_next_hint - determine which hint to use
> > > + * @block: block of interest
> > > + * @alloc_bits: size of allocation
> > > + *
> > > + * This determines if we should scan based on the scan_hint or first_free.
> > > + * In general, we want to scan from first_free to fulfill
> > > +allocations by
> > > + * first fit. However, if we know a scan_hint at position
> > > +scan_hint_start
> > > + * cannot fulfill an allocation, we can begin scanning from there
> > > +knowing
> > > + * the contig_hint will be our fallback.
> > > + */
> > > +static int pcpu_next_hint(struct pcpu_block_md *block, int
> > > +alloc_bits) {
> > > + /*
> > > + * The three conditions below determine if we can skip past the
> > > + * scan_hint. First, does the scan hint exist. Second, is the
> > > + * contig_hint after the scan_hint (possibly not true iff
> > > + * contig_hint == scan_hint). Third, is the allocation request
> > > + * larger than the scan_hint.
> > > + */
> > > + if (block->scan_hint &&
> > > + block->contig_hint_start > block->scan_hint_start &&
> > > + alloc_bits > block->scan_hint)
> > > + return block->scan_hint_start + block->scan_hint;
> > > +
> > > + return block->first_free;
> > > +}
> > > +
> > > /**
> > > * pcpu_next_md_free_region - finds the next hint free area
> > > * @chunk: chunk of interest
> > > @@ -415,9 +443,11 @@ static void pcpu_next_fit_region(struct
> > > pcpu_chunk *chunk, int alloc_bits,
> > > if (block->contig_hint &&
> > > block->contig_hint_start >= block_off &&
> > > block->contig_hint >= *bits + alloc_bits) {
> > > + int start = pcpu_next_hint(block, alloc_bits);
> > > +
> > > *bits += alloc_bits + block->contig_hint_start -
> > > - block->first_free;
> > > - *bit_off = pcpu_block_off_to_off(i, block->first_free);
> > > + start;
> >
> > This might not relevant to this patch.
> > Not sure it is intended or not.
> > For `alloc_bits + block->contig_hink_start - [block->first_free or
> > start]` If the reason is to let pcpu_is_populated return a proper
> > next_off when pcpu_is_populated fail, it makes sense. If not, why not just
> use *bits += alloc_bits.
> >
>
> This is how the iterator works. Without it, it doesn't.
Oops,I made a mistake, you are right, the iterator needs such logic.
Thanks,
Peng.
>
> Thanks,
> Dennis
On Wed, Feb 27, 2019 at 09:18:27PM -0500, Dennis Zhou wrote:
> Hi everyone,
>
> It was reported a while [1] that an increase in allocation alignment
> requirement [2] caused the percpu memory allocator to do significantly
> more work.
>
> After spending quite a bit of time diving into it, it seems the crux was
> the following:
> 1) chunk management by free_bytes caused allocations to scan over
> chunks that could not fit due to fragmentation
> 2) per block fragmentation required scanning from an early first_free
> bit causing allocations to repeat work
>
> This series introduces a scan_hint for pcpu_block_md and merges the
> paths used to manage the hints. The scan_hint represents the largest
> known free area prior to the contig_hint. There are some caveats to
> this. First, it may not necessarily be the largest area as we do partial
> updates based on freeing of regions and failed scanning in
> pcpu_alloc_area(). Second, if contig_hint == scan_hint, then
> scan_hint_start > contig_hint_start is possible. This is necessary
> for scan_hint discovery when refreshing the hint of a block.
>
> A necessary change is to enforce a block to be the size of a page. This
> let's the management of nr_empty_pop_pages to be done by breaking and
> making full contig_hints in the hint update paths. Prior, this was done
> by piggy backing off of refreshing the chunk contig_hint as it performed
> a full scan and counting empty full pages.
>
> The following are the results found using the workload provided in [3].
>
> branch | time
> ------------------------
> 5.0-rc7 | 69s
> [2] reverted | 44s
> scan_hint | 39s
>
> The times above represent the approximate average across multiple runs.
> I tested based on a basic 1M 16-byte allocation pattern with no
> alignment requirement and times did not differ between 5.0-rc7 and
> scan_hint.
>
> [1] https://lore.kernel.org/netdev/CANn89iKb_vW+LA-91RV=zuAqbNycPFUYW54w_S=KZ3HdcWPw6Q@mail.gmail.com/
> [2] https://lore.kernel.org/netdev/[email protected]/
> [3] https://lore.kernel.org/netdev/[email protected]/
>
> This patchset contains the following 12 patches:
> 0001-percpu-update-free-path-with-correct-new-free-region.patch
> 0002-percpu-do-not-search-past-bitmap-when-allocating-an-.patch
> 0003-percpu-introduce-helper-to-determine-if-two-regions-.patch
> 0004-percpu-manage-chunks-based-on-contig_bits-instead-of.patch
> 0005-percpu-relegate-chunks-unusable-when-failing-small-a.patch
> 0006-percpu-set-PCPU_BITMAP_BLOCK_SIZE-to-PAGE_SIZE.patch
> 0007-percpu-add-block-level-scan_hint.patch
> 0008-percpu-remember-largest-area-skipped-during-allocati.patch
> 0009-percpu-use-block-scan_hint-to-only-scan-forward.patch
> 0010-percpu-make-pcpu_block_md-generic.patch
> 0011-percpu-convert-chunk-hints-to-be-based-on-pcpu_block.patch
> 0012-percpu-use-chunk-scan_hint-to-skip-some-scanning.patch
>
> 0001 fixes an issue where the chunk contig_hint was being updated
> improperly with the new region's starting offset and possibly differing
> contig_hint. 0002 fixes possibly scanning pass the end of the bitmap.
> 0003 introduces a helper to do region overlap comparison. 0004 switches
> to chunk management by contig_hint rather than free_bytes. 0005 moves
> chunks that fail to allocate to the empty block list to prevent excess
> scanning with of chunks with small contig_hints and poor alignment.
> 0006 introduces the constraint PCPU_BITMAP_BLOCK_SIZE == PAGE_SIZE and
> modifies nr_empty_pop_pages management to be a part of the hint updates.
> 0007-0009 introduces percpu block scan_hint. 0010 makes pcpu_block_md
> generic so chunk hints can be managed as a pcpu_block_md responsible
> for more bits. 0011-0012 add chunk scan_hints.
>
> This patchset is on top of percpu#master a3b22b9f11d9.
>
> diffstats below:
>
> Dennis Zhou (12):
> percpu: update free path with correct new free region
> percpu: do not search past bitmap when allocating an area
> percpu: introduce helper to determine if two regions overlap
> percpu: manage chunks based on contig_bits instead of free_bytes
> percpu: relegate chunks unusable when failing small allocations
> percpu: set PCPU_BITMAP_BLOCK_SIZE to PAGE_SIZE
> percpu: add block level scan_hint
> percpu: remember largest area skipped during allocation
> percpu: use block scan_hint to only scan forward
> percpu: make pcpu_block_md generic
> percpu: convert chunk hints to be based on pcpu_block_md
> percpu: use chunk scan_hint to skip some scanning
>
> include/linux/percpu.h | 12 +-
> mm/percpu-internal.h | 15 +-
> mm/percpu-km.c | 2 +-
> mm/percpu-stats.c | 5 +-
> mm/percpu.c | 547 +++++++++++++++++++++++++++++------------
> 5 files changed, 404 insertions(+), 177 deletions(-)
>
> Thanks,
> Dennis
Applied to percpu/for-5.2.
Thanks,
Dennis