Date: Wed, 19 Jul 2017 19:11:06 +0000
From: Josef Bacik <josef@toxicpanda.com>
To: Dennis Zhou <dennisz@fb.com>
Cc: Tejun Heo <tj@kernel.org>, Christoph Lameter <cl@linux.com>,
        kernel-team@fb.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
        Dennis Zhou <dennisszhou@gmail.com>
Subject: Re: [PATCH 09/10] percpu: replace area map allocator with bitmap
 allocator
Message-ID: <20170719191105.GC23135@li70-116.members.linode.com>
References: <20170716022315.19892-1-dennisz@fb.com>
 <20170716022315.19892-10-dennisz@fb.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20170716022315.19892-10-dennisz@fb.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4068
Lines: 81

On Sat, Jul 15, 2017 at 10:23:14PM -0400, Dennis Zhou wrote:
> From: "Dennis Zhou (Facebook)" <dennisszhou@gmail.com>
> 
> The percpu memory allocator is experiencing scalability issues when
> allocating and freeing large numbers of counters as in BPF.
> Additionally, there is a corner case where iteration is triggered over
> all chunks if the contig_hint is the right size, but wrong alignment.
> 
> Implementation:
> This patch removes the area map allocator in favor of a bitmap allocator
> backed by metadata blocks. The primary goal is to provide consistency
> in performance and memory footprint with a focus on small allocations
> (< 64 bytes). The bitmap removes the heavy memmove from the freeing
> critical path and provides a consistent memory footprint. The metadata
> blocks provide a bound on the amount of scanning required by maintaining
> a set of hints.
> 
> The chunks previously were managed by free_size, a value maintained in
> bytes. Now, the chunks are managed in terms of bits, which is just a
> scaled value of free_size down by PCPU_MIN_ALLOC_SIZE.
> 
> There is one caveat with this implementation. In an effort to make
> freeing fast, the only time metadata is updated on the free path is if a
> whole block becomes free or the freed area spans across metadata blocks.
> This causes the chunk’s contig_hint to be potentially smaller than what
> it could allocate by up to a block. If the chunk’s contig_hint is
> smaller than a block, a check occurs and the hint is kept accurate.
> Metadata is always kept accurate on allocation and therefore the
> situation where a chunk has a larger contig_hint than available will
> never occur.
> 
> Evaluation:
> I have primarily done testing against a simple workload of allocation of
> 1 million objects of varying size. Deallocation was done by in order,
> alternating, and in reverse. These numbers were collected after rebasing
> ontop of a80099a152. I present the worst-case numbers here:
> 
>   Area Map Allocator:
> 
>         Object Size | Alloc Time (ms) | Free Time (ms)
>         ----------------------------------------------
>               4B    |        335      |     4960
>              16B    |        485      |     1150
>              64B    |        445      |      280
>             128B    |        505      |      177
>            1024B    |       3385      |      140
> 
>   Bitmap Allocator:
> 
>         Object Size | Alloc Time (ms) | Free Time (ms)
>         ----------------------------------------------
>               4B    |        725      |       70
>              16B    |        760      |       70
>              64B    |        855      |       80
>             128B    |        910      |       90
>            1024B    |       3770      |      260
> 
> This data demonstrates the inability for the area map allocator to
> handle less than ideal situations. In the best case of reverse
> deallocation, the area map allocator was able to perform within range
> of the bitmap allocator. In the worst case situation, freeing took
> nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator
> dramatically improves the consistency of the free path. The small
> allocations performed nearly identical regardless of the freeing
> pattern.
> 
> While it does add to the allocation latency, the allocation scenario
> here is optimal for the area map allocator. The second problem of
> additional scanning can result in the area map allocator completing in
> 52 minutes. The same workload takes only 14 seconds to complete for the
> bitmap allocator. This was produced under a more contrived scenario of
> allocating 1 milion 4-byte objects with 8-byte alignment.
> 

This was a bear to review, I feel like it could be split into a few smaller
pieces.  You are changing hinting, allocating/freeing, and how you find chunks.
Those seem like good logical divisions to me.  Overall the design seems sound
and I didn't spot any major problems.  Once you've split them up I'll do another
thorough comb through and then add my reviewed by.  Thanks,

Josef