Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933855AbdGSTLL (ORCPT ); Wed, 19 Jul 2017 15:11:11 -0400 Received: from mail-yb0-f179.google.com ([209.85.213.179]:33349 "EHLO mail-yb0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933834AbdGSTLI (ORCPT ); Wed, 19 Jul 2017 15:11:08 -0400 Date: Wed, 19 Jul 2017 19:11:06 +0000 From: Josef Bacik To: Dennis Zhou Cc: Tejun Heo , Christoph Lameter , kernel-team@fb.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Dennis Zhou Subject: Re: [PATCH 09/10] percpu: replace area map allocator with bitmap allocator Message-ID: <20170719191105.GC23135@li70-116.members.linode.com> References: <20170716022315.19892-1-dennisz@fb.com> <20170716022315.19892-10-dennisz@fb.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20170716022315.19892-10-dennisz@fb.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4068 Lines: 81 On Sat, Jul 15, 2017 at 10:23:14PM -0400, Dennis Zhou wrote: > From: "Dennis Zhou (Facebook)" > > The percpu memory allocator is experiencing scalability issues when > allocating and freeing large numbers of counters as in BPF. > Additionally, there is a corner case where iteration is triggered over > all chunks if the contig_hint is the right size, but wrong alignment. > > Implementation: > This patch removes the area map allocator in favor of a bitmap allocator > backed by metadata blocks. The primary goal is to provide consistency > in performance and memory footprint with a focus on small allocations > (< 64 bytes). The bitmap removes the heavy memmove from the freeing > critical path and provides a consistent memory footprint. The metadata > blocks provide a bound on the amount of scanning required by maintaining > a set of hints. > > The chunks previously were managed by free_size, a value maintained in > bytes. Now, the chunks are managed in terms of bits, which is just a > scaled value of free_size down by PCPU_MIN_ALLOC_SIZE. > > There is one caveat with this implementation. In an effort to make > freeing fast, the only time metadata is updated on the free path is if a > whole block becomes free or the freed area spans across metadata blocks. > This causes the chunk’s contig_hint to be potentially smaller than what > it could allocate by up to a block. If the chunk’s contig_hint is > smaller than a block, a check occurs and the hint is kept accurate. > Metadata is always kept accurate on allocation and therefore the > situation where a chunk has a larger contig_hint than available will > never occur. > > Evaluation: > I have primarily done testing against a simple workload of allocation of > 1 million objects of varying size. Deallocation was done by in order, > alternating, and in reverse. These numbers were collected after rebasing > ontop of a80099a152. I present the worst-case numbers here: > > Area Map Allocator: > > Object Size | Alloc Time (ms) | Free Time (ms) > ---------------------------------------------- > 4B | 335 | 4960 > 16B | 485 | 1150 > 64B | 445 | 280 > 128B | 505 | 177 > 1024B | 3385 | 140 > > Bitmap Allocator: > > Object Size | Alloc Time (ms) | Free Time (ms) > ---------------------------------------------- > 4B | 725 | 70 > 16B | 760 | 70 > 64B | 855 | 80 > 128B | 910 | 90 > 1024B | 3770 | 260 > > This data demonstrates the inability for the area map allocator to > handle less than ideal situations. In the best case of reverse > deallocation, the area map allocator was able to perform within range > of the bitmap allocator. In the worst case situation, freeing took > nearly 5 seconds for 1 million 4-byte objects. The bitmap allocator > dramatically improves the consistency of the free path. The small > allocations performed nearly identical regardless of the freeing > pattern. > > While it does add to the allocation latency, the allocation scenario > here is optimal for the area map allocator. The second problem of > additional scanning can result in the area map allocator completing in > 52 minutes. The same workload takes only 14 seconds to complete for the > bitmap allocator. This was produced under a more contrived scenario of > allocating 1 milion 4-byte objects with 8-byte alignment. > This was a bear to review, I feel like it could be split into a few smaller pieces. You are changing hinting, allocating/freeing, and how you find chunks. Those seem like good logical divisions to me. Overall the design seems sound and I didn't spot any major problems. Once you've split them up I'll do another thorough comb through and then add my reviewed by. Thanks, Josef