Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755153AbdGKC6M (ORCPT ); Mon, 10 Jul 2017 22:58:12 -0400 Received: from mx1.redhat.com ([209.132.183.28]:45436 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753810AbdGKC6K (ORCPT ); Mon, 10 Jul 2017 22:58:10 -0400 DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com E6EE45AFE8 Authentication-Results: ext-mx10.extmail.prod.ext.phx2.redhat.com; dmarc=none (p=none dis=none) header.from=redhat.com Authentication-Results: ext-mx10.extmail.prod.ext.phx2.redhat.com; spf=pass smtp.mailfrom=jpoimboe@redhat.com DKIM-Filter: OpenDKIM Filter v2.11.0 mx1.redhat.com E6EE45AFE8 Date: Mon, 10 Jul 2017 21:58:07 -0500 From: Josh Poimboeuf To: Ingo Molnar Cc: x86@kernel.org, linux-kernel@vger.kernel.org, live-patching@vger.kernel.org, Linus Torvalds , Andy Lutomirski , Jiri Slaby , "H. Peter Anvin" , Peter Zijlstra Subject: Re: [PATCH v2 4/8] objtool: add undwarf debuginfo generation Message-ID: <20170711025807.62fzfgf2dhcgqur6@treble> References: <20170629072512.pmkfnrgq4dci6od7@gmail.com> <20170629140404.qgcvxhcgm7iywrkb@treble> <20170629144618.vdzem7o6ib5nqab6@gmail.com> <20170629150652.r2dl7f3pzp6cj2i7@treble> <20170706203636.lcwfjsphmy2q464v@treble> <20170707094437.2vgosia5hjg2wsut@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20170707094437.2vgosia5hjg2wsut@gmail.com> User-Agent: Mutt/1.6.0.1 (2016-04-01) X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.39]); Tue, 11 Jul 2017 02:58:10 +0000 (UTC) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4345 Lines: 90 On Fri, Jul 07, 2017 at 11:44:37AM +0200, Ingo Molnar wrote: > > * Josh Poimboeuf wrote: > > > On Thu, Jun 29, 2017 at 10:06:52AM -0500, Josh Poimboeuf wrote: > > > On Thu, Jun 29, 2017 at 04:46:18PM +0200, Ingo Molnar wrote: > > > > > > > > * Josh Poimboeuf wrote: > > > > > > > > > > Plus, shouldn't we use __packed for 'struct undwarf' to minimize the > > > > > > structure's size (to 6 bytes AFAICS?) - or is optimal packing of the main > > > > > > undwarf array already guaranteed on every platform with this layout? > > > > > > > > > > Ah yes, it should definitely be packed (assuming that doesn't affect performance > > > > > negatively). > > > > > > > > So if I count that correctly that should shave another ~1MB off a typical ~4MB > > > > table size? > > > > > > Here's what my Fedora kernel looks like *before* the packed change: > > > > > > $ eu-readelf -S vmlinux |grep undwarf > > > [15] .undwarf_ip PROGBITS ffffffff81f776d0 011776d0 0012d9d0 0 A 0 0 1 > > > [16] .undwarf PROGBITS ffffffff820a50a0 012a50a0 0025b3a0 0 A 0 0 1 > > > > > > The total undwarf data size is ~3.5MB. > > > > > > There are 308852 entries of two parallel arrays: > > > > > > * .undwarf (8 bytes/entry) = 2470816 bytes > > > * .undwarf_ip (4 bytes/entry) = 1235408 bytes > > > > > > If we pack undwarf, reducing the size of the .undwarf entries by two > > > bytes, it will save 308852 * 2 = 617704. > > > > > > So the savings will be ~600k, and the typical size will be reduced to ~3MB. > > > > Just for the record, while packing the struct from 8 to 6 bytes did save 600k, > > it also made the unwinder ~7% slower. I think that's probably an ok tradeoff, > > so I'll leave it packed in v3. > > So, out of curiosity, I'm wondering where that slowdown comes from: on modern x86 > CPUs indexing by units of 6 bytes ought to be just as fast as indexing by 8 bytes, > unless I'm missing something? Is it maybe the not naturally aligned 32-bit words? > > Or maybe there's some bad case of a 32-bit word crossing a 64-byte cache line > boundary that hits some pathological aspect of the CPU? We could probably get > around any such problems by padding by 2 bytes on 64-byte boundaries - that's only > a ~3% data size increase. The flip side would be a complication of the data > structure and its accessors - which might cost more in terms of code generation > efficiency than it buys us to begin with ... > > Also, there's another aspect besides RAM footprint: a large data structure that is > ~20% smaller means 20% less cache footprint: which for cache cold lookups might > matter more than the direct computational cost. tl;dr: Packed really seems to be more like ~2% slower, time for an adult beverage. So I tested again with the latest version of my code, and this time packed was 5% *faster* than unpacked, rather than 7% slower. 'perf stat' showed that, in both cases, most of the difference was caused by branch misses in the binary search code. But that code doesn't even touch the packed struct... After some hair-pulling/hand-wringing I realized that changing the struct packing caused GCC to change some of the unwinder code a bit, which shifted the rest of the kernel's function offsets enough that it changed the behavior of the unwind table binary search in a way that affected the CPU's branch prediction. And my crude benchmark was just unwinding the same stack on repeat, so a small change in the loop behavior had a big impact on the overall branch predictability. Anyway, I used some linker magic to temporarily move the unwinder code to the end of .text, so that unwinder changes don't add unexpected side effects to the microbenchmark behavior. Now I'm getting more consistent results: the packed struct is measuring ~2% slower. The slight slowdown might just be explained by the fact that GCC generates some extra instructions for extracting the fields out of the packed struct. In the meantime, I found a ~10% speedup by making the "fast lookup table" block size a power-of-two (256) to get rid of the need for a slow 'div' instruction. I think I'm done performance tweaking for now. I'll keep the packed struct, and add the code for the 'div' removal, and hope to submit v3 soon. -- Josh