Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752136AbbESVif (ORCPT ); Tue, 19 May 2015 17:38:35 -0400 Received: from mail-wg0-f45.google.com ([74.125.82.45]:35009 "EHLO mail-wg0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752103AbbESVi0 (ORCPT ); Tue, 19 May 2015 17:38:26 -0400 Date: Tue, 19 May 2015 23:38:20 +0200 From: Ingo Molnar To: Linus Torvalds Cc: Andy Lutomirski , Davidlohr Bueso , Peter Anvin , Denys Vlasenko , Linux Kernel Mailing List , Tim Chen , Borislav Petkov , Peter Zijlstra , "Chandramouleeswaran, Aswin" , Peter Zijlstra , Brian Gerst , Paul McKenney , Thomas Gleixner , Jason Low , "linux-tip-commits@vger.kernel.org" , Arjan van de Ven , Andrew Morton Subject: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions Message-ID: <20150519213820.GA31688@gmail.com> References: <20150410121808.GA19918@gmail.com> <20150517055551.GB17002@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150517055551.GB17002@gmail.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 20612 Lines: 426 * Ingo Molnar wrote: > This function packing argument fails: > > - for large functions that are physically fragmented > > - if less than half of all functions in a hot workload are > packed together. This might be the common case in fact. > > - even if functions are technically 'packed' next to each other, > this only works for small functions: larger functions typically > are hotter near their heads, with unlikely codepaths being in > their tails. > > > Size matters, but size matters mainly from an I$ standpoint, not > > from some absolute 'big is bad" issue. > > Absolutely. > > > [...] Also, even when size matters, performance matters too. I do > > want performance numbers. Is this measurable? > > Will try to measure this. I'm somewhat sceptical that I'll be able > to measure any signal: alignment effects are very hard to measure on > x86, especially on any realistic workload. So I spent the better part of today trying to measure all this. As expected it's very hard to measure such alignment effects: anything that misses the cache and is a real workload tends to be rather noisy. After a couple of fruitless attempts I wrote the test program below. It tries to create a more or less meaningful macro-workload that executes a lot of diverse kernel functions, while still having relatively low data footprint: /* * Copyright (C) 2015, Ingo Molnar */ #include #include #include #include #include #include #include #include #define BUG_ON(cond) assert(!(cond)) #define DEFAULT_LOOPS 100000 int main(int argc, char **argv) { const char *file_name = "./tmp-test-creat.txt"; struct stat stat_buf; long loops; long i; int ret; setlocale(LC_ALL, ""); if (argc > 1) loops = atol(argv[1]); else loops = DEFAULT_LOOPS; printf("# VFS-mix: creat()+stat()+mmap()+munmap()+close()+unlink()ing a file in $(pwd) %'ld times ... ", loops); fflush(stdout); for (i = 0; i < loops; i++) { int fd; fd = creat(file_name, 00700); BUG_ON(fd < 0); ret = lstat(file_name, &stat_buf); BUG_ON(ret != 0); ret = lseek(fd, 4095, SEEK_SET); BUG_ON(ret != 4095); close(fd); fd = open(file_name, O_RDWR|O_CREAT|O_TRUNC); BUG_ON(fd < 0); { char c = 1; ret = write(fd, &c, 1); BUG_ON(ret != 1); } { char *mmap_buf = (char *)mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); BUG_ON(mmap_buf == (void *)-1L); mmap_buf[0] = 1; ret = munmap(mmap_buf, 4096); BUG_ON(ret != 0); } close(fd); ret = unlink(file_name); BUG_ON(ret != 0); } printf("done.\n"); return 0; } This is a pretty silly test in itself that re-creates the same file again and again and does some VFS and MM ops on it, then deletes it. But the nice thing about it is that in every iteration, if ran on a journalling filesystem such as ext4, it touches: - the VFS - various filesystem low level functions - various IO/block layer functions - low level IO driver - memory allocators (slub and page allocator) - IRQ handling - page fault handling - mmap() paths - syscall entry/exit path - scheduling - the RCU subsystem - the workqueue subsystem - the perf events subsystem That's a ton of activity: running this on ext4 executes around 1,500 unique kernel functions (!), around 118,000 instructions per iteration - with relatively small data footprint. The profile is pretty flat, with almost 1,000 functions being beyond the 0.01% overhead threshold. This test workload has a non-trivial I$ footprint: with 1,500 functions and the median kernel function size of around 150 bytes, it's about 220k of text executed - well beyond L1 instruction cache sizes. So this is a test that has a realistic instruction cache footprint, but has micro-benchmark data footprint properties - which reduces noise while still excercising the I$ effect we seek. To further reduce measurement noise, I ran this on a system with SSD disks, so the IRQ and block IO workload is almost synchronous and the CPU is fully utilized. Then I bound the workload to a single CPU, with prio SCHED_FIFO:90, and ran it the following way: taskset 1 chrt -f 99 perf stat -a -C 0 --pre sync --repeat 10 \ -e L1-icache-load-misses \ -e instructions \ -e context-switches \ ./test-vfs Then I built 11 kernels, each with different function alignment settings: fomalhaut:~/linux> size linux-*/vmlinux | sort -n text data bss dec filename 12150606 2565544 1634304 16350454 linux-____CC_OPTIMIZE_FOR_SIZE=y/vmlinux 13534242 2579560 1634304 17748106 linux-falign-functions=__1-bytes/vmlinux 13554530 2579560 1634304 17768394 linux-falign-functions=__2-bytes/vmlinux 13590946 2579560 1634304 17804810 linux-falign-functions=__4-bytes/vmlinux 13658786 2579560 1634304 17872650 linux-falign-functions=__8-bytes/vmlinux 13799602 2579560 1634304 18013466 linux-falign-functions=_16-bytes/vmlinux 14075330 2579560 1634304 18289194 linux-falign-functions=_32-bytes/vmlinux 14664898 2579560 1634304 18878762 linux-falign-functions=_64-bytes/vmlinux 15980994 2579560 1634304 20194858 linux-falign-functions=128-bytes/vmlinux 19038018 2591848 1634304 23264170 linux-falign-functions=256-bytes/vmlinux 26391106 2591848 1634304 30617258 linux-falign-functions=512-bytes/vmlinux (I've added the -Os kernel as a comparison.) A single run results in output like: Performance counter stats for 'system wide' (10 runs): 649,398,972 L1-icache-load-misses ( +- 0.09% ) (100.00%) 11,875,234,916 instructions ( +- 0.00% ) 300,038 context-switches ( +- 0.00% ) 7.198533444 seconds time elapsed ( +- 0.28% ) The 'instructions' and 'context-switches' metrics can be used to cross-check that the run was undisturbed and that it executed exactly the same workload as other runs. 'time elapsed' is rather noisy, while 'L1-icache-load-misses' is the L1 instruction cache misses metric that is the most interesting one: it's a proxy metric for effective I$ footprint of the workload: if the footprint is larger then the number of misses goes up, if it's tigher then the number of misses goes down. I ran this on two systems: ... an Intel system: vendor_id : GenuineIntel cpu family : 6 model : 62 model name : Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz stepping : 7 cache size : 25600 KB cache_alignment : 64 ... and an AMD system: vendor_id : AuthenticAMD cpu family : 21 model : 1 model name : AMD Opteron(tm) Processor 6278 stepping : 2 cache size : 2048 KB cache_alignment : 64 The CPU frequencies were set to the max on both systems, to reduce power throttling noise and run-to-run skew. The AMD system did not have SSDs, so there I used tmpfs: this reduced the cache footprint to about 30% of that of the Intel system. The alignment effect was still borderline measurable. I ran the test 10 times on the AMD system (so 100 runs), and 3 times on the Intel system (which took longer due to the file touching ext4/SSD) - i.e. 30 runs. I picked the best result from the tests, to reduce noise from workload perturbations and from data cache layout effects. (If I take the average values then the effect is still visible, but noisier.) Here's the result from the Intel system: linux-falign-functions=_64-bytes/res.txt: 647,853,942 L1-icache-load-misses ( +- 0.07% ) (100.00%) linux-falign-functions=128-bytes/res.txt: 669,401,612 L1-icache-load-misses ( +- 0.08% ) (100.00%) linux-falign-functions=_32-bytes/res.txt: 685,969,043 L1-icache-load-misses ( +- 0.08% ) (100.00%) linux-falign-functions=256-bytes/res.txt: 699,130,207 L1-icache-load-misses ( +- 0.06% ) (100.00%) linux-falign-functions=512-bytes/res.txt: 699,130,207 L1-icache-load-misses ( +- 0.06% ) (100.00%) linux-falign-functions=_16-bytes/res.txt: 706,080,917 L1-icache-load-misses [vanilla kernel] ( +- 0.05% ) (100.00%) linux-falign-functions=__1-bytes/res.txt: 724,539,055 L1-icache-load-misses ( +- 0.31% ) (100.00%) linux-falign-functions=__4-bytes/res.txt: 725,707,848 L1-icache-load-misses ( +- 0.12% ) (100.00%) linux-falign-functions=__8-bytes/res.txt: 726,543,194 L1-icache-load-misses ( +- 0.04% ) (100.00%) linux-falign-functions=__2-bytes/res.txt: 738,946,179 L1-icache-load-misses ( +- 0.12% ) (100.00%) linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt: 921,910,808 L1-icache-load-misses ( +- 0.05% ) (100.00%) The optimal I$ miss rate is at 64 bytes - which is 9% better than the default kernel's I$ miss rate at 16 bytes alignment. The 128/256/512 bytes numbers show an increasing amount of cache misses: probably due to the artificially reduced associativity of the caching. Surprisingly there's a rather marked improvement in elapsed time as well: linux-falign-functions=_64-bytes/res.txt: 7.154816369 seconds time elapsed ( +- 0.03% ) linux-falign-functions=_32-bytes/res.txt: 7.231074263 seconds time elapsed ( +- 0.12% ) linux-falign-functions=__8-bytes/res.txt: 7.292203002 seconds time elapsed ( +- 0.30% ) linux-falign-functions=128-bytes/res.txt: 7.314226040 seconds time elapsed ( +- 0.29% ) linux-falign-functions=_16-bytes/res.txt: 7.333597250 seconds time elapsed [vanilla kernel] ( +- 0.48% ) linux-falign-functions=__1-bytes/res.txt: 7.367139908 seconds time elapsed ( +- 0.28% ) linux-falign-functions=__4-bytes/res.txt: 7.371721930 seconds time elapsed ( +- 0.26% ) linux-falign-functions=__2-bytes/res.txt: 7.410033936 seconds time elapsed ( +- 0.34% ) linux-falign-functions=256-bytes/res.txt: 7.507029637 seconds time elapsed ( +- 0.07% ) linux-falign-functions=512-bytes/res.txt: 7.507029637 seconds time elapsed ( +- 0.07% ) linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt: 8.531418784 seconds time elapsed ( +- 0.19% ) the workload got 2.5% faster - which is pretty nice! This result is 5+ standard deviations above the noise of the measurement. Side note: see how catastrophic -Os (CC_OPTIMIZE_FOR_SIZE=y) performance is: markedly higher cache miss rate despite a 'smaller' kernel, and the workload is 16.3% slower (!). Part of the -Os picture is that the -Os kernel is executing much more instructions: linux-falign-functions=_64-bytes/res.txt: 11,851,763,357 instructions ( +- 0.01% ) linux-falign-functions=__1-bytes/res.txt: 11,852,538,446 instructions ( +- 0.01% ) linux-falign-functions=_16-bytes/res.txt: 11,854,159,736 instructions ( +- 0.01% ) linux-falign-functions=__4-bytes/res.txt: 11,864,421,708 instructions ( +- 0.01% ) linux-falign-functions=__8-bytes/res.txt: 11,865,947,941 instructions ( +- 0.01% ) linux-falign-functions=_32-bytes/res.txt: 11,867,369,566 instructions ( +- 0.01% ) linux-falign-functions=128-bytes/res.txt: 11,867,698,477 instructions ( +- 0.01% ) linux-falign-functions=__2-bytes/res.txt: 11,870,853,247 instructions ( +- 0.01% ) linux-falign-functions=256-bytes/res.txt: 11,876,281,686 instructions ( +- 0.01% ) linux-falign-functions=512-bytes/res.txt: 11,876,281,686 instructions ( +- 0.01% ) linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt: 14,318,175,358 instructions ( +- 0.01% ) 21.2% more instructions executed ... that cannot go well. So this should be a reminder that it's effective I$ footprint and number of instructions executed that matters to performance, not kernel size alone. With current GCC -Os should only be used on embedded systems where one is willing to make the kernel 10%+ slower, in exchange for a 20% smaller kernel. The AMD system, with a starkly different x86 microarchitecture, is showing similar characteristics: linux-falign-functions=_64-bytes/res-amd.txt: 108,886,550 L1-icache-load-misses ( +- 0.10% ) (100.00%) linux-falign-functions=_32-bytes/res-amd.txt: 110,433,214 L1-icache-load-misses ( +- 0.15% ) (100.00%) linux-falign-functions=__1-bytes/res-amd.txt: 113,623,200 L1-icache-load-misses ( +- 0.17% ) (100.00%) linux-falign-functions=128-bytes/res-amd.txt: 119,100,216 L1-icache-load-misses ( +- 0.22% ) (100.00%) linux-falign-functions=_16-bytes/res-amd.txt: 122,916,937 L1-icache-load-misses ( +- 0.15% ) (100.00%) linux-falign-functions=__8-bytes/res-amd.txt: 123,810,566 L1-icache-load-misses ( +- 0.18% ) (100.00%) linux-falign-functions=__2-bytes/res-amd.txt: 124,337,908 L1-icache-load-misses ( +- 0.71% ) (100.00%) linux-falign-functions=__4-bytes/res-amd.txt: 125,221,805 L1-icache-load-misses ( +- 0.09% ) (100.00%) linux-falign-functions=256-bytes/res-amd.txt: 135,761,433 L1-icache-load-misses ( +- 0.18% ) (100.00%) linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt: 159,918,181 L1-icache-load-misses ( +- 0.10% ) (100.00%) linux-falign-functions=512-bytes/res-amd.txt: 170,307,064 L1-icache-load-misses ( +- 0.26% ) (100.00%) 64 bytes is a similar sweet spot. Note that the penalty at 512 bytes is much steeper than on Intel systems: cache associativity is likely lower on this AMD CPU. Interestingly the 1 byte alignment result is still pretty good on AMD systems - and I used the exact same kernel image on both systems, so the layout of the functions is exactly the same. Elapsed time is noisier, but shows a similar trend: linux-falign-functions=_64-bytes/res-amd.txt: 1.928409143 seconds time elapsed ( +- 2.74% ) linux-falign-functions=128-bytes/res-amd.txt: 1.932961745 seconds time elapsed ( +- 2.18% ) linux-falign-functions=__8-bytes/res-amd.txt: 1.940703051 seconds time elapsed ( +- 1.84% ) linux-falign-functions=__1-bytes/res-amd.txt: 1.940744001 seconds time elapsed ( +- 2.15% ) linux-falign-functions=_32-bytes/res-amd.txt: 1.962074787 seconds time elapsed ( +- 2.38% ) linux-falign-functions=_16-bytes/res-amd.txt: 2.000941789 seconds time elapsed ( +- 1.18% ) linux-falign-functions=__4-bytes/res-amd.txt: 2.002305627 seconds time elapsed ( +- 2.75% ) linux-falign-functions=256-bytes/res-amd.txt: 2.003218532 seconds time elapsed ( +- 3.16% ) linux-falign-functions=__2-bytes/res-amd.txt: 2.031252839 seconds time elapsed ( +- 1.77% ) linux-falign-functions=512-bytes/res-amd.txt: 2.080632439 seconds time elapsed ( +- 1.06% ) linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt: 2.346644318 seconds time elapsed ( +- 2.19% ) 64 bytes alignment is the sweet spot here as well, it's 3.7% faster than the default 16 bytes alignment. So based on those measurements, I think we should do the exact opposite of my original patch that reduced alignment to 1 bytes, and increase kernel function address alignment from 16 bytes to the natural cache line size (64 bytes on modern CPUs). The cost is a 6.2% larger kernel image: 13799602 2579560 1634304 18013466 linux-falign-functions=_16-bytes/vmlinux 14664898 2579560 1634304 18878762 linux-falign-functions=_64-bytes/vmlinux But this is basically just a RAM cost, it does not increase effective cache footprint (in fact it decreases it), so it's likely a win on everything but RAM starved embedded systems. In the future we could perhaps still pack small functions of certain subsystems (such as hot locking functions). The patch below implements the alignment optimization by the build system using X86_L1_CACHE_SIZE to align functions, limited to 64-bit systems for the time being. Not-Yet-Signed-off-by: Ingo Molnar --- arch/x86/Kconfig.cpu | 17 +++++++++++++++++ arch/x86/Makefile | 6 ++++++ 2 files changed, 23 insertions(+) diff --git a/arch/x86/Kconfig.cpu b/arch/x86/Kconfig.cpu index 6983314c8b37..9eacb85efda9 100644 --- a/arch/x86/Kconfig.cpu +++ b/arch/x86/Kconfig.cpu @@ -304,6 +304,23 @@ config X86_L1_CACHE_SHIFT default "4" if MELAN || M486 || MGEODEGX1 default "5" if MWINCHIP3D || MWINCHIPC6 || MCRUSOE || MEFFICEON || MCYRIXIII || MK6 || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || M586 || MVIAC3_2 || MGEODE_LX +config X86_L1_CACHE_SIZE + int + default "16" if X86_L1_CACHE_SHIFT=4 + default "32" if X86_L1_CACHE_SHIFT=5 + default "64" if X86_L1_CACHE_SHIFT=6 + default "128" if X86_L1_CACHE_SHIFT=7 + +# +# We optimize function alignment on 64-bit kernels, +# to pack the instruction cache optimally: +# +config X86_FUNCTION_ALIGNMENT + int + default "16" if !64BIT + default X86_L1_CACHE_SIZE if 64BIT && !CC_OPTIMIZE_FOR_SIZE + default "1" if 64BIT && CC_OPTIMIZE_FOR_SIZE + config X86_PPRO_FENCE bool "PentiumPro memory ordering errata workaround" depends on M686 || M586MMX || M586TSC || M586 || M486 || MGEODEGX1 diff --git a/arch/x86/Makefile b/arch/x86/Makefile index 57996ee840dd..45ebf0bbe833 100644 --- a/arch/x86/Makefile +++ b/arch/x86/Makefile @@ -83,6 +83,12 @@ else # Pack loops tightly as well: KBUILD_CFLAGS += -falign-loops=1 + # + # Allocate a separate cacheline for every function, + # for optimal instruction cache packing: + # + KBUILD_CFLAGS += -falign-functions=$(CONFIG_X86_FUNCTION_ALIGNMENT) + # Don't autogenerate traditional x87 instructions KBUILD_CFLAGS += $(call cc-option,-mno-80387) KBUILD_CFLAGS += $(call cc-option,-mno-fp-ret-in-387) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/