by Jason Low

[permalink] [raw]

Subject: Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions

* Denys Vlasenko <[email protected]> wrote:

> > The AMD system, with a starkly different x86 microarchitecture, is
> > showing similar characteristics:
> >
> > linux-falign-functions=_64-bytes/res-amd.txt: 108,886,550 L1-icache-load-misses ( +- 0.10% ) (100.00%)
> > linux-falign-functions=_32-bytes/res-amd.txt: 110,433,214 L1-icache-load-misses ( +- 0.15% ) (100.00%)
> > linux-falign-functions=__1-bytes/res-amd.txt: 113,623,200 L1-icache-load-misses ( +- 0.17% ) (100.00%)
> > linux-falign-functions=128-bytes/res-amd.txt: 119,100,216 L1-icache-load-misses ( +- 0.22% ) (100.00%)
> > linux-falign-functions=_16-bytes/res-amd.txt: 122,916,937 L1-icache-load-misses ( +- 0.15% ) (100.00%)
> > linux-falign-functions=__8-bytes/res-amd.txt: 123,810,566 L1-icache-load-misses ( +- 0.18% ) (100.00%)
> > linux-falign-functions=__2-bytes/res-amd.txt: 124,337,908 L1-icache-load-misses ( +- 0.71% ) (100.00%)
> > linux-falign-functions=__4-bytes/res-amd.txt: 125,221,805 L1-icache-load-misses ( +- 0.09% ) (100.00%)
> > linux-falign-functions=256-bytes/res-amd.txt: 135,761,433 L1-icache-load-misses ( +- 0.18% ) (100.00%)
> > linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt: 159,918,181 L1-icache-load-misses ( +- 0.10% ) (100.00%)
> > linux-falign-functions=512-bytes/res-amd.txt: 170,307,064 L1-icache-load-misses ( +- 0.26% ) (100.00%)
> >
> > 64 bytes is a similar sweet spot. Note that the penalty at 512 bytes
> > is much steeper than on Intel systems: cache associativity is likely
> > lower on this AMD CPU.
> >
> > Interestingly the 1 byte alignment result is still pretty good on AMD
> > systems - and I used the exact same kernel image on both systems, so
> > the layout of the functions is exactly the same.
> >
> > Elapsed time is noisier, but shows a similar trend:
> >
> > linux-falign-functions=_64-bytes/res-amd.txt: 1.928409143 seconds time elapsed ( +- 2.74% )
> > linux-falign-functions=128-bytes/res-amd.txt: 1.932961745 seconds time elapsed ( +- 2.18% )
> > linux-falign-functions=__8-bytes/res-amd.txt: 1.940703051 seconds time elapsed ( +- 1.84% )
> > linux-falign-functions=__1-bytes/res-amd.txt: 1.940744001 seconds time elapsed ( +- 2.15% )
> > linux-falign-functions=_32-bytes/res-amd.txt: 1.962074787 seconds time elapsed ( +- 2.38% )
> > linux-falign-functions=_16-bytes/res-amd.txt: 2.000941789 seconds time elapsed ( +- 1.18% )
> > linux-falign-functions=__4-bytes/res-amd.txt: 2.002305627 seconds time elapsed ( +- 2.75% )
> > linux-falign-functions=256-bytes/res-amd.txt: 2.003218532 seconds time elapsed ( +- 3.16% )
> > linux-falign-functions=__2-bytes/res-amd.txt: 2.031252839 seconds time elapsed ( +- 1.77% )
> > linux-falign-functions=512-bytes/res-amd.txt: 2.080632439 seconds time elapsed ( +- 1.06% )
> > linux-____CC_OPTIMIZE_FOR_SIZE=y/res-amd.txt: 2.346644318 seconds time elapsed ( +- 2.19% )
> >
> > 64 bytes alignment is the sweet spot here as well, it's 3.7%
> > faster than the default 16 bytes alignment.
>
> In AMD, 64 bytes win too, yes, but by a *very* small margin. 8 bytes
> and 1 byte alignments have basically same timings, and both take
> what, +0.63% of time longer to run?

See my previous mail why this is a misleading conclusion: the timings
have so much stddev that they are not really comparable. But if you
look at the cachemiss you'll see the true difference:

linux-falign-functions=_64-bytes/res-amd.txt: 108,886,550 L1-icache-load-misses ( +- 0.10% ) (100.00%)
linux-falign-functions=__8-bytes/res-amd.txt: 123,810,566 L1-icache-load-misses ( +- 0.18% ) (100.00%)
linux-falign-functions=__1-bytes/res-amd.txt: 113,623,200 L1-icache-load-misses ( +- 0.17% ) (100.00%)

1 byte alignment isn't optimal on AMD either.

> > + #
> > + # Allocate a separate cacheline for every function,
> > + # for optimal instruction cache packing:
> > + #
> > + KBUILD_CFLAGS += -falign-functions=$(CONFIG_X86_FUNCTION_ALIGNMENT)
>
> How about -falign-functions=CONFIG_X86_FUNCTION_ALIGNMENT/2 + 1 instead?
>
> This avoids pathological cases where function starting just a few
> bytes after 64-bytes boundary gets aligned to the next one, wasting
> ~60 bytes.

Well, that should be pretty similar to the 32-byte aligned numbers I
already posted, right?

The 32-byte numbers are better than the default 16-byte alignment, but
64-byte alignment was even better.

A quick build shows:

text data bss dec hex filename
13534242 2579560 1634304 17748106 10ed08a linux-falign-functions=__1-bytes/vmlinux
13554530 2579560 1634304 17768394 10f1fca linux-falign-functions=__2-bytes/vmlinux
13590946 2579560 1634304 17804810 10fae0a linux-falign-functions=__4-bytes/vmlinux
13658786 2579560 1634304 17872650 110b70a linux-falign-functions=__8-bytes/vmlinux
13799602 2579560 1634304 18013466 112dd1a linux-falign-functions=_16-bytes/vmlinux
13943906 2579560 1634304 18157770 11510ca linux-falign-functions=_33-bytes/vmlinux [1]
14075330 2579560 1634304 18289194 117122a linux-falign-functions=_32-bytes/vmlinux [2]
14664898 2579560 1634304 18878762 120112a linux-falign-functions=_64-bytes/vmlinux
15980994 2579560 1634304 20194858 134262a linux-falign-functions=128-bytes/vmlinux
19038018 2591848 1634304 23264170 162fbaa linux-falign-functions=256-bytes/vmlinux
26391106 2591848 1634304 30617258 1d32eaa linux-falign-functions=512-bytes/vmlinux

Interestingly the -falign-functions=33 kernel is marginally smaller
than the -falign-functions=32 kernel, by 0.1%.

But the numbers aren't exceptional:

671,702,272 L1-icache-load-misses ( +- 0.06% ) (100.00%)
11,892,913,320 instructions ( +- 0.01% )
300,030 context-switches ( +- 0.00% )

7.349312237 seconds time elapsed ( +- 1.20% )

Compared to the others it's on rank 3:

linux-falign-functions=_64-bytes/res.txt: 647,853,942 L1-icache-load-misses ( +- 0.07% ) (100.00%)
linux-falign-functions=128-bytes/res.txt: 669,401,612 L1-icache-load-misses ( +- 0.08% ) (100.00%)
linux-falign-functions=_33-bytes/res.txt: [NEW] 671,702,272 L1-icache-load-misses ( +- 0.06% ) (100.00%)
linux-falign-functions=_32-bytes/res.txt: 685,969,043 L1-icache-load-misses ( +- 0.08% ) (100.00%)
linux-falign-functions=256-bytes/res.txt: 699,130,207 L1-icache-load-misses ( +- 0.06% ) (100.00%)
linux-falign-functions=512-bytes/res.txt: 699,130,207 L1-icache-load-misses ( +- 0.06% ) (100.00%)
linux-falign-functions=_16-bytes/res.txt: 706,080,917 L1-icache-load-misses [vanilla kernel] ( +- 0.05% ) (100.00%)
linux-falign-functions=__1-bytes/res.txt: 724,539,055 L1-icache-load-misses ( +- 0.31% ) (100.00%)
linux-falign-functions=__4-bytes/res.txt: 725,707,848 L1-icache-load-misses ( +- 0.12% ) (100.00%)
linux-falign-functions=__8-bytes/res.txt: 726,543,194 L1-icache-load-misses ( +- 0.04% ) (100.00%)
linux-falign-functions=__2-bytes/res.txt: 738,946,179 L1-icache-load-misses ( +- 0.12% ) (100.00%)
linux-____CC_OPTIMIZE_FOR_SIZE=y/res.txt: 921,910,808 L1-icache-load-misses ( +- 0.05% ) (100.00%)

So I still think the best strategy would be a variant of what I
mentioned before:

1) 'small functions': whose size is 1-63 bytes.

2) 'large functions': whose size is 64- bytes.

3) align all large functions to 64 bytes cachelines.

4) pack all small functions into remaining large function tail
holes, without them spilling over into the next cacheline

5) pack the remaining small functions with each other, tightly, as
long as they don't spill over to the next cacheline.

6) once optimization steps 1-5 are performed, for all cachelines
that contain small functions and still have a hole near the end
of the cacheline, allow functions to be aligned to 32/16/8/4/2
bytes as long as they still fit into the cacheline.

So for example lets consider this layout with tight packing:

[small-fn1][small-fn2][large-fn3...........................][small-fn4 ..... ][hole............]
[......... cacheline 1 ........][......... cacheline 2 ........][......... cacheline 3 ........]

Both 'fn3' and 'fn4' are on two cachelines.

After step 5 we'd get such a packing:

[small-fn1][small-fn2][hole....][large-fn3...........................][small-fn4 ..... ][hole..]
[......... cacheline 1 ........][......... cacheline 2 ........][......... cacheline 3 ........]

fn1, fn2 and fn4 each have their own cacheline, while large-fn3
necessarily has to spill into the next cacheline due to its size.
'fn4' is now on a single cache line.

After step 6 we'd get some extra intra-cacheline alignment by
utilizing the holes:

[small-fn1]..[small-fn2][hole..][large-fn3...........................]...[small-fn4 ..... ][hle]
[......... cacheline 1 ........][......... cacheline 2 ........][......... cacheline 3 ........]

Note how small-fn2 got moved forward by 2 bytes, to move it to an 8
bytes boundary and small-fn4 got moved forward by 3 bytes - at the
expense of the holes at the end of the cachelines.

This function alignment optimization basically minimizes the number of
cache line boundaries intersecting individual functions, so this
minimizes the I$ footprint aggressively.

This is not a trivial computation btw.: because even if we do this
only per .o object file we'd have to consider nr_fns! permutations of
all possible function ordering. We'd want to pick the ordering that
matches the above 1-6 constraints, _and_ which is as close to 'source
code ordering' as possible.

Such an optimization cannot be described via simple local alignment
rules like any variant of -falign-functions, as it requires reordering
of functions.

Thanks,

Ingo

2016-04-16 21:08:57

by Denys Vlasenko

[permalink] [raw]

Subject: Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions

On Thu, May 21, 2015 at 1:38 PM, Denys Vlasenko <[email protected]> wrote:
> On 05/20/2015 02:21 PM, Denys Vlasenko wrote:
>> So what we need is to put something like ".p2align 64,,7"
>> before every function.
>>
>> (
>> Why 7?
>>
>> defconfig vmlinux (w/o FRAME_POINTER) has 42141 functions.
>> 6923 of them have 1st insn 5 or more bytes long,
>> 5841 of them have 1st insn 6 or more bytes long,
>> 5095 of them have 1st insn 7 or more bytes long,
>> 786 of them have 1st insn 8 or more bytes long,
>> 548 of them have 1st insn 9 or more bytes long,
>> 375 of them have 1st insn 10 or more bytes long,
>> 73 of them have 1st insn 11 or more bytes long,
>> one of them has 1st insn 12 bytes long:
>> this "heroic" instruction is in local_touch_nmi()
>> 65 48 c7 05 44 3c 00 7f 00 00 00 00
>> movq $0x0,%gs:0x7f003c44(%rip)
>>
>> Thus ensuring that at least seven first bytes do not cross
>> 64-byte boundary would cover >98% of all functions.
>> )
>>
>> gcc can't do that right now. With -falign-functions=N,
>> it emits ".p2align next_power_of_2(N),,N-1"
>>
>> We need to make it just a tiny bit smarter.
>>
>>> We'd need toolchain help to do saner alignment.
>>
>> Yep.
>> I'm going to create a gcc BZ with a feature request,
>> unless you disagree with my musings above.
>
> The BZ is here:
>
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66240

...and now this BZ has a working patch, which implements e.g.
-falign-functions=64,7