Date: Thu, 21 May 2015 13:36:17 +0200
From: Ingo Molnar <mingo@kernel.org>
To: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
        Andy Lutomirski <luto@amacapital.net>,
        Davidlohr Bueso <dave@stgolabs.net>, Peter Anvin <hpa@zytor.com>,
        Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
        Tim Chen <tim.c.chen@linux.intel.com>, Borislav Petkov <bp@alien8.de>,
        Peter Zijlstra <peterz@infradead.org>,
        "Chandramouleeswaran, Aswin" <aswin@hp.com>,
        Peter Zijlstra <a.p.zijlstra@chello.nl>,
        Brian Gerst <brgerst@gmail.com>,
        Paul McKenney <paulmck@linux.vnet.ibm.com>,
        Thomas Gleixner <tglx@linutronix.de>, Jason Low <jason.low2@hp.com>,
        "linux-tip-commits@vger.kernel.org" 
	<linux-tip-commits@vger.kernel.org>,
        Arjan van de Ven <arjan@infradead.org>,
        Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [RFC PATCH] x86/64: Optimize the effective instruction cache
 footprint of kernel functions
Message-ID: <20150521113617.GA7911@gmail.com>
References: <20150410121808.GA19918@gmail.com>
 <tip-4874fe1eeb40b403a8c9d0ddeb4d166cab3f37ba@git.kernel.org>
 <CA+55aFywCXk083w78cQGbRKh-ERLtE8v9PZhqxaHcyHJxSNsFQ@mail.gmail.com>
 <20150517055551.GB17002@gmail.com>
 <20150519213820.GA31688@gmail.com>
 <CA+55aFwog9oCat-FQqW52SvVGyZohNZPeuWk92a2cvhjc6H38Q@mail.gmail.com>
 <555C7C57.1070608@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <555C7C57.1070608@redhat.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3022
Lines: 58


* Denys Vlasenko <dvlasenk@redhat.com> wrote:

> I was thinking about Ingo's AMD results:
> 
> linux-falign-functions=_64-bytes/res-amd.txt:        1.928409143 seconds time elapsed
> linux-falign-functions=__8-bytes/res-amd.txt:        1.940703051 seconds time elapsed
> linux-falign-functions=__1-bytes/res-amd.txt:        1.940744001 seconds time elapsed
> 
> AMD is almost perfect. Having no alignment at all still works very 
> well. [...]

Not quite. As I mentioned it in my post, the 'time elapsed' numbers 
were very noisy in the AMD case - and you've cut off the stddev column 
that shows this. Here is the full data:

 linux-falign-functions=_64-bytes/res-amd.txt:        1.928409143 seconds time elapsed                                          ( +-  2.74% )
 linux-falign-functions=__8-bytes/res-amd.txt:        1.940703051 seconds time elapsed                                          ( +-  1.84% )
 linux-falign-functions=__1-bytes/res-amd.txt:        1.940744001 seconds time elapsed                                          ( +-  2.15% )

2-3% of stddev for a 3.7% speedup is not conclusive.

What you should use instead is the cachemiss counts, which is a good 
proxy and a lot more stable statistically:

 linux-falign-functions=_64-bytes/res-amd.txt:        108,886,550      L1-icache-load-misses                                         ( +-  0.10% )  (100.00%)
 linux-falign-functions=__8-bytes/res-amd.txt:        123,810,566      L1-icache-load-misses                                         ( +-  0.18% )  (100.00%)
 linux-falign-functions=__1-bytes/res-amd.txt:        113,623,200      L1-icache-load-misses                                         ( +-  0.17% )  (100.00%)

which shows that 64 bytes alignment still generates a better I$ layout 
than tight packing, resulting in 4.3% fewer I$ misses.

On Intel it's more pronounced:

 linux-falign-functions=_64-bytes/res.txt:        647,853,942      L1-icache-load-misses                                         ( +-  0.07% )  (100.00%)
 linux-falign-functions=__1-bytes/res.txt:        724,539,055      L1-icache-load-misses                                         ( +-  0.31% )  (100.00%)

12% difference. Note that the Intel workload is running on SSDs which 
makes the cache footprint several times larger, and the workload is 
more realistic as well than the AMD test that was running in tmpfs.

I think it's a fair bet to assume that the AMD system will show a 
similar difference if it were to run the same workload.

Allowing smaller functions to be cut in half by cacheline boundaries 
looks like a losing strategy, especially with larger workloads.

The modified scheme I suggested: 64 bytes alignment + intelligent 
packing might do even better than dumb 64 bytes alignment.

Thanks,

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/