MIME-Version: 1.0
In-Reply-To: <5527C700.3030405@redhat.com>
References: <CA+55aFz6KKxGVxPAbsmw9GsKJfy85P2C0EmYBrGpn+aJDjZJWw@mail.gmail.com>
	<20150409175652.GI6464@linux.vnet.ibm.com>
	<CA+55aFzXMDjQQ7jTjsPdh1RikXfgV7OCd-+13cz06MOmDBA33w@mail.gmail.com>
	<CA+55aFwZWi6ecDmVsMBQJTrgrW3GD2DaRtpiOspe=5amR1=dNg@mail.gmail.com>
	<20150409183926.GM6464@linux.vnet.ibm.com>
	<20150410090051.GA28549@gmail.com>
	<20150410091252.GA27630@gmail.com>
	<20150410092152.GA21332@gmail.com>
	<20150410111427.GA30477@gmail.com>
	<20150410112748.GB30477@gmail.com>
	<20150410120846.GA17101@gmail.com>
	<5527C700.3030405@redhat.com>
Date: Fri, 10 Apr 2015 11:48:32 -0700
Message-ID: <CA+55aFy5TKacYVk=UoL377-jcSoagetGxrORbLi+MOY-VYOz=g@mail.gmail.com>
Subject: Re: [PATCH] x86: Align jump targets to 1 byte boundaries
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>,
        "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
        Jason Low <jason.low2@hp.com>, Peter Zijlstra <peterz@infradead.org>,
        Davidlohr Bueso <dave@stgolabs.net>,
        Tim Chen <tim.c.chen@linux.intel.com>,
        Aswin Chandramouleeswaran <aswin@hp.com>,
        LKML <linux-kernel@vger.kernel.org>, Borislav Petkov <bp@alien8.de>,
        Andy Lutomirski <luto@amacapital.net>, Brian Gerst <brgerst@gmail.com>,
        "H. Peter Anvin" <hpa@zytor.com>, Thomas Gleixner <tglx@linutronix.de>,
        Peter Zijlstra <a.p.zijlstra@chello.nl>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2136
Lines: 59

On Fri, Apr 10, 2015 at 5:50 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>
> However, I'm an -Os guy. Expect -O2 people to disagree :)

I used to be an -Os guy too. I'm a big believer in I$ density.

HOWEVER.

It turns out that gcc's -Os is just horrible nasty crap. It doesn't
actually make good tradeoffs for code density, because it doesn't make
any tradeoffs at all. It tries to choose small code, even when it's
ridiculously bad small code.

For example, a 24-byte static memcpy is best done as three quad-word
load/store pairs. That's very cheap, and not at all unreasonable.

But what does gcc do? It does a "rep movsl".

Seriously. That's *shit*. It absolutely kills performance on some very
critical code.

I'm not making that up. Try "-O2" and "-Os" on the appended trivial
code. Yes, the "rep movsl" is smaller, but it's incredibly expensive,
particularly if the result is partially used afterwards.

And I'm not a hater of "rep movs" - not at all. I think that "rep
movsb" is basically a perfect way to tell the CPU "do an optimized
memcpy with whatever cache situation you have". So I'm a big fan of
the string instructions, but only when appropriate. And "appropriate"
here very much includes "I don't know the memory copy size, so I'm
going to call out to some complex generic code that does all kinds of
size checks and tricks".

Replacing three pairs of "mov" instructions with a "rep movs" is insane.

(There are a couple of other examples of that kind of issues with
"-Os". Like using "imul $15" instead of single shift-by-4 and
subtract. Again, the "imul" is certainly smaller, but can have quite
bad latency and throughput issues).

So I'm no longer a fan of -Os. It disables too many obviously good
code optimizations.

                   Linus

---
struct dummy {
        unsigned long a, b, c;
};

void test(struct dummy *a, struct dummy *b)
{
        *b = *a;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/