MIME-Version: 1.0
In-Reply-To: <20150210224601.GA6170@node.dhcp.inet.fi>
References: <35FD53F367049845BC99AC72306C23D1044A02027E18@CNBJMBX05.corpusers.net>
	<35FD53F367049845BC99AC72306C23D1044A02027E19@CNBJMBX05.corpusers.net>
	<20150210122242.4eca36e5d9fd28d401f58513@linux-foundation.org>
	<CA+55aFzLgPZoLKRK5rPk8hpCS=Y8CNh59K_tzEZEVKpt1VyBWg@mail.gmail.com>
	<20150210224601.GA6170@node.dhcp.inet.fi>
Date: Tue, 10 Feb 2015 15:29:02 -0800
Message-ID: <CA+55aFzOmBaKk372pJEq7VNcsK3tq_arzryNF9sU2WpQh2v9NA@mail.gmail.com>
Subject: Re: [RFC V2] test_bit before clear files_struct bits
From: Linus Torvalds <torvalds@linux-foundation.org>
To: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andrew Morton <akpm@linux-foundation.org>,
        "Wang, Yalin" <Yalin.Wang@sonymobile.com>,
        "viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
        "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        "Gao, Neil" <Neil.Gao@sonymobile.com>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2009
Lines: 43

On Tue, Feb 10, 2015 at 2:46 PM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
>
> But I still fail to understand why my micro-benchmark is faster with
> branch before store comparing to plain store.

Very tight artificial loops like that tend to be horrible for
performance analysis on modern cores, because you end up seeing mostly
random microarchitectural details rather than any real performance.

At a guess, since you write just one word per cacheline, what happens
is that the store buffer continually fills up faster than the stores
get drained to cache. So then the stores start staling. The extra load
- that you expect to slow things down - likely ends up efectively just
prefetching the hot L2 cacheline into L1 so that the store buffer then
drains more cleanly.

And both the load and the branch are effectively free, because the
branch predicts perfectly, and the load just prefetches a cacheline
that will have to be fetched for the subsequent store buffer drain
anyway.

And as you say, there is no cacheline bouncing issues, and the working
set presumably fits in the caches - even if it doesn't fit in the L1.

But that's all just a wild guess.

It could equally easily be some very specific microarchitectural store
buffer stall due to the simpler loop hitting just the right cycle
count between stores.  There are all kinds of random odd small
corner-cases that are generally very very rare and hidden in the
noise, but then a loop with just the right strides can happen to just
hit them. It used to be *trivial* to hit things like address
generation stalls, and even though modern Intel CPU's tend to be quite
robust performance-wise, it's not true that they always handle any
code sequence "perfectly".

                            Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/