Date: Wed, 11 Feb 2015 00:46:01 +0200
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
        "Wang, Yalin" <Yalin.Wang@sonymobile.com>,
        "viro@zeniv.linux.org.uk" <viro@zeniv.linux.org.uk>,
        "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        "Gao, Neil" <Neil.Gao@sonymobile.com>
Subject: Re: [RFC V2] test_bit before clear files_struct bits
Message-ID: <20150210224601.GA6170@node.dhcp.inet.fi>
References: <35FD53F367049845BC99AC72306C23D1044A02027E18@CNBJMBX05.corpusers.net>
 <35FD53F367049845BC99AC72306C23D1044A02027E19@CNBJMBX05.corpusers.net>
 <20150210122242.4eca36e5d9fd28d401f58513@linux-foundation.org>
 <CA+55aFzLgPZoLKRK5rPk8hpCS=Y8CNh59K_tzEZEVKpt1VyBWg@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <CA+55aFzLgPZoLKRK5rPk8hpCS=Y8CNh59K_tzEZEVKpt1VyBWg@mail.gmail.com>
User-Agent: Mutt/1.5.23.1 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 2138
Lines: 46

On Tue, Feb 10, 2015 at 12:49:46PM -0800, Linus Torvalds wrote:
> On Tue, Feb 10, 2015 at 12:22 PM, Andrew Morton
> <akpm@linux-foundation.org> wrote:
> >
> > The patch is good but I'm still wondering if any CPUs can do this
> > speedup for us.  The CPU has to pull in the target word to modify the
> > bit and what it *could* do is to avoid dirtying the cacheline if it
> > sees that the bit is already in the desired state.
> 
> Sadly, no CPU I know of actually does this.  Probably because it would
> take a bit more core resources, and conditional writes to memory are
> not normally part of an x86 core (it might be more natural for
> something like old-style ARM that has conditional writes).
> 
> Also, even if the store were to be conditional, the cacheline would
> have been acquired in exclusive state, and in many cache protocols the
> state machine is from exclusive to dirty (since normally the only
> reason to get a cacheline for exclusive use is in order to write to
> it). So a "read, test, conditional write" ends up actually being more
> complicated than you'd think - because you *want* that
> exclusive->dirty state for the case where you really are going to
> change the bit, and to avoid extra cache protocol stages you don't
> generally want to read the cacheline into a shared read mode first
> (only to then have to turn it into exclusive/dirty as a second state)

That all sounds resonable.

But I still fail to understand why my micro-benchmark is faster with
branch before store comparing to plain store.

http://article.gmane.org/gmane.linux.kernel.cross-arch/26254

In this case we would not have intermidiate shared state, because we don't
have anybody else to share cache line with. So with branch we would have
the smae E->M and write-back to memory as without branch. But it doesn't
explain why branch makes code faster.

Any ideas?

-- 
 Kirill A. Shutemov
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/