Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754786AbbBJWqN (ORCPT ); Tue, 10 Feb 2015 17:46:13 -0500 Received: from mta-out1.inet.fi ([62.71.2.227]:47423 "EHLO jenni2.inet.fi" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752360AbbBJWqL (ORCPT ); Tue, 10 Feb 2015 17:46:11 -0500 Date: Wed, 11 Feb 2015 00:46:01 +0200 From: "Kirill A. Shutemov" To: Linus Torvalds Cc: Andrew Morton , "Wang, Yalin" , "viro@zeniv.linux.org.uk" , "linux-fsdevel@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "Gao, Neil" Subject: Re: [RFC V2] test_bit before clear files_struct bits Message-ID: <20150210224601.GA6170@node.dhcp.inet.fi> References: <35FD53F367049845BC99AC72306C23D1044A02027E18@CNBJMBX05.corpusers.net> <35FD53F367049845BC99AC72306C23D1044A02027E19@CNBJMBX05.corpusers.net> <20150210122242.4eca36e5d9fd28d401f58513@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23.1 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2138 Lines: 46 On Tue, Feb 10, 2015 at 12:49:46PM -0800, Linus Torvalds wrote: > On Tue, Feb 10, 2015 at 12:22 PM, Andrew Morton > wrote: > > > > The patch is good but I'm still wondering if any CPUs can do this > > speedup for us. The CPU has to pull in the target word to modify the > > bit and what it *could* do is to avoid dirtying the cacheline if it > > sees that the bit is already in the desired state. > > Sadly, no CPU I know of actually does this. Probably because it would > take a bit more core resources, and conditional writes to memory are > not normally part of an x86 core (it might be more natural for > something like old-style ARM that has conditional writes). > > Also, even if the store were to be conditional, the cacheline would > have been acquired in exclusive state, and in many cache protocols the > state machine is from exclusive to dirty (since normally the only > reason to get a cacheline for exclusive use is in order to write to > it). So a "read, test, conditional write" ends up actually being more > complicated than you'd think - because you *want* that > exclusive->dirty state for the case where you really are going to > change the bit, and to avoid extra cache protocol stages you don't > generally want to read the cacheline into a shared read mode first > (only to then have to turn it into exclusive/dirty as a second state) That all sounds resonable. But I still fail to understand why my micro-benchmark is faster with branch before store comparing to plain store. http://article.gmane.org/gmane.linux.kernel.cross-arch/26254 In this case we would not have intermidiate shared state, because we don't have anybody else to share cache line with. So with branch we would have the smae E->M and write-back to memory as without branch. But it doesn't explain why branch makes code faster. Any ideas? -- Kirill A. Shutemov -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/