Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752508AbbBJX3F (ORCPT ); Tue, 10 Feb 2015 18:29:05 -0500 Received: from mail-ie0-f179.google.com ([209.85.223.179]:40859 "EHLO mail-ie0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750871AbbBJX3E (ORCPT ); Tue, 10 Feb 2015 18:29:04 -0500 MIME-Version: 1.0 In-Reply-To: <20150210224601.GA6170@node.dhcp.inet.fi> References: <35FD53F367049845BC99AC72306C23D1044A02027E18@CNBJMBX05.corpusers.net> <35FD53F367049845BC99AC72306C23D1044A02027E19@CNBJMBX05.corpusers.net> <20150210122242.4eca36e5d9fd28d401f58513@linux-foundation.org> <20150210224601.GA6170@node.dhcp.inet.fi> Date: Tue, 10 Feb 2015 15:29:02 -0800 X-Google-Sender-Auth: 2vUDe3C0hl0H8AVk_39lTxKKXjE Message-ID: Subject: Re: [RFC V2] test_bit before clear files_struct bits From: Linus Torvalds To: "Kirill A. Shutemov" Cc: Andrew Morton , "Wang, Yalin" , "viro@zeniv.linux.org.uk" , "linux-fsdevel@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "Gao, Neil" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2009 Lines: 43 On Tue, Feb 10, 2015 at 2:46 PM, Kirill A. Shutemov wrote: > > But I still fail to understand why my micro-benchmark is faster with > branch before store comparing to plain store. Very tight artificial loops like that tend to be horrible for performance analysis on modern cores, because you end up seeing mostly random microarchitectural details rather than any real performance. At a guess, since you write just one word per cacheline, what happens is that the store buffer continually fills up faster than the stores get drained to cache. So then the stores start staling. The extra load - that you expect to slow things down - likely ends up efectively just prefetching the hot L2 cacheline into L1 so that the store buffer then drains more cleanly. And both the load and the branch are effectively free, because the branch predicts perfectly, and the load just prefetches a cacheline that will have to be fetched for the subsequent store buffer drain anyway. And as you say, there is no cacheline bouncing issues, and the working set presumably fits in the caches - even if it doesn't fit in the L1. But that's all just a wild guess. It could equally easily be some very specific microarchitectural store buffer stall due to the simpler loop hitting just the right cycle count between stores. There are all kinds of random odd small corner-cases that are generally very very rare and hidden in the noise, but then a loop with just the right strides can happen to just hit them. It used to be *trivial* to hit things like address generation stalls, and even though modern Intel CPU's tend to be quite robust performance-wise, it's not true that they always handle any code sequence "perfectly". Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/