Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759959Ab3JPAJk (ORCPT ); Tue, 15 Oct 2013 20:09:40 -0400 Received: from mga02.intel.com ([134.134.136.20]:16202 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1759629Ab3JPAJi (ORCPT ); Tue, 15 Oct 2013 20:09:38 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="4.93,503,1378882800"; d="scan'208";a="393485227" Subject: Re: [PATCH v8 0/9] rwsem performance optimizations From: Tim Chen To: Ingo Molnar Cc: Ingo Molnar , Andrew Morton , Linus Torvalds , Andrea Arcangeli , Alex Shi , Andi Kleen , Michel Lespinasse , Davidlohr Bueso , Matthew R Wilcox , Dave Hansen , Peter Zijlstra , Rik van Riel , Peter Hurley , "Paul E.McKenney" , Jason Low , Waiman Long , linux-kernel@vger.kernel.org, linux-mm In-Reply-To: <20131010075444.GD17990@gmail.com> References: <1380753493.11046.82.camel@schen9-DESK> <20131003073212.GC5775@gmail.com> <1381186674.11046.105.camel@schen9-DESK> <20131009061551.GD7664@gmail.com> <1381336441.11046.128.camel@schen9-DESK> <20131010075444.GD17990@gmail.com> Content-Type: text/plain; charset="UTF-8" Date: Tue, 15 Oct 2013 17:09:16 -0700 Message-ID: <1381882156.11046.178.camel@schen9-DESK> Mime-Version: 1.0 X-Mailer: Evolution 2.32.3 (2.32.3-1.fc14) Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5316 Lines: 150 On Thu, 2013-10-10 at 09:54 +0200, Ingo Molnar wrote: > * Tim Chen wrote: > > > The throughput of pure mmap with mutex is below vs pure mmap is below: > > > > % change in performance of the mmap with pthread-mutex vs pure mmap > > #threads vanilla all rwsem without optspin > > patches > > 1 3.0% -1.0% -1.7% > > 5 7.2% -26.8% 5.5% > > 10 5.2% -10.6% 22.1% > > 20 6.8% 16.4% 12.5% > > 40 -0.2% 32.7% 0.0% > > > > So with mutex, the vanilla kernel and the one without optspin both run > > faster. This is consistent with what Peter reported. With optspin, the > > picture is more mixed, with lower throughput at low to moderate number > > of threads and higher throughput with high number of threads. > > So, going back to your orignal table: > > > % change in performance of the mmap with pthread-mutex vs pure mmap > > #threads vanilla all without optspin > > 1 3.0% -1.0% -1.7% > > 5 7.2% -26.8% 5.5% > > 10 5.2% -10.6% 22.1% > > 20 6.8% 16.4% 12.5% > > 40 -0.2% 32.7% 0.0% > > > > In general, vanilla and no-optspin case perform better with > > pthread-mutex. For the case with optspin, mmap with pthread-mutex is > > worse at low to moderate contention and better at high contention. > > it appears that 'without optspin' appears to be a pretty good choice - if > it wasn't for that '1 thread' number, which, if I correctly assume is the > uncontended case, is one of the most common usecases ... > > How can the single-threaded case get slower? None of the patches should > really cause noticeable overhead in the non-contended case. That looks > weird. > > It would also be nice to see the 2, 3, 4 thread numbers - those are the > most common contention scenarios in practice - where do we see the first > improvement in performance? > > Also, it would be nice to include a noise/sttdev figure, it's really hard > to tell whether -1.7% is statistically significant. Ingo, I think that the optimistic spin changes to rwsem should enhance performance to real workloads after all. In my previous tests, I was doing mmap followed immediately by munmap without doing anything to the memory. No real workload will behave that way and it is not the scenario that we should optimize for. A much better approximation of real usages will be doing mmap, then touching the memories being mmaped, followed by munmap. This changes the dynamics of the rwsem as we are now dominated by read acquisitions of mmap sem due to the page faults, instead of having only write acquisitions from mmap. In this case, any delay in write acquisitions will be costly as we will be blocking a lot of readers. This is where optimistic spinning on write acquisitions of mmap sem can provide a very significant boost to the throughput. I change the test case to the following with writes to the mmaped memory: #define MEMSIZE (1 * 1024 * 1024) char *testcase_description = "Anonymous memory mmap/munmap of 1MB"; void testcase(unsigned long long *iterations) { int i; while (1) { char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); assert(c != MAP_FAILED); for (i=0; i