Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752802AbbHZXzK (ORCPT ); Wed, 26 Aug 2015 19:55:10 -0400 Received: from gundega.hpl.hp.com ([192.6.19.190]:36887 "EHLO gundega.hpl.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751769AbbHZXzG (ORCPT ); Wed, 26 Aug 2015 19:55:06 -0400 Message-ID: <55DE4FA8.7050701@hpe.com> Date: Wed, 26 Aug 2015 16:45:44 -0700 From: Hideaki Kimura User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Frederic Weisbecker CC: Jason Low , Oleg Nesterov , Andrew Morton , Peter Zijlstra , Ingo Molnar , Thomas Gleixner , "Paul E. McKenney" , linux-kernel@vger.kernel.org, Linus Torvalds , Steven Rostedt , Rik van Riel , Scott J Norton Subject: Re: [PATCH 0/3] timer: Improve itimers scalability References: <1440559068-29680-1-git-send-email-jason.low2@hp.com> <20150825202710.d960a928.akpm@linux-foundation.org> <1440606804.23728.85.camel@j-VirtualBox> <20150826170851.GA5264@redhat.com> <1440626847.23728.122.camel@j-VirtualBox> <55DE4366.9080104@hpe.com> <20150826231326.GE11992@lerouge> In-Reply-To: <20150826231326.GE11992@lerouge> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3332 Lines: 78 On 08/26/2015 04:13 PM, Frederic Weisbecker wrote: > On Wed, Aug 26, 2015 at 03:53:26PM -0700, Hideaki Kimura wrote: >> Sure, let me elaborate. >> >> Executive summary: >> Yes, enabling a process-wide timer in such a large machine is not wise, but >> sometimes users/applications cannot avoid it. >> >> >> The issue was observed actually not in a database itself but in a common >> library it links to; gperftools. >> >> The database itself is optimized for many-cores/sockets, so surely it avoids >> putting a process-wide timer or other unscalable things. It just links to >> libprofiler for an optional feature to profile performance bottleneck only >> when the user turns it on. We of course avoid turning the feature on unless >> while we debug/tune the database. >> >> However, libprofiler sets the timer even when the client program doesn't >> invoke any of its functions: libprofiler does it when the shared library is >> loaded. We requested the developer of libprofiler to change the behavior, >> but seems like there is a reason to keep that behavior: >> https://code.google.com/p/gperftools/issues/detail?id=133 >> >> Based on this, I think there are two reasons why we should ameliorate this >> issue in kernel layer. >> >> >> 1. In the particular case, it's hard to prevent or even detect the issue in >> user space. >> >> We (a team of low-level database and kernel experts) in fact spent huge >> amount of time to just figure out what's the bottleneck there because >> nothing measurable happens in user space. I pulled out countless hairs. >> >> Also, the user has to de-link the library from the application to prevent >> the itimer installation. Imagine a case where the software is proprietary. >> It won't fly. >> >> >> 2. This is just one example. There could be many other such >> binaries/libraries that do similar things somewhere in a complex software >> stack. >> >> Today we haven't heard of many such cases, but people will start hitting it >> once 100s~1,000s of cores become common. >> >> >> After applying this patchset, we have observed that the performance hit >> almost completely went away at least for 240 cores. So, it's quite >> beneficial in real world. > > I can easily imagine that many code incidentally use posix cpu timers when > it's not strictly required. But it doesn't look right to fix the kernel > for that. For this simple reason: posix cpu timers, even after your fix, > should introduce noticeable overhead. All threads of a process with a timer > enqueued in elapse the cputime in a shared atomic variable. Add to that the > overhead of enqueuing the timer, firing it. There is a bunch of scalability > issue there. I totally agree that this is not a perfect solution. If there are 10x more cores and sockets, just the atomic fetch_add might be too expensive. However, it's comparatively/realistically the best thing we can do without any drawbacks. We can't magically force all library developers to write the most scalable code always. My point is: this is a safety net, and a very effective one. -- Hideaki Kimura -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/