Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1030210AbbEAP7i (ORCPT ); Fri, 1 May 2015 11:59:38 -0400 Received: from mail-wi0-f173.google.com ([209.85.212.173]:38090 "EHLO mail-wi0-f173.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753625AbbEAP7T (ORCPT ); Fri, 1 May 2015 11:59:19 -0400 Date: Fri, 1 May 2015 17:59:12 +0200 From: Ingo Molnar To: Rik van Riel Cc: linux-kernel@vger.kernel.org, x86@kernel.org, williams@redhat.com, luto@kernel.org, fweisbec@redhat.com, peterz@infradead.org, heiko.carstens@de.ibm.com, tglx@linutronix.de, Ingo Molnar , Paolo Bonzini Subject: Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry Message-ID: <20150501155912.GA451@gmail.com> References: <1430429035-25563-1-git-send-email-riel@redhat.com> <1430429035-25563-4-git-send-email-riel@redhat.com> <20150501064044.GA18957@gmail.com> <554399D1.6010405@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <554399D1.6010405@redhat.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2768 Lines: 72 * Rik van Riel wrote: > > I.e. what's the baseline we are talking about? > > It's an astounding difference. This is not a kernel without > nohz_full, just a CPU without nohz_full running the same kernel I > tested with yesterday: > > run time system time > vanilla 5.49s 2.08s > __acct patch 5.21s 1.92s > both patches 4.88s 1.71s > CPU w/o nohz 3.12s 1.63s <-- your numbers, mostly > > What is even more interesting is that the majority of the time > difference seems to come from _user_ time, which has gone down from > around 3.4 seconds in the vanilla kernel to around 1.5 seconds on > the CPU without nohz_full enabled... > > At syscall entry time, the nohz_full context tracking code is very > straightforward. We check thread_info->flags & > _TIF_WORK_SYSCALL_ENTRY, and call syscall_trace_enter_phase1, which > handles USER -> KERNEL context transition. > > Syscall exit time is a convoluted mess. Both do_notify_resume and > syscall_trace_leave call exit_user() on entry and enter_user() on > exit, leaving the time spent looping around between int_with_check > and syscall_return: in entry_64.S accounted as user time. > > I sent an email about this last night, it may be useful to add a > third test & function call point to the syscall return code, where > we can call user_enter() just ONCE, and remove the other context > tracking calls from that loop. So what I'm wondering about is the big picture: - This is crazy big overhead in something as fundamental as system calls! - We don't even have the excuse of the syscall auditing code, which kind of has to run for every syscall if it wants to do its job! - [ The 'precise vtime' stuff that is driven from syscall entry/exit is crazy, and I hope not enabled in any distro. ] - So why are we doing this in every syscall time at all? Basically the whole point of user-context tracking is to be able to flush pending RCU callbacks. But that's crazy, we can sure defer a few kfree()s on this CPU, even indefinitely! If some other CPU does a sync_rcu(), then it can very well pluck those callbacks from this super low latency CPU's RCU lists (with due care) and go and free stuff itself ... There's no need to disturb this CPU for that! If user-space does not do anything kernel-ish then there won't be any new RCU callbacks piled up, so it's not like it's a resource leak issue either. So what's the point? Why not remove this big source of overhead altogether? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/