Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751219AbbEBF1l (ORCPT ); Sat, 2 May 2015 01:27:41 -0400 Received: from mail-wi0-f180.google.com ([209.85.212.180]:36565 "EHLO mail-wi0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750894AbbEBF1j (ORCPT ); Sat, 2 May 2015 01:27:39 -0400 Date: Sat, 2 May 2015 07:27:33 +0200 From: Ingo Molnar To: Andy Lutomirski Cc: Rik van Riel , "linux-kernel@vger.kernel.org" , X86 ML , williams@redhat.com, Andrew Lutomirski , fweisbec@redhat.com, Peter Zijlstra , Heiko Carstens , Thomas Gleixner , Ingo Molnar , Paolo Bonzini , "Paul E. McKenney" , Linus Torvalds Subject: Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry Message-ID: <20150502052733.GA9983@gmail.com> References: <554399D1.6010405@redhat.com> <20150501155912.GA451@gmail.com> <20150501162109.GA1091@gmail.com> <5543A94B.3020108@redhat.com> <20150501163431.GB1327@gmail.com> <5543C05E.9040209@redhat.com> <20150501184025.GA2114@gmail.com> <5543CFE5.1030509@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4213 Lines: 110 * Andy Lutomirski wrote: > On Fri, May 1, 2015 at 12:11 PM, Rik van Riel wrote: > > On 05/01/2015 02:40 PM, Ingo Molnar wrote: > > > >> Or we could do that in the syscall path with a single store of a > >> constant flag to a location in the task struct. We have a number of > >> natural flags that get written on syscall entry, such as: > >> > >> pushq_cfi $__USER_DS /* pt_regs->ss */ > >> > >> That goes to a constant location on the kernel stack. On return from > >> system calls we could write 0 to that location. > > Huh? IRET with zero there will fault, and we can't guarantee that > all syscalls return using sysret. [...] So IRET is a slowpath - I was thinking about the fastpath mainly. > [...] Also, we'd have to audit all the entries, and > system_call_after_swapgs currently enables interrupts early enough > that an interrupt before all the pushes will do unpredictable things > to pt_regs. An irq hardware frame won't push zero to that selector value, right? That's the only bad thing that would confuse the code. > We could further abuse orig_ax, but that would require a lot of > auditing. Honestly, though, I think keeping a flag in an > otherwise-hot cache line is a better bet. [...] That would work too, at the cost of one more instruction, as now we'd have to both set and clear it. > [...] Also, making it per-cpu instead of per-task will probably be > easier on the RCU code, since otherwise the RCU code will have to do > some kind of synchronization (even if it's a wait-free probe of the > rq lock or similar) to figure out which pt_regs to look at. So the synchronize_rcu() part is an even slower slow path, in comparison with system call entry overhead. But yes, safely accessing the remote task is a complication, but with such a scheme we cannot avoid it, we'd still have to set TIF_NOHZ. So even if we do: > If we went that route, I'd advocate sticking the flag in tss->sp1. > That cacheline is unconditionally read on kernel entry already, and > sp1 is available in tip:x86/asm (and maybe even in Linus' tree -- I > lost track). [1] > > Alternatively, I'd suggest that we actually add a whole new word to > pt_regs. ... we'd still have to set TIF_NOHZ and are back to square one in terms of race complexity. But it's not overly complex: by taking the remote CPU's rq-lock from synchronize_rcu() we could get a stable pointer to the currently executing task. And if we do that, we might as well use the opportunity and take a look at pt_regs as well - this is why sticking it into pt_regs might be better. So I'd: - enlarge pt_regs by a word and stick the flag there (this allocation is essentially free) - update the word from entry/exit - synchronize_rcu() avoids having to send an IPI by taking a peak at rq->curr's pt_regs::flag, and if: - the flag is 0 then it has observed a quiescent state. - the flag is 1, then it would set TIF_NOHZ and wait for a completion from a TIF_NOHZ callback. synchronize_rcu() often involves waiting for (timer tick driven) grace periods anyway, so this is a relatively fast solution - and it would limit the overhead to 2 instructions. On systems that have zero nohz-full CPUs (i.e. !context_tracking_enabled) we could patch out those two instructions into NOPs, which would be eliminated in the decoder. Regarding the user/kernel execution time split measurement: 1) the same flag could be used to sample a remote CPU's statistics from another CPU and update the stats in the currently executing task. As long as there's at least one non-nohz-full CPU, this would work. Or are there systems were all CPUs are nohz-full? 2) Alternatively we could just drive user/kernel split statistics from context switches, which would be inaccurate if the workload is SCHED_FIFO that only rarely context switches. How does this sound? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/