Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752312AbbEHGhU (ORCPT ); Fri, 8 May 2015 02:37:20 -0400 Received: from mail-wi0-f180.google.com ([209.85.212.180]:38755 "EHLO mail-wi0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751326AbbEHGhR (ORCPT ); Fri, 8 May 2015 02:37:17 -0400 Date: Fri, 8 May 2015 08:37:11 +0200 From: Ingo Molnar To: Andy Lutomirski Cc: fweisbec@redhat.com, Paolo Bonzini , X86 ML , Thomas Gleixner , Peter Zijlstra , Heiko Carstens , Ingo Molnar , Mike Galbraith , Rik van Riel , "linux-kernel@vger.kernel.org" , williams@redhat.com Subject: Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry Message-ID: <20150508063711.GA4934@gmail.com> References: <1430659432.4233.3.camel@gmail.com> <55465B2D.6010300@redhat.com> <55466E72.8060602@redhat.com> <20150507104845.GB14924@gmail.com> <20150507124437.GB17443@gmail.com> <20150507150845.GA20608@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4288 Lines: 123 * Andy Lutomirski wrote: > > So do you mean: > > > > this_cpu_set(rcu_state) = IN_KERNEL; > > ... > > this_cpu_inc(rcu_qs_ctr); > > this_cpu_set(rcu_state) = IN_USER; > > > > ? > > > > So in your proposal we'd have an INC and two MOVs. I think we can make > > it just two simple stores into a byte flag, one on entry and one on > > exit: > > > > this_cpu_set(rcu_state) = IN_KERNEL; > > ... > > this_cpu_set(rcu_state) = IN_USER; > > > > I was thinking that either a counter or a state flag could make sense. > Doing both would be pointless. The counter could use the low bits to > indicate the state. The benefit of the counter would be that the > RCU-waiting CPU could observe that the counter has incremented and > that therefore a grace period has elapsed. Getting it right would > require lots of care. So if you mean: ... this_cpu_inc(rcu_qs_ctr); I don't see how this would work reliably: how do you handle the case of a SCHED_FIFO task never returning from user-space (common technique in RT apps)? synchronize_rcu() would block indefinitely as it would never see rcu_qs_ctr increase. We have to be able to observe user-mode anyway, for system-time statistics purposes, and that flag could IMHO also drive the RCU GC machinery. > > > The problem is that I don't see how TIF_RCU_THINGY can work > > > reliably. If the remote CPU sets it, it'll be too late and we'll > > > still enter user mode without seeing it. If it's just an > > > optimization, though, then it should be fine. > > > > Well, after setting it, the remote CPU has to re-check whether the > > RT CPU has entered user-mode - before it goes to wait. > > How? > > Suppose the exit path looked like: > > this_cpu_write(rcu_state, IN_USER); > > if (ti->flags & _TIF_RCU_NOTIFY) { > if (test_and_clear_bit(TIF_RCU_NOTIFY, &ti->flags)) > slow_notify_rcu_that_we_are_exiting(); > } > > iret or sysret; No, it would look like this: this_cpu_write(rcu_state, IN_USER); iret or sysret; I.e. IN_USER is set well after all notifications are checked. No kernel execution happens afterwards. (No extra checks added - the regular return-to-user-work checks would handle TIF_RCU_NOTIFY.) ( Same goes for idle: we just mark it IN_IDLE and move it back to IN_KERNEL after the idling ends. ) > The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets > _TIF_RCU_NOTIFY. This could happen arbitrarily late before IRET > because stores can be delayed. (It could even happen after sysret, > IIRC, but IRET is serializing.) All it has to do in the synchronize_rcu() slowpath is something like: if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) { smp_mb__before_atomic(); set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY); smp_rmb(); if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) ... go wait ... } /* Cool, we observed quiescent state: */ The cost of the trivial barrier is nothing compared to the 'go wait' cost which we will pay in 99.9% of the cases! > If we put an mfence after this_cpu_set or did an unconditional > test_and_clear_bit on ti->flags then this problem goes away, but > that would probably be slower than we'd like. We are talking about a dozen cycles, while a typical synchronize_rcu() will wait millions (sometimes billions) of cycles. There's absolutely zero performance concern here and it's all CPU local in any case. In fact a user-mode/kernel-mode flag speeds up naive implementations of synchronize_rcu(): because it's able to observe extended quiescent state immediately, without having to wait for a counter to increase (which was always the classic way to observe grace periods). If all CPUs are in user mode or are idle (which is rather common!) then synchronize_rcu() could return almost immediately - while previously it had to wait for scheduling or periodic timer irqs to trigger on all CPUs - adding many millisecs of delay even in the best of cases. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/