Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753439AbbEHL1U (ORCPT ); Fri, 8 May 2015 07:27:20 -0400 Received: from mail-wi0-f175.google.com ([209.85.212.175]:38328 "EHLO mail-wi0-f175.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752886AbbEHL1Q (ORCPT ); Fri, 8 May 2015 07:27:16 -0400 Date: Fri, 8 May 2015 13:27:11 +0200 From: Ingo Molnar To: Andy Lutomirski Cc: fweisbec@redhat.com, Paolo Bonzini , Thomas Gleixner , X86 ML , Peter Zijlstra , Ingo Molnar , Heiko Carstens , Mike Galbraith , "linux-kernel@vger.kernel.org" , Rik van Riel , williams@redhat.com Subject: Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry Message-ID: <20150508112711.GA20444@gmail.com> References: <55466E72.8060602@redhat.com> <20150507104845.GB14924@gmail.com> <20150507124437.GB17443@gmail.com> <20150507150845.GA20608@gmail.com> <20150508063711.GA4934@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5897 Lines: 172 * Andy Lutomirski wrote: > On May 8, 2015 12:07 PM, "Ingo Molnar" wrote: > > > > > > * Andy Lutomirski wrote: > > > > > > So do you mean: > > > > > > > > this_cpu_set(rcu_state) = IN_KERNEL; > > > > ... > > > > this_cpu_inc(rcu_qs_ctr); > > > > this_cpu_set(rcu_state) = IN_USER; > > > > > > > > ? > > > > > > > > So in your proposal we'd have an INC and two MOVs. I think we can make > > > > it just two simple stores into a byte flag, one on entry and one on > > > > exit: > > > > > > > > this_cpu_set(rcu_state) = IN_KERNEL; > > > > ... > > > > this_cpu_set(rcu_state) = IN_USER; > > > > > > > > > > I was thinking that either a counter or a state flag could make sense. > > > Doing both would be pointless. The counter could use the low bits to > > > indicate the state. The benefit of the counter would be that the > > > RCU-waiting CPU could observe that the counter has incremented and > > > that therefore a grace period has elapsed. Getting it right would > > > require lots of care. > > > > So if you mean: > > > > > > ... > > this_cpu_inc(rcu_qs_ctr); > > > > > > I don't see how this would work reliably: how do you handle the case > > of a SCHED_FIFO task never returning from user-space (common technique > > in RT apps)? synchronize_rcu() would block indefinitely as it would > > never see rcu_qs_ctr increase. > > > > We have to be able to observe user-mode anyway, for system-time > > statistics purposes, and that flag could IMHO also drive the RCU GC > > machinery. > > The counter would have to be fancier than that to work. We could > say that an even value means user mode, for example. IOW the high > bits of the counter would count transitions to quiescent and the low > bits would indicate the state. > > Actually, I'd count backwards. We could use an andl instruction to > go to user mode and a decl to go to kernel mode. at which point it's not really a monotonic quiescent state counter - which your naming suggested before. Also it would have to be done both at entry and at exit. But yes, if you replace a state flag with a recursive state flag that is INC/DEC maintained that would work as well. That's not a counter though. > > > The RCU-waiting CPU sees that rcu_state == IN_KERNEL and sets > > > _TIF_RCU_NOTIFY. This could happen arbitrarily late before IRET > > > because stores can be delayed. (It could even happen after > > > sysret, IIRC, but IRET is serializing.) > > > > All it has to do in the synchronize_rcu() slowpath is something > > like: > > I don't think this works. Suppose rt_cpu starts in kernel mode. > Then it checks ti flags. > > > > > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) { > > smp_mb__before_atomic(); > > set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY); > > smp_rmb(); > > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) > > Now it sets rcu_state and exits to user mode. It never notices > TIF_RCU_NOTIFY. Indeed, you are right, that's racy. > [...] We could fix this by sending an IPI to kick it. With an IPI we won't need any flags or counters on the RT CPU - it's the IPI we are trying to avoid. So how about the following way to fix the race: simply do a poll loop with a fixed timeout sleep: like it would do anyway if synchronize_rcu() was waiting for the timer irq to end the grace period on the RT-CPU. The TIF flag would cause an RCU grace period to lapse, no need to wake up the synchronize_rcu() side: it will notice the (regular) increased RCU counter. > > We are talking about a dozen cycles, while a typical > > synchronize_rcu() will wait millions (sometimes billions) of > > cycles. There's absolutely zero performance concern here and it's > > all CPU local in any case. > > The barrier would affect all syscalls, though. [...] What barrier? I never suggested any barrier in the syscall fast path, this bit: > > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) { > > smp_mb__before_atomic(); > > set_tsk_thread_flag(remote_task, TIF_RCU_NOTIFY); > > smp_rmb(); > > if (per_cpu(rcu_state, rt_cpu) == IN_KERNEL) runs inside synchronize_rcu() or so. But none of that is needed if we do: - simple IN_KERNEL/IN_USER/[/IN_IDLE] flag updated at context entry/exit time, straight in the assembly - TIF_RCU_QS to trigger a TIF slowpath on the RT-CPU that does nothing but: this_cpu_inc(rcu_qs_ctr). - synchronize_rcu() injects TIF_RCU_QS into the RT-CPU and then does a timeout-poll-loop with jiffies granular timeouts, simulating a timer IRQ in essence. It will either observe IN_USER or will see the regular RCU qs counter increase. this should be race free. Alternatively we can even get rid of the TIF flag by merging the percpu counter with the percpu state. Quiescent states are recorded via a counter shifted by two bits: rcu_qs_ctr += 4; while user/kernel/idle mode is recorded in the lower 2 bits. So on entering the kernel we'd do: rcu_qs_ctr += 4+1; /* Register QS and set bit 0 to IN_KERNEL */ on returning to user-space we'd do: rcu_qs_ctr += 4-1; /* We have bit 0 set already, clear it and register a QS */ on going idle we'd do: rcu_qs_ctr += 4+2; /* Register QS, set bit 1 */ on return from idle we'd do: rcu_qs_ctr += 4-2+1; /* Register QS, clear bit 1, set bit 0 */ etc. On all boundary transitions we can use a constant ADD on a suitable percpu variable. Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/