Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753293AbbFZTcD (ORCPT ); Fri, 26 Jun 2015 15:32:03 -0400 Received: from mail-la0-f46.google.com ([209.85.215.46]:36854 "EHLO mail-la0-f46.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753306AbbFZTbm (ORCPT ); Fri, 26 Jun 2015 15:31:42 -0400 MIME-Version: 1.0 In-Reply-To: <1050218158.4054.1435342186284.JavaMail.zimbra@efficios.com> References: <20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com> <20150624222609.6116.30992.stgit@kitami.mtv.corp.google.com> <1050218158.4054.1435342186284.JavaMail.zimbra@efficios.com> From: Andy Lutomirski Date: Fri, 26 Jun 2015 12:31:21 -0700 Message-ID: Subject: Re: [RFC PATCH 2/3] restartable sequences: x86 ABI To: Mathieu Desnoyers Cc: Paul Turner , Peter Zijlstra , "Paul E. McKenney" , Andrew Hunter , Andi Kleen , Lai Jiangshan , linux-api , "linux-kernel@vger.kernel.org" , rostedt , Josh Triplett , Ingo Molnar , Andrew Morton , Linus Torvalds , Chris Lameter Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4619 Lines: 106 On Fri, Jun 26, 2015 at 11:09 AM, Mathieu Desnoyers wrote: > ----- On Jun 24, 2015, at 6:26 PM, Paul Turner pjt@google.com wrote: > >> Implements the x86 (i386 & x86-64) ABIs for interrupting and restarting >> execution within restartable sequence sections. >> >> With respect to the x86-specific ABI: >> On 32-bit: Upon restart, the interrupted rip is placed in %ecx >> On 64-bit (or x32): Upon restart, the interrupted rip is placed in %r10 >> >> While potentially surprising at first glance, this choice is strongly motivated >> by the fact that the available scratch registers under the i386 function call >> ABI overlap with those used as argument registers under x86_64. >> >> Given that sequences are already personality specific and that we always want >> the arguments to be available for sequence restart, it's much more natural to >> ultimately differentiate the ABI in these two cases. >> >> Signed-off-by: Paul Turner >> --- >> arch/x86/include/asm/restartable_sequences.h | 50 +++++++++++++++++++ >> arch/x86/kernel/Makefile | 2 + >> arch/x86/kernel/restartable_sequences.c | 69 ++++++++++++++++++++++++++ >> arch/x86/kernel/signal.c | 12 +++++ >> kernel/restartable_sequences.c | 11 +++- >> 5 files changed, 141 insertions(+), 3 deletions(-) >> create mode 100644 arch/x86/include/asm/restartable_sequences.h >> create mode 100644 arch/x86/kernel/restartable_sequences.c >> >> diff --git a/arch/x86/include/asm/restartable_sequences.h >> b/arch/x86/include/asm/restartable_sequences.h >> new file mode 100644 >> index 0000000..0ceb024 >> --- /dev/null >> +++ b/arch/x86/include/asm/restartable_sequences.h >> @@ -0,0 +1,50 @@ >> +#ifndef _ASM_X86_RESTARTABLE_SEQUENCES_H >> +#define _ASM_X86_RESTARTABLE_SEQUENCES_H >> + >> +#include >> +#include >> +#include >> + >> +#ifdef CONFIG_RESTARTABLE_SEQUENCES >> + >> +static inline bool arch_rseq_in_crit_section(struct task_struct *p, >> + struct pt_regs *regs) >> +{ >> + struct task_struct *leader = p->group_leader; >> + struct restartable_sequence_state *rseq_state = &leader->rseq_state; >> + >> + unsigned long ip = (unsigned long)regs->ip; >> + if (unlikely(ip < (unsigned long)rseq_state->crit_end && >> + ip >= (unsigned long)rseq_state->crit_start)) >> + return true; >> + >> + return false; >> +} >> + >> +static inline bool arch_rseq_needs_notify_resume(struct task_struct *p) >> +{ >> +#ifdef CONFIG_PREEMPT >> + /* >> + * Under CONFIG_PREEMPT it's possible for regs to be incoherent in the >> + * case that we took an interrupt during syscall entry. Avoid this by >> + * always deferring to our notify-resume handler. >> + */ >> + return true; > > I'm a bit puzzled about this. If I look at perf_get_regs_user() in the perf > code, task_pt_regs() seems to return the user-space pt_regs for a task with > a current->mm set (iow, not a kernel thread), even if an interrupt nests on > top of a system call. The only corner-case is NMIs, where an NMI may interrupt > in the middle of setting up the task pt_regs, but scheduling should never happen > there, right ? Careful, here! task_pt_regs returns a pointer to the place where regs would be if they were fully initialized. We can certainly take an interrupt in the middle of pt_regs setup (entry_SYSCALL_64 enables interrupts very early, for example). To me, the question is whether we can ever be preemptable at such a time. It's a bit worse, though: we can certainly be preemptible when other code is accessing pt_regs. clone, execve, sigreturn, and signal delivery come to mind. Why don't we give up on poking at user state from the scheduler and do it on exit to user mode instead? Starting in 4.3 (hopefully landing in -tip in a week or two), we should have a nice function prepare_exit_to_usermode that runs with well-defined state, non-reentrantly, that can do whatever you want here, *including user memory access*. The remaining question would be what the ABI should be. Could we get away with a vDSO function along the lines of "set *A=B and *X=Y if we're on cpu N and *X=Z"? Straight-up cmpxchg would be even simpler. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/