Received: by 2002:a05:6a10:22f:0:0:0:0 with SMTP id 15csp1193163pxk; Fri, 4 Sep 2020 03:14:24 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwdmN+vsSyIVc57sCoP3/REgIG5qp4Kw9TN7+yHmkO8cfwWHg2sRhxW7SP2p9TDsS494Cle X-Received: by 2002:aa7:d34b:: with SMTP id m11mr7899207edr.178.1599214464046; Fri, 04 Sep 2020 03:14:24 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1599214464; cv=none; d=google.com; s=arc-20160816; b=SXxg7rYOpQmdVpTlJnRPmmW4L1tEv6ybIxuWlsz8+tLEfDSF3nWrRy1bDCMBYsRG7C CS4fx7XwD5X6wiUItb/shbnG1IPKGJA7/j0+7jalbkYyqIYSvkUHa9pAG27jKXR4MrHq OxnZ+tUYRn5t692uxjH/mIO6cxqlPAuHZaldSvaViiZJN2/o9aeSSzvPPi2kFsR6oSBO 6sUfIJDfTJLquVEaaiHcOxVq8PxH2aNizZ5j7w9azoxSE98JIh9OAeSOLB/fZqrbFjCj XuqCa+djNXrq28qdbkFn8S48eF78rh13ftJ/A/XY/UXM5ZF4wV0QQkV52mxhBSe+E5la uYAw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:message-id:date:references :in-reply-to:subject:cc:to:dkim-signature:dkim-signature:from; bh=Z8OvW6DJXZpgU3aKJ+ArJf6u2HaUfprFbOOK8smF874=; b=TXq8f6dny3PS1P6LfKlBgq2BeBRlasM58oDx5e7p29mK4xSVsz3kJrXXIoJmerUMHq 5Q3kAvvVVxyaVGWbhKhsrGD9dM914/UllPn6fskmOGxvsPPM8oA+Hdn+d2tw4cSniQ8L E6JT2aVM9CeVp2ISOJiy4RWbYR3rMbQ/1FhJYZV/sfLCCQ/R/3hannitAW+oJc+FNuDm Pv32C5Tk//FLVl15BjWxyBDJIjCsMV+Nex4OCJ9JDsx54hN2ytw4z4O5jCNYR6afMh1Z RZ5E4pP6csY8sdQsAyYx1f6WOy9ndbM5Q4LsrUnVkKmiShpph3yIjlyZo6BtsZu4DvLK UYPA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=dZBSlId5; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id h12si4120943ejx.85.2020.09.04.03.14.00; Fri, 04 Sep 2020 03:14:24 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@linutronix.de header.s=2020 header.b=dZBSlId5; dkim=neutral (no key) header.i=@linutronix.de header.s=2020e; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=linutronix.de Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729297AbgIDKNY (ORCPT + 99 others); Fri, 4 Sep 2020 06:13:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:48850 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726171AbgIDKNU (ORCPT ); Fri, 4 Sep 2020 06:13:20 -0400 Received: from galois.linutronix.de (Galois.linutronix.de [IPv6:2a0a:51c0:0:12e:550::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 146D0C061244; Fri, 4 Sep 2020 03:13:19 -0700 (PDT) From: Thomas Gleixner DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1599214396; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=Z8OvW6DJXZpgU3aKJ+ArJf6u2HaUfprFbOOK8smF874=; b=dZBSlId5W09VtfHcoWhseB5XiTBbGXWSjSWG5XFU58ZWYuqDdtM4Ipu9mNGZY+ddAn37dS AxDs/0r1rhHicd824BDPwytOgDf8f17n5gdpXS06ccu1J8J02/S3Px3uFgDRLoGRl+PjKM bU83O/2nJ/tSYGIE4CElngX6XiPbSOHvrFIlF/46BHUlEwSBUrrrMnzuy1Ks/8KOesKn82 yzz5TXpXJFv9IUUrWA8fr+U2kWyVy8VmvS31HDRYjYZ0xy05l0NK41J5gNkBa6AjMjEEvi SqxPITHKMuMWLYE38SvauwgernJ2tQIyJdf2qZxYvlL917z72wWol9JRX9XuYQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1599214396; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=Z8OvW6DJXZpgU3aKJ+ArJf6u2HaUfprFbOOK8smF874=; b=eoru9GuSlO3Zvk6GwC/O3LrVM2/p6u5L16LR2oCIwbP9/NjUC2FQtUYVwbzdk2rHLSP1Ta igirNT0+2/4DKxDw== To: Andy Lutomirski Cc: Andy Lutomirski , Brian Gerst , X86 ML , LKML , Will Deacon , Christian Borntraeger , Michael Ellerman , Benjamin Herrenschmidt , Paul Mackerras , Catalin Marinas , linux-arm-kernel , Heiko Carstens , Vasily Gorbik , linux-s390 , linuxppc-dev Subject: Re: ptrace_syscall_32 is failing In-Reply-To: References: <87k0xdjbtt.fsf@nanos.tec.linutronix.de> <87blioinub.fsf@nanos.tec.linutronix.de> Date: Fri, 04 Sep 2020 12:13:15 +0200 Message-ID: <87mu254zpg.fsf@nanos.tec.linutronix.de> MIME-Version: 1.0 Content-Type: text/plain Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Andy, On Wed, Sep 02 2020 at 09:49, Andy Lutomirski wrote: > On Wed, Sep 2, 2020 at 1:29 AM Thomas Gleixner wrote: >> >> But you might tell me where exactly you want to inject the SIGTRAP in >> the syscall exit code flow. > > It would be a bit complicated. Definitely after any signals from the > syscall are delivered. Right now, I think that we don't deliver a > SIGTRAP on the instruction boundary after SYSCALL while > single-stepping. (I think we used to, but only sometimes, and now we > are at least consistent.) This is because IRET will not trap if it > starts with TF clear and ends up setting it. (I asked Intel to > document this, and I think they finally did, although I haven't gotten > around to reading the new docs. Certainly the old docs as of a year > or two ago had no description whatsoever of how TF changes worked.) > > Deciding exactly *when* a trap should occur would be nontrivial -- we > can't trap on sigreturn() from a SIGTRAP, for example. > > So this isn't fully worked out. Oh well. >> >> I don't think we want that in general. The current variant is perfectly >> >> fine for everything except the 32bit fast syscall nonsense. Also >> >> irqentry_entry/exit is not equivalent to the syscall_enter/exit >> >> counterparts. >> > >> > If there are any architectures in which actual work is needed to >> > figure out whether something is a syscall in the first place, they'll >> > want to do the usual kernel entry work before the syscall entry work. >> >> That's low level entry code which does not require RCU, lockdep, tracing >> or whatever muck we setup before actual work can be done. >> >> arch_asm_entry() >> ... >> arch_c_entry(cause) { >> switch(cause) { >> case EXCEPTION: arch_c_exception(...); >> case SYSCALL: arch_c_syscall(...); >> ... >> } > > You're assuming that figuring out the cause doesn't need the kernel > entry code to run first. In the case of the 32-bit vDSO fast > syscalls, we arguably don't know whether an entry is a syscall until > we have done a user memory access. Logically, we're doing: > > if (get_user() < 0) { > /* Not a syscall. This is actually a silly operation that sets AX = > -EFAULT and returns. Do not audit or invoke ptrace. */ > } else { > /* This actually is a syscall. */ > } Yes, that's what I've addressed with providing split interfaces. >> You really want to differentiate between exception and syscall >> entry/exit. >> > > Why do we want to distinguish between exception and syscall > entry/exit? For the enter part, AFAICS the exception case boils down > to enter_from_user_mode() and the syscall case is: > > enter_from_user_mode(regs); > instrumentation_begin(); > > local_irq_enable(); > ti_work = READ_ONCE(current_thread_info()->flags); > if (ti_work & SYSCALL_ENTER_WORK) > syscall = syscall_trace_enter(regs, syscall, ti_work); > instrumentation_end(); > > Which would decompose quite nicely as a regular (non-syscall) entry > plus the syscall part later. There is a difference between syscall entry and exception entry at least in my view: syscall: enter_from_user_mode(regs); local_irq_enable(); exception: enter_from_user_mode(regs); >> we'd have: >> >> arch_c_entry() >> irqentry_enter(); >> local_irq_enble(); >> nr = syscall_enter_from_user_mode_work(); >> ... >> >> which enforces two calls for sane entries and more code in arch/.... > > This is why I still like my: > > arch_c_entry() > irqentry_enter_from_user_mode(); > generic_syscall(); > exit... So what we have now (with my patch applied) is either: 1) arch_c_entry() nr = syscall_enter_from_user_mode(); arch_handle_syscall(nr); syscall_exit_to_user_mode(); or for that extra 32bit fast syscall thing: 2) arch_c_entry() syscall_enter_from_user_mode_prepare(); arch_do_stuff(); nr = syscall_enter_from_user_mode_work(); arch_handle_syscall(nr); syscall_exit_to_user_mode(); So for sane cases you just use #1. Ideally we'd not need arch_handle_syscall(nr) at all, but that does not work with multiple ABIs supported, i.e. the compat muck. The only way we could make that work is to have: syscall_enter_exit(regs, mode) nr = syscall_enter_from_user_mode(); arch_handle_syscall(mode, nr); syscall_exit_to_user_mode(); and then arch_c_entry() becomes: syscall_enter_exit(regs, mode); which means that arch_handle_syscall() would have to evaluate the mode and chose the appropriate syscall table. Not sure whether that's a win. Thanks, tglx