Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753539AbbDBO0p (ORCPT ); Thu, 2 Apr 2015 10:26:45 -0400 Received: from mail-lb0-f179.google.com ([209.85.217.179]:35980 "EHLO mail-lb0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753305AbbDBO0l (ORCPT ); Thu, 2 Apr 2015 10:26:41 -0400 MIME-Version: 1.0 In-Reply-To: <20150402123159.GA25151@gmail.com> References: <9472f1ca4c19a38ecda45bba9c91b7168135fcfa.1427923514.git.luto@kernel.org> <20150402090744.GA26846@gmail.com> <551D14D3.1070907@redhat.com> <20150402103735.GA21105@gmail.com> <551D3503.6000508@redhat.com> <20150402123159.GA25151@gmail.com> From: Andy Lutomirski Date: Thu, 2 Apr 2015 07:26:19 -0700 Message-ID: Subject: Re: [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set To: Ingo Molnar Cc: Denys Vlasenko , Brian Gerst , Andy Lutomirski , "the arch/x86 maintainers" , Linux Kernel Mailing List , Borislav Petkov , Linus Torvalds , Borislav Petkov Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5283 Lines: 137 On Thu, Apr 2, 2015 at 5:31 AM, Ingo Molnar wrote: > > * Denys Vlasenko wrote: > >> On 04/02/2015 01:14 PM, Brian Gerst wrote: >> >>>> So I merged this as it's an obvious bugfix, but in hindsight I'm >> >>>> really uneasy about the whole opportunistic SYSRET concept: it appears >> >>>> that the chance that %rcx matches return-%rip is astronomical - this >> >>>> is why this bug wasn't noticed live so far. >> >>>> >> >>>> So should we really be doing this? >> >>> >> >>> Andy does this not for the off-chance that userspace's RCX is equal >> >>> to return address and R11 == RFLAGS. The chances of that are >> >>> astronomically small. >> >>> >> >>> This code path triggers when ptrace/audit/seccomp is active. Instead >> >>> of torturing ourselves trying to not divert into IRET return, now >> >>> code is steered that way. But then immediately before actual IRET, >> >>> we check again: "do we really need IRET?" IOW "did ptrace really >> >>> touch pt_regs->ss? ->flags? ->rip? ->rcx?" which in vast majority of >> >>> cases will not be true. >> >> >> >> I keep forgetting about that, my test systems have the audit muck >> >> turned off ;-) >> >> >> >> Fair enough - and it's sensible to share the IRET path between >> >> interrupts and complex-return system calls, even though the check >> >> is unnecessary overhead for the pure interrupt return path... >> > >> > >> > Maybe we could reintroduce TIF_IRET for this purpose instead of >> > (ab)using TIF_NOTIFY_RESUME. Then we would only do the opportunistic >> > check for those cases (ptrace, audit, exec, sigreturn, etc.), and skip >> > it for interrupts. >> >> The very first check in the existing code, pt_regs->cx == >> pt_regs->ip, will fail for interrupt returns. >> >> You hardly can save anything by placing a (ti->flags & >> TIF_TRY_SYSRET) check in front of it, it's almost as expensive. > > Well, what I was thinking of was to have a pure irq (well, async > context) return path, not shared with the weird-syscall-IRET return > path at all ... > > It would be open coded, not obfuscated via macros. > > That way AFAICS the upsides are: > > - it's easier to read (and maintain) what goes on in which case. > '*intr*' labels would truly identify interrupt return related > processing, for a change! > > - we can optimize in a more directed fashion - like here > > ... while the downsides are: > > - more code > - a (small) chance of a fix going to one path while not the other. > > How much extra code would it be? Negative if we did it right, perhaps. I think the best approach is a complete rewrite, not an attempt to incrementally improve it. The current code is held together by gotos and bailing wire, and I'm surprised it works at all. Some of it seems to work by accident AFAICT. For example, the sysret audit "fast path" that I deleted as part of the opportunistic sysret work was quite buggy AFAICT, but no one who looks at this code cares about audit, and it's nearly impossible to even tell what the code is supposed to do. Linus tried to rewrite some of it last year, but it was incomplete. Here's my vague inventory of what the exit paths need to do on return to userspace: - syscall_trace_leave [1] (syscall only) - context tracking if TIF_NOHZ (all exits) - one-shot work. These are things that must be done if the flags are set, and doing it clears the flags. These flags need to be checked with IRQs off, and IRQs cannot be re-enabled afterwards. This includes signal delivery, scheduling, and user return notifiers. (all exits) - uprobes. Maybe this counts as one-shot work. (all exits, I assume) - Check for sysret applicability (only important for syscalls) - espfix, iret, unless we're using sysret (all exits) There's also the special case of interrupt returns to kernel mode. That should schedule in preemptable kernels. To me, this suggests that we should have a total of four asm exit paths: - syscall return. IMO this should call a single C function, check the sysret conditions, and jump to the espfix code. - paranoid return to kernel. This should just return after a possible swapgs. (We do this now.) - interrupt/exception return. IMO this should call a single C function and jump to the espfix code. [2] - NMI -- this is its own crazy thing. All of the check/careful/very_careful crap can just go. Good riddance. [1] syscall_trace_leave contains context tracking hooks that are AFAICT completely unnecessary. That's almost okay, though -- other similarly silly context tracking hooks fix up the mess. This might explain part of why context tracking is so incredibly slow. [2] I don't even think we need the retint_careful vs retint_kernel distinction in asm. If we ditch that, we should be able to replace it with something like: call prepare_intr_exception_return testl $ebx,$ebx jz 1f SWAPGS 1f: INTERRUPT_RETURN Of course, prepare_intr_exception_return would need to DTRT if we're returning to kernel mode, but that's easy. --Andy -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/