Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp5130007yba; Tue, 30 Apr 2019 09:35:17 -0700 (PDT) X-Google-Smtp-Source: APXvYqzoePuV5kq6CDvxqRz+PrsVLh5fv7eSfySiJi7ERGplmNZqzNcXti5TZw/9CA8YZ0VXc6ZY X-Received: by 2002:a17:902:6946:: with SMTP id k6mr15465820plt.81.1556642117694; Tue, 30 Apr 2019 09:35:17 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1556642117; cv=none; d=google.com; s=arc-20160816; b=iyZK35OYuCOcmBAk8bn5L/4TlNvq6OI+GzMtyjqgchouO1+FV2c1oghcFzvog2aVpc HtmajOtx/hW6WJ8vChloZIo617ah+Z3+2Lfs0N6fjyz1LkVrn27J1lqf2Vxx26xuNs2Q gSNokmsVGZAva7hzVkCzPVW56uoLttNS6Y3EKBoyryGxz9iR1S+j0qqx+NhPMiStC4f3 BkjoABQBX81SRT78gJh8FID34q62TVBXxfc5gqN6HCOYyuXAI1yGKrjBeDkry4LR9kWF MH4OAKITMoe4OubojqhwrzvW8yix4VW2TO6FjvF1FUT6avK0XTez5PeZYs+RfkO8H3Hi YWgQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=d21icmBvhqfeNvH4qjmz5Aal7BVg4PGSDFOP68VUYaM=; b=mwbejb2Ez+bWyuhETQBpXsMCX3q6hDAVFtk3OxOr18oHp9mzQ6t3+BZyUuvn0lJHlI wROkaE1G7Bde10jTs104Q9hl8ddAvE20fXcJDZ3Lwa8l1KWOBPHxkFpvtqxcppSXbs// nN9wdWXWXBJz7nAvpdiPvYebEeEGJGVtmTwpGyN/p2n65aWn+Ean4vxyFS7oTxboxXIb D7ZFme96yeEvMzF4C4Yf7ZOs2nqFYXKGbPD9ACuXcaCqTi97TDHONCFJL1BBJLtMPLhl YftXcc+UoovyuS8T0kLTvcPY76vOpWeEQUSZCh7yB2UfGEyGPgQLqEcxEOz4jqwA4sdp FFfA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=SQ3WD6JG; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p8si37031336plk.392.2019.04.30.09.35.02; Tue, 30 Apr 2019 09:35:17 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=SQ3WD6JG; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726460AbfD3QeI (ORCPT + 99 others); Tue, 30 Apr 2019 12:34:08 -0400 Received: from mail.kernel.org ([198.145.29.99]:58286 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726028AbfD3QeH (ORCPT ); Tue, 30 Apr 2019 12:34:07 -0400 Received: from mail-wr1-f48.google.com (mail-wr1-f48.google.com [209.85.221.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 566482184B for ; Tue, 30 Apr 2019 16:34:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1556642046; bh=+7m+vDUlDtrlaNOzQneMzscMBhl0wuy482hLt99Pw24=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=SQ3WD6JGj2rfpj1pZpQt9iBLCEw3KJUoQ0o0OruuyIpbQgwpRt1MXShYRRcRy0Usp gAeC1aF/3cUrFIIdAyRkO3dlJb8Ys5+aLXJk9/34BsSDYhvXqPhUjGKyfYAxejyYl/ nCyI8zJ0pmp/laRF1qEXhGhS5UUwT3twrheTvF+o= Received: by mail-wr1-f48.google.com with SMTP id v16so19363183wrp.1 for ; Tue, 30 Apr 2019 09:34:06 -0700 (PDT) X-Gm-Message-State: APjAAAUZc8d0HP0iwg3VymzGX8F5rKTeuKz8heukrhngx/psZBfDQM7N QzonAUGVipQDSfy16AjKClM4Dtm8/Rcp8J1NurJmKw== X-Received: by 2002:a5d:4b0c:: with SMTP id v12mr30170120wrq.330.1556642043097; Tue, 30 Apr 2019 09:34:03 -0700 (PDT) MIME-Version: 1.0 References: <20190428133826.3e142cfd@oasis.local.home> <20190430135602.GD2589@hirez.programming.kicks-ass.net> In-Reply-To: From: Andy Lutomirski Date: Tue, 30 Apr 2019 09:33:51 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [PATCH 3/4] x86/ftrace: make ftrace_int3_handler() not to skip fops invocation To: Linus Torvalds Cc: Peter Zijlstra , Andy Lutomirski , Steven Rostedt , Nicolai Stange , Thomas Gleixner , Ingo Molnar , Borislav Petkov , "H. Peter Anvin" , "the arch/x86 maintainers" , Josh Poimboeuf , Jiri Kosina , Miroslav Benes , Petr Mladek , Joe Lawrence , Shuah Khan , Konrad Rzeszutek Wilk , Tim Chen , Sebastian Andrzej Siewior , Mimi Zohar , Juergen Gross , Nick Desaulniers , Nayna Jain , Masahiro Yamada , Joerg Roedel , Linux List Kernel Mailing , live-patching@vger.kernel.org, "open list:KERNEL SELFTEST FRAMEWORK" Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 30, 2019 at 9:06 AM Linus Torvalds wrote: > > On Tue, Apr 30, 2019 at 6:56 AM Peter Zijlstra wrote: > > > > Realistically, I don't think you can hit the problem in practice. The > only way to hit that incredibly small race of "one instruction, *both* > NMI and interrupts" is to have a lot of interrupts going all at the > same time, but that will also then solve the latency problem, so the > very act of triggering it will also fix it. > > I don't see any case where it's really bad. The "sti sysexit" race is > similar, just about latency of user space signal reporting (and > perhaps any pending TIF_WORK_xyz flags). In the worst case, it actually kills the machine. Last time I tracked a bug like this down, I think the issue was that we got preempted after the last TIF_ check, entered a VM, exited, context switched back, and switched to user mode without noticing that there was a ending KVM user return notifier. This left us with bogus CPU state and the machine exploded. Linus, can I ask you to reconsider your opposition to Josh's other approach of just shifting the stack on int3 entry? I agree that it's ugly, but the ugliness is easily manageable and fairly self-contained. We add a little bit of complication to the entry asm (but it's not like it's unprecedented -- the entry asm does all kinds of stack rearrangement due to IST and PTI crap already), and we add an int3_emulate_call(struct pt_regs *regs, unsigned long target) helper that has appropriate assertions that the stack is okay and emulates the call. And that's it. In contrast, your approach involves multiple asm trampolines, hash tables, batching complications, and sti shadows. As an additional argument, with the stack-shifting approach, it runs on *every int3 from kernel mode*. This means that we can do something like this: static bool int3_emulate_call_okay(struct pt_regs *regs) { unsigned long available_stack = regs->sp - (unsigned long); return available_stack >= sizeof(long); } void do_int3(...) { { WARN_ON_ONCE(!user_mode(regs) && !int3_emulate_call_okay(regs)); ...; } static void int3_emulate_call(struct pt_regs *regs, unsigned long target) { BUG_ON(user_mode(regs) || !int3_emulate_call_okey(regs)); regs->sp -= sizeof(unsigned long); *(unsigned long *)regs->sp = target; /* CET SHSTK fixup goes here */ } Obviously the CET SHSTK fixup might be rather nasty, but I suspect it's a solvable problem. A major benefit of this is that the entry asm nastiness will get exercised all the time, and, if we screw it up, the warning will fire. This is the basic principle behind why the entry stuff *works* these days. I've put a lot of effort into making sure that running kernels with CONFIG_DEBUG_ENTRY and running the selftests actually exercises the nasty cases. --Andy