Received: by 2002:a25:e74b:0:0:0:0:0 with SMTP id e72csp1449995ybh; Sun, 19 Jul 2020 20:55:08 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxhapxeAbag2oC2wuZnVzKg2k2I/107/Un3000B5B4SUe4dT+xfGbzkVLXCuzMP9vwzb3un X-Received: by 2002:a05:6402:1c0f:: with SMTP id ck15mr19048294edb.155.1595217308156; Sun, 19 Jul 2020 20:55:08 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1595217308; cv=none; d=google.com; s=arc-20160816; b=uh3soqnsmHs/jkfG48y4m8O1pgF6lh0qH3UeD5tky9gIR3lG+P4rN9k6j4Z438rvbD uOHLmZFTVzaDc1PrOna9Chh8JfJ+QZgX2tSNoibffNw78E5D/DnHpMXkky6zqPvxnobi QTkqbD02fe7b3iOLjkJAq8HeuPSHK3ywCkrGr/4OxjKXrSQgdQeAvZzulApwyc7pYWJV lJ8MaZomBKOJxVJzcZgkHa1v9/qyslqAtb6prpLkQkDQIuArfYxlgWzrbTcLArR+QdIx oQa5MzKXRUItY5nIUpX6Ouuep2n5hoFgHyjpP8UWDNCynkgOYNIWU1yAh82e7hnZy4FE XTAA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :dkim-signature; bh=1u53c2F+veno3CGunFwdcO+tqxJwq9yQaGLu7xcFrPw=; b=AEhYfb3LUHDiqe11BZea5FZwHBzJKXAfnZjShPNz/qriZld6PQSvcyLxsPyz2FWE4p KU0RW4CPcRC65eIYqN4Gw0tWssPgnDah5iehX6vjXRiVYC2KbA+ra5hEfPoskgFtHX4S dSa1sGmTnpMN86wH+INMmNHCPX3+p7cct2MNAOl5KU3t3k3WjatAJIKd3BGiWsDxZXGi qXPQPRjskXwQSyqUBiWlFSwNytrtu9WUxy7C0LebSB40p5dy0qTLnsJIQ6DnDv+5/4ba yPwpXl8fPq0ecUAnDd1rAyzgEt+n2ucNHC2FsHthkcHz+rkkFf3adj6pt6iMO6ngSF3X zC+Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=lCn9Mecq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id d19si9421925edp.72.2020.07.19.20.54.45; Sun, 19 Jul 2020 20:55:08 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=lCn9Mecq; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726161AbgGTDyC (ORCPT + 99 others); Sun, 19 Jul 2020 23:54:02 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42390 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726123AbgGTDyC (ORCPT ); Sun, 19 Jul 2020 23:54:02 -0400 Received: from mail-qt1-x841.google.com (mail-qt1-x841.google.com [IPv6:2607:f8b0:4864:20::841]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C2E84C0619D2 for ; Sun, 19 Jul 2020 20:54:01 -0700 (PDT) Received: by mail-qt1-x841.google.com with SMTP id a32so12004952qtb.5 for ; Sun, 19 Jul 2020 20:54:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=joelfernandes.org; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=1u53c2F+veno3CGunFwdcO+tqxJwq9yQaGLu7xcFrPw=; b=lCn9MecqKLH0aBUp1HeYdTzb2uHoRNWigJVVPjRVUHLgJ+8zbh1ts3tAs6W18HQuLa k7mIenZ1M46bCFLadYhwwSt9vUORW78dzM3Wjr0ltd9r6Xbq49RbQaDZpXEdZ1Bj0VIM vlBgzHsXKUVolLSZ/3xwgJQC3ifqRAPV8/30E= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=1u53c2F+veno3CGunFwdcO+tqxJwq9yQaGLu7xcFrPw=; b=i5QYot5atz6CodAgw0J4Vwsk/aQCJ7mdV7vPjFMcE7Xca6O168Uy7Yy4cWR5BJdcyv ZfE2hYewe7mzTdncYEw5QY7v0D8D2G0SJiViscu1jE9zfQjiOm17YSMCe0qixbAx3uzY IEHrTsMI6VWc1n+uACEjQxIjMre2GIoVPYyaYX/J96v+sL+cP91smADj+IB1/8SH6rkd WAMG0eNVfGzQN9EMJfUOCblkeihKsyl0qm8NDOXW365IK43R2S8E27R/7w2+b5MZJFwV s+UYamF+nVaWyzl3cuqzIYxa8YGmYmbnzzE+oBP+62i7VMvyKBPj1WFIcqxe28Xc48yA ORKA== X-Gm-Message-State: AOAM533xfFr+cE9zOwBvpyyYTWO4soU/HKOJWJhD0yEAxuTgAkvgFeND pOGBoZ08StHO7ufgnKRcuIdirg== X-Received: by 2002:ac8:6f73:: with SMTP id u19mr22366570qtv.36.1595217240670; Sun, 19 Jul 2020 20:54:00 -0700 (PDT) Received: from localhost ([2620:15c:6:12:cad3:ffff:feb3:bd59]) by smtp.gmail.com with ESMTPSA id t5sm17237283qkh.46.2020.07.19.20.53.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 19 Jul 2020 20:53:59 -0700 (PDT) Date: Sun, 19 Jul 2020 23:53:59 -0400 From: Joel Fernandes To: Thomas Gleixner Cc: Vineeth Remanan Pillai , Nishanth Aravamudan , Julien Desfossez , Peter Zijlstra , Tim Chen , mingo@kernel.org, pjt@google.com, torvalds@linux-foundation.org, linux-kernel@vger.kernel.org, subhra.mazumdar@oracle.com, fweisbec@gmail.com, keescook@chromium.org, kerrnel@google.com, Phil Auld , Aaron Lu , Aubrey Li , Valentin Schneider , Mel Gorman , Pawan Gupta , Paolo Bonzini , vineethrp@gmail.com, Chen Yu , Christian Brauner , Aubrey Li , Tim Chen , "Paul E . McKenney" Subject: Re: [RFC PATCH 14/16] irq: Add support for core-wide protection of IRQ and softirq Message-ID: <20200720035359.GA4187092@google.com> References: <871rl9r9xr.fsf@nanos.tec.linutronix.de> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <871rl9r9xr.fsf@nanos.tec.linutronix.de> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Jul 18, 2020 at 01:36:16AM +0200, Thomas Gleixner wrote: > Vineeth, Joel! Hi Thomas, > Vineeth Remanan Pillai writes: > > From: "Joel Fernandes (Google)" > > > > Lastly, we also check in the schedule loop if we are about to schedule > > an untrusted process while the core is in such a state. This is possible > > if a trusted thread enters the scheduler by way of yielding CPU. This > > would involve no transitions through the irq_exit() point to do any > > waiting, so we have to explicitly do the waiting there. > > Lot's of 'we' and 'this patch' in this changelog. Care to read > Documentation/process/submitting-patches.rst ? Sure, will fix it. [..] > is the usual kernel coding style. > > + static_branch_disable(&sched_core_irq_pause); > > + > > + return 1; > > +} > > +__setup("sched_core_irq_pause=", set_sched_core_irq_pause); > > +#endif > > + > > asmlinkage __visible void __softirq_entry __do_softirq(void) > > { > > unsigned long end = jiffies + MAX_SOFTIRQ_TIME; > > @@ -273,6 +291,16 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) > > /* Reset the pending bitmask before enabling irqs */ > > set_softirq_pending(0); > > > > +#ifdef CONFIG_SCHED_CORE_IRQ_PAUSE > > + /* > > + * Core scheduling mitigations require entry into softirq to send stall > > + * IPIs to sibling hyperthreads if needed (ex, sibling is running > > + * untrusted task). If we are here from irq_exit(), no IPIs are sent. > > + */ > > + if (static_branch_likely(&sched_core_irq_pause)) > > + sched_core_irq_enter(); > > So you already have a #ifdef CONFIG_SCHED_CORE_IRQ_PAUSE section > above. Why on earth can't you just provide helper functions and stubs > for the CONFIG_SCHED_CORE_IRQ_PAUSE=n case there instead of sprinkling > #ifdeffery all over the place? These ifdeffery and checkpatch / command line parameter issues were added by Vineeth before he sent out my patch. I'll let him comment on these, agreed they all need fixing! > > +#endif > > + > > local_irq_enable(); > > How is __do_softirq() relevant at all? > > __do_softirq() is called from: > > 1) the tail of interrupt handling > 2) the tail of local_bh_enable() > 3) the ksoftirq thread > > None of this is relevant for what you are trying to do: > > #1: Iscovered by the interrupt handling code simply because this nests > into the interrupt handling. > > #2: local_bh_enable() only ends up there when invoked from thread > context. > > This would only make sense if one sibling is in the kernel and the > other in user space or guest mode. Isn't that forbidden by core > scheduling in the first place? One sibling being in the kernel, while the other sibling is in userspace is not yet forbidden by core-scheduling. We need to add support for that. I think after reading your email and thinking about this last few days, we can make it work by sending IPIs on both syscall and IRQ entries. > #3: See #2 > > But what's worse is that this is called from an interrupt disabled > region. So you brilliantly created a trivial source of livelocks. > stomp_machine() being the most obvious candidate. But it's easy enough > to come up with more subtle ones. > > The point is that you want to bring out the sibling thread from user > space or guest mode when the other HT thread: > > A) invokes a syscall or traps into the hypervisor which is > semantically equivalent (not code wise). > > B) handles an interrupt/exception > > So #A covers #2 and #3 above and #B covers #1 above. Right. #A) was never implemented. But now I'm going to do it! > But you have the same livelock problem with doing this core wait thingy > within any interrupt disabled region. Right, which we fixed by moving the waiting code to prepare_exit_to_usermode() in the other test patch I sent to fix Aubrey's deadlock, but as you suggested I will make it work with the new generic entry code. > What you really want to do is on syscall and interrupt/exception entry: > > if (other_sibling_in_user_or_guest()) > send_IPI(); Ok. > Of course it's conveniant to piggy pack that onto the reschedule IPI, > but that sucks. What you want is a dedicated IPI which does: > > magic_ipi() > set_thread_flag(TIF_CORE_SCHED_EXIT); I tried IPIs before and ran into a few issues mostly related to CSD locking if I remember, I will try again. > And on the exit side you want to look at the entry code generalisation I > just posted: > > https://lore.kernel.org/r/20200716182208.180916541@linutronix.de Thank you, I went over this code and it makes sense we work based on that. > Add TIF_CORE_SCHED_EXIT to EXIT_TO_USER_MODE_WORK and to > EXIT_TO_GUEST_MODE_WORK and handle it in exit_to_guest_mode_work() and > exit_to_user_mode_loop() with interrupts enabled. > There's already state to determine if waiting is needed or not, we have counters that determine if any sibling HT in the core is handling an IRQ or not, and if so I wait before exiting to user mode. The outer-most IRQ entry is being tracked so we don't send IPIs on every entry. But I got an idea from your email. Whenever a 'tagged' task becomes current, I can set a TIF flag marking that it is a tagged task and needs special handling when exiting to user mode (or guest). Then on exiting to usermode (or guest), I can check if any waiting is needed and wait there in the interrupt enabled regions you are suggesting. This should simplify the code since currently we have that complex if expression to determine if the task is "special". > Trying to do this in fully interrupt disabled context is just a recipe > for disaster. > > The entry case condition wants to have a TIF bit as well, i.e. > > if (thread_test_bit(TIF_CORE_SCHED_REQUIRED) { > sched_ipi_dance() { > if (other_sibling_in_user_or_guest()) > send_IPI(); > } > } I did not understand this bit. Could you explain more about it? Are you talking about the IPIs sent from the schedule() loop in this series? > I know you hate this, but we are not going to add tons of unholy hackery > to make this "work" by some definition of work. > > The only way to make this work without making a complete mess and > killing performance is to do this on a best effort basis. It really does > not matter whether the result of the orchestration is perfect or > not. What matters is that you destroy any predictability. If the > orchestration terminates early occasionally then the resulting damage is > a purely academic exercise. > > Making it work reliably without hard to debug livelocks and other > horrors is not so academic. > > So please stop overengineering this and just take the pragmatic approach > of making it mostly correct. That will prevent 99.999% of realistic > attacks. Agreed. Actually that's what I was aiming for. Even the IPIs are sent only if needed. I did a lot of benchmarking and did not see a performance hit. > It might not prevent the carefully orchestrated academic paper attack > which utilizes the 0.001% failure rate by artificially enforcing a > scenario which cannot be enforced in a real world environment. > Got it. thanks, - Joel