Received: by 2002:a25:868d:0:0:0:0:0 with SMTP id z13csp3780391ybk; Tue, 19 May 2020 12:47:25 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzkEonfQC6NO9iStFdzjx6Xou8GejyxVZlrt25J2mCkYf6u4BhbdFvD9OZLbzrcMZAPrPqP X-Received: by 2002:a17:906:5615:: with SMTP id f21mr748787ejq.372.1589917645356; Tue, 19 May 2020 12:47:25 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1589917645; cv=none; d=google.com; s=arc-20160816; b=wJCqodQW29pAGaWXMQkx4EPiKsMyEO6fQrFbLNY1c5FXnt0PITAr9S2ZI47zX8rN2k 12tHM+YU95nlwlw/ni8v4GgZwIe3HgkALoYtXnAfKY1bkzipD8OqFHemvYVgAvszdg+A glb8Qz+SFCoHwkszWnqE3RH5LPYTXbSbN+BOYXjovDV2yqilNY8J6oti1BLW+1IL8QHM YGRlJTwAbKI7trr6x+ZTZWTx+EYY8/Skw74AII06PY2EfyGSDvoacV+o8Ma4O0KMal2s 9EyKTPLDAN/2fbXeX5bPg7SEFiDhE9ulml8Ax1ca2R3q0gMmUqBjaWKEpOW/rl5CE1ce IL/g== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=Ys/aLNZOeJEw+PiV7iCdmrcB1AiY/eEsv6c4jnpR/Ms=; b=X0jqsROAltx/o6hDpGPPsV1rseYMG5v2twcGO4riedmnd7usFGyOKJGyekvcJAUZiz gDKQZWZ/cMzeqd/JnTK+jhk0VqyAIqlBIb28c9Llk1z98dKmM5SEq5b7vAoBEvDDzzyb l/nBx7LMLAfUJSP7s/s090fAd/yIivhZlfzaZQiza29dWvW+/ihhxMkVJBIQPGalCMje zmnu5l5HxHGOR4C4DwWppdH2juy+qFeDVIRdw37dSIHkoBtj5LSjj7YNpsPn6FyOSR2P RW/dz5Wu3pf2HaFbvNFys8oQY/uHIHrlo52PhV5wQPwB5jdSDqoMsVz+/xjRt7/d+C0y TPFA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=BFi5FVJ0; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id oq11si483654ejb.552.2020.05.19.12.47.01; Tue, 19 May 2020 12:47:25 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=BFi5FVJ0; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726583AbgESTpI (ORCPT + 99 others); Tue, 19 May 2020 15:45:08 -0400 Received: from mail.kernel.org ([198.145.29.99]:45096 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726161AbgESTpI (ORCPT ); Tue, 19 May 2020 15:45:08 -0400 Received: from mail-wr1-f47.google.com (mail-wr1-f47.google.com [209.85.221.47]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 05510206C3 for ; Tue, 19 May 2020 19:45:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1589917507; bh=prph2qoH9QCMW3pwAxbLaio9VKuKuClidqYb/wDuWzU=; h=References:In-Reply-To:From:Date:Subject:To:Cc:From; b=BFi5FVJ06lBapTZSSRbCWUzV32vFq/MmzdZvPPIUD8KJShdBwApyOsiZG08VAsI+7 MfL/tRJNb9nMAnoX+Q5+0fG8mJQF/UljG6cO/KxHE3suMTNSyVVbTo9zykPg/1NIKs 993iAY5SJY9m4atQ37no1oaF8yhhG64uX0koKqEY= Received: by mail-wr1-f47.google.com with SMTP id y3so759328wrt.1 for ; Tue, 19 May 2020 12:45:06 -0700 (PDT) X-Gm-Message-State: AOAM5317P26w8pmaxbyLLtQ/EVnHRz7fNhJR3ElcyXLG5dmWkyvDpjFm MQ5kevUhPFT9pccKtzDMPlhmRjPWJzknSxBZ+utY3Q== X-Received: by 2002:adf:eccf:: with SMTP id s15mr536305wro.70.1589917505450; Tue, 19 May 2020 12:45:05 -0700 (PDT) MIME-Version: 1.0 References: <20200515234547.710474468@linutronix.de> <20200515235125.425810667@linutronix.de> <87imgr7nwp.fsf@nanos.tec.linutronix.de> In-Reply-To: <87imgr7nwp.fsf@nanos.tec.linutronix.de> From: Andy Lutomirski Date: Tue, 19 May 2020 12:44:54 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [patch V6 10/37] x86/entry: Switch XEN/PV hypercall entry to IDTENTRY To: Thomas Gleixner Cc: Andy Lutomirski , Andrew Cooper , LKML , X86 ML , "Paul E. McKenney" , Alexandre Chartre , Frederic Weisbecker , Paolo Bonzini , Sean Christopherson , Masami Hiramatsu , Petr Mladek , Steven Rostedt , Joel Fernandes , Boris Ostrovsky , Juergen Gross , Brian Gerst , Mathieu Desnoyers , Josh Poimboeuf , Will Deacon , Tom Lendacky , Wei Liu , Michael Kelley , Jason Chen CJ , Zhao Yakui , "Peter Zijlstra (Intel)" Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, May 19, 2020 at 11:58 AM Thomas Gleixner wrote: > > Andy Lutomirski writes: > > On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner wrote: > >> @@ -573,6 +578,16 @@ static __always_inline void __idtentry_exit(struct pt_regs *regs) > >> instrumentation_end(); > >> return; > >> } > >> + } else if (IS_ENABLED(CONFIG_XEN_PV)) { > >> + if (preempt_hcall) { > >> + /* See CONFIG_PREEMPTION above */ > >> + instrumentation_begin(); > >> + rcu_irq_exit_preempt(); > >> + xen_maybe_preempt_hcall(); > >> + trace_hardirqs_on(); > >> + instrumentation_end(); > >> + return; > >> + } > > > > Ewwwww! This shouldn't be taken as a NAK -- it's just an expression > > of disgust. > > I'm really not proud of it, but that was the least horrible thing I > could come up with. > > > Shouldn't this be: > > > > instrumentation_begin(); > > if (!irq_needs_irq_stack(...)) > > __blah(); > > else > > run_on_irqstack(__blah, NULL); > > instrumentation_end(); > > > > or even: > > > > instrumentation_begin(); > > run_on_irqstack_if_needed(__blah, NULL); > > instrumentation_end(); > > Yeah. In that case the instrumentation markers are not required as they > will be inside the run....() function. > > > ****** BUT ******* > > > > I think this is all arse-backwards. This is a giant mess designed to > > pretend we support preemption and to emulate normal preemption in a > > non-preemptible kernel. I propose one to two massive cleanups: > > > > A: Just delete all of this code. Preemptible hypercalls on > > non-preempt kernels will still process interrupts but won't get > > preempted. If you want preemption, compile with preemption. > > I'm happy to do so, but the XEN folks might have opinions on that :) > > > B: Turn this thing around. Specifically, in the one and only case we > > care about, we know pretty much exactly what context we got this entry > > in: we're running in a schedulable context doing an explicitly > > preemptible hypercall, and we have RIP pointing at a SYSCALL > > instruction (presumably, but we shouldn't bet on it) in the hypercall > > page. Ideally we would change the Xen PV ABI so the hypercall would > > return something like EAGAIN instead of auto-restarting and we could > > ditch this mess entirely. But the ABI seems to be set in stone or at > > least in molasses, so how about just: > > > > idt_entry(exit(regs)); > > if (inhcall && need_resched()) > > schedule(); > > Which brings you into the situation that you call schedule() from the > point where we just moved it out. If we would go there we'd need to > ensure that RCU is watching as well. idtentry_exit() might have it > turned off .... I don't think this is possible. Once you untangle all the wrappers, the call sites are effectively: __this_cpu_write(xen_in_preemptible_hcall, true); CALL_NOSPEC to the hypercall page __this_cpu_write(xen_in_preemptible_hcall, false); I think IF=1 when this happens, but I won't swear to it. RCU had better be watching. As I understand it, the one and only situation Xen wants to handle is that an interrupt gets delivered during the hypercall. The hypervisor is too clever for its own good and deals with this by rewinding RIP to the beginning of whatever instruction did the hypercall and delivers the interrupt, and we end up in this handler. So, if this happens, the idea is to not only handle the interrupt but to schedule if scheduling would be useful. So I don't think we need all this RCU magic. This really ought to be able to be simplified to: idtentry_exit(); if (appropriate condition) schedule(); Obviously we don't want to schedule if this is a nested entry, but we should be able to rule that out by checking that regs->flags & X86_EFLAGS_IF and by handling the percpu variable a little more intelligently. So maybe the right approach is: bool in_preemptible_hcall = __this_cpu_read(xen_in_preemptible_hcall); __this_cpu_write(xen_in_preemptible_hcall, false); idtentry_enter(...); do the acutal work; idtentry_exit(...); if (in_preemptible_hcall) { assert regs->flags & X86_EFLAGS_IF; assert that RCU is watching; assert that we're on the thread stack; assert whatever else we feel like asserting; if (need_resched()) schedule(); } __this_cpu_write(xen_in_preemptible_hcall, in_preemptible_hcall); And now we don't have a special idtentry_exit() case just for Xen, and all the mess is entirely contained in the Xen PV code. And we need to mark all the preemptible hypercalls noinstr. Does this seem reasonable? That being said, right now, with or without your patch, I think we're toast if the preemptible hypercall code gets traced. So maybe the right thing is to just drop all the magic preemption stuff from your patch and let the Xen maintainers submit something new (maybe like what I suggest above) if they want magic preemption back.