Received: by 2002:a25:868d:0:0:0:0:0 with SMTP id z13csp356219ybk; Wed, 20 May 2020 01:10:46 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxP6UZA6vy9F4iVxZr5Ucuz+cKracEhs3KfoQd1QSo1GmMO3WEqTtb4YDcKCUpsTcJiA1FK X-Received: by 2002:a17:906:4d82:: with SMTP id s2mr2623724eju.542.1589962246553; Wed, 20 May 2020 01:10:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1589962246; cv=none; d=google.com; s=arc-20160816; b=mkpKSqtSornHXbWRcGAB58iXjMMyazz6GxNPVwYPmwJfzaX5+6pyh+6hYEaXkfZaqF NEGYChScJhjOxenhL5SMtKXGTHyErV1hk68C7HCfkJy58AUMCMPAfP4i9Eso8nJoe4WJ 5VFkc/SSEA/4eMU1Dvv8kvYTboTxDd9oavFe9wdggufl/1UZbw2b6fQdFC2mxjTFwyXB X1sXH+IwVu5UwD+aoRpDgUfXohnMF/Yv4uFho9ULt+G/ySMHu3QodDT9CwLdeIy/SHb3 +GEP2bRkyoFS1EAhS9/GK5mj583yJ65wdw7FUC3gGI4PNH7JPrUwC3udqvz+BVkxHQWz cfqw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=IKbvnSPCasiuGfnDp0TV+We4ZETkjpLpPCWm1NFBkzM=; b=fNgKrelr9PcEghEITlFFzSZC/fzEXQ6YjQgfCY43lKzByVEXcu5LHrRb84LIBs7vTU zK4HIwuS0njUj2S7LmFCLnHM+RtQBHoY6koeHVb/yJJOH5NZCQEjvS+JP3NiYKaNZBo5 Omh/IYr77ifnxo+wHNcQB8fJg4Ro1+TJ+4oG1xnItXmt0lMhQYXw/thEuRfriHDw6GRw pnS1ctZUwOOBCjYE0HrHzFLQ4NbC/J2dBjn5fZdG1dBrf+0Cql1qi9aL4XwNFuV4xJlB 045z8mC9+FIMwaKLtNI70BFjHwah2qem1fWPumBCSDN6XT1TNRcAU6kMaeQMYs5Ovxje PuHA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id l10si1080548edv.152.2020.05.20.01.10.22; Wed, 20 May 2020 01:10:46 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726852AbgETIGi (ORCPT + 99 others); Wed, 20 May 2020 04:06:38 -0400 Received: from mx2.suse.de ([195.135.220.15]:48682 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726436AbgETIGi (ORCPT ); Wed, 20 May 2020 04:06:38 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 57470ACA7; Wed, 20 May 2020 08:06:37 +0000 (UTC) Subject: Re: [patch V6 10/37] x86/entry: Switch XEN/PV hypercall entry to IDTENTRY To: Andy Lutomirski , Thomas Gleixner Cc: Andrew Cooper , LKML , X86 ML , "Paul E. McKenney" , Alexandre Chartre , Frederic Weisbecker , Paolo Bonzini , Sean Christopherson , Masami Hiramatsu , Petr Mladek , Steven Rostedt , Joel Fernandes , Boris Ostrovsky , Brian Gerst , Mathieu Desnoyers , Josh Poimboeuf , Will Deacon , Tom Lendacky , Wei Liu , Michael Kelley , Jason Chen CJ , Zhao Yakui , "Peter Zijlstra (Intel)" References: <20200515234547.710474468@linutronix.de> <20200515235125.425810667@linutronix.de> <87imgr7nwp.fsf@nanos.tec.linutronix.de> From: =?UTF-8?B?SsO8cmdlbiBHcm/Dnw==?= Message-ID: <3dd0e972-1b80-cd6b-6490-5b745ada68c8@suse.com> Date: Wed, 20 May 2020 10:06:32 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 19.05.20 21:44, Andy Lutomirski wrote: > On Tue, May 19, 2020 at 11:58 AM Thomas Gleixner wrote: >> >> Andy Lutomirski writes: >>> On Fri, May 15, 2020 at 5:10 PM Thomas Gleixner wrote: >>>> @@ -573,6 +578,16 @@ static __always_inline void __idtentry_exit(struct pt_regs *regs) >>>> instrumentation_end(); >>>> return; >>>> } >>>> + } else if (IS_ENABLED(CONFIG_XEN_PV)) { >>>> + if (preempt_hcall) { >>>> + /* See CONFIG_PREEMPTION above */ >>>> + instrumentation_begin(); >>>> + rcu_irq_exit_preempt(); >>>> + xen_maybe_preempt_hcall(); >>>> + trace_hardirqs_on(); >>>> + instrumentation_end(); >>>> + return; >>>> + } >>> >>> Ewwwww! This shouldn't be taken as a NAK -- it's just an expression >>> of disgust. >> >> I'm really not proud of it, but that was the least horrible thing I >> could come up with. >> >>> Shouldn't this be: >>> >>> instrumentation_begin(); >>> if (!irq_needs_irq_stack(...)) >>> __blah(); >>> else >>> run_on_irqstack(__blah, NULL); >>> instrumentation_end(); >>> >>> or even: >>> >>> instrumentation_begin(); >>> run_on_irqstack_if_needed(__blah, NULL); >>> instrumentation_end(); >> >> Yeah. In that case the instrumentation markers are not required as they >> will be inside the run....() function. >> >>> ****** BUT ******* >>> >>> I think this is all arse-backwards. This is a giant mess designed to >>> pretend we support preemption and to emulate normal preemption in a >>> non-preemptible kernel. I propose one to two massive cleanups: >>> >>> A: Just delete all of this code. Preemptible hypercalls on >>> non-preempt kernels will still process interrupts but won't get >>> preempted. If you want preemption, compile with preemption. >> >> I'm happy to do so, but the XEN folks might have opinions on that :) Indeed. :-) >> >>> B: Turn this thing around. Specifically, in the one and only case we >>> care about, we know pretty much exactly what context we got this entry >>> in: we're running in a schedulable context doing an explicitly >>> preemptible hypercall, and we have RIP pointing at a SYSCALL >>> instruction (presumably, but we shouldn't bet on it) in the hypercall >>> page. Ideally we would change the Xen PV ABI so the hypercall would >>> return something like EAGAIN instead of auto-restarting and we could >>> ditch this mess entirely. But the ABI seems to be set in stone or at >>> least in molasses, so how about just: >>> >>> idt_entry(exit(regs)); >>> if (inhcall && need_resched()) >>> schedule(); >> >> Which brings you into the situation that you call schedule() from the >> point where we just moved it out. If we would go there we'd need to >> ensure that RCU is watching as well. idtentry_exit() might have it >> turned off .... > > I don't think this is possible. Once you untangle all the wrappers, > the call sites are effectively: > > __this_cpu_write(xen_in_preemptible_hcall, true); > CALL_NOSPEC to the hypercall page > __this_cpu_write(xen_in_preemptible_hcall, false); > > I think IF=1 when this happens, but I won't swear to it. RCU had > better be watching. Preemptible hypercalls are never done with interrupts off. To be more precise: they are only ever done during ioctl() processing. I can add an ASSERT() to xen_preemptible_hcall_begin() if you want. > > As I understand it, the one and only situation Xen wants to handle is > that an interrupt gets delivered during the hypercall. The hypervisor > is too clever for its own good and deals with this by rewinding RIP to > the beginning of whatever instruction did the hypercall and delivers > the interrupt, and we end up in this handler. So, if this happens, > the idea is to not only handle the interrupt but to schedule if > scheduling would be useful. Correct. More precise: the hypercalls in question can last very long (up to several seconds) and so they need to be interruptible. As said before: the interface how this is done is horrible. :-( > > So I don't think we need all this RCU magic. This really ought to be > able to be simplified to: > > idtentry_exit(); > > if (appropriate condition) > schedule(); > > Obviously we don't want to schedule if this is a nested entry, but we > should be able to rule that out by checking that regs->flags & > X86_EFLAGS_IF and by handling the percpu variable a little more > intelligently. So maybe the right approach is: > > bool in_preemptible_hcall = __this_cpu_read(xen_in_preemptible_hcall); > __this_cpu_write(xen_in_preemptible_hcall, false); > idtentry_enter(...); > > do the acutal work; > > idtentry_exit(...); > > if (in_preemptible_hcall) { > assert regs->flags & X86_EFLAGS_IF; > assert that RCU is watching; > assert that we're on the thread stack; > assert whatever else we feel like asserting; > if (need_resched()) > schedule(); > } > > __this_cpu_write(xen_in_preemptible_hcall, in_preemptible_hcall); > > And now we don't have a special idtentry_exit() case just for Xen, and > all the mess is entirely contained in the Xen PV code. And we need to > mark all the preemptible hypercalls noinstr. Does this seem > reasonable? From my point of view this sounds fine. > > That being said, right now, with or without your patch, I think we're > toast if the preemptible hypercall code gets traced. So maybe the > right thing is to just drop all the magic preemption stuff from your > patch and let the Xen maintainers submit something new (maybe like > what I suggest above) if they want magic preemption back. > I'd prefer to not break preemptible hypercall in between. IMO the patch should be modified along your suggestion. I'd be happy to test it. Juergen