Received: by 2002:a05:6a10:9848:0:0:0:0 with SMTP id x8csp4302501pxf; Tue, 16 Mar 2021 10:06:37 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzUH7B4gT7ms+EIj7NlvDOduJHraS2ftWxUfBMR36SmOgzQ3aihoEpw+qNlFXFxErFLwoVW X-Received: by 2002:a17:906:2816:: with SMTP id r22mr30359692ejc.2.1615914397470; Tue, 16 Mar 2021 10:06:37 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1615914397; cv=none; d=google.com; s=arc-20160816; b=i20JY0k3OBvGXMx/w33NIGbQKlSw2mJlho/3lJBRqcUm5ALQli/8vt32UZs42FqBQ2 Awdgzz8+8YPsDtBRvuv7uKLa/anc3ngEXSQOS8L7BYugB5Z04LnjSctzcNi4bFsqo9UO U5gczv2+ivd48rdgpazENZl5L9VxTHZYJN/u3TpcQPtZkRwC4Kna8TD3f3/ikD1al5BK uz7MESJ+wFKEbI2vhHnfbgUvwBAIWg2ZdHXF/vzPx1492XlPXXcp8kr4lLIrcT99qWvc BB9/YYPzo2elvISWX4GFUAj3EDuCB3HRRyEBb9PKpt42B/tVjCh1yryt71OlE9Cgjcc0 0WjQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:from:references :cc:to:subject; bh=3YzwjOUyJrhUCswncGEwSifJgXLXiUIEdNda3CCWNPg=; b=BrYN8RBlbX2MSjk5pGwa3RezaztzihldgfDLmIcTYgBrU04/5qSD0cFy61BphRBM2V IJH3fWFIEzV0IUb7nhS3Sah6Rve60L/FdASCdpPwI65Myyr9umkvikkOpm73BTwo1s7B Z80+vQqwmXuW0rApbR3J6T8EFqz11lkFZfNpWqo9N4n8iMIp42Y+kyzFV+Gy91hsTeFN 3/ct0pFEWdYh6Gk4wUlTy5gzCZhFJixqu7f5WCeSDeo5rJ1a843N66IDKzvMlxf3jjVR B69MRYw4EEnlMrAO/7xiOWfmLRCFGgdtHPKg5F+QVrtnDR6eesvrPf4w2bc1FNUCudm0 LyVQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=siemens.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id u19si14270311edo.410.2021.03.16.10.06.13; Tue, 16 Mar 2021 10:06:37 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=siemens.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238467AbhCPQHS (ORCPT + 99 others); Tue, 16 Mar 2021 12:07:18 -0400 Received: from lizzard.sbs.de ([194.138.37.39]:38194 "EHLO lizzard.sbs.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238488AbhCPQHI (ORCPT ); Tue, 16 Mar 2021 12:07:08 -0400 Received: from mail2.sbs.de (mail2.sbs.de [192.129.41.66]) by lizzard.sbs.de (8.15.2/8.15.2) with ESMTPS id 12GG6d19015921 (version=TLSv1.2 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=OK); Tue, 16 Mar 2021 17:06:39 +0100 Received: from [167.87.27.98] ([167.87.27.98]) by mail2.sbs.de (8.15.2/8.15.2) with ESMTP id 12GFubql030758; Tue, 16 Mar 2021 16:56:37 +0100 Subject: Re: [PATCH 2/3] KVM: x86: guest debug: don't inject interrupts while single stepping To: Maxim Levitsky , Sean Christopherson Cc: kvm@vger.kernel.org, Vitaly Kuznetsov , linux-kernel@vger.kernel.org, Thomas Gleixner , Wanpeng Li , Kieran Bingham , Jessica Yu , Andrew Morton , "maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)" , Joerg Roedel , Jim Mattson , Borislav Petkov , Stefano Garzarella , "H. Peter Anvin" , Paolo Bonzini , Ingo Molnar References: <20210315221020.661693-1-mlevitsk@redhat.com> <20210315221020.661693-3-mlevitsk@redhat.com> <1259724f-1bdb-6229-2772-3192f6d17a4a@siemens.com> <71ae8b75c30fd0f87e760216ad310ddf72d31c7b.camel@redhat.com> <2a44c302-744e-2794-59f6-c921b895726d@siemens.com> <1d27b215a488f8b8fc175e97c5ab973cc811922d.camel@redhat.com> <727e5ef1-f771-1301-88d6-d76f05540b01@siemens.com> From: Jan Kiszka Message-ID: <83f69225-f2ad-9073-3afb-c7cb8435059c@siemens.com> Date: Tue, 16 Mar 2021 16:56:37 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 16.03.21 16:49, Maxim Levitsky wrote: > On Tue, 2021-03-16 at 16:31 +0100, Jan Kiszka wrote: >> On 16.03.21 15:34, Maxim Levitsky wrote: >>> On Tue, 2021-03-16 at 14:46 +0100, Jan Kiszka wrote: >>>> On 16.03.21 13:34, Maxim Levitsky wrote: >>>>> On Tue, 2021-03-16 at 12:27 +0100, Jan Kiszka wrote: >>>>>> On 16.03.21 11:59, Maxim Levitsky wrote: >>>>>>> On Tue, 2021-03-16 at 10:16 +0100, Jan Kiszka wrote: >>>>>>>> On 16.03.21 00:37, Sean Christopherson wrote: >>>>>>>>> On Tue, Mar 16, 2021, Maxim Levitsky wrote: >>>>>>>>>> This change greatly helps with two issues: >>>>>>>>>> >>>>>>>>>> * Resuming from a breakpoint is much more reliable. >>>>>>>>>> >>>>>>>>>> When resuming execution from a breakpoint, with interrupts enabled, more often >>>>>>>>>> than not, KVM would inject an interrupt and make the CPU jump immediately to >>>>>>>>>> the interrupt handler and eventually return to the breakpoint, to trigger it >>>>>>>>>> again. >>>>>>>>>> >>>>>>>>>> From the user point of view it looks like the CPU never executed a >>>>>>>>>> single instruction and in some cases that can even prevent forward progress, >>>>>>>>>> for example, when the breakpoint is placed by an automated script >>>>>>>>>> (e.g lx-symbols), which does something in response to the breakpoint and then >>>>>>>>>> continues the guest automatically. >>>>>>>>>> If the script execution takes enough time for another interrupt to arrive, >>>>>>>>>> the guest will be stuck on the same breakpoint RIP forever. >>>>>>>>>> >>>>>>>>>> * Normal single stepping is much more predictable, since it won't land the >>>>>>>>>> debugger into an interrupt handler, so it is much more usable. >>>>>>>>>> >>>>>>>>>> (If entry to an interrupt handler is desired, the user can still place a >>>>>>>>>> breakpoint at it and resume the guest, which won't activate this workaround >>>>>>>>>> and let the gdb still stop at the interrupt handler) >>>>>>>>>> >>>>>>>>>> Since this change is only active when guest is debugged, it won't affect >>>>>>>>>> KVM running normal 'production' VMs. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Signed-off-by: Maxim Levitsky >>>>>>>>>> Tested-by: Stefano Garzarella >>>>>>>>>> --- >>>>>>>>>> arch/x86/kvm/x86.c | 6 ++++++ >>>>>>>>>> 1 file changed, 6 insertions(+) >>>>>>>>>> >>>>>>>>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >>>>>>>>>> index a9d95f90a0487..b75d990fcf12b 100644 >>>>>>>>>> --- a/arch/x86/kvm/x86.c >>>>>>>>>> +++ b/arch/x86/kvm/x86.c >>>>>>>>>> @@ -8458,6 +8458,12 @@ static void inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit >>>>>>>>>> can_inject = false; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> + /* >>>>>>>>>> + * Don't inject interrupts while single stepping to make guest debug easier >>>>>>>>>> + */ >>>>>>>>>> + if (vcpu->guest_debug & KVM_GUESTDBG_SINGLESTEP) >>>>>>>>>> + return; >>>>>>>>> >>>>>>>>> Is this something userspace can deal with? E.g. disable IRQs and/or set NMI >>>>>>>>> blocking at the start of single-stepping, unwind at the end? Deviating this far >>>>>>>>> from architectural behavior will end in tears at some point. >>>>>>>>> >>>>>>>> >>>>>>>> Does this happen to address this suspicious workaround in the kernel? >>>>>>>> >>>>>>>> /* >>>>>>>> * The kernel doesn't use TF single-step outside of: >>>>>>>> * >>>>>>>> * - Kprobes, consumed through kprobe_debug_handler() >>>>>>>> * - KGDB, consumed through notify_debug() >>>>>>>> * >>>>>>>> * So if we get here with DR_STEP set, something is wonky. >>>>>>>> * >>>>>>>> * A known way to trigger this is through QEMU's GDB stub, >>>>>>>> * which leaks #DB into the guest and causes IST recursion. >>>>>>>> */ >>>>>>>> if (WARN_ON_ONCE(dr6 & DR_STEP)) >>>>>>>> regs->flags &= ~X86_EFLAGS_TF; >>>>>>>> >>>>>>>> (arch/x86/kernel/traps.c, exc_debug_kernel) >>>>>>>> >>>>>>>> I wonder why this got merged while no one fixed QEMU/KVM, for years? Oh, >>>>>>>> yeah, question to myself as well, dancing around broken guest debugging >>>>>>>> for a long time while trying to fix other issues... >>>>>>> >>>>>>> To be honest I didn't see that warning even once, but I can imagine KVM >>>>>>> leaking #DB due to bugs in that code. That area historically didn't receive >>>>>>> much attention since it can only be triggered by >>>>>>> KVM_GET/SET_GUEST_DEBUG which isn't used in production. >>>>>> >>>>>> I've triggered it recently while debugging a guest, that's why I got >>>>>> aware of the code path. Long ago, all this used to work (soft BPs, >>>>>> single-stepping etc.) >>>>>> >>>>>>> The only issue that I on the other hand did >>>>>>> see which is mostly gdb fault is that it fails to remove a software breakpoint >>>>>>> when resuming over it, if that breakpoint's python handler messes up >>>>>>> with gdb's symbols, which is what lx-symbols does. >>>>>>> >>>>>>> And that despite the fact that lx-symbol doesn't mess with the object >>>>>>> (that is the kernel) where the breakpoint is defined. >>>>>>> >>>>>>> Just adding/removing one symbol file is enough to trigger this issue. >>>>>>> >>>>>>> Since lx-symbols already works this around when it reloads all symbols, >>>>>>> I extended that workaround to happen also when loading/unloading >>>>>>> only a single symbol file. >>>>>> >>>>>> You have no issue with interactive debugging when NOT using gdb scripts >>>>>> / lx-symbol? >>>>> >>>>> To be honest I don't use guest debugging that much, >>>>> so I probably missed some issues. >>>>> >>>>> Now that I fixed lx-symbols though I'll probably use >>>>> guest debugging much more. >>>>> I will keep an eye on any issues that I find. >>>>> >>>>> The main push to fix lx-symbols actually came >>>>> from me wanting to understand if there is something >>>>> broken with KVM's guest debugging knowing that >>>>> lx-symbols crashes the guest when module is loaded >>>>> after lx-symbols was executed. >>>>> >>>>> That lx-symbols related guest crash I traced to issue >>>>> with gdb as I explained, and the lack of blocking of the interrupts >>>>> on single step is not a bug but more a missing feature >>>>> that should be implemented to make single step easier to use. >>>> >>>> Again, this used to work fine. But maybe this patch can change the >>>> picture by avoid that the unavoidable short TF leakage into the guest >>>> escalates beyond the single instruction to step over. >>> >>> >>> Actually now I think I understand what is going on. >>> >>> The TF flag isn't auto cleared as RF flag is, and if the instruction >>> which is single stepped gets an interrupt it is pushed onto the interrupt stack. >>> (then it is cleared for the duration of the interrupt handler) >>> Since we use the TF flag for single stepping the guest, this indeed can >>> cause it to be leaked. >>> >>> So this patch actually should mitigate this almost completely. >>> >>> Also now I understand why Intel has the 'monitor trap' feature, I think it >>> is exactly for the cases when hypervisor wants to single step the guest >>> without the fear of changing of the guest visible cpu state. >> >> Exactly. >> >>> >>> KVM on VMX should probably switch to using monitor trap for single stepping. >> >> Back then, when I was hacking on the gdb-stub and KVM support, the >> monitor trap flag was not yet broadly available, but the idea to once >> use it was already there. Now it can be considered broadly available, >> but it would still require some changes to get it in. >> >> Unfortunately, we don't have such thing with SVM, even recent versions, >> right? So, a proper way of avoiding diverting event injections while we >> are having the guest in an "incorrect" state should definitely be the goal. > Yes, I am not aware of anything like monitor trap on SVM. > >> >> Given that KVM knows whether TF originates solely from guest debugging >> or was (also) injected by the guest, we should be able to identify the >> cases where your approach is best to apply. And that without any extra >> control knob that everyone will only forget to set. > Well I think that the downside of this patch is that the user might actually > want to single step into an interrupt handler, > and this patch makes it a bit more complicated, and changes the default > behavior. If the default makes debugging practically impossible and breaks the guest by leaking host state, that is also not OK. We must not leak, that is priority one. So, even if we consider stepping into the interrupt a use case (surely not the default one, so having this opt-in only), we must ensure that TF will never end up on the guest stack saved for the interrupt handler. That is KVM's default responsibility. Jan > > I have no objections though to use this patch as is, or at least make this > the new default with a new flag to override this. > > Sean Christopherson, what do you think? > > Best regards, > Maxim Levitsky > >> >> Jan >> > > -- Siemens AG, T RDA IOT Corporate Competence Center Embedded Linux