Received: by 10.223.185.111 with SMTP id b44csp77125wrg; Fri, 9 Mar 2018 01:14:33 -0800 (PST) X-Google-Smtp-Source: AG47ELtFfs0brV9eFYoIFQCryvJDLcC+p059j3QxN8F94vtpxmLwEU12Oc7iQyeXzSfxPGXEotTr X-Received: by 2002:a17:902:d81:: with SMTP id 1-v6mr27451486plv.324.1520586873334; Fri, 09 Mar 2018 01:14:33 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1520586873; cv=none; d=google.com; s=arc-20160816; b=Cx/cNE5m5WyjVtJxhQhj/NCTbmS19c0xFDaaZ1hthCBiw+lvgPB5TnuO+ru4PPqmz1 JRtVyZiBs+NiwZNavIHa049LZXwKAlBYXH72p7u1Xj8Unfjp6GU7liKM80m2FQzMN7PY Qwv2MuOD4SJnTDoYTau7y3T4MFdP0QxUS5LwTVVCv4VLlwTTgKtdVNZbw788ZWZq59n7 IqyEOHaJxogmoP8NVGV8Ms+53xemotCqzw+gJcxQdyCzB//aKn9Y0gog3/GSveB9INHA mPsYxNUoZVOplFNezkLYcYCI4X4gM4ttpfiBqRwho8aTyjK05MEQe3BytbQ4w1F5aIir SBXQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:organization:from:references:cc:to:subject :arc-authentication-results; bh=wPBZDdJTpp8zbr1OyU4/UmzIv9xFVha6ZjAfatFaVPE=; b=gJ1jKzf68d46ojtk62pLopeGX5ooEVK7O3lra2QQxM3RUST90SrqvHJkF5UIaphI/s osLUFAV47LniUlYbPIiFISAeGuZuu7risYYkG6RfzPYDCow8egl1bnHDDjDNiOBvAEzy pwSAkjSx9qSIE6sFn94O7mkxEHKPvmBDwKiVsUrUIs+4qlXqpkvi4wXO0uCT5oxQAqp3 OWJnzKjnfQhvQNy8EGY1YklYS6yOeGQDt4HSAjBv+kohje5zvrakUiLk/rOXpeEdRQVm /OLxI2HYPgsQF1gPpfI/7rAt84YgIOeiDhMX4ag7/dLFK3/JEFnrI+w9SLyy/n3JJIgg kx3g== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id u10-v6si527388plu.509.2018.03.09.01.14.18; Fri, 09 Mar 2018 01:14:33 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751495AbeCIJMd (ORCPT + 99 others); Fri, 9 Mar 2018 04:12:33 -0500 Received: from foss.arm.com ([217.140.101.70]:48490 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751173AbeCIJMI (ORCPT ); Fri, 9 Mar 2018 04:12:08 -0500 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 6E46B80D; Fri, 9 Mar 2018 01:12:07 -0800 (PST) Received: from [10.1.207.62] (usa-sjc-imap-foss1.foss.arm.com [10.72.51.249]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 4C9E73F25C; Fri, 9 Mar 2018 01:12:03 -0800 (PST) Subject: Re: [RFC PATCH] KVM: arm/arm64: vgic: change condition for level interrupt resampling To: Auger Eric , Christoffer Dall Cc: Shunyong Yang , ard.biesheuvel@linaro.org, will.deacon@arm.com, david.daney@cavium.com, linux-arm-kernel@lists.infradead.org, kvmarm@lists.cs.columbia.edu, linux-kernel@vger.kernel.org, Joey Zheng References: <1520492490-7943-1-git-send-email-shunyong.yang@hxt-semitech.com> <9ad47673-068e-f732-d2ca-9c76a8fbdfbc@arm.com> <0a15633d-8944-cb9b-3e6b-b08ee5ec42b9@arm.com> <20180308161900.GC1917@lvm> <86r2oubho3.wl-marc.zyngier@arm.com> From: Marc Zyngier Organization: ARM Ltd Message-ID: <6e438ffc-4b80-4706-a767-7c84aa896348@arm.com> Date: Fri, 9 Mar 2018 09:12:01 +0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-GB Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 08/03/18 18:12, Auger Eric wrote: > Hi Marc, Christoffer, > > On 08/03/18 18:28, Marc Zyngier wrote: >> On Thu, 08 Mar 2018 16:19:00 +0000, >> Christoffer Dall wrote: >>> >>> On Thu, Mar 08, 2018 at 11:54:27AM +0000, Marc Zyngier wrote: >>>> On 08/03/18 09:49, Marc Zyngier wrote: >>>>> [updated Christoffer's email address] >>>>> >>>>> Hi Shunyong, >>>>> >>>>> On 08/03/18 07:01, Shunyong Yang wrote: >>>>>> When resampling irqfds is enabled, level interrupt should be >>>>>> de-asserted when resampling happens. On page 4-47 of GIC v3 >>>>>> specification IHI0069D, it said, >>>>>> "When the PE acknowledges an SGI, a PPI, or an SPI at the CPU >>>>>> interface, the IRI changes the status of the interrupt to active >>>>>> and pending if: >>>>>> • It is an edge-triggered interrupt, and another edge has been >>>>>> detected since the interrupt was acknowledged. >>>>>> • It is a level-sensitive interrupt, and the level has not been >>>>>> deasserted since the interrupt was acknowledged." >>>>>> >>>>>> GIC v2 specification IHI0048B.b has similar description on page >>>>>> 3-42 for state machine transition. >>>>>> >>>>>> When some VFIO device, like mtty(8250 VFIO mdev emulation driver >>>>>> in samples/vfio-mdev) triggers a level interrupt, the status >>>>>> transition in LR is pending-->active-->active and pending. >>>>>> Then it will wait resampling to de-assert the interrupt. >>>>>> >>>>>> Current design of lr_signals_eoi_mi() will return false if state >>>>>> in LR is not invalid(Inactive). It causes resampling will not happen >>>>>> in mtty case. >>>>> >>>>> Let me rephrase this, and tell me if I understood it correctly: >>>>> >>>>> - A level interrupt is injected, activated by the guest (LR state=active) >>>>> - guest exits, re-enters, (LR state=pending+active) >>>>> - guest EOIs the interrupt (LR state=pending) >>>>> - maintenance interrupt >>>>> - we don't signal the resampling because we're not in an invalid state >>>>> >>>>> Is that correct? >>>>> >>>>> That's an interesting case, because it seems to invalidate some of the >>>>> optimization that went in over a year ago. >>>>> >>>>> 096f31c4360f KVM: arm/arm64: vgic: Get rid of MISR and EISR fields >>>>> b6095b084d87 KVM: arm/arm64: vgic: Get rid of unnecessary save_maint_int_state >>>>> af0614991ab6 KVM: arm/arm64: vgic: Get rid of unnecessary process_maintenance operation >>>>> >>>>> We could compare the value of the LR before the guest entry with >>>>> the value at exit time, but we still could miss it if we have a >>>>> transition such as P+A -> P -> A and assume a long enough propagation >>>>> delay for the maintenance interrupt (which is very likely). >>>>> >>>>> In essence, we have lost the benefit of EISR, which was to give us a >>>>> way to deal with asynchronous signalling. >>>>> >>>>>> >>>>>> This will cause interrupt fired continuously to guest even 8250 IIR >>>>>> has no interrupt. When 8250's interrupt is configured in shared mode, >>>>>> it will pass interrupt to other drivers to handle. However, there >>>>>> is no other driver involved. Then, a "nobody cared" kernel complaint >>>>>> occurs. >>>>>> >>>>>> / # cat /dev/ttyS0 >>>>>> [ 4.826836] random: crng init done >>>>>> [ 6.373620] irq 41: nobody cared (try booting with the "irqpoll" >>>>>> option) >>>>>> [ 6.376414] CPU: 0 PID: 1307 Comm: cat Not tainted 4.16.0-rc4 #4 >>>>>> [ 6.378927] Hardware name: linux,dummy-virt (DT) >>>>>> [ 6.380876] Call trace: >>>>>> [ 6.381937] dump_backtrace+0x0/0x180 >>>>>> [ 6.383495] show_stack+0x14/0x1c >>>>>> [ 6.384902] dump_stack+0x90/0xb4 >>>>>> [ 6.386312] __report_bad_irq+0x38/0xe0 >>>>>> [ 6.387944] note_interrupt+0x1f4/0x2b8 >>>>>> [ 6.389568] handle_irq_event_percpu+0x54/0x7c >>>>>> [ 6.391433] handle_irq_event+0x44/0x74 >>>>>> [ 6.393056] handle_fasteoi_irq+0x9c/0x154 >>>>>> [ 6.394784] generic_handle_irq+0x24/0x38 >>>>>> [ 6.396483] __handle_domain_irq+0x60/0xb4 >>>>>> [ 6.398207] gic_handle_irq+0x98/0x1b0 >>>>>> [ 6.399796] el1_irq+0xb0/0x128 >>>>>> [ 6.401138] _raw_spin_unlock_irqrestore+0x18/0x40 >>>>>> [ 6.403149] __setup_irq+0x41c/0x678 >>>>>> [ 6.404669] request_threaded_irq+0xe0/0x190 >>>>>> [ 6.406474] univ8250_setup_irq+0x208/0x234 >>>>>> [ 6.408250] serial8250_do_startup+0x1b4/0x754 >>>>>> [ 6.410123] serial8250_startup+0x20/0x28 >>>>>> [ 6.411826] uart_startup.part.21+0x78/0x144 >>>>>> [ 6.413633] uart_port_activate+0x50/0x68 >>>>>> [ 6.415328] tty_port_open+0x84/0xd4 >>>>>> [ 6.416851] uart_open+0x34/0x44 >>>>>> [ 6.418229] tty_open+0xec/0x3c8 >>>>>> [ 6.419610] chrdev_open+0xb0/0x198 >>>>>> [ 6.421093] do_dentry_open+0x200/0x310 >>>>>> [ 6.422714] vfs_open+0x54/0x84 >>>>>> [ 6.424054] path_openat+0x2dc/0xf04 >>>>>> [ 6.425569] do_filp_open+0x68/0xd8 >>>>>> [ 6.427044] do_sys_open+0x16c/0x224 >>>>>> [ 6.428563] SyS_openat+0x10/0x18 >>>>>> [ 6.429972] el0_svc_naked+0x30/0x34 >>>>>> [ 6.431494] handlers: >>>>>> [ 6.432479] [<000000000e9fb4bb>] serial8250_interrupt >>>>>> [ 6.434597] Disabling IRQ #41 >>>>>> >>>>>> This patch changes the lr state condition in lr_signals_eoi_mi() from >>>>>> invalid(Inactive) to active and pending to avoid this. >>>>>> >>>>>> I am not sure about the original design of the condition of >>>>>> invalid(active). So, This RFC is sent out for comments. >>>>>> >>>>>> Cc: Joey Zheng >>>>>> Signed-off-by: Shunyong Yang >>>>>> --- >>>>>> virt/kvm/arm/vgic/vgic-v2.c | 4 ++-- >>>>>> virt/kvm/arm/vgic/vgic-v3.c | 4 ++-- >>>>>> 2 files changed, 4 insertions(+), 4 deletions(-) >>>>>> >>>>>> diff --git a/virt/kvm/arm/vgic/vgic-v2.c b/virt/kvm/arm/vgic/vgic-v2.c >>>>>> index e9d840a75e7b..740ee9a5f551 100644 >>>>>> --- a/virt/kvm/arm/vgic/vgic-v2.c >>>>>> +++ b/virt/kvm/arm/vgic/vgic-v2.c >>>>>> @@ -46,8 +46,8 @@ void vgic_v2_set_underflow(struct kvm_vcpu *vcpu) >>>>>> >>>>>> static bool lr_signals_eoi_mi(u32 lr_val) >>>>>> { >>>>>> - return !(lr_val & GICH_LR_STATE) && (lr_val & GICH_LR_EOI) && >>>>>> - !(lr_val & GICH_LR_HW); >>>>>> + return !((lr_val & GICH_LR_STATE) ^ GICH_LR_STATE) && >>>>> >>>>> That feels very wrong. You're now signalling the resampling in both >>>>> invalid and pending+active, and the latter state doesn't mean you've >>>>> EOIed anything. You're now over-signalling, and signalling the >>>>> wrong event. >>>>> >>>>>> + (lr_val & GICH_LR_EOI) && !(lr_val & GICH_LR_HW); >>>>>> } >>>>>> >>>>>> /* >>>>>> diff --git a/virt/kvm/arm/vgic/vgic-v3.c b/virt/kvm/arm/vgic/vgic-v3.c >>>>>> index 6b329414e57a..43111bba7af9 100644 >>>>>> --- a/virt/kvm/arm/vgic/vgic-v3.c >>>>>> +++ b/virt/kvm/arm/vgic/vgic-v3.c >>>>>> @@ -35,8 +35,8 @@ void vgic_v3_set_underflow(struct kvm_vcpu *vcpu) >>>>>> >>>>>> static bool lr_signals_eoi_mi(u64 lr_val) >>>>>> { >>>>>> - return !(lr_val & ICH_LR_STATE) && (lr_val & ICH_LR_EOI) && >>>>>> - !(lr_val & ICH_LR_HW); >>>>>> + return !((lr_val & ICH_LR_STATE) ^ ICH_LR_STATE) && >>>>>> + (lr_val & ICH_LR_EOI) && !(lr_val & ICH_LR_HW); >>>>>> } >>>>>> >>>>>> void vgic_v3_fold_lr_state(struct kvm_vcpu *vcpu) >>>>>> >>>>> >>>>> Assuming I understand the issue correctly, I cannot really see how >>>>> to solve this without reintroducing EISR, which sucks majorly. >>>>> >>>>> I'll try to cook something shortly and we can all have a good >>>>> fight about how crap this is. >>>> >>>> Here's what I came up with. I don't really like it, but that's >>>> the least invasive this I could come up with. Please let me >>>> know if that helps with your test case. Note that I have only >>>> boot-tested this on a sample of 1 machine, so I don't expect this >>>> to be perfect. >>>> >>>> Also, any guideline on how to reproduce this would be much appreciated. >>>> I never used this mdev/mtty thing, so please bear with me. >>>> >>>> Thanks, >>>> >>>> M. >>>> >>>> From 66a7c4cfc1029b0169dd771e196e2876ba3f17b1 Mon Sep 17 00:00:00 2001 >>>> From: Marc Zyngier >>>> Date: Thu, 8 Mar 2018 11:14:06 +0000 >>>> Subject: [PATCH] KVM: arm/arm64: Do not rely on LR state to guess EOI MI >>>> status >>>> >>>> We so far rely on the LR state to decide whether the guest has >>>> EOI'd a level interrupt or not. While this looks like a good >>>> idea on the surface, it leads to a couple of annoying corner >>>> cases: >>>> >>>> Example 1: (P = Pending, A = Active, MI = Maintenance Interrupt) >>>> P -> guest IAR -> A -> exit/entry -> P+A -> guest EOI -> P -> MI >>> >>> Do we really get an EOI maintenance interrupt here? Reading the MISR >>> and EISR descriptions make me thing this is not the case... > > Hum yes in EISR it is said that ICH_LR.State = 0b00! >> >> Yeah, it looks like I always want EISR to do what I want, and not to >> do what it does. Man, this thing is such a piece of crap. >> >> OK, scratch that. We need to do it without the help of the HW. >> >>>> The state is now pending, we've really EOI'd the interrupt, and >>>> yet lr_signals_eoi_mi() returns false, since the state is not 0. >>>> The result is that we won't signal anything on the corresponding >>>> irqfd, which people complain about. Meh. >>> >>> So the core of the problem is that when we've entered the guest with >>> PENDING+ACTIVE and when we exit (for some reason) we don't signal the >>> resamplefd, right? The solution seems to me that we don't ever do >>> PENDING+ACTIVE if you need to resample after each deactivate. What >>> would be the point of appending a pending state that you only know to be >>> valid after a resample anyway? >> >> The question is then to identify that a given source needs to be >> signalled back to VFIO. Calling into the eventfd code on the hot path >> is pretty horrid (I'm not sure if we can really call into this with >> interrupts disabled, for example). >> >>> >>>> >>>> Example 2: >>>> P+A -> guest EOI -> P -> delayed MI -> guest IAR -> A -> MI fires >>> >>> We could be more clever and do the following calculation on every exit: >>> >>> If you enter with P, and exit with either A or 0, then signal. >>> >>> If you enter with P+A, and you exit with either P, A, or 0, then signal. >>> >>> Wouldn't that also solve it? (Although I have a feeling you'd miss some >>> exits in this case). >> >> I'd be more confident if we did forbid P+A for such interrupts >> altogether, as they really feel like another kind of HW interrupt. > > the LR P+A looks strange to me too. all the more so it may cause the > same IRQ to be acked twice? If the pending bit isn't dropped by the time we get to EOI the first one, probably. But that's pretty much expected with a level interrupt isn't it? > P -> A -> 0 (resample). Doesn't our issue come from the fact we reinject > the P in LR until the line level is deasserted? Which is consistent with the life cycle of a level interrupt. What usually happens is (for a non HW interrupt): P -> IAR -> A -> lower the line in the device -> 0 If you generate an exit at the right spot, and yet don't lower the line, you end up with: P -> IAR -> A -> exit/enter -> P+A From there, if you lower the line, it is likely to cause an exit: P+A -> MMIO trap lowering the line -> A >> >> Eric: Is there any way to get a callback from the eventfd code to flag >> a given irq as requiring a notification on EOI? > > bool kvm_irq_has_notifier(struct kvm *kvm, unsigned irqchip, unsigned > pin) was used in the past. I think it does what you want. > Not exactly. I'm very reluctant to call this on the hot path (I'd need the info on hw_flush), and I'd rather have a callback from the eventfd subsystem to tell me when a pin is being associated with a notifier (because this is likely to be very rare). If that doesn't exit, never mind. We can see if that solves Shunyong issue and optimize later. M. -- Jazz is not dead. It just smells funny...