Received: by 2002:a05:6a10:206:0:0:0:0 with SMTP id 6csp4660767pxj; Tue, 22 Jun 2021 05:24:57 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxnT8/XWjZ9NQZz8PvJgbusbB+jthtrJagbjQn0eY1Q/iAT4IrE0uproAOUwZMSU70RR5dR X-Received: by 2002:a17:906:2985:: with SMTP id x5mr3724617eje.438.1624364697534; Tue, 22 Jun 2021 05:24:57 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1624364697; cv=none; d=google.com; s=arc-20160816; b=UdxtEvgrGc0+zc8MGebyBMUEiw9hBn62F6YQnLKRBDVZXLNHJ8vCe7wRPoZLNI/sMH Ld2G65YaqWfq+TaTpAjEp+7GGdxNOZPHKzeRMs96aVCcNtmo9WVUoUdKBA4XSZbQ+kRQ Co/r29+Bb6KJ56nP5/3zsNNO3PGrqM2B3xwKX2YJcqf7FgcddhB7mQVJAjTmFevfoD2t /yA7KVZhStlwLFX3Yy0xhLIbJSkz21yEJBt2W/5umtaoYPkvVh5vsCgCxLxjOQbsQh7j kqaDKpC69UjGycYpf/U0MWnwsajVI8fsd9DkOd1cD9IkSiBij20vy7oFjrowGFh36OnS qDmQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:from:references :cc:to:subject:dkim-signature; bh=M0Y3trJ0hBFNWH8kNw4EvDo6aslDCpP5XqPwDmKmclg=; b=ZzLZrDfEWUGHPX15ycxtMBPl6/cakB9F7dbQenmSlbRwz6LOmGUOY7Dknv1EO6OlGh vIK+q/cyGpH2I+6qqBc3ZI7bAXtETYnKoIH3NnfP0T7qc04rq9Io+yy+BOCjBITT3fvD S63j5qRlp5KygEAQR1I6cPiFrCgV3fileyd7YH58Nxz7X5JpWilZgUOu+zJHKae3x3aQ 4ERjTF4HMuZh4V7ge0o6PKfpOmKacJfRCe5XEsjCyk5D3k03M3VaEZFF5GF5kbv7ymE3 VaTGBbzh7hePy3TD/LcfnCjKgQdRRt1k9mckp3XylY6Ru201Zmbe8Cdr663Vt7qBjsPe qLBg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@xen.org header.s=20200302mail header.b=nHLtiNmG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id u15si13623201eju.396.2021.06.22.05.24.34; Tue, 22 Jun 2021 05:24:57 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@xen.org header.s=20200302mail header.b=nHLtiNmG; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231451AbhFVMXh (ORCPT + 99 others); Tue, 22 Jun 2021 08:23:37 -0400 Received: from mail.xenproject.org ([104.130.215.37]:50152 "EHLO mail.xenproject.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230232AbhFVMXh (ORCPT ); Tue, 22 Jun 2021 08:23:37 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=xen.org; s=20200302mail; h=Content-Transfer-Encoding:Content-Type:In-Reply-To: MIME-Version:Date:Message-ID:From:References:Cc:To:Subject; bh=M0Y3trJ0hBFNWH8kNw4EvDo6aslDCpP5XqPwDmKmclg=; b=nHLtiNmGFSPdcnX0RLBYD156cX 0SKd/Ms6obc2pi1YLItOiufKYUrDzYprkklLMDzNahgdHqV1egpJ66GCHLe4GwxsIGwksdEFVlrga kcYllL7BCT6kQvHHXxNOzK6Fq/zBvVVR5qsNmK77DFFlGSXgNOcOnNptTG39ob2dnSOM=; Received: from xenbits.xenproject.org ([104.239.192.120]) by mail.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1lvfP6-0002LX-OE; Tue, 22 Jun 2021 12:21:20 +0000 Received: from [54.239.6.182] (helo=a483e7b01a66.ant.amazon.com) by xenbits.xenproject.org with esmtpsa (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1lvfP6-0005P7-Ey; Tue, 22 Jun 2021 12:21:20 +0000 Subject: Re: Interrupt for port 19, but apparently not enabled; per-user 000000004af23acc To: Juergen Gross Cc: "xen-devel@lists.xenproject.org" , linux-kernel@vger.kernel.org, mheyne@amazon.de References: <6552fc66-ba19-2c77-7928-b0272d3e1622@xen.org> <4d8a7ba7-a9f6-2999-8750-bfe2b85f064e@suse.com> From: Julien Grall Message-ID: <9a08bbf2-ba6a-6e49-3bcb-bfe2beb32b99@xen.org> Date: Tue, 22 Jun 2021 14:21:18 +0200 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <4d8a7ba7-a9f6-2999-8750-bfe2b85f064e@suse.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-GB Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Juergen, On 22/06/2021 13:04, Juergen Gross wrote: > On 22.06.21 12:24, Julien Grall wrote: >> Hi Juergen, >> >> As discussed on IRC yesterday, we noticed a couple of splat in 5.13-rc6 > >> (and stable 5.4) in the evtchn driver: >> >> [    7.581000] ------------[ cut here ]------------ >> [    7.581899] Interrupt for port 19, but apparently not > enabled; >> per-user 000000004af23acc >> [    7.583401] WARNING: CPU: 0 PID: 467 at >> /home/ANT.AMAZON.COM/jgrall/works/oss/linux/drivers/xen/evtchn.c:169 >> evtchn_interrupt+0xd5/0x100 >> [    7.585583] Modules linked in: >> [    7.586188] CPU: 0 PID: 467 Comm: xenstore-read Not tainted >> 5.13.0-rc6 #240 >> [    7.587462] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), >> BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 >> [    7.589462] RIP: e030:evtchn_interrupt+0xd5/0x100 >> [    7.590361] Code: 48 8d bb d8 01 00 00 ba 01 00 00 00 > be 1d 00 00 00 >> e8 5f 72 c4 ff eb b2 8b 75 20 48 89 da 48 c7 c7 a8 03 5f 82 e8 6b 2d 96 > >> ff <0f> 0b e9 4d ff ff ff 41 0f b6 f4 48 c7 c7 80 da a2 82 e8 f0 >> [    7.593662] RSP: e02b:ffffc90040003e60 EFLAGS: 00010082 >> [    7.594636] RAX: 0000000000000000 RBX: ffff888102328c00 RCX: >> 0000000000000027 >> [    7.595924] RDX: 0000000000000000 RSI: ffff88817fe18ad0 RDI: >> ffff88817fe18ad8 >> [    7.597216] RBP: ffff888108ef8140 R08: 0000000000000000 R09: >> 0000000000000001 >> [    7.598522] R10: 0000000000000000 R11: 7075727265746e49 R12: >> 0000000000000000 >> [    7.599810] R13: ffffc90040003ec4 R14: ffff8881001b8000 R15: >> ffff888109b36f80 >> [    7.601113] FS:  0000000000000000(0000) GS:ffff88817fe00000(0000) >> knlGS:0000000000000000 >> [    7.602570] CS:  10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033 >> [    7.603700] CR2: 00007f15b390e368 CR3: 000000010bb04000 CR4: >> 0000000000050660 >> [    7.604993] Call Trace: >> [    7.605501]  >> [    7.605929]  __handle_irq_event_percpu+0x4c/0x330 >> [    7.606817]  handle_irq_event_percpu+0x32/0xa0 >> [    7.607670]  handle_irq_event+0x3a/0x60 >> [    7.608416]  handle_edge_irq+0x9b/0x1f0 >> [    7.609154]  generic_handle_irq+0x4f/0x60 >> [    7.609918]  __evtchn_fifo_handle_events+0x195/0x3a0 >> [    7.610864]  __xen_evtchn_do_upcall+0x66/0xb0 >> [    7.611693]  __xen_pv_evtchn_do_upcall+0x1d/0x30 >> [    7.612582]  xen_pv_evtchn_do_upcall+0x9d/0xc0 >> [    7.613439]  >> [    7.613882]  exc_xen_hypervisor_callback+0x8/0x10 >> >> This is quite similar to the problem I reported a few months ago (see >> [1]) but this time this is happening with fifo rather than 2L. >> >> I haven't been able to reproduced it reliably so far. But looking at >> the code, I think I have found another potential race after commit >> >> commit b6622798bc50b625a1e62f82c7190df40c1f5b21 >> Author: Juergen Gross >> Date:   Sat Mar 6 17:18:33 2021 +0100 >>     xen/events: avoid handling the same event on two cpus at the same >> time >>     When changing the cpu affinity of an event it can happen today that >>     (with some unlucky timing) the same event will be handled > on the old >>     and the new cpu at the same time. >>     Avoid that by adding an "event active" flag to the per-event data and >>     call the handler only if this flag isn't set. >>     Cc: stable@vger.kernel.org >>     Reported-by: Julien Grall >>     Signed-off-by: Juergen Gross >>     Reviewed-by: Julien Grall >>     Link: https://lore.kernel.org/r/20210306161833.4552-4-jgross@suse.com >>     Signed-off-by: Boris Ostrovsky >> >> The evtchn driver will use the lateeoi handlers. So the code to ack >> looks like: >> >> do_mask(..., EVT_MASK_REASON_EOI_PENDING) >> smp_store_release(&info->is_active, 0); >> clear_evtchn(info->evtchn); >> >> The code to handle an interrupts look like: >> >> clear_link(...) >> if ( evtchn_fifo_is_pending(port) && !evtchn_fifo_is_mask()) { >>    if (xchg_acquire(&info->is_active, 1) >>      return; >>    generic_handle_irq(); >> } >> >> After changing the affinity, an interrupt may be received once on the >> previous vCPU. So, I think the following can happen: >> >> vCPU0                             | vCPU1 >>                    | >>   Receive event              | >>                    | change affinity to vCPU1 >>   clear_link()              | >>                        | >>                 /* The interrupt is re-raised */ >>                    | receive event >>                      | >>                    | /* The interrupt is not masked */ >>   info->is_active = 1          | >>   do_mask(...)              | >>   info->is_active = 0          | >>                    | info->is_active = 1 >>   clear_evtchn(...)               | >>                                   | do_mask(...) >>                                   | info->is_active = 0 >>                    | clear_evtchn(...) >> >> Does this look plausible to you? > > Yes, it does. > > Thanks for the analysis. > > So I guess for lateeoi events we need to clear is_active only in > xen_irq_lateeoi()? At a first glance this should fix the issue. It should work and would be quite neat. But, I believe clear_evtchn() would have to stick in the ack helper to avoid losing interrupts. Cheers, -- Julien Grall