Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp3852455pxb; Mon, 8 Feb 2021 01:35:36 -0800 (PST) X-Google-Smtp-Source: ABdhPJwWc9YIAItpQSkLZfAAjjR5ZS0g+ZMUaBRI9UeBDOAm3muuVaQuWWfZwffGGNo5xb3EMlLI X-Received: by 2002:a17:906:390c:: with SMTP id f12mr16184859eje.31.1612776936240; Mon, 08 Feb 2021 01:35:36 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1612776936; cv=none; d=google.com; s=arc-20160816; b=ATLZG0VMSuhZ0GRAkfxxZY/lccICBMJ46s5J8LJctod270ZBvtDhYJrsCZNJYgQll5 p4lyospJFu3S/PUt3TRKpDk643SxrOzE1HLbDY8QOJ8AMwyBnLamyGqqTXx6ITX7/qzV tdFLtfIHB47MLpTERkQMUOEr7ZFfyACUjPPHKnAUHFA9MBWl27GxvGcpJYCGtMOQx1/A KfICFzUEn7PxEBc97r5kLqJXg1LXG+IXajse5JNfqrbMLh1O8akub6qFdxuvSspR/quG 9eQH9367OUfDmk8DVzTKD5pgjxFCf1I63iUyWzszh0+pWzpDWvQZVbVv9VP/cNpYT80m CsSQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:content-language :in-reply-to:mime-version:user-agent:date:message-id:from:references :cc:to:subject:dkim-signature; bh=Z2FuntU0FCSPGAWD+ofKK+yUmsoCyrZClvV7APxBvaU=; b=RE8EfM1IP92fH36vC5q90hT9RvHSXGKIqzPuN85YF/H4sCwQ93oOGmfy3BRApcOckY 6/6RSh97SdyCw5fn7+KWkEZs0YIJdyvYDb4ustFlbYxwgV/MHYJ2/hKkqL8YT8Gjlcnr 4SXZFOb9xx7jVlkb/T//l+aXTrdjGYtS/hIlt+1jd7faBatE994ddTMpD5bLdVuwtC7m y+Ssrry2/3bGVfPrjeuhEOx+YW4V5OxXZ8BEVBtWYsK4bEwgecnyMc4dRnFVk+2vtUUo pppxnquoRo38vR/CzyzQwCx8cVwMARMEKxa/u2pcIrcKShFnelO774NGwotXvRsgbcfh KKBA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@xen.org header.s=20200302mail header.b=S3YFzpNj; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id h18si865640eds.493.2021.02.08.01.35.12; Mon, 08 Feb 2021 01:35:36 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@xen.org header.s=20200302mail header.b=S3YFzpNj; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231757AbhBHJdi (ORCPT + 99 others); Mon, 8 Feb 2021 04:33:38 -0500 Received: from mail.xenproject.org ([104.130.215.37]:58000 "EHLO mail.xenproject.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231431AbhBHJMa (ORCPT ); Mon, 8 Feb 2021 04:12:30 -0500 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=xen.org; s=20200302mail; h=Content-Transfer-Encoding:Content-Type:In-Reply-To: MIME-Version:Date:Message-ID:From:References:Cc:To:Subject; bh=Z2FuntU0FCSPGAWD+ofKK+yUmsoCyrZClvV7APxBvaU=; b=S3YFzpNjNeKQK9a3thbvJwcCVT /0VqloskC35+uEOgRFBZD+9JV3WrwxjsyPgNFqKOnfjWAtoIi0DKRJlxFQvbTwbupeSSukJX8x1dj n1sYe+iGxzjWMHsDVsqLDKXVJUELdnB3CcU8BTrhm0NTQvuJ5S5mzzk9OVF2XbxfQZ1w=; Received: from xenbits.xenproject.org ([104.239.192.120]) by mail.xenproject.org with esmtp (Exim 4.92) (envelope-from ) id 1l92Zm-00046z-Fi; Mon, 08 Feb 2021 09:11:22 +0000 Received: from [54.239.6.177] (helo=a483e7b01a66.ant.amazon.com) by xenbits.xenproject.org with esmtpsa (TLS1.3:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.92) (envelope-from ) id 1l92Zm-0003Oc-7O; Mon, 08 Feb 2021 09:11:22 +0000 Subject: Re: [PATCH 0/7] xen/events: bug fixes and some diagnostic aids To: =?UTF-8?B?SsO8cmdlbiBHcm/Dnw==?= , xen-devel@lists.xenproject.org, linux-kernel@vger.kernel.org, linux-block@vger.kernel.org, netdev@vger.kernel.org, linux-scsi@vger.kernel.org Cc: Boris Ostrovsky , Stefano Stabellini , stable@vger.kernel.org, Konrad Rzeszutek Wilk , =?UTF-8?Q?Roger_Pau_Monn=c3=a9?= , Jens Axboe , Wei Liu , Paul Durrant , "David S. Miller" , Jakub Kicinski References: <20210206104932.29064-1-jgross@suse.com> From: Julien Grall Message-ID: Date: Mon, 8 Feb 2021 09:11:18 +0000 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Thunderbird/78.7.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-GB Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Juergen, On 07/02/2021 12:58, Jürgen Groß wrote: > On 06.02.21 19:46, Julien Grall wrote: >> Hi Juergen, >> >> On 06/02/2021 10:49, Juergen Gross wrote: >>> The first three patches are fixes for XSA-332. The avoid WARN splats >>> and a performance issue with interdomain events. >> >> Thanks for helping to figure out the problem. Unfortunately, I still >> see reliably the WARN splat with the latest Linux master >> (1e0d27fce010) + your first 3 patches. >> >> I am using Xen 4.11 (1c7d984645f9) and dom0 is forced to use the 2L >> events ABI. >> >> After some debugging, I think I have an idea what's went wrong. The >> problem happens when the event is initially bound from vCPU0 to a >> different vCPU. >> >>  From the comment in xen_rebind_evtchn_to_cpu(), we are masking the >> event to prevent it being delivered on an unexpected vCPU. However, I >> believe the following can happen: >> >> vCPU0                | vCPU1 >>                  | >>                  | Call xen_rebind_evtchn_to_cpu() >> receive event X            | >>                  | mask event X >>                  | bind to vCPU1 >>         | unmask event X >>                  | >>                  | receive event X >>                  | >>                  | handle_edge_irq(X) >> handle_edge_irq(X)        |  -> handle_irq_event() >>                  |   -> set IRQD_IN_PROGRESS >>   -> set IRQS_PENDING        | >>                  |   -> evtchn_interrupt() >>                  |   -> clear IRQD_IN_PROGRESS >>                  |  -> IRQS_PENDING is set >>                  |  -> handle_irq_event() >>                  |   -> evtchn_interrupt() >>                  |     -> WARN() >>                  | >> >> All the lateeoi handlers expect a ONESHOT semantic and >> evtchn_interrupt() is doesn't tolerate any deviation. >> >> I think the problem was introduced by 7f874a0447a9 ("xen/events: fix >> lateeoi irq acknowledgment") because the interrupt was disabled >> previously. Therefore we wouldn't do another iteration in >> handle_edge_irq(). > > I think you picked the wrong commit for blaming, as this is just > the last patch of the three patches you were testing. I actually found the right commit for blaming but I copied the information from the wrong shell :/. The bug was introduced by: c44b849cee8c ("xen/events: switch user event channels to lateeoi model") > >> Aside the handlers, I think it may impact the defer EOI mitigation >> because in theory if a 3rd vCPU is joining the party (let say vCPU A >> migrate the event from vCPU B to vCPU C). So info->{eoi_cpu, >> irq_epoch, eoi_time} could possibly get mangled? >> >> For a fix, we may want to consider to hold evtchn_rwlock with the >> write permission. Although, I am not 100% sure this is going to >> prevent everything. > > It will make things worse, as it would violate the locking hierarchy > (xen_rebind_evtchn_to_cpu() is called with the IRQ-desc lock held). Ah, right. > > On a first glance I think we'll need a 3rd masking state ("temporarily > masked") in the second patch in order to avoid a race with lateeoi. > > In order to avoid the race you outlined above we need an "event is being > handled" indicator checked via test_and_set() semantics in > handle_irq_for_port() and reset only when calling clear_evtchn(). It feels like we are trying to workaround the IRQ flow we are using (i.e. handle_edge_irq()). This reminds me the thread we had before discovering XSA-332 (see [1]). Back then, it was suggested to switch back to handle_fasteoi_irq(). Cheers, [1] https://lore.kernel.org/xen-devel/alpine.DEB.2.21.2004271552430.29217@sstabellini-ThinkPad-T480s/ -- Julien Grall