Received: by 2002:a05:7412:b10a:b0:f3:1519:9f41 with SMTP id az10csp242196rdb; Thu, 30 Nov 2023 03:49:20 -0800 (PST) X-Google-Smtp-Source: AGHT+IGmvbwwA0sX1dwUxMM19O8ZMf0v9lm3Xy3BTdQl8BjcwoWszjJU0BV7o0l5ccJnp6b3K6M+ X-Received: by 2002:a17:903:11c4:b0:1cc:4072:22c6 with SMTP id q4-20020a17090311c400b001cc407222c6mr25251349plh.24.1701344959957; Thu, 30 Nov 2023 03:49:19 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701344959; cv=none; d=google.com; s=arc-20160816; b=QtobemgiO+eDhjo9ruvG9zk3gT7l/Ekyrtz3j6NGCsym1FLWA5GY6AdSQpUvMWC4jd xaBYGEtd7P/WKG/RGDMVK6JXzJ4yhURLSY7XbbA0PQKmJqfWWgaNeJRRh6UJrB726H1f ml6yYhV80uGFcPMnFbWH9ijwBm8VvqjbveTyM9oKHD1yaWjRrju4XU4ulEnPdPl+WnfE 6z6kjmGmhgHbv285hsTs1wJPw7A4Xx3duOjCk0UrIdr4eWDPSqL84vFzOD4NMF2MDqCz MrAnbes51EMvEdtaMeQKDS8nkWQ4XCogmYWWjVFITzpw0rVHclLsq1jpDeX031nHv5YL ZE0w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=jHuboHUN/VWiv59+28y9Q0cPJKCbUxp9iKX8jJJGScE=; fh=h89pDCf4sb/h8iiH2eBl34HhpquGQT/lUmV6zzg6kCw=; b=PFvbHoOVlfzMbozJ+FJ2J1iR3oKCu9qPxrskbUWhVVNmDHMjoteUjCSAhI3VkQA6nc 3ao9RP5Q9xUbeVeS+ZwxhCZcm2mH4Nzz51P210dsBsdGWEFcB02XbKLw/JS23iGu+5+i dWAEch3jaj1WupIrBL0xZbcdGBnpdLfg62pmICITrc6WcN1cXYZbT3oKFGXM5CQSfkDU J4RSY+SRuyv1E4RBTgfSg21toQC0w/82hFWF+TWcxUxw1eJ51jFYxFqfhZub2yiDDkwZ l75P8+04DHd23E9Fi+A1s+fLOdnbkA606ueHJWIvF9oAVpwGn8/vxkRmTWVm4HKEBdXD K6yQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from lipwig.vger.email (lipwig.vger.email. [2620:137:e000::3:3]) by mx.google.com with ESMTPS id b11-20020a170902d50b00b001bf1005924csi1066798plg.321.2023.11.30.03.49.19 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 30 Nov 2023 03:49:19 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) client-ip=2620:137:e000::3:3; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:3 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by lipwig.vger.email (Postfix) with ESMTP id 9BFB4802F551; Thu, 30 Nov 2023 03:49:16 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at lipwig.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345168AbjK3LtB (ORCPT + 99 others); Thu, 30 Nov 2023 06:49:01 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59092 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231856AbjK3Ls7 (ORCPT ); Thu, 30 Nov 2023 06:48:59 -0500 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 3AC6D84; Thu, 30 Nov 2023 03:49:05 -0800 (PST) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 94F3F1042; Thu, 30 Nov 2023 03:49:51 -0800 (PST) Received: from FVFF77S0Q05N (unknown [10.57.43.250]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 128D03F5A1; Thu, 30 Nov 2023 03:49:03 -0800 (PST) Date: Thu, 30 Nov 2023 11:48:54 +0000 From: Mark Rutland To: "Ashley, William" Cc: "linux-arm-kernel@lists.infradead.org" , "linux-kernel@vger.kernel.org" , "linux-perf-users@vger.kernel.org" , will@kernel.org Subject: Re: armv8pmu: Pending overflow interrupt is discarded when perf event is disabled Message-ID: References: <950001BD-490C-4BAC-8EEA-CDB9F7C4ADFC@amazon.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Spam-Status: No, score=-0.8 required=5.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lipwig.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (lipwig.vger.email [0.0.0.0]); Thu, 30 Nov 2023 03:49:16 -0800 (PST) On Wed, Nov 29, 2023 at 04:35:01PM +0000, Mark Rutland wrote: > Hi William, > > On Mon, Nov 20, 2023 at 10:32:10PM +0000, Ashley, William wrote: > > Adding linux-arm-kernel@lists.infradead.org and linux-kernel@vger.kernel.org, > > sorry for the noise. > > Thanks for this! > > For the benefit of others, the original mail (and attachment) can be found at: > > https://lore.kernel.org/linux-perf-users/950001BD-490C-4BAC-8EEA-CDB9F7C4ADFC@amazon.com/ > > For some reason, links and whitespace seem to have been mangled in the resend; > I'm not sure what happened there. > > I've added Will Deacon, as Will and I co-maintain the ARM PMU drivers. > > > On 11/20/23, 12:36 PM, "Ashley, William" > wrote: > > > > > > An issue [1] was opened in the rr-debugger project reporting occasional missed > > perf event overflow signals on arm64. I've been digging into this and think I > > understand what's happening, but wanted to confirm my understanding. > > > > The attached example application, derived from an rr-debugger test case, reports > > when the value of a counter doesn't increase by the expected period +/- some > > tolerance. When it is ping-ponged between cores (e.g. with taskset) at a high > > frequency, it frequently reports increases of ~2x the expected. I've confirmed > > this same behavior on kernels 5.4, 5.10, 6.1 and 6.5. > > > > > > I found armv8pmu_disable_intens [2] that is called as part of event > > de-scheduling and contains > > /* Clear the overflow flag in case an interrupt is pending. */ > > write_pmovsclr(mask); > > which results in any pending overflow interrupt being dropped. I added some > > debug output here and indeed there is a correlation of this bit being high at > > the point of the above code and the reproducer identifying a missed signal. > > I think you're right that if we had an overflow asserted at this point, we'll > throw away the occurrence of the overflow (and so not call > perf_event_overflow() and generate a sample, etc). > > It looks like we only lose the occurrence of the overflow; the actual counts > will get sampled correctly and when we next reprogram the event, > armpmu_event_set_period() should set up the next overflow period. > > > This behavior does not occur with pseudo-NMIs (irqchip.gicv3_pseudo_nmi=1) > > enabled. > > That's interesting, because it implies that the PMU overflow interrupt is being > recognised by the CPU while regular interrupts are disabled. There are some > narrow races where that could occur (e.g. taking a timer or scheduler IRQ > *just* as an overflow occurs), and some other cases I'd expect RR to avoid by > construction (e.g. if RR isn't using mode exclusion and also counts kernel > events). It's also worth noting that this means there are races even with > pseudo-NMI where overflows could be lost. > > How often do you see the overflow being lost? > > Does RR set any of the perf_event_attr::exclude_* bits? If not, does RR > intentionally count events that occur within the kernel? Looking at the test, I see it sets perf_event_attr::exclude_kernel to 1, but doesn't set perf_event_attr::exclude_host or perf_event_attr::exclude_hv. I think the poorly-defined exclude_* bits are part of the problem here. Using your test as-is on my ThunderX2, I can reproduce the period being longer than expected by concurrently running the following in a shell: while true; do for C in $(seq 0 63); do taskset -c -p $C ${TEST_PID_HERE}; done; done > /dev/null ... resulting in: | [mark@gravadlaks:~]% ./counter-overflow | Pid 20060 running with period 10000 tolerance 1000 | Signal #1: last: 0, cur: 10292, max diff: 0 | Signal #415330: delta of 19999 is outside 10000 +/- 1000 | Signal #415330: last: 4153290187, cur: 4153310186, max diff: 10292 | Signal #489879: delta of 19998 is outside 10000 +/- 1000 | Signal #511842: delta of 20058 is outside 10000 +/- 1000 | Signal #511842: last: 5118430130, cur: 5118450188, max diff: 19999 However, if I modify the test to also set perf_event_attr::exclude_host=1, I do not see any lost overflows after many minutes. On VHE hosts (like the ThunderX2), the host kernel gets counted when perf_event_attr::exclude_host=0, even if perf_event_attr::exclude_kernel=1 (which I agree is surprising), so I think what's happening is the counters are counting in the host kernel, which isn't what RR actually wants regardless. From looking at: https://github.com/rr-debugger/rr/blob/master/src/PerfCounters.cc ... it looks like RR only sets perf_event_attr::exclude_{kernel,guest}, and due to this I think it's still counting the host kernel (and potentially the hypervisor, depending on that hypervisor's policy for what it filters from the guest). There's still a potential race on the kernel side, but I suspect that if RR were to set the non-user exclude_* bits it could significantly reduce the chance of a lost overflow signal on existing kernels, at least on bare-metal. This is an annoying ABI; the interaction between all the bits is ill-defined, especially host/hv/kernel (since for VHE those are the same thing). Ideally this would have had the opposite polarity and you'd have only needed to set an include_user bit and be done with it. I'll continue to look at what we can do kernel-side, but I reckon it's worth having RR try the other exclude bits regardless, if that's possible? It would be interesting to know whether that helps you under a hypervisor. Thanks, Mark. > > When an event is not being explicitly torn down (e.g. being closed), this seems > > like an undesirable behavior. > > I agree it's undesirable, though my understanding is that for most other users > this isn't any worse than losing samples for other reasons (e.g. the perf ring > buffer being full), and for the general case of sample-based profiling, losing > a sample every so often isn't the end of the world. > > > I haven't attempted to demo it yet, but I suspect > > an application disabling an event temporarily could occasionally see the same > > missed overflow signals. Is my understanding here correct? > > That sounds right to me, though I haven't checked that end-to-end yet. > > > Does anyone have thoughts on how this could be addressed without creating > > other issues? > > We should be able to detect overflow from the counter value alone, so we might > be able to account for that when we actually read the event out, or when we > schedule it back in and reprogram the period. > > I'm not sure if we can reasonably do that when scheduling the event out, since > if we're switching tasks, that'll queue up IRQ work which will be triggered in > the context of the next task. > > We might be able to figure out that we have an overflow when we schedule the > event in under armpmu_start(), but I'll need to go digging to see if there are > any free-running counters as the comment above the call to > armpmu_event_set_period() says, or whether that's a historical artifact. > > I suspect we might need to restructure the code somewhat to be able to catch > overflows more consistently. I'll see if I can come up with something, but we > might not be able to guarantee this in all cases. > > Mark. > > > [1] https://github.com/rr-debugger/rr/issues/3607 > > [2] https://github.com/torvalds/linux/blob/c42d9eeef8e5ba9292eda36fd8e3c11f35ee065c/drivers/perf/arm_pmuv3.c#L652C20-L652C43