Received: by 2002:a05:6358:9144:b0:117:f937:c515 with SMTP id r4csp7169813rwr; Tue, 25 Apr 2023 09:02:06 -0700 (PDT) X-Google-Smtp-Source: AKy350Zd7Tl/0WivKFPiLVxeRyQANcx6dTYB91D5YQ208gB/Yq6nIxPYXrkEYlxaTcgtKTwTulm6 X-Received: by 2002:a05:6a20:6a21:b0:f0:6517:2fd with SMTP id p33-20020a056a206a2100b000f0651702fdmr23989089pzk.2.1682438526115; Tue, 25 Apr 2023 09:02:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1682438526; cv=none; d=google.com; s=arc-20160816; b=y9Fle8LFx4m6gnqS18VY4fqaJFhr2/zfPS1Zs0uESY1HFzVxGGNNN430bYjR92e82d GM8WVPK0jC72xYG+6H17YvhfI8E/b/JHlGbuJbNsCw84vIwCNicBhA30dUh1pNb2xYxe VRRgODn1KWllhxro04yk6q+rpfZgREJLTe/ZtHt+8xU0uNOhLwhoNQPYdfq5Oo8DDAKN L2zx6hGD/3RMkc4+d6QLFLuKtItWRJuiFarGB97iW5Sdi6sAUwEMJ2oXs6DfvFbTI185 NKiIPZ6iV1XzyhnvF0qnT8FNL+jJY6K7YIncgyGebSzSt7qCJjWiU+SzHMSTIFbl06G8 bWaA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:references :cc:to:from:content-language:subject:user-agent:mime-version:date :message-id; bh=Ifbi3AQnZjeCpoAkiCk0gWEXvP/yO3lkl+g49hWvVQU=; b=phsbTvzhmsLXlJgpeS1iDj0ENjQbCVqpvQCUKj44DDqVJylzhVc+ppwgPrnEJVt9Yd MIlwgenzXEM7Tflo0TwAYXXS1UZ6tV8fbFMYFPm+VaPX7UHxEvBiJSqmWfRnxgwp/VNe MolKtlR7n1tulcA9phzaAi7CwaVvtqJYPbtcsJpg2G9mPKT0g3/ytwmEFr/DvmeEY3Ly XXqxr4ml1CWJrtul8/CahazFt7FXgpouGAzMk/MjlnFMmAHnrFjNvklT39AaT9/5Ldmp RIt3/xz7V2eon84OLtI/Wq8opsvI/YBRlqkMOcftKnwUDWqnV2wm8OkKAyxM0DRTvSlp dkKQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x5-20020a654145000000b0051b724d5d1csi13192297pgp.763.2023.04.25.09.01.24; Tue, 25 Apr 2023 09:02:06 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234558AbjDYPt6 (ORCPT + 99 others); Tue, 25 Apr 2023 11:49:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33804 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229831AbjDYPt4 (ORCPT ); Tue, 25 Apr 2023 11:49:56 -0400 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 2DB7E4499 for ; Tue, 25 Apr 2023 08:49:55 -0700 (PDT) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id B1A084B3; Tue, 25 Apr 2023 08:50:38 -0700 (PDT) Received: from [10.1.196.40] (e121345-lin.cambridge.arm.com [10.1.196.40]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id D18613F587; Tue, 25 Apr 2023 08:49:53 -0700 (PDT) Message-ID: <22e2728f-d51a-bf92-4791-a7df5c4a2c15@arm.com> Date: Tue, 25 Apr 2023 16:49:49 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux aarch64; rv:102.0) Gecko/20100101 Thunderbird/102.9.0 Subject: Re: [PATCH] perf/arm-cmn: Fix DTC reset Content-Language: en-GB From: Robin Murphy To: Geoff Blake Cc: will@kernel.org, mark.rutland@arm.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, ilkka@os.amperecomputing.com References: <5ea7ec4e-bf9b-f3a7-965c-fa85b640d00f@arm.com> In-Reply-To: <5ea7ec4e-bf9b-f3a7-965c-fa85b640d00f@arm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-5.6 required=5.0 tests=BAYES_00,NICE_REPLY_A, RCVD_IN_DNSWL_MED,SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 14/04/2023 3:19 pm, Robin Murphy wrote: > On 2023-04-06 22:25, Geoff Blake wrote: >> Ran this patch on an AWS C6g.metal and unfortunately still see the >> spurious IRQs trigger quickly (within 10 tries) when using the following >> flow: >> >> perf stat -a -e arm_cmn_0/event=0x5,type=0x5/ -- sleep 600 >> kexec -e >> >> Adding in the simple shutdown routine, I have run over 100 of the >> above cycles and the spurious IRQs haven't triggered.  I think we still >> need both for now. > > There is no "need both" - if this patch doesn't work to reset the PMU as > intended then we still need a better patch that does. After yet more > trying, I still cannot reproduce your results, but I do suspect this > patch isn't as good as it initially seemed. > > I got my hands on a C6g.metal instance, and I'm building the mainline > version of arm-cmn.c from my cmn-dev branch (including the two other > pending fixes that I've sent recently) against the 5.15.0-1031-aws > kernel that it came with, as a standalone module with a trivial > makefile. Even running "stress -m 60" in the background, as the most > effective thing I've found so far, that hnf_pocq_reqs_recvd event takes > well over 8 minutes to overflow, so I have failed to achieve the > necessary timing to kexec at just the right point for the residual > interconnect traffic to add up and overflow the event during the handful > of seconds that the kexec takes. For completeness, I have managed to run > the perf stat/kexec, then run stress for 10 minutes under the new > kernel, *then* finally load the module to achieve the right conditions, > but that's so utterly contrived and long-winded that I don't really have > the patience to do it more than the twice that I already did. > > What I can do instantly and reliably is reproduce equivalent conditions > with my (now even more stripped-down) remove hack[1] and a simple > rmmod/insmod (with a few seconds in between for good measure), leading > to demonstrable latent overflows on all 4 DTCs every time. The existing > code does seem to manage to reset DTC0 such that its interrupt (IRQ 27) > does not fire, consistent with what I've observed on other machines, > while I see the secondary DTCs (IRQs 28, 29 and 30) each fire 100000 > times spuriously and get disabled. With this patch on top[2], that > consistently does not happen over 100 unload/reload cycles. > > Given that you say the same write to clear DTC_CTL, but a few seconds > earlier in the form of the shutdown hook, does seem to work, I have > still been wary of some kind of weird timing issue all along, but the > fact that I was getting such consistent behaviour even on C6g seemed to > be pointing away from that :/ > > The closest I've got so far is by leaving this even more involved test > loop (with real PMU programming in between) running overnight: > > for i in {1..10000}; do sudo insmod arm-cmn.ko && sudo perf stat -e > arm_cmn_0/eventid=5,type=5/ sleep 1 && sudo rmmod arm-cmn && sleep 4; done > > and now coming back to find /proc/interrupts saying this: > >  27:          1          0          0... >  28:          1          0          0... >  29:          2          0          0... >  30:          1          0          0... > > I've quite often seen a single IRQ firing earlier than expected (not > necessarily spuriously), so I still need to check what's up with that - > it may be that writing to the counters doesn't always take either. > However, the single extra incidence of IRQ 29 which has happened at some > point after I went home is more of a smoking gun: > > [84581.790043] WARNING: CPU: 0 PID: 0 at /home/ubuntu/arm-cmn.c:1828 > arm_cmn_handle_irq+0x148/0x1cc [arm_cmn] > > So something still snuck through reset, but it *was* at least visible > and clearable by the time the IRQ was enabled. Interestingly the other > warning for !dtc->cycles did not fire at the same time, even though the > hack normally overflows PMCCNTR before PMEVCNTR(0). I'll keep digging... I realise I neglected to follow up on this - where I got to was adding an extra read back of CMN_DTC_CTL after the write, tweaking my remove hack to generate overflows even more reliably for good measure, and that then ran for ~56,000 test cycles (until my time on the instance ran out) without any stray interrupts at all. However I wasn't keen on posting that as a v2 without any better justification than it being a "add random things to change the timing a bit" bodge. Since we're in the merge window now, I'll see if I can get a better answer by -rc1. Thanks, Robin.