Received: by 2002:ab2:6309:0:b0:1fb:d597:ff75 with SMTP id s9csp293364lqt; Thu, 6 Jun 2024 04:01:37 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCUFFqdF2WOj3psbPt/uBk1sLCg8s0rvQYOcjsxT19BigIXCHOMaRvlmsM9O1l5KPyiZZK638OUGDPpRdRmBXwzh/VRDsBaKFg1A0pAbeA== X-Google-Smtp-Source: AGHT+IEXygLplrT1Fa9kUNnVn5Nf0tTjgD0fSlRnk2JPC2zP0X9WMRREqfYgExPizXV8NJxw5qAu X-Received: by 2002:a05:6870:a54b:b0:24f:d1aa:9bd0 with SMTP id 586e51a60fabf-25121d01240mr5898280fac.13.1717671697317; Thu, 06 Jun 2024 04:01:37 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1717671697; cv=pass; d=google.com; s=arc-20160816; b=gVe7pfXmVyx8yfP9aQVhkGkpTI+OuCE3qonMIRKw3Rih4wcumQG6Tgt/OK0g/jGj5J /bqydA0TZU+ba5m99xsXhlZOJQd84ZvlNGSSRfSJYllSvM2waA7t7gkPKV8PfgJgeG3+ S990wP7AVFgkxqsda3fqJkp5xgljjGSMusbYDLJqzBAM/c2H6RH9OM1FuKNnkVwrxk/1 d/SHWKuzuTAHxjAfNkfnXZcnFSkxEAErJxJzeLjAWMEs3sjj2s5HaFzUJi6AqVcurrLx 8uVOqhZW0TT3LnwoaSh5R54YrmQw1sBu0/BgjqaCPzp3XK/99Hg8k9R0fmhbmcA9zdj6 lndg== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:date:message-id; bh=ztenOfUc21HZMsRoz4eyylyuUJX2u1MscUHGReeYHII=; fh=KyJhxZLysNoWu6C00xbcpFM1wwdmwyO18vlYftq6EUA=; b=zdWvT+ELstd4L94QhpX4cjRWJGQ9Y+/6XMdfTdwa0jkMoiRJHc2ap+0Ym+g6maxUpy +UBygv3w/PfkdBdnNFlD9v/yys3Iq1jMzeFjuX10Hi8hTCCTlyquu2aG42VEHokb+so7 5Y9RkXl8Fcn0WuNEKAG0DJvOmdYiJLk/rkkJDCWClDikmvSFy15XOvlBXtLO09cb8AGp dNSqg8hQPJkLkYUK0HZ9Rjw5IiuD7biNl+fJj0RCFg5+OC6XNVy7NOFTH0dXjStfAPSg a4iSXFRzNHBfvrBavuJFXuIBufhSLOuRqhOamLiLjm5asVsk9X7Yck6+azexgIYSUmuf g+yw==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-204151-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-204151-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [2604:1380:45d1:ec00::1]) by mx.google.com with ESMTPS id af79cd13be357-79538545a0asi20462285a.450.2024.06.06.04.01.37 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 06 Jun 2024 04:01:37 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-204151-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) client-ip=2604:1380:45d1:ec00::1; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-204151-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45d1:ec00::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-204151-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id F0F461C25934 for ; Thu, 6 Jun 2024 11:01:36 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id C4316194C9A; Thu, 6 Jun 2024 11:01:31 +0000 (UTC) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 540F63A8C0; Thu, 6 Jun 2024 11:01:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717671690; cv=none; b=LVyeDQGPac88AuNH2hAOjv77Sz3Hhnr3KWMoiaWj0s7DtkTEVGDpT+pLXuNXxb7QWQ7Tntnzqcj3MidtJvdpnADcvarSKrcoGo3Rpqz/ABRtTXw9GxeXTO2zIyM6FMlXq9T/f1h0gGoEr6QQgapeJdt/BFQ9ELjYzMlzdGYS74Q= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717671690; c=relaxed/simple; bh=XOb74sq0KCeqPlWTLiQ4TrJW2NPp7b949OWlKC/giPM=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=VoCnwdLqvVHpiks6H5bbCLmlePH8TwPIYPxoJABhZ6A+21KwQjwbbMZCck8+Aty73W7DUWz1sy67MaiF/jJGJM3ckRxflQLvNzTPvCLb74XAadfonSwbG8LBOJQCwTGE5IWxzYUD1MM4NY0fA2kfvBPhVi1WZWz7ih6CLKFpnG4= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 3886B2F4; Thu, 6 Jun 2024 04:01:52 -0700 (PDT) Received: from [192.168.1.100] (usa-sjc-mx-foss1.foss.arm.com [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id E0B283F762; Thu, 6 Jun 2024 04:01:25 -0700 (PDT) Message-ID: <5e968259-2fd5-49fa-940f-e019b35966b2@arm.com> Date: Thu, 6 Jun 2024 12:01:25 +0100 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH V17 0/9] arm64/perf: Enable branch stack sampling To: Anshuman Khandual , Mark Rutland Cc: Mark Brown , Rob Herring , Marc Zyngier , Suzuki Poulose , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , linux-perf-users@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, will@kernel.org, catalin.marinas@arm.com References: <20240405024639.1179064-1-anshuman.khandual@arm.com> <80d33844-bdd2-4fee-81dd-9cd37c63d42c@arm.com> <5dcfbebe-f9f1-4767-bf21-7717f20af5d0@arm.com> Content-Language: en-US From: James Clark In-Reply-To: <5dcfbebe-f9f1-4767-bf21-7717f20af5d0@arm.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit On 06/06/2024 05:58, Anshuman Khandual wrote: > On 5/30/24 23:11, Mark Rutland wrote: >> On Thu, May 30, 2024 at 10:47:34AM +0100, James Clark wrote: >>> On 05/04/2024 03:46, Anshuman Khandual wrote: >>>> This series enables perf branch stack sampling support on arm64 platform >>>> via a new arch feature called Branch Record Buffer Extension (BRBE). All >>>> the relevant register definitions could be accessed here. >>>> >>>> https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers >>>> >>>> This series applies on 6.9-rc2. >>>> >>>> Also this series is being hosted below for quick access, review and test. >>>> >>>> https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v17) >>>> >>>> There are still some open questions regarding handling multiple perf events >>>> with different privilege branch filters getting on the same PMU, supporting >>>> guest branch stack tracing from the host etc. Finally also looking for some >>>> suggestions regarding supporting BRBE inside the guest. The series has been >>>> re-organized completely as suggested earlier. >>>> >>>> - Anshuman >>>> >>> [...] >>>> >>>> ------------------ Possible 'branch_sample_type' Mismatch ----------------- >>>> >>>> Branch stack sampling attributes 'event->attr.branch_sample_type' generally >>>> remain the same for all the events during a perf record session. >>>> >>>> $perf record -e -e -j [workload] >>>> >>>> event_1->attr.branch_sample_type == event_2->attr.branch_sample_type >>>> >>>> This 'branch_sample_type' is used to configure the BRBE hardware, when both >>>> events i.e and get scheduled on a given PMU. But during >>>> PMU HW event's privilege filter inheritance, 'branch_sample_type' does not >>>> remain the same for all events. Let's consider the following example >>>> >>>> $perf record -e cycles:u -e instructions:k -j any,save_type ls >>>> >>>> cycles->attr.branch_sample_type != instructions->attr.branch_sample_type >>>> >>>> Because cycles event inherits PERF_SAMPLE_BRANCH_USER and instruction event >>>> inherits PERF_SAMPLE_BRANCH_KERNEL. The proposed solution here configures >>>> BRBE hardware with 'branch_sample_type' from last event to be added in the >>>> PMU and hence captured branch records only get passed on to matching events >>>> during a PMU interrupt. >>>> >>> >>> Hi Anshuman, >>> >>> Surely because of this example we should merge? At least we have to try >>> to make the most common basic command lines work. Unless we expect all >>> tools to know whether the branch buffer is shared between PMUs on each >>> architecture or not. The driver knows though, so can merge the settings >>> because it all has to go into one BRBE. >> >> The difficulty here is that these are opened as independent events (not >> in the same event group), and so from the driver's PoV, this is no >> different two two users independently doing: >> >> perf record -e event:u -j any,save_type -p ${SOME_PID} >> >> perf record -e event:k -j any,save_type -p ${SOME_PID} >> >> ... where either would be surprised to get the merged result. > > Right, that's the problem. The proposed idea here ensures that each event > here will get only expected branch records, even though sample size might > get reduced as the HW filters overrides might not be evenly split between > them during the perf session. > >> >> So, if we merge the filters into the most permissive set, we *must* >> filter them when handing them to userspace so that each event gets the >> expected set of branch records. > > Agreed, if the branch filters get merged to the most permissive set via > an OR semantics, then results must be filtered before being pushed into > the ring buffer for each individual event that has overflown during the > PMU IRQ. > >> >> Assuming we do that, for Anshuman's case above: >> >> perf record -e cycles:u -e instructions:k -j any,save_type ls >> >> ... the overflow of the cycles evnt will only record user branch >> records, and the overflow of the instructions event will only record >> kernel branch records. > > Right. > >> >> I think it's arguable either way as to whether users want that or merged >> records; we should probably figure out with the tools folk what the >> desired behaviour is for that command line, but as above I don't think >> that we can have the kernel give both events merged records unless >> userspace asks for that explicitly. > > Right, we should not give merged results unless explicitly asked by the > event. Otherwise that might break the semantics. > >> >>> Merging the settings in tools would be a much harder problem. >> >> I can see that it'd be harder to do that up-front when parsing each >> event independently, but there is a phase where the tool knows about all >> of the events and *might* be able to do that, where the driver doesn't >> really know at any point that these events are related. >> >> Regardless, I assume that if the user passes: >> >> perf record -e cycles:u -e instructions:k -k any,u,k,save_type ls >> >> ... both events will be opened with PERF_SAMPLE_BRANCH_USER and >> PERF_SAMPLE_BRANCH_KERNEL, so maybe that's good, and in-kernel filtering >> is sufficient. > Kernel filtering will not be required in this case as "-j any,u,k," overrides > both event's individual privilege request i.e cycles:u and instructions:k. So > both the events will receive branch records related to PERF_SAMPLE_BRANCH_USER > and PERF_SAMPLE_BRANCH_KERNEL. From branch sample perspective privilege filter > is (PERF_SAMPLE_BRANCH_USER | PERF_SAMPLE_BRANCH_KERNEL). > >> >>> Any user that doesn't have permission for anything in the merged result >>> can continue to get nothing. >>> >>> And we can always add filtering in the kernel later on if we want to to >>> make it appear to behave even more normally. >> >> As above, I think if we merge the HW filters in the kernel then the >> kernel must do SW filtering. I don't think that's something we can leave >> for later. > > Alright. > >> >>>> static int >>>> armpmu_add(struct perf_event *event, int flags) >>>> { >>>> ........ >>>> if (has_branch_stack(event)) { >>>> /* >>>> * Reset branch records buffer if a new task event gets >>>> * scheduled on a PMU which might have existing records. >>>> * Otherwise older branch records present in the buffer >>>> * might leak into the new task event. >>>> */ >>>> if (event->ctx->task && hw_events->brbe_context != event->ctx) { >>>> hw_events->brbe_context = event->ctx; >>>> if (armpmu->branch_reset) >>>> armpmu->branch_reset(); >>>> } >>>> hw_events->brbe_users++; >>>> Here -------> hw_events->brbe_sample_type = event->attr.branch_sample_type; >>>> } >>>> ........ >>>> } >>>> >>>> Instead of overriding existing 'branch_sample_type', both could be merged. >>>> >>> >>> I can't see any use case where anyone would want the override behavior. >>> Especially if you imagine multiple users not even aware of each other. >> >> I completely agree that one event overriding another is not an >> acceptable solution. >> >>> Either the current "no records for mismatches" or the merged one make sense. >> >> I think our options are: >> >> 1) Do not allow events with conflicting HW filters to be scheduled (e.g. >> treating this like a counter constraint). > > That's the easiest solution and will keep things simple but downside being > the sample size will probably get much reduced. But such scenarios will be > rare, and hence won't matter much. > >> >> 2) Allow events with conflicting HW filters to be scheduled, merge the >> active HW filters, and SW filter the records in the kernel. >> >> 3) Allow events with conflicting branch filters to be scheduled, but >> only those which match the "active" filter get records. > > That's the proposed solution. "Active" filter gets decided on which event > comes last thus override the previous and PMU interrupt handler delivers > branch records only to the matching events. > >> >> So far (2) seems to me like the best option, and IIUC that's what x86 >> does with LBR. I suspect that also justifies discarding records at >> context switch time, since we can't know how many relevant records were >> discarded due to conflicting records (and so cannot safely stitch >> records), and users have to expect that they may get fewer records than >> may exist in HW anyway. > > So if we implement merging branch filters requests and SW filtering for the > captured branch records, we should also drop saving and stitching mechanism > completely ? > > Coming back to the implementation for option (2), the following code change > (tentative and untested) will merge branch filter requests and drop the event > filter check during PMU interrupt. > > diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c > index 45ac2d0ca04c..9afa4e48d957 100644 > --- a/drivers/perf/arm_brbe.c > +++ b/drivers/perf/arm_brbe.c > @@ -286,7 +286,13 @@ void armv8pmu_branch_stack_add(struct perf_event *event, struct pmu_hw_events *h > armv8pmu_branch_stack_reset(); > } > hw_events->branch_users++; > - hw_events->branch_sample_type = event->attr.branch_sample_type; > + > + /* > + * Merge all branch filter requests from different perf > + * events being added into this PMU. This includes both > + * privilege and branch type filters. > + */ > + hw_events->branch_sample_type |= event->attr.branch_sample_type; > } > > void armv8pmu_branch_stack_del(struct perf_event *event, struct pmu_hw_events *hw_events) > diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c > index 6137ae4ba7c3..c5311d5365cc 100644 > --- a/drivers/perf/arm_pmuv3.c > +++ b/drivers/perf/arm_pmuv3.c > @@ -856,15 +856,12 @@ static void read_branch_records(struct pmu_hw_events *cpuc, > return; > > /* > - * Overflowed event's branch_sample_type does not match the configured > - * branch filters in the BRBE HW. So the captured branch records here > - * cannot be co-related to the overflowed event. Report to the user as > - * if no branch records have been captured, and flush branch records. > - * The same scenario is applicable when the current task context does > - * not match with overflown event. > + * When the current task context does not match with the PMU overflown > + * event, the captured branch records here cannot be co-related to the > + * overflowed event. Report to the user as if no branch records have > + * been captured, and flush branch records. > */ > - if ((cpuc->branch_sample_type != event->attr.branch_sample_type) || > - (event->ctx->task && cpuc->branch_context != event->ctx)) > + if (event->ctx->task && cpuc->branch_context != event->ctx) > return; > > /* > > and something like the following change (tentative and untested) implements the > required SW branch records filtering. > > diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c > index c5311d5365cc..d2390657c466 100644 > --- a/drivers/perf/arm_pmuv3.c > +++ b/drivers/perf/arm_pmuv3.c > @@ -843,11 +843,97 @@ static void armv8pmu_stop(struct arm_pmu *cpu_pmu) > armv8pmu_pmcr_write(armv8pmu_pmcr_read() & ~ARMV8_PMU_PMCR_E); > } > > +static bool filter_branch_type(struct perf_event *event, struct perf_branch_entry *entry) > +{ > + u64 br_type = event->attr.branch_sample_type; > + u64 mask = PERF_BR_UNCOND; > + > + if (br_type & PERF_SAMPLE_BRANCH_ANY) > + return true; > + > + if (entry->type == PERF_BR_UNKNOWN) > + return true; > + > + if (br_type & PERF_SAMPLE_BRANCH_IND_JUMP) > + mask |= PERF_BR_IND; > + > + if (br_type & PERF_SAMPLE_BRANCH_COND) { > + mask &= ~PERF_BR_UNCOND; > + mask |= PERF_BR_COND; > + } > + > + if (br_type & PERF_SAMPLE_BRANCH_CALL) > + mask |= PERF_BR_CALL; > + > + if (br_type & PERF_SAMPLE_BRANCH_IND_CALL) > + mask |= PERF_BR_IND_CALL; > + > + if (br_type & PERF_SAMPLE_BRANCH_ANY_CALL) { > + mask |= PERF_BR_CALL; > + mask |= PERF_BR_IRQ; > + mask |= PERF_BR_SYSCALL; > + mask |= PERF_BR_SERROR; > + if (br_type & PERF_SAMPLE_BRANCH_COND) > + mask |= PERF_BR_COND_CALL; > + } > + > + if (br_type & PERF_SAMPLE_BRANCH_ANY_RETURN) { > + mask |= PERF_BR_RET; > + mask |= PERF_BR_ERET; > + mask |= PERF_BR_SYSRET; > + if (br_type & PERF_SAMPLE_BRANCH_COND) > + mask |= PERF_BR_COND_RET; > + } > + return mask & entry->type; > +} You can build this mask once for each event because the options don't change. At the moment it's built for each record which seems excessive. > + > +static bool filter_branch_privilege(struct perf_event *event, struct perf_branch_entry *entry) > +{ > + u64 br_type = event->attr.branch_sample_type; > + > + if (!(br_type & PERF_SAMPLE_BRANCH_USER)) { > + if (is_ttbr0_addr(entry->from) || is_ttbr0_addr(entry->to)) > + return false; > + } > + > + if (!(br_type & PERF_SAMPLE_BRANCH_KERNEL)) { > + if (is_ttbr1_addr(entry->from) || is_ttbr1_addr(entry->to)) > + return false; > + } > + > + if (!(br_type & PERF_SAMPLE_BRANCH_HV)) { > + if (is_kernel_in_hyp_mode()) { > + if (is_ttbr1_addr(entry->from) || is_ttbr1_addr(entry->to)) > + return false; > + } > + } > + return true; > +} > + > +static void filter_branch_records(struct perf_event *event, struct pmu_hw_events *cpuc, > + struct branch_records *event_records) > +{ > + struct perf_branch_entry *entry; > + int idx, count; > + > + memset(event_records, 0, sizeof(struct branch_records)); > + for (idx = 0, count = 0; idx < cpuc->branches->branch_stack.nr; idx++) { > + entry = &cpuc->branches->branch_entries[idx]; > + if (!filter_branch_privilege(event, entry) || !filter_branch_type(event, entry)) > + continue; > + > + memcpy(&event_records->branch_entries[count], &entry, sizeof(struct perf_branch_entry)); > + count++; > + } > +} > + > static void read_branch_records(struct pmu_hw_events *cpuc, > struct perf_event *event, > struct perf_sample_data *data, > bool *branch_captured) > { > + struct branch_records event_records; > + > /* > * CPU specific branch records buffer must have been allocated already > * for the hardware records to be captured and processed further. > @@ -874,6 +960,20 @@ static void read_branch_records(struct pmu_hw_events *cpuc, > armv8pmu_branch_read(cpuc, event); > *branch_captured = true; > } > + > + /* > + * Filter captured branch records > + * > + * PMU captured branch records would contain samples applicable for > + * the agregated branch filters, for all events that got scheduled > + * on this PMU. Hence the branch records need to be filtered first > + * so that each individual event get samples they had requested. > + */ > + if (cpuc->branch_sample_type != event->attr.branch_sample_type) { > + filter_branch_records(event, cpuc, &event_records); > + perf_sample_save_brstack(data, event, &event_records.branch_stack, NULL); > + return; > + } > perf_sample_save_brstack(data, event, &cpuc->branches->branch_stack, NULL); > } > >