Received: by 2002:ab2:6309:0:b0:1fb:d597:ff75 with SMTP id s9csp179120lqt; Wed, 5 Jun 2024 23:28:00 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCUWhRXkNKFaLe7Qitb4VUVtnReh1iI0VAi6cScRNAmpMFmZuSTRx4av1WUVxn9v4E7AVw/Uh6D9nz/DTVIy+ABIL41D7o1ph6rOHNsIJg== X-Google-Smtp-Source: AGHT+IH8OYA8JkUEDbmThwUHQivQZU0NJxKETFPdXM20uIHWDLE1U6Z7s77Ml8zzt0BMMFyK/ztQ X-Received: by 2002:a17:90b:234f:b0:2bf:ae7d:4894 with SMTP id 98e67ed59e1d1-2c27db0cfafmr4745068a91.19.1717655280686; Wed, 05 Jun 2024 23:28:00 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1717655280; cv=pass; d=google.com; s=arc-20160816; b=acwjUyFDZhK1OlGSbW8vLg5H2Ba5LD+It8wyEezly+kkVuZmBhlOVuGJ+AaUR1D3t/ AZQGyxknbHfOfFEqiB3x8ZrW4XVmm+wtVVzfvNSQuF92wU4wK6TXU0pbUUz3vsW/VnFi PpHvU1SOgRFYbUXVYgnGDTX61NCliyQPxGKmaYfv3CYl6UL8HvWlATLVvyr4TUyMV1Cd M86pEYxquO/6bw9pJnLI+3uyIelcBTrR2GbdVfFxVUr7Hc7ysCuHIAwavqgBNEE3XR71 wHh4SSwIBYkjJ59gzyxzUc/rPnuObYzlZwgPg8fQkpTGNImfFhIiXN2R6krEZf4Hvm2G /0Uw== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-transfer-encoding:in-reply-to:references:cc:to:from :content-language:subject:user-agent:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:date:message-id; bh=q5YCxY8Kf/rKTSWbWNVJb1paNG4mLqMKzho8QSTvLow=; fh=Vf6mr4PLYgX56DU8/svnG+lD934rdeUhJR84+ccQIu0=; b=n3ZilBUfEAH7RUYvS488udh+CT50Tuz6oRSjGlwqwA+l6oRobSFNE09tDeB+jlxOJV Uuwd2wRg5KTpl3/CjxRVJMpMS4jrWAckJODbbAaslgP6PiAZW0kQrgGrP8R9rpkz7AYE RgBCWl2LDompmTIUumrqPrAUtjrboqC7XZhsPqlJhAKD5FtCc6tykiPQRLXafFuWK/ai QJAMjYQJ0jnX0mLszcpCYnizXEllnr5WXe5DVDbBzZhQDsfAAzWjbjRnzkyq7h4FFvk2 zrgQqKdmQDguxs6OcaRuwKfE1V1EBRpt7I8/GvJmH/3Bwa1iHpcx4dA9bXlxvXFZEIFm /vDQ==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-203656-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-203656-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [2604:1380:45e3:2400::1]) by mx.google.com with ESMTPS id 98e67ed59e1d1-2c29c37c560si663021a91.107.2024.06.05.23.28.00 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 05 Jun 2024 23:28:00 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-203656-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) client-ip=2604:1380:45e3:2400::1; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=arm.com dmarc=pass fromdomain=arm.com); spf=pass (google.com: domain of linux-kernel+bounces-203656-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) smtp.mailfrom="linux-kernel+bounces-203656-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 4D442286347 for ; Thu, 6 Jun 2024 06:28:00 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 582041327E5; Thu, 6 Jun 2024 06:27:34 +0000 (UTC) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 0DDF7131E38; Thu, 6 Jun 2024 06:27:29 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=217.140.110.172 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717655253; cv=none; b=S7XLGTiSSjrk8OSNwVgLpCdyEhfyg3UMgxmGGVQKcxrKe2zOtj8cRO/40y/0E8LCSm3F/wOmB5TYQCUUONERyzVWVBroHF4ywRpLfMJFR1UQEV65C2UNkPgjtaj8kMH0lXs1G3MqQCFeaiiPbPz5gCjb0vS6SfNoF6XwD0FandU= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1717655253; c=relaxed/simple; bh=uE2iRlCp5pkc/o1Kv3EonVfXXYFTu8vlFwzkyoZSliU=; h=Message-ID:Date:MIME-Version:Subject:From:To:Cc:References: In-Reply-To:Content-Type; b=txUjEKMH99M4ofj1lrhTQoMLh8KKF546ZKae3YaSAuLTTHvtztjEejUqGftBZI+MTLvSLjGUoqpq4LikWOSI0xcpvKGZDrm5kiCnyL635T27T+EfSsqJHUjQJna+AyPeGBhppYq4B5cP3MrTc9yjJEfsiepiCfoXeKq3kNPMdFs= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com; spf=pass smtp.mailfrom=arm.com; arc=none smtp.client-ip=217.140.110.172 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 2BDA8339; Wed, 5 Jun 2024 23:27:53 -0700 (PDT) Received: from [10.162.40.16] (a077893.blr.arm.com [10.162.40.16]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id B2C943F762; Wed, 5 Jun 2024 23:27:23 -0700 (PDT) Message-ID: Date: Thu, 6 Jun 2024 11:57:20 +0530 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH V17 0/9] arm64/perf: Enable branch stack sampling Content-Language: en-US From: Anshuman Khandual To: Mark Rutland , James Clark Cc: Mark Brown , Rob Herring , Marc Zyngier , Suzuki Poulose , Peter Zijlstra , Ingo Molnar , Arnaldo Carvalho de Melo , linux-perf-users@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, will@kernel.org, catalin.marinas@arm.com References: <20240405024639.1179064-1-anshuman.khandual@arm.com> <80d33844-bdd2-4fee-81dd-9cd37c63d42c@arm.com> <5dcfbebe-f9f1-4767-bf21-7717f20af5d0@arm.com> In-Reply-To: <5dcfbebe-f9f1-4767-bf21-7717f20af5d0@arm.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit On 6/6/24 10:28, Anshuman Khandual wrote: > On 5/30/24 23:11, Mark Rutland wrote: >> On Thu, May 30, 2024 at 10:47:34AM +0100, James Clark wrote: >>> On 05/04/2024 03:46, Anshuman Khandual wrote: >>>> This series enables perf branch stack sampling support on arm64 platform >>>> via a new arch feature called Branch Record Buffer Extension (BRBE). All >>>> the relevant register definitions could be accessed here. >>>> >>>> https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers >>>> >>>> This series applies on 6.9-rc2. >>>> >>>> Also this series is being hosted below for quick access, review and test. >>>> >>>> https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v17) >>>> >>>> There are still some open questions regarding handling multiple perf events >>>> with different privilege branch filters getting on the same PMU, supporting >>>> guest branch stack tracing from the host etc. Finally also looking for some >>>> suggestions regarding supporting BRBE inside the guest. The series has been >>>> re-organized completely as suggested earlier. >>>> >>>> - Anshuman >>>> >>> [...] >>>> >>>> ------------------ Possible 'branch_sample_type' Mismatch ----------------- >>>> >>>> Branch stack sampling attributes 'event->attr.branch_sample_type' generally >>>> remain the same for all the events during a perf record session. >>>> >>>> $perf record -e -e -j [workload] >>>> >>>> event_1->attr.branch_sample_type == event_2->attr.branch_sample_type >>>> >>>> This 'branch_sample_type' is used to configure the BRBE hardware, when both >>>> events i.e and get scheduled on a given PMU. But during >>>> PMU HW event's privilege filter inheritance, 'branch_sample_type' does not >>>> remain the same for all events. Let's consider the following example >>>> >>>> $perf record -e cycles:u -e instructions:k -j any,save_type ls >>>> >>>> cycles->attr.branch_sample_type != instructions->attr.branch_sample_type >>>> >>>> Because cycles event inherits PERF_SAMPLE_BRANCH_USER and instruction event >>>> inherits PERF_SAMPLE_BRANCH_KERNEL. The proposed solution here configures >>>> BRBE hardware with 'branch_sample_type' from last event to be added in the >>>> PMU and hence captured branch records only get passed on to matching events >>>> during a PMU interrupt. >>>> >>> >>> Hi Anshuman, >>> >>> Surely because of this example we should merge? At least we have to try >>> to make the most common basic command lines work. Unless we expect all >>> tools to know whether the branch buffer is shared between PMUs on each >>> architecture or not. The driver knows though, so can merge the settings >>> because it all has to go into one BRBE. >> >> The difficulty here is that these are opened as independent events (not >> in the same event group), and so from the driver's PoV, this is no >> different two two users independently doing: >> >> perf record -e event:u -j any,save_type -p ${SOME_PID} >> >> perf record -e event:k -j any,save_type -p ${SOME_PID} >> >> ... where either would be surprised to get the merged result. > > Right, that's the problem. The proposed idea here ensures that each event > here will get only expected branch records, even though sample size might > get reduced as the HW filters overrides might not be evenly split between > them during the perf session. > >> >> So, if we merge the filters into the most permissive set, we *must* >> filter them when handing them to userspace so that each event gets the >> expected set of branch records. > > Agreed, if the branch filters get merged to the most permissive set via > an OR semantics, then results must be filtered before being pushed into > the ring buffer for each individual event that has overflown during the > PMU IRQ. > >> >> Assuming we do that, for Anshuman's case above: >> >> perf record -e cycles:u -e instructions:k -j any,save_type ls >> >> ... the overflow of the cycles evnt will only record user branch >> records, and the overflow of the instructions event will only record >> kernel branch records. > > Right. > >> >> I think it's arguable either way as to whether users want that or merged >> records; we should probably figure out with the tools folk what the >> desired behaviour is for that command line, but as above I don't think >> that we can have the kernel give both events merged records unless >> userspace asks for that explicitly. > > Right, we should not give merged results unless explicitly asked by the > event. Otherwise that might break the semantics. > >> >>> Merging the settings in tools would be a much harder problem. >> >> I can see that it'd be harder to do that up-front when parsing each >> event independently, but there is a phase where the tool knows about all >> of the events and *might* be able to do that, where the driver doesn't >> really know at any point that these events are related. >> >> Regardless, I assume that if the user passes: >> >> perf record -e cycles:u -e instructions:k -k any,u,k,save_type ls >> >> ... both events will be opened with PERF_SAMPLE_BRANCH_USER and >> PERF_SAMPLE_BRANCH_KERNEL, so maybe that's good, and in-kernel filtering >> is sufficient. > Kernel filtering will not be required in this case as "-j any,u,k," overrides > both event's individual privilege request i.e cycles:u and instructions:k. So > both the events will receive branch records related to PERF_SAMPLE_BRANCH_USER > and PERF_SAMPLE_BRANCH_KERNEL. From branch sample perspective privilege filter > is (PERF_SAMPLE_BRANCH_USER | PERF_SAMPLE_BRANCH_KERNEL). > >> >>> Any user that doesn't have permission for anything in the merged result >>> can continue to get nothing. >>> >>> And we can always add filtering in the kernel later on if we want to to >>> make it appear to behave even more normally. >> >> As above, I think if we merge the HW filters in the kernel then the >> kernel must do SW filtering. I don't think that's something we can leave >> for later. > > Alright. > >> >>>> static int >>>> armpmu_add(struct perf_event *event, int flags) >>>> { >>>> ........ >>>> if (has_branch_stack(event)) { >>>> /* >>>> * Reset branch records buffer if a new task event gets >>>> * scheduled on a PMU which might have existing records. >>>> * Otherwise older branch records present in the buffer >>>> * might leak into the new task event. >>>> */ >>>> if (event->ctx->task && hw_events->brbe_context != event->ctx) { >>>> hw_events->brbe_context = event->ctx; >>>> if (armpmu->branch_reset) >>>> armpmu->branch_reset(); >>>> } >>>> hw_events->brbe_users++; >>>> Here -------> hw_events->brbe_sample_type = event->attr.branch_sample_type; >>>> } >>>> ........ >>>> } >>>> >>>> Instead of overriding existing 'branch_sample_type', both could be merged. >>>> >>> >>> I can't see any use case where anyone would want the override behavior. >>> Especially if you imagine multiple users not even aware of each other. >> >> I completely agree that one event overriding another is not an >> acceptable solution. >> >>> Either the current "no records for mismatches" or the merged one make sense. >> >> I think our options are: >> >> 1) Do not allow events with conflicting HW filters to be scheduled (e.g. >> treating this like a counter constraint). > > That's the easiest solution and will keep things simple but downside being > the sample size will probably get much reduced. But such scenarios will be > rare, and hence won't matter much. > >> >> 2) Allow events with conflicting HW filters to be scheduled, merge the >> active HW filters, and SW filter the records in the kernel. >> >> 3) Allow events with conflicting branch filters to be scheduled, but >> only those which match the "active" filter get records. > > That's the proposed solution. "Active" filter gets decided on which event > comes last thus override the previous and PMU interrupt handler delivers > branch records only to the matching events. > >> >> So far (2) seems to me like the best option, and IIUC that's what x86 >> does with LBR. I suspect that also justifies discarding records at >> context switch time, since we can't know how many relevant records were >> discarded due to conflicting records (and so cannot safely stitch >> records), and users have to expect that they may get fewer records than >> may exist in HW anyway. > > So if we implement merging branch filters requests and SW filtering for the > captured branch records, we should also drop saving and stitching mechanism > completely ? > > Coming back to the implementation for option (2), the following code change > (tentative and untested) will merge branch filter requests and drop the event > filter check during PMU interrupt. > > diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c > index 45ac2d0ca04c..9afa4e48d957 100644 > --- a/drivers/perf/arm_brbe.c > +++ b/drivers/perf/arm_brbe.c > @@ -286,7 +286,13 @@ void armv8pmu_branch_stack_add(struct perf_event *event, struct pmu_hw_events *h > armv8pmu_branch_stack_reset(); > } > hw_events->branch_users++; > - hw_events->branch_sample_type = event->attr.branch_sample_type; > + > + /* > + * Merge all branch filter requests from different perf > + * events being added into this PMU. This includes both > + * privilege and branch type filters. > + */ > + hw_events->branch_sample_type |= event->attr.branch_sample_type; > } > > void armv8pmu_branch_stack_del(struct perf_event *event, struct pmu_hw_events *hw_events) > diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c > index 6137ae4ba7c3..c5311d5365cc 100644 > --- a/drivers/perf/arm_pmuv3.c > +++ b/drivers/perf/arm_pmuv3.c > @@ -856,15 +856,12 @@ static void read_branch_records(struct pmu_hw_events *cpuc, > return; > > /* > - * Overflowed event's branch_sample_type does not match the configured > - * branch filters in the BRBE HW. So the captured branch records here > - * cannot be co-related to the overflowed event. Report to the user as > - * if no branch records have been captured, and flush branch records. > - * The same scenario is applicable when the current task context does > - * not match with overflown event. > + * When the current task context does not match with the PMU overflown > + * event, the captured branch records here cannot be co-related to the > + * overflowed event. Report to the user as if no branch records have > + * been captured, and flush branch records. > */ > - if ((cpuc->branch_sample_type != event->attr.branch_sample_type) || > - (event->ctx->task && cpuc->branch_context != event->ctx)) > + if (event->ctx->task && cpuc->branch_context != event->ctx) > return; > > /* > > and something like the following change (tentative and untested) implements the > required SW branch records filtering. > > diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c > index c5311d5365cc..d2390657c466 100644 > --- a/drivers/perf/arm_pmuv3.c > +++ b/drivers/perf/arm_pmuv3.c > @@ -843,11 +843,97 @@ static void armv8pmu_stop(struct arm_pmu *cpu_pmu) > armv8pmu_pmcr_write(armv8pmu_pmcr_read() & ~ARMV8_PMU_PMCR_E); > } > > +static bool filter_branch_type(struct perf_event *event, struct perf_branch_entry *entry) > +{ > + u64 br_type = event->attr.branch_sample_type; > + u64 mask = PERF_BR_UNCOND; > + > + if (br_type & PERF_SAMPLE_BRANCH_ANY) > + return true; > + > + if (entry->type == PERF_BR_UNKNOWN) > + return true; > + > + if (br_type & PERF_SAMPLE_BRANCH_IND_JUMP) > + mask |= PERF_BR_IND; > + > + if (br_type & PERF_SAMPLE_BRANCH_COND) { > + mask &= ~PERF_BR_UNCOND; > + mask |= PERF_BR_COND; > + } > + > + if (br_type & PERF_SAMPLE_BRANCH_CALL) > + mask |= PERF_BR_CALL; > + > + if (br_type & PERF_SAMPLE_BRANCH_IND_CALL) > + mask |= PERF_BR_IND_CALL; > + > + if (br_type & PERF_SAMPLE_BRANCH_ANY_CALL) { > + mask |= PERF_BR_CALL; > + mask |= PERF_BR_IRQ; > + mask |= PERF_BR_SYSCALL; > + mask |= PERF_BR_SERROR; > + if (br_type & PERF_SAMPLE_BRANCH_COND) > + mask |= PERF_BR_COND_CALL; > + } > + > + if (br_type & PERF_SAMPLE_BRANCH_ANY_RETURN) { > + mask |= PERF_BR_RET; > + mask |= PERF_BR_ERET; > + mask |= PERF_BR_SYSRET; > + if (br_type & PERF_SAMPLE_BRANCH_COND) > + mask |= PERF_BR_COND_RET; > + } > + return mask & entry->type; > +} > + > +static bool filter_branch_privilege(struct perf_event *event, struct perf_branch_entry *entry) > +{ > + u64 br_type = event->attr.branch_sample_type; > + > + if (!(br_type & PERF_SAMPLE_BRANCH_USER)) { > + if (is_ttbr0_addr(entry->from) || is_ttbr0_addr(entry->to)) > + return false; > + } > + > + if (!(br_type & PERF_SAMPLE_BRANCH_KERNEL)) { > + if (is_ttbr1_addr(entry->from) || is_ttbr1_addr(entry->to)) > + return false; > + } > + > + if (!(br_type & PERF_SAMPLE_BRANCH_HV)) { > + if (is_kernel_in_hyp_mode()) { > + if (is_ttbr1_addr(entry->from) || is_ttbr1_addr(entry->to)) > + return false; > + } > + } > + return true; > +} > + > +static void filter_branch_records(struct perf_event *event, struct pmu_hw_events *cpuc, > + struct branch_records *event_records) > +{ > + struct perf_branch_entry *entry; > + int idx, count; > + > + memset(event_records, 0, sizeof(struct branch_records)); > + for (idx = 0, count = 0; idx < cpuc->branches->branch_stack.nr; idx++) { > + entry = &cpuc->branches->branch_entries[idx]; > + if (!filter_branch_privilege(event, entry) || !filter_branch_type(event, entry)) > + continue; > + > + memcpy(&event_records->branch_entries[count], &entry, sizeof(struct perf_branch_entry)); > + count++; > + } > +} > + > static void read_branch_records(struct pmu_hw_events *cpuc, > struct perf_event *event, > struct perf_sample_data *data, > bool *branch_captured) > { > + struct branch_records event_records; > + > /* > * CPU specific branch records buffer must have been allocated already > * for the hardware records to be captured and processed further. > @@ -874,6 +960,20 @@ static void read_branch_records(struct pmu_hw_events *cpuc, > armv8pmu_branch_read(cpuc, event); > *branch_captured = true; > } > + > + /* > + * Filter captured branch records > + * > + * PMU captured branch records would contain samples applicable for > + * the agregated branch filters, for all events that got scheduled > + * on this PMU. Hence the branch records need to be filtered first > + * so that each individual event get samples they had requested. > + */ > + if (cpuc->branch_sample_type != event->attr.branch_sample_type) { > + filter_branch_records(event, cpuc, &event_records); > + perf_sample_save_brstack(data, event, &event_records.branch_stack, NULL); > + return; > + } > perf_sample_save_brstack(data, event, &cpuc->branches->branch_stack, NULL); > } But now with the above SW filtering enabled during branch record processing, there are two possible problems - Potential for RCU stalls, observed them some times - IRQ handling gets delayed while processing these records repeatedly for each event