Received: by 2002:a05:7412:e794:b0:fa:551:50a7 with SMTP id o20csp1031747rdd; Wed, 10 Jan 2024 06:49:57 -0800 (PST) X-Google-Smtp-Source: AGHT+IHed1zIJMLE3dZc5/SyO6sPuiIERCAJo6YJDPdslw8DlcmDWd9qetm4bE7ZywePGadfNVWu X-Received: by 2002:a50:9991:0:b0:558:55e3:7b16 with SMTP id m17-20020a509991000000b0055855e37b16mr185574edb.13.1704898197793; Wed, 10 Jan 2024 06:49:57 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1704898197; cv=none; d=google.com; s=arc-20160816; b=SYOFs6HhLRYjYAuRONG+5cw5ogecEEFXjL3+diN+Hq0uhEuIUgiIFEmqNxgfhi5qW6 qcOii389PRp0uEsD+KqipXCixq5sT/zeJO8ihELPmXFnyWCPAENBltvAXLEXzUE4C4s6 VUgaZ+ltRT/07SVAttSy/Rwdbod2JpjymKy+p6ziVGQ+Yaa9Br1ImTIndp8H7VJY5Q2h dAeajb/sRKgPGuukAlLxvqaYNHnPzpZcFt88biLFmgCfjT/FCZCdKJBLh9RrGhZls3Wf 9I7IIw0oblrNiViNJvc2bRNRV4+cry99JMJwbZ94UDpIH8qqtxXuPxVvyBTqh7pffnmn hGjg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-disposition:mime-version:list-unsubscribe :list-subscribe:list-id:precedence:references:message-id:subject:cc :to:from:date; bh=oWKhnBXzTc6ViTF3Ty0LuWGHdQvGqIkLrOExHP2hDOo=; fh=cCVMafgbEyOCCUqUx5WjkYKLbIRwhADjPEhTFUhR4p8=; b=T+edAa2bLyIg0P0yHaftZ3YajezBK9hZkqRnqQEloOfomuU4N1GX1KQWTT+bioZFHG ZG/fgI8e6VyyC4H3lejbCKeRVgnTk8KMjRvTnAywrNNI9hTHBCuZHXhqnNiKL0mIlxc3 uj1Nbc4eIoLde6xmMh7Kj/vPC66JCIzSw1E1VeFsISMI3Md24yhnjuslJFQaLS45JiDu s4GQ66pC3iy7wC8eXwKQ6Hhu52hzGd5cdRMORl5wAmVrEI7/kVtc/29is11A1Vr/SgBM 2BHrcdMwA4fsFjbiNIdDVJwccWXPis59lSpkDP5PmUygwOkGxnOJrzvCl6tvk8aaE4vI /1LQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel+bounces-22372-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-22372-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [2604:1380:4601:e00::3]) by mx.google.com with ESMTPS id dm2-20020a05640222c200b005587f0faa73si61342edb.304.2024.01.10.06.49.57 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 10 Jan 2024 06:49:57 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-22372-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel+bounces-22372-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) smtp.mailfrom="linux-kernel+bounces-22372-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 548681F26EE2 for ; Wed, 10 Jan 2024 14:49:57 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 9D6024B5D9; Wed, 10 Jan 2024 14:49:50 +0000 (UTC) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 88D563DBB3 for ; Wed, 10 Jan 2024 14:49:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=arm.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 9F7202F4; Wed, 10 Jan 2024 06:50:32 -0800 (PST) Received: from FVFF77S0Q05N (unknown [10.57.87.82]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 6D5E03F64C; Wed, 10 Jan 2024 06:49:45 -0800 (PST) Date: Wed, 10 Jan 2024 14:49:39 +0000 From: Mark Rutland To: Namhyung Kim Cc: Peter Zijlstra , Ingo Molnar , Alexander Shishkin , Arnaldo Carvalho de Melo , LKML , Mingwei Zhang , Ian Rogers , Kan Liang Subject: Re: [PATCH RESEND 1/2] perf/core: Update perf_adjust_freq_unthr_context() Message-ID: References: <20240109213623.449371-1-namhyung@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240109213623.449371-1-namhyung@kernel.org> On Tue, Jan 09, 2024 at 01:36:22PM -0800, Namhyung Kim wrote: > It was unnecessarily disabling and enabling PMUs for each event. It > should be done at PMU level. Add pmu_ctx->nr_freq counter to check it > at each PMU. As pmu context has separate active lists for pinned group > and flexible group, factor out a new function to do the job. > > Another minor optimization is that it can skip PMUs w/ CAP_NO_INTERRUPT > even if it needs to unthrottle sampling events. > > Reviewed-by: Ian Rogers > Reviewed-by: Kan Liang > Tested-by: Mingwei Zhang > Signed-off-by: Namhyung Kim Hi, I've taken a quick look and I don't think this is quite right for hybrid/big.LITTLE, but I think that should be relatively simple to fix (more on that below). This seems to be a bunch of optimizations; was that based on inspection alone, or have you found a workload where this has a measureable impact? > --- > include/linux/perf_event.h | 1 + > kernel/events/core.c | 68 +++++++++++++++++++++++--------------- > 2 files changed, 43 insertions(+), 26 deletions(-) > > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > index d2a15c0c6f8a..b2ff60fa487e 100644 > --- a/include/linux/perf_event.h > +++ b/include/linux/perf_event.h > @@ -883,6 +883,7 @@ struct perf_event_pmu_context { > > unsigned int nr_events; > unsigned int nr_cgroups; > + unsigned int nr_freq; > > atomic_t refcount; /* event <-> epc */ > struct rcu_head rcu_head; > diff --git a/kernel/events/core.c b/kernel/events/core.c > index 59b332cce9e7..ce9db9dbfd4c 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -2277,8 +2277,10 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx) > > if (!is_software_event(event)) > cpc->active_oncpu--; > - if (event->attr.freq && event->attr.sample_freq) > + if (event->attr.freq && event->attr.sample_freq) { > ctx->nr_freq--; > + epc->nr_freq--; > + } > if (event->attr.exclusive || !cpc->active_oncpu) > cpc->exclusive = 0; > > @@ -2533,9 +2535,10 @@ event_sched_in(struct perf_event *event, struct perf_event_context *ctx) > > if (!is_software_event(event)) > cpc->active_oncpu++; > - if (event->attr.freq && event->attr.sample_freq) > + if (event->attr.freq && event->attr.sample_freq) { > ctx->nr_freq++; > - > + epc->nr_freq++; > + } > if (event->attr.exclusive) > cpc->exclusive = 1; > > @@ -4098,30 +4101,14 @@ static void perf_adjust_period(struct perf_event *event, u64 nsec, u64 count, bo > } > } > > -/* > - * combine freq adjustment with unthrottling to avoid two passes over the > - * events. At the same time, make sure, having freq events does not change > - * the rate of unthrottling as that would introduce bias. > - */ > -static void > -perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) > +static void perf_adjust_freq_unthr_events(struct list_head *event_list) > { > struct perf_event *event; > struct hw_perf_event *hwc; > u64 now, period = TICK_NSEC; > s64 delta; > > - /* > - * only need to iterate over all events iff: > - * - context have events in frequency mode (needs freq adjust) > - * - there are events to unthrottle on this cpu > - */ > - if (!(ctx->nr_freq || unthrottle)) > - return; > - > - raw_spin_lock(&ctx->lock); > - > - list_for_each_entry_rcu(event, &ctx->event_list, event_entry) { > + list_for_each_entry(event, event_list, active_list) { > if (event->state != PERF_EVENT_STATE_ACTIVE) > continue; > > @@ -4129,8 +4116,6 @@ perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) > if (!event_filter_match(event)) > continue; > > - perf_pmu_disable(event->pmu); > - > hwc = &event->hw; > > if (hwc->interrupts == MAX_INTERRUPTS) { > @@ -4140,7 +4125,7 @@ perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) > } > > if (!event->attr.freq || !event->attr.sample_freq) > - goto next; > + continue; > > /* > * stop the event and update event->count > @@ -4162,8 +4147,39 @@ perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) > perf_adjust_period(event, period, delta, false); > > event->pmu->start(event, delta > 0 ? PERF_EF_RELOAD : 0); > - next: > - perf_pmu_enable(event->pmu); > + } > +} > + > +/* > + * combine freq adjustment with unthrottling to avoid two passes over the > + * events. At the same time, make sure, having freq events does not change > + * the rate of unthrottling as that would introduce bias. > + */ > +static void > +perf_adjust_freq_unthr_context(struct perf_event_context *ctx, bool unthrottle) > +{ > + struct perf_event_pmu_context *pmu_ctx; > + > + /* > + * only need to iterate over all events iff: > + * - context have events in frequency mode (needs freq adjust) > + * - there are events to unthrottle on this cpu > + */ > + if (!(ctx->nr_freq || unthrottle)) > + return; > + > + raw_spin_lock(&ctx->lock); > + > + list_for_each_entry(pmu_ctx, &ctx->pmu_ctx_list, pmu_ctx_entry) { > + if (!(pmu_ctx->nr_freq || unthrottle)) > + continue; > + if (pmu_ctx->pmu->capabilities & PERF_PMU_CAP_NO_INTERRUPT) > + continue; > + > + perf_pmu_disable(pmu_ctx->pmu); > + perf_adjust_freq_unthr_events(&pmu_ctx->pinned_active); > + perf_adjust_freq_unthr_events(&pmu_ctx->flexible_active); > + perf_pmu_enable(pmu_ctx->pmu); > } I don't think this is correct for big.LITTLE/hybrid systems. Imagine a system where CPUs 0-1 have pmu_a, CPUs 2-3 have pmu_b, and a task has events for both pmu_a and pmu_b. The perf_event_context for that task will have a perf_event_pmu_context for each PMU in its pmu_ctx_list. Say that task is run on CPU0, and perf_event_task_tick() is called. That will call perf_adjust_freq_unthr_context(), and it will iterate over the pmu_ctx_list. Note that regardless of pmu_ctx->nr_freq, if 'unthottle' is true, we'll go ahead and call the following for all of the pmu contexts in the pmu_ctx_list: perf_pmu_disable(pmu_ctx->pmu); perf_adjust_freq_unthr_events(&pmu_ctx->pinned_active); perf_adjust_freq_unthr_events(&pmu_ctx->flexible_active); perf_pmu_enable(pmu_ctx->pmu); .. and that means we might call that for pmu_b, even though it's not associated with CPU0. That could be fatal depending on what those callbacks do. The old logic avoided that possibility implicitly, since the events for pmu_b couldn't be active, and so the check at the start of the look would skip all of pmu_b's events: if (event->state != PERF_EVENT_STATE_ACTIVE) continue; We could do similar by keeping track of how many active events each perf_event_pmu_context has, which'd allow us to do something like: if (pmu_ctx->nr_active == 0) continue; How does that sound to you? Mark.