Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757317AbcKXQ1q (ORCPT ); Thu, 24 Nov 2016 11:27:46 -0500 Received: from foss.arm.com ([217.140.101.70]:57810 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751095AbcKXQ1p (ORCPT ); Thu, 24 Nov 2016 11:27:45 -0500 Date: Thu, 24 Nov 2016 16:19:09 +0000 From: Mark Rutland To: kan.liang@intel.com Cc: peterz@infradead.org, mingo@redhat.com, acme@kernel.org, linux-kernel@vger.kernel.org, alexander.shishkin@linux.intel.com, tglx@linutronix.de, namhyung@kernel.org, jolsa@kernel.org, adrian.hunter@intel.com, wangnan0@huawei.com, andi@firstfloor.org Subject: Re: [PATCH 02/14] perf/x86: output NMI overhead Message-ID: <20161124161712.GA2444@remoulade> References: <1479894292-16277-1-git-send-email-kan.liang@intel.com> <1479894292-16277-3-git-send-email-kan.liang@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1479894292-16277-3-git-send-email-kan.liang@intel.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4099 Lines: 122 On Wed, Nov 23, 2016 at 04:44:40AM -0500, kan.liang@intel.com wrote: > From: Kan Liang > > NMI handler is one of the most important part which brings overhead. > > There are lots of NMI during sampling. It's very expensive to log each > NMI. So the accumulated time and NMI# will be output when event is going > to be disabled or task is scheduling out. > The newly introduced flag PERF_EF_LOG indicate to output the overhead > log. > > Signed-off-by: Kan Liang > --- > arch/x86/events/core.c | 19 ++++++++++++++- > arch/x86/events/perf_event.h | 2 ++ > include/linux/perf_event.h | 1 + > include/uapi/linux/perf_event.h | 2 ++ > kernel/events/core.c | 54 ++++++++++++++++++++++------------------- > 5 files changed, 52 insertions(+), 26 deletions(-) > > diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c > index d31735f..6c3b0ef 100644 > --- a/arch/x86/events/core.c > +++ b/arch/x86/events/core.c > @@ -1397,6 +1397,11 @@ static void x86_pmu_del(struct perf_event *event, int flags) > > perf_event_update_userpage(event); > > + if ((flags & PERF_EF_LOG) && cpuc->nmi_overhead.nr) { > + cpuc->nmi_overhead.cpu = smp_processor_id(); > + perf_log_overhead(event, PERF_NMI_OVERHEAD, &cpuc->nmi_overhead); > + } > + > do_del: > if (x86_pmu.del) { > /* > @@ -1475,11 +1480,21 @@ void perf_events_lapic_init(void) > apic_write(APIC_LVTPC, APIC_DM_NMI); > } > > +static void > +perf_caculate_nmi_overhead(u64 time) s/caculate/calculate/ - this tripped me up when grepping. > @@ -1492,8 +1507,10 @@ perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs) > start_clock = sched_clock(); > ret = x86_pmu.handle_irq(regs); > finish_clock = sched_clock(); > + clock = finish_clock - start_clock; > > - perf_sample_event_took(finish_clock - start_clock); > + perf_caculate_nmi_overhead(clock); > + perf_sample_event_took(clock); Ah, so it's the *sampling* overhead, not the NMI overhead. This doesn't take into account the cost of entering/exiting the handler, which could be larger than the sampling overhead (e.g. if the PMU is connected through chained interrupt controllers). > enum perf_record_overhead_type { > + PERF_NMI_OVERHEAD = 0, As above, it may be worth calling this PERF_SAMPLE_OVERHEAD; this doesn't count the entire cost of the NMI, and other architectures may want to implement this, yet don't have NMI. [...] > static void > event_sched_out(struct perf_event *event, > struct perf_cpu_context *cpuctx, > - struct perf_event_context *ctx) > + struct perf_event_context *ctx, > + bool log_overhead) Boolean parameter are always confusing. Why not pass the flags directly? That way we can pass *which* overhead to log, and make the callsites easier to understand. > event->tstamp_stopped = tstamp; > - event->pmu->del(event, 0); > + event->pmu->del(event, log_overhead ? PERF_EF_LOG : 0); ... which we could pass on here. > @@ -1835,20 +1835,21 @@ event_sched_out(struct perf_event *event, > static void > group_sched_out(struct perf_event *group_event, > struct perf_cpu_context *cpuctx, > - struct perf_event_context *ctx) > + struct perf_event_context *ctx, > + bool log_overhead) Likewise. > @@ -1872,7 +1873,7 @@ __perf_remove_from_context(struct perf_event *event, > { > unsigned long flags = (unsigned long)info; > > - event_sched_out(event, cpuctx, ctx); > + event_sched_out(event, cpuctx, ctx, false); > if (flags & DETACH_GROUP) > perf_group_detach(event); > list_del_event(event, ctx); > @@ -1918,9 +1919,9 @@ static void __perf_event_disable(struct perf_event *event, > update_cgrp_time_from_event(event); > update_group_times(event); > if (event == event->group_leader) > - group_sched_out(event, cpuctx, ctx); > + group_sched_out(event, cpuctx, ctx, true); > else > - event_sched_out(event, cpuctx, ctx); > + event_sched_out(event, cpuctx, ctx, true); Why does this differ from __perf_remove_from_context()? What's the policy for when we do or do not measure overhead? Thanks, Mark.