Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754953AbcL0Vem (ORCPT ); Tue, 27 Dec 2016 16:34:42 -0500 Received: from mail-ua0-f171.google.com ([209.85.217.171]:33878 "EHLO mail-ua0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753181AbcL0Ved (ORCPT ); Tue, 27 Dec 2016 16:34:33 -0500 MIME-Version: 1.0 In-Reply-To: <87vau5gn1w.fsf@firstfloor.org> References: <1481929988-31569-1-git-send-email-vikas.shivappa@linux.intel.com> <1481929988-31569-2-git-send-email-vikas.shivappa@linux.intel.com> <20161223123228.GQ3107@twins.programming.kicks-ass.net> <20161223203318.GU3107@twins.programming.kicks-ass.net> <87vau5gn1w.fsf@firstfloor.org> From: David Carrillo-Cisneros Date: Tue, 27 Dec 2016 13:33:46 -0800 Message-ID: Subject: Re: [PATCH 01/14] x86/cqm: Intel Resource Monitoring Documentation To: Andi Kleen Cc: Shivappa Vikas , Peter Zijlstra , Vikas Shivappa , linux-kernel , x86 , Thomas Gleixner , "Shankar, Ravi V" , "Luck, Tony" , Fenghua Yu , Stephane Eranian , hpa@zytor.com Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2586 Lines: 69 On Tue, Dec 27, 2016 at 12:00 PM, Andi Kleen wrote: > Shivappa Vikas writes: >> >> Ok , looks like the interface is the problem. Will try to fix >> this. We are just trying to have a light weight monitoring >> option so that its reasonable to monitor for a >> very long time (like lifetime of process etc). Mainly to not have all >> the perf scheduling overhead. > > That seems like an odd reason to define a completely new user interface. > This is to avoid one MSR write for a RMID change per context switch > in/out cgroup or is it other code too? > > Is there some number you can put to the overhead? I obtained some timing by manually instrumenting the kernel in a Haswell EP. When using one intel_cmt/llc_occupancy/ cgroup perf_event in one CPU, the avg time to do __perf_event_task_sched_out + __perf_event_task_sched_in is ~1170ns most of the time is spend in cgroup ctx switch (~1120ns) . When using continuous monitoring in CQM driver, the avg time to find the rmid to write inside of pqr_context switch is ~16ns Note that this excludes the MSR write. It's only the overhead of finding the RMID to write in PQR_ASSOC. Both paths call the same routine to find the RMID, so there are about 1100 ns of overhead in perf_cgroup_switch. By inspection I assume most of it comes from iterating over the pmu list. > Or is there some other overhead other than the MSR write > you're concerned about? No, that problem is solved with the PQR software cache introduced in the series. > Perhaps some optimization could be done in the code to make it faster, > then the new interface wouldn't be needed. There are some. One in my list is to create a list of pmus with at least one cgroup event and use it to iterate over in perf_cgroup_switch, instead of using the "pmus" list. The pmus list has grown a lot recently with the addition of all the uncore pmus. Despite this optimization, it's unlikely that the whole sched_out + sched_in gets that close to the 15 ns of the non perf_event approach. Please note that context switch time for llc_occupancy events has more impact than for other events because in order to obtain reliable measurements, the RMID switch must be active _all_ the time, not only while the event is read. > > FWIW there are some pending changes to context switch that will > eliminate at least one common MSR write [1]. If that was fixed > you could do the RMID MSR write "for free" That may save the need for the PQR software cache in this series, but won't speed up the context switch. Thanks, David