Received: by 2002:ac0:a582:0:0:0:0:0 with SMTP id m2-v6csp4940294imm; Tue, 16 Oct 2018 02:33:56 -0700 (PDT) X-Google-Smtp-Source: ACcGV60crc0A6V/QSZb5MdWj9C5Kp47BoIPBLDee0vHoiUAlfmPYLVeNir6x+CplpEkKXePczwEP X-Received: by 2002:a17:902:848d:: with SMTP id c13-v6mr19816138plo.303.1539682436578; Tue, 16 Oct 2018 02:33:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1539682436; cv=none; d=google.com; s=arc-20160816; b=K8frrpBapYUPV02eeK6nDUNDrUEBCJWfgDTDm4zZFBxBuziu+YhNDT6ByEiTD9joJq j2lQrnFJS+tndBd4TxMOKxl0Z/ACJMVNosqFPbAdSgi+ZbysQ4LW5AL1HlD6Hly/4Arg nwDF7Ky/6C47A7Xj8gVvu5b7Y8qTc0bQfnzcsUnftxPODohGJaqoyeDkZCS00nNzvTVd kG5bHx1vO9ljYr7ieUGzr+3FfloWFwLZri8DN4YfDvzWOanPgOAZHHQRQKW7nOlMs5ZO ZGiOtzU/S9LnEZ4OS0UvAHGRc1fq+Vu2o8yE3w8EmjGDsf2b8LqpOmhS/gNB41YLGWfR TcbA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=jyzZiCCNvpS/s8C/w/jAP1Z86vePcFBHR/zxl3boWHc=; b=vR2jgYc8ZPhPIWWT6ndFAz5C5q4RqzCvptxzGRSL4Ofo/x5Bgs+FwHoYyBZ96UZaau COAJtKx6oIGnTrg8QX9QSLXmCiWjZ7rnMRuVkteEgKdKWSxALPeeTOJSu4ZCZBZruDub Ei388wLvedlb7yaN6kIQHpV0sKCE5aUkHgH/e+SgrEkmeyxmcsxkjVLfVPsQ6TBYMov7 GKV7mFNQPFpYwfJW1YAB8hO4ZIDic1qqobd7SCXn2x9U5rAVmthA7fZGbTpMyWnjHqt2 KXK4Q44A3UbDCyfFrdxkuyuR8J4GoNoOTloDhGMxdYyDNk8i+yGtFbkfWGMGaLBB4kgD Jwzg== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@infradead.org header.s=bombadil.20170209 header.b=tyej7czQ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w3-v6si13231091plq.198.2018.10.16.02.33.39; Tue, 16 Oct 2018 02:33:56 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@infradead.org header.s=bombadil.20170209 header.b=tyej7czQ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727078AbeJPRWn (ORCPT + 99 others); Tue, 16 Oct 2018 13:22:43 -0400 Received: from bombadil.infradead.org ([198.137.202.133]:38786 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726718AbeJPRWm (ORCPT ); Tue, 16 Oct 2018 13:22:42 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=In-Reply-To:Content-Type:MIME-Version :References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=jyzZiCCNvpS/s8C/w/jAP1Z86vePcFBHR/zxl3boWHc=; b=tyej7czQha1Ni35wyiqPwhse3 hSuyw3m0KTEK3dPP5PBDZLUbpNVEKqVrDBcb2M4FGE080+ZRR94gQsmUuGy3uJUSzwwtxNzIPYoPX hWCZqH47OACyAyOzRxuQv1HKI1SkxoYjnio+gdMLxFiJJsMdvH2/HALdsCySpyTZRi4HLSEqlRUYn jhULKQEJrp8ggAexyKFvPE2SyZO53dtDIWdlfJPYa117l7iIch6ivSUO1d1YH4YezKjmqKMeSdxDY nXEtGktiB7p+Eha2iMaVQr4cGhp8nqU+8CyhmZh+dHXKN7nPtQEWPjJeGsrTbl9BZpTzdv6BjYanY jldJTuezA==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=hirez.programming.kicks-ass.net) by bombadil.infradead.org with esmtpsa (Exim 4.90_1 #2 (Red Hat Linux)) id 1gCLii-0002DQ-2g; Tue, 16 Oct 2018 09:32:56 +0000 Received: by hirez.programming.kicks-ass.net (Postfix, from userid 1000) id 662C22029586E; Tue, 16 Oct 2018 11:32:53 +0200 (CEST) Date: Tue, 16 Oct 2018 11:32:53 +0200 From: Peter Zijlstra To: Stephane Eranian Cc: Alexey Budankov , Ingo Molnar , LKML , Arnaldo Carvalho de Melo , Alexander Shishkin , Jiri Olsa , songliubraving@fb.com, Thomas Gleixner , Mark Rutland , megha.dey@intel.com, frederic@kernel.org Subject: Re: [RFC][PATCH] perf: Rewrite core context handling Message-ID: <20181016093253.GD4030@hirez.programming.kicks-ass.net> References: <20181010104559.GO5728@hirez.programming.kicks-ass.net> <3a738a08-2295-a4e9-dce7-a3e2b2ad794e@linux.intel.com> <20181015083448.GN9867@hirez.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Oct 15, 2018 at 11:31:24AM -0700, Stephane Eranian wrote: > I have always had a hard time understanding the role of all these > structs in the generic code. This is still very confusing and very > hard to follow. > > In my mind, you have per-task and per-cpu perf_events contexts. And > for each you can have multiple PMUs, some hw some sw. Each PMU has > its own list of events maintained in RB tree. There is never any > interactions between PMUs. That is more or less how it was. We have per PMU task or CPU contexts: task_struct::perf_events_ctxp[] <-> perf_event_context <-> perf_cpu_context ^ | ^ | ^ `---------------------------------' | `--> pmu <--' v ^ perf_event ------' Each task has an array of pointers to a perf_event_context. Each perf_event_context has a direct relation to a PMU and a group of events for that PMU. The task related perf_event_context's have a pointer back to that task. Each PMU has a per-cpu pointer to a per-cpu perf_cpu_context, which includes a perf_event_context, which again has a direct relation to that PMU, and a group of events for that PMU. The perf_cpu_context also tracks which task context is currently associated with that CPU and includes a few other things like the hrtimer for rotation etc.. Each perf_event is then associated with its PMU and one perf_event_context. > Maybe this is how this is done or proposed by your patches, but it > certainly is not obvious. No, my patch somewhat completely wrecks the above; and reduces to a single task context and a single CPU context. There were a number of problems with the above. One is that task-array of pointer, which limited the number of task contexts we could have. Now, we could've easily changed that to a list and called it a day. That is not in fact a horribly difficult patch. If you combine that with a patch that actually freed task context's when they go empty, that might actually work. But there are a number of other considerations that resulted in the patch as presented: - there is a bunch of per context state that is simply duplicated between contexts, like for instance the time keeping. There is no point in tracking the time for 'n' per task/cpu contexts when in fact they're all the same. - on context switch we have to iterate all these 'n' contexts and switch them one by one. Instead of just switching one context and calling it a day. - for big.little we'd end up with 2 per-task contexts and only ever use 1 at any one time, which increases 'n' in the above cases for no purpose. - the actual per-pmu-per-context state is very small (as I think Alexey already implied). - a single context simplifies a bunch of things; including the move_group case (we no longer have to adjust perf_event::ctx) and the cpu-online tests and the ctx locking and it removes a bunch of context lists (like active_ctx_list). So a single context is what I went with. That all results in: task_struct::perf_event_ctxp -> perf_event_context <- perf_cpu_context ^ | ^ ^ `---------------------------------' | | | `--> perf_event_pmu_context | ^ ^ | | | | ,-----' v | | perf_cpu_pmu_context | | ^ | | | v v v perf_event ---> pmu Because while the per-pmu-per-context state is small, it does exists, this gives rise to perf_event_pmu_context. It tracks nr_events and nr_active, which is used to (quickly) tell if rotation is required (it is possible to reduce this state I think, but I've not yet gotten it down to 0). It also tracks which events are actually active; iterating a list is cheaper than finding them all in the RB-tree. It also contains the task_ctx_data thing for LBR, which is a PMU specific extra data thingy. We then also keep a list of (active) perf_event_pmu_context in perf_event_context, such that we can quickly find which PMUs are in fact involved with the context. This simplifies context scheduling a little. We then also need per-pmu-per-cpu state, which gives rise to perf_cpu_pmu_context, and that mostly includes bits to drive the event rotation, which per ABI is per PMU, but it also includes bits to do perf_event_attr::exclusive scheduling, which is also naturally per-pmu-per-cpu. And yes, the above looks more complicated, but at the same time, a bunch of things did get simplified. Maybe once the dust settles someone can turn this here email into a sensible comment or something ;-) > Also the Intel LBR is not a PMU on is own. Maybe you are talking about > the BTS in arch/x86/even/sintel/bts.c. This thing: https://lkml.kernel.org/r/1510970046-25387-1-git-send-email-megha.dey@linux.intel.com