Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751022AbdIEMGf (ORCPT ); Tue, 5 Sep 2017 08:06:35 -0400 Received: from mga06.intel.com ([134.134.136.31]:15940 "EHLO mga06.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750741AbdIEMGe (ORCPT ); Tue, 5 Sep 2017 08:06:34 -0400 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.41,480,1498546800"; d="scan'208";a="145617251" Subject: Re: [RFC][PATCH] perf: Rewrite enabled/running timekeeping From: Alexey Budankov To: Peter Zijlstra Cc: Ingo Molnar , Arnaldo Carvalho de Melo , Alexander Shishkin , Andi Kleen , Kan Liang , Dmitri Prokhorov , Valery Cherepennikov , Mark Rutland , Stephane Eranian , David Carrillo-Cisneros , linux-kernel , Vince Weaver , Thomas Gleixner References: <96c7776f-1f17-a39e-23e9-658596216d6b@linux.intel.com> <20170803150052.za2vofyqfgarukdr@hirez.programming.kicks-ass.net> <20170822204743.GR32112@worktop.programming.kicks-ass.net> <2a426aa2-42c8-e839-1cec-aa3971651f3e@linux.intel.com> <20170831171837.njnc6r6elsvkl7lt@hirez.programming.kicks-ass.net> <46f72a3f-f18b-0227-3d78-fb23c8a6e18e@linux.intel.com> <20170904120843.oazlv73phoxoinlj@hirez.programming.kicks-ass.net> <385005b6-51ea-383e-df81-43365f3f5152@linux.intel.com> <20170904154145.xl4fyg7vhgbnmhwi@hirez.programming.kicks-ass.net> <9c3c1fc7-5a2e-2745-dab1-390dc7d58d50@linux.intel.com> Organization: Intel Corp. Message-ID: Date: Tue, 5 Sep 2017 15:06:26 +0300 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.3.0 MIME-Version: 1.0 In-Reply-To: <9c3c1fc7-5a2e-2745-dab1-390dc7d58d50@linux.intel.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6901 Lines: 143 Hi, On 05.09.2017 13:17, Alexey Budankov wrote: > On 04.09.2017 18:41, Peter Zijlstra wrote: >> On Mon, Sep 04, 2017 at 05:56:06PM +0300, Alexey Budankov wrote: >>> On 04.09.2017 15:08, Peter Zijlstra wrote: >>>> On Mon, Sep 04, 2017 at 01:46:45PM +0300, Alexey Budankov wrote: >>>>>> So the below completely rewrites timekeeping (and probably breaks >>>>>> world) but does away with the need to touch events that don't get >>>>>> scheduled. >>>>> >>>>> We still need and do iterate thru all events at some points e.g. on context switches. >>>> >>>> Why do we _need_ to? >>> >>> We do so in the current implementation with several tstamp_* fields. >> >> Right, but we want to stop doing so asap :-) >> > > Well, I see you point :). It turns out that with straightforward timekeeping > we can also avoid whole tree iteration on context switches additionally to RB > tree based iterations and rotations on hrtimer interrupt. That brings even more > performance and rotation switch can be avoided. > > However we can't completely get rid of whole tree iterations because of > inheritance code on forks in perf_event_init_context() here: > > perf_event_groups_for_each(event, &parent_ctx->pinned_groups, group_node) { > ret = inherit_task_group(event, parent, parent_ctx, > child, ctxn, &inherited_all); > if (ret) > goto out_unlock; > } > > and here: > > perf_event_groups_for_each(event, &parent_ctx->flexible_groups, group_node) { > ret = inherit_task_group(event, parent, parent_ctx, > child, ctxn, &inherited_all); > if (ret) > goto out_unlock; > } > > Below is the patch set put on top of the timekeeping rework. > > It is for tip/master branch. > > --- > include/linux/perf_event.h | 40 +-- > kernel/events/core.c | 839 +++++++++++++++++++++++---------------------- > 2 files changed, 448 insertions(+), 431 deletions(-) > Got this under perf_fuzzer on Xeon Phi (KNL): [ 6614.226280] ------------[ cut here ]------------ [ 6614.226305] WARNING: CPU: 45 PID: 43385 at kernel/events/core.c:239 event_function+0xb3/0xe0 [ 6614.226310] Modules linked in: btrfs xor raid6_pq ufs hfsplus hfs minix vfat msdos fat jfs xfs reiserfs binfmt_misc xt_CHECKSUM iptable_mangle fuse ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables cmac arc4 md4 nls_utf8 nfsv3 cifs rpcsec_gss_krb5 nfsv4 nfs ccm dns_resolver fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_rapl hfi1 sb_edac nfsd x86_pkg_temp_thermal intel_powerclamp rdmavt ipmi_ssif coretemp iTCO_wdt joydev ib_core crct10dif_pclmul iTCO_vendor_support crc32_pclmul ipmi_si ghash_clmulni_intel [ 6614.226444] auth_rpcgss ipmi_devintf mei_me tpm_tis intel_cstate mei tpm_tis_core intel_uncore pcspkr intel_rapl_perf tpm ipmi_msghandler shpchp nfs_acl lpc_ich lockd i2c_i801 wmi grace acpi_power_meter acpi_pad sunrpc mgag200 drm_kms_helper ttm drm igb crc32c_intel ptp pps_core dca i2c_algo_bit [ 6614.226501] CPU: 45 PID: 43385 Comm: perf_fuzzer Not tainted 4.13.0-v11.1.2+ #5 [ 6614.226506] Hardware name: Intel Corporation S7200AP/S7200AP, BIOS S72C610.86B.01.01.0190.080520162104 08/05/2016 [ 6614.226511] task: ffff8d0f5866c000 task.stack: ffffa6b05aae8000 [ 6614.226517] RIP: 0010:event_function+0xb3/0xe0 [ 6614.226522] RSP: 0018:ffffa6b05aaebc30 EFLAGS: 00010087 [ 6614.226528] RAX: 0000000000000000 RBX: ffffc6b03c545a90 RCX: ffffa6b05aaebd00 [ 6614.226532] RDX: 0000000000000000 RSI: ffffa6b05aaebca0 RDI: ffffc6b03c545a98 [ 6614.226536] RBP: ffffa6b05aaebc58 R08: 000000000001f8e0 R09: ffff8d0fa99fd120 [ 6614.226540] R10: ffff8d0fbb407900 R11: 0000000000000000 R12: ffffc6b03ba05a90 [ 6614.226544] R13: 0000000000000000 R14: ffffa6b05aaebd48 R15: ffff8d0f9c48a000 [ 6614.226550] FS: 00007f05cff85740(0000) GS:ffff8d0fbc540000(0000) knlGS:0000000000000000 [ 6614.226555] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6614.226559] CR2: 00000000007211c0 CR3: 0000002f269fa000 CR4: 00000000001407e0 [ 6614.226563] Call Trace: [ 6614.226577] remote_function+0x3b/0x50 [ 6614.226585] generic_exec_single+0x9a/0xd0 [ 6614.226592] smp_call_function_single+0xc8/0x100 [ 6614.226599] cpu_function_call+0x43/0x60 [ 6614.226606] ? cpu_clock_event_read+0x10/0x10 [ 6614.226612] perf_event_read+0xc7/0xe0 [ 6614.226619] ? perf_install_in_context+0xf0/0xf0 [ 6614.226625] __perf_read_group_add+0x25/0x180 [ 6614.226632] perf_read+0xcb/0x2b0 [ 6614.226640] __vfs_read+0x37/0x160 [ 6614.226648] ? security_file_permission+0x9d/0xc0 [ 6614.226655] vfs_read+0x8c/0x130 [ 6614.226661] SyS_read+0x55/0xc0 [ 6614.226670] do_syscall_64+0x67/0x180 [ 6614.226678] entry_SYSCALL64_slow_path+0x25/0x25 [ 6614.226684] RIP: 0033:0x7f05cfaad980 [ 6614.226688] RSP: 002b:00007fff1b562b48 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 [ 6614.226694] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f05cfaad980 [ 6614.226699] RDX: 0000000000004fcf RSI: 0000000000735680 RDI: 0000000000000003 [ 6614.226703] RBP: 00007fff1b562b60 R08: 00007f05cfd800f4 R09: 00007f05cfd80140 [ 6614.226707] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401980 [ 6614.226711] R13: 00007fff1b564f60 R14: 0000000000000000 R15: 0000000000000000 [ 6614.226716] Code: e2 48 89 de 4c 89 ff 41 ff 56 08 31 c0 4d 85 ed 74 05 41 c6 45 08 00 c6 43 08 00 5b 41 5c 41 5d 41 5e 41 5f 5d c3 49 39 dc 74 cf <0f> ff eb cb 0f ff 0f 1f 80 00 00 00 00 e9 78 ff ff ff 0f ff 66 [ 6614.226812] ---[ end trace ff12704813059a28 ]--- static int event_function(void *info) { struct event_function_struct *efs = info; struct perf_event *event = efs->event; struct perf_event_context *ctx = event->ctx; struct perf_cpu_context *cpuctx = __get_cpu_context(ctx); struct perf_event_context *task_ctx = cpuctx->task_ctx; int ret = 0; WARN_ON_ONCE(!irqs_disabled()); perf_ctx_lock(cpuctx, task_ctx); /* * Since we do the IPI call without holding ctx->lock things can have * changed, double check we hit the task we set out to hit. */ if (ctx->task) { if (ctx->task != current) { ret = -ESRCH; goto unlock; } /* * We only use event_function_call() on established contexts, * and event_function() is only ever called when active (or * rather, we'll have bailed in task_function_call() or the * above ctx->task != current test), therefore we must have * ctx->is_active here. */ WARN_ON_ONCE(!ctx->is_active); /* * And since we have ctx->is_active, cpuctx->task_ctx must * match. */ WARN_ON_ONCE(task_ctx != ctx); } else { ===> WARN_ON_ONCE(&cpuctx->ctx != ctx); } efs->func(event, cpuctx, ctx, efs->data); unlock: perf_ctx_unlock(cpuctx, task_ctx); return ret; }