Received: by 10.213.65.68 with SMTP id h4csp873820imn; Wed, 14 Mar 2018 02:46:51 -0700 (PDT) X-Google-Smtp-Source: AG47ELscc3CmbLveMJSwcgYUYg1+C5Y/kSl5NjP45emyi44RNaqM/qF0KJZVlSpgJOuX7OGiBq+a X-Received: by 10.101.75.82 with SMTP id k18mr208217pgt.335.1521020811639; Wed, 14 Mar 2018 02:46:51 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1521020811; cv=none; d=google.com; s=arc-20160816; b=VnIb63BOJ1n3tixfJMX+EC+PHjmh/1clVVFyAh+jngtlECc/osY1/vK5pkl2dfooKN iaKp/56eK0NSaGR12/zB7xhCumE3FpQ/HndHvs/ktwG/PAbsBbNaBxEpcf/SXStJd9YA h/kRoA7LzUXzXlpqirkKFa0C+1wgKf0fIM2o4MnCFQa9O44f84viNUL56ushEuT+i84H p1w0N+YR0A3+vWK8ofxTk/C76pdFFRqactIgNjRcz0khyTX3sPqaUdqPppBHUWPQZ85V GAvX1ZnULeYRBR0Q5egGyDCeWuN2IpFABowYVo6RUgvtnVjF6o+kCPWVZUFINqNj5j+o xOAw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature:arc-authentication-results; bh=lTPgQuKWJXReFRlJ3ddUllkfTcw9oEmYUOy4v3yaKcM=; b=sPfV+4x+kWHArVIpvfzub75dRKjxZIKVUFvOvF6R84HUdkrTetuIIfkQ8paKPzkIwR Y73sqw+rp5tDp4Y9RZgjfHevoF+arsQz626mYYQm6gLhz5PfrU+v6ffo0qSFuSeh1IPa p6JmYvoCzF/R+1BWIHXayl9jX/1/3ApE8qrNG4iBVkJ7klTsGwZh9K1F11eIREywz+2N XwjHzKuKGVxrMvgPZJFJz2YSr+efiDL8osRNJN2smLoWSu0SOgx8igrV0L8bXdIf94FX y1si3Z3sjTaToLaPpHvhckwn81RxnEy8nDg4HpF6tzAXRPHOm3MHbkybCzOkd4rOeDcO +40Q== ARC-Authentication-Results: i=1; mx.google.com; dkim=fail header.i=@infradead.org header.s=bombadil.20170209 header.b=fgLhnbFi; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id z22-v6si318318plo.28.2018.03.14.02.46.37; Wed, 14 Mar 2018 02:46:51 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=fail header.i=@infradead.org header.s=bombadil.20170209 header.b=fgLhnbFi; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751523AbeCNJpb (ORCPT + 99 others); Wed, 14 Mar 2018 05:45:31 -0400 Received: from bombadil.infradead.org ([198.137.202.133]:52922 "EHLO bombadil.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751366AbeCNJp3 (ORCPT ); Wed, 14 Mar 2018 05:45:29 -0400 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20170209; h=In-Reply-To:Content-Type:MIME-Version :References:Message-ID:Subject:Cc:To:From:Date:Sender:Reply-To: Content-Transfer-Encoding:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Id: List-Help:List-Unsubscribe:List-Subscribe:List-Post:List-Owner:List-Archive; bh=lTPgQuKWJXReFRlJ3ddUllkfTcw9oEmYUOy4v3yaKcM=; b=fgLhnbFiOARjjiXBpLBEQqNaW WYHGUDWUYAakJbvJ+iE01zLgrRHF6Z7Z5B5e32PjEOV0z8D0nBcIhECZXgIHvNCA6AZHOHJBGyYTk /pq5av6HdLA3w4vSXC0PxX4PSG1lsahHdCcSoYFGY3IQpsGT4aAAhp6lNFnCRoi5GqleyeP0L8bVv +67dWqfqHgN4ex6vvQcjuidKKYr3eH37SYWtZou470xi/d+eNJX2C0zOX2aaSzhkzr2s5NN4DcJ/R 8d2Y9F53U4/qvd3k+9ooyhqceYUhU8O8sFLdcfqPQ5F3Q6uf7Bt8fveDiobhYuI3q09Lm/tigct0O TWmcjNbLA==; Received: from j217100.upc-j.chello.nl ([24.132.217.100] helo=hirez.programming.kicks-ass.net) by bombadil.infradead.org with esmtpsa (Exim 4.90_1 #2 (Red Hat Linux)) id 1ew2yH-0005Ch-MY; Wed, 14 Mar 2018 09:45:22 +0000 Received: by hirez.programming.kicks-ass.net (Postfix, from userid 1000) id 56EE32029F86C; Wed, 14 Mar 2018 10:45:19 +0100 (CET) Date: Wed, 14 Mar 2018 10:45:19 +0100 From: Peter Zijlstra To: Jason Vas Dias Cc: x86@kernel.org, LKML , Thomas Gleixner , andi Subject: Re: [PATCH v4.16-rc4 2/2] x86/vdso: on Intel, VDSO should handle CLOCK_MONOTONIC_RAW Message-ID: <20180314094519.GD4082@hirez.programming.kicks-ass.net> References: <20180312082741.GD4064@hirez.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.3 (2018-01-21) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Mar 13, 2018 at 11:45:45PM +0000, Jason Vas Dias wrote: > On 12/03/2018, Peter Zijlstra wrote: > > On Mon, Mar 12, 2018 at 07:01:20AM +0000, Jason Vas Dias wrote: > >> Sometimes, particularly when correlating elapsed time to performance > >> counter values, > > > > So what actual problem are you tring to solve here? Perf can already > > give you sample time in various clocks, including MONOTONIC_RAW. > > > > > > Yes, I am sampling perf counters, You're not in fact sampling, you're just reading the counters. > including CPU_CYCLES , INSTRUCTIONS, > CPU_CLOCK, TASK_CLOCK, etc, in a Group FD I open with > perf_event_open() , for the current thread on the current CPU - > I am doing this for 4 threads , on Intel & ARM cpus. > > Reading performance counters does involve 2 ioctls and a read() , > which takes time that already far exceeds the time required to read > the TSC or CNTPCT in the VDSO . So you can avoid the whole ioctl(ENABLE), ioctl(DISABLE) nonsense and just let them run and do: read(group_fd, &buf_pre, size); /* your code section */ read(group_fd, &buf_post, size); /* compute buf_post - buf_pre */ Which is only 2 system calls, not 4. Also, a while back there was the proposal to extend the mmap() self-monitoring interface to groups, see: https://lkml.kernel.org/r/20170530172555.5ya3ilfw3sowokjz@hirez.programming.kicks-ass.net I never did get around to writing the actual code for it, but it shouldn't be too hard. > The CPU_CLOCK software counter should give the converted TSC cycles > seen between the ioctl( grp_fd, PERF_EVENT_IOC_ENABLE , ...) > and the ioctl( grp_fd, PERF_EVENT_IOC_DISABLE ), and the > difference between the event->time_running and time_enabled > should also measure elapsed time . While CPU_CLOCK is TSC based, there is no guarantee it has any correlation to CLOCK_MONOTONIC_RAW (even if that is also TSC based). (although, I think I might have fixed that recently and it might just work, but it's very much not guaranteed). If you want to correlate to CLOCK_MONOTONIC_RAW you have to read CLOCK_MONOTONIC_RAW and not some random other clock value. > This gives the "inner" elapsed time, from the perpective of the kernel, > while the measured code section had the counters enabled. > > But unless the user-space program also has a way of measuring elapsed > time from the CPU's perspective , ie. without being subject to > operator or NTP / PTP adjustment, it has no way of correlating this > inner elapsed time with any "outer" You could read the time using the group_fd's mmap() page. That actually includes the TSC mult,shift,offset as used by perf clocks. > Currently, users must parse the log file or use gdb / objdump to > inspect /proc/kcore to get the TSC calibration and exact > mult+shift values for the TSC value conversion. Which ;-) there's multiple floating around.. > Intel does not publish, nor does the CPU come with in ROM or firmware, > the actual precise TSC frequency - this must be calibrated against the > other clocks , according to a complicated procedure in section 18.2 of > the SDM . My TSC has a "rated" / nominal TSC frequency , which one > can compute from CPUID leaves, of 2.3ghz, but the "Refined TSC frequency" > is 2.8333ghz . You might want to look at commit: b51120309348 ("x86/tsc: Fix erroneous TSC rate on Skylake Xeon") There is no such thing as a precise TSC frequency, there's a reason we have NTP/PTP. > Hence I think Linux should export this calibrated frequency somehow ; > its "calibration" is expressed as the raw clocksource 'mult' and 'shift' > values, and is exported to the VDSO . > > I think the VDSO should read the TSC and use the calibration > to render the raw, unadjusted time from the CPU's perspective. > > Hence, the patch I am preparing , which is again attached. I have no objection to adding CLOCK_MONOTONIC_RAW support to the VDSO, but you seem to be rather confused on how things work. Now, if you wanted to actually have CLOCK_MONOTONIC_RAW times from perf you'd need something like the below patch. You'd need to create your events with: attr.use_clockid = 1; attr.clockid = CLOCK_MONOTONIC_RAW; attr.read_format |= PERF_FORMAT_TIME; But whatever you do, you really have to stop mixing clocks, that's broken, even if it magically works for now. --- include/uapi/linux/perf_event.h | 5 ++++- kernel/events/core.c | 23 ++++++++++++++++++++--- 2 files changed, 24 insertions(+), 4 deletions(-) diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h index 912b85b52344..e210c9a97f2b 100644 --- a/include/uapi/linux/perf_event.h +++ b/include/uapi/linux/perf_event.h @@ -271,9 +271,11 @@ enum { * { u64 time_enabled; } && PERF_FORMAT_TOTAL_TIME_ENABLED * { u64 time_running; } && PERF_FORMAT_TOTAL_TIME_RUNNING * { u64 id; } && PERF_FORMAT_ID + * { u64 time; } && PERF_FORMAT_TIME * } && !PERF_FORMAT_GROUP * * { u64 nr; + * { u64 time; } && PERF_FORMAT_TIME * { u64 time_enabled; } && PERF_FORMAT_TOTAL_TIME_ENABLED * { u64 time_running; } && PERF_FORMAT_TOTAL_TIME_RUNNING * { u64 value; @@ -287,8 +289,9 @@ enum perf_event_read_format { PERF_FORMAT_TOTAL_TIME_RUNNING = 1U << 1, PERF_FORMAT_ID = 1U << 2, PERF_FORMAT_GROUP = 1U << 3, + PERF_FORMAT_TIME = 1U << 4, - PERF_FORMAT_MAX = 1U << 4, /* non-ABI */ + PERF_FORMAT_MAX = 1U << 5, /* non-ABI */ }; #define PERF_ATTR_SIZE_VER0 64 /* sizeof first published struct */ diff --git a/kernel/events/core.c b/kernel/events/core.c index c87decf03757..4298b4a39bc0 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -1707,6 +1707,9 @@ static void __perf_event_read_size(struct perf_event *event, int nr_siblings) size += sizeof(u64); } + if (event->attr.read_format & PERF_FORMAT_TIME) + size += sizeof(u64); + size += entry * nr; event->read_size = size; } @@ -4685,6 +4688,9 @@ static int __perf_read_group_add(struct perf_event *leader, int n = 1; /* skip @nr */ int ret; + if (read_format & PERF_FORMAT_TIME) + n++; /* skip @time */ + ret = perf_event_read(leader, true); if (ret) return ret; @@ -4739,6 +4745,9 @@ static int perf_read_group(struct perf_event *event, values[0] = 1 + leader->nr_siblings; + if (read_format & PERF_FORMAT_TIME) + values[1] = perf_event_clock(event); + /* * By locking the child_mutex of the leader we effectively * lock the child list of all siblings.. XXX explain how. @@ -4773,7 +4782,7 @@ static int perf_read_one(struct perf_event *event, u64 read_format, char __user *buf) { u64 enabled, running; - u64 values[4]; + u64 values[5]; int n = 0; values[n++] = __perf_event_read_value(event, &enabled, &running); @@ -4783,6 +4792,8 @@ static int perf_read_one(struct perf_event *event, values[n++] = running; if (read_format & PERF_FORMAT_ID) values[n++] = primary_event_id(event); + if (read_format & PERF_FORMAT_TIME) + values[n++] = perf_event_clock(event) if (copy_to_user(buf, values, n * sizeof(u64))) return -EFAULT; @@ -6034,7 +6045,7 @@ static void perf_output_read_one(struct perf_output_handle *handle, u64 enabled, u64 running) { u64 read_format = event->attr.read_format; - u64 values[4]; + u64 values[5]; int n = 0; values[n++] = perf_event_count(event); @@ -6049,6 +6060,9 @@ static void perf_output_read_one(struct perf_output_handle *handle, if (read_format & PERF_FORMAT_ID) values[n++] = primary_event_id(event); + if (read_format & PERF_FORMAT_TIME) + values[n++] = perf_event_clock(event); + __output_copy(handle, values, n * sizeof(u64)); } @@ -6058,11 +6072,14 @@ static void perf_output_read_group(struct perf_output_handle *handle, { struct perf_event *leader = event->group_leader, *sub; u64 read_format = event->attr.read_format; - u64 values[5]; + u64 values[6]; int n = 0; values[n++] = 1 + leader->nr_siblings; + if (read_format & PERF_FORMAT_TIME) + values[n++] = perf_event_clock(event); + if (read_format & PERF_FORMAT_TOTAL_TIME_ENABLED) values[n++] = enabled;