Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp30966180rwd; Thu, 6 Jul 2023 13:28:56 -0700 (PDT) X-Google-Smtp-Source: APBJJlGrQA6gOZohj/DF1FwQYUr8w4dZKnnGdkg99s0031Xmfu2hiVa56N+cnoXdOEhqssekQCWk X-Received: by 2002:a17:903:1c1:b0:1b8:a389:43ef with SMTP id e1-20020a17090301c100b001b8a38943efmr3812358plh.24.1688675335678; Thu, 06 Jul 2023 13:28:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1688675335; cv=none; d=google.com; s=arc-20160816; b=h1Ua9McAnsa444sXNEjujHVhhFHn2EDph1pN1vZFPHt/FrGUikoISW9UjrlyXRc/pN vrS+GUZzppR/rLNfxxllRJTdeaC5Z0CUvA3Di7qle4c/uA+aBoVeJm/7vXMHSi5MfMrK wzmw9qqWtvTlXuRds31uo9qvMZmU6LD1C1Y7naxAJ/y8Gs9GUc+cTdPJ2a+tRwa557yO 3RdlDaOxrEK85WorTO5GJAnfZ5A530D+A+Skbef0IzJoJgDy3hkyJegg+32n/mVKr9Gw n4+krVbXFlMfYw/YNZ/D0Mhkd8Zu/rjqNZ/Fl3iqXOFSbSET+6EJr/yDUWhnbG5JvOg0 ydeA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=qgieecZWTUuXEPx1wGrMdeFVGmlpK0/k5HVeYL3HOg0=; fh=6wK2KPp1acvAL4uC0l1YTvhYCbpQ8NSYUzeWaenFjOE=; b=hOmiInLZcMnX92kJC+YPowwV22SgYp2fB/GQ704rMDiWJhkyKWUK1R39Raxt88O0QF oiXf6LlHaE8cUwKjdVxfUffXwsXrVZSdTNGbd9DlNbbvXKWIBLFUBSMpzEOsRzkfr0xp OKGug/PPubngKrpQNRarwd/5cOOFWizm130pZ2p6wlvdqVY09p5cBmoSoSa6llFj5Zva oHCIkX6Ac2cruPSEW7DVdkptiAwT5R6fM+YHrVPY0IZuLnjNV51XtWM33zVL0T7bORtb rDLWI9RFF5OuBCyRKdS1qTNWrcLuKZlxb51PbTS1af0HgNEOwkjkO9K/ywFEiD5ybn/F +TMg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b="rWCcZv1/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id lg11-20020a170902fb8b00b001b86e15e56esi1815850plb.65.2023.07.06.13.28.41; Thu, 06 Jul 2023 13:28:55 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20221208 header.b="rWCcZv1/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232617AbjGFTvf (ORCPT + 99 others); Thu, 6 Jul 2023 15:51:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:52542 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229490AbjGFTvc (ORCPT ); Thu, 6 Jul 2023 15:51:32 -0400 Received: from mail-lj1-x230.google.com (mail-lj1-x230.google.com [IPv6:2a00:1450:4864:20::230]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B298E1FC4; Thu, 6 Jul 2023 12:51:29 -0700 (PDT) Received: by mail-lj1-x230.google.com with SMTP id 38308e7fff4ca-2b703a0453fso17067641fa.3; Thu, 06 Jul 2023 12:51:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1688673088; x=1691265088; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=qgieecZWTUuXEPx1wGrMdeFVGmlpK0/k5HVeYL3HOg0=; b=rWCcZv1/dLtL/JLJ861MyL5B54iLbDsJBjj0vwGBe118DpBzSrzN+VkRMXQ5xkikZ6 gKqJcM3Y27C9QU77OZpea1B6rxI5pN6lsDrJmugBU/rpfEQbSu0lWuvQwzEk5q2ftlaI N21Mz3r0Q7dj0IiiqhY8Jm2ex/6RqHXrfok3DIwcyvFWuW+IH0T7zvlH071VBDxgD7sv CXkEYua5Ly/N35XVy9+jPSa4UGb/eHLppVidwzZXax3KEwsShmFwCtJyjmWMLfBl13yO N1v9PFYVRfYFs+qWn9MFm1TfilmaEUhqNN76zNc2CTcHcpuo7ODva/i/rnAyktxN7VbQ n6Bw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1688673088; x=1691265088; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=qgieecZWTUuXEPx1wGrMdeFVGmlpK0/k5HVeYL3HOg0=; b=htzQsngFLv6gOeWEHiOu2ljUL2SebbunKSjqw22zpxS/Z7z/wpkVsawSwsDJw8W0qH TmSCplUda2r+i7dZjigaJaSTaWNxS3GRTz4uXUBJtYdGko6eXTAe6yUEmpz9ZCcKrXKO wCP5KCOgStpLtk6Wl4P1p0RMiQER2wT1jyS019pf5hhQyotaqHz2qxRHRvTKmK+kCLch b0bFgX2oi4wTn1zJGlnx/emY7rYZsunRQp9MXHCDJ0WcU8GC0g3TXkskE7DTMDl40/he hWCYK8pxpDobkQzTRJjKro8N1adSO5ggrArQ7xR87K7hyr2jZzmpfM/FgP66URRqkMUg URsw== X-Gm-Message-State: ABy/qLa8ta+lKQY7vt2gc+XFVy4vflH/6p4jzyzILZyzsK+TOzMI3hSU qtTVxOvvHro8czgwQ2rCddM5gQ9+ooCvH1nwmfeUctdslYg= X-Received: by 2002:a2e:380b:0:b0:2b6:fe55:7318 with SMTP id f11-20020a2e380b000000b002b6fe557318mr2761393lja.12.1688673087592; Thu, 06 Jul 2023 12:51:27 -0700 (PDT) MIME-Version: 1.0 References: <20230703105745.1314475-1-tero.kristo@linux.intel.com> <20230703105745.1314475-2-tero.kristo@linux.intel.com> <64a64e46b7d5b_b20ce208de@john.notmuch> <4b874e4c-4ad3-590d-3885-b4a3b894524e@linux.intel.com> In-Reply-To: <4b874e4c-4ad3-590d-3885-b4a3b894524e@linux.intel.com> From: Alexei Starovoitov Date: Thu, 6 Jul 2023 12:51:16 -0700 Message-ID: Subject: Re: [PATCH 1/2] x86/tsc: Add new BPF helper call bpf_rdtsc To: Tero Kristo Cc: John Fastabend , Shuah Khan , Thomas Gleixner , X86 ML , Borislav Petkov , Dave Hansen , Ingo Molnar , Alexei Starovoitov , "open list:KERNEL SELFTEST FRAMEWORK" , LKML , Andrii Nakryiko , Daniel Borkmann , bpf Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Jul 6, 2023 at 4:59=E2=80=AFAM Tero Kristo wrote: > > > On 06/07/2023 08:16, John Fastabend wrote: > > Alexei Starovoitov wrote: > >> On Mon, Jul 3, 2023 at 3:58=E2=80=AFAM Tero Kristo wrote: > >>> Currently the raw TSC counter can be read within kernel via rdtsc_ord= ered() > >>> and friends, and additionally even userspace has access to it via the > >>> RDTSC assembly instruction. BPF programs on the other hand don't have > >>> direct access to the TSC counter, but alternatively must go through t= he > >>> performance subsystem (bpf_perf_event_read), which only provides rela= tive > >>> value compared to the start point of the program, and is also much sl= ower > >>> than the direct read. Add a new BPF helper definition for bpf_rdtsc()= which > >>> can be used for any accurate profiling needs. > >>> > >>> A use-case for the new API is for example wakeup latency tracing via > >>> eBPF on Intel architecture, where it is extremely beneficial to be ab= le > >>> to get raw TSC timestamps and compare these directly to the value > >>> programmed to the MSR_IA32_TSC_DEADLINE register. This way a direct > >>> latency value from the hardware interrupt to the execution of the > >>> interrupt handler can be calculated. Having the functionality within > >>> eBPF also has added benefits of allowing to filter any other relevant > >>> data like C-state residency values, and also to drop any irrelevant > >>> data points directly in the kernel context, without passing all the > >>> data to userspace for post-processing. > >>> > >>> Signed-off-by: Tero Kristo > >>> --- > >>> arch/x86/include/asm/msr.h | 1 + > >>> arch/x86/kernel/tsc.c | 23 +++++++++++++++++++++++ > >>> 2 files changed, 24 insertions(+) > >>> > >>> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h > >>> index 65ec1965cd28..3dde673cb563 100644 > >>> --- a/arch/x86/include/asm/msr.h > >>> +++ b/arch/x86/include/asm/msr.h > >>> @@ -309,6 +309,7 @@ struct msr *msrs_alloc(void); > >>> void msrs_free(struct msr *msrs); > >>> int msr_set_bit(u32 msr, u8 bit); > >>> int msr_clear_bit(u32 msr, u8 bit); > >>> +u64 bpf_rdtsc(void); > >>> > >>> #ifdef CONFIG_SMP > >>> int rdmsr_on_cpu(unsigned int cpu, u32 msr_no, u32 *l, u32 *h); > >>> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c > >>> index 344698852146..ded857abef81 100644 > >>> --- a/arch/x86/kernel/tsc.c > >>> +++ b/arch/x86/kernel/tsc.c > >>> @@ -15,6 +15,8 @@ > >>> #include > >>> #include > >>> #include > >>> +#include > >>> +#include > >>> > >>> #include > >>> #include > >>> @@ -29,6 +31,7 @@ > >>> #include > >>> #include > >>> #include > >>> +#include > >>> > >>> unsigned int __read_mostly cpu_khz; /* TSC clocks / usec, not us= ed here */ > >>> EXPORT_SYMBOL(cpu_khz); > >>> @@ -1551,6 +1554,24 @@ void __init tsc_early_init(void) > >>> tsc_enable_sched_clock(); > >>> } > >>> > >>> +u64 bpf_rdtsc(void) > >>> +{ > >>> + /* Check if Time Stamp is enabled only in ring 0 */ > >>> + if (cr4_read_shadow() & X86_CR4_TSD) > >>> + return 0; > >> Why check this? It's always enabled in the kernel, no? > > It is always enabled, but there are certain syscalls that can be used to > disable the TSC access for oneself. prctl(PR_SET_TSC, ...) and > seccomp(SET_MODE_STRICT,...). Not having the check in place would in > theory allow a restricted BPF program to circumvent this (if there ever > was such a thing.) But yes, I do agree this part is a bit debatable > whether it should be there at all. What do you mean 'circumvent' ? It's a tracing bpf prog running in the kernel loaded by root and reading tsc for the purpose of the kernel. There is no unprivileged access to tsc here. > > >>> + > >>> + return rdtsc_ordered(); > >> Why _ordered? Why not just rdtsc ? > >> Especially since you want to trace latency. Extra lfence will ruin > >> the measurements. > >> > > If we used it as a fast way to order events on multiple CPUs I > > guess we need the lfence? We use ktime_get_ns() now for things > > like this when we just need an order counter. We have also > > observed time going backwards with this and have heuristics > > to correct it but its rare. > > Yeah, I think it is better to induce some extra latency instead of > having some weird ordering issues with the timestamps. lfence is not 'some extra latency'. I suspect rdtsc_ordered() will be slower than bpf_ktime_get_ns(). What's the point of using it then? > > Also, things like the ftrace also use rdtsc_ordered() as its underlying > clock, if you use x86-tsc as the trace clock (see > arch/x86/kernel/trace_clock.c.) > > -Tero >