Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp7652472rwd; Tue, 6 Jun 2023 14:13:30 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ470kFR29WJEKm49ek9lFCIPM9s2OfHpEKlMcsQ9gwWgagvcOTUit0wuT1WGSoRbQFTnkbM X-Received: by 2002:a17:903:228c:b0:1af:e295:ad56 with SMTP id b12-20020a170903228c00b001afe295ad56mr3955308plh.49.1686086010079; Tue, 06 Jun 2023 14:13:30 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686086010; cv=none; d=google.com; s=arc-20160816; b=FqA9zV6hysqHIl1LoRuQIqMbVr74CjKZEN5S1SOUg2p1JDWAwCSrBa+DcODDwWsZ8o ulzWKPryDoVtqGSwdMIpRAqIHVqvhYVfxHcqJcb8VzurCkw1XJsmpkDlsGHWO+7sh3SI m3zcsrwZrEd6LNorCO8Bh3dnZEFddQSHaaSHsp9bKURw7FkBWi8rezZk9PWl1mdGo7Hu xH56vw20P1ii/9xtAtSs0/LlxbrfLwzb0EY73TALQrunsWLS38JpPd0M3C6TOWkgkYRP eyB1caW+6eyhH7xhb1bP7Q2Yd5U+BUkMx/BrIfDr2p9wX0hx97QouiPks/iinKSrw5+q gueg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature:dkim-filter; bh=0ETLWs8Ceu9GxJZiRavPxUta0//U9m1gAOQEZ5ERUwU=; b=mwwW9/udgCXqyhXOF+IcNvPgWRio3zZvgQFvWnMFTe1R8yM/1fUxyWoyU4HdD8iCGU RgdkwZXFInFf2Nu5P1WDjc78IFDNXhNaXnNGqehbbF97BpkjLtCnQY3Zh9jXsCN+crpO Z78qyuT68EgRg1OYE2NhVld3xegng+7nd5PiI0KSwQx60V1ma0VUvdIVInU/jLqwcw/g vW3kJiivswk1P/h80fS01JLBVdAAa+1n+STwNBl9/JQ/xs6oU1fPL3njLjpCm3z/jG0x V8YPBJ5NdQVW3856z9U44NDRqS6Id2BOqhgmLXF7HXpSfJTREi8kubZ22h7oW7ZtlHVc I9UA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.microsoft.com header.s=default header.b=L8o1R6hO; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.microsoft.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id 18-20020a170902c11200b001b06fa86b08si7619144pli.412.2023.06.06.14.13.14; Tue, 06 Jun 2023 14:13:30 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.microsoft.com header.s=default header.b=L8o1R6hO; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.microsoft.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S238455AbjFFU5l (ORCPT + 99 others); Tue, 6 Jun 2023 16:57:41 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:58740 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231706AbjFFU5j (ORCPT ); Tue, 6 Jun 2023 16:57:39 -0400 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id D3EBD1707; Tue, 6 Jun 2023 13:57:37 -0700 (PDT) Received: from W11-BEAU-MD.localdomain (unknown [76.135.27.212]) by linux.microsoft.com (Postfix) with ESMTPSA id E580820BE492; Tue, 6 Jun 2023 13:57:36 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com E580820BE492 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1686085057; bh=0ETLWs8Ceu9GxJZiRavPxUta0//U9m1gAOQEZ5ERUwU=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=L8o1R6hOw20Q2OWIIrmyxakKURspNvlCCiRArHQP/p4Lbia0vbpaHE0XIODGn0J7a qrJJbcbRtKDznCwU5ka91xEZMOmRPqe6/rIaCfxPTn9CT/zkyXE2iA6Ltz7AWuqx+a MUkr3uiaEvsb+g67+0KFrI9OqGW8WzTVYC6H97fk= Date: Tue, 6 Jun 2023 13:57:30 -0700 From: Beau Belgrave To: Andrii Nakryiko Cc: Masami Hiramatsu , Alexei Starovoitov , Steven Rostedt , LKML , linux-trace-kernel@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , bpf , David Vernet , Linus Torvalds , dthaler@microsoft.com, brauner@kernel.org, hch@infradead.orgl, Mathieu Desnoyers Subject: Re: [PATCH] tracing/user_events: Run BPF program if attached Message-ID: <20230606205730.GA163@W11-BEAU-MD.localdomain> References: <20230508163751.841-1-beaub@linux.microsoft.com> <20230509130111.62d587f1@rorschach.local.home> <20230509163050.127d5123@rorschach.local.home> <20230515165707.hv65ekwp2djkjj5i@MacBook-Pro-8.local> <20230515192407.GA85@W11-BEAU-MD.localdomain> <20230517003628.aqqlvmzffj7fzzoj@MacBook-Pro-8.local> <20230606225741.a9d8003a22451db96545b5a8@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Spam-Status: No, score=-19.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_MED, SPF_HELO_PASS,SPF_PASS,T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL, USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jun 06, 2023 at 09:57:14AM -0700, Andrii Nakryiko wrote: > On Tue, Jun 6, 2023 at 6:57 AM Masami Hiramatsu wrote: > > > > Hi, > > > > On Tue, 16 May 2023 17:36:28 -0700 > > Alexei Starovoitov wrote: > > > > > BPF progs have three ways to access kernel tracepoints: > > > 1. traditional tracepoint > > > > This is the trace_events, which is used by ftrace, right? > > > > > 2. raw tracepoint > > > 3. raw tracepoint with BTF > > > > > > 1 was added first and now rarely used (only by old tools), since it's slow. > > > 2 was added later to address performance concerns. > > > 3 was added after BTF was introduced to provide accurate types. > > > > > > 3 is the only one that bpf community recommends and is the one that is used most often. > > > > > > As far as I know trace_events were never connected to bpf. > > > Unless somebody sneaked the code in without us seeing it. > > > > With this design, I understand that you may not want to connect BPF > > directly to user_events. It needs a different model. > > > > > > > > I think you're trying to model user_events+bpf as 1. > > > Which means that you'll be repeating the same mistakes. > > > > The user_events is completely different from the traceppoint and > > must have no BTF with it. > > Also, all information must be sent in the user-written data packet. > > (No data structure, event if there is a structure, it must be fully > > contained in the packet.) > > > > For the tracepoint, there is a function call with some values or > > pointers of data structure. So it is meaningful to skip using the > > traceevent (which converts all pointers to actual field values of > > the data structure and store it to ftrace buffer) because most of > > the values can be ignored in the BPF prog. > > > > However, for the user_events, the data is just passed from the > > user as a data packet, and BPF prog can access to the data packet > > (to avoid accessing malicious data, data validator can not be > > skipped). So this seems like 1. but actually you can access to > > the validated data on perf buffer. Maybe we can allow BPF to > > hook the write syscall and access user-space data, but it may > > not safe. I think this is the safest way to do that. > > I'm trying to understand why we need a new kernel concept for all > this. It looks like we are just creating a poor man's > publisher/subscriber solution in the kernel, but mostly intend to use > it from user-space? Why not just use Unix domain sockets for this, > though? Use SOCK_SEQPACKET, put "event data" into a single packet > that's guaranteed to not be broken up. Expose this to other processes > through named pipes, if necessary. > > Sorry if it's naive questions, but it's not clear what problem > user_events are solving and why we need a new thing and can't use > existing kernel primitives? > There's a number of reasons why we did not do as you suggest. The first reason is we want to only take the write() syscall cost when events are wanting to be monitored. This is done at a per-trace_event level and is not at a per-process level. user_events gives us the ability to know cheaply when an event is or is not to be written. It does this by setting/clearing a bit in each process when the trace_event classes register function is invoked to attach/detach perf or ftrace. By using a bit instead of bytes, we also have the ability to share tracing out to the kernel as well as any local user tracer in the future, this work was started by Mathieu Desnoyers via libside [1]. The second reason is we have found user based buffers to be unreliable when either the user process is crashing or has a corruption bug. By having the data buffers reside within the kernel, we prevent this from happening. If the kernel panics, we can also pull events out of the perf_event buffers via GDB to understand what our user processes were doing before the time of the panic. The third reason is we want to make use of all the features that perf, ftrace, and eBPF have. We do not want to have to re-write all of those features. The main things are being able to filter upon event payloads and aggregate them together. We also selectively turn on and off stack walking for some events (not all). Perf lets us selectively do this on a per-event basis in addition to grabbing raw stack data to enable unwinding via DWARF instructions. When we monitor events via perf/ftrace, we can find each offset and type for the fields within the event. We need to know these to properly decode events and analyze them. Tracefs gives a us a single place to see all of these events and efficiently decode them, including a stable event ID. We would have to replicate all of that work in userspace in addition to the other features we rely upon. The fourth reason is related to the third, we have a lot of existing diagnostics that rely upon and setup perf ring buffers. We want the user and kernel diagnostics to land in the same buffers with the same timestamps so we can see a full picture of what is going on. Thanks, -Beau 1. https://github.com/compudj/libside > > > > > Thank you, > > > > -- > > Masami Hiramatsu (Google)