Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D7E5DC433EF for ; Wed, 22 Dec 2021 14:18:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S245645AbhLVOSj (ORCPT ); Wed, 22 Dec 2021 09:18:39 -0500 Received: from dfw.source.kernel.org ([139.178.84.217]:51772 "EHLO dfw.source.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S245605AbhLVOSi (ORCPT ); Wed, 22 Dec 2021 09:18:38 -0500 Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 1F18461AD8; Wed, 22 Dec 2021 14:18:38 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 83A37C36AEA; Wed, 22 Dec 2021 14:18:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1640182717; bh=OzziU8y1aBqNeaEv60Y5cbCEDvhp8GglpPLz7KOa9bg=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=atGpQ2TNpSELVfQncWhRfzMWupI1gxhGzf2dJB042ILWGHgTpHVLuFu4F1Z8k8+od n8S+if5VcoAAJ8wy6NocOoIkbM6fSOsU2TRavFo8ihBajLtwzFB6mIyH8vjT5HNu/v GiTXHs8ysXrfAJJQWBJhVw4Z6vwzo/TyLHM9daJ2iEHwgWOu1+lFsla0Vs6Xs5AOKP aW4iX2NOqlxG0BqQlA52nnW5NmnyizUpMff4003gO2HHSTKEYBgDhCLpwKw2+EreAo mhsq67J6QpvbOYLwQcPAnpg/HoowScmHxmzebucFZXXyAv3ZY0MBAz2wr1E9dSiK8S bKfVeuorfr74A== Date: Wed, 22 Dec 2021 23:18:34 +0900 From: Masami Hiramatsu To: Beau Belgrave Cc: rostedt@goodmis.org, linux-trace-devel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v8 09/12] user_events: Add documentation file Message-Id: <20211222231834.875f662b37fe673dec0b3663@kernel.org> In-Reply-To: <20211216173511.10390-10-beaub@linux.microsoft.com> References: <20211216173511.10390-1-beaub@linux.microsoft.com> <20211216173511.10390-10-beaub@linux.microsoft.com> X-Mailer: Sylpheed 3.7.0 (GTK+ 2.24.32; x86_64-pc-linux-gnu) Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Beau, On Thu, 16 Dec 2021 09:35:08 -0800 Beau Belgrave wrote: > Add a documentation file about user_events with example code, etc. > explaining how it may be used. > > Signed-off-by: Beau Belgrave > --- > Documentation/trace/index.rst | 1 + > Documentation/trace/user_events.rst | 195 ++++++++++++++++++++++++++++ > 2 files changed, 196 insertions(+) > create mode 100644 Documentation/trace/user_events.rst > > diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst > index 3769b9b7aed8..3a47aa8341c6 100644 > --- a/Documentation/trace/index.rst > +++ b/Documentation/trace/index.rst > @@ -30,3 +30,4 @@ Linux Tracing Technologies > stm > sys-t > coresight/index > + user_events > diff --git a/Documentation/trace/user_events.rst b/Documentation/trace/user_events.rst > new file mode 100644 > index 000000000000..36104b537476 > --- /dev/null > +++ b/Documentation/trace/user_events.rst > @@ -0,0 +1,195 @@ > +========================================= > +user_events: User-based Event Tracing > +========================================= > + > +:Author: Beau Belgrave > + > +Overview > +-------- > +User based trace events allow user processes to create events and trace data > +that can be viewed via existing tools, such as ftrace, perf and eBPF. > +To enable this feature, build your kernel with CONFIG_USER_EVENTS=y. > + > +Programs can view status of the events via > +/sys/kernel/debug/tracing/user_events_status and can both register and write > +data out via /sys/kernel/debug/tracing/user_events_data. > + > +Programs can also use /sys/kernel/debug/tracing/dynamic_events to register and > +delete user based events via the u: prefix. The format of the command to > +dynamic_events is the same as the ioctl with the u: prefix applied. > + > +Typically programs will register a set of events that they wish to expose to > +tools that can read trace_events (such as ftrace and perf). The registration > +process gives back two ints to the program for each event. The first int is the > +status index. This index describes which byte in the > +/sys/kernel/debug/tracing/user_events_status file represents this event. The > +second int is the write index. This index describes the data when a write() or > +writev() is called on the /sys/kernel/debug/tracing/user_events_data file. > + > +The structures referenced in this document are contained with the > +/include/uap/linux/user_events.h file in the source tree. > + > +**NOTE:** *Both user_events_status and user_events_data are under the tracefs > +filesystem and may be mounted at different paths than above.* > + > +Registering > +----------- > +Registering within a user process is done via ioctl() out to the > +/sys/kernel/debug/tracing/user_events_data file. The command to issue is > +DIAG_IOCSREG. This command takes a struct user_reg as an argument. > + Could you add the user_reg data structure here? > +The struct user_reg requires two values, the first is the size of the structure > +to ensure forward and backward compatibility. The second is the command string > +to issue for registering. This explanation may be a bit out of date? user_reg has 4 fields. 2 for input, 2 for output. And could you add a section for DIAG_IOCSDEL? > + > +User based events show up under tracefs like any other event under the > +subsystem named "user_events". This means tools that wish to attach to the > +events need to use /sys/kernel/debug/tracing/events/user_events/[name]/enable > +or perf record -e user_events:[name] when attaching/recording. > + > +**NOTE:** *The write_index returned is only valid for the FD that was used* > + > +Command Format > +^^^^^^^^^^^^^^ > +The command string format is as follows:: > + > + name[:FLAG1[,FLAG2...]] [Field1[;Field2...]] > + > +Supported Flags > +^^^^^^^^^^^^^^^ > +**BPF_ITER** - EBPF programs attached to this event will get the raw iovec > +struct instead of any data copies for max performance. > + > +Field Format > +^^^^^^^^^^^^ > +:: > + > + type name [size] > + > +Basic types are supported (__data_loc, u32, u64, int, char, char[20], etc). > +User programs are encouraged to use clearly sized types like u32. > + > +**NOTE:** *Long is not supported since size can vary between user and kernel.* > + > +The size is only valid for types that start with a struct prefix. > +This allows user programs to describe custom structs out to tools, if required. > + > +For example, a struct in C that looks like this:: > + > + struct mytype { > + char data[20]; > + }; > + > +Would be represented by the following field:: > + > + struct mytype myname 20 > + > +Status > +------ > +When tools attach/record user based events the status of the event is updated > +in realtime. This allows user programs to only incur the cost of the write() or > +writev() calls when something is actively attached to the event. > + > +User programs call mmap() on /sys/kernel/debug/tracing/user_events_status to > +check the status for each event that is registered. The byte to check in the > +file is given back after the register ioctl() via user_reg.status_index. > +Currently the size of user_events_status is a single page, however, custom > +kernel configurations can change this size to allow more user based events. In > +all cases the size of the file is a multiple of a page size. > + > +For example, if the register ioctl() gives back a status_index of 3 you would > +check byte 3 of the returned mmap data to see if anything is attached to that > +event. > + > +Administrators can easily check the status of all registered events by reading > +the user_events_status file directly via a terminal. The output is as follows:: > + > + Byte:Name [# Comments] > + ... > + > + Active: ActiveCount > + Busy: BusyCount > + Max: MaxCount > + > +For example, on a system that has a single event the output looks like this:: > + > + 1:test > + > + Active: 1 > + Busy: 0 > + Max: 4096 > + > +If a user enables the user event via ftrace, the output would change to this:: > + > + 1:test # Used by ftrace > + > + Active: 1 > + Busy: 1 > + Max: 4096 > + > +**NOTE:** *A status index of 0 will never be returned. This allows user > +programs to have an index that can be used on error cases.* > + > +Status Bits > +^^^^^^^^^^^ > +The byte being checked will be non-zero if anything is attached. Programs can > +check specific bits in the byte to see what mechanism has been attached. > + > +The following values are defined to aid in checking what has been attached: > + > +**EVENT_STATUS_FTRACE** - Bit set if ftrace has been attached (Bit 0). > + > +**EVENT_STATUS_PERF** - Bit set if perf/eBPF has been attached (Bit 1). > + > +Writing Data > +------------ > +After registering an event the same fd that was used to register can be used > +to write an entry for that event. The write_index returned must be at the start > +of the data, then the remaining data is treated as the payload of the event. > + > +For example, if write_index returned was 1 and I wanted to write out an int > +payload of the event. Then the data would have to be 8 bytes (2 ints) in size, > +with the first 4 bytes being equal to 1 and the last 4 bytes being equal to the > +value I want as the payload. > + > +In memory this would look like this:: > + > + int index; > + int payload; > + > +User programs might have well known structs that they wish to use to emit out > +as payloads. In those cases writev() can be used, with the first vector being > +the index and the following vector(s) being the actual event payload. > + > +For example, if I have a struct like this:: > + > + struct payload { > + int src; > + int dst; > + int flags; > + }; > + > +It's advised for user programs to do the following:: > + > + struct iovec io[2]; > + struct payload e; > + > + io[0].iov_base = &write_index; > + io[0].iov_len = sizeof(write_index); > + io[1].iov_base = &e; > + io[1].iov_len = sizeof(e); > + > + writev(fd, (const struct iovec*)io, 2); > + > +**NOTE:** *The write_index is not emitted out into the trace being recorded.* > + > +EBPF > +---- > +EBPF programs that attach to a user-based event tracepoint are given a pointer > +to a struct user_bpf_context. The bpf context contains the data type (which can > +be a user or kernel buffer, or can be a pointer to the iovec) and the data > +length that was emitted (minus the write_index). > + > +Example Code > +------------ > +See sample code in samples/user_events. Maybe tools/testing/selftests/user_events ? Thank you, > -- > 2.17.1 > -- Masami Hiramatsu