Received: by 2002:a05:6a10:6d10:0:0:0:0 with SMTP id gq16csp3786296pxb; Tue, 19 Apr 2022 09:47:00 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwxuYn93LAidWKM3KQ6x+OtOtXWN838KJbcULxueulLlymoowYvJd910YTc7wtfynHfTdHZ X-Received: by 2002:a17:90a:138e:b0:1cb:b5f8:f12d with SMTP id i14-20020a17090a138e00b001cbb5f8f12dmr24906259pja.84.1650386820557; Tue, 19 Apr 2022 09:47:00 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1650386820; cv=none; d=google.com; s=arc-20160816; b=YiHMtIyzygz/ck11lLqW8DY5gj/f8/AYkVuOB4xnG2Sd8nxVIAUksmUm1K+0n9m1wi mCu9WvPohha1tomdZa80rkEQMumSIDqgqoHFg3SMw3kPSxDL5/Yxp+VJ1gXHeByoyfS2 Zo6StxEL9gv3Q1lAlvq0cJo0RxN6VCgMAhiD2OsEvZBx+HIvsknsuca8erZh/+hJLUUB IVULB8r8+jDpjXhxeU98ZkcuxsCirUmUMI4yCqNecRLqXdKFDbZfuDUqgxAkoJfZa/fl GpPO0tHjjl6T02mlTX27qC8/nplSq/W4dt8CKiRHEKpzLP5MyDXMds27pl7RWDUCQloF ZndQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature :dkim-filter; bh=+d8zINXu1e0i7/hJ59NpTOShuy/1+3GCZlvbU1Rdgvo=; b=SxImLujmPX5pKIRfyAeNJV6XlNsxIseolD0o0Dd1E/8C0HVUR2IKt+9ful9nUFlf6V vGANanud38NMVkiqoPXwzxIhQBlKDukngWKZyD4xcFVWQns3QpdMI9Hp0NhLIFiAvPUp 8OFR+3VvuElMa1M0LvzfvVUpnayTs5fNJlDmFpcz3vSeZwK3G953YukVx4m5NJssPApY cup1FCReDzns7FILc3AgB5NPo83G1wFvedddojrm4jQUjrmwIZoYTXcbkwTTSEwYCo79 jpnpChHcAEDD1ZR7QiVsm10StT0fPXqx3b3Vbp/NDT+1r3h2R9DsEpb6rvjL2xiigp+4 43qw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.microsoft.com header.s=default header.b=HpBCjAQc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.microsoft.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id a12-20020a170902eccc00b00159071e0842si5249757plh.413.2022.04.19.09.46.43; Tue, 19 Apr 2022 09:47:00 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.microsoft.com header.s=default header.b=HpBCjAQc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.microsoft.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240411AbiDSA2p (ORCPT + 99 others); Mon, 18 Apr 2022 20:28:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:57708 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230451AbiDSA2l (ORCPT ); Mon, 18 Apr 2022 20:28:41 -0400 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id D8DD422B33; Mon, 18 Apr 2022 17:26:00 -0700 (PDT) Received: from kbox (c-73-140-2-214.hsd1.wa.comcast.net [73.140.2.214]) by linux.microsoft.com (Postfix) with ESMTPSA id 5665F20C360F; Mon, 18 Apr 2022 17:26:00 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com 5665F20C360F DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1650327960; bh=+d8zINXu1e0i7/hJ59NpTOShuy/1+3GCZlvbU1Rdgvo=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=HpBCjAQcruArdzyS36Depw/dKiQQIDqM8A4m10c2k9kvkXkVqWLlOBGF7L0zyd7m+ Ck0OEQaaW36v+cxfIm0jzDdkqUaQhoF+N7M1Wi5qmvQoGn6oG/hfHhySGqow7PtKk5 UG9XbMqFahkdEFdEhWQ2Babm/FSGQT8ijTgIE3X4= Date: Mon, 18 Apr 2022 17:25:49 -0700 From: Beau Belgrave To: Hagen Paul Pfeifer Cc: rostedt@goodmis.org, mhiramat@kernel.org, linux-trace-devel@vger.kernel.org, linux-kernel@vger.kernel.org, Mathieu Desnoyers Subject: Re: [PATCH v8 00/12] user_events: Enable user processes to create and write to trace events Message-ID: <20220419002549.GA2055@kbox> References: <20211216173511.10390-1-beaub@linux.microsoft.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Spam-Status: No, score=-19.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_MED, SPF_HELO_PASS,SPF_PASS,T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL, USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Apr 18, 2022 at 10:43:29PM +0200, Hagen Paul Pfeifer wrote: > * Beau Belgrave | 2021-12-16 09:34:59 [-0800]: > > >The typical scenario is on process start to mmap user_events_status. Processes > >then register the events they plan to use via the REG ioctl. The ioctl reads > >and updates the passed in user_reg struct. The status_index of the struct is > >used to know the byte in the status page to check for that event. The > >write_index of the struct is used to describe that event when writing out to > >the fd that was used for the ioctl call. The data must always include this > >index first when writing out data for an event. Data can be written either by > >write() or by writev(). > > Hey Beau, a little bit late to the party. A few questions from my side: What > are the exact weak points of USDT compared to User Events that stand in the > way of further extend USDT (in a non-compatible way, sure, just as an > different approach!)? The nice thing about USDT is that I can search for all > possible probes of the system via "find / | readelf | ". Since they are listed > in a dedicated ELF section (.note.stapsdt) - they are visible & transparent. I > can also map a hierarchy/structure in Executable/DSO via clever choice of > names. The big disadvantage of USDT is the lack of type information, but from > a registration, explicit point of view, they are nice. > > Or in other words: why not extends the USDT approach? Why not > > u32 val = 23; > const char *garbage = "tracestring"; > > DYNAMIC_TRACE_PROBE2("foo:bar", val, u32, garbage, cstring); > We actually tried some USDT extension methods early on, by extending the .note.stapsdt sections and seeing how far we could get our definitions into that form. There are a few problems when running in a highly container/CGROUP environment even if you can get our formats into stapsdt. It costs a lot to transverse every ELF file on the machine to find all the notes. When profiling or tracing many containers, each cgroup's mount space must be entered and then tracked. Since these files are in different locations, they each need a separate probe definition, since the definitions/patches are tied to the location of the binary to patch. As new cgroups come online, we would have to keep track of each new binary location and find probes that match their location. This becomes really hard to manage if for example we just want to always enable a specific event regardless of where it is on the filesystem. Events are limited to a max of 2^16 having many duplicate events in the system might start to approach that limit for high-core machines with many small cgroup isolations. We run programs that are built on interpreted or JIT'd code (C#, javascript, etc.). These don't have great places to put a stap definition, since they aren't ELF files. I've seen approaches where temporary ELF files are generated, however, this costs a lot. Now we have even more temporarily files to go patch, meaning more events and more probe definitions (many of them in our case would be duplicates of the others). In production environments we have them locked down heavily with both SELINUX and IPE enabled. This prevents us from patching user mode code on the fly, the typical perf probe calls fail here. We typically want to know what events are available to us with very little overhead. Having programs register to a well known location already (trace_events, tracefs) I can easily see all the user events on the system by just doing ls on /sys/kernel/tracing/events/user_events. I can also see all their data formats and easily enable hist and filtering since these formats are known to the kernel. In our testing uprobes are much more costly to the running program than the write syscall. For managed code, as in java, code is moving around and are not always in static locations. The probe locations can change, etc. Calling from a managed location into a native one has performance implications as well when using a dynamic/temp elf stub approach. We are actively using user_events to solve these problems in our environments that have previously seen high overheads to achieve the same results. Many times we cannot afford to miss any events, so live scanning for new ELF files doesn't work for us as the programs and cgroups are short lived. > > Sure, the argument names, here "val" and "garbage" should also be saved. I > also like the "just one additional header to the project to get things > running" (#include "sdt.h"). Sure, a DYNAMIC_TRACE_IS_ACTIVE("foo:bar") would > be great. But in fact we have never needed that in the past. > > > hgn Thanks, -Beau