Received: by 2002:a05:6358:3188:b0:123:57c1:9b43 with SMTP id q8csp1201159rwd; Wed, 7 Jun 2023 12:34:15 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ4uFP1qQ9p4m9QNZdc25WShcCevxupH/k1uWA7gU2j3rHUvgVUAEQgf7Kupm3FHk9dALFRy X-Received: by 2002:a05:6a20:9143:b0:114:6390:d735 with SMTP id x3-20020a056a20914300b001146390d735mr2369536pzc.27.1686166455371; Wed, 07 Jun 2023 12:34:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1686166455; cv=none; d=google.com; s=arc-20160816; b=Pdkz6vSKAg+NtXqT0G4dIhX/dVxftrNWW2OEg9pw8tKOagCYx0y0xbz5Zjwm+FcFVF SpgPgwc+NQVDd14ddSNUZUpuMjQ3qxxWSZSIoxWl2LnxECBzYzfvjoUB+rcNvSzmYqCE xUH0hMtgwAGGdE4m4B3oehMRtD6eqnsLW/ZRsGiwWK2QtKIHLoXr4SqZ6rrQqbQZdRMV cGphk5/AlRqg9XZB9TNXVqAnl8Cg4NlKp1KGksHkrKaPgLZQ8Nwadg9jjfhhrw+44QtR MKT0oY4/Eyakf/qNY2FwjQW6va0QsIABTHQLaEIzMboo3hY9DZ9hkroz1I7mdwBKbv5w l1rg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature :dkim-filter; bh=f1LMEbD3P/UcUgwpVy1pDfNCcNYjSOvLAPIcNYLpj5Y=; b=DT59Lg+/mJc0/v+u6ajOL8o1KigUijZJ9ZhmDA8nfd9KqAC+kmSpqbgQL55MPPnvye J1RjrH/jGduOwUl5q0qHYvzTgq/AGu+ygDe2QhIPBG6QOe2eSeD24fQZKziF1fQeeaMz y5HtZkErNfWPsjpCWIzb44q0cUP9BF93hCE5MHpvVGzGV2+/z/INFTEn6AeW3qBqHleD 55WC875+NTcYWYJW0B4++WwTBfRIRkA5mwjrsHg9EI2GEZvOJ74IT+ie1a5yDiWTxShM n2fEuJcDhJPK0lr72QO8gZeliVhn2a0/8IzbhCWraqvocO9ZOaQT0WA7UCdI9MWJ+R1s 5rVQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linux.microsoft.com header.s=default header.b=VuWFl3bc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.microsoft.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id w29-20020aa79a1d000000b0064feff07372si9046402pfj.195.2023.06.07.12.34.02; Wed, 07 Jun 2023 12:34:15 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@linux.microsoft.com header.s=default header.b=VuWFl3bc; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linux.microsoft.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229667AbjFGT0w (ORCPT + 99 others); Wed, 7 Jun 2023 15:26:52 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45742 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229556AbjFGT0T (ORCPT ); Wed, 7 Jun 2023 15:26:19 -0400 Received: from linux.microsoft.com (linux.microsoft.com [13.77.154.182]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id B37801FDC; Wed, 7 Jun 2023 12:26:17 -0700 (PDT) Received: from W11-BEAU-MD.localdomain (unknown [76.135.27.212]) by linux.microsoft.com (Postfix) with ESMTPSA id E0F3F20C1440; Wed, 7 Jun 2023 12:26:16 -0700 (PDT) DKIM-Filter: OpenDKIM Filter v2.11.0 linux.microsoft.com E0F3F20C1440 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.microsoft.com; s=default; t=1686165977; bh=f1LMEbD3P/UcUgwpVy1pDfNCcNYjSOvLAPIcNYLpj5Y=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=VuWFl3bcAqSiQg8bASdQ++6BkFmItki+WMWEN/gWWM3Lj9xRZdblFHi9vCYh8Za26 OYgy3uSMgh//J0tDf/XyrsjX/gOu4I8Bx7R8Tc5Ea66GJmkuFO2S3kDP8fv64h0Ndx w7sSgpDxYSJf/sPo8NQThRGTTIeA/IHpRL0bi4XI= Date: Wed, 7 Jun 2023 12:26:11 -0700 From: Beau Belgrave To: Masami Hiramatsu Cc: Christian Brauner , Alexei Starovoitov , Steven Rostedt , LKML , linux-trace-kernel@vger.kernel.org, Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , bpf , David Vernet , Linus Torvalds , Dave Thaler , Christoph Hellwig , Mathieu Desnoyers Subject: Re: [PATCH] tracing/user_events: Run BPF program if attached Message-ID: <20230607192611.GA143@W11-BEAU-MD.localdomain> References: <20230516212658.2f5cc2c6@gandalf.local.home> <20230517165028.GA71@W11-BEAU-MD.localdomain> <20230601-urenkel-holzofen-cd9403b9cadd@brauner> <20230601152414.GA71@W11-BEAU-MD.localdomain> <20230601-legten-festplatten-fe053c6f16a4@brauner> <20230601162921.GA152@W11-BEAU-MD.localdomain> <20230606223752.65dd725c04b11346b45e0546@kernel.org> <20230606170549.GA71@W11-BEAU-MD.localdomain> <20230607230702.03c6d3a213d527a221bdc533@kernel.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20230607230702.03c6d3a213d527a221bdc533@kernel.org> X-Spam-Status: No, score=-19.8 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_MED, SPF_HELO_PASS,SPF_PASS,T_SCC_BODY_TEXT_LINE,USER_IN_DEF_DKIM_WL, USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jun 07, 2023 at 11:07:02PM +0900, Masami Hiramatsu wrote: > On Tue, 6 Jun 2023 10:05:49 -0700 > Beau Belgrave wrote: > > > On Tue, Jun 06, 2023 at 10:37:52PM +0900, Masami Hiramatsu wrote: > > > Hi Beau, > > > > > > On Thu, 1 Jun 2023 09:29:21 -0700 > > > Beau Belgrave wrote: > > > > > > > > > These are stubs to integrate namespace support. I've been working on a > > > > > > series that adds a tracing namespace support similiar to the IMA > > > > > > namespace work [1]. That series is ending up taking more time than I > > > > > > > > > > Look, this is all well and nice but you've integrated user events with > > > > > tracefs. This is currently a single-instance global filesystem. So what > > > > > you're effectively implying is that you're namespacing tracefs by > > > > > hanging it off of struct user namespace making it mountable by > > > > > unprivileged users. Or what's the plan? > > > > > > > > > > > > > We don't have plans for unprivileged users currently. I think that is a > > > > great goal and requires a proper tracing namespace, which we currently > > > > don't have. I've done some thinking on this, but I would like to hear > > > > your thoughts and others on how to do this properly. We do talk about > > > > this in the tracefs meetings (those might be out of your time zone > > > > unfortunately). > > > > > > > > > That alone is massive work with _wild_ security implications. My > > > > > appetite for exposing more stuff under user namespaces is very low given > > > > > the amount of CVEs we've had over the years. > > > > > > > > > > > > > Ok, I based that approach on the feedback given in LPC 2022 - Containers > > > > and Checkpoint/Retore MC [1]. I believe you gave feedback to use user > > > > namespaces to provide the encapsulation that was required :) > > > > > > Even with the user namespace, I think we still need to provide separate > > > "eventname-space" for each application, since it may depend on the context > > > who and where it is launched. I think the easiest solution is (perhaps) > > > providing a PID-based new groups for each instance (the PID-prefix or > > > suffix will be hidden from the application). > > > I think it may not good to allow unprivileged user processes to detect > > > the registered event name each other by default. > > > > > > > Regarding PID, are you referring the PID namespace the application > > resides within? Or the actual single PID of the process? > > I meant the actual single PID of the process. That will be the safest > way by default. > How do you feel about instead of single PID using the effective user ID? That way we wouldn't have so many events on the system, and the user is controlling what runs and can share events. I could see a way for admins to also override the user_event suffix on a per-user basis to allow for broader event name scopes if required (IE: Our k8s and production scenarios). > > > > In production we monitor things in sets that encompass more than a > > single application. A requirement we need is the ability to group > > like-processes together for monitoring purposes. > > > > We really need a way to know these set of events are for this group, the > > easiest way to do that is by the system name provided on each event. If > > this were to be single PID (and not the PID namespace), then we wouldn't > > be able to achieve this requirement. Ideally an admin would be able to > > setup the name in some way that means something to them in user-space. > > Would you mean using the same events between several different processes? > I think it needs more care about security concerns. More on this later. > > If not, I think admin has a way to identify which processes are running in > the same group outside of ftrace, and can set the filter correctly. > Agree that's possible, but it's going to be a massive amount of events for both tracefs and perf_event ring buffers to handle (we need a perf FD per trace_event ID). > > > > IE: user_events_critical as a system name, vs knowing (user_events_5 > > or user_events_6 or user_events_8) are "critical". > > My thought is the latter. Then the process can not access to the > other process's namespace each other. > > > > > Another simple example is the same "application" but it gets exec'd more > > than once. Each time it execs the system name would change if it was > > really by the actual PID vs PID namespace. This would be very hard to > > manage on a perf_event or eBPF level for us. It would also vastly > > increase the number of trace_events that would get created on the > > system. > > Indeed. But fundamentally allowing user to create (register) the new > event means such DoS attack can happen. That's why we have a limitation > of the max number of user_events. (BTW, I want to make this number > controllable from sysctl or tracefs. Also, we need something against the > event-id space contamination by this DoS attack.) > I also think it would be better to have some rate-limit about registering > new events. > Totally agree here. > > > > > > > > > > > > anticipated. > > > > > > > > > > Yet you were confident enough to leave the namespacing stubs for this > > > > > functionality in the code. ;) > > > > > > > > > > What is the overall goal here? Letting arbitrary unprivileged containers > > > > > define their own custom user event type by mounting tracefs inside > > > > > unprivileged containers? If so, what security story is going to > > > > > guarantee that writing arbitrary tracepoints from random unprivileged > > > > > containers is safe? > > > > > > > > > > > > > Unprivileged containers is not a goal, however, having a per-pod > > > > user_event system name, such as user_event_, would be ideal > > > > for certain diagnostic scenarios, such as monitoring the entire pod. > > > > > > That can be done in the user-space tools, not in the kernel. > > > > > > > Right, during k8s pod creation we would create the group and name it > > something that makes sense to the operator as an example. I'm sure there > > are lots of scenarios user-space can do. However, they almost always > > involve more than 1 application together in our scenarios. > > Yeah, if it is always used with k8s in the backend servers, it maybe OK. > But if it is used in more unreliable environment, we need to consider > about malicious normal users. > > > > > > > When you have a lot of containers, you also want to limit how many > > > > tracepoints each container can create, even if they are given access to > > > > the tracefs file. The per-group can limit how many events/tracepoints > > > > that container can go create, since we currently only have 16-bit > > > > identifiers for trace_event's we need to be cautious we don't run out. > > > > > > I agree, we need to have a knob to limit it to avoid DoS attack. > > > > > > > user_events in general has tracepoint validators to ensure the payloads > > > > coming in are "safe" from what the kernel might do with them, such as > > > > filtering out data. > > > > > > [...] > > > > > > changing the system name of user_events on a per-namespace basis. > > > > > > > > > > What is the "system name" and how does it protect against namespaces > > > > > messing with each other? > > > > > > > > trace_events in the tracing facility require both a system name and an > > > > event name. IE: sched/sched_waking, sched is the system name, > > > > sched_waking is the event name. For user_events in the root group, the > > > > system name is "user_events". When groups are introduced, the system > > > > name can be "user_events_" for example. > > > > > > So my suggestion is using PID in root pid namespace instead of GUID > > > by default. > > > > > > > By default this would be fine as long as admins can change this to a larger > > group before activation for our purposes. PID however, might be a bit > > too granular of an identifier for our scenarios as I've explained above. > > > > I think these logical steps make sense: > > 1. Create "event namespace" (Default system name suffix, max count) > > 2. Setup "event namespace" (Change system name suffix, max count) > > 3. Attach "event namespace" > > > > I'm not sure we know what to attach to in #3 yet, so far both a tracer > > namespace and user namespace have been proposed. I think we need to > > answer that. Right now everything is in the root "event namespace" and > > is simply referred to by default as "user_events" as the system name > > without a suffix, and with the boot configured max event count. > > OK, so I think we are on the same page :) > > I think the user namespace is not enough for protecting events on > multi-user system without containers. So it has less flexibility. > The new tracer namespace may be OK, we still need a helper user > program like 'user_eventd' for managing access based on some policy. > If we have a way to manage it with SELinux etc. it will be the best > I think. (Perhaps using UNIX domain socket will give us such flexibility.) > I'm adding Mathieu to CC since I think he had a few cases where a static namespace wasn't enough and we might need hierarchy support. If we don't need hierarchy support, I think it's a lot easier to do. I like the idea of a per-user event namespace vs a per-PID event namespace knowing what we have to do to monitor all of this via perf. Like I said above, that will be a huge amount of events compared to a per-user or namespace approach. But I do like where this is headed and glad we are having this conversation :) Thanks, -Beau