Received: by 2002:ab2:3350:0:b0:1f4:6588:b3a7 with SMTP id o16csp1612703lqe; Mon, 8 Apr 2024 14:29:46 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCXzQnfPGRTvs7mNV6oSzfiEOPkDYoOHUTfHEWM20d6AQuhNv3pZu7nA5yxdWMuEJY142b+xUc6lkBAEDdhRJsRDGPg9JS9TnNZJjYeu7A== X-Google-Smtp-Source: AGHT+IHoPI4PT8TUC5Hpah9dQn/CpYShlNW2s9HF7fOviQDkgyKgAyr9NTe7sRfjiMtdLOxK1Ioz X-Received: by 2002:a17:90b:4382:b0:2a0:3c31:7d07 with SMTP id in2-20020a17090b438200b002a03c317d07mr7802037pjb.48.1712611785821; Mon, 08 Apr 2024 14:29:45 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1712611785; cv=pass; d=google.com; s=arc-20160816; b=B8e2pIjSYYFWtwob2A0M+/DDG7bil8m/8vxPQYeJhacRH++lCXhtGUjulN5EWYgEso 0doag2vbhnPtZdh9QFu2m2sbXEUYeZddHLBuk5/g9KKqWJFUg8/Htc7aG7NZaqTWVXTg m6WFhCjKtNeeIkeLWV44Yu+nloGUED6WcXrerKozBXElrdvXjTzCjt/C0wO3MaOAUNt5 ABeW8I4FmHBNiSCQdqfpHZcyvcHp08uRv/WujSM2ZF3Ad5Svb2dK4VDOmG/SOuRsbu3D OENyOyL25usxxyRf6aCapu2zRcwRPVO7cI/uGxBHZfovLg3TGVcMRu2lZgEwZSraQ6Uz 5qUQ== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:list-unsubscribe:list-subscribe:list-id:precedence :references:message-id:subject:cc:to:from:date:dkim-signature; bh=IapgfDVJ0FVifSAHIhEo8F37MsbgKVLxtLY8RK88aAs=; fh=tWedduKdZfmc5x5dPveXf04NtuTwlscpi5X/WE4lpxw=; b=IYSg4QaN88vdKO9pVXw/8N5/4t4mo9ibrS2evccPpEQPugmZHWYssD5YPp+OjWOKkj 2ORj0s2hlFoHaeJ8/rBW8p8EwGHK2Hscaup2+YcHP5LAX0be7vHYBZ68BrzFGanpNp5W UNwA9cD9d8MFA+uTYmwsMdVx9F8DpVl/UNonlntOwiWm8JHkbxz6mpF5Di5NJBi8sw2/ MU8foUrVjdnPAHOgt8IkumPk+pGxu+wR/xmQGNw2sxWMjoXpv8xodTcR1XBHF5dZb1rw UZvrBxNVII3xul/iDqUlvbQD6UiX1HW6BqekDOCFWwUB8IGg2VDuhLpFfpPqqb+64svi ZEMg==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=WmsjPJ7N; arc=pass (i=1 dkim=pass dkdomain=kernel.org); spf=pass (google.com: domain of linux-kernel+bounces-135941-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-135941-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from sv.mirrors.kernel.org (sv.mirrors.kernel.org. [139.178.88.99]) by mx.google.com with ESMTPS id g23-20020a17090ace9700b002a2c3529123si9396891pju.184.2024.04.08.14.29.45 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 08 Apr 2024 14:29:45 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-135941-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) client-ip=139.178.88.99; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=WmsjPJ7N; arc=pass (i=1 dkim=pass dkdomain=kernel.org); spf=pass (google.com: domain of linux-kernel+bounces-135941-linux.lists.archive=gmail.com@vger.kernel.org designates 139.178.88.99 as permitted sender) smtp.mailfrom="linux-kernel+bounces-135941-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by sv.mirrors.kernel.org (Postfix) with ESMTPS id 3AED4281C66 for ; Mon, 8 Apr 2024 21:29:15 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id C78011487ED; Mon, 8 Apr 2024 21:29:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="WmsjPJ7N" Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BCAF0495CB; Mon, 8 Apr 2024 21:29:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712611746; cv=none; b=ESjhSneVbZSVUCPnY7Wd1eWMQxmI0eBuaEVRxCGubcnmXEdAK3Sfe4Ii/XEJQC4Mc9VyacfK26vADwDSDLhh/NzPypwMLB9xrDVQJ+auvwvk3RK5PKMx0wnViULH1+GvOvkfYt7gDKDakVGgWzaqOS6ttO4L6jgHzuqBFHtyZQY= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712611746; c=relaxed/simple; bh=b9zm69YyzYDWwGcIyc+2QzF3WetCtaijzlqkIvHbTRs=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=LHf9+7KF3ujGJsO69iL9wL3s6pphGPxyJMD8z/swZjeyHQJpNYb9krwY1EZSrvMVY9PhWfE5Vik6mWV8KIg1grpuMli1Fpn5aqtc7Exi0cHRU4/tVHRXMBVKGvUQCSqnjnoVWbPscWNk3gFguj6HCDcLE+cmgBXDKqgrgdbzpVE= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=WmsjPJ7N; arc=none smtp.client-ip=10.30.226.201 Received: by smtp.kernel.org (Postfix) with ESMTPSA id BAD47C433F1; Mon, 8 Apr 2024 21:29:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1712611746; bh=b9zm69YyzYDWwGcIyc+2QzF3WetCtaijzlqkIvHbTRs=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=WmsjPJ7NG90s04vEQU4fL1/nHsAzU3ifvRZiiOUXvXb/5k1D8Me0QZnTLAkpBMnBm iSkWKwXDXQPYW9gPQ5JaVf36cijcICsH7s/i6vqq3CrN1/4HXVdeC7As2zMkss0KYZ 8D8ouunUOdpEsfHaU5o4+bjb3ZpEKWvlZeqhjMfHZ0pG2p3544+mzObvQ2KCrQzfkG nlP+U14YLlGcsYOnrRCUspbiFc0hjivATpf2PPV7AAluIj1DJMj7iblhXvUSqD89C1 y8y213watDoybYaavzPdp6i4KSRolSXVVg9dWrRhX6vVOBSaGPUSaA9QV4FF3JOKnX M/GI00UkaFweQ== Date: Mon, 8 Apr 2024 23:29:03 +0200 From: Frederic Weisbecker To: Sebastian Andrzej Siewior Cc: linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org, Adrian Hunter , Alexander Shishkin , Arnaldo Carvalho de Melo , Ian Rogers , Ingo Molnar , Jiri Olsa , Marco Elver , Mark Rutland , Namhyung Kim , Peter Zijlstra , Thomas Gleixner , Arnaldo Carvalho de Melo Subject: Re: [PATCH v3 2/4] perf: Enqueue SIGTRAP always via task_work. Message-ID: References: <20240322065208.60456-1-bigeasy@linutronix.de> <20240322065208.60456-3-bigeasy@linutronix.de> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20240322065208.60456-3-bigeasy@linutronix.de> Le Fri, Mar 22, 2024 at 07:48:22AM +0100, Sebastian Andrzej Siewior a ?crit : > A signal is delivered by raising irq_work() which works from any context > including NMI. irq_work() can be delayed if the architecture does not > provide an interrupt vector. In order not to lose a signal, the signal > is injected via task_work during event_sched_out(). > > Instead going via irq_work, the signal could be added directly via > task_work. The signal is sent to current and can be enqueued on its > return path to userland instead of triggering irq_work. A dummy IRQ is > required in the NMI case to ensure the task_work is handled before > returning to user land. For this irq_work is used. An alternative would > be just raising an interrupt like arch_send_call_function_single_ipi(). > > During testing with `remove_on_exec' it become visible that the event > can be enqueued via NMI during execve(). The task_work must not be kept > because free_event() will complain later. Also the new task will not > have a sighandler installed. > > Queue signal via task_work. Remove perf_event::pending_sigtrap and > and use perf_event::pending_work instead. Raise irq_work in the NMI case > for a dummy interrupt. Remove the task_work if the event is freed. > > Tested-by: Marco Elver > Tested-by: Arnaldo Carvalho de Melo > Signed-off-by: Sebastian Andrzej Siewior It clashes a bit with a series I have posted. Let's see the details: > --- > include/linux/perf_event.h | 3 +- > kernel/events/core.c | 58 ++++++++++++++++++++++---------------- > 2 files changed, 34 insertions(+), 27 deletions(-) > > diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h > index d2a15c0c6f8a9..24ac6765146c7 100644 > --- a/include/linux/perf_event.h > +++ b/include/linux/perf_event.h > @@ -781,7 +781,6 @@ struct perf_event { > unsigned int pending_wakeup; > unsigned int pending_kill; > unsigned int pending_disable; > - unsigned int pending_sigtrap; > unsigned long pending_addr; /* SIGTRAP */ > struct irq_work pending_irq; > struct callback_head pending_task; > @@ -959,7 +958,7 @@ struct perf_event_context { > struct rcu_head rcu_head; > > /* > - * Sum (event->pending_sigtrap + event->pending_work) > + * Sum (event->pending_work + event->pending_work) > * > * The SIGTRAP is targeted at ctx->task, as such it won't do changing > * that until the signal is delivered. > diff --git a/kernel/events/core.c b/kernel/events/core.c > index c7a0274c662c8..e0b2da8de485f 100644 > --- a/kernel/events/core.c > +++ b/kernel/events/core.c > @@ -2283,21 +2283,6 @@ event_sched_out(struct perf_event *event, struct perf_event_context *ctx) > state = PERF_EVENT_STATE_OFF; > } > > - if (event->pending_sigtrap) { > - bool dec = true; > - > - event->pending_sigtrap = 0; > - if (state != PERF_EVENT_STATE_OFF && > - !event->pending_work) { > - event->pending_work = 1; > - dec = false; > - WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount)); > - task_work_add(current, &event->pending_task, TWA_RESUME); > - } > - if (dec) > - local_dec(&event->ctx->nr_pending); > - } > - > perf_event_set_state(event, state); > > if (!is_software_event(event)) > @@ -6741,11 +6726,6 @@ static void __perf_pending_irq(struct perf_event *event) > * Yay, we hit home and are in the context of the event. > */ > if (cpu == smp_processor_id()) { > - if (event->pending_sigtrap) { > - event->pending_sigtrap = 0; > - perf_sigtrap(event); > - local_dec(&event->ctx->nr_pending); > - } > if (event->pending_disable) { > event->pending_disable = 0; > perf_event_disable_local(event); > @@ -9592,14 +9572,23 @@ static int __perf_event_overflow(struct perf_event *event, > > if (regs) > pending_id = hash32_ptr((void *)instruction_pointer(regs)) ?: 1; > - if (!event->pending_sigtrap) { > - event->pending_sigtrap = pending_id; > + if (!event->pending_work) { > + event->pending_work = pending_id; > local_inc(&event->ctx->nr_pending); > - irq_work_queue(&event->pending_irq); > + WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount)); > + task_work_add(current, &event->pending_task, TWA_RESUME); If the overflow happens between exit_task_work() and perf_event_exit_task(), you're leaking the event. (This was there before this patch). See: https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m5e6c8ebbef04ab9a1d7f05340cd3e2716a9a8c39 > + /* > + * The NMI path returns directly to userland. The > + * irq_work is raised as a dummy interrupt to ensure > + * regular return path to user is taken and task_work > + * is processed. > + */ > + if (in_nmi()) > + irq_work_queue(&event->pending_irq); > } else if (event->attr.exclude_kernel && valid_sample) { > /* > * Should not be able to return to user space without > - * consuming pending_sigtrap; with exceptions: > + * consuming pending_work; with exceptions: > * > * 1. Where !exclude_kernel, events can overflow again > * in the kernel without returning to user space. > @@ -9609,7 +9598,7 @@ static int __perf_event_overflow(struct perf_event *event, > * To approximate progress (with false negatives), > * check 32-bit hash of the current IP. > */ > - WARN_ON_ONCE(event->pending_sigtrap != pending_id); > + WARN_ON_ONCE(event->pending_work != pending_id); > } > > event->pending_addr = 0; > @@ -13049,6 +13038,13 @@ static void sync_child_event(struct perf_event *child_event) > &parent_event->child_total_time_running); > } > > +static bool task_work_cb_match(struct callback_head *cb, void *data) > +{ > + struct perf_event *event = container_of(cb, struct perf_event, pending_task); > + > + return event == data; > +} I suggest we introduce a proper API to cancel an actual callback head, see: https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#mbfac417463018394f9d80c68c7f2cafe9d066a4b https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m0a347249a462523358724085f2489ce9ed91e640 > + > static void > perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx) > { > @@ -13088,6 +13084,18 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx) > * Kick perf_poll() for is_event_hup(); > */ > perf_event_wakeup(parent_event); > + /* > + * Cancel pending task_work and update counters if it has not > + * yet been delivered to userland. free_event() expects the > + * reference counter at one and keeping the event around until > + * the task returns to userland can be a unexpected if there is > + * no signal handler registered. > + */ > + if (event->pending_work && > + task_work_cancel_match(current, task_work_cb_match, event)) { > + put_event(event); > + local_dec(&event->ctx->nr_pending); > + } So exiting task, privileged exec and also exit on exec call into this before releasing the children. And parents rely on put_event() from file close + the task work. But what about remote release of children on file close? See perf_event_release_kernel() directly calling free_event() on them. One possible fix is to avoid the reference count game around task work and flush them on free_event(). See here: https://lore.kernel.org/all/202403310406.TPrIela8-lkp@intel.com/T/#m63c28147d8ac06b21c64d7784d49f892e06c0e50 Thanks. > free_event(event); > put_event(parent_event); > return; > -- > 2.43.0 > >