Received: by 2002:a05:6a10:af89:0:0:0:0 with SMTP id iu9csp1287516pxb; Fri, 21 Jan 2022 14:25:28 -0800 (PST) X-Google-Smtp-Source: ABdhPJzF8Eua6kcYlNjkJ+yLy5t2WUtkUBo7MIV860atTaB/MIgp7/RXeklFJfM1W3JIpq7PNgOQ X-Received: by 2002:a17:902:b206:b0:149:3b5d:2b8b with SMTP id t6-20020a170902b20600b001493b5d2b8bmr5375507plr.162.1642803928403; Fri, 21 Jan 2022 14:25:28 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1642803928; cv=none; d=google.com; s=arc-20160816; b=H7XwYWRBcxeKX0EBgedS3qOBFr1dmAuCeRPVgXEviNBswWn+FkVdn6n4PplDYwOUWR Cw/gHyW9nnZPI96AFIIPuNP5oYpEGnJkd9Dg96o4aRTuNF55/h4UHGFokRLP6vTJij8x zbiJ63/kz9c9VN8S81M27c9qnlKIQYyYnzWc3kS75GuvlxaCG2V6mpQDOZqlk6s84oSw 7Rg3smU8co1WfMou9V0yR5csZSqp9bxXOicEk4QpwulGzyI4yMcnYzSTPM7pJAhuQGAq X5U0Th0vbWnb7hUMZ4AytfwgI/gn9LzFvmDGDCMlE5pffdn5ViSmV2l1aHHJt/+qTpA8 cSkw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :message-id:date:subject:cc:to:from; bh=+Npro4q1vd27ws/pCDxOsnkBTmmFNT/bhx4xiJ4Cx94=; b=h8+zZHOACMqeXhXqqLJmFenMcCqp6H/CqcIUGi5wTzgWwhXJRs29whnSMvayVgvDz8 W6p1llfJ2OlwusRtLTG3ggNknC0bLQ+js1H6a6wpK6ZfAo9Ns2Ac6XnQWfewOvZA5RQl tT/wU1VtOr546a8e0g70JRKVTEIW8Txs6zxqa3pWjaUCcBzf3N4LRrul3rbdiIjKIqhO 5fBEuC9UKKOGqjqckJH4L+osHiGgThsRzwNU6P8sT+WpZ7d/AZAy0ilLakraj12H9fvW cu3KOlI8Z1sJpacGoqyR+MdKRQZm/1oiBW9jPawVOh/vSMOKZIKX1Zcx7NFcdSZ1cNsc Ztgw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id n8si8015573plh.598.2022.01.21.14.25.15; Fri, 21 Jan 2022 14:25:28 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1376968AbiATQZv (ORCPT + 99 others); Thu, 20 Jan 2022 11:25:51 -0500 Received: from foss.arm.com ([217.140.110.172]:44002 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S238473AbiATQZu (ORCPT ); Thu, 20 Jan 2022 11:25:50 -0500 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 1C7AAED1; Thu, 20 Jan 2022 08:25:50 -0800 (PST) Received: from e113632-lin.cambridge.arm.com (e113632-lin.cambridge.arm.com [10.1.196.57]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPA id A15F53F73D; Thu, 20 Jan 2022 08:25:47 -0800 (PST) From: Valentin Schneider To: linux-kernel@vger.kernel.org Cc: Abhijeet Dharmapurikar , =?UTF-8?q?Uwe=20Kleine-K=C3=B6nig?= , Dietmar Eggemann , Steven Rostedt , Peter Zijlstra , Ingo Molnar , Vincent Guittot , Thomas Gleixner , Sebastian Andrzej Siewior , Juri Lelli , Daniel Bristot de Oliveira , Kees Cook , Andrew Morton , "Eric W. Biederman" , Alexey Gladkov , "Kenta.Tada@sony.com" , Randy Dunlap , Ed Tsai Subject: [PATCH v3 0/2] sched/tracing: sched_switch prev_state reported as TASK_RUNNING when it's not Date: Thu, 20 Jan 2022 16:25:18 +0000 Message-Id: <20220120162520.570782-1-valentin.schneider@arm.com> X-Mailer: git-send-email 2.25.1 MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi folks, Problem ======= Abhijeet pointed out that the following sequence of trace events can be observed: cat-1676 [001] d..3 103.010411: sched_waking: comm=grep pid=1677 prio=120 target_cpu=004 grep-1677 [004] d..2 103.010440: sched_switch: prev_comm=grep prev_pid=1677 prev_prio=120 prev_state=R 0x0 ==> next_comm=swapper/4 next_pid=0 next_prio=120 -0 [004] dNh3 103.0100459: sched_wakeup: comm=grep pid=1677 prio=120 target_cpu=004 IOW, not-yet-woken task gets presented as runnable and switched out in favor of the idle task... Dietmar and I had a look, turns out this sequence can happen: p->__state = TASK_INTERRUPTIBLE; __schedule() deactivate_task(p); ttwu() READ !p->on_rq p->__state=TASK_WAKING trace_sched_switch() __trace_sched_switch_state() task_state_index() return 0; TASK_WAKING isn't in TASK_REPORT, hence why task_state_index(p) returns 0. This can happen as of commit c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu") which punted the TASK_WAKING write to the other side of smp_cond_load_acquire(&p->on_cpu, !VAL); in ttwu(). Uwe reported on #linux-rt what I think is a similar issue with TASK_RTLOCK_WAIT on PREEMPT_RT; again that state isn't in TASK_REPORT so you get prev_state=0 despite the task blocking on a lock. Both of those are very confusing for tooling or even human observers. Patches ======= For the TASK_WAKING case, pushing the prev_state read in __schedule() down to __trace_sched_switch_state() seems sufficient. That's patch 1. For TASK_RTLOCK_WAIT, it's a bit trickier. One could use ->saved_state as prev_state, but a) that is also susceptible to racing (see ttwu() / ttwu_state_match()) b) some lock substitutions (e.g. rt_spin_lock()) leave ->saved_state as TASK_RUNNING. Patch 2 adds TASK_RTLOCK_WAIT to TASK_REPORT instead. Testing ======= Briefly tested on an Arm Juno via ftrace+hackbench against o tip/sched/core@82762d2af31a o v5.16-rt15-rebase w/ CONFIG_PREEMPT_RT I also had a look at the __schedule() disassembly as suggested by Peter; on x86 prev_state gets pushed to the stack and popped prior to the trace event static call, the rest seems mostly unchanged (GCC 9.3). Revisions ========= v2 -> v3 ++++++++ o Dropped TASK_RTLOCK_WAIT from TASK_REPORT, made it appear as TASK_UNINTERRUPTIBLE instead (Eric) RFC v1 -> v2 ++++++++++++ o Collected tags (Steven, Sebastian) o Patched missed trace_switch event clients (Steven) Cheers, Valentin Valentin Schneider (2): sched/tracing: Don't re-read p->state when emitting sched_switch event sched/tracing: Report TASK_RTLOCK_WAIT tasks as TASK_UNINTERRUPTIBLE include/linux/sched.h | 19 ++++++++++++++++--- include/trace/events/sched.h | 11 +++++++---- kernel/sched/core.c | 4 ++-- kernel/trace/fgraph.c | 4 +++- kernel/trace/ftrace.c | 4 +++- kernel/trace/trace_events.c | 8 ++++++-- kernel/trace/trace_osnoise.c | 4 +++- kernel/trace/trace_sched_switch.c | 1 + kernel/trace/trace_sched_wakeup.c | 1 + 9 files changed, 42 insertions(+), 14 deletions(-) -- 2.25.1