Received: by 2002:a89:413:0:b0:1fd:dba5:e537 with SMTP id m19csp717067lqs; Fri, 14 Jun 2024 03:49:28 -0700 (PDT) X-Forwarded-Encrypted: i=3; AJvYcCXSjwZzoGlXQ8SFmz0ULrPVhv224LD8TdY4fqKUOUTnGMltULndA6G3mKpmSOdU9jYLXysBTyrMv4M75PnBe+HM1fIVrF7J59nJCsAtvQ== X-Google-Smtp-Source: AGHT+IGCOYhOsnHeaheQIbmETbgCxJAei5cUeDtogXyVU6kO1MGlBe1/nHTomdE3NGk7+Yboagze X-Received: by 2002:a05:620a:244f:b0:795:4cdf:6481 with SMTP id af79cd13be357-798d2438666mr239236185a.35.1718362168520; Fri, 14 Jun 2024 03:49:28 -0700 (PDT) ARC-Seal: i=2; a=rsa-sha256; t=1718362168; cv=pass; d=google.com; s=arc-20160816; b=gTCeieM+M4+4PXqXc3h+fB5a0BEpG/QzD8AwFDK0Des4XDQ4o60C++jw5M2RhIT/7E vvl6HUDAcTbN0jTmWMSJJBnXHUtTjwJPEUMVKjqVcZLK6gLVJeO9mfoWmoxb7A26p8eV pnk8jdFgML817jJ/XVbo2DFAs/TQBNlMtj4M1UL+GofAQPz9EFtnz3OOcsjSIelws6iD Qv0X/FzFWo16eJKnBUXfFvNgCcqs4MIiQPfXs09twLogwBgQjW8efug7l7PdbZaPEp4I cge/tfb5mvMr4BtCIC7dfsB58Cq787elpqPFJ3UsbOcOMHEJkHtBqHjdMUVL3H6YROCP iXig== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:list-unsubscribe:list-subscribe:list-id:precedence :dkim-signature; bh=FJRFKVvp1PoPGBhKf/e9VTmU0CVL3sNznYNLbRBztzw=; fh=17b7qme+8KCdSxn5mP2u4d1K1M9AB9/FME9Y4Jx3ytU=; b=YR6o76VVl6zEaZiSQsIfbNWOmqIwIbrqswdQEFmHb+S4g5TupAY9FkNwk+vtpqHXCf WdDVAUGvVYxHzZ0pDvgDhG4gBsbGupdPEOqsAvnExMy5hQlizoCnXTty4UPrU2iIQM6k pt/Se9BbqpGZQEmD6HUwW2VmfMf7xI5RtZZo6mTMlhFT4DPX+S0zoC6kwhIeCtk39ZvM d1mHHR/Ux9hYJgS20PzF3pCHSnUeushgjKJu20NJtW+kqvu3gMISGBnr7ET1yYqHI8KP We20X+gcQf5AsO8GzDnp9b50TN5jF1wd+6Qk0ymObGBGotjxpu1h8G3RcwXnP5gcSmBM 11fw==; dara=google.com ARC-Authentication-Results: i=2; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=ULvZR+wS; arc=pass (i=1 spf=pass spfdomain=linaro.org dkim=pass dkdomain=linaro.org dmarc=pass fromdomain=linaro.org); spf=pass (google.com: domain of linux-kernel+bounces-214813-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-214813-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from ny.mirrors.kernel.org (ny.mirrors.kernel.org. [147.75.199.223]) by mx.google.com with ESMTPS id af79cd13be357-798aac9dc26si343948185a.16.2024.06.14.03.49.28 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 14 Jun 2024 03:49:28 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel+bounces-214813-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) client-ip=147.75.199.223; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=ULvZR+wS; arc=pass (i=1 spf=pass spfdomain=linaro.org dkim=pass dkdomain=linaro.org dmarc=pass fromdomain=linaro.org); spf=pass (google.com: domain of linux-kernel+bounces-214813-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.199.223 as permitted sender) smtp.mailfrom="linux-kernel+bounces-214813-linux.lists.archive=gmail.com@vger.kernel.org"; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ny.mirrors.kernel.org (Postfix) with ESMTPS id 3769D1C228AB for ; Fri, 14 Jun 2024 10:49:01 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id EA5A2194C8F; Fri, 14 Jun 2024 10:48:52 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linaro.org header.i=@linaro.org header.b="ULvZR+wS" Received: from mail-pg1-f170.google.com (mail-pg1-f170.google.com [209.85.215.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B32E6194AE5 for ; Fri, 14 Jun 2024 10:48:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.170 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718362131; cv=none; b=BdEMUjdhSgeqgZCO+9AiljTRRyo654belRxBQbQOli7bZ7HXHzzmeRlTTlHZA957bkAtg8CntnRlYB3LMQiQCL5T/wxYVvdEpv6KH6qemLO/PxNwnLyJPmW/ZJ29Rv2duFbh192wrQSkJxdPFhFM+emy1TjB7kAQZs0RQ81UqLo= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1718362131; c=relaxed/simple; bh=ZXU+0RMH0fz/pS+haFqcAKCaYVHjid2A9LxGhmMo8l0=; h=MIME-Version:References:In-Reply-To:From:Date:Message-ID:Subject: To:Cc:Content-Type; b=Q9rhI6+7VU7xZAmBLV2poItT2/TJVduvBpOVQrd/c+a15fR3tGdVGEENHOfVJUBFtdEvRZ8CAxmQQPN3cPQ19V10+IoBQn1OufQxVfx66idMu3OPtYujIB+mGIbf/qPTYxKp2qVEryVfi4RcPsKNyzoPgaiGNRizaD1GlJ9uxxc= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linaro.org; spf=pass smtp.mailfrom=linaro.org; dkim=pass (2048-bit key) header.d=linaro.org header.i=@linaro.org header.b=ULvZR+wS; arc=none smtp.client-ip=209.85.215.170 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linaro.org Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linaro.org Received: by mail-pg1-f170.google.com with SMTP id 41be03b00d2f7-681953ad4f2so1625504a12.2 for ; Fri, 14 Jun 2024 03:48:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; t=1718362129; x=1718966929; darn=vger.kernel.org; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=FJRFKVvp1PoPGBhKf/e9VTmU0CVL3sNznYNLbRBztzw=; b=ULvZR+wS/hE9Ui0qxEbqShEACnlnuqbK3iolxi1ecMvyMEimcZ24gSzDDE32Z7b+5J +tdfjNEoxG/j2Xf5rPDInV5r4ac4S1ApyS3huSDtwpbUwe7ZRuRPhYvPXbEgEe9bCEq+ /z24duZhwsIet0UpbZGW7vtSPieFmWgSOX3ZR0NTeRTi64lN3xJna5lRUyW3ReFGBmCf B1jZFUuDpaiOhKCJ12wWsKgmdxGMTbfdIHVzro/HXkoCsyPXBBuJKTGTvVVn1lz3wqhR b2uCeu4nKD34TiWlLyt/Zv7m1AjUrjSQW+afecZRB0TiyMqxIw6VtSN5lGjuZZ8a4gR/ LuHg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718362129; x=1718966929; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=FJRFKVvp1PoPGBhKf/e9VTmU0CVL3sNznYNLbRBztzw=; b=IpgAbjpfqtOwdg/hD1TnEQDoAitO+CDndDTAD1sdp2whqwQRNgcJi0NKpCjph1Kmr+ W5TcmkbEhP52A8DWAG1Rce7IY3b/mTq3yZWqY8v/K0qYhig66RF9IkC+f9S5+Yk2zgBM tvWDJaB+a5V1kwVoXboSh61qWxlOoHFeWD8RdlUwtyZISBN27vRYIJDqwe9yERHccw5a m0fQc5mGoiOp5BSO42FvBmlAaK1cVYnT/afnfx5e6QA63GuGtn4jUl/m9F7eqle0VVxq BtZ9Lqgr/sG7rYODLjVWr93IPLy2D0jfjzAb3sIUrjZG2y/VVyjMHFw1+hXd+BfXwcwq JSrg== X-Forwarded-Encrypted: i=1; AJvYcCXgj8Qy7+D6VU0RQkMooNbHpkm933RNk59E/+2I+mTGfvDbrPWibqnPkYJo1tojfY+KqY+Qp6+t8AUPHXHL4nQ1+jwFyMck2oqWVuJT X-Gm-Message-State: AOJu0YzL40ECxsES2VrVOKgHCEPKfyn/asQdQS4bIEOywNTATWcQ77EW FycHN4MIODGn0kdf/Ks3qixBoi/eMfhQbrAQajYG5Flts3jax4WjWWTmQkSr6juBCsXou00NXkE Xl7V7AE0vogVZK2S9ArYsaxrUuXxokUhd58fhQw== X-Received: by 2002:a17:90a:928a:b0:2c2:e97f:89d5 with SMTP id 98e67ed59e1d1-2c4dbd44d6amr2419553a91.42.1718362128949; Fri, 14 Jun 2024 03:48:48 -0700 (PDT) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 References: <20240613181613.4329-1-kprateek.nayak@amd.com> <20240614092801.GL8774@noisy.programming.kicks-ass.net> In-Reply-To: <20240614092801.GL8774@noisy.programming.kicks-ass.net> From: Vincent Guittot Date: Fri, 14 Jun 2024 12:48:37 +0200 Message-ID: Subject: Re: [PATCH v2 00/14] Introducing TIF_NOTIFY_IPI flag To: Peter Zijlstra Cc: K Prateek Nayak , linux-kernel@vger.kernel.org, "Gautham R. Shenoy" , Richard Henderson , Ivan Kokshaysky , Matt Turner , Russell King , Guo Ren , Michal Simek , Dinh Nguyen , Jonas Bonn , Stefan Kristiansson , Stafford Horne , "James E.J. Bottomley" , Helge Deller , Michael Ellerman , Nicholas Piggin , Christophe Leroy , "Naveen N. Rao" , Yoshinori Sato , Rich Felker , John Paul Adrian Glaubitz , "David S. Miller" , Andreas Larsson , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , "Rafael J. Wysocki" , Daniel Lezcano , Juri Lelli , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider , Andrew Donnellan , Benjamin Gray , Frederic Weisbecker , Xin Li , Kees Cook , Rick Edgecombe , Tony Battersby , Bjorn Helgaas , Brian Gerst , Leonardo Bras , Imran Khan , "Paul E. McKenney" , Rik van Riel , Tim Chen , David Vernet , Julia Lawall , linux-alpha@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-csky@vger.kernel.org, linux-openrisc@vger.kernel.org, linux-parisc@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-sh@vger.kernel.org, sparclinux@vger.kernel.org, linux-pm@vger.kernel.org, x86@kernel.org Content-Type: text/plain; charset="UTF-8" On Fri, 14 Jun 2024 at 11:28, Peter Zijlstra wrote: > > On Thu, Jun 13, 2024 at 06:15:59PM +0000, K Prateek Nayak wrote: > > Effects of call_function_single_prep_ipi() > > ========================================== > > > > To pull a TIF_POLLING thread out of idle to process an IPI, the sender > > sets the TIF_NEED_RESCHED bit in the idle task's thread info in > > call_function_single_prep_ipi() and avoids sending an actual IPI to the > > target. As a result, the scheduler expects a task to be enqueued when > > exiting the idle path. This is not the case with non-polling idle states > > where the idle CPU exits the non-polling idle state to process the > > interrupt, and since need_resched() returns false, soon goes back to > > idle again. > > > > When TIF_NEED_RESCHED flag is set, do_idle() will call schedule_idle(), > > a large part of which runs with local IRQ disabled. In case of ipistorm, > > when measuring IPI throughput, this large IRQ disabled section delays > > processing of IPIs. Further auditing revealed that in absence of any > > runnable tasks, pick_next_task_fair(), which is called from the > > pick_next_task() fast path, will always call newidle_balance() in this > > scenario, further increasing the time spent in the IRQ disabled section. > > > > Following is the crude visualization of the problem with relevant > > functions expanded: > > -- > > CPU0 CPU1 > > ==== ==== > > do_idle() { > > __current_set_polling(); > > ... > > monitor(addr); > > if (!need_resched()) > > mwait() { > > /* Waiting */ > > smp_call_function_single(CPU1, func, wait = 1) { ... > > ... ... > > set_nr_if_polling(CPU1) { ... > > /* Realizes CPU1 is polling */ ... > > try_cmpxchg(addr, ... > > &val, ... > > val | _TIF_NEED_RESCHED); ... > > } /* Does not send an IPI */ ... > > ... } /* mwait exit due to write at addr */ > > csd_lock_wait() { } > > /* Waiting */ preempt_set_need_resched(); > > ... __current_clr_polling(); > > ... flush_smp_call_function_queue() { > > ... func(); > > } /* End of wait */ } > > } schedule_idle() { > > ... > > local_irq_disable(); > > smp_call_function_single(CPU1, func, wait = 1) { ... > > ... ... > > arch_send_call_function_single_ipi(CPU1); ... > > \ ... > > \ newidle_balance() { > > \ ... > > /* Delay */ ... > > \ } > > \ ... > > \--------------> local_irq_enable(); > > /* Processes the IPI */ > > -- > > > > > > Skipping newidle_balance() > > ========================== > > > > In an earlier attempt to solve the challenge of the long IRQ disabled > > section, newidle_balance() was skipped when a CPU waking up from idle > > was found to have no runnable tasks, and was transitioning back to > > idle [2]. Tim [3] and David [4] had pointed out that newidle_balance() > > may be viable for CPUs that are idling with tick enabled, where the > > newidle_balance() has the opportunity to pull tasks onto the idle CPU. > > I don't think we should be relying on this in any way shape or form. > NOHZ can kill that tick at any time. > > Also, semantically, calling newidle from the idle thread is just daft. > You're really not newly idle in that case. > > > Vincent [5] pointed out a case where the idle load kick will fail to > > run on an idle CPU since the IPI handler launching the ILB will check > > for need_resched(). In such cases, the idle CPU relies on > > newidle_balance() to pull tasks towards itself. > > Is this the need_resched() in _nohz_idle_balance() ? Should we change > this to 'need_resched() && (rq->nr_running || rq->ttwu_pending)' or > something long those lines? It's not only this but also in do_idle() as well which exits the loop to look for tasks to schedule > > I mean, it's fairly trivial to figure out if there really is going to be > work there. > > > Using an alternate flag instead of NEED_RESCHED to indicate a pending > > IPI was suggested as the correct approach to solve this problem on the > > same thread. > > So adding per-arch changes for this seems like something we shouldn't > unless there really is no other sane options. > > That is, I really think we should start with something like the below > and then fix any fallout from that. The main problem is that need_resched becomes somewhat meaningless because it doesn't only mean "I need to resched a task" and we have to add more tests around even for those not using polling > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 0935f9d4bb7b..cfa45338ae97 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -5799,7 +5800,7 @@ static inline struct task_struct * > __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > { > const struct sched_class *class; > - struct task_struct *p; > + struct task_struct *p = NULL; > > /* > * Optimization: we know that if all tasks are in the fair class we can > @@ -5810,9 +5811,11 @@ __pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) > if (likely(!sched_class_above(prev->sched_class, &fair_sched_class) && > rq->nr_running == rq->cfs.h_nr_running)) { > > - p = pick_next_task_fair(rq, prev, rf); > - if (unlikely(p == RETRY_TASK)) > - goto restart; > + if (rq->nr_running) { How do you make the diff between a spurious need_resched() because of polling and a cpu becoming idle ? isn't rq->nr_running null in both cases ? In the later case, we need to call sched_balance_newidle() but not in the former > + p = pick_next_task_fair(rq, prev, rf); > + if (unlikely(p == RETRY_TASK)) > + goto restart; > + } > > /* Assume the next prioritized class is idle_sched_class */ > if (!p) {