Received: by 2002:a05:6358:45e:b0:b5:b6eb:e1f9 with SMTP id 30csp841336rwe; Thu, 25 Aug 2022 10:02:27 -0700 (PDT) X-Google-Smtp-Source: AA6agR5SCRrJZTueFGsFmiKF+5EDlRdWSR1HJu0XCtdMIzqSvPGVPRCPBL65O25EqSlhvB9SncPr X-Received: by 2002:a17:902:b413:b0:172:a628:7915 with SMTP id x19-20020a170902b41300b00172a6287915mr4601519plr.99.1661446946413; Thu, 25 Aug 2022 10:02:26 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1661446946; cv=none; d=google.com; s=arc-20160816; b=KQOc55GeFzML6pJrmArqCErVr0ZOU/lYGtgpSBYtfyt4LtLaW+MbNX40+XhUSVyEdZ P8C3b8rJpO1ckknsVs1s15jZi1fYxoI2s/RwBQ0Lf+3A2UJtNMQXXbwEHgdqQ4TmTGU9 GBgjWWDOpXz3qpz0wcdsQgFu9dnV4snjE8RCBkcSYU/RVFl5Lu02/UWy6ZINS3RNLgIJ 2KhF8gpS5ppHujV5ny8YhVHlfomr9y9M7wLpYBEQ7z1YIihjOoWGe6Ne+tIZtraoZz2T 2n/mgkebKN65E4aewZSWHWNbjNIei/GLaXgpGUjmTuMX55DWEcvx4xnjUBrMfbCK2863 BxMA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=/PP0ux4xnVggwZX9NLWdLDgmxpszMjAcO6bhJJu5nGY=; b=IivrecpUw4JHfrCRwy+4M5F+m9ty2n0o/TstPAjKfGLx8sK0iaQmVo03vZcykSByjk fWeC/VVNl9B7Zlwl0Pzbm2lbyVBPtHNT+FhZpqTWR59VLCqRPWW0iXN2uWCleKjXzSQs g3/lcsSxfVB78XTANmJreLvQNxljRIk+O0BHZ9ywTVpXa3wuC6HGhxEzuwaoGjz6DjDZ BzShvJALOUcDAd8es9ljrplWTuG9ACqGGlBvFA83+pTqqcXqTB7MYyMbwpk0SEYlNPkS foGvhlBjXgO/YkhRT5ZzmA+Av/SzNSue8mw1MbMBYff6lahx/M2rfMbO82QpJCT5Xuj8 DLhw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=UvrI7uci; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id c17-20020a056a000ad100b0052d163e23a7si19141374pfl.51.2022.08.25.10.02.14; Thu, 25 Aug 2022 10:02:26 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=UvrI7uci; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=bytedance.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S243039AbiHYQov (ORCPT + 99 others); Thu, 25 Aug 2022 12:44:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43614 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S243109AbiHYQoN (ORCPT ); Thu, 25 Aug 2022 12:44:13 -0400 Received: from mail-pl1-x635.google.com (mail-pl1-x635.google.com [IPv6:2607:f8b0:4864:20::635]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B3220BC808 for ; Thu, 25 Aug 2022 09:43:02 -0700 (PDT) Received: by mail-pl1-x635.google.com with SMTP id x23so19002389pll.7 for ; Thu, 25 Aug 2022 09:43:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc; bh=/PP0ux4xnVggwZX9NLWdLDgmxpszMjAcO6bhJJu5nGY=; b=UvrI7uci/V05kvoMUf3ILFGeF/R2vrNsWDDvRZeB1AmYyS1o9UyAv2f66GZ5tWvutD jPDVN2oJpVizBYXfth4UD7YeuYwPlowmHWVQPbCwt05ME/b6wVLIV9kc2kPirv/OFl9i btPY2W2iucdmMEUFSBkdATSREsUNQ7vaoH149BgbQ3S86zaHC9oUg5Yw0rzLnq4G/rLg HuzDirca8bvvnoQqqiucI0eHE4eUPaOvKMPTWtbpp4+7mLuKzHKzvYuahaS9ZFfI2a3m gDBqSb5pjSbikdNeNk/H+bN1mm72C/boZEjpal5ZqWuH8ZObfBghEQkPuPZd4kwYlIyd lFHw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc; bh=/PP0ux4xnVggwZX9NLWdLDgmxpszMjAcO6bhJJu5nGY=; b=LZz8UlpfWJJzI2GaZXkXQHKj+6FfRnCI392xZYsEZvXeNoV5Vd/2+32DJUF1C8fUIs K3DVjwtxCwlH9P+5TWFA8UaZU4PwtjygyS1xssAkj1IbJq41NDi8TZ6B1qx7uDyrYHtn ZihFjEJNOPPl8a1TJI2hK+yDydJkc2IGyydRr98Gmi6LFIdIX3jcb0epb1X2a4tyO+LT TpWr3yWGmJYPlUInMY8AX+jK58a6MKnFifqo1nwAjPHhz3eAAsAMiOTgPPhNyf3WQmyx EVCm2ShtfU5MiPvpDa2UK0cH0kQRsNmXpy2MavM3TYuqQ50ee4DeuxVbNWhFq8hE21U7 /ReQ== X-Gm-Message-State: ACgBeo38bHE4xvquzkB4bxJbV9GbZbB3fMFwrGCKcJIv37kjNpmS/+w0 MM0D54xQxdad2crUu8Ao9N3t5Q== X-Received: by 2002:a17:902:bb94:b0:172:8fd9:7438 with SMTP id m20-20020a170902bb9400b001728fd97438mr4372382pls.174.1661445781918; Thu, 25 Aug 2022 09:43:01 -0700 (PDT) Received: from localhost.localdomain ([139.177.225.236]) by smtp.gmail.com with ESMTPSA id b18-20020a62a112000000b005362314bf80sm12779408pff.67.2022.08.25.09.42.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 25 Aug 2022 09:43:01 -0700 (PDT) From: Chengming Zhou To: hannes@cmpxchg.org, tj@kernel.org, mkoutny@suse.com, surenb@google.com Cc: mingo@redhat.com, peterz@infradead.org, gregkh@linuxfoundation.org, corbet@lwn.net, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, songmuchun@bytedance.com, Chengming Zhou Subject: [PATCH v4 05/10] sched/psi: optimize task switch inside shared cgroups again Date: Fri, 26 Aug 2022 00:41:06 +0800 Message-Id: <20220825164111.29534-6-zhouchengming@bytedance.com> X-Mailer: git-send-email 2.35.1 In-Reply-To: <20220825164111.29534-1-zhouchengming@bytedance.com> References: <20220825164111.29534-1-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Way back when PSI_MEM_FULL was accounted from the timer tick, task switching could simply iterate next and prev to the common ancestor to update TSK_ONCPU and be done. Then memstall ticks were replaced with checking curr->in_memstall directly in psi_group_change(). That meant that now if the task switch was between a memstall and a !memstall task, we had to iterate through the common ancestors at least ONCE to fix up their state_masks. We added the identical_state filter to make sure the common ancestor elimination was skipped in that case. It seems that was always a little too eager, because it caused us to walk the common ancestors *twice* instead of the required once: the iteration for next could have stopped at the common ancestor; prev could have updated TSK_ONCPU up to the common ancestor, then finish to the root without changing any flags, just to get the new curr->in_memstall into the state_masks. This patch recognizes this and makes it so that we walk to the root exactly once if state_mask needs updating, which is simply catching up on a missed optimization that could have been done in commit 7fae6c8171d2 ("psi: Use ONCPU state tracking machinery to detect reclaim") directly. Apart from this, it's also necessary for the next patch "sched/psi: remove NR_ONCPU task accounting". Suppose we walk the common ancestors twice: (1) psi_group_change(.clear = 0, .set = TSK_ONCPU) (2) psi_group_change(.clear = TSK_ONCPU, .set = 0) We previously used tasks[NR_ONCPU] to record TSK_ONCPU, tasks[NR_ONCPU]++ in (1) then tasks[NR_ONCPU]-- in (2), so tasks[NR_ONCPU] still be correct. The next patch change to use one bit in state mask to record TSK_ONCPU, PSI_ONCPU bit will be set in (1), but then be cleared in (2), which cause the psi_group_cpu has task running on CPU but without PSI_ONCPU bit set! With this patch, we will never walk the common ancestors twice, so won't have above problem. Suggested-by: Johannes Weiner Signed-off-by: Chengming Zhou --- kernel/sched/psi.c | 21 +++++++++------------ 1 file changed, 9 insertions(+), 12 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 77d53c03a76f..d71dbc2356ff 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -820,20 +820,15 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, u64 now = cpu_clock(cpu); if (next->pid) { - bool identical_state; - psi_flags_change(next, 0, TSK_ONCPU); /* - * When switching between tasks that have an identical - * runtime state, the cgroup that contains both tasks - * we reach the first common ancestor. Iterate @next's - * ancestors only until we encounter @prev's ONCPU. + * Set TSK_ONCPU on @next's cgroups. If @next shares any + * ancestors with @prev, those will already have @prev's + * TSK_ONCPU bit set, and we can stop the iteration there. */ - identical_state = prev->psi_flags == next->psi_flags; iter = NULL; while ((group = iterate_groups(next, &iter))) { - if (identical_state && - per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { + if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { common = group; break; } @@ -877,10 +872,12 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, psi_group_change(group, cpu, clear, set, now, wake_clock); /* - * TSK_ONCPU is handled up to the common ancestor. If we're tasked - * with dequeuing too, finish that for the rest of the hierarchy. + * TSK_ONCPU is handled up to the common ancestor. If there are + * any other differences between the two tasks (e.g. prev goes + * to sleep, or only one task is memstall), finish propagating + * those differences all the way up to the root. */ - if (sleep) { + if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) { clear &= ~TSK_ONCPU; for (; group; group = iterate_groups(prev, &iter)) psi_group_change(group, cpu, clear, set, now, wake_clock); -- 2.37.2