Received: by 2002:a05:6358:45e:b0:b5:b6eb:e1f9 with SMTP id 30csp695783rwe; Wed, 24 Aug 2022 07:39:34 -0700 (PDT) X-Google-Smtp-Source: AA6agR5e2eibOOhsG59NhpW/Ike7WDGI52VIiWJgp36Rm6rUCxbvp9bebTrzQ3rVsTpM1hnSVj0B X-Received: by 2002:a17:902:8486:b0:171:48cd:1356 with SMTP id c6-20020a170902848600b0017148cd1356mr27983325plo.153.1661351974413; Wed, 24 Aug 2022 07:39:34 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1661351974; cv=none; d=google.com; s=arc-20160816; b=upo1Rp1mMzXWphXQt0W5vX+F9CvQua21sAkftb5+J9wsbuNL/4Uw31jmPz4YWU6kgO HlPXBJmPFaVdwQd9L5bPB9mSxcEA/vk0EDP+rihUIj7t2ntd9CriqjNaTXSxmsTtk1ej ElK27daDRszb8YzzmcyAHHSxZbQV29UTBYbAvi6kg2XfPfaxGlyqvEH8u52qpbg/2e3w G1fJi/TOZFC/9ihQ4Tj5lUYAf4WCho0bK5HomtTo2gTWhU2GQJWR+tJZY3Xf8QpRHOo5 VmhFS4Shno0RV1BW2Hmu5aHMFKEgRlLz5U+3F7V9XB9NIiH07rT8+S6pU/iQyh2fj/9h HuAQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=WISgtRtj7XAbDMj5crexRB3gKRdBJedqKO8qGlTO7ac=; b=Im7UD1w6gpPH8yINlnTw8om85UoX8igdrGY3MIhxtdLS7KV9g5veULpKVjw40QHQSn aCe3668YBsC1PsVRvHc2QQGx7WR0O4Ak77NqHV8r5BaHMx9A8NMj1z6Vf0oiLo35IhYE /vyCasQ5/5dSP4q72sYEwNGU/S4rARSsC5URXfSpvkGGEwWDV68QGQYIt5eRNBlUR3G8 vhk2BhNQ68QVhc8XV4hYCAuU3rSfh58UfRB/rwdLhKAivoDg8N1wWyT4xsruZ6pTK5w8 Uoj2ACHkXEbeLyHERB3m85qlQpVkNF5JMivJ8gwBqd4wxD2+nD7V/zgpoOuRuQ3n9/Y3 ELZw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b=T3LH1lzO; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id y68-20020a633247000000b0042b34b89285si679858pgy.323.2022.08.24.07.39.22; Wed, 24 Aug 2022 07:39:34 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20210112.gappssmtp.com header.s=20210112 header.b=T3LH1lzO; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235867AbiHXOGf (ORCPT + 99 others); Wed, 24 Aug 2022 10:06:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47342 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239128AbiHXOGO (ORCPT ); Wed, 24 Aug 2022 10:06:14 -0400 Received: from mail-ed1-x52b.google.com (mail-ed1-x52b.google.com [IPv6:2a00:1450:4864:20::52b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4FBBF98A5D for ; Wed, 24 Aug 2022 07:06:11 -0700 (PDT) Received: by mail-ed1-x52b.google.com with SMTP id w10so10062125edc.3 for ; Wed, 24 Aug 2022 07:06:11 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20210112.gappssmtp.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc; bh=WISgtRtj7XAbDMj5crexRB3gKRdBJedqKO8qGlTO7ac=; b=T3LH1lzO4KaEX4zn8CgVYuDuv1KauxJR+HLCXKEs3xkm360Emw4hVW/MuvafYZ6sl/ 3JPoNCsUwsH6+piwal7pIV4xPcGk8/LhMhEmlDsIW5wXG1GsuodbK5PAQMitAaN3SklC UDZ4VPj9lBpiLtgWK1EICxF2A+N21C/ZEJZAlVolTuMYN7KyxVC3LlArNy48s6WdarRN QM7u7L8czZFsRahByWnn++hZx6eXqbmVB8j1Vab8qkWaBUCGWKA7wqQqGZS8B2rDK/6s I50j59hZkbewhI7anqmsnrTDwIZg46TVpjej35gDcENEkxdo/tXAd8zJRfrcLqyBaNeq DT8A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc; bh=WISgtRtj7XAbDMj5crexRB3gKRdBJedqKO8qGlTO7ac=; b=Yg9qY7D8y5sbo84wwTSSqk00+SuDAYOt46ngefivREGEQVUGv2/eMqnyUoebTxaxEc ModBy4UPU7a/rieG/y+XoFcrrrMCvv2JJiCEV+qzsMoBJ5eqdUOwxLajKA82PU+4sPRG P1p1lfW43XxoEMSo2rr8Cy2J2iS4MlbgQe+VXCMzLk/oS5vuzNTKxb8U89/fNnyEb97N t7zPvX1EurOKhGIgiSApT7PW2QwbQQlupZ77C+cUqQmK0QFYMCDk5hZsPbXItEHv+pwi OQl7hi0cO/5wS1cdqtnTh4DiJKAhY2PNBFKbnx+wxbUgNFbbBZR4RTsRKLRTRuAlIkHu BxyQ== X-Gm-Message-State: ACgBeo2CRbKzieTu609+jQQ0S7ykJBKUDbDj1/xTY1vVBEVoYsDGMhNm GRbZlZAeq1RmmiGDUatzF1pJ6g== X-Received: by 2002:a05:6402:f14:b0:447:7944:558 with SMTP id i20-20020a0564020f1400b0044779440558mr3068928eda.29.1661349969800; Wed, 24 Aug 2022 07:06:09 -0700 (PDT) Received: from localhost ([2a02:8070:6389:a4c0:2ca9:6d59:782b:fff3]) by smtp.gmail.com with ESMTPSA id cz21-20020a0564021cb500b004465d1db765sm3103733edb.89.2022.08.24.07.06.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 24 Aug 2022 07:06:09 -0700 (PDT) Date: Wed, 24 Aug 2022 10:06:08 -0400 From: Johannes Weiner To: Chengming Zhou Cc: tj@kernel.org, mkoutny@suse.com, surenb@google.com, gregkh@linuxfoundation.org, corbet@lwn.net, mingo@redhat.com, peterz@infradead.org, songmuchun@bytedance.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v3 05/10] sched/psi: optimize task switch inside shared cgroups again Message-ID: References: <20220824081829.33748-1-zhouchengming@bytedance.com> <20220824081829.33748-6-zhouchengming@bytedance.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220824081829.33748-6-zhouchengming@bytedance.com> X-Spam-Status: No, score=-1.9 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Aug 24, 2022 at 04:18:24PM +0800, Chengming Zhou wrote: > commit 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") > defer prev task sleep handling to psi_task_switch(), so we don't need > to clear and set TSK_ONCPU state for common cgroups. > > A > | > B > / \ > C D > / \ > prev next > > After that commit psi_task_switch() do: > 1. psi_group_change(next, .set=TSK_ONCPU) for D > 2. psi_group_change(prev, .clear=TSK_ONCPU | TSK_RUNNING) for C > 3. psi_group_change(prev, .clear=TSK_RUNNING) for B, A > > But there is a limitation "prev->psi_flags == next->psi_flags" that > if not satisfied, will make this cgroups optimization unusable for both > sleep switch or running switch cases. For example: > > prev->in_memstall != next->in_memstall when sleep switch: > 1. psi_group_change(next, .set=TSK_ONCPU) for D, B, A > 2. psi_group_change(prev, .clear=TSK_ONCPU | TSK_RUNNING) for C, B, A > > prev->in_memstall != next->in_memstall when running switch: > 1. psi_group_change(next, .set=TSK_ONCPU) for D, B, A > 2. psi_group_change(prev, .clear=TSK_ONCPU) for C, B, A > > The reason why this limitation exist is that we consider a group is > PSI_MEM_FULL if the CPU is actively reclaiming and nothing productive > could run even if it were runnable. So when CPU curr changed from prev > to next and their in_memstall status is different, we have to change > PSI_MEM_FULL status for their common cgroups. > > This patch remove this limitation by making psi_group_change() change > PSI_MEM_FULL status depend on CPU curr->in_memstall status. > > Signed-off-by: Chengming Zhou Hoo boy, that took me a second. Way back when PSI_MEM_FULL was accounted from the timer tick, task switching could simply iterate next and prev to the common ancestor to update TSK_ONCPU and be done. Then memstall ticks were replaced with checking curr->in_memstall directly in psi_group_change(). That meant that now if the task switch was between a memstall and a !memstall task, we had to iterate through the common ancestors at least ONCE to fix up their state_masks. We added the identical_state filter to make sure the common ancestor elimination was skipped in that case. It seems that was always a little too eager, because it caused us to walk the common ancestors *twice* instead of the required once: the iteration for next could have stopped at the common ancestor; prev could have updated TSK_ONCPU up to the common ancestor, then finish to the root without changing any flags, just to get the new curr->in_memstall into the state_masks. This patch recognizes this and makes it so that we walk to the root exactly once if state_mask needs updating. Unless I missed anything, would you mind adding this to the changelog? I'm not quite sure how 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups") fits into the picture. That optimized the sleep case, but the sleep case never had the common ancestor optimization (the dq would have already cleared TSK_ONCPU up to the root). Let me know if I am mistaken. AFAICS I can see, this patch here is simply catching up on a missed optimization that could have been done in 7fae6c8171d2 ("psi: Use ONCPU state tracking machinery to detect reclaim") directly already. So I think it all makes sense. I have just two notes on the diff: > @@ -820,8 +820,6 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, > u64 now = cpu_clock(cpu); > > if (next->pid) { > - bool identical_state; > - > psi_flags_change(next, 0, TSK_ONCPU); > /* > * When switching between tasks that have an identical > @@ -829,11 +827,9 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, > * we reach the first common ancestor. Iterate @next's > * ancestors only until we encounter @prev's ONCPU. > */ The comment is rather stale now. Could you change it to this? /* * Set TSK_ONCPU on @next's cgroups. If @next shares any * ancestors with @prev, those will already have @prev's * TSK_ONCPU bit set, and we can stop the iteration there. */ > - identical_state = prev->psi_flags == next->psi_flags; > iter = NULL; > while ((group = iterate_groups(next, &iter))) { > - if (identical_state && > - per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { > + if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { > common = group; > break; > } > @@ -880,7 +876,7 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next, > * TSK_ONCPU is handled up to the common ancestor. If we're tasked > * with dequeuing too, finish that for the rest of the hierarchy. > */ > - if (sleep) { > + if (sleep || unlikely(prev->in_memstall != next->in_memstall)) { > clear &= ~TSK_ONCPU; > for (; group; group = iterate_groups(prev, &iter)) > psi_group_change(group, cpu, clear, set, now, wake_clock); Okay, this computes too. But it is somewhat special-cased, without explaining why the memstall state in particular matters. Instead of focusing on the exceptions though, can we just generalize this a bit? /* * TSK_ONCPU is handled up to the common ancestor. If there are * any other differences between the two tasks (e.g. prev goes * to sleep, or only one task is memstall), finish propagating * those differences all the way up to the root. */ if ((prev->psi_flags ^ next->psi_flags) & ~TSK_ONCPU) { clear &= ~TSK_ONCPU; for (; group; group = iterate_groups(prev, &iter)) psi_group_change(group, cpu, clear, set, now, wake_clock); } Thanks Johannes