Received: by 2002:a6b:500f:0:0:0:0:0 with SMTP id e15csp124680iob; Tue, 3 May 2022 13:06:07 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxTYY2v7sWTVGSZ5LtqvOYnUnqTIRwadplkgHTUDggoqWTK8UT6CUrnKDs1pUOEE8Y/Z3kn X-Received: by 2002:a05:6a00:2148:b0:4fa:92f2:bae3 with SMTP id o8-20020a056a00214800b004fa92f2bae3mr17484184pfk.69.1651608367488; Tue, 03 May 2022 13:06:07 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1651608367; cv=none; d=google.com; s=arc-20160816; b=bbz1aYXPIJFSjzfz1zbSjzoemXg0C+IInor9Waoj0oxMULs5NcW8YkBV2ofc8W4sY1 U8DsiNvSq0u/2HoVZyWOp8ZnDuneUIs8kye2oRYsgErsM0Zs+xVneMQroAQ04szyuie3 ZlfJ8yP74T30xEU+GTuGgoWEGgTuWXM4G5y3NssBe7NubFmH9uSAIcHuky5EGHG4GbRx 7Ib7Ml32iVFRAYdgKXpVQuO03gO1OMI/eCiJKqTe7sxp6/xSPOZMb0FgPHbnoJ2lWcyy XAxYabDvsjDoYrTurhn4A+7ftH5liFNy6occNxkHXOCsN14uTeyL204irBR4kbQ7E2fo NDvw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:mime-version :references:in-reply-to:message-id:date:subject:cc:to:from :dkim-signature; bh=2rVia1pQrXLAijSVfWAWpBEkZYOPuySuFEwnkNUZOK4=; b=uLHr/RhYMrK+4RSfNNeHKNN2xOEXyVDT5iaFTEZgNjlcr9YFHRrHldJ+HucV/CaGQx NlUBgMfBYYLpmGQP1fVz7Y4BUroJnUJ4kH0HaVdE0VjfwhbgVja7oaBMhtaRGMqIWRo2 tweLsG5Z0/eh4zLJf5H8NKWra2uJBC34mvHcaNkJRbiQy47hXfgsqywjfMwRhJ2P1YfQ uV+Yj9/8KBshI2HLc/Pn0perGcYd0uldLDur/rmHFuEEaz5EJIr+6+Kme7jWaaHE75t4 mKi2OR9G9D+yCoAKu+CJKTFv+vyfMHAtdba98KWWmLOZIDD/t7pmBhP7bNOtcxeHyStR zvAA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="KAd2kUw/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id 189-20020a6215c6000000b0050d3f057ea9si16733294pfv.235.2022.05.03.13.05.48; Tue, 03 May 2022 13:06:07 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="KAd2kUw/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232452AbiECQ02 (ORCPT + 99 others); Tue, 3 May 2022 12:26:28 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37802 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S239377AbiECQ0U (ORCPT ); Tue, 3 May 2022 12:26:20 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id 90FB239815 for ; Tue, 3 May 2022 09:22:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1651594966; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=2rVia1pQrXLAijSVfWAWpBEkZYOPuySuFEwnkNUZOK4=; b=KAd2kUw/W6TfmtAu/a1afQ0MezTTfjpERAu+aCiHGoY3CiajA3ZCbRC8cyx5pjb1T+MwIm hJ6qyYMZAx3w6gq9YAx/NMEFsKlp35bGZABWHy1qv1FcqWZPTk96FDOiV5nRBOjR2rvlHg LKb27gbRznJJxi0WccyPIWZmg7fSKz4= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-316-KuRXog86OYG_rKIEwm-Shg-1; Tue, 03 May 2022 12:22:40 -0400 X-MC-Unique: KuRXog86OYG_rKIEwm-Shg-1 Received: from smtp.corp.redhat.com (int-mx10.intmail.prod.int.rdu2.redhat.com [10.11.54.10]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id C5A8AC77B43; Tue, 3 May 2022 16:22:39 +0000 (UTC) Received: from llong.com (dhcp-17-215.bos.redhat.com [10.18.17.215]) by smtp.corp.redhat.com (Postfix) with ESMTP id 576E8463E0B; Tue, 3 May 2022 16:22:39 +0000 (UTC) From: Waiman Long To: Tejun Heo , Zefan Li , Johannes Weiner , Jonathan Corbet , Shuah Khan Cc: cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, Andrew Morton , Roman Gushchin , Phil Auld , Peter Zijlstra , Juri Lelli , Frederic Weisbecker , Marcelo Tosatti , =?UTF-8?q?Michal=20Koutn=C3=BD?= , Waiman Long Subject: [PATCH v10 5/8] cgroup/cpuset: Add a new isolated cpus.partition type Date: Tue, 3 May 2022 12:21:46 -0400 Message-Id: <20220503162149.1764245-6-longman@redhat.com> In-Reply-To: <20220503162149.1764245-1-longman@redhat.com> References: <20220503162149.1764245-1-longman@redhat.com> MIME-Version: 1.0 Content-type: text/plain Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.85 on 10.11.54.10 X-Spam-Status: No, score=-3.4 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_LOW, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Cpuset v1 uses the sched_load_balance control file to determine if load balancing should be enabled. Cpuset v2 gets rid of sched_load_balance as its use may require disabling load balancing at cgroup root. For workloads that require very low latency like DPDK, the latency jitters caused by periodic load balancing may exceed the desired latency limit. When cpuset v2 is in use, the only way to avoid this latency cost is to use the "isolcpus=" kernel boot option to isolate a set of CPUs. After the kernel boot, however, there is no way to add or remove CPUs from this isolated set. For workloads that are more dynamic in nature, that means users have to provision enough CPUs for the worst case situation resulting in excess idle CPUs. To address this issue for cpuset v2, a new cpuset.cpus.partition type "isolated" is added which allows the creation of a cpuset partition without load balancing. This will allow system administrators to dynamically adjust the size of isolated partition to the current need of the workload without rebooting the system. Signed-off-by: Waiman Long --- kernel/cgroup/cpuset.c | 74 +++++++++++++++++++++++++++++++++++------- 1 file changed, 63 insertions(+), 11 deletions(-) diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 1f1d76013046..073db69ac8fc 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -177,11 +177,15 @@ struct cpuset { * * 0 - member (not a partition root) * 1 - partition root + * 2 - partition root without load balancing (isolated) * -1 - invalid partition root + * -2 - invalid isolated partition root */ #define PRS_MEMBER 0 #define PRS_ROOT 1 +#define PRS_ISOLATED 2 #define PRS_INVALID_ROOT -1 +#define PRS_INVALID_ISOLATED -2 static inline bool is_prs_invalid(int prs_state) { @@ -278,7 +282,8 @@ static inline int is_partition_invalid(const struct cpuset *cs) static inline void set_partition_invalid(struct cpuset *cs) { - cs->partition_root_state = PRS_INVALID_ROOT; + if (is_partition_valid(cs)) + cs->partition_root_state = -cs->partition_root_state; } /* @@ -1369,17 +1374,19 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd, if (cmd == partcmd_update) { /* - * Check for possible transition between PRS_ROOT - * and PRS_INVALID_ROOT. + * Check for possible transition between valid and invalid + * partition root. */ switch (cs->partition_root_state) { case PRS_ROOT: + case PRS_ISOLATED: if (part_error) - new_prs = PRS_INVALID_ROOT; + new_prs = -old_prs; break; case PRS_INVALID_ROOT: + case PRS_INVALID_ISOLATED: if (!part_error) - new_prs = PRS_ROOT; + new_prs = -old_prs; break; } } @@ -1389,7 +1396,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd, /* * Transitioning from invalid to valid (partcmd_update) may require - * setting CS_CPU_EXCLUSIVE. + * setting CS_CPU_EXCLUSIVE and clearing CS_SCHED_LOAD_BALANCE later. */ if ((old_prs != new_prs) && is_prs_invalid(old_prs)) { if (!is_cpu_exclusive(cs) && @@ -1430,8 +1437,17 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd, if (adding || deleting) update_tasks_cpumask(parent); + /* + * Set or clear CS_SCHED_LOAD_BALANCE when partcmd_update, if necessary. + * rebuild_sched_domains_locked() may be called. + */ + if ((old_prs != new_prs) && (cmd == partcmd_update)) { + if (old_prs == PRS_ISOLATED) + update_flag(CS_SCHED_LOAD_BALANCE, cs, 1); + else if (new_prs == PRS_ISOLATED) + update_flag(CS_SCHED_LOAD_BALANCE, cs, 0); + } notify_partition_change(cs, old_prs); - return 0; } @@ -1506,6 +1522,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp, if ((cp != cs) && old_prs) { switch (parent->partition_root_state) { case PRS_ROOT: + case PRS_ISOLATED: update_parent = true; break; @@ -1515,7 +1532,8 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp, * invalid, child partition roots become * invalid too. */ - new_prs = PRS_INVALID_ROOT; + if (is_partition_valid(cp)) + new_prs = -cp->partition_root_state; break; } } @@ -2097,6 +2115,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs, static int update_prstate(struct cpuset *cs, int new_prs) { int err = 0, old_prs = cs->partition_root_state; + bool sched_domain_rebuilt = false; struct cpuset *parent = parent_cs(cs); struct tmpmasks tmpmask; @@ -2107,8 +2126,10 @@ static int update_prstate(struct cpuset *cs, int new_prs) * For a previously invalid partition root, leave it at being * invalid if new_prs is not "member". */ - if (new_prs && is_prs_invalid(old_prs)) + if (new_prs && is_prs_invalid(old_prs)) { + cs->partition_root_state = -new_prs; return 0; + } if (alloc_cpumasks(NULL, &tmpmask)) return -ENOMEM; @@ -2134,6 +2155,22 @@ static int update_prstate(struct cpuset *cs, int new_prs) update_flag(CS_CPU_EXCLUSIVE, cs, 0); goto out; } + + if (new_prs == PRS_ISOLATED) { + /* + * Disable the load balance flag should not return an + * error unless the system is running out of memory. + */ + update_flag(CS_SCHED_LOAD_BALANCE, cs, 0); + sched_domain_rebuilt = true; + } + } else if (old_prs && new_prs) { + /* + * A change in load balance state only, no change in cpumasks. + */ + update_flag(CS_SCHED_LOAD_BALANCE, cs, (new_prs != PRS_ISOLATED)); + sched_domain_rebuilt = true; + goto out; /* Sched domain is rebuilt in update_flag() */ } else { /* * Switching back to member is always allowed even if it @@ -2155,6 +2192,12 @@ static int update_prstate(struct cpuset *cs, int new_prs) /* Turning off CS_CPU_EXCLUSIVE will not return error */ update_flag(CS_CPU_EXCLUSIVE, cs, 0); + + if (!is_sched_load_balance(cs)) { + /* Make sure load balance is on */ + update_flag(CS_SCHED_LOAD_BALANCE, cs, 1); + sched_domain_rebuilt = true; + } } update_tasks_cpumask(parent); @@ -2162,13 +2205,14 @@ static int update_prstate(struct cpuset *cs, int new_prs) if (parent->child_ecpus_count) update_sibling_cpumasks(parent, cs, &tmpmask); - rebuild_sched_domains_locked(); + if (!sched_domain_rebuilt) + rebuild_sched_domains_locked(); out: /* * Make partition invalid if an error happen */ if (err) - new_prs = PRS_INVALID_ROOT; + new_prs = -new_prs; spin_lock_irq(&callback_lock); cs->partition_root_state = new_prs; spin_unlock_irq(&callback_lock); @@ -2680,12 +2724,18 @@ static int sched_partition_show(struct seq_file *seq, void *v) case PRS_ROOT: seq_puts(seq, "root\n"); break; + case PRS_ISOLATED: + seq_puts(seq, "isolated\n"); + break; case PRS_MEMBER: seq_puts(seq, "member\n"); break; case PRS_INVALID_ROOT: seq_puts(seq, "root invalid\n"); break; + case PRS_INVALID_ISOLATED: + seq_puts(seq, "isolated invalid\n"); + break; } return 0; } @@ -2706,6 +2756,8 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf, val = PRS_ROOT; else if (!strcmp(buf, "member")) val = PRS_MEMBER; + else if (!strcmp(buf, "isolated")) + val = PRS_ISOLATED; else return -EINVAL; -- 2.27.0