Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp2953676imm; Mon, 28 May 2018 20:49:43 -0700 (PDT) X-Google-Smtp-Source: ADUXVKJHEstIUaNT63Jbu56X9Q//7hU11woxnqPWwrjbHGLY7SDfRB/8/p34n4zWqxyn0dixh8XY X-Received: by 2002:a62:d8c1:: with SMTP id e184-v6mr11506158pfg.161.1527565783203; Mon, 28 May 2018 20:49:43 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527565783; cv=none; d=google.com; s=arc-20160816; b=yxJV4PrrJaEmzEa/H7pDx8BxW17p8XXQwiR9C7tyfk8yFoh31T9/kdNfA2G5VgpFSh 8rbpGJX063khzHTSFoWJiNNJZxT3/DcVe0m8huk/wdEOk2uUe6sWtHH2shj+3sTrNJKs DH8EjfJekH6FTASTIOr8LsAm6NYSRIFizMwK2KxOfaobKg7ljXhzB0cu1yzrPVD8JbvZ CEOGgLE5ZOV4VeGOeoF3ZQBjFhZR3Hfe8lVpL2eYEgCYmgyiryrGDomtz80NnOErmz/e H7HLPvygeCK83KPfXbhkVfEXjYw93y8Hv0fY7F/k0v8uHsIR8y1ac41Svv3oiyiot4a1 Btwg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:organization:from:references:cc:to:subject :arc-authentication-results; bh=vIp/0u1w0Es+qmQey2dBoHH8P6C6C5EazPWbp45jeno=; b=UcGJzmJkiQxa6ghFVZt+krsmURrSwNmI1vywnIGJBcRtIhzWZ2nddbcR25Y0yheMh6 RU2IgNjUCAw/9K+6F+G7x0sygbUy4QKm+9yjIPK7ScKJyvs/a0EVgLox+W9Jkf3NuKm0 5nxXh7c7Ma62BPNTE5OkGbDhDnnj1h9il/ISBIrxFQUmApKRsBwvPUF9jyJGVgCuAB5x Fe3ShCAdMIzDHp0xdlWE/gItpCcCwQ/6kWUIYnfyszlAJkbGvtgfS9T0DYhCpX24gnDp E6BMLHpC+gxKVKkZF7SfeFdagJDbfRRdGzrHwSUFfm5PVHCOivwwwDEwpLme5EXK5wnc 16TQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d124-v6si31713739pfc.176.2018.05.28.20.49.28; Mon, 28 May 2018 20:49:43 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933363AbeE1Sbz convert rfc822-to-8bit (ORCPT + 99 others); Mon, 28 May 2018 14:31:55 -0400 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:46722 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S933255AbeE1Sbv (ORCPT ); Mon, 28 May 2018 14:31:51 -0400 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 35F958163AD2; Mon, 28 May 2018 18:31:51 +0000 (UTC) Received: from llong.remote.csb (ovpn-121-110.rdu2.redhat.com [10.10.121.110]) by smtp.corp.redhat.com (Postfix) with ESMTP id 638A1111E410; Mon, 28 May 2018 18:31:47 +0000 (UTC) Subject: Re: [PATCH v8 3/6] cpuset: Add cpuset.sched.load_balance flag to v2 To: Peter Zijlstra Cc: Tejun Heo , Li Zefan , Johannes Weiner , Ingo Molnar , cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kernel-team@fb.com, pjt@google.com, luto@amacapital.net, Mike Galbraith , torvalds@linux-foundation.org, Roman Gushchin , Juri Lelli References: <1526590545-3350-1-git-send-email-longman@redhat.com> <1526590545-3350-4-git-send-email-longman@redhat.com> <20180524154341.GJ12198@hirez.programming.kicks-ass.net> <9f547771-54e8-118e-80f7-48f99c7b0a12@redhat.com> <20180528124508.GE3452@worktop.programming.kicks-ass.net> From: Waiman Long Organization: Red Hat Message-ID: <5487d08f-322c-15d3-06e2-f3d3408141d6@redhat.com> Date: Mon, 28 May 2018 14:31:48 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.0 MIME-Version: 1.0 In-Reply-To: <20180528124508.GE3452@worktop.programming.kicks-ass.net> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT Content-Language: en-US X-Scanned-By: MIMEDefang 2.78 on 10.11.54.3 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.8]); Mon, 28 May 2018 18:31:51 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.8]); Mon, 28 May 2018 18:31:51 +0000 (UTC) for IP:'10.11.54.3' DOMAIN:'int-mx03.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'longman@redhat.com' RCPT:'' Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 05/28/2018 08:45 AM, Peter Zijlstra wrote: > On Thu, May 24, 2018 at 02:55:25PM -0400, Waiman Long wrote: >> On 05/24/2018 11:43 AM, Peter Zijlstra wrote: >>> I'm confused... why exactly do we have both domain and load_balance ? >> The domain is for partitioning the CPUs only. It doesn't change the load >> balancing state. So the load_balance flag is still need to turn on and >> off load balancing. > OK, so we have to two boolean flags, giving 4 possible states. Lets just > go through them one by on: > > A) domain:0 load_balance:0 -- we have no exclusive domain, but have > load-balancing disabled across them. AFAICT this should be an invalid > state. > > B) domain:0 load_balance:1 -- we have no exclusive domain, but have > load-balancing enabled. AFAICT this is the default state and is a > no-op. > > C) domain:1 load_balance:0 -- we have an exclusive domain, and have > load-balancing disabled across it. This is, AFAICT, identical to > having a bunch of sub/sibling groups each with a single CPU domain. > > D) domain:1 load_balance:1 -- we have an exclusive domain, and have > load-balancing enabled. This is a partition. > > Now, I think I've overlooked the fact that load_balance==1 only really > means something when the parent's load_balance==0, but I'm not sure that > really changes anything. > > So, afaict, the above only have two useful states: B and D. Which again > raises the question, why two knobs? What useful configurations does it > allow? I am working on the v9 patch, and below is the current draft of the documentation. Hopefully that will clarify some of the concepts that we are discussing here. cpuset.sched.domain_root A read-write single value file which exists on non-root cpuset-enabled cgroups. It is a binary value flag that accepts either "0" (off) or "1" (on). This flag is set by the parent and is not delegatable. If set, it indicates that the current cgroup is the root of a new scheduling domain or partition that comprises itself and all its descendants except those that are scheduling domain roots themselves and their descendants. The root cgroup is always a scheduling domain root. There are constraints on where this flag can be set. It can only be set in a cgroup if all the following conditions are true. 1) The "cpuset.cpus" is not empty and the list of CPUs are exclusive, i.e. they are not shared by any of its siblings. 2) The parent cgroup is also a scheduling domain root. 3) There is no child cgroups with cpuset enabled. This is for eliminating corner cases that have to be handled if such a condition is allowed. Setting this flag will take the CPUs away from the effective CPUs of the parent cgroup. Once it is set, this flag cannot be cleared if there are any child cgroups with cpuset enabled. Further changes made to "cpuset.cpus" is allowed as long as the first condition above is still true. A parent scheduling domain root cgroup cannot distribute all its CPUs to its child scheduling domain root cgroups unless its load balancing flag is turned off. cpuset.sched.load_balance A read-write single value file which exists on non-root cpuset-enabled cgroups. It is a binary value flag that accepts either "0" (off) or "1" (on). This flag is set by the parent and is not delegatable. It is on by default in the root cgroup. When it is on, tasks within this cpuset will be load-balanced by the kernel scheduler. Tasks will be moved from CPUs with high load to other CPUs within the same cpuset with less load periodically. When it is off, there will be no load balancing among CPUs on this cgroup. Tasks will stay in the CPUs they are running on and will not be moved to other CPUs. The load balancing state of a cgroup can only be changed on a scheduling domain root cgroup with no cpuset-enabled children. All cgroups within a scheduling domain or partition must have the same load balancing state. As descendant cgroups of a scheduling domain root are created, they inherit the same load balancing state of their root. The main purpose of using a new domain_root flag is to enable user to create new partitions without the trick of disabling load_balance in the parent and enabling it in the child. Now, we can create as many partitions as we want without ever turning off load balancing in any of the cpusets. I find it to be more straight forward and easier to understand than using the load_balance trick. Of course, turning off load balancing is still useful in some use cases, so it is supported. To simplify thing, it is mandated that all the cpusets within a partition must have the same load balancing state. This is to ensure that we can't use the load_balance trick to create additional partition underneath it. The domain_root flag is the only way to create partition. A) domain_root: 0, load_balance: 0 -- a non-domain root cpuset within a no load balancing partition. B) domain_root: 0, load_balance: 1 -- a non-domain root cpuset within a load balancing partition. C) domain_root: 1, load_balance: 0 -- a domain root cpuset of a no load balancing partition. D) domain_root: 1, load_balance: 1 -- a domain root cpuset of a load balancing partition. Hope this help. Cheers, Longman