Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp488883imm; Mon, 21 May 2018 09:12:46 -0700 (PDT) X-Google-Smtp-Source: AB8JxZrXVJi/5/iwHU6nzIP1B59uVVcI14ZPezsX32wrudkbsL9tQvPbgTQyCxyrz/nO6PsAd9YH X-Received: by 2002:a65:424d:: with SMTP id d13-v6mr16043067pgq.234.1526919166182; Mon, 21 May 2018 09:12:46 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1526919166; cv=none; d=google.com; s=arc-20160816; b=QxFkXVKeMrtT8erouiSQ6JXUn+K5tNep0gP+ZMY9WTC7v4Bk39xn7YJd0iNuS0jxBd OVyvyXP2mXPCyAIWHglye1sfRmErEKRurtkLzMxV0gAuRbe5xnqqLauzk5FWBmlevJOw /w/qD923shvNvqdKnBT4/xud6c7wehTWRq2SDA+GL6CyOoQHamA02QNFx4EwaB8at7mG nlfrsNCjiidPMxRtWy07F4R6+igorBcjrTNZkVIgVnIDk3mf+TrLb0oO8OBd48vR0nSC JTOxtNHeRrgLM0yxkvS7hNTHaODGMfWGM+8AWmfMsx+cUhKVd6ebAsxQ1YbYGkc3J4Up IxiA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-language :content-transfer-encoding:in-reply-to:mime-version:user-agent:date :message-id:organization:from:references:cc:to:subject :arc-authentication-results; bh=iPoeSeai2k4Pb0uuDltkfP4OIkeONvJZU3NI4xUNm9M=; b=sTmbe+NDxkPMr71Zlqr3lX+7EY74gIIG+9RUPKmM7x1wnb7weAKXSR+nob2L5yWerU tuiTWVnTGI4K7hSP0h6wRQH0DUI0E9KYLg72lwT6Nj+dcs5tKuSmzk6wFQEtg5+MRSm0 sZIPDYrdErTNAjGiJOAYXWHlkDl7axpIZwyqePuVS2A6tnR7Jy1El3jYSPrKxQOL8Kg8 r+yV3CzdNq18/y+BdFXv+JaXw6up3xtr1LkHGXXpXGEDV/Fj2MRSx6D65sjpOX2e6oKp mX0pqTDvEjRk/bbLFqwchrqTY3YDNjxO6uGqjs/yXThzKTFukaUc75ZQmbu4a4FNlQ6B 8IAQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id l4-v6si14463268plb.213.2018.05.21.09.12.30; Mon, 21 May 2018 09:12:46 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753077AbeEUQKz convert rfc822-to-8bit (ORCPT + 99 others); Mon, 21 May 2018 12:10:55 -0400 Received: from mx3-rdu2.redhat.com ([66.187.233.73]:55228 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1752761AbeEUQKw (ORCPT ); Mon, 21 May 2018 12:10:52 -0400 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 2E7B04022931; Mon, 21 May 2018 16:10:52 +0000 (UTC) Received: from llong.remote.csb (dhcp-17-182.bos.redhat.com [10.18.17.182]) by smtp.corp.redhat.com (Postfix) with ESMTP id C158D1121294; Mon, 21 May 2018 16:10:47 +0000 (UTC) Subject: Re: [PATCH v8 1/6] cpuset: Enable cpuset controller in default hierarchy To: Patrick Bellasi Cc: Tejun Heo , Li Zefan , Johannes Weiner , Peter Zijlstra , Ingo Molnar , cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kernel-team@fb.com, pjt@google.com, luto@amacapital.net, Mike Galbraith , torvalds@linux-foundation.org, Roman Gushchin , Juri Lelli References: <1526590545-3350-1-git-send-email-longman@redhat.com> <1526590545-3350-2-git-send-email-longman@redhat.com> <20180521115528.GR30654@e110439-lin> <20180521150924.GS30654@e110439-lin> From: Waiman Long Organization: Red Hat Message-ID: <80841a12-6f82-91ae-8925-3092398efe32@redhat.com> Date: Mon, 21 May 2018 12:10:47 -0400 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.0 MIME-Version: 1.0 In-Reply-To: <20180521150924.GS30654@e110439-lin> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8BIT Content-Language: en-US X-Scanned-By: MIMEDefang 2.78 on 10.11.54.3 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.6]); Mon, 21 May 2018 16:10:52 +0000 (UTC) X-Greylist: inspected by milter-greylist-4.5.16 (mx1.redhat.com [10.11.55.6]); Mon, 21 May 2018 16:10:52 +0000 (UTC) for IP:'10.11.54.3' DOMAIN:'int-mx03.intmail.prod.int.rdu2.redhat.com' HELO:'smtp.corp.redhat.com' FROM:'longman@redhat.com' RCPT:'' Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 05/21/2018 11:09 AM, Patrick Bellasi wrote: > On 21-May 09:55, Waiman Long wrote: > >> Changing cpuset.cpus will require searching for the all the tasks in >> the cpuset and change its cpu mask. > ... I'm wondering if that has to be the case. In principle there can > be a different solution which is: update on demand. In the wakeup > path, once we know a task really need a CPU and we want to find one > for it, at that point we can align the cpuset mask with the task's > one. Sort of using the cpuset mask as a clamp on top of the task's > affinity mask. > > The main downside of such an approach could be the overheads in the > wakeup path... but, still... that should be measured. > The advantage is that we do not spend time changing attributes of > tassk which, potentially, could be sleeping for a long time. We already have a linked list of tasks in a cgroup. So it isn't too hard to find them. Doing update on demand will require adding a bunch of code to the wakeup path. So unless there is a good reason to do it, I don't it as necessary at this point. > >> That isn't a fast operation, but it shouldn't be too bad either >> depending on how many tasks are in the cpuset. > Indeed, althought it still seems a bit odd and overkilling updating > task affinity for tasks which are not currently RUNNABLE. Isn't it? > >> I would not suggest doing rapid changes to cpuset.cpus as a mean to tune >> the behavior of a task. So what exactly is the tuning you are thinking >> about? Is it moving a task from the a high-power cpu to a low power one >> or vice versa? > That's defenitively a possible use case. In Android for example we > usually assign more resources to TOP_APP tasks (those belonging to the > application you are currently using) while we restrict the resoures > one we switch an app to be in BACKGROUND. Switching an app from foreground to background and vice versa shouldn't happen that frequently. Maybe once every few seconds, at most. I am just wondering what use cases will require changing cpuset attributes in tens per second. > More in general, if you think about a generic Run-Time Resource > Management framework, which assign resources to the tasks of multiple > applications and want to have a fine grained control. > >> If so, it is probably better to move the task from one cpuset of >> high-power cpus to another cpuset of low-power cpus. > This is what Android does not but also what we want to possible > change, for two main reasons: > > 1. it does not fit with the "number one guideline" for proper > CGroups usage, which is "Organize Once and Control": > https://elixir.bootlin.com/linux/latest/source/Documentation/cgroup-v2.txt#L518 > where it says that: > migrating processes across cgroups frequently as a means to > apply different resource restrictions is discouraged. > > Despite this giudeline, it turns out that in v1 at least, it seems > to be faster to move tasks across cpusets then tuning cpuset > attributes... also when all the tasks are sleeping. It is probably similar in v2 as the core logic are almost the same. > 2. it does not allow to get advantages for accounting controllers such > as the memory controller where, by moving tasks around, we cannot > properly account and control the amount of memory a task can use. For v1, memory controller and cpuset controller can be in different hierarchy. For v2, we have a unified hierarchy. However, we don't need to enable all the controllers in different levels of the hierarchy. For example, A (memory, cpuset) -- B1 (cpuset) \-- B2 (cpuset) Cgroup A has memory and cpuset controllers enabled. The child cgroups B1 and B2 only have cpuset enabled. You can move tasks between B1 and B2 and they will be subjected to the same memory limitation as imposed by the memory controller in A. So there are way to work around that. > Thsu, for these reasons and also to possibly migrate to the unified > hierarchy schema proposed by CGroups v2... we would like a > low-overhead mechanism for setting/tuning cpuset at run-time with > whatever frequency you like. We may be able to improve the performance of changing cpuset attribute somewhat, but I don't believe there will be much improvement here. >>>> + >>>> +The "cpuset" controller is hierarchical. That means the controller >>>> +cannot use CPUs or memory nodes not allowed in its parent. >>>> + >>>> + >>>> +Cpuset Interface Files >>>> +~~~~~~~~~~~~~~~~~~~~~~ >>>> + >>>> + cpuset.cpus >>>> + A read-write multiple values file which exists on non-root >>>> + cpuset-enabled cgroups. >>>> + >>>> + It lists the CPUs allowed to be used by tasks within this >>>> + cgroup. The CPU numbers are comma-separated numbers or >>>> + ranges. For example: >>>> + >>>> + # cat cpuset.cpus >>>> + 0-4,6,8-10 >>>> + >>>> + An empty value indicates that the cgroup is using the same >>>> + setting as the nearest cgroup ancestor with a non-empty >>>> + "cpuset.cpus" or all the available CPUs if none is found. >>> Does that means that we can move tasks into a newly created group for >>> which we have not yet configured this value? >>> AFAIK, that's a different behavior wrt v1... and I like it better. >>> >> For v2, if you haven't set up the cpuset.cpus, it defaults to the >> effective cpu list of its parent. > +1 > >>>> + >>>> + The value of "cpuset.cpus" stays constant until the next update >>>> + and won't be affected by any CPU hotplug events. >>> This also sounds interesting, does it means that we use the >>> cpuset.cpus mask to restrict online CPUs, whatever they are? >> cpuset.cpus holds the cpu list written by the users. >> cpuset.cpus.effective is the actual cpu mask that is being used. The >> effective cpu mask is always a subset of cpuset.cpus. They differ if not >> all the CPUs in cpuset.cpus are online. > And that's fine: the effective mask is updated based on HP events. > > The main limitations on this side, so far, is that in > update_tasks_cpumask() we walk all the tasks to set_cpus_allowed_ptr() > independently for them to be RUNNABLE or not. Isn't that? That is true. > Thus, this will ensure to have a valid mask at wakeup time, but > perhaps it's not such a big overhead to update the same on the wakeup > path... thus speeding up quite a lot the update_cpumasks_hier() > especially when you have many SLEEPING tasks on a cpuset. > > A first measurement and tracing shows that this update could cost up > to 4ms on a Pixel2 device where you update the cpus for a cpuset > containing a single task always sleeping. The 4ms cost is more than what I would have expected. If you think delaying the update until wakeup time is the right move, you can create a patch to do that and we can discuss the merit of doing so in LKML. > >>> I'll have a better look at the code, but my understanding of v1 is >>> that we spent a lot of effort to keep task cpu-affinity masks aligned >>> with the cpuset in which they live, and we do something similar at each >>> HP event, which ultimately generates a lot of overheads in systems >>> where: you have many HP events and/or cpuset.cpus change quite >>> frequently. >>> >>> I hope to find some better behavior in this series. >>> >> The behavior of CPU offline event should be similar in v2. Any HP event >> will cause the system to reset the cpu masks of task affected by the >> event. The online event, however, will be a bit different between v1 and >> v2. For v1, the online event won't restore the CPU back to those cpusets >> that had the onlined CPU previously. For v2, the v2, the online CPU will >> be restored back to those cpusets. So there is less work from the >> management layer, but overhead is still there in the kernel of doing the >> restore. > On that side, I still have to better look into the v1 and v2 > implementations, but for the util_clamp extension of the cpu > controller: > https://lkml.org/lkml/2018/4/9/601 > I'm proposing a different update schema which it seems can give you > the benefits or "restoring the mask" after an UP event as well as a > fast update/tuning path at run-time. > > Along the line of the above implementation, it would mean that the > task affinity mask is constrained/clamped/masked by the TG's affinity > mask. This should be an operation performed "on-demand" whenever it > makes sense. > > However, to be honest, I never measured the overheads to combine two > cpu masks and it can very well be something overkilling for the wakeup > path. I don't think the AND by itself should be an issue, since it's > already used in the fast wakeup path, e.g. > > select_task_rq_fair() > select_idle_sibling() > select_idle_core() > cpumask_and(cpus, sched_domain_span(sd), > &p->cpus_allowed); > > What eventually could be an issue is the race between the scheduler > looking at the cpuset cpumaks and cgroups changing it... but perhaps > that's something could be fixed with a proper locking mechanism. > > I will try to run some experiments to at least collect some overheads > numbers. Collecting more information on where the slowdown is will be helpful. -Longman