Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp422651imm; Mon, 21 May 2018 08:10:04 -0700 (PDT) X-Google-Smtp-Source: AB8JxZr9TiXCGhlS2V3kgzeqx89oLElBSklOXLRvLD1zT8cnShfN2Rb0xAgihqj6e0QfKBJm4P6S X-Received: by 2002:a63:5f8b:: with SMTP id t133-v6mr9447947pgb.301.1526915404494; Mon, 21 May 2018 08:10:04 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1526915404; cv=none; d=google.com; s=arc-20160816; b=N7PLqpEK4SjMNQQUuDZVGgYQIaZl8yzVw+5+rPASpPj+LGUFhLVCUVV8mVpR2z+egZ v9z0/toMigEGLIJML1gXmF2/j7seLymWQ6S7tZhkfni89ougwC7fAwRvD1Ka98OWElif diObgNAVItsK+/sOiA//yq24Spne7m9jhM+Yh+BL9d3WUJiMiXQk10fFOnf7W2tyGiD/ z0AvzXfVbwo4m2ZwxNTt51qGOssYZsFM2cNH8uULNcNp9VaoLsLrgeovzIHqQWrKuyQC XlnOUZ4pXxMJmdDqjmpqwZE35L42xRIrKDkBGnBwz2Jf7lYSaqtH+yEi/dlWHtjoB+x9 FLSQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=dX9E2BP7w6gTzowmQaGrGCX/9jGb+G25+TwHMQaDXhA=; b=oP2Xc7c7LdsnwtI/mGktQuIEnovdP1dAiYVSsLiwJTG1GxMNZwvNaLFlt/dfeTpaU2 A7+eXBjcD1hAIpKv2M5TC644kHNYiARBpYSyMtpa4EYIIu3zf/34dRE6JeIgrcLKjo0b DeBvJRTfVmnSt3+Up2mLM9Ys5P+rLpwF51RdLFeB6tfuIk+FqaF7+fgRYAlEJl8b0lHA YeJGmDNQwvyw4WAW0LATCS6RFXugqrRUDUgb6cB8RUxzbtoReuNKh5k8hiZIa4um9AJd vtA12s1xLvSFcY4VQtNH/i5oejKHN/3VQfnwvErbB+YQYAVf0v53m/gkuRsU3khCeJLY RoRQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id u59-v6si14277441plb.253.2018.05.21.08.09.48; Mon, 21 May 2018 08:10:04 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752817AbeEUPJd (ORCPT + 99 others); Mon, 21 May 2018 11:09:33 -0400 Received: from foss.arm.com ([217.140.101.70]:52478 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751843AbeEUPJa (ORCPT ); Mon, 21 May 2018 11:09:30 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id F1ACD1596; Mon, 21 May 2018 08:09:29 -0700 (PDT) Received: from e110439-lin (e110439-lin.cambridge.arm.com [10.1.210.68]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 53CD33F577; Mon, 21 May 2018 08:09:27 -0700 (PDT) Date: Mon, 21 May 2018 16:09:24 +0100 From: Patrick Bellasi To: Waiman Long Cc: Tejun Heo , Li Zefan , Johannes Weiner , Peter Zijlstra , Ingo Molnar , cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, kernel-team@fb.com, pjt@google.com, luto@amacapital.net, Mike Galbraith , torvalds@linux-foundation.org, Roman Gushchin , Juri Lelli Subject: Re: [PATCH v8 1/6] cpuset: Enable cpuset controller in default hierarchy Message-ID: <20180521150924.GS30654@e110439-lin> References: <1526590545-3350-1-git-send-email-longman@redhat.com> <1526590545-3350-2-git-send-email-longman@redhat.com> <20180521115528.GR30654@e110439-lin> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 21-May 09:55, Waiman Long wrote: > On 05/21/2018 07:55 AM, Patrick Bellasi wrote: > > Hi Waiman! [...] > >> +Cpuset > >> +------ > >> + > >> +The "cpuset" controller provides a mechanism for constraining > >> +the CPU and memory node placement of tasks to only the resources > >> +specified in the cpuset interface files in a task's current cgroup. > >> +This is especially valuable on large NUMA systems where placing jobs > >> +on properly sized subsets of the systems with careful processor and > >> +memory placement to reduce cross-node memory access and contention > >> +can improve overall system performance. > > Another quite important use-case for cpuset is Android, where they are > > actively used to do both power-saving as well as performance tunings. > > For example, depending on the status of an application, its threads > > can be allowed to run on all available CPUS (e.g. foreground apps) or > > be restricted only on few energy efficient CPUs (e.g. backgroud apps). > > > > Since here we are at "rewriting" cpusets for v2, I think it's important > > to keep this mobile world scenario into consideration. > > > > For example, in this context, we are looking at the possibility to > > update/tune cpuset.cpus with a relatively high rate, i.e. tens of > > times per second. Not sure that's the same update rate usually > > required for the large NUMA systems you cite above. However, in this > > case it's quite important to have really small overheads for these > > operations. > > The cgroup interface isn't designed for high update throughput. Indeed, I had the same impression... > Changing cpuset.cpus will require searching for the all the tasks in > the cpuset and change its cpu mask. ... I'm wondering if that has to be the case. In principle there can be a different solution which is: update on demand. In the wakeup path, once we know a task really need a CPU and we want to find one for it, at that point we can align the cpuset mask with the task's one. Sort of using the cpuset mask as a clamp on top of the task's affinity mask. The main downside of such an approach could be the overheads in the wakeup path... but, still... that should be measured. The advantage is that we do not spend time changing attributes of tassk which, potentially, could be sleeping for a long time. > That isn't a fast operation, but it shouldn't be too bad either > depending on how many tasks are in the cpuset. Indeed, althought it still seems a bit odd and overkilling updating task affinity for tasks which are not currently RUNNABLE. Isn't it? > I would not suggest doing rapid changes to cpuset.cpus as a mean to tune > the behavior of a task. So what exactly is the tuning you are thinking > about? Is it moving a task from the a high-power cpu to a low power one > or vice versa? That's defenitively a possible use case. In Android for example we usually assign more resources to TOP_APP tasks (those belonging to the application you are currently using) while we restrict the resoures one we switch an app to be in BACKGROUND. More in general, if you think about a generic Run-Time Resource Management framework, which assign resources to the tasks of multiple applications and want to have a fine grained control. > If so, it is probably better to move the task from one cpuset of > high-power cpus to another cpuset of low-power cpus. This is what Android does not but also what we want to possible change, for two main reasons: 1. it does not fit with the "number one guideline" for proper CGroups usage, which is "Organize Once and Control": https://elixir.bootlin.com/linux/latest/source/Documentation/cgroup-v2.txt#L518 where it says that: migrating processes across cgroups frequently as a means to apply different resource restrictions is discouraged. Despite this giudeline, it turns out that in v1 at least, it seems to be faster to move tasks across cpusets then tuning cpuset attributes... also when all the tasks are sleeping. 2. it does not allow to get advantages for accounting controllers such as the memory controller where, by moving tasks around, we cannot properly account and control the amount of memory a task can use. Thsu, for these reasons and also to possibly migrate to the unified hierarchy schema proposed by CGroups v2... we would like a low-overhead mechanism for setting/tuning cpuset at run-time with whatever frequency you like. > >> + > >> +The "cpuset" controller is hierarchical. That means the controller > >> +cannot use CPUs or memory nodes not allowed in its parent. > >> + > >> + > >> +Cpuset Interface Files > >> +~~~~~~~~~~~~~~~~~~~~~~ > >> + > >> + cpuset.cpus > >> + A read-write multiple values file which exists on non-root > >> + cpuset-enabled cgroups. > >> + > >> + It lists the CPUs allowed to be used by tasks within this > >> + cgroup. The CPU numbers are comma-separated numbers or > >> + ranges. For example: > >> + > >> + # cat cpuset.cpus > >> + 0-4,6,8-10 > >> + > >> + An empty value indicates that the cgroup is using the same > >> + setting as the nearest cgroup ancestor with a non-empty > >> + "cpuset.cpus" or all the available CPUs if none is found. > > Does that means that we can move tasks into a newly created group for > > which we have not yet configured this value? > > AFAIK, that's a different behavior wrt v1... and I like it better. > > > > For v2, if you haven't set up the cpuset.cpus, it defaults to the > effective cpu list of its parent. +1 > > >> + > >> + The value of "cpuset.cpus" stays constant until the next update > >> + and won't be affected by any CPU hotplug events. > > This also sounds interesting, does it means that we use the > > cpuset.cpus mask to restrict online CPUs, whatever they are? > > cpuset.cpus holds the cpu list written by the users. > cpuset.cpus.effective is the actual cpu mask that is being used. The > effective cpu mask is always a subset of cpuset.cpus. They differ if not > all the CPUs in cpuset.cpus are online. And that's fine: the effective mask is updated based on HP events. The main limitations on this side, so far, is that in update_tasks_cpumask() we walk all the tasks to set_cpus_allowed_ptr() independently for them to be RUNNABLE or not. Isn't that? Thus, this will ensure to have a valid mask at wakeup time, but perhaps it's not such a big overhead to update the same on the wakeup path... thus speeding up quite a lot the update_cpumasks_hier() especially when you have many SLEEPING tasks on a cpuset. A first measurement and tracing shows that this update could cost up to 4ms on a Pixel2 device where you update the cpus for a cpuset containing a single task always sleeping. > > I'll have a better look at the code, but my understanding of v1 is > > that we spent a lot of effort to keep task cpu-affinity masks aligned > > with the cpuset in which they live, and we do something similar at each > > HP event, which ultimately generates a lot of overheads in systems > > where: you have many HP events and/or cpuset.cpus change quite > > frequently. > > > > I hope to find some better behavior in this series. > > > > The behavior of CPU offline event should be similar in v2. Any HP event > will cause the system to reset the cpu masks of task affected by the > event. The online event, however, will be a bit different between v1 and > v2. For v1, the online event won't restore the CPU back to those cpusets > that had the onlined CPU previously. For v2, the v2, the online CPU will > be restored back to those cpusets. So there is less work from the > management layer, but overhead is still there in the kernel of doing the > restore. On that side, I still have to better look into the v1 and v2 implementations, but for the util_clamp extension of the cpu controller: https://lkml.org/lkml/2018/4/9/601 I'm proposing a different update schema which it seems can give you the benefits or "restoring the mask" after an UP event as well as a fast update/tuning path at run-time. Along the line of the above implementation, it would mean that the task affinity mask is constrained/clamped/masked by the TG's affinity mask. This should be an operation performed "on-demand" whenever it makes sense. However, to be honest, I never measured the overheads to combine two cpu masks and it can very well be something overkilling for the wakeup path. I don't think the AND by itself should be an issue, since it's already used in the fast wakeup path, e.g. select_task_rq_fair() select_idle_sibling() select_idle_core() cpumask_and(cpus, sched_domain_span(sd), &p->cpus_allowed); What eventually could be an issue is the race between the scheduler looking at the cpuset cpumaks and cgroups changing it... but perhaps that's something could be fixed with a proper locking mechanism. I will try to run some experiments to at least collect some overheads numbers. [...] > >> @@ -2104,8 +2144,10 @@ struct cgroup_subsys cpuset_cgrp_subsys = { > >> .post_attach = cpuset_post_attach, > >> .bind = cpuset_bind, > >> .fork = cpuset_fork, > >> - .legacy_cftypes = files, > >> + .legacy_cftypes = legacy_files, > >> + .dfl_cftypes = dfl_files, > >> .early_init = true, > >> + .threaded = true, > > Which means that by default we can attach tasks instead of only > > processes, right? > > Yes, you can control task placement on the thread level, not just process. +1 -- #include Patrick Bellasi