Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754240AbdHYGD7 (ORCPT ); Fri, 25 Aug 2017 02:03:59 -0400 Received: from mail.sssup.it ([193.205.80.98]:59090 "EHLO mail.santannapisa.it" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751453AbdHYGD5 (ORCPT ); Fri, 25 Aug 2017 02:03:57 -0400 Date: Fri, 25 Aug 2017 08:02:43 +0200 From: luca abeni To: Mathieu Poirier Cc: Ingo Molnar , Peter Zijlstra , tj@kernel.org, vbabka@suse.cz, Li Zefan , akpm@linux-foundation.org, weiyongjun1@huawei.com, Juri Lelli , Steven Rostedt , Claudio Scordino , Daniel Bristot de Oliveira , "linux-kernel@vger.kernel.org" , Tommaso Cucinotta Subject: Re: [PATCH 0/7] sched/deadline: fix cpusets bandwidth accounting Message-ID: <20170825080243.7591aa0c@sweethome> In-Reply-To: References: <1502918443-30169-1-git-send-email-mathieu.poirier@linaro.org> <20170822142136.3604336e@luca> <20170824095326.4f5c1777@luca> X-Mailer: Claws Mail 3.14.1 (GTK+ 2.24.31; x86_64-pc-linux-gnu) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 8073 Lines: 203 Hi Mathieu, On Thu, 24 Aug 2017 14:32:20 -0600 Mathieu Poirier wrote: [...] > >> > if we want to create some "second level cpusets" inside a "parent > >> > cpuset", allowing deadline tasks to be placed inside both the > >> > "parent cpuset" and the "second level cpusets", then we have to > >> > subtract the "second level cpusets" maximum utilizations from > >> > the "parent cpuset" utilization. > >> > > >> > I am not sure how difficult it can be to implement this... > >> > >> Humm... I am missing some context here. > > > > Or maybe I misunderstood the issue you were seeing (I am no expert > > on cpusets). Is it related to hierarchies of cpusets (with one > > cpuset contained inside another one)? > > Having spent a lot of time in the CPUset code, I can understand the > confusion. > > CPUset allows to create a hierarchy of sets, _seemingly_ creating > overlapping root domains. Fortunately that isn't the case - > overlapping CPUsets are morphed together to create non-overlapping > root domains. The magic happens in rebuild_sched_domains_locked() [1] > where generate_sched_domains() [2] transforms any CPUset topology into > disjoint domains. Ok; thanks for explaining [...] > root@linaro-developer:/sys/fs/cgroup/cpuset# mkdir set1 set2 > root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0 > set1/cpuset.mem > root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0 > set2/cpuset.mems > root@linaro-developer:/sys/fs/cgroup/cpuset# echo 0,1 > > set1/cpuset.cpus root@linaro-developer:/sys/fs/cgroup/cpuset# echo > 2,3 > set2/cpuset.cpus root@linaro-developer:/sys/fs/cgroup/cpuset# > echo 0 > cpuset.sched_load_balance > root@linaro-developer:/sys/fs/cgroup/cpuset# > > At this time runqueue0 and runqueue1 point to root domain A while > runqueue2 and runqueue3 point to root domain B (something that can't > be seen without adding more instrumentation). Ok; up to here, everything is clear to me ;-) > Newly created tasks can roam on all the CPUs available: > > > root@linaro-developer:/home/linaro# ./burn & > [1] 3973 > root@linaro-developer:/home/linaro# grep > Cpus_allowed: /proc/3973/status Cpus_allowed: f > root@linaro-developer:/home/linaro# This happens because the task is not in set1 nor in set2, right? I _think_ (but I am not sure; I did not design this part of SCHED_DEADLINE) that the original idea was that in this situation SCHED_DEADLINE tasks can be only in set1 or in set2 (SCHED_DEADLINE tasks are not allowed to be in the "default" CPUset, in this setup). Is this what one of your later patches enforces? > The above demonstrate that even if we have two CPUsets new task belong > to the "default" CPUset and as such can use all the available CPUs. I still have a doubt (probably showing all my ignorance about CPUsets :)... In this situation, we have 3 CPUsets: "default", set1, and set2... Is everyone of these CPUsets associated to a root domain (so, we have 3 root domains)? Or only set1 and set2 are associated to a root domain? > Now let's make task 3973 a DL task: > > root@linaro-developer:/home/linaro# ./schedtool -E -t 900000:1000000 > 3973 root@linaro-developer:/home/linaro# grep dl /proc/sched_debug > dl_rq[0]: > .dl_nr_running : 0 > .dl_nr_migratory : 0 > .dl_bw->bw : 996147 > .dl_bw->total_bw : 0 <------ Problem Ok; I think I understand the problem, now... > dl_rq[3]: > .dl_nr_running : 0 > .dl_nr_migratory : 0 > .dl_bw->bw : 996147 > .dl_bw->total_bw : 943718 <------ As expected > root@linaro-developer:/home/linaro/jlelli# > > When task 3973 was promoted to a DL task it was running on either CPU2 > or CPU3. The acceptance test was done on root domain B and the task > utilisation added as expected. But as pointed out above task 3973 can > still be scheduled on CPU0 and CPU1 and that is a problem since the > utilisation hasn't been added there as well. The task is now spread > over two root domains rather than a single one, as currently expected > by the DL code (note that there are many ways to reproduce this > situation). I think this is a bug, and the only reasonable solution is to allow the task to become SCHED_DEADLINE if it is in set1 or set2 (so, if its affinity mask coincides exactly with all of the CPUs of the root domain where the task utilization is added). > In its current form the patchset prevents specific operations from > being carried out if we recognise that a task could end up spanning > more than a single root domain. Good. I think this is the right way to go. > But that will break as soon as we > find a new way to create a DL task that spans multiple domains (and I > may not have caught them all either). So, we need to fix that too ;-) > Another way to fix this is to do an acceptance test on all the root > domain of a task. I think we need to undestand what's the inteded behaviour of SCHED_DEADLINE in this situation... My understanding is that SCHED_DEADLINE is designed to do global EDF scheduling inside an "isolated" CPUset; a SCHED_DEADLINE task spanning multiple domains would break some SCHED_DEADLINE properties (from the scheduling theory point of view) in some interesting ways... I am not saying we should not do this, but I believe that allowing tasks to span multiple domains require some redesign of the admission test and migration mechanisms in SCHED_DEADLINE. I think this is related to the "generic affinities" issue that Peter mentioned some time ago. > So above we'd run the acceptance test on root > domain A and B before promoting the task. Of course we'd also have to > add the utilisation of that task to both root domain. Although simple > it goes at the core of the DL scheduler and touches pretty much every > aspect of it, something I'm reluctant to embark on. I see... So, the "default" CPUset does not have any root domain associated to it? If it had, we could just subtract the maximum utilizations of set1 and set2 to it when creating the root domains of set1 and set2. Thanks, Luca > > [1]. > http://elixir.free-electrons.com/linux/latest/source/kernel/cgroup/cpuset.c#L814 > [2]. > http://elixir.free-electrons.com/linux/latest/source/kernel/cgroup/cpuset.c#L634 > [3]. https://github.com/jlelli/tests.git [4]. > https://github.com/jlelli/schedtool-dl.git [5]. > https://lkml.org/lkml/2016/2/3/966 > > > > >> Nonetheless the approach I > >> was contemplating was to repeat the current mathematics to all the > >> root domains accessible from a p->cpus_allowed's flag. > > > > I think in the original SCHED_DEADLINE design there should be only > > one root domain compatible with the task's affinity... If this does > > not happen, I suspect it is a bug (Juri, can you confirm?). > > > > My understanding is that with SCHED_DEADLINE cpusets should be used > > to partition the system's CPUs in disjoint sets (and I think there > > is one root domain for each one of those disjoint sets). And the > > task affinity mask should correspond with the CPUs composing the > > set in which the task is executing. > > > > > >> As such we'd > >> have the same acceptance test but repeated to more than one root > >> domain. To do that time can be an issue but the real problem I > >> see is related to the current DL code. It is geared around a > >> single root domain and changing that means meddling in a lot of > >> places. I had a prototype that was beginning to address that but > >> decided to gather people's opinion before getting in too deep. > > > > I still do not fully understand this (I got the impression that > > this is related to hierarchies of cpusets, but I am not sure if this > > understanding is correct). Maybe an example would help me to > > understand. > > The above should say it all - please get back to me if I haven't > expressed myself clearly. > > > > > > > > > Thanks, > > Luca